All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
@ 2021-12-21  2:48 Cheng Xu
  2021-12-21  2:48 ` [PATCH rdma-next 01/11] RDMA: Add ERDMA to rdma_driver_id definition Cheng Xu
                   ` (11 more replies)
  0 siblings, 12 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-21  2:48 UTC (permalink / raw)
  To: jgg, dledford; +Cc: leon, linux-rdma, KaiShen, chengyou, tonylu

Hello all,

This patch set introduces the Elastic RDMA Adapter (ERDMA) driver, which
released in Apsara Conference 2021 by Alibaba.

ERDMA enables large-scale RDMA acceleration capability in Alibaba ECS
environment, initially offered in g7re instance. It can improve the
efficiency of large-scale distributed computing and communication
significantly and expand dynamically with the cluster scale of Alibaba
Cloud.

ERDMA is a RDMA networking adapter based on the Alibaba MOC hardware. It
works in the VPC network environment (overlay network), and uses iWarp
tranport protocol. ERDMA supports reliable connection (RC). ERDMA also
supports both kernel space and user space verbs. Now we have already
supported HPC/AI applications with libfabric, NoF and some other internal
verbs libraries, such as xrdma, epsl, etc,.

For the ECS instance with RDMA enabled, there are two kinds of devices
allocated, one for ERDMA, and one for the original netdev (virtio-net).
They are different PCI deivces. ERDMA driver can get the information about
which netdev attached to in its PCIe barspace (by MAC address matching).

Thanks,
Cheng Xu

Cheng Xu (11):
  RDMA: Add ERDMA to rdma_driver_id definition
  RDMA/erdma: Add the hardware related definitions
  RDMA/erdma: Add main include file
  RDMA/erdma: Add cmdq implementation
  RDMA/erdma: Add event queue implementation
  RDMA/erdma: Add verbs header file
  RDMA/erdma: Add verbs implementation
  RDMA/erdma: Add connection management (CM) support
  RDMA/erdma: Add the erdma module
  RDMA/erdma: Add the ABI definitions
  RDMA/erdma: Add driver to kernel build environment

 MAINTAINERS                               |    8 +
 drivers/infiniband/Kconfig                |    1 +
 drivers/infiniband/hw/Makefile            |    1 +
 drivers/infiniband/hw/erdma/Kconfig       |   10 +
 drivers/infiniband/hw/erdma/Makefile      |    5 +
 drivers/infiniband/hw/erdma/erdma.h       |  381 +++++
 drivers/infiniband/hw/erdma/erdma_cm.c    | 1585 +++++++++++++++++++++
 drivers/infiniband/hw/erdma/erdma_cm.h    |  158 ++
 drivers/infiniband/hw/erdma/erdma_cmdq.c  |  489 +++++++
 drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
 drivers/infiniband/hw/erdma/erdma_debug.c |  314 ++++
 drivers/infiniband/hw/erdma/erdma_debug.h |   18 +
 drivers/infiniband/hw/erdma/erdma_eq.c    |  346 +++++
 drivers/infiniband/hw/erdma/erdma_hw.h    |  474 ++++++
 drivers/infiniband/hw/erdma/erdma_main.c  |  711 +++++++++
 drivers/infiniband/hw/erdma/erdma_qp.c    |  624 ++++++++
 drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++
 drivers/infiniband/hw/erdma/erdma_verbs.h |  366 +++++
 include/uapi/rdma/erdma-abi.h             |   49 +
 include/uapi/rdma/ib_user_ioctl_verbs.h   |    1 +
 20 files changed, 7219 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/Kconfig
 create mode 100644 drivers/infiniband/hw/erdma/Makefile
 create mode 100644 drivers/infiniband/hw/erdma/erdma.h
 create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.h
 create mode 100644 drivers/infiniband/hw/erdma/erdma_cmdq.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.h
 create mode 100644 drivers/infiniband/hw/erdma/erdma_eq.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_hw.h
 create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.h
 create mode 100644 include/uapi/rdma/erdma-abi.h

-- 
2.27.0


^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH rdma-next 01/11] RDMA: Add ERDMA to rdma_driver_id definition
  2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
@ 2021-12-21  2:48 ` Cheng Xu
  2021-12-21  2:48 ` [PATCH rdma-next 02/11] RDMA/erdma: Add the hardware related definitions Cheng Xu
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-21  2:48 UTC (permalink / raw)
  To: jgg, dledford; +Cc: leon, linux-rdma, KaiShen, chengyou, tonylu

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 include/uapi/rdma/ib_user_ioctl_verbs.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
index 3072e5d6b692..7dd56210226f 100644
--- a/include/uapi/rdma/ib_user_ioctl_verbs.h
+++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
@@ -250,6 +250,7 @@ enum rdma_driver_id {
 	RDMA_DRIVER_QIB,
 	RDMA_DRIVER_EFA,
 	RDMA_DRIVER_SIW,
+	RDMA_DRIVER_ERDMA,
 };
 
 enum ib_uverbs_gid_type {
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rdma-next 02/11] RDMA/erdma: Add the hardware related definitions
  2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
  2021-12-21  2:48 ` [PATCH rdma-next 01/11] RDMA: Add ERDMA to rdma_driver_id definition Cheng Xu
@ 2021-12-21  2:48 ` Cheng Xu
  2021-12-21  2:48 ` [PATCH rdma-next 03/11] RDMA/erdma: Add main include file Cheng Xu
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-21  2:48 UTC (permalink / raw)
  To: jgg, dledford; +Cc: leon, linux-rdma, KaiShen, chengyou, tonylu

ERDMA is a PCIe device, and this file provides ERDMA hardware related
definitions, mainly including PCIe device capabilities restrictions,
device registers definitions, doorbell space, doorbell structure
definitions and WQE definitions.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 drivers/infiniband/hw/erdma/erdma_hw.h | 474 +++++++++++++++++++++++++
 1 file changed, 474 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/erdma_hw.h

diff --git a/drivers/infiniband/hw/erdma/erdma_hw.h b/drivers/infiniband/hw/erdma/erdma_hw.h
new file mode 100644
index 000000000000..cd5683b04078
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_hw.h
@@ -0,0 +1,474 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ */
+
+#ifndef __ERDMA_HW_H__
+#define __ERDMA_HW_H__
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+/* PCIe device related definition. */
+#define PCI_VENDOR_ID_ALIBABA 0x1ded
+
+#define ERDMA_FUNC_BAR     0
+#define ERDMA_MISX_BAR     2
+
+#define ERDMA_BAR_MASK (BIT(ERDMA_FUNC_BAR) | BIT(ERDMA_MISX_BAR))
+
+/* MSI-X related. */
+#define ERDMA_NUM_MSIX_VEC        32
+#define ERDMA_MSIX_VECTOR_CMDQ    0
+
+/* PCIe Bar0 Registers. */
+#define ERDMA_REGS_VERSION_REG             0x0
+#define ERDMA_REGS_DEV_CTRL_REG            0x10
+#define ERDMA_REGS_DEV_ST_REG              0x14
+#define ERDMA_REGS_NETDEV_MAC_L_REG        0x18
+#define ERDMA_REGS_NETDEV_MAC_H_REG        0x1C
+#define ERDMA_REGS_CMDQ_SQ_ADDR_L_REG      0x20
+#define ERDMA_REGS_CMDQ_SQ_ADDR_H_REG      0x24
+#define ERDMA_REGS_CMDQ_CQ_ADDR_L_REG      0x28
+#define ERDMA_REGS_CMDQ_CQ_ADDR_H_REG      0x2C
+#define ERDMA_REGS_CMDQ_DEPTH_REG          0x30
+#define ERDMA_REGS_CMDQ_EQ_DEPTH_REG       0x34
+#define ERDMA_REGS_CMDQ_EQ_ADDR_L_REG      0x38
+#define ERDMA_REGS_CMDQ_EQ_ADDR_H_REG      0x3C
+#define ERDMA_REGS_AEQ_ADDR_L_REG          0x40
+#define ERDMA_REGS_AEQ_ADDR_H_REG          0x44
+#define ERDMA_REGS_AEQ_DEPTH_REG           0x48
+#define ERDMA_REGS_GRP_NUM_REG             0x4c
+#define ERDMA_REGS_AEQ_DB_REG              0x50
+#define ERDMA_CMDQ_SQ_DB_HOST_ADDR_REG     0x60
+#define ERDMA_CMDQ_CQ_DB_HOST_ADDR_REG     0x68
+#define ERDMA_CMDQ_EQ_DB_HOST_ADDR_REG     0x70
+#define ERDMA_AEQ_DB_HOST_ADDR_REG         0x78
+#define ERDMA_REGS_CEQ_DB_BASE_REG         0x100
+#define ERDMA_CMDQ_SQDB_REG                0x200
+#define ERDMA_CMDQ_CQDB_REG                0x300
+
+/* DEV_CTRL_REG details. */
+#define ERDMA_REG_DEV_CTRL_RESET_MASK       0x00000001
+#define ERDMA_REG_DEV_CTRL_INIT_MASK        0x00000002
+
+/* DEV_ST_REG details. */
+#define ERDMA_REG_DEV_ST_RESET_DONE_MASK    0x00000001U
+#define ERDMA_REG_DEV_ST_INIT_DONE_MASK     0x00000002U
+
+/* eRDMA PCIe DBs definition. */
+#define ERDMA_BAR_DB_SPACE_BASE     4096
+
+#define ERDMA_BAR_SQDB_SPACE_OFFSET ERDMA_BAR_DB_SPACE_BASE
+#define ERDMA_BAR_SQDB_SPACE_SIZE   (384 * 1024)
+
+#define ERDMA_BAR_RQDB_SPACE_OFFSET (ERDMA_BAR_SQDB_SPACE_OFFSET + ERDMA_BAR_SQDB_SPACE_SIZE)
+#define ERDMA_BAR_RQDB_SPACE_SIZE   (96 * 1024)
+
+#define ERDMA_BAR_CQDB_SPACE_OFFSET (ERDMA_BAR_RQDB_SPACE_OFFSET + ERDMA_BAR_RQDB_SPACE_SIZE)
+
+/* Doorbell page resources related. */
+/*
+ * Max # of parallelly issued directSQE is 3072 per device,
+ * hardware organizes this into 24 group, per group has 128 credits.
+ */
+#define ERDMA_DWQE_MAX_GRP_CNT          24
+#define ERDMA_DWQE_NUM_PER_GRP          128
+
+#define ERDMA_DWQE_TYPE0_CNT            64
+#define ERDMA_DWQE_TYPE1_CNT            496
+#define ERDMA_DWQE_TYPE1_CNT_PER_PAGE   16  /* type1 DB contains 2 DBs, takes 256Byte. */
+
+#define ERDMA_SDB_SHARED_PAGE_INDEX     95
+
+/* Doorbell related. */
+#define ERDMA_CQDB_EQN_MASK              GENMASK_ULL(63, 56)
+#define ERDMA_CQDB_CQN_MASK              GENMASK_ULL(55, 32)
+#define ERDMA_CQDB_ARM_MASK              BIT_ULL(31)
+#define ERDMA_CQDB_SOL_MASK              BIT_ULL(30)
+#define ERDMA_CQDB_CMDSN_MASK            GENMASK_ULL(29, 28)
+#define ERDMA_CQDB_CI_MASK               GENMASK_ULL(23, 0)
+
+#define ERDMA_EQDB_ARM_MASK              BIT(31)
+#define ERDMA_EQDB_CI_MASK               GENMASK_ULL(23, 0)
+
+/* WQE related. */
+#define EQE_SIZE 16
+#define EQE_SHIFT 4
+#define RQE_SIZE 32
+#define RQE_SHIFT 5
+#define CQE_SIZE 32
+#define CQE_SHIFT 5
+#define SQEBB_SIZE 32
+#define SQEBB_SHIFT 5
+#define SQEBB_MASK (~(SQEBB_SIZE - 1))
+#define SQEBB_ALIGN(size) ((size + SQEBB_SIZE - 1) & SQEBB_MASK)
+#define SQEBB_COUNT(size) (SQEBB_ALIGN(size) >> SQEBB_SHIFT)
+
+#define ERDMA_MAX_SQE_SIZE 128
+#define ERDMA_MAX_WQEBB_PER_SQE 4
+
+/* CMDQ related. */
+#define ERDMA_CMDQ_MAX_OUTSTANDING      128
+#define ERDMA_CMDQ_SQE_SIZE             64
+
+/* cmdq sub module definition. */
+enum CMDQ_WQE_SUB_MOD {
+	CMDQ_SUBMOD_RDMA    = 0,
+	CMDQ_SUBMOD_COMMON  = 1
+};
+
+enum CMDQ_RDMA_OPCODE {
+	CMDQ_OPCODE_QUERY_DEVICE = 0,
+	CMDQ_OPCODE_CREATE_QP    = 1,
+	CMDQ_OPCODE_DESTROY_QP   = 2,
+	CMDQ_OPCODE_MODIFY_QP    = 3,
+	CMDQ_OPCODE_CREATE_CQ    = 4,
+	CMDQ_OPCODE_DESTROY_CQ   = 5,
+	CMDQ_OPCODE_REG_MR       = 8,
+	CMDQ_OPCODE_DEREG_MR     = 9
+};
+
+enum CMDQ_COMMON_OPCODE {
+	CMDQ_OPCODE_CREATE_EQ  = 0,
+	CMDQ_OPCODE_DESTROY_EQ = 1
+};
+
+/* cmdq-SQE HDR */
+#define ERDMA_CMD_HDR_WQEBB_CNT_MASK     GENMASK_ULL(54, 52)
+#define ERDMA_CMD_HDR_CONTEXT_COOKIE     GENMASK_ULL(47, 32)
+#define ERDMA_CMD_HDR_SUB_MOD_MASK       GENMASK_ULL(25, 24)
+#define ERDMA_CMD_HDR_OPCODE_MASK        GENMASK_ULL(23, 16)
+#define ERDMA_CMD_HDR_WQEBB_INDEX_MASK   GENMASK_ULL(15, 0)
+
+struct erdma_cmdq_destroy_cq_req {
+	u64 hdr;
+	u32 cqn;
+};
+
+struct erdma_cmdq_create_eq_req {
+	u64 hdr;
+	u64 qbuf_addr;
+	u8  vector_idx;
+	u8  eqn;
+	u8  depth;
+	u8  qtype;
+	u32 db_dma_addr_l;
+	u32 db_dma_addr_h;
+};
+
+struct erdma_cmdq_destroy_eq_req {
+	u64 hdr;
+	u64 rsvd0;
+	u8  vector_idx;
+	u8  eqn;
+	u8  rsvd1;
+	u8  qtype;
+};
+
+/* create_cq cfg0 */
+#define ERDMA_CMD_CREATE_CQ_DEPTH_MASK      GENMASK(31, 24)
+#define ERDMA_CMD_CREATE_CQ_PAGESIZE_MASK   GENMASK(23, 20)
+#define ERDMA_CMD_CREATE_CQ_CQN_MASK        GENMASK(19, 0)
+
+/* create_cq cfg1 */
+#define ERDMA_CMD_CREATE_CQ_MTT_CNT_MASK    GENMASK(31, 16)
+#define ERDMA_CMD_CREATE_CQ_MTT_TYPE_MASK   BIT(15)
+#define ERDMA_CMD_CREATE_CQ_EQN_MASK        GENMASK(9, 0)
+
+struct erdma_cmdq_create_cq_req {
+	u64 hdr;
+	u32 cfg0;
+	u32 qbuf_addr_l;
+	u32 qbuf_addr_h;
+	u32 cfg1;
+	u64 cq_db_info_addr;
+	u32 first_page_offset;
+};
+
+/* regmr/deregmr cfg0 */
+#define ERDMA_CMD_MR_VALID_MASK    BIT(31)
+#define ERDMA_CMD_MR_KEY_MASK      GENMASK(27, 20)
+#define ERDMA_CMD_MR_MPT_IDX_MASK  GENMASK(19, 0)
+
+/* regmr cfg1 */
+#define ERDMA_CMD_REGMR_PD_MASK       GENMASK(31, 12)
+#define ERDMA_CMD_REGMR_TYPE_MASK     GENMASK(7, 6)
+#define ERDMA_CMD_REGMR_RIGHT_MASK    GENMASK(5, 2)
+#define ERDMA_CMD_REGMR_ACC_MODE_MASK GENMASK(1, 0)
+
+/* regmr cfg2 */
+#define ERDMA_CMD_REGMR_PAGESIZE_MASK GENMASK(31, 27)
+#define ERDMA_CMD_REGMR_MTT_TYPE_MASK GENMASK(21, 20)
+#define ERDMA_CMD_REGMR_MTT_CNT_MASK  GENMASK(19, 0)
+
+struct erdma_cmdq_reg_mr_req {
+	u64 hdr;
+	u32 cfg0;
+	u32 cfg1;
+	u64 start_va;
+	u32 size;
+	u32 cfg2;
+	u64 phy_addr[4];
+};
+
+struct erdma_cmdq_dereg_mr_req {
+	u64 hdr;
+	u32 cfg0;
+};
+
+/* modify qp cfg0 */
+#define ERDMA_CMD_MODIFY_QP_STATE_MASK  GENMASK(31, 24)
+#define ERDMA_CMD_MODIFY_QP_CC_MASK     GENMASK(23, 20)
+#define ERDMA_CMD_MODIFY_QP_QPN_MASK    GENMASK(19, 0)
+
+struct erdma_cmdq_modify_qp_req {
+	u64 hdr;
+	u32 cfg0;
+	u32 remote_qpn;
+	u32 dip;
+	u32 sip;
+	u16 sport;
+	u16 dport;
+	u32 send_nxt;
+	u32 recv_nxt;
+};
+
+/* create qp cfg0 */
+#define ERDMA_CMD_CREATE_QP_SQ_DEPTH_MASK   GENMASK(31, 20)
+#define ERDMA_CMD_CREATE_QP_QPN_MASK        GENMASK(19, 0)
+
+/* create qp cfg1 */
+#define ERDMA_CMD_CREATE_QP_RQ_DEPTH_MASK   GENMASK(31, 20)
+#define ERDMA_CMD_CREATE_QP_PD_MASK         GENMASK(19, 0)
+
+/* cqn_mtt_cfg */
+#define ERDMA_CMD_CREATE_QP_PAGE_SIZE_MASK  GENMASK(31, 28)
+#define ERDMA_CMD_CREATE_QP_CQN_MASK        GENMASK(23, 0)
+
+/* mtt_cfg */
+#define ERDMA_CMD_CREATE_QP_PAGE_OFFSET_MASK GENMASK(31, 12)
+#define ERDMA_CMD_CREATE_QP_MTT_CNT_MASK     GENMASK(11, 1)
+#define ERDMA_CMD_CREATE_QP_MTT_TYPE_MASK    BIT(0)
+
+struct erdma_cmdq_create_qp_req {
+	u64 hdr;
+	u32 cfg0;
+	u32 cfg1;
+	u32 sq_cqn_mtt_cfg;
+	u32 rq_cqn_mtt_cfg;
+	u64 sq_buf_addr;
+	u64 rq_buf_addr;
+	u32 sq_mtt_cfg;
+	u32 rq_mtt_cfg;
+	u64 sq_db_info_dma_addr;
+	u64 rq_db_info_dma_addr;
+};
+
+struct erdma_cmdq_destroy_qp_req {
+	u64 hdr;
+	u32 qpn;
+};
+
+#define ERDMA_CMD_DEV_CAP0_MAX_CQE_MASK      GENMASK_ULL(47, 40)
+#define ERDMA_CMD_DEV_CAP0_MAX_RECV_WR_MASK  GENMASK_ULL(23, 16)
+#define ERDMA_CMD_DEV_CAP0_MAX_MR_SIZE_MASK  GENMASK_ULL(7, 0)
+
+#define ERDMA_CMD_DEV_CAP1_DMA_LOCAL_KEY_MASK GENMASK_ULL(63, 32)
+#define ERDMA_CMD_DEV_CAP1_DEFAULT_CC_MASK    GENMASK_ULL(31, 28)
+#define ERDMA_CMD_DEV_CAP1_QBLOCK_MASK        GENMASK_ULL(27, 16)
+#define ERDMA_CMD_DEV_CAP1_MAX_MW_MASK        GENMASK_ULL(7, 0)
+
+#define ERDMA_NQP_PER_QBLOCK 1024
+
+/* CQE hdr */
+#define ERDMA_CQE_HDR_OWNER_MASK         BIT(31)
+#define ERDMA_CQE_HDR_OPCODE_MASK        GENMASK(23, 16)
+#define ERDMA_CQE_HDR_QTYPE_MASK         GENMASK(15, 8)
+#define ERDMA_CQE_HDR_SYNDROME_MASK      GENMASK(7, 0)
+
+#define ERDMA_CQE_QTYPE_SQ    0
+#define ERDMA_CQE_QTYPE_RQ    1
+#define ERDMA_CQE_QTYPE_CMDQ  2
+
+struct erdma_cqe {
+	__be32 hdr;
+	__be32 qe_idx;
+	__be32 qpn;
+	__be32 imm_data;
+	__be32 size;
+	__be32 rsvd[3];
+};
+
+struct erdma_sge {
+	__aligned_le64 laddr;
+	__le32         length;
+	__le32         lkey;
+};
+
+/* Receive Queue Element */
+struct erdma_rqe {
+	__le16 qe_idx;
+	__le16 rsvd;
+	__le32 qpn;
+	__le32 rsvd2;
+	__le32 rsvd3;
+	__le64 to;
+	__le32 length;
+	__le32 stag;
+};
+
+/* SQE */
+#define ERDMA_SQE_HDR_SGL_LEN_MASK       GENMASK_ULL(63, 56)
+#define ERDMA_SQE_HDR_WQEBB_CNT_MASK     GENMASK_ULL(54, 52)
+#define ERDMA_SQE_HDR_QPN_MASK           GENMASK_ULL(51, 32)
+#define ERDMA_SQE_HDR_OPCODE_MASK        GENMASK_ULL(31, 27)
+#define ERDMA_SQE_HDR_DWQE_MASK          BIT_ULL(26)
+#define ERDMA_SQE_HDR_INLINE_MASK        BIT_ULL(25)
+#define ERDMA_SQE_HDR_FENCE_MASK         BIT_ULL(24)
+#define ERDMA_SQE_HDR_SE_MASK            BIT_ULL(23)
+#define ERDMA_SQE_HDR_CE_MASK            BIT_ULL(22)
+#define ERDMA_SQE_HDR_WQEBB_INDEX_MASK   GENMASK_ULL(15, 0)
+
+/* REG MR attrs */
+#define ERDMA_SQE_MR_ACCESS_MODE_MASK	GENMASK_ULL(1, 0)
+#define ERDMA_SQE_MR_ACCESS_RIGHT_MASK	GENMASK_ULL(5, 2)
+#define ERDMA_SQE_MR_MTT_TYPE_MASK		GENMASK_ULL(7, 6)
+#define ERDMA_SQE_MR_MTT_COUNT_MASK		GENMASK_ULL(31, 12)
+
+struct erdma_write_sqe {
+	__le64 hdr;
+	__le32 imm_data;
+	__le32 length;
+
+	__le32 sink_stag;
+	__le32 sink_to_low;
+	__le32 sink_to_high;
+
+	__le32 rsvd;
+
+	struct erdma_sge sgl[0];
+};
+
+struct erdma_send_sqe {
+	__le64 hdr;
+	__le32 imm_data;
+	__le32 length;
+	struct erdma_sge sgl[0];
+};
+
+struct erdma_readreq_sqe {
+	__le64 hdr;
+	__le32 invalid_stag;
+	__le32 length;
+	__le32 sink_stag;
+	__le32 sink_to_low;
+	__le32 sink_to_high;
+	__le32 rsvd0;
+};
+
+struct erdma_reg_mr_sqe {
+	__le64 hdr;
+	__le64 addr;
+	__le32 length;
+	__le32 stag;
+	__le32 attrs;
+	__le32 reserved;
+	__u64 inline_addr[0];
+};
+
+/* EQ related. */
+#define ERDMA_DEFAULT_EQ_DEPTH 256
+
+/* ceqe */
+#define ERDMA_CEQE_HDR_DB_MASK  BIT_ULL(63)
+#define ERDMA_CEQE_HDR_PI_MASK  GENMASK_ULL(55, 32)
+#define ERDMA_CEQE_HDR_O_MASK   BIT_ULL(31)
+#define ERDMA_CEQE_HDR_CQN_MASK GENMASK_ULL(19, 0)
+
+/* aeqe */
+#define ERDMA_AEQE_HDR_O_MASK       BIT(31)
+#define ERDMA_AEQE_HDR_TYPE_MASK    GENMASK(23, 16)
+#define ERDMA_AEQE_HDR_SUBTYPE_MASK GENMASK(7, 0)
+
+#define ERDMA_AE_TYPE_QP_FATAL_EVENT     0
+#define ERDMA_AE_TYPE_QP_ERQ_ERR_EVENT   1
+#define ERDMA_AE_TYPE_ACC_ERR_EVENT      2
+#define ERDMA_AE_TYPE_CQ_ERR             3
+#define ERDMA_AE_TYPE_OTHER_ERROR        4
+
+struct erdma_aeqe {
+	__le32   hdr;
+	__le32   event_data0;
+	__le32   event_data1;
+	__le32   rsvd2;
+};
+
+enum erdma_opcode {
+	ERDMA_OP_WRITE           = 0,
+	ERDMA_OP_READ            = 1,
+	ERDMA_OP_SEND            = 2,
+	ERDMA_OP_SEND_WITH_IMM   = 3,
+
+	ERDMA_OP_RECEIVE         = 4,
+	ERDMA_OP_RECV_IMM        = 5,
+	ERDMA_OP_RECV_INV        = 6,
+
+	ERDMA_OP_REQ_ERR         = 7,
+	ERDNA_OP_READ_RESPONSE   = 8,
+	ERDMA_OP_WRITE_WITH_IMM  = 9,
+
+	ERDMA_OP_RECV_ERR        = 10,
+
+	ERDMA_OP_INVALIDATE     = 11,
+	ERDMA_OP_RSP_SEND_IMM   = 12,
+	ERDMA_OP_SEND_WITH_INV  = 13,
+
+	ERDMA_OP_REG_MR         = 14,
+	ERDMA_OP_LOCAL_INV      = 15,
+	ERDMA_OP_READ_WITH_INV  = 16,
+	ERDMA_NUM_OPCODES       = 17,
+	ERDMA_OP_INVALID        = ERDMA_NUM_OPCODES + 1
+};
+
+enum erdma_wc_status {
+	ERDMA_WC_SUCCESS = 0,
+	ERDMA_WC_GENERAL_ERR = 1,
+	ERDMA_WC_RECV_WQE_FORMAT_ERR = 2,
+	ERDMA_WC_RECV_STAG_INVALID_ERR = 3,
+	ERDMA_WC_RECV_ADDR_VIOLATION_ERR = 4,
+	ERDMA_WC_RECV_RIGHT_VIOLATION_ERR = 5,
+	ERDMA_WC_RECV_PDID_ERR = 6,
+	ERDMA_WC_RECV_WARRPING_ERR = 7,
+	ERDMA_WC_SEND_WQE_FORMAT_ERR = 8,
+	ERDMA_WC_SEND_WQE_ORD_EXCEED = 9,
+	ERDMA_WC_SEND_STAG_INVALID_ERR = 10,
+	ERDMA_WC_SEND_ADDR_VIOLATION_ERR = 11,
+	ERDMA_WC_SEND_RIGHT_VIOLATION_ERR = 12,
+	ERDMA_WC_SEND_PDID_ERR = 13,
+	ERDMA_WC_SEND_WARRPING_ERR = 14,
+	ERDMA_WC_FLUSH_ERR = 15,
+	ERDMA_WC_RETRY_EXC_ERR = 16,
+	ERDMA_NUM_WC_STATUS
+};
+
+enum erdma_vendor_err {
+	ERDMA_WC_VENDOR_NO_ERR = 0,
+	ERDMA_WC_VENDOR_INVALID_RQE = 1,
+	ERDMA_WC_VENDOR_RQE_INVALID_STAG = 2,
+	ERDMA_WC_VENDOR_RQE_ADDR_VIOLATION = 3,
+	ERDMA_WC_VENDOR_RQE_ACCESS_RIGHT_ERR = 4,
+	ERDMA_WC_VENDOR_RQE_INVALID_PD = 5,
+	ERDMA_WC_VENDOR_RQE_WRAP_ERR = 6,
+	ERDMA_WC_VENDOR_INVALID_SQE = 0x20,
+	ERDMA_WC_VENDOR_ZERO_ORD = 0x21,
+	ERDMA_WC_VENDOR_SQE_INVALID_STAG = 0x30,
+	ERDMA_WC_VENDOR_SQE_ADDR_VIOLATION = 0x31,
+	ERDMA_WC_VENDOR_SQE_ACCESS_ERR = 0x32,
+	ERDMA_WC_VENDOR_SQE_INVALID_PD = 0x33,
+	ERDMA_WC_VENDOR_SQE_WARP_ERR = 0x34
+};
+
+#endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rdma-next 03/11] RDMA/erdma: Add main include file
  2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
  2021-12-21  2:48 ` [PATCH rdma-next 01/11] RDMA: Add ERDMA to rdma_driver_id definition Cheng Xu
  2021-12-21  2:48 ` [PATCH rdma-next 02/11] RDMA/erdma: Add the hardware related definitions Cheng Xu
@ 2021-12-21  2:48 ` Cheng Xu
  2021-12-21  2:48 ` [PATCH rdma-next 04/11] RDMA/erdma: Add cmdq implementation Cheng Xu
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-21  2:48 UTC (permalink / raw)
  To: jgg, dledford; +Cc: leon, linux-rdma, KaiShen, chengyou, tonylu

Add ERDMA driver main header file, defining internal used data structures
and operations. The defined data structures includes *cmdq*, which is used
as the communication channel between ERDMA driver and hardware.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 drivers/infiniband/hw/erdma/erdma.h | 381 ++++++++++++++++++++++++++++
 1 file changed, 381 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/erdma.h

diff --git a/drivers/infiniband/hw/erdma/erdma.h b/drivers/infiniband/hw/erdma/erdma.h
new file mode 100644
index 000000000000..fd0fee698874
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma.h
@@ -0,0 +1,381 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ */
+
+#ifndef __ERDMA_H__
+#define __ERDMA_H__
+
+#include <linux/bitfield.h>
+#include <linux/netdevice.h>
+#include <linux/xarray.h>
+#include <rdma/ib_verbs.h>
+
+#include "erdma_hw.h"
+
+#define DRV_MODULE_NAME "erdma"
+
+struct erdma_eq {
+	void *qbuf;
+	dma_addr_t qbuf_dma_addr;
+
+	u32 depth;
+	u64 __iomem *db_addr;
+
+	spinlock_t lock;
+
+	u16 ci;
+	u16 owner;
+
+	atomic64_t event_num;
+	atomic64_t notify_num;
+
+	void *db_info;
+};
+
+struct erdma_cmdq_sq {
+	void *qbuf;
+	dma_addr_t qbuf_dma_addr;
+
+	spinlock_t lock;
+	u64 __iomem *db_addr;
+
+	u16 ci;
+	u16 pi;
+
+	u16 depth;
+	u16 wqebb_cnt;
+
+	void *db_info;
+
+	u64 total_cmds;
+	u64 total_comp_cmds;
+};
+
+struct erdma_cmdq_cq {
+	void *qbuf;
+
+	dma_addr_t qbuf_dma_addr;
+
+	u64 __iomem *db_addr;
+	spinlock_t lock;
+
+	u32 ci;
+	u16 owner;
+	u16 depth;
+
+	void *db_info;
+
+	atomic64_t cq_armed_num;
+};
+
+enum {
+	ERDMA_CMD_STATUS_INIT,
+	ERDMA_CMD_STATUS_ISSUED,
+	ERDMA_CMD_STATUS_FINISHED,
+	ERDMA_CMD_STATUS_TIMEOUT
+};
+
+struct erdma_comp_wait {
+	struct completion wait_event;
+	u32 cmd_status;
+	u32 ctx_id;
+	u16 sq_pi;
+	u8 comp_status;
+	u8 rsvd;
+	u32 comp_data[4];
+};
+
+enum {
+	ERDMA_CMDQ_STATE_OK_BIT = 0,
+	ERDMA_CMDQ_STATE_TIMEOUT_BIT = 1,
+	ERDMA_CMDQ_STATE_CTX_ERR_BIT = 2,
+};
+
+#define ERDMA_CMDQ_TIMEOUT_MS       15000
+#define ERDMA_REG_ACCESS_WAIT_MS    20
+#define ERDMA_WAIT_DEV_DONE_CNT     500
+
+struct erdma_cmdq {
+	void *dev;
+
+	unsigned long *comp_wait_bitmap;
+	struct erdma_comp_wait *wait_pool;
+	spinlock_t lock;
+
+	u8 use_event;
+
+	struct erdma_cmdq_sq sq;
+	struct erdma_cmdq_cq cq;
+	struct erdma_eq eq;
+
+	unsigned long state;
+
+	struct semaphore credits;
+	u16 max_outstandings;
+};
+
+struct erdma_devattr {
+	unsigned int device;
+	unsigned int version;
+
+	u32 vendor_id;
+	u32 vendor_part_id;
+	u32 sw_version;
+	u32 max_qp;
+	u32 max_send_wr;
+	u32 max_recv_wr;
+	u32 max_ord;
+	u32 max_ird;
+
+	enum ib_device_cap_flags cap_flags;
+	u32 max_send_sge;
+	u32 max_recv_sge;
+	u32 max_sge_rd;
+	u32 max_cq;
+	u32 max_cqe;
+	u64 max_mr_size;
+	u32 max_mr;
+	u32 max_pd;
+	u32 max_mw;
+	u32 max_srq;
+	u32 max_srq_wr;
+	u32 max_srq_sge;
+	u32 local_dma_key;
+};
+
+#define ERDMA_IRQNAME_SIZE 50
+struct erdma_irq_info {
+	char name[ERDMA_IRQNAME_SIZE];
+	irq_handler_t handler;
+	u32 msix_vector;
+	void *data;
+	int cpu;
+	cpumask_t affinity_hint_mask;
+};
+
+struct erdma_eq_cb {
+	u8 ready;
+	u8 rsvd[3];
+	void *dev;
+	struct erdma_irq_info irq_info;
+	struct erdma_eq eq;
+	struct tasklet_struct tasklet;
+};
+
+#define COMPROMISE_CC ERDMA_CC_CUBIC
+enum erdma_cc_method {
+	ERDMA_CC_NEWRENO = 0,
+	ERDMA_CC_CUBIC,
+	ERDMA_CC_HPCC_RTT,
+	ERDMA_CC_HPCC_ECN,
+	ERDMA_CC_HPCC_INT,
+	ERDMA_CC_METHODS_NUM
+};
+
+struct erdma_resource_cb {
+	unsigned long *bitmap;
+	spinlock_t lock;
+	u32 next_alloc_idx;
+	u32 max_cap;
+};
+
+enum {
+	ERDMA_RES_TYPE_PD = 0,
+	ERDMA_RES_TYPE_STAG_IDX = 1,
+	ERDMA_RES_CNT = 2,
+};
+
+static inline int erdma_alloc_idx(struct erdma_resource_cb *res_cb)
+{
+	int idx;
+	unsigned long flags;
+	u32 start_idx = res_cb->next_alloc_idx;
+
+	spin_lock_irqsave(&res_cb->lock, flags);
+	idx = find_next_zero_bit(res_cb->bitmap, res_cb->max_cap, start_idx);
+	if (idx == res_cb->max_cap) {
+		idx = find_first_zero_bit(res_cb->bitmap, res_cb->max_cap);
+		if (idx == res_cb->max_cap) {
+			res_cb->next_alloc_idx = 1;
+			spin_unlock_irqrestore(&res_cb->lock, flags);
+			return -ENOSPC;
+		}
+	}
+
+	set_bit(idx, res_cb->bitmap);
+	spin_unlock_irqrestore(&res_cb->lock, flags);
+	res_cb->next_alloc_idx = idx + 1;
+	return idx;
+}
+
+static inline void erdma_free_idx(struct erdma_resource_cb *res_cb, u32 idx)
+{
+	unsigned long flags;
+	u32 used;
+
+	spin_lock_irqsave(&res_cb->lock, flags);
+	used = test_and_clear_bit(idx, res_cb->bitmap);
+	spin_unlock_irqrestore(&res_cb->lock, flags);
+	WARN_ON(!used);
+}
+
+#define ERDMA_EXTRA_BUFFER_SIZE 8
+
+struct erdma_dev {
+	struct ib_device ibdev;
+	struct net_device *netdev;
+	struct pci_dev *pdev;
+
+	struct notifier_block netdev_nb;
+	unsigned char peer_addr[MAX_ADDR_LEN];
+
+	/* physical port state (only one port per device) */
+	enum ib_port_state state;
+
+	u8 __iomem *func_bar;
+
+	resource_size_t func_bar_addr;
+	resource_size_t func_bar_len;
+
+	u32 dma_width;
+
+	struct erdma_irq_info comm_irq;
+	struct erdma_cmdq cmdq;
+
+	u16 irq_num;
+	u16 rsvd;
+
+	struct erdma_eq_cb aeq;
+	struct erdma_eq_cb ceqs[31];
+
+	struct erdma_devattr attrs;
+
+	spinlock_t lock;
+
+	struct erdma_resource_cb res_cb[ERDMA_RES_CNT];
+	struct xarray qp_xa;
+	struct xarray cq_xa;
+
+	u32 next_alloc_qpn;
+	u32 next_alloc_cqn;
+
+	spinlock_t db_bitmap_lock;
+
+	/* We provide 64 uContexts that each has one SQ doorbell Page. */
+	DECLARE_BITMAP(sdb_page, ERDMA_DWQE_TYPE0_CNT);
+	/* We provide 496 uContexts that each has one SQ normal Db, and one directWQE db */
+	DECLARE_BITMAP(sdb_entry, ERDMA_DWQE_TYPE1_CNT);
+
+	atomic_t num_pd;
+	atomic_t num_qp;
+	atomic_t num_cq;
+	atomic_t num_cep;
+	atomic_t num_mr;
+	atomic_t num_ctx;
+
+	struct list_head cep_list;
+
+	u32 is_registered;
+	struct dentry *debugfs;
+
+	int numa_node;
+	int cc_method;
+	int grp_num;
+	int disable_dwqe;
+	int dwqe_pages;
+	int dwqe_entries;
+};
+
+static inline struct erdma_dev *to_edev(struct ib_device *ibdev)
+{
+	return container_of(ibdev, struct erdma_dev, ibdev);
+}
+
+static inline u32 erdma_reg_read32(struct erdma_dev *dev, u32 reg)
+{
+	return readl(dev->func_bar + reg);
+}
+
+static inline u64 erdma_reg_read64(struct erdma_dev *dev, u32 reg)
+{
+	return readq(dev->func_bar + reg);
+}
+
+static inline void erdma_reg_write32(struct erdma_dev *dev, u32 reg, u32 value)
+{
+	writel(value, dev->func_bar + reg);
+}
+
+static inline void erdma_reg_write64(struct erdma_dev *dev, u32 reg, u64 value)
+{
+	writeq(value, dev->func_bar + reg);
+}
+
+static inline u32 erdma_reg_read32_filed(struct erdma_dev *dev, u32 reg, u32 filed_mask)
+{
+	u32 val = erdma_reg_read32(dev, reg);
+
+	return FIELD_GET(filed_mask, val);
+}
+
+static inline int erdma_poll_ceq_event(struct erdma_eq *ceq)
+{
+	__le64 *ceqe;
+	u16 queue_size_mask = ceq->depth - 1;
+	u64 val;
+
+	ceqe = ceq->qbuf + ((ceq->ci & queue_size_mask) << EQE_SHIFT);
+
+	val = READ_ONCE(*ceqe);
+	if (FIELD_GET(ERDMA_CEQE_HDR_O_MASK, val) == ceq->owner) {
+		dma_rmb();
+		ceq->ci++;
+
+		if ((ceq->ci & queue_size_mask) == 0)
+			ceq->owner = !ceq->owner;
+
+		atomic64_inc(&ceq->event_num);
+
+		return FIELD_GET(ERDMA_CEQE_HDR_CQN_MASK, val);
+	}
+
+	return -1;
+}
+
+static inline void notify_eq(struct erdma_eq *eq)
+{
+	u64 db_data = FIELD_PREP(ERDMA_EQDB_CI_MASK, eq->ci) |
+		FIELD_PREP(ERDMA_EQDB_ARM_MASK, 1);
+
+	*(u64 *)eq->db_info = db_data;
+	writeq(db_data, eq->db_addr);
+
+	atomic64_inc(&eq->notify_num);
+}
+
+int erdma_cmdq_init(struct erdma_dev *dev);
+void erdma_finish_cmdq_init(struct erdma_dev *dev);
+void erdma_cmdq_destroy(struct erdma_dev *dev);
+
+#define ERDMA_CMDQ_BUILD_REQ_HDR(hdr, mod, op) \
+do { \
+	*(u64 *)(hdr) = FIELD_PREP(ERDMA_CMD_HDR_SUB_MOD_MASK, mod); \
+	*(u64 *)(hdr) |= FIELD_PREP(ERDMA_CMD_HDR_OPCODE_MASK, op); \
+} while (0)
+
+int erdma_post_cmd_wait(struct erdma_cmdq *cmdq, u64 *req, u32 req_size, u64 *resp0, u64 *resp1);
+void erdma_cmdq_completion_handler(struct erdma_dev *dev);
+
+int erdma_ceqs_init(struct erdma_dev *dev);
+void erdma_ceqs_uninit(struct erdma_dev *dev);
+
+int erdma_aeq_init(struct erdma_dev *dev);
+void erdma_aeq_destroy(struct erdma_dev *dev);
+
+void erdma_aeq_event_handler(struct erdma_dev *dev);
+void erdma_ceq_completion_handler(struct erdma_eq_cb *ceq_cb);
+
+#endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rdma-next 04/11] RDMA/erdma: Add cmdq implementation
  2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
                   ` (2 preceding siblings ...)
  2021-12-21  2:48 ` [PATCH rdma-next 03/11] RDMA/erdma: Add main include file Cheng Xu
@ 2021-12-21  2:48 ` Cheng Xu
  2021-12-21  2:48 ` [PATCH rdma-next 05/11] RDMA/erdma: Add event queue implementation Cheng Xu
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-21  2:48 UTC (permalink / raw)
  To: jgg, dledford; +Cc: leon, linux-rdma, KaiShen, chengyou, tonylu

Cmdq is the main control plane channel between erdma driver and hardware.
After erdma device is initialized, the cmdq channel will be active in the
whole lifecycle of this driver.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 drivers/infiniband/hw/erdma/erdma_cmdq.c | 489 +++++++++++++++++++++++
 1 file changed, 489 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/erdma_cmdq.c

diff --git a/drivers/infiniband/hw/erdma/erdma_cmdq.c b/drivers/infiniband/hw/erdma/erdma_cmdq.c
new file mode 100644
index 000000000000..3eca524fca24
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_cmdq.c
@@ -0,0 +1,489 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/pci.h>
+
+#include "erdma.h"
+#include "erdma_hw.h"
+#include "erdma_verbs.h"
+
+static inline void arm_cmdq_cq(struct erdma_cmdq *cmdq)
+{
+	u64 db_data = FIELD_PREP(ERDMA_CQDB_CI_MASK, cmdq->cq.ci) |
+		FIELD_PREP(ERDMA_CQDB_ARM_MASK, 1);
+
+	*(u64 *)cmdq->cq.db_info = db_data;
+	writeq(db_data, cmdq->cq.db_addr);
+
+	atomic64_inc(&cmdq->cq.cq_armed_num);
+}
+
+static inline void kick_cmdq_db(struct erdma_cmdq *cmdq)
+{
+	u64 db_data = FIELD_PREP(ERDMA_CMD_HDR_WQEBB_INDEX_MASK, cmdq->sq.pi);
+
+	*(u64 *)cmdq->sq.db_info = db_data;
+	writeq(db_data, cmdq->sq.db_addr);
+}
+
+
+static struct erdma_comp_wait *get_comp_wait(struct erdma_cmdq *cmdq)
+{
+	int comp_idx;
+
+	spin_lock(&cmdq->lock);
+	comp_idx = find_first_zero_bit(cmdq->comp_wait_bitmap, cmdq->max_outstandings);
+	if (comp_idx == cmdq->max_outstandings) {
+		spin_unlock(&cmdq->lock);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	set_bit(comp_idx, cmdq->comp_wait_bitmap);
+	spin_unlock(&cmdq->lock);
+
+	return &cmdq->wait_pool[comp_idx];
+}
+
+static void put_comp_wait(struct erdma_cmdq *cmdq, struct erdma_comp_wait *comp_wait)
+{
+	int used;
+
+	cmdq->wait_pool[comp_wait->ctx_id].cmd_status = ERDMA_CMD_STATUS_INIT;
+	spin_lock(&cmdq->lock);
+	used = test_and_clear_bit(comp_wait->ctx_id, cmdq->comp_wait_bitmap);
+	spin_unlock(&cmdq->lock);
+
+	WARN_ON(!used);
+}
+
+static int erdma_cmdq_wait_res_init(struct erdma_dev *dev, struct erdma_cmdq *cmdq)
+{
+	int i;
+
+	cmdq->wait_pool = devm_kcalloc(&dev->pdev->dev, cmdq->max_outstandings,
+		sizeof(struct erdma_comp_wait), GFP_KERNEL);
+	if (!cmdq->wait_pool)
+		return -ENOMEM;
+
+	spin_lock_init(&cmdq->lock);
+	cmdq->comp_wait_bitmap = devm_kcalloc(&dev->pdev->dev,
+		BITS_TO_LONGS(cmdq->max_outstandings), sizeof(unsigned long), GFP_KERNEL);
+	if (!cmdq->comp_wait_bitmap) {
+		devm_kfree(&dev->pdev->dev, cmdq->wait_pool);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < cmdq->max_outstandings; i++) {
+		init_completion(&cmdq->wait_pool[i].wait_event);
+		cmdq->wait_pool[i].ctx_id = i;
+	}
+
+	return 0;
+}
+
+static int erdma_cmdq_sq_init(struct erdma_dev *dev)
+{
+	struct erdma_cmdq *cmdq = &dev->cmdq;
+	struct erdma_cmdq_sq *sq = &cmdq->sq;
+	u32 buf_size;
+
+	sq->wqebb_cnt = SQEBB_COUNT(ERDMA_CMDQ_SQE_SIZE);
+	sq->depth = cmdq->max_outstandings * sq->wqebb_cnt;
+
+	buf_size = sq->depth << SQEBB_SHIFT;
+
+	sq->qbuf = dma_alloc_coherent(&dev->pdev->dev, buf_size + ERDMA_EXTRA_BUFFER_SIZE,
+		&sq->qbuf_dma_addr, GFP_KERNEL);
+	if (!sq->qbuf)
+		return -ENOMEM;
+
+	sq->db_info = sq->qbuf + buf_size;
+
+	spin_lock_init(&sq->lock);
+
+	sq->ci = 0;
+	sq->pi = 0;
+	sq->total_cmds = 0;
+	sq->total_comp_cmds = 0;
+
+	sq->db_addr = (u64 __iomem *)(dev->func_bar + ERDMA_CMDQ_SQDB_REG);
+
+	erdma_reg_write32(dev, ERDMA_REGS_CMDQ_SQ_ADDR_H_REG, upper_32_bits(sq->qbuf_dma_addr));
+	erdma_reg_write32(dev, ERDMA_REGS_CMDQ_SQ_ADDR_L_REG, lower_32_bits(sq->qbuf_dma_addr));
+	erdma_reg_write32(dev, ERDMA_REGS_CMDQ_DEPTH_REG, sq->depth);
+	erdma_reg_write64(dev, ERDMA_CMDQ_SQ_DB_HOST_ADDR_REG, sq->qbuf_dma_addr + buf_size);
+
+	return 0;
+}
+
+static int erdma_cmdq_cq_init(struct erdma_dev *dev)
+{
+	struct erdma_cmdq *cmdq = &dev->cmdq;
+	struct erdma_cmdq_cq *cq = &cmdq->cq;
+	u32 buf_size;
+
+	cq->depth = cmdq->max_outstandings * 2;
+	buf_size = cq->depth << CQE_SHIFT;
+
+	cq->qbuf = dma_alloc_coherent(&dev->pdev->dev, buf_size + ERDMA_EXTRA_BUFFER_SIZE,
+			&cq->qbuf_dma_addr, GFP_KERNEL);
+	if (!cq->qbuf)
+		return -ENOMEM;
+
+	cq->db_info = cq->qbuf + buf_size;
+
+	memset(cq->qbuf, 0, buf_size + ERDMA_EXTRA_BUFFER_SIZE);
+
+	spin_lock_init(&cq->lock);
+
+	cq->db_addr = (u64 __iomem *)(dev->func_bar + ERDMA_CMDQ_CQDB_REG);
+	cq->ci = 0;
+	cq->owner = 1;
+
+	atomic64_set(&cq->cq_armed_num, 0);
+
+	erdma_reg_write32(dev, ERDMA_REGS_CMDQ_CQ_ADDR_H_REG, upper_32_bits(cq->qbuf_dma_addr));
+	erdma_reg_write32(dev, ERDMA_REGS_CMDQ_CQ_ADDR_L_REG, lower_32_bits(cq->qbuf_dma_addr));
+	erdma_reg_write64(dev, ERDMA_CMDQ_CQ_DB_HOST_ADDR_REG, cq->qbuf_dma_addr + buf_size);
+
+	return 0;
+}
+
+static int erdma_cmdq_eq_init(struct erdma_dev *dev)
+{
+	struct erdma_cmdq *cmdq = &dev->cmdq;
+	struct erdma_eq *eq = &cmdq->eq;
+	u32 buf_size;
+
+	eq->depth = cmdq->max_outstandings;
+	buf_size = eq->depth << EQE_SHIFT;
+
+	eq->qbuf = dma_alloc_coherent(&dev->pdev->dev, buf_size + ERDMA_EXTRA_BUFFER_SIZE,
+			&eq->qbuf_dma_addr, GFP_KERNEL);
+	if (!eq->qbuf)
+		return -ENOMEM;
+
+	eq->db_info = eq->qbuf + buf_size;
+
+	memset(eq->qbuf, 0, buf_size);
+	memset(eq->db_info, 0, 8);
+
+	spin_lock_init(&eq->lock);
+	atomic64_set(&eq->event_num, 0);
+
+	eq->db_addr = (u64 __iomem *)(dev->func_bar + ERDMA_REGS_CEQ_DB_BASE_REG);
+	eq->ci = 0;
+	eq->owner = 1;
+
+	erdma_reg_write32(dev, ERDMA_REGS_CMDQ_EQ_ADDR_H_REG, upper_32_bits(eq->qbuf_dma_addr));
+	erdma_reg_write32(dev, ERDMA_REGS_CMDQ_EQ_ADDR_L_REG, lower_32_bits(eq->qbuf_dma_addr));
+	erdma_reg_write32(dev, ERDMA_REGS_CMDQ_EQ_DEPTH_REG, eq->depth);
+	erdma_reg_write64(dev, ERDMA_CMDQ_EQ_DB_HOST_ADDR_REG, eq->qbuf_dma_addr + buf_size);
+
+	return 0;
+}
+
+int erdma_cmdq_init(struct erdma_dev *dev)
+{
+	int err, i;
+	struct erdma_cmdq *cmdq = &dev->cmdq;
+	u32 status, ctrl;
+
+	cmdq->max_outstandings = ERDMA_CMDQ_MAX_OUTSTANDING;
+	cmdq->dev = dev;
+	cmdq->use_event = 0;
+
+	sema_init(&cmdq->credits, cmdq->max_outstandings);
+
+	err = erdma_cmdq_wait_res_init(dev, cmdq);
+	if (err)
+		return err;
+
+	err = erdma_cmdq_sq_init(dev);
+	if (err)
+		return err;
+
+	err = erdma_cmdq_cq_init(dev);
+	if (err)
+		goto err_destroy_sq;
+
+	err = erdma_cmdq_eq_init(dev);
+	if (err)
+		goto err_destroy_cq;
+
+	ctrl = FIELD_PREP(ERDMA_REG_DEV_CTRL_INIT_MASK, 1);
+	erdma_reg_write32(dev, ERDMA_REGS_DEV_CTRL_REG, ctrl);
+
+	for (i = 0; i < ERDMA_WAIT_DEV_DONE_CNT; i++) {
+		status = erdma_reg_read32_filed(dev, ERDMA_REGS_DEV_ST_REG,
+			ERDMA_REG_DEV_ST_INIT_DONE_MASK);
+		if (status)
+			break;
+
+		msleep(ERDMA_REG_ACCESS_WAIT_MS);
+	}
+
+	if (i == ERDMA_WAIT_DEV_DONE_CNT) {
+		dev_err(&dev->pdev->dev, "wait init done failed.\n");
+		err = -ETIMEDOUT;
+		goto err_destroy_eq;
+	}
+
+	set_bit(ERDMA_CMDQ_STATE_OK_BIT, &cmdq->state);
+
+	return 0;
+
+err_destroy_eq:
+	dma_free_coherent(&dev->pdev->dev,
+		(cmdq->eq.depth << EQE_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+		cmdq->eq.qbuf, cmdq->eq.qbuf_dma_addr);
+
+err_destroy_cq:
+	dma_free_coherent(&dev->pdev->dev,
+		(cmdq->cq.depth << CQE_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+		cmdq->cq.qbuf, cmdq->cq.qbuf_dma_addr);
+
+err_destroy_sq:
+	dma_free_coherent(&dev->pdev->dev,
+		(cmdq->sq.depth << SQEBB_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+		cmdq->sq.qbuf, cmdq->sq.qbuf_dma_addr);
+
+	return err;
+}
+
+void erdma_finish_cmdq_init(struct erdma_dev *dev)
+{
+	/* after device init successfully, change cmdq to event mode. */
+	dev->cmdq.use_event = true;
+	arm_cmdq_cq(&dev->cmdq);
+}
+
+void erdma_cmdq_destroy(struct erdma_dev *dev)
+{
+	struct erdma_cmdq *cmdq = &dev->cmdq;
+
+	clear_bit(ERDMA_CMDQ_STATE_OK_BIT, &cmdq->state);
+
+	dma_free_coherent(&dev->pdev->dev,
+		(cmdq->eq.depth << EQE_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+		cmdq->eq.qbuf, cmdq->eq.qbuf_dma_addr);
+	dma_free_coherent(&dev->pdev->dev,
+		(cmdq->sq.depth << SQEBB_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+		cmdq->sq.qbuf, cmdq->sq.qbuf_dma_addr);
+	dma_free_coherent(&dev->pdev->dev,
+		(cmdq->cq.depth << CQE_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+		cmdq->cq.qbuf, cmdq->cq.qbuf_dma_addr);
+}
+
+static inline void *get_cmdq_sqe(struct erdma_cmdq *cmdq, u16 idx)
+{
+	idx &= (cmdq->sq.depth - 1);
+	return cmdq->sq.qbuf + (idx << SQEBB_SHIFT);
+}
+
+static inline void *get_cmdq_cqe(struct erdma_cmdq *cmdq, u16 idx)
+{
+	idx &= (cmdq->cq.depth - 1);
+	return cmdq->cq.qbuf + (idx << CQE_SHIFT);
+}
+
+static void push_cmdq_sqe(struct erdma_cmdq *cmdq, u64 *req,
+			 size_t req_len, struct erdma_comp_wait *comp_wait)
+{
+	__le64 *wqe;
+	u64 hdr = *req;
+
+	comp_wait->cmd_status = ERDMA_CMD_STATUS_ISSUED;
+	reinit_completion(&comp_wait->wait_event);
+	comp_wait->sq_pi = cmdq->sq.pi;
+
+	wqe = get_cmdq_sqe(cmdq, cmdq->sq.pi);
+	memcpy(wqe, req, req_len);
+
+	cmdq->sq.pi += cmdq->sq.wqebb_cnt;
+	hdr |= FIELD_PREP(ERDMA_CMD_HDR_WQEBB_INDEX_MASK, cmdq->sq.pi);
+	hdr |= FIELD_PREP(ERDMA_CMD_HDR_CONTEXT_COOKIE, comp_wait->ctx_id);
+	hdr |= FIELD_PREP(ERDMA_CMD_HDR_WQEBB_CNT_MASK, cmdq->sq.wqebb_cnt - 1);
+	*wqe = hdr;
+
+	cmdq->sq.total_cmds++;
+
+	kick_cmdq_db(cmdq);
+}
+
+static void erdma_poll_single_cmd_completion(struct erdma_cmdq *cmdq, __be32 *cqe)
+{
+	struct erdma_comp_wait *comp_wait;
+	u16 sqe_idx, ctx_id;
+	u64 *sqe;
+	int i;
+	u32 hdr0 = __be32_to_cpu(*cqe);
+
+	sqe_idx = __be32_to_cpu(*(cqe + 1));
+	sqe = (u64 *)get_cmdq_sqe(cmdq, sqe_idx);
+
+	ctx_id = FIELD_GET(ERDMA_CMD_HDR_CONTEXT_COOKIE, *sqe);
+	comp_wait = &cmdq->wait_pool[ctx_id];
+	if (comp_wait->cmd_status != ERDMA_CMD_STATUS_ISSUED)
+		return;
+
+	comp_wait->cmd_status = ERDMA_CMD_STATUS_FINISHED;
+	comp_wait->comp_status = FIELD_GET(ERDMA_CQE_HDR_SYNDROME_MASK, hdr0);
+	cmdq->sq.ci += cmdq->sq.wqebb_cnt;
+
+	for (i = 0; i < 4; i++)
+		comp_wait->comp_data[i] = __be32_to_cpu(*(cqe + 2 + i));
+
+	if (cmdq->use_event)
+		complete(&comp_wait->wait_event);
+}
+
+static void erdma_polling_cmd_completions(struct erdma_cmdq *cmdq)
+{
+	u32 hdr;
+	__be32 *cqe;
+	unsigned long flags;
+	u16 comp_num = 0;
+	u8 owner, expect_owner;
+	u16 cqe_idx;
+
+	spin_lock_irqsave(&cmdq->cq.lock, flags);
+
+	expect_owner = cmdq->cq.owner;
+	cqe_idx = cmdq->cq.ci & (cmdq->cq.depth - 1);
+
+	while (1) {
+		cqe = (__be32 *)get_cmdq_cqe(cmdq, cqe_idx);
+		hdr = __be32_to_cpu(READ_ONCE(*cqe));
+
+		owner = FIELD_GET(ERDMA_CQE_HDR_OWNER_MASK, hdr);
+		if (owner != expect_owner)
+			break;
+
+		dma_rmb();
+		erdma_poll_single_cmd_completion(cmdq, cqe);
+		comp_num++;
+		if (cqe_idx == cmdq->cq.depth - 1) {
+			cqe_idx = 0;
+			expect_owner = !expect_owner;
+		} else {
+			cqe_idx++;
+		}
+	}
+
+	if (comp_num) {
+		cmdq->cq.ci += comp_num;
+		cmdq->cq.owner = expect_owner;
+		cmdq->sq.total_comp_cmds += comp_num;
+
+		if (cmdq->use_event)
+			arm_cmdq_cq(cmdq);
+	}
+
+	spin_unlock_irqrestore(&cmdq->cq.lock, flags);
+}
+
+void erdma_cmdq_completion_handler(struct erdma_dev *dev)
+{
+	int cqn, got_event = 0;
+
+	if (!test_bit(ERDMA_CMDQ_STATE_OK_BIT, &dev->cmdq.state) || !dev->cmdq.use_event)
+		return;
+
+	while ((cqn = erdma_poll_ceq_event(&dev->cmdq.eq)) != -1)
+		got_event++;
+
+	if (got_event)
+		erdma_polling_cmd_completions(&dev->cmdq);
+
+	notify_eq(&dev->cmdq.eq);
+}
+
+static int erdma_poll_cmd_completion(struct erdma_comp_wait *comp_ctx,
+				     struct erdma_cmdq *cmdq, u32 timeout)
+{
+	unsigned long comp_timeout = jiffies + msecs_to_jiffies(timeout);
+
+	while (1) {
+		erdma_polling_cmd_completions(cmdq);
+		if (comp_ctx->cmd_status != ERDMA_CMD_STATUS_ISSUED)
+			break;
+
+		if (time_is_before_jiffies(comp_timeout))
+			return -ETIME;
+
+		msleep(20);
+	}
+
+	return 0;
+}
+
+static int erdma_wait_cmd_completion(struct erdma_comp_wait *comp_ctx,
+				     struct erdma_cmdq *cmdq, u32 timeout)
+{
+	unsigned long flags = 0;
+
+	wait_for_completion_timeout(&comp_ctx->wait_event,
+		msecs_to_jiffies(timeout));
+
+	if (unlikely(comp_ctx->cmd_status != ERDMA_CMD_STATUS_FINISHED)) {
+		spin_lock_irqsave(&cmdq->cq.lock, flags);
+		comp_ctx->cmd_status = ERDMA_CMD_STATUS_TIMEOUT;
+		spin_unlock_irqrestore(&cmdq->cq.lock, flags);
+		return -ETIME;
+	}
+
+	return 0;
+}
+
+int erdma_post_cmd_wait(struct erdma_cmdq *cmdq, u64 *req, u32 req_size, u64 *resp0, u64 *resp1)
+{
+	struct erdma_comp_wait *comp_wait;
+	int ret;
+
+	if (!test_bit(ERDMA_CMDQ_STATE_OK_BIT, &cmdq->state))
+		return -ENODEV;
+
+	down(&cmdq->credits);
+
+	comp_wait = get_comp_wait(cmdq);
+	if (IS_ERR(comp_wait)) {
+		clear_bit(ERDMA_CMDQ_STATE_OK_BIT, &cmdq->state);
+		set_bit(ERDMA_CMDQ_STATE_CTX_ERR_BIT, &cmdq->state);
+		up(&cmdq->credits);
+		return PTR_ERR(comp_wait);
+	}
+
+	spin_lock(&cmdq->sq.lock);
+	push_cmdq_sqe(cmdq, req, req_size, comp_wait);
+	spin_unlock(&cmdq->sq.lock);
+
+	if (cmdq->use_event)
+		ret = erdma_wait_cmd_completion(comp_wait, cmdq, ERDMA_CMDQ_TIMEOUT_MS);
+	else
+		ret = erdma_poll_cmd_completion(comp_wait, cmdq, ERDMA_CMDQ_TIMEOUT_MS);
+
+	if (ret) {
+		set_bit(ERDMA_CMDQ_STATE_TIMEOUT_BIT, &cmdq->state);
+		clear_bit(ERDMA_CMDQ_STATE_OK_BIT, &cmdq->state);
+		goto out;
+	}
+
+	ret = comp_wait->comp_status;
+
+	if (resp0 && resp1) {
+		*resp0 = *((u64 *)&comp_wait->comp_data[0]);
+		*resp1 = *((u64 *)&comp_wait->comp_data[2]);
+	}
+	put_comp_wait(cmdq, comp_wait);
+
+out:
+	up(&cmdq->credits);
+
+	return ret;
+}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rdma-next 05/11] RDMA/erdma: Add event queue implementation
  2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
                   ` (3 preceding siblings ...)
  2021-12-21  2:48 ` [PATCH rdma-next 04/11] RDMA/erdma: Add cmdq implementation Cheng Xu
@ 2021-12-21  2:48 ` Cheng Xu
  2021-12-21  2:48 ` [PATCH rdma-next 06/11] RDMA/erdma: Add verbs header file Cheng Xu
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-21  2:48 UTC (permalink / raw)
  To: jgg, dledford; +Cc: leon, linux-rdma, KaiShen, chengyou, tonylu

Event queue (EQ) is the main notifcaition way from erdma hardware to its
driver. Each erdma device contains 2 kinds EQs: asynchronous EQ (AEQ) and
completion EQ (CEQ). Per device has 1 AEQ, which used for RDMA async event
report, and max to 32 CEQs (numbered for CEQ0 to CEQ31). CEQ0 is used for
cmdq completion event report, and the reset CEQs are used for RDMA
completion event report.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 drivers/infiniband/hw/erdma/erdma_eq.c | 346 +++++++++++++++++++++++++
 1 file changed, 346 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/erdma_eq.c

diff --git a/drivers/infiniband/hw/erdma/erdma_eq.c b/drivers/infiniband/hw/erdma/erdma_eq.c
new file mode 100644
index 000000000000..67dae3a6a245
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_eq.c
@@ -0,0 +1,346 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ */
+
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/pci.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "erdma.h"
+#include "erdma_cm.h"
+#include "erdma_hw.h"
+#include "erdma_verbs.h"
+
+static inline int
+erdma_poll_aeq_event(struct erdma_eq *aeq, void *out)
+{
+	struct erdma_aeqe *aeqe = (struct erdma_aeqe *)aeq->qbuf + (aeq->ci & (aeq->depth - 1));
+	u32 val;
+
+	val = le32_to_cpu(READ_ONCE(aeqe->hdr));
+	if (FIELD_GET(ERDMA_AEQE_HDR_O_MASK, val) == aeq->owner) {
+		dma_rmb();
+		aeq->ci++;
+		if ((aeq->ci & (aeq->depth - 1)) == 0)
+			aeq->owner = !aeq->owner;
+
+		atomic64_inc(&aeq->event_num);
+		if (out)
+			memcpy(out, aeqe, sizeof(struct erdma_aeqe));
+
+		return 1;
+	}
+
+	return 0;
+}
+
+void erdma_aeq_event_handler(struct erdma_dev *dev)
+{
+	struct erdma_aeqe aeqe;
+	u32 cqn, qpn;
+	struct erdma_qp *qp;
+	struct erdma_cq *cq;
+	struct ib_event event;
+
+	memset(&event, 0, sizeof(event));
+	while (erdma_poll_aeq_event(&dev->aeq.eq, &aeqe)) {
+		if (FIELD_GET(ERDMA_AEQE_HDR_TYPE_MASK, aeqe.hdr) == ERDMA_AE_TYPE_CQ_ERR) {
+			cqn = aeqe.event_data0;
+			cq = find_cq_by_cqn(dev, cqn);
+			if (!cq)
+				continue;
+			event.device = cq->ibcq.device;
+			event.element.cq = &cq->ibcq;
+			event.event = IB_EVENT_CQ_ERR;
+			if (cq->ibcq.event_handler)
+				cq->ibcq.event_handler(&event, cq->ibcq.cq_context);
+		} else {
+			qpn = aeqe.event_data0;
+			qp = find_qp_by_qpn(dev, qpn);
+			if (!qp)
+				continue;
+
+			event.device = qp->ibqp.device;
+			event.element.qp = &qp->ibqp;
+			event.event = IB_EVENT_QP_FATAL;
+			if (qp->ibqp.event_handler)
+				qp->ibqp.event_handler(&event, qp->ibqp.qp_context);
+		}
+	}
+
+	notify_eq(&dev->aeq.eq);
+}
+
+int erdma_aeq_init(struct erdma_dev *dev)
+{
+	struct erdma_eq *eq = &dev->aeq.eq;
+	u32 buf_size = ERDMA_DEFAULT_EQ_DEPTH << EQE_SHIFT;
+
+	eq->qbuf = dma_alloc_coherent(&dev->pdev->dev, buf_size + ERDMA_EXTRA_BUFFER_SIZE,
+		&eq->qbuf_dma_addr, GFP_KERNEL);
+	if (!eq->qbuf)
+		return -ENOMEM;
+
+	eq->db_info = eq->qbuf + buf_size;
+
+	memset(eq->qbuf, 0, buf_size);
+	memset(eq->db_info, 0, 8);
+
+	spin_lock_init(&eq->lock);
+	atomic64_set(&eq->event_num, 0);
+	atomic64_set(&eq->notify_num, 0);
+
+	eq->depth = ERDMA_DEFAULT_EQ_DEPTH;
+	eq->db_addr = (u64 __iomem *)(dev->func_bar + ERDMA_REGS_AEQ_DB_REG);
+	eq->ci = 0;
+
+	eq->owner = 1;
+	dev->aeq.dev = dev;
+
+	dev->aeq.ready = 1;
+
+	erdma_reg_write32(dev, ERDMA_REGS_AEQ_ADDR_H_REG, upper_32_bits(eq->qbuf_dma_addr));
+	erdma_reg_write32(dev, ERDMA_REGS_AEQ_ADDR_L_REG, lower_32_bits(eq->qbuf_dma_addr));
+	erdma_reg_write32(dev, ERDMA_REGS_AEQ_DEPTH_REG, eq->depth);
+	erdma_reg_write64(dev, ERDMA_AEQ_DB_HOST_ADDR_REG, eq->qbuf_dma_addr + buf_size);
+
+	return 0;
+}
+
+void erdma_aeq_destroy(struct erdma_dev *dev)
+{
+	struct erdma_eq *eq = &dev->aeq.eq;
+	u32 buf_size  = ERDMA_DEFAULT_EQ_DEPTH << EQE_SHIFT;
+
+	dev->aeq.ready = 0;
+
+	dma_free_coherent(&dev->pdev->dev, buf_size + ERDMA_EXTRA_BUFFER_SIZE,
+		eq->qbuf, eq->qbuf_dma_addr);
+}
+
+
+#define MAX_POLL_CHUNK_SIZE 16
+void erdma_ceq_completion_handler(struct erdma_eq_cb *ceq_cb)
+{
+	int cqn;
+	struct erdma_cq *cq;
+	struct erdma_dev *dev = ceq_cb->dev;
+	u32 poll_cnt = 0;
+
+	if (!ceq_cb->ready)
+		return;
+
+	while ((cqn = erdma_poll_ceq_event(&ceq_cb->eq)) != -1) {
+		poll_cnt++;
+		if (cqn == 0)
+			continue;
+
+		cq = find_cq_by_cqn(dev, cqn);
+		if (!cq)
+			continue;
+
+		if (cq->ibcq.comp_handler)
+			cq->ibcq.comp_handler(&cq->ibcq, cq->ibcq.cq_context);
+
+		if (poll_cnt >= MAX_POLL_CHUNK_SIZE)
+			break;
+	}
+
+	notify_eq(&ceq_cb->eq);
+}
+
+
+static irqreturn_t erdma_intr_ceq_handler(int irq, void *data)
+{
+	struct erdma_eq_cb *ceq_cb = data;
+
+	tasklet_schedule(&ceq_cb->tasklet);
+
+	return IRQ_HANDLED;
+}
+
+void erdma_intr_ceq_task(unsigned long data)
+{
+	erdma_ceq_completion_handler((struct erdma_eq_cb *)data);
+}
+
+static int erdma_set_ceq_irq(struct erdma_dev *dev, u16 eqn)
+{
+	u32 cpu;
+	int err;
+	struct erdma_irq_info *irq_info = &dev->ceqs[eqn - 1].irq_info;
+
+	snprintf(irq_info->name, ERDMA_IRQNAME_SIZE, "erdma-ceq%u@pci:%s",
+		eqn - 1, pci_name(dev->pdev));
+	irq_info->handler = erdma_intr_ceq_handler;
+	irq_info->data = &dev->ceqs[eqn - 1];
+	irq_info->msix_vector = pci_irq_vector(dev->pdev, eqn);
+
+	tasklet_init(&dev->ceqs[eqn - 1].tasklet, erdma_intr_ceq_task,
+			(unsigned long)&dev->ceqs[eqn - 1]);
+
+	cpu = cpumask_local_spread(eqn, dev->numa_node);
+	irq_info->cpu = cpu;
+	cpumask_set_cpu(cpu, &irq_info->affinity_hint_mask);
+	dev_info(&dev->pdev->dev, "setup irq:%p vector:%d name:%s\n",
+		 irq_info,
+		 irq_info->msix_vector,
+		 irq_info->name);
+
+	err = request_irq(irq_info->msix_vector, irq_info->handler, 0,
+		irq_info->name, irq_info->data);
+	if (err) {
+		dev_err(&dev->pdev->dev, "failed to request_irq(%d)\n", err);
+		return err;
+	}
+
+	irq_set_affinity_hint(irq_info->msix_vector, &irq_info->affinity_hint_mask);
+
+	return 0;
+}
+
+
+static void erdma_free_ceq_irq(struct erdma_dev *dev, u16 eqn)
+{
+	struct erdma_irq_info *irq_info = &dev->ceqs[eqn - 1].irq_info;
+
+	irq_set_affinity_hint(irq_info->msix_vector, NULL);
+	free_irq(irq_info->msix_vector, irq_info->data);
+}
+
+static inline int
+create_eq_cmd(struct erdma_dev *dev, u32 eqn, struct erdma_eq *eq)
+{
+	int err;
+	struct erdma_cmdq_create_eq_req req;
+	dma_addr_t db_info_dma_addr;
+
+	ERDMA_CMDQ_BUILD_REQ_HDR(&req, CMDQ_SUBMOD_COMMON, CMDQ_OPCODE_CREATE_EQ);
+	req.eqn = eqn;
+	req.depth = ilog2(eq->depth);
+	req.qbuf_addr = eq->qbuf_dma_addr;
+	req.qtype = 1; /* CEQ */
+	/* Vector index is the same sa EQN. */
+	req.vector_idx = eqn;
+	db_info_dma_addr = eq->qbuf_dma_addr + (eq->depth << EQE_SHIFT);
+	req.db_dma_addr_l = lower_32_bits(db_info_dma_addr);
+	req.db_dma_addr_h = upper_32_bits(db_info_dma_addr);
+
+	err = erdma_post_cmd_wait(&dev->cmdq, (u64 *)&req,
+		sizeof(struct erdma_cmdq_create_eq_req), NULL, NULL);
+	if (err) {
+		dev_err(&dev->pdev->dev,
+			"ERROR: err code = %d, cmd of create eq failed.\n", err);
+		return err;
+	}
+
+	return 0;
+}
+
+static int erdma_ceq_init_one(struct erdma_dev *dev, u16 eqn)
+{
+	/* CEQ indexed from 1, 0 rsvd for CMDQ-EQ. */
+	struct erdma_eq *eq = &dev->ceqs[eqn - 1].eq;
+	u32 buf_size = ERDMA_DEFAULT_EQ_DEPTH << EQE_SHIFT;
+	int ret;
+
+	eq->qbuf = dma_alloc_coherent(&dev->pdev->dev, buf_size + ERDMA_EXTRA_BUFFER_SIZE,
+		&eq->qbuf_dma_addr, GFP_KERNEL);
+	if (!eq->qbuf)
+		return -ENOMEM;
+
+	eq->db_info = eq->qbuf + ERDMA_EXTRA_BUFFER_SIZE;
+
+	memset(eq->qbuf, 0, buf_size);
+	memset(eq->db_info, 0, ERDMA_EXTRA_BUFFER_SIZE);
+
+	spin_lock_init(&eq->lock);
+	atomic64_set(&eq->event_num, 0);
+	atomic64_set(&eq->notify_num, 0);
+
+	eq->depth = ERDMA_DEFAULT_EQ_DEPTH;
+	eq->db_addr = (u64 __iomem *)(dev->func_bar + ERDMA_REGS_CEQ_DB_BASE_REG + eqn * 8);
+	eq->ci = 0;
+	eq->owner = 1;
+	dev->ceqs[eqn - 1].dev = dev;
+
+	ret = create_eq_cmd(dev, eqn, eq);
+	if (ret) {
+		dev->ceqs[eqn - 1].ready = 0;
+		return ret;
+	}
+
+	dev->ceqs[eqn - 1].ready = 1;
+
+	return ret;
+}
+
+static void erdma_ceq_uninit_one(struct erdma_dev *dev, u16 eqn)
+{
+	struct erdma_eq *eq = &dev->ceqs[eqn - 1].eq;
+	u32 buf_size = ERDMA_DEFAULT_EQ_DEPTH << EQE_SHIFT;
+	struct erdma_cmdq_destroy_eq_req req;
+	int err;
+
+	dev->ceqs[eqn - 1].ready = 0;
+
+	ERDMA_CMDQ_BUILD_REQ_HDR(&req, CMDQ_SUBMOD_COMMON, CMDQ_OPCODE_DESTROY_EQ);
+	req.eqn = eqn;
+	req.qtype = 1;
+	req.vector_idx = eqn;
+
+	err = erdma_post_cmd_wait(&dev->cmdq, (u64 *)&req, sizeof(req), NULL, NULL);
+	if (err) {
+		dev_err(&dev->pdev->dev,
+			"ERROR: err code = %d, cmd of destroy eq failed.\n", err);
+		return;
+	}
+
+	dma_free_coherent(&dev->pdev->dev, buf_size + ERDMA_EXTRA_BUFFER_SIZE,
+		eq->qbuf, eq->qbuf_dma_addr);
+}
+
+int erdma_ceqs_init(struct erdma_dev *dev)
+{
+	u32 i, j;
+	int err = 0;
+
+	for (i = 1; i < dev->irq_num; i++) {
+		err = erdma_ceq_init_one(dev, i);
+		if (err)
+			goto out_err;
+
+		err = erdma_set_ceq_irq(dev, i);
+		if (err) {
+			erdma_ceq_uninit_one(dev, i);
+			goto out_err;
+		}
+	}
+
+	return 0;
+
+out_err:
+	for (j = 1; j < i; j++) {
+		erdma_free_ceq_irq(dev, j);
+		erdma_ceq_uninit_one(dev, j);
+	}
+
+	return err;
+}
+
+void erdma_ceqs_uninit(struct erdma_dev *dev)
+{
+	u32 i;
+
+	for (i = 1; i < dev->irq_num; i++) {
+		erdma_free_ceq_irq(dev, i);
+		erdma_ceq_uninit_one(dev, i);
+	}
+}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rdma-next 06/11] RDMA/erdma: Add verbs header file
  2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
                   ` (4 preceding siblings ...)
  2021-12-21  2:48 ` [PATCH rdma-next 05/11] RDMA/erdma: Add event queue implementation Cheng Xu
@ 2021-12-21  2:48 ` Cheng Xu
  2021-12-21 13:28   ` Leon Romanovsky
  2021-12-21  2:48 ` [PATCH rdma-next 07/11] RDMA/erdma: Add verbs implementation Cheng Xu
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Cheng Xu @ 2021-12-21  2:48 UTC (permalink / raw)
  To: jgg, dledford; +Cc: leon, linux-rdma, KaiShen, chengyou, tonylu

This header file defines the main structrues and functions used for RDMA
Verbs, including qp, cq, mr ucontext, etc,.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 drivers/infiniband/hw/erdma/erdma_verbs.h | 366 ++++++++++++++++++++++
 1 file changed, 366 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.h

diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.h b/drivers/infiniband/hw/erdma/erdma_verbs.h
new file mode 100644
index 000000000000..6eda8843d0d5
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.h
@@ -0,0 +1,366 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ */
+
+#ifndef __ERDMA_VERBS_H__
+#define __ERDMA_VERBS_H__
+
+#include <linux/errno.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "erdma.h"
+#include "erdma_cm.h"
+#include "erdma_hw.h"
+
+/* RDMA Capbility. */
+#define ERDMA_MAX_PD         (128 * 1024)
+#define ERDMA_MAX_SEND_WR    4096
+#define ERDMA_MAX_ORD        128
+#define ERDMA_MAX_IRD        128
+#define ERDMA_MAX_SGE_RD     1
+#define ERDMA_MAX_FMR        0
+#define ERDMA_MAX_SRQ        0 /* not support srq now. */
+#define ERDMA_MAX_SRQ_WR     0 /* not support srq now. */
+#define ERDMA_MAX_SRQ_SGE    0 /* not support srq now. */
+#define ERDMA_MAX_CONTEXT    (128 * 1024)
+#define ERDMA_MAX_SEND_SGE   6
+#define ERDMA_MAX_RECV_SGE   1
+#define ERDMA_MAX_INLINE     (sizeof(struct erdma_sge) * (ERDMA_MAX_SEND_SGE))
+#define ERDMA_MAX_FRMR_PA    512
+
+enum {
+	ERDMA_MMAP_IO_NC = 0,  /* no cache */
+};
+
+struct erdma_user_mmap_entry {
+	struct rdma_user_mmap_entry rdma_entry;
+	u64 address;
+	u8 mmap_flag;
+};
+
+struct erdma_ucontext {
+	struct ib_ucontext ibucontext;
+	struct erdma_dev *dev;
+
+	u32 sdb_type;
+	u32 sdb_idx;
+	u32 sdb_page_idx;
+	u32 sdb_page_off;
+	u64 sdb;
+	u64 rdb;
+	u64 cdb;
+
+	struct rdma_user_mmap_entry *sq_db_mmap_entry;
+	struct rdma_user_mmap_entry *rq_db_mmap_entry;
+	struct rdma_user_mmap_entry *cq_db_mmap_entry;
+
+	/* doorbell records */
+	struct list_head dbrecords_page_list;
+	struct mutex dbrecords_page_mutex;
+};
+
+
+struct erdma_pd {
+	struct ib_pd ibpd;
+	u32 pdn;
+};
+
+/*
+ * MemoryRegion definition.
+ */
+#define ERDMA_MAX_INLINE_MTT_ENTRIES 4
+#define MTT_SIZE(x) (x << 3) /* per mtt takes 8 Bytes. */
+#define ERDMA_MR_MAX_MTT_CNT  524288
+#define ERDMA_MTT_ENTRY_SIZE  8
+
+#define ERDMA_MR_TYPE_NORMAL  0
+#define ERDMA_MR_TYPE_FRMR    1
+#define ERDMA_MR_TYPE_DMA     2
+
+#define ERDMA_MR_INLINE_MTT   0
+#define ERDMA_MR_INDIRECT_MTT 1
+
+#define ERDMA_MR_ACC_LR BIT(0)
+#define ERDMA_MR_ACC_LW BIT(1)
+#define ERDMA_MR_ACC_RR BIT(2)
+#define ERDMA_MR_ACC_RW BIT(3)
+
+struct erdma_mem {
+	struct ib_umem *umem;
+	void *mtt_buf;
+	u32 mtt_type;
+	u32 page_size;
+	u32 page_offset;
+	u32 page_cnt;
+	u32 mtt_nents;
+
+	u64 va;
+	u64 len;
+
+	u64 mtt_entry[ERDMA_MAX_INLINE_MTT_ENTRIES];
+};
+
+struct erdma_mr {
+	struct ib_mr ibmr;
+	struct erdma_mem mem;
+	u8 type;
+	u8 access;
+	u8 valid;
+};
+
+struct erdma_user_dbrecords_page {
+	struct list_head list;
+	struct ib_umem *umem;
+	u64 va;
+	int refcnt;
+};
+
+
+struct erdma_uqp {
+	struct erdma_mem sq_mtt;
+	struct erdma_mem rq_mtt;
+
+	dma_addr_t sq_db_info_dma_addr;
+	dma_addr_t rq_db_info_dma_addr;
+
+	struct erdma_user_dbrecords_page *user_dbr_page;
+
+	u32 rq_offset;
+};
+struct erdma_kqp {
+	u16 sq_pi;
+	u16 sq_ci;
+
+	u16 rq_pi;
+	u16 rq_ci;
+
+	u64 *swr_tbl;
+	u64 *rwr_tbl;
+
+	void *hw_sq_db;
+	void *hw_rq_db;
+
+	void *sq_buf;
+	dma_addr_t sq_buf_dma_addr;
+
+	void *rq_buf;
+	dma_addr_t rq_buf_dma_addr;
+
+	void *sq_db_info;
+	void *rq_db_info;
+
+	u8 sig_all;
+};
+
+enum erdma_qp_state {
+	ERDMA_QP_STATE_IDLE      = 0,
+	ERDMA_QP_STATE_RTR       = 1,
+	ERDMA_QP_STATE_RTS       = 2,
+	ERDMA_QP_STATE_CLOSING   = 3,
+	ERDMA_QP_STATE_TERMINATE = 4,
+	ERDMA_QP_STATE_ERROR     = 5,
+	ERDMA_QP_STATE_UNDEF     = 7,
+	ERDMA_QP_STATE_COUNT     = 8
+};
+
+enum erdma_qp_flags {
+	ERDMA_BIND_ENABLED	= (1 << 0),
+	ERDMA_WRITE_ENABLED	= (1 << 1),
+	ERDMA_READ_ENABLED	= (1 << 2)
+};
+
+enum erdma_qp_attr_mask {
+	ERDMA_QP_ATTR_STATE             = (1 << 0),
+	ERDMA_QP_ATTR_ACCESS_FLAGS      = (1 << 1),
+	ERDMA_QP_ATTR_LLP_HANDLE        = (1 << 2),
+	ERDMA_QP_ATTR_ORD               = (1 << 3),
+	ERDMA_QP_ATTR_IRD               = (1 << 4),
+	ERDMA_QP_ATTR_SQ_SIZE           = (1 << 5),
+	ERDMA_QP_ATTR_RQ_SIZE           = (1 << 6),
+	ERDMA_QP_ATTR_MPA               = (1 << 7)
+};
+
+struct erdma_qp_attrs {
+	enum erdma_qp_state state;
+	u32 sq_size;
+	u32 rq_size;
+	u32 orq_size;
+	u32 irq_size;
+	u32 max_send_sge;
+	u32 max_recv_sge;
+	enum erdma_qp_flags flags;
+
+	struct socket *llp_stream_handle;
+	u32 sip;
+	u32 dip;
+	u16 sport;
+	u16 dport;
+	u16 origin_sport;
+	u32 remote_qpn;
+};
+
+struct erdma_qp {
+	struct ib_qp ibqp;
+	struct kref ref;
+	struct completion safe_free;
+	struct erdma_dev *dev;
+	struct erdma_cep *cep;
+	struct rw_semaphore state_lock;
+	bool is_kernel_qp;
+
+	union {
+		struct erdma_kqp kern_qp;
+		struct erdma_uqp user_qp;
+	};
+
+	struct erdma_cq *scq;
+	struct erdma_cq *rcq;
+
+	struct erdma_qp_attrs attrs;
+	spinlock_t lock;
+
+	u8 cc_method;
+#define ERDMA_QP_TYPE_CLIENT 0
+#define ERDMA_QP_TYPE_SERVER 1
+	u8 qp_type;
+	u8 private_data_len;
+};
+
+
+struct erdma_kcq_info {
+	struct erdma_cqe *qbuf;
+	dma_addr_t qbuf_dma_addr;
+	u32 ci;
+	u32 owner;
+	u32 cmdsn;
+	void *db;
+	spinlock_t lock;
+	void *db_info;
+};
+
+struct erdma_ucq_info {
+	struct erdma_mem qbuf_mtt;
+	struct erdma_user_dbrecords_page *user_dbr_page;
+	dma_addr_t db_info_dma_addr;
+};
+
+struct erdma_cq {
+	struct ib_cq ibcq;
+	u32 cqn;
+
+	u32 depth;
+	u32 assoc_eqn;
+	u32 is_kernel_cq;
+
+	union {
+		struct erdma_kcq_info kern_cq;
+		struct erdma_ucq_info user_cq;
+	};
+};
+
+#define QP_ID(qp) ((qp)->ibqp.qp_num)
+
+static inline struct erdma_qp *find_qp_by_qpn(struct erdma_dev *dev, int id)
+{
+	return (struct erdma_qp *)xa_load(&dev->qp_xa, id);
+}
+
+static inline struct erdma_cq *find_cq_by_cqn(struct erdma_dev *dev, int id)
+{
+	return (struct erdma_cq *)xa_load(&dev->cq_xa, id);
+}
+
+extern void erdma_qp_get(struct erdma_qp *qp);
+extern void erdma_qp_put(struct erdma_qp *qp);
+extern int erdma_modify_qp_internal(struct erdma_qp *qp, struct erdma_qp_attrs *attrs,
+				    enum erdma_qp_attr_mask mask);
+extern void erdma_qp_llp_close(struct erdma_qp *qp);
+extern void erdma_qp_cm_drop(struct erdma_qp *qp, int sched);
+
+static inline struct erdma_ucontext *to_ectx(struct ib_ucontext *ibctx)
+{
+	return container_of(ibctx, struct erdma_ucontext, ibucontext);
+}
+
+static inline struct erdma_pd *to_epd(struct ib_pd *pd)
+{
+	return container_of(pd, struct erdma_pd, ibpd);
+}
+
+static inline struct erdma_mr *to_emr(struct ib_mr *ibmr)
+{
+	return container_of(ibmr, struct erdma_mr, ibmr);
+}
+
+static inline struct erdma_qp *to_eqp(struct ib_qp *qp)
+{
+	return container_of(qp, struct erdma_qp, ibqp);
+}
+
+static inline struct erdma_cq *to_ecq(struct ib_cq *ibcq)
+{
+	return container_of(ibcq, struct erdma_cq, ibcq);
+}
+
+static inline struct erdma_user_mmap_entry *to_emmap(struct rdma_user_mmap_entry *ibmmap)
+{
+	return container_of(ibmmap, struct erdma_user_mmap_entry, rdma_entry);
+}
+
+static inline void *get_sq_entry(struct erdma_qp *qp, u16 idx)
+{
+	idx &= (qp->attrs.sq_size - 1);
+	return qp->kern_qp.sq_buf + (idx << SQEBB_SHIFT);
+}
+
+extern int erdma_alloc_ucontext(struct ib_ucontext *ctx, struct ib_udata *data);
+extern void erdma_dealloc_ucontext(struct ib_ucontext *ctx);
+extern int erdma_query_device(struct ib_device *dev, struct ib_device_attr *attr,
+			      struct ib_udata *data);
+extern int erdma_get_port_immutable(struct ib_device *dev, u32 port,
+				    struct ib_port_immutable *ib_port_immutable);
+extern int erdma_create_cq(struct ib_cq *cq,
+			   const struct ib_cq_init_attr *attr,
+			   struct ib_udata *data);
+extern int erdma_query_port(struct ib_device *dev, u32 port, struct ib_port_attr *attr);
+extern int erdma_query_pkey(struct ib_device *dev, u32 port, u16 idx, u16 *pkey);
+extern int erdma_query_gid(struct ib_device *dev, u32 port, int idx, union ib_gid *gid);
+extern int erdma_alloc_pd(struct ib_pd *pd, struct ib_udata *data);
+extern int erdma_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata);
+extern int erdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
+				   struct ib_udata *data);
+extern int erdma_query_qp(struct ib_qp *qp, struct ib_qp_attr *attr, int mask,
+			struct ib_qp_init_attr *init_attr);
+extern int erdma_modify_qp(struct ib_qp *qp, struct ib_qp_attr *attr, int mask,
+			      struct ib_udata *data);
+extern int erdma_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata);
+extern int erdma_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata);
+extern int erdma_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags);
+extern struct ib_mr *erdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
+				      u64 virt, int access, struct ib_udata *udata);
+extern struct ib_mr *erdma_get_dma_mr(struct ib_pd *ibpd, int rights);
+extern int erdma_dereg_mr(struct ib_mr *mr, struct ib_udata *data);
+extern int erdma_mmap(struct ib_ucontext *ctx, struct vm_area_struct *vma);
+extern void erdma_qp_get_ref(struct ib_qp *qp);
+extern void erdma_qp_put_ref(struct ib_qp *qp);
+extern struct ib_qp *erdma_get_ibqp(struct ib_device *dev, int id);
+extern int erdma_post_send(struct ib_qp *qp, const struct ib_send_wr *send_wr,
+			   const struct ib_send_wr **bad_send_wr);
+extern int erdma_post_recv(struct ib_qp *qp, const struct ib_recv_wr *recv_wr,
+			   const struct ib_recv_wr **bad_recv_wr);
+extern int erdma_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc);
+extern struct ib_mr *erdma_ib_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
+				       u32 max_num_sg);
+extern int erdma_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
+			   int sg_nents, unsigned int *sg_offset);
+extern struct net_device *erdma_get_netdev(struct ib_device *device, u32 port_num);
+extern void erdma_disassociate_ucontext(struct ib_ucontext *ibcontext);
+extern void erdma_port_event(struct erdma_dev *dev, enum ib_event_type reason);
+
+#endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rdma-next 07/11] RDMA/erdma: Add verbs implementation
  2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
                   ` (5 preceding siblings ...)
  2021-12-21  2:48 ` [PATCH rdma-next 06/11] RDMA/erdma: Add verbs header file Cheng Xu
@ 2021-12-21  2:48 ` Cheng Xu
  2021-12-21 13:32   ` Leon Romanovsky
  2021-12-21  2:48 ` [PATCH rdma-next 08/11] RDMA/erdma: Add connection management (CM) support Cheng Xu
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Cheng Xu @ 2021-12-21  2:48 UTC (permalink / raw)
  To: jgg, dledford; +Cc: leon, linux-rdma, KaiShen, chengyou, tonylu

The RDMA verbs implementation of erdma is divided into three files:
erdma_qp.c, erdma_cq.c, and erdma_verbs.c. Internal used functions and
datapath functions of QP/CQ are put in erdma_qp.c and erdma_cq.c, the reset
is in erdma_verbs.c.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
 drivers/infiniband/hw/erdma/erdma_qp.c    |  624 +++++++++
 drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++++
 3 files changed, 2302 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c

diff --git a/drivers/infiniband/hw/erdma/erdma_cq.c b/drivers/infiniband/hw/erdma/erdma_cq.c
new file mode 100644
index 000000000000..11bf1b26c05a
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_cq.c
@@ -0,0 +1,201 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ */
+
+#include <rdma/ib_verbs.h>
+
+#include "erdma_hw.h"
+#include "erdma_verbs.h"
+
+static inline int erdma_cq_notempty(struct erdma_cq *cq)
+{
+	struct erdma_cqe *cqe;
+	unsigned long flags;
+	u32 hdr;
+
+	spin_lock_irqsave(&cq->kern_cq.lock, flags);
+
+	cqe = &cq->kern_cq.qbuf[cq->kern_cq.ci & (cq->depth - 1)];
+	hdr = be32_to_cpu(READ_ONCE(cqe->hdr));
+	if (FIELD_GET(ERDMA_CQE_HDR_OWNER_MASK, hdr) != cq->kern_cq.owner) {
+		spin_unlock_irqrestore(&cq->kern_cq.lock, flags);
+		return 0;
+	}
+
+	spin_unlock_irqrestore(&cq->kern_cq.lock, flags);
+	return 1;
+}
+
+static inline void notify_cq(struct erdma_cq *cq, u8 solcitied)
+{
+	u64 db_data = FIELD_PREP(ERDMA_CQDB_EQN_MASK, cq->assoc_eqn) |
+		FIELD_PREP(ERDMA_CQDB_CQN_MASK, cq->cqn) |
+		FIELD_PREP(ERDMA_CQDB_ARM_MASK, 1) |
+		FIELD_PREP(ERDMA_CQDB_SOL_MASK, solcitied) |
+		FIELD_PREP(ERDMA_CQDB_CMDSN_MASK, cq->kern_cq.cmdsn) |
+		FIELD_PREP(ERDMA_CQDB_CI_MASK, cq->kern_cq.ci);
+
+	*(u64 *)cq->kern_cq.db_info = db_data;
+	writeq(db_data, cq->kern_cq.db);
+}
+
+int erdma_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags)
+{
+	struct erdma_cq *cq = to_ecq(ibcq);
+	int ret = 0;
+
+	notify_cq(cq, (flags & IB_CQ_SOLICITED_MASK) == IB_CQ_SOLICITED);
+
+	if (flags & IB_CQ_REPORT_MISSED_EVENTS)
+		ret = erdma_cq_notempty(cq);
+
+	return ret;
+}
+
+static const struct {
+	enum erdma_opcode erdma;
+	enum ib_wc_opcode base;
+} map_cqe_opcode[ERDMA_NUM_OPCODES] = {
+	{ ERDMA_OP_WRITE, IB_WC_RDMA_WRITE },
+	{ ERDMA_OP_READ, IB_WC_RDMA_READ },
+	{ ERDMA_OP_SEND, IB_WC_SEND },
+	{ ERDMA_OP_SEND_WITH_IMM, IB_WC_SEND },
+	{ ERDMA_OP_RECEIVE, IB_WC_RECV },
+	{ ERDMA_OP_RECV_IMM, IB_WC_RECV_RDMA_WITH_IMM },
+	{ ERDMA_OP_RECV_INV, IB_WC_LOCAL_INV}, /* confirm afterwards */
+	{ ERDMA_OP_REQ_ERR, IB_WC_RECV }, /* remove afterwards */
+	{ ERDNA_OP_READ_RESPONSE, IB_WC_RECV }, /* can not appear */
+	{ ERDMA_OP_WRITE_WITH_IMM, IB_WC_RDMA_WRITE },
+	{ ERDMA_OP_RECV_ERR, IB_WC_RECV_RDMA_WITH_IMM }, /* can not appear */
+	{ ERDMA_OP_INVALIDATE, IB_WC_LOCAL_INV },
+	{ ERDMA_OP_RSP_SEND_IMM, IB_WC_RECV },
+	{ ERDMA_OP_SEND_WITH_INV, IB_WC_SEND },
+	{ ERDMA_OP_REG_MR, IB_WC_REG_MR },
+	{ ERDMA_OP_LOCAL_INV, IB_WC_LOCAL_INV },
+	{ ERDMA_OP_READ_WITH_INV, IB_WC_RDMA_READ },
+};
+
+static const struct {
+	enum erdma_wc_status erdma;
+	enum ib_wc_status base;
+	enum erdma_vendor_err vendor;
+} map_cqe_status[ERDMA_NUM_WC_STATUS] = {
+	{ ERDMA_WC_SUCCESS, IB_WC_SUCCESS, ERDMA_WC_VENDOR_NO_ERR },
+	{ ERDMA_WC_GENERAL_ERR, IB_WC_GENERAL_ERR, ERDMA_WC_VENDOR_NO_ERR },
+	{ ERDMA_WC_RECV_WQE_FORMAT_ERR, IB_WC_GENERAL_ERR, ERDMA_WC_VENDOR_INVALID_RQE },
+	{ ERDMA_WC_RECV_STAG_INVALID_ERR, IB_WC_REM_ACCESS_ERR,
+			ERDMA_WC_VENDOR_RQE_INVALID_STAG },
+	{ ERDMA_WC_RECV_ADDR_VIOLATION_ERR, IB_WC_REM_ACCESS_ERR,
+			ERDMA_WC_VENDOR_RQE_ADDR_VIOLATION },
+	{ ERDMA_WC_RECV_RIGHT_VIOLATION_ERR, IB_WC_REM_ACCESS_ERR,
+			ERDMA_WC_VENDOR_RQE_ACCESS_RIGHT_ERR },
+	{ ERDMA_WC_RECV_PDID_ERR, IB_WC_REM_ACCESS_ERR, ERDMA_WC_VENDOR_RQE_INVALID_PD },
+	{ ERDMA_WC_RECV_WARRPING_ERR, IB_WC_REM_ACCESS_ERR, ERDMA_WC_VENDOR_RQE_WRAP_ERR },
+	{ ERDMA_WC_SEND_WQE_FORMAT_ERR, IB_WC_LOC_QP_OP_ERR, ERDMA_WC_VENDOR_INVALID_SQE },
+	{ ERDMA_WC_SEND_WQE_ORD_EXCEED, IB_WC_GENERAL_ERR, ERDMA_WC_VENDOR_ZERO_ORD },
+	{ ERDMA_WC_SEND_STAG_INVALID_ERR, IB_WC_LOC_ACCESS_ERR,
+			ERDMA_WC_VENDOR_SQE_INVALID_STAG },
+	{ ERDMA_WC_SEND_ADDR_VIOLATION_ERR, IB_WC_LOC_ACCESS_ERR,
+			ERDMA_WC_VENDOR_SQE_ADDR_VIOLATION },
+	{ ERDMA_WC_SEND_RIGHT_VIOLATION_ERR, IB_WC_LOC_ACCESS_ERR,
+			ERDMA_WC_VENDOR_SQE_ACCESS_ERR },
+	{ ERDMA_WC_SEND_PDID_ERR, IB_WC_LOC_ACCESS_ERR, ERDMA_WC_VENDOR_SQE_INVALID_PD },
+	{ ERDMA_WC_SEND_WARRPING_ERR, IB_WC_LOC_ACCESS_ERR, ERDMA_WC_VENDOR_SQE_WARP_ERR },
+	{ ERDMA_WC_FLUSH_ERR, IB_WC_WR_FLUSH_ERR, ERDMA_WC_VENDOR_NO_ERR },
+	{ ERDMA_WC_RETRY_EXC_ERR, IB_WC_RETRY_EXC_ERR, ERDMA_WC_VENDOR_NO_ERR },
+};
+
+static int
+erdma_poll_one_cqe(struct erdma_cq *cq, struct erdma_cqe *cqe, struct ib_wc *wc)
+{
+	struct erdma_dev *dev = to_edev(cq->ibcq.device);
+	struct erdma_qp *qp;
+	struct erdma_kqp *kern_qp;
+	u64 *wqe_hdr;
+	u64 *id_table;
+	u32 qpn = be32_to_cpu(cqe->qpn);
+	u16 wqe_idx = be32_to_cpu(cqe->qe_idx);
+	u32 hdr = be32_to_cpu(cqe->hdr);
+	u16 depth;
+	u8 opcode, syndrome, qtype;
+
+	qp = find_qp_by_qpn(dev, qpn);
+	kern_qp = &qp->kern_qp;
+
+	qtype = FIELD_GET(ERDMA_CQE_HDR_QTYPE_MASK, hdr);
+	syndrome = FIELD_GET(ERDMA_CQE_HDR_SYNDROME_MASK, hdr);
+	opcode = FIELD_GET(ERDMA_CQE_HDR_OPCODE_MASK, hdr);
+
+	if (qtype == ERDMA_CQE_QTYPE_SQ) {
+		id_table = kern_qp->swr_tbl;
+		depth = qp->attrs.sq_size;
+		wqe_hdr = (u64 *)get_sq_entry(qp, wqe_idx);
+		kern_qp->sq_ci = wqe_idx + FIELD_GET(ERDMA_SQE_HDR_WQEBB_CNT_MASK, *wqe_hdr) + 1;
+	} else {
+		id_table = kern_qp->rwr_tbl;
+		depth = qp->attrs.rq_size;
+	}
+	wc->wr_id = id_table[wqe_idx & (depth - 1)];
+	wc->byte_len = be32_to_cpu(cqe->size);
+
+	wc->wc_flags = 0;
+
+	wc->opcode = map_cqe_opcode[opcode].base;
+	if (wc->opcode == IB_WC_RECV_RDMA_WITH_IMM) {
+		wc->ex.imm_data = be32_to_cpu(cqe->imm_data);
+		wc->wc_flags |= IB_WC_WITH_IMM;
+	}
+
+	if (syndrome >= ERDMA_NUM_WC_STATUS)
+		syndrome = ERDMA_WC_GENERAL_ERR;
+
+	wc->status = map_cqe_status[syndrome].base;
+	wc->vendor_err = map_cqe_status[syndrome].vendor;
+	wc->qp = &qp->ibqp;
+
+	return 0;
+}
+
+int erdma_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
+{
+	struct erdma_cq *cq = to_ecq(ibcq);
+	struct erdma_cqe *cqe;
+	unsigned long flags;
+	u32 owner;
+	u32 ci;
+	int i, ret;
+	int new = 0;
+	u32 hdr;
+
+	spin_lock_irqsave(&cq->kern_cq.lock, flags);
+
+	owner = cq->kern_cq.owner;
+	ci = cq->kern_cq.ci;
+
+	for (i = 0; i < num_entries; i++) {
+		cqe = &cq->kern_cq.qbuf[ci & (cq->depth - 1)];
+
+		hdr = be32_to_cpu(READ_ONCE(cqe->hdr));
+		if (FIELD_GET(ERDMA_CQE_HDR_OWNER_MASK, hdr) != owner)
+			break;
+
+		/* cqbuf should be ready when we poll*/
+		dma_rmb();
+		ret = erdma_poll_one_cqe(cq, cqe, wc);
+		ci++;
+		if ((ci & (cq->depth - 1)) == 0)
+			owner = !owner;
+		if (ret)
+			continue;
+		wc++;
+		new++;
+	}
+	cq->kern_cq.owner = owner;
+	cq->kern_cq.ci = ci;
+
+	spin_unlock_irqrestore(&cq->kern_cq.lock, flags);
+	return new;
+}
diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c b/drivers/infiniband/hw/erdma/erdma_qp.c
new file mode 100644
index 000000000000..8c02215cee04
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_qp.c
@@ -0,0 +1,624 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ *
+ * Authors: Bernard Metzler <bmt@zurich.ibm.com>
+ *          Fredy Neeser <nfd@zurich.ibm.com>
+ * Copyright (c) 2008-2016, IBM Corporation
+ */
+
+#include <linux/errno.h>
+#include <linux/pci.h>
+#include <linux/scatterlist.h>
+#include <linux/types.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_user_verbs.h>
+#include <rdma/ib_verbs.h>
+
+#include "erdma.h"
+#include "erdma_cm.h"
+#include "erdma_verbs.h"
+
+struct ib_qp *erdma_get_ibqp(struct ib_device *ibdev, int id)
+{
+	struct erdma_qp *qp = find_qp_by_qpn(to_edev(ibdev), id);
+
+	if (qp)
+		return &qp->ibqp;
+
+	return (struct ib_qp *)NULL;
+}
+
+static int erdma_modify_qp_state_to_rts(struct erdma_qp *qp,
+					struct erdma_qp_attrs *attrs,
+					enum erdma_qp_attr_mask mask)
+{
+	int ret;
+	struct erdma_dev *dev = qp->dev;
+	struct erdma_cmdq_modify_qp_req req;
+	struct tcp_sock *tp;
+
+	if (!(mask & ERDMA_QP_ATTR_LLP_HANDLE))
+		return -EINVAL;
+
+	if (!(mask & ERDMA_QP_ATTR_MPA))
+		return -EINVAL;
+
+	qp->attrs.state = ERDMA_QP_STATE_RTS;
+	qp->attrs.remote_qpn = qp->cep->mpa.remote_qpn;
+	qp->attrs.llp_stream_handle = attrs->llp_stream_handle;
+
+	tp = tcp_sk(attrs->llp_stream_handle->sk);
+
+	ERDMA_CMDQ_BUILD_REQ_HDR(&req, CMDQ_SUBMOD_RDMA, CMDQ_OPCODE_MODIFY_QP);
+
+	req.cfg0 = FIELD_PREP(ERDMA_CMD_MODIFY_QP_STATE_MASK, qp->attrs.state) |
+		FIELD_PREP(ERDMA_CMD_MODIFY_QP_CC_MASK, qp->cc_method) |
+		FIELD_PREP(ERDMA_CMD_MODIFY_QP_QPN_MASK, QP_ID(qp));
+
+	req.remote_qpn = qp->attrs.remote_qpn;
+	req.dip = qp->cep->llp.raddr.sin_addr.s_addr;
+	req.sip = qp->cep->llp.laddr.sin_addr.s_addr;
+	req.dport = qp->cep->llp.raddr.sin_port;
+	req.sport = qp->cep->llp.laddr.sin_port;
+
+	req.send_nxt = tp->snd_nxt;
+	if (qp->qp_type == ERDMA_QP_TYPE_SERVER)
+		req.send_nxt += 20 + qp->private_data_len; /* rsvd tcp seq for mpa-rsp */
+	req.recv_nxt = tp->rcv_nxt;
+
+	ret = erdma_post_cmd_wait(&dev->cmdq, (u64 *)&req,
+		sizeof(req), NULL, NULL);
+	if (ret) {
+		dev_err(&dev->pdev->dev,
+			"ERROR: err code = %d, cmd of modify qp failed.\n", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+static int erdma_modify_qp_state_to_stop(struct erdma_qp *qp,
+					 struct erdma_qp_attrs *attrs,
+					 enum erdma_qp_attr_mask mask)
+{
+	int ret;
+	struct erdma_dev *dev = qp->dev;
+	struct erdma_cmdq_modify_qp_req req;
+
+	qp->attrs.state = attrs->state;
+
+	ERDMA_CMDQ_BUILD_REQ_HDR(&req, CMDQ_SUBMOD_RDMA, CMDQ_OPCODE_MODIFY_QP);
+
+	req.cfg0 = FIELD_PREP(ERDMA_CMD_MODIFY_QP_STATE_MASK, attrs->state) |
+		FIELD_PREP(ERDMA_CMD_MODIFY_QP_QPN_MASK, QP_ID(qp));
+
+	ret = erdma_post_cmd_wait(&dev->cmdq, (u64 *)&req,
+		sizeof(req), NULL, NULL);
+	if (ret) {
+		dev_err(&dev->pdev->dev,
+			"ERROR: err code = %d, cmd of modify qp failed.\n", ret);
+		return ret;
+	}
+
+	return 0;
+}
+
+int erdma_modify_qp_internal(struct erdma_qp *qp, struct erdma_qp_attrs *attrs,
+			     enum erdma_qp_attr_mask mask)
+{
+	int drop_conn = 0, ret = 0;
+
+	if (!mask)
+		return 0;
+
+	if (mask != ERDMA_QP_ATTR_STATE) {
+		/*
+		 * changes of qp attributes (maybe state, too)
+		 */
+		if (mask & ERDMA_QP_ATTR_ACCESS_FLAGS) {
+			if (attrs->flags & ERDMA_BIND_ENABLED)
+				qp->attrs.flags |= ERDMA_BIND_ENABLED;
+			else
+				qp->attrs.flags &= ~ERDMA_BIND_ENABLED;
+
+			if (attrs->flags & ERDMA_WRITE_ENABLED)
+				qp->attrs.flags |= ERDMA_WRITE_ENABLED;
+			else
+				qp->attrs.flags &= ~ERDMA_WRITE_ENABLED;
+
+			if (attrs->flags & ERDMA_READ_ENABLED)
+				qp->attrs.flags |= ERDMA_READ_ENABLED;
+			else
+				qp->attrs.flags &= ~ERDMA_READ_ENABLED;
+
+		}
+	}
+	if (!(mask & ERDMA_QP_ATTR_STATE))
+		return 0;
+
+	switch (qp->attrs.state) {
+	case ERDMA_QP_STATE_IDLE:
+	case ERDMA_QP_STATE_RTR:
+		switch (attrs->state) {
+		case ERDMA_QP_STATE_RTS:
+			ret = erdma_modify_qp_state_to_rts(qp, attrs, mask);
+			break;
+		case ERDMA_QP_STATE_ERROR:
+			qp->attrs.state = ERDMA_QP_STATE_ERROR;
+			if (qp->cep) {
+				erdma_cep_put(qp->cep);
+				qp->cep = NULL;
+			}
+			ret = erdma_modify_qp_state_to_stop(qp, attrs, mask);
+			break;
+		case ERDMA_QP_STATE_RTR:
+			break;
+		default:
+			break;
+		}
+		break;
+	case ERDMA_QP_STATE_RTS:
+		switch (attrs->state) {
+		case ERDMA_QP_STATE_CLOSING:
+			/*
+			 * Verbs: move to IDLE if SQ and ORQ are empty.
+			 * Move to ERROR otherwise. But first of all we must
+			 * close the connection. So we keep CLOSING or ERROR
+			 * as a transient state, schedule connection drop work
+			 * and wait for the socket state change upcall to
+			 * come back closed.
+			 */
+			ret = erdma_modify_qp_state_to_stop(qp, attrs, mask);
+			drop_conn = 1;
+			break;
+		case ERDMA_QP_STATE_TERMINATE:
+			qp->attrs.state = ERDMA_QP_STATE_TERMINATE;
+			ret = erdma_modify_qp_state_to_stop(qp, attrs, mask);
+			drop_conn = 1;
+			break;
+		case ERDMA_QP_STATE_ERROR:
+			/*
+			 * This is an emergency close.
+			 *
+			 * Any in progress transmit operation will get
+			 * cancelled.
+			 * This will likely result in a protocol failure,
+			 * if a TX operation is in transit. The caller
+			 * could unconditional wait to give the current
+			 * operation a chance to complete.
+			 * Esp., how to handle the non-empty IRQ case?
+			 * The peer was asking for data transfer at a valid
+			 * point in time.
+			 */
+			ret = erdma_modify_qp_state_to_stop(qp, attrs, mask);
+			qp->attrs.state = ERDMA_QP_STATE_ERROR;
+			drop_conn = 1;
+			break;
+		default:
+			break;
+		}
+		break;
+	case ERDMA_QP_STATE_TERMINATE:
+		switch (attrs->state) {
+		case ERDMA_QP_STATE_ERROR:
+			qp->attrs.state = ERDMA_QP_STATE_ERROR;
+			break;
+		default:
+			break;
+		}
+		break;
+	case ERDMA_QP_STATE_CLOSING:
+		switch (attrs->state) {
+		case ERDMA_QP_STATE_IDLE:
+			qp->attrs.state = ERDMA_QP_STATE_IDLE;
+			break;
+		case ERDMA_QP_STATE_CLOSING:
+			/*
+			 * The LLP may already moved the QP to closing
+			 * due to graceful peer close init
+			 */
+			break;
+		case ERDMA_QP_STATE_ERROR:
+			/*
+			 * QP was moved to CLOSING by LLP event
+			 * not yet seen by user.
+			 */
+			ret = erdma_modify_qp_state_to_stop(qp, attrs, mask);
+			qp->attrs.state = ERDMA_QP_STATE_ERROR;
+			break;
+		default:
+			return -ECONNABORTED;
+		}
+		break;
+	default:
+		break;
+	}
+
+	if (drop_conn)
+		erdma_qp_cm_drop(qp, 0);
+
+	return ret;
+}
+
+void erdma_qp_llp_close(struct erdma_qp *qp)
+{
+	struct erdma_qp_attrs qp_attrs;
+
+	down_write(&qp->state_lock);
+
+	qp->attrs.llp_stream_handle = NULL;
+
+	switch (qp->attrs.state) {
+	case ERDMA_QP_STATE_RTS:
+	case ERDMA_QP_STATE_RTR:
+	case ERDMA_QP_STATE_IDLE:
+	case ERDMA_QP_STATE_TERMINATE:
+		qp_attrs.state = ERDMA_QP_STATE_CLOSING;
+		(void)erdma_modify_qp_internal(qp, &qp_attrs, ERDMA_QP_ATTR_STATE);
+		break;
+	case ERDMA_QP_STATE_CLOSING:
+		qp->attrs.state = ERDMA_QP_STATE_IDLE;
+		break;
+	default:
+		break;
+	}
+
+	if (qp->cep) {
+		erdma_cep_put(qp->cep);
+		qp->cep = NULL;
+	}
+
+	up_write(&qp->state_lock);
+}
+
+static void erdma_qp_safe_free(struct kref *ref)
+{
+	struct erdma_qp	*qp = container_of(ref, struct erdma_qp, ref);
+
+	complete(&qp->safe_free);
+}
+
+void erdma_qp_put(struct erdma_qp *qp)
+{
+	WARN_ON(kref_read(&qp->ref) < 1);
+	kref_put(&qp->ref, erdma_qp_safe_free);
+}
+
+void erdma_qp_get(struct erdma_qp *qp)
+{
+	kref_get(&qp->ref);
+}
+
+static inline int fill_inline_data(struct erdma_qp *qp, const struct ib_send_wr *send_wr,
+				   u16 wqebb_idx, u32 sgl_offset, u32 *length_field)
+{
+	int i = 0;
+	char *data;
+	u32 remain_size, copy_size, data_off, bytes = 0;
+
+	wqebb_idx += (sgl_offset >> SQEBB_SHIFT);
+	sgl_offset &= (SQEBB_SIZE - 1);
+	data = get_sq_entry(qp, wqebb_idx);
+
+	while (i < send_wr->num_sge) {
+		bytes += send_wr->sg_list[i].length;
+		if (bytes > (int)ERDMA_MAX_INLINE)
+			return -EINVAL;
+
+		remain_size = send_wr->sg_list[i].length;
+		data_off = 0;
+
+		while (1) {
+			copy_size = min(remain_size, SQEBB_SIZE - sgl_offset);
+
+			memcpy(data + sgl_offset,
+				(void *)(uintptr_t)send_wr->sg_list[i].addr + data_off,
+				copy_size);
+			remain_size -= copy_size;
+			data_off += copy_size;
+			sgl_offset += copy_size;
+			wqebb_idx += (sgl_offset >> SQEBB_SHIFT);
+			sgl_offset &= (SQEBB_SIZE - 1);
+
+			data = get_sq_entry(qp, wqebb_idx);
+			if (!remain_size)
+				break;
+		};
+
+		i++;
+	}
+	*length_field = bytes;
+
+	return bytes;
+}
+
+static inline int fill_sgl(struct erdma_qp *qp, const struct ib_send_wr *send_wr,
+			   u16 wqebb_idx, u32 sgl_offset, u32 *length_field)
+{
+	int i = 0;
+	u32 bytes = 0;
+	char *sgl;
+
+	if (send_wr->num_sge > qp->dev->attrs.max_send_sge)
+		return -EINVAL;
+
+	if (sgl_offset & 0xf) {
+		ibdev_warn(&qp->dev->ibdev, "sgl in wqe not 16 Byte aligned.");
+		return -EINVAL;
+	}
+
+	while (i < send_wr->num_sge) {
+		wqebb_idx += (sgl_offset >> SQEBB_SHIFT);
+		sgl_offset &= (SQEBB_SIZE - 1);
+		sgl = get_sq_entry(qp, wqebb_idx);
+
+		bytes += send_wr->sg_list[i].length;
+		memcpy(sgl + sgl_offset, &send_wr->sg_list[i], sizeof(struct ib_sge));
+
+		sgl_offset += sizeof(struct ib_sge);
+		i++;
+	}
+
+	*length_field = bytes;
+	return 0;
+}
+
+static inline int erdma_push_one_sqe(struct erdma_qp *qp, u16 *pi,
+				     const struct ib_send_wr *send_wr)
+{
+	struct erdma_write_sqe *write_sqe;
+	struct erdma_send_sqe *send_sqe;
+	struct erdma_readreq_sqe *read_sqe;
+	struct erdma_reg_mr_sqe *regmr_sge;
+	struct erdma_mr *mr;
+	struct ib_rdma_wr *rdma_wr;
+	struct ib_sge *sge;
+	u32 wqe_size, wqebb_cnt;
+	int ret;
+	u32 flags = send_wr->send_flags;
+	u32 idx = *pi & (qp->attrs.sq_size - 1);
+	u64 *entry = (u64 *)get_sq_entry(qp, idx);
+	u64 wqe_hdr;
+	u32 *length_field = NULL;
+	u64 sgl_offset = 0;
+	enum ib_wr_opcode op = send_wr->opcode;
+	u64 *inline_data;
+
+	*entry = 0;
+	qp->kern_qp.swr_tbl[idx] = send_wr->wr_id;
+
+	wqe_hdr = FIELD_PREP(ERDMA_SQE_HDR_CE_MASK,
+		((flags & IB_SEND_SIGNALED) || qp->kern_qp.sig_all) ? 1 : 0);
+	wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_SE_MASK, flags & IB_SEND_SOLICITED ? 1 : 0);
+	wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_FENCE_MASK, flags & IB_SEND_FENCE ? 1 : 0);
+	wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_INLINE_MASK, flags & IB_SEND_INLINE ? 1 : 0);
+	wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_QPN_MASK, QP_ID(qp));
+
+	switch (op) {
+	case IB_WR_RDMA_WRITE:
+	case IB_WR_RDMA_WRITE_WITH_IMM:
+		wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_OPCODE_MASK,
+			op == IB_WR_RDMA_WRITE ? ERDMA_OP_WRITE : ERDMA_OP_WRITE_WITH_IMM);
+		rdma_wr = container_of(send_wr, struct ib_rdma_wr, wr);
+		write_sqe = (struct erdma_write_sqe *)entry;
+
+		write_sqe->imm_data = send_wr->ex.imm_data;
+		write_sqe->sink_stag = rdma_wr->rkey;
+		write_sqe->sink_to_high = (rdma_wr->remote_addr >> 32) & 0xFFFFFFFF;
+		write_sqe->sink_to_low = rdma_wr->remote_addr & 0xFFFFFFFF;
+
+		length_field = &write_sqe->length;
+		wqe_size = sizeof(struct erdma_write_sqe);
+		sgl_offset = wqe_size;
+		break;
+	case IB_WR_RDMA_READ:
+	case IB_WR_RDMA_READ_WITH_INV:
+		if (unlikely(send_wr->num_sge != 1))
+			return -EINVAL;
+		wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_OPCODE_MASK,
+			op == IB_WR_RDMA_READ ? ERDMA_OP_READ : ERDMA_OP_READ_WITH_INV);
+		rdma_wr = container_of(send_wr, struct ib_rdma_wr, wr);
+		read_sqe = (struct erdma_readreq_sqe *)entry;
+		if (op == IB_WR_RDMA_READ_WITH_INV)
+			read_sqe->invalid_stag = send_wr->ex.invalidate_rkey;
+
+		read_sqe->length = send_wr->sg_list[0].length;
+		read_sqe->sink_stag = send_wr->sg_list[0].lkey;
+		read_sqe->sink_to_low = *(__u32 *)(&send_wr->sg_list[0].addr);
+		read_sqe->sink_to_high = *((__u32 *)(&send_wr->sg_list[0].addr) + 1);
+
+		sge = (struct ib_sge *)get_sq_entry(qp, idx + 1);
+		sge->addr = rdma_wr->remote_addr;
+		sge->lkey = rdma_wr->rkey;
+		sge->length = send_wr->sg_list[0].length;
+		wqe_size = sizeof(struct erdma_readreq_sqe) + send_wr->num_sge * 16;
+
+		goto out;
+	case IB_WR_SEND:
+	case IB_WR_SEND_WITH_IMM:
+	case IB_WR_SEND_WITH_INV:
+		wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_OPCODE_MASK,
+			op == IB_WR_SEND ? ERDMA_OP_SEND : (op == IB_WR_SEND_WITH_IMM ?
+			ERDMA_OP_SEND_WITH_IMM : ERDMA_OP_SEND_WITH_INV));
+		send_sqe = (struct erdma_send_sqe *)entry;
+		if (op == IB_WR_SEND_WITH_INV)
+			send_sqe->imm_data = send_wr->ex.invalidate_rkey;
+		else
+			send_sqe->imm_data = send_wr->ex.imm_data;
+
+		length_field = &send_sqe->length;
+		wqe_size = sizeof(struct erdma_send_sqe);
+		sgl_offset = wqe_size;
+
+		break;
+	case IB_WR_REG_MR:
+		wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_OPCODE_MASK, ERDMA_OP_REG_MR);
+		regmr_sge = (struct erdma_reg_mr_sqe *)entry;
+		mr = to_emr(reg_wr(send_wr)->mr);
+
+		mr->access = ERDMA_MR_ACC_LR |
+			(reg_wr(send_wr)->access & IB_ACCESS_REMOTE_READ  ? ERDMA_MR_ACC_RR : 0) |
+			(reg_wr(send_wr)->access & IB_ACCESS_LOCAL_WRITE  ? ERDMA_MR_ACC_LW : 0) |
+			(reg_wr(send_wr)->access & IB_ACCESS_REMOTE_WRITE ? ERDMA_MR_ACC_RW : 0);
+		regmr_sge->addr = mr->ibmr.iova;
+		regmr_sge->length = mr->ibmr.length;
+		regmr_sge->stag = mr->ibmr.lkey;
+		regmr_sge->attrs |= FIELD_PREP(ERDMA_SQE_MR_ACCESS_MODE_MASK, 0);
+		regmr_sge->attrs |= FIELD_PREP(ERDMA_SQE_MR_ACCESS_RIGHT_MASK, mr->access);
+		regmr_sge->attrs |= FIELD_PREP(ERDMA_SQE_MR_MTT_COUNT_MASK, mr->mem.mtt_nents);
+
+		if (mr->mem.mtt_nents < 4) {
+			regmr_sge->attrs |= FIELD_PREP(ERDMA_SQE_MR_MTT_TYPE_MASK, 0);
+			inline_data = (u64 *)get_sq_entry(qp, idx + 1);
+			memcpy(inline_data, mr->mem.mtt_buf, mr->mem.mtt_nents * 8);
+			wqe_size = sizeof(struct erdma_reg_mr_sqe) + mr->mem.mtt_nents * 8;
+		} else {
+			regmr_sge->attrs |= FIELD_PREP(ERDMA_SQE_MR_MTT_TYPE_MASK, 1);
+			wqe_size = sizeof(struct erdma_reg_mr_sqe);
+		}
+
+		goto out;
+	case IB_WR_LOCAL_INV:
+		wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_OPCODE_MASK, ERDMA_OP_LOCAL_INV);
+		regmr_sge = (struct erdma_reg_mr_sqe *)entry;
+		regmr_sge->stag = send_wr->ex.invalidate_rkey;
+		wqe_size = sizeof(struct erdma_reg_mr_sqe);
+		goto out;
+	default:
+		return -EOPNOTSUPP;
+	}
+
+	if (flags & IB_SEND_INLINE) {
+		ret = fill_inline_data(qp, send_wr, idx, sgl_offset, length_field);
+		if (ret < 0)
+			return -EINVAL;
+		wqe_size += ret;
+		wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_SGL_LEN_MASK, ret);
+	} else {
+		ret = fill_sgl(qp, send_wr, idx, sgl_offset, length_field);
+		if (ret)
+			return -EINVAL;
+		wqe_size += send_wr->num_sge * sizeof(struct ib_sge);
+		wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_SGL_LEN_MASK, send_wr->num_sge);
+	}
+
+out:
+	wqebb_cnt = SQEBB_COUNT(wqe_size);
+	wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_WQEBB_CNT_MASK, wqebb_cnt - 1);
+	*pi += wqebb_cnt;
+	wqe_hdr |= FIELD_PREP(ERDMA_SQE_HDR_WQEBB_INDEX_MASK, *pi);
+
+	*entry = wqe_hdr;
+
+	return 0;
+}
+
+static inline void kick_sq_db(struct erdma_qp *qp, u16 pi)
+{
+	u64 db_data = FIELD_PREP(ERDMA_SQE_HDR_QPN_MASK, QP_ID(qp)) |
+		FIELD_PREP(ERDMA_SQE_HDR_WQEBB_INDEX_MASK, pi);
+
+	*(u64 *)qp->kern_qp.sq_db_info = db_data;
+	writeq(db_data, qp->kern_qp.hw_sq_db);
+}
+
+int erdma_post_send(struct ib_qp *ibqp,
+		    const struct ib_send_wr *send_wr,
+		    const struct ib_send_wr **bad_send_wr)
+{
+	struct erdma_qp *qp = to_eqp(ibqp);
+	int ret = 0;
+	const struct ib_send_wr *wr = send_wr;
+	unsigned long flags;
+	u16 sq_pi;
+
+	if (!send_wr)
+		return -EINVAL;
+
+	spin_lock_irqsave(&qp->lock, flags);
+	sq_pi = qp->kern_qp.sq_pi;
+
+	while (wr) {
+		if ((u16)(sq_pi - qp->kern_qp.sq_ci) >= qp->attrs.sq_size) {
+			ret = -ENOMEM;
+			*bad_send_wr = send_wr;
+			break;
+		}
+
+		ret = erdma_push_one_sqe(qp, &sq_pi, wr);
+		if (ret) {
+			*bad_send_wr = wr;
+			break;
+		}
+		qp->kern_qp.sq_pi = sq_pi;
+		kick_sq_db(qp, sq_pi);
+
+		wr = wr->next;
+	}
+	spin_unlock_irqrestore(&qp->lock, flags);
+
+	return ret;
+}
+
+static inline int erdma_post_recv_one(struct ib_qp *ibqp, const struct ib_recv_wr *recv_wr,
+		const struct ib_recv_wr **bad_recv_wr)
+{
+	struct erdma_qp *qp = to_eqp(ibqp);
+	struct erdma_rqe *rqe;
+	unsigned int rq_pi;
+	u16 idx;
+
+	rq_pi = qp->kern_qp.rq_pi;
+	idx = rq_pi & (qp->attrs.rq_size - 1);
+	rqe = (struct erdma_rqe *)qp->kern_qp.rq_buf + idx;
+
+	rqe->qe_idx = rq_pi + 1;
+	rqe->qpn = QP_ID(qp);
+
+	if (recv_wr->num_sge == 0) {
+		rqe->length = 0;
+	} else if (recv_wr->num_sge == 1) {
+		rqe->stag = recv_wr->sg_list[0].lkey;
+		rqe->to = recv_wr->sg_list[0].addr;
+		rqe->length = recv_wr->sg_list[0].length;
+	} else {
+		return -EINVAL;
+	}
+
+	*(u64 *)qp->kern_qp.rq_db_info = *(u64 *)rqe;
+	writeq(*(u64 *)rqe, qp->kern_qp.hw_rq_db);
+
+	qp->kern_qp.rwr_tbl[idx] = recv_wr->wr_id;
+	qp->kern_qp.rq_pi = rq_pi + 1;
+
+	return 0;
+}
+
+int erdma_post_recv(struct ib_qp *qp,
+		    const struct ib_recv_wr *recv_wr,
+		    const struct ib_recv_wr **bad_recv_wr)
+{
+	struct erdma_qp *eqp = to_eqp(qp);
+	int ret = 0;
+	const struct ib_recv_wr *wr = recv_wr;
+	unsigned long flags;
+
+	if (!qp || !recv_wr)
+		return -EINVAL;
+
+	spin_lock_irqsave(&eqp->lock, flags);
+	while (wr) {
+		ret = erdma_post_recv_one(qp, wr, bad_recv_wr);
+		if (ret) {
+			*bad_recv_wr = wr;
+			break;
+		}
+		wr = wr->next;
+	}
+	spin_unlock_irqrestore(&eqp->lock, flags);
+	return ret;
+}
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
new file mode 100644
index 000000000000..9ff9ef66610f
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -0,0 +1,1477 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ *
+ * Authors: Bernard Metzler <bmt@zurich.ibm.com>
+ *          Fredy Neeser <nfd@zurich.ibm.com>
+ * Copyright (c) 2008-2016, IBM Corporation
+ *
+ * Copyright (c) 2013-2015, Mellanox Technologies. All rights reserved.
+ */
+
+
+#include <linux/errno.h>
+#include <linux/pci.h>
+#include <linux/types.h>
+#include <linux/uaccess.h>
+#include <linux/vmalloc.h>
+
+#include <rdma/erdma-abi.h>
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_umem.h>
+#include <rdma/ib_user_verbs.h>
+#include <rdma/uverbs_ioctl.h>
+
+#include "erdma.h"
+#include "erdma_cm.h"
+#include "erdma_hw.h"
+#include "erdma_verbs.h"
+
+static inline int
+create_qp_cmd(struct erdma_dev *dev, struct erdma_qp *qp)
+{
+	struct erdma_cmdq_create_qp_req req;
+	struct erdma_pd *pd = to_epd(qp->ibqp.pd);
+	struct erdma_uqp *user_qp;
+	int err;
+
+	ERDMA_CMDQ_BUILD_REQ_HDR(&req, CMDQ_SUBMOD_RDMA, CMDQ_OPCODE_CREATE_QP);
+
+	req.cfg0 = FIELD_PREP(ERDMA_CMD_CREATE_QP_SQ_DEPTH_MASK, ilog2(qp->attrs.sq_size)) |
+		FIELD_PREP(ERDMA_CMD_CREATE_QP_QPN_MASK, QP_ID(qp));
+	req.cfg1 = FIELD_PREP(ERDMA_CMD_CREATE_QP_RQ_DEPTH_MASK, ilog2(qp->attrs.rq_size)) |
+		FIELD_PREP(ERDMA_CMD_CREATE_QP_PD_MASK, pd->pdn);
+
+	if (qp->is_kernel_qp) {
+		u32 pg_sz_field = ilog2(SZ_1M) - 12;
+
+		req.sq_cqn_mtt_cfg = FIELD_PREP(ERDMA_CMD_CREATE_QP_PAGE_SIZE_MASK, pg_sz_field) |
+			FIELD_PREP(ERDMA_CMD_CREATE_QP_CQN_MASK, qp->scq->cqn);
+		req.rq_cqn_mtt_cfg = FIELD_PREP(ERDMA_CMD_CREATE_QP_PAGE_SIZE_MASK, pg_sz_field) |
+			FIELD_PREP(ERDMA_CMD_CREATE_QP_CQN_MASK, qp->rcq->cqn);
+
+		req.sq_mtt_cfg = FIELD_PREP(ERDMA_CMD_CREATE_QP_PAGE_OFFSET_MASK, 0) |
+			FIELD_PREP(ERDMA_CMD_CREATE_QP_MTT_CNT_MASK, 1) |
+			FIELD_PREP(ERDMA_CMD_CREATE_QP_MTT_TYPE_MASK, ERDMA_MR_INLINE_MTT);
+		req.rq_mtt_cfg = req.sq_mtt_cfg;
+
+		req.rq_buf_addr = qp->kern_qp.rq_buf_dma_addr;
+		req.sq_buf_addr = qp->kern_qp.sq_buf_dma_addr;
+		req.sq_db_info_dma_addr =
+			qp->kern_qp.sq_buf_dma_addr + (SQEBB_SHIFT << qp->attrs.sq_size);
+		req.rq_db_info_dma_addr =
+			qp->kern_qp.rq_buf_dma_addr + (RQE_SHIFT << qp->attrs.rq_size);
+	} else {
+		user_qp = &qp->user_qp;
+		req.sq_cqn_mtt_cfg = FIELD_PREP(ERDMA_CMD_CREATE_QP_PAGE_SIZE_MASK,
+			ilog2(user_qp->sq_mtt.page_size) - 12);
+		req.sq_cqn_mtt_cfg |= FIELD_PREP(ERDMA_CMD_CREATE_QP_CQN_MASK, qp->scq->cqn);
+
+		req.rq_cqn_mtt_cfg = FIELD_PREP(ERDMA_CMD_CREATE_QP_PAGE_SIZE_MASK,
+			ilog2(user_qp->rq_mtt.page_size) - 12);
+		req.rq_cqn_mtt_cfg |= FIELD_PREP(ERDMA_CMD_CREATE_QP_CQN_MASK, qp->rcq->cqn);
+
+		req.sq_mtt_cfg = user_qp->sq_mtt.page_offset;
+		req.sq_mtt_cfg |=
+			FIELD_PREP(ERDMA_CMD_CREATE_QP_MTT_CNT_MASK, user_qp->sq_mtt.mtt_nents) |
+			FIELD_PREP(ERDMA_CMD_CREATE_QP_MTT_TYPE_MASK, user_qp->sq_mtt.mtt_type);
+
+		req.rq_mtt_cfg = user_qp->rq_mtt.page_offset;
+		req.rq_mtt_cfg |=
+			FIELD_PREP(ERDMA_CMD_CREATE_QP_MTT_CNT_MASK, user_qp->rq_mtt.mtt_nents) |
+			FIELD_PREP(ERDMA_CMD_CREATE_QP_MTT_TYPE_MASK, user_qp->rq_mtt.mtt_type);
+
+		if (user_qp->sq_mtt.mtt_nents == 1)
+			req.sq_buf_addr = *(u64 *)user_qp->sq_mtt.mtt_buf;
+		else
+			req.sq_buf_addr = user_qp->sq_mtt.mtt_entry[0];
+
+		if (user_qp->rq_mtt.mtt_nents == 1)
+			req.rq_buf_addr = *(u64 *)user_qp->rq_mtt.mtt_buf;
+		else
+			req.rq_buf_addr = user_qp->rq_mtt.mtt_entry[0];
+
+		req.sq_db_info_dma_addr = user_qp->sq_db_info_dma_addr;
+		req.rq_db_info_dma_addr = user_qp->rq_db_info_dma_addr;
+	}
+
+	err = erdma_post_cmd_wait(&dev->cmdq, (u64 *)&req, sizeof(req), NULL, NULL);
+	if (err) {
+		dev_err(&dev->pdev->dev,
+			"ERROR: err code = %d, cmd of create qp failed.\n", err);
+		return err;
+	}
+
+	return 0;
+}
+
+static inline int
+regmr_cmd(struct erdma_dev *dev, struct erdma_mr *mr)
+{
+	struct erdma_cmdq_reg_mr_req req;
+	struct erdma_pd *pd = to_epd(mr->ibmr.pd);
+	u64 *phy_addr;
+	int err, i;
+
+	ERDMA_CMDQ_BUILD_REQ_HDR(&req, CMDQ_SUBMOD_RDMA, CMDQ_OPCODE_REG_MR);
+
+	req.cfg0 = FIELD_PREP(ERDMA_CMD_MR_VALID_MASK, mr->valid) |
+		FIELD_PREP(ERDMA_CMD_MR_KEY_MASK, mr->ibmr.lkey & 0xFF) |
+		FIELD_PREP(ERDMA_CMD_MR_MPT_IDX_MASK, mr->ibmr.lkey >> 8);
+	req.cfg1 = FIELD_PREP(ERDMA_CMD_REGMR_PD_MASK, pd->pdn) |
+		FIELD_PREP(ERDMA_CMD_REGMR_TYPE_MASK, mr->type) |
+		FIELD_PREP(ERDMA_CMD_REGMR_RIGHT_MASK, mr->access) |
+		FIELD_PREP(ERDMA_CMD_REGMR_ACC_MODE_MASK, 0);
+	req.cfg2 = FIELD_PREP(ERDMA_CMD_REGMR_PAGESIZE_MASK, ilog2(mr->mem.page_size)) |
+		FIELD_PREP(ERDMA_CMD_REGMR_MTT_TYPE_MASK, mr->mem.mtt_type) |
+		FIELD_PREP(ERDMA_CMD_REGMR_MTT_CNT_MASK, mr->mem.page_cnt);
+
+	if (mr->type == ERDMA_MR_TYPE_DMA)
+		goto post_cmd;
+
+	if (mr->type == ERDMA_MR_TYPE_NORMAL) {
+		req.start_va = mr->mem.va;
+		req.size = mr->mem.len;
+	}
+
+	if (mr->type == ERDMA_MR_TYPE_FRMR || mr->mem.mtt_type == ERDMA_MR_INDIRECT_MTT) {
+		phy_addr = req.phy_addr;
+		*phy_addr = mr->mem.mtt_entry[0];
+	} else {
+		phy_addr = req.phy_addr;
+		for (i = 0; i < mr->mem.mtt_nents; i++)
+			*phy_addr++ = mr->mem.mtt_entry[i];
+	}
+
+post_cmd:
+	err = erdma_post_cmd_wait(&dev->cmdq, (u64 *)&req, sizeof(req), NULL, NULL);
+	if (err) {
+		dev_err(&dev->pdev->dev,
+			"ERROR: err code = %d, cmd of reg mr failed.\n", err);
+		return err;
+	}
+
+	return err;
+}
+
+static inline int
+create_cq_cmd(struct erdma_dev *dev, struct erdma_cq *cq)
+{
+	int err;
+	struct erdma_cmdq_create_cq_req req;
+	u32 page_size;
+
+	ERDMA_CMDQ_BUILD_REQ_HDR(&req, CMDQ_SUBMOD_RDMA, CMDQ_OPCODE_CREATE_CQ);
+
+	req.cfg0 = FIELD_PREP(ERDMA_CMD_CREATE_CQ_CQN_MASK, cq->cqn) |
+		FIELD_PREP(ERDMA_CMD_CREATE_CQ_DEPTH_MASK, ilog2(cq->depth));
+	req.cfg1 = FIELD_PREP(ERDMA_CMD_CREATE_CQ_EQN_MASK, cq->assoc_eqn);
+
+	if (cq->is_kernel_cq) {
+		page_size = SZ_32M;
+		req.cfg0 |= FIELD_PREP(ERDMA_CMD_CREATE_CQ_PAGESIZE_MASK, ilog2(page_size) - 12);
+		req.qbuf_addr_l = lower_32_bits(cq->kern_cq.qbuf_dma_addr);
+		req.qbuf_addr_h = upper_32_bits(cq->kern_cq.qbuf_dma_addr);
+
+		req.cfg1 |= FIELD_PREP(ERDMA_CMD_CREATE_CQ_MTT_CNT_MASK, 1) |
+			FIELD_PREP(ERDMA_CMD_CREATE_CQ_MTT_TYPE_MASK, ERDMA_MR_INLINE_MTT);
+
+		req.first_page_offset = 0;
+		req.cq_db_info_addr = cq->kern_cq.qbuf_dma_addr + (cq->depth << CQE_SHIFT);
+	} else {
+		req.cfg0 |= FIELD_PREP(ERDMA_CMD_CREATE_CQ_PAGESIZE_MASK,
+			ilog2(cq->user_cq.qbuf_mtt.page_size) - 12);
+		if (cq->user_cq.qbuf_mtt.mtt_nents == 1) {
+			req.qbuf_addr_l = lower_32_bits(*(u64 *)cq->user_cq.qbuf_mtt.mtt_buf);
+			req.qbuf_addr_h = upper_32_bits(*(u64 *)cq->user_cq.qbuf_mtt.mtt_buf);
+		} else {
+			req.qbuf_addr_l = lower_32_bits(cq->user_cq.qbuf_mtt.mtt_entry[0]);
+			req.qbuf_addr_h = upper_32_bits(cq->user_cq.qbuf_mtt.mtt_entry[0]);
+		}
+		req.cfg1 |= FIELD_PREP(ERDMA_CMD_CREATE_CQ_MTT_CNT_MASK,
+			cq->user_cq.qbuf_mtt.mtt_nents);
+		req.cfg1 |= FIELD_PREP(ERDMA_CMD_CREATE_CQ_MTT_TYPE_MASK,
+			cq->user_cq.qbuf_mtt.mtt_type);
+
+		req.first_page_offset = cq->user_cq.qbuf_mtt.page_offset;
+		req.cq_db_info_addr = cq->user_cq.db_info_dma_addr;
+	}
+
+	err = erdma_post_cmd_wait(&dev->cmdq, (u64 *)&req, sizeof(req), NULL, NULL);
+	if (err) {
+		dev_err(&dev->pdev->dev,
+			"ERROR: err code = %d, cmd of create cq failed.\n", err);
+		return err;
+	}
+
+	return 0;
+}
+
+
+static struct rdma_user_mmap_entry *
+erdma_user_mmap_entry_insert(struct erdma_ucontext *uctx, void *address, u32 size,
+			     u8 mmap_flag, u64 *mmap_offset)
+{
+	struct erdma_user_mmap_entry *entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	int ret;
+
+	if (!entry)
+		return NULL;
+
+	entry->address = (u64)address;
+	entry->mmap_flag = mmap_flag;
+
+	size = PAGE_ALIGN(size);
+
+	ret = rdma_user_mmap_entry_insert(&uctx->ibucontext,
+		&entry->rdma_entry, size);
+	if (ret) {
+		kfree(entry);
+		return NULL;
+	}
+
+	*mmap_offset = rdma_user_mmap_get_offset(&entry->rdma_entry);
+
+	return &entry->rdma_entry;
+}
+
+int erdma_query_device(struct ib_device *ibdev, struct ib_device_attr *attr,
+		       struct ib_udata *unused)
+{
+	struct erdma_dev *dev = to_edev(ibdev);
+
+	memset(attr, 0, sizeof(*attr));
+
+	attr->max_mr_size = dev->attrs.max_mr_size;
+	attr->vendor_id = dev->attrs.vendor_id;
+	attr->vendor_part_id = 0;
+	attr->max_qp = dev->attrs.max_qp;
+	attr->max_qp_wr = dev->attrs.max_send_wr > dev->attrs.max_recv_wr
+		? dev->attrs.max_recv_wr : dev->attrs.max_send_wr;
+
+	attr->max_qp_rd_atom = dev->attrs.max_ord;
+	attr->max_qp_init_rd_atom = dev->attrs.max_ird;
+	attr->max_res_rd_atom = dev->attrs.max_qp * dev->attrs.max_ird;
+	attr->device_cap_flags = dev->attrs.cap_flags;
+	ibdev->local_dma_lkey = dev->attrs.local_dma_key;
+	attr->max_send_sge = dev->attrs.max_send_sge;
+	attr->max_recv_sge = dev->attrs.max_recv_sge;
+	attr->max_sge_rd = dev->attrs.max_sge_rd;
+	attr->max_cq = dev->attrs.max_cq;
+	attr->max_cqe = dev->attrs.max_cqe;
+	attr->max_mr = dev->attrs.max_mr;
+	attr->max_pd = dev->attrs.max_pd;
+	attr->max_mw = dev->attrs.max_mw;
+	attr->max_srq = dev->attrs.max_srq;
+	attr->max_srq_wr = dev->attrs.max_srq_wr;
+	attr->max_srq_sge = dev->attrs.max_srq_sge;
+	attr->max_fast_reg_page_list_len = ERDMA_MAX_FRMR_PA;
+
+	memcpy(&attr->sys_image_guid, dev->netdev->dev_addr, 6);
+
+	return 0;
+}
+
+int erdma_query_pkey(struct ib_device *ibdev, u32 port, u16 idx, u16 *pkey)
+{
+	*pkey = 0xffff;
+	return 0;
+}
+
+int erdma_query_gid(struct ib_device *ibdev, u32 port, int idx,
+		    union ib_gid *gid)
+{
+	struct erdma_dev *dev = to_edev(ibdev);
+
+	memset(gid, 0, sizeof(*gid));
+	memcpy(&gid->raw[0], dev->netdev->dev_addr, 6);
+
+	return 0;
+}
+
+int erdma_query_port(struct ib_device *ibdev, u32 port,
+		     struct ib_port_attr *attr)
+{
+	struct erdma_dev *dev = to_edev(ibdev);
+
+	memset(attr, 0, sizeof(*attr));
+
+	attr->state = dev->state;
+	attr->max_mtu = IB_MTU_1024;
+	attr->active_mtu = attr->max_mtu;
+	attr->gid_tbl_len = 1;
+	attr->port_cap_flags = IB_PORT_CM_SUP;
+	attr->port_cap_flags |= IB_PORT_DEVICE_MGMT_SUP;
+	attr->max_msg_sz = -1;
+	attr->pkey_tbl_len = 1;
+	attr->active_width = 2;
+	attr->active_speed = 2;
+	attr->phys_state = dev->state == IB_PORT_ACTIVE ? 5 : 3;
+
+	return 0;
+}
+
+int erdma_get_port_immutable(struct ib_device *ibdev, u32 port,
+			     struct ib_port_immutable *port_immutable)
+{
+	struct ib_port_attr attr;
+	int ret = erdma_query_port(ibdev, port, &attr);
+
+	if (ret)
+		return ret;
+
+	port_immutable->pkey_tbl_len = attr.pkey_tbl_len;
+	port_immutable->gid_tbl_len = attr.gid_tbl_len;
+	port_immutable->core_cap_flags = RDMA_CORE_PORT_IWARP;
+
+	return 0;
+}
+
+int erdma_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
+{
+	struct erdma_pd *pd = to_epd(ibpd);
+	struct erdma_dev *dev = to_edev(ibpd->device);
+	int pdn;
+
+	pdn = erdma_alloc_idx(&dev->res_cb[ERDMA_RES_TYPE_PD]);
+	if (pdn < 0)
+		return pdn;
+
+	pd->pdn = pdn;
+
+	atomic_inc(&dev->num_pd);
+
+	return 0;
+}
+
+int erdma_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
+{
+	struct erdma_pd *pd = to_epd(ibpd);
+	struct erdma_dev *dev = to_edev(ibpd->device);
+
+	erdma_free_idx(&dev->res_cb[ERDMA_RES_TYPE_PD], pd->pdn);
+	atomic_dec(&dev->num_pd);
+
+	return 0;
+}
+
+static inline int
+erdma_qp_validate_cap(struct erdma_dev *dev, struct ib_qp_init_attr *attrs)
+{
+	if ((attrs->cap.max_send_wr > dev->attrs.max_send_wr) ||
+	    (attrs->cap.max_recv_wr > dev->attrs.max_recv_wr) ||
+	    (attrs->cap.max_send_sge > dev->attrs.max_send_sge) ||
+	    (attrs->cap.max_recv_sge > dev->attrs.max_recv_sge) ||
+	    (attrs->cap.max_inline_data > ERDMA_MAX_INLINE) ||
+	    !attrs->cap.max_send_wr ||
+	    !attrs->cap.max_recv_wr) {
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static inline int
+erdma_qp_validate_attr(struct erdma_dev *dev, struct ib_qp_init_attr *attrs)
+{
+	if (attrs->qp_type != IB_QPT_RC) {
+		ibdev_err_ratelimited(&dev->ibdev, "only support RC mode.");
+		return -EOPNOTSUPP;
+	}
+
+	if (attrs->srq) {
+		ibdev_err_ratelimited(&dev->ibdev, "not support SRQ now.");
+		return -EOPNOTSUPP;
+	}
+
+	if (!attrs->send_cq || !attrs->recv_cq) {
+		ibdev_err_ratelimited(&dev->ibdev, "SCQ or RCQ is null.");
+		return -EOPNOTSUPP;
+	}
+
+	return 0;
+}
+
+static void free_kernel_qp(struct erdma_qp *qp)
+{
+	struct erdma_dev *dev = qp->dev;
+
+	vfree(qp->kern_qp.swr_tbl);
+	vfree(qp->kern_qp.rwr_tbl);
+
+	if (qp->kern_qp.sq_buf) {
+		dma_free_coherent(&dev->pdev->dev,
+			(qp->attrs.sq_size << SQEBB_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+			qp->kern_qp.sq_buf, qp->kern_qp.sq_buf_dma_addr);
+	}
+
+	if (qp->kern_qp.rq_buf) {
+		dma_free_coherent(&dev->pdev->dev,
+			(qp->attrs.rq_size << RQE_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+			qp->kern_qp.rq_buf, qp->kern_qp.rq_buf_dma_addr);
+	}
+}
+
+static int
+init_kernel_qp(struct erdma_dev *dev, struct erdma_qp *qp, struct ib_qp_init_attr *attrs)
+{
+	int ret = -ENOMEM;
+
+	if (attrs->sq_sig_type == IB_SIGNAL_ALL_WR)
+		qp->kern_qp.sig_all = 1;
+
+	qp->is_kernel_qp = 1;
+	qp->kern_qp.sq_pi = 0;
+	qp->kern_qp.sq_ci = 0;
+	qp->kern_qp.rq_pi = 0;
+	qp->kern_qp.rq_ci = 0;
+	qp->kern_qp.hw_sq_db = dev->func_bar + ERDMA_BAR_SQDB_SPACE_OFFSET +
+		(ERDMA_SDB_SHARED_PAGE_INDEX << PAGE_SHIFT);
+	qp->kern_qp.hw_rq_db = dev->func_bar + ERDMA_BAR_RQDB_SPACE_OFFSET;
+
+	qp->kern_qp.swr_tbl = vmalloc(qp->attrs.sq_size * sizeof(u64));
+	qp->kern_qp.rwr_tbl = vmalloc(qp->attrs.rq_size * sizeof(u64));
+
+	qp->kern_qp.sq_buf = dma_alloc_coherent(&dev->pdev->dev,
+		(qp->attrs.sq_size << SQEBB_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+		&qp->kern_qp.sq_buf_dma_addr, GFP_KERNEL);
+	if (!qp->kern_qp.sq_buf)
+		goto err_out;
+
+	qp->kern_qp.rq_buf = dma_alloc_coherent(&dev->pdev->dev,
+		(qp->attrs.rq_size << RQE_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+		&qp->kern_qp.rq_buf_dma_addr, GFP_KERNEL);
+	if (!qp->kern_qp.rq_buf)
+		goto err_out;
+
+	qp->kern_qp.sq_db_info = qp->kern_qp.sq_buf + (qp->attrs.sq_size << SQEBB_SHIFT);
+	qp->kern_qp.rq_db_info = qp->kern_qp.rq_buf + (qp->attrs.rq_size << RQE_SHIFT);
+
+	return 0;
+
+err_out:
+	free_kernel_qp(qp);
+	return ret;
+}
+
+static inline int
+get_mtt_entries(struct erdma_dev *dev, struct erdma_mem *mem, u64 start,
+		u64 len, int access, u64 virt, unsigned long req_page_size, u8 force_indirect_mtt)
+{
+	struct ib_block_iter biter;
+	uint64_t *phy_addr = NULL;
+	int ret = 0;
+
+	mem->umem = ib_umem_get(&dev->ibdev, start, len, access);
+	if (IS_ERR(mem->umem)) {
+		ret = PTR_ERR(mem->umem);
+		mem->umem = NULL;
+		return ret;
+	}
+
+	mem->page_size = ib_umem_find_best_pgsz(mem->umem, req_page_size, virt);
+	mem->page_offset = start & (mem->page_size - 1);
+	mem->mtt_nents = ib_umem_num_dma_blocks(mem->umem, mem->page_size);
+	mem->page_cnt = mem->mtt_nents;
+
+	if (mem->page_cnt > ERDMA_MAX_INLINE_MTT_ENTRIES || force_indirect_mtt) {
+		mem->mtt_type = ERDMA_MR_INDIRECT_MTT;
+		mem->mtt_buf = alloc_pages_exact(MTT_SIZE(mem->page_cnt), GFP_KERNEL);
+		if (!mem->mtt_buf) {
+			ret = -ENOMEM;
+			goto error_ret;
+		}
+		phy_addr = mem->mtt_buf;
+	} else {
+		mem->mtt_type = ERDMA_MR_INLINE_MTT;
+		phy_addr = mem->mtt_entry;
+	}
+
+	rdma_umem_for_each_dma_block(mem->umem, &biter, mem->page_size) {
+		*phy_addr = rdma_block_iter_dma_address(&biter);
+		phy_addr++;
+	}
+
+	if (mem->mtt_type == ERDMA_MR_INDIRECT_MTT) {
+		mem->mtt_entry[0] = dma_map_single(&dev->pdev->dev, mem->mtt_buf,
+			MTT_SIZE(mem->page_cnt), DMA_TO_DEVICE);
+		if (dma_mapping_error(&dev->pdev->dev, mem->mtt_entry[0])) {
+			ibdev_err(&dev->ibdev, "failed to map DMA address.\n");
+			free_pages_exact(mem->mtt_buf, MTT_SIZE(mem->page_cnt));
+			mem->mtt_buf = NULL;
+			ret = -ENOMEM;
+			goto error_ret;
+		}
+	}
+
+	return 0;
+
+error_ret:
+	if (mem->umem) {
+		ib_umem_release(mem->umem);
+		mem->umem = NULL;
+	}
+
+	return ret;
+}
+
+static void
+put_mtt_entries(struct erdma_dev *dev, struct erdma_mem *mem)
+{
+	if (mem->umem) {
+		ib_umem_release(mem->umem);
+		mem->umem = NULL;
+	}
+
+	if (mem->mtt_buf) {
+		dma_unmap_single(&dev->pdev->dev, mem->mtt_entry[0],
+				MTT_SIZE(mem->page_cnt), DMA_TO_DEVICE);
+		free_pages_exact(mem->mtt_buf, MTT_SIZE(mem->page_cnt));
+	}
+}
+
+static int
+erdma_map_user_dbrecords(struct erdma_ucontext *ctx, u64 dbrecords_va,
+	struct erdma_user_dbrecords_page **dbr_page, dma_addr_t *dma_addr)
+{
+	struct erdma_user_dbrecords_page *page = NULL;
+	int rv = 0;
+
+	mutex_lock(&ctx->dbrecords_page_mutex);
+
+	list_for_each_entry(page, &ctx->dbrecords_page_list, list)
+		if (page->va == (dbrecords_va & PAGE_MASK))
+			goto found;
+
+	page = kmalloc(sizeof(*page), GFP_KERNEL);
+	if (!page) {
+		rv = -ENOMEM;
+		goto out;
+	}
+
+	page->va = (dbrecords_va & PAGE_MASK);
+	page->refcnt = 0;
+
+	page->umem = ib_umem_get(ctx->ibucontext.device, dbrecords_va & PAGE_MASK, PAGE_SIZE, 0);
+	if (IS_ERR(page->umem)) {
+		rv = PTR_ERR(page->umem);
+		kfree(page);
+		goto out;
+	}
+
+	list_add(&page->list, &ctx->dbrecords_page_list);
+
+found:
+	*dma_addr = sg_dma_address(page->umem->sgt_append.sgt.sgl) + (dbrecords_va & ~PAGE_MASK);
+	*dbr_page = page;
+	page->refcnt++;
+
+out:
+	mutex_unlock(&ctx->dbrecords_page_mutex);
+	return rv;
+}
+
+static void
+erdma_unmap_user_dbrecords(struct erdma_ucontext *ctx, struct erdma_user_dbrecords_page **dbr_page)
+{
+	if (!ctx || !(*dbr_page))
+		return;
+
+	mutex_lock(&ctx->dbrecords_page_mutex);
+	if (--(*dbr_page)->refcnt == 0) {
+		list_del(&(*dbr_page)->list);
+		ib_umem_release((*dbr_page)->umem);
+		kfree(*dbr_page);
+	}
+
+	*dbr_page = NULL;
+	mutex_unlock(&ctx->dbrecords_page_mutex);
+}
+
+static int
+init_user_qp(struct erdma_qp *qp, struct erdma_ucontext *uctx, u64 va, u32 len, u64 db_info_va)
+{
+	int ret;
+	dma_addr_t db_info_dma_addr;
+	u32 rq_offset;
+
+	qp->is_kernel_qp = false;
+	if (len < (PAGE_ALIGN(qp->attrs.sq_size * SQEBB_SIZE) + qp->attrs.rq_size * RQE_SIZE)) {
+		ibdev_err(&qp->dev->ibdev, "queue len error qbuf(%u) sq(%u) rq(%u).\n", len,
+			qp->attrs.sq_size, qp->attrs.rq_size);
+		return -EINVAL;
+	}
+
+	ret = get_mtt_entries(qp->dev, &qp->user_qp.sq_mtt, va,
+		qp->attrs.sq_size << SQEBB_SHIFT, 0, va, (SZ_1M - SZ_4K), 1);
+	if (ret)
+		goto err_out;
+
+	rq_offset = PAGE_ALIGN(qp->attrs.sq_size << SQEBB_SHIFT);
+	qp->user_qp.rq_offset = rq_offset;
+
+	ret = get_mtt_entries(qp->dev, &qp->user_qp.rq_mtt, va + rq_offset,
+		qp->attrs.rq_size << RQE_SHIFT, 0, va + rq_offset, (SZ_1M - SZ_4K), 1);
+	if (ret)
+		goto err_out;
+
+	ret = erdma_map_user_dbrecords(uctx, db_info_va,
+		&qp->user_qp.user_dbr_page, &db_info_dma_addr);
+	if (ret)
+		goto err_out;
+
+	qp->user_qp.sq_db_info_dma_addr = db_info_dma_addr;
+	qp->user_qp.rq_db_info_dma_addr = db_info_dma_addr + 8;
+
+	return 0;
+
+err_out:
+	return ret;
+}
+
+static void
+free_user_qp(struct erdma_qp *qp, struct erdma_ucontext *uctx)
+{
+	put_mtt_entries(qp->dev, &qp->user_qp.sq_mtt);
+	put_mtt_entries(qp->dev, &qp->user_qp.rq_mtt);
+	erdma_unmap_user_dbrecords(uctx, &qp->user_qp.user_dbr_page);
+}
+
+int erdma_create_qp(struct ib_qp *ibqp,
+		    struct ib_qp_init_attr *attrs,
+		    struct ib_udata *udata)
+{
+	struct erdma_qp *qp = to_eqp(ibqp);
+	struct erdma_dev *dev = to_edev(ibqp->device);
+	struct erdma_ucontext *uctx =
+		rdma_udata_to_drv_context(udata, struct erdma_ucontext, ibucontext);
+	struct erdma_ureq_create_qp ureq;
+	struct erdma_uresp_create_qp uresp;
+	int ret;
+
+	ret = erdma_qp_validate_cap(dev, attrs);
+	if (ret)
+		goto err_out;
+
+	ret = erdma_qp_validate_attr(dev, attrs);
+	if (ret)
+		goto err_out;
+
+	qp->scq = to_ecq(attrs->send_cq);
+	qp->rcq = to_ecq(attrs->recv_cq);
+	qp->dev = dev;
+
+	init_rwsem(&qp->state_lock);
+	kref_init(&qp->ref);
+	init_completion(&qp->safe_free);
+
+	ret = xa_alloc_cyclic(&dev->qp_xa, &qp->ibqp.qp_num, qp,
+		XA_LIMIT(1, dev->attrs.max_qp - 1), &dev->next_alloc_qpn, GFP_KERNEL);
+	if (ret < 0) {
+		ret = -ENOMEM;
+		goto err_out;
+	}
+
+	qp->attrs.sq_size = roundup_pow_of_two(attrs->cap.max_send_wr * ERDMA_MAX_WQEBB_PER_SQE);
+	qp->attrs.rq_size = roundup_pow_of_two(attrs->cap.max_recv_wr);
+
+	if (uctx) {
+		ret = ib_copy_from_udata(&ureq, udata, min(sizeof(ureq), udata->inlen));
+		if (ret)
+			goto err_out_xa;
+
+		init_user_qp(qp, uctx, ureq.qbuf_va, ureq.qbuf_len, ureq.db_record_va);
+
+		memset(&uresp, 0, sizeof(uresp));
+
+		uresp.num_sqe = qp->attrs.sq_size;
+		uresp.num_rqe = qp->attrs.rq_size;
+		uresp.qp_id = QP_ID(qp);
+		uresp.rq_offset = qp->user_qp.rq_offset;
+
+		ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+		if (ret)
+			goto err_out_xa;
+	} else {
+		init_kernel_qp(dev, qp, attrs);
+	}
+
+	qp->attrs.max_send_sge = attrs->cap.max_send_sge;
+	qp->attrs.max_recv_sge = attrs->cap.max_recv_sge;
+	qp->attrs.state = ERDMA_QP_STATE_IDLE;
+
+	ret = create_qp_cmd(dev, qp);
+	if (ret)
+		goto err_out_cmd;
+
+	spin_lock_init(&qp->lock);
+	atomic_inc(&dev->num_qp);
+
+	return 0;
+
+err_out_cmd:
+	if (qp->is_kernel_qp)
+		free_kernel_qp(qp);
+	else
+		free_user_qp(qp, uctx);
+err_out_xa:
+	xa_erase(&dev->qp_xa, QP_ID(qp));
+err_out:
+	return ret;
+}
+
+static inline int erdma_create_stag(struct erdma_dev *dev, u32 *stag)
+{
+	int stag_idx;
+	u32 key = 0;
+
+	stag_idx = erdma_alloc_idx(&dev->res_cb[ERDMA_RES_TYPE_STAG_IDX]);
+	if (stag_idx < 0)
+		return stag_idx;
+
+	*stag = (stag_idx << 8) | (key & 0xFF);
+
+	return 0;
+}
+
+struct ib_mr *erdma_get_dma_mr(struct ib_pd *ibpd, int mr_access_flags)
+{
+	struct erdma_mr *mr;
+	struct erdma_dev *dev = to_edev(ibpd->device);
+	int ret;
+	u32 stag;
+
+	mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	ret = erdma_create_stag(dev, &stag);
+	if (ret)
+		goto out_free;
+
+	mr->type = ERDMA_MR_TYPE_DMA;
+
+	mr->ibmr.lkey = stag;
+	mr->ibmr.rkey = stag;
+	mr->ibmr.pd = ibpd;
+	mr->access = ERDMA_MR_ACC_LR |
+		(mr_access_flags & IB_ACCESS_REMOTE_READ ? ERDMA_MR_ACC_RR : 0) |
+		(mr_access_flags & IB_ACCESS_LOCAL_WRITE ? ERDMA_MR_ACC_LW : 0) |
+		(mr_access_flags & IB_ACCESS_REMOTE_WRITE ? ERDMA_MR_ACC_RW : 0);
+	ret = regmr_cmd(dev, mr);
+	if (ret) {
+		ret = -EIO;
+		goto out_remove_stag;
+	}
+
+	atomic_inc(&dev->num_mr);
+	return &mr->ibmr;
+
+out_remove_stag:
+	erdma_free_idx(&dev->res_cb[ERDMA_RES_TYPE_STAG_IDX], mr->ibmr.lkey >> 8);
+
+out_free:
+	kfree(mr);
+
+	return ERR_PTR(ret);
+}
+
+struct ib_mr *
+erdma_ib_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type, u32 max_num_sg)
+{
+	struct erdma_mr *mr;
+	struct erdma_dev *dev = to_edev(ibpd->device);
+	int ret;
+	u32 stag;
+
+	if (mr_type != IB_MR_TYPE_MEM_REG)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	if (max_num_sg > ERDMA_MR_MAX_MTT_CNT) {
+		ibdev_err(&dev->ibdev, "max_num_sg too large:%u", max_num_sg);
+		return ERR_PTR(-EINVAL);
+	}
+
+	mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	ret = erdma_create_stag(dev, &stag);
+	if (ret)
+		goto out_free;
+
+	mr->type = ERDMA_MR_TYPE_FRMR;
+
+	mr->ibmr.lkey = stag;
+	mr->ibmr.rkey = stag;
+	mr->ibmr.pd = ibpd;
+	/* update it in FRMR. */
+	mr->access = ERDMA_MR_ACC_LR | ERDMA_MR_ACC_LW | ERDMA_MR_ACC_RR | ERDMA_MR_ACC_RW;
+
+	mr->mem.page_size = PAGE_SIZE; /* update it later. */
+	mr->mem.page_cnt = max_num_sg;
+	mr->mem.mtt_type = ERDMA_MR_INDIRECT_MTT;
+	mr->mem.mtt_buf = alloc_pages_exact(MTT_SIZE(mr->mem.page_cnt), GFP_KERNEL);
+	if (!mr->mem.mtt_buf) {
+		ret = -ENOMEM;
+		goto out_remove_stag;
+	}
+
+	mr->mem.mtt_entry[0] = dma_map_single(&dev->pdev->dev, mr->mem.mtt_buf,
+		MTT_SIZE(mr->mem.page_cnt), DMA_TO_DEVICE);
+	if (dma_mapping_error(&dev->pdev->dev, mr->mem.mtt_entry[0])) {
+		ret = -ENOMEM;
+		goto out_free_mtt;
+	}
+
+	ret = regmr_cmd(dev, mr);
+	if (ret) {
+		ret = -EIO;
+		goto out_dma_unmap;
+	}
+
+	atomic_inc(&dev->num_mr);
+	return &mr->ibmr;
+
+out_dma_unmap:
+	dma_unmap_single(&dev->pdev->dev, mr->mem.mtt_entry[0],
+		MTT_SIZE(mr->mem.page_cnt), DMA_TO_DEVICE);
+out_free_mtt:
+	free_pages_exact(mr->mem.mtt_buf, MTT_SIZE(mr->mem.page_cnt));
+
+out_remove_stag:
+	erdma_free_idx(&dev->res_cb[ERDMA_RES_TYPE_STAG_IDX], mr->ibmr.lkey >> 8);
+
+out_free:
+	kfree(mr);
+
+	return ERR_PTR(ret);
+}
+
+
+static int erdma_set_page(struct ib_mr *ibmr, u64 addr)
+{
+	struct erdma_mr *mr = to_emr(ibmr);
+
+	if (mr->mem.mtt_nents >= mr->mem.page_cnt)
+		return -1;
+
+	*((u64 *)mr->mem.mtt_buf + mr->mem.mtt_nents) = addr;
+	mr->mem.mtt_nents++;
+
+	return 0;
+}
+
+
+int erdma_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
+		    unsigned int *sg_offset)
+{
+	struct erdma_mr *mr = to_emr(ibmr);
+	int num;
+
+	mr->mem.mtt_nents = 0;
+
+	num = ib_sg_to_pages(&mr->ibmr, sg, sg_nents, sg_offset, erdma_set_page);
+
+	return num;
+}
+
+
+struct ib_mr *erdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
+				u64 virt, int access, struct ib_udata *udata)
+{
+	struct erdma_mr *mr = NULL;
+	struct erdma_dev *dev = to_edev(ibpd->device);
+	u32 stag;
+	int ret;
+
+	if (!len || len > dev->attrs.max_mr_size) {
+		ibdev_err(&dev->ibdev, "ERROR: Out of mr size: %llu, max %llu\n",
+			len, dev->attrs.max_mr_size);
+		return ERR_PTR(-EINVAL);
+	}
+
+	mr = kzalloc(sizeof(*mr), GFP_KERNEL);
+	if (!mr)
+		return ERR_PTR(-ENOMEM);
+
+	ret = get_mtt_entries(dev, &mr->mem, start, len, access, virt, SZ_2G - SZ_4K, 0);
+	if (ret)
+		goto err_out_free;
+
+	ret = erdma_create_stag(dev, &stag);
+	if (ret)
+		goto err_out_put_mtt;
+
+	mr->ibmr.lkey = mr->ibmr.rkey = stag;
+	mr->ibmr.pd = ibpd;
+	mr->mem.va = virt;
+	mr->mem.len = len;
+	mr->access = ERDMA_MR_ACC_LR |
+		(access & IB_ACCESS_REMOTE_READ ? ERDMA_MR_ACC_RR : 0) |
+		(access & IB_ACCESS_LOCAL_WRITE ? ERDMA_MR_ACC_LW : 0) |
+		(access & IB_ACCESS_REMOTE_WRITE ? ERDMA_MR_ACC_RW : 0);
+	mr->valid = 1;
+	mr->type = ERDMA_MR_TYPE_NORMAL;
+
+	ret = regmr_cmd(dev, mr);
+	if (ret) {
+		ret = -EIO;
+		goto err_out_mr;
+	}
+
+	atomic_inc(&dev->num_mr);
+
+	return &mr->ibmr;
+
+err_out_mr:
+	erdma_free_idx(&dev->res_cb[ERDMA_RES_TYPE_STAG_IDX], mr->ibmr.lkey >> 8);
+
+err_out_put_mtt:
+	put_mtt_entries(dev, &mr->mem);
+
+err_out_free:
+	kfree(mr);
+
+	return ERR_PTR(ret);
+}
+
+int erdma_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
+{
+	struct erdma_mr *mr;
+	struct erdma_dev *dev = to_edev(ibmr->device);
+	struct erdma_cmdq_dereg_mr_req req;
+	int ret;
+
+	mr = to_emr(ibmr);
+
+	ERDMA_CMDQ_BUILD_REQ_HDR(&req, CMDQ_SUBMOD_RDMA, CMDQ_OPCODE_DEREG_MR);
+
+	req.cfg0 = FIELD_PREP(ERDMA_CMD_MR_MPT_IDX_MASK, ibmr->lkey >> 8) |
+		FIELD_PREP(ERDMA_CMD_MR_KEY_MASK, ibmr->lkey & 0xFF);
+
+	ret = erdma_post_cmd_wait(&dev->cmdq, (u64 *)&req, sizeof(req), NULL, NULL);
+	if (ret) {
+		dev_err(&dev->pdev->dev,
+			"ERROR: err code = %d, cmd of dereg mr failed.\n", ret);
+		return ret;
+	}
+
+	erdma_free_idx(&dev->res_cb[ERDMA_RES_TYPE_STAG_IDX], ibmr->lkey >> 8);
+	atomic_dec(&dev->num_mr);
+
+	put_mtt_entries(dev, &mr->mem);
+
+	kfree(mr);
+	return 0;
+}
+
+extern int erdma_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
+{
+	struct erdma_cq *cq = to_ecq(ibcq);
+	struct erdma_dev *dev = to_edev(ibcq->device);
+	struct erdma_ucontext *ctx =
+		rdma_udata_to_drv_context(udata, struct erdma_ucontext, ibucontext);
+	int err;
+	struct erdma_cmdq_destroy_cq_req req;
+
+	ERDMA_CMDQ_BUILD_REQ_HDR(&req, CMDQ_SUBMOD_RDMA, CMDQ_OPCODE_DESTROY_CQ);
+	req.cqn = cq->cqn;
+
+	err = erdma_post_cmd_wait(&dev->cmdq, (u64 *)&req,
+		sizeof(req), NULL, NULL);
+	if (err) {
+		dev_err(&dev->pdev->dev,
+			"ERROR: err code = %d, cmd of destroy cq failed.\n", err);
+		return err;
+	}
+
+	if (cq->is_kernel_cq) {
+		dma_free_coherent(&dev->pdev->dev,
+			(cq->depth << CQE_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+			cq->kern_cq.qbuf, cq->kern_cq.qbuf_dma_addr);
+	} else {
+		erdma_unmap_user_dbrecords(ctx, &cq->user_cq.user_dbr_page);
+		put_mtt_entries(dev, &cq->user_cq.qbuf_mtt);
+	}
+
+	xa_erase(&dev->cq_xa, cq->cqn);
+	atomic_dec(&dev->num_cq);
+
+	return 0;
+}
+
+int erdma_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
+{
+	struct erdma_qp *qp = to_eqp(ibqp);
+	struct erdma_dev *dev = to_edev(ibqp->device);
+	struct erdma_ucontext *ctx =
+		rdma_udata_to_drv_context(udata, struct erdma_ucontext, ibucontext);
+	struct erdma_qp_attrs qp_attrs;
+	int err;
+	struct erdma_cmdq_destroy_qp_req req;
+
+	down_write(&qp->state_lock);
+	qp_attrs.state = ERDMA_QP_STATE_ERROR;
+	(void)erdma_modify_qp_internal(qp, &qp_attrs, ERDMA_QP_ATTR_STATE);
+	up_write(&qp->state_lock);
+
+	ERDMA_CMDQ_BUILD_REQ_HDR(&req, CMDQ_SUBMOD_RDMA, CMDQ_OPCODE_DESTROY_QP);
+	req.qpn = QP_ID(qp);
+
+	erdma_qp_put(qp);
+	wait_for_completion(&qp->safe_free);
+
+	err = erdma_post_cmd_wait(&dev->cmdq, (u64 *)&req,
+		sizeof(req), NULL, NULL);
+	if (err) {
+		dev_err(&dev->pdev->dev,
+			"ERROR: err code = %d, cmd of destroy qp failed.\n", err);
+		up_write(&qp->state_lock);
+		return err;
+	}
+
+	if (qp->is_kernel_qp) {
+		vfree(qp->kern_qp.swr_tbl);
+		vfree(qp->kern_qp.rwr_tbl);
+		dma_free_coherent(&dev->pdev->dev,
+			(qp->attrs.rq_size << RQE_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+			qp->kern_qp.rq_buf, qp->kern_qp.rq_buf_dma_addr);
+		dma_free_coherent(&dev->pdev->dev,
+			(qp->attrs.sq_size << SQEBB_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+			qp->kern_qp.sq_buf, qp->kern_qp.sq_buf_dma_addr);
+	} else {
+		put_mtt_entries(dev, &qp->user_qp.sq_mtt);
+		put_mtt_entries(dev, &qp->user_qp.rq_mtt);
+		erdma_unmap_user_dbrecords(ctx, &qp->user_qp.user_dbr_page);
+	}
+
+	if (qp->cep)
+		erdma_cep_put(qp->cep);
+	xa_erase(&dev->qp_xa, QP_ID(qp));
+	atomic_dec(&dev->num_qp);
+
+	return 0;
+}
+
+void erdma_qp_get_ref(struct ib_qp *ibqp)
+{
+	erdma_qp_get(to_eqp(ibqp));
+}
+
+void erdma_qp_put_ref(struct ib_qp *ibqp)
+{
+	erdma_qp_put(to_eqp(ibqp));
+}
+
+int erdma_mmap(struct ib_ucontext *ctx, struct vm_area_struct *vma)
+{
+	struct rdma_user_mmap_entry *rdma_entry;
+	struct erdma_user_mmap_entry *entry;
+	int err = -EINVAL;
+
+	if (vma->vm_start & (PAGE_SIZE - 1)) {
+		pr_warn("WARN: map not page aligned\n");
+		goto out;
+	}
+
+	rdma_entry = rdma_user_mmap_entry_get(ctx, vma);
+	if (!rdma_entry) {
+		pr_warn("WARN: mmap lookup failed: %lx\n", vma->vm_pgoff);
+		goto out;
+	}
+
+	entry = to_emmap(rdma_entry);
+
+	switch (entry->mmap_flag) {
+	case ERDMA_MMAP_IO_NC:
+		/* map doorbell. */
+		vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+		err = io_remap_pfn_range(vma, vma->vm_start, PFN_DOWN(entry->address),
+			PAGE_SIZE, vma->vm_page_prot);
+		break;
+	default:
+		pr_err("mmap failed, uobj type = %d\n", entry->mmap_flag);
+		err = -EINVAL;
+		break;
+	}
+
+	rdma_user_mmap_entry_put(rdma_entry);
+out:
+	return err;
+}
+
+#define ERDMA_SDB_PAGE     0
+#define ERDMA_SDB_ENTRY    1
+#define ERDMA_SDB_SHARED   2
+
+static void alloc_db_resources(struct erdma_dev *dev, struct erdma_ucontext *ctx)
+{
+	u32 bitmap_idx;
+
+	if (dev->disable_dwqe)
+		goto alloc_normal_db;
+
+	/* Try to alloc independent SDB page. */
+	spin_lock(&dev->db_bitmap_lock);
+	bitmap_idx = find_first_zero_bit(dev->sdb_page, dev->dwqe_pages);
+	if (bitmap_idx != dev->dwqe_pages) {
+		set_bit(bitmap_idx, dev->sdb_page);
+		spin_unlock(&dev->db_bitmap_lock);
+
+		ctx->sdb_type = ERDMA_SDB_PAGE;
+		ctx->sdb_idx = bitmap_idx;
+		ctx->sdb_page_idx = bitmap_idx;
+		ctx->sdb = dev->func_bar_addr +
+			ERDMA_BAR_SQDB_SPACE_OFFSET + (bitmap_idx << PAGE_SHIFT);
+		ctx->sdb_page_off = 0;
+
+		return;
+	}
+
+	bitmap_idx = find_first_zero_bit(dev->sdb_entry, dev->dwqe_entries);
+	if (bitmap_idx != dev->dwqe_entries) {
+		set_bit(bitmap_idx, dev->sdb_entry);
+		spin_unlock(&dev->db_bitmap_lock);
+
+		ctx->sdb_type = ERDMA_SDB_ENTRY;
+		ctx->sdb_idx = bitmap_idx;
+		ctx->sdb_page_idx = ERDMA_DWQE_TYPE0_CNT +
+			bitmap_idx / ERDMA_DWQE_TYPE1_CNT_PER_PAGE;
+		ctx->sdb_page_off = bitmap_idx % ERDMA_DWQE_TYPE1_CNT_PER_PAGE;
+
+		ctx->sdb = dev->func_bar_addr +
+			ERDMA_BAR_SQDB_SPACE_OFFSET + (ctx->sdb_page_idx << PAGE_SHIFT);
+
+		return;
+	}
+
+	spin_unlock(&dev->db_bitmap_lock);
+
+alloc_normal_db:
+	ctx->sdb_type = ERDMA_SDB_SHARED;
+	ctx->sdb_idx = 0;
+	ctx->sdb_page_idx = ERDMA_SDB_SHARED_PAGE_INDEX;
+	ctx->sdb_page_off = 0;
+
+	ctx->sdb = dev->func_bar_addr +
+		ERDMA_BAR_SQDB_SPACE_OFFSET + (ctx->sdb_page_idx << PAGE_SHIFT);
+}
+
+static void erdma_uctx_user_mmap_entries_remove(struct erdma_ucontext *uctx)
+{
+	rdma_user_mmap_entry_remove(uctx->sq_db_mmap_entry);
+	rdma_user_mmap_entry_remove(uctx->rq_db_mmap_entry);
+	rdma_user_mmap_entry_remove(uctx->cq_db_mmap_entry);
+}
+
+int erdma_alloc_ucontext(struct ib_ucontext *ibctx,
+			 struct ib_udata *udata)
+{
+	struct erdma_ucontext *ctx = to_ectx(ibctx);
+	struct erdma_dev *dev = to_edev(ibctx->device);
+	int ret;
+	struct erdma_uresp_alloc_ctx uresp = {};
+
+	if (atomic_inc_return(&dev->num_ctx) > ERDMA_MAX_CONTEXT) {
+		ret = -ENOMEM;
+		goto err_out;
+	}
+
+	INIT_LIST_HEAD(&ctx->dbrecords_page_list);
+	mutex_init(&ctx->dbrecords_page_mutex);
+	ctx->dev = dev;
+
+	alloc_db_resources(dev, ctx);
+
+	ctx->rdb = dev->func_bar_addr + ERDMA_BAR_RQDB_SPACE_OFFSET;
+	ctx->cdb = dev->func_bar_addr + ERDMA_BAR_CQDB_SPACE_OFFSET;
+
+	if (udata->outlen < sizeof(uresp)) {
+		ret = -EINVAL;
+		goto err_out;
+	}
+
+	ctx->sq_db_mmap_entry = erdma_user_mmap_entry_insert(ctx, (void *)ctx->sdb,
+		PAGE_SIZE, ERDMA_MMAP_IO_NC, &uresp.sdb);
+	if (!ctx->sq_db_mmap_entry) {
+		ret = -ENOMEM;
+		goto err_out;
+	}
+
+	ctx->rq_db_mmap_entry = erdma_user_mmap_entry_insert(ctx, (void *)ctx->rdb,
+		PAGE_SIZE, ERDMA_MMAP_IO_NC, &uresp.rdb);
+	if (!ctx->sq_db_mmap_entry) {
+		ret = -EINVAL;
+		goto err_out;
+	}
+
+	ctx->cq_db_mmap_entry = erdma_user_mmap_entry_insert(ctx, (void *)ctx->cdb,
+		PAGE_SIZE, ERDMA_MMAP_IO_NC, &uresp.cdb);
+	if (!ctx->cq_db_mmap_entry) {
+		ret = -EINVAL;
+		goto err_out;
+	}
+
+	uresp.dev_id = dev->attrs.vendor_part_id;
+	uresp.sdb_type = ctx->sdb_type;
+	uresp.sdb_offset = ctx->sdb_page_off;
+
+	ret = ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+	if (ret)
+		goto err_out;
+
+	return 0;
+
+err_out:
+	erdma_uctx_user_mmap_entries_remove(ctx);
+	atomic_dec(&dev->num_ctx);
+	return ret;
+}
+
+
+void erdma_dealloc_ucontext(struct ib_ucontext *ibctx)
+{
+	struct erdma_ucontext *ctx = to_ectx(ibctx);
+	struct erdma_dev *dev = ctx->dev;
+
+	spin_lock(&dev->db_bitmap_lock);
+	if (ctx->sdb_type == ERDMA_SDB_PAGE)
+		clear_bit(ctx->sdb_idx, dev->sdb_page);
+	else if (ctx->sdb_type == ERDMA_SDB_ENTRY)
+		clear_bit(ctx->sdb_idx, dev->sdb_entry);
+
+	erdma_uctx_user_mmap_entries_remove(ctx);
+
+	spin_unlock(&dev->db_bitmap_lock);
+
+	atomic_dec(&ctx->dev->num_ctx);
+}
+
+static int ib_qp_state_to_erdma_qp_state[IB_QPS_ERR+1] = {
+	[IB_QPS_RESET]	= ERDMA_QP_STATE_IDLE,
+	[IB_QPS_INIT]	= ERDMA_QP_STATE_IDLE,
+	[IB_QPS_RTR]	= ERDMA_QP_STATE_RTR,
+	[IB_QPS_RTS]	= ERDMA_QP_STATE_RTS,
+	[IB_QPS_SQD]	= ERDMA_QP_STATE_CLOSING,
+	[IB_QPS_SQE]	= ERDMA_QP_STATE_TERMINATE,
+	[IB_QPS_ERR]	= ERDMA_QP_STATE_ERROR
+};
+
+int erdma_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		    int attr_mask, struct ib_udata *udata)
+{
+	struct erdma_qp_attrs new_attrs;
+	enum erdma_qp_attr_mask erdma_attr_mask = 0;
+	struct erdma_qp *qp = to_eqp(ibqp);
+	int ret = 0;
+
+	if (!attr_mask)
+		goto out;
+
+	memset(&new_attrs, 0, sizeof(new_attrs));
+
+	if (attr_mask & IB_QP_ACCESS_FLAGS) {
+		erdma_attr_mask |= ERDMA_QP_ATTR_ACCESS_FLAGS;
+
+		if (attr->qp_access_flags & IB_ACCESS_REMOTE_READ)
+			new_attrs.flags |= ERDMA_READ_ENABLED;
+		if (attr->qp_access_flags & IB_ACCESS_REMOTE_WRITE)
+			new_attrs.flags |= ERDMA_WRITE_ENABLED;
+		if (attr->qp_access_flags & IB_ACCESS_MW_BIND)
+			new_attrs.flags |= ERDMA_BIND_ENABLED;
+	}
+
+	if (attr_mask & IB_QP_STATE) {
+		new_attrs.state = ib_qp_state_to_erdma_qp_state[attr->qp_state];
+
+		if (new_attrs.state == ERDMA_QP_STATE_UNDEF)
+			return -EINVAL;
+
+		erdma_attr_mask |= ERDMA_QP_ATTR_STATE;
+	}
+
+	down_write(&qp->state_lock);
+
+	ret = erdma_modify_qp_internal(qp, &new_attrs, erdma_attr_mask);
+
+	up_write(&qp->state_lock);
+
+out:
+	return ret;
+}
+
+static inline enum ib_mtu erdma_mtu_net2ib(unsigned short mtu)
+{
+	if (mtu >= 4096)
+		return IB_MTU_4096;
+	if (mtu >= 2048)
+		return IB_MTU_2048;
+	if (mtu >= 1024)
+		return IB_MTU_1024;
+	if (mtu >= 512)
+		return IB_MTU_512;
+	if (mtu >= 256)
+		return IB_MTU_256;
+	return IB_MTU_4096;
+}
+
+int erdma_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr,
+		 int qp_attr_mask, struct ib_qp_init_attr *qp_init_attr)
+{
+	struct erdma_qp *qp;
+	struct erdma_dev *dev;
+
+	if (ibqp && qp_attr && qp_init_attr) {
+		qp = to_eqp(ibqp);
+		dev = to_edev(ibqp->device);
+	} else
+		return -EINVAL;
+
+	qp_attr->cap.max_inline_data = ERDMA_MAX_INLINE;
+	qp_init_attr->cap.max_inline_data = ERDMA_MAX_INLINE;
+
+	qp_attr->cap.max_send_wr = qp->attrs.sq_size;
+	qp_attr->cap.max_recv_wr = qp->attrs.rq_size;
+	qp_attr->cap.max_send_sge = qp->attrs.max_send_sge;
+	qp_attr->cap.max_recv_sge = qp->attrs.max_recv_sge;
+
+	qp_attr->path_mtu = erdma_mtu_net2ib(dev->netdev->mtu);
+	qp_attr->max_rd_atomic = qp->attrs.irq_size;
+	qp_attr->max_dest_rd_atomic = qp->attrs.orq_size;
+
+	qp_attr->qp_access_flags = IB_ACCESS_LOCAL_WRITE |
+		IB_ACCESS_REMOTE_WRITE | IB_ACCESS_REMOTE_READ;
+
+	qp_init_attr->cap = qp_attr->cap;
+
+	return 0;
+}
+
+int erdma_create_cq(struct ib_cq *ibcq,
+		    const struct ib_cq_init_attr *attr,
+		    struct ib_udata *udata)
+{
+	struct erdma_cq *cq = to_ecq(ibcq);
+	struct erdma_dev *dev = to_edev(ibcq->device);
+	unsigned int depth = attr->cqe;
+	int ret;
+	struct erdma_ucontext *ctx =
+		rdma_udata_to_drv_context(udata, struct erdma_ucontext, ibucontext);
+
+	if (depth > dev->attrs.max_cqe) {
+		dev_warn(&dev->pdev->dev,
+			"WARN: exceed cqe(%d) > capbility(%d)\n",
+			depth, dev->attrs.max_cqe);
+		return -EINVAL;
+	}
+
+	depth = roundup_pow_of_two(depth);
+	cq->ibcq.cqe = depth;
+	cq->depth = depth;
+	cq->assoc_eqn = attr->comp_vector + 1;
+
+	ret = xa_alloc_cyclic(&dev->cq_xa, &cq->cqn, cq,
+		XA_LIMIT(1, dev->attrs.max_cq - 1), &dev->next_alloc_cqn, GFP_KERNEL);
+	if (ret < 0)
+		return ret;
+
+	if (udata) {
+		struct erdma_ureq_create_cq ureq;
+		struct erdma_uresp_create_cq uresp;
+
+		ret = ib_copy_from_udata(&ureq, udata, min(udata->inlen, sizeof(ureq)));
+		if (ret)
+			goto err_out_xa;
+		cq->is_kernel_cq = 0;
+
+		ret = get_mtt_entries(dev, &cq->user_cq.qbuf_mtt, ureq.qbuf_va, ureq.qbuf_len,
+			0, ureq.qbuf_va, SZ_64M - SZ_4K, 1);
+		if (ret)
+			goto err_out_xa;
+
+		ret = erdma_map_user_dbrecords(ctx, ureq.db_record_va, &cq->user_cq.user_dbr_page,
+			&cq->user_cq.db_info_dma_addr);
+		if (ret) {
+			put_mtt_entries(dev, &cq->user_cq.qbuf_mtt);
+			goto err_out_xa;
+		}
+
+		uresp.cq_id = cq->cqn;
+		uresp.num_cqe = depth;
+
+		ret = ib_copy_to_udata(udata, &uresp, min(sizeof(uresp), udata->outlen));
+		if (ret) {
+			erdma_unmap_user_dbrecords(ctx, &cq->user_cq.user_dbr_page);
+			put_mtt_entries(dev, &cq->user_cq.qbuf_mtt);
+			goto err_out_xa;
+		}
+	} else {
+		cq->is_kernel_cq = 1;
+		cq->kern_cq.owner = 1;
+
+		cq->kern_cq.qbuf = dma_alloc_coherent(&dev->pdev->dev,
+			(depth << CQE_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+			&cq->kern_cq.qbuf_dma_addr, GFP_KERNEL);
+		if (!cq->kern_cq.qbuf) {
+			ret = -ENOMEM;
+			goto err_out_xa;
+		}
+
+		cq->kern_cq.db_info = cq->kern_cq.qbuf + (depth << CQE_SHIFT);
+		spin_lock_init(&cq->kern_cq.lock);
+		/* use default cqdb. */
+		cq->kern_cq.db = dev->func_bar + ERDMA_BAR_CQDB_SPACE_OFFSET;
+	}
+
+	ret = create_cq_cmd(dev, cq);
+	if (ret)
+		goto err_free_res;
+
+	atomic_inc(&dev->num_cq);
+	return 0;
+
+err_free_res:
+	if (udata) {
+		erdma_unmap_user_dbrecords(ctx, &cq->user_cq.user_dbr_page);
+		put_mtt_entries(dev, &cq->user_cq.qbuf_mtt);
+	} else {
+		dma_free_coherent(&dev->pdev->dev, (depth << CQE_SHIFT) + ERDMA_EXTRA_BUFFER_SIZE,
+			cq->kern_cq.qbuf, cq->kern_cq.qbuf_dma_addr);
+	}
+
+err_out_xa:
+	xa_erase(&dev->cq_xa, cq->cqn);
+
+	return ret;
+}
+
+struct net_device *erdma_get_netdev(struct ib_device *device, u32 port_num)
+{
+	struct erdma_dev *dev = to_edev(device);
+
+	if (dev->netdev)
+		dev_hold(dev->netdev);
+
+	return dev->netdev;
+}
+
+void erdma_disassociate_ucontext(struct ib_ucontext *ibcontext)
+{
+
+}
+
+void erdma_port_event(struct erdma_dev *dev, enum ib_event_type reason)
+{
+	struct ib_event event;
+
+	event.device = &dev->ibdev;
+	event.element.port_num = 1;
+	event.event = reason;
+
+	ib_dispatch_event(&event);
+}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rdma-next 08/11] RDMA/erdma: Add connection management (CM) support
  2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
                   ` (6 preceding siblings ...)
  2021-12-21  2:48 ` [PATCH rdma-next 07/11] RDMA/erdma: Add verbs implementation Cheng Xu
@ 2021-12-21  2:48 ` Cheng Xu
  2021-12-21  2:48 ` [PATCH rdma-next 09/11] RDMA/erdma: Add the erdma module Cheng Xu
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-21  2:48 UTC (permalink / raw)
  To: jgg, dledford; +Cc: leon, linux-rdma, KaiShen, chengyou, tonylu

ERDMA's transport procotol is iWarp, so the driver must support CM
interface. In CM part, we use the same way as SoftiWarp: using kernel
socket to setup the connection, then performing MPA negotiation in kernel.
So, this part of code mainly comes from SoftiWarp, base on it, we add some
more features, such as non-blocking iw_connect implementation.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 drivers/infiniband/hw/erdma/erdma_cm.c | 1585 ++++++++++++++++++++++++
 drivers/infiniband/hw/erdma/erdma_cm.h |  158 +++
 2 files changed, 1743 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.h

diff --git a/drivers/infiniband/hw/erdma/erdma_cm.c b/drivers/infiniband/hw/erdma/erdma_cm.c
new file mode 100644
index 000000000000..36d4f353d5c6
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_cm.c
@@ -0,0 +1,1585 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ *
+ * Authors: Bernard Metzler <bmt@zurich.ibm.com>
+ *          Fredy Neeser <nfd@zurich.ibm.com>
+ * Copyright (c) 2008-2016, IBM Corporation
+ */
+
+#include <linux/errno.h>
+#include <linux/inetdevice.h>
+#include <linux/net.h>
+#include <linux/inetdevice.h>
+#include <net/addrconf.h>
+#include <linux/tcp.h>
+#include <linux/types.h>
+#include <linux/workqueue.h>
+#include <net/sock.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+#include <rdma/ib_verbs.h>
+
+#include "erdma.h"
+#include "erdma_cm.h"
+#include "erdma_verbs.h"
+
+static bool mpa_crc_strict = 1;
+module_param(mpa_crc_strict, bool, 0644);
+static bool mpa_crc_required;
+module_param(mpa_crc_required, bool, 0644);
+
+MODULE_PARM_DESC(mpa_crc_required, "MPA CRC required");
+MODULE_PARM_DESC(mpa_crc_strict, "MPA CRC off enforced");
+
+static void erdma_cm_llp_state_change(struct sock *sk);
+static void erdma_cm_llp_data_ready(struct sock *sk);
+static void erdma_cm_llp_error_report(struct sock *sk);
+
+static void erdma_sk_assign_cm_upcalls(struct sock *sk)
+{
+	write_lock_bh(&sk->sk_callback_lock);
+	sk->sk_state_change = erdma_cm_llp_state_change;
+	sk->sk_data_ready = erdma_cm_llp_data_ready;
+	sk->sk_error_report = erdma_cm_llp_error_report;
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void erdma_sk_save_upcalls(struct sock *sk)
+{
+	struct erdma_cep *cep = sk_to_cep(sk);
+
+	WARN_ON(!cep);
+
+	write_lock_bh(&sk->sk_callback_lock);
+	cep->sk_state_change = sk->sk_state_change;
+	cep->sk_data_ready = sk->sk_data_ready;
+	cep->sk_error_report = sk->sk_error_report;
+	write_unlock_bh(&sk->sk_callback_lock);
+}
+
+static void erdma_sk_restore_upcalls(struct sock *sk, struct erdma_cep *cep)
+{
+	sk->sk_state_change = cep->sk_state_change;
+	sk->sk_data_ready = cep->sk_data_ready;
+	sk->sk_error_report = cep->sk_error_report;
+	sk->sk_user_data = NULL;
+}
+
+static void erdma_socket_disassoc(struct socket *s)
+{
+	struct sock	*sk = s->sk;
+	struct erdma_cep	*cep;
+
+	if (sk) {
+		write_lock_bh(&sk->sk_callback_lock);
+		cep = sk_to_cep(sk);
+		if (cep) {
+			erdma_sk_restore_upcalls(sk, cep);
+			erdma_cep_put(cep);
+		} else
+			pr_warn("cannot restore sk callbacks: no ep\n");
+		write_unlock_bh(&sk->sk_callback_lock);
+	} else
+		pr_warn("cannot restore sk callbacks: no sk\n");
+}
+
+static inline int kernel_peername(struct socket *s, struct sockaddr_in *addr)
+{
+	return s->ops->getname(s, (struct sockaddr *)addr, 1);
+}
+
+static inline int kernel_localname(struct socket *s, struct sockaddr_in *addr)
+{
+	return s->ops->getname(s, (struct sockaddr *)addr, 0);
+}
+
+static void erdma_cep_socket_assoc(struct erdma_cep *cep, struct socket *s)
+{
+	cep->llp.sock = s;
+	erdma_cep_get(cep);
+	s->sk->sk_user_data = cep;
+
+	erdma_sk_save_upcalls(s->sk);
+	erdma_sk_assign_cm_upcalls(s->sk);
+}
+
+
+static struct erdma_cep *erdma_cep_alloc(struct erdma_dev  *dev)
+{
+	struct erdma_cep *cep = kzalloc(sizeof(*cep), GFP_KERNEL);
+
+	if (cep) {
+		unsigned long flags;
+
+		INIT_LIST_HEAD(&cep->listenq);
+		INIT_LIST_HEAD(&cep->devq);
+		INIT_LIST_HEAD(&cep->work_freelist);
+
+		kref_init(&cep->ref);
+		cep->state = ERDMA_EPSTATE_IDLE;
+		init_waitqueue_head(&cep->waitq);
+		spin_lock_init(&cep->lock);
+		cep->dev = dev;
+
+		spin_lock_irqsave(&dev->lock, flags);
+		list_add_tail(&cep->devq, &dev->cep_list);
+		spin_unlock_irqrestore(&dev->lock, flags);
+		atomic_inc(&dev->num_cep);
+	}
+	return cep;
+}
+
+static void erdma_cm_free_work(struct erdma_cep *cep)
+{
+	struct list_head	*w, *tmp;
+	struct erdma_cm_work	*work;
+
+	list_for_each_safe(w, tmp, &cep->work_freelist) {
+		work = list_entry(w, struct erdma_cm_work, list);
+		list_del(&work->list);
+		kfree(work);
+	}
+}
+
+static void erdma_cancel_mpatimer(struct erdma_cep *cep)
+{
+	spin_lock_bh(&cep->lock);
+	if (cep->mpa_timer) {
+		if (cancel_delayed_work(&cep->mpa_timer->work)) {
+			erdma_cep_put(cep);
+			kfree(cep->mpa_timer); /* not needed again */
+		}
+		cep->mpa_timer = NULL;
+	}
+	spin_unlock_bh(&cep->lock);
+}
+
+static void erdma_put_work(struct erdma_cm_work *work)
+{
+	INIT_LIST_HEAD(&work->list);
+	spin_lock_bh(&work->cep->lock);
+	list_add(&work->list, &work->cep->work_freelist);
+	spin_unlock_bh(&work->cep->lock);
+}
+
+static void erdma_cep_set_inuse(struct erdma_cep *cep)
+{
+	unsigned long flags;
+	int ret;
+retry:
+	spin_lock_irqsave(&cep->lock, flags);
+
+	if (cep->in_use) {
+		spin_unlock_irqrestore(&cep->lock, flags);
+		ret = wait_event_interruptible(cep->waitq, !cep->in_use);
+		if (signal_pending(current))
+			flush_signals(current);
+		goto retry;
+	} else {
+		cep->in_use = 1;
+		spin_unlock_irqrestore(&cep->lock, flags);
+	}
+}
+
+static void erdma_cep_set_free(struct erdma_cep *cep)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&cep->lock, flags);
+	cep->in_use = 0;
+	spin_unlock_irqrestore(&cep->lock, flags);
+
+	wake_up(&cep->waitq);
+}
+
+
+static void __erdma_cep_dealloc(struct kref *ref)
+{
+	struct erdma_cep *cep = container_of(ref, struct erdma_cep, ref);
+	struct erdma_dev *dev = cep->dev;
+	unsigned long flags;
+
+	WARN_ON(cep->listen_cep);
+
+	/* kfree(NULL) is save */
+	if (cep->private_storage != NULL)
+		kfree(cep->private_storage);
+	if (cep->private_storage != NULL)
+		kfree(cep->mpa.pdata);
+	spin_lock_bh(&cep->lock);
+	if (!list_empty(&cep->work_freelist))
+		erdma_cm_free_work(cep);
+	spin_unlock_bh(&cep->lock);
+
+	spin_lock_irqsave(&dev->lock, flags);
+	list_del(&cep->devq);
+	spin_unlock_irqrestore(&dev->lock, flags);
+	atomic_dec(&dev->num_cep);
+	kfree(cep);
+}
+
+static struct erdma_cm_work *erdma_get_work(struct erdma_cep *cep)
+{
+	struct erdma_cm_work    *work = NULL;
+	unsigned long           flags;
+
+	spin_lock_irqsave(&cep->lock, flags);
+	if (!list_empty(&cep->work_freelist)) {
+		work = list_entry(cep->work_freelist.next, struct erdma_cm_work,
+				  list);
+		list_del_init(&work->list);
+	}
+	spin_unlock_irqrestore(&cep->lock, flags);
+	return work;
+}
+
+static int erdma_cm_alloc_work(struct erdma_cep *cep, int num)
+{
+	struct erdma_cm_work        *work;
+
+	if (!list_empty(&cep->work_freelist)) {
+		pr_err("ERROR: Not init work_freelist.\n");
+		return -ENOMEM;
+	}
+
+	while (num--) {
+		work = kmalloc(sizeof(*work), GFP_KERNEL);
+		if (!work) {
+			if (!(list_empty(&cep->work_freelist)))
+				erdma_cm_free_work(cep);
+			return -ENOMEM;
+		}
+		work->cep = cep;
+		INIT_LIST_HEAD(&work->list);
+		list_add(&work->list, &cep->work_freelist);
+	}
+	return 0;
+}
+
+/*
+ * erdma_cm_upcall()
+ *
+ * Upcall to IWCM to inform about async connection events
+ */
+static int erdma_cm_upcall(struct erdma_cep *cep, enum iw_cm_event_type reason,
+			   int status)
+{
+	struct iw_cm_event event;
+	struct iw_cm_id *cm_id;
+
+	memset(&event, 0, sizeof(event));
+	event.status = status;
+	event.event = reason;
+
+	if (reason == IW_CM_EVENT_CONNECT_REQUEST ||
+	    reason == IW_CM_EVENT_CONNECT_REPLY) {
+		u16 pd_len = be16_to_cpu(cep->mpa.hdr.params.pd_len);
+
+		if (pd_len) {
+			event.private_data_len = pd_len;
+			event.private_data = cep->mpa.pdata;
+			if (cep->mpa.pdata == NULL)
+				event.private_data_len = 0;
+		}
+
+		to_sockaddr_in(event.local_addr) = cep->llp.laddr;
+		to_sockaddr_in(event.remote_addr) = cep->llp.raddr;
+	}
+	if (reason == IW_CM_EVENT_CONNECT_REQUEST) {
+		event.ird = cep->dev->attrs.max_ird;
+		event.ord = cep->dev->attrs.max_ord;
+		event.provider_data = cep;
+		cm_id = cep->listen_cep->cm_id;
+	} else
+		cm_id = cep->cm_id;
+
+	if (!cep->is_connecting && reason == IW_CM_EVENT_CONNECT_REPLY)
+		return 0;
+
+	cep->is_connecting = false;
+
+	return cm_id->event_handler(cm_id, &event);
+}
+
+/*
+ * erdma_qp_cm_drop()
+ *
+ * Drops established LLP connection if present and not already
+ * scheduled for dropping. Called from user context, SQ workqueue
+ * or receive IRQ. Caller signals if socket can be immediately
+ * closed (basically, if not in IRQ).
+ */
+void erdma_qp_cm_drop(struct erdma_qp *qp, int schedule)
+{
+	struct erdma_cep *cep = qp->cep;
+
+	if (!qp->cep)
+		return;
+
+	if (schedule)
+		erdma_cm_queue_work(cep, ERDMA_CM_WORK_CLOSE_LLP);
+	else {
+		erdma_cep_set_inuse(cep);
+
+		if (cep->state == ERDMA_EPSTATE_CLOSED)
+			goto out;
+
+		if (cep->cm_id) {
+			switch (cep->state) {
+
+			case ERDMA_EPSTATE_AWAIT_MPAREP:
+				erdma_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY,
+					      -EINVAL);
+				break;
+
+			case ERDMA_EPSTATE_RDMA_MODE:
+				erdma_cm_upcall(cep, IW_CM_EVENT_CLOSE, 0);
+
+				break;
+
+			case ERDMA_EPSTATE_IDLE:
+			case ERDMA_EPSTATE_LISTENING:
+			case ERDMA_EPSTATE_CONNECTING:
+			case ERDMA_EPSTATE_AWAIT_MPAREQ:
+			case ERDMA_EPSTATE_RECVD_MPAREQ:
+			case ERDMA_EPSTATE_CLOSED:
+			default:
+
+				break;
+			}
+			cep->cm_id->rem_ref(cep->cm_id);
+			cep->cm_id = NULL;
+			erdma_cep_put(cep);
+		}
+		cep->state = ERDMA_EPSTATE_CLOSED;
+
+		if (cep->llp.sock) {
+			erdma_socket_disassoc(cep->llp.sock);
+			sock_release(cep->llp.sock);
+			cep->llp.sock = NULL;
+		}
+		if (cep->qp) {
+			WARN_ON(qp != cep->qp);
+			cep->qp = NULL;
+			erdma_qp_put(qp);
+		}
+out:
+		erdma_cep_set_free(cep);
+	}
+}
+
+
+void erdma_cep_put(struct erdma_cep *cep)
+{
+	WARN_ON(kref_read(&cep->ref) < 1);
+	kref_put(&cep->ref, __erdma_cep_dealloc);
+}
+
+void erdma_cep_get(struct erdma_cep *cep)
+{
+	kref_get(&cep->ref);
+}
+
+static inline int ksock_recv(struct socket *sock, char *buf, size_t size,
+			     int flags)
+{
+	struct kvec iov = {buf, size};
+	struct msghdr msg = {.msg_name = NULL, .msg_flags = flags};
+
+	return kernel_recvmsg(sock, &msg, &iov, 1, size, flags);
+}
+
+static inline void __mpa_rr_set_cc(u16 *bits, u16 cc)
+{
+	*bits = (*bits & ~MPA_RR_DESIRED_CC)
+		| (cc & MPA_RR_DESIRED_CC);
+}
+
+static inline u8 __mpa_rr_cc(u16 mpa_rr_bits)
+{
+	u16 rev = (mpa_rr_bits & MPA_RR_DESIRED_CC);
+
+	return (u8)rev;
+}
+
+static inline void __mpa_rr_set_revision(u16 *bits, u8 rev)
+{
+	*bits = (*bits & ~MPA_RR_MASK_REVISION)
+		| (cpu_to_be16(rev) & MPA_RR_MASK_REVISION);
+}
+
+static inline u8 __mpa_rr_revision(u16 mpa_rr_bits)
+{
+	u16 rev = mpa_rr_bits & MPA_RR_MASK_REVISION;
+
+	return (u8)be16_to_cpu(rev);
+}
+
+/*
+ * Expects params->pd_len in host byte order
+ */
+static int erdma_send_mpareqrep(struct erdma_cep *cep, const void *pdata,
+			      u8 pd_len)
+{
+	struct socket	*s = cep->llp.sock;
+	struct mpa_rr	*rr = &cep->mpa.hdr;
+	struct kvec	iov[2];
+	struct msghdr	msg;
+	int		ret;
+
+	memset(&msg, 0, sizeof(msg));
+
+	rr->params.pd_len = cpu_to_be16(pd_len);
+
+	iov[0].iov_base = rr;
+	iov[0].iov_len = sizeof(*rr);
+
+	if (pd_len) {
+		iov[1].iov_base = (char *)pdata;
+		iov[1].iov_len = pd_len;
+
+		ret =  kernel_sendmsg(s, &msg, iov, 2, pd_len + sizeof(*rr));
+	} else
+		ret =  kernel_sendmsg(s, &msg, iov, 1, sizeof(*rr));
+
+	return ret < 0 ? ret : 0;
+}
+
+/*
+ * Receive MPA Request/Reply header.
+ *
+ * Returns 0 if complete MPA Request/Reply haeder including
+ * eventual private data was received. Returns -EAGAIN if
+ * header was partially received or negative error code otherwise.
+ *
+ * Context: May be called in process context only
+ */
+static int erdma_recv_mpa_rr(struct erdma_cep *cep)
+{
+	struct mpa_rr	*hdr = &cep->mpa.hdr;
+	struct socket	*s = cep->llp.sock;
+	u16		pd_len;
+	int		rcvd, to_rcv;
+
+	if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr)) {
+
+		rcvd = ksock_recv(s, (char *)hdr + cep->mpa.bytes_rcvd,
+				  sizeof(struct mpa_rr) -
+				  cep->mpa.bytes_rcvd, MSG_DONTWAIT);
+		/* we use DONTWAIT mode, so EAGAIN may appear. */
+		if (rcvd == -EAGAIN)
+			return -EAGAIN;
+
+		if (rcvd <= 0)
+			return -ECONNABORTED;
+
+		cep->mpa.bytes_rcvd += rcvd;
+
+		if (cep->mpa.bytes_rcvd < sizeof(struct mpa_rr))
+			return -EAGAIN;
+
+		if (be16_to_cpu(hdr->params.pd_len) > MPA_MAX_PRIVDATA)
+			return -EPROTO;
+	}
+	pd_len = be16_to_cpu(hdr->params.pd_len);
+
+	/*
+	 * At least the MPA Request/Reply header (frame not including
+	 * private data) has been received.
+	 * Receive (or continue receiving) any private data.
+	 */
+	to_rcv = pd_len - (cep->mpa.bytes_rcvd - sizeof(struct mpa_rr));
+
+	if (!to_rcv) {
+		/*
+		 * We must have hdr->params.pd_len == 0 and thus received a
+		 * complete MPA Request/Reply frame.
+		 * Check against peer protocol violation.
+		 */
+		u32 word;
+
+		rcvd = ksock_recv(s, (char *)&word, sizeof(word), MSG_DONTWAIT);
+		if (rcvd == -EAGAIN)
+			return 0;
+
+		if (rcvd == 0)
+			return -EPIPE;
+
+		if (rcvd < 0)
+			return rcvd;
+		return -EPROTO;
+	}
+
+	/*
+	 * At this point, we must have hdr->params.pd_len != 0.
+	 * A private data buffer gets allocated if hdr->params.pd_len != 0.
+	 */
+	if (!cep->mpa.pdata) {
+		cep->mpa.pdata = kmalloc(pd_len + 4, GFP_KERNEL);
+		if (!cep->mpa.pdata)
+			return -ENOMEM;
+	}
+	rcvd = ksock_recv(s, cep->mpa.pdata + cep->mpa.bytes_rcvd
+			  - sizeof(struct mpa_rr), to_rcv + 4, MSG_DONTWAIT);
+
+	if (rcvd < 0)
+		return rcvd;
+
+	if (rcvd > to_rcv)
+		return -EPROTO;
+
+	cep->mpa.bytes_rcvd += rcvd;
+
+	if (to_rcv == rcvd)
+		return 0;
+
+	return -EAGAIN;
+}
+
+
+/*
+ * erdma_proc_mpareq()
+ *
+ * Read MPA Request from socket and signal new connection to IWCM
+ * if success. Caller must hold lock on corresponding listening CEP.
+ */
+static int erdma_proc_mpareq(struct erdma_cep *cep)
+{
+	struct mpa_rr      *req;
+	int                ret;
+
+	ret = erdma_recv_mpa_rr(cep);
+	if (ret)
+		goto out;
+
+	req = &cep->mpa.hdr;
+
+	if (__mpa_rr_revision(req->params.bits) > MPA_REVISION_1) {
+		/* allow for 0 and 1 only */
+		ret = -EPROTO;
+		goto out;
+	}
+
+	if (memcmp(req->key, MPA_KEY_REQ, 12)) {
+		ret = -EPROTO;
+		goto out;
+	}
+
+	cep->mpa.remote_qpn = *(u32 *)&req->key[12];
+	/*
+	 * Prepare for sending MPA reply
+	 */
+	memcpy(req->key, MPA_KEY_REP, 12);
+
+	if (req->params.bits & MPA_RR_FLAG_MARKERS ||
+	    (req->params.bits & MPA_RR_FLAG_CRC &&
+	    !mpa_crc_required && mpa_crc_strict)) {
+		req->params.bits &= ~MPA_RR_FLAG_MARKERS;
+		req->params.bits |= MPA_RR_FLAG_REJECT; /* reject */
+
+		if (!mpa_crc_required && mpa_crc_strict)
+			req->params.bits &= ~MPA_RR_FLAG_CRC;
+
+		kfree(cep->mpa.pdata);
+		cep->mpa.pdata = NULL;
+
+		(void)erdma_send_mpareqrep(cep, NULL, 0);
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+	/*
+	 * Enable CRC if requested by module initialization
+	 */
+	if (!(req->params.bits & MPA_RR_FLAG_CRC) && mpa_crc_required)
+		req->params.bits |= MPA_RR_FLAG_CRC;
+
+	cep->state = ERDMA_EPSTATE_RECVD_MPAREQ;
+
+	/* Keep reference until IWCM accepts/rejects */
+	erdma_cep_get(cep);
+	ret = erdma_cm_upcall(cep, IW_CM_EVENT_CONNECT_REQUEST, 0);
+	if (ret)
+		erdma_cep_put(cep);
+out:
+	return ret;
+}
+
+static int erdma_proc_mpareply(struct erdma_cep *cep)
+{
+	struct erdma_qp_attrs qp_attrs;
+	struct erdma_qp *qp = cep->qp;
+	struct mpa_rr *rep;
+	int ret;
+
+	ret = erdma_recv_mpa_rr(cep);
+	if (ret != -EAGAIN)
+		erdma_cancel_mpatimer(cep);
+	if (ret)
+		goto out_err;
+
+	rep = &cep->mpa.hdr;
+
+	if (__mpa_rr_revision(rep->params.bits) > MPA_REVISION_1) {
+		/* allow for 0 and 1 only */
+		ret = -EPROTO;
+		goto out_err;
+	}
+	if (memcmp(rep->key, MPA_KEY_REP, 12)) {
+		ret = -EPROTO;
+		goto out_err;
+	}
+
+	cep->mpa.remote_qpn = *(u32 *)&rep->key[12];
+
+	if (rep->params.bits & MPA_RR_FLAG_REJECT) {
+		(void)erdma_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY, -ECONNRESET);
+		ret = -ECONNRESET;
+		goto out;
+	}
+
+	if ((rep->params.bits & MPA_RR_FLAG_MARKERS)
+		|| (mpa_crc_required && !(rep->params.bits & MPA_RR_FLAG_CRC))
+		|| (mpa_crc_strict && !mpa_crc_required
+			&& (rep->params.bits & MPA_RR_FLAG_CRC))) {
+
+		(void)erdma_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY,
+			-ECONNREFUSED);
+		ret = -EINVAL;
+		goto out;
+	}
+	memset(&qp_attrs, 0, sizeof(qp_attrs));
+	qp_attrs.irq_size = cep->ird;
+	qp_attrs.orq_size = cep->ord;
+	qp_attrs.llp_stream_handle = cep->llp.sock;
+	qp_attrs.state = ERDMA_QP_STATE_RTS;
+
+	down_write(&qp->state_lock);
+	if (qp->attrs.state > ERDMA_QP_STATE_RTS) {
+		ret = -EINVAL;
+		up_write(&qp->state_lock);
+		goto out_err;
+	}
+
+	qp->qp_type = ERDMA_QP_TYPE_CLIENT;
+	qp->cc_method = __mpa_rr_cc(rep->params.bits) == qp->dev->cc_method ?
+			qp->dev->cc_method : COMPROMISE_CC;
+	ret = erdma_modify_qp_internal(qp, &qp_attrs, ERDMA_QP_ATTR_STATE |
+						      ERDMA_QP_ATTR_LLP_HANDLE |
+						      ERDMA_QP_ATTR_MPA);
+
+	up_write(&qp->state_lock);
+
+	if (!ret) {
+		ret = erdma_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY, 0);
+		if (!ret)
+			cep->state = ERDMA_EPSTATE_RDMA_MODE;
+
+		goto out;
+	}
+
+out_err:
+	(void)erdma_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY, -EINVAL);
+out:
+	return ret;
+}
+
+/*
+ * erdma_accept_newconn - accept an incoming pending connection
+ *
+ */
+static void erdma_accept_newconn(struct erdma_cep *cep)
+{
+	struct socket    *s       = cep->llp.sock;
+	struct socket    *new_s   = NULL;
+	struct erdma_cep *new_cep = NULL;
+	int              ret       = 0; /* debug only. should disappear */
+
+	if (cep->state != ERDMA_EPSTATE_LISTENING)
+		goto error;
+
+	new_cep = erdma_cep_alloc(cep->dev);
+	if (!new_cep)
+		goto error;
+
+	if (erdma_cm_alloc_work(new_cep, 6) != 0)
+		goto error;
+
+	/*
+	 * Copy saved socket callbacks from listening CEP
+	 * and assign new socket with new CEP
+	 */
+	new_cep->sk_state_change = cep->sk_state_change;
+	new_cep->sk_data_ready   = cep->sk_data_ready;
+	new_cep->sk_error_report = cep->sk_error_report;
+
+	ret = kernel_accept(s, &new_s, O_NONBLOCK);
+	if (ret != 0)
+		goto error;
+
+	new_cep->llp.sock = new_s;
+	new_s->sk->sk_user_data = new_cep;
+	erdma_cep_get(new_cep);
+
+	tcp_sock_set_nodelay(new_s->sk);
+
+	ret = kernel_peername(new_s, &new_cep->llp.raddr);
+	if (ret < 0)
+		goto error;
+
+	ret = kernel_localname(new_s, &new_cep->llp.laddr);
+	if (ret < 0)
+		goto error;
+
+	new_cep->state = ERDMA_EPSTATE_AWAIT_MPAREQ;
+
+	ret = erdma_cm_queue_work(new_cep, ERDMA_CM_WORK_MPATIMEOUT);
+	if (ret)
+		goto error;
+	/*
+	 * See erdma_proc_mpareq() etc. for the use of new_cep->listen_cep.
+	 */
+	new_cep->listen_cep = cep;
+	erdma_cep_get(cep);
+
+	if (atomic_read(&new_s->sk->sk_rmem_alloc)) {
+		/*
+		 * MPA REQ already queued
+		 */
+		erdma_cep_set_inuse(new_cep);
+		ret = erdma_proc_mpareq(new_cep);
+		erdma_cep_set_free(new_cep);
+
+		if (ret != -EAGAIN) {
+			erdma_cep_put(cep);
+			new_cep->listen_cep = NULL;
+			if (ret)
+				goto error;
+		}
+	}
+	return;
+
+error:
+	if (new_cep) {
+		new_cep->state = ERDMA_EPSTATE_CLOSED;
+		erdma_cancel_mpatimer(new_cep);
+
+		erdma_cep_put(new_cep);
+		new_cep->llp.sock = NULL;
+	}
+
+	if (new_s) {
+		erdma_socket_disassoc(new_s);
+		sock_release(new_s);
+	}
+}
+
+static int erdma_newconn_connected(struct erdma_cep *cep)
+{
+	struct socket    *s       = cep->llp.sock;
+	int              ret;
+	int              qpn;
+
+	ret = kernel_peername(s, &cep->llp.raddr);
+	if (ret < 0)
+		goto error;
+
+	ret = kernel_localname(s, &cep->llp.laddr);
+	if (ret < 0)
+		goto error;
+
+	cep->mpa.hdr.params.bits = 0;
+	__mpa_rr_set_revision(&cep->mpa.hdr.params.bits, MPA_REVISION_1);
+	__mpa_rr_set_cc(&cep->mpa.hdr.params.bits, cep->dev->cc_method);
+
+	if (mpa_crc_required)
+		cep->mpa.hdr.params.bits |= MPA_RR_FLAG_CRC;
+
+	qpn = QP_ID(cep->qp);
+	memcpy(cep->mpa.hdr.key, MPA_KEY_REQ, 12);
+	memcpy(&cep->mpa.hdr.key[12], &qpn, 4);
+
+	ret = erdma_send_mpareqrep(cep, cep->private_storage, cep->pd_len);
+
+	cep->mpa.hdr.params.pd_len = 0;
+
+	if (ret >= 0) {
+		cep->state = ERDMA_EPSTATE_AWAIT_MPAREP;
+		ret = erdma_cm_queue_work(cep, ERDMA_CM_WORK_MPATIMEOUT);
+		if (!ret)
+			return 0;
+	}
+
+error:
+	return ret;
+}
+
+static void erdma_cm_work_handler(struct work_struct *w)
+{
+	struct erdma_cm_work *work;
+	struct erdma_cep     *cep;
+	int                  release_cep = 0, ret = 0;
+
+	work = container_of(w, struct erdma_cm_work, work.work);
+	cep = work->cep;
+
+	erdma_cep_set_inuse(cep);
+
+	switch (work->type) {
+	case ERDMA_CM_WORK_CONNECTED:
+		erdma_cancel_mpatimer(cep);
+		if (cep->state == ERDMA_EPSTATE_CONNECTING) {
+			ret = erdma_newconn_connected(cep);
+			if (ret) {
+				erdma_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY, -EIO);
+				release_cep = 1;
+			}
+		}
+
+		break;
+	case ERDMA_CM_WORK_CONNECTTIMEOUT:
+		if (cep->state == ERDMA_EPSTATE_CONNECTING) {
+			cep->mpa_timer = NULL;
+			erdma_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY,
+					-ETIMEDOUT);
+			release_cep = 1;
+		}
+		break;
+	case ERDMA_CM_WORK_ACCEPT:
+
+		erdma_accept_newconn(cep);
+		break;
+
+	case ERDMA_CM_WORK_READ_MPAHDR:
+		switch (cep->state) {
+		case ERDMA_EPSTATE_AWAIT_MPAREQ:
+			if (cep->listen_cep) {
+				erdma_cep_set_inuse(cep->listen_cep);
+
+				if (cep->listen_cep->state ==
+				    ERDMA_EPSTATE_LISTENING)
+					ret = erdma_proc_mpareq(cep);
+				else
+					ret = -EFAULT;
+
+				erdma_cep_set_free(cep->listen_cep);
+
+				if (ret != -EAGAIN) {
+					erdma_cep_put(cep->listen_cep);
+					cep->listen_cep = NULL;
+					if (ret)
+						erdma_cep_put(cep);
+				}
+			}
+			break;
+
+		case ERDMA_EPSTATE_AWAIT_MPAREP:
+			ret = erdma_proc_mpareply(cep);
+			break;
+		default:
+			break;
+		}
+		if (ret && ret != -EAGAIN)
+			release_cep = 1;
+		break;
+	case ERDMA_CM_WORK_CLOSE_LLP:
+		if (cep->cm_id)
+			erdma_cm_upcall(cep, IW_CM_EVENT_CLOSE, 0);
+		release_cep = 1;
+		break;
+	case ERDMA_CM_WORK_PEER_CLOSE:
+		if (cep->cm_id) {
+			switch (cep->state) {
+			case ERDMA_EPSTATE_CONNECTING:
+			case ERDMA_EPSTATE_AWAIT_MPAREP:
+				/*
+				 * MPA reply not received, but connection drop
+				 */
+				erdma_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY,
+					      -ECONNRESET);
+				break;
+			case ERDMA_EPSTATE_RDMA_MODE:
+				/*
+				 * NOTE: IW_CM_EVENT_DISCONNECT is given just
+				 *       to transition IWCM into CLOSING.
+				 *       FIXME: is that needed?
+				 */
+				erdma_cm_upcall(cep, IW_CM_EVENT_DISCONNECT, 0);
+				erdma_cm_upcall(cep, IW_CM_EVENT_CLOSE, 0);
+				break;
+			default:
+				break;
+			}
+		} else {
+			switch (cep->state) {
+			case ERDMA_EPSTATE_RECVD_MPAREQ:
+				break;
+			case ERDMA_EPSTATE_AWAIT_MPAREQ:
+				/*
+				 * Socket close before MPA request received.
+				 */
+				if (cep->listen_cep) {
+					erdma_cep_put(cep->listen_cep);
+					cep->listen_cep = NULL;
+				}
+				break;
+			default:
+				break;
+			}
+		}
+		release_cep = 1;
+		break;
+	case ERDMA_CM_WORK_MPATIMEOUT:
+		cep->mpa_timer = NULL;
+		if (cep->state == ERDMA_EPSTATE_AWAIT_MPAREP) {
+			/*
+			 * MPA request timed out:
+			 * Hide any partially received private data and signal
+			 * timeout
+			 */
+			cep->mpa.hdr.params.pd_len = 0;
+
+			if (cep->cm_id)
+				erdma_cm_upcall(cep, IW_CM_EVENT_CONNECT_REPLY, -ETIMEDOUT);
+			release_cep = 1;
+		} else if (cep->state == ERDMA_EPSTATE_AWAIT_MPAREQ) {
+			/*
+			 * No MPA request received after peer TCP stream setup.
+			 */
+			if (cep->listen_cep) {
+				erdma_cep_put(cep->listen_cep);
+				cep->listen_cep = NULL;
+			}
+
+			erdma_cep_put(cep);
+			release_cep = 1;
+		}
+		break;
+	default:
+		pr_err("ERROR: work task type:%u.\n", work->type);
+		break;
+	}
+
+	if (release_cep) {
+		erdma_cancel_mpatimer(cep);
+		cep->state = ERDMA_EPSTATE_CLOSED;
+		if (cep->qp) {
+			struct erdma_qp *qp = cep->qp;
+			/*
+			 * Serialize a potential race with application
+			 * closing the QP and calling erdma_qp_cm_drop()
+			 */
+			erdma_qp_get(qp);
+			erdma_cep_set_free(cep);
+
+			erdma_qp_llp_close(qp);
+			erdma_qp_put(qp);
+
+			erdma_cep_set_inuse(cep);
+			cep->qp = NULL;
+			erdma_qp_put(qp);
+		}
+		if (cep->llp.sock) {
+			erdma_socket_disassoc(cep->llp.sock);
+			sock_release(cep->llp.sock);
+			cep->llp.sock = NULL;
+		}
+
+		if (cep->cm_id) {
+			cep->cm_id->rem_ref(cep->cm_id);
+			cep->cm_id = NULL;
+			if (cep->state != ERDMA_EPSTATE_LISTENING)
+				erdma_cep_put(cep);
+		}
+	}
+	erdma_cep_set_free(cep);
+	erdma_put_work(work);
+	erdma_cep_put(cep);
+}
+
+static struct workqueue_struct *erdma_cm_wq;
+
+int erdma_cm_queue_work(struct erdma_cep *cep, enum erdma_work_type type)
+{
+	struct erdma_cm_work *work = erdma_get_work(cep);
+	unsigned long delay = 0;
+
+	if (!work)
+		return -ENOMEM;
+
+	work->type = type;
+	work->cep = cep;
+
+	erdma_cep_get(cep);
+
+	INIT_DELAYED_WORK(&work->work, erdma_cm_work_handler);
+
+	if (type == ERDMA_CM_WORK_MPATIMEOUT) {
+		cep->mpa_timer = work;
+
+		if (cep->state == ERDMA_EPSTATE_AWAIT_MPAREP)
+			delay = MPAREQ_TIMEOUT;
+		else
+			delay = MPAREP_TIMEOUT;
+	} else if (type == ERDMA_CM_WORK_CONNECTTIMEOUT) {
+		cep->mpa_timer = work;
+
+		delay = CONNECT_TIMEOUT;
+	}
+
+	queue_delayed_work(erdma_cm_wq, &work->work, delay);
+
+	return 0;
+}
+
+static void erdma_cm_llp_data_ready(struct sock *sk)
+{
+	struct erdma_cep *cep;
+
+	read_lock(&sk->sk_callback_lock);
+
+	cep = sk_to_cep(sk);
+	if (!cep)
+		goto out;
+
+	switch (cep->state) {
+	case ERDMA_EPSTATE_RDMA_MODE:
+	case ERDMA_EPSTATE_LISTENING:
+		break;
+	case ERDMA_EPSTATE_AWAIT_MPAREQ:
+	case ERDMA_EPSTATE_AWAIT_MPAREP:
+		erdma_cm_queue_work(cep, ERDMA_CM_WORK_READ_MPAHDR);
+		break;
+	default:
+		break;
+	}
+out:
+	read_unlock(&sk->sk_callback_lock);
+}
+
+static void erdma_cm_llp_error_report(struct sock *sk)
+{
+	struct erdma_cep *cep = sk_to_cep(sk);
+
+	if (cep) {
+		cep->sk_error = sk->sk_err;
+		cep->sk_error_report(sk);
+	}
+}
+
+static void erdma_cm_llp_state_change(struct sock *sk)
+{
+	struct erdma_cep *cep;
+	struct socket *s;
+	void (*orig_state_change)(struct sock *sk);
+
+	read_lock(&sk->sk_callback_lock);
+
+	cep = sk_to_cep(sk);
+	if (!cep) {
+		read_unlock(&sk->sk_callback_lock);
+		return;
+	}
+	orig_state_change = cep->sk_state_change;
+
+	s = sk->sk_socket;
+
+	switch (sk->sk_state) {
+	case TCP_ESTABLISHED:
+		if (cep->state == ERDMA_EPSTATE_CONNECTING)
+			erdma_cm_queue_work(cep, ERDMA_CM_WORK_CONNECTED);
+		else
+			erdma_cm_queue_work(cep, ERDMA_CM_WORK_ACCEPT);
+		break;
+	case TCP_CLOSE:
+	case TCP_CLOSE_WAIT:
+		if (cep->state != ERDMA_EPSTATE_LISTENING)
+			erdma_cm_queue_work(cep, ERDMA_CM_WORK_PEER_CLOSE);
+		break;
+	default:
+		break;
+	}
+	read_unlock(&sk->sk_callback_lock);
+	orig_state_change(sk);
+}
+
+
+static int kernel_bindconnect(struct socket *s,
+			      struct sockaddr *laddr, int laddrlen,
+			      struct sockaddr *raddr, int raddrlen, int flags)
+{
+	int err;
+	struct sock *sk = s->sk;
+
+	/*
+	 * Make address available again asap.
+	 */
+	sock_set_reuseaddr(s->sk);
+
+	err = s->ops->bind(s, laddr, laddrlen);
+	if (err < 0) {
+		pr_info("try port (%u) failed\n", ((struct sockaddr_in *)laddr)->sin_port);
+		/* Try to alloc port, not use RDMA port. */
+		((struct sockaddr_in *)laddr)->sin_port = 0;
+		err = s->ops->bind(s, laddr, laddrlen);
+		if (err < 0)
+			goto done;
+		pr_info("alloc source port %u.\n", inet_sk(sk)->inet_num);
+	}
+
+	err = s->ops->connect(s, raddr, raddrlen, flags);
+	if (err < 0)
+		goto done;
+
+	err = s->ops->getname(s, laddr, 0);
+done:
+	return err;
+}
+
+
+int erdma_connect(struct iw_cm_id *id, struct iw_cm_conn_param *params)
+{
+	struct erdma_dev *dev = to_edev(id->device);
+	struct erdma_qp *qp;
+	struct erdma_cep *cep = NULL;
+	struct socket *s = NULL;
+	struct sockaddr *laddr, *raddr;
+	u16 pd_len = params->private_data_len;
+	int ret;
+
+	if (pd_len > MPA_MAX_PRIVDATA)
+		return -EINVAL;
+
+	qp = find_qp_by_qpn(dev, params->qpn);
+	if (!qp)
+		return -ENOENT;
+
+	laddr = (struct sockaddr *)&id->m_local_addr;
+	raddr = (struct sockaddr *)&id->m_remote_addr;
+
+	qp->attrs.sip = ntohl(to_sockaddr_in(id->local_addr).sin_addr.s_addr);
+	qp->attrs.origin_sport = ntohs(to_sockaddr_in(id->local_addr).sin_port);
+	qp->attrs.dip = ntohl(to_sockaddr_in(id->remote_addr).sin_addr.s_addr);
+	qp->attrs.dport = ntohs(to_sockaddr_in(id->m_remote_addr).sin_port);
+
+	ret = sock_create(AF_INET, SOCK_STREAM, IPPROTO_TCP, &s);
+	if (ret < 0)
+		goto error_put_qp;
+
+	cep = erdma_cep_alloc(dev);
+	if (!cep) {
+		ret = -ENOMEM;
+		goto error_release_sock;
+	}
+
+	erdma_cep_set_inuse(cep);
+
+	/* Associate QP with CEP */
+	erdma_cep_get(cep);
+	qp->cep = cep;
+	erdma_qp_get(qp);
+	cep->qp = qp;
+
+	/* Associate cm_id with CEP */
+	id->add_ref(id);
+	cep->cm_id = id;
+
+	ret = erdma_cm_alloc_work(cep, 6);
+	if (ret != 0) {
+		ret = -ENOMEM;
+		goto error_release_cep;
+	}
+
+	cep->ird = params->ird;
+	cep->ord = params->ord;
+	cep->state = ERDMA_EPSTATE_CONNECTING;
+	cep->is_connecting = true;
+
+	erdma_cep_socket_assoc(cep, s);
+
+	cep->pd_len = pd_len;
+	cep->private_storage = kmalloc(pd_len, GFP_KERNEL);
+	if (!cep->private_storage) {
+		ret = -ENOMEM;
+		goto error_disasssoc;
+	}
+
+	memcpy(cep->private_storage, params->private_data, params->private_data_len);
+
+	ret = kernel_bindconnect(s, laddr, sizeof(*laddr), raddr,
+				sizeof(*raddr), O_NONBLOCK);
+	if (ret != -EINPROGRESS && ret != 0) {
+		goto error_disasssoc;
+	} else if (ret == 0) {
+		ret = erdma_cm_queue_work(cep, ERDMA_CM_WORK_CONNECTED);
+	} else {
+		ret = erdma_cm_queue_work(cep, ERDMA_CM_WORK_CONNECTTIMEOUT);
+		if (ret)
+			goto error_disasssoc;
+	}
+
+	erdma_cep_set_free(cep);
+	return 0;
+
+error_disasssoc:
+	kfree(cep->private_storage);
+	cep->private_storage = NULL;
+	cep->pd_len = 0;
+
+	erdma_socket_disassoc(s);
+
+error_release_cep:
+	/* disassoc with cm_id */
+	cep->cm_id = NULL;
+	id->rem_ref(id);
+
+	/* disassoc with qp */
+	qp->cep = NULL;
+	erdma_cep_put(cep);
+	cep->qp = NULL;
+	erdma_qp_put(qp);
+
+	cep->state = ERDMA_EPSTATE_CLOSED;
+
+	erdma_cep_set_free(cep);
+
+	/* release the cep. */
+	erdma_cep_put(cep);
+
+error_release_sock:
+	if (s)
+		sock_release(s);
+error_put_qp:
+	erdma_qp_put(qp);
+
+	return ret;
+}
+
+int erdma_accept(struct iw_cm_id *id, struct iw_cm_conn_param *params)
+{
+	struct erdma_dev *dev = to_edev(id->device);
+	struct erdma_cep *cep = (struct erdma_cep *)id->provider_data;
+	struct erdma_qp *qp;
+	struct erdma_qp_attrs qp_attrs;
+	int ret;
+
+	erdma_cep_set_inuse(cep);
+	erdma_cep_put(cep);
+
+	/* Free lingering inbound private data */
+	if (cep->mpa.hdr.params.pd_len) {
+		cep->mpa.hdr.params.pd_len = 0;
+		kfree(cep->mpa.pdata);
+		cep->mpa.pdata = NULL;
+	}
+	erdma_cancel_mpatimer(cep);
+
+	if (cep->state != ERDMA_EPSTATE_RECVD_MPAREQ) {
+		if (cep->state == ERDMA_EPSTATE_CLOSED) {
+			erdma_cep_set_free(cep);
+			erdma_cep_put(cep);
+			return -ECONNRESET;
+		}
+		return -EBADFD;
+	}
+
+	qp = find_qp_by_qpn(dev, params->qpn);
+	if (!qp)
+		return -ENOENT;
+
+	down_write(&qp->state_lock);
+	if (qp->attrs.state > ERDMA_QP_STATE_RTS) {
+		ret = -EINVAL;
+		up_write(&qp->state_lock);
+		goto error;
+	}
+
+	if (params->ord > dev->attrs.max_ord ||
+	    params->ird > dev->attrs.max_ord) {
+		ret = -EINVAL;
+		up_write(&qp->state_lock);
+		goto error;
+	}
+
+	if (params->private_data_len > MPA_MAX_PRIVDATA) {
+		ret = -EINVAL;
+		up_write(&qp->state_lock);
+		goto error;
+	}
+
+	cep->cm_id = id;
+	id->add_ref(id);
+
+	memset(&qp_attrs, 0, sizeof(qp_attrs));
+	qp_attrs.orq_size = params->ord;
+	qp_attrs.irq_size = params->ird;
+	qp_attrs.llp_stream_handle = cep->llp.sock;
+
+	qp_attrs.state = ERDMA_QP_STATE_RTS;
+
+	qp->attrs.sip = ntohl(cep->llp.laddr.sin_addr.s_addr);
+	qp->attrs.origin_sport = ntohs(cep->llp.laddr.sin_port);
+	qp->attrs.dip = ntohl(cep->llp.raddr.sin_addr.s_addr);
+	qp->attrs.dport = ntohs(cep->llp.raddr.sin_port);
+	qp->attrs.sport = ntohs(cep->llp.laddr.sin_port);
+
+	/* Associate QP with CEP */
+	erdma_cep_get(cep);
+	qp->cep = cep;
+
+	erdma_qp_get(qp);
+	cep->qp = qp;
+
+	cep->state = ERDMA_EPSTATE_RDMA_MODE;
+
+	qp->qp_type = ERDMA_QP_TYPE_SERVER;
+	qp->private_data_len = params->private_data_len;
+	qp->cc_method = __mpa_rr_cc(cep->mpa.hdr.params.bits) == qp->dev->cc_method ?
+		qp->dev->cc_method : COMPROMISE_CC;
+
+	/* move to rts */
+	ret = erdma_modify_qp_internal(qp, &qp_attrs, ERDMA_QP_ATTR_STATE |
+				       ERDMA_QP_ATTR_LLP_HANDLE |
+				       ERDMA_QP_ATTR_MPA);
+	up_write(&qp->state_lock);
+
+	if (ret)
+		goto error;
+
+	__mpa_rr_set_cc(&cep->mpa.hdr.params.bits, qp->dev->cc_method);
+	memcpy(&cep->mpa.hdr.key[12], (u32 *)&QP_ID(qp), 4);
+	ret = erdma_send_mpareqrep(cep, params->private_data,
+				params->private_data_len);
+
+	if (!ret) {
+		ret = erdma_cm_upcall(cep, IW_CM_EVENT_ESTABLISHED, 0);
+		if (ret)
+			goto error;
+
+		erdma_cep_set_free(cep);
+
+		return 0;
+	}
+
+error:
+	erdma_socket_disassoc(cep->llp.sock);
+	sock_release(cep->llp.sock);
+	cep->llp.sock = NULL;
+
+	cep->state = ERDMA_EPSTATE_CLOSED;
+
+	if (cep->cm_id) {
+		cep->cm_id->rem_ref(id);
+		cep->cm_id = NULL;
+	}
+	if (qp->cep) {
+		erdma_cep_put(cep);
+		qp->cep = NULL;
+	}
+
+	cep->qp = NULL;
+	erdma_qp_put(qp);
+
+	erdma_cep_set_free(cep);
+	erdma_cep_put(cep);
+
+	return ret;
+}
+
+/*
+ * erdma_reject()
+ *
+ * Local connection reject case. Send private data back to peer,
+ * close connection and dereference connection id.
+ */
+int erdma_reject(struct iw_cm_id *id, const void *pdata, u8 plen)
+{
+	struct erdma_cep	*cep = (struct erdma_cep *)id->provider_data;
+
+	erdma_cep_set_inuse(cep);
+	erdma_cep_put(cep);
+
+	erdma_cancel_mpatimer(cep);
+
+	if (cep->state != ERDMA_EPSTATE_RECVD_MPAREQ) {
+		if (cep->state == ERDMA_EPSTATE_CLOSED) {
+			erdma_cep_set_free(cep);
+			erdma_cep_put(cep); /* should be last reference */
+
+			return -ECONNRESET;
+		}
+		return -EBADFD;
+	}
+
+	if (__mpa_rr_revision(cep->mpa.hdr.params.bits) == MPA_REVISION_1) {
+		cep->mpa.hdr.params.bits |= MPA_RR_FLAG_REJECT; /* reject */
+		(void)erdma_send_mpareqrep(cep, pdata, plen);
+	}
+	erdma_socket_disassoc(cep->llp.sock);
+	sock_release(cep->llp.sock);
+	cep->llp.sock = NULL;
+
+	cep->state = ERDMA_EPSTATE_CLOSED;
+
+	erdma_cep_set_free(cep);
+	erdma_cep_put(cep);
+
+	return 0;
+}
+
+int erdma_create_listen(struct iw_cm_id *id, int backlog)
+{
+	struct socket *s;
+	struct erdma_cep *cep = NULL;
+	int ret = 0;
+	struct erdma_dev *dev = to_edev(id->device);
+	int addr_family = id->local_addr.ss_family;
+
+	if (addr_family != AF_INET)
+		return -EAFNOSUPPORT;
+
+	ret = sock_create(addr_family, SOCK_STREAM, IPPROTO_TCP, &s);
+	if (ret < 0)
+		return ret;
+
+	/*
+	 * Allow binding local port when still in TIME_WAIT from last close.
+	 */
+	sock_set_reuseaddr(s->sk);
+
+	if (addr_family == AF_INET) {
+		struct sockaddr_in *laddr = &to_sockaddr_in(id->local_addr);
+		u8 *l_ip, *r_ip;
+
+		l_ip = (u8 *) &to_sockaddr_in(id->local_addr).sin_addr.s_addr;
+		r_ip = (u8 *) &to_sockaddr_in(id->remote_addr).sin_addr.s_addr;
+
+		/* For wildcard addr, limit binding to current device only */
+		if (ipv4_is_zeronet(laddr->sin_addr.s_addr))
+			s->sk->sk_bound_dev_if = dev->netdev->ifindex;
+
+		ret = s->ops->bind(s, (struct sockaddr *)laddr, sizeof(struct sockaddr_in));
+	} else {
+		ret = -EAFNOSUPPORT;
+		goto error;
+	}
+
+	if (ret != 0)
+		goto error;
+
+	cep = erdma_cep_alloc(dev);
+	if (!cep) {
+		ret = -ENOMEM;
+		goto error;
+	}
+	erdma_cep_socket_assoc(cep, s);
+
+	ret = erdma_cm_alloc_work(cep, backlog);
+	if (ret != 0)
+		goto error;
+
+	ret = s->ops->listen(s, backlog);
+	if (ret != 0)
+		goto error;
+
+	memcpy(&cep->llp.laddr, &id->local_addr, sizeof(cep->llp.laddr));
+	memcpy(&cep->llp.raddr, &id->remote_addr, sizeof(cep->llp.raddr));
+
+	cep->cm_id = id;
+	id->add_ref(id);
+
+	if (!id->provider_data) {
+		id->provider_data = kmalloc(sizeof(struct list_head), GFP_KERNEL);
+		if (!id->provider_data) {
+			ret = -ENOMEM;
+			goto error;
+		}
+		INIT_LIST_HEAD((struct list_head *)id->provider_data);
+	}
+
+	list_add_tail(&cep->listenq, (struct list_head *)id->provider_data);
+	cep->state = ERDMA_EPSTATE_LISTENING;
+
+	return 0;
+
+error:
+	if (cep) {
+		erdma_cep_set_inuse(cep);
+
+		if (cep->cm_id) {
+			cep->cm_id->rem_ref(cep->cm_id);
+			cep->cm_id = NULL;
+		}
+		cep->llp.sock = NULL;
+		erdma_socket_disassoc(s);
+		cep->state = ERDMA_EPSTATE_CLOSED;
+
+		erdma_cep_set_free(cep);
+		erdma_cep_put(cep);
+	}
+	sock_release(s);
+
+	return ret;
+}
+
+static void erdma_drop_listeners(struct iw_cm_id *id)
+{
+	struct list_head	*p, *tmp;
+	/*
+	 * In case of a wildcard rdma_listen on a multi-homed device,
+	 * a listener's IWCM id is associated with more than one listening CEP.
+	 */
+	list_for_each_safe(p, tmp, (struct list_head *)id->provider_data) {
+		struct erdma_cep *cep = list_entry(p, struct erdma_cep, listenq);
+
+		list_del(p);
+
+		erdma_cep_set_inuse(cep);
+
+		if (cep->cm_id) {
+			cep->cm_id->rem_ref(cep->cm_id);
+			cep->cm_id = NULL;
+		}
+		if (cep->llp.sock) {
+			erdma_socket_disassoc(cep->llp.sock);
+			sock_release(cep->llp.sock);
+			cep->llp.sock = NULL;
+		}
+		cep->state = ERDMA_EPSTATE_CLOSED;
+		erdma_cep_set_free(cep);
+		erdma_cep_put(cep);
+	}
+}
+
+int erdma_destroy_listen(struct iw_cm_id *id)
+{
+	if (!id->provider_data)
+		return 0;
+
+	erdma_drop_listeners(id);
+	kfree(id->provider_data);
+	id->provider_data = NULL;
+
+	return 0;
+}
+
+int erdma_cm_init(void)
+{
+	erdma_cm_wq = create_singlethread_workqueue("erdma_cm_wq");
+	if (!erdma_cm_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void erdma_cm_exit(void)
+{
+	if (erdma_cm_wq) {
+		flush_workqueue(erdma_cm_wq);
+		destroy_workqueue(erdma_cm_wq);
+	}
+}
diff --git a/drivers/infiniband/hw/erdma/erdma_cm.h b/drivers/infiniband/hw/erdma/erdma_cm.h
new file mode 100644
index 000000000000..7c5406d55de4
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_cm.h
@@ -0,0 +1,158 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ *
+ * Authors: Bernard Metzler <bmt@zurich.ibm.com>
+ * Copyright (c) 2008-2016, IBM Corporation
+ */
+
+#ifndef __ERDMA_CM_H__
+#define __ERDMA_CM_H__
+
+#include <net/sock.h>
+#include <linux/tcp.h>
+
+#include <rdma/iw_cm.h>
+
+
+/* iWarp MPA protocol defs */
+#define RDMAP_VERSION		1
+#define DDP_VERSION		1
+#define MPA_REVISION_1		1
+#define MPA_MAX_PRIVDATA	RDMA_MAX_PRIVATE_DATA
+#define MPA_KEY_REQ		"MPA ID Req F"
+#define MPA_KEY_REP		"MPA ID Rep F"
+
+struct mpa_rr_params {
+	__be16 bits;
+	__be16 pd_len;
+};
+
+/*
+ * MPA request/response Hdr bits & fields
+ */
+enum {
+	MPA_RR_FLAG_MARKERS  = __cpu_to_be16(0x8000),
+	MPA_RR_FLAG_CRC      = __cpu_to_be16(0x4000),
+	MPA_RR_FLAG_REJECT   = __cpu_to_be16(0x2000),
+	MPA_RR_DESIRED_CC    = __cpu_to_be16(0x0f00),
+	MPA_RR_RESERVED      = __cpu_to_be16(0x1000),
+	MPA_RR_MASK_REVISION = __cpu_to_be16(0x00ff)
+};
+
+/*
+ * MPA request/reply header
+ */
+struct mpa_rr {
+	u8 key[16];
+	struct mpa_rr_params params;
+};
+
+struct erdma_mpa_info {
+	struct mpa_rr hdr;	/* peer mpa hdr in host byte order */
+	char          *pdata;
+	int           bytes_rcvd;
+	u32           remote_qpn;
+};
+
+struct erdma_sk_upcalls {
+	void (*sk_state_change)(struct sock *sk);
+	void (*sk_data_ready)(struct sock *sk, int bytes);
+	void (*sk_error_report)(struct sock *sk);
+};
+struct erdma_llp_info {
+	struct socket           *sock;
+	struct sockaddr_in      laddr;	/* redundant with socket info above */
+	struct sockaddr_in      raddr;	/* dito, consider removal */
+	struct erdma_sk_upcalls sk_def_upcalls;
+};
+
+struct erdma_dev;
+
+enum erdma_cep_state {
+	ERDMA_EPSTATE_IDLE = 1,
+	ERDMA_EPSTATE_LISTENING,
+	ERDMA_EPSTATE_CONNECTING,
+	ERDMA_EPSTATE_AWAIT_MPAREQ,
+	ERDMA_EPSTATE_RECVD_MPAREQ,
+	ERDMA_EPSTATE_AWAIT_MPAREP,
+	ERDMA_EPSTATE_RDMA_MODE,
+	ERDMA_EPSTATE_CLOSED
+};
+
+struct erdma_cep {
+	struct iw_cm_id *cm_id;
+	struct erdma_dev *dev;
+
+	struct list_head devq;
+	/*
+	 * The provider_data element of a listener IWCM ID
+	 * refers to a list of one or more listener CEPs
+	 */
+	struct list_head listenq;
+	struct erdma_cep *listen_cep;
+	struct erdma_qp *qp;
+	spinlock_t lock;
+	wait_queue_head_t waitq;
+	struct kref ref;
+	enum erdma_cep_state state;
+	short in_use;
+	struct erdma_cm_work *mpa_timer;
+	struct list_head work_freelist;
+	struct erdma_llp_info llp;
+	struct erdma_mpa_info mpa;
+	int ord;
+	int ird;
+	int sk_error;
+	int pd_len;
+	void *private_storage;
+
+	/* Saved upcalls of socket llp.sock */
+	void (*sk_state_change)(struct sock *sk);
+	void (*sk_data_ready)(struct sock *sk);
+	void (*sk_error_report)(struct sock *sk);
+
+	bool is_connecting;
+};
+
+#define MPAREQ_TIMEOUT	(HZ*20)
+#define MPAREP_TIMEOUT	(HZ*10)
+#define CONNECT_TIMEOUT  (HZ*10)
+
+enum erdma_work_type {
+	ERDMA_CM_WORK_ACCEPT	= 1,
+	ERDMA_CM_WORK_READ_MPAHDR,
+	ERDMA_CM_WORK_CLOSE_LLP,		/* close socket */
+	ERDMA_CM_WORK_PEER_CLOSE,		/* socket indicated peer close */
+	ERDMA_CM_WORK_MPATIMEOUT,
+	ERDMA_CM_WORK_CONNECTED,
+	ERDMA_CM_WORK_CONNECTTIMEOUT
+};
+
+struct erdma_cm_work {
+	struct delayed_work	work;
+	struct list_head	list;
+	enum erdma_work_type	type;
+	struct erdma_cep	*cep;
+};
+
+#define to_sockaddr_in(a) (*(struct sockaddr_in *)(&(a)))
+
+extern int erdma_connect(struct iw_cm_id *id, struct iw_cm_conn_param *param);
+extern int erdma_accept(struct iw_cm_id *id, struct iw_cm_conn_param *param);
+extern int erdma_reject(struct iw_cm_id *id, const void *pdata, u8 plen);
+extern int erdma_create_listen(struct iw_cm_id *id, int backlog);
+extern int erdma_destroy_listen(struct iw_cm_id *id);
+
+extern void erdma_cep_get(struct erdma_cep *ceq);
+extern void erdma_cep_put(struct erdma_cep *ceq);
+extern int erdma_cm_queue_work(struct erdma_cep *ceq, enum erdma_work_type type);
+
+extern int erdma_cm_init(void);
+extern void erdma_cm_exit(void);
+
+#define sk_to_cep(sk)	((struct erdma_cep *)((sk)->sk_user_data))
+
+#endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rdma-next 09/11] RDMA/erdma: Add the erdma module
  2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
                   ` (7 preceding siblings ...)
  2021-12-21  2:48 ` [PATCH rdma-next 08/11] RDMA/erdma: Add connection management (CM) support Cheng Xu
@ 2021-12-21  2:48 ` Cheng Xu
  2021-12-21 13:26   ` Leon Romanovsky
  2021-12-21  2:48 ` [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions Cheng Xu
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 52+ messages in thread
From: Cheng Xu @ 2021-12-21  2:48 UTC (permalink / raw)
  To: jgg, dledford; +Cc: leon, linux-rdma, KaiShen, chengyou, tonylu

Add the main erdma module and debugfs files. The main module provides
interface to infiniband subsytem, and the debugfs module provides a way
to allow user can get the core status of the device and set the preferred
congestion control algorithm.

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 drivers/infiniband/hw/erdma/erdma_debug.c | 314 ++++++++++
 drivers/infiniband/hw/erdma/erdma_debug.h |  18 +
 drivers/infiniband/hw/erdma/erdma_main.c  | 711 ++++++++++++++++++++++
 3 files changed, 1043 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.c
 create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.h
 create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c

diff --git a/drivers/infiniband/hw/erdma/erdma_debug.c b/drivers/infiniband/hw/erdma/erdma_debug.c
new file mode 100644
index 000000000000..3cbed4dde0e2
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_debug.c
@@ -0,0 +1,314 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ */
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/debugfs.h>
+
+#include <rdma/iw_cm.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_smi.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "erdma.h"
+#include "erdma_cm.h"
+#include "erdma_debug.h"
+#include "erdma_verbs.h"
+
+char *cc_method_string[ERDMA_CC_METHODS_NUM] = {
+	[ERDMA_CC_NEWRENO] = "newreno",
+	[ERDMA_CC_CUBIC] = "cubic",
+	[ERDMA_CC_HPCC_RTT] = "hpcc_rtt",
+	[ERDMA_CC_HPCC_ECN] = "hpcc_ecn",
+	[ERDMA_CC_HPCC_INT] = "hpcc_int"
+};
+
+static struct dentry *erdma_debugfs;
+
+
+static int erdma_dbgfs_file_open(struct inode *inode, struct file *fp)
+{
+	fp->private_data = inode->i_private;
+	return nonseekable_open(inode, fp);
+}
+
+static ssize_t erdma_show_stats(struct file *fp, char __user *buf, size_t space,
+			      loff_t *ppos)
+{
+	struct erdma_dev *dev = fp->private_data;
+	char *kbuf = NULL;
+	int len = 0;
+
+	if (*ppos)
+		goto out;
+
+	kbuf = kmalloc(space, GFP_KERNEL);
+	if (!kbuf)
+		goto out;
+
+	len = snprintf(kbuf, space, "Resource Summary of %s:\n"
+		"%s: %d\n%s: %d\n%s: %d\n%s: %d\n%s: %d\n%s: %d\n",
+		dev->ibdev.name,
+		"ucontext ", atomic_read(&dev->num_ctx),
+		"pd       ", atomic_read(&dev->num_pd),
+		"qp       ", atomic_read(&dev->num_qp),
+		"cq       ", atomic_read(&dev->num_cq),
+		"mr       ", atomic_read(&dev->num_mr),
+		"cep      ", atomic_read(&dev->num_cep));
+	if (len > space)
+		len = space;
+out:
+	if (len)
+		len = simple_read_from_buffer(buf, len, ppos, kbuf, len);
+
+	kfree(kbuf);
+	return len;
+
+}
+
+static ssize_t erdma_show_cmdq(struct file *fp, char __user *buf, size_t space,
+			       loff_t *ppos)
+{
+	struct erdma_dev *dev = fp->private_data;
+	char *kbuf = NULL;
+	int len = 0, n;
+
+	if (*ppos)
+		goto out;
+
+	kbuf = kmalloc(space, GFP_KERNEL);
+	if (!kbuf)
+		goto out;
+
+	len =  snprintf(kbuf, space,
+		"CMDQ Summary:\n"
+		"submitted:%llu, completed:%llu.\n"
+		"ceq notify:%llu,,notify:%llu aeq event:%llu,,notify:%llu cq armed:%llu\n",
+		dev->cmdq.sq.total_cmds, dev->cmdq.sq.total_comp_cmds,
+		atomic64_read(&dev->cmdq.eq.event_num),
+		atomic64_read(&dev->cmdq.eq.notify_num),
+		atomic64_read(&dev->aeq.eq.event_num),
+		atomic64_read(&dev->aeq.eq.notify_num),
+		atomic64_read(&dev->cmdq.cq.cq_armed_num));
+	if (len > space) {
+		len = space;
+		goto out;
+	}
+
+	space -= len;
+	n = snprintf(kbuf + len, space,
+		"SQ-buf depth:%u, ci:0x%x, pi:0x%x\n",
+		dev->cmdq.sq.depth, dev->cmdq.sq.ci, dev->cmdq.sq.pi);
+	len += n;
+	space -= n;
+	n = snprintf(kbuf + len, space,
+		"CQ-buf depth:%u, ci:0x%x\n",
+		dev->cmdq.cq.depth, dev->cmdq.cq.ci);
+	len += n;
+	space -= n;
+	n = snprintf(kbuf + len, space,
+		"EQ-buf depth:%u, ci:0x%x\n",
+		dev->cmdq.eq.depth, dev->cmdq.eq.ci);
+	len += n;
+	space -= n;
+	n = snprintf(kbuf + len, space,
+		"AEQ-buf depth:%u, ci:0x%x\n",
+		dev->aeq.eq.depth, dev->aeq.eq.ci);
+	len += n;
+	space -= n;
+	n = snprintf(kbuf + len, space,
+		"q-flags:0x%lx\n", dev->cmdq.state);
+
+	len += n;
+	space -= n;
+
+out:
+	if (len)
+		len = simple_read_from_buffer(buf, len, ppos, kbuf, len);
+
+	kfree(kbuf);
+	return len;
+
+}
+
+static ssize_t erdma_show_ceq(struct file *fp, char __user *buf, size_t space,
+			      loff_t *ppos)
+{
+
+	struct erdma_dev *dev = fp->private_data;
+	char *kbuf = NULL;
+	int len = 0, n, i;
+	struct erdma_eq_cb *eq_cb;
+
+	if (*ppos)
+		goto out;
+
+	kbuf = kmalloc(space, GFP_KERNEL);
+	if (!kbuf)
+		goto out;
+
+	len =  snprintf(kbuf, space, "CEQs Summary:\n");
+	if (len > space) {
+		len = space;
+		goto out;
+	}
+
+	space -= len;
+
+	for (i = 0; i < 31; i++) {
+		eq_cb = &dev->ceqs[i];
+		n = snprintf(kbuf + len, space,
+			"%d ready:%u,event_num:%llu,notify_num:%llu,depth:%u,ci:0x%x\n",
+			i, eq_cb->ready,
+			atomic64_read(&eq_cb->eq.event_num),
+			atomic64_read(&eq_cb->eq.notify_num),
+			eq_cb->eq.depth, eq_cb->eq.ci);
+		if (n < space) {
+			len += n;
+			space -= n;
+		} else {
+			len += space;
+			break;
+		}
+	}
+
+out:
+	if (len)
+		len = simple_read_from_buffer(buf, len, ppos, kbuf, len);
+
+	kfree(kbuf);
+	return len;
+
+}
+
+static ssize_t erdma_show_cc(struct file *fp, char __user *buf, size_t space,
+			     loff_t *ppos)
+{
+	struct erdma_dev *dev = fp->private_data;
+	char *kbuf = NULL;
+	int len = 0;
+
+	kbuf = kmalloc(space, GFP_KERNEL);
+	if (!kbuf)
+		goto out;
+
+	if (*ppos)
+		goto out;
+
+	if (dev->cc_method < 0 || dev->cc_method >= ERDMA_CC_METHODS_NUM)
+		goto out;
+
+	len =  snprintf(kbuf, space, "%s\n", cc_method_string[dev->cc_method]);
+	if (len > space)
+		len = space;
+out:
+	if (len)
+		len = simple_read_from_buffer(buf, len, ppos, kbuf, len);
+
+	kfree(kbuf);
+	return len;
+
+}
+
+static ssize_t erdma_set_cc(struct file *fp, const char __user *buf, size_t count, loff_t *ppos)
+{
+	int bytes_not_copied;
+	struct erdma_dev *dev = fp->private_data;
+	char cmd_buf[64];
+	int i;
+
+	if (*ppos != 0)
+		return 0;
+
+	if (count >= sizeof(cmd_buf))
+		return -ENOSPC;
+
+	bytes_not_copied = copy_from_user(cmd_buf, buf, count);
+	if (bytes_not_copied < 0)
+		return bytes_not_copied;
+	if (bytes_not_copied > 0)
+		count -= bytes_not_copied;
+
+	cmd_buf[count] = '\0';
+	*ppos = 0;
+
+	for (i = 0; i < ERDMA_CC_METHODS_NUM; i++) {
+		if (strlen(cc_method_string[i]) == (count - 1) &&
+			!memcmp(cmd_buf, cc_method_string[i], count - 1)) {
+			dev->cc_method = i;
+			return count;
+		}
+	}
+
+	return -EINVAL;
+}
+
+static const struct file_operations erdma_stats_debug_fops = {
+	.owner = THIS_MODULE,
+	.open = erdma_dbgfs_file_open,
+	.read = erdma_show_stats
+};
+
+static const struct file_operations erdma_cmdq_debug_fops = {
+	.owner = THIS_MODULE,
+	.open = erdma_dbgfs_file_open,
+	.read = erdma_show_cmdq
+};
+
+static const struct file_operations erdma_ceq_debug_fops = {
+	.owner = THIS_MODULE,
+	.open = erdma_dbgfs_file_open,
+	.read = erdma_show_ceq
+};
+
+static const struct file_operations erdma_cc_fops = {
+	.owner = THIS_MODULE,
+	.open = erdma_dbgfs_file_open,
+	.read = erdma_show_cc,
+	.write = erdma_set_cc,
+
+};
+
+void erdma_debugfs_add_one(struct erdma_dev *dev)
+{
+	if (!erdma_debugfs)
+		return;
+
+	dev->debugfs = debugfs_create_dir(dev->ibdev.name, erdma_debugfs);
+	if (dev->debugfs) {
+		debugfs_create_file("stats", 0400, dev->debugfs,
+			(void *)dev,
+			&erdma_stats_debug_fops);
+		debugfs_create_file("cmdq", 0400, dev->debugfs,
+			(void *)dev,
+			&erdma_cmdq_debug_fops);
+		debugfs_create_file("ceq", 0400, dev->debugfs,
+			(void *)dev,
+			&erdma_ceq_debug_fops);
+		debugfs_create_file("cc", 0400, dev->debugfs,
+			(void *)dev,
+			&erdma_cc_fops);
+	}
+
+}
+
+void erdma_debugfs_remove_one(struct erdma_dev *dev)
+{
+	debugfs_remove_recursive(dev->debugfs);
+	dev->debugfs = NULL;
+}
+
+void erdma_debugfs_init(void)
+{
+	erdma_debugfs = debugfs_create_dir("erdma", NULL);
+}
+
+void erdma_debugfs_exit(void)
+{
+	debugfs_remove_recursive(erdma_debugfs);
+	erdma_debugfs = NULL;
+}
diff --git a/drivers/infiniband/hw/erdma/erdma_debug.h b/drivers/infiniband/hw/erdma/erdma_debug.h
new file mode 100644
index 000000000000..73e170719b17
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_debug.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+ *
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ */
+
+#ifndef __ERDMA_DEBUG_H__
+#define __ERDMA_DEBUG_H__
+
+#include <linux/uaccess.h>
+
+extern void erdma_debugfs_init(void);
+extern void erdma_debugfs_add_one(struct erdma_dev *dev);
+extern void erdma_debugfs_remove_one(struct erdma_dev *dev);
+extern void erdma_debugfs_exit(void);
+
+#endif
diff --git a/drivers/infiniband/hw/erdma/erdma_main.c b/drivers/infiniband/hw/erdma/erdma_main.c
new file mode 100644
index 000000000000..12ace2921fb3
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/erdma_main.c
@@ -0,0 +1,711 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Authors: Cheng Xu <chengyou@linux.alibaba.com>
+ *          Kai Shen <kaishen@linux.alibaba.com>
+ * Copyright (c) 2020-2021, Alibaba Group.
+ */
+
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/pci.h>
+#include <rdma/erdma-abi.h>
+#include <rdma/ib_verbs.h>
+#include <rdma/ib_user_verbs.h>
+
+#include "erdma.h"
+#include "erdma_cm.h"
+#include "erdma_debug.h"
+#include "erdma_hw.h"
+#include "erdma_verbs.h"
+
+#define DESC __stringify(ElasticRDMA(iWarp) Driver)
+
+MODULE_AUTHOR("Alibaba");
+MODULE_DESCRIPTION(DESC);
+MODULE_LICENSE("GPL v2");
+MODULE_VERSION("1.0");
+
+/*Common string that is matched to accept the device by the user library*/
+#define ERDMA_NODE_DESC_COMMON "Elastic RDMA(iWARP) stack"
+#define ERDMA_IBDEV_PREFIX "erdma_"
+
+static int max_vectors = 32;
+module_param(max_vectors, int, 0644);
+MODULE_PARM_DESC(max_vectors, "Specify the max vectors used, should whithin [1, 32].");
+
+static void erdma_device_register(struct erdma_dev *dev)
+{
+	struct ib_device *ibdev = &dev->ibdev;
+	struct net_device *netdev = dev->netdev;
+	int ret;
+
+	memset(ibdev->name, 0, IB_DEVICE_NAME_MAX);
+	ret = snprintf(ibdev->name, IB_DEVICE_NAME_MAX, "%s%.2x%.2x%.2x",
+		ERDMA_IBDEV_PREFIX,
+		*((u8 *)dev->netdev->dev_addr + 3),
+		*((u8 *)dev->netdev->dev_addr + 4),
+		*((u8 *)dev->netdev->dev_addr + 5));
+	if (ret < 0) {
+		pr_err("ERROR: copy ibdev name failed.\n");
+		return;
+	}
+
+	memset(&ibdev->node_guid, 0, sizeof(ibdev->node_guid));
+	memcpy(&ibdev->node_guid, netdev->dev_addr, 6);
+
+	ibdev->phys_port_cnt = 1;
+	ret = ib_device_set_netdev(ibdev, dev->netdev, 1);
+	if (ret)
+		return;
+
+	ret = ib_register_device(ibdev, ibdev->name, &dev->pdev->dev);
+	if (ret) {
+		pr_err("ERROR: ib_register_device(%s) failed: ret = %d\n",
+			ibdev->name, ret);
+		return;
+	}
+
+	erdma_debugfs_add_one(dev);
+
+	dev->is_registered = 1;
+}
+
+static void erdma_device_deregister(struct erdma_dev *dev)
+{
+	erdma_debugfs_remove_one(dev);
+
+	ib_unregister_device(&dev->ibdev);
+
+	WARN_ON(atomic_read(&dev->num_ctx));
+	WARN_ON(atomic_read(&dev->num_qp));
+	WARN_ON(atomic_read(&dev->num_cq));
+	WARN_ON(atomic_read(&dev->num_mr));
+	WARN_ON(atomic_read(&dev->num_pd));
+	WARN_ON(atomic_read(&dev->num_cep));
+}
+
+static int erdma_netdev_matched_edev(struct net_device *netdev, struct erdma_dev *dev)
+{
+	if (netdev->perm_addr[0] == dev->peer_addr[0] &&
+	    netdev->perm_addr[1] == dev->peer_addr[1] &&
+	    netdev->perm_addr[2] == dev->peer_addr[2] &&
+	    netdev->perm_addr[3] == dev->peer_addr[3] &&
+	    netdev->perm_addr[4] == dev->peer_addr[4] &&
+	    netdev->perm_addr[5] == dev->peer_addr[5])
+		return 1;
+
+	return 0;
+}
+
+static int erdma_netdev_event(struct notifier_block *nb, unsigned long event,
+			      void *arg)
+{
+	struct net_device *netdev = netdev_notifier_info_to_dev(arg);
+	struct erdma_dev *dev = container_of(nb, struct erdma_dev, netdev_nb);
+
+	if (dev->netdev != NULL && dev->netdev != netdev)
+		goto done;
+
+	switch (event) {
+	case NETDEV_UP:
+		if (dev->is_registered) {
+			dev->state = IB_PORT_ACTIVE;
+			erdma_port_event(dev, IB_EVENT_PORT_ACTIVE);
+		}
+		break;
+	case NETDEV_DOWN:
+		if (dev->is_registered) {
+			dev->state = IB_PORT_DOWN;
+			erdma_port_event(dev, IB_EVENT_PORT_ERR);
+		}
+		break;
+	case NETDEV_REGISTER:
+		if (!dev->is_registered && erdma_netdev_matched_edev(netdev, dev)) {
+			dev->netdev = netdev;
+			dev->state = IB_PORT_INIT;
+			erdma_device_register(dev);
+		}
+		break;
+	case NETDEV_UNREGISTER:
+	case NETDEV_CHANGEADDR:
+	case NETDEV_CHANGEMTU:
+	case NETDEV_GOING_DOWN:
+	case NETDEV_CHANGE:
+	default:
+		break;
+	}
+
+done:
+	return NOTIFY_OK;
+}
+
+static irqreturn_t erdma_comm_irq_handler(int irq, void *data)
+{
+	struct erdma_dev *dev = data;
+
+	erdma_cmdq_completion_handler(dev);
+	erdma_aeq_event_handler(dev);
+
+	return IRQ_HANDLED;
+}
+
+static int erdma_request_vectors(struct erdma_dev *dev)
+{
+	int msix_vecs, irq_num;
+
+	msix_vecs = max_vectors;
+	if (msix_vecs < 1 || msix_vecs > ERDMA_NUM_MSIX_VEC)
+		return -EINVAL;
+
+	irq_num = pci_alloc_irq_vectors(dev->pdev, 1, msix_vecs, PCI_IRQ_MSIX);
+
+	if (irq_num <= 0) {
+		dev_err(&dev->pdev->dev, "request irq vectors failed(%d), expected(%d).\n",
+			irq_num, msix_vecs);
+		return -ENOSPC;
+	}
+
+	dev_info(&dev->pdev->dev, "hardware return %d irqs.\n", irq_num);
+	dev->irq_num = irq_num;
+
+	return 0;
+}
+
+static int erdma_comm_irq_init(struct erdma_dev *dev)
+{
+	u32 cpu = 0;
+	int err;
+	struct erdma_irq_info *irq_info = &dev->comm_irq;
+
+	snprintf(irq_info->name, ERDMA_IRQNAME_SIZE, "erdma-common@pci:%s", pci_name(dev->pdev));
+	irq_info->handler = erdma_comm_irq_handler;
+	irq_info->data = dev;
+	irq_info->msix_vector = pci_irq_vector(dev->pdev, ERDMA_MSIX_VECTOR_CMDQ);
+
+	if (dev->numa_node >= 0)
+		cpu = cpumask_first(cpumask_of_node(dev->numa_node));
+
+	irq_info->cpu = cpu;
+	cpumask_set_cpu(cpu, &irq_info->affinity_hint_mask);
+	dev_info(&dev->pdev->dev, "setup irq:%p vector:%d name:%s\n",
+		 irq_info,
+		 irq_info->msix_vector,
+		 irq_info->name);
+
+	err = request_irq(irq_info->msix_vector, irq_info->handler, 0,
+		irq_info->name, irq_info->data);
+	if (err) {
+		dev_err(&dev->pdev->dev, "failed to request_irq(%d)\n", err);
+		return err;
+	}
+
+	irq_set_affinity_hint(irq_info->msix_vector, &irq_info->affinity_hint_mask);
+
+	return 0;
+}
+
+static void erdma_comm_irq_uninit(struct erdma_dev *dev)
+{
+	struct erdma_irq_info *irq_info = &dev->comm_irq;
+
+	irq_set_affinity_hint(irq_info->msix_vector, NULL);
+	free_irq(irq_info->msix_vector, irq_info->data);
+}
+
+static void __erdma_dwqe_resource_init(struct erdma_dev *dev, int grp_num)
+{
+	int total_pages, type0, type1, shared;
+
+	if (grp_num < 4)
+		dev->disable_dwqe = 1;
+	else
+		dev->disable_dwqe = 0;
+
+	/* One page contains 4 goups. */
+	total_pages = grp_num * 4;
+	shared = 1;
+	if (grp_num >= ERDMA_DWQE_MAX_GRP_CNT) {
+		grp_num = ERDMA_DWQE_MAX_GRP_CNT;
+		type0 = ERDMA_DWQE_TYPE0_CNT;
+		type1 = ERDMA_DWQE_TYPE1_CNT / ERDMA_DWQE_TYPE1_CNT_PER_PAGE;
+	} else {
+		type1 = total_pages / 3;
+		type0 = total_pages - type1;
+	}
+
+	dev->dwqe_pages = type0;
+	dev->dwqe_entries = type1 * ERDMA_DWQE_TYPE1_CNT_PER_PAGE;
+
+	pr_info("grp_num:%d, total pages:%d, type0:%d, type1:%d, type1_db_cnt:%d, shared:%d\n",
+		grp_num, total_pages, type0, type1, type1 * 16, shared);
+}
+
+static int erdma_device_init(struct erdma_dev *dev, struct pci_dev *pdev)
+{
+	int err;
+
+	dev->grp_num = erdma_reg_read32(dev, ERDMA_REGS_GRP_NUM_REG);
+
+	dev_info(&pdev->dev, "hardware returned grp_num:%d\n", dev->grp_num);
+
+	__erdma_dwqe_resource_init(dev, dev->grp_num);
+
+	/* force dma width to 64. */
+	dev->dma_width = 64;
+
+	err = pci_set_dma_mask(pdev, DMA_BIT_MASK(dev->dma_width));
+	if (err) {
+		dev_err(&pdev->dev, "pci_set_dma_mask failed(%d)\n", err);
+		return err;
+	}
+
+	err = pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(dev->dma_width));
+	if (err) {
+		dev_err(&pdev->dev, "pci_set_consistent_dma_mask failed(%d)\n", err);
+		return err;
+	}
+
+	return err;
+}
+
+static void erdma_device_uninit(struct erdma_dev *dev)
+{
+	u32 ctrl;
+
+	ctrl = FIELD_PREP(ERDMA_REG_DEV_CTRL_RESET_MASK, 1);
+	erdma_reg_write32(dev, ERDMA_REGS_DEV_CTRL_REG, ctrl);
+}
+
+static const struct pci_device_id erdma_pci_tbl[] = {
+	{PCI_DEVICE(PCI_VENDOR_ID_ALIBABA, 0x107f)},
+	{PCI_DEVICE(PCI_VENDOR_ID_ALIBABA, 0x5007)},
+	{}
+};
+
+static int erdma_probe_dev(struct pci_dev *pdev)
+{
+	int err;
+	struct erdma_dev *dev;
+	u32 version;
+	int bars;
+	struct ib_device *ibdev;
+
+	err = pci_enable_device(pdev);
+	if (err) {
+		dev_err(&pdev->dev, "pci_enable_device failed(%d)\n", err);
+		return err;
+	}
+
+	pci_set_master(pdev);
+
+	dev = ib_alloc_device(erdma_dev, ibdev);
+	if (!dev) {
+		dev_err(&pdev->dev, "ib_alloc_device failed\n");
+		err = -ENOMEM;
+		goto err_disable_device;
+	}
+
+	ibdev = &dev->ibdev;
+
+	pci_set_drvdata(pdev, dev);
+	dev->pdev = pdev;
+	dev->numa_node = pdev->dev.numa_node;
+
+	bars = pci_select_bars(pdev, IORESOURCE_MEM);
+	err = pci_request_selected_regions(pdev, bars, DRV_MODULE_NAME);
+	if (bars != ERDMA_BAR_MASK || err) {
+		dev_err(&pdev->dev,
+			"pci_request_selected_regions failed(bars:%d, err:%d)\n", bars, err);
+		err = err == 0 ? -EINVAL : err;
+		goto err_ib_device_release;
+	}
+
+	dev->func_bar_addr = pci_resource_start(pdev, ERDMA_FUNC_BAR);
+	dev->func_bar_len = pci_resource_len(pdev, ERDMA_FUNC_BAR);
+
+	dev->func_bar = devm_ioremap(&pdev->dev, dev->func_bar_addr, dev->func_bar_len);
+	if (!dev->func_bar) {
+		dev_err(&pdev->dev, "devm_ioremap failed.\n");
+		err = -EFAULT;
+		goto err_release_bars;
+	}
+
+	version = erdma_reg_read32(dev, ERDMA_REGS_VERSION_REG);
+	if (version == 0) {
+		/* we knows that it is a non-functional function. */
+		err = -ENODEV;
+		goto err_iounmap_func_bar;
+	}
+
+	err = erdma_device_init(dev, pdev);
+	if (err)
+		goto err_iounmap_func_bar;
+
+	err = erdma_request_vectors(dev);
+	if (err)
+		goto err_iounmap_func_bar;
+
+	err = erdma_comm_irq_init(dev);
+	if (err)
+		goto err_free_vectors;
+
+	err = erdma_aeq_init(dev);
+	if (err)
+		goto err_uninit_comm_irq;
+
+	err = erdma_cmdq_init(dev);
+	if (err)
+		goto err_uninit_aeq;
+
+	err = erdma_ceqs_init(dev);
+	if (err)
+		goto err_uninit_cmdq;
+
+	erdma_finish_cmdq_init(dev);
+
+	return 0;
+
+err_uninit_cmdq:
+	erdma_device_uninit(dev);
+	erdma_cmdq_destroy(dev);
+
+err_uninit_aeq:
+	erdma_aeq_destroy(dev);
+
+err_uninit_comm_irq:
+	erdma_comm_irq_uninit(dev);
+
+err_free_vectors:
+	pci_free_irq_vectors(dev->pdev);
+
+err_iounmap_func_bar:
+	devm_iounmap(&pdev->dev, dev->func_bar);
+
+err_release_bars:
+	pci_release_selected_regions(pdev, bars);
+
+err_ib_device_release:
+	ib_dealloc_device(&dev->ibdev);
+
+err_disable_device:
+	pci_disable_device(pdev);
+
+	return err;
+}
+
+static void erdma_remove_dev(struct pci_dev *pdev)
+{
+	struct erdma_dev *dev = pci_get_drvdata(pdev);
+
+	erdma_ceqs_uninit(dev);
+
+	erdma_device_uninit(dev);
+
+	erdma_cmdq_destroy(dev);
+	erdma_aeq_destroy(dev);
+	erdma_comm_irq_uninit(dev);
+	pci_free_irq_vectors(dev->pdev);
+
+	devm_iounmap(&pdev->dev, dev->func_bar);
+	pci_release_selected_regions(pdev, ERDMA_BAR_MASK);
+
+	ib_dealloc_device(&dev->ibdev);
+
+	pci_disable_device(pdev);
+
+}
+
+static int erdma_dev_attrs_init(struct erdma_dev *dev)
+{
+	int err;
+	u64 req_hdr, cap0, cap1;
+
+	ERDMA_CMDQ_BUILD_REQ_HDR(&req_hdr, CMDQ_SUBMOD_RDMA, CMDQ_OPCODE_QUERY_DEVICE);
+
+	err = erdma_post_cmd_wait(&dev->cmdq, &req_hdr, sizeof(req_hdr), &cap0, &cap1);
+	if (err) {
+		dev_err(&dev->pdev->dev,
+			"ERROR: err code = %d, cmd of query capibility failed.\n", err);
+		return err;
+	}
+
+	dev->attrs.max_cqe = 1 << FIELD_GET(ERDMA_CMD_DEV_CAP0_MAX_CQE_MASK, cap0);
+	dev->attrs.max_mr_size = 1 << FIELD_GET(ERDMA_CMD_DEV_CAP0_MAX_MR_SIZE_MASK, cap0);
+	dev->attrs.max_mw = 1 << FIELD_GET(ERDMA_CMD_DEV_CAP1_MAX_MW_MASK, cap1);
+	dev->attrs.max_recv_wr = 1 << FIELD_GET(ERDMA_CMD_DEV_CAP0_MAX_RECV_WR_MASK, cap0);
+	dev->attrs.local_dma_key = FIELD_GET(ERDMA_CMD_DEV_CAP1_DMA_LOCAL_KEY_MASK, cap1);
+	dev->cc_method = FIELD_GET(ERDMA_CMD_DEV_CAP1_DEFAULT_CC_MASK, cap1);
+	dev->attrs.max_qp = ERDMA_NQP_PER_QBLOCK * FIELD_GET(ERDMA_CMD_DEV_CAP1_QBLOCK_MASK, cap1);
+	dev->attrs.max_mr = 2 * dev->attrs.max_qp;
+	dev->attrs.max_cq =  2  * dev->attrs.max_qp;
+
+	dev->attrs.max_send_wr = ERDMA_MAX_SEND_WR;
+	dev->attrs.vendor_id = PCI_VENDOR_ID_ALIBABA;
+	dev->attrs.max_ord = ERDMA_MAX_ORD;
+	dev->attrs.max_ird = ERDMA_MAX_IRD;
+	dev->attrs.cap_flags = IB_DEVICE_LOCAL_DMA_LKEY | IB_DEVICE_MEM_MGT_EXTENSIONS;
+	dev->attrs.max_send_sge = ERDMA_MAX_SEND_SGE;
+	dev->attrs.max_recv_sge = ERDMA_MAX_RECV_SGE;
+	dev->attrs.max_sge_rd = ERDMA_MAX_SGE_RD;
+	dev->attrs.max_pd = ERDMA_MAX_PD;
+	dev->attrs.max_srq = ERDMA_MAX_SRQ;
+	dev->attrs.max_srq_wr = ERDMA_MAX_SRQ_WR;
+	dev->attrs.max_srq_sge = ERDMA_MAX_SRQ_SGE;
+
+	dev->res_cb[ERDMA_RES_TYPE_PD].max_cap = ERDMA_MAX_PD;
+	dev->res_cb[ERDMA_RES_TYPE_STAG_IDX].max_cap = dev->attrs.max_mr;
+
+	return 0;
+}
+
+int erdma_res_cb_init(struct erdma_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < ERDMA_RES_CNT; i++) {
+		dev->res_cb[i].next_alloc_idx = 1;
+		spin_lock_init(&dev->res_cb[i].lock);
+		dev->res_cb[i].bitmap = kcalloc(BITS_TO_LONGS(dev->res_cb[i].max_cap),
+			sizeof(unsigned long), GFP_KERNEL);
+		/* We will free the memory in erdma_res_cb_free */
+		if (!dev->res_cb[i].bitmap)
+			return -ENOMEM;
+	}
+
+	return 0;
+}
+
+void erdma_res_cb_free(struct erdma_dev *dev)
+{
+	int i;
+
+	for (i = 0; i < ERDMA_RES_CNT; i++)
+		kfree(dev->res_cb[i].bitmap);
+}
+
+static const struct ib_device_ops erdma_device_ops = {
+	.owner = THIS_MODULE,
+	.driver_id = RDMA_DRIVER_ERDMA,
+	.uverbs_abi_ver = ERDMA_ABI_VERSION,
+
+	.alloc_mr = erdma_ib_alloc_mr,
+	.alloc_pd = erdma_alloc_pd,
+	.alloc_ucontext = erdma_alloc_ucontext,
+	.create_cq = erdma_create_cq,
+	.create_qp = erdma_create_qp,
+	.dealloc_pd = erdma_dealloc_pd,
+	.dealloc_ucontext = erdma_dealloc_ucontext,
+	.dereg_mr = erdma_dereg_mr,
+	.destroy_cq = erdma_destroy_cq,
+	.destroy_qp = erdma_destroy_qp,
+	.disassociate_ucontext = erdma_disassociate_ucontext,
+	.get_dma_mr = erdma_get_dma_mr,
+	.get_netdev = erdma_get_netdev,
+	.get_port_immutable = erdma_get_port_immutable,
+	.iw_accept = erdma_accept,
+	.iw_add_ref = erdma_qp_get_ref,
+	.iw_connect = erdma_connect,
+	.iw_create_listen = erdma_create_listen,
+	.iw_destroy_listen = erdma_destroy_listen,
+	.iw_get_qp = erdma_get_ibqp,
+	.iw_reject = erdma_reject,
+	.iw_rem_ref = erdma_qp_put_ref,
+	.map_mr_sg = erdma_map_mr_sg,
+	.mmap = erdma_mmap,
+	.modify_qp = erdma_modify_qp,
+	.post_recv = erdma_post_recv,
+	.post_send = erdma_post_send,
+	.poll_cq = erdma_poll_cq,
+	.query_device = erdma_query_device,
+	.query_gid = erdma_query_gid,
+	.query_pkey = erdma_query_pkey,
+	.query_port = erdma_query_port,
+	.query_qp = erdma_query_qp,
+	.req_notify_cq = erdma_req_notify_cq,
+	.reg_user_mr = erdma_reg_user_mr,
+
+	INIT_RDMA_OBJ_SIZE(ib_cq, erdma_cq, ibcq),
+	INIT_RDMA_OBJ_SIZE(ib_pd, erdma_pd, ibpd),
+	INIT_RDMA_OBJ_SIZE(ib_ucontext, erdma_ucontext, ibucontext),
+	INIT_RDMA_OBJ_SIZE(ib_qp, erdma_qp, ibqp),
+};
+
+static int erdma_ib_device_add(struct pci_dev *pdev)
+{
+	struct erdma_dev *dev = pci_get_drvdata(pdev);
+	struct ib_device *ibdev = &dev->ibdev;
+	u32 mac_h, mac_l;
+	int ret = 0;
+
+	ret = erdma_dev_attrs_init(dev);
+	if (ret)
+		goto out;
+
+	ibdev->uverbs_cmd_mask =
+		(1ull << IB_USER_VERBS_CMD_GET_CONTEXT) |
+		(1ull << IB_USER_VERBS_CMD_QUERY_DEVICE) |
+		(1ull << IB_USER_VERBS_CMD_QUERY_PORT) |
+		(1ull << IB_USER_VERBS_CMD_ALLOC_PD) |
+		(1ull << IB_USER_VERBS_CMD_DEALLOC_PD) |
+		(1ull << IB_USER_VERBS_CMD_REG_MR) |
+		(1ull << IB_USER_VERBS_CMD_DEREG_MR) |
+		(1ull << IB_USER_VERBS_CMD_CREATE_COMP_CHANNEL) |
+		(1ull << IB_USER_VERBS_CMD_CREATE_CQ) |
+		(1ull << IB_USER_VERBS_CMD_DESTROY_CQ) |
+		(1ull << IB_USER_VERBS_CMD_CREATE_QP) |
+		(1ull << IB_USER_VERBS_CMD_QUERY_QP) |
+		(1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
+		(1ull << IB_USER_VERBS_CMD_DESTROY_QP);
+
+	ibdev->node_type = RDMA_NODE_RNIC;
+	memcpy(ibdev->node_desc, ERDMA_NODE_DESC_COMMON, sizeof(ERDMA_NODE_DESC_COMMON));
+
+	/*
+	 * Current model (one-to-one device association):
+	 * One ERDMA device per net_device or, equivalently,
+	 * per physical port.
+	 */
+	ibdev->phys_port_cnt = 1;
+	ibdev->num_comp_vectors = dev->irq_num - 1;
+
+	ib_set_device_ops(ibdev, &erdma_device_ops);
+
+	INIT_LIST_HEAD(&dev->cep_list);
+
+	spin_lock_init(&dev->lock);
+	xa_init_flags(&dev->qp_xa, XA_FLAGS_ALLOC1);
+	xa_init_flags(&dev->cq_xa, XA_FLAGS_ALLOC1);
+	dev->next_alloc_cqn = 1;
+	dev->next_alloc_qpn = 1;
+
+	ret = erdma_res_cb_init(dev);
+	if (ret)
+		goto out;
+
+	spin_lock_init(&dev->db_bitmap_lock);
+	bitmap_zero(dev->sdb_page, ERDMA_DWQE_TYPE0_CNT);
+	bitmap_zero(dev->sdb_entry, ERDMA_DWQE_TYPE1_CNT);
+
+	atomic_set(&dev->num_ctx, 0);
+	atomic_set(&dev->num_qp, 0);
+	atomic_set(&dev->num_cq, 0);
+	atomic_set(&dev->num_mr, 0);
+	atomic_set(&dev->num_pd, 0);
+	atomic_set(&dev->num_cep, 0);
+
+	mac_l = erdma_reg_read32(dev, ERDMA_REGS_NETDEV_MAC_L_REG);
+	mac_h = erdma_reg_read32(dev, ERDMA_REGS_NETDEV_MAC_H_REG);
+
+	pr_info("assoc netdev mac addr is 0x%x-0x%x.\n", mac_h, mac_l);
+
+	dev->peer_addr[0] = (mac_h >> 8) & 0xFF;
+	dev->peer_addr[1] = mac_h & 0xFF;
+	dev->peer_addr[2] = (mac_l >> 24) & 0xFF;
+	dev->peer_addr[3] = (mac_l >> 16) & 0xFF;
+	dev->peer_addr[4] = (mac_l >> 8) & 0xFF;
+	dev->peer_addr[5] = mac_l & 0xFF;
+
+	dev->netdev_nb.notifier_call = erdma_netdev_event;
+	dev->netdev = NULL;
+
+	ret = register_netdevice_notifier(&dev->netdev_nb);
+	if (ret)
+		goto out;
+
+	return 0;
+out:
+	erdma_res_cb_free(dev);
+	xa_destroy(&dev->qp_xa);
+	xa_destroy(&dev->cq_xa);
+
+	return ret;
+}
+
+static void erdma_ib_device_remove(struct pci_dev *pdev)
+{
+	struct erdma_dev *dev = pci_get_drvdata(pdev);
+
+	unregister_netdevice_notifier(&dev->netdev_nb);
+
+	if (dev->is_registered) {
+		erdma_device_deregister(dev);
+		dev->is_registered = 0;
+	}
+
+	erdma_res_cb_free(dev);
+	xa_destroy(&dev->qp_xa);
+	xa_destroy(&dev->cq_xa);
+}
+
+static int erdma_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
+{
+	int ret;
+
+	ret = erdma_probe_dev(pdev);
+	if (ret)
+		return ret;
+
+	ret = erdma_ib_device_add(pdev);
+	if (ret) {
+		erdma_remove_dev(pdev);
+		return ret;
+	}
+
+	return 0;
+}
+
+static void erdma_remove(struct pci_dev *pdev)
+{
+	erdma_ib_device_remove(pdev);
+	erdma_remove_dev(pdev);
+}
+
+static struct pci_driver erdma_pci_driver = {
+	.name = DRV_MODULE_NAME,
+	.id_table = erdma_pci_tbl,
+	.probe = erdma_probe,
+	.remove = erdma_remove
+};
+
+MODULE_DEVICE_TABLE(pci, erdma_pci_tbl);
+
+static __init int erdma_init_module(void)
+{
+	int ret;
+
+	erdma_debugfs_init();
+
+	ret = erdma_cm_init();
+	if (ret)
+		goto uninit_dbgfs;
+
+	ret = pci_register_driver(&erdma_pci_driver);
+	if (ret) {
+		pr_err("Couldn't register erdma driver.\n");
+		goto uninit_cm;
+	}
+
+	return ret;
+
+uninit_cm:
+	erdma_cm_exit();
+
+uninit_dbgfs:
+	erdma_debugfs_exit();
+
+	return ret;
+}
+
+static void __exit erdma_exit_module(void)
+{
+	pci_unregister_driver(&erdma_pci_driver);
+
+	erdma_cm_exit();
+	erdma_debugfs_exit();
+}
+
+module_init(erdma_init_module);
+module_exit(erdma_exit_module);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
                   ` (8 preceding siblings ...)
  2021-12-21  2:48 ` [PATCH rdma-next 09/11] RDMA/erdma: Add the erdma module Cheng Xu
@ 2021-12-21  2:48 ` Cheng Xu
  2021-12-21 11:57     ` kernel test robot
                     ` (2 more replies)
  2021-12-21  2:48 ` [PATCH rdma-next 11/11] RDMA/erdma: Add driver to kernel build environment Cheng Xu
  2021-12-21 13:09 ` [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Leon Romanovsky
  11 siblings, 3 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-21  2:48 UTC (permalink / raw)
  To: jgg, dledford; +Cc: leon, linux-rdma, KaiShen, chengyou, tonylu

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 include/uapi/rdma/erdma-abi.h | 49 +++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)
 create mode 100644 include/uapi/rdma/erdma-abi.h

diff --git a/include/uapi/rdma/erdma-abi.h b/include/uapi/rdma/erdma-abi.h
new file mode 100644
index 000000000000..6bcba10c1e41
--- /dev/null
+++ b/include/uapi/rdma/erdma-abi.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR Linux-OpenIB) */
+/*
+ * Copyright (c) 2020-2021, Alibaba Group.
+ */
+
+#ifndef __ERDMA_USER_H__
+#define __ERDMA_USER_H__
+
+#include <linux/types.h>
+
+#define ERDMA_ABI_VERSION       1
+
+struct erdma_ureq_create_cq {
+	u64 db_record_va;
+	u64 qbuf_va;
+	u32 qbuf_len;
+	u32 rsvd0;
+};
+
+struct erdma_uresp_create_cq {
+	u32 cq_id;
+	u32 num_cqe;
+};
+
+struct erdma_ureq_create_qp {
+	u64 db_record_va;
+	u64 qbuf_va;
+	u32 qbuf_len;
+	u32 rsvd0;
+};
+
+struct erdma_uresp_create_qp {
+	u32 qp_id;
+	u32 num_sqe;
+	u32 num_rqe;
+	u32 rq_offset;
+};
+
+struct erdma_uresp_alloc_ctx {
+	u32 dev_id;
+	u32 pad;
+	u32 sdb_type;
+	u32 sdb_offset;
+	u64 sdb;
+	u64 rdb;
+	u64 cdb;
+};
+
+#endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH rdma-next 11/11] RDMA/erdma: Add driver to kernel build environment
  2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
                   ` (9 preceding siblings ...)
  2021-12-21  2:48 ` [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions Cheng Xu
@ 2021-12-21  2:48 ` Cheng Xu
  2021-12-22  0:58     ` kernel test robot
  2021-12-21 13:09 ` [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Leon Romanovsky
  11 siblings, 1 reply; 52+ messages in thread
From: Cheng Xu @ 2021-12-21  2:48 UTC (permalink / raw)
  To: jgg, dledford; +Cc: leon, linux-rdma, KaiShen, chengyou, tonylu

Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
---
 MAINTAINERS                          |  8 ++++++++
 drivers/infiniband/Kconfig           |  1 +
 drivers/infiniband/hw/Makefile       |  1 +
 drivers/infiniband/hw/erdma/Kconfig  | 10 ++++++++++
 drivers/infiniband/hw/erdma/Makefile |  5 +++++
 5 files changed, 25 insertions(+)
 create mode 100644 drivers/infiniband/hw/erdma/Kconfig
 create mode 100644 drivers/infiniband/hw/erdma/Makefile

diff --git a/MAINTAINERS b/MAINTAINERS
index e9d484507c06..ac2b54c6439b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -722,6 +722,14 @@ S:	Maintained
 F:	Documentation/i2c/busses/i2c-ali1563.rst
 F:	drivers/i2c/busses/i2c-ali1563.c
 
+ALIBABA ELASTIC RDMA DRIVER
+M:	Cheng Xu <chengyou@linux.alibaba.com>
+M:	Kai Shen <kaishen@linux.alibaba.com>
+L:	linux-rdma@vger.kernel.org
+S:	Supported
+F:	drivers/infiniband/hw/erdma
+F:	include/uapi/rdma/erdma-abi.h
+
 ALIENWARE WMI DRIVER
 L:	Dell.Client.Kernel@dell.com
 S:	Maintained
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 33d3ce9c888e..cc6a7ff88ff3 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -92,6 +92,7 @@ source "drivers/infiniband/hw/hns/Kconfig"
 source "drivers/infiniband/hw/bnxt_re/Kconfig"
 source "drivers/infiniband/hw/hfi1/Kconfig"
 source "drivers/infiniband/hw/qedr/Kconfig"
+source "drivers/infiniband/hw/erdma/Kconfig"
 source "drivers/infiniband/sw/rdmavt/Kconfig"
 source "drivers/infiniband/sw/rxe/Kconfig"
 source "drivers/infiniband/sw/siw/Kconfig"
diff --git a/drivers/infiniband/hw/Makefile b/drivers/infiniband/hw/Makefile
index fba0b3be903e..6b3a88046125 100644
--- a/drivers/infiniband/hw/Makefile
+++ b/drivers/infiniband/hw/Makefile
@@ -13,3 +13,4 @@ obj-$(CONFIG_INFINIBAND_HFI1)		+= hfi1/
 obj-$(CONFIG_INFINIBAND_HNS)		+= hns/
 obj-$(CONFIG_INFINIBAND_QEDR)		+= qedr/
 obj-$(CONFIG_INFINIBAND_BNXT_RE)	+= bnxt_re/
+obj-$(CONFIG_INFINIBAND_ERDMA)		+= erdma/
diff --git a/drivers/infiniband/hw/erdma/Kconfig b/drivers/infiniband/hw/erdma/Kconfig
new file mode 100644
index 000000000000..8526689fede7
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/Kconfig
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0-only
+config INFINIBAND_ERDMA
+	tristate "Alibaba Elastic RDMA Adapter (ERDMA) support"
+	depends on PCI_MSI && 64BIT && !CPU_BIG_ENDIAN
+	depends on INFINIBAND_ADDR_TRANS
+	depends on INFINIBAND_USER_ACCESS
+	help
+	  This is a RDMA/iWarp driver for Alibaba Elastic RDMA Adapter(ERDMA).
+
+	  To compile this driver as module, choose M here.
diff --git a/drivers/infiniband/hw/erdma/Makefile b/drivers/infiniband/hw/erdma/Makefile
new file mode 100644
index 000000000000..149d22a80aa6
--- /dev/null
+++ b/drivers/infiniband/hw/erdma/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_INFINIBAND_ERDMA) := erdma.o
+
+erdma-y := erdma_cm.o erdma_main.o erdma_cmdq.o erdma_debug.o erdma_verbs.o erdma_qp.o erdma_eq.o\
+	erdma_cq.o
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-21  2:48 ` [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions Cheng Xu
@ 2021-12-21 11:57     ` kernel test robot
  2021-12-22 16:14     ` kernel test robot
  2021-12-23 15:46   ` Yanjun Zhu
  2 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2021-12-21 11:57 UTC (permalink / raw)
  To: Cheng Xu, jgg, dledford
  Cc: kbuild-all, leon, linux-rdma, KaiShen, chengyou, tonylu

Hi Cheng,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on rdma/for-next]
[also build test ERROR on linus/master v5.16-rc6 next-20211220]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Cheng-Xu/Elastic-RDMA-Adapter-ERDMA-driver/20211221-105044
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git for-next
config: x86_64-randconfig-a004-20211220 (https://download.01.org/0day-ci/archive/20211221/202112211925.cA7D5851-lkp@intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/8bafa2877f1dd44153ce36bb8a0a0c491f990b6b
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Cheng-Xu/Elastic-RDMA-Adapter-ERDMA-driver/20211221-105044
        git checkout 8bafa2877f1dd44153ce36bb8a0a0c491f990b6b
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from <command-line>:32:
>> ./usr/include/rdma/erdma-abi.h:14:2: error: unknown type name 'u64'
      14 |  u64 db_record_va;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:15:2: error: unknown type name 'u64'
      15 |  u64 qbuf_va;
         |  ^~~
>> ./usr/include/rdma/erdma-abi.h:16:2: error: unknown type name 'u32'
      16 |  u32 qbuf_len;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:17:2: error: unknown type name 'u32'
      17 |  u32 rsvd0;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:21:2: error: unknown type name 'u32'
      21 |  u32 cq_id;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:22:2: error: unknown type name 'u32'
      22 |  u32 num_cqe;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:26:2: error: unknown type name 'u64'
      26 |  u64 db_record_va;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:27:2: error: unknown type name 'u64'
      27 |  u64 qbuf_va;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:28:2: error: unknown type name 'u32'
      28 |  u32 qbuf_len;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:29:2: error: unknown type name 'u32'
      29 |  u32 rsvd0;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:33:2: error: unknown type name 'u32'
      33 |  u32 qp_id;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:34:2: error: unknown type name 'u32'
      34 |  u32 num_sqe;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:35:2: error: unknown type name 'u32'
      35 |  u32 num_rqe;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:36:2: error: unknown type name 'u32'
      36 |  u32 rq_offset;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:40:2: error: unknown type name 'u32'
      40 |  u32 dev_id;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:41:2: error: unknown type name 'u32'
      41 |  u32 pad;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:42:2: error: unknown type name 'u32'
      42 |  u32 sdb_type;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:43:2: error: unknown type name 'u32'
      43 |  u32 sdb_offset;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:44:2: error: unknown type name 'u64'
      44 |  u64 sdb;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:45:2: error: unknown type name 'u64'
      45 |  u64 rdb;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:46:2: error: unknown type name 'u64'
      46 |  u64 cdb;
         |  ^~~

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
@ 2021-12-21 11:57     ` kernel test robot
  0 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2021-12-21 11:57 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 3954 bytes --]

Hi Cheng,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on rdma/for-next]
[also build test ERROR on linus/master v5.16-rc6 next-20211220]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Cheng-Xu/Elastic-RDMA-Adapter-ERDMA-driver/20211221-105044
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git for-next
config: x86_64-randconfig-a004-20211220 (https://download.01.org/0day-ci/archive/20211221/202112211925.cA7D5851-lkp(a)intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
        # https://github.com/0day-ci/linux/commit/8bafa2877f1dd44153ce36bb8a0a0c491f990b6b
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Cheng-Xu/Elastic-RDMA-Adapter-ERDMA-driver/20211221-105044
        git checkout 8bafa2877f1dd44153ce36bb8a0a0c491f990b6b
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from <command-line>:32:
>> ./usr/include/rdma/erdma-abi.h:14:2: error: unknown type name 'u64'
      14 |  u64 db_record_va;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:15:2: error: unknown type name 'u64'
      15 |  u64 qbuf_va;
         |  ^~~
>> ./usr/include/rdma/erdma-abi.h:16:2: error: unknown type name 'u32'
      16 |  u32 qbuf_len;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:17:2: error: unknown type name 'u32'
      17 |  u32 rsvd0;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:21:2: error: unknown type name 'u32'
      21 |  u32 cq_id;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:22:2: error: unknown type name 'u32'
      22 |  u32 num_cqe;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:26:2: error: unknown type name 'u64'
      26 |  u64 db_record_va;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:27:2: error: unknown type name 'u64'
      27 |  u64 qbuf_va;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:28:2: error: unknown type name 'u32'
      28 |  u32 qbuf_len;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:29:2: error: unknown type name 'u32'
      29 |  u32 rsvd0;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:33:2: error: unknown type name 'u32'
      33 |  u32 qp_id;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:34:2: error: unknown type name 'u32'
      34 |  u32 num_sqe;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:35:2: error: unknown type name 'u32'
      35 |  u32 num_rqe;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:36:2: error: unknown type name 'u32'
      36 |  u32 rq_offset;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:40:2: error: unknown type name 'u32'
      40 |  u32 dev_id;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:41:2: error: unknown type name 'u32'
      41 |  u32 pad;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:42:2: error: unknown type name 'u32'
      42 |  u32 sdb_type;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:43:2: error: unknown type name 'u32'
      43 |  u32 sdb_offset;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:44:2: error: unknown type name 'u64'
      44 |  u64 sdb;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:45:2: error: unknown type name 'u64'
      45 |  u64 rdb;
         |  ^~~
   ./usr/include/rdma/erdma-abi.h:46:2: error: unknown type name 'u64'
      46 |  u64 cdb;
         |  ^~~

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
  2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
                   ` (10 preceding siblings ...)
  2021-12-21  2:48 ` [PATCH rdma-next 11/11] RDMA/erdma: Add driver to kernel build environment Cheng Xu
@ 2021-12-21 13:09 ` Leon Romanovsky
  2021-12-22  3:35   ` Cheng Xu
  11 siblings, 1 reply; 52+ messages in thread
From: Leon Romanovsky @ 2021-12-21 13:09 UTC (permalink / raw)
  To: Cheng Xu; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu

On Tue, Dec 21, 2021 at 10:48:47AM +0800, Cheng Xu wrote:
> Hello all,
> 
> This patch set introduces the Elastic RDMA Adapter (ERDMA) driver, which
> released in Apsara Conference 2021 by Alibaba.
> 
> ERDMA enables large-scale RDMA acceleration capability in Alibaba ECS
> environment, initially offered in g7re instance. It can improve the
> efficiency of large-scale distributed computing and communication
> significantly and expand dynamically with the cluster scale of Alibaba
> Cloud.
> 
> ERDMA is a RDMA networking adapter based on the Alibaba MOC hardware. It
> works in the VPC network environment (overlay network), and uses iWarp
> tranport protocol. ERDMA supports reliable connection (RC). ERDMA also
> supports both kernel space and user space verbs. Now we have already
> supported HPC/AI applications with libfabric, NoF and some other internal
> verbs libraries, such as xrdma, epsl, etc,.

We will need to get erdma provider implementation in the rdma-core too,
in order to consider to merge it.

> 
> For the ECS instance with RDMA enabled, there are two kinds of devices
> allocated, one for ERDMA, and one for the original netdev (virtio-net).
> They are different PCI deivces. ERDMA driver can get the information about
> which netdev attached to in its PCIe barspace (by MAC address matching).

This is very questionable. The netdev part should be kept in the
drivers/ethernet/... part of the kernel.

Thanks

> 
> Thanks,
> Cheng Xu
> 
> Cheng Xu (11):
>   RDMA: Add ERDMA to rdma_driver_id definition
>   RDMA/erdma: Add the hardware related definitions
>   RDMA/erdma: Add main include file
>   RDMA/erdma: Add cmdq implementation
>   RDMA/erdma: Add event queue implementation
>   RDMA/erdma: Add verbs header file
>   RDMA/erdma: Add verbs implementation
>   RDMA/erdma: Add connection management (CM) support
>   RDMA/erdma: Add the erdma module
>   RDMA/erdma: Add the ABI definitions
>   RDMA/erdma: Add driver to kernel build environment
> 
>  MAINTAINERS                               |    8 +
>  drivers/infiniband/Kconfig                |    1 +
>  drivers/infiniband/hw/Makefile            |    1 +
>  drivers/infiniband/hw/erdma/Kconfig       |   10 +
>  drivers/infiniband/hw/erdma/Makefile      |    5 +
>  drivers/infiniband/hw/erdma/erdma.h       |  381 +++++
>  drivers/infiniband/hw/erdma/erdma_cm.c    | 1585 +++++++++++++++++++++
>  drivers/infiniband/hw/erdma/erdma_cm.h    |  158 ++
>  drivers/infiniband/hw/erdma/erdma_cmdq.c  |  489 +++++++
>  drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
>  drivers/infiniband/hw/erdma/erdma_debug.c |  314 ++++
>  drivers/infiniband/hw/erdma/erdma_debug.h |   18 +
>  drivers/infiniband/hw/erdma/erdma_eq.c    |  346 +++++
>  drivers/infiniband/hw/erdma/erdma_hw.h    |  474 ++++++
>  drivers/infiniband/hw/erdma/erdma_main.c  |  711 +++++++++
>  drivers/infiniband/hw/erdma/erdma_qp.c    |  624 ++++++++
>  drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++
>  drivers/infiniband/hw/erdma/erdma_verbs.h |  366 +++++
>  include/uapi/rdma/erdma-abi.h             |   49 +
>  include/uapi/rdma/ib_user_ioctl_verbs.h   |    1 +
>  20 files changed, 7219 insertions(+)
>  create mode 100644 drivers/infiniband/hw/erdma/Kconfig
>  create mode 100644 drivers/infiniband/hw/erdma/Makefile
>  create mode 100644 drivers/infiniband/hw/erdma/erdma.h
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.h
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_cmdq.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.h
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_eq.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_hw.h
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.h
>  create mode 100644 include/uapi/rdma/erdma-abi.h
> 
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 09/11] RDMA/erdma: Add the erdma module
  2021-12-21  2:48 ` [PATCH rdma-next 09/11] RDMA/erdma: Add the erdma module Cheng Xu
@ 2021-12-21 13:26   ` Leon Romanovsky
  2021-12-22  2:33     ` Cheng Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Leon Romanovsky @ 2021-12-21 13:26 UTC (permalink / raw)
  To: Cheng Xu; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu

On Tue, Dec 21, 2021 at 10:48:56AM +0800, Cheng Xu wrote:
> Add the main erdma module and debugfs files. The main module provides
> interface to infiniband subsytem, and the debugfs module provides a way
> to allow user can get the core status of the device and set the preferred
> congestion control algorithm.

debugfs is for debug - dump various information.
It is not the right interface to set configuration properties.

> 
> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
> ---
>  drivers/infiniband/hw/erdma/erdma_debug.c | 314 ++++++++++
>  drivers/infiniband/hw/erdma/erdma_debug.h |  18 +
>  drivers/infiniband/hw/erdma/erdma_main.c  | 711 ++++++++++++++++++++++
>  3 files changed, 1043 insertions(+)
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.h
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c
> 
> diff --git a/drivers/infiniband/hw/erdma/erdma_debug.c b/drivers/infiniband/hw/erdma/erdma_debug.c
> new file mode 100644
> index 000000000000..3cbed4dde0e2
> --- /dev/null
> +++ b/drivers/infiniband/hw/erdma/erdma_debug.c
> @@ -0,0 +1,314 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Authors: Cheng Xu <chengyou@linux.alibaba.com>
> + *          Kai Shen <kaishen@linux.alibaba.com>
> + * Copyright (c) 2020-2021, Alibaba Group.
> + */
> +#include <linux/errno.h>
> +#include <linux/types.h>
> +#include <linux/list.h>
> +#include <linux/debugfs.h>
> +
> +#include <rdma/iw_cm.h>
> +#include <rdma/ib_verbs.h>
> +#include <rdma/ib_smi.h>
> +#include <rdma/ib_user_verbs.h>
> +
> +#include "erdma.h"
> +#include "erdma_cm.h"
> +#include "erdma_debug.h"
> +#include "erdma_verbs.h"
> +
> +char *cc_method_string[ERDMA_CC_METHODS_NUM] = {
> +	[ERDMA_CC_NEWRENO] = "newreno",
> +	[ERDMA_CC_CUBIC] = "cubic",
> +	[ERDMA_CC_HPCC_RTT] = "hpcc_rtt",
> +	[ERDMA_CC_HPCC_ECN] = "hpcc_ecn",
> +	[ERDMA_CC_HPCC_INT] = "hpcc_int"
> +};
> +
> +static struct dentry *erdma_debugfs;
> +
> +
> +static int erdma_dbgfs_file_open(struct inode *inode, struct file *fp)
> +{
> +	fp->private_data = inode->i_private;
> +	return nonseekable_open(inode, fp);
> +}
> +
> +static ssize_t erdma_show_stats(struct file *fp, char __user *buf, size_t space,
> +			      loff_t *ppos)
> +{
> +	struct erdma_dev *dev = fp->private_data;
> +	char *kbuf = NULL;
> +	int len = 0;
> +
> +	if (*ppos)
> +		goto out;
> +
> +	kbuf = kmalloc(space, GFP_KERNEL);
> +	if (!kbuf)
> +		goto out;
> +
> +	len = snprintf(kbuf, space, "Resource Summary of %s:\n"
> +		"%s: %d\n%s: %d\n%s: %d\n%s: %d\n%s: %d\n%s: %d\n",
> +		dev->ibdev.name,
> +		"ucontext ", atomic_read(&dev->num_ctx),
> +		"pd       ", atomic_read(&dev->num_pd),
> +		"qp       ", atomic_read(&dev->num_qp),
> +		"cq       ", atomic_read(&dev->num_cq),
> +		"mr       ", atomic_read(&dev->num_mr),

Why do you need to duplicate "restrack res ..."?

> +		"cep      ", atomic_read(&dev->num_cep));
> +	if (len > space)
> +		len = space;
> +out:
> +	if (len)
> +		len = simple_read_from_buffer(buf, len, ppos, kbuf, len);
> +
> +	kfree(kbuf);
> +	return len;
> +
> +}
> +

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 06/11] RDMA/erdma: Add verbs header file
  2021-12-21  2:48 ` [PATCH rdma-next 06/11] RDMA/erdma: Add verbs header file Cheng Xu
@ 2021-12-21 13:28   ` Leon Romanovsky
  2021-12-22  2:36     ` Cheng Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Leon Romanovsky @ 2021-12-21 13:28 UTC (permalink / raw)
  To: Cheng Xu; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu

On Tue, Dec 21, 2021 at 10:48:53AM +0800, Cheng Xu wrote:
> This header file defines the main structrues and functions used for RDMA
> Verbs, including qp, cq, mr ucontext, etc,.
> 
> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
> ---
>  drivers/infiniband/hw/erdma/erdma_verbs.h | 366 ++++++++++++++++++++++
>  1 file changed, 366 insertions(+)
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.h
> 
> diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.h b/drivers/infiniband/hw/erdma/erdma_verbs.h
> new file mode 100644
> index 000000000000..6eda8843d0d5
> --- /dev/null
> +++ b/drivers/infiniband/hw/erdma/erdma_verbs.h
> @@ -0,0 +1,366 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * Authors: Cheng Xu <chengyou@linux.alibaba.com>
> + *          Kai Shen <kaishen@linux.alibaba.com>
> + * Copyright (c) 2020-2021, Alibaba Group.
> + */
> +
> +#ifndef __ERDMA_VERBS_H__
> +#define __ERDMA_VERBS_H__

<...>

> +extern int erdma_query_port(struct ib_device *dev, u32 port, struct ib_port_attr *attr);
> +extern int erdma_query_pkey(struct ib_device *dev, u32 port, u16 idx, u16 *pkey);
> +extern int erdma_query_gid(struct ib_device *dev, u32 port, int idx, union ib_gid *gid);
> +extern int erdma_alloc_pd(struct ib_pd *pd, struct ib_udata *data);
> +extern int erdma_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata);
> +extern int erdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
> +				   struct ib_udata *data);
> +extern int erdma_query_qp(struct ib_qp *qp, struct ib_qp_attr *attr, int mask,
> +			struct ib_qp_init_attr *init_attr);
> +extern int erdma_modify_qp(struct ib_qp *qp, struct ib_qp_attr *attr, int mask,
> +			      struct ib_udata *data);
> +extern int erdma_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata);
> +extern int erdma_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata);
> +extern int erdma_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags);
> +extern struct ib_mr *erdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
> +				      u64 virt, int access, struct ib_udata *udata);
> +extern struct ib_mr *erdma_get_dma_mr(struct ib_pd *ibpd, int rights);
> +extern int erdma_dereg_mr(struct ib_mr *mr, struct ib_udata *data);
> +extern int erdma_mmap(struct ib_ucontext *ctx, struct vm_area_struct *vma);
> +extern void erdma_qp_get_ref(struct ib_qp *qp);
> +extern void erdma_qp_put_ref(struct ib_qp *qp);
> +extern struct ib_qp *erdma_get_ibqp(struct ib_device *dev, int id);
> +extern int erdma_post_send(struct ib_qp *qp, const struct ib_send_wr *send_wr,
> +			   const struct ib_send_wr **bad_send_wr);
> +extern int erdma_post_recv(struct ib_qp *qp, const struct ib_recv_wr *recv_wr,
> +			   const struct ib_recv_wr **bad_recv_wr);
> +extern int erdma_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc);
> +extern struct ib_mr *erdma_ib_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
> +				       u32 max_num_sg);
> +extern int erdma_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
> +			   int sg_nents, unsigned int *sg_offset);
> +extern struct net_device *erdma_get_netdev(struct ib_device *device, u32 port_num);
> +extern void erdma_disassociate_ucontext(struct ib_ucontext *ibcontext);
> +extern void erdma_port_event(struct erdma_dev *dev, enum ib_event_type reason);

Why do you add "extern" to function declarations?

Thanks

> +
> +#endif
> -- 
> 2.27.0
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs implementation
  2021-12-21  2:48 ` [PATCH rdma-next 07/11] RDMA/erdma: Add verbs implementation Cheng Xu
@ 2021-12-21 13:32   ` Leon Romanovsky
  2021-12-21 15:20     ` Bernard Metzler
  2021-12-22  2:50     ` Cheng Xu
  0 siblings, 2 replies; 52+ messages in thread
From: Leon Romanovsky @ 2021-12-21 13:32 UTC (permalink / raw)
  To: Cheng Xu; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu

On Tue, Dec 21, 2021 at 10:48:54AM +0800, Cheng Xu wrote:
> The RDMA verbs implementation of erdma is divided into three files:
> erdma_qp.c, erdma_cq.c, and erdma_verbs.c. Internal used functions and
> datapath functions of QP/CQ are put in erdma_qp.c and erdma_cq.c, the reset
> is in erdma_verbs.c.
> 
> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
> ---
>  drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
>  drivers/infiniband/hw/erdma/erdma_qp.c    |  624 +++++++++
>  drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++++
>  3 files changed, 2302 insertions(+)
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
>  create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c


Please no inline functions in .c files and no void casting for the
return values of functions.

<...>

> diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c b/drivers/infiniband/hw/erdma/erdma_qp.c
> new file mode 100644
> index 000000000000..8c02215cee04
> --- /dev/null
> +++ b/drivers/infiniband/hw/erdma/erdma_qp.c
> @@ -0,0 +1,624 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Authors: Cheng Xu <chengyou@linux.alibaba.com>
> + *          Kai Shen <kaishen@linux.alibaba.com>
> + * Copyright (c) 2020-2021, Alibaba Group.
> + *
> + * Authors: Bernard Metzler <bmt@zurich.ibm.com>
> + *          Fredy Neeser <nfd@zurich.ibm.com>
> + * Copyright (c) 2008-2016, IBM Corporation

What does it mean?

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs implementation
  2021-12-21 13:32   ` Leon Romanovsky
@ 2021-12-21 15:20     ` Bernard Metzler
  2021-12-22  3:11       ` Cheng Xu
  2021-12-22  2:50     ` Cheng Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Bernard Metzler @ 2021-12-21 15:20 UTC (permalink / raw)
  To: Leon Romanovsky, Cheng Xu; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu

> -----Original Message-----
> From: Leon Romanovsky <leon@kernel.org>
> Sent: Tuesday, 21 December 2021 14:32
> To: Cheng Xu <chengyou@linux.alibaba.com>
> Cc: jgg@ziepe.ca; dledford@redhat.com; linux-rdma@vger.kernel.org;
> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
> Subject: [EXTERNAL] Re: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs
> implementation
> 
> On Tue, Dec 21, 2021 at 10:48:54AM +0800, Cheng Xu wrote:
> > The RDMA verbs implementation of erdma is divided into three files:
> > erdma_qp.c, erdma_cq.c, and erdma_verbs.c. Internal used functions and
> > datapath functions of QP/CQ are put in erdma_qp.c and erdma_cq.c, the
> reset
> > is in erdma_verbs.c.
> >
> > Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
> > ---
> >  drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
> >  drivers/infiniband/hw/erdma/erdma_qp.c    |  624 +++++++++
> >  drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++++
> >  3 files changed, 2302 insertions(+)
> >  create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
> >  create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
> >  create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c
> 
> 
> Please no inline functions in .c files and no void casting for the
> return values of functions.
> 
> <...>
> 
> > diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c
> b/drivers/infiniband/hw/erdma/erdma_qp.c
> > new file mode 100644
> > index 000000000000..8c02215cee04
> > --- /dev/null
> > +++ b/drivers/infiniband/hw/erdma/erdma_qp.c
> > @@ -0,0 +1,624 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Authors: Cheng Xu <chengyou@linux.alibaba.com>
> > + *          Kai Shen <kaishen@linux.alibaba.com>
> > + * Copyright (c) 2020-2021, Alibaba Group.
> > + *
> > + * Authors: Bernard Metzler <bmt@zurich.ibm.com>
> > + *          Fredy Neeser <nfd@zurich.ibm.com>
> > + * Copyright (c) 2008-2016, IBM Corporation
> 
> What does it mean?
> 

Significant parts of the driver have been taken from siw it seems.
Probably really from an old version of it.
In that case I would have recommended to take the upstream siw
code, which has been cleaned from those issues we now see again
(including debugfs code, extern definitions, inline in .c code,
casting issues, etc etc.). Why starting in 2020 with
code from 2016, if better code is available?

Bernard.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 11/11] RDMA/erdma: Add driver to kernel build environment
  2021-12-21  2:48 ` [PATCH rdma-next 11/11] RDMA/erdma: Add driver to kernel build environment Cheng Xu
@ 2021-12-22  0:58     ` kernel test robot
  0 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2021-12-22  0:58 UTC (permalink / raw)
  To: Cheng Xu, jgg, dledford
  Cc: kbuild-all, leon, linux-rdma, KaiShen, chengyou, tonylu

Hi Cheng,

I love your patch! Perhaps something to improve:

[auto build test WARNING on rdma/for-next]
[also build test WARNING on linus/master v5.16-rc6 next-20211221]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Cheng-Xu/Elastic-RDMA-Adapter-ERDMA-driver/20211221-105044
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git for-next
config: ia64-allyesconfig (https://download.01.org/0day-ci/archive/20211222/202112220838.tXmQUWZb-lkp@intel.com/config)
compiler: ia64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/3b0243fb79f0f12a5b5c020c6f26c82de2c3c57e
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Cheng-Xu/Elastic-RDMA-Adapter-ERDMA-driver/20211221-105044
        git checkout 3b0243fb79f0f12a5b5c020c6f26c82de2c3c57e
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=ia64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/infiniband/hw/erdma/erdma_cm.c: In function 'erdma_cep_set_inuse':
>> drivers/infiniband/hw/erdma/erdma_cm.c:173:13: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
     173 |         int ret;
         |             ^~~
   drivers/infiniband/hw/erdma/erdma_cm.c: In function 'erdma_cm_llp_state_change':
>> drivers/infiniband/hw/erdma/erdma_cm.c:1077:24: warning: variable 's' set but not used [-Wunused-but-set-variable]
    1077 |         struct socket *s;
         |                        ^
   drivers/infiniband/hw/erdma/erdma_cm.c: In function 'erdma_create_listen':
>> drivers/infiniband/hw/erdma/erdma_cm.c:1456:28: warning: variable 'r_ip' set but not used [-Wunused-but-set-variable]
    1456 |                 u8 *l_ip, *r_ip;
         |                            ^~~~
>> drivers/infiniband/hw/erdma/erdma_cm.c:1456:21: warning: variable 'l_ip' set but not used [-Wunused-but-set-variable]
    1456 |                 u8 *l_ip, *r_ip;
         |                     ^~~~
--
   drivers/infiniband/hw/erdma/erdma_main.c: In function 'erdma_probe_dev':
>> drivers/infiniband/hw/erdma/erdma_main.c:294:27: warning: variable 'ibdev' set but not used [-Wunused-but-set-variable]
     294 |         struct ib_device *ibdev;
         |                           ^~~~~
   drivers/infiniband/hw/erdma/erdma_main.c: At top level:
>> drivers/infiniband/hw/erdma/erdma_main.c:464:5: warning: no previous prototype for 'erdma_res_cb_init' [-Wmissing-prototypes]
     464 | int erdma_res_cb_init(struct erdma_dev *dev)
         |     ^~~~~~~~~~~~~~~~~
>> drivers/infiniband/hw/erdma/erdma_main.c:481:6: warning: no previous prototype for 'erdma_res_cb_free' [-Wmissing-prototypes]
     481 | void erdma_res_cb_free(struct erdma_dev *dev)
         |      ^~~~~~~~~~~~~~~~~
--
>> drivers/infiniband/hw/erdma/erdma_eq.c:169:6: warning: no previous prototype for 'erdma_intr_ceq_task' [-Wmissing-prototypes]
     169 | void erdma_intr_ceq_task(unsigned long data)
         |      ^~~~~~~~~~~~~~~~~~~


vim +/ret +173 drivers/infiniband/hw/erdma/erdma_cm.c

1d17ac4bdb13af Cheng Xu 2021-12-21  169  
1d17ac4bdb13af Cheng Xu 2021-12-21  170  static void erdma_cep_set_inuse(struct erdma_cep *cep)
1d17ac4bdb13af Cheng Xu 2021-12-21  171  {
1d17ac4bdb13af Cheng Xu 2021-12-21  172  	unsigned long flags;
1d17ac4bdb13af Cheng Xu 2021-12-21 @173  	int ret;
1d17ac4bdb13af Cheng Xu 2021-12-21  174  retry:
1d17ac4bdb13af Cheng Xu 2021-12-21  175  	spin_lock_irqsave(&cep->lock, flags);
1d17ac4bdb13af Cheng Xu 2021-12-21  176  
1d17ac4bdb13af Cheng Xu 2021-12-21  177  	if (cep->in_use) {
1d17ac4bdb13af Cheng Xu 2021-12-21  178  		spin_unlock_irqrestore(&cep->lock, flags);
1d17ac4bdb13af Cheng Xu 2021-12-21  179  		ret = wait_event_interruptible(cep->waitq, !cep->in_use);
1d17ac4bdb13af Cheng Xu 2021-12-21  180  		if (signal_pending(current))
1d17ac4bdb13af Cheng Xu 2021-12-21  181  			flush_signals(current);
1d17ac4bdb13af Cheng Xu 2021-12-21  182  		goto retry;
1d17ac4bdb13af Cheng Xu 2021-12-21  183  	} else {
1d17ac4bdb13af Cheng Xu 2021-12-21  184  		cep->in_use = 1;
1d17ac4bdb13af Cheng Xu 2021-12-21  185  		spin_unlock_irqrestore(&cep->lock, flags);
1d17ac4bdb13af Cheng Xu 2021-12-21  186  	}
1d17ac4bdb13af Cheng Xu 2021-12-21  187  }
1d17ac4bdb13af Cheng Xu 2021-12-21  188  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 11/11] RDMA/erdma: Add driver to kernel build environment
@ 2021-12-22  0:58     ` kernel test robot
  0 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2021-12-22  0:58 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 5054 bytes --]

Hi Cheng,

I love your patch! Perhaps something to improve:

[auto build test WARNING on rdma/for-next]
[also build test WARNING on linus/master v5.16-rc6 next-20211221]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Cheng-Xu/Elastic-RDMA-Adapter-ERDMA-driver/20211221-105044
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git for-next
config: ia64-allyesconfig (https://download.01.org/0day-ci/archive/20211222/202112220838.tXmQUWZb-lkp(a)intel.com/config)
compiler: ia64-linux-gcc (GCC) 11.2.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/3b0243fb79f0f12a5b5c020c6f26c82de2c3c57e
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Cheng-Xu/Elastic-RDMA-Adapter-ERDMA-driver/20211221-105044
        git checkout 3b0243fb79f0f12a5b5c020c6f26c82de2c3c57e
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross O=build_dir ARCH=ia64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/infiniband/hw/erdma/erdma_cm.c: In function 'erdma_cep_set_inuse':
>> drivers/infiniband/hw/erdma/erdma_cm.c:173:13: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
     173 |         int ret;
         |             ^~~
   drivers/infiniband/hw/erdma/erdma_cm.c: In function 'erdma_cm_llp_state_change':
>> drivers/infiniband/hw/erdma/erdma_cm.c:1077:24: warning: variable 's' set but not used [-Wunused-but-set-variable]
    1077 |         struct socket *s;
         |                        ^
   drivers/infiniband/hw/erdma/erdma_cm.c: In function 'erdma_create_listen':
>> drivers/infiniband/hw/erdma/erdma_cm.c:1456:28: warning: variable 'r_ip' set but not used [-Wunused-but-set-variable]
    1456 |                 u8 *l_ip, *r_ip;
         |                            ^~~~
>> drivers/infiniband/hw/erdma/erdma_cm.c:1456:21: warning: variable 'l_ip' set but not used [-Wunused-but-set-variable]
    1456 |                 u8 *l_ip, *r_ip;
         |                     ^~~~
--
   drivers/infiniband/hw/erdma/erdma_main.c: In function 'erdma_probe_dev':
>> drivers/infiniband/hw/erdma/erdma_main.c:294:27: warning: variable 'ibdev' set but not used [-Wunused-but-set-variable]
     294 |         struct ib_device *ibdev;
         |                           ^~~~~
   drivers/infiniband/hw/erdma/erdma_main.c: At top level:
>> drivers/infiniband/hw/erdma/erdma_main.c:464:5: warning: no previous prototype for 'erdma_res_cb_init' [-Wmissing-prototypes]
     464 | int erdma_res_cb_init(struct erdma_dev *dev)
         |     ^~~~~~~~~~~~~~~~~
>> drivers/infiniband/hw/erdma/erdma_main.c:481:6: warning: no previous prototype for 'erdma_res_cb_free' [-Wmissing-prototypes]
     481 | void erdma_res_cb_free(struct erdma_dev *dev)
         |      ^~~~~~~~~~~~~~~~~
--
>> drivers/infiniband/hw/erdma/erdma_eq.c:169:6: warning: no previous prototype for 'erdma_intr_ceq_task' [-Wmissing-prototypes]
     169 | void erdma_intr_ceq_task(unsigned long data)
         |      ^~~~~~~~~~~~~~~~~~~


vim +/ret +173 drivers/infiniband/hw/erdma/erdma_cm.c

1d17ac4bdb13af Cheng Xu 2021-12-21  169  
1d17ac4bdb13af Cheng Xu 2021-12-21  170  static void erdma_cep_set_inuse(struct erdma_cep *cep)
1d17ac4bdb13af Cheng Xu 2021-12-21  171  {
1d17ac4bdb13af Cheng Xu 2021-12-21  172  	unsigned long flags;
1d17ac4bdb13af Cheng Xu 2021-12-21 @173  	int ret;
1d17ac4bdb13af Cheng Xu 2021-12-21  174  retry:
1d17ac4bdb13af Cheng Xu 2021-12-21  175  	spin_lock_irqsave(&cep->lock, flags);
1d17ac4bdb13af Cheng Xu 2021-12-21  176  
1d17ac4bdb13af Cheng Xu 2021-12-21  177  	if (cep->in_use) {
1d17ac4bdb13af Cheng Xu 2021-12-21  178  		spin_unlock_irqrestore(&cep->lock, flags);
1d17ac4bdb13af Cheng Xu 2021-12-21  179  		ret = wait_event_interruptible(cep->waitq, !cep->in_use);
1d17ac4bdb13af Cheng Xu 2021-12-21  180  		if (signal_pending(current))
1d17ac4bdb13af Cheng Xu 2021-12-21  181  			flush_signals(current);
1d17ac4bdb13af Cheng Xu 2021-12-21  182  		goto retry;
1d17ac4bdb13af Cheng Xu 2021-12-21  183  	} else {
1d17ac4bdb13af Cheng Xu 2021-12-21  184  		cep->in_use = 1;
1d17ac4bdb13af Cheng Xu 2021-12-21  185  		spin_unlock_irqrestore(&cep->lock, flags);
1d17ac4bdb13af Cheng Xu 2021-12-21  186  	}
1d17ac4bdb13af Cheng Xu 2021-12-21  187  }
1d17ac4bdb13af Cheng Xu 2021-12-21  188  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 09/11] RDMA/erdma: Add the erdma module
  2021-12-21 13:26   ` Leon Romanovsky
@ 2021-12-22  2:33     ` Cheng Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-22  2:33 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/21/21 9:26 PM, Leon Romanovsky wrote:
> On Tue, Dec 21, 2021 at 10:48:56AM +0800, Cheng Xu wrote:
>> Add the main erdma module and debugfs files. The main module provides
>> interface to infiniband subsytem, and the debugfs module provides a way
>> to allow user can get the core status of the device and set the preferred
>> congestion control algorithm.
> 
> debugfs is for debug - dump various information.
> It is not the right interface to set configuration properties.

I agree. At first we want to implement 'device_group' interface, but it
is not recommended for new drivers, and we find current netlink command
do not meet our requirement (maybe we missed something). So we use
debugfs as the cc configuration interface temporarily. It would be
better if you could give us some suggestions.

>>
>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
>> ---
>>   drivers/infiniband/hw/erdma/erdma_debug.c | 314 ++++++++++
>>   drivers/infiniband/hw/erdma/erdma_debug.h |  18 +
>>   drivers/infiniband/hw/erdma/erdma_main.c  | 711 ++++++++++++++++++++++
>>   3 files changed, 1043 insertions(+)
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.h
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c
>>
>> diff --git a/drivers/infiniband/hw/erdma/erdma_debug.c b/drivers/infiniband/hw/erdma/erdma_debug.c
>> new file mode 100644
>> index 000000000000..3cbed4dde0e2
>> --- /dev/null
>> +++ b/drivers/infiniband/hw/erdma/erdma_debug.c
>> @@ -0,0 +1,314 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Authors: Cheng Xu <chengyou@linux.alibaba.com>
>> + *          Kai Shen <kaishen@linux.alibaba.com>
>> + * Copyright (c) 2020-2021, Alibaba Group.
>> + */
>> +#include <linux/errno.h>
>> +#include <linux/types.h>
>> +#include <linux/list.h>
>> +#include <linux/debugfs.h>
>> +
>> +#include <rdma/iw_cm.h>
>> +#include <rdma/ib_verbs.h>
>> +#include <rdma/ib_smi.h>
>> +#include <rdma/ib_user_verbs.h>
>> +
>> +#include "erdma.h"
>> +#include "erdma_cm.h"
>> +#include "erdma_debug.h"
>> +#include "erdma_verbs.h"
>> +
>> +char *cc_method_string[ERDMA_CC_METHODS_NUM] = {
>> +	[ERDMA_CC_NEWRENO] = "newreno",
>> +	[ERDMA_CC_CUBIC] = "cubic",
>> +	[ERDMA_CC_HPCC_RTT] = "hpcc_rtt",
>> +	[ERDMA_CC_HPCC_ECN] = "hpcc_ecn",
>> +	[ERDMA_CC_HPCC_INT] = "hpcc_int"
>> +};
>> +
>> +static struct dentry *erdma_debugfs;
>> +
>> +
>> +static int erdma_dbgfs_file_open(struct inode *inode, struct file *fp)
>> +{
>> +	fp->private_data = inode->i_private;
>> +	return nonseekable_open(inode, fp);
>> +}
>> +
>> +static ssize_t erdma_show_stats(struct file *fp, char __user *buf, size_t space,
>> +			      loff_t *ppos)
>> +{
>> +	struct erdma_dev *dev = fp->private_data;
>> +	char *kbuf = NULL;
>> +	int len = 0;
>> +
>> +	if (*ppos)
>> +		goto out;
>> +
>> +	kbuf = kmalloc(space, GFP_KERNEL);
>> +	if (!kbuf)
>> +		goto out;
>> +
>> +	len = snprintf(kbuf, space, "Resource Summary of %s:\n"
>> +		"%s: %d\n%s: %d\n%s: %d\n%s: %d\n%s: %d\n%s: %d\n",
>> +		dev->ibdev.name,
>> +		"ucontext ", atomic_read(&dev->num_ctx),
>> +		"pd       ", atomic_read(&dev->num_pd),
>> +		"qp       ", atomic_read(&dev->num_qp),
>> +		"cq       ", atomic_read(&dev->num_cq),
>> +		"mr       ", atomic_read(&dev->num_mr),
> 
> Why do you need to duplicate "restrack res ..."?

We will remove this unnecessary code.

Thanks,
Cheng Xu

> 
>> +		"cep      ", atomic_read(&dev->num_cep));
>> +	if (len > space)
>> +		len = space;
>> +out:
>> +	if (len)
>> +		len = simple_read_from_buffer(buf, len, ppos, kbuf, len);
>> +
>> +	kfree(kbuf);
>> +	return len;
>> +
>> +}
>> +
> 
> Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 06/11] RDMA/erdma: Add verbs header file
  2021-12-21 13:28   ` Leon Romanovsky
@ 2021-12-22  2:36     ` Cheng Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-22  2:36 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/21/21 9:28 PM, Leon Romanovsky wrote:
> On Tue, Dec 21, 2021 at 10:48:53AM +0800, Cheng Xu wrote:
>> This header file defines the main structrues and functions used for RDMA
>> Verbs, including qp, cq, mr ucontext, etc,.
>>
>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
>> ---
>>   drivers/infiniband/hw/erdma/erdma_verbs.h | 366 ++++++++++++++++++++++
>>   1 file changed, 366 insertions(+)
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.h
>>
>> diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.h b/drivers/infiniband/hw/erdma/erdma_verbs.h
>> new file mode 100644
>> index 000000000000..6eda8843d0d5
>> --- /dev/null
>> +++ b/drivers/infiniband/hw/erdma/erdma_verbs.h
>> @@ -0,0 +1,366 @@
>> +/* SPDX-License-Identifier: GPL-2.0
>> + *
>> + * Authors: Cheng Xu <chengyou@linux.alibaba.com>
>> + *          Kai Shen <kaishen@linux.alibaba.com>
>> + * Copyright (c) 2020-2021, Alibaba Group.
>> + */
>> +
>> +#ifndef __ERDMA_VERBS_H__
>> +#define __ERDMA_VERBS_H__
> 
> <...>
> 
>> +extern int erdma_query_port(struct ib_device *dev, u32 port, struct ib_port_attr *attr);
>> +extern int erdma_query_pkey(struct ib_device *dev, u32 port, u16 idx, u16 *pkey);
>> +extern int erdma_query_gid(struct ib_device *dev, u32 port, int idx, union ib_gid *gid);
>> +extern int erdma_alloc_pd(struct ib_pd *pd, struct ib_udata *data);
>> +extern int erdma_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata);
>> +extern int erdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
>> +				   struct ib_udata *data);
>> +extern int erdma_query_qp(struct ib_qp *qp, struct ib_qp_attr *attr, int mask,
>> +			struct ib_qp_init_attr *init_attr);
>> +extern int erdma_modify_qp(struct ib_qp *qp, struct ib_qp_attr *attr, int mask,
>> +			      struct ib_udata *data);
>> +extern int erdma_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata);
>> +extern int erdma_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata);
>> +extern int erdma_req_notify_cq(struct ib_cq *cq, enum ib_cq_notify_flags flags);
>> +extern struct ib_mr *erdma_reg_user_mr(struct ib_pd *ibpd, u64 start, u64 len,
>> +				      u64 virt, int access, struct ib_udata *udata);
>> +extern struct ib_mr *erdma_get_dma_mr(struct ib_pd *ibpd, int rights);
>> +extern int erdma_dereg_mr(struct ib_mr *mr, struct ib_udata *data);
>> +extern int erdma_mmap(struct ib_ucontext *ctx, struct vm_area_struct *vma);
>> +extern void erdma_qp_get_ref(struct ib_qp *qp);
>> +extern void erdma_qp_put_ref(struct ib_qp *qp);
>> +extern struct ib_qp *erdma_get_ibqp(struct ib_device *dev, int id);
>> +extern int erdma_post_send(struct ib_qp *qp, const struct ib_send_wr *send_wr,
>> +			   const struct ib_send_wr **bad_send_wr);
>> +extern int erdma_post_recv(struct ib_qp *qp, const struct ib_recv_wr *recv_wr,
>> +			   const struct ib_recv_wr **bad_recv_wr);
>> +extern int erdma_poll_cq(struct ib_cq *cq, int num_entries, struct ib_wc *wc);
>> +extern struct ib_mr *erdma_ib_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
>> +				       u32 max_num_sg);
>> +extern int erdma_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
>> +			   int sg_nents, unsigned int *sg_offset);
>> +extern struct net_device *erdma_get_netdev(struct ib_device *device, u32 port_num);
>> +extern void erdma_disassociate_ucontext(struct ib_ucontext *ibcontext);
>> +extern void erdma_port_event(struct erdma_dev *dev, enum ib_event_type reason);
> 
> Why do you add "extern" to function declarations?
> 
> Thanks
> 

We misunderstood the usage of "extern", and will fix it.

Thanks,
Cheng Xu

>> +
>> +#endif
>> -- 
>> 2.27.0
>>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs implementation
  2021-12-21 13:32   ` Leon Romanovsky
  2021-12-21 15:20     ` Bernard Metzler
@ 2021-12-22  2:50     ` Cheng Xu
  1 sibling, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-22  2:50 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/21/21 9:32 PM, Leon Romanovsky wrote:
> On Tue, Dec 21, 2021 at 10:48:54AM +0800, Cheng Xu wrote:
>> The RDMA verbs implementation of erdma is divided into three files:
>> erdma_qp.c, erdma_cq.c, and erdma_verbs.c. Internal used functions and
>> datapath functions of QP/CQ are put in erdma_qp.c and erdma_cq.c, the reset
>> is in erdma_verbs.c.
>>
>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
>> ---
>>   drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
>>   drivers/infiniband/hw/erdma/erdma_qp.c    |  624 +++++++++
>>   drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++++
>>   3 files changed, 2302 insertions(+)
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c
> 
> 
> Please no inline functions in .c files and no void casting for the
> return values of functions.

Will fix it.

> <...>
> 
>> diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c b/drivers/infiniband/hw/erdma/erdma_qp.c
>> new file mode 100644
>> index 000000000000..8c02215cee04
>> --- /dev/null
>> +++ b/drivers/infiniband/hw/erdma/erdma_qp.c
>> @@ -0,0 +1,624 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Authors: Cheng Xu <chengyou@linux.alibaba.com>
>> + *          Kai Shen <kaishen@linux.alibaba.com>
>> + * Copyright (c) 2020-2021, Alibaba Group.
>> + *
>> + * Authors: Bernard Metzler <bmt@zurich.ibm.com>
>> + *          Fredy Neeser <nfd@zurich.ibm.com>
>> + * Copyright (c) 2008-2016, IBM Corporation
> 
> What does it mean?
> 
> Thanks

As mentioned in patch 08, parts of our code come from siw with some
modification. In "erdma_qp.c" and "erdma_verbs.c", the code related with
CM module is also developed based on siw, mainly including qp state
machine implementation. So we keep original authors and copyright
information in the files.

Thanks,
Cheng Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs implementation
  2021-12-21 15:20     ` Bernard Metzler
@ 2021-12-22  3:11       ` Cheng Xu
  2021-12-22  4:18         ` Cheng Xu
  2021-12-22 12:46         ` Bernard Metzler
  0 siblings, 2 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-22  3:11 UTC (permalink / raw)
  To: Bernard Metzler, Leon Romanovsky
  Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/21/21 11:20 PM, Bernard Metzler wrote:
>> -----Original Message-----
>> From: Leon Romanovsky <leon@kernel.org>
>> Sent: Tuesday, 21 December 2021 14:32
>> To: Cheng Xu <chengyou@linux.alibaba.com>
>> Cc: jgg@ziepe.ca; dledford@redhat.com; linux-rdma@vger.kernel.org;
>> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
>> Subject: [EXTERNAL] Re: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs
>> implementation
>>
>> On Tue, Dec 21, 2021 at 10:48:54AM +0800, Cheng Xu wrote:
>>> The RDMA verbs implementation of erdma is divided into three files:
>>> erdma_qp.c, erdma_cq.c, and erdma_verbs.c. Internal used functions and
>>> datapath functions of QP/CQ are put in erdma_qp.c and erdma_cq.c, the
>> reset
>>> is in erdma_verbs.c.
>>>
>>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
>>> ---
>>>   drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
>>>   drivers/infiniband/hw/erdma/erdma_qp.c    |  624 +++++++++
>>>   drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++++
>>>   3 files changed, 2302 insertions(+)
>>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
>>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
>>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c
>>
>>
>> Please no inline functions in .c files and no void casting for the
>> return values of functions.
>>
>> <...>
>>
>>> diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c
>> b/drivers/infiniband/hw/erdma/erdma_qp.c
>>> new file mode 100644
>>> index 000000000000..8c02215cee04
>>> --- /dev/null
>>> +++ b/drivers/infiniband/hw/erdma/erdma_qp.c
>>> @@ -0,0 +1,624 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +/*
>>> + * Authors: Cheng Xu <chengyou@linux.alibaba.com>
>>> + *          Kai Shen <kaishen@linux.alibaba.com>
>>> + * Copyright (c) 2020-2021, Alibaba Group.
>>> + *
>>> + * Authors: Bernard Metzler <bmt@zurich.ibm.com>
>>> + *          Fredy Neeser <nfd@zurich.ibm.com>
>>> + * Copyright (c) 2008-2016, IBM Corporation
>>
>> What does it mean?
>>
> 
> Significant parts of the driver have been taken from siw it seems.
> Probably really from an old version of it.
> In that case I would have recommended to take the upstream siw
> code, which has been cleaned from those issues we now see again
> (including debugfs code, extern definitions, inline in .c code,
> casting issues, etc etc.). Why starting in 2020 with
> code from 2016, if better code is available?
> 
> Bernard.

First of all, thank you for developing siw, Bernard and Fredy, so we
can build our erdma based on your work.
At the beginning, we started developing erdma driver in kernel
4.9/4.19/5.10, and didn't know the upstream siw version since it is in
the newer kernel version. As a result, we develop erdma based on the
older version.
Thank you for your recommendation. We will check the differences and
take the upstream siw code if needed.

Thanks,
Cheng Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
  2021-12-21 13:09 ` [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Leon Romanovsky
@ 2021-12-22  3:35   ` Cheng Xu
  2021-12-23 10:23     ` Leon Romanovsky
  0 siblings, 1 reply; 52+ messages in thread
From: Cheng Xu @ 2021-12-22  3:35 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/21/21 9:09 PM, Leon Romanovsky wrote:
> On Tue, Dec 21, 2021 at 10:48:47AM +0800, Cheng Xu wrote:
>> Hello all,
>>
>> This patch set introduces the Elastic RDMA Adapter (ERDMA) driver, which
>> released in Apsara Conference 2021 by Alibaba.
>>
>> ERDMA enables large-scale RDMA acceleration capability in Alibaba ECS
>> environment, initially offered in g7re instance. It can improve the
>> efficiency of large-scale distributed computing and communication
>> significantly and expand dynamically with the cluster scale of Alibaba
>> Cloud.
>>
>> ERDMA is a RDMA networking adapter based on the Alibaba MOC hardware. It
>> works in the VPC network environment (overlay network), and uses iWarp
>> tranport protocol. ERDMA supports reliable connection (RC). ERDMA also
>> supports both kernel space and user space verbs. Now we have already
>> supported HPC/AI applications with libfabric, NoF and some other internal
>> verbs libraries, such as xrdma, epsl, etc,.
> 
> We will need to get erdma provider implementation in the rdma-core too,
> in order to consider to merge it.

Sure, I will submit erdma userspace provider implementation within 2
days.

>>
>> For the ECS instance with RDMA enabled, there are two kinds of devices
>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>> They are different PCI deivces. ERDMA driver can get the information about
>> which netdev attached to in its PCIe barspace (by MAC address matching).
> 
> This is very questionable. The netdev part should be kept in the
> drivers/ethernet/... part of the kernel.
> 
> Thanks

The net device used in Alibaba ECS instance is virtio-net device, driven
by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
device, and will be attached to an existed virtio-net device. The
relationship between ibdev and netdev in erdma is similar to siw/rxe.

>>
>> Thanks,
>> Cheng Xu
>>
>> Cheng Xu (11):
>>    RDMA: Add ERDMA to rdma_driver_id definition
>>    RDMA/erdma: Add the hardware related definitions
>>    RDMA/erdma: Add main include file
>>    RDMA/erdma: Add cmdq implementation
>>    RDMA/erdma: Add event queue implementation
>>    RDMA/erdma: Add verbs header file
>>    RDMA/erdma: Add verbs implementation
>>    RDMA/erdma: Add connection management (CM) support
>>    RDMA/erdma: Add the erdma module
>>    RDMA/erdma: Add the ABI definitions
>>    RDMA/erdma: Add driver to kernel build environment
>>
>>   MAINTAINERS                               |    8 +
>>   drivers/infiniband/Kconfig                |    1 +
>>   drivers/infiniband/hw/Makefile            |    1 +
>>   drivers/infiniband/hw/erdma/Kconfig       |   10 +
>>   drivers/infiniband/hw/erdma/Makefile      |    5 +
>>   drivers/infiniband/hw/erdma/erdma.h       |  381 +++++
>>   drivers/infiniband/hw/erdma/erdma_cm.c    | 1585 +++++++++++++++++++++
>>   drivers/infiniband/hw/erdma/erdma_cm.h    |  158 ++
>>   drivers/infiniband/hw/erdma/erdma_cmdq.c  |  489 +++++++
>>   drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
>>   drivers/infiniband/hw/erdma/erdma_debug.c |  314 ++++
>>   drivers/infiniband/hw/erdma/erdma_debug.h |   18 +
>>   drivers/infiniband/hw/erdma/erdma_eq.c    |  346 +++++
>>   drivers/infiniband/hw/erdma/erdma_hw.h    |  474 ++++++
>>   drivers/infiniband/hw/erdma/erdma_main.c  |  711 +++++++++
>>   drivers/infiniband/hw/erdma/erdma_qp.c    |  624 ++++++++
>>   drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 +++++++++++++++++++
>>   drivers/infiniband/hw/erdma/erdma_verbs.h |  366 +++++
>>   include/uapi/rdma/erdma-abi.h             |   49 +
>>   include/uapi/rdma/ib_user_ioctl_verbs.h   |    1 +
>>   20 files changed, 7219 insertions(+)
>>   create mode 100644 drivers/infiniband/hw/erdma/Kconfig
>>   create mode 100644 drivers/infiniband/hw/erdma/Makefile
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma.h
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_cm.h
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_cmdq.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_debug.h
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_eq.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_hw.h
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_main.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c
>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.h
>>   create mode 100644 include/uapi/rdma/erdma-abi.h
>>
>> -- 
>> 2.27.0
>>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs implementation
  2021-12-22  3:11       ` Cheng Xu
@ 2021-12-22  4:18         ` Cheng Xu
  2021-12-22 12:46         ` Bernard Metzler
  1 sibling, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-22  4:18 UTC (permalink / raw)
  To: Bernard Metzler, Leon Romanovsky
  Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/22/21 11:11 AM, Cheng Xu wrote:
> 
> 
> On 12/21/21 11:20 PM, Bernard Metzler wrote:
>>> -----Original Message-----
>>> From: Leon Romanovsky <leon@kernel.org>
>>> Sent: Tuesday, 21 December 2021 14:32
>>> To: Cheng Xu <chengyou@linux.alibaba.com>
>>> Cc: jgg@ziepe.ca; dledford@redhat.com; linux-rdma@vger.kernel.org;
>>> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
>>> Subject: [EXTERNAL] Re: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs
>>> implementation
>>>
>>> On Tue, Dec 21, 2021 at 10:48:54AM +0800, Cheng Xu wrote:
>>>> The RDMA verbs implementation of erdma is divided into three files:
>>>> erdma_qp.c, erdma_cq.c, and erdma_verbs.c. Internal used functions and
>>>> datapath functions of QP/CQ are put in erdma_qp.c and erdma_cq.c, the
>>> reset
>>>> is in erdma_verbs.c.
>>>>
>>>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
>>>> ---
>>>>   drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
>>>>   drivers/infiniband/hw/erdma/erdma_qp.c    |  624 +++++++++
>>>>   drivers/infiniband/hw/erdma/erdma_verbs.c | 1477 
>>>> +++++++++++++++++++++
>>>>   3 files changed, 2302 insertions(+)
>>>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
>>>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
>>>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c
>>>
>>>
>>> Please no inline functions in .c files and no void casting for the
>>> return values of functions.
>>>
>>> <...>
>>>
>>>> diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c
>>> b/drivers/infiniband/hw/erdma/erdma_qp.c
>>>> new file mode 100644
>>>> index 000000000000..8c02215cee04
>>>> --- /dev/null
>>>> +++ b/drivers/infiniband/hw/erdma/erdma_qp.c
>>>> @@ -0,0 +1,624 @@
>>>> +// SPDX-License-Identifier: GPL-2.0
>>>> +/*
>>>> + * Authors: Cheng Xu <chengyou@linux.alibaba.com>
>>>> + *          Kai Shen <kaishen@linux.alibaba.com>
>>>> + * Copyright (c) 2020-2021, Alibaba Group.
>>>> + *
>>>> + * Authors: Bernard Metzler <bmt@zurich.ibm.com>
>>>> + *          Fredy Neeser <nfd@zurich.ibm.com>
>>>> + * Copyright (c) 2008-2016, IBM Corporation
>>>
>>> What does it mean?
>>>
>>
>> Significant parts of the driver have been taken from siw it seems.
>> Probably really from an old version of it.
>> In that case I would have recommended to take the upstream siw
>> code, which has been cleaned from those issues we now see again
>> (including debugfs code, extern definitions, inline in .c code,
>> casting issues, etc etc.). Why starting in 2020 with
>> code from 2016, if better code is available?
>>
>> Bernard.
> 
> First of all, thank you for developing siw, Bernard and Fredy, so we
> can build our erdma based on your work.
> At the beginning, we started developing erdma driver in kernel
> 4.9/4.19/5.10, and didn't know the upstream siw version since it is in

Correction:
At first we develop in kernel 4.9/4.19, kernel 5.10 was
added to support later.

Thanks,
Cheng Xu

> the newer kernel version. As a result, we develop erdma based on the
> older version.
> Thank you for your recommendation. We will check the differences and
> take the upstream siw code if needed.
> 
> Thanks,
> Cheng Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* RE: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs implementation
  2021-12-22  3:11       ` Cheng Xu
  2021-12-22  4:18         ` Cheng Xu
@ 2021-12-22 12:46         ` Bernard Metzler
  2021-12-23  8:38           ` Cheng Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Bernard Metzler @ 2021-12-22 12:46 UTC (permalink / raw)
  To: Cheng Xu, Leon Romanovsky; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu

> -----Original Message-----
> From: Cheng Xu <chengyou@linux.alibaba.com>
> Sent: Wednesday, 22 December 2021 04:11
> To: Bernard Metzler <BMT@zurich.ibm.com>; Leon Romanovsky
> <leon@kernel.org>
> Cc: jgg@ziepe.ca; dledford@redhat.com; linux-rdma@vger.kernel.org;
> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
> Subject: [EXTERNAL] Re: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs
> implementation
> 
> 
> 
> On 12/21/21 11:20 PM, Bernard Metzler wrote:
> >> -----Original Message-----
> >> From: Leon Romanovsky <leon@kernel.org>
> >> Sent: Tuesday, 21 December 2021 14:32
> >> To: Cheng Xu <chengyou@linux.alibaba.com>
> >> Cc: jgg@ziepe.ca; dledford@redhat.com; linux-rdma@vger.kernel.org;
> >> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
> >> Subject: [EXTERNAL] Re: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs
> >> implementation
> >>
> >> On Tue, Dec 21, 2021 at 10:48:54AM +0800, Cheng Xu wrote:
> >>> The RDMA verbs implementation of erdma is divided into three files:
> >>> erdma_qp.c, erdma_cq.c, and erdma_verbs.c. Internal used functions and
> >>> datapath functions of QP/CQ are put in erdma_qp.c and erdma_cq.c, the
> >> reset
> >>> is in erdma_verbs.c.
> >>>
> >>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
> >>> ---
> >>>   drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
> >>>   drivers/infiniband/hw/erdma/erdma_qp.c    |  624 +++++++++
> >>>   drivers/infiniband/hw/erdma/erdma_verbs.c | 1477
> +++++++++++++++++++++
> >>>   3 files changed, 2302 insertions(+)
> >>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
> >>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
> >>>   create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c
> >>
> >>
> >> Please no inline functions in .c files and no void casting for the
> >> return values of functions.
> >>
> >> <...>
> >>
> >>> diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c
> >> b/drivers/infiniband/hw/erdma/erdma_qp.c
> >>> new file mode 100644
> >>> index 000000000000..8c02215cee04
> >>> --- /dev/null
> >>> +++ b/drivers/infiniband/hw/erdma/erdma_qp.c
> >>> @@ -0,0 +1,624 @@
> >>> +// SPDX-License-Identifier: GPL-2.0
> >>> +/*
> >>> + * Authors: Cheng Xu <chengyou@linux.alibaba.com>
> >>> + *          Kai Shen <kaishen@linux.alibaba.com>
> >>> + * Copyright (c) 2020-2021, Alibaba Group.
> >>> + *
> >>> + * Authors: Bernard Metzler <bmt@zurich.ibm.com>
> >>> + *          Fredy Neeser <nfd@zurich.ibm.com>
> >>> + * Copyright (c) 2008-2016, IBM Corporation
> >>
> >> What does it mean?
> >>
> >
> > Significant parts of the driver have been taken from siw it seems.
> > Probably really from an old version of it.
> > In that case I would have recommended to take the upstream siw
> > code, which has been cleaned from those issues we now see again
> > (including debugfs code, extern definitions, inline in .c code,
> > casting issues, etc etc.). Why starting in 2020 with
> > code from 2016, if better code is available?
> >
> > Bernard.
> 
> First of all, thank you for developing siw, Bernard and Fredy, so we
> can build our erdma based on your work.

You are welcome.
You probably got the code from https://github.com/zrlio/softiwarp
where I stopped pushing updates 4 years ago. By then, I started working
on making it acceptable for upstream. As said, I highly recommend taking
it from there, since the community already invested time and effort to
make the code better, and finally acceptable. If you do so, please also
update the copyright notice.
Fredy isn't part of it since almost 10 years, and is not reachable
via the email provided. And, by 2016, his contributions were limited to
the siw_cm.c code only.



> At the beginning, we started developing erdma driver in kernel
> 4.9/4.19/5.10, and didn't know the upstream siw version since it is in


siw is in the Linux kernel since v5.3

> the newer kernel version. As a result, we develop erdma based on the
> older version.
> Thank you for your recommendation. We will check the differences and
> take the upstream siw code if needed.
> 
> Thanks,
> Cheng Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-21  2:48 ` [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions Cheng Xu
@ 2021-12-22 16:14     ` kernel test robot
  2021-12-22 16:14     ` kernel test robot
  2021-12-23 15:46   ` Yanjun Zhu
  2 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2021-12-22 16:14 UTC (permalink / raw)
  To: Cheng Xu, jgg, dledford
  Cc: llvm, kbuild-all, leon, linux-rdma, KaiShen, chengyou, tonylu

Hi Cheng,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on rdma/for-next]
[also build test ERROR on linus/master v5.16-rc6 next-20211222]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Cheng-Xu/Elastic-RDMA-Adapter-ERDMA-driver/20211221-105044
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git for-next
config: i386-randconfig-a014-20211220 (https://download.01.org/0day-ci/archive/20211223/202112230027.47XqoqUH-lkp@intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 555eacf75f21cd1dfc6363d73ad187b730349543)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/8bafa2877f1dd44153ce36bb8a0a0c491f990b6b
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Cheng-Xu/Elastic-RDMA-Adapter-ERDMA-driver/20211221-105044
        git checkout 8bafa2877f1dd44153ce36bb8a0a0c491f990b6b
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from <built-in>:1:
>> ./usr/include/rdma/erdma-abi.h:14:2: error: unknown type name 'u64'
           u64 db_record_va;
           ^
   ./usr/include/rdma/erdma-abi.h:15:2: error: unknown type name 'u64'
           u64 qbuf_va;
           ^
>> ./usr/include/rdma/erdma-abi.h:16:2: error: unknown type name 'u32'
           u32 qbuf_len;
           ^
   ./usr/include/rdma/erdma-abi.h:17:2: error: unknown type name 'u32'
           u32 rsvd0;
           ^
   ./usr/include/rdma/erdma-abi.h:21:2: error: unknown type name 'u32'
           u32 cq_id;
           ^
   ./usr/include/rdma/erdma-abi.h:22:2: error: unknown type name 'u32'
           u32 num_cqe;
           ^
   ./usr/include/rdma/erdma-abi.h:26:2: error: unknown type name 'u64'
           u64 db_record_va;
           ^
   ./usr/include/rdma/erdma-abi.h:27:2: error: unknown type name 'u64'
           u64 qbuf_va;
           ^
   ./usr/include/rdma/erdma-abi.h:28:2: error: unknown type name 'u32'
           u32 qbuf_len;
           ^
   ./usr/include/rdma/erdma-abi.h:29:2: error: unknown type name 'u32'
           u32 rsvd0;
           ^
   ./usr/include/rdma/erdma-abi.h:33:2: error: unknown type name 'u32'
           u32 qp_id;
           ^
   ./usr/include/rdma/erdma-abi.h:34:2: error: unknown type name 'u32'
           u32 num_sqe;
           ^
   ./usr/include/rdma/erdma-abi.h:35:2: error: unknown type name 'u32'
           u32 num_rqe;
           ^
   ./usr/include/rdma/erdma-abi.h:36:2: error: unknown type name 'u32'
           u32 rq_offset;
           ^
   ./usr/include/rdma/erdma-abi.h:40:2: error: unknown type name 'u32'
           u32 dev_id;
           ^
   ./usr/include/rdma/erdma-abi.h:41:2: error: unknown type name 'u32'
           u32 pad;
           ^
   ./usr/include/rdma/erdma-abi.h:42:2: error: unknown type name 'u32'
           u32 sdb_type;
           ^
   ./usr/include/rdma/erdma-abi.h:43:2: error: unknown type name 'u32'
           u32 sdb_offset;
           ^
   ./usr/include/rdma/erdma-abi.h:44:2: error: unknown type name 'u64'
           u64 sdb;
           ^
   fatal error: too many errors emitted, stopping now [-ferror-limit=]
   20 errors generated.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
@ 2021-12-22 16:14     ` kernel test robot
  0 siblings, 0 replies; 52+ messages in thread
From: kernel test robot @ 2021-12-22 16:14 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 4011 bytes --]

Hi Cheng,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on rdma/for-next]
[also build test ERROR on linus/master v5.16-rc6 next-20211222]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Cheng-Xu/Elastic-RDMA-Adapter-ERDMA-driver/20211221-105044
base:   https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git for-next
config: i386-randconfig-a014-20211220 (https://download.01.org/0day-ci/archive/20211223/202112230027.47XqoqUH-lkp(a)intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project 555eacf75f21cd1dfc6363d73ad187b730349543)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/8bafa2877f1dd44153ce36bb8a0a0c491f990b6b
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Cheng-Xu/Elastic-RDMA-Adapter-ERDMA-driver/20211221-105044
        git checkout 8bafa2877f1dd44153ce36bb8a0a0c491f990b6b
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from <built-in>:1:
>> ./usr/include/rdma/erdma-abi.h:14:2: error: unknown type name 'u64'
           u64 db_record_va;
           ^
   ./usr/include/rdma/erdma-abi.h:15:2: error: unknown type name 'u64'
           u64 qbuf_va;
           ^
>> ./usr/include/rdma/erdma-abi.h:16:2: error: unknown type name 'u32'
           u32 qbuf_len;
           ^
   ./usr/include/rdma/erdma-abi.h:17:2: error: unknown type name 'u32'
           u32 rsvd0;
           ^
   ./usr/include/rdma/erdma-abi.h:21:2: error: unknown type name 'u32'
           u32 cq_id;
           ^
   ./usr/include/rdma/erdma-abi.h:22:2: error: unknown type name 'u32'
           u32 num_cqe;
           ^
   ./usr/include/rdma/erdma-abi.h:26:2: error: unknown type name 'u64'
           u64 db_record_va;
           ^
   ./usr/include/rdma/erdma-abi.h:27:2: error: unknown type name 'u64'
           u64 qbuf_va;
           ^
   ./usr/include/rdma/erdma-abi.h:28:2: error: unknown type name 'u32'
           u32 qbuf_len;
           ^
   ./usr/include/rdma/erdma-abi.h:29:2: error: unknown type name 'u32'
           u32 rsvd0;
           ^
   ./usr/include/rdma/erdma-abi.h:33:2: error: unknown type name 'u32'
           u32 qp_id;
           ^
   ./usr/include/rdma/erdma-abi.h:34:2: error: unknown type name 'u32'
           u32 num_sqe;
           ^
   ./usr/include/rdma/erdma-abi.h:35:2: error: unknown type name 'u32'
           u32 num_rqe;
           ^
   ./usr/include/rdma/erdma-abi.h:36:2: error: unknown type name 'u32'
           u32 rq_offset;
           ^
   ./usr/include/rdma/erdma-abi.h:40:2: error: unknown type name 'u32'
           u32 dev_id;
           ^
   ./usr/include/rdma/erdma-abi.h:41:2: error: unknown type name 'u32'
           u32 pad;
           ^
   ./usr/include/rdma/erdma-abi.h:42:2: error: unknown type name 'u32'
           u32 sdb_type;
           ^
   ./usr/include/rdma/erdma-abi.h:43:2: error: unknown type name 'u32'
           u32 sdb_offset;
           ^
   ./usr/include/rdma/erdma-abi.h:44:2: error: unknown type name 'u64'
           u64 sdb;
           ^
   fatal error: too many errors emitted, stopping now [-ferror-limit=]
   20 errors generated.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs implementation
  2021-12-22 12:46         ` Bernard Metzler
@ 2021-12-23  8:38           ` Cheng Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-23  8:38 UTC (permalink / raw)
  To: Bernard Metzler, Leon Romanovsky
  Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/22/21 8:46 PM, Bernard Metzler wrote:
>> -----Original Message-----
>> From: Cheng Xu <chengyou@linux.alibaba.com>
>> Sent: Wednesday, 22 December 2021 04:11
>> To: Bernard Metzler <BMT@zurich.ibm.com>; Leon Romanovsky
>> <leon@kernel.org>
>> Cc: jgg@ziepe.ca; dledford@redhat.com; linux-rdma@vger.kernel.org;
>> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
>> Subject: [EXTERNAL] Re: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs
>> implementation
>>
>>
>>
>> On 12/21/21 11:20 PM, Bernard Metzler wrote:
>>>> -----Original Message-----
>>>> From: Leon Romanovsky <leon@kernel.org>
>>>> Sent: Tuesday, 21 December 2021 14:32
>>>> To: Cheng Xu <chengyou@linux.alibaba.com>
>>>> Cc: jgg@ziepe.ca; dledford@redhat.com; linux-rdma@vger.kernel.org;
>>>> KaiShen@linux.alibaba.com; tonylu@linux.alibaba.com
>>>> Subject: [EXTERNAL] Re: [PATCH rdma-next 07/11] RDMA/erdma: Add verbs
>>>> implementation
>>>>
>>>> On Tue, Dec 21, 2021 at 10:48:54AM +0800, Cheng Xu wrote:
>>>>> The RDMA verbs implementation of erdma is divided into three files:
>>>>> erdma_qp.c, erdma_cq.c, and erdma_verbs.c. Internal used functions and
>>>>> datapath functions of QP/CQ are put in erdma_qp.c and erdma_cq.c, the
>>>> reset
>>>>> is in erdma_verbs.c.
>>>>>
>>>>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
>>>>> ---
>>>>>    drivers/infiniband/hw/erdma/erdma_cq.c    |  201 +++
>>>>>    drivers/infiniband/hw/erdma/erdma_qp.c    |  624 +++++++++
>>>>>    drivers/infiniband/hw/erdma/erdma_verbs.c | 1477
>> +++++++++++++++++++++
>>>>>    3 files changed, 2302 insertions(+)
>>>>>    create mode 100644 drivers/infiniband/hw/erdma/erdma_cq.c
>>>>>    create mode 100644 drivers/infiniband/hw/erdma/erdma_qp.c
>>>>>    create mode 100644 drivers/infiniband/hw/erdma/erdma_verbs.c
>>>>
>>>>
>>>> Please no inline functions in .c files and no void casting for the
>>>> return values of functions.
>>>>
>>>> <...>
>>>>
>>>>> diff --git a/drivers/infiniband/hw/erdma/erdma_qp.c
>>>> b/drivers/infiniband/hw/erdma/erdma_qp.c
>>>>> new file mode 100644
>>>>> index 000000000000..8c02215cee04
>>>>> --- /dev/null
>>>>> +++ b/drivers/infiniband/hw/erdma/erdma_qp.c
>>>>> @@ -0,0 +1,624 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>> +/*
>>>>> + * Authors: Cheng Xu <chengyou@linux.alibaba.com>
>>>>> + *          Kai Shen <kaishen@linux.alibaba.com>
>>>>> + * Copyright (c) 2020-2021, Alibaba Group.
>>>>> + *
>>>>> + * Authors: Bernard Metzler <bmt@zurich.ibm.com>
>>>>> + *          Fredy Neeser <nfd@zurich.ibm.com>
>>>>> + * Copyright (c) 2008-2016, IBM Corporation
>>>>
>>>> What does it mean?
>>>>
>>>
>>> Significant parts of the driver have been taken from siw it seems.
>>> Probably really from an old version of it.
>>> In that case I would have recommended to take the upstream siw
>>> code, which has been cleaned from those issues we now see again
>>> (including debugfs code, extern definitions, inline in .c code,
>>> casting issues, etc etc.). Why starting in 2020 with
>>> code from 2016, if better code is available?
>>>
>>> Bernard.
>>
>> First of all, thank you for developing siw, Bernard and Fredy, so we
>> can build our erdma based on your work.
> 
> You are welcome.
> You probably got the code from https://github.com/zrlio/softiwarp
> where I stopped pushing updates 4 years ago. By then, I started working
> on making it acceptable for upstream. As said, I highly recommend taking
> it from there, since the community already invested time and effort to
> make the code better, and finally acceptable. If you do so, please also
> update the copyright notice.

Thank you, and I will follow your recommendation.

> Fredy isn't part of it since almost 10 years, and is not reachable
> via the email provided. And, by 2016, his contributions were limited to
> the siw_cm.c code only.
> 
> 
> 
>> At the beginning, we started developing erdma driver in kernel
>> 4.9/4.19/5.10, and didn't know the upstream siw version since it is in
> 
> 
> siw is in the Linux kernel since v5.3
> 

Yes, I got it wrong.

Thanks,
Cheng Xu

>> the newer kernel version. As a result, we develop erdma based on the
>> older version.
>> Thank you for your recommendation. We will check the differences and
>> take the upstream siw code if needed.
>>
>> Thanks,
>> Cheng Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
  2021-12-22  3:35   ` Cheng Xu
@ 2021-12-23 10:23     ` Leon Romanovsky
  2021-12-23 12:59       ` Cheng Xu
  0 siblings, 1 reply; 52+ messages in thread
From: Leon Romanovsky @ 2021-12-23 10:23 UTC (permalink / raw)
  To: Cheng Xu; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu

On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
> 

<...>

> > > 
> > > For the ECS instance with RDMA enabled, there are two kinds of devices
> > > allocated, one for ERDMA, and one for the original netdev (virtio-net).
> > > They are different PCI deivces. ERDMA driver can get the information about
> > > which netdev attached to in its PCIe barspace (by MAC address matching).
> > 
> > This is very questionable. The netdev part should be kept in the
> > drivers/ethernet/... part of the kernel.
> > 
> > Thanks
> 
> The net device used in Alibaba ECS instance is virtio-net device, driven
> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
> device, and will be attached to an existed virtio-net device. The
> relationship between ibdev and netdev in erdma is similar to siw/rxe.

siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
through MAC's matching.

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
  2021-12-23 10:23     ` Leon Romanovsky
@ 2021-12-23 12:59       ` Cheng Xu
  2021-12-23 13:44         ` Leon Romanovsky
  2022-01-07 14:24         ` Jason Gunthorpe
  0 siblings, 2 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-23 12:59 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/23/21 6:23 PM, Leon Romanovsky wrote:
> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
>>
> 
> <...>
> 
>>>>
>>>> For the ECS instance with RDMA enabled, there are two kinds of devices
>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>>>> They are different PCI deivces. ERDMA driver can get the information about
>>>> which netdev attached to in its PCIe barspace (by MAC address matching).
>>>
>>> This is very questionable. The netdev part should be kept in the
>>> drivers/ethernet/... part of the kernel.
>>>
>>> Thanks
>>
>> The net device used in Alibaba ECS instance is virtio-net device, driven
>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
>> device, and will be attached to an existed virtio-net device. The
>> relationship between ibdev and netdev in erdma is similar to siw/rxe.
> 
> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
> through MAC's matching.
> 
> Thanks

Both siw/rxe/erdma don't need to implement netdev part, this is what I
wanted to express when I said 'similar'.
What you mentioned (the bind mechanism) is one major difference between
erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
he/she wants, but it is not true for erdma. When user buys the erdma
service, he/she must specify which ENI (elastic network interface) to be
binded, it means that the attached erdma device can only be binded to
the specific netdev. Due to the uniqueness of MAC address in our ECS
instance, we use the MAC address as the identification, then the driver 
knows which netdev should be binded to.

Thanks,
Cheng Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
  2021-12-23 12:59       ` Cheng Xu
@ 2021-12-23 13:44         ` Leon Romanovsky
  2021-12-24  7:07           ` Cheng Xu
  2022-01-07 14:24         ` Jason Gunthorpe
  1 sibling, 1 reply; 52+ messages in thread
From: Leon Romanovsky @ 2021-12-23 13:44 UTC (permalink / raw)
  To: Cheng Xu; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu

On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
> 
> 
> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
> > On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
> > > 
> > 
> > <...>
> > 
> > > > > 
> > > > > For the ECS instance with RDMA enabled, there are two kinds of devices
> > > > > allocated, one for ERDMA, and one for the original netdev (virtio-net).
> > > > > They are different PCI deivces. ERDMA driver can get the information about
> > > > > which netdev attached to in its PCIe barspace (by MAC address matching).
> > > > 
> > > > This is very questionable. The netdev part should be kept in the
> > > > drivers/ethernet/... part of the kernel.
> > > > 
> > > > Thanks
> > > 
> > > The net device used in Alibaba ECS instance is virtio-net device, driven
> > > by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
> > > device, and will be attached to an existed virtio-net device. The
> > > relationship between ibdev and netdev in erdma is similar to siw/rxe.
> > 
> > siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
> > through MAC's matching.
> > 
> > Thanks
> 
> Both siw/rxe/erdma don't need to implement netdev part, this is what I
> wanted to express when I said 'similar'.
> What you mentioned (the bind mechanism) is one major difference between
> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
> he/she wants, but it is not true for erdma. When user buys the erdma
> service, he/she must specify which ENI (elastic network interface) to be
> binded, it means that the attached erdma device can only be binded to
> the specific netdev. Due to the uniqueness of MAC address in our ECS
> instance, we use the MAC address as the identification, then the driver
> knows which netdev should be binded to.

Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK.
I personally don't like the idea that bind logic is performed "magically".

BTW,
1. No module parameters
2. No driver versions

Thanks

> 
> Thanks,
> Cheng Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-21  2:48 ` [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions Cheng Xu
  2021-12-21 11:57     ` kernel test robot
  2021-12-22 16:14     ` kernel test robot
@ 2021-12-23 15:46   ` Yanjun Zhu
  2021-12-23 18:45     ` Leon Romanovsky
  2 siblings, 1 reply; 52+ messages in thread
From: Yanjun Zhu @ 2021-12-23 15:46 UTC (permalink / raw)
  To: Cheng Xu, jgg, dledford; +Cc: leon, linux-rdma, KaiShen, tonylu

在 2021/12/21 10:48, Cheng Xu 写道:
> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
> ---
>   include/uapi/rdma/erdma-abi.h | 49 +++++++++++++++++++++++++++++++++++
>   1 file changed, 49 insertions(+)
>   create mode 100644 include/uapi/rdma/erdma-abi.h
> 
> diff --git a/include/uapi/rdma/erdma-abi.h b/include/uapi/rdma/erdma-abi.h
> new file mode 100644
> index 000000000000..6bcba10c1e41
> --- /dev/null
> +++ b/include/uapi/rdma/erdma-abi.h
> @@ -0,0 +1,49 @@
> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR Linux-OpenIB) */
> +/*
> + * Copyright (c) 2020-2021, Alibaba Group.
> + */
> +
> +#ifndef __ERDMA_USER_H__
> +#define __ERDMA_USER_H__
> +
> +#include <linux/types.h>
> +
> +#define ERDMA_ABI_VERSION       1

ERDMA_ABI_VERSION should be 2?

Zhu Yanjun
> +
> +struct erdma_ureq_create_cq {
> +	u64 db_record_va;
> +	u64 qbuf_va;
> +	u32 qbuf_len;
> +	u32 rsvd0;
> +};
> +
> +struct erdma_uresp_create_cq {
> +	u32 cq_id;
> +	u32 num_cqe;
> +};
> +
> +struct erdma_ureq_create_qp {
> +	u64 db_record_va;
> +	u64 qbuf_va;
> +	u32 qbuf_len;
> +	u32 rsvd0;
> +};
> +
> +struct erdma_uresp_create_qp {
> +	u32 qp_id;
> +	u32 num_sqe;
> +	u32 num_rqe;
> +	u32 rq_offset;
> +};
> +
> +struct erdma_uresp_alloc_ctx {
> +	u32 dev_id;
> +	u32 pad;
> +	u32 sdb_type;
> +	u32 sdb_offset;
> +	u64 sdb;
> +	u64 rdb;
> +	u64 cdb;
> +};
> +
> +#endif


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-23 15:46   ` Yanjun Zhu
@ 2021-12-23 18:45     ` Leon Romanovsky
  2021-12-23 22:55       ` Yanjun Zhu
  0 siblings, 1 reply; 52+ messages in thread
From: Leon Romanovsky @ 2021-12-23 18:45 UTC (permalink / raw)
  To: Yanjun Zhu; +Cc: Cheng Xu, jgg, dledford, linux-rdma, KaiShen, tonylu

On Thu, Dec 23, 2021 at 11:46:03PM +0800, Yanjun Zhu wrote:
> 在 2021/12/21 10:48, Cheng Xu 写道:
> > Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
> > ---
> >   include/uapi/rdma/erdma-abi.h | 49 +++++++++++++++++++++++++++++++++++
> >   1 file changed, 49 insertions(+)
> >   create mode 100644 include/uapi/rdma/erdma-abi.h
> > 
> > diff --git a/include/uapi/rdma/erdma-abi.h b/include/uapi/rdma/erdma-abi.h
> > new file mode 100644
> > index 000000000000..6bcba10c1e41
> > --- /dev/null
> > +++ b/include/uapi/rdma/erdma-abi.h
> > @@ -0,0 +1,49 @@
> > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR Linux-OpenIB) */
> > +/*
> > + * Copyright (c) 2020-2021, Alibaba Group.
> > + */
> > +
> > +#ifndef __ERDMA_USER_H__
> > +#define __ERDMA_USER_H__
> > +
> > +#include <linux/types.h>
> > +
> > +#define ERDMA_ABI_VERSION       1
> 
> ERDMA_ABI_VERSION should be 2?

Why?

This field is for rdma-core and we don't have erdma provider in that
library yet. It always starts from 1 for new drivers.

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-23 18:45     ` Leon Romanovsky
@ 2021-12-23 22:55       ` Yanjun Zhu
  2021-12-24  6:04         ` Leon Romanovsky
  2021-12-24  7:12         ` Cheng Xu
  0 siblings, 2 replies; 52+ messages in thread
From: Yanjun Zhu @ 2021-12-23 22:55 UTC (permalink / raw)
  To: Leon Romanovsky, Yanjun Zhu
  Cc: Cheng Xu, jgg, dledford, linux-rdma, KaiShen, tonylu

在 2021/12/24 2:45, Leon Romanovsky 写道:
> On Thu, Dec 23, 2021 at 11:46:03PM +0800, Yanjun Zhu wrote:
>> 在 2021/12/21 10:48, Cheng Xu 写道:
>>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
>>> ---
>>>    include/uapi/rdma/erdma-abi.h | 49 +++++++++++++++++++++++++++++++++++
>>>    1 file changed, 49 insertions(+)
>>>    create mode 100644 include/uapi/rdma/erdma-abi.h
>>>
>>> diff --git a/include/uapi/rdma/erdma-abi.h b/include/uapi/rdma/erdma-abi.h
>>> new file mode 100644
>>> index 000000000000..6bcba10c1e41
>>> --- /dev/null
>>> +++ b/include/uapi/rdma/erdma-abi.h
>>> @@ -0,0 +1,49 @@
>>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR Linux-OpenIB) */
>>> +/*
>>> + * Copyright (c) 2020-2021, Alibaba Group.
>>> + */
>>> +
>>> +#ifndef __ERDMA_USER_H__
>>> +#define __ERDMA_USER_H__
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +#define ERDMA_ABI_VERSION       1
>>
>> ERDMA_ABI_VERSION should be 2?
> 
> Why?
> 
> This field is for rdma-core and we don't have erdma provider in that
> library yet. It always starts from 1 for new drivers.
Please check this link: 
http://mail.spinics.net/lists/linux-rdma/msg63012.html

Jason mentioned in this link:

"
/*
  * For 64 bit machines ABI version 1 and 2 are the same. Otherwise 32
  * bit machines require ABI version 2 which guarentees the user and
  * kernel use the same ABI.
  */
"

Zhu Yanjun
> 
> Thanks


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-23 22:55       ` Yanjun Zhu
@ 2021-12-24  6:04         ` Leon Romanovsky
  2021-12-24  7:54           ` Yanjun Zhu
  2021-12-24  7:12         ` Cheng Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Leon Romanovsky @ 2021-12-24  6:04 UTC (permalink / raw)
  To: Yanjun Zhu; +Cc: Cheng Xu, jgg, dledford, linux-rdma, KaiShen, tonylu

On Fri, Dec 24, 2021 at 06:55:41AM +0800, Yanjun Zhu wrote:
> 在 2021/12/24 2:45, Leon Romanovsky 写道:
> > On Thu, Dec 23, 2021 at 11:46:03PM +0800, Yanjun Zhu wrote:
> > > 在 2021/12/21 10:48, Cheng Xu 写道:
> > > > Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
> > > > ---
> > > >    include/uapi/rdma/erdma-abi.h | 49 +++++++++++++++++++++++++++++++++++
> > > >    1 file changed, 49 insertions(+)
> > > >    create mode 100644 include/uapi/rdma/erdma-abi.h
> > > > 
> > > > diff --git a/include/uapi/rdma/erdma-abi.h b/include/uapi/rdma/erdma-abi.h
> > > > new file mode 100644
> > > > index 000000000000..6bcba10c1e41
> > > > --- /dev/null
> > > > +++ b/include/uapi/rdma/erdma-abi.h
> > > > @@ -0,0 +1,49 @@
> > > > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR Linux-OpenIB) */
> > > > +/*
> > > > + * Copyright (c) 2020-2021, Alibaba Group.
> > > > + */
> > > > +
> > > > +#ifndef __ERDMA_USER_H__
> > > > +#define __ERDMA_USER_H__
> > > > +
> > > > +#include <linux/types.h>
> > > > +
> > > > +#define ERDMA_ABI_VERSION       1
> > > 
> > > ERDMA_ABI_VERSION should be 2?
> > 
> > Why?
> > 
> > This field is for rdma-core and we don't have erdma provider in that
> > library yet. It always starts from 1 for new drivers.
> Please check this link:
> http://mail.spinics.net/lists/linux-rdma/msg63012.html

OK, I still don't understand why.

RXE case is different, because rdma-core already had broken RXE
implementation, so this is why the version was incremented.

> 
> Jason mentioned in this link:
> 
> "
> /*
>  * For 64 bit machines ABI version 1 and 2 are the same. Otherwise 32
>  * bit machines require ABI version 2 which guarentees the user and
>  * kernel use the same ABI.
>  */
> "
> 
> Zhu Yanjun
> > 
> > Thanks
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
  2021-12-23 13:44         ` Leon Romanovsky
@ 2021-12-24  7:07           ` Cheng Xu
  2021-12-24 18:26             ` Leon Romanovsky
  0 siblings, 1 reply; 52+ messages in thread
From: Cheng Xu @ 2021-12-24  7:07 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/23/21 9:44 PM, Leon Romanovsky wrote:
> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
>>
>>
>> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
>>>>
>>>
>>> <...>
>>>
>>>>>>
>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices
>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>>>>>> They are different PCI deivces. ERDMA driver can get the information about
>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching).
>>>>>
>>>>> This is very questionable. The netdev part should be kept in the
>>>>> drivers/ethernet/... part of the kernel.
>>>>>
>>>>> Thanks
>>>>
>>>> The net device used in Alibaba ECS instance is virtio-net device, driven
>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
>>>> device, and will be attached to an existed virtio-net device. The
>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe.
>>>
>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
>>> through MAC's matching.
>>>
>>> Thanks
>>
>> Both siw/rxe/erdma don't need to implement netdev part, this is what I
>> wanted to express when I said 'similar'.
>> What you mentioned (the bind mechanism) is one major difference between
>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
>> he/she wants, but it is not true for erdma. When user buys the erdma
>> service, he/she must specify which ENI (elastic network interface) to be
>> binded, it means that the attached erdma device can only be binded to
>> the specific netdev. Due to the uniqueness of MAC address in our ECS
>> instance, we use the MAC address as the identification, then the driver
>> knows which netdev should be binded to.
> 
> Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK.
> I personally don't like the idea that bind logic is performed "magically".
> 

OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it
means that erdma can not be ready to use like other RDMA HCAs, until
user configure the link manually. This way may be not friendly to them.
I'm not sure that our current method is acceptable or not. If you
strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to
it.

Thanks,
Cheng Xu

> BTW,
> 1. No module parameters
> 2. No driver versions
> 

Will fix them.

> Thanks
> 
>>
>> Thanks,
>> Cheng Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-23 22:55       ` Yanjun Zhu
  2021-12-24  6:04         ` Leon Romanovsky
@ 2021-12-24  7:12         ` Cheng Xu
  2021-12-24  8:02           ` Yanjun Zhu
  2021-12-24 18:19           ` Leon Romanovsky
  1 sibling, 2 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-24  7:12 UTC (permalink / raw)
  To: Yanjun Zhu, Leon Romanovsky; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/24/21 6:55 AM, Yanjun Zhu wrote:
> 在 2021/12/24 2:45, Leon Romanovsky 写道:
>> On Thu, Dec 23, 2021 at 11:46:03PM +0800, Yanjun Zhu wrote:
>>> 在 2021/12/21 10:48, Cheng Xu 写道:
>>>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
>>>> ---
>>>>    include/uapi/rdma/erdma-abi.h | 49 
>>>> +++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 49 insertions(+)
>>>>    create mode 100644 include/uapi/rdma/erdma-abi.h
>>>>
>>>> diff --git a/include/uapi/rdma/erdma-abi.h 
>>>> b/include/uapi/rdma/erdma-abi.h
>>>> new file mode 100644
>>>> index 000000000000..6bcba10c1e41
>>>> --- /dev/null
>>>> +++ b/include/uapi/rdma/erdma-abi.h
>>>> @@ -0,0 +1,49 @@
>>>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR 
>>>> Linux-OpenIB) */
>>>> +/*
>>>> + * Copyright (c) 2020-2021, Alibaba Group.
>>>> + */
>>>> +
>>>> +#ifndef __ERDMA_USER_H__
>>>> +#define __ERDMA_USER_H__
>>>> +
>>>> +#include <linux/types.h>
>>>> +
>>>> +#define ERDMA_ABI_VERSION       1
>>>
>>> ERDMA_ABI_VERSION should be 2?
>>
>> Why?
>>
>> This field is for rdma-core and we don't have erdma provider in that
>> library yet. It always starts from 1 for new drivers.
> Please check this link: 
> http://mail.spinics.net/lists/linux-rdma/msg63012.html
> 
> Jason mentioned in this link:
> 
> "
> /*
>   * For 64 bit machines ABI version 1 and 2 are the same. Otherwise 32
>   * bit machines require ABI version 2 which guarentees the user and
>   * kernel use the same ABI.
>   */
> "
> 
> Zhu Yanjun

Even though I do not understand the reason, but as mentioned above, I 
think ERDMA_ABI_VERSION = 1 is fine, because ERDMA can only work in 
64bit machines.

Thanks,
Cheng Xu




^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-24  6:04         ` Leon Romanovsky
@ 2021-12-24  7:54           ` Yanjun Zhu
  2021-12-24 18:11             ` Leon Romanovsky
  0 siblings, 1 reply; 52+ messages in thread
From: Yanjun Zhu @ 2021-12-24  7:54 UTC (permalink / raw)
  To: Leon Romanovsky, Yanjun Zhu
  Cc: Cheng Xu, jgg, dledford, linux-rdma, KaiShen, tonylu

在 2021/12/24 14:04, Leon Romanovsky 写道:
> On Fri, Dec 24, 2021 at 06:55:41AM +0800, Yanjun Zhu wrote:
>> 在 2021/12/24 2:45, Leon Romanovsky 写道:
>>> On Thu, Dec 23, 2021 at 11:46:03PM +0800, Yanjun Zhu wrote:
>>>> 在 2021/12/21 10:48, Cheng Xu 写道:
>>>>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
>>>>> ---
>>>>>     include/uapi/rdma/erdma-abi.h | 49 +++++++++++++++++++++++++++++++++++
>>>>>     1 file changed, 49 insertions(+)
>>>>>     create mode 100644 include/uapi/rdma/erdma-abi.h
>>>>>
>>>>> diff --git a/include/uapi/rdma/erdma-abi.h b/include/uapi/rdma/erdma-abi.h
>>>>> new file mode 100644
>>>>> index 000000000000..6bcba10c1e41
>>>>> --- /dev/null
>>>>> +++ b/include/uapi/rdma/erdma-abi.h
>>>>> @@ -0,0 +1,49 @@
>>>>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR Linux-OpenIB) */
>>>>> +/*
>>>>> + * Copyright (c) 2020-2021, Alibaba Group.
>>>>> + */
>>>>> +
>>>>> +#ifndef __ERDMA_USER_H__
>>>>> +#define __ERDMA_USER_H__
>>>>> +
>>>>> +#include <linux/types.h>
>>>>> +
>>>>> +#define ERDMA_ABI_VERSION       1
>>>>
>>>> ERDMA_ABI_VERSION should be 2?
>>>
>>> Why?
>>>
>>> This field is for rdma-core and we don't have erdma provider in that
>>> library yet. It always starts from 1 for new drivers.
>> Please check this link:
>> http://mail.spinics.net/lists/linux-rdma/msg63012.html
> 
> OK, I still don't understand why.


Perhaps 32 bit machines require ABI version 2 which guarentees the user 
and kernel use the same ABI.

Zhu Yanjun

> 
> RXE case is different, because rdma-core already had broken RXE
> implementation, so this is why the version was incremented.
> 
>>
>> Jason mentioned in this link:
>>
>> "
>> /*
>>   * For 64 bit machines ABI version 1 and 2 are the same. Otherwise 32
>>   * bit machines require ABI version 2 which guarentees the user and
>>   * kernel use the same ABI.
>>   */
>> "
>>
>> Zhu Yanjun
>>>
>>> Thanks
>>


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-24  7:12         ` Cheng Xu
@ 2021-12-24  8:02           ` Yanjun Zhu
  2021-12-24 18:19           ` Leon Romanovsky
  1 sibling, 0 replies; 52+ messages in thread
From: Yanjun Zhu @ 2021-12-24  8:02 UTC (permalink / raw)
  To: Cheng Xu, Leon Romanovsky; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu


在 2021/12/24 15:12, Cheng Xu 写道:
>
>
> On 12/24/21 6:55 AM, Yanjun Zhu wrote:
>> 在 2021/12/24 2:45, Leon Romanovsky 写道:
>>> On Thu, Dec 23, 2021 at 11:46:03PM +0800, Yanjun Zhu wrote:
>>>> 在 2021/12/21 10:48, Cheng Xu 写道:
>>>>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
>>>>> ---
>>>>>    include/uapi/rdma/erdma-abi.h | 49 
>>>>> +++++++++++++++++++++++++++++++++++
>>>>>    1 file changed, 49 insertions(+)
>>>>>    create mode 100644 include/uapi/rdma/erdma-abi.h
>>>>>
>>>>> diff --git a/include/uapi/rdma/erdma-abi.h 
>>>>> b/include/uapi/rdma/erdma-abi.h
>>>>> new file mode 100644
>>>>> index 000000000000..6bcba10c1e41
>>>>> --- /dev/null
>>>>> +++ b/include/uapi/rdma/erdma-abi.h
>>>>> @@ -0,0 +1,49 @@
>>>>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR 
>>>>> Linux-OpenIB) */
>>>>> +/*
>>>>> + * Copyright (c) 2020-2021, Alibaba Group.
>>>>> + */
>>>>> +
>>>>> +#ifndef __ERDMA_USER_H__
>>>>> +#define __ERDMA_USER_H__
>>>>> +
>>>>> +#include <linux/types.h>
>>>>> +
>>>>> +#define ERDMA_ABI_VERSION       1
>>>>
>>>> ERDMA_ABI_VERSION should be 2?
>>>
>>> Why?
>>>
>>> This field is for rdma-core and we don't have erdma provider in that
>>> library yet. It always starts from 1 for new drivers.
>> Please check this link: 
>> http://mail.spinics.net/lists/linux-rdma/msg63012.html
>>
>> Jason mentioned in this link:
>>
>> "
>> /*
>>   * For 64 bit machines ABI version 1 and 2 are the same. Otherwise 32
>>   * bit machines require ABI version 2 which guarentees the user and
>>   * kernel use the same ABI.
>>   */
>> "
>>
>> Zhu Yanjun
>
> Even though I do not understand the reason, but as mentioned above, I 
> think ERDMA_ABI_VERSION = 1 is fine, because ERDMA can only work in 
> 64bit machines.


Sure. If ERDMA can only work in 64bit machines, ERDMA_ABI_VERSION = 1 is 
fine

Zhu Yanjun


>
> Thanks,
> Cheng Xu
>
>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-24  7:54           ` Yanjun Zhu
@ 2021-12-24 18:11             ` Leon Romanovsky
  0 siblings, 0 replies; 52+ messages in thread
From: Leon Romanovsky @ 2021-12-24 18:11 UTC (permalink / raw)
  To: Yanjun Zhu; +Cc: Cheng Xu, jgg, dledford, linux-rdma, KaiShen, tonylu

On Fri, Dec 24, 2021 at 03:54:18PM +0800, Yanjun Zhu wrote:
> 在 2021/12/24 14:04, Leon Romanovsky 写道:
> > On Fri, Dec 24, 2021 at 06:55:41AM +0800, Yanjun Zhu wrote:
> > > 在 2021/12/24 2:45, Leon Romanovsky 写道:
> > > > On Thu, Dec 23, 2021 at 11:46:03PM +0800, Yanjun Zhu wrote:
> > > > > 在 2021/12/21 10:48, Cheng Xu 写道:
> > > > > > Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
> > > > > > ---
> > > > > >     include/uapi/rdma/erdma-abi.h | 49 +++++++++++++++++++++++++++++++++++
> > > > > >     1 file changed, 49 insertions(+)
> > > > > >     create mode 100644 include/uapi/rdma/erdma-abi.h
> > > > > > 
> > > > > > diff --git a/include/uapi/rdma/erdma-abi.h b/include/uapi/rdma/erdma-abi.h
> > > > > > new file mode 100644
> > > > > > index 000000000000..6bcba10c1e41
> > > > > > --- /dev/null
> > > > > > +++ b/include/uapi/rdma/erdma-abi.h
> > > > > > @@ -0,0 +1,49 @@
> > > > > > +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR Linux-OpenIB) */
> > > > > > +/*
> > > > > > + * Copyright (c) 2020-2021, Alibaba Group.
> > > > > > + */
> > > > > > +
> > > > > > +#ifndef __ERDMA_USER_H__
> > > > > > +#define __ERDMA_USER_H__
> > > > > > +
> > > > > > +#include <linux/types.h>
> > > > > > +
> > > > > > +#define ERDMA_ABI_VERSION       1
> > > > > 
> > > > > ERDMA_ABI_VERSION should be 2?
> > > > 
> > > > Why?
> > > > 
> > > > This field is for rdma-core and we don't have erdma provider in that
> > > > library yet. It always starts from 1 for new drivers.
> > > Please check this link:
> > > http://mail.spinics.net/lists/linux-rdma/msg63012.html
> > 
> > OK, I still don't understand why.
> 
> 
> Perhaps 32 bit machines require ABI version 2 which guarentees the user and
> kernel use the same ABI.

Nope, it is not.

> 
> Zhu Yanjun
> 
> > 
> > RXE case is different, because rdma-core already had broken RXE
> > implementation, so this is why the version was incremented.
> > 
> > > 
> > > Jason mentioned in this link:
> > > 
> > > "
> > > /*
> > >   * For 64 bit machines ABI version 1 and 2 are the same. Otherwise 32
> > >   * bit machines require ABI version 2 which guarentees the user and
> > >   * kernel use the same ABI.
> > >   */
> > > "
> > > 
> > > Zhu Yanjun
> > > > 
> > > > Thanks
> > > 
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-24  7:12         ` Cheng Xu
  2021-12-24  8:02           ` Yanjun Zhu
@ 2021-12-24 18:19           ` Leon Romanovsky
  2021-12-25  0:03             ` Yanjun Zhu
  2021-12-25  3:36             ` Cheng Xu
  1 sibling, 2 replies; 52+ messages in thread
From: Leon Romanovsky @ 2021-12-24 18:19 UTC (permalink / raw)
  To: Cheng Xu; +Cc: Yanjun Zhu, jgg, dledford, linux-rdma, KaiShen, tonylu

On Fri, Dec 24, 2021 at 03:12:35PM +0800, Cheng Xu wrote:
> 
> 
> On 12/24/21 6:55 AM, Yanjun Zhu wrote:
> > 在 2021/12/24 2:45, Leon Romanovsky 写道:
> > > On Thu, Dec 23, 2021 at 11:46:03PM +0800, Yanjun Zhu wrote:
> > > > 在 2021/12/21 10:48, Cheng Xu 写道:
> > > > > Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
> > > > > ---
> > > > >    include/uapi/rdma/erdma-abi.h | 49
> > > > > +++++++++++++++++++++++++++++++++++
> > > > >    1 file changed, 49 insertions(+)
> > > > >    create mode 100644 include/uapi/rdma/erdma-abi.h
> > > > > 
> > > > > diff --git a/include/uapi/rdma/erdma-abi.h
> > > > > b/include/uapi/rdma/erdma-abi.h
> > > > > new file mode 100644
> > > > > index 000000000000..6bcba10c1e41
> > > > > --- /dev/null
> > > > > +++ b/include/uapi/rdma/erdma-abi.h
> > > > > @@ -0,0 +1,49 @@
> > > > > +/* SPDX-License-Identifier: ((GPL-2.0 WITH
> > > > > Linux-syscall-note) OR Linux-OpenIB) */
> > > > > +/*
> > > > > + * Copyright (c) 2020-2021, Alibaba Group.
> > > > > + */
> > > > > +
> > > > > +#ifndef __ERDMA_USER_H__
> > > > > +#define __ERDMA_USER_H__
> > > > > +
> > > > > +#include <linux/types.h>
> > > > > +
> > > > > +#define ERDMA_ABI_VERSION       1
> > > > 
> > > > ERDMA_ABI_VERSION should be 2?
> > > 
> > > Why?
> > > 
> > > This field is for rdma-core and we don't have erdma provider in that
> > > library yet. It always starts from 1 for new drivers.
> > Please check this link:
> > http://mail.spinics.net/lists/linux-rdma/msg63012.html
> > 
> > Jason mentioned in this link:
> > 
> > "
> > /*
> >   * For 64 bit machines ABI version 1 and 2 are the same. Otherwise 32
> >   * bit machines require ABI version 2 which guarentees the user and
> >   * kernel use the same ABI.
> >   */
> > "
> > 
> > Zhu Yanjun
> 
> Even though I do not understand the reason, but as mentioned above, I think
> ERDMA_ABI_VERSION = 1 is fine, because ERDMA can only work in 64bit
> machines.

Jason's comment came after we discovered that many of our API structures had
problematic layout and weren't aligned to 64bits. This caused to issues when
the 32bits software tried to use 64bit kernel.

So we didn't have many choices but bump ABI versions for broken drivers
and RXE was one of them.

You are proposing new driver, it should start from 1.

Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
  2021-12-24  7:07           ` Cheng Xu
@ 2021-12-24 18:26             ` Leon Romanovsky
  2021-12-25  2:54               ` Cheng Xu
                                 ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Leon Romanovsky @ 2021-12-24 18:26 UTC (permalink / raw)
  To: Cheng Xu; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu

On Fri, Dec 24, 2021 at 03:07:57PM +0800, Cheng Xu wrote:
> 
> 
> On 12/23/21 9:44 PM, Leon Romanovsky wrote:
> > On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
> > > 
> > > 
> > > On 12/23/21 6:23 PM, Leon Romanovsky wrote:
> > > > On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
> > > > > 
> > > > 
> > > > <...>
> > > > 
> > > > > > > 
> > > > > > > For the ECS instance with RDMA enabled, there are two kinds of devices
> > > > > > > allocated, one for ERDMA, and one for the original netdev (virtio-net).
> > > > > > > They are different PCI deivces. ERDMA driver can get the information about
> > > > > > > which netdev attached to in its PCIe barspace (by MAC address matching).
> > > > > > 
> > > > > > This is very questionable. The netdev part should be kept in the
> > > > > > drivers/ethernet/... part of the kernel.
> > > > > > 
> > > > > > Thanks
> > > > > 
> > > > > The net device used in Alibaba ECS instance is virtio-net device, driven
> > > > > by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
> > > > > device, and will be attached to an existed virtio-net device. The
> > > > > relationship between ibdev and netdev in erdma is similar to siw/rxe.
> > > > 
> > > > siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
> > > > through MAC's matching.
> > > > 
> > > > Thanks
> > > 
> > > Both siw/rxe/erdma don't need to implement netdev part, this is what I
> > > wanted to express when I said 'similar'.
> > > What you mentioned (the bind mechanism) is one major difference between
> > > erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
> > > he/she wants, but it is not true for erdma. When user buys the erdma
> > > service, he/she must specify which ENI (elastic network interface) to be
> > > binded, it means that the attached erdma device can only be binded to
> > > the specific netdev. Due to the uniqueness of MAC address in our ECS
> > > instance, we use the MAC address as the identification, then the driver
> > > knows which netdev should be binded to.
> > 
> > Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK.
> > I personally don't like the idea that bind logic is performed "magically".
> > 
> 
> OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it
> means that erdma can not be ready to use like other RDMA HCAs, until
> user configure the link manually. This way may be not friendly to them.
> I'm not sure that our current method is acceptable or not. If you
> strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to
> it.

Before you are rushing to change that logic, could you please explain
the security model of this binding?

I'm as an owner of VM can replace kernel code with any code I want and
remove your MAC matching (or replace to something different). How will
you protect from such flow?

If you don't trust VM, you should perform binding in hypervisor and
this erdma driver will work out-of-the-box in the VM.

Thanks

> 
> Thanks,
> Cheng Xu
> 
> > BTW,
> > 1. No module parameters
> > 2. No driver versions
> > 
> 
> Will fix them.
> 
> > Thanks
> > 
> > > 
> > > Thanks,
> > > Cheng Xu

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-24 18:19           ` Leon Romanovsky
@ 2021-12-25  0:03             ` Yanjun Zhu
  2021-12-25  3:36             ` Cheng Xu
  1 sibling, 0 replies; 52+ messages in thread
From: Yanjun Zhu @ 2021-12-25  0:03 UTC (permalink / raw)
  To: Leon Romanovsky, Cheng Xu; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu


在 2021/12/25 2:19, Leon Romanovsky 写道:
> On Fri, Dec 24, 2021 at 03:12:35PM +0800, Cheng Xu wrote:
>>
>> On 12/24/21 6:55 AM, Yanjun Zhu wrote:
>>> 在 2021/12/24 2:45, Leon Romanovsky 写道:
>>>> On Thu, Dec 23, 2021 at 11:46:03PM +0800, Yanjun Zhu wrote:
>>>>> 在 2021/12/21 10:48, Cheng Xu 写道:
>>>>>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
>>>>>> ---
>>>>>>     include/uapi/rdma/erdma-abi.h | 49
>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 49 insertions(+)
>>>>>>     create mode 100644 include/uapi/rdma/erdma-abi.h
>>>>>>
>>>>>> diff --git a/include/uapi/rdma/erdma-abi.h
>>>>>> b/include/uapi/rdma/erdma-abi.h
>>>>>> new file mode 100644
>>>>>> index 000000000000..6bcba10c1e41
>>>>>> --- /dev/null
>>>>>> +++ b/include/uapi/rdma/erdma-abi.h
>>>>>> @@ -0,0 +1,49 @@
>>>>>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH
>>>>>> Linux-syscall-note) OR Linux-OpenIB) */
>>>>>> +/*
>>>>>> + * Copyright (c) 2020-2021, Alibaba Group.
>>>>>> + */
>>>>>> +
>>>>>> +#ifndef __ERDMA_USER_H__
>>>>>> +#define __ERDMA_USER_H__
>>>>>> +
>>>>>> +#include <linux/types.h>
>>>>>> +
>>>>>> +#define ERDMA_ABI_VERSION       1
>>>>> ERDMA_ABI_VERSION should be 2?
>>>> Why?
>>>>
>>>> This field is for rdma-core and we don't have erdma provider in that
>>>> library yet. It always starts from 1 for new drivers.
>>> Please check this link:
>>> http://mail.spinics.net/lists/linux-rdma/msg63012.html
>>>
>>> Jason mentioned in this link:
>>>
>>> "
>>> /*
>>>    * For 64 bit machines ABI version 1 and 2 are the same. Otherwise 32
>>>    * bit machines require ABI version 2 which guarentees the user and
>>>    * kernel use the same ABI.
>>>    */
>>> "
>>>
>>> Zhu Yanjun
>> Even though I do not understand the reason, but as mentioned above, I think
>> ERDMA_ABI_VERSION = 1 is fine, because ERDMA can only work in 64bit
>> machines.
> Jason's comment came after we discovered that many of our API structures had
> problematic layout and weren't aligned to 64bits. This caused to issues when
> the 32bits software tried to use 64bit kernel.

Got it. Thanks

Zhu Yanjun

>
> So we didn't have many choices but bump ABI versions for broken drivers
> and RXE was one of them.
>
> You are proposing new driver, it should start from 1.
>
> Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
  2021-12-24 18:26             ` Leon Romanovsky
@ 2021-12-25  2:54               ` Cheng Xu
  2021-12-25  2:57               ` Cheng Xu
  2021-12-25  3:03               ` [Please ignore the two former responses]Re: " Cheng Xu
  2 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-25  2:54 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/25/21 2:26 AM, Leon Romanovsky wrote:
> On Fri, Dec 24, 2021 at 03:07:57PM +0800, Cheng Xu wrote:
>>
>>
>> On 12/23/21 9:44 PM, Leon Romanovsky wrote:
>>> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
>>>>
>>>>
>>>> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
>>>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
>>>>>>
>>>>>
>>>>> <...>
>>>>>
>>>>>>>>
>>>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices
>>>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>>>>>>>> They are different PCI deivces. ERDMA driver can get the information about
>>>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching).
>>>>>>>
>>>>>>> This is very questionable. The netdev part should be kept in the
>>>>>>> drivers/ethernet/... part of the kernel.
>>>>>>>
>>>>>>> Thanks
>>>>>>
>>>>>> The net device used in Alibaba ECS instance is virtio-net device, driven
>>>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
>>>>>> device, and will be attached to an existed virtio-net device. The
>>>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe.
>>>>>
>>>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
>>>>> through MAC's matching.
>>>>>
>>>>> Thanks
>>>>
>>>> Both siw/rxe/erdma don't need to implement netdev part, this is what I
>>>> wanted to express when I said 'similar'.
>>>> What you mentioned (the bind mechanism) is one major difference between
>>>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
>>>> he/she wants, but it is not true for erdma. When user buys the erdma
>>>> service, he/she must specify which ENI (elastic network interface) to be
>>>> binded, it means that the attached erdma device can only be binded to
>>>> the specific netdev. Due to the uniqueness of MAC address in our ECS
>>>> instance, we use the MAC address as the identification, then the driver
>>>> knows which netdev should be binded to.
>>>
>>> Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK.
>>> I personally don't like the idea that bind logic is performed "magically".
>>>
>>
>> OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it
>> means that erdma can not be ready to use like other RDMA HCAs, until
>> user configure the link manually. This way may be not friendly to them.
>> I'm not sure that our current method is acceptable or not. If you
>> strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to
>> it.
> 
> Before you are rushing to change that logic, could you please explain
> the security model of this binding?
> 
> I'm as an owner of VM can replace kernel code with any code I want and
> remove your MAC matching (or replace to something different). How will
> you protect from such flow?

I think this topic belongs to anti-attack. One principle of anti-attack
in our cloud is that the attacker MUST NOT have influence on users but
themselves.

Before I answer the question, I want to describe some more details of
our architecture.

In our MOC architecture, virtio-net device (e.g, virtio-net back-end) is
fully offloaded to MOC, not in host hypervisor. One virtio-net device
belongs to a vport, and if it has a peer erdma device, erdma device also
belongs to the vport. The protocol headers of the network flows in the 
virtio-net and erdma devices must be consistent with the vport
configurations (mac address, ip, etc. ) by checking the OVS rules.

Back to the question, we can not prevent attackers from modifying the
code, making devices binding wrongly in the front-end, or in some worse
cases, making driver sending invalid commands to devices. If binding
wrongly, the erdma network will be unreachable, because the OVS module
in MOC hardware can distinguish this situation and drop all the invalid
network packets, and this has no influence to other users.


> If you don't trust VM, you should perform binding in hypervisor and
> this erdma driver will work out-of-the-box in the VM.

As mentioned above, we also have the binding configuration in the
back-end (e.g, MOC hardware), only when the configuration is correct of
the front-end, the erdma can work properly.

> Thanks
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
  2021-12-24 18:26             ` Leon Romanovsky
  2021-12-25  2:54               ` Cheng Xu
@ 2021-12-25  2:57               ` Cheng Xu
  2021-12-25  3:03               ` [Please ignore the two former responses]Re: " Cheng Xu
  2 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-25  2:57 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/25/21 2:26 AM, Leon Romanovsky wrote:
> On Fri, Dec 24, 2021 at 03:07:57PM +0800, Cheng Xu wrote:
>>
>>
>> On 12/23/21 9:44 PM, Leon Romanovsky wrote:
>>> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
>>>>
>>>>
>>>> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
>>>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
>>>>>>
>>>>>
>>>>> <...>
>>>>>
>>>>>>>>
>>>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices
>>>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>>>>>>>> They are different PCI deivces. ERDMA driver can get the information about
>>>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching).
>>>>>>>
>>>>>>> This is very questionable. The netdev part should be kept in the
>>>>>>> drivers/ethernet/... part of the kernel.
>>>>>>>
>>>>>>> Thanks
>>>>>>
>>>>>> The net device used in Alibaba ECS instance is virtio-net device, driven
>>>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
>>>>>> device, and will be attached to an existed virtio-net device. The
>>>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe.
>>>>>
>>>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
>>>>> through MAC's matching.
>>>>>
>>>>> Thanks
>>>>
>>>> Both siw/rxe/erdma don't need to implement netdev part, this is what I
>>>> wanted to express when I said 'similar'.
>>>> What you mentioned (the bind mechanism) is one major difference between
>>>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
>>>> he/she wants, but it is not true for erdma. When user buys the erdma
>>>> service, he/she must specify which ENI (elastic network interface) to be
>>>> binded, it means that the attached erdma device can only be binded to
>>>> the specific netdev. Due to the uniqueness of MAC address in our ECS
>>>> instance, we use the MAC address as the identification, then the driver
>>>> knows which netdev should be binded to.
>>>
>>> Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK.
>>> I personally don't like the idea that bind logic is performed "magically".
>>>
>>
>> OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it
>> means that erdma can not be ready to use like other RDMA HCAs, until
>> user configure the link manually. This way may be not friendly to them.
>> I'm not sure that our current method is acceptable or not. If you
>> strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to
>> it.
> 
> Before you are rushing to change that logic, could you please explain
> the security model of this binding?
> 
> I'm as an owner of VM can replace kernel code with any code I want and
> remove your MAC matching (or replace to something different). How will
> you protect from such flow?

In our MOC architecture, virtio-net device (e.g, virtio-net back-end) is
fully offloaded to MOC, not in host hypervisor. One virtio-net device
belongs to a vport, and if it has a peer erdma device, erdma device also
belongs to the vport. The protocol headers of the network flows in the
virtio-net and erdma devices must be consistent with the vport
configurations (mac address, ip, etc. ) by checking the OVS rules.

Back to the question, we can not prevent attackers from modifying the
code, making devices binding wrongly in the front-end, or in some worse
cases, making driver sending invalid commands to devices. If binding
wrongly, the erdma network will be unreachable, because the OVS module
in MOC hardware can distinguish this situation and drop all the invalid
network packets, and this has no influence to other users.

> If you don't trust VM, you should perform binding in hypervisor and
> this erdma driver will work out-of-the-box in the VM.

As mentioned above, we also have the binding configuration in the
back-end (e.g, MOC hardware), only when the configuration is correct of
the front-end, the erdma can work properly.

Thanks,
Cheng Xu

> Thanks
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Please ignore the two former responses]Re: [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
  2021-12-24 18:26             ` Leon Romanovsky
  2021-12-25  2:54               ` Cheng Xu
  2021-12-25  2:57               ` Cheng Xu
@ 2021-12-25  3:03               ` Cheng Xu
  2 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-25  3:03 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/25/21 2:26 AM, Leon Romanovsky wrote:
> On Fri, Dec 24, 2021 at 03:07:57PM +0800, Cheng Xu wrote:
>>
>>
>> On 12/23/21 9:44 PM, Leon Romanovsky wrote:
>>> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
>>>>
>>>>
>>>> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
>>>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
>>>>>>
>>>>>
>>>>> <...>
>>>>>
>>>>>>>>
>>>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices
>>>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>>>>>>>> They are different PCI deivces. ERDMA driver can get the information about
>>>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching).
>>>>>>>
>>>>>>> This is very questionable. The netdev part should be kept in the
>>>>>>> drivers/ethernet/... part of the kernel.
>>>>>>>
>>>>>>> Thanks
>>>>>>
>>>>>> The net device used in Alibaba ECS instance is virtio-net device, driven
>>>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
>>>>>> device, and will be attached to an existed virtio-net device. The
>>>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe.
>>>>>
>>>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
>>>>> through MAC's matching.
>>>>>
>>>>> Thanks
>>>>
>>>> Both siw/rxe/erdma don't need to implement netdev part, this is what I
>>>> wanted to express when I said 'similar'.
>>>> What you mentioned (the bind mechanism) is one major difference between
>>>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
>>>> he/she wants, but it is not true for erdma. When user buys the erdma
>>>> service, he/she must specify which ENI (elastic network interface) to be
>>>> binded, it means that the attached erdma device can only be binded to
>>>> the specific netdev. Due to the uniqueness of MAC address in our ECS
>>>> instance, we use the MAC address as the identification, then the driver
>>>> knows which netdev should be binded to.
>>>
>>> Nothing prohibits from you to implement this MAC check in RDMA_NLDEV_CMD_NEWLINK.
>>> I personally don't like the idea that bind logic is performed "magically".
>>>
>>
>> OK, I agree with you that using RDMA_NLDEV_CMD_NEWLINK is better. But it
>> means that erdma can not be ready to use like other RDMA HCAs, until
>> user configure the link manually. This way may be not friendly to them.
>> I'm not sure that our current method is acceptable or not. If you
>> strongly recommend us to use RDMA_NLDEV_CMD_NEWLINK, we will change to
>> it.
> 
> Before you are rushing to change that logic, could you please explain
> the security model of this binding?
> 
> I'm as an owner of VM can replace kernel code with any code I want and
> remove your MAC matching (or replace to something different). How will
> you protect from such flow?

(I'm sorry for wrong editing format in the two former responses, please
ignore them.)

I think this topic belongs to anti-attack. One principle of anti-attack
in our cloud is that the attacker MUST NOT have influence on users but
themselves.

Before I answer the question, I want to describe some more details of
our architecture.

In our MOC architecture, virtio-net device (e.g, virtio-net back-end) is
fully offloaded to MOC, not in host hypervisor. One virtio-net device
belongs to a vport, and if it has a peer erdma device, erdma device also
belongs to the vport. The protocol headers of the network flows in the
virtio-net and erdma devices must be consistent with the vport
configurations (mac address, ip, etc. ) by checking the OVS rules.

Back to the question, we can not prevent attackers from modifying the
code, making devices binding wrongly in the front-end, or in some worse
cases, making driver sending invalid commands to devices. If binding
wrongly, the erdma network will be unreachable, because the OVS module
in MOC hardware can distinguish this situation and drop all the invalid
network packets, and this has no influence to other users.

> If you don't trust VM, you should perform binding in hypervisor and
> this erdma driver will work out-of-the-box in the VM.

As mentioned above, we also have the binding configuration in the
back-end (e.g, MOC hardware), only when the configuration is correct of
the front-end, the erdma can work properly.

Thanks,
Cheng Xu

> Thanks
> 

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions
  2021-12-24 18:19           ` Leon Romanovsky
  2021-12-25  0:03             ` Yanjun Zhu
@ 2021-12-25  3:36             ` Cheng Xu
  1 sibling, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2021-12-25  3:36 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Yanjun Zhu, jgg, dledford, linux-rdma, KaiShen, tonylu



On 12/25/21 2:19 AM, Leon Romanovsky wrote:
> On Fri, Dec 24, 2021 at 03:12:35PM +0800, Cheng Xu wrote:
>>
>>
>> On 12/24/21 6:55 AM, Yanjun Zhu wrote:
>>> 在 2021/12/24 2:45, Leon Romanovsky 写道:
>>>> On Thu, Dec 23, 2021 at 11:46:03PM +0800, Yanjun Zhu wrote:
>>>>> 在 2021/12/21 10:48, Cheng Xu 写道:
>>>>>> Signed-off-by: Cheng Xu <chengyou@linux.alibaba.com>
>>>>>> ---
>>>>>>     include/uapi/rdma/erdma-abi.h | 49
>>>>>> +++++++++++++++++++++++++++++++++++
>>>>>>     1 file changed, 49 insertions(+)
>>>>>>     create mode 100644 include/uapi/rdma/erdma-abi.h
>>>>>>
>>>>>> diff --git a/include/uapi/rdma/erdma-abi.h
>>>>>> b/include/uapi/rdma/erdma-abi.h
>>>>>> new file mode 100644
>>>>>> index 000000000000..6bcba10c1e41
>>>>>> --- /dev/null
>>>>>> +++ b/include/uapi/rdma/erdma-abi.h
>>>>>> @@ -0,0 +1,49 @@
>>>>>> +/* SPDX-License-Identifier: ((GPL-2.0 WITH
>>>>>> Linux-syscall-note) OR Linux-OpenIB) */
>>>>>> +/*
>>>>>> + * Copyright (c) 2020-2021, Alibaba Group.
>>>>>> + */
>>>>>> +
>>>>>> +#ifndef __ERDMA_USER_H__
>>>>>> +#define __ERDMA_USER_H__
>>>>>> +
>>>>>> +#include <linux/types.h>
>>>>>> +
>>>>>> +#define ERDMA_ABI_VERSION       1
>>>>>
>>>>> ERDMA_ABI_VERSION should be 2?
>>>>
>>>> Why?
>>>>
>>>> This field is for rdma-core and we don't have erdma provider in that
>>>> library yet. It always starts from 1 for new drivers.
>>> Please check this link:
>>> http://mail.spinics.net/lists/linux-rdma/msg63012.html
>>>
>>> Jason mentioned in this link:
>>>
>>> "
>>> /*
>>>    * For 64 bit machines ABI version 1 and 2 are the same. Otherwise 32
>>>    * bit machines require ABI version 2 which guarentees the user and
>>>    * kernel use the same ABI.
>>>    */
>>> "
>>>
>>> Zhu Yanjun
>>
>> Even though I do not understand the reason, but as mentioned above, I think
>> ERDMA_ABI_VERSION = 1 is fine, because ERDMA can only work in 64bit
>> machines.
> 
> Jason's comment came after we discovered that many of our API structures had
> problematic layout and weren't aligned to 64bits. This caused to issues when
> the 32bits software tried to use 64bit kernel.
> 
> So we didn't have many choices but bump ABI versions for broken drivers
> and RXE was one of them.
> 
> You are proposing new driver, it should start from 1.

Thanks for your explanation.

Thanks,
Cheng Xu

> Thanks

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
  2021-12-23 12:59       ` Cheng Xu
  2021-12-23 13:44         ` Leon Romanovsky
@ 2022-01-07 14:24         ` Jason Gunthorpe
  2022-01-10 10:07           ` Cheng Xu
  1 sibling, 1 reply; 52+ messages in thread
From: Jason Gunthorpe @ 2022-01-07 14:24 UTC (permalink / raw)
  To: Cheng Xu; +Cc: Leon Romanovsky, dledford, linux-rdma, KaiShen, tonylu

On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
> 
> 
> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
> > On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
> > > 
> > 
> > <...>
> > 
> > > > > 
> > > > > For the ECS instance with RDMA enabled, there are two kinds of devices
> > > > > allocated, one for ERDMA, and one for the original netdev (virtio-net).
> > > > > They are different PCI deivces. ERDMA driver can get the information about
> > > > > which netdev attached to in its PCIe barspace (by MAC address matching).
> > > > 
> > > > This is very questionable. The netdev part should be kept in the
> > > > drivers/ethernet/... part of the kernel.
> > > > 
> > > > Thanks
> > > 
> > > The net device used in Alibaba ECS instance is virtio-net device, driven
> > > by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
> > > device, and will be attached to an existed virtio-net device. The
> > > relationship between ibdev and netdev in erdma is similar to siw/rxe.
> > 
> > siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
> > through MAC's matching.
> > 
> > Thanks
> 
> Both siw/rxe/erdma don't need to implement netdev part, this is what I
> wanted to express when I said 'similar'.
> What you mentioned (the bind mechanism) is one major difference between
> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
> he/she wants, but it is not true for erdma. When user buys the erdma
> service, he/she must specify which ENI (elastic network interface) to be
> binded, it means that the attached erdma device can only be binded to
> the specific netdev. Due to the uniqueness of MAC address in our ECS
> instance, we use the MAC address as the identification, then the driver
> knows which netdev should be binded to.

It really doesn't match our driver binding model to rely on MAC
addreses.

Our standard model would expect that the virtio-net driver would
detect it has RDMA capability and spawn an aux device to link the two
things together.

Using net notifiers to try to link the lifecycles together has been a
mess so far.

Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver
  2022-01-07 14:24         ` Jason Gunthorpe
@ 2022-01-10 10:07           ` Cheng Xu
  0 siblings, 0 replies; 52+ messages in thread
From: Cheng Xu @ 2022-01-10 10:07 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Leon Romanovsky, dledford, linux-rdma, KaiShen, tonylu



On 1/7/22 10:24 PM, Jason Gunthorpe wrote:
> On Thu, Dec 23, 2021 at 08:59:14PM +0800, Cheng Xu wrote:
>>
>>
>> On 12/23/21 6:23 PM, Leon Romanovsky wrote:
>>> On Wed, Dec 22, 2021 at 11:35:44AM +0800, Cheng Xu wrote:
>>>>
>>>
>>> <...>
>>>
>>>>>>
>>>>>> For the ECS instance with RDMA enabled, there are two kinds of devices
>>>>>> allocated, one for ERDMA, and one for the original netdev (virtio-net).
>>>>>> They are different PCI deivces. ERDMA driver can get the information about
>>>>>> which netdev attached to in its PCIe barspace (by MAC address matching).
>>>>>
>>>>> This is very questionable. The netdev part should be kept in the
>>>>> drivers/ethernet/... part of the kernel.
>>>>>
>>>>> Thanks
>>>>
>>>> The net device used in Alibaba ECS instance is virtio-net device, driven
>>>> by virtio-pci/virtio-net drivers. ERDMA device does not need its own net
>>>> device, and will be attached to an existed virtio-net device. The
>>>> relationship between ibdev and netdev in erdma is similar to siw/rxe.
>>>
>>> siw/rxe binds through RDMA_NLDEV_CMD_NEWLINK netlink command and not
>>> through MAC's matching.
>>>
>>> Thanks
>>
>> Both siw/rxe/erdma don't need to implement netdev part, this is what I
>> wanted to express when I said 'similar'.
>> What you mentioned (the bind mechanism) is one major difference between
>> erdma and siw/rxe. For siw/rxe, user can attach ibdev to every netdev if
>> he/she wants, but it is not true for erdma. When user buys the erdma
>> service, he/she must specify which ENI (elastic network interface) to be
>> binded, it means that the attached erdma device can only be binded to
>> the specific netdev. Due to the uniqueness of MAC address in our ECS
>> instance, we use the MAC address as the identification, then the driver
>> knows which netdev should be binded to.
> 
> It really doesn't match our driver binding model to rely on MAC
> addreses.
> 
> Our standard model would expect that the virtio-net driver would
> detect it has RDMA capability and spawn an aux device to link the two
> things together.
> 
> Using net notifiers to try to link the lifecycles together has been a
> mess so far.
Thanks for your explanation.

I guess this model requires the netdev and its associated ibdev share
the same physical hardware (pci device or platform device)? ERDMA is a
separated pci device. Only because that ENIs in our cloud are
virtio-net devices, and we let ERDMA binded to virtio-net. Actually it
also can work with other type of netdev.

As you and Leon said, using net notifiers is not a good way. And I'm
modifying our bind mechanism, using RDMA_NLDEV_CMD_NEWLINK to fix it.

Thanks,
Cheng Xu

> Jason

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2022-01-10 10:08 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-21  2:48 [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Cheng Xu
2021-12-21  2:48 ` [PATCH rdma-next 01/11] RDMA: Add ERDMA to rdma_driver_id definition Cheng Xu
2021-12-21  2:48 ` [PATCH rdma-next 02/11] RDMA/erdma: Add the hardware related definitions Cheng Xu
2021-12-21  2:48 ` [PATCH rdma-next 03/11] RDMA/erdma: Add main include file Cheng Xu
2021-12-21  2:48 ` [PATCH rdma-next 04/11] RDMA/erdma: Add cmdq implementation Cheng Xu
2021-12-21  2:48 ` [PATCH rdma-next 05/11] RDMA/erdma: Add event queue implementation Cheng Xu
2021-12-21  2:48 ` [PATCH rdma-next 06/11] RDMA/erdma: Add verbs header file Cheng Xu
2021-12-21 13:28   ` Leon Romanovsky
2021-12-22  2:36     ` Cheng Xu
2021-12-21  2:48 ` [PATCH rdma-next 07/11] RDMA/erdma: Add verbs implementation Cheng Xu
2021-12-21 13:32   ` Leon Romanovsky
2021-12-21 15:20     ` Bernard Metzler
2021-12-22  3:11       ` Cheng Xu
2021-12-22  4:18         ` Cheng Xu
2021-12-22 12:46         ` Bernard Metzler
2021-12-23  8:38           ` Cheng Xu
2021-12-22  2:50     ` Cheng Xu
2021-12-21  2:48 ` [PATCH rdma-next 08/11] RDMA/erdma: Add connection management (CM) support Cheng Xu
2021-12-21  2:48 ` [PATCH rdma-next 09/11] RDMA/erdma: Add the erdma module Cheng Xu
2021-12-21 13:26   ` Leon Romanovsky
2021-12-22  2:33     ` Cheng Xu
2021-12-21  2:48 ` [PATCH rdma-next 10/11] RDMA/erdma: Add the ABI definitions Cheng Xu
2021-12-21 11:57   ` kernel test robot
2021-12-21 11:57     ` kernel test robot
2021-12-22 16:14   ` kernel test robot
2021-12-22 16:14     ` kernel test robot
2021-12-23 15:46   ` Yanjun Zhu
2021-12-23 18:45     ` Leon Romanovsky
2021-12-23 22:55       ` Yanjun Zhu
2021-12-24  6:04         ` Leon Romanovsky
2021-12-24  7:54           ` Yanjun Zhu
2021-12-24 18:11             ` Leon Romanovsky
2021-12-24  7:12         ` Cheng Xu
2021-12-24  8:02           ` Yanjun Zhu
2021-12-24 18:19           ` Leon Romanovsky
2021-12-25  0:03             ` Yanjun Zhu
2021-12-25  3:36             ` Cheng Xu
2021-12-21  2:48 ` [PATCH rdma-next 11/11] RDMA/erdma: Add driver to kernel build environment Cheng Xu
2021-12-22  0:58   ` kernel test robot
2021-12-22  0:58     ` kernel test robot
2021-12-21 13:09 ` [PATCH rdma-next 00/11] Elastic RDMA Adapter (ERDMA) driver Leon Romanovsky
2021-12-22  3:35   ` Cheng Xu
2021-12-23 10:23     ` Leon Romanovsky
2021-12-23 12:59       ` Cheng Xu
2021-12-23 13:44         ` Leon Romanovsky
2021-12-24  7:07           ` Cheng Xu
2021-12-24 18:26             ` Leon Romanovsky
2021-12-25  2:54               ` Cheng Xu
2021-12-25  2:57               ` Cheng Xu
2021-12-25  3:03               ` [Please ignore the two former responses]Re: " Cheng Xu
2022-01-07 14:24         ` Jason Gunthorpe
2022-01-10 10:07           ` Cheng Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.