All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH rdma-core 0/4] libhns: Add support for direct WQE
@ 2021-05-28  9:32 Weihang Li
  2021-05-28  9:32 ` [PATCH rdma-core 1/4] Update kernel headers Weihang Li
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Weihang Li @ 2021-05-28  9:32 UTC (permalink / raw)
  To: jgg, leon; +Cc: linux-rdma, linuxarm

Direct wqe is a mechanism to fill wqe directly into the hardware. In the
case of light load, the wqe will be filled into pcie bar space of the
hardware, this will reduce one memory access operation and therefore
reduce the latency. 

This series first refactor current uar mmap process to add branch for
direct wqe, then fix an issue on interface to write doorbell, the feature
is enabled at last.

The kernel parts is named "RDMA/hns: Add support for userspace Direct WQE".

Lang Cheng (1):
  libhns: Fixes data type when writing doorbell

Weihang Li (1):
  Update kernel headers

Xi Wang (1):
  libhns: Refactor hns uar mmap flow

Yixing Liu (1):
  libhns: Add support for direct wqe

 kernel-headers/rdma/hns-abi.h    |  6 +++
 providers/hns/hns_roce_u.c       | 76 ++++++++++++++++++++++++--------------
 providers/hns/hns_roce_u.h       | 19 +++++++++-
 providers/hns/hns_roce_u_db.h    | 13 ++-----
 providers/hns/hns_roce_u_hw_v1.c | 17 +++++----
 providers/hns/hns_roce_u_hw_v2.c | 80 ++++++++++++++++++++++++++++++++--------
 providers/hns/hns_roce_u_hw_v2.h | 29 ++++++++-------
 providers/hns/hns_roce_u_verbs.c | 39 ++++++++++++++++++++
 8 files changed, 204 insertions(+), 75 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH rdma-core 1/4] Update kernel headers
  2021-05-28  9:32 [PATCH rdma-core 0/4] libhns: Add support for direct WQE Weihang Li
@ 2021-05-28  9:32 ` Weihang Li
  2021-05-28  9:32 ` [PATCH rdma-core 2/4] libhns: Refactor hns uar mmap flow Weihang Li
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Weihang Li @ 2021-05-28  9:32 UTC (permalink / raw)
  To: jgg, leon; +Cc: linux-rdma, linuxarm

To commit ?? ("RDMA/hns: Support direct WQE of userspace").

Signed-off-by: Weihang Li <liweihang@huawei.com>
---
 kernel-headers/rdma/hns-abi.h | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel-headers/rdma/hns-abi.h b/kernel-headers/rdma/hns-abi.h
index 42b1776..248c611 100644
--- a/kernel-headers/rdma/hns-abi.h
+++ b/kernel-headers/rdma/hns-abi.h
@@ -77,6 +77,7 @@ enum hns_roce_qp_cap_flags {
 	HNS_ROCE_QP_CAP_RQ_RECORD_DB = 1 << 0,
 	HNS_ROCE_QP_CAP_SQ_RECORD_DB = 1 << 1,
 	HNS_ROCE_QP_CAP_OWNER_DB = 1 << 2,
+	HNS_ROCE_QP_CAP_DIRECT_WQE = 1 << 5,
 };
 
 struct hns_roce_ib_create_qp_resp {
@@ -94,4 +95,9 @@ struct hns_roce_ib_alloc_pd_resp {
 	__u32 pdn;
 };
 
+enum {
+	HNS_ROCE_MMAP_REGULAR_PAGE,
+	HNS_ROCE_MMAP_DWQE_PAGE,
+};
+
 #endif /* HNS_ABI_USER_H */
-- 
2.7.4


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH rdma-core 2/4] libhns: Refactor hns uar mmap flow
  2021-05-28  9:32 [PATCH rdma-core 0/4] libhns: Add support for direct WQE Weihang Li
  2021-05-28  9:32 ` [PATCH rdma-core 1/4] Update kernel headers Weihang Li
@ 2021-05-28  9:32 ` Weihang Li
  2021-05-28  9:32 ` [PATCH rdma-core 3/4] libhns: Fixes data type when writing doorbell Weihang Li
  2021-05-28  9:32 ` [PATCH rdma-core 4/4] libhns: Add support for direct wqe Weihang Li
  3 siblings, 0 replies; 13+ messages in thread
From: Weihang Li @ 2021-05-28  9:32 UTC (permalink / raw)
  To: jgg, leon; +Cc: linux-rdma, linuxarm

From: Xi Wang <wangxi11@huawei.com>

Classify the uar address by wrapping the uar type and start page as offset
for rdma io mmap.

Signed-off-by: Xi Wang <wangxi11@huawei.com>
Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
---
 providers/hns/hns_roce_u.c | 76 ++++++++++++++++++++++++++++++----------------
 providers/hns/hns_roce_u.h | 12 ++++++++
 2 files changed, 61 insertions(+), 27 deletions(-)

diff --git a/providers/hns/hns_roce_u.c b/providers/hns/hns_roce_u.c
index 3b31ad3..2d9c46f 100644
--- a/providers/hns/hns_roce_u.c
+++ b/providers/hns/hns_roce_u.c
@@ -95,16 +95,58 @@ static const struct verbs_context_ops hns_common_ops = {
 	.get_srq_num = hns_roce_u_get_srq_num,
 };
 
-static struct verbs_context *hns_roce_alloc_context(struct ibv_device *ibdev,
-						    int cmd_fd,
-						    void *private_data)
+static off_t get_uar_mmap_offset(unsigned long idx, int page_size, int cmd)
+{
+	off_t offset = 0;
+
+	hns_roce_mmap_set_command(cmd, &offset);
+	hns_roce_mmap_set_index(idx, &offset);
+
+	return offset * page_size;
+}
+
+static int hns_roce_mmap(struct hns_roce_device *hr_dev,
+			 struct hns_roce_context *context, int cmd_fd)
+{
+	int page_size = hr_dev->page_size;
+	off_t offset;
+
+	offset = get_uar_mmap_offset(0, page_size, HNS_ROCE_MMAP_REGULAR_PAGE);
+	context->uar = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
+			    MAP_SHARED, cmd_fd, offset);
+	if (context->uar == MAP_FAILED)
+		return -EINVAL;
+
+	offset = get_uar_mmap_offset(1, page_size, HNS_ROCE_MMAP_REGULAR_PAGE);
+
+	if (hr_dev->hw_version == HNS_ROCE_HW_VER1) {
+		/*
+		 * when vma->vm_pgoff is 1, the cq_tptr_base includes 64K CQ,
+		 * a pointer of CQ need 2B size
+		 */
+		context->cq_tptr_base = mmap(NULL, HNS_ROCE_CQ_DB_BUF_SIZE,
+					     PROT_READ | PROT_WRITE, MAP_SHARED,
+					     cmd_fd, offset);
+		if (context->cq_tptr_base == MAP_FAILED)
+			goto db_free;
+	}
+
+	return 0;
+
+db_free:
+	munmap(context->uar, hr_dev->page_size);
+
+	return -EINVAL;
+}
+
+static struct verbs_context *
+hns_roce_alloc_context(struct ibv_device *ibdev, int cmd_fd, void *private_data)
 {
 	struct hns_roce_device *hr_dev = to_hr_dev(ibdev);
 	struct hns_roce_alloc_ucontext_resp resp = {};
 	struct ibv_device_attr dev_attrs;
 	struct hns_roce_context *context;
 	struct ibv_get_context cmd;
-	int offset = 0;
 	int i;
 
 	context = verbs_init_and_alloc_context(ibdev, cmd_fd, context, ibv_ctx,
@@ -154,35 +196,15 @@ static struct verbs_context *hns_roce_alloc_context(struct ibv_device *ibdev,
 	context->max_srq_wr = dev_attrs.max_srq_wr;
 	context->max_srq_sge = dev_attrs.max_srq_sge;
 
-	context->uar = mmap(NULL, hr_dev->page_size, PROT_READ | PROT_WRITE,
-			    MAP_SHARED, cmd_fd, offset);
-	if (context->uar == MAP_FAILED)
-		goto err_free;
-
-	offset += hr_dev->page_size;
-
-	if (hr_dev->hw_version == HNS_ROCE_HW_VER1) {
-		/*
-		 * when vma->vm_pgoff is 1, the cq_tptr_base includes 64K CQ,
-		 * a pointer of CQ need 2B size
-		 */
-		context->cq_tptr_base = mmap(NULL, HNS_ROCE_CQ_DB_BUF_SIZE,
-					     PROT_READ | PROT_WRITE, MAP_SHARED,
-					     cmd_fd, offset);
-		if (context->cq_tptr_base == MAP_FAILED)
-			goto db_free;
-	}
-
 	pthread_spin_init(&context->uar_lock, PTHREAD_PROCESS_PRIVATE);
 
 	verbs_set_ops(&context->ibv_ctx, &hns_common_ops);
 	verbs_set_ops(&context->ibv_ctx, &hr_dev->u_hw->hw_ops);
 
-	return &context->ibv_ctx;
+	if (hns_roce_mmap(hr_dev, context, cmd_fd))
+		goto err_free;
 
-db_free:
-	munmap(context->uar, hr_dev->page_size);
-	context->uar = NULL;
+	return &context->ibv_ctx;
 
 err_free:
 	verbs_uninit_context(&context->ibv_ctx);
diff --git a/providers/hns/hns_roce_u.h b/providers/hns/hns_roce_u.h
index 0d7abd8..3c4b162 100644
--- a/providers/hns/hns_roce_u.h
+++ b/providers/hns/hns_roce_u.h
@@ -357,6 +357,18 @@ static inline struct hns_roce_ah *to_hr_ah(struct ibv_ah *ibv_ah)
 	return container_of(ibv_ah, struct hns_roce_ah, ibv_ah);
 }
 
+/* command value is offset[15:8] */
+static inline void hns_roce_mmap_set_command(int command, off_t *offset)
+{
+	*offset |= (command & 0xff) << 8;
+}
+
+/* index value is offset[63:16] | offset[7:0] */
+static inline void hns_roce_mmap_set_index(unsigned long index, off_t *offset)
+{
+	*offset |= (index & 0xff) | ((index >> 8) << 16);
+}
+
 int hns_roce_u_query_device(struct ibv_context *context,
 			    const struct ibv_query_device_ex_input *input,
 			    struct ibv_device_attr_ex *attr, size_t attr_size);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH rdma-core 3/4] libhns: Fixes data type when writing doorbell
  2021-05-28  9:32 [PATCH rdma-core 0/4] libhns: Add support for direct WQE Weihang Li
  2021-05-28  9:32 ` [PATCH rdma-core 1/4] Update kernel headers Weihang Li
  2021-05-28  9:32 ` [PATCH rdma-core 2/4] libhns: Refactor hns uar mmap flow Weihang Li
@ 2021-05-28  9:32 ` Weihang Li
  2021-06-04 14:43   ` Jason Gunthorpe
  2021-05-28  9:32 ` [PATCH rdma-core 4/4] libhns: Add support for direct wqe Weihang Li
  3 siblings, 1 reply; 13+ messages in thread
From: Weihang Li @ 2021-05-28  9:32 UTC (permalink / raw)
  To: jgg, leon; +Cc: linux-rdma, linuxarm

From: Lang Cheng <chenglang@huawei.com>

The doorbell data is a __le32[] value instead of uint32_t[], and the DB
register should be written with a little-endian data instead of uint64_t.

Signed-off-by: Lang Cheng <chenglang@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
---
 providers/hns/hns_roce_u_db.h    | 13 +++----------
 providers/hns/hns_roce_u_hw_v1.c | 17 +++++++++--------
 providers/hns/hns_roce_u_hw_v2.c | 28 +++++++++++++++++-----------
 3 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/providers/hns/hns_roce_u_db.h b/providers/hns/hns_roce_u_db.h
index b44e64d..453fa5a 100644
--- a/providers/hns/hns_roce_u_db.h
+++ b/providers/hns/hns_roce_u_db.h
@@ -37,18 +37,11 @@
 #ifndef _HNS_ROCE_U_DB_H
 #define _HNS_ROCE_U_DB_H
 
-#if __BYTE_ORDER == __LITTLE_ENDIAN
-#define HNS_ROCE_PAIR_TO_64(val) ((uint64_t) val[1] << 32 | val[0])
-#elif __BYTE_ORDER == __BIG_ENDIAN
-#define HNS_ROCE_PAIR_TO_64(val) ((uint64_t) val[0] << 32 | val[1])
-#else
-#error __BYTE_ORDER not defined
-#endif
+#define HNS_ROCE_WORD_NUM 2
 
-static inline void hns_roce_write64(uint32_t val[2],
-				    struct hns_roce_context *ctx, int offset)
+static inline void hns_roce_write64(__le64 *dest, __le32 val[HNS_ROCE_WORD_NUM])
 {
-	*(volatile uint64_t *) (ctx->uar + offset) = HNS_ROCE_PAIR_TO_64(val);
+	*(volatile __le64 *)dest = *(__le64 *)val;
 }
 
 void *hns_roce_alloc_db(struct hns_roce_context *ctx,
diff --git a/providers/hns/hns_roce_u_hw_v1.c b/providers/hns/hns_roce_u_hw_v1.c
index 8f0a71a..d00230c 100644
--- a/providers/hns/hns_roce_u_hw_v1.c
+++ b/providers/hns/hns_roce_u_hw_v1.c
@@ -65,7 +65,7 @@ static void hns_roce_update_rq_head(struct hns_roce_context *ctx,
 
 	udma_to_device_barrier();
 
-	hns_roce_write64((uint32_t *)&rq_db, ctx, ROCEE_DB_OTHERS_L_0_REG);
+	hns_roce_write64(ctx->uar + ROCEE_DB_OTHERS_L_0_REG, (__le32 *)&rq_db);
 }
 
 static void hns_roce_update_sq_head(struct hns_roce_context *ctx,
@@ -84,7 +84,7 @@ static void hns_roce_update_sq_head(struct hns_roce_context *ctx,
 
 	udma_to_device_barrier();
 
-	hns_roce_write64((uint32_t *)&sq_db, ctx, ROCEE_DB_SQ_L_0_REG);
+	hns_roce_write64(ctx->uar + ROCEE_DB_SQ_L_0_REG, (__le32 *)&sq_db);
 }
 
 static void hns_roce_update_cq_cons_index(struct hns_roce_context *ctx,
@@ -102,7 +102,7 @@ static void hns_roce_update_cq_cons_index(struct hns_roce_context *ctx,
 		       CQ_DB_U32_4_CONS_IDX_S,
 		       cq->cons_index & ((cq->cq_depth << 1) - 1));
 
-	hns_roce_write64((uint32_t *)&cq_db, ctx, ROCEE_DB_OTHERS_L_0_REG);
+	hns_roce_write64(ctx->uar + ROCEE_DB_OTHERS_L_0_REG, (__le32 *)&cq_db);
 }
 
 static void hns_roce_handle_error_cqe(struct hns_roce_cqe *cqe,
@@ -422,10 +422,11 @@ static int hns_roce_u_v1_poll_cq(struct ibv_cq *ibvcq, int ne,
  */
 static int hns_roce_u_v1_arm_cq(struct ibv_cq *ibvcq, int solicited)
 {
-	uint32_t ci;
-	uint32_t solicited_flag;
-	struct hns_roce_cq_db cq_db = {};
+	struct hns_roce_context *ctx = to_hr_ctx(ibvcq->context);
 	struct hns_roce_cq *cq = to_hr_cq(ibvcq);
+	struct hns_roce_cq_db cq_db = {};
+	uint32_t solicited_flag;
+	uint32_t ci;
 
 	ci = cq->cons_index & ((cq->cq_depth << 1) - 1);
 	solicited_flag = solicited ? HNS_ROCE_CQ_DB_REQ_SOL :
@@ -441,8 +442,8 @@ static int hns_roce_u_v1_arm_cq(struct ibv_cq *ibvcq, int solicited)
 	roce_set_field(cq_db.u32_4, CQ_DB_U32_4_CONS_IDX_M,
 		       CQ_DB_U32_4_CONS_IDX_S, ci);
 
-	hns_roce_write64((uint32_t *)&cq_db, to_hr_ctx(ibvcq->context),
-			  ROCEE_DB_OTHERS_L_0_REG);
+	hns_roce_write64(ctx->uar + ROCEE_DB_OTHERS_L_0_REG,
+			 (uint32_t *)&cq_db);
 	return 0;
 }
 
diff --git a/providers/hns/hns_roce_u_hw_v2.c b/providers/hns/hns_roce_u_hw_v2.c
index 2308f78..aa57cc4 100644
--- a/providers/hns/hns_roce_u_hw_v2.c
+++ b/providers/hns/hns_roce_u_hw_v2.c
@@ -293,7 +293,8 @@ static void hns_roce_update_rq_db(struct hns_roce_context *ctx,
 		       HNS_ROCE_V2_RQ_DB);
 	rq_db.parameter = htole32(rq_head);
 
-	hns_roce_write64((uint32_t *)&rq_db, ctx, ROCEE_VF_DB_CFG0_OFFSET);
+	hns_roce_write64(ctx->uar + ROCEE_VF_DB_CFG0_OFFSET,
+			 (__le32 *)&rq_db);
 }
 
 static void hns_roce_update_sq_db(struct hns_roce_context *ctx,
@@ -308,7 +309,8 @@ static void hns_roce_update_sq_db(struct hns_roce_context *ctx,
 	sq_db.parameter = htole32(sq_head);
 	roce_set_field(sq_db.parameter, DB_PARAM_SL_M, DB_PARAM_SL_S, sl);
 
-	hns_roce_write64((uint32_t *)&sq_db, ctx, ROCEE_VF_DB_CFG0_OFFSET);
+	hns_roce_write64(ctx->uar + ROCEE_VF_DB_CFG0_OFFSET,
+			 (__le32 *)&sq_db);
 }
 
 static void hns_roce_v2_update_cq_cons_index(struct hns_roce_context *ctx,
@@ -325,7 +327,8 @@ static void hns_roce_v2_update_cq_cons_index(struct hns_roce_context *ctx,
 	roce_set_field(cq_db.parameter, DB_PARAM_CQ_CMD_SN_M,
 		       DB_PARAM_CQ_CMD_SN_S, 1);
 
-	hns_roce_write64((uint32_t *)&cq_db, ctx, ROCEE_VF_DB_CFG0_OFFSET);
+	hns_roce_write64(ctx->uar + ROCEE_VF_DB_CFG0_OFFSET,
+			 (__le32 *)&cq_db);
 }
 
 static struct hns_roce_qp *hns_roce_v2_find_qp(struct hns_roce_context *ctx,
@@ -659,11 +662,12 @@ static int hns_roce_u_v2_poll_cq(struct ibv_cq *ibvcq, int ne,
 
 static int hns_roce_u_v2_arm_cq(struct ibv_cq *ibvcq, int solicited)
 {
-	uint32_t ci;
-	uint32_t cmd_sn;
-	uint32_t solicited_flag;
-	struct hns_roce_db cq_db = {};
+	struct hns_roce_context *ctx = to_hr_ctx(ibvcq->context);
 	struct hns_roce_cq *cq = to_hr_cq(ibvcq);
+	struct hns_roce_db cq_db = {};
+	uint32_t solicited_flag;
+	uint32_t cmd_sn;
+	uint32_t ci;
 
 	ci = cq->cons_index & ((cq->cq_depth << 1) - 1);
 	cmd_sn = cq->arm_sn & HNS_ROCE_CMDSN_MASK;
@@ -681,8 +685,9 @@ static int hns_roce_u_v2_arm_cq(struct ibv_cq *ibvcq, int solicited)
 		       DB_PARAM_CQ_CMD_SN_S, cmd_sn);
 	roce_set_bit(cq_db.parameter, DB_PARAM_CQ_NOTIFY_S, solicited_flag);
 
-	hns_roce_write64((uint32_t *)&cq_db, to_hr_ctx(ibvcq->context),
-			  ROCEE_VF_DB_CFG0_OFFSET);
+	hns_roce_write64(ctx->uar + ROCEE_VF_DB_CFG0_OFFSET,
+			 (uint32_t *)&cq_db);
+
 	return 0;
 }
 
@@ -1693,8 +1698,9 @@ static int hns_roce_u_v2_post_srq_recv(struct ibv_srq *ib_srq,
 		srq_db.parameter = htole32(srq->idx_que.head &
 					   DB_PARAM_SRQ_PRODUCER_COUNTER_M);
 
-		hns_roce_write64((uint32_t *)&srq_db, ctx,
-				 ROCEE_VF_DB_CFG0_OFFSET);
+		hns_roce_write64(ctx->uar + ROCEE_VF_DB_CFG0_OFFSET,
+				 (uint32_t *)&srq_db);
+
 	}
 
 	pthread_spin_unlock(&srq->lock);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH rdma-core 4/4] libhns: Add support for direct wqe
  2021-05-28  9:32 [PATCH rdma-core 0/4] libhns: Add support for direct WQE Weihang Li
                   ` (2 preceding siblings ...)
  2021-05-28  9:32 ` [PATCH rdma-core 3/4] libhns: Fixes data type when writing doorbell Weihang Li
@ 2021-05-28  9:32 ` Weihang Li
  2021-06-04 14:50   ` Jason Gunthorpe
  3 siblings, 1 reply; 13+ messages in thread
From: Weihang Li @ 2021-05-28  9:32 UTC (permalink / raw)
  To: jgg, leon; +Cc: linux-rdma, linuxarm

From: Yixing Liu <liuyixing1@huawei.com>

The current write wqe mechanism is to write to DDR first, and then notify
the hardware through doorbell to read the data. Direct wqe is a mechanism
to fill wqe directly into the hardware. In the case of light load, the wqe
will be filled into pcie bar space of the hardware, this will reduce one
memory access operation and therefore reduce the latency. SIMD instructions
allows cpu to write the 512 bits at one time to device memory, thus it can
be used for posting direct wqe.

Signed-off-by: Yixing Liu <liuyixing1@huawei.com>
Signed-off-by: Wenpeng Liang <liangwenpeng@huawei.com>
Signed-off-by: Weihang Li <liweihang@huawei.com>
---
 providers/hns/hns_roce_u.h       |  7 ++++--
 providers/hns/hns_roce_u_hw_v2.c | 52 ++++++++++++++++++++++++++++++++++++----
 providers/hns/hns_roce_u_hw_v2.h | 29 ++++++++++++----------
 providers/hns/hns_roce_u_verbs.c | 39 ++++++++++++++++++++++++++++++
 4 files changed, 108 insertions(+), 19 deletions(-)

diff --git a/providers/hns/hns_roce_u.h b/providers/hns/hns_roce_u.h
index 3c4b162..2ffb604 100644
--- a/providers/hns/hns_roce_u.h
+++ b/providers/hns/hns_roce_u.h
@@ -81,6 +81,8 @@
 
 #define INVALID_SGE_LENGTH 0x80000000
 
+#define HNS_ROCE_DWQE_PAGE_SIZE 65536
+
 #define HNS_ROCE_ADDRESS_MASK 0xFFFFFFFF
 #define HNS_ROCE_ADDRESS_SHIFT 32
 
@@ -280,13 +282,14 @@ struct hns_roce_qp {
 	struct hns_roce_sge_ex		ex_sge;
 	unsigned int			next_sge;
 	int				port_num;
-	int				sl;
+	uint8_t				sl;
 	unsigned int			qkey;
 	enum ibv_mtu			path_mtu;
 
 	struct hns_roce_rinl_buf	rq_rinl_buf;
 	unsigned long			flags;
 	int				refcnt; /* specially used for XRC */
+	void				*dwqe_page;
 };
 
 struct hns_roce_av {
@@ -417,7 +420,7 @@ hns_roce_u_create_qp_ex(struct ibv_context *context,
 
 struct ibv_qp *hns_roce_u_open_qp(struct ibv_context *context,
 				  struct ibv_qp_open_attr *attr);
-
+void hns_roce_v2_clear_qp(struct hns_roce_context *ctx, struct hns_roce_qp *qp);
 int hns_roce_u_query_qp(struct ibv_qp *ibqp, struct ibv_qp_attr *attr,
 			int attr_mask, struct ibv_qp_init_attr *init_attr);
 
diff --git a/providers/hns/hns_roce_u_hw_v2.c b/providers/hns/hns_roce_u_hw_v2.c
index aa57cc4..28d455b 100644
--- a/providers/hns/hns_roce_u_hw_v2.c
+++ b/providers/hns/hns_roce_u_hw_v2.c
@@ -33,10 +33,15 @@
 #define _GNU_SOURCE
 #include <stdio.h>
 #include <string.h>
+#include <sys/mman.h>
 #include "hns_roce_u.h"
 #include "hns_roce_u_db.h"
 #include "hns_roce_u_hw_v2.h"
 
+#if defined(__aarch64__) || defined(__arm__)
+#include <arm_neon.h>
+#endif
+
 #define HR_IBV_OPC_MAP(ib_key, hr_key) \
 		[IBV_WR_ ## ib_key] = HNS_ROCE_WQE_OP_ ## hr_key
 
@@ -313,6 +318,39 @@ static void hns_roce_update_sq_db(struct hns_roce_context *ctx,
 			 (__le32 *)&sq_db);
 }
 
+static inline void hns_roce_write512(uint64_t *dest, uint64_t *val)
+{
+#if defined(__aarch64__) || defined(__arm__)
+	uint64x2x4_t dwqe;
+
+	/* Load multiple 4-element structures to 4 registers */
+	dwqe = vld4q_u64(val);
+	/* store multiple 4-element structures from 4 registers */
+	vst4q_u64(dest, dwqe);
+#else
+	int i;
+
+	for (i = 0; i < HNS_ROCE_WRITE_TIMES; i++)
+		hns_roce_write64(dest + i, val + HNS_ROCE_WORD_NUM * i);
+#endif
+}
+
+static void hns_roce_write_dwqe(struct hns_roce_qp *qp, void *wqe)
+{
+	struct hns_roce_rc_sq_wqe *rc_sq_wqe = wqe;
+
+	/* All kinds of DirectWQE have the same header field layout */
+	roce_set_bit(rc_sq_wqe->byte_4, RC_SQ_WQE_BYTE_4_FLAG_S, 1);
+	roce_set_field(rc_sq_wqe->byte_4, RC_SQ_WQE_BYTE_4_DB_SL_L_M,
+		       RC_SQ_WQE_BYTE_4_DB_SL_L_S, qp->sl);
+	roce_set_field(rc_sq_wqe->byte_4, RC_SQ_WQE_BYTE_4_DB_SL_H_M,
+		       RC_SQ_WQE_BYTE_4_DB_SL_H_S, qp->sl >> HNS_ROCE_SL_SHIFT);
+	roce_set_field(rc_sq_wqe->byte_4, RC_SQ_WQE_BYTE_4_WQE_INDEX_M,
+		       RC_SQ_WQE_BYTE_4_WQE_INDEX_S, qp->sq.head);
+
+	hns_roce_write512(qp->dwqe_page, wqe);
+}
+
 static void hns_roce_v2_update_cq_cons_index(struct hns_roce_context *ctx,
 					     struct hns_roce_cq *cq)
 {
@@ -342,8 +380,7 @@ static struct hns_roce_qp *hns_roce_v2_find_qp(struct hns_roce_context *ctx,
 		return NULL;
 }
 
-static void hns_roce_v2_clear_qp(struct hns_roce_context *ctx,
-				 struct hns_roce_qp *qp)
+void hns_roce_v2_clear_qp(struct hns_roce_context *ctx, struct hns_roce_qp *qp)
 {
 	uint32_t qpn = qp->verbs_qp.qp.qp_num;
 	uint32_t tind = (qpn & (ctx->num_qps - 1)) >> ctx->qp_table_shift;
@@ -1240,6 +1277,7 @@ int hns_roce_u_v2_post_send(struct ibv_qp *ibvqp, struct ibv_send_wr *wr,
 			break;
 		case IBV_QPT_UD:
 			ret = set_ud_wqe(wqe, qp, wr, nreq, &sge_info);
+			qp->sl = to_hr_ah(wr->wr.ud.ah)->av.sl;
 			break;
 		default:
 			ret = EINVAL;
@@ -1255,10 +1293,13 @@ out:
 	if (likely(nreq)) {
 		qp->sq.head += nreq;
 		qp->next_sge = sge_info.start_idx;
-
 		udma_to_device_barrier();
 
-		hns_roce_update_sq_db(ctx, ibvqp->qp_num, qp->sl, qp->sq.head);
+		if (nreq == 1 && (qp->flags & HNS_ROCE_QP_CAP_DIRECT_WQE))
+			hns_roce_write_dwqe(qp, wqe);
+		else
+			hns_roce_update_sq_db(ctx, qp->verbs_qp.qp.qp_num, qp->sl,
+					      qp->sq.head);
 
 		if (qp->flags & HNS_ROCE_QP_CAP_SQ_RECORD_DB)
 			*(qp->sdb) = qp->sq.head & 0xffff;
@@ -1564,6 +1605,9 @@ static int hns_roce_u_v2_destroy_qp(struct ibv_qp *ibqp)
 
 	hns_roce_unlock_cqs(ibqp);
 
+	if (qp->flags & HNS_ROCE_QP_CAP_DIRECT_WQE)
+		munmap(qp->dwqe_page, HNS_ROCE_DWQE_PAGE_SIZE);
+
 	hns_roce_free_qp_buf(qp, ctx);
 
 	free(qp);
diff --git a/providers/hns/hns_roce_u_hw_v2.h b/providers/hns/hns_roce_u_hw_v2.h
index c13d82e..b319826 100644
--- a/providers/hns/hns_roce_u_hw_v2.h
+++ b/providers/hns/hns_roce_u_hw_v2.h
@@ -40,6 +40,8 @@
 
 #define HNS_ROCE_CMDSN_MASK			0x3
 
+#define HNS_ROCE_SL_SHIFT 2
+
 /* V2 REG DEFINITION */
 #define ROCEE_VF_DB_CFG0_OFFSET			0x0230
 
@@ -133,6 +135,8 @@ struct hns_roce_db {
 #define DB_BYTE_4_CMD_S 24
 #define DB_BYTE_4_CMD_M GENMASK(27, 24)
 
+#define DB_BYTE_4_FLAG_S 31
+
 #define DB_PARAM_SRQ_PRODUCER_COUNTER_S 0
 #define DB_PARAM_SRQ_PRODUCER_COUNTER_M GENMASK(15, 0)
 
@@ -216,8 +220,16 @@ struct hns_roce_rc_sq_wqe {
 };
 
 #define RC_SQ_WQE_BYTE_4_OPCODE_S 0
-#define RC_SQ_WQE_BYTE_4_OPCODE_M \
-	(((1UL << 5) - 1) << RC_SQ_WQE_BYTE_4_OPCODE_S)
+#define RC_SQ_WQE_BYTE_4_OPCODE_M GENMASK(4, 0)
+
+#define RC_SQ_WQE_BYTE_4_DB_SL_L_S 5
+#define RC_SQ_WQE_BYTE_4_DB_SL_L_M GENMASK(6, 5)
+
+#define RC_SQ_WQE_BYTE_4_DB_SL_H_S 13
+#define RC_SQ_WQE_BYTE_4_DB_SL_H_M GENMASK(14, 13)
+
+#define RC_SQ_WQE_BYTE_4_WQE_INDEX_S 15
+#define RC_SQ_WQE_BYTE_4_WQE_INDEX_M GENMASK(30, 15)
 
 #define RC_SQ_WQE_BYTE_4_OWNER_S 7
 
@@ -239,6 +251,8 @@ struct hns_roce_rc_sq_wqe {
 
 #define RC_SQ_WQE_BYTE_4_RDMA_WRITE_S 22
 
+#define RC_SQ_WQE_BYTE_4_FLAG_S 31
+
 #define RC_SQ_WQE_BYTE_16_XRC_SRQN_S 0
 #define RC_SQ_WQE_BYTE_16_XRC_SRQN_M \
 	(((1UL << 24) - 1) << RC_SQ_WQE_BYTE_16_XRC_SRQN_S)
@@ -311,23 +325,12 @@ struct hns_roce_ud_sq_wqe {
 #define UD_SQ_WQE_OPCODE_S 0
 #define UD_SQ_WQE_OPCODE_M GENMASK(4, 0)
 
-#define UD_SQ_WQE_DB_SL_L_S 5
-#define UD_SQ_WQE_DB_SL_L_M GENMASK(6, 5)
-
-#define UD_SQ_WQE_DB_SL_H_S 13
-#define UD_SQ_WQE_DB_SL_H_M GENMASK(14, 13)
-
-#define UD_SQ_WQE_INDEX_S 15
-#define UD_SQ_WQE_INDEX_M GENMASK(30, 15)
-
 #define UD_SQ_WQE_OWNER_S 7
 
 #define UD_SQ_WQE_CQE_S 8
 
 #define UD_SQ_WQE_SE_S 11
 
-#define UD_SQ_WQE_FLAG_S 31
-
 #define UD_SQ_WQE_PD_S 0
 #define UD_SQ_WQE_PD_M GENMASK(23, 0)
 
diff --git a/providers/hns/hns_roce_u_verbs.c b/providers/hns/hns_roce_u_verbs.c
index 7b44829..f97144e 100644
--- a/providers/hns/hns_roce_u_verbs.c
+++ b/providers/hns/hns_roce_u_verbs.c
@@ -1115,6 +1115,37 @@ static int hns_roce_store_qp(struct hns_roce_context *ctx,
 	return 0;
 }
 
+static off_t get_dwqe_mmap_offset(unsigned long qpn, int page_size, int cmd)
+{
+	off_t offset = 0;
+	unsigned long idx;
+
+	idx = qpn * (HNS_ROCE_DWQE_PAGE_SIZE / page_size);
+
+	hns_roce_mmap_set_command(cmd, &offset);
+	hns_roce_mmap_set_index(idx, &offset);
+
+	return offset * page_size;
+}
+
+static int mmap_dwqe(struct ibv_context *ibv_ctx, struct hns_roce_qp *qp)
+{
+	struct hns_roce_device *hr_dev = to_hr_dev(ibv_ctx->device);
+	int page_size = hr_dev->page_size;
+	off_t offset;
+
+	offset = get_dwqe_mmap_offset(qp->verbs_qp.qp.qp_num, page_size,
+				      HNS_ROCE_MMAP_DWQE_PAGE);
+
+	qp->dwqe_page = mmap(NULL, HNS_ROCE_DWQE_PAGE_SIZE, PROT_WRITE,
+			     MAP_SHARED, ibv_ctx->cmd_fd, offset);
+
+	if (qp->dwqe_page == MAP_FAILED)
+		return -EINVAL;
+
+	return 0;
+}
+
 static int qp_exec_create_cmd(struct ibv_qp_init_attr_ex *attr,
 			      struct hns_roce_qp *qp,
 			      struct hns_roce_context *ctx)
@@ -1216,10 +1247,18 @@ static struct ibv_qp *create_qp(struct ibv_context *ibv_ctx,
 	if (ret)
 		goto err_store;
 
+	if (qp->flags & HNS_ROCE_QP_CAP_DIRECT_WQE) {
+		ret = mmap_dwqe(ibv_ctx, qp);
+		if (ret)
+			goto err_dwqe;
+	}
+
 	qp_setup_config(attr, qp, context);
 
 	return &qp->verbs_qp.qp;
 
+err_dwqe:
+	hns_roce_v2_clear_qp(context, qp);
 err_store:
 	ibv_cmd_destroy_qp(&qp->verbs_qp.qp);
 err_cmd:
-- 
2.7.4


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH rdma-core 3/4] libhns: Fixes data type when writing doorbell
  2021-05-28  9:32 ` [PATCH rdma-core 3/4] libhns: Fixes data type when writing doorbell Weihang Li
@ 2021-06-04 14:43   ` Jason Gunthorpe
  2021-06-09  3:35     ` liweihang
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 14:43 UTC (permalink / raw)
  To: Weihang Li; +Cc: leon, linux-rdma, linuxarm

On Fri, May 28, 2021 at 05:32:58PM +0800, Weihang Li wrote:
> From: Lang Cheng <chenglang@huawei.com>
> 
> The doorbell data is a __le32[] value instead of uint32_t[], and the DB
> register should be written with a little-endian data instead of uint64_t.
> 
> Signed-off-by: Lang Cheng <chenglang@huawei.com>
> Signed-off-by: Weihang Li <liweihang@huawei.com>
>  providers/hns/hns_roce_u_db.h    | 13 +++----------
>  providers/hns/hns_roce_u_hw_v1.c | 17 +++++++++--------
>  providers/hns/hns_roce_u_hw_v2.c | 28 +++++++++++++++++-----------
>  3 files changed, 29 insertions(+), 29 deletions(-)
> 
> diff --git a/providers/hns/hns_roce_u_db.h b/providers/hns/hns_roce_u_db.h
> index b44e64d..453fa5a 100644
> +++ b/providers/hns/hns_roce_u_db.h
> @@ -37,18 +37,11 @@
>  #ifndef _HNS_ROCE_U_DB_H
>  #define _HNS_ROCE_U_DB_H
>  
> -#if __BYTE_ORDER == __LITTLE_ENDIAN
> -#define HNS_ROCE_PAIR_TO_64(val) ((uint64_t) val[1] << 32 | val[0])
> -#elif __BYTE_ORDER == __BIG_ENDIAN
> -#define HNS_ROCE_PAIR_TO_64(val) ((uint64_t) val[0] << 32 | val[1])
> -#else
> -#error __BYTE_ORDER not defined
> -#endif
> +#define HNS_ROCE_WORD_NUM 2
>  
> -static inline void hns_roce_write64(uint32_t val[2],
> -				    struct hns_roce_context *ctx, int offset)
> +static inline void hns_roce_write64(__le64 *dest, __le32 val[HNS_ROCE_WORD_NUM])
>  {
> -	*(volatile uint64_t *) (ctx->uar + offset) = HNS_ROCE_PAIR_TO_64(val);
> +	*(volatile __le64 *)dest = *(__le64 *)val;
>  }

Please use the macros in util/mmio.h

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH rdma-core 4/4] libhns: Add support for direct wqe
  2021-05-28  9:32 ` [PATCH rdma-core 4/4] libhns: Add support for direct wqe Weihang Li
@ 2021-06-04 14:50   ` Jason Gunthorpe
  2021-06-11  9:20     ` liweihang
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2021-06-04 14:50 UTC (permalink / raw)
  To: Weihang Li; +Cc: leon, linux-rdma, linuxarm

On Fri, May 28, 2021 at 05:32:59PM +0800, Weihang Li wrote:
> diff --git a/providers/hns/hns_roce_u_hw_v2.c b/providers/hns/hns_roce_u_hw_v2.c
> index aa57cc4..28d455b 100644
> +++ b/providers/hns/hns_roce_u_hw_v2.c
> @@ -33,10 +33,15 @@
>  #define _GNU_SOURCE
>  #include <stdio.h>
>  #include <string.h>
> +#include <sys/mman.h>
>  #include "hns_roce_u.h"
>  #include "hns_roce_u_db.h"
>  #include "hns_roce_u_hw_v2.h"
>  
> +#if defined(__aarch64__) || defined(__arm__)
> +#include <arm_neon.h>
> +#endif
> +
>  #define HR_IBV_OPC_MAP(ib_key, hr_key) \
>  		[IBV_WR_ ## ib_key] = HNS_ROCE_WQE_OP_ ## hr_key
>  
> @@ -313,6 +318,39 @@ static void hns_roce_update_sq_db(struct hns_roce_context *ctx,
>  			 (__le32 *)&sq_db);
>  }
>  
> +static inline void hns_roce_write512(uint64_t *dest, uint64_t *val)
> +{
> +#if defined(__aarch64__) || defined(__arm__)
> +	uint64x2x4_t dwqe;
> +
> +	/* Load multiple 4-element structures to 4 registers */
> +	dwqe = vld4q_u64(val);
> +	/* store multiple 4-element structures from 4 registers */
> +	vst4q_u64(dest, dwqe);
> +#else
> +	int i;
> +
> +	for (i = 0; i < HNS_ROCE_WRITE_TIMES; i++)
> +		hns_roce_write64(dest + i, val + HNS_ROCE_WORD_NUM * i);
> +#endif
> +}

No code like this in providers. This should be done similiarly to how
SSE is handled on x86

This is 

   mmio_memcpy_x64(dest, val, 64);

The above should be conditionalized to trigger NEON

#if defined(__aarch64__) || defined(__arm__)
static inline void __mmio_memcpy_x64_64b(..)
{..
    vst4q_u64(dest, vld4q_u64(src))
..}
#endif

#define mmio_memcpy_x64(dest, src, bytecount)
 ({if (__builtin_constant_p(bytecount == 64)
        __mmio_memcpy_x64_64b(dest,src,bytecount)
   ...

And I'm not sure what barriers you need for prot_device, but certainly
more than none. If you don't know then use the WC barriers

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH rdma-core 3/4] libhns: Fixes data type when writing doorbell
  2021-06-04 14:43   ` Jason Gunthorpe
@ 2021-06-09  3:35     ` liweihang
  0 siblings, 0 replies; 13+ messages in thread
From: liweihang @ 2021-06-09  3:35 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, linux-rdma, Linuxarm

On 2021/6/4 22:43, Jason Gunthorpe wrote:
> On Fri, May 28, 2021 at 05:32:58PM +0800, Weihang Li wrote:
>> From: Lang Cheng <chenglang@huawei.com>
>>
>> The doorbell data is a __le32[] value instead of uint32_t[], and the DB
>> register should be written with a little-endian data instead of uint64_t.
>>
>> Signed-off-by: Lang Cheng <chenglang@huawei.com>
>> Signed-off-by: Weihang Li <liweihang@huawei.com>
>>  providers/hns/hns_roce_u_db.h    | 13 +++----------
>>  providers/hns/hns_roce_u_hw_v1.c | 17 +++++++++--------
>>  providers/hns/hns_roce_u_hw_v2.c | 28 +++++++++++++++++-----------
>>  3 files changed, 29 insertions(+), 29 deletions(-)
>>
>> diff --git a/providers/hns/hns_roce_u_db.h b/providers/hns/hns_roce_u_db.h
>> index b44e64d..453fa5a 100644
>> +++ b/providers/hns/hns_roce_u_db.h
>> @@ -37,18 +37,11 @@
>>  #ifndef _HNS_ROCE_U_DB_H
>>  #define _HNS_ROCE_U_DB_H
>>  
>> -#if __BYTE_ORDER == __LITTLE_ENDIAN
>> -#define HNS_ROCE_PAIR_TO_64(val) ((uint64_t) val[1] << 32 | val[0])
>> -#elif __BYTE_ORDER == __BIG_ENDIAN
>> -#define HNS_ROCE_PAIR_TO_64(val) ((uint64_t) val[0] << 32 | val[1])
>> -#else
>> -#error __BYTE_ORDER not defined
>> -#endif
>> +#define HNS_ROCE_WORD_NUM 2
>>  
>> -static inline void hns_roce_write64(uint32_t val[2],
>> -				    struct hns_roce_context *ctx, int offset)
>> +static inline void hns_roce_write64(__le64 *dest, __le32 val[HNS_ROCE_WORD_NUM])
>>  {
>> -	*(volatile uint64_t *) (ctx->uar + offset) = HNS_ROCE_PAIR_TO_64(val);
>> +	*(volatile __le64 *)dest = *(__le64 *)val;
>>  }
> 
> Please use the macros in util/mmio.h
> 
> Jason
> 

OK, thank you.

Weihang


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH rdma-core 4/4] libhns: Add support for direct wqe
  2021-06-04 14:50   ` Jason Gunthorpe
@ 2021-06-11  9:20     ` liweihang
  2021-06-11 11:31       ` Jason Gunthorpe
  0 siblings, 1 reply; 13+ messages in thread
From: liweihang @ 2021-06-11  9:20 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, linux-rdma, Linuxarm

On 2021/6/4 22:50, Jason Gunthorpe wrote:
> On Fri, May 28, 2021 at 05:32:59PM +0800, Weihang Li wrote:
>> diff --git a/providers/hns/hns_roce_u_hw_v2.c b/providers/hns/hns_roce_u_hw_v2.c
>> index aa57cc4..28d455b 100644
>> +++ b/providers/hns/hns_roce_u_hw_v2.c
>> @@ -33,10 +33,15 @@
>>  #define _GNU_SOURCE
>>  #include <stdio.h>
>>  #include <string.h>
>> +#include <sys/mman.h>
>>  #include "hns_roce_u.h"
>>  #include "hns_roce_u_db.h"
>>  #include "hns_roce_u_hw_v2.h"
>>  
>> +#if defined(__aarch64__) || defined(__arm__)
>> +#include <arm_neon.h>
>> +#endif
>> +
>>  #define HR_IBV_OPC_MAP(ib_key, hr_key) \
>>  		[IBV_WR_ ## ib_key] = HNS_ROCE_WQE_OP_ ## hr_key
>>  
>> @@ -313,6 +318,39 @@ static void hns_roce_update_sq_db(struct hns_roce_context *ctx,
>>  			 (__le32 *)&sq_db);
>>  }
>>  
>> +static inline void hns_roce_write512(uint64_t *dest, uint64_t *val)
>> +{
>> +#if defined(__aarch64__) || defined(__arm__)
>> +	uint64x2x4_t dwqe;
>> +
>> +	/* Load multiple 4-element structures to 4 registers */
>> +	dwqe = vld4q_u64(val);
>> +	/* store multiple 4-element structures from 4 registers */
>> +	vst4q_u64(dest, dwqe);
>> +#else
>> +	int i;
>> +
>> +	for (i = 0; i < HNS_ROCE_WRITE_TIMES; i++)
>> +		hns_roce_write64(dest + i, val + HNS_ROCE_WORD_NUM * i);
>> +#endif
>> +}
> 
> No code like this in providers. This should be done similiarly to how
> SSE is handled on x86
> 
> This is 
> 
>    mmio_memcpy_x64(dest, val, 64);
> 
> The above should be conditionalized to trigger NEON
> 
> #if defined(__aarch64__) || defined(__arm__)
> static inline void __mmio_memcpy_x64_64b(..)
> {..
>     vst4q_u64(dest, vld4q_u64(src))
> ..}
> #endif
> 
> #define mmio_memcpy_x64(dest, src, bytecount)
>  ({if (__builtin_constant_p(bytecount == 64)
>         __mmio_memcpy_x64_64b(dest,src,bytecount)
>    ...
> 

OK, thank you.

> And I'm not sure what barriers you need for prot_device, but certainly
> more than none. If you don't know then use the WC barriers
> 

ST4 instructions can guarantee the 64 bytes data to be wrote at a time, so we
don't need a barrier.

Weihang

> Jason
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH rdma-core 4/4] libhns: Add support for direct wqe
  2021-06-11  9:20     ` liweihang
@ 2021-06-11 11:31       ` Jason Gunthorpe
  2021-06-16  9:55         ` liweihang
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2021-06-11 11:31 UTC (permalink / raw)
  To: liweihang; +Cc: leon, linux-rdma, Linuxarm

On Fri, Jun 11, 2021 at 09:20:51AM +0000, liweihang wrote:
> On 2021/6/4 22:50, Jason Gunthorpe wrote:
> > On Fri, May 28, 2021 at 05:32:59PM +0800, Weihang Li wrote:
> >> diff --git a/providers/hns/hns_roce_u_hw_v2.c b/providers/hns/hns_roce_u_hw_v2.c
> >> index aa57cc4..28d455b 100644
> >> +++ b/providers/hns/hns_roce_u_hw_v2.c
> >> @@ -33,10 +33,15 @@
> >>  #define _GNU_SOURCE
> >>  #include <stdio.h>
> >>  #include <string.h>
> >> +#include <sys/mman.h>
> >>  #include "hns_roce_u.h"
> >>  #include "hns_roce_u_db.h"
> >>  #include "hns_roce_u_hw_v2.h"
> >>  
> >> +#if defined(__aarch64__) || defined(__arm__)
> >> +#include <arm_neon.h>
> >> +#endif
> >> +
> >>  #define HR_IBV_OPC_MAP(ib_key, hr_key) \
> >>  		[IBV_WR_ ## ib_key] = HNS_ROCE_WQE_OP_ ## hr_key
> >>  
> >> @@ -313,6 +318,39 @@ static void hns_roce_update_sq_db(struct hns_roce_context *ctx,
> >>  			 (__le32 *)&sq_db);
> >>  }
> >>  
> >> +static inline void hns_roce_write512(uint64_t *dest, uint64_t *val)
> >> +{
> >> +#if defined(__aarch64__) || defined(__arm__)
> >> +	uint64x2x4_t dwqe;
> >> +
> >> +	/* Load multiple 4-element structures to 4 registers */
> >> +	dwqe = vld4q_u64(val);
> >> +	/* store multiple 4-element structures from 4 registers */
> >> +	vst4q_u64(dest, dwqe);
> >> +#else
> >> +	int i;
> >> +
> >> +	for (i = 0; i < HNS_ROCE_WRITE_TIMES; i++)
> >> +		hns_roce_write64(dest + i, val + HNS_ROCE_WORD_NUM * i);
> >> +#endif
> >> +}
> > 
> > No code like this in providers. This should be done similiarly to how
> > SSE is handled on x86
> > 
> > This is 
> > 
> >    mmio_memcpy_x64(dest, val, 64);
> > 
> > The above should be conditionalized to trigger NEON
> > 
> > #if defined(__aarch64__) || defined(__arm__)
> > static inline void __mmio_memcpy_x64_64b(..)
> > {..
> >     vst4q_u64(dest, vld4q_u64(src))
> > ..}
> > #endif
> > 
> > #define mmio_memcpy_x64(dest, src, bytecount)
> >  ({if (__builtin_constant_p(bytecount == 64)
> >         __mmio_memcpy_x64_64b(dest,src,bytecount)
> >    ...
> > 
> 
> OK, thank you.
> 
> > And I'm not sure what barriers you need for prot_device, but certainly
> > more than none. If you don't know then use the WC barriers
> > 
> 
> ST4 instructions can guarantee the 64 bytes data to be wrote at a time, so we
> don't need a barrier.

arm is always a relaxed out of order storage model, you need barriers
to ensure that the observance of the ST4 is in-order with the other
writes that might be going on

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH rdma-core 4/4] libhns: Add support for direct wqe
  2021-06-11 11:31       ` Jason Gunthorpe
@ 2021-06-16  9:55         ` liweihang
  2021-06-16 19:14           ` Jason Gunthorpe
  0 siblings, 1 reply; 13+ messages in thread
From: liweihang @ 2021-06-16  9:55 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, linux-rdma, Linuxarm

On 2021/6/11 19:31, Jason Gunthorpe wrote:
> On Fri, Jun 11, 2021 at 09:20:51AM +0000, liweihang wrote:
>> On 2021/6/4 22:50, Jason Gunthorpe wrote:
>>> On Fri, May 28, 2021 at 05:32:59PM +0800, Weihang Li wrote:
>>>> diff --git a/providers/hns/hns_roce_u_hw_v2.c b/providers/hns/hns_roce_u_hw_v2.c
>>>> index aa57cc4..28d455b 100644
>>>> +++ b/providers/hns/hns_roce_u_hw_v2.c
>>>> @@ -33,10 +33,15 @@
>>>>  #define _GNU_SOURCE
>>>>  #include <stdio.h>
>>>>  #include <string.h>
>>>> +#include <sys/mman.h>
>>>>  #include "hns_roce_u.h"
>>>>  #include "hns_roce_u_db.h"
>>>>  #include "hns_roce_u_hw_v2.h"
>>>>  
>>>> +#if defined(__aarch64__) || defined(__arm__)
>>>> +#include <arm_neon.h>
>>>> +#endif
>>>> +
>>>>  #define HR_IBV_OPC_MAP(ib_key, hr_key) \
>>>>  		[IBV_WR_ ## ib_key] = HNS_ROCE_WQE_OP_ ## hr_key
>>>>  
>>>> @@ -313,6 +318,39 @@ static void hns_roce_update_sq_db(struct hns_roce_context *ctx,
>>>>  			 (__le32 *)&sq_db);
>>>>  }
>>>>  
>>>> +static inline void hns_roce_write512(uint64_t *dest, uint64_t *val)
>>>> +{
>>>> +#if defined(__aarch64__) || defined(__arm__)
>>>> +	uint64x2x4_t dwqe;
>>>> +
>>>> +	/* Load multiple 4-element structures to 4 registers */
>>>> +	dwqe = vld4q_u64(val);
>>>> +	/* store multiple 4-element structures from 4 registers */
>>>> +	vst4q_u64(dest, dwqe);
>>>> +#else
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i < HNS_ROCE_WRITE_TIMES; i++)
>>>> +		hns_roce_write64(dest + i, val + HNS_ROCE_WORD_NUM * i);
>>>> +#endif
>>>> +}
>>>
>>> No code like this in providers. This should be done similiarly to how
>>> SSE is handled on x86
>>>
>>> This is 
>>>
>>>    mmio_memcpy_x64(dest, val, 64);
>>>
>>> The above should be conditionalized to trigger NEON
>>>
>>> #if defined(__aarch64__) || defined(__arm__)
>>> static inline void __mmio_memcpy_x64_64b(..)
>>> {..
>>>     vst4q_u64(dest, vld4q_u64(src))
>>> ..}
>>> #endif
>>>
>>> #define mmio_memcpy_x64(dest, src, bytecount)
>>>  ({if (__builtin_constant_p(bytecount == 64)
>>>         __mmio_memcpy_x64_64b(dest,src,bytecount)
>>>    ...
>>>
>>
>> OK, thank you.
>>
>>> And I'm not sure what barriers you need for prot_device, but certainly
>>> more than none. If you don't know then use the WC barriers
>>>
>>
>> ST4 instructions can guarantee the 64 bytes data to be wrote at a time, so we
>> don't need a barrier.
> 
> arm is always a relaxed out of order storage model, you need barriers
> to ensure that the observance of the ST4 is in-order with the other
> writes that might be going on
> 
> Jason
> 

Hi Jason

Sorry for the late reply. Here is the process of post send of HIP08/09:

   +-----------+
   | post send |
   +-----+-----+
         |
   +-----+-----+
   | write WQE |
   +-----+-----+
         |
         | udma_to_device_barrier()
         |
   +-----+-----+   Y  +-----------+  N
   |  HIP09 ?  +------+ multi WR ?+-------------+
   +-----+-----+      +-----+-----+             |
         | N                | Y                 |
   +-----+-----+      +-----+-----+    +--------+--------+
   |  ring DB  |      |  ring DB  |    |direct WQE (ST4) |
   +-----------+      +-----------+    +-----------------+

After users call ibv_post_send, the driver writes the WQE into memory, and add a
barrier to ensure that all of the WQE has been fully written. Then, for HIP09,
we check if there is only one WR, and if so, we write the WQE into pci bar space
via ST4 instructions, then the hardware will get the WQE. If there are more than
one WQEs, we generate a SQ doorbell to tell the hardware to read WQEs.

Direct WQE merge the process ring doorbell and get WQE from memory to the
hardware, avoiding reading WQEs from the memory after the doorbell is updated.
The ST4 instructions is atomic as ring doorbell for the hardware, and before
ST4, the WQE has been fully written into the memory. So I think current barrier
is enough for Direct WQE.

If there is still any issues in this process, could you please tell us where to
add the barrier? Thank you :)

Weihang


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH rdma-core 4/4] libhns: Add support for direct wqe
  2021-06-16  9:55         ` liweihang
@ 2021-06-16 19:14           ` Jason Gunthorpe
  2021-06-18  7:23             ` liweihang
  0 siblings, 1 reply; 13+ messages in thread
From: Jason Gunthorpe @ 2021-06-16 19:14 UTC (permalink / raw)
  To: liweihang; +Cc: leon, linux-rdma, Linuxarm

On Wed, Jun 16, 2021 at 09:55:45AM +0000, liweihang wrote:

> If there is still any issues in this process, could you please tell us where to
> add the barrier? Thank you :)

I don't know ARM perfectly well, but generally look at

 1) Do these special stores barrier with the spin unlock protecting
    the post send? Allowing them to leak out will get things out of
    order

 2) ARM MMIO stores are not ordered, so that DB store the ST4 store
    are not guaranteed to execute in program order without a barrier.
    The spinlock is not a MMIO barrier

You could ignore some of this when the DB rings were basically
idempotent, but if you are xfering data it is more tricky. This is why
we always see a barrier after a WC store to put all future MMIO
strongly in order with the store.

Jason

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH rdma-core 4/4] libhns: Add support for direct wqe
  2021-06-16 19:14           ` Jason Gunthorpe
@ 2021-06-18  7:23             ` liweihang
  0 siblings, 0 replies; 13+ messages in thread
From: liweihang @ 2021-06-18  7:23 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, linux-rdma, Linuxarm, linyunsheng

On 2021/6/17 3:14, Jason Gunthorpe wrote:
> On Wed, Jun 16, 2021 at 09:55:45AM +0000, liweihang wrote:
> 
>> If there is still any issues in this process, could you please tell us where to
>> add the barrier? Thank you :)
> 
> I don't know ARM perfectly well, but generally look at
> 
>  1) Do these special stores barrier with the spin unlock protecting
>     the post send? Allowing them to leak out will get things out of
>     order

I do not think we need to rely on the spin unlock to ensure correct ordering for
ST4 store.
ST4 store is similiar as DB store, the difference is that DB store writes 8
bytes to the device's MMIO space and ST4 store writes 64 bytes, the ST4 store
can be ordered by udma_to_device_barrier() too, which mean we can also use
udma_to_device_barrier() to ensure correct ordering between ST4 store and DB
store too.

> 
>  2) ARM MMIO stores are not ordered, so that DB store the ST4 store
>     are not guaranteed to execute in program order without a barrier.
>     The spinlock is not a MMIO barrier
> 

As there is udma_to_device_barrier() between each round of post send, we can
guarantee that the last DB store/ST4 store reaches the device before issuing the
the next DB store/ST4 store.

> You could ignore some of this when the DB rings were basically
> idempotent, but if you are xfering data it is more tricky. This is why
> we always see a barrier after a WC store to put all future MMIO
> strongly in order with the store.
> 
> Jason
> 

"st4 store" writes the doorbell and the content of WQE to the roce engine, and
the st4 store ensure doorbell and the content of WQE both reach the roce engine
at the same time. we tried to avoid WC store by using st4 store here, as WC
store might need a different barrier in order to flush the data to the device.

Thanks
Weihang

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-06-18  7:23 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-28  9:32 [PATCH rdma-core 0/4] libhns: Add support for direct WQE Weihang Li
2021-05-28  9:32 ` [PATCH rdma-core 1/4] Update kernel headers Weihang Li
2021-05-28  9:32 ` [PATCH rdma-core 2/4] libhns: Refactor hns uar mmap flow Weihang Li
2021-05-28  9:32 ` [PATCH rdma-core 3/4] libhns: Fixes data type when writing doorbell Weihang Li
2021-06-04 14:43   ` Jason Gunthorpe
2021-06-09  3:35     ` liweihang
2021-05-28  9:32 ` [PATCH rdma-core 4/4] libhns: Add support for direct wqe Weihang Li
2021-06-04 14:50   ` Jason Gunthorpe
2021-06-11  9:20     ` liweihang
2021-06-11 11:31       ` Jason Gunthorpe
2021-06-16  9:55         ` liweihang
2021-06-16 19:14           ` Jason Gunthorpe
2021-06-18  7:23             ` liweihang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.