linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation
@ 2021-12-28  8:07 Li Zhijian
  2021-12-28  8:07 ` [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem Li Zhijian
                   ` (10 more replies)
  0 siblings, 11 replies; 49+ messages in thread
From: Li Zhijian @ 2021-12-28  8:07 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto, Li Zhijian

Hey folks,

These patches are going to implement a *NEW* RDMA opcode "RDMA FLUSH".
In IB SPEC 1.5[1][2], 2 new opcodes, ATOMIC WRITE and RDMA FLUSH were
added in the MEMORY PLACEMENT EXTENSIONS section.

FLUSH is used by the requesting node to achieve guarantees on the data
placement within the memory subsystem of preceding accesses to a
single memory region, such as those performed by RDMA WRITE, Atomics
and ATOMIC WRITE requests.

The operation indicates the virtual address space of a destination node
and where the guarantees should apply. This range must be contiguous
in the virtual space of the memory key but it is not necessarily a
contiguous range of physical memory.

FLUSH packets carry FLUSH extended transport header (see below) to
specify the placement type and the selectivity level of the operation
and RDMA extended header (RETH, see base document RETH definition) to
specify the R_Key VA and Length associated with this request following
the BTH in RC, RDETH in RD and XRCETH in XRC.

RC FLUSH:
+----+------+------+
|BTH | FETH | RETH |
+----+------+------+

RD FLUSH:
+----+------+------+------+
|BTH | RDETH| FETH | RETH |
+----+------+------+------+

XRC FLUSH:
+----+-------+------+------+
|BTH | XRCETH| FETH | RETH |
+----+-------+------+------+

Currently, we introduce RC and RD services only, since XRC has not been
implemented by rxe yet.
NOTE: only RC service is tested now, and since other HCAs have not
added/implemented FLUSH yet, we can only test FLUSH operation in both
SoftRoCE/rxe devices.

The corresponding rdma-core and FLUSH example are available on:
https://github.com/zhijianli88/rdma-core/tree/rfc

Below list some details about FLUSH transport packet:

A FLUSH message is built upon FLUSH request packet and is responded
successfully by RDMA READ response of zero size.

oA19-2: FLUSH shall be single packet message and shall have no payload.

oA19-2: FLUSH shall be single packet message and shall have no payload.
oA19-5: FLUSH BTH shall hold the Opcode = 0x1C

FLUSH Extended Transport Header(FETH)
+-----+-----------+------------------------+----------------------+
|Bits |   31-6    |          5-4           |        3-0           |
+-----+-----------+------------------------+----------------------+
|     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
+-----+-----------+------------------------+----------------------+

Selectivity Level (SEL) – defines the memory region scope the FLUSH
should apply on. Values are as follows:
• b’00 - Memory Region Range: FLUSH applies for all preceding memory
         updates to the RETH range on this QP. All RETH fields shall be
         valid in this selectivity mode. RETH:DMALen field shall be
         between zero and (2 31 -1) bytes (inclusive).
• b’01 - Memory Region: FLUSH applies for all preceding memory up-
         dates to RETH.R_key on this QP. RETH:DMALen and RETH:VA
         shall be ignored in this mode.
• b'10 - Reserved.
• b'11 - Reserved.

Placement Type (PLT) – Defines the memory placement guarantee of
this FLUSH. Multiple bits may be set in this field. Values are as follows:
• Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
  Visibility.
• Bit 1 if set to '1' indicated that the FLUSH should guarantee
  Persistence.
• Bits 3:2 are reserved

[1]: https://www.infinibandta.org/ibta-specification/ # login required
[2]: https://www.infinibandta.org/wp-content/uploads/2021/08/IBTA-Overview-of-IBTA-Volume-1-Release-1.5-and-MPE-2021-08-17-Secure.pptx

CC: Jason Gunthorpe <jgg@ziepe.ca>
CC: Zhu Yanjun <zyjzyj2000@gmail.com
CC: Leon Romanovsky <leon@kernel.org>
CC: Bob Pearson <rpearsonhpe@gmail.com>
CC: Weihang Li <liweihang@huawei.com>
CC: Mark Bloch <mbloch@nvidia.com>
CC: Wenpeng Liang <liangwenpeng@huawei.com>
CC: Aharon Landau <aharonl@nvidia.com>
CC: linux-rdma@vger.kernel.org
CC: linux-kernel@vger.kernel.org

Li Zhijian (10):
  RDMA: mr: Introduce is_pmem
  RDMA: Allow registering MR with flush access flags
  RDMA/rxe: Allow registering FLUSH flags for supported device only
  RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device
  RDMA/rxe: Allow registering persistent flag for pmem MR only
  RDMA/rxe: Implement RC RDMA FLUSH service in requester side
  RDMA/rxe: Set BTH's SE to zero for FLUSH packet
  RDMA/rxe: Implement flush execution in responder side
  RDMA/rxe: Implement flush completion
  RDMA/rxe: Add RD FLUSH service support

 drivers/infiniband/core/uverbs_cmd.c    |  16 +++
 drivers/infiniband/sw/rxe/rxe_comp.c    |   4 +-
 drivers/infiniband/sw/rxe/rxe_hdr.h     |  52 ++++++++++
 drivers/infiniband/sw/rxe/rxe_loc.h     |   2 +
 drivers/infiniband/sw/rxe/rxe_mr.c      |  63 +++++++++++-
 drivers/infiniband/sw/rxe/rxe_opcode.c  |  33 ++++++
 drivers/infiniband/sw/rxe/rxe_opcode.h  |   3 +
 drivers/infiniband/sw/rxe/rxe_param.h   |   3 +-
 drivers/infiniband/sw/rxe/rxe_req.c     |  14 ++-
 drivers/infiniband/sw/rxe/rxe_resp.c    | 131 +++++++++++++++++++++++-
 include/rdma/ib_pack.h                  |   3 +
 include/rdma/ib_verbs.h                 |  22 +++-
 include/uapi/rdma/ib_user_ioctl_verbs.h |   2 +
 include/uapi/rdma/ib_user_verbs.h       |  18 ++++
 include/uapi/rdma/rdma_user_rxe.h       |   6 ++
 15 files changed, 362 insertions(+), 10 deletions(-)

-- 
2.31.1




^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem
  2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
@ 2021-12-28  8:07 ` Li Zhijian
  2022-01-06  0:21   ` Jason Gunthorpe
  2021-12-28  8:07 ` [RFC PATCH rdma-next 02/10] RDMA: Allow registering MR with flush access flags Li Zhijian
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 49+ messages in thread
From: Li Zhijian @ 2021-12-28  8:07 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto, Li Zhijian

We can use it to indicate whether the registering mr is associated with
a pmem/nvdimm or not.

Currently, we only assign it in rxe driver, for other device/drivers,
they should implement it if needed.

RDMA FLUSH will support the persistence feature for a pmem/nvdimm.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c | 47 ++++++++++++++++++++++++++++++
 include/rdma/ib_verbs.h            |  1 +
 2 files changed, 48 insertions(+)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 7c4cd19a9db2..bcd5e7afa475 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -162,6 +162,50 @@ void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr)
 	mr->type = IB_MR_TYPE_DMA;
 }
 
+// XXX: the logic is similar with mm/memory-failure.c
+static bool page_in_dev_pagemap(struct page *page)
+{
+	unsigned long pfn;
+	struct page *p;
+	struct dev_pagemap *pgmap = NULL;
+
+	pfn = page_to_pfn(page);
+	if (!pfn) {
+		pr_err("no such pfn for page %p\n", page);
+		return false;
+	}
+
+	p = pfn_to_online_page(pfn);
+	if (!p) {
+		if (pfn_valid(pfn)) {
+			pgmap = get_dev_pagemap(pfn, NULL);
+			if (pgmap)
+				put_dev_pagemap(pgmap);
+		}
+	}
+
+	return !!pgmap;
+}
+
+static bool iova_in_pmem(struct rxe_mr *mr, u64 iova, int length)
+{
+	struct page *page = NULL;
+	char *vaddr = iova_to_vaddr(mr, iova, length);
+
+	if (!vaddr) {
+		pr_err("not a valid iova %llu\n", iova);
+		return false;
+	}
+
+	page = virt_to_page(vaddr);
+	if (!page) {
+		pr_err("no such page for vaddr %p\n", vaddr);
+		return false;
+	}
+
+	return page_in_dev_pagemap(page);
+}
+
 int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 		     int access, struct rxe_mr *mr)
 {
@@ -236,6 +280,9 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 	set->va = start;
 	set->offset = ib_umem_offset(umem);
 
+	// iova_in_pmem must be called after set is updated
+	mr->ibmr.is_pmem = iova_in_pmem(mr, iova, length);
+
 	return 0;
 
 err_release_umem:
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 6e9ad656ecb7..822ebb3425dc 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1807,6 +1807,7 @@ struct ib_mr {
 	unsigned int	   page_size;
 	enum ib_mr_type	   type;
 	bool		   need_inval;
+	bool		   is_pmem;
 	union {
 		struct ib_uobject	*uobject;	/* user */
 		struct list_head	qp_entry;	/* FR */
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH rdma-next 02/10] RDMA: Allow registering MR with flush access flags
  2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
  2021-12-28  8:07 ` [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem Li Zhijian
@ 2021-12-28  8:07 ` Li Zhijian
  2021-12-28  8:07 ` [RFC PATCH rdma-next 03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only Li Zhijian
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 49+ messages in thread
From: Li Zhijian @ 2021-12-28  8:07 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto, Li Zhijian

Users can use ibv_reg_mr(3) to register flush access flags.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
 include/rdma/ib_verbs.h                 | 6 +++++-
 include/uapi/rdma/ib_user_ioctl_verbs.h | 2 ++
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 822ebb3425dc..f04d66539879 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1444,10 +1444,14 @@ enum ib_access_flags {
 	IB_ACCESS_ON_DEMAND = IB_UVERBS_ACCESS_ON_DEMAND,
 	IB_ACCESS_HUGETLB = IB_UVERBS_ACCESS_HUGETLB,
 	IB_ACCESS_RELAXED_ORDERING = IB_UVERBS_ACCESS_RELAXED_ORDERING,
+	IB_ACCESS_FLUSH_GLOBAL_VISIBILITY = IB_UVERBS_ACCESS_FLUSH_GLOBAL_VISIBLITY,
+	IB_ACCESS_FLUSH_PERSISTENT = IB_UVERBS_ACCESS_FLUSH_PERSISTENT,
 
+	IB_ACCESS_FLUSH = IB_ACCESS_FLUSH_GLOBAL_VISIBILITY |
+			  IB_ACCESS_FLUSH_PERSISTENT,
 	IB_ACCESS_OPTIONAL = IB_UVERBS_ACCESS_OPTIONAL_RANGE,
 	IB_ACCESS_SUPPORTED =
-		((IB_ACCESS_HUGETLB << 1) - 1) | IB_ACCESS_OPTIONAL,
+		((IB_ACCESS_FLUSH_PERSISTENT << 1) - 1) | IB_ACCESS_OPTIONAL,
 };
 
 /*
diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
index 3072e5d6b692..4763d3e3ea16 100644
--- a/include/uapi/rdma/ib_user_ioctl_verbs.h
+++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
@@ -57,6 +57,8 @@ enum ib_uverbs_access_flags {
 	IB_UVERBS_ACCESS_ZERO_BASED = 1 << 5,
 	IB_UVERBS_ACCESS_ON_DEMAND = 1 << 6,
 	IB_UVERBS_ACCESS_HUGETLB = 1 << 7,
+	IB_UVERBS_ACCESS_FLUSH_GLOBAL_VISIBLITY = 1 << 8,
+	IB_UVERBS_ACCESS_FLUSH_PERSISTENT = 1 << 9,
 
 	IB_UVERBS_ACCESS_RELAXED_ORDERING = IB_UVERBS_ACCESS_OPTIONAL_FIRST,
 	IB_UVERBS_ACCESS_OPTIONAL_RANGE =
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH rdma-next 03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only
  2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
  2021-12-28  8:07 ` [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem Li Zhijian
  2021-12-28  8:07 ` [RFC PATCH rdma-next 02/10] RDMA: Allow registering MR with flush access flags Li Zhijian
@ 2021-12-28  8:07 ` Li Zhijian
  2022-01-06  0:22   ` Jason Gunthorpe
  2022-01-13  6:43   ` lizhijian
  2021-12-28  8:07 ` [RFC PATCH rdma-next 04/10] RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device Li Zhijian
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 49+ messages in thread
From: Li Zhijian @ 2021-12-28  8:07 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto, Li Zhijian

Device should enable IB_DEVICE_RDMA_FLUSH capability if it want to
support RDMA FLUSH.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
 include/rdma/ib_verbs.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index f04d66539879..51d58b641201 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -291,6 +291,7 @@ enum ib_device_cap_flags {
 	/* The device supports padding incoming writes to cacheline. */
 	IB_DEVICE_PCI_WRITE_END_PADDING		= (1ULL << 36),
 	IB_DEVICE_ALLOW_USER_UNREG		= (1ULL << 37),
+	IB_DEVICE_RDMA_FLUSH			= (1ULL << 38),
 };
 
 enum ib_atomic_cap {
@@ -4319,6 +4320,10 @@ static inline int ib_check_mr_access(struct ib_device *ib_dev,
 	if (flags & IB_ACCESS_ON_DEMAND &&
 	    !(ib_dev->attrs.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING))
 		return -EINVAL;
+
+	if (flags & IB_ACCESS_FLUSH &&
+	    !(ib_dev->attrs.device_cap_flags & IB_DEVICE_RDMA_FLUSH))
+		return -EINVAL;
 	return 0;
 }
 
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH rdma-next 04/10] RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device
  2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (2 preceding siblings ...)
  2021-12-28  8:07 ` [RFC PATCH rdma-next 03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only Li Zhijian
@ 2021-12-28  8:07 ` Li Zhijian
  2022-01-06  0:22   ` Jason Gunthorpe
  2021-12-28  8:07 ` [RFC PATCH rdma-next 05/10] RDMA/rxe: Allow registering persistent flag for pmem MR only Li Zhijian
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 49+ messages in thread
From: Li Zhijian @ 2021-12-28  8:07 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto, Li Zhijian

We will enable the RDMA FLUSH on rxe device later.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_param.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h
index 918270e34a35..5c140f311aa2 100644
--- a/drivers/infiniband/sw/rxe/rxe_param.h
+++ b/drivers/infiniband/sw/rxe/rxe_param.h
@@ -53,7 +53,8 @@ enum rxe_device_param {
 					| IB_DEVICE_ALLOW_USER_UNREG
 					| IB_DEVICE_MEM_WINDOW
 					| IB_DEVICE_MEM_WINDOW_TYPE_2A
-					| IB_DEVICE_MEM_WINDOW_TYPE_2B,
+					| IB_DEVICE_MEM_WINDOW_TYPE_2B
+					| IB_DEVICE_RDMA_FLUSH,
 	RXE_MAX_SGE			= 32,
 	RXE_MAX_WQE_SIZE		= sizeof(struct rxe_send_wqe) +
 					  sizeof(struct ib_sge) * RXE_MAX_SGE,
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH rdma-next 05/10] RDMA/rxe: Allow registering persistent flag for pmem MR only
  2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (3 preceding siblings ...)
  2021-12-28  8:07 ` [RFC PATCH rdma-next 04/10] RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device Li Zhijian
@ 2021-12-28  8:07 ` Li Zhijian
  2021-12-30 22:25   ` Tom Talpey
  2021-12-28  8:07 ` [RFC PATCH rdma-next 06/10] RDMA/rxe: Implement RC RDMA FLUSH service in requester side Li Zhijian
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 49+ messages in thread
From: Li Zhijian @ 2021-12-28  8:07 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto, Li Zhijian

Memory region should support 2 placement types: IB_ACCESS_FLUSH_PERSISTENT
and IB_ACCESS_FLUSH_GLOBAL_VISIBILITY, and only pmem/nvdimm has ability to
persist data(IB_ACCESS_FLUSH_PERSISTENT).

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index bcd5e7afa475..21616d058f29 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -206,6 +206,11 @@ static bool iova_in_pmem(struct rxe_mr *mr, u64 iova, int length)
 	return page_in_dev_pagemap(page);
 }
 
+static bool ib_check_flush_access_flags(struct ib_mr *mr, u32 flags)
+{
+	return mr->is_pmem || !(flags & IB_ACCESS_FLUSH_PERSISTENT);
+}
+
 int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 		     int access, struct rxe_mr *mr)
 {
@@ -282,6 +287,13 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 
 	// iova_in_pmem must be called after set is updated
 	mr->ibmr.is_pmem = iova_in_pmem(mr, iova, length);
+	if (!ib_check_flush_access_flags(&mr->ibmr, access)) {
+		pr_err("Cannot set IB_ACCESS_FLUSH_PERSISTENT for non-pmem memory\n");
+		mr->state = RXE_MR_STATE_INVALID;
+		mr->umem = NULL;
+		err = -EINVAL;
+		goto err_release_umem;
+	}
 
 	return 0;
 
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH rdma-next 06/10] RDMA/rxe: Implement RC RDMA FLUSH service in requester side
  2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (4 preceding siblings ...)
  2021-12-28  8:07 ` [RFC PATCH rdma-next 05/10] RDMA/rxe: Allow registering persistent flag for pmem MR only Li Zhijian
@ 2021-12-28  8:07 ` Li Zhijian
  2021-12-28  8:07 ` [RFC PATCH rdma-next 07/10] RDMA/rxe: Set BTH's SE to zero for FLUSH packet Li Zhijian
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 49+ messages in thread
From: Li Zhijian @ 2021-12-28  8:07 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto, Li Zhijian

a RC FLUSH packet consists of:
+----+------+------+
|BTH | FETH | RETH |
+----+------+------+

oA19-2: FLUSH shall be single packet message and shall have no payload.
oA19-5: FLUSH BTH shall hold the Opcode = 0x1C

FLUSH Extended Transport Header(FETH)
+-----+-----------+------------------------+----------------------+
|Bits |   31-6    |          5-4           |        3-0           |
+-----+-----------+------------------------+----------------------+
|     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
+-----+-----------+------------------------+----------------------+

Selectivity Level (SEL) – defines the memory region scope the FLUSH
should apply on. Values are as follows:
• b’00 - Memory Region Range: FLUSH applies for all preceding memory
         updates to the RETH range on this QP. All RETH fields shall be
         valid in this selectivity mode. RETH:DMALen field shall be between
         zero and (2 31 -1) bytes (inclusive).
• b’01 - Memory Region: FLUSH applies for all preceding memory updates
         to RETH.R_key on this QP. RETH:DMALen and RETH:VA shall be
         ignored in this mode.
• b'10 - Reserved.
• b'11 - Reserved.

Placement Type (PLT) – Defines the memory placement guarantee of
this FLUSH. Multiple bits may be set in this field. Values are as follows:
• Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
  Visibility.
• Bit 1 if set to '1' indicated that the FLUSH should guarantee
  Persistence.
• Bits 3:2 are reserved

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
 drivers/infiniband/core/uverbs_cmd.c   | 16 ++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_hdr.h    | 24 ++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_opcode.c | 13 +++++++++++++
 drivers/infiniband/sw/rxe/rxe_opcode.h |  3 +++
 drivers/infiniband/sw/rxe/rxe_req.c    | 10 ++++++++++
 include/rdma/ib_pack.h                 |  2 ++
 include/rdma/ib_verbs.h                |  9 +++++++++
 include/uapi/rdma/ib_user_verbs.h      |  7 +++++++
 include/uapi/rdma/rdma_user_rxe.h      |  6 ++++++
 9 files changed, 90 insertions(+)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 6b6393176b3c..ed54fb0fd13d 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -2080,6 +2080,22 @@ static int ib_uverbs_post_send(struct uverbs_attr_bundle *attrs)
 			rdma->rkey = user_wr->wr.rdma.rkey;
 
 			next = &rdma->wr;
+		} else if (user_wr->opcode == IB_WR_RDMA_FLUSH) {
+			struct ib_flush_wr *flush;
+
+			next_size = sizeof(*flush);
+			flush = alloc_wr(next_size, user_wr->num_sge);
+			if (!flush) {
+				ret = -ENOMEM;
+				goto out_put;
+			}
+
+			flush->remote_addr = user_wr->wr.flush.remote_addr;
+			flush->rkey = user_wr->wr.flush.rkey;
+			flush->type = user_wr->wr.flush.type;
+			flush->level = user_wr->wr.flush.level;
+
+			next = &flush->wr;
 		} else if (user_wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP ||
 			   user_wr->opcode == IB_WR_ATOMIC_FETCH_AND_ADD) {
 			struct ib_atomic_wr *atomic;
diff --git a/drivers/infiniband/sw/rxe/rxe_hdr.h b/drivers/infiniband/sw/rxe/rxe_hdr.h
index e432f9e37795..e37aa1944b18 100644
--- a/drivers/infiniband/sw/rxe/rxe_hdr.h
+++ b/drivers/infiniband/sw/rxe/rxe_hdr.h
@@ -607,6 +607,29 @@ static inline void reth_set_len(struct rxe_pkt_info *pkt, u32 len)
 		rxe_opcode[pkt->opcode].offset[RXE_RETH], len);
 }
 
+/*
+ * FLUSH Extended Transport Header(FETH)
+ * +-----+-----------+------------------------+----------------------+
+ * |Bits |   31-6    |          5-4           |        3-0           |
+ * +-----+-----------+------------------------+----------------------+
+ *       | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
+ * +-----+-----------+------------------------+----------------------+
+ */
+#define FETH_PLT_SHIFT 0UL
+#define FETH_SEL_SHIFT 4UL
+#define FETH_RESERVED_SHIFT 6UL
+#define FETH_PLT_MASK ((1UL << FETH_SEL_SHIFT) - 1UL)
+#define FETH_SEL_MASK (~FETH_PLT_MASK & ((1UL << FETH_RESERVED_SHIFT) - 1UL))
+
+static inline void feth_init(struct rxe_pkt_info *pkt, u32 type, u32 level)
+{
+	u32 *p = (u32 *)(pkt->hdr + rxe_opcode[pkt->opcode].offset[RXE_FETH]);
+	u32 feth = ((level << FETH_SEL_SHIFT) & FETH_SEL_MASK) |
+		   ((type << FETH_PLT_SHIFT) & FETH_PLT_MASK);
+
+	*p = cpu_to_be32(feth);
+}
+
 /******************************************************************************
  * Atomic Extended Transport Header
  ******************************************************************************/
@@ -910,6 +933,7 @@ enum rxe_hdr_length {
 	RXE_ATMETH_BYTES	= sizeof(struct rxe_atmeth),
 	RXE_IETH_BYTES		= sizeof(struct rxe_ieth),
 	RXE_RDETH_BYTES		= sizeof(struct rxe_rdeth),
+	RXE_FETH_BYTES		= sizeof(u32),
 };
 
 static inline size_t header_size(struct rxe_pkt_info *pkt)
diff --git a/drivers/infiniband/sw/rxe/rxe_opcode.c b/drivers/infiniband/sw/rxe/rxe_opcode.c
index 3ef5a10a6efd..d61c8b354af4 100644
--- a/drivers/infiniband/sw/rxe/rxe_opcode.c
+++ b/drivers/infiniband/sw/rxe/rxe_opcode.c
@@ -316,6 +316,19 @@ struct rxe_opcode_info rxe_opcode[RXE_NUM_OPCODE] = {
 						+ RXE_AETH_BYTES,
 		}
 	},
+	[IB_OPCODE_RC_RDMA_FLUSH]			= {
+		.name	= "IB_OPCODE_RC_RDMA_FLUSH",
+		.mask	= RXE_FETH_MASK | RXE_RETH_MASK | RXE_FLUSH_MASK
+				| RXE_START_MASK | RXE_END_MASK | RXE_REQ_MASK,
+		.length = RXE_BTH_BYTES + RXE_FETH_BYTES + RXE_RETH_BYTES,
+		.offset = {
+			[RXE_BTH]	= 0,
+			[RXE_FETH]	= RXE_BTH_BYTES,
+			[RXE_RETH]	= RXE_BTH_BYTES	+ RXE_FETH_BYTES,
+			[RXE_PAYLOAD]	= RXE_BTH_BYTES
+					+ RXE_FETH_BYTES + RXE_RETH_BYTES,
+		}
+	},
 	[IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE]			= {
 		.name	= "IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE",
 		.mask	= RXE_AETH_MASK | RXE_ATMACK_MASK | RXE_ACK_MASK
diff --git a/drivers/infiniband/sw/rxe/rxe_opcode.h b/drivers/infiniband/sw/rxe/rxe_opcode.h
index 8f9aaaf260f2..dbc2eca8a92c 100644
--- a/drivers/infiniband/sw/rxe/rxe_opcode.h
+++ b/drivers/infiniband/sw/rxe/rxe_opcode.h
@@ -48,6 +48,7 @@ enum rxe_hdr_type {
 	RXE_DETH,
 	RXE_IMMDT,
 	RXE_PAYLOAD,
+	RXE_FETH,
 	NUM_HDR_TYPES
 };
 
@@ -63,6 +64,7 @@ enum rxe_hdr_mask {
 	RXE_IETH_MASK		= BIT(RXE_IETH),
 	RXE_RDETH_MASK		= BIT(RXE_RDETH),
 	RXE_DETH_MASK		= BIT(RXE_DETH),
+	RXE_FETH_MASK		= BIT(RXE_FETH),
 	RXE_PAYLOAD_MASK	= BIT(RXE_PAYLOAD),
 
 	RXE_REQ_MASK		= BIT(NUM_HDR_TYPES + 0),
@@ -80,6 +82,7 @@ enum rxe_hdr_mask {
 	RXE_END_MASK		= BIT(NUM_HDR_TYPES + 10),
 
 	RXE_LOOPBACK_MASK	= BIT(NUM_HDR_TYPES + 12),
+	RXE_FLUSH_MASK		= BIT(NUM_HDR_TYPES + 13),
 
 	RXE_READ_OR_ATOMIC_MASK	= (RXE_READ_MASK | RXE_ATOMIC_MASK),
 	RXE_WRITE_OR_SEND_MASK	= (RXE_WRITE_MASK | RXE_SEND_MASK),
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 5eb89052dd66..a3e9351873e2 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -220,6 +220,9 @@ static int next_opcode_rc(struct rxe_qp *qp, u32 opcode, int fits)
 				IB_OPCODE_RC_SEND_ONLY_WITH_IMMEDIATE :
 				IB_OPCODE_RC_SEND_FIRST;
 
+	case IB_WR_RDMA_FLUSH:
+		return IB_OPCODE_RC_RDMA_FLUSH;
+
 	case IB_WR_RDMA_READ:
 		return IB_OPCODE_RC_RDMA_READ_REQUEST;
 
@@ -418,6 +421,10 @@ static struct sk_buff *init_req_packet(struct rxe_qp *qp,
 		reth_set_len(pkt, wqe->dma.resid);
 	}
 
+	/* Fill Flush Extension Transport Header */
+	if (pkt->mask & RXE_FETH_MASK)
+		feth_init(pkt, ibwr->wr.flush.type, ibwr->wr.flush.level);
+
 	if (pkt->mask & RXE_IMMDT_MASK)
 		immdt_set_imm(pkt, ibwr->ex.imm_data);
 
@@ -477,6 +484,9 @@ static int finish_packet(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 
 			memset(pad, 0, bth_pad(pkt));
 		}
+	} else if (pkt->mask & RXE_FLUSH_MASK) {
+		// oA19-2: shall have no payload.
+		wqe->dma.resid = 0;
 	}
 
 	return 0;
diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h
index a9162f25beaf..d19edb502de6 100644
--- a/include/rdma/ib_pack.h
+++ b/include/rdma/ib_pack.h
@@ -84,6 +84,7 @@ enum {
 	/* opcode 0x15 is reserved */
 	IB_OPCODE_SEND_LAST_WITH_INVALIDATE         = 0x16,
 	IB_OPCODE_SEND_ONLY_WITH_INVALIDATE         = 0x17,
+	IB_OPCODE_RDMA_FLUSH                        = 0x1C,
 
 	/* real constants follow -- see comment about above IB_OPCODE()
 	   macro for more details */
@@ -112,6 +113,7 @@ enum {
 	IB_OPCODE(RC, FETCH_ADD),
 	IB_OPCODE(RC, SEND_LAST_WITH_INVALIDATE),
 	IB_OPCODE(RC, SEND_ONLY_WITH_INVALIDATE),
+	IB_OPCODE(RC, RDMA_FLUSH),
 
 	/* UC */
 	IB_OPCODE(UC, SEND_FIRST),
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 51d58b641201..d336f2f8bb69 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1297,6 +1297,7 @@ struct ib_qp_attr {
 enum ib_wr_opcode {
 	/* These are shared with userspace */
 	IB_WR_RDMA_WRITE = IB_UVERBS_WR_RDMA_WRITE,
+	IB_WR_RDMA_FLUSH = IB_UVERBS_WR_RDMA_FLUSH,
 	IB_WR_RDMA_WRITE_WITH_IMM = IB_UVERBS_WR_RDMA_WRITE_WITH_IMM,
 	IB_WR_SEND = IB_UVERBS_WR_SEND,
 	IB_WR_SEND_WITH_IMM = IB_UVERBS_WR_SEND_WITH_IMM,
@@ -1391,6 +1392,14 @@ struct ib_atomic_wr {
 	u32			rkey;
 };
 
+struct ib_flush_wr {
+	struct ib_send_wr	wr;
+	u64			remote_addr;
+	u32			rkey;
+	u32			type;
+	u32			level;
+};
+
 static inline const struct ib_atomic_wr *atomic_wr(const struct ib_send_wr *wr)
 {
 	return container_of(wr, struct ib_atomic_wr, wr);
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
index 7ee73a0652f1..4b7093f58259 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -784,6 +784,7 @@ enum ib_uverbs_wr_opcode {
 	IB_UVERBS_WR_RDMA_READ_WITH_INV = 11,
 	IB_UVERBS_WR_MASKED_ATOMIC_CMP_AND_SWP = 12,
 	IB_UVERBS_WR_MASKED_ATOMIC_FETCH_AND_ADD = 13,
+	IB_UVERBS_WR_RDMA_FLUSH = 14,
 	/* Review enum ib_wr_opcode before modifying this */
 };
 
@@ -797,6 +798,12 @@ struct ib_uverbs_send_wr {
 		__u32 invalidate_rkey;
 	} ex;
 	union {
+		struct {
+			__aligned_u64 remote_addr;
+			__u32	rkey;
+			__u32	type;
+			__u32	level;
+		} flush;
 		struct {
 			__aligned_u64 remote_addr;
 			__u32 rkey;
diff --git a/include/uapi/rdma/rdma_user_rxe.h b/include/uapi/rdma/rdma_user_rxe.h
index f09c5c9e3dd5..45f1b5b25414 100644
--- a/include/uapi/rdma/rdma_user_rxe.h
+++ b/include/uapi/rdma/rdma_user_rxe.h
@@ -82,6 +82,12 @@ struct rxe_send_wr {
 		__u32		invalidate_rkey;
 	} ex;
 	union {
+		struct {
+			__aligned_u64 remote_addr;
+			__u32	rkey;
+			__u32	type;
+			__u32	level;
+		} flush;
 		struct {
 			__aligned_u64 remote_addr;
 			__u32	rkey;
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH rdma-next 07/10] RDMA/rxe: Set BTH's SE to zero for FLUSH packet
  2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (5 preceding siblings ...)
  2021-12-28  8:07 ` [RFC PATCH rdma-next 06/10] RDMA/rxe: Implement RC RDMA FLUSH service in requester side Li Zhijian
@ 2021-12-28  8:07 ` Li Zhijian
  2021-12-28  8:07 ` [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side Li Zhijian
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 49+ messages in thread
From: Li Zhijian @ 2021-12-28  8:07 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto, Li Zhijian

The SPEC said:
oA19-6: FLUSH BTH header field solicited event (SE) indication shall be
set to zero.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_req.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index a3e9351873e2..082c5f76f29b 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -401,7 +401,9 @@ static struct sk_buff *init_req_packet(struct rxe_qp *qp,
 			(pkt->mask & RXE_END_MASK) &&
 			((pkt->mask & (RXE_SEND_MASK)) ||
 			(pkt->mask & (RXE_WRITE_MASK | RXE_IMMDT_MASK)) ==
-			(RXE_WRITE_MASK | RXE_IMMDT_MASK));
+			(RXE_WRITE_MASK | RXE_IMMDT_MASK)) &&
+			/* oA19-6: always set SE to zero */
+			!(pkt->mask & RXE_FETH_MASK);
 
 	qp_num = (pkt->mask & RXE_DETH_MASK) ? ibwr->wr.ud.remote_qpn :
 					 qp->attr.dest_qp_num;
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (6 preceding siblings ...)
  2021-12-28  8:07 ` [RFC PATCH rdma-next 07/10] RDMA/rxe: Set BTH's SE to zero for FLUSH packet Li Zhijian
@ 2021-12-28  8:07 ` Li Zhijian
  2021-12-30 22:18   ` Tom Talpey
                     ` (2 more replies)
  2021-12-28  8:07 ` [RFC PATCH rdma-next 09/10] RDMA/rxe: Implement flush completion Li Zhijian
                   ` (2 subsequent siblings)
  10 siblings, 3 replies; 49+ messages in thread
From: Li Zhijian @ 2021-12-28  8:07 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto, Li Zhijian

In contrast to other opcodes, after a series of sanity checking, FLUSH
opcode will do a Placement Type checking before it really do the FLUSH
operation. Responder will also reply NAK "Remote Access Error" if it
found a placement type violation.

We will persist data via arch_wb_cache_pmem(), which could be
architecture specific.

After the execution, responder would reply a responded successfully by
RDMA READ response of zero size.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_hdr.h  |  28 ++++++
 drivers/infiniband/sw/rxe/rxe_loc.h  |   2 +
 drivers/infiniband/sw/rxe/rxe_mr.c   |   4 +-
 drivers/infiniband/sw/rxe/rxe_resp.c | 131 ++++++++++++++++++++++++++-
 include/uapi/rdma/ib_user_verbs.h    |  10 ++
 5 files changed, 169 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_hdr.h b/drivers/infiniband/sw/rxe/rxe_hdr.h
index e37aa1944b18..cdfd393b8bd8 100644
--- a/drivers/infiniband/sw/rxe/rxe_hdr.h
+++ b/drivers/infiniband/sw/rxe/rxe_hdr.h
@@ -630,6 +630,34 @@ static inline void feth_init(struct rxe_pkt_info *pkt, u32 type, u32 level)
 	*p = cpu_to_be32(feth);
 }
 
+static inline u32 __feth_plt(void *arg)
+{
+	u32 *fethp = arg;
+	u32 feth = be32_to_cpu(*fethp);
+
+	return (feth & FETH_PLT_MASK) >> FETH_PLT_SHIFT;
+}
+
+static inline u32 __feth_sel(void *arg)
+{
+	u32 *fethp = arg;
+	u32 feth = be32_to_cpu(*fethp);
+
+	return (feth & FETH_SEL_MASK) >> FETH_SEL_SHIFT;
+}
+
+static inline u32 feth_plt(struct rxe_pkt_info *pkt)
+{
+	return __feth_plt(pkt->hdr +
+		rxe_opcode[pkt->opcode].offset[RXE_FETH]);
+}
+
+static inline u32 feth_sel(struct rxe_pkt_info *pkt)
+{
+	return __feth_sel(pkt->hdr +
+		rxe_opcode[pkt->opcode].offset[RXE_FETH]);
+}
+
 /******************************************************************************
  * Atomic Extended Transport Header
  ******************************************************************************/
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index b1e174afb1d4..73c39ff11e28 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -80,6 +80,8 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		enum rxe_mr_copy_dir dir);
 int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
 	      void *addr, int length, enum rxe_mr_copy_dir dir);
+void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
+		 size_t *offset_out);
 void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length);
 struct rxe_mr *lookup_mr(struct rxe_pd *pd, int access, u32 key,
 			 enum rxe_mr_lookup_type type);
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 21616d058f29..2cb530305392 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -325,8 +325,8 @@ int rxe_mr_init_fast(struct rxe_pd *pd, int max_pages, struct rxe_mr *mr)
 	return err;
 }
 
-static void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
-			size_t *offset_out)
+void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
+		 size_t *offset_out)
 {
 	struct rxe_map_set *set = mr->cur_map_set;
 	size_t offset = iova - set->iova + set->offset;
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index e8f435fa6e4d..6730336037d1 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -5,6 +5,7 @@
  */
 
 #include <linux/skbuff.h>
+#include <linux/libnvdimm.h>
 
 #include "rxe.h"
 #include "rxe_loc.h"
@@ -19,6 +20,7 @@ enum resp_states {
 	RESPST_CHK_RESOURCE,
 	RESPST_CHK_LENGTH,
 	RESPST_CHK_RKEY,
+	RESPST_CHK_PLT,
 	RESPST_EXECUTE,
 	RESPST_READ_REPLY,
 	RESPST_COMPLETE,
@@ -35,6 +37,7 @@ enum resp_states {
 	RESPST_ERR_TOO_MANY_RDMA_ATM_REQ,
 	RESPST_ERR_RNR,
 	RESPST_ERR_RKEY_VIOLATION,
+	RESPST_ERR_PLT_VIOLATION,
 	RESPST_ERR_INVALIDATE_RKEY,
 	RESPST_ERR_LENGTH,
 	RESPST_ERR_CQ_OVERFLOW,
@@ -53,6 +56,7 @@ static char *resp_state_name[] = {
 	[RESPST_CHK_RESOURCE]			= "CHK_RESOURCE",
 	[RESPST_CHK_LENGTH]			= "CHK_LENGTH",
 	[RESPST_CHK_RKEY]			= "CHK_RKEY",
+	[RESPST_CHK_PLT]			= "CHK_PLACEMENT_TYPE",
 	[RESPST_EXECUTE]			= "EXECUTE",
 	[RESPST_READ_REPLY]			= "READ_REPLY",
 	[RESPST_COMPLETE]			= "COMPLETE",
@@ -69,6 +73,7 @@ static char *resp_state_name[] = {
 	[RESPST_ERR_TOO_MANY_RDMA_ATM_REQ]	= "ERR_TOO_MANY_RDMA_ATM_REQ",
 	[RESPST_ERR_RNR]			= "ERR_RNR",
 	[RESPST_ERR_RKEY_VIOLATION]		= "ERR_RKEY_VIOLATION",
+	[RESPST_ERR_PLT_VIOLATION]		= "ERR_PLACEMENT_TYPE_VIOLATION",
 	[RESPST_ERR_INVALIDATE_RKEY]		= "ERR_INVALIDATE_RKEY_VIOLATION",
 	[RESPST_ERR_LENGTH]			= "ERR_LENGTH",
 	[RESPST_ERR_CQ_OVERFLOW]		= "ERR_CQ_OVERFLOW",
@@ -400,6 +405,30 @@ static enum resp_states check_length(struct rxe_qp *qp,
 	}
 }
 
+static enum resp_states check_placement_type(struct rxe_qp *qp,
+					     struct rxe_pkt_info *pkt)
+{
+	struct rxe_mr *mr = qp->resp.mr;
+	u32 plt = feth_plt(pkt);
+
+	// no FLUSH access flag MR
+	if ((mr->access & IB_ACCESS_FLUSH) == 0) {
+		pr_err("This mr isn't registered any flush access permission\n");
+		return RESPST_ERR_PLT_VIOLATION;
+	}
+
+	if ((plt & IB_EXT_PLT_GLB_VIS &&
+	    !(mr->access & IB_ACCESS_FLUSH_GLOBAL_VISIBILITY)) ||
+	    (plt & IB_EXT_PLT_PERSIST &&
+	    !(mr->access & IB_ACCESS_FLUSH_PERSISTENT))) {
+		pr_err("Target MR don't allow this placement type, is_pmem: %d, register flag: %x, request flag: %x\n",
+		       mr->ibmr.is_pmem, mr->access >> 8, plt);
+		return RESPST_ERR_PLT_VIOLATION;
+	}
+
+	return RESPST_EXECUTE;
+}
+
 static enum resp_states check_rkey(struct rxe_qp *qp,
 				   struct rxe_pkt_info *pkt)
 {
@@ -413,7 +442,7 @@ static enum resp_states check_rkey(struct rxe_qp *qp,
 	enum resp_states state;
 	int access;
 
-	if (pkt->mask & RXE_READ_OR_WRITE_MASK) {
+	if (pkt->mask & (RXE_READ_OR_WRITE_MASK | RXE_FLUSH_MASK)) {
 		if (pkt->mask & RXE_RETH_MASK) {
 			qp->resp.va = reth_va(pkt);
 			qp->resp.offset = 0;
@@ -434,8 +463,10 @@ static enum resp_states check_rkey(struct rxe_qp *qp,
 	}
 
 	/* A zero-byte op is not required to set an addr or rkey. */
+	/* RXE_FETH_MASK carraies zero-byte playload */
 	if ((pkt->mask & RXE_READ_OR_WRITE_MASK) &&
 	    (pkt->mask & RXE_RETH_MASK) &&
+	    !(pkt->mask & RXE_FETH_MASK) &&
 	    reth_len(pkt) == 0) {
 		return RESPST_EXECUTE;
 	}
@@ -503,7 +534,7 @@ static enum resp_states check_rkey(struct rxe_qp *qp,
 	WARN_ON_ONCE(qp->resp.mr);
 
 	qp->resp.mr = mr;
-	return RESPST_EXECUTE;
+	return pkt->mask & RXE_FETH_MASK ? RESPST_CHK_PLT : RESPST_EXECUTE;
 
 err:
 	if (mr)
@@ -549,6 +580,91 @@ static enum resp_states write_data_in(struct rxe_qp *qp,
 	return rc;
 }
 
+static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
+{
+	int			err;
+	int			bytes;
+	u8			*va;
+	struct rxe_map		**map;
+	struct rxe_phys_buf	*buf;
+	int			m;
+	int			i;
+	size_t			offset;
+
+	if (length == 0)
+		return 0;
+
+	if (mr->type == IB_MR_TYPE_DMA) {
+		arch_wb_cache_pmem((void *)iova, length);
+		return 0;
+	}
+
+	WARN_ON_ONCE(!mr->cur_map_set);
+
+	err = mr_check_range(mr, iova, length);
+	if (err) {
+		err = -EFAULT;
+		goto err1;
+	}
+
+	lookup_iova(mr, iova, &m, &i, &offset);
+
+	map = mr->cur_map_set->map + m;
+	buf	= map[0]->buf + i;
+
+	while (length > 0) {
+		va	= (u8 *)(uintptr_t)buf->addr + offset;
+		bytes	= buf->size - offset;
+
+		if (bytes > length)
+			bytes = length;
+
+		arch_wb_cache_pmem(va, bytes);
+
+		length	-= bytes;
+
+		offset	= 0;
+		buf++;
+		i++;
+
+		if (i == RXE_BUF_PER_MAP) {
+			i = 0;
+			map++;
+			buf = map[0]->buf;
+		}
+	}
+
+	return 0;
+
+err1:
+	return err;
+}
+
+static enum resp_states process_flush(struct rxe_qp *qp,
+				       struct rxe_pkt_info *pkt)
+{
+	u64 length = 0, start = qp->resp.va;
+	u32 sel = feth_sel(pkt);
+	u32 plt = feth_plt(pkt);
+	struct rxe_mr *mr = qp->resp.mr;
+
+	if (sel == IB_EXT_SEL_MR_RANGE)
+		length = qp->resp.length;
+	else if (sel == IB_EXT_SEL_MR_WHOLE)
+		length = mr->cur_map_set->length;
+
+	if (plt == IB_EXT_PLT_PERSIST) {
+		nvdimm_flush_iova(mr, start, length);
+		wmb(); // clwb follows by a sfence
+	} else if (plt == IB_EXT_PLT_GLB_VIS)
+		wmb(); // sfence is enough
+
+	/* set RDMA READ response of zero */
+	qp->resp.resid = 0;
+
+	return RESPST_READ_REPLY;
+}
+
 /* Guarantee atomicity of atomic operations at the machine level. */
 static DEFINE_SPINLOCK(atomic_ops_lock);
 
@@ -801,6 +917,8 @@ static enum resp_states execute(struct rxe_qp *qp, struct rxe_pkt_info *pkt)
 		err = process_atomic(qp, pkt);
 		if (err)
 			return err;
+	} else if (pkt->mask & RXE_FLUSH_MASK) {
+		return process_flush(qp, pkt);
 	} else {
 		/* Unreachable */
 		WARN_ON_ONCE(1);
@@ -1059,7 +1177,7 @@ static enum resp_states duplicate_request(struct rxe_qp *qp,
 		/* SEND. Ack again and cleanup. C9-105. */
 		send_ack(qp, pkt, AETH_ACK_UNLIMITED, prev_psn);
 		return RESPST_CLEANUP;
-	} else if (pkt->mask & RXE_READ_MASK) {
+	} else if (pkt->mask & RXE_READ_MASK || pkt->mask & RXE_FLUSH_MASK) {
 		struct resp_res *res;
 
 		res = find_resource(qp, pkt->psn);
@@ -1098,7 +1216,7 @@ static enum resp_states duplicate_request(struct rxe_qp *qp,
 			/* Reset the resource, except length. */
 			res->read.va_org = iova;
 			res->read.va = iova;
-			res->read.resid = resid;
+			res->read.resid = pkt->mask & RXE_FLUSH_MASK ? 0 : resid;
 
 			/* Replay the RDMA read reply. */
 			qp->resp.res = res;
@@ -1244,6 +1362,9 @@ int rxe_responder(void *arg)
 		case RESPST_CHK_RKEY:
 			state = check_rkey(qp, pkt);
 			break;
+		case RESPST_CHK_PLT:
+			state = check_placement_type(qp, pkt);
+			break;
 		case RESPST_EXECUTE:
 			state = execute(qp, pkt);
 			break;
@@ -1298,6 +1419,8 @@ int rxe_responder(void *arg)
 			break;
 
 		case RESPST_ERR_RKEY_VIOLATION:
+		/* oA19-13 8 */
+		case RESPST_ERR_PLT_VIOLATION:
 			if (qp_type(qp) == IB_QPT_RC) {
 				/* Class C */
 				do_class_ac_error(qp, AETH_NAK_REM_ACC_ERR,
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
index 4b7093f58259..efa06f53c7c6 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -105,6 +105,16 @@ enum {
 	IB_USER_VERBS_EX_CMD_MODIFY_CQ
 };
 
+enum ib_ext_placement_type {
+	IB_EXT_PLT_GLB_VIS = 1 << 0,
+	IB_EXT_PLT_PERSIST = 1 << 1,
+};
+
+enum ib_ext_selectivity_level {
+	IB_EXT_SEL_MR_RANGE = 0,
+	IB_EXT_SEL_MR_WHOLE,
+};
+
 /*
  * Make sure that all structs defined in this file remain laid out so
  * that they pack the same way on 32-bit and 64-bit architectures (to
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH rdma-next 09/10] RDMA/rxe: Implement flush completion
  2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (7 preceding siblings ...)
  2021-12-28  8:07 ` [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side Li Zhijian
@ 2021-12-28  8:07 ` Li Zhijian
  2021-12-28  8:07 ` [RFC PATCH rdma-next 10/10] RDMA/rxe: Add RD FLUSH service support Li Zhijian
  2021-12-29  8:49 ` [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Gromadzki, Tomasz
  10 siblings, 0 replies; 49+ messages in thread
From: Li Zhijian @ 2021-12-28  8:07 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto, Li Zhijian

Introduce a new IB_UVERBS_WC_FLUSH code to tell userspace a FLUSH
completion.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_comp.c | 4 +++-
 include/rdma/ib_verbs.h              | 1 +
 include/uapi/rdma/ib_user_verbs.h    | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index f363fe3fa414..e5b9d07eba93 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -104,6 +104,7 @@ static enum ib_wc_opcode wr_to_wc_opcode(enum ib_wr_opcode opcode)
 	case IB_WR_LOCAL_INV:			return IB_WC_LOCAL_INV;
 	case IB_WR_REG_MR:			return IB_WC_REG_MR;
 	case IB_WR_BIND_MW:			return IB_WC_BIND_MW;
+	case IB_WR_RDMA_FLUSH:			return IB_WC_RDMA_FLUSH;
 
 	default:
 		return 0xff;
@@ -261,7 +262,8 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
 		 */
 	case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE:
 		if (wqe->wr.opcode != IB_WR_RDMA_READ &&
-		    wqe->wr.opcode != IB_WR_RDMA_READ_WITH_INV) {
+		    wqe->wr.opcode != IB_WR_RDMA_READ_WITH_INV &&
+		    wqe->wr.opcode != IB_WR_RDMA_FLUSH) {
 			wqe->status = IB_WC_FATAL_ERR;
 			return COMPST_ERROR;
 		}
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index d336f2f8bb69..d528f2c4eba0 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -963,6 +963,7 @@ const char *__attribute_const__ ib_wc_status_msg(enum ib_wc_status status);
 enum ib_wc_opcode {
 	IB_WC_SEND = IB_UVERBS_WC_SEND,
 	IB_WC_RDMA_WRITE = IB_UVERBS_WC_RDMA_WRITE,
+	IB_WC_RDMA_FLUSH = IB_UVERBS_WC_FLUSH,
 	IB_WC_RDMA_READ = IB_UVERBS_WC_RDMA_READ,
 	IB_WC_COMP_SWAP = IB_UVERBS_WC_COMP_SWAP,
 	IB_WC_FETCH_ADD = IB_UVERBS_WC_FETCH_ADD,
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
index efa06f53c7c6..b9d6b3ca5708 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -476,6 +476,7 @@ enum ib_uverbs_wc_opcode {
 	IB_UVERBS_WC_BIND_MW = 5,
 	IB_UVERBS_WC_LOCAL_INV = 6,
 	IB_UVERBS_WC_TSO = 7,
+	IB_UVERBS_WC_FLUSH = 8,
 };
 
 struct ib_uverbs_wc {
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC PATCH rdma-next 10/10] RDMA/rxe: Add RD FLUSH service support
  2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (8 preceding siblings ...)
  2021-12-28  8:07 ` [RFC PATCH rdma-next 09/10] RDMA/rxe: Implement flush completion Li Zhijian
@ 2021-12-28  8:07 ` Li Zhijian
  2021-12-29  8:49 ` [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Gromadzki, Tomasz
  10 siblings, 0 replies; 49+ messages in thread
From: Li Zhijian @ 2021-12-28  8:07 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto, Li Zhijian

Although the SPEC said FLUSH is supported by RC/RD/XRC services, XRC has
not been supported by the rxe.

So XRC FLUSH will not be supported until rxe implements XRC service.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
I have not setup a RD environment to test this protocol
---
 drivers/infiniband/sw/rxe/rxe_opcode.c | 20 ++++++++++++++++++++
 include/rdma/ib_pack.h                 |  1 +
 2 files changed, 21 insertions(+)

diff --git a/drivers/infiniband/sw/rxe/rxe_opcode.c b/drivers/infiniband/sw/rxe/rxe_opcode.c
index d61c8b354af4..0d5f555f5ec5 100644
--- a/drivers/infiniband/sw/rxe/rxe_opcode.c
+++ b/drivers/infiniband/sw/rxe/rxe_opcode.c
@@ -919,6 +919,26 @@ struct rxe_opcode_info rxe_opcode[RXE_NUM_OPCODE] = {
 						+ RXE_RDETH_BYTES,
 		}
 	},
+	[IB_OPCODE_RD_RDMA_FLUSH]			= {
+		.name	= "IB_OPCODE_RD_RDMA_FLUSH",
+		.mask	= RXE_RDETH_MASK | RXE_FETH_MASK | RXE_RETH_MASK
+				| RXE_FLUSH_MASK | RXE_START_MASK
+				| RXE_END_MASK | RXE_REQ_MASK,
+		.length = RXE_BTH_BYTES + RXE_FETH_BYTES + RXE_RETH_BYTES,
+		.offset = {
+			[RXE_BTH]	= 0,
+			[RXE_RDETH]	= RXE_BTH_BYTES,
+			[RXE_FETH]	= RXE_BTH_BYTES
+							+ RXE_RDETH_BYTES,
+			[RXE_RETH]	= RXE_BTH_BYTES
+							+ RXE_RDETH_BYTES
+							+ RXE_FETH_BYTES,
+			[RXE_PAYLOAD]	= RXE_BTH_BYTES
+							+ RXE_RDETH_BYTES
+							+ RXE_FETH_BYTES
+							+ RXE_RETH_BYTES,
+		}
+	},
 
 	/* UD */
 	[IB_OPCODE_UD_SEND_ONLY]			= {
diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h
index d19edb502de6..40568a33ead8 100644
--- a/include/rdma/ib_pack.h
+++ b/include/rdma/ib_pack.h
@@ -151,6 +151,7 @@ enum {
 	IB_OPCODE(RD, ATOMIC_ACKNOWLEDGE),
 	IB_OPCODE(RD, COMPARE_SWAP),
 	IB_OPCODE(RD, FETCH_ADD),
+	IB_OPCODE(RD, RDMA_FLUSH),
 
 	/* UD */
 	IB_OPCODE(UD, SEND_ONLY),
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 49+ messages in thread

* RE: [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation
  2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (9 preceding siblings ...)
  2021-12-28  8:07 ` [RFC PATCH rdma-next 10/10] RDMA/rxe: Add RD FLUSH service support Li Zhijian
@ 2021-12-29  8:49 ` Gromadzki, Tomasz
  2021-12-29 14:35   ` Tom Talpey
  10 siblings, 1 reply; 49+ messages in thread
From: Gromadzki, Tomasz @ 2021-12-29  8:49 UTC (permalink / raw)
  To: Li Zhijian, linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto

1)
> rdma_post_flush(struct rdma_cm_id *id, void *context, struct ibv_sge *sgl,
>		int nsge, int flags, uint32_t type, uint32_t level,
>		uint64_t remote_addr, uint32_t rkey)
Shall we not use struct ibv_sge *sgl in this API call but explicitly use flush length (DMALen) argument?
It will be similar to other rdma_post_XXX API calls where ibv_sge is not used at all.
Ibv_sge is used only in rdma_post_XXXv API calls.

2)
>struct ibv_send_wr {
>...
> 	union {
>		struct {
>			uint64_t	remote_addr;
>			uint32_t	rkey;
>			uint32_t	type;
>			uint32_t	level;
>		} flush;

Shall we extend this structure with 
uint32_t length
and abandon using *sg_list as it is related in RDMA verbs to local memory access only:

Ibv_post_send.3:
struct ibv_sge         *sg_list;                /* Pointer to the s/g array */
int                     num_sge;                /* Size of the s/g array */

In the case of flush, there is no local context at all.

Thanks
Tomasz

> -----Original Message-----
> From: Li Zhijian <lizhijian@cn.fujitsu.com>
> Sent: Tuesday, December 28, 2021 9:07 AM
> To: linux-rdma@vger.kernel.org; zyjzyj2000@gmail.com; jgg@ziepe.ca;
> aharonl@nvidia.com; leon@kernel.org
> Cc: linux-kernel@vger.kernel.org; mbloch@nvidia.com;
> liweihang@huawei.com; liangwenpeng@huawei.com;
> yangx.jy@cn.fujitsu.com; rpearsonhpe@gmail.com; y-goto@fujitsu.com; Li
> Zhijian <lizhijian@cn.fujitsu.com>
> Subject: [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH
> operation
> 
> Hey folks,
> 
> These patches are going to implement a *NEW* RDMA opcode "RDMA
> FLUSH".
> In IB SPEC 1.5[1][2], 2 new opcodes, ATOMIC WRITE and RDMA FLUSH were
> added in the MEMORY PLACEMENT EXTENSIONS section.
> 
> FLUSH is used by the requesting node to achieve guarantees on the data
> placement within the memory subsystem of preceding accesses to a single
> memory region, such as those performed by RDMA WRITE, Atomics and
> ATOMIC WRITE requests.
> 
> The operation indicates the virtual address space of a destination node and
> where the guarantees should apply. This range must be contiguous in the
> virtual space of the memory key but it is not necessarily a contiguous range
> of physical memory.
> 
> FLUSH packets carry FLUSH extended transport header (see below) to
> specify the placement type and the selectivity level of the operation and
> RDMA extended header (RETH, see base document RETH definition) to
> specify the R_Key VA and Length associated with this request following the
> BTH in RC, RDETH in RD and XRCETH in XRC.
> 
> RC FLUSH:
> +----+------+------+
> |BTH | FETH | RETH |
> +----+------+------+
> 
> RD FLUSH:
> +----+------+------+------+
> |BTH | RDETH| FETH | RETH |
> +----+------+------+------+
> 
> XRC FLUSH:
> +----+-------+------+------+
> |BTH | XRCETH| FETH | RETH |
> +----+-------+------+------+
> 
> Currently, we introduce RC and RD services only, since XRC has not been
> implemented by rxe yet.
> NOTE: only RC service is tested now, and since other HCAs have not
> added/implemented FLUSH yet, we can only test FLUSH operation in both
> SoftRoCE/rxe devices.
> 
> The corresponding rdma-core and FLUSH example are available on:
> https://github.com/zhijianli88/rdma-core/tree/rfc
> 
> Below list some details about FLUSH transport packet:
> 
> A FLUSH message is built upon FLUSH request packet and is responded
> successfully by RDMA READ response of zero size.
> 
> oA19-2: FLUSH shall be single packet message and shall have no payload.
> 
> oA19-2: FLUSH shall be single packet message and shall have no payload.
> oA19-5: FLUSH BTH shall hold the Opcode = 0x1C
> 
> FLUSH Extended Transport Header(FETH)
> +-----+-----------+------------------------+----------------------+
> |Bits |   31-6    |          5-4           |        3-0           |
> +-----+-----------+------------------------+----------------------+
> |     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
> +-----+-----------+------------------------+----------------------+
> 
> Selectivity Level (SEL) – defines the memory region scope the FLUSH should
> apply on. Values are as follows:
> • b’00 - Memory Region Range: FLUSH applies for all preceding memory
>          updates to the RETH range on this QP. All RETH fields shall be
>          valid in this selectivity mode. RETH:DMALen field shall be
>          between zero and (2 31 -1) bytes (inclusive).
> • b’01 - Memory Region: FLUSH applies for all preceding memory up-
>          dates to RETH.R_key on this QP. RETH:DMALen and RETH:VA
>          shall be ignored in this mode.
> • b'10 - Reserved.
> • b'11 - Reserved.
> 
> Placement Type (PLT) – Defines the memory placement guarantee of this
> FLUSH. Multiple bits may be set in this field. Values are as follows:
> • Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
>   Visibility.
> • Bit 1 if set to '1' indicated that the FLUSH should guarantee
>   Persistence.
> • Bits 3:2 are reserved
> 
> [1]: https://www.infinibandta.org/ibta-specification/ # login required
> [2]: https://www.infinibandta.org/wp-content/uploads/2021/08/IBTA-
> Overview-of-IBTA-Volume-1-Release-1.5-and-MPE-2021-08-17-Secure.pptx
> 
> CC: Jason Gunthorpe <jgg@ziepe.ca>
> CC: Zhu Yanjun <zyjzyj2000@gmail.com
> CC: Leon Romanovsky <leon@kernel.org>
> CC: Bob Pearson <rpearsonhpe@gmail.com>
> CC: Weihang Li <liweihang@huawei.com>
> CC: Mark Bloch <mbloch@nvidia.com>
> CC: Wenpeng Liang <liangwenpeng@huawei.com>
> CC: Aharon Landau <aharonl@nvidia.com>
> CC: linux-rdma@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
> 
> Li Zhijian (10):
>   RDMA: mr: Introduce is_pmem
>   RDMA: Allow registering MR with flush access flags
>   RDMA/rxe: Allow registering FLUSH flags for supported device only
>   RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device
>   RDMA/rxe: Allow registering persistent flag for pmem MR only
>   RDMA/rxe: Implement RC RDMA FLUSH service in requester side
>   RDMA/rxe: Set BTH's SE to zero for FLUSH packet
>   RDMA/rxe: Implement flush execution in responder side
>   RDMA/rxe: Implement flush completion
>   RDMA/rxe: Add RD FLUSH service support
> 
>  drivers/infiniband/core/uverbs_cmd.c    |  16 +++
>  drivers/infiniband/sw/rxe/rxe_comp.c    |   4 +-
>  drivers/infiniband/sw/rxe/rxe_hdr.h     |  52 ++++++++++
>  drivers/infiniband/sw/rxe/rxe_loc.h     |   2 +
>  drivers/infiniband/sw/rxe/rxe_mr.c      |  63 +++++++++++-
>  drivers/infiniband/sw/rxe/rxe_opcode.c  |  33 ++++++
>  drivers/infiniband/sw/rxe/rxe_opcode.h  |   3 +
>  drivers/infiniband/sw/rxe/rxe_param.h   |   3 +-
>  drivers/infiniband/sw/rxe/rxe_req.c     |  14 ++-
>  drivers/infiniband/sw/rxe/rxe_resp.c    | 131 +++++++++++++++++++++++-
>  include/rdma/ib_pack.h                  |   3 +
>  include/rdma/ib_verbs.h                 |  22 +++-
>  include/uapi/rdma/ib_user_ioctl_verbs.h |   2 +
>  include/uapi/rdma/ib_user_verbs.h       |  18 ++++
>  include/uapi/rdma/rdma_user_rxe.h       |   6 ++
>  15 files changed, 362 insertions(+), 10 deletions(-)
> 
> --
> 2.31.1
> 
> 

---------------------------------------------------------------------
Intel Technology Poland sp. z o.o.
ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation
  2021-12-29  8:49 ` [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Gromadzki, Tomasz
@ 2021-12-29 14:35   ` Tom Talpey
  2021-12-31  1:10     ` lizhijian
  0 siblings, 1 reply; 49+ messages in thread
From: Tom Talpey @ 2021-12-29 14:35 UTC (permalink / raw)
  To: Gromadzki, Tomasz, Li Zhijian, linux-rdma, zyjzyj2000, jgg,
	aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto

On 12/29/2021 3:49 AM, Gromadzki, Tomasz wrote:
> 1)
>> rdma_post_flush(struct rdma_cm_id *id, void *context, struct ibv_sge *sgl,
>> 		int nsge, int flags, uint32_t type, uint32_t level,
>> 		uint64_t remote_addr, uint32_t rkey)
> Shall we not use struct ibv_sge *sgl in this API call but explicitly use flush length (DMALen) argument?
> It will be similar to other rdma_post_XXX API calls where ibv_sge is not used at all.
> Ibv_sge is used only in rdma_post_XXXv API calls.

I agree. The Flush operation does not access any local memory and
therefore an SGE is inappropriate. Better to pass the actual
Flush parameters, i.e. add the length. See next comment.

> 2)
>> struct ibv_send_wr {
>> ...
>> 	union {
>> 		struct {
>> 			uint64_t	remote_addr;
>> 			uint32_t	rkey;
>> 			uint32_t	type;
>> 			uint32_t	level;
>> 		} flush;
> 
> Shall we extend this structure with
> uint32_t length

Also agree. The remote length of the subregion to be flushed is
a required parameter and should be passed here.

Is it really necessary to promote the type and level to full 32-bit
fields in the API? In the protocol, these are just 4 bits and 2 bits,
respectively, and are packed together into a singe 32-bit FETH field.

Tom.

> and abandon using *sg_list as it is related in RDMA verbs to local memory access only:
> 
> Ibv_post_send.3:
> struct ibv_sge         *sg_list;                /* Pointer to the s/g array */
> int                     num_sge;                /* Size of the s/g array */
> 
> In the case of flush, there is no local context at all.
> 
> Thanks
> Tomasz
> 
>> -----Original Message-----
>> From: Li Zhijian <lizhijian@cn.fujitsu.com>
>> Sent: Tuesday, December 28, 2021 9:07 AM
>> To: linux-rdma@vger.kernel.org; zyjzyj2000@gmail.com; jgg@ziepe.ca;
>> aharonl@nvidia.com; leon@kernel.org
>> Cc: linux-kernel@vger.kernel.org; mbloch@nvidia.com;
>> liweihang@huawei.com; liangwenpeng@huawei.com;
>> yangx.jy@cn.fujitsu.com; rpearsonhpe@gmail.com; y-goto@fujitsu.com; Li
>> Zhijian <lizhijian@cn.fujitsu.com>
>> Subject: [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH
>> operation
>>
>> Hey folks,
>>
>> These patches are going to implement a *NEW* RDMA opcode "RDMA
>> FLUSH".
>> In IB SPEC 1.5[1][2], 2 new opcodes, ATOMIC WRITE and RDMA FLUSH were
>> added in the MEMORY PLACEMENT EXTENSIONS section.
>>
>> FLUSH is used by the requesting node to achieve guarantees on the data
>> placement within the memory subsystem of preceding accesses to a single
>> memory region, such as those performed by RDMA WRITE, Atomics and
>> ATOMIC WRITE requests.
>>
>> The operation indicates the virtual address space of a destination node and
>> where the guarantees should apply. This range must be contiguous in the
>> virtual space of the memory key but it is not necessarily a contiguous range
>> of physical memory.
>>
>> FLUSH packets carry FLUSH extended transport header (see below) to
>> specify the placement type and the selectivity level of the operation and
>> RDMA extended header (RETH, see base document RETH definition) to
>> specify the R_Key VA and Length associated with this request following the
>> BTH in RC, RDETH in RD and XRCETH in XRC.
>>
>> RC FLUSH:
>> +----+------+------+
>> |BTH | FETH | RETH |
>> +----+------+------+
>>
>> RD FLUSH:
>> +----+------+------+------+
>> |BTH | RDETH| FETH | RETH |
>> +----+------+------+------+
>>
>> XRC FLUSH:
>> +----+-------+------+------+
>> |BTH | XRCETH| FETH | RETH |
>> +----+-------+------+------+
>>
>> Currently, we introduce RC and RD services only, since XRC has not been
>> implemented by rxe yet.
>> NOTE: only RC service is tested now, and since other HCAs have not
>> added/implemented FLUSH yet, we can only test FLUSH operation in both
>> SoftRoCE/rxe devices.
>>
>> The corresponding rdma-core and FLUSH example are available on:
>> https://github.com/zhijianli88/rdma-core/tree/rfc
>>
>> Below list some details about FLUSH transport packet:
>>
>> A FLUSH message is built upon FLUSH request packet and is responded
>> successfully by RDMA READ response of zero size.
>>
>> oA19-2: FLUSH shall be single packet message and shall have no payload.
>>
>> oA19-2: FLUSH shall be single packet message and shall have no payload.
>> oA19-5: FLUSH BTH shall hold the Opcode = 0x1C
>>
>> FLUSH Extended Transport Header(FETH)
>> +-----+-----------+------------------------+----------------------+
>> |Bits |   31-6    |          5-4           |        3-0           |
>> +-----+-----------+------------------------+----------------------+
>> |     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
>> +-----+-----------+------------------------+----------------------+
>>
>> Selectivity Level (SEL) – defines the memory region scope the FLUSH should
>> apply on. Values are as follows:
>> • b’00 - Memory Region Range: FLUSH applies for all preceding memory
>>           updates to the RETH range on this QP. All RETH fields shall be
>>           valid in this selectivity mode. RETH:DMALen field shall be
>>           between zero and (2 31 -1) bytes (inclusive).
>> • b’01 - Memory Region: FLUSH applies for all preceding memory up-
>>           dates to RETH.R_key on this QP. RETH:DMALen and RETH:VA
>>           shall be ignored in this mode.
>> • b'10 - Reserved.
>> • b'11 - Reserved.
>>
>> Placement Type (PLT) – Defines the memory placement guarantee of this
>> FLUSH. Multiple bits may be set in this field. Values are as follows:
>> • Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
>>    Visibility.
>> • Bit 1 if set to '1' indicated that the FLUSH should guarantee
>>    Persistence.
>> • Bits 3:2 are reserved
>>
>> [1]: https://www.infinibandta.org/ibta-specification/ # login required
>> [2]: https://www.infinibandta.org/wp-content/uploads/2021/08/IBTA-
>> Overview-of-IBTA-Volume-1-Release-1.5-and-MPE-2021-08-17-Secure.pptx
>>
>> CC: Jason Gunthorpe <jgg@ziepe.ca>
>> CC: Zhu Yanjun <zyjzyj2000@gmail.com
>> CC: Leon Romanovsky <leon@kernel.org>
>> CC: Bob Pearson <rpearsonhpe@gmail.com>
>> CC: Weihang Li <liweihang@huawei.com>
>> CC: Mark Bloch <mbloch@nvidia.com>
>> CC: Wenpeng Liang <liangwenpeng@huawei.com>
>> CC: Aharon Landau <aharonl@nvidia.com>
>> CC: linux-rdma@vger.kernel.org
>> CC: linux-kernel@vger.kernel.org
>>
>> Li Zhijian (10):
>>    RDMA: mr: Introduce is_pmem
>>    RDMA: Allow registering MR with flush access flags
>>    RDMA/rxe: Allow registering FLUSH flags for supported device only
>>    RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device
>>    RDMA/rxe: Allow registering persistent flag for pmem MR only
>>    RDMA/rxe: Implement RC RDMA FLUSH service in requester side
>>    RDMA/rxe: Set BTH's SE to zero for FLUSH packet
>>    RDMA/rxe: Implement flush execution in responder side
>>    RDMA/rxe: Implement flush completion
>>    RDMA/rxe: Add RD FLUSH service support
>>
>>   drivers/infiniband/core/uverbs_cmd.c    |  16 +++
>>   drivers/infiniband/sw/rxe/rxe_comp.c    |   4 +-
>>   drivers/infiniband/sw/rxe/rxe_hdr.h     |  52 ++++++++++
>>   drivers/infiniband/sw/rxe/rxe_loc.h     |   2 +
>>   drivers/infiniband/sw/rxe/rxe_mr.c      |  63 +++++++++++-
>>   drivers/infiniband/sw/rxe/rxe_opcode.c  |  33 ++++++
>>   drivers/infiniband/sw/rxe/rxe_opcode.h  |   3 +
>>   drivers/infiniband/sw/rxe/rxe_param.h   |   3 +-
>>   drivers/infiniband/sw/rxe/rxe_req.c     |  14 ++-
>>   drivers/infiniband/sw/rxe/rxe_resp.c    | 131 +++++++++++++++++++++++-
>>   include/rdma/ib_pack.h                  |   3 +
>>   include/rdma/ib_verbs.h                 |  22 +++-
>>   include/uapi/rdma/ib_user_ioctl_verbs.h |   2 +
>>   include/uapi/rdma/ib_user_verbs.h       |  18 ++++
>>   include/uapi/rdma/rdma_user_rxe.h       |   6 ++
>>   15 files changed, 362 insertions(+), 10 deletions(-)
>>
>> --
>> 2.31.1
>>
>>
> 
> ---------------------------------------------------------------------
> Intel Technology Poland sp. z o.o.
> ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
> Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
> This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2021-12-28  8:07 ` [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side Li Zhijian
@ 2021-12-30 22:18   ` Tom Talpey
  2021-12-31  1:37     ` lizhijian
  2022-01-04 16:40   ` Tom Talpey
  2022-01-06  0:28   ` Jason Gunthorpe
  2 siblings, 1 reply; 49+ messages in thread
From: Tom Talpey @ 2021-12-30 22:18 UTC (permalink / raw)
  To: Li Zhijian, linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto

On 12/28/2021 3:07 AM, Li Zhijian wrote:
> In contrast to other opcodes, after a series of sanity checking, FLUSH
> opcode will do a Placement Type checking before it really do the FLUSH
> operation. Responder will also reply NAK "Remote Access Error" if it
> found a placement type violation.
> 
> We will persist data via arch_wb_cache_pmem(), which could be
> architecture specific.
> 
> After the execution, responder would reply a responded successfully by
> RDMA READ response of zero size.
> 
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> ---
>   drivers/infiniband/sw/rxe/rxe_hdr.h  |  28 ++++++
>   drivers/infiniband/sw/rxe/rxe_loc.h  |   2 +
>   drivers/infiniband/sw/rxe/rxe_mr.c   |   4 +-
>   drivers/infiniband/sw/rxe/rxe_resp.c | 131 ++++++++++++++++++++++++++-
>   include/uapi/rdma/ib_user_verbs.h    |  10 ++
>   5 files changed, 169 insertions(+), 6 deletions(-)
> 

<snip>

> +static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
> +{
> +	int			err;
> +	int			bytes;
> +	u8			*va;
> +	struct rxe_map		**map;
> +	struct rxe_phys_buf	*buf;
> +	int			m;
> +	int			i;
> +	size_t			offset;
> +
> +	if (length == 0)
> +		return 0;

The length is only relevant when the flush type is "Memory Region
Range".

When the flush type is "Memory Region", the entire region must be
flushed successfully before completing the operation.

> +
> +	if (mr->type == IB_MR_TYPE_DMA) {
> +		arch_wb_cache_pmem((void *)iova, length);
> +		return 0;
> +	}

Are dmamr's supported for remote access? I thought that was
prevented on first principles now. I might suggest not allowing
them to be flushed in any event. There is no length restriction,
and it's a VERY costly operation. At a minimum, protect this
closely.

> +
> +	WARN_ON_ONCE(!mr->cur_map_set);

The WARN is rather pointless because the code will crash just
seven lines below.

> +
> +	err = mr_check_range(mr, iova, length);
> +	if (err) {
> +		err = -EFAULT;
> +		goto err1;
> +	}
> +
> +	lookup_iova(mr, iova, &m, &i, &offset);
> +
> +	map = mr->cur_map_set->map + m;
> +	buf	= map[0]->buf + i;
> +
> +	while (length > 0) {
> +		va	= (u8 *)(uintptr_t)buf->addr + offset;
> +		bytes	= buf->size - offset;
> +
> +		if (bytes > length)
> +			bytes = length;
> +
> +		arch_wb_cache_pmem(va, bytes);
> +
> +		length	-= bytes;
> +
> +		offset	= 0;
> +		buf++;
> +		i++;
> +
> +		if (i == RXE_BUF_PER_MAP) {
> +			i = 0;
> +			map++;
> +			buf = map[0]->buf;
> +		}
> +	}
> +
> +	return 0;
> +
> +err1:
> +	return err;
> +}
> +
> +static enum resp_states process_flush(struct rxe_qp *qp,
> +				       struct rxe_pkt_info *pkt)
> +{
> +	u64 length = 0, start = qp->resp.va;
> +	u32 sel = feth_sel(pkt);
> +	u32 plt = feth_plt(pkt);
> +	struct rxe_mr *mr = qp->resp.mr;
> +
> +	if (sel == IB_EXT_SEL_MR_RANGE)
> +		length = qp->resp.length;
> +	else if (sel == IB_EXT_SEL_MR_WHOLE)
> +		length = mr->cur_map_set->length;

I'm going to have to think about these
> +
> +	if (plt == IB_EXT_PLT_PERSIST) {
> +		nvdimm_flush_iova(mr, start, length);
> +		wmb(); // clwb follows by a sfence
> +	} else if (plt == IB_EXT_PLT_GLB_VIS)
> +		wmb(); // sfence is enough

The persistence and global visibility bits are not mutually
exclusive, and in fact persistence does not imply global
visibility in some platforms. They must be tested and
processed individually.

	if (plt & IB_EXT_PLT_PERSIST)
		...
	else if (plt & IB_EXT_PLT_GLB_VIS)
		..

Second, the "clwb" and "sfence" comments are completely
Intel-specific. What processing will be done on other
processor platforms???

Tom.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 05/10] RDMA/rxe: Allow registering persistent flag for pmem MR only
  2021-12-28  8:07 ` [RFC PATCH rdma-next 05/10] RDMA/rxe: Allow registering persistent flag for pmem MR only Li Zhijian
@ 2021-12-30 22:25   ` Tom Talpey
  2021-12-31  3:34     ` lizhijian
  0 siblings, 1 reply; 49+ messages in thread
From: Tom Talpey @ 2021-12-30 22:25 UTC (permalink / raw)
  To: Li Zhijian, linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto

On 12/28/2021 3:07 AM, Li Zhijian wrote:
> Memory region should support 2 placement types: IB_ACCESS_FLUSH_PERSISTENT
> and IB_ACCESS_FLUSH_GLOBAL_VISIBILITY, and only pmem/nvdimm has ability to
> persist data(IB_ACCESS_FLUSH_PERSISTENT).
> 
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> ---
>   drivers/infiniband/sw/rxe/rxe_mr.c | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index bcd5e7afa475..21616d058f29 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -206,6 +206,11 @@ static bool iova_in_pmem(struct rxe_mr *mr, u64 iova, int length)
>   	return page_in_dev_pagemap(page);
>   }
>   
> +static bool ib_check_flush_access_flags(struct ib_mr *mr, u32 flags)
> +{
> +	return mr->is_pmem || !(flags & IB_ACCESS_FLUSH_PERSISTENT);
> +}

It is perfectly allowed to flush ordinary memory, persistence is
another matter entirely. Is this subroutine checking for flush,
or persistence? Its name is confusing and needs to be clarified.

> +
>   int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
>   		     int access, struct rxe_mr *mr)
>   {
> @@ -282,6 +287,13 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
>   
>   	// iova_in_pmem must be called after set is updated
>   	mr->ibmr.is_pmem = iova_in_pmem(mr, iova, length);
> +	if (!ib_check_flush_access_flags(&mr->ibmr, access)) {
> +		pr_err("Cannot set IB_ACCESS_FLUSH_PERSISTENT for non-pmem memory\n");
> +		mr->state = RXE_MR_STATE_INVALID;
> +		mr->umem = NULL;
> +		err = -EINVAL;
> +		goto err_release_umem;
> +	}

Setting is_pmem is reasonable, but again, this is confusing with respect
to the region being flushable. In general, all memory is flushable,
provided the platform supports any kind of cache flush (i.e. all of them).

Tom.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation
  2021-12-29 14:35   ` Tom Talpey
@ 2021-12-31  1:10     ` lizhijian
  0 siblings, 0 replies; 49+ messages in thread
From: lizhijian @ 2021-12-31  1:10 UTC (permalink / raw)
  To: Tom Talpey, Gromadzki, Tomasz, lizhijian, linux-rdma, zyjzyj2000,
	jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto

Tom, Tomasz

All you reviewing are greatly appreciated.


On 29/12/2021 22:35, Tom Talpey wrote:
> On 12/29/2021 3:49 AM, Gromadzki, Tomasz wrote:
>> 1)
>>> rdma_post_flush(struct rdma_cm_id *id, void *context, struct ibv_sge *sgl,
>>>         int nsge, int flags, uint32_t type, uint32_t level,
>>>         uint64_t remote_addr, uint32_t rkey)
>> Shall we not use struct ibv_sge *sgl in this API call but explicitly use flush length (DMALen) argument?
>> It will be similar to other rdma_post_XXX API calls where ibv_sge is not used at all.
>> Ibv_sge is used only in rdma_post_XXXv API calls.
Indeed, i missed rdma_post_XXX and rdma_post_XXXv



>
> I agree. The Flush operation does not access any local memory and
> therefore an SGE is inappropriate. Better to pass the actual
> Flush parameters, i.e. add the length. See next comment.

It sounds good.


>
>> 2)
>>> struct ibv_send_wr {
>>> ...
>>>     union {
>>>         struct {
>>>             uint64_t    remote_addr;
>>>             uint32_t    rkey;
>>>             uint32_t    type;
>>>             uint32_t    level;
>>>         } flush;
>>
>> Shall we extend this structure with
>> uint32_t length
>
> Also agree. The remote length of the subregion to be flushed is
> a required parameter and should be passed here.

Agree.


>
> Is it really necessary to promote the type and level to full 32-bit
> fields in the API? In the protocol, these are just 4 bits and 2 bits,
> respectively, and are packed together into a singe 32-bit FETH field.
good catch, i will update both them to u8 instead.



>
> Tom.
>
>> and abandon using *sg_list as it is related in RDMA verbs to local memory access only:

oh, good to know that, it's new knowledge to me. very thanks.


>>
>> Ibv_post_send.3:
>> struct ibv_sge         *sg_list;                /* Pointer to the s/g array */
>> int                     num_sge;                /* Size of the s/g array */
>>
>> In the case of flush, there is no local context at all.

Got it.


Thanks
Zhijian

>>
>> Thanks
>> Tomasz
>>
>>> -----Original Message-----
>>> From: Li Zhijian <lizhijian@cn.fujitsu.com>
>>> Sent: Tuesday, December 28, 2021 9:07 AM
>>> To: linux-rdma@vger.kernel.org; zyjzyj2000@gmail.com; jgg@ziepe.ca;
>>> aharonl@nvidia.com; leon@kernel.org
>>> Cc: linux-kernel@vger.kernel.org; mbloch@nvidia.com;
>>> liweihang@huawei.com; liangwenpeng@huawei.com;
>>> yangx.jy@cn.fujitsu.com; rpearsonhpe@gmail.com; y-goto@fujitsu.com; Li
>>> Zhijian <lizhijian@cn.fujitsu.com>
>>> Subject: [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH
>>> operation
>>>
>>> Hey folks,
>>>
>>> These patches are going to implement a *NEW* RDMA opcode "RDMA
>>> FLUSH".
>>> In IB SPEC 1.5[1][2], 2 new opcodes, ATOMIC WRITE and RDMA FLUSH were
>>> added in the MEMORY PLACEMENT EXTENSIONS section.
>>>
>>> FLUSH is used by the requesting node to achieve guarantees on the data
>>> placement within the memory subsystem of preceding accesses to a single
>>> memory region, such as those performed by RDMA WRITE, Atomics and
>>> ATOMIC WRITE requests.
>>>
>>> The operation indicates the virtual address space of a destination node and
>>> where the guarantees should apply. This range must be contiguous in the
>>> virtual space of the memory key but it is not necessarily a contiguous range
>>> of physical memory.
>>>
>>> FLUSH packets carry FLUSH extended transport header (see below) to
>>> specify the placement type and the selectivity level of the operation and
>>> RDMA extended header (RETH, see base document RETH definition) to
>>> specify the R_Key VA and Length associated with this request following the
>>> BTH in RC, RDETH in RD and XRCETH in XRC.
>>>
>>> RC FLUSH:
>>> +----+------+------+
>>> |BTH | FETH | RETH |
>>> +----+------+------+
>>>
>>> RD FLUSH:
>>> +----+------+------+------+
>>> |BTH | RDETH| FETH | RETH |
>>> +----+------+------+------+
>>>
>>> XRC FLUSH:
>>> +----+-------+------+------+
>>> |BTH | XRCETH| FETH | RETH |
>>> +----+-------+------+------+
>>>
>>> Currently, we introduce RC and RD services only, since XRC has not been
>>> implemented by rxe yet.
>>> NOTE: only RC service is tested now, and since other HCAs have not
>>> added/implemented FLUSH yet, we can only test FLUSH operation in both
>>> SoftRoCE/rxe devices.
>>>
>>> The corresponding rdma-core and FLUSH example are available on:
>>> https://github.com/zhijianli88/rdma-core/tree/rfc
>>>
>>> Below list some details about FLUSH transport packet:
>>>
>>> A FLUSH message is built upon FLUSH request packet and is responded
>>> successfully by RDMA READ response of zero size.
>>>
>>> oA19-2: FLUSH shall be single packet message and shall have no payload.
>>>
>>> oA19-2: FLUSH shall be single packet message and shall have no payload.
>>> oA19-5: FLUSH BTH shall hold the Opcode = 0x1C
>>>
>>> FLUSH Extended Transport Header(FETH)
>>> +-----+-----------+------------------------+----------------------+
>>> |Bits |   31-6    |          5-4           | 3-0           |
>>> +-----+-----------+------------------------+----------------------+
>>> |     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
>>> +-----+-----------+------------------------+----------------------+
>>>
>>> Selectivity Level (SEL) – defines the memory region scope the FLUSH should
>>> apply on. Values are as follows:
>>> • b’00 - Memory Region Range: FLUSH applies for all preceding memory
>>>           updates to the RETH range on this QP. All RETH fields shall be
>>>           valid in this selectivity mode. RETH:DMALen field shall be
>>>           between zero and (2 31 -1) bytes (inclusive).
>>> • b’01 - Memory Region: FLUSH applies for all preceding memory up-
>>>           dates to RETH.R_key on this QP. RETH:DMALen and RETH:VA
>>>           shall be ignored in this mode.
>>> • b'10 - Reserved.
>>> • b'11 - Reserved.
>>>
>>> Placement Type (PLT) – Defines the memory placement guarantee of this
>>> FLUSH. Multiple bits may be set in this field. Values are as follows:
>>> • Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
>>>    Visibility.
>>> • Bit 1 if set to '1' indicated that the FLUSH should guarantee
>>>    Persistence.
>>> • Bits 3:2 are reserved
>>>
>>> [1]: https://www.infinibandta.org/ibta-specification/ # login required
>>> [2]: https://www.infinibandta.org/wp-content/uploads/2021/08/IBTA-
>>> Overview-of-IBTA-Volume-1-Release-1.5-and-MPE-2021-08-17-Secure.pptx
>>>
>>> CC: Jason Gunthorpe <jgg@ziepe.ca>
>>> CC: Zhu Yanjun <zyjzyj2000@gmail.com
>>> CC: Leon Romanovsky <leon@kernel.org>
>>> CC: Bob Pearson <rpearsonhpe@gmail.com>
>>> CC: Weihang Li <liweihang@huawei.com>
>>> CC: Mark Bloch <mbloch@nvidia.com>
>>> CC: Wenpeng Liang <liangwenpeng@huawei.com>
>>> CC: Aharon Landau <aharonl@nvidia.com>
>>> CC: linux-rdma@vger.kernel.org
>>> CC: linux-kernel@vger.kernel.org
>>>
>>> Li Zhijian (10):
>>>    RDMA: mr: Introduce is_pmem
>>>    RDMA: Allow registering MR with flush access flags
>>>    RDMA/rxe: Allow registering FLUSH flags for supported device only
>>>    RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device
>>>    RDMA/rxe: Allow registering persistent flag for pmem MR only
>>>    RDMA/rxe: Implement RC RDMA FLUSH service in requester side
>>>    RDMA/rxe: Set BTH's SE to zero for FLUSH packet
>>>    RDMA/rxe: Implement flush execution in responder side
>>>    RDMA/rxe: Implement flush completion
>>>    RDMA/rxe: Add RD FLUSH service support
>>>
>>>   drivers/infiniband/core/uverbs_cmd.c    |  16 +++
>>>   drivers/infiniband/sw/rxe/rxe_comp.c    |   4 +-
>>>   drivers/infiniband/sw/rxe/rxe_hdr.h     |  52 ++++++++++
>>>   drivers/infiniband/sw/rxe/rxe_loc.h     |   2 +
>>>   drivers/infiniband/sw/rxe/rxe_mr.c      |  63 +++++++++++-
>>>   drivers/infiniband/sw/rxe/rxe_opcode.c  |  33 ++++++
>>>   drivers/infiniband/sw/rxe/rxe_opcode.h  |   3 +
>>>   drivers/infiniband/sw/rxe/rxe_param.h   |   3 +-
>>>   drivers/infiniband/sw/rxe/rxe_req.c     |  14 ++-
>>>   drivers/infiniband/sw/rxe/rxe_resp.c    | 131 +++++++++++++++++++++++-
>>>   include/rdma/ib_pack.h                  |   3 +
>>>   include/rdma/ib_verbs.h                 |  22 +++-
>>>   include/uapi/rdma/ib_user_ioctl_verbs.h |   2 +
>>>   include/uapi/rdma/ib_user_verbs.h       |  18 ++++
>>>   include/uapi/rdma/rdma_user_rxe.h       |   6 ++
>>>   15 files changed, 362 insertions(+), 10 deletions(-)
>>>
>>> -- 
>>> 2.31.1
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> Intel Technology Poland sp. z o.o.
>> ul. Slowackiego 173 | 80-298 Gdansk | Sad Rejonowy Gdansk Polnoc | VII Wydzial Gospodarczy Krajowego Rejestru Sadowego - KRS 101882 | NIP 957-07-52-316 | Kapital zakladowy 200.000 PLN.
>> Ta wiadomosc wraz z zalacznikami jest przeznaczona dla okreslonego adresata i moze zawierac informacje poufne. W razie przypadkowego otrzymania tej wiadomosci, prosimy o powiadomienie nadawcy oraz trwale jej usuniecie; jakiekolwiek przegladanie lub rozpowszechnianie jest zabronione.
>> This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). If you are not the intended recipient, please contact the sender and delete all copies; any review or distribution by others is strictly prohibited.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2021-12-30 22:18   ` Tom Talpey
@ 2021-12-31  1:37     ` lizhijian
  2021-12-31  2:32       ` Tom Talpey
  0 siblings, 1 reply; 49+ messages in thread
From: lizhijian @ 2021-12-31  1:37 UTC (permalink / raw)
  To: Tom Talpey, lizhijian, linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto



On 31/12/2021 06:18, Tom Talpey wrote:
> On 12/28/2021 3:07 AM, Li Zhijian wrote:
>> In contrast to other opcodes, after a series of sanity checking, FLUSH
>> opcode will do a Placement Type checking before it really do the FLUSH
>> operation. Responder will also reply NAK "Remote Access Error" if it
>> found a placement type violation.
>>
>> We will persist data via arch_wb_cache_pmem(), which could be
>> architecture specific.
>>
>> After the execution, responder would reply a responded successfully by
>> RDMA READ response of zero size.
>>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>>   drivers/infiniband/sw/rxe/rxe_hdr.h  |  28 ++++++
>>   drivers/infiniband/sw/rxe/rxe_loc.h  |   2 +
>>   drivers/infiniband/sw/rxe/rxe_mr.c   |   4 +-
>>   drivers/infiniband/sw/rxe/rxe_resp.c | 131 ++++++++++++++++++++++++++-
>>   include/uapi/rdma/ib_user_verbs.h    |  10 ++
>>   5 files changed, 169 insertions(+), 6 deletions(-)
>>
>
> <snip>
>
>> +static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
>> +{
>> +    int            err;
>> +    int            bytes;
>> +    u8            *va;
>> +    struct rxe_map        **map;
>> +    struct rxe_phys_buf    *buf;
>> +    int            m;
>> +    int            i;
>> +    size_t            offset;
>> +
>> +    if (length == 0)
>> +        return 0;
>
> The length is only relevant when the flush type is "Memory Region
> Range".
>
> When the flush type is "Memory Region", the entire region must be
> flushed successfully before completing the operation.

Yes, currently, the length has been expanded to the MR's length in such case.


>
>> +
>> +    if (mr->type == IB_MR_TYPE_DMA) {
>> +        arch_wb_cache_pmem((void *)iova, length);
>> +        return 0;
>> +    }
>
> Are dmamr's supported for remote access? I thought that was
> prevented on first principles now. I might suggest not allowing
> them to be flushed in any event. There is no length restriction,
> and it's a VERY costly operation. At a minimum, protect this
> closely.
Indeed, I didn't confidence about this, the main logic comes from rxe_mr_copy()
Thanks for the suggestion.



>
>> +
>> +    WARN_ON_ONCE(!mr->cur_map_set);
>
> The WARN is rather pointless because the code will crash just
> seven lines below.
>
>> +
>> +    err = mr_check_range(mr, iova, length);
>> +    if (err) {
>> +        err = -EFAULT;
>> +        goto err1;
>> +    }
>> +
>> +    lookup_iova(mr, iova, &m, &i, &offset);
>> +
>> +    map = mr->cur_map_set->map + m;
>> +    buf    = map[0]->buf + i;
>> +
>> +    while (length > 0) {
>> +        va    = (u8 *)(uintptr_t)buf->addr + offset;
>> +        bytes    = buf->size - offset;
>> +
>> +        if (bytes > length)
>> +            bytes = length;
>> +
>> +        arch_wb_cache_pmem(va, bytes);
>> +
>> +        length    -= bytes;
>> +
>> +        offset    = 0;
>> +        buf++;
>> +        i++;
>> +
>> +        if (i == RXE_BUF_PER_MAP) {
>> +            i = 0;
>> +            map++;
>> +            buf = map[0]->buf;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +
>> +err1:
>> +    return err;
>> +}
>> +
>> +static enum resp_states process_flush(struct rxe_qp *qp,
>> +                       struct rxe_pkt_info *pkt)
>> +{
>> +    u64 length = 0, start = qp->resp.va;
>> +    u32 sel = feth_sel(pkt);
>> +    u32 plt = feth_plt(pkt);
>> +    struct rxe_mr *mr = qp->resp.mr;
>> +
>> +    if (sel == IB_EXT_SEL_MR_RANGE)
>> +        length = qp->resp.length;
>> +    else if (sel == IB_EXT_SEL_MR_WHOLE)
>> +        length = mr->cur_map_set->length;
>
> I'm going to have to think about these

Yes, you inspire me that we should consider to adjust the start of iova to the MR's start as well.


>> +
>> +    if (plt == IB_EXT_PLT_PERSIST) {
>> +        nvdimm_flush_iova(mr, start, length);
>> +        wmb(); // clwb follows by a sfence
>> +    } else if (plt == IB_EXT_PLT_GLB_VIS)
>> +        wmb(); // sfence is enough
>
> The persistence and global visibility bits are not mutually
> exclusive,
My bad, it ever appeared in my mind. o(╯□╰)o




> and in fact persistence does not imply global
> visibility in some platforms. 
If so, and per the SPEC, why not
    if (plt & IB_EXT_PLT_PERSIST)
       do_somethingA();
    if (plt & IB_EXT_PLT_GLB_VIS)
       do_somethingB();



> They must be tested and
> processed individually.
>
>     if (plt & IB_EXT_PLT_PERSIST)
>         ...
>     else if (plt & IB_EXT_PLT_GLB_VIS)
>         ..
>
> Second, the "clwb" and "sfence" comments are completely
> Intel-specific. 
good catch.


> What processing will be done on other
> processor platforms???

I didn't dig other ARCH yet but INTEL.
In this function, i think i just need to call the higher level wrapper, like wmb() and
arch_wb_cache_pmem are enough, right ?

Again, thank you.

Thanks.




>
> Tom.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2021-12-31  1:37     ` lizhijian
@ 2021-12-31  2:32       ` Tom Talpey
  2022-01-04  8:51         ` lizhijian
  2022-01-06  0:35         ` Jason Gunthorpe
  0 siblings, 2 replies; 49+ messages in thread
From: Tom Talpey @ 2021-12-31  2:32 UTC (permalink / raw)
  To: lizhijian, linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto


On 12/30/2021 8:37 PM, lizhijian@fujitsu.com wrote
>>> +static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
>>> +{
>>> +    int            err;
>>> +    int            bytes;
>>> +    u8            *va;
>>> +    struct rxe_map        **map;
>>> +    struct rxe_phys_buf    *buf;
>>> +    int            m;
>>> +    int            i;
>>> +    size_t            offset;
>>> +
>>> +    if (length == 0)
>>> +        return 0;
>>
>> The length is only relevant when the flush type is "Memory Region
>> Range".
>>
>> When the flush type is "Memory Region", the entire region must be
>> flushed successfully before completing the operation.
> 
> Yes, currently, the length has been expanded to the MR's length in such case.

I'll dig deeper, but this is not immediately clear in this context.
A zero-length operation is however valid, even though it's a no-op,
the application may use it to ensure certain ordering constraints.
Will comment later if I have a specific concern.

>>> +
>>> +    if (mr->type == IB_MR_TYPE_DMA) {
>>> +        arch_wb_cache_pmem((void *)iova, length);
>>> +        return 0;
>>> +    }
>>
>> Are dmamr's supported for remote access? I thought that was
>> prevented on first principles now. I might suggest not allowing
>> them to be flushed in any event. There is no length restriction,
>> and it's a VERY costly operation. At a minimum, protect this
>> closely.
> Indeed, I didn't confidence about this, the main logic comes from rxe_mr_copy()
> Thanks for the suggestion.

Well, be careful when following semantics from local behaviors. When you
are processing a command from the wire, extreme caution is needed.
ESPECIALLY WHEN IT'S A DMA_MR, WHICH PROVIDES ACCESS TO ALL PHYSICAL!!!

Sorry for shouting. :)

>>> +
>>> +    WARN_ON_ONCE(!mr->cur_map_set);
>>
>> The WARN is rather pointless because the code will crash just
>> seven lines below.
>>
>>> +
>>> +    err = mr_check_range(mr, iova, length);
>>> +    if (err) {
>>> +        err = -EFAULT;
>>> +        goto err1;
>>> +    }
>>> +
>>> +    lookup_iova(mr, iova, &m, &i, &offset);
>>> +
>>> +    map = mr->cur_map_set->map + m;
>>> +    buf    = map[0]->buf + i;
>>> +
>>> +    while (length > 0) {
>>> +        va    = (u8 *)(uintptr_t)buf->addr + offset;
>>> +        bytes    = buf->size - offset;
>>> +
>>> +        if (bytes > length)
>>> +            bytes = length;
>>> +
>>> +        arch_wb_cache_pmem(va, bytes);
>>> +
>>> +        length    -= bytes;
>>> +
>>> +        offset    = 0;
>>> +        buf++;
>>> +        i++;
>>> +
>>> +        if (i == RXE_BUF_PER_MAP) {
>>> +            i = 0;
>>> +            map++;
>>> +            buf = map[0]->buf;
>>> +        }
>>> +    }
>>> +
>>> +    return 0;
>>> +
>>> +err1:
>>> +    return err;
>>> +}
>>> +
>>> +static enum resp_states process_flush(struct rxe_qp *qp,
>>> +                       struct rxe_pkt_info *pkt)
>>> +{
>>> +    u64 length = 0, start = qp->resp.va;
>>> +    u32 sel = feth_sel(pkt);
>>> +    u32 plt = feth_plt(pkt);
>>> +    struct rxe_mr *mr = qp->resp.mr;
>>> +
>>> +    if (sel == IB_EXT_SEL_MR_RANGE)
>>> +        length = qp->resp.length;
>>> +    else if (sel == IB_EXT_SEL_MR_WHOLE)
>>> +        length = mr->cur_map_set->length;
>>
>> I'm going to have to think about these
> 
> Yes, you inspire me that we should consider to adjust the start of iova to the MR's start as well.

I'll review again when you decide.
>>> +
>>> +    if (plt == IB_EXT_PLT_PERSIST) {
>>> +        nvdimm_flush_iova(mr, start, length);
>>> +        wmb(); // clwb follows by a sfence
>>> +    } else if (plt == IB_EXT_PLT_GLB_VIS)
>>> +        wmb(); // sfence is enough
>>
>> The persistence and global visibility bits are not mutually
>> exclusive,
> My bad, it ever appeared in my mind. o(╯□╰)o
> 
> 
> 
> 
>> and in fact persistence does not imply global
>> visibility in some platforms.
> If so, and per the SPEC, why not
>      if (plt & IB_EXT_PLT_PERSIST)
>         do_somethingA();
>      if (plt & IB_EXT_PLT_GLB_VIS)
>         do_somethingB();

At the simplest, yes. But there are many subtle other possibilities.

The global visibility is oriented toward supporting distributed
shared memory workloads, or publish/subscribe on high scale systems.
It's mainly about ensuring that all devices and CPUs see the data,
something ordinary RDMA Write does not guarantee. This often means
flushing write pipelines, possibly followed by invalidating caches.

The persistence, of course, is about ensuring the data is stored. This
is entirely different from making it visible.

In fact, you often want to optimize the two scenarios together, since
these operations are all about optimizing latency. The choice is going
to depend greatly on the behavior of the platform itself. For example,
certain caches provide persistence when batteries are present. So,
providing persistence and providing visibility are one and the same.
No need to do that twice.

On the other hand, some systems may provide persistence only after a
significant cost, such as writing into flash from a DRAM buffer, or
when an Optane DIMM becomes overloaded from continuous writes and
begins to throttle them. The flush may need some rather intricate
processing, to avoid deadlock.

Your code does not appear to envision these, so the simple way might
be best for now. But you should consider other situations.

>> Second, the "clwb" and "sfence" comments are completely
>> Intel-specific.
> good catch.
> 
> 
>> What processing will be done on other
>> processor platforms???
> 
> I didn't dig other ARCH yet but INTEL.
> In this function, i think i just need to call the higher level wrapper, like wmb() and
> arch_wb_cache_pmem are enough, right ?

Well, higher level wrappers may signal errors, for example if they're
not available or unreliable, and you will need to handle them. Also,
they may block. Is that ok in this context?

You need to get this right on all platforms which will run this code.
It is not acceptable to guess at whether the data is in the required
state before completing the RDMA_FLUSH. If this is not guaranteed,
the operation must raise the required error. To do anything else will
violate the protocol contract, and the remote application may fail.

Tom.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 05/10] RDMA/rxe: Allow registering persistent flag for pmem MR only
  2021-12-30 22:25   ` Tom Talpey
@ 2021-12-31  3:34     ` lizhijian
  2021-12-31 14:40       ` Tom Talpey
  0 siblings, 1 reply; 49+ messages in thread
From: lizhijian @ 2021-12-31  3:34 UTC (permalink / raw)
  To: Tom Talpey, lizhijian, linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto


delete unreachable liweihang@huawei.com from recipients


On 31/12/2021 06:25, Tom Talpey wrote:
> On 12/28/2021 3:07 AM, Li Zhijian wrote:
>> Memory region should support 2 placement types: IB_ACCESS_FLUSH_PERSISTENT
>> and IB_ACCESS_FLUSH_GLOBAL_VISIBILITY, and only pmem/nvdimm has ability to
>> persist data(IB_ACCESS_FLUSH_PERSISTENT).
>>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>>   drivers/infiniband/sw/rxe/rxe_mr.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
>> index bcd5e7afa475..21616d058f29 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
>> @@ -206,6 +206,11 @@ static bool iova_in_pmem(struct rxe_mr *mr, u64 iova, int length)
>>       return page_in_dev_pagemap(page);
>>   }
>>   +static bool ib_check_flush_access_flags(struct ib_mr *mr, u32 flags)
>> +{
>> +    return mr->is_pmem || !(flags & IB_ACCESS_FLUSH_PERSISTENT);
>> +}
>
> It is perfectly allowed to flush ordinary memory, persistence is
> another matter entirely. 
It did, but only allows for the MRs that registered FLUSH access flags.



> Is this subroutine checking for flush,
> or persistence? 

both, but it should be called in registering MR stage.
we have to 2 checking points, here is the 1st gate, where it prevent
local user from registering a persistent access flag to an ordinary memory.

2nd is in [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
where it prevent remote user to requesting persist data into an ordinary memory.

> Its name is confusing and needs to be clarified.

Err,  let me see.... a more suitable name is very welcome.


Thanks

>
>> +
>>   int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
>>                int access, struct rxe_mr *mr)
>>   {
>> @@ -282,6 +287,13 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
>>         // iova_in_pmem must be called after set is updated
>>       mr->ibmr.is_pmem = iova_in_pmem(mr, iova, length);
>> +    if (!ib_check_flush_access_flags(&mr->ibmr, access)) {
>> +        pr_err("Cannot set IB_ACCESS_FLUSH_PERSISTENT for non-pmem memory\n");
>> +        mr->state = RXE_MR_STATE_INVALID;
>> +        mr->umem = NULL;
>> +        err = -EINVAL;
>> +        goto err_release_umem;
>> +    }
>
> Setting is_pmem is reasonable, but again, this is confusing with respect
> to the region being flushable. In general, all memory is flushable,
> provided the platform supports any kind of cache flush (i.e. all of them).
>
> Tom.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 05/10] RDMA/rxe: Allow registering persistent flag for pmem MR only
  2021-12-31  3:34     ` lizhijian
@ 2021-12-31 14:40       ` Tom Talpey
  2022-01-04  1:32         ` lizhijian
  0 siblings, 1 reply; 49+ messages in thread
From: Tom Talpey @ 2021-12-31 14:40 UTC (permalink / raw)
  To: lizhijian, linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto

On 12/30/2021 10:34 PM, lizhijian@fujitsu.com wrote:

>>>    +static bool ib_check_flush_access_flags(struct ib_mr *mr, u32 flags)
>>> +{
>>> +    return mr->is_pmem || !(flags & IB_ACCESS_FLUSH_PERSISTENT);
>>> +}

>> Its name is confusing and needs to be clarified.
> 
> Err,  let me see.... a more suitable name is very welcome.

Since the subroutine is rather simple, and with only a single
reference in a single file, it would be best to just pull
the test inline and delete it. This would also remove some
inefficient code.

  	if (flags & IB_ACCESS_FLUSH_PERSISTENT) {
		if (!iova_in_pmem(mr, iova, length) {
			pr_err("Cannot set IB_ACCESS_FLUSH_PERSISTENT for non-pmem memory\n");
			mr->state = RXE_MR_STATE_INVALID;
			mr->umem = NULL;
			err = -EINVAL;
			goto err_release_umem;
		}
		mr-> ibmr.is_pmem = true;
	}
...


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 05/10] RDMA/rxe: Allow registering persistent flag for pmem MR only
  2021-12-31 14:40       ` Tom Talpey
@ 2022-01-04  1:32         ` lizhijian
  0 siblings, 0 replies; 49+ messages in thread
From: lizhijian @ 2022-01-04  1:32 UTC (permalink / raw)
  To: Tom Talpey, linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto



On 31/12/2021 22:40, Tom Talpey wrote:
> On 12/30/2021 10:34 PM, lizhijian@fujitsu.com wrote:
>
>>>>    +static bool ib_check_flush_access_flags(struct ib_mr *mr, u32 flags)
>>>> +{
>>>> +    return mr->is_pmem || !(flags & IB_ACCESS_FLUSH_PERSISTENT);
>>>> +}
>
>>> Its name is confusing and needs to be clarified.
>>
>> Err,  let me see.... a more suitable name is very welcome.
>
> Since the subroutine is rather simple, and with only a single
> reference in a single file, it would be best to just pull
> the test inline and delete it. This would also remove some
> inefficient code.
>
>      if (flags & IB_ACCESS_FLUSH_PERSISTENT) {
>         if (!iova_in_pmem(mr, iova, length) {
>             pr_err("Cannot set IB_ACCESS_FLUSH_PERSISTENT for non-pmem memory\n");
>             mr->state = RXE_MR_STATE_INVALID;
>             mr->umem = NULL;
>             err = -EINVAL;
>             goto err_release_umem;
>         }
>         mr-> ibmr.is_pmem = true;
>     }
> ...

Make sense.

Thanks
Zhijian


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2021-12-31  2:32       ` Tom Talpey
@ 2022-01-04  8:51         ` lizhijian
  2022-01-04 16:02           ` Tom Talpey
  2022-01-06  0:35         ` Jason Gunthorpe
  1 sibling, 1 reply; 49+ messages in thread
From: lizhijian @ 2022-01-04  8:51 UTC (permalink / raw)
  To: Tom Talpey, linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, rpearsonhpe, y-goto



On 31/12/2021 10:32, Tom Talpey wrote:
>
> On 12/30/2021 8:37 PM, lizhijian@fujitsu.com wrote
>>>> +static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
>>>> +{
>>>> +    int            err;
>>>> +    int            bytes;
>>>> +    u8            *va;
>>>> +    struct rxe_map        **map;
>>>> +    struct rxe_phys_buf    *buf;
>>>> +    int            m;
>>>> +    int            i;
>>>> +    size_t            offset;
>>>> +
>>>> +    if (length == 0)
>>>> +        return 0;
>>>
>>> The length is only relevant when the flush type is "Memory Region
>>> Range".
>>>
>>> When the flush type is "Memory Region", the entire region must be
>>> flushed successfully before completing the operation.
>>
>> Yes, currently, the length has been expanded to the MR's length in such case.
>
> I'll dig deeper, but this is not immediately clear in this context.
> A zero-length operation is however valid, even though it's a no-op,

If it's no-op, what shall we do in this case.


> the application may use it to ensure certain ordering constraints.
> Will comment later if I have a specific concern.

kindly welcome :)




>
>>>> +
>>>> +    if (mr->type == IB_MR_TYPE_DMA) {
>>>> +        arch_wb_cache_pmem((void *)iova, length);
>>>> +        return 0;
>>>> +    }
>>>
>>> Are dmamr's supported for remote access? I thought that was
>>> prevented on first principles now. I might suggest not allowing
>>> them to be flushed in any event. There is no length restriction,
>>> and it's a VERY costly operation. At a minimum, protect this
>>> closely.
>> Indeed, I didn't confidence about this, the main logic comes from rxe_mr_copy()
>> Thanks for the suggestion.
>
> Well, be careful when following semantics from local behaviors. When you
> are processing a command from the wire, extreme caution is needed.
> ESPECIALLY WHEN IT'S A DMA_MR, WHICH PROVIDES ACCESS TO ALL PHYSICAL!!!
>
> Sorry for shouting. :)

Never mind :)





>
>>>> +
>>>> +static enum resp_states process_flush(struct rxe_qp *qp,
>>>> +                       struct rxe_pkt_info *pkt)
>>>> +{
>>>> +    u64 length = 0, start = qp->resp.va;
>>>> +    u32 sel = feth_sel(pkt);
>>>> +    u32 plt = feth_plt(pkt);
>>>> +    struct rxe_mr *mr = qp->resp.mr;
>>>> +
>>>> +    if (sel == IB_EXT_SEL_MR_RANGE)
>>>> +        length = qp->resp.length;
>>>> +    else if (sel == IB_EXT_SEL_MR_WHOLE)
>>>> +        length = mr->cur_map_set->length;
>>>
>>> I'm going to have to think about these
>>
>> Yes, you inspire me that we should consider to adjust the start of iova to the MR's start as well.
>
> I'll review again when you decide.
>>>> +
>>>> +    if (plt == IB_EXT_PLT_PERSIST) {
>>>> +        nvdimm_flush_iova(mr, start, length);
>>>> +        wmb(); // clwb follows by a sfence
>>>> +    } else if (plt == IB_EXT_PLT_GLB_VIS)
>>>> +        wmb(); // sfence is enough
>>>
>>> The persistence and global visibility bits are not mutually
>>> exclusive,
>> My bad, it ever appeared in my mind. o(╯□╰)o
>>
>>
>>
>>
>>> and in fact persistence does not imply global
>>> visibility in some platforms.
>> If so, and per the SPEC, why not
>>      if (plt & IB_EXT_PLT_PERSIST)
>>         do_somethingA();
>>      if (plt & IB_EXT_PLT_GLB_VIS)
>>         do_somethingB();
>
> At the simplest, yes. But there are many subtle other possibilities.
>
> The global visibility is oriented toward supporting distributed
> shared memory workloads, or publish/subscribe on high scale systems.
> It's mainly about ensuring that all devices and CPUs see the data,
> something ordinary RDMA Write does not guarantee. This often means
> flushing write pipelines, possibly followed by invalidating caches.
>
> The persistence, of course, is about ensuring the data is stored. This
> is entirely different from making it visible.
>
> In fact, you often want to optimize the two scenarios together, since
> these operations are all about optimizing latency. The choice is going
> to depend greatly on the behavior of the platform itself. For example,
> certain caches provide persistence when batteries are present. So,
> providing persistence and providing visibility are one and the same.
> No need to do that twice.
>
> On the other hand, some systems may provide persistence only after a
> significant cost, such as writing into flash from a DRAM buffer, or
> when an Optane DIMM becomes overloaded from continuous writes and
> begins to throttle them. The flush may need some rather intricate
> processing, to avoid deadlock.
>
> Your code does not appear to envision these, so the simple way might
> be best for now. But you should consider other situations.

Thanks a lot for your detailed explanation.



>
>>> Second, the "clwb" and "sfence" comments are completely
>>> Intel-specific.
>> good catch.
>>
>>
>>> What processing will be done on other
>>> processor platforms???
>>
>> I didn't dig other ARCH yet but INTEL.
>> In this function, i think i just need to call the higher level wrapper, like wmb() and
>> arch_wb_cache_pmem are enough, right ?
>
> Well, higher level wrappers may signal errors, for example if they're
> not available or unreliable, and you will need to handle them. Also,
> they may block. Is that ok in this context?

Good question.

My bad, i forgot to check to return value of nvdimm_flush_iova() previously.

But if you mean we should guarantee arch_wb_cache_pmem() and wmb(), i have not
idea yet.

arch_wb_cache_pmem() and wmb(), What i'm currently using are just to hide
the assembly instructions on different architectures. And they are void return.

I wonder if we can always assume they work always When the code reaches them.
Since all current local flushing to pmem do something like that AFAIK(Feel free
to correct me if i'm wrong)

Thanks
Zhijian

>
> You need to get this right on all platforms which will run this code.
> It is not acceptable to guess at whether the data is in the required
> state before completing the RDMA_FLUSH. If this is not guaranteed,
> the operation must raise the required error. To do anything else will
> violate the protocol contract, and the remote application may fail.


>
> Tom.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2022-01-04  8:51         ` lizhijian
@ 2022-01-04 16:02           ` Tom Talpey
  0 siblings, 0 replies; 49+ messages in thread
From: Tom Talpey @ 2022-01-04 16:02 UTC (permalink / raw)
  To: lizhijian, linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, rpearsonhpe, y-goto

On 1/4/2022 3:51 AM, lizhijian@fujitsu.com wrote:
> 
> 
> On 31/12/2021 10:32, Tom Talpey wrote:
>>
>> On 12/30/2021 8:37 PM, lizhijian@fujitsu.com wrote
>>>>> +static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
>>>>> +{
>>>>> +    int            err;
>>>>> +    int            bytes;
>>>>> +    u8            *va;
>>>>> +    struct rxe_map        **map;
>>>>> +    struct rxe_phys_buf    *buf;
>>>>> +    int            m;
>>>>> +    int            i;
>>>>> +    size_t            offset;
>>>>> +
>>>>> +    if (length == 0)
>>>>> +        return 0;
>>>>
>>>> The length is only relevant when the flush type is "Memory Region
>>>> Range".
>>>>
>>>> When the flush type is "Memory Region", the entire region must be
>>>> flushed successfully before completing the operation.
>>>
>>> Yes, currently, the length has been expanded to the MR's length in such case.
>>
>> I'll dig deeper, but this is not immediately clear in this context.
>> A zero-length operation is however valid, even though it's a no-op,
> 
> If it's no-op, what shall we do in this case.

It's a no-op only in the case that the flush has the MRR (range)
flag set. When the Region flag is set, the length is ignored. As
you point out, the length is determined in process_flush().

I think the confusion arises because the routine is being used
for both cases, and the zero-length case is only valid for one.
To me, it would make more sense to make this test before calling
it. Something like:

+static enum resp_states process_flush(struct rxe_qp *qp,
+				       struct rxe_pkt_info *pkt)
+{
+	u64 length = 0, start = qp->resp.va;
+	u32 sel = feth_sel(pkt);
+	u32 plt = feth_plt(pkt);
+	struct rxe_mr *mr = qp->resp.mr;
+
+	if (sel == IB_EXT_SEL_MR_RANGE)
+		length = qp->resp.length;
+	else if (sel == IB_EXT_SEL_MR_WHOLE)
	{
+		length = mr->cur_map_set->length;
		WARN_ON(length == 0);
	}

	if (length > 0) {
		// there may be optimizations possible here...
		if (plt & IB_EXT_PLT_PERSIST)
			(flush to persistence)
		if (plt & IB_EXT_PLT_GLB_VIS)
			(flush to global visibility)
	}

+	/* set RDMA READ response of zero */
+	qp->resp.resid = 0;
+
+	return RESPST_READ_REPLY;
+}

Tom.

> 
> 
>> the application may use it to ensure certain ordering constraints.
>> Will comment later if I have a specific concern.
> 
> kindly welcome :)
> 
> 
> 
> 
>>
>>>>> +
>>>>> +    if (mr->type == IB_MR_TYPE_DMA) {
>>>>> +        arch_wb_cache_pmem((void *)iova, length);
>>>>> +        return 0;
>>>>> +    }
>>>>
>>>> Are dmamr's supported for remote access? I thought that was
>>>> prevented on first principles now. I might suggest not allowing
>>>> them to be flushed in any event. There is no length restriction,
>>>> and it's a VERY costly operation. At a minimum, protect this
>>>> closely.
>>> Indeed, I didn't confidence about this, the main logic comes from rxe_mr_copy()
>>> Thanks for the suggestion.
>>
>> Well, be careful when following semantics from local behaviors. When you
>> are processing a command from the wire, extreme caution is needed.
>> ESPECIALLY WHEN IT'S A DMA_MR, WHICH PROVIDES ACCESS TO ALL PHYSICAL!!!
>>
>> Sorry for shouting. :)
> 
> Never mind :)
> 
> 
> 
> 
> 
>>
>>>>> +
>>>>> +static enum resp_states process_flush(struct rxe_qp *qp,
>>>>> +                       struct rxe_pkt_info *pkt)
>>>>> +{
>>>>> +    u64 length = 0, start = qp->resp.va;
>>>>> +    u32 sel = feth_sel(pkt);
>>>>> +    u32 plt = feth_plt(pkt);
>>>>> +    struct rxe_mr *mr = qp->resp.mr;
>>>>> +
>>>>> +    if (sel == IB_EXT_SEL_MR_RANGE)
>>>>> +        length = qp->resp.length;
>>>>> +    else if (sel == IB_EXT_SEL_MR_WHOLE)
>>>>> +        length = mr->cur_map_set->length;
>>>>
>>>> I'm going to have to think about these
>>>
>>> Yes, you inspire me that we should consider to adjust the start of iova to the MR's start as well.
>>
>> I'll review again when you decide.
>>>>> +
>>>>> +    if (plt == IB_EXT_PLT_PERSIST) {
>>>>> +        nvdimm_flush_iova(mr, start, length);
>>>>> +        wmb(); // clwb follows by a sfence
>>>>> +    } else if (plt == IB_EXT_PLT_GLB_VIS)
>>>>> +        wmb(); // sfence is enough
>>>>
>>>> The persistence and global visibility bits are not mutually
>>>> exclusive,
>>> My bad, it ever appeared in my mind. o(╯□╰)o
>>>
>>>
>>>
>>>
>>>> and in fact persistence does not imply global
>>>> visibility in some platforms.
>>> If so, and per the SPEC, why not
>>>       if (plt & IB_EXT_PLT_PERSIST)
>>>          do_somethingA();
>>>       if (plt & IB_EXT_PLT_GLB_VIS)
>>>          do_somethingB();
>>
>> At the simplest, yes. But there are many subtle other possibilities.
>>
>> The global visibility is oriented toward supporting distributed
>> shared memory workloads, or publish/subscribe on high scale systems.
>> It's mainly about ensuring that all devices and CPUs see the data,
>> something ordinary RDMA Write does not guarantee. This often means
>> flushing write pipelines, possibly followed by invalidating caches.
>>
>> The persistence, of course, is about ensuring the data is stored. This
>> is entirely different from making it visible.
>>
>> In fact, you often want to optimize the two scenarios together, since
>> these operations are all about optimizing latency. The choice is going
>> to depend greatly on the behavior of the platform itself. For example,
>> certain caches provide persistence when batteries are present. So,
>> providing persistence and providing visibility are one and the same.
>> No need to do that twice.
>>
>> On the other hand, some systems may provide persistence only after a
>> significant cost, such as writing into flash from a DRAM buffer, or
>> when an Optane DIMM becomes overloaded from continuous writes and
>> begins to throttle them. The flush may need some rather intricate
>> processing, to avoid deadlock.
>>
>> Your code does not appear to envision these, so the simple way might
>> be best for now. But you should consider other situations.
> 
> Thanks a lot for your detailed explanation.
> 
> 
> 
>>
>>>> Second, the "clwb" and "sfence" comments are completely
>>>> Intel-specific.
>>> good catch.
>>>
>>>
>>>> What processing will be done on other
>>>> processor platforms???
>>>
>>> I didn't dig other ARCH yet but INTEL.
>>> In this function, i think i just need to call the higher level wrapper, like wmb() and
>>> arch_wb_cache_pmem are enough, right ?
>>
>> Well, higher level wrappers may signal errors, for example if they're
>> not available or unreliable, and you will need to handle them. Also,
>> they may block. Is that ok in this context?
> 
> Good question.
> 
> My bad, i forgot to check to return value of nvdimm_flush_iova() previously.
> 
> But if you mean we should guarantee arch_wb_cache_pmem() and wmb(), i have not
> idea yet.
> 
> arch_wb_cache_pmem() and wmb(), What i'm currently using are just to hide
> the assembly instructions on different architectures. And they are void return.
> 
> I wonder if we can always assume they work always When the code reaches them.
> Since all current local flushing to pmem do something like that AFAIK(Feel free
> to correct me if i'm wrong)
> 
> Thanks
> Zhijian
> 
>>
>> You need to get this right on all platforms which will run this code.
>> It is not acceptable to guess at whether the data is in the required
>> state before completing the RDMA_FLUSH. If this is not guaranteed,
>> the operation must raise the required error. To do anything else will
>> violate the protocol contract, and the remote application may fail.
> 
> 
>>
>> Tom.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2021-12-28  8:07 ` [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side Li Zhijian
  2021-12-30 22:18   ` Tom Talpey
@ 2022-01-04 16:40   ` Tom Talpey
  2022-01-05  1:43     ` lizhijian
  2022-01-06  0:28   ` Jason Gunthorpe
  2 siblings, 1 reply; 49+ messages in thread
From: Tom Talpey @ 2022-01-04 16:40 UTC (permalink / raw)
  To: Li Zhijian, linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto


On 12/28/2021 3:07 AM, Li Zhijian wrote:

> +static enum resp_states process_flush(struct rxe_qp *qp,
> +				       struct rxe_pkt_info *pkt)
> +{
> +	u64 length = 0, start = qp->resp.va;
> +	u32 sel = feth_sel(pkt);
> +	u32 plt = feth_plt(pkt);
> +	struct rxe_mr *mr = qp->resp.mr;
> +
> +	if (sel == IB_EXT_SEL_MR_RANGE)
> +		length = qp->resp.length;
> +	else if (sel == IB_EXT_SEL_MR_WHOLE)
> +		length = mr->cur_map_set->length;

I noticed another issue in this. When the "Memory Region" flag is
set, the RETH:VA field in the request is ignored, in addition to
the RETH:DMALEN. Therefore, both the start and length must be set
from the map.

Is the mr->cur_map_set[0]->addr field a virtual address, or a
physical? I can't find the cur_map_set in any patch. The virtual
(mapped) address will be needed, to pass to arch_wb_cache_pmem().

Something like this?

	if (sel == IB_EXT_SEL_MR_WHOLE) {
		length = mr->cur_map_set->length;
		start = mr->cur_map_set[0]->addr;
	}

(in addition to my other review suggestions about 0-length, etc...)

Tom.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2022-01-04 16:40   ` Tom Talpey
@ 2022-01-05  1:43     ` lizhijian
  0 siblings, 0 replies; 49+ messages in thread
From: lizhijian @ 2022-01-05  1:43 UTC (permalink / raw)
  To: Tom Talpey, lizhijian, linux-rdma, zyjzyj2000, jgg, aharonl, leon
  Cc: linux-kernel, mbloch, liweihang, liangwenpeng, yangx.jy,
	rpearsonhpe, y-goto



On 05/01/2022 00:40, Tom Talpey wrote:
>
> On 12/28/2021 3:07 AM, Li Zhijian wrote:
>
>> +static enum resp_states process_flush(struct rxe_qp *qp,
>> +                       struct rxe_pkt_info *pkt)
>> +{
>> +    u64 length = 0, start = qp->resp.va;
>> +    u32 sel = feth_sel(pkt);
>> +    u32 plt = feth_plt(pkt);
>> +    struct rxe_mr *mr = qp->resp.mr;
>> +
>> +    if (sel == IB_EXT_SEL_MR_RANGE)
>> +        length = qp->resp.length;
>> +    else if (sel == IB_EXT_SEL_MR_WHOLE)
>> +        length = mr->cur_map_set->length;
>
> I noticed another issue in this. When the "Memory Region" flag is
> set, the RETH:VA field in the request is ignored, in addition to
> the RETH:DMALEN. Therefore, both the start and length must be set
> from the map.
Yes, that's true

Actually, i have drafted a new version last week when you first time pointed out this.
https://github.com/zhijianli88/linux/commit/a368d821b163f74836d527638625d414c7a62d00#diff-1b675c68d2f5f664abefea05cbf415f6307c9920b31a13bed41105c4a84e0259R641
the latest code is like:
     if (sel == IB_EXT_SEL_MR_RANGE) {
         start = qp->resp.va;
         length = qp->resp.length;
     } else { /* sel == IB_EXT_SEL_MR_WHOLE */
         start = mr->cur_map_set->iova;
         length = mr->cur_map_set->length;
     }


>
> Is the mr->cur_map_set[0]->addr field a virtual address, or a
> physical? 
mr->cur_map_set[0]->addr is a virtual address in kernel space.

nvdimm_flush_iova() will accept a user space address(alloc(), mmap() by applications),
this address could be transited to a kernel virtual address by lookup_iova() where it
will refer to mr->cur_map_set[n]->addr.

So i think we should use qp->resp.va and mr->cur_map_set->iova that are both user
space address.



> I can't find the cur_map_set in any patch. 
cur_map_set will be filled in rxe_mr_init_user(),
rxe_mr_init_user()
  -> ib_umem_get()
    -> pin_user_pages_fast()
  -> update cur_map_set


BTW: some of you all comments are already addressed in
https://github.com/zhijianli88/linux/tree/rdma-flush
and it may forced update irregularly.


Thanks
Zhijian

> The virtual
> (mapped) address will be needed, to pass to arch_wb_cache_pmem().
>
> Something like this?
>
>     if (sel == IB_EXT_SEL_MR_WHOLE) {
>         length = mr->cur_map_set->length;
>         start = mr->cur_map_set[0]->addr;
>     }
>
> (in addition to my other review suggestions about 0-length, etc...)
>
> Tom.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem
  2021-12-28  8:07 ` [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem Li Zhijian
@ 2022-01-06  0:21   ` Jason Gunthorpe
  2022-01-06  6:12     ` lizhijian
  0 siblings, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2022-01-06  0:21 UTC (permalink / raw)
  To: Li Zhijian
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liweihang, liangwenpeng, yangx.jy, rpearsonhpe, y-goto

On Tue, Dec 28, 2021 at 04:07:08PM +0800, Li Zhijian wrote:
> We can use it to indicate whether the registering mr is associated with
> a pmem/nvdimm or not.
> 
> Currently, we only assign it in rxe driver, for other device/drivers,
> they should implement it if needed.
> 
> RDMA FLUSH will support the persistence feature for a pmem/nvdimm.
> 
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>  drivers/infiniband/sw/rxe/rxe_mr.c | 47 ++++++++++++++++++++++++++++++
>  include/rdma/ib_verbs.h            |  1 +
>  2 files changed, 48 insertions(+)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index 7c4cd19a9db2..bcd5e7afa475 100644
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -162,6 +162,50 @@ void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr)
>  	mr->type = IB_MR_TYPE_DMA;
>  }
>  
> +// XXX: the logic is similar with mm/memory-failure.c
> +static bool page_in_dev_pagemap(struct page *page)
> +{
> +	unsigned long pfn;
> +	struct page *p;
> +	struct dev_pagemap *pgmap = NULL;
> +
> +	pfn = page_to_pfn(page);
> +	if (!pfn) {
> +		pr_err("no such pfn for page %p\n", page);
> +		return false;
> +	}
> +
> +	p = pfn_to_online_page(pfn);
> +	if (!p) {
> +		if (pfn_valid(pfn)) {
> +			pgmap = get_dev_pagemap(pfn, NULL);
> +			if (pgmap)
> +				put_dev_pagemap(pgmap);
> +		}
> +	}
> +
> +	return !!pgmap;

You need to get Dan to check this out, but I'm pretty sure this should
be more like this:

if (is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_FS_DAX)


> +static bool iova_in_pmem(struct rxe_mr *mr, u64 iova, int length)
> +{
> +	struct page *page = NULL;
> +	char *vaddr = iova_to_vaddr(mr, iova, length);
> +
> +	if (!vaddr) {
> +		pr_err("not a valid iova %llu\n", iova);
> +		return false;
> +	}
> +
> +	page = virt_to_page(vaddr);

And obviously this isn't uniform for the entire umem, so I don't even
know what this is supposed to mean.

> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 6e9ad656ecb7..822ebb3425dc 100644
> +++ b/include/rdma/ib_verbs.h
> @@ -1807,6 +1807,7 @@ struct ib_mr {
>  	unsigned int	   page_size;
>  	enum ib_mr_type	   type;
>  	bool		   need_inval;
> +	bool		   is_pmem;

Or why it is being stored in the global struct?

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only
  2021-12-28  8:07 ` [RFC PATCH rdma-next 03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only Li Zhijian
@ 2022-01-06  0:22   ` Jason Gunthorpe
  2022-01-06  6:20     ` lizhijian
  2022-01-13  6:43   ` lizhijian
  1 sibling, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2022-01-06  0:22 UTC (permalink / raw)
  To: Li Zhijian
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liweihang, liangwenpeng, yangx.jy, rpearsonhpe, y-goto

On Tue, Dec 28, 2021 at 04:07:10PM +0800, Li Zhijian wrote:
> Device should enable IB_DEVICE_RDMA_FLUSH capability if it want to
> support RDMA FLUSH.
> 
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> ---
>  include/rdma/ib_verbs.h | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index f04d66539879..51d58b641201 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -291,6 +291,7 @@ enum ib_device_cap_flags {
>  	/* The device supports padding incoming writes to cacheline. */
>  	IB_DEVICE_PCI_WRITE_END_PADDING		= (1ULL << 36),
>  	IB_DEVICE_ALLOW_USER_UNREG		= (1ULL << 37),
> +	IB_DEVICE_RDMA_FLUSH			= (1ULL << 38),
>  };
>  
>  enum ib_atomic_cap {
> @@ -4319,6 +4320,10 @@ static inline int ib_check_mr_access(struct ib_device *ib_dev,
>  	if (flags & IB_ACCESS_ON_DEMAND &&
>  	    !(ib_dev->attrs.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING))
>  		return -EINVAL;
> +
> +	if (flags & IB_ACCESS_FLUSH &&
> +	    !(ib_dev->attrs.device_cap_flags & IB_DEVICE_RDMA_FLUSH))
> +		return -EINVAL;
>  	return 0;
>  }

This needs to be combined with the previous patch

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 04/10] RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device
  2021-12-28  8:07 ` [RFC PATCH rdma-next 04/10] RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device Li Zhijian
@ 2022-01-06  0:22   ` Jason Gunthorpe
  2022-01-06  6:26     ` lizhijian
  0 siblings, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2022-01-06  0:22 UTC (permalink / raw)
  To: Li Zhijian
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liweihang, liangwenpeng, yangx.jy, rpearsonhpe, y-goto

On Tue, Dec 28, 2021 at 04:07:11PM +0800, Li Zhijian wrote:
> We will enable the RDMA FLUSH on rxe device later.

Then this is the wrong patch order

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2021-12-28  8:07 ` [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side Li Zhijian
  2021-12-30 22:18   ` Tom Talpey
  2022-01-04 16:40   ` Tom Talpey
@ 2022-01-06  0:28   ` Jason Gunthorpe
  2022-01-06  6:42     ` lizhijian
  2 siblings, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2022-01-06  0:28 UTC (permalink / raw)
  To: Li Zhijian
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liweihang, liangwenpeng, yangx.jy, rpearsonhpe, y-goto

On Tue, Dec 28, 2021 at 04:07:15PM +0800, Li Zhijian wrote:
> +	while (length > 0) {
> +		va	= (u8 *)(uintptr_t)buf->addr + offset;
> +		bytes	= buf->size - offset;
> +
> +		if (bytes > length)
> +			bytes = length;
> +
> +		arch_wb_cache_pmem(va, bytes);

So why did we need to check that the va was pmem to call this?

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2021-12-31  2:32       ` Tom Talpey
  2022-01-04  8:51         ` lizhijian
@ 2022-01-06  0:35         ` Jason Gunthorpe
  1 sibling, 0 replies; 49+ messages in thread
From: Jason Gunthorpe @ 2022-01-06  0:35 UTC (permalink / raw)
  To: Tom Talpey
  Cc: lizhijian, linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel,
	mbloch, liweihang, liangwenpeng, yangx.jy, rpearsonhpe, y-goto

On Thu, Dec 30, 2021 at 09:32:06PM -0500, Tom Talpey wrote:

> The global visibility is oriented toward supporting distributed
> shared memory workloads, or publish/subscribe on high scale systems.
> It's mainly about ensuring that all devices and CPUs see the data,
> something ordinary RDMA Write does not guarantee. This often means
> flushing write pipelines, possibly followed by invalidating caches.

Isn't that what that new ATOMIC_WRITE does? Why do I need to flush if
ATOMIC WRITE was specified as a release? All I need to do is acquire
on the ATOMIC_WRITE site and I'm good?

So what does FLUSH do here, and how does a CPU 'acquire' against this
kind of flush? The flush doesn't imply any ordering right, so how is
it useful on its own?

The write to persistance aspect I understand, but this notion of
global viability seems peculiar.

> Well, higher level wrappers may signal errors, for example if they're
> not available or unreliable, and you will need to handle them. Also,
> they may block. Is that ok in this context?

I'm not sure we have any other tools here beyond a release barrier
like wmb() or the special pmem cache flush. Neither are blocking or
can fail.

In pmem systems storage failure is handled via the memory failure
stuff asynchronously.

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem
  2022-01-06  0:21   ` Jason Gunthorpe
@ 2022-01-06  6:12     ` lizhijian
  2022-01-14  8:10       ` Li, Zhijian
  2022-01-16 18:11       ` Dan Williams
  0 siblings, 2 replies; 49+ messages in thread
From: lizhijian @ 2022-01-06  6:12 UTC (permalink / raw)
  To: Jason Gunthorpe, lizhijian
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liangwenpeng, yangx.jy, rpearsonhpe, y-goto, dan.j.williams


Add Dan to the party :)

May i know whether there is any existing APIs to check whether
a va/page backs to a nvdimm/pmem ?



On 06/01/2022 08:21, Jason Gunthorpe wrote:
> On Tue, Dec 28, 2021 at 04:07:08PM +0800, Li Zhijian wrote:
>> We can use it to indicate whether the registering mr is associated with
>> a pmem/nvdimm or not.
>>
>> Currently, we only assign it in rxe driver, for other device/drivers,
>> they should implement it if needed.
>>
>> RDMA FLUSH will support the persistence feature for a pmem/nvdimm.
>>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>>   drivers/infiniband/sw/rxe/rxe_mr.c | 47 ++++++++++++++++++++++++++++++
>>   include/rdma/ib_verbs.h            |  1 +
>>   2 files changed, 48 insertions(+)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
>> index 7c4cd19a9db2..bcd5e7afa475 100644
>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
>> @@ -162,6 +162,50 @@ void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr)
>>   	mr->type = IB_MR_TYPE_DMA;
>>   }
>>   
>> +// XXX: the logic is similar with mm/memory-failure.c
>> +static bool page_in_dev_pagemap(struct page *page)
>> +{
>> +	unsigned long pfn;
>> +	struct page *p;
>> +	struct dev_pagemap *pgmap = NULL;
>> +
>> +	pfn = page_to_pfn(page);
>> +	if (!pfn) {
>> +		pr_err("no such pfn for page %p\n", page);
>> +		return false;
>> +	}
>> +
>> +	p = pfn_to_online_page(pfn);
>> +	if (!p) {
>> +		if (pfn_valid(pfn)) {
>> +			pgmap = get_dev_pagemap(pfn, NULL);
>> +			if (pgmap)
>> +				put_dev_pagemap(pgmap);
>> +		}
>> +	}
>> +
>> +	return !!pgmap;
> You need to get Dan to check this out, but I'm pretty sure this should
> be more like this:
>
> if (is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_FS_DAX)

Great, i have added him.



>
>
>> +static bool iova_in_pmem(struct rxe_mr *mr, u64 iova, int length)
>> +{
>> +	struct page *page = NULL;
>> +	char *vaddr = iova_to_vaddr(mr, iova, length);
>> +
>> +	if (!vaddr) {
>> +		pr_err("not a valid iova %llu\n", iova);
>> +		return false;
>> +	}
>> +
>> +	page = virt_to_page(vaddr);
> And obviously this isn't uniform for the entire umem, so I don't even
> know what this is supposed to mean.

My intention is to check if a memory region belongs to a nvdimm/pmem.
The approach is like that:
iova(user space)-+                     +-> page -> page_in_dev_pagemap()
                  |                     |
                  +-> va(kernel space) -+
Since current MR's va is associated with map_set where it record the relations
between iova and va and page. Do do you mean we should travel map_set to
get its page ? or by any other ways.

  





>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>> index 6e9ad656ecb7..822ebb3425dc 100644
>> +++ b/include/rdma/ib_verbs.h
>> @@ -1807,6 +1807,7 @@ struct ib_mr {
>>   	unsigned int	   page_size;
>>   	enum ib_mr_type	   type;
>>   	bool		   need_inval;
>> +	bool		   is_pmem;
> Or why it is being stored in the global struct?

Indeed, it's not strong necessary. but i think is_pmem should belongs to a ib_mr
so that it can be checked by other general code when they needed even though no
one does such checking so far.


Thanks
Zhijian



>
> Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only
  2022-01-06  0:22   ` Jason Gunthorpe
@ 2022-01-06  6:20     ` lizhijian
  0 siblings, 0 replies; 49+ messages in thread
From: lizhijian @ 2022-01-06  6:20 UTC (permalink / raw)
  To: Jason Gunthorpe, lizhijian
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liangwenpeng, yangx.jy, rpearsonhpe, y-goto



On 06/01/2022 08:22, Jason Gunthorpe wrote:
> On Tue, Dec 28, 2021 at 04:07:10PM +0800, Li Zhijian wrote:
>> Device should enable IB_DEVICE_RDMA_FLUSH capability if it want to
>> support RDMA FLUSH.
>>
>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>> ---
>>   include/rdma/ib_verbs.h | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>> index f04d66539879..51d58b641201 100644
>> --- a/include/rdma/ib_verbs.h
>> +++ b/include/rdma/ib_verbs.h
>> @@ -291,6 +291,7 @@ enum ib_device_cap_flags {
>>   	/* The device supports padding incoming writes to cacheline. */
>>   	IB_DEVICE_PCI_WRITE_END_PADDING		= (1ULL << 36),
>>   	IB_DEVICE_ALLOW_USER_UNREG		= (1ULL << 37),
>> +	IB_DEVICE_RDMA_FLUSH			= (1ULL << 38),
>>   };
>>   
>>   enum ib_atomic_cap {
>> @@ -4319,6 +4320,10 @@ static inline int ib_check_mr_access(struct ib_device *ib_dev,
>>   	if (flags & IB_ACCESS_ON_DEMAND &&
>>   	    !(ib_dev->attrs.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING))
>>   		return -EINVAL;
>> +
>> +	if (flags & IB_ACCESS_FLUSH &&
>> +	    !(ib_dev->attrs.device_cap_flags & IB_DEVICE_RDMA_FLUSH))
>> +		return -EINVAL;
>>   	return 0;
>>   }
> This needs to be combined with the previous patch
Make sense

Thanks
Zhijian

>
> Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 04/10] RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device
  2022-01-06  0:22   ` Jason Gunthorpe
@ 2022-01-06  6:26     ` lizhijian
  0 siblings, 0 replies; 49+ messages in thread
From: lizhijian @ 2022-01-06  6:26 UTC (permalink / raw)
  To: Jason Gunthorpe, lizhijian
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liangwenpeng, yangx.jy, rpearsonhpe, y-goto



On 06/01/2022 08:22, Jason Gunthorpe wrote:
> On Tue, Dec 28, 2021 at 04:07:11PM +0800, Li Zhijian wrote:
>> We will enable the RDMA FLUSH on rxe device later.
> Then this is the wrong patch order

Good catch, It will move it to the rear of this set.

Thanks
Zhijian
>
> Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2022-01-06  0:28   ` Jason Gunthorpe
@ 2022-01-06  6:42     ` lizhijian
  2022-01-06 17:33       ` Jason Gunthorpe
  0 siblings, 1 reply; 49+ messages in thread
From: lizhijian @ 2022-01-06  6:42 UTC (permalink / raw)
  To: Jason Gunthorpe, lizhijian
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liweihang, liangwenpeng, yangx.jy, rpearsonhpe, y-goto



On 06/01/2022 08:28, Jason Gunthorpe wrote:
> On Tue, Dec 28, 2021 at 04:07:15PM +0800, Li Zhijian wrote:
>> +	while (length > 0) {
>> +		va	= (u8 *)(uintptr_t)buf->addr + offset;
>> +		bytes	= buf->size - offset;
>> +
>> +		if (bytes > length)
>> +			bytes = length;
>> +
>> +		arch_wb_cache_pmem(va, bytes);
> So why did we need to check that the va was pmem to call this?
Sorry, i didn't get you.

I didn't check whether va is pmem, since only MR registered with PERSISTENCE(only pmem can
register this access flag) can reach here.

Thanks
Zhijian


>
> Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2022-01-06  6:42     ` lizhijian
@ 2022-01-06 17:33       ` Jason Gunthorpe
  2022-01-10  5:45         ` lizhijian
  0 siblings, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2022-01-06 17:33 UTC (permalink / raw)
  To: lizhijian
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liweihang, liangwenpeng, yangx.jy, rpearsonhpe, y-goto

On Thu, Jan 06, 2022 at 06:42:57AM +0000, lizhijian@fujitsu.com wrote:
> 
> 
> On 06/01/2022 08:28, Jason Gunthorpe wrote:
> > On Tue, Dec 28, 2021 at 04:07:15PM +0800, Li Zhijian wrote:
> >> +	while (length > 0) {
> >> +		va	= (u8 *)(uintptr_t)buf->addr + offset;
> >> +		bytes	= buf->size - offset;
> >> +
> >> +		if (bytes > length)
> >> +			bytes = length;
> >> +
> >> +		arch_wb_cache_pmem(va, bytes);
> > So why did we need to check that the va was pmem to call this?
> Sorry, i didn't get you.
> 
> I didn't check whether va is pmem, since only MR registered with PERSISTENCE(only pmem can
> register this access flag) can reach here.

Yes, that is what I mean, why did we need to check anything to call
this API - it should work on any CPU mapped address.

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2022-01-06 17:33       ` Jason Gunthorpe
@ 2022-01-10  5:45         ` lizhijian
  2022-01-10 14:34           ` Jason Gunthorpe
  0 siblings, 1 reply; 49+ messages in thread
From: lizhijian @ 2022-01-10  5:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liangwenpeng, yangx.jy, rpearsonhpe, y-goto

Hi Jason


On 07/01/2022 01:33, Jason Gunthorpe wrote:
> On Thu, Jan 06, 2022 at 06:42:57AM +0000, lizhijian@fujitsu.com wrote:
>>
>> On 06/01/2022 08:28, Jason Gunthorpe wrote:
>>> On Tue, Dec 28, 2021 at 04:07:15PM +0800, Li Zhijian wrote:
>>>> +	while (length > 0) {
>>>> +		va	= (u8 *)(uintptr_t)buf->addr + offset;
>>>> +		bytes	= buf->size - offset;
>>>> +
>>>> +		if (bytes > length)
>>>> +			bytes = length;
>>>> +
>>>> +		arch_wb_cache_pmem(va, bytes);
>>> So why did we need to check that the va was pmem to call this?
>> Sorry, i didn't get you.
>>
>> I didn't check whether va is pmem, since only MR registered with PERSISTENCE(only pmem can
>> register this access flag) can reach here.
> Yes, that is what I mean,

I'm not sure I understand the *check* you mentioned above.

Current code just dose something like:

if (!sanity_check())
     return;
if (requested_plt == PERSISTENCE)
     va = iova_to_va(iova);
     arch_wb_cache_pmem(va, bytes);
     wmb;
else if (requested_plt == GLOBAL_VISIBILITY)
     wmb();


> why did we need to check anything to call
> this API
As above pseudo code,  it didn't *check* anything as what you said i think.


> - it should work on any CPU mapped address.
Of course, arch_wb_cache_pmem(va, bytes) works on CPU mapped address backing to both dimm and nvdimm,
but not a iova that could refers to user space address.


Thanks
Zhijian



>
> Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2022-01-10  5:45         ` lizhijian
@ 2022-01-10 14:34           ` Jason Gunthorpe
  2022-01-11  5:34             ` lizhijian
  0 siblings, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2022-01-10 14:34 UTC (permalink / raw)
  To: lizhijian
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liangwenpeng, yangx.jy, rpearsonhpe, y-goto

On Mon, Jan 10, 2022 at 05:45:47AM +0000, lizhijian@fujitsu.com wrote:
> Hi Jason
> 
> 
> On 07/01/2022 01:33, Jason Gunthorpe wrote:
> > On Thu, Jan 06, 2022 at 06:42:57AM +0000, lizhijian@fujitsu.com wrote:
> >>
> >> On 06/01/2022 08:28, Jason Gunthorpe wrote:
> >>> On Tue, Dec 28, 2021 at 04:07:15PM +0800, Li Zhijian wrote:
> >>>> +	while (length > 0) {
> >>>> +		va	= (u8 *)(uintptr_t)buf->addr + offset;
> >>>> +		bytes	= buf->size - offset;
> >>>> +
> >>>> +		if (bytes > length)
> >>>> +			bytes = length;
> >>>> +
> >>>> +		arch_wb_cache_pmem(va, bytes);
> >>> So why did we need to check that the va was pmem to call this?
> >> Sorry, i didn't get you.
> >>
> >> I didn't check whether va is pmem, since only MR registered with PERSISTENCE(only pmem can
> >> register this access flag) can reach here.
> > Yes, that is what I mean,
> 
> I'm not sure I understand the *check* you mentioned above.
> 
> Current code just dose something like:
> 
> if (!sanity_check())
>      return;
> if (requested_plt == PERSISTENCE)
>      va = iova_to_va(iova);
>      arch_wb_cache_pmem(va, bytes);
>      wmb;
> else if (requested_plt == GLOBAL_VISIBILITY)
>      wmb();
> 
> 
> > why did we need to check anything to call
> > this API
> As above pseudo code,  it didn't *check* anything as what you said i think.

I mean when you created the MR in the first place you checked for pmem
before even allowing the persitent access flag.

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2022-01-10 14:34           ` Jason Gunthorpe
@ 2022-01-11  5:34             ` lizhijian
  2022-01-11 20:48               ` Jason Gunthorpe
  0 siblings, 1 reply; 49+ messages in thread
From: lizhijian @ 2022-01-11  5:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liangwenpeng, yangx.jy, rpearsonhpe, y-goto



On 10/01/2022 22:34, Jason Gunthorpe wrote:
> On Mon, Jan 10, 2022 at 05:45:47AM +0000, lizhijian@fujitsu.com wrote:
>> Hi Jason
>>
>>
>> On 07/01/2022 01:33, Jason Gunthorpe wrote:
>>> On Thu, Jan 06, 2022 at 06:42:57AM +0000, lizhijian@fujitsu.com wrote:
>>>> On 06/01/2022 08:28, Jason Gunthorpe wrote:
>>>>> On Tue, Dec 28, 2021 at 04:07:15PM +0800, Li Zhijian wrote:
>>>>>> +	while (length > 0) {
>>>>>> +		va	= (u8 *)(uintptr_t)buf->addr + offset;
>>>>>> +		bytes	= buf->size - offset;
>>>>>> +
>>>>>> +		if (bytes > length)
>>>>>> +			bytes = length;
>>>>>> +
>>>>>> +		arch_wb_cache_pmem(va, bytes);
>>>>> So why did we need to check that the va was pmem to call this?
>>>> Sorry, i didn't get you.
>>>>
>>>> I didn't check whether va is pmem, since only MR registered with PERSISTENCE(only pmem can
>>>> register this access flag) can reach here.
>>> Yes, that is what I mean,
>> I'm not sure I understand the *check* you mentioned above.
>>
>> Current code just dose something like:
>>
>> if (!sanity_check())
>>       return;
>> if (requested_plt == PERSISTENCE)
>>       va = iova_to_va(iova);
>>       arch_wb_cache_pmem(va, bytes);
>>       wmb;
>> else if (requested_plt == GLOBAL_VISIBILITY)
>>       wmb();
>>
>>
>>> why did we need to check anything to call
>>> this API
>> As above pseudo code,  it didn't *check* anything as what you said i think.
> I mean when you created the MR in the first place you checked for pmem
> before even allowing the persitent access flag.

Yes, that's true. that's because only pmem has ability to persist data.
So do you mean we don't need to prevent user to create/register a persistent
access flag to a non-pmem MR? it would be a bit confusing if so.


>
> Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2022-01-11  5:34             ` lizhijian
@ 2022-01-11 20:48               ` Jason Gunthorpe
  2022-01-12  9:50                 ` lizhijian
  0 siblings, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2022-01-11 20:48 UTC (permalink / raw)
  To: lizhijian
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liangwenpeng, yangx.jy, rpearsonhpe, y-goto

On Tue, Jan 11, 2022 at 05:34:36AM +0000, lizhijian@fujitsu.com wrote:

> Yes, that's true. that's because only pmem has ability to persist data.
> So do you mean we don't need to prevent user to create/register a persistent
> access flag to a non-pmem MR? it would be a bit confusing if so.

Since these extensions seem to have a mode that is unrelated to
persistent memory, I'm not sure it makes sense to link the two things.

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2022-01-11 20:48               ` Jason Gunthorpe
@ 2022-01-12  9:50                 ` lizhijian
  2022-01-12 13:12                   ` Jason Gunthorpe
  0 siblings, 1 reply; 49+ messages in thread
From: lizhijian @ 2022-01-12  9:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liangwenpeng, yangx.jy, rpearsonhpe, y-goto



On 12/01/2022 04:48, Jason Gunthorpe wrote:
> On Tue, Jan 11, 2022 at 05:34:36AM +0000, lizhijian@fujitsu.com wrote:
>
>> Yes, that's true. that's because only pmem has ability to persist data.
>> So do you mean we don't need to prevent user to create/register a persistent
>> access flag to a non-pmem MR? it would be a bit confusing if so.
> Since these extensions seem to have a mode that is unrelated to
> persistent memory,
I can only agree with part of them, since the extensions also say that:

oA19-1: Responder shall successfully respond on FLUSH operation only
after providing the placement guarantees, as specified in the packet, of
preceding memory updates (for example: RDMA WRITE, Atomics and
ATOMIC WRITE) towards the memory region.

it mentions *shall successfully respond on FLUSH operation only
after providing the placement guarantees*. If users request a
persistent placement to a non-pmem MR without errors,  from view
of the users, they will think of their request had been *successfully responded*
that doesn't reflect the true(data was persisted).

So i think we should respond error to request side in such case.


Further more, If we have a checking during the new MR creating/registering,
user who registers this MR can know if the target MR supports persistent access flag.
Then they can tell this information to the request side so that request side can
request a valid placement type later. that is similar behavior with current librpma.


Thanks
Zhijian

>   I'm not sure it makes sense to link the two things.
>
> Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2022-01-12  9:50                 ` lizhijian
@ 2022-01-12 13:12                   ` Jason Gunthorpe
  2022-01-13  6:29                     ` lizhijian
  0 siblings, 1 reply; 49+ messages in thread
From: Jason Gunthorpe @ 2022-01-12 13:12 UTC (permalink / raw)
  To: lizhijian
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liangwenpeng, yangx.jy, rpearsonhpe, y-goto

On Wed, Jan 12, 2022 at 09:50:38AM +0000, lizhijian@fujitsu.com wrote:
> 
> 
> On 12/01/2022 04:48, Jason Gunthorpe wrote:
> > On Tue, Jan 11, 2022 at 05:34:36AM +0000, lizhijian@fujitsu.com wrote:
> >
> >> Yes, that's true. that's because only pmem has ability to persist data.
> >> So do you mean we don't need to prevent user to create/register a persistent
> >> access flag to a non-pmem MR? it would be a bit confusing if so.
> > Since these extensions seem to have a mode that is unrelated to
> > persistent memory,
> I can only agree with part of them, since the extensions also say that:
> 
> oA19-1: Responder shall successfully respond on FLUSH operation only
> after providing the placement guarantees, as specified in the packet, of
> preceding memory updates (for example: RDMA WRITE, Atomics and
> ATOMIC WRITE) towards the memory region.
> 
> it mentions *shall successfully respond on FLUSH operation only
> after providing the placement guarantees*. If users request a
> persistent placement to a non-pmem MR without errors,  from view
> of the users, they will think of their request had been *successfully responded*
> that doesn't reflect the true(data was persisted).

The "placement guarentees" should obviously be variable depending on
the type of memory being targeted.

> Further more, If we have a checking during the new MR creating/registering,
> user who registers this MR can know if the target MR supports persistent access flag.
> Then they can tell this information to the request side so that request side can
> request a valid placement type later. that is similar behavior with current librpma.

Then you can't use ATOMIC_WRITE with non-nvdimm memory, which is
nonsense

Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side
  2022-01-12 13:12                   ` Jason Gunthorpe
@ 2022-01-13  6:29                     ` lizhijian
  0 siblings, 0 replies; 49+ messages in thread
From: lizhijian @ 2022-01-13  6:29 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liangwenpeng, yangx.jy, rpearsonhpe, y-goto



On 12/01/2022 21:12, Jason Gunthorpe wrote:
> On Wed, Jan 12, 2022 at 09:50:38AM +0000, lizhijian@fujitsu.com wrote:
>>
>> On 12/01/2022 04:48, Jason Gunthorpe wrote:
>>> On Tue, Jan 11, 2022 at 05:34:36AM +0000, lizhijian@fujitsu.com wrote:
>>>
>>>> Yes, that's true. that's because only pmem has ability to persist data.
>>>> So do you mean we don't need to prevent user to create/register a persistent
>>>> access flag to a non-pmem MR? it would be a bit confusing if so.
>>> Since these extensions seem to have a mode that is unrelated to
>>> persistent memory,
>> I can only agree with part of them, since the extensions also say that:
>>
>> oA19-1: Responder shall successfully respond on FLUSH operation only
>> after providing the placement guarantees, as specified in the packet, of
>> preceding memory updates (for example: RDMA WRITE, Atomics and
>> ATOMIC WRITE) towards the memory region.
>>
>> it mentions *shall successfully respond on FLUSH operation only
>> after providing the placement guarantees*. If users request a
>> persistent placement to a non-pmem MR without errors,  from view
>> of the users, they will think of their request had been *successfully responded*
>> that doesn't reflect the true(data was persisted).
> The "placement guarentees" should obviously be variable depending on
> the type of memory being targeted.

> Since these extensions seem to have a mode that is unrelated to
> persistent memory,


Although the SPEC doesn't link extensions to persistent memory directly, but
the *Persistence* is linked to pmem like indicated memory region. See below:

A19.2 G LOSSARY

Global Visibility
   Ensuring the placement of the preceding data accesses in the indicated
   memory region is visible for reading by the responder platform.
Persistence
   Ensuring the placement of the preceding data accesses in the indicated
   memory region is persistent and the data will be preserved across power
   cycle or other failure of the responder platform.

Thanks
Zhijian


>
>> Further more, If we have a checking during the new MR creating/registering,
>> user who registers this MR can know if the target MR supports persistent access flag.
>> Then they can tell this information to the request side so that request side can
>> request a valid placement type later. that is similar behavior with current librpma.
> Then you can't use ATOMIC_WRITE with non-nvdimm memory, which is
> nonsense
>
> Jason

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only
  2021-12-28  8:07 ` [RFC PATCH rdma-next 03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only Li Zhijian
  2022-01-06  0:22   ` Jason Gunthorpe
@ 2022-01-13  6:43   ` lizhijian
  1 sibling, 0 replies; 49+ messages in thread
From: lizhijian @ 2022-01-13  6:43 UTC (permalink / raw)
  To: lizhijian, linux-rdma, zyjzyj2000, jgg, aharonl, leon,
	Tom Talpey, Gromadzki, Tomasz
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, rpearsonhpe, y-goto

Hey Tom, Tomas, all

I recall that the SPEC says:
> A19.4.3.1 HCA RESOURCES
> This Annex introduces the following new HCA attributes:
> • Ability to support Memory Placement Extensions
> a) Ability to support FLUSH
> i) Ability to support FLUSH with PLT Global Visibility
> ii) Ability to support FLUSH with PLT Persistence

Do you have any idea that can the HCA support just part of the FLUSH capabilities.
For example,  HCA_foo only supports PLT Global Visibility, no PLT Persistence support.

Thanks
Zhijian



On 28/12/2021 16:07, Li Zhijian wrote:
> Device should enable IB_DEVICE_RDMA_FLUSH capability if it want to
> support RDMA FLUSH.
>
> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> ---
>   include/rdma/ib_verbs.h | 5 +++++
>   1 file changed, 5 insertions(+)
>
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index f04d66539879..51d58b641201 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -291,6 +291,7 @@ enum ib_device_cap_flags {
>   	/* The device supports padding incoming writes to cacheline. */
>   	IB_DEVICE_PCI_WRITE_END_PADDING		= (1ULL << 36),
>   	IB_DEVICE_ALLOW_USER_UNREG		= (1ULL << 37),
> +	IB_DEVICE_RDMA_FLUSH			= (1ULL << 38),
>   };
>   
>   enum ib_atomic_cap {
> @@ -4319,6 +4320,10 @@ static inline int ib_check_mr_access(struct ib_device *ib_dev,
>   	if (flags & IB_ACCESS_ON_DEMAND &&
>   	    !(ib_dev->attrs.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING))
>   		return -EINVAL;
> +
> +	if (flags & IB_ACCESS_FLUSH &&
> +	    !(ib_dev->attrs.device_cap_flags & IB_DEVICE_RDMA_FLUSH))
> +		return -EINVAL;
>   	return 0;
>   }
>   

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem
  2022-01-06  6:12     ` lizhijian
@ 2022-01-14  8:10       ` Li, Zhijian
  2022-01-27 22:30         ` Jeff Moyer
  2022-01-16 18:11       ` Dan Williams
  1 sibling, 1 reply; 49+ messages in thread
From: Li, Zhijian @ 2022-01-14  8:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, zyjzyj2000, aharonl, leon, linux-kernel, mbloch,
	liangwenpeng, yangx.jy, rpearsonhpe, y-goto, dan.j.williams,
	nvdimm

Copied to nvdimm list

Thanks

Zhijian


on 2022/1/6 14:12, Li Zhijian wrote:
>
> Add Dan to the party :)
>
> May i know whether there is any existing APIs to check whether
> a va/page backs to a nvdimm/pmem ?
>
>
>
> On 06/01/2022 08:21, Jason Gunthorpe wrote:
>> On Tue, Dec 28, 2021 at 04:07:08PM +0800, Li Zhijian wrote:
>>> We can use it to indicate whether the registering mr is associated with
>>> a pmem/nvdimm or not.
>>>
>>> Currently, we only assign it in rxe driver, for other device/drivers,
>>> they should implement it if needed.
>>>
>>> RDMA FLUSH will support the persistence feature for a pmem/nvdimm.
>>>
>>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>>>   drivers/infiniband/sw/rxe/rxe_mr.c | 47 
>>> ++++++++++++++++++++++++++++++
>>>   include/rdma/ib_verbs.h            |  1 +
>>>   2 files changed, 48 insertions(+)
>>>
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c 
>>> b/drivers/infiniband/sw/rxe/rxe_mr.c
>>> index 7c4cd19a9db2..bcd5e7afa475 100644
>>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
>>> @@ -162,6 +162,50 @@ void rxe_mr_init_dma(struct rxe_pd *pd, int 
>>> access, struct rxe_mr *mr)
>>>       mr->type = IB_MR_TYPE_DMA;
>>>   }
>>>   +// XXX: the logic is similar with mm/memory-failure.c
>>> +static bool page_in_dev_pagemap(struct page *page)
>>> +{
>>> +    unsigned long pfn;
>>> +    struct page *p;
>>> +    struct dev_pagemap *pgmap = NULL;
>>> +
>>> +    pfn = page_to_pfn(page);
>>> +    if (!pfn) {
>>> +        pr_err("no such pfn for page %p\n", page);
>>> +        return false;
>>> +    }
>>> +
>>> +    p = pfn_to_online_page(pfn);
>>> +    if (!p) {
>>> +        if (pfn_valid(pfn)) {
>>> +            pgmap = get_dev_pagemap(pfn, NULL);
>>> +            if (pgmap)
>>> +                put_dev_pagemap(pgmap);
>>> +        }
>>> +    }
>>> +
>>> +    return !!pgmap;
>> You need to get Dan to check this out, but I'm pretty sure this should
>> be more like this:
>>
>> if (is_zone_device_page(page) && page->pgmap->type == 
>> MEMORY_DEVICE_FS_DAX)
>
> Great, i have added him.
>
>
>
>>
>>
>>> +static bool iova_in_pmem(struct rxe_mr *mr, u64 iova, int length)
>>> +{
>>> +    struct page *page = NULL;
>>> +    char *vaddr = iova_to_vaddr(mr, iova, length);
>>> +
>>> +    if (!vaddr) {
>>> +        pr_err("not a valid iova %llu\n", iova);
>>> +        return false;
>>> +    }
>>> +
>>> +    page = virt_to_page(vaddr);
>> And obviously this isn't uniform for the entire umem, so I don't even
>> know what this is supposed to mean.
>
> My intention is to check if a memory region belongs to a nvdimm/pmem.
> The approach is like that:
> iova(user space)-+                     +-> page -> page_in_dev_pagemap()
>                  |                     |
>                  +-> va(kernel space) -+
> Since current MR's va is associated with map_set where it record the 
> relations
> between iova and va and page. Do do you mean we should travel map_set to
> get its page ? or by any other ways.
>
>
>
>
>
>
>
>>> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
>>> index 6e9ad656ecb7..822ebb3425dc 100644
>>> +++ b/include/rdma/ib_verbs.h
>>> @@ -1807,6 +1807,7 @@ struct ib_mr {
>>>       unsigned int       page_size;
>>>       enum ib_mr_type       type;
>>>       bool           need_inval;
>>> +    bool           is_pmem;
>> Or why it is being stored in the global struct?
>
> Indeed, it's not strong necessary. but i think is_pmem should belongs 
> to a ib_mr
> so that it can be checked by other general code when they needed even 
> though no
> one does such checking so far.
>
>
> Thanks
> Zhijian
>
>
>
>>
>> Jason
>



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem
  2022-01-06  6:12     ` lizhijian
  2022-01-14  8:10       ` Li, Zhijian
@ 2022-01-16 18:11       ` Dan Williams
  2022-01-18  8:55         ` lizhijian
  1 sibling, 1 reply; 49+ messages in thread
From: Dan Williams @ 2022-01-16 18:11 UTC (permalink / raw)
  To: lizhijian
  Cc: Jason Gunthorpe, linux-rdma, zyjzyj2000, aharonl, leon,
	linux-kernel, mbloch, liangwenpeng, yangx.jy, rpearsonhpe,
	y-goto

On Wed, Jan 5, 2022 at 10:13 PM lizhijian@fujitsu.com
<lizhijian@fujitsu.com> wrote:
>
>
> Add Dan to the party :)
>
> May i know whether there is any existing APIs to check whether
> a va/page backs to a nvdimm/pmem ?
>
>
>
> On 06/01/2022 08:21, Jason Gunthorpe wrote:
> > On Tue, Dec 28, 2021 at 04:07:08PM +0800, Li Zhijian wrote:
> >> We can use it to indicate whether the registering mr is associated with
> >> a pmem/nvdimm or not.
> >>
> >> Currently, we only assign it in rxe driver, for other device/drivers,
> >> they should implement it if needed.
> >>
> >> RDMA FLUSH will support the persistence feature for a pmem/nvdimm.
> >>
> >> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> >>   drivers/infiniband/sw/rxe/rxe_mr.c | 47 ++++++++++++++++++++++++++++++
> >>   include/rdma/ib_verbs.h            |  1 +
> >>   2 files changed, 48 insertions(+)
> >>
> >> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> >> index 7c4cd19a9db2..bcd5e7afa475 100644
> >> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> >> @@ -162,6 +162,50 @@ void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr)
> >>      mr->type = IB_MR_TYPE_DMA;
> >>   }
> >>
> >> +// XXX: the logic is similar with mm/memory-failure.c
> >> +static bool page_in_dev_pagemap(struct page *page)
> >> +{
> >> +    unsigned long pfn;
> >> +    struct page *p;
> >> +    struct dev_pagemap *pgmap = NULL;
> >> +
> >> +    pfn = page_to_pfn(page);
> >> +    if (!pfn) {
> >> +            pr_err("no such pfn for page %p\n", page);
> >> +            return false;
> >> +    }
> >> +
> >> +    p = pfn_to_online_page(pfn);
> >> +    if (!p) {
> >> +            if (pfn_valid(pfn)) {
> >> +                    pgmap = get_dev_pagemap(pfn, NULL);
> >> +                    if (pgmap)
> >> +                            put_dev_pagemap(pgmap);
> >> +            }
> >> +    }
> >> +
> >> +    return !!pgmap;
> > You need to get Dan to check this out, but I'm pretty sure this should
> > be more like this:
> >
> > if (is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_FS_DAX)
>
> Great, i have added him.
>
>
>
> >
> >
> >> +static bool iova_in_pmem(struct rxe_mr *mr, u64 iova, int length)
> >> +{
> >> +    struct page *page = NULL;
> >> +    char *vaddr = iova_to_vaddr(mr, iova, length);
> >> +
> >> +    if (!vaddr) {
> >> +            pr_err("not a valid iova %llu\n", iova);
> >> +            return false;
> >> +    }
> >> +
> >> +    page = virt_to_page(vaddr);
> > And obviously this isn't uniform for the entire umem, so I don't even
> > know what this is supposed to mean.
>
> My intention is to check if a memory region belongs to a nvdimm/pmem.
> The approach is like that:
> iova(user space)-+                     +-> page -> page_in_dev_pagemap()
>                   |                     |
>                   +-> va(kernel space) -+
> Since current MR's va is associated with map_set where it record the relations
> between iova and va and page. Do do you mean we should travel map_set to
> get its page ? or by any other ways.

Apologies for the delay in responding.

The Subject line of this patch is confusing, if you want to know if a
pfn is in persistent memory the only mechanism for that is:

region_intersects(addr, length, IORESOURCE_MEM, IORES_DESC_PERSISTENT_MEMORY)

...there is otherwise nothing pmem specific about the dev_pagemap
infrastructure. Yes, pmem is the primary user, but it is also used for
mapping "soft-reserved" memory (See: the EFI_MEMORY_SP) attribute, and
other users.

Can you clarify the intent? I am missing some context.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem
  2022-01-16 18:11       ` Dan Williams
@ 2022-01-18  8:55         ` lizhijian
  2022-01-18 15:28           ` Dan Williams
  0 siblings, 1 reply; 49+ messages in thread
From: lizhijian @ 2022-01-18  8:55 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, linux-rdma, zyjzyj2000, aharonl, leon,
	linux-kernel, mbloch, liangwenpeng, yangx.jy, rpearsonhpe,
	y-goto



On 17/01/2022 02:11, Dan Williams wrote:
> On Wed, Jan 5, 2022 at 10:13 PM lizhijian@fujitsu.com
> <lizhijian@fujitsu.com> wrote:
>>
>> Add Dan to the party :)
>>
>> May i know whether there is any existing APIs to check whether
>> a va/page backs to a nvdimm/pmem ?
>>
>>
>>
>> On 06/01/2022 08:21, Jason Gunthorpe wrote:
>>> On Tue, Dec 28, 2021 at 04:07:08PM +0800, Li Zhijian wrote:
>>>> We can use it to indicate whether the registering mr is associated with
>>>> a pmem/nvdimm or not.
>>>>
>>>> Currently, we only assign it in rxe driver, for other device/drivers,
>>>> they should implement it if needed.
>>>>
>>>> RDMA FLUSH will support the persistence feature for a pmem/nvdimm.
>>>>
>>>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>>>>    drivers/infiniband/sw/rxe/rxe_mr.c | 47 ++++++++++++++++++++++++++++++
>>>>    include/rdma/ib_verbs.h            |  1 +
>>>>    2 files changed, 48 insertions(+)
>>>>
>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
>>>> index 7c4cd19a9db2..bcd5e7afa475 100644
>>>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
>>>> @@ -162,6 +162,50 @@ void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr)
>>>>       mr->type = IB_MR_TYPE_DMA;
>>>>    }
>>>>
>>>> +// XXX: the logic is similar with mm/memory-failure.c
>>>> +static bool page_in_dev_pagemap(struct page *page)
>>>> +{
>>>> +    unsigned long pfn;
>>>> +    struct page *p;
>>>> +    struct dev_pagemap *pgmap = NULL;
>>>> +
>>>> +    pfn = page_to_pfn(page);
>>>> +    if (!pfn) {
>>>> +            pr_err("no such pfn for page %p\n", page);
>>>> +            return false;
>>>> +    }
>>>> +
>>>> +    p = pfn_to_online_page(pfn);
>>>> +    if (!p) {
>>>> +            if (pfn_valid(pfn)) {
>>>> +                    pgmap = get_dev_pagemap(pfn, NULL);
>>>> +                    if (pgmap)
>>>> +                            put_dev_pagemap(pgmap);
>>>> +            }
>>>> +    }
>>>> +
>>>> +    return !!pgmap;
>>> You need to get Dan to check this out, but I'm pretty sure this should
>>> be more like this:
>>>
>>> if (is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_FS_DAX)
>> Great, i have added him.
>>
>>
>>
>>>
>>>> +static bool iova_in_pmem(struct rxe_mr *mr, u64 iova, int length)
>>>> +{
>>>> +    struct page *page = NULL;
>>>> +    char *vaddr = iova_to_vaddr(mr, iova, length);
>>>> +
>>>> +    if (!vaddr) {
>>>> +            pr_err("not a valid iova %llu\n", iova);
>>>> +            return false;
>>>> +    }
>>>> +
>>>> +    page = virt_to_page(vaddr);
>>> And obviously this isn't uniform for the entire umem, so I don't even
>>> know what this is supposed to mean.
>> My intention is to check if a memory region belongs to a nvdimm/pmem.
>> The approach is like that:
>> iova(user space)-+                     +-> page -> page_in_dev_pagemap()
>>                    |                     |
>>                    +-> va(kernel space) -+
>> Since current MR's va is associated with map_set where it record the relations
>> between iova and va and page. Do do you mean we should travel map_set to
>> get its page ? or by any other ways.
> Apologies for the delay in responding.
>
> The Subject line of this patch is confusing, if you want to know if a
> pfn is in persistent memory the only mechanism for that is:
>
> region_intersects(addr, length, IORESOURCE_MEM, IORES_DESC_PERSISTENT_MEMORY)
>
> ...there is otherwise nothing pmem specific about the dev_pagemap
> infrastructure. Yes, pmem is the primary user, but it is also used for
> mapping "soft-reserved" memory (See: the EFI_MEMORY_SP) attribute, and
> other users.
>
> Can you clarify the intent? I am missing some context.

thanks for your help @Dan

I'm going to implement a new ibvers called RDMA FLUSH, it will support global visibility
and persistence placement type.

In my first design, only pmem can support persistence placement type, so i need to introduce
new attribute is_pmem to RDMA memory region where it associates to a user space address iova
so that i can reject a persistence placement type to DRAM(non-pmem).

Thanks
Zhijian

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem
  2022-01-18  8:55         ` lizhijian
@ 2022-01-18 15:28           ` Dan Williams
  2022-01-19  2:01             ` lizhijian
  0 siblings, 1 reply; 49+ messages in thread
From: Dan Williams @ 2022-01-18 15:28 UTC (permalink / raw)
  To: lizhijian
  Cc: Jason Gunthorpe, linux-rdma, zyjzyj2000, aharonl, leon,
	linux-kernel, mbloch, liangwenpeng, yangx.jy, rpearsonhpe,
	y-goto

On Tue, Jan 18, 2022 at 12:55 AM lizhijian@fujitsu.com
<lizhijian@fujitsu.com> wrote:
>
>
>
> On 17/01/2022 02:11, Dan Williams wrote:
> > On Wed, Jan 5, 2022 at 10:13 PM lizhijian@fujitsu.com
> > <lizhijian@fujitsu.com> wrote:
> >>
> >> Add Dan to the party :)
> >>
> >> May i know whether there is any existing APIs to check whether
> >> a va/page backs to a nvdimm/pmem ?
> >>
> >>
> >>
> >> On 06/01/2022 08:21, Jason Gunthorpe wrote:
> >>> On Tue, Dec 28, 2021 at 04:07:08PM +0800, Li Zhijian wrote:
> >>>> We can use it to indicate whether the registering mr is associated with
> >>>> a pmem/nvdimm or not.
> >>>>
> >>>> Currently, we only assign it in rxe driver, for other device/drivers,
> >>>> they should implement it if needed.
> >>>>
> >>>> RDMA FLUSH will support the persistence feature for a pmem/nvdimm.
> >>>>
> >>>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
> >>>>    drivers/infiniband/sw/rxe/rxe_mr.c | 47 ++++++++++++++++++++++++++++++
> >>>>    include/rdma/ib_verbs.h            |  1 +
> >>>>    2 files changed, 48 insertions(+)
> >>>>
> >>>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> >>>> index 7c4cd19a9db2..bcd5e7afa475 100644
> >>>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> >>>> @@ -162,6 +162,50 @@ void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr)
> >>>>       mr->type = IB_MR_TYPE_DMA;
> >>>>    }
> >>>>
> >>>> +// XXX: the logic is similar with mm/memory-failure.c
> >>>> +static bool page_in_dev_pagemap(struct page *page)
> >>>> +{
> >>>> +    unsigned long pfn;
> >>>> +    struct page *p;
> >>>> +    struct dev_pagemap *pgmap = NULL;
> >>>> +
> >>>> +    pfn = page_to_pfn(page);
> >>>> +    if (!pfn) {
> >>>> +            pr_err("no such pfn for page %p\n", page);
> >>>> +            return false;
> >>>> +    }
> >>>> +
> >>>> +    p = pfn_to_online_page(pfn);
> >>>> +    if (!p) {
> >>>> +            if (pfn_valid(pfn)) {
> >>>> +                    pgmap = get_dev_pagemap(pfn, NULL);
> >>>> +                    if (pgmap)
> >>>> +                            put_dev_pagemap(pgmap);
> >>>> +            }
> >>>> +    }
> >>>> +
> >>>> +    return !!pgmap;
> >>> You need to get Dan to check this out, but I'm pretty sure this should
> >>> be more like this:
> >>>
> >>> if (is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_FS_DAX)
> >> Great, i have added him.
> >>
> >>
> >>
> >>>
> >>>> +static bool iova_in_pmem(struct rxe_mr *mr, u64 iova, int length)
> >>>> +{
> >>>> +    struct page *page = NULL;
> >>>> +    char *vaddr = iova_to_vaddr(mr, iova, length);
> >>>> +
> >>>> +    if (!vaddr) {
> >>>> +            pr_err("not a valid iova %llu\n", iova);
> >>>> +            return false;
> >>>> +    }
> >>>> +
> >>>> +    page = virt_to_page(vaddr);
> >>> And obviously this isn't uniform for the entire umem, so I don't even
> >>> know what this is supposed to mean.
> >> My intention is to check if a memory region belongs to a nvdimm/pmem.
> >> The approach is like that:
> >> iova(user space)-+                     +-> page -> page_in_dev_pagemap()
> >>                    |                     |
> >>                    +-> va(kernel space) -+
> >> Since current MR's va is associated with map_set where it record the relations
> >> between iova and va and page. Do do you mean we should travel map_set to
> >> get its page ? or by any other ways.
> > Apologies for the delay in responding.
> >
> > The Subject line of this patch is confusing, if you want to know if a
> > pfn is in persistent memory the only mechanism for that is:
> >
> > region_intersects(addr, length, IORESOURCE_MEM, IORES_DESC_PERSISTENT_MEMORY)
> >
> > ...there is otherwise nothing pmem specific about the dev_pagemap
> > infrastructure. Yes, pmem is the primary user, but it is also used for
> > mapping "soft-reserved" memory (See: the EFI_MEMORY_SP) attribute, and
> > other users.
> >
> > Can you clarify the intent? I am missing some context.
>
> thanks for your help @Dan
>
> I'm going to implement a new ibvers called RDMA FLUSH, it will support global visibility
> and persistence placement type.
>
> In my first design, only pmem can support persistence placement type, so i need to introduce
> new attribute is_pmem to RDMA memory region where it associates to a user space address iova
> so that i can reject a persistence placement type to DRAM(non-pmem).

Ok, I think for that case you are better served using the
IORES_DESC_PERSISTENT_MEMORY designation as the gate for attempting
the flush. The dev_pagemap is otherwise only indicating the presence
of device-backed memory pages, not persistence.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem
  2022-01-18 15:28           ` Dan Williams
@ 2022-01-19  2:01             ` lizhijian
  0 siblings, 0 replies; 49+ messages in thread
From: lizhijian @ 2022-01-19  2:01 UTC (permalink / raw)
  To: Dan Williams
  Cc: Jason Gunthorpe, linux-rdma, zyjzyj2000, aharonl, leon,
	linux-kernel, mbloch, liangwenpeng, yangx.jy, rpearsonhpe,
	y-goto



On 18/01/2022 23:28, Dan Williams wrote:
> On Tue, Jan 18, 2022 at 12:55 AM lizhijian@fujitsu.com
> <lizhijian@fujitsu.com> wrote:
>>
>>
>> On 17/01/2022 02:11, Dan Williams wrote:
>>> On Wed, Jan 5, 2022 at 10:13 PM lizhijian@fujitsu.com
>>> <lizhijian@fujitsu.com> wrote:
>>>> Add Dan to the party :)
>>>>
>>>> May i know whether there is any existing APIs to check whether
>>>> a va/page backs to a nvdimm/pmem ?
>>>>
>>>>
>>>>
>>>> On 06/01/2022 08:21, Jason Gunthorpe wrote:
>>>>> On Tue, Dec 28, 2021 at 04:07:08PM +0800, Li Zhijian wrote:
>>>>>> We can use it to indicate whether the registering mr is associated with
>>>>>> a pmem/nvdimm or not.
>>>>>>
>>>>>> Currently, we only assign it in rxe driver, for other device/drivers,
>>>>>> they should implement it if needed.
>>>>>>
>>>>>> RDMA FLUSH will support the persistence feature for a pmem/nvdimm.
>>>>>>
>>>>>> Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
>>>>>>     drivers/infiniband/sw/rxe/rxe_mr.c | 47 ++++++++++++++++++++++++++++++
>>>>>>     include/rdma/ib_verbs.h            |  1 +
>>>>>>     2 files changed, 48 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
>>>>>> index 7c4cd19a9db2..bcd5e7afa475 100644
>>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
>>>>>> @@ -162,6 +162,50 @@ void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr)
>>>>>>        mr->type = IB_MR_TYPE_DMA;
>>>>>>     }
>>>>>>
>>>>>> +// XXX: the logic is similar with mm/memory-failure.c
>>>>>> +static bool page_in_dev_pagemap(struct page *page)
>>>>>> +{
>>>>>> +    unsigned long pfn;
>>>>>> +    struct page *p;
>>>>>> +    struct dev_pagemap *pgmap = NULL;
>>>>>> +
>>>>>> +    pfn = page_to_pfn(page);
>>>>>> +    if (!pfn) {
>>>>>> +            pr_err("no such pfn for page %p\n", page);
>>>>>> +            return false;
>>>>>> +    }
>>>>>> +
>>>>>> +    p = pfn_to_online_page(pfn);
>>>>>> +    if (!p) {
>>>>>> +            if (pfn_valid(pfn)) {
>>>>>> +                    pgmap = get_dev_pagemap(pfn, NULL);
>>>>>> +                    if (pgmap)
>>>>>> +                            put_dev_pagemap(pgmap);
>>>>>> +            }
>>>>>> +    }
>>>>>> +
>>>>>> +    return !!pgmap;
>>>>> You need to get Dan to check this out, but I'm pretty sure this should
>>>>> be more like this:
>>>>>
>>>>> if (is_zone_device_page(page) && page->pgmap->type == MEMORY_DEVICE_FS_DAX)
>>>> Great, i have added him.
>>>>
>>>>
>>>>
>>>>>> +static bool iova_in_pmem(struct rxe_mr *mr, u64 iova, int length)
>>>>>> +{
>>>>>> +    struct page *page = NULL;
>>>>>> +    char *vaddr = iova_to_vaddr(mr, iova, length);
>>>>>> +
>>>>>> +    if (!vaddr) {
>>>>>> +            pr_err("not a valid iova %llu\n", iova);
>>>>>> +            return false;
>>>>>> +    }
>>>>>> +
>>>>>> +    page = virt_to_page(vaddr);
>>>>> And obviously this isn't uniform for the entire umem, so I don't even
>>>>> know what this is supposed to mean.
>>>> My intention is to check if a memory region belongs to a nvdimm/pmem.
>>>> The approach is like that:
>>>> iova(user space)-+                     +-> page -> page_in_dev_pagemap()
>>>>                     |                     |
>>>>                     +-> va(kernel space) -+
>>>> Since current MR's va is associated with map_set where it record the relations
>>>> between iova and va and page. Do do you mean we should travel map_set to
>>>> get its page ? or by any other ways.
>>> Apologies for the delay in responding.
>>>
>>> The Subject line of this patch is confusing, if you want to know if a
>>> pfn is in persistent memory the only mechanism for that is:
>>>
>>> region_intersects(addr, length, IORESOURCE_MEM, IORES_DESC_PERSISTENT_MEMORY)
>>>
>>> ...there is otherwise nothing pmem specific about the dev_pagemap
>>> infrastructure. Yes, pmem is the primary user, but it is also used for
>>> mapping "soft-reserved" memory (See: the EFI_MEMORY_SP) attribute, and
>>> other users.
>>>
>>> Can you clarify the intent? I am missing some context.
>> thanks for your help @Dan
>>
>> I'm going to implement a new ibvers called RDMA FLUSH, it will support global visibility
>> and persistence placement type.
>>
>> In my first design, only pmem can support persistence placement type, so i need to introduce
>> new attribute is_pmem to RDMA memory region where it associates to a user space address iova
>> so that i can reject a persistence placement type to DRAM(non-pmem).
> Ok, I think for that case you are better served using the
> IORES_DESC_PERSISTENT_MEMORY designation as the gate for attempting
> the flush. The dev_pagemap is otherwise only indicating the presence
> of device-backed memory pages, not persistence.

Thanks a million for your kindly suggestion.

Thanks
Zhijian



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem
  2022-01-14  8:10       ` Li, Zhijian
@ 2022-01-27 22:30         ` Jeff Moyer
  0 siblings, 0 replies; 49+ messages in thread
From: Jeff Moyer @ 2022-01-27 22:30 UTC (permalink / raw)
  To: Li, Zhijian
  Cc: Jason Gunthorpe, linux-rdma, zyjzyj2000, aharonl, leon,
	linux-kernel, mbloch, liangwenpeng, yangx.jy, rpearsonhpe,
	y-goto, dan.j.williams, nvdimm

"Li, Zhijian" <lizhijian@cn.fujitsu.com> writes:

> Copied to nvdimm list
>
> Thanks
>
> Zhijian
>
>
> on 2022/1/6 14:12, Li Zhijian wrote:
>>
>> Add Dan to the party :)
>>
>> May i know whether there is any existing APIs to check whether
>> a va/page backs to a nvdimm/pmem ?

I don't know of one.  You could try walk_system_ram_range looking for
IORES_DESC_PERSISTENT_MEMORY, but that's not very efficient.

>>> You need to get Dan to check this out, but I'm pretty sure this should
>>> be more like this:
>>>
>>> if (is_zone_device_page(page) && page->pgmap->type ==
>>> MEMORY_DEVICE_FS_DAX)

You forgot MEMORY_DEVICE_GENERIC.  However, this doesn't guarantee the
memory belongs to persistent memory, only that it is direct access
capable.

Dan, any ideas?

-Jeff


^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2022-01-27 22:27 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-28  8:07 [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 01/10] RDMA: mr: Introduce is_pmem Li Zhijian
2022-01-06  0:21   ` Jason Gunthorpe
2022-01-06  6:12     ` lizhijian
2022-01-14  8:10       ` Li, Zhijian
2022-01-27 22:30         ` Jeff Moyer
2022-01-16 18:11       ` Dan Williams
2022-01-18  8:55         ` lizhijian
2022-01-18 15:28           ` Dan Williams
2022-01-19  2:01             ` lizhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 02/10] RDMA: Allow registering MR with flush access flags Li Zhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only Li Zhijian
2022-01-06  0:22   ` Jason Gunthorpe
2022-01-06  6:20     ` lizhijian
2022-01-13  6:43   ` lizhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 04/10] RDMA/rxe: Enable IB_DEVICE_RDMA_FLUSH for rxe device Li Zhijian
2022-01-06  0:22   ` Jason Gunthorpe
2022-01-06  6:26     ` lizhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 05/10] RDMA/rxe: Allow registering persistent flag for pmem MR only Li Zhijian
2021-12-30 22:25   ` Tom Talpey
2021-12-31  3:34     ` lizhijian
2021-12-31 14:40       ` Tom Talpey
2022-01-04  1:32         ` lizhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 06/10] RDMA/rxe: Implement RC RDMA FLUSH service in requester side Li Zhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 07/10] RDMA/rxe: Set BTH's SE to zero for FLUSH packet Li Zhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 08/10] RDMA/rxe: Implement flush execution in responder side Li Zhijian
2021-12-30 22:18   ` Tom Talpey
2021-12-31  1:37     ` lizhijian
2021-12-31  2:32       ` Tom Talpey
2022-01-04  8:51         ` lizhijian
2022-01-04 16:02           ` Tom Talpey
2022-01-06  0:35         ` Jason Gunthorpe
2022-01-04 16:40   ` Tom Talpey
2022-01-05  1:43     ` lizhijian
2022-01-06  0:28   ` Jason Gunthorpe
2022-01-06  6:42     ` lizhijian
2022-01-06 17:33       ` Jason Gunthorpe
2022-01-10  5:45         ` lizhijian
2022-01-10 14:34           ` Jason Gunthorpe
2022-01-11  5:34             ` lizhijian
2022-01-11 20:48               ` Jason Gunthorpe
2022-01-12  9:50                 ` lizhijian
2022-01-12 13:12                   ` Jason Gunthorpe
2022-01-13  6:29                     ` lizhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 09/10] RDMA/rxe: Implement flush completion Li Zhijian
2021-12-28  8:07 ` [RFC PATCH rdma-next 10/10] RDMA/rxe: Add RD FLUSH service support Li Zhijian
2021-12-29  8:49 ` [RFC PATCH rdma-next 00/10] RDMA/rxe: Add RDMA FLUSH operation Gromadzki, Tomasz
2021-12-29 14:35   ` Tom Talpey
2021-12-31  1:10     ` lizhijian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).