linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation
@ 2022-01-25  8:50 Li Zhijian
  2022-01-25  8:50 ` [RFC PATCH v2 1/9] RDMA: mr: Introduce is_pmem Li Zhijian
                   ` (9 more replies)
  0 siblings, 10 replies; 12+ messages in thread
From: Li Zhijian @ 2022-01-25  8:50 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon, tom, tomasz.gromadzki
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, y-goto,
	rpearsonhpe, dan.j.williams, Li Zhijian, yangx.jy

Hey folks,

I wanna thank all of you for the kind feedback in my previous RFC.
Recently, i have tried my best to do some updates as per your comments.
Indeed, not all comments have been addressed for some reasons, i still
wish to post this new one to start a new discussion.

Outstanding issues:
- iova_to_addr() without any kmap/kmap_local_page flows might not always
  work. # existing issue.
- responder should reply error to requested side when it requests a
  persistence placement type to DRAM ?
-------

These patches are going to implement a *NEW* RDMA opcode "RDMA FLUSH".
In IB SPEC 1.5[1][2], 2 new opcodes, ATOMIC WRITE and RDMA FLUSH were
added in the MEMORY PLACEMENT EXTENSIONS section.

FLUSH is used by the requesting node to achieve guarantees on the data
placement within the memory subsystem of preceding accesses to a
single memory region, such as those performed by RDMA WRITE, Atomics
and ATOMIC WRITE requests.

The operation indicates the virtual address space of a destination node
and where the guarantees should apply. This range must be contiguous
in the virtual space of the memory key but it is not necessarily a
contiguous range of physical memory.

FLUSH packets carry FLUSH extended transport header (see below) to
specify the placement type and the selectivity level of the operation
and RDMA extended header (RETH, see base document RETH definition) to
specify the R_Key VA and Length associated with this request following
the BTH in RC, RDETH in RD and XRCETH in XRC.

RC FLUSH:
+----+------+------+
|BTH | FETH | RETH |
+----+------+------+

RD FLUSH:
+----+------+------+------+
|BTH | RDETH| FETH | RETH |
+----+------+------+------+

XRC FLUSH:
+----+-------+------+------+
|BTH | XRCETH| FETH | RETH |
+----+-------+------+------+

Currently, we introduce RC and RD services only, since XRC has not been
implemented by rxe yet.
NOTE: only RC service is tested now, and since other HCAs have not
added/implemented FLUSH yet, we can only test FLUSH operation in both
SoftRoCE/rxe devices.

The corresponding rdma-core and FLUSH example are available on:
https://github.com/zhijianli88/rdma-core/tree/rfc
Can access the kernel source in:
https://github.com/zhijianli88/linux/tree/rdma-flush

- We introduce is_pmem attribute to MR(memory region)
- We introduce FLUSH placement type attributes to HCA
- We introduce FLUSH access flags that users are able to register with
Below figure shows the valid access flags uses can register with:
+------------------------+------------------+--------------+
| HCA attributes         |    register access flags        |
|        and             +-----------------+---------------+
| MR attribute(is_pmem)  |global visibility |  persistence |
|------------------------+------------------+--------------+
| global visibility(DRAM)|        O         |      X       |
|------------------------+------------------+--------------+
| global visibility(PMEM)|        O         |      X       |
|------------------------+------------------+--------------+
| persistence(DRAM)      |        X         |      X       |
|------------------------+------------------+--------------+
| persistence(PMEM)      |        X         |      O       |
+------------------------+------------------+--------------+
O: allow to register such access flag

In order to make placement guarentees, we currently reject requesting a
persistent flush to a non-pmem.
The responder will check the remote requested placement types by checking
the registered access flags.
+------------------------+------------------+--------------+
|                        |     registered flags            |
| remote requested types +------------------+--------------+
|                        |global visibility |  persistence |
|------------------------+------------------+--------------+
| global visibility      |        O         |      x       |
+------------------------+------------------+--------------+
| persistence            |        X         |      O       |
+------------------------+------------------+--------------+
O: allow to request such placement type

Below list some details about FLUSH transport packet:

A FLUSH message is built upon FLUSH request packet and is responded
successfully by RDMA READ response of zero size.

oA19-2: FLUSH shall be single packet message and shall have no payload.
oA19-5: FLUSH BTH shall hold the Opcode = 0x1C

FLUSH Extended Transport Header(FETH)
+-----+-----------+------------------------+----------------------+
|Bits |   31-6    |          5-4           |        3-0           |
+-----+-----------+------------------------+----------------------+
|     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
+-----+-----------+------------------------+----------------------+

Selectivity Level (SEL) – defines the memory region scope the FLUSH
should apply on. Values are as follows:
• b’00 - Memory Region Range: FLUSH applies for all preceding memory
         updates to the RETH range on this QP. All RETH fields shall be
         valid in this selectivity mode. RETH:DMALen field shall be
         between zero and (2 31 -1) bytes (inclusive).
• b’01 - Memory Region: FLUSH applies for all preceding memory up-
         dates to RETH.R_key on this QP. RETH:DMALen and RETH:VA
         shall be ignored in this mode.
• b'10 - Reserved.
• b'11 - Reserved.

Placement Type (PLT) – Defines the memory placement guarantee of
this FLUSH. Multiple bits may be set in this field. Values are as follows:
• Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
  Visibility.
• Bit 1 if set to '1' indicated that the FLUSH should guarantee
  Persistence.
• Bits 3:2 are reserved

[1]: https://www.infinibandta.org/ibta-specification/ # login required
[2]: https://www.infinibandta.org/wp-content/uploads/2021/08/IBTA-Overview-of-IBTA-Volume-1-Release-1.5-and-MPE-2021-08-17-Secure.pptx

CC: yangx.jy@cn.fujitsu.com
CC: y-goto@fujitsu.com
CC: Jason Gunthorpe <jgg@ziepe.ca>
CC: Zhu Yanjun <zyjzyj2000@gmail.com
CC: Leon Romanovsky <leon@kernel.org>
CC: Bob Pearson <rpearsonhpe@gmail.com>
CC: Mark Bloch <mbloch@nvidia.com>
CC: Wenpeng Liang <liangwenpeng@huawei.com>
CC: Aharon Landau <aharonl@nvidia.com>
CC: Tom Talpey <tom@talpey.com>
CC: "Gromadzki, Tomasz" <tomasz.gromadzki@intel.com>
CC: Dan Williams <dan.j.williams@intel.com>
CC: linux-rdma@vger.kernel.org
CC: linux-kernel@vger.kernel.org

V1:
https://lore.kernel.org/lkml/050c3183-2fc6-03a1-eecd-258744750972@fujitsu.com/T/
or https://github.com/zhijianli88/linux/tree/rdma-flush-rfcv1

Changes log
V2:
https://github.com/zhijianli88/linux/tree/rdma-flush
RDMA: mr: Introduce is_pmem
  check 1st byte to avoid crossing page boundary
  new scheme to check is_pmem # Dan

RDMA: Allow registering MR with flush access flags
  combine with [03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only to this patch # Jason
  split RDMA_FLUSH to 2 capabilities

RDMA/rxe: Allow registering persistent flag for pmem MR only
  update commit message, get rid of confusing ib_check_flush_access_flags() # Tom

RDMA/rxe: Implement RC RDMA FLUSH service in requester side
  extend flush to include length field. # Tom and Tomasz

RDMA/rxe: Implement flush execution in responder side
  adjust start for WHOLE MR level # Tom
  don't support DMA mr for flush # Tom
  check flush return value

RDMA/rxe: Enable RDMA FLUSH capability for rxe device
  adjust patch's order. move it here from [04/10]

Li Zhijian (9):
  RDMA: mr: Introduce is_pmem
  RDMA: Allow registering MR with flush access flags
  RDMA/rxe: Allow registering persistent flag for pmem MR only
  RDMA/rxe: Implement RC RDMA FLUSH service in requester side
  RDMA/rxe: Set BTH's SE to zero for FLUSH packet
  RDMA/rxe: Implement flush execution in responder side
  RDMA/rxe: Implement flush completion
  RDMA/rxe: Enable RDMA FLUSH capability for rxe device
  RDMA/rxe: Add RD FLUSH service support

 drivers/infiniband/core/uverbs_cmd.c    |  17 +++
 drivers/infiniband/sw/rxe/rxe_comp.c    |   4 +-
 drivers/infiniband/sw/rxe/rxe_hdr.h     |  52 +++++++++
 drivers/infiniband/sw/rxe/rxe_loc.h     |   2 +
 drivers/infiniband/sw/rxe/rxe_mr.c      |  37 ++++++-
 drivers/infiniband/sw/rxe/rxe_opcode.c  |  35 +++++++
 drivers/infiniband/sw/rxe/rxe_opcode.h  |   3 +
 drivers/infiniband/sw/rxe/rxe_param.h   |   4 +-
 drivers/infiniband/sw/rxe/rxe_req.c     |  19 +++-
 drivers/infiniband/sw/rxe/rxe_resp.c    | 133 +++++++++++++++++++++++-
 include/rdma/ib_pack.h                  |   3 +
 include/rdma/ib_verbs.h                 |  30 +++++-
 include/uapi/rdma/ib_user_ioctl_verbs.h |   2 +
 include/uapi/rdma/ib_user_verbs.h       |  19 ++++
 include/uapi/rdma/rdma_user_rxe.h       |   7 ++
 15 files changed, 355 insertions(+), 12 deletions(-)

-- 
2.31.1




^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 1/9] RDMA: mr: Introduce is_pmem
  2022-01-25  8:50 [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
@ 2022-01-25  8:50 ` Li Zhijian
  2022-01-25  8:50 ` [RFC PATCH v2 2/9] RDMA: Allow registering MR with flush access flags Li Zhijian
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Li Zhijian @ 2022-01-25  8:50 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon, tom, tomasz.gromadzki
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, y-goto,
	rpearsonhpe, dan.j.williams, Li Zhijian

We can use it to indicate whether the registering mr is associated with
a pmem/nvdimm or not.

Currently, we only update it in rxe driver, for other device/drivers,
they should implement it if needed.

CC: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
V2: check 1st byte to avoid crossing page boundary
    new scheme to check is_pmem # Dan
---
 drivers/infiniband/sw/rxe/rxe_mr.c | 25 +++++++++++++++++++++++++
 include/rdma/ib_verbs.h            |  1 +
 2 files changed, 26 insertions(+)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 453ef3c9d535..0427baea8c06 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -161,6 +161,28 @@ void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr)
 	mr->type = IB_MR_TYPE_DMA;
 }
 
+static bool iova_in_pmem(struct rxe_mr *mr, u64 iova, int length)
+{
+	char *vaddr;
+	int is_pmem;
+
+	/* XXX: Shall me allow length == 0 */
+	if (length == 0) {
+		return false;
+	}
+	/* check the 1st byte only to avoid crossing page boundary */
+	vaddr = iova_to_vaddr(mr, iova, 1);
+	if (!vaddr) {
+		pr_warn("not a valid iova 0x%llx\n", iova);
+		return false;
+	}
+
+	is_pmem = region_intersects(virt_to_phys(vaddr), 1, IORESOURCE_MEM,
+				    IORES_DESC_PERSISTENT_MEMORY);
+
+	return is_pmem == REGION_INTERSECTS;
+}
+
 int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 		     int access, struct rxe_mr *mr)
 {
@@ -235,6 +257,9 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 	set->va = start;
 	set->offset = ib_umem_offset(umem);
 
+	// iova_in_pmem() must be called after set is updated
+	mr->ibmr.is_pmem = iova_in_pmem(mr, iova, length);
+
 	return 0;
 
 err_release_umem:
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 69d883f7fb41..4fa07b123c8d 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1807,6 +1807,7 @@ struct ib_mr {
 	unsigned int	   page_size;
 	enum ib_mr_type	   type;
 	bool		   need_inval;
+	bool		   is_pmem;
 	union {
 		struct ib_uobject	*uobject;	/* user */
 		struct list_head	qp_entry;	/* FR */
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 2/9] RDMA: Allow registering MR with flush access flags
  2022-01-25  8:50 [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
  2022-01-25  8:50 ` [RFC PATCH v2 1/9] RDMA: mr: Introduce is_pmem Li Zhijian
@ 2022-01-25  8:50 ` Li Zhijian
  2022-01-25  8:50 ` [RFC PATCH v2 3/9] RDMA/rxe: Allow registering persistent flag for pmem MR only Li Zhijian
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Li Zhijian @ 2022-01-25  8:50 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon, tom, tomasz.gromadzki
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, y-goto,
	rpearsonhpe, dan.j.williams, Li Zhijian

Users can use ibv_reg_mr(3) to register flush access flags.
NOTE that, we allow to register the access flags that are supported
by the HCA in both.

Device/HCA should enable IB_DEVICE_PLT_GLOBAL_VISIBILITY or
IB_DEVICE_PLT_PERSISTENT or both capabilities if it want to support
RDMA FLUSH.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
V2: combine [03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only to this patch # Jason
    split RDMA_FLUSH to 2 capabilities
    Fix typo
---
 include/rdma/ib_verbs.h                 | 18 ++++++++++++++++--
 include/uapi/rdma/ib_user_ioctl_verbs.h |  2 ++
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 4fa07b123c8d..7f5905180636 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -291,6 +291,9 @@ enum ib_device_cap_flags {
 	/* The device supports padding incoming writes to cacheline. */
 	IB_DEVICE_PCI_WRITE_END_PADDING		= (1ULL << 36),
 	IB_DEVICE_ALLOW_USER_UNREG		= (1ULL << 37),
+	/* Placement type attributes */
+	IB_DEVICE_PLT_GLOBAL_VISIBILITY		= (1ULL << 38),
+	IB_DEVICE_PLT_PERSISTENT		= (1ULL << 39),
 };
 
 enum ib_atomic_cap {
@@ -1444,10 +1447,14 @@ enum ib_access_flags {
 	IB_ACCESS_ON_DEMAND = IB_UVERBS_ACCESS_ON_DEMAND,
 	IB_ACCESS_HUGETLB = IB_UVERBS_ACCESS_HUGETLB,
 	IB_ACCESS_RELAXED_ORDERING = IB_UVERBS_ACCESS_RELAXED_ORDERING,
+	IB_ACCESS_FLUSH_GLOBAL_VISIBILITY = IB_UVERBS_ACCESS_FLUSH_GLOBAL_VISIBILITY,
+	IB_ACCESS_FLUSH_PERSISTENT = IB_UVERBS_ACCESS_FLUSH_PERSISTENT,
 
+	IB_ACCESS_FLUSHABLE = IB_ACCESS_FLUSH_GLOBAL_VISIBILITY |
+			      IB_ACCESS_FLUSH_PERSISTENT,
 	IB_ACCESS_OPTIONAL = IB_UVERBS_ACCESS_OPTIONAL_RANGE,
 	IB_ACCESS_SUPPORTED =
-		((IB_ACCESS_HUGETLB << 1) - 1) | IB_ACCESS_OPTIONAL,
+		((IB_ACCESS_FLUSH_PERSISTENT << 1) - 1) | IB_ACCESS_OPTIONAL,
 };
 
 /*
@@ -4301,6 +4308,7 @@ int ib_dealloc_xrcd_user(struct ib_xrcd *xrcd, struct ib_udata *udata);
 static inline int ib_check_mr_access(struct ib_device *ib_dev,
 				     unsigned int flags)
 {
+	u64 device_cap = ib_dev->attrs.device_cap_flags;
 	/*
 	 * Local write permission is required if remote write or
 	 * remote atomic permission is also requested.
@@ -4313,7 +4321,13 @@ static inline int ib_check_mr_access(struct ib_device *ib_dev,
 		return -EINVAL;
 
 	if (flags & IB_ACCESS_ON_DEMAND &&
-	    !(ib_dev->attrs.device_cap_flags & IB_DEVICE_ON_DEMAND_PAGING))
+	    !(device_cap & IB_DEVICE_ON_DEMAND_PAGING))
+		return -EINVAL;
+
+	if ((flags & IB_ACCESS_FLUSH_GLOBAL_VISIBILITY &&
+	    !(device_cap & IB_DEVICE_PLT_GLOBAL_VISIBILITY)) ||
+	    (flags & IB_ACCESS_FLUSH_PERSISTENT &&
+	    !(device_cap & IB_DEVICE_PLT_PERSISTENT)))
 		return -EINVAL;
 	return 0;
 }
diff --git a/include/uapi/rdma/ib_user_ioctl_verbs.h b/include/uapi/rdma/ib_user_ioctl_verbs.h
index 3072e5d6b692..2c28f90ec54c 100644
--- a/include/uapi/rdma/ib_user_ioctl_verbs.h
+++ b/include/uapi/rdma/ib_user_ioctl_verbs.h
@@ -57,6 +57,8 @@ enum ib_uverbs_access_flags {
 	IB_UVERBS_ACCESS_ZERO_BASED = 1 << 5,
 	IB_UVERBS_ACCESS_ON_DEMAND = 1 << 6,
 	IB_UVERBS_ACCESS_HUGETLB = 1 << 7,
+	IB_UVERBS_ACCESS_FLUSH_GLOBAL_VISIBILITY = 1 << 8,
+	IB_UVERBS_ACCESS_FLUSH_PERSISTENT = 1 << 9,
 
 	IB_UVERBS_ACCESS_RELAXED_ORDERING = IB_UVERBS_ACCESS_OPTIONAL_FIRST,
 	IB_UVERBS_ACCESS_OPTIONAL_RANGE =
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 3/9] RDMA/rxe: Allow registering persistent flag for pmem MR only
  2022-01-25  8:50 [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
  2022-01-25  8:50 ` [RFC PATCH v2 1/9] RDMA: mr: Introduce is_pmem Li Zhijian
  2022-01-25  8:50 ` [RFC PATCH v2 2/9] RDMA: Allow registering MR with flush access flags Li Zhijian
@ 2022-01-25  8:50 ` Li Zhijian
  2022-01-25  8:50 ` [RFC PATCH v2 4/9] RDMA/rxe: Implement RC RDMA FLUSH service in requester side Li Zhijian
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Li Zhijian @ 2022-01-25  8:50 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon, tom, tomasz.gromadzki
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, y-goto,
	rpearsonhpe, dan.j.williams, Li Zhijian

Memory region should support 2 placement types: IB_ACCESS_FLUSH_PERSISTENT
and IB_ACCESS_FLUSH_GLOBAL_VISIBILITY, and only pmem/nvdimm has ability to
persist data(IB_ACCESS_FLUSH_PERSISTENT).

It prevents local user from registering a persistent access flag to
a non-pmem MR.

+------------------------+------------------+--------------+
| HCA attributes         |    register access flags        |
|        and             +-----------------+---------------+
| MR attribute(is_pmem)  |global visibility |  persistence |
|------------------------+------------------+--------------+
| global visibility(DRAM)|        O         |      X       |
|------------------------+------------------+--------------+
| global visibility(PMEM)|        O         |      X       |
|------------------------+------------------+--------------+
| persistence(DRAM)      |        X         |      X       |
|------------------------+------------------+--------------+
| persistence(PMEM)      |        X         |      O       |
+------------------------+------------------+--------------+
PMEM: is_pmem is true
DRAM: is_pmem is false
O: allow to register such access flag
X: otherwise

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
V2: update commit message, get rid of confusing ib_check_flush_access_flags() # Tom
---
 drivers/infiniband/sw/rxe/rxe_mr.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 0427baea8c06..89a3bb4e8b71 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -258,7 +258,15 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 	set->offset = ib_umem_offset(umem);
 
 	// iova_in_pmem() must be called after set is updated
-	mr->ibmr.is_pmem = iova_in_pmem(mr, iova, length);
+	if (iova_in_pmem(mr, iova, length))
+		mr->ibmr.is_pmem = true;
+	else if (access & IB_ACCESS_FLUSH_PERSISTENT) {
+		pr_warn("Cannot register IB_ACCESS_FLUSH_PERSISTENT for non-pmem memory\n");
+		mr->state = RXE_MR_STATE_INVALID;
+		mr->umem = NULL;
+		err = -EINVAL;
+		goto err_release_umem;
+	}
 
 	return 0;
 
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 4/9] RDMA/rxe: Implement RC RDMA FLUSH service in requester side
  2022-01-25  8:50 [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (2 preceding siblings ...)
  2022-01-25  8:50 ` [RFC PATCH v2 3/9] RDMA/rxe: Allow registering persistent flag for pmem MR only Li Zhijian
@ 2022-01-25  8:50 ` Li Zhijian
  2022-01-25  8:50 ` [RFC PATCH v2 5/9] RDMA/rxe: Set BTH's SE to zero for FLUSH packet Li Zhijian
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Li Zhijian @ 2022-01-25  8:50 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon, tom, tomasz.gromadzki
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, y-goto,
	rpearsonhpe, dan.j.williams, Li Zhijian

a RC FLUSH packet consists of:
+----+------+------+
|BTH | FETH | RETH |
+----+------+------+

oA19-2: FLUSH shall be single packet message and shall have no payload.
oA19-5: FLUSH BTH shall hold the Opcode = 0x1C

FLUSH Extended Transport Header(FETH)
+-----+-----------+------------------------+----------------------+
|Bits |   31-6    |          5-4           |        3-0           |
+-----+-----------+------------------------+----------------------+
|     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
+-----+-----------+------------------------+----------------------+

Selectivity Level (SEL) – defines the memory region scope the FLUSH
should apply on. Values are as follows:
• b’00 - Memory Region Range: FLUSH applies for all preceding memory
         updates to the RETH range on this QP. All RETH fields shall be
         valid in this selectivity mode. RETH:DMALen field shall be between
         zero and (2 31 -1) bytes (inclusive).
• b’01 - Memory Region: FLUSH applies for all preceding memory updates
         to RETH.R_key on this QP. RETH:DMALen and RETH:VA shall be
         ignored in this mode.
• b'10 - Reserved.
• b'11 - Reserved.

Placement Type (PLT) – Defines the memory placement guarantee of
this FLUSH. Multiple bits may be set in this field. Values are as follows:
• Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
  Visibility.
• Bit 1 if set to '1' indicated that the FLUSH should guarantee
  Persistence.
• Bits 3:2 are reserved

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
V2: extend flush to include length field.
---
 drivers/infiniband/core/uverbs_cmd.c   | 17 +++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_hdr.h    | 24 ++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_opcode.c | 15 +++++++++++++++
 drivers/infiniband/sw/rxe/rxe_opcode.h |  3 +++
 drivers/infiniband/sw/rxe/rxe_req.c    | 15 ++++++++++++++-
 include/rdma/ib_pack.h                 |  2 ++
 include/rdma/ib_verbs.h                | 10 ++++++++++
 include/uapi/rdma/ib_user_verbs.h      |  8 ++++++++
 include/uapi/rdma/rdma_user_rxe.h      |  7 +++++++
 9 files changed, 100 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 6b6393176b3c..632e1747fb60 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -2080,6 +2080,23 @@ static int ib_uverbs_post_send(struct uverbs_attr_bundle *attrs)
 			rdma->rkey = user_wr->wr.rdma.rkey;
 
 			next = &rdma->wr;
+		} else if (user_wr->opcode == IB_WR_RDMA_FLUSH) {
+			struct ib_flush_wr *flush;
+
+			next_size = sizeof(*flush);
+			flush = alloc_wr(next_size, user_wr->num_sge);
+			if (!flush) {
+				ret = -ENOMEM;
+				goto out_put;
+			}
+
+			flush->remote_addr = user_wr->wr.flush.remote_addr;
+			flush->length = user_wr->wr.flush.length;
+			flush->rkey = user_wr->wr.flush.rkey;
+			flush->type = user_wr->wr.flush.type;
+			flush->level = user_wr->wr.flush.level;
+
+			next = &flush->wr;
 		} else if (user_wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP ||
 			   user_wr->opcode == IB_WR_ATOMIC_FETCH_AND_ADD) {
 			struct ib_atomic_wr *atomic;
diff --git a/drivers/infiniband/sw/rxe/rxe_hdr.h b/drivers/infiniband/sw/rxe/rxe_hdr.h
index e432f9e37795..e37aa1944b18 100644
--- a/drivers/infiniband/sw/rxe/rxe_hdr.h
+++ b/drivers/infiniband/sw/rxe/rxe_hdr.h
@@ -607,6 +607,29 @@ static inline void reth_set_len(struct rxe_pkt_info *pkt, u32 len)
 		rxe_opcode[pkt->opcode].offset[RXE_RETH], len);
 }
 
+/*
+ * FLUSH Extended Transport Header(FETH)
+ * +-----+-----------+------------------------+----------------------+
+ * |Bits |   31-6    |          5-4           |        3-0           |
+ * +-----+-----------+------------------------+----------------------+
+ *       | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
+ * +-----+-----------+------------------------+----------------------+
+ */
+#define FETH_PLT_SHIFT 0UL
+#define FETH_SEL_SHIFT 4UL
+#define FETH_RESERVED_SHIFT 6UL
+#define FETH_PLT_MASK ((1UL << FETH_SEL_SHIFT) - 1UL)
+#define FETH_SEL_MASK (~FETH_PLT_MASK & ((1UL << FETH_RESERVED_SHIFT) - 1UL))
+
+static inline void feth_init(struct rxe_pkt_info *pkt, u32 type, u32 level)
+{
+	u32 *p = (u32 *)(pkt->hdr + rxe_opcode[pkt->opcode].offset[RXE_FETH]);
+	u32 feth = ((level << FETH_SEL_SHIFT) & FETH_SEL_MASK) |
+		   ((type << FETH_PLT_SHIFT) & FETH_PLT_MASK);
+
+	*p = cpu_to_be32(feth);
+}
+
 /******************************************************************************
  * Atomic Extended Transport Header
  ******************************************************************************/
@@ -910,6 +933,7 @@ enum rxe_hdr_length {
 	RXE_ATMETH_BYTES	= sizeof(struct rxe_atmeth),
 	RXE_IETH_BYTES		= sizeof(struct rxe_ieth),
 	RXE_RDETH_BYTES		= sizeof(struct rxe_rdeth),
+	RXE_FETH_BYTES		= sizeof(u32),
 };
 
 static inline size_t header_size(struct rxe_pkt_info *pkt)
diff --git a/drivers/infiniband/sw/rxe/rxe_opcode.c b/drivers/infiniband/sw/rxe/rxe_opcode.c
index df596ba7527d..adea6c16dfb5 100644
--- a/drivers/infiniband/sw/rxe/rxe_opcode.c
+++ b/drivers/infiniband/sw/rxe/rxe_opcode.c
@@ -316,6 +316,21 @@ struct rxe_opcode_info rxe_opcode[RXE_NUM_OPCODE] = {
 					  RXE_AETH_BYTES,
 		}
 	},
+	[IB_OPCODE_RC_RDMA_FLUSH]			= {
+		.name	= "IB_OPCODE_RC_RDMA_FLUSH",
+		.mask	= RXE_FETH_MASK | RXE_RETH_MASK | RXE_FLUSH_MASK |
+			  RXE_START_MASK | RXE_END_MASK | RXE_REQ_MASK,
+		.length = RXE_BTH_BYTES + RXE_FETH_BYTES + RXE_RETH_BYTES,
+		.offset = {
+			[RXE_BTH]	= 0,
+			[RXE_FETH]	= RXE_BTH_BYTES,
+			[RXE_RETH]	= RXE_BTH_BYTES +
+					  RXE_FETH_BYTES,
+			[RXE_PAYLOAD]	= RXE_BTH_BYTES +
+					  RXE_FETH_BYTES +
+					  RXE_RETH_BYTES,
+		}
+	},
 	[IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE]			= {
 		.name	= "IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE",
 		.mask	= RXE_AETH_MASK | RXE_ATMACK_MASK | RXE_ACK_MASK |
diff --git a/drivers/infiniband/sw/rxe/rxe_opcode.h b/drivers/infiniband/sw/rxe/rxe_opcode.h
index 8f9aaaf260f2..dbc2eca8a92c 100644
--- a/drivers/infiniband/sw/rxe/rxe_opcode.h
+++ b/drivers/infiniband/sw/rxe/rxe_opcode.h
@@ -48,6 +48,7 @@ enum rxe_hdr_type {
 	RXE_DETH,
 	RXE_IMMDT,
 	RXE_PAYLOAD,
+	RXE_FETH,
 	NUM_HDR_TYPES
 };
 
@@ -63,6 +64,7 @@ enum rxe_hdr_mask {
 	RXE_IETH_MASK		= BIT(RXE_IETH),
 	RXE_RDETH_MASK		= BIT(RXE_RDETH),
 	RXE_DETH_MASK		= BIT(RXE_DETH),
+	RXE_FETH_MASK		= BIT(RXE_FETH),
 	RXE_PAYLOAD_MASK	= BIT(RXE_PAYLOAD),
 
 	RXE_REQ_MASK		= BIT(NUM_HDR_TYPES + 0),
@@ -80,6 +82,7 @@ enum rxe_hdr_mask {
 	RXE_END_MASK		= BIT(NUM_HDR_TYPES + 10),
 
 	RXE_LOOPBACK_MASK	= BIT(NUM_HDR_TYPES + 12),
+	RXE_FLUSH_MASK		= BIT(NUM_HDR_TYPES + 13),
 
 	RXE_READ_OR_ATOMIC_MASK	= (RXE_READ_MASK | RXE_ATOMIC_MASK),
 	RXE_WRITE_OR_SEND_MASK	= (RXE_WRITE_MASK | RXE_SEND_MASK),
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 5eb89052dd66..708138117136 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -220,6 +220,9 @@ static int next_opcode_rc(struct rxe_qp *qp, u32 opcode, int fits)
 				IB_OPCODE_RC_SEND_ONLY_WITH_IMMEDIATE :
 				IB_OPCODE_RC_SEND_FIRST;
 
+	case IB_WR_RDMA_FLUSH:
+		return IB_OPCODE_RC_RDMA_FLUSH;
+
 	case IB_WR_RDMA_READ:
 		return IB_OPCODE_RC_RDMA_READ_REQUEST;
 
@@ -413,11 +416,18 @@ static struct sk_buff *init_req_packet(struct rxe_qp *qp,
 
 	/* init optional headers */
 	if (pkt->mask & RXE_RETH_MASK) {
-		reth_set_rkey(pkt, ibwr->wr.rdma.rkey);
+		if (pkt->mask & RXE_FETH_MASK)
+			reth_set_rkey(pkt, ibwr->wr.flush.rkey);
+		else
+			reth_set_rkey(pkt, ibwr->wr.rdma.rkey);
 		reth_set_va(pkt, wqe->iova);
 		reth_set_len(pkt, wqe->dma.resid);
 	}
 
+	/* Fill Flush Extension Transport Header */
+	if (pkt->mask & RXE_FETH_MASK)
+		feth_init(pkt, ibwr->wr.flush.type, ibwr->wr.flush.level);
+
 	if (pkt->mask & RXE_IMMDT_MASK)
 		immdt_set_imm(pkt, ibwr->ex.imm_data);
 
@@ -477,6 +487,9 @@ static int finish_packet(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 
 			memset(pad, 0, bth_pad(pkt));
 		}
+	} else if (pkt->mask & RXE_FLUSH_MASK) {
+		// oA19-2: shall have no payload.
+		wqe->dma.resid = 0;
 	}
 
 	return 0;
diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h
index a9162f25beaf..d19edb502de6 100644
--- a/include/rdma/ib_pack.h
+++ b/include/rdma/ib_pack.h
@@ -84,6 +84,7 @@ enum {
 	/* opcode 0x15 is reserved */
 	IB_OPCODE_SEND_LAST_WITH_INVALIDATE         = 0x16,
 	IB_OPCODE_SEND_ONLY_WITH_INVALIDATE         = 0x17,
+	IB_OPCODE_RDMA_FLUSH                        = 0x1C,
 
 	/* real constants follow -- see comment about above IB_OPCODE()
 	   macro for more details */
@@ -112,6 +113,7 @@ enum {
 	IB_OPCODE(RC, FETCH_ADD),
 	IB_OPCODE(RC, SEND_LAST_WITH_INVALIDATE),
 	IB_OPCODE(RC, SEND_ONLY_WITH_INVALIDATE),
+	IB_OPCODE(RC, RDMA_FLUSH),
 
 	/* UC */
 	IB_OPCODE(UC, SEND_FIRST),
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 7f5905180636..d8555b6e4eba 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1299,6 +1299,7 @@ struct ib_qp_attr {
 enum ib_wr_opcode {
 	/* These are shared with userspace */
 	IB_WR_RDMA_WRITE = IB_UVERBS_WR_RDMA_WRITE,
+	IB_WR_RDMA_FLUSH = IB_UVERBS_WR_RDMA_FLUSH,
 	IB_WR_RDMA_WRITE_WITH_IMM = IB_UVERBS_WR_RDMA_WRITE_WITH_IMM,
 	IB_WR_SEND = IB_UVERBS_WR_SEND,
 	IB_WR_SEND_WITH_IMM = IB_UVERBS_WR_SEND_WITH_IMM,
@@ -1393,6 +1394,15 @@ struct ib_atomic_wr {
 	u32			rkey;
 };
 
+struct ib_flush_wr {
+	struct ib_send_wr	wr;
+	u64			remote_addr;
+	u32			length;
+	u32			rkey;
+	u8			type;
+	u8			level;
+};
+
 static inline const struct ib_atomic_wr *atomic_wr(const struct ib_send_wr *wr)
 {
 	return container_of(wr, struct ib_atomic_wr, wr);
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
index 7ee73a0652f1..c4131913ef6a 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -784,6 +784,7 @@ enum ib_uverbs_wr_opcode {
 	IB_UVERBS_WR_RDMA_READ_WITH_INV = 11,
 	IB_UVERBS_WR_MASKED_ATOMIC_CMP_AND_SWP = 12,
 	IB_UVERBS_WR_MASKED_ATOMIC_FETCH_AND_ADD = 13,
+	IB_UVERBS_WR_RDMA_FLUSH = 14,
 	/* Review enum ib_wr_opcode before modifying this */
 };
 
@@ -797,6 +798,13 @@ struct ib_uverbs_send_wr {
 		__u32 invalidate_rkey;
 	} ex;
 	union {
+		struct {
+			__aligned_u64 remote_addr;
+			__u32 length;
+			__u32 rkey;
+			__u8 type;
+			__u8 level;
+		} flush;
 		struct {
 			__aligned_u64 remote_addr;
 			__u32 rkey;
diff --git a/include/uapi/rdma/rdma_user_rxe.h b/include/uapi/rdma/rdma_user_rxe.h
index f09c5c9e3dd5..3de56ed5c24f 100644
--- a/include/uapi/rdma/rdma_user_rxe.h
+++ b/include/uapi/rdma/rdma_user_rxe.h
@@ -82,6 +82,13 @@ struct rxe_send_wr {
 		__u32		invalidate_rkey;
 	} ex;
 	union {
+		struct {
+			__aligned_u64 remote_addr;
+			__u32	length;
+			__u32	rkey;
+			__u8	type;
+			__u8	level;
+		} flush;
 		struct {
 			__aligned_u64 remote_addr;
 			__u32	rkey;
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 5/9] RDMA/rxe: Set BTH's SE to zero for FLUSH packet
  2022-01-25  8:50 [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (3 preceding siblings ...)
  2022-01-25  8:50 ` [RFC PATCH v2 4/9] RDMA/rxe: Implement RC RDMA FLUSH service in requester side Li Zhijian
@ 2022-01-25  8:50 ` Li Zhijian
  2022-01-25  8:50 ` [RFC PATCH v2 6/9] RDMA/rxe: Implement flush execution in responder side Li Zhijian
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Li Zhijian @ 2022-01-25  8:50 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon, tom, tomasz.gromadzki
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, y-goto,
	rpearsonhpe, dan.j.williams, Li Zhijian

The SPEC says:
oA19-6: FLUSH BTH header field solicited event (SE) indication shall be
set to zero.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
V2: said -> says
---
 drivers/infiniband/sw/rxe/rxe_req.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 708138117136..363a33b905bf 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -401,7 +401,9 @@ static struct sk_buff *init_req_packet(struct rxe_qp *qp,
 			(pkt->mask & RXE_END_MASK) &&
 			((pkt->mask & (RXE_SEND_MASK)) ||
 			(pkt->mask & (RXE_WRITE_MASK | RXE_IMMDT_MASK)) ==
-			(RXE_WRITE_MASK | RXE_IMMDT_MASK));
+			(RXE_WRITE_MASK | RXE_IMMDT_MASK)) &&
+			/* oA19-6: always set SE to zero */
+			!(pkt->mask & RXE_FETH_MASK);
 
 	qp_num = (pkt->mask & RXE_DETH_MASK) ? ibwr->wr.ud.remote_qpn :
 					 qp->attr.dest_qp_num;
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 6/9] RDMA/rxe: Implement flush execution in responder side
  2022-01-25  8:50 [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (4 preceding siblings ...)
  2022-01-25  8:50 ` [RFC PATCH v2 5/9] RDMA/rxe: Set BTH's SE to zero for FLUSH packet Li Zhijian
@ 2022-01-25  8:50 ` Li Zhijian
  2022-01-25  8:50 ` [RFC PATCH v2 7/9] RDMA/rxe: Implement flush completion Li Zhijian
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Li Zhijian @ 2022-01-25  8:50 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon, tom, tomasz.gromadzki
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, y-goto,
	rpearsonhpe, dan.j.williams, Li Zhijian

In contrast to other opcodes, after a series of sanity checking, FLUSH
opcode will do a Placement Type checking before it really do the FLUSH
operation. Responder will also reply NAK "Remote Access Error" if it
found a placement type violation.

The responder will check the remote requested placement types by checking
the registered access flags.
+------------------------+------------------+--------------+
|                        |     registered flags            |
| remote requested types +------------------+--------------+
|                        |global visibility |  persistence |
|------------------------+------------------+--------------+
| global visibility      |        O         |      x       |
+------------------------+------------------+--------------+
| persistence            |        X         |      O       |
+------------------------+------------------+--------------+
O: allow to request such placement type
X: otherwise

We will persist data via arch_wb_cache_pmem(), which could be
architecture specific.

After the execution, responder would reply a responded successfully by
RDMA READ response of zero size.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
V2:
 # from Tom
 - adjust start for WHOLE MR level
 - don't support DMA mr for flush
 - check flush return value
---
 drivers/infiniband/sw/rxe/rxe_hdr.h  |  28 ++++++
 drivers/infiniband/sw/rxe/rxe_loc.h  |   2 +
 drivers/infiniband/sw/rxe/rxe_mr.c   |   4 +-
 drivers/infiniband/sw/rxe/rxe_resp.c | 136 +++++++++++++++++++++++++--
 include/uapi/rdma/ib_user_verbs.h    |  10 ++
 5 files changed, 172 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_hdr.h b/drivers/infiniband/sw/rxe/rxe_hdr.h
index e37aa1944b18..cdfd393b8bd8 100644
--- a/drivers/infiniband/sw/rxe/rxe_hdr.h
+++ b/drivers/infiniband/sw/rxe/rxe_hdr.h
@@ -630,6 +630,34 @@ static inline void feth_init(struct rxe_pkt_info *pkt, u32 type, u32 level)
 	*p = cpu_to_be32(feth);
 }
 
+static inline u32 __feth_plt(void *arg)
+{
+	u32 *fethp = arg;
+	u32 feth = be32_to_cpu(*fethp);
+
+	return (feth & FETH_PLT_MASK) >> FETH_PLT_SHIFT;
+}
+
+static inline u32 __feth_sel(void *arg)
+{
+	u32 *fethp = arg;
+	u32 feth = be32_to_cpu(*fethp);
+
+	return (feth & FETH_SEL_MASK) >> FETH_SEL_SHIFT;
+}
+
+static inline u32 feth_plt(struct rxe_pkt_info *pkt)
+{
+	return __feth_plt(pkt->hdr +
+		rxe_opcode[pkt->opcode].offset[RXE_FETH]);
+}
+
+static inline u32 feth_sel(struct rxe_pkt_info *pkt)
+{
+	return __feth_sel(pkt->hdr +
+		rxe_opcode[pkt->opcode].offset[RXE_FETH]);
+}
+
 /******************************************************************************
  * Atomic Extended Transport Header
  ******************************************************************************/
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index b1e174afb1d4..73c39ff11e28 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -80,6 +80,8 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		enum rxe_mr_copy_dir dir);
 int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
 	      void *addr, int length, enum rxe_mr_copy_dir dir);
+void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
+		 size_t *offset_out);
 void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length);
 struct rxe_mr *lookup_mr(struct rxe_pd *pd, int access, u32 key,
 			 enum rxe_mr_lookup_type type);
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 89a3bb4e8b71..cd55fcc00e65 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -298,8 +298,8 @@ int rxe_mr_init_fast(struct rxe_pd *pd, int max_pages, struct rxe_mr *mr)
 	return err;
 }
 
-static void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
-			size_t *offset_out)
+void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
+		 size_t *offset_out)
 {
 	struct rxe_map_set *set = mr->cur_map_set;
 	size_t offset = iova - set->iova + set->offset;
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index e0093fad4e0f..3277a36f506f 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -5,6 +5,7 @@
  */
 
 #include <linux/skbuff.h>
+#include <linux/libnvdimm.h>
 
 #include "rxe.h"
 #include "rxe_loc.h"
@@ -19,6 +20,7 @@ enum resp_states {
 	RESPST_CHK_RESOURCE,
 	RESPST_CHK_LENGTH,
 	RESPST_CHK_RKEY,
+	RESPST_CHK_PLT,
 	RESPST_EXECUTE,
 	RESPST_READ_REPLY,
 	RESPST_COMPLETE,
@@ -35,6 +37,7 @@ enum resp_states {
 	RESPST_ERR_TOO_MANY_RDMA_ATM_REQ,
 	RESPST_ERR_RNR,
 	RESPST_ERR_RKEY_VIOLATION,
+	RESPST_ERR_PLT_VIOLATION,
 	RESPST_ERR_INVALIDATE_RKEY,
 	RESPST_ERR_LENGTH,
 	RESPST_ERR_CQ_OVERFLOW,
@@ -53,6 +56,7 @@ static char *resp_state_name[] = {
 	[RESPST_CHK_RESOURCE]			= "CHK_RESOURCE",
 	[RESPST_CHK_LENGTH]			= "CHK_LENGTH",
 	[RESPST_CHK_RKEY]			= "CHK_RKEY",
+	[RESPST_CHK_PLT]			= "CHK_PLACEMENT_TYPE",
 	[RESPST_EXECUTE]			= "EXECUTE",
 	[RESPST_READ_REPLY]			= "READ_REPLY",
 	[RESPST_COMPLETE]			= "COMPLETE",
@@ -69,6 +73,7 @@ static char *resp_state_name[] = {
 	[RESPST_ERR_TOO_MANY_RDMA_ATM_REQ]	= "ERR_TOO_MANY_RDMA_ATM_REQ",
 	[RESPST_ERR_RNR]			= "ERR_RNR",
 	[RESPST_ERR_RKEY_VIOLATION]		= "ERR_RKEY_VIOLATION",
+	[RESPST_ERR_PLT_VIOLATION]		= "ERR_PLACEMENT_TYPE_VIOLATION",
 	[RESPST_ERR_INVALIDATE_RKEY]		= "ERR_INVALIDATE_RKEY_VIOLATION",
 	[RESPST_ERR_LENGTH]			= "ERR_LENGTH",
 	[RESPST_ERR_CQ_OVERFLOW]		= "ERR_CQ_OVERFLOW",
@@ -400,6 +405,25 @@ static enum resp_states check_length(struct rxe_qp *qp,
 	}
 }
 
+static enum resp_states check_placement_type(struct rxe_qp *qp,
+					     struct rxe_pkt_info *pkt)
+{
+	struct rxe_mr *mr = qp->resp.mr;
+	u32 plt = feth_plt(pkt);
+
+	if ((plt & IB_EXT_PLT_GLB_VIS &&
+	    !(mr->access & IB_ACCESS_FLUSH_GLOBAL_VISIBILITY)) ||
+	    (plt & IB_EXT_PLT_PERSIST &&
+	    !(mr->access & IB_ACCESS_FLUSH_PERSISTENT))) {
+		pr_info("Target MR didn't support this placement type, is_pmem: %s, registered flag: %x, requested flag: %x\n",
+		        mr->ibmr.is_pmem ? "true" : "false",
+		        (mr->access & IB_ACCESS_FLUSHABLE) >> 8, plt);
+		return RESPST_ERR_PLT_VIOLATION;
+	}
+
+	return RESPST_EXECUTE;
+}
+
 static enum resp_states check_rkey(struct rxe_qp *qp,
 				   struct rxe_pkt_info *pkt)
 {
@@ -413,7 +437,7 @@ static enum resp_states check_rkey(struct rxe_qp *qp,
 	enum resp_states state;
 	int access;
 
-	if (pkt->mask & RXE_READ_OR_WRITE_MASK) {
+	if (pkt->mask & (RXE_READ_OR_WRITE_MASK | RXE_FLUSH_MASK)) {
 		if (pkt->mask & RXE_RETH_MASK) {
 			qp->resp.va = reth_va(pkt);
 			qp->resp.offset = 0;
@@ -421,8 +445,12 @@ static enum resp_states check_rkey(struct rxe_qp *qp,
 			qp->resp.resid = reth_len(pkt);
 			qp->resp.length = reth_len(pkt);
 		}
-		access = (pkt->mask & RXE_READ_MASK) ? IB_ACCESS_REMOTE_READ
-						     : IB_ACCESS_REMOTE_WRITE;
+		if (pkt->mask & RXE_FLUSH_MASK)
+			access = IB_ACCESS_FLUSHABLE;
+		else if (pkt->mask & RXE_READ_MASK)
+			access = IB_ACCESS_REMOTE_READ;
+		else
+			access = IB_ACCESS_REMOTE_WRITE;
 	} else if (pkt->mask & RXE_ATOMIC_MASK) {
 		qp->resp.va = atmeth_va(pkt);
 		qp->resp.offset = 0;
@@ -434,8 +462,10 @@ static enum resp_states check_rkey(struct rxe_qp *qp,
 	}
 
 	/* A zero-byte op is not required to set an addr or rkey. */
+	/* RXE_FETH_MASK carraies zero-byte playload */
 	if ((pkt->mask & RXE_READ_OR_WRITE_MASK) &&
 	    (pkt->mask & RXE_RETH_MASK) &&
+	    !(pkt->mask & RXE_FETH_MASK) &&
 	    reth_len(pkt) == 0) {
 		return RESPST_EXECUTE;
 	}
@@ -503,7 +533,7 @@ static enum resp_states check_rkey(struct rxe_qp *qp,
 	WARN_ON_ONCE(qp->resp.mr);
 
 	qp->resp.mr = mr;
-	return RESPST_EXECUTE;
+	return pkt->mask & RXE_FETH_MASK ? RESPST_CHK_PLT : RESPST_EXECUTE;
 
 err:
 	if (mr)
@@ -549,6 +579,93 @@ static enum resp_states write_data_in(struct rxe_qp *qp,
 	return rc;
 }
 
+static int nvdimm_flush_iova(struct rxe_mr *mr, u64 iova, int length)
+{
+	int err;
+	int bytes;
+	u8 *va;
+	struct rxe_map **map;
+	struct rxe_phys_buf *buf;
+	int m;
+	int i;
+	size_t offset;
+
+	if (length == 0)
+		return 0;
+
+	if (mr->type == IB_MR_TYPE_DMA) {
+		err = -EFAULT;
+		goto err1;
+	}
+
+	err = mr_check_range(mr, iova, length);
+	if (err) {
+		err = -EFAULT;
+		goto err1;
+	}
+
+	lookup_iova(mr, iova, &m, &i, &offset);
+
+	map = mr->cur_map_set->map + m;
+	buf = map[0]->buf + i;
+
+	while (length > 0) {
+		va = (u8 *)(uintptr_t)buf->addr + offset;
+		bytes = buf->size - offset;
+
+		if (bytes > length)
+			bytes = length;
+
+		arch_wb_cache_pmem(va, bytes);
+
+		length -= bytes;
+
+		offset = 0;
+		buf++;
+		i++;
+
+		if (i == RXE_BUF_PER_MAP) {
+			i = 0;
+			map++;
+			buf = map[0]->buf;
+		}
+	}
+
+	return 0;
+
+err1:
+	return err;
+}
+
+static enum resp_states process_flush(struct rxe_qp *qp,
+				       struct rxe_pkt_info *pkt)
+{
+	u64 length, start;
+	u32 sel = feth_sel(pkt);
+	u32 plt = feth_plt(pkt);
+	struct rxe_mr *mr = qp->resp.mr;
+
+	if (sel == IB_EXT_SEL_MR_RANGE) {
+		start = qp->resp.va;
+		length = qp->resp.length;
+	} else { /* sel == IB_EXT_SEL_MR_WHOLE */
+		start = mr->cur_map_set->iova;
+		length = mr->cur_map_set->length;
+	}
+
+	if (plt & IB_EXT_PLT_PERSIST) {
+		if (nvdimm_flush_iova(mr, start, length))
+			return RESPST_ERR_RKEY_VIOLATION;
+		wmb();
+	} else if (plt & IB_EXT_PLT_GLB_VIS)
+		wmb();
+
+	/* Prepare RDMA READ response of zero */
+	qp->resp.resid = 0;
+
+	return RESPST_READ_REPLY;
+}
+
 /* Guarantee atomicity of atomic operations at the machine level. */
 static DEFINE_SPINLOCK(atomic_ops_lock);
 
@@ -801,6 +918,8 @@ static enum resp_states execute(struct rxe_qp *qp, struct rxe_pkt_info *pkt)
 		err = process_atomic(qp, pkt);
 		if (err)
 			return err;
+	} else if (pkt->mask & RXE_FLUSH_MASK) {
+		return process_flush(qp, pkt);
 	} else {
 		/* Unreachable */
 		WARN_ON_ONCE(1);
@@ -1061,7 +1180,7 @@ static enum resp_states duplicate_request(struct rxe_qp *qp,
 		/* SEND. Ack again and cleanup. C9-105. */
 		send_ack(qp, pkt, AETH_ACK_UNLIMITED, prev_psn);
 		return RESPST_CLEANUP;
-	} else if (pkt->mask & RXE_READ_MASK) {
+	} else if (pkt->mask & RXE_READ_MASK || pkt->mask & RXE_FLUSH_MASK) {
 		struct resp_res *res;
 
 		res = find_resource(qp, pkt->psn);
@@ -1100,7 +1219,7 @@ static enum resp_states duplicate_request(struct rxe_qp *qp,
 			/* Reset the resource, except length. */
 			res->read.va_org = iova;
 			res->read.va = iova;
-			res->read.resid = resid;
+			res->read.resid = pkt->mask & RXE_FLUSH_MASK ? 0 : resid;
 
 			/* Replay the RDMA read reply. */
 			qp->resp.res = res;
@@ -1247,6 +1366,9 @@ int rxe_responder(void *arg)
 		case RESPST_CHK_RKEY:
 			state = check_rkey(qp, pkt);
 			break;
+		case RESPST_CHK_PLT:
+			state = check_placement_type(qp, pkt);
+			break;
 		case RESPST_EXECUTE:
 			state = execute(qp, pkt);
 			break;
@@ -1301,6 +1423,8 @@ int rxe_responder(void *arg)
 			break;
 
 		case RESPST_ERR_RKEY_VIOLATION:
+		/* oA19-13 8 */
+		case RESPST_ERR_PLT_VIOLATION:
 			if (qp_type(qp) == IB_QPT_RC) {
 				/* Class C */
 				do_class_ac_error(qp, AETH_NAK_REM_ACC_ERR,
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
index c4131913ef6a..be1f9dca08a8 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -105,6 +105,16 @@ enum {
 	IB_USER_VERBS_EX_CMD_MODIFY_CQ
 };
 
+enum ib_ext_placement_type {
+	IB_EXT_PLT_GLB_VIS = 1 << 0,
+	IB_EXT_PLT_PERSIST = 1 << 1,
+};
+
+enum ib_ext_selectivity_level {
+	IB_EXT_SEL_MR_RANGE = 0,
+	IB_EXT_SEL_MR_WHOLE,
+};
+
 /*
  * Make sure that all structs defined in this file remain laid out so
  * that they pack the same way on 32-bit and 64-bit architectures (to
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 7/9] RDMA/rxe: Implement flush completion
  2022-01-25  8:50 [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (5 preceding siblings ...)
  2022-01-25  8:50 ` [RFC PATCH v2 6/9] RDMA/rxe: Implement flush execution in responder side Li Zhijian
@ 2022-01-25  8:50 ` Li Zhijian
  2022-01-25  8:50 ` [RFC PATCH v2 8/9] RDMA/rxe: Enable RDMA FLUSH capability for rxe device Li Zhijian
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Li Zhijian @ 2022-01-25  8:50 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon, tom, tomasz.gromadzki
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, y-goto,
	rpearsonhpe, dan.j.williams, Li Zhijian

Introduce a new IB_UVERBS_WC_FLUSH code to tell userspace a FLUSH
completion.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_comp.c | 4 +++-
 include/rdma/ib_verbs.h              | 1 +
 include/uapi/rdma/ib_user_verbs.h    | 1 +
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index f363fe3fa414..e5b9d07eba93 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -104,6 +104,7 @@ static enum ib_wc_opcode wr_to_wc_opcode(enum ib_wr_opcode opcode)
 	case IB_WR_LOCAL_INV:			return IB_WC_LOCAL_INV;
 	case IB_WR_REG_MR:			return IB_WC_REG_MR;
 	case IB_WR_BIND_MW:			return IB_WC_BIND_MW;
+	case IB_WR_RDMA_FLUSH:			return IB_WC_RDMA_FLUSH;
 
 	default:
 		return 0xff;
@@ -261,7 +262,8 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
 		 */
 	case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE:
 		if (wqe->wr.opcode != IB_WR_RDMA_READ &&
-		    wqe->wr.opcode != IB_WR_RDMA_READ_WITH_INV) {
+		    wqe->wr.opcode != IB_WR_RDMA_READ_WITH_INV &&
+		    wqe->wr.opcode != IB_WR_RDMA_FLUSH) {
 			wqe->status = IB_WC_FATAL_ERR;
 			return COMPST_ERROR;
 		}
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index d8555b6e4eba..5242acb73004 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -965,6 +965,7 @@ const char *__attribute_const__ ib_wc_status_msg(enum ib_wc_status status);
 enum ib_wc_opcode {
 	IB_WC_SEND = IB_UVERBS_WC_SEND,
 	IB_WC_RDMA_WRITE = IB_UVERBS_WC_RDMA_WRITE,
+	IB_WC_RDMA_FLUSH = IB_UVERBS_WC_FLUSH,
 	IB_WC_RDMA_READ = IB_UVERBS_WC_RDMA_READ,
 	IB_WC_COMP_SWAP = IB_UVERBS_WC_COMP_SWAP,
 	IB_WC_FETCH_ADD = IB_UVERBS_WC_FETCH_ADD,
diff --git a/include/uapi/rdma/ib_user_verbs.h b/include/uapi/rdma/ib_user_verbs.h
index be1f9dca08a8..d43671fef93e 100644
--- a/include/uapi/rdma/ib_user_verbs.h
+++ b/include/uapi/rdma/ib_user_verbs.h
@@ -476,6 +476,7 @@ enum ib_uverbs_wc_opcode {
 	IB_UVERBS_WC_BIND_MW = 5,
 	IB_UVERBS_WC_LOCAL_INV = 6,
 	IB_UVERBS_WC_TSO = 7,
+	IB_UVERBS_WC_FLUSH = 8,
 };
 
 struct ib_uverbs_wc {
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 8/9] RDMA/rxe: Enable RDMA FLUSH capability for rxe device
  2022-01-25  8:50 [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (6 preceding siblings ...)
  2022-01-25  8:50 ` [RFC PATCH v2 7/9] RDMA/rxe: Implement flush completion Li Zhijian
@ 2022-01-25  8:50 ` Li Zhijian
  2022-01-25  8:50 ` [RFC PATCH v2 9/9] RDMA/rxe: Add RD FLUSH service support Li Zhijian
  2022-01-25  8:57 ` [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Zhu Yanjun
  9 siblings, 0 replies; 12+ messages in thread
From: Li Zhijian @ 2022-01-25  8:50 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon, tom, tomasz.gromadzki
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, y-goto,
	rpearsonhpe, dan.j.williams, Li Zhijian

A19.4.3.1 HCA RESOURCES
This Annex introduces the following new HCA attributes:
• Ability to support Memory Placement Extensions
a) Ability to support FLUSH
 i) Ability to support FLUSH with PLT Global Visibility
 ii) Ability to support FLUSH with PLT Persistence

Now we are ready to enable RDMA FLUSH capability for RXE.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
V2: adjust patch's order. move it here from [04/10]
    update comments, add referring to SPEC
---
 drivers/infiniband/sw/rxe/rxe_param.h | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h
index 918270e34a35..281e1977b147 100644
--- a/drivers/infiniband/sw/rxe/rxe_param.h
+++ b/drivers/infiniband/sw/rxe/rxe_param.h
@@ -53,7 +53,9 @@ enum rxe_device_param {
 					| IB_DEVICE_ALLOW_USER_UNREG
 					| IB_DEVICE_MEM_WINDOW
 					| IB_DEVICE_MEM_WINDOW_TYPE_2A
-					| IB_DEVICE_MEM_WINDOW_TYPE_2B,
+					| IB_DEVICE_MEM_WINDOW_TYPE_2B
+					| IB_DEVICE_PLT_GLOBAL_VISIBILITY
+					| IB_DEVICE_PLT_PERSISTENT,
 	RXE_MAX_SGE			= 32,
 	RXE_MAX_WQE_SIZE		= sizeof(struct rxe_send_wqe) +
 					  sizeof(struct ib_sge) * RXE_MAX_SGE,
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [RFC PATCH v2 9/9] RDMA/rxe: Add RD FLUSH service support
  2022-01-25  8:50 [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (7 preceding siblings ...)
  2022-01-25  8:50 ` [RFC PATCH v2 8/9] RDMA/rxe: Enable RDMA FLUSH capability for rxe device Li Zhijian
@ 2022-01-25  8:50 ` Li Zhijian
  2022-01-25  8:57 ` [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Zhu Yanjun
  9 siblings, 0 replies; 12+ messages in thread
From: Li Zhijian @ 2022-01-25  8:50 UTC (permalink / raw)
  To: linux-rdma, zyjzyj2000, jgg, aharonl, leon, tom, tomasz.gromadzki
  Cc: linux-kernel, mbloch, liangwenpeng, yangx.jy, y-goto,
	rpearsonhpe, dan.j.williams, Li Zhijian

Although the SPEC said FLUSH is supported by RC/RD/XRC services, XRC has
not been supported by the rxe.

So XRC FLUSH will not be supported until rxe implements XRC service.

Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com>
---
I have not setup a RD environment to test this protocol
---
 drivers/infiniband/sw/rxe/rxe_opcode.c | 20 ++++++++++++++++++++
 include/rdma/ib_pack.h                 |  1 +
 2 files changed, 21 insertions(+)

diff --git a/drivers/infiniband/sw/rxe/rxe_opcode.c b/drivers/infiniband/sw/rxe/rxe_opcode.c
index adea6c16dfb5..3d86129558f7 100644
--- a/drivers/infiniband/sw/rxe/rxe_opcode.c
+++ b/drivers/infiniband/sw/rxe/rxe_opcode.c
@@ -922,6 +922,26 @@ struct rxe_opcode_info rxe_opcode[RXE_NUM_OPCODE] = {
 					  RXE_RDETH_BYTES,
 		}
 	},
+	[IB_OPCODE_RD_RDMA_FLUSH]			= {
+		.name	= "IB_OPCODE_RD_RDMA_FLUSH",
+		.mask	= RXE_RDETH_MASK | RXE_FETH_MASK | RXE_RETH_MASK |
+			  RXE_FLUSH_MASK | RXE_START_MASK |
+			  RXE_END_MASK | RXE_REQ_MASK,
+		.length = RXE_BTH_BYTES + RXE_FETH_BYTES + RXE_RETH_BYTES,
+		.offset = {
+			[RXE_BTH]	= 0,
+			[RXE_RDETH]	= RXE_BTH_BYTES,
+			[RXE_FETH]	= RXE_BTH_BYTES +
+					  RXE_RDETH_BYTES,
+			[RXE_RETH]	= RXE_BTH_BYTES +
+					  RXE_RDETH_BYTES +
+					  RXE_FETH_BYTES,
+			[RXE_PAYLOAD]	= RXE_BTH_BYTES +
+					  RXE_RDETH_BYTES +
+					  RXE_FETH_BYTES +
+					  RXE_RETH_BYTES,
+		}
+	},
 
 	/* UD */
 	[IB_OPCODE_UD_SEND_ONLY]			= {
diff --git a/include/rdma/ib_pack.h b/include/rdma/ib_pack.h
index d19edb502de6..40568a33ead8 100644
--- a/include/rdma/ib_pack.h
+++ b/include/rdma/ib_pack.h
@@ -151,6 +151,7 @@ enum {
 	IB_OPCODE(RD, ATOMIC_ACKNOWLEDGE),
 	IB_OPCODE(RD, COMPARE_SWAP),
 	IB_OPCODE(RD, FETCH_ADD),
+	IB_OPCODE(RD, RDMA_FLUSH),
 
 	/* UD */
 	IB_OPCODE(UD, SEND_ONLY),
-- 
2.31.1




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation
  2022-01-25  8:50 [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
                   ` (8 preceding siblings ...)
  2022-01-25  8:50 ` [RFC PATCH v2 9/9] RDMA/rxe: Add RD FLUSH service support Li Zhijian
@ 2022-01-25  8:57 ` Zhu Yanjun
  2022-01-25 10:32   ` lizhijian
  9 siblings, 1 reply; 12+ messages in thread
From: Zhu Yanjun @ 2022-01-25  8:57 UTC (permalink / raw)
  To: Li Zhijian
  Cc: RDMA mailing list, Jason Gunthorpe, aharonl, Leon Romanovsky,
	tom, tomasz.gromadzki, LKML, mbloch, liangwenpeng, Xiao Yang,
	y-goto, Bob Pearson, dan.j.williams, Xiao Yang

On Tue, Jan 25, 2022 at 4:45 PM Li Zhijian <lizhijian@cn.fujitsu.com> wrote:
>
> Hey folks,
>
> I wanna thank all of you for the kind feedback in my previous RFC.
> Recently, i have tried my best to do some updates as per your comments.
> Indeed, not all comments have been addressed for some reasons, i still
> wish to post this new one to start a new discussion.
>
> Outstanding issues:
> - iova_to_addr() without any kmap/kmap_local_page flows might not always
>   work. # existing issue.
> - responder should reply error to requested side when it requests a
>   persistence placement type to DRAM ?
> -------
>
> These patches are going to implement a *NEW* RDMA opcode "RDMA FLUSH".
> In IB SPEC 1.5[1][2], 2 new opcodes, ATOMIC WRITE and RDMA FLUSH were
> added in the MEMORY PLACEMENT EXTENSIONS section.
>
> FLUSH is used by the requesting node to achieve guarantees on the data
> placement within the memory subsystem of preceding accesses to a
> single memory region, such as those performed by RDMA WRITE, Atomics
> and ATOMIC WRITE requests.
>
> The operation indicates the virtual address space of a destination node
> and where the guarantees should apply. This range must be contiguous
> in the virtual space of the memory key but it is not necessarily a
> contiguous range of physical memory.
>
> FLUSH packets carry FLUSH extended transport header (see below) to
> specify the placement type and the selectivity level of the operation
> and RDMA extended header (RETH, see base document RETH definition) to
> specify the R_Key VA and Length associated with this request following
> the BTH in RC, RDETH in RD and XRCETH in XRC.

Thanks. Would you like to add some test cases in the latest rdma-core
about this RDMA FLUSH operation?

Thanks a lot.
Zhu Yanjun

>
> RC FLUSH:
> +----+------+------+
> |BTH | FETH | RETH |
> +----+------+------+
>
> RD FLUSH:
> +----+------+------+------+
> |BTH | RDETH| FETH | RETH |
> +----+------+------+------+
>
> XRC FLUSH:
> +----+-------+------+------+
> |BTH | XRCETH| FETH | RETH |
> +----+-------+------+------+
>
> Currently, we introduce RC and RD services only, since XRC has not been
> implemented by rxe yet.
> NOTE: only RC service is tested now, and since other HCAs have not
> added/implemented FLUSH yet, we can only test FLUSH operation in both
> SoftRoCE/rxe devices.
>
> The corresponding rdma-core and FLUSH example are available on:
> https://github.com/zhijianli88/rdma-core/tree/rfc
> Can access the kernel source in:
> https://github.com/zhijianli88/linux/tree/rdma-flush
>
> - We introduce is_pmem attribute to MR(memory region)
> - We introduce FLUSH placement type attributes to HCA
> - We introduce FLUSH access flags that users are able to register with
> Below figure shows the valid access flags uses can register with:
> +------------------------+------------------+--------------+
> | HCA attributes         |    register access flags        |
> |        and             +-----------------+---------------+
> | MR attribute(is_pmem)  |global visibility |  persistence |
> |------------------------+------------------+--------------+
> | global visibility(DRAM)|        O         |      X       |
> |------------------------+------------------+--------------+
> | global visibility(PMEM)|        O         |      X       |
> |------------------------+------------------+--------------+
> | persistence(DRAM)      |        X         |      X       |
> |------------------------+------------------+--------------+
> | persistence(PMEM)      |        X         |      O       |
> +------------------------+------------------+--------------+
> O: allow to register such access flag
>
> In order to make placement guarentees, we currently reject requesting a
> persistent flush to a non-pmem.
> The responder will check the remote requested placement types by checking
> the registered access flags.
> +------------------------+------------------+--------------+
> |                        |     registered flags            |
> | remote requested types +------------------+--------------+
> |                        |global visibility |  persistence |
> |------------------------+------------------+--------------+
> | global visibility      |        O         |      x       |
> +------------------------+------------------+--------------+
> | persistence            |        X         |      O       |
> +------------------------+------------------+--------------+
> O: allow to request such placement type
>
> Below list some details about FLUSH transport packet:
>
> A FLUSH message is built upon FLUSH request packet and is responded
> successfully by RDMA READ response of zero size.
>
> oA19-2: FLUSH shall be single packet message and shall have no payload.
> oA19-5: FLUSH BTH shall hold the Opcode = 0x1C
>
> FLUSH Extended Transport Header(FETH)
> +-----+-----------+------------------------+----------------------+
> |Bits |   31-6    |          5-4           |        3-0           |
> +-----+-----------+------------------------+----------------------+
> |     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
> +-----+-----------+------------------------+----------------------+
>
> Selectivity Level (SEL) – defines the memory region scope the FLUSH
> should apply on. Values are as follows:
> • b’00 - Memory Region Range: FLUSH applies for all preceding memory
>          updates to the RETH range on this QP. All RETH fields shall be
>          valid in this selectivity mode. RETH:DMALen field shall be
>          between zero and (2 31 -1) bytes (inclusive).
> • b’01 - Memory Region: FLUSH applies for all preceding memory up-
>          dates to RETH.R_key on this QP. RETH:DMALen and RETH:VA
>          shall be ignored in this mode.
> • b'10 - Reserved.
> • b'11 - Reserved.
>
> Placement Type (PLT) – Defines the memory placement guarantee of
> this FLUSH. Multiple bits may be set in this field. Values are as follows:
> • Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
>   Visibility.
> • Bit 1 if set to '1' indicated that the FLUSH should guarantee
>   Persistence.
> • Bits 3:2 are reserved
>
> [1]: https://www.infinibandta.org/ibta-specification/ # login required
> [2]: https://www.infinibandta.org/wp-content/uploads/2021/08/IBTA-Overview-of-IBTA-Volume-1-Release-1.5-and-MPE-2021-08-17-Secure.pptx
>
> CC: yangx.jy@cn.fujitsu.com
> CC: y-goto@fujitsu.com
> CC: Jason Gunthorpe <jgg@ziepe.ca>
> CC: Zhu Yanjun <zyjzyj2000@gmail.com
> CC: Leon Romanovsky <leon@kernel.org>
> CC: Bob Pearson <rpearsonhpe@gmail.com>
> CC: Mark Bloch <mbloch@nvidia.com>
> CC: Wenpeng Liang <liangwenpeng@huawei.com>
> CC: Aharon Landau <aharonl@nvidia.com>
> CC: Tom Talpey <tom@talpey.com>
> CC: "Gromadzki, Tomasz" <tomasz.gromadzki@intel.com>
> CC: Dan Williams <dan.j.williams@intel.com>
> CC: linux-rdma@vger.kernel.org
> CC: linux-kernel@vger.kernel.org
>
> V1:
> https://lore.kernel.org/lkml/050c3183-2fc6-03a1-eecd-258744750972@fujitsu.com/T/
> or https://github.com/zhijianli88/linux/tree/rdma-flush-rfcv1
>
> Changes log
> V2:
> https://github.com/zhijianli88/linux/tree/rdma-flush
> RDMA: mr: Introduce is_pmem
>   check 1st byte to avoid crossing page boundary
>   new scheme to check is_pmem # Dan
>
> RDMA: Allow registering MR with flush access flags
>   combine with [03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only to this patch # Jason
>   split RDMA_FLUSH to 2 capabilities
>
> RDMA/rxe: Allow registering persistent flag for pmem MR only
>   update commit message, get rid of confusing ib_check_flush_access_flags() # Tom
>
> RDMA/rxe: Implement RC RDMA FLUSH service in requester side
>   extend flush to include length field. # Tom and Tomasz
>
> RDMA/rxe: Implement flush execution in responder side
>   adjust start for WHOLE MR level # Tom
>   don't support DMA mr for flush # Tom
>   check flush return value
>
> RDMA/rxe: Enable RDMA FLUSH capability for rxe device
>   adjust patch's order. move it here from [04/10]
>
> Li Zhijian (9):
>   RDMA: mr: Introduce is_pmem
>   RDMA: Allow registering MR with flush access flags
>   RDMA/rxe: Allow registering persistent flag for pmem MR only
>   RDMA/rxe: Implement RC RDMA FLUSH service in requester side
>   RDMA/rxe: Set BTH's SE to zero for FLUSH packet
>   RDMA/rxe: Implement flush execution in responder side
>   RDMA/rxe: Implement flush completion
>   RDMA/rxe: Enable RDMA FLUSH capability for rxe device
>   RDMA/rxe: Add RD FLUSH service support
>
>  drivers/infiniband/core/uverbs_cmd.c    |  17 +++
>  drivers/infiniband/sw/rxe/rxe_comp.c    |   4 +-
>  drivers/infiniband/sw/rxe/rxe_hdr.h     |  52 +++++++++
>  drivers/infiniband/sw/rxe/rxe_loc.h     |   2 +
>  drivers/infiniband/sw/rxe/rxe_mr.c      |  37 ++++++-
>  drivers/infiniband/sw/rxe/rxe_opcode.c  |  35 +++++++
>  drivers/infiniband/sw/rxe/rxe_opcode.h  |   3 +
>  drivers/infiniband/sw/rxe/rxe_param.h   |   4 +-
>  drivers/infiniband/sw/rxe/rxe_req.c     |  19 +++-
>  drivers/infiniband/sw/rxe/rxe_resp.c    | 133 +++++++++++++++++++++++-
>  include/rdma/ib_pack.h                  |   3 +
>  include/rdma/ib_verbs.h                 |  30 +++++-
>  include/uapi/rdma/ib_user_ioctl_verbs.h |   2 +
>  include/uapi/rdma/ib_user_verbs.h       |  19 ++++
>  include/uapi/rdma/rdma_user_rxe.h       |   7 ++
>  15 files changed, 355 insertions(+), 12 deletions(-)
>
> --
> 2.31.1
>
>
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation
  2022-01-25  8:57 ` [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Zhu Yanjun
@ 2022-01-25 10:32   ` lizhijian
  0 siblings, 0 replies; 12+ messages in thread
From: lizhijian @ 2022-01-25 10:32 UTC (permalink / raw)
  To: Zhu Yanjun, lizhijian
  Cc: RDMA mailing list, Jason Gunthorpe, aharonl, Leon Romanovsky,
	tom, tomasz.gromadzki, LKML, mbloch, liangwenpeng, yangx.jy,
	y-goto, Bob Pearson, dan.j.williams, yangx.jy



On 25/01/2022 16:57, Zhu Yanjun wrote:
> On Tue, Jan 25, 2022 at 4:45 PM Li Zhijian <lizhijian@cn.fujitsu.com> wrote:
>> Hey folks,
>>
>> I wanna thank all of you for the kind feedback in my previous RFC.
>> Recently, i have tried my best to do some updates as per your comments.
>> Indeed, not all comments have been addressed for some reasons, i still
>> wish to post this new one to start a new discussion.
>>
>> Outstanding issues:
>> - iova_to_addr() without any kmap/kmap_local_page flows might not always
>>    work. # existing issue.
>> - responder should reply error to requested side when it requests a
>>    persistence placement type to DRAM ?
>> -------
>>
>> These patches are going to implement a *NEW* RDMA opcode "RDMA FLUSH".
>> In IB SPEC 1.5[1][2], 2 new opcodes, ATOMIC WRITE and RDMA FLUSH were
>> added in the MEMORY PLACEMENT EXTENSIONS section.
>>
>> FLUSH is used by the requesting node to achieve guarantees on the data
>> placement within the memory subsystem of preceding accesses to a
>> single memory region, such as those performed by RDMA WRITE, Atomics
>> and ATOMIC WRITE requests.
>>
>> The operation indicates the virtual address space of a destination node
>> and where the guarantees should apply. This range must be contiguous
>> in the virtual space of the memory key but it is not necessarily a
>> contiguous range of physical memory.
>>
>> FLUSH packets carry FLUSH extended transport header (see below) to
>> specify the placement type and the selectivity level of the operation
>> and RDMA extended header (RETH, see base document RETH definition) to
>> specify the R_Key VA and Length associated with this request following
>> the BTH in RC, RDETH in RD and XRCETH in XRC.
> Thanks. Would you like to add some test cases in the latest rdma-core
> about this RDMA FLUSH operation?

Of course, they are on the way. Actually i had WIP PR to do that:
https://github.com/linux-rdma/rdma-core/pull/1119

But some stuffs cannot start until we have a more stable proposal and APIs.

Thanks
Zhijian

>
> Thanks a lot.
> Zhu Yanjun
>
>> RC FLUSH:
>> +----+------+------+
>> |BTH | FETH | RETH |
>> +----+------+------+
>>
>> RD FLUSH:
>> +----+------+------+------+
>> |BTH | RDETH| FETH | RETH |
>> +----+------+------+------+
>>
>> XRC FLUSH:
>> +----+-------+------+------+
>> |BTH | XRCETH| FETH | RETH |
>> +----+-------+------+------+
>>
>> Currently, we introduce RC and RD services only, since XRC has not been
>> implemented by rxe yet.
>> NOTE: only RC service is tested now, and since other HCAs have not
>> added/implemented FLUSH yet, we can only test FLUSH operation in both
>> SoftRoCE/rxe devices.
>>
>> The corresponding rdma-core and FLUSH example are available on:
>> https://github.com/zhijianli88/rdma-core/tree/rfc
>> Can access the kernel source in:
>> https://github.com/zhijianli88/linux/tree/rdma-flush
>>
>> - We introduce is_pmem attribute to MR(memory region)
>> - We introduce FLUSH placement type attributes to HCA
>> - We introduce FLUSH access flags that users are able to register with
>> Below figure shows the valid access flags uses can register with:
>> +------------------------+------------------+--------------+
>> | HCA attributes         |    register access flags        |
>> |        and             +-----------------+---------------+
>> | MR attribute(is_pmem)  |global visibility |  persistence |
>> |------------------------+------------------+--------------+
>> | global visibility(DRAM)|        O         |      X       |
>> |------------------------+------------------+--------------+
>> | global visibility(PMEM)|        O         |      X       |
>> |------------------------+------------------+--------------+
>> | persistence(DRAM)      |        X         |      X       |
>> |------------------------+------------------+--------------+
>> | persistence(PMEM)      |        X         |      O       |
>> +------------------------+------------------+--------------+
>> O: allow to register such access flag
>>
>> In order to make placement guarentees, we currently reject requesting a
>> persistent flush to a non-pmem.
>> The responder will check the remote requested placement types by checking
>> the registered access flags.
>> +------------------------+------------------+--------------+
>> |                        |     registered flags            |
>> | remote requested types +------------------+--------------+
>> |                        |global visibility |  persistence |
>> |------------------------+------------------+--------------+
>> | global visibility      |        O         |      x       |
>> +------------------------+------------------+--------------+
>> | persistence            |        X         |      O       |
>> +------------------------+------------------+--------------+
>> O: allow to request such placement type
>>
>> Below list some details about FLUSH transport packet:
>>
>> A FLUSH message is built upon FLUSH request packet and is responded
>> successfully by RDMA READ response of zero size.
>>
>> oA19-2: FLUSH shall be single packet message and shall have no payload.
>> oA19-5: FLUSH BTH shall hold the Opcode = 0x1C
>>
>> FLUSH Extended Transport Header(FETH)
>> +-----+-----------+------------------------+----------------------+
>> |Bits |   31-6    |          5-4           |        3-0           |
>> +-----+-----------+------------------------+----------------------+
>> |     | Reserved  | Selectivity Level(SEL) | Placement Type(PLT)  |
>> +-----+-----------+------------------------+----------------------+
>>
>> Selectivity Level (SEL) – defines the memory region scope the FLUSH
>> should apply on. Values are as follows:
>> • b’00 - Memory Region Range: FLUSH applies for all preceding memory
>>           updates to the RETH range on this QP. All RETH fields shall be
>>           valid in this selectivity mode. RETH:DMALen field shall be
>>           between zero and (2 31 -1) bytes (inclusive).
>> • b’01 - Memory Region: FLUSH applies for all preceding memory up-
>>           dates to RETH.R_key on this QP. RETH:DMALen and RETH:VA
>>           shall be ignored in this mode.
>> • b'10 - Reserved.
>> • b'11 - Reserved.
>>
>> Placement Type (PLT) – Defines the memory placement guarantee of
>> this FLUSH. Multiple bits may be set in this field. Values are as follows:
>> • Bit 0 if set to '1' indicated that the FLUSH should guarantee Global
>>    Visibility.
>> • Bit 1 if set to '1' indicated that the FLUSH should guarantee
>>    Persistence.
>> • Bits 3:2 are reserved
>>
>> [1]: https://www.infinibandta.org/ibta-specification/ # login required
>> [2]: https://www.infinibandta.org/wp-content/uploads/2021/08/IBTA-Overview-of-IBTA-Volume-1-Release-1.5-and-MPE-2021-08-17-Secure.pptx
>>
>> CC: yangx.jy@cn.fujitsu.com
>> CC: y-goto@fujitsu.com
>> CC: Jason Gunthorpe <jgg@ziepe.ca>
>> CC: Zhu Yanjun <zyjzyj2000@gmail.com
>> CC: Leon Romanovsky <leon@kernel.org>
>> CC: Bob Pearson <rpearsonhpe@gmail.com>
>> CC: Mark Bloch <mbloch@nvidia.com>
>> CC: Wenpeng Liang <liangwenpeng@huawei.com>
>> CC: Aharon Landau <aharonl@nvidia.com>
>> CC: Tom Talpey <tom@talpey.com>
>> CC: "Gromadzki, Tomasz" <tomasz.gromadzki@intel.com>
>> CC: Dan Williams <dan.j.williams@intel.com>
>> CC: linux-rdma@vger.kernel.org
>> CC: linux-kernel@vger.kernel.org
>>
>> V1:
>> https://lore.kernel.org/lkml/050c3183-2fc6-03a1-eecd-258744750972@fujitsu.com/T/
>> or https://github.com/zhijianli88/linux/tree/rdma-flush-rfcv1
>>
>> Changes log
>> V2:
>> https://github.com/zhijianli88/linux/tree/rdma-flush
>> RDMA: mr: Introduce is_pmem
>>    check 1st byte to avoid crossing page boundary
>>    new scheme to check is_pmem # Dan
>>
>> RDMA: Allow registering MR with flush access flags
>>    combine with [03/10] RDMA/rxe: Allow registering FLUSH flags for supported device only to this patch # Jason
>>    split RDMA_FLUSH to 2 capabilities
>>
>> RDMA/rxe: Allow registering persistent flag for pmem MR only
>>    update commit message, get rid of confusing ib_check_flush_access_flags() # Tom
>>
>> RDMA/rxe: Implement RC RDMA FLUSH service in requester side
>>    extend flush to include length field. # Tom and Tomasz
>>
>> RDMA/rxe: Implement flush execution in responder side
>>    adjust start for WHOLE MR level # Tom
>>    don't support DMA mr for flush # Tom
>>    check flush return value
>>
>> RDMA/rxe: Enable RDMA FLUSH capability for rxe device
>>    adjust patch's order. move it here from [04/10]
>>
>> Li Zhijian (9):
>>    RDMA: mr: Introduce is_pmem
>>    RDMA: Allow registering MR with flush access flags
>>    RDMA/rxe: Allow registering persistent flag for pmem MR only
>>    RDMA/rxe: Implement RC RDMA FLUSH service in requester side
>>    RDMA/rxe: Set BTH's SE to zero for FLUSH packet
>>    RDMA/rxe: Implement flush execution in responder side
>>    RDMA/rxe: Implement flush completion
>>    RDMA/rxe: Enable RDMA FLUSH capability for rxe device
>>    RDMA/rxe: Add RD FLUSH service support
>>
>>   drivers/infiniband/core/uverbs_cmd.c    |  17 +++
>>   drivers/infiniband/sw/rxe/rxe_comp.c    |   4 +-
>>   drivers/infiniband/sw/rxe/rxe_hdr.h     |  52 +++++++++
>>   drivers/infiniband/sw/rxe/rxe_loc.h     |   2 +
>>   drivers/infiniband/sw/rxe/rxe_mr.c      |  37 ++++++-
>>   drivers/infiniband/sw/rxe/rxe_opcode.c  |  35 +++++++
>>   drivers/infiniband/sw/rxe/rxe_opcode.h  |   3 +
>>   drivers/infiniband/sw/rxe/rxe_param.h   |   4 +-
>>   drivers/infiniband/sw/rxe/rxe_req.c     |  19 +++-
>>   drivers/infiniband/sw/rxe/rxe_resp.c    | 133 +++++++++++++++++++++++-
>>   include/rdma/ib_pack.h                  |   3 +
>>   include/rdma/ib_verbs.h                 |  30 +++++-
>>   include/uapi/rdma/ib_user_ioctl_verbs.h |   2 +
>>   include/uapi/rdma/ib_user_verbs.h       |  19 ++++
>>   include/uapi/rdma/rdma_user_rxe.h       |   7 ++
>>   15 files changed, 355 insertions(+), 12 deletions(-)
>>
>> --
>> 2.31.1
>>
>>
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-01-25 10:35 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-25  8:50 [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Li Zhijian
2022-01-25  8:50 ` [RFC PATCH v2 1/9] RDMA: mr: Introduce is_pmem Li Zhijian
2022-01-25  8:50 ` [RFC PATCH v2 2/9] RDMA: Allow registering MR with flush access flags Li Zhijian
2022-01-25  8:50 ` [RFC PATCH v2 3/9] RDMA/rxe: Allow registering persistent flag for pmem MR only Li Zhijian
2022-01-25  8:50 ` [RFC PATCH v2 4/9] RDMA/rxe: Implement RC RDMA FLUSH service in requester side Li Zhijian
2022-01-25  8:50 ` [RFC PATCH v2 5/9] RDMA/rxe: Set BTH's SE to zero for FLUSH packet Li Zhijian
2022-01-25  8:50 ` [RFC PATCH v2 6/9] RDMA/rxe: Implement flush execution in responder side Li Zhijian
2022-01-25  8:50 ` [RFC PATCH v2 7/9] RDMA/rxe: Implement flush completion Li Zhijian
2022-01-25  8:50 ` [RFC PATCH v2 8/9] RDMA/rxe: Enable RDMA FLUSH capability for rxe device Li Zhijian
2022-01-25  8:50 ` [RFC PATCH v2 9/9] RDMA/rxe: Add RD FLUSH service support Li Zhijian
2022-01-25  8:57 ` [RFC PATCH v2 0/9] RDMA/rxe: Add RDMA FLUSH operation Zhu Yanjun
2022-01-25 10:32   ` lizhijian

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).