All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC rdma-core 1/2] Registering non-contiguous memory
@ 2018-01-04 17:47 Alex Margolin
       [not found] ` <1515088046-26605-1-git-send-email-alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Margolin @ 2018-01-04 17:47 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Alex Margolin

Numerous applications communicate buffers with a non-contiguous memory layout.
For example, HPC applications often work on a matrix, and require sending a row or a column: 

```
               M
          -----------
         | |X| | | | |
          -----------
         | |X| | | | |
          -----------   N
         | |X| | | | |
          -----------
         | |X| | | | |
          -----------
```

There are two alternatives to send the cells marked with an 'X' using contiguous buffers:

1. Create a list of N scatter-gather entries, each with the length of a single cell, and pass those inside an ibv_send_wr to ibv_post_send().
2. Create a temporary contiguous buffer to hold all N marked cells, copy each cell to it's respective location in this buffer and pass this buffer to ibv_post_send().

Both alternatives requires additional memory resources, linear with respect to N, in order to send the desired memory layout. Non-contiguous memory registration addresses this issue - to allow passing a compact description of a memory layout for send/recv operations. In this example, the registered memory description would include the base pointer to the first cell, the matrix dimensions (M and N) and the size of a single cell.
Another use-case for non-contiguous memory access is when more than one memory region holds the data and the request may span across multiple MRs:

```
                     ----------
                    |          |
                    |          |
                    | Memory   |
                   /| region #1|
     "Composite   / |          |
       region"   /  |          |
     ---------- /  -|          |
    |          |  / |          |
    |          | /   ----------
     ---------- <
    |          | \   ---------- 
    |          |  --|          |
    |          |    | Memory   |
    |          | ___| region #2|
     ---------- <   |          |
    |          | \   ----------
     ----------\  \_ ----------
                \   |          |
                 \  | Memory   |
                  \-| region #N|
                    |          |
                    |          |
                     ----------
```

Similarly, sending such a layout would require specifying all N memory keys in every ibv_post_send() invocation, while the alternative could be listing those once in advance, and each operation only includes a base pointer and length.

The key for dealing with non-contiguous memory layouts at a low latency is the ability to describe it in the data path. This means the API has to allow user-level registration of such layouts. For this end, this API is an extension to the Memory Regions API, where the user can dynamically assign non-contiguous layout description to an MR.


Alex Margolin (1):
  verbs: Introduce non-contiguous memory registration

 libibverbs/man/ibv_rereg_mr.3             |   2 +
 libibverbs/man/ibv_rereg_mr_interleaved.3 | 260 ++++++++++++++++++++++++++++++
 libibverbs/man/ibv_rereg_mr_sg.3          | 181 +++++++++++++++++++++
 libibverbs/verbs.h                        |  75 ++++++++-
 4 files changed, 517 insertions(+), 1 deletion(-)
 create mode 100644 libibverbs/man/ibv_rereg_mr_interleaved.3
 create mode 100644 libibverbs/man/ibv_rereg_mr_sg.3

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found] ` <1515088046-26605-1-git-send-email-alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2018-01-04 17:47   ` Alex Margolin
       [not found]     ` <1515088046-26605-2-git-send-email-alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Margolin @ 2018-01-04 17:47 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Alex Margolin

Signed-off-by: Alex Margolin <alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 libibverbs/man/ibv_mr_set_layout_interleaved.3 | 232 +++++++++++++++++++++++++
 libibverbs/man/ibv_mr_set_layout_sg.3          | 153 ++++++++++++++++
 libibverbs/man/ibv_reg_mr.3                    |   2 +
 libibverbs/man/ibv_rereg_mr.3                  |   2 +
 libibverbs/verbs.h                             |  80 +++++++++
 5 files changed, 469 insertions(+)
 create mode 100644 libibverbs/man/ibv_mr_set_layout_interleaved.3
 create mode 100644 libibverbs/man/ibv_mr_set_layout_sg.3

diff --git a/libibverbs/man/ibv_mr_set_layout_interleaved.3 b/libibverbs/man/ibv_mr_set_layout_interleaved.3
new file mode 100644
index 0000000..93f5768
--- /dev/null
+++ b/libibverbs/man/ibv_mr_set_layout_interleaved.3
@@ -0,0 +1,232 @@
+.\" -*- nroff -*-
+.\" Licensed under the OpenIB.org BSD license (FreeBSD Variant) - See COPYING.md
+.\"
+.TH IBV_MR_SET_LAYOUT_INTERLEAVED 3 2016-03-13 libibverbs "Libibverbs Programmer's Manual"
+.SH "NAME"
+ibv_mr_set_layout_interleaved \- register an interleaved (non-contiguous) memory region (MR)
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/verbs.h>
+.sp
+.BI "int ibv_mr_set_layout_interleaved(struct ibv_mr " "*mr" ", int " "flags" ", int " "num_interleaved",
+.BI "                             struct ibv_mr_interleaved * " "interleaved_list");
+.fi
+.fi
+.SH "DESCRIPTION"
+The
+.B ibv_mr_set_layout_interleaved()
+function registers a non-contiguous memory layout to the given memory region (MR).
+Such memory layout is described by a repeating pattern of contiguous ranges
+within one or more MRs. Once this registration is valid, a send or recieve operation
+can alternate between those MRs by using a single local or remote key.
+.PP
+.I mr\fR
+is the result of a successful call to ibv_reg_mr(), and will be bound to the new
+memory layout. Creating an MR strictly for non-contiguous registration could be
+expadited by requesting zero length in ibv_reg_mr(). The same MR could be reused
+for multiple calls - each overriding the previous.
+.PP
+.I flags\fR
+is a bit-mask of optional modifiers. Flags should be a combination (bit field) of:
+.PP
+.br
+.B IBV_MR_SET_LAYOUT_AVOID_INVALIDATION \fR Prevent MR key invalidation (see Notes).
+.PP
+.I num_interleaved\fR
+is the size of the array describing the memory layout.
+.PP
+The argument
+.I interleaved_list\fR
+is an ibv_mr_interleaved struct, as defined in <infiniband/verbs.h>. Each entry
+refers to a pattern of items, or datum, and the resulting MR would take one datum
+from each entry in a Round-robin fashion. The MRs passed as arguments in interleaved_list
+could also be non-contiguous, as a result to previous calls to ibv_mr_set_layout_sg()
+or ibv_mr_set_layout_interleaved() on them. This case creates a nested definition of
+a non-contiguous memory layout, and it is supported up to a nesting level stated
+in max_mr_nesting_level inside struct ibv_mr_set_layout_caps.
+.PP
+Each entry is describes a pattern as follows:
+.nf
+
+struct ibv_mr_layout_interleaved {
+.in +8
+struct ibv_sge                                first_datum;    /* description of the first single item */
+int                                           num_repeated;   /* number of times to repeat this struct */
+int                                           num_dimensions; /* size of the dimensions array */
+struct ibv_mr_layout_interleved_dimensions    *dims;
+.in -8
+}
+.fi
+.PP
+In case
+.I num_repeated\fR > 1
+, which is only supported if IBV_MR_SET_LAYOUT_INTERLEAVED_REPEAT appears in cap_flags
+inside struct ibv_mr_set_layout_caps, means this entry would be visited this amount of
+of times consecutively on each round-robin cycle. This is the equivalent of
+duplicating an entry in the array. If IBV_MR_SET_LAYOUT_INTERLEAVED_NONUNIFORM_REPEAT
+also appears,
+.I num_repeated\fR
+may vary between entries.
+.PP
+.I num_dimensions\fR
+determines the length of the following
+.I dims\fR
+array, and is intended for multi-dimnetional data-structures such as a matrix.
+For example, a column in a 3D matrix could be described as with num_dimentions=2.
+.nf
+
+struct ibv_mr_layout_interleved_dimensions {
+.in +8
+uint64_t                               offset_stride;  /* Distance between two consecutive item base pointers */
+uint64_t                               datum_count;    /* Number of consecutive items */
+.in -8
+}
+.fi
+.PP
+Each dimention contains
+.I offset_stride\fR
+, which is the distance between the start of two consecutive datum, and
+.I datum_count\fR
+, which is the number of datum for this dimension.
+In the typical case,
+.I datum_count\fR
+would be the number of items at that MR, and
+.I offset_stride\fR
+would be the size of each item plus the distance to the next item. In multi-dimensional
+cases, the second dimension describes the number of times the first dimension appears,
+and how far away between two such appearances.
+.PP
+After a successful call, the new MR has to be bound before it could be used.
+A call to ibv_post_send() with the opcode IBV_WR_BIND_MR would bind the MR
+(usable after WR completion or in the following WRs on the same QP).
+.PP
+To clarify the pattern description, below is the pseduo-code for reading a pattern
+in a simple single-dimension case:
+.nf
+
+        foreach(entry in interleaved_list):
+                foreach(i from 0 to num_repeated):
+                        Read from entry.first_datum.addr (entry.first_datum.length Bytes)
+                        entry.first_datum.addr += entry.dims[0].offset_stride
+.fi
+.SH "RETURN VALUE"
+.B ibv_mr_set_layout_interleaved()
+returns 0 on success, otherwise an error has occurred,
+.I enum ibv_mr_set_layout_err_code\fR
+represents the error as listed below:
+.br
+IBV_MR_SET_LAYOUT_ERR_INPUT - Old MR is valid, an input error was detected by libibverbs.
+.br
+IBV_MR_SET_LAYOUT_ERR_WOULD_INVALIDATE - MR requires invalidation, but IBV_MR_SET_LAYOUT_AVOID_INVALIDATION was given.
+.br
+IBV_MR_SET_LAYOUT_ERR_UNSUPPORTED - Input requires a capability not supported (see
+.I struct ibv_mr_layout_caps\fR).
+.SH "EXAMPLES"
+The following code example demonstrates non-contiguous memory registration,
+along with the WR-based completion semantic. This example swaps the items with
+the odd indexes with with the even when sending (without actually changing
+memory contents):
+.PP
+.nf
+contig_mr = ibv_reg_mr(pd, addr, item_len * 100, 0);
+if (!contig_mr) {
+        fprintf(stderr, "Failed to create contiguous MR\en");
+        return 1;
+}
+
+noncontig_mr = ibv_reg_mr(pd, NULL, 0, IBV_ACCESS_ZERO_BASED);
+if (!noncontig_mr) {
+        fprintf(stderr, "Failed to create non-contiguous MR\en");
+        return 1;
+}
+
+struct ibv_mr_interleved_dimensions mr_ilv_dim =
+{
+    .offset_stride = 2 * item_len, /* after item[x] take item[x+2] */
+    .datum_count = 50
+};
+
+struct ibv_mr_interleaved mr_ilv[2] =
+{
+        {
+                .first_datum =
+                {
+                        .addr = item_len, /* start with item[1] */
+                        .length = item_len,
+                        .lkey = contig_mr->lkey
+                },
+                num_repeated = 1,
+                num_dimensions = 1,
+                dims = &mr_ilv_dim
+        },
+        {
+                .first_datum =
+                {
+                        .addr = 0, /* start with item[0] */
+                        .length = item_len,
+                        .lkey = contig_mr->lkey
+                },
+                num_repeated = 1,
+                num_dimensions = 1,
+                dims = &mr_ilv_dim
+        },
+};
+
+ret = ibv_mr_set_layout_interleaved(noncontig_mr, 0, 2, mr_ilv);
+if (ret) {
+        fprintf(stderr, "Non-contiguous registration failed\en");
+        return 1;
+}
+
+struct ibv_sge interleaved =
+{
+        .addr = 0,
+        .length = item_len * 100,
+        .lkey = noncontig_mr->lkey
+};
+
+struct ibv_send_wr send_wr = {
+        .opcode = IBV_WR_SEND,
+        .num_sge = 1,
+        .sg_list = interleaved,
+        .flags = 0
+};
+
+ret = ibv_post_send(qp, send_wr, &bad_wr);
+if (ret) {
+        fprintf(stderr, "Non-contiguous send failed\en");
+        return 1;
+}
+
+.PP
+.SH "NOTES"
+There are two alternatives for completion semantics: registration is valid on
+function return (default), or upon completion of a user-initiated WR with the
+opcode IBV_WR_BIND_MR and the MR passed in struct bind_mr inside struct ibv_send_wr.
+In order to select the latter, flags should include IBV_MR_SET_LAYOUT_WITH_POST_WR.
+In this case, a user may post send/recieve WR on this MR right after the bind WR
+on the same QP, and it is guaranteed to be processed correctly.
+.PP
+Storing the layout may require additional space, causing an internal
+re-initialization of the MR (at some latency cost) and the invalidation of
+previous local and remote keys. Using the same
+.I num_interleaved\fR
+and the same
+.I num_repeated\fR
+would prevent resizing. Alternatively, passing IBV_MR_SET_LAYOUT_AVOID_INVALIDATION would
+cause the call to fail if a resize would be required.
+.PP
+Even upon a failure the user is still required to call ibv_dereg_mr on this MR.
+Also, deregistration must occur in inverse order relative to registration of MRs.
+.SH "SEE ALSO"
+.BR ibv_reg_mr (3),
+.BR ibv_mr_set_layout_sg (3),
+.BR ibv_mr_set_layout_interleaved (3),
+.BR ibv_dereg_mr (3),
+.SH "AUTHORS"
+.TP
+Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
+.TP
+Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
+.TP
+Alex Margolin <alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
diff --git a/libibverbs/man/ibv_mr_set_layout_sg.3 b/libibverbs/man/ibv_mr_set_layout_sg.3
new file mode 100644
index 0000000..22fa03c
--- /dev/null
+++ b/libibverbs/man/ibv_mr_set_layout_sg.3
@@ -0,0 +1,153 @@
+.\" -*- nroff -*-
+.\" Licensed under the OpenIB.org BSD license (FreeBSD Variant) - See COPYING.md
+.\"
+.TH IBV_MR_SET_LAYOUT_SG 3 2016-03-13 libibverbs "Libibverbs Programmer's Manual"
+.SH "NAME"
+ibv_mr_set_layout_sg \- register a non-contiguous memory region (MR)
+.SH "SYNOPSIS"
+.nf
+.B #include <infiniband/verbs.h>
+.sp
+.BI "int ibv_mr_set_layout_sg(struct ibv_mr " "*mr" ", int " "flags" ",
+.BI "                    int " "num_sge" ", struct ibv_sge * " "sg_list");
+.fi
+.fi
+.SH "DESCRIPTION"
+The
+.B ibv_mr_set_layout_sg()
+function registers a non-contiguous memory layout to the given memory region (MR).
+Such memory layout is described by a list of contiguous ranges
+within other MRs. Once this registration is valid, a send or recieve operation
+can span across that list of MRs by using a single local or remote key.
+.PP
+.I mr\fR
+is the result of a successful call to ibv_reg_mr(), and will be bound to the new
+memory layout. Creating an MR strictly for non-contiguous registration could be
+expadited by requesting zero length in ibv_reg_mr(). The same MR could be reused
+for multiple calls - each overriding the previous.
+.PP
+.I flags\fR
+is a bit-mask of optional modifiers. Flags should be a combination (bit field) of:
+.PP
+.br
+.B IBV_MR_SET_LAYOUT_AVOID_INVALIDATION \fR Prevent MR key invalidation (see Notes).
+.PP
+.I num_sge\fR
+is the size of the s/g array describing the memory layout.
+.PP
+The argument
+.I sge_list\fR
+is an ibv_sge struct, as defined in <infiniband/verbs.h>. Each entry refers to a
+buffer, described by it's MR (local key), length and either a pointer or an offset
+ - depending on whether the MR is "zero-based". The MRs passed as arguments in
+sg_list could also be non-contiguous, as a result to previous calls to
+ibv_mr_set_layout_sg() or ibv_mr_set_layout_interleaved() on them.
+This case creates a nested definition of a non-contiguous memory layout, and it
+is supported up to a nesting level stated in max_mr_nesting_level inside struct
+ibv_mr_set_layout_caps.
+.PP
+.SH "RETURN VALUE"
+.B ibv_mr_set_layout_sg()
+returns 0 on success, otherwise an error has occurred,
+.I enum ibv_mr_set_layout_err_code\fR
+represents the error as listed below:
+.br
+IBV_MR_SET_LAYOUT_ERR_INPUT - Old MR is valid, an input error was detected by libibverbs.
+.br
+IBV_MR_SET_LAYOUT_ERR_WOULD_INVALIDATE - MR requires invalidation, but IBV_MR_SET_LAYOUT_AVOID_INVALIDATION was given.
+.br
+IBV_MR_SET_LAYOUT_ERR_UNSUPPORTED - Input requires a capability not supported (see
+.I struct ibv_mr_layout_caps\fR).
+.SH "EXAMPLES"
+The following code example demonstrates non-contiguous memory registration,
+by combining two contiguous regions, along with the WR-based completion semantic:
+.PP
+.nf
+mr1 = ibv_reg_mr(pd, addr1, len1, 0);
+if (!mr1) {
+        fprintf(stderr, "Failed to create MR #1\en");
+        return 1;
+}
+
+mr2 = ibv_reg_mr(pd, addr2, len2, 0);
+if (!mr2) {
+        fprintf(stderr, "Failed to create MR #2\en");
+        return 1;
+}
+
+mr3 = ibv_reg_mr(pd, NULL, 0, IBV_ACCESS_ZERO_BASED);
+if (!mr3) {
+        fprintf(stderr, "Failed to create result MR\en");
+        return 1;
+}
+
+struct ibv_sge composite[] =
+{
+        {
+                .addr = addr1,
+                .length = len1,
+                .lkey = mr1->lkey
+        },
+        {
+                .addr = addr2,
+                .length = len2,
+                .lkey = mr2->lkey
+        }
+};
+
+ret = ibv_mr_set_layout_sg(mr3, 0, 2, composite);
+if (ret) {
+        fprintf(stderr, "Non-contiguous registration failed\en");
+        return 1;
+}
+
+struct ibv_sge non_contig =
+{
+        .addr = 0,
+        .length = len1 + len2,
+        .lkey = mr3->lkey
+};
+
+struct ibv_send_wr send_wr = {
+        .opcode = IBV_WR_SEND,
+        .num_sge = 1,
+        .sg_list = non_contig,
+        .flags = 0
+};
+
+ret = ibv_post_send(qp, send_wr, &bad_wr);
+if (ret) {
+        fprintf(stderr, "Non-contiguous send failed\en");
+        return 1;
+}
+
+.PP
+.SH "NOTES"
+There are two alternatives for completion semantics: registration is valid on
+function return (default), or upon completion of a user-initiated WR with the
+opcode IBV_WR_BIND_MR and the MR passed in struct bind_mr inside struct ibv_send_wr.
+In order to select the latter, flags should include IBV_MR_SET_LAYOUT_WITH_POST_WR.
+In this case, a user may post send/recieve WR on this MR right after the bind WR
+on the same QP, and it is guaranteed to be processed correctly.
+.PP
+Storing the layout may require additional space, causing an internal
+re-initialization of the MR (at some latency cost) and the invalidation of
+previous local and remote keys. Using the same
+.I num_sge\fR
+would prevent resizing. Alternatively, passing IBV_MR_SET_LAYOUT_AVOID_INVALIDATION would
+cause the call to fail if a resize would be required.
+.PP
+Even upon a failure the user is still required to call ibv_dereg_mr on this MR.
+Also, deregistration must occur in inverse order relative to registration of MRs.
+.SH "SEE ALSO"
+.BR ibv_reg_mr (3),
+.BR ibv_dereg_mr_sg (3),
+.BR ibv_dereg_mr_interleaved (3),
+.BR ibv_dereg_mr (3),
+.SH "AUTHORS"
+.TP
+Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
+.TP
+Yishai Hadas <yishaih-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
+.TP
+Alex Margolin <alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
diff --git a/libibverbs/man/ibv_reg_mr.3 b/libibverbs/man/ibv_reg_mr.3
index d3f09c0..506c3a1 100644
--- a/libibverbs/man/ibv_reg_mr.3
+++ b/libibverbs/man/ibv_reg_mr.3
@@ -74,6 +74,8 @@ fails if any memory window is still bound to this MR.
 .BR ibv_post_send (3),
 .BR ibv_post_recv (3),
 .BR ibv_post_srq_recv (3)
+.BR ibv_mr_set_layout_sg (3),
+.BR ibv_mr_set_layout_interleaved (3),
 .SH "AUTHORS"
 .TP
 Dotan Barak <dotanba-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
diff --git a/libibverbs/man/ibv_rereg_mr.3 b/libibverbs/man/ibv_rereg_mr.3
index 9fa567c..c21ef06 100644
--- a/libibverbs/man/ibv_rereg_mr.3
+++ b/libibverbs/man/ibv_rereg_mr.3
@@ -69,6 +69,8 @@ IBV_REREG_MR_ERR_CMD_AND_DO_FORK_NEW - MR shouldn't be used, command error, inva
 Even on a failure, the user still needs to call ibv_dereg_mr on this MR.
 .SH "SEE ALSO"
 .BR ibv_reg_mr (3),
+.BR ibv_mr_set_layout_sg (3),
+.BR ibv_mr_set_layout_interleaved (3),
 .BR ibv_dereg_mr (3),
 .SH "AUTHORS"
 .TP
diff --git a/libibverbs/verbs.h b/libibverbs/verbs.h
index 7b53a6f..8903db8 100644
--- a/libibverbs/verbs.h
+++ b/libibverbs/verbs.h
@@ -208,6 +208,24 @@ struct ibv_tso_caps {
 	uint32_t supported_qpts;
 };
 
+enum ibv_mr_layout_cap_flags {
+	IBV_MR_SET_LAYOUT_SG					= 1 << 0,
+	IBV_MR_SET_LAYOUT_INTERLEAVED			= 1 << 1,
+	IBV_MR_SET_LAYOUT_INTERLEAVED_REPEAT			= 1 << 2,
+	IBV_MR_SET_LAYOUT_INTERLEAVED_NONUNIFORM_REPEAT	= 1 << 3,
+	IBV_MR_SET_LAYOUT_INTERLEAVED_NONUNIFORM_DATUM_TOTAL = 1 << 4,
+};
+
+struct ibv_mr_layout_caps {
+	uint64_t cap_flags;
+	uint32_t max_num_sg;
+	uint32_t max_inline_num_sg;
+	uint32_t max_num_interleaved;
+	uint32_t max_inline_num_interleaved;
+	uint32_t max_mr_stride_dimenson;
+	uint32_t max_mr_nesting_level;
+};
+
 /* RX Hash function flags */
 enum ibv_rx_hash_function_flags {
 	IBV_RX_HASH_FUNC_TOEPLITZ	= 1 << 0,
@@ -290,6 +308,7 @@ struct ibv_device_attr_ex {
 	uint32_t		raw_packet_caps; /* Use ibv_raw_packet_caps */
 	struct ibv_tm_caps	tm_caps;
 	struct ibv_cq_moderation_caps  cq_mod_caps;
+	struct ibv_mr_layout_caps mr_layout_caps;
 };
 
 enum ibv_mtu {
@@ -564,6 +583,12 @@ enum ibv_rereg_mr_flags {
 	IBV_REREG_MR_FLAGS_SUPPORTED	= ((IBV_REREG_MR_KEEP_VALID << 1) - 1)
 };
 
+enum ibv_mr_set_layout_flags {
+	IBV_MR_SET_LAYOUT_WITH_POST_WR		= (1 << 0),
+	IBV_MR_SET_LAYOUT_AVOID_INVALIDATION		= (1 << 1),
+	IBV_MR_SET_LAYOUT_FLAGS_SUPPORTED	= ((IBV_MR_SET_LAYOUT_AVOID_INVALIDATION << 1) - 1)
+};
+
 struct ibv_mr {
 	struct ibv_context     *context;
 	struct ibv_pd	       *pd;
@@ -1033,6 +1058,9 @@ struct ibv_send_wr {
 			uint16_t		hdr_sz;
 			uint16_t		mss;
 		} tso;
+		struct {
+			struct ibv_mr *mr;
+		} mr_set_layout;
 	};
 };
 
@@ -1634,8 +1662,38 @@ struct ibv_values_ex {
 	struct timespec raw_clock;
 };
 
+struct ibv_mr_layout_interleved_dimensions {
+	uint64_t offset_stride;
+	uint64_t datum_count;
+};
+
+struct ibv_mr_layout_interleaved {
+	struct ibv_sge				   first_datum;
+	int					   num_repeated;
+	int 					   num_dimentsions;
+	struct ibv_mr_layout_interleved_dimensions *dims;
+};
+
+enum verbs_context_mask {
+	VERBS_CONTEXT_XRCD	= 1 << 0,
+	VERBS_CONTEXT_SRQ	= 1 << 1,
+	VERBS_CONTEXT_QP	= 1 << 2,
+	VERBS_CONTEXT_CREATE_FLOW = 1 << 3,
+	VERBS_CONTEXT_DESTROY_FLOW = 1 << 4,
+	VERBS_CONTEXT_REREG_MR	= 1 << 5,
+	VERBS_CONTEXT_RESERVED	= 1 << 6
+};
+
 struct verbs_context {
 	/*  "grows up" - new fields go here */
+	int (*mr_set_layout_sg)(struct ibv_mr* mr,
+				int flags,
+				int num_sge,
+				struct ibv_sge *sg_list);
+	int (*mr_set_layout_interleaved)(struct ibv_mr* mr,
+					 int flags,
+					 int num_interleaved,
+					 struct ibv_mr_layout_interleaved *interleaved_list);
 	int (*modify_cq)(struct ibv_cq *cq, struct ibv_modify_cq_attr *attr);
 	int (*post_srq_ops)(struct ibv_srq *srq,
 			    struct ibv_ops_wr *op,
@@ -1878,6 +1936,28 @@ int ibv_rereg_mr(struct ibv_mr *mr, int flags,
  */
 int ibv_dereg_mr(struct ibv_mr *mr);
 
+enum ibv_mr_set_layout_err_code {
+	/* Old MR is valid, invalid input */
+	IBV_MR_SET_LAYOUT_ERR_INPUT = -1,
+	/* MR requires invalidation, but IBV_MR_SET_LAYOUT_KEEP_VALID is on */
+	IBV_MR_SET_LAYOUT_ERR_WOULD_INVALIDATE = -2,
+	/* Input valid, but the capability is unsupported (see ibv_mr_layout_caps) */
+	IBV_MR_SET_LAYOUT_ERR_UNSUPPORTED = -3,
+};
+
+/**
+ * ibv_mr_set_layout_sg - Register several memory regions as one.
+ */
+int ibv_mr_set_layout_sg(struct ibv_mr* mr, int flags,
+			 int num_sge,
+			 struct ibv_sge *sg_list);
+/**
+ * ibv_mr_set_layout_interleaved - Register several interleaving memory regions as one.
+ */
+int ibv_mr_set_layout_interleaved(struct ibv_mr* mr, int flags,
+				  int num_interleaved,
+				  struct ibv_mr_layout_interleaved *interleaved_list);
+
 /**
  * ibv_alloc_mw - Allocate a memory window
  */
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found]     ` <1515088046-26605-2-git-send-email-alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2018-01-11 12:22       ` Yuval Shaia
  2018-01-11 16:44         ` Jason Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Yuval Shaia @ 2018-01-11 12:22 UTC (permalink / raw)
  To: Alex Margolin; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

> +The following code example demonstrates non-contiguous memory registration,
> +by combining two contiguous regions, along with the WR-based completion semantic:
> +.PP
> +.nf
> +mr1 = ibv_reg_mr(pd, addr1, len1, 0);
> +if (!mr1) {
> +        fprintf(stderr, "Failed to create MR #1\en");
> +        return 1;
> +}
> +
> +mr2 = ibv_reg_mr(pd, addr2, len2, 0);
> +if (!mr2) {
> +        fprintf(stderr, "Failed to create MR #2\en");
> +        return 1;
> +}

So, to register non-contiguous 512 random buffers i would have to create
512 MRs?

> +
> +mr3 = ibv_reg_mr(pd, NULL, 0, IBV_ACCESS_ZERO_BASED);
> +if (!mr3) {
> +        fprintf(stderr, "Failed to create result MR\en");
> +        return 1;
> +}
> +
> +struct ibv_sge composite[] =
> +{
> +        {
> +                .addr = addr1,
> +                .length = len1,
> +                .lkey = mr1->lkey
> +        },
> +        {
> +                .addr = addr2,
> +                .length = len2,
> +                .lkey = mr2->lkey
> +        }
> +};
> +
> +ret = ibv_mr_set_layout_sg(mr3, 0, 2, composite);
> +if (ret) {
> +        fprintf(stderr, "Non-contiguous registration failed\en");
> +        return 1;
> +}
> +
> +struct ibv_sge non_contig =
> +{
> +        .addr = 0,
> +        .length = len1 + len2,
> +        .lkey = mr3->lkey
> +};
> +
> +struct ibv_send_wr send_wr = {
> +        .opcode = IBV_WR_SEND,
> +        .num_sge = 1,
> +        .sg_list = non_contig,
> +        .flags = 0
> +};
> +
> +ret = ibv_post_send(qp, send_wr, &bad_wr);
> +if (ret) {
> +        fprintf(stderr, "Non-contiguous send failed\en");
> +        return 1;
> +}
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
  2018-01-11 12:22       ` Yuval Shaia
@ 2018-01-11 16:44         ` Jason Gunthorpe
       [not found]           ` <20180111164455.GA1309-uk2M96/98Pc@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2018-01-11 16:44 UTC (permalink / raw)
  To: Yuval Shaia; +Cc: Alex Margolin, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Jan 11, 2018 at 02:22:07PM +0200, Yuval Shaia wrote:
> > +The following code example demonstrates non-contiguous memory registration,
> > +by combining two contiguous regions, along with the WR-based completion semantic:
> > +.PP
> > +.nf
> > +mr1 = ibv_reg_mr(pd, addr1, len1, 0);
> > +if (!mr1) {
> > +        fprintf(stderr, "Failed to create MR #1\en");
> > +        return 1;
> > +}
> > +
> > +mr2 = ibv_reg_mr(pd, addr2, len2, 0);
> > +if (!mr2) {
> > +        fprintf(stderr, "Failed to create MR #2\en");
> > +        return 1;
> > +}
> 
> So, to register non-contiguous 512 random buffers i would have to create
> 512 MRs?

That is a fair point - I wonder if some of these API should have an
option to accept a pointer directly? Maybe the driver requires a MR
but we don't need that as an the API?

Particularly the _sg one..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found]           ` <20180111164455.GA1309-uk2M96/98Pc@public.gmane.org>
@ 2018-01-22 15:59             ` Alex Margolin
       [not found]               ` <VI1PR05MB1278C4C4FF78B4B1A551252EB9EC0-79XLn2atqDMOK6E67s+DINqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Alex Margolin @ 2018-01-22 15:59 UTC (permalink / raw)
  To: Jason Gunthorpe, Yuval Shaia; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

> -----Original Message-----
> From: Jason Gunthorpe
> Sent: Thursday, January 11, 2018 6:45 PM
> To: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Cc: Alex Margolin <alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory
> registration
> 
> On Thu, Jan 11, 2018 at 02:22:07PM +0200, Yuval Shaia wrote:
> > > +The following code example demonstrates non-contiguous memory
> > > +registration, by combining two contiguous regions, along with the
> WR-based completion semantic:
> > > +.PP
> > > +.nf
> > > +mr1 = ibv_reg_mr(pd, addr1, len1, 0); if (!mr1) {
> > > +        fprintf(stderr, "Failed to create MR #1\en");
> > > +        return 1;
> > > +}
> > > +
> > > +mr2 = ibv_reg_mr(pd, addr2, len2, 0); if (!mr2) {
> > > +        fprintf(stderr, "Failed to create MR #2\en");
> > > +        return 1;
> > > +}
> >
> > So, to register non-contiguous 512 random buffers i would have to
> > create
> > 512 MRs?


I think typically if you have a large amount of buffers - it would be located in fairly close proximity, so you'd prefer one MR to cover all of them and the SGEs will only differ in base address.

Are you proposing the function also replaces ibv_reg_mr() if the user passes multiple unregistered regions?
I could see the benefit, but then we'd require additional parameters (i.e. send_flags) and those MRs couldn't be reused (otherwise need to add output pointers for resulting MRs).
The benefit will probably not be latency, though, since IIRC the MR creation can't really be parallelized.
Yuval - are you aware of a scenario implementing a high amount of ibv_reg_mr() calls?

> 
> That is a fair point - I wonder if some of these API should have an
> option to accept a pointer directly? Maybe the driver requires a MR but
> we don't need that as an the API?
> 
> Particularly the _sg one..
> 
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found]               ` <VI1PR05MB1278C4C4FF78B4B1A551252EB9EC0-79XLn2atqDMOK6E67s+DINqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-01-23 20:29                 ` Yuval Shaia
  2018-01-25 12:43                   ` Alex Margolin
  2018-01-25 13:10                   ` Alex Margolin
  0 siblings, 2 replies; 18+ messages in thread
From: Yuval Shaia @ 2018-01-23 20:29 UTC (permalink / raw)
  To: Alex Margolin, Marcel Apfelbaum
  Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, Jan 22, 2018 at 03:59:51PM +0000, Alex Margolin wrote:
> > -----Original Message-----
> > From: Jason Gunthorpe
> > Sent: Thursday, January 11, 2018 6:45 PM
> > To: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> > Cc: Alex Margolin <alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory
> > registration
> > 
> > On Thu, Jan 11, 2018 at 02:22:07PM +0200, Yuval Shaia wrote:
> > > > +The following code example demonstrates non-contiguous memory
> > > > +registration, by combining two contiguous regions, along with the
> > WR-based completion semantic:
> > > > +.PP
> > > > +.nf
> > > > +mr1 = ibv_reg_mr(pd, addr1, len1, 0); if (!mr1) {
> > > > +        fprintf(stderr, "Failed to create MR #1\en");
> > > > +        return 1;
> > > > +}
> > > > +
> > > > +mr2 = ibv_reg_mr(pd, addr2, len2, 0); if (!mr2) {
> > > > +        fprintf(stderr, "Failed to create MR #2\en");
> > > > +        return 1;
> > > > +}
> > >
> > > So, to register non-contiguous 512 random buffers i would have to
> > > create
> > > 512 MRs?
> 
> 
> I think typically if you have a large amount of buffers - it would be located in fairly close proximity, so you'd prefer one MR to cover all of them and the SGEs will only differ in base address.

Define "large amount".
I did several experiments with something like hundred or few hundred
(Marcel, do you remember how many?) and they were scattered at the range of
about 3G so one MR is not an option. Our application is QEMU so 3G for one
MR means no memory overcommit.

> 
> Are you proposing the function also replaces ibv_reg_mr() if the user passes multiple unregistered regions?
> I could see the benefit, but then we'd require additional parameters (i.e. send_flags) and those MRs couldn't be reused (otherwise need to add output pointers for resulting MRs).

Yeah, more or less the same ib_reg_mr but one that gets list of pages
instead of virtual address and will skip the "while (npages)" loop in
ib_umem_get and just go directly to dma_map_sg. Idea here is that anyway
the HW supports scattered list of buffers so why to limit the API to
contiguous virtual address.

We dropped this idea as it turns out that we need extra help from the HW in
post_send phase where the virtual address received in the SGE refers to the
virtual address given at ib_reg_mr.
We somehow believed that zero-based-mr will solve this by maybe allowing
addresses in SGE to be something like an index to a entry in the page-list
given to ib_reg_mr but apparently zero-based-mr is not yet functional (at
least not in CX3).
(We have lack of knowledge in what exactly zero-based-mr is).
 
> The benefit will probably not be latency, though, since IIRC the MR creation can't really be parallelized.
> Yuval - are you aware of a scenario implementing a high amount of ibv_reg_mr() calls?

High amount of ibv_reg_mr calls no but i have a scenario where my
application can potentially receive request to create MR for 262144
scattered pages.
By the way, using the suggested API from Jason below, SG list will still
limits us, not sure how big SG list can be but sure not 262144.
So what we were thinking is to give ib_reg_mr a huge range, even 4G but
then use a bitmap parameter that will specify only the pages in that range
that take part in the MR.

> 
> > 
> > That is a fair point - I wonder if some of these API should have an
> > option to accept a pointer directly? Maybe the driver requires a MR but
> > we don't need that as an the API?
> > 
> > Particularly the _sg one..
> > 
> > Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
  2018-01-23 20:29                 ` Yuval Shaia
@ 2018-01-25 12:43                   ` Alex Margolin
       [not found]                     ` <VI1PR05MB12787572593F02F05AA20DE1B9E10-79XLn2atqDMOK6E67s+DINqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2018-01-25 13:10                   ` Alex Margolin
  1 sibling, 1 reply; 18+ messages in thread
From: Alex Margolin @ 2018-01-25 12:43 UTC (permalink / raw)
  To: Yuval Shaia, Marcel Apfelbaum
  Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA



> -----Original Message-----
> From: Yuval Shaia [mailto:yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org]
> Sent: Tuesday, January 23, 2018 10:30 PM
> To: Alex Margolin <alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Marcel Apfelbaum
> <marcel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Jason Gunthorpe <jgg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory
> registration
> 
> On Mon, Jan 22, 2018 at 03:59:51PM +0000, Alex Margolin wrote:
> > > -----Original Message-----
> > > From: Jason Gunthorpe
> > > Sent: Thursday, January 11, 2018 6:45 PM
> > > To: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> > > Cc: Alex Margolin <alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous
> > > memory registration
> > >
> > > On Thu, Jan 11, 2018 at 02:22:07PM +0200, Yuval Shaia wrote:
> > > > > +The following code example demonstrates non-contiguous memory
> > > > > +registration, by combining two contiguous regions, along with
> > > > > +the
> > > WR-based completion semantic:
> > > > > +.PP
> > > > > +.nf
> > > > > +mr1 = ibv_reg_mr(pd, addr1, len1, 0); if (!mr1) {
> > > > > +        fprintf(stderr, "Failed to create MR #1\en");
> > > > > +        return 1;
> > > > > +}
> > > > > +
> > > > > +mr2 = ibv_reg_mr(pd, addr2, len2, 0); if (!mr2) {
> > > > > +        fprintf(stderr, "Failed to create MR #2\en");
> > > > > +        return 1;
> > > > > +}
> > > >
> > > > So, to register non-contiguous 512 random buffers i would have to
> > > > create
> > > > 512 MRs?
> >
> >
> > I think typically if you have a large amount of buffers - it would be
> located in fairly close proximity, so you'd prefer one MR to cover all
> of them and the SGEs will only differ in base address.
> 
> Define "large amount".
> I did several experiments with something like hundred or few hundred
> (Marcel, do you remember how many?) and they were scattered at the range
> of about 3G so one MR is not an option. Our application is QEMU so 3G
> for one MR means no memory overcommit.
> 
> >
> > Are you proposing the function also replaces ibv_reg_mr() if the user
> passes multiple unregistered regions?
> > I could see the benefit, but then we'd require additional parameters
> (i.e. send_flags) and those MRs couldn't be reused (otherwise need to
> add output pointers for resulting MRs).

Actually, I realized it can be implemented with the proposed API.
All that is missing is a capability bit and a flag for set_layout_*,
and the implementation could work as follows (changes relative to SG example):

+assert(caps & IBV_MR_SET_LAYOUT_INTERNAL_REGISTRATION);
-mr1 = ibv_reg_mr(pd, addr1, len1, 0);
-if (!mr1) {
-        fprintf(stderr, "Failed to create MR #1\en");
-        return 1;
-}
-
-mr2 = ibv_reg_mr(pd, addr2, len2, 0);
-if (!mr2) {
-        fprintf(stderr, "Failed to create MR #2\en");
-        return 1;
-}

mr3 = ibv_reg_mr(pd, NULL, 0, IBV_ACCESS_ZERO_BASED);
if (!mr3) {
        fprintf(stderr, "Failed to create result MR\en");
        return 1;
}

struct ibv_sge composite[] =
{
        {
                .addr = addr1,
                .length = len1,
-                .lkey = mr1->lkey
        },
        {
                .addr = addr2,
                .length = len2,
-                .lkey = mr2->lkey
        }
};

+ret = ibv_mr_set_layout_sg(mr3, IBV_MR_SET_LAYOUT_REGISTER_BUFFERS, 2, composite);
-ret = ibv_mr_set_layout_sg(mr3, 0, 2, composite);
if (ret) {
        fprintf(stderr, "Non-contiguous registration failed\en");
        return 1;
}

In this case calling ibv_mr_set_layout_sg() will cause an internal registration
replacing the ibv_reg_mr calls for mr1 and mr2, and the registration will be stored
in mr3.

Is this what you had in mind?

> 
> Yeah, more or less the same ib_reg_mr but one that gets list of pages
> instead of virtual address and will skip the "while (npages)" loop in
> ib_umem_get and just go directly to dma_map_sg. Idea here is that anyway
> the HW supports scattered list of buffers so why to limit the API to
> contiguous virtual address.
> 
> We dropped this idea as it turns out that we need extra help from the HW
> in post_send phase where the virtual address received in the SGE refers
> to the virtual address given at ib_reg_mr.
> We somehow believed that zero-based-mr will solve this by maybe allowing
> addresses in SGE to be something like an index to a entry in the page-
> list given to ib_reg_mr but apparently zero-based-mr is not yet
> functional (at least not in CX3).
> (We have lack of knowledge in what exactly zero-based-mr is).
> 
> > The benefit will probably not be latency, though, since IIRC the MR
> creation can't really be parallelized.
> > Yuval - are you aware of a scenario implementing a high amount of
> ibv_reg_mr() calls?
> 
> High amount of ibv_reg_mr calls no but i have a scenario where my
> application can potentially receive request to create MR for 262144
> scattered pages.
> By the way, using the suggested API from Jason below, SG list will still
> limits us, not sure how big SG list can be but sure not 262144.
> So what we were thinking is to give ib_reg_mr a huge range, even 4G but
> then use a bitmap parameter that will specify only the pages in that
> range that take part in the MR.
> 
> >
> > >
> > > That is a fair point - I wonder if some of these API should have an
> > > option to accept a pointer directly? Maybe the driver requires a MR
> > > but we don't need that as an the API?
> > >
> > > Particularly the _sg one..
> > >
> > > Jason
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> > in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
  2018-01-23 20:29                 ` Yuval Shaia
  2018-01-25 12:43                   ` Alex Margolin
@ 2018-01-25 13:10                   ` Alex Margolin
  1 sibling, 0 replies; 18+ messages in thread
From: Alex Margolin @ 2018-01-25 13:10 UTC (permalink / raw)
  To: Yuval Shaia, Marcel Apfelbaum
  Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA



> -----Original Message-----
> From: Alex Margolin
> Sent: Thursday, January 25, 2018 2:43 PM
> To: 'Yuval Shaia' <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>; Marcel Apfelbaum
> <marcel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: Jason Gunthorpe <jgg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: RE: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory
> registration
> 
> 
> 
> > -----Original Message-----
> > From: Yuval Shaia [mailto:yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org]
> > Sent: Tuesday, January 23, 2018 10:30 PM
> > To: Alex Margolin <alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Marcel Apfelbaum
> > <marcel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Cc: Jason Gunthorpe <jgg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous
> > memory registration
> >
> > On Mon, Jan 22, 2018 at 03:59:51PM +0000, Alex Margolin wrote:
> > > > -----Original Message-----
> > > > From: Jason Gunthorpe
> > > > Sent: Thursday, January 11, 2018 6:45 PM
> > > > To: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> > > > Cc: Alex Margolin <alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>;
> > > > linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous
> > > > memory registration
> > > >
> > > > On Thu, Jan 11, 2018 at 02:22:07PM +0200, Yuval Shaia wrote:
> > > > > > +The following code example demonstrates non-contiguous memory
> > > > > > +registration, by combining two contiguous regions, along with
> > > > > > +the
> > > > WR-based completion semantic:
> > > > > > +.PP
> > > > > > +.nf
> > > > > > +mr1 = ibv_reg_mr(pd, addr1, len1, 0); if (!mr1) {
> > > > > > +        fprintf(stderr, "Failed to create MR #1\en");
> > > > > > +        return 1;
> > > > > > +}
> > > > > > +
> > > > > > +mr2 = ibv_reg_mr(pd, addr2, len2, 0); if (!mr2) {
> > > > > > +        fprintf(stderr, "Failed to create MR #2\en");
> > > > > > +        return 1;
> > > > > > +}
> > > > >
> > > > > So, to register non-contiguous 512 random buffers i would have
> > > > > to create
> > > > > 512 MRs?
> > >
> > >
> > > I think typically if you have a large amount of buffers - it would
> > > be
> > located in fairly close proximity, so you'd prefer one MR to cover all
> > of them and the SGEs will only differ in base address.
> >
> > Define "large amount".
> > I did several experiments with something like hundred or few hundred
> > (Marcel, do you remember how many?) and they were scattered at the
> > range of about 3G so one MR is not an option. Our application is QEMU
> > so 3G for one MR means no memory overcommit.
> >
> > >
> > > Are you proposing the function also replaces ibv_reg_mr() if the
> > > user
> > passes multiple unregistered regions?
> > > I could see the benefit, but then we'd require additional parameters
> > (i.e. send_flags) and those MRs couldn't be reused (otherwise need to
> > add output pointers for resulting MRs).
> 
> Actually, I realized it can be implemented with the proposed API.
> All that is missing is a capability bit and a flag for set_layout_*, and
> the implementation could work as follows (changes relative to SG
> example):
> 
> +assert(caps & IBV_MR_SET_LAYOUT_INTERNAL_REGISTRATION);
> -mr1 = ibv_reg_mr(pd, addr1, len1, 0);
> -if (!mr1) {
> -        fprintf(stderr, "Failed to create MR #1\en");
> -        return 1;
> -}
> -
> -mr2 = ibv_reg_mr(pd, addr2, len2, 0);
> -if (!mr2) {
> -        fprintf(stderr, "Failed to create MR #2\en");
> -        return 1;
> -}
> 
> mr3 = ibv_reg_mr(pd, NULL, 0, IBV_ACCESS_ZERO_BASED); if (!mr3) {
>         fprintf(stderr, "Failed to create result MR\en");
>         return 1;
> }
> 
> struct ibv_sge composite[] =
> {
>         {
>                 .addr = addr1,
>                 .length = len1,
> -                .lkey = mr1->lkey
>         },
>         {
>                 .addr = addr2,
>                 .length = len2,
> -                .lkey = mr2->lkey
>         }
> };
> 
> +ret = ibv_mr_set_layout_sg(mr3, IBV_MR_SET_LAYOUT_REGISTER_BUFFERS, 2,
> +composite);
> -ret = ibv_mr_set_layout_sg(mr3, 0, 2, composite); if (ret) {
>         fprintf(stderr, "Non-contiguous registration failed\en");
>         return 1;
> }
> 
> In this case calling ibv_mr_set_layout_sg() will cause an internal
> registration replacing the ibv_reg_mr calls for mr1 and mr2, and the
> registration will be stored in mr3.

Forgot to add - MR creation parameters, such as access flags, will be taken from mr3 reg_mr call.

> 
> Is this what you had in mind?
> 
> >
> > Yeah, more or less the same ib_reg_mr but one that gets list of pages
> > instead of virtual address and will skip the "while (npages)" loop in
> > ib_umem_get and just go directly to dma_map_sg. Idea here is that
> > anyway the HW supports scattered list of buffers so why to limit the
> > API to contiguous virtual address.
> >
> > We dropped this idea as it turns out that we need extra help from the
> > HW in post_send phase where the virtual address received in the SGE
> > refers to the virtual address given at ib_reg_mr.
> > We somehow believed that zero-based-mr will solve this by maybe
> > allowing addresses in SGE to be something like an index to a entry in
> > the page- list given to ib_reg_mr but apparently zero-based-mr is not
> > yet functional (at least not in CX3).
> > (We have lack of knowledge in what exactly zero-based-mr is).
> >
> > > The benefit will probably not be latency, though, since IIRC the MR
> > creation can't really be parallelized.
> > > Yuval - are you aware of a scenario implementing a high amount of
> > ibv_reg_mr() calls?
> >
> > High amount of ibv_reg_mr calls no but i have a scenario where my
> > application can potentially receive request to create MR for 262144
> > scattered pages.
> > By the way, using the suggested API from Jason below, SG list will
> > still limits us, not sure how big SG list can be but sure not 262144.
> > So what we were thinking is to give ib_reg_mr a huge range, even 4G
> > but then use a bitmap parameter that will specify only the pages in
> > that range that take part in the MR.
> >
> > >
> > > >
> > > > That is a fair point - I wonder if some of these API should have
> > > > an option to accept a pointer directly? Maybe the driver requires
> > > > a MR but we don't need that as an the API?
> > > >
> > > > Particularly the _sg one..
> > > >
> > > > Jason
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-
> rdma"
> > > in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found]                     ` <VI1PR05MB12787572593F02F05AA20DE1B9E10-79XLn2atqDMOK6E67s+DINqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-01-25 16:07                       ` Jason Gunthorpe
  2018-01-28 20:37                       ` Yuval Shaia
  1 sibling, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2018-01-25 16:07 UTC (permalink / raw)
  To: Alex Margolin
  Cc: Yuval Shaia, Marcel Apfelbaum, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Jan 25, 2018 at 12:43:05PM +0000, Alex Margolin wrote:

> Actually, I realized it can be implemented with the proposed API.
> All that is missing is a capability bit and a flag for set_layout_*,
> and the implementation could work as follows (changes relative to SG example):

This can work..

> In this case calling ibv_mr_set_layout_sg() will cause an internal registration
> replacing the ibv_reg_mr calls for mr1 and mr2, and the registration will be stored
> in mr3.

Yuval is right though, in cases where the buffers are page aligned we
can ask the kernel to create a single normal MR supported by all HW.

Could be a useful feature.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found]                     ` <VI1PR05MB12787572593F02F05AA20DE1B9E10-79XLn2atqDMOK6E67s+DINqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2018-01-25 16:07                       ` Jason Gunthorpe
@ 2018-01-28 20:37                       ` Yuval Shaia
  2018-01-29 17:27                         ` Jason Gunthorpe
  1 sibling, 1 reply; 18+ messages in thread
From: Yuval Shaia @ 2018-01-28 20:37 UTC (permalink / raw)
  To: Alex Margolin
  Cc: Marcel Apfelbaum, Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Jan 25, 2018 at 12:43:05PM +0000, Alex Margolin wrote:
> 
> 
> > -----Original Message-----
> > From: Yuval Shaia [mailto:yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org]
> > Sent: Tuesday, January 23, 2018 10:30 PM
> > To: Alex Margolin <alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; Marcel Apfelbaum
> > <marcel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Cc: Jason Gunthorpe <jgg-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory
> > registration
> > 
> > On Mon, Jan 22, 2018 at 03:59:51PM +0000, Alex Margolin wrote:
> > > > -----Original Message-----
> > > > From: Jason Gunthorpe
> > > > Sent: Thursday, January 11, 2018 6:45 PM
> > > > To: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> > > > Cc: Alex Margolin <alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > > > Subject: Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous
> > > > memory registration
> > > >
> > > > On Thu, Jan 11, 2018 at 02:22:07PM +0200, Yuval Shaia wrote:
> > > > > > +The following code example demonstrates non-contiguous memory
> > > > > > +registration, by combining two contiguous regions, along with
> > > > > > +the
> > > > WR-based completion semantic:
> > > > > > +.PP
> > > > > > +.nf
> > > > > > +mr1 = ibv_reg_mr(pd, addr1, len1, 0); if (!mr1) {
> > > > > > +        fprintf(stderr, "Failed to create MR #1\en");
> > > > > > +        return 1;
> > > > > > +}
> > > > > > +
> > > > > > +mr2 = ibv_reg_mr(pd, addr2, len2, 0); if (!mr2) {
> > > > > > +        fprintf(stderr, "Failed to create MR #2\en");
> > > > > > +        return 1;
> > > > > > +}
> > > > >
> > > > > So, to register non-contiguous 512 random buffers i would have to
> > > > > create
> > > > > 512 MRs?
> > >
> > >
> > > I think typically if you have a large amount of buffers - it would be
> > located in fairly close proximity, so you'd prefer one MR to cover all
> > of them and the SGEs will only differ in base address.
> > 
> > Define "large amount".
> > I did several experiments with something like hundred or few hundred
> > (Marcel, do you remember how many?) and they were scattered at the range
> > of about 3G so one MR is not an option. Our application is QEMU so 3G
> > for one MR means no memory overcommit.
> > 
> > >
> > > Are you proposing the function also replaces ibv_reg_mr() if the user
> > passes multiple unregistered regions?
> > > I could see the benefit, but then we'd require additional parameters
> > (i.e. send_flags) and those MRs couldn't be reused (otherwise need to
> > add output pointers for resulting MRs).
> 
> Actually, I realized it can be implemented with the proposed API.
> All that is missing is a capability bit and a flag for set_layout_*,
> and the implementation could work as follows (changes relative to SG example):
> 
> +assert(caps & IBV_MR_SET_LAYOUT_INTERNAL_REGISTRATION);
> -mr1 = ibv_reg_mr(pd, addr1, len1, 0);
> -if (!mr1) {
> -        fprintf(stderr, "Failed to create MR #1\en");
> -        return 1;
> -}
> -
> -mr2 = ibv_reg_mr(pd, addr2, len2, 0);
> -if (!mr2) {
> -        fprintf(stderr, "Failed to create MR #2\en");
> -        return 1;
> -}
> 
> mr3 = ibv_reg_mr(pd, NULL, 0, IBV_ACCESS_ZERO_BASED);
> if (!mr3) {
>         fprintf(stderr, "Failed to create result MR\en");
>         return 1;
> }
> 
> struct ibv_sge composite[] =
> {
>         {
>                 .addr = addr1,
>                 .length = len1,
> -                .lkey = mr1->lkey
>         },
>         {
>                 .addr = addr2,
>                 .length = len2,
> -                .lkey = mr2->lkey
>         }
> };
> 
> +ret = ibv_mr_set_layout_sg(mr3, IBV_MR_SET_LAYOUT_REGISTER_BUFFERS, 2, composite);
> -ret = ibv_mr_set_layout_sg(mr3, 0, 2, composite);
> if (ret) {
>         fprintf(stderr, "Non-contiguous registration failed\en");
>         return 1;
> }
> 
> In this case calling ibv_mr_set_layout_sg() will cause an internal registration
> replacing the ibv_reg_mr calls for mr1 and mr2, and the registration will be stored
> in mr3.
> 
> Is this what you had in mind?

Yes.

But let's try to take it one step further, what if all my buffers are the
same size, of even better, all are PAGE_SIZE. So in case of "composite"
array of let's say 262144 elements i would have wasteful 262144 * 8 bytes.

This problem could be solved with a bitmap to a given range where only the
bits that are set composed the MR.

> 
> > 
> > Yeah, more or less the same ib_reg_mr but one that gets list of pages
> > instead of virtual address and will skip the "while (npages)" loop in
> > ib_umem_get and just go directly to dma_map_sg. Idea here is that anyway
> > the HW supports scattered list of buffers so why to limit the API to
> > contiguous virtual address.
> > 
> > We dropped this idea as it turns out that we need extra help from the HW
> > in post_send phase where the virtual address received in the SGE refers
> > to the virtual address given at ib_reg_mr.
> > We somehow believed that zero-based-mr will solve this by maybe allowing
> > addresses in SGE to be something like an index to a entry in the page-
> > list given to ib_reg_mr but apparently zero-based-mr is not yet
> > functional (at least not in CX3).
> > (We have lack of knowledge in what exactly zero-based-mr is).
> > 
> > > The benefit will probably not be latency, though, since IIRC the MR
> > creation can't really be parallelized.
> > > Yuval - are you aware of a scenario implementing a high amount of
> > ibv_reg_mr() calls?
> > 
> > High amount of ibv_reg_mr calls no but i have a scenario where my
> > application can potentially receive request to create MR for 262144
> > scattered pages.
> > By the way, using the suggested API from Jason below, SG list will still
> > limits us, not sure how big SG list can be but sure not 262144.
> > So what we were thinking is to give ib_reg_mr a huge range, even 4G but
> > then use a bitmap parameter that will specify only the pages in that
> > range that take part in the MR.
> > 
> > >
> > > >
> > > > That is a fair point - I wonder if some of these API should have an
> > > > option to accept a pointer directly? Maybe the driver requires a MR
> > > > but we don't need that as an the API?
> > > >
> > > > Particularly the _sg one..
> > > >
> > > > Jason
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> > > in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
  2018-01-28 20:37                       ` Yuval Shaia
@ 2018-01-29 17:27                         ` Jason Gunthorpe
       [not found]                           ` <20180129172717.GW23852-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2018-01-29 17:27 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: Alex Margolin, Marcel Apfelbaum, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Sun, Jan 28, 2018 at 10:37:47PM +0200, Yuval Shaia wrote:

> But let's try to take it one step further, what if all my buffers are the
> same size, of even better, all are PAGE_SIZE. So in case of "composite"
> array of let's say 262144 elements i would have wasteful 262144 * 8 bytes.
> 
> This problem could be solved with a bitmap to a given range where only the
> bits that are set composed the MR.

You want this for the host on virtualization right? Like we talked
about at plumbers?

Is it really necessary to be so optimal? A list of SGLs is not good
enough?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found]                           ` <20180129172717.GW23852-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2018-01-30 11:35                             ` Marcel Apfelbaum
       [not found]                               ` <12d04e1b-6024-0763-f5c5-46ca8b0823a6-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Marcel Apfelbaum @ 2018-01-30 11:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Yuval Shaia
  Cc: Alex Margolin, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 29/01/2018 19:27, Jason Gunthorpe wrote:
> On Sun, Jan 28, 2018 at 10:37:47PM +0200, Yuval Shaia wrote:
> 

Hi Jason,

>> But let's try to take it one step further, what if all my buffers are the
>> same size, of even better, all are PAGE_SIZE. So in case of "composite"
>> array of let's say 262144 elements i would have wasteful 262144 * 8 bytes.
>>
>> This problem could be solved with a bitmap to a given range where only the
>> bits that are set composed the MR.
> 
> You want this for the host on virtualization right?

Yes. (actually is more about us needing rather that wanting :) )

> Like we talked
> about at plumbers?
>  > Is it really necessary to be so optimal? A list of SGLs is not good
> enough?

It is not. We think the list would need to be limited to a single page,
(system calls limitation? maybe we are wrong?)
so we would not be able to pass for 1GB MR which would have 262144 elements.

The proposed interface seems cleaner. We pass a (huge) range and we mark
the pages we need using a bitmap.


By the way, doing that would only solve half of our problem.

The other problem is what is happening on post-send. We don't have a virtually
contiguous range to pass to post-send, and breaking the Work Request
into several work requests using pages as boundaries will become again a problem
if we want to send a big chunk (the HW has a rather limited max sg elements).
We can solve it by using 0 based MRs, do you know if the current HW supports it?

Thanks,
Marcel

> 
> Jason
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found]                               ` <12d04e1b-6024-0763-f5c5-46ca8b0823a6-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-30 15:42                                 ` Jason Gunthorpe
       [not found]                                   ` <20180130154200.GD21679-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2018-01-30 15:42 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: Yuval Shaia, Alex Margolin, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, Jan 30, 2018 at 01:35:21PM +0200, Marcel Apfelbaum wrote:
> On 29/01/2018 19:27, Jason Gunthorpe wrote:
> >On Sun, Jan 28, 2018 at 10:37:47PM +0200, Yuval Shaia wrote:
> >
> 
> Hi Jason,
> 
> >>But let's try to take it one step further, what if all my buffers are the
> >>same size, of even better, all are PAGE_SIZE. So in case of "composite"
> >>array of let's say 262144 elements i would have wasteful 262144 * 8 bytes.
> >>
> >>This problem could be solved with a bitmap to a given range where only the
> >>bits that are set composed the MR.
> >
> >You want this for the host on virtualization right?
> 
> Yes. (actually is more about us needing rather that wanting :) )
> 
> >Like we talked
> >about at plumbers?
> > > Is it really necessary to be so optimal? A list of SGLs is not good
> >enough?
> 
> It is not. We think the list would need to be limited to a single page,
> (system calls limitation? maybe we are wrong?)

The new ioctl interface isn't really limited.
This new API(s) will run over ioctl.

> By the way, doing that would only solve half of our problem.

Well, actually, only a 3rd :| The new MR would likely be 0 based, but
the VM guest doesn't know about this. So you'd need an API that can do
arbitrary based to really solve your probably. I guess all HW should
be able to do this so maybe it is OK?

> The other problem is what is happening on post-send. We don't have a
> virtually contiguous range to pass to post-send, and breaking the
> Work Request into several work requests using pages as boundaries
> will become again a problem if we want to send a big chunk (the HW
> has a rather limited max sg elements).  We can solve it by using 0
> based MRs, do you know if the current HW supports it?

I think some does.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found]                                   ` <20180130154200.GD21679-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2018-01-31 12:27                                     ` Marcel Apfelbaum
       [not found]                                       ` <76b5a8cf-b3ed-c76d-6157-91fc5f6f2b35-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Marcel Apfelbaum @ 2018-01-31 12:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yuval Shaia, Alex Margolin, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 30/01/2018 17:42, Jason Gunthorpe wrote:
> On Tue, Jan 30, 2018 at 01:35:21PM +0200, Marcel Apfelbaum wrote:
>> On 29/01/2018 19:27, Jason Gunthorpe wrote:
>>> On Sun, Jan 28, 2018 at 10:37:47PM +0200, Yuval Shaia wrote:
>>>
>>
>> Hi Jason,
>>
>>>> But let's try to take it one step further, what if all my buffers are the
>>>> same size, of even better, all are PAGE_SIZE. So in case of "composite"
>>>> array of let's say 262144 elements i would have wasteful 262144 * 8 bytes.
>>>>
>>>> This problem could be solved with a bitmap to a given range where only the
>>>> bits that are set composed the MR.
>>>
>>> You want this for the host on virtualization right?
>>
>> Yes. (actually is more about us needing rather that wanting :) )
>>
>>> Like we talked
>>> about at plumbers?
>>>> Is it really necessary to be so optimal? A list of SGLs is not good
>>> enough?
>>
>> It is not. We think the list would need to be limited to a single page,
>> (system calls limitation? maybe we are wrong?)
> 
> The new ioctl interface isn't really limited.
> This new API(s) will run over ioctl.
> 

It is good to know, but still, passing so much information to kernel
when we can rather "compress" it, maybe it worth a second thought.


>> By the way, doing that would only solve half of our problem.
> 
> Well, actually, only a 3rd :| The new MR would likely be 0 based, but
> the VM guest doesn't know about this. So you'd need an API that can do
> arbitrary based to really solve your probably. I guess all HW should
> be able to do this so maybe it is OK?

The way we solve "the other" half is by intercepting the post-send
requests in hypervisor. At hypervisor level we don't have contiguous virtual
addresses anymore, but we don't need them for 0 based MRs:
The guest still register regular MRs, while the hypervisor will
register a 0 based MR save the guest virtual address of the MR.
At post-send we simply substract the saved MR base address from the work request
buffers and we are back to 0 based MR.

> 
>> The other problem is what is happening on post-send. We don't have a
>> virtually contiguous range to pass to post-send, and breaking the
>> Work Request into several work requests using pages as boundaries
>> will become again a problem if we want to send a big chunk (the HW
>> has a rather limited max sg elements).  We can solve it by using 0
>> based MRs, do you know if the current HW supports it?
> 
> I think some does.
> 

Do you have a model in mind? We would really want to try it out.

By the way, I tried to search in the kernel for vendors implementing it
and I saw maybe one vendor... so maybe 0 based MR is a nice idea but nothing more.

Thanks,
Marcel

> Jason
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found]                                       ` <76b5a8cf-b3ed-c76d-6157-91fc5f6f2b35-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-31 18:38                                         ` Jason Gunthorpe
       [not found]                                           ` <20180131183810.GA23352-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2018-01-31 18:38 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: Yuval Shaia, Alex Margolin, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 31, 2018 at 02:27:01PM +0200, Marcel Apfelbaum wrote:

> It is good to know, but still, passing so much information to kernel
> when we can rather "compress" it, maybe it worth a second thought.

Not sure. Have to see the whole thing..

> > Well, actually, only a 3rd :| The new MR would likely be 0 based, but
> > the VM guest doesn't know about this. So you'd need an API that can do
> > arbitrary based to really solve your probably. I guess all HW should
> > be able to do this so maybe it is OK?
> 
> The way we solve "the other" half is by intercepting the post-send
> requests in hypervisor. At hypervisor level we don't have contiguous virtual
> addresses anymore, but we don't need them for 0 based MRs:
> The guest still register regular MRs, while the hypervisor will
> register a 0 based MR save the guest virtual address of the MR.
> At post-send we simply substract the saved MR base address from the work request
> buffers and we are back to 0 based MR.

That only works for lkeys, the rkey expoeses the base address to the
remote - the HV can't fix it..

> Do you have a model in mind? We would really want to try it out.
> 
> By the way, I tried to search in the kernel for vendors implementing
> it and I saw maybe one vendor... so maybe 0 based MR is a nice idea
> but nothing more.

Try the mlx drivers, I think at least one of them can do it today.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found]                                           ` <20180131183810.GA23352-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2018-02-01 18:22                                             ` Marcel Apfelbaum
       [not found]                                               ` <dded9055-a329-9b9d-943a-7a60445e2ada-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Marcel Apfelbaum @ 2018-02-01 18:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yuval Shaia, Alex Margolin, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 31/01/2018 20:38, Jason Gunthorpe wrote:
> On Wed, Jan 31, 2018 at 02:27:01PM +0200, Marcel Apfelbaum wrote:
> 
>> It is good to know, but still, passing so much information to kernel
>> when we can rather "compress" it, maybe it worth a second thought.
> 
> Not sure. Have to see the whole thing..
> 
>>> Well, actually, only a 3rd :| The new MR would likely be 0 based, but
>>> the VM guest doesn't know about this. So you'd need an API that can do
>>> arbitrary based to really solve your probably. I guess all HW should
>>> be able to do this so maybe it is OK?
>>
>> The way we solve "the other" half is by intercepting the post-send
>> requests in hypervisor. At hypervisor level we don't have contiguous virtual
>> addresses anymore, but we don't need them for 0 based MRs:
>> The guest still register regular MRs, while the hypervisor will
>> register a 0 based MR save the guest virtual address of the MR.
>> At post-send we simply substract the saved MR base address from the work request
>> buffers and we are back to 0 based MR.
> 

Hi Jason,

> That only works for lkeys, the rkey expoeses the base address to the
> remote - the HV can't fix it..
> 

Thanks for the clarification.

What we really need is to allow to map a list of
pages to a IOVA different from the process address
space, e.g guest supplied IOVA.

Something like req_mr (list_of_process_va_pages, base_other_iova, len_other_iova)

Do think the new API can support that?

Thanks,
Marcel


>> Do you have a model in mind? We would really want to try it out.
>>
>> By the way, I tried to search in the kernel for vendors implementing
>> it and I saw maybe one vendor... so maybe 0 based MR is a nice idea
>> but nothing more.
> 
> Try the mlx drivers, I think at least one of them can do it today.
> 
> Jason
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found]                                               ` <dded9055-a329-9b9d-943a-7a60445e2ada-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-02-01 18:29                                                 ` Jason Gunthorpe
       [not found]                                                   ` <20180201182959.GN23352-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2018-02-01 18:29 UTC (permalink / raw)
  To: Marcel Apfelbaum
  Cc: Yuval Shaia, Alex Margolin, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Feb 01, 2018 at 08:22:01PM +0200, Marcel Apfelbaum wrote:
> On 31/01/2018 20:38, Jason Gunthorpe wrote:
> > On Wed, Jan 31, 2018 at 02:27:01PM +0200, Marcel Apfelbaum wrote:
> > 
> >> It is good to know, but still, passing so much information to kernel
> >> when we can rather "compress" it, maybe it worth a second thought.
> > 
> > Not sure. Have to see the whole thing..
> > 
> >>> Well, actually, only a 3rd :| The new MR would likely be 0 based, but
> >>> the VM guest doesn't know about this. So you'd need an API that can do
> >>> arbitrary based to really solve your probably. I guess all HW should
> >>> be able to do this so maybe it is OK?
> >>
> >> The way we solve "the other" half is by intercepting the post-send
> >> requests in hypervisor. At hypervisor level we don't have contiguous virtual
> >> addresses anymore, but we don't need them for 0 based MRs:
> >> The guest still register regular MRs, while the hypervisor will
> >> register a 0 based MR save the guest virtual address of the MR.
> >> At post-send we simply substract the saved MR base address from the work request
> >> buffers and we are back to 0 based MR.
> > 
> 
> Hi Jason,
> 
> > That only works for lkeys, the rkey expoeses the base address to the
> > remote - the HV can't fix it..
> > 
> 
> Thanks for the clarification.
> 
> What we really need is to allow to map a list of
> pages to a IOVA different from the process address
> space, e.g guest supplied IOVA.
> 
> Something like req_mr (list_of_process_va_pages, base_other_iova, len_other_iova)
> 
> Do think the new API can support that?

Well, I think we should have something like this.

I actually can't see how it could need special HW support, since this
is basically exactly the same as creating a normal MR.

And same with 0 based, 'base_other_iova == 0' is the same as zero
based.

I think the difference from the proposed API here is this requires
full OS pages, while Alex's version can do sub-pages too using HW
features.

I would urge you to persue an API like you described:

struct ibv_mr *ib_reg_mr_sg(const void *pages[], size_t num_pages,
                            uint64_t mr_addr,
			    size_t mr_offset, // MR starts at pages[0] + mr)offset
			    size_t mr_length,
			    unsigned int flags);

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration
       [not found]                                                   ` <20180201182959.GN23352-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2018-02-01 18:45                                                     ` Marcel Apfelbaum
  0 siblings, 0 replies; 18+ messages in thread
From: Marcel Apfelbaum @ 2018-02-01 18:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Yuval Shaia, Alex Margolin, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 01/02/2018 20:29, Jason Gunthorpe wrote:
> On Thu, Feb 01, 2018 at 08:22:01PM +0200, Marcel Apfelbaum wrote:
>> On 31/01/2018 20:38, Jason Gunthorpe wrote:
>>> On Wed, Jan 31, 2018 at 02:27:01PM +0200, Marcel Apfelbaum wrote:
>>>
>>>> It is good to know, but still, passing so much information to kernel
>>>> when we can rather "compress" it, maybe it worth a second thought.
>>>
>>> Not sure. Have to see the whole thing..
>>>
>>>>> Well, actually, only a 3rd :| The new MR would likely be 0 based, but
>>>>> the VM guest doesn't know about this. So you'd need an API that can do
>>>>> arbitrary based to really solve your probably. I guess all HW should
>>>>> be able to do this so maybe it is OK?
>>>>
>>>> The way we solve "the other" half is by intercepting the post-send
>>>> requests in hypervisor. At hypervisor level we don't have contiguous virtual
>>>> addresses anymore, but we don't need them for 0 based MRs:
>>>> The guest still register regular MRs, while the hypervisor will
>>>> register a 0 based MR save the guest virtual address of the MR.
>>>> At post-send we simply substract the saved MR base address from the work request
>>>> buffers and we are back to 0 based MR.
>>>
>>
>> Hi Jason,
>>
>>> That only works for lkeys, the rkey expoeses the base address to the
>>> remote - the HV can't fix it..
>>>
>>
>> Thanks for the clarification.
>>
>> What we really need is to allow to map a list of
>> pages to a IOVA different from the process address
>> space, e.g guest supplied IOVA.
>>
>> Something like req_mr (list_of_process_va_pages, base_other_iova, len_other_iova)
>>
>> Do think the new API can support that?
> 
> Well, I think we should have something like this.
> 
> I actually can't see how it could need special HW support, since this
> is basically exactly the same as creating a normal MR.
> 
> And same with 0 based, 'base_other_iova == 0' is the same as zero
> based.
> 

Agreed.

> I think the difference from the proposed API here is this requires
> full OS pages, while Alex's version can do sub-pages too using HW
> features.
> 
> I would urge you to persue an API like you described:
> 
> struct ibv_mr *ib_reg_mr_sg(const void *pages[], size_t num_pages,
>                             uint64_t mr_addr,
> 			    size_t mr_offset, // MR starts at pages[0] + mr)offset
> 			    size_t mr_length,
> 			    unsigned int flags);
> 

Sounds right, thanks for the pointer.
Marcel

> Jason
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-02-01 18:45 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-04 17:47 [RFC rdma-core 1/2] Registering non-contiguous memory Alex Margolin
     [not found] ` <1515088046-26605-1-git-send-email-alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2018-01-04 17:47   ` [RFC rdma-core 2/2] verbs: Introduce non-contiguous memory registration Alex Margolin
     [not found]     ` <1515088046-26605-2-git-send-email-alexma-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2018-01-11 12:22       ` Yuval Shaia
2018-01-11 16:44         ` Jason Gunthorpe
     [not found]           ` <20180111164455.GA1309-uk2M96/98Pc@public.gmane.org>
2018-01-22 15:59             ` Alex Margolin
     [not found]               ` <VI1PR05MB1278C4C4FF78B4B1A551252EB9EC0-79XLn2atqDMOK6E67s+DINqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-01-23 20:29                 ` Yuval Shaia
2018-01-25 12:43                   ` Alex Margolin
     [not found]                     ` <VI1PR05MB12787572593F02F05AA20DE1B9E10-79XLn2atqDMOK6E67s+DINqRiQSDpxhJvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-01-25 16:07                       ` Jason Gunthorpe
2018-01-28 20:37                       ` Yuval Shaia
2018-01-29 17:27                         ` Jason Gunthorpe
     [not found]                           ` <20180129172717.GW23852-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2018-01-30 11:35                             ` Marcel Apfelbaum
     [not found]                               ` <12d04e1b-6024-0763-f5c5-46ca8b0823a6-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-30 15:42                                 ` Jason Gunthorpe
     [not found]                                   ` <20180130154200.GD21679-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2018-01-31 12:27                                     ` Marcel Apfelbaum
     [not found]                                       ` <76b5a8cf-b3ed-c76d-6157-91fc5f6f2b35-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-31 18:38                                         ` Jason Gunthorpe
     [not found]                                           ` <20180131183810.GA23352-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2018-02-01 18:22                                             ` Marcel Apfelbaum
     [not found]                                               ` <dded9055-a329-9b9d-943a-7a60445e2ada-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-02-01 18:29                                                 ` Jason Gunthorpe
     [not found]                                                   ` <20180201182959.GN23352-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2018-02-01 18:45                                                     ` Marcel Apfelbaum
2018-01-25 13:10                   ` Alex Margolin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.