linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH rdma-next 0/4] Support multiple path records
@ 2022-09-08 10:08 Leon Romanovsky
  2022-09-08 10:09 ` [PATCH rdma-next 1/4] RDMA/core: Rename rdma_route.num_paths field to num_pri_alt_paths Leon Romanovsky
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Leon Romanovsky @ 2022-09-08 10:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Leon Romanovsky, Bart Van Assche, linux-rdma, Mark Bloch, Mark Zhang

From: Leon Romanovsky <leonro@nvidia.com>

From Mark:

These patches allow IB core to receive multiple path records from
user-space rdma netlink service.

Currently only one GMP PR is supported when doing a PR query. This means
in a fabric with multiple subnets, when a packet goes out of the sender,
it’s assigned with a LID to a router as its destination LID. The current
solution selects a specific router per destination, means that there is
no advanced routing from the sender towards the network.

This patchset supports to receive an inbound PR and an outbound PR along
with the GMP PR. The LIDs in these 3 PRs can be used in this way:
1. GMP PR: used as the standard local/remote LIDs;
2. DLID of outbound PR: Used as the "dlid" field for outbound traffic;
3. DLID of inbound PR: Used as the "dlid" field for outbound traffic in
   responder side.

The inboundPR.dlid is passed to responder with the "primary_LID" filed
in the ConnectRequest message.

With this, the user-space rdma netlink service can set special DLIDs
in inbound/outbound PRs, to select specific routers for datapath
between the 2 nodes.

The following cases were tested:
- New kernel with new netlink service that supports multiple path
  records;
- New kernel with old netlink service;
- Old kernel with new netlink service;
- Client side new kernel with new netlink service, server side with
  old kernel.

Thanks.

Mark Zhang (4):
  RDMA/core: Rename rdma_route.num_paths field to num_pri_alt_paths
  RDMA/cma: Multiple path records support with netlink channel
  RDMA/cm: Use SLID in the work completion as the DLID in responder side
  RDMA/cm: Use DLID from inbound/outbound PathRecords as the datapath
    DLID

 drivers/infiniband/core/cm.c              |  39 +++-
 drivers/infiniband/core/cma.c             |  88 ++++++--
 drivers/infiniband/core/sa_query.c        | 235 +++++++++++++++-------
 drivers/infiniband/core/ucma.c            |  10 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   2 +-
 drivers/infiniband/ulp/srp/ib_srp.c       |   2 +-
 include/rdma/ib_cm.h                      |   2 +
 include/rdma/ib_sa.h                      |   3 +-
 include/rdma/rdma_cm.h                    |  13 +-
 9 files changed, 284 insertions(+), 110 deletions(-)

-- 
2.37.2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH rdma-next 1/4] RDMA/core: Rename rdma_route.num_paths field to num_pri_alt_paths
  2022-09-08 10:08 [PATCH rdma-next 0/4] Support multiple path records Leon Romanovsky
@ 2022-09-08 10:09 ` Leon Romanovsky
  2022-09-08 10:09 ` [PATCH rdma-next 2/4] RDMA/cma: Multiple path records support with netlink channel Leon Romanovsky
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2022-09-08 10:09 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Mark Zhang, Bart Van Assche, linux-rdma, Mark Bloch

From: Mark Zhang <markzhang@nvidia.com>

This fields means the total number of primary and alternative paths,
i.e.,:
  0 - No primary nor alternate path is available;
  1 - Only primary path is available;
  2 - Both primary and alternate path are available.
Rename it to avoid confusion as with follow patches primary path will
support multiple path records.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/cma.c  | 18 +++++++++---------
 drivers/infiniband/core/ucma.c | 10 +++++-----
 include/rdma/rdma_cm.h         |  7 ++++++-
 3 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 46d06678dfbe..91e72a76d95e 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2241,14 +2241,14 @@ cma_ib_new_conn_id(const struct rdma_cm_id *listen_id,
 		goto err;
 
 	rt = &id->route;
-	rt->num_paths = ib_event->param.req_rcvd.alternate_path ? 2 : 1;
-	rt->path_rec = kmalloc_array(rt->num_paths, sizeof(*rt->path_rec),
-				     GFP_KERNEL);
+	rt->num_pri_alt_paths = ib_event->param.req_rcvd.alternate_path ? 2 : 1;
+	rt->path_rec = kmalloc_array(rt->num_pri_alt_paths,
+				     sizeof(*rt->path_rec), GFP_KERNEL);
 	if (!rt->path_rec)
 		goto err;
 
 	rt->path_rec[0] = *path;
-	if (rt->num_paths == 2)
+	if (rt->num_pri_alt_paths == 2)
 		rt->path_rec[1] = *ib_event->param.req_rcvd.alternate_path;
 
 	if (net_dev) {
@@ -2826,7 +2826,7 @@ static void cma_query_handler(int status, struct sa_path_rec *path_rec,
 	route = &work->id->id.route;
 
 	if (!status) {
-		route->num_paths = 1;
+		route->num_pri_alt_paths = 1;
 		*route->path_rec = *path_rec;
 	} else {
 		work->old_state = RDMA_CM_ROUTE_QUERY;
@@ -3081,7 +3081,7 @@ int rdma_set_ib_path(struct rdma_cm_id *id,
 		dev_put(ndev);
 	}
 
-	id->route.num_paths = 1;
+	id->route.num_pri_alt_paths = 1;
 	return 0;
 
 err_free:
@@ -3214,7 +3214,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv)
 		goto err1;
 	}
 
-	route->num_paths = 1;
+	route->num_pri_alt_paths = 1;
 
 	ndev = cma_iboe_set_path_rec_l2_fields(id_priv);
 	if (!ndev) {
@@ -3274,7 +3274,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv)
 err2:
 	kfree(route->path_rec);
 	route->path_rec = NULL;
-	route->num_paths = 0;
+	route->num_pri_alt_paths = 0;
 err1:
 	kfree(work);
 	return ret;
@@ -4265,7 +4265,7 @@ static int cma_connect_ib(struct rdma_id_private *id_priv,
 	}
 
 	req.primary_path = &route->path_rec[0];
-	if (route->num_paths == 2)
+	if (route->num_pri_alt_paths == 2)
 		req.alternate_path = &route->path_rec[1];
 
 	req.ppath_sgid_attr = id_priv->id.route.addr.dev_addr.sgid_attr;
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 9d6ac9dff39a..bf42650f125b 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -754,8 +754,8 @@ static void ucma_copy_ib_route(struct rdma_ucm_query_route_resp *resp,
 {
 	struct rdma_dev_addr *dev_addr;
 
-	resp->num_paths = route->num_paths;
-	switch (route->num_paths) {
+	resp->num_paths = route->num_pri_alt_paths;
+	switch (route->num_pri_alt_paths) {
 	case 0:
 		dev_addr = &route->addr.dev_addr;
 		rdma_addr_get_dgid(dev_addr,
@@ -781,8 +781,8 @@ static void ucma_copy_iboe_route(struct rdma_ucm_query_route_resp *resp,
 				 struct rdma_route *route)
 {
 
-	resp->num_paths = route->num_paths;
-	switch (route->num_paths) {
+	resp->num_paths = route->num_pri_alt_paths;
+	switch (route->num_pri_alt_paths) {
 	case 0:
 		rdma_ip2gid((struct sockaddr *)&route->addr.dst_addr,
 			    (union ib_gid *)&resp->ib_route[0].dgid);
@@ -921,7 +921,7 @@ static ssize_t ucma_query_path(struct ucma_context *ctx,
 	if (!resp)
 		return -ENOMEM;
 
-	resp->num_paths = ctx->cm_id->route.num_paths;
+	resp->num_paths = ctx->cm_id->route.num_pri_alt_paths;
 	for (i = 0, out_len -= sizeof(*resp);
 	     i < resp->num_paths && out_len > sizeof(struct ib_path_rec_data);
 	     i++, out_len -= sizeof(struct ib_path_rec_data)) {
diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h
index 5b18e2e36ee6..81916039ee24 100644
--- a/include/rdma/rdma_cm.h
+++ b/include/rdma/rdma_cm.h
@@ -52,7 +52,12 @@ struct rdma_addr {
 struct rdma_route {
 	struct rdma_addr addr;
 	struct sa_path_rec *path_rec;
-	int num_paths;
+	/*
+	 * 0 - No primary nor alternate path is available
+	 * 1 - Only primary path is available
+	 * 2 - Both primary and alternate path are available
+	 */
+	int num_pri_alt_paths;
 };
 
 struct rdma_conn_param {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH rdma-next 2/4] RDMA/cma: Multiple path records support with netlink channel
  2022-09-08 10:08 [PATCH rdma-next 0/4] Support multiple path records Leon Romanovsky
  2022-09-08 10:09 ` [PATCH rdma-next 1/4] RDMA/core: Rename rdma_route.num_paths field to num_pri_alt_paths Leon Romanovsky
@ 2022-09-08 10:09 ` Leon Romanovsky
  2022-09-22 13:58   ` Jason Gunthorpe
  2022-09-08 10:09 ` [PATCH rdma-next 3/4] RDMA/cm: Use SLID in the work completion as the DLID in responder side Leon Romanovsky
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Leon Romanovsky @ 2022-09-08 10:09 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Mark Zhang, Bart Van Assche, linux-rdma, Mark Bloch

From: Mark Zhang <markzhang@nvidia.com>

Support receiving inbound and outbound IB path records (along with GMP
PathRecord) from user-space service through the RDMA netlink channel.
The LIDs in these 3 PRs can be used in this way:
1. GMP PR: used as the standard local/remote LIDs;
2. DLID of outbound PR: Used as the "dlid" field for outbound traffic;
3. DLID of inbound PR: Used as the "dlid" field for outbound traffic in
   responder side.

This is aimed to support adaptive routing. With current IB routing
solution when a packet goes out it's assigned with a fixed DLID per
target, meaning a fixed router will be used.
The LIDs in inbound/outbound path records can be used to identify group
of routers that allow communication with another subnet's entity. With
them packets from an inter-subnet connection may travel through any
router in the set to reach the target.

As confirmed with Jason, when sending a netlink request, kernel uses
LS_RESOLVE_PATH_USE_ALL so that the service knows kernel supports
multiple PRs.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/cma.c             |  70 ++++++-
 drivers/infiniband/core/sa_query.c        | 235 +++++++++++++++-------
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   2 +-
 drivers/infiniband/ulp/srp/ib_srp.c       |   2 +-
 include/rdma/ib_sa.h                      |   3 +-
 include/rdma/rdma_cm.h                    |   6 +
 6 files changed, 231 insertions(+), 87 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 91e72a76d95e..a3efc462305d 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2026,6 +2026,8 @@ static void _destroy_id(struct rdma_id_private *id_priv,
 		cma_id_put(id_priv->id.context);
 
 	kfree(id_priv->id.route.path_rec);
+	kfree(id_priv->id.route.path_rec_inbound);
+	kfree(id_priv->id.route.path_rec_outbound);
 
 	put_net(id_priv->id.route.addr.dev_addr.net);
 	kfree(id_priv);
@@ -2817,26 +2819,72 @@ int rdma_set_min_rnr_timer(struct rdma_cm_id *id, u8 min_rnr_timer)
 }
 EXPORT_SYMBOL(rdma_set_min_rnr_timer);
 
+static void route_set_path_rec_inbound(struct cma_work *work,
+				       struct sa_path_rec *path_rec)
+{
+	struct rdma_route *route = &work->id->id.route;
+
+	if (!route->path_rec_inbound) {
+		route->path_rec_inbound =
+			kzalloc(sizeof(*route->path_rec_inbound), GFP_KERNEL);
+		if (!route->path_rec_inbound)
+			return;
+	}
+
+	*route->path_rec_inbound = *path_rec;
+}
+
+static void route_set_path_rec_outbound(struct cma_work *work,
+					struct sa_path_rec *path_rec)
+{
+	struct rdma_route *route = &work->id->id.route;
+
+	if (!route->path_rec_outbound) {
+		route->path_rec_outbound =
+			kzalloc(sizeof(*route->path_rec_outbound), GFP_KERNEL);
+		if (!route->path_rec_outbound)
+			return;
+	}
+
+	*route->path_rec_outbound = *path_rec;
+}
+
 static void cma_query_handler(int status, struct sa_path_rec *path_rec,
-			      void *context)
+			      int num_prs, void *context)
 {
 	struct cma_work *work = context;
 	struct rdma_route *route;
+	int i;
 
 	route = &work->id->id.route;
 
-	if (!status) {
-		route->num_pri_alt_paths = 1;
-		*route->path_rec = *path_rec;
-	} else {
-		work->old_state = RDMA_CM_ROUTE_QUERY;
-		work->new_state = RDMA_CM_ADDR_RESOLVED;
-		work->event.event = RDMA_CM_EVENT_ROUTE_ERROR;
-		work->event.status = status;
-		pr_debug_ratelimited("RDMA CM: ROUTE_ERROR: failed to query path. status %d\n",
-				     status);
+	if (status)
+		goto fail;
+
+	for (i = 0; i < num_prs; i++) {
+		if (!path_rec[i].flags || (path_rec[i].flags & IB_PATH_GMP))
+			*route->path_rec = path_rec[i];
+		else if (path_rec[i].flags & IB_PATH_INBOUND)
+			route_set_path_rec_inbound(work, &path_rec[i]);
+		else if (path_rec[i].flags & IB_PATH_OUTBOUND)
+			route_set_path_rec_outbound(work, &path_rec[i]);
+	}
+	if (!route->path_rec) {
+		status = -EINVAL;
+		goto fail;
 	}
 
+	route->num_pri_alt_paths = 1;
+	queue_work(cma_wq, &work->work);
+	return;
+
+fail:
+	work->old_state = RDMA_CM_ROUTE_QUERY;
+	work->new_state = RDMA_CM_ADDR_RESOLVED;
+	work->event.event = RDMA_CM_EVENT_ROUTE_ERROR;
+	work->event.status = status;
+	pr_debug_ratelimited("RDMA CM: ROUTE_ERROR: failed to query path. status %d\n",
+			     status);
 	queue_work(cma_wq, &work->work);
 }
 
diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 003e504feca2..0de83d9a4985 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -50,6 +50,7 @@
 #include <rdma/ib_marshall.h>
 #include <rdma/ib_addr.h>
 #include <rdma/opa_addr.h>
+#include <rdma/rdma_cm.h>
 #include "sa.h"
 #include "core_priv.h"
 
@@ -104,7 +105,8 @@ struct ib_sa_device {
 };
 
 struct ib_sa_query {
-	void (*callback)(struct ib_sa_query *, int, struct ib_sa_mad *);
+	void (*callback)(struct ib_sa_query *sa_query, int status,
+			 int num_prs, struct ib_sa_mad *mad);
 	void (*release)(struct ib_sa_query *);
 	struct ib_sa_client    *client;
 	struct ib_sa_port      *port;
@@ -116,6 +118,12 @@ struct ib_sa_query {
 	u32			seq; /* Local svc request sequence number */
 	unsigned long		timeout; /* Local svc timeout */
 	u8			path_use; /* How will the pathrecord be used */
+
+	/* A separate buffer to save pathrecords of a response, as in cases
+	 * like IB/netlink, mulptiple pathrecords are supported, so that
+	 * mad->data is not large enough to hold them
+	 */
+	void			*resp_pr_data;
 };
 
 #define IB_SA_ENABLE_LOCAL_SERVICE	0x00000001
@@ -123,7 +131,8 @@ struct ib_sa_query {
 #define IB_SA_QUERY_OPA			0x00000004
 
 struct ib_sa_path_query {
-	void (*callback)(int, struct sa_path_rec *, void *);
+	void (*callback)(int status, struct sa_path_rec *rec,
+			 int num_paths, void *context);
 	void *context;
 	struct ib_sa_query sa_query;
 	struct sa_path_rec *conv_pr;
@@ -712,7 +721,7 @@ static void ib_nl_set_path_rec_attrs(struct sk_buff *skb,
 
 	if ((comp_mask & IB_SA_PATH_REC_REVERSIBLE) &&
 	    sa_rec->reversible != 0)
-		query->path_use = LS_RESOLVE_PATH_USE_GMP;
+		query->path_use = LS_RESOLVE_PATH_USE_ALL;
 	else
 		query->path_use = LS_RESOLVE_PATH_USE_UNIDIRECTIONAL;
 	header->path_use = query->path_use;
@@ -865,50 +874,81 @@ static void send_handler(struct ib_mad_agent *agent,
 static void ib_nl_process_good_resolve_rsp(struct ib_sa_query *query,
 					   const struct nlmsghdr *nlh)
 {
+	struct ib_path_rec_data *srec, *drec;
+	struct ib_sa_path_query *path_query;
 	struct ib_mad_send_wc mad_send_wc;
-	struct ib_sa_mad *mad = NULL;
 	const struct nlattr *head, *curr;
-	struct ib_path_rec_data  *rec;
-	int len, rem;
+	struct ib_sa_mad *mad = NULL;
+	int len, rem, num_prs = 0;
 	u32 mask = 0;
 	int status = -EIO;
 
-	if (query->callback) {
-		head = (const struct nlattr *) nlmsg_data(nlh);
-		len = nlmsg_len(nlh);
-		switch (query->path_use) {
-		case LS_RESOLVE_PATH_USE_UNIDIRECTIONAL:
-			mask = IB_PATH_PRIMARY | IB_PATH_OUTBOUND;
-			break;
+	if (!query->callback)
+		goto out;
 
-		case LS_RESOLVE_PATH_USE_ALL:
-		case LS_RESOLVE_PATH_USE_GMP:
-		default:
-			mask = IB_PATH_PRIMARY | IB_PATH_GMP |
-				IB_PATH_BIDIRECTIONAL;
-			break;
+	path_query = container_of(query, struct ib_sa_path_query, sa_query);
+	mad = query->mad_buf->mad;
+	if (!path_query->conv_pr &&
+	    (be16_to_cpu(mad->mad_hdr.attr_id) == IB_SA_ATTR_PATH_REC)) {
+		/* Need a larger buffer for possible multiple PRs */
+		query->resp_pr_data = kvcalloc(RDMA_PRIMARY_PATH_MAX_REC_NUM,
+					       sizeof(*drec), GFP_KERNEL);
+		if (!query->resp_pr_data) {
+			query->callback(query, -ENOMEM, 0, NULL);
+			return;
 		}
-		nla_for_each_attr(curr, head, len, rem) {
-			if (curr->nla_type == LS_NLA_TYPE_PATH_RECORD) {
-				rec = nla_data(curr);
-				/*
-				 * Get the first one. In the future, we may
-				 * need to get up to 6 pathrecords.
-				 */
-				if ((rec->flags & mask) == mask) {
-					mad = query->mad_buf->mad;
-					mad->mad_hdr.method |=
-						IB_MGMT_METHOD_RESP;
-					memcpy(mad->data, rec->path_rec,
-					       sizeof(rec->path_rec));
-					status = 0;
-					break;
-				}
-			}
+	}
+
+	head = (const struct nlattr *) nlmsg_data(nlh);
+	len = nlmsg_len(nlh);
+	switch (query->path_use) {
+	case LS_RESOLVE_PATH_USE_UNIDIRECTIONAL:
+		mask = IB_PATH_PRIMARY | IB_PATH_OUTBOUND;
+		break;
+
+	case LS_RESOLVE_PATH_USE_ALL:
+		mask = IB_PATH_PRIMARY;
+		break;
+
+	case LS_RESOLVE_PATH_USE_GMP:
+	default:
+		mask = IB_PATH_PRIMARY | IB_PATH_GMP |
+			IB_PATH_BIDIRECTIONAL;
+		break;
+	}
+
+	drec = (struct ib_path_rec_data *)query->resp_pr_data;
+	nla_for_each_attr(curr, head, len, rem) {
+		if (curr->nla_type != LS_NLA_TYPE_PATH_RECORD)
+			continue;
+
+		srec = nla_data(curr);
+		if ((srec->flags & mask) != mask)
+			continue;
+
+		status = 0;
+		if (!drec) {
+			memcpy(mad->data, srec->path_rec,
+			       sizeof(srec->path_rec));
+			num_prs = 1;
+			break;
 		}
-		query->callback(query, status, mad);
+
+		memcpy(drec, srec, sizeof(*drec));
+		drec++;
+		num_prs++;
+		if (num_prs >= RDMA_PRIMARY_PATH_MAX_REC_NUM)
+			break;
 	}
 
+	if (!status)
+		mad->mad_hdr.method |= IB_MGMT_METHOD_RESP;
+
+	query->callback(query, status, num_prs, mad);
+	kvfree(query->resp_pr_data);
+	query->resp_pr_data = NULL;
+
+out:
 	mad_send_wc.send_buf = query->mad_buf;
 	mad_send_wc.status = IB_WC_SUCCESS;
 	send_handler(query->mad_buf->mad_agent, &mad_send_wc);
@@ -1411,41 +1451,90 @@ static int opa_pr_query_possible(struct ib_sa_client *client,
 		return PR_IB_SUPPORTED;
 }
 
+static void ib_sa_pr_callback_single(struct ib_sa_path_query *query,
+				     int status, struct ib_sa_mad *mad)
+{
+	struct sa_path_rec rec = {};
+
+	ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table),
+		  mad->data, &rec);
+	rec.rec_type = SA_PATH_REC_TYPE_IB;
+	sa_path_set_dmac_zero(&rec);
+
+	if (query->conv_pr) {
+		struct sa_path_rec opa;
+
+		memset(&opa, 0, sizeof(struct sa_path_rec));
+		sa_convert_path_ib_to_opa(&opa, &rec);
+		query->callback(status, &opa, 1, query->context);
+	} else {
+		query->callback(status, &rec, 1, query->context);
+	}
+}
+
+/**
+ * ib_sa_pr_callback_multiple() - Parse path records then do callback.
+ *
+ * In a multiple-PR case the PRs are saved in "query->resp_pr_data"
+ * (instead of"mad->data") and with "ib_path_rec_data" structure format,
+ * so that rec->flags can be set to indicate the type of PR.
+ * This is valid only in IB fabric.
+ */
+static void ib_sa_pr_callback_multiple(struct ib_sa_path_query *query,
+				       int status, int num_prs,
+				       struct ib_path_rec_data *rec_data)
+{
+	struct sa_path_rec *rec;
+	int i;
+
+	rec = kvcalloc(num_prs, sizeof(*rec), GFP_KERNEL);
+	if (!rec) {
+		query->callback(-ENOMEM, NULL, 0, query->context);
+		return;
+	}
+
+	for (i = 0; i < num_prs; i++) {
+		ib_unpack(path_rec_table, ARRAY_SIZE(path_rec_table),
+			  rec_data[i].path_rec, rec + i);
+		rec[i].rec_type = SA_PATH_REC_TYPE_IB;
+		sa_path_set_dmac_zero(rec + i);
+		rec[i].flags = rec_data[i].flags;
+	}
+
+	query->callback(status, rec, num_prs, query->context);
+	kvfree(rec);
+}
+
 static void ib_sa_path_rec_callback(struct ib_sa_query *sa_query,
-				    int status,
+				    int status, int num_prs,
 				    struct ib_sa_mad *mad)
 {
 	struct ib_sa_path_query *query =
 		container_of(sa_query, struct ib_sa_path_query, sa_query);
+	struct sa_path_rec rec;
 
-	if (mad) {
-		struct sa_path_rec rec;
-
-		if (sa_query->flags & IB_SA_QUERY_OPA) {
-			ib_unpack(opa_path_rec_table,
-				  ARRAY_SIZE(opa_path_rec_table),
-				  mad->data, &rec);
-			rec.rec_type = SA_PATH_REC_TYPE_OPA;
-			query->callback(status, &rec, query->context);
-		} else {
-			ib_unpack(path_rec_table,
-				  ARRAY_SIZE(path_rec_table),
-				  mad->data, &rec);
-			rec.rec_type = SA_PATH_REC_TYPE_IB;
-			sa_path_set_dmac_zero(&rec);
-
-			if (query->conv_pr) {
-				struct sa_path_rec opa;
+	if (!mad || !num_prs) {
+		query->callback(status, NULL, 0, query->context);
+		return;
+	}
 
-				memset(&opa, 0, sizeof(struct sa_path_rec));
-				sa_convert_path_ib_to_opa(&opa, &rec);
-				query->callback(status, &opa, query->context);
-			} else {
-				query->callback(status, &rec, query->context);
-			}
+	if (sa_query->flags & IB_SA_QUERY_OPA) {
+		if (num_prs != 1) {
+			query->callback(-EINVAL, NULL, 0, query->context);
+			return;
 		}
-	} else
-		query->callback(status, NULL, query->context);
+
+		ib_unpack(opa_path_rec_table, ARRAY_SIZE(opa_path_rec_table),
+			  mad->data, &rec);
+		rec.rec_type = SA_PATH_REC_TYPE_OPA;
+		query->callback(status, &rec, num_prs, query->context);
+	} else {
+		if (!sa_query->resp_pr_data)
+			ib_sa_pr_callback_single(query, status, mad);
+		else
+			ib_sa_pr_callback_multiple(query, status, num_prs,
+						   sa_query->resp_pr_data);
+	}
 }
 
 static void ib_sa_path_rec_release(struct ib_sa_query *sa_query)
@@ -1489,7 +1578,7 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
 		       unsigned long timeout_ms, gfp_t gfp_mask,
 		       void (*callback)(int status,
 					struct sa_path_rec *resp,
-					void *context),
+					int num_paths, void *context),
 		       void *context,
 		       struct ib_sa_query **sa_query)
 {
@@ -1588,7 +1677,7 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
 EXPORT_SYMBOL(ib_sa_path_rec_get);
 
 static void ib_sa_mcmember_rec_callback(struct ib_sa_query *sa_query,
-					int status,
+					int status, int num_prs,
 					struct ib_sa_mad *mad)
 {
 	struct ib_sa_mcmember_query *query =
@@ -1680,7 +1769,7 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 
 /* Support GuidInfoRecord */
 static void ib_sa_guidinfo_rec_callback(struct ib_sa_query *sa_query,
-					int status,
+					int status, int num_paths,
 					struct ib_sa_mad *mad)
 {
 	struct ib_sa_guidinfo_query *query =
@@ -1790,7 +1879,7 @@ static void ib_classportinfo_cb(void *context)
 }
 
 static void ib_sa_classport_info_rec_callback(struct ib_sa_query *sa_query,
-					      int status,
+					      int status, int num_prs,
 					      struct ib_sa_mad *mad)
 {
 	unsigned long flags;
@@ -1966,13 +2055,13 @@ static void send_handler(struct ib_mad_agent *agent,
 			/* No callback -- already got recv */
 			break;
 		case IB_WC_RESP_TIMEOUT_ERR:
-			query->callback(query, -ETIMEDOUT, NULL);
+			query->callback(query, -ETIMEDOUT, 0, NULL);
 			break;
 		case IB_WC_WR_FLUSH_ERR:
-			query->callback(query, -EINTR, NULL);
+			query->callback(query, -EINTR, 0, NULL);
 			break;
 		default:
-			query->callback(query, -EIO, NULL);
+			query->callback(query, -EIO, 0, NULL);
 			break;
 		}
 
@@ -2000,10 +2089,10 @@ static void recv_handler(struct ib_mad_agent *mad_agent,
 		if (mad_recv_wc->wc->status == IB_WC_SUCCESS)
 			query->callback(query,
 					mad_recv_wc->recv_buf.mad->mad_hdr.status ?
-					-EINVAL : 0,
+					-EINVAL : 0, 1,
 					(struct ib_sa_mad *) mad_recv_wc->recv_buf.mad);
 		else
-			query->callback(query, -EIO, NULL);
+			query->callback(query, -EIO, 0, NULL);
 	}
 
 	ib_free_recv_mad(mad_recv_wc);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index a4904371e2db..ac25fc80fb33 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -742,7 +742,7 @@ void ipoib_flush_paths(struct net_device *dev)
 
 static void path_rec_completion(int status,
 				struct sa_path_rec *pathrec,
-				void *path_ptr)
+				int num_prs, void *path_ptr)
 {
 	struct ipoib_path *path = path_ptr;
 	struct net_device *dev = path->dev;
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c
index 1e777b2043d6..e5ec5444b662 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -699,7 +699,7 @@ static void srp_free_ch_ib(struct srp_target_port *target,
 
 static void srp_path_rec_completion(int status,
 				    struct sa_path_rec *pathrec,
-				    void *ch_ptr)
+				    int num_paths, void *ch_ptr)
 {
 	struct srp_rdma_ch *ch = ch_ptr;
 	struct srp_target_port *target = ch->target;
diff --git a/include/rdma/ib_sa.h b/include/rdma/ib_sa.h
index 3634d4cc7a56..e930bec33b31 100644
--- a/include/rdma/ib_sa.h
+++ b/include/rdma/ib_sa.h
@@ -186,6 +186,7 @@ struct sa_path_rec {
 		struct sa_path_rec_opa opa;
 	};
 	enum sa_path_rec_type rec_type;
+	u32 flags;
 };
 
 static inline enum ib_gid_type
@@ -413,7 +414,7 @@ int ib_sa_path_rec_get(struct ib_sa_client *client, struct ib_device *device,
 		       ib_sa_comp_mask comp_mask, unsigned long timeout_ms,
 		       gfp_t gfp_mask,
 		       void (*callback)(int status, struct sa_path_rec *resp,
-					void *context),
+					int num_prs, void *context),
 		       void *context, struct ib_sa_query **query);
 
 struct ib_sa_multicast {
diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h
index 81916039ee24..cdc7cafab572 100644
--- a/include/rdma/rdma_cm.h
+++ b/include/rdma/rdma_cm.h
@@ -49,9 +49,15 @@ struct rdma_addr {
 	struct rdma_dev_addr dev_addr;
 };
 
+#define RDMA_PRIMARY_PATH_MAX_REC_NUM 3
 struct rdma_route {
 	struct rdma_addr addr;
 	struct sa_path_rec *path_rec;
+
+	/* Optional path records of primary path */
+	struct sa_path_rec *path_rec_inbound;
+	struct sa_path_rec *path_rec_outbound;
+
 	/*
 	 * 0 - No primary nor alternate path is available
 	 * 1 - Only primary path is available
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH rdma-next 3/4] RDMA/cm: Use SLID in the work completion as the DLID in responder side
  2022-09-08 10:08 [PATCH rdma-next 0/4] Support multiple path records Leon Romanovsky
  2022-09-08 10:09 ` [PATCH rdma-next 1/4] RDMA/core: Rename rdma_route.num_paths field to num_pri_alt_paths Leon Romanovsky
  2022-09-08 10:09 ` [PATCH rdma-next 2/4] RDMA/cma: Multiple path records support with netlink channel Leon Romanovsky
@ 2022-09-08 10:09 ` Leon Romanovsky
  2022-09-08 10:09 ` [PATCH rdma-next 4/4] RDMA/cm: Use DLID from inbound/outbound PathRecords as the datapath DLID Leon Romanovsky
  2022-09-22  9:36 ` [PATCH rdma-next 0/4] Support multiple path records Leon Romanovsky
  4 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2022-09-08 10:09 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Mark Zhang, Bart Van Assche, linux-rdma, Mark Bloch

From: Mark Zhang <markzhang@nvidia.com>

The responder should always use WC's SLID as the dlid, to follow the
IB SPEC section "13.5.4.2 COMMON RESPONSE ACTIONS":
A responder always takes the following actions in constructing a
response packet:
- The SLID of the received packet is used as the DLID in the response
  packet.

Fixes: ac3a949fb2ff ("IB/CM: Set appropriate slid and dlid when handling CM request")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/cm.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index d7410ee2ade7..ade82752f9f7 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1614,14 +1614,13 @@ static void cm_path_set_rec_type(struct ib_device *ib_device, u32 port_num,
 
 static void cm_format_path_lid_from_req(struct cm_req_msg *req_msg,
 					struct sa_path_rec *primary_path,
-					struct sa_path_rec *alt_path)
+					struct sa_path_rec *alt_path,
+					struct ib_wc *wc)
 {
 	u32 lid;
 
 	if (primary_path->rec_type != SA_PATH_REC_TYPE_OPA) {
-		sa_path_set_dlid(primary_path,
-				 IBA_GET(CM_REQ_PRIMARY_LOCAL_PORT_LID,
-					 req_msg));
+		sa_path_set_dlid(primary_path, wc->slid);
 		sa_path_set_slid(primary_path,
 				 IBA_GET(CM_REQ_PRIMARY_REMOTE_PORT_LID,
 					 req_msg));
@@ -1658,7 +1657,8 @@ static void cm_format_path_lid_from_req(struct cm_req_msg *req_msg,
 
 static void cm_format_paths_from_req(struct cm_req_msg *req_msg,
 				     struct sa_path_rec *primary_path,
-				     struct sa_path_rec *alt_path)
+				     struct sa_path_rec *alt_path,
+				     struct ib_wc *wc)
 {
 	primary_path->dgid =
 		*IBA_GET_MEM_PTR(CM_REQ_PRIMARY_LOCAL_PORT_GID, req_msg);
@@ -1716,7 +1716,7 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg,
 		if (sa_path_is_roce(alt_path))
 			alt_path->roce.route_resolved = false;
 	}
-	cm_format_path_lid_from_req(req_msg, primary_path, alt_path);
+	cm_format_path_lid_from_req(req_msg, primary_path, alt_path, wc);
 }
 
 static u16 cm_get_bth_pkey(struct cm_work *work)
@@ -2129,7 +2129,7 @@ static int cm_req_handler(struct cm_work *work)
 	if (cm_req_has_alt_path(req_msg))
 		work->path[1].rec_type = work->path[0].rec_type;
 	cm_format_paths_from_req(req_msg, &work->path[0],
-				 &work->path[1]);
+				 &work->path[1], work->mad_recv_wc->wc);
 	if (cm_id_priv->av.ah_attr.type == RDMA_AH_ATTR_TYPE_ROCE)
 		sa_path_set_dmac(&work->path[0],
 				 cm_id_priv->av.ah_attr.roce.dmac);
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH rdma-next 4/4] RDMA/cm: Use DLID from inbound/outbound PathRecords as the datapath DLID
  2022-09-08 10:08 [PATCH rdma-next 0/4] Support multiple path records Leon Romanovsky
                   ` (2 preceding siblings ...)
  2022-09-08 10:09 ` [PATCH rdma-next 3/4] RDMA/cm: Use SLID in the work completion as the DLID in responder side Leon Romanovsky
@ 2022-09-08 10:09 ` Leon Romanovsky
  2022-09-22  9:36 ` [PATCH rdma-next 0/4] Support multiple path records Leon Romanovsky
  4 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2022-09-08 10:09 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Mark Zhang, Bart Van Assche, linux-rdma, Mark Bloch

From: Mark Zhang <markzhang@nvidia.com>

In inter-subnet cases, when inbound/outbound PRs are available,
outbound_PR.dlid is used as the requestor's datapath DLID and
inbound_PR.dlid is used as the responder's DLID. The inbound_PR.dlid
is passed to responder side with the "ConnectReq.Primary_Local_Port_LID"
field. With this solution the PERMISSIVE_LID is no longer used in
Primary Local LID field.

Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
---
 drivers/infiniband/core/cm.c  | 25 +++++++++++++++++++++++--
 drivers/infiniband/core/cma.c |  2 ++
 include/rdma/ib_cm.h          |  2 ++
 3 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index ade82752f9f7..1f9938a2c475 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -175,6 +175,7 @@ struct cm_device {
 struct cm_av {
 	struct cm_port *port;
 	struct rdma_ah_attr ah_attr;
+	u16 dlid_datapath;
 	u16 pkey_index;
 	u8 timeout;
 };
@@ -1304,6 +1305,7 @@ static void cm_format_req(struct cm_req_msg *req_msg,
 	struct sa_path_rec *pri_path = param->primary_path;
 	struct sa_path_rec *alt_path = param->alternate_path;
 	bool pri_ext = false;
+	__be16 lid;
 
 	if (pri_path->rec_type == SA_PATH_REC_TYPE_OPA)
 		pri_ext = opa_is_extended_lid(pri_path->opa.dlid,
@@ -1363,9 +1365,16 @@ static void cm_format_req(struct cm_req_msg *req_msg,
 					      htons(ntohl(sa_path_get_dlid(
 						      pri_path)))));
 	} else {
+
+		if (param->primary_path_inbound) {
+			lid = param->primary_path_inbound->ib.dlid;
+			IBA_SET(CM_REQ_PRIMARY_LOCAL_PORT_LID, req_msg,
+				be16_to_cpu(lid));
+		} else
+			IBA_SET(CM_REQ_PRIMARY_LOCAL_PORT_LID, req_msg,
+				be16_to_cpu(IB_LID_PERMISSIVE));
+
 		/* Work-around until there's a way to obtain remote LID info */
-		IBA_SET(CM_REQ_PRIMARY_LOCAL_PORT_LID, req_msg,
-			be16_to_cpu(IB_LID_PERMISSIVE));
 		IBA_SET(CM_REQ_PRIMARY_REMOTE_PORT_LID, req_msg,
 			be16_to_cpu(IB_LID_PERMISSIVE));
 	}
@@ -1520,6 +1529,10 @@ int ib_send_cm_req(struct ib_cm_id *cm_id,
 	spin_lock_irqsave(&cm_id_priv->lock, flags);
 
 	cm_move_av_from_path(&cm_id_priv->av, &av);
+	if (param->primary_path_outbound)
+		cm_id_priv->av.dlid_datapath =
+			be16_to_cpu(param->primary_path_outbound->ib.dlid);
+
 	if (param->alternate_path)
 		cm_move_av_from_path(&cm_id_priv->alt_av, &alt_av);
 
@@ -2154,6 +2167,10 @@ static int cm_req_handler(struct cm_work *work)
 				       NULL, 0);
 		goto rejected;
 	}
+	if (cm_id_priv->av.ah_attr.type == RDMA_AH_ATTR_TYPE_IB)
+		cm_id_priv->av.dlid_datapath =
+			IBA_GET(CM_REQ_PRIMARY_LOCAL_PORT_LID, req_msg);
+
 	if (cm_req_has_alt_path(req_msg)) {
 		ret = cm_init_av_by_path(&work->path[1], NULL,
 					 &cm_id_priv->alt_av);
@@ -4113,6 +4130,10 @@ static int cm_init_qp_rtr_attr(struct cm_id_private *cm_id_priv,
 		*qp_attr_mask = IB_QP_STATE | IB_QP_AV | IB_QP_PATH_MTU |
 				IB_QP_DEST_QPN | IB_QP_RQ_PSN;
 		qp_attr->ah_attr = cm_id_priv->av.ah_attr;
+		if ((qp_attr->ah_attr.type == RDMA_AH_ATTR_TYPE_IB) &&
+		    cm_id_priv->av.dlid_datapath &&
+		    (cm_id_priv->av.dlid_datapath != 0xffff))
+			qp_attr->ah_attr.ib.dlid = cm_id_priv->av.dlid_datapath;
 		qp_attr->path_mtu = cm_id_priv->path_mtu;
 		qp_attr->dest_qp_num = be32_to_cpu(cm_id_priv->remote_qpn);
 		qp_attr->rq_psn = be32_to_cpu(cm_id_priv->rq_psn);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index a3efc462305d..7eacb23165fc 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -4313,6 +4313,8 @@ static int cma_connect_ib(struct rdma_id_private *id_priv,
 	}
 
 	req.primary_path = &route->path_rec[0];
+	req.primary_path_inbound = route->path_rec_inbound;
+	req.primary_path_outbound = route->path_rec_outbound;
 	if (route->num_pri_alt_paths == 2)
 		req.alternate_path = &route->path_rec[1];
 
diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
index 8dae5847020a..a2ac62b4a6cf 100644
--- a/include/rdma/ib_cm.h
+++ b/include/rdma/ib_cm.h
@@ -348,6 +348,8 @@ struct ib_cm_id *ib_cm_insert_listen(struct ib_device *device,
 
 struct ib_cm_req_param {
 	struct sa_path_rec	*primary_path;
+	struct sa_path_rec	*primary_path_inbound;
+	struct sa_path_rec	*primary_path_outbound;
 	struct sa_path_rec	*alternate_path;
 	const struct ib_gid_attr *ppath_sgid_attr;
 	__be64			service_id;
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH rdma-next 0/4] Support multiple path records
  2022-09-08 10:08 [PATCH rdma-next 0/4] Support multiple path records Leon Romanovsky
                   ` (3 preceding siblings ...)
  2022-09-08 10:09 ` [PATCH rdma-next 4/4] RDMA/cm: Use DLID from inbound/outbound PathRecords as the datapath DLID Leon Romanovsky
@ 2022-09-22  9:36 ` Leon Romanovsky
  4 siblings, 0 replies; 12+ messages in thread
From: Leon Romanovsky @ 2022-09-22  9:36 UTC (permalink / raw)
  To: Leon Romanovsky, Jason Gunthorpe
  Cc: Bart Van Assche, Mark Zhang, Leon Romanovsky, Mark Bloch, linux-rdma

On Thu, 8 Sep 2022 13:08:59 +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@nvidia.com>
> 
> From Mark:
> 
> These patches allow IB core to receive multiple path records from
> user-space rdma netlink service.
> 
> [...]

Applied, thanks!

[1/4] RDMA/core: Rename rdma_route.num_paths field to num_pri_alt_paths
      https://git.kernel.org/rdma/rdma/c/bf9a9928510a03
[2/4] RDMA/cma: Multiple path records support with netlink channel
      https://git.kernel.org/rdma/rdma/c/5a374949339427
[3/4] RDMA/cm: Use SLID in the work completion as the DLID in responder side
      https://git.kernel.org/rdma/rdma/c/b7d95040c13f61
[4/4] RDMA/cm: Use DLID from inbound/outbound PathRecords as the datapath DLID
      https://git.kernel.org/rdma/rdma/c/eb8336dbe373ed

Best regards,
-- 
Leon Romanovsky <leon@kernel.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH rdma-next 2/4] RDMA/cma: Multiple path records support with netlink channel
  2022-09-08 10:09 ` [PATCH rdma-next 2/4] RDMA/cma: Multiple path records support with netlink channel Leon Romanovsky
@ 2022-09-22 13:58   ` Jason Gunthorpe
  2022-09-23  1:40     ` Mark Zhang
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2022-09-22 13:58 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Mark Zhang, Bart Van Assche, linux-rdma, Mark Bloch

On Thu, Sep 08, 2022 at 01:09:01PM +0300, Leon Romanovsky wrote:

> +static void route_set_path_rec_inbound(struct cma_work *work,
> +				       struct sa_path_rec *path_rec)
> +{
> +	struct rdma_route *route = &work->id->id.route;
> +
> +	if (!route->path_rec_inbound) {
> +		route->path_rec_inbound =
> +			kzalloc(sizeof(*route->path_rec_inbound), GFP_KERNEL);
> +		if (!route->path_rec_inbound)
> +			return;
> +	}
> +
> +	*route->path_rec_inbound = *path_rec;
> +}

We are just ignoring these memory allocation failures??

>  static void cma_query_handler(int status, struct sa_path_rec *path_rec,
> -			      void *context)
> +			      int num_prs, void *context)

This param should be "unsigned int num_prs"

>  {
>  	struct cma_work *work = context;
>  	struct rdma_route *route;
> +	int i;
>  
>  	route = &work->id->id.route;
>  
> -	if (!status) {
> -		route->num_pri_alt_paths = 1;
> -		*route->path_rec = *path_rec;
> -	} else {
> -		work->old_state = RDMA_CM_ROUTE_QUERY;
> -		work->new_state = RDMA_CM_ADDR_RESOLVED;
> -		work->event.event = RDMA_CM_EVENT_ROUTE_ERROR;
> -		work->event.status = status;
> -		pr_debug_ratelimited("RDMA CM: ROUTE_ERROR: failed to query path. status %d\n",
> -				     status);
> +	if (status)
> +		goto fail;
> +
> +	for (i = 0; i < num_prs; i++) {
> +		if (!path_rec[i].flags || (path_rec[i].flags & IB_PATH_GMP))
> +			*route->path_rec = path_rec[i];
> +		else if (path_rec[i].flags & IB_PATH_INBOUND)
> +			route_set_path_rec_inbound(work, &path_rec[i]);
> +		else if (path_rec[i].flags & IB_PATH_OUTBOUND)
> +			route_set_path_rec_outbound(work, &path_rec[i]);
> +	}
> +	if (!route->path_rec) {

Why is this needed? The for loop above will have already oops'd.

> +/**
> + * ib_sa_pr_callback_multiple() - Parse path records then do callback.
> + *
> + * In a multiple-PR case the PRs are saved in "query->resp_pr_data"
> + * (instead of"mad->data") and with "ib_path_rec_data" structure format,
> + * so that rec->flags can be set to indicate the type of PR.
> + * This is valid only in IB fabric.
> + */
> +static void ib_sa_pr_callback_multiple(struct ib_sa_path_query *query,
> +				       int status, int num_prs,
> +				       struct ib_path_rec_data *rec_data)
> +{
> +	struct sa_path_rec *rec;
> +	int i;
> +
> +	rec = kvcalloc(num_prs, sizeof(*rec), GFP_KERNEL);
> +	if (!rec) {
> +		query->callback(-ENOMEM, NULL, 0, query->context);
> +		return;
> +	}

This all seems really wild, why are we allocating memory so many times
on this path? Have ib_nl_process_good_resolve_rsp unpack the mad
instead of storing the raw format

It would also be good to make resp_pr_data always valid so all these
special paths don't need to exist.

> diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h
> index 81916039ee24..cdc7cafab572 100644
> --- a/include/rdma/rdma_cm.h
> +++ b/include/rdma/rdma_cm.h
> @@ -49,9 +49,15 @@ struct rdma_addr {
>  	struct rdma_dev_addr dev_addr;
>  };
>  
> +#define RDMA_PRIMARY_PATH_MAX_REC_NUM 3

This is a strange place for this define, it should be in sa_query.c?

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH rdma-next 2/4] RDMA/cma: Multiple path records support with netlink channel
  2022-09-22 13:58   ` Jason Gunthorpe
@ 2022-09-23  1:40     ` Mark Zhang
  2022-09-23 13:13       ` Jason Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: Mark Zhang @ 2022-09-23  1:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Leon Romanovsky; +Cc: Bart Van Assche, linux-rdma, Mark Bloch

On 9/22/2022 9:58 PM, Jason Gunthorpe wrote:
> On Thu, Sep 08, 2022 at 01:09:01PM +0300, Leon Romanovsky wrote:
> 
>> +static void route_set_path_rec_inbound(struct cma_work *work,
>> +				       struct sa_path_rec *path_rec)
>> +{
>> +	struct rdma_route *route = &work->id->id.route;
>> +
>> +	if (!route->path_rec_inbound) {
>> +		route->path_rec_inbound =
>> +			kzalloc(sizeof(*route->path_rec_inbound), GFP_KERNEL);
>> +		if (!route->path_rec_inbound)
>> +			return;
>> +	}
>> +
>> +	*route->path_rec_inbound = *path_rec;
>> +}
> 
> We are just ignoring these memory allocation failures??
> 
Inside "if" statement if kzalloc fails here then we don't set 
route->path_rec_inbound or outbound;

>>   static void cma_query_handler(int status, struct sa_path_rec *path_rec,
>> -			      void *context)
>> +			      int num_prs, void *context)
> 
> This param should be "unsigned int num_prs"

Ack

> 
>>   {
>>   	struct cma_work *work = context;
>>   	struct rdma_route *route;
>> +	int i;
>>   
>>   	route = &work->id->id.route;
>>   
>> -	if (!status) {
>> -		route->num_pri_alt_paths = 1;
>> -		*route->path_rec = *path_rec;
>> -	} else {
>> -		work->old_state = RDMA_CM_ROUTE_QUERY;
>> -		work->new_state = RDMA_CM_ADDR_RESOLVED;
>> -		work->event.event = RDMA_CM_EVENT_ROUTE_ERROR;
>> -		work->event.status = status;
>> -		pr_debug_ratelimited("RDMA CM: ROUTE_ERROR: failed to query path. status %d\n",
>> -				     status);
>> +	if (status)
>> +		goto fail;
>> +
>> +	for (i = 0; i < num_prs; i++) {
>> +		if (!path_rec[i].flags || (path_rec[i].flags & IB_PATH_GMP))
>> +			*route->path_rec = path_rec[i];
>> +		else if (path_rec[i].flags & IB_PATH_INBOUND)
>> +			route_set_path_rec_inbound(work, &path_rec[i]);
>> +		else if (path_rec[i].flags & IB_PATH_OUTBOUND)
>> +			route_set_path_rec_outbound(work, &path_rec[i]);
>> +	}
>> +	if (!route->path_rec) {
> 
> Why is this needed? The for loop above will have already oops'd.

Right, this "if" is no needed. We don't need to check if route->path_rec 
is valid in this function because it is allocated in cma_resolve_ib_route()

> 
>> +/**
>> + * ib_sa_pr_callback_multiple() - Parse path records then do callback.
>> + *
>> + * In a multiple-PR case the PRs are saved in "query->resp_pr_data"
>> + * (instead of"mad->data") and with "ib_path_rec_data" structure format,
>> + * so that rec->flags can be set to indicate the type of PR.
>> + * This is valid only in IB fabric.
>> + */
>> +static void ib_sa_pr_callback_multiple(struct ib_sa_path_query *query,
>> +				       int status, int num_prs,
>> +				       struct ib_path_rec_data *rec_data)
>> +{
>> +	struct sa_path_rec *rec;
>> +	int i;
>> +
>> +	rec = kvcalloc(num_prs, sizeof(*rec), GFP_KERNEL);
>> +	if (!rec) {
>> +		query->callback(-ENOMEM, NULL, 0, query->context);
>> +		return;
>> +	}
> 
> This all seems really wild, why are we allocating memory so many times
> on this path? Have ib_nl_process_good_resolve_rsp unpack the mad
> instead of storing the raw format
> 
> It would also be good to make resp_pr_data always valid so all these
> special paths don't need to exist.

The ib_sa_pr_callback_single() uses stack variable "rec" but 
ib_sa_pr_callback_multiple() uses malloc because there are multiple PRs.

ib_sa_path_rec_callback is also used by ib_post_send_mad(), which always 
have single PR and saved in mad->data, so always set resp_pr_data in 
netlink case is not enough.

> >> diff --git a/include/rdma/rdma_cm.h b/include/rdma/rdma_cm.h
>> index 81916039ee24..cdc7cafab572 100644
>> --- a/include/rdma/rdma_cm.h
>> +++ b/include/rdma/rdma_cm.h
>> @@ -49,9 +49,15 @@ struct rdma_addr {
>>   	struct rdma_dev_addr dev_addr;
>>   };
>>   
>> +#define RDMA_PRIMARY_PATH_MAX_REC_NUM 3
> 
> This is a strange place for this define, it should be in sa_query.c?

That's because path_rec, path_rec_inbound and path_rec_outbound are 
defined here, but yes it is only used in sa_query.c, so maybe better 
move it to there.

Thanks Jason.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH rdma-next 2/4] RDMA/cma: Multiple path records support with netlink channel
  2022-09-23  1:40     ` Mark Zhang
@ 2022-09-23 13:13       ` Jason Gunthorpe
  2022-09-23 14:24         ` Mark Zhang
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 13:13 UTC (permalink / raw)
  To: Mark Zhang; +Cc: Leon Romanovsky, Bart Van Assche, linux-rdma, Mark Bloch

On Fri, Sep 23, 2022 at 09:40:22AM +0800, Mark Zhang wrote:
> On 9/22/2022 9:58 PM, Jason Gunthorpe wrote:
> > On Thu, Sep 08, 2022 at 01:09:01PM +0300, Leon Romanovsky wrote:
> > 
> > > +static void route_set_path_rec_inbound(struct cma_work *work,
> > > +				       struct sa_path_rec *path_rec)
> > > +{
> > > +	struct rdma_route *route = &work->id->id.route;
> > > +
> > > +	if (!route->path_rec_inbound) {
> > > +		route->path_rec_inbound =
> > > +			kzalloc(sizeof(*route->path_rec_inbound), GFP_KERNEL);
> > > +		if (!route->path_rec_inbound)
> > > +			return;
> > > +	}
> > > +
> > > +	*route->path_rec_inbound = *path_rec;
> > > +}
> > 
> > We are just ignoring these memory allocation failures??
> > 
> Inside "if" statement if kzalloc fails here then we don't set
> route->path_rec_inbound or outbound;

But why don't we propogate a ENOMEM failure code?
> > > +static void ib_sa_pr_callback_multiple(struct ib_sa_path_query *query,
> > > +				       int status, int num_prs,
> > > +				       struct ib_path_rec_data *rec_data)
> > > +{
> > > +	struct sa_path_rec *rec;
> > > +	int i;
> > > +
> > > +	rec = kvcalloc(num_prs, sizeof(*rec), GFP_KERNEL);
> > > +	if (!rec) {
> > > +		query->callback(-ENOMEM, NULL, 0, query->context);
> > > +		return;
> > > +	}
> > 
> > This all seems really wild, why are we allocating memory so many times
> > on this path? Have ib_nl_process_good_resolve_rsp unpack the mad
> > instead of storing the raw format
> > 
> > It would also be good to make resp_pr_data always valid so all these
> > special paths don't need to exist.
> 
> The ib_sa_pr_callback_single() uses stack variable "rec" but
> ib_sa_pr_callback_multiple() uses malloc because there are multiple PRs.
> 
> ib_sa_path_rec_callback is also used by ib_post_send_mad(), which always
> have single PR and saved in mad->data, so always set resp_pr_data in netlink
> case is not enough.

We should always be able to point resp_pr_data to some kind of
storage, even if it is stack storage.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH rdma-next 2/4] RDMA/cma: Multiple path records support with netlink channel
  2022-09-23 13:13       ` Jason Gunthorpe
@ 2022-09-23 14:24         ` Mark Zhang
  2022-09-23 14:50           ` Mark Zhang
  2022-09-23 18:19           ` Jason Gunthorpe
  0 siblings, 2 replies; 12+ messages in thread
From: Mark Zhang @ 2022-09-23 14:24 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Leon Romanovsky, Bart Van Assche, linux-rdma, Mark Bloch

On 9/23/2022 9:13 PM, Jason Gunthorpe wrote:
> On Fri, Sep 23, 2022 at 09:40:22AM +0800, Mark Zhang wrote:
>> On 9/22/2022 9:58 PM, Jason Gunthorpe wrote:
>>> On Thu, Sep 08, 2022 at 01:09:01PM +0300, Leon Romanovsky wrote:
>>>
>>>> +static void route_set_path_rec_inbound(struct cma_work *work,
>>>> +				       struct sa_path_rec *path_rec)
>>>> +{
>>>> +	struct rdma_route *route = &work->id->id.route;
>>>> +
>>>> +	if (!route->path_rec_inbound) {
>>>> +		route->path_rec_inbound =
>>>> +			kzalloc(sizeof(*route->path_rec_inbound), GFP_KERNEL);
>>>> +		if (!route->path_rec_inbound)
>>>> +			return;
>>>> +	}
>>>> +
>>>> +	*route->path_rec_inbound = *path_rec;
>>>> +}
>>>
>>> We are just ignoring these memory allocation failures??
>>>
>> Inside "if" statement if kzalloc fails here then we don't set
>> route->path_rec_inbound or outbound;
> 
> But why don't we propogate a ENOMEM failure code?

Because inbound/outbound PRs are optional, so even they are provided 
they can still be ignored if cma is not able to set them (e.g. memory 
allocation failure in this case).

>>>> +static void ib_sa_pr_callback_multiple(struct ib_sa_path_query *query,
>>>> +				       int status, int num_prs,
>>>> +				       struct ib_path_rec_data *rec_data)
>>>> +{
>>>> +	struct sa_path_rec *rec;
>>>> +	int i;
>>>> +
>>>> +	rec = kvcalloc(num_prs, sizeof(*rec), GFP_KERNEL);
>>>> +	if (!rec) {
>>>> +		query->callback(-ENOMEM, NULL, 0, query->context);
>>>> +		return;
>>>> +	}
>>>
>>> This all seems really wild, why are we allocating memory so many times
>>> on this path? Have ib_nl_process_good_resolve_rsp unpack the mad
>>> instead of storing the raw format
>>>
>>> It would also be good to make resp_pr_data always valid so all these
>>> special paths don't need to exist.
>>
>> The ib_sa_pr_callback_single() uses stack variable "rec" but
>> ib_sa_pr_callback_multiple() uses malloc because there are multiple PRs.
>>
>> ib_sa_path_rec_callback is also used by ib_post_send_mad(), which always
>> have single PR and saved in mad->data, so always set resp_pr_data in netlink
>> case is not enough.
> 
> We should always be able to point resp_pr_data to some kind of
> storage, even if it is stack storage.

The idea is:
- Single PR: PR in mad->data; Used by both netlink and
   ib_post_send_mad();
- Multiple PRs: PRs in resp_pr_data, with "ib_path_rec_data" structure
   format; Currently used by netlink only.

So if we want to always use resp_pr_data then in single-PR case we need 
to copy mad->data to resp_pr_data in both netlink and 
ib_post_send_mad(), and turn it into "ib_path_rec_data" structure 
format. This adds complexity for single-PR, which should be most of 
situations?

Use malloc instead of stack for resp_pr_data and multiple-PR unpack is 
because sizeof(sa_path_rec)=72B, now we supports 3 and there might be 
more in future..


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH rdma-next 2/4] RDMA/cma: Multiple path records support with netlink channel
  2022-09-23 14:24         ` Mark Zhang
@ 2022-09-23 14:50           ` Mark Zhang
  2022-09-23 18:19           ` Jason Gunthorpe
  1 sibling, 0 replies; 12+ messages in thread
From: Mark Zhang @ 2022-09-23 14:50 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Leon Romanovsky, Bart Van Assche, linux-rdma, Mark Bloch

On 9/23/2022 10:24 PM, Mark Zhang wrote:
> External email: Use caution opening links or attachments
> 
> 
> On 9/23/2022 9:13 PM, Jason Gunthorpe wrote:
>> On Fri, Sep 23, 2022 at 09:40:22AM +0800, Mark Zhang wrote:
>>> On 9/22/2022 9:58 PM, Jason Gunthorpe wrote:
>>>> On Thu, Sep 08, 2022 at 01:09:01PM +0300, Leon Romanovsky wrote:
>>>>
>>>>> +static void route_set_path_rec_inbound(struct cma_work *work,
>>>>> +                                 struct sa_path_rec *path_rec)
>>>>> +{
>>>>> +  struct rdma_route *route = &work->id->id.route;
>>>>> +
>>>>> +  if (!route->path_rec_inbound) {
>>>>> +          route->path_rec_inbound =
>>>>> +                  kzalloc(sizeof(*route->path_rec_inbound), 
>>>>> GFP_KERNEL);
>>>>> +          if (!route->path_rec_inbound)
>>>>> +                  return;
>>>>> +  }
>>>>> +
>>>>> +  *route->path_rec_inbound = *path_rec;
>>>>> +}
>>>>
>>>> We are just ignoring these memory allocation failures??
>>>>
>>> Inside "if" statement if kzalloc fails here then we don't set
>>> route->path_rec_inbound or outbound;
>>
>> But why don't we propogate a ENOMEM failure code?
> 
> Because inbound/outbound PRs are optional, so even they are provided
> they can still be ignored if cma is not able to set them (e.g. memory
> allocation failure in this case).
> 
>>>>> +static void ib_sa_pr_callback_multiple(struct ib_sa_path_query 
>>>>> *query,
>>>>> +                                 int status, int num_prs,
>>>>> +                                 struct ib_path_rec_data *rec_data)
>>>>> +{
>>>>> +  struct sa_path_rec *rec;
>>>>> +  int i;
>>>>> +
>>>>> +  rec = kvcalloc(num_prs, sizeof(*rec), GFP_KERNEL);
>>>>> +  if (!rec) {
>>>>> +          query->callback(-ENOMEM, NULL, 0, query->context);
>>>>> +          return;
>>>>> +  }
>>>>
>>>> This all seems really wild, why are we allocating memory so many times
>>>> on this path? Have ib_nl_process_good_resolve_rsp unpack the mad
>>>> instead of storing the raw format
>>>>
>>>> It would also be good to make resp_pr_data always valid so all these
>>>> special paths don't need to exist.
>>>
>>> The ib_sa_pr_callback_single() uses stack variable "rec" but
>>> ib_sa_pr_callback_multiple() uses malloc because there are multiple PRs.
>>>
>>> ib_sa_path_rec_callback is also used by ib_post_send_mad(), which always
>>> have single PR and saved in mad->data, so always set resp_pr_data in 
>>> netlink
>>> case is not enough.
>>
>> We should always be able to point resp_pr_data to some kind of
>> storage, even if it is stack storage.
> 
> The idea is:
> - Single PR: PR in mad->data; Used by both netlink and
>    ib_post_send_mad();
> - Multiple PRs: PRs in resp_pr_data, with "ib_path_rec_data" structure
>    format; Currently used by netlink only.
> So if we want to always use resp_pr_data then in single-PR case we need
> to copy mad->data to resp_pr_data in both netlink and
> ib_post_send_mad(), and turn it into "ib_path_rec_data" structure
> format. This adds complexity for single-PR, which should be most of
> situations?

Sorry what I mean is resp_pr_data is used by netlink while mad->data is 
used by post_send_mad(), so if we want to always use resp_pr_data then 
in post_send_mad() we need to do malloc, copy and transfer. Besides if 
malloc failures then we still need to use mad->data, unless we always 
use stack no matter how many PRs there are.

> Use malloc instead of stack for resp_pr_data and multiple-PR unpack is
> because sizeof(sa_path_rec)=72B, now we supports 3 and there might be
> more in future..
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH rdma-next 2/4] RDMA/cma: Multiple path records support with netlink channel
  2022-09-23 14:24         ` Mark Zhang
  2022-09-23 14:50           ` Mark Zhang
@ 2022-09-23 18:19           ` Jason Gunthorpe
  1 sibling, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2022-09-23 18:19 UTC (permalink / raw)
  To: Mark Zhang; +Cc: Leon Romanovsky, Bart Van Assche, linux-rdma, Mark Bloch

On Fri, Sep 23, 2022 at 10:24:35PM +0800, Mark Zhang wrote:
> On 9/23/2022 9:13 PM, Jason Gunthorpe wrote:
> > On Fri, Sep 23, 2022 at 09:40:22AM +0800, Mark Zhang wrote:
> > > On 9/22/2022 9:58 PM, Jason Gunthorpe wrote:
> > > > On Thu, Sep 08, 2022 at 01:09:01PM +0300, Leon Romanovsky wrote:
> > > > 
> > > > > +static void route_set_path_rec_inbound(struct cma_work *work,
> > > > > +				       struct sa_path_rec *path_rec)
> > > > > +{
> > > > > +	struct rdma_route *route = &work->id->id.route;
> > > > > +
> > > > > +	if (!route->path_rec_inbound) {
> > > > > +		route->path_rec_inbound =
> > > > > +			kzalloc(sizeof(*route->path_rec_inbound), GFP_KERNEL);
> > > > > +		if (!route->path_rec_inbound)
> > > > > +			return;
> > > > > +	}
> > > > > +
> > > > > +	*route->path_rec_inbound = *path_rec;
> > > > > +}
> > > > 
> > > > We are just ignoring these memory allocation failures??
> > > > 
> > > Inside "if" statement if kzalloc fails here then we don't set
> > > route->path_rec_inbound or outbound;
> > 
> > But why don't we propogate a ENOMEM failure code?
> 
> Because inbound/outbound PRs are optional, so even they are provided they
> can still be ignored if cma is not able to set them (e.g. memory allocation
> failure in this case).

This isn't how we do things, the netlink operation had a failure, the
failure should propogate. If we've run out of memory the CM connection
is going to blow up anyhow very quickly.

> > We should always be able to point resp_pr_data to some kind of
> > storage, even if it is stack storage.
> 
> The idea is:
> - Single PR: PR in mad->data; Used by both netlink and
>   ib_post_send_mad();
> - Multiple PRs: PRs in resp_pr_data, with "ib_path_rec_data" structure
>   format; Currently used by netlink only.
> 
> So if we want to always use resp_pr_data then in single-PR case we need to
> copy mad->data to resp_pr_data in both netlink and ib_post_send_mad(), and
> turn it into "ib_path_rec_data" structure format. This adds complexity for
> single-PR, which should be most of situations?

You don't copy it, you unpack it. We already have to unpack it, just
to a (large!) stack var - unpack it to resp_pr_data instead.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-09-23 18:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-08 10:08 [PATCH rdma-next 0/4] Support multiple path records Leon Romanovsky
2022-09-08 10:09 ` [PATCH rdma-next 1/4] RDMA/core: Rename rdma_route.num_paths field to num_pri_alt_paths Leon Romanovsky
2022-09-08 10:09 ` [PATCH rdma-next 2/4] RDMA/cma: Multiple path records support with netlink channel Leon Romanovsky
2022-09-22 13:58   ` Jason Gunthorpe
2022-09-23  1:40     ` Mark Zhang
2022-09-23 13:13       ` Jason Gunthorpe
2022-09-23 14:24         ` Mark Zhang
2022-09-23 14:50           ` Mark Zhang
2022-09-23 18:19           ` Jason Gunthorpe
2022-09-08 10:09 ` [PATCH rdma-next 3/4] RDMA/cm: Use SLID in the work completion as the DLID in responder side Leon Romanovsky
2022-09-08 10:09 ` [PATCH rdma-next 4/4] RDMA/cm: Use DLID from inbound/outbound PathRecords as the datapath DLID Leon Romanovsky
2022-09-22  9:36 ` [PATCH rdma-next 0/4] Support multiple path records Leon Romanovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).