linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 0/4] Sending kernel pathrecord query to user cache server
@ 2015-08-14 12:52 kaike.wan-ral2JQCrhuEAvxtiuMwx3w
       [not found] ` <1439556729-27876-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 12+ messages in thread
From: kaike.wan-ral2JQCrhuEAvxtiuMwx3w @ 2015-08-14 12:52 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Kaike Wan

From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

A SA cache is undeniably critical for fabric scalability and performance.
In user space, the ibacm application provides a good example of pathrecord
cache for address and route resolution. With the recent implementation of
the provider architecture, ibacm offers more extensibility as a SA cache.
In kernel, ipoib implements its own small cache for pathrecords, which is
however not available for general use. Furthermore, the implementation of
a SA cache in user space offers better flexibility, larger capacity, and
more robustness for the system.

In this patch series, a mechanism is implemented to allow ib_sa to
send pathrecord query to a user application (eg ibacm) through netlink.
Potentially, this mechanism could be easily extended to other SA queries.

With a customized test implemented in rdma_cm module (not included in this
series), it was shown that the time to retrieve 1 million pathrecords
dropped from 47053 jiffies (47.053 seconds) to 10339 jiffies (or 10.339
seconds) on a two-node system, a reduction of 78%.

This patch series is built against Doug's to-be-rebased/for-4.3 branch
after reverting the v8 series and adding Jason's patch:

https://patchwork.kernel.org/patch/6952841/

Some tests with namespace have been performed:
1. An unprivileged user cannot bind to the RDMA_NL_GROUP_LS multicast
   group;
2. An unprivileged user cannot create a new network namespace. However,
   it can create a new user namespace together with a new network
   namespace by using clone() with CLONE_NEWUSER | CLONE_NEWNET flags;
3. In the user and network namespaces created by an unprivileged user,
   the user can be mapped into root and thus be able to bind to the
   RDMA_NL_GROUP_LS multicast group. However, it can neither send 
   requests to the kernel RDMA netlink code nor receive requests from
   it. This is because kernel RDMA netlink code associates itself with
   the init_net network namespace, which in turn associates itself with
   init_user_ns namespace. 

Changes since v8:
-Patch 1:
  - Remove status attribute;
-Patch 4:
  - Add an attribute policy to validate incoming netlink requests or
    responses with nla_parse();
  - Change the check for incoming pathrecord data flags;
  - Add a security check for incoming netlink requests or responses;
  - Add a cast in ibnl_put_msg call to avoid 0-Day building warning.

Changes since v7:
-Patch 1:
  - Replace RDMA_NL_SA with RDMA_NL_LS;
  - Remove the defines for status attribute;
  - Remove RDMA_NL_LS_F_OK;
  - Remove a few structures for simple attribute data;
  - Add the family header for RESOLVE request;
  - Add comments about different attributes.
-Patch 2:
  - Add a helper function to receive netlink responses;
  - Modify ibnl_rcv_msg() to invoke the callback directly for netlink
    response and the SET_TIMEOUT request instead of netlink_dump_start.
-Patch 4:
  - Replace the netlink macros with static inline functions;
  - Simplify the request path with fewer and direct function calls;
  - Fold the netlink request structure into the ib_sa_query structure;
  - Drop the numb_path comparison when determining path_use;
  - Encode the RESOLVE family header when building the request;
  - Determine the anticipated pathrecord data flags by path_use;
  - Use nla_parse() to parse SET_TIMEOUT request message;

Changes since v6:
- Patch 4:
  - Replace __u8/16/64 with u8/16/64;
  - Remove the pathrecord flags testing when checking a netlink response;
  - Remove a few error prints;

Changes since v5:
- Patch 1:
  - Replace reversible and numb_path attributes with path_use attribute.
  - Define Mandatory attribute flag.
  - Define attribute data types in cpu byte order.
- Patch 4:
  - Change the calculation of total attribute len;
  - Modify the setting of attributes.

Changes since v4:
- Patch 1: rename LS_NLA_TYPE_NUM_PATH as LS_NLA_TYPE_NUMB_PATH.
- Patch 4: remove the renaming of LS_NLA_TYPE_NUM_PATH as
           LS_NLA_TYPE_NUMB_PATH.

Changes since v3:
- Patch 1: add basic RESOLVE attribute types.
- Patch 4: change the encoding of the RESOLVE request message based on
  the new attribute types and the input comp_mask. Change the response
  handling by iterating all attributes.

Changes since v2:
- Redesigne the communication protocol between the kernel and user space
  application. Instead of the MAD packet format, the new protocol uses
  netlink message header and attributes to exchange request and
  response between the kernel and user space.The design was described
  here:
  http://www.spinics.net/lists/linux-rdma/msg25621.html

Changes since v1:
- Move kzalloc changes into a separate patch (Patch 3).
- Remove redundant include line (Patch 4).
- Rename struct rdma_nl_resp_msg as structure ib_nl_resp_msg (Patch 4).

Kaike Wan (4):
  IB/netlink: Add defines for local service requests through netlink
  IB/core: Add rdma netlink helper functions
  IB/sa: Allocate SA query with kzalloc
  IB/sa: Route SA pathrecord query through netlink

 drivers/infiniband/core/netlink.c  |   55 ++++
 drivers/infiniband/core/sa_query.c |  509 +++++++++++++++++++++++++++++++++++-
 include/rdma/rdma_netlink.h        |    7 +
 include/uapi/rdma/rdma_netlink.h   |   82 ++++++
 4 files changed, 648 insertions(+), 5 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v9 1/4] IB/netlink: Add defines for local service requests through netlink
       [not found] ` <1439556729-27876-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2015-08-14 12:52   ` kaike.wan-ral2JQCrhuEAvxtiuMwx3w
  2015-08-14 12:52   ` [PATCH v9 2/4] IB/core: Add rdma netlink helper functions kaike.wan-ral2JQCrhuEAvxtiuMwx3w
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: kaike.wan-ral2JQCrhuEAvxtiuMwx3w @ 2015-08-14 12:52 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Kaike Wan, John Fleck, Ira Weiny

From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

This patch adds netlink defines for local service client, local service
group, local service operations, and related attributes.

Signed-off-by: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: John Fleck <john.fleck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 include/uapi/rdma/rdma_netlink.h |   82 ++++++++++++++++++++++++++++++++++++++
 1 files changed, 82 insertions(+), 0 deletions(-)

diff --git a/include/uapi/rdma/rdma_netlink.h b/include/uapi/rdma/rdma_netlink.h
index 6e4bb42..c19a5dc 100644
--- a/include/uapi/rdma/rdma_netlink.h
+++ b/include/uapi/rdma/rdma_netlink.h
@@ -7,12 +7,14 @@ enum {
 	RDMA_NL_RDMA_CM = 1,
 	RDMA_NL_NES,
 	RDMA_NL_C4IW,
+	RDMA_NL_LS,	/* RDMA Local Services */
 	RDMA_NL_NUM_CLIENTS
 };
 
 enum {
 	RDMA_NL_GROUP_CM = 1,
 	RDMA_NL_GROUP_IWPM,
+	RDMA_NL_GROUP_LS,
 	RDMA_NL_NUM_GROUPS
 };
 
@@ -128,5 +130,85 @@ enum {
 	IWPM_NLA_ERR_MAX
 };
 
+/*
+ * Local service operations:
+ *   RESOLVE - The client requests the local service to resolve a path.
+ *   SET_TIMEOUT - The local service requests the client to set the timeout.
+ */
+enum {
+	RDMA_NL_LS_OP_RESOLVE = 0,
+	RDMA_NL_LS_OP_SET_TIMEOUT,
+	RDMA_NL_LS_NUM_OPS
+};
+
+/* Local service netlink message flags */
+#define RDMA_NL_LS_F_ERR	0x0100	/* Failed response */
+
+/*
+ * Local service resolve operation family header.
+ * The layout for the resolve operation:
+ *    nlmsg header
+ *    family header
+ *    attributes
+ */
+
+/*
+ * Local service path use:
+ * Specify how the path(s) will be used.
+ *   ALL - For connected CM operation (6 pathrecords)
+ *   UNIDIRECTIONAL - For unidirectional UD (1 pathrecord)
+ *   GMP - For miscellaneous GMP like operation (at least 1 reversible
+ *         pathrecord)
+ */
+enum {
+	LS_RESOLVE_PATH_USE_ALL = 0,
+	LS_RESOLVE_PATH_USE_UNIDIRECTIONAL,
+	LS_RESOLVE_PATH_USE_GMP,
+	LS_RESOLVE_PATH_USE_MAX
+};
+
+#define LS_DEVICE_NAME_MAX 64
+
+struct rdma_ls_resolve_header {
+	__u8 device_name[LS_DEVICE_NAME_MAX];
+	__u8 port_num;
+	__u8 path_use;
+};
+
+/* Local service attribute type */
+#define RDMA_NLA_F_MANDATORY	(1 << 13)
+#define RDMA_NLA_TYPE_MASK	(~(NLA_F_NESTED | NLA_F_NET_BYTEORDER | \
+				  RDMA_NLA_F_MANDATORY))
+
+/*
+ * Local service attributes:
+ *   Attr Name       Size                       Byte order
+ *   -----------------------------------------------------
+ *   PATH_RECORD     struct ib_path_rec_data
+ *   TIMEOUT         u32                        cpu
+ *   SERVICE_ID      u64                        cpu
+ *   DGID            u8[16]                     BE
+ *   SGID            u8[16]                     BE
+ *   TCLASS          u8
+ *   PKEY            u16                        cpu
+ *   QOS_CLASS       u16                        cpu
+ */
+enum {
+	LS_NLA_TYPE_UNSPEC = 0,
+	LS_NLA_TYPE_PATH_RECORD,
+	LS_NLA_TYPE_TIMEOUT,
+	LS_NLA_TYPE_SERVICE_ID,
+	LS_NLA_TYPE_DGID,
+	LS_NLA_TYPE_SGID,
+	LS_NLA_TYPE_TCLASS,
+	LS_NLA_TYPE_PKEY,
+	LS_NLA_TYPE_QOS_CLASS,
+	LS_NLA_TYPE_MAX
+};
+
+/* Local service DGID/SGID attribute: big endian */
+struct rdma_nla_ls_gid {
+	__u8		gid[16];
+};
 
 #endif /* _UAPI_RDMA_NETLINK_H */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v9 2/4] IB/core: Add rdma netlink helper functions
       [not found] ` <1439556729-27876-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2015-08-14 12:52   ` [PATCH v9 1/4] IB/netlink: Add defines for local service requests through netlink kaike.wan-ral2JQCrhuEAvxtiuMwx3w
@ 2015-08-14 12:52   ` kaike.wan-ral2JQCrhuEAvxtiuMwx3w
  2015-08-14 12:52   ` [PATCH v9 3/4] IB/sa: Allocate SA query with kzalloc kaike.wan-ral2JQCrhuEAvxtiuMwx3w
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 12+ messages in thread
From: kaike.wan-ral2JQCrhuEAvxtiuMwx3w @ 2015-08-14 12:52 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Kaike Wan, John Fleck, Ira Weiny

From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

This patch adds a function to check if listeners for a netlink multicast
group are present. It also adds a function to receive netlink response
messages.

Signed-off-by: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: John Fleck <john.fleck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/core/netlink.c |   55 +++++++++++++++++++++++++++++++++++++
 include/rdma/rdma_netlink.h       |    7 +++++
 2 files changed, 62 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/core/netlink.c b/drivers/infiniband/core/netlink.c
index 23dd5a5..d47df93 100644
--- a/drivers/infiniband/core/netlink.c
+++ b/drivers/infiniband/core/netlink.c
@@ -49,6 +49,14 @@ static DEFINE_MUTEX(ibnl_mutex);
 static struct sock *nls;
 static LIST_HEAD(client_list);
 
+int ibnl_chk_listeners(unsigned int group)
+{
+	if (netlink_has_listeners(nls, group) == 0)
+		return -1;
+	return 0;
+}
+EXPORT_SYMBOL(ibnl_chk_listeners);
+
 int ibnl_add_client(int index, int nops,
 		    const struct ibnl_client_cbs cb_table[])
 {
@@ -151,6 +159,23 @@ static int ibnl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 			    !client->cb_table[op].dump)
 				return -EINVAL;
 
+			/*
+			 * For response or local service set_timeout request,
+			 * there is no need to use netlink_dump_start.
+			 */
+			if (!(nlh->nlmsg_flags & NLM_F_REQUEST) ||
+			    (index == RDMA_NL_LS &&
+			     op == RDMA_NL_LS_OP_SET_TIMEOUT)) {
+				struct netlink_callback cb = {
+					.skb = skb,
+					.nlh = nlh,
+					.dump = client->cb_table[op].dump,
+					.module = client->cb_table[op].module,
+				};
+
+				return cb.dump(skb, &cb);
+			}
+
 			{
 				struct netlink_dump_control c = {
 					.dump = client->cb_table[op].dump,
@@ -165,9 +190,39 @@ static int ibnl_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh)
 	return -EINVAL;
 }
 
+static void ibnl_rcv_reply_skb(struct sk_buff *skb)
+{
+	struct nlmsghdr *nlh;
+	int msglen;
+
+	/*
+	 * Process responses until there is no more message or the first
+	 * request. Generally speaking, it is not recommended to mix responses
+	 * with requests.
+	 */
+	while (skb->len >= nlmsg_total_size(0)) {
+		nlh = nlmsg_hdr(skb);
+
+		if (nlh->nlmsg_len < NLMSG_HDRLEN || skb->len < nlh->nlmsg_len)
+			return;
+
+		/* Handle response only */
+		if (nlh->nlmsg_flags & NLM_F_REQUEST)
+			return;
+
+		ibnl_rcv_msg(skb, nlh);
+
+		msglen = NLMSG_ALIGN(nlh->nlmsg_len);
+		if (msglen > skb->len)
+			msglen = skb->len;
+		skb_pull(skb, msglen);
+	}
+}
+
 static void ibnl_rcv(struct sk_buff *skb)
 {
 	mutex_lock(&ibnl_mutex);
+	ibnl_rcv_reply_skb(skb);
 	netlink_rcv_skb(skb, &ibnl_rcv_msg);
 	mutex_unlock(&ibnl_mutex);
 }
diff --git a/include/rdma/rdma_netlink.h b/include/rdma/rdma_netlink.h
index 0790882..5852661 100644
--- a/include/rdma/rdma_netlink.h
+++ b/include/rdma/rdma_netlink.h
@@ -77,4 +77,11 @@ int ibnl_unicast(struct sk_buff *skb, struct nlmsghdr *nlh,
 int ibnl_multicast(struct sk_buff *skb, struct nlmsghdr *nlh,
 			unsigned int group, gfp_t flags);
 
+/**
+ * Check if there are any listeners to the netlink group
+ * @group: the netlink group ID
+ * Returns 0 on success or a negative for no listeners.
+ */
+int ibnl_chk_listeners(unsigned int group);
+
 #endif /* _RDMA_NETLINK_H */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v9 3/4] IB/sa: Allocate SA query with kzalloc
       [not found] ` <1439556729-27876-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2015-08-14 12:52   ` [PATCH v9 1/4] IB/netlink: Add defines for local service requests through netlink kaike.wan-ral2JQCrhuEAvxtiuMwx3w
  2015-08-14 12:52   ` [PATCH v9 2/4] IB/core: Add rdma netlink helper functions kaike.wan-ral2JQCrhuEAvxtiuMwx3w
@ 2015-08-14 12:52   ` kaike.wan-ral2JQCrhuEAvxtiuMwx3w
  2015-08-14 12:52   ` [PATCH v9 4/4] IB/sa: Route SA pathrecord query through netlink kaike.wan-ral2JQCrhuEAvxtiuMwx3w
  2015-08-21 23:07   ` [PATCH v9 0/4] Sending kernel pathrecord query to user cache server Jason Gunthorpe
  4 siblings, 0 replies; 12+ messages in thread
From: kaike.wan-ral2JQCrhuEAvxtiuMwx3w @ 2015-08-14 12:52 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Kaike Wan, John Fleck, Ira Weiny

From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Replace kmalloc with kzalloc so that all uninitialized fields in SA query
will be zero-ed out to avoid unintentional consequence. This prepares the
SA query structure to accept new fields in the future.

Signed-off-by: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: John Fleck <john.fleck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/core/sa_query.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index d40be36..968c66f 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -740,7 +740,7 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
 	port  = &sa_dev->port[port_num - sa_dev->start_port];
 	agent = port->agent;
 
-	query = kmalloc(sizeof *query, gfp_mask);
+	query = kzalloc(sizeof(*query), gfp_mask);
 	if (!query)
 		return -ENOMEM;
 
@@ -862,7 +862,7 @@ int ib_sa_service_rec_query(struct ib_sa_client *client,
 	    method != IB_SA_METHOD_DELETE)
 		return -EINVAL;
 
-	query = kmalloc(sizeof *query, gfp_mask);
+	query = kzalloc(sizeof(*query), gfp_mask);
 	if (!query)
 		return -ENOMEM;
 
@@ -954,7 +954,7 @@ int ib_sa_mcmember_rec_query(struct ib_sa_client *client,
 	port  = &sa_dev->port[port_num - sa_dev->start_port];
 	agent = port->agent;
 
-	query = kmalloc(sizeof *query, gfp_mask);
+	query = kzalloc(sizeof(*query), gfp_mask);
 	if (!query)
 		return -ENOMEM;
 
@@ -1051,7 +1051,7 @@ int ib_sa_guid_info_rec_query(struct ib_sa_client *client,
 	port  = &sa_dev->port[port_num - sa_dev->start_port];
 	agent = port->agent;
 
-	query = kmalloc(sizeof *query, gfp_mask);
+	query = kzalloc(sizeof(*query), gfp_mask);
 	if (!query)
 		return -ENOMEM;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v9 4/4] IB/sa: Route SA pathrecord query through netlink
       [not found] ` <1439556729-27876-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-08-14 12:52   ` [PATCH v9 3/4] IB/sa: Allocate SA query with kzalloc kaike.wan-ral2JQCrhuEAvxtiuMwx3w
@ 2015-08-14 12:52   ` kaike.wan-ral2JQCrhuEAvxtiuMwx3w
       [not found]     ` <1439556729-27876-5-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
  2015-08-21 23:07   ` [PATCH v9 0/4] Sending kernel pathrecord query to user cache server Jason Gunthorpe
  4 siblings, 1 reply; 12+ messages in thread
From: kaike.wan-ral2JQCrhuEAvxtiuMwx3w @ 2015-08-14 12:52 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Kaike Wan, John Fleck, Ira Weiny

From: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

This patch routes a SA pathrecord query to netlink first and processes the
response appropriately. If a failure is returned, the request will be sent
through IB. The decision whether to route the request to netlink first is
determined by the presence of a listener for the local service netlink
multicast group. If the user-space local service netlink multicast group
listener is not present, the request will be sent through IB, just like
what is currently being done.

Signed-off-by: Kaike Wan <kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: John Fleck <john.fleck-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/core/sa_query.c |  501 +++++++++++++++++++++++++++++++++++-
 1 files changed, 500 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/sa_query.c b/drivers/infiniband/core/sa_query.c
index 968c66f..edcf568 100644
--- a/drivers/infiniband/core/sa_query.c
+++ b/drivers/infiniband/core/sa_query.c
@@ -45,12 +45,21 @@
 #include <uapi/linux/if_ether.h>
 #include <rdma/ib_pack.h>
 #include <rdma/ib_cache.h>
+#include <rdma/rdma_netlink.h>
+#include <net/netlink.h>
+#include <uapi/rdma/ib_user_sa.h>
+#include <rdma/ib_marshall.h>
 #include "sa.h"
 
 MODULE_AUTHOR("Roland Dreier");
 MODULE_DESCRIPTION("InfiniBand subnet administration query support");
 MODULE_LICENSE("Dual BSD/GPL");
 
+#define IB_SA_LOCAL_SVC_TIMEOUT_MIN		100
+#define IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT		2000
+#define IB_SA_LOCAL_SVC_TIMEOUT_MAX		200000
+static int sa_local_svc_timeout_ms = IB_SA_LOCAL_SVC_TIMEOUT_DEFAULT;
+
 struct ib_sa_sm_ah {
 	struct ib_ah        *ah;
 	struct kref          ref;
@@ -80,8 +89,16 @@ struct ib_sa_query {
 	struct ib_mad_send_buf *mad_buf;
 	struct ib_sa_sm_ah     *sm_ah;
 	int			id;
+	u32			flags;
+	struct list_head	list; /* Local svc request list */
+	u32			seq; /* Local svc request sequence number */
+	unsigned long		timeout; /* Local svc timeout */
+	u8			path_use; /* How will the pathrecord be used */
 };
 
+#define IB_SA_ENABLE_LOCAL_SERVICE	0x00000001
+#define IB_SA_CANCEL			0x00000002
+
 struct ib_sa_service_query {
 	void (*callback)(int, struct ib_sa_service_rec *, void *);
 	void *context;
@@ -106,6 +123,26 @@ struct ib_sa_mcmember_query {
 	struct ib_sa_query sa_query;
 };
 
+static LIST_HEAD(ib_nl_request_list);
+static DEFINE_SPINLOCK(ib_nl_request_lock);
+static atomic_t ib_nl_sa_request_seq;
+static struct workqueue_struct *ib_nl_wq;
+static struct delayed_work ib_nl_timed_work;
+static const struct nla_policy ib_nl_policy[LS_NLA_TYPE_MAX] = {
+	[LS_NLA_TYPE_PATH_RECORD]	= {.type = NLA_BINARY,
+		.len = sizeof(struct ib_path_rec_data)},
+	[LS_NLA_TYPE_TIMEOUT]		= {.type = NLA_U32},
+	[LS_NLA_TYPE_SERVICE_ID]	= {.type = NLA_U64},
+	[LS_NLA_TYPE_DGID]		= {.type = NLA_BINARY,
+		.len = sizeof(struct rdma_nla_ls_gid)},
+	[LS_NLA_TYPE_SGID]		= {.type = NLA_BINARY,
+		.len = sizeof(struct rdma_nla_ls_gid)},
+	[LS_NLA_TYPE_TCLASS]		= {.type = NLA_U8},
+	[LS_NLA_TYPE_PKEY]		= {.type = NLA_U16},
+	[LS_NLA_TYPE_QOS_CLASS]		= {.type = NLA_U16},
+};
+
+
 static void ib_sa_add_one(struct ib_device *device);
 static void ib_sa_remove_one(struct ib_device *device, void *client_data);
 
@@ -381,6 +418,427 @@ static const struct ib_field guidinfo_rec_table[] = {
 	  .size_bits    = 512 },
 };
 
+static inline void ib_sa_disable_local_svc(struct ib_sa_query *query)
+{
+	query->flags &= ~IB_SA_ENABLE_LOCAL_SERVICE;
+}
+
+static inline int ib_sa_query_cancelled(struct ib_sa_query *query)
+{
+	return (query->flags & IB_SA_CANCEL);
+}
+
+static void ib_nl_set_path_rec_attrs(struct sk_buff *skb,
+				     struct ib_sa_query *query)
+{
+	struct ib_sa_path_rec *sa_rec = query->mad_buf->context[1];
+	struct ib_sa_mad *mad = query->mad_buf->mad;
+	ib_sa_comp_mask comp_mask = mad->sa_hdr.comp_mask;
+	u16 val16;
+	u64 val64;
+	struct rdma_ls_resolve_header *header;
+
+	query->mad_buf->context[1] = NULL;
+
+	/* Construct the family header first */
+	header = (struct rdma_ls_resolve_header *)
+		skb_put(skb, NLMSG_ALIGN(sizeof(*header)));
+	memcpy(header->device_name, query->port->agent->device->name,
+	       LS_DEVICE_NAME_MAX);
+	header->port_num = query->port->port_num;
+
+	if ((comp_mask & IB_SA_PATH_REC_REVERSIBLE) &&
+	    sa_rec->reversible != 0)
+		query->path_use = LS_RESOLVE_PATH_USE_GMP;
+	else
+		query->path_use = LS_RESOLVE_PATH_USE_UNIDIRECTIONAL;
+	header->path_use = query->path_use;
+
+	/* Now build the attributes */
+	if (comp_mask & IB_SA_PATH_REC_SERVICE_ID) {
+		val64 = be64_to_cpu(sa_rec->service_id);
+		nla_put(skb, RDMA_NLA_F_MANDATORY | LS_NLA_TYPE_SERVICE_ID,
+			sizeof(val64), &val64);
+	}
+	if (comp_mask & IB_SA_PATH_REC_DGID)
+		nla_put(skb, RDMA_NLA_F_MANDATORY | LS_NLA_TYPE_DGID,
+			sizeof(sa_rec->dgid), &sa_rec->dgid);
+	if (comp_mask & IB_SA_PATH_REC_SGID)
+		nla_put(skb, RDMA_NLA_F_MANDATORY | LS_NLA_TYPE_SGID,
+			sizeof(sa_rec->sgid), &sa_rec->sgid);
+	if (comp_mask & IB_SA_PATH_REC_TRAFFIC_CLASS)
+		nla_put(skb, RDMA_NLA_F_MANDATORY | LS_NLA_TYPE_TCLASS,
+			sizeof(sa_rec->traffic_class), &sa_rec->traffic_class);
+
+	if (comp_mask & IB_SA_PATH_REC_PKEY) {
+		val16 = be16_to_cpu(sa_rec->pkey);
+		nla_put(skb, RDMA_NLA_F_MANDATORY | LS_NLA_TYPE_PKEY,
+			sizeof(val16), &val16);
+	}
+	if (comp_mask & IB_SA_PATH_REC_QOS_CLASS) {
+		val16 = be16_to_cpu(sa_rec->qos_class);
+		nla_put(skb, RDMA_NLA_F_MANDATORY | LS_NLA_TYPE_QOS_CLASS,
+			sizeof(val16), &val16);
+	}
+}
+
+static int ib_nl_get_path_rec_attrs_len(ib_sa_comp_mask comp_mask)
+{
+	int len = 0;
+
+	if (comp_mask & IB_SA_PATH_REC_SERVICE_ID)
+		len += nla_total_size(sizeof(u64));
+	if (comp_mask & IB_SA_PATH_REC_DGID)
+		len += nla_total_size(sizeof(struct rdma_nla_ls_gid));
+	if (comp_mask & IB_SA_PATH_REC_SGID)
+		len += nla_total_size(sizeof(struct rdma_nla_ls_gid));
+	if (comp_mask & IB_SA_PATH_REC_TRAFFIC_CLASS)
+		len += nla_total_size(sizeof(u8));
+	if (comp_mask & IB_SA_PATH_REC_PKEY)
+		len += nla_total_size(sizeof(u16));
+	if (comp_mask & IB_SA_PATH_REC_QOS_CLASS)
+		len += nla_total_size(sizeof(u16));
+
+	/*
+	 * Make sure that at least some of the required comp_mask bits are
+	 * set.
+	 */
+	if (WARN_ON(len == 0))
+		return len;
+
+	/* Add the family header */
+	len += NLMSG_ALIGN(sizeof(struct rdma_ls_resolve_header));
+
+	return len;
+}
+
+static int ib_nl_send_msg(struct ib_sa_query *query)
+{
+	struct sk_buff *skb = NULL;
+	struct nlmsghdr *nlh;
+	void *data;
+	int ret = 0;
+	struct ib_sa_mad *mad;
+	int len;
+
+	mad = query->mad_buf->mad;
+	len = ib_nl_get_path_rec_attrs_len(mad->sa_hdr.comp_mask);
+	if (len <= 0)
+		return -EMSGSIZE;
+
+	skb = nlmsg_new(len, GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	/* Put nlmsg header only for now */
+	data = ibnl_put_msg(skb, &nlh, query->seq, 0, RDMA_NL_LS,
+			    RDMA_NL_LS_OP_RESOLVE, (int) GFP_KERNEL);
+	if (!data) {
+		kfree_skb(skb);
+		return -EMSGSIZE;
+	}
+
+	/* Add attributes */
+	ib_nl_set_path_rec_attrs(skb, query);
+
+	/* Repair the nlmsg header length */
+	nlmsg_end(skb, nlh);
+
+	ret = ibnl_multicast(skb, nlh, RDMA_NL_GROUP_LS, GFP_KERNEL);
+	if (!ret)
+		ret = len;
+	else
+		ret = 0;
+
+	return ret;
+}
+
+static int ib_nl_make_request(struct ib_sa_query *query)
+{
+	unsigned long flags;
+	unsigned long delay;
+	int ret;
+
+	INIT_LIST_HEAD(&query->list);
+	query->seq = (u32)atomic_inc_return(&ib_nl_sa_request_seq);
+
+	spin_lock_irqsave(&ib_nl_request_lock, flags);
+	ret = ib_nl_send_msg(query);
+	if (ret <= 0) {
+		ret = -EIO;
+		goto request_out;
+	} else {
+		ret = 0;
+	}
+
+	delay = msecs_to_jiffies(sa_local_svc_timeout_ms);
+	query->timeout = delay + jiffies;
+	list_add_tail(&query->list, &ib_nl_request_list);
+	/* Start the timeout if this is the only request */
+	if (ib_nl_request_list.next == &query->list)
+		queue_delayed_work(ib_nl_wq, &ib_nl_timed_work, delay);
+
+request_out:
+	spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+
+	return ret;
+}
+
+static int ib_nl_cancel_request(struct ib_sa_query *query)
+{
+	unsigned long flags;
+	struct ib_sa_query *wait_query;
+	int found = 0;
+
+	spin_lock_irqsave(&ib_nl_request_lock, flags);
+	list_for_each_entry(wait_query, &ib_nl_request_list, list) {
+		/* Let the timeout to take care of the callback */
+		if (query == wait_query) {
+			query->flags |= IB_SA_CANCEL;
+			query->timeout = jiffies;
+			list_move(&query->list, &ib_nl_request_list);
+			found = 1;
+			mod_delayed_work(ib_nl_wq, &ib_nl_timed_work, 1);
+			break;
+		}
+	}
+	spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+
+	return found;
+}
+
+static void send_handler(struct ib_mad_agent *agent,
+			 struct ib_mad_send_wc *mad_send_wc);
+
+static void ib_nl_process_good_resolve_rsp(struct ib_sa_query *query,
+					   const struct nlmsghdr *nlh)
+{
+	struct ib_mad_send_wc mad_send_wc;
+	struct ib_sa_mad *mad = NULL;
+	const struct nlattr *head, *curr;
+	struct ib_path_rec_data  *rec;
+	int len, rem;
+	u32 mask = 0;
+	int status = -EIO;
+
+	if (query->callback) {
+		head = (const struct nlattr *) nlmsg_data(nlh);
+		len = nlmsg_len(nlh);
+		switch (query->path_use) {
+		case LS_RESOLVE_PATH_USE_UNIDIRECTIONAL:
+			mask = IB_PATH_PRIMARY | IB_PATH_OUTBOUND;
+			break;
+
+		case LS_RESOLVE_PATH_USE_ALL:
+		case LS_RESOLVE_PATH_USE_GMP:
+		default:
+			mask = IB_PATH_PRIMARY | IB_PATH_GMP |
+				IB_PATH_BIDIRECTIONAL;
+			break;
+		}
+		nla_for_each_attr(curr, head, len, rem) {
+			if (curr->nla_type == LS_NLA_TYPE_PATH_RECORD) {
+				rec = nla_data(curr);
+				/*
+				 * Get the first one. In the future, we may
+				 * need to get up to 6 pathrecords.
+				 */
+				if ((rec->flags & mask) == mask) {
+					mad = query->mad_buf->mad;
+					mad->mad_hdr.method |=
+						IB_MGMT_METHOD_RESP;
+					memcpy(mad->data, rec->path_rec,
+					       sizeof(rec->path_rec));
+					status = 0;
+					break;
+				}
+			}
+		}
+		query->callback(query, status, mad);
+	}
+
+	mad_send_wc.send_buf = query->mad_buf;
+	mad_send_wc.status = IB_WC_SUCCESS;
+	send_handler(query->mad_buf->mad_agent, &mad_send_wc);
+}
+
+static void ib_nl_request_timeout(struct work_struct *work)
+{
+	unsigned long flags;
+	struct ib_sa_query *query;
+	unsigned long delay;
+	struct ib_mad_send_wc mad_send_wc;
+	int ret;
+
+	spin_lock_irqsave(&ib_nl_request_lock, flags);
+	while (!list_empty(&ib_nl_request_list)) {
+		query = list_entry(ib_nl_request_list.next,
+				   struct ib_sa_query, list);
+
+		if (time_after(query->timeout, jiffies)) {
+			delay = query->timeout - jiffies;
+			if ((long)delay <= 0)
+				delay = 1;
+			queue_delayed_work(ib_nl_wq, &ib_nl_timed_work, delay);
+			break;
+		}
+
+		list_del(&query->list);
+		ib_sa_disable_local_svc(query);
+		/* Hold the lock to protect against query cancellation */
+		if (ib_sa_query_cancelled(query))
+			ret = -1;
+		else
+			ret = ib_post_send_mad(query->mad_buf, NULL);
+		if (ret) {
+			mad_send_wc.send_buf = query->mad_buf;
+			mad_send_wc.status = IB_WC_WR_FLUSH_ERR;
+			spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+			send_handler(query->port->agent, &mad_send_wc);
+			spin_lock_irqsave(&ib_nl_request_lock, flags);
+		}
+	}
+	spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+}
+
+static int ib_nl_handle_set_timeout(struct sk_buff *skb,
+				    struct netlink_callback *cb)
+{
+	const struct nlmsghdr *nlh = (struct nlmsghdr *)cb->nlh;
+	int timeout, delta, abs_delta;
+	const struct nlattr *attr;
+	unsigned long flags;
+	struct ib_sa_query *query;
+	long delay = 0;
+	struct nlattr *tb[LS_NLA_TYPE_MAX];
+	int ret;
+
+	if (!netlink_capable(skb, CAP_NET_ADMIN))
+		return -EPERM;
+
+	ret = nla_parse(tb, LS_NLA_TYPE_MAX - 1, nlmsg_data(nlh),
+			nlmsg_len(nlh), ib_nl_policy);
+	attr = (const struct nlattr *)tb[LS_NLA_TYPE_TIMEOUT];
+	if (ret || !attr)
+		goto settimeout_out;
+
+	timeout = *(int *) nla_data(attr);
+	if (timeout < IB_SA_LOCAL_SVC_TIMEOUT_MIN)
+		timeout = IB_SA_LOCAL_SVC_TIMEOUT_MIN;
+	if (timeout > IB_SA_LOCAL_SVC_TIMEOUT_MAX)
+		timeout = IB_SA_LOCAL_SVC_TIMEOUT_MAX;
+
+	delta = timeout - sa_local_svc_timeout_ms;
+	if (delta < 0)
+		abs_delta = -delta;
+	else
+		abs_delta = delta;
+
+	if (delta != 0) {
+		spin_lock_irqsave(&ib_nl_request_lock, flags);
+		sa_local_svc_timeout_ms = timeout;
+		list_for_each_entry(query, &ib_nl_request_list, list) {
+			if (delta < 0 && abs_delta > query->timeout)
+				query->timeout = 0;
+			else
+				query->timeout += delta;
+
+			/* Get the new delay from the first entry */
+			if (!delay) {
+				delay = query->timeout - jiffies;
+				if (delay <= 0)
+					delay = 1;
+			}
+		}
+		if (delay)
+			mod_delayed_work(ib_nl_wq, &ib_nl_timed_work,
+					 (unsigned long)delay);
+		spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+	}
+
+settimeout_out:
+	return skb->len;
+}
+
+static inline int ib_nl_is_good_resolve_resp(const struct nlmsghdr *nlh)
+{
+	struct nlattr *tb[LS_NLA_TYPE_MAX];
+	int ret;
+
+	if (nlh->nlmsg_flags & RDMA_NL_LS_F_ERR)
+		return 0;
+
+	ret = nla_parse(tb, LS_NLA_TYPE_MAX - 1, nlmsg_data(nlh),
+			nlmsg_len(nlh), ib_nl_policy);
+	if (ret)
+		return 0;
+
+	return 1;
+}
+
+static int ib_nl_handle_resolve_resp(struct sk_buff *skb,
+				     struct netlink_callback *cb)
+{
+	const struct nlmsghdr *nlh = (struct nlmsghdr *)cb->nlh;
+	unsigned long flags;
+	struct ib_sa_query *query;
+	struct ib_mad_send_buf *send_buf;
+	struct ib_mad_send_wc mad_send_wc;
+	int found = 0;
+	int ret;
+
+	if (!netlink_capable(skb, CAP_NET_ADMIN))
+		return -EPERM;
+
+	spin_lock_irqsave(&ib_nl_request_lock, flags);
+	list_for_each_entry(query, &ib_nl_request_list, list) {
+		/*
+		 * If the query is cancelled, let the timeout routine
+		 * take care of it.
+		 */
+		if (nlh->nlmsg_seq == query->seq) {
+			found = !ib_sa_query_cancelled(query);
+			if (found)
+				list_del(&query->list);
+			break;
+		}
+	}
+
+	if (!found) {
+		spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+		goto resp_out;
+	}
+
+	send_buf = query->mad_buf;
+
+	if (!ib_nl_is_good_resolve_resp(nlh)) {
+		/* if the result is a failure, send out the packet via IB */
+		ib_sa_disable_local_svc(query);
+		ret = ib_post_send_mad(query->mad_buf, NULL);
+		spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+		if (ret) {
+			mad_send_wc.send_buf = send_buf;
+			mad_send_wc.status = IB_WC_GENERAL_ERR;
+			send_handler(query->port->agent, &mad_send_wc);
+		}
+	} else {
+		spin_unlock_irqrestore(&ib_nl_request_lock, flags);
+		ib_nl_process_good_resolve_rsp(query, nlh);
+	}
+
+resp_out:
+	return skb->len;
+}
+
+static struct ibnl_client_cbs ib_sa_cb_table[] = {
+	[RDMA_NL_LS_OP_RESOLVE] = {
+		.dump = ib_nl_handle_resolve_resp,
+		.module = THIS_MODULE },
+	[RDMA_NL_LS_OP_SET_TIMEOUT] = {
+		.dump = ib_nl_handle_set_timeout,
+		.module = THIS_MODULE },
+};
+
 static void free_sm_ah(struct kref *kref)
 {
 	struct ib_sa_sm_ah *sm_ah = container_of(kref, struct ib_sa_sm_ah, ref);
@@ -502,7 +960,13 @@ void ib_sa_cancel_query(int id, struct ib_sa_query *query)
 	mad_buf = query->mad_buf;
 	spin_unlock_irqrestore(&idr_lock, flags);
 
-	ib_cancel_mad(agent, mad_buf);
+	/*
+	 * If the query is still on the netlink request list, schedule
+	 * it to be cancelled by the timeout routine. Otherwise, it has been
+	 * sent to the MAD layer and has to be cancelled from there.
+	 */
+	if (!ib_nl_cancel_request(query))
+		ib_cancel_mad(agent, mad_buf);
 }
 EXPORT_SYMBOL(ib_sa_cancel_query);
 
@@ -639,6 +1103,14 @@ static int send_mad(struct ib_sa_query *query, int timeout_ms, gfp_t gfp_mask)
 	query->mad_buf->context[0] = query;
 	query->id = id;
 
+	if (query->flags & IB_SA_ENABLE_LOCAL_SERVICE) {
+		if (!ibnl_chk_listeners(RDMA_NL_GROUP_LS)) {
+			if (!ib_nl_make_request(query))
+				return id;
+		}
+		ib_sa_disable_local_svc(query);
+	}
+
 	ret = ib_post_send_mad(query->mad_buf, NULL);
 	if (ret) {
 		spin_lock_irqsave(&idr_lock, flags);
@@ -767,6 +1239,9 @@ int ib_sa_path_rec_get(struct ib_sa_client *client,
 
 	*sa_query = &query->sa_query;
 
+	query->sa_query.flags |= IB_SA_ENABLE_LOCAL_SERVICE;
+	query->sa_query.mad_buf->context[1] = rec;
+
 	ret = send_mad(&query->sa_query, timeout_ms, gfp_mask);
 	if (ret < 0)
 		goto err2;
@@ -1251,6 +1726,8 @@ static int __init ib_sa_init(void)
 
 	get_random_bytes(&tid, sizeof tid);
 
+	atomic_set(&ib_nl_sa_request_seq, 0);
+
 	ret = ib_register_client(&sa_client);
 	if (ret) {
 		printk(KERN_ERR "Couldn't register ib_sa client\n");
@@ -1263,7 +1740,25 @@ static int __init ib_sa_init(void)
 		goto err2;
 	}
 
+	ib_nl_wq = create_singlethread_workqueue("ib_nl_sa_wq");
+	if (!ib_nl_wq) {
+		ret = -ENOMEM;
+		goto err3;
+	}
+
+	if (ibnl_add_client(RDMA_NL_LS, RDMA_NL_LS_NUM_OPS,
+			    ib_sa_cb_table)) {
+		pr_err("Failed to add netlink callback\n");
+		ret = -EINVAL;
+		goto err4;
+	}
+	INIT_DELAYED_WORK(&ib_nl_timed_work, ib_nl_request_timeout);
+
 	return 0;
+err4:
+	destroy_workqueue(ib_nl_wq);
+err3:
+	mcast_cleanup();
 err2:
 	ib_unregister_client(&sa_client);
 err1:
@@ -1272,6 +1767,10 @@ err1:
 
 static void __exit ib_sa_cleanup(void)
 {
+	ibnl_remove_client(RDMA_NL_LS);
+	cancel_delayed_work(&ib_nl_timed_work);
+	flush_workqueue(ib_nl_wq);
+	destroy_workqueue(ib_nl_wq);
 	mcast_cleanup();
 	ib_unregister_client(&sa_client);
 	idr_destroy(&query_idr);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v9 0/4] Sending kernel pathrecord query to user cache server
       [not found] ` <1439556729-27876-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-08-14 12:52   ` [PATCH v9 4/4] IB/sa: Route SA pathrecord query through netlink kaike.wan-ral2JQCrhuEAvxtiuMwx3w
@ 2015-08-21 23:07   ` Jason Gunthorpe
       [not found]     ` <20150821230734.GA16951-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  4 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2015-08-21 23:07 UTC (permalink / raw)
  To: kaike.wan-ral2JQCrhuEAvxtiuMwx3w, Haggai Eran
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Fri, Aug 14, 2015 at 08:52:05AM -0400, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:

> Some tests with namespace have been performed:
> 1. An unprivileged user cannot bind to the RDMA_NL_GROUP_LS multicast
>    group;
> 2. An unprivileged user cannot create a new network namespace. However,
>    it can create a new user namespace together with a new network
>    namespace by using clone() with CLONE_NEWUSER | CLONE_NEWNET flags;
> 3. In the user and network namespaces created by an unprivileged user,
>    the user can be mapped into root and thus be able to bind to the
>    RDMA_NL_GROUP_LS multicast group. However, it can neither send 
>    requests to the kernel RDMA netlink code nor receive requests from
>    it. This is because kernel RDMA netlink code associates itself with
>    the init_net network namespace, which in turn associates itself with
>    init_user_ns namespace. 

Haggie, how does this coverage match your expectations with your
namespace series?

Kaike, how does #3 work? If I create a user namespace and try to bind
it succeeds to userspace but ibnl_chk_listeners still returns false in
the kernel?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v9 4/4] IB/sa: Route SA pathrecord query through netlink
       [not found]     ` <1439556729-27876-5-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
@ 2015-08-21 23:12       ` Jason Gunthorpe
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2015-08-21 23:12 UTC (permalink / raw)
  To: kaike.wan-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, John Fleck, Ira Weiny

On Fri, Aug 14, 2015 at 08:52:09AM -0400, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:

> +static LIST_HEAD(ib_nl_request_list);
> +static DEFINE_SPINLOCK(ib_nl_request_lock);
> +static atomic_t ib_nl_sa_request_seq;
> +static struct workqueue_struct *ib_nl_wq;
> +static struct delayed_work ib_nl_timed_work;
> +static const struct nla_policy ib_nl_policy[LS_NLA_TYPE_MAX] = {
> +	[LS_NLA_TYPE_PATH_RECORD]	= {.type = NLA_BINARY,
> +		.len = sizeof(struct ib_path_rec_data)},
> +	[LS_NLA_TYPE_TIMEOUT]		= {.type = NLA_U32},
> +	[LS_NLA_TYPE_SERVICE_ID]	= {.type = NLA_U64},
> +	[LS_NLA_TYPE_DGID]		= {.type = NLA_BINARY,
> +		.len = sizeof(struct rdma_nla_ls_gid)},
> +	[LS_NLA_TYPE_SGID]		= {.type = NLA_BINARY,
> +		.len = sizeof(struct rdma_nla_ls_gid)},
> +	[LS_NLA_TYPE_TCLASS]		= {.type = NLA_U8},
> +	[LS_NLA_TYPE_PKEY]		= {.type = NLA_U16},
> +	[LS_NLA_TYPE_QOS_CLASS]		= {.type = NLA_U16},
> +};

I'm not sure a single policy is the right way to use this API. The
majority of cases seem to use a policy per message type. However, I
can't think of a downside.

I'd like to test this again here next week, but I think this is OK
now.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v9 0/4] Sending kernel pathrecord query to user cache server
       [not found]     ` <20150821230734.GA16951-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2015-08-22  6:17       ` Haggai Eran
  2015-08-24 14:32       ` Wan, Kaike
  1 sibling, 0 replies; 12+ messages in thread
From: Haggai Eran @ 2015-08-22  6:17 UTC (permalink / raw)
  To: Jason Gunthorpe, kaike.wan-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Saturday, August 22, 2015 2:07 AM, Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> On Fri, Aug 14, 2015 at 08:52:05AM -0400, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> 
>> Some tests with namespace have been performed:
>> 1. An unprivileged user cannot bind to the RDMA_NL_GROUP_LS multicast
>>    group;
>> 2. An unprivileged user cannot create a new network namespace. However,
>>    it can create a new user namespace together with a new network
>>    namespace by using clone() with CLONE_NEWUSER | CLONE_NEWNET flags;
>> 3. In the user and network namespaces created by an unprivileged user,
>>    the user can be mapped into root and thus be able to bind to the
>>    RDMA_NL_GROUP_LS multicast group. However, it can neither send
>>    requests to the kernel RDMA netlink code nor receive requests from
>>    it. This is because kernel RDMA netlink code associates itself with
>>    the init_net network namespace, which in turn associates itself with
>>    init_user_ns namespace.
> 
> Haggie, how does this coverage match your expectations with your
> namespace series?

It sounds good. Our thinking was that the network namespace assigns netdevs to namespaces, and we don't want to assign an entire IB device to a namespace, so it isn't suitable for restricting applications that deal with IB directly. RDMA CM already used IP addressing so it is was suitable to be changed to work only with the devices in a process's network namespace.

We also discussed creating a cgroup for RDMA later on, that could make sure a container can only use certain P_Keys, for applications that don't use RDMA CM. Such a cgroup could also be used to limit the SA queries a process can issue.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v9 0/4] Sending kernel pathrecord query to user cache server
       [not found]     ` <20150821230734.GA16951-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2015-08-22  6:17       ` Haggai Eran
@ 2015-08-24 14:32       ` Wan, Kaike
       [not found]         ` <3F128C9216C9B84BB6ED23EF16290AFB18548AF0-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  1 sibling, 1 reply; 12+ messages in thread
From: Wan, Kaike @ 2015-08-24 14:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Haggai Eran; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA



> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Jason Gunthorpe
> Sent: Friday, August 21, 2015 7:08 PM
> To: Wan, Kaike; Haggai Eran
> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: [PATCH v9 0/4] Sending kernel pathrecord query to user cache
> server
> 
> On Fri, Aug 14, 2015 at 08:52:05AM -0400, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
> 
> > Some tests with namespace have been performed:
> > 1. An unprivileged user cannot bind to the RDMA_NL_GROUP_LS multicast
> >    group;
> > 2. An unprivileged user cannot create a new network namespace. However,
> >    it can create a new user namespace together with a new network
> >    namespace by using clone() with CLONE_NEWUSER | CLONE_NEWNET
> flags;
> > 3. In the user and network namespaces created by an unprivileged user,
> >    the user can be mapped into root and thus be able to bind to the
> >    RDMA_NL_GROUP_LS multicast group. However, it can neither send
> >    requests to the kernel RDMA netlink code nor receive requests from
> >    it. This is because kernel RDMA netlink code associates itself with
> >    the init_net network namespace, which in turn associates itself with
> >    init_user_ns namespace.
> 
> Haggie, how does this coverage match your expectations with your
> namespace series?
> 
> Kaike, how does #3 work? 

I created a test app that used clone() with CLONE_NEWUSER | CLONE_NEWNET to create child process (modeled after the user_namespace man page example: http://man7.org/linux/man-pages/man7/user_namespaces.7.html). Once the child process was mapped to root (uid 0),   it created the netlink socket and bound to the RDMA_NL_GROUP_LS and waited to receive requests from the kernel.

If I create a user namespace and try to bind it
> succeeds to userspace but ibnl_chk_listeners still returns false in the kernel?

ibnl_chk_listeners() actually returned 0 (success), indicating that there were listeners. However, ibnl_multicast() failed. From the code of netlink_has_listeners(), it is apparently that the check has nothing to do with namespace (that's why it succeeded).
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v9 0/4] Sending kernel pathrecord query to user cache server
       [not found]         ` <3F128C9216C9B84BB6ED23EF16290AFB18548AF0-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2015-08-24 15:10           ` Haggai Eran
  2015-08-25  6:34           ` Haggai Eran
  2015-08-25  6:37           ` Haggai Eran
  2 siblings, 0 replies; 12+ messages in thread
From: Haggai Eran @ 2015-08-24 15:10 UTC (permalink / raw)
  To: Wan, Kaike, Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 24/08/2015 17:32, Wan, Kaike wrote:
>> On Fri, Aug 14, 2015 at 08:52:05AM -0400, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
>> > 
>>> > > Some tests with namespace have been performed:
>>> > > 1. An unprivileged user cannot bind to the RDMA_NL_GROUP_LS multicast
>>> > >    group;
>>> > > 2. An unprivileged user cannot create a new network namespace. However,
>>> > >    it can create a new user namespace together with a new network
>>> > >    namespace by using clone() with CLONE_NEWUSER | CLONE_NEWNET
>> > flags;
>>> > > 3. In the user and network namespaces created by an unprivileged user,
>>> > >    the user can be mapped into root and thus be able to bind to the
>>> > >    RDMA_NL_GROUP_LS multicast group. However, it can neither send
>>> > >    requests to the kernel RDMA netlink code nor receive requests from
>>> > >    it. This is because kernel RDMA netlink code associates itself with
>>> > >    the init_net network namespace, which in turn associates itself with
>>> > >    init_user_ns namespace.
>> > 
>> > Haggie, how does this coverage match your expectations with your
>> > namespace series?
>> > 
>> > Kaike, how does #3 work? 
> I created a test app that used clone() with CLONE_NEWUSER | CLONE_NEWNET to create child process (modeled after the user_namespace man page example: http://man7.org/linux/man-pages/man7/user_namespaces.7.html). Once the child process was mapped to root (uid 0),   it created the netlink socket and bound to the RDMA_NL_GROUP_LS and waited to receive requests from the kernel.
> 
> If I create a user namespace and try to bind it
>> > succeeds to userspace but ibnl_chk_listeners still returns false in the kernel?
> ibnl_chk_listeners() actually returned 0 (success), indicating that there were listeners. However, ibnl_multicast() failed. From the code of netlink_has_listeners(), it is apparently that the check has nothing to do with namespace (that's why it succeeded).

It looks like the ibnl socket (nls) is created with the &init_net 
network namespace, and netlink won't send multicasts to sockets on 
other namespaces (see [1]).

Haggai

[1] http://lxr.free-electrons.com/source/net/netlink/af_netlink.c?v=4.1#L1935
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v9 0/4] Sending kernel pathrecord query to user cache server
       [not found]         ` <3F128C9216C9B84BB6ED23EF16290AFB18548AF0-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2015-08-24 15:10           ` Haggai Eran
@ 2015-08-25  6:34           ` Haggai Eran
  2015-08-25  6:37           ` Haggai Eran
  2 siblings, 0 replies; 12+ messages in thread
From: Haggai Eran @ 2015-08-25  6:34 UTC (permalink / raw)
  To: Wan, Kaike, Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 24/08/2015 17:32, Wan, Kaike wrote:
>> On Fri, Aug 14, 2015 at 08:52:05AM -0400, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
>> > 
>>> > > Some tests with namespace have been performed:
>>> > > 1. An unprivileged user cannot bind to the RDMA_NL_GROUP_LS multicast
>>> > >    group;
>>> > > 2. An unprivileged user cannot create a new network namespace. However,
>>> > >    it can create a new user namespace together with a new network
>>> > >    namespace by using clone() with CLONE_NEWUSER | CLONE_NEWNET
>> > flags;
>>> > > 3. In the user and network namespaces created by an unprivileged user,
>>> > >    the user can be mapped into root and thus be able to bind to the
>>> > >    RDMA_NL_GROUP_LS multicast group. However, it can neither send
>>> > >    requests to the kernel RDMA netlink code nor receive requests from
>>> > >    it. This is because kernel RDMA netlink code associates itself with
>>> > >    the init_net network namespace, which in turn associates itself with
>>> > >    init_user_ns namespace.
>> > 
>> > Haggie, how does this coverage match your expectations with your
>> > namespace series?
>> > 
>> > Kaike, how does #3 work? 
> I created a test app that used clone() with CLONE_NEWUSER | CLONE_NEWNET to create child process (modeled after the user_namespace man page example: http://man7.org/linux/man-pages/man7/user_namespaces.7.html). Once the child process was mapped to root (uid 0),   it created the netlink socket and bound to the RDMA_NL_GROUP_LS and waited to receive requests from the kernel.
> 
> If I create a user namespace and try to bind it
>> > succeeds to userspace but ibnl_chk_listeners still returns false in the kernel?
> ibnl_chk_listeners() actually returned 0 (success), indicating that there were listeners. However, ibnl_multicast() failed. From the code of netlink_has_listeners(), it is apparently that the check has nothing to do with namespace (that's why it succeeded).

It looks like the ibnl socket (nls) is created with the &init_net 
network namespace, and netlink won't send multicasts to sockets on 
other namespaces (see [1]).

Haggai

[1] http://lxr.free-electrons.com/source/net/netlink/af_netlink.c?v=4.1#L1935
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v9 0/4] Sending kernel pathrecord query to user cache server
       [not found]         ` <3F128C9216C9B84BB6ED23EF16290AFB18548AF0-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2015-08-24 15:10           ` Haggai Eran
  2015-08-25  6:34           ` Haggai Eran
@ 2015-08-25  6:37           ` Haggai Eran
  2 siblings, 0 replies; 12+ messages in thread
From: Haggai Eran @ 2015-08-25  6:37 UTC (permalink / raw)
  To: Wan, Kaike, Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 24/08/2015 17:32, Wan, Kaike wrote:
>> On Fri, Aug 14, 2015 at 08:52:05AM -0400, kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org wrote:
>> > 
>>> > > Some tests with namespace have been performed:
>>> > > 1. An unprivileged user cannot bind to the RDMA_NL_GROUP_LS multicast
>>> > >    group;
>>> > > 2. An unprivileged user cannot create a new network namespace. However,
>>> > >    it can create a new user namespace together with a new network
>>> > >    namespace by using clone() with CLONE_NEWUSER | CLONE_NEWNET
>> > flags;
>>> > > 3. In the user and network namespaces created by an unprivileged user,
>>> > >    the user can be mapped into root and thus be able to bind to the
>>> > >    RDMA_NL_GROUP_LS multicast group. However, it can neither send
>>> > >    requests to the kernel RDMA netlink code nor receive requests from
>>> > >    it. This is because kernel RDMA netlink code associates itself with
>>> > >    the init_net network namespace, which in turn associates itself with
>>> > >    init_user_ns namespace.
>> > 
>> > Haggie, how does this coverage match your expectations with your
>> > namespace series?
>> > 
>> > Kaike, how does #3 work? 
> I created a test app that used clone() with CLONE_NEWUSER | CLONE_NEWNET to create child process (modeled after the user_namespace man page example: http://man7.org/linux/man-pages/man7/user_namespaces.7.html). Once the child process was mapped to root (uid 0),   it created the netlink socket and bound to the RDMA_NL_GROUP_LS and waited to receive requests from the kernel.
> 
> If I create a user namespace and try to bind it
>> > succeeds to userspace but ibnl_chk_listeners still returns false in the kernel?
> ibnl_chk_listeners() actually returned 0 (success), indicating that there were listeners. However, ibnl_multicast() failed. From the code of netlink_has_listeners(), it is apparently that the check has nothing to do with namespace (that's why it succeeded).

It looks like the ibnl socket (nls) is created with the &init_net 
network namespace, and netlink won't send multicasts to sockets on 
other namespaces (see [1]).

Haggai

[1] http://lxr.free-electrons.com/source/net/netlink/af_netlink.c?v=4.1#L1935
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-08-25  6:37 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-14 12:52 [PATCH v9 0/4] Sending kernel pathrecord query to user cache server kaike.wan-ral2JQCrhuEAvxtiuMwx3w
     [not found] ` <1439556729-27876-1-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2015-08-14 12:52   ` [PATCH v9 1/4] IB/netlink: Add defines for local service requests through netlink kaike.wan-ral2JQCrhuEAvxtiuMwx3w
2015-08-14 12:52   ` [PATCH v9 2/4] IB/core: Add rdma netlink helper functions kaike.wan-ral2JQCrhuEAvxtiuMwx3w
2015-08-14 12:52   ` [PATCH v9 3/4] IB/sa: Allocate SA query with kzalloc kaike.wan-ral2JQCrhuEAvxtiuMwx3w
2015-08-14 12:52   ` [PATCH v9 4/4] IB/sa: Route SA pathrecord query through netlink kaike.wan-ral2JQCrhuEAvxtiuMwx3w
     [not found]     ` <1439556729-27876-5-git-send-email-kaike.wan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
2015-08-21 23:12       ` Jason Gunthorpe
2015-08-21 23:07   ` [PATCH v9 0/4] Sending kernel pathrecord query to user cache server Jason Gunthorpe
     [not found]     ` <20150821230734.GA16951-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2015-08-22  6:17       ` Haggai Eran
2015-08-24 14:32       ` Wan, Kaike
     [not found]         ` <3F128C9216C9B84BB6ED23EF16290AFB18548AF0-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2015-08-24 15:10           ` Haggai Eran
2015-08-25  6:34           ` Haggai Eran
2015-08-25  6:37           ` Haggai Eran

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).