All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH libibverbs V3 0/3] Add support for UD QPs under RoCE IP addressing
@ 2014-05-08  6:51 Or Gerlitz
       [not found] ` <1399531883-30599-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Or Gerlitz @ 2014-05-08  6:51 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	Or Gerlitz

Hi Roland,

This series adds support for Ethernet L2 address resolution for RoCE UD QPs,
whose L2 address-handles, unlike RC/UC/XRC/etc QPs are set from user space
without going through uverbs and the kernel IB core. The code is also
compatible both with old kernels that don't run IP based addressing.

Matan and Or.

changes from V2:
  - when using libnl3, link libibverbs itself with libnl3 and not the utilities
  - fix memory leak which took place when nl_addr_cmp_prefix_msb returns non zero.
  - rebased against libibverbs-1.1.8

changes from V1, addressed feedback from Yann Droneaud:
  - ported the code to libnl3 with automatic fallback option to libnl1
  - better error handling (more checks to erroneous flows)
  - fixed typos and removed blank lines
  - set close-on-exec open flag for the relevant fds
  - remove wrong comment from the change-log of patch #1

changes from V0:
 - don't change struct ibv_port_attr flags field from uint32_t
   to enum (size could change)
 - ibv_ah_attr_ex is extendable and contains ibv_ah_attr

Matan Barak (3):
  Add ibv_port_cap_flags
  Use neighbour lookup for RoCE UD QPs Eth L2 resolution
  Add ibv_query_port_ex support

 Makefile.am                |   10 +-
 configure.ac               |   31 ++
 include/infiniband/verbs.h |  160 +++++++-
 src/device.c               |    2 +
 src/neigh.c                | 1001 ++++++++++++++++++++++++++++++++++++++++++++
 src/neigh.h                |   53 +++
 src/verbs.c                |  164 +++++++-
 7 files changed, 1415 insertions(+), 6 deletions(-)
 create mode 100644 src/neigh.c
 create mode 100644 src/neigh.h

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH libibverbs V3 1/3] Add ibv_port_cap_flags
       [not found] ` <1399531883-30599-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-05-08  6:51   ` Or Gerlitz
  2014-05-08  6:51   ` [PATCH libibverbs V3 2/3] Use neighbour lookup for RoCE UD QPs Eth L2 resolution Or Gerlitz
  2014-05-08  6:51   ` [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support Or Gerlitz
  2 siblings, 0 replies; 20+ messages in thread
From: Or Gerlitz @ 2014-05-08  6:51 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	Or Gerlitz

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Add an enum that describes ibv_port_cap_flags that complies
with the respective kernel enum.

This value could be fetched when using ibv_query_port in
port_cap_flags.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/infiniband/verbs.h |   22 ++++++++++++++++++++++
 1 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index cfa1156..2eb69ae 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -191,6 +191,28 @@ enum {
 	IBV_LINK_LAYER_ETHERNET,
 };
 
+enum ibv_port_cap_flags {
+	IBV_PORT_SM				= 1 <<  1,
+	IBV_PORT_NOTICE_SUP			= 1 <<  2,
+	IBV_PORT_TRAP_SUP			= 1 <<  3,
+	IBV_PORT_OPT_IPD_SUP			= 1 <<  4,
+	IBV_PORT_AUTO_MIGR_SUP			= 1 <<  5,
+	IBV_PORT_SL_MAP_SUP			= 1 <<  6,
+	IBV_PORT_MKEY_NVRAM			= 1 <<  7,
+	IBV_PORT_PKEY_NVRAM			= 1 <<  8,
+	IBV_PORT_LED_INFO_SUP			= 1 <<  9,
+	IBV_PORT_SYS_IMAGE_GUID_SUP		= 1 << 11,
+	IBV_PORT_PKEY_SW_EXT_PORT_TRAP_SUP	= 1 << 12,
+	IBV_PORT_EXTENDED_SPEEDS_SUP		= 1 << 14,
+	IBV_PORT_CM_SUP				= 1 << 16,
+	IBV_PORT_SNMP_TUNNEL_SUP		= 1 << 17,
+	IBV_PORT_REINIT_SUP			= 1 << 18,
+	IBV_PORT_DEVICE_MGMT_SUP		= 1 << 19,
+	IBV_PORT_VENDOR_CLASS			= 1 << 24,
+	IBV_PORT_CLIENT_REG_SUP			= 1 << 25,
+	IBV_PORT_IP_BASED_GIDS			= 1 << 26,
+};
+
 struct ibv_port_attr {
 	enum ibv_port_state	state;
 	enum ibv_mtu		max_mtu;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH libibverbs V3 2/3] Use neighbour lookup for RoCE UD QPs Eth L2 resolution
       [not found] ` <1399531883-30599-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-05-08  6:51   ` [PATCH libibverbs V3 1/3] Add ibv_port_cap_flags Or Gerlitz
@ 2014-05-08  6:51   ` Or Gerlitz
       [not found]     ` <1399531883-30599-3-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-05-08  6:51   ` [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support Or Gerlitz
  2 siblings, 1 reply; 20+ messages in thread
From: Or Gerlitz @ 2014-05-08  6:51 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	Or Gerlitz

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

In order to implement RoCE IP based addressing for UD QPs, without introducing
uverbs changes, we need a way to resolve the L2 Ethernet addresses from user-space.

This is done with netlink through libnl, and in libibverbs such that multiple
vendor provider libraries can use the code.

The Ethernet L2 params are passed to the provider driver using an extension
verb drv_ibv_create_ah_ex call. This resolution is done only for providers
that support this extension verb, otherwise, there's not real reason to do that.

The steps for resolution:

1. get sgid

2. from sgid, get the interface

3. query route from interface to the destination

4. query the neigh table, if the MAC for the destination is already known, done.

5. if not, loop until timeout:
    a. send a UDP packet to that IP targeted to the "discard" port
    b. listen to events from the neigh table
    c. query neigh table in order to know if the neigh has shown up between
       query until we started monitoring it

6. query vlan id from the interface

This solution depends on libnl-3 with backports to libnl-1.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 Makefile.am                |   10 +-
 configure.ac               |   31 ++
 include/infiniband/verbs.h |   35 ++-
 src/neigh.c                | 1001 ++++++++++++++++++++++++++++++++++++++++++++
 src/neigh.h                |   53 +++
 src/verbs.c                |  161 +++++++-
 6 files changed, 1285 insertions(+), 6 deletions(-)
 create mode 100644 src/neigh.c
 create mode 100644 src/neigh.h

diff --git a/Makefile.am b/Makefile.am
index a6767de..deb273d 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -5,14 +5,18 @@ lib_LTLIBRARIES = src/libibverbs.la
 ACLOCAL_AMFLAGS = -I config
 AM_CFLAGS = -g -Wall
 
-src_libibverbs_la_CFLAGS = $(AM_CFLAGS) -DIBV_CONFIG_DIR=\"$(sysconfdir)/libibverbs.d\"
-
+src_libibverbs_la_CFLAGS = $(AM_CFLAGS) -DIBV_CONFIG_DIR=\"$(sysconfdir)/libibverbs.d\" \
+			   $(LIBNL_CFLAGS)
 libibverbs_version_script = @LIBIBVERBS_VERSION_SCRIPT@
+src_libibverbs_la_LIBADD = $(LIBNL_LIBS)
 
 src_libibverbs_la_SOURCES = src/cmd.c src/compat-1_0.c src/device.c src/init.c \
 			    src/marshall.c src/memory.c src/sysfs.c src/verbs.c \
 			    src/enum_strs.c
-src_libibverbs_la_LDFLAGS = -version-info 1 -export-dynamic \
+if ! NO_RESOLVE_NEIGH
+src_libibverbs_la_SOURCES += src/neigh.c
+endif
+src_libibverbs_la_LDFLAGS = -version-info 1 -export-dynamic $(LIBNL_LIBS)\
     $(libibverbs_version_script)
 src_libibverbs_la_DEPENDENCIES = $(srcdir)/src/libibverbs.map
 
diff --git a/configure.ac b/configure.ac
index b5f6dfc..bef0e35 100644
--- a/configure.ac
+++ b/configure.ac
@@ -28,6 +28,37 @@ else
     fi
 fi
 
+AC_ARG_WITH([resolve-neigh],
+    AC_HELP_STRING([--with-resolve-neigh],
+        [Enable neighbour resolution in Ethernet (default YES)]))
+have_libnl=no
+if  test x$with_resolve_neigh = x || test x$with_resolve_neigh = xyes; then
+	PKG_CHECK_MODULES([LIBNL],[libnl-3.0],[
+			   have_libnl=yes
+			   AC_DEFINE([HAVE_LIBNL3], [1], [Use libnl-3.0])
+			   AC_DEFINE([HAVE_LIBNL], [1],  [Use libnl])
+			   PKG_CHECK_MODULES([LIBNL_ROUTE3], [libnl-route-3.0])
+			   LIBNL_CFLAGS="$LIBNL_CFLAGS $LIBNL_ROUTE3_CFLAGS"
+			   LIBNL_LIBS="$LIBNL_LIBS $LIBNL_ROUTE3_LIBS"], [:]
+			  );
+	if test "$have_libnl" = no; then
+		PKG_CHECK_MODULES([LIBNL], [libnl-1], [have_libnl=yes
+				  AC_DEFINE([HAVE_LIBNL1], [1], [Use libnl-1])
+				  AC_DEFINE([HAVE_LIBNL], [1],  [Use libnl])
+				  AC_CHECK_LIB(nl, rtnl_link_vlan_get_id, [],
+					AC_MSG_ERROR([rtnl_link_vlan_get_id not found.  libibverbs requires libnl.]))
+				 ],[
+					AC_MSG_ERROR([libibverbs requires libnl.])
+				 ])
+	fi
+else
+    AC_DEFINE([NRESOLVE_NEIGH], 1, [Define to 1 to disable resovle neigh annotations.])
+fi
+AM_CONDITIONAL([HAVE_LIBNL], [test "$have_libnl" = "yes"])
+AC_SUBST([LIBNL_CFLAGS])
+AC_SUBST([LIBNL_LIBS])
+AM_CONDITIONAL(NO_RESOLVE_NEIGH, test x$with_resolve_neigh = xno)
+
 dnl Checks for libraries
 AC_CHECK_LIB(dl, dlsym, [],
     AC_MSG_ERROR([dlsym() not found.  libibverbs requires libdl.]))
diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 2eb69ae..92d6a60 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -463,6 +463,36 @@ struct ibv_ah_attr {
 	uint8_t			port_num;
 };
 
+enum ibv_ah_attr_ex_attr_mask {
+	IBV_AH_ATTR_EX_LL = 1UL << 0,
+	IBV_AH_ATTR_EX_VID = 1UL << 1,
+};
+
+enum ll_address_type {
+	LL_ADDRESS_UNKNOWN,
+	LL_ADDRESS_IB,
+	LL_ADDRESS_ETH,
+	LL_ADDRESS_SIZE
+};
+
+struct ibv_ah_attr_ex {
+	struct ibv_global_route	grh;
+	uint16_t		dlid;
+	uint8_t			sl;
+	uint8_t			src_path_bits;
+	uint8_t			static_rate;
+	uint8_t			is_global;
+	uint8_t			port_num;
+	uint32_t		comp_mask;
+	struct {
+		enum ll_address_type type;
+		uint32_t	len;
+		char		*address;
+	} ll_address;
+	uint16_t		vid;
+};
+
+
 enum ibv_srq_attr_mask {
 	IBV_SRQ_MAX_WR	= 1 << 0,
 	IBV_SRQ_LIMIT	= 1 << 1
@@ -968,11 +998,14 @@ enum verbs_context_mask {
 	VERBS_CONTEXT_QP	= 1 << 2,
 	VERBS_CONTEXT_CREATE_FLOW = 1 << 3,
 	VERBS_CONTEXT_DESTROY_FLOW = 1 << 4,
-	VERBS_CONTEXT_RESERVED	= 1 << 5
+	VERBS_CONTEXT_CREATE_AH = 1 << 5,
+	VERBS_CONTEXT_RESERVED	= 1 << 6
 };
 
 struct verbs_context {
 	/*  "grows up" - new fields go here */
+	struct ibv_ah * (*drv_ibv_create_ah_ex)(struct ibv_pd *pd,
+						struct ibv_ah_attr_ex *attr);
 	int (*drv_ibv_destroy_flow) (struct ibv_flow *flow);
 	int (*lib_ibv_destroy_flow) (struct ibv_flow *flow);
 	struct ibv_flow * (*drv_ibv_create_flow) (struct ibv_qp *qp,
diff --git a/src/neigh.c b/src/neigh.c
new file mode 100644
index 0000000..f2f13ba
--- /dev/null
+++ b/src/neigh.c
@@ -0,0 +1,1001 @@
+
+#include "config.h"
+#include <net/if_packet.h>
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <netlink/route/rtnl.h>
+#include <netlink/route/link.h>
+#include <netlink/route/route.h>
+#include <netlink/route/neighbour.h>
+#ifndef HAVE_LIBNL1
+#include <netlink/route/link/vlan.h>
+#endif
+
+
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <sys/timerfd.h>
+#include <netinet/in.h>
+#include <errno.h>
+#include <unistd.h>
+#include <ifaddrs.h>
+#include <netdb.h>
+#ifndef _LINUX_IF_H
+#include <net/if.h>
+#else
+/*Workaround when there's a collision between the includes */
+extern unsigned int if_nametoindex(__const char *__ifname) __THROW;
+#endif
+
+/* for PFX */
+#include "ibverbs.h"
+
+#include "neigh.h"
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+#define MIN(a, b) ((a) < (b) ? (a) : (b))
+#define print_hdr PFX "resolver: "
+#define print_err(...) fprintf(stderr, print_hdr __VA_ARGS__)
+#ifdef _DEBUG_
+#define print_dbg_s(stream, ...) fprintf((stream), __VA_ARGS__)
+#define print_address(stream, msg, addr) \
+	{							      \
+		char buff[100];					      \
+		print_dbg_s((stream), (msg), nl_addr2str((addr), buff,\
+			    sizeof(buff)));			      \
+	}
+#else
+#define print_dbg_s(stream, args...)
+#define print_address(stream, msg, ...)
+#endif
+#define print_dbg(...) print_dbg_s(stderr, print_hdr __VA_ARGS__)
+
+#ifdef HAVE_LIBNL1
+#define nl_geterror(x) nl_geterror()
+#endif
+
+/* Workaround - declaration missing */
+extern int		rtnl_link_vlan_get_id(struct rtnl_link *);
+
+static pthread_once_t device_neigh_alloc = PTHREAD_ONCE_INIT;
+#ifdef HAVE_LIBNL1
+static struct nl_handle *zero_socket;
+#else
+static struct nl_sock *zero_socket;
+#endif
+
+union sktaddr {
+	struct sockaddr s;
+	struct sockaddr_in s4;
+	struct sockaddr_in6 s6;
+};
+
+struct skt {
+	union sktaddr sktaddr;
+	socklen_t len;
+};
+
+
+static int set_link_port(union sktaddr *s, int port, int oif)
+{
+	switch (s->s.sa_family) {
+	case AF_INET:
+		s->s4.sin_port = port;
+		break;
+	case AF_INET6:
+		s->s6.sin6_port = port;
+		s->s6.sin6_scope_id = oif;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int cmp_address(const struct sockaddr *s1,
+		       const struct sockaddr *s2) {
+	if (s1->sa_family != s2->sa_family)
+		return s1->sa_family ^ s2->sa_family;
+
+	switch (s1->sa_family) {
+	case AF_INET:
+		return ((struct sockaddr_in *)s1)->sin_addr.s_addr ^
+		       ((struct sockaddr_in *)s2)->sin_addr.s_addr;
+	case AF_INET6:
+		return memcmp(
+			((struct sockaddr_in6 *)s1)->sin6_addr.s6_addr,
+			((struct sockaddr_in6 *)s2)->sin6_addr.s6_addr,
+			sizeof(((struct sockaddr_in6 *)s1)->sin6_addr.s6_addr));
+	default:
+		return -ENOTSUP;
+	}
+}
+
+
+static int get_ifindex(const struct sockaddr *s)
+{
+	struct ifaddrs *ifaddr, *ifa;
+	int name2index = -ENODEV;
+
+	if (-1 == getifaddrs(&ifaddr))
+		return errno;
+
+	for (ifa = ifaddr; ifa != NULL; ifa = ifa->ifa_next) {
+		if (NULL == ifa->ifa_addr)
+			continue;
+
+		if (!cmp_address(ifa->ifa_addr, s)) {
+			name2index = if_nametoindex(ifa->ifa_name);
+			break;
+		}
+	}
+
+	freeifaddrs(ifaddr);
+
+	return name2index;
+}
+
+static struct nl_addr *get_neigh_mac(struct get_neigh_handler *neigh_handler)
+{
+	struct rtnl_neigh *neigh;
+	struct nl_addr *ll_addr = NULL;
+
+	/* future optimization - if link local address - parse address and
+	 * return mac */
+	neigh = rtnl_neigh_get(neigh_handler->neigh_cache,
+			       neigh_handler->oif,
+			       neigh_handler->dst);
+	if (NULL == neigh) {
+		print_dbg("Neigh isn't at cache\n");
+		return NULL;
+	}
+
+	ll_addr = rtnl_neigh_get_lladdr(neigh);
+	if (NULL == ll_addr)
+		print_err("Failed to get hw address from neighbour\n");
+	else
+		ll_addr = nl_addr_clone(ll_addr);
+
+	rtnl_neigh_put(neigh);
+	return ll_addr;
+}
+
+static void get_neigh_cb_event(struct nl_object *obj, void *arg)
+{
+	struct get_neigh_handler *neigh_handler =
+		(struct get_neigh_handler *)arg;
+	/* assumed serilized callback (no parallel execution of function) */
+	if (nl_object_match_filter(
+		obj,
+		(struct nl_object *)neigh_handler->filter_neigh)) {
+		struct rtnl_neigh *neigh = (struct rtnl_neigh *)obj;
+		print_dbg("Found a match for neighbour\n");
+		/* check that we didn't set it already */
+		if (NULL == neigh_handler->found_ll_addr) {
+			if (NULL == rtnl_neigh_get_lladdr(neigh)) {
+				print_err("Neighbour doesn't have a hw addr\n");
+				return;
+			}
+			neigh_handler->found_ll_addr =
+				nl_addr_clone(rtnl_neigh_get_lladdr(neigh));
+			if (NULL == neigh_handler->found_ll_addr)
+				print_err("Couldn't copy neighbour hw addr\n");
+		}
+	}
+}
+
+static int get_neigh_cb(struct nl_msg *msg, void *arg)
+{
+	struct get_neigh_handler *neigh_handler =
+		(struct get_neigh_handler *)arg;
+
+	if (nl_msg_parse(msg, &get_neigh_cb_event, neigh_handler) < 0)
+		print_err("Unknown event\n");
+
+	return NL_OK;
+}
+
+static void set_neigh_filter(struct get_neigh_handler *neigh_handler,
+			     struct rtnl_neigh *filter) {
+	neigh_handler->filter_neigh = filter;
+}
+
+static struct rtnl_neigh *create_filter_neigh_for_dst(struct nl_addr *dst_addr,
+						      int oif)
+{
+	struct rtnl_neigh *filter_neigh;
+
+	filter_neigh = rtnl_neigh_alloc();
+	if (NULL == filter_neigh) {
+		print_err("Couldn't allocate filter neigh\n");
+		return NULL;
+	}
+	rtnl_neigh_set_ifindex(filter_neigh, oif);
+	rtnl_neigh_set_dst(filter_neigh, dst_addr);
+
+	return filter_neigh;
+}
+
+#define PORT_DISCARD htons(9)
+#define SEND_PAYLOAD "H"
+
+static int create_socket(struct get_neigh_handler *neigh_handler,
+			 struct skt *addr_dst, int *psock_fd)
+{
+	int err;
+	struct skt addr_src;
+	int sock_fd;
+
+	memset(addr_dst, 0, sizeof(*addr_dst));
+	memset(&addr_src, 0, sizeof(addr_src));
+	addr_src.len = sizeof(addr_src.sktaddr);
+
+	err = nl_addr_fill_sockaddr(neigh_handler->src,
+				    &addr_src.sktaddr.s,
+				    &addr_src.len);
+	if (err) {
+		print_err("couldn't convert src to sockaddr\n");
+		return err;
+	}
+
+	addr_dst->len = sizeof(addr_dst->sktaddr);
+	err = nl_addr_fill_sockaddr(neigh_handler->dst,
+				    &addr_dst->sktaddr.s,
+				    &addr_dst->len);
+	if (err) {
+		print_err("couldn't convert dst to sockaddr\n");
+		return err;
+	}
+
+	err = set_link_port(&addr_dst->sktaddr, PORT_DISCARD,
+			    neigh_handler->oif);
+	if (err)
+		return err;
+
+	sock_fd = socket(addr_dst->sktaddr.s.sa_family,
+			SOCK_DGRAM | SOCK_CLOEXEC, 0);
+	if (sock_fd == -1)
+		return errno ? -errno : -1;
+	err = bind(sock_fd, &addr_src.sktaddr.s, addr_src.len);
+	if (err) {
+		int bind_err = -errno;
+		print_err("Couldn't bind socket\n");
+		close(sock_fd);
+		return bind_err ?: err;
+	}
+
+	*psock_fd = sock_fd;
+
+	return 0;
+}
+
+#define NUM_OF_RETRIES 10
+#define NUM_OF_TRIES ((NUM_OF_RETRIES) + 1)
+#if NUM_OF_TRIES < 1
+#error "neigh: invalid value of NUM_OF_RETRIES"
+#endif
+static int create_timer(struct get_neigh_handler *neigh_handler)
+{
+	int user_timeout = neigh_handler->timeout/NUM_OF_TRIES;
+	struct timespec timeout = {
+		.tv_sec = user_timeout / 1000,
+		.tv_nsec = (user_timeout % 1000) * 1000000
+	};
+	struct itimerspec timer_time = {.it_value = timeout};
+	int timer_fd;
+
+	timer_fd = timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC);
+	if (-1 == timer_fd) {
+		print_err("Couldn't create timer\n");
+		return timer_fd;
+	}
+
+	if (neigh_handler->timeout) {
+		if (NUM_OF_TRIES <= 1)
+			bzero(&timer_time.it_interval,
+			      sizeof(timer_time.it_interval));
+		else
+			timer_time.it_interval = timeout;
+		if (timerfd_settime(timer_fd, 0, &timer_time, NULL)) {
+			print_err("Couldn't set timer\n");
+			return -1;
+		}
+	}
+
+	return timer_fd;
+}
+
+#define UDP_SOCKET_MAX_SENDTO 100000ULL
+
+static struct nl_addr *process_get_neigh_mac(
+		struct get_neigh_handler *neigh_handler)
+{
+	int err;
+	struct nl_addr *ll_addr = get_neigh_mac(neigh_handler);
+	struct rtnl_neigh *neigh_filter;
+	fd_set fdset;
+	int sock_fd;
+	int fd;
+	int nfds;
+	int timer_fd;
+	int ret;
+	struct skt addr_dst;
+	char buff[sizeof(SEND_PAYLOAD)] = SEND_PAYLOAD;
+	int retries = 0;
+	uint64_t max_count = UDP_SOCKET_MAX_SENDTO;
+
+	if (NULL != ll_addr)
+		return ll_addr;
+
+	err = nl_socket_add_membership(neigh_handler->sock,
+				       RTNLGRP_NEIGH);
+	if (err < 0) {
+		print_err("%s\n", nl_geterror(err));
+		return NULL;
+	}
+
+	neigh_filter = create_filter_neigh_for_dst(neigh_handler->dst,
+						   neigh_handler->oif);
+	if (NULL == neigh_filter)
+		return NULL;
+
+	set_neigh_filter(neigh_handler, neigh_filter);
+
+#ifdef HAVE_LIBNL1
+	nl_disable_sequence_check(neigh_handler->sock);
+#else
+	nl_socket_disable_seq_check(neigh_handler->sock);
+#endif
+	nl_socket_modify_cb(neigh_handler->sock, NL_CB_VALID, NL_CB_CUSTOM,
+			    &get_neigh_cb, neigh_handler);
+
+	fd = nl_socket_get_fd(neigh_handler->sock);
+
+	err = create_socket(neigh_handler, &addr_dst, &sock_fd);
+
+	if (err)
+		return NULL;
+
+	do {
+		err = sendto(sock_fd, buff, sizeof(buff), 0,
+				 &addr_dst.sktaddr.s,
+				 addr_dst.len);
+		if (err > 0)
+			err = 0;
+	} while (-1 == err && EADDRNOTAVAIL == errno && --max_count);
+
+	if (err) {
+		print_err("Failed to send packet %d", err);
+		goto close_socket;
+	}
+	timer_fd = create_timer(neigh_handler);
+	if (timer_fd < 0)
+		goto close_socket;
+
+	nfds = MAX(fd, timer_fd) + 1;
+
+
+	while (1) {
+		FD_ZERO(&fdset);
+		FD_SET(fd, &fdset);
+		FD_SET(timer_fd, &fdset);
+
+		/* wait for an incoming message on the netlink socket */
+		ret = select(nfds, &fdset, NULL, NULL, NULL);
+
+		if (ret) {
+			if (FD_ISSET(fd, &fdset)) {
+				nl_recvmsgs_default(neigh_handler->sock);
+				if (neigh_handler->found_ll_addr)
+					break;
+			} else {
+				nl_cache_refill(neigh_handler->sock,
+						neigh_handler->neigh_cache);
+				ll_addr = get_neigh_mac(neigh_handler);
+				if (NULL != ll_addr) {
+					break;
+				} else if (FD_ISSET(timer_fd, &fdset) &&
+					   retries < NUM_OF_RETRIES) {
+					if (sendto(sock_fd, buff, sizeof(buff),
+						   0, &addr_dst.sktaddr.s,
+						   addr_dst.len) < 0)
+						print_err("Failed to send "
+							  "packet while waiting"
+							  " for events\n");
+				}
+			}
+
+			if (FD_ISSET(timer_fd, &fdset)) {
+				uint64_t read_val;
+				if (read(timer_fd, &read_val, sizeof(read_val))
+				    < 0)
+					print_dbg("Read timer_fd failed\n");
+				if (++retries >=  NUM_OF_TRIES) {
+					print_dbg("Timeout while trying to fetch "
+						  "neighbours\n");
+					break;
+				}
+			}
+		}
+	}
+	close(timer_fd);
+close_socket:
+	close(sock_fd);
+	return ll_addr ? ll_addr : neigh_handler->found_ll_addr;
+}
+
+
+static int get_mcast_mac_ipv4(struct nl_addr *dst, struct nl_addr **ll_addr)
+{
+	char mac_addr[6] = {0x01, 0x00, 0x5E};
+	uint32_t addr = ntohl(*(uint32_t *)nl_addr_get_binary_addr(dst));
+	mac_addr[5] = addr & 0xFF;
+	addr >>= 8;
+	mac_addr[4] = addr & 0xFF;
+	addr >>= 8;
+	mac_addr[3] = addr & 0x7F;
+
+	*ll_addr = nl_addr_build(AF_LLC, mac_addr, sizeof(mac_addr));
+
+	return *ll_addr == NULL ? -EINVAL : 0;
+}
+
+static int get_mcast_mac_ipv6(struct nl_addr *dst, struct nl_addr **ll_addr)
+{
+	char mac_addr[6] = {0x33, 0x33};
+	memcpy(mac_addr + 2, (char *)nl_addr_get_binary_addr(dst) + 12, 4);
+
+	*ll_addr = nl_addr_build(AF_LLC, mac_addr, sizeof(mac_addr));
+
+	return *ll_addr == NULL ? -EINVAL : 0;
+}
+
+static int get_link_local_mac_ipv6(struct nl_addr *dst,
+				   struct nl_addr **ll_addr)
+{
+	char mac_addr[6];
+
+	memcpy(mac_addr + 3, (char *)nl_addr_get_binary_addr(dst) + 13, 3);
+	memcpy(mac_addr, (char *)nl_addr_get_binary_addr(dst) + 8, 3);
+	mac_addr[0] ^= 2;
+
+	*ll_addr = nl_addr_build(AF_LLC, mac_addr, sizeof(mac_addr));
+	return *ll_addr == NULL ? -EINVAL : 0;
+}
+
+const struct encoded_l3_addr {
+	short family;
+	uint8_t prefix_bits;
+	const char data[16];
+	int (*getter)(struct nl_addr *dst, struct nl_addr **ll_addr);
+} encoded_prefixes[] = {
+	{.family = AF_INET,
+	 .prefix_bits = 4,
+	 .data = {0xe0},
+	 .getter = &get_mcast_mac_ipv4},
+	{.family = AF_INET6,
+	 .prefix_bits = 8,
+	 .data = {0xff},
+	 .getter = &get_mcast_mac_ipv6},
+	{.family = AF_INET6,
+	 .prefix_bits = 64,
+	 .data = {0xfe, 0x80},
+	 .getter = get_link_local_mac_ipv6},
+};
+
+int nl_addr_cmp_prefix_msb(void *addr1, int len1, void *addr2, int len2)
+{
+	int len = MIN(len1, len2);
+	int bytes = len / 8;
+	int d = memcmp(addr1, addr2, bytes);
+
+	if (d == 0) {
+		int mask = ((1UL << (len % 8)) - 1UL) << (8 - len);
+
+		d = (((char *)addr1)[bytes] & mask) -
+		    (((char *)addr2)[bytes] & mask);
+	}
+
+	return d;
+}
+static int handle_encoded_mac(struct nl_addr *dst, struct nl_addr **ll_addr)
+{
+	uint32_t family = nl_addr_get_family(dst);
+	struct nl_addr *prefix = NULL;
+	int i;
+	int ret = 1;
+
+	for (i = 0;
+	     i < sizeof(encoded_prefixes)/sizeof(encoded_prefixes[0]) &&
+	     ret; prefix = NULL, i++) {
+		if (encoded_prefixes[i].family != family)
+			continue;
+
+		prefix = nl_addr_build(
+				family,
+				(void *)encoded_prefixes[i].data,
+				MIN(encoded_prefixes[i].prefix_bits/8 +
+				    !!(encoded_prefixes[i].prefix_bits % 8),
+				    sizeof(encoded_prefixes[i].data)));
+
+		if (NULL == prefix)
+			return -ENOMEM;
+		nl_addr_set_prefixlen(prefix,
+				      encoded_prefixes[i].prefix_bits);
+
+		if (!nl_addr_cmp_prefix_msb(nl_addr_get_binary_addr(dst),
+					   nl_addr_get_prefixlen(dst),
+					   nl_addr_get_binary_addr(prefix),
+					   nl_addr_get_prefixlen(prefix)))
+			ret = encoded_prefixes[i].getter(dst, ll_addr);
+#ifdef HAVE_LIBNL1
+		nl_addr_destroy(prefix);
+#else
+		nl_addr_put(prefix);
+#endif
+	}
+
+	return ret;
+}
+
+static void get_route_cb_parser(struct nl_object *obj, void *arg)
+{
+	struct get_neigh_handler *neigh_handler =
+		(struct get_neigh_handler *)arg;
+
+	struct rtnl_route *route = (struct rtnl_route *)obj;
+	struct nl_addr *gateway;
+	struct nl_addr *src = rtnl_route_get_pref_src(route);
+	int oif;
+	int type = rtnl_route_get_type(route);
+	struct rtnl_link *link;
+
+#ifdef HAVE_LIBNL1
+	gateway = rtnl_route_get_gateway(route);
+	oif = rtnl_route_get_oif(route);
+#else
+	struct rtnl_nexthop *nh = rtnl_route_nexthop_n(route, 0);
+	if (!nh)
+		print_err("Out of memory\n");
+	gateway = rtnl_route_nh_get_gateway(nh);
+	oif = rtnl_route_nh_get_ifindex(nh);
+#endif
+
+	if (gateway) {
+#ifdef HAVE_LIBNL1
+		nl_addr_destroy(neigh_handler->dst);
+#else
+		nl_addr_put(neigh_handler->dst);
+#endif
+		neigh_handler->dst = nl_addr_clone(gateway);
+		print_dbg("Found gateway\n");
+	}
+
+	if (RTN_BLACKHOLE == type ||
+	    RTN_UNREACHABLE == type ||
+	    RTN_PROHIBIT == type ||
+	    RTN_THROW == type) {
+		print_err("Destination unrechable (type %d)\n", type);
+		goto err;
+	}
+
+	if (!neigh_handler->src && src)
+		neigh_handler->src = nl_addr_clone(src);
+
+	if (neigh_handler->oif < 0 && oif > 0)
+		neigh_handler->oif = oif;
+
+	/* Link Local */
+	if (RTN_LOCAL == type) {
+		struct nl_addr *lladdr;
+
+		link = rtnl_link_get(neigh_handler->link_cache,
+				     neigh_handler->oif);
+
+		if (NULL == link)
+			goto err;
+
+		lladdr = rtnl_link_get_addr(link);
+
+		if (NULL == lladdr)
+			goto err_link;
+
+		neigh_handler->found_ll_addr = nl_addr_clone(lladdr);
+		rtnl_link_put(link);
+	} else {
+		if (!handle_encoded_mac(
+				neigh_handler->dst,
+				&neigh_handler->found_ll_addr))
+			print_address(stderr, "calculated address %s\n",
+				      neigh_handler->found_ll_addr);
+	}
+
+	print_address(stderr, "Current dst %s\n", neigh_handler->dst);
+	return;
+
+err_link:
+	rtnl_link_put(link);
+err:
+	if (neigh_handler->src) {
+#ifdef HAVE_LIBNL1
+		nl_addr_put(neigh_handler->src);
+#else
+		nl_addr_put(neigh_handler->src);
+#endif
+		neigh_handler->src = NULL;
+	}
+}
+
+static int get_route_cb(struct nl_msg *msg, void *arg)
+{
+	struct get_neigh_handler *neigh_handler =
+		(struct get_neigh_handler *)arg;
+	int err;
+
+	err = nl_msg_parse(msg, &get_route_cb_parser, neigh_handler);
+	if (err < 0) {
+		print_err("Unable to parse object: %s", nl_geterror(err));
+		return err;
+	}
+
+	if (!neigh_handler->dst || !neigh_handler->src ||
+	    neigh_handler->oif <= 0) {
+		print_err("Missing params\n");
+		return -1;
+	}
+
+	if (NULL != neigh_handler->found_ll_addr)
+		goto found;
+
+	neigh_handler->found_ll_addr =
+		process_get_neigh_mac(neigh_handler);
+	if (neigh_handler->found_ll_addr)
+		print_address(stderr, "Neigh is %s\n",
+			      neigh_handler->found_ll_addr);
+
+found:
+	return neigh_handler->found_ll_addr ? 0 : -1;
+}
+
+int neigh_get_oif_from_src(struct get_neigh_handler *neigh_handler)
+{
+	int oif = -ENODEV;
+	struct addrinfo *src_info;
+
+#ifdef HAVE_LIBNL1
+	src_info = nl_addr_info(neigh_handler->src);
+	if (NULL == src_info) {
+#else
+	int err;
+	err = nl_addr_info(neigh_handler->src, &src_info);
+	if (err) {
+#endif
+		print_err("Unable to get address info %s\n",
+			  nl_geterror(err));
+		return oif;
+	}
+
+	oif = get_ifindex(src_info->ai_addr);
+	if (oif <= 0)
+		goto free;
+
+	print_dbg("IF index is %d\n", oif);
+
+free:
+	freeaddrinfo(src_info);
+	return oif;
+}
+
+static void destroy_zero_based_socket(void)
+{
+	if (NULL != zero_socket) {
+		print_dbg("destroying zero based socket\n");
+#ifdef HAVE_LIBNL1
+		nl_handle_destroy(zero_socket);
+#else
+		nl_socket_free(zero_socket);
+#endif
+	}
+}
+
+static void alloc_zero_based_socket(void)
+{
+#ifdef HAVE_LIBNL1
+	zero_socket = nl_handle_alloc();
+#else
+	zero_socket = nl_socket_alloc();
+#endif
+	print_dbg("creating zero based socket\n");
+	atexit(&destroy_zero_based_socket);
+}
+
+int neigh_init_resources(struct get_neigh_handler *neigh_handler, int timeout)
+{
+	int err;
+
+	pthread_once(&device_neigh_alloc, &alloc_zero_based_socket);
+#ifdef HAVE_LIBNL1
+	neigh_handler->sock = nl_handle_alloc();
+#else
+	neigh_handler->sock = nl_socket_alloc();
+#endif
+	if (NULL == neigh_handler->sock) {
+		print_err("Unable to allocate netlink socket\n");
+		return -ENOMEM;
+	}
+
+	err = nl_connect(neigh_handler->sock, NETLINK_ROUTE);
+	if (err < 0) {
+		print_err("Unable to connect netlink socket: %s\n",
+			  nl_geterror(err));
+		goto free_socket;
+	}
+
+#ifdef HAVE_LIBNL1
+	neigh_handler->link_cache =
+		rtnl_link_alloc_cache(neigh_handler->sock);
+	if (NULL == neigh_handler->link_cache) {
+		err = -ENOMEM;
+#else
+
+	err = rtnl_link_alloc_cache(neigh_handler->sock, AF_UNSPEC,
+				    &neigh_handler->link_cache);
+	if (err) {
+#endif
+		print_err("Unable to allocate link cache: %s\n",
+			  nl_geterror(err));
+		err = -ENOMEM;
+		goto close_connection;
+	}
+
+	nl_cache_mngt_provide(neigh_handler->link_cache);
+
+#ifdef HAVE_LIBNL1
+	neigh_handler->route_cache =
+		rtnl_route_alloc_cache(neigh_handler->sock);
+	if (NULL == neigh_handler->route_cache) {
+		err = -ENOMEM;
+#else
+	err = rtnl_route_alloc_cache(neigh_handler->sock, AF_UNSPEC, 0,
+				     &neigh_handler->route_cache);
+	if (err) {
+#endif
+		print_err("Unable to allocate route cache: %s\n",
+			  nl_geterror(err));
+		goto free_link_cache;
+	}
+
+	nl_cache_mngt_provide(neigh_handler->route_cache);
+
+#ifdef HAVE_LIBNL1
+	neigh_handler->neigh_cache =
+		rtnl_neigh_alloc_cache(neigh_handler->sock);
+	if (NULL == neigh_handler->neigh_cache) {
+		err = -ENOMEM;
+#else
+	err = rtnl_neigh_alloc_cache(neigh_handler->sock,
+				     &neigh_handler->neigh_cache);
+	if (err) {
+#endif
+		print_err("Couldn't allocate neigh cache %s\n",
+			  nl_geterror(err));
+		goto free_route_cache;
+	}
+
+	nl_cache_mngt_provide(neigh_handler->neigh_cache);
+
+	/* init structure */
+	neigh_handler->timeout = timeout;
+	neigh_handler->oif = -1;
+	neigh_handler->filter_neigh = NULL;
+	neigh_handler->found_ll_addr = NULL;
+	neigh_handler->dst = NULL;
+	neigh_handler->src = NULL;
+	neigh_handler->vid = -1;
+	neigh_handler->neigh_status = GET_NEIGH_STATUS_OK;
+
+	return 0;
+
+free_route_cache:
+	nl_cache_mngt_unprovide(neigh_handler->route_cache);
+	nl_cache_free(neigh_handler->route_cache);
+	neigh_handler->route_cache = NULL;
+free_link_cache:
+	nl_cache_mngt_unprovide(neigh_handler->link_cache);
+	nl_cache_free(neigh_handler->link_cache);
+	neigh_handler->link_cache = NULL;
+close_connection:
+	nl_close(neigh_handler->sock);
+free_socket:
+#ifdef HAVE_LIBNL1
+	nl_handle_destroy(neigh_handler->sock);
+		nl_handle_destroy(zero_socket);
+#else
+	nl_socket_free(neigh_handler->sock);
+#endif
+	neigh_handler->sock = NULL;
+	return err;
+}
+
+uint16_t neigh_get_vlan_id_from_dev(struct get_neigh_handler *neigh_handler)
+{
+	struct rtnl_link *link;
+	int vid = 0xffff;
+
+	link = rtnl_link_get(neigh_handler->link_cache, neigh_handler->oif);
+	if (NULL == link) {
+		print_err("Can't find link in cache\n");
+		return -EINVAL;
+	}
+
+#ifndef HAVE_LIBNL1
+	if (rtnl_link_is_vlan(link))
+#endif
+		vid = rtnl_link_vlan_get_id(link);
+	rtnl_link_put(link);
+	return vid >= 0 && vid <= 0xfff ? vid : 0xffff;
+}
+
+void neigh_set_vlan_id(struct get_neigh_handler *neigh_handler, uint16_t vid)
+{
+	if (vid >= 0 && vid <= 0xfff)
+		neigh_handler->vid = vid;
+}
+
+int neigh_set_dst(struct get_neigh_handler *neigh_handler,
+		  int family, void *buf, size_t size)
+{
+	neigh_handler->dst = nl_addr_build(family, buf, size);
+	return NULL == neigh_handler->dst;
+}
+
+int neigh_set_src(struct get_neigh_handler *neigh_handler,
+		  int family, void *buf, size_t size)
+{
+	neigh_handler->src = nl_addr_build(family, buf, size);
+	return NULL == neigh_handler->src;
+}
+
+void neigh_set_oif(struct get_neigh_handler *neigh_handler, int oif)
+{
+	neigh_handler->oif = oif;
+}
+
+int neigh_get_ll(struct get_neigh_handler *neigh_handler, void *addr_buff,
+		 int addr_size) {
+	int neigh_len;
+
+	if (NULL == neigh_handler->found_ll_addr)
+		return -EINVAL;
+
+	 neigh_len = nl_addr_get_len(neigh_handler->found_ll_addr);
+
+	if (neigh_len > addr_size)
+		return -EINVAL;
+
+	memcpy(addr_buff, nl_addr_get_binary_addr(neigh_handler->found_ll_addr),
+	       neigh_len);
+
+	return neigh_len;
+}
+
+void neigh_free_resources(struct get_neigh_handler *neigh_handler)
+{
+	/* Should be released first because it's holding a reference to dst */
+	if (NULL != neigh_handler->filter_neigh) {
+		rtnl_neigh_put(neigh_handler->filter_neigh);
+		neigh_handler->filter_neigh = NULL;
+	}
+
+	if (NULL != neigh_handler->src) {
+#ifdef HAVE_LIBNL1
+		nl_addr_put(neigh_handler->src);
+#else
+		nl_addr_put(neigh_handler->src);
+#endif
+		neigh_handler->src = NULL;
+	}
+
+	if (NULL != neigh_handler->dst) {
+#ifdef HAVE_LIBNL1
+		nl_addr_destroy(neigh_handler->dst);
+#else
+		nl_addr_put(neigh_handler->dst);
+#endif
+		neigh_handler->dst = NULL;
+	}
+
+	if (NULL != neigh_handler->found_ll_addr) {
+#ifdef HAVE_LIBNL1
+		nl_addr_destroy(neigh_handler->found_ll_addr);
+#else
+		nl_addr_put(neigh_handler->found_ll_addr);
+#endif
+		neigh_handler->found_ll_addr = NULL;
+	}
+
+	if (NULL != neigh_handler->neigh_cache) {
+		nl_cache_mngt_unprovide(neigh_handler->neigh_cache);
+		nl_cache_free(neigh_handler->neigh_cache);
+		neigh_handler->neigh_cache = NULL;
+	}
+
+	if (NULL != neigh_handler->route_cache) {
+		nl_cache_mngt_unprovide(neigh_handler->route_cache);
+		nl_cache_free(neigh_handler->route_cache);
+		neigh_handler->route_cache = NULL;
+	}
+
+	if (NULL != neigh_handler->link_cache) {
+		nl_cache_mngt_unprovide(neigh_handler->link_cache);
+		nl_cache_free(neigh_handler->link_cache);
+		neigh_handler->link_cache = NULL;
+	}
+
+	if (NULL != neigh_handler->sock) {
+		nl_close(neigh_handler->sock);
+#ifdef HAVE_LIBNL1
+		nl_handle_destroy(neigh_handler->sock);
+#else
+		nl_socket_free(neigh_handler->sock);
+#endif
+		neigh_handler->sock = NULL;
+	}
+}
+
+int process_get_neigh(struct get_neigh_handler *neigh_handler)
+{
+	struct nl_msg *m;
+	struct rtmsg rmsg = {
+		.rtm_family = nl_addr_get_family(neigh_handler->dst),
+		.rtm_dst_len = nl_addr_get_prefixlen(neigh_handler->dst),
+	};
+	int err;
+	int last_status;
+
+	last_status = __sync_fetch_and_or(
+			&neigh_handler->neigh_status,
+			GET_NEIGH_STATUS_IN_PROCESS);
+
+	if (last_status != GET_NEIGH_STATUS_OK)
+		return -EINPROGRESS;
+
+	m = nlmsg_alloc_simple(RTM_GETROUTE, 0);
+
+	if (NULL == m)
+		return -ENOMEM;
+
+	nlmsg_append(m, &rmsg, sizeof(rmsg), NLMSG_ALIGNTO);
+
+	nla_put_addr(m, RTA_DST, neigh_handler->dst);
+
+	if (neigh_handler->oif > 0)
+		nla_put_u32(m, RTA_OIF, neigh_handler->oif);
+
+	err = nl_send_auto_complete(neigh_handler->sock, m);
+	nlmsg_free(m);
+	if (err < 0) {
+		print_err("%s", nl_geterror(err));
+		return err;
+	}
+
+	nl_socket_modify_cb(neigh_handler->sock, NL_CB_VALID,
+			    NL_CB_CUSTOM, &get_route_cb, neigh_handler);
+
+	err = nl_recvmsgs_default(neigh_handler->sock);
+	if (err < 0) {
+		print_err("%s", nl_geterror(err));
+		__sync_fetch_and_or(
+			&neigh_handler->neigh_status,
+			GET_NEIGH_STATUS_ERR);
+	}
+
+	__sync_fetch_and_and(
+		&neigh_handler->neigh_status,
+		~GET_NEIGH_STATUS_IN_PROCESS);
+
+	return err;
+}
diff --git a/src/neigh.h b/src/neigh.h
new file mode 100644
index 0000000..09e01e1
--- /dev/null
+++ b/src/neigh.h
@@ -0,0 +1,53 @@
+#ifndef _GET_NEIGH_
+#define _GET_NEIGH_
+
+#include <stddef.h>
+#include <stdint.h>
+#include "config.h"
+#ifdef HAVE_LIBNL1
+#include <netlink/object.h>
+#else
+#include <netlink/object-api.h>
+#endif
+
+enum get_neigh_status {
+	GET_NEIGH_STATUS_OK = 0,
+	GET_NEIGH_STATUS_IN_PROCESS = 1 << 0,
+	GET_NEIGH_STATUS_ERR = 1 << 1,
+};
+
+struct get_neigh_handler {
+#ifdef HAVE_LIBNL1
+	struct nl_handle *sock;
+#else
+	struct nl_sock *sock;
+#endif
+	struct nl_cache *link_cache;
+	struct nl_cache	*neigh_cache;
+	struct nl_cache *route_cache;
+	int32_t oif;
+	int vid;
+	struct rtnl_neigh *filter_neigh;
+	struct nl_addr *found_ll_addr;
+	struct nl_addr *dst;
+	struct nl_addr *src;
+	enum get_neigh_status neigh_status;
+	uint64_t timeout;
+};
+
+int process_get_neigh(struct get_neigh_handler *neigh_handler);
+void neigh_free_resources(struct get_neigh_handler *neigh_handler);
+void neigh_set_vlan_id(struct get_neigh_handler *neigh_handler, uint16_t vid);
+uint16_t neigh_get_vlan_id_from_dev(struct get_neigh_handler *neigh_handler);
+int neigh_init_resources(struct get_neigh_handler *neigh_handler, int timeout);
+
+int neigh_set_src(struct get_neigh_handler *neigh_handler,
+		  int family, void *buf, size_t size);
+void neigh_set_oif(struct get_neigh_handler *neigh_handler, int oif);
+int neigh_set_dst(struct get_neigh_handler *neigh_handler,
+		  int family, void *buf, size_t size);
+int neigh_get_oif_from_src(struct get_neigh_handler *neigh_handler);
+int neigh_get_ll(struct get_neigh_handler *neigh_handler, void *addr_buf,
+		 int addr_size);
+
+#endif
diff --git a/src/verbs.c b/src/verbs.c
index a6aae70..e7343a9 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -43,6 +43,9 @@
 #include <string.h>
 
 #include "ibverbs.h"
+#ifndef NRESOLVE_NEIGH
+#include "neigh.h"
+#endif
 
 int ibv_rate_to_mult(enum ibv_rate rate)
 {
@@ -505,15 +508,169 @@ int __ibv_destroy_qp(struct ibv_qp *qp)
 }
 default_symver(__ibv_destroy_qp, ibv_destroy_qp);
 
+static inline int ipv6_addr_v4mapped(const struct in6_addr *a)
+{
+	return ((a->s6_addr32[0] | a->s6_addr32[1]) |
+		(a->s6_addr32[2] ^ htonl(0x0000ffff))) == 0UL ||
+		/* IPv4 encoded multicast addresses */
+	       (a->s6_addr32[0]  == htonl(0xff0e0000) &&
+		((a->s6_addr32[1] |
+		  (a->s6_addr32[2] ^ htonl(0x0000ffff))) == 0UL));
+}
+
+
+struct peer_address {
+	void *address;
+	uint32_t size;
+};
+
+static inline int create_peer_from_gid(int family, void *raw_gid,
+				       struct peer_address *peer_address)
+{
+	switch (family) {
+	case AF_INET:
+		peer_address->address = raw_gid + 12;
+		peer_address->size = 4;
+		break;
+	case AF_INET6:
+		peer_address->address = raw_gid;
+		peer_address->size = 16;
+		break;
+	default:
+		return -1;
+	}
+
+	return 0;
+}
+
+#define ETHERNET_LL_SIZE 6
+#define NEIGH_GET_DEFAULT_TIMEOUT_MS 3000
 struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
 {
-	struct ibv_ah *ah = pd->context->ops.create_ah(pd, attr);
+	int err;
+	struct ibv_ah *ah = NULL;
+#ifndef NRESOLVE_NEIGH
+	struct ibv_port_attr port_attr;
+	int dst_family;
+	int src_family;
+	int oif;
+	struct get_neigh_handler neigh_handler;
+	union ibv_gid sgid;
+	struct ibv_ah_attr_ex attr_ex;
+	char ethernet_ll[ETHERNET_LL_SIZE];
+	struct verbs_context *vctx = verbs_get_ctx_op(pd->context,
+						      drv_ibv_create_ah_ex);
+	struct peer_address src;
+	struct peer_address dst;
+
+	if (!vctx) {
+#endif
+		ah = pd->context->ops.create_ah(pd, attr);
+#ifndef NRESOLVE_NEIGH
+		goto return_ah;
+	}
+
+	err = ibv_query_port(pd->context, attr->port_num, &port_attr);
+
+	if (err) {
+		fprintf(stderr, PFX "ibv_create_ah failed to query port.\n");
+		return NULL;
+	}
+
+	if (IBV_LINK_LAYER_ETHERNET != port_attr.link_layer ||
+	    !(port_attr.port_cap_flags & IBV_PORT_IP_BASED_GIDS)) {
+		ah = pd->context->ops.create_ah(pd, attr);
+		goto return_ah;
+	}
+
+	memcpy(&attr_ex, attr, sizeof(*attr));
+	memset((void *)&attr_ex + sizeof(*attr), 0,
+	       sizeof(attr_ex) - sizeof(*attr));
+
+	err = ibv_query_gid(pd->context, attr->port_num,
+			    attr->grh.sgid_index, &sgid);
+
+	if (err) {
+		fprintf(stderr, PFX "ibv_create_ah failed to query sgid.\n");
+		return NULL;
+	}
+
+	if (neigh_init_resources(&neigh_handler, NEIGH_GET_DEFAULT_TIMEOUT_MS))
+		return NULL;
+
+	dst_family = ipv6_addr_v4mapped((struct in6_addr *)attr->grh.dgid.raw) ?
+			AF_INET : AF_INET6;
+	src_family = ipv6_addr_v4mapped((struct in6_addr *)sgid.raw) ?
+			AF_INET : AF_INET6;
+
+	if (create_peer_from_gid(dst_family, attr->grh.dgid.raw, &dst)) {
+		fprintf(stderr, PFX
+			"ibv_create_ah failed to create dst peer\n");
+		goto free_resources;
+	}
+	if (create_peer_from_gid(src_family, &sgid.raw, &src)) {
+		fprintf(stderr, PFX
+			"ibv_create_ah failed to create src peer\n");
+		goto free_resources;
+	}
+	if (neigh_set_dst(&neigh_handler, dst_family, dst.address,
+			  dst.size)) {
+		fprintf(stderr, PFX
+			"ibv_create_ah failed to create dst addr\n");
+		goto free_resources;
+	}
+
+	if (neigh_set_src(&neigh_handler, src_family, src.address,
+			  src.size)) {
+		fprintf(stderr, PFX
+			"ibv_create_ah failed to create src addr\n");
+		goto free_resources;
+	}
+
+	oif = neigh_get_oif_from_src(&neigh_handler);
+
+	if (oif > 0) {
+		neigh_set_oif(&neigh_handler, oif);
+	} else {
+		fprintf(stderr, PFX "ibv_create_ah failed to get output IF\n");
+		goto free_resources;
+	}
+
+	/* blocking call */
+	if (process_get_neigh(&neigh_handler)) {
+		fprintf(stderr, PFX "Neigh resolution process failed\n");
+		goto free_resources;
+	}
+
+	attr_ex.vid = neigh_get_vlan_id_from_dev(&neigh_handler);
+
+	if (attr_ex.vid <= 0xfff) {
+		neigh_set_vlan_id(&neigh_handler, attr_ex.vid);
+		attr_ex.comp_mask |= IBV_AH_ATTR_EX_VID;
+	}
+	/* We are using only Ethernet here */
+	attr_ex.ll_address.len = neigh_get_ll(&neigh_handler, ethernet_ll,
+					      sizeof(ethernet_ll));
 
+	if (attr_ex.ll_address.len <= 0)
+		goto free_resources;
+
+	attr_ex.comp_mask |= IBV_AH_ATTR_EX_LL;
+	attr_ex.ll_address.type = LL_ADDRESS_ETH;
+	attr_ex.ll_address.address = ethernet_ll;
+
+
+	ah = vctx->drv_ibv_create_ah_ex(pd, &attr_ex);
+
+free_resources:
+	neigh_free_resources(&neigh_handler);
+
+return_ah:
+#endif
 	if (ah) {
 		ah->context = pd->context;
 		ah->pd      = pd;
 	}
-
 	return ah;
 }
 default_symver(__ibv_create_ah, ibv_create_ah);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found] ` <1399531883-30599-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2014-05-08  6:51   ` [PATCH libibverbs V3 1/3] Add ibv_port_cap_flags Or Gerlitz
  2014-05-08  6:51   ` [PATCH libibverbs V3 2/3] Use neighbour lookup for RoCE UD QPs Eth L2 resolution Or Gerlitz
@ 2014-05-08  6:51   ` Or Gerlitz
       [not found]     ` <1399531883-30599-4-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2 siblings, 1 reply; 20+ messages in thread
From: Or Gerlitz @ 2014-05-08  6:51 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, dledford-H+wXaHxf7aLQT0dZR+AlfA,
	Or Gerlitz

From: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch adds extended support for ibv_query_port.

This allows to request fields that aren't availible by the current
ibv_query_port API and avoid fetching from vendor library fields that
the user doesn't need, which gives more room for optimizations.

Signed-off-by: Matan Barak <matanb-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 include/infiniband/verbs.h |  105 +++++++++++++++++++++++++++++++++++++++++++-
 src/device.c               |    2 +
 src/verbs.c                |    7 ++-
 3 files changed, 111 insertions(+), 3 deletions(-)

diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
index 92d6a60..abf0058 100644
--- a/include/infiniband/verbs.h
+++ b/include/infiniband/verbs.h
@@ -492,6 +492,69 @@ struct ibv_ah_attr_ex {
 	uint16_t		vid;
 };
 
+enum {
+	IBV_QUERY_PORT_EX_STATE			= 1 << 0,
+	IBV_QUERY_PORT_EX_MAX_MTU		= 1 << 1,
+	IBV_QUERY_PORT_EX_ACTIVE_MTU		= 1 << 2,
+	IBV_QUERY_PORT_EX_GID_TBL_LEN		= 1 << 3,
+	IBV_QUERY_PORT_EX_CAP_FLAGS		= 1 << 4,
+	IBV_QUERY_PORT_EX_MAX_MSG_SZ		= 1 << 5,
+	IBV_QUERY_PORT_EX_BAD_PKEY_CNTR		= 1 << 6,
+	IBV_QUERY_PORT_EX_QKEY_VIOL_CNTR	= 1 << 7,
+	IBV_QUERY_PORT_EX_PKEY_TBL_LEN		= 1 << 8,
+	IBV_QUERY_PORT_EX_LID			= 1 << 9,
+	IBV_QUERY_PORT_EX_SM_LID		= 1 << 10,
+	IBV_QUERY_PORT_EX_LMC			= 1 << 11,
+	IBV_QUERY_PORT_EX_MAX_VL_NUM		= 1 << 12,
+	IBV_QUERY_PORT_EX_SM_SL			= 1 << 13,
+	IBV_QUERY_PORT_EX_SUBNET_TIMEOUT	= 1 << 14,
+	IBV_QUERY_PORT_EX_INIT_TYPE_REPLY	= 1 << 15,
+	IBV_QUERY_PORT_EX_ACTIVE_WIDTH		= 1 << 16,
+	IBV_QUERY_PORT_EX_ACTIVE_SPEED		= 1 << 17,
+	IBV_QUERY_PORT_EX_PHYS_STATE		= 1 << 18,
+	IBV_QUERY_PORT_EX_LINK_LAYER		= 1 << 19,
+	/* mask of the fields that exists in the standard query_port_command */
+	IBV_QUERY_PORT_EX_STD_MASK		= (1 << 20) - 1,
+	/* mask of all supported fields */
+	IBV_QUERY_PORT_EX_MASK			= IBV_QUERY_PORT_EX_STD_MASK,
+};
+
+enum ibv_query_port_ex_attr_mask {
+	IBV_QUERY_PORT_EX_ATTR_MASK1		= 1 << 0,
+	IBV_QUERY_PORT_EX_ATTR_MASKS		= (1 << 1) - 1
+};
+
+struct ibv_port_attr_ex {
+	union {
+		struct {
+			enum ibv_port_state	state;
+			enum ibv_mtu		max_mtu;
+			enum ibv_mtu		active_mtu;
+			int			gid_tbl_len;
+			uint32_t		port_cap_flags;
+			uint32_t		max_msg_sz;
+			uint32_t		bad_pkey_cntr;
+			uint32_t		qkey_viol_cntr;
+			uint16_t		pkey_tbl_len;
+			uint16_t		lid;
+			uint16_t		sm_lid;
+			uint8_t			lmc;
+			uint8_t			max_vl_num;
+			uint8_t			sm_sl;
+			uint8_t			subnet_timeout;
+			uint8_t			init_type_reply;
+			uint8_t			active_width;
+			uint8_t			active_speed;
+			uint8_t			phys_state;
+			uint8_t			link_layer;
+			uint8_t			reserved;
+		};
+		struct ibv_port_attr		port_attr;
+	};
+	uint32_t		comp_mask;
+	uint32_t		mask1;
+};
+
 
 enum ibv_srq_attr_mask {
 	IBV_SRQ_MAX_WR	= 1 << 0,
@@ -999,11 +1062,16 @@ enum verbs_context_mask {
 	VERBS_CONTEXT_CREATE_FLOW = 1 << 3,
 	VERBS_CONTEXT_DESTROY_FLOW = 1 << 4,
 	VERBS_CONTEXT_CREATE_AH = 1 << 5,
-	VERBS_CONTEXT_RESERVED	= 1 << 6
+	VERBS_CONTEXT_QUERY_PORT = 1 << 6,
+	VERBS_CONTEXT_RESERVED	= 1 << 7
 };
 
 struct verbs_context {
 	/*  "grows up" - new fields go here */
+	int (*drv_query_port_ex)(struct ibv_context *context, uint8_t port_num,
+				 struct ibv_port_attr_ex *port_attr);
+	int (*lib_query_port_ex)(struct ibv_context *context, uint8_t port_num,
+				  struct ibv_port_attr_ex *port_attr);
 	struct ibv_ah * (*drv_ibv_create_ah_ex)(struct ibv_pd *pd,
 						struct ibv_ah_attr_ex *attr);
 	int (*drv_ibv_destroy_flow) (struct ibv_flow *flow);
@@ -1140,6 +1208,41 @@ static inline int ___ibv_query_port(struct ibv_context *context,
 #define ibv_query_port(context, port_num, port_attr) \
 	___ibv_query_port(context, port_num, port_attr)
 
+static inline int ibv_query_port_ex(struct ibv_context *context,
+				    uint8_t port_num,
+				    struct ibv_port_attr_ex *port_attr)
+{
+	struct verbs_context *vctx;
+
+	if (0 == port_attr->comp_mask)
+		return ibv_query_port(context, port_num,
+				      &port_attr->port_attr);
+
+	/* Check that only valid flags were given */
+	if ((!port_attr->comp_mask & IBV_QUERY_PORT_EX_ATTR_MASK1) ||
+	    (port_attr->comp_mask & ~IBV_QUERY_PORT_EX_ATTR_MASKS) ||
+	    (port_attr->mask1 & ~IBV_QUERY_PORT_EX_MASK)) {
+		errno = EINVAL;
+		return -errno;
+	}
+
+	vctx = verbs_get_ctx_op(context, lib_query_port_ex);
+
+	if (!vctx) {
+		/* Fallback to legacy mode */
+		if (port_attr->comp_mask == IBV_QUERY_PORT_EX_ATTR_MASK1 &&
+		    !(port_attr->mask1 & ~IBV_QUERY_PORT_EX_STD_MASK))
+			return ibv_query_port(context, port_num,
+					      &port_attr->port_attr);
+
+		/* Unsupported field was requested */
+		errno = ENOSYS;
+		return -errno;
+	}
+
+	return vctx->lib_query_port_ex(context, port_num, port_attr);
+}
+
 /**
  * ibv_query_gid - Get a GID table entry
  */
diff --git a/src/device.c b/src/device.c
index 9642ead..eb3f2cd 100644
--- a/src/device.c
+++ b/src/device.c
@@ -173,6 +173,8 @@ struct ibv_context *__ibv_open_device(struct ibv_device *device)
 			 context_ex->drv_ibv_create_flow;
 		 context_ex->lib_ibv_destroy_flow =
 			 context_ex->drv_ibv_destroy_flow;
+		 context_ex->lib_query_port_ex =
+			 context_ex->drv_query_port_ex;
 	}
 
 	context->device = device;
diff --git a/src/verbs.c b/src/verbs.c
index e7343a9..f8245b0 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -550,7 +550,7 @@ struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
 	int err;
 	struct ibv_ah *ah = NULL;
 #ifndef NRESOLVE_NEIGH
-	struct ibv_port_attr port_attr;
+	struct ibv_port_attr_ex port_attr;
 	int dst_family;
 	int src_family;
 	int oif;
@@ -570,7 +570,10 @@ struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
 		goto return_ah;
 	}
 
-	err = ibv_query_port(pd->context, attr->port_num, &port_attr);
+	port_attr.comp_mask = IBV_QUERY_PORT_EX_ATTR_MASK1;
+	port_attr.mask1 = IBV_QUERY_PORT_EX_LINK_LAYER |
+			  IBV_QUERY_PORT_EX_CAP_FLAGS;
+	err = ibv_query_port_ex(pd->context, attr->port_num, &port_attr);
 
 	if (err) {
 		fprintf(stderr, PFX "ibv_create_ah failed to query port.\n");
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]     ` <1399531883-30599-4-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-05-08 19:09       ` Jason Gunthorpe
       [not found]         ` <20140508190951.GB25849-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2014-05-08 19:09 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	matanb-VPRAkNaXOzVWk0Htik3J/w, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On Thu, May 08, 2014 at 09:51:23AM +0300, Or Gerlitz wrote:

> +struct ibv_port_attr_ex {
> +	union {
> +		struct {
> +			enum ibv_port_state	state;
> +			enum ibv_mtu		max_mtu;
> +			enum ibv_mtu		active_mtu;
> +			int			gid_tbl_len;
> +			uint32_t		port_cap_flags;
> +			uint32_t		max_msg_sz;
> +			uint32_t		bad_pkey_cntr;
> +			uint32_t		qkey_viol_cntr;
> +			uint16_t		pkey_tbl_len;
> +			uint16_t		lid;
> +			uint16_t		sm_lid;
> +			uint8_t			lmc;
> +			uint8_t			max_vl_num;
> +			uint8_t			sm_sl;
> +			uint8_t			subnet_timeout;
> +			uint8_t			init_type_reply;
> +			uint8_t			active_width;
> +			uint8_t			active_speed;
> +			uint8_t			phys_state;
> +			uint8_t			link_layer;
> +			uint8_t			reserved;
> +		};
> +		struct ibv_port_attr		port_attr;
> +	};
> +	uint32_t		comp_mask;
> +	uint32_t		mask1;
> +};

I really don't like this deviation from the standard _ex
pattern. 

Anonymous struct/union is a C11 feature and GNU extension. I don't
think is appropriate for a user library to rely on. We cannot assume
the user is has a compiler in C11 mode.

The cannonical layout should be:

struct ibv_port_attr_ex {
   uint64_t comp_mask;
   struct ibv_port_attr	port_attr;
   // New stuff goes here
}

It is fine to use comp_mask to indicate all the fields, no need for
the weird mask1, lets not over complicate things for users.

>  struct verbs_context {
>  	/*  "grows up" - new fields go here */
> +	int (*drv_query_port_ex)(struct ibv_context *context, uint8_t port_num,
> +				 struct ibv_port_attr_ex *port_attr);
> +	int (*lib_query_port_ex)(struct ibv_context *context, uint8_t port_num,
> +				  struct ibv_port_attr_ex *port_attr);
>  	struct ibv_ah * (*drv_ibv_create_ah_ex)(struct ibv_pd *pd,
>  						struct ibv_ah_attr_ex *attr);
>  	int (*drv_ibv_destroy_flow) (struct ibv_flow *flow);

I'm not sure what Roland decided to merge, but I am surprised to see
drv_ elements here. That was nix'd in the round of review of the first
patch set. eg create_qp_ex/etc don't have them.

The flow is such that the verbs code has a chance to capture and
override the function pointer after the driver sets it, there is no
purpose to the drv_ values.

I wouldn't like to see more added, and I think you should make a patch
to ensure they are not necessary before it propogates too far.

> diff --git a/src/verbs.c b/src/verbs.c
> index e7343a9..f8245b0 100644
> +++ b/src/verbs.c
> @@ -550,7 +550,7 @@ struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
>  	int err;
>  	struct ibv_ah *ah = NULL;
>  #ifndef NRESOLVE_NEIGH
> -	struct ibv_port_attr port_attr;
> +	struct ibv_port_attr_ex port_attr;
>  	int dst_family;
>  	int src_family;
>  	int oif;
> @@ -570,7 +570,10 @@ struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
>  		goto return_ah;
>  	}
>  
> -	err = ibv_query_port(pd->context, attr->port_num, &port_attr);
> +	port_attr.comp_mask = IBV_QUERY_PORT_EX_ATTR_MASK1;
> +	port_attr.mask1 = IBV_QUERY_PORT_EX_LINK_LAYER |
> +			  IBV_QUERY_PORT_EX_CAP_FLAGS;
> +	err = ibv_query_port_ex(pd->context, attr->port_num, &port_attr);
>  
>  	if (err) {
>  		fprintf(stderr, PFX "ibv_create_ah failed to query port.\n");

I wonder if this belongs in a seperate patch, I had to read it twice
to realize this change was to take advantage of the reduced query
performance optimization.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 2/3] Use neighbour lookup for RoCE UD QPs Eth L2 resolution
       [not found]     ` <1399531883-30599-3-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-05-08 19:29       ` Jason Gunthorpe
  0 siblings, 0 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2014-05-08 19:29 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	matanb-VPRAkNaXOzVWk0Htik3J/w, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	dledford-H+wXaHxf7aLQT0dZR+AlfA

On Thu, May 08, 2014 at 09:51:22AM +0300, Or Gerlitz wrote:

> In order to implement RoCE IP based addressing for UD QPs, without
> introducing uverbs changes, we need a way to resolve the L2 Ethernet
> addresses from user-space.

Please put anything that changes the public verbs ABI in a seperate
patch.

Burying the create_ah_ex stuff in this giant thing is not condusive to
reviewing the ABI portion.

The ABI is *important* do everything you can to bring as many eyes as
possible to those changes!!

> +struct ibv_ah_attr_ex {
> +	struct ibv_global_route	grh;
> +	uint16_t		dlid;
> +	uint8_t			sl;
> +	uint8_t			src_path_bits;
> +	uint8_t			static_rate;
> +	uint8_t			is_global;
> +	uint8_t			port_num;
> +	uint32_t		comp_mask;
> +	struct {
> +		enum ll_address_type type;
> +		uint32_t	len;
> +		char		*address;
> +	} ll_address;
> +	uint16_t		vid;
> +};

The above looks sorta reasonable, but:

What is the 'char *address'? Should it be 'const void *' ?????

Who owns that memory? Who allocates it, who frees it? Seems sketchy
as heck.

What happens if a ibv_ah_attr_ex is returned in some kind of
ibv_qp_attr_ex structure from a future query_qp?

Can you just put in a 'uint64_t address[8];' or sockaddr_storage, or
something?

> +
>  enum ibv_srq_attr_mask {
>  	IBV_SRQ_MAX_WR	= 1 << 0,
>  	IBV_SRQ_LIMIT	= 1 << 1
> @@ -968,11 +998,14 @@ enum verbs_context_mask {
>  	VERBS_CONTEXT_QP	= 1 << 2,
>  	VERBS_CONTEXT_CREATE_FLOW = 1 << 3,
>  	VERBS_CONTEXT_DESTROY_FLOW = 1 << 4,
> -	VERBS_CONTEXT_RESERVED	= 1 << 5
> +	VERBS_CONTEXT_CREATE_AH = 1 << 5,
> +	VERBS_CONTEXT_RESERVED	= 1 << 6
>  };

What is going on here??

verbs_context_mask is messed up already!?

It is NOT supposed to be a mask if the function is available - we
can already test for NULL for that.

It is supposed to indicate if structures *allocated by the provider*
are extended. 

For instance if VERBS_CONTEXT_QP is set, then everyone knows that the
'struct ibv_qp' returned by *any* driver function (crate_qp, open_qp,
create_qp_ex) is convertible to 'verbs_qp'.

VERBS_CONTEXT_CREATE_FLOW/VERBS_CONTEXT_DESTROY_FLOW are screwed
up. They shouldn't even exist, there is no unextended ibv_flow
structure to worry about.

Fix this patch to not add VERBS_CONTEXT_CREATE_AH, ibv_ah_attr_ex is
not a provider allocated structure so it doesn't need an entry.

Please send a patch to remove the screwed up two above and replace
with RESERVED.

And BE MORE CAREFUL. THIS SHOULD NOT HAVE HAPPENED!

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]         ` <20140508190951.GB25849-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2014-05-08 19:40           ` Christoph Lameter
       [not found]             ` <alpine.DEB.2.10.1405081439370.26267-gkYfJU5Cukgdnm+yROfE0A@public.gmane.org>
  2014-05-09 14:01           ` Roland Dreier
  2014-05-12 12:22           ` Matan Barak
  2 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2014-05-08 19:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, dledford-H+wXaHxf7aLQT0dZR+AlfA

On Thu, 8 May 2014, Jason Gunthorpe wrote:

> Anonymous struct/union is a C11 feature and GNU extension. I don't
> think is appropriate for a user library to rely on. We cannot assume
> the user is has a compiler in C11 mode.

Anonymous structs/union are used in the kernel already. See
include/linux/mm_types.h f.e.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]             ` <alpine.DEB.2.10.1405081439370.26267-gkYfJU5Cukgdnm+yROfE0A@public.gmane.org>
@ 2014-05-08 19:47               ` Jason Gunthorpe
  0 siblings, 0 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2014-05-08 19:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb-VPRAkNaXOzVWk0Htik3J/w,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, dledford-H+wXaHxf7aLQT0dZR+AlfA

On Thu, May 08, 2014 at 02:40:44PM -0500, Christoph Lameter wrote:
> On Thu, 8 May 2014, Jason Gunthorpe wrote:
> 
> > Anonymous struct/union is a C11 feature and GNU extension. I don't
> > think is appropriate for a user library to rely on. We cannot assume
> > the user is has a compiler in C11 mode.
> 
> Anonymous structs/union are used in the kernel already. See
> include/linux/mm_types.h f.e.

Re-read: This is a userspace header.

mm_types.h is never included by a userspace program, so it doesn't
matter.

Find something in include/uapi and we'll talk :)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]         ` <20140508190951.GB25849-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2014-05-08 19:40           ` Christoph Lameter
@ 2014-05-09 14:01           ` Roland Dreier
       [not found]             ` <CAL1RGDUiBtOaYKZVR3ghOZzG6J0p5EV5o4FTTW0E43mHz-BaVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-05-12 12:22           ` Matan Barak
  2 siblings, 1 reply; 20+ messages in thread
From: Roland Dreier @ 2014-05-09 14:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb,
	Yishai Hadas, Doug Ledford

On Thu, May 8, 2014 at 12:09 PM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>>  struct verbs_context {
>>       /*  "grows up" - new fields go here */
>> +     int (*drv_query_port_ex)(struct ibv_context *context, uint8_t port_num,
>> +                              struct ibv_port_attr_ex *port_attr);
>> +     int (*lib_query_port_ex)(struct ibv_context *context, uint8_t port_num,
>> +                               struct ibv_port_attr_ex *port_attr);
>>       struct ibv_ah * (*drv_ibv_create_ah_ex)(struct ibv_pd *pd,
>>                                               struct ibv_ah_attr_ex *attr);
>>       int (*drv_ibv_destroy_flow) (struct ibv_flow *flow);
>
> I'm not sure what Roland decided to merge, but I am surprised to see
> drv_ elements here. That was nix'd in the round of review of the first
> patch set. eg create_qp_ex/etc don't have them.
>
> The flow is such that the verbs code has a chance to capture and
> override the function pointer after the driver sets it, there is no
> purpose to the drv_ values.
>
> I wouldn't like to see more added, and I think you should make a patch
> to ensure they are not necessary before it propogates too far.

Sorry, I was not aware of the feeling on drv_XXX members here and I
don't think I saw any review comments about them.  (Maybe they got
lost in the flood of "merge it merge it merge it" emails)

Anyway I'm wondering if there's a way to recover without breaking ABI
here (or by breaking ABI in a manageable way).  The only library using
this stuff (that I know of) is libmlx4, maybe we can spin quick
releases of both and pretend libibverbs 1.1.8 never happened?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]             ` <CAL1RGDUiBtOaYKZVR3ghOZzG6J0p5EV5o4FTTW0E43mHz-BaVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-05-09 18:10               ` Jason Gunthorpe
       [not found]                 ` <20140509181043.GC18257-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2014-05-09 18:10 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb,
	Yishai Hadas, Doug Ledford

On Fri, May 09, 2014 at 07:01:36AM -0700, Roland Dreier wrote:
> On Thu, May 8, 2014 at 12:09 PM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> >>  struct verbs_context {
> >>       /*  "grows up" - new fields go here */
> >> +     int (*drv_query_port_ex)(struct ibv_context *context, uint8_t port_num,
> >> +                              struct ibv_port_attr_ex *port_attr);
> >> +     int (*lib_query_port_ex)(struct ibv_context *context, uint8_t port_num,
> >> +                               struct ibv_port_attr_ex *port_attr);
> >>       struct ibv_ah * (*drv_ibv_create_ah_ex)(struct ibv_pd *pd,
> >>                                               struct ibv_ah_attr_ex *attr);
> >>       int (*drv_ibv_destroy_flow) (struct ibv_flow *flow);
> >
> > I'm not sure what Roland decided to merge, but I am surprised to see
> > drv_ elements here. That was nix'd in the round of review of the first
> > patch set. eg create_qp_ex/etc don't have them.
> >
> > The flow is such that the verbs code has a chance to capture and
> > override the function pointer after the driver sets it, there is no
> > purpose to the drv_ values.
> >
> > I wouldn't like to see more added, and I think you should make a patch
> > to ensure they are not necessary before it propogates too far.
> 
> Sorry, I was not aware of the feeling on drv_XXX members here and I
> don't think I saw any review comments about them.  (Maybe they got
> lost in the flood of "merge it merge it merge it" emails)

To be honest, I never bothered looking at any patches beyond the core
extension patches. There was no indication from you that they would
ever be merged, so it didn't feel like good use of my time.

> Anyway I'm wondering if there's a way to recover without breaking ABI
> here (or by breaking ABI in a manageable way).  The only library using
> this stuff (that I know of) is libmlx4, maybe we can spin quick
> releases of both and pretend libibverbs 1.1.8 never happened?

I think the best approach is to rescind the last libmlx4 version so it
never gets used then:
  - Rename drv_* to _reserved_X
  - Drop all references to drv_* from libverbs
  - Have libmlx4 set the ib_ pointer only

For the other problem
  - Replace VERBS_CONTEXT_CREATE_FLOW and VERBS_CONTEXT_DESTROY_FLOW
    with _RESERVED_x
  - Drop the references from mlx4

These are not a big deal as long as things are fixed quickly before
the bad libmlx4 gets widely used. Then we could reclaim the reserved_X
entires someday.

The biggest issue I see is that this is creating an anti-pattern, so
we need to stamp that out in the verbs source.

I expect Or will get right on this..

Or, it would be helpful to me if you could go back to libibverbs
commit cbf2a35162afcc9e97870b7b18d6477133a8dfa2 and post the corrected
flow steering patches with the ABI/API change as a distinct patch. I
think I caught everything, but lets also correct that process error
and hopefully Sean/etc can comment too.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]                 ` <20140509181043.GC18257-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2014-05-09 18:20                   ` Roland Dreier
       [not found]                     ` <CAL1RGDVmeTKz1TXGLP+h4t9ffuaVeBCkWKPC2AuFyygcNweRWg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-05-11 12:35                   ` Or Gerlitz
  1 sibling, 1 reply; 20+ messages in thread
From: Roland Dreier @ 2014-05-09 18:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb,
	Yishai Hadas, Doug Ledford

On Fri, May 9, 2014 at 11:10 AM, Jason Gunthorpe
<jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> To be honest, I never bothered looking at any patches beyond the core
> extension patches. There was no indication from you that they would
> ever be merged, so it didn't feel like good use of my time.

Haven't the verbs extensions patches been in libibverbs since last
summer?  It's only these flow steering patches that got merged in the
last couple of weeks.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]                     ` <CAL1RGDVmeTKz1TXGLP+h4t9ffuaVeBCkWKPC2AuFyygcNweRWg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-05-09 18:39                       ` Hefty, Sean
  2014-05-09 18:40                       ` Jason Gunthorpe
  1 sibling, 0 replies; 20+ messages in thread
From: Hefty, Sean @ 2014-05-09 18:39 UTC (permalink / raw)
  To: Roland Dreier, Jason Gunthorpe
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb,
	Yishai Hadas, Doug Ledford

> > To be honest, I never bothered looking at any patches beyond the core
> > extension patches. There was no indication from you that they would
> > ever be merged, so it didn't feel like good use of my time.
> 
> Haven't the verbs extensions patches been in libibverbs since last
> summer?  It's only these flow steering patches that got merged in the
> last couple of weeks.

Verb extensions weren't merged until the latter half of December.

Jason's proposal to revert the changes seem reasonable to me.

- Sean

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]                     ` <CAL1RGDVmeTKz1TXGLP+h4t9ffuaVeBCkWKPC2AuFyygcNweRWg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-05-09 18:39                       ` Hefty, Sean
@ 2014-05-09 18:40                       ` Jason Gunthorpe
  1 sibling, 0 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2014-05-09 18:40 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, matanb,
	Yishai Hadas, Doug Ledford

On Fri, May 09, 2014 at 11:20:46AM -0700, Roland Dreier wrote:
> On Fri, May 9, 2014 at 11:10 AM, Jason Gunthorpe
> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
> > To be honest, I never bothered looking at any patches beyond the core
> > extension patches. There was no indication from you that they would
> > ever be merged, so it didn't feel like good use of my time.
> 
> Haven't the verbs extensions patches been in libibverbs since last
> summer?  It's only these flow steering patches that got merged in the
> last couple of weeks.

Winter, I think:
 Committer: Roland Dreier <roland-BHEL68pLQRGGvPXPguhicg@public.gmane.org>  2013-12-16 12:06:50

and I missed your mailing list posting accepting the patch, wasn't
cc'd.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]                 ` <20140509181043.GC18257-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2014-05-09 18:20                   ` Roland Dreier
@ 2014-05-11 12:35                   ` Or Gerlitz
       [not found]                     ` <536F6EAC.8020109-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 20+ messages in thread
From: Or Gerlitz @ 2014-05-11 12:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Roland Dreier, Matan Barak
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas, Doug Ledford,
	Tzahi Oved

On 09/05/2014 21:10, Jason Gunthorpe wrote:
> On Fri, May 09, 2014 at 07:01:36AM -0700, Roland Dreier wrote:
>> On Thu, May 8, 2014 at 12:09 PM, Jason Gunthorpe
>> <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:
>>>>   struct verbs_context {
>>>>        /*  "grows up" - new fields go here */
>>>> +     int (*drv_query_port_ex)(struct ibv_context *context, uint8_t port_num,
>>>> +                              struct ibv_port_attr_ex *port_attr);
>>>> +     int (*lib_query_port_ex)(struct ibv_context *context, uint8_t port_num,
>>>> +                               struct ibv_port_attr_ex *port_attr);
>>>>        struct ibv_ah * (*drv_ibv_create_ah_ex)(struct ibv_pd *pd,
>>>>                                                struct ibv_ah_attr_ex *attr);
>>>>        int (*drv_ibv_destroy_flow) (struct ibv_flow *flow);
>>> I'm not sure what Roland decided to merge, but I am surprised to see
>>> drv_ elements here. That was nix'd in the round of review of the first
>>> patch set. eg create_qp_ex/etc don't have them.
>>>
>>> The flow is such that the verbs code has a chance to capture and
>>> override the function pointer after the driver sets it, there is no
>>> purpose to the drv_ values.
>>>
>>> I wouldn't like to see more added, and I think you should make a patch
>>> to ensure they are not necessary before it propogates too far.
>> Sorry, I was not aware of the feeling on drv_XXX members here and I
>> don't think I saw any review comments about them.  (Maybe they got
>> lost in the flood of "merge it merge it merge it" emails)
> To be honest, I never bothered looking at any patches beyond the core extension patches. There was no indication from you that they would ever be merged, so it didn't feel like good use of my time.

Jason, I am clueless on what you are talking (please read below), I 
think you are mixing things (see the DD comment below too)

>
>> Anyway I'm wondering if there's a way to recover without breaking ABI here (or by breaking ABI in a manageable way).  The only library using this stuff (that I know of) is libmlx4, maybe we can spin quick releases of both and pretend libibverbs 1.1.8 never happened?

Roland, Jason,

To clarify, recall that there are few faces/pieces involved in the 
extensions toy, specifically, things that relate to libibverbs/uverbs 
interaction, those who relate to libibverbs/application  and the ones 
that deal with libibverbs/vendor plugin library.

So, the XRC patches that were around here for months if not years only 
dealt with the libibverbs/uverbs part of things.

The Flow-steering kernel patches have nothing to do with extensions, and 
here (e.g the UD related ones), is where you -- Roland got few 
reminders, after you left them untouched on the floor for long time.. No 
budge or nudge was sent to you on the user space (next item) flow 
steering patches  which were posted on Feb 6th, this year, so please 
take care with DD (Devil/Details) dealing.

> I think the best approach is to rescind the last libmlx4 version so it
> never gets used then:
>    - Rename drv_* to _reserved_X
>    - Drop all references to drv_* from libverbs
>    - Have libmlx4 set the ib_ pointer only
>
> For the other problem
>    - Replace VERBS_CONTEXT_CREATE_FLOW and VERBS_CONTEXT_DESTROY_FLOW
>      with _RESERVED_x
>    - Drop the references from mlx4
>
> These are not a big deal as long as things are fixed quickly before the bad libmlx4 gets widely used. Then we could reclaim the reserved_X entires someday.
>
> The biggest issue I see is that this is creating an anti-pattern, so we need to stamp that out in the verbs source.
>
> I expect Or will get right on this..
>
> Or, it would be helpful to me if you could go back to libibverbs commit cbf2a35162a [...] and post the corrected flow steering patches with the ABI/API change as a distinct patch. I think I caught everything, but lets also correct that process error and hopefully Sean/etc can comment too.
>

So I understand that we need to remove VERBS_CONTEXT_CREATE_FLOW and 
VERBS_CONTEXT_DESTROY_FLOW and it makes sense to follow your suggestion 
of replacing them with _RESERVED_x

As for the drv_ entries, that was mine/Matan's interpretation of the 
architecture, Matan is checking this with Tzahi Oved, the founder of 
this concept and will be sending a response early this week.

I'm not sure we need to roll back to  commit cbf2a35162a and start from 
there, but if people feel this is the right thing to do, let it be.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]                     ` <536F6EAC.8020109-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-05-12  6:35                       ` Jason Gunthorpe
  0 siblings, 0 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2014-05-12  6:35 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, Matan Barak, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Yishai Hadas, Doug Ledford, Tzahi Oved

On Sun, May 11, 2014 at 03:35:56PM +0300, Or Gerlitz wrote:

> Jason, I am clueless on what you are talking (please read below), I
> think you are mixing things (see the DD comment below too)

Way back at the start, in 2011, when Liran posted this - the first
patch set of this concept did not reorganize the driver startup
process so it needed to have two function pointers to allow libverbs a
chance to hook things.

Instead we re-organized the startup process so that verbs allocates
the context entirely and calls into the driver to set it up. This
allows verbs to do whatever it wants to the function pointers,
including copying the driver version to verbs-private-memory and
overriding with it's own - should that be necessary.

For this reason the drv_/ib_ scheme was entirely removed.

The patches that Tzahi/Sean/etc posted to implement XRC did not use
it, it hasn't been part of the concept for 3 years. Not sure where
you+Matan got the idea from.

> >Or, it would be helpful to me if you could go back to libibverbs
> >commit cbf2a35162a [...] and post the corrected flow steering
> >patches with the ABI/API change as a distinct patch. I think I
> >caught everything, but lets also correct that process error and
> >hopefully Sean/etc can comment too.

> I'm not sure we need to roll back to  commit cbf2a35162a and start
> from there, but if people feel this is the right thing to do, let it
> be.

I would like to see at the very least a clean patch, introducing only
the API for flow steering, that follows the procedure I suggested.

This is to show we are all on the same page. I'm not suggesting
rolling back Roland's tree, just that you go back to that point and
re-issue the patches as a learning exercise for us all.

Follow up with a correction patch to Roland's tip.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]         ` <20140508190951.GB25849-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  2014-05-08 19:40           ` Christoph Lameter
  2014-05-09 14:01           ` Roland Dreier
@ 2014-05-12 12:22           ` Matan Barak
       [not found]             ` <5370BCF2.7050107-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2 siblings, 1 reply; 20+ messages in thread
From: Matan Barak @ 2014-05-12 12:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, dledford-H+wXaHxf7aLQT0dZR+AlfA

On 8/5/2014 10:09 PM, Jason Gunthorpe wrote:
> On Thu, May 08, 2014 at 09:51:23AM +0300, Or Gerlitz wrote:
>
>> +struct ibv_port_attr_ex {
>> +	union {
>> +		struct {
>> +			enum ibv_port_state	state;
>> +			enum ibv_mtu		max_mtu;
>> +			enum ibv_mtu		active_mtu;
>> +			int			gid_tbl_len;
>> +			uint32_t		port_cap_flags;
>> +			uint32_t		max_msg_sz;
>> +			uint32_t		bad_pkey_cntr;
>> +			uint32_t		qkey_viol_cntr;
>> +			uint16_t		pkey_tbl_len;
>> +			uint16_t		lid;
>> +			uint16_t		sm_lid;
>> +			uint8_t			lmc;
>> +			uint8_t			max_vl_num;
>> +			uint8_t			sm_sl;
>> +			uint8_t			subnet_timeout;
>> +			uint8_t			init_type_reply;
>> +			uint8_t			active_width;
>> +			uint8_t			active_speed;
>> +			uint8_t			phys_state;
>> +			uint8_t			link_layer;
>> +			uint8_t			reserved;
>> +		};
>> +		struct ibv_port_attr		port_attr;
>> +	};
>> +	uint32_t		comp_mask;
>> +	uint32_t		mask1;
>> +};
>
> I really don't like this deviation from the standard _ex
> pattern.
>
> Anonymous struct/union is a C11 feature and GNU extension. I don't
> think is appropriate for a user library to rely on. We cannot assume
> the user is has a compiler in C11 mode.
>
> The cannonical layout should be:
>
> struct ibv_port_attr_ex {
>     uint64_t comp_mask;
>     struct ibv_port_attr	port_attr;
>     // New stuff goes here
> }
>

It's not mandatory, but I think it could prevent future breakages and 
inconsistencies. Anyway, anonymous unions are supported in icc 9.2+, 
clang 3.1+ and of course gcc. However, I'll remove this in the next 
version of this series.

> It is fine to use comp_mask to indicate all the fields, no need for
> the weird mask1, lets not over complicate things for users.

I don't think that's the right thing to do. According to my 
understanding, the prefix of the extension verb is fully compatible with 
the old structure. Afterwards, comp_mask has a bit for every new field. 
mask1 is a new field that indicates which of the "old" values the user 
is interested in. If we want to be strict with the extension verbs 
definition, lets keep it all the way.

>
>>   struct verbs_context {
>>   	/*  "grows up" - new fields go here */
>> +	int (*drv_query_port_ex)(struct ibv_context *context, uint8_t port_num,
>> +				 struct ibv_port_attr_ex *port_attr);
>> +	int (*lib_query_port_ex)(struct ibv_context *context, uint8_t port_num,
>> +				  struct ibv_port_attr_ex *port_attr);
>>   	struct ibv_ah * (*drv_ibv_create_ah_ex)(struct ibv_pd *pd,
>>   						struct ibv_ah_attr_ex *attr);
>>   	int (*drv_ibv_destroy_flow) (struct ibv_flow *flow);
>
> I'm not sure what Roland decided to merge, but I am surprised to see
> drv_ elements here. That was nix'd in the round of review of the first
> patch set. eg create_qp_ex/etc don't have them.
>
> The flow is such that the verbs code has a chance to capture and
> override the function pointer after the driver sets it, there is no
> purpose to the drv_ values.
>
> I wouldn't like to see more added, and I think you should make a patch
> to ensure they are not necessary before it propogates too far.
>

I'll remove the drv_query_port_ex function and rename  lib_query_port_ex 
to query_port_ex and drv_ibv_create_ah_ex to create_ah_ex.

>> diff --git a/src/verbs.c b/src/verbs.c
>> index e7343a9..f8245b0 100644
>> +++ b/src/verbs.c
>> @@ -550,7 +550,7 @@ struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
>>   	int err;
>>   	struct ibv_ah *ah = NULL;
>>   #ifndef NRESOLVE_NEIGH
>> -	struct ibv_port_attr port_attr;
>> +	struct ibv_port_attr_ex port_attr;
>>   	int dst_family;
>>   	int src_family;
>>   	int oif;
>> @@ -570,7 +570,10 @@ struct ibv_ah *__ibv_create_ah(struct ibv_pd *pd, struct ibv_ah_attr *attr)
>>   		goto return_ah;
>>   	}
>>
>> -	err = ibv_query_port(pd->context, attr->port_num, &port_attr);
>> +	port_attr.comp_mask = IBV_QUERY_PORT_EX_ATTR_MASK1;
>> +	port_attr.mask1 = IBV_QUERY_PORT_EX_LINK_LAYER |
>> +			  IBV_QUERY_PORT_EX_CAP_FLAGS;
>> +	err = ibv_query_port_ex(pd->context, attr->port_num, &port_attr);
>>
>>   	if (err) {
>>   		fprintf(stderr, PFX "ibv_create_ah failed to query port.\n");
>
> I wonder if this belongs in a seperate patch, I had to read it twice
> to realize this change was to take advantage of the reduced query
> performance optimization.

Thanks, I'll move that to a different patch.

>
> Jason
>

Thanks for the comments,
Matan
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]             ` <5370BCF2.7050107-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-05-12 16:43               ` Jason Gunthorpe
       [not found]                 ` <20140512164323.GA20666-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Jason Gunthorpe @ 2014-05-12 16:43 UTC (permalink / raw)
  To: Matan Barak
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	yishaih-VPRAkNaXOzVWk0Htik3J/w, dledford-H+wXaHxf7aLQT0dZR+AlfA

On Mon, May 12, 2014 at 03:22:10PM +0300, Matan Barak wrote:

> >Anonymous struct/union is a C11 feature and GNU extension. I don't
> >think is appropriate for a user library to rely on. We cannot assume
> >the user is has a compiler in C11 mode.
> >
> >The cannonical layout should be:
> >
> >struct ibv_port_attr_ex {
> >    uint64_t comp_mask;
> >    struct ibv_port_attr	port_attr;
> >    // New stuff goes here
> >}
> >
> 
> It's not mandatory, but I think it could prevent future breakages
> and inconsistencies. Anyway, anonymous unions are supported in icc
> 9.2+, clang 3.1+ and of course gcc. However, I'll remove this in the
> next version of this series.

Many compilers support C11, but if the user chooses to compile with,
say, -std=c99 then the feature goes away. System level headers should
never require the user to use a specific -std.

You can also inline the legacy port_attr into the struct, which might
be the best option here.

> >It is fine to use comp_mask to indicate all the fields, no need for
> >the weird mask1, lets not over complicate things for users.
> 
> I don't think that's the right thing to do. According to my
> understanding, the prefix of the extension verb is fully compatible
> with the old structure.

This is not necessary in this case, it is convient that the old
structure exist within the extended structure so that it is easy for
the wrapper to call an old function without having to munge the
function pointers, but that copy can be in any place.

> Afterwards, comp_mask has a bit for every new field. mask1 is a new
> field that indicates which of the "old" values the user is
> interested in. If we want to be strict with the extension verbs
> definition, lets keep it all the way.

comp_mask has bits for fields, that is all it does, the distinction
between 'new' and 'old' is meaningless at this point. Just inline the
bits in comp_mask.

Generally, it would be smart to limit the number of comp_mask bits
used so we don't run out in the future, so in general, a new extension
can omit bits for fields that exist when the extension is introduced.

However, in this case there is a functional purpose to having the
fields have bits, so you should just go ahead and include them
directly.

Adding more masks is just going to be confusing to users. Remember,
there is no static type checking, so the user has to know that
IBV_QUERY_PORT_EX_LINK_LAYER is associated with mask1, while
IBV_QUERY_PORT_EX_ATTR_MASK1 is somehow associated with comp_mask -
and they sure do look similar, and it is counter-intuitive compared to
other cases in the library.

BTW, before I forget, the patch that introduces the API must also
include a man page for it.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]                 ` <20140512164323.GA20666-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2014-05-12 21:57                   ` Roland Dreier
       [not found]                     ` <CAL1RGDXYwPHrS7dSWg0ojyiPG7TF1Y0800q801kWvMxoyn3c8g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
       [not found]                     ` <23840589.5954.1399987081862.JavaMail."DougLedford"@Phenom>
  0 siblings, 2 replies; 20+ messages in thread
From: Roland Dreier @ 2014-05-12 21:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Matan Barak, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Yishai Hadas, Doug Ledford

> BTW, before I forget, the patch that introduces the API must also
> include a man page for it.

Jason, care to send a patch for the README or whatever with the rules? :)
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]                     ` <CAL1RGDXYwPHrS7dSWg0ojyiPG7TF1Y0800q801kWvMxoyn3c8g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-05-13 13:18                       ` Doug Ledford
  0 siblings, 0 replies; 20+ messages in thread
From: Doug Ledford @ 2014-05-13 13:18 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Matan Barak, Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Yishai Hadas, Jason Gunthorpe

----- Original Message -----
> > BTW, before I forget, the patch that introduces the API must also
> > include a man page for it.
> 
> Jason, care to send a patch for the README or whatever with the
> rules? :)

README.Extensions_API_and_Guidelines ?  I wouldn't bury it in the
generic README, people won't know to look there when working
specifically on an extension.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD
	      http://people.redhat.com/dledford

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support
       [not found]                     ` <23840589.5954.1399987081862.JavaMail."DougLedford"@Phenom>
@ 2014-05-13 20:02                       ` Jason Gunthorpe
  0 siblings, 0 replies; 20+ messages in thread
From: Jason Gunthorpe @ 2014-05-13 20:02 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Roland Dreier, Matan Barak, Or Gerlitz,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas

On Tue, May 13, 2014 at 09:18:01AM -0400, Doug Ledford wrote:
> > > BTW, before I forget, the patch that introduces the API must also
> > > include a man page for it.
> > 
> > Jason, care to send a patch for the README or whatever with the
> > rules? :)
> 
> README.Extensions_API_and_Guidelines ?  I wouldn't bury it in the
> generic README, people won't know to look there when working
> specifically on an extension.

Right, let me see what I can find time for this week..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-05-13 20:02 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-08  6:51 [PATCH libibverbs V3 0/3] Add support for UD QPs under RoCE IP addressing Or Gerlitz
     [not found] ` <1399531883-30599-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-05-08  6:51   ` [PATCH libibverbs V3 1/3] Add ibv_port_cap_flags Or Gerlitz
2014-05-08  6:51   ` [PATCH libibverbs V3 2/3] Use neighbour lookup for RoCE UD QPs Eth L2 resolution Or Gerlitz
     [not found]     ` <1399531883-30599-3-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-05-08 19:29       ` Jason Gunthorpe
2014-05-08  6:51   ` [PATCH libibverbs V3 3/3] Add ibv_query_port_ex support Or Gerlitz
     [not found]     ` <1399531883-30599-4-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-05-08 19:09       ` Jason Gunthorpe
     [not found]         ` <20140508190951.GB25849-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2014-05-08 19:40           ` Christoph Lameter
     [not found]             ` <alpine.DEB.2.10.1405081439370.26267-gkYfJU5Cukgdnm+yROfE0A@public.gmane.org>
2014-05-08 19:47               ` Jason Gunthorpe
2014-05-09 14:01           ` Roland Dreier
     [not found]             ` <CAL1RGDUiBtOaYKZVR3ghOZzG6J0p5EV5o4FTTW0E43mHz-BaVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-05-09 18:10               ` Jason Gunthorpe
     [not found]                 ` <20140509181043.GC18257-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2014-05-09 18:20                   ` Roland Dreier
     [not found]                     ` <CAL1RGDVmeTKz1TXGLP+h4t9ffuaVeBCkWKPC2AuFyygcNweRWg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-05-09 18:39                       ` Hefty, Sean
2014-05-09 18:40                       ` Jason Gunthorpe
2014-05-11 12:35                   ` Or Gerlitz
     [not found]                     ` <536F6EAC.8020109-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-05-12  6:35                       ` Jason Gunthorpe
2014-05-12 12:22           ` Matan Barak
     [not found]             ` <5370BCF2.7050107-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-05-12 16:43               ` Jason Gunthorpe
     [not found]                 ` <20140512164323.GA20666-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2014-05-12 21:57                   ` Roland Dreier
     [not found]                     ` <CAL1RGDXYwPHrS7dSWg0ojyiPG7TF1Y0800q801kWvMxoyn3c8g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-05-13 13:18                       ` Doug Ledford
     [not found]                     ` <23840589.5954.1399987081862.JavaMail."DougLedford"@Phenom>
2014-05-13 20:02                       ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.