All of lore.kernel.org
 help / color / mirror / Atom feed
* [Patch v2 0/3] Fix GID lookup performance regression
@ 2013-09-24 21:16 Doug Ledford
       [not found] ` <cover.1380056994.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Doug Ledford @ 2013-09-24 21:16 UTC (permalink / raw)
  To: Sean Hefty, Roland Drier, Or Gerlitz, Amir Vadai, Eli Cohen,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Doug Ledford

A performance issue was identified at a customer's site.  This issue resulted
in failures in their production environment.  It was eventually determined
that the excessive amount of time it took the IB stack to look up GIDs was
the primary culprit behind the problems observed.  This patch series resolves
the issue.  The first patch is the primary resolution and the one that we
can't live without, while the other two patches make incremental improvements
and could be skipped if need be (although I tend to think they make sense).

Roland, this was a showstopper for our customer.  As a mathematical experiment,
mlx4 hardware (dual port) has a 128 entry GID table, and without this patch,
if the GID you are looking for is on a second port, or God forbid a second
card, then the slowdown to search for the right GID is entirely unlivable.

Doug Ledford (3):
  IB/cma: use cached gids
  IB/cma: Check for GID on listening device first
  IB/cache: don't fill the cache with junk

 drivers/infiniband/core/cache.c | 132 +++++++++++++++++++++++++++++++---------
 drivers/infiniband/core/cma.c   |  65 ++++++++++----------
 2 files changed, 136 insertions(+), 61 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Patch v2 1/3] IB/cma: use cached gids
       [not found] ` <cover.1380056994.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-09-24 21:16   ` Doug Ledford
       [not found]     ` <0dae5249b1f09936a2976ef910c022eecaf9a7fa.1380056994.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-09-24 21:16   ` [Patch v2 2/3] IB/cma: Check for GID on listening device first Doug Ledford
  2013-09-24 21:16   ` [Patch v2 3/3] IB/cache: don't fill the cache with junk Doug Ledford
  2 siblings, 1 reply; 8+ messages in thread
From: Doug Ledford @ 2013-09-24 21:16 UTC (permalink / raw)
  To: Sean Hefty, Roland Drier, Or Gerlitz, Amir Vadai, Eli Cohen,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Doug Ledford

The cma_acquire_dev function was changed by commit 3c86aa70bf67
to use find_gid_port because multiport devices might have
either IB or IBoE formatted gids.  The old function assumed that
all ports on the same device used the same GID format.  However,
when it was changed to use find_gid_port, we inadvertently lost
usage of the GID cache.  This turned out to be a very costly
change.  In our testing, each iteration through each index of
the GID table takes roughly 35us.  When you have multiple
devices in a system, and the GID you are looking for is on one
of the later devices, the code loops through all of the GID
indexes on all of the early devices before it finally succeeds
on the target device.  This pathological search behavior combined
with 35us per GID table index retrieval results in results such
as the following from the cmtime application that's part of the
latest librdmacm git repo:

ib1:
step              total ms     max ms     min us  us / conn
create id    :       29.42       0.04       1.00       2.94
bind addr    :   186705.66      19.00   18556.00   18670.57
resolve addr :       41.93       9.68     619.00       4.19
resolve route:      486.93       0.48     101.00      48.69
create qp    :     4021.95       6.18     330.00     402.20
connect      :    68350.39   68588.17   24632.00    6835.04
disconnect   :     1460.43     252.65-1862269.00     146.04
destroy      :       41.16       0.04       2.00       4.12

ib0:
step              total ms     max ms     min us  us / conn
create id    :       28.61       0.68       1.00       2.86
bind addr    :     2178.86       2.95     201.00     217.89
resolve addr :       51.26      16.85     845.00       5.13
resolve route:      620.08       0.43      92.00      62.01
create qp    :     3344.40       6.36     273.00     334.44
connect      :     6435.99    6368.53    7844.00     643.60
disconnect   :     5095.38     321.90     757.00     509.54
destroy      :       37.13       0.02       2.00       3.71

Clearly, both the bind address and connect operations suffer
a huge penalty for being anything other than the default
GID on the first port in the system.

After applying this patch, the numbers now look like this:

ib1:
step              total ms     max ms     min us  us / conn
create id    :       30.15       0.03       1.00       3.01
bind addr    :       80.27       0.04       7.00       8.03
resolve addr :       43.02      13.53     589.00       4.30
resolve route:      482.90       0.45     100.00      48.29
create qp    :     3986.55       5.80     330.00     398.66
connect      :     7141.53    7051.29    5005.00     714.15
disconnect   :     5038.85     193.63     918.00     503.88
destroy      :       37.02       0.04       2.00       3.70

ib0:
step              total ms     max ms     min us  us / conn
create id    :       34.27       0.05       1.00       3.43
bind addr    :       26.45       0.04       1.00       2.64
resolve addr :       38.25      10.54     760.00       3.82
resolve route:      604.79       0.43      97.00      60.48
create qp    :     3314.95       6.34     273.00     331.49
connect      :    12399.26   12351.10    8609.00    1239.93
disconnect   :     5096.76     270.72    1015.00     509.68
destroy      :       37.10       0.03       2.00       3.71

It's worth noting that we still suffer a bit of a penalty on
connect to the wrong device, but the penalty is much less than
it used to be.  Follow on patches deal with this penalty.

Many thanks to Neil Horman for helping to track the source of
slow function that allowed us to track down the fact that
the original patch I mentioned above backed out cache usage
and identify just how much that impacted the system.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/core/cma.c | 35 ++++++-----------------------------
 1 file changed, 6 insertions(+), 29 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index dab4b41..c62ff9e 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -328,28 +328,6 @@ static int cma_set_qkey(struct rdma_id_private *id_priv, u32 qkey)
 	return ret;
 }
 
-static int find_gid_port(struct ib_device *device, union ib_gid *gid, u8 port_num)
-{
-	int i;
-	int err;
-	struct ib_port_attr props;
-	union ib_gid tmp;
-
-	err = ib_query_port(device, port_num, &props);
-	if (err)
-		return err;
-
-	for (i = 0; i < props.gid_tbl_len; ++i) {
-		err = ib_query_gid(device, port_num, i, &tmp);
-		if (err)
-			return err;
-		if (!memcmp(&tmp, gid, sizeof tmp))
-			return 0;
-	}
-
-	return -EADDRNOTAVAIL;
-}
-
 static void cma_translate_ib(struct sockaddr_ib *sib, struct rdma_dev_addr *dev_addr)
 {
 	dev_addr->dev_type = ARPHRD_INFINIBAND;
@@ -377,7 +355,7 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv)
 	struct cma_device *cma_dev;
 	union ib_gid gid, iboe_gid;
 	int ret = -ENODEV;
-	u8 port;
+	u8 port, found_port;
 	enum rdma_link_layer dev_ll = dev_addr->dev_type == ARPHRD_INFINIBAND ?
 		IB_LINK_LAYER_INFINIBAND : IB_LINK_LAYER_ETHERNET;
 
@@ -390,20 +368,19 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv)
 	memcpy(&gid, dev_addr->src_dev_addr +
 	       rdma_addr_gid_offset(dev_addr), sizeof gid);
 	list_for_each_entry(cma_dev, &dev_list, list) {
-		for (port = 1; port <= cma_dev->device->phys_port_cnt; ++port) {
+		for (port = 1; port <= cma_dev->device->phys_port_cnt; ++port)
 			if (rdma_port_get_link_layer(cma_dev->device, port) == dev_ll) {
 				if (rdma_node_get_transport(cma_dev->device->node_type) == RDMA_TRANSPORT_IB &&
 				    rdma_port_get_link_layer(cma_dev->device, port) == IB_LINK_LAYER_ETHERNET)
-					ret = find_gid_port(cma_dev->device, &iboe_gid, port);
+					ret = ib_find_cached_gid(cma_dev->device, &iboe_gid, &found_port, NULL);
 				else
-					ret = find_gid_port(cma_dev->device, &gid, port);
+					ret = ib_find_cached_gid(cma_dev->device, &gid, &found_port, NULL);
 
-				if (!ret) {
-					id_priv->id.port_num = port;
+				if (!ret && (port == found_port)) {
+					id_priv->id.port_num = found_port;
 					goto out;
 				}
 			}
-		}
 	}
 
 out:
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Patch v2 2/3] IB/cma: Check for GID on listening device first
       [not found] ` <cover.1380056994.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-09-24 21:16   ` [Patch v2 1/3] IB/cma: use cached gids Doug Ledford
@ 2013-09-24 21:16   ` Doug Ledford
  2013-09-24 21:16   ` [Patch v2 3/3] IB/cache: don't fill the cache with junk Doug Ledford
  2 siblings, 0 replies; 8+ messages in thread
From: Doug Ledford @ 2013-09-24 21:16 UTC (permalink / raw)
  To: Sean Hefty, Roland Drier, Or Gerlitz, Amir Vadai, Eli Cohen,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Doug Ledford

As a simple optimization that should speed up the vast majority of
connect attemps on IB devices, when we are searching for the GID of an
incoming connection in the cached GID lists of devices, search the
device that received the incoming connection request first.  If we don't
find it there, then move on to other devices.

This reduces the time to perform 10,000 connections considerably.
Prior to this patch, a bad run of cmtime would look like this:

connect      :    12399.26   12351.10    8609.00    1239.93

With this patch, it looks more like this:

connect      :     5864.86    5799.80    8876.00     586.49

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/core/cma.c | 36 ++++++++++++++++++++++++++++++------
 1 file changed, 30 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index c62ff9e..dadc486 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -349,7 +349,8 @@ static int cma_translate_addr(struct sockaddr *addr, struct rdma_dev_addr *dev_a
 	return ret;
 }
 
-static int cma_acquire_dev(struct rdma_id_private *id_priv)
+static int cma_acquire_dev(struct rdma_id_private *id_priv,
+			   struct rdma_id_private *listen_id_priv)
 {
 	struct rdma_dev_addr *dev_addr = &id_priv->id.route.addr.dev_addr;
 	struct cma_device *cma_dev;
@@ -367,8 +368,30 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv)
 	iboe_addr_get_sgid(dev_addr, &iboe_gid);
 	memcpy(&gid, dev_addr->src_dev_addr +
 	       rdma_addr_gid_offset(dev_addr), sizeof gid);
+	if (listen_id_priv &&
+	    rdma_port_get_link_layer(listen_id_priv->id.device,
+				     listen_id_priv->id.port_num) == dev_ll) {
+		cma_dev = listen_id_priv->cma_dev;
+		port = listen_id_priv->id.port_num;
+		if (rdma_node_get_transport(cma_dev->device->node_type) == RDMA_TRANSPORT_IB &&
+		    rdma_port_get_link_layer(cma_dev->device, port) == IB_LINK_LAYER_ETHERNET)
+			ret = ib_find_cached_gid(cma_dev->device, &iboe_gid,
+						 &found_port, NULL);
+		else
+			ret = ib_find_cached_gid(cma_dev->device, &gid,
+						 &found_port, NULL);
+
+		if (!ret && (port  == found_port)) {
+			id_priv->id.port_num = found_port;
+			goto out;
+		}
+	}
 	list_for_each_entry(cma_dev, &dev_list, list) {
-		for (port = 1; port <= cma_dev->device->phys_port_cnt; ++port)
+		for (port = 1; port <= cma_dev->device->phys_port_cnt; ++port) {
+			if (listen_id_priv &&
+			    listen_id_priv->cma_dev == cma_dev &&
+			    listen_id_priv->id.port_num == port)
+				continue;
 			if (rdma_port_get_link_layer(cma_dev->device, port) == dev_ll) {
 				if (rdma_node_get_transport(cma_dev->device->node_type) == RDMA_TRANSPORT_IB &&
 				    rdma_port_get_link_layer(cma_dev->device, port) == IB_LINK_LAYER_ETHERNET)
@@ -381,6 +404,7 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv)
 					goto out;
 				}
 			}
+		}
 	}
 
 out:
@@ -1269,7 +1293,7 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event)
 	}
 
 	mutex_lock_nested(&conn_id->handler_mutex, SINGLE_DEPTH_NESTING);
-	ret = cma_acquire_dev(conn_id);
+	ret = cma_acquire_dev(conn_id, listen_id);
 	if (ret)
 		goto err2;
 
@@ -1458,7 +1482,7 @@ static int iw_conn_req_handler(struct iw_cm_id *cm_id,
 		goto out;
 	}
 
-	ret = cma_acquire_dev(conn_id);
+	ret = cma_acquire_dev(conn_id, listen_id);
 	if (ret) {
 		mutex_unlock(&conn_id->handler_mutex);
 		rdma_destroy_id(new_cm_id);
@@ -2027,7 +2051,7 @@ static void addr_handler(int status, struct sockaddr *src_addr,
 		goto out;
 
 	if (!status && !id_priv->cma_dev)
-		status = cma_acquire_dev(id_priv);
+		status = cma_acquire_dev(id_priv, NULL);
 
 	if (status) {
 		if (!cma_comp_exch(id_priv, RDMA_CM_ADDR_RESOLVED,
@@ -2524,7 +2548,7 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr *addr)
 		if (ret)
 			goto err1;
 
-		ret = cma_acquire_dev(id_priv);
+		ret = cma_acquire_dev(id_priv, NULL);
 		if (ret)
 			goto err1;
 	}
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Patch v2 3/3] IB/cache: don't fill the cache with junk
       [not found] ` <cover.1380056994.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2013-09-24 21:16   ` [Patch v2 1/3] IB/cma: use cached gids Doug Ledford
  2013-09-24 21:16   ` [Patch v2 2/3] IB/cma: Check for GID on listening device first Doug Ledford
@ 2013-09-24 21:16   ` Doug Ledford
       [not found]     ` <4c88e00f5211787a98fa980a4d42c5c6374ab868.1380056994.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2 siblings, 1 reply; 8+ messages in thread
From: Doug Ledford @ 2013-09-24 21:16 UTC (permalink / raw)
  To: Sean Hefty, Roland Drier, Or Gerlitz, Amir Vadai, Eli Cohen,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Doug Ledford

We keep a cache of the GIDs and PKeys on an IB device, but when we get
the props for the card, the table length returned is the overall table
size and not related to how many valid entries there are in the table.
As a result, we end up with things like 128 entries for GIDs that are 3
valid GIDs and 125 empty GIDs.  Then when we call find_cache_gid, we
search through all 128 GID cache entries using memcmp.  This is
needlessly expensive.  So when we update the cache, check each item we
get from the card to see if it's valid and only save it into the cache
if it is.  We make sure to preserve the index from the card's table with
each cache item so the correct index in the card's table is returned
on any lookup that requests the index.

I have performance numbers on this change, but they aren't really
conclusive.  I made this change after the previous two patches, and
while conceptually it is obvious that search 3 or 4 GIDs is better than
searching through 128 empty GIDs using memcmp, the fact of the matter is
that we normally find our result quickly and so the benefit of this
change is not seen, at least not by using the cmtime application.  In
addition, the immediately previous patch optimized the connect routine's
selection of device to search first, again hiding the benefit of this
change.  It would require a very complex setup with lots of valid GIDs
and connect attempts that were accessing GIDs late in the table to
demonstrate the benefit of this patch clearly, and the cmtime utilitity
from librdmacm does not do that.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/core/cache.c | 132 +++++++++++++++++++++++++++++++---------
 1 file changed, 103 insertions(+), 29 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 80f6cf2..d31972d 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -44,12 +44,18 @@
 
 struct ib_pkey_cache {
 	int             table_len;
-	u16             table[0];
+	struct pkey_cache_table_entry {
+		u16             pkey;
+		unsigned char	index;
+	}		entry[0];
 };
 
 struct ib_gid_cache {
 	int             table_len;
-	union ib_gid    table[0];
+	struct gid_cache_table_entry {
+		union ib_gid    gid;
+		unsigned char	index;
+	}		entry[0];
 };
 
 struct ib_update_work {
@@ -76,7 +82,7 @@ int ib_get_cached_gid(struct ib_device *device,
 {
 	struct ib_gid_cache *cache;
 	unsigned long flags;
-	int ret = 0;
+	int i, ret = 0;
 
 	if (port_num < start_port(device) || port_num > end_port(device))
 		return -EINVAL;
@@ -85,13 +91,26 @@ int ib_get_cached_gid(struct ib_device *device,
 
 	cache = device->cache.gid_cache[port_num - start_port(device)];
 
-	if (index < 0 || index >= cache->table_len)
+	if (index < 0 || index >= cache->table_len) {
 		ret = -EINVAL;
-	else
-		*gid = cache->table[index];
+		goto out_unlock;
+	}
 
-	read_unlock_irqrestore(&device->cache.lock, flags);
+	for (i = 0; i < cache->table_len; ++i)
+		if (cache->entry[i].index == index)
+			break;
+
+	if (i < cache->table_len)
+		*gid = cache->entry[i].gid;
+	else {
+		ret = ib_query_gid(device, port_num, index, gid);
+		if (ret)
+			printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n",
+			       ret, device->name, index);
+	}
 
+out_unlock:
+	read_unlock_irqrestore(&device->cache.lock, flags);
 	return ret;
 }
 EXPORT_SYMBOL(ib_get_cached_gid);
@@ -115,18 +134,18 @@ int ib_find_cached_gid(struct ib_device *device,
 	for (p = 0; p <= end_port(device) - start_port(device); ++p) {
 		cache = device->cache.gid_cache[p];
 		for (i = 0; i < cache->table_len; ++i) {
-			if (!memcmp(gid, &cache->table[i], sizeof *gid)) {
+			if (!memcmp(gid, &cache->entry[i].gid, sizeof *gid)) {
 				*port_num = p + start_port(device);
 				if (index)
-					*index = i;
+					*index = cache->entry[i].index;
 				ret = 0;
 				goto found;
 			}
 		}
 	}
+
 found:
 	read_unlock_irqrestore(&device->cache.lock, flags);
-
 	return ret;
 }
 EXPORT_SYMBOL(ib_find_cached_gid);
@@ -138,7 +157,7 @@ int ib_get_cached_pkey(struct ib_device *device,
 {
 	struct ib_pkey_cache *cache;
 	unsigned long flags;
-	int ret = 0;
+	int i, ret = 0;
 
 	if (port_num < start_port(device) || port_num > end_port(device))
 		return -EINVAL;
@@ -147,13 +166,22 @@ int ib_get_cached_pkey(struct ib_device *device,
 
 	cache = device->cache.pkey_cache[port_num - start_port(device)];
 
-	if (index < 0 || index >= cache->table_len)
+	if (index < 0 || index >= cache->table_len) {
 		ret = -EINVAL;
+		goto out_unlock;
+	}
+
+	for (i = 0; i < cache->table_len; ++i)
+		if (cache->entry[i].index == index)
+			break;
+
+	if (i < cache->table_len)
+		*pkey = cache->entry[i].pkey;
 	else
-		*pkey = cache->table[index];
+		*pkey = 0x0000;
 
+out_unlock:
 	read_unlock_irqrestore(&device->cache.lock, flags);
-
 	return ret;
 }
 EXPORT_SYMBOL(ib_get_cached_pkey);
@@ -179,13 +207,13 @@ int ib_find_cached_pkey(struct ib_device *device,
 	*index = -1;
 
 	for (i = 0; i < cache->table_len; ++i)
-		if ((cache->table[i] & 0x7fff) == (pkey & 0x7fff)) {
-			if (cache->table[i] & 0x8000) {
-				*index = i;
+		if ((cache->entry[i].pkey & 0x7fff) == (pkey & 0x7fff)) {
+			if (cache->entry[i].pkey & 0x8000) {
+				*index = cache->entry[i].index;
 				ret = 0;
 				break;
 			} else
-				partial_ix = i;
+				partial_ix = cache->entry[i].index;
 		}
 
 	if (ret && partial_ix >= 0) {
@@ -219,8 +247,8 @@ int ib_find_exact_cached_pkey(struct ib_device *device,
 	*index = -1;
 
 	for (i = 0; i < cache->table_len; ++i)
-		if (cache->table[i] == pkey) {
-			*index = i;
+		if (cache->entry[i].pkey == pkey) {
+			*index = cache->entry[i].index;
 			ret = 0;
 			break;
 		}
@@ -255,8 +283,10 @@ static void ib_cache_update(struct ib_device *device,
 	struct ib_port_attr       *tprops = NULL;
 	struct ib_pkey_cache      *pkey_cache = NULL, *old_pkey_cache;
 	struct ib_gid_cache       *gid_cache = NULL, *old_gid_cache;
-	int                        i;
+	int                        i, j;
 	int                        ret;
+	union ib_gid		   gid, empty_gid;
+	u16			   pkey;
 
 	tprops = kmalloc(sizeof *tprops, GFP_KERNEL);
 	if (!tprops)
@@ -270,35 +300,79 @@ static void ib_cache_update(struct ib_device *device,
 	}
 
 	pkey_cache = kmalloc(sizeof *pkey_cache + tprops->pkey_tbl_len *
-			     sizeof *pkey_cache->table, GFP_KERNEL);
+			     sizeof *pkey_cache->entry, GFP_KERNEL);
 	if (!pkey_cache)
 		goto err;
 
-	pkey_cache->table_len = tprops->pkey_tbl_len;
+	pkey_cache->table_len = 0;
 
 	gid_cache = kmalloc(sizeof *gid_cache + tprops->gid_tbl_len *
-			    sizeof *gid_cache->table, GFP_KERNEL);
+			    sizeof *gid_cache->entry, GFP_KERNEL);
 	if (!gid_cache)
 		goto err;
 
-	gid_cache->table_len = tprops->gid_tbl_len;
+	gid_cache->table_len = 0;
 
-	for (i = 0; i < pkey_cache->table_len; ++i) {
-		ret = ib_query_pkey(device, port, i, pkey_cache->table + i);
+	for (i = 0, j = 0; i < tprops->pkey_tbl_len; ++i) {
+		ret = ib_query_pkey(device, port, i, &pkey);
 		if (ret) {
 			printk(KERN_WARNING "ib_query_pkey failed (%d) for %s (index %d)\n",
 			       ret, device->name, i);
 			goto err;
 		}
+		/* pkey 0xffff must be the default pkeyand 0x0000 must be the invalid
+		 * pkey per IBTA spec */
+		if (pkey) {
+			pkey_cache->entry[j].index = i;
+			pkey_cache->entry[j++].pkey = pkey;
+		}
 	}
+	pkey_cache->table_len = j;
 
-	for (i = 0; i < gid_cache->table_len; ++i) {
-		ret = ib_query_gid(device, port, i, gid_cache->table + i);
+	memset(&empty_gid, 0, sizeof empty_gid);
+	for (i = 0, j = 0; i < tprops->gid_tbl_len; ++i) {
+		ret = ib_query_gid(device, port, i, &gid);
 		if (ret) {
 			printk(KERN_WARNING "ib_query_gid failed (%d) for %s (index %d)\n",
 			       ret, device->name, i);
 			goto err;
 		}
+		/* if the lower 8 bytes the device GID entry is all 0,
+		 * our entry is a blank, invalid entry...
+		 * depending on device, the upper 8 bytes might or might
+		 * not be prefilled with a valid subnet prefix, so
+		 * don't rely on them for determining a valid gid
+		 * entry
+		 */
+		if (memcmp(&gid + 8, &empty_gid + 8, sizeof gid - 8)) {
+			gid_cache->entry[j].index = i;
+			gid_cache->entry[j++].gid = gid;
+		}
+	}
+	gid_cache->table_len = j;
+
+	old_pkey_cache = pkey_cache;
+	pkey_cache = kmalloc(sizeof *pkey_cache + old_pkey_cache->table_len *
+			     sizeof *pkey_cache->entry, GFP_KERNEL);
+	if (!pkey_cache)
+		pkey_cache = old_pkey_cache;
+	else {
+		pkey_cache->table_len = old_pkey_cache->table_len;
+		memcpy(&pkey_cache->entry[0], &old_pkey_cache->entry[0],
+		       pkey_cache->table_len * sizeof *pkey_cache->entry);
+		kfree(old_pkey_cache);
+	}
+
+	old_gid_cache = gid_cache;
+	gid_cache = kmalloc(sizeof *gid_cache + old_gid_cache->table_len *
+			    sizeof *gid_cache->entry, GFP_KERNEL);
+	if (!gid_cache)
+		gid_cache = old_gid_cache;
+	else {
+		gid_cache->table_len = old_gid_cache->table_len;
+		memcpy(&gid_cache->entry[0], &old_gid_cache->entry[0],
+		       gid_cache->table_len * sizeof *gid_cache->entry);
+		kfree(old_gid_cache);
 	}
 
 	write_lock_irq(&device->cache.lock);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [Patch v2 3/3] IB/cache: don't fill the cache with junk
       [not found]     ` <4c88e00f5211787a98fa980a4d42c5c6374ab868.1380056994.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-10-20  6:51       ` Jack Morgenstein
  2013-10-21  4:12         ` Doug Ledford
  0 siblings, 1 reply; 8+ messages in thread
From: Jack Morgenstein @ 2013-10-20  6:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Sean Hefty, Roland Drier, Or Gerlitz, Amir Vadai, Eli Cohen,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, 24 Sep 2013 17:16:29 -0400
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> @@ -85,13 +91,26 @@ int ib_get_cached_gid(struct ib_device *device,
>  
>  	cache = device->cache.gid_cache[port_num -
> start_port(device)]; 
> -	if (index < 0 || index >= cache->table_len)
> +	if (index < 0 || index >= cache->table_len) {
>  		ret = -EINVAL;
> -	else
> -		*gid = cache->table[index];
> +		goto out_unlock;
> +	}
>  
> -	read_unlock_irqrestore(&device->cache.lock, flags);
> +	for (i = 0; i < cache->table_len; ++i)
> +		if (cache->entry[i].index == index)
> +			break;
> +
> +

Hi Doug,

I am a bit concerned about this patch, because where before
ib_get_cached_gid just returned the GID at the given index, with your
suggested change, ib_get_cached_gid() requires a search of the new gid
table (to find the entry with the requested index value).

ib_get_cached_gid is called by cm_req_handler, for the gid at index 0.
There is no guarantee that this will be the first entry in the new
scheme.

Furthermore, ib_get_cached_gid is also called in MAD packet handling,
with the specific gid index that is required.

Thus, the savings for ib_find_cached_gid might possibly be offset by a
performance loss in ib_get_cached_gid.

A simpler optimization would be to simply keep a count of the number of
valid GIDS in the gid table -- and break off the search when the last
valid GID has been seen.  This would optimize cases where, for example,
you are searching for a GID that is not in the table, and only the
first 3 gids in the table are valid (so you would not needlessly access
125 invalid GIDs).  Clearly, such an optimization is only useful when
there are a lot of invalid gids bunched together at the end of the
table.  Still, something to think about.

-Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Patch v2 1/3] IB/cma: use cached gids
       [not found]     ` <0dae5249b1f09936a2976ef910c022eecaf9a7fa.1380056994.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-10-20  6:57       ` Jack Morgenstein
  0 siblings, 0 replies; 8+ messages in thread
From: Jack Morgenstein @ 2013-10-20  6:57 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Sean Hefty, Roland Drier, Or Gerlitz, Amir Vadai, Eli Cohen,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

ACK.  Looks good!
Very nice catch, Doug!

-Jack

On Tue, 24 Sep 2013 17:16:27 -0400
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> The cma_acquire_dev function was changed by commit 3c86aa70bf67
> to use find_gid_port because multiport devices might have
> either IB or IBoE formatted gids.  The old function assumed that
> all ports on the same device used the same GID format.  However,
> when it was changed to use find_gid_port, we inadvertently lost
> usage of the GID cache.  This turned out to be a very costly
> change.  In our testing, each iteration through each index of
> the GID table takes roughly 35us.  When you have multiple
> devices in a system, and the GID you are looking for is on one
> of the later devices, the code loops through all of the GID
> indexes on all of the early devices before it finally succeeds
> on the target device.  This pathological search behavior combined
> with 35us per GID table index retrieval results in results such
> as the following from the cmtime application that's part of the
> latest librdmacm git repo:
> 
> ib1:
> step              total ms     max ms     min us  us / conn
> create id    :       29.42       0.04       1.00       2.94
> bind addr    :   186705.66      19.00   18556.00   18670.57
> resolve addr :       41.93       9.68     619.00       4.19
> resolve route:      486.93       0.48     101.00      48.69
> create qp    :     4021.95       6.18     330.00     402.20
> connect      :    68350.39   68588.17   24632.00    6835.04
> disconnect   :     1460.43     252.65-1862269.00     146.04
> destroy      :       41.16       0.04       2.00       4.12
> 
> ib0:
> step              total ms     max ms     min us  us / conn
> create id    :       28.61       0.68       1.00       2.86
> bind addr    :     2178.86       2.95     201.00     217.89
> resolve addr :       51.26      16.85     845.00       5.13
> resolve route:      620.08       0.43      92.00      62.01
> create qp    :     3344.40       6.36     273.00     334.44
> connect      :     6435.99    6368.53    7844.00     643.60
> disconnect   :     5095.38     321.90     757.00     509.54
> destroy      :       37.13       0.02       2.00       3.71
> 
> Clearly, both the bind address and connect operations suffer
> a huge penalty for being anything other than the default
> GID on the first port in the system.
> 
> After applying this patch, the numbers now look like this:
> 
> ib1:
> step              total ms     max ms     min us  us / conn
> create id    :       30.15       0.03       1.00       3.01
> bind addr    :       80.27       0.04       7.00       8.03
> resolve addr :       43.02      13.53     589.00       4.30
> resolve route:      482.90       0.45     100.00      48.29
> create qp    :     3986.55       5.80     330.00     398.66
> connect      :     7141.53    7051.29    5005.00     714.15
> disconnect   :     5038.85     193.63     918.00     503.88
> destroy      :       37.02       0.04       2.00       3.70
> 
> ib0:
> step              total ms     max ms     min us  us / conn
> create id    :       34.27       0.05       1.00       3.43
> bind addr    :       26.45       0.04       1.00       2.64
> resolve addr :       38.25      10.54     760.00       3.82
> resolve route:      604.79       0.43      97.00      60.48
> create qp    :     3314.95       6.34     273.00     331.49
> connect      :    12399.26   12351.10    8609.00    1239.93
> disconnect   :     5096.76     270.72    1015.00     509.68
> destroy      :       37.10       0.03       2.00       3.71
> 
> It's worth noting that we still suffer a bit of a penalty on
> connect to the wrong device, but the penalty is much less than
> it used to be.  Follow on patches deal with this penalty.
> 
> Many thanks to Neil Horman for helping to track the source of
> slow function that allowed us to track down the fact that
> the original patch I mentioned above backed out cache usage
> and identify just how much that impacted the system.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Patch v2 3/3] IB/cache: don't fill the cache with junk
  2013-10-20  6:51       ` Jack Morgenstein
@ 2013-10-21  4:12         ` Doug Ledford
       [not found]           ` <5264A9C6.6000807-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Doug Ledford @ 2013-10-21  4:12 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: Sean Hefty, Roland Drier, Or Gerlitz, Amir Vadai, Eli Cohen,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 10/20/2013 02:51 AM, Jack Morgenstein wrote:
> On Tue, 24 Sep 2013 17:16:29 -0400
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>> @@ -85,13 +91,26 @@ int ib_get_cached_gid(struct ib_device *device,
>>
>>   	cache = device->cache.gid_cache[port_num -
>> start_port(device)];
>> -	if (index < 0 || index >= cache->table_len)
>> +	if (index < 0 || index >= cache->table_len) {
>>   		ret = -EINVAL;
>> -	else
>> -		*gid = cache->table[index];
>> +		goto out_unlock;
>> +	}
>>
>> -	read_unlock_irqrestore(&device->cache.lock, flags);
>> +	for (i = 0; i < cache->table_len; ++i)
>> +		if (cache->entry[i].index == index)
>> +			break;
>> +
>> +
>
> Hi Doug,
>
> I am a bit concerned about this patch, because where before
> ib_get_cached_gid just returned the GID at the given index, with your
> suggested change, ib_get_cached_gid() requires a search of the new gid
> table (to find the entry with the requested index value).

Yes, I agree.  That was a consideration in things.

> ib_get_cached_gid is called by cm_req_handler, for the gid at index 0.
> There is no guarantee that this will be the first entry in the new
> scheme.

If it's valid, then yes it is.  Only if GID index 0 is invalid will it 
not be the first entry (and then it won't be an entry at all).  This is 
due to how the update of the gid table works (it scans each entry, 
starting at 0 and going to table length, and puts them into a table in 
ascending order).

Which points out that a valid optimization not present in this patch 
would be to break if the current table index is > than the requested index.

I had also thought about doing a bitmap for the valid entries. 
Realistically, most machines only have 1 or 2 ports of IB devices.  For 
common devices, the maximum pkey table length is 128 entries.  So, 2 
ports at 128 entries per port is a pretty miserly 256 bytes of memory. 
We could just allocate a full table and use a bitmap to represent the 
valid entries, and then in the find_cached* cases use the bitmap to 
limit our search.  That would allow us to go back to O(1) for get_cached*.

> Furthermore, ib_get_cached_gid is also called in MAD packet handling,
> with the specific gid index that is required.
>
> Thus, the savings for ib_find_cached_gid might possibly be offset by a
> performance loss in ib_get_cached_gid.

Doubtful.  Given that the common case is trading a single lookup 
increase to a 3 to 5 GID long chain search versus searching 128 entries 
of which 123 to 125 are invalid down to a similar 3 to 5 long chain 
search, the obvious big gain is getting rid of that 123 to 125 wasted 
memcmp's.  And ib_find_cached_gid is used in cma_acquire_dev(), which in 
turn is called by cma_req_handler, iw_conn_req_handler, addr_handler, 
and rdma_bind_addr.  So it sees plenty of use as well.

Now, one thing I haven't tested yet and wish to, is a situation when we 
have lots of SRIOV devices and a GID table on the PF that is highly 
populated.  In that case, further refinement will likely be necessary. 
If so, that will be my next patch, but I'll leave these patches to stand 
as they do now.

> A simpler optimization would be to simply keep a count of the number of
> valid GIDS in the gid table -- and break off the search when the last
> valid GID has been seen.

My understanding, according to the PKey change flow patches that Or 
posted, is that the GID table can have invalid entries prior to valid 
entries, and so that optimization would break.

>  This would optimize cases where, for example,
> you are searching for a GID that is not in the table, and only the
> first 3 gids in the table are valid (so you would not needlessly access
> 125 invalid GIDs).  Clearly, such an optimization is only useful when
> there are a lot of invalid gids bunched together at the end of the
> table.  Still, something to think about.

As you point out, it only works if all the invalid entries are at the 
end, and we can't guarantee that.

I think I like my suggestion better: go back to having a full table, but 
use a bitmap to indicate valid entries and then use the bitmap to limit 
our comparisons in the find_cached* functions, and put the get_* 
funtions back to being O(1).  But I would still do that incrementally 
from here I think.

But I'm not totally convinced of that either.  The exact sitiation I 
listed above, lots of GIDs on an SRIOV PF, makes me concerned that we 
can get back to a horrible situation in the find_cached* functions once 
we actually have lots of valid entries.  It makes me think we need 
something better than just a linear search of all valid entries when you 
take SRIOV into account.  Whether hash chains or ranges or something to 
make the lots of valid GIDs case faster, I suspect something needs to be 
done, but because things simply aren't in common use yet we don't know it.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Patch v2 3/3] IB/cache: don't fill the cache with junk
       [not found]           ` <5264A9C6.6000807-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2013-10-22  6:35             ` Jack Morgenstein
  0 siblings, 0 replies; 8+ messages in thread
From: Jack Morgenstein @ 2013-10-22  6:35 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Sean Hefty, Roland Drier, Or Gerlitz, Amir Vadai, Eli Cohen,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Mon, 21 Oct 2013 00:12:54 -0400
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> I think I like my suggestion better: go back to having a full table,
> but use a bitmap to indicate valid entries and then use the bitmap to
> limit our comparisons in the find_cached* functions, and put the
> get_* funtions back to being O(1).  But I would still do that
> incrementally from here I think.
> 
> But I'm not totally convinced of that either.  The exact sitiation I 
> listed above, lots of GIDs on an SRIOV PF, makes me concerned that we 
> can get back to a horrible situation in the find_cached* functions
> once we actually have lots of valid entries.  It makes me think we
> need something better than just a linear search of all valid entries
> when you take SRIOV into account.  Whether hash chains or ranges or
> something to make the lots of valid GIDs case faster, I suspect
> something needs to be done, but because things simply aren't in
> common use yet we don't know it.

Doug, I like your suggestion regarding bitmaps. I would rather hold off
on patch #3, though, because as you say, patches 1 and 2 do most of the
work and the patch #3 optimization won't do much if the GID table is
very populated (which will be the case under SRIOV).

I think what you say is correct, a linear search through a populated
table will be expensive -- and we need to come up with a better
strategy here.

ACK for first 2 patches, please hold off on the third.

-Jack
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-10-22  6:35 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-24 21:16 [Patch v2 0/3] Fix GID lookup performance regression Doug Ledford
     [not found] ` <cover.1380056994.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-09-24 21:16   ` [Patch v2 1/3] IB/cma: use cached gids Doug Ledford
     [not found]     ` <0dae5249b1f09936a2976ef910c022eecaf9a7fa.1380056994.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-10-20  6:57       ` Jack Morgenstein
2013-09-24 21:16   ` [Patch v2 2/3] IB/cma: Check for GID on listening device first Doug Ledford
2013-09-24 21:16   ` [Patch v2 3/3] IB/cache: don't fill the cache with junk Doug Ledford
     [not found]     ` <4c88e00f5211787a98fa980a4d42c5c6374ab868.1380056994.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-10-20  6:51       ` Jack Morgenstein
2013-10-21  4:12         ` Doug Ledford
     [not found]           ` <5264A9C6.6000807-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2013-10-22  6:35             ` Jack Morgenstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.