All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] IPoIB: Fix multiple race conditions
@ 2014-08-12 23:38 Doug Ledford
       [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Doug Ledford

Locking of multicast joins/leaves in the IPoIB layer have been problematic
for a while.  There have been recent changes to try and make things better,
including these changes:

bea1e22 IPoIB: Fix use-after-free of multicast object
a9c8ba5 IPoIB: Fix usage of uninitialized multicast objects

Unfortunately, the following test still fails (miserably) on a plain
upstream kernel:

pass=0
ifdown ib0
while true; do
    ifconfig ib0 up
    ifconfig ib0 down
    echo "Pass $pass"
    let pass++
done

This usually fails within 10 to 20 passes, although I did have a lucky
run make it to 300 or so.  If you happen to have a P_Key child interface,
it fails even quicker.

In tracking down the rtnl deadlock in the multicast code, I ended up
re-designing the flag usage and clean up just the race conditions in
the various tasks.  Even that wasn't enough to resolve the issue entirely
though, as if you ran the test above on multiple interfaces simultaneously,
it could still deadlock.  So in the end I re-did the workqueue usage too
so that we now use a workqueue per device (and that includes P_Key devices
have dedicated workqueues) as well as one global workqueue that does
nothing but flush tasks.  This turns out to be a much more elegant way
of handling the workqueues and in fact enabled me to remove all of the
klunky passing around of flush parameters to tell various functions not
to flush the workqueue if it would deadlock the workqueue we are running
from.

Here is my test setup:

2 InfiniBand physical fabrics: ib0 and ib1
2 P_Keys on each fabric: default and 0x8002 on ib0 and 0x8003 on ib1
4 total IPoIB devices that I have named mlx4_ib0, mlx4_ib0.8002,
  mlx4_ib1, and mlx4_ib1.8003

In order to test my final patch set, I logged into my test machine on
four different virtual terminals, I used ifdown on all of the above
interfaces to get things in a consistent state, and then I ran the above
loop, one per terminal per interface simultaneously.

It's worth noting here that when you ifconfig a base interface up, it
automatically brings up the child interface too, so the ifconfig mlx4_ib0
up is in fact racing with both ups and downs of mlx4_ib0.8002.  The same
is true for the mlx4_ib1 interface and its child.  With my patch set in
place, these loops are currently running without a problem and have passed
15,000 up/down cycles per interface.

Doug Ledford (8):
  IPoIB: Consolidate rtnl_lock tasks in workqueue
  IPoIB: Make the carrier_on_task race aware
  IPoIB: fix MCAST_FLAG_BUSY usage
  IPoIB: fix mcast_dev_flush/mcast_restart_task race
  IPoIB: change init sequence ordering
  IPoIB: Use dedicated workqueues per interface
  IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
  IPoIB: No longer use flush as a parameter

 drivers/infiniband/ulp/ipoib/ipoib.h           |  19 +-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  18 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  27 ++-
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  49 +++--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 239 ++++++++++++++++---------
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |  22 ++-
 6 files changed, 240 insertions(+), 134 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue
       [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-08-12 23:38   ` Doug Ledford
       [not found]     ` <2394730ce5ae1d46522dca04066293dd842edf16.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-12 23:38   ` [PATCH 2/8] IPoIB: Make the carrier_on_task race aware Doug Ledford
                     ` (8 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Doug Ledford

Setting the mtu can safely be moved to the carrier_on_task, which keeps
us from needing to take the rtnl lock in the join_finish section.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index d4e005720d0..a0a42859f12 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 		spin_unlock_irq(&priv->lock);
 		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
 		set_qkey = 1;
-
-		if (!ipoib_cm_admin_enabled(dev)) {
-			rtnl_lock();
-			dev_set_mtu(dev, min(priv->mcast_mtu, priv->admin_mtu));
-			rtnl_unlock();
-		}
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
 	}
 
 	rtnl_lock();
+	if (!ipoib_cm_admin_enabled(priv->dev))
+		dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
 	netif_carrier_on(priv->dev);
 	rtnl_unlock();
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/8] IPoIB: Make the carrier_on_task race aware
       [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-12 23:38   ` [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue Doug Ledford
@ 2014-08-12 23:38   ` Doug Ledford
       [not found]     ` <d05cdce70f1312f35f8be2d14bafd2a06809b137.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-12 23:38   ` [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage Doug Ledford
                     ` (7 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Doug Ledford

We blindly assume that we can just take the rtnl lock and that will
prevent races with downing this interface.  Unfortunately, that's not
the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
in an attempt to clear out all remaining instances of ipoib_join_task.
But, since this task is put on the same workqueue as the join task, the
flush_workqueue waits on this thread too.  But this thread is deadlocked
on the rtnl lock.  The better thing here is to use trylock and loop on
that until we either get the lock or we see that FLAG_ADMIN_UP has
been cleared, in which case we don't need to do anything anyway and we
just return.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index a0a42859f12..7e9cd39b5ef 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
 						   carrier_on_task);
 	struct ib_port_attr attr;
 
-	/*
-	 * Take rtnl_lock to avoid racing with ipoib_stop() and
-	 * turning the carrier back on while a device is being
-	 * removed.
-	 */
 	if (ib_query_port(priv->ca, priv->port, &attr) ||
 	    attr.state != IB_PORT_ACTIVE) {
 		ipoib_dbg(priv, "Keeping carrier off until IB port is active\n");
 		return;
 	}
 
-	rtnl_lock();
+	/*
+	 * Take rtnl_lock to avoid racing with ipoib_stop() and
+	 * turning the carrier back on while a device is being
+	 * removed.  However, ipoib_stop() will attempt to flush
+	 * the workqueue while holding the rtnl lock, so loop
+	 * on trylock until either we get the lock or we see
+	 * FLAG_ADMIN_UP go away as that signals that we are bailing
+	 * and can safely ignore the carrier on work
+	 */
+	while (!rtnl_trylock()) {
+		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+			return;
+		else
+			msleep(20);
+	}
 	if (!ipoib_cm_admin_enabled(priv->dev))
 		dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
 	netif_carrier_on(priv->dev);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
       [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-12 23:38   ` [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue Doug Ledford
  2014-08-12 23:38   ` [PATCH 2/8] IPoIB: Make the carrier_on_task race aware Doug Ledford
@ 2014-08-12 23:38   ` Doug Ledford
       [not found]     ` <a410e80dc5ca7cfa64229bbbf50c1337317e3bd8.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-12 23:38   ` [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race Doug Ledford
                     ` (6 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Doug Ledford

Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects)
added a new flag MCAST_JOIN_STARTED, but was not very strict in how it
was used.  We didn't always initialize the completion struct before we
set the flag, and we didn't always call complete on the completion
struct from all paths that complete it.  This made it less than totally
effective, and certainly made its use confusing.  And in the flush
function we would use the presence of this flag to signal that we should
wait on the completion struct, but we never cleared this flag, ever.
This is further muddied by the fact that we overload the MCAST_FLAG_BUSY
flag to mean two different things: we have a join in flight, and we have
succeeded in getting an ib_sa_join_multicast.

In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.

1) Remove the MCAST_JOIN_STARTED flag entirely
2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight
3) Test on mcast->mc directly to see if we have completed
ib_sa_join_multicast (using IS_ERR_OR_NULL)
4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
the mcast->done completion struct
5) Make sure that before calling complete(&mcast->done), we always clear
the MCAST_FLAG_BUSY bit
6) Take the mcast_mutex before we call ib_sa_multicast_join and also
take the mutex in our join callback.  This forces ib_sa_multicast_join
to return and set mcast->mc before we process the callback.  This way,
our callback can safely clear mcast->mc if there is an error on the join
and we will do the right thing as a result in mcast_dev_flush.
7) Because we need the mutex to synchronize mcast->mc, we can no longer
call mcast_sendonly_join directly from mcast_send and instead must add
sendonly join processing to the mcast_join_task

A number of different races are resolved with these changes.  These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.

One race looks something like this:

Thread 1                             Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
  alloc member
  call callback
                                     ifconfig ib0 down
				     wait_for_completion
    callback call completes
                                     wait_for_completion in
				     mcast_dev_flush completes
				       mcast->mc is PTR_ERR_OR_NULL
				       so we skip ib_sa_leave_multicast
    return from callback
  return from ib_sa_join_multicast
set mcast->mc = return from ib_sa_multicast

We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c

Another like this:

Thread 1                   Thread 2         Thread 3
ib_sa_multicast_join
                                            ifconfig ib0 down
					    priv->broadcast = NULL
                           join_complete
			                    wait_for_completion
			   mcast->mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast->mc
			   complete
			   return -EAGAIN (making mcast->mc invalid)
			   		    call ib_sa_multicast_leave
					    on invalid mcast->mc, hang
					    forever

By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast->mc to be valid at the time we run
the callback.  This allows us to clear mcast->mc if there is an error
and the join is going to fail.  We do this before we complete the mcast.
In this way, mcast_dev_flush always sees consistent state in regards to
mcast->mc membership at the time that the wait_for_completion() returns.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  10 +-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 148 ++++++++++++++++---------
 2 files changed, 101 insertions(+), 57 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 3edce617c31..43840971ad8 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -98,9 +98,15 @@ enum {
 
 	IPOIB_MCAST_FLAG_FOUND	  = 0,	/* used in set_multicast_list */
 	IPOIB_MCAST_FLAG_SENDONLY = 1,
-	IPOIB_MCAST_FLAG_BUSY	  = 2,	/* joining or already joined */
+	/*
+	 * For IPOIB_MCAST_FLAG_BUSY
+	 * When set, in flight join and mcast->mc is unreliable
+	 * When clear and mcast->mc IS_ERR_OR_NULL, need to restart or
+	 *   haven't started yet
+	 * When clear and mcast->mc is valid pointer, join was successful
+	 */
+	IPOIB_MCAST_FLAG_BUSY	  = 2,
 	IPOIB_MCAST_FLAG_ATTACHED = 3,
-	IPOIB_MCAST_JOIN_STARTED  = 4,
 
 	MAX_SEND_CQE		  = 16,
 	IPOIB_CM_COPYBREAK	  = 256,
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 7e9cd39b5ef..f5e8da530d9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -271,16 +271,27 @@ ipoib_mcast_sendonly_join_complete(int status,
 	struct ipoib_mcast *mcast = multicast->context;
 	struct net_device *dev = mcast->dev;
 
+	/*
+	 * We have to take the mutex to force mcast_sendonly_join to
+	 * return from ib_sa_multicast_join and set mcast->mc to a
+	 * valid value.  Otherwise we were racing with ourselves in
+	 * that we might fail here, but get a valid return from
+	 * ib_sa_multicast_join after we had cleared mcast->mc here,
+	 * resulting in mis-matched joins and leaves and a deadlock
+	 */
+	mutex_lock(&mcast_mutex);
+
 	/* We trap for port events ourselves. */
 	if (status == -ENETRESET)
-		return 0;
+		goto out;
 
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
 
 	if (status) {
 		if (mcast->logcount++ < 20)
-			ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for %pI6, status %d\n",
+			ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
+					"join failed for %pI6, status %d\n",
 					mcast->mcmember.mgid.raw, status);
 
 		/* Flush out any queued packets */
@@ -290,11 +301,15 @@ ipoib_mcast_sendonly_join_complete(int status,
 			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
 		}
 		netif_tx_unlock_bh(dev);
-
-		/* Clear the busy flag so we try again */
-		status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY,
-					    &mcast->flags);
 	}
+out:
+	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	if (status)
+		mcast->mc = NULL;
+	complete(&mcast->done);
+	if (status == -ENETRESET)
+		status = 0;
+	mutex_unlock(&mcast_mutex);
 	return status;
 }
 
@@ -312,12 +327,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 	int ret = 0;
 
 	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
-		ipoib_dbg_mcast(priv, "device shutting down, no multicast joins\n");
+		ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
+				"multicast joins\n");
 		return -ENODEV;
 	}
 
-	if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
-		ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n");
+	if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
+		ipoib_dbg_mcast(priv, "multicast entry busy, skipping "
+				"sendonly join\n");
 		return -EBUSY;
 	}
 
@@ -325,6 +342,9 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 	rec.port_gid = priv->local_gid;
 	rec.pkey     = cpu_to_be16(priv->pkey);
 
+	mutex_lock(&mcast_mutex);
+	init_completion(&mcast->done);
+	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
 					 priv->port, &rec,
 					 IB_SA_MCMEMBER_REC_MGID	|
@@ -337,12 +357,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 	if (IS_ERR(mcast->mc)) {
 		ret = PTR_ERR(mcast->mc);
 		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-		ipoib_warn(priv, "ib_sa_join_multicast failed (ret = %d)\n",
-			   ret);
+		complete(&mcast->done);
+		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
+			   "failed (ret = %d)\n", ret);
 	} else {
-		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting join\n",
-				mcast->mcmember.mgid.raw);
+		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
+				"sendonly join\n", mcast->mcmember.mgid.raw);
 	}
+	mutex_unlock(&mcast_mutex);
 
 	return ret;
 }
@@ -390,22 +412,28 @@ static int ipoib_mcast_join_complete(int status,
 	ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n",
 			mcast->mcmember.mgid.raw, status);
 
+	/*
+	 * We have to take the mutex to force mcast_join to
+	 * return from ib_sa_multicast_join and set mcast->mc to a
+	 * valid value.  Otherwise we were racing with ourselves in
+	 * that we might fail here, but get a valid return from
+	 * ib_sa_multicast_join after we had cleared mcast->mc here,
+	 * resulting in mis-matched joins and leaves and a deadlock
+	 */
+	mutex_lock(&mcast_mutex);
+
 	/* We trap for port events ourselves. */
-	if (status == -ENETRESET) {
-		status = 0;
+	if (status == -ENETRESET)
 		goto out;
-	}
 
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
 
 	if (!status) {
 		mcast->backoff = 1;
-		mutex_lock(&mcast_mutex);
 		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
 					   &priv->mcast_task, 0);
-		mutex_unlock(&mcast_mutex);
 
 		/*
 		 * Defer carrier on work to ipoib_workqueue to avoid a
@@ -413,37 +441,35 @@ static int ipoib_mcast_join_complete(int status,
 		 */
 		if (mcast == priv->broadcast)
 			queue_work(ipoib_workqueue, &priv->carrier_on_task);
-
-		status = 0;
-		goto out;
-	}
-
-	if (mcast->logcount++ < 20) {
-		if (status == -ETIMEDOUT || status == -EAGAIN) {
-			ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
-					mcast->mcmember.mgid.raw, status);
-		} else {
-			ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
-				   mcast->mcmember.mgid.raw, status);
+	} else {
+		if (mcast->logcount++ < 20) {
+			if (status == -ETIMEDOUT || status == -EAGAIN) {
+				ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
+						mcast->mcmember.mgid.raw, status);
+			} else {
+				ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
+					   mcast->mcmember.mgid.raw, status);
+			}
 		}
-	}
-
-	mcast->backoff *= 2;
-	if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
-		mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-
-	/* Clear the busy flag so we try again */
-	status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 
-	mutex_lock(&mcast_mutex);
+		mcast->backoff *= 2;
+		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
+			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
+	}
+out:
 	spin_lock_irq(&priv->lock);
-	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	if (status)
+		mcast->mc = NULL;
+	complete(&mcast->done);
+	if (status == -ENETRESET)
+		status = 0;
+	if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
 		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
 				   mcast->backoff * HZ);
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
-out:
-	complete(&mcast->done);
+
 	return status;
 }
 
@@ -492,10 +518,9 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		rec.hop_limit	  = priv->broadcast->mcmember.hop_limit;
 	}
 
-	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	mutex_lock(&mcast_mutex);
 	init_completion(&mcast->done);
-	set_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags);
-
+	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
 					 &rec, comp_mask, GFP_KERNEL,
 					 ipoib_mcast_join_complete, mcast);
@@ -509,13 +534,12 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
 			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
 
-		mutex_lock(&mcast_mutex);
 		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
 			queue_delayed_work(ipoib_workqueue,
 					   &priv->mcast_task,
 					   mcast->backoff * HZ);
-		mutex_unlock(&mcast_mutex);
 	}
+	mutex_unlock(&mcast_mutex);
 }
 
 void ipoib_mcast_join_task(struct work_struct *work)
@@ -576,7 +600,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
-		if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
+		if (IS_ERR_OR_NULL(priv->broadcast->mc) &&
+		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
 			ipoib_mcast_join(dev, priv->broadcast, 0);
 		return;
 	}
@@ -584,23 +609,33 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	while (1) {
 		struct ipoib_mcast *mcast = NULL;
 
+		/*
+		 * Need the mutex so our flags are consistent, need the
+		 * priv->lock so we don't race with list removals in either
+		 * mcast_dev_flush or mcast_restart_task
+		 */
+		mutex_lock(&mcast_mutex);
 		spin_lock_irq(&priv->lock);
 		list_for_each_entry(mcast, &priv->multicast_list, list) {
-			if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)
-			    && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)
-			    && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+			if (IS_ERR_OR_NULL(mcast->mc) &&
+			    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
+			    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
 				/* Found the next unjoined group */
 				break;
 			}
 		}
 		spin_unlock_irq(&priv->lock);
+		mutex_unlock(&mcast_mutex);
 
 		if (&mcast->list == &priv->multicast_list) {
 			/* All done */
 			break;
 		}
 
-		ipoib_mcast_join(dev, mcast, 1);
+		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+			ipoib_mcast_sendonly_join(mcast);
+		else
+			ipoib_mcast_join(dev, mcast, 1);
 		return;
 	}
 
@@ -646,6 +681,9 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 	int ret = 0;
 
 	if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
+		ipoib_warn(priv, "ipoib_mcast_leave on an in-flight join\n");
+
+	if (!IS_ERR_OR_NULL(mcast->mc))
 		ib_sa_free_multicast(mcast->mc);
 
 	if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
@@ -698,6 +736,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
+		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
+			queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
 	}
 
 	if (!mcast->ah) {
@@ -711,8 +751,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
 			ipoib_dbg_mcast(priv, "no address vector, "
 					"but multicast join already started\n");
-		else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
-			ipoib_mcast_sendonly_join(mcast);
 
 		/*
 		 * If lookup completes between here and out:, don't
@@ -774,7 +812,7 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
 
 	/* seperate between the wait to the leave*/
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
-		if (test_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags))
+		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
 			wait_for_completion(&mcast->done);
 
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race
       [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2014-08-12 23:38   ` [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage Doug Ledford
@ 2014-08-12 23:38   ` Doug Ledford
       [not found]     ` <e1dbcfc25d8930b281aad12699ebf8fa82485b0e.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-12 23:38   ` [PATCH 5/8] IPoIB: change init sequence ordering Doug Ledford
                     ` (5 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Doug Ledford

Our mcast_dev_flush routine and our mcast_restart_task can race against
each other.  In particular, they both hold the priv->lock while
manipulating the rbtree and while removing mcast entries from the
multicast_list and while adding entries to the remove_list, but they
also both drop their locks prior to doing the actual removes.  The
mcast_dev_flush routine is run entirely under the rtnl lock and so has
at least some locking.  The actual race condition is like this:

Thread 1                                Thread 2
ifconfig ib0 up
  start multicast join for broadcast
  multicast join completes for broadcast
  start to add more multicast joins
    call mcast_restart_task to add new entries
                                        ifconfig ib0 down
					  mcast_dev_flush
					    mcast_leave(mcast A)
    mcast_leave(mcast A)

As mcast_leave calls ib_sa_multicast_leave, and as member in
core/multicast.c is ref counted, we run into an unbalanced refcount
issue.  To avoid stomping on each others removes, take the rtnl lock
specifically when we are deleting the entries from the remove list.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++++++++++++++++++++++----
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index f5e8da530d9..19e3fe75ebf 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -810,7 +810,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
 
 	spin_unlock_irqrestore(&priv->lock, flags);
 
-	/* seperate between the wait to the leave*/
+	/*
+	 * make sure the in-flight joins have finished before we attempt
+	 * to leave
+	 */
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
 		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
 			wait_for_completion(&mcast->done);
@@ -931,14 +934,38 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	netif_addr_unlock(dev);
 	local_irq_restore(flags);
 
-	/* We have to cancel outside of the spinlock */
+	/*
+	 * make sure the in-flight joins have finished before we attempt
+	 * to leave
+	 */
+	list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
+		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
+			wait_for_completion(&mcast->done);
+
+	/*
+	 * We have to cancel outside of the spinlock, but we have to
+	 * take the rtnl lock or else we race with the removal of
+	 * entries from the remove list in mcast_dev_flush as part
+	 * of ipoib_stop() which will call mcast_stop_thread with
+	 * flush == 1 while holding the rtnl lock, and the
+	 * flush_workqueue won't complete until this restart_mcast_task
+	 * completes.  So do like the carrier on task and attempt to
+	 * take the rtnl lock, but if we can't before the ADMIN_UP flag
+	 * goes away, then just return and know that the remove list will
+	 * get flushed later by mcast_dev_flush.
+	 */
+	while (!rtnl_trylock()) {
+		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+			return;
+		else
+			msleep(20);
+	}
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
 		ipoib_mcast_leave(mcast->dev, mcast);
 		ipoib_mcast_free(mcast);
 	}
-
-	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-		ipoib_mcast_start_thread(dev);
+	ipoib_mcast_start_thread(dev);
+	rtnl_unlock();
 }
 
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/8] IPoIB: change init sequence ordering
       [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2014-08-12 23:38   ` [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race Doug Ledford
@ 2014-08-12 23:38   ` Doug Ledford
       [not found]     ` <ead9800512c1cb412b86cb1de3868c40f07c72be.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-12 23:38   ` [PATCH 6/8] IPoIB: Use dedicated workqueues per interface Doug Ledford
                     ` (4 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Doug Ledford

In preparation for using per device work queues, we need to move the
start of the neighbor thread task to after ipoib_ib_dev_init and move
the destruction of the neighbor task to before ipoib_ib_dev_cleanup.
Otherwise we will end up freeing our workqueue with work possibly still
on it.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 217cb77157d..949948a46d4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1262,15 +1262,13 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	if (ipoib_neigh_hash_init(priv) < 0)
-		goto out;
 	/* Allocate RX/TX "rings" to hold queued skbs */
 	priv->rx_ring =	kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring,
 				GFP_KERNEL);
 	if (!priv->rx_ring) {
 		printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n",
 		       ca->name, ipoib_recvq_size);
-		goto out_neigh_hash_cleanup;
+		goto out;
 	}
 
 	priv->tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv->tx_ring);
@@ -1285,16 +1283,24 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 	if (ipoib_ib_dev_init(dev, ca, port))
 		goto out_tx_ring_cleanup;
 
+	/*
+	 * Must be after ipoib_ib_dev_init so we can allocate a per
+	 * device wq there and use it here
+	 */
+	if (ipoib_neigh_hash_init(priv) < 0)
+		goto out_dev_uninit;
+
 	return 0;
 
+out_dev_uninit:
+	ipoib_ib_dev_cleanup();
+
 out_tx_ring_cleanup:
 	vfree(priv->tx_ring);
 
 out_rx_ring_cleanup:
 	kfree(priv->rx_ring);
 
-out_neigh_hash_cleanup:
-	ipoib_neigh_hash_uninit(dev);
 out:
 	return -ENOMEM;
 }
@@ -1317,6 +1323,12 @@ void ipoib_dev_cleanup(struct net_device *dev)
 	}
 	unregister_netdevice_many(&head);
 
+	/*
+	 * Must be before ipoib_ib_dev_cleanup or we delete an in use
+	 * work queue
+	 */
+	ipoib_neigh_hash_uninit(dev);
+
 	ipoib_ib_dev_cleanup(dev);
 
 	kfree(priv->rx_ring);
@@ -1324,8 +1336,6 @@ void ipoib_dev_cleanup(struct net_device *dev)
 
 	priv->rx_ring = NULL;
 	priv->tx_ring = NULL;
-
-	ipoib_neigh_hash_uninit(dev);
 }
 
 static const struct header_ops ipoib_header_ops = {
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/8] IPoIB: Use dedicated workqueues per interface
       [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2014-08-12 23:38   ` [PATCH 5/8] IPoIB: change init sequence ordering Doug Ledford
@ 2014-08-12 23:38   ` Doug Ledford
       [not found]     ` <f7af9c251d722675a549e4a673f46c0f31dfa266.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-12 23:38   ` [PATCH 7/8] IPoIB: Make ipoib_mcast_stop_thread flush the workqueue Doug Ledford
                     ` (3 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Doug Ledford

During my recent work on the rtnl lock deadlock in the IPoIB driver, I
saw that even once I fixed the apparent races for a single device, as
soon as that device had any children, new races popped up.  It turns out
that this is because no matter how well we protect against races on a
single device, the fact that all devices use the same workqueue, and
flush_workqueue() flushes *everything* from that workqueue, we can have
one device in the middle of a down and holding the rtnl lock and another
totally unrelated device needing to run mcast_restart_task, which wants
the rtnl lock and will loop trying to take it unless is sees its own
FLAG_ADMIN_UP flag go away.  Because the unrelated interface will never
see its own ADMIN_UP flag drop, the interface going down will deadlock
trying to flush the queue.  There are several possible solutions to this
problem:

Make carrier_on_task and mcast_restart_task try to take the rtnl for
some set period of time and if they fail, then bail.  This runs the real
risk of dropping work on the floor, which can end up being its own
separate kind of deadlock.

Set some global flag in the driver that says some device is in the
middle of going down, letting all tasks know to bail.  Again, this can
drop work on the floor.  I suppose if our own ADMIN_UP flag doesn't go
away, then maybe after a few tries on the rtnl lock we can queue our own
task back up as a delayed work and return and avoid dropping work on the
floor that way.  But I'm not 100% convinced that we won't cause other
problems.

Or the method this patch attempts to use, which is when we bring an
interface up, create a workqueue specifically for that interface, so
that when we take it back down, we are flushing only those tasks
associated with our interface.  In addition, keep the global workqueue,
but now limit it to only flush tasks.  In this way, the flush tasks can
always flush the device specific work queues without having deadlock
issues.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        | 18 +++++++++---------
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  6 +++---
 drivers/infiniband/ulp/ipoib/ipoib_main.c      | 19 ++++++++++++-------
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 ++++++++++++--------------
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     | 22 +++++++++++++++++++++-
 6 files changed, 58 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 43840971ad8..7bf7339eaef 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -317,6 +317,7 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
+	struct workqueue_struct *wq;
 	struct delayed_work mcast_task;
 	struct work_struct carrier_on_task;
 	struct work_struct flush_light;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 933efcea0d0..56959adb6c7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 	}
 
 	spin_lock_irq(&priv->lock);
-	queue_delayed_work(ipoib_workqueue,
+	queue_delayed_work(priv->wq,
 			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
 	/* Add this entry to passive ids list head, but do not re-add it
 	 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
@@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 			spin_lock_irqsave(&priv->lock, flags);
 			list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list);
 			ipoib_cm_start_rx_drain(priv);
-			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
+			queue_work(priv->wq, &priv->cm.rx_reap_task);
 			spin_unlock_irqrestore(&priv->lock, flags);
 		} else
 			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
@@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 				spin_lock_irqsave(&priv->lock, flags);
 				list_move(&p->list, &priv->cm.rx_reap_list);
 				spin_unlock_irqrestore(&priv->lock, flags);
-				queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
+				queue_work(priv->wq, &priv->cm.rx_reap_task);
 			}
 			return;
 		}
@@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 
 		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
 			list_move(&tx->list, &priv->cm.reap_list);
-			queue_work(ipoib_workqueue, &priv->cm.reap_task);
+			queue_work(priv->wq, &priv->cm.reap_task);
 		}
 
 		clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags);
@@ -1255,7 +1255,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
 
 		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
 			list_move(&tx->list, &priv->cm.reap_list);
-			queue_work(ipoib_workqueue, &priv->cm.reap_task);
+			queue_work(priv->wq, &priv->cm.reap_task);
 		}
 
 		spin_unlock_irqrestore(&priv->lock, flags);
@@ -1284,7 +1284,7 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path
 	tx->dev = dev;
 	list_add(&tx->list, &priv->cm.start_list);
 	set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags);
-	queue_work(ipoib_workqueue, &priv->cm.start_task);
+	queue_work(priv->wq, &priv->cm.start_task);
 	return tx;
 }
 
@@ -1295,7 +1295,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
 	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
 		spin_lock_irqsave(&priv->lock, flags);
 		list_move(&tx->list, &priv->cm.reap_list);
-		queue_work(ipoib_workqueue, &priv->cm.reap_task);
+		queue_work(priv->wq, &priv->cm.reap_task);
 		ipoib_dbg(priv, "Reap connection for gid %pI6\n",
 			  tx->neigh->daddr + 4);
 		tx->neigh = NULL;
@@ -1417,7 +1417,7 @@ void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb,
 
 	skb_queue_tail(&priv->cm.skb_queue, skb);
 	if (e)
-		queue_work(ipoib_workqueue, &priv->cm.skb_task);
+		queue_work(priv->wq, &priv->cm.skb_task);
 }
 
 static void ipoib_cm_rx_reap(struct work_struct *work)
@@ -1450,7 +1450,7 @@ static void ipoib_cm_stale_task(struct work_struct *work)
 	}
 
 	if (!list_empty(&priv->cm.passive_ids))
-		queue_delayed_work(ipoib_workqueue,
+		queue_delayed_work(priv->wq,
 				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
 	spin_unlock_irq(&priv->lock);
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 72626c34817..bfd17d41b5f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -655,7 +655,7 @@ void ipoib_reap_ah(struct work_struct *work)
 	__ipoib_reap_ah(dev);
 
 	if (!test_bit(IPOIB_STOP_REAPER, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
+		queue_delayed_work(priv->wq, &priv->ah_reap_task,
 				   round_jiffies_relative(HZ));
 }
 
@@ -696,7 +696,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
 	}
 
 	clear_bit(IPOIB_STOP_REAPER, &priv->flags);
-	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
+	queue_delayed_work(priv->wq, &priv->ah_reap_task,
 			   round_jiffies_relative(HZ));
 
 	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
@@ -881,7 +881,7 @@ timeout:
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
 	if (flush)
-		flush_workqueue(ipoib_workqueue);
+		flush_workqueue(priv->wq);
 
 	begin = jiffies;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 949948a46d4..255c8296566 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -839,7 +839,7 @@ static void ipoib_set_mcast_list(struct net_device *dev)
 		return;
 	}
 
-	queue_work(ipoib_workqueue, &priv->restart_task);
+	queue_work(priv->wq, &priv->restart_task);
 }
 
 static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr)
@@ -954,7 +954,7 @@ static void ipoib_reap_neigh(struct work_struct *work)
 	__ipoib_reap_neigh(priv);
 
 	if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
+		queue_delayed_work(priv->wq, &priv->neigh_reap_task,
 				   arp_tbl.gc_interval);
 }
 
@@ -1133,7 +1133,7 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv)
 
 	/* start garbage collection */
 	clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
-	queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
+	queue_delayed_work(priv->wq, &priv->neigh_reap_task,
 			   arp_tbl.gc_interval);
 
 	return 0;
@@ -1293,7 +1293,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 	return 0;
 
 out_dev_uninit:
-	ipoib_ib_dev_cleanup();
+	ipoib_ib_dev_cleanup(dev);
 
 out_tx_ring_cleanup:
 	vfree(priv->tx_ring);
@@ -1646,7 +1646,7 @@ register_failed:
 	/* Stop GC if started before flush */
 	set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
 	cancel_delayed_work(&priv->neigh_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	flush_workqueue(priv->wq);
 
 event_failed:
 	ipoib_dev_cleanup(priv->dev);
@@ -1717,7 +1717,7 @@ static void ipoib_remove_one(struct ib_device *device)
 		/* Stop GC */
 		set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
 		cancel_delayed_work(&priv->neigh_reap_task);
-		flush_workqueue(ipoib_workqueue);
+		flush_workqueue(priv->wq);
 
 		unregister_netdev(priv->dev);
 		free_netdev(priv->dev);
@@ -1758,8 +1758,13 @@ static int __init ipoib_init_module(void)
 	 * unregister_netdev() and linkwatch_event take the rtnl lock,
 	 * so flush_scheduled_work() can deadlock during device
 	 * removal.
+	 *
+	 * In addition, bringing one device up and another down at the
+	 * same time can deadlock a single workqueue, so we have this
+	 * global fallback workqueue, but we also attempt to open a
+	 * per device workqueue each time we bring an interface up
 	 */
-	ipoib_workqueue = create_singlethread_workqueue("ipoib");
+	ipoib_workqueue = create_singlethread_workqueue("ipoib_flush");
 	if (!ipoib_workqueue) {
 		ret = -ENOMEM;
 		goto err_fs;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 19e3fe75ebf..969ef420518 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -388,7 +388,7 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
 	 * the workqueue while holding the rtnl lock, so loop
 	 * on trylock until either we get the lock or we see
 	 * FLAG_ADMIN_UP go away as that signals that we are bailing
-	 * and can safely ignore the carrier on work
+	 * and can safely ignore the carrier on work.
 	 */
 	while (!rtnl_trylock()) {
 		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
@@ -432,15 +432,14 @@ static int ipoib_mcast_join_complete(int status,
 	if (!status) {
 		mcast->backoff = 1;
 		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->mcast_task, 0);
+			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 
 		/*
-		 * Defer carrier on work to ipoib_workqueue to avoid a
+		 * Defer carrier on work to priv->wq to avoid a
 		 * deadlock on rtnl_lock here.
 		 */
 		if (mcast == priv->broadcast)
-			queue_work(ipoib_workqueue, &priv->carrier_on_task);
+			queue_work(priv->wq, &priv->carrier_on_task);
 	} else {
 		if (mcast->logcount++ < 20) {
 			if (status == -ETIMEDOUT || status == -EAGAIN) {
@@ -465,7 +464,7 @@ out:
 	if (status == -ENETRESET)
 		status = 0;
 	if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
+		queue_delayed_work(priv->wq, &priv->mcast_task,
 				   mcast->backoff * HZ);
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
@@ -535,8 +534,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
 
 		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->mcast_task,
+			queue_delayed_work(priv->wq, &priv->mcast_task,
 					   mcast->backoff * HZ);
 	}
 	mutex_unlock(&mcast_mutex);
@@ -584,8 +582,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
 			mutex_lock(&mcast_mutex);
 			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-				queue_delayed_work(ipoib_workqueue,
-						   &priv->mcast_task, HZ);
+				queue_delayed_work(priv->wq, &priv->mcast_task,
+						   HZ);
 			mutex_unlock(&mcast_mutex);
 			return;
 		}
@@ -652,7 +650,7 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 
 	mutex_lock(&mcast_mutex);
 	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
+		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	mutex_unlock(&mcast_mutex);
 
 	return 0;
@@ -670,7 +668,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
 	mutex_unlock(&mcast_mutex);
 
 	if (flush)
-		flush_workqueue(ipoib_workqueue);
+		flush_workqueue(priv->wq);
 
 	return 0;
 }
@@ -737,7 +735,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
 		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
+			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	}
 
 	if (!mcast->ah) {
@@ -952,7 +950,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	 * completes.  So do like the carrier on task and attempt to
 	 * take the rtnl lock, but if we can't before the ADMIN_UP flag
 	 * goes away, then just return and know that the remove list will
-	 * get flushed later by mcast_dev_flush.
+	 * get flushed later by mcast_stop_thread.
 	 */
 	while (!rtnl_trylock()) {
 		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index c56d5d44c53..b72a753eb41 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -145,10 +145,20 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 	int ret, size;
 	int i;
 
+	/*
+	 * the various IPoIB tasks assume they will never race against
+	 * themselves, so always use a single thread workqueue
+	 */
+	priv->wq = create_singlethread_workqueue("ipoib_wq");
+	if (!priv->wq) {
+		printk(KERN_WARNING "ipoib: failed to allocate device WQ\n");
+		return -ENODEV;
+	}
+
 	priv->pd = ib_alloc_pd(priv->ca);
 	if (IS_ERR(priv->pd)) {
 		printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name);
-		return -ENODEV;
+		goto out_free_wq;
 	}
 
 	priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE);
@@ -242,6 +252,10 @@ out_free_mr:
 
 out_free_pd:
 	ib_dealloc_pd(priv->pd);
+
+out_free_wq:
+	destroy_workqueue(priv->wq);
+	priv->wq = NULL;
 	return -ENODEV;
 }
 
@@ -270,6 +284,12 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
 
 	if (ib_dealloc_pd(priv->pd))
 		ipoib_warn(priv, "ib_dealloc_pd failed\n");
+
+	if (priv->wq) {
+		flush_workqueue(priv->wq);
+		destroy_workqueue(priv->wq);
+		priv->wq = NULL;
+	}
 }
 
 void ipoib_event(struct ib_event_handler *handler,
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 7/8] IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
       [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2014-08-12 23:38   ` [PATCH 6/8] IPoIB: Use dedicated workqueues per interface Doug Ledford
@ 2014-08-12 23:38   ` Doug Ledford
       [not found]     ` <ae3912431eeacd81d920a405a6bdeb3853791b1a.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-12 23:38   ` [PATCH 8/8] IPoIB: No longer use flush as a parameter Doug Ledford
                     ` (2 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Doug Ledford

We used to pass a flush variable to mcast_stop_thread to indicate if we
should flush the workqueue or not.  This was due to some code trying to
flush a workqueue that it was currently running on which is a no-no.
Now that we have per-device work queues, and now that
ipoib_mcast_restart_task has taken the fact that it is queued on a
single thread workqueue with all of the ipoib_mcast_join_task's and
therefore has no need to stop the join task while it runs, we can do
away with the flush parameter and unilaterally flush always.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  2 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  4 ++--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++------------
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 7bf7339eaef..2e6c0d91d66 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -493,7 +493,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb);
 
 void ipoib_mcast_restart_task(struct work_struct *work);
 int ipoib_mcast_start_thread(struct net_device *dev);
-int ipoib_mcast_stop_thread(struct net_device *dev, int flush);
+int ipoib_mcast_stop_thread(struct net_device *dev);
 
 void ipoib_mcast_dev_down(struct net_device *dev);
 void ipoib_mcast_dev_flush(struct net_device *dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index bfd17d41b5f..66096787119 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -747,7 +747,7 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush)
 	clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
 	netif_carrier_off(dev);
 
-	ipoib_mcast_stop_thread(dev, flush);
+	ipoib_mcast_stop_thread(dev);
 	ipoib_mcast_dev_flush(dev);
 
 	ipoib_flush_paths(dev);
@@ -1097,7 +1097,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
 	 */
 	ipoib_flush_paths(dev);
 
-	ipoib_mcast_stop_thread(dev, 1);
+	ipoib_mcast_stop_thread(dev);
 	ipoib_mcast_dev_flush(dev);
 
 	ipoib_transport_dev_cleanup(dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 969ef420518..8a538c010b9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -656,7 +656,7 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 	return 0;
 }
 
-int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
+int ipoib_mcast_stop_thread(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
@@ -667,8 +667,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
 	cancel_delayed_work(&priv->mcast_task);
 	mutex_unlock(&mcast_mutex);
 
-	if (flush)
-		flush_workqueue(priv->wq);
+	flush_workqueue(priv->wq);
 
 	return 0;
 }
@@ -846,8 +845,6 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 
 	ipoib_dbg_mcast(priv, "restarting multicast task\n");
 
-	ipoib_mcast_stop_thread(dev, 0);
-
 	local_irq_save(flags);
 	netif_addr_lock(dev);
 	spin_lock(&priv->lock);
@@ -944,13 +941,10 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	 * We have to cancel outside of the spinlock, but we have to
 	 * take the rtnl lock or else we race with the removal of
 	 * entries from the remove list in mcast_dev_flush as part
-	 * of ipoib_stop() which will call mcast_stop_thread with
-	 * flush == 1 while holding the rtnl lock, and the
-	 * flush_workqueue won't complete until this restart_mcast_task
-	 * completes.  So do like the carrier on task and attempt to
-	 * take the rtnl lock, but if we can't before the ADMIN_UP flag
-	 * goes away, then just return and know that the remove list will
-	 * get flushed later by mcast_stop_thread.
+	 * of ipoib_stop().  We detect the drop of the ADMIN_UP flag
+	 * to signal that we have hit this particular race, and we
+	 * return since we know we don't need to do anything else
+	 * anyway.
 	 */
 	while (!rtnl_trylock()) {
 		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
@@ -962,6 +956,9 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 		ipoib_mcast_leave(mcast->dev, mcast);
 		ipoib_mcast_free(mcast);
 	}
+	/*
+	 * Restart our join task if needed
+	 */
 	ipoib_mcast_start_thread(dev);
 	rtnl_unlock();
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 8/8] IPoIB: No longer use flush as a parameter
       [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2014-08-12 23:38   ` [PATCH 7/8] IPoIB: Make ipoib_mcast_stop_thread flush the workqueue Doug Ledford
@ 2014-08-12 23:38   ` Doug Ledford
       [not found]     ` <ad7bb2b8da52f187cf2978e6a1c77ead32b60de3.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-15 22:08   ` [PATCH 0/8] IPoIB: Fix multiple race conditions Wendy Cheng
  2014-09-03 13:52   ` Or Gerlitz
  9 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-08-12 23:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Doug Ledford

Various places in the IPoIB code had a deadlock related to flushing the
ipoib workqueue.  Now that we have per device workqueues and a specific
flush workqueue, there is no longer a deadlock issue with flushing the
device specific workqueues and we can do so unilaterally.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h      |  6 +++---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c   | 19 +++++++++----------
 drivers/infiniband/ulp/ipoib/ipoib_main.c |  8 ++++----
 3 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 2e6c0d91d66..71b18883ffe 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -478,10 +478,10 @@ void ipoib_ib_dev_flush_heavy(struct work_struct *work);
 void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
-int ipoib_ib_dev_open(struct net_device *dev, int flush);
+int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
-int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev, int flush);
+int ipoib_ib_dev_down(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev);
 void ipoib_pkey_dev_check_presence(struct net_device *dev);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 66096787119..fe65abb5150 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -664,7 +664,7 @@ static void ipoib_ib_tx_timer_func(unsigned long ctx)
 	drain_tx_cq((struct net_device *)ctx);
 }
 
-int ipoib_ib_dev_open(struct net_device *dev, int flush)
+int ipoib_ib_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
@@ -706,7 +706,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
 dev_stop:
 	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
 		napi_enable(&priv->napi);
-	ipoib_ib_dev_stop(dev, flush);
+	ipoib_ib_dev_stop(dev);
 	return -1;
 }
 
@@ -738,7 +738,7 @@ int ipoib_ib_dev_up(struct net_device *dev)
 	return ipoib_mcast_start_thread(dev);
 }
 
-int ipoib_ib_dev_down(struct net_device *dev, int flush)
+int ipoib_ib_dev_down(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
@@ -807,7 +807,7 @@ void ipoib_drain_cq(struct net_device *dev)
 	local_bh_enable();
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev, int flush)
+int ipoib_ib_dev_stop(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -880,8 +880,7 @@ timeout:
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
 	cancel_delayed_work(&priv->ah_reap_task);
-	if (flush)
-		flush_workqueue(priv->wq);
+	flush_workqueue(priv->wq);
 
 	begin = jiffies;
 
@@ -918,7 +917,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 		    (unsigned long) dev);
 
 	if (dev->flags & IFF_UP) {
-		if (ipoib_ib_dev_open(dev, 1)) {
+		if (ipoib_ib_dev_open(dev)) {
 			ipoib_transport_dev_cleanup(dev);
 			return -ENODEV;
 		}
@@ -1040,12 +1039,12 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
 	}
 
 	if (level >= IPOIB_FLUSH_NORMAL)
-		ipoib_ib_dev_down(dev, 0);
+		ipoib_ib_dev_down(dev);
 
 	if (level == IPOIB_FLUSH_HEAVY) {
 		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
-			ipoib_ib_dev_stop(dev, 0);
-		if (ipoib_ib_dev_open(dev, 0) != 0)
+			ipoib_ib_dev_stop(dev);
+		if (ipoib_ib_dev_open(dev) != 0)
 			return;
 		if (netif_queue_stopped(dev))
 			netif_start_queue(dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 255c8296566..4e4f6ec59e1 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -108,7 +108,7 @@ int ipoib_open(struct net_device *dev)
 
 	set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
 
-	if (ipoib_ib_dev_open(dev, 1)) {
+	if (ipoib_ib_dev_open(dev)) {
 		if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
 			return 0;
 		goto err_disable;
@@ -139,7 +139,7 @@ int ipoib_open(struct net_device *dev)
 	return 0;
 
 err_stop:
-	ipoib_ib_dev_stop(dev, 1);
+	ipoib_ib_dev_stop(dev);
 
 err_disable:
 	clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
@@ -157,8 +157,8 @@ static int ipoib_stop(struct net_device *dev)
 
 	netif_stop_queue(dev);
 
-	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev, 0);
+	ipoib_ib_dev_down(dev);
+	ipoib_ib_dev_stop(dev);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/8] IPoIB: Fix multiple race conditions
       [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2014-08-12 23:38   ` [PATCH 8/8] IPoIB: No longer use flush as a parameter Doug Ledford
@ 2014-08-15 22:08   ` Wendy Cheng
       [not found]     ` <CABgxfbEGfiNGUKT4NJi1GoDRouFznxpgogDt5yr47TLfwDB7hA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-09-03 13:52   ` Or Gerlitz
  9 siblings, 1 reply; 37+ messages in thread
From: Wendy Cheng @ 2014-08-15 22:08 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Cheng, Wendy

On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Locking of multicast joins/leaves in the IPoIB layer have been problematic
> for a while.  There have been recent changes to try and make things better,
> including these changes:
>
> bea1e22 IPoIB: Fix use-after-free of multicast object
> a9c8ba5 IPoIB: Fix usage of uninitialized multicast objects
>
> Unfortunately, the following test still fails (miserably) on a plain
> upstream kernel:
>
> pass=0
> ifdown ib0
> while true; do
>     ifconfig ib0 up
>     ifconfig ib0 down
>     echo "Pass $pass"
>     let pass++
> done
>
> This usually fails within 10 to 20 passes, although I did have a lucky
> run make it to 300 or so.  If you happen to have a P_Key child interface,
> it fails even quicker.
>
[snip]
>
> Doug Ledford (8):
>   IPoIB: Consolidate rtnl_lock tasks in workqueue
>   IPoIB: Make the carrier_on_task race aware
>   IPoIB: fix MCAST_FLAG_BUSY usage
>   IPoIB: fix mcast_dev_flush/mcast_restart_task race
>   IPoIB: change init sequence ordering
>   IPoIB: Use dedicated workqueues per interface
>   IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
>   IPoIB: No longer use flush as a parameter
>

IPOIB is recently added as a technology preview for Intel Xeon Phi
(currently a PCIe card) that runs embedded Linux (named MPSS) with
Infiniband software stacks supported via emulation drivers. One early
feedback from users with large cluster nodes is IPOIB's power
consumption. The root cause of the reported issue is more to do with
how MPSS handles its DMA buffers (vs. how Linux IB stacks work) - so
submitting the fix to upstream is not planned at this moment (unless
folks are interested in the changes).

However, since this patch set happens to be in the heart of the
reported power issue, we would like to take a closer look to avoid
MPSS code base deviating too much from future upstream kernel(s).
Question, comment, and/or ack will follow sometime next week.

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue
       [not found]     ` <2394730ce5ae1d46522dca04066293dd842edf16.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-08-15 22:11       ` Wendy Cheng
  2014-09-04 14:28       ` Erez Shitrit
  1 sibling, 0 replies; 37+ messages in thread
From: Wendy Cheng @ 2014-08-15 22:11 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Setting the mtu can safely be moved to the carrier_on_task, which keeps
> us from needing to take the rtnl lock in the join_finish section.
>

Looks good !

Acked-by: Wendy Cheng <wendy.cheng-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++------
>  1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index d4e005720d0..a0a42859f12 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
>                 spin_unlock_irq(&priv->lock);
>                 priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
>                 set_qkey = 1;
> -
> -               if (!ipoib_cm_admin_enabled(dev)) {
> -                       rtnl_lock();
> -                       dev_set_mtu(dev, min(priv->mcast_mtu, priv->admin_mtu));
> -                       rtnl_unlock();
> -               }
>         }
>
>         if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
> @@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
>         }
>
>         rtnl_lock();
> +       if (!ipoib_cm_admin_enabled(priv->dev))
> +               dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
>         netif_carrier_on(priv->dev);
>         rtnl_unlock();
>  }
> --
> 1.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/8] IPoIB: Make the carrier_on_task race aware
       [not found]     ` <d05cdce70f1312f35f8be2d14bafd2a06809b137.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-08-18 23:26       ` Wendy Cheng
       [not found]         ` <CABgxfbE6edfZZ58=mTvhGqWSkCxsik0XuQPR0L-yayze=803cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-09-04 12:13       ` Erez Shitrit
  1 sibling, 1 reply; 37+ messages in thread
From: Wendy Cheng @ 2014-08-18 23:26 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> We blindly assume that we can just take the rtnl lock and that will
> prevent races with downing this interface.  Unfortunately, that's not
> the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
> in an attempt to clear out all remaining instances of ipoib_join_task.
> But, since this task is put on the same workqueue as the join task, the
> flush_workqueue waits on this thread too.  But this thread is deadlocked
> on the rtnl lock.  The better thing here is to use trylock and loop on
> that until we either get the lock or we see that FLAG_ADMIN_UP has
> been cleared, in which case we don't need to do anything anyway and we
> just return.
>
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index a0a42859f12..7e9cd39b5ef 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
>                                                    carrier_on_task);
>         struct ib_port_attr attr;
>
> -       /*
> -        * Take rtnl_lock to avoid racing with ipoib_stop() and
> -        * turning the carrier back on while a device is being
> -        * removed.
> -        */
>         if (ib_query_port(priv->ca, priv->port, &attr) ||
>             attr.state != IB_PORT_ACTIVE) {
>                 ipoib_dbg(priv, "Keeping carrier off until IB port is active\n");
>                 return;
>         }
>
> -       rtnl_lock();
> +       /*
> +        * Take rtnl_lock to avoid racing with ipoib_stop() and
> +        * turning the carrier back on while a device is being
> +        * removed.  However, ipoib_stop() will attempt to flush
> +        * the workqueue while holding the rtnl lock, so loop
> +        * on trylock until either we get the lock or we see
> +        * FLAG_ADMIN_UP go away as that signals that we are bailing
> +        * and can safely ignore the carrier on work
> +        */
> +       while (!rtnl_trylock()) {
> +               if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> +                       return;
> +               else
> +                       msleep(20);
> +       }


I always think rtnl lock is too big for this purpose... and that 20 ms
is not ideal either. Could we have a new IPOIB private mutex used by
ipoib_stop() and this section of code ? So something like:

ipoib_stop()
{    .....
     mutex_lock(&something_new);
     clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
     ...
     mutex_unlock(&something_new);
     return 0;
}

Then the loop would become:

// this while-loop will be very short - since we either get the mutex
quickly or "return" quickly.
 while (!mutex_trylock(&something_new)) {
              if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
                      return;
 }


>         if (!ipoib_cm_admin_enabled(priv->dev))
>                 dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
>         netif_carrier_on(priv->dev);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
       [not found]     ` <a410e80dc5ca7cfa64229bbbf50c1337317e3bd8.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-08-19 18:08       ` Wendy Cheng
       [not found]         ` <CABgxfbH-Dt3CpxJKwCAZeHTUyupaA9y_WXVXuxgiPMet26PTQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Wendy Cheng @ 2014-08-19 18:08 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects)
> added a new flag MCAST_JOIN_STARTED, but was not very strict in how it
> was used.  We didn't always initialize the completion struct before we
> set the flag, and we didn't always call complete on the completion
> struct from all paths that complete it.  This made it less than totally
> effective, and certainly made its use confusing.  And in the flush
> function we would use the presence of this flag to signal that we should
> wait on the completion struct, but we never cleared this flag, ever.
> This is further muddied by the fact that we overload the MCAST_FLAG_BUSY
> flag to mean two different things: we have a join in flight, and we have
> succeeded in getting an ib_sa_join_multicast.
>

[snip]

> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 7e9cd39b5ef..f5e8da530d9 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -271,16 +271,27 @@ ipoib_mcast_sendonly_join_complete(int status,
>         struct ipoib_mcast *mcast = multicast->context;
>         struct net_device *dev = mcast->dev;
>
> +       /*
> +        * We have to take the mutex to force mcast_sendonly_join to
> +        * return from ib_sa_multicast_join and set mcast->mc to a
> +        * valid value.  Otherwise we were racing with ourselves in
> +        * that we might fail here, but get a valid return from
> +        * ib_sa_multicast_join after we had cleared mcast->mc here,
> +        * resulting in mis-matched joins and leaves and a deadlock
> +        */
> +       mutex_lock(&mcast_mutex);
> +
>         /* We trap for port events ourselves. */
>         if (status == -ENETRESET)
> -               return 0;
> +               goto out;
>

[snip]


> +out:
> +       clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> +       if (status)
> +               mcast->mc = NULL;
> +       complete(&mcast->done);
> +       if (status == -ENETRESET)
> +               status = 0;
> +       mutex_unlock(&mcast_mutex);
>         return status;
>  }
>

I have not gone thru the whole patch yet. However, here is something quick ...

Would this "return 0 if (status == -ENETRESET)" be wrong in the
original code ?  Say ..... when it returns to process_group_error(),
without proper error return code, it'll skip ib_sa_free_multicast().
Should we worry about it ?

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
       [not found]         ` <CABgxfbH-Dt3CpxJKwCAZeHTUyupaA9y_WXVXuxgiPMet26PTQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-19 20:28           ` Doug Ledford
       [not found]             ` <902D5BF2-159A-4B31-A87F-7491F3C8057F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-08-19 20:28 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

Excuse my top posting, I'm at a meeting and not using my normal mail client.

To answer your question, this is a confusing issue that I had to work through myself.

When we call ib_sa_multicast_join, it will create a record for us and return that.  That record is not valid for us to call ib_sa_multicast_leave until after the join has completed and our callback has acknowledged that join completion with a 0 return status from our callback.

At the time the record is created, ib_sa_multicast_join will also queue up the actual join.  When the join completes, they call our callback (and without my locking fixes, we can get the callback before ib_sa_multicast_join returns our record entry to us).

If our callback is called with the status already set, then the join has failed and will be handled that way when our callback returns.  In this case, the callback is informative for us, that's all.  When we trap -ENETRESET and put our return to 0, we are acknowledging that the join failed prior to our callback and that the join will not be completed and that mcast->mc as we have recorded is not valid.

If the status is 0 to start, and we set it to a non-0 value, then that also indicates that the join has failed, but in this case the core code must unroll the join that was completed up until the callback was done (and it does).  Any time this is the case, the member record returned from ib_sa_multicast_join is invalid and we can not call ib_sa_multicast_leave on the record or we will hang forever waiting for a completion that won't be coming.

Only if we get our callback entered with a 0 status and return with a 0 status does the member record that ib_sa_multicast_join returned to us become fully valid.  At this point, we must call ib_sa_multicast_leave on the record if we want it to go away.

So that's why in this patch we

1) take a mutex to force ib_sa_join_multicast to return and us to set mcast->mc to the proper return value before we process the join completion callback
2) always clear mcast->mc if there is any error since we can't call ib_sa_multicast_leave
3) always complete the mcast in case we are waiting on it
4) only if our status is ENETRESET set our return to 0 so the ib core code knows we acknowledged the event

Sent from my iPad

> On Aug 19, 2014, at 11:08 AM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
>> On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects)
>> added a new flag MCAST_JOIN_STARTED, but was not very strict in how it
>> was used.  We didn't always initialize the completion struct before we
>> set the flag, and we didn't always call complete on the completion
>> struct from all paths that complete it.  This made it less than totally
>> effective, and certainly made its use confusing.  And in the flush
>> function we would use the presence of this flag to signal that we should
>> wait on the completion struct, but we never cleared this flag, ever.
>> This is further muddied by the fact that we overload the MCAST_FLAG_BUSY
>> flag to mean two different things: we have a join in flight, and we have
>> succeeded in getting an ib_sa_join_multicast.
>> 
> 
> [snip]
> 
>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> index 7e9cd39b5ef..f5e8da530d9 100644
>> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> @@ -271,16 +271,27 @@ ipoib_mcast_sendonly_join_complete(int status,
>>        struct ipoib_mcast *mcast = multicast->context;
>>        struct net_device *dev = mcast->dev;
>> 
>> +       /*
>> +        * We have to take the mutex to force mcast_sendonly_join to
>> +        * return from ib_sa_multicast_join and set mcast->mc to a
>> +        * valid value.  Otherwise we were racing with ourselves in
>> +        * that we might fail here, but get a valid return from
>> +        * ib_sa_multicast_join after we had cleared mcast->mc here,
>> +        * resulting in mis-matched joins and leaves and a deadlock
>> +        */
>> +       mutex_lock(&mcast_mutex);
>> +
>>        /* We trap for port events ourselves. */
>>        if (status == -ENETRESET)
>> -               return 0;
>> +               goto out;
>> 
> 
> [snip]
> 
> 
>> +out:
>> +       clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>> +       if (status)
>> +               mcast->mc = NULL;
>> +       complete(&mcast->done);
>> +       if (status == -ENETRESET)
>> +               status = 0;
>> +       mutex_unlock(&mcast_mutex);
>>        return status;
>> }
>> 
> 
> I have not gone thru the whole patch yet. However, here is something quick ...
> 
> Would this "return 0 if (status == -ENETRESET)" be wrong in the
> original code ?  Say ..... when it returns to process_group_error(),
> without proper error return code, it'll skip ib_sa_free_multicast().
> Should we worry about it ?
> 
> -- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/8] IPoIB: Make the carrier_on_task race aware
       [not found]         ` <CABgxfbE6edfZZ58=mTvhGqWSkCxsik0XuQPR0L-yayze=803cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-19 20:32           ` Doug Ledford
       [not found]             ` <2CC1794B-10CD-49A2-8F5D-0C66A6684DBC-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-08-19 20:32 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

Whether or not the core network code is OK with us dropping the rtnl lock while manipulating the interface is the issue here.  However, I did consider changing the mcast_mutex to a per interface lock instead.  The are various optimizations that can be made now that the locking is correct and race free and that we don't flush from the workqueue we are flushing.  I didn't go into those changes in this set.  As for 20ms, I did that because checkpatch complains about shorter waits and I don't like to busy loop, especially since it's not like this is a performance sensitive section of code.

Sent from my iPad

> On Aug 18, 2014, at 4:26 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
>> On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> We blindly assume that we can just take the rtnl lock and that will
>> prevent races with downing this interface.  Unfortunately, that's not
>> the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
>> in an attempt to clear out all remaining instances of ipoib_join_task.
>> But, since this task is put on the same workqueue as the join task, the
>> flush_workqueue waits on this thread too.  But this thread is deadlocked
>> on the rtnl lock.  The better thing here is to use trylock and loop on
>> that until we either get the lock or we see that FLAG_ADMIN_UP has
>> been cleared, in which case we don't need to do anything anyway and we
>> just return.
>> 
>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> ---
>> drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++++++++------
>> 1 file changed, 15 insertions(+), 6 deletions(-)
>> 
>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> index a0a42859f12..7e9cd39b5ef 100644
>> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
>>                                                   carrier_on_task);
>>        struct ib_port_attr attr;
>> 
>> -       /*
>> -        * Take rtnl_lock to avoid racing with ipoib_stop() and
>> -        * turning the carrier back on while a device is being
>> -        * removed.
>> -        */
>>        if (ib_query_port(priv->ca, priv->port, &attr) ||
>>            attr.state != IB_PORT_ACTIVE) {
>>                ipoib_dbg(priv, "Keeping carrier off until IB port is active\n");
>>                return;
>>        }
>> 
>> -       rtnl_lock();
>> +       /*
>> +        * Take rtnl_lock to avoid racing with ipoib_stop() and
>> +        * turning the carrier back on while a device is being
>> +        * removed.  However, ipoib_stop() will attempt to flush
>> +        * the workqueue while holding the rtnl lock, so loop
>> +        * on trylock until either we get the lock or we see
>> +        * FLAG_ADMIN_UP go away as that signals that we are bailing
>> +        * and can safely ignore the carrier on work
>> +        */
>> +       while (!rtnl_trylock()) {
>> +               if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
>> +                       return;
>> +               else
>> +                       msleep(20);
>> +       }
> 
> 
> I always think rtnl lock is too big for this purpose... and that 20 ms
> is not ideal either. Could we have a new IPOIB private mutex used by
> ipoib_stop() and this section of code ? So something like:
> 
> ipoib_stop()
> {    .....
>     mutex_lock(&something_new);
>     clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
>     ...
>     mutex_unlock(&something_new);
>     return 0;
> }
> 
> Then the loop would become:
> 
> // this while-loop will be very short - since we either get the mutex
> quickly or "return" quickly.
> while (!mutex_trylock(&something_new)) {
>              if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
>                      return;
> }
> 
> 
>>        if (!ipoib_cm_admin_enabled(priv->dev))
>>                dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
>>        netif_carrier_on(priv->dev);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/8] IPoIB: Make the carrier_on_task race aware
       [not found]             ` <2CC1794B-10CD-49A2-8F5D-0C66A6684DBC-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-08-19 22:05               ` Wendy Cheng
  0 siblings, 0 replies; 37+ messages in thread
From: Wendy Cheng @ 2014-08-19 22:05 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

On Tue, Aug 19, 2014 at 1:32 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Whether or not the core network code is OK with us dropping the rtnl lock while manipulating the interface is the issue here.  However, I did consider changing the mcast_mutex to a per interface lock instead.  The are various optimizations that can be made now that the locking is correct and race free and that we don't flush from the workqueue we are flushing.  I didn't go into those changes in this set.  As for 20ms, I did that because checkpatch complains about shorter waits and I don't like to busy loop, especially since it's not like this is a performance sensitive section of code.
>

I'm acking this patch for now. However, I want to remind everyone that
rntl is a "BIG" global lock that has been used by other net devices
such as Ethernet. It is particularly troublesome for us since the
IPOIB needs to co-exist with Ethernet interface that has non-trivial
amount of traffic as well. Imaging everyone tries to bring their
interfaces up/down all together in a large cluster environment that
needs to worry about power consumption. It would be "cool" if we can
see less of this lock !

Acked-by: Wendy Cheng <wendy.cheng-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH 6/8] IPoIB: Use dedicated workqueues per interface
       [not found]     ` <f7af9c251d722675a549e4a673f46c0f31dfa266.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-08-20 15:01       ` Estrin, Alex
  2014-09-04  6:49       ` Erez Shitrit
  1 sibling, 0 replies; 37+ messages in thread
From: Estrin, Alex @ 2014-08-20 15:01 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A



> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On
> Behalf Of Doug Ledford
> Sent: Tuesday, August 12, 2014 7:38 PM
> To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> Cc: Doug Ledford
> Subject: [PATCH 6/8] IPoIB: Use dedicated workqueues per interface
> 
> During my recent work on the rtnl lock deadlock in the IPoIB driver, I
> saw that even once I fixed the apparent races for a single device, as
> soon as that device had any children, new races popped up.  It turns out
> that this is because no matter how well we protect against races on a
> single device, the fact that all devices use the same workqueue, and
> flush_workqueue() flushes *everything* from that workqueue, we can have
> one device in the middle of a down and holding the rtnl lock and another
> totally unrelated device needing to run mcast_restart_task, which wants
> the rtnl lock and will loop trying to take it unless is sees its own
> FLAG_ADMIN_UP flag go away.  Because the unrelated interface will never
> see its own ADMIN_UP flag drop, the interface going down will deadlock
> trying to flush the queue.  There are several possible solutions to this
> problem:
> 
> Make carrier_on_task and mcast_restart_task try to take the rtnl for
> some set period of time and if they fail, then bail.  This runs the real
> risk of dropping work on the floor, which can end up being its own
> separate kind of deadlock.
> 
> Set some global flag in the driver that says some device is in the
> middle of going down, letting all tasks know to bail.  Again, this can
> drop work on the floor.  I suppose if our own ADMIN_UP flag doesn't go
> away, then maybe after a few tries on the rtnl lock we can queue our own
> task back up as a delayed work and return and avoid dropping work on the
> floor that way.  But I'm not 100% convinced that we won't cause other
> problems.
> 
> Or the method this patch attempts to use, which is when we bring an
> interface up, create a workqueue specifically for that interface, so
> that when we take it back down, we are flushing only those tasks
> associated with our interface.  In addition, keep the global workqueue,
> but now limit it to only flush tasks.  In this way, the flush tasks can
> always flush the device specific work queues without having deadlock
> issues.
> 
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Workqueues per interface is a great idea. Thanks!
Acked-by: Alex Estrin <alex.estrin-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

> ---
>  drivers/infiniband/ulp/ipoib/ipoib.h           |  1 +
>  drivers/infiniband/ulp/ipoib/ipoib_cm.c        | 18 +++++++++---------
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  6 +++---
>  drivers/infiniband/ulp/ipoib/ipoib_main.c      | 19 ++++++++++++-------
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 ++++++++++++--------------
>  drivers/infiniband/ulp/ipoib/ipoib_verbs.c     | 22 +++++++++++++++++++++-
>  6 files changed, 58 insertions(+), 34 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
> b/drivers/infiniband/ulp/ipoib/ipoib.h
> index 43840971ad8..7bf7339eaef 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> @@ -317,6 +317,7 @@ struct ipoib_dev_priv {
>  	struct list_head multicast_list;
>  	struct rb_root multicast_tree;
> 
> +	struct workqueue_struct *wq;
>  	struct delayed_work mcast_task;
>  	struct work_struct carrier_on_task;
>  	struct work_struct flush_light;
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index 933efcea0d0..56959adb6c7 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct
> ib_cm_event *even
>  	}
> 
>  	spin_lock_irq(&priv->lock);
> -	queue_delayed_work(ipoib_workqueue,
> +	queue_delayed_work(priv->wq,
>  			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>  	/* Add this entry to passive ids list head, but do not re-add it
>  	 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
> @@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc
> *wc)
>  			spin_lock_irqsave(&priv->lock, flags);
>  			list_splice_init(&priv->cm.rx_drain_list, &priv-
> >cm.rx_reap_list);
>  			ipoib_cm_start_rx_drain(priv);
> -			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
> +			queue_work(priv->wq, &priv->cm.rx_reap_task);
>  			spin_unlock_irqrestore(&priv->lock, flags);
>  		} else
>  			ipoib_warn(priv, "cm recv completion event with wrid %d (>
> %d)\n",
> @@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc
> *wc)
>  				spin_lock_irqsave(&priv->lock, flags);
>  				list_move(&p->list, &priv->cm.rx_reap_list);
>  				spin_unlock_irqrestore(&priv->lock, flags);
> -				queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
> +				queue_work(priv->wq, &priv->cm.rx_reap_task);
>  			}
>  			return;
>  		}
> @@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc
> *wc)
> 
>  		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
>  			list_move(&tx->list, &priv->cm.reap_list);
> -			queue_work(ipoib_workqueue, &priv->cm.reap_task);
> +			queue_work(priv->wq, &priv->cm.reap_task);
>  		}
> 
>  		clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags);
> @@ -1255,7 +1255,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
> 
>  		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
>  			list_move(&tx->list, &priv->cm.reap_list);
> -			queue_work(ipoib_workqueue, &priv->cm.reap_task);
> +			queue_work(priv->wq, &priv->cm.reap_task);
>  		}
> 
>  		spin_unlock_irqrestore(&priv->lock, flags);
> @@ -1284,7 +1284,7 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device
> *dev, struct ipoib_path
>  	tx->dev = dev;
>  	list_add(&tx->list, &priv->cm.start_list);
>  	set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags);
> -	queue_work(ipoib_workqueue, &priv->cm.start_task);
> +	queue_work(priv->wq, &priv->cm.start_task);
>  	return tx;
>  }
> 
> @@ -1295,7 +1295,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
>  	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
>  		spin_lock_irqsave(&priv->lock, flags);
>  		list_move(&tx->list, &priv->cm.reap_list);
> -		queue_work(ipoib_workqueue, &priv->cm.reap_task);
> +		queue_work(priv->wq, &priv->cm.reap_task);
>  		ipoib_dbg(priv, "Reap connection for gid %pI6\n",
>  			  tx->neigh->daddr + 4);
>  		tx->neigh = NULL;
> @@ -1417,7 +1417,7 @@ void ipoib_cm_skb_too_long(struct net_device *dev, struct
> sk_buff *skb,
> 
>  	skb_queue_tail(&priv->cm.skb_queue, skb);
>  	if (e)
> -		queue_work(ipoib_workqueue, &priv->cm.skb_task);
> +		queue_work(priv->wq, &priv->cm.skb_task);
>  }
> 
>  static void ipoib_cm_rx_reap(struct work_struct *work)
> @@ -1450,7 +1450,7 @@ static void ipoib_cm_stale_task(struct work_struct *work)
>  	}
> 
>  	if (!list_empty(&priv->cm.passive_ids))
> -		queue_delayed_work(ipoib_workqueue,
> +		queue_delayed_work(priv->wq,
>  				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>  	spin_unlock_irq(&priv->lock);
>  }
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 72626c34817..bfd17d41b5f 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -655,7 +655,7 @@ void ipoib_reap_ah(struct work_struct *work)
>  	__ipoib_reap_ah(dev);
> 
>  	if (!test_bit(IPOIB_STOP_REAPER, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
> +		queue_delayed_work(priv->wq, &priv->ah_reap_task,
>  				   round_jiffies_relative(HZ));
>  }
> 
> @@ -696,7 +696,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
>  	}
> 
>  	clear_bit(IPOIB_STOP_REAPER, &priv->flags);
> -	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
> +	queue_delayed_work(priv->wq, &priv->ah_reap_task,
>  			   round_jiffies_relative(HZ));
> 
>  	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
> @@ -881,7 +881,7 @@ timeout:
>  	set_bit(IPOIB_STOP_REAPER, &priv->flags);
>  	cancel_delayed_work(&priv->ah_reap_task);
>  	if (flush)
> -		flush_workqueue(ipoib_workqueue);
> +		flush_workqueue(priv->wq);
> 
>  	begin = jiffies;
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 949948a46d4..255c8296566 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -839,7 +839,7 @@ static void ipoib_set_mcast_list(struct net_device *dev)
>  		return;
>  	}
> 
> -	queue_work(ipoib_workqueue, &priv->restart_task);
> +	queue_work(priv->wq, &priv->restart_task);
>  }
> 
>  static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr)
> @@ -954,7 +954,7 @@ static void ipoib_reap_neigh(struct work_struct *work)
>  	__ipoib_reap_neigh(priv);
> 
>  	if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
> +		queue_delayed_work(priv->wq, &priv->neigh_reap_task,
>  				   arp_tbl.gc_interval);
>  }
> 
> @@ -1133,7 +1133,7 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv)
> 
>  	/* start garbage collection */
>  	clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
> -	queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
> +	queue_delayed_work(priv->wq, &priv->neigh_reap_task,
>  			   arp_tbl.gc_interval);
> 
>  	return 0;
> @@ -1293,7 +1293,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device
> *ca, int port)
>  	return 0;
> 
>  out_dev_uninit:
> -	ipoib_ib_dev_cleanup();
> +	ipoib_ib_dev_cleanup(dev);
> 
>  out_tx_ring_cleanup:
>  	vfree(priv->tx_ring);
> @@ -1646,7 +1646,7 @@ register_failed:
>  	/* Stop GC if started before flush */
>  	set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
>  	cancel_delayed_work(&priv->neigh_reap_task);
> -	flush_workqueue(ipoib_workqueue);
> +	flush_workqueue(priv->wq);
> 
>  event_failed:
>  	ipoib_dev_cleanup(priv->dev);
> @@ -1717,7 +1717,7 @@ static void ipoib_remove_one(struct ib_device *device)
>  		/* Stop GC */
>  		set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
>  		cancel_delayed_work(&priv->neigh_reap_task);
> -		flush_workqueue(ipoib_workqueue);
> +		flush_workqueue(priv->wq);
> 
>  		unregister_netdev(priv->dev);
>  		free_netdev(priv->dev);
> @@ -1758,8 +1758,13 @@ static int __init ipoib_init_module(void)
>  	 * unregister_netdev() and linkwatch_event take the rtnl lock,
>  	 * so flush_scheduled_work() can deadlock during device
>  	 * removal.
> +	 *
> +	 * In addition, bringing one device up and another down at the
> +	 * same time can deadlock a single workqueue, so we have this
> +	 * global fallback workqueue, but we also attempt to open a
> +	 * per device workqueue each time we bring an interface up
>  	 */
> -	ipoib_workqueue = create_singlethread_workqueue("ipoib");
> +	ipoib_workqueue = create_singlethread_workqueue("ipoib_flush");
>  	if (!ipoib_workqueue) {
>  		ret = -ENOMEM;
>  		goto err_fs;
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 19e3fe75ebf..969ef420518 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -388,7 +388,7 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
>  	 * the workqueue while holding the rtnl lock, so loop
>  	 * on trylock until either we get the lock or we see
>  	 * FLAG_ADMIN_UP go away as that signals that we are bailing
> -	 * and can safely ignore the carrier on work
> +	 * and can safely ignore the carrier on work.
>  	 */
>  	while (!rtnl_trylock()) {
>  		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> @@ -432,15 +432,14 @@ static int ipoib_mcast_join_complete(int status,
>  	if (!status) {
>  		mcast->backoff = 1;
>  		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue,
> -					   &priv->mcast_task, 0);
> +			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
> 
>  		/*
> -		 * Defer carrier on work to ipoib_workqueue to avoid a
> +		 * Defer carrier on work to priv->wq to avoid a
>  		 * deadlock on rtnl_lock here.
>  		 */
>  		if (mcast == priv->broadcast)
> -			queue_work(ipoib_workqueue, &priv->carrier_on_task);
> +			queue_work(priv->wq, &priv->carrier_on_task);
>  	} else {
>  		if (mcast->logcount++ < 20) {
>  			if (status == -ETIMEDOUT || status == -EAGAIN) {
> @@ -465,7 +464,7 @@ out:
>  	if (status == -ENETRESET)
>  		status = 0;
>  	if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
> +		queue_delayed_work(priv->wq, &priv->mcast_task,
>  				   mcast->backoff * HZ);
>  	spin_unlock_irq(&priv->lock);
>  	mutex_unlock(&mcast_mutex);
> @@ -535,8 +534,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct
> ipoib_mcast *mcast,
>  			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
> 
>  		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue,
> -					   &priv->mcast_task,
> +			queue_delayed_work(priv->wq, &priv->mcast_task,
>  					   mcast->backoff * HZ);
>  	}
>  	mutex_unlock(&mcast_mutex);
> @@ -584,8 +582,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
>  			ipoib_warn(priv, "failed to allocate broadcast group\n");
>  			mutex_lock(&mcast_mutex);
>  			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -				queue_delayed_work(ipoib_workqueue,
> -						   &priv->mcast_task, HZ);
> +				queue_delayed_work(priv->wq, &priv->mcast_task,
> +						   HZ);
>  			mutex_unlock(&mcast_mutex);
>  			return;
>  		}
> @@ -652,7 +650,7 @@ int ipoib_mcast_start_thread(struct net_device *dev)
> 
>  	mutex_lock(&mcast_mutex);
>  	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
> +		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>  	mutex_unlock(&mcast_mutex);
> 
>  	return 0;
> @@ -670,7 +668,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
>  	mutex_unlock(&mcast_mutex);
> 
>  	if (flush)
> -		flush_workqueue(ipoib_workqueue);
> +		flush_workqueue(priv->wq);
> 
>  	return 0;
>  }
> @@ -737,7 +735,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct
> sk_buff *skb)
>  		__ipoib_mcast_add(dev, mcast);
>  		list_add_tail(&mcast->list, &priv->multicast_list);
>  		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
> +			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>  	}
> 
>  	if (!mcast->ah) {
> @@ -952,7 +950,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
>  	 * completes.  So do like the carrier on task and attempt to
>  	 * take the rtnl lock, but if we can't before the ADMIN_UP flag
>  	 * goes away, then just return and know that the remove list will
> -	 * get flushed later by mcast_dev_flush.
> +	 * get flushed later by mcast_stop_thread.
>  	 */
>  	while (!rtnl_trylock()) {
>  		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> index c56d5d44c53..b72a753eb41 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> @@ -145,10 +145,20 @@ int ipoib_transport_dev_init(struct net_device *dev, struct
> ib_device *ca)
>  	int ret, size;
>  	int i;
> 
> +	/*
> +	 * the various IPoIB tasks assume they will never race against
> +	 * themselves, so always use a single thread workqueue
> +	 */
> +	priv->wq = create_singlethread_workqueue("ipoib_wq");
> +	if (!priv->wq) {
> +		printk(KERN_WARNING "ipoib: failed to allocate device WQ\n");
> +		return -ENODEV;
> +	}
> +
>  	priv->pd = ib_alloc_pd(priv->ca);
>  	if (IS_ERR(priv->pd)) {
>  		printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name);
> -		return -ENODEV;
> +		goto out_free_wq;
>  	}
> 
>  	priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE);
> @@ -242,6 +252,10 @@ out_free_mr:
> 
>  out_free_pd:
>  	ib_dealloc_pd(priv->pd);
> +
> +out_free_wq:
> +	destroy_workqueue(priv->wq);
> +	priv->wq = NULL;
>  	return -ENODEV;
>  }
> 
> @@ -270,6 +284,12 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
> 
>  	if (ib_dealloc_pd(priv->pd))
>  		ipoib_warn(priv, "ib_dealloc_pd failed\n");
> +
> +	if (priv->wq) {
> +		flush_workqueue(priv->wq);
> +		destroy_workqueue(priv->wq);
> +		priv->wq = NULL;
> +	}
>  }
> 
>  void ipoib_event(struct ib_event_handler *handler,
> --
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH 7/8] IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
       [not found]     ` <ae3912431eeacd81d920a405a6bdeb3853791b1a.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-08-20 15:03       ` Estrin, Alex
  0 siblings, 0 replies; 37+ messages in thread
From: Estrin, Alex @ 2014-08-20 15:03 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A



> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On
> Behalf Of Doug Ledford
> Sent: Tuesday, August 12, 2014 7:38 PM
> To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> Cc: Doug Ledford
> Subject: [PATCH 7/8] IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
> 
> We used to pass a flush variable to mcast_stop_thread to indicate if we
> should flush the workqueue or not.  This was due to some code trying to
> flush a workqueue that it was currently running on which is a no-no.
> Now that we have per-device work queues, and now that
> ipoib_mcast_restart_task has taken the fact that it is queued on a
> single thread workqueue with all of the ipoib_mcast_join_task's and
> therefore has no need to stop the join task while it runs, we can do
> away with the flush parameter and unilaterally flush always.
> 
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Alex Estrin <alex.estrin-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

> ---
>  drivers/infiniband/ulp/ipoib/ipoib.h           |  2 +-
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  4 ++--
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++------------
>  3 files changed, 12 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
> b/drivers/infiniband/ulp/ipoib/ipoib.h
> index 7bf7339eaef..2e6c0d91d66 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> @@ -493,7 +493,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct
> sk_buff *skb);
> 
>  void ipoib_mcast_restart_task(struct work_struct *work);
>  int ipoib_mcast_start_thread(struct net_device *dev);
> -int ipoib_mcast_stop_thread(struct net_device *dev, int flush);
> +int ipoib_mcast_stop_thread(struct net_device *dev);
> 
>  void ipoib_mcast_dev_down(struct net_device *dev);
>  void ipoib_mcast_dev_flush(struct net_device *dev);
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index bfd17d41b5f..66096787119 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -747,7 +747,7 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush)
>  	clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
>  	netif_carrier_off(dev);
> 
> -	ipoib_mcast_stop_thread(dev, flush);
> +	ipoib_mcast_stop_thread(dev);
>  	ipoib_mcast_dev_flush(dev);
> 
>  	ipoib_flush_paths(dev);
> @@ -1097,7 +1097,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
>  	 */
>  	ipoib_flush_paths(dev);
> 
> -	ipoib_mcast_stop_thread(dev, 1);
> +	ipoib_mcast_stop_thread(dev);
>  	ipoib_mcast_dev_flush(dev);
> 
>  	ipoib_transport_dev_cleanup(dev);
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 969ef420518..8a538c010b9 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -656,7 +656,7 @@ int ipoib_mcast_start_thread(struct net_device *dev)
>  	return 0;
>  }
> 
> -int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
> +int ipoib_mcast_stop_thread(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> 
> @@ -667,8 +667,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
>  	cancel_delayed_work(&priv->mcast_task);
>  	mutex_unlock(&mcast_mutex);
> 
> -	if (flush)
> -		flush_workqueue(priv->wq);
> +	flush_workqueue(priv->wq);
> 
>  	return 0;
>  }
> @@ -846,8 +845,6 @@ void ipoib_mcast_restart_task(struct work_struct *work)
> 
>  	ipoib_dbg_mcast(priv, "restarting multicast task\n");
> 
> -	ipoib_mcast_stop_thread(dev, 0);
> -
>  	local_irq_save(flags);
>  	netif_addr_lock(dev);
>  	spin_lock(&priv->lock);
> @@ -944,13 +941,10 @@ void ipoib_mcast_restart_task(struct work_struct *work)
>  	 * We have to cancel outside of the spinlock, but we have to
>  	 * take the rtnl lock or else we race with the removal of
>  	 * entries from the remove list in mcast_dev_flush as part
> -	 * of ipoib_stop() which will call mcast_stop_thread with
> -	 * flush == 1 while holding the rtnl lock, and the
> -	 * flush_workqueue won't complete until this restart_mcast_task
> -	 * completes.  So do like the carrier on task and attempt to
> -	 * take the rtnl lock, but if we can't before the ADMIN_UP flag
> -	 * goes away, then just return and know that the remove list will
> -	 * get flushed later by mcast_stop_thread.
> +	 * of ipoib_stop().  We detect the drop of the ADMIN_UP flag
> +	 * to signal that we have hit this particular race, and we
> +	 * return since we know we don't need to do anything else
> +	 * anyway.
>  	 */
>  	while (!rtnl_trylock()) {
>  		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> @@ -962,6 +956,9 @@ void ipoib_mcast_restart_task(struct work_struct *work)
>  		ipoib_mcast_leave(mcast->dev, mcast);
>  		ipoib_mcast_free(mcast);
>  	}
> +	/*
> +	 * Restart our join task if needed
> +	 */
>  	ipoib_mcast_start_thread(dev);
>  	rtnl_unlock();
>  }
> --
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* RE: [PATCH 8/8] IPoIB: No longer use flush as a parameter
       [not found]     ` <ad7bb2b8da52f187cf2978e6a1c77ead32b60de3.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-08-20 15:04       ` Estrin, Alex
  0 siblings, 0 replies; 37+ messages in thread
From: Estrin, Alex @ 2014-08-20 15:04 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A



> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On
> Behalf Of Doug Ledford
> Sent: Tuesday, August 12, 2014 7:38 PM
> To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> Cc: Doug Ledford
> Subject: [PATCH 8/8] IPoIB: No longer use flush as a parameter
> 
> Various places in the IPoIB code had a deadlock related to flushing the
> ipoib workqueue.  Now that we have per device workqueues and a specific
> flush workqueue, there is no longer a deadlock issue with flushing the
> device specific workqueues and we can do so unilaterally.
> 
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Alex Estrin <alex.estrin-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

> ---
>  drivers/infiniband/ulp/ipoib/ipoib.h      |  6 +++---
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c   | 19 +++++++++----------
>  drivers/infiniband/ulp/ipoib/ipoib_main.c |  8 ++++----
>  3 files changed, 16 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h
> b/drivers/infiniband/ulp/ipoib/ipoib.h
> index 2e6c0d91d66..71b18883ffe 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> @@ -478,10 +478,10 @@ void ipoib_ib_dev_flush_heavy(struct work_struct *work);
>  void ipoib_pkey_event(struct work_struct *work);
>  void ipoib_ib_dev_cleanup(struct net_device *dev);
> 
> -int ipoib_ib_dev_open(struct net_device *dev, int flush);
> +int ipoib_ib_dev_open(struct net_device *dev);
>  int ipoib_ib_dev_up(struct net_device *dev);
> -int ipoib_ib_dev_down(struct net_device *dev, int flush);
> -int ipoib_ib_dev_stop(struct net_device *dev, int flush);
> +int ipoib_ib_dev_down(struct net_device *dev);
> +int ipoib_ib_dev_stop(struct net_device *dev);
>  void ipoib_pkey_dev_check_presence(struct net_device *dev);
> 
>  int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 66096787119..fe65abb5150 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -664,7 +664,7 @@ static void ipoib_ib_tx_timer_func(unsigned long ctx)
>  	drain_tx_cq((struct net_device *)ctx);
>  }
> 
> -int ipoib_ib_dev_open(struct net_device *dev, int flush)
> +int ipoib_ib_dev_open(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  	int ret;
> @@ -706,7 +706,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
>  dev_stop:
>  	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
>  		napi_enable(&priv->napi);
> -	ipoib_ib_dev_stop(dev, flush);
> +	ipoib_ib_dev_stop(dev);
>  	return -1;
>  }
> 
> @@ -738,7 +738,7 @@ int ipoib_ib_dev_up(struct net_device *dev)
>  	return ipoib_mcast_start_thread(dev);
>  }
> 
> -int ipoib_ib_dev_down(struct net_device *dev, int flush)
> +int ipoib_ib_dev_down(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
> 
> @@ -807,7 +807,7 @@ void ipoib_drain_cq(struct net_device *dev)
>  	local_bh_enable();
>  }
> 
> -int ipoib_ib_dev_stop(struct net_device *dev, int flush)
> +int ipoib_ib_dev_stop(struct net_device *dev)
>  {
>  	struct ipoib_dev_priv *priv = netdev_priv(dev);
>  	struct ib_qp_attr qp_attr;
> @@ -880,8 +880,7 @@ timeout:
>  	/* Wait for all AHs to be reaped */
>  	set_bit(IPOIB_STOP_REAPER, &priv->flags);
>  	cancel_delayed_work(&priv->ah_reap_task);
> -	if (flush)
> -		flush_workqueue(priv->wq);
> +	flush_workqueue(priv->wq);
> 
>  	begin = jiffies;
> 
> @@ -918,7 +917,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device
> *ca, int port)
>  		    (unsigned long) dev);
> 
>  	if (dev->flags & IFF_UP) {
> -		if (ipoib_ib_dev_open(dev, 1)) {
> +		if (ipoib_ib_dev_open(dev)) {
>  			ipoib_transport_dev_cleanup(dev);
>  			return -ENODEV;
>  		}
> @@ -1040,12 +1039,12 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv
> *priv,
>  	}
> 
>  	if (level >= IPOIB_FLUSH_NORMAL)
> -		ipoib_ib_dev_down(dev, 0);
> +		ipoib_ib_dev_down(dev);
> 
>  	if (level == IPOIB_FLUSH_HEAVY) {
>  		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
> -			ipoib_ib_dev_stop(dev, 0);
> -		if (ipoib_ib_dev_open(dev, 0) != 0)
> +			ipoib_ib_dev_stop(dev);
> +		if (ipoib_ib_dev_open(dev) != 0)
>  			return;
>  		if (netif_queue_stopped(dev))
>  			netif_start_queue(dev);
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 255c8296566..4e4f6ec59e1 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -108,7 +108,7 @@ int ipoib_open(struct net_device *dev)
> 
>  	set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
> 
> -	if (ipoib_ib_dev_open(dev, 1)) {
> +	if (ipoib_ib_dev_open(dev)) {
>  		if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
>  			return 0;
>  		goto err_disable;
> @@ -139,7 +139,7 @@ int ipoib_open(struct net_device *dev)
>  	return 0;
> 
>  err_stop:
> -	ipoib_ib_dev_stop(dev, 1);
> +	ipoib_ib_dev_stop(dev);
> 
>  err_disable:
>  	clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
> @@ -157,8 +157,8 @@ static int ipoib_stop(struct net_device *dev)
> 
>  	netif_stop_queue(dev);
> 
> -	ipoib_ib_dev_down(dev, 1);
> -	ipoib_ib_dev_stop(dev, 0);
> +	ipoib_ib_dev_down(dev);
> +	ipoib_ib_dev_stop(dev);
> 
>  	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
>  		struct ipoib_dev_priv *cpriv;
> --
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
       [not found]             ` <902D5BF2-159A-4B31-A87F-7491F3C8057F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-08-25 19:51               ` Wendy Cheng
       [not found]                 ` <CABgxfbHOD75vLdZ0TtWZbk8ne3kHd_eWObxPHmoJ-D8DjE0bkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Wendy Cheng @ 2014-08-25 19:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

On Tue, Aug 19, 2014 at 1:28 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> So that's why in this patch we
>
> 1) take a mutex to force ib_sa_join_multicast to return and us to set mcast->mc to the proper return value before we process the join completion callback
> 2) always clear mcast->mc if there is any error since we can't call ib_sa_multicast_leave
> 3) always complete the mcast in case we are waiting on it
> 4) only if our status is ENETRESET set our return to 0 so the ib core code knows we acknowledged the event
>

We don't have IPOIB_MCAST_JOIN_STARTED (and the "done" completion
struct) in our code base (MPSS) yet ...I'm *not* n-acking this patch
but I find it hard to understand the ramifications. It has nothing to
do with this patch - actually the patch itself looks pretty ok (by
eyes).

The original IPOIB mcast flow, particularly its abnormal error path,
confuses me. Is it really possible for ib_sa_join_multicast() to
return *after* its callback (ipoib_mcast_sendonly_join_complete and
ipoib_mcast_join_complete) ? The mcast->done completion struct looks
dangerous as well.

I'll let other capable people to do the final call(s).

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
       [not found]                 ` <CABgxfbHOD75vLdZ0TtWZbk8ne3kHd_eWObxPHmoJ-D8DjE0bkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-25 20:03                   ` Doug Ledford
       [not found]                     ` <E3EFCBAC-2D6E-49D3-A556-DBD40701CC5F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-08-25 20:03 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A


> On Aug 25, 2014, at 2:51 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
>> On Tue, Aug 19, 2014 at 1:28 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> 
>> So that's why in this patch we
>> 
>> 1) take a mutex to force ib_sa_join_multicast to return and us to set mcast->mc to the proper return value before we process the join completion callback
>> 2) always clear mcast->mc if there is any error since we can't call ib_sa_multicast_leave
>> 3) always complete the mcast in case we are waiting on it
>> 4) only if our status is ENETRESET set our return to 0 so the ib core code knows we acknowledged the event
> 
> confuses me. Is it really possible for ib_sa_join_multicast() to
> return *after* its callback (ipoib_mcast_sendonly_join_complete and
> ipoib_mcast_join_complete) ?

Yes.  They are both on work queues and ib_sa_multicast_join simply fires off another workqueue task.  The scheduler is free to start that task instantly if the workqueue isn't busy, and it often does (although not necessarily on the same CPU).  Then it is a race to see who finishes first.

> The mcast->done completion struct looks
> dangerous as well.

It *was* dangerous ;-).  My patches cleaned it up.  Before my patches there were failure to completes, then during an early version of my patch there were double completes and failure to clear mcast->mc, then I figured out the race on the callback finishing before ib_sa_multicast_join and added the mutex and things suddenly started working much better ;-)

> I'll let other capable people to do the final call(s).
> 
> -- Wendy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Sent from my iPad

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage
       [not found]                     ` <E3EFCBAC-2D6E-49D3-A556-DBD40701CC5F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-08-26 18:04                       ` Wendy Cheng
  0 siblings, 0 replies; 37+ messages in thread
From: Wendy Cheng @ 2014-08-26 18:04 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

On Mon, Aug 25, 2014 at 1:03 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>
>> On Aug 25, 2014, at 2:51 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>> Is it really possible for ib_sa_join_multicast() to
>> return *after* its callback (ipoib_mcast_sendonly_join_complete and
>> ipoib_mcast_join_complete) ?
>
> Yes.  They are both on work queues and ib_sa_join_multicast simply fires off another workqueue task.  The scheduler is free to start that task instantly if the workqueue isn't busy, and it often does (although not necessarily on the same CPU).  Then it is a race to see who finishes first.
>

Ok, thanks for the explanation. I also googled and found the original
patch where the IPOIB_MCAST_JOIN_STARTED was added. This patch now
makes sense.

Acked-by: Wendy Cheng <wendy.cheng-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

On the other hand, I'm still puzzled why ib_sa_join_multicast() can't
be a blocking call (i.e. wait until callback is executed) - why would
IPOIB pay the price to work around these nasty issues ? But I guess
that is off-topic too much ..

BTW, thanks for the work. Our users will be doing if-up-down a lot for
power management, patches like these help !

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race
       [not found]     ` <e1dbcfc25d8930b281aad12699ebf8fa82485b0e.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-08-29 21:53       ` Wendy Cheng
       [not found]         ` <CABgxfbHDuUrdHuLJT2oD07Cy3Ys4_rj-bJ6eR=9+uv0CuPH7_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Wendy Cheng @ 2014-08-29 21:53 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Our mcast_dev_flush routine and our mcast_restart_task can race against
> each other.  In particular, they both hold the priv->lock while
> manipulating the rbtree and while removing mcast entries from the
> multicast_list and while adding entries to the remove_list, but they
> also both drop their locks prior to doing the actual removes.  The
> mcast_dev_flush routine is run entirely under the rtnl lock and so has
> at least some locking.  The actual race condition is like this:
>
> Thread 1                                Thread 2
> ifconfig ib0 up
>   start multicast join for broadcast
>   multicast join completes for broadcast
>   start to add more multicast joins
>     call mcast_restart_task to add new entries
>                                         ifconfig ib0 down
>                                           mcast_dev_flush
>                                             mcast_leave(mcast A)
>     mcast_leave(mcast A)
>
> As mcast_leave calls ib_sa_multicast_leave, and as member in
> core/multicast.c is ref counted, we run into an unbalanced refcount
> issue.  To avoid stomping on each others removes, take the rtnl lock
> specifically when we are deleting the entries from the remove list.

Isn't "test_and_clear_bit()" atomic so it is unlikely that
ib_sa_free_multicast() can run multiple times  ?

    638 static int ipoib_mcast_leave(struct net_device *dev, struct
ipoib_mcast         *mcast)
    639 {
    640         struct ipoib_dev_priv *priv = netdev_priv(dev);
    641         int ret = 0;
    642
    643         if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
    644                 ib_sa_free_multicast(mcast->mc);
    645
    646         if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED,
&mcast->flags)        ) {


-- Wendy

>
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index f5e8da530d9..19e3fe75ebf 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -810,7 +810,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
>
>         spin_unlock_irqrestore(&priv->lock, flags);
>
> -       /* seperate between the wait to the leave*/
> +       /*
> +        * make sure the in-flight joins have finished before we attempt
> +        * to leave
> +        */
>         list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
>                 if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
>                         wait_for_completion(&mcast->done);
> @@ -931,14 +934,38 @@ void ipoib_mcast_restart_task(struct work_struct *work)
>         netif_addr_unlock(dev);
>         local_irq_restore(flags);
>
> -       /* We have to cancel outside of the spinlock */
> +       /*
> +        * make sure the in-flight joins have finished before we attempt
> +        * to leave
> +        */
> +       list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
> +               if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> +                       wait_for_completion(&mcast->done);
> +
> +       /*
> +        * We have to cancel outside of the spinlock, but we have to
> +        * take the rtnl lock or else we race with the removal of
> +        * entries from the remove list in mcast_dev_flush as part
> +        * of ipoib_stop() which will call mcast_stop_thread with
> +        * flush == 1 while holding the rtnl lock, and the
> +        * flush_workqueue won't complete until this restart_mcast_task
> +        * completes.  So do like the carrier on task and attempt to
> +        * take the rtnl lock, but if we can't before the ADMIN_UP flag
> +        * goes away, then just return and know that the remove list will
> +        * get flushed later by mcast_dev_flush.
> +        */
> +       while (!rtnl_trylock()) {
> +               if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> +                       return;
> +               else
> +                       msleep(20);
> +       }
>         list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
>                 ipoib_mcast_leave(mcast->dev, mcast);
>                 ipoib_mcast_free(mcast);
>         }
> -
> -       if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> -               ipoib_mcast_start_thread(dev);
> +       ipoib_mcast_start_thread(dev);
> +       rtnl_unlock();
>  }
>
>  #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
> --
> 1.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race
       [not found]         ` <CABgxfbHDuUrdHuLJT2oD07Cy3Ys4_rj-bJ6eR=9+uv0CuPH7_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-08-30 15:39           ` Wendy Cheng
       [not found]             ` <CABgxfbGJOdmAn1sokEtisDdnA=r_4mfP=PfqZVsP0cd_oL50dA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-09-03 17:49           ` Doug Ledford
  1 sibling, 1 reply; 37+ messages in thread
From: Wendy Cheng @ 2014-08-30 15:39 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

On Fri, Aug 29, 2014 at 2:53 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> Our mcast_dev_flush routine and our mcast_restart_task can race against
>> each other.  In particular, they both hold the priv->lock while
>> manipulating the rbtree and while removing mcast entries from the
>> multicast_list and while adding entries to the remove_list, but they
>> also both drop their locks prior to doing the actual removes.  The
>> mcast_dev_flush routine is run entirely under the rtnl lock and so has
>> at least some locking.  The actual race condition is like this:
>>
>> Thread 1                                Thread 2
>> ifconfig ib0 up
>>   start multicast join for broadcast
>>   multicast join completes for broadcast
>>   start to add more multicast joins
>>     call mcast_restart_task to add new entries
>>                                         ifconfig ib0 down
>>                                           mcast_dev_flush
>>                                             mcast_leave(mcast A)
>>     mcast_leave(mcast A)
>>
>> As mcast_leave calls ib_sa_multicast_leave, and as member in
>> core/multicast.c is ref counted, we run into an unbalanced refcount
>> issue.  To avoid stomping on each others removes, take the rtnl lock
>> specifically when we are deleting the entries from the remove list.
>
> Isn't "test_and_clear_bit()" atomic so it is unlikely that
> ib_sa_free_multicast() can run multiple times  ?

Oops .. how about if the structure itself gets freed ? My bad !

However, isn't that the remove_list a local list on the caller's stack
? .. and  the original list entry moving (to remove_list) is protected
by the spin lock (priv->lock), it is unlikely that the
ib_sa_free_multicast() can operate on the same entry ?

The patch itself is harmless though .. but adding the rntl_lock is
really not ideal.

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/8] IPoIB: Fix multiple race conditions
       [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2014-08-15 22:08   ` [PATCH 0/8] IPoIB: Fix multiple race conditions Wendy Cheng
@ 2014-09-03 13:52   ` Or Gerlitz
       [not found]     ` <54071D14.9040404-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  9 siblings, 1 reply; 37+ messages in thread
From: Or Gerlitz @ 2014-09-03 13:52 UTC (permalink / raw)
  To: Doug Ledford, Erez Shitrit, Saeed Mahameed
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Yevgeny Petrilin

On 8/13/2014 2:38 AM, Doug Ledford wrote:
> Locking of multicast joins/leaves in the IPoIB layer have been problematic
> for a while.  There have been recent changes to try and make things better,
> including these changes:
>
> bea1e22 IPoIB: Fix use-after-free of multicast object
> a9c8ba5 IPoIB: Fix usage of uninitialized multicast objects
>
> Unfortunately, the following test still fails (miserably) on a plain
> upstream kernel:
>
> pass=0
> ifdown ib0
> while true; do
>      ifconfig ib0 up
>      ifconfig ib0 down
>      echo "Pass $pass"
>      let pass++
> done
>
> This usually fails within 10 to 20 passes, although I did have a lucky
> run make it to 300 or so.  If you happen to have a P_Key child interface,
> it fails even quicker.

Hi Doug,

Thanks for looking on that. I wasn't aware we're doing so badly... I 
checked here and the plan is that Erez Shitrit from Mellanox will also 
be providing feedback on the series.

Anyway, just to make sure we're on the same page, people agree that 
picking such series is too heavy for post merge window, right? so we are 
talking on 3.18, agree?

Or.

>
> In tracking down the rtnl deadlock in the multicast code, I ended up
> re-designing the flag usage and clean up just the race conditions in
> the various tasks.  Even that wasn't enough to resolve the issue entirely
> though, as if you ran the test above on multiple interfaces simultaneously,
> it could still deadlock.  So in the end I re-did the workqueue usage too
> so that we now use a workqueue per device (and that includes P_Key devices
> have dedicated workqueues) as well as one global workqueue that does
> nothing but flush tasks.  This turns out to be a much more elegant way
> of handling the workqueues and in fact enabled me to remove all of the
> klunky passing around of flush parameters to tell various functions not
> to flush the workqueue if it would deadlock the workqueue we are running
> from.
>
> Here is my test setup:
>
> 2 InfiniBand physical fabrics: ib0 and ib1
> 2 P_Keys on each fabric: default and 0x8002 on ib0 and 0x8003 on ib1
> 4 total IPoIB devices that I have named mlx4_ib0, mlx4_ib0.8002,
>    mlx4_ib1, and mlx4_ib1.8003
>
> In order to test my final patch set, I logged into my test machine on
> four different virtual terminals, I used ifdown on all of the above
> interfaces to get things in a consistent state, and then I ran the above
> loop, one per terminal per interface simultaneously.
>
> It's worth noting here that when you ifconfig a base interface up, it
> automatically brings up the child interface too, so the ifconfig mlx4_ib0
> up is in fact racing with both ups and downs of mlx4_ib0.8002.  The same
> is true for the mlx4_ib1 interface and its child.  With my patch set in
> place, these loops are currently running without a problem and have passed
> 15,000 up/down cycles per interface.
>
> Doug Ledford (8):
>    IPoIB: Consolidate rtnl_lock tasks in workqueue
>    IPoIB: Make the carrier_on_task race aware
>    IPoIB: fix MCAST_FLAG_BUSY usage
>    IPoIB: fix mcast_dev_flush/mcast_restart_task race
>    IPoIB: change init sequence ordering
>    IPoIB: Use dedicated workqueues per interface
>    IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
>    IPoIB: No longer use flush as a parameter
>
>   drivers/infiniband/ulp/ipoib/ipoib.h           |  19 +-
>   drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  18 +-
>   drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  27 ++-
>   drivers/infiniband/ulp/ipoib/ipoib_main.c      |  49 +++--
>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 239 ++++++++++++++++---------
>   drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |  22 ++-
>   6 files changed, 240 insertions(+), 134 deletions(-)
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/8] IPoIB: Fix multiple race conditions
       [not found]     ` <CABgxfbEGfiNGUKT4NJi1GoDRouFznxpgogDt5yr47TLfwDB7hA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-09-03 16:26       ` Wendy Cheng
  0 siblings, 0 replies; 37+ messages in thread
From: Wendy Cheng @ 2014-09-03 16:26 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Cheng, Wendy

On Fri, Aug 15, 2014 at 3:08 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
[snip]
>>
>> Doug Ledford (8):
>>   IPoIB: Consolidate rtnl_lock tasks in workqueue
>>   IPoIB: Make the carrier_on_task race aware
>>   IPoIB: fix MCAST_FLAG_BUSY usage
>>   IPoIB: fix mcast_dev_flush/mcast_restart_task race
>>   IPoIB: change init sequence ordering
>>   IPoIB: Use dedicated workqueues per interface
>>   IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
>>   IPoIB: No longer use flush as a parameter
>>
>
> IPOIB is recently added as a technology preview for Intel Xeon Phi
> (currently a PCIe card) that runs embedded Linux (named MPSS) with
> Infiniband software stacks supported via emulation drivers. One early
> feedback from users with large cluster nodes is IPOIB's power
> consumption. The root cause of the reported issue is more to do with
> how MPSS handles its DMA buffers (vs. how Linux IB stacks work) - so
> submitting the fix to upstream is not planned at this moment (unless
> folks are interested in the changes).
>
> However, since this patch set happens to be in the heart of the
> reported power issue, we would like to take a closer look to avoid
> MPSS code base deviating too much from future upstream kernel(s).
> Question, comment, and/or ack will follow sometime next week.
>

I've reviewed the patch set - the first half of the patches look good.
Patch #5,#6,#7,#8 are fine if we go for one WQ per device - will let
others do the final call.

On our system (OFED 1.5.4 based), similar deadlocks were also observed
while the power management issues were worked on. Restricted by other
issues that were specific to our platform, I took the advantage of
single IPOIB workqueue by queuing the if-up(s) and/or if-down(s) to
the work queue if one already in progress. It serialized the logic by
default. However, I would not mind one WQ per device approach and will
re-make the changes when this patch set is picked up by mainline
kernel.

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race
       [not found]         ` <CABgxfbHDuUrdHuLJT2oD07Cy3Ys4_rj-bJ6eR=9+uv0CuPH7_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2014-08-30 15:39           ` Wendy Cheng
@ 2014-09-03 17:49           ` Doug Ledford
  1 sibling, 0 replies; 37+ messages in thread
From: Doug Ledford @ 2014-09-03 17:49 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

[-- Attachment #1: Type: text/plain, Size: 6050 bytes --]

On Fri, 2014-08-29 at 14:53 -0700, Wendy Cheng wrote:
> On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Our mcast_dev_flush routine and our mcast_restart_task can race against
> > each other.  In particular, they both hold the priv->lock while
> > manipulating the rbtree and while removing mcast entries from the
> > multicast_list and while adding entries to the remove_list, but they
> > also both drop their locks prior to doing the actual removes.  The
> > mcast_dev_flush routine is run entirely under the rtnl lock and so has
> > at least some locking.  The actual race condition is like this:
> >
> > Thread 1                                Thread 2
> > ifconfig ib0 up
> >   start multicast join for broadcast
> >   multicast join completes for broadcast
> >   start to add more multicast joins
> >     call mcast_restart_task to add new entries
> >                                         ifconfig ib0 down
> >                                           mcast_dev_flush
> >                                             mcast_leave(mcast A)
> >     mcast_leave(mcast A)
> >
> > As mcast_leave calls ib_sa_multicast_leave, and as member in
> > core/multicast.c is ref counted, we run into an unbalanced refcount
> > issue.  To avoid stomping on each others removes, take the rtnl lock
> > specifically when we are deleting the entries from the remove list.
> 
> Isn't "test_and_clear_bit()" atomic so it is unlikely that
> ib_sa_free_multicast() can run multiple times  ?
> 
>     638 static int ipoib_mcast_leave(struct net_device *dev, struct
> ipoib_mcast         *mcast)
>     639 {
>     640         struct ipoib_dev_priv *priv = netdev_priv(dev);
>     641         int ret = 0;
>     642
>     643         if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
>     644                 ib_sa_free_multicast(mcast->mc);
>     645
>     646         if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED,
> &mcast->flags)        ) {

This is the original code, and you would be right, but it had other
problems.  Even though test_and_clear_bit is atomic, the original usage
of MCAST_FLAG_BUSY was really that the mcast join had started, there was
no guarantee it had completed (we would set the flag, call
ib_sa_multicast_join, and we could then race with our leave where it was
possible to leave prior to the join completing).

My patches change this around now, and the code only checks for
MCAST_FLAG_BUSY as a double check that we aren't leaving an in-process
join.

> 
> -- Wendy
> 
> >
> > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++++++++++++++++++++++----
> >  1 file changed, 32 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > index f5e8da530d9..19e3fe75ebf 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > @@ -810,7 +810,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
> >
> >         spin_unlock_irqrestore(&priv->lock, flags);
> >
> > -       /* seperate between the wait to the leave*/
> > +       /*
> > +        * make sure the in-flight joins have finished before we attempt
> > +        * to leave
> > +        */
> >         list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
> >                 if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> >                         wait_for_completion(&mcast->done);
> > @@ -931,14 +934,38 @@ void ipoib_mcast_restart_task(struct work_struct *work)
> >         netif_addr_unlock(dev);
> >         local_irq_restore(flags);
> >
> > -       /* We have to cancel outside of the spinlock */
> > +       /*
> > +        * make sure the in-flight joins have finished before we attempt
> > +        * to leave
> > +        */
> > +       list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
> > +               if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> > +                       wait_for_completion(&mcast->done);
> > +
> > +       /*
> > +        * We have to cancel outside of the spinlock, but we have to
> > +        * take the rtnl lock or else we race with the removal of
> > +        * entries from the remove list in mcast_dev_flush as part
> > +        * of ipoib_stop() which will call mcast_stop_thread with
> > +        * flush == 1 while holding the rtnl lock, and the
> > +        * flush_workqueue won't complete until this restart_mcast_task
> > +        * completes.  So do like the carrier on task and attempt to
> > +        * take the rtnl lock, but if we can't before the ADMIN_UP flag
> > +        * goes away, then just return and know that the remove list will
> > +        * get flushed later by mcast_dev_flush.
> > +        */
> > +       while (!rtnl_trylock()) {
> > +               if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> > +                       return;
> > +               else
> > +                       msleep(20);
> > +       }
> >         list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
> >                 ipoib_mcast_leave(mcast->dev, mcast);
> >                 ipoib_mcast_free(mcast);
> >         }
> > -
> > -       if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> > -               ipoib_mcast_start_thread(dev);
> > +       ipoib_mcast_start_thread(dev);
> > +       rtnl_unlock();
> >  }
> >
> >  #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
> > --
> > 1.9.3
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race
       [not found]             ` <CABgxfbGJOdmAn1sokEtisDdnA=r_4mfP=PfqZVsP0cd_oL50dA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-09-03 18:06               ` Doug Ledford
       [not found]                 ` <1409767614.26762.7.camel-v+aXH1h/sVwpzh8Nc7Vzg+562jBIR2Zt@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-09-03 18:06 UTC (permalink / raw)
  To: Wendy Cheng
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

[-- Attachment #1: Type: text/plain, Size: 3223 bytes --]

On Sat, 2014-08-30 at 08:39 -0700, Wendy Cheng wrote:
> On Fri, Aug 29, 2014 at 2:53 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> Our mcast_dev_flush routine and our mcast_restart_task can race against
> >> each other.  In particular, they both hold the priv->lock while
> >> manipulating the rbtree and while removing mcast entries from the
> >> multicast_list and while adding entries to the remove_list, but they
> >> also both drop their locks prior to doing the actual removes.  The
> >> mcast_dev_flush routine is run entirely under the rtnl lock and so has
> >> at least some locking.  The actual race condition is like this:
> >>
> >> Thread 1                                Thread 2
> >> ifconfig ib0 up
> >>   start multicast join for broadcast
> >>   multicast join completes for broadcast
> >>   start to add more multicast joins
> >>     call mcast_restart_task to add new entries
> >>                                         ifconfig ib0 down
> >>                                           mcast_dev_flush
> >>                                             mcast_leave(mcast A)
> >>     mcast_leave(mcast A)
> >>
> >> As mcast_leave calls ib_sa_multicast_leave, and as member in
> >> core/multicast.c is ref counted, we run into an unbalanced refcount
> >> issue.  To avoid stomping on each others removes, take the rtnl lock
> >> specifically when we are deleting the entries from the remove list.
> >
> > Isn't "test_and_clear_bit()" atomic so it is unlikely that
> > ib_sa_free_multicast() can run multiple times  ?
> 
> Oops .. how about if the structure itself gets freed ? My bad !

Well, just like the last email, the code you are referring to is in the
original code, and had other issues.  After my patches it does not look
like that.

> However, isn't that the remove_list a local list on the caller's stack
> ? .. and  the original list entry moving (to remove_list) is protected
> by the spin lock (priv->lock), it is unlikely that the
> ib_sa_free_multicast() can operate on the same entry ?

Yes, you're right.  I had it in my mind that the remove_list was part of
the ipoib private dev, not local on the stack.  So you are right that we
could probably get away with removing the rtnl lock there (although it
would need to be in a later patch than the one you are reviewing
here...here there would still be a race between the restart task and the
downing of the interface because they all still share the same work
queue, but once we switch to the per device work queues in patch #6,
this can happen in parallel safely with the flush task I think).

> The patch itself is harmless though .. but adding the rntl_lock is
> really not ideal.
> 
> -- Wendy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/8] IPoIB: Fix multiple race conditions
       [not found]     ` <54071D14.9040404-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2014-09-03 18:12       ` Doug Ledford
  0 siblings, 0 replies; 37+ messages in thread
From: Doug Ledford @ 2014-09-03 18:12 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Erez Shitrit, Saeed Mahameed, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A, Yevgeny Petrilin

[-- Attachment #1: Type: text/plain, Size: 4642 bytes --]

On Wed, 2014-09-03 at 16:52 +0300, Or Gerlitz wrote:
> On 8/13/2014 2:38 AM, Doug Ledford wrote:
> > Locking of multicast joins/leaves in the IPoIB layer have been problematic
> > for a while.  There have been recent changes to try and make things better,
> > including these changes:
> >
> > bea1e22 IPoIB: Fix use-after-free of multicast object
> > a9c8ba5 IPoIB: Fix usage of uninitialized multicast objects
> >
> > Unfortunately, the following test still fails (miserably) on a plain
> > upstream kernel:
> >
> > pass=0
> > ifdown ib0
> > while true; do
> >      ifconfig ib0 up
> >      ifconfig ib0 down
> >      echo "Pass $pass"
> >      let pass++
> > done
> >
> > This usually fails within 10 to 20 passes, although I did have a lucky
> > run make it to 300 or so.  If you happen to have a P_Key child interface,
> > it fails even quicker.
> 
> Hi Doug,
> 
> Thanks for looking on that. I wasn't aware we're doing so badly... I 
> checked here and the plan is that Erez Shitrit from Mellanox will also 
> be providing feedback on the series.
> 
> Anyway, just to make sure we're on the same page, people agree that 
> picking such series is too heavy for post merge window, right? so we are 
> talking on 3.18, agree?

Given how easy the problem is to reproduce, and how this patch set
positively solves the reproducer, I would have preferred 3.17, and I had
it in to Roland before the merge window closed, but Roland chose to put
it off.

/me boggles.

> Or.
> 
> >
> > In tracking down the rtnl deadlock in the multicast code, I ended up
> > re-designing the flag usage and clean up just the race conditions in
> > the various tasks.  Even that wasn't enough to resolve the issue entirely
> > though, as if you ran the test above on multiple interfaces simultaneously,
> > it could still deadlock.  So in the end I re-did the workqueue usage too
> > so that we now use a workqueue per device (and that includes P_Key devices
> > have dedicated workqueues) as well as one global workqueue that does
> > nothing but flush tasks.  This turns out to be a much more elegant way
> > of handling the workqueues and in fact enabled me to remove all of the
> > klunky passing around of flush parameters to tell various functions not
> > to flush the workqueue if it would deadlock the workqueue we are running
> > from.
> >
> > Here is my test setup:
> >
> > 2 InfiniBand physical fabrics: ib0 and ib1
> > 2 P_Keys on each fabric: default and 0x8002 on ib0 and 0x8003 on ib1
> > 4 total IPoIB devices that I have named mlx4_ib0, mlx4_ib0.8002,
> >    mlx4_ib1, and mlx4_ib1.8003
> >
> > In order to test my final patch set, I logged into my test machine on
> > four different virtual terminals, I used ifdown on all of the above
> > interfaces to get things in a consistent state, and then I ran the above
> > loop, one per terminal per interface simultaneously.
> >
> > It's worth noting here that when you ifconfig a base interface up, it
> > automatically brings up the child interface too, so the ifconfig mlx4_ib0
> > up is in fact racing with both ups and downs of mlx4_ib0.8002.  The same
> > is true for the mlx4_ib1 interface and its child.  With my patch set in
> > place, these loops are currently running without a problem and have passed
> > 15,000 up/down cycles per interface.
> >
> > Doug Ledford (8):
> >    IPoIB: Consolidate rtnl_lock tasks in workqueue
> >    IPoIB: Make the carrier_on_task race aware
> >    IPoIB: fix MCAST_FLAG_BUSY usage
> >    IPoIB: fix mcast_dev_flush/mcast_restart_task race
> >    IPoIB: change init sequence ordering
> >    IPoIB: Use dedicated workqueues per interface
> >    IPoIB: Make ipoib_mcast_stop_thread flush the workqueue
> >    IPoIB: No longer use flush as a parameter
> >
> >   drivers/infiniband/ulp/ipoib/ipoib.h           |  19 +-
> >   drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  18 +-
> >   drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  27 ++-
> >   drivers/infiniband/ulp/ipoib/ipoib_main.c      |  49 +++--
> >   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 239 ++++++++++++++++---------
> >   drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |  22 ++-
> >   6 files changed, 240 insertions(+), 134 deletions(-)
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race
       [not found]                 ` <1409767614.26762.7.camel-v+aXH1h/sVwpzh8Nc7Vzg+562jBIR2Zt@public.gmane.org>
@ 2014-09-03 19:45                   ` Wendy Cheng
  0 siblings, 0 replies; 37+ messages in thread
From: Wendy Cheng @ 2014-09-03 19:45 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A

On Wed, Sep 3, 2014 at 11:06 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Sat, 2014-08-30 at 08:39 -0700, Wendy Cheng wrote:
>> On Fri, Aug 29, 2014 at 2:53 PM, Wendy Cheng <s.wendy.cheng-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> > On Tue, Aug 12, 2014 at 4:38 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >> Our mcast_dev_flush routine and our mcast_restart_task can race against
>> >> each other.  In particular, they both hold the priv->lock while
>> >> manipulating the rbtree and while removing mcast entries from the
>> >> multicast_list and while adding entries to the remove_list, but they
>> >> also both drop their locks prior to doing the actual removes.  The
>> >> mcast_dev_flush routine is run entirely under the rtnl lock and so has
>> >> at least some locking.  The actual race condition is like this:
>> >>
>> >> Thread 1                                Thread 2
>> >> ifconfig ib0 up
>> >>   start multicast join for broadcast
>> >>   multicast join completes for broadcast
>> >>   start to add more multicast joins
>> >>     call mcast_restart_task to add new entries
>> >>                                         ifconfig ib0 down
>> >>                                           mcast_dev_flush
>> >>                                             mcast_leave(mcast A)
>> >>     mcast_leave(mcast A)
>> >>
>> >> As mcast_leave calls ib_sa_multicast_leave, and as member in
>> >> core/multicast.c is ref counted, we run into an unbalanced refcount
>> >> issue.  To avoid stomping on each others removes, take the rtnl lock
>> >> specifically when we are deleting the entries from the remove list.
>> >

[snip] ..

>> However, isn't that the remove_list a local list on the caller's stack
>> ? .. and  the original list entry moving (to remove_list) is protected
>> by the spin lock (priv->lock), it is unlikely that the
>> ib_sa_free_multicast() can operate on the same entry ?
>
> Yes, you're right.  I had it in my mind that the remove_list was part of
> the ipoib private dev, not local on the stack.  So you are right that we
> could probably get away with removing the rtnl lock there (although it
> would need to be in a later patch than the one you are reviewing
> here...here there would still be a race between the restart task and the
> downing of the interface because they all still share the same work
> queue, but once we switch to the per device work queues in patch #6,
> this can happen in parallel safely with the flush task I think).
>

Yep, that local "remove_list" is easy to miss.

BTW, you might want to modify the text description of this patch to
make your point clear.

-- Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/8] IPoIB: Use dedicated workqueues per interface
       [not found]     ` <f7af9c251d722675a549e4a673f46c0f31dfa266.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-20 15:01       ` Estrin, Alex
@ 2014-09-04  6:49       ` Erez Shitrit
       [not found]         ` <54080B6B.8050707-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  1 sibling, 1 reply; 37+ messages in thread
From: Erez Shitrit @ 2014-09-04  6:49 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A

Hi Doug,

The idea that "big" ipoib workqueue is only for the IB_EVENT_XXX looks 
good to me.

one comment here: I don't see where the driver flushes it at all.
IMHO, when the driver goes down you should cancel all pending tasks over 
that global wq prior the call for the destroy_workqueue.
The current logic calls the destroy_workqueue after the remove_one, i 
think you should cancel the pending works after the 
ib_unregister_event_handler call in the ipoib_remove_one function, 
otherwise if there are pending tasks in that queue they will schedule 
after the dev/priv are gone.

Thanks, Erez
> During my recent work on the rtnl lock deadlock in the IPoIB driver, I
> saw that even once I fixed the apparent races for a single device, as
> soon as that device had any children, new races popped up.  It turns out
> that this is because no matter how well we protect against races on a
> single device, the fact that all devices use the same workqueue, and
> flush_workqueue() flushes *everything* from that workqueue, we can have
> one device in the middle of a down and holding the rtnl lock and another
> totally unrelated device needing to run mcast_restart_task, which wants
> the rtnl lock and will loop trying to take it unless is sees its own
> FLAG_ADMIN_UP flag go away.  Because the unrelated interface will never
> see its own ADMIN_UP flag drop, the interface going down will deadlock
> trying to flush the queue.  There are several possible solutions to this
> problem:
>
> Make carrier_on_task and mcast_restart_task try to take the rtnl for
> some set period of time and if they fail, then bail.  This runs the real
> risk of dropping work on the floor, which can end up being its own
> separate kind of deadlock.
>
> Set some global flag in the driver that says some device is in the
> middle of going down, letting all tasks know to bail.  Again, this can
> drop work on the floor.  I suppose if our own ADMIN_UP flag doesn't go
> away, then maybe after a few tries on the rtnl lock we can queue our own
> task back up as a delayed work and return and avoid dropping work on the
> floor that way.  But I'm not 100% convinced that we won't cause other
> problems.
>
> Or the method this patch attempts to use, which is when we bring an
> interface up, create a workqueue specifically for that interface, so
> that when we take it back down, we are flushing only those tasks
> associated with our interface.  In addition, keep the global workqueue,
> but now limit it to only flush tasks.  In this way, the flush tasks can
> always flush the device specific work queues without having deadlock
> issues.
>
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>   drivers/infiniband/ulp/ipoib/ipoib.h           |  1 +
>   drivers/infiniband/ulp/ipoib/ipoib_cm.c        | 18 +++++++++---------
>   drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  6 +++---
>   drivers/infiniband/ulp/ipoib/ipoib_main.c      | 19 ++++++++++++-------
>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 ++++++++++++--------------
>   drivers/infiniband/ulp/ipoib/ipoib_verbs.c     | 22 +++++++++++++++++++++-
>   6 files changed, 58 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
> index 43840971ad8..7bf7339eaef 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> @@ -317,6 +317,7 @@ struct ipoib_dev_priv {
>   	struct list_head multicast_list;
>   	struct rb_root multicast_tree;
>   
> +	struct workqueue_struct *wq;
>   	struct delayed_work mcast_task;
>   	struct work_struct carrier_on_task;
>   	struct work_struct flush_light;
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index 933efcea0d0..56959adb6c7 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
>   	}
>   
>   	spin_lock_irq(&priv->lock);
> -	queue_delayed_work(ipoib_workqueue,
> +	queue_delayed_work(priv->wq,
>   			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>   	/* Add this entry to passive ids list head, but do not re-add it
>   	 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
> @@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
>   			spin_lock_irqsave(&priv->lock, flags);
>   			list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list);
>   			ipoib_cm_start_rx_drain(priv);
> -			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
> +			queue_work(priv->wq, &priv->cm.rx_reap_task);
>   			spin_unlock_irqrestore(&priv->lock, flags);
>   		} else
>   			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
> @@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
>   				spin_lock_irqsave(&priv->lock, flags);
>   				list_move(&p->list, &priv->cm.rx_reap_list);
>   				spin_unlock_irqrestore(&priv->lock, flags);
> -				queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
> +				queue_work(priv->wq, &priv->cm.rx_reap_task);
>   			}
>   			return;
>   		}
> @@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
>   
>   		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
>   			list_move(&tx->list, &priv->cm.reap_list);
> -			queue_work(ipoib_workqueue, &priv->cm.reap_task);
> +			queue_work(priv->wq, &priv->cm.reap_task);
>   		}
>   
>   		clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags);
> @@ -1255,7 +1255,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
>   
>   		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
>   			list_move(&tx->list, &priv->cm.reap_list);
> -			queue_work(ipoib_workqueue, &priv->cm.reap_task);
> +			queue_work(priv->wq, &priv->cm.reap_task);
>   		}
>   
>   		spin_unlock_irqrestore(&priv->lock, flags);
> @@ -1284,7 +1284,7 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path
>   	tx->dev = dev;
>   	list_add(&tx->list, &priv->cm.start_list);
>   	set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags);
> -	queue_work(ipoib_workqueue, &priv->cm.start_task);
> +	queue_work(priv->wq, &priv->cm.start_task);
>   	return tx;
>   }
>   
> @@ -1295,7 +1295,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
>   	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
>   		spin_lock_irqsave(&priv->lock, flags);
>   		list_move(&tx->list, &priv->cm.reap_list);
> -		queue_work(ipoib_workqueue, &priv->cm.reap_task);
> +		queue_work(priv->wq, &priv->cm.reap_task);
>   		ipoib_dbg(priv, "Reap connection for gid %pI6\n",
>   			  tx->neigh->daddr + 4);
>   		tx->neigh = NULL;
> @@ -1417,7 +1417,7 @@ void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb,
>   
>   	skb_queue_tail(&priv->cm.skb_queue, skb);
>   	if (e)
> -		queue_work(ipoib_workqueue, &priv->cm.skb_task);
> +		queue_work(priv->wq, &priv->cm.skb_task);
>   }
>   
>   static void ipoib_cm_rx_reap(struct work_struct *work)
> @@ -1450,7 +1450,7 @@ static void ipoib_cm_stale_task(struct work_struct *work)
>   	}
>   
>   	if (!list_empty(&priv->cm.passive_ids))
> -		queue_delayed_work(ipoib_workqueue,
> +		queue_delayed_work(priv->wq,
>   				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
>   	spin_unlock_irq(&priv->lock);
>   }
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 72626c34817..bfd17d41b5f 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -655,7 +655,7 @@ void ipoib_reap_ah(struct work_struct *work)
>   	__ipoib_reap_ah(dev);
>   
>   	if (!test_bit(IPOIB_STOP_REAPER, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
> +		queue_delayed_work(priv->wq, &priv->ah_reap_task,
>   				   round_jiffies_relative(HZ));
>   }
>   
> @@ -696,7 +696,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
>   	}
>   
>   	clear_bit(IPOIB_STOP_REAPER, &priv->flags);
> -	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
> +	queue_delayed_work(priv->wq, &priv->ah_reap_task,
>   			   round_jiffies_relative(HZ));
>   
>   	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
> @@ -881,7 +881,7 @@ timeout:
>   	set_bit(IPOIB_STOP_REAPER, &priv->flags);
>   	cancel_delayed_work(&priv->ah_reap_task);
>   	if (flush)
> -		flush_workqueue(ipoib_workqueue);
> +		flush_workqueue(priv->wq);
>   
>   	begin = jiffies;
>   
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 949948a46d4..255c8296566 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -839,7 +839,7 @@ static void ipoib_set_mcast_list(struct net_device *dev)
>   		return;
>   	}
>   
> -	queue_work(ipoib_workqueue, &priv->restart_task);
> +	queue_work(priv->wq, &priv->restart_task);
>   }
>   
>   static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr)
> @@ -954,7 +954,7 @@ static void ipoib_reap_neigh(struct work_struct *work)
>   	__ipoib_reap_neigh(priv);
>   
>   	if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
> +		queue_delayed_work(priv->wq, &priv->neigh_reap_task,
>   				   arp_tbl.gc_interval);
>   }
>   
> @@ -1133,7 +1133,7 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv)
>   
>   	/* start garbage collection */
>   	clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
> -	queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
> +	queue_delayed_work(priv->wq, &priv->neigh_reap_task,
>   			   arp_tbl.gc_interval);
>   
>   	return 0;
> @@ -1293,7 +1293,7 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
>   	return 0;
>   
>   out_dev_uninit:
> -	ipoib_ib_dev_cleanup();
> +	ipoib_ib_dev_cleanup(dev);
>   
>   out_tx_ring_cleanup:
>   	vfree(priv->tx_ring);
> @@ -1646,7 +1646,7 @@ register_failed:
>   	/* Stop GC if started before flush */
>   	set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
>   	cancel_delayed_work(&priv->neigh_reap_task);
> -	flush_workqueue(ipoib_workqueue);
> +	flush_workqueue(priv->wq);
>   
>   event_failed:
>   	ipoib_dev_cleanup(priv->dev);
> @@ -1717,7 +1717,7 @@ static void ipoib_remove_one(struct ib_device *device)
>   		/* Stop GC */
>   		set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
>   		cancel_delayed_work(&priv->neigh_reap_task);
> -		flush_workqueue(ipoib_workqueue);
> +		flush_workqueue(priv->wq);
>   
>   		unregister_netdev(priv->dev);
>   		free_netdev(priv->dev);
> @@ -1758,8 +1758,13 @@ static int __init ipoib_init_module(void)
>   	 * unregister_netdev() and linkwatch_event take the rtnl lock,
>   	 * so flush_scheduled_work() can deadlock during device
>   	 * removal.
> +	 *
> +	 * In addition, bringing one device up and another down at the
> +	 * same time can deadlock a single workqueue, so we have this
> +	 * global fallback workqueue, but we also attempt to open a
> +	 * per device workqueue each time we bring an interface up
>   	 */
> -	ipoib_workqueue = create_singlethread_workqueue("ipoib");
> +	ipoib_workqueue = create_singlethread_workqueue("ipoib_flush");
>   	if (!ipoib_workqueue) {
>   		ret = -ENOMEM;
>   		goto err_fs;
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 19e3fe75ebf..969ef420518 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -388,7 +388,7 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
>   	 * the workqueue while holding the rtnl lock, so loop
>   	 * on trylock until either we get the lock or we see
>   	 * FLAG_ADMIN_UP go away as that signals that we are bailing
> -	 * and can safely ignore the carrier on work
> +	 * and can safely ignore the carrier on work.
>   	 */
>   	while (!rtnl_trylock()) {
>   		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> @@ -432,15 +432,14 @@ static int ipoib_mcast_join_complete(int status,
>   	if (!status) {
>   		mcast->backoff = 1;
>   		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue,
> -					   &priv->mcast_task, 0);
> +			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>   
>   		/*
> -		 * Defer carrier on work to ipoib_workqueue to avoid a
> +		 * Defer carrier on work to priv->wq to avoid a
>   		 * deadlock on rtnl_lock here.
>   		 */
>   		if (mcast == priv->broadcast)
> -			queue_work(ipoib_workqueue, &priv->carrier_on_task);
> +			queue_work(priv->wq, &priv->carrier_on_task);
>   	} else {
>   		if (mcast->logcount++ < 20) {
>   			if (status == -ETIMEDOUT || status == -EAGAIN) {
> @@ -465,7 +464,7 @@ out:
>   	if (status == -ENETRESET)
>   		status = 0;
>   	if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
> +		queue_delayed_work(priv->wq, &priv->mcast_task,
>   				   mcast->backoff * HZ);
>   	spin_unlock_irq(&priv->lock);
>   	mutex_unlock(&mcast_mutex);
> @@ -535,8 +534,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
>   			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
>   
>   		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue,
> -					   &priv->mcast_task,
> +			queue_delayed_work(priv->wq, &priv->mcast_task,
>   					   mcast->backoff * HZ);
>   	}
>   	mutex_unlock(&mcast_mutex);
> @@ -584,8 +582,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
>   			ipoib_warn(priv, "failed to allocate broadcast group\n");
>   			mutex_lock(&mcast_mutex);
>   			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -				queue_delayed_work(ipoib_workqueue,
> -						   &priv->mcast_task, HZ);
> +				queue_delayed_work(priv->wq, &priv->mcast_task,
> +						   HZ);
>   			mutex_unlock(&mcast_mutex);
>   			return;
>   		}
> @@ -652,7 +650,7 @@ int ipoib_mcast_start_thread(struct net_device *dev)
>   
>   	mutex_lock(&mcast_mutex);
>   	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> -		queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
> +		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>   	mutex_unlock(&mcast_mutex);
>   
>   	return 0;
> @@ -670,7 +668,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
>   	mutex_unlock(&mcast_mutex);
>   
>   	if (flush)
> -		flush_workqueue(ipoib_workqueue);
> +		flush_workqueue(priv->wq);
>   
>   	return 0;
>   }
> @@ -737,7 +735,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
>   		__ipoib_mcast_add(dev, mcast);
>   		list_add_tail(&mcast->list, &priv->multicast_list);
>   		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
> +			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>   	}
>   
>   	if (!mcast->ah) {
> @@ -952,7 +950,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
>   	 * completes.  So do like the carrier on task and attempt to
>   	 * take the rtnl lock, but if we can't before the ADMIN_UP flag
>   	 * goes away, then just return and know that the remove list will
> -	 * get flushed later by mcast_dev_flush.
> +	 * get flushed later by mcast_stop_thread.
>   	 */
>   	while (!rtnl_trylock()) {
>   		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> index c56d5d44c53..b72a753eb41 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
> @@ -145,10 +145,20 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
>   	int ret, size;
>   	int i;
>   
> +	/*
> +	 * the various IPoIB tasks assume they will never race against
> +	 * themselves, so always use a single thread workqueue
> +	 */
> +	priv->wq = create_singlethread_workqueue("ipoib_wq");
> +	if (!priv->wq) {
> +		printk(KERN_WARNING "ipoib: failed to allocate device WQ\n");
> +		return -ENODEV;
> +	}
> +
>   	priv->pd = ib_alloc_pd(priv->ca);
>   	if (IS_ERR(priv->pd)) {
>   		printk(KERN_WARNING "%s: failed to allocate PD\n", ca->name);
> -		return -ENODEV;
> +		goto out_free_wq;
>   	}
>   
>   	priv->mr = ib_get_dma_mr(priv->pd, IB_ACCESS_LOCAL_WRITE);
> @@ -242,6 +252,10 @@ out_free_mr:
>   
>   out_free_pd:
>   	ib_dealloc_pd(priv->pd);
> +
> +out_free_wq:
> +	destroy_workqueue(priv->wq);
> +	priv->wq = NULL;
>   	return -ENODEV;
>   }
>   
> @@ -270,6 +284,12 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
>   
>   	if (ib_dealloc_pd(priv->pd))
>   		ipoib_warn(priv, "ib_dealloc_pd failed\n");
> +
> +	if (priv->wq) {
> +		flush_workqueue(priv->wq);
> +		destroy_workqueue(priv->wq);
> +		priv->wq = NULL;
> +	}
>   }
>   
>   void ipoib_event(struct ib_event_handler *handler,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/8] IPoIB: Make the carrier_on_task race aware
       [not found]     ` <d05cdce70f1312f35f8be2d14bafd2a06809b137.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-18 23:26       ` Wendy Cheng
@ 2014-09-04 12:13       ` Erez Shitrit
       [not found]         ` <5408576C.7040609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  1 sibling, 1 reply; 37+ messages in thread
From: Erez Shitrit @ 2014-09-04 12:13 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A

> We blindly assume that we can just take the rtnl lock and that will
> prevent races with downing this interface.  Unfortunately, that's not
> the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
> in an attempt to clear out all remaining instances of ipoib_join_task.
> But, since this task is put on the same workqueue as the join task, the
> flush_workqueue waits on this thread too.  But this thread is deadlocked
> on the rtnl lock.  The better thing here is to use trylock and loop on
> that until we either get the lock or we see that FLAG_ADMIN_UP has
> been cleared, in which case we don't need to do anything anyway and we
> just return.
>
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++++++++++++++------
>   1 file changed, 15 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index a0a42859f12..7e9cd39b5ef 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
>   						   carrier_on_task);
>   	struct ib_port_attr attr;
>   
> -	/*
> -	 * Take rtnl_lock to avoid racing with ipoib_stop() and
> -	 * turning the carrier back on while a device is being
> -	 * removed.
> -	 */
>   	if (ib_query_port(priv->ca, priv->port, &attr) ||
>   	    attr.state != IB_PORT_ACTIVE) {
>   		ipoib_dbg(priv, "Keeping carrier off until IB port is active\n");
>   		return;
>   	}
>   
> -	rtnl_lock();
> +	/*
> +	 * Take rtnl_lock to avoid racing with ipoib_stop() and
> +	 * turning the carrier back on while a device is being
> +	 * removed.  However, ipoib_stop() will attempt to flush
> +	 * the workqueue while holding the rtnl lock, so loop
> +	 * on trylock until either we get the lock or we see
> +	 * FLAG_ADMIN_UP go away as that signals that we are bailing
> +	 * and can safely ignore the carrier on work
> +	 */
> +	while (!rtnl_trylock()) {
> +		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
> +			return;

I think that this check shouldn't be connected to the rtnl_lock, if 
IPOIB_FLAG_ADMIN_UP is clear no need to take the carrier on, just wait 
to the next loop when ipoib_open will be called.

> +		else
> +			msleep(20);
> +	}
>   	if (!ipoib_cm_admin_enabled(priv->dev))
>   		dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
>   	netif_carrier_on(priv->dev);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/8] IPoIB: change init sequence ordering
       [not found]     ` <ead9800512c1cb412b86cb1de3868c40f07c72be.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-09-04 12:36       ` Erez Shitrit
  0 siblings, 0 replies; 37+ messages in thread
From: Erez Shitrit @ 2014-09-04 12:36 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A


> In preparation for using per device work queues, we need to move the
> start of the neighbor thread task to after ipoib_ib_dev_init and move
> the destruction of the neighbor task to before ipoib_ib_dev_cleanup.
> Otherwise we will end up freeing our workqueue with work possibly still
> on it.
>
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

Acked-by: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

> ---
>   drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +++++++++++++++++-------
>   1 file changed, 17 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 217cb77157d..949948a46d4 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -1262,15 +1262,13 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
>   {
>   	struct ipoib_dev_priv *priv = netdev_priv(dev);
>   
> -	if (ipoib_neigh_hash_init(priv) < 0)
> -		goto out;
>   	/* Allocate RX/TX "rings" to hold queued skbs */
>   	priv->rx_ring =	kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring,
>   				GFP_KERNEL);
>   	if (!priv->rx_ring) {
>   		printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n",
>   		       ca->name, ipoib_recvq_size);
> -		goto out_neigh_hash_cleanup;
> +		goto out;
>   	}
>   
>   	priv->tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv->tx_ring);
> @@ -1285,16 +1283,24 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
>   	if (ipoib_ib_dev_init(dev, ca, port))
>   		goto out_tx_ring_cleanup;
>   
> +	/*
> +	 * Must be after ipoib_ib_dev_init so we can allocate a per
> +	 * device wq there and use it here
> +	 */
> +	if (ipoib_neigh_hash_init(priv) < 0)
> +		goto out_dev_uninit;
> +
>   	return 0;
>   
> +out_dev_uninit:
> +	ipoib_ib_dev_cleanup();
> +
>   out_tx_ring_cleanup:
>   	vfree(priv->tx_ring);
>   
>   out_rx_ring_cleanup:
>   	kfree(priv->rx_ring);
>   
> -out_neigh_hash_cleanup:
> -	ipoib_neigh_hash_uninit(dev);
>   out:
>   	return -ENOMEM;
>   }
> @@ -1317,6 +1323,12 @@ void ipoib_dev_cleanup(struct net_device *dev)
>   	}
>   	unregister_netdevice_many(&head);
>   
> +	/*
> +	 * Must be before ipoib_ib_dev_cleanup or we delete an in use
> +	 * work queue
> +	 */
> +	ipoib_neigh_hash_uninit(dev);
> +
>   	ipoib_ib_dev_cleanup(dev);
>   
>   	kfree(priv->rx_ring);
> @@ -1324,8 +1336,6 @@ void ipoib_dev_cleanup(struct net_device *dev)
>   
>   	priv->rx_ring = NULL;
>   	priv->tx_ring = NULL;
> -
> -	ipoib_neigh_hash_uninit(dev);
>   }
>   
>   static const struct header_ops ipoib_header_ops = {

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue
       [not found]     ` <2394730ce5ae1d46522dca04066293dd842edf16.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2014-08-15 22:11       ` Wendy Cheng
@ 2014-09-04 14:28       ` Erez Shitrit
  1 sibling, 0 replies; 37+ messages in thread
From: Erez Shitrit @ 2014-09-04 14:28 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A


> Setting the mtu can safely be moved to the carrier_on_task, which keeps
> us from needing to take the rtnl lock in the join_finish section.
>
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Acked-by: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++------
>   1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index d4e005720d0..a0a42859f12 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
>   		spin_unlock_irq(&priv->lock);
>   		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
>   		set_qkey = 1;
> -
> -		if (!ipoib_cm_admin_enabled(dev)) {
> -			rtnl_lock();
> -			dev_set_mtu(dev, min(priv->mcast_mtu, priv->admin_mtu));
> -			rtnl_unlock();
> -		}
>   	}
>   
>   	if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
> @@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
>   	}
>   
>   	rtnl_lock();
> +	if (!ipoib_cm_admin_enabled(priv->dev))
> +		dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
>   	netif_carrier_on(priv->dev);
>   	rtnl_unlock();
>   }

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/8] IPoIB: Use dedicated workqueues per interface
       [not found]         ` <54080B6B.8050707-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2014-09-09  7:09           ` Doug Ledford
       [not found]             ` <540EA7B2.2050805-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2014-09-09  7:09 UTC (permalink / raw)
  To: Erez Shitrit, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A

On 09/04/2014 02:49 AM, Erez Shitrit wrote:
> Hi Doug,
>
> The idea that "big" ipoib workqueue is only for the IB_EVENT_XXX looks
> good to me.
>
> one comment here: I don't see where the driver flushes it at all.
> IMHO, when the driver goes down you should cancel all pending tasks over
> that global wq prior the call for the destroy_workqueue.

I understand your thought, but I disagree with your conclusion.  If we 
ever get to the portion of the driver down sequence where we are 
removing the big flush work queue and there are still items on that work 
queue, then we have already failed and we are going to crash because 
there are outstanding flushes for a deleted device.  The real issue here 
is that we need to make sure it's not possible to delete a device that 
has outstanding flushes, and not possible to flush a device in the 
process of being deleted (Wendy, don't get mad at me, but the rtnl lock 
jumps out as appropriate for this purpose).

> The current logic calls the destroy_workqueue after the remove_one, i
> think you should cancel the pending works after the
> ib_unregister_event_handler call in the ipoib_remove_one function,

The ib_remove_one is device specific while the big flush work queue is 
not.  As such, that can cancel work for devices other than the device 
being removed with no clear means of getting it back.

> otherwise if there are pending tasks in that queue they will schedule
> after the dev/priv are gone.

I agree that there is a problem here.  I think what needs to happen is 
that now that we have a work queue per device, and all flushes come in 
to us via a work queue that does not belong to the device, it is now 
possible to handle flushes synchronously.  This would allow us to lock 
around the flush itself and prevent removal until after the flush is 
complete without getting into the hairy rat hole that is trying to flush 
the flush workqueue that might result in either flushing or canceling 
the very device we are working on.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/8] IPoIB: Make the carrier_on_task race aware
       [not found]         ` <5408576C.7040609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2014-09-09  7:17           ` Doug Ledford
  0 siblings, 0 replies; 37+ messages in thread
From: Doug Ledford @ 2014-09-09  7:17 UTC (permalink / raw)
  To: Erez Shitrit, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A

On 09/04/2014 08:13 AM, Erez Shitrit wrote:
>> We blindly assume that we can just take the rtnl lock and that will
>> prevent races with downing this interface.  Unfortunately, that's not
>> the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
>> in an attempt to clear out all remaining instances of ipoib_join_task.
>> But, since this task is put on the same workqueue as the join task, the
>> flush_workqueue waits on this thread too.  But this thread is deadlocked
>> on the rtnl lock.  The better thing here is to use trylock and loop on
>> that until we either get the lock or we see that FLAG_ADMIN_UP has
>> been cleared, in which case we don't need to do anything anyway and we
>> just return.
>>
>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>> ---
>>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21
>> +++++++++++++++------
>>   1 file changed, 15 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> index a0a42859f12..7e9cd39b5ef 100644
>> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct
>> work_struct *work)
>>                              carrier_on_task);
>>       struct ib_port_attr attr;
>> -    /*
>> -     * Take rtnl_lock to avoid racing with ipoib_stop() and
>> -     * turning the carrier back on while a device is being
>> -     * removed.
>> -     */
>>       if (ib_query_port(priv->ca, priv->port, &attr) ||
>>           attr.state != IB_PORT_ACTIVE) {
>>           ipoib_dbg(priv, "Keeping carrier off until IB port is
>> active\n");
>>           return;
>>       }
>> -    rtnl_lock();
>> +    /*
>> +     * Take rtnl_lock to avoid racing with ipoib_stop() and
>> +     * turning the carrier back on while a device is being
>> +     * removed.  However, ipoib_stop() will attempt to flush
>> +     * the workqueue while holding the rtnl lock, so loop
>> +     * on trylock until either we get the lock or we see
>> +     * FLAG_ADMIN_UP go away as that signals that we are bailing
>> +     * and can safely ignore the carrier on work
>> +     */
>> +    while (!rtnl_trylock()) {
>> +        if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
>> +            return;
>
> I think that this check shouldn't be connected to the rtnl_lock, if
> IPOIB_FLAG_ADMIN_UP is clear no need to take the carrier on, just wait
> to the next loop when ipoib_open will be called.

Actually, I was optimizing for the common case.  The common case is that 
someone is *not* running a while true; do ifup/ifdown; done torture 
test.  In that situation, you will get the rtnl lock, which you have to 
have before doing to two config items below, and so you can totally skip 
the check for FLAG_ADMIN_UP because A) the code that drops this flag 
gets the rtnl lock first and B) that same code flushes this workqueue 
while holding the lock, so if we ever get the lock, we already know the 
state of the ADMIN_UP flag.  This is all moot though in the sense that 
once the devices are switched to per device work queues, the need to do 
this loop goes away and we can simplify the code.  This was only needed 
because our flush task and carrier on task shared a work queue and we 
could deadlock without this little loop.

>> +        else
>> +            msleep(20);
>> +    }
>>       if (!ipoib_cm_admin_enabled(priv->dev))
>>           dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
>>       netif_carrier_on(priv->dev);
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 6/8] IPoIB: Use dedicated workqueues per interface
       [not found]             ` <540EA7B2.2050805-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2014-09-10 17:27               ` Erez Shitrit
  0 siblings, 0 replies; 37+ messages in thread
From: Erez Shitrit @ 2014-09-10 17:27 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A


> On 09/04/2014 02:49 AM, Erez Shitrit wrote:
>> Hi Doug,
>>
>> The idea that "big" ipoib workqueue is only for the IB_EVENT_XXX looks
>> good to me.
>>
>> one comment here: I don't see where the driver flushes it at all.
>> IMHO, when the driver goes down you should cancel all pending tasks over
>> that global wq prior the call for the destroy_workqueue.
>
> I understand your thought, but I disagree with your conclusion. If we 
> ever get to the portion of the driver down sequence where we are 
> removing the big flush work queue and there are still items on that 
> work queue, then we have already failed and we are going to crash 
> because there are outstanding flushes for a deleted device.
it is not a rare case, sm events (pkey, hand-over, etc.) while driver 
restart on devices in the cluster, and you are there.
removing one device of few on the same host, it is a rare condition, 
anyway i agree that cancel_work is not the solution.
> The real issue here is that we need to make sure it's not possible to 
> delete a device that has outstanding flushes, and not possible to 
> flush a device in the process of being deleted (Wendy, don't get mad 
> at me, but the rtnl lock jumps out as appropriate for this purpose).
we can remove the device and drain the workqueue while checking on each 
event that this event doesn't belong to deleted device (we have the list 
of all existing priv objects in the system via ib_get_client_data) after 
that we can be sure that no more works are for that deleted device.
>
>> The current logic calls the destroy_workqueue after the remove_one, i
>> think you should cancel the pending works after the
>> ib_unregister_event_handler call in the ipoib_remove_one function,
>
> The ib_remove_one is device specific while the big flush work queue is 
> not.  As such, that can cancel work for devices other than the device 
> being removed with no clear means of getting it back.
>
>> otherwise if there are pending tasks in that queue they will schedule
>> after the dev/priv are gone.
>
> I agree that there is a problem here.  I think what needs to happen is 
> that now that we have a work queue per device, and all flushes come in 
> to us via a work queue that does not belong to the device, it is now 
> possible to handle flushes synchronously.  This would allow us to lock 
> around the flush itself and prevent removal until after the flush is 
> complete without getting into the hairy rat hole that is trying to 
> flush the flush workqueue that might result in either flushing or 
> canceling the very device we are working on.
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2014-09-10 17:27 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-12 23:38 [PATCH 0/8] IPoIB: Fix multiple race conditions Doug Ledford
     [not found] ` <cover.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-12 23:38   ` [PATCH 1/8] IPoIB: Consolidate rtnl_lock tasks in workqueue Doug Ledford
     [not found]     ` <2394730ce5ae1d46522dca04066293dd842edf16.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-15 22:11       ` Wendy Cheng
2014-09-04 14:28       ` Erez Shitrit
2014-08-12 23:38   ` [PATCH 2/8] IPoIB: Make the carrier_on_task race aware Doug Ledford
     [not found]     ` <d05cdce70f1312f35f8be2d14bafd2a06809b137.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-18 23:26       ` Wendy Cheng
     [not found]         ` <CABgxfbE6edfZZ58=mTvhGqWSkCxsik0XuQPR0L-yayze=803cg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-19 20:32           ` Doug Ledford
     [not found]             ` <2CC1794B-10CD-49A2-8F5D-0C66A6684DBC-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-19 22:05               ` Wendy Cheng
2014-09-04 12:13       ` Erez Shitrit
     [not found]         ` <5408576C.7040609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-09-09  7:17           ` Doug Ledford
2014-08-12 23:38   ` [PATCH 3/8] IPoIB: fix MCAST_FLAG_BUSY usage Doug Ledford
     [not found]     ` <a410e80dc5ca7cfa64229bbbf50c1337317e3bd8.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-19 18:08       ` Wendy Cheng
     [not found]         ` <CABgxfbH-Dt3CpxJKwCAZeHTUyupaA9y_WXVXuxgiPMet26PTQw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-19 20:28           ` Doug Ledford
     [not found]             ` <902D5BF2-159A-4B31-A87F-7491F3C8057F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-25 19:51               ` Wendy Cheng
     [not found]                 ` <CABgxfbHOD75vLdZ0TtWZbk8ne3kHd_eWObxPHmoJ-D8DjE0bkw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-25 20:03                   ` Doug Ledford
     [not found]                     ` <E3EFCBAC-2D6E-49D3-A556-DBD40701CC5F-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-26 18:04                       ` Wendy Cheng
2014-08-12 23:38   ` [PATCH 4/8] IPoIB: fix mcast_dev_flush/mcast_restart_task race Doug Ledford
     [not found]     ` <e1dbcfc25d8930b281aad12699ebf8fa82485b0e.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-29 21:53       ` Wendy Cheng
     [not found]         ` <CABgxfbHDuUrdHuLJT2oD07Cy3Ys4_rj-bJ6eR=9+uv0CuPH7_Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-08-30 15:39           ` Wendy Cheng
     [not found]             ` <CABgxfbGJOdmAn1sokEtisDdnA=r_4mfP=PfqZVsP0cd_oL50dA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-09-03 18:06               ` Doug Ledford
     [not found]                 ` <1409767614.26762.7.camel-v+aXH1h/sVwpzh8Nc7Vzg+562jBIR2Zt@public.gmane.org>
2014-09-03 19:45                   ` Wendy Cheng
2014-09-03 17:49           ` Doug Ledford
2014-08-12 23:38   ` [PATCH 5/8] IPoIB: change init sequence ordering Doug Ledford
     [not found]     ` <ead9800512c1cb412b86cb1de3868c40f07c72be.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-09-04 12:36       ` Erez Shitrit
2014-08-12 23:38   ` [PATCH 6/8] IPoIB: Use dedicated workqueues per interface Doug Ledford
     [not found]     ` <f7af9c251d722675a549e4a673f46c0f31dfa266.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-20 15:01       ` Estrin, Alex
2014-09-04  6:49       ` Erez Shitrit
     [not found]         ` <54080B6B.8050707-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2014-09-09  7:09           ` Doug Ledford
     [not found]             ` <540EA7B2.2050805-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-09-10 17:27               ` Erez Shitrit
2014-08-12 23:38   ` [PATCH 7/8] IPoIB: Make ipoib_mcast_stop_thread flush the workqueue Doug Ledford
     [not found]     ` <ae3912431eeacd81d920a405a6bdeb3853791b1a.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-20 15:03       ` Estrin, Alex
2014-08-12 23:38   ` [PATCH 8/8] IPoIB: No longer use flush as a parameter Doug Ledford
     [not found]     ` <ad7bb2b8da52f187cf2978e6a1c77ead32b60de3.1407885724.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2014-08-20 15:04       ` Estrin, Alex
2014-08-15 22:08   ` [PATCH 0/8] IPoIB: Fix multiple race conditions Wendy Cheng
     [not found]     ` <CABgxfbEGfiNGUKT4NJi1GoDRouFznxpgogDt5yr47TLfwDB7hA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-09-03 16:26       ` Wendy Cheng
2014-09-03 13:52   ` Or Gerlitz
     [not found]     ` <54071D14.9040404-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2014-09-03 18:12       ` Doug Ledford

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.