All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] IB/ipoib: fixup multicast locking issues
@ 2015-02-22  0:26 Doug Ledford
       [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-02-22  0:26 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

This is the re-ordered, squashed version of my 22 patch set that I
posted on Feb 11.  There are a few minor differences between that
set and this one.  They are:

1) Rename __ipoib_mcast_continue_join_thread to
   __ipoib_mcast_schedule_join_thread
2) Make __ipoib_mcast_schedule_join_thread cancel any delayed work to
   avoid us accidentally trying to queue the single work struct instance
   twice (which doesn't work)
3) Slight alter layout of __ipoib_mcast_schedule_join_thread.  Logic
   is the same modulo #2, but indenting is reduced and readability
   increased
4) Switch a few instances of FLAG_ADMIN_UP to FLAG_OPER_UP
5) Add a couple missing spinlocks so that we always call the schedule
   helper with the spinlock held
6) Make sure that we only clear the BUSY flag once we have done all the
   other things we are going to do to the mcast entry, and if possible,
   only call complete after we have released the spinlock
7) Fix the usage of time_before_eq when we should have just used
   time_before in ipoib_mcast_join_task
8) Create/destroy priv->wq in a slightly different point of
   ipoib_transport_dev_init/ipoib_transport_dev_cleanup

This entire patchset was intended to address the issue of ipoib
interfaces being brought up/down in a tight loop, which will hardlock
a standard v3.19 kernel.  It succeeds at resolving that problem.  In
order to be sure this patchset does not introduce other problems,
and in order to ensure that this rework of the patches into a new
set does not break bisectability, this entire patchset has been
extensively tested, starting with the first patch and going through
the last.

I used a 12 machine group plus the subnet manager to test these
patches.

1 machine ran ifconfig up/ifconfig down in a tight loop tests
1 machine ran rmmod/insmod ib_ipoib in a loop with a 10 second pause
  between insmod and rmmod
1 machine ran rmmod/insmod ib_ipoib in a tight loop with only a .1
  second pause between insmod and rmmod
9 machines that kept their interfaces up and ran iperf servers, 6 also
  ran ping6 instances to the addresses of all 12 machines, 3 ran iperf
  clients that sent data to all 9 iperf servers in an infinite loop
1 subnet manager machine that otherwise did not participate, but
  during testing was set to restart opensm once every 30 seconds to
  force net re-register events on all 12 machines in the group

In addition to the configuration of various machines above to test
data transfers, the IPoIB infrastructure itself contained several
elements designed to test specific multicast capabilities.

The primary P_Key, the one with the ping6 instances running on it,
intentionally had some well known multicast groups not defined in
order to intentionally cause failed sendonly multicast joins on
the same device that needed to work with IPv6 pings as well as
IPv4 multicast.

One of the alternate P_Key interfaces was defined with a minimum
rate of 56GBit/s, so all machines without 56GBit/s capability
were unable to ever join the broadcast group on these P_Keys.
This was done to make sure that when the broadcast group is not
joined, no other multicast joins, sendonly or otherwise, are ever
sent.  It also was done to make sure that failed attempts to join
the broadcast group honored the backoff delays properly.

Note: both machines that were doing the insmod/rmmod loops were
changed to not have any P_Key interfaces defined other than the
default P_Key interface.  It is known that repeated insmod/rmmod
of the ib_ipoib interface is fragile and easily breaks in the
presence of child interfaces.  It was not my intent to address
that particular problem with this patch set and so to avoid false
issues, children interfaces were removed from the mix on these
machines.

A wide array of hardware was also tested with this 12 machine group,
covering mthca, mlx4, mlx5, and qib hardware.

Patches 1 through 6 were tested without the ifconfig/rmmod/opensm
loops as those particular problems were not expected to be addressed
until patch 7.  Pathes 7 through 9 were tested with all tests.

The final, complete patch set was left running with the various
tests until it had completed 257 opensm restarts, 12052
ifconfig up/ifconfig down loops, 765 10 second insmod/rmmod loops,
and 1971 .1 second insmod/rmmod loops.  The only observed problem
was that the fast insmod/rmmod loop eventually locked up the
network stack on the machine.  It was stuck on a rtnl_lock deadlock,
but not one related to the multicast code (and therefore outside
the scope of these patches to address).  There are several bits of
additional locking to be fixed in the overall ipoib code in relation
to insmod/rmmod races and this patch set does not attempt to address
those.  It merely attempts not to introduce any new issues while
resolving the mcast locking issues related to bringing the interface
up and down.  I feel confident that it does that.

Doug Ledford (9):
  IB/ipoib: factor out ah flushing
  IB/ipoib: change init sequence ordering
  IB/ipoib: Consolidate rtnl_lock tasks in workqueue
  IB/ipoib: Make the carrier_on_task race aware
  IB/ipoib: Use dedicated workqueues per interface
  IB/ipoib: No longer use flush as a parameter
  IB/ipoib: fix MCAST_FLAG_BUSY usage
  IB/ipoib: deserialize multicast joins
  IB/ipoib: drop mcast_mutex usage

 drivers/infiniband/ulp/ipoib/ipoib.h           |  20 +-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  18 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  69 ++--
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  60 +--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 500 +++++++++++++------------
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |  31 +-
 6 files changed, 389 insertions(+), 309 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/9] IB/ipoib: factor out ah flushing
       [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-22  0:26   ` Doug Ledford
       [not found]     ` <b06eb720c2f654f5ecdb72c66f4e89149d1c24ec.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-22  0:27   ` [PATCH 2/9] IB/ipoib: change init sequence ordering Doug Ledford
                     ` (9 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-02-22  0:26 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
appropriate times to flush out all remaining ah entries before we shut
the device down.

Because neighbors and mcast entries can each have a reference on any
given ah, we must make sure to free all of those first before our ah
will actually have a 0 refcount and be able to be reaped.

This factoring is needed in preparation for having per-device work
queues.  The original per-device workqueue code resulted in the following
error message:

<ibdev>: ib_dealloc_pd failed

That error was tracked down to this issue.  With the changes to which
workqueues were flushed when, there were no flushes of the per device
workqueue after the last ah's were freed, resulting in an attempt to
dealloc the pd with outstanding resources still allocated.  This code
puts the explicit flushes in the needed places to avoid that problem.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 ++++++++++++++++++++-------------
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 72626c34817..cb02466a0eb 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work)
 				   round_jiffies_relative(HZ));
 }
 
+static void ipoib_flush_ah(struct net_device *dev, int flush)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	cancel_delayed_work(&priv->ah_reap_task);
+	if (flush)
+		flush_workqueue(ipoib_workqueue);
+	ipoib_reap_ah(&priv->ah_reap_task.work);
+}
+
+static void ipoib_stop_ah(struct net_device *dev, int flush)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	set_bit(IPOIB_STOP_REAPER, &priv->flags);
+	ipoib_flush_ah(dev, flush);
+}
+
 static void ipoib_ib_tx_timer_func(unsigned long ctx)
 {
 	drain_tx_cq((struct net_device *)ctx);
@@ -877,24 +895,7 @@ timeout:
 	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
 		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
 
-	/* Wait for all AHs to be reaped */
-	set_bit(IPOIB_STOP_REAPER, &priv->flags);
-	cancel_delayed_work(&priv->ah_reap_task);
-	if (flush)
-		flush_workqueue(ipoib_workqueue);
-
-	begin = jiffies;
-
-	while (!list_empty(&priv->dead_ahs)) {
-		__ipoib_reap_ah(dev);
-
-		if (time_after(jiffies, begin + HZ)) {
-			ipoib_warn(priv, "timing out; will leak address handles\n");
-			break;
-		}
-
-		msleep(1);
-	}
+	ipoib_flush_ah(dev, flush);
 
 	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
 
@@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
 	if (level == IPOIB_FLUSH_LIGHT) {
 		ipoib_mark_paths_invalid(dev);
 		ipoib_mcast_dev_flush(dev);
+		ipoib_flush_ah(dev, 0);
 	}
 
 	if (level >= IPOIB_FLUSH_NORMAL)
@@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
 	ipoib_mcast_stop_thread(dev, 1);
 	ipoib_mcast_dev_flush(dev);
 
+	/*
+	 * All of our ah references aren't free until after
+	 * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
+	 * the neighbor garbage collection is stopped and reaped.
+	 * That should all be done now, so make a final ah flush.
+	 */
+	ipoib_stop_ah(dev, 1);
+
 	ipoib_transport_dev_cleanup(dev);
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/9] IB/ipoib: change init sequence ordering
       [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-22  0:26   ` [PATCH 1/9] IB/ipoib: factor out ah flushing Doug Ledford
@ 2015-02-22  0:27   ` Doug Ledford
  2015-02-22  0:27   ` [PATCH 3/9] IB/ipoib: Consolidate rtnl_lock tasks in workqueue Doug Ledford
                     ` (8 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Doug Ledford @ 2015-02-22  0:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

In preparation for using per device work queues, we need to move the
start of the neighbor thread task to after ipoib_ib_dev_init and move
the destruction of the neighbor task to before ipoib_ib_dev_cleanup.
Otherwise we will end up freeing our workqueue with work possibly
still on it.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 58b5aa3b6f2..002ff0da9fa 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1262,15 +1262,13 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	if (ipoib_neigh_hash_init(priv) < 0)
-		goto out;
 	/* Allocate RX/TX "rings" to hold queued skbs */
 	priv->rx_ring =	kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring,
 				GFP_KERNEL);
 	if (!priv->rx_ring) {
 		printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n",
 		       ca->name, ipoib_recvq_size);
-		goto out_neigh_hash_cleanup;
+		goto out;
 	}
 
 	priv->tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv->tx_ring);
@@ -1285,16 +1283,24 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 	if (ipoib_ib_dev_init(dev, ca, port))
 		goto out_tx_ring_cleanup;
 
+	/*
+	 * Must be after ipoib_ib_dev_init so we can allocate a per
+	 * device wq there and use it here
+	 */
+	if (ipoib_neigh_hash_init(priv) < 0)
+		goto out_dev_uninit;
+
 	return 0;
 
+out_dev_uninit:
+	ipoib_ib_dev_cleanup(dev);
+
 out_tx_ring_cleanup:
 	vfree(priv->tx_ring);
 
 out_rx_ring_cleanup:
 	kfree(priv->rx_ring);
 
-out_neigh_hash_cleanup:
-	ipoib_neigh_hash_uninit(dev);
 out:
 	return -ENOMEM;
 }
@@ -1317,6 +1323,12 @@ void ipoib_dev_cleanup(struct net_device *dev)
 	}
 	unregister_netdevice_many(&head);
 
+	/*
+	 * Must be before ipoib_ib_dev_cleanup or we delete an in use
+	 * work queue
+	 */
+	ipoib_neigh_hash_uninit(dev);
+
 	ipoib_ib_dev_cleanup(dev);
 
 	kfree(priv->rx_ring);
@@ -1324,8 +1336,6 @@ void ipoib_dev_cleanup(struct net_device *dev)
 
 	priv->rx_ring = NULL;
 	priv->tx_ring = NULL;
-
-	ipoib_neigh_hash_uninit(dev);
 }
 
 static const struct header_ops ipoib_header_ops = {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 3/9] IB/ipoib: Consolidate rtnl_lock tasks in workqueue
       [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-22  0:26   ` [PATCH 1/9] IB/ipoib: factor out ah flushing Doug Ledford
  2015-02-22  0:27   ` [PATCH 2/9] IB/ipoib: change init sequence ordering Doug Ledford
@ 2015-02-22  0:27   ` Doug Ledford
  2015-02-22  0:27   ` [PATCH 4/9] IB/ipoib: Make the carrier_on_task race aware Doug Ledford
                     ` (7 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Doug Ledford @ 2015-02-22  0:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

The ipoib_mcast_flush_dev routine is called with the rtnl_lock held and
needs to keep it held.  It also needs to call flush_workqueue() to flush
out any outstanding work.  In the past, we've had to try and make sure
that we didn't flush out any outstanding join completions because they
also wanted to grab rtnl_lock() and that would deadlock.  It turns out
that the only thing in the join completion handler that needs this lock
can be safely moved to our carrier_on_task, thereby reducing the
potential for the join completion code and the flush code to deadlock
against each other.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index ffb83b5f7e8..eee66d13e5b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 		spin_unlock_irq(&priv->lock);
 		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
 		set_qkey = 1;
-
-		if (!ipoib_cm_admin_enabled(dev)) {
-			rtnl_lock();
-			dev_set_mtu(dev, min(priv->mcast_mtu, priv->admin_mtu));
-			rtnl_unlock();
-		}
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)) {
@@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
 	}
 
 	rtnl_lock();
+	if (!ipoib_cm_admin_enabled(priv->dev))
+		dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
 	netif_carrier_on(priv->dev);
 	rtnl_unlock();
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 4/9] IB/ipoib: Make the carrier_on_task race aware
       [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-02-22  0:27   ` [PATCH 3/9] IB/ipoib: Consolidate rtnl_lock tasks in workqueue Doug Ledford
@ 2015-02-22  0:27   ` Doug Ledford
  2015-02-22  0:27   ` [PATCH 5/9] IB/ipoib: Use dedicated workqueues per interface Doug Ledford
                     ` (6 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Doug Ledford @ 2015-02-22  0:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

We blindly assume that we can just take the rtnl lock and that will
prevent races with downing this interface.  Unfortunately, that's not
the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
in an attempt to clear out all remaining instances of ipoib_join_task.
But, since this task is put on the same workqueue as the join task,
the flush_workqueue waits on this thread too.  But this thread is
deadlocked on the rtnl lock.  The better thing here is to use trylock
and loop on that until we either get the lock or we see that
FLAG_OPER_UP has been cleared, in which case we don't need to do
anything anyway and we just return.

While investigating which flag should be used, FLAG_ADMIN_UP or
FLAG_OPER_UP, it was determined that FLAG_OPER_UP was the more
appropriate flag to use.  However, there was a mix of these two flags in
use in the existing code.  So while we check for that flag here as part
of this race fix, also cleanup the two places that had used the less
appropriate flag for their tests.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index eee66d13e5b..c63a598d0b4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
 						   carrier_on_task);
 	struct ib_port_attr attr;
 
-	/*
-	 * Take rtnl_lock to avoid racing with ipoib_stop() and
-	 * turning the carrier back on while a device is being
-	 * removed.
-	 */
 	if (ib_query_port(priv->ca, priv->port, &attr) ||
 	    attr.state != IB_PORT_ACTIVE) {
 		ipoib_dbg(priv, "Keeping carrier off until IB port is active\n");
 		return;
 	}
 
-	rtnl_lock();
+	/*
+	 * Take rtnl_lock to avoid racing with ipoib_stop() and
+	 * turning the carrier back on while a device is being
+	 * removed.  However, ipoib_stop() will attempt to flush
+	 * the workqueue while holding the rtnl lock, so loop
+	 * on trylock until either we get the lock or we see
+	 * FLAG_OPER_UP go away as that signals that we are bailing
+	 * and can safely ignore the carrier on work.
+	 */
+	while (!rtnl_trylock()) {
+		if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
+			return;
+		else
+			msleep(20);
+	}
 	if (!ipoib_cm_admin_enabled(priv->dev))
 		dev_set_mtu(priv->dev, min(priv->mcast_mtu, priv->admin_mtu));
 	netif_carrier_on(priv->dev);
@@ -535,7 +544,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	if (!priv->broadcast) {
 		struct ipoib_mcast *broadcast;
 
-		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+		if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
 			return;
 
 		broadcast = ipoib_mcast_alloc(dev, 1);
@@ -882,7 +891,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 		ipoib_mcast_free(mcast);
 	}
 
-	if (test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+	if (test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
 		ipoib_mcast_start_thread(dev);
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 5/9] IB/ipoib: Use dedicated workqueues per interface
       [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-02-22  0:27   ` [PATCH 4/9] IB/ipoib: Make the carrier_on_task race aware Doug Ledford
@ 2015-02-22  0:27   ` Doug Ledford
       [not found]     ` <1cfdf15058cea312f07c2907490a1d7300603c40.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-22  0:27   ` [PATCH 6/9] IB/ipoib: No longer use flush as a parameter Doug Ledford
                     ` (5 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-02-22  0:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

During my recent work on the rtnl lock deadlock in the IPoIB driver, I
saw that even once I fixed the apparent races for a single device, as
soon as that device had any children, new races popped up.  It turns
out that this is because no matter how well we protect against races
on a single device, the fact that all devices use the same workqueue,
and flush_workqueue() flushes *everything* from that workqueue means
that we would also have to prevent all races between different devices
(for instance, ipoib_mcast_restart_task on interface ib0 can race with
ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on
the rtnl_lock).

There are several possible solutions to this problem:

Make carrier_on_task and mcast_restart_task try to take the rtnl for
some set period of time and if they fail, then bail.  This runs the
real risk of dropping work on the floor, which can end up being its
own separate kind of deadlock.

Set some global flag in the driver that says some device is in the
middle of going down, letting all tasks know to bail.  Again, this can
drop work on the floor.

Or the method this patch attempts to use, which is when we bring an
interface up, create a workqueue specifically for that interface, so
that when we take it back down, we are flushing only those tasks
associated with our interface.  In addition, keep the global
workqueue, but now limit it to only flush tasks.  In this way, the
flush tasks can always flush the device specific work queues without
having deadlock issues.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        | 18 +++++++--------
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  6 ++---
 drivers/infiniband/ulp/ipoib/ipoib_main.c      | 28 +++++++++++++----------
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 20 ++++++++---------
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     | 31 +++++++++++++++++++++++---
 6 files changed, 66 insertions(+), 38 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index d7562beb542..e940cd9f847 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -317,6 +317,7 @@ struct ipoib_dev_priv {
 	struct list_head multicast_list;
 	struct rb_root multicast_tree;
 
+	struct workqueue_struct *wq;
 	struct delayed_work mcast_task;
 	struct work_struct carrier_on_task;
 	struct work_struct flush_light;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 933efcea0d0..56959adb6c7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even
 	}
 
 	spin_lock_irq(&priv->lock);
-	queue_delayed_work(ipoib_workqueue,
+	queue_delayed_work(priv->wq,
 			   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
 	/* Add this entry to passive ids list head, but do not re-add it
 	 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
@@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 			spin_lock_irqsave(&priv->lock, flags);
 			list_splice_init(&priv->cm.rx_drain_list, &priv->cm.rx_reap_list);
 			ipoib_cm_start_rx_drain(priv);
-			queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
+			queue_work(priv->wq, &priv->cm.rx_reap_task);
 			spin_unlock_irqrestore(&priv->lock, flags);
 		} else
 			ipoib_warn(priv, "cm recv completion event with wrid %d (> %d)\n",
@@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 				spin_lock_irqsave(&priv->lock, flags);
 				list_move(&p->list, &priv->cm.rx_reap_list);
 				spin_unlock_irqrestore(&priv->lock, flags);
-				queue_work(ipoib_workqueue, &priv->cm.rx_reap_task);
+				queue_work(priv->wq, &priv->cm.rx_reap_task);
 			}
 			return;
 		}
@@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 
 		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
 			list_move(&tx->list, &priv->cm.reap_list);
-			queue_work(ipoib_workqueue, &priv->cm.reap_task);
+			queue_work(priv->wq, &priv->cm.reap_task);
 		}
 
 		clear_bit(IPOIB_FLAG_OPER_UP, &tx->flags);
@@ -1255,7 +1255,7 @@ static int ipoib_cm_tx_handler(struct ib_cm_id *cm_id,
 
 		if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
 			list_move(&tx->list, &priv->cm.reap_list);
-			queue_work(ipoib_workqueue, &priv->cm.reap_task);
+			queue_work(priv->wq, &priv->cm.reap_task);
 		}
 
 		spin_unlock_irqrestore(&priv->lock, flags);
@@ -1284,7 +1284,7 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path
 	tx->dev = dev;
 	list_add(&tx->list, &priv->cm.start_list);
 	set_bit(IPOIB_FLAG_INITIALIZED, &tx->flags);
-	queue_work(ipoib_workqueue, &priv->cm.start_task);
+	queue_work(priv->wq, &priv->cm.start_task);
 	return tx;
 }
 
@@ -1295,7 +1295,7 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
 	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &tx->flags)) {
 		spin_lock_irqsave(&priv->lock, flags);
 		list_move(&tx->list, &priv->cm.reap_list);
-		queue_work(ipoib_workqueue, &priv->cm.reap_task);
+		queue_work(priv->wq, &priv->cm.reap_task);
 		ipoib_dbg(priv, "Reap connection for gid %pI6\n",
 			  tx->neigh->daddr + 4);
 		tx->neigh = NULL;
@@ -1417,7 +1417,7 @@ void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb,
 
 	skb_queue_tail(&priv->cm.skb_queue, skb);
 	if (e)
-		queue_work(ipoib_workqueue, &priv->cm.skb_task);
+		queue_work(priv->wq, &priv->cm.skb_task);
 }
 
 static void ipoib_cm_rx_reap(struct work_struct *work)
@@ -1450,7 +1450,7 @@ static void ipoib_cm_stale_task(struct work_struct *work)
 	}
 
 	if (!list_empty(&priv->cm.passive_ids))
-		queue_delayed_work(ipoib_workqueue,
+		queue_delayed_work(priv->wq,
 				   &priv->cm.stale_task, IPOIB_CM_RX_DELAY);
 	spin_unlock_irq(&priv->lock);
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index cb02466a0eb..2a56b7a11a9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -655,7 +655,7 @@ void ipoib_reap_ah(struct work_struct *work)
 	__ipoib_reap_ah(dev);
 
 	if (!test_bit(IPOIB_STOP_REAPER, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
+		queue_delayed_work(priv->wq, &priv->ah_reap_task,
 				   round_jiffies_relative(HZ));
 }
 
@@ -665,7 +665,7 @@ static void ipoib_flush_ah(struct net_device *dev, int flush)
 
 	cancel_delayed_work(&priv->ah_reap_task);
 	if (flush)
-		flush_workqueue(ipoib_workqueue);
+		flush_workqueue(priv->wq);
 	ipoib_reap_ah(&priv->ah_reap_task.work);
 }
 
@@ -714,7 +714,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
 	}
 
 	clear_bit(IPOIB_STOP_REAPER, &priv->flags);
-	queue_delayed_work(ipoib_workqueue, &priv->ah_reap_task,
+	queue_delayed_work(priv->wq, &priv->ah_reap_task,
 			   round_jiffies_relative(HZ));
 
 	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 002ff0da9fa..64effe4faf7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -839,7 +839,7 @@ static void ipoib_set_mcast_list(struct net_device *dev)
 		return;
 	}
 
-	queue_work(ipoib_workqueue, &priv->restart_task);
+	queue_work(priv->wq, &priv->restart_task);
 }
 
 static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr)
@@ -954,7 +954,7 @@ static void ipoib_reap_neigh(struct work_struct *work)
 	__ipoib_reap_neigh(priv);
 
 	if (!test_bit(IPOIB_STOP_NEIGH_GC, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
+		queue_delayed_work(priv->wq, &priv->neigh_reap_task,
 				   arp_tbl.gc_interval);
 }
 
@@ -1133,7 +1133,7 @@ static int ipoib_neigh_hash_init(struct ipoib_dev_priv *priv)
 
 	/* start garbage collection */
 	clear_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
-	queue_delayed_work(ipoib_workqueue, &priv->neigh_reap_task,
+	queue_delayed_work(priv->wq, &priv->neigh_reap_task,
 			   arp_tbl.gc_interval);
 
 	return 0;
@@ -1643,10 +1643,11 @@ sysfs_failed:
 
 register_failed:
 	ib_unregister_event_handler(&priv->event_handler);
+	flush_workqueue(ipoib_workqueue);
 	/* Stop GC if started before flush */
 	set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
 	cancel_delayed_work(&priv->neigh_reap_task);
-	flush_workqueue(ipoib_workqueue);
+	flush_workqueue(priv->wq);
 
 event_failed:
 	ipoib_dev_cleanup(priv->dev);
@@ -1709,6 +1710,7 @@ static void ipoib_remove_one(struct ib_device *device)
 
 	list_for_each_entry_safe(priv, tmp, dev_list, list) {
 		ib_unregister_event_handler(&priv->event_handler);
+		flush_workqueue(ipoib_workqueue);
 
 		rtnl_lock();
 		dev_change_flags(priv->dev, priv->dev->flags & ~IFF_UP);
@@ -1717,7 +1719,7 @@ static void ipoib_remove_one(struct ib_device *device)
 		/* Stop GC */
 		set_bit(IPOIB_STOP_NEIGH_GC, &priv->flags);
 		cancel_delayed_work(&priv->neigh_reap_task);
-		flush_workqueue(ipoib_workqueue);
+		flush_workqueue(priv->wq);
 
 		unregister_netdev(priv->dev);
 		free_netdev(priv->dev);
@@ -1752,14 +1754,16 @@ static int __init ipoib_init_module(void)
 		return ret;
 
 	/*
-	 * We create our own workqueue mainly because we want to be
-	 * able to flush it when devices are being removed.  We can't
-	 * use schedule_work()/flush_scheduled_work() because both
-	 * unregister_netdev() and linkwatch_event take the rtnl lock,
-	 * so flush_scheduled_work() can deadlock during device
-	 * removal.
+	 * We create a global workqueue here that is used for all flush
+	 * operations.  However, if you attempt to flush a workqueue
+	 * from a task on that same workqueue, it deadlocks the system.
+	 * We want to be able to flush the tasks associated with a
+	 * specific net device, so we also create a workqueue for each
+	 * netdevice.  We queue up the tasks for that device only on
+	 * its private workqueue, and we only queue up flush events
+	 * on our global flush workqueue.  This avoids the deadlocks.
 	 */
-	ipoib_workqueue = create_singlethread_workqueue("ipoib");
+	ipoib_workqueue = create_singlethread_workqueue("ipoib_flush");
 	if (!ipoib_workqueue) {
 		ret = -ENOMEM;
 		goto err_fs;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index c63a598d0b4..9d3c1ed576e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -403,16 +403,15 @@ static int ipoib_mcast_join_complete(int status,
 		mcast->backoff = 1;
 		mutex_lock(&mcast_mutex);
 		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->mcast_task, 0);
+			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 		mutex_unlock(&mcast_mutex);
 
 		/*
-		 * Defer carrier on work to ipoib_workqueue to avoid a
+		 * Defer carrier on work to priv->wq to avoid a
 		 * deadlock on rtnl_lock here.
 		 */
 		if (mcast == priv->broadcast)
-			queue_work(ipoib_workqueue, &priv->carrier_on_task);
+			queue_work(priv->wq, &priv->carrier_on_task);
 
 		status = 0;
 		goto out;
@@ -438,7 +437,7 @@ static int ipoib_mcast_join_complete(int status,
 	mutex_lock(&mcast_mutex);
 	spin_lock_irq(&priv->lock);
 	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->mcast_task,
+		queue_delayed_work(priv->wq, &priv->mcast_task,
 				   mcast->backoff * HZ);
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
@@ -511,8 +510,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 
 		mutex_lock(&mcast_mutex);
 		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(ipoib_workqueue,
-					   &priv->mcast_task,
+			queue_delayed_work(priv->wq, &priv->mcast_task,
 					   mcast->backoff * HZ);
 		mutex_unlock(&mcast_mutex);
 	}
@@ -552,8 +550,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
 			mutex_lock(&mcast_mutex);
 			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-				queue_delayed_work(ipoib_workqueue,
-						   &priv->mcast_task, HZ);
+				queue_delayed_work(priv->wq, &priv->mcast_task,
+						   HZ);
 			mutex_unlock(&mcast_mutex);
 			return;
 		}
@@ -609,7 +607,7 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 
 	mutex_lock(&mcast_mutex);
 	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(ipoib_workqueue, &priv->mcast_task, 0);
+		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	mutex_unlock(&mcast_mutex);
 
 	return 0;
@@ -627,7 +625,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
 	mutex_unlock(&mcast_mutex);
 
 	if (flush)
-		flush_workqueue(ipoib_workqueue);
+		flush_workqueue(priv->wq);
 
 	return 0;
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index c56d5d44c53..34628403fd8 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -157,6 +157,16 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 		goto out_free_pd;
 	}
 
+	/*
+	 * the various IPoIB tasks assume they will never race against
+	 * themselves, so always use a single thread workqueue
+	 */
+	priv->wq = create_singlethread_workqueue("ipoib_wq");
+	if (!priv->wq) {
+		printk(KERN_WARNING "ipoib: failed to allocate device WQ\n");
+		goto out_free_mr;
+	}
+
 	size = ipoib_recvq_size + 1;
 	ret = ipoib_cm_dev_init(dev);
 	if (!ret) {
@@ -165,12 +175,13 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 			size += ipoib_recvq_size + 1; /* 1 extra for rx_drain_qp */
 		else
 			size += ipoib_recvq_size * ipoib_max_conn_qp;
-	}
+	} else
+		goto out_free_wq;
 
 	priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
 	if (IS_ERR(priv->recv_cq)) {
 		printk(KERN_WARNING "%s: failed to create receive CQ\n", ca->name);
-		goto out_free_mr;
+		goto out_cm_dev_cleanup;
 	}
 
 	priv->send_cq = ib_create_cq(priv->ca, ipoib_send_comp_handler, NULL,
@@ -236,12 +247,19 @@ out_free_send_cq:
 out_free_recv_cq:
 	ib_destroy_cq(priv->recv_cq);
 
+out_cm_dev_cleanup:
+	ipoib_cm_dev_cleanup(dev);
+
+out_free_wq:
+	destroy_workqueue(priv->wq);
+	priv->wq = NULL;
+
 out_free_mr:
 	ib_dereg_mr(priv->mr);
-	ipoib_cm_dev_cleanup(dev);
 
 out_free_pd:
 	ib_dealloc_pd(priv->pd);
+
 	return -ENODEV;
 }
 
@@ -265,11 +283,18 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
 
 	ipoib_cm_dev_cleanup(dev);
 
+	if (priv->wq) {
+		flush_workqueue(priv->wq);
+		destroy_workqueue(priv->wq);
+		priv->wq = NULL;
+	}
+
 	if (ib_dereg_mr(priv->mr))
 		ipoib_warn(priv, "ib_dereg_mr failed\n");
 
 	if (ib_dealloc_pd(priv->pd))
 		ipoib_warn(priv, "ib_dealloc_pd failed\n");
+
 }
 
 void ipoib_event(struct ib_event_handler *handler,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 6/9] IB/ipoib: No longer use flush as a parameter
       [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-02-22  0:27   ` [PATCH 5/9] IB/ipoib: Use dedicated workqueues per interface Doug Ledford
@ 2015-02-22  0:27   ` Doug Ledford
  2015-02-22  0:27   ` [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage Doug Ledford
                     ` (4 subsequent siblings)
  10 siblings, 0 replies; 37+ messages in thread
From: Doug Ledford @ 2015-02-22  0:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

Various places in the IPoIB code had a deadlock related to flushing
the ipoib workqueue.  Now that we have per device workqueues and a
specific flush workqueue, there is no longer a deadlock issue with
flushing the device specific workqueues and we can do so unilaterally.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  8 +++---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        | 35 +++++++++++++-------------
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  8 +++---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 18 ++++++++++---
 4 files changed, 39 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index e940cd9f847..9ef432ae72e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -478,10 +478,10 @@ void ipoib_ib_dev_flush_heavy(struct work_struct *work);
 void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
-int ipoib_ib_dev_open(struct net_device *dev, int flush);
+int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
-int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev, int flush);
+int ipoib_ib_dev_down(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev);
 void ipoib_pkey_dev_check_presence(struct net_device *dev);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
@@ -493,7 +493,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb);
 
 void ipoib_mcast_restart_task(struct work_struct *work);
 int ipoib_mcast_start_thread(struct net_device *dev);
-int ipoib_mcast_stop_thread(struct net_device *dev, int flush);
+int ipoib_mcast_stop_thread(struct net_device *dev);
 
 void ipoib_mcast_dev_down(struct net_device *dev);
 void ipoib_mcast_dev_flush(struct net_device *dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 2a56b7a11a9..e144d07d53c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -659,22 +659,21 @@ void ipoib_reap_ah(struct work_struct *work)
 				   round_jiffies_relative(HZ));
 }
 
-static void ipoib_flush_ah(struct net_device *dev, int flush)
+static void ipoib_flush_ah(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	cancel_delayed_work(&priv->ah_reap_task);
-	if (flush)
-		flush_workqueue(priv->wq);
+	flush_workqueue(priv->wq);
 	ipoib_reap_ah(&priv->ah_reap_task.work);
 }
 
-static void ipoib_stop_ah(struct net_device *dev, int flush)
+static void ipoib_stop_ah(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
-	ipoib_flush_ah(dev, flush);
+	ipoib_flush_ah(dev);
 }
 
 static void ipoib_ib_tx_timer_func(unsigned long ctx)
@@ -682,7 +681,7 @@ static void ipoib_ib_tx_timer_func(unsigned long ctx)
 	drain_tx_cq((struct net_device *)ctx);
 }
 
-int ipoib_ib_dev_open(struct net_device *dev, int flush)
+int ipoib_ib_dev_open(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
@@ -724,7 +723,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
 dev_stop:
 	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
 		napi_enable(&priv->napi);
-	ipoib_ib_dev_stop(dev, flush);
+	ipoib_ib_dev_stop(dev);
 	return -1;
 }
 
@@ -756,7 +755,7 @@ int ipoib_ib_dev_up(struct net_device *dev)
 	return ipoib_mcast_start_thread(dev);
 }
 
-int ipoib_ib_dev_down(struct net_device *dev, int flush)
+int ipoib_ib_dev_down(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
@@ -765,7 +764,7 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush)
 	clear_bit(IPOIB_FLAG_OPER_UP, &priv->flags);
 	netif_carrier_off(dev);
 
-	ipoib_mcast_stop_thread(dev, flush);
+	ipoib_mcast_stop_thread(dev);
 	ipoib_mcast_dev_flush(dev);
 
 	ipoib_flush_paths(dev);
@@ -825,7 +824,7 @@ void ipoib_drain_cq(struct net_device *dev)
 	local_bh_enable();
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev, int flush)
+int ipoib_ib_dev_stop(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_attr qp_attr;
@@ -895,7 +894,7 @@ timeout:
 	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
 		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
 
-	ipoib_flush_ah(dev, flush);
+	ipoib_flush_ah(dev);
 
 	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
 
@@ -919,7 +918,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 		    (unsigned long) dev);
 
 	if (dev->flags & IFF_UP) {
-		if (ipoib_ib_dev_open(dev, 1)) {
+		if (ipoib_ib_dev_open(dev)) {
 			ipoib_transport_dev_cleanup(dev);
 			return -ENODEV;
 		}
@@ -1038,16 +1037,16 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
 	if (level == IPOIB_FLUSH_LIGHT) {
 		ipoib_mark_paths_invalid(dev);
 		ipoib_mcast_dev_flush(dev);
-		ipoib_flush_ah(dev, 0);
+		ipoib_flush_ah(dev);
 	}
 
 	if (level >= IPOIB_FLUSH_NORMAL)
-		ipoib_ib_dev_down(dev, 0);
+		ipoib_ib_dev_down(dev);
 
 	if (level == IPOIB_FLUSH_HEAVY) {
 		if (test_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
-			ipoib_ib_dev_stop(dev, 0);
-		if (ipoib_ib_dev_open(dev, 0) != 0)
+			ipoib_ib_dev_stop(dev);
+		if (ipoib_ib_dev_open(dev) != 0)
 			return;
 		if (netif_queue_stopped(dev))
 			netif_start_queue(dev);
@@ -1099,7 +1098,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
 	 */
 	ipoib_flush_paths(dev);
 
-	ipoib_mcast_stop_thread(dev, 1);
+	ipoib_mcast_stop_thread(dev);
 	ipoib_mcast_dev_flush(dev);
 
 	/*
@@ -1108,7 +1107,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
 	 * the neighbor garbage collection is stopped and reaped.
 	 * That should all be done now, so make a final ah flush.
 	 */
-	ipoib_stop_ah(dev, 1);
+	ipoib_stop_ah(dev);
 
 	ipoib_transport_dev_cleanup(dev);
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 64effe4faf7..26e0eedc2c6 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -108,7 +108,7 @@ int ipoib_open(struct net_device *dev)
 
 	set_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
 
-	if (ipoib_ib_dev_open(dev, 1)) {
+	if (ipoib_ib_dev_open(dev)) {
 		if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
 			return 0;
 		goto err_disable;
@@ -139,7 +139,7 @@ int ipoib_open(struct net_device *dev)
 	return 0;
 
 err_stop:
-	ipoib_ib_dev_stop(dev, 1);
+	ipoib_ib_dev_stop(dev);
 
 err_disable:
 	clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
@@ -157,8 +157,8 @@ static int ipoib_stop(struct net_device *dev)
 
 	netif_stop_queue(dev);
 
-	ipoib_ib_dev_down(dev, 1);
-	ipoib_ib_dev_stop(dev, 0);
+	ipoib_ib_dev_down(dev);
+	ipoib_ib_dev_stop(dev);
 
 	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
 		struct ipoib_dev_priv *cpriv;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 9d3c1ed576e..bb1b69904f9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -613,7 +613,7 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 	return 0;
 }
 
-int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
+int ipoib_mcast_stop_thread(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
@@ -624,8 +624,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
 	cancel_delayed_work(&priv->mcast_task);
 	mutex_unlock(&mcast_mutex);
 
-	if (flush)
-		flush_workqueue(priv->wq);
+	flush_workqueue(priv->wq);
 
 	return 0;
 }
@@ -797,7 +796,18 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 
 	ipoib_dbg_mcast(priv, "restarting multicast task\n");
 
-	ipoib_mcast_stop_thread(dev, 0);
+	/*
+	 * We're running on the priv->wq right now, so we can't call
+	 * mcast_stop_thread as it wants to flush the wq and that
+	 * will deadlock.  We don't actually *need* to stop the
+	 * thread here anyway, so just clear the run flag, cancel
+	 * any delayed work, do our work, remove the old entries,
+	 * then restart the thread.
+	 */
+	mutex_lock(&mcast_mutex);
+	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
+	cancel_delayed_work(&priv->mcast_task);
+	mutex_unlock(&mcast_mutex);
 
 	local_irq_save(flags);
 	netif_addr_lock(dev);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage
       [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2015-02-22  0:27   ` [PATCH 6/9] IB/ipoib: No longer use flush as a parameter Doug Ledford
@ 2015-02-22  0:27   ` Doug Ledford
       [not found]     ` <9d657f64ee961ee3b3233520d8b499b234a42bcd.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-22  0:27   ` [PATCH 8/9] IB/ipoib: deserialize multicast joins Doug Ledford
                     ` (3 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-02-22  0:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
in how it was used.  We didn't always initialize the completion struct
before we set the flag, and we didn't always call complete on the
completion struct from all paths that complete it.  And when we did
complete it, sometimes we continued to touch the mcast entry after
the completion, opening us up to possible use after free issues.

This made it less than totally effective, and certainly made its use
confusing.  And in the flush function we would use the presence of this
flag to signal that we should wait on the completion struct, but we never
cleared this flag, ever.

In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.

 1) Remove the MCAST_JOIN_STARTED flag entirely
 2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
 3) Test mcast->mc directly to see if we have completed
    ib_sa_join_multicast (using IS_ERR_OR_NULL)
 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
    the mcast->done completion struct
 5) Make sure that before calling complete(&mcast->done), we always clear
    the MCAST_FLAG_BUSY bit
 6) Take the mcast_mutex before we call ib_sa_multicast_join and also
    take the mutex in our join callback.  This forces
    ib_sa_multicast_join to return and set mcast->mc before we process
    the callback.  This way, our callback can safely clear mcast->mc
    if there is an error on the join and we will do the right thing as
    a result in mcast_dev_flush.
 7) Because we need the mutex to synchronize mcast->mc, we can no
    longer call mcast_sendonly_join directly from mcast_send and
    instead must add sendonly join processing to the mcast_join_task
 8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
    we have a running task.  We know when we need to reschedule our
    join task thread and don't need a flag to tell us.
 9) Add a helper for rescheduling the join task thread

A number of different races are resolved with these changes.  These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.

One race looks something like this:

Thread 1                             Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
  alloc member
  call callback
                                     ifconfig ib0 down
				     wait_for_completion
    callback call completes
                                     wait_for_completion in
				     mcast_dev_flush completes
				       mcast->mc is PTR_ERR_OR_NULL
				       so we skip ib_sa_leave_multicast
    return from callback
  return from ib_sa_join_multicast
set mcast->mc = return from ib_sa_multicast

We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c

Another like this:

Thread 1                   Thread 2         Thread 3
ib_sa_multicast_join
                                            ifconfig ib0 down
					    priv->broadcast = NULL
                           join_complete
			                    wait_for_completion
			   mcast->mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast->mc
			   complete
			   return -EAGAIN (making mcast->mc invalid)
			   		    call ib_sa_multicast_leave
					    on invalid mcast->mc, hang
					    forever

By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast->mc to be valid at the time we
run the callback.  This allows us to clear mcast->mc if there is an
error and the join is going to fail.  We do this before we complete
the mcast.  In this way, mcast_dev_flush always sees consistent state
in regards to mcast->mc membership at the time that the
wait_for_completion() returns.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  11 +-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 355 ++++++++++++++++---------
 2 files changed, 238 insertions(+), 128 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 9ef432ae72e..c79dcd5ee8a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -98,9 +98,15 @@ enum {
 
 	IPOIB_MCAST_FLAG_FOUND	  = 0,	/* used in set_multicast_list */
 	IPOIB_MCAST_FLAG_SENDONLY = 1,
-	IPOIB_MCAST_FLAG_BUSY	  = 2,	/* joining or already joined */
+	/*
+	 * For IPOIB_MCAST_FLAG_BUSY
+	 * When set, in flight join and mcast->mc is unreliable
+	 * When clear and mcast->mc IS_ERR_OR_NULL, need to restart or
+	 *   haven't started yet
+	 * When clear and mcast->mc is valid pointer, join was successful
+	 */
+	IPOIB_MCAST_FLAG_BUSY	  = 2,
 	IPOIB_MCAST_FLAG_ATTACHED = 3,
-	IPOIB_MCAST_JOIN_STARTED  = 4,
 
 	MAX_SEND_CQE		  = 16,
 	IPOIB_CM_COPYBREAK	  = 256,
@@ -148,6 +154,7 @@ struct ipoib_mcast {
 
 	unsigned long created;
 	unsigned long backoff;
+	unsigned long delay_until;
 
 	unsigned long flags;
 	unsigned char logcount;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index bb1b69904f9..277e7ac7c4d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -66,6 +66,48 @@ struct ipoib_mcast_iter {
 	unsigned int       send_only;
 };
 
+/*
+ * This should be called with the mcast_mutex held
+ */
+static void __ipoib_mcast_schedule_join_thread(struct ipoib_dev_priv *priv,
+					       struct ipoib_mcast *mcast,
+					       bool delay)
+{
+	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
+		return;
+
+	/*
+	 * We will be scheduling *something*, so cancel whatever is
+	 * currently scheduled first
+	 */
+	cancel_delayed_work(&priv->mcast_task);
+	if (mcast && delay) {
+		/*
+		 * We had a failure and want to schedule a retry later
+		 */
+		mcast->backoff *= 2;
+		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
+			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
+		mcast->delay_until = jiffies + (mcast->backoff * HZ);
+		/*
+		 * Mark this mcast for its delay, but restart the
+		 * task immediately.  The join task will make sure to
+		 * clear out all entries without delays, and then
+		 * schedule itself to run again when the earliest
+		 * delay expires
+		 */
+		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+	} else if (delay) {
+		/*
+		 * Special case of retrying after a failure to
+		 * allocate the broadcast multicast group, wait
+		 * 1 second and try again
+		 */
+		queue_delayed_work(priv->wq, &priv->mcast_task, HZ);
+	} else
+		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+}
+
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
 	struct net_device *dev = mcast->dev;
@@ -103,6 +145,7 @@ static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev,
 
 	mcast->dev = dev;
 	mcast->created = jiffies;
+	mcast->delay_until = jiffies;
 	mcast->backoff = 1;
 
 	INIT_LIST_HEAD(&mcast->list);
@@ -270,17 +313,31 @@ ipoib_mcast_sendonly_join_complete(int status,
 {
 	struct ipoib_mcast *mcast = multicast->context;
 	struct net_device *dev = mcast->dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	/*
+	 * We have to take the mutex to force mcast_sendonly_join to
+	 * return from ib_sa_multicast_join and set mcast->mc to a
+	 * valid value.  Otherwise we were racing with ourselves in
+	 * that we might fail here, but get a valid return from
+	 * ib_sa_multicast_join after we had cleared mcast->mc here,
+	 * resulting in mis-matched joins and leaves and a deadlock
+	 */
+	mutex_lock(&mcast_mutex);
 
 	/* We trap for port events ourselves. */
-	if (status == -ENETRESET)
-		return 0;
+	if (status == -ENETRESET) {
+		status = 0;
+		goto out;
+	}
 
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
 
 	if (status) {
 		if (mcast->logcount++ < 20)
-			ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for %pI6, status %d\n",
+			ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
+					"join failed for %pI6, status %d\n",
 					mcast->mcmember.mgid.raw, status);
 
 		/* Flush out any queued packets */
@@ -290,11 +347,18 @@ ipoib_mcast_sendonly_join_complete(int status,
 			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
 		}
 		netif_tx_unlock_bh(dev);
-
-		/* Clear the busy flag so we try again */
-		status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY,
-					    &mcast->flags);
+		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
+	} else {
+		mcast->backoff = 1;
+		mcast->delay_until = jiffies;
+		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
 	}
+out:
+	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	if (status)
+		mcast->mc = NULL;
+	complete(&mcast->done);
+	mutex_unlock(&mcast_mutex);
 	return status;
 }
 
@@ -312,19 +376,18 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 	int ret = 0;
 
 	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
-		ipoib_dbg_mcast(priv, "device shutting down, no multicast joins\n");
+		ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
+				"multicast joins\n");
+		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+		complete(&mcast->done);
 		return -ENODEV;
 	}
 
-	if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
-		ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n");
-		return -EBUSY;
-	}
-
 	rec.mgid     = mcast->mcmember.mgid;
 	rec.port_gid = priv->local_gid;
 	rec.pkey     = cpu_to_be16(priv->pkey);
 
+	mutex_lock(&mcast_mutex);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
 					 priv->port, &rec,
 					 IB_SA_MCMEMBER_REC_MGID	|
@@ -337,12 +400,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 	if (IS_ERR(mcast->mc)) {
 		ret = PTR_ERR(mcast->mc);
 		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-		ipoib_warn(priv, "ib_sa_join_multicast failed (ret = %d)\n",
-			   ret);
+		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
+			   "failed (ret = %d)\n", ret);
+		complete(&mcast->done);
 	} else {
-		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting join\n",
-				mcast->mcmember.mgid.raw);
+		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
+				"sendonly join\n", mcast->mcmember.mgid.raw);
 	}
+	mutex_unlock(&mcast_mutex);
 
 	return ret;
 }
@@ -390,6 +455,16 @@ static int ipoib_mcast_join_complete(int status,
 	ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n",
 			mcast->mcmember.mgid.raw, status);
 
+	/*
+	 * We have to take the mutex to force mcast_join to
+	 * return from ib_sa_multicast_join and set mcast->mc to a
+	 * valid value.  Otherwise we were racing with ourselves in
+	 * that we might fail here, but get a valid return from
+	 * ib_sa_multicast_join after we had cleared mcast->mc here,
+	 * resulting in mis-matched joins and leaves and a deadlock
+	 */
+	mutex_lock(&mcast_mutex);
+
 	/* We trap for port events ourselves. */
 	if (status == -ENETRESET) {
 		status = 0;
@@ -401,10 +476,8 @@ static int ipoib_mcast_join_complete(int status,
 
 	if (!status) {
 		mcast->backoff = 1;
-		mutex_lock(&mcast_mutex);
-		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
-		mutex_unlock(&mcast_mutex);
+		mcast->delay_until = jiffies;
+		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
 
 		/*
 		 * Defer carrier on work to priv->wq to avoid a
@@ -412,37 +485,26 @@ static int ipoib_mcast_join_complete(int status,
 		 */
 		if (mcast == priv->broadcast)
 			queue_work(priv->wq, &priv->carrier_on_task);
-
-		status = 0;
-		goto out;
-	}
-
-	if (mcast->logcount++ < 20) {
-		if (status == -ETIMEDOUT || status == -EAGAIN) {
-			ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
-					mcast->mcmember.mgid.raw, status);
-		} else {
-			ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
-				   mcast->mcmember.mgid.raw, status);
+	} else {
+		if (mcast->logcount++ < 20) {
+			if (status == -ETIMEDOUT || status == -EAGAIN) {
+				ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
+						mcast->mcmember.mgid.raw, status);
+			} else {
+				ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
+					   mcast->mcmember.mgid.raw, status);
+			}
 		}
-	}
-
-	mcast->backoff *= 2;
-	if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
-		mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
 
-	/* Clear the busy flag so we try again */
-	status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-
-	mutex_lock(&mcast_mutex);
-	spin_lock_irq(&priv->lock);
-	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task,
-				   mcast->backoff * HZ);
-	spin_unlock_irq(&priv->lock);
-	mutex_unlock(&mcast_mutex);
+		/* Requeue this join task with a backoff delay */
+		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
+	}
 out:
+	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	if (status)
+		mcast->mc = NULL;
 	complete(&mcast->done);
+	mutex_unlock(&mcast_mutex);
 	return status;
 }
 
@@ -491,29 +553,18 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		rec.hop_limit	  = priv->broadcast->mcmember.hop_limit;
 	}
 
-	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-	init_completion(&mcast->done);
-	set_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags);
-
+	mutex_lock(&mcast_mutex);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
 					 &rec, comp_mask, GFP_KERNEL,
 					 ipoib_mcast_join_complete, mcast);
 	if (IS_ERR(mcast->mc)) {
 		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-		complete(&mcast->done);
 		ret = PTR_ERR(mcast->mc);
 		ipoib_warn(priv, "ib_sa_join_multicast failed, status %d\n", ret);
-
-		mcast->backoff *= 2;
-		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
-			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-
-		mutex_lock(&mcast_mutex);
-		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(priv->wq, &priv->mcast_task,
-					   mcast->backoff * HZ);
-		mutex_unlock(&mcast_mutex);
+		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
+		complete(&mcast->done);
 	}
+	mutex_unlock(&mcast_mutex);
 }
 
 void ipoib_mcast_join_task(struct work_struct *work)
@@ -522,6 +573,9 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		container_of(work, struct ipoib_dev_priv, mcast_task.work);
 	struct net_device *dev = priv->dev;
 	struct ib_port_attr port_attr;
+	unsigned long delay_until = 0;
+	struct ipoib_mcast *mcast = NULL;
+	int create = 1;
 
 	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
 		return;
@@ -539,64 +593,102 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	else
 		memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid));
 
+	/*
+	 * We have to hold the mutex to keep from racing with the join
+	 * completion threads on setting flags on mcasts, and we have
+	 * to hold the priv->lock because dev_flush will remove entries
+	 * out from underneath us, so at a minimum we need the lock
+	 * through the time that we do the for_each loop of the mcast
+	 * list or else dev_flush can make us oops.
+	 */
+	mutex_lock(&mcast_mutex);
+	spin_lock_irq(&priv->lock);
+	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
+		goto out;
+
 	if (!priv->broadcast) {
 		struct ipoib_mcast *broadcast;
 
-		if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
-			return;
-
-		broadcast = ipoib_mcast_alloc(dev, 1);
+		broadcast = ipoib_mcast_alloc(dev, 0);
 		if (!broadcast) {
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
-			mutex_lock(&mcast_mutex);
-			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-				queue_delayed_work(priv->wq, &priv->mcast_task,
-						   HZ);
-			mutex_unlock(&mcast_mutex);
-			return;
+			/*
+			 * Restart us after a 1 second delay to retry
+			 * creating our broadcast group and attaching to
+			 * it.  Until this succeeds, this ipoib dev is
+			 * completely stalled (multicast wise).
+			 */
+			__ipoib_mcast_schedule_join_thread(priv, NULL, 1);
+			goto out;
 		}
 
-		spin_lock_irq(&priv->lock);
 		memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
 		       sizeof (union ib_gid));
 		priv->broadcast = broadcast;
 
 		__ipoib_mcast_add(dev, priv->broadcast);
-		spin_unlock_irq(&priv->lock);
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
-		if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
-			ipoib_mcast_join(dev, priv->broadcast, 0);
-		return;
+		if (IS_ERR_OR_NULL(priv->broadcast->mc) &&
+		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) {
+			mcast = priv->broadcast;
+			create = 0;
+			if (mcast->backoff > 1 &&
+			    time_before(jiffies, mcast->delay_until)) {
+				delay_until = mcast->delay_until;
+				mcast = NULL;
+			}
+		}
+		goto out;
 	}
 
-	while (1) {
-		struct ipoib_mcast *mcast = NULL;
-
-		spin_lock_irq(&priv->lock);
-		list_for_each_entry(mcast, &priv->multicast_list, list) {
-			if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)
-			    && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)
-			    && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+	/*
+	 * We'll never get here until the broadcast group is both allocated
+	 * and attached
+	 */
+	list_for_each_entry(mcast, &priv->multicast_list, list) {
+		if (IS_ERR_OR_NULL(mcast->mc) &&
+		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
+		    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+			if (mcast->backoff == 1 ||
+			    time_after_eq(jiffies, mcast->delay_until))
 				/* Found the next unjoined group */
 				break;
-			}
+			else if (!delay_until ||
+				 time_before(mcast->delay_until, delay_until))
+				delay_until = mcast->delay_until;
 		}
-		spin_unlock_irq(&priv->lock);
-
-		if (&mcast->list == &priv->multicast_list) {
-			/* All done */
-			break;
-		}
-
-		ipoib_mcast_join(dev, mcast, 1);
-		return;
 	}
 
-	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
+	if (&mcast->list == &priv->multicast_list) {
+		/*
+		 * All done, unless we have delayed work from
+		 * backoff retransmissions, but we will get
+		 * restarted when the time is right, so we are
+		 * done for now
+		 */
+		mcast = NULL;
+		ipoib_dbg_mcast(priv, "successfully joined all "
+				"multicast groups\n");
+	}
 
-	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
+out:
+	if (mcast) {
+		init_completion(&mcast->done);
+		set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	}
+	spin_unlock_irq(&priv->lock);
+	mutex_unlock(&mcast_mutex);
+	if (mcast) {
+		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+			ipoib_mcast_sendonly_join(mcast);
+		else
+			ipoib_mcast_join(dev, mcast, create);
+	}
+	if (delay_until)
+		queue_delayed_work(priv->wq, &priv->mcast_task,
+				   delay_until - jiffies);
 }
 
 int ipoib_mcast_start_thread(struct net_device *dev)
@@ -606,8 +698,8 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 	ipoib_dbg_mcast(priv, "starting multicast thread\n");
 
 	mutex_lock(&mcast_mutex);
-	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+	set_bit(IPOIB_MCAST_RUN, &priv->flags);
+	__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
 	mutex_unlock(&mcast_mutex);
 
 	return 0;
@@ -635,7 +727,12 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 	int ret = 0;
 
 	if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
+		ipoib_warn(priv, "ipoib_mcast_leave on an in-flight join\n");
+
+	if (!IS_ERR_OR_NULL(mcast->mc))
 		ib_sa_free_multicast(mcast->mc);
+	else
+		ipoib_dbg(priv, "ipoib_mcast_leave with mcast->mc invalid\n");
 
 	if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
 		ipoib_dbg_mcast(priv, "leaving MGID %pI6\n",
@@ -646,7 +743,9 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 				      be16_to_cpu(mcast->mcmember.mlid));
 		if (ret)
 			ipoib_warn(priv, "ib_detach_mcast failed (result = %d)\n", ret);
-	}
+	} else if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+		ipoib_dbg(priv, "leaving with no mcmember but not a "
+			  "SENDONLY join\n");
 
 	return 0;
 }
@@ -687,6 +786,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
+		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
 	}
 
 	if (!mcast->ah) {
@@ -696,13 +796,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 			++dev->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
 		}
-
-		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
-			ipoib_dbg_mcast(priv, "no address vector, "
-					"but multicast join already started\n");
-		else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
-			ipoib_mcast_sendonly_join(mcast);
-
 		/*
 		 * If lookup completes between here and out:, don't
 		 * want to send packet twice.
@@ -761,9 +854,12 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
 
 	spin_unlock_irqrestore(&priv->lock, flags);
 
-	/* seperate between the wait to the leave*/
+	/*
+	 * make sure the in-flight joins have finished before we attempt
+	 * to leave
+	 */
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
-		if (test_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags))
+		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
 			wait_for_completion(&mcast->done);
 
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
@@ -794,20 +890,14 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	unsigned long flags;
 	struct ib_sa_mcmember_rec rec;
 
-	ipoib_dbg_mcast(priv, "restarting multicast task\n");
+	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
+		/*
+		 * shortcut...on shutdown flush is called next, just
+		 * let it do all the work
+		 */
+		return;
 
-	/*
-	 * We're running on the priv->wq right now, so we can't call
-	 * mcast_stop_thread as it wants to flush the wq and that
-	 * will deadlock.  We don't actually *need* to stop the
-	 * thread here anyway, so just clear the run flag, cancel
-	 * any delayed work, do our work, remove the old entries,
-	 * then restart the thread.
-	 */
-	mutex_lock(&mcast_mutex);
-	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
-	cancel_delayed_work(&priv->mcast_task);
-	mutex_unlock(&mcast_mutex);
+	ipoib_dbg_mcast(priv, "restarting multicast task\n");
 
 	local_irq_save(flags);
 	netif_addr_lock(dev);
@@ -893,14 +983,27 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	netif_addr_unlock(dev);
 	local_irq_restore(flags);
 
-	/* We have to cancel outside of the spinlock */
+	/*
+	 * make sure the in-flight joins have finished before we attempt
+	 * to leave
+	 */
+	list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
+		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
+			wait_for_completion(&mcast->done);
+
 	list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
 		ipoib_mcast_leave(mcast->dev, mcast);
 		ipoib_mcast_free(mcast);
 	}
 
-	if (test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
-		ipoib_mcast_start_thread(dev);
+	/*
+	 * Double check that we are still up
+	 */
+	if (test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
+		spin_lock_irqsave(&priv->lock, flags);
+		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
+		spin_unlock_irqrestore(&priv->lock, flags);
+	}
 }
 
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 8/9] IB/ipoib: deserialize multicast joins
       [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2015-02-22  0:27   ` [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage Doug Ledford
@ 2015-02-22  0:27   ` Doug Ledford
       [not found]     ` <a24ade295dfdd1369aac47a978003569ec190952.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-22  0:27   ` [PATCH 9/9] IB/ipoib: drop mcast_mutex usage Doug Ledford
                     ` (2 subsequent siblings)
  10 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-02-22  0:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

Allow the ipoib layer to attempt to join all outstanding multicast
groups at once.  The ib_sa layer will serialize multiple attempts to
join the same group, but will process attempts to join different groups
in parallel.  Take advantage of that.

In order to make this happen, change the mcast_join_thread to loop
through all needed joins, sending a join request for each one that we
still need to join.  There are a few special cases we handle though:

1) Don't attempt to join anything but the broadcast group until the join
of the broadcast group has succeeded.
2) No longer restart the join task at the end of completion handling.
If we completed successfully, we are done.  The join task now needs kicked
either by mcast_send or mcast_restart_task or mcast_start_thread, but
should not need started anytime else except when scheduling a backoff
attempt to rejoin.
3) No longer use separate join/completion routines for regular and
sendonly joins, pass them all through the same routine and just do the
right thing based on the SENDONLY join flag.
4) Only try to join a SENDONLY join twice, then drop the packets and
quit trying.  We leave the mcast group in the list so that if we get a
new packet, all that we have to do is queue up the packet and restart
the join task and it will automatically try to join twice and then
either send or flush the queue again.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 250 ++++++++-----------------
 1 file changed, 82 insertions(+), 168 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 277e7ac7c4d..c670d9c2cda 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -307,111 +307,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 	return 0;
 }
 
-static int
-ipoib_mcast_sendonly_join_complete(int status,
-				   struct ib_sa_multicast *multicast)
-{
-	struct ipoib_mcast *mcast = multicast->context;
-	struct net_device *dev = mcast->dev;
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-
-	/*
-	 * We have to take the mutex to force mcast_sendonly_join to
-	 * return from ib_sa_multicast_join and set mcast->mc to a
-	 * valid value.  Otherwise we were racing with ourselves in
-	 * that we might fail here, but get a valid return from
-	 * ib_sa_multicast_join after we had cleared mcast->mc here,
-	 * resulting in mis-matched joins and leaves and a deadlock
-	 */
-	mutex_lock(&mcast_mutex);
-
-	/* We trap for port events ourselves. */
-	if (status == -ENETRESET) {
-		status = 0;
-		goto out;
-	}
-
-	if (!status)
-		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
-
-	if (status) {
-		if (mcast->logcount++ < 20)
-			ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
-					"join failed for %pI6, status %d\n",
-					mcast->mcmember.mgid.raw, status);
-
-		/* Flush out any queued packets */
-		netif_tx_lock_bh(dev);
-		while (!skb_queue_empty(&mcast->pkt_queue)) {
-			++dev->stats.tx_dropped;
-			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
-		}
-		netif_tx_unlock_bh(dev);
-		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
-	} else {
-		mcast->backoff = 1;
-		mcast->delay_until = jiffies;
-		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
-	}
-out:
-	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-	if (status)
-		mcast->mc = NULL;
-	complete(&mcast->done);
-	mutex_unlock(&mcast_mutex);
-	return status;
-}
-
-static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
-{
-	struct net_device *dev = mcast->dev;
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct ib_sa_mcmember_rec rec = {
-#if 0				/* Some SMs don't support send-only yet */
-		.join_state = 4
-#else
-		.join_state = 1
-#endif
-	};
-	int ret = 0;
-
-	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
-		ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
-				"multicast joins\n");
-		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-		complete(&mcast->done);
-		return -ENODEV;
-	}
-
-	rec.mgid     = mcast->mcmember.mgid;
-	rec.port_gid = priv->local_gid;
-	rec.pkey     = cpu_to_be16(priv->pkey);
-
-	mutex_lock(&mcast_mutex);
-	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
-					 priv->port, &rec,
-					 IB_SA_MCMEMBER_REC_MGID	|
-					 IB_SA_MCMEMBER_REC_PORT_GID	|
-					 IB_SA_MCMEMBER_REC_PKEY	|
-					 IB_SA_MCMEMBER_REC_JOIN_STATE,
-					 GFP_ATOMIC,
-					 ipoib_mcast_sendonly_join_complete,
-					 mcast);
-	if (IS_ERR(mcast->mc)) {
-		ret = PTR_ERR(mcast->mc);
-		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
-			   "failed (ret = %d)\n", ret);
-		complete(&mcast->done);
-	} else {
-		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
-				"sendonly join\n", mcast->mcmember.mgid.raw);
-	}
-	mutex_unlock(&mcast_mutex);
-
-	return ret;
-}
-
 void ipoib_mcast_carrier_on_task(struct work_struct *work)
 {
 	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
@@ -452,7 +347,9 @@ static int ipoib_mcast_join_complete(int status,
 	struct net_device *dev = mcast->dev;
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-	ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n",
+	ipoib_dbg_mcast(priv, "%sjoin completion for %pI6 (status %d)\n",
+			test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ?
+			"sendonly " : "",
 			mcast->mcmember.mgid.raw, status);
 
 	/*
@@ -477,27 +374,52 @@ static int ipoib_mcast_join_complete(int status,
 	if (!status) {
 		mcast->backoff = 1;
 		mcast->delay_until = jiffies;
-		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
 
 		/*
 		 * Defer carrier on work to priv->wq to avoid a
-		 * deadlock on rtnl_lock here.
+		 * deadlock on rtnl_lock here.  Requeue our multicast
+		 * work too, which will end up happening right after
+		 * our carrier on task work and will allow us to
+		 * send out all of the non-broadcast joins
 		 */
-		if (mcast == priv->broadcast)
+		if (mcast == priv->broadcast) {
 			queue_work(priv->wq, &priv->carrier_on_task);
+			__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
+		}
 	} else {
 		if (mcast->logcount++ < 20) {
 			if (status == -ETIMEDOUT || status == -EAGAIN) {
-				ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
+				ipoib_dbg_mcast(priv, "%smulticast join failed for %pI6, status %d\n",
+						test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ? "sendonly " : "",
 						mcast->mcmember.mgid.raw, status);
 			} else {
-				ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
+				ipoib_warn(priv, "%smulticast join failed for %pI6, status %d\n",
+						test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ? "sendonly " : "",
 					   mcast->mcmember.mgid.raw, status);
 			}
 		}
 
-		/* Requeue this join task with a backoff delay */
-		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
+		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) &&
+		    mcast->backoff >= 2) {
+			/*
+			 * We only retry sendonly joins once before we drop
+			 * the packet and quit trying to deal with the
+			 * group.  However, we leave the group in the
+			 * mcast list as an unjoined group.  If we want to
+			 * try joining again, we simply queue up a packet
+			 * and restart the join thread.  The empty queue
+			 * is why the join thread ignores this group.
+			 */
+			mcast->backoff = 1;
+			netif_tx_lock_bh(dev);
+			while (!skb_queue_empty(&mcast->pkt_queue)) {
+				++dev->stats.tx_dropped;
+				dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
+			}
+			netif_tx_unlock_bh(dev);
+		} else
+			/* Requeue this join task with a backoff delay */
+			__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
 	}
 out:
 	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
@@ -650,45 +572,45 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	list_for_each_entry(mcast, &priv->multicast_list, list) {
 		if (IS_ERR_OR_NULL(mcast->mc) &&
 		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
-		    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+		    (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ||
+		     !skb_queue_empty(&mcast->pkt_queue))) {
 			if (mcast->backoff == 1 ||
-			    time_after_eq(jiffies, mcast->delay_until))
+			    time_after_eq(jiffies, mcast->delay_until)) {
 				/* Found the next unjoined group */
-				break;
-			else if (!delay_until ||
+				init_completion(&mcast->done);
+				set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+				if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+					create = 0;
+				else
+					create = 1;
+				spin_unlock_irq(&priv->lock);
+				mutex_unlock(&mcast_mutex);
+				ipoib_mcast_join(dev, mcast, create);
+				mutex_lock(&mcast_mutex);
+				spin_lock_irq(&priv->lock);
+			} else if (!delay_until ||
 				 time_before(mcast->delay_until, delay_until))
 				delay_until = mcast->delay_until;
 		}
 	}
 
-	if (&mcast->list == &priv->multicast_list) {
-		/*
-		 * All done, unless we have delayed work from
-		 * backoff retransmissions, but we will get
-		 * restarted when the time is right, so we are
-		 * done for now
-		 */
-		mcast = NULL;
-		ipoib_dbg_mcast(priv, "successfully joined all "
-				"multicast groups\n");
-	}
+	mcast = NULL;
+	ipoib_dbg_mcast(priv, "successfully started all multicast joins\n");
 
 out:
+	if (delay_until) {
+		cancel_delayed_work(&priv->mcast_task);
+		queue_delayed_work(priv->wq, &priv->mcast_task,
+				   delay_until - jiffies);
+	}
 	if (mcast) {
 		init_completion(&mcast->done);
 		set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	}
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
-	if (mcast) {
-		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
-			ipoib_mcast_sendonly_join(mcast);
-		else
-			ipoib_mcast_join(dev, mcast, create);
-	}
-	if (delay_until)
-		queue_delayed_work(priv->wq, &priv->mcast_task,
-				   delay_until - jiffies);
+	if (mcast)
+		ipoib_mcast_join(dev, mcast, create);
 }
 
 int ipoib_mcast_start_thread(struct net_device *dev)
@@ -731,8 +653,6 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
 
 	if (!IS_ERR_OR_NULL(mcast->mc))
 		ib_sa_free_multicast(mcast->mc);
-	else
-		ipoib_dbg(priv, "ipoib_mcast_leave with mcast->mc invalid\n");
 
 	if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
 		ipoib_dbg_mcast(priv, "leaving MGID %pI6\n",
@@ -768,43 +688,37 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 	}
 
 	mcast = __ipoib_mcast_find(dev, mgid);
-	if (!mcast) {
-		/* Let's create a new send only group now */
-		ipoib_dbg_mcast(priv, "setting up send only multicast group for %pI6\n",
-				mgid);
-
-		mcast = ipoib_mcast_alloc(dev, 0);
+	if (!mcast || !mcast->ah) {
 		if (!mcast) {
-			ipoib_warn(priv, "unable to allocate memory for "
-				   "multicast structure\n");
-			++dev->stats.tx_dropped;
-			dev_kfree_skb_any(skb);
-			goto out;
-		}
-
-		set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags);
-		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
-		__ipoib_mcast_add(dev, mcast);
-		list_add_tail(&mcast->list, &priv->multicast_list);
-		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
-	}
+			/* Let's create a new send only group now */
+			ipoib_dbg_mcast(priv, "setting up send only multicast group for %pI6\n",
+					mgid);
+
+			mcast = ipoib_mcast_alloc(dev, 0);
+			if (!mcast) {
+				ipoib_warn(priv, "unable to allocate memory "
+					   "for multicast structure\n");
+				++dev->stats.tx_dropped;
+				dev_kfree_skb_any(skb);
+				goto unlock;
+			}
 
-	if (!mcast->ah) {
+			set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags);
+			memcpy(mcast->mcmember.mgid.raw, mgid,
+			       sizeof (union ib_gid));
+			__ipoib_mcast_add(dev, mcast);
+			list_add_tail(&mcast->list, &priv->multicast_list);
+		}
 		if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
 			skb_queue_tail(&mcast->pkt_queue, skb);
 		else {
 			++dev->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
 		}
-		/*
-		 * If lookup completes between here and out:, don't
-		 * want to send packet twice.
-		 */
-		mcast = NULL;
-	}
-
-out:
-	if (mcast && mcast->ah) {
+		if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
+			__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
+		}
+	} else {
 		struct ipoib_neigh *neigh;
 
 		spin_unlock_irqrestore(&priv->lock, flags);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 9/9] IB/ipoib: drop mcast_mutex usage
       [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (7 preceding siblings ...)
  2015-02-22  0:27   ` [PATCH 8/9] IB/ipoib: deserialize multicast joins Doug Ledford
@ 2015-02-22  0:27   ` Doug Ledford
       [not found]     ` <767f4c41779db63ce8c6dbba04b21959aba70ef9.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-02-22 21:34   ` [PATCH 0/9] IB/ipoib: fixup multicast locking issues Or Gerlitz
  2015-03-13  8:41   ` Or Gerlitz
  10 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-02-22  0:27 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit, Doug Ledford

We needed the mcast_mutex when we had to prevent the join completion
callback from having the value it stored in mcast->mc overwritten
by a delayed return from ib_sa_join_multicast.  By storing the return
of ib_sa_join_multicast in an intermediate variable, we prevent a
delayed return from ib_sa_join_multicast overwriting the valid
contents of mcast->mc, and we no longer need a mutex to force the
join callback to run after the return of ib_sa_join_multicast.  This
allows us to do away with the mutex entirely and protect our critical
sections with a just a spinlock instead.  This is highly desirable
as there were some places where we couldn't use a mutex because the
code was not allowed to sleep, and so we were currently using a mix
of mutex and spinlock to protect what we needed to protect.  Now we
only have a spin lock and the locking complexity is greatly reduced.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 70 ++++++++++++--------------
 1 file changed, 32 insertions(+), 38 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index c670d9c2cda..3203ebe9b10 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -55,8 +55,6 @@ MODULE_PARM_DESC(mcast_debug_level,
 		 "Enable multicast debug tracing if > 0");
 #endif
 
-static DEFINE_MUTEX(mcast_mutex);
-
 struct ipoib_mcast_iter {
 	struct net_device *dev;
 	union ib_gid       mgid;
@@ -67,7 +65,7 @@ struct ipoib_mcast_iter {
 };
 
 /*
- * This should be called with the mcast_mutex held
+ * This should be called with the priv->lock held
  */
 static void __ipoib_mcast_schedule_join_thread(struct ipoib_dev_priv *priv,
 					       struct ipoib_mcast *mcast,
@@ -352,16 +350,6 @@ static int ipoib_mcast_join_complete(int status,
 			"sendonly " : "",
 			mcast->mcmember.mgid.raw, status);
 
-	/*
-	 * We have to take the mutex to force mcast_join to
-	 * return from ib_sa_multicast_join and set mcast->mc to a
-	 * valid value.  Otherwise we were racing with ourselves in
-	 * that we might fail here, but get a valid return from
-	 * ib_sa_multicast_join after we had cleared mcast->mc here,
-	 * resulting in mis-matched joins and leaves and a deadlock
-	 */
-	mutex_lock(&mcast_mutex);
-
 	/* We trap for port events ourselves. */
 	if (status == -ENETRESET) {
 		status = 0;
@@ -383,8 +371,10 @@ static int ipoib_mcast_join_complete(int status,
 		 * send out all of the non-broadcast joins
 		 */
 		if (mcast == priv->broadcast) {
+			spin_lock_irq(&priv->lock);
 			queue_work(priv->wq, &priv->carrier_on_task);
 			__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
+			goto out_locked;
 		}
 	} else {
 		if (mcast->logcount++ < 20) {
@@ -417,16 +407,28 @@ static int ipoib_mcast_join_complete(int status,
 				dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
 			}
 			netif_tx_unlock_bh(dev);
-		} else
+		} else {
+			spin_lock_irq(&priv->lock);
 			/* Requeue this join task with a backoff delay */
 			__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
+			goto out_locked;
+		}
 	}
 out:
-	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	spin_lock_irq(&priv->lock);
+out_locked:
+	/*
+	 * Make sure to set mcast->mc before we clear the busy flag to avoid
+	 * racing with code that checks for BUSY before checking mcast->mc
+	 */
 	if (status)
 		mcast->mc = NULL;
+	else
+		mcast->mc = multicast;
+	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	spin_unlock_irq(&priv->lock);
 	complete(&mcast->done);
-	mutex_unlock(&mcast_mutex);
+
 	return status;
 }
 
@@ -434,6 +436,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 			     int create)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_sa_multicast *multicast;
 	struct ib_sa_mcmember_rec rec = {
 		.join_state = 1
 	};
@@ -475,18 +478,19 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		rec.hop_limit	  = priv->broadcast->mcmember.hop_limit;
 	}
 
-	mutex_lock(&mcast_mutex);
-	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
+	multicast = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
 					 &rec, comp_mask, GFP_KERNEL,
 					 ipoib_mcast_join_complete, mcast);
-	if (IS_ERR(mcast->mc)) {
-		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
-		ret = PTR_ERR(mcast->mc);
+	if (IS_ERR(multicast)) {
+		ret = PTR_ERR(multicast);
 		ipoib_warn(priv, "ib_sa_join_multicast failed, status %d\n", ret);
+		spin_lock_irq(&priv->lock);
+		/* Requeue this join task with a backoff delay */
 		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
+		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+		spin_unlock_irq(&priv->lock);
 		complete(&mcast->done);
 	}
-	mutex_unlock(&mcast_mutex);
 }
 
 void ipoib_mcast_join_task(struct work_struct *work)
@@ -515,15 +519,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	else
 		memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid));
 
-	/*
-	 * We have to hold the mutex to keep from racing with the join
-	 * completion threads on setting flags on mcasts, and we have
-	 * to hold the priv->lock because dev_flush will remove entries
-	 * out from underneath us, so at a minimum we need the lock
-	 * through the time that we do the for_each loop of the mcast
-	 * list or else dev_flush can make us oops.
-	 */
-	mutex_lock(&mcast_mutex);
 	spin_lock_irq(&priv->lock);
 	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
 		goto out;
@@ -584,9 +579,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
 				else
 					create = 1;
 				spin_unlock_irq(&priv->lock);
-				mutex_unlock(&mcast_mutex);
 				ipoib_mcast_join(dev, mcast, create);
-				mutex_lock(&mcast_mutex);
 				spin_lock_irq(&priv->lock);
 			} else if (!delay_until ||
 				 time_before(mcast->delay_until, delay_until))
@@ -608,7 +601,6 @@ out:
 		set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	}
 	spin_unlock_irq(&priv->lock);
-	mutex_unlock(&mcast_mutex);
 	if (mcast)
 		ipoib_mcast_join(dev, mcast, create);
 }
@@ -616,13 +608,14 @@ out:
 int ipoib_mcast_start_thread(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned long flags;
 
 	ipoib_dbg_mcast(priv, "starting multicast thread\n");
 
-	mutex_lock(&mcast_mutex);
+	spin_lock_irqsave(&priv->lock, flags);
 	set_bit(IPOIB_MCAST_RUN, &priv->flags);
 	__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
-	mutex_unlock(&mcast_mutex);
+	spin_unlock_irqrestore(&priv->lock, flags);
 
 	return 0;
 }
@@ -630,13 +623,14 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 int ipoib_mcast_stop_thread(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	unsigned long flags;
 
 	ipoib_dbg_mcast(priv, "stopping multicast thread\n");
 
-	mutex_lock(&mcast_mutex);
+	spin_lock_irqsave(&priv->lock, flags);
 	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
 	cancel_delayed_work(&priv->mcast_task);
-	mutex_unlock(&mcast_mutex);
+	spin_unlock_irqrestore(&priv->lock, flags);
 
 	flush_workqueue(priv->wq);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues
       [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (8 preceding siblings ...)
  2015-02-22  0:27   ` [PATCH 9/9] IB/ipoib: drop mcast_mutex usage Doug Ledford
@ 2015-02-22 21:34   ` Or Gerlitz
       [not found]     ` <CAJ3xEMgj=ATKLt0MA67c3WefCrG1hZ59eSrhpD-u_dxLJe2kfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-03-13  8:41   ` Or Gerlitz
  10 siblings, 1 reply; 37+ messages in thread
From: Or Gerlitz @ 2015-02-22 21:34 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Erez Shitrit

On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> This is the re-ordered, squashed version of my 22 patch set that I
> posted on Feb 11.  There are a few minor differences between that
> set and this one.

Hi Doug,

I took quick look on your git repo @
git://github.com/dledford/linux.git and it seems not to contain this
series, can you please get that there and tell what branch to pull?

Or.

> They are:
> 1) Rename __ipoib_mcast_continue_join_thread to
>    __ipoib_mcast_schedule_join_thread
> 2) Make __ipoib_mcast_schedule_join_thread cancel any delayed work to
>    avoid us accidentally trying to queue the single work struct instance
>    twice (which doesn't work)
> 3) Slight alter layout of __ipoib_mcast_schedule_join_thread.  Logic
>    is the same modulo #2, but indenting is reduced and readability
>    increased
> 4) Switch a few instances of FLAG_ADMIN_UP to FLAG_OPER_UP
> 5) Add a couple missing spinlocks so that we always call the schedule
>    helper with the spinlock held
> 6) Make sure that we only clear the BUSY flag once we have done all the
>    other things we are going to do to the mcast entry, and if possible,
>    only call complete after we have released the spinlock
> 7) Fix the usage of time_before_eq when we should have just used
>    time_before in ipoib_mcast_join_task
> 8) Create/destroy priv->wq in a slightly different point of
>    ipoib_transport_dev_init/ipoib_transport_dev_cleanup
>
> This entire patchset was intended to address the issue of ipoib
> interfaces being brought up/down in a tight loop, which will hardlock
> a standard v3.19 kernel.  It succeeds at resolving that problem.  In
> order to be sure this patchset does not introduce other problems,
> and in order to ensure that this rework of the patches into a new
> set does not break bisectability, this entire patchset has been
> extensively tested, starting with the first patch and going through
> the last.
>
> I used a 12 machine group plus the subnet manager to test these
> patches.
>
> 1 machine ran ifconfig up/ifconfig down in a tight loop tests
> 1 machine ran rmmod/insmod ib_ipoib in a loop with a 10 second pause
>   between insmod and rmmod
> 1 machine ran rmmod/insmod ib_ipoib in a tight loop with only a .1
>   second pause between insmod and rmmod
> 9 machines that kept their interfaces up and ran iperf servers, 6 also
>   ran ping6 instances to the addresses of all 12 machines, 3 ran iperf
>   clients that sent data to all 9 iperf servers in an infinite loop
> 1 subnet manager machine that otherwise did not participate, but
>   during testing was set to restart opensm once every 30 seconds to
>   force net re-register events on all 12 machines in the group
>
> In addition to the configuration of various machines above to test
> data transfers, the IPoIB infrastructure itself contained several
> elements designed to test specific multicast capabilities.
>
> The primary P_Key, the one with the ping6 instances running on it,
> intentionally had some well known multicast groups not defined in
> order to intentionally cause failed sendonly multicast joins on
> the same device that needed to work with IPv6 pings as well as
> IPv4 multicast.
>
> One of the alternate P_Key interfaces was defined with a minimum
> rate of 56GBit/s, so all machines without 56GBit/s capability
> were unable to ever join the broadcast group on these P_Keys.
> This was done to make sure that when the broadcast group is not
> joined, no other multicast joins, sendonly or otherwise, are ever
> sent.  It also was done to make sure that failed attempts to join
> the broadcast group honored the backoff delays properly.
>
> Note: both machines that were doing the insmod/rmmod loops were
> changed to not have any P_Key interfaces defined other than the
> default P_Key interface.  It is known that repeated insmod/rmmod
> of the ib_ipoib interface is fragile and easily breaks in the
> presence of child interfaces.  It was not my intent to address
> that particular problem with this patch set and so to avoid false
> issues, children interfaces were removed from the mix on these
> machines.
>
> A wide array of hardware was also tested with this 12 machine group,
> covering mthca, mlx4, mlx5, and qib hardware.
>
> Patches 1 through 6 were tested without the ifconfig/rmmod/opensm
> loops as those particular problems were not expected to be addressed
> until patch 7.  Pathes 7 through 9 were tested with all tests.
>
> The final, complete patch set was left running with the various
> tests until it had completed 257 opensm restarts, 12052
> ifconfig up/ifconfig down loops, 765 10 second insmod/rmmod loops,
> and 1971 .1 second insmod/rmmod loops.  The only observed problem
> was that the fast insmod/rmmod loop eventually locked up the
> network stack on the machine.  It was stuck on a rtnl_lock deadlock,
> but not one related to the multicast code (and therefore outside
> the scope of these patches to address).  There are several bits of
> additional locking to be fixed in the overall ipoib code in relation
> to insmod/rmmod races and this patch set does not attempt to address
> those.  It merely attempts not to introduce any new issues while
> resolving the mcast locking issues related to bringing the interface
> up and down.  I feel confident that it does that.
>
> Doug Ledford (9):
>   IB/ipoib: factor out ah flushing
>   IB/ipoib: change init sequence ordering
>   IB/ipoib: Consolidate rtnl_lock tasks in workqueue
>   IB/ipoib: Make the carrier_on_task race aware
>   IB/ipoib: Use dedicated workqueues per interface
>   IB/ipoib: No longer use flush as a parameter
>   IB/ipoib: fix MCAST_FLAG_BUSY usage
>   IB/ipoib: deserialize multicast joins
>   IB/ipoib: drop mcast_mutex usage
>
>  drivers/infiniband/ulp/ipoib/ipoib.h           |  20 +-
>  drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  18 +-
>  drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  69 ++--
>  drivers/infiniband/ulp/ipoib/ipoib_main.c      |  60 +--
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 500 +++++++++++++------------
>  drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |  31 +-
>  6 files changed, 389 insertions(+), 309 deletions(-)
>
> --
> 2.1.0
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues
       [not found]     ` <CAJ3xEMgj=ATKLt0MA67c3WefCrG1hZ59eSrhpD-u_dxLJe2kfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-02-22 21:56       ` Doug Ledford
       [not found]         ` <1424642176.4847.2.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-02-22 21:56 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 6850 bytes --]

On Sun, 2015-02-22 at 23:34 +0200, Or Gerlitz wrote:
> On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > This is the re-ordered, squashed version of my 22 patch set that I
> > posted on Feb 11.  There are a few minor differences between that
> > set and this one.
> 
> Hi Doug,
> 
> I took quick look on your git repo @
> git://github.com/dledford/linux.git and it seems not to contain this
> series, can you please get that there and tell what branch to pull?

It's there now, branch for-3.20-squashed.

> Or.
> 
> > They are:
> > 1) Rename __ipoib_mcast_continue_join_thread to
> >    __ipoib_mcast_schedule_join_thread
> > 2) Make __ipoib_mcast_schedule_join_thread cancel any delayed work to
> >    avoid us accidentally trying to queue the single work struct instance
> >    twice (which doesn't work)
> > 3) Slight alter layout of __ipoib_mcast_schedule_join_thread.  Logic
> >    is the same modulo #2, but indenting is reduced and readability
> >    increased
> > 4) Switch a few instances of FLAG_ADMIN_UP to FLAG_OPER_UP
> > 5) Add a couple missing spinlocks so that we always call the schedule
> >    helper with the spinlock held
> > 6) Make sure that we only clear the BUSY flag once we have done all the
> >    other things we are going to do to the mcast entry, and if possible,
> >    only call complete after we have released the spinlock
> > 7) Fix the usage of time_before_eq when we should have just used
> >    time_before in ipoib_mcast_join_task
> > 8) Create/destroy priv->wq in a slightly different point of
> >    ipoib_transport_dev_init/ipoib_transport_dev_cleanup
> >
> > This entire patchset was intended to address the issue of ipoib
> > interfaces being brought up/down in a tight loop, which will hardlock
> > a standard v3.19 kernel.  It succeeds at resolving that problem.  In
> > order to be sure this patchset does not introduce other problems,
> > and in order to ensure that this rework of the patches into a new
> > set does not break bisectability, this entire patchset has been
> > extensively tested, starting with the first patch and going through
> > the last.
> >
> > I used a 12 machine group plus the subnet manager to test these
> > patches.
> >
> > 1 machine ran ifconfig up/ifconfig down in a tight loop tests
> > 1 machine ran rmmod/insmod ib_ipoib in a loop with a 10 second pause
> >   between insmod and rmmod
> > 1 machine ran rmmod/insmod ib_ipoib in a tight loop with only a .1
> >   second pause between insmod and rmmod
> > 9 machines that kept their interfaces up and ran iperf servers, 6 also
> >   ran ping6 instances to the addresses of all 12 machines, 3 ran iperf
> >   clients that sent data to all 9 iperf servers in an infinite loop
> > 1 subnet manager machine that otherwise did not participate, but
> >   during testing was set to restart opensm once every 30 seconds to
> >   force net re-register events on all 12 machines in the group
> >
> > In addition to the configuration of various machines above to test
> > data transfers, the IPoIB infrastructure itself contained several
> > elements designed to test specific multicast capabilities.
> >
> > The primary P_Key, the one with the ping6 instances running on it,
> > intentionally had some well known multicast groups not defined in
> > order to intentionally cause failed sendonly multicast joins on
> > the same device that needed to work with IPv6 pings as well as
> > IPv4 multicast.
> >
> > One of the alternate P_Key interfaces was defined with a minimum
> > rate of 56GBit/s, so all machines without 56GBit/s capability
> > were unable to ever join the broadcast group on these P_Keys.
> > This was done to make sure that when the broadcast group is not
> > joined, no other multicast joins, sendonly or otherwise, are ever
> > sent.  It also was done to make sure that failed attempts to join
> > the broadcast group honored the backoff delays properly.
> >
> > Note: both machines that were doing the insmod/rmmod loops were
> > changed to not have any P_Key interfaces defined other than the
> > default P_Key interface.  It is known that repeated insmod/rmmod
> > of the ib_ipoib interface is fragile and easily breaks in the
> > presence of child interfaces.  It was not my intent to address
> > that particular problem with this patch set and so to avoid false
> > issues, children interfaces were removed from the mix on these
> > machines.
> >
> > A wide array of hardware was also tested with this 12 machine group,
> > covering mthca, mlx4, mlx5, and qib hardware.
> >
> > Patches 1 through 6 were tested without the ifconfig/rmmod/opensm
> > loops as those particular problems were not expected to be addressed
> > until patch 7.  Pathes 7 through 9 were tested with all tests.
> >
> > The final, complete patch set was left running with the various
> > tests until it had completed 257 opensm restarts, 12052
> > ifconfig up/ifconfig down loops, 765 10 second insmod/rmmod loops,
> > and 1971 .1 second insmod/rmmod loops.  The only observed problem
> > was that the fast insmod/rmmod loop eventually locked up the
> > network stack on the machine.  It was stuck on a rtnl_lock deadlock,
> > but not one related to the multicast code (and therefore outside
> > the scope of these patches to address).  There are several bits of
> > additional locking to be fixed in the overall ipoib code in relation
> > to insmod/rmmod races and this patch set does not attempt to address
> > those.  It merely attempts not to introduce any new issues while
> > resolving the mcast locking issues related to bringing the interface
> > up and down.  I feel confident that it does that.
> >
> > Doug Ledford (9):
> >   IB/ipoib: factor out ah flushing
> >   IB/ipoib: change init sequence ordering
> >   IB/ipoib: Consolidate rtnl_lock tasks in workqueue
> >   IB/ipoib: Make the carrier_on_task race aware
> >   IB/ipoib: Use dedicated workqueues per interface
> >   IB/ipoib: No longer use flush as a parameter
> >   IB/ipoib: fix MCAST_FLAG_BUSY usage
> >   IB/ipoib: deserialize multicast joins
> >   IB/ipoib: drop mcast_mutex usage
> >
> >  drivers/infiniband/ulp/ipoib/ipoib.h           |  20 +-
> >  drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  18 +-
> >  drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  69 ++--
> >  drivers/infiniband/ulp/ipoib/ipoib_main.c      |  60 +--
> >  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 500 +++++++++++++------------
> >  drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |  31 +-
> >  6 files changed, 389 insertions(+), 309 deletions(-)
> >
> > --
> > 2.1.0
> >


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues
       [not found]         ` <1424642176.4847.2.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-22 21:57           ` Doug Ledford
  0 siblings, 0 replies; 37+ messages in thread
From: Doug Ledford @ 2015-02-22 21:57 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 7307 bytes --]

On Sun, 2015-02-22 at 16:56 -0500, Doug Ledford wrote:
> On Sun, 2015-02-22 at 23:34 +0200, Or Gerlitz wrote:
> > On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > This is the re-ordered, squashed version of my 22 patch set that I
> > > posted on Feb 11.  There are a few minor differences between that
> > > set and this one.
> > 
> > Hi Doug,
> > 
> > I took quick look on your git repo @
> > git://github.com/dledford/linux.git and it seems not to contain this
> > series, can you please get that there and tell what branch to pull?
> 
> It's there now, branch for-3.20-squashed.

Also, git diff for-3.20..for-3.20-squashed will show the exact
differences between this 9 patch set and the previous 22 patch set.

> > Or.
> > 
> > > They are:
> > > 1) Rename __ipoib_mcast_continue_join_thread to
> > >    __ipoib_mcast_schedule_join_thread
> > > 2) Make __ipoib_mcast_schedule_join_thread cancel any delayed work to
> > >    avoid us accidentally trying to queue the single work struct instance
> > >    twice (which doesn't work)
> > > 3) Slight alter layout of __ipoib_mcast_schedule_join_thread.  Logic
> > >    is the same modulo #2, but indenting is reduced and readability
> > >    increased
> > > 4) Switch a few instances of FLAG_ADMIN_UP to FLAG_OPER_UP
> > > 5) Add a couple missing spinlocks so that we always call the schedule
> > >    helper with the spinlock held
> > > 6) Make sure that we only clear the BUSY flag once we have done all the
> > >    other things we are going to do to the mcast entry, and if possible,
> > >    only call complete after we have released the spinlock
> > > 7) Fix the usage of time_before_eq when we should have just used
> > >    time_before in ipoib_mcast_join_task
> > > 8) Create/destroy priv->wq in a slightly different point of
> > >    ipoib_transport_dev_init/ipoib_transport_dev_cleanup
> > >
> > > This entire patchset was intended to address the issue of ipoib
> > > interfaces being brought up/down in a tight loop, which will hardlock
> > > a standard v3.19 kernel.  It succeeds at resolving that problem.  In
> > > order to be sure this patchset does not introduce other problems,
> > > and in order to ensure that this rework of the patches into a new
> > > set does not break bisectability, this entire patchset has been
> > > extensively tested, starting with the first patch and going through
> > > the last.
> > >
> > > I used a 12 machine group plus the subnet manager to test these
> > > patches.
> > >
> > > 1 machine ran ifconfig up/ifconfig down in a tight loop tests
> > > 1 machine ran rmmod/insmod ib_ipoib in a loop with a 10 second pause
> > >   between insmod and rmmod
> > > 1 machine ran rmmod/insmod ib_ipoib in a tight loop with only a .1
> > >   second pause between insmod and rmmod
> > > 9 machines that kept their interfaces up and ran iperf servers, 6 also
> > >   ran ping6 instances to the addresses of all 12 machines, 3 ran iperf
> > >   clients that sent data to all 9 iperf servers in an infinite loop
> > > 1 subnet manager machine that otherwise did not participate, but
> > >   during testing was set to restart opensm once every 30 seconds to
> > >   force net re-register events on all 12 machines in the group
> > >
> > > In addition to the configuration of various machines above to test
> > > data transfers, the IPoIB infrastructure itself contained several
> > > elements designed to test specific multicast capabilities.
> > >
> > > The primary P_Key, the one with the ping6 instances running on it,
> > > intentionally had some well known multicast groups not defined in
> > > order to intentionally cause failed sendonly multicast joins on
> > > the same device that needed to work with IPv6 pings as well as
> > > IPv4 multicast.
> > >
> > > One of the alternate P_Key interfaces was defined with a minimum
> > > rate of 56GBit/s, so all machines without 56GBit/s capability
> > > were unable to ever join the broadcast group on these P_Keys.
> > > This was done to make sure that when the broadcast group is not
> > > joined, no other multicast joins, sendonly or otherwise, are ever
> > > sent.  It also was done to make sure that failed attempts to join
> > > the broadcast group honored the backoff delays properly.
> > >
> > > Note: both machines that were doing the insmod/rmmod loops were
> > > changed to not have any P_Key interfaces defined other than the
> > > default P_Key interface.  It is known that repeated insmod/rmmod
> > > of the ib_ipoib interface is fragile and easily breaks in the
> > > presence of child interfaces.  It was not my intent to address
> > > that particular problem with this patch set and so to avoid false
> > > issues, children interfaces were removed from the mix on these
> > > machines.
> > >
> > > A wide array of hardware was also tested with this 12 machine group,
> > > covering mthca, mlx4, mlx5, and qib hardware.
> > >
> > > Patches 1 through 6 were tested without the ifconfig/rmmod/opensm
> > > loops as those particular problems were not expected to be addressed
> > > until patch 7.  Pathes 7 through 9 were tested with all tests.
> > >
> > > The final, complete patch set was left running with the various
> > > tests until it had completed 257 opensm restarts, 12052
> > > ifconfig up/ifconfig down loops, 765 10 second insmod/rmmod loops,
> > > and 1971 .1 second insmod/rmmod loops.  The only observed problem
> > > was that the fast insmod/rmmod loop eventually locked up the
> > > network stack on the machine.  It was stuck on a rtnl_lock deadlock,
> > > but not one related to the multicast code (and therefore outside
> > > the scope of these patches to address).  There are several bits of
> > > additional locking to be fixed in the overall ipoib code in relation
> > > to insmod/rmmod races and this patch set does not attempt to address
> > > those.  It merely attempts not to introduce any new issues while
> > > resolving the mcast locking issues related to bringing the interface
> > > up and down.  I feel confident that it does that.
> > >
> > > Doug Ledford (9):
> > >   IB/ipoib: factor out ah flushing
> > >   IB/ipoib: change init sequence ordering
> > >   IB/ipoib: Consolidate rtnl_lock tasks in workqueue
> > >   IB/ipoib: Make the carrier_on_task race aware
> > >   IB/ipoib: Use dedicated workqueues per interface
> > >   IB/ipoib: No longer use flush as a parameter
> > >   IB/ipoib: fix MCAST_FLAG_BUSY usage
> > >   IB/ipoib: deserialize multicast joins
> > >   IB/ipoib: drop mcast_mutex usage
> > >
> > >  drivers/infiniband/ulp/ipoib/ipoib.h           |  20 +-
> > >  drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  18 +-
> > >  drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  69 ++--
> > >  drivers/infiniband/ulp/ipoib/ipoib_main.c      |  60 +--
> > >  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 500 +++++++++++++------------
> > >  drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |  31 +-
> > >  6 files changed, 389 insertions(+), 309 deletions(-)
> > >
> > > --
> > > 2.1.0
> > >
> 
> 


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 5/9] IB/ipoib: Use dedicated workqueues per interface
       [not found]     ` <1cfdf15058cea312f07c2907490a1d7300603c40.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-23 16:48       ` Or Gerlitz
  0 siblings, 0 replies; 37+ messages in thread
From: Or Gerlitz @ 2015-02-23 16:48 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Erez Shitrit

On Sun, Feb 22, 2015 at 2:27 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> @@ -236,12 +247,19 @@ out_free_send_cq:
>  out_free_recv_cq:
>         ib_destroy_cq(priv->recv_cq);
>
> +out_cm_dev_cleanup:
> +       ipoib_cm_dev_cleanup(dev);
> +
> +out_free_wq:
> +       destroy_workqueue(priv->wq);
> +       priv->wq = NULL;
> +
>  out_free_mr:
>         ib_dereg_mr(priv->mr);
> -       ipoib_cm_dev_cleanup(dev);
>
>  out_free_pd:
>         ib_dealloc_pd(priv->pd);
> +
>         return -ENODEV;
>  }

just quick initial feedback to get fixed for V1 of the reworked
series: please avoid random addition/deletion of blank lines as part
of patch that fix bug X or introduce feature Y

>
> @@ -265,11 +283,18 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
>
>         ipoib_cm_dev_cleanup(dev);
>
> +       if (priv->wq) {
> +               flush_workqueue(priv->wq);
> +               destroy_workqueue(priv->wq);
> +               priv->wq = NULL;
> +       }
> +
>         if (ib_dereg_mr(priv->mr))
>                 ipoib_warn(priv, "ib_dereg_mr failed\n");
>
>         if (ib_dealloc_pd(priv->pd))
>                 ipoib_warn(priv, "ib_dealloc_pd failed\n");
> +
>  }

here too
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 9/9] IB/ipoib: drop mcast_mutex usage
       [not found]     ` <767f4c41779db63ce8c6dbba04b21959aba70ef9.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-23 16:56       ` Or Gerlitz
       [not found]         ` <CAJ3xEMgLPF9pCwQDy9QyL9fAERJXJRXN2gBj3nhuXUCcbfCMPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Or Gerlitz @ 2015-02-23 16:56 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Erez Shitrit

On Sun, Feb 22, 2015 at 2:27 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> We needed the mcast_mutex when we had to prevent the join completion
> callback from having the value it stored in mcast->mc overwritten

downstream patches of this series (7/9 and 8/9) make pretty much heavy
usage of the mcast_mutex (e.g add/delete lines that use it), and patch
9/9 removes it altogether.. which would be very confusing for
maintaining purposes. Is there a sane way to avoid that?!
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 9/9] IB/ipoib: drop mcast_mutex usage
       [not found]         ` <CAJ3xEMgLPF9pCwQDy9QyL9fAERJXJRXN2gBj3nhuXUCcbfCMPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-02-23 17:41           ` Doug Ledford
  0 siblings, 0 replies; 37+ messages in thread
From: Doug Ledford @ 2015-02-23 17:41 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 1279 bytes --]

On Mon, 2015-02-23 at 18:56 +0200, Or Gerlitz wrote:
> On Sun, Feb 22, 2015 at 2:27 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > We needed the mcast_mutex when we had to prevent the join completion
> > callback from having the value it stored in mcast->mc overwritten
> 
> downstream patches of this series (7/9 and 8/9) make pretty much heavy
> usage of the mcast_mutex (e.g add/delete lines that use it), and patch
> 9/9 removes it altogether.. which would be very confusing for
> maintaining purposes. Is there a sane way to avoid that?!

No.  The changes that make dropping the mutex possible are part of patch
7.  Patch 7 changes the semantics of the MCAST_FLAG_BUSY usage, and
fixes some locking bugs, but that's different than wholesale changing of
the locking type.  If you want to preserve bisecability and be able to
test the semantic changes to the FLAG_BUSY usage separate from the
changes to the locking type, then they have to be separated.  So, for
the sake of good engineering practices and separation of distinctly
different types of changes, that locking change should not be folded
into patch 7.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
       [not found]     ` <b06eb720c2f654f5ecdb72c66f4e89149d1c24ec.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-02-26 13:28       ` Erez Shitrit
       [not found]         ` <54EF1F67.4000001-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Erez Shitrit @ 2015-02-26 13:28 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit

On 2/22/2015 2:26 AM, Doug Ledford wrote:
> Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
> appropriate times to flush out all remaining ah entries before we shut
> the device down.
>
> Because neighbors and mcast entries can each have a reference on any
> given ah, we must make sure to free all of those first before our ah
> will actually have a 0 refcount and be able to be reaped.
>
> This factoring is needed in preparation for having per-device work
> queues.  The original per-device workqueue code resulted in the following
> error message:
>
> <ibdev>: ib_dealloc_pd failed
>
> That error was tracked down to this issue.  With the changes to which
> workqueues were flushed when, there were no flushes of the per device
> workqueue after the last ah's were freed, resulting in an attempt to
> dealloc the pd with outstanding resources still allocated.  This code
> puts the explicit flushes in the needed places to avoid that problem.
>
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>   drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 ++++++++++++++++++++-------------
>   1 file changed, 28 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> index 72626c34817..cb02466a0eb 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> @@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work)
>   				   round_jiffies_relative(HZ));
>   }
>   
> +static void ipoib_flush_ah(struct net_device *dev, int flush)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +
> +	cancel_delayed_work(&priv->ah_reap_task);
> +	if (flush)
> +		flush_workqueue(ipoib_workqueue);
> +	ipoib_reap_ah(&priv->ah_reap_task.work);
> +}
> +
> +static void ipoib_stop_ah(struct net_device *dev, int flush)
> +{
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +
> +	set_bit(IPOIB_STOP_REAPER, &priv->flags);
> +	ipoib_flush_ah(dev, flush);
> +}
> +
>   static void ipoib_ib_tx_timer_func(unsigned long ctx)
>   {
>   	drain_tx_cq((struct net_device *)ctx);
> @@ -877,24 +895,7 @@ timeout:
>   	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
>   		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
>   
> -	/* Wait for all AHs to be reaped */
> -	set_bit(IPOIB_STOP_REAPER, &priv->flags);
> -	cancel_delayed_work(&priv->ah_reap_task);
> -	if (flush)
> -		flush_workqueue(ipoib_workqueue);
> -
> -	begin = jiffies;
> -
> -	while (!list_empty(&priv->dead_ahs)) {
> -		__ipoib_reap_ah(dev);
> -
> -		if (time_after(jiffies, begin + HZ)) {
> -			ipoib_warn(priv, "timing out; will leak address handles\n");
> -			break;
> -		}
> -
> -		msleep(1);
> -	}
> +	ipoib_flush_ah(dev, flush);
>   
>   	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
>   
> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
>   	if (level == IPOIB_FLUSH_LIGHT) {
>   		ipoib_mark_paths_invalid(dev);
>   		ipoib_mcast_dev_flush(dev);
> +		ipoib_flush_ah(dev, 0);

Why do you need to call the flush function here?
I can't see the reason to use the flush not from the stop_ah, meaning 
without setting the IPOIB_STOP_REAPER, the flush can send twice the same 
work.

>   	}
>   
>   	if (level >= IPOIB_FLUSH_NORMAL)
> @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
>   	ipoib_mcast_stop_thread(dev, 1);
>   	ipoib_mcast_dev_flush(dev);
>   
> +	/*
> +	 * All of our ah references aren't free until after
> +	 * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
> +	 * the neighbor garbage collection is stopped and reaped.
> +	 * That should all be done now, so make a final ah flush.
> +	 */
> +	ipoib_stop_ah(dev, 1);
> +
>   	ipoib_transport_dev_cleanup(dev);
>   }
>   

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
       [not found]         ` <54EF1F67.4000001-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-02-26 16:27           ` Doug Ledford
       [not found]             ` <1424968046.2543.18.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-02-26 16:27 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 6245 bytes --]

On Thu, 2015-02-26 at 15:28 +0200, Erez Shitrit wrote:
> On 2/22/2015 2:26 AM, Doug Ledford wrote:
> > Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
> > appropriate times to flush out all remaining ah entries before we shut
> > the device down.
> >
> > Because neighbors and mcast entries can each have a reference on any
> > given ah, we must make sure to free all of those first before our ah
> > will actually have a 0 refcount and be able to be reaped.
> >
> > This factoring is needed in preparation for having per-device work
> > queues.  The original per-device workqueue code resulted in the following
> > error message:
> >
> > <ibdev>: ib_dealloc_pd failed
> >
> > That error was tracked down to this issue.  With the changes to which
> > workqueues were flushed when, there were no flushes of the per device
> > workqueue after the last ah's were freed, resulting in an attempt to
> > dealloc the pd with outstanding resources still allocated.  This code
> > puts the explicit flushes in the needed places to avoid that problem.
> >
> > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >   drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 ++++++++++++++++++++-------------
> >   1 file changed, 28 insertions(+), 18 deletions(-)
> >
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> > index 72626c34817..cb02466a0eb 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
> > @@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work)
> >   				   round_jiffies_relative(HZ));
> >   }
> >   
> > +static void ipoib_flush_ah(struct net_device *dev, int flush)
> > +{
> > +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +
> > +	cancel_delayed_work(&priv->ah_reap_task);
> > +	if (flush)
> > +		flush_workqueue(ipoib_workqueue);
> > +	ipoib_reap_ah(&priv->ah_reap_task.work);
> > +}
> > +
> > +static void ipoib_stop_ah(struct net_device *dev, int flush)
> > +{
> > +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> > +
> > +	set_bit(IPOIB_STOP_REAPER, &priv->flags);
> > +	ipoib_flush_ah(dev, flush);
> > +}
> > +
> >   static void ipoib_ib_tx_timer_func(unsigned long ctx)
> >   {
> >   	drain_tx_cq((struct net_device *)ctx);
> > @@ -877,24 +895,7 @@ timeout:
> >   	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
> >   		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
> >   
> > -	/* Wait for all AHs to be reaped */
> > -	set_bit(IPOIB_STOP_REAPER, &priv->flags);
> > -	cancel_delayed_work(&priv->ah_reap_task);
> > -	if (flush)
> > -		flush_workqueue(ipoib_workqueue);
> > -
> > -	begin = jiffies;
> > -
> > -	while (!list_empty(&priv->dead_ahs)) {
> > -		__ipoib_reap_ah(dev);
> > -
> > -		if (time_after(jiffies, begin + HZ)) {
> > -			ipoib_warn(priv, "timing out; will leak address handles\n");
> > -			break;
> > -		}
> > -
> > -		msleep(1);
> > -	}
> > +	ipoib_flush_ah(dev, flush);
> >   
> >   	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
> >   
> > @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
> >   	if (level == IPOIB_FLUSH_LIGHT) {
> >   		ipoib_mark_paths_invalid(dev);
> >   		ipoib_mcast_dev_flush(dev);
> > +		ipoib_flush_ah(dev, 0);
> 
> Why do you need to call the flush function here?

To remove all of the ah's that were reduced to a 0 refcount by the
previous two functions prior to restarting operations.  When we remove
an ah, it calls ib_destroy_ah which calls all the way down into the low
level driver.  This was to make sure that old, stale data was removed
all the way down to the card level before we started new queries for
paths and ahs.

> I can't see the reason to use the flush not from the stop_ah, meaning 
> without setting the IPOIB_STOP_REAPER, the flush can send twice the same 
> work.

No, it can't.  The ah flush routine does not search through ahs to find
ones to flush.  When you delete neighbors and mcasts, they release their
references to ahs.  When the refcount goes to 0, the put routine puts
the ah on the to-be-deleted ah list.  All the flush does is take that
list and delete the items.  If you run the flush twice, the first run
deletes all the items on the to-be-deleted list, the second run sees an
empty list and does nothing.

As for using flush versus stop: the flush function cancels any delayed
ah_flush work so that it isn't racing with the normally scheduled
ah_flush, then flushes the workqueue to make sure anything that might
result in an ah getting freed is done, then flushes, then schedules a
new delayed flush_ah work 1 second later.  So, it does exactly what a
flush should do: it removes what there is currently to remove, and in
the case of a periodically scheduled garbage collection, schedules a new
periodic flush at the maximum interval.

It is not appropriate to call stop_ah at this point because it will
cancel the delayed work, flush the ahs, then never reschedule the
garbage collection.  If we called stop here, we would have to call start
later.  But that's not really necessary as the flush cancels the
scheduled work and reschedules it for a second later.

> >   	}
> >   
> >   	if (level >= IPOIB_FLUSH_NORMAL)
> > @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
> >   	ipoib_mcast_stop_thread(dev, 1);
> >   	ipoib_mcast_dev_flush(dev);
> >   
> > +	/*
> > +	 * All of our ah references aren't free until after
> > +	 * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
> > +	 * the neighbor garbage collection is stopped and reaped.
> > +	 * That should all be done now, so make a final ah flush.
> > +	 */
> > +	ipoib_stop_ah(dev, 1);
> > +
> >   	ipoib_transport_dev_cleanup(dev);
> >   }
> >   
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
       [not found]             ` <1424968046.2543.18.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-03-01  6:47               ` Erez Shitrit
       [not found]                 ` <54F2B61C.9080308-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Erez Shitrit @ 2015-03-01  6:47 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

On 2/26/2015 6:27 PM, Doug Ledford wrote:
> On Thu, 2015-02-26 at 15:28 +0200, Erez Shitrit wrote:
>> On 2/22/2015 2:26 AM, Doug Ledford wrote:
>>> Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
>>> appropriate times to flush out all remaining ah entries before we shut
>>> the device down.
>>>
>>> Because neighbors and mcast entries can each have a reference on any
>>> given ah, we must make sure to free all of those first before our ah
>>> will actually have a 0 refcount and be able to be reaped.
>>>
>>> This factoring is needed in preparation for having per-device work
>>> queues.  The original per-device workqueue code resulted in the following
>>> error message:
>>>
>>> <ibdev>: ib_dealloc_pd failed
>>>
>>> That error was tracked down to this issue.  With the changes to which
>>> workqueues were flushed when, there were no flushes of the per device
>>> workqueue after the last ah's were freed, resulting in an attempt to
>>> dealloc the pd with outstanding resources still allocated.  This code
>>> puts the explicit flushes in the needed places to avoid that problem.
>>>
>>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>    drivers/infiniband/ulp/ipoib/ipoib_ib.c | 46 ++++++++++++++++++++-------------
>>>    1 file changed, 28 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
>>> index 72626c34817..cb02466a0eb 100644
>>> --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
>>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
>>> @@ -659,6 +659,24 @@ void ipoib_reap_ah(struct work_struct *work)
>>>    				   round_jiffies_relative(HZ));
>>>    }
>>>    
>>> +static void ipoib_flush_ah(struct net_device *dev, int flush)
>>> +{
>>> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
>>> +
>>> +	cancel_delayed_work(&priv->ah_reap_task);
>>> +	if (flush)
>>> +		flush_workqueue(ipoib_workqueue);
>>> +	ipoib_reap_ah(&priv->ah_reap_task.work);
>>> +}
>>> +
>>> +static void ipoib_stop_ah(struct net_device *dev, int flush)
>>> +{
>>> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
>>> +
>>> +	set_bit(IPOIB_STOP_REAPER, &priv->flags);
>>> +	ipoib_flush_ah(dev, flush);
>>> +}
>>> +
>>>    static void ipoib_ib_tx_timer_func(unsigned long ctx)
>>>    {
>>>    	drain_tx_cq((struct net_device *)ctx);
>>> @@ -877,24 +895,7 @@ timeout:
>>>    	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
>>>    		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
>>>    
>>> -	/* Wait for all AHs to be reaped */
>>> -	set_bit(IPOIB_STOP_REAPER, &priv->flags);
>>> -	cancel_delayed_work(&priv->ah_reap_task);
>>> -	if (flush)
>>> -		flush_workqueue(ipoib_workqueue);
>>> -
>>> -	begin = jiffies;
>>> -
>>> -	while (!list_empty(&priv->dead_ahs)) {
>>> -		__ipoib_reap_ah(dev);
>>> -
>>> -		if (time_after(jiffies, begin + HZ)) {
>>> -			ipoib_warn(priv, "timing out; will leak address handles\n");
>>> -			break;
>>> -		}
>>> -
>>> -		msleep(1);
>>> -	}
>>> +	ipoib_flush_ah(dev, flush);
>>>    
>>>    	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
>>>    
>>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
>>>    	if (level == IPOIB_FLUSH_LIGHT) {
>>>    		ipoib_mark_paths_invalid(dev);
>>>    		ipoib_mcast_dev_flush(dev);
>>> +		ipoib_flush_ah(dev, 0);
>> Why do you need to call the flush function here?
> To remove all of the ah's that were reduced to a 0 refcount by the
> previous two functions prior to restarting operations.  When we remove
> an ah, it calls ib_destroy_ah which calls all the way down into the low
> level driver.  This was to make sure that old, stale data was removed
> all the way down to the card level before we started new queries for
> paths and ahs.

Yes. but it is not needed.
The bug happened when the driver was removed (via modprobe -r etc.), and 
there were ah's in the dead_ah list, that was fixed by you in the 
function ipoib_ib_dev_cleanup, the call that you added here is not 
relevant to the bug (and IMHO is not needed at all)
So, the the task of cleaning the dead_ah is already there, no need to 
recall it, it will be called anyway 1 sec at the most from now.

You can try that, take of that call, no harm or memory leak will happened.

>> I can't see the reason to use the flush not from the stop_ah, meaning
>> without setting the IPOIB_STOP_REAPER, the flush can send twice the same
>> work.
> No, it can't.  The ah flush routine does not search through ahs to find
> ones to flush.  When you delete neighbors and mcasts, they release their
> references to ahs.  When the refcount goes to 0, the put routine puts
> the ah on the to-be-deleted ah list.  All the flush does is take that
> list and delete the items.  If you run the flush twice, the first run
> deletes all the items on the to-be-deleted list, the second run sees an
> empty list and does nothing.
>
> As for using flush versus stop: the flush function cancels any delayed
> ah_flush work so that it isn't racing with the normally scheduled

when you call cancel_delayed_work to work that can schedule itself, it 
is not help, the work can be at the middle of the run and re-schedule 
itself...


> ah_flush, then flushes the workqueue to make sure anything that might
> result in an ah getting freed is done, then flushes, then schedules a
> new delayed flush_ah work 1 second later.  So, it does exactly what a
> flush should do: it removes what there is currently to remove, and in
> the case of a periodically scheduled garbage collection, schedules a new
> periodic flush at the maximum interval.
>
> It is not appropriate to call stop_ah at this point because it will
> cancel the delayed work, flush the ahs, then never reschedule the
> garbage collection.  If we called stop here, we would have to call start
> later.  But that's not really necessary as the flush cancels the
> scheduled work and reschedules it for a second later.
>
>>>    	}
>>>    
>>>    	if (level >= IPOIB_FLUSH_NORMAL)
>>> @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
>>>    	ipoib_mcast_stop_thread(dev, 1);
>>>    	ipoib_mcast_dev_flush(dev);
>>>    
>>> +	/*
>>> +	 * All of our ah references aren't free until after
>>> +	 * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
>>> +	 * the neighbor garbage collection is stopped and reaped.
>>> +	 * That should all be done now, so make a final ah flush.
>>> +	 */
>>> +	ipoib_stop_ah(dev, 1);
>>> +
>>>    	ipoib_transport_dev_cleanup(dev);
>>>    }
>>>    
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage
       [not found]     ` <9d657f64ee961ee3b3233520d8b499b234a42bcd.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-03-01  9:31       ` Erez Shitrit
       [not found]         ` <54F2DC81.304-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Erez Shitrit @ 2015-03-01  9:31 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit

On 2/22/2015 2:27 AM, Doug Ledford wrote:
> Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
> objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
> in how it was used.  We didn't always initialize the completion struct
> before we set the flag, and we didn't always call complete on the
> completion struct from all paths that complete it.  And when we did
> complete it, sometimes we continued to touch the mcast entry after
> the completion, opening us up to possible use after free issues.
>
> This made it less than totally effective, and certainly made its use
> confusing.  And in the flush function we would use the presence of this
> flag to signal that we should wait on the completion struct, but we never
> cleared this flag, ever.
>
> In order to make things clearer and aid in resolving the rtnl deadlock
> bug I've been chasing, I cleaned this up a bit.
>
>   1) Remove the MCAST_JOIN_STARTED flag entirely
>   2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
>   3) Test mcast->mc directly to see if we have completed
>      ib_sa_join_multicast (using IS_ERR_OR_NULL)
>   4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
>      the mcast->done completion struct
>   5) Make sure that before calling complete(&mcast->done), we always clear
>      the MCAST_FLAG_BUSY bit
>   6) Take the mcast_mutex before we call ib_sa_multicast_join and also
>      take the mutex in our join callback.  This forces
>      ib_sa_multicast_join to return and set mcast->mc before we process
>      the callback.  This way, our callback can safely clear mcast->mc
>      if there is an error on the join and we will do the right thing as
>      a result in mcast_dev_flush.
>   7) Because we need the mutex to synchronize mcast->mc, we can no
>      longer call mcast_sendonly_join directly from mcast_send and
>      instead must add sendonly join processing to the mcast_join_task
>   8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
>      we have a running task.  We know when we need to reschedule our
>      join task thread and don't need a flag to tell us.
>   9) Add a helper for rescheduling the join task thread
>
> A number of different races are resolved with these changes.  These
> races existed with the old MCAST_FLAG_BUSY usage, the
> MCAST_JOIN_STARTED flag was an attempt to address them, and while it
> helped, a determined effort could still trip things up.
>
> One race looks something like this:
>
> Thread 1                             Thread 2
> ib_sa_join_multicast (as part of running restart mcast task)
>    alloc member
>    call callback
>                                       ifconfig ib0 down
> 				     wait_for_completion
>      callback call completes
>                                       wait_for_completion in
> 				     mcast_dev_flush completes
> 				       mcast->mc is PTR_ERR_OR_NULL
> 				       so we skip ib_sa_leave_multicast
>      return from callback
>    return from ib_sa_join_multicast
> set mcast->mc = return from ib_sa_multicast
>
> We now have a permanently unbalanced join/leave issue that trips up the
> refcounting in core/multicast.c
>
> Another like this:
>
> Thread 1                   Thread 2         Thread 3
> ib_sa_multicast_join
>                                              ifconfig ib0 down
> 					    priv->broadcast = NULL
>                             join_complete
> 			                    wait_for_completion
> 			   mcast->mc is not yet set, so don't clear
> return from ib_sa_join_multicast and set mcast->mc
> 			   complete
> 			   return -EAGAIN (making mcast->mc invalid)
> 			   		    call ib_sa_multicast_leave
> 					    on invalid mcast->mc, hang
> 					    forever
>
> By holding the mutex around ib_sa_multicast_join and taking the mutex
> early in the callback, we force mcast->mc to be valid at the time we
> run the callback.  This allows us to clear mcast->mc if there is an
> error and the join is going to fail.  We do this before we complete
> the mcast.  In this way, mcast_dev_flush always sees consistent state
> in regards to mcast->mc membership at the time that the
> wait_for_completion() returns.
>
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>   drivers/infiniband/ulp/ipoib/ipoib.h           |  11 +-
>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 355 ++++++++++++++++---------
>   2 files changed, 238 insertions(+), 128 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
> index 9ef432ae72e..c79dcd5ee8a 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> @@ -98,9 +98,15 @@ enum {
>   
>   	IPOIB_MCAST_FLAG_FOUND	  = 0,	/* used in set_multicast_list */
>   	IPOIB_MCAST_FLAG_SENDONLY = 1,
> -	IPOIB_MCAST_FLAG_BUSY	  = 2,	/* joining or already joined */
> +	/*
> +	 * For IPOIB_MCAST_FLAG_BUSY
> +	 * When set, in flight join and mcast->mc is unreliable
> +	 * When clear and mcast->mc IS_ERR_OR_NULL, need to restart or
> +	 *   haven't started yet
> +	 * When clear and mcast->mc is valid pointer, join was successful
> +	 */
> +	IPOIB_MCAST_FLAG_BUSY	  = 2,
>   	IPOIB_MCAST_FLAG_ATTACHED = 3,
> -	IPOIB_MCAST_JOIN_STARTED  = 4,
>   
>   	MAX_SEND_CQE		  = 16,
>   	IPOIB_CM_COPYBREAK	  = 256,
> @@ -148,6 +154,7 @@ struct ipoib_mcast {
>   
>   	unsigned long created;
>   	unsigned long backoff;
> +	unsigned long delay_until;
>   
>   	unsigned long flags;
>   	unsigned char logcount;
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index bb1b69904f9..277e7ac7c4d 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -66,6 +66,48 @@ struct ipoib_mcast_iter {
>   	unsigned int       send_only;
>   };
>   
> +/*
> + * This should be called with the mcast_mutex held
> + */
> +static void __ipoib_mcast_schedule_join_thread(struct ipoib_dev_priv *priv,
> +					       struct ipoib_mcast *mcast,
> +					       bool delay)
> +{
> +	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))

You don't need the flag IPOIB_MCAST_RUN, it is duplicated of 
IPOIB_FLAG_OPER_UP
probably, need to be taken from all places (including ipoib.h file).

> +		return;
> +
> +	/*
> +	 * We will be scheduling *something*, so cancel whatever is
> +	 * currently scheduled first
> +	 */
> +	cancel_delayed_work(&priv->mcast_task);
> +	if (mcast && delay) {
> +		/*
> +		 * We had a failure and want to schedule a retry later
> +		 */
> +		mcast->backoff *= 2;
> +		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
> +			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
> +		mcast->delay_until = jiffies + (mcast->backoff * HZ);
> +		/*
> +		 * Mark this mcast for its delay, but restart the
> +		 * task immediately.  The join task will make sure to
> +		 * clear out all entries without delays, and then
> +		 * schedule itself to run again when the earliest
> +		 * delay expires
> +		 */
> +		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
> +	} else if (delay) {
> +		/*
> +		 * Special case of retrying after a failure to
> +		 * allocate the broadcast multicast group, wait
> +		 * 1 second and try again
> +		 */
> +		queue_delayed_work(priv->wq, &priv->mcast_task, HZ);
> +	} else
> +		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
> +}
> +
>   static void ipoib_mcast_free(struct ipoib_mcast *mcast)
>   {
>   	struct net_device *dev = mcast->dev;
> @@ -103,6 +145,7 @@ static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev,
>   
>   	mcast->dev = dev;
>   	mcast->created = jiffies;
> +	mcast->delay_until = jiffies;
>   	mcast->backoff = 1;
>   
>   	INIT_LIST_HEAD(&mcast->list);
> @@ -270,17 +313,31 @@ ipoib_mcast_sendonly_join_complete(int status,
>   {
>   	struct ipoib_mcast *mcast = multicast->context;
>   	struct net_device *dev = mcast->dev;
> +	struct ipoib_dev_priv *priv = netdev_priv(dev);
> +
> +	/*
> +	 * We have to take the mutex to force mcast_sendonly_join to
> +	 * return from ib_sa_multicast_join and set mcast->mc to a
> +	 * valid value.  Otherwise we were racing with ourselves in
> +	 * that we might fail here, but get a valid return from
> +	 * ib_sa_multicast_join after we had cleared mcast->mc here,
> +	 * resulting in mis-matched joins and leaves and a deadlock
> +	 */
> +	mutex_lock(&mcast_mutex);
>   
>   	/* We trap for port events ourselves. */
> -	if (status == -ENETRESET)
> -		return 0;
> +	if (status == -ENETRESET) {
> +		status = 0;
> +		goto out;
> +	}
>   
>   	if (!status)
>   		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
>   
>   	if (status) {
>   		if (mcast->logcount++ < 20)
> -			ipoib_dbg_mcast(netdev_priv(dev), "multicast join failed for %pI6, status %d\n",
> +			ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
> +					"join failed for %pI6, status %d\n",
>   					mcast->mcmember.mgid.raw, status);
>   
>   		/* Flush out any queued packets */
> @@ -290,11 +347,18 @@ ipoib_mcast_sendonly_join_complete(int status,
>   			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
>   		}
>   		netif_tx_unlock_bh(dev);
> -
> -		/* Clear the busy flag so we try again */
> -		status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY,
> -					    &mcast->flags);
> +		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
> +	} else {
> +		mcast->backoff = 1;
> +		mcast->delay_until = jiffies;
> +		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
>   	}
> +out:
> +	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> +	if (status)
> +		mcast->mc = NULL;
> +	complete(&mcast->done);
> +	mutex_unlock(&mcast_mutex);
>   	return status;
>   }
>   
> @@ -312,19 +376,18 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
>   	int ret = 0;
>   
>   	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
> -		ipoib_dbg_mcast(priv, "device shutting down, no multicast joins\n");
> +		ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
> +				"multicast joins\n");
> +		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> +		complete(&mcast->done);
>   		return -ENODEV;
>   	}
>   
> -	if (test_and_set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
> -		ipoib_dbg_mcast(priv, "multicast entry busy, skipping\n");
> -		return -EBUSY;
> -	}
> -
>   	rec.mgid     = mcast->mcmember.mgid;
>   	rec.port_gid = priv->local_gid;
>   	rec.pkey     = cpu_to_be16(priv->pkey);
>   
> +	mutex_lock(&mcast_mutex);
>   	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
>   					 priv->port, &rec,
>   					 IB_SA_MCMEMBER_REC_MGID	|
> @@ -337,12 +400,14 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
>   	if (IS_ERR(mcast->mc)) {
>   		ret = PTR_ERR(mcast->mc);
>   		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> -		ipoib_warn(priv, "ib_sa_join_multicast failed (ret = %d)\n",
> -			   ret);
> +		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
> +			   "failed (ret = %d)\n", ret);
> +		complete(&mcast->done);
>   	} else {
> -		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting join\n",
> -				mcast->mcmember.mgid.raw);
> +		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
> +				"sendonly join\n", mcast->mcmember.mgid.raw);
>   	}
> +	mutex_unlock(&mcast_mutex);
>   
>   	return ret;
>   }
> @@ -390,6 +455,16 @@ static int ipoib_mcast_join_complete(int status,
>   	ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n",
>   			mcast->mcmember.mgid.raw, status);
>   
> +	/*
> +	 * We have to take the mutex to force mcast_join to
> +	 * return from ib_sa_multicast_join and set mcast->mc to a
> +	 * valid value.  Otherwise we were racing with ourselves in
> +	 * that we might fail here, but get a valid return from
> +	 * ib_sa_multicast_join after we had cleared mcast->mc here,
> +	 * resulting in mis-matched joins and leaves and a deadlock
> +	 */
> +	mutex_lock(&mcast_mutex);
> +
>   	/* We trap for port events ourselves. */
>   	if (status == -ENETRESET) {
>   		status = 0;
> @@ -401,10 +476,8 @@ static int ipoib_mcast_join_complete(int status,
>   
>   	if (!status) {
>   		mcast->backoff = 1;
> -		mutex_lock(&mcast_mutex);
> -		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
> -		mutex_unlock(&mcast_mutex);
> +		mcast->delay_until = jiffies;
> +		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
>   
>   		/*
>   		 * Defer carrier on work to priv->wq to avoid a
> @@ -412,37 +485,26 @@ static int ipoib_mcast_join_complete(int status,
>   		 */
>   		if (mcast == priv->broadcast)
>   			queue_work(priv->wq, &priv->carrier_on_task);
> -
> -		status = 0;
> -		goto out;
> -	}
> -
> -	if (mcast->logcount++ < 20) {
> -		if (status == -ETIMEDOUT || status == -EAGAIN) {
> -			ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
> -					mcast->mcmember.mgid.raw, status);
> -		} else {
> -			ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
> -				   mcast->mcmember.mgid.raw, status);
> +	} else {
> +		if (mcast->logcount++ < 20) {
> +			if (status == -ETIMEDOUT || status == -EAGAIN) {
> +				ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
> +						mcast->mcmember.mgid.raw, status);
> +			} else {
> +				ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
> +					   mcast->mcmember.mgid.raw, status);
> +			}
>   		}
> -	}
> -
> -	mcast->backoff *= 2;
> -	if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
> -		mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
>   
> -	/* Clear the busy flag so we try again */
> -	status = test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> -
> -	mutex_lock(&mcast_mutex);
> -	spin_lock_irq(&priv->lock);
> -	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -		queue_delayed_work(priv->wq, &priv->mcast_task,
> -				   mcast->backoff * HZ);
> -	spin_unlock_irq(&priv->lock);
> -	mutex_unlock(&mcast_mutex);
> +		/* Requeue this join task with a backoff delay */
> +		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
> +	}
>   out:
> +	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> +	if (status)
> +		mcast->mc = NULL;
>   	complete(&mcast->done);
> +	mutex_unlock(&mcast_mutex);
>   	return status;
>   }
>   
> @@ -491,29 +553,18 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
>   		rec.hop_limit	  = priv->broadcast->mcmember.hop_limit;
>   	}
>   
> -	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> -	init_completion(&mcast->done);
> -	set_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags);
> -
> +	mutex_lock(&mcast_mutex);
>   	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
>   					 &rec, comp_mask, GFP_KERNEL,
>   					 ipoib_mcast_join_complete, mcast);
>   	if (IS_ERR(mcast->mc)) {
>   		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> -		complete(&mcast->done);
>   		ret = PTR_ERR(mcast->mc);
>   		ipoib_warn(priv, "ib_sa_join_multicast failed, status %d\n", ret);
> -
> -		mcast->backoff *= 2;
> -		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
> -			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
> -
> -		mutex_lock(&mcast_mutex);
> -		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(priv->wq, &priv->mcast_task,
> -					   mcast->backoff * HZ);
> -		mutex_unlock(&mcast_mutex);
> +		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
> +		complete(&mcast->done);
>   	}
> +	mutex_unlock(&mcast_mutex);
>   }
>   
>   void ipoib_mcast_join_task(struct work_struct *work)
> @@ -522,6 +573,9 @@ void ipoib_mcast_join_task(struct work_struct *work)
>   		container_of(work, struct ipoib_dev_priv, mcast_task.work);
>   	struct net_device *dev = priv->dev;
>   	struct ib_port_attr port_attr;
> +	unsigned long delay_until = 0;
> +	struct ipoib_mcast *mcast = NULL;
> +	int create = 1;
>   
>   	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
>   		return;
> @@ -539,64 +593,102 @@ void ipoib_mcast_join_task(struct work_struct *work)
>   	else
>   		memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid));
>   
> +	/*
> +	 * We have to hold the mutex to keep from racing with the join
> +	 * completion threads on setting flags on mcasts, and we have
> +	 * to hold the priv->lock because dev_flush will remove entries
> +	 * out from underneath us, so at a minimum we need the lock
> +	 * through the time that we do the for_each loop of the mcast
> +	 * list or else dev_flush can make us oops.
> +	 */
> +	mutex_lock(&mcast_mutex);
> +	spin_lock_irq(&priv->lock);
> +	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
> +		goto out;
> +
>   	if (!priv->broadcast) {
>   		struct ipoib_mcast *broadcast;
>   
> -		if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
> -			return;
> -
> -		broadcast = ipoib_mcast_alloc(dev, 1);
> +		broadcast = ipoib_mcast_alloc(dev, 0);
>   		if (!broadcast) {
>   			ipoib_warn(priv, "failed to allocate broadcast group\n");
> -			mutex_lock(&mcast_mutex);
> -			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
> -				queue_delayed_work(priv->wq, &priv->mcast_task,
> -						   HZ);
> -			mutex_unlock(&mcast_mutex);
> -			return;
> +			/*
> +			 * Restart us after a 1 second delay to retry
> +			 * creating our broadcast group and attaching to
> +			 * it.  Until this succeeds, this ipoib dev is
> +			 * completely stalled (multicast wise).
> +			 */
> +			__ipoib_mcast_schedule_join_thread(priv, NULL, 1);
> +			goto out;
>   		}
>   
> -		spin_lock_irq(&priv->lock);
>   		memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
>   		       sizeof (union ib_gid));
>   		priv->broadcast = broadcast;
>   
>   		__ipoib_mcast_add(dev, priv->broadcast);
> -		spin_unlock_irq(&priv->lock);
>   	}
>   
>   	if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
> -		if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
> -			ipoib_mcast_join(dev, priv->broadcast, 0);
> -		return;
> +		if (IS_ERR_OR_NULL(priv->broadcast->mc) &&
> +		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) {
> +			mcast = priv->broadcast;
> +			create = 0;
> +			if (mcast->backoff > 1 &&
> +			    time_before(jiffies, mcast->delay_until)) {
> +				delay_until = mcast->delay_until;
> +				mcast = NULL;
> +			}
> +		}
> +		goto out;
>   	}
>   
> -	while (1) {
> -		struct ipoib_mcast *mcast = NULL;
> -
> -		spin_lock_irq(&priv->lock);
> -		list_for_each_entry(mcast, &priv->multicast_list, list) {
> -			if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags)
> -			    && !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)
> -			    && !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
> +	/*
> +	 * We'll never get here until the broadcast group is both allocated
> +	 * and attached
> +	 */
> +	list_for_each_entry(mcast, &priv->multicast_list, list) {
> +		if (IS_ERR_OR_NULL(mcast->mc) &&
> +		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
> +		    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
> +			if (mcast->backoff == 1 ||
> +			    time_after_eq(jiffies, mcast->delay_until))
>   				/* Found the next unjoined group */
>   				break;
> -			}
> +			else if (!delay_until ||
> +				 time_before(mcast->delay_until, delay_until))
> +				delay_until = mcast->delay_until;
>   		}
> -		spin_unlock_irq(&priv->lock);
> -
> -		if (&mcast->list == &priv->multicast_list) {
> -			/* All done */
> -			break;
> -		}
> -
> -		ipoib_mcast_join(dev, mcast, 1);
> -		return;
>   	}
>   
> -	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
> +	if (&mcast->list == &priv->multicast_list) {
> +		/*
> +		 * All done, unless we have delayed work from
> +		 * backoff retransmissions, but we will get
> +		 * restarted when the time is right, so we are
> +		 * done for now
> +		 */
> +		mcast = NULL;
> +		ipoib_dbg_mcast(priv, "successfully joined all "
> +				"multicast groups\n");
> +	}
>   
> -	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
> +out:
> +	if (mcast) {
> +		init_completion(&mcast->done);
> +		set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> +	}
> +	spin_unlock_irq(&priv->lock);
> +	mutex_unlock(&mcast_mutex);
> +	if (mcast) {
> +		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
> +			ipoib_mcast_sendonly_join(mcast);
> +		else
> +			ipoib_mcast_join(dev, mcast, create);
> +	}
> +	if (delay_until)
> +		queue_delayed_work(priv->wq, &priv->mcast_task,
> +				   delay_until - jiffies);
>   }
>   
>   int ipoib_mcast_start_thread(struct net_device *dev)
> @@ -606,8 +698,8 @@ int ipoib_mcast_start_thread(struct net_device *dev)
>   	ipoib_dbg_mcast(priv, "starting multicast thread\n");
>   
>   	mutex_lock(&mcast_mutex);
> -	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> -		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
> +	set_bit(IPOIB_MCAST_RUN, &priv->flags);
> +	__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
>   	mutex_unlock(&mcast_mutex);
>   
>   	return 0;
> @@ -635,7 +727,12 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
>   	int ret = 0;
>   
>   	if (test_and_clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> +		ipoib_warn(priv, "ipoib_mcast_leave on an in-flight join\n");
> +
> +	if (!IS_ERR_OR_NULL(mcast->mc))
>   		ib_sa_free_multicast(mcast->mc);
> +	else
> +		ipoib_dbg(priv, "ipoib_mcast_leave with mcast->mc invalid\n");
>   
>   	if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
>   		ipoib_dbg_mcast(priv, "leaving MGID %pI6\n",
> @@ -646,7 +743,9 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
>   				      be16_to_cpu(mcast->mcmember.mlid));
>   		if (ret)
>   			ipoib_warn(priv, "ib_detach_mcast failed (result = %d)\n", ret);
> -	}
> +	} else if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
> +		ipoib_dbg(priv, "leaving with no mcmember but not a "
> +			  "SENDONLY join\n");
>   
>   	return 0;
>   }
> @@ -687,6 +786,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
>   		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
>   		__ipoib_mcast_add(dev, mcast);
>   		list_add_tail(&mcast->list, &priv->multicast_list);
> +		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
>   	}
>   
>   	if (!mcast->ah) {
> @@ -696,13 +796,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
>   			++dev->stats.tx_dropped;
>   			dev_kfree_skb_any(skb);
>   		}
> -
> -		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> -			ipoib_dbg_mcast(priv, "no address vector, "
> -					"but multicast join already started\n");
> -		else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
> -			ipoib_mcast_sendonly_join(mcast);
> -
>   		/*
>   		 * If lookup completes between here and out:, don't
>   		 * want to send packet twice.
> @@ -761,9 +854,12 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
>   
>   	spin_unlock_irqrestore(&priv->lock, flags);
>   
> -	/* seperate between the wait to the leave*/
> +	/*
> +	 * make sure the in-flight joins have finished before we attempt
> +	 * to leave
> +	 */
>   	list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
> -		if (test_bit(IPOIB_MCAST_JOIN_STARTED, &mcast->flags))
> +		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
>   			wait_for_completion(&mcast->done);
>   
>   	list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
> @@ -794,20 +890,14 @@ void ipoib_mcast_restart_task(struct work_struct *work)
>   	unsigned long flags;
>   	struct ib_sa_mcmember_rec rec;
>   
> -	ipoib_dbg_mcast(priv, "restarting multicast task\n");
> +	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
> +		/*
> +		 * shortcut...on shutdown flush is called next, just
> +		 * let it do all the work
> +		 */
> +		return;
>   
> -	/*
> -	 * We're running on the priv->wq right now, so we can't call
> -	 * mcast_stop_thread as it wants to flush the wq and that
> -	 * will deadlock.  We don't actually *need* to stop the
> -	 * thread here anyway, so just clear the run flag, cancel
> -	 * any delayed work, do our work, remove the old entries,
> -	 * then restart the thread.
> -	 */
> -	mutex_lock(&mcast_mutex);
> -	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
> -	cancel_delayed_work(&priv->mcast_task);
> -	mutex_unlock(&mcast_mutex);
> +	ipoib_dbg_mcast(priv, "restarting multicast task\n");
>   
>   	local_irq_save(flags);
>   	netif_addr_lock(dev);
> @@ -893,14 +983,27 @@ void ipoib_mcast_restart_task(struct work_struct *work)
>   	netif_addr_unlock(dev);
>   	local_irq_restore(flags);
>   
> -	/* We have to cancel outside of the spinlock */
> +	/*
> +	 * make sure the in-flight joins have finished before we attempt
> +	 * to leave
> +	 */
> +	list_for_each_entry_safe(mcast, tmcast, &remove_list, list)
> +		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> +			wait_for_completion(&mcast->done);
> +
>   	list_for_each_entry_safe(mcast, tmcast, &remove_list, list) {
>   		ipoib_mcast_leave(mcast->dev, mcast);
>   		ipoib_mcast_free(mcast);
>   	}
>   
> -	if (test_bit(IPOIB_FLAG_OPER_UP, &priv->flags))
> -		ipoib_mcast_start_thread(dev);
> +	/*
> +	 * Double check that we are still up
> +	 */
> +	if (test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
> +		spin_lock_irqsave(&priv->lock, flags);
> +		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
> +		spin_unlock_irqrestore(&priv->lock, flags);
> +	}
>   }
>   
>   #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 8/9] IB/ipoib: deserialize multicast joins
       [not found]     ` <a24ade295dfdd1369aac47a978003569ec190952.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-03-01 13:58       ` Erez Shitrit
       [not found]         ` <54F31AEC.3010001-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Erez Shitrit @ 2015-03-01 13:58 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Or Gerlitz, Erez Shitrit

On 2/22/2015 2:27 AM, Doug Ledford wrote:
> Allow the ipoib layer to attempt to join all outstanding multicast
> groups at once.  The ib_sa layer will serialize multiple attempts to
> join the same group, but will process attempts to join different groups
> in parallel.  Take advantage of that.
>
> In order to make this happen, change the mcast_join_thread to loop
> through all needed joins, sending a join request for each one that we
> still need to join.  There are a few special cases we handle though:
>
> 1) Don't attempt to join anything but the broadcast group until the join
> of the broadcast group has succeeded.
> 2) No longer restart the join task at the end of completion handling.
> If we completed successfully, we are done.  The join task now needs kicked
> either by mcast_send or mcast_restart_task or mcast_start_thread, but
> should not need started anytime else except when scheduling a backoff
> attempt to rejoin.
> 3) No longer use separate join/completion routines for regular and
> sendonly joins, pass them all through the same routine and just do the
> right thing based on the SENDONLY join flag.
> 4) Only try to join a SENDONLY join twice, then drop the packets and
> quit trying.  We leave the mcast group in the list so that if we get a
> new packet, all that we have to do is queue up the packet and restart
> the join task and it will automatically try to join twice and then
> either send or flush the queue again.
>
> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 250 ++++++++-----------------
>   1 file changed, 82 insertions(+), 168 deletions(-)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 277e7ac7c4d..c670d9c2cda 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -307,111 +307,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
>   	return 0;
>   }
>   
> -static int
> -ipoib_mcast_sendonly_join_complete(int status,
> -				   struct ib_sa_multicast *multicast)
> -{
> -	struct ipoib_mcast *mcast = multicast->context;
> -	struct net_device *dev = mcast->dev;
> -	struct ipoib_dev_priv *priv = netdev_priv(dev);
> -
> -	/*
> -	 * We have to take the mutex to force mcast_sendonly_join to
> -	 * return from ib_sa_multicast_join and set mcast->mc to a
> -	 * valid value.  Otherwise we were racing with ourselves in
> -	 * that we might fail here, but get a valid return from
> -	 * ib_sa_multicast_join after we had cleared mcast->mc here,
> -	 * resulting in mis-matched joins and leaves and a deadlock
> -	 */
> -	mutex_lock(&mcast_mutex);
> -
> -	/* We trap for port events ourselves. */
> -	if (status == -ENETRESET) {
> -		status = 0;
> -		goto out;
> -	}
> -
> -	if (!status)
> -		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
> -
> -	if (status) {
> -		if (mcast->logcount++ < 20)
> -			ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
> -					"join failed for %pI6, status %d\n",
> -					mcast->mcmember.mgid.raw, status);
> -
> -		/* Flush out any queued packets */
> -		netif_tx_lock_bh(dev);
> -		while (!skb_queue_empty(&mcast->pkt_queue)) {
> -			++dev->stats.tx_dropped;
> -			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
> -		}
> -		netif_tx_unlock_bh(dev);
> -		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
> -	} else {
> -		mcast->backoff = 1;
> -		mcast->delay_until = jiffies;
> -		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
> -	}
> -out:
> -	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> -	if (status)
> -		mcast->mc = NULL;
> -	complete(&mcast->done);
> -	mutex_unlock(&mcast_mutex);
> -	return status;
> -}
> -
> -static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
> -{
> -	struct net_device *dev = mcast->dev;
> -	struct ipoib_dev_priv *priv = netdev_priv(dev);
> -	struct ib_sa_mcmember_rec rec = {
> -#if 0				/* Some SMs don't support send-only yet */
> -		.join_state = 4
> -#else
> -		.join_state = 1
> -#endif
> -	};
> -	int ret = 0;
> -
> -	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
> -		ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
> -				"multicast joins\n");
> -		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> -		complete(&mcast->done);
> -		return -ENODEV;
> -	}
> -
> -	rec.mgid     = mcast->mcmember.mgid;
> -	rec.port_gid = priv->local_gid;
> -	rec.pkey     = cpu_to_be16(priv->pkey);
> -
> -	mutex_lock(&mcast_mutex);
> -	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
> -					 priv->port, &rec,
> -					 IB_SA_MCMEMBER_REC_MGID	|
> -					 IB_SA_MCMEMBER_REC_PORT_GID	|
> -					 IB_SA_MCMEMBER_REC_PKEY	|
> -					 IB_SA_MCMEMBER_REC_JOIN_STATE,
> -					 GFP_ATOMIC,
> -					 ipoib_mcast_sendonly_join_complete,
> -					 mcast);
> -	if (IS_ERR(mcast->mc)) {
> -		ret = PTR_ERR(mcast->mc);
> -		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> -		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
> -			   "failed (ret = %d)\n", ret);
> -		complete(&mcast->done);
> -	} else {
> -		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
> -				"sendonly join\n", mcast->mcmember.mgid.raw);
> -	}
> -	mutex_unlock(&mcast_mutex);
> -
> -	return ret;
> -}
> -
>   void ipoib_mcast_carrier_on_task(struct work_struct *work)
>   {
>   	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
> @@ -452,7 +347,9 @@ static int ipoib_mcast_join_complete(int status,
>   	struct net_device *dev = mcast->dev;
>   	struct ipoib_dev_priv *priv = netdev_priv(dev);
>   
> -	ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n",
> +	ipoib_dbg_mcast(priv, "%sjoin completion for %pI6 (status %d)\n",
> +			test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ?
> +			"sendonly " : "",
>   			mcast->mcmember.mgid.raw, status);
>   
>   	/*
> @@ -477,27 +374,52 @@ static int ipoib_mcast_join_complete(int status,
>   	if (!status) {
>   		mcast->backoff = 1;
>   		mcast->delay_until = jiffies;
> -		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
>   
>   		/*
>   		 * Defer carrier on work to priv->wq to avoid a
> -		 * deadlock on rtnl_lock here.
> +		 * deadlock on rtnl_lock here.  Requeue our multicast
> +		 * work too, which will end up happening right after
> +		 * our carrier on task work and will allow us to
> +		 * send out all of the non-broadcast joins
>   		 */
> -		if (mcast == priv->broadcast)
> +		if (mcast == priv->broadcast) {
>   			queue_work(priv->wq, &priv->carrier_on_task);
> +			__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
> +		}
>   	} else {
>   		if (mcast->logcount++ < 20) {
>   			if (status == -ETIMEDOUT || status == -EAGAIN) {
> -				ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
> +				ipoib_dbg_mcast(priv, "%smulticast join failed for %pI6, status %d\n",
> +						test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ? "sendonly " : "",
>   						mcast->mcmember.mgid.raw, status);
>   			} else {
> -				ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
> +				ipoib_warn(priv, "%smulticast join failed for %pI6, status %d\n",
> +						test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ? "sendonly " : "",
>   					   mcast->mcmember.mgid.raw, status);
>   			}
>   		}
>   
> -		/* Requeue this join task with a backoff delay */
> -		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
> +		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) &&
> +		    mcast->backoff >= 2) {
> +			/*
> +			 * We only retry sendonly joins once before we drop
> +			 * the packet and quit trying to deal with the
> +			 * group.  However, we leave the group in the
> +			 * mcast list as an unjoined group.  If we want to
> +			 * try joining again, we simply queue up a packet
> +			 * and restart the join thread.  The empty queue
> +			 * is why the join thread ignores this group.
> +			 */

Question: the sendonly is at the list for ever? looks like that, and it 
is prior to your patches, so probably it should be sent in some other 
patch to solve that.

> +			mcast->backoff = 1;
> +			netif_tx_lock_bh(dev);
> +			while (!skb_queue_empty(&mcast->pkt_queue)) {
> +				++dev->stats.tx_dropped;
> +				dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
> +			}
> +			netif_tx_unlock_bh(dev);
> +		} else
> +			/* Requeue this join task with a backoff delay */
> +			__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
>   	}
>   out:
>   	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> @@ -650,45 +572,45 @@ void ipoib_mcast_join_task(struct work_struct *work)
>   	list_for_each_entry(mcast, &priv->multicast_list, list) {
>   		if (IS_ERR_OR_NULL(mcast->mc) &&
>   		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
> -		    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
> +		    (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ||
> +		     !skb_queue_empty(&mcast->pkt_queue))) {
>   			if (mcast->backoff == 1 ||
> -			    time_after_eq(jiffies, mcast->delay_until))
> +			    time_after_eq(jiffies, mcast->delay_until)) {
>   				/* Found the next unjoined group */
> -				break;
> -			else if (!delay_until ||
> +				init_completion(&mcast->done);
> +				set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> +				if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
> +					create = 0;
> +				else
> +					create = 1;
> +				spin_unlock_irq(&priv->lock);
> +				mutex_unlock(&mcast_mutex);
> +				ipoib_mcast_join(dev, mcast, create);
> +				mutex_lock(&mcast_mutex);
> +				spin_lock_irq(&priv->lock);
> +			} else if (!delay_until ||
>   				 time_before(mcast->delay_until, delay_until))
>   				delay_until = mcast->delay_until;
>   		}
>   	}
>   
> -	if (&mcast->list == &priv->multicast_list) {
> -		/*
> -		 * All done, unless we have delayed work from
> -		 * backoff retransmissions, but we will get
> -		 * restarted when the time is right, so we are
> -		 * done for now
> -		 */
> -		mcast = NULL;
> -		ipoib_dbg_mcast(priv, "successfully joined all "
> -				"multicast groups\n");
> -	}
> +	mcast = NULL;
> +	ipoib_dbg_mcast(priv, "successfully started all multicast joins\n");
>   
>   out:
> +	if (delay_until) {
> +		cancel_delayed_work(&priv->mcast_task);
> +		queue_delayed_work(priv->wq, &priv->mcast_task,
> +				   delay_until - jiffies);
> +	}
>   	if (mcast) {
>   		init_completion(&mcast->done);
>   		set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>   	}
>   	spin_unlock_irq(&priv->lock);
>   	mutex_unlock(&mcast_mutex);
> -	if (mcast) {
> -		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
> -			ipoib_mcast_sendonly_join(mcast);
> -		else
> -			ipoib_mcast_join(dev, mcast, create);
> -	}
> -	if (delay_until)
> -		queue_delayed_work(priv->wq, &priv->mcast_task,
> -				   delay_until - jiffies);
> +	if (mcast)
> +		ipoib_mcast_join(dev, mcast, create);
>   }
>   
>   int ipoib_mcast_start_thread(struct net_device *dev)
> @@ -731,8 +653,6 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast)
>   
>   	if (!IS_ERR_OR_NULL(mcast->mc))
>   		ib_sa_free_multicast(mcast->mc);
> -	else
> -		ipoib_dbg(priv, "ipoib_mcast_leave with mcast->mc invalid\n");
>   
>   	if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
>   		ipoib_dbg_mcast(priv, "leaving MGID %pI6\n",
> @@ -768,43 +688,37 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
>   	}
>   
>   	mcast = __ipoib_mcast_find(dev, mgid);
> -	if (!mcast) {
> -		/* Let's create a new send only group now */
> -		ipoib_dbg_mcast(priv, "setting up send only multicast group for %pI6\n",
> -				mgid);
> -
> -		mcast = ipoib_mcast_alloc(dev, 0);
> +	if (!mcast || !mcast->ah) {
>   		if (!mcast) {
> -			ipoib_warn(priv, "unable to allocate memory for "
> -				   "multicast structure\n");
> -			++dev->stats.tx_dropped;
> -			dev_kfree_skb_any(skb);
> -			goto out;
> -		}
> -
> -		set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags);
> -		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
> -		__ipoib_mcast_add(dev, mcast);
> -		list_add_tail(&mcast->list, &priv->multicast_list);
> -		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
> -	}
> +			/* Let's create a new send only group now */
> +			ipoib_dbg_mcast(priv, "setting up send only multicast group for %pI6\n",
> +					mgid);
> +
> +			mcast = ipoib_mcast_alloc(dev, 0);
> +			if (!mcast) {
> +				ipoib_warn(priv, "unable to allocate memory "
> +					   "for multicast structure\n");
> +				++dev->stats.tx_dropped;
> +				dev_kfree_skb_any(skb);
> +				goto unlock;
> +			}
>   
> -	if (!mcast->ah) {
> +			set_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags);
> +			memcpy(mcast->mcmember.mgid.raw, mgid,
> +			       sizeof (union ib_gid));
> +			__ipoib_mcast_add(dev, mcast);
> +			list_add_tail(&mcast->list, &priv->multicast_list);
> +		}
>   		if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
>   			skb_queue_tail(&mcast->pkt_queue, skb);
>   		else {
>   			++dev->stats.tx_dropped;
>   			dev_kfree_skb_any(skb);
>   		}
> -		/*
> -		 * If lookup completes between here and out:, don't
> -		 * want to send packet twice.
> -		 */
> -		mcast = NULL;
> -	}
> -
> -out:
> -	if (mcast && mcast->ah) {
> +		if (!test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
> +			__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
> +		}
> +	} else {
>   		struct ipoib_neigh *neigh;
>   
>   		spin_unlock_irqrestore(&priv->lock, flags);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
       [not found]                 ` <54F2B61C.9080308-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-03-02 15:09                   ` Doug Ledford
       [not found]                     ` <1425308967.2354.19.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-03-02 15:09 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 5247 bytes --]

On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
> On 2/26/2015 6:27 PM, Doug Ledford wrote:
> >
> >>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
> >>>    	if (level == IPOIB_FLUSH_LIGHT) {
> >>>    		ipoib_mark_paths_invalid(dev);
> >>>    		ipoib_mcast_dev_flush(dev);
> >>> +		ipoib_flush_ah(dev, 0);
> >> Why do you need to call the flush function here?
> > To remove all of the ah's that were reduced to a 0 refcount by the
> > previous two functions prior to restarting operations.  When we remove
> > an ah, it calls ib_destroy_ah which calls all the way down into the low
> > level driver.  This was to make sure that old, stale data was removed
> > all the way down to the card level before we started new queries for
> > paths and ahs.
> 
> Yes. but it is not needed.

That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
isn't needed but doesn't hurt either.  For older cards (in particular,
mthca), the driver actually frees up card resources at the time of the
call.

> The bug happened when the driver was removed (via modprobe -r etc.), and 
> there were ah's in the dead_ah list, that was fixed by you in the 
> function ipoib_ib_dev_cleanup, the call that you added here is not 
> relevant to the bug (and IMHO is not needed at all)

I never said that this hunk was part of the original bug I saw before.

> So, the the task of cleaning the dead_ah is already there, no need to 
> recall it, it will be called anyway 1 sec at the most from now.
> 
> You can try that, take of that call, no harm or memory leak will happened.

I have no doubt that it will get freed later.  As I said, I never
considered this particular hunk part of that original bug.  But, as I
point out above, there is no harm in it for any hardware, and depending
on hardware it can help to make sure there isn't a shortage of
resources.  Given that fact, I see no reason to remove it.

> >> I can't see the reason to use the flush not from the stop_ah, meaning
> >> without setting the IPOIB_STOP_REAPER, the flush can send twice the same
> >> work.
> > No, it can't.  The ah flush routine does not search through ahs to find
> > ones to flush.  When you delete neighbors and mcasts, they release their
> > references to ahs.  When the refcount goes to 0, the put routine puts
> > the ah on the to-be-deleted ah list.  All the flush does is take that
> > list and delete the items.  If you run the flush twice, the first run
> > deletes all the items on the to-be-deleted list, the second run sees an
> > empty list and does nothing.
> >
> > As for using flush versus stop: the flush function cancels any delayed
> > ah_flush work so that it isn't racing with the normally scheduled
> 
> when you call cancel_delayed_work to work that can schedule itself, it 
> is not help, the work can be at the middle of the run and re-schedule 
> itself...

If it is in the middle of a run and reschedules itself, then it will
schedule itself at precisely the same time we would have anyway, and we
will get flushed properly, so the net result of this particular race is
that we end up doing exactly what we wanted to do anyway.

> 
> > ah_flush, then flushes the workqueue to make sure anything that might
> > result in an ah getting freed is done, then flushes, then schedules a
> > new delayed flush_ah work 1 second later.  So, it does exactly what a
> > flush should do: it removes what there is currently to remove, and in
> > the case of a periodically scheduled garbage collection, schedules a new
> > periodic flush at the maximum interval.
> >
> > It is not appropriate to call stop_ah at this point because it will
> > cancel the delayed work, flush the ahs, then never reschedule the
> > garbage collection.  If we called stop here, we would have to call start
> > later.  But that's not really necessary as the flush cancels the
> > scheduled work and reschedules it for a second later.
> >
> >>>    	}
> >>>    
> >>>    	if (level >= IPOIB_FLUSH_NORMAL)
> >>> @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
> >>>    	ipoib_mcast_stop_thread(dev, 1);
> >>>    	ipoib_mcast_dev_flush(dev);
> >>>    
> >>> +	/*
> >>> +	 * All of our ah references aren't free until after
> >>> +	 * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
> >>> +	 * the neighbor garbage collection is stopped and reaped.
> >>> +	 * That should all be done now, so make a final ah flush.
> >>> +	 */
> >>> +	ipoib_stop_ah(dev, 1);
> >>> +
> >>>    	ipoib_transport_dev_cleanup(dev);
> >>>    }
> >>>    
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage
       [not found]         ` <54F2DC81.304-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-03-02 15:27           ` Doug Ledford
       [not found]             ` <1425310036.2354.24.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-03-02 15:27 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 7404 bytes --]

On Sun, 2015-03-01 at 11:31 +0200, Erez Shitrit wrote:
> On 2/22/2015 2:27 AM, Doug Ledford wrote:
> > Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
> > objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
> > in how it was used.  We didn't always initialize the completion struct
> > before we set the flag, and we didn't always call complete on the
> > completion struct from all paths that complete it.  And when we did
> > complete it, sometimes we continued to touch the mcast entry after
> > the completion, opening us up to possible use after free issues.
> >
> > This made it less than totally effective, and certainly made its use
> > confusing.  And in the flush function we would use the presence of this
> > flag to signal that we should wait on the completion struct, but we never
> > cleared this flag, ever.
> >
> > In order to make things clearer and aid in resolving the rtnl deadlock
> > bug I've been chasing, I cleaned this up a bit.
> >
> >   1) Remove the MCAST_JOIN_STARTED flag entirely
> >   2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
> >   3) Test mcast->mc directly to see if we have completed
> >      ib_sa_join_multicast (using IS_ERR_OR_NULL)
> >   4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
> >      the mcast->done completion struct
> >   5) Make sure that before calling complete(&mcast->done), we always clear
> >      the MCAST_FLAG_BUSY bit
> >   6) Take the mcast_mutex before we call ib_sa_multicast_join and also
> >      take the mutex in our join callback.  This forces
> >      ib_sa_multicast_join to return and set mcast->mc before we process
> >      the callback.  This way, our callback can safely clear mcast->mc
> >      if there is an error on the join and we will do the right thing as
> >      a result in mcast_dev_flush.
> >   7) Because we need the mutex to synchronize mcast->mc, we can no
> >      longer call mcast_sendonly_join directly from mcast_send and
> >      instead must add sendonly join processing to the mcast_join_task
> >   8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
> >      we have a running task.  We know when we need to reschedule our
> >      join task thread and don't need a flag to tell us.
> >   9) Add a helper for rescheduling the join task thread
> >
> > A number of different races are resolved with these changes.  These
> > races existed with the old MCAST_FLAG_BUSY usage, the
> > MCAST_JOIN_STARTED flag was an attempt to address them, and while it
> > helped, a determined effort could still trip things up.
> >
> > One race looks something like this:
> >
> > Thread 1                             Thread 2
> > ib_sa_join_multicast (as part of running restart mcast task)
> >    alloc member
> >    call callback
> >                                       ifconfig ib0 down
> > 				     wait_for_completion
> >      callback call completes
> >                                       wait_for_completion in
> > 				     mcast_dev_flush completes
> > 				       mcast->mc is PTR_ERR_OR_NULL
> > 				       so we skip ib_sa_leave_multicast
> >      return from callback
> >    return from ib_sa_join_multicast
> > set mcast->mc = return from ib_sa_multicast
> >
> > We now have a permanently unbalanced join/leave issue that trips up the
> > refcounting in core/multicast.c
> >
> > Another like this:
> >
> > Thread 1                   Thread 2         Thread 3
> > ib_sa_multicast_join
> >                                              ifconfig ib0 down
> > 					    priv->broadcast = NULL
> >                             join_complete
> > 			                    wait_for_completion
> > 			   mcast->mc is not yet set, so don't clear
> > return from ib_sa_join_multicast and set mcast->mc
> > 			   complete
> > 			   return -EAGAIN (making mcast->mc invalid)
> > 			   		    call ib_sa_multicast_leave
> > 					    on invalid mcast->mc, hang
> > 					    forever
> >
> > By holding the mutex around ib_sa_multicast_join and taking the mutex
> > early in the callback, we force mcast->mc to be valid at the time we
> > run the callback.  This allows us to clear mcast->mc if there is an
> > error and the join is going to fail.  We do this before we complete
> > the mcast.  In this way, mcast_dev_flush always sees consistent state
> > in regards to mcast->mc membership at the time that the
> > wait_for_completion() returns.
> >
> > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >   drivers/infiniband/ulp/ipoib/ipoib.h           |  11 +-
> >   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 355 ++++++++++++++++---------
> >   2 files changed, 238 insertions(+), 128 deletions(-)
> >
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
> > index 9ef432ae72e..c79dcd5ee8a 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> > @@ -98,9 +98,15 @@ enum {
> >   
> >   	IPOIB_MCAST_FLAG_FOUND	  = 0,	/* used in set_multicast_list */
> >   	IPOIB_MCAST_FLAG_SENDONLY = 1,
> > -	IPOIB_MCAST_FLAG_BUSY	  = 2,	/* joining or already joined */
> > +	/*
> > +	 * For IPOIB_MCAST_FLAG_BUSY
> > +	 * When set, in flight join and mcast->mc is unreliable
> > +	 * When clear and mcast->mc IS_ERR_OR_NULL, need to restart or
> > +	 *   haven't started yet
> > +	 * When clear and mcast->mc is valid pointer, join was successful
> > +	 */
> > +	IPOIB_MCAST_FLAG_BUSY	  = 2,
> >   	IPOIB_MCAST_FLAG_ATTACHED = 3,
> > -	IPOIB_MCAST_JOIN_STARTED  = 4,
> >   
> >   	MAX_SEND_CQE		  = 16,
> >   	IPOIB_CM_COPYBREAK	  = 256,
> > @@ -148,6 +154,7 @@ struct ipoib_mcast {
> >   
> >   	unsigned long created;
> >   	unsigned long backoff;
> > +	unsigned long delay_until;
> >   
> >   	unsigned long flags;
> >   	unsigned char logcount;
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > index bb1b69904f9..277e7ac7c4d 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > @@ -66,6 +66,48 @@ struct ipoib_mcast_iter {
> >   	unsigned int       send_only;
> >   };
> >   
> > +/*
> > + * This should be called with the mcast_mutex held
> > + */
> > +static void __ipoib_mcast_schedule_join_thread(struct ipoib_dev_priv *priv,
> > +					       struct ipoib_mcast *mcast,
> > +					       bool delay)
> > +{
> > +	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
> 
> You don't need the flag IPOIB_MCAST_RUN, it is duplicated of 
> IPOIB_FLAG_OPER_UP
> probably, need to be taken from all places (including ipoib.h file).

This is probably true, however I skipped it for this series of patches.
It wasn't a requirement of proper operation, and depending on where in
the patch series you tried to inject this change, it had unintended
negative consequences.  In particular, up until patch 7/9,
mcast_restart_task used to do a hand rolled stop thread and a matching
start_thread and so couldn't use FLAG_OPER_UP because we needed
FLAG_OPER_UP to tell us whether or not to restart the thread.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 8/9] IB/ipoib: deserialize multicast joins
       [not found]         ` <54F31AEC.3010001-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-03-02 15:29           ` Doug Ledford
       [not found]             ` <1425310145.2354.26.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-03-02 15:29 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 9285 bytes --]

On Sun, 2015-03-01 at 15:58 +0200, Erez Shitrit wrote:
> On 2/22/2015 2:27 AM, Doug Ledford wrote:
> > Allow the ipoib layer to attempt to join all outstanding multicast
> > groups at once.  The ib_sa layer will serialize multiple attempts to
> > join the same group, but will process attempts to join different groups
> > in parallel.  Take advantage of that.
> >
> > In order to make this happen, change the mcast_join_thread to loop
> > through all needed joins, sending a join request for each one that we
> > still need to join.  There are a few special cases we handle though:
> >
> > 1) Don't attempt to join anything but the broadcast group until the join
> > of the broadcast group has succeeded.
> > 2) No longer restart the join task at the end of completion handling.
> > If we completed successfully, we are done.  The join task now needs kicked
> > either by mcast_send or mcast_restart_task or mcast_start_thread, but
> > should not need started anytime else except when scheduling a backoff
> > attempt to rejoin.
> > 3) No longer use separate join/completion routines for regular and
> > sendonly joins, pass them all through the same routine and just do the
> > right thing based on the SENDONLY join flag.
> > 4) Only try to join a SENDONLY join twice, then drop the packets and
> > quit trying.  We leave the mcast group in the list so that if we get a
> > new packet, all that we have to do is queue up the packet and restart
> > the join task and it will automatically try to join twice and then
> > either send or flush the queue again.
> >
> > Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > ---
> >   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 250 ++++++++-----------------
> >   1 file changed, 82 insertions(+), 168 deletions(-)
> >
> > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > index 277e7ac7c4d..c670d9c2cda 100644
> > --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> > @@ -307,111 +307,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
> >   	return 0;
> >   }
> >   
> > -static int
> > -ipoib_mcast_sendonly_join_complete(int status,
> > -				   struct ib_sa_multicast *multicast)
> > -{
> > -	struct ipoib_mcast *mcast = multicast->context;
> > -	struct net_device *dev = mcast->dev;
> > -	struct ipoib_dev_priv *priv = netdev_priv(dev);
> > -
> > -	/*
> > -	 * We have to take the mutex to force mcast_sendonly_join to
> > -	 * return from ib_sa_multicast_join and set mcast->mc to a
> > -	 * valid value.  Otherwise we were racing with ourselves in
> > -	 * that we might fail here, but get a valid return from
> > -	 * ib_sa_multicast_join after we had cleared mcast->mc here,
> > -	 * resulting in mis-matched joins and leaves and a deadlock
> > -	 */
> > -	mutex_lock(&mcast_mutex);
> > -
> > -	/* We trap for port events ourselves. */
> > -	if (status == -ENETRESET) {
> > -		status = 0;
> > -		goto out;
> > -	}
> > -
> > -	if (!status)
> > -		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
> > -
> > -	if (status) {
> > -		if (mcast->logcount++ < 20)
> > -			ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
> > -					"join failed for %pI6, status %d\n",
> > -					mcast->mcmember.mgid.raw, status);
> > -
> > -		/* Flush out any queued packets */
> > -		netif_tx_lock_bh(dev);
> > -		while (!skb_queue_empty(&mcast->pkt_queue)) {
> > -			++dev->stats.tx_dropped;
> > -			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
> > -		}
> > -		netif_tx_unlock_bh(dev);
> > -		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
> > -	} else {
> > -		mcast->backoff = 1;
> > -		mcast->delay_until = jiffies;
> > -		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
> > -	}
> > -out:
> > -	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> > -	if (status)
> > -		mcast->mc = NULL;
> > -	complete(&mcast->done);
> > -	mutex_unlock(&mcast_mutex);
> > -	return status;
> > -}
> > -
> > -static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
> > -{
> > -	struct net_device *dev = mcast->dev;
> > -	struct ipoib_dev_priv *priv = netdev_priv(dev);
> > -	struct ib_sa_mcmember_rec rec = {
> > -#if 0				/* Some SMs don't support send-only yet */
> > -		.join_state = 4
> > -#else
> > -		.join_state = 1
> > -#endif
> > -	};
> > -	int ret = 0;
> > -
> > -	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
> > -		ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
> > -				"multicast joins\n");
> > -		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> > -		complete(&mcast->done);
> > -		return -ENODEV;
> > -	}
> > -
> > -	rec.mgid     = mcast->mcmember.mgid;
> > -	rec.port_gid = priv->local_gid;
> > -	rec.pkey     = cpu_to_be16(priv->pkey);
> > -
> > -	mutex_lock(&mcast_mutex);
> > -	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
> > -					 priv->port, &rec,
> > -					 IB_SA_MCMEMBER_REC_MGID	|
> > -					 IB_SA_MCMEMBER_REC_PORT_GID	|
> > -					 IB_SA_MCMEMBER_REC_PKEY	|
> > -					 IB_SA_MCMEMBER_REC_JOIN_STATE,
> > -					 GFP_ATOMIC,
> > -					 ipoib_mcast_sendonly_join_complete,
> > -					 mcast);
> > -	if (IS_ERR(mcast->mc)) {
> > -		ret = PTR_ERR(mcast->mc);
> > -		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> > -		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
> > -			   "failed (ret = %d)\n", ret);
> > -		complete(&mcast->done);
> > -	} else {
> > -		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
> > -				"sendonly join\n", mcast->mcmember.mgid.raw);
> > -	}
> > -	mutex_unlock(&mcast_mutex);
> > -
> > -	return ret;
> > -}
> > -
> >   void ipoib_mcast_carrier_on_task(struct work_struct *work)
> >   {
> >   	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
> > @@ -452,7 +347,9 @@ static int ipoib_mcast_join_complete(int status,
> >   	struct net_device *dev = mcast->dev;
> >   	struct ipoib_dev_priv *priv = netdev_priv(dev);
> >   
> > -	ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n",
> > +	ipoib_dbg_mcast(priv, "%sjoin completion for %pI6 (status %d)\n",
> > +			test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ?
> > +			"sendonly " : "",
> >   			mcast->mcmember.mgid.raw, status);
> >   
> >   	/*
> > @@ -477,27 +374,52 @@ static int ipoib_mcast_join_complete(int status,
> >   	if (!status) {
> >   		mcast->backoff = 1;
> >   		mcast->delay_until = jiffies;
> > -		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
> >   
> >   		/*
> >   		 * Defer carrier on work to priv->wq to avoid a
> > -		 * deadlock on rtnl_lock here.
> > +		 * deadlock on rtnl_lock here.  Requeue our multicast
> > +		 * work too, which will end up happening right after
> > +		 * our carrier on task work and will allow us to
> > +		 * send out all of the non-broadcast joins
> >   		 */
> > -		if (mcast == priv->broadcast)
> > +		if (mcast == priv->broadcast) {
> >   			queue_work(priv->wq, &priv->carrier_on_task);
> > +			__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
> > +		}
> >   	} else {
> >   		if (mcast->logcount++ < 20) {
> >   			if (status == -ETIMEDOUT || status == -EAGAIN) {
> > -				ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
> > +				ipoib_dbg_mcast(priv, "%smulticast join failed for %pI6, status %d\n",
> > +						test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ? "sendonly " : "",
> >   						mcast->mcmember.mgid.raw, status);
> >   			} else {
> > -				ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
> > +				ipoib_warn(priv, "%smulticast join failed for %pI6, status %d\n",
> > +						test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ? "sendonly " : "",
> >   					   mcast->mcmember.mgid.raw, status);
> >   			}
> >   		}
> >   
> > -		/* Requeue this join task with a backoff delay */
> > -		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
> > +		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) &&
> > +		    mcast->backoff >= 2) {
> > +			/*
> > +			 * We only retry sendonly joins once before we drop
> > +			 * the packet and quit trying to deal with the
> > +			 * group.  However, we leave the group in the
> > +			 * mcast list as an unjoined group.  If we want to
> > +			 * try joining again, we simply queue up a packet
> > +			 * and restart the join thread.  The empty queue
> > +			 * is why the join thread ignores this group.
> > +			 */
> 
> Question: the sendonly is at the list for ever? looks like that, and it 
> is prior to your patches, so probably it should be sent in some other 
> patch to solve that.

Correct.  That logic is unchanged.  It probably deserves some sort of
timeout such that after X seconds with no traffic on a sendonly group,
we leave that group and only rejoin if we get a new send.  I had thought
about making it something really long, like 20 minutes, but thinking
about it is all I've done.  I didn't code anything up or include that in
these patches.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage
       [not found]             ` <1425310036.2354.24.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-03-03  9:53               ` Erez Shitrit
  0 siblings, 0 replies; 37+ messages in thread
From: Erez Shitrit @ 2015-03-03  9:53 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

On 3/2/2015 5:27 PM, Doug Ledford wrote:
> On Sun, 2015-03-01 at 11:31 +0200, Erez Shitrit wrote:
>> On 2/22/2015 2:27 AM, Doug Ledford wrote:
>>> Commit a9c8ba5884 ("IPoIB: Fix usage of uninitialized multicast
>>> objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
>>> in how it was used.  We didn't always initialize the completion struct
>>> before we set the flag, and we didn't always call complete on the
>>> completion struct from all paths that complete it.  And when we did
>>> complete it, sometimes we continued to touch the mcast entry after
>>> the completion, opening us up to possible use after free issues.
>>>
>>> This made it less than totally effective, and certainly made its use
>>> confusing.  And in the flush function we would use the presence of this
>>> flag to signal that we should wait on the completion struct, but we never
>>> cleared this flag, ever.
>>>
>>> In order to make things clearer and aid in resolving the rtnl deadlock
>>> bug I've been chasing, I cleaned this up a bit.
>>>
>>>    1) Remove the MCAST_JOIN_STARTED flag entirely
>>>    2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
>>>    3) Test mcast->mc directly to see if we have completed
>>>       ib_sa_join_multicast (using IS_ERR_OR_NULL)
>>>    4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
>>>       the mcast->done completion struct
>>>    5) Make sure that before calling complete(&mcast->done), we always clear
>>>       the MCAST_FLAG_BUSY bit
>>>    6) Take the mcast_mutex before we call ib_sa_multicast_join and also
>>>       take the mutex in our join callback.  This forces
>>>       ib_sa_multicast_join to return and set mcast->mc before we process
>>>       the callback.  This way, our callback can safely clear mcast->mc
>>>       if there is an error on the join and we will do the right thing as
>>>       a result in mcast_dev_flush.
>>>    7) Because we need the mutex to synchronize mcast->mc, we can no
>>>       longer call mcast_sendonly_join directly from mcast_send and
>>>       instead must add sendonly join processing to the mcast_join_task
>>>    8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
>>>       we have a running task.  We know when we need to reschedule our
>>>       join task thread and don't need a flag to tell us.
>>>    9) Add a helper for rescheduling the join task thread
>>>
>>> A number of different races are resolved with these changes.  These
>>> races existed with the old MCAST_FLAG_BUSY usage, the
>>> MCAST_JOIN_STARTED flag was an attempt to address them, and while it
>>> helped, a determined effort could still trip things up.
>>>
>>> One race looks something like this:
>>>
>>> Thread 1                             Thread 2
>>> ib_sa_join_multicast (as part of running restart mcast task)
>>>     alloc member
>>>     call callback
>>>                                        ifconfig ib0 down
>>> 				     wait_for_completion
>>>       callback call completes
>>>                                        wait_for_completion in
>>> 				     mcast_dev_flush completes
>>> 				       mcast->mc is PTR_ERR_OR_NULL
>>> 				       so we skip ib_sa_leave_multicast
>>>       return from callback
>>>     return from ib_sa_join_multicast
>>> set mcast->mc = return from ib_sa_multicast
>>>
>>> We now have a permanently unbalanced join/leave issue that trips up the
>>> refcounting in core/multicast.c
>>>
>>> Another like this:
>>>
>>> Thread 1                   Thread 2         Thread 3
>>> ib_sa_multicast_join
>>>                                               ifconfig ib0 down
>>> 					    priv->broadcast = NULL
>>>                              join_complete
>>> 			                    wait_for_completion
>>> 			   mcast->mc is not yet set, so don't clear
>>> return from ib_sa_join_multicast and set mcast->mc
>>> 			   complete
>>> 			   return -EAGAIN (making mcast->mc invalid)
>>> 			   		    call ib_sa_multicast_leave
>>> 					    on invalid mcast->mc, hang
>>> 					    forever
>>>
>>> By holding the mutex around ib_sa_multicast_join and taking the mutex
>>> early in the callback, we force mcast->mc to be valid at the time we
>>> run the callback.  This allows us to clear mcast->mc if there is an
>>> error and the join is going to fail.  We do this before we complete
>>> the mcast.  In this way, mcast_dev_flush always sees consistent state
>>> in regards to mcast->mc membership at the time that the
>>> wait_for_completion() returns.
>>>
>>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>    drivers/infiniband/ulp/ipoib/ipoib.h           |  11 +-
>>>    drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 355 ++++++++++++++++---------
>>>    2 files changed, 238 insertions(+), 128 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
>>> index 9ef432ae72e..c79dcd5ee8a 100644
>>> --- a/drivers/infiniband/ulp/ipoib/ipoib.h
>>> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
>>> @@ -98,9 +98,15 @@ enum {
>>>    
>>>    	IPOIB_MCAST_FLAG_FOUND	  = 0,	/* used in set_multicast_list */
>>>    	IPOIB_MCAST_FLAG_SENDONLY = 1,
>>> -	IPOIB_MCAST_FLAG_BUSY	  = 2,	/* joining or already joined */
>>> +	/*
>>> +	 * For IPOIB_MCAST_FLAG_BUSY
>>> +	 * When set, in flight join and mcast->mc is unreliable
>>> +	 * When clear and mcast->mc IS_ERR_OR_NULL, need to restart or
>>> +	 *   haven't started yet
>>> +	 * When clear and mcast->mc is valid pointer, join was successful
>>> +	 */
>>> +	IPOIB_MCAST_FLAG_BUSY	  = 2,
>>>    	IPOIB_MCAST_FLAG_ATTACHED = 3,
>>> -	IPOIB_MCAST_JOIN_STARTED  = 4,
>>>    
>>>    	MAX_SEND_CQE		  = 16,
>>>    	IPOIB_CM_COPYBREAK	  = 256,
>>> @@ -148,6 +154,7 @@ struct ipoib_mcast {
>>>    
>>>    	unsigned long created;
>>>    	unsigned long backoff;
>>> +	unsigned long delay_until;
>>>    
>>>    	unsigned long flags;
>>>    	unsigned char logcount;
>>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>>> index bb1b69904f9..277e7ac7c4d 100644
>>> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>>> @@ -66,6 +66,48 @@ struct ipoib_mcast_iter {
>>>    	unsigned int       send_only;
>>>    };
>>>    
>>> +/*
>>> + * This should be called with the mcast_mutex held
>>> + */
>>> +static void __ipoib_mcast_schedule_join_thread(struct ipoib_dev_priv *priv,
>>> +					       struct ipoib_mcast *mcast,
>>> +					       bool delay)
>>> +{
>>> +	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
>> You don't need the flag IPOIB_MCAST_RUN, it is duplicated of
>> IPOIB_FLAG_OPER_UP
>> probably, need to be taken from all places (including ipoib.h file).
> This is probably true, however I skipped it for this series of patches.
> It wasn't a requirement of proper operation, and depending on where in
> the patch series you tried to inject this change, it had unintended
> negative consequences.  In particular, up until patch 7/9,
> mcast_restart_task used to do a hand rolled stop thread and a matching
> start_thread and so couldn't use FLAG_OPER_UP because we needed
> FLAG_OPER_UP to tell us whether or not to restart the thread.

OK, sounds reasonable. we can send a patch after that that fix that.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 8/9] IB/ipoib: deserialize multicast joins
       [not found]             ` <1425310145.2354.26.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-03-03  9:54               ` Erez Shitrit
  0 siblings, 0 replies; 37+ messages in thread
From: Erez Shitrit @ 2015-03-03  9:54 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

On 3/2/2015 5:29 PM, Doug Ledford wrote:
> On Sun, 2015-03-01 at 15:58 +0200, Erez Shitrit wrote:
>> On 2/22/2015 2:27 AM, Doug Ledford wrote:
>>> Allow the ipoib layer to attempt to join all outstanding multicast
>>> groups at once.  The ib_sa layer will serialize multiple attempts to
>>> join the same group, but will process attempts to join different groups
>>> in parallel.  Take advantage of that.
>>>
>>> In order to make this happen, change the mcast_join_thread to loop
>>> through all needed joins, sending a join request for each one that we
>>> still need to join.  There are a few special cases we handle though:
>>>
>>> 1) Don't attempt to join anything but the broadcast group until the join
>>> of the broadcast group has succeeded.
>>> 2) No longer restart the join task at the end of completion handling.
>>> If we completed successfully, we are done.  The join task now needs kicked
>>> either by mcast_send or mcast_restart_task or mcast_start_thread, but
>>> should not need started anytime else except when scheduling a backoff
>>> attempt to rejoin.
>>> 3) No longer use separate join/completion routines for regular and
>>> sendonly joins, pass them all through the same routine and just do the
>>> right thing based on the SENDONLY join flag.
>>> 4) Only try to join a SENDONLY join twice, then drop the packets and
>>> quit trying.  We leave the mcast group in the list so that if we get a
>>> new packet, all that we have to do is queue up the packet and restart
>>> the join task and it will automatically try to join twice and then
>>> either send or flush the queue again.
>>>
>>> Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>    drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 250 ++++++++-----------------
>>>    1 file changed, 82 insertions(+), 168 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>>> index 277e7ac7c4d..c670d9c2cda 100644
>>> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>>> @@ -307,111 +307,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
>>>    	return 0;
>>>    }
>>>    
>>> -static int
>>> -ipoib_mcast_sendonly_join_complete(int status,
>>> -				   struct ib_sa_multicast *multicast)
>>> -{
>>> -	struct ipoib_mcast *mcast = multicast->context;
>>> -	struct net_device *dev = mcast->dev;
>>> -	struct ipoib_dev_priv *priv = netdev_priv(dev);
>>> -
>>> -	/*
>>> -	 * We have to take the mutex to force mcast_sendonly_join to
>>> -	 * return from ib_sa_multicast_join and set mcast->mc to a
>>> -	 * valid value.  Otherwise we were racing with ourselves in
>>> -	 * that we might fail here, but get a valid return from
>>> -	 * ib_sa_multicast_join after we had cleared mcast->mc here,
>>> -	 * resulting in mis-matched joins and leaves and a deadlock
>>> -	 */
>>> -	mutex_lock(&mcast_mutex);
>>> -
>>> -	/* We trap for port events ourselves. */
>>> -	if (status == -ENETRESET) {
>>> -		status = 0;
>>> -		goto out;
>>> -	}
>>> -
>>> -	if (!status)
>>> -		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
>>> -
>>> -	if (status) {
>>> -		if (mcast->logcount++ < 20)
>>> -			ipoib_dbg_mcast(netdev_priv(dev), "sendonly multicast "
>>> -					"join failed for %pI6, status %d\n",
>>> -					mcast->mcmember.mgid.raw, status);
>>> -
>>> -		/* Flush out any queued packets */
>>> -		netif_tx_lock_bh(dev);
>>> -		while (!skb_queue_empty(&mcast->pkt_queue)) {
>>> -			++dev->stats.tx_dropped;
>>> -			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
>>> -		}
>>> -		netif_tx_unlock_bh(dev);
>>> -		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
>>> -	} else {
>>> -		mcast->backoff = 1;
>>> -		mcast->delay_until = jiffies;
>>> -		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
>>> -	}
>>> -out:
>>> -	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>>> -	if (status)
>>> -		mcast->mc = NULL;
>>> -	complete(&mcast->done);
>>> -	mutex_unlock(&mcast_mutex);
>>> -	return status;
>>> -}
>>> -
>>> -static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
>>> -{
>>> -	struct net_device *dev = mcast->dev;
>>> -	struct ipoib_dev_priv *priv = netdev_priv(dev);
>>> -	struct ib_sa_mcmember_rec rec = {
>>> -#if 0				/* Some SMs don't support send-only yet */
>>> -		.join_state = 4
>>> -#else
>>> -		.join_state = 1
>>> -#endif
>>> -	};
>>> -	int ret = 0;
>>> -
>>> -	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
>>> -		ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
>>> -				"multicast joins\n");
>>> -		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>>> -		complete(&mcast->done);
>>> -		return -ENODEV;
>>> -	}
>>> -
>>> -	rec.mgid     = mcast->mcmember.mgid;
>>> -	rec.port_gid = priv->local_gid;
>>> -	rec.pkey     = cpu_to_be16(priv->pkey);
>>> -
>>> -	mutex_lock(&mcast_mutex);
>>> -	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
>>> -					 priv->port, &rec,
>>> -					 IB_SA_MCMEMBER_REC_MGID	|
>>> -					 IB_SA_MCMEMBER_REC_PORT_GID	|
>>> -					 IB_SA_MCMEMBER_REC_PKEY	|
>>> -					 IB_SA_MCMEMBER_REC_JOIN_STATE,
>>> -					 GFP_ATOMIC,
>>> -					 ipoib_mcast_sendonly_join_complete,
>>> -					 mcast);
>>> -	if (IS_ERR(mcast->mc)) {
>>> -		ret = PTR_ERR(mcast->mc);
>>> -		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>>> -		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
>>> -			   "failed (ret = %d)\n", ret);
>>> -		complete(&mcast->done);
>>> -	} else {
>>> -		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
>>> -				"sendonly join\n", mcast->mcmember.mgid.raw);
>>> -	}
>>> -	mutex_unlock(&mcast_mutex);
>>> -
>>> -	return ret;
>>> -}
>>> -
>>>    void ipoib_mcast_carrier_on_task(struct work_struct *work)
>>>    {
>>>    	struct ipoib_dev_priv *priv = container_of(work, struct ipoib_dev_priv,
>>> @@ -452,7 +347,9 @@ static int ipoib_mcast_join_complete(int status,
>>>    	struct net_device *dev = mcast->dev;
>>>    	struct ipoib_dev_priv *priv = netdev_priv(dev);
>>>    
>>> -	ipoib_dbg_mcast(priv, "join completion for %pI6 (status %d)\n",
>>> +	ipoib_dbg_mcast(priv, "%sjoin completion for %pI6 (status %d)\n",
>>> +			test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ?
>>> +			"sendonly " : "",
>>>    			mcast->mcmember.mgid.raw, status);
>>>    
>>>    	/*
>>> @@ -477,27 +374,52 @@ static int ipoib_mcast_join_complete(int status,
>>>    	if (!status) {
>>>    		mcast->backoff = 1;
>>>    		mcast->delay_until = jiffies;
>>> -		__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
>>>    
>>>    		/*
>>>    		 * Defer carrier on work to priv->wq to avoid a
>>> -		 * deadlock on rtnl_lock here.
>>> +		 * deadlock on rtnl_lock here.  Requeue our multicast
>>> +		 * work too, which will end up happening right after
>>> +		 * our carrier on task work and will allow us to
>>> +		 * send out all of the non-broadcast joins
>>>    		 */
>>> -		if (mcast == priv->broadcast)
>>> +		if (mcast == priv->broadcast) {
>>>    			queue_work(priv->wq, &priv->carrier_on_task);
>>> +			__ipoib_mcast_schedule_join_thread(priv, NULL, 0);
>>> +		}
>>>    	} else {
>>>    		if (mcast->logcount++ < 20) {
>>>    			if (status == -ETIMEDOUT || status == -EAGAIN) {
>>> -				ipoib_dbg_mcast(priv, "multicast join failed for %pI6, status %d\n",
>>> +				ipoib_dbg_mcast(priv, "%smulticast join failed for %pI6, status %d\n",
>>> +						test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ? "sendonly " : "",
>>>    						mcast->mcmember.mgid.raw, status);
>>>    			} else {
>>> -				ipoib_warn(priv, "multicast join failed for %pI6, status %d\n",
>>> +				ipoib_warn(priv, "%smulticast join failed for %pI6, status %d\n",
>>> +						test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) ? "sendonly " : "",
>>>    					   mcast->mcmember.mgid.raw, status);
>>>    			}
>>>    		}
>>>    
>>> -		/* Requeue this join task with a backoff delay */
>>> -		__ipoib_mcast_schedule_join_thread(priv, mcast, 1);
>>> +		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags) &&
>>> +		    mcast->backoff >= 2) {
>>> +			/*
>>> +			 * We only retry sendonly joins once before we drop
>>> +			 * the packet and quit trying to deal with the
>>> +			 * group.  However, we leave the group in the
>>> +			 * mcast list as an unjoined group.  If we want to
>>> +			 * try joining again, we simply queue up a packet
>>> +			 * and restart the join thread.  The empty queue
>>> +			 * is why the join thread ignores this group.
>>> +			 */
>> Question: the sendonly is at the list for ever? looks like that, and it
>> is prior to your patches, so probably it should be sent in some other
>> patch to solve that.
> Correct.  That logic is unchanged.  It probably deserves some sort of
> timeout such that after X seconds with no traffic on a sendonly group,
> we leave that group and only rejoin if we get a new send.  I had thought
> about making it something really long, like 20 minutes, but thinking
> about it is all I've done.  I didn't code anything up or include that in
> these patches.
>
OK, Thanks. We we will need to send a patch for that.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
       [not found]                     ` <1425308967.2354.19.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-03-03  9:59                       ` Erez Shitrit
       [not found]                         ` <54F585E9.7070704-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Erez Shitrit @ 2015-03-03  9:59 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Or Gerlitz, Erez Shitrit

On 3/2/2015 5:09 PM, Doug Ledford wrote:
> On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
>> On 2/26/2015 6:27 PM, Doug Ledford wrote:
>>>>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv,
>>>>>     	if (level == IPOIB_FLUSH_LIGHT) {
>>>>>     		ipoib_mark_paths_invalid(dev);
>>>>>     		ipoib_mcast_dev_flush(dev);
>>>>> +		ipoib_flush_ah(dev, 0);
>>>> Why do you need to call the flush function here?
>>> To remove all of the ah's that were reduced to a 0 refcount by the
>>> previous two functions prior to restarting operations.  When we remove
>>> an ah, it calls ib_destroy_ah which calls all the way down into the low
>>> level driver.  This was to make sure that old, stale data was removed
>>> all the way down to the card level before we started new queries for
>>> paths and ahs.
>> Yes. but it is not needed.
> That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
> isn't needed but doesn't hurt either.  For older cards (in particular,
> mthca), the driver actually frees up card resources at the time of the
> call.

Can you please elaborate more here, I took a look in the mthca and 
didn't see that.
anyway, what i don't understand is why you need to do that now, the ah 
is already in the dead_ah_list so, in at the most 1 sec will be cleared 
and if the driver goes down your other hunk fixed that.

>> The bug happened when the driver was removed (via modprobe -r etc.), and
>> there were ah's in the dead_ah list, that was fixed by you in the
>> function ipoib_ib_dev_cleanup, the call that you added here is not
>> relevant to the bug (and IMHO is not needed at all)
> I never said that this hunk was part of the original bug I saw before.
>
>> So, the the task of cleaning the dead_ah is already there, no need to
>> recall it, it will be called anyway 1 sec at the most from now.
>>
>> You can try that, take of that call, no harm or memory leak will happened.
> I have no doubt that it will get freed later.  As I said, I never
> considered this particular hunk part of that original bug.  But, as I
> point out above, there is no harm in it for any hardware, and depending
> on hardware it can help to make sure there isn't a shortage of
> resources.  Given that fact, I see no reason to remove it.
>
>>>> I can't see the reason to use the flush not from the stop_ah, meaning
>>>> without setting the IPOIB_STOP_REAPER, the flush can send twice the same
>>>> work.
>>> No, it can't.  The ah flush routine does not search through ahs to find
>>> ones to flush.  When you delete neighbors and mcasts, they release their
>>> references to ahs.  When the refcount goes to 0, the put routine puts
>>> the ah on the to-be-deleted ah list.  All the flush does is take that
>>> list and delete the items.  If you run the flush twice, the first run
>>> deletes all the items on the to-be-deleted list, the second run sees an
>>> empty list and does nothing.
>>>
>>> As for using flush versus stop: the flush function cancels any delayed
>>> ah_flush work so that it isn't racing with the normally scheduled
>> when you call cancel_delayed_work to work that can schedule itself, it
>> is not help, the work can be at the middle of the run and re-schedule
>> itself...
> If it is in the middle of a run and reschedules itself, then it will
> schedule itself at precisely the same time we would have anyway, and we
> will get flushed properly, so the net result of this particular race is
> that we end up doing exactly what we wanted to do anyway.
>
>>> ah_flush, then flushes the workqueue to make sure anything that might
>>> result in an ah getting freed is done, then flushes, then schedules a
>>> new delayed flush_ah work 1 second later.  So, it does exactly what a
>>> flush should do: it removes what there is currently to remove, and in
>>> the case of a periodically scheduled garbage collection, schedules a new
>>> periodic flush at the maximum interval.
>>>
>>> It is not appropriate to call stop_ah at this point because it will
>>> cancel the delayed work, flush the ahs, then never reschedule the
>>> garbage collection.  If we called stop here, we would have to call start
>>> later.  But that's not really necessary as the flush cancels the
>>> scheduled work and reschedules it for a second later.
>>>
>>>>>     	}
>>>>>     
>>>>>     	if (level >= IPOIB_FLUSH_NORMAL)
>>>>> @@ -1100,6 +1102,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
>>>>>     	ipoib_mcast_stop_thread(dev, 1);
>>>>>     	ipoib_mcast_dev_flush(dev);
>>>>>     
>>>>> +	/*
>>>>> +	 * All of our ah references aren't free until after
>>>>> +	 * ipoib_mcast_dev_flush(), ipoib_flush_paths, and
>>>>> +	 * the neighbor garbage collection is stopped and reaped.
>>>>> +	 * That should all be done now, so make a final ah flush.
>>>>> +	 */
>>>>> +	ipoib_stop_ah(dev, 1);
>>>>> +
>>>>>     	ipoib_transport_dev_cleanup(dev);
>>>>>     }
>>>>>     
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
       [not found]                         ` <54F585E9.7070704-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-03-13  8:39                           ` Or Gerlitz
       [not found]                             ` <CAJ3xEMgxxHu5BQdADaRe-Grtf4rm1LMfsCRiDyF6ToPdV_62OA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Or Gerlitz @ 2015-03-13  8:39 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Erez Shitrit,
	Erez Shitrit

On Tue, Mar 3, 2015 at 11:59 AM, Erez Shitrit <erezsh-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> On 3/2/2015 5:09 PM, Doug Ledford wrote:
>>
>> On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
>>>
>>> On 2/26/2015 6:27 PM, Doug Ledford wrote:
>>>>>>
>>>>>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct
>>>>>> ipoib_dev_priv *priv,
>>>>>>         if (level == IPOIB_FLUSH_LIGHT) {
>>>>>>                 ipoib_mark_paths_invalid(dev);
>>>>>>                 ipoib_mcast_dev_flush(dev);
>>>>>> +               ipoib_flush_ah(dev, 0);
>>>>>
>>>>> Why do you need to call the flush function here?
>>>>
>>>> To remove all of the ah's that were reduced to a 0 refcount by the
>>>> previous two functions prior to restarting operations.  When we remove
>>>> an ah, it calls ib_destroy_ah which calls all the way down into the low
>>>> level driver.  This was to make sure that old, stale data was removed
>>>> all the way down to the card level before we started new queries for
>>>> paths and ahs.
>>>
>>> Yes. but it is not needed.
>>
>> That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
>> isn't needed but doesn't hurt either.  For older cards (in particular,
>> mthca), the driver actually frees up card resources at the time of the
>> call.
>
>
> Can you please elaborate more here, I took a look in the mthca and didn't
> see that.
> anyway, what i don't understand is why you need to do that now, the ah is
> already in the dead_ah_list so, in at the most 1 sec will be cleared and if
> the driver goes down your other hunk fixed that.

Doug, ten days and no response from you... lets finalize the review on
this series so we have it safely done for 4.1 -- on top of it Erez
prepares a set of IPoIB fixes from our internal tree and we want that
for 4.1 too. Please address.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues
       [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (9 preceding siblings ...)
  2015-02-22 21:34   ` [PATCH 0/9] IB/ipoib: fixup multicast locking issues Or Gerlitz
@ 2015-03-13  8:41   ` Or Gerlitz
       [not found]     ` <CAJ3xEMjHrTH_F=zPDsH9A9qRWo=AYN4sgbsdDKV62nzBkB5kXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  10 siblings, 1 reply; 37+ messages in thread
From: Or Gerlitz @ 2015-03-13  8:41 UTC (permalink / raw)
  To: Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Erez Shitrit

On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> This is the re-ordered, squashed version of my 22 patch set that I
> posted on Feb 11.  There are a few minor differences between that
> set and this one.  They are:
[...]

Doug, you wrote here a very detailed listing of the changes from
earlier posts and the testing the patches went through, which is
excellent. It would be very good if you can also post few liner
telling the changes done by the series in high level, so we can have
this test as part of a "merge" commit that says in the kernel logs.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
       [not found]                             ` <CAJ3xEMgxxHu5BQdADaRe-Grtf4rm1LMfsCRiDyF6ToPdV_62OA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-03-15 18:42                               ` Doug Ledford
       [not found]                                 ` <3A0A417D-BFE4-475C-BAB3-C3FB1D313022-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-03-15 18:42 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Erez Shitrit,
	Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 3022 bytes --]


> On Mar 13, 2015, at 1:39 AM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> On Tue, Mar 3, 2015 at 11:59 AM, Erez Shitrit <erezsh-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>> On 3/2/2015 5:09 PM, Doug Ledford wrote:
>>> 
>>> On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
>>>> 
>>>> On 2/26/2015 6:27 PM, Doug Ledford wrote:
>>>>>>> 
>>>>>>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct
>>>>>>> ipoib_dev_priv *priv,
>>>>>>>        if (level == IPOIB_FLUSH_LIGHT) {
>>>>>>>                ipoib_mark_paths_invalid(dev);
>>>>>>>                ipoib_mcast_dev_flush(dev);
>>>>>>> +               ipoib_flush_ah(dev, 0);
>>>>>> 
>>>>>> Why do you need to call the flush function here?
>>>>> 
>>>>> To remove all of the ah's that were reduced to a 0 refcount by the
>>>>> previous two functions prior to restarting operations.  When we remove
>>>>> an ah, it calls ib_destroy_ah which calls all the way down into the low
>>>>> level driver.  This was to make sure that old, stale data was removed
>>>>> all the way down to the card level before we started new queries for
>>>>> paths and ahs.
>>>> 
>>>> Yes. but it is not needed.
>>> 
>>> That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
>>> isn't needed but doesn't hurt either.  For older cards (in particular,
>>> mthca), the driver actually frees up card resources at the time of the
>>> call.
>> 
>> 
>> Can you please elaborate more here, I took a look in the mthca and didn't
>> see that.
>> anyway, what i don't understand is why you need to do that now, the ah is
>> already in the dead_ah_list so, in at the most 1 sec will be cleared and if
>> the driver goes down your other hunk fixed that.
> 
> Doug, ten days and no response from you... lets finalize the review on
> this series so we have it safely done for 4.1 -- on top of it Erez
> prepares a set of IPoIB fixes from our internal tree and we want that
> for 4.1 too. Please address.

I didn’t have much to say here.  I said that mthca can have card resources freed by this call, which is backed up by this code in mthca_ah.c

int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
{
        switch (ah->type) {
        case MTHCA_AH_ON_HCA:
                mthca_free(&dev->av_table.alloc,
                           (ah->avdma - dev->av_table.ddr_av_base) /
                           MTHCA_AV_SIZE);
                break;


I’m not entirely sure how Erez missed that, but it’s there and it’s what gets called when we destroy an ah (depending on the card of course).  So, that represents one case where freeing the resources in a non-lazy fashion has a direct benefit.  And there is no cited drawback to freeing the resources in a non-lazy fashion on a net event, so I don’t see what there is to discuss further on the issue.

—
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
	GPG Key ID: 0E572FDD






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 842 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues
       [not found]     ` <CAJ3xEMjHrTH_F=zPDsH9A9qRWo=AYN4sgbsdDKV62nzBkB5kXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-03-15 18:52       ` Doug Ledford
       [not found]         ` <F42024C5-60A5-4B92-B4AC-4D225E2C0FC3-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-03-15 18:52 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 1965 bytes --]


> On Mar 13, 2015, at 1:41 AM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> This is the re-ordered, squashed version of my 22 patch set that I
>> posted on Feb 11.  There are a few minor differences between that
>> set and this one.  They are:
> [...]
> 
> Doug, you wrote here a very detailed listing of the changes from
> earlier posts and the testing the patches went through, which is
> excellent. It would be very good if you can also post few liner
> telling the changes done by the series in high level, so we can have
> this test as part of a "merge" commit that says in the kernel logs.

OK.  I would take what I had in the original message and expand upon it then:

This entire patchset was intended to address the issue of ipoib
interfaces being brought up/down in a tight loop, which will hardlock
a standard v3.19 kernel.  It succeeds at resolving that problem.

In order to accomplish this goal, it reworks how the IPOIB_MCAST_FLAG_BUSY flag is used.  Conceptually, that flag used to be set when we started a multicast join, and would stay set once the join was complete.  This left no way to tell if the multicast join was complete or still in flight.  This allowed race conditions to develop between joining multicast groups and taking an interface down.  A previous attempt to resolve these race conditions used the flag IPOIB_MCAST_JOIN_STARTED, but did not succeed at fully resolving the race conditions.  This patchset resolves this issue, plus a number of related issues discovered while working on this issue.  The primary fix itself is patch 6/9 and a more complete description of the changes to how the IPOIB_MCAST_FLAG_BUSY flag is now used can be found in that commit log.

—
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
	GPG Key ID: 0E572FDD






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 842 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
       [not found]                                 ` <3A0A417D-BFE4-475C-BAB3-C3FB1D313022-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-03-16 15:24                                   ` Erez Shitrit
       [not found]                                     ` <5506F5B2.1080900-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: Erez Shitrit @ 2015-03-16 15:24 UTC (permalink / raw)
  To: Doug Ledford, Or Gerlitz
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier, Erez Shitrit

On 3/15/2015 8:42 PM, Doug Ledford wrote:
>> On Mar 13, 2015, at 1:39 AM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>> On Tue, Mar 3, 2015 at 11:59 AM, Erez Shitrit <erezsh-LDSdmyG8hGXFtvusC2JeyQ@public.gmane.orgo.il> wrote:
>>> On 3/2/2015 5:09 PM, Doug Ledford wrote:
>>>> On Sun, 2015-03-01 at 08:47 +0200, Erez Shitrit wrote:
>>>>> On 2/26/2015 6:27 PM, Doug Ledford wrote:
>>>>>>>> @@ -1037,6 +1038,7 @@ static void __ipoib_ib_dev_flush(struct
>>>>>>>> ipoib_dev_priv *priv,
>>>>>>>>         if (level == IPOIB_FLUSH_LIGHT) {
>>>>>>>>                 ipoib_mark_paths_invalid(dev);
>>>>>>>>                 ipoib_mcast_dev_flush(dev);
>>>>>>>> +               ipoib_flush_ah(dev, 0);
>>>>>>> Why do you need to call the flush function here?
>>>>>> To remove all of the ah's that were reduced to a 0 refcount by the
>>>>>> previous two functions prior to restarting operations.  When we remove
>>>>>> an ah, it calls ib_destroy_ah which calls all the way down into the low
>>>>>> level driver.  This was to make sure that old, stale data was removed
>>>>>> all the way down to the card level before we started new queries for
>>>>>> paths and ahs.
>>>>> Yes. but it is not needed.
>>>> That depends on the card.  For the modern cards (mlx4, mlx5, qib), it
>>>> isn't needed but doesn't hurt either.  For older cards (in particular,
>>>> mthca), the driver actually frees up card resources at the time of the
>>>> call.
>>>
>>> Can you please elaborate more here, I took a look in the mthca and didn't
>>> see that.
>>> anyway, what i don't understand is why you need to do that now, the ah is
>>> already in the dead_ah_list so, in at the most 1 sec will be cleared and if
>>> the driver goes down your other hunk fixed that.
>> Doug, ten days and no response from you... lets finalize the review on
>> this series so we have it safely done for 4.1 -- on top of it Erez
>> prepares a set of IPoIB fixes from our internal tree and we want that
>> for 4.1 too. Please address.
> I didn’t have much to say here.  I said that mthca can have card resources freed by this call, which is backed up by this code in mthca_ah.c
>
> int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
> {
>          switch (ah->type) {
>          case MTHCA_AH_ON_HCA:
>                  mthca_free(&dev->av_table.alloc,
>                             (ah->avdma - dev->av_table.ddr_av_base) /
>                             MTHCA_AV_SIZE);
>                  break;
>
>
> I’m not entirely sure how Erez missed that, but it’s there and it’s what gets called when we destroy an ah (depending on the card of course).  So, that represents one case where freeing the resources in a non-lazy fashion has a direct benefit.  And there is no cited drawback to freeing the resources in a non-lazy fashion on a net event, so I don’t see what there is to discuss further on the issue.
sorry, but i still don't see the connection to the device type.
It will be deleted/freed with the regular flow, like it does in the rest 
of the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so 
why here it should be directly after the event?

> —
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> 	GPG Key ID: 0E572FDD
>
>
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
       [not found]                                     ` <5506F5B2.1080900-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-03-16 16:06                                       ` Doug Ledford
       [not found]                                         ` <ADC46FD9-3179-4182-949D-1884C9D31757-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-03-16 18:00                                       ` Doug Ledford
  1 sibling, 1 reply; 37+ messages in thread
From: Doug Ledford @ 2015-03-16 16:06 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier,
	Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 2210 bytes --]


> On Mar 16, 2015, at 8:24 AM, Erez Shitrit <erezsh-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> 
> On 3/15/2015 8:42 PM, Doug Ledford wrote:
>> 
>>> Doug, ten days and no response from you... lets finalize the review on
>>> this series so we have it safely done for 4.1 -- on top of it Erez
>>> prepares a set of IPoIB fixes from our internal tree and we want that
>>> for 4.1 too. Please address.
>> I didn’t have much to say here.  I said that mthca can have card resources freed by this call, which is backed up by this code in mthca_ah.c
>> 
>> int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
>> {
>>         switch (ah->type) {
>>         case MTHCA_AH_ON_HCA:
>>                 mthca_free(&dev->av_table.alloc,
>>                            (ah->avdma - dev->av_table.ddr_av_base) /
>>                            MTHCA_AV_SIZE);
>>                 break;
>> 
>> 
>> I’m not entirely sure how Erez missed that, but it’s there and it’s what gets called when we destroy an ah (depending on the card of course).  So, that represents one case where freeing the resources in a non-lazy fashion has a direct benefit.  And there is no cited drawback to freeing the resources in a non-lazy fashion on a net event, so I don’t see what there is to discuss further on the issue.
> sorry, but i still don't see the connection to the device type.
> It will be deleted/freed with the regular flow, like it does in the rest of the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so why here it should be directly after the event?

Because it’s the right thing to do.  The only reason to do lazy deletion is when there is a performance benefit to batching.  There is no performance benefit to batching here.  And because on certain hardware the action frees resources on the card, which are limited, doing non-lazy deletion can be beneficial.  Given that there is no downside to doing the deletions in a non-lazy fashion, and that there can be an upside depending on hardware, there is no reason to stick with the lazy deletions.

—
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
	GPG Key ID: 0E572FDD






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 842 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
       [not found]                                         ` <ADC46FD9-3179-4182-949D-1884C9D31757-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-03-16 16:51                                           ` Erez Shitrit
  0 siblings, 0 replies; 37+ messages in thread
From: Erez Shitrit @ 2015-03-16 16:51 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier,
	Erez Shitrit

On 3/16/2015 6:06 PM, Doug Ledford wrote:
>> On Mar 16, 2015, at 8:24 AM, Erez Shitrit <erezsh-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>>
>> On 3/15/2015 8:42 PM, Doug Ledford wrote:
>>>> Doug, ten days and no response from you... lets finalize the review on
>>>> this series so we have it safely done for 4.1 -- on top of it Erez
>>>> prepares a set of IPoIB fixes from our internal tree and we want that
>>>> for 4.1 too. Please address.
>>> I didn’t have much to say here.  I said that mthca can have card resources freed by this call, which is backed up by this code in mthca_ah.c
>>>
>>> int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
>>> {
>>>          switch (ah->type) {
>>>          case MTHCA_AH_ON_HCA:
>>>                  mthca_free(&dev->av_table.alloc,
>>>                             (ah->avdma - dev->av_table.ddr_av_base) /
>>>                             MTHCA_AV_SIZE);
>>>                  break;
>>>
>>>
>>> I’m not entirely sure how Erez missed that, but it’s there and it’s what gets called when we destroy an ah (depending on the card of course).  So, that represents one case where freeing the resources in a non-lazy fashion has a direct benefit.  And there is no cited drawback to freeing the resources in a non-lazy fashion on a net event, so I don’t see what there is to discuss further on the issue.
>> sorry, but i still don't see the connection to the device type.
>> It will be deleted/freed with the regular flow, like it does in the rest of the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so why here it should be directly after the event?
> Because it’s the right thing to do.  The only reason to do lazy deletion is when there is a performance benefit to batching.  There is no performance benefit to batching here.  And because on certain hardware the action frees resources on the card, which are limited, doing non-lazy deletion can be beneficial.  Given that there is no downside to doing the deletions in a non-lazy fashion, and that there can be an upside depending on hardware, there is no reason to stick with the lazy deletions.
OK, i understand your point, not sure why it is not always with the ah 
deletion, anyway it is harmless here.

> —
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> 	GPG Key ID: 0E572FDD
>
>
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/9] IB/ipoib: factor out ah flushing
       [not found]                                     ` <5506F5B2.1080900-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-03-16 16:06                                       ` Doug Ledford
@ 2015-03-16 18:00                                       ` Doug Ledford
  1 sibling, 0 replies; 37+ messages in thread
From: Doug Ledford @ 2015-03-16 18:00 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier,
	Erez Shitrit


> On Mar 16, 2015, at 8:24 AM, Erez Shitrit <erezsh-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> 
> On 3/15/2015 8:42 PM, Doug Ledford wrote:
>> 
>>> Doug, ten days and no response from you... lets finalize the review on
>>> this series so we have it safely done for 4.1 -- on top of it Erez
>>> prepares a set of IPoIB fixes from our internal tree and we want that
>>> for 4.1 too. Please address.
>> I didn’t have much to say here.  I said that mthca can have card resources freed by this call, which is backed up by this code in mthca_ah.c
>> 
>> int mthca_destroy_ah(struct mthca_dev *dev, struct mthca_ah *ah)
>> {
>>        switch (ah->type) {
>>        case MTHCA_AH_ON_HCA:
>>                mthca_free(&dev->av_table.alloc,
>>                           (ah->avdma - dev->av_table.ddr_av_base) /
>>                           MTHCA_AV_SIZE);
>>                break;
>> 
>> 
>> I’m not entirely sure how Erez missed that, but it’s there and it’s what gets called when we destroy an ah (depending on the card of course).  So, that represents one case where freeing the resources in a non-lazy fashion has a direct benefit.  And there is no cited drawback to freeing the resources in a non-lazy fashion on a net event, so I don’t see what there is to discuss further on the issue.
> sorry, but i still don't see the connection to the device type.
> It will be deleted/freed with the regular flow, like it does in the rest of the life cycle cases of the ah (in neigh_dtor, path_free, etc.), so why here it should be directly after the event?

Because it’s the right thing to do.  The only reason to do lazy deletion is when there is a performance benefit to batching.  There is no performance benefit to batching here.  And because on certain hardware the action frees resources on the card, which are limited, doing non-lazy deletion can be beneficial.  Given that there is no downside to doing the deletions in a non-lazy fashion, and that there can be an upside depending on hardware, there is no reason to stick with the lazy deletions.

—
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
	GPG Key ID: 0E572FDD





--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues
       [not found]         ` <F42024C5-60A5-4B92-B4AC-4D225E2C0FC3-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-03-31 17:04           ` ira.weiny
       [not found]             ` <20150331170452.GA6261-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
  0 siblings, 1 reply; 37+ messages in thread
From: ira.weiny @ 2015-03-31 17:04 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Or Gerlitz, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Roland Dreier,
	Erez Shitrit

On Sun, Mar 15, 2015 at 11:52:44AM -0700, Doug Ledford wrote:
> 
> > On Mar 13, 2015, at 1:41 AM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> > 
> > On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> This is the re-ordered, squashed version of my 22 patch set that I
> >> posted on Feb 11.  There are a few minor differences between that
> >> set and this one.  They are:
> > [...]
> > 
> > Doug, you wrote here a very detailed listing of the changes from
> > earlier posts and the testing the patches went through, which is
> > excellent. It would be very good if you can also post few liner
> > telling the changes done by the series in high level, so we can have
> > this test as part of a "merge" commit that says in the kernel logs.
> 
> OK.  I would take what I had in the original message and expand upon it then:
> 
> This entire patchset was intended to address the issue of ipoib
> interfaces being brought up/down in a tight loop, which will hardlock
> a standard v3.19 kernel.  It succeeds at resolving that problem.


I pulled this series and did some medium weight testing on 3.19 (module
reloads, insmod/rmmod, opensm restarts (client re-register)).  IPoIB recovered
without issue on each of the tests.

Tested-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

> 
> In order to accomplish this goal, it reworks how the IPOIB_MCAST_FLAG_BUSY flag is used.  Conceptually, that flag used to be set when we started a multicast join, and would stay set once the join was complete.  This left no way to tell if the multicast join was complete or still in flight.  This allowed race conditions to develop between joining multicast groups and taking an interface down.  A previous attempt to resolve these race conditions used the flag IPOIB_MCAST_JOIN_STARTED, but did not succeed at fully resolving the race conditions.  This patchset resolves this issue, plus a number of related issues discovered while working on this issue.  The primary fix itself is patch 6/9 and a more complete description of the changes to how the IPOIB_MCAST_FLAG_BUSY flag is now used can be found in that commit log.
> 
> —
> Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> 	GPG Key ID: 0E572FDD
> 
> 
> 
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 0/9] IB/ipoib: fixup multicast locking issues
       [not found]             ` <20150331170452.GA6261-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
@ 2015-03-31 20:42               ` Or Gerlitz
  0 siblings, 0 replies; 37+ messages in thread
From: Or Gerlitz @ 2015-03-31 20:42 UTC (permalink / raw)
  To: ira.weiny, Roland Dreier
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Erez Shitrit

On Tue, Mar 31, 2015 at 8:04 PM, ira.weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> On Sun, Mar 15, 2015 at 11:52:44AM -0700, Doug Ledford wrote:
>>
>> > On Mar 13, 2015, at 1:41 AM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> >
>> > On Sun, Feb 22, 2015 at 2:26 AM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> >> This is the re-ordered, squashed version of my 22 patch set that I
>> >> posted on Feb 11.  There are a few minor differences between that
>> >> set and this one.  They are:
>> > [...]
>> >
>> > Doug, you wrote here a very detailed listing of the changes from
>> > earlier posts and the testing the patches went through, which is
>> > excellent. It would be very good if you can also post few liner
>> > telling the changes done by the series in high level, so we can have
>> > this test as part of a "merge" commit that says in the kernel logs.
>>
>> OK.  I would take what I had in the original message and expand upon it then:
>>
>> This entire patchset was intended to address the issue of ipoib
>> interfaces being brought up/down in a tight loop, which will hardlock
>> a standard v3.19 kernel.  It succeeds at resolving that problem.
>
>
> I pulled this series and did some medium weight testing on 3.19 (module
> reloads, insmod/rmmod, opensm restarts (client re-register)).  IPoIB recovered
> without issue on each of the tests.
> Tested-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Yep, here too. We tested upstream + this series and it works well.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2015-03-31 20:42 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-22  0:26 [PATCH 0/9] IB/ipoib: fixup multicast locking issues Doug Ledford
     [not found] ` <cover.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-22  0:26   ` [PATCH 1/9] IB/ipoib: factor out ah flushing Doug Ledford
     [not found]     ` <b06eb720c2f654f5ecdb72c66f4e89149d1c24ec.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-26 13:28       ` Erez Shitrit
     [not found]         ` <54EF1F67.4000001-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-02-26 16:27           ` Doug Ledford
     [not found]             ` <1424968046.2543.18.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-01  6:47               ` Erez Shitrit
     [not found]                 ` <54F2B61C.9080308-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-03-02 15:09                   ` Doug Ledford
     [not found]                     ` <1425308967.2354.19.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-03  9:59                       ` Erez Shitrit
     [not found]                         ` <54F585E9.7070704-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-03-13  8:39                           ` Or Gerlitz
     [not found]                             ` <CAJ3xEMgxxHu5BQdADaRe-Grtf4rm1LMfsCRiDyF6ToPdV_62OA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-15 18:42                               ` Doug Ledford
     [not found]                                 ` <3A0A417D-BFE4-475C-BAB3-C3FB1D313022-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-16 15:24                                   ` Erez Shitrit
     [not found]                                     ` <5506F5B2.1080900-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-03-16 16:06                                       ` Doug Ledford
     [not found]                                         ` <ADC46FD9-3179-4182-949D-1884C9D31757-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-16 16:51                                           ` Erez Shitrit
2015-03-16 18:00                                       ` Doug Ledford
2015-02-22  0:27   ` [PATCH 2/9] IB/ipoib: change init sequence ordering Doug Ledford
2015-02-22  0:27   ` [PATCH 3/9] IB/ipoib: Consolidate rtnl_lock tasks in workqueue Doug Ledford
2015-02-22  0:27   ` [PATCH 4/9] IB/ipoib: Make the carrier_on_task race aware Doug Ledford
2015-02-22  0:27   ` [PATCH 5/9] IB/ipoib: Use dedicated workqueues per interface Doug Ledford
     [not found]     ` <1cfdf15058cea312f07c2907490a1d7300603c40.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-23 16:48       ` Or Gerlitz
2015-02-22  0:27   ` [PATCH 6/9] IB/ipoib: No longer use flush as a parameter Doug Ledford
2015-02-22  0:27   ` [PATCH 7/9] IB/ipoib: fix MCAST_FLAG_BUSY usage Doug Ledford
     [not found]     ` <9d657f64ee961ee3b3233520d8b499b234a42bcd.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-01  9:31       ` Erez Shitrit
     [not found]         ` <54F2DC81.304-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-03-02 15:27           ` Doug Ledford
     [not found]             ` <1425310036.2354.24.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-03  9:53               ` Erez Shitrit
2015-02-22  0:27   ` [PATCH 8/9] IB/ipoib: deserialize multicast joins Doug Ledford
     [not found]     ` <a24ade295dfdd1369aac47a978003569ec190952.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-01 13:58       ` Erez Shitrit
     [not found]         ` <54F31AEC.3010001-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-03-02 15:29           ` Doug Ledford
     [not found]             ` <1425310145.2354.26.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-03  9:54               ` Erez Shitrit
2015-02-22  0:27   ` [PATCH 9/9] IB/ipoib: drop mcast_mutex usage Doug Ledford
     [not found]     ` <767f4c41779db63ce8c6dbba04b21959aba70ef9.1424562072.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-23 16:56       ` Or Gerlitz
     [not found]         ` <CAJ3xEMgLPF9pCwQDy9QyL9fAERJXJRXN2gBj3nhuXUCcbfCMPg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-23 17:41           ` Doug Ledford
2015-02-22 21:34   ` [PATCH 0/9] IB/ipoib: fixup multicast locking issues Or Gerlitz
     [not found]     ` <CAJ3xEMgj=ATKLt0MA67c3WefCrG1hZ59eSrhpD-u_dxLJe2kfg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-02-22 21:56       ` Doug Ledford
     [not found]         ` <1424642176.4847.2.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-02-22 21:57           ` Doug Ledford
2015-03-13  8:41   ` Or Gerlitz
     [not found]     ` <CAJ3xEMjHrTH_F=zPDsH9A9qRWo=AYN4sgbsdDKV62nzBkB5kXA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-03-15 18:52       ` Doug Ledford
     [not found]         ` <F42024C5-60A5-4B92-B4AC-4D225E2C0FC3-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-03-31 17:04           ` ira.weiny
     [not found]             ` <20150331170452.GA6261-W4f6Xiosr+yv7QzWx2u06xL4W9x8LtSr@public.gmane.org>
2015-03-31 20:42               ` Or Gerlitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.