[PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling
@ 2015-01-20  3:58 Doug Ledford
       [not found] ` <cover.1421725318.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Doug Ledford @ 2015-01-20  3:58 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

These patches are to resolve issues created by my previous patch set.
While that set worked fine in my testing, there were problems with
multicast joins after the initial set of joins had completed.  Since my
testing relied upon the normal set of multicast joins that happen
when the interface is first brought up, I missed those problems.

Symptoms vary from failure to send packets due to a failed join, to
loss of connectivity after a subnet manager restart, to failure
to properly release multicast groups on shutdown resulting in hangs
when the mlx4 driver attempts to unload itself via its reboot
notifier handler.

This set of patches has passed a number of tests above and beyond my
original tests.  As suggested by Or Gerlitz I added IPv6 and IPv4
multicast tests.  I also added both subnet manager restarts and
manual shutdown/restart of individual ports at the switch in order to
ensure that the ENETRESET path was properly tested.  I included
testing, then a subnet manager restart, then a quiescent period for
caches to expire, then restarting testing to make sure that arp and
neighbor discovery work after the subnet manager restart.

All in all, I have not been able to trip the multicast joins up any
longer.

Additionally, the original impetus for my first 8 patch set was that
it was simply too easy to break the IPoIB subsystem with this simple
loop:

while true; do
    ifconfig ib0 up
    ifconfig ib0 down
done

Just to be safe, I made sure this problem did not resurface.

Roland, the 3.19-rc code is broken.  We either need to revert my
original patchset, or grab these, but I would not recommend leaving
it as it currently stands.

Doug Ledford (7):
  IB/ipoib: Fix failed multicast joins/sends
  IB/ipoib: Add a helper to restart the multicast task
  IB/ipoib: make delayed tasks not hold up everything
  IB/ipoib: Handle -ENETRESET properly in our callback
  IB/ipoib: don't restart our thread on ENETRESET
  IB/ipoib: remove unneeded locks
  IB/ipoib: fix race between mcast_dev_flush and mcast_join

 drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 204 +++++++++++++++----------
 2 files changed, 121 insertions(+), 84 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <cover.1421725318.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* [PATCH FIX For-3.19 v4 1/7] IB/ipoib: Fix failed multicast joins/sends
       [not found] ` <cover.1421725318.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-01-20  3:58   ` Doug Ledford
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 2/7] IB/ipoib: Add a helper to restart the multicast task Doug Ledford
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2015-01-20  3:58 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

The usage of IPOIB_MCAST_RUN as a flag is inconsistent.  In some places
it is used to mean "our device is administratively allowed to send
multicast joins/leaves/packets" and in other places it means "our
multicast join task thread is currently running and will process your
request if you put it on the queue".  However, this latter meaning is in
fact flawed as there is a race condition between the join task testing
the mcast list and finding it empty of remaining work, dropping the
mcast mutex and also the priv->lock spinlock, and clearing the
IPOIB_MCAST_RUN flag.  Further, there are numerous locations that use
the flag in the former fashion, and when all tasks complete and the task
thread clears the RUN flag, all of those other locations will fail to
ever again queue any work.  This results in the interface coming up fine
initially, but having problems adding new multicast groups after the
first round of groups have all been added and the RUN flag is cleared by
the join task thread when it thinks it is done.  To resolve this issue,
convert all locations in the code to treat the RUN flag as an indicator
that the multicast portion of this interface is in fact administratively
up and joins/leaves/sends can be performed.  There is no harm (other
than a slight performance penalty) to never clearing this flag and using
it in this fashion as it simply means that a few places that used to
micro-optimize how often this task was queued on a work queue will now
queue the task a few extra times.  We can address that suboptimal
behavior in future patches.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index bc50dd0d0e4..91b8fe118ec 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -630,8 +630,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	}

 	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
-
-	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
 }

 int ipoib_mcast_start_thread(struct net_device *dev)
@@ -641,8 +639,8 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 	ipoib_dbg_mcast(priv, "starting multicast thread\n");

 	mutex_lock(&mcast_mutex);
-	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+	set_bit(IPOIB_MCAST_RUN, &priv->flags);
+	queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	mutex_unlock(&mcast_mutex);

 	return 0;
@@ -725,7 +723,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
-		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
+		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
 			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	}

@@ -951,7 +949,8 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	/*
 	 * Restart our join task if needed
 	 */
-	ipoib_mcast_start_thread(dev);
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	rtnl_unlock();
 }

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH FIX For-3.19 v4 2/7] IB/ipoib: Add a helper to restart the multicast task
       [not found] ` <cover.1421725318.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 1/7] IB/ipoib: Fix failed multicast joins/sends Doug Ledford
@ 2015-01-20  3:58   ` Doug Ledford
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 3/7] IB/ipoib: make delayed tasks not hold up everything Doug Ledford
                     ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2015-01-20  3:58 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

Add a simple helper to do the right thing when restarting
the multicast thread.  Call that helper from all the places
that need to restart the thread.  Add two places that didn't
restart the thread, but should have.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 46 ++++++++++++++++----------
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 91b8fe118ec..cb1e495bd74 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -66,6 +66,13 @@ struct ipoib_mcast_iter {
 	unsigned int       send_only;
 };
 
+static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv,
+					       int delay)
+{
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+		queue_delayed_work(priv->wq, &priv->mcast_task, delay);
+}
+
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
 	struct net_device *dev = mcast->dev;
@@ -270,6 +277,7 @@ ipoib_mcast_sendonly_join_complete(int status,
 {
 	struct ipoib_mcast *mcast = multicast->context;
 	struct net_device *dev = mcast->dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	/*
 	 * We have to take the mutex to force mcast_sendonly_join to
@@ -309,6 +317,7 @@ out:
 	complete(&mcast->done);
 	if (status == -ENETRESET)
 		status = 0;
+	__ipoib_mcast_continue_join_thread(priv, 0);
 	mutex_unlock(&mcast_mutex);
 	return status;
 }
@@ -360,6 +369,7 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 		complete(&mcast->done);
 		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
 			   "failed (ret = %d)\n", ret);
+		__ipoib_mcast_continue_join_thread(priv, 0);
 	} else {
 		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
 				"sendonly join\n", mcast->mcmember.mgid.raw);
@@ -431,8 +441,7 @@ static int ipoib_mcast_join_complete(int status,
 
 	if (!status) {
 		mcast->backoff = 1;
-		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+		__ipoib_mcast_continue_join_thread(priv, 0);
 
 		/*
 		 * Defer carrier on work to priv->wq to avoid a
@@ -454,6 +463,13 @@ static int ipoib_mcast_join_complete(int status,
 		mcast->backoff *= 2;
 		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
 			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
+		/*
+		 * XXX - This is wrong.  *Our* join failed, but because the
+		 * join thread does the joins in a serial fashion, if there
+		 * are any joins behind ours waiting to complete, they should
+		 * not be subjected to our backoff delay.
+		 */
+		__ipoib_mcast_continue_join_thread(priv, mcast->backoff * HZ);
 	}
 out:
 	spin_lock_irq(&priv->lock);
@@ -463,9 +479,6 @@ out:
 	complete(&mcast->done);
 	if (status == -ENETRESET)
 		status = 0;
-	if (status && test_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task,
-				   mcast->backoff * HZ);
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
 
@@ -532,10 +545,13 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		mcast->backoff *= 2;
 		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
 			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-
-		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(priv->wq, &priv->mcast_task,
-					   mcast->backoff * HZ);
+		/*
+		 * XXX - This is wrong.  *Our* join failed, but because the
+		 * join thread does the joins in a serial fashion, if there
+		 * are any joins behind ours waiting to complete, they should
+		 * not be subjected to our backoff delay.
+		 */
+		__ipoib_mcast_continue_join_thread(priv, mcast->backoff * HZ);
 	}
 	mutex_unlock(&mcast_mutex);
 }
@@ -573,9 +589,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		if (!broadcast) {
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
 			mutex_lock(&mcast_mutex);
-			if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-				queue_delayed_work(priv->wq, &priv->mcast_task,
-						   HZ);
+			__ipoib_mcast_continue_join_thread(priv, HZ);
 			mutex_unlock(&mcast_mutex);
 			return;
 		}
@@ -723,8 +737,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
-		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+		__ipoib_mcast_continue_join_thread(priv, 0);
 	}
 
 	if (!mcast->ah) {
@@ -947,10 +960,9 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 		ipoib_mcast_free(mcast);
 	}
 	/*
-	 * Restart our join task if needed
+	 * Restart our join task thread if needed
 	 */
-	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+	__ipoib_mcast_continue_join_thread(priv, 0);
 	rtnl_unlock();
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH FIX For-3.19 v4 3/7] IB/ipoib: make delayed tasks not hold up everything
       [not found] ` <cover.1421725318.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 1/7] IB/ipoib: Fix failed multicast joins/sends Doug Ledford
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 2/7] IB/ipoib: Add a helper to restart the multicast task Doug Ledford
@ 2015-01-20  3:58   ` Doug Ledford
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 4/7] IB/ipoib: Handle -ENETRESET properly in our callback Doug Ledford
                     ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2015-01-20  3:58 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

Currently, the multicast join thread only processes one task at a
time.  It does this by having itself be scheduled as a delayed work
item, when it runs, it finds one, and only one, piece of work to
process, it then kicks that off via either the normal join process
or the sendonly join process, and then it immediately exits.  Both
of those process chains are responsible for restarting the task
when they have completed their specific action.  This makes the entire
join process serial with only one join supposedly ever outstanding at
a time.

However, if we fail a join, and we need to initiate a backoff delay,
that delay holds up the entire join process for all joins, not just
the failed join.  So modify the design such that we can have joins
in delay, and the multicast thread will ignore them until their time
comes around, and then we can process the rest of the queue without
waiting for the delayed items.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  1 +
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 86 +++++++++++++++++---------
 2 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 8ba80a6d3a4..c79dcd5ee8a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -154,6 +154,7 @@ struct ipoib_mcast {
 
 	unsigned long created;
 	unsigned long backoff;
+	unsigned long delay_until;
 
 	unsigned long flags;
 	unsigned char logcount;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index cb1e495bd74..957e7d2e80c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -67,10 +67,34 @@ struct ipoib_mcast_iter {
 };
 
 static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv,
+					       struct ipoib_mcast *mcast,
 					       int delay)
 {
-	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags)) {
+		/*
+		 * Mark this mcast for its delay and set a timer to kick the
+		 * thread when the delay completes
+		 */
+		if (mcast && delay) {
+			mcast->backoff *= 2;
+			if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
+				mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
+			mcast->delay_until = jiffies + (mcast->backoff * HZ);
+			queue_delayed_work(priv->wq, &priv->mcast_task,
+					   mcast->backoff * HZ);
+		} else if (delay) {
+			/* Special case of retrying after a failure to
+			 * allocate the broadcast multicast group, wait
+			 * 1 second and try again
+			 */
+			queue_delayed_work(priv->wq, &priv->mcast_task, HZ);
+		}
+		/*
+		 * But also rekick the thread immediately for any other
+		 * tasks in queue behind this one
+		 */
 		queue_delayed_work(priv->wq, &priv->mcast_task, delay);
+	}
 }
 
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
@@ -110,6 +134,7 @@ static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev,
 
 	mcast->dev = dev;
 	mcast->created = jiffies;
+	mcast->delay_until = jiffies;
 	mcast->backoff = 1;
 
 	INIT_LIST_HEAD(&mcast->list);
@@ -309,6 +334,10 @@ ipoib_mcast_sendonly_join_complete(int status,
 			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
 		}
 		netif_tx_unlock_bh(dev);
+	} else {
+		/* Join completed, so reset any backoff parameters */
+		mcast->backoff = 1;
+		mcast->delay_until = jiffies;
 	}
 out:
 	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
@@ -317,7 +346,7 @@ out:
 	complete(&mcast->done);
 	if (status == -ENETRESET)
 		status = 0;
-	__ipoib_mcast_continue_join_thread(priv, 0);
+	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	mutex_unlock(&mcast_mutex);
 	return status;
 }
@@ -369,7 +398,7 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 		complete(&mcast->done);
 		ipoib_warn(priv, "ib_sa_join_multicast for sendonly join "
 			   "failed (ret = %d)\n", ret);
-		__ipoib_mcast_continue_join_thread(priv, 0);
+		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	} else {
 		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
 				"sendonly join\n", mcast->mcmember.mgid.raw);
@@ -441,7 +470,8 @@ static int ipoib_mcast_join_complete(int status,
 
 	if (!status) {
 		mcast->backoff = 1;
-		__ipoib_mcast_continue_join_thread(priv, 0);
+		mcast->delay_until = jiffies;
+		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 
 		/*
 		 * Defer carrier on work to priv->wq to avoid a
@@ -460,16 +490,8 @@ static int ipoib_mcast_join_complete(int status,
 			}
 		}
 
-		mcast->backoff *= 2;
-		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
-			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-		/*
-		 * XXX - This is wrong.  *Our* join failed, but because the
-		 * join thread does the joins in a serial fashion, if there
-		 * are any joins behind ours waiting to complete, they should
-		 * not be subjected to our backoff delay.
-		 */
-		__ipoib_mcast_continue_join_thread(priv, mcast->backoff * HZ);
+		/* Requeue this join task with a backoff delay */
+		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	}
 out:
 	spin_lock_irq(&priv->lock);
@@ -541,17 +563,8 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 		complete(&mcast->done);
 		ret = PTR_ERR(mcast->mc);
 		ipoib_warn(priv, "ib_sa_join_multicast failed, status %d\n", ret);
-
-		mcast->backoff *= 2;
-		if (mcast->backoff > IPOIB_MAX_BACKOFF_SECONDS)
-			mcast->backoff = IPOIB_MAX_BACKOFF_SECONDS;
-		/*
-		 * XXX - This is wrong.  *Our* join failed, but because the
-		 * join thread does the joins in a serial fashion, if there
-		 * are any joins behind ours waiting to complete, they should
-		 * not be subjected to our backoff delay.
-		 */
-		__ipoib_mcast_continue_join_thread(priv, mcast->backoff * HZ);
+		/* Requeue this join task with a backoff delay */
+		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	}
 	mutex_unlock(&mcast_mutex);
 }
@@ -589,7 +602,13 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		if (!broadcast) {
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
 			mutex_lock(&mcast_mutex);
-			__ipoib_mcast_continue_join_thread(priv, HZ);
+			/*
+			 * Restart us after a 1 second delay to retry
+			 * creating our broadcast group and attaching to
+			 * it.  Until this succeeds, this ipoib dev is
+			 * completely stalled (multicast wise).
+			 */
+			__ipoib_mcast_continue_join_thread(priv, NULL, 1);
 			mutex_unlock(&mcast_mutex);
 			return;
 		}
@@ -623,7 +642,10 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		list_for_each_entry(mcast, &priv->multicast_list, list) {
 			if (IS_ERR_OR_NULL(mcast->mc) &&
 			    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
-			    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags)) {
+			    !test_bit(IPOIB_MCAST_FLAG_ATTACHED,
+				      &mcast->flags) &&
+			    (mcast->backoff == 1 ||
+			     time_after_eq(jiffies, mcast->delay_until))) {
 				/* Found the next unjoined group */
 				break;
 			}
@@ -632,7 +654,11 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		mutex_unlock(&mcast_mutex);
 
 		if (&mcast->list == &priv->multicast_list) {
-			/* All done */
+			/* All done, unless we have delayed work from
+			 * backoff retransmissions, but we will get
+			 * restarted when the time is right, so we are
+			 * done for now
+			 */
 			break;
 		}
 
@@ -737,7 +763,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
-		__ipoib_mcast_continue_join_thread(priv, 0);
+		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	}
 
 	if (!mcast->ah) {
@@ -962,7 +988,7 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	/*
 	 * Restart our join task thread if needed
 	 */
-	__ipoib_mcast_continue_join_thread(priv, 0);
+	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	rtnl_unlock();
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH FIX For-3.19 v4 4/7] IB/ipoib: Handle -ENETRESET properly in our callback
       [not found] ` <cover.1421725318.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 3/7] IB/ipoib: make delayed tasks not hold up everything Doug Ledford
@ 2015-01-20  3:58   ` Doug Ledford
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 5/7] IB/ipoib: don't restart our thread on ENETRESET Doug Ledford
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2015-01-20  3:58 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

When the core layer calls our callback with ENETRESET as the error, we
clear the status to 0 before returning indicating that we are going to
handle the error ourselves.  This causes the core layer to not free the
mcast->mc structure, but we are releasing our reference to it by
clearing mcast->mc.  So, preserve our reference to the multicast
structure so when we next run ipoib_mcast_dev_flush, it will be able
to properly release the mcast->mc entry at the right time in the right
way.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 957e7d2e80c..50d2de2270a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -315,8 +315,10 @@ ipoib_mcast_sendonly_join_complete(int status,
 	mutex_lock(&mcast_mutex);
 
 	/* We trap for port events ourselves. */
-	if (status == -ENETRESET)
+	if (status == -ENETRESET) {
+		status = 0;
 		goto out;
+	}
 
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
@@ -344,8 +346,6 @@ out:
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
-	if (status == -ENETRESET)
-		status = 0;
 	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	mutex_unlock(&mcast_mutex);
 	return status;
@@ -462,8 +462,10 @@ static int ipoib_mcast_join_complete(int status,
 	mutex_lock(&mcast_mutex);
 
 	/* We trap for port events ourselves. */
-	if (status == -ENETRESET)
+	if (status == -ENETRESET) {
+		status = 0;
 		goto out;
+	}
 
 	if (!status)
 		status = ipoib_mcast_join_finish(mcast, &multicast->rec);
@@ -499,8 +501,6 @@ out:
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
-	if (status == -ENETRESET)
-		status = 0;
 	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH FIX For-3.19 v4 5/7] IB/ipoib: don't restart our thread on ENETRESET
       [not found] ` <cover.1421725318.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 4/7] IB/ipoib: Handle -ENETRESET properly in our callback Doug Ledford
@ 2015-01-20  3:58   ` Doug Ledford
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 6/7] IB/ipoib: remove unneeded locks Doug Ledford
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2015-01-20  3:58 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

Make the send only join completion handler behave like the normal join
completion handler in that we should restart the multicast join thread
whenever our join completes with either an error or success, but don't
restart the thread if we got a net reset event, let ipoib_event queue
up a device flush, which will then call ipoib_mcast_dev_flush to remove
all of the current mcast entries, and ipoib_mcase_restart_task to check
out current mcast entry list against the netdev list of mcast entries
we are supposed to have and requeue the ones we need to get back.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 50d2de2270a..a03b6a04203 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -336,17 +336,18 @@ ipoib_mcast_sendonly_join_complete(int status,
 			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
 		}
 		netif_tx_unlock_bh(dev);
+		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	} else {
 		/* Join completed, so reset any backoff parameters */
 		mcast->backoff = 1;
 		mcast->delay_until = jiffies;
+		__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	}
 out:
 	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
-	__ipoib_mcast_continue_join_thread(priv, NULL, 0);
 	mutex_unlock(&mcast_mutex);
 	return status;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH FIX For-3.19 v4 6/7] IB/ipoib: remove unneeded locks
       [not found] ` <cover.1421725318.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 5/7] IB/ipoib: don't restart our thread on ENETRESET Doug Ledford
@ 2015-01-20  3:58   ` Doug Ledford
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 7/7] IB/ipoib: fix race between mcast_dev_flush and mcast_join Doug Ledford
  2015-01-20 16:16   ` [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling Erez Shitrit
  7 siblings, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2015-01-20  3:58 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

During the various work I've done on this, some extra locking has crept
in when things weren't working right.  This is one of those spots.
Remove the unneeded spinlocks.  The mutex is enough to protect against
what we need to protect against.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index a03b6a04203..424c602d5a0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -497,12 +497,10 @@ static int ipoib_mcast_join_complete(int status,
 		__ipoib_mcast_continue_join_thread(priv, mcast, 1);
 	}
 out:
-	spin_lock_irq(&priv->lock);
 	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
-	spin_unlock_irq(&priv->lock);
 	mutex_unlock(&mcast_mutex);
 
 	return status;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH FIX For-3.19 v4 7/7] IB/ipoib: fix race between mcast_dev_flush and mcast_join
       [not found] ` <cover.1421725318.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (5 preceding siblings ...)
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 6/7] IB/ipoib: remove unneeded locks Doug Ledford
@ 2015-01-20  3:58   ` Doug Ledford
  2015-01-20 16:16   ` [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling Erez Shitrit
  7 siblings, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2015-01-20  3:58 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit, Doug Ledford

There is a classic race between mcast_dev_flush and mcast_join_thread
in that the join thread is going to loop through all of the mcast
entries in the device mcast list, while dev_flush wants to remove
those entries from the list so that they can eventually be freed when
they are no longer busy.

As always, the list traversal and the list removal must be done under
the same lock to avoid races, so modify the join_task thread code to
hold onto the spinlock for the entirety of its run up until it has
saved the mcast entry that it's going to work on into a private pointer
and set the flags on that entry to mark it as busy.  Only then is it
safe to drop the lock and let mcast_dev_flush remove entries from the
list.

I also reworked the flow of the join_task thread to be more clear that
it is making a single pass at work to do and then exiting as soon as it
gets that single task queued off.

I then reworked both the regular and send only join routines to know
that they will be called with the BUSY flag already set on the mcast
entry they are passed, so any error condition means they need to
clear the flag and complete the task and they should not check the
busy status as it will now always be true.

Signed-off-by: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 112 ++++++++++++-------------
 1 file changed, 56 insertions(+), 56 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 424c602d5a0..972fa15a225 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -368,22 +368,16 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)) {
 		ipoib_dbg_mcast(priv, "device shutting down, no sendonly "
 				"multicast joins\n");
+		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+		complete(&mcast->done);
 		return -ENODEV;
 	}
 
-	if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags)) {
-		ipoib_dbg_mcast(priv, "multicast entry busy, skipping "
-				"sendonly join\n");
-		return -EBUSY;
-	}
-
 	rec.mgid     = mcast->mcmember.mgid;
 	rec.port_gid = priv->local_gid;
 	rec.pkey     = cpu_to_be16(priv->pkey);
 
 	mutex_lock(&mcast_mutex);
-	init_completion(&mcast->done);
-	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
 					 priv->port, &rec,
 					 IB_SA_MCMEMBER_REC_MGID	|
@@ -552,8 +546,6 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast,
 	}
 
 	mutex_lock(&mcast_mutex);
-	init_completion(&mcast->done);
-	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca, priv->port,
 					 &rec, comp_mask, GFP_KERNEL,
 					 ipoib_mcast_join_complete, mcast);
@@ -574,6 +566,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
 		container_of(work, struct ipoib_dev_priv, mcast_task.work);
 	struct net_device *dev = priv->dev;
 	struct ib_port_attr port_attr;
+	struct ipoib_mcast *mcast = NULL;
+	int create = 1;
 
 	if (!test_bit(IPOIB_MCAST_RUN, &priv->flags))
 		return;
@@ -591,16 +585,25 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	else
 		memcpy(priv->dev->dev_addr + 4, priv->local_gid.raw, sizeof (union ib_gid));
 
+	/*
+	 * We have to hold the mutex to keep from racing with the join
+	 * completion threads on setting flags on mcasts, and we have
+	 * to hold the priv->lock because dev_flush will remove entries
+	 * out from underneath us, so at a minimum we need the lock
+	 * through the time that we do the for_each loop of the mcast
+	 * list or else dev_flush can make us oops.
+	 */
+	mutex_lock(&mcast_mutex);
+	spin_lock_irq(&priv->lock);
+	if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
+		goto out;
+
 	if (!priv->broadcast) {
 		struct ipoib_mcast *broadcast;
 
-		if (!test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-			return;
-
-		broadcast = ipoib_mcast_alloc(dev, 1);
+		broadcast = ipoib_mcast_alloc(dev, 0);
 		if (!broadcast) {
 			ipoib_warn(priv, "failed to allocate broadcast group\n");
-			mutex_lock(&mcast_mutex);
 			/*
 			 * Restart us after a 1 second delay to retry
 			 * creating our broadcast group and attaching to
@@ -608,67 +611,64 @@ void ipoib_mcast_join_task(struct work_struct *work)
 			 * completely stalled (multicast wise).
 			 */
 			__ipoib_mcast_continue_join_thread(priv, NULL, 1);
-			mutex_unlock(&mcast_mutex);
-			return;
+			goto out;
 		}
 
-		spin_lock_irq(&priv->lock);
 		memcpy(broadcast->mcmember.mgid.raw, priv->dev->broadcast + 4,
 		       sizeof (union ib_gid));
 		priv->broadcast = broadcast;
 
 		__ipoib_mcast_add(dev, priv->broadcast);
-		spin_unlock_irq(&priv->lock);
 	}
 
 	if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
 		if (IS_ERR_OR_NULL(priv->broadcast->mc) &&
-		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags))
-			ipoib_mcast_join(dev, priv->broadcast, 0);
-		return;
-	}
-
-	while (1) {
-		struct ipoib_mcast *mcast = NULL;
-
-		/*
-		 * Need the mutex so our flags are consistent, need the
-		 * priv->lock so we don't race with list removals in either
-		 * mcast_dev_flush or mcast_restart_task
-		 */
-		mutex_lock(&mcast_mutex);
-		spin_lock_irq(&priv->lock);
-		list_for_each_entry(mcast, &priv->multicast_list, list) {
-			if (IS_ERR_OR_NULL(mcast->mc) &&
-			    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
-			    !test_bit(IPOIB_MCAST_FLAG_ATTACHED,
-				      &mcast->flags) &&
-			    (mcast->backoff == 1 ||
-			     time_after_eq(jiffies, mcast->delay_until))) {
-				/* Found the next unjoined group */
-				break;
-			}
+		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &priv->broadcast->flags)) {
+			mcast = priv->broadcast;
+			create = 0;
 		}
-		spin_unlock_irq(&priv->lock);
-		mutex_unlock(&mcast_mutex);
+		goto out;
+	}
 
-		if (&mcast->list == &priv->multicast_list) {
-			/* All done, unless we have delayed work from
-			 * backoff retransmissions, but we will get
-			 * restarted when the time is right, so we are
-			 * done for now
-			 */
+	/* We'll never get here until the broadcast group is both allocated
+	 * and attached
+	 */
+	list_for_each_entry(mcast, &priv->multicast_list, list) {
+		if (IS_ERR_OR_NULL(mcast->mc) &&
+		    !test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags) &&
+		    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &mcast->flags) &&
+		    (mcast->backoff == 1 ||
+		     time_after_eq(jiffies, mcast->delay_until))) {
+			/* Found the next unjoined group */
 			break;
 		}
+	}
 
+	if (&mcast->list == &priv->multicast_list) {
+		/* All done, unless we have delayed work from
+		 * backoff retransmissions, but we will get
+		 * restarted when the time is right, so we are
+		 * done for now
+		 */
+		mcast = NULL;
+		ipoib_dbg_mcast(priv, "successfully joined all "
+				"multicast groups\n");
+	}
+
+out:
+	if (mcast) {
+		init_completion(&mcast->done);
+		set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
+	}
+	spin_unlock_irq(&priv->lock);
+	mutex_unlock(&mcast_mutex);
+	if (mcast) {
 		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
 			ipoib_mcast_sendonly_join(mcast);
 		else
-			ipoib_mcast_join(dev, mcast, 1);
-		return;
+			ipoib_mcast_join(dev, mcast, create);
 	}
-
-	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
+	return;
 }
 
 int ipoib_mcast_start_thread(struct net_device *dev)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling
       [not found] ` <cover.1421725318.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (6 preceding siblings ...)
  2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 7/7] IB/ipoib: fix race between mcast_dev_flush and mcast_join Doug Ledford
@ 2015-01-20 16:16   ` Erez Shitrit
       [not found]     ` <54BE7F66.4070404-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  7 siblings, 1 reply; 18+ messages in thread
From: Erez Shitrit @ 2015-01-20 16:16 UTC (permalink / raw)
  To: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit

On 1/20/2015 5:58 AM, Doug Ledford wrote:
> These patches are to resolve issues created by my previous patch set.
> While that set worked fine in my testing, there were problems with
> multicast joins after the initial set of joins had completed.  Since my
> testing relied upon the normal set of multicast joins that happen
> when the interface is first brought up, I missed those problems.
>
> Symptoms vary from failure to send packets due to a failed join, to
> loss of connectivity after a subnet manager restart, to failure
> to properly release multicast groups on shutdown resulting in hangs
> when the mlx4 driver attempts to unload itself via its reboot
> notifier handler.
>
> This set of patches has passed a number of tests above and beyond my
> original tests.  As suggested by Or Gerlitz I added IPv6 and IPv4
> multicast tests.  I also added both subnet manager restarts and
> manual shutdown/restart of individual ports at the switch in order to
> ensure that the ENETRESET path was properly tested.  I included
> testing, then a subnet manager restart, then a quiescent period for
> caches to expire, then restarting testing to make sure that arp and
> neighbor discovery work after the subnet manager restart.
>
> All in all, I have not been able to trip the multicast joins up any
> longer.
>
> Additionally, the original impetus for my first 8 patch set was that
> it was simply too easy to break the IPoIB subsystem with this simple
> loop:
>
> while true; do
>      ifconfig ib0 up
>      ifconfig ib0 down
> done
>
> Just to be safe, I made sure this problem did not resurface.
>
> Roland, the 3.19-rc code is broken.  We either need to revert my
> original patchset, or grab these, but I would not recommend leaving
> it as it currently stands.
>
> Doug Ledford (7):
>    IB/ipoib: Fix failed multicast joins/sends
>    IB/ipoib: Add a helper to restart the multicast task
>    IB/ipoib: make delayed tasks not hold up everything
>    IB/ipoib: Handle -ENETRESET properly in our callback
>    IB/ipoib: don't restart our thread on ENETRESET
>    IB/ipoib: remove unneeded locks
>    IB/ipoib: fix race between mcast_dev_flush and mcast_join
>
>   drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 204 +++++++++++++++----------
>   2 files changed, 121 insertions(+), 84 deletions(-)
>
Hi Doug,

After trying your V4 patch series, I can tell that first, the endless 
scheduling of
the mcast task is indeed over, but still, the multicast functionality in 
ipoib is unstable.

I see that there are times that ping6 works good, and sometimes it 
doesn't, to
make it clear I always use the link-local address assigned by the stack 
to the
IPoIB device, see [1] below for how I run it.

I also see that send-only mcast stops working from time to time, see [2] 
below
for how I run this. I can narrow the problem to be on the sender 
(client) side,
since I work with a peer node which has well functioning IPoIB multicast 
code.

One more phenomena, that in some cases I can see that the driver (after the
mcast_debug_level is set) prints endless message:
"ib0: no address vector, but multicast join already started"

One practical solution here would be to revert the offending commit 3.19-rc1
016d9fb "IPoIB: fix MCAST_FLAG_BUSY usage".

Thanks, Erez

1] IPv6 ping

$ ping6 fe80::202:c903:9f:3b0a -I ib0
where the IPv6 address is the one displayed by "ip addr show dev ib0" on 
the remote node

[2] IPv4 multicast

# server
$ route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0
$ netserver

# client
$ route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0
$ netperf -H 11.134.33.1 -t omni -- -H 225.5.5.4 -T udp -R 1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <54BE7F66.4070404-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>]

* Re: [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling
       [not found]     ` <54BE7F66.4070404-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-01-20 17:37       ` Doug Ledford
  2015-01-21 17:19       ` Roland Dreier
  1 sibling, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2015-01-20 17:37 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, roland-DgEjT+Ai2ygdnm+yROfE0A,
	Amir Vadai, Eyal Perry, Or Gerlitz, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 8644 bytes --]

On Tue, 2015-01-20 at 18:16 +0200, Erez Shitrit wrote:
> On 1/20/2015 5:58 AM, Doug Ledford wrote:
> > These patches are to resolve issues created by my previous patch set.
> > While that set worked fine in my testing, there were problems with
> > multicast joins after the initial set of joins had completed.  Since my
> > testing relied upon the normal set of multicast joins that happen
> > when the interface is first brought up, I missed those problems.
> >
> > Symptoms vary from failure to send packets due to a failed join, to
> > loss of connectivity after a subnet manager restart, to failure
> > to properly release multicast groups on shutdown resulting in hangs
> > when the mlx4 driver attempts to unload itself via its reboot
> > notifier handler.
> >
> > This set of patches has passed a number of tests above and beyond my
> > original tests.  As suggested by Or Gerlitz I added IPv6 and IPv4
> > multicast tests.  I also added both subnet manager restarts and
> > manual shutdown/restart of individual ports at the switch in order to
> > ensure that the ENETRESET path was properly tested.  I included
> > testing, then a subnet manager restart, then a quiescent period for
> > caches to expire, then restarting testing to make sure that arp and
> > neighbor discovery work after the subnet manager restart.
> >
> > All in all, I have not been able to trip the multicast joins up any
> > longer.
> >
> > Additionally, the original impetus for my first 8 patch set was that
> > it was simply too easy to break the IPoIB subsystem with this simple
> > loop:
> >
> > while true; do
> >      ifconfig ib0 up
> >      ifconfig ib0 down
> > done
> >
> > Just to be safe, I made sure this problem did not resurface.
> >
> > Roland, the 3.19-rc code is broken.  We either need to revert my
> > original patchset, or grab these, but I would not recommend leaving
> > it as it currently stands.
> >
> > Doug Ledford (7):
> >    IB/ipoib: Fix failed multicast joins/sends
> >    IB/ipoib: Add a helper to restart the multicast task
> >    IB/ipoib: make delayed tasks not hold up everything
> >    IB/ipoib: Handle -ENETRESET properly in our callback
> >    IB/ipoib: don't restart our thread on ENETRESET
> >    IB/ipoib: remove unneeded locks
> >    IB/ipoib: fix race between mcast_dev_flush and mcast_join
> >
> >   drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
> >   drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 204 +++++++++++++++----------
> >   2 files changed, 121 insertions(+), 84 deletions(-)
> >
> Hi Doug,
> 
> After trying your V4 patch series, I can tell that first, the endless 
> scheduling of
> the mcast task is indeed over,

Good.

>  but still, the multicast functionality in 
> ipoib is unstable.

I'm not seeing that here.  Let's try to figure out what's different.

> I see that there are times that ping6 works good, and sometimes it 
> doesn't, to
> make it clear I always use the link-local address assigned by the stack 
> to the
> IPoIB device, see [1] below for how I run it.

As do I.  I'll attach the scripts I used to run it for your reference.

> I also see that send-only mcast stops working from time to time, see [2] 
> below
> for how I run this. I can narrow the problem to be on the sender 
> (client) side,
> since I work with a peer node which has well functioning IPoIB multicast 
> code.

I don't think the peer side really denotes a conclusive argument ;-)

> One more phenomena, that in some cases I can see that the driver (after the
> mcast_debug_level is set) prints endless message:
> "ib0: no address vector, but multicast join already started"

OK, this is to be expected from your tests I think.  In particular, this
is generated by mcast_send() if it's called by your program while the
send only join has not yet completed.  The flow goes like this:

First packet after interface comes up:
mcast_send() -> ipoib_mcast_alloc() -> ipoib_mcast_add() -> schedule join task thread

				In a different thread:
				mcast_join_task()
				  find unjoined mcast group
				  mark mcast->flags with IPOIB_MCAST_FLAG BUSY
				  -> mcast_join()
				     send join request over the wire

Back on original thread context:
mcast_send()
  this time we find a matching mcast entry but mcast->ah is NULL
  queue packet, unless backlog is full and then drop packet
  if mcast->flags & IPOIB_MCAST_FLAG_BUSY, emit notice that you see

				In a different thread:
				mcast_sendonly_join_complete() ->
					mcast_join_finish()
					  set mcast->ah
					  send skb backlog queue
				  clear IPOIB_MCAST_FLAG_BUSY

Back on original thread context:
mcast_send()
  now we find the mcast entry, and we find the mcast->ah entry, so
  sends now proceed as expected with no messages, and any lost packets
  while waiting on mcast->ah to be valid are simply gone

This looks entirely normal to me if your application is busy blasting
packets while the join is happening.  Actually, I think the message is
worthless to be honest.  I would be more interested in a message about
dropping packets than simply a message that denotes we are sending
packets while the join is still in process.

Unless we are sending so many packets out that we are starving the
join's ability to finish.  That would be interesting data to know.  Does
the join never finish in this case?  Also, I think you indicated that
you are running back to back and without a switch?  These joins have to
go to the subnet manager and back.  What is your subnet management like?

> 
> One practical solution here would be to revert the offending commit 3.19-rc1
> 016d9fb "IPoIB: fix MCAST_FLAG_BUSY usage".

It is not practical to revert that patch by itself.  That patch changes
semantics of the mcast->flag usage in such a way that all of my
subsequent patches are broken without it.  They go as a group or not at
all.

> Thanks, Erez
> 
> 1] IPv6 ping
> 
> $ ping6 fe80::202:c903:9f:3b0a -I ib0
> where the IPv6 address is the one displayed by "ip addr show dev ib0" on 
> the remote node

Mine is similar.  I use these two files:
[root@rdma-master testing]$ cat ip6-addresses.txt 
rdma-master	fe80::f652:1403:7b:cba1	mlx4_ib0
rdma-perf-00	fe80::202:c903:31:7791	mlx4_ib0
rdma-perf-01	fe80::f652:1403:7b:e1b1	mlx4_ib0
rdma-perf-02	fe80::211:7500:77:d3cc	qib_ib0
rdma-perf-03	fe80::211:7500:77:d81a	qib_ib0
rdma-storage-01	fe80::f652:1403:7b:e131	mlx4_ib0
rdma-vr6-master	fe80::601:1403:7b:cba1	mlx4_ib0
[root@rdma-master testing]$ cat ping_loop 
#!/bin/bash

trap_handler()
{
	exit 0
}

trap 'trap_handler' 2 15

ADDR_FILE=ip6-addresses.txt
ME=`hostname -s`
LOCAL=`awk '/'"$ME"'/ { print $3 }' $ADDR_FILE`
while true; do
	cat $ADDR_FILE | \
	while read host addr dev; do
		[ ${host} = `hostname -s` ] && continue
		ping6 ${addr}%$LOCAL -c 3
	done
done
[root@rdma-master testing]$ 


> 
> [2] IPv4 multicast
> 
> # server
> $ route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0
> $ netserver
> 
> # client
> $ route add -net 224.0.0.0 netmask 240.0.0.0 dev ib0
> $ netperf -H 11.134.33.1 -t omni -- -H 225.5.5.4 -T udp -R 1

I've been using iperf with a slightly different setup:

Each machine is a server:
ip route add 224.0.0.0/4 dev <ib0 device>
iperf -usB 224.3.2.<server #> -i 1 > <host>-iperf-server.out &

Each machine rotates as a client:
iperf_loop &

[root@rdma-master testing]$ cat iperf-addresses.txt 
rdma-master	224.3.2.1
rdma-perf-00	224.3.2.2
rdma-perf-01	224.3.2.3
rdma-perf-02	224.3.2.4
rdma-perf-03	224.3.2.5
rdma-storage-01	224.3.2.6
[root@rdma-master testing]$ cat iperf_loop 
#!/bin/bash

ADDR_FILE=iperf-addresses.txt
ME=`hostname -s`
LOG=${ME}-iperf-client.out
> $LOG
while true; do
	cat $ADDR_FILE | \
	while read host addr ; do
		[ ${host} = $ME ] && continue
		iperf -uc ${addr} -i 1 >> $LOG
	done
done
[root@rdma-master testing]$ 

One of the differences between iperf and netperf is the speed with which
it is blasting the multicast packets out.  iperf sends them at a fairly
sane rate while netperf is balls to the wall.  So I don't see the kernel
messages you posted as a problem, they are simply telling you that
netperf is blasting away at the group while it is coming online.  Unless
they happen infinitely on a single sendonly group, which would indicate
that our join never completed.  If that's the case, we have to find out
why the join never completed.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling
       [not found]     ` <54BE7F66.4070404-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2015-01-20 17:37       ` Doug Ledford
@ 2015-01-21 17:19       ` Roland Dreier
       [not found]         ` <CAL1RGDVPCH3h7nZMd9=4rFPqbKEGZiXQC+Atvfwj5kWcf2S3Qg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 18+ messages in thread
From: Roland Dreier @ 2015-01-21 17:19 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai,
	Eyal Perry, Or Gerlitz, Erez Shitrit

On Tue, Jan 20, 2015 at 8:16 AM, Erez Shitrit <erezsh-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>
> After trying your V4 patch series, I can tell that first, the endless scheduling of
> the mcast task is indeed over, but still, the multicast functionality in ipoib is unstable.

Is this worse than 3.18?  (Have you tested that?)

Because Doug's changes fixed some bad, easy-to-reproduce issues.  On
the other hand we don't want to introduce new regressions to fix the
old issues.

I think we only have a few days to decide whether to revert back to
3.18 code, or push forward with these fixes.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <CAL1RGDVPCH3h7nZMd9=4rFPqbKEGZiXQC+Atvfwj5kWcf2S3Qg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling
       [not found]         ` <CAL1RGDVPCH3h7nZMd9=4rFPqbKEGZiXQC+Atvfwj5kWcf2S3Qg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-21 17:34           ` Doug Ledford
  2015-01-21 20:34           ` Or Gerlitz
  1 sibling, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2015-01-21 17:34 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Erez Shitrit, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai,
	Eyal Perry, Or Gerlitz, Erez Shitrit

It's worse than 3.18, but only if your test is rmmod ipoib based. For normal running conditions it's better. However, I'm going to try and fix the rmmod issue today if at all possible.

Sent from my iPhone

> On Jan 21, 2015, at 12:20 PM, Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> 
>> On Tue, Jan 20, 2015 at 8:16 AM, Erez Shitrit <erezsh-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>> 
>> After trying your V4 patch series, I can tell that first, the endless scheduling of
>> the mcast task is indeed over, but still, the multicast functionality in ipoib is unstable.
> 
> Is this worse than 3.18?  (Have you tested that?)
> 
> Because Doug's changes fixed some bad, easy-to-reproduce issues.  On
> the other hand we don't want to introduce new regressions to fix the
> old issues.
> 
> I think we only have a few days to decide whether to revert back to
> 3.18 code, or push forward with these fixes.
> 
> - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling
       [not found]         ` <CAL1RGDVPCH3h7nZMd9=4rFPqbKEGZiXQC+Atvfwj5kWcf2S3Qg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-01-21 17:34           ` Doug Ledford
@ 2015-01-21 20:34           ` Or Gerlitz
       [not found]             ` <CAJ3xEMi5vGc-Urv_7RSWCn9-ZAiq0WzVn3y0__jrTc48jM9gpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 1 reply; 18+ messages in thread
From: Or Gerlitz @ 2015-01-21 20:34 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Erez Shitrit, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Amir Vadai, Eyal Perry, Erez Shitrit

On Wed, Jan 21, 2015 at 7:19 PM, Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Tue, Jan 20, 2015 at 8:16 AM, Erez Shitrit <erezsh-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

>> After trying your V4 patch series, I can tell that first, the endless scheduling of
>> the mcast task is indeed over, but still, the multicast functionality in ipoib is unstable.

> Is this worse than 3.18?  (Have you tested that?)

Roland, Doug,

To be fully clear here by "this" we're talking on seven patches of
complexity and volume which I think go way beyond post -rc5 timeline:

  IB/ipoib: Fix failed multicast joins/sends
  IB/ipoib: Add a helper to restart the multicast task
  IB/ipoib: make delayed tasks not hold up everything
  IB/ipoib: Handle -ENETRESET properly in our callback
  IB/ipoib: don't restart our thread on ENETRESET
  IB/ipoib: remove unneeded locks
  IB/ipoib: fix race between mcast_dev_flush and mcast_join

 drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 204 +++++++++++++++----------
 2 files changed, 121 insertions(+), 84 deletions(-)

Doug, I understand your claim and frustration that with 3.18 and such
(older kernels) your ifdown/up loop manages to break the driver, but
fixing the driver such that this test works and in the same time
practically breaking IPv6 and IPv5 multicast introduces a deep
regression vs. 3.18 - which as you wrote here, would be wrong to fix
with such a further big change.

Are you really sure that reverting the offending patch 016d9fb25cd9
"IPoIB: fix MCAST_FLAG_BUSY usage" and maybe some more dependent
related hunks from downstream patches of that series isn't possible?

If this is the case, I would suggest that we either revive the review
on the fix we sent [1] or drop the whole 3.19-rc1 changes. I vote for
the former.

[1] http://marc.info/?l=linux-rdma&m=142064313123254&w=2

> Because Doug's changes fixed some bad, easy-to-reproduce issues.  On
> the other hand we don't want to introduce new regressions to fix the
> old issues.

See above, we did introduced regressions.

> I think we only have a few days to decide whether to revert back to
> 3.18 code, or push forward with these fixes.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <CAJ3xEMi5vGc-Urv_7RSWCn9-ZAiq0WzVn3y0__jrTc48jM9gpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling
       [not found]             ` <CAJ3xEMi5vGc-Urv_7RSWCn9-ZAiq0WzVn3y0__jrTc48jM9gpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-21 20:37               ` Roland Dreier
       [not found]                 ` <CAG4TOxO3wTguFbffxPFANrVTQo7=gVDzEAWwoDhh_acNTixXpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-01-21 21:48               ` Doug Ledford
  1 sibling, 1 reply; 18+ messages in thread
From: Roland Dreier @ 2015-01-21 20:37 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Erez Shitrit, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Amir Vadai, Eyal Perry, Erez Shitrit

On Wed, Jan 21, 2015 at 12:34 PM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> Because Doug's changes fixed some bad, easy-to-reproduce issues.  On
>> the other hand we don't want to introduce new regressions to fix the
>> old issues.
>
> See above, we did introduced regressions.

Yes, I know, that's my whole point.

We need to fix the current 3.19-rc code, and the two choices are to
keep the fixes we added during 3.19 or revert back to 3.18.

Doug's opinion is that your proposed fix is broken, and we don't have
an alternate fix.

So I suggest we revert the whole series from 3.19 and get this right for 3.20.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

[parent not found: <CAG4TOxO3wTguFbffxPFANrVTQo7=gVDzEAWwoDhh_acNTixXpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling
       [not found]                 ` <CAG4TOxO3wTguFbffxPFANrVTQo7=gVDzEAWwoDhh_acNTixXpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-21 20:49                   ` Or Gerlitz
  2015-01-21 21:48                   ` Doug Ledford
  2015-01-22 14:21                   ` Doug Ledford
  2 siblings, 0 replies; 18+ messages in thread
From: Or Gerlitz @ 2015-01-21 20:49 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Erez Shitrit, Doug Ledford, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Amir Vadai, Eyal Perry, Erez Shitrit

On Wed, Jan 21, 2015 at 10:37 PM, Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> On Wed, Jan 21, 2015 at 12:34 PM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> Because Doug's changes fixed some bad, easy-to-reproduce issues.  On
>>> the other hand we don't want to introduce new regressions to fix the
>>> old issues.
>>
>> See above, we did introduced regressions.
>
> Yes, I know, that's my whole point.
>
> We need to fix the current 3.19-rc code, and the two choices are to
> keep the fixes we added during 3.19 or revert back to 3.18.

> Doug's opinion is that your proposed fix is broken, and we don't have
> an alternate fix.

Before we go back to 3.18, there's the option I was asking Doug to
comment on, namely is it really impossible to revert the offending
patch? if this is the case, I'd like to get Erez few more days to
discuss his proposed patch + changes to it from code review.

> So I suggest we revert the whole series from 3.19 and get this right for 3.20.

Roland, I'm very happy to see that you're finally up to non -rc1 activity.

This didn't happen for 3.18 and maybe also for 3.17 and you know what,
it's terribly impossible to run a kernel sub-system like that.

 Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling
       [not found]                 ` <CAG4TOxO3wTguFbffxPFANrVTQo7=gVDzEAWwoDhh_acNTixXpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-01-21 20:49                   ` Or Gerlitz
@ 2015-01-21 21:48                   ` Doug Ledford
  2015-01-22 14:21                   ` Doug Ledford
  2 siblings, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2015-01-21 21:48 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Or Gerlitz, Erez Shitrit, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Amir Vadai, Eyal Perry, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 941 bytes --]

On Wed, 2015-01-21 at 12:37 -0800, Roland Dreier wrote:
> On Wed, Jan 21, 2015 at 12:34 PM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> Because Doug's changes fixed some bad, easy-to-reproduce issues.  On
> >> the other hand we don't want to introduce new regressions to fix the
> >> old issues.
> >
> > See above, we did introduced regressions.
> 
> Yes, I know, that's my whole point.
> 
> We need to fix the current 3.19-rc code, and the two choices are to
> keep the fixes we added during 3.19 or revert back to 3.18.
> 
> Doug's opinion is that your proposed fix is broken, and we don't have
> an alternate fix.
> 
> So I suggest we revert the whole series from 3.19 and get this right for 3.20.
> 
>  - R.

Give me 24 hours before making a final decision on that please.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling
       [not found]                 ` <CAG4TOxO3wTguFbffxPFANrVTQo7=gVDzEAWwoDhh_acNTixXpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-01-21 20:49                   ` Or Gerlitz
  2015-01-21 21:48                   ` Doug Ledford
@ 2015-01-22 14:21                   ` Doug Ledford
  2 siblings, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2015-01-22 14:21 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Or Gerlitz, Erez Shitrit, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Amir Vadai, Eyal Perry, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 1797 bytes --]

On Wed, 2015-01-21 at 12:37 -0800, Roland Dreier wrote:
> On Wed, Jan 21, 2015 at 12:34 PM, Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> >> Because Doug's changes fixed some bad, easy-to-reproduce issues.  On
> >> the other hand we don't want to introduce new regressions to fix the
> >> old issues.
> >
> > See above, we did introduced regressions.
> 
> Yes, I know, that's my whole point.
> 
> We need to fix the current 3.19-rc code, and the two choices are to
> keep the fixes we added during 3.19 or revert back to 3.18.
> 
> Doug's opinion is that your proposed fix is broken, and we don't have
> an alternate fix.

I will second that opinion.  Over night we ran a series of tests on some
new patches I made, and they resolved the rmmod/insmod failure case in
our testing.  There were two significant fixes.  One of them was related
to the switch to using a separate work queue per device.  The other was
an oversight in ipoib_mcast_restart_task().  Neither of these issues
were addressed by the alternate fix.  So, at best, the alternate fix is
paper machete that covers over two holes but leaves the holes in place.

> So I suggest we revert the whole series from 3.19 and get this right for 3.20.

Before you decide, please take a look at the final fix as I see it.
This was a 7 patch series, now it's 10 patches.  But the final three
patches are small, well understood, and obviously correct.

Regardless of whether you take these 10, I do *not* suggest leaving the
first 8 and using the alternate patch.  I suggest either an all or
nothing approach.  But, like I said, the rmmod issue is now fixed in my
testing.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling
       [not found]             ` <CAJ3xEMi5vGc-Urv_7RSWCn9-ZAiq0WzVn3y0__jrTc48jM9gpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-01-21 20:37               ` Roland Dreier
@ 2015-01-21 21:48               ` Doug Ledford
  1 sibling, 0 replies; 18+ messages in thread
From: Doug Ledford @ 2015-01-21 21:48 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, Erez Shitrit, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Amir Vadai, Eyal Perry, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 5604 bytes --]

On Wed, 2015-01-21 at 22:34 +0200, Or Gerlitz wrote:
> On Wed, Jan 21, 2015 at 7:19 PM, Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> > On Tue, Jan 20, 2015 at 8:16 AM, Erez Shitrit <erezsh-LDSdmyG8hGV8YrgS2mwiifUpdFzICT1y@public.gmane.orgl> wrote:
> 
> >> After trying your V4 patch series, I can tell that first, the endless scheduling of
> >> the mcast task is indeed over, but still, the multicast functionality in ipoib is unstable.
> 
> > Is this worse than 3.18?  (Have you tested that?)
> 
> Roland, Doug,
> 
> To be fully clear here by "this" we're talking on seven patches of
> complexity and volume which I think go way beyond post -rc5 timeline:
> 
>   IB/ipoib: Fix failed multicast joins/sends
>   IB/ipoib: Add a helper to restart the multicast task
>   IB/ipoib: make delayed tasks not hold up everything
>   IB/ipoib: Handle -ENETRESET properly in our callback
>   IB/ipoib: don't restart our thread on ENETRESET
>   IB/ipoib: remove unneeded locks
>   IB/ipoib: fix race between mcast_dev_flush and mcast_join
> 
>  drivers/infiniband/ulp/ipoib/ipoib.h           |   1 +
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 204 +++++++++++++++----------
>  2 files changed, 121 insertions(+), 84 deletions(-)

I don't usually look at the line count or the patch count, I look at the
changes.  Most of the 7 patches above are very self contained and very
small.  The exception is the final patch and even though it is large,
the net result is that it takes a needlessly complex function and makes
it easier to read, easier to understand, easier to verify that it does
what it's supposed to do, and then fixes a locking issue.  I also look
at testing.  While it's not obvious to the people on this list, I am
testing this rather profusely and so is our internal QE department and a
few others.  I'm really down to only the rmmod race (which I suspect the
7 patches above did not introduce, but the original 8 patches did).

> Doug, I understand your claim and frustration that with 3.18 and such
> (older kernels) your ifdown/up loop manages to break the driver,

Yes.  Quite easily.

>  but
> fixing the driver such that this test works and in the same time
> practically breaking IPv6 and IPv5 multicast introduces a deep
> regression vs. 3.18

You're right.  That's why there are these 7 patches.

>  - which as you wrote here, would be wrong to fix
> with such a further big change.
> 
> Are you really sure that reverting the offending patch 016d9fb25cd9
> "IPoIB: fix MCAST_FLAG_BUSY usage" and maybe some more dependent
> related hunks from downstream patches of that series isn't possible?

Positive.  The 8 patches change the semantics of the MCAST_BUSY_FLAG
usage throughout the code.  If you pick some and leave others, you can
end up with parts using the flag one way and parts another and then you
really blow your locking and usage to hell.

> If this is the case, I would suggest that we either revive the review
> on the fix we sent [1] or drop the whole 3.19-rc1 changes. I vote for
> the former.
> 
> [1] http://marc.info/?l=linux-rdma&m=142064313123254&w=2

So, my original 8 patches were broken.  That's easy enough to see.  My
"sin" in my testing of those patches was that my test methodology was
too myopic.  I focused on the ifup/ifdown loop scenario, and didn't test
other things that might be effected.  Similarly, if people are only
testing this latest set of code one way (for example, using rmmod/insmod
as the means to force the connection up and down and test the
joins/leaves), that can be similarly myopic.  This patch from Erez is
horribly broken for cases other than rmmod/insmod testing.  I'm guessing
you didn't see it because you never tested any other way, like using
ifup/ifdown or pulling a cable or turning a switch port off and back on.
If you had, there would not be a question of using it, at least as it
stands.

Look, smaller is not always better.  A fix that seems to work and is
smaller, but renders the code almost unintelligible because it makes the
code internally inconsistent between various areas is not a preferable
patch to a larger patch that fixes things right and makes the overall
code consistent and understandable.

> > Because Doug's changes fixed some bad, easy-to-reproduce issues.  On
> > the other hand we don't want to introduce new regressions to fix the
> > old issues.
> 
> See above, we did introduced regressions.

Which are already fixed by this 7 patch patchset.  The one remaining
issue isn't actually an ipv6 or ipv4 multicast issue, but specifically
an issue related to locking around rmmod/insmod and that we are allowing
ourselves to get into an inconsistent state during these cycles which
then effects later running (for instance, we might not clean out all of
our multicast joins before rmmod on ipoib succeeds, so on the next
insmod and attempt to join that same multicast group, the ib_sa layer
multicast membership state no longer agrees with us and we have
problems).  I've got to fix that.  But if you run your tests without
rmmod/insmod being how you bring the link up/down, I think you'll find
it now passes those tests with flying colors.  At least it does in my
testing and I would greatly value confirmation of that.

> > I think we only have a few days to decide whether to revert back to
> > 3.18 code, or push forward with these fixes.
> 
> Or.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-01-22 14:21 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-20  3:58 [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling Doug Ledford
     [not found] ` <cover.1421725318.git.dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 1/7] IB/ipoib: Fix failed multicast joins/sends Doug Ledford
2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 2/7] IB/ipoib: Add a helper to restart the multicast task Doug Ledford
2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 3/7] IB/ipoib: make delayed tasks not hold up everything Doug Ledford
2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 4/7] IB/ipoib: Handle -ENETRESET properly in our callback Doug Ledford
2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 5/7] IB/ipoib: don't restart our thread on ENETRESET Doug Ledford
2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 6/7] IB/ipoib: remove unneeded locks Doug Ledford
2015-01-20  3:58   ` [PATCH FIX For-3.19 v4 7/7] IB/ipoib: fix race between mcast_dev_flush and mcast_join Doug Ledford
2015-01-20 16:16   ` [PATCH FIX For-3.19 v4 0/7] IB/ipoib: follow fixes for multicast handling Erez Shitrit
     [not found]     ` <54BE7F66.4070404-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-01-20 17:37       ` Doug Ledford
2015-01-21 17:19       ` Roland Dreier
     [not found]         ` <CAL1RGDVPCH3h7nZMd9=4rFPqbKEGZiXQC+Atvfwj5kWcf2S3Qg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-21 17:34           ` Doug Ledford
2015-01-21 20:34           ` Or Gerlitz
     [not found]             ` <CAJ3xEMi5vGc-Urv_7RSWCn9-ZAiq0WzVn3y0__jrTc48jM9gpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-21 20:37               ` Roland Dreier
     [not found]                 ` <CAG4TOxO3wTguFbffxPFANrVTQo7=gVDzEAWwoDhh_acNTixXpQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-21 20:49                   ` Or Gerlitz
2015-01-21 21:48                   ` Doug Ledford
2015-01-22 14:21                   ` Doug Ledford
2015-01-21 21:48               ` Doug Ledford

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.