Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow

From: Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
To: Or Gerlitz <gerlitz.or-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
	"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
	Amir Vadai <amirv-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Eyal Perry <eyalpe-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow
Date: Tue, 13 Jan 2015 16:01:00 -0500	[thread overview]
Message-ID: <1421182860.43839.214.camel@redhat.com> (raw)
In-Reply-To: <CAJ3xEMgJ_+b1avw6vacagSw0bxNVzLW064rztvJ6w3Y9Th0GRg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>


[-- Attachment #1.1: Type: text/plain, Size: 2614 bytes --]

On Tue, 2015-01-13 at 22:13 +0200, Or Gerlitz wrote:
> On Tue, Jan 13, 2015 at 6:45 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, 2015-01-09 at 10:32 +0200, Or Gerlitz wrote:
> >> On Wed, Jan 7, 2015 at 5:04 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> >> > From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >> >
> >> > Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
> >> > both IPv6 traffic and for the most cases all IPv4 multicast traffic
> >> > aren't working.
> >>
> >> Doug, can you ack the breakage introduced by your commit and the fix?
> >
> > I haven't double checked the breakage, I'll take your word for it
> 
> just try ping6 or iperf multicast and see it for yourself, please.

I have.  I have them working now.

> 
> > (at the time I did my work, I had multicast debugging on and I verified the
> > join/leave process, but I had assumed that the process would work the
> > same for optional multicast groups as it does for the IPoIB broadcast
> > group and other default IPoIB groups, so I didn't specifically test
> > additional multicast groups above and beyond the broadcast/etc groups).
> >
> > However, the fix is not workable.  In particular, as soon as this patch
> > is added to the kernel, you will start getting messages like this:
> >
> > mlx4_ib0: ipoib_mcast_leave on an in-flight join
> 
> 
> I don't see it on my systems, is that upstream you're running? what entity does
> ,;x4_ib0: prefixed prints and under what settings, is that the IPoIB driver?

No, that's my internal rhel7 tree, but it's so close to bare upstream
when it comes to IPoIB that there really shouldn't be any difference.
And mlx4_ib0 is just the name that I renamed ib0 to (An internal
standard in my test cluster is that all ib interfaces are named based
upon the hardware they are tied to, so mlx4_ib0, mlx5_ib0, qib_ib0, etc.
Makes my life in verifying code coverage easier)  I can test again with
an upstream kernel later today.  But, for now, suffice it to say the
problem is not resolved with the patch in this thread, but with the much
simpler patch I've attached to this email (note, this is made against
rhel7, I'm only attaching it so you can try it yourself, assuming it
even applies, and once I've tested it on an upstream kernel myself and
verified that it works properly, I'll submit the final upstream version
under a new thread).

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #1.2: ipoib-mcast-test.patch --]
[-- Type: text/x-patch, Size: 3525 bytes --]

commit dc2c71896edc3b549e79a51cf83db813d6012a34
Author: Doug Ledford <dledford@redhat.com>
Date:   Tue Jan 13 13:35:40 2015 -0500

    IB/ipoib: Fix failed multicast joins/sends
    
    The usage of IPOIB_MCAST_RUN as a flag is inconsistent.  In some places
    it is used to mean "our device is administratively allowed to send
    multicast joins/leaves/packets" and in other places it means "our
    multicast join task thread is currently running and will process your
    request if you put it on the queue".  However, this latter meaning is in
    fact flawed as there is a race condition between the join task testing
    the mcast list and finding it empty of remaining work, dropping the
    mcast mutex and also the priv->lock spinlock, and clearing the
    IPOIB_MCAST_RUN flag.  Further, there are numerous locations that use
    the flag in the former fashion, and when all tasks complete and the task
    thread clears the RUN flag, all of those other locations will fail to
    ever again queue any work.  This results in the interface coming up fine
    initially, but having problems adding new multicast groups after the
    first round of groups have all been added and the RUN flag is cleared by
    the join task thread when it thinks it is done.  To resolve this issue,
    convert all locations in the code to treat the RUN flag as an indicator
    that the multicast portion of this interface is in fact administratively
    up and joins/leaves/sends can be performed.  There is no harm (other
    than a slight performance penalty) to never clearing this flag and using
    it in this fashion as it simply means that a few places that used to
    micro-optimize how often this task was queued on a work queue will now
    queue the task a few extra times.  We can address that suboptimal
    behavior in future patches.
    
    Signed-off-by: Doug Ledford <dledford@redhat.com>

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 8a538c010b9..cba6e160df2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -638,8 +638,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	}
 
 	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
-
-	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
 }
 
 int ipoib_mcast_start_thread(struct net_device *dev)
@@ -649,8 +647,8 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 	ipoib_dbg_mcast(priv, "starting multicast thread\n");
 
 	mutex_lock(&mcast_mutex);
-	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+	set_bit(IPOIB_MCAST_RUN, &priv->flags);
+	queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	mutex_unlock(&mcast_mutex);
 
 	return 0;
@@ -733,7 +731,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
-		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
+		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
 			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	}
 
@@ -959,7 +957,8 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	/*
 	 * Restart our join task if needed
 	 */
-	ipoib_mcast_start_thread(dev);
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	rtnl_unlock();
 }
 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]