All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow
@ 2015-01-07 15:04 Or Gerlitz
       [not found] ` <1420643066-3599-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Or Gerlitz @ 2015-01-07 15:04 UTC (permalink / raw)
  To: Roland Dreier, Doug Ledford
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai, Eyal Perry,
	Erez Shitrit, Or Gerlitz

From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
both IPv6 traffic and for the most cases all IPv4 multicast traffic
aren't working.

After this change there is no mechanism to handle the work that does the
join process for the rest of the mcg's. For example, if in the list of
all the mcg's there is a send-only request, after its processing, the
code in ipoib_mcast_sendonly_join_complete() will not requeue the
mcast task, but leaves the bit that signals this task is running,
and hence the task will never run.

Also, whenever the kernel sends multicast packet (w.o joining to this
group), we don't call ipoib_send_only_join(), the code tries to start
the mcast task but it failed because the bit IPOIB_MCAST_RUN is always
set, As a result the multicast packet will never be sent.

The fix handles all the join requests via the same logic, and call
explicitly to sendonly join whenever there is a packet from sendonly type.

Since ipoib_mcast_sendonly_join() is now called from the driver TX flow,
we can't take mutex there. Locking isn't required there since the multicast
join callback will be called only after the SA agent initialized the relevant
multicast object.

Fixes: 016d9fb25cd9 ('IPoIB: fix MCAST_FLAG_BUSY usage')
Reported-by: Eyal Perry <eyalpe-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
V0 --> V1 changes: Added credits (...) and furnished the change-log abit.

 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   15 ++++++---------
 1 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index bc50dd0..0ea4b08 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -301,9 +301,10 @@ ipoib_mcast_sendonly_join_complete(int status,
 			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
 		}
 		netif_tx_unlock_bh(dev);
+
+		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	}
 out:
-	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	if (status)
 		mcast->mc = NULL;
 	complete(&mcast->done);
@@ -342,7 +343,6 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 	rec.port_gid = priv->local_gid;
 	rec.pkey     = cpu_to_be16(priv->pkey);
 
-	mutex_lock(&mcast_mutex);
 	init_completion(&mcast->done);
 	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
 	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
@@ -364,7 +364,6 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
 		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
 				"sendonly join\n", mcast->mcmember.mgid.raw);
 	}
-	mutex_unlock(&mcast_mutex);
 
 	return ret;
 }
@@ -622,10 +621,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
 			break;
 		}
 
-		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
-			ipoib_mcast_sendonly_join(mcast);
-		else
-			ipoib_mcast_join(dev, mcast, 1);
+		ipoib_mcast_join(dev, mcast, 1);
+
 		return;
 	}
 
@@ -725,8 +722,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
-		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	}
 
 	if (!mcast->ah) {
@@ -740,6 +735,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
 			ipoib_dbg_mcast(priv, "no address vector, "
 					"but multicast join already started\n");
+		else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
+			ipoib_mcast_sendonly_join(mcast);
 
 		/*
 		 * If lookup completes between here and out:, don't
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow
       [not found] ` <1420643066-3599-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2015-01-09  8:32   ` Or Gerlitz
       [not found]     ` <CAJ3xEMh60SGsK43b49MSozqmL8PDzTevs5CMGCvhPfmL3kHHVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2015-01-13 18:07   ` Doug Ledford
  1 sibling, 1 reply; 10+ messages in thread
From: Or Gerlitz @ 2015-01-09  8:32 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai,
	Eyal Perry, Erez Shitrit

On Wed, Jan 7, 2015 at 5:04 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>
> Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
> both IPv6 traffic and for the most cases all IPv4 multicast traffic
> aren't working.

Doug, can you ack the breakage introduced by your commit and the fix?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow
       [not found]     ` <CAJ3xEMh60SGsK43b49MSozqmL8PDzTevs5CMGCvhPfmL3kHHVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-13 16:45       ` Doug Ledford
       [not found]         ` <1421167544.43839.190.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Doug Ledford @ 2015-01-13 16:45 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai,
	Eyal Perry, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 1998 bytes --]

On Fri, 2015-01-09 at 10:32 +0200, Or Gerlitz wrote:
> On Wed, Jan 7, 2015 at 5:04 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> > From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >
> > Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
> > both IPv6 traffic and for the most cases all IPv4 multicast traffic
> > aren't working.
> 
> Doug, can you ack the breakage introduced by your commit and the fix?

I haven't double checked the breakage, I'll take your word for it (at
the time I did my work, I had multicast debugging on and I verified the
join/leave process, but I had assumed that the process would work the
same for optional multicast groups as it does for the IPoIB broadcast
group and other default IPoIB groups, so I didn't specifically test
additional multicast groups above and beyond the broadcast/etc groups).

However, the fix is not workable.  In particular, as soon as this patch
is added to the kernel, you will start getting messages like this:

mlx4_ib0: ipoib_mcast_leave on an in-flight join

Every time you get this message, you've run into a "shouldn't ever
happen" situation.  If this happens, then we've lost track of the mcast
flags settings or we've genuinely tried to remove a mcast group where
the lower layer is still working on our join.  Either way, it means
we've screwed up.  Further, with this patch in place, I'm seeing random
acts of badness now with non-default IPoIB pkey joins.  Sometimes they
work, sometimes they don't.

So, no, this patch doesn't work.  I'll do some more investigating and
report back.

> Or.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow
       [not found] ` <1420643066-3599-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2015-01-09  8:32   ` Or Gerlitz
@ 2015-01-13 18:07   ` Doug Ledford
       [not found]     ` <1421172470.43839.207.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 10+ messages in thread
From: Doug Ledford @ 2015-01-13 18:07 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai,
	Eyal Perry, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 7057 bytes --]

On Wed, 2015-01-07 at 17:04 +0200, Or Gerlitz wrote:
> From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> 
> Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
> both IPv6 traffic and for the most cases all IPv4 multicast traffic
> aren't working.
> 
> After this change there is no mechanism to handle the work that does the
> join process for the rest of the mcg's. For example, if in the list of
> all the mcg's there is a send-only request, after its processing, the
> code in ipoib_mcast_sendonly_join_complete() will not requeue the
> mcast task, but leaves the bit that signals this task is running,
> and hence the task will never run.
> 
> Also, whenever the kernel sends multicast packet (w.o joining to this
> group), we don't call ipoib_send_only_join(), the code tries to start
> the mcast task but it failed because the bit IPOIB_MCAST_RUN is always
> set, As a result the multicast packet will never be sent.
> 
> The fix handles all the join requests via the same logic, and call
> explicitly to sendonly join whenever there is a packet from sendonly type.
> 
> Since ipoib_mcast_sendonly_join() is now called from the driver TX flow,
> we can't take mutex there. Locking isn't required there since the multicast
> join callback will be called only after the SA agent initialized the relevant
> multicast object.
> 
> Fixes: 016d9fb25cd9 ('IPoIB: fix MCAST_FLAG_BUSY usage')
> Reported-by: Eyal Perry <eyalpe-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
> V0 --> V1 changes: Added credits (...) and furnished the change-log abit.
> 
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   15 ++++++---------
>  1 files changed, 6 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index bc50dd0..0ea4b08 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -301,9 +301,10 @@ ipoib_mcast_sendonly_join_complete(int status,
>  			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
>  		}
>  		netif_tx_unlock_bh(dev);
> +
> +		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>  	}
>  out:
> -	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>  	if (status)
>  		mcast->mc = NULL;
>  	complete(&mcast->done);

This chunk is wrong.  We are in our complete routine, which means
ib_sa_join_multicast is calling us for this mcast group, and we will
never see another return for this group.  We must clear the BUSY flag no
matter what as the BUSY flag now indicates that our mcast join is still
outstanding in the lower layer ib_sa_ area, not that we have joined the
group.  Please re-read my patches that re-worked the BUSY flag usage.
The BUSY flag was poorly named/used in the past, which is why a previous
patch introduced the JOINING or whatever flag it was called.  My
patchset reworks the flag usage to be more sane.  BUSY now means
*exactly* that: this mcast group is in the process of joining, aka it's
BUSY.  It doesn't mean we've joined the group and there are no more
outstanding join requests.  That's signified by mcast->mc !=
IS_ERR_OR_NULL.

> @@ -342,7 +343,6 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
>  	rec.port_gid = priv->local_gid;
>  	rec.pkey     = cpu_to_be16(priv->pkey);
>  
> -	mutex_lock(&mcast_mutex);
>  	init_completion(&mcast->done);
>  	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>  	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
> @@ -364,7 +364,6 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
>  		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
>  				"sendonly join\n", mcast->mcmember.mgid.raw);
>  	}
> -	mutex_unlock(&mcast_mutex);
>  
>  	return ret;
>  }

No!  You can not, under any circumstances, remove this locking!  One of
the things that frustrated me for a bit until I tracked it down was how
ib_sa_join_multicast returns errors to the ipoib layer.  When you call
ib_sa_join_multicast, the return value is either a valid mcast->mc
pointer or IS_ERR(err).  If it's a valid pointer, that does not mean we
have successfully joined, it means that we might join, but it isn't
until we have completed the callback that we know.  The callback will
clear out mcast->mc if we encounter an error during the callback and
know that by returning an error from the callback, the lower layer is
going to delete the mcast->mc context out from underneath us.  As it
turns out, we often get our callbacks called even before we get the
initial return from ib_sa_join_multicast.  If we don't have this
locking, and we get any error in the callback, the callback will clear
mcast->mc to indicate that we have no valid group, then we will return
from ib_sa_join_multicast and set mcast->mc to an invalid group.  To
prevent that, the callback grabs this mutex at the beginning of its
operation.  We *must* grab the mutex here and hold it until we are done
with mcast->mc or else we can't know for sure if the mcast->mc that we
just set as the return code from ib_sa_join_multicast is the right
return value or if we just overwrote the callbacks setting of that
field.

> @@ -622,10 +621,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
>  			break;
>  		}
>  
> -		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
> -			ipoib_mcast_sendonly_join(mcast);
> -		else
> -			ipoib_mcast_join(dev, mcast, 1);
> +		ipoib_mcast_join(dev, mcast, 1);
> +
>  		return;
>  	}
>  
> @@ -725,8 +722,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
>  		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
>  		__ipoib_mcast_add(dev, mcast);
>  		list_add_tail(&mcast->list, &priv->multicast_list);
> -		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> -			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>  	}
>  
>  	if (!mcast->ah) {
> @@ -740,6 +735,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
>  		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
>  			ipoib_dbg_mcast(priv, "no address vector, "
>  					"but multicast join already started\n");
> +		else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
> +			ipoib_mcast_sendonly_join(mcast);
>  
>  		/*
>  		 * If lookup completes between here and out:, don't

None of this looks right to me either.  But that's OK, I think I found
the bug while looking all of this over.  I'm going to do some testing
and I'll report back here when done.  You can't even do this without
removing the mutex_lock above that I pointed out has to stay, so this
really isn't right.


-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow
       [not found]         ` <1421167544.43839.190.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-01-13 20:13           ` Or Gerlitz
       [not found]             ` <CAJ3xEMgJ_+b1avw6vacagSw0bxNVzLW064rztvJ6w3Y9Th0GRg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Or Gerlitz @ 2015-01-13 20:13 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai,
	Eyal Perry, Erez Shitrit

On Tue, Jan 13, 2015 at 6:45 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, 2015-01-09 at 10:32 +0200, Or Gerlitz wrote:
>> On Wed, Jan 7, 2015 at 5:04 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> > From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> >
>> > Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
>> > both IPv6 traffic and for the most cases all IPv4 multicast traffic
>> > aren't working.
>>
>> Doug, can you ack the breakage introduced by your commit and the fix?
>
> I haven't double checked the breakage, I'll take your word for it

just try ping6 or iperf multicast and see it for yourself, please.


> (at the time I did my work, I had multicast debugging on and I verified the
> join/leave process, but I had assumed that the process would work the
> same for optional multicast groups as it does for the IPoIB broadcast
> group and other default IPoIB groups, so I didn't specifically test
> additional multicast groups above and beyond the broadcast/etc groups).
>
> However, the fix is not workable.  In particular, as soon as this patch
> is added to the kernel, you will start getting messages like this:
>
> mlx4_ib0: ipoib_mcast_leave on an in-flight join


I don't see it on my systems, is that upstream you're running? what entity does
,;x4_ib0: prefixed prints and under what settings, is that the IPoIB driver?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow
       [not found]             ` <CAJ3xEMgJ_+b1avw6vacagSw0bxNVzLW064rztvJ6w3Y9Th0GRg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2015-01-13 21:01               ` Doug Ledford
       [not found]                 ` <1421182860.43839.214.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Doug Ledford @ 2015-01-13 21:01 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai,
	Eyal Perry, Erez Shitrit


[-- Attachment #1.1: Type: text/plain, Size: 2614 bytes --]

On Tue, 2015-01-13 at 22:13 +0200, Or Gerlitz wrote:
> On Tue, Jan 13, 2015 at 6:45 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, 2015-01-09 at 10:32 +0200, Or Gerlitz wrote:
> >> On Wed, Jan 7, 2015 at 5:04 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> >> > From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >> >
> >> > Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
> >> > both IPv6 traffic and for the most cases all IPv4 multicast traffic
> >> > aren't working.
> >>
> >> Doug, can you ack the breakage introduced by your commit and the fix?
> >
> > I haven't double checked the breakage, I'll take your word for it
> 
> just try ping6 or iperf multicast and see it for yourself, please.

I have.  I have them working now.

> 
> > (at the time I did my work, I had multicast debugging on and I verified the
> > join/leave process, but I had assumed that the process would work the
> > same for optional multicast groups as it does for the IPoIB broadcast
> > group and other default IPoIB groups, so I didn't specifically test
> > additional multicast groups above and beyond the broadcast/etc groups).
> >
> > However, the fix is not workable.  In particular, as soon as this patch
> > is added to the kernel, you will start getting messages like this:
> >
> > mlx4_ib0: ipoib_mcast_leave on an in-flight join
> 
> 
> I don't see it on my systems, is that upstream you're running? what entity does
> ,;x4_ib0: prefixed prints and under what settings, is that the IPoIB driver?

No, that's my internal rhel7 tree, but it's so close to bare upstream
when it comes to IPoIB that there really shouldn't be any difference.
And mlx4_ib0 is just the name that I renamed ib0 to (An internal
standard in my test cluster is that all ib interfaces are named based
upon the hardware they are tied to, so mlx4_ib0, mlx5_ib0, qib_ib0, etc.
Makes my life in verifying code coverage easier)  I can test again with
an upstream kernel later today.  But, for now, suffice it to say the
problem is not resolved with the patch in this thread, but with the much
simpler patch I've attached to this email (note, this is made against
rhel7, I'm only attaching it so you can try it yourself, assuming it
even applies, and once I've tested it on an upstream kernel myself and
verified that it works properly, I'll submit the final upstream version
under a new thread).

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #1.2: ipoib-mcast-test.patch --]
[-- Type: text/x-patch, Size: 3525 bytes --]

commit dc2c71896edc3b549e79a51cf83db813d6012a34
Author: Doug Ledford <dledford@redhat.com>
Date:   Tue Jan 13 13:35:40 2015 -0500

    IB/ipoib: Fix failed multicast joins/sends
    
    The usage of IPOIB_MCAST_RUN as a flag is inconsistent.  In some places
    it is used to mean "our device is administratively allowed to send
    multicast joins/leaves/packets" and in other places it means "our
    multicast join task thread is currently running and will process your
    request if you put it on the queue".  However, this latter meaning is in
    fact flawed as there is a race condition between the join task testing
    the mcast list and finding it empty of remaining work, dropping the
    mcast mutex and also the priv->lock spinlock, and clearing the
    IPOIB_MCAST_RUN flag.  Further, there are numerous locations that use
    the flag in the former fashion, and when all tasks complete and the task
    thread clears the RUN flag, all of those other locations will fail to
    ever again queue any work.  This results in the interface coming up fine
    initially, but having problems adding new multicast groups after the
    first round of groups have all been added and the RUN flag is cleared by
    the join task thread when it thinks it is done.  To resolve this issue,
    convert all locations in the code to treat the RUN flag as an indicator
    that the multicast portion of this interface is in fact administratively
    up and joins/leaves/sends can be performed.  There is no harm (other
    than a slight performance penalty) to never clearing this flag and using
    it in this fashion as it simply means that a few places that used to
    micro-optimize how often this task was queued on a work queue will now
    queue the task a few extra times.  We can address that suboptimal
    behavior in future patches.
    
    Signed-off-by: Doug Ledford <dledford@redhat.com>

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 8a538c010b9..cba6e160df2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -638,8 +638,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
 	}
 
 	ipoib_dbg_mcast(priv, "successfully joined all multicast groups\n");
-
-	clear_bit(IPOIB_MCAST_RUN, &priv->flags);
 }
 
 int ipoib_mcast_start_thread(struct net_device *dev)
@@ -649,8 +647,8 @@ int ipoib_mcast_start_thread(struct net_device *dev)
 	ipoib_dbg_mcast(priv, "starting multicast thread\n");
 
 	mutex_lock(&mcast_mutex);
-	if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
-		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
+	set_bit(IPOIB_MCAST_RUN, &priv->flags);
+	queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	mutex_unlock(&mcast_mutex);
 
 	return 0;
@@ -733,7 +731,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
 		__ipoib_mcast_add(dev, mcast);
 		list_add_tail(&mcast->list, &priv->multicast_list);
-		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
+		if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
 			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	}
 
@@ -959,7 +957,8 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 	/*
 	 * Restart our join task if needed
 	 */
-	ipoib_mcast_start_thread(dev);
+	if (test_bit(IPOIB_MCAST_RUN, &priv->flags))
+		queue_delayed_work(priv->wq, &priv->mcast_task, 0);
 	rtnl_unlock();
 }
 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow
       [not found]                 ` <1421182860.43839.214.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-01-14  6:01                   ` Doug Ledford
       [not found]                     ` <1421215262.43839.225.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Doug Ledford @ 2015-01-14  6:01 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai,
	Eyal Perry, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 2267 bytes --]

On Tue, 2015-01-13 at 16:01 -0500, Doug Ledford wrote:
> On Tue, 2015-01-13 at 22:13 +0200, Or Gerlitz wrote:
> > On Tue, Jan 13, 2015 at 6:45 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > On Fri, 2015-01-09 at 10:32 +0200, Or Gerlitz wrote:
> > >> On Wed, Jan 7, 2015 at 5:04 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> > >> > From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> > >> >
> > >> > Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
> > >> > both IPv6 traffic and for the most cases all IPv4 multicast traffic
> > >> > aren't working.
> > >>
> > >> Doug, can you ack the breakage introduced by your commit and the fix?
> > >
> > > I haven't double checked the breakage, I'll take your word for it
> > 
> > just try ping6 or iperf multicast and see it for yourself, please.
> 
> I have.  I have them working now.
> 
> > 
> > > (at the time I did my work, I had multicast debugging on and I verified the
> > > join/leave process, but I had assumed that the process would work the
> > > same for optional multicast groups as it does for the IPoIB broadcast
> > > group and other default IPoIB groups, so I didn't specifically test
> > > additional multicast groups above and beyond the broadcast/etc groups).
> > >
> > > However, the fix is not workable.  In particular, as soon as this patch
> > > is added to the kernel, you will start getting messages like this:
> > >
> > > mlx4_ib0: ipoib_mcast_leave on an in-flight join
> > 
> > 
> > I don't see it on my systems, is that upstream you're running? what entity does
> > ,;x4_ib0: prefixed prints and under what settings, is that the IPoIB driver?
> 
> No, that's my internal rhel7 tree, but it's so close to bare upstream
> when it comes to IPoIB that there really shouldn't be any difference.

I've tested my patch on a plain upstream kernel and it resolves the
problem there too.  So, I feel confident that the issue is resolved
properly with this patch.

Roland, do you need a different submission, or can you take the patch
from my last email?

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow
       [not found]                     ` <1421215262.43839.225.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-01-14  6:06                       ` Or Gerlitz
  0 siblings, 0 replies; 10+ messages in thread
From: Or Gerlitz @ 2015-01-14  6:06 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai,
	Eyal Perry, Erez Shitrit

On 1/14/2015 8:01 AM, Doug Ledford wrote:
> I've tested my patch on a plain upstream kernel and it resolves the 
> problem there too. 

Good, please make a proper submission on the list so we can review it 
inline and send questions/feedback.

Or.

> So, I feel confident that the issue is resolved properly with this 
> patch. Roland, do you need a different submission, or can you take the 
> patch from my last email? 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow
       [not found]     ` <1421172470.43839.207.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2015-01-22 11:39       ` Erez Shitrit
       [not found]         ` <54C0E187.2010100-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Erez Shitrit @ 2015-01-22 11:39 UTC (permalink / raw)
  To: Doug Ledford, Or Gerlitz
  Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Amir Vadai,
	Eyal Perry, Erez Shitrit

On 1/13/2015 8:07 PM, Doug Ledford wrote:
> On Wed, 2015-01-07 at 17:04 +0200, Or Gerlitz wrote:
>> From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>
>> Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
>> both IPv6 traffic and for the most cases all IPv4 multicast traffic
>> aren't working.
>>
>> After this change there is no mechanism to handle the work that does the
>> join process for the rest of the mcg's. For example, if in the list of
>> all the mcg's there is a send-only request, after its processing, the
>> code in ipoib_mcast_sendonly_join_complete() will not requeue the
>> mcast task, but leaves the bit that signals this task is running,
>> and hence the task will never run.
>>
>> Also, whenever the kernel sends multicast packet (w.o joining to this
>> group), we don't call ipoib_send_only_join(), the code tries to start
>> the mcast task but it failed because the bit IPOIB_MCAST_RUN is always
>> set, As a result the multicast packet will never be sent.
>>
>> The fix handles all the join requests via the same logic, and call
>> explicitly to sendonly join whenever there is a packet from sendonly type.
>>
>> Since ipoib_mcast_sendonly_join() is now called from the driver TX flow,
>> we can't take mutex there. Locking isn't required there since the multicast
>> join callback will be called only after the SA agent initialized the relevant
>> multicast object.
>>
>> Fixes: 016d9fb25cd9 ('IPoIB: fix MCAST_FLAG_BUSY usage')
>> Reported-by: Eyal Perry <eyalpe-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Signed-off-by: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> ---
>> V0 --> V1 changes: Added credits (...) and furnished the change-log abit.
>>
>>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   15 ++++++---------
>>   1 files changed, 6 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> index bc50dd0..0ea4b08 100644
>> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
>> @@ -301,9 +301,10 @@ ipoib_mcast_sendonly_join_complete(int status,
>>   			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
>>   		}
>>   		netif_tx_unlock_bh(dev);
>> +
>> +		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>>   	}
>>   out:
>> -	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>>   	if (status)
>>   		mcast->mc = NULL;
>>   	complete(&mcast->done);
> This chunk is wrong.  We are in our complete routine, which means
> ib_sa_join_multicast is calling us for this mcast group, and we will
> never see another return for this group.  We must clear the BUSY flag no
> matter what as the BUSY flag now indicates that our mcast join is still
> outstanding in the lower layer ib_sa_ area, not that we have joined the
> group.  Please re-read my patches that re-worked the BUSY flag usage.
> The BUSY flag was poorly named/used in the past, which is why a previous
> patch introduced the JOINING or whatever flag it was called.  My
> patchset reworks the flag usage to be more sane.  BUSY now means
> *exactly* that: this mcast group is in the process of joining, aka it's
> BUSY.  It doesn't mean we've joined the group and there are no more
> outstanding join requests.  That's signified by mcast->mc !=
> IS_ERR_OR_NULL.
agree, may mistake, the clear_bit(IPOIB_MCAST_FLAG_BUSY) should be in 
any case, i will move it 3 lines later,
BTW, this will solve the "in-flight" messages you saw.
>
>> @@ -342,7 +343,6 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
>>   	rec.port_gid = priv->local_gid;
>>   	rec.pkey     = cpu_to_be16(priv->pkey);
>>   
>> -	mutex_lock(&mcast_mutex);
>>   	init_completion(&mcast->done);
>>   	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
>>   	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
>> @@ -364,7 +364,6 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
>>   		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
>>   				"sendonly join\n", mcast->mcmember.mgid.raw);
>>   	}
>> -	mutex_unlock(&mcast_mutex);
>>   
>>   	return ret;
>>   }
> No!  You can not, under any circumstances, remove this locking!  One of
> the things that frustrated me for a bit until I tracked it down was how
> ib_sa_join_multicast returns errors to the ipoib layer.  When you call
> ib_sa_join_multicast, the return value is either a valid mcast->mc
> pointer or IS_ERR(err).  If it's a valid pointer, that does not mean we
> have successfully joined, it means that we might join, but it isn't
> until we have completed the callback that we know.  The callback will
> clear out mcast->mc if we encounter an error during the callback and
> know that by returning an error from the callback, the lower layer is
> going to delete the mcast->mc context out from underneath us.  As it
> turns out, we often get our callbacks called even before we get the
> initial return from ib_sa_join_multicast.  If we don't have this
> locking, and we get any error in the callback, the callback will clear
> mcast->mc to indicate that we have no valid group, then we will return
> from ib_sa_join_multicast and set mcast->mc to an invalid group.  To
> prevent that, the callback grabs this mutex at the beginning of its
> operation.  We *must* grab the mutex here and hold it until we are done
> with mcast->mc or else we can't know for sure if the mcast->mc that we
> just set as the return code from ib_sa_join_multicast is the right
> return value or if we just overwrote the callbacks setting of that
> field.
OK, i got your point now, you want to sync between the 
ipoib_mcas_sendonly_join to the sa callback 
(ipoib_mcast_sendonly_join_complete),
but for that you don't need to use mutex (if you are using mutes you are 
not able to call ipoib_mcast_sendonly_join from the TX flow).
the easy way is to add simple check in the function 
ipoib_mcast_sendonly_join instead of the IS_ERR(mcast->mc) use 
IS_ERR_OR_NULL(mcast->mc). IMHO, that solves the sync issue.

>> @@ -622,10 +621,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
>>   			break;
>>   		}
>>   
>> -		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
>> -			ipoib_mcast_sendonly_join(mcast);
>> -		else
>> -			ipoib_mcast_join(dev, mcast, 1);
>> +		ipoib_mcast_join(dev, mcast, 1);
>> +
>>   		return;
>>   	}
>>   
>> @@ -725,8 +722,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
>>   		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
>>   		__ipoib_mcast_add(dev, mcast);
>>   		list_add_tail(&mcast->list, &priv->multicast_list);
>> -		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
>> -			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
>>   	}
>>   
>>   	if (!mcast->ah) {
>> @@ -740,6 +735,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
>>   		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
>>   			ipoib_dbg_mcast(priv, "no address vector, "
>>   					"but multicast join already started\n");
>> +		else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
>> +			ipoib_mcast_sendonly_join(mcast);
>>   
>>   		/*
>>   		 * If lookup completes between here and out:, don't
> None of this looks right to me either.  But that's OK, I think I found
> the bug while looking all of this over.  I'm going to do some testing
> and I'll report back here when done.  You can't even do this without
> removing the mutex_lock above that I pointed out has to stay, so this
> really isn't right.
I think, that after i will add the changes i mentioned, it will be fine.
will send you the new patch, please take a look.
>
>
Thanks, Erez
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow
       [not found]         ` <54C0E187.2010100-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2015-01-22 14:10           ` Doug Ledford
  0 siblings, 0 replies; 10+ messages in thread
From: Doug Ledford @ 2015-01-22 14:10 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: Or Gerlitz, Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Amir Vadai, Eyal Perry, Erez Shitrit

[-- Attachment #1: Type: text/plain, Size: 9479 bytes --]

On Thu, 2015-01-22 at 13:39 +0200, Erez Shitrit wrote:
> On 1/13/2015 8:07 PM, Doug Ledford wrote:
> > On Wed, 2015-01-07 at 17:04 +0200, Or Gerlitz wrote:
> >> From: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >>
> >> Following commit 016d9fb25cd9 "IPoIB: fix MCAST_FLAG_BUSY usage"
> >> both IPv6 traffic and for the most cases all IPv4 multicast traffic
> >> aren't working.
> >>
> >> After this change there is no mechanism to handle the work that does the
> >> join process for the rest of the mcg's. For example, if in the list of
> >> all the mcg's there is a send-only request, after its processing, the
> >> code in ipoib_mcast_sendonly_join_complete() will not requeue the
> >> mcast task, but leaves the bit that signals this task is running,
> >> and hence the task will never run.
> >>
> >> Also, whenever the kernel sends multicast packet (w.o joining to this
> >> group), we don't call ipoib_send_only_join(), the code tries to start
> >> the mcast task but it failed because the bit IPOIB_MCAST_RUN is always
> >> set, As a result the multicast packet will never be sent.
> >>
> >> The fix handles all the join requests via the same logic, and call
> >> explicitly to sendonly join whenever there is a packet from sendonly type.
> >>
> >> Since ipoib_mcast_sendonly_join() is now called from the driver TX flow,
> >> we can't take mutex there. Locking isn't required there since the multicast
> >> join callback will be called only after the SA agent initialized the relevant
> >> multicast object.
> >>
> >> Fixes: 016d9fb25cd9 ('IPoIB: fix MCAST_FLAG_BUSY usage')
> >> Reported-by: Eyal Perry <eyalpe-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >> Signed-off-by: Erez Shitrit <erezsh-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >> Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> >> ---
> >> V0 --> V1 changes: Added credits (...) and furnished the change-log abit.
> >>
> >>   drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   15 ++++++---------
> >>   1 files changed, 6 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> >> index bc50dd0..0ea4b08 100644
> >> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> >> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> >> @@ -301,9 +301,10 @@ ipoib_mcast_sendonly_join_complete(int status,
> >>   			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
> >>   		}
> >>   		netif_tx_unlock_bh(dev);
> >> +
> >> +		clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> >>   	}
> >>   out:
> >> -	clear_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> >>   	if (status)
> >>   		mcast->mc = NULL;
> >>   	complete(&mcast->done);
> > This chunk is wrong.  We are in our complete routine, which means
> > ib_sa_join_multicast is calling us for this mcast group, and we will
> > never see another return for this group.  We must clear the BUSY flag no
> > matter what as the BUSY flag now indicates that our mcast join is still
> > outstanding in the lower layer ib_sa_ area, not that we have joined the
> > group.  Please re-read my patches that re-worked the BUSY flag usage.
> > The BUSY flag was poorly named/used in the past, which is why a previous
> > patch introduced the JOINING or whatever flag it was called.  My
> > patchset reworks the flag usage to be more sane.  BUSY now means
> > *exactly* that: this mcast group is in the process of joining, aka it's
> > BUSY.  It doesn't mean we've joined the group and there are no more
> > outstanding join requests.  That's signified by mcast->mc !=
> > IS_ERR_OR_NULL.
> agree, may mistake, the clear_bit(IPOIB_MCAST_FLAG_BUSY) should be in 
> any case, i will move it 3 lines later,
> BTW, this will solve the "in-flight" messages you saw.
> >
> >> @@ -342,7 +343,6 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
> >>   	rec.port_gid = priv->local_gid;
> >>   	rec.pkey     = cpu_to_be16(priv->pkey);
> >>   
> >> -	mutex_lock(&mcast_mutex);
> >>   	init_completion(&mcast->done);
> >>   	set_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags);
> >>   	mcast->mc = ib_sa_join_multicast(&ipoib_sa_client, priv->ca,
> >> @@ -364,7 +364,6 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
> >>   		ipoib_dbg_mcast(priv, "no multicast record for %pI6, starting "
> >>   				"sendonly join\n", mcast->mcmember.mgid.raw);
> >>   	}
> >> -	mutex_unlock(&mcast_mutex);
> >>   
> >>   	return ret;
> >>   }
> > No!  You can not, under any circumstances, remove this locking!  One of
> > the things that frustrated me for a bit until I tracked it down was how
> > ib_sa_join_multicast returns errors to the ipoib layer.  When you call
> > ib_sa_join_multicast, the return value is either a valid mcast->mc
> > pointer or IS_ERR(err).  If it's a valid pointer, that does not mean we
> > have successfully joined, it means that we might join, but it isn't
> > until we have completed the callback that we know.  The callback will
> > clear out mcast->mc if we encounter an error during the callback and
> > know that by returning an error from the callback, the lower layer is
> > going to delete the mcast->mc context out from underneath us.  As it
> > turns out, we often get our callbacks called even before we get the
> > initial return from ib_sa_join_multicast.  If we don't have this
> > locking, and we get any error in the callback, the callback will clear
> > mcast->mc to indicate that we have no valid group, then we will return
> > from ib_sa_join_multicast and set mcast->mc to an invalid group.  To
> > prevent that, the callback grabs this mutex at the beginning of its
> > operation.  We *must* grab the mutex here and hold it until we are done
> > with mcast->mc or else we can't know for sure if the mcast->mc that we
> > just set as the return code from ib_sa_join_multicast is the right
> > return value or if we just overwrote the callbacks setting of that
> > field.
> OK, i got your point now, you want to sync between the 
> ipoib_mcas_sendonly_join to the sa callback 
> (ipoib_mcast_sendonly_join_complete),
> but for that you don't need to use mutex (if you are using mutes you are 
> not able to call ipoib_mcast_sendonly_join from the TX flow).
> the easy way is to add simple check in the function 
> ipoib_mcast_sendonly_join instead of the IS_ERR(mcast->mc) use 
> IS_ERR_OR_NULL(mcast->mc). IMHO, that solves the sync issue.

No, it doesn't.  We must use something like the mutex or else it is
possible for the join completion to run *prior* to the return of
ib_sa_multicast_join.  We put the return value of ib_sa_join_multicast
into mcast->mc.  If the join completion runs prior to our return, we
overwrite whatever the join completion had put into mcast->mc with our
return from the ib_sa_multicast_join routine, possibly overwriting a
valid value.  It took me quite some time to track that race down, please
don't reintroduce it.

> >> @@ -622,10 +621,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
> >>   			break;
> >>   		}
> >>   
> >> -		if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
> >> -			ipoib_mcast_sendonly_join(mcast);
> >> -		else
> >> -			ipoib_mcast_join(dev, mcast, 1);
> >> +		ipoib_mcast_join(dev, mcast, 1);
> >> +
> >>   		return;
> >>   	}
> >>   
> >> @@ -725,8 +722,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
> >>   		memcpy(mcast->mcmember.mgid.raw, mgid, sizeof (union ib_gid));
> >>   		__ipoib_mcast_add(dev, mcast);
> >>   		list_add_tail(&mcast->list, &priv->multicast_list);
> >> -		if (!test_and_set_bit(IPOIB_MCAST_RUN, &priv->flags))
> >> -			queue_delayed_work(priv->wq, &priv->mcast_task, 0);
> >>   	}
> >>   
> >>   	if (!mcast->ah) {
> >> @@ -740,6 +735,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
> >>   		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
> >>   			ipoib_dbg_mcast(priv, "no address vector, "
> >>   					"but multicast join already started\n");
> >> +		else if (test_bit(IPOIB_MCAST_FLAG_SENDONLY, &mcast->flags))
> >> +			ipoib_mcast_sendonly_join(mcast);
> >>   
> >>   		/*
> >>   		 * If lookup completes between here and out:, don't
> > None of this looks right to me either.  But that's OK, I think I found
> > the bug while looking all of this over.  I'm going to do some testing
> > and I'll report back here when done.  You can't even do this without
> > removing the mutex_lock above that I pointed out has to stay, so this
> > really isn't right.
> I think, that after i will add the changes i mentioned, it will be fine.
> will send you the new patch, please take a look.

I still disagree with the patch.  In order to work around the sendonly
join and sendonly join completion routines not restarting the task
thread properly, you have moved them over to being called directly.
This gives two different modes of operation for regular joins versus
send only joins.  That just makes the code more confusing and less
understandable.  That, to me, is not a good fix.  If it was a bandaid
that limped us home without making the code more obfuscated, that's one
thing.  This does not do that in my opinion.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD



[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2015-01-22 14:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-07 15:04 [PATCH V1 FIX for-3.19] IB/ipoib: Fix broken multicast flow Or Gerlitz
     [not found] ` <1420643066-3599-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2015-01-09  8:32   ` Or Gerlitz
     [not found]     ` <CAJ3xEMh60SGsK43b49MSozqmL8PDzTevs5CMGCvhPfmL3kHHVQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-13 16:45       ` Doug Ledford
     [not found]         ` <1421167544.43839.190.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-13 20:13           ` Or Gerlitz
     [not found]             ` <CAJ3xEMgJ_+b1avw6vacagSw0bxNVzLW064rztvJ6w3Y9Th0GRg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2015-01-13 21:01               ` Doug Ledford
     [not found]                 ` <1421182860.43839.214.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-14  6:01                   ` Doug Ledford
     [not found]                     ` <1421215262.43839.225.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-14  6:06                       ` Or Gerlitz
2015-01-13 18:07   ` Doug Ledford
     [not found]     ` <1421172470.43839.207.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2015-01-22 11:39       ` Erez Shitrit
     [not found]         ` <54C0E187.2010100-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2015-01-22 14:10           ` Doug Ledford

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.