All of lore.kernel.org
 help / color / mirror / Atom feed
* IPoIB: Fix multicast packet drops before join is complete
@ 2009-06-02 14:49 Christoph Lameter
  2009-06-04  5:41 ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2009-06-02 14:49 UTC (permalink / raw)
  To: Roland Dreier; +Cc: netdev, Yossi Etigin

Subject: IPoIB: Fix multicast packet drops before join is complete

The IPoIB layer drops multicast packets after queueing 3 as long as a
multicast group is not ready. The multicast group not being ready may
occur on first use or after a period of silence on a MC group.

What should happen is that packet are queued up until the maximum queue
size is reached (set by sk_sndbuf). Then the socket layer would put
the sending process to sleep until space in the queue becomes available
(after the multicast group becomes ready and after the initial messages
have been sent).

With the IPoIB layer dropping packets this does not occur. The process
can continue sending multicast packets and they are all dropped until
the multicast group becomes ready. The receiver will see the initial 3
multicast packets that have been significantly delayed, then a large
gap of missing packet before getting packets that have not been delayed.

After this patch the socket queue will build up and the sender will be
throttled until the MC group becomes ready.

If old behavior is desired then the application can configure
the send queue size to only allow 3 packet and specify MSG_DONTWAIT
for send operations.

Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

---
 drivers/infiniband/ulp/ipoib/ipoib.h           |    1 -
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |    7 +------
 2 files changed, 1 insertion(+), 7 deletions(-)

Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2009-05-28 15:50:56.000000000 -0500
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib_multicast.c	2009-06-02 09:27:40.000000000 -0500
@@ -685,12 +685,7 @@ void ipoib_mcast_send(struct net_device
 	}

 	if (!mcast->ah) {
-		if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
-			skb_queue_tail(&mcast->pkt_queue, skb);
-		else {
-			++dev->stats.tx_dropped;
-			dev_kfree_skb_any(skb);
-		}
+		skb_queue_tail(&mcast->pkt_queue, skb);

 		if (test_bit(IPOIB_MCAST_FLAG_BUSY, &mcast->flags))
 			ipoib_dbg_mcast(priv, "no address vector, "
Index: linux-2.6/drivers/infiniband/ulp/ipoib/ipoib.h
===================================================================
--- linux-2.6.orig/drivers/infiniband/ulp/ipoib/ipoib.h	2009-05-28 15:50:56.000000000 -0500
+++ linux-2.6/drivers/infiniband/ulp/ipoib/ipoib.h	2009-06-02 09:27:40.000000000 -0500
@@ -79,7 +79,6 @@ enum {
 	IPOIB_NUM_WC		  = 4,

 	IPOIB_MAX_PATH_REC_QUEUE  = 3,
-	IPOIB_MAX_MCAST_QUEUE	  = 3,

 	IPOIB_FLAG_OPER_UP	  = 0,
 	IPOIB_FLAG_INITIALIZED	  = 1,

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-02 14:49 IPoIB: Fix multicast packet drops before join is complete Christoph Lameter
@ 2009-06-04  5:41 ` David Miller
  2009-06-04 14:28   ` Or Gerlitz
  2009-06-04 15:52   ` Christoph Lameter
  0 siblings, 2 replies; 29+ messages in thread
From: David Miller @ 2009-06-04  5:41 UTC (permalink / raw)
  To: cl; +Cc: rdreier, netdev, yosefe

From: Christoph Lameter <cl@linux-foundation.org>
Date: Tue, 2 Jun 2009 10:49:21 -0400 (EDT)

> IPoIB: Fix multicast packet drops before join is complete
> 
> The IPoIB layer drops multicast packets after queueing 3 as long as a
> multicast group is not ready. The multicast group not being ready may
> occur on first use or after a period of silence on a MC group.
> 
> What should happen is that packet are queued up until the maximum queue
> size is reached (set by sk_sndbuf). Then the socket layer would put
> the sending process to sleep until space in the queue becomes available
> (after the multicast group becomes ready and after the initial messages
> have been sent).
> 
> With the IPoIB layer dropping packets this does not occur. The process
> can continue sending multicast packets and they are all dropped until
> the multicast group becomes ready. The receiver will see the initial 3
> multicast packets that have been significantly delayed, then a large
> gap of missing packet before getting packets that have not been delayed.
> 
> After this patch the socket queue will build up and the sender will be
> throttled until the MC group becomes ready.
> 
> If old behavior is desired then the application can configure
> the send queue size to only allow 3 packet and specify MSG_DONTWAIT
> for send operations.
> 
> Signed-off-by: Christoph Lameter <cl@linux-foundation.org>

I don't think using an infinite backlog for this is wise.

We don't do this for ARP, for example.  We have a 3 packet limit just
like IPoIB implements for multicast here.

I'm all for increasing those limits, but eliminating them I cannot
agree with.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-04  5:41 ` David Miller
@ 2009-06-04 14:28   ` Or Gerlitz
  2009-06-04 15:52   ` Christoph Lameter
  1 sibling, 0 replies; 29+ messages in thread
From: Or Gerlitz @ 2009-06-04 14:28 UTC (permalink / raw)
  To: David Miller; +Cc: cl, rdreier, netdev, yosefe

On Thu, Jun 4, 2009 at 8:41 AM, David Miller <davem@davemloft.net> wrote:
> From: Christoph Lameter <cl@linux-foundation.org>

>> After this patch the socket queue will build up and the sender will be throttled

> I'm all for increasing those limits, but eliminating them I cannot agree with.

So the socket buffer reasoning isn't robust enough?

Or.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-04  5:41 ` David Miller
  2009-06-04 14:28   ` Or Gerlitz
@ 2009-06-04 15:52   ` Christoph Lameter
  2009-06-04 22:30     ` David Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2009-06-04 15:52 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, netdev, yosefe

On Wed, 3 Jun 2009, David Miller wrote:

> I don't think using an infinite backlog for this is wise.

The number of packets is limited by the size of the sendbuffer. Its not
unlimited but under user control.

> We don't do this for ARP, for example.  We have a 3 packet limit just
> like IPoIB implements for multicast here.

ARP is a management protocol with specific semantics. Socket protocols are
dealing with streams of datagrams.

> I'm all for increasing those limits, but eliminating them I cannot
> agree with.

If these limits are lower than the socket sendbuffer size then the packets
will be dropped without the app being stopped. If they are set higher then
the application will block after the sendbuffer size has been reached.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-04 15:52   ` Christoph Lameter
@ 2009-06-04 22:30     ` David Miller
  2009-06-05 14:18       ` Christoph Lameter
  2009-06-05 16:54       ` Roland Dreier
  0 siblings, 2 replies; 29+ messages in thread
From: David Miller @ 2009-06-04 22:30 UTC (permalink / raw)
  To: cl; +Cc: rdreier, netdev, yosefe

From: Christoph Lameter <cl@linux-foundation.org>
Date: Thu, 4 Jun 2009 11:52:48 -0400 (EDT)

> On Wed, 3 Jun 2009, David Miller wrote:
> 
>> We don't do this for ARP, for example.  We have a 3 packet limit just
>> like IPoIB implements for multicast here.
> 
> ARP is a management protocol with specific semantics. Socket protocols are
> dealing with streams of datagrams.

Go look at what the ARP backlog queue actually does, then come back to
this conversation (hint: it's not a backlog for ARP packets, it's
a backlog for packets waiting for ARP to resolve).

Thank you.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-04 22:30     ` David Miller
@ 2009-06-05 14:18       ` Christoph Lameter
  2009-06-05 16:56         ` Roland Dreier
  2009-06-06  1:16         ` David Miller
  2009-06-05 16:54       ` Roland Dreier
  1 sibling, 2 replies; 29+ messages in thread
From: Christoph Lameter @ 2009-06-05 14:18 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, netdev, yosefe

On Thu, 4 Jun 2009, David Miller wrote:

> From: Christoph Lameter <cl@linux-foundation.org>
> Date: Thu, 4 Jun 2009 11:52:48 -0400 (EDT)
>
> > On Wed, 3 Jun 2009, David Miller wrote:
> >
> >> We don't do this for ARP, for example.  We have a 3 packet limit just
> >> like IPoIB implements for multicast here.
> >
> > ARP is a management protocol with specific semantics. Socket protocols are
> > dealing with streams of datagrams.
>
> Go look at what the ARP backlog queue actually does, then come back to
> this conversation (hint: it's not a backlog for ARP packets, it's
> a backlog for packets waiting for ARP to resolve).

ARP is tied to managing small chunks of information about the network
infrastructure. Buffering the first few and throwing the rest away is
appropriate there for what the ARP protocol intends to do.

UDP multicasting can be used for streaming information. And right now the
IPoIB layer is dropping thousands of packets whenever there was a pause of
a few minutes or when a new multicast group is used and there is some
delay that the network need to reestablish the multicast route.

On IPoIB the app can send without being throttled to the speed supported
by the hardware in these cases. The faster the cpu we get the more packets
will be dropped in these bursts. The socket layer has an option to not
wait using MSG_DONTWAIT but in these cases we are not honoring that the
flag is not set. We simply drop the packets.

If UDP multicasting is used for a purpose like ARP then this is
appropriate but UDP multicasting has a variety of uses. If you want ARP
style semantics then the buffer size can be limited in such a way that
only 3 packets are bufferd by setting SO_SNDBUF and using MSG_DONTWAIT.

But without the patch this method is forced upon all uses of UDP
multicast for IPoIB layer.

And strangely if you use a 1G NIC everything is fine and the socket layer
properly throttles for packet bursts.




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-04 22:30     ` David Miller
  2009-06-05 14:18       ` Christoph Lameter
@ 2009-06-05 16:54       ` Roland Dreier
  1 sibling, 0 replies; 29+ messages in thread
From: Roland Dreier @ 2009-06-05 16:54 UTC (permalink / raw)
  To: David Miller; +Cc: cl, netdev, yosefe


 > Go look at what the ARP backlog queue actually does, then come back to
 > this conversation (hint: it's not a backlog for ARP packets, it's
 > a backlog for packets waiting for ARP to resolve).

Yes, I see now.  I confused myself a few times already in this
discussion.  I'm going to pull Christoph's patch out of my pending queue
for now until we reach agreement on this.

 - R.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-05 14:18       ` Christoph Lameter
@ 2009-06-05 16:56         ` Roland Dreier
  2009-06-05 19:17           ` Christoph Lameter
  2009-06-06  1:16         ` David Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Roland Dreier @ 2009-06-05 16:56 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Miller, netdev, yosefe


 > ARP is tied to managing small chunks of information about the network
 > infrastructure. Buffering the first few and throwing the rest away is
 > appropriate there for what the ARP protocol intends to do.

Yes, but what the IP stack is doing is queueing a few packets while ARP
is pending and dropping all other packets until the destination ethernet
address is resolved.

 > UDP multicasting can be used for streaming information. And right now the
 > IPoIB layer is dropping thousands of packets whenever there was a pause of
 > a few minutes or when a new multicast group is used and there is some
 > delay that the network need to reestablish the multicast route.

Yes -- and the required IB multicast resolution seems like it is an L2
thing precisely analogous to ARP.  So I think the original IPoIB code
actually was doing the right thing.

 - R.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-05 16:56         ` Roland Dreier
@ 2009-06-05 19:17           ` Christoph Lameter
  2009-06-05 21:12             ` Roland Dreier
  2009-06-05 21:13             ` Roland Dreier
  0 siblings, 2 replies; 29+ messages in thread
From: Christoph Lameter @ 2009-06-05 19:17 UTC (permalink / raw)
  To: Roland Dreier; +Cc: David Miller, netdev, yosefe

On Fri, 5 Jun 2009, Roland Dreier wrote:

>  > ARP is tied to managing small chunks of information about the network
>  > infrastructure. Buffering the first few and throwing the rest away is
>  > appropriate there for what the ARP protocol intends to do.
>
> Yes, but what the IP stack is doing is queueing a few packets while ARP
> is pending and dropping all other packets until the destination ethernet
> address is resolved.

Usually the ARP resolution is already established so its not that big an
issue. In order for ARP resolution to matter we have to send to a
IP address that is not in the current ARP cache.
We ran a test here sending UDP unicast and indeed we get udp packet loss.

However, *no* initial packets made it at all to the destination. We
had an effective outage for 3 milliseconds of packets being
discarded before anything is received. However, once it starts packets
flow continually without a problem.

This is actually better than IPoIB there is nothing before
the initial traffic that arrives. In the IPoIB case 3 packets arrive
suggesting to the other side that the link is okay.

In the MC case a IGMP join must also be performed which likely takes much
longer and the switches will hold off traffic for awhile. We have a
mininum here of 8 milliseconds of those outages. Then the first 3 packet
arrive.

>  > UDP multicasting can be used for streaming information. And right now the
>  > IPoIB layer is dropping thousands of packets whenever there was a pause of
>  > a few minutes or when a new multicast group is used and there is some
>  > delay that the network need to reestablish the multicast route.
>
> Yes -- and the required IB multicast resolution seems like it is an L2
> thing precisely analogous to ARP.  So I think the original IPoIB code
> actually was doing the right thing.

Then why dont the 1G NICs do the same?

Why are apps no longer working right when we move them from 1G to
IPoIB?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-05 19:17           ` Christoph Lameter
@ 2009-06-05 21:12             ` Roland Dreier
  2009-06-06  1:17               ` David Miller
  2009-06-05 21:13             ` Roland Dreier
  1 sibling, 1 reply; 29+ messages in thread
From: Roland Dreier @ 2009-06-05 21:12 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Miller, netdev, yosefe


 > > Yes -- and the required IB multicast resolution seems like it is an L2
 > > thing precisely analogous to ARP.  So I think the original IPoIB code
 > > actually was doing the right thing.
 > 
 > Then why dont the 1G NICs do the same?

ethernet is a different L2 -- you don't have to resolve an IP multicast
address to an L2 address the way you do for IPoIB.

My point is just that multicast group join is the same sort of L2
resolution as ARP is for ethernet, and so having the same sort of
behavior makes sense.

 - R.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-05 19:17           ` Christoph Lameter
  2009-06-05 21:12             ` Roland Dreier
@ 2009-06-05 21:13             ` Roland Dreier
  2009-06-08 15:16               ` Christoph Lameter
  1 sibling, 1 reply; 29+ messages in thread
From: Roland Dreier @ 2009-06-05 21:13 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Miller, netdev, yosefe


 > This is actually better than IPoIB there is nothing before
 > the initial traffic that arrives. In the IPoIB case 3 packets arrive
 > suggesting to the other side that the link is okay.

So maybe we should drop the *oldest* packet on queue overflow?  Would
that make things better?

 - R.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-05 14:18       ` Christoph Lameter
  2009-06-05 16:56         ` Roland Dreier
@ 2009-06-06  1:16         ` David Miller
  2009-06-08 15:20           ` Christoph Lameter
  1 sibling, 1 reply; 29+ messages in thread
From: David Miller @ 2009-06-06  1:16 UTC (permalink / raw)
  To: cl; +Cc: rdreier, netdev, yosefe

From: Christoph Lameter <cl@linux-foundation.org>
Date: Fri, 5 Jun 2009 10:18:15 -0400 (EDT)

> ARP is tied to managing small chunks of information about the network
> infrastructure. Buffering the first few and throwing the rest away is
> appropriate there for what the ARP protocol intends to do.
> 
> UDP multicasting can be used for streaming information. And right now the
> IPoIB layer is dropping thousands of packets whenever there was a pause of
> a few minutes or when a new multicast group is used and there is some
> delay that the network need to reestablish the multicast route.
> 
> On IPoIB the app can send without being throttled to the speed supported
> by the hardware in these cases. The faster the cpu we get the more packets
> will be dropped in these bursts. The socket layer has an option to not
> wait using MSG_DONTWAIT but in these cases we are not honoring that the
> flag is not set. We simply drop the packets.
> 
> If UDP multicasting is used for a purpose like ARP then this is
> appropriate but UDP multicasting has a variety of uses. If you want ARP
> style semantics then the buffer size can be limited in such a way that
> only 3 packets are bufferd by setting SO_SNDBUF and using MSG_DONTWAIT.

I can guarentee you that this will break in the future, because
very soon we are going to skb_orphan() (disassosicate the socket
from the SKB and thus detract the socket buffer allocation) right
before we give packets to the device layer.  It kills out TX
performance what we're doing now.

And Rusty has done tests showing that, as far as fairness is
concerned, doing the early skb_orphan() does not create problems
where sockets can "take over" a device keeping other sockets out.

But it will allow non-limiting schemes like your's to consume tons
of memory.  Normally device TX queue lengths keep the systems
collection of active sockets in check, so there is a real limit
under normal operation.   Your multicast-resolution list has no
limits.

It's fundamentally wrong.  You need some limit there.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-05 21:12             ` Roland Dreier
@ 2009-06-06  1:17               ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2009-06-06  1:17 UTC (permalink / raw)
  To: rdreier; +Cc: cl, netdev, yosefe

From: Roland Dreier <rdreier@cisco.com>
Date: Fri, 05 Jun 2009 14:12:35 -0700

> ethernet is a different L2 -- you don't have to resolve an IP multicast
> address to an L2 address the way you do for IPoIB.
> 
> My point is just that multicast group join is the same sort of L2
> resolution as ARP is for ethernet, and so having the same sort of
> behavior makes sense.

Then the IPoIB layer should block on group join, not have this
limitless packet backlog which as I have explained is ill designed
and definitely not futureproof.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-05 21:13             ` Roland Dreier
@ 2009-06-08 15:16               ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2009-06-08 15:16 UTC (permalink / raw)
  To: Roland Dreier; +Cc: David Miller, netdev, yosefe

On Fri, 5 Jun 2009, Roland Dreier wrote:

>
>  > This is actually better than IPoIB there is nothing before
>  > the initial traffic that arrives. In the IPoIB case 3 packets arrive
>  > suggesting to the other side that the link is okay.
>
> So maybe we should drop the *oldest* packet on queue overflow?  Would
> that make things better?

Definitely. That would avoid the apps thinking that there was a huge
outage. Also the join times on some fabrics can be quite long if there is
heavy reconfiguration of the switch fabric going on. In the
order of seconds it seems. The packet is definitely stale by that time.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-06  1:16         ` David Miller
@ 2009-06-08 15:20           ` Christoph Lameter
  2009-06-08 21:29             ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2009-06-08 15:20 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, netdev, yosefe

On Fri, 5 Jun 2009, David Miller wrote:

> It's fundamentally wrong.  You need some limit there.

Maybe we need a different approach but (since we are at the point of
talking about categorical fundamentally wrong statements) its
fundamentally wrong if the IPoIB layer works differently from a regular
Ethernet NIC. They should work in the same way.

Could the packet drop mechanism for multicast be removed from the device
drivers and made standard in the core net layer? So they all have the same
predictable behavior for the applications using multicast?


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-08 15:20           ` Christoph Lameter
@ 2009-06-08 21:29             ` David Miller
  2009-06-09 20:52               ` Roland Dreier
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2009-06-08 21:29 UTC (permalink / raw)
  To: cl; +Cc: rdreier, netdev, yosefe

From: Christoph Lameter <cl@linux-foundation.org>
Date: Mon, 8 Jun 2009 11:20:31 -0400 (EDT)

> On Fri, 5 Jun 2009, David Miller wrote:
> 
>> It's fundamentally wrong.  You need some limit there.
> 
> Maybe we need a different approach but (since we are at the point of
> talking about categorical fundamentally wrong statements) its
> fundamentally wrong if the IPoIB layer works differently from a regular
> Ethernet NIC. They should work in the same way.

This I agree on.  When a multicast group is joined, the IPoIB layer
should block in some way until the subscription is fully available
for use.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-08 21:29             ` David Miller
@ 2009-06-09 20:52               ` Roland Dreier
  2009-06-10  0:45                 ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Roland Dreier @ 2009-06-09 20:52 UTC (permalink / raw)
  To: David Miller; +Cc: cl, netdev, yosefe


 > > Maybe we need a different approach but (since we are at the point of
 > > talking about categorical fundamentally wrong statements) its
 > > fundamentally wrong if the IPoIB layer works differently from a regular
 > > Ethernet NIC. They should work in the same way.
 > 
 > This I agree on.  When a multicast group is joined, the IPoIB layer
 > should block in some way until the subscription is fully available
 > for use.

Not sure I understand.  What do you mean by block?  It is quite possible
that a single multicast group is waiting to be joined, while other
multicast groups and unicast traffic can be sent with no problem.  So
stopping the whole device queue and blocking everything doesn't seem
appropriate.

 - R.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-09 20:52               ` Roland Dreier
@ 2009-06-10  0:45                 ` David Miller
  2009-06-10  3:55                   ` Roland Dreier
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2009-06-10  0:45 UTC (permalink / raw)
  To: rdreier; +Cc: cl, netdev, yosefe

From: Roland Dreier <rdreier@cisco.com>
Date: Tue, 09 Jun 2009 13:52:45 -0700

> 
>  > > Maybe we need a different approach but (since we are at the point of
>  > > talking about categorical fundamentally wrong statements) its
>  > > fundamentally wrong if the IPoIB layer works differently from a regular
>  > > Ethernet NIC. They should work in the same way.
>  > 
>  > This I agree on.  When a multicast group is joined, the IPoIB layer
>  > should block in some way until the subscription is fully available
>  > for use.
> 
> Not sure I understand.  What do you mean by block?  It is quite possible
> that a single multicast group is waiting to be joined, while other
> multicast groups and unicast traffic can be sent with no problem.  So
> stopping the whole device queue and blocking everything doesn't seem
> appropriate.

Block the process joining the multicast group, not the entire
interface.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-10  0:45                 ` David Miller
@ 2009-06-10  3:55                   ` Roland Dreier
  2009-06-10  4:57                     ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Roland Dreier @ 2009-06-10  3:55 UTC (permalink / raw)
  To: David Miller; +Cc: cl, netdev, yosefe


 > Block the process joining the multicast group, not the entire
 > interface.

Hmm... how do I do that?  The interface gets an skb that it sees should
be sent to a multicast group that it is not a member of yet, and so it
fires off a request to join that group (as a send-only member).  How
does a netdev block the process that queued up a given skb to send?
Couldn't the packet have come through a local software bridge or
something like that, so the original process is long since lost to the
network stack?

 - R.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-10  3:55                   ` Roland Dreier
@ 2009-06-10  4:57                     ` David Miller
  2009-06-10  5:04                       ` Roland Dreier
  2009-06-11 15:07                       ` Christoph Lameter
  0 siblings, 2 replies; 29+ messages in thread
From: David Miller @ 2009-06-10  4:57 UTC (permalink / raw)
  To: rdreier; +Cc: cl, netdev, yosefe

From: Roland Dreier <rdreier@cisco.com>
Date: Tue, 09 Jun 2009 20:55:57 -0700

> Hmm... how do I do that?  The interface gets an skb that it sees should
> be sent to a multicast group that it is not a member of yet, and so it
> fires off a request to join that group (as a send-only member).  How
> does a netdev block the process that queued up a given skb to send?
> Couldn't the packet have come through a local software bridge or
> something like that, so the original process is long since lost to the
> network stack?

If a facility doesn't exist yet, we're going to have to create
one.  It would need to do a downcall to the device when the
user joins a multicast group on a socket, and then the device
can do whatever magic is necessary to speak multicast immediately
and sleep until it really is available for immediate use.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-10  4:57                     ` David Miller
@ 2009-06-10  5:04                       ` Roland Dreier
  2009-06-10  5:12                         ` David Miller
  2009-06-11 15:07                       ` Christoph Lameter
  1 sibling, 1 reply; 29+ messages in thread
From: Roland Dreier @ 2009-06-10  5:04 UTC (permalink / raw)
  To: David Miller; +Cc: cl, netdev, yosefe


 > If a facility doesn't exist yet, we're going to have to create
 > one.  It would need to do a downcall to the device when the
 > user joins a multicast group on a socket, and then the device
 > can do whatever magic is necessary to speak multicast immediately
 > and sleep until it really is available for immediate use.

But a send-only membership is created when a multicast packet is sent,
which an application can do just with sendto() -- do we want to plumb
all the way from sendto() down to the device the packet is ultimately
going to be sent on?

There isn't any corresponding code to block an app during ARP
resolution, is there?

 - R.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-10  5:04                       ` Roland Dreier
@ 2009-06-10  5:12                         ` David Miller
  2009-06-10 10:18                           ` Or Gerlitz
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2009-06-10  5:12 UTC (permalink / raw)
  To: rdreier; +Cc: cl, netdev, yosefe

From: Roland Dreier <rdreier@cisco.com>
Date: Tue, 09 Jun 2009 22:04:21 -0700

>  > If a facility doesn't exist yet, we're going to have to create
>  > one.  It would need to do a downcall to the device when the
>  > user joins a multicast group on a socket, and then the device
>  > can do whatever magic is necessary to speak multicast immediately
>  > and sleep until it really is available for immediate use.
> 
> But a send-only membership is created when a multicast packet is sent,
> which an application can do just with sendto() -- do we want to plumb
> all the way from sendto() down to the device the packet is ultimately
> going to be sent on?
> 
> There isn't any corresponding code to block an app during ARP
> resolution, is there?

Nope.

But the problem is that it's incredibly invasive to applications
how IPoIB behaves for multicast traffic.  It's much more biting
than ARP because usually ARP hits you when you do your connection
attempt or initial DNS request, then it's resolved and smooth
sailing.

Here the app can send a voluminous amount of application data
at the destination and it'll mostly drop on the floor currently.
This is bad.

ARP doesn't typically toss real application data, this does.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-10  5:12                         ` David Miller
@ 2009-06-10 10:18                           ` Or Gerlitz
  2009-06-10 12:01                             ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Or Gerlitz @ 2009-06-10 10:18 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, cl, netdev, yosefe

On Wed, Jun 10, 2009 at 8:12 AM, David Miller <davem@davemloft.net> wrote:
> From: Roland Dreier <rdreier@cisco.com>

>> There isn't any corresponding code to block an app during ARP resolution, is there?

> Here the app can send a voluminous amount of application data at the destination and it'll
> mostly drop on the floor currently. ARP doesn't typically toss real application data, this does.


Dave,

If there's no app level flow control around, UDP senders will just send...  so
for --unicast--  the same problem of injecting tons of packets which can get
dropped will come into play under both Ethernet and IB, so the user is limited
by the socket buffer len and the neigh unres_qlen systcl.

Why not apply a similar systcl mechanism for the IB/mcast case so the ipoib
driver level (under the user directive) will have control on how many
packets would
be queued before starting to drop.

Or.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-10 10:18                           ` Or Gerlitz
@ 2009-06-10 12:01                             ` David Miller
  2009-06-11 11:45                               ` Or Gerlitz
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2009-06-10 12:01 UTC (permalink / raw)
  To: or.gerlitz; +Cc: rdreier, cl, netdev, yosefe

From: Or Gerlitz <or.gerlitz@gmail.com>
Date: Wed, 10 Jun 2009 13:18:29 +0300

> If there's no app level flow control around, UDP senders will just
> send...  so for --unicast-- the same problem of injecting tons of
> packets which can get dropped will come into play under both
> Ethernet and IB, so the user is limited by the socket buffer len and
> the neigh unres_qlen systcl.

But this generally does not cause problems via ARP, why might that
be?  Because resolution is 1) rare 2) quick.  Both of which do not
apply to this IPoIB case.

> Why not apply a similar systcl mechanism for the IB/mcast case so
> the ipoib driver level (under the user directive) will have control
> on how many packets would be queued before starting to drop.

Sure, but that's just pushing the problem around instead of
solving it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-10 12:01                             ` David Miller
@ 2009-06-11 11:45                               ` Or Gerlitz
  2009-06-11 11:57                                 ` David Miller
  0 siblings, 1 reply; 29+ messages in thread
From: Or Gerlitz @ 2009-06-11 11:45 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, cl, netdev, yosefe

On Wed, Jun 10, 2009 at 3:01 PM, David Miller<davem@davemloft.net> wrote:
> From: Or Gerlitz <or.gerlitz@gmail.com>

>> for --unicast-- the same problem [...] will come into play under both Ethernet and IB,
>> so the user is limited by the socket buffer len and the neigh unres_qlen systcl.

> But this generally does not cause problems via ARP, why might that be?
> Because resolution is 1) rare 2) quick.  Both of which do not apply to this IPoIB case.

To be precise, I don't see why ARP resolution is more rare then
joining to a multicast
group on behalf of a sender in IPoIB, they both happen on the first
and if/when the kernel
deletes the neighbour.

> Sure, but that's just pushing the problem around instead of solving it.

Let me think about it more.

Or.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-11 11:45                               ` Or Gerlitz
@ 2009-06-11 11:57                                 ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2009-06-11 11:57 UTC (permalink / raw)
  To: or.gerlitz; +Cc: rdreier, cl, netdev, yosefe

From: Or Gerlitz <or.gerlitz@gmail.com>
Date: Thu, 11 Jun 2009 14:45:28 +0300

> On Wed, Jun 10, 2009 at 3:01 PM, David Miller<davem@davemloft.net> wrote:
>> From: Or Gerlitz <or.gerlitz@gmail.com>
> 
>>> for --unicast-- the same problem [...] will come into play under both Ethernet and IB,
>>> so the user is limited by the socket buffer len and the neigh unres_qlen systcl.
> 
>> But this generally does not cause problems via ARP, why might that
>> be?  Because resolution is 1) rare 2) quick.  Both of which do not
>> apply to this IPoIB case.
> 
> To be precise, I don't see why ARP resolution is more rare then
> joining to a multicast group on behalf of a sender in IPoIB, they
> both happen on the first and if/when the kernel deletes the
> neighbour.

There are two issues:

1) Ethernet does not have to "resolve" anything for multicast sends.
   Therefore this issue is IPoIB specific.

2) ARP's tend to resolve on the DNS request for the site, or the
   initial connection attempt.  Both of which are single packets,
   easily retransmitted, and the dropping of which does not result in
   losing significant application data.  The multicast case over IPoIB
   does drop tons of actual application data.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-10  4:57                     ` David Miller
  2009-06-10  5:04                       ` Roland Dreier
@ 2009-06-11 15:07                       ` Christoph Lameter
  2009-06-11 23:58                         ` David Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Lameter @ 2009-06-11 15:07 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, netdev, yosefe

On Tue, 9 Jun 2009, David Miller wrote:

> From: Roland Dreier <rdreier@cisco.com>
> Date: Tue, 09 Jun 2009 20:55:57 -0700
>
> > Hmm... how do I do that?  The interface gets an skb that it sees should
> > be sent to a multicast group that it is not a member of yet, and so it
> > fires off a request to join that group (as a send-only member).  How
> > does a netdev block the process that queued up a given skb to send?
> > Couldn't the packet have come through a local software bridge or
> > something like that, so the original process is long since lost to the
> > network stack?
>
> If a facility doesn't exist yet, we're going to have to create
> one.  It would need to do a downcall to the device when the
> user joins a multicast group on a socket, and then the device
> can do whatever magic is necessary to speak multicast immediately
> and sleep until it really is available for immediate use.

Umm?? My patch does that....


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-11 15:07                       ` Christoph Lameter
@ 2009-06-11 23:58                         ` David Miller
  2009-06-12 14:17                           ` Christoph Lameter
  0 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2009-06-11 23:58 UTC (permalink / raw)
  To: cl; +Cc: rdreier, netdev, yosefe

From: Christoph Lameter <cl@linux-foundation.org>
Date: Thu, 11 Jun 2009 11:07:52 -0400 (EDT)

> On Tue, 9 Jun 2009, David Miller wrote:
> 
>> From: Roland Dreier <rdreier@cisco.com>
>> Date: Tue, 09 Jun 2009 20:55:57 -0700
>>
>> > Hmm... how do I do that?  The interface gets an skb that it sees should
>> > be sent to a multicast group that it is not a member of yet, and so it
>> > fires off a request to join that group (as a send-only member).  How
>> > does a netdev block the process that queued up a given skb to send?
>> > Couldn't the packet have come through a local software bridge or
>> > something like that, so the original process is long since lost to the
>> > network stack?
>>
>> If a facility doesn't exist yet, we're going to have to create
>> one.  It would need to do a downcall to the device when the
>> user joins a multicast group on a socket, and then the device
>> can do whatever magic is necessary to speak multicast immediately
>> and sleep until it really is available for immediate use.
> 
> Umm?? My patch does that....

No it does not.  It eliminates a limit on the generic intermediate
queue.  It did no blocking, no sleeping, or the process when it
joined the group.

I was suggesting to block the process, at socket context time, when it
joins a multicast group.  That way it wouldn't be able to do any sends
until IPoIB could resolve the multicast join.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: IPoIB: Fix multicast packet drops before join is complete
  2009-06-11 23:58                         ` David Miller
@ 2009-06-12 14:17                           ` Christoph Lameter
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Lameter @ 2009-06-12 14:17 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, netdev, yosefe

On Thu, 11 Jun 2009, David Miller wrote:

> > Umm?? My patch does that....
>
> No it does not.  It eliminates a limit on the generic intermediate
> queue.  It did no blocking, no sleeping, or the process when it
> joined the group.

What intermediate queue? The check in the IPoIB layuer was simply for
packets queued on a socket. Are you talking about the TX ring?

> I was suggesting to block the process, at socket context time, when it
> joins a multicast group.  That way it wouldn't be able to do any sends
> until IPoIB could resolve the multicast join.

My patch blocks the process when the queue has filled up. It blocks when
the queue has reached sk_sndbuf

The join occurs and then the app continues sending packets that are
queued until the queue limits have been reached. Then the process blocks.

It would be better to block earlier but this also does the trick.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2009-06-12 14:17 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-02 14:49 IPoIB: Fix multicast packet drops before join is complete Christoph Lameter
2009-06-04  5:41 ` David Miller
2009-06-04 14:28   ` Or Gerlitz
2009-06-04 15:52   ` Christoph Lameter
2009-06-04 22:30     ` David Miller
2009-06-05 14:18       ` Christoph Lameter
2009-06-05 16:56         ` Roland Dreier
2009-06-05 19:17           ` Christoph Lameter
2009-06-05 21:12             ` Roland Dreier
2009-06-06  1:17               ` David Miller
2009-06-05 21:13             ` Roland Dreier
2009-06-08 15:16               ` Christoph Lameter
2009-06-06  1:16         ` David Miller
2009-06-08 15:20           ` Christoph Lameter
2009-06-08 21:29             ` David Miller
2009-06-09 20:52               ` Roland Dreier
2009-06-10  0:45                 ` David Miller
2009-06-10  3:55                   ` Roland Dreier
2009-06-10  4:57                     ` David Miller
2009-06-10  5:04                       ` Roland Dreier
2009-06-10  5:12                         ` David Miller
2009-06-10 10:18                           ` Or Gerlitz
2009-06-10 12:01                             ` David Miller
2009-06-11 11:45                               ` Or Gerlitz
2009-06-11 11:57                                 ` David Miller
2009-06-11 15:07                       ` Christoph Lameter
2009-06-11 23:58                         ` David Miller
2009-06-12 14:17                           ` Christoph Lameter
2009-06-05 16:54       ` Roland Dreier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.