linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
@ 2007-08-07 14:37 Steve Wise
  2007-08-07 14:54 ` Evgeniy Polyakov
  2007-08-09 18:49 ` Steve Wise
  0 siblings, 2 replies; 54+ messages in thread
From: Steve Wise @ 2007-08-07 14:37 UTC (permalink / raw)
  To: Roland Dreier, David S. Miller
  Cc: netdev, linux-kernel, Sean Hefty, OpenFabrics General

Networking experts,

I'd like input on the patch below, and help in solving this bug 
properly.  iWARP devices that support both native stack TCP and iWARP 
(aka RDMA over TCP/IP/Ethernet) connections on the same interface need 
the fix below or some similar fix to the RDMA connection manager.

This is a BUG in the Linux RDMA-CMA code as it stands today.

Here is the issue:

Consider an mpi cluster running mvapich2.  And the cluster runs 
MPI/Sockets jobs concurrently with MPI/RDMA jobs.  It is possible, 
without the patch below, for MPI/Sockets processes to mistakenly get 
incoming RDMA connections and vice versa.  The way mvapich2 works is 
that the ranks all bind and listen to a random port (retrying new random 
ports if the bind fails with "in use").  Once they get a free port and
bind/listen, they advertise that port number to the peers to do 
connection setup.  Currently, without the patch below, the mpi/rdma 
processes can end up binding/listening to the _same_ port number as the 
mpi/sockets processes running over the native tcp stack.  This is due to 
duplicate port spaces for native stack TCP and the rdma cm's RDMA_PS_TCP 
port space.  If this happens, then the connections can get screwed up.

The correct solution in my mind is to use the host stack's TCP port 
space for _all_ RDMA_PS_TCP port allocations.   The patch below is a 
minimal delta to unify the port spaces by using the kernel stack to bind 
ports.  This is done by allocating a kernel socket and binding to the 
appropriate local addr/port.  It also allows the kernel stack to pick 
ephemeral ports by virtue of just passing in port 0 on the kernel bind 
operation.

There has been a discussion already on the RDMA list if anyone is 
interested:

http://www.mail-archive.com/general@lists.openfabrics.org/msg05162.html


Thanks,

Steve.


---

RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.

This is needed for iwarp providers that support native and rdma
connections over the same interface.

Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---

 drivers/infiniband/core/cma.c |   27 ++++++++++++++++++++++++++-
 1 files changed, 26 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 9e0ab04..e4d2d7f 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -111,6 +111,7 @@ struct rdma_id_private {
 	struct rdma_cm_id	id;

 	struct rdma_bind_list	*bind_list;
+	struct socket		*sock;
 	struct hlist_node	node;
 	struct list_head	list;
 	struct list_head	listen_list;
@@ -695,6 +696,8 @@ static void cma_release_port(struct rdma
 		kfree(bind_list);
 	}
 	mutex_unlock(&lock);
+	if (id_priv->sock)
+		sock_release(id_priv->sock);
 }

 void rdma_destroy_id(struct rdma_cm_id *id)
@@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps,
 	return 0;
 }

+static int cma_get_tcp_port(struct rdma_id_private *id_priv)
+{
+	int ret;
+	struct socket *sock;
+
+	ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
+	if (ret)
+		return ret;
+	ret = sock->ops->bind(sock,
+			  (struct socketaddr *)&id_priv->id.route.addr.src_addr,
+			  ip_addr_size(&id_priv->id.route.addr.src_addr));
+	if (ret) {
+		sock_release(sock);
+		return ret;
+	}
+	id_priv->sock = sock;
+	return 0;	
+}
+
 static int cma_get_port(struct rdma_id_private *id_priv)
 {
 	struct idr *ps;
@@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p
 		break;
 	case RDMA_PS_TCP:
 		ps = &tcp_ps;
+		ret = cma_get_tcp_port(id_priv); /* Synch with native stack */
+		if (ret)
+			goto out;
 		break;
 	case RDMA_PS_UDP:
 		ps = &udp_ps;
@@ -1815,7 +1840,7 @@ static int cma_get_port(struct rdma_id_p
 	else
 		ret = cma_use_port(ps, id_priv);
 	mutex_unlock(&lock);
-
+out:
 	return ret;
 }



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-07 14:37 [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space Steve Wise
@ 2007-08-07 14:54 ` Evgeniy Polyakov
  2007-08-07 15:06   ` Steve Wise
  2007-08-09 18:49 ` Steve Wise
  1 sibling, 1 reply; 54+ messages in thread
From: Evgeniy Polyakov @ 2007-08-07 14:54 UTC (permalink / raw)
  To: Steve Wise
  Cc: Roland Dreier, David S. Miller, netdev, linux-kernel, Sean Hefty,
	OpenFabrics General

Hi Steve.

On Tue, Aug 07, 2007 at 09:37:41AM -0500, Steve Wise (swise@opengridcomputing.com) wrote:
> +static int cma_get_tcp_port(struct rdma_id_private *id_priv)
> +{
> +	int ret;
> +	struct socket *sock;
> +
> +	ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
> +	if (ret)
> +		return ret;
> +	ret = sock->ops->bind(sock,
> +			  (struct socketaddr 
> *)&id_priv->id.route.addr.src_addr,
> +			  ip_addr_size(&id_priv->id.route.addr.src_addr));

If get away from talks about broken offloading, this one will result in
the case, when usual network dataflow can enter private rdma land, i.e.
after bind succeeded this socket is accessible via any other network
device. Is it inteded?
And this is quite noticeble overhead per rdma connection, btw.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-07 14:54 ` Evgeniy Polyakov
@ 2007-08-07 15:06   ` Steve Wise
  2007-08-07 15:39     ` Evgeniy Polyakov
  0 siblings, 1 reply; 54+ messages in thread
From: Steve Wise @ 2007-08-07 15:06 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Roland Dreier, David S. Miller, netdev, linux-kernel, Sean Hefty,
	OpenFabrics General



Evgeniy Polyakov wrote:
> Hi Steve.
> 
> On Tue, Aug 07, 2007 at 09:37:41AM -0500, Steve Wise (swise@opengridcomputing.com) wrote:
>> +static int cma_get_tcp_port(struct rdma_id_private *id_priv)
>> +{
>> +	int ret;
>> +	struct socket *sock;
>> +
>> +	ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
>> +	if (ret)
>> +		return ret;
>> +	ret = sock->ops->bind(sock,
>> +			  (struct socketaddr 
>> *)&id_priv->id.route.addr.src_addr,
>> +			  ip_addr_size(&id_priv->id.route.addr.src_addr));
> 
> If get away from talks about broken offloading, this one will result in
> the case, when usual network dataflow can enter private rdma land, i.e.
> after bind succeeded this socket is accessible via any other network
> device. Is it inteded?
> And this is quite noticeble overhead per rdma connection, btw.
> 

I'm not sure I understand your question?  What do you mean by 
"accessible"?  The intention is to _just_ reserve the addr/port.  
The socket struct alloc and bind was a simple way to do this.  I 
assume we'll have to come up with a better way though.  
Namely provide a low level interface to the port space allocator 
allowing both rdma and the host tcp stack to share the space without
requiring a socket struct for rdma connections. 

Or maybe we'll come up a different and better solution to this issue...

Steve.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-07 15:06   ` Steve Wise
@ 2007-08-07 15:39     ` Evgeniy Polyakov
  0 siblings, 0 replies; 54+ messages in thread
From: Evgeniy Polyakov @ 2007-08-07 15:39 UTC (permalink / raw)
  To: Steve Wise
  Cc: Roland Dreier, David S. Miller, netdev, linux-kernel, Sean Hefty,
	OpenFabrics General

On Tue, Aug 07, 2007 at 10:06:29AM -0500, Steve Wise (swise@opengridcomputing.com) wrote:
> >On Tue, Aug 07, 2007 at 09:37:41AM -0500, Steve Wise 
> >(swise@opengridcomputing.com) wrote:
> >>+static int cma_get_tcp_port(struct rdma_id_private *id_priv)
> >>+{
> >>+	int ret;
> >>+	struct socket *sock;
> >>+
> >>+	ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
> >>+	if (ret)
> >>+		return ret;
> >>+	ret = sock->ops->bind(sock,
> >>+			  (struct socketaddr 
> >>*)&id_priv->id.route.addr.src_addr,
> >>+			  ip_addr_size(&id_priv->id.route.addr.src_addr));
> >
> >If get away from talks about broken offloading, this one will result in
> >the case, when usual network dataflow can enter private rdma land, i.e.
> >after bind succeeded this socket is accessible via any other network
> >device. Is it inteded?
> >And this is quite noticeble overhead per rdma connection, btw.
> >
> 
> I'm not sure I understand your question?  What do you mean by 
> "accessible"?  The intention is to _just_ reserve the addr/port.  

Above RDMA ->bind() ends up with tcp_v4_get_port(), which will only add
socket into bhash, but it is only accessible for new sockets created for
listening connections or expilicit bind, network traffic checks only
listening and establised hashes, which are not affected by above change,
so it was false alarm from my side. It does allow to 'grab' a port and
forbid its possible reuse.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-07 14:37 [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space Steve Wise
  2007-08-07 14:54 ` Evgeniy Polyakov
@ 2007-08-09 18:49 ` Steve Wise
  2007-08-09 21:40   ` [ofa-general] " Sean Hefty
  1 sibling, 1 reply; 54+ messages in thread
From: Steve Wise @ 2007-08-09 18:49 UTC (permalink / raw)
  To: Roland Dreier, David S. Miller
  Cc: netdev, linux-kernel, Sean Hefty, OpenFabrics General

Any more comments?


Steve Wise wrote:
> Networking experts,
> 
> I'd like input on the patch below, and help in solving this bug 
> properly.  iWARP devices that support both native stack TCP and iWARP 
> (aka RDMA over TCP/IP/Ethernet) connections on the same interface need 
> the fix below or some similar fix to the RDMA connection manager.
> 
> This is a BUG in the Linux RDMA-CMA code as it stands today.
> 
> Here is the issue:
> 
> Consider an mpi cluster running mvapich2.  And the cluster runs 
> MPI/Sockets jobs concurrently with MPI/RDMA jobs.  It is possible, 
> without the patch below, for MPI/Sockets processes to mistakenly get 
> incoming RDMA connections and vice versa.  The way mvapich2 works is 
> that the ranks all bind and listen to a random port (retrying new random 
> ports if the bind fails with "in use").  Once they get a free port and
> bind/listen, they advertise that port number to the peers to do 
> connection setup.  Currently, without the patch below, the mpi/rdma 
> processes can end up binding/listening to the _same_ port number as the 
> mpi/sockets processes running over the native tcp stack.  This is due to 
> duplicate port spaces for native stack TCP and the rdma cm's RDMA_PS_TCP 
> port space.  If this happens, then the connections can get screwed up.
> 
> The correct solution in my mind is to use the host stack's TCP port 
> space for _all_ RDMA_PS_TCP port allocations.   The patch below is a 
> minimal delta to unify the port spaces by using the kernel stack to bind 
> ports.  This is done by allocating a kernel socket and binding to the 
> appropriate local addr/port.  It also allows the kernel stack to pick 
> ephemeral ports by virtue of just passing in port 0 on the kernel bind 
> operation.
> 
> There has been a discussion already on the RDMA list if anyone is 
> interested:
> 
> http://www.mail-archive.com/general@lists.openfabrics.org/msg05162.html
> 
> 
> Thanks,
> 
> Steve.
> 
> 
> ---
> 
> RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
> 
> This is needed for iwarp providers that support native and rdma
> connections over the same interface.
> 
> Signed-off-by: Steve Wise <swise@opengridcomputing.com>
> ---
> 
> drivers/infiniband/core/cma.c |   27 ++++++++++++++++++++++++++-
> 1 files changed, 26 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
> index 9e0ab04..e4d2d7f 100644
> --- a/drivers/infiniband/core/cma.c
> +++ b/drivers/infiniband/core/cma.c
> @@ -111,6 +111,7 @@ struct rdma_id_private {
>     struct rdma_cm_id    id;
> 
>     struct rdma_bind_list    *bind_list;
> +    struct socket        *sock;
>     struct hlist_node    node;
>     struct list_head    list;
>     struct list_head    listen_list;
> @@ -695,6 +696,8 @@ static void cma_release_port(struct rdma
>         kfree(bind_list);
>     }
>     mutex_unlock(&lock);
> +    if (id_priv->sock)
> +        sock_release(id_priv->sock);
> }
> 
> void rdma_destroy_id(struct rdma_cm_id *id)
> @@ -1790,6 +1793,25 @@ static int cma_use_port(struct idr *ps,
>     return 0;
> }
> 
> +static int cma_get_tcp_port(struct rdma_id_private *id_priv)
> +{
> +    int ret;
> +    struct socket *sock;
> +
> +    ret = sock_create_kern(AF_INET, SOCK_STREAM, IPPROTO_TCP, &sock);
> +    if (ret)
> +        return ret;
> +    ret = sock->ops->bind(sock,
> +              (struct sockaddr *)&id_priv->id.route.addr.src_addr,
> +              ip_addr_size(&id_priv->id.route.addr.src_addr));
> +    if (ret) {
> +        sock_release(sock);
> +        return ret;
> +    }
> +    id_priv->sock = sock;
> +    return 0;   
> +}
> +
> static int cma_get_port(struct rdma_id_private *id_priv)
> {
>     struct idr *ps;
> @@ -1801,6 +1823,9 @@ static int cma_get_port(struct rdma_id_p
>         break;
>     case RDMA_PS_TCP:
>         ps = &tcp_ps;
> +        ret = cma_get_tcp_port(id_priv); /* Synch with native stack */
> +        if (ret)
> +            goto out;
>         break;
>     case RDMA_PS_UDP:
>         ps = &udp_ps;
> @@ -1815,7 +1840,7 @@ static int cma_get_port(struct rdma_id_p
>     else
>         ret = cma_use_port(ps, id_priv);
>     mutex_unlock(&lock);
> -
> +out:
>     return ret;
> }
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-09 18:49 ` Steve Wise
@ 2007-08-09 21:40   ` Sean Hefty
  2007-08-09 21:55     ` David Miller
  0 siblings, 1 reply; 54+ messages in thread
From: Sean Hefty @ 2007-08-09 21:40 UTC (permalink / raw)
  To: Steve Wise
  Cc: Roland Dreier, David S. Miller, netdev, linux-kernel,
	OpenFabrics General

Steve Wise wrote:
> Any more comments?

Does anyone have ideas on how to reserve the port space without using a 
struct socket?

- Sean

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-09 21:40   ` [ofa-general] " Sean Hefty
@ 2007-08-09 21:55     ` David Miller
  2007-08-09 23:22       ` Sean Hefty
                         ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: David Miller @ 2007-08-09 21:55 UTC (permalink / raw)
  To: mshefty; +Cc: swise, rdreier, netdev, linux-kernel, general

From: Sean Hefty <mshefty@ichips.intel.com>
Date: Thu, 09 Aug 2007 14:40:16 -0700

> Steve Wise wrote:
> > Any more comments?
> 
> Does anyone have ideas on how to reserve the port space without using a 
> struct socket?

How about we just remove the RDMA stack altogether?  I am not at all
kidding.  If you guys can't stay in your sand box and need to cause
problems for the normal network stack, it's unacceptable.  We were
told all along the if RDMA went into the tree none of this kind of
stuff would be an issue.

These are exactly the kinds of problems for which people like myself
were dreading.  These subsystems have no buisness using the TCP port
space of the Linux software stack, absolutely none.

After TCP port reservation, what's next?  It seems an at least
bi-monthly event that the RDMA folks need to put their fingers
into something else in the normal networking stack.  No more.

I will NACK any patch that opens up sockets to eat up ports or
anything stupid like that.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-09 21:55     ` David Miller
@ 2007-08-09 23:22       ` Sean Hefty
  2007-08-15 14:42       ` Steve Wise
  2007-10-08 21:54       ` Steve Wise
  2 siblings, 0 replies; 54+ messages in thread
From: Sean Hefty @ 2007-08-09 23:22 UTC (permalink / raw)
  To: David Miller; +Cc: swise, rdreier, netdev, linux-kernel, general

> How about we just remove the RDMA stack altogether?  I am not at all
> kidding.  If you guys can't stay in your sand box and need to cause
> problems for the normal network stack, it's unacceptable.  We were
> told all along the if RDMA went into the tree none of this kind of
> stuff would be an issue.

There are currently two RDMA solutions available.  Each solution has 
different requirements and uses the normal network stack differently. 
Infiniband uses its own transport.  iWarp runs over TCP.

We have tried to leverage the existing infrastructure where it makes sense.

> After TCP port reservation, what's next?  It seems an at least
> bi-monthly event that the RDMA folks need to put their fingers
> into something else in the normal networking stack.  No more.

Currently, the RDMA stack uses its own port space.  This causes a 
problem for iWarp, and is what Steve is looking for a solution for.  I'm 
not an iWarp guru, so I don't know what options exist.  Can iWarp use 
its own address family?  Identify specific IP addresses for iWarp use? 
Restrict iWarp to specific port numbers?  Let the app control the 
correct operation?  I don't know.

Steve merely defined a problem and suggested a possible solution.  He's 
looking for constructive help trying to solve the problem.

- Sean

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-09 21:55     ` David Miller
  2007-08-09 23:22       ` Sean Hefty
@ 2007-08-15 14:42       ` Steve Wise
  2007-08-16  2:26         ` Jeff Garzik
  2007-10-08 21:54       ` Steve Wise
  2 siblings, 1 reply; 54+ messages in thread
From: Steve Wise @ 2007-08-15 14:42 UTC (permalink / raw)
  To: David Miller; +Cc: mshefty, rdreier, netdev, linux-kernel, general



David Miller wrote:
> From: Sean Hefty <mshefty@ichips.intel.com>
> Date: Thu, 09 Aug 2007 14:40:16 -0700
> 
>> Steve Wise wrote:
>>> Any more comments?
>> Does anyone have ideas on how to reserve the port space without using a 
>> struct socket?
> 
> How about we just remove the RDMA stack altogether?  I am not at all
> kidding.  If you guys can't stay in your sand box and need to cause
> problems for the normal network stack, it's unacceptable.  We were
> told all along the if RDMA went into the tree none of this kind of
> stuff would be an issue.

I think removing the RDMA stack is the wrong thing to do, and you 
shouldn't just threaten to yank entire subsystems because you don't like 
the technology.  Lets keep this constructive, can we?  RDMA should get 
the respect of any other technology in Linux.  Maybe its a niche in your 
opinion, but come on, there's more RDMA users than say, the sparc64 
port.  Eh?

> 
> These are exactly the kinds of problems for which people like myself
> were dreading.  These subsystems have no buisness using the TCP port
> space of the Linux software stack, absolutely none.
> 

Ok, although IMO its the correct solution.  But I'll propose other 
solutions below.  I ask for your feedback (and everyones!) on these 
alternate solutions.

> After TCP port reservation, what's next?  It seems an at least
> bi-monthly event that the RDMA folks need to put their fingers
> into something else in the normal networking stack.  No more.
>

The only other change requested and commited, if I recall correctly, was 
for netevents, and that enabled both Infiniband and iWARP to integrate 
with the neighbour subsystem.  I think that was a useful and needed 
change.  Prior to that, these subsystems were snooping ARP replies to 
trigger events.  That was back in 2.6.18 or 2.6.19 I think...

> I will NACK any patch that opens up sockets to eat up ports or
> anything stupid like that.

Got it.

Here are alternate solutions that avoid the need to share the port space:

Solution 1)

1) admins must setup an alias interface on the iwarp device for use with 
rdma.  This interface will have to be a separate subnet from the "TCP 
used" interface.  And with a canonical name that indicates its "for rdma 
only".  Like eth2:iw or eth2:rdma.  There can be many of these per device.

2) admins make sure their sockets/tcp services don't use the interface 
configured in #1, and their rdma service do use said interface.

3) iwarp providers must translation binds to ipaddr 0.0.0.0 to the 
associated "for rdma only" ip addresses.  They can do this by searching 
for all aliases of the canonical name that are aliases of the TCP 
interface for their nic device.  Or: somehow not handle incoming 
connections to any address but the "for rdma use" addresses and instead 
pass them up and not offload them.

This will avoid the collisions as long as the above steps are followed.


Solution 2)

Another possibility would be for the driver to create two net devices 
(and hence two interace names) like "eth2" and "iw2", and artificially 
separate the RDMA stuff that way.

These two solutions are similar in that they create a "rdma only" interface.

Pros:
- is not intrusive into the core networking code
- very minimal changes needed and in the iwarp provider's code, who are 
the ones with this problem
- makes it clear which subnets are RDMA only

Cons:
- relies on system admin to set it up correctly.
- native stack can still "use" this rdma-only interface and the same 
port space issue will exist.


For the record, here are possible port-sharing solutions Dave sez he'll NAK:

Solution NAK-1)

The rdma-cma just allocates a socket and binds it to reserve TCP ports.

Pros:
- minimal changes needed to implement (always a plus in my mind :)
- simple, clean, and it works (KISS)
- if no RDMA is in use, there is no impact on the native stack
- no need for a seperate RDMA interface

Cons:
- wastes memory
- puts a TCP socket in the "CLOSED" state in the pcb tables.
- Dave will NAK it :)

Solution NAK-2)

Create a low-level sockets-agnostic port allocation service that is 
shared by both TCP and RDMA.  This way, the rdma-cm can reserve ports in 
an efficient manor instead of doing it via kernel_bind() using a sock 
struct.

Pros:
- probably the correct solution (my opinion :) if we went down the path 
of sharing port space
- if no RDMA is in use, there is no impact on the native stack
- no need for a separate RDMA interface

Cons:

- very intrusive change because the port allocations stuff is tightly 
bound to the host stack and sock struct, etc.
- Dave will NAK it :)


Steve.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-15 14:42       ` Steve Wise
@ 2007-08-16  2:26         ` Jeff Garzik
  2007-08-16  3:11           ` Roland Dreier
                             ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Jeff Garzik @ 2007-08-16  2:26 UTC (permalink / raw)
  To: Steve Wise; +Cc: David Miller, mshefty, rdreier, netdev, linux-kernel, general

Steve Wise wrote:
> 
> 
> David Miller wrote:
>> From: Sean Hefty <mshefty@ichips.intel.com>
>> Date: Thu, 09 Aug 2007 14:40:16 -0700
>>
>>> Steve Wise wrote:
>>>> Any more comments?
>>> Does anyone have ideas on how to reserve the port space without using 
>>> a struct socket?
>>
>> How about we just remove the RDMA stack altogether?  I am not at all
>> kidding.  If you guys can't stay in your sand box and need to cause
>> problems for the normal network stack, it's unacceptable.  We were
>> told all along the if RDMA went into the tree none of this kind of
>> stuff would be an issue.
> 
> I think removing the RDMA stack is the wrong thing to do, and you 
> shouldn't just threaten to yank entire subsystems because you don't like 
> the technology.  Lets keep this constructive, can we?  RDMA should get 
> the respect of any other technology in Linux.  Maybe its a niche in your 
> opinion, but come on, there's more RDMA users than say, the sparc64 
> port.  Eh?

It's not about being a niche.  It's about creating a maintainable 
software net stack that has predictable behavior.

Needing to reach out of the RDMA sandbox and reserve net stack resources 
away from itself travels a path we've consistently avoided.


>> I will NACK any patch that opens up sockets to eat up ports or
>> anything stupid like that.
> 
> Got it.

Ditto for me as well.

	Jeff



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-16  2:26         ` Jeff Garzik
@ 2007-08-16  3:11           ` Roland Dreier
  2007-08-16  3:27           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty
  2007-08-16 13:43           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Tom Tucker
  2 siblings, 0 replies; 54+ messages in thread
From: Roland Dreier @ 2007-08-16  3:11 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Steve Wise, David Miller, mshefty, netdev, linux-kernel, general

 > Needing to reach out of the RDMA sandbox and reserve net stack
 > resources away from itself travels a path we've consistently avoided.

Where did the idea of an "RDMA sandbox" come from?  Obviously no one
disagrees with keeping things clean and maintainable, but the idea
that RDMA is a second-class citizen that doesn't get any input into
the evolution of the networking code seems kind of offensive to me.

 - R.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space.
  2007-08-16  2:26         ` Jeff Garzik
  2007-08-16  3:11           ` Roland Dreier
@ 2007-08-16  3:27           ` Sean Hefty
  2007-08-16 13:43           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Tom Tucker
  2 siblings, 0 replies; 54+ messages in thread
From: Sean Hefty @ 2007-08-16  3:27 UTC (permalink / raw)
  To: 'Jeff Garzik', Steve Wise
  Cc: netdev, rdreier, linux-kernel, general, David Miller

>It's not about being a niche.  It's about creating a maintainable
>software net stack that has predictable behavior.
>
>Needing to reach out of the RDMA sandbox and reserve net stack resources
>away from itself travels a path we've consistently avoided.

We need to ensure that we're also creating a maintainable kernel.  RDMA doesn't
use sockets, but that doesn't mean it's not part of the networking support
provided by the Linux kernel.  Making blanket statements that RDMA should stay
within a sandbox is equivalent to saying that RDMA should duplicate any network
related functionality that it might need.

>>> I will NACK any patch that opens up sockets to eat up ports or
>>> anything stupid like that.
>
>Ditto for me as well.

I agree that using a socket is the wrong approach, but my guess is that it was
suggested as a possibility because of the attempt to keep RDMA in its 'sandbox'.
The iWarp architecture implements RDMA over TCP; it just doesn't use sockets.
The Linux network stack doesn't easily support this possibility.  Are there any
reasonable ways to enable this to the degree necessary for iWarp?

- Sean

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-16  2:26         ` Jeff Garzik
  2007-08-16  3:11           ` Roland Dreier
  2007-08-16  3:27           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty
@ 2007-08-16 13:43           ` Tom Tucker
  2007-08-16 21:17             ` David Miller
  2 siblings, 1 reply; 54+ messages in thread
From: Tom Tucker @ 2007-08-16 13:43 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Steve Wise, David Miller, mshefty, rdreier, netdev, linux-kernel,
	general

On Wed, 2007-08-15 at 22:26 -0400, Jeff Garzik wrote:

[...snip...]

> > I think removing the RDMA stack is the wrong thing to do, and you 
> > shouldn't just threaten to yank entire subsystems because you don't like 
> > the technology.  Lets keep this constructive, can we?  RDMA should get 
> > the respect of any other technology in Linux.  Maybe its a niche in your 
> > opinion, but come on, there's more RDMA users than say, the sparc64 
> > port.  Eh?
> 
> It's not about being a niche.  It's about creating a maintainable 
> software net stack that has predictable behavior.

Isn't RDMA _part_ of the "software net stack" within Linux? Why isn't
making RDMA stable, supportable and maintainable equally as important as
any other subsystem? 

> 
> Needing to reach out of the RDMA sandbox and reserve net stack resources 
> away from itself travels a path we've consistently avoided.
> 
> 
> >> I will NACK any patch that opens up sockets to eat up ports or
> >> anything stupid like that.
> > 
> > Got it.
> 
> Ditto for me as well.
> 
> 	Jeff
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-16 13:43           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Tom Tucker
@ 2007-08-16 21:17             ` David Miller
  2007-08-17 19:52               ` Roland Dreier
  0 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2007-08-16 21:17 UTC (permalink / raw)
  To: tom; +Cc: jeff, swise, mshefty, rdreier, netdev, linux-kernel, general

From: Tom Tucker <tom@opengridcomputing.com>
Date: Thu, 16 Aug 2007 08:43:11 -0500

> Isn't RDMA _part_ of the "software net stack" within Linux?

It very much is not so.

When using RDMA you lose the capability to do packet shaping,
classification, and all the other wonderful networking facilities
you've grown to love and use over the years.

I'm glad this is a surprise to you, because it illustrates the
point some of us keep trying to make about technologies like
this.

Imagine if you didn't know any of this, you purchase and begin to
deploy a huge piece of RDMA infrastructure, you then get the mandate
from IT that you need to add firewalling on the RDMA connections at
the host level, and "oh shit" you can't?

This is why none of us core networking developers like RDMA at all.
It's totally not integrated with the rest of the Linux stack and on
top of that it even gets in the way.  It's an abberation, an eye sore,
and a constant source of consternation.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-16 21:17             ` David Miller
@ 2007-08-17 19:52               ` Roland Dreier
  2007-08-17 21:27                 ` David Miller
  0 siblings, 1 reply; 54+ messages in thread
From: Roland Dreier @ 2007-08-17 19:52 UTC (permalink / raw)
  To: David Miller; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general

 > > Isn't RDMA _part_ of the "software net stack" within Linux?

 > It very much is not so.

This is just nit-picking.  You can draw the boundary of the "software
net stack" wherever you want, but I think Sean's point was just that
RDMA drivers already are part of Linux, and we all want them to get
better.

 > When using RDMA you lose the capability to do packet shaping,
 > classification, and all the other wonderful networking facilities
 > you've grown to love and use over the years.

Same thing with TSO and LRO and who knows what else.  I know you're
going to make a distinction between "stateless" and "stateful"
offloads, but really it's just an arbitrary distinction between things
you like and things you don't.

 > Imagine if you didn't know any of this, you purchase and begin to
 > deploy a huge piece of RDMA infrastructure, you then get the mandate
 > from IT that you need to add firewalling on the RDMA connections at
 > the host level, and "oh shit" you can't?

It's ironic that you bring up firewalling.  I've had vendors of iWARP
hardware tell me they would *love* to work with the community to make
firewalling work better for RDMA connections.  But instead we get the
catch-22 of your changing arguments -- first, you won't even consider
changes that might help RDMA work better in the name of
maintainability; then you have to protect poor, ignorant users from
accidentally using RDMA because of some problem or another; and then
when someone tries to fix some of the problems you mention, it's back
to step one.

Obviously some decisions have been prejudged here, so I guess this
moves to the realm of politics.  I have plenty of interesting
technical stuff, so I'll leave it to the people with a horse in the
race to find ways to twist your arm.

 - R.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-17 19:52               ` Roland Dreier
@ 2007-08-17 21:27                 ` David Miller
  2007-08-17 23:31                   ` Roland Dreier
  0 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2007-08-17 21:27 UTC (permalink / raw)
  To: rdreier; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general

From: Roland Dreier <rdreier@cisco.com>
Date: Fri, 17 Aug 2007 12:52:39 -0700

>  > When using RDMA you lose the capability to do packet shaping,
>  > classification, and all the other wonderful networking facilities
>  > you've grown to love and use over the years.
> 
> Same thing with TSO and LRO and who knows what else.

Not true at all.  Full classification and filtering still is usable
with TSO and LRO.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-17 21:27                 ` David Miller
@ 2007-08-17 23:31                   ` Roland Dreier
  2007-08-18  0:00                     ` David Miller
  0 siblings, 1 reply; 54+ messages in thread
From: Roland Dreier @ 2007-08-17 23:31 UTC (permalink / raw)
  To: David Miller; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general

 > >  > When using RDMA you lose the capability to do packet shaping,
 > >  > classification, and all the other wonderful networking facilities
 > >  > you've grown to love and use over the years.
 > > 
 > > Same thing with TSO and LRO and who knows what else.
 > 
 > Not true at all.  Full classification and filtering still is usable
 > with TSO and LRO.

Well, obviously with TSO and LRO the packets that the stack sends or
receives are not the same as what's on the wire.  Whether that breaks
your wonderful networking facilities or not depends on the specifics
of the particular facility I guess -- for example shaping is clearly
broken by TSO.  (And people can wonder what the packet trains TSO
creates do to congestion control on the internet, but the netdev crowd
has already decided that TSO is "good" and RDMA is "bad")

 - R.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-17 23:31                   ` Roland Dreier
@ 2007-08-18  0:00                     ` David Miller
  2007-08-18  5:23                       ` Roland Dreier
  0 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2007-08-18  0:00 UTC (permalink / raw)
  To: rdreier; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general

From: Roland Dreier <rdreier@cisco.com>
Date: Fri, 17 Aug 2007 16:31:07 -0700

>  > >  > When using RDMA you lose the capability to do packet shaping,
>  > >  > classification, and all the other wonderful networking facilities
>  > >  > you've grown to love and use over the years.
>  > > 
>  > > Same thing with TSO and LRO and who knows what else.
>  > 
>  > Not true at all.  Full classification and filtering still is usable
>  > with TSO and LRO.
> 
> Well, obviously with TSO and LRO the packets that the stack sends or
> receives are not the same as what's on the wire.  Whether that breaks
> your wonderful networking facilities or not depends on the specifics
> of the particular facility I guess -- for example shaping is clearly
> broken by TSO.  (And people can wonder what the packet trains TSO
> creates do to congestion control on the internet, but the netdev crowd
> has already decided that TSO is "good" and RDMA is "bad")

This is also a series of falsehoods.  All packet filtering,
queue management, and packet scheduling facilities work perfectly
fine and as designed with both LRO and TSO.

When problems come up, they are bugs, and we fix them.

Please stop spreading this FUD about TSO and LRO.

The fact is that RDMA bypasses the whole stack so that supporting
these facilities is not even _POSSIBLE_.  With stateless offloads it
is possible to support all of these facilities, and we do.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-18  0:00                     ` David Miller
@ 2007-08-18  5:23                       ` Roland Dreier
  2007-08-18  6:44                         ` David Miller
  0 siblings, 1 reply; 54+ messages in thread
From: Roland Dreier @ 2007-08-18  5:23 UTC (permalink / raw)
  To: David Miller; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general

 > This is also a series of falsehoods.  All packet filtering,
 > queue management, and packet scheduling facilities work perfectly
 > fine and as designed with both LRO and TSO.

I'm not sure I follow.  Perhaps "broken" was too strong a word to use,
but if you pass a huge segment to a NIC with TSO, then you've given
the NIC control of scheduling the packets that end up getting put on
the wire.  If your software packet scheduling is operating at a bigger
scale, then things work fine, but I don't see how you can say that TSO
doesn't lead to head-of-line blocking etc at short time scales.  And
yes of course I agree you can make sure things work by using short
segments or not using TSO at all.

Similarly with LRO the packets that get passed to the stack are not
the packets that were actually on the wire.  Sure, most filtering will
work fine but eg are you sure your RTT estimates aren't going to get
screwed up and cause some subtle bug?  And I could trot out all the
same bugaboos that are brought up about RDMA and warn darkly about
security problems with bugs in NIC hardware that after all has to
parse and rewrite TCP and IP packets.

Also, looking at the complexity and bug-fixing effort that go into
making TSO work vs the really pretty small gain it gives also makes
part of me wonder whether the noble proclamations about
maintainability are always taken to heart.

Of course I know everything I just wrote is wrong because I forgot to
refer to the crucial axiom that stateless == good && RDMA == bad.
And sometimes it's unfortunate that in Linux when there's disagreement
about something, the default action is *not* to do something.

Sorry for prolonging this argument.  Dave, I should say that I
appreciate all the work you've done in helping build the most kick-ass
networking stack in history.  And as I said before, I have plenty of
interesting work to do however this turns out, so I'll try to leave
any further arguing to people who actually have a dog in this fight.

 - R.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-18  5:23                       ` Roland Dreier
@ 2007-08-18  6:44                         ` David Miller
  2007-08-19  7:01                           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty
  2007-08-21  1:16                           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Roland Dreier
  0 siblings, 2 replies; 54+ messages in thread
From: David Miller @ 2007-08-18  6:44 UTC (permalink / raw)
  To: rdreier; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general

From: Roland Dreier <rdreier@cisco.com>
Date: Fri, 17 Aug 2007 22:23:01 -0700

> Also, looking at the complexity and bug-fixing effort that go into
> making TSO work vs the really pretty small gain it gives also makes
> part of me wonder whether the noble proclamations about
> maintainability are always taken to heart.

The cpu and bus utilization improvements of TSO on the sender side are
more than significant.  Ask anyone who looks closely at this.

For example, as part of his batching work Krisha Kumar has been
posting lots of numbers lately on the netdev list, I'm sure he can
post more specific numbers comparing the current stack in the case of
TSO disabled vs. TSO enabled if that is what you need to see how
beneficial TSO in fact is.

If TSO is such a lose why does pretty much every ethernet chip vendor
implement it in hardware?  If you say it's just because Microsoft
defines TSO in their NDI, that's a total cop-out.  It really does help
performance a lot.  Why did the Xen folks bother making generic
software TSO infrastructure for the kernel for the benefit of their
virtualization network device?  Why would someone as bright as Herbert
Xu even bother to implement that stuff if TSO gives a "pretty small
gain"?

Similarly for LRO and this isn't defined in NDI at all.  Vendors are
going so far as to put full flow tables in their chips in order to do
LRO better.

Using the bugs and issues we've run into while implementing TSO as
evidence there is something wrong with it is a total straw man.  Look
how many times the filesystem page cache has been rewritten over the
years.

Use the TSO problems as more of an example of how shitty a programmer
I must be. :)

Just be realistic and accept that RDMA is a point in time solution,
and like any other such technology takes flexibility away from users.

Horizontal scaling of cpus up to huge arity cores, network devices
using large numbers of transmit and receive queues and classification
based queue selection, are all going to work to make things like RDMA
even more irrelevant than they already are.

If you can't see that this is the future, you have my condolences.
Because frankly, the signs are all around that this is where things
are going.

The work doesn't belong in these special purpose devices, they belong
in the far-end-node compute resources, and our computers are getting
more and more of these general purpose compute engines every day.
We will be constantly moving away from specialized solutions and
towards those which solve large classes of problems for large groups
of people.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space.
  2007-08-18  6:44                         ` David Miller
@ 2007-08-19  7:01                           ` Sean Hefty
  2007-08-19  7:23                             ` David Miller
  2007-08-21  1:16                           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Roland Dreier
  1 sibling, 1 reply; 54+ messages in thread
From: Sean Hefty @ 2007-08-19  7:01 UTC (permalink / raw)
  To: 'David Miller', rdreier; +Cc: jeff, netdev, linux-kernel, general

>Just be realistic and accept that RDMA is a point in time solution,
>and like any other such technology takes flexibility away from users.

All technologies are just point in time solutions.  While management is
important, shouldn't the customers decide how important it is relative to their
problems?  Whether some future technology will be better matters little if a
problem needs to be solved today.

>If you can't see that this is the future, you have my condolences.
>Because frankly, the signs are all around that this is where things
>are going.

Adding a bazillion cores to a processor doesn't do a thing to help memory
bandwidth.

Millions of Infiniband ports are in operation today.  Over 25% of the top 500
supercomputers use Infiniband.  The formation of the OpenFabrics Alliance was
pushed and has been continuously funded by an RDMA customer - the US National
Labs.  RDMA technologies are backed by Cisco, IBM, Intel, QLogic, Sun, Voltaire,
Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi, NEC, Fujitsu,
LSI, SGI, Sandia, and at least two dozen other companies.  IDC expects
Infiniband adapter revenue to triple between 2006 and 2011, and switch revenue
to increase six-fold (combined revenues of 1 billion).

Customers see real benefits using channel based architectures.  Do all customers
need it?  Of course not.  Is it a niche?  Yes, but I would say that about any
10+ gig network.  That doesn't mean that it hasn't become essential for some
customers.

- Sean

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom the host TCP port space.
  2007-08-19  7:01                           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty
@ 2007-08-19  7:23                             ` David Miller
  2007-08-19 17:33                               ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom " Felix Marti
  0 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2007-08-19  7:23 UTC (permalink / raw)
  To: sean.hefty; +Cc: rdreier, jeff, netdev, linux-kernel, general

From: "Sean Hefty" <sean.hefty@intel.com>
Date: Sun, 19 Aug 2007 00:01:07 -0700

> Millions of Infiniband ports are in operation today.  Over 25% of the top 500
> supercomputers use Infiniband.  The formation of the OpenFabrics Alliance was
> pushed and has been continuously funded by an RDMA customer - the US National
> Labs.  RDMA technologies are backed by Cisco, IBM, Intel, QLogic, Sun, Voltaire,
> Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi, NEC, Fujitsu,
> LSI, SGI, Sandia, and at least two dozen other companies.  IDC expects
> Infiniband adapter revenue to triple between 2006 and 2011, and switch revenue
> to increase six-fold (combined revenues of 1 billion).

Scale these numbers with reality and usage.

These vendors pour in huge amounts of money into a relatively small
number of extremely large cluster installations.  Besides the folks
doing nuke and whole-earth simulations at some government lab, nobody
cares.  And part of the investment is not being done wholly for smart
economic reasons, but also largely publicity purposes.

So present your great Infiniband numbers with that being admitted up
front, ok?

It's relevance to Linux as a general purpose operating system that
should be "good enough" for %99 of the world is close to NIL.

People have been pouring tons of money and research into doing stupid
things to make clusters go fast, and in such a way that make zero
sense for general purpose operating systems, for ages.  RDMA is just
one such example.

BTW, I find it ironic that you mention memory bandwidth as a retort,
as Roland's favorite stateless offload devil, TSO, deals explicity
with lowering the per-packet BUS bandwidth usage of TCP.  LRO
offloading does likewise.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-19  7:23                             ` David Miller
@ 2007-08-19 17:33                               ` Felix Marti
  2007-08-19 19:32                                 ` David Miller
  2007-08-20  0:18                                 ` Herbert Xu
  0 siblings, 2 replies; 54+ messages in thread
From: Felix Marti @ 2007-08-19 17:33 UTC (permalink / raw)
  To: David Miller, sean.hefty; +Cc: netdev, rdreier, general, linux-kernel, jeff



> -----Original Message-----
> From: general-bounces@lists.openfabrics.org [mailto:general-
> bounces@lists.openfabrics.org] On Behalf Of David Miller
> Sent: Sunday, August 19, 2007 12:24 AM
> To: sean.hefty@intel.com
> Cc: netdev@vger.kernel.org; rdreier@cisco.com;
> general@lists.openfabrics.org; linux-kernel@vger.kernel.org;
> jeff@garzik.org
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> From: "Sean Hefty" <sean.hefty@intel.com>
> Date: Sun, 19 Aug 2007 00:01:07 -0700
> 
> > Millions of Infiniband ports are in operation today.  Over 25% of
the
> top 500
> > supercomputers use Infiniband.  The formation of the OpenFabrics
> Alliance was
> > pushed and has been continuously funded by an RDMA customer - the US
> National
> > Labs.  RDMA technologies are backed by Cisco, IBM, Intel, QLogic,
> Sun, Voltaire,
> > Mellanox, NetApp, AMD, Dell, HP, Oracle, Unisys, Emulex, Hitachi,
> NEC, Fujitsu,
> > LSI, SGI, Sandia, and at least two dozen other companies.  IDC
> expects
> > Infiniband adapter revenue to triple between 2006 and 2011, and
> switch revenue
> > to increase six-fold (combined revenues of 1 billion).
> 
> Scale these numbers with reality and usage.
> 
> These vendors pour in huge amounts of money into a relatively small
> number of extremely large cluster installations.  Besides the folks
> doing nuke and whole-earth simulations at some government lab, nobody
> cares.  And part of the investment is not being done wholly for smart
> economic reasons, but also largely publicity purposes.
> 
> So present your great Infiniband numbers with that being admitted up
> front, ok?
> 
> It's relevance to Linux as a general purpose operating system that
> should be "good enough" for %99 of the world is close to NIL.
> 
> People have been pouring tons of money and research into doing stupid
> things to make clusters go fast, and in such a way that make zero
> sense for general purpose operating systems, for ages.  RDMA is just
> one such example.
[Felix Marti] Ouch, and I believed linux to be a leading edge OS, 
scaling from small embedded systems to hundreds of CPUs and hence
I assumed that the same 'scalability' applies to the network subsystem.

> 
> BTW, I find it ironic that you mention memory bandwidth as a retort,
> as Roland's favorite stateless offload devil, TSO, deals explicity
> with lowering the per-packet BUS bandwidth usage of TCP.  LRO
> offloading does likewise.

[Felix Marti] Aren't you confusing memory and bus BW here? - RDMA 
enables DMA from/to application buffers removing the user-to-kernel/
kernel-to-user memory copy with is a significant overhead at the 
rates we're talking about: memory copy at 20Gbps (10Gbps in and 10Gbps 
out) requires 60Gbps of BW on most common platforms. So, receiving and
transmitting at 10Gbps with LRO and TSO requires 80Gbps of system 
memory BW (which is beyond what most systems can do) whereas RDMA can 
do with 20Gbps!

In addition, BUS improvements are really not significant (nor are buses 
the bottleneck anymore with wide availability of PCI-E >= x8); TSO
avoids 
the DMA of a bunch of network headers... a typical example of stateless
offload - improving performance by a few percent while offload
technologies
provide system improvements of hundreds of percent.

I know that you don't agree that TSO has drawbacks, as outlined by
Roland, 
but its history showing something else: the addition of TSO took a fair
amount of time and network performance was erratic for multiple kernel 
revisions and the TSO code is sprinkled across the network stack. It is
an 
example of an intrusive 'improvement' whereas Steve (who started this 
thread) is asking for a relatively small change (decoupling the 4-tuple 
allocation from the socket). As Steve has outlined, your refusal of the
change requires RDMA users to work around the issue which pushes the
issue to the end-users and thus slowing down the acceptance of the 
technology leading to a chicken-and-egg problem: you only care if there 
are lots of users but you make it hard to use the technology in the
first 
place, clever ;)
 
> _______________________________________________
> general mailing list
> general@lists.openfabrics.org
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-
> general

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-19 17:33                               ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom " Felix Marti
@ 2007-08-19 19:32                                 ` David Miller
  2007-08-19 19:49                                   ` Felix Marti
  2007-08-20  0:18                                 ` Herbert Xu
  1 sibling, 1 reply; 54+ messages in thread
From: David Miller @ 2007-08-19 19:32 UTC (permalink / raw)
  To: felix; +Cc: sean.hefty, netdev, rdreier, general, linux-kernel, jeff

From: "Felix Marti" <felix@chelsio.com>
Date: Sun, 19 Aug 2007 10:33:31 -0700

> I know that you don't agree that TSO has drawbacks, as outlined by
> Roland, but its history showing something else: the addition of TSO
> took a fair amount of time and network performance was erratic for
> multiple kernel revisions and the TSO code is sprinkled across the
> network stack.

This thing you call "sprinkled" is a necessity of any hardware
offload when it is possible for a packet to later get "steered"
to a device which cannot perform the offload.

Therefore we need a software implementation of TSO so that those
packets can still get output to the non-TSO-capable device.

We do the same thing for checksum offloading.

And for free we can use the software offloading mechanism to
get batching to arbitrary network devices, even those which cannot
do TSO.

What benefits does RDMA infrastructure give to non-RDMA capable
devices?  None?  I see, that's great.

And again the TSO bugs and issues are being overstated and, also for
the second time, these issues are more indicative of my bad
programming skills then they are of intrinsic issues of TSO.  The
TSO implementation was looking for a good design, and it took me
a while to find it because I personally suck.

Face it, stateless offloads are always going to be better in the long
term.  And this is proven.

You RDMA folks really do live in some kind of fantasy land.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-19 19:32                                 ` David Miller
@ 2007-08-19 19:49                                   ` Felix Marti
  2007-08-19 23:04                                     ` David Miller
  2007-08-19 23:27                                     ` Andi Kleen
  0 siblings, 2 replies; 54+ messages in thread
From: Felix Marti @ 2007-08-19 19:49 UTC (permalink / raw)
  To: David Miller; +Cc: sean.hefty, netdev, rdreier, general, linux-kernel, jeff



> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Sunday, August 19, 2007 12:32 PM
> To: Felix Marti
> Cc: sean.hefty@intel.com; netdev@vger.kernel.org; rdreier@cisco.com;
> general@lists.openfabrics.org; linux-kernel@vger.kernel.org;
> jeff@garzik.org
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> From: "Felix Marti" <felix@chelsio.com>
> Date: Sun, 19 Aug 2007 10:33:31 -0700
> 
> > I know that you don't agree that TSO has drawbacks, as outlined by
> > Roland, but its history showing something else: the addition of TSO
> > took a fair amount of time and network performance was erratic for
> > multiple kernel revisions and the TSO code is sprinkled across the
> > network stack.
> 
> This thing you call "sprinkled" is a necessity of any hardware
> offload when it is possible for a packet to later get "steered"
> to a device which cannot perform the offload.
> 
> Therefore we need a software implementation of TSO so that those
> packets can still get output to the non-TSO-capable device.
> 
> We do the same thing for checksum offloading.
> 
> And for free we can use the software offloading mechanism to
> get batching to arbitrary network devices, even those which cannot
> do TSO.
> 
> What benefits does RDMA infrastructure give to non-RDMA capable
> devices?  None?  I see, that's great.
> 
> And again the TSO bugs and issues are being overstated and, also for
> the second time, these issues are more indicative of my bad
> programming skills then they are of intrinsic issues of TSO.  The
> TSO implementation was looking for a good design, and it took me
> a while to find it because I personally suck.
> 
> Face it, stateless offloads are always going to be better in the long
> term.  And this is proven.
> 
> You RDMA folks really do live in some kind of fantasy land.
[Felix Marti] You're not at all addressing the fact that RDMA does solve
the memory BW problem and stateless offload doesn't. Apart from that, I
don't quite understand your argument with respect to the benefits of the
RDMA infrastructure; what benefits does the TSO infrastructure give the
non-TSO capable devices? Isn't the answer none and yet you added TSO
support?! I don't think that the argument is stateless _versus_ stateful
offload both have their advantages and disadvantages. Stateless offload
does help, i.e. TSO/LRO do improve performance in back-to-back
benchmarks. It seems me that _you_ claim that there is no benefit to
statefull offload and that is where we're disagreeing; there is benefit
and i.e. the much lower memory BW requirements is just one example, yet
an important one. We'll probably never agree but it seems to me that
we're asking only for small changes to the software stack and then we
can give the choice to the end users: they can opt for stateless offload
if it fits the performance needs or for statefull offload if their apps
require the extra boost in performance.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-19 19:49                                   ` Felix Marti
@ 2007-08-19 23:04                                     ` David Miller
  2007-08-20  0:32                                       ` Felix Marti
  2007-08-19 23:27                                     ` Andi Kleen
  1 sibling, 1 reply; 54+ messages in thread
From: David Miller @ 2007-08-19 23:04 UTC (permalink / raw)
  To: felix; +Cc: sean.hefty, netdev, rdreier, general, linux-kernel, jeff

From: "Felix Marti" <felix@chelsio.com>
Date: Sun, 19 Aug 2007 12:49:05 -0700

> You're not at all addressing the fact that RDMA does solve the
> memory BW problem and stateless offload doesn't.

It does, I just didn't retort to your claims because they were
so blatantly wrong.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-19 23:27                                     ` Andi Kleen
@ 2007-08-19 23:12                                       ` David Miller
  2007-08-20  1:45                                       ` Felix Marti
  1 sibling, 0 replies; 54+ messages in thread
From: David Miller @ 2007-08-19 23:12 UTC (permalink / raw)
  To: andi; +Cc: felix, jeff, netdev, rdreier, linux-kernel, general

From: Andi Kleen <andi@firstfloor.org>
Date: 20 Aug 2007 01:27:35 +0200

> "Felix Marti" <felix@chelsio.com> writes:
> 
> > what benefits does the TSO infrastructure give the
> > non-TSO capable devices?
> 
> It improves performance on software queueing devices between guests
> and hypervisors. This is a more and more important application these
> days.  Even when the system running the Hypervisor has a non TSO
> capable device in the end it'll still save CPU cycles this way. Right now
> virtualized IO tends to much more CPU intensive than direct IO so any
> help it can get is beneficial.
> 
> It also makes loopback faster, although given that's probably not that
> useful.
> 
> And a lot of the "TSO infrastructure" was needed for zero copy TX anyways,
> which benefits most reasonable modern NICs (anything with hardware 
> checksumming)

And also, you can enable TSO generation for a non-TSO-hw device and
get all of the segmentation overhead reduction gains which works out
as a pure win as long as the device can at a minimum do checksumming.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-19 19:49                                   ` Felix Marti
  2007-08-19 23:04                                     ` David Miller
@ 2007-08-19 23:27                                     ` Andi Kleen
  2007-08-19 23:12                                       ` David Miller
  2007-08-20  1:45                                       ` Felix Marti
  1 sibling, 2 replies; 54+ messages in thread
From: Andi Kleen @ 2007-08-19 23:27 UTC (permalink / raw)
  To: Felix Marti; +Cc: David Miller, jeff, netdev, rdreier, linux-kernel, general

"Felix Marti" <felix@chelsio.com> writes:

> what benefits does the TSO infrastructure give the
> non-TSO capable devices?

It improves performance on software queueing devices between guests
and hypervisors. This is a more and more important application these
days.  Even when the system running the Hypervisor has a non TSO
capable device in the end it'll still save CPU cycles this way. Right now
virtualized IO tends to much more CPU intensive than direct IO so any
help it can get is beneficial.

It also makes loopback faster, although given that's probably not that
useful.

And a lot of the "TSO infrastructure" was needed for zero copy TX anyways,
which benefits most reasonable modern NICs (anything with hardware 
checksumming)

-Andi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-19 17:33                               ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom " Felix Marti
  2007-08-19 19:32                                 ` David Miller
@ 2007-08-20  0:18                                 ` Herbert Xu
  1 sibling, 0 replies; 54+ messages in thread
From: Herbert Xu @ 2007-08-20  0:18 UTC (permalink / raw)
  To: Felix Marti
  Cc: davem, sean.hefty, netdev, rdreier, general, linux-kernel, jeff

Felix Marti <felix@chelsio.com> wrote:
>
> [Felix Marti] Aren't you confusing memory and bus BW here? - RDMA 
> enables DMA from/to application buffers removing the user-to-kernel/
> kernel-to-user memory copy with is a significant overhead at the 
> rates we're talking about: memory copy at 20Gbps (10Gbps in and 10Gbps 
> out) requires 60Gbps of BW on most common platforms. So, receiving and
> transmitting at 10Gbps with LRO and TSO requires 80Gbps of system 
> memory BW (which is beyond what most systems can do) whereas RDMA can 
> do with 20Gbps!

Actually this is false.  TSO only requires a copy if the user
chooses to use the sendmsg interface instead of sendpage.  The
same is true for RDMA really.  Except that instead of having to
switch your application to sendfile/splice, you're switching it
to RDMA.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-19 23:04                                     ` David Miller
@ 2007-08-20  0:32                                       ` Felix Marti
  2007-08-20  0:40                                         ` David Miller
  0 siblings, 1 reply; 54+ messages in thread
From: Felix Marti @ 2007-08-20  0:32 UTC (permalink / raw)
  To: David Miller; +Cc: sean.hefty, netdev, rdreier, general, linux-kernel, jeff



> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Sunday, August 19, 2007 4:04 PM
> To: Felix Marti
> Cc: sean.hefty@intel.com; netdev@vger.kernel.org; rdreier@cisco.com;
> general@lists.openfabrics.org; linux-kernel@vger.kernel.org;
> jeff@garzik.org
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> From: "Felix Marti" <felix@chelsio.com>
> Date: Sun, 19 Aug 2007 12:49:05 -0700
> 
> > You're not at all addressing the fact that RDMA does solve the
> > memory BW problem and stateless offload doesn't.
> 
> It does, I just didn't retort to your claims because they were
> so blatantly wrong.
[Felix Marti] Hmmm, interesting... I guess it is impossible to even have
a discussion on the subject.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20  0:32                                       ` Felix Marti
@ 2007-08-20  0:40                                         ` David Miller
  2007-08-20  0:47                                           ` Felix Marti
  0 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2007-08-20  0:40 UTC (permalink / raw)
  To: felix; +Cc: sean.hefty, netdev, rdreier, general, linux-kernel, jeff

From: "Felix Marti" <felix@chelsio.com>
Date: Sun, 19 Aug 2007 17:32:39 -0700

[ Why do you put that "[Felix Marti]" everywhere you say something?
  It's annoying and superfluous. The quoting done by your mail client
  makes clear who is saying what. ]

> Hmmm, interesting... I guess it is impossible to even have
> a discussion on the subject.

Nice try, Herbert Xu gave a great explanation.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20  0:40                                         ` David Miller
@ 2007-08-20  0:47                                           ` Felix Marti
  2007-08-20  1:05                                             ` David Miller
  2007-08-20  9:43                                             ` Evgeniy Polyakov
  0 siblings, 2 replies; 54+ messages in thread
From: Felix Marti @ 2007-08-20  0:47 UTC (permalink / raw)
  To: David Miller; +Cc: sean.hefty, netdev, rdreier, general, linux-kernel, jeff



> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Sunday, August 19, 2007 5:40 PM
> To: Felix Marti
> Cc: sean.hefty@intel.com; netdev@vger.kernel.org; rdreier@cisco.com;
> general@lists.openfabrics.org; linux-kernel@vger.kernel.org;
> jeff@garzik.org
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> From: "Felix Marti" <felix@chelsio.com>
> Date: Sun, 19 Aug 2007 17:32:39 -0700
> 
> [ Why do you put that "[Felix Marti]" everywhere you say something?
>   It's annoying and superfluous. The quoting done by your mail client
>   makes clear who is saying what. ]
> 
> > Hmmm, interesting... I guess it is impossible to even have
> > a discussion on the subject.
> 
> Nice try, Herbert Xu gave a great explanation.
[Felix Marti] David and Herbert, so you agree that the user<>kernel
space memory copy overhead is a significant overhead and we want to
enable zero-copy in both the receive and transmit path? - Yes, copy
avoidance is mainly an API issue and unfortunately the so widely used
(synchronous) sockets API doesn't make copy avoidance easy, which is one
area where protocol offload can help. Yes, some apps can resort to
sendfile() but there are many apps which seem to have trouble switching
to that API... and what about the receive path?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20  0:47                                           ` Felix Marti
@ 2007-08-20  1:05                                             ` David Miller
  2007-08-20  1:41                                               ` Felix Marti
  2007-08-20  9:43                                             ` Evgeniy Polyakov
  1 sibling, 1 reply; 54+ messages in thread
From: David Miller @ 2007-08-20  1:05 UTC (permalink / raw)
  To: felix; +Cc: sean.hefty, netdev, rdreier, general, linux-kernel, jeff

From: "Felix Marti" <felix@chelsio.com>
Date: Sun, 19 Aug 2007 17:47:59 -0700

> [Felix Marti]

Please stop using this to start your replies, thank you.

> David and Herbert, so you agree that the user<>kernel
> space memory copy overhead is a significant overhead and we want to
> enable zero-copy in both the receive and transmit path? - Yes, copy
> avoidance is mainly an API issue and unfortunately the so widely used
> (synchronous) sockets API doesn't make copy avoidance easy, which is one
> area where protocol offload can help. Yes, some apps can resort to
> sendfile() but there are many apps which seem to have trouble switching
> to that API... and what about the receive path?

On the send side none of this is an issue.  You either are sending
static content, in which using sendfile() is trivial, or you're
generating data dynamically in which case the data copy is in the
noise or too small to do zerocopy on and if not you can use a shared
mmap to generate your data into, and then sendfile out from that file,
to avoid the copy that way.

splice() helps a lot too.

Splice has the capability to do away with the receive side too, and
there are a few receivefile() implementations that could get cleaned
up and merged in.

Also, the I/O bus is still the more limiting factor and main memory
bandwidth in all of this, it is the smallest data pipe for
communications out to and from the network.  So the protocol header
avoidance gains of TSO and LRO are still a very worthwhile savings.

But even if RDMA increases performance 100 fold, it still doesn't
avoid the issue that it doesn't fit in with the rest of the networking
stack and feature set.

Any monkey can change the rules around ("ok I can make it go fast as
long as you don't need firewalling, packet scheduling, classification,
and you only need to talk to specific systems that speak this same
special protocol") to make things go faster.  On the other hand well
designed solutions can give performance gains within the constraints
of the full system design and without sactificing functionality.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20  1:05                                             ` David Miller
@ 2007-08-20  1:41                                               ` Felix Marti
  2007-08-20 11:07                                                 ` Andi Kleen
  0 siblings, 1 reply; 54+ messages in thread
From: Felix Marti @ 2007-08-20  1:41 UTC (permalink / raw)
  To: David Miller; +Cc: sean.hefty, netdev, rdreier, general, linux-kernel, jeff



> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Sunday, August 19, 2007 6:06 PM
> To: Felix Marti
> Cc: sean.hefty@intel.com; netdev@vger.kernel.org; rdreier@cisco.com;
> general@lists.openfabrics.org; linux-kernel@vger.kernel.org;
> jeff@garzik.org
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> From: "Felix Marti" <felix@chelsio.com>
> Date: Sun, 19 Aug 2007 17:47:59 -0700
> 
> > [Felix Marti]
> 
> Please stop using this to start your replies, thank you.
Better?

> 
> > David and Herbert, so you agree that the user<>kernel
> > space memory copy overhead is a significant overhead and we want to
> > enable zero-copy in both the receive and transmit path? - Yes, copy
> > avoidance is mainly an API issue and unfortunately the so widely
used
> > (synchronous) sockets API doesn't make copy avoidance easy, which is
> one
> > area where protocol offload can help. Yes, some apps can resort to
> > sendfile() but there are many apps which seem to have trouble
> switching
> > to that API... and what about the receive path?
> 
> On the send side none of this is an issue.  You either are sending
> static content, in which using sendfile() is trivial, or you're
> generating data dynamically in which case the data copy is in the
> noise or too small to do zerocopy on and if not you can use a shared
> mmap to generate your data into, and then sendfile out from that file,
> to avoid the copy that way.
> 
> splice() helps a lot too.
> 
> Splice has the capability to do away with the receive side too, and
> there are a few receivefile() implementations that could get cleaned
> up and merged in.
I don't believe it is as simple as that. Many apps synthesize their
payload in user space buffers (i.e. malloced memory) and expect to
receive their data in user space buffers _and_ expect the received data
to have a certain alignment and to be contiguous - something not
addressed by these 'new' APIs. Look, people writing HPC apps tend to
take advantage of whatever they can to squeeze some extra performance
out of their apps and they are resorting to protocol offload technology
for a reason, wouldn't you agree? 

> 
> Also, the I/O bus is still the more limiting factor and main memory
> bandwidth in all of this, it is the smallest data pipe for
> communications out to and from the network.  So the protocol header
> avoidance gains of TSO and LRO are still a very worthwhile savings.
So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20 +
20), 864B, when moving ~64KB of payload - looks like very much in the
noise to me. And again, PCI-E provides more bandwidth than the wire...

> 
> But even if RDMA increases performance 100 fold, it still doesn't
> avoid the issue that it doesn't fit in with the rest of the networking
> stack and feature set.
> 
> Any monkey can change the rules around ("ok I can make it go fast as
> long as you don't need firewalling, packet scheduling, classification,
> and you only need to talk to specific systems that speak this same
> special protocol") to make things go faster.  On the other hand well
> designed solutions can give performance gains within the constraints
> of the full system design and without sactificing functionality.
While I believe that you should give people an option to get 'high
performance' _instead_ of other features and let them chose whatever
they care about, I really do agree with what you're saying and believe
that offload devices _should_ be integrated with the facilities that you
mention (in fact, offload can do a much better job at lots of things
that you mention ;) ... but you're not letting offload devices integrate
and you're slowing down innovation in this field.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-19 23:27                                     ` Andi Kleen
  2007-08-19 23:12                                       ` David Miller
@ 2007-08-20  1:45                                       ` Felix Marti
  1 sibling, 0 replies; 54+ messages in thread
From: Felix Marti @ 2007-08-20  1:45 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, jeff, netdev, rdreier, linux-kernel, general



> -----Original Message-----
> From: ak@suse.de [mailto:ak@suse.de] On Behalf Of Andi Kleen
> Sent: Sunday, August 19, 2007 4:28 PM
> To: Felix Marti
> Cc: David Miller; jeff@garzik.org; netdev@vger.kernel.org;
> rdreier@cisco.com; linux-kernel@vger.kernel.org;
> general@lists.openfabrics.org
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> "Felix Marti" <felix@chelsio.com> writes:
> 
> > what benefits does the TSO infrastructure give the
> > non-TSO capable devices?
> 
> It improves performance on software queueing devices between guests
> and hypervisors. This is a more and more important application these
> days.  Even when the system running the Hypervisor has a non TSO
> capable device in the end it'll still save CPU cycles this way. Right
> now
> virtualized IO tends to much more CPU intensive than direct IO so any
> help it can get is beneficial.
> 
> It also makes loopback faster, although given that's probably not that
> useful.
> 
> And a lot of the "TSO infrastructure" was needed for zero copy TX
> anyways,
> which benefits most reasonable modern NICs (anything with hardware
> checksumming)
Hi Andi, yes, you're right. I should have chosen my example more
carefully.

> 
> -Andi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20  0:47                                           ` Felix Marti
  2007-08-20  1:05                                             ` David Miller
@ 2007-08-20  9:43                                             ` Evgeniy Polyakov
  2007-08-20 16:53                                               ` Felix Marti
  1 sibling, 1 reply; 54+ messages in thread
From: Evgeniy Polyakov @ 2007-08-20  9:43 UTC (permalink / raw)
  To: Felix Marti
  Cc: David Miller, sean.hefty, netdev, rdreier, general, linux-kernel, jeff

On Sun, Aug 19, 2007 at 05:47:59PM -0700, Felix Marti (felix@chelsio.com) wrote:
> [Felix Marti] David and Herbert, so you agree that the user<>kernel
> space memory copy overhead is a significant overhead and we want to
> enable zero-copy in both the receive and transmit path? - Yes, copy

It depends. If you need to access that data after received, you will get
cache miss and performance will not be much better (if any) that with
copy.

> avoidance is mainly an API issue and unfortunately the so widely used
> (synchronous) sockets API doesn't make copy avoidance easy, which is one
> area where protocol offload can help. Yes, some apps can resort to
> sendfile() but there are many apps which seem to have trouble switching
> to that API... and what about the receive path?

There is number of implementations, and all they are suitable for is 
to have recvfile(), since this is likely the only case, which can work 
without cache.

And actually RDMA stack exist and no one said it should be thrown away
_until_ it messes with main stack. It started to speal ports. What will
happen when it gest all port space and no new legal network conection
can be opened, although there is no way to show to user who got it?
What will happen if hardware RDMA connection got terminated and software
could not free the port? Will RDMA request to export connection reset
functions out of stack to drop network connections which are on the ports
which are supposed to be used by new RDMA connections?

RDMA is not a problem, but how it influence to the network stack is.
Let's better think about how to work correctly with network stack (since
we already have that cr^Wdifferent hardware) instead of saying that
others do bad work and do not allow shiny new feature to exist.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20  1:41                                               ` Felix Marti
@ 2007-08-20 11:07                                                 ` Andi Kleen
  2007-08-20 16:26                                                   ` Felix Marti
  2007-08-20 19:16                                                   ` Rick Jones
  0 siblings, 2 replies; 54+ messages in thread
From: Andi Kleen @ 2007-08-20 11:07 UTC (permalink / raw)
  To: Felix Marti
  Cc: David Miller, sean.hefty, netdev, rdreier, general, linux-kernel, jeff

"Felix Marti" <felix@chelsio.com> writes:
> > avoidance gains of TSO and LRO are still a very worthwhile savings.
> So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20 +
> 20), 864B, when moving ~64KB of payload - looks like very much in the
> noise to me.

TSO is beneficial for the software again. The linux code currently
takes several locks and does quite a few function calls for each 
packet and using larger packets lowers this overhead. At least with
10GbE saving CPU cycles is still quite important.

> an option to get 'high performance' 

Shouldn't you qualify that?

It is unlikely you really duplicated all the tuning for corner cases
that went over many years into good software TCP stacks in your
hardware.  So e.g. for wide area networks with occasional packet loss
the software might well perform better.

-Andi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20 11:07                                                 ` Andi Kleen
@ 2007-08-20 16:26                                                   ` Felix Marti
  2007-08-20 19:16                                                   ` Rick Jones
  1 sibling, 0 replies; 54+ messages in thread
From: Felix Marti @ 2007-08-20 16:26 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, sean.hefty, netdev, rdreier, general, linux-kernel, jeff



> -----Original Message-----
> From: ak@suse.de [mailto:ak@suse.de] On Behalf Of Andi Kleen
> Sent: Monday, August 20, 2007 4:07 AM
> To: Felix Marti
> Cc: David Miller; sean.hefty@intel.com; netdev@vger.kernel.org;
> rdreier@cisco.com; general@lists.openfabrics.org; linux-
> kernel@vger.kernel.org; jeff@garzik.org
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> "Felix Marti" <felix@chelsio.com> writes:
> > > avoidance gains of TSO and LRO are still a very worthwhile
savings.
> > So, i.e. with TSO, your saving about 16 headers (let us say 14 + 20
+
> > 20), 864B, when moving ~64KB of payload - looks like very much in
the
> > noise to me.
> 
> TSO is beneficial for the software again. The linux code currently
> takes several locks and does quite a few function calls for each
> packet and using larger packets lowers this overhead. At least with
> 10GbE saving CPU cycles is still quite important.
> 
> > an option to get 'high performance'
> 
> Shouldn't you qualify that?
> 
> It is unlikely you really duplicated all the tuning for corner cases
> that went over many years into good software TCP stacks in your
> hardware.  So e.g. for wide area networks with occasional packet loss
> the software might well perform better.
Yes, it used to be sufficient to submit performance data to show that a
technology make 'sense'. In fact, I believe it was Alan Cox who once
said that linux will have a look at offload once an offload device holds
the land speed record (probably assuming that the day never comes ;).
For the last few years it has been Chelsio offload devices that have
been improving their own LSRs (as IO bus speeds have been increasing).
It is worthwhile to point out that OC-192 doesn't offer full 10Gbps BW
and the fine-grained (per packet and not per TSO-burst) packet scheduler
in the offload device played a crucial part in pushing performance to
the limits of what OC-192 can do. Most other customers use our offload
products in low-latency cluster environments. - The problem with offload
devices is that they are not all born equal and there have been a lot of
poor implementation giving the technology a bad name. I can only speak
for Chelsio and do claim that we have a solid implementation that scales
from low-latency clusters environments to LFNs.

Andi, I could present performance numbers, i.e. throughput and CPU
utilization in function of IO size, number of connections, ... in a
back-to-back environment and/or in a cluster environment... but what
will it get me? I'd still get hit by the 'not integrated' hammer :(

> 
> -Andi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20  9:43                                             ` Evgeniy Polyakov
@ 2007-08-20 16:53                                               ` Felix Marti
  2007-08-20 18:10                                                 ` Andi Kleen
  2007-08-20 20:33                                                 ` Patrick Geoffray
  0 siblings, 2 replies; 54+ messages in thread
From: Felix Marti @ 2007-08-20 16:53 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, sean.hefty, netdev, rdreier, general, linux-kernel, jeff



> -----Original Message-----
> From: Evgeniy Polyakov [mailto:johnpol@2ka.mipt.ru]
> Sent: Monday, August 20, 2007 2:43 AM
> To: Felix Marti
> Cc: David Miller; sean.hefty@intel.com; netdev@vger.kernel.org;
> rdreier@cisco.com; general@lists.openfabrics.org; linux-
> kernel@vger.kernel.org; jeff@garzik.org
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> On Sun, Aug 19, 2007 at 05:47:59PM -0700, Felix Marti
> (felix@chelsio.com) wrote:
> > [Felix Marti] David and Herbert, so you agree that the user<>kernel
> > space memory copy overhead is a significant overhead and we want to
> > enable zero-copy in both the receive and transmit path? - Yes, copy
> 
> It depends. If you need to access that data after received, you will
> get
> cache miss and performance will not be much better (if any) that with
> copy.
Yes, the app will take the cache hits when accessing the data. However,
the fact remains that if there is a copy in the receive path, you
require and additional 3x memory BW (which is very significant at these
high rates and most likely the bottleneck for most current systems)...
and somebody always has to take the cache miss be it the copy_to_user or
the app.
> 
> > avoidance is mainly an API issue and unfortunately the so widely
used
> > (synchronous) sockets API doesn't make copy avoidance easy, which is
> one
> > area where protocol offload can help. Yes, some apps can resort to
> > sendfile() but there are many apps which seem to have trouble
> switching
> > to that API... and what about the receive path?
> 
> There is number of implementations, and all they are suitable for is
> to have recvfile(), since this is likely the only case, which can work
> without cache.
> 
> And actually RDMA stack exist and no one said it should be thrown away
> _until_ it messes with main stack. It started to speal ports. What
will
> happen when it gest all port space and no new legal network conection
> can be opened, although there is no way to show to user who got it?
> What will happen if hardware RDMA connection got terminated and
> software
> could not free the port? Will RDMA request to export connection reset
> functions out of stack to drop network connections which are on the
> ports
> which are supposed to be used by new RDMA connections?
Yes, RDMA support is there... but we could make it better and easier to
use. We have a problem today with port sharing and there was a proposal
to address the issue by tighter integration (see the beginning of the
thread) but the proposal got shot down immediately... because it is RDMA
and not for technical reasons. I believe this email threads shows in
detail how RDMA (a network technology) is treated as bastard child by
the network folks, well at least by one of them.
> 
> RDMA is not a problem, but how it influence to the network stack is.
> Let's better think about how to work correctly with network stack
> (since
> we already have that cr^Wdifferent hardware) instead of saying that
> others do bad work and do not allow shiny new feature to exist.
By no means did I want to imply that others do bad work; are you
referring to me using TSO implementation issues as an example? - If so,
let me clarify: I understand that the TSO implementation took some time
to get right. What I was referring to is that TSO(/LRO) have their own
issues, some eluded to by Roland and me. In fact, customers working on
the LSR couldn't use TSO due to the burstiness it introduces and had to
fall-back to our fine grained packet scheduling done in the offload
device. I am for variety, let us support new technologies that solve
real problems (lots of folks are buying this stuff for a reason) instead
of the 'ah, its brain-dead and has no future' attitude... there is
precedence for offloading the host CPUs: have a look at graphics.
Graphics used to be done by the host CPU and now we have dedicated
graphics adapters that do a much better job... so, why is it so
farfetched that offload devices can do a better job at a data-flow
problem?
> 
> --
> 	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20 16:53                                               ` Felix Marti
@ 2007-08-20 18:10                                                 ` Andi Kleen
  2007-08-20 19:02                                                   ` Felix Marti
  2007-08-20 20:33                                                 ` Patrick Geoffray
  1 sibling, 1 reply; 54+ messages in thread
From: Andi Kleen @ 2007-08-20 18:10 UTC (permalink / raw)
  To: Felix Marti
  Cc: Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, general,
	David Miller

"Felix Marti" <felix@chelsio.com> writes:

> What I was referring to is that TSO(/LRO) have their own
> issues, some eluded to by Roland and me. In fact, customers working on
> the LSR couldn't use TSO due to the burstiness it introduces

That was in old kernels where TSO didn't honor the initial cwnd correctly, 
right? I assume it's long fixed.

If not please clarify what the problem was.

> have a look at graphics.
> Graphics used to be done by the host CPU and now we have dedicated
> graphics adapters that do a much better job...

Is your off load device as programable as a modern GPU?

> farfetched that offload devices can do a better job at a data-flow
> problem?

One big difference is that there is no potentially adverse and
always varying internet between the graphics card and your monitor.

-Andi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20 18:10                                                 ` Andi Kleen
@ 2007-08-20 19:02                                                   ` Felix Marti
  2007-08-20 20:18                                                     ` Thomas Graf
  0 siblings, 1 reply; 54+ messages in thread
From: Felix Marti @ 2007-08-20 19:02 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Evgeniy Polyakov, jeff, netdev, rdreier, linux-kernel, general,
	David Miller



> -----Original Message-----
> From: ak@suse.de [mailto:ak@suse.de] On Behalf Of Andi Kleen
> Sent: Monday, August 20, 2007 11:11 AM
> To: Felix Marti
> Cc: Evgeniy Polyakov; jeff@garzik.org; netdev@vger.kernel.org;
> rdreier@cisco.com; linux-kernel@vger.kernel.org;
> general@lists.openfabrics.org; David Miller
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> "Felix Marti" <felix@chelsio.com> writes:
> 
> > What I was referring to is that TSO(/LRO) have their own
> > issues, some eluded to by Roland and me. In fact, customers working
> on
> > the LSR couldn't use TSO due to the burstiness it introduces
> 
> That was in old kernels where TSO didn't honor the initial cwnd
> correctly,
> right? I assume it's long fixed.
> 
> If not please clarify what the problem was.
The problem is that is that Ethernet is about the only technology that
discloses 'useable' throughput while everybody else talks about
signaling rates ;) - OC-192 can carry about 9.128Gbps (or close to that
number) and hence 10Gbps Ethernet was overwhelming the OC-192 network.
The customer needed to schedule packets at about 98% of OC-192
throughput in order to avoid packet drop. The scheduling needed to be
done on a per packet basis and not per 'burst of packets' basis in order
to avoid packet drop.
 
> 
> > have a look at graphics.
> > Graphics used to be done by the host CPU and now we have dedicated
> > graphics adapters that do a much better job...
> 
> Is your off load device as programable as a modern GPU?
It has a lot of knobs to turn.

> 
> > farfetched that offload devices can do a better job at a data-flow
> > problem?
> 
> One big difference is that there is no potentially adverse and
> always varying internet between the graphics card and your monitor.
These graphic adapters provide a wealth of features that you can take
advantage of to bring these amazing graphics to life. General purpose
CPUs cannot keep up. Chelsio offload devices do the same thing in the
realm of networking. - Will there be things you can't do, probably yes,
but as I said, there are lots of knobs to turn (and the latest and
greatest feature that gets hyped up might not always be the best thing
since sliced bread anyway; what happened to BIC love? ;)

> 
> -Andi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20 11:07                                                 ` Andi Kleen
  2007-08-20 16:26                                                   ` Felix Marti
@ 2007-08-20 19:16                                                   ` Rick Jones
  1 sibling, 0 replies; 54+ messages in thread
From: Rick Jones @ 2007-08-20 19:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Felix Marti, David Miller, sean.hefty, netdev, rdreier, general,
	linux-kernel, jeff

Andi Kleen wrote:
> TSO is beneficial for the software again. The linux code currently
> takes several locks and does quite a few function calls for each 
> packet and using larger packets lowers this overhead. At least with
> 10GbE saving CPU cycles is still quite important.

Some quick netperf TCP_RR tests between a pair of dual-core rx6600's running 
2.6.23-rc3.  the NICs are dual-core e1000's connected back-to-back with the 
interrupt throttle disabled.  I like using TCP_RR to tickle path-length 
questions because it rarely runs into bandwidth limitations regardless of the 
link-type.

First, with TSO enabled on both sides, then with it disabled, netperf/netserver 
bound to the same CPU as takes interrupts, which is the "best" place to be for a 
TCP_RR test (although not always for a TCP_STREAM test...):

:~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 
(192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf.  : first burst 0 : cpu bind
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput      :  0.3%
!!!                       Local CPU util  : 39.3%
!!!                       Remote CPU util : 40.6%

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      10.01   18611.32  20.96  22.35  22.522  24.017
16384  87380
:~# ethtool -K eth2 tso off
e1000: eth2: e1000_set_tso: TSO is Disabled
:~# netperf -T 1 -t TCP_RR -H 192.168.2.105 -I 99,1 -c -C
TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.2.105 
(192.168.2.105) port 0 AF_INET : +/-0.5% @ 99% conf.  : first burst 0 : cpu bind
!!! WARNING
!!! Desired confidence was not achieved within the specified iterations.
!!! This implies that there was variability in the test environment that
!!! must be investigated before going further.
!!! Confidence intervals: Throughput      :  0.4%
!!!                       Local CPU util  : 21.0%
!!!                       Remote CPU util : 25.2%

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

16384  87380  1       1      10.01   19812.51  17.81  17.19  17.983  17.358
16384  87380

While the confidence intervals for CPU util weren't hit, I suspect the 
differences in service demand were still real.  On throughput we are talking 
about +/- 0.2%, for CPU util we are talking about +/- 20% (percent not 
percentage points) in the first test and 12.5% in the second.

So, in broad handwaving terms, TSO increased the per-transaction service demand 
by something along the lines of (23.27 - 17.67)/17.67 or ~30% and the 
transaction rate decreased by ~6%.

rick jones
bitrate blindless is a constant concern

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20 19:02                                                   ` Felix Marti
@ 2007-08-20 20:18                                                     ` Thomas Graf
  2007-08-20 20:33                                                       ` Andi Kleen
  0 siblings, 1 reply; 54+ messages in thread
From: Thomas Graf @ 2007-08-20 20:18 UTC (permalink / raw)
  To: Felix Marti
  Cc: Andi Kleen, Evgeniy Polyakov, jeff, netdev, rdreier,
	linux-kernel, general, David Miller

* Felix Marti <felix@chelsio.com> 2007-08-20 12:02
> These graphic adapters provide a wealth of features that you can take
> advantage of to bring these amazing graphics to life. General purpose
> CPUs cannot keep up. Chelsio offload devices do the same thing in the
> realm of networking. - Will there be things you can't do, probably yes,
> but as I said, there are lots of knobs to turn (and the latest and
> greatest feature that gets hyped up might not always be the best thing
> since sliced bread anyway; what happened to BIC love? ;)

GPUs have almost no influence on system security, the network stack OTOH
is probably the most vulnerable part of an operating system. Even if all
vendors would implement all the features collected over the last years
properly which seems unlikely. Having such an essential and critical
part depend on the vendor of my network card without being able to even
verify it properly is truly frightening.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20 20:18                                                     ` Thomas Graf
@ 2007-08-20 20:33                                                       ` Andi Kleen
  0 siblings, 0 replies; 54+ messages in thread
From: Andi Kleen @ 2007-08-20 20:33 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Felix Marti, Andi Kleen, Evgeniy Polyakov, jeff, netdev, rdreier,
	linux-kernel, general, David Miller

> GPUs have almost no influence on system security, 

Unless you use direct rendering from user space.

-Andi

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20 16:53                                               ` Felix Marti
  2007-08-20 18:10                                                 ` Andi Kleen
@ 2007-08-20 20:33                                                 ` Patrick Geoffray
  2007-08-21  4:21                                                   ` Felix Marti
  1 sibling, 1 reply; 54+ messages in thread
From: Patrick Geoffray @ 2007-08-20 20:33 UTC (permalink / raw)
  To: Felix Marti
  Cc: Evgeniy Polyakov, David Miller, sean.hefty, netdev, rdreier,
	general, linux-kernel, jeff

Felix Marti wrote:
> Yes, the app will take the cache hits when accessing the data. However,
> the fact remains that if there is a copy in the receive path, you
> require and additional 3x memory BW (which is very significant at these
> high rates and most likely the bottleneck for most current systems)...
> and somebody always has to take the cache miss be it the copy_to_user or
> the app.

The cache miss is going to cost you half the memory bandwidth of a full 
copy. If the data is already in cache, then the copy is cheaper.

However, removing the copy removes the kernel from the picture on the 
receive side, so you lose demultiplexing, asynchronism, security, 
accounting, flow-control, swapping, etc. If it's ok with you to not use 
the kernel stack, then why expect to fit in the existing infrastructure 
anyway ?

> Yes, RDMA support is there... but we could make it better and easier to

What do you need from the kernel for RDMA support beyond HW drivers ? A 
fast way to pin and translate user memory (ie registration). That is 
pretty much the sandbox that David referred to.

Eventually, it would be useful to be able to track the VM space to 
implement a registration cache instead of using ugly hacks in user-space 
to hijack malloc, but this is completely independent from the net stack.

> use. We have a problem today with port sharing and there was a proposal

The port spaces are either totally separate and there is no issue, or 
completely identical and you should then run your connection manager in 
user-space or fix your middlewares.

> and not for technical reasons. I believe this email threads shows in
> detail how RDMA (a network technology) is treated as bastard child by
> the network folks, well at least by one of them.

I don't think it's fair. This thread actually show how pushy some RDMA 
folks are about not acknowledging that the current infrastructure is 
here for a reason, and about mistaking zero-copy and RDMA.

This is a similar argument than the TOE discussion, and it was 
definitively a good decision to not mess up the Linux stack with TOEs.

Patrick

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-18  6:44                         ` David Miller
  2007-08-19  7:01                           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty
@ 2007-08-21  1:16                           ` Roland Dreier
  2007-08-21  6:58                             ` David Miller
  1 sibling, 1 reply; 54+ messages in thread
From: Roland Dreier @ 2007-08-21  1:16 UTC (permalink / raw)
  To: David Miller; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general

[TSO / LRO discussion snipped -- it's not the main point so no sense
spending energy arguing about it]

 > Just be realistic and accept that RDMA is a point in time solution,
 > and like any other such technology takes flexibility away from users.
 > 
 > Horizontal scaling of cpus up to huge arity cores, network devices
 > using large numbers of transmit and receive queues and classification
 > based queue selection, are all going to work to make things like RDMA
 > even more irrelevant than they already are.

To me there is a real fundamental difference between RDMA and
traditional SOCK_STREAM / SOCK_DATAGRAM networking, namely that
messages can carry the address where they're supposed to be
delivered (what the IETF calls "direct data placement").  And on top
of that you can build one-sided operations aka put/get aka RDMA.

And direct data placement really does give you a factor of two at
least, because otherwise you're stuck receiving the data in one
buffer, looking at some of the data at least, and then figuring out
where to copy it.  And memory bandwidth is if anything becoming more
valuable; maybe LRO + header splitting + page remapping tricks can get
you somewhere but as NCPUS grows then it seems the TLB shootdown cost
of page flipping is only going to get worse.

Don't get too hung up on the fact that current iWARP (RDMA over IP)
implementations are using TCP offload -- to me that is just a side
effect of doing enough processing on the NIC side of the PCI bus to be
able to do direct data placement.  InfiniBand with competely different
transport, link and physical layers is one way to implement RDMA
without TCP offload and I'm sure there will be others -- eg Intel's
IOAT stuff could probably evolve to the point where you could
implement iWARP with software TCP and the data placement offloaded to
some DMA engine.

 - R.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* RE: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom the host TCP port space.
  2007-08-20 20:33                                                 ` Patrick Geoffray
@ 2007-08-21  4:21                                                   ` Felix Marti
  0 siblings, 0 replies; 54+ messages in thread
From: Felix Marti @ 2007-08-21  4:21 UTC (permalink / raw)
  To: Patrick Geoffray
  Cc: Evgeniy Polyakov, David Miller, sean.hefty, netdev, rdreier,
	general, linux-kernel, jeff



> -----Original Message-----
> From: Patrick Geoffray [mailto:patrick@myri.com]
> Sent: Monday, August 20, 2007 1:34 PM
> To: Felix Marti
> Cc: Evgeniy Polyakov; David Miller; sean.hefty@intel.com;
> netdev@vger.kernel.org; rdreier@cisco.com;
> general@lists.openfabrics.org; linux-kernel@vger.kernel.org;
> jeff@garzik.org
> Subject: Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate
> PS_TCPportsfrom the host TCP port space.
> 
> Felix Marti wrote:
> > Yes, the app will take the cache hits when accessing the data.
> However,
> > the fact remains that if there is a copy in the receive path, you
> > require and additional 3x memory BW (which is very significant at
> these
> > high rates and most likely the bottleneck for most current
> systems)...
> > and somebody always has to take the cache miss be it the
copy_to_user
> or
> > the app.
> 
> The cache miss is going to cost you half the memory bandwidth of a
full
> copy. If the data is already in cache, then the copy is cheaper.
> 
> However, removing the copy removes the kernel from the picture on the
> receive side, so you lose demultiplexing, asynchronism, security,
> accounting, flow-control, swapping, etc. If it's ok with you to not
use
> the kernel stack, then why expect to fit in the existing
infrastructure
> anyway ?
Many of the things you're referring to are moved to the offload adapter
but from an ease of use point of view, it would be great if the user
could still collect stats the same way, i.e. netstat reports the 4-tuple
in use and other network stats. In addition, security features and
packet scheduling could be integrated so that the user configures them
the same way as the network stack.

> 
> > Yes, RDMA support is there... but we could make it better and easier
> to
> 
> What do you need from the kernel for RDMA support beyond HW drivers ?
A
> fast way to pin and translate user memory (ie registration). That is
> pretty much the sandbox that David referred to.
> 
> Eventually, it would be useful to be able to track the VM space to
> implement a registration cache instead of using ugly hacks in user-
> space
> to hijack malloc, but this is completely independent from the net
> stack.
> 
> > use. We have a problem today with port sharing and there was a
> proposal
> 
> The port spaces are either totally separate and there is no issue, or
> completely identical and you should then run your connection manager
in
> user-space or fix your middlewares.
When running on an iWarp device (and hence on top of TCP) I believe that
the port space should shared and i.e. netstat reports the 4-tuple in
use. 

> 
> > and not for technical reasons. I believe this email threads shows in
> > detail how RDMA (a network technology) is treated as bastard child
by
> > the network folks, well at least by one of them.
> 
> I don't think it's fair. This thread actually show how pushy some RDMA
> folks are about not acknowledging that the current infrastructure is
> here for a reason, and about mistaking zero-copy and RDMA.
Zero-copy and RDMA are not the same but in the context of this
discussion I referred to RDMA as a superset (zero-copy is implied).

> 
> This is a similar argument than the TOE discussion, and it was
> definitively a good decision to not mess up the Linux stack with TOEs.
> 
> Patrick

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-21  1:16                           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Roland Dreier
@ 2007-08-21  6:58                             ` David Miller
  2007-08-28 19:38                               ` Roland Dreier
  0 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2007-08-21  6:58 UTC (permalink / raw)
  To: rdreier; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general

From: Roland Dreier <rdreier@cisco.com>
Date: Mon, 20 Aug 2007 18:16:54 -0700

> And direct data placement really does give you a factor of two at
> least, because otherwise you're stuck receiving the data in one
> buffer, looking at some of the data at least, and then figuring out
> where to copy it.  And memory bandwidth is if anything becoming more
> valuable; maybe LRO + header splitting + page remapping tricks can get
> you somewhere but as NCPUS grows then it seems the TLB shootdown cost
> of page flipping is only going to get worse.

As Herbert has said already, people can code for this just like
they have to code for RDMA.

There is no fundamental difference from converting an application
to sendfile or similar.

The only thing this needs is a
"recvmsg_I_dont_care_where_the_data_is()" call.  There are no alignment
issues unless you are trying to push this data directly into the
page cache.

Couple this with a card that makes sure that on a per-page basis, only
data for a particular flow (or group of flows) will accumulate.

People already make cards that can do stuff like this, it can be done
statelessly with an on-chip dynamically maintained flow table.

And best yet it doesn't turn off every feature in the networking nor
bypass it for the actual protocol processing.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-21  6:58                             ` David Miller
@ 2007-08-28 19:38                               ` Roland Dreier
  2007-08-28 20:43                                 ` David Miller
  0 siblings, 1 reply; 54+ messages in thread
From: Roland Dreier @ 2007-08-28 19:38 UTC (permalink / raw)
  To: David Miller; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general

Sorry for the long latency, I was at the beach all last week.

 > > And direct data placement really does give you a factor of two at
 > > least, because otherwise you're stuck receiving the data in one
 > > buffer, looking at some of the data at least, and then figuring out
 > > where to copy it.  And memory bandwidth is if anything becoming more
 > > valuable; maybe LRO + header splitting + page remapping tricks can get
 > > you somewhere but as NCPUS grows then it seems the TLB shootdown cost
 > > of page flipping is only going to get worse.

 > As Herbert has said already, people can code for this just like
 > they have to code for RDMA.

No argument, you need to change the interface to take advantage of RDMA.

 > There is no fundamental difference from converting an application
 > to sendfile or similar.

Yes, on the transmit side, there's not much difference from sendfile
or splice, although RDMA may give a slightly nicer interface that also
gives basically the equivalent of AIO.

 > The only thing this needs is a
 > "recvmsg_I_dont_care_where_the_data_is()" call.  There are no alignment
 > issues unless you are trying to push this data directly into the
 > page cache.

I don't understand how this gives you the same thing as direct data
placement (DDP).  There are many situations where the sender knows
where the data has to go and if there's some way to pass that to the
receiver, so that info can be used in the receive path to put the data
in the right place, the receiver can save a copy.  This is
fundamentally the same "offload" that an FC HBA does -- the SCSI
midlayer queues up commands like "read block A and put the data at
address X" and "read block B and put the data at address Y" and the
HBA matches tags on incoming data to put the blocks at the right
addresses, even if block B is received before block A.

RFC 4297 has some discussion of the various approaches, and while you
might not agree with their conclusions, it is interesting reading.

 > Couple this with a card that makes sure that on a per-page basis, only
 > data for a particular flow (or group of flows) will accumulate.

It seems that the NIC would also have to look into a TCP stream (and
handle out of order segments etc) to find message boundaries for this
to be equivalent to what an RDMA NIC does.

 - R.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-28 19:38                               ` Roland Dreier
@ 2007-08-28 20:43                                 ` David Miller
  0 siblings, 0 replies; 54+ messages in thread
From: David Miller @ 2007-08-28 20:43 UTC (permalink / raw)
  To: rdreier; +Cc: tom, jeff, swise, mshefty, netdev, linux-kernel, general

From: Roland Dreier <rdreier@cisco.com>
Date: Tue, 28 Aug 2007 12:38:07 -0700

> It seems that the NIC would also have to look into a TCP stream (and
> handle out of order segments etc) to find message boundaries for this
> to be equivalent to what an RDMA NIC does.

It would work for data that accumulates in-order, give or take a small
window, just like LRO does.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-08-09 21:55     ` David Miller
  2007-08-09 23:22       ` Sean Hefty
  2007-08-15 14:42       ` Steve Wise
@ 2007-10-08 21:54       ` Steve Wise
  2007-10-09 13:44         ` James Lentini
  2007-10-10 21:01         ` Sean Hefty
  2 siblings, 2 replies; 54+ messages in thread
From: Steve Wise @ 2007-10-08 21:54 UTC (permalink / raw)
  To: David Miller; +Cc: mshefty, rdreier, netdev, linux-kernel, general



David Miller wrote:
> From: Sean Hefty <mshefty@ichips.intel.com>
> Date: Thu, 09 Aug 2007 14:40:16 -0700
> 
>> Steve Wise wrote:
>>> Any more comments?
>> Does anyone have ideas on how to reserve the port space without using a 
>> struct socket?
> 
> How about we just remove the RDMA stack altogether?  I am not at all
> kidding.  If you guys can't stay in your sand box and need to cause
> problems for the normal network stack, it's unacceptable.  We were
> told all along the if RDMA went into the tree none of this kind of
> stuff would be an issue.
> 
> These are exactly the kinds of problems for which people like myself
> were dreading.  These subsystems have no buisness using the TCP port
> space of the Linux software stack, absolutely none.
> 
> After TCP port reservation, what's next?  It seems an at least
> bi-monthly event that the RDMA folks need to put their fingers
> into something else in the normal networking stack.  No more.
> 
> I will NACK any patch that opens up sockets to eat up ports or
> anything stupid like that.

Hey Dave,

The hack to use a socket and bind it to claim the port was just for 
demostrating the idea.  The correct solution, IMO, is to enhance the 
core low level 4-tuple allocation services to be more generic (eg: not 
be tied to a struct sock).  Then the host tcp stack and the host rdma 
stack can allocate TCP/iWARP ports/4tuples from this common exported 
service and share the port space.  This allocation service could also be 
used by other deep adapters like iscsi adapters if needed.

Will you NAK such a solution if I go implement it and submit for review? 
  The dual ip subnet solution really sux, and I'm trying one more time 
to see if you will entertain the common port space solution, if done 
correctly.

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-10-08 21:54       ` Steve Wise
@ 2007-10-09 13:44         ` James Lentini
  2007-10-10 21:01         ` Sean Hefty
  1 sibling, 0 replies; 54+ messages in thread
From: James Lentini @ 2007-10-09 13:44 UTC (permalink / raw)
  To: Steve Wise; +Cc: David Miller, rdreier, linux-kernel, general, netdev


On Mon, 8 Oct 2007, Steve Wise wrote:

> The correct solution, IMO, is to enhance the core low level 4-tuple 
> allocation services to be more generic (eg: not be tied to a struct 
> sock).  Then the host tcp stack and the host rdma stack can allocate 
> TCP/iWARP ports/4tuples from this common exported service and share 
> the port space.  This allocation service could also be used by other 
> deep adapters like iscsi adapters if needed.

As a developer of an RDMA ULP, NFS-RDMA, I like this approach because 
it will simplify the configuration of an RDMA device and the services 
that use it.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-10-08 21:54       ` Steve Wise
  2007-10-09 13:44         ` James Lentini
@ 2007-10-10 21:01         ` Sean Hefty
  2007-10-10 23:04           ` David Miller
  1 sibling, 1 reply; 54+ messages in thread
From: Sean Hefty @ 2007-10-10 21:01 UTC (permalink / raw)
  To: Steve Wise; +Cc: David Miller, rdreier, netdev, linux-kernel, general

> The hack to use a socket and bind it to claim the port was just for 
> demostrating the idea.  The correct solution, IMO, is to enhance the 
> core low level 4-tuple allocation services to be more generic (eg: not 
> be tied to a struct sock).  Then the host tcp stack and the host rdma 
> stack can allocate TCP/iWARP ports/4tuples from this common exported 
> service and share the port space.  This allocation service could also be 
> used by other deep adapters like iscsi adapters if needed.

Since iWarp runs on top of TCP, the port space is really the same. 
FWIW, I agree that this proposal is the correct solution to support iWarp.

- Sean

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space.
  2007-10-10 21:01         ` Sean Hefty
@ 2007-10-10 23:04           ` David Miller
  0 siblings, 0 replies; 54+ messages in thread
From: David Miller @ 2007-10-10 23:04 UTC (permalink / raw)
  To: mshefty; +Cc: swise, rdreier, netdev, linux-kernel, general

From: Sean Hefty <mshefty@ichips.intel.com>
Date: Wed, 10 Oct 2007 14:01:07 -0700

> > The hack to use a socket and bind it to claim the port was just for 
> > demostrating the idea.  The correct solution, IMO, is to enhance the 
> > core low level 4-tuple allocation services to be more generic (eg: not 
> > be tied to a struct sock).  Then the host tcp stack and the host rdma 
> > stack can allocate TCP/iWARP ports/4tuples from this common exported 
> > service and share the port space.  This allocation service could also be 
> > used by other deep adapters like iscsi adapters if needed.
> 
> Since iWarp runs on top of TCP, the port space is really the same. 
> FWIW, I agree that this proposal is the correct solution to support iWarp.

But you can be sure it's not going to happen, sorry.

It would mean that we'd need to export the entire TCP socket table so
then when iWARP connections are created you can search to make sure
there is not an existing full 4-tuple that is the same.

It is not just about local TCP ports.

iWARP needs to live in it's seperate little container and not
contaminate the rest of the networking, this is the deal.  Any
suggested such change which breaks that deal will be NACK'd by all of
the core networking developers.

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2007-10-10 23:05 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-08-07 14:37 [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from the host TCP port space Steve Wise
2007-08-07 14:54 ` Evgeniy Polyakov
2007-08-07 15:06   ` Steve Wise
2007-08-07 15:39     ` Evgeniy Polyakov
2007-08-09 18:49 ` Steve Wise
2007-08-09 21:40   ` [ofa-general] " Sean Hefty
2007-08-09 21:55     ` David Miller
2007-08-09 23:22       ` Sean Hefty
2007-08-15 14:42       ` Steve Wise
2007-08-16  2:26         ` Jeff Garzik
2007-08-16  3:11           ` Roland Dreier
2007-08-16  3:27           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty
2007-08-16 13:43           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Tom Tucker
2007-08-16 21:17             ` David Miller
2007-08-17 19:52               ` Roland Dreier
2007-08-17 21:27                 ` David Miller
2007-08-17 23:31                   ` Roland Dreier
2007-08-18  0:00                     ` David Miller
2007-08-18  5:23                       ` Roland Dreier
2007-08-18  6:44                         ` David Miller
2007-08-19  7:01                           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP portsfrom " Sean Hefty
2007-08-19  7:23                             ` David Miller
2007-08-19 17:33                               ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCPportsfrom " Felix Marti
2007-08-19 19:32                                 ` David Miller
2007-08-19 19:49                                   ` Felix Marti
2007-08-19 23:04                                     ` David Miller
2007-08-20  0:32                                       ` Felix Marti
2007-08-20  0:40                                         ` David Miller
2007-08-20  0:47                                           ` Felix Marti
2007-08-20  1:05                                             ` David Miller
2007-08-20  1:41                                               ` Felix Marti
2007-08-20 11:07                                                 ` Andi Kleen
2007-08-20 16:26                                                   ` Felix Marti
2007-08-20 19:16                                                   ` Rick Jones
2007-08-20  9:43                                             ` Evgeniy Polyakov
2007-08-20 16:53                                               ` Felix Marti
2007-08-20 18:10                                                 ` Andi Kleen
2007-08-20 19:02                                                   ` Felix Marti
2007-08-20 20:18                                                     ` Thomas Graf
2007-08-20 20:33                                                       ` Andi Kleen
2007-08-20 20:33                                                 ` Patrick Geoffray
2007-08-21  4:21                                                   ` Felix Marti
2007-08-19 23:27                                     ` Andi Kleen
2007-08-19 23:12                                       ` David Miller
2007-08-20  1:45                                       ` Felix Marti
2007-08-20  0:18                                 ` Herbert Xu
2007-08-21  1:16                           ` [ofa-general] Re: [PATCH RFC] RDMA/CMA: Allocate PS_TCP ports from " Roland Dreier
2007-08-21  6:58                             ` David Miller
2007-08-28 19:38                               ` Roland Dreier
2007-08-28 20:43                                 ` David Miller
2007-10-08 21:54       ` Steve Wise
2007-10-09 13:44         ` James Lentini
2007-10-10 21:01         ` Sean Hefty
2007-10-10 23:04           ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).