[RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
@ 2018-10-31 23:26 Christoph Paasch
  2018-10-31 23:26 ` [RFC 1/2] net: Add new socket-option SO_DELAYED_BIND Christoph Paasch
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Christoph Paasch @ 2018-10-31 23:26 UTC (permalink / raw)
  To: netdev; +Cc: Ian Swett, Leif Hedstrom, Jana Iyengar

Implementations of Quic might want to create a separate socket for each
Quic-connection by creating a connected UDP-socket.

To achieve that on the server-side, a "master-socket" needs to wait for
incoming new connections and then creates a new socket that will be a
connected UDP-socket. To create that latter one, the server needs to
first bind() and then connect(). However, after the bind() the server
might already receive traffic on that new socket that is unrelated to the
Quic-connection at hand. Only after the connect() a full 4-tuple match
is happening. So, one can't really create this kind of a server that has
a connected UDP-socket per Quic connection.

So, what is needed is an "atomic bind & connect" that basically
prevents any incoming traffic until the connect() call has been issued
at which point the full 4-tuple is known.

This patchset implements this functionality and exposes a socket-option
to do this.

Usage would be:

        int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);

        int val = 1;
        setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));

        bind(fd, (struct sockaddr *)&src, sizeof(src));

	/* At this point, incoming traffic will never match on this socket */

        connect(fd, (struct sockaddr *)&dst, sizeof(dst));

	/* Only now incoming traffic will reach the socket */

There is literally an infinite number of ways on how to implement it,
which is why I first send it out as an RFC. With this approach here I
chose the least invasive one, just preventing the match on the incoming
path.

The reason for choosing a SOL_SOCKET socket-option and not at the
SOL_UDP-level is because that functionality actually could be useful for
other protocols as well. E.g., TCP wants to better use the full 4-tuple space
by binding to the source-IP and the destination-IP at the same time.

Feedback is very welcome!

Christoph Paasch (2):
  net: Add new socket-option SO_DELAYED_BIND
  udp: Support SO_DELAYED_BIND

 arch/alpha/include/uapi/asm/socket.h  |  2 ++
 arch/ia64/include/uapi/asm/socket.h   |  2 ++
 arch/mips/include/uapi/asm/socket.h   |  2 ++
 arch/parisc/include/uapi/asm/socket.h |  2 ++
 arch/s390/include/uapi/asm/socket.h   |  2 ++
 arch/sparc/include/uapi/asm/socket.h  |  2 ++
 arch/xtensa/include/uapi/asm/socket.h |  2 ++
 include/net/sock.h                    |  1 +
 include/uapi/asm-generic/socket.h     |  2 ++
 net/core/sock.c                       | 21 +++++++++++++++++++++
 net/ipv4/datagram.c                   |  1 +
 net/ipv4/udp.c                        |  3 +++
 12 files changed, 42 insertions(+)

-- 
2.16.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [RFC 1/2] net: Add new socket-option SO_DELAYED_BIND
  2018-10-31 23:26 [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets Christoph Paasch
@ 2018-10-31 23:26 ` Christoph Paasch
  2018-10-31 23:26 ` [RFC 2/2] udp: Support SO_DELAYED_BIND Christoph Paasch
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Christoph Paasch @ 2018-10-31 23:26 UTC (permalink / raw)
  To: netdev; +Cc: Ian Swett, Leif Hedstrom, Jana Iyengar

And store it as a flag in the sk_flags.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
---
 arch/alpha/include/uapi/asm/socket.h  |  2 ++
 arch/ia64/include/uapi/asm/socket.h   |  2 ++
 arch/mips/include/uapi/asm/socket.h   |  2 ++
 arch/parisc/include/uapi/asm/socket.h |  2 ++
 arch/s390/include/uapi/asm/socket.h   |  2 ++
 arch/sparc/include/uapi/asm/socket.h  |  2 ++
 arch/xtensa/include/uapi/asm/socket.h |  2 ++
 include/net/sock.h                    |  1 +
 include/uapi/asm-generic/socket.h     |  2 ++
 net/core/sock.c                       | 21 +++++++++++++++++++++
 10 files changed, 38 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h
index 065fb372e355..add6aca13b53 100644
--- a/arch/alpha/include/uapi/asm/socket.h
+++ b/arch/alpha/include/uapi/asm/socket.h
@@ -115,4 +115,6 @@
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		62
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/ia64/include/uapi/asm/socket.h b/arch/ia64/include/uapi/asm/socket.h
index c872c4e6bafb..98a86f406601 100644
--- a/arch/ia64/include/uapi/asm/socket.h
+++ b/arch/ia64/include/uapi/asm/socket.h
@@ -117,4 +117,6 @@
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		62
+
 #endif /* _ASM_IA64_SOCKET_H */
diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h
index 71370fb3ceef..f84bd74d58ee 100644
--- a/arch/mips/include/uapi/asm/socket.h
+++ b/arch/mips/include/uapi/asm/socket.h
@@ -126,4 +126,6 @@
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		62
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h
index 061b9cf2a779..8fe20a7abf6e 100644
--- a/arch/parisc/include/uapi/asm/socket.h
+++ b/arch/parisc/include/uapi/asm/socket.h
@@ -107,4 +107,6 @@
 #define SO_TXTIME		0x4036
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		0x4037
+
 #endif /* _UAPI_ASM_SOCKET_H */
diff --git a/arch/s390/include/uapi/asm/socket.h b/arch/s390/include/uapi/asm/socket.h
index 39d901476ee5..c00b10909a72 100644
--- a/arch/s390/include/uapi/asm/socket.h
+++ b/arch/s390/include/uapi/asm/socket.h
@@ -114,4 +114,6 @@
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		62
+
 #endif /* _ASM_SOCKET_H */
diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h
index 7ea35e5601b6..0825db0c9f46 100644
--- a/arch/sparc/include/uapi/asm/socket.h
+++ b/arch/sparc/include/uapi/asm/socket.h
@@ -104,6 +104,8 @@
 #define SO_TXTIME		0x003f
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		0x0040
+
 /* Security levels - as per NRL IPv6 - don't actually do anything */
 #define SO_SECURITY_AUTHENTICATION		0x5001
 #define SO_SECURITY_ENCRYPTION_TRANSPORT	0x5002
diff --git a/arch/xtensa/include/uapi/asm/socket.h b/arch/xtensa/include/uapi/asm/socket.h
index 1de07a7f7680..cd4d91e982d5 100644
--- a/arch/xtensa/include/uapi/asm/socket.h
+++ b/arch/xtensa/include/uapi/asm/socket.h
@@ -119,4 +119,6 @@
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		62
+
 #endif	/* _XTENSA_SOCKET_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index f665d74ae509..16fbe54cf519 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -801,6 +801,7 @@ enum sock_flags {
 	SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
 	SOCK_TXTIME,
 	SOCK_XDP, /* XDP is attached */
+	SOCK_DELAYED_BIND,
 };
 
 #define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index a12692e5f7a8..653f1f65a311 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -110,4 +110,6 @@
 #define SO_TXTIME		61
 #define SCM_TXTIME		SO_TXTIME
 
+#define SO_DELAYED_BIND		62
+
 #endif /* __ASM_GENERIC_SOCKET_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index 6fcc4bc07d19..343baa820cf2 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1047,6 +1047,23 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		}
 		break;
 
+	case SO_DELAYED_BIND:
+		if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
+			if (sk->sk_protocol != IPPROTO_UDP)
+				ret = -ENOTSUPP;
+		} else {
+			ret = -ENOTSUPP;
+		}
+
+		if (!ret) {
+			if (val < 0 || val > 1)
+				ret = -EINVAL;
+			else
+				sock_valbool_flag(sk, SOCK_DELAYED_BIND, valbool);
+		}
+
+		break;
+
 	default:
 		ret = -ENOPROTOOPT;
 		break;
@@ -1391,6 +1408,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname,
 				  SOF_TXTIME_REPORT_ERRORS : 0;
 		break;
 
+	case SO_DELAYED_BIND:
+		v.val = sock_flag(sk, SOCK_DELAYED_BIND);
+		break;
+
 	default:
 		/* We implement the SO_SNDLOWAT etc to not be settable
 		 * (1003.1g 7).
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [RFC 2/2] udp: Support SO_DELAYED_BIND
  2018-10-31 23:26 [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets Christoph Paasch
  2018-10-31 23:26 ` [RFC 1/2] net: Add new socket-option SO_DELAYED_BIND Christoph Paasch
@ 2018-10-31 23:26 ` Christoph Paasch
  2018-11-01  0:53 ` [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets Eric Dumazet
  2018-11-01 21:51 ` Willem de Bruijn
  3 siblings, 0 replies; 14+ messages in thread
From: Christoph Paasch @ 2018-10-31 23:26 UTC (permalink / raw)
  To: netdev; +Cc: Ian Swett, Leif Hedstrom, Jana Iyengar

For UDP, there is only a single socket-hash table, the udptable.

We want to prevent incoming segments to match on this socket when
SO_DELAYED_BIND is set. Thus, when computing the score for unconnected
sockets, we simply prevent the match as long as the flag is set.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
---
 net/ipv4/datagram.c | 1 +
 net/ipv4/udp.c      | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index 300921417f89..9bf0e0d2ea33 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -78,6 +78,7 @@ int __ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len
 	inet->inet_id = jiffies;
 
 	sk_dst_set(sk, &rt->dst);
+	sock_reset_flag(sk, SOCK_DELAYED_BIND);
 	err = 0;
 out:
 	return err;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ca3ed931f2a9..fb55f925342b 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -408,6 +408,9 @@ static int compute_score(struct sock *sk, struct net *net,
 			score += 4;
 	}
 
+	if (sock_flag(sk, SOCK_DELAYED_BIND))
+		return -1;
+
 	if (sk->sk_incoming_cpu == raw_smp_processor_id())
 		score++;
 	return score;
-- 
2.16.2

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
  2018-10-31 23:26 [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets Christoph Paasch
  2018-10-31 23:26 ` [RFC 1/2] net: Add new socket-option SO_DELAYED_BIND Christoph Paasch
  2018-10-31 23:26 ` [RFC 2/2] udp: Support SO_DELAYED_BIND Christoph Paasch
@ 2018-11-01  0:53 ` Eric Dumazet
  2018-11-01  3:50   ` Christoph Paasch
  2018-11-01 17:58   ` Leif Hedstrom
  2018-11-01 21:51 ` Willem de Bruijn
  3 siblings, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2018-11-01  0:53 UTC (permalink / raw)
  To: Christoph Paasch, netdev; +Cc: Ian Swett, Leif Hedstrom, Jana Iyengar



On 10/31/2018 04:26 PM, Christoph Paasch wrote:
> Implementations of Quic might want to create a separate socket for each
> Quic-connection by creating a connected UDP-socket.
> 

Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?

It would add a huge overhead in term of memory usage in the kernel,
and lots of epoll events to manage (say a QUIC server with one million flows, receiving
very few packets per second per flow)

Maybe you could elaborate on the need of having one UDP socket per connection.

> To achieve that on the server-side, a "master-socket" needs to wait for
> incoming new connections and then creates a new socket that will be a
> connected UDP-socket. To create that latter one, the server needs to
> first bind() and then connect(). However, after the bind() the server
> might already receive traffic on that new socket that is unrelated to the
> Quic-connection at hand. Only after the connect() a full 4-tuple match
> is happening. So, one can't really create this kind of a server that has
> a connected UDP-socket per Quic connection.
> 
> So, what is needed is an "atomic bind & connect" that basically
> prevents any incoming traffic until the connect() call has been issued
> at which point the full 4-tuple is known.
> 
> 
> This patchset implements this functionality and exposes a socket-option
> to do this.
> 
> Usage would be:
> 
>         int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
> 
>         int val = 1;
>         setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));
> 
>         bind(fd, (struct sockaddr *)&src, sizeof(src));
> 
> 	/* At this point, incoming traffic will never match on this socket */
> 
>         connect(fd, (struct sockaddr *)&dst, sizeof(dst));
> 
> 	/* Only now incoming traffic will reach the socket */
> 
> 
> 
> There is literally an infinite number of ways on how to implement it,
> which is why I first send it out as an RFC. With this approach here I
> chose the least invasive one, just preventing the match on the incoming
> path.
> 
> 
> The reason for choosing a SOL_SOCKET socket-option and not at the
> SOL_UDP-level is because that functionality actually could be useful for
> other protocols as well. E.g., TCP wants to better use the full 4-tuple space
> by binding to the source-IP and the destination-IP at the same time.

Passive TCP flows can not benefit from this idea.

Active TCP flows can already do that, I do not really understand what you are suggesting.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
  2018-11-01  0:53 ` [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets Eric Dumazet
@ 2018-11-01  3:50   ` Christoph Paasch
  2018-11-01  5:04     ` Eric Dumazet
  2018-11-01  5:08     ` Eric Dumazet
  2018-11-01 17:58   ` Leif Hedstrom
  1 sibling, 2 replies; 14+ messages in thread
From: Christoph Paasch @ 2018-11-01  3:50 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar

On 31/10/18 - 17:53:22, Eric Dumazet wrote:
> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
> > Implementations of Quic might want to create a separate socket for each
> > Quic-connection by creating a connected UDP-socket.
> > 
> 
> Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?
> 
> It would add a huge overhead in term of memory usage in the kernel,
> and lots of epoll events to manage (say a QUIC server with one million flows, receiving
> very few packets per second per flow)
> 
> Maybe you could elaborate on the need of having one UDP socket per connection.

I let Leif chime in on that as the ask came from him. Leif & his team are
implementing Quic in the Apache Traffic Server.


One advantage I can see is that it would allow to benefit from fq_pacing as
one could set sk_pacing_rate simply on the socket. That way there is no need
to implement the pacing in the user-space anymore.


> > To achieve that on the server-side, a "master-socket" needs to wait for
> > incoming new connections and then creates a new socket that will be a
> > connected UDP-socket. To create that latter one, the server needs to
> > first bind() and then connect(). However, after the bind() the server
> > might already receive traffic on that new socket that is unrelated to the
> > Quic-connection at hand. Only after the connect() a full 4-tuple match
> > is happening. So, one can't really create this kind of a server that has
> > a connected UDP-socket per Quic connection.
> > 
> > So, what is needed is an "atomic bind & connect" that basically
> > prevents any incoming traffic until the connect() call has been issued
> > at which point the full 4-tuple is known.
> > 
> > 
> > This patchset implements this functionality and exposes a socket-option
> > to do this.
> > 
> > Usage would be:
> > 
> >         int fd = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
> > 
> >         int val = 1;
> >         setsockopt(fd, SOL_SOCKET, SO_DELAYED_BIND, &val, sizeof(val));
> > 
> >         bind(fd, (struct sockaddr *)&src, sizeof(src));
> > 
> > 	/* At this point, incoming traffic will never match on this socket */
> > 
> >         connect(fd, (struct sockaddr *)&dst, sizeof(dst));
> > 
> > 	/* Only now incoming traffic will reach the socket */
> > 
> > 
> > 
> > There is literally an infinite number of ways on how to implement it,
> > which is why I first send it out as an RFC. With this approach here I
> > chose the least invasive one, just preventing the match on the incoming
> > path.
> > 
> > 
> > The reason for choosing a SOL_SOCKET socket-option and not at the
> > SOL_UDP-level is because that functionality actually could be useful for
> > other protocols as well. E.g., TCP wants to better use the full 4-tuple space
> > by binding to the source-IP and the destination-IP at the same time.
> 
> Passive TCP flows can not benefit from this idea.
> 
> Active TCP flows can already do that, I do not really understand what you are suggesting.

What we had here is that we wanted to let a server initiate more than 64K
connections *while* binding also to a source-IP.
With TCP the bind() would then pick a source-port and we ended up hitting the
64K limit. If we could do an atomic "bind + connect", source-port selection
could ensure that the 4-tuple is unique.

Or has something changed in recent times that allows to use the 4-tuple
matching when doing this with TCP?


Christoph

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
  2018-11-01  3:50   ` Christoph Paasch
@ 2018-11-01  5:04     ` Eric Dumazet
  2018-11-01  5:07       ` Christoph Paasch
  2018-11-01  5:08     ` Eric Dumazet
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2018-11-01  5:04 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar



On 10/31/2018 08:50 PM, Christoph Paasch wrote:

> What we had here is that we wanted to let a server initiate more than 64K
> connections *while* binding also to a source-IP.
> With TCP the bind() would then pick a source-port and we ended up hitting the
> 64K limit. If we could do an atomic "bind + connect", source-port selection
> could ensure that the 4-tuple is unique.
> 
> Or has something changed in recent times that allows to use the 4-tuple
> matching when doing this with TCP?


Well, yes, although it is not really recent (this came with linux-4.2)

You can now bind to an address only, and let the sport being automatically chosen at connect()

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=90c337da1524863838658078ec34241f45d8394d

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
  2018-11-01  5:04     ` Eric Dumazet
@ 2018-11-01  5:07       ` Christoph Paasch
  0 siblings, 0 replies; 14+ messages in thread
From: Christoph Paasch @ 2018-11-01  5:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar



> On Oct 31, 2018, at 10:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> 
> 
>> On 10/31/2018 08:50 PM, Christoph Paasch wrote:
>> 
>> What we had here is that we wanted to let a server initiate more than 64K
>> connections *while* binding also to a source-IP.
>> With TCP the bind() would then pick a source-port and we ended up hitting the
>> 64K limit. If we could do an atomic "bind + connect", source-port selection
>> could ensure that the 4-tuple is unique.
>> 
>> Or has something changed in recent times that allows to use the 4-tuple
>> matching when doing this with TCP?
> 
> 
> Well, yes, although it is not really recent (this came with linux-4.2)
> 
> You can now bind to an address only, and let the sport being automatically chosen at connect()

Oh, I didn’t knew about this socket option. Thanks :-)


Christoph

> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=90c337da1524863838658078ec34241f45d8394d
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
  2018-11-01  3:50   ` Christoph Paasch
  2018-11-01  5:04     ` Eric Dumazet
@ 2018-11-01  5:08     ` Eric Dumazet
  2018-11-01  5:17       ` Eric Dumazet
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2018-11-01  5:08 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar



On 10/31/2018 08:50 PM, Christoph Paasch wrote:
> On 31/10/18 - 17:53:22, Eric Dumazet wrote:
>> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
>>> Implementations of Quic might want to create a separate socket for each
>>> Quic-connection by creating a connected UDP-socket.
>>>
>>
>> Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?
>>
>> It would add a huge overhead in term of memory usage in the kernel,
>> and lots of epoll events to manage (say a QUIC server with one million flows, receiving
>> very few packets per second per flow)
>>
>> Maybe you could elaborate on the need of having one UDP socket per connection.
> 
> I let Leif chime in on that as the ask came from him. Leif & his team are
> implementing Quic in the Apache Traffic Server.
> 
> 
> One advantage I can see is that it would allow to benefit from fq_pacing as
> one could set sk_pacing_rate simply on the socket. That way there is no need
> to implement the pacing in the user-space anymore.

Our plan is to use EDT model for UDP packets, so that we can
still use one (not connected) UDP socket per cpu/thread.

We added in linux-4.20 the EDT model for TCP, and I intend to add the remaining part for sch_fq for 4.21.

UDP can use an ancillary message (SCM_TXTIME) to attach to the skb (which can be a GSO btw) a tstamp,
and pacing will happen just fine.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
  2018-11-01  5:08     ` Eric Dumazet
@ 2018-11-01  5:17       ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2018-11-01  5:17 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: netdev, Ian Swett, Leif Hedstrom, Jana Iyengar



On 10/31/2018 10:08 PM, Eric Dumazet wrote:

> Our plan is to use EDT model for UDP packets, so that we can
> still use one (not connected) UDP socket per cpu/thread.
> 
> We added in linux-4.20 the EDT model for TCP, and I intend to add the remaining part for sch_fq for 4.21.
> 
> UDP can use an ancillary message (SCM_TXTIME) to attach to the skb (which can be a GSO btw) a tstamp,
> and pacing will happen just fine.
> 

List of EDT patches in reverse order if you want to take a look :

3f80e08f40cdb308589a49077c87632fa4508b21 tcp: add tcp_reset_xmit_timer() helper
4c16128b6271e70c8743178e90cccee147858503 net: loopback: clear skb->tstamp before netif_rx()
79861919b8896e14b8e5707242721f2312c57ae4 tcp: fix TCP_REPAIR xmit queue setup
825e1c523d5000f067a1614e4a66bb282a2d373c tcp: cdg: use tcp high resolution clock cache
864e5c090749448e879e86bec06ee396aa2c19c5 tcp: optimize tcp internal pacing
7baf33bdac37da65ddce3adf4daa8c7805802174 net_sched: sch_fq: no longer use skb_is_tcp_pure_ack()
a7a2563064e963bc5e3f39f533974f2730c0ff56 tcp: mitigate scheduling jitter in EDT pacing model
76a9ebe811fb3d0605cb084f1ae6be5610541865 net: extend sk_pacing_rate to unsigned long
5f6188a8003d080e3753b8f14f4a5a2325ae1ff6 tcp: do not change tcp_wstamp_ns in tcp_mstamp_refresh
fb420d5d91c1274d5966917725e71f27ed092a85 tcp/fq: move back to CLOCK_MONOTONIC
90caf67b01fabdd51b6cdeeb23b29bf73901df90 net_sched: sch_fq: remove dead code dealing with retransmits
c092dd5f4a7f4e4dbbcc8cf2e50b516bf07e432f tcp: switch tcp_internal_pacing() to tcp_wstamp_ns
ab408b6dc7449c0f791e9e5f8de72fa7428584f2 tcp: switch tcp and sch_fq to new earliest departure time model
fd2bca2aa7893586887b2370e90e85bd0abc805e tcp: switch internal pacing timer to CLOCK_TAI
d3edd06ea8ea9e03de6567fda80b8be57e21a537 tcp: provide earliest departure time in skb->tstamp
9799ccb0e984a5c1311b22a212e7ff96e8b736de tcp: add tcp_wstamp_ns socket field
142537e419234c396890a22806b8644dce21b132 net_sched: sch_fq: switch to CLOCK_TAI
2fd66ffba50716fc5ab481c48db643af3bda2276 tcp: introduce tcp_skb_timestamp_us() helper
72b0094f918294e6cb8cf5c3b4520d928fbb1a57 tcp: switch tcp_clock_ns() to CLOCK_TAI base

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
  2018-11-01  0:53 ` [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets Eric Dumazet
  2018-11-01  3:50   ` Christoph Paasch
@ 2018-11-01 17:58   ` Leif Hedstrom
  2018-11-01 18:21     ` Eric Dumazet
  1 sibling, 1 reply; 14+ messages in thread
From: Leif Hedstrom @ 2018-11-01 17:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Paasch, netdev, Ian Swett, Jana Iyengar

> On Oct 31, 2018, at 6:53 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> 
> 
> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
>> Implementations of Quic might want to create a separate socket for each
>> Quic-connection by creating a connected UDP-socket.
>> 
> 
> Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?

First thing: This is an idea we’ve been floating, and it’s not completed yet, so we don’t have any performance numbers etc. to share. The ideas for the implementation came up after a discussion with Ian and Jana re: their implementation of a QUIC server.

That much said, the general rationale for this is that having a socket for each QUIC connection could simplify integrating QUIC into existing software that already does epoll() over TCP sockets. This is how e.g. Apache Traffic Server works, which is our target implementation for QUIC.

> 
> It would add a huge overhead in term of memory usage in the kernel,
> and lots of epoll events to manage (say a QUIC server with one million flows, receiving
> very few packets per second per flow)

Our use case is not millions of sockets, rather, 10’s of thousands. There would be one socket for each QUIC Connection, not per stream (obviously). At ~80Gbps on a box, we definitely see much less than 100k TCP connections.

Question: is there additional memory overhead here for the UDP sockets vs a normal TCP socket for e.g. HTTP or HTTP/2 ?

> 
> Maybe you could elaborate on the need of having one UDP socket per connection.

We had a couple reasons:

1) Easier to integrate with existing epoll() based event processing

2) Possibly less CPU usage / faster handling, since scheduling is simplified with the epoll integration (untested)

Ian and Jana also had a couple of reasons why this delayed bind could be useful for their implementations, but I’ll leave it to them to go into details.

Cheers,

— leif

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
  2018-11-01 17:58   ` Leif Hedstrom
@ 2018-11-01 18:21     ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2018-11-01 18:21 UTC (permalink / raw)
  To: Leif Hedstrom; +Cc: Christoph Paasch, netdev, Ian Swett, Jana Iyengar



On 11/01/2018 10:58 AM, Leif Hedstrom wrote:
> 
> 
>> On Oct 31, 2018, at 6:53 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 10/31/2018 04:26 PM, Christoph Paasch wrote:
>>> Implementations of Quic might want to create a separate socket for each
>>> Quic-connection by creating a connected UDP-socket.
>>>
>>
>> Nice proposal, but I doubt a QUIC server can afford having one UDP socket per connection ?
> 
> First thing: This is an idea we’ve been floating, and it’s not completed yet, so we don’t have any performance numbers etc. to share. The ideas for the implementation came up after a discussion with Ian and Jana re: their implementation of a QUIC server.
> 
> That much said, the general rationale for this is that having a socket for each QUIC connection could simplify integrating QUIC into existing software that already does epoll() over TCP sockets. This is how e.g. Apache Traffic Server works, which is our target implementation for QUIC.
> 
> 
> 
>>
>> It would add a huge overhead in term of memory usage in the kernel,
>> and lots of epoll events to manage (say a QUIC server with one million flows, receiving
>> very few packets per second per flow)
> 
> Our use case is not millions of sockets, rather, 10’s of thousands. There would be one socket for each QUIC Connection, not per stream (obviously). At ~80Gbps on a box, we definitely see much less than 100k TCP connections.
> 
> Question: is there additional memory overhead here for the UDP sockets vs a normal TCP socket for e.g. HTTP or HTTP/2 ?

TCP sockets have a lot of state. We can understand spending 2 or 3 KB per socket.

UDP sockets really have no state. The receive queue anchor is only 24 bytes.
Still, memory cost for one UDP socket are :

1344 bytes for UDP socket,
320 bytes for the "struct file"
192 bytes for the struct dentry
704 bytes for inode
512 bytes for the two dst (connected socket)
200 bytes for eventpoll structures
104 bytes for the fq flow

That is about 3.1KB per socket (but you probably can round this to 4KB due to kmalloc roundings)

One million sockets -> 4GB of memory.

This really does not scale.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
  2018-10-31 23:26 [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets Christoph Paasch
                   ` (2 preceding siblings ...)
  2018-11-01  0:53 ` [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets Eric Dumazet
@ 2018-11-01 21:51 ` Willem de Bruijn
  2018-11-01 22:11   ` Christoph Paasch
  3 siblings, 1 reply; 14+ messages in thread
From: Willem de Bruijn @ 2018-11-01 21:51 UTC (permalink / raw)
  To: cpaasch; +Cc: Network Development, ianswett, lhedstrom, jri.ietf, Eric Dumazet

On Wed, Oct 31, 2018 at 7:30 PM Christoph Paasch <cpaasch@apple.com> wrote:
>
> Implementations of Quic might want to create a separate socket for each
> Quic-connection by creating a connected UDP-socket.
>
> To achieve that on the server-side, a "master-socket" needs to wait for
> incoming new connections and then creates a new socket that will be a
> connected UDP-socket. To create that latter one, the server needs to
> first bind() and then connect(). However, after the bind() the server
> might already receive traffic on that new socket that is unrelated to the
> Quic-connection at hand.

This can also be achieved with SO_REUSEPORT_BPF and a filter
that only selects the listener socket(s) in the group. The connect
call should call udp_lib_rehash and take the connected socket out
of the reuseport listener group. Though admittedly that is more
elaborate than setting a boolean socket option.

> The ideas for the implementation came up after a discussion with Ian
> and Jana re: their implementation of a QUIC server.

That might have preceded SO_TXTIME? AFAIK traffic shaping was the
only real reason to prefer connected sockets.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
  2018-11-01 21:51 ` Willem de Bruijn
@ 2018-11-01 22:11   ` Christoph Paasch
       [not found]     ` <CAKcm_gNZqgRGRj2J5yJDsavHsoaeXtozrbGp+TmAj_DRsCUOLQ@mail.gmail.com>
  0 siblings, 1 reply; 14+ messages in thread
From: Christoph Paasch @ 2018-11-01 22:11 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Network Development, ianswett, lhedstrom, jri.ietf, Eric Dumazet

On 01/11/18 - 17:51:39, Willem de Bruijn wrote:
> On Wed, Oct 31, 2018 at 7:30 PM Christoph Paasch <cpaasch@apple.com> wrote:
> >
> > Implementations of Quic might want to create a separate socket for each
> > Quic-connection by creating a connected UDP-socket.
> >
> > To achieve that on the server-side, a "master-socket" needs to wait for
> > incoming new connections and then creates a new socket that will be a
> > connected UDP-socket. To create that latter one, the server needs to
> > first bind() and then connect(). However, after the bind() the server
> > might already receive traffic on that new socket that is unrelated to the
> > Quic-connection at hand.
> 
> This can also be achieved with SO_REUSEPORT_BPF and a filter
> that only selects the listener socket(s) in the group. The connect
> call should call udp_lib_rehash and take the connected socket out
> of the reuseport listener group. Though admittedly that is more
> elaborate than setting a boolean socket option.

Yeah, that indeed would be quite a bit more elaborate ;-)

A simple socket-option is much easier.


Cheers,
Christoph

> 
> > The ideas for the implementation came up after a discussion with Ian
> > and Jana re: their implementation of a QUIC server.
> 
> That might have preceded SO_TXTIME? AFAIK traffic shaping was the
> only real reason to prefer connected sockets.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets
       [not found]       ` <CACpbDccs6WmLCknpu2GLMMBnkHwS4apsr3Z3sAKt4Ch_2HPwgg@mail.gmail.com>
@ 2018-11-04 18:58         ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2018-11-04 18:58 UTC (permalink / raw)
  To: Jana Iyengar, Ian Swett
  Cc: Christoph Paasch, willemdebruijn.kernel, netdev, lhedstrom

On 11/04/2018 02:45 AM, Jana Iyengar wrote:
> I think SO_TXTIME solves the most egregious problem I have with using sched_fq for QUIC. That's a great help, so thank you!
> 
> That said, as Ian says, SO_TXTIME does not allow for the flow isolation properties of sched_fq, which would be a nice secondary benefit. I suspect that can also be done in a similar manner to SO_TXTIME -- by attaching an opaque label to each sendmsg which is used by sched_fq to determine the flow. Is that feasible?

The plan is to have flow isolation, without having to change applications to provide a flow identifier,
since we can not trust user applications anyway.

FQ will perform a proper flow dissection for packets sent over unconnected UDP sockets.

FQ has two parts [1], one being used by locally generated packets (local TCP stack),
one being used in forwarding workloads.

The patch which I will send shortly (when net-next reopens)
will only make sure FQ does not use skb->sk as a flow identifier
if the socket is not a connected one.

[1] See commit 06eb395fa9856b5a87cf7d80baee2a0ed3cdb9d7 for some details.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-11-05  4:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-31 23:26 [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets Christoph Paasch
2018-10-31 23:26 ` [RFC 1/2] net: Add new socket-option SO_DELAYED_BIND Christoph Paasch
2018-10-31 23:26 ` [RFC 2/2] udp: Support SO_DELAYED_BIND Christoph Paasch
2018-11-01  0:53 ` [RFC 0/2] Delayed binding of UDP sockets for Quic per-connection sockets Eric Dumazet
2018-11-01  3:50   ` Christoph Paasch
2018-11-01  5:04     ` Eric Dumazet
2018-11-01  5:07       ` Christoph Paasch
2018-11-01  5:08     ` Eric Dumazet
2018-11-01  5:17       ` Eric Dumazet
2018-11-01 17:58   ` Leif Hedstrom
2018-11-01 18:21     ` Eric Dumazet
2018-11-01 21:51 ` Willem de Bruijn
2018-11-01 22:11   ` Christoph Paasch
     [not found]     ` <CAKcm_gNZqgRGRj2J5yJDsavHsoaeXtozrbGp+TmAj_DRsCUOLQ@mail.gmail.com>
     [not found]       ` <CACpbDccs6WmLCknpu2GLMMBnkHwS4apsr3Z3sAKt4Ch_2HPwgg@mail.gmail.com>
2018-11-04 18:58         ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.