* GSO/GRO and UDP performance
@ 2013-09-04 10:07 James Yonan
2013-09-04 11:53 ` Eric Dumazet
0 siblings, 1 reply; 7+ messages in thread
From: James Yonan @ 2013-09-04 10:07 UTC (permalink / raw)
To: netdev
I'm looking at ways to improve UDP performance in the kernel.
Specifically I'd like to take some of the ideas in GSO/GRO for TCP and
apply them to UDP as well. Our use case is OpenVPN, but these methods
should apply to any UDP-based app.
AS I understand GSO/GRO for TCP, there are essentially two central features:
(a) it's a way of batching packets with similar headers together so that
they can efficiently traverse the network stack as a single unit
(b) it explicitly maps the batching of packets to the L4 segmenting
features of TCP, so that batched packets can be coalesced into TCP segments.
This approach works great for TCP because of its built-in L4 segmenting
features, but it tends to break down for UDP because of (b) in
particular -- UDP doesn't have an L4 segmenting model, so the
gso_segment method for UDP resorts to segmenting the packets with L3 IP
fragmentation (i.e. UFO). The problem is that IP fragmentation is
broken on so many different levels that it can't be relied on for apps
that need to communicate over the open internet (*). Most UDP apps do
their own app-level fragmentation and wouldn't want to be forced to buy
into IP fragmentation in order to get the performance benefits of GSO/GRO.
So I would like to propose a GSO/GRO implementation for UDP that works
by batching together separate UDP packets with similar headers into a
single skb. There is no-tie in with L3 IP fragmentation -- the packets
are sent over the wire and received as individual UDP packets.
Here is an example of how this might work in practice:
When I call sendmmsg from userspace with a bunch of UDP packets having
the same header, the kernel would assemble these packets into a single
skb via shinfo(skb)->frag_list. There would need to be a new gso_type
indicating that frag_list is simply a list of UDP packets having the
same header that should be transmitted separately. No IP fragmentation
would be necessary as long as the app has correctly sized the packets
for the link MTU.
Once this skb is about to reach the driver, dev_hard_start_xmit could do
the usual GSO thing and separate out the packets in
shinfo(skb)->frag_list and pass them individually to the driver's
ndo_start_xmit method, if the driver doesn't support batched UDP
packets. There would need to be a new gso_type for this batching model,
e.g. "SKB_GSO_UDP_BUNDLE" that drivers could optionally support.
On the receive side, we would define a gro_receive method for UDP (none
currently exists) that does the same batching in reverse: UDP packets
with the same header would be collected into shinfo(skb)->frag_list and
gso_type would be set to SKB_GSO_UDP_BUNDLE.
The bundle of UDP packets would traverse the stack as a unit until it
reaches the socket layer, where recvmmsg could pass the whole bundle up
to userspace in a single transaction (or recvmsg could disaggregate the
bundle and pass each datagram individually).
This approach should also significantly speed up UDP apps running on VM
guests, because the skbs of bundled UDP packets could be passed across
the hypervisor/guest barrier in a single transaction.
Because this technique bundles UDP packets without coalescing or
modifying them, the approach should be lossless with respect to
bridging, hypervisor/guest communication, routing, etc. It also doesn't
interfere with existing hardware support for L4 checksum offloading
(unlike UFO).
Could this work? Are there problems with this that I'm not considering?
Are there better or existing ways of doing this?
Thanks,
James
---------------------
(*) Well-known issues of UDP/IP fragmentation:
1. Relies on PMTU discovery, which often doesn't work in the real world
because of inconsistent ICMP forwarding policies.
2. Breaks down on high-bandwidth links because the IPv4 16-bit packet ID
value can wrap around, causing data corruption.
3. One fragment lost in transit means that the whole superpacket is lost.
James
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: GSO/GRO and UDP performance
2013-09-04 10:07 GSO/GRO and UDP performance James Yonan
@ 2013-09-04 11:53 ` Eric Dumazet
2013-09-06 9:22 ` James Yonan
0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2013-09-04 11:53 UTC (permalink / raw)
To: James Yonan; +Cc: netdev
On Wed, 2013-09-04 at 04:07 -0600, James Yonan wrote:
> The bundle of UDP packets would traverse the stack as a unit until it
> reaches the socket layer, where recvmmsg could pass the whole bundle up
> to userspace in a single transaction (or recvmsg could disaggregate the
> bundle and pass each datagram individually).
That would require a lot of work, say in netfilter, but also in core
network stack in forwarding, and all UDP users (L2TP, vxlan).
Very unlikely to happen IMHO.
I suspect the performance is coming from aggregation done in user space,
then re-injected into the kernel ?
You could use a kernel module, using udp_encap_enable() and friends.
Check vxlan_socket_create() for an example
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: GSO/GRO and UDP performance
2013-09-04 11:53 ` Eric Dumazet
@ 2013-09-06 9:22 ` James Yonan
2013-09-06 13:07 ` Eric Dumazet
0 siblings, 1 reply; 7+ messages in thread
From: James Yonan @ 2013-09-06 9:22 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev
On 04/09/2013 05:53, Eric Dumazet wrote:
> On Wed, 2013-09-04 at 04:07 -0600, James Yonan wrote:
>
>> The bundle of UDP packets would traverse the stack as a unit until it
>> reaches the socket layer, where recvmmsg could pass the whole bundle up
>> to userspace in a single transaction (or recvmsg could disaggregate the
>> bundle and pass each datagram individually).
>
> That would require a lot of work, say in netfilter, but also in core
> network stack in forwarding, and all UDP users (L2TP, vxlan).
>
> Very unlikely to happen IMHO.
I agree that aggregating packets by chaining multiple packets into a
single skb would be too disruptive.
However I believe GSO/GRO provides a potential solution here that would
be transparent to the core network stack and existing in-kernel UDP users.
GSO/GRO already allows any L4 protocol or lower to define their own
segmentation and aggregation algorithms, as long as the algorithms are
lossless.
There's no reason why GSO/GRO couldn't operate on L5 or higher protocols
if segmentation and aggregation algorithms are provided by a kernel
module that understands the specific app protocol.
It looks like this could be done with minimal changes to the GSO/GRO
core. There would need to be a hook where a kernel module could
register itself as a GSO/GRO provider for UDP. It could then perform
segmentation/aggregation on UDP packets that belong to it. The dispatch
to the UDP GSO/GRO providers would be done by the existing offload code
for UDP, so there would be zero added overhead for non-UDP protocols.
>
> I suspect the performance is coming from aggregation done in user space,
> then re-injected into the kernel ?
>
> You could use a kernel module, using udp_encap_enable() and friends.
>
> Check vxlan_socket_create() for an example
I actually put together a test kernel module using udp_encap_enable to
see if I could accelerate UDP performance that way. But even with the
boost of running in kernel space, the packet processing overhead of
dealing with 1500 byte packets negates most of the gain, while TCP gets
a 43x performance boost by being able to aggregate up to 64KB per
superpacket with GSO/GRO.
So I think that playing well with GSO/GRO is essential to get speedup in
UDP apps because of this 43x multiplier.
James
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: GSO/GRO and UDP performance
2013-09-06 9:22 ` James Yonan
@ 2013-09-06 13:07 ` Eric Dumazet
2013-09-06 16:42 ` Rick Jones
0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2013-09-06 13:07 UTC (permalink / raw)
To: James Yonan; +Cc: netdev
On Fri, 2013-09-06 at 03:22 -0600, James Yonan wrote:
> So I think that playing well with GSO/GRO is essential to get speedup in
> UDP apps because of this 43x multiplier.
>
Thats not true. GRO cannot aggregate more than 16+1 packets.
I think we cannot aggregate UDP packets, because UDP lacks sequence
numbers, so reorders would be a problem.
You really need something that is not UDP generic.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: GSO/GRO and UDP performance
2013-09-06 13:07 ` Eric Dumazet
@ 2013-09-06 16:42 ` Rick Jones
2013-09-06 19:26 ` James Yonan
0 siblings, 1 reply; 7+ messages in thread
From: Rick Jones @ 2013-09-06 16:42 UTC (permalink / raw)
To: James Yonan; +Cc: Eric Dumazet, netdev
On 09/06/2013 06:07 AM, Eric Dumazet wrote:
> On Fri, 2013-09-06 at 03:22 -0600, James Yonan wrote:
>
>> So I think that playing well with GSO/GRO is essential to get speedup in
>> UDP apps because of this 43x multiplier.
>>
>
> Thats not true. GRO cannot aggregate more than 16+1 packets.
>
> I think we cannot aggregate UDP packets, because UDP lacks sequence
> numbers, so reorders would be a problem.
>
> You really need something that is not UDP generic.
It may not be as sexy, and it cannot get the 43x multiplier (just what
*is* the service demand change on a netperf TCP_STREAM test these days
between GSO/GRO on and off anyway?), but looking for basic path-length
reductions would be goodness.
rick jones
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: GSO/GRO and UDP performance
2013-09-06 16:42 ` Rick Jones
@ 2013-09-06 19:26 ` James Yonan
2013-09-06 19:32 ` Eric Dumazet
0 siblings, 1 reply; 7+ messages in thread
From: James Yonan @ 2013-09-06 19:26 UTC (permalink / raw)
To: Rick Jones; +Cc: Eric Dumazet, netdev
On 06/09/2013 10:42, Rick Jones wrote:
> On 09/06/2013 06:07 AM, Eric Dumazet wrote:
>> On Fri, 2013-09-06 at 03:22 -0600, James Yonan wrote:
>>
>>> So I think that playing well with GSO/GRO is essential to get speedup in
>>> UDP apps because of this 43x multiplier.
>>>
>>
>> Thats not true. GRO cannot aggregate more than 16+1 packets.
Where does the 16+1 come from? I'm getting my 43x from the ratio of max
legal IP packet size (64KB) / internet MTU (1500). Are you saying that
GRO cannot aggregate up to 64 KB?
>> I think we cannot aggregate UDP packets, because UDP lacks sequence
>> numbers, so reorders would be a problem.
>> You really need something that is not UDP generic.
Right -- that's why I'm proposing a hook for UDP GSO/GRO providers that
know about specific app-layer protocols and can provide segmentation and
aggregation methods for them. Such a provider would be implemented in a
kernel module and would know about the specific app-layer protocol, so
it would be able to losslessly segment and aggregate it (i.e. it could
use a sequence number from the app-layer protocol).
> It may not be as sexy, and it cannot get the 43x multiplier (just what
> *is* the service demand change on a netperf TCP_STREAM test these days
> between GSO/GRO on and off anyway?)
That's something I haven't really looked too closely at yet. With
MAX_GRO_SKBS set to only 8, how well would this really scale?
> but looking for basic path-length reductions would be goodness.
Path is fairly optimized as-is.
Direction 1: udp_encap_recv -> tunnel decapsulation -> netif_rx
Direction 2: ndo_start_xmit -> tunnel encapsulation -> ip_local_out
I've also looked into getting closer to driver TX by using
dev_queue_xmit instead of ip_local_out.
Even though this is a virtual driver without interrupts, I'm also
looking at NAPI as a way of getting packet flows into GRO on the RX side.
Bottom line is that I want to saturate 10 GigE with UDP packets without
breaking a sweat. ixgbe or other drivers in that class can handle it if
the per-packet overhead in the network stack can be reduced enough.
James
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: GSO/GRO and UDP performance
2013-09-06 19:26 ` James Yonan
@ 2013-09-06 19:32 ` Eric Dumazet
0 siblings, 0 replies; 7+ messages in thread
From: Eric Dumazet @ 2013-09-06 19:32 UTC (permalink / raw)
To: James Yonan; +Cc: Rick Jones, netdev
On Fri, 2013-09-06 at 13:26 -0600, James Yonan wrote:
> Where does the 16+1 come from? I'm getting my 43x from the ratio of max
> legal IP packet size (64KB) / internet MTU (1500). Are you saying that
> GRO cannot aggregate up to 64 KB?
>
Yes this is what I said.
Hint : MAX_SKB_FRAGS is the number of fragments per skb
Each aggregated frame consumes at least one fragment.
Hint : some drivers uses more than one fragment per datagram.
-> A fragment in skb does not necessarily contains one and exactly one
datagram
> >> I think we cannot aggregate UDP packets, because UDP lacks sequence
> >> numbers, so reorders would be a problem.
>
> >> You really need something that is not UDP generic.
>
> Right -- that's why I'm proposing a hook for UDP GSO/GRO providers that
> know about specific app-layer protocols and can provide segmentation and
> aggregation methods for them. Such a provider would be implemented in a
> kernel module and would know about the specific app-layer protocol, so
> it would be able to losslessly segment and aggregate it (i.e. it could
> use a sequence number from the app-layer protocol).
Its not a choice given by application.
As I said you'll have to make sure all the stack will understand the
meaning of datagram aggregation.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-09-06 19:32 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-09-04 10:07 GSO/GRO and UDP performance James Yonan
2013-09-04 11:53 ` Eric Dumazet
2013-09-06 9:22 ` James Yonan
2013-09-06 13:07 ` Eric Dumazet
2013-09-06 16:42 ` Rick Jones
2013-09-06 19:26 ` James Yonan
2013-09-06 19:32 ` Eric Dumazet
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.