netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 3/3] ipv4: gre: add GRO capability
@ 2012-09-27 12:48 Eric Dumazet
  2012-09-27 17:52 ` Jesse Gross
  2012-10-01 21:04 ` David Miller
  0 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2012-09-27 12:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

Add GRO capability to IPv4 GRE tunnels, using the gro_cells
infrastructure.

Tested using IPv4 and IPv6 TCP traffic inside this tunnel, and
checking GRO is building large packets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/ipip.h |    3 +++
 net/ipv4/ip_gre.c  |   13 +++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/net/ipip.h b/include/net/ipip.h
index a93cf6d..ddc077c 100644
--- a/include/net/ipip.h
+++ b/include/net/ipip.h
@@ -2,6 +2,7 @@
 #define __NET_IPIP_H 1
 
 #include <linux/if_tunnel.h>
+#include <net/gro_cells.h>
 #include <net/ip.h>
 
 /* Keep error state on tunnel for 30 sec */
@@ -36,6 +37,8 @@ struct ip_tunnel {
 #endif
 	struct ip_tunnel_prl_entry __rcu *prl;		/* potential router list */
 	unsigned int			prl_count;	/* # of entries in PRL */
+
+	struct gro_cells		gro_cells;
 };
 
 struct ip_tunnel_prl_entry {
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index f233c1d..1f00b30 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -714,8 +714,7 @@ static int ipgre_rcv(struct sk_buff *skb)
 		skb_reset_network_header(skb);
 		ipgre_ecn_decapsulate(iph, skb);
 
-		netif_rx(skb);
-
+		gro_cells_receive(&tunnel->gro_cells, skb);
 		rcu_read_unlock();
 		return 0;
 	}
@@ -1296,6 +1295,9 @@ static const struct net_device_ops ipgre_netdev_ops = {
 
 static void ipgre_dev_free(struct net_device *dev)
 {
+	struct ip_tunnel *tunnel = netdev_priv(dev);
+
+	gro_cells_destroy(&tunnel->gro_cells);
 	free_percpu(dev->tstats);
 	free_netdev(dev);
 }
@@ -1327,6 +1329,7 @@ static int ipgre_tunnel_init(struct net_device *dev)
 {
 	struct ip_tunnel *tunnel;
 	struct iphdr *iph;
+	int err;
 
 	tunnel = netdev_priv(dev);
 	iph = &tunnel->parms.iph;
@@ -1353,6 +1356,12 @@ static int ipgre_tunnel_init(struct net_device *dev)
 	if (!dev->tstats)
 		return -ENOMEM;
 
+	err = gro_cells_init(&tunnel->gro_cells, dev);
+	if (err) {
+		free_percpu(dev->tstats);
+		return err;
+	}
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: gre: add GRO capability
  2012-09-27 12:48 [PATCH net-next 3/3] ipv4: gre: add GRO capability Eric Dumazet
@ 2012-09-27 17:52 ` Jesse Gross
  2012-09-27 18:08   ` Eric Dumazet
  2012-10-01 21:04 ` David Miller
  1 sibling, 1 reply; 41+ messages in thread
From: Jesse Gross @ 2012-09-27 17:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Thu, Sep 27, 2012 at 5:48 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Add GRO capability to IPv4 GRE tunnels, using the gro_cells
> infrastructure.
>
> Tested using IPv4 and IPv6 TCP traffic inside this tunnel, and
> checking GRO is building large packets.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

When I was thinking about doing this, my original plan was to handle
GRO/GSO by extending the current handlers to be able to look inside
GRE and then loop around to process the inner packet (similar to what
is done today with skb_flow_dissect() for RPS).  Is there a reason to
do it in the device?

Pushing it earlier/later in the stack obviously increases the benefit
and it will also be more compatible with the forthcoming OVS tunneling
hooks, which will be flow based and therefore won't have a device.

Also, the next generation of NICs will support this type of thing in
hardware so putting the software versions very close to the NIC will
give us a more similar abstraction.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: gre: add GRO capability
  2012-09-27 17:52 ` Jesse Gross
@ 2012-09-27 18:08   ` Eric Dumazet
  2012-09-27 18:19     ` Eric Dumazet
  2012-09-27 22:03     ` [PATCH net-next 3/3] ipv4: gre: add GRO capability Jesse Gross
  0 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2012-09-27 18:08 UTC (permalink / raw)
  To: Jesse Gross; +Cc: David Miller, netdev

On Thu, 2012-09-27 at 10:52 -0700, Jesse Gross wrote:

> When I was thinking about doing this, my original plan was to handle
> GRO/GSO by extending the current handlers to be able to look inside
> GRE and then loop around to process the inner packet (similar to what
> is done today with skb_flow_dissect() for RPS).  Is there a reason to
> do it in the device?
> 
> Pushing it earlier/later in the stack obviously increases the benefit
> and it will also be more compatible with the forthcoming OVS tunneling
> hooks, which will be flow based and therefore won't have a device.
> 
> Also, the next generation of NICs will support this type of thing in
> hardware so putting the software versions very close to the NIC will
> give us a more similar abstraction.

This sounds not feasible with all kind of tunnels, for example IPIP
tunnels, or UDP encapsulation, at least with current stack (not OVS)

Also note that pushing earlier means forcing the checksumming earlier
and it consumes a lot of cpu cycles. Hopefully NIC will help us in the
future.

Using a napi_struct permits to eventually have separate cpus, and things
like RPS/RSS to split the load.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: gre: add GRO capability
  2012-09-27 18:08   ` Eric Dumazet
@ 2012-09-27 18:19     ` Eric Dumazet
  2012-09-27 22:03       ` Jesse Gross
  2012-09-27 22:03     ` [PATCH net-next 3/3] ipv4: gre: add GRO capability Jesse Gross
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2012-09-27 18:19 UTC (permalink / raw)
  To: Jesse Gross; +Cc: David Miller, netdev

On Thu, 2012-09-27 at 20:08 +0200, Eric Dumazet wrote:

> 
> This sounds not feasible with all kind of tunnels, for example IPIP
> tunnels, or UDP encapsulation, at least with current stack (not OVS)
> 
> Also note that pushing earlier means forcing the checksumming earlier
> and it consumes a lot of cpu cycles. Hopefully NIC will help us in the
> future.
> 
> Using a napi_struct permits to eventually have separate cpus, and things
> like RPS/RSS to split the load.

Also please note that my implementation doesnt bypass first IP stack
traversal (and firewalling if any), so its changing nothing in term
of existing setups.

So packets that should be forwarded will stay as they are (no tunnels
decapsulation/recapsulation)

Doing this in the generic GRO layer sounds a bit difficult.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: gre: add GRO capability
  2012-09-27 18:08   ` Eric Dumazet
  2012-09-27 18:19     ` Eric Dumazet
@ 2012-09-27 22:03     ` Jesse Gross
  1 sibling, 0 replies; 41+ messages in thread
From: Jesse Gross @ 2012-09-27 22:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Thu, Sep 27, 2012 at 11:08 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-09-27 at 10:52 -0700, Jesse Gross wrote:
>
>> When I was thinking about doing this, my original plan was to handle
>> GRO/GSO by extending the current handlers to be able to look inside
>> GRE and then loop around to process the inner packet (similar to what
>> is done today with skb_flow_dissect() for RPS).  Is there a reason to
>> do it in the device?
>>
>> Pushing it earlier/later in the stack obviously increases the benefit
>> and it will also be more compatible with the forthcoming OVS tunneling
>> hooks, which will be flow based and therefore won't have a device.
>>
>> Also, the next generation of NICs will support this type of thing in
>> hardware so putting the software versions very close to the NIC will
>> give us a more similar abstraction.
>
> This sounds not feasible with all kind of tunnels, for example IPIP
> tunnels, or UDP encapsulation, at least with current stack (not OVS)

Hmm, I think we might be talking about different things since I can't
think of why it wouldn't be feasible (and none of it should be
specific to OVS).  What I was planning would result in the creation of
large but still encapsulated packets.  The merging would be purely
based on the headers in each layer being the same (as GRO is today) so
the logic of the IP stack, UDP stack, etc. isn't processed until
later.

> Also note that pushing earlier means forcing the checksumming earlier
> and it consumes a lot of cpu cycles. Hopefully NIC will help us in the
> future.

It is a good point that if the packet isn't actually destined to us
then probably none of this is worth it (although I suspect that the
relative number of tunnel packets that are passed through vs.
terminated is fairly low).  Many NICs are capable of supplying
CHECKSUM_COMPLETE packets here, even if it is not exposed by the
drivers.

> Using a napi_struct permits to eventually have separate cpus, and things
> like RPS/RSS to split the load.

We should be able to split the load today using RPS since we can look
into the GRE flow once the packet comes off the NIC (assuming that it
is using NAPI).

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: gre: add GRO capability
  2012-09-27 18:19     ` Eric Dumazet
@ 2012-09-27 22:03       ` Jesse Gross
  2012-09-28 14:04         ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Jesse Gross @ 2012-09-27 22:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Thu, Sep 27, 2012 at 11:19 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-09-27 at 20:08 +0200, Eric Dumazet wrote:
>
>>
>> This sounds not feasible with all kind of tunnels, for example IPIP
>> tunnels, or UDP encapsulation, at least with current stack (not OVS)
>>
>> Also note that pushing earlier means forcing the checksumming earlier
>> and it consumes a lot of cpu cycles. Hopefully NIC will help us in the
>> future.
>>
>> Using a napi_struct permits to eventually have separate cpus, and things
>> like RPS/RSS to split the load.
>
> Also please note that my implementation doesnt bypass first IP stack
> traversal (and firewalling if any), so its changing nothing in term
> of existing setups.
>
> So packets that should be forwarded will stay as they are (no tunnels
> decapsulation/recapsulation)
>
> Doing this in the generic GRO layer sounds a bit difficult.

We wouldn't actually do the decapsulation at the point of GRO.  This
is actually pretty similar to what we do with TCP - we merge TCP
payloads even though we haven't done any real IP processing yet.
However, we do check firewall rules later if we actually hit the IP
stack.  GRE would work the same way in this case.

What I'm describing is pretty much exactly what NICs will be doing, so
if that doesn't work we'll have a problem...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: gre: add GRO capability
  2012-09-27 22:03       ` Jesse Gross
@ 2012-09-28 14:04         ` Eric Dumazet
  2012-10-01 20:56           ` Jesse Gross
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2012-09-28 14:04 UTC (permalink / raw)
  To: Jesse Gross; +Cc: David Miller, netdev

On Thu, 2012-09-27 at 15:03 -0700, Jesse Gross wrote:

> We wouldn't actually do the decapsulation at the point of GRO.  This
> is actually pretty similar to what we do with TCP - we merge TCP
> payloads even though we haven't done any real IP processing yet.
> However, we do check firewall rules later if we actually hit the IP
> stack.  GRE would work the same way in this case.
> 
> What I'm describing is pretty much exactly what NICs will be doing, so
> if that doesn't work we'll have a problem...

GRO ability to truly aggregate data is kind of limited to some
workloads. How NICs will handle interleaved flows I dont really know.

What you describe needs a serious GRO preliminary work, because it
depends on napi_gro_flush() being called from time to time, while we
need something else, more fine grained.

(I am pretty sure GRO needs some love from us, it looks like some
packets can stay a long time in gro_list. It would be nice if it was
able to reorder packets (from same flow) as well)

Anyway, my changes are self-contained in a new file and non intrusive.

As soon as we can provide a better alternative we can revert them ?

Thanks

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: gre: add GRO capability
  2012-09-28 14:04         ` Eric Dumazet
@ 2012-10-01 20:56           ` Jesse Gross
  2012-10-05 14:52             ` [RFC] GRO scalability Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Jesse Gross @ 2012-10-01 20:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

On Fri, Sep 28, 2012 at 7:04 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2012-09-27 at 15:03 -0700, Jesse Gross wrote:
>
>> We wouldn't actually do the decapsulation at the point of GRO.  This
>> is actually pretty similar to what we do with TCP - we merge TCP
>> payloads even though we haven't done any real IP processing yet.
>> However, we do check firewall rules later if we actually hit the IP
>> stack.  GRE would work the same way in this case.
>>
>> What I'm describing is pretty much exactly what NICs will be doing, so
>> if that doesn't work we'll have a problem...
>
> GRO ability to truly aggregate data is kind of limited to some
> workloads. How NICs will handle interleaved flows I dont really know.
>
> What you describe needs a serious GRO preliminary work, because it
> depends on napi_gro_flush() being called from time to time, while we
> need something else, more fine grained.
>
> (I am pretty sure GRO needs some love from us, it looks like some
> packets can stay a long time in gro_list. It would be nice if it was
> able to reorder packets (from same flow) as well)

It's definitely possible to improve GRO in a couple of areas.  I'm not
quite sure why you say that these changes are related to tunnels
though, since they're not really different from say, a VLAN tag.

> Anyway, my changes are self-contained in a new file and non intrusive.
>
> As soon as we can provide a better alternative we can revert them ?

Sure, I don't have a problem with your patches for now.  I was just
trying to think about different approaches.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH net-next 3/3] ipv4: gre: add GRO capability
  2012-09-27 12:48 [PATCH net-next 3/3] ipv4: gre: add GRO capability Eric Dumazet
  2012-09-27 17:52 ` Jesse Gross
@ 2012-10-01 21:04 ` David Miller
  1 sibling, 0 replies; 41+ messages in thread
From: David Miller @ 2012-10-01 21:04 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 27 Sep 2012 14:48:50 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> Add GRO capability to IPv4 GRE tunnels, using the gro_cells
> infrastructure.
> 
> Tested using IPv4 and IPv6 TCP traffic inside this tunnel, and
> checking GRO is building large packets.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC] GRO scalability
  2012-10-01 20:56           ` Jesse Gross
@ 2012-10-05 14:52             ` Eric Dumazet
  2012-10-05 18:16               ` Rick Jones
  2012-10-06  4:11               ` Herbert Xu
  0 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2012-10-05 14:52 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Jesse Gross

Current GRO cell is somewhat limited :

- It uses a single list (napi->gro_list) of pending skbs

- This list has a limit of 8 skbs (MAX_GRO_SKBS)

- Workloads with lot of concurrent flows have small GRO hit rate but
  pay high overhead (in inet_gro_receive())

- Increasing MAX_GRO_SKBS is not an option, because GRO
  overhead becomes too high.

- Packets can stay a long time held in GRO cell (there is
  no flush if napi never completes on a stressed cpu)

  Some elephant flows can stall interactive ones (if we receive
  flood of non TCP frames, we dont flush tcp packets waiting in
gro_list)

What we could do :

1) Use a hash to avoid expensive gro_list management and allow
   much more concurrent flows.

Use skb_get_rxhash(skb) to compute rxhash

If l4_rxhash not set -> not a GRO candidate.

If l4_rxhash set, use a hash lookup to immediately finds a 'same flow'
candidates.

(tcp stack could eventually use rxhash instead of its custom hash
computation ...)

2) Use a LRU list to eventually be able to 'flush' too old packets,
   even if the napi never completes. Each time we process a new packet,
   being a GRO candidate or not, we increment a napi->sequence, and we
   flush the oldest packet in gro_lru_list if its own sequence is too
   old.

  That would give a latency guarantee.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-05 14:52             ` [RFC] GRO scalability Eric Dumazet
@ 2012-10-05 18:16               ` Rick Jones
  2012-10-05 19:00                 ` Eric Dumazet
  2012-10-06  4:11               ` Herbert Xu
  1 sibling, 1 reply; 41+ messages in thread
From: Rick Jones @ 2012-10-05 18:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross

On 10/05/2012 07:52 AM, Eric Dumazet wrote:
> What we could do :
>
> 1) Use a hash to avoid expensive gro_list management and allow
>     much more concurrent flows.
>
> Use skb_get_rxhash(skb) to compute rxhash
>
> If l4_rxhash not set -> not a GRO candidate.
>
> If l4_rxhash set, use a hash lookup to immediately finds a 'same flow'
> candidates.
>
> (tcp stack could eventually use rxhash instead of its custom hash
> computation ...)
>
> 2) Use a LRU list to eventually be able to 'flush' too old packets,
>     even if the napi never completes. Each time we process a new packet,
>     being a GRO candidate or not, we increment a napi->sequence, and we
>     flush the oldest packet in gro_lru_list if its own sequence is too
>     old.
>
>    That would give a latency guarantee.

Flushing things if N packets have come though sounds like goodness, and 
it reminds me a bit about what happens with IP fragment reassembly - 
another area where the stack is trying to guess just how long to 
hang-onto a packet before doing something else with it.  But the value 
of N to get a "decent" per-flow GRO aggregation rate will depend on the 
number of concurrent flows right?  If I want to have a good shot at 
getting 2 segments combined for 1000 active, concurrent flows entering 
my system via that interface, won't N have to approach 2000?

GRO (and HW LRO) has a fundamental limitation/disadvantage here.  GRO 
does provide a very nice "boost" on various situations (especially 
numbers of concurrent netperfs that don't blow-out the tracking limits) 
but since it won't really know anything about the flow(s) involved (*) 
or even their number (?), it will always be guessing.  That is why it is 
really only "poor man's JumboFrames" (or larger MTU - Sadly, the IEEE 
keeps us all beggars here).

A goodly portion of the benefit of GRO comes from the "incidental" ACK 
avoidance it causes yes?  That being the case, might that be a 
worthwhile avenue to explore?   It would then naturally scale as TCP et 
al do today.

When we go to 40 GbE will we have 4x as many flows, or the same number 
of 4x faster flows?

rick jones

* for example - does this TCP segment contain the last byte(s) of a 
pipelined http request/response and the first byte(s) of the next one 
and so should "flush" now?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-05 18:16               ` Rick Jones
@ 2012-10-05 19:00                 ` Eric Dumazet
  2012-10-05 19:35                   ` Rick Jones
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2012-10-05 19:00 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross

On Fri, 2012-10-05 at 11:16 -0700, Rick Jones wrote:
> O
> Flushing things if N packets have come though sounds like goodness, and 
> it reminds me a bit about what happens with IP fragment reassembly - 
> another area where the stack is trying to guess just how long to 
> hang-onto a packet before doing something else with it.  But the value 
> of N to get a "decent" per-flow GRO aggregation rate will depend on the 
> number of concurrent flows right?  If I want to have a good shot at 
> getting 2 segments combined for 1000 active, concurrent flows entering 
> my system via that interface, won't N have to approach 2000?
> 

It all depends on the max latency you can afford.

> GRO (and HW LRO) has a fundamental limitation/disadvantage here.  GRO 
> does provide a very nice "boost" on various situations (especially 
> numbers of concurrent netperfs that don't blow-out the tracking limits) 
> but since it won't really know anything about the flow(s) involved (*) 
> or even their number (?), it will always be guessing.  That is why it is 
> really only "poor man's JumboFrames" (or larger MTU - Sadly, the IEEE 
> keeps us all beggars here).
> 
> A goodly portion of the benefit of GRO comes from the "incidental" ACK 
> avoidance it causes yes?  That being the case, might that be a 
> worthwhile avenue to explore?   It would then naturally scale as TCP et 
> al do today.
> 
> When we go to 40 GbE will we have 4x as many flows, or the same number 
> of 4x faster flows?
> 
> rick jones
> 
> * for example - does this TCP segment contain the last byte(s) of a 
> pipelined http request/response and the first byte(s) of the next one 
> and so should "flush" now?

Some remarks :

1) I use some 40Gbe links, thats probably why I try to improve things ;)

2) benefit of GRO can be huge, and not only for the ACK avoidance
   (other tricks could be done for ACK avoidance in the stack)

3) High speeds probably need multiqueue device, and each queue has its
own GRO unit.

  For example on a 40Gbe, 8 queues -> 5Gbps per queue (about 400k
packets/sec)

Lets say we allow no more than 1ms of delay in GRO, this means we could
have about 400 packets in the GRO queue (assuming 1500 bytes packets)

Another idea to play with would be to extend GRO to allow packet
reorder.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-05 19:00                 ` Eric Dumazet
@ 2012-10-05 19:35                   ` Rick Jones
  2012-10-05 20:06                     ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Rick Jones @ 2012-10-05 19:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross

On 10/05/2012 12:00 PM, Eric Dumazet wrote:
> On Fri, 2012-10-05 at 11:16 -0700, Rick Jones wrote:
>
> Some remarks :
>
> 1) I use some 40Gbe links, thats probably why I try to improve things ;)

Path length before workarounds :)

> 2) benefit of GRO can be huge, and not only for the ACK avoidance
>     (other tricks could be done for ACK avoidance in the stack)

Just how much code path is there between NAPI and the socket?? (And I 
guess just how much combining are you hoping for?)

> 3) High speeds probably need multiqueue device, and each queue has its
> own GRO unit.
>
>    For example on a 40Gbe, 8 queues -> 5Gbps per queue (about 400k
> packets/sec)
>
> Lets say we allow no more than 1ms of delay in GRO,

OK.  That means we can ignore HPC and FSI because they wouldn't tolerate 
that kind of added delay anyway.  I'm not sure if that also then 
eliminates the networked storage types.

> this means we could have about 400 packets in the GRO queue (assuming
> 1500 bytes packets)

How many flows are you going to have entering via that queue?  And just 
how well "shuffled" will the segments of those flows be?  That is what 
it all comes down to right?  How many (active) flows and how well 
shuffled they are.  If the flows aren't well shuffled, you can get away 
with a smallish coalescing context.  If they are perfectly shuffled and 
greater in number than your delay allowance you get right back to square 
with all the overhead of GRO attempts with none of the benefit.

If the flow count is < 400 to allow a decent shot at a non-zero 
combining rate on well shuffled flows with the 400 packet limit, then 
that means each flow is >= 12.5 Mbit/s on average at 5 Gbit/s 
aggregated.  And I think you then get two segments per flow aggregated 
at a time.  Is that consistent with what you expect to be the 
characteristics of the flows entering via that queue?

rick jones

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-05 19:35                   ` Rick Jones
@ 2012-10-05 20:06                     ` Eric Dumazet
  2012-10-08 16:40                       ` Rick Jones
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2012-10-05 20:06 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross

On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote:

> Just how much code path is there between NAPI and the socket?? (And I 
> guess just how much combining are you hoping for?)
> 

When GRO correctly works, you can save about 30% of cpu cycles, it
depends...

Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1)
gives an improvement as well...

> > Lets say we allow no more than 1ms of delay in GRO,
> 
> OK.  That means we can ignore HPC and FSI because they wouldn't tolerate 
> that kind of added delay anyway.  I'm not sure if that also then 
> eliminates the networked storage types.
> 

I took this 1ms delay, but I never said it was a fixed value ;)

Also remember one thing, this is the _max_ delay in case your napi
handler is flooded. This almost never happen (tm)


> > this means we could have about 400 packets in the GRO queue (assuming
> > 1500 bytes packets)
> 
> How many flows are you going to have entering via that queue?  And just 
> how well "shuffled" will the segments of those flows be?  That is what 
> it all comes down to right?  How many (active) flows and how well 
> shuffled they are.  If the flows aren't well shuffled, you can get away 
> with a smallish coalescing context.  If they are perfectly shuffled and 
> greater in number than your delay allowance you get right back to square 
> with all the overhead of GRO attempts with none of the benefit.

Not sure what you mean by shuffle. We use a hash table to locate a flow,
but we also have a LRU list to get the packets ordered by their entry in
the 'GRO unit'.

If napi completes, all the LRU list content is flushed to IP stack.
( napi_gro_flush()) 

If napi doesnt complete, we would only flush 'too old' packets found in
the LRU.

Note: this selective flush can be called once per napi run from
net_rx_action(). Extra cost to get a somewhat precise timestamp
would be acceptable (one call to ktime_get() or get_cycles() every 64
packets)

This timestamp could be stored in napi->timestamp and done once per
n->poll(n, weight) call.

> 
> If the flow count is < 400 to allow a decent shot at a non-zero 
> combining rate on well shuffled flows with the 400 packet limit, then 
> that means each flow is >= 12.5 Mbit/s on average at 5 Gbit/s 
> aggregated.  And I think you then get two segments per flow aggregated 
> at a time.  Is that consistent with what you expect to be the 
> characteristics of the flows entering via that queue?

If a packet cant stay more than 1ms, then a flow sending less than 1000
packets per second wont benefit from GRO.

So yes, 12.5 Mbit/s would be the threshold.

By the way, when TCP timestamps are used, and hosts are linux machines
with HZ=1000, current GRO can not coalesce packets anyway because their
TCP options are different.

(So it would be not useful trying bigger sojourn time than 1ms)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-05 14:52             ` [RFC] GRO scalability Eric Dumazet
  2012-10-05 18:16               ` Rick Jones
@ 2012-10-06  4:11               ` Herbert Xu
  2012-10-06  5:08                 ` Eric Dumazet
  1 sibling, 1 reply; 41+ messages in thread
From: Herbert Xu @ 2012-10-06  4:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Jesse Gross

On Fri, Oct 05, 2012 at 04:52:27PM +0200, Eric Dumazet wrote:
> Current GRO cell is somewhat limited :
> 
> - It uses a single list (napi->gro_list) of pending skbs
> 
> - This list has a limit of 8 skbs (MAX_GRO_SKBS)
> 
> - Workloads with lot of concurrent flows have small GRO hit rate but
>   pay high overhead (in inet_gro_receive())
> 
> - Increasing MAX_GRO_SKBS is not an option, because GRO
>   overhead becomes too high.

Yeah these were all meant to be addressed at some point.

> - Packets can stay a long time held in GRO cell (there is
>   no flush if napi never completes on a stressed cpu)

This should never happen though.  NAPI runs must always be
punctuated just to guarantee one card never hogs a CPU.  Which
driver causes these behaviour?

>   Some elephant flows can stall interactive ones (if we receive
>   flood of non TCP frames, we dont flush tcp packets waiting in
> gro_list)

Again this should never be a problem given the natural limit
on backlog processing.

> What we could do :
> 
> 1) Use a hash to avoid expensive gro_list management and allow
>    much more concurrent flows.
> 
> Use skb_get_rxhash(skb) to compute rxhash
> 
> If l4_rxhash not set -> not a GRO candidate.
> 
> If l4_rxhash set, use a hash lookup to immediately finds a 'same flow'
> candidates.
> 
> (tcp stack could eventually use rxhash instead of its custom hash
> computation ...)

Sounds good to me.

> 2) Use a LRU list to eventually be able to 'flush' too old packets,
>    even if the napi never completes. Each time we process a new packet,
>    being a GRO candidate or not, we increment a napi->sequence, and we
>    flush the oldest packet in gro_lru_list if its own sequence is too
>    old.
> 
>   That would give a latency guarantee.

I don't think this should ever be necessary.  IOW, if we need this
for GRO, then it means that we also need it for NAPI for the exact
same reasons.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-06  4:11               ` Herbert Xu
@ 2012-10-06  5:08                 ` Eric Dumazet
  2012-10-06  5:14                   ` Herbert Xu
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2012-10-06  5:08 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Jesse Gross

Le samedi 06 octobre 2012 à 12:11 +0800, Herbert Xu a écrit :
> On Fri, Oct 05, 2012 at 04:52:27PM +0200, Eric Dumazet wrote:
> > Current GRO cell is somewhat limited :
> > 
> > - It uses a single list (napi->gro_list) of pending skbs
> > 
> > - This list has a limit of 8 skbs (MAX_GRO_SKBS)
> > 
> > - Workloads with lot of concurrent flows have small GRO hit rate but
> >   pay high overhead (in inet_gro_receive())
> > 
> > - Increasing MAX_GRO_SKBS is not an option, because GRO
> >   overhead becomes too high.
> 
> Yeah these were all meant to be addressed at some point.
> 
> > - Packets can stay a long time held in GRO cell (there is
> >   no flush if napi never completes on a stressed cpu)
> 
> This should never happen though.  NAPI runs must always be
> punctuated just to guarantee one card never hogs a CPU.  Which
> driver causes these behaviour?

I believe its a generic issue, not specific to a driver.

napi_gro_flush() is only called from napi_complete() 

Some drivers (marvell/skge.c & realtek/8139cp.c) calls it only because
they 'inline' napi_complete()

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-06  5:08                 ` Eric Dumazet
@ 2012-10-06  5:14                   ` Herbert Xu
  2012-10-06  6:22                     ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Herbert Xu @ 2012-10-06  5:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Jesse Gross

On Sat, Oct 06, 2012 at 07:08:46AM +0200, Eric Dumazet wrote:
> Le samedi 06 octobre 2012 à 12:11 +0800, Herbert Xu a écrit :
> > On Fri, Oct 05, 2012 at 04:52:27PM +0200, Eric Dumazet wrote:
> > > Current GRO cell is somewhat limited :
> > > 
> > > - It uses a single list (napi->gro_list) of pending skbs
> > > 
> > > - This list has a limit of 8 skbs (MAX_GRO_SKBS)
> > > 
> > > - Workloads with lot of concurrent flows have small GRO hit rate but
> > >   pay high overhead (in inet_gro_receive())
> > > 
> > > - Increasing MAX_GRO_SKBS is not an option, because GRO
> > >   overhead becomes too high.
> > 
> > Yeah these were all meant to be addressed at some point.
> > 
> > > - Packets can stay a long time held in GRO cell (there is
> > >   no flush if napi never completes on a stressed cpu)
> > 
> > This should never happen though.  NAPI runs must always be
> > punctuated just to guarantee one card never hogs a CPU.  Which
> > driver causes these behaviour?
> 
> I believe its a generic issue, not specific to a driver.
> 
> napi_gro_flush() is only called from napi_complete() 
> 
> Some drivers (marvell/skge.c & realtek/8139cp.c) calls it only because
> they 'inline' napi_complete()

So which driver has the potential of never doing napi_gro_flush?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-06  5:14                   ` Herbert Xu
@ 2012-10-06  6:22                     ` Eric Dumazet
  2012-10-06  7:00                       ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2012-10-06  6:22 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Jesse Gross

On Sat, 2012-10-06 at 13:14 +0800, Herbert Xu wrote:
> On Sat, Oct 06, 2012 at 07:08:46AM +0200, Eric Dumazet wrote:
> > Le samedi 06 octobre 2012 à 12:11 +0800, Herbert Xu a écrit :
> > > On Fri, Oct 05, 2012 at 04:52:27PM +0200, Eric Dumazet wrote:
> > > > Current GRO cell is somewhat limited :
> > > > 
> > > > - It uses a single list (napi->gro_list) of pending skbs
> > > > 
> > > > - This list has a limit of 8 skbs (MAX_GRO_SKBS)
> > > > 
> > > > - Workloads with lot of concurrent flows have small GRO hit rate but
> > > >   pay high overhead (in inet_gro_receive())
> > > > 
> > > > - Increasing MAX_GRO_SKBS is not an option, because GRO
> > > >   overhead becomes too high.
> > > 
> > > Yeah these were all meant to be addressed at some point.
> > > 
> > > > - Packets can stay a long time held in GRO cell (there is
> > > >   no flush if napi never completes on a stressed cpu)
> > > 
> > > This should never happen though.  NAPI runs must always be
> > > punctuated just to guarantee one card never hogs a CPU.  Which
> > > driver causes these behaviour?
> > 
> > I believe its a generic issue, not specific to a driver.
> > 
> > napi_gro_flush() is only called from napi_complete() 
> > 
> > Some drivers (marvell/skge.c & realtek/8139cp.c) calls it only because
> > they 'inline' napi_complete()
> 
> So which driver has the potential of never doing napi_gro_flush?

All drivers.

If the napi->poll() consumes all budget, we dont call napi_complete()

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-06  6:22                     ` Eric Dumazet
@ 2012-10-06  7:00                       ` Eric Dumazet
  2012-10-06 10:56                         ` Herbert Xu
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2012-10-06  7:00 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Jesse Gross

On Sat, 2012-10-06 at 08:22 +0200, Eric Dumazet wrote:

> All drivers.
> 
> If the napi->poll() consumes all budget, we dont call napi_complete()

Probably nobody noticed, because TCP stack retransmits packets
eventually.

But this adds unexpected latencies.

I'll send a patch.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-06  7:00                       ` Eric Dumazet
@ 2012-10-06 10:56                         ` Herbert Xu
  2012-10-06 18:08                           ` [PATCH] net: gro: selective flush of packets Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Herbert Xu @ 2012-10-06 10:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Jesse Gross

On Sat, Oct 06, 2012 at 09:00:25AM +0200, Eric Dumazet wrote:
> On Sat, 2012-10-06 at 08:22 +0200, Eric Dumazet wrote:
> 
> > All drivers.
> > 
> > If the napi->poll() consumes all budget, we dont call napi_complete()
> 
> Probably nobody noticed, because TCP stack retransmits packets
> eventually.
> 
> But this adds unexpected latencies.
> 
> I'll send a patch.

Yeah we should definitely do a flush in that case.  Thanks!
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH] net: gro: selective flush of packets
  2012-10-06 10:56                         ` Herbert Xu
@ 2012-10-06 18:08                           ` Eric Dumazet
  2012-10-07  0:32                             ` Herbert Xu
  2012-10-08 18:52                             ` David Miller
  0 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2012-10-06 18:08 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Jesse Gross, Tom Herbert, Yuchung Cheng

From: Eric Dumazet <edumazet@google.com>

Current GRO can hold packets in gro_list for almost unlimited
time, in case napi->poll() handler consumes its budget over and over.

In this case, napi_complete()/napi_gro_flush() are not called.

Another problem is that gro_list is flushed in non friendly way :
We scan the list and complete packets in the reverse order.
(youngest packets first, oldest packets last)
This defeats priorities that sender could have cooked.

Since GRO currently only store TCP packets, we dont really notice the
bug because of retransmits, but this behavior can add unexpected
latencies, particularly on mice flows clamped by elephant flows.

This patch makes sure no packet can stay more than 1 ms in queue, and
only in stress situations.

It also complete packets in the right order to minimize latencies.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jesse Gross <jesse@nicira.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
---
 drivers/net/ethernet/marvell/skge.c   |    2 -
 drivers/net/ethernet/realtek/8139cp.c |    2 -
 include/linux/netdevice.h             |   15 +++++----
 net/core/dev.c                        |   38 +++++++++++++++++++-----
 4 files changed, 42 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/marvell/skge.c b/drivers/net/ethernet/marvell/skge.c
index 5a30bf8..eb3cba0 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -3189,7 +3189,7 @@ static int skge_poll(struct napi_struct *napi, int to_do)
 	if (work_done < to_do) {
 		unsigned long flags;
 
-		napi_gro_flush(napi);
+		napi_gro_flush(napi, false);
 		spin_lock_irqsave(&hw->hw_lock, flags);
 		__napi_complete(napi);
 		hw->intr_mask |= napimask[skge->port];
diff --git a/drivers/net/ethernet/realtek/8139cp.c b/drivers/net/ethernet/realtek/8139cp.c
index 995d0cf..1c81825 100644
--- a/drivers/net/ethernet/realtek/8139cp.c
+++ b/drivers/net/ethernet/realtek/8139cp.c
@@ -563,7 +563,7 @@ rx_next:
 		if (cpr16(IntrStatus) & cp_rx_intr_mask)
 			goto rx_status_loop;
 
-		napi_gro_flush(napi);
+		napi_gro_flush(napi, false);
 		spin_lock_irqsave(&cp->lock, flags);
 		__napi_complete(napi);
 		cpw16_f(IntrMask, cp_intr_mask);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 01646aa..63efc28 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1497,19 +1497,22 @@ struct napi_gro_cb {
 	/* This indicates where we are processing relative to skb->data. */
 	int data_offset;
 
-	/* This is non-zero if the packet may be of the same flow. */
-	int same_flow;
-
 	/* This is non-zero if the packet cannot be merged with the new skb. */
 	int flush;
 
 	/* Number of segments aggregated. */
-	int count;
+	u16	count;
+
+	/* This is non-zero if the packet may be of the same flow. */
+	u8	same_flow;
 
 	/* Free the skb? */
-	int free;
+	u8	free;
 #define NAPI_GRO_FREE		  1
 #define NAPI_GRO_FREE_STOLEN_HEAD 2
+
+	/* jiffies when first packet was created/queued */
+	unsigned long age;
 };
 
 #define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
@@ -2157,7 +2160,7 @@ extern gro_result_t	dev_gro_receive(struct napi_struct *napi,
 extern gro_result_t	napi_skb_finish(gro_result_t ret, struct sk_buff *skb);
 extern gro_result_t	napi_gro_receive(struct napi_struct *napi,
 					 struct sk_buff *skb);
-extern void		napi_gro_flush(struct napi_struct *napi);
+extern void		napi_gro_flush(struct napi_struct *napi, bool flush_old);
 extern struct sk_buff *	napi_get_frags(struct napi_struct *napi);
 extern gro_result_t	napi_frags_finish(struct napi_struct *napi,
 					  struct sk_buff *skb,
diff --git a/net/core/dev.c b/net/core/dev.c
index 1e0a184..8c960e8 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3471,17 +3471,31 @@ out:
 	return netif_receive_skb(skb);
 }
 
-inline void napi_gro_flush(struct napi_struct *napi)
+/* napi->gro_list contains packets ordered by age.
+ * youngest packets at the head of it.
+ * Complete skbs in reverse order to reduce latencies.
+ */
+void napi_gro_flush(struct napi_struct *napi, bool flush_old)
 {
-	struct sk_buff *skb, *next;
+	struct sk_buff *skb, *prev = NULL;
 
-	for (skb = napi->gro_list; skb; skb = next) {
-		next = skb->next;
+	/* scan list and build reverse chain */
+	for (skb = napi->gro_list; skb != NULL; skb = skb->next) {
+		skb->prev = prev;
+		prev = skb;
+	}
+
+	for (skb = prev; skb; skb = prev) {
 		skb->next = NULL;
+
+		if (flush_old && NAPI_GRO_CB(skb)->age == jiffies)
+			return;
+
+		prev = skb->prev;
 		napi_gro_complete(skb);
+		napi->gro_count--;
 	}
 
-	napi->gro_count = 0;
 	napi->gro_list = NULL;
 }
 EXPORT_SYMBOL(napi_gro_flush);
@@ -3542,6 +3556,7 @@ enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 
 	napi->gro_count++;
 	NAPI_GRO_CB(skb)->count = 1;
+	NAPI_GRO_CB(skb)->age = jiffies;
 	skb_shinfo(skb)->gso_size = skb_gro_len(skb);
 	skb->next = napi->gro_list;
 	napi->gro_list = skb;
@@ -3876,7 +3891,7 @@ void napi_complete(struct napi_struct *n)
 	if (unlikely(test_bit(NAPI_STATE_NPSVC, &n->state)))
 		return;
 
-	napi_gro_flush(n);
+	napi_gro_flush(n, false);
 	local_irq_save(flags);
 	__napi_complete(n);
 	local_irq_restore(flags);
@@ -3981,8 +3996,17 @@ static void net_rx_action(struct softirq_action *h)
 				local_irq_enable();
 				napi_complete(n);
 				local_irq_disable();
-			} else
+			} else {
+				if (n->gro_list) {
+					/* flush too old packets
+					 * If HZ < 1000, flush all packets.
+					 */
+					local_irq_enable();
+					napi_gro_flush(n, HZ >= 1000);
+					local_irq_disable();
+				}
 				list_move_tail(&n->poll_list, &sd->poll_list);
+			}
 		}
 
 		netpoll_poll_unlock(have);

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: gro: selective flush of packets
  2012-10-06 18:08                           ` [PATCH] net: gro: selective flush of packets Eric Dumazet
@ 2012-10-07  0:32                             ` Herbert Xu
  2012-10-07  5:29                               ` Eric Dumazet
  2012-10-08 18:52                             ` David Miller
  1 sibling, 1 reply; 41+ messages in thread
From: Herbert Xu @ 2012-10-07  0:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Jesse Gross, Tom Herbert, Yuchung Cheng

On Sat, Oct 06, 2012 at 08:08:49PM +0200, Eric Dumazet wrote:
>
> @@ -3981,8 +3996,17 @@ static void net_rx_action(struct softirq_action *h)
>  				local_irq_enable();
>  				napi_complete(n);
>  				local_irq_disable();
> -			} else
> +			} else {
> +				if (n->gro_list) {
> +					/* flush too old packets
> +					 * If HZ < 1000, flush all packets.
> +					 */
> +					local_irq_enable();
> +					napi_gro_flush(n, HZ >= 1000);
> +					local_irq_disable();
> +				}
>  				list_move_tail(&n->poll_list, &sd->poll_list);
> +			}

Why don't we just always flush everything?

Thanks,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: gro: selective flush of packets
  2012-10-07  0:32                             ` Herbert Xu
@ 2012-10-07  5:29                               ` Eric Dumazet
  2012-10-08  7:39                                 ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2012-10-07  5:29 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Jesse Gross, Tom Herbert, Yuchung Cheng

	On Sun, 2012-10-07 at 08:32 +0800, Herbert Xu wrote:
> On Sat, Oct 06, 2012 at 08:08:49PM +0200, Eric Dumazet wrote:
> >
> > @@ -3981,8 +3996,17 @@ static void net_rx_action(struct softirq_action *h)
> >  				local_irq_enable();
> >  				napi_complete(n);
> >  				local_irq_disable();
> > -			} else
> > +			} else {
> > +				if (n->gro_list) {
> > +					/* flush too old packets
> > +					 * If HZ < 1000, flush all packets.
> > +					 */
> > +					local_irq_enable();
> > +					napi_gro_flush(n, HZ >= 1000);
> > +					local_irq_disable();
> > +				}
> >  				list_move_tail(&n->poll_list, &sd->poll_list);
> > +			}
> 
> Why don't we just always flush everything?

This is what I tried first, but it lowered performance on several
typical workloads.

Using this simple heuristic increases performance.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: gro: selective flush of packets
  2012-10-07  5:29                               ` Eric Dumazet
@ 2012-10-08  7:39                                 ` Eric Dumazet
  2012-10-08 16:42                                   ` Rick Jones
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2012-10-08  7:39 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, Jesse Gross, Tom Herbert, Yuchung Cheng

On Sun, 2012-10-07 at 07:29 +0200, Eric Dumazet wrote:
> 	On Sun, 2012-10-07 at 08:32 +0800, Herbert Xu wrote:

> > Why don't we just always flush everything?
> 

> This is what I tried first, but it lowered performance on several
> typical workloads.
> 
> Using this simple heuristic increases performance.
> 
> 

By the way, one of the beauty of GRO is it helps under load to aggregate
packets and reduce cpu load. People wanting very low latencies should
probably not use GRO, and if they use it or not, receiving a full 64
packets batch on a particular NIC makes latencies very unpredictable.

So if we consumed all budget in a napi->poll() handler, its because we
are under load and we dont really want to cancel GRO aggregation.

Next napi->poll() invocation will have more chances to coalesce frames.

If there is only one flow, its OK because a 64 packet window allows ~4
GRO super packets to be built, regardless of an unconditional flush, but
with 8 flows, it would roughly give 100% increase of GRO packets sent to
upper layers.

Only needed safety measure is to make sure we dont let packets for a too
long time in case we never complete napi, this is what this patch does,
with a latency average of 0.5 ms (for slow flows)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-05 20:06                     ` Eric Dumazet
@ 2012-10-08 16:40                       ` Rick Jones
  2012-10-08 16:59                         ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Rick Jones @ 2012-10-08 16:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross

On 10/05/2012 01:06 PM, Eric Dumazet wrote:
> On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote:
>
>> Just how much code path is there between NAPI and the socket?? (And I
>> guess just how much combining are you hoping for?)
>>
>
> When GRO correctly works, you can save about 30% of cpu cycles, it
> depends...
>
> Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1)
> gives an improvement as well...

OK, but how much of that 30% come from where?  Each coalesced segment is 
saving the cycles between NAPI and the socket.  Each avoided ACK is 
saving the cycles from TCP to the bottom of the driver and a (share of) 
transmit completion.


> I took this 1ms delay, but I never said it was a fixed value ;)
>
> Also remember one thing, this is the _max_ delay in case your napi
> handler is flooded. This almost never happen (tm)

We can still ignore the FSI types and probably the HPC types because 
they will insist on never happens (tm) :)


>
> Not sure what you mean by shuffle. We use a hash table to locate a flow,
> but we also have a LRU list to get the packets ordered by their entry in
> the 'GRO unit'.

Whe I say shuffle I mean something along the lines of interleave.  So, 
if we have four flows, 1-4, a perfect shuffle of their segments would be 
something like:

1 2 3 4 1 2 3 4 1 2 3 4

but not well shuffled might look like

1 1 3 2 3 2 4 4 4 1 3 2

rick

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: gro: selective flush of packets
  2012-10-08  7:39                                 ` Eric Dumazet
@ 2012-10-08 16:42                                   ` Rick Jones
  2012-10-08 17:10                                     ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Rick Jones @ 2012-10-08 16:42 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, David Miller, netdev, Jesse Gross, Tom Herbert,
	Yuchung Cheng

On 10/08/2012 12:39 AM, Eric Dumazet wrote:
> On Sun, 2012-10-07 at 07:29 +0200, Eric Dumazet wrote:
>> 	On Sun, 2012-10-07 at 08:32 +0800, Herbert Xu wrote:
>
>>> Why don't we just always flush everything?
>>
>
>> This is what I tried first, but it lowered performance on several
>> typical workloads.
>>
>> Using this simple heuristic increases performance.
>>
>>
>
> By the way, one of the beauty of GRO is it helps under load to aggregate
> packets and reduce cpu load. People wanting very low latencies should
> probably not use GRO, and if they use it or not, receiving a full 64
> packets batch on a particular NIC makes latencies very unpredictable.
>
> So if we consumed all budget in a napi->poll() handler, its because we
> are under load and we dont really want to cancel GRO aggregation.

Is that actually absolute, or does it depend on GRO aggregation actually 
aggregating?  In your opening message you talked about how with though 
flows GRO is defeated but its overhead remains.

rick jones

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-08 16:40                       ` Rick Jones
@ 2012-10-08 16:59                         ` Eric Dumazet
  2012-10-08 17:49                           ` Rick Jones
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2012-10-08 16:59 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross

On Mon, 2012-10-08 at 09:40 -0700, Rick Jones wrote:
> On 10/05/2012 01:06 PM, Eric Dumazet wrote:
> > On Fri, 2012-10-05 at 12:35 -0700, Rick Jones wrote:
> >
> >> Just how much code path is there between NAPI and the socket?? (And I
> >> guess just how much combining are you hoping for?)
> >>
> >
> > When GRO correctly works, you can save about 30% of cpu cycles, it
> > depends...
> >
> > Doubling MAX_SKB_FRAGS (allowing 32+1 MSS per GRO skb instead of 16+1)
> > gives an improvement as well...
> 
> OK, but how much of that 30% come from where?  Each coalesced segment is 
> saving the cycles between NAPI and the socket.  Each avoided ACK is 
> saving the cycles from TCP to the bottom of the driver and a (share of) 
> transmit completion.

It comes from the fact that you have less competition between Bottom
Half handler and application on socket lock, not counting all layers
that we have to cross (IP, netfilter ...)

Each time a TCP packet is delivered and socket owned by the user, packet
is placed on a special 'backlog queue', and application has to process
this packet right before releasing socket lock. It sucks because it adds
latencies, and other frames are queued to backlokg since application
processes the backlog (very expensive because of cache line misses)

So GRO really makes this kind of event less probable.

> 
> Whe I say shuffle I mean something along the lines of interleave.  So, 
> if we have four flows, 1-4, a perfect shuffle of their segments would be 
> something like:
> 
> 1 2 3 4 1 2 3 4 1 2 3 4
> 
> but not well shuffled might look like
> 
> 1 1 3 2 3 2 4 4 4 1 3 2
> 

If all these packets are delivered in the same NAPI run, and correctly
aggregated, their order doesnt matter.

In first case, we will deliver  B1, B2, B3, B4   (B being a GRO packet
with 3 MSS)

In second case we will deliver

B1 B3 B2 B4

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: gro: selective flush of packets
  2012-10-08 16:42                                   ` Rick Jones
@ 2012-10-08 17:10                                     ` Eric Dumazet
  0 siblings, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2012-10-08 17:10 UTC (permalink / raw)
  To: Rick Jones
  Cc: Herbert Xu, David Miller, netdev, Jesse Gross, Tom Herbert,
	Yuchung Cheng

On Mon, 2012-10-08 at 09:42 -0700, Rick Jones wrote:

> > By the way, one of the beauty of GRO is it helps under load to aggregate
> > packets and reduce cpu load. People wanting very low latencies should
> > probably not use GRO, and if they use it or not, receiving a full 64
> > packets batch on a particular NIC makes latencies very unpredictable.
> >
> > So if we consumed all budget in a napi->poll() handler, its because we
> > are under load and we dont really want to cancel GRO aggregation.
> 
> Is that actually absolute, or does it depend on GRO aggregation actually 
> aggregating?  In your opening message you talked about how with though 
> flows GRO is defeated but its overhead remains.
> 

Sorry, I dont understand the question.

We consume all budget when 64 packets are fetched from NIC.

This has nothing to do with GRO, but NAPI behavior.

Sure, if these packets are UDP messages and cross GRO stack for nothing,
its pure overhead.

Current situation is :

You receive a burst of packets, with one (or few) TCP message(s), and
other frames are UDP only.

This TCP message is held in GRO queue, and stay here as long as we dont
receive another packet for the same flow, or the burst ends.

Note that I dont really care of these few TCP messages right now,
but when/if we use a hash table and allow XXX packets in GRO stack,
things are different ;)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-08 16:59                         ` Eric Dumazet
@ 2012-10-08 17:49                           ` Rick Jones
  2012-10-08 17:55                             ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Rick Jones @ 2012-10-08 17:49 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross

On 10/08/2012 09:59 AM, Eric Dumazet wrote:
> On Mon, 2012-10-08 at 09:40 -0700, Rick Jones wrote:
>> Whe I say shuffle I mean something along the lines of interleave.  So,
>> if we have four flows, 1-4, a perfect shuffle of their segments would be
>> something like:
>>
>> 1 2 3 4 1 2 3 4 1 2 3 4
>>
>> but not well shuffled might look like
>>
>> 1 1 3 2 3 2 4 4 4 1 3 2
>>
>
> If all these packets are delivered in the same NAPI run, and correctly
> aggregated, their order doesnt matter.
>
> In first case, we will deliver  B1, B2, B3, B4   (B being a GRO packet
> with 3 MSS)
>
> In second case we will deliver
>
> B1 B3 B2 B4

So, with my term shuffle better defined, let's circle back to your 
proposal to try to GRO-service a very much larger group of flows, with a 
"flush queued packets older than N packets" heuristic as part of the 
latency minimization.  If N were 2 there - half the number of flows, the 
"perfect" shuffle" doesn't get aggregated at all right? N would have to 
be 4 or the number of concurrent flows.   What I'm trying to get at is 
just to how many concurrent flows you are trying to get GRO to scale, 
and whether at that level you have asymptotically approached having a 
hash/retained state that is, basically, a duplicate of what is happening 
in TCP.

rick

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-08 17:49                           ` Rick Jones
@ 2012-10-08 17:55                             ` Eric Dumazet
  2012-10-08 17:56                               ` Eric Dumazet
  2012-10-08 18:21                               ` [RFC] GRO scalability Rick Jones
  0 siblings, 2 replies; 41+ messages in thread
From: Eric Dumazet @ 2012-10-08 17:55 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross

On Mon, 2012-10-08 at 10:49 -0700, Rick Jones wrote:

> So, with my term shuffle better defined, let's circle back to your 
> proposal to try to GRO-service a very much larger group of flows, with a 
> "flush queued packets older than N packets" heuristic as part of the 
> latency minimization.  If N were 2 there - half the number of flows, the 
> "perfect" shuffle" doesn't get aggregated at all right? N would have to 
> be 4 or the number of concurrent flows.   What I'm trying to get at is 
> just to how many concurrent flows you are trying to get GRO to scale, 
> and whether at that level you have asymptotically approached having a 
> hash/retained state that is, basically, a duplicate of what is happening 
> in TCP.
> 

I didnt said "flush queued packets older than N packets" but instead
suggested to use a time limit, eventually a sysctl.

"flush packet if the oldest part of it is aged by xxx us"

Say the default would be 500 us

If your hardware is capable of receiving 2 packets per us, this would
allow 1000 packets in GRO queue. So using a hash table with 128 slots,
and keeping the current limit of 8 entries per bucket would allow this
kind of workload.

If your hardware is capable of receiving one packet per us, this would
allow 50 packets in GRO queue.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-08 17:55                             ` Eric Dumazet
@ 2012-10-08 17:56                               ` Eric Dumazet
  2012-10-08 18:58                                 ` [RFC] napi: limit GRO latency Stephen Hemminger
  2012-10-08 18:21                               ` [RFC] GRO scalability Rick Jones
  1 sibling, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2012-10-08 17:56 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross

On Mon, 2012-10-08 at 19:55 +0200, Eric Dumazet wrote:

> If your hardware is capable of receiving one packet per us, this would
> allow 50 packets in GRO queue.

Sorry, I meant : 10 us per packet.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-08 17:55                             ` Eric Dumazet
  2012-10-08 17:56                               ` Eric Dumazet
@ 2012-10-08 18:21                               ` Rick Jones
  2012-10-08 18:28                                 ` Eric Dumazet
  1 sibling, 1 reply; 41+ messages in thread
From: Rick Jones @ 2012-10-08 18:21 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross

On 10/08/2012 10:55 AM, Eric Dumazet wrote:
> On Mon, 2012-10-08 at 10:49 -0700, Rick Jones wrote:
>
>> So, with my term shuffle better defined, let's circle back to your
>> proposal to try to GRO-service a very much larger group of flows, with a
>> "flush queued packets older than N packets" heuristic as part of the
>> latency minimization.  If N were 2 there - half the number of flows, the
>> "perfect" shuffle" doesn't get aggregated at all right? N would have to
>> be 4 or the number of concurrent flows.   What I'm trying to get at is
>> just to how many concurrent flows you are trying to get GRO to scale,
>> and whether at that level you have asymptotically approached having a
>> hash/retained state that is, basically, a duplicate of what is happening
>> in TCP.
>>
>
> I didnt said "flush queued packets older than N packets" but instead
> suggested to use a time limit, eventually a sysctl.

Did I then mis-interpret:

> 2) Use a LRU list to eventually be able to 'flush' too old packets,
>    even if the napi never completes. Each time we process a new packet,
>    being a GRO candidate or not, we increment a napi->sequence, and we
>    flush the oldest packet in gro_lru_list if its own sequence is too
>    old.

in your initial RFC email?  Because I took that as flushing a given 
packet if N packets have come through since that packet was queued to 
await coalescing.

rick

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] GRO scalability
  2012-10-08 18:21                               ` [RFC] GRO scalability Rick Jones
@ 2012-10-08 18:28                                 ` Eric Dumazet
  0 siblings, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2012-10-08 18:28 UTC (permalink / raw)
  To: Rick Jones; +Cc: Herbert Xu, David Miller, netdev, Jesse Gross

On Mon, 2012-10-08 at 11:21 -0700, Rick Jones wrote:

> Did I then mis-interpret:
> 
> > 2) Use a LRU list to eventually be able to 'flush' too old packets,
> >    even if the napi never completes. Each time we process a new packet,
> >    being a GRO candidate or not, we increment a napi->sequence, and we
> >    flush the oldest packet in gro_lru_list if its own sequence is too
> >    old.
> 
> in your initial RFC email?  Because I took that as flushing a given 
> packet if N packets have come through since that packet was queued to 
> await coalescing.

Yes, this was refined in the patch I sent.

Currently using jiffies, but my plan is trying to use get_cycles() or
ktime_get(), after some experiments, allowing cycles/ns resolution.

(One ktime_get() per napi batch seems OK)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] net: gro: selective flush of packets
  2012-10-06 18:08                           ` [PATCH] net: gro: selective flush of packets Eric Dumazet
  2012-10-07  0:32                             ` Herbert Xu
@ 2012-10-08 18:52                             ` David Miller
  1 sibling, 0 replies; 41+ messages in thread
From: David Miller @ 2012-10-08 18:52 UTC (permalink / raw)
  To: eric.dumazet; +Cc: herbert, netdev, jesse, therbert, ycheng

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 06 Oct 2012 20:08:49 +0200

> From: Eric Dumazet <edumazet@google.com>
> 
> Current GRO can hold packets in gro_list for almost unlimited
> time, in case napi->poll() handler consumes its budget over and over.
> 
> In this case, napi_complete()/napi_gro_flush() are not called.
> 
> Another problem is that gro_list is flushed in non friendly way :
> We scan the list and complete packets in the reverse order.
> (youngest packets first, oldest packets last)
> This defeats priorities that sender could have cooked.
> 
> Since GRO currently only store TCP packets, we dont really notice the
> bug because of retransmits, but this behavior can add unexpected
> latencies, particularly on mice flows clamped by elephant flows.
> 
> This patch makes sure no packet can stay more than 1 ms in queue, and
> only in stress situations.
> 
> It also complete packets in the right order to minimize latencies.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC] napi: limit GRO latency
  2012-10-08 17:56                               ` Eric Dumazet
@ 2012-10-08 18:58                                 ` Stephen Hemminger
  2012-10-08 19:10                                   ` David Miller
  2012-10-08 19:21                                   ` Eric Dumazet
  0 siblings, 2 replies; 41+ messages in thread
From: Stephen Hemminger @ 2012-10-08 18:58 UTC (permalink / raw)
  To: Eric Dumazet, David Miller; +Cc: Rick Jones, Herbert Xu, netdev, Jesse Gross

Limit the latency of pending GRO in NAPI processing to 2*HZ.
When the system is under heavy network load, NAPI will go into
poll mode via soft irq, and only stay in the loop for
two jiffies. If this occurs, process the GRO pending list
to make sure and not delay outstanding TCP frames for too long.

Rearrange the exit path to get rid of unnecessary goto logic.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>


--- a/net/core/dev.c	2012-10-08 09:21:27.466049785 -0700
+++ b/net/core/dev.c	2012-10-08 11:56:41.714618745 -0700
@@ -3937,8 +3937,16 @@ static void net_rx_action(struct softirq
 		 * Allow this to run for 2 jiffies since which will allow
 		 * an average latency of 1.5/HZ.
 		 */
-		if (unlikely(budget <= 0 || time_after(jiffies, time_limit)))
-			goto softnet_break;
+		if (unlikely(budget <= 0 || time_after(jiffies, time_limit))) {
+			/* Cleanup all pending GRO */
+
+			list_for_each_entry(n, &sd->poll_list, poll_list)
+				napi_gro_flush(n);
+
+			sd->time_squeeze++;
+			__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+			break;
+		}
 
 		local_irq_enable();
 
@@ -3987,7 +3995,6 @@ static void net_rx_action(struct softirq
 
 		netpoll_poll_unlock(have);
 	}
-out:
 	net_rps_action_and_irq_enable(sd);
 
 #ifdef CONFIG_NET_DMA
@@ -3997,13 +4004,6 @@ out:
 	 */
 	dma_issue_pending_all();
 #endif
-
-	return;
-
-softnet_break:
-	sd->time_squeeze++;
-	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
-	goto out;
 }
 
 static gifconf_func_t *gifconf_list[NPROTO];

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] napi: limit GRO latency
  2012-10-08 18:58                                 ` [RFC] napi: limit GRO latency Stephen Hemminger
@ 2012-10-08 19:10                                   ` David Miller
  2012-10-08 19:12                                     ` Stephen Hemminger
  2012-10-08 19:21                                   ` Eric Dumazet
  1 sibling, 1 reply; 41+ messages in thread
From: David Miller @ 2012-10-08 19:10 UTC (permalink / raw)
  To: shemminger; +Cc: eric.dumazet, rick.jones2, herbert, netdev, jesse

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Mon, 8 Oct 2012 11:58:35 -0700

> Limit the latency of pending GRO in NAPI processing to 2*HZ.
> When the system is under heavy network load, NAPI will go into
> poll mode via soft irq, and only stay in the loop for
> two jiffies. If this occurs, process the GRO pending list
> to make sure and not delay outstanding TCP frames for too long.
> 
> Rearrange the exit path to get rid of unnecessary goto logic.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Did you see Eric's patch I just applied which limits it to 1ms?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] napi: limit GRO latency
  2012-10-08 19:10                                   ` David Miller
@ 2012-10-08 19:12                                     ` Stephen Hemminger
  2012-10-08 19:30                                       ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Stephen Hemminger @ 2012-10-08 19:12 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, rick.jones2, herbert, netdev, jesse

On Mon, 08 Oct 2012 15:10:35 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Mon, 8 Oct 2012 11:58:35 -0700
> 
> > Limit the latency of pending GRO in NAPI processing to 2*HZ.
> > When the system is under heavy network load, NAPI will go into
> > poll mode via soft irq, and only stay in the loop for
> > two jiffies. If this occurs, process the GRO pending list
> > to make sure and not delay outstanding TCP frames for too long.
> > 
> > Rearrange the exit path to get rid of unnecessary goto logic.
> > 
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> Did you see Eric's patch I just applied which limits it to 1ms?

Think this is a different problem.
After leaving softirq, it may be a long time until
ksoftirqd is run which is a different problem.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] napi: limit GRO latency
  2012-10-08 18:58                                 ` [RFC] napi: limit GRO latency Stephen Hemminger
  2012-10-08 19:10                                   ` David Miller
@ 2012-10-08 19:21                                   ` Eric Dumazet
  1 sibling, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2012-10-08 19:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: David Miller, Rick Jones, Herbert Xu, netdev, Jesse Gross

On Mon, 2012-10-08 at 11:58 -0700, Stephen Hemminger wrote:
> Limit the latency of pending GRO in NAPI processing to 2*HZ.
> When the system is under heavy network load, NAPI will go into
> poll mode via soft irq, and only stay in the loop for
> two jiffies. If this occurs, process the GRO pending list
> to make sure and not delay outstanding TCP frames for too long.
> 
> Rearrange the exit path to get rid of unnecessary goto logic.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> 
> --- a/net/core/dev.c	2012-10-08 09:21:27.466049785 -0700
> +++ b/net/core/dev.c	2012-10-08 11:56:41.714618745 -0700
> @@ -3937,8 +3937,16 @@ static void net_rx_action(struct softirq
>  		 * Allow this to run for 2 jiffies since which will allow
>  		 * an average latency of 1.5/HZ.
>  		 */
> -		if (unlikely(budget <= 0 || time_after(jiffies, time_limit)))
> -			goto softnet_break;
> +		if (unlikely(budget <= 0 || time_after(jiffies, time_limit))) {
> +			/* Cleanup all pending GRO */
> +
> +			list_for_each_entry(n, &sd->poll_list, poll_list)
> +				napi_gro_flush(n);
> +
> +			sd->time_squeeze++;
> +			__raise_softirq_irqoff(NET_RX_SOFTIRQ);
> +			break;
> +		}
>  

Hmm, I really dont think its a good idea.

I am working in making GRO storing more packets, so any kind of flushes
like that will completely defeat GRO intent.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] napi: limit GRO latency
  2012-10-08 19:12                                     ` Stephen Hemminger
@ 2012-10-08 19:30                                       ` Eric Dumazet
  2012-10-08 19:40                                         ` Stephen Hemminger
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2012-10-08 19:30 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, rick.jones2, herbert, netdev, jesse

On Mon, 2012-10-08 at 12:12 -0700, Stephen Hemminger wrote:
> On Mon, 08 Oct 2012 15:10:35 -0400 (EDT)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Stephen Hemminger <shemminger@vyatta.com>
> > Date: Mon, 8 Oct 2012 11:58:35 -0700
> > 
> > > Limit the latency of pending GRO in NAPI processing to 2*HZ.
> > > When the system is under heavy network load, NAPI will go into
> > > poll mode via soft irq, and only stay in the loop for
> > > two jiffies. If this occurs, process the GRO pending list
> > > to make sure and not delay outstanding TCP frames for too long.
> > > 
> > > Rearrange the exit path to get rid of unnecessary goto logic.
> > > 
> > > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> > 
> > Did you see Eric's patch I just applied which limits it to 1ms?
> 
> Think this is a different problem.
> After leaving softirq, it may be a long time until
> ksoftirqd is run which is a different problem.

System is under stress, you dont want to increase load at this point,
but increase bandwidth.

Really try this on a 40Gbe link and see how bad it is.

On a server, we probably dedicate one or several cpus only to run NAPI,
so yes we hit net_rx_action() break, but we'll reenter it a few us
later.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] napi: limit GRO latency
  2012-10-08 19:30                                       ` Eric Dumazet
@ 2012-10-08 19:40                                         ` Stephen Hemminger
  2012-10-08 19:46                                           ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Stephen Hemminger @ 2012-10-08 19:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, rick.jones2, herbert, netdev, jesse

On Mon, 08 Oct 2012 21:30:12 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> System is under stress, you dont want to increase load at this point,
> but increase bandwidth.
> 
> Really try this on a 40Gbe link and see how bad it is.

Not everybody has 40Gbe hardware just lying around available for testing :-)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC] napi: limit GRO latency
  2012-10-08 19:40                                         ` Stephen Hemminger
@ 2012-10-08 19:46                                           ` Eric Dumazet
  0 siblings, 0 replies; 41+ messages in thread
From: Eric Dumazet @ 2012-10-08 19:46 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, rick.jones2, herbert, netdev, jesse

On Mon, 2012-10-08 at 12:40 -0700, Stephen Hemminger wrote:
> On Mon, 08 Oct 2012 21:30:12 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > System is under stress, you dont want to increase load at this point,
> > but increase bandwidth.
> > 
> > Really try this on a 40Gbe link and see how bad it is.
> 
> Not everybody has 40Gbe hardware just lying around available for testing :-)

;)

In fact, the current strategy to flush napi_gro when napi completes
might be not good for moderate traffic (thats 99% of the linux machines
I believe). We could have a deferred flush instead...

I dont know what strategy is used by hardware. But surely hardware uses
timers.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2012-10-08 19:46 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-27 12:48 [PATCH net-next 3/3] ipv4: gre: add GRO capability Eric Dumazet
2012-09-27 17:52 ` Jesse Gross
2012-09-27 18:08   ` Eric Dumazet
2012-09-27 18:19     ` Eric Dumazet
2012-09-27 22:03       ` Jesse Gross
2012-09-28 14:04         ` Eric Dumazet
2012-10-01 20:56           ` Jesse Gross
2012-10-05 14:52             ` [RFC] GRO scalability Eric Dumazet
2012-10-05 18:16               ` Rick Jones
2012-10-05 19:00                 ` Eric Dumazet
2012-10-05 19:35                   ` Rick Jones
2012-10-05 20:06                     ` Eric Dumazet
2012-10-08 16:40                       ` Rick Jones
2012-10-08 16:59                         ` Eric Dumazet
2012-10-08 17:49                           ` Rick Jones
2012-10-08 17:55                             ` Eric Dumazet
2012-10-08 17:56                               ` Eric Dumazet
2012-10-08 18:58                                 ` [RFC] napi: limit GRO latency Stephen Hemminger
2012-10-08 19:10                                   ` David Miller
2012-10-08 19:12                                     ` Stephen Hemminger
2012-10-08 19:30                                       ` Eric Dumazet
2012-10-08 19:40                                         ` Stephen Hemminger
2012-10-08 19:46                                           ` Eric Dumazet
2012-10-08 19:21                                   ` Eric Dumazet
2012-10-08 18:21                               ` [RFC] GRO scalability Rick Jones
2012-10-08 18:28                                 ` Eric Dumazet
2012-10-06  4:11               ` Herbert Xu
2012-10-06  5:08                 ` Eric Dumazet
2012-10-06  5:14                   ` Herbert Xu
2012-10-06  6:22                     ` Eric Dumazet
2012-10-06  7:00                       ` Eric Dumazet
2012-10-06 10:56                         ` Herbert Xu
2012-10-06 18:08                           ` [PATCH] net: gro: selective flush of packets Eric Dumazet
2012-10-07  0:32                             ` Herbert Xu
2012-10-07  5:29                               ` Eric Dumazet
2012-10-08  7:39                                 ` Eric Dumazet
2012-10-08 16:42                                   ` Rick Jones
2012-10-08 17:10                                     ` Eric Dumazet
2012-10-08 18:52                             ` David Miller
2012-09-27 22:03     ` [PATCH net-next 3/3] ipv4: gre: add GRO capability Jesse Gross
2012-10-01 21:04 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).