netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* TCPBacklogDrops during aggressive bursts of traffic
@ 2012-05-15 14:38 Kieran Mansley
  2012-05-15 14:56 ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Kieran Mansley @ 2012-05-15 14:38 UTC (permalink / raw)
  To: netdev

I've been investigating an issue with TCPBacklogDrops being reported
(and relatively poor performance as a result).  The problem is most
easily observed on slightly older kernels (e.g 3.0.13) but is still
present in 3.3.6, although harder to reproduce.  I've also seen it in
2.6 series kernels, so it's not a recent issue.

The problem occurs at the receiver when a TCP sender with a large
congestion window is sending at a high rate and the receiving
application has blocked in a recv() or similar call.  During the stream
ACKs are being returned to the sender keeping the receive window open
and so allowing it to carry on sending.  The local socket receive buffer
gets dynamically increased, and the advertised receive window increases
similarly.

[As an aside, it appears as though the total bytes that the receiver
commits to receiving - i.e. the point at which it stops advertising new
sequence space - is around double the receive socket buffer.  I'm
guessing it is committing to receiving the current socket buffer
(perhaps as there is a pending recv() it knows it will be able to
immediately empty this) and the next one, but I've not looked into this
in detail]

As the socket buffer is approaching full the kernel decides to satisfy
the recv() call and wake the application.  It will have to copy the data
to application address space etc.  At this point there is a switch in
tcp_v4_rcv():

http://lxr.linux.no/#linux+v3.3.6/net/ipv4/tcp_ipv4.c#L1726

Before this point, the "if (!sock_owned_by_user(sk)) " will evaluate to
true, but once it has decided to wake the application I think it will
evaluate to false and it will drop through to:

1739        else if (unlikely(sk_add_backlog(sk, skb))) {
1740                bh_unlock_sock(sk);
1741                NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
1742                goto discard_and_relse;
1743        }

In sk_add_backlog() there is a test to see if the socket's receive
buffer is full, and if there is the kernel drops the packets, reporting
them through netstat as TCPBacklogDrop.  This is despite there being
potentially megabytes of unused advertised receive window space at this
point.

Very shortly afterwards the socket buffer will be empty again (as its
contents will have been transferred to the user) so this is essentially
a race and depends on a fast sender to demonstrate it.  It shows up as a
acute period of drops that are quickly retransmitted and then
accepted.  

There are two ways of thinking about this problem: either the receiver
should be more conservative about the receive window it advertises
(limiting it to the available receive socket buffer size); or the
receiver should be more generous with what it will accept on to the
backlog (matching it to the advertised receive window).  It is the
discrepancy between advertised receive window and what can be put on the
backlog that is the root of the problem.  I would be tempted by the
latter and say that as the backlog is likely to soon make it into the
receive buffer, it should be allowed to contain a full receive buffer of
bytes on top of what is currently being removed from the receive buffer
into the application.

It is harder to reproduce on recent kernels because the pending recv()
call gets satisfied very close to the start of a burst, and at this time
the receive buffer will be mostly empty and so it is less likely that
any packets in flight will overflow the backlog.  On earlier kernels it
is easier to reproduce because the pending recv() call didn't return
until the socket's receive buffer was nearly full, and so it would only
take a few extra packets to overflow the backlog.

I have a packet capture to illustrate the problem (taken on 3.0.13) if
that would be of help.  As I can easily reproduce it I'm also happy to
make changes and test to see if they improve matters.

Thanks

Kieran

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-15 14:38 TCPBacklogDrops during aggressive bursts of traffic Kieran Mansley
@ 2012-05-15 14:56 ` Eric Dumazet
  2012-05-15 15:00   ` Eric Dumazet
  2012-05-15 16:29   ` Kieran Mansley
  0 siblings, 2 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-05-15 14:56 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: netdev

On Tue, 2012-05-15 at 15:38 +0100, Kieran Mansley wrote:
> I've been investigating an issue with TCPBacklogDrops being reported
> (and relatively poor performance as a result).  The problem is most
> easily observed on slightly older kernels (e.g 3.0.13) but is still
> present in 3.3.6, although harder to reproduce.  I've also seen it in
> 2.6 series kernels, so it's not a recent issue.
> 
> The problem occurs at the receiver when a TCP sender with a large
> congestion window is sending at a high rate and the receiving
> application has blocked in a recv() or similar call.  During the stream
> ACKs are being returned to the sender keeping the receive window open
> and so allowing it to carry on sending.  The local socket receive buffer
> gets dynamically increased, and the advertised receive window increases
> similarly.
> 
> [As an aside, it appears as though the total bytes that the receiver
> commits to receiving - i.e. the point at which it stops advertising new
> sequence space - is around double the receive socket buffer.  I'm
> guessing it is committing to receiving the current socket buffer
> (perhaps as there is a pending recv() it knows it will be able to
> immediately empty this) and the next one, but I've not looked into this
> in detail]
> 
> As the socket buffer is approaching full the kernel decides to satisfy
> the recv() call and wake the application.  It will have to copy the data
> to application address space etc.  At this point there is a switch in
> tcp_v4_rcv():
> 
> http://lxr.linux.no/#linux+v3.3.6/net/ipv4/tcp_ipv4.c#L1726
> 
> Before this point, the "if (!sock_owned_by_user(sk)) " will evaluate to
> true, but once it has decided to wake the application I think it will
> evaluate to false and it will drop through to:
> 
> 1739        else if (unlikely(sk_add_backlog(sk, skb))) {
> 1740                bh_unlock_sock(sk);
> 1741                NET_INC_STATS_BH(net, LINUX_MIB_TCPBACKLOGDROP);
> 1742                goto discard_and_relse;
> 1743        }
> 
> In sk_add_backlog() there is a test to see if the socket's receive
> buffer is full, and if there is the kernel drops the packets, reporting
> them through netstat as TCPBacklogDrop.  This is despite there being
> potentially megabytes of unused advertised receive window space at this
> point.
> 
> Very shortly afterwards the socket buffer will be empty again (as its
> contents will have been transferred to the user) so this is essentially
> a race and depends on a fast sender to demonstrate it.  It shows up as a
> acute period of drops that are quickly retransmitted and then
> accepted.  
> 
> There are two ways of thinking about this problem: either the receiver
> should be more conservative about the receive window it advertises
> (limiting it to the available receive socket buffer size); or the
> receiver should be more generous with what it will accept on to the
> backlog (matching it to the advertised receive window).  It is the
> discrepancy between advertised receive window and what can be put on the
> backlog that is the root of the problem.  I would be tempted by the
> latter and say that as the backlog is likely to soon make it into the
> receive buffer, it should be allowed to contain a full receive buffer of
> bytes on top of what is currently being removed from the receive buffer
> into the application.
> 
> It is harder to reproduce on recent kernels because the pending recv()
> call gets satisfied very close to the start of a burst, and at this time
> the receive buffer will be mostly empty and so it is less likely that
> any packets in flight will overflow the backlog.  On earlier kernels it
> is easier to reproduce because the pending recv() call didn't return
> until the socket's receive buffer was nearly full, and so it would only
> take a few extra packets to overflow the backlog.
> 
> I have a packet capture to illustrate the problem (taken on 3.0.13) if
> that would be of help.  As I can easily reproduce it I'm also happy to
> make changes and test to see if they improve matters.


Please try latest kernels, this is probably 'fixed'

What network driver are you using ?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-15 14:56 ` Eric Dumazet
@ 2012-05-15 15:00   ` Eric Dumazet
  2012-05-15 16:29   ` Kieran Mansley
  1 sibling, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-05-15 15:00 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: netdev

On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:

> Please try latest kernels, this is probably 'fixed'
> 
> What network driver are you using ?
> 
> 

commit b49960a05e32121d29316cfdf653894b88ac9190
Author: Eric Dumazet <edumazet@google.com>
Date:   Wed May 2 02:28:41 2012 +0000

    tcp: change tcp_adv_win_scale and tcp_rmem[2]
    
    tcp_adv_win_scale default value is 2, meaning we expect a good citizen
    skb to have skb->len / skb->truesize ratio of 75% (3/4)
    
    In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
    1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
    So these skbs were considered as not bloated.
    
    With recent truesize fixes, a typical MSS=1460 frame truesize is now the
    more precise :
    2048 + 256 = 2304. But 2304 * 3/4 = 1728.
    So these skb are not good citizen anymore, because 1460 < 1728
    
    (GRO can escape this problem because it build skbs with a too low
    truesize.)
    
    This also means tcp advertises a too optimistic window for a given
    allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
    sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
    especially when application is slow to drain its receive queue or in
    case of losses (netperf is fast, scp is slow). This is a major latency
    source.
    
    We should adjust the len/truesize ratio to 50% instead of 75%
    
    This patch :
    
    1) changes tcp_adv_win_scale default to 1 instead of 2
    
    2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
    better truesize tracking and to allow autotuning tcp receive window to
    reach same value than before. Note that same amount of kernel memory is
    consumed compared to 2.6 kernels.
    
    Signed-off-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-15 14:56 ` Eric Dumazet
  2012-05-15 15:00   ` Eric Dumazet
@ 2012-05-15 16:29   ` Kieran Mansley
  2012-05-15 16:34     ` Eric Dumazet
  1 sibling, 1 reply; 37+ messages in thread
From: Kieran Mansley @ 2012-05-15 16:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:
> 
> Please try latest kernels, this is probably 'fixed'

I've just tried with 3.4.0-rc7 and the problem is still reproducible.
It's perhaps harder to reproduce than on 3.3.6 but still there.

> What network driver are you using ? 

The receiver is using the sfc driver that is included in the kernel
build, together with an SFC 9020 NIC. 

Kieran

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-15 16:29   ` Kieran Mansley
@ 2012-05-15 16:34     ` Eric Dumazet
  2012-05-15 16:47       ` Ben Hutchings
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-05-15 16:34 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: netdev

On Tue, 2012-05-15 at 17:29 +0100, Kieran Mansley wrote:
> On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:
> > 
> > Please try latest kernels, this is probably 'fixed'
> 
> I've just tried with 3.4.0-rc7 and the problem is still reproducible.
> It's perhaps harder to reproduce than on 3.3.6 but still there.
> 
> > What network driver are you using ? 
> 
> The receiver is using the sfc driver that is included in the kernel
> build, together with an SFC 9020 NIC. 
> 
> Kieran
> 

MTU ?

What is typical skb->truesize of skb given to stack in RX path ?

If drivers use PAGE_SIZE fragments, then you are more likely to hit
limit.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-15 16:34     ` Eric Dumazet
@ 2012-05-15 16:47       ` Ben Hutchings
  2012-05-15 17:01         ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Ben Hutchings @ 2012-05-15 16:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, netdev

On Tue, 2012-05-15 at 18:34 +0200, Eric Dumazet wrote:
> On Tue, 2012-05-15 at 17:29 +0100, Kieran Mansley wrote:
> > On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:
> > > 
> > > Please try latest kernels, this is probably 'fixed'
> > 
> > I've just tried with 3.4.0-rc7 and the problem is still reproducible.
> > It's perhaps harder to reproduce than on 3.3.6 but still there.
> > 
> > > What network driver are you using ? 
> > 
> > The receiver is using the sfc driver that is included in the kernel
> > build, together with an SFC 9020 NIC. 
> > 
> > Kieran
> > 
> 
> MTU ?

1500

> What is typical skb->truesize of skb given to stack in RX path ?
>
> If drivers use PAGE_SIZE fragments, then you are more likely to hit
> limit.

We're passing page fragments into GRO.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-15 16:47       ` Ben Hutchings
@ 2012-05-15 17:01         ` Eric Dumazet
  2012-05-15 17:23           ` Eric Dumazet
  2012-05-17 16:31           ` Kieran Mansley
  0 siblings, 2 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-05-15 17:01 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Kieran Mansley, netdev

On Tue, 2012-05-15 at 17:47 +0100, Ben Hutchings wrote:
> On Tue, 2012-05-15 at 18:34 +0200, Eric Dumazet wrote:
> > On Tue, 2012-05-15 at 17:29 +0100, Kieran Mansley wrote:
> > > On Tue, 2012-05-15 at 16:56 +0200, Eric Dumazet wrote:
> > > > 
> > > > Please try latest kernels, this is probably 'fixed'
> > > 
> > > I've just tried with 3.4.0-rc7 and the problem is still reproducible.
> > > It's perhaps harder to reproduce than on 3.3.6 but still there.
> > > 
> > > > What network driver are you using ? 
> > > 
> > > The receiver is using the sfc driver that is included in the kernel
> > > build, together with an SFC 9020 NIC. 
> > > 
> > > Kieran
> > > 
> > 
> > MTU ?
> 
> 1500
> 
> > What is typical skb->truesize of skb given to stack in RX path ?
> >
> > If drivers use PAGE_SIZE fragments, then you are more likely to hit
> > limit.
> 
> We're passing page fragments into GRO.


Yes, I can see drivers/net/ethernet/sfc/rx.c is even lying about
truesize. Thats explain why you trigger the backlogdrop even on 2.6
kernels.

skb->len = rx_buf->len;
skb->data_len = rx_buf->len;
skb->truesize += rx_buf->len; // instead of real frag size

So skb->truesize are rather small.

napi_get_frags() could probably updated in net-next to use the first
frag as skb->head to save 512 bytes per skb.

You could try setting tcp_adv_win_scale to -2

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-15 17:01         ` Eric Dumazet
@ 2012-05-15 17:23           ` Eric Dumazet
  2012-05-17 16:31           ` Kieran Mansley
  1 sibling, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-05-15 17:23 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Kieran Mansley, netdev

On Tue, 2012-05-15 at 19:01 +0200, Eric Dumazet wrote:

> 
> napi_get_frags() could probably updated in net-next to use the first
> frag as skb->head to save 512 bytes per skb.

By the way GRO_MAX_HEAD definition is way too big, napi_get_frags()
allocates fat skbs (1280 bytes of overhead instead of 768 bytes)

This should be enough :

#define GRO_MAX_HEAD 128

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-15 17:01         ` Eric Dumazet
  2012-05-15 17:23           ` Eric Dumazet
@ 2012-05-17 16:31           ` Kieran Mansley
  2012-05-17 16:37             ` Eric Dumazet
  1 sibling, 1 reply; 37+ messages in thread
From: Kieran Mansley @ 2012-05-17 16:31 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev

On Tue, 2012-05-15 at 19:01 +0200, Eric Dumazet wrote:
> You could try setting tcp_adv_win_scale to -2

I've been unable to reproduce this at all this afternoon (not sure why)
so haven't been able to try that suggestion yet.

>From looking in to what it does it sounds like it could well help, but
only in the sense of tuning the system to make it harder to hit the race
rather than fixing the race altogether.  Similarly for fixing
skb->truesize (we spotted this last week when we first hit the problem,
and adjusting it didn't make much difference).

I'll take another look tomorrow and see if I can work out what's changed
to make the drops disappear.

Kieran 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-17 16:31           ` Kieran Mansley
@ 2012-05-17 16:37             ` Eric Dumazet
  2012-05-18 15:45               ` Kieran Mansley
  2012-05-22  8:20               ` Kieran Mansley
  0 siblings, 2 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-05-17 16:37 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Ben Hutchings, netdev

On Thu, 2012-05-17 at 17:31 +0100, Kieran Mansley wrote:
> On Tue, 2012-05-15 at 19:01 +0200, Eric Dumazet wrote:
> > You could try setting tcp_adv_win_scale to -2
> 
> I've been unable to reproduce this at all this afternoon (not sure why)
> so haven't been able to try that suggestion yet.
> 
> From looking in to what it does it sounds like it could well help, but
> only in the sense of tuning the system to make it harder to hit the race
> rather than fixing the race altogether.  Similarly for fixing
> skb->truesize (we spotted this last week when we first hit the problem,
> and adjusting it didn't make much difference).

Hmm, I was not suggesting running production servers with this setting,
only do an experiment.

With net-next and tcp coalescing, I no longer have TCPBacklogDrops /
collapses, but I dont have sfc card/driver.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-17 16:37             ` Eric Dumazet
@ 2012-05-18 15:45               ` Kieran Mansley
  2012-05-18 15:49                 ` Eric Dumazet
  2012-05-18 18:40                 ` Eric Dumazet
  2012-05-22  8:20               ` Kieran Mansley
  1 sibling, 2 replies; 37+ messages in thread
From: Kieran Mansley @ 2012-05-18 15:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev

On Thu, 2012-05-17 at 18:37 +0200, Eric Dumazet wrote:
> 
> Hmm, I was not suggesting running production servers with this
> setting, only do an experiment.

OK.  I've found a more reliable way to reproduce it (use MSG_WAITALL to
make the recv call wait till the socket buffer will be more full) and
tried with the tcp_adv_win_scale to -2.  It's hard to make any confident
assertions about the impact of it, but if it does change the behaviour
it is a small change.  There are certainly still plenty of drops with
that setting.

> With net-next and tcp coalescing, I no longer have TCPBacklogDrops /
> collapses, but I dont have sfc card/driver. 

I'll try that.

Kieran

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-18 15:45               ` Kieran Mansley
@ 2012-05-18 15:49                 ` Eric Dumazet
  2012-05-18 15:53                   ` Kieran Mansley
  2012-05-18 18:40                 ` Eric Dumazet
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-05-18 15:49 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Ben Hutchings, netdev

On Fri, 2012-05-18 at 16:45 +0100, Kieran Mansley wrote:
> On Thu, 2012-05-17 at 18:37 +0200, Eric Dumazet wrote:
> > 
> > Hmm, I was not suggesting running production servers with this
> > setting, only do an experiment.
> 
> OK.  I've found a more reliable way to reproduce it (use MSG_WAITALL to
> make the recv call wait till the socket buffer will be more full) and
> tried with the tcp_adv_win_scale to -2.  It's hard to make any confident
> assertions about the impact of it, but if it does change the behaviour
> it is a small change.  There are certainly still plenty of drops with
> that setting.

Are you playing with SO_RCVBUF, or let the stack autotune it ?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-18 15:49                 ` Eric Dumazet
@ 2012-05-18 15:53                   ` Kieran Mansley
  0 siblings, 0 replies; 37+ messages in thread
From: Kieran Mansley @ 2012-05-18 15:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev

On Fri, 2012-05-18 at 17:49 +0200, Eric Dumazet wrote:
> 
> Are you playing with SO_RCVBUF, or let the stack autotune it ? 

Just letting the stack autotune it.

Kieran

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-18 15:45               ` Kieran Mansley
  2012-05-18 15:49                 ` Eric Dumazet
@ 2012-05-18 18:40                 ` Eric Dumazet
  1 sibling, 0 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-05-18 18:40 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Ben Hutchings, netdev

Le vendredi 18 mai 2012 à 16:45 +0100, Kieran Mansley a écrit :

> > With net-next and tcp coalescing, I no longer have TCPBacklogDrops /
> > collapses, but I dont have sfc card/driver. 
> 
> I'll try that.

By the way, we should get rid of napi_get_frags() & napi_gro_frags() API
and use plain build_skb() instead.

This would save 1024 bytes per skb, and remove lot of GRO code (frag0,
frag0_len, skb_gro_header_fast(), skb_gro_header_hard()...), since by
definition skb->head would point to the frame.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-17 16:37             ` Eric Dumazet
  2012-05-18 15:45               ` Kieran Mansley
@ 2012-05-22  8:20               ` Kieran Mansley
  2012-05-22  9:25                 ` Eric Dumazet
  1 sibling, 1 reply; 37+ messages in thread
From: Kieran Mansley @ 2012-05-22  8:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev

On Thu, 2012-05-17 at 18:37 +0200, Eric Dumazet wrote:
> 
> With net-next and tcp coalescing, I no longer have TCPBacklogDrops /
> collapses, but I dont have sfc card/driver.

I can still reproduce them on net-next.

Kieran

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-22  8:20               ` Kieran Mansley
@ 2012-05-22  9:25                 ` Eric Dumazet
  2012-05-22  9:30                   ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-05-22  9:25 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Ben Hutchings, netdev

On Tue, 2012-05-22 at 09:20 +0100, Kieran Mansley wrote:
> On Thu, 2012-05-17 at 18:37 +0200, Eric Dumazet wrote:
> > 
> > With net-next and tcp coalescing, I no longer have TCPBacklogDrops /
> > collapses, but I dont have sfc card/driver.
> 
> I can still reproduce them on net-next.
> 
> Kieran

Can you reproduce this with tg3 NIC for example ?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-22  9:25                 ` Eric Dumazet
@ 2012-05-22  9:30                   ` Eric Dumazet
  2012-05-22 15:09                     ` Kieran Mansley
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-05-22  9:30 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Ben Hutchings, netdev

On Tue, 2012-05-22 at 11:26 +0200, Eric Dumazet wrote:
> On Tue, 2012-05-22 at 09:20 +0100, Kieran Mansley wrote:
> > On Thu, 2012-05-17 at 18:37 +0200, Eric Dumazet wrote:
> > > 
> > > With net-next and tcp coalescing, I no longer have TCPBacklogDrops /
> > > collapses, but I dont have sfc card/driver.
> > 
> > I can still reproduce them on net-next.
> > 
> > Kieran
> 
> Can you reproduce this with tg3 NIC for example ?
> 

Also can you post a pcap capture of problematic flow ?

(maybe a link to pcap if too big, or private mail)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-22  9:30                   ` Eric Dumazet
@ 2012-05-22 15:09                     ` Kieran Mansley
  2012-05-22 16:12                       ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Kieran Mansley @ 2012-05-22 15:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev

On Tue, 2012-05-22 at 11:30 +0200, Eric Dumazet wrote:
> Also can you post a pcap capture of problematic flow ?

I'll email this to you directly. The capture is generated with netserver
on the system under test, and NetPerf sending from a similar server.
I've only included the first 1000 frames to keep the capture size down.
There are 7 retransmissions in that capture, and the TCPBacklogDrops
counter incremented by 7 during the test, so I'm happy to say they are
the cause of the drops.

The system under test was running net-next.

I've not tried with another NIC (e.g. tg3) but will see if I can find
one to test.

I've got a feeling that the drops might be easier to reproduce if I
taskset the netserver process to a different package than the one that
is handling the network interrupt for that NIC.  This fits with my
earlier theory in that it is likely to increase the overhead of waking
the user-level process to satisfy the read and so increase the time
during which received packets could overflow the backlog.  Having a
relatively aggressive sending TCP also helps, e.g. one that is
configured to open its congestion window quickly, as this will produce
more intensive bursts.

Kieran

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-22 15:09                     ` Kieran Mansley
@ 2012-05-22 16:12                       ` Eric Dumazet
  2012-05-22 16:32                         ` Kieran Mansley
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-05-22 16:12 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Ben Hutchings, netdev

On Tue, 2012-05-22 at 16:09 +0100, Kieran Mansley wrote:
> On Tue, 2012-05-22 at 11:30 +0200, Eric Dumazet wrote:
> > Also can you post a pcap capture of problematic flow ?
> 
> I'll email this to you directly. The capture is generated with netserver
> on the system under test, and NetPerf sending from a similar server.
> I've only included the first 1000 frames to keep the capture size down.
> There are 7 retransmissions in that capture, and the TCPBacklogDrops
> counter incremented by 7 during the test, so I'm happy to say they are
> the cause of the drops.
> 
> The system under test was running net-next.
> 
> I've not tried with another NIC (e.g. tg3) but will see if I can find
> one to test.

Or you could change sfc to allow its frames being coalesced.

> 
> I've got a feeling that the drops might be easier to reproduce if I
> taskset the netserver process to a different package than the one that
> is handling the network interrupt for that NIC.  This fits with my
> earlier theory in that it is likely to increase the overhead of waking
> the user-level process to satisfy the read and so increase the time
> during which received packets could overflow the backlog.  Having a
> relatively aggressive sending TCP also helps, e.g. one that is
> configured to open its congestion window quickly, as this will produce
> more intensive bursts.

__tcp_select_window() ( more precisely tcp_space() takes into account
memory used in receive/ofo queue, but not frames in backlog queue)

So if you send bursts, it might explain TCP stack continues to advertise
a too big window, instead of anticipate the problem.

Please try the following patch :

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e79aa48..82382cb 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1042,8 +1042,9 @@ static inline int tcp_win_from_space(int space)
 /* Note: caller must be prepared to deal with negative returns */ 
 static inline int tcp_space(const struct sock *sk)
 {
-	return tcp_win_from_space(sk->sk_rcvbuf -
-				  atomic_read(&sk->sk_rmem_alloc));
+	int used = atomic_read(&sk->sk_rmem_alloc) + sk->sk_backlog.len;
+
+	return tcp_win_from_space(sk->sk_rcvbuf - used);
 } 
 
 static inline int tcp_full_space(const struct sock *sk)

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-22 16:12                       ` Eric Dumazet
@ 2012-05-22 16:32                         ` Kieran Mansley
  2012-05-22 16:45                           ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Kieran Mansley @ 2012-05-22 16:32 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Ben Hutchings, netdev

On Tue, 2012-05-22 at 18:12 +0200, Eric Dumazet wrote:
> 
> __tcp_select_window() ( more precisely tcp_space() takes into account
> memory used in receive/ofo queue, but not frames in backlog queue)
> 
> So if you send bursts, it might explain TCP stack continues to
> advertise
> a too big window, instead of anticipate the problem.
> 
> Please try the following patch :
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index e79aa48..82382cb 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1042,8 +1042,9 @@ static inline int tcp_win_from_space(int space)
>  /* Note: caller must be prepared to deal with negative returns */ 
>  static inline int tcp_space(const struct sock *sk)
>  {
> -       return tcp_win_from_space(sk->sk_rcvbuf -
> -                                 atomic_read(&sk->sk_rmem_alloc));
> +       int used = atomic_read(&sk->sk_rmem_alloc) +
> sk->sk_backlog.len;
> +
> +       return tcp_win_from_space(sk->sk_rcvbuf - used);
>  } 
>  
>  static inline int tcp_full_space(const struct sock *sk)


I can give this a try (not sure when - probably later this week) but I
think this it is back to front.  The patch above will reduce the
advertised window by sk_backlog.len, but at the time that the window was
advertised that allowed the dropped packets to be sent the backlog was
empty.  It is later, when the kernel is waking the application and takes
the socket lock that the backlog starts to be used and the drop happens.
But reducing the window advertised at this point is futile - the packets
that will be dropped are already in flight.

The problem exists because the backlog has a tighter limit on it than
the receive window does; I think the backlog should be able to accept
sk_rcvbuf bytes in addition to what is already in the receive buffer (or
up to the advertised receive window if that's smaller).  At the moment
it will only accept sk_rcvbuf bytes including what is already in the
receive buffer.  The logic being that in this case we're using the
backlog because it's in the process of emptying the receive buffer into
the application, and so the receive buffer will very soon be empty, and
so we will very soon be able to accept sk_rcvbuf bytes.  This is evident
from the packet capture as the kernel stack is quite happy to accept the
significant quantity of data that arrives as part of the same burst
immediately after it has dropped a couple of packets.

Perhaps it would be easier for me to write a patch to show this
suggested solution?

Kieran

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-22 16:32                         ` Kieran Mansley
@ 2012-05-22 16:45                           ` Eric Dumazet
  2012-05-22 20:54                             ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-05-22 16:45 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Ben Hutchings, netdev

On Tue, 2012-05-22 at 17:32 +0100, Kieran Mansley wrote:
> On Tue, 2012-05-22 at 18:12 +0200, Eric Dumazet wrote:
> > 
> > __tcp_select_window() ( more precisely tcp_space() takes into account
> > memory used in receive/ofo queue, but not frames in backlog queue)
> > 
> > So if you send bursts, it might explain TCP stack continues to
> > advertise
> > a too big window, instead of anticipate the problem.
> > 
> > Please try the following patch :
> > 
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index e79aa48..82382cb 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -1042,8 +1042,9 @@ static inline int tcp_win_from_space(int space)
> >  /* Note: caller must be prepared to deal with negative returns */ 
> >  static inline int tcp_space(const struct sock *sk)
> >  {
> > -       return tcp_win_from_space(sk->sk_rcvbuf -
> > -                                 atomic_read(&sk->sk_rmem_alloc));
> > +       int used = atomic_read(&sk->sk_rmem_alloc) +
> > sk->sk_backlog.len;
> > +
> > +       return tcp_win_from_space(sk->sk_rcvbuf - used);
> >  } 
> >  
> >  static inline int tcp_full_space(const struct sock *sk)
> 
> 
> I can give this a try (not sure when - probably later this week) but I
> think this it is back to front.  The patch above will reduce the
> advertised window by sk_backlog.len, but at the time that the window was
> advertised that allowed the dropped packets to be sent the backlog was
> empty.  It is later, when the kernel is waking the application and takes
> the socket lock that the backlog starts to be used and the drop happens.
> But reducing the window advertised at this point is futile - the packets
> that will be dropped are already in flight.
> 

Not really. If we receive these packets while backlog is empty, then the
sender violates TCP rules.

We advertise tcp window directly from memory we are allowed to consume.

(On the premise sender behaves correctly, not sending bytes in small
packets)


> The problem exists because the backlog has a tighter limit on it than
> the receive window does; I think the backlog should be able to accept
> sk_rcvbuf bytes in addition to what is already in the receive buffer (or
> up to the advertised receive window if that's smaller).  At the moment
> it will only accept sk_rcvbuf bytes including what is already in the
> receive buffer.  The logic being that in this case we're using the
> backlog because it's in the process of emptying the receive buffer into
> the application, and so the receive buffer will very soon be empty, and
> so we will very soon be able to accept sk_rcvbuf bytes.  This is evident
> from the packet capture as the kernel stack is quite happy to accept the
> significant quantity of data that arrives as part of the same burst
> immediately after it has dropped a couple of packets.
> 

This is not evident from the capture, you are mistaken.

tcpdump captures packets before tcp stack, it doesnt say if they are :

1) queued in receive of ofo queue
2) queued in socket backlog
3) dropped because we hit socket rcvbuf limit

If socket lock is hold by the user, packets are queued to backlog, or
dropped.

Then, when socket lock is about to be released, we process the backlog.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-22 16:45                           ` Eric Dumazet
@ 2012-05-22 20:54                             ` Eric Dumazet
  2012-05-23  9:44                               ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-05-22 20:54 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Ben Hutchings, netdev

On Tue, 2012-05-22 at 18:45 +0200, Eric Dumazet wrote:

> This is not evident from the capture, you are mistaken.
> 
> tcpdump captures packets before tcp stack, it doesnt say if they are :
> 
> 1) queued in receive of ofo queue
> 2) queued in socket backlog
> 3) dropped because we hit socket rcvbuf limit
> 
> If socket lock is hold by the user, packets are queued to backlog, or
> dropped.
> 
> Then, when socket lock is about to be released, we process the backlog.
> 
> 

BTW, latest iproute2 ss util has nice information if you add -m :

misc/ss -m dst 192.168.99.2
State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
ESTAB      3441896 0            192.168.99.1:44409         192.168.99.2:41197   
	 skmem:(r5035136,rb6291456,t0,tb23080,f1149824,w0,o0)

Here you can see that for 3441896 bytes in TCP queue (payload), 
we have 5035136 bytes in rmem_alloc,
and 6291456 'bytes' in sk_rcvbuf

It lacks the backlog len, I'll send a patch when net-next reopens.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-22 20:54                             ` Eric Dumazet
@ 2012-05-23  9:44                               ` Eric Dumazet
  2012-05-23 12:09                                 ` Eric Dumazet
  2012-05-23 17:34                                 ` David Miller
  0 siblings, 2 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-05-23  9:44 UTC (permalink / raw)
  To: Kieran Mansley; +Cc: Ben Hutchings, netdev

Update :

My tests show that sk_backlog.len can reach 1.5MB easily on a netperf,
even with a fast machine, receiving a 10Gb flow (ixgbe adapter), if
LRO/GRO are off.

424 backlogdrop for 193.182.546 incoming packets (with my tcp_space()
patch applied)

I believe that as soon as ixgbe can use build_skb() and avoid the 1024
bytes overhead per skb, it should go away.

Of course, another way to solve the problem would be to change
tcp_recvmsg() to use lock_sock_fast(), so that no frame is backlogged at
all.

Locking the socket for the whole operation (including copyout to user)
is not very good. It was good enough years ago with small receive
window.

With a potentially huge backlog, it means user process has to process
it, regardless of its latency constraints. CPU caches are also
completely destroyed because of huge amount of data included in thousand
of skbs.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23  9:44                               ` Eric Dumazet
@ 2012-05-23 12:09                                 ` Eric Dumazet
  2012-05-23 16:04                                   ` Alexander Duyck
  2012-05-23 17:34                                 ` David Miller
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-05-23 12:09 UTC (permalink / raw)
  To: Kieran Mansley, Jeff Kirsher, Alex Duyck; +Cc: Ben Hutchings, netdev

On Wed, 2012-05-23 at 11:44 +0200, Eric Dumazet wrote:

> I believe that as soon as ixgbe can use build_skb() and avoid the 1024
> bytes overhead per skb, it should go away.


Here is the patch for ixgbe, for reference.

My machine is now able to receive a netperf TCP_STREAM full speed
(10Gb), even with GRO/LRO off. (TCPRcvCoalesce counter increasing very
fast too)

Its not an official patch yet, because :

1) I need to properly align DMA buffers to reserve NET_SKB_PAD bytes
(not all workloads are like TCP, and some headroom is needed for
tunnels)

2) Must cope with MTU > 1500 cases

3) Should not be done if NET_IP_ALIGN is not null (I dont know if ixgbe
hardware can do the DMA to non aligned area on receive)

This patch saves 1024 bytes per incoming skb. (skb->head directly mapped
to the frag containing the frame, instead of a separate memory area)

 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   82 ++++++++--------
 1 file changed, 46 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index bf20457..d05693a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1511,39 +1511,41 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
 		return true;
 	}
 
-	/*
-	 * it is valid to use page_address instead of kmap since we are
-	 * working with pages allocated out of the lomem pool per
-	 * alloc_page(GFP_ATOMIC)
-	 */
-	va = skb_frag_address(frag);
+	if (!skb_headlen(skb)) {
+		/*
+		 * it is valid to use page_address instead of kmap since we are
+		 * working with pages allocated out of the lowmem pool per
+		 * alloc_page(GFP_ATOMIC)
+		 */
+		va = skb_frag_address(frag);
 
-	/*
-	 * we need the header to contain the greater of either ETH_HLEN or
-	 * 60 bytes if the skb->len is less than 60 for skb_pad.
-	 */
-	pull_len = skb_frag_size(frag);
-	if (pull_len > 256)
-		pull_len = ixgbe_get_headlen(va, pull_len);
+		/*
+		 * we need the header to contain the greater of either ETH_HLEN or
+		 * 60 bytes if the skb->len is less than 60 for skb_pad.
+		 */
+		pull_len = skb_frag_size(frag);
+		if (pull_len > 256)
+			pull_len = ixgbe_get_headlen(va, pull_len);
 
-	/* align pull length to size of long to optimize memcpy performance */
-	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
+		/* align pull length to size of long to optimize memcpy performance */
+		skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
 
-	/* update all of the pointers */
-	skb_frag_size_sub(frag, pull_len);
-	frag->page_offset += pull_len;
-	skb->data_len -= pull_len;
-	skb->tail += pull_len;
+		/* update all of the pointers */
+		skb_frag_size_sub(frag, pull_len);
+		frag->page_offset += pull_len;
+		skb->data_len -= pull_len;
+		skb->tail += pull_len;
 
-	/*
-	 * if we sucked the frag empty then we should free it,
-	 * if there are other frags here something is screwed up in hardware
-	 */
-	if (skb_frag_size(frag) == 0) {
-		BUG_ON(skb_shinfo(skb)->nr_frags != 1);
-		skb_shinfo(skb)->nr_frags = 0;
-		__skb_frag_unref(frag);
-		skb->truesize -= ixgbe_rx_bufsz(rx_ring);
+		/*
+		 * if we sucked the frag empty then we should free it,
+		 * if there are other frags here something is screwed up in hardware
+		 */
+		if (skb_frag_size(frag) == 0) {
+			BUG_ON(skb_shinfo(skb)->nr_frags != 1);
+			skb_shinfo(skb)->nr_frags = 0;
+			__skb_frag_unref(frag);
+			skb->truesize -= ixgbe_rx_bufsz(rx_ring);
+		}
 	}
 
 	/* if skb_pad returns an error the skb was freed */
@@ -1662,6 +1664,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		struct sk_buff *skb;
 		struct page *page;
 		u16 ntc;
+		unsigned int len;
+		bool addfrag = true;
 
 		/* return some buffers to hardware, one at a time is too slow */
 		if (cleaned_count >= IXGBE_RX_BUFFER_WRITE) {
@@ -1687,7 +1691,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 		prefetchw(page);
 
 		skb = rx_buffer->skb;
-
+		len = le16_to_cpu(rx_desc->wb.upper.length);
 		if (likely(!skb)) {
 			void *page_addr = page_address(page) +
 					  rx_buffer->page_offset;
@@ -1698,9 +1702,14 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			prefetch(page_addr + L1_CACHE_BYTES);
 #endif
 
-			/* allocate a skb to store the frags */
-			skb = netdev_alloc_skb_ip_align(rx_ring->netdev,
-							IXGBE_RX_HDR_SIZE);
+			/* allocate a skb to store the frag */
+			if (len <= 256) {
+				skb = netdev_alloc_skb_ip_align(rx_ring->netdev,
+								256);
+			} else {
+				skb = build_skb(page_addr, ixgbe_rx_bufsz(rx_ring));
+				addfrag = false;
+			}
 			if (unlikely(!skb)) {
 				rx_ring->rx_stats.alloc_rx_buff_failed++;
 				break;
@@ -1729,9 +1738,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 						      DMA_FROM_DEVICE);
 		}
 
-		/* pull page into skb */
-		ixgbe_add_rx_frag(rx_ring, rx_buffer, skb,
-				  le16_to_cpu(rx_desc->wb.upper.length));
+		if (addfrag)
+			ixgbe_add_rx_frag(rx_ring, rx_buffer, skb, len);
+		else
+			__skb_put(skb, len);
 
 		if (ixgbe_can_reuse_page(rx_buffer)) {
 			/* hand second half of page back to the ring */

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23 12:09                                 ` Eric Dumazet
@ 2012-05-23 16:04                                   ` Alexander Duyck
  2012-05-23 16:12                                     ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Duyck @ 2012-05-23 16:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev

On 05/23/2012 05:09 AM, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 11:44 +0200, Eric Dumazet wrote:
>
>> I believe that as soon as ixgbe can use build_skb() and avoid the 1024
>> bytes overhead per skb, it should go away.
>
> Here is the patch for ixgbe, for reference.
I'm confused as to what this is trying to accomplish.

Currently the way the ixgbe driver works is that we allocate the
skb->head using netdev_alloc_skb, which after your recent changes should
be using a head frag.  If the buffer is less than 256 bytes we have
pushed the entire buffer into the head frag, and if it is more we only
pull everything up to the end of the TCP header.  In either case if we
are merging TCP flows we should be able to drop one page or the other
along with the sk_buff giving us a total truesize addition after merge
of ~1K for less than 256 bytes or 2K for a full sized frame.

I'll try to take a look at this today as it is in our interest to have
TCP performing as well as possible on ixgbe.

Thanks,

Alex


 
>
> My machine is now able to receive a netperf TCP_STREAM full speed
> (10Gb), even with GRO/LRO off. (TCPRcvCoalesce counter increasing very
> fast too)
>
> Its not an official patch yet, because :
>
> 1) I need to properly align DMA buffers to reserve NET_SKB_PAD bytes
> (not all workloads are like TCP, and some headroom is needed for
> tunnels)
>
> 2) Must cope with MTU > 1500 cases
>
> 3) Should not be done if NET_IP_ALIGN is not null (I dont know if ixgbe
> hardware can do the DMA to non aligned area on receive)
>
> This patch saves 1024 bytes per incoming skb. (skb->head directly mapped
> to the frag containing the frame, instead of a separate memory area)
>
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   82 ++++++++--------
>  1 file changed, 46 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index bf20457..d05693a 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -1511,39 +1511,41 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
>  		return true;
>  	}
>  
> -	/*
> -	 * it is valid to use page_address instead of kmap since we are
> -	 * working with pages allocated out of the lomem pool per
> -	 * alloc_page(GFP_ATOMIC)
> -	 */
> -	va = skb_frag_address(frag);
> +	if (!skb_headlen(skb)) {
> +		/*
> +		 * it is valid to use page_address instead of kmap since we are
> +		 * working with pages allocated out of the lowmem pool per
> +		 * alloc_page(GFP_ATOMIC)
> +		 */
> +		va = skb_frag_address(frag);
>  
> -	/*
> -	 * we need the header to contain the greater of either ETH_HLEN or
> -	 * 60 bytes if the skb->len is less than 60 for skb_pad.
> -	 */
> -	pull_len = skb_frag_size(frag);
> -	if (pull_len > 256)
> -		pull_len = ixgbe_get_headlen(va, pull_len);
> +		/*
> +		 * we need the header to contain the greater of either ETH_HLEN or
> +		 * 60 bytes if the skb->len is less than 60 for skb_pad.
> +		 */
> +		pull_len = skb_frag_size(frag);
> +		if (pull_len > 256)
> +			pull_len = ixgbe_get_headlen(va, pull_len);
>  
> -	/* align pull length to size of long to optimize memcpy performance */
> -	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
> +		/* align pull length to size of long to optimize memcpy performance */
> +		skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
>  
> -	/* update all of the pointers */
> -	skb_frag_size_sub(frag, pull_len);
> -	frag->page_offset += pull_len;
> -	skb->data_len -= pull_len;
> -	skb->tail += pull_len;
> +		/* update all of the pointers */
> +		skb_frag_size_sub(frag, pull_len);
> +		frag->page_offset += pull_len;
> +		skb->data_len -= pull_len;
> +		skb->tail += pull_len;
>  
> -	/*
> -	 * if we sucked the frag empty then we should free it,
> -	 * if there are other frags here something is screwed up in hardware
> -	 */
> -	if (skb_frag_size(frag) == 0) {
> -		BUG_ON(skb_shinfo(skb)->nr_frags != 1);
> -		skb_shinfo(skb)->nr_frags = 0;
> -		__skb_frag_unref(frag);
> -		skb->truesize -= ixgbe_rx_bufsz(rx_ring);
> +		/*
> +		 * if we sucked the frag empty then we should free it,
> +		 * if there are other frags here something is screwed up in hardware
> +		 */
> +		if (skb_frag_size(frag) == 0) {
> +			BUG_ON(skb_shinfo(skb)->nr_frags != 1);
> +			skb_shinfo(skb)->nr_frags = 0;
> +			__skb_frag_unref(frag);
> +			skb->truesize -= ixgbe_rx_bufsz(rx_ring);
> +		}
>  	}
>  
>  	/* if skb_pad returns an error the skb was freed */
> @@ -1662,6 +1664,8 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>  		struct sk_buff *skb;
>  		struct page *page;
>  		u16 ntc;
> +		unsigned int len;
> +		bool addfrag = true;
>  
>  		/* return some buffers to hardware, one at a time is too slow */
>  		if (cleaned_count >= IXGBE_RX_BUFFER_WRITE) {
> @@ -1687,7 +1691,7 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>  		prefetchw(page);
>  
>  		skb = rx_buffer->skb;
> -
> +		len = le16_to_cpu(rx_desc->wb.upper.length);
>  		if (likely(!skb)) {
>  			void *page_addr = page_address(page) +
>  					  rx_buffer->page_offset;
> @@ -1698,9 +1702,14 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>  			prefetch(page_addr + L1_CACHE_BYTES);
>  #endif
>  
> -			/* allocate a skb to store the frags */
> -			skb = netdev_alloc_skb_ip_align(rx_ring->netdev,
> -							IXGBE_RX_HDR_SIZE);
> +			/* allocate a skb to store the frag */
> +			if (len <= 256) {
> +				skb = netdev_alloc_skb_ip_align(rx_ring->netdev,
> +								256);
> +			} else {
> +				skb = build_skb(page_addr, ixgbe_rx_bufsz(rx_ring));
> +				addfrag = false;
> +			}
>  			if (unlikely(!skb)) {
>  				rx_ring->rx_stats.alloc_rx_buff_failed++;
>  				break;
> @@ -1729,9 +1738,10 @@ static bool ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>  						      DMA_FROM_DEVICE);
>  		}
>  
> -		/* pull page into skb */
> -		ixgbe_add_rx_frag(rx_ring, rx_buffer, skb,
> -				  le16_to_cpu(rx_desc->wb.upper.length));
> +		if (addfrag)
> +			ixgbe_add_rx_frag(rx_ring, rx_buffer, skb, len);
> +		else
> +			__skb_put(skb, len);
>  
>  		if (ixgbe_can_reuse_page(rx_buffer)) {
>  			/* hand second half of page back to the ring */
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23 16:04                                   ` Alexander Duyck
@ 2012-05-23 16:12                                     ` Eric Dumazet
  2012-05-23 16:39                                       ` Eric Dumazet
  2012-05-23 16:58                                       ` Alexander Duyck
  0 siblings, 2 replies; 37+ messages in thread
From: Eric Dumazet @ 2012-05-23 16:12 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev

On Wed, 2012-05-23 at 09:04 -0700, Alexander Duyck wrote:
> On 05/23/2012 05:09 AM, Eric Dumazet wrote:
> > On Wed, 2012-05-23 at 11:44 +0200, Eric Dumazet wrote:
> >
> >> I believe that as soon as ixgbe can use build_skb() and avoid the 1024
> >> bytes overhead per skb, it should go away.
> >
> > Here is the patch for ixgbe, for reference.
> I'm confused as to what this is trying to accomplish.
> 
> Currently the way the ixgbe driver works is that we allocate the
> skb->head using netdev_alloc_skb, which after your recent changes should
> be using a head frag.  If the buffer is less than 256 bytes we have
> pushed the entire buffer into the head frag, and if it is more we only
> pull everything up to the end of the TCP header.  In either case if we
> are merging TCP flows we should be able to drop one page or the other
> along with the sk_buff giving us a total truesize addition after merge
> of ~1K for less than 256 bytes or 2K for a full sized frame.
> 
> I'll try to take a look at this today as it is in our interest to have
> TCP performing as well as possible on ixgbe.

With current driver, a MTU=1500 frame uses :

sk_buff (256 bytes)
skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
one fragment of 2048 bytes

At skb free time,  one kfree(sk_buff) and two put_page().

After this patch :

sk_buff (256 bytes)
skb->head : 2048 bytes 

At skb free time, one kfree(sk_buff) and only one put_page().

Note that my patch doesnt change the 256 bytes threshold: Small frames
wont have one fragment and their use is :

sk_buff (256 bytes)
skb->head : 512 + 384 bytes 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23 16:12                                     ` Eric Dumazet
@ 2012-05-23 16:39                                       ` Eric Dumazet
  2012-05-23 17:10                                         ` Alexander Duyck
  2012-05-23 16:58                                       ` Alexander Duyck
  1 sibling, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-05-23 16:39 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev

On Wed, 2012-05-23 at 18:12 +0200, Eric Dumazet wrote:

> With current driver, a MTU=1500 frame uses :
> 
> sk_buff (256 bytes)
> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)

By the way, NET_SKB_PAD adds 64 bytes so its 64 + 512 + 384 = 960

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23 16:12                                     ` Eric Dumazet
  2012-05-23 16:39                                       ` Eric Dumazet
@ 2012-05-23 16:58                                       ` Alexander Duyck
  2012-05-23 17:24                                         ` Eric Dumazet
  1 sibling, 1 reply; 37+ messages in thread
From: Alexander Duyck @ 2012-05-23 16:58 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev

On 05/23/2012 09:12 AM, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 09:04 -0700, Alexander Duyck wrote:
>> On 05/23/2012 05:09 AM, Eric Dumazet wrote:
>>> On Wed, 2012-05-23 at 11:44 +0200, Eric Dumazet wrote:
>>>
>>>> I believe that as soon as ixgbe can use build_skb() and avoid the 1024
>>>> bytes overhead per skb, it should go away.
>>> Here is the patch for ixgbe, for reference.
>> I'm confused as to what this is trying to accomplish.
>>
>> Currently the way the ixgbe driver works is that we allocate the
>> skb->head using netdev_alloc_skb, which after your recent changes should
>> be using a head frag.  If the buffer is less than 256 bytes we have
>> pushed the entire buffer into the head frag, and if it is more we only
>> pull everything up to the end of the TCP header.  In either case if we
>> are merging TCP flows we should be able to drop one page or the other
>> along with the sk_buff giving us a total truesize addition after merge
>> of ~1K for less than 256 bytes or 2K for a full sized frame.
>>
>> I'll try to take a look at this today as it is in our interest to have
>> TCP performing as well as possible on ixgbe.
> With current driver, a MTU=1500 frame uses :
>
> sk_buff (256 bytes)
> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
> one fragment of 2048 bytes
>
> At skb free time,  one kfree(sk_buff) and two put_page().
>
> After this patch :
>
> sk_buff (256 bytes)
> skb->head : 2048 bytes 
>
> At skb free time, one kfree(sk_buff) and only one put_page().
>
> Note that my patch doesnt change the 256 bytes threshold: Small frames
> wont have one fragment and their use is :
>
> sk_buff (256 bytes)
> skb->head : 512 + 384 bytes 
>
>
Right, but the problem is that in order to make this work the we are
dropping the padding for head and hoping to have room for shared info. 
This is going to kill performance for things like routing workloads
since the entire head is going to have to be copied over to make space
for NET_SKB_PAD.  Also this assumes no RSC being enabled.  RSC is
normally enabled by default.  If it is turned on we are going to start
receiving full 2K buffers which will cause even more issues since there
wouldn't be any room for shared info in the 2K frame.

The way the driver is currently written probably provides the optimal
setup for truesize given the circumstances.  In order to support
receiving at least 1 full 1500 byte frame per fragment, and supporting
RSC I have to support receiving up to 2K of data.  If we try to make it
all part of one paged receive we would then have to either reduce the
receive buffer size to 1K in hardware and span multiple fragments for a
1.5K frame or allocate a 3K buffer so we would have room to add
NET_SKB_PAD and the shared info on the end.  At which point we are back
to the extra 1K again, only in that case we cannot trim it off later via
skb_try_coalesce.  In the 3K buffer case we would be over a 1/2 page
which means we can only get one buffer per page instead of 2 in which
case we might as well just round it up to 4K and be honest.

The reason I am confused is that I thought the skb_try_coalesce function
was supposed to be what addressed these types of issues.  If these
packets go through that function they should be stripping the sk_buff
and possibly even the skb->head if we used the fragment since the only
thing that is going to end up in the head would be the TCP header which
should have been pulled prior to trying to coalesce.

I will need to investigate this further to understand what is going on. 
I realize that dealing with 3K of memory for buffer storage is not
ideal, but all of the alternatives lean more toward 4K when fully
implemented.  I'll try and see what alternative solutions we might have
available.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23 16:39                                       ` Eric Dumazet
@ 2012-05-23 17:10                                         ` Alexander Duyck
  2012-05-23 21:19                                           ` Alexander Duyck
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Duyck @ 2012-05-23 17:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev

On 05/23/2012 09:39 AM, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 18:12 +0200, Eric Dumazet wrote:
>
>> With current driver, a MTU=1500 frame uses :
>>
>> sk_buff (256 bytes)
>> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
> By the way, NET_SKB_PAD adds 64 bytes so its 64 + 512 + 384 = 960
Actually pahole seems to be indicating to me the size of skb_shared_info
is 320, unless something has changed in the last few days.

When I get a chance I will try to remember to reduce the ixgbe header
size to 256 which should also help.  The only reason it is set to 512
was to deal with the fact that the old alloc_skb code wasn't aligning
the shared info with the end of whatever size was allocated and so the
512 was an approximation to make better use of the 1K slab allocation
back when we still were using hardware packet split.  That should help
to improve the page utilization for the headers since that would
increase the uses of a page from 4 to 6 for the skb head frag, and it
would drop truesize by another 256 bytes.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23 16:58                                       ` Alexander Duyck
@ 2012-05-23 17:24                                         ` Eric Dumazet
  2012-05-23 17:57                                           ` Alexander Duyck
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-05-23 17:24 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev

On Wed, 2012-05-23 at 09:58 -0700, Alexander Duyck wrote:

> Right, but the problem is that in order to make this work the we are
> dropping the padding for head and hoping to have room for shared info. 
> This is going to kill performance for things like routing workloads
> since the entire head is going to have to be copied over to make space
> for NET_SKB_PAD. 

Hey I said that one of the point I have to add to my patch. Please read
it again ;)

By the way, we can also add code doing the ksb->head upgrade to fragment
again in case we need to add a tunnel header, instead of full copy.

So maybe the NET_SKB_PAD is not really needed anymore.

Anyway, a router host could use a different allocation strategy (going
back to current one)

>  Also this assumes no RSC being enabled.  RSC is
> normally enabled by default.  If it is turned on we are going to start
> receiving full 2K buffers which will cause even more issues since there
> wouldn't be any room for shared info in the 2K frame.
> 

Hey his is one of the point I have to address, also mentioned.

Its almost trivial to check len (if we have room for shared info, take
it, if not allocate the head as before)


> The way the driver is currently written probably provides the optimal
> setup for truesize given the circumstances.

It unfortunate the hardware has 1KB granularity.


>   In order to support
> receiving at least 1 full 1500 byte frame per fragment, and supporting
> RSC I have to support receiving up to 2K of data.  If we try to make it
> all part of one paged receive we would then have to either reduce the
> receive buffer size to 1K in hardware and span multiple fragments for a
> 1.5K frame or allocate a 3K buffer so we would have room to add
> NET_SKB_PAD and the shared info on the end.  At which point we are back
> to the extra 1K again, only in that case we cannot trim it off later via
> skb_try_coalesce.  In the 3K buffer case we would be over a 1/2 page
> which means we can only get one buffer per page instead of 2 in which
> case we might as well just round it up to 4K and be honest.
> 
> The reason I am confused is that I thought the skb_try_coalesce function
> was supposed to be what addressed these types of issues.  If these
> packets go through that function they should be stripping the sk_buff
> and possibly even the skb->head if we used the fragment since the only
> thing that is going to end up in the head would be the TCP header which
> should have been pulled prior to trying to coalesce.
> 
> I will need to investigate this further to understand what is going on. 
> I realize that dealing with 3K of memory for buffer storage is not
> ideal, but all of the alternatives lean more toward 4K when fully
> implemented.  I'll try and see what alternative solutions we might have
> available.

Problem is skb_try_coalesce() is not used when we store packets in
socket backlog, and only used for TCP at this moment.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23  9:44                               ` Eric Dumazet
  2012-05-23 12:09                                 ` Eric Dumazet
@ 2012-05-23 17:34                                 ` David Miller
  2012-05-23 17:46                                   ` Eric Dumazet
  1 sibling, 1 reply; 37+ messages in thread
From: David Miller @ 2012-05-23 17:34 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kmansley, bhutchings, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 23 May 2012 11:44:06 +0200

> Locking the socket for the whole operation (including copyout to user)
> is not very good. It was good enough years ago with small receive
> window.
> 
> With a potentially huge backlog, it means user process has to process
> it, regardless of its latency constraints. CPU caches are also
> completely destroyed because of huge amount of data included in thousand
> of skbs.

But it is the only way we can have TCP processing scheduled and
accounted to user processes.  That does have value when you have lots
of flows active.

The scheduler's ability to give the process cpu time influences
TCP's behavier, and under load if the process can't get enough
cpu time then TCP will back off.  We want that.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23 17:34                                 ` David Miller
@ 2012-05-23 17:46                                   ` Eric Dumazet
  2012-05-23 17:57                                     ` David Miller
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-05-23 17:46 UTC (permalink / raw)
  To: David Miller; +Cc: kmansley, bhutchings, netdev

On Wed, 2012-05-23 at 13:34 -0400, David Miller wrote:

> But it is the only way we can have TCP processing scheduled and
> accounted to user processes.  That does have value when you have lots
> of flows active.
> 
> The scheduler's ability to give the process cpu time influences
> TCP's behavier, and under load if the process can't get enough
> cpu time then TCP will back off.  We want that.

But TCP already backs off if user process is not blocked on socket
input.

Modern applications uses select()/poll()/epoll() on many sockets in //.

Only old ones stil block on recv().

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23 17:46                                   ` Eric Dumazet
@ 2012-05-23 17:57                                     ` David Miller
  0 siblings, 0 replies; 37+ messages in thread
From: David Miller @ 2012-05-23 17:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: kmansley, bhutchings, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 23 May 2012 19:46:50 +0200

> But TCP already backs off if user process is not blocked on socket
> input.
> 
> Modern applications uses select()/poll()/epoll() on many sockets in //.
> 
> Only old ones stil block on recv().

These arguments seem circular.

Those modern applications still trigger enough TCP work during their
recv() calls that it's significant enough for scheduling purposes, and
to me being able to account that TCP work as process time is still
extremely beneficial.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23 17:24                                         ` Eric Dumazet
@ 2012-05-23 17:57                                           ` Alexander Duyck
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Duyck @ 2012-05-23 17:57 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev

On 05/23/2012 10:24 AM, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 09:58 -0700, Alexander Duyck wrote:
>
>> Right, but the problem is that in order to make this work the we are
>> dropping the padding for head and hoping to have room for shared info. 
>> This is going to kill performance for things like routing workloads
>> since the entire head is going to have to be copied over to make space
>> for NET_SKB_PAD. 
> Hey I said that one of the point I have to add to my patch. Please read
> it again ;)
I'm aware of that, but still it seems like we are getting ahead of
ourselves.  This fix is so specific to just the socket backlog case that
I think we are missing the fact that it is going to have huge
performance repercussions elsewhere.

> By the way, we can also add code doing the ksb->head upgrade to fragment
> again in case we need to add a tunnel header, instead of full copy.
>
> So maybe the NET_SKB_PAD is not really needed anymore.
>
> Anyway, a router host could use a different allocation strategy (going
> back to current one)
The thing I don't like is that we are adding extra memcpy calls to all
of these paths.  We cannot change the head without having to copy the
shared info and there is going to be a cost for that.  I would prefer to
only generate the shared info once and not have to relocate it every
time I want to run a router or tunnel.

>>  Also this assumes no RSC being enabled.  RSC is
>> normally enabled by default.  If it is turned on we are going to start
>> receiving full 2K buffers which will cause even more issues since there
>> wouldn't be any room for shared info in the 2K frame.
>>
> Hey his is one of the point I have to address, also mentioned.
>
> Its almost trivial to check len (if we have room for shared info, take
> it, if not allocate the head as before)
I know you mentioned this as well.  The thing is I would prefer not to
add code where we are branching in so many different directions on what
the header actually looks like.  It tends to open a lot of opportunities
for bugs when someone makes a change and doesn't take one of the
possible head and fragment combinations into account.

>> The way the driver is currently written probably provides the optimal
>> setup for truesize given the circumstances.
> It unfortunate the hardware has 1KB granularity.
Agreed, I would have preferred 512B granularity, but we are locked in at
1K for now..  :-/

> Problem is skb_try_coalesce() is not used when we store packets in
> socket backlog, and only used for TCP at this moment.
One piece of low hanging fruit that is available to help with some of
this is to drop the Rx header size for ixgbe to 256.  That should at
least cut the total truesize for the buffer and sk_buff to 960 or so
which is at least a step in the right direction.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23 17:10                                         ` Alexander Duyck
@ 2012-05-23 21:19                                           ` Alexander Duyck
  2012-05-23 21:37                                             ` Eric Dumazet
  0 siblings, 1 reply; 37+ messages in thread
From: Alexander Duyck @ 2012-05-23 21:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev

On 05/23/2012 10:10 AM, Alexander Duyck wrote:
> On 05/23/2012 09:39 AM, Eric Dumazet wrote:
>> On Wed, 2012-05-23 at 18:12 +0200, Eric Dumazet wrote:
>>
>>> With current driver, a MTU=1500 frame uses :
>>>
>>> sk_buff (256 bytes)
>>> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
>> By the way, NET_SKB_PAD adds 64 bytes so its 64 + 512 + 384 = 960
> Actually pahole seems to be indicating to me the size of skb_shared_info
> is 320, unless something has changed in the last few days.
>
> When I get a chance I will try to remember to reduce the ixgbe header
> size to 256 which should also help.  The only reason it is set to 512
> was to deal with the fact that the old alloc_skb code wasn't aligning
> the shared info with the end of whatever size was allocated and so the
> 512 was an approximation to make better use of the 1K slab allocation
> back when we still were using hardware packet split.  That should help
> to improve the page utilization for the headers since that would
> increase the uses of a page from 4 to 6 for the skb head frag, and it
> would drop truesize by another 256 bytes.
>
> Thanks,
>
> Alex
Here is the patch for review.  I have submitted the official patch to Jeff 
so that it can go through his tree for testing, validation, and submission 
once Dave's tree opens back up.

---

The recent changes to netdev_alloc_skb actually make it so that the size of
the buffer now actually has a more direct input on the truesize.  So in
order to make best use of the piece of a page we are allocated I am
reducing the IXGBE_RX_HDR_SIZE to 256 so that our truesize will be reduced
by 256 bytes as well.

This should result in performance improvements since the number of uses per
page should increase from 4 to 6 in the case of a 4K page.  In addition we
should see socket performance improvements due to the truesize dropping
to less than 1K for buffers less than 256 bytes.

Not-Yet-Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   15 ++++++++-------
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 ++--
 2 files changed, 10 insertions(+), 9 deletions(-)


diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 402dd66..468e4ab 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -77,17 +77,18 @@
 #define IXGBE_MAX_FCPAUSE		 0xFFFF
 
 /* Supported Rx Buffer Sizes */
-#define IXGBE_RXBUFFER_512   512    /* Used for packet split */
+#define IXGBE_RXBUFFER_256    256  /* Used for skb receive header */
 #define IXGBE_MAX_RXBUFFER  16384  /* largest size for a single descriptor */
 
 /*
- * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN mans we
- * reserve 2 more, and skb_shared_info adds an additional 384 bytes more,
- * this adds up to 512 bytes of extra data meaning the smallest allocation
- * we could have is 1K.
- * i.e. RXBUFFER_512 --> size-1024 slab
+ * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN means we
+ * reserve 64 more, and skb_shared_info adds an additional 320 bytes more,
+ * this adds up to 448 bytes of extra data.
+ *
+ * Since netdev_alloc_skb now allocates a page fragment we can use a value
+ * of 256 and the resultant skb will have a truesize of 960 or less.
  */
-#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_512
+#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_256
 
 #define MAXIMUM_ETHERNET_VLAN_SIZE (ETH_FRAME_LEN + ETH_FCS_LEN + VLAN_HLEN)
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 7f92e40..f92b31a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1520,8 +1520,8 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
 	 * 60 bytes if the skb->len is less than 60 for skb_pad.
 	 */
 	pull_len = skb_frag_size(frag);
-	if (pull_len > 256)
-		pull_len = ixgbe_get_headlen(va, pull_len);
+	if (pull_len > IXGBE_RX_HDR_SIZE)
+		pull_len = ixgbe_get_headlen(va, IXGBE_RX_HDR_SIZE);
 
 	/* align pull length to size of long to optimize memcpy performance */
 	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23 21:19                                           ` Alexander Duyck
@ 2012-05-23 21:37                                             ` Eric Dumazet
  2012-05-23 22:03                                               ` Alexander Duyck
  0 siblings, 1 reply; 37+ messages in thread
From: Eric Dumazet @ 2012-05-23 21:37 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev

On Wed, 2012-05-23 at 14:19 -0700, Alexander Duyck wrote:
> On 05/23/2012 10:10 AM, Alexander Duyck wrote:
> > On 05/23/2012 09:39 AM, Eric Dumazet wrote:
> >> On Wed, 2012-05-23 at 18:12 +0200, Eric Dumazet wrote:
> >>
> >>> With current driver, a MTU=1500 frame uses :
> >>>
> >>> sk_buff (256 bytes)
> >>> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
> >> By the way, NET_SKB_PAD adds 64 bytes so its 64 + 512 + 384 = 960
> > Actually pahole seems to be indicating to me the size of skb_shared_info
> > is 320, unless something has changed in the last few days.
> >
> > When I get a chance I will try to remember to reduce the ixgbe header
> > size to 256 which should also help.  The only reason it is set to 512
> > was to deal with the fact that the old alloc_skb code wasn't aligning
> > the shared info with the end of whatever size was allocated and so the
> > 512 was an approximation to make better use of the 1K slab allocation
> > back when we still were using hardware packet split.  That should help
> > to improve the page utilization for the headers since that would
> > increase the uses of a page from 4 to 6 for the skb head frag, and it
> > would drop truesize by another 256 bytes.
> >
> > Thanks,
> >
> > Alex
> Here is the patch for review.  I have submitted the official patch to Jeff 
> so that it can go through his tree for testing, validation, and submission 
> once Dave's tree opens back up.
> 
> ---
> 
> The recent changes to netdev_alloc_skb actually make it so that the size of
> the buffer now actually has a more direct input on the truesize.  So in
> order to make best use of the piece of a page we are allocated I am
> reducing the IXGBE_RX_HDR_SIZE to 256 so that our truesize will be reduced
> by 256 bytes as well.
> 
> This should result in performance improvements since the number of uses per
> page should increase from 4 to 6 in the case of a 4K page.  In addition we
> should see socket performance improvements due to the truesize dropping
> to less than 1K for buffers less than 256 bytes.
> 
> Not-Yet-Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   15 ++++++++-------
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 ++--
>  2 files changed, 10 insertions(+), 9 deletions(-)
> 
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index 402dd66..468e4ab 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -77,17 +77,18 @@
>  #define IXGBE_MAX_FCPAUSE		 0xFFFF
>  
>  /* Supported Rx Buffer Sizes */
> -#define IXGBE_RXBUFFER_512   512    /* Used for packet split */
> +#define IXGBE_RXBUFFER_256    256  /* Used for skb receive header */
>  #define IXGBE_MAX_RXBUFFER  16384  /* largest size for a single descriptor */
>  
>  /*
> - * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN mans we
> - * reserve 2 more, and skb_shared_info adds an additional 384 bytes more,
> - * this adds up to 512 bytes of extra data meaning the smallest allocation
> - * we could have is 1K.
> - * i.e. RXBUFFER_512 --> size-1024 slab
> + * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN means we
> + * reserve 64 more, and skb_shared_info adds an additional 320 bytes more,
> + * this adds up to 448 bytes of extra data.
> + *
> + * Since netdev_alloc_skb now allocates a page fragment we can use a value
> + * of 256 and the resultant skb will have a truesize of 960 or less.
>   */
> -#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_512
> +#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_256
>  
>  #define MAXIMUM_ETHERNET_VLAN_SIZE (ETH_FRAME_LEN + ETH_FCS_LEN + VLAN_HLEN)
>  
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 7f92e40..f92b31a 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -1520,8 +1520,8 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
>  	 * 60 bytes if the skb->len is less than 60 for skb_pad.
>  	 */
>  	pull_len = skb_frag_size(frag);
> -	if (pull_len > 256)
> -		pull_len = ixgbe_get_headlen(va, pull_len);
> +	if (pull_len > IXGBE_RX_HDR_SIZE)
> +		pull_len = ixgbe_get_headlen(va, IXGBE_RX_HDR_SIZE);
>  
>  	/* align pull length to size of long to optimize memcpy performance */
>  	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
> 


By the way you should reword the comment about NET_IP_ALIGN

On x86 NET_IP_ALIGN is 0, so we dont 'reserve 64 bytes more'


-> 896 bytes

Also, are you sure :

srrctl |= (IXGBE_RX_HDR_SIZE << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &  
	IXGBE_SRRCTL_BSIZEHDR_MASK; 


is still needed in ixgbe_configure_srrctl() , since it uses
IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF (non packet split)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: TCPBacklogDrops during aggressive bursts of traffic
  2012-05-23 21:37                                             ` Eric Dumazet
@ 2012-05-23 22:03                                               ` Alexander Duyck
  0 siblings, 0 replies; 37+ messages in thread
From: Alexander Duyck @ 2012-05-23 22:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Kieran Mansley, Jeff Kirsher, Ben Hutchings, netdev

On 05/23/2012 02:37 PM, Eric Dumazet wrote:
> On Wed, 2012-05-23 at 14:19 -0700, Alexander Duyck wrote:
>> On 05/23/2012 10:10 AM, Alexander Duyck wrote:
>>> On 05/23/2012 09:39 AM, Eric Dumazet wrote:
>>>> On Wed, 2012-05-23 at 18:12 +0200, Eric Dumazet wrote:
>>>>
>>>>> With current driver, a MTU=1500 frame uses :
>>>>>
>>>>> sk_buff (256 bytes)
>>>>> skb->head : 1024 bytes  (or more exaclty now : 512 + 384)
>>>> By the way, NET_SKB_PAD adds 64 bytes so its 64 + 512 + 384 = 960
>>> Actually pahole seems to be indicating to me the size of skb_shared_info
>>> is 320, unless something has changed in the last few days.
>>>
>>> When I get a chance I will try to remember to reduce the ixgbe header
>>> size to 256 which should also help.  The only reason it is set to 512
>>> was to deal with the fact that the old alloc_skb code wasn't aligning
>>> the shared info with the end of whatever size was allocated and so the
>>> 512 was an approximation to make better use of the 1K slab allocation
>>> back when we still were using hardware packet split.  That should help
>>> to improve the page utilization for the headers since that would
>>> increase the uses of a page from 4 to 6 for the skb head frag, and it
>>> would drop truesize by another 256 bytes.
>>>
>>> Thanks,
>>>
>>> Alex
>> Here is the patch for review.  I have submitted the official patch to Jeff 
>> so that it can go through his tree for testing, validation, and submission 
>> once Dave's tree opens back up.
>>
>> ---
>>
>> The recent changes to netdev_alloc_skb actually make it so that the size of
>> the buffer now actually has a more direct input on the truesize.  So in
>> order to make best use of the piece of a page we are allocated I am
>> reducing the IXGBE_RX_HDR_SIZE to 256 so that our truesize will be reduced
>> by 256 bytes as well.
>>
>> This should result in performance improvements since the number of uses per
>> page should increase from 4 to 6 in the case of a 4K page.  In addition we
>> should see socket performance improvements due to the truesize dropping
>> to less than 1K for buffers less than 256 bytes.
>>
>> Not-Yet-Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> ---
>>
>>  drivers/net/ethernet/intel/ixgbe/ixgbe.h      |   15 ++++++++-------
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    4 ++--
>>  2 files changed, 10 insertions(+), 9 deletions(-)
>>
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> index 402dd66..468e4ab 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
>> @@ -77,17 +77,18 @@
>>  #define IXGBE_MAX_FCPAUSE		 0xFFFF
>>  
>>  /* Supported Rx Buffer Sizes */
>> -#define IXGBE_RXBUFFER_512   512    /* Used for packet split */
>> +#define IXGBE_RXBUFFER_256    256  /* Used for skb receive header */
>>  #define IXGBE_MAX_RXBUFFER  16384  /* largest size for a single descriptor */
>>  
>>  /*
>> - * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN mans we
>> - * reserve 2 more, and skb_shared_info adds an additional 384 bytes more,
>> - * this adds up to 512 bytes of extra data meaning the smallest allocation
>> - * we could have is 1K.
>> - * i.e. RXBUFFER_512 --> size-1024 slab
>> + * NOTE: netdev_alloc_skb reserves up to 64 bytes, NET_IP_ALIGN means we
>> + * reserve 64 more, and skb_shared_info adds an additional 320 bytes more,
>> + * this adds up to 448 bytes of extra data.
>> + *
>> + * Since netdev_alloc_skb now allocates a page fragment we can use a value
>> + * of 256 and the resultant skb will have a truesize of 960 or less.
>>   */
>> -#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_512
>> +#define IXGBE_RX_HDR_SIZE IXGBE_RXBUFFER_256
>>  
>>  #define MAXIMUM_ETHERNET_VLAN_SIZE (ETH_FRAME_LEN + ETH_FCS_LEN + VLAN_HLEN)
>>  
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> index 7f92e40..f92b31a 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> @@ -1520,8 +1520,8 @@ static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
>>  	 * 60 bytes if the skb->len is less than 60 for skb_pad.
>>  	 */
>>  	pull_len = skb_frag_size(frag);
>> -	if (pull_len > 256)
>> -		pull_len = ixgbe_get_headlen(va, pull_len);
>> +	if (pull_len > IXGBE_RX_HDR_SIZE)
>> +		pull_len = ixgbe_get_headlen(va, IXGBE_RX_HDR_SIZE);
>>  
>>  	/* align pull length to size of long to optimize memcpy performance */
>>  	skb_copy_to_linear_data(skb, va, ALIGN(pull_len, sizeof(long)));
>>
>
> By the way you should reword the comment about NET_IP_ALIGN
>
> On x86 NET_IP_ALIGN is 0, so we dont 'reserve 64 bytes more'
On x86 yes, on other platforms that don't clear the value it will add 2
more bytes, and after cache alignment it will likely add another 64.  I
am really just stating the worst case scenario here.  This is why I end
the comment with "960 or less".

> -> 896 bytes
.. or 832 bytes if you really tweak settings and get the sk_buff down to
192 bytes.  :-)

> Also, are you sure :
>
> srrctl |= (IXGBE_RX_HDR_SIZE << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT) &  
> 	IXGBE_SRRCTL_BSIZEHDR_MASK; 
>
>
> is still needed in ixgbe_configure_srrctl() , since it uses
> IXGBE_SRRCTL_DESCTYPE_ADV_ONEBUF (non packet split)
That bit is needed for RSC.  Basically we have to specify the maximum
size of a header, even if we are not using packet split.  The value has
to be at least 128 or more.  Since we are using the value of
IXGBE_RX_HDR_SIZE to limit how much header we pull anyway I figure it is
probably a good idea just to leave this here.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2012-05-23 22:03 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-15 14:38 TCPBacklogDrops during aggressive bursts of traffic Kieran Mansley
2012-05-15 14:56 ` Eric Dumazet
2012-05-15 15:00   ` Eric Dumazet
2012-05-15 16:29   ` Kieran Mansley
2012-05-15 16:34     ` Eric Dumazet
2012-05-15 16:47       ` Ben Hutchings
2012-05-15 17:01         ` Eric Dumazet
2012-05-15 17:23           ` Eric Dumazet
2012-05-17 16:31           ` Kieran Mansley
2012-05-17 16:37             ` Eric Dumazet
2012-05-18 15:45               ` Kieran Mansley
2012-05-18 15:49                 ` Eric Dumazet
2012-05-18 15:53                   ` Kieran Mansley
2012-05-18 18:40                 ` Eric Dumazet
2012-05-22  8:20               ` Kieran Mansley
2012-05-22  9:25                 ` Eric Dumazet
2012-05-22  9:30                   ` Eric Dumazet
2012-05-22 15:09                     ` Kieran Mansley
2012-05-22 16:12                       ` Eric Dumazet
2012-05-22 16:32                         ` Kieran Mansley
2012-05-22 16:45                           ` Eric Dumazet
2012-05-22 20:54                             ` Eric Dumazet
2012-05-23  9:44                               ` Eric Dumazet
2012-05-23 12:09                                 ` Eric Dumazet
2012-05-23 16:04                                   ` Alexander Duyck
2012-05-23 16:12                                     ` Eric Dumazet
2012-05-23 16:39                                       ` Eric Dumazet
2012-05-23 17:10                                         ` Alexander Duyck
2012-05-23 21:19                                           ` Alexander Duyck
2012-05-23 21:37                                             ` Eric Dumazet
2012-05-23 22:03                                               ` Alexander Duyck
2012-05-23 16:58                                       ` Alexander Duyck
2012-05-23 17:24                                         ` Eric Dumazet
2012-05-23 17:57                                           ` Alexander Duyck
2012-05-23 17:34                                 ` David Miller
2012-05-23 17:46                                   ` Eric Dumazet
2012-05-23 17:57                                     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).