All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] tcp: remove a bogus TSO split
@ 2013-12-12 19:28 Eric Dumazet
  2013-12-13 14:15 ` Neal Cardwell
  2013-12-13 18:13 ` [PATCH v2 " Eric Dumazet
  0 siblings, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2013-12-12 19:28 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Yuchung Cheng, Neal Cardwell, Nandita Dukkipati, Van Jacobson

From: Eric Dumazet <edumazet@google.com>

While investigating performance problems on small RPC workloads,
I noticed linux TCP stack was always splitting the last TSO skb
into two parts (skbs). One being a multiple of MSS, and a small one
with the Push flag. This split is done even if TCP_NODELAY is set.

Example with request/response of 4K/4K

IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>

We think this is not needed.

All the Nagle/window tests are done at this point.

This patch tremendously improves performance, as the traffic now looks
like :

IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>


Before :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280774

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     205719.049006 task-clock                #    9.278 CPUs utilized          
         8,449,968 context-switches          #    0.041 M/sec                  
         1,935,997 CPU-migrations            #    0.009 M/sec                  
           160,541 page-faults               #    0.780 K/sec                  
   548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
   455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
   272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
   166,091,460,030 instructions              #    0.30  insns per cycle        
                                             #    2.74  stalled cycles per insn [83.39%]
    29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
     1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]

      22.173517844 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   16851063           0.0
IpExtOutOctets                  23878580777        0.0

After patch :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280877

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     107496.071918 task-clock                #    4.847 CPUs utilized          
         5,635,458 context-switches          #    0.052 M/sec                  
         1,374,707 CPU-migrations            #    0.013 M/sec                  
           160,920 page-faults               #    0.001 M/sec                  
   281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
   228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
   142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
    95,227,712,566 instructions              #    0.34  insns per cycle        
                                             #    2.40  stalled cycles per insn [83.43%]
    16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
       874,252,952 branch-misses             #    5.39% of all branches         [83.37%]

      22.175821286 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   11239428           0.0
IpExtOutOctets                  23595191035        0.0

Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
scheduler.


Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Van Jacobson <vanj@google.com>
---
 net/ipv4/tcp_output.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2a69f42e51ca..335e110e86ba 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1410,10 +1410,7 @@ static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_b
 
 	needed = min(skb->len, window);
 
-	if (max_len <= needed)
-		return max_len;
-
-	return needed - needed % mss_now;
+	return min(max_len, needed);
 }
 
 /* Can at least one segment of SKB be sent right now, according to the

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next] tcp: remove a bogus TSO split
  2013-12-12 19:28 [PATCH net-next] tcp: remove a bogus TSO split Eric Dumazet
@ 2013-12-13 14:15 ` Neal Cardwell
  2013-12-13 14:54   ` Eric Dumazet
  2013-12-13 16:58   ` David Laight
  2013-12-13 18:13 ` [PATCH v2 " Eric Dumazet
  1 sibling, 2 replies; 14+ messages in thread
From: Neal Cardwell @ 2013-12-13 14:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati, Van Jacobson

On Thu, Dec 12, 2013 at 2:28 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> While investigating performance problems on small RPC workloads,
> I noticed linux TCP stack was always splitting the last TSO skb
> into two parts (skbs). One being a multiple of MSS, and a small one
> with the Push flag. This split is done even if TCP_NODELAY is set.
>
> Example with request/response of 4K/4K
>
> IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
> IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
> IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
> IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
> IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
> IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
> IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
> IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
> IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
>
> We think this is not needed.
>
> All the Nagle/window tests are done at this point.
>
> This patch tremendously improves performance, as the traffic now looks
> like :
>
> IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
> IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
> IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
> IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
> IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
> IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
> IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
> IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
> IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
> IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
> IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
> IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
>
>
> Before :
>
> lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
> 280774
>
>  Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
>
>      205719.049006 task-clock                #    9.278 CPUs utilized
>          8,449,968 context-switches          #    0.041 M/sec
>          1,935,997 CPU-migrations            #    0.009 M/sec
>            160,541 page-faults               #    0.780 K/sec
>    548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
>    455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
>    272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
>    166,091,460,030 instructions              #    0.30  insns per cycle
>                                              #    2.74  stalled cycles per insn [83.39%]
>     29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
>      1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]
>
>       22.173517844 seconds time elapsed
>
> lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
> IpOutRequests                   16851063           0.0
> IpExtOutOctets                  23878580777        0.0
>
> After patch :
>
> lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
> 280877
>
>  Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
>
>      107496.071918 task-clock                #    4.847 CPUs utilized
>          5,635,458 context-switches          #    0.052 M/sec
>          1,374,707 CPU-migrations            #    0.013 M/sec
>            160,920 page-faults               #    0.001 M/sec
>    281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
>    228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
>    142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
>     95,227,712,566 instructions              #    0.34  insns per cycle
>                                              #    2.40  stalled cycles per insn [83.43%]
>     16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
>        874,252,952 branch-misses             #    5.39% of all branches         [83.37%]
>
>       22.175821286 seconds time elapsed
>
> lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
> IpOutRequests                   11239428           0.0
> IpExtOutOctets                  23595191035        0.0
>
> Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
> 2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
> scheduler.
>
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Nandita Dukkipati <nanditad@google.com>
> Cc: Van Jacobson <vanj@google.com>
> ---
>  net/ipv4/tcp_output.c |    5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 2a69f42e51ca..335e110e86ba 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1410,10 +1410,7 @@ static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_b
>
>         needed = min(skb->len, window);
>
> -       if (max_len <= needed)
> -               return max_len;
> -
> -       return needed - needed % mss_now;
> +       return min(max_len, needed);
>  }
>
>  /* Can at least one segment of SKB be sent right now, according to the
>

Seems like a nice improvement, but if we apply this patch then AFAICT
to get the Nagle-enabled case right we also have to update
tcp_minshall_update() to notice these new non-MSS-aligned segments
going out, and count those as non-full-size segments for the
minshall-nagle check (to ensure we have no more than one outstanding
un-ACKed sub-MSS packet). Maybe something like (please excuse the
formatting):

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 70e55d2..a2ec237 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -980,7 +980,8 @@ bool tcp_is_cwnd_limited(const struct sock *sk,
u32 in_flight);
 static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss,
                                       const struct sk_buff *skb)
 {
-       if (skb->len < mss)
+       if (skb->len < mss ||
+           tcp_skb_pcount(skb) * tcp_skb_mss(skb) > skb->len)
                tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
 }

neal

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next] tcp: remove a bogus TSO split
  2013-12-13 14:15 ` Neal Cardwell
@ 2013-12-13 14:54   ` Eric Dumazet
  2013-12-13 16:22     ` Neal Cardwell
  2013-12-13 16:58   ` David Laight
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2013-12-13 14:54 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati, Van Jacobson

On Fri, 2013-12-13 at 09:15 -0500, Neal Cardwell wrote:

> Seems like a nice improvement, but if we apply this patch then AFAICT
> to get the Nagle-enabled case right we also have to update
> tcp_minshall_update() to notice these new non-MSS-aligned segments
> going out, and count those as non-full-size segments for the
> minshall-nagle check (to ensure we have no more than one outstanding
> un-ACKed sub-MSS packet). Maybe something like (please excuse the
> formatting):
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 70e55d2..a2ec237 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -980,7 +980,8 @@ bool tcp_is_cwnd_limited(const struct sock *sk,
> u32 in_flight);
>  static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss,
>                                        const struct sk_buff *skb)
>  {
> -       if (skb->len < mss)
> +       if (skb->len < mss ||
> +           tcp_skb_pcount(skb) * tcp_skb_mss(skb) > skb->len)
>                 tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
>  }

Very good point Neal, but dont you think tcp_skb_mss(skb) is equal to
mss at this point ? (We normally have synced with tso_segs =
tcp_init_tso_segs(sk, skb, mss_now);)

(Just trying to make this code more understandable...)

Also I think we should move this helper out of include/net/tcp.h, we
only use it from tcp_output.c

I'll submit a v2, rewording the comment in front of
tcp_mss_split_point()

Thanks

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next] tcp: remove a bogus TSO split
  2013-12-13 14:54   ` Eric Dumazet
@ 2013-12-13 16:22     ` Neal Cardwell
  2013-12-13 17:54       ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Neal Cardwell @ 2013-12-13 16:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati, Van Jacobson

On Fri, Dec 13, 2013 at 9:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2013-12-13 at 09:15 -0500, Neal Cardwell wrote:
>
>> Seems like a nice improvement, but if we apply this patch then AFAICT
>> to get the Nagle-enabled case right we also have to update
>> tcp_minshall_update() to notice these new non-MSS-aligned segments
>> going out, and count those as non-full-size segments for the
>> minshall-nagle check (to ensure we have no more than one outstanding
>> un-ACKed sub-MSS packet). Maybe something like (please excuse the
>> formatting):
>>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index 70e55d2..a2ec237 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -980,7 +980,8 @@ bool tcp_is_cwnd_limited(const struct sock *sk,
>> u32 in_flight);
>>  static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss,
>>                                        const struct sk_buff *skb)
>>  {
>> -       if (skb->len < mss)
>> +       if (skb->len < mss ||
>> +           tcp_skb_pcount(skb) * tcp_skb_mss(skb) > skb->len)
>>                 tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
>>  }
>
> Very good point Neal, but dont you think tcp_skb_mss(skb) is equal to
> mss at this point ? (We normally have synced with tso_segs =
> tcp_init_tso_segs(sk, skb, mss_now);)
>
> (Just trying to make this code more understandable...)
>
> Also I think we should move this helper out of include/net/tcp.h, we
> only use it from tcp_output.c
>
> I'll submit a v2, rewording the comment in front of
> tcp_mss_split_point()

Yes, I like your ideas to use mss_now instead, move
tcp_minshall_update() to tcp_output.c (next to tcp_minshall_check()?),
and update the comment in front of tcp_mss_split_point().

And given that mss_now is more sane than tcp_skb_mss(skb) (which is
zero for one-MSS skbs) I think maybe we can make it something like:

 static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned
int mss_now,
                                       const struct sk_buff *skb)
 {
         if (skb->len < tcp_skb_pcount(skb) * mss_now)
                tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
 }

neal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH net-next] tcp: remove a bogus TSO split
  2013-12-13 14:15 ` Neal Cardwell
  2013-12-13 14:54   ` Eric Dumazet
@ 2013-12-13 16:58   ` David Laight
  2013-12-13 17:56     ` Eric Dumazet
  1 sibling, 1 reply; 14+ messages in thread
From: David Laight @ 2013-12-13 16:58 UTC (permalink / raw)
  To: Neal Cardwell, Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati, Van Jacobson

> From: Neal Cardwell
...
> Seems like a nice improvement, but if we apply this patch then AFAICT
> to get the Nagle-enabled case right we also have to update
> tcp_minshall_update() to notice these new non-MSS-aligned segments
> going out, and count those as non-full-size segments for the
> minshall-nagle check (to ensure we have no more than one outstanding
> un-ACKed sub-MSS packet). Maybe something like (please excuse the
> formatting):

This sort of begs the question about how Nagle should work.
IIRC Nagle just suppresses short segments when there is unacked data? [1]

If you have sent a TSO packet then nagle will always be 'waiting for an ack',
so should only send full segments. What is questionable is whether you should
send the final short segment, or buffer it waiting for further data from
the application to fill the segment (or an ack from the remote system).

If you split the data (as I think the code used to) then presumably
with nagle the final short segment won't actually be sent (until timeout
or an ack is received). So the transmitted segments are likely to all
be full.

OTOH with the change you'll send a partial segment.
If this only happens when the tx socket buffer (etc) is empty it is probably
an improvement!
Run vi in a large window and page forwards, the data displayed is larger
than a segment, so you have to wait for the nagle timeout before the entire
screen is updated.
Since the data is a single write() it would be a single TSO send - and
you want it all to get sent.

	David



[1] So that single characters typed into rlogin get sent together when
the remote system finally finishes processing the previous one(s).
While ftp can still send bulk data without waiting for responses.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH net-next] tcp: remove a bogus TSO split
  2013-12-13 16:22     ` Neal Cardwell
@ 2013-12-13 17:54       ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2013-12-13 17:54 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati

On Fri, 2013-12-13 at 11:22 -0500, Neal Cardwell wrote:

> Yes, I like your ideas to use mss_now instead, move
> tcp_minshall_update() to tcp_output.c (next to tcp_minshall_check()?),
> and update the comment in front of tcp_mss_split_point().
> 
> And given that mss_now is more sane than tcp_skb_mss(skb) (which is
> zero for one-MSS skbs) I think maybe we can make it something like:
> 
>  static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned
> int mss_now,
>                                        const struct sk_buff *skb)
>  {
>          if (skb->len < tcp_skb_pcount(skb) * mss_now)
>                 tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
>  }

This was indeed what I got ;)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH net-next] tcp: remove a bogus TSO split
  2013-12-13 16:58   ` David Laight
@ 2013-12-13 17:56     ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2013-12-13 17:56 UTC (permalink / raw)
  To: David Laight
  Cc: Neal Cardwell, David Miller, netdev, Yuchung Cheng,
	Nandita Dukkipati, Van Jacobson

On Fri, 2013-12-13 at 16:58 +0000, David Laight wrote:
> > From: Neal Cardwell
> ...
> > Seems like a nice improvement, but if we apply this patch then AFAICT
> > to get the Nagle-enabled case right we also have to update
> > tcp_minshall_update() to notice these new non-MSS-aligned segments
> > going out, and count those as non-full-size segments for the
> > minshall-nagle check (to ensure we have no more than one outstanding
> > un-ACKed sub-MSS packet). Maybe something like (please excuse the
> > formatting):
> 
> This sort of begs the question about how Nagle should work.
> IIRC Nagle just suppresses short segments when there is unacked data? [1]
> 
> If you have sent a TSO packet then nagle will always be 'waiting for an ack',
> so should only send full segments. What is questionable is whether you should
> send the final short segment, or buffer it waiting for further data from
> the application to fill the segment (or an ack from the remote system).

Point is the current code does a grouped send of the two skbs, without
any wait or even release of socket lock, allowing a tcp_sendmsg() to
aggregate/coalesce more data to the last (partial) segment.

Nagle test is properly done _before_ the call to tcp_mss_split_point()

I think the current behavior of tcp_mss_split_point() is a leftover of
the old days, when David Miller (and others) added TSO support to the
stack.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 net-next] tcp: remove a bogus TSO split
  2013-12-12 19:28 [PATCH net-next] tcp: remove a bogus TSO split Eric Dumazet
  2013-12-13 14:15 ` Neal Cardwell
@ 2013-12-13 18:13 ` Eric Dumazet
  2013-12-13 18:17   ` Neal Cardwell
  2013-12-13 21:51   ` [PATCH v3 net-next] tcp: refine TSO splits Eric Dumazet
  1 sibling, 2 replies; 14+ messages in thread
From: Eric Dumazet @ 2013-12-13 18:13 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Yuchung Cheng, Neal Cardwell, Nandita Dukkipati, Van Jacobson

From: Eric Dumazet <edumazet@google.com>

While investigating performance problems on small RPC workloads,
I noticed linux TCP stack was always splitting the last TSO skb
into two parts (skbs). One being a multiple of MSS, and a small one
with the Push flag. This split is done even if TCP_NODELAY is set.

Example with request/response of 4K/4K

IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>

We think this is not needed, and is probably a leftover of initial TSO support.
TCP stack had issues with mss changes, that have been fixed.

All the Nagle/window tests are done at this point.

Note : If some NIC has trouble sending TSO packets with a partial
last segment, we will find out and add a device feature or something.

tcp_minshall_update() is moved to tcp_output.c and is updated, thanks Neal !

This patch tremendously improves performance, as the traffic now looks
like :

IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>


Before :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280774

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     205719.049006 task-clock                #    9.278 CPUs utilized          
         8,449,968 context-switches          #    0.041 M/sec                  
         1,935,997 CPU-migrations            #    0.009 M/sec                  
           160,541 page-faults               #    0.780 K/sec                  
   548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
   455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
   272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
   166,091,460,030 instructions              #    0.30  insns per cycle        
                                             #    2.74  stalled cycles per insn [83.39%]
    29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
     1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]

      22.173517844 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   16851063           0.0
IpExtOutOctets                  23878580777        0.0

After patch :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280877

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     107496.071918 task-clock                #    4.847 CPUs utilized          
         5,635,458 context-switches          #    0.052 M/sec                  
         1,374,707 CPU-migrations            #    0.013 M/sec                  
           160,920 page-faults               #    0.001 M/sec                  
   281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
   228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
   142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
    95,227,712,566 instructions              #    0.34  insns per cycle        
                                             #    2.40  stalled cycles per insn [83.43%]
    16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
       874,252,952 branch-misses             #    5.39% of all branches         [83.37%]

      22.175821286 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   11239428           0.0
IpExtOutOctets                  23595191035        0.0

Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
scheduler.


Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Van Jacobson <vanj@google.com>
---
v2: changed tcp_minshall_update() as Neal pointed out.

 include/net/tcp.h     |    7 ------
 net/ipv4/tcp_output.c |   41 +++++++++++++++++++++-------------------
 2 files changed, 22 insertions(+), 26 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index f7e1ab2139ef..9cd62bc09055 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -978,13 +978,6 @@ static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
 }
 bool tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight);
 
-static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss,
-				       const struct sk_buff *skb)
-{
-	if (skb->len < mss)
-		tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
-}
-
 static inline void tcp_check_probe_timer(struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2a69f42e51ca..2c5c0029d1a0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1384,20 +1384,11 @@ static void tcp_cwnd_validate(struct sock *sk)
 	}
 }
 
-/* Returns the portion of skb which can be sent right away without
- * introducing MSS oddities to segment boundaries. In rare cases where
- * mss_now != mss_cache, we will request caller to create a small skb
- * per input skb which could be mostly avoided here (if desired).
- *
- * We explicitly want to create a request for splitting write queue tail
- * to a small skb for Nagle purposes while avoiding unnecessary modulos,
- * thus all the complexity (cwnd_len is always MSS multiple which we
- * return whenever allowed by the other factors). Basically we need the
- * modulo only when the receiver window alone is the limiting factor or
- * when we would be allowed to send the split-due-to-Nagle skb fully.
- */
-static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_buff *skb,
-					unsigned int mss_now, unsigned int max_segs)
+/* Returns the portion of skb which can be sent right away */
+static unsigned int tcp_mss_split_point(const struct sock *sk,
+					const struct sk_buff *skb,
+					unsigned int mss_now,
+					unsigned int max_segs)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	u32 needed, window, max_len;
@@ -1410,10 +1401,7 @@ static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_b
 
 	needed = min(skb->len, window);
 
-	if (max_len <= needed)
-		return max_len;
-
-	return needed - needed % mss_now;
+	return min(max_len, needed);
 }
 
 /* Can at least one segment of SKB be sent right now, according to the
@@ -1454,12 +1442,27 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
 }
 
 /* Minshall's variant of the Nagle send check. */
-static inline bool tcp_minshall_check(const struct tcp_sock *tp)
+static bool tcp_minshall_check(const struct tcp_sock *tp)
 {
 	return after(tp->snd_sml, tp->snd_una) &&
 		!after(tp->snd_sml, tp->snd_nxt);
 }
 
+/* Update snd_sml if this skb is under mss
+ * Note that a TSO packet might end with a sub-mss segment
+ * The test is really :
+ * if ((skb->len % mss_now) != 0)
+ *        tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
+ * But we can avoid doing the divide again given we already have
+ *  skb_pcount = skb->len / mss_now
+ */
+static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
+				const struct sk_buff *skb)
+{
+	if (skb->len < tcp_skb_pcount(skb) * mss_now)
+		tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
+}
+
 /* Return false, if packet can be sent now without violation Nagle's rules:
  * 1. It is full sized.
  * 2. Or it contains FIN. (already checked by caller)

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net-next] tcp: remove a bogus TSO split
  2013-12-13 18:13 ` [PATCH v2 " Eric Dumazet
@ 2013-12-13 18:17   ` Neal Cardwell
  2013-12-13 18:38     ` Eric Dumazet
  2013-12-13 21:51   ` [PATCH v3 net-next] tcp: refine TSO splits Eric Dumazet
  1 sibling, 1 reply; 14+ messages in thread
From: Neal Cardwell @ 2013-12-13 18:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati, Van Jacobson

On Fri, Dec 13, 2013 at 1:13 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> While investigating performance problems on small RPC workloads,
> I noticed linux TCP stack was always splitting the last TSO skb
> into two parts (skbs). One being a multiple of MSS, and a small one
> with the Push flag. This split is done even if TCP_NODELAY is set.
>
...
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Nandita Dukkipati <nanditad@google.com>
> Cc: Van Jacobson <vanj@google.com>
> ---
> v2: changed tcp_minshall_update() as Neal pointed out.

It occurred to me after staring at this code a little longer that
although the Nagle test is already done before the call to
tcp_mss_split_point(), the tcp_nagle_check() is only run if
tso_segs==1 and is only checking whether skb->len < mss_now, so the
Nagle code currently is implicitly assuming that if there is an skb
that is not an exact multiple of MSS, then the tcp_mss_split_point()
will chop off the odd bytes at the end and we'll loop back to the top
of the tcp_write_xmit() loop to make a Nagle decision on an skb that
has tso_segs==1.

So if we just make this improvement to tcp_mss_split_point() and the
adjustment to tcp_minshall_update() (as in patch v2), we will still
have broken Nagle behavior in a scenario like:

- suppose MSS=1460 and Nagle is enabled (TCP_NODELAY is not set)

- suppose there is an outstanding un-ACKed small packet, so that
tcp_minshall_check() returns true

- user writes 2000 bytes

- tso_segs is 2, so we do not call tcp_nagle_test()

- new version of tcp_mss_split_point() sends out the full 2000 bytes,
  leading to having 2 packets smaller than an MSS un-ACKed in the network,
  violating the invariant the minshall/nagle code is trying to maintain
  or having only one such packet un-ACKed in the network.

Previously, tcp_mss_split_point() would have split off the 2000-1460
bytes into a new skb, we would have looped back and executed the
tcp_nagle_test() code and decided not to send that small sub-MSS
2000-1460 packet.

neal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 net-next] tcp: remove a bogus TSO split
  2013-12-13 18:17   ` Neal Cardwell
@ 2013-12-13 18:38     ` Eric Dumazet
  0 siblings, 0 replies; 14+ messages in thread
From: Eric Dumazet @ 2013-12-13 18:38 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati, Van Jacobson

On Fri, 2013-12-13 at 13:17 -0500, Neal Cardwell wrote:

> It occurred to me after staring at this code a little longer that
> although the Nagle test is already done before the call to
> tcp_mss_split_point(), the tcp_nagle_check() is only run if
> tso_segs==1 and is only checking whether skb->len < mss_now, so the
> Nagle code currently is implicitly assuming that if there is an skb
> that is not an exact multiple of MSS, then the tcp_mss_split_point()
> will chop off the odd bytes at the end and we'll loop back to the top
> of the tcp_write_xmit() loop to make a Nagle decision on an skb that
> has tso_segs==1.
> 
> So if we just make this improvement to tcp_mss_split_point() and the
> adjustment to tcp_minshall_update() (as in patch v2), we will still
> have broken Nagle behavior in a scenario like:
> 
> - suppose MSS=1460 and Nagle is enabled (TCP_NODELAY is not set)
> 
> - suppose there is an outstanding un-ACKed small packet, so that
> tcp_minshall_check() returns true
> 
> - user writes 2000 bytes
> 
> - tso_segs is 2, so we do not call tcp_nagle_test()
> 
> - new version of tcp_mss_split_point() sends out the full 2000 bytes,
>   leading to having 2 packets smaller than an MSS un-ACKed in the network,
>   violating the invariant the minshall/nagle code is trying to maintain
>   or having only one such packet un-ACKed in the network.
> 
> Previously, tcp_mss_split_point() would have split off the 2000-1460
> bytes into a new skb, we would have looped back and executed the
> tcp_nagle_test() code and decided not to send that small sub-MSS
> 2000-1460 packet.
> 

Hmm, it seems we need to refactor a bit.

tcp_snd_test() seems to not care of TSO being partial.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v3 net-next] tcp: refine TSO splits
  2013-12-13 18:13 ` [PATCH v2 " Eric Dumazet
  2013-12-13 18:17   ` Neal Cardwell
@ 2013-12-13 21:51   ` Eric Dumazet
  2013-12-14  1:59     ` Neal Cardwell
  1 sibling, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2013-12-13 21:51 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Yuchung Cheng, Neal Cardwell, Nandita Dukkipati, Van Jacobson

From: Eric Dumazet <edumazet@google.com>

While investigating performance problems on small RPC workloads,
I noticed linux TCP stack was always splitting the last TSO skb
into two parts (skbs). One being a multiple of MSS, and a small one
with the Push flag. This split is done even if TCP_NODELAY is set,
or if no small packet is in flight.

Example with request/response of 4K/4K

IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>

We can avoid this split by including the Nagle tests at the right place.

Note : If some NIC had trouble sending TSO packets with a partial
last segment, we would have hit the problem in GRO/forwarding workload already.

tcp_minshall_update() is moved to tcp_output.c and is updated as we might
feed a TSO packet with a partial last segment.

This patch tremendously improves performance, as the traffic now looks
like :

IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>


Before :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280774

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     205719.049006 task-clock                #    9.278 CPUs utilized          
         8,449,968 context-switches          #    0.041 M/sec                  
         1,935,997 CPU-migrations            #    0.009 M/sec                  
           160,541 page-faults               #    0.780 K/sec                  
   548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
   455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
   272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
   166,091,460,030 instructions              #    0.30  insns per cycle        
                                             #    2.74  stalled cycles per insn [83.39%]
    29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
     1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]

      22.173517844 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   16851063           0.0
IpExtOutOctets                  23878580777        0.0

After patch :

lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
280877

 Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':

     107496.071918 task-clock                #    4.847 CPUs utilized          
         5,635,458 context-switches          #    0.052 M/sec                  
         1,374,707 CPU-migrations            #    0.013 M/sec                  
           160,920 page-faults               #    0.001 M/sec                  
   281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
   228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
   142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
    95,227,712,566 instructions              #    0.34  insns per cycle        
                                             #    2.40  stalled cycles per insn [83.43%]
    16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
       874,252,952 branch-misses             #    5.39% of all branches         [83.37%]

      22.175821286 seconds time elapsed

lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
IpOutRequests                   11239428           0.0
IpExtOutOctets                  23595191035        0.0

Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
scheduler.

Many thanks to Neal for review and ideas.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Van Jacobson <vanj@google.com>
---
v3: Add the Nagle tests to make sure we can send the whole skb
    according to Nagle rules. Factorize the code.

v2: changed tcp_minshall_update() as Neal pointed out.

 include/net/tcp.h     |    7 ----
 net/ipv4/tcp_output.c |   65 +++++++++++++++++++++++++---------------
 2 files changed, 42 insertions(+), 30 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index f7e1ab2139ef..9cd62bc09055 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -978,13 +978,6 @@ static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
 }
 bool tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight);
 
-static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss,
-				       const struct sk_buff *skb)
-{
-	if (skb->len < mss)
-		tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
-}
-
 static inline void tcp_check_probe_timer(struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2a69f42e51ca..9e7aec7ee67e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1384,23 +1384,51 @@ static void tcp_cwnd_validate(struct sock *sk)
 	}
 }
 
-/* Returns the portion of skb which can be sent right away without
- * introducing MSS oddities to segment boundaries. In rare cases where
- * mss_now != mss_cache, we will request caller to create a small skb
- * per input skb which could be mostly avoided here (if desired).
- *
- * We explicitly want to create a request for splitting write queue tail
- * to a small skb for Nagle purposes while avoiding unnecessary modulos,
- * thus all the complexity (cwnd_len is always MSS multiple which we
- * return whenever allowed by the other factors). Basically we need the
- * modulo only when the receiver window alone is the limiting factor or
- * when we would be allowed to send the split-due-to-Nagle skb fully.
+/* Minshall's variant of the Nagle send check. */
+static bool tcp_minshall_check(const struct tcp_sock *tp)
+{
+	return after(tp->snd_sml, tp->snd_una) &&
+		!after(tp->snd_sml, tp->snd_nxt);
+}
+
+/* Update snd_sml if this skb is under mss
+ * Note that a TSO packet might end with a sub-mss segment
+ * The test is really :
+ * if ((skb->len % mss) != 0)
+ *        tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
+ * But we can avoid doing the divide again given we already have
+ *  skb_pcount = skb->len / mss_now
  */
-static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_buff *skb,
-					unsigned int mss_now, unsigned int max_segs)
+static void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss_now,
+				const struct sk_buff *skb)
+{
+	if (skb->len < tcp_skb_pcount(skb) * mss_now)
+		tp->snd_sml = TCP_SKB_CB(skb)->end_seq;
+}
+
+/* Return false, if packet can be sent now without violation Nagle's rules:
+ * 1. It is full sized. (provided by caller in %partial bool)
+ * 2. Or it contains FIN. (already checked by caller)
+ * 3. Or TCP_CORK is not set, and TCP_NODELAY is set.
+ * 4. Or TCP_CORK is not set, and all sent packets are ACKed.
+ *    With Minshall's modification: all sent small packets are ACKed.
+ */
+static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
+			    unsigned int mss_now, int nonagle)
+{
+	return partial &&
+		((nonagle & TCP_NAGLE_CORK) ||
+		 (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
+}
+/* Returns the portion of skb which can be sent right away */
+static unsigned int tcp_mss_split_point(const struct sock *sk,
+					const struct sk_buff *skb,
+					unsigned int mss_now,
+					unsigned int max_segs,
+					int nonagle)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
-	u32 needed, window, max_len;
+	u32 partial, needed, window, max_len;
 
 	window = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;
 	max_len = mss_now * max_segs;
@@ -1413,7 +1441,15 @@ static unsigned int tcp_mss_split_point(const struct sock *sk, const struct sk_b
 	if (max_len <= needed)
 		return max_len;
 
-	return needed - needed % mss_now;
+	partial = needed % mss_now;
+	/* If last segment is not a full MSS, check if Nagle rules allow us
+	 * to include this last segment in this skb.
+	 * Otherwise, we'll split the skb at last MSS boundary
+	 */
+	if (tcp_nagle_check(partial != 0, tp, mss_now, nonagle))
+		return needed - partial;
+
+	return needed;
 }
 
 /* Can at least one segment of SKB be sent right now, according to the
@@ -1453,28 +1489,6 @@ static int tcp_init_tso_segs(const struct sock *sk, struct sk_buff *skb,
 	return tso_segs;
 }
 
-/* Minshall's variant of the Nagle send check. */
-static inline bool tcp_minshall_check(const struct tcp_sock *tp)
-{
-	return after(tp->snd_sml, tp->snd_una) &&
-		!after(tp->snd_sml, tp->snd_nxt);
-}
-
-/* Return false, if packet can be sent now without violation Nagle's rules:
- * 1. It is full sized.
- * 2. Or it contains FIN. (already checked by caller)
- * 3. Or TCP_CORK is not set, and TCP_NODELAY is set.
- * 4. Or TCP_CORK is not set, and all sent packets are ACKed.
- *    With Minshall's modification: all sent small packets are ACKed.
- */
-static inline bool tcp_nagle_check(const struct tcp_sock *tp,
-				  const struct sk_buff *skb,
-				  unsigned int mss_now, int nonagle)
-{
-	return skb->len < mss_now &&
-		((nonagle & TCP_NAGLE_CORK) ||
-		 (!nonagle && tp->packets_out && tcp_minshall_check(tp)));
-}
 
 /* Return true if the Nagle test allows this packet to be
  * sent now.
@@ -1495,7 +1509,7 @@ static inline bool tcp_nagle_test(const struct tcp_sock *tp, const struct sk_buf
 	if (tcp_urg_mode(tp) || (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN))
 		return true;
 
-	if (!tcp_nagle_check(tp, skb, cur_mss, nonagle))
+	if (!tcp_nagle_check(skb->len < cur_mss, tp, cur_mss, nonagle))
 		return true;
 
 	return false;
@@ -1898,7 +1912,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
 			limit = tcp_mss_split_point(sk, skb, mss_now,
 						    min_t(unsigned int,
 							  cwnd_quota,
-							  sk->sk_gso_max_segs));
+							  sk->sk_gso_max_segs),
+						    nonagle);
 
 		if (skb->len > limit &&
 		    unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 net-next] tcp: refine TSO splits
  2013-12-13 21:51   ` [PATCH v3 net-next] tcp: refine TSO splits Eric Dumazet
@ 2013-12-14  1:59     ` Neal Cardwell
  2013-12-14  2:05       ` Eric Dumazet
  0 siblings, 1 reply; 14+ messages in thread
From: Neal Cardwell @ 2013-12-14  1:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati, Van Jacobson

On Fri, Dec 13, 2013 at 4:51 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> While investigating performance problems on small RPC workloads,
> I noticed linux TCP stack was always splitting the last TSO skb
> into two parts (skbs). One being a multiple of MSS, and a small one
> with the Push flag. This split is done even if TCP_NODELAY is set,
> or if no small packet is in flight.
>
> Example with request/response of 4K/4K
>
> IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
> IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
> IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
> IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
> IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
> IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
> IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
> IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
> IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
>
> We can avoid this split by including the Nagle tests at the right place.
>
> Note : If some NIC had trouble sending TSO packets with a partial
> last segment, we would have hit the problem in GRO/forwarding workload already.
>
> tcp_minshall_update() is moved to tcp_output.c and is updated as we might
> feed a TSO packet with a partial last segment.
>
> This patch tremendously improves performance, as the traffic now looks
> like :
>
> IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
> IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
> IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
> IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
> IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
> IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
> IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
> IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
> IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
> IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
> IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
> IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
>
>
> Before :
>
> lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
> 280774
>
>  Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
>
>      205719.049006 task-clock                #    9.278 CPUs utilized
>          8,449,968 context-switches          #    0.041 M/sec
>          1,935,997 CPU-migrations            #    0.009 M/sec
>            160,541 page-faults               #    0.780 K/sec
>    548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
>    455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
>    272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
>    166,091,460,030 instructions              #    0.30  insns per cycle
>                                              #    2.74  stalled cycles per insn [83.39%]
>     29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
>      1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]
>
>       22.173517844 seconds time elapsed
>
> lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
> IpOutRequests                   16851063           0.0
> IpExtOutOctets                  23878580777        0.0
>
> After patch :
>
> lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
> 280877
>
>  Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
>
>      107496.071918 task-clock                #    4.847 CPUs utilized
>          5,635,458 context-switches          #    0.052 M/sec
>          1,374,707 CPU-migrations            #    0.013 M/sec
>            160,920 page-faults               #    0.001 M/sec
>    281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
>    228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
>    142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
>     95,227,712,566 instructions              #    0.34  insns per cycle
>                                              #    2.40  stalled cycles per insn [83.43%]
>     16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
>        874,252,952 branch-misses             #    5.39% of all branches         [83.37%]
>
>       22.175821286 seconds time elapsed
>
> lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
> IpOutRequests                   11239428           0.0
> IpExtOutOctets                  23595191035        0.0
>
> Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
> 2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
> scheduler.
>
> Many thanks to Neal for review and ideas.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Nandita Dukkipati <nanditad@google.com>
> Cc: Van Jacobson <vanj@google.com>
> ---
> v3: Add the Nagle tests to make sure we can send the whole skb
>     according to Nagle rules. Factorize the code.
>
> v2: changed tcp_minshall_update() as Neal pointed out.
>
>  include/net/tcp.h     |    7 ----
>  net/ipv4/tcp_output.c |   65 +++++++++++++++++++++++++---------------
>  2 files changed, 42 insertions(+), 30 deletions(-)

I love it! Thanks, Eric.

Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>


neal

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 net-next] tcp: refine TSO splits
  2013-12-14  1:59     ` Neal Cardwell
@ 2013-12-14  2:05       ` Eric Dumazet
  2013-12-17 20:15         ` David Miller
  0 siblings, 1 reply; 14+ messages in thread
From: Eric Dumazet @ 2013-12-14  2:05 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Yuchung Cheng, Nandita Dukkipati, Van Jacobson

On Fri, 2013-12-13 at 20:59 -0500, Neal Cardwell wrote:

> I love it! Thanks, Eric.
> 
> Acked-by: Neal Cardwell <ncardwell@google.com>
> Tested-by: Neal Cardwell <ncardwell@google.com>

Thanks a lot Neal !

Yes I think we found a winner here ;)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v3 net-next] tcp: refine TSO splits
  2013-12-14  2:05       ` Eric Dumazet
@ 2013-12-17 20:15         ` David Miller
  0 siblings, 0 replies; 14+ messages in thread
From: David Miller @ 2013-12-17 20:15 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ncardwell, netdev, ycheng, nanditad, vanj

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 13 Dec 2013 18:05:41 -0800

> On Fri, 2013-12-13 at 20:59 -0500, Neal Cardwell wrote:
> 
>> I love it! Thanks, Eric.
>> 
>> Acked-by: Neal Cardwell <ncardwell@google.com>
>> Tested-by: Neal Cardwell <ncardwell@google.com>
> 
> Thanks a lot Neal !
> 
> Yes I think we found a winner here ;)

Applied, thanks guys :-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2013-12-17 20:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-12-12 19:28 [PATCH net-next] tcp: remove a bogus TSO split Eric Dumazet
2013-12-13 14:15 ` Neal Cardwell
2013-12-13 14:54   ` Eric Dumazet
2013-12-13 16:22     ` Neal Cardwell
2013-12-13 17:54       ` Eric Dumazet
2013-12-13 16:58   ` David Laight
2013-12-13 17:56     ` Eric Dumazet
2013-12-13 18:13 ` [PATCH v2 " Eric Dumazet
2013-12-13 18:17   ` Neal Cardwell
2013-12-13 18:38     ` Eric Dumazet
2013-12-13 21:51   ` [PATCH v3 net-next] tcp: refine TSO splits Eric Dumazet
2013-12-14  1:59     ` Neal Cardwell
2013-12-14  2:05       ` Eric Dumazet
2013-12-17 20:15         ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.