netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] tcp: reduce out_of_order memory use
@ 2012-03-18 13:37 Eric Dumazet
  2012-03-18 16:55 ` Neal Cardwell
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2012-03-18 13:37 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Tom Herbert, Neal Cardwell, Ilpo Järvinen,
	H.K. Jerry Chu, Yuchung Cheng

With increasing receive window sizes, but speed of light not improved
that much, out of order queue can contain a huge number of skbs, waiting
to be moved to receive_queue when missing packets can fill the holes.

Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
sk_buff)) to store regular (MTU <= 1500) frames. This makes highly
probable sk_rmem_alloc hits sk_rcvbuf limit, which can be 4Mbytes in
many cases.

When limit is hit, tcp stack calls tcp_collapse_ofo_queue(), a true
latency killer and cpu cache blower.

Doing the coalescing attempt each time we add a frame in ofo queue
permits to keep memory use tight and in many cases avoid the
tcp_collapse() thing later.

Tested on various wireless setups (b43, ath9k, ...) known to use big skb
truesize, this patch removed the "packets collapsed in receive queue due
to low socket buffer" I had before.

This also reduced average memory used by tcp sockets.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 include/linux/snmp.h |    1 +
 net/ipv4/proc.c      |    1 +
 net/ipv4/tcp_input.c |   19 +++++++++++++++++--
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/linux/snmp.h b/include/linux/snmp.h
index 8ee8af4..2e68f5b 100644
--- a/include/linux/snmp.h
+++ b/include/linux/snmp.h
@@ -233,6 +233,7 @@ enum
 	LINUX_MIB_TCPREQQFULLDOCOOKIES,		/* TCPReqQFullDoCookies */
 	LINUX_MIB_TCPREQQFULLDROP,		/* TCPReqQFullDrop */
 	LINUX_MIB_TCPRETRANSFAIL,		/* TCPRetransFail */
+	LINUX_MIB_TCPRCVCOALESCE,			/* TCPRcvCoalesce */
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 02d6107..8af0d44 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -257,6 +257,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPReqQFullDoCookies", LINUX_MIB_TCPREQQFULLDOCOOKIES),
 	SNMP_MIB_ITEM("TCPReqQFullDrop", LINUX_MIB_TCPREQQFULLDROP),
 	SNMP_MIB_ITEM("TCPRetransFail", LINUX_MIB_TCPRETRANSFAIL),
+	SNMP_MIB_ITEM("TCPRcvCoalesce", LINUX_MIB_TCPRCVCOALESCE),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 68d4057..51fe686 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4590,8 +4590,23 @@ drop:
 		u32 end_seq = TCP_SKB_CB(skb)->end_seq;
 
 		if (seq == TCP_SKB_CB(skb1)->end_seq) {
-			__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
-
+			/* Packets in ofo can stay in queue a long time.
+			 * Better try to coalesce them right now
+			 * to avoid future tcp_collapse_ofo_queue(),
+			 * probably the most expensive function in tcp stack.
+			 */
+			if (skb->len <= skb_tailroom(skb1) && !th->fin) {
+				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOALESCE);
+				BUG_ON(skb_copy_bits(skb, 0,
+						     skb_put(skb1, skb->len),
+						     skb->len));
+				TCP_SKB_CB(skb1)->end_seq = end_seq;
+				TCP_SKB_CB(skb1)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+				__kfree_skb(skb);
+				skb = NULL;
+			} else {
+				__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
+			}
 			if (!tp->rx_opt.num_sacks ||
 			    tp->selective_acks[0].end_seq != seq)
 				goto add_sack;

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next] tcp: reduce out_of_order memory use
  2012-03-18 13:37 [PATCH net-next] tcp: reduce out_of_order memory use Eric Dumazet
@ 2012-03-18 16:55 ` Neal Cardwell
  2012-03-18 19:40   ` Eric Dumazet
                     ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Neal Cardwell @ 2012-03-18 16:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tom Herbert, Ilpo Järvinen,
	H.K. Jerry Chu, Yuchung Cheng

On Sun, Mar 18, 2012 at 9:37 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> With increasing receive window sizes, but speed of light not improved
> that much, out of order queue can contain a huge number of skbs, waiting
> to be moved to receive_queue when missing packets can fill the holes.
>
> Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
> sk_buff)) to store regular (MTU <= 1500) frames. This makes highly
> probable sk_rmem_alloc hits sk_rcvbuf limit, which can be 4Mbytes in
> many cases.
>
> When limit is hit, tcp stack calls tcp_collapse_ofo_queue(), a true
> latency killer and cpu cache blower.
>
> Doing the coalescing attempt each time we add a frame in ofo queue
> permits to keep memory use tight and in many cases avoid the
> tcp_collapse() thing later.
>
> Tested on various wireless setups (b43, ath9k, ...) known to use big skb
> truesize, this patch removed the "packets collapsed in receive queue due
> to low socket buffer" I had before.
>
> This also reduced average memory used by tcp sockets.


At this point in tcp_data_queue() we have already called
skb_set_owner_r() to charge the full truesize to the socket.  So
probably we need to adjust rmem accounting variables to account for
the fact that the memory corresponding to the overhead of the freed
skb is now gone?

The patch as written seems to only coalesce if the incoming skb fits
immediately after the first skb in the ofo queue. If we're going to
add logic to coalesce then it would be nice to also coalesce in the
case where the skb falls immediately after any later skb in the ofo queue,
i.e. at some point after the "Find place to insert this segment" loop,
probably right before or after the "Do skb overlap to previous one?"
check. Perhaps the coalescing logic could be factored out into its own
little function, and called in either of the two places?

(BTW, a few lines are longer than 80 characters.)

neal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH net-next] tcp: reduce out_of_order memory use
  2012-03-18 16:55 ` Neal Cardwell
@ 2012-03-18 19:40   ` Eric Dumazet
  2012-03-18 21:06   ` [PATCH v2 net-next 1/2] tcp: introduce tcp_data_queue_ofo Eric Dumazet
  2012-03-18 21:07   ` [PATCH v2 net-next 2/2] tcp: reduce out_of_order memory use Eric Dumazet
  2 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2012-03-18 19:40 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Tom Herbert, Ilpo Järvinen,
	H.K. Jerry Chu, Yuchung Cheng

On Sun, 2012-03-18 at 12:55 -0400, Neal Cardwell wrote:

> 
> At this point in tcp_data_queue() we have already called
> skb_set_owner_r() to charge the full truesize to the socket.  So
> probably we need to adjust rmem accounting variables to account for
> the fact that the memory corresponding to the overhead of the freed
> skb is now gone?

This is automatically done in the __kfree_skb(skb) call.

I thought of avoiding the skb_set_owner_r() but it was simpler to let
the code as is.

Thinking again, we might just defer it at the end of function if skb is
not NULL, but it adds a new conditional and a "returb;" must be changed
to "goto end;"

> 
> The patch as written seems to only coalesce if the incoming skb fits
> immediately after the first skb in the ofo queue. If we're going to
> add logic to coalesce then it would be nice to also coalesce in the
> case where the skb falls immediately after any later skb in the ofo queue,

I tried this but it was almost never used code in my experiments.

Most of the times packets come in natural order (no OOO).

> i.e. at some point after the "Find place to insert this segment" loop,
> probably right before or after the "Do skb overlap to previous one?"
> check. Perhaps the coalescing logic could be factored out into its own
> little function, and called in either of the two places?
> 

Absolutely, I was already working on splitting tcp_data_queue() in two
functions to reduce indentation level by one tabulation make it more
readable.

Yes, we probably can expand the coalescing logic to be able to cope with
OOO, so that not only we coalesce this skb with prior one in queue, but
also with next one if there is one.

> (BTW, a few lines are longer than 80 characters.)

This wont be the case once tcp_data_queue_ofo() is introduced.

I'll send a V2 with two patches.

1/2 tcp: introduce tcp_data_queue_ofo()
    (No change, only code movement to make smaller code units)

2/2 tcp: reduce out_of_order memory use
    (with the attempt to avoid the skb_set_owner_r()/__kfree_skb() as
you suggested)

Thanks !

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 net-next 1/2] tcp: introduce tcp_data_queue_ofo
  2012-03-18 16:55 ` Neal Cardwell
  2012-03-18 19:40   ` Eric Dumazet
@ 2012-03-18 21:06   ` Eric Dumazet
  2012-03-19  3:49     ` Neal Cardwell
  2012-03-19 20:57     ` David Miller
  2012-03-18 21:07   ` [PATCH v2 net-next 2/2] tcp: reduce out_of_order memory use Eric Dumazet
  2 siblings, 2 replies; 9+ messages in thread
From: Eric Dumazet @ 2012-03-18 21:06 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Tom Herbert, Ilpo Järvinen,
	H.K. Jerry Chu, Yuchung Cheng

Split tcp_data_queue() in two parts for better readability.

tcp_data_queue_ofo() is responsible for queueing incoming skb into out
of order queue.

Change code layout so that the skb_set_owner_r() is performed only if
skb is not dropped.

This is a preliminary patch before "reduce out_of_order memory use"
following patch.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_input.c |  214 ++++++++++++++++++++++-------------------
 1 file changed, 115 insertions(+), 99 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 68d4057..fa7de12 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4446,6 +4446,120 @@ static inline int tcp_try_rmem_schedule(struct sock *sk, unsigned int size)
 	return 0;
 }
 
+static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb1;
+	u32 seq, end_seq;
+
+	TCP_ECN_check_ce(tp, skb);
+
+	if (tcp_try_rmem_schedule(sk, skb->truesize)) {
+		/* TODO: should increment a counter */
+		__kfree_skb(skb);
+		return;
+	}
+
+	/* Disable header prediction. */
+	tp->pred_flags = 0;
+	inet_csk_schedule_ack(sk);
+
+	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
+		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
+
+	skb1 = skb_peek_tail(&tp->out_of_order_queue);
+	if (!skb1) {
+		/* Initial out of order segment, build 1 SACK. */
+		if (tcp_is_sack(tp)) {
+			tp->rx_opt.num_sacks = 1;
+			tp->selective_acks[0].start_seq = TCP_SKB_CB(skb)->seq;
+			tp->selective_acks[0].end_seq =
+						TCP_SKB_CB(skb)->end_seq;
+		}
+		__skb_queue_head(&tp->out_of_order_queue, skb);
+		goto end;
+	}
+
+	seq = TCP_SKB_CB(skb)->seq;
+	end_seq = TCP_SKB_CB(skb)->end_seq;
+
+	if (seq == TCP_SKB_CB(skb1)->end_seq) {
+		__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
+
+		if (!tp->rx_opt.num_sacks ||
+		    tp->selective_acks[0].end_seq != seq)
+			goto add_sack;
+
+		/* Common case: data arrive in order after hole. */
+		tp->selective_acks[0].end_seq = end_seq;
+		goto end;
+	}
+
+	/* Find place to insert this segment. */
+	while (1) {
+		if (!after(TCP_SKB_CB(skb1)->seq, seq))
+			break;
+		if (skb_queue_is_first(&tp->out_of_order_queue, skb1)) {
+			skb1 = NULL;
+			break;
+		}
+		skb1 = skb_queue_prev(&tp->out_of_order_queue, skb1);
+	}
+
+	/* Do skb overlap to previous one? */
+	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
+		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
+			/* All the bits are present. Drop. */
+			__kfree_skb(skb);
+			skb = NULL;
+			tcp_dsack_set(sk, seq, end_seq);
+			goto add_sack;
+		}
+		if (after(seq, TCP_SKB_CB(skb1)->seq)) {
+			/* Partial overlap. */
+			tcp_dsack_set(sk, seq,
+				      TCP_SKB_CB(skb1)->end_seq);
+		} else {
+			if (skb_queue_is_first(&tp->out_of_order_queue,
+					       skb1))
+				skb1 = NULL;
+			else
+				skb1 = skb_queue_prev(
+					&tp->out_of_order_queue,
+					skb1);
+		}
+	}
+	if (!skb1)
+		__skb_queue_head(&tp->out_of_order_queue, skb);
+	else
+		__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
+
+	/* And clean segments covered by new one as whole. */
+	while (!skb_queue_is_last(&tp->out_of_order_queue, skb)) {
+		skb1 = skb_queue_next(&tp->out_of_order_queue, skb);
+
+		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
+			break;
+		if (before(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
+			tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
+					 end_seq);
+			break;
+		}
+		__skb_unlink(skb1, &tp->out_of_order_queue);
+		tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
+				 TCP_SKB_CB(skb1)->end_seq);
+		__kfree_skb(skb1);
+	}
+
+add_sack:
+	if (tcp_is_sack(tp))
+		tcp_sack_new_ofo_skb(sk, seq, end_seq);
+end:
+	if (skb)
+		skb_set_owner_r(skb, sk);
+}
+
+
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
 	const struct tcphdr *th = tcp_hdr(skb);
@@ -4561,105 +4675,7 @@ drop:
 		goto queue_and_out;
 	}
 
-	TCP_ECN_check_ce(tp, skb);
-
-	if (tcp_try_rmem_schedule(sk, skb->truesize))
-		goto drop;
-
-	/* Disable header prediction. */
-	tp->pred_flags = 0;
-	inet_csk_schedule_ack(sk);
-
-	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
-		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
-
-	skb_set_owner_r(skb, sk);
-
-	if (!skb_peek(&tp->out_of_order_queue)) {
-		/* Initial out of order segment, build 1 SACK. */
-		if (tcp_is_sack(tp)) {
-			tp->rx_opt.num_sacks = 1;
-			tp->selective_acks[0].start_seq = TCP_SKB_CB(skb)->seq;
-			tp->selective_acks[0].end_seq =
-						TCP_SKB_CB(skb)->end_seq;
-		}
-		__skb_queue_head(&tp->out_of_order_queue, skb);
-	} else {
-		struct sk_buff *skb1 = skb_peek_tail(&tp->out_of_order_queue);
-		u32 seq = TCP_SKB_CB(skb)->seq;
-		u32 end_seq = TCP_SKB_CB(skb)->end_seq;
-
-		if (seq == TCP_SKB_CB(skb1)->end_seq) {
-			__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
-
-			if (!tp->rx_opt.num_sacks ||
-			    tp->selective_acks[0].end_seq != seq)
-				goto add_sack;
-
-			/* Common case: data arrive in order after hole. */
-			tp->selective_acks[0].end_seq = end_seq;
-			return;
-		}
-
-		/* Find place to insert this segment. */
-		while (1) {
-			if (!after(TCP_SKB_CB(skb1)->seq, seq))
-				break;
-			if (skb_queue_is_first(&tp->out_of_order_queue, skb1)) {
-				skb1 = NULL;
-				break;
-			}
-			skb1 = skb_queue_prev(&tp->out_of_order_queue, skb1);
-		}
-
-		/* Do skb overlap to previous one? */
-		if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-			if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-				/* All the bits are present. Drop. */
-				__kfree_skb(skb);
-				tcp_dsack_set(sk, seq, end_seq);
-				goto add_sack;
-			}
-			if (after(seq, TCP_SKB_CB(skb1)->seq)) {
-				/* Partial overlap. */
-				tcp_dsack_set(sk, seq,
-					      TCP_SKB_CB(skb1)->end_seq);
-			} else {
-				if (skb_queue_is_first(&tp->out_of_order_queue,
-						       skb1))
-					skb1 = NULL;
-				else
-					skb1 = skb_queue_prev(
-						&tp->out_of_order_queue,
-						skb1);
-			}
-		}
-		if (!skb1)
-			__skb_queue_head(&tp->out_of_order_queue, skb);
-		else
-			__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
-
-		/* And clean segments covered by new one as whole. */
-		while (!skb_queue_is_last(&tp->out_of_order_queue, skb)) {
-			skb1 = skb_queue_next(&tp->out_of_order_queue, skb);
-
-			if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
-				break;
-			if (before(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-				tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
-						 end_seq);
-				break;
-			}
-			__skb_unlink(skb1, &tp->out_of_order_queue);
-			tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
-					 TCP_SKB_CB(skb1)->end_seq);
-			__kfree_skb(skb1);
-		}
-
-add_sack:
-		if (tcp_is_sack(tp))
-			tcp_sack_new_ofo_skb(sk, seq, end_seq);
-	}
+	tcp_data_queue_ofo(sk, skb);
 }
 
 static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 net-next 2/2] tcp: reduce out_of_order memory use
  2012-03-18 16:55 ` Neal Cardwell
  2012-03-18 19:40   ` Eric Dumazet
  2012-03-18 21:06   ` [PATCH v2 net-next 1/2] tcp: introduce tcp_data_queue_ofo Eric Dumazet
@ 2012-03-18 21:07   ` Eric Dumazet
  2012-03-19  3:53     ` Neal Cardwell
  2012-03-19 20:57     ` David Miller
  2 siblings, 2 replies; 9+ messages in thread
From: Eric Dumazet @ 2012-03-18 21:07 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: David Miller, netdev, Tom Herbert, Ilpo Järvinen,
	H.K. Jerry Chu, Yuchung Cheng

With increasing receive window sizes, but speed of light not improved
that much, out of order queue can contain a huge number of skbs, waiting
to be moved to receive_queue when missing packets can fill the holes.

Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
sk_buff)) to store regular (MTU <= 1500) frames. This makes highly
probable sk_rmem_alloc hits sk_rcvbuf limit, which can be 4Mbytes in
many cases.

When limit is hit, tcp stack calls tcp_collapse_ofo_queue(), a true
latency killer and cpu cache blower.

Doing the coalescing attempt each time we add a frame in ofo queue
permits to keep memory use tight and in many cases avoid the
tcp_collapse() thing later.

Tested on various wireless setups (b43, ath9k, ...) known to use big skb
truesize, this patch removed the "packets collapsed in receive queue due
to low socket buffer" I had before.

This also reduced average memory used by tcp sockets.

With help from Neal Cardwell.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
V2: rebase after tcp_data_queue_ofo() introduction.

 include/linux/snmp.h |    1 +
 net/ipv4/proc.c      |    1 +
 net/ipv4/tcp_input.c |   19 ++++++++++++++++++-
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/linux/snmp.h b/include/linux/snmp.h
index 8ee8af4..2e68f5b 100644
--- a/include/linux/snmp.h
+++ b/include/linux/snmp.h
@@ -233,6 +233,7 @@ enum
 	LINUX_MIB_TCPREQQFULLDOCOOKIES,		/* TCPReqQFullDoCookies */
 	LINUX_MIB_TCPREQQFULLDROP,		/* TCPReqQFullDrop */
 	LINUX_MIB_TCPRETRANSFAIL,		/* TCPRetransFail */
+	LINUX_MIB_TCPRCVCOALESCE,			/* TCPRcvCoalesce */
 	__LINUX_MIB_MAX
 };
 
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 02d6107..8af0d44 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -257,6 +257,7 @@ static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPReqQFullDoCookies", LINUX_MIB_TCPREQQFULLDOCOOKIES),
 	SNMP_MIB_ITEM("TCPReqQFullDrop", LINUX_MIB_TCPREQQFULLDROP),
 	SNMP_MIB_ITEM("TCPRetransFail", LINUX_MIB_TCPRETRANSFAIL),
+	SNMP_MIB_ITEM("TCPRcvCoalesce", LINUX_MIB_TCPRCVCOALESCE),
 	SNMP_MIB_SENTINEL
 };
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fa7de12..e886e2f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4484,7 +4484,24 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	end_seq = TCP_SKB_CB(skb)->end_seq;
 
 	if (seq == TCP_SKB_CB(skb1)->end_seq) {
-		__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
+		/* Packets in ofo can stay in queue a long time.
+		 * Better try to coalesce them right now
+		 * to avoid future tcp_collapse_ofo_queue(),
+		 * probably the most expensive function in tcp stack.
+		 */
+		if (skb->len <= skb_tailroom(skb1) && !tcp_hdr(skb)->fin) {
+			NET_INC_STATS_BH(sock_net(sk),
+					 LINUX_MIB_TCPRCVCOALESCE);
+			BUG_ON(skb_copy_bits(skb, 0,
+					     skb_put(skb1, skb->len),
+					     skb->len));
+			TCP_SKB_CB(skb1)->end_seq = end_seq;
+			TCP_SKB_CB(skb1)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+			__kfree_skb(skb);
+			skb = NULL;
+		} else {
+			__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
+		}
 
 		if (!tp->rx_opt.num_sacks ||
 		    tp->selective_acks[0].end_seq != seq)

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: introduce tcp_data_queue_ofo
  2012-03-18 21:06   ` [PATCH v2 net-next 1/2] tcp: introduce tcp_data_queue_ofo Eric Dumazet
@ 2012-03-19  3:49     ` Neal Cardwell
  2012-03-19 20:57     ` David Miller
  1 sibling, 0 replies; 9+ messages in thread
From: Neal Cardwell @ 2012-03-19  3:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tom Herbert, Ilpo Järvinen,
	H.K. Jerry Chu, Yuchung Cheng

On Sun, Mar 18, 2012 at 5:06 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Split tcp_data_queue() in two parts for better readability.
>
> tcp_data_queue_ofo() is responsible for queueing incoming skb into out
> of order queue.
>
> Change code layout so that the skb_set_owner_r() is performed only if
> skb is not dropped.
>
> This is a preliminary patch before "reduce out_of_order memory use"
> following patch.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

Nice clean-up. Thanks for doing this!

neal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 net-next 2/2] tcp: reduce out_of_order memory use
  2012-03-18 21:07   ` [PATCH v2 net-next 2/2] tcp: reduce out_of_order memory use Eric Dumazet
@ 2012-03-19  3:53     ` Neal Cardwell
  2012-03-19 20:57     ` David Miller
  1 sibling, 0 replies; 9+ messages in thread
From: Neal Cardwell @ 2012-03-19  3:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Tom Herbert, Ilpo Järvinen,
	H.K. Jerry Chu, Yuchung Cheng

On Sun, Mar 18, 2012 at 5:07 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> With increasing receive window sizes, but speed of light not improved
> that much, out of order queue can contain a huge number of skbs, waiting
> to be moved to receive_queue when missing packets can fill the holes.
>
> Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
> sk_buff)) to store regular (MTU <= 1500) frames. This makes highly
> probable sk_rmem_alloc hits sk_rcvbuf limit, which can be 4Mbytes in
> many cases.
>
> When limit is hit, tcp stack calls tcp_collapse_ofo_queue(), a true
> latency killer and cpu cache blower.
>
> Doing the coalescing attempt each time we add a frame in ofo queue
> permits to keep memory use tight and in many cases avoid the
> tcp_collapse() thing later.
>
> Tested on various wireless setups (b43, ath9k, ...) known to use big skb
> truesize, this patch removed the "packets collapsed in receive queue due
> to low socket buffer" I had before.
>
> This also reduced average memory used by tcp sockets.
>
> With help from Neal Cardwell.
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 net-next 1/2] tcp: introduce tcp_data_queue_ofo
  2012-03-18 21:06   ` [PATCH v2 net-next 1/2] tcp: introduce tcp_data_queue_ofo Eric Dumazet
  2012-03-19  3:49     ` Neal Cardwell
@ 2012-03-19 20:57     ` David Miller
  1 sibling, 0 replies; 9+ messages in thread
From: David Miller @ 2012-03-19 20:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ncardwell, netdev, therbert, ilpo.jarvinen, hkchu, ycheng

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 18 Mar 2012 14:06:44 -0700

> Split tcp_data_queue() in two parts for better readability.
> 
> tcp_data_queue_ofo() is responsible for queueing incoming skb into out
> of order queue.
> 
> Change code layout so that the skb_set_owner_r() is performed only if
> skb is not dropped.
> 
> This is a preliminary patch before "reduce out_of_order memory use"
> following patch.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 net-next 2/2] tcp: reduce out_of_order memory use
  2012-03-18 21:07   ` [PATCH v2 net-next 2/2] tcp: reduce out_of_order memory use Eric Dumazet
  2012-03-19  3:53     ` Neal Cardwell
@ 2012-03-19 20:57     ` David Miller
  1 sibling, 0 replies; 9+ messages in thread
From: David Miller @ 2012-03-19 20:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ncardwell, netdev, therbert, ilpo.jarvinen, hkchu, ycheng

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 18 Mar 2012 14:07:47 -0700

> With increasing receive window sizes, but speed of light not improved
> that much, out of order queue can contain a huge number of skbs, waiting
> to be moved to receive_queue when missing packets can fill the holes.
> 
> Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
> sk_buff)) to store regular (MTU <= 1500) frames. This makes highly
> probable sk_rmem_alloc hits sk_rcvbuf limit, which can be 4Mbytes in
> many cases.
> 
> When limit is hit, tcp stack calls tcp_collapse_ofo_queue(), a true
> latency killer and cpu cache blower.
> 
> Doing the coalescing attempt each time we add a frame in ofo queue
> permits to keep memory use tight and in many cases avoid the
> tcp_collapse() thing later.
> 
> Tested on various wireless setups (b43, ath9k, ...) known to use big skb
> truesize, this patch removed the "packets collapsed in receive queue due
> to low socket buffer" I had before.
> 
> This also reduced average memory used by tcp sockets.
> 
> With help from Neal Cardwell.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-03-19 20:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-18 13:37 [PATCH net-next] tcp: reduce out_of_order memory use Eric Dumazet
2012-03-18 16:55 ` Neal Cardwell
2012-03-18 19:40   ` Eric Dumazet
2012-03-18 21:06   ` [PATCH v2 net-next 1/2] tcp: introduce tcp_data_queue_ofo Eric Dumazet
2012-03-19  3:49     ` Neal Cardwell
2012-03-19 20:57     ` David Miller
2012-03-18 21:07   ` [PATCH v2 net-next 2/2] tcp: reduce out_of_order memory use Eric Dumazet
2012-03-19  3:53     ` Neal Cardwell
2012-03-19 20:57     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).