All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] fix tcp fin memory accounting
@ 2015-03-19 23:19 Josh Hunt
  2015-03-20  2:17 ` Eric Dumazet
  2015-03-20 17:20 ` David Miller
  0 siblings, 2 replies; 25+ messages in thread
From: Josh Hunt @ 2015-03-19 23:19 UTC (permalink / raw)
  To: davem, netdev; +Cc: edumazet, Josh Hunt

tcp_send_fin() does not account for the memory it allocates properly, so
sk_forward_alloc can be negative in cases where we've sent a FIN:

ss example output (ss -amn | grep -B1 f4294):
tcp    FIN-WAIT-1 0      1            192.168.0.1:45520         192.0.2.1:8080
	skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0)
--
tcp    FIN-WAIT-1 0      1            192.168.0.1:8080          192.0.2.1:59710
	skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0)

This can be resolved by using sk_stream_alloc_skb() instead of calling alloc_skb_fclone()
directly. The same fix was made in tcp_connect() in commit 355a901e6cf1 (tcp: make connect() mem charging friendly).

With this fix applied I no longer see the negative sk_forward_alloc values (or
very large values repoted by ss) in my tests.

Signed-off-by: Josh Hunt <johunt@akamai.com>
---
 net/ipv4/tcp_output.c |    6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5a73ad5..c2f0f60 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2820,15 +2820,11 @@ void tcp_send_fin(struct sock *sk)
 	} else {
 		/* Socket is locked, keep trying until memory is available. */
 		for (;;) {
-			skb = alloc_skb_fclone(MAX_TCP_HEADER,
-					       sk->sk_allocation);
+			skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation);
 			if (skb)
 				break;
 			yield();
 		}
-
-		/* Reserve space for headers and prepare control bits. */
-		skb_reserve(skb, MAX_TCP_HEADER);
 		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
 		tcp_init_nondata_skb(skb, tp->write_seq,
 				     TCPHDR_ACK | TCPHDR_FIN);
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] fix tcp fin memory accounting
  2015-03-19 23:19 [PATCH] fix tcp fin memory accounting Josh Hunt
@ 2015-03-20  2:17 ` Eric Dumazet
  2015-03-20 17:20 ` David Miller
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2015-03-20  2:17 UTC (permalink / raw)
  To: Josh Hunt; +Cc: davem, netdev, edumazet

On Thu, 2015-03-19 at 19:19 -0400, Josh Hunt wrote:
> tcp_send_fin() does not account for the memory it allocates properly, so
> sk_forward_alloc can be negative in cases where we've sent a FIN:
> 
> ss example output (ss -amn | grep -B1 f4294):
> tcp    FIN-WAIT-1 0      1            192.168.0.1:45520         192.0.2.1:8080
> 	skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0)
> --
> tcp    FIN-WAIT-1 0      1            192.168.0.1:8080          192.0.2.1:59710
> 	skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0)
> 
> This can be resolved by using sk_stream_alloc_skb() instead of calling alloc_skb_fclone()
> directly. The same fix was made in tcp_connect() in commit 355a901e6cf1 (tcp: make connect() mem charging friendly).
> 
> With this fix applied I no longer see the negative sk_forward_alloc values (or
> very large values repoted by ss) in my tests.
> 
> Signed-off-by: Josh Hunt <johunt@akamai.com>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] fix tcp fin memory accounting
  2015-03-19 23:19 [PATCH] fix tcp fin memory accounting Josh Hunt
  2015-03-20  2:17 ` Eric Dumazet
@ 2015-03-20 17:20 ` David Miller
  2015-03-20 17:36   ` Josh Hunt
  1 sibling, 1 reply; 25+ messages in thread
From: David Miller @ 2015-03-20 17:20 UTC (permalink / raw)
  To: johunt; +Cc: netdev, edumazet

From: Josh Hunt <johunt@akamai.com>
Date: Thu, 19 Mar 2015 19:19:30 -0400

> tcp_send_fin() does not account for the memory it allocates properly, so
> sk_forward_alloc can be negative in cases where we've sent a FIN:
> 
> ss example output (ss -amn | grep -B1 f4294):
> tcp    FIN-WAIT-1 0      1            192.168.0.1:45520         192.0.2.1:8080
> 	skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0)
> --
> tcp    FIN-WAIT-1 0      1            192.168.0.1:8080          192.0.2.1:59710
> 	skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0)
> 
> This can be resolved by using sk_stream_alloc_skb() instead of calling alloc_skb_fclone()
> directly. The same fix was made in tcp_connect() in commit 355a901e6cf1 (tcp: make connect() mem charging friendly).
> 
> With this fix applied I no longer see the negative sk_forward_alloc values (or
> very large values repoted by ss) in my tests.
> 
> Signed-off-by: Josh Hunt <johunt@akamai.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] fix tcp fin memory accounting
  2015-03-20 17:20 ` David Miller
@ 2015-03-20 17:36   ` Josh Hunt
  2015-03-24  6:10     ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Josh Hunt @ 2015-03-20 17:36 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, edumazet

On 03/20/2015 12:20 PM, David Miller wrote:
> From: Josh Hunt <johunt@akamai.com>
> Date: Thu, 19 Mar 2015 19:19:30 -0400
>
>> tcp_send_fin() does not account for the memory it allocates properly, so
>> sk_forward_alloc can be negative in cases where we've sent a FIN:
>>
>> ss example output (ss -amn | grep -B1 f4294):
>> tcp    FIN-WAIT-1 0      1            192.168.0.1:45520         192.0.2.1:8080
>> 	skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0)
>> --
>> tcp    FIN-WAIT-1 0      1            192.168.0.1:8080          192.0.2.1:59710
>> 	skmem:(r0,rb87380,t0,tb87380,f4294966016,w1280,o0,bl0)
>>
>> This can be resolved by using sk_stream_alloc_skb() instead of calling alloc_skb_fclone()
>> directly. The same fix was made in tcp_connect() in commit 355a901e6cf1 (tcp: make connect() mem charging friendly).
>>
>> With this fix applied I no longer see the negative sk_forward_alloc values (or
>> very large values repoted by ss) in my tests.
>>
>> Signed-off-by: Josh Hunt <johunt@akamai.com>
>
> Applied and queued up for -stable, thanks.
>

Thanks David. Would it be possible to queue up 355a901e6cf1 (tcp: make 
connect() mem charging friendly) for stable as well? That is the commit 
that fixes this problem in the tcp_connect()/tcp_send_syn_data() cases.

Thanks
Josh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] fix tcp fin memory accounting
  2015-03-20 17:36   ` Josh Hunt
@ 2015-03-24  6:10     ` David Miller
  2015-03-24  6:11       ` Josh Hunt
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2015-03-24  6:10 UTC (permalink / raw)
  To: johunt; +Cc: netdev, edumazet

From: Josh Hunt <johunt@akamai.com>
Date: Fri, 20 Mar 2015 12:36:24 -0500

> Would it be possible to queue up 355a901e6cf1 (tcp: make connect()
> mem charging friendly) for stable as well? That is the commit that
> fixes this problem in the tcp_connect()/tcp_send_syn_data() cases.

Done.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] fix tcp fin memory accounting
  2015-03-24  6:10     ` David Miller
@ 2015-03-24  6:11       ` Josh Hunt
  2015-04-22  0:09         ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Josh Hunt @ 2015-03-24  6:11 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, edumazet

On 03/24/2015 01:10 AM, David Miller wrote:
> From: Josh Hunt <johunt@akamai.com>
> Date: Fri, 20 Mar 2015 12:36:24 -0500
>
>> Would it be possible to queue up 355a901e6cf1 (tcp: make connect()
>> mem charging friendly) for stable as well? That is the commit that
>> fixes this problem in the tcp_connect()/tcp_send_syn_data() cases.
>
> Done.
>

Thanks David.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] fix tcp fin memory accounting
  2015-03-24  6:11       ` Josh Hunt
@ 2015-04-22  0:09         ` Eric Dumazet
  2015-04-22  1:32           ` [PATCH net] tcp: fix possible deadlock in tcp_send_fin() Eric Dumazet
  2015-04-23 13:48           ` [PATCH] fix tcp fin memory accounting Josh Hunt
  0 siblings, 2 replies; 25+ messages in thread
From: Eric Dumazet @ 2015-04-22  0:09 UTC (permalink / raw)
  To: Josh Hunt; +Cc: David Miller, netdev, edumazet

On Tue, 2015-03-24 at 01:11 -0500, Josh Hunt wrote:
> On 03/24/2015 01:10 AM, David Miller wrote:
> > From: Josh Hunt <johunt@akamai.com>
> > Date: Fri, 20 Mar 2015 12:36:24 -0500
> >
> >> Would it be possible to queue up 355a901e6cf1 (tcp: make connect()
> >> mem charging friendly) for stable as well? That is the commit that
> >> fixes this problem in the tcp_connect()/tcp_send_syn_data() cases.
> >
> > Done.
> >
> 
> Thanks David.

Note that this patch adds a deadlock possibility in some stress
situations.

If a process owning some tcp socket dies, and tcp_mem[2] is already hit,
all sk_stream_alloc_skb() can return NULL and we loop in tcp_send_fin(),
making no progress because we can not free any tcp memory.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net] tcp: fix possible deadlock in tcp_send_fin()
  2015-04-22  0:09         ` Eric Dumazet
@ 2015-04-22  1:32           ` Eric Dumazet
  2015-04-22 14:33             ` [PATCH net] net: do not deplete pfmemalloc reserve Eric Dumazet
  2015-04-22 18:15             ` [PATCH net] tcp: fix possible deadlock in tcp_send_fin() David Miller
  2015-04-23 13:48           ` [PATCH] fix tcp fin memory accounting Josh Hunt
  1 sibling, 2 replies; 25+ messages in thread
From: Eric Dumazet @ 2015-04-22  1:32 UTC (permalink / raw)
  To: Josh Hunt; +Cc: David Miller, netdev, edumazet, Greg Thelen, David Rientjes

From: Eric Dumazet <edumazet@google.com>

Using sk_stream_alloc_skb() in tcp_send_fin() is dangerous in
case a huge process is killed by OOM, and tcp_mem[2] is hit.

To be able to free memory we need to make progress, so this
patch allows FIN packets to not care about tcp_mem[2], if
skb allocation succeeded.

In a follow-up patch, we might abort tcp_send_fin() infinite loop
in case TIF_MEMDIE is set on this thread, as memory allocator
did its best getting extra memory already. 

This patch reverts d22e15371811 ("tcp: fix tcp fin memory accounting")

Fixes: d22e15371811 ("tcp: fix tcp fin memory accounting")
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c |   20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8c8d7e06b72f..2ade67b7cdb0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2812,6 +2812,21 @@ begin_fwd:
 	}
 }
 
+/* We allow to exceed memory limits for FIN packets to expedite
+ * connection tear down and (memory) recovery.
+ * Otherwise tcp_send_fin() could loop forever.
+ */
+static void sk_forced_wmem_schedule(struct sock *sk, int size)
+{
+	int amt, status;
+
+	if (size <= sk->sk_forward_alloc)
+		return;
+	amt = sk_mem_pages(size);
+	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
+	sk_memory_allocated_add(sk, amt, &status);
+}
+
 /* Send a fin.  The caller locks the socket for us.  This cannot be
  * allowed to fail queueing a FIN frame under any circumstances.
  */
@@ -2834,11 +2849,14 @@ void tcp_send_fin(struct sock *sk)
 	} else {
 		/* Socket is locked, keep trying until memory is available. */
 		for (;;) {
-			skb = sk_stream_alloc_skb(sk, 0, sk->sk_allocation);
+			skb = alloc_skb_fclone(MAX_TCP_HEADER,
+					       sk->sk_allocation);
 			if (skb)
 				break;
 			yield();
 		}
+		skb_reserve(skb, MAX_TCP_HEADER);
+		sk_forced_wmem_schedule(sk, skb->truesize);
 		/* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */
 		tcp_init_nondata_skb(skb, tp->write_seq,
 				     TCPHDR_ACK | TCPHDR_FIN);

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net] net: do not deplete pfmemalloc reserve
  2015-04-22  1:32           ` [PATCH net] tcp: fix possible deadlock in tcp_send_fin() Eric Dumazet
@ 2015-04-22 14:33             ` Eric Dumazet
  2015-04-22 19:35               ` David Miller
  2015-04-22 18:15             ` [PATCH net] tcp: fix possible deadlock in tcp_send_fin() David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2015-04-22 14:33 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, edumazet, Greg Thelen, David Rientjes, Mel Gorman

From: Eric Dumazet <edumazet@google.com>

build_skb() should look at the page pfmemalloc status.
If set, this means page allocator allocated this page in the
expectation it would help to free other pages. Networking
stack can do that only if skb->pfmemalloc is also set.

Also, we must refrain using high order pages from the pfmemalloc
reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
them. Under memory pressure, using order-0 pages is probably the best
strategy.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/skbuff.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d1967dab9cc6..456ead534e10 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -311,7 +311,11 @@ struct sk_buff *build_skb(void *data, unsigned int frag_size)
 
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->truesize = SKB_TRUESIZE(size);
-	skb->head_frag = frag_size != 0;
+	if (frag_size) {
+		skb->head_frag = 1;
+		if (virt_to_head_page(data)->pfmemalloc)
+			skb->pfmemalloc = 1;
+	}
 	atomic_set(&skb->users, 1);
 	skb->head = data;
 	skb->data = data;
@@ -348,7 +352,8 @@ static struct page *__page_frag_refill(struct netdev_alloc_cache *nc,
 	gfp_t gfp = gfp_mask;
 
 	if (order) {
-		gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
+		gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
+			    __GFP_NOMEMALLOC;
 		page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
 		nc->frag.size = PAGE_SIZE << (page ? order : 0);
 	}

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net] tcp: fix possible deadlock in tcp_send_fin()
  2015-04-22  1:32           ` [PATCH net] tcp: fix possible deadlock in tcp_send_fin() Eric Dumazet
  2015-04-22 14:33             ` [PATCH net] net: do not deplete pfmemalloc reserve Eric Dumazet
@ 2015-04-22 18:15             ` David Miller
  2015-04-22 18:39               ` Eric Dumazet
  1 sibling, 1 reply; 25+ messages in thread
From: David Miller @ 2015-04-22 18:15 UTC (permalink / raw)
  To: eric.dumazet; +Cc: johunt, netdev, edumazet, gthelen, rientjes

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 21 Apr 2015 18:32:24 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Using sk_stream_alloc_skb() in tcp_send_fin() is dangerous in
> case a huge process is killed by OOM, and tcp_mem[2] is hit.
> 
> To be able to free memory we need to make progress, so this
> patch allows FIN packets to not care about tcp_mem[2], if
> skb allocation succeeded.
> 
> In a follow-up patch, we might abort tcp_send_fin() infinite loop
> in case TIF_MEMDIE is set on this thread, as memory allocator
> did its best getting extra memory already. 
> 
> This patch reverts d22e15371811 ("tcp: fix tcp fin memory accounting")
> 
> Fixes: d22e15371811 ("tcp: fix tcp fin memory accounting")
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied and queued up for -stable, since the tcp fin memory accounting change
went to -stable already.

Wrt. longer term solutions, I would even be OK at this point aborting
the connection altogether if the skb allocation fails period.  Not
just if TIF_MEMDIE is set.  That endless loop has been nothing but
trouble.

Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net] tcp: fix possible deadlock in tcp_send_fin()
  2015-04-22 18:15             ` [PATCH net] tcp: fix possible deadlock in tcp_send_fin() David Miller
@ 2015-04-22 18:39               ` Eric Dumazet
  2015-04-22 18:50                 ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2015-04-22 18:39 UTC (permalink / raw)
  To: David Miller; +Cc: johunt, netdev, edumazet, gthelen, rientjes

On Wed, 2015-04-22 at 14:15 -0400, David Miller wrote:

> Wrt. longer term solutions, I would even be OK at this point aborting
> the connection altogether if the skb allocation fails period.  Not
> just if TIF_MEMDIE is set.  That endless loop has been nothing but
> trouble.

Agreed.

Note that in the unlikely case skb can not be allocated,
if an already transmitted packet is in the write queue, we also can OR
the FIN flag on it, and rely on normal rtx to deliver this FIN later.

I'll cook a patch when net-next reopens.

Thanks David.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net] tcp: fix possible deadlock in tcp_send_fin()
  2015-04-22 18:39               ` Eric Dumazet
@ 2015-04-22 18:50                 ` David Miller
  2015-04-22 19:12                   ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2015-04-22 18:50 UTC (permalink / raw)
  To: eric.dumazet; +Cc: johunt, netdev, edumazet, gthelen, rientjes

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 22 Apr 2015 11:39:27 -0700

> Note that in the unlikely case skb can not be allocated,
> if an already transmitted packet is in the write queue, we also can OR
> the FIN flag on it, and rely on normal rtx to deliver this FIN later.
> 
> I'll cook a patch when net-next reopens.

Hmmm, we already check for something like this at the beginning of
tcp_send_fin(), I know because I added that piece of code 15+ years
ago :-)

Or are you suggesting something slightly different?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net] tcp: fix possible deadlock in tcp_send_fin()
  2015-04-22 18:50                 ` David Miller
@ 2015-04-22 19:12                   ` Eric Dumazet
  2015-04-22 19:29                     ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2015-04-22 19:12 UTC (permalink / raw)
  To: David Miller; +Cc: johunt, netdev, edumazet, gthelen, rientjes

On Wed, 2015-04-22 at 14:50 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 22 Apr 2015 11:39:27 -0700
> 
> > Note that in the unlikely case skb can not be allocated,
> > if an already transmitted packet is in the write queue, we also can OR
> > the FIN flag on it, and rely on normal rtx to deliver this FIN later.
> > 
> > I'll cook a patch when net-next reopens.
> 
> Hmmm, we already check for something like this at the beginning of
> tcp_send_fin(), I know because I added that piece of code 15+ years
> ago :-)
> 
> Or are you suggesting something slightly different?


The existing test does the OR only if there is an unsent frame in the
write queue.

Because if the frame present in write queue was already sent, we prefer
cooking a fresh skb to be able to send it right now (disabling Nagle
checks)

So, in the case we cannot allocate the skb, we can still add the FIN
flag, but not push any frame right now.

FIN will be sent later, because incoming ACK or rtx will eventually send
it.

This sounds better than not sending FIN at all.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net] tcp: fix possible deadlock in tcp_send_fin()
  2015-04-22 19:12                   ` Eric Dumazet
@ 2015-04-22 19:29                     ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2015-04-22 19:29 UTC (permalink / raw)
  To: eric.dumazet; +Cc: johunt, netdev, edumazet, gthelen, rientjes

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 22 Apr 2015 12:12:20 -0700

> On Wed, 2015-04-22 at 14:50 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 22 Apr 2015 11:39:27 -0700
>> 
>> > Note that in the unlikely case skb can not be allocated,
>> > if an already transmitted packet is in the write queue, we also can OR
>> > the FIN flag on it, and rely on normal rtx to deliver this FIN later.
>> > 
>> > I'll cook a patch when net-next reopens.
>> 
>> Hmmm, we already check for something like this at the beginning of
>> tcp_send_fin(), I know because I added that piece of code 15+ years
>> ago :-)
>> 
>> Or are you suggesting something slightly different?
> 
> 
> The existing test does the OR only if there is an unsent frame in the
> write queue.
> 
> Because if the frame present in write queue was already sent, we prefer
> cooking a fresh skb to be able to send it right now (disabling Nagle
> checks)
> 
> So, in the case we cannot allocate the skb, we can still add the FIN
> flag, but not push any frame right now.
> 
> FIN will be sent later, because incoming ACK or rtx will eventually send
> it.
> 
> This sounds better than not sending FIN at all.

Ok, that makes a lot of sense.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net] net: do not deplete pfmemalloc reserve
  2015-04-22 14:33             ` [PATCH net] net: do not deplete pfmemalloc reserve Eric Dumazet
@ 2015-04-22 19:35               ` David Miller
  2015-04-22 20:20                 ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2015-04-22 19:35 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, edumazet, gthelen, rientjes, mgorman

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 22 Apr 2015 07:33:36 -0700

> Also, we must refrain using high order pages from the pfmemalloc
> reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
> them. Under memory pressure, using order-0 pages is probably the best
> strategy.
 ...
> @@ -348,7 +352,8 @@ static struct page *__page_frag_refill(struct netdev_alloc_cache *nc,
>  	gfp_t gfp = gfp_mask;
>  
>  	if (order) {
> -		gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
> +		gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
> +			    __GFP_NOMEMALLOC;
>  		page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);

Hmmm, skbuff.h says:

 * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx

But that's exactly what this __page_frag_refill() code is primarily
being used for, network RX, right?

If we believe the comment, you should not be adding the
__GFP_NOMEMALLOC flag here.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net] net: do not deplete pfmemalloc reserve
  2015-04-22 19:35               ` David Miller
@ 2015-04-22 20:20                 ` Eric Dumazet
  2015-04-22 20:24                   ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2015-04-22 20:20 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, edumazet, gthelen, rientjes, mgorman

On Wed, 2015-04-22 at 15:35 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 22 Apr 2015 07:33:36 -0700
> 
> > Also, we must refrain using high order pages from the pfmemalloc
> > reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
> > them. Under memory pressure, using order-0 pages is probably the best
> > strategy.
>  ...
> > @@ -348,7 +352,8 @@ static struct page *__page_frag_refill(struct netdev_alloc_cache *nc,
> >  	gfp_t gfp = gfp_mask;
> >  
> >  	if (order) {
> > -		gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
> > +		gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
> > +			    __GFP_NOMEMALLOC;
> >  		page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
> 
> Hmmm, skbuff.h says:
> 
>  * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx
> 
> But that's exactly what this __page_frag_refill() code is primarily
> being used for, network RX, right?
> 
> If we believe the comment, you should not be adding the
> __GFP_NOMEMALLOC flag here.

I do not see a problem here.

We attempt high order pages allocations only if they bring a performance
gain. Under memory pressure, there is the high risk a whole order-3 page
is consumed by a tiny buffer using 200 bytes but not consumed.

Only high order pages allocation attempts should request this
__GFP_NOMEMALLOC flag set.

order-0 pages will have __GFP_MEMALLOC if requested by at least one
socket in the host.

About __dev_alloc_pages() comment / users, they have to know if they
need high order pages, or could fallback to order-0 ones.

We can update the comments in skbuff.h, but so far the only user has no
need to assert __GFP_NOMEMALLOC.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net] net: do not deplete pfmemalloc reserve
  2015-04-22 20:20                 ` Eric Dumazet
@ 2015-04-22 20:24                   ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2015-04-22 20:24 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, edumazet, gthelen, rientjes, mgorman

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 22 Apr 2015 13:20:48 -0700

> On Wed, 2015-04-22 at 15:35 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 22 Apr 2015 07:33:36 -0700
>> 
>> > Also, we must refrain using high order pages from the pfmemalloc
>> > reserve, so __page_frag_refill() must also use __GFP_NOMEMALLOC for
>> > them. Under memory pressure, using order-0 pages is probably the best
>> > strategy.
>>  ...
>> > @@ -348,7 +352,8 @@ static struct page *__page_frag_refill(struct netdev_alloc_cache *nc,
>> >  	gfp_t gfp = gfp_mask;
>> >  
>> >  	if (order) {
>> > -		gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY;
>> > +		gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
>> > +			    __GFP_NOMEMALLOC;
>> >  		page = alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
>> 
>> Hmmm, skbuff.h says:
>> 
>>  * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx
>> 
>> But that's exactly what this __page_frag_refill() code is primarily
>> being used for, network RX, right?
>> 
>> If we believe the comment, you should not be adding the
>> __GFP_NOMEMALLOC flag here.
> 
> I do not see a problem here.
> 
> We attempt high order pages allocations only if they bring a performance
> gain. Under memory pressure, there is the high risk a whole order-3 page
> is consumed by a tiny buffer using 200 bytes but not consumed.
> 
> Only high order pages allocation attempts should request this
> __GFP_NOMEMALLOC flag set.
> 
> order-0 pages will have __GFP_MEMALLOC if requested by at least one
> socket in the host.
> 
> About __dev_alloc_pages() comment / users, they have to know if they
> need high order pages, or could fallback to order-0 ones.
> 
> We can update the comments in skbuff.h, but so far the only user has no
> need to assert __GFP_NOMEMALLOC.

Ok, thanks for explaining.   The 'order' test guarding this change didn't
register fully when I first read your patch.

I'll apply this and queue up for -stable, thanks Eric.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] fix tcp fin memory accounting
  2015-04-22  0:09         ` Eric Dumazet
  2015-04-22  1:32           ` [PATCH net] tcp: fix possible deadlock in tcp_send_fin() Eric Dumazet
@ 2015-04-23 13:48           ` Josh Hunt
  2015-04-23 14:02             ` Eric Dumazet
  1 sibling, 1 reply; 25+ messages in thread
From: Josh Hunt @ 2015-04-23 13:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, edumazet

On 04/21/2015 07:09 PM, Eric Dumazet wrote:
>
> Note that this patch adds a deadlock possibility in some stress
> situations.
>
> If a process owning some tcp socket dies, and tcp_mem[2] is already hit,
> all sk_stream_alloc_skb() can return NULL and we loop in tcp_send_fin(),
> making no progress because we can not free any tcp memory.

Ugh. Thanks for fixing this Eric!

Josh

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] fix tcp fin memory accounting
  2015-04-23 13:48           ` [PATCH] fix tcp fin memory accounting Josh Hunt
@ 2015-04-23 14:02             ` Eric Dumazet
  2015-04-23 15:54               ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2015-04-23 14:02 UTC (permalink / raw)
  To: Josh Hunt; +Cc: David Miller, netdev, edumazet

On Thu, 2015-04-23 at 08:48 -0500, Josh Hunt wrote:
> On 04/21/2015 07:09 PM, Eric Dumazet wrote:
> >
> > Note that this patch adds a deadlock possibility in some stress
> > situations.
> >
> > If a process owning some tcp socket dies, and tcp_mem[2] is already hit,
> > all sk_stream_alloc_skb() can return NULL and we loop in tcp_send_fin(),
> > making no progress because we can not free any tcp memory.
> 
> Ugh. Thanks for fixing this Eric!

No problem ;)

For the record, I've tested this followup patch that I'll formally
submit when net-next reopens :

If there is one already sent skb in write queue
(we look at the tail of course), and :
- We are under TCP memory pressure, 
- or allocation of a fresh skb using GFP_KERNEL failed,

-> we add the FIN flag that will eventually be sent later after a
timeout.

 net/ipv4/tcp_output.c |   41 ++++++++++++++++++++--------------------
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2ade67b7cdb0..fe6558eb64f3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2827,33 +2827,34 @@ static void sk_forced_wmem_schedule(struct sock *sk, int size)
 	sk_memory_allocated_add(sk, amt, &status);
 }
 
-/* Send a fin.  The caller locks the socket for us.  This cannot be
- * allowed to fail queueing a FIN frame under any circumstances.
+/* Send a FIN. The caller locks the socket for us.
+ * We should try to send a FIN packet really hard, but eventually give up.
  */
 void tcp_send_fin(struct sock *sk)
 {
+	struct sk_buff *skb, *tskb = tcp_write_queue_tail(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb = tcp_write_queue_tail(sk);
-	int mss_now;
 
-	/* Optimization, tack on the FIN if we have a queue of
-	 * unsent frames.  But be careful about outgoing SACKS
-	 * and IP options.
+	/* Optimization, tack on the FIN if we have one skb in write queue and
+	 * this skb was not yet sent, or we are under memory pressure.
+	 * Note: in the latter case, FIN packet will be sent after a timeout,
+	 * as TCP stack thinks it has been transmitted once.
 	 */
-	mss_now = tcp_current_mss(sk);
-
-	if (tcp_send_head(sk)) {
-		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-		TCP_SKB_CB(skb)->end_seq++;
+	if (tskb && (tcp_send_head(sk) || sk_under_memory_pressure(sk))) {
+coalesce:
+		TCP_SKB_CB(tskb)->tcp_flags |= TCPHDR_FIN;
+		TCP_SKB_CB(tskb)->end_seq++;
 		tp->write_seq++;
+		if (!tcp_send_head(sk)) {
+			tp->snd_nxt++;
+			return;
+		}
 	} else {
-		/* Socket is locked, keep trying until memory is available. */
-		for (;;) {
-			skb = alloc_skb_fclone(MAX_TCP_HEADER,
-					       sk->sk_allocation);
-			if (skb)
-				break;
-			yield();
+		skb = alloc_skb_fclone(MAX_TCP_HEADER, sk->sk_allocation);
+		if (unlikely(!skb)) {
+			if (tskb)
+				goto coalesce;
+			return;
 		}
 		skb_reserve(skb, MAX_TCP_HEADER);
 		sk_forced_wmem_schedule(sk, skb->truesize);
@@ -2862,7 +2863,7 @@ void tcp_send_fin(struct sock *sk)
 				     TCPHDR_ACK | TCPHDR_FIN);
 		tcp_queue_skb(sk, skb);
 	}
-	__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
+	__tcp_push_pending_frames(sk, tcp_current_mss(sk), TCP_NAGLE_OFF);
 }
 
 /* We get here when a process closes a file descriptor (either due to

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] fix tcp fin memory accounting
  2015-04-23 14:02             ` Eric Dumazet
@ 2015-04-23 15:54               ` David Miller
  2015-04-23 16:16                 ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2015-04-23 15:54 UTC (permalink / raw)
  To: eric.dumazet; +Cc: johunt, netdev, edumazet

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 23 Apr 2015 07:02:47 -0700

> +		if (!tcp_send_head(sk)) {
> +			tp->snd_nxt++;
> +			return;
> +		}

I'm not so sure about this.  Why is this needed?

Otherwise patch looks fine to me.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] fix tcp fin memory accounting
  2015-04-23 15:54               ` David Miller
@ 2015-04-23 16:16                 ` Eric Dumazet
  2015-04-23 16:48                   ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2015-04-23 16:16 UTC (permalink / raw)
  To: David Miller; +Cc: johunt, netdev, edumazet

On Thu, 2015-04-23 at 11:54 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 23 Apr 2015 07:02:47 -0700
> 
> > +		if (!tcp_send_head(sk)) {
> > +			tp->snd_nxt++;
> > +			return;
> > +		}
> 
> I'm not so sure about this.  Why is this needed?
> 
> Otherwise patch looks fine to me.
> 

I guess I need to add a comment then ;)

If we want to pretend FIN was sent, we also need to tweak tp->snd_nxt to
match new tskb->end_seq (or tp->write_seq).

I tested following packetdrill script and confirmed that if I do not
tweak snd_nxt, last packet sent is incorrect :

	> . 5001:5001(0) ack 2

This might be because our stack relies that we never coalesce something
on one already sent skb (we do this check in tcp_sendmsg() for example)

Also note the funny thing : The FIN that is finally sent also has a Push
flag. I believe its fine to leave this, as it is not incorrect at
protocol perspective, and removing the push would add some logic when we
split the packet to remove the ACKed part.


// Initialize a server socket.
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0

+0    < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
+0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
+.010 < . 1:1(0) ack 1 win 257

+0 accept(3, ..., ...) = 4

+0.010 write(4, ..., 5000) = 5000
+0    > P. 1:5001(5000) ack 1

// this assumes kernel is not able to allocate fresh skb to 
// hold FIN
+.010 shutdown(4, SHUT_RDWR) = 0
+0    < . 1:1(0) ack 5001 win 257

// Yeah ! note the FIN that we 'retransmit' also gets a Push
+0.209    > FP. 5001:5001(0) ack 1

+0 read(4, ..., 1000) = 0
+0 write(4, ..., 1000) = -1 EPIPE (Broken pipe)

+.010 close(4) = 0

// TLP at 2*RTT after the original send of FIN.
+.201 > FP. 5001:5001(0) ack 1

// Receive an ACK for the remaining outstanding data.
+.010 < . 1:1(0) ack 5002 win 257

// Receive a FIN.
+.010 < F. 1:1(0) ack 5002 win 257
// Make sure final packet is correct
+0    > . 5002:5002(0) ack 2

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] fix tcp fin memory accounting
  2015-04-23 16:16                 ` Eric Dumazet
@ 2015-04-23 16:48                   ` Eric Dumazet
  2015-04-23 17:12                     ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2015-04-23 16:48 UTC (permalink / raw)
  To: David Miller; +Cc: johunt, netdev, edumazet

On Thu, 2015-04-23 at 09:16 -0700, Eric Dumazet wrote:
> On Thu, 2015-04-23 at 11:54 -0400, David Miller wrote:
> > From: Eric Dumazet <eric.dumazet@gmail.com>
> > Date: Thu, 23 Apr 2015 07:02:47 -0700
> > 
> > > +		if (!tcp_send_head(sk)) {
> > > +			tp->snd_nxt++;
> > > +			return;
> > > +		}
> > 
> > I'm not so sure about this.  Why is this needed?
> > 
> > Otherwise patch looks fine to me.
> > 
> 
> I guess I need to add a comment then ;)
> 
> If we want to pretend FIN was sent, we also need to tweak tp->snd_nxt to
> match new tskb->end_seq (or tp->write_seq).
> 
> I tested following packetdrill script and confirmed that if I do not
> tweak snd_nxt, last packet sent is incorrect :
> 
> 	> . 5001:5001(0) ack 2
> 
> This might be because our stack relies that we never coalesce something
> on one already sent skb (we do this check in tcp_sendmsg() for example)

Well, real reason is that tp->snd_nxt is not touched in retransmit
paths, but when new data is sent (tcp_event_new_data_sent())

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] fix tcp fin memory accounting
  2015-04-23 16:48                   ` Eric Dumazet
@ 2015-04-23 17:12                     ` David Miller
  2015-04-23 17:42                       ` [PATCH net] tcp: avoid looping in tcp_send_fin() Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: David Miller @ 2015-04-23 17:12 UTC (permalink / raw)
  To: eric.dumazet; +Cc: johunt, netdev, edumazet

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 23 Apr 2015 09:48:12 -0700

> On Thu, 2015-04-23 at 09:16 -0700, Eric Dumazet wrote:
>> On Thu, 2015-04-23 at 11:54 -0400, David Miller wrote:
>> > From: Eric Dumazet <eric.dumazet@gmail.com>
>> > Date: Thu, 23 Apr 2015 07:02:47 -0700
>> > 
>> > > +		if (!tcp_send_head(sk)) {
>> > > +			tp->snd_nxt++;
>> > > +			return;
>> > > +		}
>> > 
>> > I'm not so sure about this.  Why is this needed?
>> > 
>> > Otherwise patch looks fine to me.
>> > 
>> 
>> I guess I need to add a comment then ;)
>> 
>> If we want to pretend FIN was sent, we also need to tweak tp->snd_nxt to
>> match new tskb->end_seq (or tp->write_seq).
>> 
>> I tested following packetdrill script and confirmed that if I do not
>> tweak snd_nxt, last packet sent is incorrect :
>> 
>> 	> . 5001:5001(0) ack 2
>> 
>> This might be because our stack relies that we never coalesce something
>> on one already sent skb (we do this check in tcp_sendmsg() for example)
> 
> Well, real reason is that tp->snd_nxt is not touched in retransmit
> paths, but when new data is sent (tcp_event_new_data_sent())

That makes sense, thanks for explaining.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net] tcp: avoid looping in tcp_send_fin()
  2015-04-23 17:12                     ` David Miller
@ 2015-04-23 17:42                       ` Eric Dumazet
  2015-04-24 15:07                         ` David Miller
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2015-04-23 17:42 UTC (permalink / raw)
  To: David Miller; +Cc: johunt, netdev, edumazet, Neal Cardwell, Yuchung Cheng

From: Eric Dumazet <edumazet@google.com>

Presence of an unbound loop in tcp_send_fin() had always been hard
to explain when analyzing crash dumps involving gigantic dying processes
with millions of sockets.

Lets try a different strategy :

In case of memory pressure, try to add the FIN flag to last packet
in write queue, even if packet was already sent. TCP stack will
be able to deliver this FIN after a timeout event. Note that this
FIN being delivered by a retransmit, it also carries a Push flag
given our current implementation.

By checking sk_under_memory_pressure(), we anticipate that cooking
many FIN packets might deplete tcp memory.

In the case we could not allocate a packet, even with __GFP_WAIT
allocation, then not sending a FIN seems quite reasonable if it allows
to get rid of this socket, free memory, and not block the process from
eventually doing other useful work.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_output.c |   50 +++++++++++++++++++++++-----------------
 1 file changed, 29 insertions(+), 21 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2ade67b7cdb0..a369e8a70b2c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2814,7 +2814,8 @@ begin_fwd:
 
 /* We allow to exceed memory limits for FIN packets to expedite
  * connection tear down and (memory) recovery.
- * Otherwise tcp_send_fin() could loop forever.
+ * Otherwise tcp_send_fin() could be tempted to either delay FIN
+ * or even be forced to close flow without any FIN.
  */
 static void sk_forced_wmem_schedule(struct sock *sk, int size)
 {
@@ -2827,33 +2828,40 @@ static void sk_forced_wmem_schedule(struct sock *sk, int size)
 	sk_memory_allocated_add(sk, amt, &status);
 }
 
-/* Send a fin.  The caller locks the socket for us.  This cannot be
- * allowed to fail queueing a FIN frame under any circumstances.
+/* Send a FIN. The caller locks the socket for us.
+ * We should try to send a FIN packet really hard, but eventually give up.
  */
 void tcp_send_fin(struct sock *sk)
 {
+	struct sk_buff *skb, *tskb = tcp_write_queue_tail(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb = tcp_write_queue_tail(sk);
-	int mss_now;
 
-	/* Optimization, tack on the FIN if we have a queue of
-	 * unsent frames.  But be careful about outgoing SACKS
-	 * and IP options.
+	/* Optimization, tack on the FIN if we have one skb in write queue and
+	 * this skb was not yet sent, or we are under memory pressure.
+	 * Note: in the latter case, FIN packet will be sent after a timeout,
+	 * as TCP stack thinks it has already been transmitted.
 	 */
-	mss_now = tcp_current_mss(sk);
-
-	if (tcp_send_head(sk)) {
-		TCP_SKB_CB(skb)->tcp_flags |= TCPHDR_FIN;
-		TCP_SKB_CB(skb)->end_seq++;
+	if (tskb && (tcp_send_head(sk) || sk_under_memory_pressure(sk))) {
+coalesce:
+		TCP_SKB_CB(tskb)->tcp_flags |= TCPHDR_FIN;
+		TCP_SKB_CB(tskb)->end_seq++;
 		tp->write_seq++;
+		if (!tcp_send_head(sk)) {
+			/* This means tskb was already sent.
+			 * Pretend we included the FIN on previous transmit.
+			 * We need to set tp->snd_nxt to the value it would have
+			 * if FIN had been sent. This is because retransmit path
+			 * does not change tp->snd_nxt.
+			 */
+			tp->snd_nxt++;
+			return;
+		}
 	} else {
-		/* Socket is locked, keep trying until memory is available. */
-		for (;;) {
-			skb = alloc_skb_fclone(MAX_TCP_HEADER,
-					       sk->sk_allocation);
-			if (skb)
-				break;
-			yield();
+		skb = alloc_skb_fclone(MAX_TCP_HEADER, sk->sk_allocation);
+		if (unlikely(!skb)) {
+			if (tskb)
+				goto coalesce;
+			return;
 		}
 		skb_reserve(skb, MAX_TCP_HEADER);
 		sk_forced_wmem_schedule(sk, skb->truesize);
@@ -2862,7 +2870,7 @@ void tcp_send_fin(struct sock *sk)
 				     TCPHDR_ACK | TCPHDR_FIN);
 		tcp_queue_skb(sk, skb);
 	}
-	__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
+	__tcp_push_pending_frames(sk, tcp_current_mss(sk), TCP_NAGLE_OFF);
 }
 
 /* We get here when a process closes a file descriptor (either due to

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net] tcp: avoid looping in tcp_send_fin()
  2015-04-23 17:42                       ` [PATCH net] tcp: avoid looping in tcp_send_fin() Eric Dumazet
@ 2015-04-24 15:07                         ` David Miller
  0 siblings, 0 replies; 25+ messages in thread
From: David Miller @ 2015-04-24 15:07 UTC (permalink / raw)
  To: eric.dumazet; +Cc: johunt, netdev, edumazet, ncardwell, ycheng

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 23 Apr 2015 10:42:39 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> Presence of an unbound loop in tcp_send_fin() had always been hard
> to explain when analyzing crash dumps involving gigantic dying processes
> with millions of sockets.
> 
> Lets try a different strategy :
> 
> In case of memory pressure, try to add the FIN flag to last packet
> in write queue, even if packet was already sent. TCP stack will
> be able to deliver this FIN after a timeout event. Note that this
> FIN being delivered by a retransmit, it also carries a Push flag
> given our current implementation.
> 
> By checking sk_under_memory_pressure(), we anticipate that cooking
> many FIN packets might deplete tcp memory.
> 
> In the case we could not allocate a packet, even with __GFP_WAIT
> allocation, then not sending a FIN seems quite reasonable if it allows
> to get rid of this socket, free memory, and not block the process from
> eventually doing other useful work.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

This is really nice, and long overdue, so I'm going to apply this now.

Thanks Eric.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-04-24 15:07 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-19 23:19 [PATCH] fix tcp fin memory accounting Josh Hunt
2015-03-20  2:17 ` Eric Dumazet
2015-03-20 17:20 ` David Miller
2015-03-20 17:36   ` Josh Hunt
2015-03-24  6:10     ` David Miller
2015-03-24  6:11       ` Josh Hunt
2015-04-22  0:09         ` Eric Dumazet
2015-04-22  1:32           ` [PATCH net] tcp: fix possible deadlock in tcp_send_fin() Eric Dumazet
2015-04-22 14:33             ` [PATCH net] net: do not deplete pfmemalloc reserve Eric Dumazet
2015-04-22 19:35               ` David Miller
2015-04-22 20:20                 ` Eric Dumazet
2015-04-22 20:24                   ` David Miller
2015-04-22 18:15             ` [PATCH net] tcp: fix possible deadlock in tcp_send_fin() David Miller
2015-04-22 18:39               ` Eric Dumazet
2015-04-22 18:50                 ` David Miller
2015-04-22 19:12                   ` Eric Dumazet
2015-04-22 19:29                     ` David Miller
2015-04-23 13:48           ` [PATCH] fix tcp fin memory accounting Josh Hunt
2015-04-23 14:02             ` Eric Dumazet
2015-04-23 15:54               ` David Miller
2015-04-23 16:16                 ` Eric Dumazet
2015-04-23 16:48                   ` Eric Dumazet
2015-04-23 17:12                     ` David Miller
2015-04-23 17:42                       ` [PATCH net] tcp: avoid looping in tcp_send_fin() Eric Dumazet
2015-04-24 15:07                         ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.