All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Nick Bowler <nbowler@elliptictech.com>
Cc: linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	"David S. Miller" <davem@davemloft.net>
Subject: Re: Regression, bisected: reference leak with IPSec since ~2.6.31
Date: Mon, 20 Sep 2010 23:31:12 +0200	[thread overview]
Message-ID: <1285018272.2323.243.camel@edumazet-laptop> (raw)
In-Reply-To: <1285013853.2323.148.camel@edumazet-laptop>

Le lundi 20 septembre 2010 à 22:17 +0200, Eric Dumazet a écrit :
> Le lundi 20 septembre 2010 à 15:52 -0400, Nick Bowler a écrit :
> > On 2010-09-20 20:20 +0200, Eric Dumazet wrote:
> > > If you change your program to send small frames (so they are not
> > > fragmented), is the problem still present ?
> > 
> > I changed MAX_DGRAM_SIZE in the test program to 1000 (mtu on the
> > interface is 1500).  The short answer is that the references are
> > not leaked, and things seem to get cleaned up.  So the rest of this
> > mail probably describes a separate issue.
> > 
> > The long answer, however, is interesting: With latest Linus' git, the
> > references are cleaned up much later than I would expect.  After running
> > the test program and flushing the SAD/SPD, the reference count is still
> > 1.  If I repeat the test immediately, the reference count will increase
> > further.  I can easily raise the reference count to, say, 100.  Now, if
> > I wait a while (10 minutes or so), the reference count will still be
> > 100.  However, when I run the setkey script after this delay, the
> > reference count drops immediately to 1.  If I then flush the SAD/SPD, it
> > drops to 0.
> > 
> > This behaviour is new: newer than the reported leak.  For example, with
> > 2.6.34, everything works perfectly with MAX_DGRAM_SIZE set to 1000 (the
> > SAs are destroyed immediately when the SAD/SPD are flushed), but the
> > leak occurs with MAX_DGRAM_SIZE set to 10000.
> > 
> 
> Thanks Nick
> 
> I suspect a skb->truesize bug somewhere.
> 
> I can see atomic_read(&sk->sk_wmem_alloc) becoming negative after a
> while...
> 
> I am investigating and let you know.
> 
> Thanks
> 

OK, I found a bug in ip_fragment() and ip6_fragment()

In case slow_path is hit, we have a truesize mismatch

Could you try following patch ?

Thanks !

[PATCH] ip : fix truesize mismatch in ip fragmentation

We should not set frag->destructor to sock_wkfree() until we are sure we
dont hit slow path in ip_fragment(). Or we risk uncharging
frag->truesize twice, and in the end, having negative socket
sk_wmem_alloc counter, or even freeing socket sooner than expected.

Many thanks to Nick Bowler, who provided a very clean bug report and
test programs.

While Nick bisection pointed to commit 2b85a34e911bf483 (net: No more
expensive sock_hold()/sock_put() on each tx), underlying bug is older.

Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_output.c  |    8 ++++----
 net/ipv6/ip6_output.c |   10 +++++-----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04b6989..126d9b3 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -490,7 +490,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 	if (skb_has_frags(skb)) {
 		struct sk_buff *frag;
 		int first_len = skb_pagelen(skb);
-		int truesizes = 0;
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
@@ -510,11 +509,13 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 				goto slow_path;
 
 			BUG_ON(frag->sk);
-			if (skb->sk) {
+		}
+		if (skb->sk) {
+			skb_walk_frags(skb, frag) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
+				skb->truesize -= frag->truesize;
 			}
-			truesizes += frag->truesize;
 		}
 
 		/* Everything is OK. Generate! */
@@ -524,7 +525,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 		frag = skb_shinfo(skb)->frag_list;
 		skb_frag_list_init(skb);
 		skb->data_len = first_len - skb_headlen(skb);
-		skb->truesize -= truesizes;
 		skb->len = first_len;
 		iph->tot_len = htons(first_len);
 		iph->frag_off = htons(IP_MF);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index d40b330..10983ab 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -639,7 +639,6 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 	if (skb_has_frags(skb)) {
 		int first_len = skb_pagelen(skb);
-		int truesizes = 0;
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
@@ -658,13 +657,15 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 				goto slow_path;
 
 			BUG_ON(frag->sk);
-			if (skb->sk) {
+		}
+		if (skb->sk) {
+			skb_walk_frags(skb, frag) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
-				truesizes += frag->truesize;
+				skb->truesize -= frag->truesize;
 			}
 		}
-
+				
 		err = 0;
 		offset = 0;
 		frag = skb_shinfo(skb)->frag_list;
@@ -693,7 +694,6 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 		first_len = skb_pagelen(skb);
 		skb->data_len = first_len - skb_headlen(skb);
-		skb->truesize -= truesizes;
 		skb->len = first_len;
 		ipv6_hdr(skb)->payload_len = htons(first_len -
 						   sizeof(struct ipv6hdr));



  reply	other threads:[~2010-09-20 21:31 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-20 17:44 Regression, bisected: reference leak with IPSec since ~2.6.31 Nick Bowler
2010-09-20 18:20 ` Eric Dumazet
2010-09-20 19:52   ` Nick Bowler
2010-09-20 20:00     ` David Miller
2010-09-20 21:23       ` Nick Bowler
2010-09-20 20:17     ` Eric Dumazet
2010-09-20 21:31       ` Eric Dumazet [this message]
2010-09-21  6:16         ` [PATCH] ip : take care of last fragment in ip_append_data Eric Dumazet
2010-09-21 23:38           ` David Miller
2010-09-22  4:44             ` Eric Dumazet
2010-09-22  4:53               ` David Miller
2010-09-24 21:42           ` David Miller
2010-09-21  9:12         ` Regression, bisected: reference leak with IPSec since ~2.6.31 Jarek Poplawski
2010-09-21  9:21           ` Eric Dumazet
2010-09-21  9:38             ` Jarek Poplawski
2010-09-21  9:55               ` Eric Dumazet
2010-09-21 10:07                 ` Eric Dumazet
2010-09-21 10:48                   ` Jarek Poplawski
2010-09-21 11:58                     ` Eric Dumazet
2010-09-21 12:39                       ` Jarek Poplawski
2010-09-21 14:05         ` Nick Bowler
2010-09-21 14:16           ` [PATCH] ip : fix truesize mismatch in ip fragmentation Eric Dumazet
2010-09-21 15:58             ` [PATCH v3] ip: " Eric Dumazet
2010-09-21 16:26               ` Henrique de Moraes Holschuh
2010-09-21 16:31                 ` Eric Dumazet
2010-09-21 18:09                   ` Henrique de Moraes Holschuh
2010-09-21 19:24                     ` David Miller
2010-09-21 23:06                       ` Henrique de Moraes Holschuh
2010-09-21 17:50               ` Jarek Poplawski
2010-09-21 18:47                 ` Eric Dumazet
2010-09-21 19:21                   ` Jarek Poplawski
2010-09-21 22:15                     ` David Miller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1285018272.2323.243.camel@edumazet-laptop \
    --to=eric.dumazet@gmail.com \
    --cc=davem@davemloft.net \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nbowler@elliptictech.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.