All of lore.kernel.org
 help / color / mirror / Atom feed
* Regression, bisected: reference leak with IPSec since ~2.6.31
@ 2010-09-20 17:44 Nick Bowler
  2010-09-20 18:20 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Bowler @ 2010-09-20 17:44 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Eric Dumazet, David S. Miller

[-- Attachment #1: Type: text/plain, Size: 6415 bytes --]

Since 2.6.31, one of our UDP test programs has resulted in an SA leak: after
running the test and flushing the SAD/SPD, the esp module is left with a
non-zero reference count.  This reference is never released: closer
inspection reveals that esp_destroy is never called on the SA.

I've attached a simplified version of the test program which reproduces the
issue.  To reproduce:

  (1) Create a Tx SA -- I used the following setkey script for my tests:

       add 192.168.42.2 192.168.42.1 esp 0x327B23C6  -f seq-pad
        -E rijndael-cbc 0x3D1B58BA507ED7AB2EB141F241B71EFB
        -A null;
       
       spdadd 192.168.42.2 192.168.42.1 any -P out ipsec
        esp/transport//require;

  (2) lsmod | grep esp4
      (note that the reference count is 1)

  (3) run the test program, e.g. udp_burst 192.168.42.1 for the above SA.

  (4) setkey -F; setkey -P -F

  (5) wait as long as you want.

  (6) lsmod | grep esp4
      (note that the reference count is still 1, not 0 as it should be)

You can repeat the test until the zombie SAs consume all available
memory.  Cursory tests show similar problems with AH, IPv6 and Rx SAs,
but I only really tested ESP Tx.

Bisection implicates the following:

  2b85a34e911bf483c27cfdd124aeb1605145dc80 is the first bad commit
  commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
  Author: Eric Dumazet <eric.dumazet@gmail.com>
  Date:   Thu Jun 11 02:55:43 2009 -0700
  
      net: No more expensive sock_hold()/sock_put() on each tx
      
      One of the problem with sock memory accounting is it uses
      a pair of sock_hold()/sock_put() for each transmitted packet.
      
      This slows down bidirectional flows because the receive path
      also needs to take a refcount on socket and might use a different
      cpu than transmit path or transmit completion path. So these
      two atomic operations also trigger cache line bounces.
      
      We can see this in tx or tx/rx workloads (media gateways for example),
      where sock_wfree() can be in top five functions in profiles.
      
      We use this sock_hold()/sock_put() so that sock freeing
      is delayed until all tx packets are completed.
      
      As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
      by one unit at init time, until sk_free() is called.
      Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
      to decrement initial offset and atomicaly check if any packets
      are in flight.
      
      skb_set_owner_w() doesnt call sock_hold() anymore
      
      sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
      reached 0 to perform the final freeing.
      
      Drawback is that a skb->truesize error could lead to unfreeable sockets, or
      even worse, prematurely calling __sk_free() on a live socket.
      
      Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
      on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
      contention point. 5 % speedup on a UDP transmit workload (depends
      on number of flows), lowering TX completion cpu usage.
      
      Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: David S. Miller <davem@davemloft.net>
  
  :040000 040000 13fc48f3903764863486c4e557a50281e8d790e6 4801aa898e906ad61729a84e33f1afb114abdf47 M	include
  :040000 040000 042d86ad3f4d34cb96f137acb356c8251c2f8efc bd0698ec5bf8507c8ed1c5cf1e7791b4a5ed5596 M	net

  git bisect start
  # good: [4a6908a3a050aacc9c3a2f36b276b46c0629ad91] Linux 2.6.28
  git bisect good 4a6908a3a050aacc9c3a2f36b276b46c0629ad91
  # bad: [22763c5cf3690a681551162c15d34d935308c8d7] Linux 2.6.32
  git bisect bad 22763c5cf3690a681551162c15d34d935308c8d7
  # good: [37ecfd807b82bf547429fe1376e1fe7000ba7cff] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc
  git bisect good 37ecfd807b82bf547429fe1376e1fe7000ba7cff
  # bad: [2187550525d7bcb8c87689e4eca41b1955bf9ac3] xfs: rationalize xfs_inobt_lookup*
  git bisect bad 2187550525d7bcb8c87689e4eca41b1955bf9ac3
  # bad: [9cbc1cb8cd46ce1f7645b9de249b2ce8460129bb] Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6
  git bisect bad 9cbc1cb8cd46ce1f7645b9de249b2ce8460129bb
  # good: [8a1ca8cedd108c8e76a6ab34079d0bbb4f244799] Merge branch 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
  git bisect good 8a1ca8cedd108c8e76a6ab34079d0bbb4f244799
  # good: [2cf4d4514d5b43c1f3b64bd0ec8b9853bde8f1dc] Merge branch 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm
  git bisect good 2cf4d4514d5b43c1f3b64bd0ec8b9853bde8f1dc
  # good: [267a90127472be70b02ab13cbd355b5013e2aa51] ath9k: Optimize TBTT/DTIM calculation for timers
  git bisect good 267a90127472be70b02ab13cbd355b5013e2aa51
  # good: [5b1a002ade68173f21b2126a778278df72202ba6] datagram: Use frag list abstraction interfaces.
  git bisect good 5b1a002ade68173f21b2126a778278df72202ba6
  # bad: [5b2c4b972c0226406361f83b747eb5cdab51e68e] net: fix network drivers ndo_start_xmit() return values (part 8)
  git bisect bad 5b2c4b972c0226406361f83b747eb5cdab51e68e
  # bad: [2b85a34e911bf483c27cfdd124aeb1605145dc80] net: No more expensive sock_hold()/sock_put() on each tx
  git bisect bad 2b85a34e911bf483c27cfdd124aeb1605145dc80
  # good: [558f6d3229ddb9f11ca4ffee0439046c283882ff] cfg80211: fix for duplicate response for driver reg request
  git bisect good 558f6d3229ddb9f11ca4ffee0439046c283882ff
  # good: [84503ddd65e804ccdeedee3f307b40d80ff793e6] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6
  git bisect good 84503ddd65e804ccdeedee3f307b40d80ff793e6
  # good: [87433bfc75f34599c38137e172b6bf8fd41971ba] r8169: use dev_kfree_skb() instead of dev_kfree_skb_irq()
  git bisect good 87433bfc75f34599c38137e172b6bf8fd41971ba
  # good: [6811086899f2740c08d0ade26f8b9d705708e0cc] be2net: fix netdev stats rx_errors and rx_dropped
  git bisect good 6811086899f2740c08d0ade26f8b9d705708e0cc
  # good: [a7a0ef31def6b6badd94fc96c8f17c2e18d91513] be2net: Fix early reset of rx-completion
  git bisect good a7a0ef31def6b6badd94fc96c8f17c2e18d91513
  # good: [f2333a014c1e13ac8e1b73a6fd77731c524eff78] netxen: No need to check vfree() pointer.
  git bisect good f2333a014c1e13ac8e1b73a6fd77731c524eff78

-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

[-- Attachment #2: udp_burst.c --]
[-- Type: text/x-c, Size: 1182 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>

#define MAX_DGRAM_SIZE 10000

static char buf[MAX_DGRAM_SIZE];

int main(int argc, char **argv)
{
	char *addr = NULL, *port = "9000";
	struct addrinfo *info, hints = {
		.ai_family   = AF_UNSPEC,
		.ai_socktype = SOCK_DGRAM,
		.ai_flags    = AI_PASSIVE,
	};
	int i, rc, sock;

	if (argc > 1)
		addr = argv[1];
	if (argc > 2)
		addr = argv[2];
	if (!addr) {
		fprintf(stderr, "usage: %s addr [port]\n", argv[0]);
		return EXIT_FAILURE;
	}

	rc = getaddrinfo(addr, port, &hints, &info);
	if (rc != 0) {
		fprintf(stderr, "getaddrinfo: %s\n", gai_strerror(rc));
		return EXIT_FAILURE;
	}

	sock = socket(info->ai_family, info->ai_socktype, info->ai_protocol);
	if (sock == -1) {
		perror("socket");
		return EXIT_FAILURE;
	}

	if (connect(sock, info->ai_addr, info->ai_addrlen) == -1) {
		perror("connect");
		return EXIT_FAILURE;
	}

	for (i = 0; i < MAX_DGRAM_SIZE; i++) {
		if (send(sock, buf, i+1, MSG_DONTWAIT) == -1) {
			if (errno != EAGAIN && errno != EINTR) {
				perror("send");
				return EXIT_FAILURE;
			}
		}
	}

	return 0;
}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-20 17:44 Regression, bisected: reference leak with IPSec since ~2.6.31 Nick Bowler
@ 2010-09-20 18:20 ` Eric Dumazet
  2010-09-20 19:52   ` Nick Bowler
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2010-09-20 18:20 UTC (permalink / raw)
  To: Nick Bowler; +Cc: linux-kernel, netdev, David S. Miller

Le lundi 20 septembre 2010 à 13:44 -0400, Nick Bowler a écrit :
> Since 2.6.31, one of our UDP test programs has resulted in an SA leak: after
> running the test and flushing the SAD/SPD, the esp module is left with a
> non-zero reference count.  This reference is never released: closer
> inspection reveals that esp_destroy is never called on the SA.
> 
> I've attached a simplified version of the test program which reproduces the
> issue.  To reproduce:
> 
>   (1) Create a Tx SA -- I used the following setkey script for my tests:
> 
>        add 192.168.42.2 192.168.42.1 esp 0x327B23C6  -f seq-pad
>         -E rijndael-cbc 0x3D1B58BA507ED7AB2EB141F241B71EFB
>         -A null;
>        
>        spdadd 192.168.42.2 192.168.42.1 any -P out ipsec
>         esp/transport//require;
> 
>   (2) lsmod | grep esp4
>       (note that the reference count is 1)
> 
>   (3) run the test program, e.g. udp_burst 192.168.42.1 for the above SA.
> 
>   (4) setkey -F; setkey -P -F
> 
>   (5) wait as long as you want.
> 
>   (6) lsmod | grep esp4
>       (note that the reference count is still 1, not 0 as it should be)
> 
> You can repeat the test until the zombie SAs consume all available
> memory.  Cursory tests show similar problems with AH, IPv6 and Rx SAs,
> but I only really tested ESP Tx.
> 
> Bisection implicates the following:
> 
>   2b85a34e911bf483c27cfdd124aeb1605145dc80 is the first bad commit
>   commit 2b85a34e911bf483c27cfdd124aeb1605145dc80
>   Author: Eric Dumazet <eric.dumazet@gmail.com>
>   Date:   Thu Jun 11 02:55:43 2009 -0700
>   
>       net: No more expensive sock_hold()/sock_put() on each tx
>       
>       One of the problem with sock memory accounting is it uses
>       a pair of sock_hold()/sock_put() for each transmitted packet.
>       
>       This slows down bidirectional flows because the receive path
>       also needs to take a refcount on socket and might use a different
>       cpu than transmit path or transmit completion path. So these
>       two atomic operations also trigger cache line bounces.
>       
>       We can see this in tx or tx/rx workloads (media gateways for example),
>       where sock_wfree() can be in top five functions in profiles.
>       
>       We use this sock_hold()/sock_put() so that sock freeing
>       is delayed until all tx packets are completed.
>       
>       As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
>       by one unit at init time, until sk_free() is called.
>       Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
>       to decrement initial offset and atomicaly check if any packets
>       are in flight.
>       
>       skb_set_owner_w() doesnt call sock_hold() anymore
>       
>       sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
>       reached 0 to perform the final freeing.
>       
>       Drawback is that a skb->truesize error could lead to unfreeable sockets, or
>       even worse, prematurely calling __sk_free() on a live socket.
>       
>       Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
>       on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
>       contention point. 5 % speedup on a UDP transmit workload (depends
>       on number of flows), lowering TX completion cpu usage.
>       
>       Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>       Signed-off-by: David S. Miller <davem@davemloft.net>
>   
>   :040000 040000 13fc48f3903764863486c4e557a50281e8d790e6 4801aa898e906ad61729a84e33f1afb114abdf47 M	include
>   :040000 040000 042d86ad3f4d34cb96f137acb356c8251c2f8efc bd0698ec5bf8507c8ed1c5cf1e7791b4a5ed5596 M	net
> 
>   git bisect start
>   # good: [4a6908a3a050aacc9c3a2f36b276b46c0629ad91] Linux 2.6.28
>   git bisect good 4a6908a3a050aacc9c3a2f36b276b46c0629ad91
>   # bad: [22763c5cf3690a681551162c15d34d935308c8d7] Linux 2.6.32
>   git bisect bad 22763c5cf3690a681551162c15d34d935308c8d7
>   # good: [37ecfd807b82bf547429fe1376e1fe7000ba7cff] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/drzeus/mmc
>   git bisect good 37ecfd807b82bf547429fe1376e1fe7000ba7cff
>   # bad: [2187550525d7bcb8c87689e4eca41b1955bf9ac3] xfs: rationalize xfs_inobt_lookup*
>   git bisect bad 2187550525d7bcb8c87689e4eca41b1955bf9ac3
>   # bad: [9cbc1cb8cd46ce1f7645b9de249b2ce8460129bb] Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6
>   git bisect bad 9cbc1cb8cd46ce1f7645b9de249b2ce8460129bb
>   # good: [8a1ca8cedd108c8e76a6ab34079d0bbb4f244799] Merge branch 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
>   git bisect good 8a1ca8cedd108c8e76a6ab34079d0bbb4f244799
>   # good: [2cf4d4514d5b43c1f3b64bd0ec8b9853bde8f1dc] Merge branch 'for-linus' of master.kernel.org:/home/rmk/linux-2.6-arm
>   git bisect good 2cf4d4514d5b43c1f3b64bd0ec8b9853bde8f1dc
>   # good: [267a90127472be70b02ab13cbd355b5013e2aa51] ath9k: Optimize TBTT/DTIM calculation for timers
>   git bisect good 267a90127472be70b02ab13cbd355b5013e2aa51
>   # good: [5b1a002ade68173f21b2126a778278df72202ba6] datagram: Use frag list abstraction interfaces.
>   git bisect good 5b1a002ade68173f21b2126a778278df72202ba6
>   # bad: [5b2c4b972c0226406361f83b747eb5cdab51e68e] net: fix network drivers ndo_start_xmit() return values (part 8)
>   git bisect bad 5b2c4b972c0226406361f83b747eb5cdab51e68e
>   # bad: [2b85a34e911bf483c27cfdd124aeb1605145dc80] net: No more expensive sock_hold()/sock_put() on each tx
>   git bisect bad 2b85a34e911bf483c27cfdd124aeb1605145dc80
>   # good: [558f6d3229ddb9f11ca4ffee0439046c283882ff] cfg80211: fix for duplicate response for driver reg request
>   git bisect good 558f6d3229ddb9f11ca4ffee0439046c283882ff
>   # good: [84503ddd65e804ccdeedee3f307b40d80ff793e6] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6
>   git bisect good 84503ddd65e804ccdeedee3f307b40d80ff793e6
>   # good: [87433bfc75f34599c38137e172b6bf8fd41971ba] r8169: use dev_kfree_skb() instead of dev_kfree_skb_irq()
>   git bisect good 87433bfc75f34599c38137e172b6bf8fd41971ba
>   # good: [6811086899f2740c08d0ade26f8b9d705708e0cc] be2net: fix netdev stats rx_errors and rx_dropped
>   git bisect good 6811086899f2740c08d0ade26f8b9d705708e0cc
>   # good: [a7a0ef31def6b6badd94fc96c8f17c2e18d91513] be2net: Fix early reset of rx-completion
>   git bisect good a7a0ef31def6b6badd94fc96c8f17c2e18d91513
>   # good: [f2333a014c1e13ac8e1b73a6fd77731c524eff78] netxen: No need to check vfree() pointer.
>   git bisect good f2333a014c1e13ac8e1b73a6fd77731c524eff78
> 


Hi Nick

If you change your program to send small frames (so they are not
fragmented), is the problem still present ?

The commit you mention was buggy and a fix was added later in
sock_wfree()

(commit d99927f4d93f36553699573b279e0ff98ad7dea6
net: Fix sock_wfree() race)




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-20 18:20 ` Eric Dumazet
@ 2010-09-20 19:52   ` Nick Bowler
  2010-09-20 20:00     ` David Miller
  2010-09-20 20:17     ` Eric Dumazet
  0 siblings, 2 replies; 32+ messages in thread
From: Nick Bowler @ 2010-09-20 19:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev, David S. Miller

On 2010-09-20 20:20 +0200, Eric Dumazet wrote:
> If you change your program to send small frames (so they are not
> fragmented), is the problem still present ?

I changed MAX_DGRAM_SIZE in the test program to 1000 (mtu on the
interface is 1500).  The short answer is that the references are
not leaked, and things seem to get cleaned up.  So the rest of this
mail probably describes a separate issue.

The long answer, however, is interesting: With latest Linus' git, the
references are cleaned up much later than I would expect.  After running
the test program and flushing the SAD/SPD, the reference count is still
1.  If I repeat the test immediately, the reference count will increase
further.  I can easily raise the reference count to, say, 100.  Now, if
I wait a while (10 minutes or so), the reference count will still be
100.  However, when I run the setkey script after this delay, the
reference count drops immediately to 1.  If I then flush the SAD/SPD, it
drops to 0.

This behaviour is new: newer than the reported leak.  For example, with
2.6.34, everything works perfectly with MAX_DGRAM_SIZE set to 1000 (the
SAs are destroyed immediately when the SAD/SPD are flushed), but the
leak occurs with MAX_DGRAM_SIZE set to 10000.

-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-20 19:52   ` Nick Bowler
@ 2010-09-20 20:00     ` David Miller
  2010-09-20 21:23       ` Nick Bowler
  2010-09-20 20:17     ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: David Miller @ 2010-09-20 20:00 UTC (permalink / raw)
  To: nbowler; +Cc: eric.dumazet, linux-kernel, netdev

From: Nick Bowler <nbowler@elliptictech.com>
Date: Mon, 20 Sep 2010 15:52:56 -0400

> On 2010-09-20 20:20 +0200, Eric Dumazet wrote:
>> If you change your program to send small frames (so they are not
>> fragmented), is the problem still present ?
> 
> I changed MAX_DGRAM_SIZE in the test program to 1000 (mtu on the
> interface is 1500).  The short answer is that the references are
> not leaked, and things seem to get cleaned up.  So the rest of this
> mail probably describes a separate issue.
> 
> The long answer, however, is interesting: With latest Linus' git, the
> references are cleaned up much later than I would expect.  After running
> the test program and flushing the SAD/SPD, the reference count is still
> 1.  If I repeat the test immediately, the reference count will increase
> further.  I can easily raise the reference count to, say, 100.  Now, if
> I wait a while (10 minutes or so), the reference count will still be
> 100.  However, when I run the setkey script after this delay, the
> reference count drops immediately to 1.  If I then flush the SAD/SPD, it
> drops to 0.

This is because we actually cache IPSEC routes correctly, previously
we'd create a new routing cache entry every time a lookup happened.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-20 19:52   ` Nick Bowler
  2010-09-20 20:00     ` David Miller
@ 2010-09-20 20:17     ` Eric Dumazet
  2010-09-20 21:31       ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2010-09-20 20:17 UTC (permalink / raw)
  To: Nick Bowler; +Cc: linux-kernel, netdev, David S. Miller

Le lundi 20 septembre 2010 à 15:52 -0400, Nick Bowler a écrit :
> On 2010-09-20 20:20 +0200, Eric Dumazet wrote:
> > If you change your program to send small frames (so they are not
> > fragmented), is the problem still present ?
> 
> I changed MAX_DGRAM_SIZE in the test program to 1000 (mtu on the
> interface is 1500).  The short answer is that the references are
> not leaked, and things seem to get cleaned up.  So the rest of this
> mail probably describes a separate issue.
> 
> The long answer, however, is interesting: With latest Linus' git, the
> references are cleaned up much later than I would expect.  After running
> the test program and flushing the SAD/SPD, the reference count is still
> 1.  If I repeat the test immediately, the reference count will increase
> further.  I can easily raise the reference count to, say, 100.  Now, if
> I wait a while (10 minutes or so), the reference count will still be
> 100.  However, when I run the setkey script after this delay, the
> reference count drops immediately to 1.  If I then flush the SAD/SPD, it
> drops to 0.
> 
> This behaviour is new: newer than the reported leak.  For example, with
> 2.6.34, everything works perfectly with MAX_DGRAM_SIZE set to 1000 (the
> SAs are destroyed immediately when the SAD/SPD are flushed), but the
> leak occurs with MAX_DGRAM_SIZE set to 10000.
> 

Thanks Nick

I suspect a skb->truesize bug somewhere.

I can see atomic_read(&sk->sk_wmem_alloc) becoming negative after a
while...

I am investigating and let you know.

Thanks



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-20 20:00     ` David Miller
@ 2010-09-20 21:23       ` Nick Bowler
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Bowler @ 2010-09-20 21:23 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, linux-kernel, netdev

On 2010-09-20 13:00 -0700, David Miller wrote:
> From: Nick Bowler <nbowler@elliptictech.com>
> > The long answer, however, is interesting: With latest Linus' git, the
> > references are cleaned up much later than I would expect.
[...]
> This is because we actually cache IPSEC routes correctly, previously
> we'd create a new routing cache entry every time a lookup happened.

But this means that the SAs, including their cryptographic keys, are
kept in memory indefinitely after the SAD/SPD entries are destroyed.
Why aren't the cache entries invalidated when this occurs?

This also makes it extremely difficult to unload the xfrm modules,
something we often need to do during testing, as references to them
are held indefinitely.

-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-20 20:17     ` Eric Dumazet
@ 2010-09-20 21:31       ` Eric Dumazet
  2010-09-21  6:16         ` [PATCH] ip : take care of last fragment in ip_append_data Eric Dumazet
                           ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Eric Dumazet @ 2010-09-20 21:31 UTC (permalink / raw)
  To: Nick Bowler; +Cc: linux-kernel, netdev, David S. Miller

Le lundi 20 septembre 2010 à 22:17 +0200, Eric Dumazet a écrit :
> Le lundi 20 septembre 2010 à 15:52 -0400, Nick Bowler a écrit :
> > On 2010-09-20 20:20 +0200, Eric Dumazet wrote:
> > > If you change your program to send small frames (so they are not
> > > fragmented), is the problem still present ?
> > 
> > I changed MAX_DGRAM_SIZE in the test program to 1000 (mtu on the
> > interface is 1500).  The short answer is that the references are
> > not leaked, and things seem to get cleaned up.  So the rest of this
> > mail probably describes a separate issue.
> > 
> > The long answer, however, is interesting: With latest Linus' git, the
> > references are cleaned up much later than I would expect.  After running
> > the test program and flushing the SAD/SPD, the reference count is still
> > 1.  If I repeat the test immediately, the reference count will increase
> > further.  I can easily raise the reference count to, say, 100.  Now, if
> > I wait a while (10 minutes or so), the reference count will still be
> > 100.  However, when I run the setkey script after this delay, the
> > reference count drops immediately to 1.  If I then flush the SAD/SPD, it
> > drops to 0.
> > 
> > This behaviour is new: newer than the reported leak.  For example, with
> > 2.6.34, everything works perfectly with MAX_DGRAM_SIZE set to 1000 (the
> > SAs are destroyed immediately when the SAD/SPD are flushed), but the
> > leak occurs with MAX_DGRAM_SIZE set to 10000.
> > 
> 
> Thanks Nick
> 
> I suspect a skb->truesize bug somewhere.
> 
> I can see atomic_read(&sk->sk_wmem_alloc) becoming negative after a
> while...
> 
> I am investigating and let you know.
> 
> Thanks
> 

OK, I found a bug in ip_fragment() and ip6_fragment()

In case slow_path is hit, we have a truesize mismatch

Could you try following patch ?

Thanks !

[PATCH] ip : fix truesize mismatch in ip fragmentation

We should not set frag->destructor to sock_wkfree() until we are sure we
dont hit slow path in ip_fragment(). Or we risk uncharging
frag->truesize twice, and in the end, having negative socket
sk_wmem_alloc counter, or even freeing socket sooner than expected.

Many thanks to Nick Bowler, who provided a very clean bug report and
test programs.

While Nick bisection pointed to commit 2b85a34e911bf483 (net: No more
expensive sock_hold()/sock_put() on each tx), underlying bug is older.

Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_output.c  |    8 ++++----
 net/ipv6/ip6_output.c |   10 +++++-----
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04b6989..126d9b3 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -490,7 +490,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 	if (skb_has_frags(skb)) {
 		struct sk_buff *frag;
 		int first_len = skb_pagelen(skb);
-		int truesizes = 0;
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
@@ -510,11 +509,13 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 				goto slow_path;
 
 			BUG_ON(frag->sk);
-			if (skb->sk) {
+		}
+		if (skb->sk) {
+			skb_walk_frags(skb, frag) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
+				skb->truesize -= frag->truesize;
 			}
-			truesizes += frag->truesize;
 		}
 
 		/* Everything is OK. Generate! */
@@ -524,7 +525,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 		frag = skb_shinfo(skb)->frag_list;
 		skb_frag_list_init(skb);
 		skb->data_len = first_len - skb_headlen(skb);
-		skb->truesize -= truesizes;
 		skb->len = first_len;
 		iph->tot_len = htons(first_len);
 		iph->frag_off = htons(IP_MF);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index d40b330..10983ab 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -639,7 +639,6 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 	if (skb_has_frags(skb)) {
 		int first_len = skb_pagelen(skb);
-		int truesizes = 0;
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
@@ -658,13 +657,15 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 				goto slow_path;
 
 			BUG_ON(frag->sk);
-			if (skb->sk) {
+		}
+		if (skb->sk) {
+			skb_walk_frags(skb, frag) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
-				truesizes += frag->truesize;
+				skb->truesize -= frag->truesize;
 			}
 		}
-
+				
 		err = 0;
 		offset = 0;
 		frag = skb_shinfo(skb)->frag_list;
@@ -693,7 +694,6 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 		first_len = skb_pagelen(skb);
 		skb->data_len = first_len - skb_headlen(skb);
-		skb->truesize -= truesizes;
 		skb->len = first_len;
 		ipv6_hdr(skb)->payload_len = htons(first_len -
 						   sizeof(struct ipv6hdr));



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH] ip : take care of last fragment in ip_append_data
  2010-09-20 21:31       ` Eric Dumazet
@ 2010-09-21  6:16         ` Eric Dumazet
  2010-09-21 23:38           ` David Miller
  2010-09-24 21:42           ` David Miller
  2010-09-21  9:12         ` Regression, bisected: reference leak with IPSec since ~2.6.31 Jarek Poplawski
  2010-09-21 14:05         ` Nick Bowler
  2 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2010-09-21  6:16 UTC (permalink / raw)
  To: Nick Bowler; +Cc: linux-kernel, netdev, David S. Miller

Le lundi 20 septembre 2010 à 23:31 +0200, Eric Dumazet a écrit :

> OK, I found a bug in ip_fragment() and ip6_fragment()
> 
> In case slow_path is hit, we have a truesize mismatch
> 
> Could you try following patch ?
> 
> Thanks !
> 
> [PATCH] ip : fix truesize mismatch in ip fragmentation
> 
> We should not set frag->destructor to sock_wkfree() until we are sure we
> dont hit slow path in ip_fragment(). Or we risk uncharging
> frag->truesize twice, and in the end, having negative socket
> sk_wmem_alloc counter, or even freeing socket sooner than expected.
> 
> Many thanks to Nick Bowler, who provided a very clean bug report and
> test programs.
> 
> While Nick bisection pointed to commit 2b85a34e911bf483 (net: No more
> expensive sock_hold()/sock_put() on each tx), underlying bug is older.
> 
> Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  net/ipv4/ip_output.c  |    8 ++++----
>  net/ipv6/ip6_output.c |   10 +++++-----
>  2 files changed, 9 insertions(+), 9 deletions(-)

This previous patch works for me. I also cooked following fix, to not
enter slow path on some lengthes (MTU + N*(MTU-20))

[PATCH] ip : take care of last fragment in ip_append_data

While investigating a bit, I found ip_fragment() slow path was taken
because ip_append_data() provides following layout for a send(MTU +
N*(MTU - 20)) syscall :

- one skb with 1500 (mtu) bytes
- N fragments of 1480 (mtu-20) bytes (before adding IP header)
last fragment gets 17 bytes of trail data because of following bit:

	if (datalen == length + fraggap)
		alloclen += rt->dst.trailer_len;

Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm...
another bug ?)

In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu,
so we take slow path, building another skb chain.

In order to avoid taking slow path, we should correct ip_append_data()
to make sure last fragment has real trail space, under mtu...

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_output.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04b6989..dfd9a0d 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -927,16 +927,19 @@ alloc_new_skb:
 			    !(rt->dst.dev->features&NETIF_F_SG))
 				alloclen = mtu;
 			else
-				alloclen = datalen + fragheaderlen;
+				alloclen = fraglen;
 
 			/* The last fragment gets additional space at tail.
 			 * Note, with MSG_MORE we overallocate on fragments,
 			 * because we have no idea what fragment will be
 			 * the last.
 			 */
-			if (datalen == length + fraggap)
+			if (datalen == length + fraggap) {
 				alloclen += rt->dst.trailer_len;
-
+				/* make sure mtu is not reached */
+				if (datalen > mtu - fragheaderlen - rt->dst.trailer_len)
+					datalen -= ALIGN(rt->dst.trailer_len, 8);
+			}
 			if (transhdrlen) {
 				skb = sock_alloc_send_skb(sk,
 						alloclen + hh_len + 15,



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-20 21:31       ` Eric Dumazet
  2010-09-21  6:16         ` [PATCH] ip : take care of last fragment in ip_append_data Eric Dumazet
@ 2010-09-21  9:12         ` Jarek Poplawski
  2010-09-21  9:21           ` Eric Dumazet
  2010-09-21 14:05         ` Nick Bowler
  2 siblings, 1 reply; 32+ messages in thread
From: Jarek Poplawski @ 2010-09-21  9:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Nick Bowler, linux-kernel, netdev, David S. Miller

On 2010-09-20 23:31, Eric Dumazet wrote:
...
> [PATCH] ip : fix truesize mismatch in ip fragmentation
> 
> We should not set frag->destructor to sock_wkfree() until we are sure we
> dont hit slow path in ip_fragment(). Or we risk uncharging
> frag->truesize twice, and in the end, having negative socket
> sk_wmem_alloc counter, or even freeing socket sooner than expected.
> 
> Many thanks to Nick Bowler, who provided a very clean bug report and
> test programs.
> 
> While Nick bisection pointed to commit 2b85a34e911bf483 (net: No more
> expensive sock_hold()/sock_put() on each tx), underlying bug is older.
> 
> Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  net/ipv4/ip_output.c  |    8 ++++----
>  net/ipv6/ip6_output.c |   10 +++++-----
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 04b6989..126d9b3 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -490,7 +490,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
>  	if (skb_has_frags(skb)) {
>  		struct sk_buff *frag;
>  		int first_len = skb_pagelen(skb);
> -		int truesizes = 0;
>  
>  		if (first_len - hlen > mtu ||
>  		    ((first_len - hlen) & 7) ||
> @@ -510,11 +509,13 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
>  				goto slow_path;
>  
>  			BUG_ON(frag->sk);
> -			if (skb->sk) {
> +		}
> +		if (skb->sk) {
> +			skb_walk_frags(skb, frag) {
>  				frag->sk = skb->sk;
>  				frag->destructor = sock_wfree;

Nice catch, but it seems doing it in the first loop as now, and
reverting changes before goto slow_path might be more optimal here.

Jarek P.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-21  9:12         ` Regression, bisected: reference leak with IPSec since ~2.6.31 Jarek Poplawski
@ 2010-09-21  9:21           ` Eric Dumazet
  2010-09-21  9:38             ` Jarek Poplawski
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2010-09-21  9:21 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Nick Bowler, linux-kernel, netdev, David S. Miller

Le mardi 21 septembre 2010 à 09:12 +0000, Jarek Poplawski a écrit :
> On 2010-09-20 23:31, Eric Dumazet wrote:
> ...
> > [PATCH] ip : fix truesize mismatch in ip fragmentation
> > 
> > We should not set frag->destructor to sock_wkfree() until we are sure we
> > dont hit slow path in ip_fragment(). Or we risk uncharging
> > frag->truesize twice, and in the end, having negative socket
> > sk_wmem_alloc counter, or even freeing socket sooner than expected.
> > 
> > Many thanks to Nick Bowler, who provided a very clean bug report and
> > test programs.
> > 
> > While Nick bisection pointed to commit 2b85a34e911bf483 (net: No more
> > expensive sock_hold()/sock_put() on each tx), underlying bug is older.
> > 
> > Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > ---
> >  net/ipv4/ip_output.c  |    8 ++++----
> >  net/ipv6/ip6_output.c |   10 +++++-----
> >  2 files changed, 9 insertions(+), 9 deletions(-)
> > 
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index 04b6989..126d9b3 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -490,7 +490,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
> >  	if (skb_has_frags(skb)) {
> >  		struct sk_buff *frag;
> >  		int first_len = skb_pagelen(skb);
> > -		int truesizes = 0;
> >  
> >  		if (first_len - hlen > mtu ||
> >  		    ((first_len - hlen) & 7) ||
> > @@ -510,11 +509,13 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
> >  				goto slow_path;
> >  
> >  			BUG_ON(frag->sk);
> > -			if (skb->sk) {
> > +		}
> > +		if (skb->sk) {
> > +			skb_walk_frags(skb, frag) {
> >  				frag->sk = skb->sk;
> >  				frag->destructor = sock_wfree;
> 
> Nice catch, but it seems doing it in the first loop as now, and
> reverting changes before goto slow_path might be more optimal here.
> 

I thought of this, but found this function already very complex.

Once everything is in cpu caches, the added loop is very cheap.

I liked the :

<check everything without changing state>
if something wrong
	goto slow_path
else
	<OK, lets do destructive things>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-21  9:21           ` Eric Dumazet
@ 2010-09-21  9:38             ` Jarek Poplawski
  2010-09-21  9:55               ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Jarek Poplawski @ 2010-09-21  9:38 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Nick Bowler, linux-kernel, netdev, David S. Miller

On Tue, Sep 21, 2010 at 11:21:00AM +0200, Eric Dumazet wrote:
> Le mardi 21 septembre 2010 ?? 09:12 +0000, Jarek Poplawski a écrit :
> > On 2010-09-20 23:31, Eric Dumazet wrote:
> > ...
> > > @@ -510,11 +509,13 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
> > >  				goto slow_path;
> > >  
> > >  			BUG_ON(frag->sk);
> > > -			if (skb->sk) {
> > > +		}
> > > +		if (skb->sk) {
> > > +			skb_walk_frags(skb, frag) {
> > >  				frag->sk = skb->sk;
> > >  				frag->destructor = sock_wfree;
> > 
> > Nice catch, but it seems doing it in the first loop as now, and
> > reverting changes before goto slow_path might be more optimal here.
> > 
> 
> I thought of this, but found this function already very complex.
> 
> Once everything is in cpu caches, the added loop is very cheap.

I hope you're right with this.

> 
> I liked the :
> 
> <check everything without changing state>
> if something wrong
> 	goto slow_path
> else
> 	<OK, lets do destructive things>
> 

But it's an optimization of the "unlikely" case btw. ;-)

Jarek P.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-21  9:38             ` Jarek Poplawski
@ 2010-09-21  9:55               ` Eric Dumazet
  2010-09-21 10:07                 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2010-09-21  9:55 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Nick Bowler, linux-kernel, netdev, David S. Miller

Le mardi 21 septembre 2010 à 09:38 +0000, Jarek Poplawski a écrit :
> On Tue, Sep 21, 2010 at 11:21:00AM +0200, Eric Dumazet wrote:

> I hope you're right with this.
> 
> > 
> > I liked the :
> > 
> > <check everything without changing state>
> > if something wrong
> > 	goto slow_path
> > else
> > 	<OK, lets do destructive things>
> > 
> 
> But it's an optimization of the "unlikely" case btw. ;-)
> 

This is a bug fix, in a complex function.

in -next branch, we can try to be smart, adding more code in slow_path
to revert what was done in the hope fast path would be taken.
And pray we dont add another bug :)



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-21  9:55               ` Eric Dumazet
@ 2010-09-21 10:07                 ` Eric Dumazet
  2010-09-21 10:48                   ` Jarek Poplawski
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2010-09-21 10:07 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Nick Bowler, linux-kernel, netdev, David S. Miller

Le mardi 21 septembre 2010 à 11:55 +0200, Eric Dumazet a écrit :
> Le mardi 21 septembre 2010 à 09:38 +0000, Jarek Poplawski a écrit :
> > On Tue, Sep 21, 2010 at 11:21:00AM +0200, Eric Dumazet wrote:
> 
> > I hope you're right with this.
> > 
> > > 
> > > I liked the :
> > > 
> > > <check everything without changing state>
> > > if something wrong
> > > 	goto slow_path
> > > else
> > > 	<OK, lets do destructive things>
> > > 
> > 
> > But it's an optimization of the "unlikely" case btw. ;-)
> > 
> 
> This is a bug fix, in a complex function.
> 
> in -next branch, we can try to be smart, adding more code in slow_path
> to revert what was done in the hope fast path would be taken.
> And pray we dont add another bug :)
> 

My initial patch was somehing like :

<loop to check everything is ok>
  compute truesizes = sum (frag->truesize)
  not modifying frag->sk/frag->destructor

if problem -> goto slow_path
else {
	skb->truesize -= truesizes;
       send(skb)

	for each frag
		set frag->sk = original_sk
                set frag->destructor = sock_wfree
		send(frag)

Problem was if we got an error while sending skb or one frag :
It was leaking some truesize accounting.
(an extra loop was needed to set frag->sk/frag->destructor for remaining
frags before freeing them)

So I added the extra loop to :

for each frag {
	set frag->sk = skb->sk
	set frag->destructor = sock_wfree
	skb->truesize -= frag->truesize
}

It removed the need to compute sum(frag->truesize) in a temp variable
(truesizes)



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-21 10:07                 ` Eric Dumazet
@ 2010-09-21 10:48                   ` Jarek Poplawski
  2010-09-21 11:58                     ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Jarek Poplawski @ 2010-09-21 10:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Nick Bowler, linux-kernel, netdev, David S. Miller

On Tue, Sep 21, 2010 at 12:07:37PM +0200, Eric Dumazet wrote:
> Le mardi 21 septembre 2010 ?? 11:55 +0200, Eric Dumazet a écrit :
> > Le mardi 21 septembre 2010 ?? 09:38 +0000, Jarek Poplawski a écrit :
> > > On Tue, Sep 21, 2010 at 11:21:00AM +0200, Eric Dumazet wrote:
> > 
> > > I hope you're right with this.
> > > 
> > > > 
> > > > I liked the :
> > > > 
> > > > <check everything without changing state>
> > > > if something wrong
> > > > 	goto slow_path
> > > > else
> > > > 	<OK, lets do destructive things>
> > > > 
> > > 
> > > But it's an optimization of the "unlikely" case btw. ;-)
> > > 
> > 
> > This is a bug fix, in a complex function.
> > 
> > in -next branch, we can try to be smart, adding more code in slow_path
> > to revert what was done in the hope fast path would be taken.
> > And pray we dont add another bug :)

But if you were honest about caches there is no reason to be smart
no more. ;-)

Below is your patch mostly as an example only that I didn't mean
nothing complex (ipv4 only, not compiled).

Jarek P.
---

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04b6989..049d61f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -490,7 +490,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 	if (skb_has_frags(skb)) {
 		struct sk_buff *frag;
 		int first_len = skb_pagelen(skb);
-		int truesizes = 0;
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
@@ -503,18 +502,18 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 			if (frag->len > mtu ||
 			    ((frag->len & 7) && frag->next) ||
 			    skb_headroom(frag) < hlen)
-			    goto slow_path;
+			    goto slow_path_clean;
 
 			/* Partially cloned skb? */
 			if (skb_shared(frag))
-				goto slow_path;
+				goto slow_path_clean;
 
 			BUG_ON(frag->sk);
 			if (skb->sk) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
+				skb->truesize -= frag->truesize;
 			}
-			truesizes += frag->truesize;
 		}
 
 		/* Everything is OK. Generate! */
@@ -524,7 +523,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 		frag = skb_shinfo(skb)->frag_list;
 		skb_frag_list_init(skb);
 		skb->data_len = first_len - skb_headlen(skb);
-		skb->truesize -= truesizes;
 		skb->len = first_len;
 		iph->tot_len = htons(first_len);
 		iph->frag_off = htons(IP_MF);
@@ -576,6 +574,17 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 		}
 		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 		return err;
+
+slow_path_clean:
+		if (skb->sk) {
+			skb_walk_frags(skb, frag) {
+				if (!frag->sk)
+					break;
+				frag->sk = NULL;
+				frag->destructor = NULL;
+				skb->truesize += frag->truesize;
+			}
+		}
 	}
 
 slow_path:

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-21 10:48                   ` Jarek Poplawski
@ 2010-09-21 11:58                     ` Eric Dumazet
  2010-09-21 12:39                       ` Jarek Poplawski
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2010-09-21 11:58 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Nick Bowler, linux-kernel, netdev, David S. Miller

Le mardi 21 septembre 2010 à 10:48 +0000, Jarek Poplawski a écrit :

> But if you were honest about caches there is no reason to be smart
> no more. ;-)
> 
> Below is your patch mostly as an example only that I didn't mean
> nothing complex (ipv4 only, not compiled).
> 
> Jarek P.
> ---
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 04b6989..049d61f 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -490,7 +490,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
>  	if (skb_has_frags(skb)) {
>  		struct sk_buff *frag;
>  		int first_len = skb_pagelen(skb);
> -		int truesizes = 0;
>  
>  		if (first_len - hlen > mtu ||
>  		    ((first_len - hlen) & 7) ||
> @@ -503,18 +502,18 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
>  			if (frag->len > mtu ||
>  			    ((frag->len & 7) && frag->next) ||
>  			    skb_headroom(frag) < hlen)
> -			    goto slow_path;
> +			    goto slow_path_clean;
>  
>  			/* Partially cloned skb? */
>  			if (skb_shared(frag))
> -				goto slow_path;
> +				goto slow_path_clean;
>  
>  			BUG_ON(frag->sk);
>  			if (skb->sk) {
>  				frag->sk = skb->sk;
>  				frag->destructor = sock_wfree;
> +				skb->truesize -= frag->truesize;
>  			}
> -			truesizes += frag->truesize;
>  		}
>  
>  		/* Everything is OK. Generate! */
> @@ -524,7 +523,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
>  		frag = skb_shinfo(skb)->frag_list;
>  		skb_frag_list_init(skb);
>  		skb->data_len = first_len - skb_headlen(skb);
> -		skb->truesize -= truesizes;
>  		skb->len = first_len;
>  		iph->tot_len = htons(first_len);
>  		iph->frag_off = htons(IP_MF);
> @@ -576,6 +574,17 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
>  		}
>  		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
>  		return err;
> +
> +slow_path_clean:
> +		if (skb->sk) {
> +			skb_walk_frags(skb, frag) {
> +				if (!frag->sk)
> +					break;
> +				frag->sk = NULL;
> +				frag->destructor = NULL;
> +				skb->truesize += frag->truesize;
> +			}
> +		}
>  	}
>  
>  slow_path:

Nice try, but this adds a bug unfortunately, or if you prefer makes the
"BUG_ON(frag->sk);" being not fully covered.

If we hit slow path, and one or more frags had frag->sk set, we silently
unset it, instead of a nice BUG_ON().

If you really want to have a slow_path_clean, you should use a different
frag2 iterator :

slow_path_clean:
       if (skb->sk) {
                skb_walk_frags(skb, frag2) {
                       if (frag2 == frag)
                               break;
                       frag2->sk = NULL;
                       frag2->destructor = NULL;
                       skb->truesize += frag2->truesize;
               }
       }

Face it, this is too complex for net-2.6, could we just fix the bug, and
'optimize' later ?




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-21 11:58                     ` Eric Dumazet
@ 2010-09-21 12:39                       ` Jarek Poplawski
  0 siblings, 0 replies; 32+ messages in thread
From: Jarek Poplawski @ 2010-09-21 12:39 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Nick Bowler, linux-kernel, netdev, David S. Miller

On Tue, Sep 21, 2010 at 01:58:30PM +0200, Eric Dumazet wrote:
> Le mardi 21 septembre 2010 ?? 10:48 +0000, Jarek Poplawski a écrit :
> 
> > But if you were honest about caches there is no reason to be smart
> > no more. ;-)
> > 
> > Below is your patch mostly as an example only that I didn't mean
> > nothing complex (ipv4 only, not compiled).
> > 
> > Jarek P.
> > ---
> > 
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index 04b6989..049d61f 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -490,7 +490,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
> >  	if (skb_has_frags(skb)) {
> >  		struct sk_buff *frag;
> >  		int first_len = skb_pagelen(skb);
> > -		int truesizes = 0;
> >  
> >  		if (first_len - hlen > mtu ||
> >  		    ((first_len - hlen) & 7) ||
> > @@ -503,18 +502,18 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
> >  			if (frag->len > mtu ||
> >  			    ((frag->len & 7) && frag->next) ||
> >  			    skb_headroom(frag) < hlen)
> > -			    goto slow_path;
> > +			    goto slow_path_clean;
> >  
> >  			/* Partially cloned skb? */
> >  			if (skb_shared(frag))
> > -				goto slow_path;
> > +				goto slow_path_clean;
> >  
> >  			BUG_ON(frag->sk);
> >  			if (skb->sk) {
> >  				frag->sk = skb->sk;
> >  				frag->destructor = sock_wfree;
> > +				skb->truesize -= frag->truesize;
> >  			}
> > -			truesizes += frag->truesize;
> >  		}
> >  
> >  		/* Everything is OK. Generate! */
> > @@ -524,7 +523,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
> >  		frag = skb_shinfo(skb)->frag_list;
> >  		skb_frag_list_init(skb);
> >  		skb->data_len = first_len - skb_headlen(skb);
> > -		skb->truesize -= truesizes;
> >  		skb->len = first_len;
> >  		iph->tot_len = htons(first_len);
> >  		iph->frag_off = htons(IP_MF);
> > @@ -576,6 +574,17 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
> >  		}
> >  		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
> >  		return err;
> > +
> > +slow_path_clean:
> > +		if (skb->sk) {
> > +			skb_walk_frags(skb, frag) {
> > +				if (!frag->sk)
> > +					break;
> > +				frag->sk = NULL;
> > +				frag->destructor = NULL;
> > +				skb->truesize += frag->truesize;
> > +			}
> > +		}
> >  	}
> >  
> >  slow_path:
> 
> Nice try, but this adds a bug unfortunately, or if you prefer makes the
> "BUG_ON(frag->sk);" being not fully covered.
> 
> If we hit slow path, and one or more frags had frag->sk set, we silently
> unset it, instead of a nice BUG_ON().

Which exactly BUG_ON() on the slow path do you mean?

> 
> If you really want to have a slow_path_clean, you should use a different
> frag2 iterator :
> 
> slow_path_clean:
>        if (skb->sk) {
>                 skb_walk_frags(skb, frag2) {
>                        if (frag2 == frag)
>                                break;
>                        frag2->sk = NULL;
>                        frag2->destructor = NULL;
>                        skb->truesize += frag2->truesize;
>                }
>        }
> 
> Face it, this is too complex for net-2.6, could we just fix the bug, and
> 'optimize' later ?

I thought it's clear I didn't intend to change anything with this
example, but in any case I admit it: I'm 100% OK with your patch.

Jarek P.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Regression, bisected: reference leak with IPSec since ~2.6.31
  2010-09-20 21:31       ` Eric Dumazet
  2010-09-21  6:16         ` [PATCH] ip : take care of last fragment in ip_append_data Eric Dumazet
  2010-09-21  9:12         ` Regression, bisected: reference leak with IPSec since ~2.6.31 Jarek Poplawski
@ 2010-09-21 14:05         ` Nick Bowler
  2010-09-21 14:16           ` [PATCH] ip : fix truesize mismatch in ip fragmentation Eric Dumazet
  2 siblings, 1 reply; 32+ messages in thread
From: Nick Bowler @ 2010-09-21 14:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, netdev, David S. Miller

On 2010-09-20 23:31 +0200, Eric Dumazet wrote:
> Could you try following patch ?
>
> [PATCH] ip : fix truesize mismatch in ip fragmentation
[...]
> @@ -658,13 +657,15 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
>  				goto slow_path;
>  
>  			BUG_ON(frag->sk);
> -			if (skb->sk) {
> +		}
> +		if (skb->sk) {
> +			skb_walk_frags(skb, frag) {
>  				frag->sk = skb->sk;
>  				frag->destructor = sock_wfree;
> -				truesizes += frag->truesize;
> +				skb->truesize -= frag->truesize;
>  			}
>  		}
> -
> +				

This hunk introduces some whitespace damage.

Anyway, I tried this with ESP on both IPv4 and IPv6 and it appears to
correct the issue.  Thanks!

-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] ip : fix truesize mismatch in ip fragmentation
  2010-09-21 14:05         ` Nick Bowler
@ 2010-09-21 14:16           ` Eric Dumazet
  2010-09-21 15:58             ` [PATCH v3] ip: " Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2010-09-21 14:16 UTC (permalink / raw)
  To: Nick Bowler; +Cc: linux-kernel, netdev, David S. Miller

Le mardi 21 septembre 2010 à 10:05 -0400, Nick Bowler a écrit :

> This hunk introduces some whitespace damage.
> 
> Anyway, I tried this with ESP on both IPv4 and IPv6 and it appears to
> correct the issue.  Thanks!
> 

Indeed good catch.

Here is an updated patch, I added your Tested-by

Thanks for testing !

[PATCH] ip : fix truesize mismatch in ip fragmentation

We should not set frag->destructor to sock_wkfree() until we are sure we
dont hit slow path in ip_fragment(). Or we risk uncharging
frag->truesize twice, and in the end, having negative socket
sk_wmem_alloc counter, or even freeing socket sooner than expected.

Many thanks to Nick Bowler, who provided a very clean bug report and
test program.

While Nick bisection pointed to commit 2b85a34e911bf483 (net: No more
expensive sock_hold()/sock_put() on each tx), underlying bug is older.

Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_output.c  |    8 ++++----
 net/ipv6/ip6_output.c |    8 ++++----
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04b6989..126d9b3 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -490,7 +490,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 	if (skb_has_frags(skb)) {
 		struct sk_buff *frag;
 		int first_len = skb_pagelen(skb);
-		int truesizes = 0;
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
@@ -510,11 +509,13 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 				goto slow_path;
 
 			BUG_ON(frag->sk);
-			if (skb->sk) {
+		}
+		if (skb->sk) {
+			skb_walk_frags(skb, frag) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
+				skb->truesize -= frag->truesize;
 			}
-			truesizes += frag->truesize;
 		}
 
 		/* Everything is OK. Generate! */
@@ -524,7 +525,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 		frag = skb_shinfo(skb)->frag_list;
 		skb_frag_list_init(skb);
 		skb->data_len = first_len - skb_headlen(skb);
-		skb->truesize -= truesizes;
 		skb->len = first_len;
 		iph->tot_len = htons(first_len);
 		iph->frag_off = htons(IP_MF);
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index d40b330..633217d 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -639,7 +639,6 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 	if (skb_has_frags(skb)) {
 		int first_len = skb_pagelen(skb);
-		int truesizes = 0;
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
@@ -658,10 +657,12 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 				goto slow_path;
 
 			BUG_ON(frag->sk);
-			if (skb->sk) {
+		}
+		if (skb->sk) {
+			skb_walk_frags(skb, frag) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
-				truesizes += frag->truesize;
+				skb->truesize -= frag->truesize;
 			}
 		}
 
@@ -693,7 +694,6 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 		first_len = skb_pagelen(skb);
 		skb->data_len = first_len - skb_headlen(skb);
-		skb->truesize -= truesizes;
 		skb->len = first_len;
 		ipv6_hdr(skb)->payload_len = htons(first_len -
 						   sizeof(struct ipv6hdr));



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v3] ip: fix truesize mismatch in ip fragmentation
  2010-09-21 14:16           ` [PATCH] ip : fix truesize mismatch in ip fragmentation Eric Dumazet
@ 2010-09-21 15:58             ` Eric Dumazet
  2010-09-21 16:26               ` Henrique de Moraes Holschuh
  2010-09-21 17:50               ` Jarek Poplawski
  0 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2010-09-21 15:58 UTC (permalink / raw)
  To: Nick Bowler
  Cc: linux-kernel, netdev, David S. Miller, Jarek Poplawski, Patrick McHardy

Le mardi 21 septembre 2010 à 16:16 +0200, Eric Dumazet a écrit :
> Le mardi 21 septembre 2010 à 10:05 -0400, Nick Bowler a écrit :
> 
> > This hunk introduces some whitespace damage.
> > 
> > Anyway, I tried this with ESP on both IPv4 and IPv6 and it appears to
> > correct the issue.  Thanks!
> > 
> 
> Indeed good catch.
> 
> Here is an updated patch, I added your Tested-by
> 
> Thanks for testing !
> 
> [PATCH] ip : fix truesize mismatch in ip fragmentation
> 
> We should not set frag->destructor to sock_wkfree() until we are sure we
> dont hit slow path in ip_fragment(). Or we risk uncharging
> frag->truesize twice, and in the end, having negative socket
> sk_wmem_alloc counter, or even freeing socket sooner than expected.
> 
> Many thanks to Nick Bowler, who provided a very clean bug report and
> test program.
> 
> While Nick bisection pointed to commit 2b85a34e911bf483 (net: No more
> expensive sock_hold()/sock_put() on each tx), underlying bug is older.
> 


Hmm, while looking at git history, I found commit from Patrick
b2722b1c3a893e (ip_fragment: also adjust skb->truesize for packets not
owned by a socket)
As my patch reverts it, we probably want a more polished patch.

(Also port Patrick work to ipv6)

So here is a V3

[PATCH v3] ip: fix truesize mismatch in ip fragmentation

Special care should be taken when slow path is hit in ip_fragment() :

When walking through frags, we transfert truesize ownership from skb to
frags. Then if we hit a slow_path condition, we must undo this or risk
uncharging frags->truesize twice, and in the end, having negative socket
sk_wmem_alloc counter, or even freeing socket sooner than expected.

Many thanks to Nick Bowler, who provided a very clean bug report and
test program.

Thanks to Jarek for reviewing my first patch and providing a V2

While Nick bisection pointed to commit 2b85a34e911 (net: No more
expensive sock_hold()/sock_put() on each tx), underlying bug is older
(2.6.12-rc5)

A side effect is to extend work done in commit b2722b1c3a893e
(ip_fragment: also adjust skb->truesize for packets not owned by a
socket) to ipv6 as well.

Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/ip_output.c  |   19 +++++++++++++------
 net/ipv6/ip6_output.c |   18 +++++++++++++-----
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04b6989..a643f7a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -488,9 +488,8 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 	 * we can switch to copy when see the first bad fragment.
 	 */
 	if (skb_has_frags(skb)) {
-		struct sk_buff *frag;
+		struct sk_buff *frag, *frag2;
 		int first_len = skb_pagelen(skb);
-		int truesizes = 0;
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
@@ -503,18 +502,18 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 			if (frag->len > mtu ||
 			    ((frag->len & 7) && frag->next) ||
 			    skb_headroom(frag) < hlen)
-			    goto slow_path;
+			    goto slow_path_undo;
 
 			/* Partially cloned skb? */
 			if (skb_shared(frag))
-				goto slow_path;
+				goto slow_path_undo;
 
 			BUG_ON(frag->sk);
 			if (skb->sk) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
 			}
-			truesizes += frag->truesize;
+			skb->truesize -= frag->truesize;
 		}
 
 		/* Everything is OK. Generate! */
@@ -524,7 +523,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 		frag = skb_shinfo(skb)->frag_list;
 		skb_frag_list_init(skb);
 		skb->data_len = first_len - skb_headlen(skb);
-		skb->truesize -= truesizes;
 		skb->len = first_len;
 		iph->tot_len = htons(first_len);
 		iph->frag_off = htons(IP_MF);
@@ -576,6 +574,15 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 		}
 		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 		return err;
+
+slow_path_undo:
+		skb_walk_frags(skb, frag2) {
+			if (frag2 == frag)
+				break;
+			frag2->sk = NULL;
+			frag2->destructor = NULL;
+			skb->truesize += frag2->truesize;
+		}
 	}
 
 slow_path:
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index d40b330..ca7ba44 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -639,7 +639,7 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 	if (skb_has_frags(skb)) {
 		int first_len = skb_pagelen(skb);
-		int truesizes = 0;
+		struct sk_buff *frag2;
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
@@ -651,18 +651,18 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 			if (frag->len > mtu ||
 			    ((frag->len & 7) && frag->next) ||
 			    skb_headroom(frag) < hlen)
-			    goto slow_path;
+			    goto slow_path_undo;
 
 			/* Partially cloned skb? */
 			if (skb_shared(frag))
-				goto slow_path;
+				goto slow_path_undo;
 
 			BUG_ON(frag->sk);
 			if (skb->sk) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
-				truesizes += frag->truesize;
 			}
+			skb->truesize -= frag->truesize;
 		}
 
 		err = 0;
@@ -693,7 +693,6 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 		first_len = skb_pagelen(skb);
 		skb->data_len = first_len - skb_headlen(skb);
-		skb->truesize -= truesizes;
 		skb->len = first_len;
 		ipv6_hdr(skb)->payload_len = htons(first_len -
 						   sizeof(struct ipv6hdr));
@@ -756,6 +755,15 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 			      IPSTATS_MIB_FRAGFAILS);
 		dst_release(&rt->dst);
 		return err;
+
+slow_path_undo:
+		skb_walk_frags(skb, frag2) {
+			if (frag2 == frag)
+				break;
+			frag2->sk = NULL;
+			frag2->destructor = NULL;
+			skb->truesize += frag2->truesize;
+		}
 	}
 
 slow_path:



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] ip: fix truesize mismatch in ip fragmentation
  2010-09-21 15:58             ` [PATCH v3] ip: " Eric Dumazet
@ 2010-09-21 16:26               ` Henrique de Moraes Holschuh
  2010-09-21 16:31                 ` Eric Dumazet
  2010-09-21 17:50               ` Jarek Poplawski
  1 sibling, 1 reply; 32+ messages in thread
From: Henrique de Moraes Holschuh @ 2010-09-21 16:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Bowler, linux-kernel, netdev, David S. Miller,
	Jarek Poplawski, Patrick McHardy

Should this be a candidate for -stable?

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] ip: fix truesize mismatch in ip fragmentation
  2010-09-21 16:26               ` Henrique de Moraes Holschuh
@ 2010-09-21 16:31                 ` Eric Dumazet
  2010-09-21 18:09                   ` Henrique de Moraes Holschuh
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2010-09-21 16:31 UTC (permalink / raw)
  To: Henrique de Moraes Holschuh
  Cc: Nick Bowler, linux-kernel, netdev, David S. Miller,
	Jarek Poplawski, Patrick McHardy

Le mardi 21 septembre 2010 à 13:26 -0300, Henrique de Moraes Holschuh a
écrit :
> Should this be a candidate for -stable?
> 

Yes, of course, but David wants to handle stable submissions himself.

I am not sure we want to bug stable team with dozens of mails while
polishing patches ?

Thanks



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] ip: fix truesize mismatch in ip fragmentation
  2010-09-21 15:58             ` [PATCH v3] ip: " Eric Dumazet
  2010-09-21 16:26               ` Henrique de Moraes Holschuh
@ 2010-09-21 17:50               ` Jarek Poplawski
  2010-09-21 18:47                 ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Jarek Poplawski @ 2010-09-21 17:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Bowler, linux-kernel, netdev, David S. Miller, Patrick McHardy

On Tue, Sep 21, 2010 at 05:58:25PM +0200, Eric Dumazet wrote:
> Hmm, while looking at git history, I found commit from Patrick
> b2722b1c3a893e (ip_fragment: also adjust skb->truesize for packets not
> owned by a socket)
> As my patch reverts it, we probably want a more polished patch.
> 
> (Also port Patrick work to ipv6)
> 
> So here is a V3
> 
> [PATCH v3] ip: fix truesize mismatch in ip fragmentation
> 
> Special care should be taken when slow path is hit in ip_fragment() :
> 
> When walking through frags, we transfert truesize ownership from skb to
> frags. Then if we hit a slow_path condition, we must undo this or risk
> uncharging frags->truesize twice, and in the end, having negative socket
> sk_wmem_alloc counter, or even freeing socket sooner than expected.
> 
> Many thanks to Nick Bowler, who provided a very clean bug report and
> test program.
> 
> Thanks to Jarek for reviewing my first patch and providing a V2
> 
> While Nick bisection pointed to commit 2b85a34e911 (net: No more
> expensive sock_hold()/sock_put() on each tx), underlying bug is older
> (2.6.12-rc5)
> 
> A side effect is to extend work done in commit b2722b1c3a893e
> (ip_fragment: also adjust skb->truesize for packets not owned by a
> socket) to ipv6 as well.
> 
> Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
> Tested-by: Nick Bowler <nbowler@elliptictech.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Jarek Poplawski <jarkao2@gmail.com>
> CC: Patrick McHardy <kaber@trash.net>
> ---
>  net/ipv4/ip_output.c  |   19 +++++++++++++------
>  net/ipv6/ip6_output.c |   18 +++++++++++++-----
>  2 files changed, 26 insertions(+), 11 deletions(-)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 04b6989..a643f7a 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -488,9 +488,8 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
>  	 * we can switch to copy when see the first bad fragment.
>  	 */
>  	if (skb_has_frags(skb)) {
> -		struct sk_buff *frag;
> +		struct sk_buff *frag, *frag2;
>  		int first_len = skb_pagelen(skb);
> -		int truesizes = 0;
>  
>  		if (first_len - hlen > mtu ||
>  		    ((first_len - hlen) & 7) ||
> @@ -503,18 +502,18 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
>  			if (frag->len > mtu ||
>  			    ((frag->len & 7) && frag->next) ||
>  			    skb_headroom(frag) < hlen)
> -			    goto slow_path;
> +			    goto slow_path_undo;

Looks better and better to me, except, checkpatch complains about the
(existing) indentation fault here (and later), but I guess you've seen
that?

Jarek P.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] ip: fix truesize mismatch in ip fragmentation
  2010-09-21 16:31                 ` Eric Dumazet
@ 2010-09-21 18:09                   ` Henrique de Moraes Holschuh
  2010-09-21 19:24                     ` David Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Henrique de Moraes Holschuh @ 2010-09-21 18:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Bowler, linux-kernel, netdev, David S. Miller,
	Jarek Poplawski, Patrick McHardy

On Tue, 21 Sep 2010, Eric Dumazet wrote:
> Le mardi 21 septembre 2010 à 13:26 -0300, Henrique de Moraes Holschuh a
> écrit :
> > Should this be a candidate for -stable?
> > 
> 
> Yes, of course, but David wants to handle stable submissions himself.
> 
> I am not sure we want to bug stable team with dozens of mails while
> polishing patches ?

We don't.  But one often marks commits that should go to -stable using a Cc:
pseudo-header, and also includes relevant information (e.g. to which stable
kernels it should be applied to) to the commit message.

Since there wasn't one, and I didn't readly find any post in this thread
that mentioned it should also go to stable, AND it looked at first glance
like something that should go to stable, I asked about it.

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] ip: fix truesize mismatch in ip fragmentation
  2010-09-21 17:50               ` Jarek Poplawski
@ 2010-09-21 18:47                 ` Eric Dumazet
  2010-09-21 19:21                   ` Jarek Poplawski
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2010-09-21 18:47 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Nick Bowler, linux-kernel, netdev, David S. Miller, Patrick McHardy


> Looks better and better to me, except, checkpatch complains about the
> (existing) indentation fault here (and later), but I guess you've seen
> that?
> 

Indeed, there is checkpatch somewhere ;)

Thanks !

[PATCH v4] ip: fix truesize mismatch in ip fragmentation

Special care should be taken when slow path is hit in ip_fragment() :

When walking through frags, we transfert truesize ownership from skb to
frags. Then if we hit a slow_path condition, we must undo this or risk
uncharging frags->truesize twice, and in the end, having negative socket
sk_wmem_alloc counter, or even freeing socket sooner than expected.

Many thanks to Nick Bowler, who provided a very clean bug report and
test program.

Thanks to Jarek for reviewing my first patch and providing a V2

While Nick bisection pointed to commit 2b85a34e911 (net: No more
expensive sock_hold()/sock_put() on each tx), underlying bug is older
(2.6.12-rc5)

A side effect is to extend work done in commit b2722b1c3a893e
(ip_fragment: also adjust skb->truesize for packets not owned by a
socket) to ipv6 as well.

Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
Tested-by: Nick Bowler <nbowler@elliptictech.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Jarek Poplawski <jarkao2@gmail.com>
CC: Patrick McHardy <kaber@trash.net>
---
 net/ipv4/ip_output.c  |   19 +++++++++++++------
 net/ipv6/ip6_output.c |   18 +++++++++++++-----
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 04b6989..7649d77 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -488,9 +488,8 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 	 * we can switch to copy when see the first bad fragment.
 	 */
 	if (skb_has_frags(skb)) {
-		struct sk_buff *frag;
+		struct sk_buff *frag, *frag2;
 		int first_len = skb_pagelen(skb);
-		int truesizes = 0;
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
@@ -503,18 +502,18 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 			if (frag->len > mtu ||
 			    ((frag->len & 7) && frag->next) ||
 			    skb_headroom(frag) < hlen)
-			    goto slow_path;
+				goto slow_path_clean;
 
 			/* Partially cloned skb? */
 			if (skb_shared(frag))
-				goto slow_path;
+				goto slow_path_clean;
 
 			BUG_ON(frag->sk);
 			if (skb->sk) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
 			}
-			truesizes += frag->truesize;
+			skb->truesize -= frag->truesize;
 		}
 
 		/* Everything is OK. Generate! */
@@ -524,7 +523,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 		frag = skb_shinfo(skb)->frag_list;
 		skb_frag_list_init(skb);
 		skb->data_len = first_len - skb_headlen(skb);
-		skb->truesize -= truesizes;
 		skb->len = first_len;
 		iph->tot_len = htons(first_len);
 		iph->frag_off = htons(IP_MF);
@@ -576,6 +574,15 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 		}
 		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 		return err;
+
+slow_path_clean:
+		skb_walk_frags(skb, frag2) {
+			if (frag2 == frag)
+				break;
+			frag2->sk = NULL;
+			frag2->destructor = NULL;
+			skb->truesize += frag2->truesize;
+		}
 	}
 
 slow_path:
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index d40b330..980912e 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -639,7 +639,7 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 	if (skb_has_frags(skb)) {
 		int first_len = skb_pagelen(skb);
-		int truesizes = 0;
+		struct sk_buff *frag2;
 
 		if (first_len - hlen > mtu ||
 		    ((first_len - hlen) & 7) ||
@@ -651,18 +651,18 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 			if (frag->len > mtu ||
 			    ((frag->len & 7) && frag->next) ||
 			    skb_headroom(frag) < hlen)
-			    goto slow_path;
+				goto slow_path_clean;
 
 			/* Partially cloned skb? */
 			if (skb_shared(frag))
-				goto slow_path;
+				goto slow_path_clean;
 
 			BUG_ON(frag->sk);
 			if (skb->sk) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
-				truesizes += frag->truesize;
 			}
+			skb->truesize -= frag->truesize;
 		}
 
 		err = 0;
@@ -693,7 +693,6 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 		first_len = skb_pagelen(skb);
 		skb->data_len = first_len - skb_headlen(skb);
-		skb->truesize -= truesizes;
 		skb->len = first_len;
 		ipv6_hdr(skb)->payload_len = htons(first_len -
 						   sizeof(struct ipv6hdr));
@@ -756,6 +755,15 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 			      IPSTATS_MIB_FRAGFAILS);
 		dst_release(&rt->dst);
 		return err;
+
+slow_path_clean:
+		skb_walk_frags(skb, frag2) {
+			if (frag2 == frag)
+				break;
+			frag2->sk = NULL;
+			frag2->destructor = NULL;
+			skb->truesize += frag2->truesize;
+		}
 	}
 
 slow_path:



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] ip: fix truesize mismatch in ip fragmentation
  2010-09-21 18:47                 ` Eric Dumazet
@ 2010-09-21 19:21                   ` Jarek Poplawski
  2010-09-21 22:15                     ` David Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Jarek Poplawski @ 2010-09-21 19:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Bowler, linux-kernel, netdev, David S. Miller, Patrick McHardy

On Tue, Sep 21, 2010 at 08:47:45PM +0200, Eric Dumazet wrote:
> 
> > Looks better and better to me, except, checkpatch complains about the
> > (existing) indentation fault here (and later), but I guess you've seen
> > that?
> > 
> 
> Indeed, there is checkpatch somewhere ;)
> 
> Thanks !
> 
> [PATCH v4] ip: fix truesize mismatch in ip fragmentation
> 
> Special care should be taken when slow path is hit in ip_fragment() :
> 
> When walking through frags, we transfert truesize ownership from skb to
> frags. Then if we hit a slow_path condition, we must undo this or risk
> uncharging frags->truesize twice, and in the end, having negative socket
> sk_wmem_alloc counter, or even freeing socket sooner than expected.
> 
> Many thanks to Nick Bowler, who provided a very clean bug report and
> test program.
> 
> Thanks to Jarek for reviewing my first patch and providing a V2
> 
> While Nick bisection pointed to commit 2b85a34e911 (net: No more
> expensive sock_hold()/sock_put() on each tx), underlying bug is older
> (2.6.12-rc5)
> 
> A side effect is to extend work done in commit b2722b1c3a893e
> (ip_fragment: also adjust skb->truesize for packets not owned by a
> socket) to ipv6 as well.
> 
> Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
> Tested-by: Nick Bowler <nbowler@elliptictech.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Jarek Poplawski <jarkao2@gmail.com>
> CC: Patrick McHardy <kaber@trash.net>

Looks perfect to me.

Jarek P.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] ip: fix truesize mismatch in ip fragmentation
  2010-09-21 18:09                   ` Henrique de Moraes Holschuh
@ 2010-09-21 19:24                     ` David Miller
  2010-09-21 23:06                       ` Henrique de Moraes Holschuh
  0 siblings, 1 reply; 32+ messages in thread
From: David Miller @ 2010-09-21 19:24 UTC (permalink / raw)
  To: hmh; +Cc: eric.dumazet, nbowler, linux-kernel, netdev, jarkao2, kaber

From: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Date: Tue, 21 Sep 2010 15:09:40 -0300

> On Tue, 21 Sep 2010, Eric Dumazet wrote:
>> Le mardi 21 septembre 2010 à 13:26 -0300, Henrique de Moraes Holschuh a
>> écrit :
>> > Should this be a candidate for -stable?
>> > 
>> 
>> Yes, of course, but David wants to handle stable submissions himself.
>> 
>> I am not sure we want to bug stable team with dozens of mails while
>> polishing patches ?
> 
> We don't.  But one often marks commits that should go to -stable using a Cc:
> pseudo-header, and also includes relevant information (e.g. to which stable
> kernels it should be applied to) to the commit message.
> 
> Since there wasn't one, and I didn't readly find any post in this thread
> that mentioned it should also go to stable, AND it looked at first glance
> like something that should go to stable, I asked about it.

Sorry, that is not how we typically handle things in the networking.

I queue up all appropriate -stable patches automatically.

I do this mainly because:

1) It isn't the submitter who gets to decide all by himself that
   something is -stable material, that's partly my job too.

   So if the submitter puts the CC: stable thing in the commit
   message, that takes me out of the decision making process.

2) I do not want -stable submissions to go in just because a patch
   made it into Linus's tree.

   I want fixes to sit and cook in Linus's tree for a while before
   they go to -stable unless it's an _incredibly_ obvious fix.
   This allows any bugs in the fix to be shaken out first.

   Again, the CC: stable tag subverts that.

I really think the "CC: stable" tag is only appropriate for a very
limited scope of bug fixes.  The incredibly obvious ones that need
almost no testing and time exposure in Linus's tree.

All of the rest should be carefully queued up for -stable and
submitted there after a week or two of the patch sitting in Linus's
tree.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] ip: fix truesize mismatch in ip fragmentation
  2010-09-21 19:21                   ` Jarek Poplawski
@ 2010-09-21 22:15                     ` David Miller
  0 siblings, 0 replies; 32+ messages in thread
From: David Miller @ 2010-09-21 22:15 UTC (permalink / raw)
  To: jarkao2; +Cc: eric.dumazet, nbowler, linux-kernel, netdev, kaber

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Tue, 21 Sep 2010 21:21:54 +0200

> On Tue, Sep 21, 2010 at 08:47:45PM +0200, Eric Dumazet wrote:
>> [PATCH v4] ip: fix truesize mismatch in ip fragmentation
>> 
>> Special care should be taken when slow path is hit in ip_fragment() :
>> 
>> When walking through frags, we transfert truesize ownership from skb to
>> frags. Then if we hit a slow_path condition, we must undo this or risk
>> uncharging frags->truesize twice, and in the end, having negative socket
>> sk_wmem_alloc counter, or even freeing socket sooner than expected.
>> 
>> Many thanks to Nick Bowler, who provided a very clean bug report and
>> test program.
>> 
>> Thanks to Jarek for reviewing my first patch and providing a V2
>> 
>> While Nick bisection pointed to commit 2b85a34e911 (net: No more
>> expensive sock_hold()/sock_put() on each tx), underlying bug is older
>> (2.6.12-rc5)
>> 
>> A side effect is to extend work done in commit b2722b1c3a893e
>> (ip_fragment: also adjust skb->truesize for packets not owned by a
>> socket) to ipv6 as well.
>> 
>> Reported-and-bisected-by: Nick Bowler <nbowler@elliptictech.com>
>> Tested-by: Nick Bowler <nbowler@elliptictech.com>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> CC: Jarek Poplawski <jarkao2@gmail.com>
>> CC: Patrick McHardy <kaber@trash.net>
> 
> Looks perfect to me.

Great work everyone!

Applied, thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v3] ip: fix truesize mismatch in ip fragmentation
  2010-09-21 19:24                     ` David Miller
@ 2010-09-21 23:06                       ` Henrique de Moraes Holschuh
  0 siblings, 0 replies; 32+ messages in thread
From: Henrique de Moraes Holschuh @ 2010-09-21 23:06 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, nbowler, linux-kernel, netdev, jarkao2, kaber

On Tue, 21 Sep 2010, David Miller wrote:
> > Since there wasn't one, and I didn't readly find any post in this thread
> > that mentioned it should also go to stable, AND it looked at first glance
> > like something that should go to stable, I asked about it.
> 
> Sorry, that is not how we typically handle things in the networking.
> 
> I queue up all appropriate -stable patches automatically.

I see.

>    I want fixes to sit and cook in Linus's tree for a while before
>    they go to -stable unless it's an _incredibly_ obvious fix.
>    This allows any bugs in the fix to be shaken out first.

...

> All of the rest should be carefully queued up for -stable and
> submitted there after a week or two of the patch sitting in Linus's
> tree.

That is a very good way of doing things.  Thanks for the hard work!

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] ip : take care of last fragment in ip_append_data
  2010-09-21  6:16         ` [PATCH] ip : take care of last fragment in ip_append_data Eric Dumazet
@ 2010-09-21 23:38           ` David Miller
  2010-09-22  4:44             ` Eric Dumazet
  2010-09-24 21:42           ` David Miller
  1 sibling, 1 reply; 32+ messages in thread
From: David Miller @ 2010-09-21 23:38 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nbowler, linux-kernel, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 21 Sep 2010 08:16:27 +0200

> [PATCH] ip : take care of last fragment in ip_append_data
> 
> While investigating a bit, I found ip_fragment() slow path was taken
> because ip_append_data() provides following layout for a send(MTU +
> N*(MTU - 20)) syscall :
> 
> - one skb with 1500 (mtu) bytes
> - N fragments of 1480 (mtu-20) bytes (before adding IP header)
> last fragment gets 17 bytes of trail data because of following bit:
> 
> 	if (datalen == length + fraggap)
> 		alloclen += rt->dst.trailer_len;
> 
> Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm...
> another bug ?)
> 
> In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu,
> so we take slow path, building another skb chain.
> 
> In order to avoid taking slow path, we should correct ip_append_data()
> to make sure last fragment has real trail space, under mtu...
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

This patch largely looks fine, but:

1) I want to find out where that "17" tailer_len comes from before
   applying this, that doesn't make any sense.

2) Even with #1 addressed, this function is tricky so I want to review
   this patch some more.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] ip : take care of last fragment in ip_append_data
  2010-09-21 23:38           ` David Miller
@ 2010-09-22  4:44             ` Eric Dumazet
  2010-09-22  4:53               ` David Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2010-09-22  4:44 UTC (permalink / raw)
  To: David Miller; +Cc: nbowler, linux-kernel, netdev

Le mardi 21 septembre 2010 à 16:38 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 21 Sep 2010 08:16:27 +0200
> 
> > [PATCH] ip : take care of last fragment in ip_append_data
> > 
> > While investigating a bit, I found ip_fragment() slow path was taken
> > because ip_append_data() provides following layout for a send(MTU +
> > N*(MTU - 20)) syscall :
> > 
> > - one skb with 1500 (mtu) bytes
> > - N fragments of 1480 (mtu-20) bytes (before adding IP header)
> > last fragment gets 17 bytes of trail data because of following bit:
> > 
> > 	if (datalen == length + fraggap)
> > 		alloclen += rt->dst.trailer_len;
> > 
> > Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm...
> > another bug ?)
> > 
> > In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu,
> > so we take slow path, building another skb chain.
> > 
> > In order to avoid taking slow path, we should correct ip_append_data()
> > to make sure last fragment has real trail space, under mtu...
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> 
> This patch largely looks fine, but:
> 
> 1) I want to find out where that "17" tailer_len comes from before
>    applying this, that doesn't make any sense.
> 
> 2) Even with #1 addressed, this function is tricky so I want to review
>    this patch some more.



The "17" (instead of probable 16 need) comes from :

net/ipv4/esp4.c line 599 :

x->props.trailer_len = align + 1 + crypto_aead_authsize(esp->aead);

In my Nick ipsec script case, 
crypto_aead_blocksize(aead) = 16, 
crypto_aead_authsize(esp->aead) = 0

-> align = 16
trailer_len = 16 + 1 + 0;

I am not sure we need the "+ 1", but I know nothing about this stuff.

Same in net/ipv6/esp6.c ?


Anyway the last frag problem is for packets with lengths :
 MTU + N*(MTU - 20) + LAST

LAST being from [(MTU - trailer_len) ... MTU], not only MTU as I wrote
in changelog




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] ip : take care of last fragment in ip_append_data
  2010-09-22  4:44             ` Eric Dumazet
@ 2010-09-22  4:53               ` David Miller
  0 siblings, 0 replies; 32+ messages in thread
From: David Miller @ 2010-09-22  4:53 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nbowler, linux-kernel, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 22 Sep 2010 06:44:06 +0200

> The "17" (instead of probable 16 need) comes from :
> 
> net/ipv4/esp4.c line 599 :
> 
> x->props.trailer_len = align + 1 + crypto_aead_authsize(esp->aead);
> 
> In my Nick ipsec script case, 
> crypto_aead_blocksize(aead) = 16, 
> crypto_aead_authsize(esp->aead) = 0
> 
> -> align = 16
> trailer_len = 16 + 1 + 0;
> 
> I am not sure we need the "+ 1", but I know nothing about this stuff.
> 
> Same in net/ipv6/esp6.c ?

I think the author of this code thought that the alignment was
expressed as "N - 1" instead of plain "N".

I'll do some research. :-)

Thanks Eric.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] ip : take care of last fragment in ip_append_data
  2010-09-21  6:16         ` [PATCH] ip : take care of last fragment in ip_append_data Eric Dumazet
  2010-09-21 23:38           ` David Miller
@ 2010-09-24 21:42           ` David Miller
  1 sibling, 0 replies; 32+ messages in thread
From: David Miller @ 2010-09-24 21:42 UTC (permalink / raw)
  To: eric.dumazet; +Cc: nbowler, linux-kernel, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 21 Sep 2010 08:16:27 +0200

> [PATCH] ip : take care of last fragment in ip_append_data
> 
> While investigating a bit, I found ip_fragment() slow path was taken
> because ip_append_data() provides following layout for a send(MTU +
> N*(MTU - 20)) syscall :
> 
> - one skb with 1500 (mtu) bytes
> - N fragments of 1480 (mtu-20) bytes (before adding IP header)
> last fragment gets 17 bytes of trail data because of following bit:
> 
> 	if (datalen == length + fraggap)
> 		alloclen += rt->dst.trailer_len;
> 
> Then esp4 adds 16 bytes of data (while trailer_len is 17... hmm...
> another bug ?)
> 
> In ip_fragment(), we notice last fragment is too big (1496 + 20) > mtu,
> so we take slow path, building another skb chain.
> 
> In order to avoid taking slow path, we should correct ip_append_data()
> to make sure last fragment has real trail space, under mtu...
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2010-09-24 21:41 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-20 17:44 Regression, bisected: reference leak with IPSec since ~2.6.31 Nick Bowler
2010-09-20 18:20 ` Eric Dumazet
2010-09-20 19:52   ` Nick Bowler
2010-09-20 20:00     ` David Miller
2010-09-20 21:23       ` Nick Bowler
2010-09-20 20:17     ` Eric Dumazet
2010-09-20 21:31       ` Eric Dumazet
2010-09-21  6:16         ` [PATCH] ip : take care of last fragment in ip_append_data Eric Dumazet
2010-09-21 23:38           ` David Miller
2010-09-22  4:44             ` Eric Dumazet
2010-09-22  4:53               ` David Miller
2010-09-24 21:42           ` David Miller
2010-09-21  9:12         ` Regression, bisected: reference leak with IPSec since ~2.6.31 Jarek Poplawski
2010-09-21  9:21           ` Eric Dumazet
2010-09-21  9:38             ` Jarek Poplawski
2010-09-21  9:55               ` Eric Dumazet
2010-09-21 10:07                 ` Eric Dumazet
2010-09-21 10:48                   ` Jarek Poplawski
2010-09-21 11:58                     ` Eric Dumazet
2010-09-21 12:39                       ` Jarek Poplawski
2010-09-21 14:05         ` Nick Bowler
2010-09-21 14:16           ` [PATCH] ip : fix truesize mismatch in ip fragmentation Eric Dumazet
2010-09-21 15:58             ` [PATCH v3] ip: " Eric Dumazet
2010-09-21 16:26               ` Henrique de Moraes Holschuh
2010-09-21 16:31                 ` Eric Dumazet
2010-09-21 18:09                   ` Henrique de Moraes Holschuh
2010-09-21 19:24                     ` David Miller
2010-09-21 23:06                       ` Henrique de Moraes Holschuh
2010-09-21 17:50               ` Jarek Poplawski
2010-09-21 18:47                 ` Eric Dumazet
2010-09-21 19:21                   ` Jarek Poplawski
2010-09-21 22:15                     ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.