Bridge + Conntrack + SKB Recycle: Fragment Reassembly Errors

All of lore.kernel.org
 help / color / mirror / Atom feed

* Bridge + Conntrack + SKB Recycle: Fragment Reassembly Errors
@ 2009-11-10 16:09 ben
  2009-11-10 16:50 ` Patrick McHardy
  0 siblings, 1 reply; 8+ messages in thread
From: ben @ 2009-11-10 16:09 UTC (permalink / raw)
  To: netdev

We have observed significant reassembly errors when combining
routing/bridging with conntrack + nf_defrag_ipv4 loaded, and
skb_recycle_check - enabled interfaces.  For our test, we had a single
linux device with two interfaces (gianfars in this case) with SKB
recycling enabled.  We sent large, continuous pings across the bridge,
like this:
ping -s 64000 -A <dest IP>

Then, we ran netstat -s --raw, and noticed that IPSTATS_MIB_REASMFAILS
were happening for about 40% of the received datagrams.  Tracing the
code in ip_fragment.c, we instrumented each of the
IPSTATS_MIB_REASMFAILS locations, and found the culprit to be
ip_evictor.  Nothing looked unusual here, so we placed tracing in
ip_frag_queue, directly above:
	atomic_add(skb->truesize, &qp->q.net->mem);

We noticed that quite a few of the skb->truesize numbers were in the 67K
range, which quickly overwhelms the default 192K-ish ipfrag_low_thresh.
This means that the next time inet_frag_evictor is run:
 work = atomic_read(&nf->mem) - nf->low_thresh;

Will surely be positive, and it is likely that our huge-frag-containing
queue will be one of those evicted. 

Looking at the source of these huge skbs, it seems that during
re-fragmentation in br_nf_dev_queue_xmit (which calls ip_fragment with
CONFIG_NF_CONNTRACK_IPV4 enabled), the huge datagram that was allocated
to hold a successfully-reassembled skb may be getting reused?  In any
case, when skb_recycle_check(skb, min_rx_size) is called, the huge
(skb->truesize huge, not data huge) skb is recycled for use on RX, and
it eventually gets enqueued for reassembly, causing the
inet_frag_evictor to have a positive work value.

Our solution was to add an upper-bounds check to skb_recycle_check,
which prevents the large-ish SKBs from being used to create future
frags, and overwhelming ipfrag_low_thresh.  This seems quite clunky,
although I would be happy to submit this as a patch...

If this is not the right place...what is the "right" place?

Ben Menchaca
Bigfoot Networks

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bridge + Conntrack + SKB Recycle: Fragment Reassembly Errors
  2009-11-10 16:09 Bridge + Conntrack + SKB Recycle: Fragment Reassembly Errors ben
@ 2009-11-10 16:50 ` Patrick McHardy
  2009-11-21 19:08   ` David Miller
  0 siblings, 1 reply; 8+ messages in thread
From: Patrick McHardy @ 2009-11-10 16:50 UTC (permalink / raw)
  To: ben; +Cc: netdev

ben@bigfootnetworks.com wrote:
> We have observed significant reassembly errors when combining
> routing/bridging with conntrack + nf_defrag_ipv4 loaded, and
> skb_recycle_check - enabled interfaces.  For our test, we had a single
> linux device with two interfaces (gianfars in this case) with SKB
> recycling enabled.  We sent large, continuous pings across the bridge,
> like this:
> ping -s 64000 -A <dest IP>
> 
> Then, we ran netstat -s --raw, and noticed that IPSTATS_MIB_REASMFAILS
> were happening for about 40% of the received datagrams.  Tracing the
> code in ip_fragment.c, we instrumented each of the
> IPSTATS_MIB_REASMFAILS locations, and found the culprit to be
> ip_evictor.  Nothing looked unusual here, so we placed tracing in
> ip_frag_queue, directly above:
> 	atomic_add(skb->truesize, &qp->q.net->mem);
> 
> We noticed that quite a few of the skb->truesize numbers were in the 67K
> range, which quickly overwhelms the default 192K-ish ipfrag_low_thresh.
> This means that the next time inet_frag_evictor is run:
>  work = atomic_read(&nf->mem) - nf->low_thresh;
> 
> Will surely be positive, and it is likely that our huge-frag-containing
> queue will be one of those evicted. 
> 
> Looking at the source of these huge skbs, it seems that during
> re-fragmentation in br_nf_dev_queue_xmit (which calls ip_fragment with
> CONFIG_NF_CONNTRACK_IPV4 enabled), the huge datagram that was allocated
> to hold a successfully-reassembled skb may be getting reused?  In any
> case, when skb_recycle_check(skb, min_rx_size) is called, the huge
> (skb->truesize huge, not data huge) skb is recycled for use on RX, and
> it eventually gets enqueued for reassembly, causing the
> inet_frag_evictor to have a positive work value.

Interesting problem. I wonder what the linear size of the skb was
and whether we're just not properly adjusting truesize of the head
during refragmentation.

This code in ip_fragment() looks suspicious:

	if (skb_has_frags(skb)) {
	...
		skb_walk_frags(skb, frag) {
			...
			if (skb->sk) {
				frag->sk = skb->sk;
				frag->destructor = sock_wfree;
				truesizes += frag->truesize;
			}

truesizes is later used to adjust truesize of the head skb.
For some reason this is only done when it originated from a
local socket.

> Our solution was to add an upper-bounds check to skb_recycle_check,
> which prevents the large-ish SKBs from being used to create future
> frags, and overwhelming ipfrag_low_thresh.  This seems quite clunky,
> although I would be happy to submit this as a patch...

This seems reasonable to me, there might be large skbs for
different reasons.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bridge + Conntrack + SKB Recycle: Fragment Reassembly Errors
  2009-11-10 16:50 ` Patrick McHardy
@ 2009-11-21 19:08   ` David Miller
  2009-11-22  0:21     ` Patrick McHardy
  0 siblings, 1 reply; 8+ messages in thread
From: David Miller @ 2009-11-21 19:08 UTC (permalink / raw)
  To: kaber; +Cc: ben, netdev

From: Patrick McHardy <kaber@trash.net>
Date: Tue, 10 Nov 2009 17:50:38 +0100

> This code in ip_fragment() looks suspicious:
> 
> 	if (skb_has_frags(skb)) {
> 	...
> 		skb_walk_frags(skb, frag) {
> 			...
> 			if (skb->sk) {
> 				frag->sk = skb->sk;
> 				frag->destructor = sock_wfree;
> 				truesizes += frag->truesize;
> 			}
> 
> truesizes is later used to adjust truesize of the head skb.
> For some reason this is only done when it originated from a
> local socket.

Well, it shouldn't look _that_ suspicious.

What this code is doing is making sure that after we make all of these
changes, the truesize of the SKBs referrng to the socket do not
change.

It's simply making sure that the math works out when all the
sock_wfree() calls occur later.

If we don't have a socket involved, there is no reason to make
these adjustments.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bridge + Conntrack + SKB Recycle: Fragment Reassembly Errors
  2009-11-21 19:08   ` David Miller
@ 2009-11-22  0:21     ` Patrick McHardy
  2009-11-22  0:29       ` Patrick McHardy
  0 siblings, 1 reply; 8+ messages in thread
From: Patrick McHardy @ 2009-11-22  0:21 UTC (permalink / raw)
  To: David Miller; +Cc: ben, netdev

David Miller wrote:
> From: Patrick McHardy <kaber@trash.net>
> Date: Tue, 10 Nov 2009 17:50:38 +0100
> 
>> This code in ip_fragment() looks suspicious:
>>
>> 	if (skb_has_frags(skb)) {
>> 	...
>> 		skb_walk_frags(skb, frag) {
>> 			...
>> 			if (skb->sk) {
>> 				frag->sk = skb->sk;
>> 				frag->destructor = sock_wfree;
>> 				truesizes += frag->truesize;
>> 			}
>>
>> truesizes is later used to adjust truesize of the head skb.
>> For some reason this is only done when it originated from a
>> local socket.
> 
> Well, it shouldn't look _that_ suspicious.
> 
> What this code is doing is making sure that after we make all of these
> changes, the truesize of the SKBs referrng to the socket do not
> change.
> 
> It's simply making sure that the math works out when all the
> sock_wfree() calls occur later.
> 
> If we don't have a socket involved, there is no reason to make
> these adjustments.

That seems to be the assumption. But ip_defrag() uses skb->truesize
as well to make sure the defragmentation memory limits are not exceeded.
In this case what seems to be happening is:

- gianfar with skb recycling enabled receives a number of fragments
  on a bridge (~64000b total)

- conntrack defragments the packet using ip_defrag(), which causes
  skb->truesize of the head fragment to account for all the fragments

- the packet is refragmented in the bridging code using ip_fragment().
  This doesn't re-adjust skb->truesize of the head fragment when the
  packet is not associated with a socket

- the head is recycled in gianfar

- another fragment is received and reuses the recycled skb with a
  huge truesize

- the defragmentation limits are exceeded due to the huge truesize

So it seems we need to adjust skb->truesize in ip_fragment() since
skb_recycle_check() assumes the skb is linear (and therefore
skb->truesize reflects the linear size). Ben's suggestions of adding
an upper limit based on the requested size to skb_recycle_check()
makes sense to me as well to avoid this problem when recycling large
linear skbs.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bridge + Conntrack + SKB Recycle: Fragment Reassembly Errors
  2009-11-22  0:21     ` Patrick McHardy
@ 2009-11-22  0:29       ` Patrick McHardy
  2009-12-01 16:00         ` ben
  0 siblings, 1 reply; 8+ messages in thread
From: Patrick McHardy @ 2009-11-22  0:29 UTC (permalink / raw)
  To: ben; +Cc: David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 416 bytes --]

Patrick McHardy wrote:
> So it seems we need to adjust skb->truesize in ip_fragment() since
> skb_recycle_check() assumes the skb is linear (and therefore
> skb->truesize reflects the linear size). Ben's suggestions of adding
> an upper limit based on the requested size to skb_recycle_check()
> makes sense to me as well to avoid this problem when recycling large
> linear skbs.

Ben, please give this patch a try.

[-- Attachment #2: x --]
[-- Type: text/plain, Size: 428 bytes --]

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 322b408..031989d 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -501,8 +501,8 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 			if (skb->sk) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
-				truesizes += frag->truesize;
 			}
+			truesizes += frag->truesize;
 		}
 
 		/* Everything is OK. Generate! */

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* RE: Bridge + Conntrack + SKB Recycle: Fragment Reassembly Errors
  2009-11-22  0:29       ` Patrick McHardy
@ 2009-12-01 16:00         ` ben
  2009-12-01 16:24           ` Patrick McHardy
  0 siblings, 1 reply; 8+ messages in thread
From: ben @ 2009-12-01 16:00 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, netdev

> Ben, please give this patch a try.

I have not been able to recreate the issue after applying the patch,
which is great.  Is this the only case in which large-ish SKBs might be
recycled and cause the reassembly overflow?

- Ben Menchaca

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Bridge + Conntrack + SKB Recycle: Fragment Reassembly Errors
  2009-12-01 16:00         ` ben
@ 2009-12-01 16:24           ` Patrick McHardy
  2009-12-01 23:54             ` David Miller
  0 siblings, 1 reply; 8+ messages in thread
From: Patrick McHardy @ 2009-12-01 16:24 UTC (permalink / raw)
  To: ben; +Cc: David Miller, netdev

[-- Attachment #1: Type: text/plain, Size: 391 bytes --]

ben@bigfootnetworks.com wrote:
>> Ben, please give this patch a try.
> 
> I have not been able to recreate the issue after applying the patch,
> which is great.

Thanks for testing.

> Is this the only case in which large-ish SKBs might be
> recycled and cause the reassembly overflow?

I'm not aware of any other cases at least.

Dave, attached is the patch again with a proper changelog.


[-- Attachment #2: 01.diff --]
[-- Type: text/x-patch, Size: 1421 bytes --]

commit d45f8b9ff2b7c1c5787348a39d3778931beca7e3
Author: Patrick McHardy <kaber@trash.net>
Date:   Tue Dec 1 17:19:14 2009 +0100

    ip_fragment: also adjust skb->truesize for packets not owned by a socket
    
    When a large packet gets reassembled by ip_defrag(), the head skb
    accounts for all the fragments in skb->truesize. If this packet is
    refragmented again, skb->truesize is not re-adjusted to reflect only
    the head size since its not owned by a socket. If the head fragment
    then gets recycled and reused for another received fragment, it might
    exceed the defragmentation limits due to its large truesize value.
    
    skb_recycle_check() explicitly checks for linear skbs, so any recycled
    skb should reflect its true size in skb->truesize. Change ip_fragment()
    to also adjust the truesize value of skbs not owned by a socket.
    
    Reported-and-tested-by: Ben Menchaca <ben@bigfootnetworks.com>
    Signed-off-by: Patrick McHardy <kaber@trash.net>

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index b78e615..e34013a 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -503,8 +503,8 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 			if (skb->sk) {
 				frag->sk = skb->sk;
 				frag->destructor = sock_wfree;
-				truesizes += frag->truesize;
 			}
+			truesizes += frag->truesize;
 		}
 
 		/* Everything is OK. Generate! */

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: Bridge + Conntrack + SKB Recycle: Fragment Reassembly Errors
  2009-12-01 16:24           ` Patrick McHardy
@ 2009-12-01 23:54             ` David Miller
  0 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2009-12-01 23:54 UTC (permalink / raw)
  To: kaber; +Cc: ben, netdev

From: Patrick McHardy <kaber@trash.net>
Date: Tue, 01 Dec 2009 17:24:23 +0100

> Dave, attached is the patch again with a proper changelog.

Applied to net-2.6, thanks a lot everyone.

I'll queue this up for -stable as well.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-12-01 23:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-10 16:09 Bridge + Conntrack + SKB Recycle: Fragment Reassembly Errors ben
2009-11-10 16:50 ` Patrick McHardy
2009-11-21 19:08   ` David Miller
2009-11-22  0:21     ` Patrick McHardy
2009-11-22  0:29       ` Patrick McHardy
2009-12-01 16:00         ` ben
2009-12-01 16:24           ` Patrick McHardy
2009-12-01 23:54             ` David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.