netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -next] net: preserve geometry of fragment sizes when forwarding
@ 2015-05-07 21:04 Florian Westphal
  2015-05-18 19:39 ` David Miller
  0 siblings, 1 reply; 13+ messages in thread
From: Florian Westphal @ 2015-05-07 21:04 UTC (permalink / raw)
  To: netdev; +Cc: hannes, Florian Westphal, Eric Dumazet

There was interest in keeping geometry of original fragments on forward.

This (re)enables this feature.

on router with mtu 1500 on all interfaces and netfilter conntrack enabled:

incoming packet on router:
IP (ttl 64, offset 0, flags [+], ICMP, length 1276) 192.168.7.1 > 10.0.0.2: ICMP echo request, length 1256
IP (ttl 64, offset 1256, flags [+], ICMP, length 1276) 192.168.7.1 > 10.0.0.2: ip-proto-1
IP (ttl 64, offset 2512, flags [none], ICMP, length 516) 192.168.7.1 > 10.0.0.2: ip-proto-1

Without patch, refragmentation uses device mtu. incoming packet on destination host:
IP (ttl 63, offset 0, flags [+], ICMP, length 1500) 192.168.7.1 > 10.0.0.2: ICMP echo request, length 1480
IP (ttl 63, offset 1480, flags [+], ICMP, length 1500) 192.168.7.1 > 10.0.0.2: ip-proto-1
IP (ttl 63, offset 2960, flags [none], ICMP, length 68) 192.168.7.1 > 10.0.0.2: ip-proto-1

With patch, ip_fragment skb_has_frag_list fastpath gets used:
IP (ttl 63, offset 0, flags [+], ICMP, length 1276) 192.168.7.1 > 10.0.0.2: ICMP echo request, length 1256
IP (ttl 63, offset 1256, flags [+], ICMP, length 1276) 192.168.7.1 > 10.0.0.2: ip-proto-1
IP (ttl 63, offset 2512, flags [none], ICMP, length 516) 192.168.7.1 > 10.0.0.2: ip-proto-1

Caveat:
This disables the optimization made in commit
3cc4949269e01f39443d0 ("ipv4: use skb coalescing in defragmentation") for
everyone as soon as nf_defrag_ipv4 modules are loaded (conntrack defrag
hooks earlier than ipv4 stacks own defragmentation for local delivery),
and there is no way to easily determine if we will forward the skb at that
stage.

ip_fragment checks the size of the frag skbs vs. the outgoing device mtu
before using them so if device mtu is smaller than the frag skb length
the device mtu will be used instead for refragmentation.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/ip_fragment.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index cc1da6d..31fbb18 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -93,7 +93,7 @@ int ip_frag_mem(struct net *net)
 }
 
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-			 struct net_device *dev);
+			 struct net_device *dev, bool preserve_frags);
 
 struct ip4_create_arg {
 	struct iphdr *iph;
@@ -315,7 +315,8 @@ static int ip_frag_reinit(struct ipq *qp)
 }
 
 /* Add new segment to existing queue. */
-static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
+static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb,
+			 bool preserve_frags)
 {
 	struct sk_buff *prev, *next;
 	struct net_device *dev;
@@ -483,7 +484,7 @@ found:
 		unsigned long orefdst = skb->_skb_refdst;
 
 		skb->_skb_refdst = 0UL;
-		err = ip_frag_reasm(qp, prev, dev);
+		err = ip_frag_reasm(qp, prev, dev, preserve_frags);
 		skb->_skb_refdst = orefdst;
 		return err;
 	}
@@ -500,7 +501,7 @@ err:
 /* Build a new IP datagram from all its fragments. */
 
 static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
-			 struct net_device *dev)
+			 struct net_device *dev, bool preserve_frags)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct iphdr *iph;
@@ -590,7 +591,8 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		else if (head->ip_summed == CHECKSUM_COMPLETE)
 			head->csum = csum_add(head->csum, fp->csum);
 
-		if (skb_try_coalesce(head, fp, &headstolen, &delta)) {
+		if (!preserve_frags &&
+		    skb_try_coalesce(head, fp, &headstolen, &delta)) {
 			kfree_skb_partial(fp, headstolen);
 		} else {
 			if (!skb_shinfo(head)->frag_list)
@@ -629,6 +631,11 @@ out_fail:
 	return err;
 }
 
+static bool preserve_fraglist(u32 user)
+{
+	return user != IP_DEFRAG_LOCAL_DELIVER;
+}
+
 /* Process an incoming IP datagram fragment. */
 int ip_defrag(struct sk_buff *skb, u32 user)
 {
@@ -645,7 +652,7 @@ int ip_defrag(struct sk_buff *skb, u32 user)
 
 		spin_lock(&qp->q.lock);
 
-		ret = ip_frag_queue(qp, skb);
+		ret = ip_frag_queue(qp, skb, preserve_fraglist(user));
 
 		spin_unlock(&qp->q.lock);
 		ipq_put(qp);
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
  2015-05-07 21:04 [PATCH -next] net: preserve geometry of fragment sizes when forwarding Florian Westphal
@ 2015-05-18 19:39 ` David Miller
  2015-05-18 20:06   ` Florian Westphal
  0 siblings, 1 reply; 13+ messages in thread
From: David Miller @ 2015-05-18 19:39 UTC (permalink / raw)
  To: fw; +Cc: netdev, hannes, edumazet

From: Florian Westphal <fw@strlen.de>
Date: Thu,  7 May 2015 23:04:24 +0200

> There was interest in keeping geometry of original fragments on forward.
> 
> This (re)enables this feature.
> 
> on router with mtu 1500 on all interfaces and netfilter conntrack enabled:
 ...
> Caveat:
> This disables the optimization made in commit
> 3cc4949269e01f39443d0 ("ipv4: use skb coalescing in defragmentation") for
> everyone as soon as nf_defrag_ipv4 modules are loaded (conntrack defrag
> hooks earlier than ipv4 stacks own defragmentation for local delivery),
> and there is no way to easily determine if we will forward the skb at that
> stage.
> 
> ip_fragment checks the size of the frag skbs vs. the outgoing device mtu
> before using them so if device mtu is smaller than the frag skb length
> the device mtu will be used instead for refragmentation.
> 
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>

Indeed, I agree that we should only modify the packet's geomtry if we
know it's to be locally delivered.

But paying the cost just because a netfilter module is loaded, that's
really heavy handed and shows really bad engineering on our part.

When I hear "happens when netfilter modules are loaded", it translates
into my head as "all the time".  And for you it should too, because
effectively that's how the world operates.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
  2015-05-18 19:39 ` David Miller
@ 2015-05-18 20:06   ` Florian Westphal
  2015-05-18 20:28     ` David Miller
  0 siblings, 1 reply; 13+ messages in thread
From: Florian Westphal @ 2015-05-18 20:06 UTC (permalink / raw)
  To: David Miller; +Cc: fw, netdev, hannes, edumazet

David Miller <davem@davemloft.net> wrote:
> > There was interest in keeping geometry of original fragments on forward.
> > 
> > This (re)enables this feature.
> > 
> > on router with mtu 1500 on all interfaces and netfilter conntrack enabled:
>  ...
> > Caveat:
> > This disables the optimization made in commit
> > 3cc4949269e01f39443d0 ("ipv4: use skb coalescing in defragmentation") for
> > everyone as soon as nf_defrag_ipv4 modules are loaded (conntrack defrag
> > hooks earlier than ipv4 stacks own defragmentation for local delivery),
> > and there is no way to easily determine if we will forward the skb at that
> > stage.
> > 
> > ip_fragment checks the size of the frag skbs vs. the outgoing device mtu
> > before using them so if device mtu is smaller than the frag skb length
> > the device mtu will be used instead for refragmentation.
> > 
> > Cc: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Florian Westphal <fw@strlen.de>
> 
> Indeed, I agree that we should only modify the packet's geomtry if we
> know it's to be locally delivered.
> 
> But paying the cost just because a netfilter module is loaded, that's
> really heavy handed and shows really bad engineering on our part.
> 
> When I hear "happens when netfilter modules are loaded", it translates
> into my head as "all the time".  And for you it should too, because
> effectively that's how the world operates.

Yes.  This is why I don't like this patch either.

But, where do I go from here?

I'd like to get rid of the bridge netfilter specific hacks in
ip_fragment.  But all my previous attempts were NAKed.

solution #1: add mtu argument (most simple solution):
http://patchwork.ozlabs.org/patch/457420/

NAK: "Anything that adds
an 'mtu' argument to ip_fragment() I am not even going to look
at"

#2: store largest frag size in IPCB and use that:
http://patchwork.ozlabs.org/patch/467837/
("not enough, must preserve geometry of all fragments")

... and this patch was my attempt to do that.

I could even tolerate the br nf legacy crap in ip_fragment and
just pretend its not there.

BUT: ipv6 conntrack on top of bridge is completely broken
(bridge tosses all reassembled packets).

And I absolutely under no circumstances will send patches
to add the same br nf crap that we have in ipv4 to ipv6 stack.
[ First patches to do this were sent to nf-devel a while back,
  so this problem does hurt users ].

To find something that works for ipv4 will hopefully also allow
re-using that approach for ipv6 and fix this mess once and for all.

I've even entertainted copy-pasting all of ip_fragment into
bridge netfilter & make needed changes but thats insane too.

So, please please re-evaluate your stance on any of the previous
attempts or tell me how you would provide bridge netfilter with
the means to transparently forward (refrag) reassembled skbs, without
breaking PMTUD, in ipv4 and ipv6.

Thank you.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
  2015-05-18 20:06   ` Florian Westphal
@ 2015-05-18 20:28     ` David Miller
  2015-05-18 20:40       ` Florian Westphal
  0 siblings, 1 reply; 13+ messages in thread
From: David Miller @ 2015-05-18 20:28 UTC (permalink / raw)
  To: fw; +Cc: netdev, hannes, edumazet, herbert

From: Florian Westphal <fw@strlen.de>
Date: Mon, 18 May 2015 22:06:37 +0200

> So, please please re-evaluate your stance on any of the previous
> attempts or tell me how you would provide bridge netfilter with
> the means to transparently forward (refrag) reassembled skbs, without
> breaking PMTUD, in ipv4 and ipv6.

I know you really don't want to do it, but I really want to see
the "GRO'ish" idea implemented to solve all of these problems.

I know it's hard, and you're making it clear here that you'd
rather just pass an mtu argument around or duplicate the entire
ip fragmentation code base into br_netfilter to solve this problem.

But as networking maintainer I'm supposed to tell you "no" when
those kinds of proposals are being made.  Ok?

We have amazing infrastructure for dealing with oddly segmented
packets, such as skb_header_pointer().  So parsing things in
a fragmented SKB should be no problem regardless of where the
split points are.

We could have suitable interfaces for writing to packets as well,
which would be equally fast as direct access unless the SKB is
split in the middle of the object you want to write into.

The only real barrier left is overlapping fragments, and for that I'd
say just that for netfilter we can just trim the edges, just like the
ip_fragment code already does, and adjust the protocol headers
as-needed.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
  2015-05-18 20:28     ` David Miller
@ 2015-05-18 20:40       ` Florian Westphal
  2015-05-18 20:55         ` David Miller
  0 siblings, 1 reply; 13+ messages in thread
From: Florian Westphal @ 2015-05-18 20:40 UTC (permalink / raw)
  To: David Miller; +Cc: fw, netdev, hannes, edumazet, herbert

David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Mon, 18 May 2015 22:06:37 +0200
> 
> > So, please please re-evaluate your stance on any of the previous
> > attempts or tell me how you would provide bridge netfilter with
> > the means to transparently forward (refrag) reassembled skbs, without
> > breaking PMTUD, in ipv4 and ipv6.
> 
> I know you really don't want to do it, but I really want to see
> the "GRO'ish" idea implemented to solve all of these problems.
> 
> I know it's hard, and you're making it clear here that you'd
> rather just pass an mtu argument around or duplicate the entire
> ip fragmentation code base into br_netfilter to solve this problem.

Its not 'hard'.  I don't see how its possible to do this.

> But as networking maintainer I'm supposed to tell you "no" when
> those kinds of proposals are being made.  Ok?

Sure.

> We have amazing infrastructure for dealing with oddly segmented
> packets, such as skb_header_pointer().  So parsing things in
> a fragmented SKB should be no problem regardless of where the
> split points are.

Netfilter works fine with reassembled skbs that have frag lists.
Parsing is also not that much of a problem, modifying/growing is.

> We could have suitable interfaces for writing to packets as well,
> which would be equally fast as direct access unless the SKB is
> split in the middle of the object you want to write into.

When I send patches for inclusion in the kernel, I do this to fix
things, or I do it because I believe such patches improve kernel in some
way.

I try to imagine how e.g. the H264 or SIP nat helpers would look like
after such a change and it makes me cringe.

But, to the best of my understanding, what you ask will push a lot of
non-trivial code into the kernel for no functional gain over
what has been proposed.

But, even if we'd have magic solution that does what you want we've
gained nothing; there are (rare) cases where packets get completely modified
(e.g. payload is replaced from userspace/nfqueue; xfrm mangling, etc etc
 so we will not be able to escape this either).

Maybe I am just too incompetent.

I've tried the best I could do.  Perhaps someone else can pick this up.

Alas, I'll bring this up during NFWS 2015.  Maybe someone will know how
to do what you are asking.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
  2015-05-18 20:40       ` Florian Westphal
@ 2015-05-18 20:55         ` David Miller
  2015-05-18 21:33           ` Florian Westphal
  0 siblings, 1 reply; 13+ messages in thread
From: David Miller @ 2015-05-18 20:55 UTC (permalink / raw)
  To: fw; +Cc: netdev, hannes, edumazet, herbert

From: Florian Westphal <fw@strlen.de>
Date: Mon, 18 May 2015 22:40:49 +0200

> But, to the best of my understanding, what you ask will push a lot of
> non-trivial code into the kernel for no functional gain over
> what has been proposed.

The functional gain is that we stop linearizing the packet, which
involves memory allocation and copying the entire packet.

I am very confident that the performance gains would be non-trivial
and quite measurable.

You'd also be able to trivially respect the geometry of the original
incoming packet stream.

Every objection has been of the form "this special case" (this time
SIP) is not easy.

If I were doing this, I would implement something that handles the
normal cases properly.  And then take it from there.

If you try to imagine the totality of it and all the edge cases
and details from the beginning, yes it will look impossible.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
  2015-05-18 20:55         ` David Miller
@ 2015-05-18 21:33           ` Florian Westphal
  2015-05-18 22:50             ` Herbert Xu
  2015-05-18 23:51             ` David Miller
  0 siblings, 2 replies; 13+ messages in thread
From: Florian Westphal @ 2015-05-18 21:33 UTC (permalink / raw)
  To: David Miller; +Cc: fw, netdev, hannes, edumazet, herbert

David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Mon, 18 May 2015 22:40:49 +0200
> 
> > But, to the best of my understanding, what you ask will push a lot of
> > non-trivial code into the kernel for no functional gain over
> > what has been proposed.
> 
> The functional gain is that we stop linearizing the packet, which
> involves memory allocation and copying the entire packet.

AFAICS ipv4 and ipv6 defragmentations do not perform linearizations or
reallocations?

> I am very confident that the performance gains would be non-trivial
> and quite measurable.

Are fragmented packets that common?
I don't have any real data on this, the box sending this email has

55965898 incoming packets delivered
62 reassemblies required

... but it is just an end host.

TCP shouldn't be a problem thanks to pmtud, and for high-volume
fragmented ipv4 flows i'd expect poor performance due to the 16 bit ID space
limitations long before processing bottleneck.

> You'd also be able to trivially respect the geometry of the original
> incoming packet stream.

True.  OTOH, the patch proposed in this thread would have done the same
with a lot less code (I admit that removing the optimization from Eric
once nf_defrag is loaded is not desirable; but I did not find a solution
to this problem aside from doing route lookup or tentative 'forward is
off') check, which I did not like.

Another alternative might be to delay Erics 'coalescing' step and move
it into the ip stack, after 'local delivery' decision was taken.

I can investigate this if you think its worth it.

> Every objection has been of the form "this special case" (this time
> SIP) is not easy.

Yes, but these objections are not some random hand-waving gesture.
It presents us with certain dilemmas, e.g. single udp packet:

1280 1280 1280 542

sip nat helper has to do nat/pat and replaces 10.2.3.4 with 192.168.2.3
(lets assume we'd have helpers that deal with addresses split over 2
 fragmented skbs so we can deal with 10.2 appearing in fragment #2
  and .3.4 in fragment #3)

We can then end up with something like
1283 1281 1282 542

... and what should we do then?

shuffe payload via memcpy/memmove() to only grow last frag?
This will not be hot path or common by any means.
But nervertheless, this can happen, and we need to deal with it.

> If I were doing this, I would implement something that handles the
> normal cases properly.  And then take it from there.

What is a 'normal case'?
And how do you propose we deal with the 'non-normal' cases?

I assume you mean to e.g. linearize for edge cases + then refragment?

If thats true, then we'd still need one of the proposed solutions to handle
this to get packets we can send out without breaking geometry/growing
fragments to a larger mtu.

> If you try to imagine the totality of it and all the edge cases
> and details from the beginning, yes it will look impossible.

Hmm...  correct, but I still believe we're talking immense pain
for very little gain.

Thanks for spending time on this.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
  2015-05-18 21:33           ` Florian Westphal
@ 2015-05-18 22:50             ` Herbert Xu
  2015-05-18 23:02               ` Florian Westphal
  2015-05-18 23:51             ` David Miller
  1 sibling, 1 reply; 13+ messages in thread
From: Herbert Xu @ 2015-05-18 22:50 UTC (permalink / raw)
  To: Florian Westphal; +Cc: David Miller, netdev, hannes, edumazet

On Mon, May 18, 2015 at 11:33:29PM +0200, Florian Westphal wrote:
>
> > Every objection has been of the form "this special case" (this time
> > SIP) is not easy.
> 
> Yes, but these objections are not some random hand-waving gesture.
> It presents us with certain dilemmas, e.g. single udp packet:
> 
> 1280 1280 1280 542
> 
> sip nat helper has to do nat/pat and replaces 10.2.3.4 with 192.168.2.3
> (lets assume we'd have helpers that deal with addresses split over 2
>  fragmented skbs so we can deal with 10.2 appearing in fragment #2
>   and .3.4 in fragment #3)
> 
> We can then end up with something like
> 1283 1281 1282 542
> 
> ... and what should we do then?

I think what David is saying that you can apply your special
logic to these edge cases, e.g., linearise them and on output
use your MTU cap to refragment.

However, for the vast majority of fragments that you receive,
which would not be modified by NAT, they should retain their
original geometry.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
  2015-05-18 22:50             ` Herbert Xu
@ 2015-05-18 23:02               ` Florian Westphal
  2015-05-18 23:20                 ` Herbert Xu
  0 siblings, 1 reply; 13+ messages in thread
From: Florian Westphal @ 2015-05-18 23:02 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Florian Westphal, David Miller, netdev, hannes, edumazet

Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Mon, May 18, 2015 at 11:33:29PM +0200, Florian Westphal wrote:
> >
> > > Every objection has been of the form "this special case" (this time
> > > SIP) is not easy.
> > 
> > Yes, but these objections are not some random hand-waving gesture.
> > It presents us with certain dilemmas, e.g. single udp packet:
> > 
> > 1280 1280 1280 542
> > 
> > sip nat helper has to do nat/pat and replaces 10.2.3.4 with 192.168.2.3
> > (lets assume we'd have helpers that deal with addresses split over 2
> >  fragmented skbs so we can deal with 10.2 appearing in fragment #2
> >   and .3.4 in fragment #3)
> > 
> > We can then end up with something like
> > 1283 1281 1282 542
> > 
> > ... and what should we do then?
> 
> I think what David is saying that you can apply your special
> logic to these edge cases, e.g., linearise them and on output
> use your MTU cap to refragment.
> 
> However, for the vast majority of fragments that you receive,
> which would not be modified by NAT, they should retain their
> original geometry.

That would be achived by this patch (perhaps altered to coalesce later
in ip stack when we're sure skb is locally delivered rather than doing
it depending on nf_defrag module presence) plus the earlier patches to:

1. track largest fragment size unconditionally in IPCB
2. refragment to at most this size.

This would keep full geometry of all fragments EXCEPT for
edge cases (overlapping fragments, xfrm, nfqueue payload rewrite from
userspace and anything else that destroys frag lists).

For those edge cases, we would rely on the
IPCB information to re-fragment according to the largest original fragment.

For the vast majority, we would just re-frag via the original
skbs from the skb frag list.  The nice thing is that this is transparent to
netfilter (just another non-linear skb) and ip_fragment can even cope with
reduced mtu on output path (when we eg. receive 1400 byte fragments but
packet is routed via outdevice with 1280 byte mtu).

What am I missing?

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
  2015-05-18 23:02               ` Florian Westphal
@ 2015-05-18 23:20                 ` Herbert Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Herbert Xu @ 2015-05-18 23:20 UTC (permalink / raw)
  To: Florian Westphal; +Cc: David Miller, netdev, hannes, edumazet

On Tue, May 19, 2015 at 01:02:55AM +0200, Florian Westphal wrote:
>
> That would be achived by this patch (perhaps altered to coalesce later
> in ip stack when we're sure skb is locally delivered rather than doing
> it depending on nf_defrag module presence) plus the earlier patches to:

I don't see anything wrong with this particular patch.  If anything
you should extend this patch to also cover IP_DEFRAG_CALL_RA_CHAIN
which AFAICS is not forwarded.

David?

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
  2015-05-18 21:33           ` Florian Westphal
  2015-05-18 22:50             ` Herbert Xu
@ 2015-05-18 23:51             ` David Miller
  2015-05-19 12:34               ` Florian Westphal
  1 sibling, 1 reply; 13+ messages in thread
From: David Miller @ 2015-05-18 23:51 UTC (permalink / raw)
  To: fw; +Cc: netdev, hannes, edumazet, herbert

From: Florian Westphal <fw@strlen.de>
Date: Mon, 18 May 2015 23:33:29 +0200

> Thanks for spending time on this.

Ok, I've heard what you have to say.

Of the fixes you've proposed already, I prefer the device MTU one
because it doesn't penalize the ip_fragment.c optimizations just
because a netfilter module is loaded.

Which of your already proposed patches do you prefer, and why?

I'm happy to apply one or the other as an interim step to make
forward progress.  However, I really want to seriously consider
our long term handling of fragments in netfilter meanwhile.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
  2015-05-18 23:51             ` David Miller
@ 2015-05-19 12:34               ` Florian Westphal
  2015-05-19 19:34                 ` Jay Vosburgh
  0 siblings, 1 reply; 13+ messages in thread
From: Florian Westphal @ 2015-05-19 12:34 UTC (permalink / raw)
  To: David Miller; +Cc: fw, netdev, hannes, edumazet, herbert

David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Mon, 18 May 2015 23:33:29 +0200
> 
> > Thanks for spending time on this.
> 
> Ok, I've heard what you have to say.
> 
> Of the fixes you've proposed already, I prefer the device MTU one
> because it doesn't penalize the ip_fragment.c optimizations just
> because a netfilter module is loaded.

True, OTOH it only allows removing the br_nf hacks, it doesn't fix
a few corner cases (see below).

> Which of your already proposed patches do you prefer, and why?

The IPCB one, main change from that set:
http://patchwork.ozlabs.org/patch/460252/

> I'm happy to apply one or the other as an interim step to make
> forward progress.  However, I really want to seriously consider
> our long term handling of fragments in netfilter meanwhile.

Here is a summary of the two.

Both allow removal of bridge netfilter mtu kludge that we have in
ip_fragment, but they have different pros and cons.

A. Add mtu argument to ip_fragment.
 pro: actual change is small (most is refactoring)
 pro: obvious code flow for bridge netfilter, we can
  pass device_mtu - reserved_mtu (encap overhead for ppp, vlan, etc)
  as 'new' device mtu, effectively moving this hack from ip_fragment to
  br_netfilter.

 cons: doesn't resolve hypothetical(?) router edge cases:
   #1. if we'd receive e.g. fragment size 1000 and fragment size
   200, where both fragments have DF bit set, we will currently
   send a single 1200 byte DF packet, if outdevice has mtu >= 1200.

   This is because ip_finish_output() only fragments if skb->len > mtu.

   #2 iff we do refragment (e.g. out mtu is 1000), we re-create fragments
   of size 1000 and 200, but DF bit gets lost in the process.

Neither seems to be a big problem, #2 doesn't break end-to-end connectivity,
and #1 doesn't seem to happen in practice; else we should have seen
bug reports by now.  However, I do feel uneasy about it and think we should
fix it.

last A series diffstat:
 include/linux/netfilter_bridge.h |    7 -------
 include/net/ip.h                 |    2 +-
 net/bridge/br_netfilter.c        |   34 +++++++++++++++++++++++++++++++---
 net/ipv4/ip_output.c             |   29 ++++++++++++++++++-----------
 4 files changed, 50 insertions(+), 22 deletions(-)

B. store largest fragment size in IPCB and use that.
  pro: we already do this to reject refragmented skbs that had DF set if
  the largest original fragment is larger than the outdevice mtu, so this
  change is not as big as one might think.

  The main aspects of this change is that ip_fragment will do

  mtu = IPCB(skb)->largest_frag_size;
  if (!mtu)
	mtu = ip_skb_dst_mtu(skb);

  instead of just calling ip_skb_dst_mtu(), and that we will call
  ip_fragment on all skbs that have new IPSKB_FRAG_PMTU flag set, i.e.

  if ((IPCB(skb)->flags & IPSKB_FRAG_PMTU) || skb->len > mtu)
	ip_fragment()

  This will prevent us from sending packets that are larger than
  the biggest original fragment we originally got.

  bridge netfilter can make use of this; since we store largest
  size seen we can just remove the mtu encap overhead calculations.

  For ipv6, bridge netfilter will have to be fixed
  to preserve IP6CB largest size, just like we do for IPCB.
  This will be a bridge netfilter specific change only.

last B series had this diffstat:
 include/net/inet_frag.h |    2 +-
 include/net/ip.h        |    1 +
 net/ipv4/ip_forward.c   |   18 +++++++++++-------
 net/ipv4/ip_fragment.c  |   31 ++++++++++++++++++++++++++-----
 net/ipv4/ip_output.c    |   37 ++++++++++++++++++++++++++++---------
 net/ipv6/ip6_output.c   |   29 ++++++++++++++++++-----------
 6 files changed, 85 insertions(+), 33 deletions(-)

My suggestion would be to go with #B.  Its a more intrusive change compared to A,
but it allows to get rid of br netfilter hack while also resolving silly and/or
rare corner cases, such as e.g. reassembling two ipv6 fragments and then failing
to undo it on forward.

Neither approach prevents us from making skb_has_frag_list() fast-path
in ip_fragment work again at a later point.

I can rebase and resubmit both proposals if you prefer so you can review
them in more detail.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH -next] net: preserve geometry of fragment sizes when forwarding
  2015-05-19 12:34               ` Florian Westphal
@ 2015-05-19 19:34                 ` Jay Vosburgh
  0 siblings, 0 replies; 13+ messages in thread
From: Jay Vosburgh @ 2015-05-19 19:34 UTC (permalink / raw)
  To: Florian Westphal; +Cc: David Miller, netdev, hannes, edumazet, herbert

Florian Westphal <fw@strlen.de> wrote:
[...]
> cons: doesn't resolve hypothetical(?) router edge cases:
>   #1. if we'd receive e.g. fragment size 1000 and fragment size
>   200, where both fragments have DF bit set, we will currently
>   send a single 1200 byte DF packet, if outdevice has mtu >= 1200.
>
>   This is because ip_finish_output() only fragments if skb->len > mtu.
>
>   #2 iff we do refragment (e.g. out mtu is 1000), we re-create fragments
>   of size 1000 and 200, but DF bit gets lost in the process.
>
>Neither seems to be a big problem, #2 doesn't break end-to-end connectivity,
>and #1 doesn't seem to happen in practice; else we should have seen
>bug reports by now.  However, I do feel uneasy about it and think we should
>fix it.

	FWIW, I've seen Openstack-based network topologies with local
customizations to take advantage of #1 to deliberately avoid
refragmentation on bridge egress.  In this case, the bridge egress port
is a veth device, and the bridge-side veth mtu is set higher than the
veth peer's mtu, which is in a container.  The veth driver does not
check mtu on transmit, so this "works."

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-05-19 19:34 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-07 21:04 [PATCH -next] net: preserve geometry of fragment sizes when forwarding Florian Westphal
2015-05-18 19:39 ` David Miller
2015-05-18 20:06   ` Florian Westphal
2015-05-18 20:28     ` David Miller
2015-05-18 20:40       ` Florian Westphal
2015-05-18 20:55         ` David Miller
2015-05-18 21:33           ` Florian Westphal
2015-05-18 22:50             ` Herbert Xu
2015-05-18 23:02               ` Florian Westphal
2015-05-18 23:20                 ` Herbert Xu
2015-05-18 23:51             ` David Miller
2015-05-19 12:34               ` Florian Westphal
2015-05-19 19:34                 ` Jay Vosburgh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).