[PATCH -next 0/3] net: cap size to original frag size when refragmenting

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH -next 0/3] net: cap size to original frag size when refragmenting
@ 2015-04-10 22:16 Florian Westphal
  2015-04-10 22:16 ` [PATCH -next 1/3] ipv4: reject too-big defragmented DF-skb when forwarding Florian Westphal
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Florian Westphal @ 2015-04-10 22:16 UTC (permalink / raw)
  To: netdev; +Cc: kaber

This series alters ipv4 and ipv6 fragmentation to ensure that we do not
increase the size of the original fragments when refragmenting.

For IPv4, we only do this when DF bit was set on original fragments since
path mtu discovery doesn't happen otherwise.

tested via:
#!/usr/bin/python
from scapy.all import *
dip="10.23.42.2"
payload="A"*1400
packet=IP(dst=dip,id=12345,flags='DF')/UDP(sport=42,dport=42)/payload
frags=fragment(packet,fragsize=1200)
for fragment in frags:
    send(fragment)

Without this patch, we generate fragments without df bit set based
on the outgoing device mtu when fragmenting after forwarding, ie.

IP (ttl 64, id 12345, offset 0, flags [+, DF], proto UDP (17), length 1204)
    192.168.7.1.42 > 10.23.42.2.42: UDP, length 1400
IP (ttl 64, id 12345, offset 1184, flags [DF], proto UDP (17), length 244)
    192.168.7.1 > 10.23.42.2: ip-proto-17

on ingress will turn into

IP (ttl 63, id 12345, offset 0, flags [+], proto UDP (17), length 1396)
    192.168.7.1.42 > 10.23.42.2.42: UDP, length 1400
IP (ttl 63, id 12345, offset 1376, flags [none], proto UDP (17), length 52)

(mtu is 1400, we strip df and send larger fragment), or

IP (ttl 63, id 12345, offset 0, flags [DF], proto UDP (17), length 1428)
    192.168.7.1.42 > 10.23.42.2.42: [udp sum ok] UDP, length 1400

if mtu is 1500.

The latter is worse since we keep DF, so anything in the path that has smaller
mtu will send icmp error; but original sender never sent packets larger
than 1204 byte.

With patch, we keep the intent of such fragments and will emit DF-fragments that
won't exceed 1204 byte in size.

IPv6 has similar issue so respect the maximum seen fragment size there as well.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH -next 1/3] ipv4: reject too-big defragmented DF-skb when forwarding
  2015-04-10 22:16 [PATCH -next 0/3] net: cap size to original frag size when refragmenting Florian Westphal
@ 2015-04-10 22:16 ` Florian Westphal
  2015-04-10 22:51   ` Hannes Frederic Sowa
  2015-04-10 22:16 ` [PATCH -next 2/3] ipv6: don't increase size when refragmenting forwarded skbs Florian Westphal
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Florian Westphal @ 2015-04-10 22:16 UTC (permalink / raw)
  To: netdev; +Cc: kaber, Florian Westphal

Send icmp pmtu error if we find that the largest fragment of df-skb
exceeded the output path mtu.

The ip output path will still catch this later on but we can avoid the
forward/postrouting hook traversal by rejecting right away.

This is what ipv6 already does.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/ip_forward.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 939992c..494f62b 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -39,17 +39,22 @@
 #include <net/route.h>
 #include <net/xfrm.h>
 
-static bool ip_may_fragment(const struct sk_buff *skb)
-{
-	return unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0) ||
-		skb->ignore_df;
-}
-
 static bool ip_exceeds_mtu(const struct sk_buff *skb, unsigned int mtu)
 {
+	if (unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0))
+		return false;
+
 	if (skb->len <= mtu)
 		return false;
 
+	/* original fragment had DF set */
+	if (unlikely(IPCB(skb)->frag_max_size) &&
+	    IPCB(skb)->frag_max_size > mtu)
+		return true;
+
+	if (skb->ignore_df)
+		return false;
+
 	if (skb_is_gso(skb) && skb_gso_network_seglen(skb) <= mtu)
 		return false;
 
@@ -111,7 +116,7 @@ int ip_forward(struct sk_buff *skb)
 
 	IPCB(skb)->flags |= IPSKB_FORWARDED;
 	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
-	if (!ip_may_fragment(skb) && ip_exceeds_mtu(skb, mtu)) {
+	if (ip_exceeds_mtu(skb, mtu)) {
 		IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 			  htonl(mtu));
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH -next 2/3] ipv6: don't increase size when refragmenting forwarded skbs
  2015-04-10 22:16 [PATCH -next 0/3] net: cap size to original frag size when refragmenting Florian Westphal
  2015-04-10 22:16 ` [PATCH -next 1/3] ipv4: reject too-big defragmented DF-skb when forwarding Florian Westphal
@ 2015-04-10 22:16 ` Florian Westphal
  2015-04-10 22:16 ` [PATCH -next 3/3] ipv4: don't remove df bit when refragmenting Florian Westphal
  2015-04-13 17:53 ` [PATCH -next 0/3] net: cap size to original frag size " David Miller
  3 siblings, 0 replies; 25+ messages in thread
From: Florian Westphal @ 2015-04-10 22:16 UTC (permalink / raw)
  To: netdev; +Cc: kaber, Florian Westphal

since commit 6aafeef03b9d ("netfilter: push reasm skb through instead of
original frag skbs") we will end up sometimes re-fragmenting skbs
that we've reassembled.

In this case, we must not use the device mtu, we might increase the
fragment size, possibly breaking connectivity as later router
might send packet-too-big errors even if sender never sent fragments
exceeding the reported mtu:

mtu 1500 - 1500:1400 - 1400:1280 - 1280
     A         R1         R2        B

- A sends to B, fragment size 1500
- R1 sends pkttoobig error for 1400
- A sends to B, fragment size 1400
- R2 sends pkttoobig error for 1280
- A sends to B, fragment size 1280
- R2 sends pkttoobig error for 1280 again because it sees
fragments of size 1400.

This doesn't occur in practice because we currently use routing
information to fetch the mtu.

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv6/ip6_output.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 7fde1f2..604c99d 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -564,18 +564,17 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
 	/* We must not fragment if the socket is set to force MTU discovery
 	 * or if the skb it not generated by a local socket.
 	 */
-	if (unlikely(!skb->ignore_df && skb->len > mtu) ||
-		     (IP6CB(skb)->frag_max_size &&
-		      IP6CB(skb)->frag_max_size > mtu)) {
-		if (skb->sk && dst_allfrag(skb_dst(skb)))
-			sk_nocaps_add(skb->sk, NETIF_F_GSO_MASK);
+	if (unlikely(!skb->ignore_df && skb->len > mtu))
+		goto fail_toobig;
 
-		skb->dev = skb_dst(skb)->dev;
-		icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
-		IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
-			      IPSTATS_MIB_FRAGFAILS);
-		kfree_skb(skb);
-		return -EMSGSIZE;
+	if (IP6CB(skb)->frag_max_size) {
+		if (IP6CB(skb)->frag_max_size > mtu)
+			goto fail_toobig;
+
+		/* don't send larger fragments than what we received */
+		mtu = IP6CB(skb)->frag_max_size;
+		if (mtu < IPV6_MIN_MTU)
+			mtu = IPV6_MIN_MTU;
 	}
 
 	if (np && np->frag_size < mtu) {
@@ -815,6 +814,14 @@ slow_path:
 	consume_skb(skb);
 	return err;
 
+fail_toobig:
+	if (skb->sk && dst_allfrag(skb_dst(skb)))
+		sk_nocaps_add(skb->sk, NETIF_F_GSO_MASK);
+
+	skb->dev = skb_dst(skb)->dev;
+	icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
+	err = -EMSGSIZE;
+
 fail:
 	IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
 		      IPSTATS_MIB_FRAGFAILS);
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH -next 3/3] ipv4: don't remove df bit when refragmenting
  2015-04-10 22:16 [PATCH -next 0/3] net: cap size to original frag size when refragmenting Florian Westphal
  2015-04-10 22:16 ` [PATCH -next 1/3] ipv4: reject too-big defragmented DF-skb when forwarding Florian Westphal
  2015-04-10 22:16 ` [PATCH -next 2/3] ipv6: don't increase size when refragmenting forwarded skbs Florian Westphal
@ 2015-04-10 22:16 ` Florian Westphal
  2015-04-12  8:51   ` Florian Westphal
  2015-04-13 17:53 ` [PATCH -next 0/3] net: cap size to original frag size " David Miller
  3 siblings, 1 reply; 25+ messages in thread
From: Florian Westphal @ 2015-04-10 22:16 UTC (permalink / raw)
  To: netdev; +Cc: kaber, Florian Westphal

We always send fragments without DF bit set.

Thus, given following setup:

mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280
   A           R1               R2         B

Where R1 and R2 run linux with netfilter defragmentation/conntrack
enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1
will respond with icmp too big error if one of these fragments exceeded
1400 bytes.  So far, so good.

However, the host A will never learn about the lower 1280 link.
The next packet presumably sent by A would use 1400 as the new fragment
size, but R1 will strip DF bit when refragmenting.

Whats worse: If R1 receives fragment sizes 1200 and 100, it would
forward the reassembled packet without refragmenting, i.e.
R2 would send an icmp error in response to a packet that was never sent,
citing mtu that the original sender never exceeded.

In order to 'replay' the original fragments to preserve their semantics,
one solution is to

 1. set DF bit on the new fragments if it was set on original ones.
 2. set the size of the new fragments generated by R1 during
    refragmentation to the largest size seen when defragmenting.

R2 will then notice the problem and will send the expected
'too big, use 1280' icmp error, and further fragments of this size
would not grow anymore to 1400 link mtu when R1 refragments.

There is however, one important caveat. We cannot just use existing
IPCB(skb)->frag_max_size as upper boundary for refragmentation.

We have to consider a case where we receive a large fragment without DF,
followed by a small fragment with DF set.

In such scenario we must not generate a large spew of small DF-fragments
(else we induce packet/traffic amplification).

This modifies ip_fragment so that we track largest fragment size seen
both for DF and non-DF packets.

Then, when we find that we had at least one DF fragment AND the largest
non-DF fragment did not exceed one with DF set, let ip_fragment know that
it should refragment using original frag size and also set DF bit on the
newly created fragments.

Joint work with Hannes Frederic Sowa.

Reported-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/inet_frag.h |  2 +-
 include/net/ip.h        |  1 +
 net/ipv4/ip_fragment.c  | 27 ++++++++++++++++++++++-----
 net/ipv4/ip_output.c    | 43 ++++++++++++++++++++++++++++++++++---------
 4 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 8d17655..e1300b3 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -43,7 +43,7 @@ enum {
  * @len: total length of the original datagram
  * @meat: length of received fragments so far
  * @flags: fragment queue flags
- * @max_size: (ipv4 only) maximum received fragment size with IP_DF set
+ * @max_size: maximum received fragment size
  * @net: namespace that this frag belongs to
  */
 struct inet_frag_queue {
diff --git a/include/net/ip.h b/include/net/ip.h
index d14af7e..428e694 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -45,6 +45,7 @@ struct inet_skb_parm {
 #define IPSKB_FRAG_COMPLETE	BIT(3)
 #define IPSKB_REROUTED		BIT(4)
 #define IPSKB_DOREDIRECT	BIT(5)
+#define IPSKB_FRAG_PMTU		BIT(6)
 
 	u16			frag_max_size;
 };
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index cc1da6d..863d10d 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -75,6 +75,7 @@ struct ipq {
 	__be16		id;
 	u8		protocol;
 	u8		ecn; /* RFC3168 support */
+	u16		max_df_size; /* largest frag with DF set seen */
 	int             iif;
 	unsigned int    rid;
 	struct inet_peer *peer;
@@ -319,6 +320,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
 	struct sk_buff *prev, *next;
 	struct net_device *dev;
+	unsigned int fragsize;
 	int flags, offset;
 	int ihl, end;
 	int err = -ENOENT;
@@ -474,9 +476,14 @@ found:
 	if (offset == 0)
 		qp->q.flags |= INET_FRAG_FIRST_IN;
 
+	fragsize = skb->len + ihl;
+
+	if (fragsize > qp->q.max_size)
+		qp->q.max_size = fragsize;
+
 	if (ip_hdr(skb)->frag_off & htons(IP_DF) &&
-	    skb->len + ihl > qp->q.max_size)
-		qp->q.max_size = skb->len + ihl;
+	    fragsize > qp->max_df_size)
+		qp->max_df_size = fragsize;
 
 	if (qp->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
 	    qp->q.meat == qp->q.len) {
@@ -606,13 +613,23 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 	head->next = NULL;
 	head->dev = dev;
 	head->tstamp = qp->q.stamp;
-	IPCB(head)->frag_max_size = qp->q.max_size;
+	IPCB(head)->frag_max_size = qp->max_df_size;
 
 	iph = ip_hdr(head);
-	/* max_size != 0 implies at least one fragment had IP_DF set */
-	iph->frag_off = qp->q.max_size ? htons(IP_DF) : 0;
 	iph->tot_len = htons(len);
 	iph->tos |= ecn;
+
+	/* if largest size with df is also largest seen we should also
+	 * call ip_fragment & set DF on the new fragments when forwarding
+	 * to not break path mtu discovery.
+	 */
+	if (qp->max_df_size == qp->q.max_size) {
+		IPCB(head)->flags |= IPSKB_FRAG_PMTU;
+		iph->frag_off = htons(IP_DF);
+	} else {
+		iph->frag_off = 0;
+	}
+
 	IP_INC_STATS_BH(net, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
 	qp->q.fragments_tail = NULL;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c65b93a..1b7e16b 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -270,7 +270,8 @@ static int ip_finish_output(struct sock *sk, struct sk_buff *skb)
 	if (skb_is_gso(skb))
 		return ip_finish_output_gso(sk, skb);
 
-	if (skb->len > ip_skb_dst_mtu(skb))
+	if (skb->len > ip_skb_dst_mtu(skb) ||
+	    (IPCB(skb)->flags & IPSKB_FRAG_PMTU))
 		return ip_fragment(sk, skb, ip_finish_output2);
 
 	return ip_finish_output2(sk, skb);
@@ -507,14 +508,29 @@ int ip_fragment(struct sock *sk, struct sk_buff *skb,
 	iph = ip_hdr(skb);
 
 	mtu = ip_skb_dst_mtu(skb);
-	if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
-		     (IPCB(skb)->frag_max_size &&
-		      IPCB(skb)->frag_max_size > mtu))) {
-		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
-		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
-			  htonl(mtu));
-		kfree_skb(skb);
-		return -EMSGSIZE;
+
+	if (unlikely((iph->frag_off & htons(IP_DF)) && !skb->ignore_df))
+		goto fail_toobig;
+
+	/* if frag_max_size is set, then at least one fragment
+	 * had DF bit set when defragmentation took place.
+	 */
+	if (unlikely(IPCB(skb)->frag_max_size)) {
+		if (IPCB(skb)->frag_max_size > mtu)
+			goto fail_toobig;
+
+		/* PMTU discovery wanted?
+		 * Don't send larger fragments than what we received.
+		 *
+		 * We don't cap unconditionally because we don't
+		 * want to send tiny packets if skb was built from
+		 * small df-fragment and one huge fragment without df.
+		 *
+		 * IPSKB_FRAG_PMTU tells us that all fragments had same size
+		 * so we won't create more packets than we originally received.
+		 */
+		if (IPCB(skb)->flags & IPSKB_FRAG_PMTU)
+			mtu = IPCB(skb)->frag_max_size;
 	}
 
 	/*
@@ -711,6 +727,9 @@ slow_path:
 		iph = ip_hdr(skb2);
 		iph->frag_off = htons((offset >> 3));
 
+		if (IPCB(skb)->flags & IPSKB_FRAG_PMTU)
+			iph->frag_off |= htons(IP_DF);
+
 		/* ANK: dirty, but effective trick. Upgrade options only if
 		 * the segment to be fragmented was THE FIRST (otherwise,
 		 * options are already fixed) and make it ONCE
@@ -750,6 +769,12 @@ fail:
 	kfree_skb(skb);
 	IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 	return err;
+fail_toobig:
+	IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
+	icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
+	kfree_skb(skb);
+	return -EMSGSIZE;
+
 }
 EXPORT_SYMBOL(ip_fragment);
 
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 1/3] ipv4: reject too-big defragmented DF-skb when forwarding
  2015-04-10 22:16 ` [PATCH -next 1/3] ipv4: reject too-big defragmented DF-skb when forwarding Florian Westphal
@ 2015-04-10 22:51   ` Hannes Frederic Sowa
  0 siblings, 0 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2015-04-10 22:51 UTC (permalink / raw)
  To: Florian Westphal, netdev; +Cc: kaber

On Sat, Apr 11, 2015, at 00:16, Florian Westphal wrote:
> Send icmp pmtu error if we find that the largest fragment of df-skb
> exceeded the output path mtu.
> 
> The ip output path will still catch this later on but we can avoid the
> forward/postrouting hook traversal by rejecting right away.
> 
> This is what ipv6 already does.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>

Also,
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

I think this is the right thing to do not create pmtu blackholes

Bye,
Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 3/3] ipv4: don't remove df bit when refragmenting
  2015-04-10 22:16 ` [PATCH -next 3/3] ipv4: don't remove df bit when refragmenting Florian Westphal
@ 2015-04-12  8:51   ` Florian Westphal
  0 siblings, 0 replies; 25+ messages in thread
From: Florian Westphal @ 2015-04-12  8:51 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, kaber, jesse

Florian Westphal <fw@strlen.de> wrote:
> We always send fragments without DF bit set.
> 
> Thus, given following setup:
> 
> mtu1500 - mtu1500:1400 - mtu1400:1280 - mtu1280
>    A           R1               R2         B
> 
> Where R1 and R2 run linux with netfilter defragmentation/conntrack
> enabled, then if Host A sent a fragmented packet _with_ DF set to B, R1
> will respond with icmp too big error if one of these fragments exceeded
> 1400 bytes.  So far, so good.
> 
> However, the host A will never learn about the lower 1280 link.
> The next packet presumably sent by A would use 1400 as the new fragment
> size, but R1 will strip DF bit when refragmenting.
> 
> Whats worse: If R1 receives fragment sizes 1200 and 100, it would
> forward the reassembled packet without refragmenting, i.e.
> R2 would send an icmp error in response to a packet that was never sent,
> citing mtu that the original sender never exceeded.
> 
> In order to 'replay' the original fragments to preserve their semantics,
> one solution is to
> 
>  1. set DF bit on the new fragments if it was set on original ones.
>  2. set the size of the new fragments generated by R1 during
>     refragmentation to the largest size seen when defragmenting.
> 
> R2 will then notice the problem and will send the expected
> 'too big, use 1280' icmp error, and further fragments of this size
> would not grow anymore to 1400 link mtu when R1 refragments.
> 
> There is however, one important caveat. We cannot just use existing
> IPCB(skb)->frag_max_size as upper boundary for refragmentation.
> 
> We have to consider a case where we receive a large fragment without DF,
> followed by a small fragment with DF set.
> 
> In such scenario we must not generate a large spew of small DF-fragments
> (else we induce packet/traffic amplification).
> 
> This modifies ip_fragment so that we track largest fragment size seen
> both for DF and non-DF packets.
> 
> Then, when we find that we had at least one DF fragment AND the largest
> non-DF fragment did not exceed one with DF set, let ip_fragment know that
> it should refragment using original frag size and also set DF bit on the
> newly created fragments.

Seems Jesse would prefer if we'd set max_frag_size unconditionally.

The advantage of doing that would be that we can easily get rid of the
nf_bridge_mtu_reduction() kludge we have right now since we'd just pick
the largest original fragment size.

Only caveat is that we'd have to check IPSKB_FRAG_PMTU flag too before
deciding to reject if out mtu is too small, which is easy to do.

So, before I respin patch #3: What do others think?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-10 22:16 [PATCH -next 0/3] net: cap size to original frag size when refragmenting Florian Westphal
                   ` (2 preceding siblings ...)
  2015-04-10 22:16 ` [PATCH -next 3/3] ipv4: don't remove df bit when refragmenting Florian Westphal
@ 2015-04-13 17:53 ` David Miller
  2015-04-13 18:40   ` Hannes Frederic Sowa
  2015-04-16  4:56   ` Herbert Xu
  3 siblings, 2 replies; 25+ messages in thread
From: David Miller @ 2015-04-13 17:53 UTC (permalink / raw)
  To: fw; +Cc: netdev, kaber

From: Florian Westphal <fw@strlen.de>
Date: Sat, 11 Apr 2015 00:16:26 +0200

> This series alters ipv4 and ipv6 fragmentation to ensure that we do not
> increase the size of the original fragments when refragmenting.
> 
> For IPv4, we only do this when DF bit was set on original fragments since
> path mtu discovery doesn't happen otherwise.

What really saddens me is that in my opinion the code was doing the
right thing prior to commit 6aafeef03b9d ("netfilter: push reasm skb
through instead of original frag skbs").

I would rather we investigate making the previous scheme work
properly.

Because then there is no ambiguity at all, you preserve on output
exactly what you had on input.  The same geometry, the same
everything.  No special checks, no max frag len, none of this crap.
Those are all hacks trying to work around the _fundamental_ issue
which is that we potentially change the thing when we refrag.

The commit in question claims that the "problem" is that if some of
the sub-frags don't match the same we have problems.  Well that is so
easy to test for, and if such a test triggers take a slow path.

Netfilter should not ever change the geometry of a fragmented frame.
And the only way to do that is to maintain the fraglist of the
individual fragments through the netfilter stack.

The end result should be much faster than what we have now too.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-13 17:53 ` [PATCH -next 0/3] net: cap size to original frag size " David Miller
@ 2015-04-13 18:40   ` Hannes Frederic Sowa
  2015-04-13 18:56     ` Hannes Frederic Sowa
  2015-04-13 20:16     ` David Miller
  2015-04-16  4:56   ` Herbert Xu
  1 sibling, 2 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2015-04-13 18:40 UTC (permalink / raw)
  To: David Miller, fw; +Cc: netdev, kaber

Hi David,

On Mon, Apr 13, 2015, at 19:53, David Miller wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Sat, 11 Apr 2015 00:16:26 +0200
> 
> > This series alters ipv4 and ipv6 fragmentation to ensure that we do not
> > increase the size of the original fragments when refragmenting.
> > 
> > For IPv4, we only do this when DF bit was set on original fragments since
> > path mtu discovery doesn't happen otherwise.
> 
> What really saddens me is that in my opinion the code was doing the
> right thing prior to commit 6aafeef03b9d ("netfilter: push reasm skb
> through instead of original frag skbs").

This commit aligns IPv6 code with what IPv4 did for years already. I
fear that changing that would lead to semantically backwards
incompatible changes (pushing the original skbs through the netfilter
hooks).

> I would rather we investigate making the previous scheme work
> properly.
> 
> Because then there is no ambiguity at all, you preserve on output
> exactly what you had on input.  The same geometry, the same
> everything.  No special checks, no max frag len, none of this crap.
> Those are all hacks trying to work around the _fundamental_ issue
> which is that we potentially change the thing when we refrag.

netfilter does alter the skbs destructiveley. So the only reason we
could use those unaltered skbs would be to save the lengths of the
individual fragments so we could reapply them to the altered
(constructed by reassembly) one in the output path. This seems much more
complex to me. :/

> The commit in question claims that the "problem" is that if some of
> the sub-frags don't match the same we have problems.  Well that is so
> easy to test for, and if such a test triggers take a slow path.
> 
> Netfilter should not ever change the geometry of a fragmented frame.
> And the only way to do that is to maintain the fraglist of the
> individual fragments through the netfilter stack.
> 
> The end result should be much faster than what we have now too.

In the end netfilter has to deal with fragments where the split is in
the transport layer (like a split directly in the tcp header to confuse
firewalls, fragment offset=1). To do this correctly, I think we have to
always pass in the original skb.

But I actually prefer Jesse's suggestion to set max_frag_size
unconditionally.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-13 18:40   ` Hannes Frederic Sowa
@ 2015-04-13 18:56     ` Hannes Frederic Sowa
  2015-04-13 20:16     ` David Miller
  1 sibling, 0 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2015-04-13 18:56 UTC (permalink / raw)
  To: David Miller, fw; +Cc: netdev, kaber

On Mon, Apr 13, 2015, at 20:40, Hannes Frederic Sowa wrote:
> In the end netfilter has to deal with fragments where the split is in
> the transport layer (like a split directly in the tcp header to confuse
> firewalls, fragment offset=1). To do this correctly, I think we have to
> always pass in the original skb.

Sorry, by 'original' I meant the fully reassembled skb.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-13 18:40   ` Hannes Frederic Sowa
  2015-04-13 18:56     ` Hannes Frederic Sowa
@ 2015-04-13 20:16     ` David Miller
  2015-04-13 20:33       ` Hannes Frederic Sowa
  1 sibling, 1 reply; 25+ messages in thread
From: David Miller @ 2015-04-13 20:16 UTC (permalink / raw)
  To: hannes; +Cc: fw, netdev, kaber

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Mon, 13 Apr 2015 20:40:52 +0200

> netfilter does alter the skbs destructiveley. So the only reason we
> could use those unaltered skbs would be to save the lengths of the
> individual fragments so we could reapply them to the altered
> (constructed by reassembly) one in the output path. This seems much more
> complex to me. :/

It could process the individual fragments one by one then.  Then
at the end you could see what the final verdicts look like.

What happens right now is so damn expensive, has the geometry
issue, and is totally unnecessary %99.99999 of the time.

I therefore think there is massive value in trying to make it work to
keep the original frags around as a group and try to process them that
way.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-13 20:16     ` David Miller
@ 2015-04-13 20:33       ` Hannes Frederic Sowa
  0 siblings, 0 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2015-04-13 20:33 UTC (permalink / raw)
  To: David Miller; +Cc: fw, netdev, kaber

Hi David,

On Mon, Apr 13, 2015, at 22:16, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Mon, 13 Apr 2015 20:40:52 +0200
> 
> > netfilter does alter the skbs destructiveley. So the only reason we
> > could use those unaltered skbs would be to save the lengths of the
> > individual fragments so we could reapply them to the altered
> > (constructed by reassembly) one in the output path. This seems much more
> > complex to me. :/
> 
> It could process the individual fragments one by one then.  Then
> at the end you could see what the final verdicts look like.

This exactly sounds like the reassembly engine we have now.

Consider we have a split in the TCP header, so we definitely have to
keep this skb in the input path as we got the final segments. We could
try to teach netfilter instead of just using tcp_hdr to be aware of that
the header could not be fully reassembled and peek into the next skb,
but this really seems to complex just to me. After that we have the
netfilter connection tracking helpers which suddenly need to handle all
those packets, too.

Before "netfilter: push reasm skb through instead of original frag skbs"
we simply dropped packets where the necessary header was not available,
which also was just a very seldom situation but this is still a
correctness issue.

> What happens right now is so damn expensive, has the geometry
> issue, and is totally unnecessary %99.99999 of the time.

Yes, totally.

The reason we try to pursue this way is that we found no way to actually
build something completely transparent, the state we would need to carry
around is just to immense. Consider fragmentation duplicates and such. I
think, the only think to worry about is correct PMTU behavior here, and
this patchset tries to keep this intact all the time.

> I therefore think there is massive value in trying to make it work to
> keep the original frags around as a group and try to process them that
> way.

I also do think that fragmented traffic is not a common traffic pattern
we should optimize for.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-13 17:53 ` [PATCH -next 0/3] net: cap size to original frag size " David Miller
  2015-04-13 18:40   ` Hannes Frederic Sowa
@ 2015-04-16  4:56   ` Herbert Xu
  2015-04-16  5:24     ` Patrick McHardy
  1 sibling, 1 reply; 25+ messages in thread
From: Herbert Xu @ 2015-04-16  4:56 UTC (permalink / raw)
  To: David Miller; +Cc: fw, netdev, kaber

David Miller <davem@davemloft.net> wrote:
> 
> Because then there is no ambiguity at all, you preserve on output
> exactly what you had on input.  The same geometry, the same
> everything.  No special checks, no max frag len, none of this crap.
> Those are all hacks trying to work around the _fundamental_ issue
> which is that we potentially change the thing when we refrag.

Agreed.  Doing anything other than preserving the original geometry
is simply wrong.

However, this doesn't mean that netfilter has to process each
fragment.  What we could do is to preserve the original fragments
in frag_list and then process the overall skb as a unit in netfilter.

On output we simply fragment according to the original frag_list.

The only thing to watch out for is to eliminate anything in the
middle that tries to linearise the skb.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16  4:56   ` Herbert Xu
@ 2015-04-16  5:24     ` Patrick McHardy
  2015-04-16  5:29       ` Herbert Xu
  2015-04-16 15:32       ` David Miller
  0 siblings, 2 replies; 25+ messages in thread
From: Patrick McHardy @ 2015-04-16  5:24 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, fw, netdev

On 16.04, Herbert Xu wrote:
> David Miller <davem@davemloft.net> wrote:
> > 
> > Because then there is no ambiguity at all, you preserve on output
> > exactly what you had on input.  The same geometry, the same
> > everything.  No special checks, no max frag len, none of this crap.
> > Those are all hacks trying to work around the _fundamental_ issue
> > which is that we potentially change the thing when we refrag.
> 
> Agreed.  Doing anything other than preserving the original geometry
> is simply wrong.
> 
> However, this doesn't mean that netfilter has to process each
> fragment.  What we could do is to preserve the original fragments
> in frag_list and then process the overall skb as a unit in netfilter.
> 
> On output we simply fragment according to the original frag_list.
> 
> The only thing to watch out for is to eliminate anything in the
> middle that tries to linearise the skb.

Netfilter may change the contents of the packet, even change its size.
It is *really* hard to do this while keeping the original fragments
intact.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16  5:24     ` Patrick McHardy
@ 2015-04-16  5:29       ` Herbert Xu
  2015-04-16  5:42         ` Patrick McHardy
  2015-04-16 12:11         ` Hannes Frederic Sowa
  2015-04-16 15:32       ` David Miller
  1 sibling, 2 replies; 25+ messages in thread
From: Herbert Xu @ 2015-04-16  5:29 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, fw, netdev

On Thu, Apr 16, 2015 at 06:24:00AM +0100, Patrick McHardy wrote:
>
> Netfilter may change the contents of the packet, even change its size.
> It is *really* hard to do this while keeping the original fragments
> intact.

Perhaps we should provide better helpers to facilitate this?

So instead of directly manipulating the content of the skb you
would so so through helpers and the helpers can then try to do
sensible things with the fragments.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16  5:29       ` Herbert Xu
@ 2015-04-16  5:42         ` Patrick McHardy
  2015-04-16 12:11         ` Hannes Frederic Sowa
  1 sibling, 0 replies; 25+ messages in thread
From: Patrick McHardy @ 2015-04-16  5:42 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, fw, netdev

On 16.04, Herbert Xu wrote:
> On Thu, Apr 16, 2015 at 06:24:00AM +0100, Patrick McHardy wrote:
> >
> > Netfilter may change the contents of the packet, even change its size.
> > It is *really* hard to do this while keeping the original fragments
> > intact.
> 
> Perhaps we should provide better helpers to facilitate this?
> 
> So instead of directly manipulating the content of the skb you
> would so so through helpers and the helpers can then try to do
> sensible things with the fragments.

Yeah, but its going to get pretty complicated. There can be multiple
size changes and modifications for every packet, so we would need to
keep track of the size modifications that have already occured to
map to the correct position in the frag_list. The modifications
would then have to be performed on both the reassembled skb since
they might again be matched against later on and on the frag_list,
potentially split over multiple skbs.

When enlarging the packet, I guess we would insert a new (small)
fragment to keep the geometry of the original fragments. In extreme
cases like H.323 NAT this might result in a huge amount of new very
small fragments. And it might still break the geometry when the
size difference isn't representable as a multiple of 8.

When reducing the skb size, we might have to choice but to change
the geometry of at least on fragment and would have to shift a lot
of data around.

This is very complicated, it we really want to do this, its a lot
easier to just keep note of the full original geometry and refragment
to those exact sizes while potentially adding or removing things
at the end.

My personal opinion is, why bother. The only thing that cares about
the fragment sizes is PMTUD, and that's what this patch is fixing in
a much simpler fashion.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16  5:29       ` Herbert Xu
  2015-04-16  5:42         ` Patrick McHardy
@ 2015-04-16 12:11         ` Hannes Frederic Sowa
  2015-04-16 15:43           ` David Miller
  1 sibling, 1 reply; 25+ messages in thread
From: Hannes Frederic Sowa @ 2015-04-16 12:11 UTC (permalink / raw)
  To: Herbert Xu, Patrick McHardy; +Cc: David Miller, fw, netdev

On Thu, Apr 16, 2015, at 07:29, Herbert Xu wrote:
> On Thu, Apr 16, 2015 at 06:24:00AM +0100, Patrick McHardy wrote:
> >
> > Netfilter may change the contents of the packet, even change its size.
> > It is *really* hard to do this while keeping the original fragments
> > intact.
> 
> Perhaps we should provide better helpers to facilitate this?
> 
> So instead of directly manipulating the content of the skb you
> would so so through helpers and the helpers can then try to do
> sensible things with the fragments.

When Florian and me started discussing how to solve this we also wanted
to be as transparent as possible. But looking at all possible
fragmentation scenarios, this seems to be too complicated.

Even imagine a fragment with overlapping offsets and some of the
fragments got duplicated. If we had to keep this in frag_list and now
netfilter has to change any of this contents, this will become a total
mess, like changing one port in multiple skbs at different offsets.

I doubt it is worth the effort.

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16  5:24     ` Patrick McHardy
  2015-04-16  5:29       ` Herbert Xu
@ 2015-04-16 15:32       ` David Miller
  2015-04-16 20:55         ` Patrick McHardy
  1 sibling, 1 reply; 25+ messages in thread
From: David Miller @ 2015-04-16 15:32 UTC (permalink / raw)
  To: kaber; +Cc: herbert, fw, netdev

From: Patrick McHardy <kaber@trash.net>
Date: Thu, 16 Apr 2015 06:24:00 +0100

> On 16.04, Herbert Xu wrote:
>> David Miller <davem@davemloft.net> wrote:
>> > 
>> > Because then there is no ambiguity at all, you preserve on output
>> > exactly what you had on input.  The same geometry, the same
>> > everything.  No special checks, no max frag len, none of this crap.
>> > Those are all hacks trying to work around the _fundamental_ issue
>> > which is that we potentially change the thing when we refrag.
>> 
>> Agreed.  Doing anything other than preserving the original geometry
>> is simply wrong.
>> 
>> However, this doesn't mean that netfilter has to process each
>> fragment.  What we could do is to preserve the original fragments
>> in frag_list and then process the overall skb as a unit in netfilter.
>> 
>> On output we simply fragment according to the original frag_list.
>> 
>> The only thing to watch out for is to eliminate anything in the
>> middle that tries to linearise the skb.
> 
> Netfilter may change the contents of the packet, even change its size.
> It is *really* hard to do this while keeping the original fragments
> intact.

I keep hearing a lot of "it's hard" as the only reason we shouldn't do
this properly, and that frankly sucks.  People aren't looking for a
solution and to be honest it's quite tiring.

The common case is that the rules processed are simple, the size of
the overall packet does _not_ change, and therefore the best thing
to do is pass the entire thing as a unit with the frags in tact.

That's the fundamental fact.  It's also the fastest way to process
these packets and avoids all of these stupid max frag garbage.

Only at the point where netfilter makes changes to the size of the
packet does it take "ownership" and get to take on the responsibility
of making sure the new resulting fragments are sane.

But only at that point.

It must not molest the geometry in any other set of circumstances,
and I absolutely will not relent on this.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16 12:11         ` Hannes Frederic Sowa
@ 2015-04-16 15:43           ` David Miller
  2015-04-16 16:13             ` Hannes Frederic Sowa
  2015-04-16 20:32             ` Patrick McHardy
  0 siblings, 2 replies; 25+ messages in thread
From: David Miller @ 2015-04-16 15:43 UTC (permalink / raw)
  To: hannes; +Cc: herbert, kaber, fw, netdev

From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Thu, 16 Apr 2015 14:11:42 +0200

> On Thu, Apr 16, 2015, at 07:29, Herbert Xu wrote:
>> On Thu, Apr 16, 2015 at 06:24:00AM +0100, Patrick McHardy wrote:
>> >
>> > Netfilter may change the contents of the packet, even change its size.
>> > It is *really* hard to do this while keeping the original fragments
>> > intact.
>> 
>> Perhaps we should provide better helpers to facilitate this?
>> 
>> So instead of directly manipulating the content of the skb you
>> would so so through helpers and the helpers can then try to do
>> sensible things with the fragments.
> 
> When Florian and me started discussing how to solve this we also wanted
> to be as transparent as possible. But looking at all possible
> fragmentation scenarios, this seems to be too complicated.
> 
> Even imagine a fragment with overlapping offsets and some of the
> fragments got duplicated. If we had to keep this in frag_list and now
> netfilter has to change any of this contents, this will become a total
> mess, like changing one port in multiple skbs at different offsets.
> 
> I doubt it is worth the effort.

You guys keep talking about exceptional cases rather than what is
unquestionable the common case, and the one worth handling in the
fast paths.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16 15:43           ` David Miller
@ 2015-04-16 16:13             ` Hannes Frederic Sowa
  2015-04-16 20:47               ` Herbert Xu
  2015-04-16 20:32             ` Patrick McHardy
  1 sibling, 1 reply; 25+ messages in thread
From: Hannes Frederic Sowa @ 2015-04-16 16:13 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, kaber, fw, netdev

Hi David,

On Thu, Apr 16, 2015, at 17:43, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Thu, 16 Apr 2015 14:11:42 +0200
> 
> > On Thu, Apr 16, 2015, at 07:29, Herbert Xu wrote:
> >> On Thu, Apr 16, 2015 at 06:24:00AM +0100, Patrick McHardy wrote:
> >> >
> >> > Netfilter may change the contents of the packet, even change its size.
> >> > It is *really* hard to do this while keeping the original fragments
> >> > intact.
> >> 
> >> Perhaps we should provide better helpers to facilitate this?
> >> 
> >> So instead of directly manipulating the content of the skb you
> >> would so so through helpers and the helpers can then try to do
> >> sensible things with the fragments.
> > 
> > When Florian and me started discussing how to solve this we also wanted
> > to be as transparent as possible. But looking at all possible
> > fragmentation scenarios, this seems to be too complicated.
> > 
> > Even imagine a fragment with overlapping offsets and some of the
> > fragments got duplicated. If we had to keep this in frag_list and now
> > netfilter has to change any of this contents, this will become a total
> > mess, like changing one port in multiple skbs at different offsets.
> > 
> > I doubt it is worth the effort.
> 
> You guys keep talking about exceptional cases rather than what is
> unquestionable the common case, and the one worth handling in the
> fast paths.

So currently we have one fast path, that is: we are not fragmented, we
get out non-fragmented, none of this code is ever touched and no
problem.

We don't want to mak this more complex, but

every other solution would need to first expand sk_buff with a new
member(!), then:
  1) push the original fragments through netfilter - I think this will
  definitely break user space compatibility
  2) add new wrappers into the fast path which now will make sure that
  we access the correct skb from the frag_list instead of just
      dereference the pointer - we will make fast path also more slowly
      and a lot more complex
  3) use the reassembled skb for netfilter and emit the original ones -
  seems like a security hazard to me, discrepancy what is 
      checked and what is used
  4) just use the frag_list skb->len's to restore the reassembled skb to
  the original sizes, this would be possible but we would still harm 
      fast-path with that (new skb member) we could switch to
      dynamically allocated array of int[] to store sizes and DF bit
      information
      In particular, even this would not achieve the goal to have
      perfect transparency.

I can give you a more fundamental problem:
We cannot even be transparent if asymmetric routing is in effect and we
don't see all fragments. In case we have contrack rules active, we clear
fragment queues after some time. So a bridge actually will toss
offset!=0 packets if it never sees the head. If we let the packet pass,
we might have a security problem. We can never become fully transparent.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16 15:43           ` David Miller
  2015-04-16 16:13             ` Hannes Frederic Sowa
@ 2015-04-16 20:32             ` Patrick McHardy
  1 sibling, 0 replies; 25+ messages in thread
From: Patrick McHardy @ 2015-04-16 20:32 UTC (permalink / raw)
  To: David Miller, hannes; +Cc: herbert, fw, netdev

Am 16. April 2015 17:43:23 MESZ, schrieb David Miller <davem@davemloft.net>:
>From: Hannes Frederic Sowa <hannes@stressinduktion.org>
>Date: Thu, 16 Apr 2015 14:11:42 +0200
>
>> On Thu, Apr 16, 2015, at 07:29, Herbert Xu wrote:
>>> On Thu, Apr 16, 2015 at 06:24:00AM +0100, Patrick McHardy wrote:
>>> >
>>> > Netfilter may change the contents of the packet, even change its
>size.
>>> > It is *really* hard to do this while keeping the original
>fragments
>>> > intact.
>>> 
>>> Perhaps we should provide better helpers to facilitate this?
>>> 
>>> So instead of directly manipulating the content of the skb you
>>> would so so through helpers and the helpers can then try to do
>>> sensible things with the fragments.
>> 
>> When Florian and me started discussing how to solve this we also
>wanted
>> to be as transparent as possible. But looking at all possible
>> fragmentation scenarios, this seems to be too complicated.
>> 
>> Even imagine a fragment with overlapping offsets and some of the
>> fragments got duplicated. If we had to keep this in frag_list and now
>> netfilter has to change any of this contents, this will become a
>total
>> mess, like changing one port in multiple skbs at different offsets.
>> 
>> I doubt it is worth the effort.
>
>You guys keep talking about exceptional cases rather than what is
>unquestionable the common case, and the one worth handling in the
>fast paths.

True. But its not like we haven't tried. What I keep hearing is lots of well meaning opinions, and the fact is, I've looked into this countless times, you can find my first attempt when googling from 11 years ago, and the resolution was always - it's not worth it. It's not "too hard", I wouldn't mind. And I fail to see the problem with that. We're not talking about a functional defect, it's something people (me included) think is "not nice".

I will try to recollect the problems the different approaches have in detail. I'm pretty sure you will come to the same conclusion as I have.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16 16:13             ` Hannes Frederic Sowa
@ 2015-04-16 20:47               ` Herbert Xu
  2015-04-16 20:56                 ` Patrick McHardy
  0 siblings, 1 reply; 25+ messages in thread
From: Herbert Xu @ 2015-04-16 20:47 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: David Miller, kaber, fw, netdev

On Thu, Apr 16, 2015 at 06:13:25PM +0200, Hannes Frederic Sowa wrote:
>
> So currently we have one fast path, that is: we are not fragmented, we
> get out non-fragmented, none of this code is ever touched and no
> problem.
> 
> We don't want to mak this more complex, but

You should read Dave's other email where he gives you an obvious
solution.  If you have to modify the skb then you don't have to
worry about the original fragments.

But if you only read the skb then don't linearise it completely
and keep the original fragments.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16 15:32       ` David Miller
@ 2015-04-16 20:55         ` Patrick McHardy
  0 siblings, 0 replies; 25+ messages in thread
From: Patrick McHardy @ 2015-04-16 20:55 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, fw, netdev

On 16.04, David Miller wrote:
> > Netfilter may change the contents of the packet, even change its size.
> > It is *really* hard to do this while keeping the original fragments
> > intact.
> 
> I keep hearing a lot of "it's hard" as the only reason we shouldn't do
> this properly, and that frankly sucks.  People aren't looking for a
> solution and to be honest it's quite tiring.
> 
> The common case is that the rules processed are simple, the size of
> the overall packet does _not_ change, and therefore the best thing
> to do is pass the entire thing as a unit with the frags in tact.
> 
> That's the fundamental fact.  It's also the fastest way to process
> these packets and avoids all of these stupid max frag garbage.
> 
> Only at the point where netfilter makes changes to the size of the
> packet does it take "ownership" and get to take on the responsibility
> of making sure the new resulting fragments are sane.
> 
> But only at that point.

Agreed, that part shouldn't be hard. We need to pass the defragmented
skb through the ruleset, meaning we need to pass it through the stack.
That's needed since the rules depend on this.

If we don't make changes, we can spit out the original fragments, but
for this we need to keep a reference to them from the skb. We still
need the max_frag_size thing, once a modification is made we drop the
frag list reference and just regulary refragment the modified skb
according to the limits.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16 20:47               ` Herbert Xu
@ 2015-04-16 20:56                 ` Patrick McHardy
  2015-04-16 23:16                   ` Hannes Frederic Sowa
  0 siblings, 1 reply; 25+ messages in thread
From: Patrick McHardy @ 2015-04-16 20:56 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Hannes Frederic Sowa, David Miller, fw, netdev

On 17.04, Herbert Xu wrote:
> On Thu, Apr 16, 2015 at 06:13:25PM +0200, Hannes Frederic Sowa wrote:
> >
> > So currently we have one fast path, that is: we are not fragmented, we
> > get out non-fragmented, none of this code is ever touched and no
> > problem.
> > 
> > We don't want to mak this more complex, but
> 
> You should read Dave's other email where he gives you an obvious
> solution.  If you have to modify the skb then you don't have to
> worry about the original fragments.
> 
> But if you only read the skb then don't linearise it completely
> and keep the original fragments.

Yes, I was just responding to that. We need an additional member
in the skb, but then this part is quite simple.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16 20:56                 ` Patrick McHardy
@ 2015-04-16 23:16                   ` Hannes Frederic Sowa
  2015-04-16 23:44                     ` Patrick McHardy
  0 siblings, 1 reply; 25+ messages in thread
From: Hannes Frederic Sowa @ 2015-04-16 23:16 UTC (permalink / raw)
  To: Patrick McHardy, Herbert Xu; +Cc: David Miller, Florian Westphal, netdev



On Thu, Apr 16, 2015, at 22:56, Patrick McHardy wrote:
> On 17.04, Herbert Xu wrote:
> > On Thu, Apr 16, 2015 at 06:13:25PM +0200, Hannes Frederic Sowa wrote:
> > >
> > > So currently we have one fast path, that is: we are not fragmented, we
> > > get out non-fragmented, none of this code is ever touched and no
> > > problem.
> > > 
> > > We don't want to mak this more complex, but
> > 
> > You should read Dave's other email where he gives you an obvious
> > solution.  If you have to modify the skb then you don't have to
> > worry about the original fragments.
> > 
> > But if you only read the skb then don't linearise it completely
> > and keep the original fragments.
> 
> Yes, I was just responding to that. We need an additional member
> in the skb, but then this part is quite simple.

I guess you refer to the solution of using the fragmented list of
packets to do checks. I don't think it will be that simple to peek into
all the different skbs if the fragments are split in the header.
Splitting fragments in the header is the 1x1 of firewall by-passing
attacks, so we should certainly handle it correctly.

Logic off peeking into fragments and splitting fragments must be
logically consent all the time, with all netfilter helpers in the tree.

I don't think the additional checks in fast-path are in any way
justified.

Bye,
Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting
  2015-04-16 23:16                   ` Hannes Frederic Sowa
@ 2015-04-16 23:44                     ` Patrick McHardy
  0 siblings, 0 replies; 25+ messages in thread
From: Patrick McHardy @ 2015-04-16 23:44 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Herbert Xu, David Miller, Florian Westphal, netdev

On 17.04, Hannes Frederic Sowa wrote:
> 
> 
> On Thu, Apr 16, 2015, at 22:56, Patrick McHardy wrote:
> > On 17.04, Herbert Xu wrote:
> > > On Thu, Apr 16, 2015 at 06:13:25PM +0200, Hannes Frederic Sowa wrote:
> > > >
> > > > So currently we have one fast path, that is: we are not fragmented, we
> > > > get out non-fragmented, none of this code is ever touched and no
> > > > problem.
> > > > 
> > > > We don't want to mak this more complex, but
> > > 
> > > You should read Dave's other email where he gives you an obvious
> > > solution.  If you have to modify the skb then you don't have to
> > > worry about the original fragments.
> > > 
> > > But if you only read the skb then don't linearise it completely
> > > and keep the original fragments.
> > 
> > Yes, I was just responding to that. We need an additional member
> > in the skb, but then this part is quite simple.
> 
> I guess you refer to the solution of using the fragmented list of
> packets to do checks. I don't think it will be that simple to peek into
> all the different skbs if the fragments are split in the header.
> Splitting fragments in the header is the 1x1 of firewall by-passing
> attacks, so we should certainly handle it correctly.

Also correct, that argument hasn't come up so far. Its actually not
too hard to handle, you need to keep track of which parts have been
inspected to make a classification decision and make sure those
parts are not overwritten. The TCP match does this statically, but
we certainly could do this dynamically. The question again is how
to keep geometry if things are actually overwritten, especially
partially. Apparently all this troubly is being seen as worthwhile.

> Logic off peeking into fragments and splitting fragments must be
> logically consent all the time, with all netfilter helpers in the tree.
> 
> I don't think the additional checks in fast-path are in any way
> justified.

I fully agree.

> 
> Bye,
> Hannes
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-04-16 23:44 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-10 22:16 [PATCH -next 0/3] net: cap size to original frag size when refragmenting Florian Westphal
2015-04-10 22:16 ` [PATCH -next 1/3] ipv4: reject too-big defragmented DF-skb when forwarding Florian Westphal
2015-04-10 22:51   ` Hannes Frederic Sowa
2015-04-10 22:16 ` [PATCH -next 2/3] ipv6: don't increase size when refragmenting forwarded skbs Florian Westphal
2015-04-10 22:16 ` [PATCH -next 3/3] ipv4: don't remove df bit when refragmenting Florian Westphal
2015-04-12  8:51   ` Florian Westphal
2015-04-13 17:53 ` [PATCH -next 0/3] net: cap size to original frag size " David Miller
2015-04-13 18:40   ` Hannes Frederic Sowa
2015-04-13 18:56     ` Hannes Frederic Sowa
2015-04-13 20:16     ` David Miller
2015-04-13 20:33       ` Hannes Frederic Sowa
2015-04-16  4:56   ` Herbert Xu
2015-04-16  5:24     ` Patrick McHardy
2015-04-16  5:29       ` Herbert Xu
2015-04-16  5:42         ` Patrick McHardy
2015-04-16 12:11         ` Hannes Frederic Sowa
2015-04-16 15:43           ` David Miller
2015-04-16 16:13             ` Hannes Frederic Sowa
2015-04-16 20:47               ` Herbert Xu
2015-04-16 20:56                 ` Patrick McHardy
2015-04-16 23:16                   ` Hannes Frederic Sowa
2015-04-16 23:44                     ` Patrick McHardy
2015-04-16 20:32             ` Patrick McHardy
2015-04-16 15:32       ` David Miller
2015-04-16 20:55         ` Patrick McHardy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.