All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next] mpls: Infer payload of packet from via address family.
@ 2015-03-11 12:58 Robert Shearman
  2015-03-11 17:29 ` Eric W. Biederman
  0 siblings, 1 reply; 11+ messages in thread
From: Robert Shearman @ 2015-03-11 12:58 UTC (permalink / raw)
  To: davem; +Cc: netdev, Robert Shearman, Eric W. Biederman

This ensures that if a routing protocol incorrectly advertises a label
for a prefix whose address-family is inconsistent with that of the
nexthop, then the traffic will be dropped, rather than the issue being
silently worked around.

The accessible skb length should also be validated prior to the IPv4
or IPv6 headers being accessed, since only the label header will have
previously been validated.

Rename mpls_egress to mpls_egress_to_ip to make it more obvious that
the function is used for traffic going out as IP, not for labeled
traffic (or for the not-yet-implemented pseudo-wires).

Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Robert Shearman <rshearma@brocade.com>
---
 net/mpls/af_mpls.c | 91 ++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 68 insertions(+), 23 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 0ad8f7141..d1074b8 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -81,35 +81,81 @@ static bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu)
 	return true;
 }
 
-static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb,
-			struct mpls_entry_decoded dec)
+static int mpls_pkt_determine_af(struct sk_buff *skb)
 {
-	/* RFC4385 and RFC5586 encode other packets in mpls such that
-	 * they don't conflict with the ip version number, making
-	 * decoding by examining the ip version correct in everything
-	 * except for the strangest cases.
-	 *
-	 * The strange cases if we choose to support them will require
-	 * manual configuration.
+	if (!pskb_may_pull(skb, sizeof(struct iphdr))) {
+		return AF_PACKET;
+	}
+
+	/* At the moment, this is only used at the end of the LSP when
+	 * the payload is expected to be IP. More comprehensive checks
+	 * will be required if this is to be used where pseudo-wire
+	 * traffic not using RFC4385/RFC5586 encap could be present.
 	 */
-	struct iphdr *hdr4 = ip_hdr(skb);
+
+	switch (ip_hdr(skb)->version) {
+	case 4:
+		return AF_INET;
+	case 6:
+		return AF_INET6;
+	default:
+		return AF_PACKET;
+	}
+}
+
+static bool mpls_egress_to_ip(struct mpls_route *rt, struct sk_buff *skb,
+			struct mpls_entry_decoded dec)
+{
 	bool success = true;
+	int af;
+
+	switch (rt->rt_via_table) {
+	case NEIGH_ARP_TABLE:
+		af = AF_INET;
+		break;
+	case NEIGH_ND_TABLE:
+		af = AF_INET6;
+		break;
+	case NEIGH_LINK_TABLE:
+		af = mpls_pkt_determine_af(skb);
+		break;
+	default:
+		/* Unexpected rt_via_table value */
+		WARN_ON(true);
+		af = AF_PACKET;
+		break;
+	}
 
-	if (hdr4->version == 4) {
-		skb->protocol = htons(ETH_P_IP);
-		csum_replace2(&hdr4->check,
-			      htons(hdr4->ttl << 8),
-			      htons(dec.ttl << 8));
-		hdr4->ttl = dec.ttl;
+	switch (af) {
+	case AF_INET: {
+		struct iphdr *hdr4 = ip_hdr(skb);
+		if (pskb_may_pull(skb, sizeof(struct iphdr)) &&
+			hdr4->version == 4) {
+			skb->protocol = htons(ETH_P_IP);
+			csum_replace2(&hdr4->check,
+				htons(hdr4->ttl << 8),
+				htons(dec.ttl << 8));
+			hdr4->ttl = dec.ttl;
+		} else {
+			success = false;
+		}
+		break;
 	}
-	else if (hdr4->version == 6) {
+	case AF_INET6: {
 		struct ipv6hdr *hdr6 = ipv6_hdr(skb);
-		skb->protocol = htons(ETH_P_IPV6);
-		hdr6->hop_limit = dec.ttl;
+		if (pskb_may_pull(skb, sizeof(struct ipv6hdr)) &&
+			hdr6->version == 6) {
+			skb->protocol = htons(ETH_P_IPV6);
+			hdr6->hop_limit = dec.ttl;
+		} else {
+			success = false;
+		}
+		break;
 	}
-	else
-		/* version 0 and version 1 are used by pseudo wires */
+	default:
 		success = false;
+		break;
+	}
 	return success;
 }
 
@@ -184,8 +230,7 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev,
 	skb->protocol = htons(ETH_P_MPLS_UC);
 
 	if (unlikely(!new_header_size && dec.bos)) {
-		/* Penultimate hop popping */
-		if (!mpls_egress(rt, skb, dec))
+		if (!mpls_egress_to_ip(rt, skb, dec))
 			goto drop;
 	} else {
 		bool bos;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] mpls: Infer payload of packet from via address family.
  2015-03-11 12:58 [PATCH net-next] mpls: Infer payload of packet from via address family Robert Shearman
@ 2015-03-11 17:29 ` Eric W. Biederman
  2015-03-11 22:02   ` Robert Shearman
  0 siblings, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 2015-03-11 17:29 UTC (permalink / raw)
  To: Robert Shearman; +Cc: davem, netdev

Robert Shearman <rshearma@brocade.com> writes:

> This ensures that if a routing protocol incorrectly advertises a label
> for a prefix whose address-family is inconsistent with that of the
> nexthop, then the traffic will be dropped, rather than the issue being
> silently worked around.

The address family of the next hop need have no particular relationship
to the address families you can send to the next hop.  As such I
consider the behavior your are proposing here actively wrong.  It
appears to block valid use cases such as using a single mpls label
to carry both ipv4 and ipv6 traffic simply because we use an ipv4 next
hop.

I am not opposed to adding configurability to force the issue,
especially as unexpected traffic may be a problem but I don't think that
should be the default.  I think the default for a tunnel egress should
be assume everyone sticking packets in that tunnel are playing nice so
we should decode as much as possible.

> The accessible skb length should also be validated prior to the IPv4
> or IPv6 headers being accessed, since only the label header will have
> previously been validated.

I agree I goofed by not including the appropriate pskb_may_pull checks.
And that needs to be fixed.

> Rename mpls_egress to mpls_egress_to_ip to make it more obvious that
> the function is used for traffic going out as IP, not for labeled
> traffic (or for the not-yet-implemented pseudo-wires).

I disagree.

The name of the function needs to be mpls_egress, and it should be
eventually expanded to handle as many cases are reasonable.  With the
default being to start the decode of packets by looking at the first
nibble.

Without explicit configuration it seems entirely reasonable to assume
that if the first nibble is 4 it is an ipv4 packet.  If the first nibble
is 6 it is an ipv6 packet.  If the first nibble is 1 it is a generic
association channel.  If the first nibble is 0 it has a control word and
it is a pseudo wire where the output tunnel type matches the output
device.

A handful of pseudo wires do things differently.  SONET sets bits in the
first nibble, Ethernet has cases where it does not include the control
word and as such the first nible might not be zero.   And then we have
oddball cases that need configuration such as should the ethernet
control words sequence number be honored.

I admit that supporting ethernet and similiar pseudo wires will require
the arguments to mpls_egress to be changed a little so that we can skip
taking the next hop address link layer address from the mpls_route, but
that does not mean we should just through it under a bus.

Fundamentally mpls_egress is the function that we call when the bottom
of stack indicator is reached.  It should either figure out that the
packet can be forwarded or it should indicate that the packet should be
dropped.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] mpls: Infer payload of packet from via address family.
  2015-03-11 17:29 ` Eric W. Biederman
@ 2015-03-11 22:02   ` Robert Shearman
  2015-03-12 18:19     ` Eric W. Biederman
  0 siblings, 1 reply; 11+ messages in thread
From: Robert Shearman @ 2015-03-11 22:02 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: davem, netdev

On 11/03/15 17:29, Eric W. Biederman wrote:
> Robert Shearman <rshearma@brocade.com> writes:
>
>> This ensures that if a routing protocol incorrectly advertises a label
>> for a prefix whose address-family is inconsistent with that of the
>> nexthop, then the traffic will be dropped, rather than the issue being
>> silently worked around.
>
> The address family of the next hop need have no particular relationship
> to the address families you can send to the next hop.  As such I
> consider the behavior your are proposing here actively wrong.  It
> appears to block valid use cases such as using a single mpls label
> to carry both ipv4 and ipv6 traffic simply because we use an ipv4 next
> hop.
 >
 > I am not opposed to adding configurability to force the issue,
 > especially as unexpected traffic may be a problem but I don't think
 > that should be the default.

Having read RFC 3032 section 2.2, I realise that my original intention 
was misguided. However, as labels are allocated and advertised locally 
and so the implementation gets to decide what labels are used for, then 
I think the statement of this being actively wrong is a bit too strong. 
Allowing the carrying of IPv4 and IPv6 traffic using the same attached 
nexthop route means our implementation to won't be able to use a single 
cached hardware header associated with the neighbour (although 
admittedly it would require reworking for it to be usable here anyway).

In my experience of MPLS (which admittedly lacks TP) there is no MPLS 
application that will allocate a label to forward traffic to an attached 
nexthop that will carry both IPv4 and IPv6 traffic. The only exception I 
can think of is statically configured labels, but there is no reason why 
you can't use two labels with appropriate nexthops.

Since adding such a restriction after this has shipped will have 
implications for anybody relying on it, we should consider the 
implications of being too liberal with what the kernel accepts.

> I think the default for a tunnel egress should
> be assume everyone sticking packets in that tunnel are playing nice so
> we should decode as much as possible.

Naturally - the entire MPLS core is one big trust domain and the PEs 
have to be trusted to be doing the right thing. My point was more about 
restricting what the local control plane can do in terms of allocating 
labels while we have the freedom to do so.

That raises a related issue - as it stands today, with the kernel module 
loaded any interface can receive and process MPLS traffic. If this 
implementation is to act as a PE, then it's imperative we come up with a 
mechanism of ensuring that MPLS packets are only processed on interfaces 
that the operator has explicitly configured.

>
>> The accessible skb length should also be validated prior to the IPv4
>> or IPv6 headers being accessed, since only the label header will have
>> previously been validated.
>
> I agree I goofed by not including the appropriate pskb_may_pull checks.
> And that needs to be fixed.
>
>> Rename mpls_egress to mpls_egress_to_ip to make it more obvious that
>> the function is used for traffic going out as IP, not for labeled
>> traffic (or for the not-yet-implemented pseudo-wires).
>
> I disagree.
>
> The name of the function needs to be mpls_egress, and it should be
> eventually expanded to handle as many cases are reasonable.  With the
> default being to start the decode of packets by looking at the first
> nibble.
>
> Without explicit configuration it seems entirely reasonable to assume
> that if the first nibble is 4 it is an ipv4 packet.  If the first nibble
> is 6 it is an ipv6 packet.  If the first nibble is 1 it is a generic
> association channel.  If the first nibble is 0 it has a control word and
> it is a pseudo wire where the output tunnel type matches the output
> device.

The intention of RFC 4385 was that the control word be used to prevent 
the payload of an MPLS packet being wrongly as interpreted by 
intermediate routers (e.g. as the RFC states for ECMP, or for ICMP 
generation). IMHO the intention wasn't to allow the first nibble to be 
used as a protocol field with 0/1 meaning L2.

While I can't think of any concrete concerns, I do feel very uneasy 
about a payload being sent out directly as L2 without the control plane 
explicitly specifying this as the intention for the label route.

> A handful of pseudo wires do things differently.  SONET sets bits in the
> first nibble, Ethernet has cases where it does not include the control
> word and as such the first nible might not be zero.   And then we have
> oddball cases that need configuration such as should the ethernet
> control words sequence number be honored.

Just to clarify, the issue is more that in such cases without a control 
word the first nibble could be 0, 1, 4 or 6. I've seen PW deployments 
not using control words due to some devices not supporting them. I agree 
that we'll have to consider how the configuration of such cases will work.

> I admit that supporting ethernet and similiar pseudo wires will require
> the arguments to mpls_egress to be changed a little so that we can skip
> taking the next hop address link layer address from the mpls_route, but
> that does not mean we should just through it under a bus.
>
> Fundamentally mpls_egress is the function that we call when the bottom
> of stack indicator is reached.  It should either figure out that the
> packet can be forwarded or it should indicate that the packet should be
> dropped.

I don't want to get too drawn into a discussion on what the code will 
look like with PW implemented, but I had imagined that mpls_forward 
would be refactored such that it would be known from the route lookup 
that this was an PW route with or without a control word, and the 
parsing of the first nibble wouldn't be necessary.

Thanks,
Rob

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] mpls: Infer payload of packet from via address family.
  2015-03-11 22:02   ` Robert Shearman
@ 2015-03-12 18:19     ` Eric W. Biederman
  2015-03-12 20:54       ` [PATCH net-next ] mpls: In mpls_egress verify the packet length Eric W. Biederman
  2015-03-13 14:51       ` [PATCH net-next] mpls: Infer payload of packet from via address family Robert Shearman
  0 siblings, 2 replies; 11+ messages in thread
From: Eric W. Biederman @ 2015-03-12 18:19 UTC (permalink / raw)
  To: Robert Shearman; +Cc: davem, netdev

Robert Shearman <rshearma@brocade.com> writes:

> On 11/03/15 17:29, Eric W. Biederman wrote:
>> Robert Shearman <rshearma@brocade.com> writes:
>>
>>> This ensures that if a routing protocol incorrectly advertises a label
>>> for a prefix whose address-family is inconsistent with that of the
>>> nexthop, then the traffic will be dropped, rather than the issue being
>>> silently worked around.
>>
>> The address family of the next hop need have no particular relationship
>> to the address families you can send to the next hop.  As such I
>> consider the behavior your are proposing here actively wrong.  It
>> appears to block valid use cases such as using a single mpls label
>> to carry both ipv4 and ipv6 traffic simply because we use an ipv4 next
>> hop.
>>
>> I am not opposed to adding configurability to force the issue,
>> especially as unexpected traffic may be a problem but I don't think
>> that should be the default.
>
> Having read RFC 3032 section 2.2, I realise that my original intention was
> misguided. However, as labels are allocated and advertised locally and so the
> implementation gets to decide what labels are used for, then I think the
> statement of this being actively wrong is a bit too strong. Allowing the
> carrying of IPv4 and IPv6 traffic using the same attached nexthop route means
> our implementation to won't be able to use a single cached hardware header
> associated with the neighbour (although admittedly it would require reworking
> for it to be usable here anyway).

What I meant by actively wrong is that the abstraction with the
neighbour table entries is independent of the protocol used.   If that
was not the case we would not be able to use them for mpls traffic.
Imposing funny limits when it is just ipv4 traffic coming out of a
tunnel I think would be confusing.

As for the hardware headers.  If someone wants to benchmark things
I think we could support it comparitively easily in this interface
by having perhaps 3 cached hardware headers.  1 for ipv4, 1 for ipv6
and 1 for mpls in an array in the neighbour table.  Strangely enough
the cached hardware header is smaller than the hardware address
in the neighbour table entry.

> In my experience of MPLS (which admittedly lacks TP) there is no MPLS
> application that will allocate a label to forward traffic to an attached nexthop
> that will carry both IPv4 and IPv6 traffic. The only exception I can think of is
> statically configured labels, but there is no reason why you can't use two
> labels with appropriate nexthops.

I know the Forwrading-Equivalence-Class to label mapping would need to
be distinct mappings.  But I do find that surprising that no one has
implemented a tunnels that carry both v4 and v6 traffic.

The scenario that I keep imagining is a datacenter where there are mpls
tunnels to every machine, with the mpls network not carrying if you are
running v4 or v6 it just has a single tunnel to each machine.  That
seems like the sensible way to construct things.  Especially since
the a single MPLS lable overhead is the same as the VLAN header
overhead.  So you would not need to reduce your MTU below 1500 bytes.

Also reading RFC4364 BGP/MPLS and RFC4659 BGP/MPLS v6 seem describe
operating an all ipv4 core network with ipv6 and ipv4 at the edges.
I may have missed an extra label for tunnel somewhere but it definitely
works to have a minimum number of tunnels through the core network.

> Since adding such a restriction after this has shipped will have implications
> for anybody relying on it, we should consider the implications of being too
> liberal with what the kernel accepts.

I agree.  If there are good reasons not to do this I am game.  So far
minimizing the number of tunnels in the network and keeping the network
as simple as possible seems to strongly suggest that having both v4 and
v6 in a single tunnel is a good idea.

The only reason I can think of for having separate tunnels is so that
you someone can play anycast on the local subnet games with arp or
neighbour discovery.  But that should be rare.

>> I think the default for a tunnel egress should
>> be assume everyone sticking packets in that tunnel are playing nice so
>> we should decode as much as possible.
>
> Naturally - the entire MPLS core is one big trust domain and the PEs have to be
> trusted to be doing the right thing. My point was more about restricting what
> the local control plane can do in terms of allocating labels while we have the
> freedom to do so.
>
> That raises a related issue - as it stands today, with the kernel module loaded
> any interface can receive and process MPLS traffic. If this implementation is to
> act as a PE, then it's imperative we come up with a mechanism of ensuring that
> MPLS packets are only processed on interfaces that the operator has explicitly
> configured.

I started at minimal and simple.  Per interface forwarding and mtu knobs
are likely desirable.

But please note even with the module loaded mpls packets will be ignored
until someone sets the mpls platform label table size to non-zero.

We also have per interface controls if we want to place different
interfaces in different network namespaces, and get different mpls label
tables and different ip forwarding tables.

>>> The accessible skb length should also be validated prior to the IPv4
>>> or IPv6 headers being accessed, since only the label header will have
>>> previously been validated.
>>
>> I agree I goofed by not including the appropriate pskb_may_pull checks.
>> And that needs to be fixed.
>>
>>> Rename mpls_egress to mpls_egress_to_ip to make it more obvious that
>>> the function is used for traffic going out as IP, not for labeled
>>> traffic (or for the not-yet-implemented pseudo-wires).
>>
>> I disagree.
>>
>> The name of the function needs to be mpls_egress, and it should be
>> eventually expanded to handle as many cases are reasonable.  With the
>> default being to start the decode of packets by looking at the first
>> nibble.
>>
>> Without explicit configuration it seems entirely reasonable to assume
>> that if the first nibble is 4 it is an ipv4 packet.  If the first nibble
>> is 6 it is an ipv6 packet.  If the first nibble is 1 it is a generic
>> association channel.  If the first nibble is 0 it has a control word and
>> it is a pseudo wire where the output tunnel type matches the output
>> device.
>
> The intention of RFC 4385 was that the control word be used to prevent the
> payload of an MPLS packet being wrongly as interpreted by intermediate routers
> (e.g. as the RFC states for ECMP, or for ICMP generation). IMHO the intention
> wasn't to allow the first nibble to be used as a protocol field with 0/1 meaning
> L2.

Which in a lot of ways is a weirdness of MPLS there is no well defined
way of just looking at a packet and seeing what the contents should be.
Given that there are standards that at least first glance seem to be
widely deployed that actually allows us to decode what is a packet
without lots of state and context I think it useful to encourage the
use of those standards.  Especially when we get to define what the
labels are or can be used for on our end.

> While I can't think of any concrete concerns, I do feel very uneasy about a
> payload being sent out directly as L2 without the control plane explicitly
> specifying this as the intention for the label route.

Fair enough.  From a be liberal in what you accept and make the easy
things easy perspective it seems like the right thing to me.  At the
same time we haven't implemented any of that yet.  So we can examine it
all in detail when the time comes.

>> A handful of pseudo wires do things differently.  SONET sets bits in the
>> first nibble, Ethernet has cases where it does not include the control
>> word and as such the first nible might not be zero.   And then we have
>> oddball cases that need configuration such as should the ethernet
>> control words sequence number be honored.
>
> Just to clarify, the issue is more that in such cases without a control word the
> first nibble could be 0, 1, 4 or 6. I've seen PW deployments not using control
> words due to some devices not supporting them. I agree that we'll have to
> consider how the configuration of such cases will work.

Agreed.  For whatever it is worth RFC4448 says it is a must to support
ethernet without a control word.  So we don't have to get esoteric for
this issue to come up.

>> I admit that supporting ethernet and similiar pseudo wires will require
>> the arguments to mpls_egress to be changed a little so that we can skip
>> taking the next hop address link layer address from the mpls_route, but
>> that does not mean we should just through it under a bus.
>>
>> Fundamentally mpls_egress is the function that we call when the bottom
>> of stack indicator is reached.  It should either figure out that the
>> packet can be forwarded or it should indicate that the packet should be
>> dropped.
>
> I don't want to get too drawn into a discussion on what the code will look like
> with PW implemented, but I had imagined that mpls_forward would be refactored
> such that it would be known from the route lookup that this was an PW route with
> or without a control word, and the parsing of the first nibble wouldn't be
> necessary.

What I don't currently have in mpls_route is a distinction between
popping a label and popping a final label.  If we want to draw such a
distinction now would be a good time to have that conversation.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next ] mpls:  In mpls_egress verify the packet length.
  2015-03-12 18:19     ` Eric W. Biederman
@ 2015-03-12 20:54       ` Eric W. Biederman
  2015-03-12 22:25         ` David Miller
  2015-03-13 14:51       ` [PATCH net-next] mpls: Infer payload of packet from via address family Robert Shearman
  1 sibling, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 2015-03-12 20:54 UTC (permalink / raw)
  To: Robert Shearman; +Cc: davem, netdev


Robert Shearman noticed that mpls_egress is failing to verify that
the bytes to be examined are in fact present in the packet before
mpls_egress reads those bytes.

Reported-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/mpls/af_mpls.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 0ad8f7141be2..d5882c5c176b 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -92,17 +92,25 @@ static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb,
 	 * The strange cases if we choose to support them will require
 	 * manual configuration.
 	 */
-	struct iphdr *hdr4 = ip_hdr(skb);
+	struct iphdr *hdr4;
 	bool success = true;
 
-	if (hdr4->version == 4) {
+	/* If the packet is empty we can't do anything */
+	if (!pskb_may_pull(skb, 1))
+		return false;
+
+	/* Use ip_hdr to find the ip protocol version */
+	hdr4 = ip_hdr(skb);
+	if ((hdr4->version == 4) &&
+	    pskb_may_pull(skb, sizeof(struct iphdr))) {
 		skb->protocol = htons(ETH_P_IP);
 		csum_replace2(&hdr4->check,
 			      htons(hdr4->ttl << 8),
 			      htons(dec.ttl << 8));
 		hdr4->ttl = dec.ttl;
 	}
-	else if (hdr4->version == 6) {
+	else if ((hdr4->version == 6) &&
+		 pskb_may_pull(skb, sizeof(struct ipv6hdr))) {
 		struct ipv6hdr *hdr6 = ipv6_hdr(skb);
 		skb->protocol = htons(ETH_P_IPV6);
 		hdr6->hop_limit = dec.ttl;
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next ] mpls: In mpls_egress verify the packet length.
  2015-03-12 20:54       ` [PATCH net-next ] mpls: In mpls_egress verify the packet length Eric W. Biederman
@ 2015-03-12 22:25         ` David Miller
  2015-03-12 22:50           ` Eric W. Biederman
  0 siblings, 1 reply; 11+ messages in thread
From: David Miller @ 2015-03-12 22:25 UTC (permalink / raw)
  To: ebiederm; +Cc: rshearma, netdev

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Thu, 12 Mar 2015 15:54:52 -0500

> Robert Shearman noticed that mpls_egress is failing to verify that
> the bytes to be examined are in fact present in the packet before
> mpls_egress reads those bytes.
> 
> Reported-by: Robert Shearman <rshearma@brocade.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Let's see if we can consolidate these pskb_may_pull() calls into one,
in the fast path they are relatively cheap but not zero cost.

The caller (mpls_forward()) is doing one as well.

All you care about is the version field and the hop_limit/ttl.

For ipv4, version is in byte 0, and ttl is in byte 9.

For ipv6, version is also in byte zero and hop_limit is in byte 7.

Therefore if you do pskb_may_pull(skb, 10) it will cover all cases
whilst not exceeding the size of either protocol's total header.

Above the pskb_may_pull() call can add a big comment explaining what
you are doing and why this is sufficient.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next ] mpls: In mpls_egress verify the packet length.
  2015-03-12 22:25         ` David Miller
@ 2015-03-12 22:50           ` Eric W. Biederman
  2015-03-12 23:22             ` Eric W. Biederman
  0 siblings, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 2015-03-12 22:50 UTC (permalink / raw)
  To: David Miller; +Cc: rshearma, netdev

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Thu, 12 Mar 2015 15:54:52 -0500
>
>> Robert Shearman noticed that mpls_egress is failing to verify that
>> the bytes to be examined are in fact present in the packet before
>> mpls_egress reads those bytes.
>> 
>> Reported-by: Robert Shearman <rshearma@brocade.com>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> Let's see if we can consolidate these pskb_may_pull() calls into one,
> in the fast path they are relatively cheap but not zero cost.

> The caller (mpls_forward()) is doing one as well.
>
> All you care about is the version field and the hop_limit/ttl.
>
> For ipv4, version is in byte 0, and ttl is in byte 9.
By my count the ttl is in byte 8, and more importantly the
header checksum is in byte 10.

> For ipv6, version is also in byte zero and hop_limit is in byte 7.
>
> Therefore if you do pskb_may_pull(skb, 10) it will cover all cases
> whilst not exceeding the size of either protocol's total header.

Using the value of 12 that seems very reasonable.

> Above the pskb_may_pull() call can add a big comment explaining what
> you are doing and why this is sufficient.

Will do.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next ] mpls: In mpls_egress verify the packet length.
  2015-03-12 22:50           ` Eric W. Biederman
@ 2015-03-12 23:22             ` Eric W. Biederman
  2015-03-13  3:05               ` David Miller
  0 siblings, 1 reply; 11+ messages in thread
From: Eric W. Biederman @ 2015-03-12 23:22 UTC (permalink / raw)
  To: David Miller; +Cc: rshearma, netdev


Reobert Shearman noticed that mpls_egress is failing to verify that
the bytes to be examined are in fact present in the packet before
mpls_egress reads those bytes.

As suggested by David Miller reduce this to a single pskb_may_pull
call so that we don't do unnecessary work in the fast path.

Reported-by: Robert Shearman <rshearma@brocade.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 net/mpls/af_mpls.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 0ad8f7141be2..db8a2ea6d4de 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -92,9 +92,24 @@ static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb,
 	 * The strange cases if we choose to support them will require
 	 * manual configuration.
 	 */
-	struct iphdr *hdr4 = ip_hdr(skb);
+	struct iphdr *hdr4;
 	bool success = true;
 
+	/* The IPv4 code below accesses through the IPv4 header
+	 * checksum, which is 12 bytes into the packet.
+	 * The IPv6 code below accesses through the IPv6 hop limit
+	 * which is 8 bytes into the packet.
+	 *
+	 * For all supported cases there should always be at least 12
+	 * bytes of packet data present.  The IPv4 header is 20 bytes
+	 * without options and the IPv6 header is always 40 bytes
+	 * long.
+	 */
+	if (!pskb_may_pull(skb, 12))
+		return false;
+
+	/* Use ip_hdr to find the ip protocol version */
+	hdr4 = ip_hdr(skb);
 	if (hdr4->version == 4) {
 		skb->protocol = htons(ETH_P_IP);
 		csum_replace2(&hdr4->check,
-- 
2.2.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next ] mpls: In mpls_egress verify the packet length.
  2015-03-12 23:22             ` Eric W. Biederman
@ 2015-03-13  3:05               ` David Miller
  0 siblings, 0 replies; 11+ messages in thread
From: David Miller @ 2015-03-13  3:05 UTC (permalink / raw)
  To: ebiederm; +Cc: rshearma, netdev

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Thu, 12 Mar 2015 18:22:59 -0500

> 
> Reobert Shearman noticed that mpls_egress is failing to verify that
> the bytes to be examined are in fact present in the packet before
> mpls_egress reads those bytes.
> 
> As suggested by David Miller reduce this to a single pskb_may_pull
> call so that we don't do unnecessary work in the fast path.
> 
> Reported-by: Robert Shearman <rshearma@brocade.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>

Perfect, applied, thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] mpls: Infer payload of packet from via address family.
  2015-03-12 18:19     ` Eric W. Biederman
  2015-03-12 20:54       ` [PATCH net-next ] mpls: In mpls_egress verify the packet length Eric W. Biederman
@ 2015-03-13 14:51       ` Robert Shearman
  2015-03-13 17:08         ` Eric W. Biederman
  1 sibling, 1 reply; 11+ messages in thread
From: Robert Shearman @ 2015-03-13 14:51 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: davem, netdev

On 12/03/15 18:19, Eric W. Biederman wrote:
> Robert Shearman <rshearma@brocade.com> writes:
>
>> On 11/03/15 17:29, Eric W. Biederman wrote:
>>> Robert Shearman <rshearma@brocade.com> writes:
>>>
>>>> This ensures that if a routing protocol incorrectly advertises a label
>>>> for a prefix whose address-family is inconsistent with that of the
>>>> nexthop, then the traffic will be dropped, rather than the issue being
>>>> silently worked around.
>>>
>>> The address family of the next hop need have no particular relationship
>>> to the address families you can send to the next hop.  As such I
>>> consider the behavior your are proposing here actively wrong.  It
>>> appears to block valid use cases such as using a single mpls label
>>> to carry both ipv4 and ipv6 traffic simply because we use an ipv4 next
>>> hop.
>>>
>>> I am not opposed to adding configurability to force the issue,
>>> especially as unexpected traffic may be a problem but I don't think
>>> that should be the default.
>>
>> Having read RFC 3032 section 2.2, I realise that my original intention was
>> misguided. However, as labels are allocated and advertised locally and so the
>> implementation gets to decide what labels are used for, then I think the
>> statement of this being actively wrong is a bit too strong. Allowing the
>> carrying of IPv4 and IPv6 traffic using the same attached nexthop route means
>> our implementation to won't be able to use a single cached hardware header
>> associated with the neighbour (although admittedly it would require reworking
>> for it to be usable here anyway).
>
> What I meant by actively wrong is that the abstraction with the
> neighbour table entries is independent of the protocol used.   If that
> was not the case we would not be able to use them for mpls traffic.
> Imposing funny limits when it is just ipv4 traffic coming out of a
> tunnel I think would be confusing.

Agreed. How about adding a netlink attribute indicating the FEC, or at 
least the address family in the case of IP FECs? That would then give a 
an indicator of the traffic that the LSP is carrying, regardless of 
nexthop type. The same method could then be used to indicate whether the 
label route is for a PW and then the presence or lack or control word.

> As for the hardware headers.  If someone wants to benchmark things
> I think we could support it comparitively easily in this interface
> by having perhaps 3 cached hardware headers.  1 for ipv4, 1 for ipv6
> and 1 for mpls in an array in the neighbour table.  Strangely enough
> the cached hardware header is smaller than the hardware address
> in the neighbour table entry.
>
>> In my experience of MPLS (which admittedly lacks TP) there is no MPLS
>> application that will allocate a label to forward traffic to an attached nexthop
>> that will carry both IPv4 and IPv6 traffic. The only exception I can think of is
>> statically configured labels, but there is no reason why you can't use two
>> labels with appropriate nexthops.
>
> I know the Forwrading-Equivalence-Class to label mapping would need to
> be distinct mappings.  But I do find that surprising that no one has
> implemented a tunnels that carry both v4 and v6 traffic.

This is a use case for TE tunnels. However, the accepted practice is to 
add an IPv6 Explicit NULL label to differentiate IPv4 from IPv6 traffic 
over the tunnel.

Furthermore, in discussions with Stewart Bryant (author of a number of 
IETF RFCs, including 4385) and he wasn't aware of any significant 
existing implementations that carry IPv4 and IPv6 traffic using the same 
label. He was also of the opinion that by doing this we'd be "attempting 
to swim upstream". Furthermore, as the carrying of one type of traffic 
for a given label is considered status quo, then this could cause it to 
be treated as an invariant on which current (there are over 100 RFCs 
relating to MPLS and I certainly haven't read them all) or future 
changes to MPLS are based upon.

> The scenario that I keep imagining is a datacenter where there are mpls
> tunnels to every machine, with the mpls network not carrying if you are
> running v4 or v6 it just has a single tunnel to each machine.  That
> seems like the sensible way to construct things.  Especially since
> the a single MPLS lable overhead is the same as the VLAN header
> overhead.  So you would not need to reduce your MTU below 1500 bytes.

Using a label each for IPv4 and IPv6 is still possible if that is a 
requirement. Note that labels are local in nature (ignoring Segment 
Routing) so you would only hit the label limit if one router needed to 
communicate with ~2^19 machines at once, which seems a lot to me. Note 
that BGP has much higher scales especially when using VPNs and solves 
the issue using per-CE or per-VRF labels

> Also reading RFC4364 BGP/MPLS and RFC4659 BGP/MPLS v6 seem describe
> operating an all ipv4 core network with ipv6 and ipv4 at the edges.
> I may have missed an extra label for tunnel somewhere but it definitely
> works to have a minimum number of tunnels through the core network.

Yes, when using MPLS-VPN of either address family there will always be 
an extra label for the BGP FECs because otherwise you would lose the 
discrimination of which VPN the traffic belongs to, either when using 
PHP or when looking up the destination on the PE router after the final 
label pop.

>> Since adding such a restriction after this has shipped will have implications
>> for anybody relying on it, we should consider the implications of being too
>> liberal with what the kernel accepts.
>
> I agree.  If there are good reasons not to do this I am game.  So far
> minimizing the number of tunnels in the network and keeping the network
> as simple as possible seems to strongly suggest that having both v4 and
> v6 in a single tunnel is a good idea.

I'm fine with keeping that an option with the above suggestion of the 
netlink attribute, for users that want to take that gamble. However, I'd 
also like the option of a control plane that wants to only use a label 
for one traffic type at a time can specify this to guard against bugs 
and allow better observability of the MPLS label table to the operator.

>>>> Rename mpls_egress to mpls_egress_to_ip to make it more obvious that
>>>> the function is used for traffic going out as IP, not for labeled
>>>> traffic (or for the not-yet-implemented pseudo-wires).
>>>
>>> I disagree.
>>>
>>> The name of the function needs to be mpls_egress, and it should be
>>> eventually expanded to handle as many cases are reasonable.  With the
>>> default being to start the decode of packets by looking at the first
>>> nibble.
>>>
>>> Without explicit configuration it seems entirely reasonable to assume
>>> that if the first nibble is 4 it is an ipv4 packet.  If the first nibble
>>> is 6 it is an ipv6 packet.  If the first nibble is 1 it is a generic
>>> association channel.  If the first nibble is 0 it has a control word and
>>> it is a pseudo wire where the output tunnel type matches the output
>>> device.
>>
>> The intention of RFC 4385 was that the control word be used to prevent the
>> payload of an MPLS packet being wrongly as interpreted by intermediate routers
>> (e.g. as the RFC states for ECMP, or for ICMP generation). IMHO the intention
>> wasn't to allow the first nibble to be used as a protocol field with 0/1 meaning
>> L2.
>
> Which in a lot of ways is a weirdness of MPLS there is no well defined
> way of just looking at a packet and seeing what the contents should be.
> Given that there are standards that at least first glance seem to be
> widely deployed that actually allows us to decode what is a packet
> without lots of state and context I think it useful to encourage the
> use of those standards.  Especially when we get to define what the
> labels are or can be used for on our end.

The lack of a protocol type in the packet is just a fact of the MPLS 
protocol. RFC 3032 states that the label is the primary source of what 
is encapsulated.

In my discussion with one of the authors of RFC 4385 it was stated to me 
in no uncertain terms that the label is the indicator that this is a PW 
packet and that the introduction of the control word wasn't intended to 
distinguish an IP packet from a PW packet. Instead the type of the 
encapsulated traffic should be determined from the label, not from the 
payload. Doing the contrary would be diverging from how the rest of the 
industry has implemented it and. moreover, could lead to issues that 
authors of RFCs never considered, because the use was considered outside 
of accepted practice.

>>> I admit that supporting ethernet and similiar pseudo wires will require
>>> the arguments to mpls_egress to be changed a little so that we can skip
>>> taking the next hop address link layer address from the mpls_route, but
>>> that does not mean we should just through it under a bus.
>>>
>>> Fundamentally mpls_egress is the function that we call when the bottom
>>> of stack indicator is reached.  It should either figure out that the
>>> packet can be forwarded or it should indicate that the packet should be
>>> dropped.
>>
>> I don't want to get too drawn into a discussion on what the code will look like
>> with PW implemented, but I had imagined that mpls_forward would be refactored
>> such that it would be known from the route lookup that this was an PW route with
>> or without a control word, and the parsing of the first nibble wouldn't be
>> necessary.
>
> What I don't currently have in mpls_route is a distinction between
> popping a label and popping a final label.  If we want to draw such a
> distinction now would be a good time to have that conversation.

Yes, that would be highly desirable in the case of MPLS-VPN where a 
label route is added via CE. In order to prevent unexpected MPLS traffic 
being sent to the CE (perhaps MPLS-OAM), the control plane would want to 
ask the dataplane to install an "unlabeled" entry where packets with the 
BOS bit not set should be dropped.

I'd like to propose that we change the semantics of no labels being 
specified to be: pop and forward if BOS bit set, drop if BOS not set. 
Then we can allow the control plane to specify a label value with the 
reserved implicit-null value which wouldn't be put into a packet, but 
translated into pop and forward regardless of BOS bit (i.e. IP if BOS 
set, MPLS if BOS not set).

Thanks,
Rob

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next] mpls: Infer payload of packet from via address family.
  2015-03-13 14:51       ` [PATCH net-next] mpls: Infer payload of packet from via address family Robert Shearman
@ 2015-03-13 17:08         ` Eric W. Biederman
  0 siblings, 0 replies; 11+ messages in thread
From: Eric W. Biederman @ 2015-03-13 17:08 UTC (permalink / raw)
  To: Robert Shearman; +Cc: davem, netdev

Robert Shearman <rshearma@brocade.com> writes:

> On 12/03/15 18:19, Eric W. Biederman wrote:
>> Robert Shearman <rshearma@brocade.com> writes:

> Agreed. How about adding a netlink attribute indicating the FEC, or at
> least the address family in the case of IP FECs? That would then give
> a an indicator of the traffic that the LSP is carrying, regardless of
> nexthop type. The same method could then be used to indicate whether
> the label route is for a PW and then the presence or lack or control
> word.

Not a problem. I fully expect we will do that at some point.
My current code is just a starting point, and additions like that can be
made incrementally and easily.

I fully expect David Miller will merge any same well thought out patch
with a good description.

To be clear.  I was not proposing forcing MPLS to be something it is not
where there is a type in each packet identifying the data.  All I meant
was that in the common cases it can be inferred what the packet type is
I don't mind taking advantage of it as it makes the code simpler to
write and easier to use.

> I'm fine with keeping that an option with the above suggestion of the
> netlink attribute, for users that want to take that gamble. However,
> I'd also like the option of a control plane that wants to only use a
> label for one traffic type at a time can specify this to guard against
> bugs and allow better observability of the MPLS label table to the
> operator.

No objections from me.  I just picked a default that was easy and
useful.  Knobs to tighten things down assuming they don't make the fast
path complicated and slow are fine by me.

>> What I don't currently have in mpls_route is a distinction between
>> popping a label and popping a final label.  If we want to draw such a
>> distinction now would be a good time to have that conversation.
>
> Yes, that would be highly desirable in the case of MPLS-VPN where a
> label route is added via CE. In order to prevent unexpected MPLS
> traffic being sent to the CE (perhaps MPLS-OAM), the control plane
> would want to ask the dataplane to install an "unlabeled" entry where
> packets with the BOS bit not set should be dropped.
>
> I'd like to propose that we change the semantics of no labels being
> specified to be: pop and forward if BOS bit set, drop if BOS not
> set. Then we can allow the control plane to specify a label value with
> the reserved implicit-null value which wouldn't be put into a packet,
> but translated into pop and forward regardless of BOS bit (i.e. IP if
> BOS set, MPLS if BOS not set).

I am not certain about having two different ways to say pop a label,
that seems to create unnecessary complications.

This raises an interesting question.  Are there cases where we need to
support variable depth label stacks.  So that sometimes a label is
popped and the packet is then forwarded over mpls, and that sometimes a
label is popped the BOS bit is set and we treat the packet as a local IP
packet?

I think the answer is we don't need to support that case.  It seems an
even stranger case than having both IPv4 and IPv6 in the same label
stack.

So I think it may be sufficient to simply specify what is in an mpls
tunnel under the label that we are popping.   If we specify more MPLS
it is just a label pop and BOS of stack must not be set (or drop).  If
we specify IPv4, the BOS must be set or drop.

I am not particularly motivated to go after that aspect right now so I
will let you cook up a patch.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-03-13 17:12 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-11 12:58 [PATCH net-next] mpls: Infer payload of packet from via address family Robert Shearman
2015-03-11 17:29 ` Eric W. Biederman
2015-03-11 22:02   ` Robert Shearman
2015-03-12 18:19     ` Eric W. Biederman
2015-03-12 20:54       ` [PATCH net-next ] mpls: In mpls_egress verify the packet length Eric W. Biederman
2015-03-12 22:25         ` David Miller
2015-03-12 22:50           ` Eric W. Biederman
2015-03-12 23:22             ` Eric W. Biederman
2015-03-13  3:05               ` David Miller
2015-03-13 14:51       ` [PATCH net-next] mpls: Infer payload of packet from via address family Robert Shearman
2015-03-13 17:08         ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.