All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type
@ 2022-02-04 13:58 Guillaume Nault
  2022-02-04 13:58 ` [PATCH net-next 1/4] ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules Guillaume Nault
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Guillaume Nault @ 2022-02-04 13:58 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern,
	Toke Høiland-Jørgensen, Shuah Khan, linux-kselftest,
	Russell Strong, Dave Taht

The networking stack currently doesn't clearly distinguish between DSCP
and ECN bits. The entire DSCP+ECN bits are stored in u8 variables (or
structure fields), and each part of the stack handles them in their own
way, using different macros. This has created several bugs in the past
and some uncommon code paths are still unfixed.

Such bugs generally manifest by selecting invalid routes because of ECN
bits interfering with FIB routes and rules lookups (more details in the
LPC 2021 talk[1] and in the RFC of this series[2]).

This patch series aims at preventing the introduction of such bugs (and
detecting existing ones), by introducing a dscp_t type, representing
"sanitised" DSCP values (that is, with no ECN information), as opposed
to plain u8 values that contain both DSCP and ECN information. dscp_t
makes it clear for the reader what we're working on, and Sparse can
flag invalid interactions between dscp_t and plain u8.

This series converts only a few variables and structures:

  * Patch 1 converts the tclass field of struct fib6_rule. It
    effectively forbids the use of ECN bits in the tos/dsfield option
    of ip -6 rule. Rules now match packets solely based on their DSCP
    bits, so ECN doesn't influence the result any more. This contrasts
    with the previous behaviour where all 8 bits of the Traffic Class
    field were used. It is believed that this change is acceptable as
    matching ECN bits wasn't usable for IPv4, so only IPv6-only
    deployments could be depending on it. Also the previous behaviour
    made DSCP-based ip6-rules fail for packets with both a DSCP and an
    ECN mark, which is another reason why any such deploy is unlikely.

  * Patch 2 converts the tos field of struct fib4_rule. This one too
    effectively forbids defining ECN bits, this time in ip -4 rule.
    Before that, setting ECN bit 1 was accepted, while ECN bit 0 was
    rejected. But even when accepted, the rule would never match, as
    the packets would have their ECN bits cleared before doing the
    rule lookup.

  * Patch 3 converts the fc_tos field of struct fib_config. This is
    equivalent to patch 2, but for IPv4 routes. Routes using a
    tos/dsfield option with any ECN bit set is now rejected. Before
    this patch, they were accepted but, as with ip4 rules, these routes
    couldn't match any packet, since their ECN bits are cleared before
    the lookup.

  * Patch 4 converts the fa_tos field of struct fib_alias. This one is
    pure internal u8 to dscp_t conversion. While patches 1-3 had user
    facing consequences, this patch shouldn't have any side effect and
    is there to give an overview of what future conversion patches will
    look like. Conversions are quite mechanical, but imply some code
    churn, which is the price for the extra clarity a possibility of
    type checking.

To summarise, all the behaviour changes required for the dscp_t type
approach to work should be contained in patches 1-3. These changes are
edge cases of ip-route and ip-rule that don't currently work properly.
So they should be safe. Also, a kernel selftest is added for each of
them.

Finally, this work also paves the way for allowing the usage of the 3
high order DSCP bits in IPv4 (a few call paths already handle them, but
in general the stack clears them before IPv4 rule and route lookups).

References:
  [1] LPC 2021 talk:
        - https://linuxplumbersconf.org/event/11/contributions/943/
        - Direct link to slide deck:
            https://linuxplumbersconf.org/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf
  [2] RFC version of this series:
      - https://lore.kernel.org/netdev/cover.1638814614.git.gnault@redhat.com/

Changes since RFC:
  - Use simple mask instead of a bit shift to converting between u8
    and dscp_t (Toke).
  - Reword patch 4 to make it clear that no behaviour change is
    intended (Toke).
  - Add kernel selftests.
  - Rebase on latest net-next.

Guillaume Nault (4):
  ipv6: Define dscp_t and stop taking ECN bits into account in
    fib6-rules
  ipv4: Stop taking ECN bits into account in fib4-rules
  ipv4: Reject routes specifying ECN bits in rtm_tos
  ipv4: Use dscp_t in struct fib_alias

 include/net/inet_dscp.h                       | 57 ++++++++++++++
 include/net/ip_fib.h                          |  3 +-
 include/net/ipv6.h                            |  6 ++
 net/ipv4/fib_frontend.c                       | 11 ++-
 net/ipv4/fib_lookup.h                         |  3 +-
 net/ipv4/fib_rules.c                          | 18 +++--
 net/ipv4/fib_semantics.c                      | 14 ++--
 net/ipv4/fib_trie.c                           | 58 ++++++++------
 net/ipv4/route.c                              |  3 +-
 net/ipv6/fib6_rules.c                         | 19 +++--
 tools/testing/selftests/net/fib_rule_tests.sh | 60 ++++++++++++++-
 tools/testing/selftests/net/fib_tests.sh      | 76 +++++++++++++++++++
 12 files changed, 278 insertions(+), 50 deletions(-)
 create mode 100644 include/net/inet_dscp.h

-- 
2.21.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net-next 1/4] ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules
  2022-02-04 13:58 [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type Guillaume Nault
@ 2022-02-04 13:58 ` Guillaume Nault
  2022-02-04 13:58 ` [PATCH net-next 2/4] ipv4: Stop taking ECN bits into account in fib4-rules Guillaume Nault
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Guillaume Nault @ 2022-02-04 13:58 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern,
	Toke Høiland-Jørgensen, Shuah Khan, linux-kselftest,
	Russell Strong, Dave Taht

Define a dscp_t type and its appropriate helpers that ensure ECN bits
are not taken into account when handling DSCP.

Use this new type to replace the tclass field of struct fib6_rule, so
that fib6-rules don't get influenced by ECN bits anymore.

Before this patch, fib6-rules didn't make any distinction between the
DSCP and ECN bits. Therefore, rules specifying a DSCP (tos or dsfield
options in iproute2) stopped working as soon a packets had at least one
of its ECN bits set (as a work around one could create four rules for
each DSCP value to match, one for each possible ECN value).

After this patch fib6-rules only compare the DSCP bits. ECN doesn't
influence the result anymore. Also, fib6-rules now must have the ECN
bits cleared or they will be rejected.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
---
 include/net/inet_dscp.h                       | 57 +++++++++++++++++++
 include/net/ipv6.h                            |  6 ++
 net/ipv6/fib6_rules.c                         | 19 +++++--
 tools/testing/selftests/net/fib_rule_tests.sh | 30 +++++++++-
 4 files changed, 105 insertions(+), 7 deletions(-)
 create mode 100644 include/net/inet_dscp.h

diff --git a/include/net/inet_dscp.h b/include/net/inet_dscp.h
new file mode 100644
index 000000000000..72f250dffada
--- /dev/null
+++ b/include/net/inet_dscp.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * inet_dscp.h: helpers for handling differentiated services codepoints (DSCP)
+ *
+ * DSCP is defined in RFC 2474:
+ *
+ *        0   1   2   3   4   5   6   7
+ *      +---+---+---+---+---+---+---+---+
+ *      |         DSCP          |  CU   |
+ *      +---+---+---+---+---+---+---+---+
+ *
+ *        DSCP: differentiated services codepoint
+ *        CU:   currently unused
+ *
+ * The whole DSCP + CU bits form the DS field.
+ * The DS field is also commonly called TOS or Traffic Class (for IPv6).
+ *
+ * Note: the CU bits are now used for Explicit Congestion Notification
+ *       (RFC 3168).
+ */
+
+#ifndef _INET_DSCP_H
+#define _INET_DSCP_H
+
+#include <linux/types.h>
+
+/* Special type for storing DSCP values.
+ *
+ * A dscp_t variable stores a DS field with the CU (ECN) bits cleared.
+ * Using dscp_t allows to strictly separate DSCP and ECN bits, thus avoiding
+ * bugs where ECN bits are erroneously taken into account during FIB lookups
+ * or policy routing.
+ *
+ * Note: to get the real DSCP value contained in a dscp_t variable one would
+ * have to do a bit shift after calling inet_dscp_to_dsfield(). We could have
+ * a helper for that, but there's currently no users.
+ */
+typedef u8 __bitwise dscp_t;
+
+#define INET_DSCP_MASK 0xfc
+
+static inline dscp_t inet_dsfield_to_dscp(__u8 dsfield)
+{
+	return (__force dscp_t)(dsfield & INET_DSCP_MASK);
+}
+
+static inline __u8 inet_dscp_to_dsfield(dscp_t dscp)
+{
+	return (__force __u8)dscp;
+}
+
+static inline bool inet_validate_dscp(__u8 val)
+{
+	return !(val & ~INET_DSCP_MASK);
+}
+
+#endif /* _INET_DSCP_H */
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 082f30256f59..3d898eb6df9c 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -18,6 +18,7 @@
 #include <net/ndisc.h>
 #include <net/flow.h>
 #include <net/flow_dissector.h>
+#include <net/inet_dscp.h>
 #include <net/snmp.h>
 #include <net/netns/hash.h>
 
@@ -975,6 +976,11 @@ static inline u8 ip6_tclass(__be32 flowinfo)
 	return ntohl(flowinfo & IPV6_TCLASS_MASK) >> IPV6_TCLASS_SHIFT;
 }
 
+static inline dscp_t ip6_dscp(__be32 flowinfo)
+{
+	return inet_dsfield_to_dscp(ip6_tclass(flowinfo));
+}
+
 static inline __be32 ip6_make_flowinfo(unsigned int tclass, __be32 flowlabel)
 {
 	return htonl(tclass << IPV6_TCLASS_SHIFT) | flowlabel;
diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index ec029c86ae06..e2a7b0059669 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -16,6 +16,7 @@
 #include <linux/indirect_call_wrapper.h>
 
 #include <net/fib_rules.h>
+#include <net/inet_dscp.h>
 #include <net/ipv6.h>
 #include <net/addrconf.h>
 #include <net/ip6_route.h>
@@ -25,14 +26,14 @@ struct fib6_rule {
 	struct fib_rule		common;
 	struct rt6key		src;
 	struct rt6key		dst;
-	u8			tclass;
+	dscp_t			dscp;
 };
 
 static bool fib6_rule_matchall(const struct fib_rule *rule)
 {
 	struct fib6_rule *r = container_of(rule, struct fib6_rule, common);
 
-	if (r->dst.plen || r->src.plen || r->tclass)
+	if (r->dst.plen || r->src.plen || r->dscp)
 		return false;
 	return fib_rule_matchall(rule);
 }
@@ -323,7 +324,7 @@ INDIRECT_CALLABLE_SCOPE int fib6_rule_match(struct fib_rule *rule,
 			return 0;
 	}
 
-	if (r->tclass && r->tclass != ip6_tclass(fl6->flowlabel))
+	if (r->dscp && r->dscp != ip6_dscp(fl6->flowlabel))
 		return 0;
 
 	if (rule->ip_proto && (rule->ip_proto != fl6->flowi6_proto))
@@ -349,6 +350,13 @@ static int fib6_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 	struct net *net = sock_net(skb->sk);
 	struct fib6_rule *rule6 = (struct fib6_rule *) rule;
 
+	if (!inet_validate_dscp(frh->tos)) {
+		NL_SET_ERR_MSG(extack,
+			       "Invalid dsfield (tos): ECN bits must be 0");
+		goto errout;
+	}
+	rule6->dscp = inet_dsfield_to_dscp(frh->tos);
+
 	if (rule->action == FR_ACT_TO_TBL && !rule->l3mdev) {
 		if (rule->table == RT6_TABLE_UNSPEC) {
 			NL_SET_ERR_MSG(extack, "Invalid table");
@@ -369,7 +377,6 @@ static int fib6_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 
 	rule6->src.plen = frh->src_len;
 	rule6->dst.plen = frh->dst_len;
-	rule6->tclass = frh->tos;
 
 	if (fib_rule_requires_fldissect(rule))
 		net->ipv6.fib6_rules_require_fldissect++;
@@ -402,7 +409,7 @@ static int fib6_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh,
 	if (frh->dst_len && (rule6->dst.plen != frh->dst_len))
 		return 0;
 
-	if (frh->tos && (rule6->tclass != frh->tos))
+	if (frh->tos && inet_dscp_to_dsfield(rule6->dscp) != frh->tos)
 		return 0;
 
 	if (frh->src_len &&
@@ -423,7 +430,7 @@ static int fib6_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
 
 	frh->dst_len = rule6->dst.plen;
 	frh->src_len = rule6->src.plen;
-	frh->tos = rule6->tclass;
+	frh->tos = inet_dscp_to_dsfield(rule6->dscp);
 
 	if ((rule6->dst.plen &&
 	     nla_put_in6_addr(skb, FRA_DST, &rule6->dst.addr)) ||
diff --git a/tools/testing/selftests/net/fib_rule_tests.sh b/tools/testing/selftests/net/fib_rule_tests.sh
index 3b0489910422..d7a9ab3be1d3 100755
--- a/tools/testing/selftests/net/fib_rule_tests.sh
+++ b/tools/testing/selftests/net/fib_rule_tests.sh
@@ -114,10 +114,25 @@ fib_rule6_test_match_n_redirect()
 	log_test $? 0 "rule6 del by pref: $description"
 }
 
+fib_rule6_test_reject()
+{
+	local match="$1"
+	local rc
+
+	$IP -6 rule add $match table $RTABLE 2>/dev/null
+	rc=$?
+	log_test $rc 2 "rule6 check: $match"
+
+	if [ $rc -eq 0 ]; then
+		$IP -6 rule del $match table $RTABLE
+	fi
+}
+
 fib_rule6_test()
 {
 	local getmatch
 	local match
+	local cnt
 
 	# setup the fib rule redirect route
 	$IP -6 route add table $RTABLE default via $GW_IP6 dev $DEV onlink
@@ -128,8 +143,21 @@ fib_rule6_test()
 	match="from $SRC_IP6 iif $DEV"
 	fib_rule6_test_match_n_redirect "$match" "$match" "iif redirect to table"
 
+	# Reject dsfield (tos) options which have ECN bits set
+	for cnt in $(seq 1 3); do
+		match="dsfield $cnt"
+		fib_rule6_test_reject "$match"
+	done
+
+	# Don't take ECN bits into account when matching on dsfield
 	match="tos 0x10"
-	fib_rule6_test_match_n_redirect "$match" "$match" "tos redirect to table"
+	for cnt in "0x10" "0x11" "0x12" "0x13"; do
+		# Using option 'tos' instead of 'dsfield' as old iproute2
+		# versions don't support 'dsfield' in ip rule show.
+		getmatch="tos $cnt"
+		fib_rule6_test_match_n_redirect "$match" "$getmatch" \
+						"$getmatch redirect to table"
+	done
 
 	match="fwmark 0x64"
 	getmatch="mark 0x64"
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 2/4] ipv4: Stop taking ECN bits into account in fib4-rules
  2022-02-04 13:58 [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type Guillaume Nault
  2022-02-04 13:58 ` [PATCH net-next 1/4] ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules Guillaume Nault
@ 2022-02-04 13:58 ` Guillaume Nault
  2022-02-04 13:58 ` [PATCH net-next 3/4] ipv4: Reject routes specifying ECN bits in rtm_tos Guillaume Nault
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Guillaume Nault @ 2022-02-04 13:58 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern,
	Toke Høiland-Jørgensen, Shuah Khan, linux-kselftest,
	Russell Strong, Dave Taht

Use the new dscp_t type to replace the tos field of struct fib4_rule,
so that fib4-rules consistently ignore ECN bits.

Before this patch, fib4-rules did accept rules with the high order ECN
bit set (but not the low order one). Also, it relied on its callers
masking the ECN bits of ->flowi4_tos to prevent those from influencing
the result. This was brittle and a few call paths still do the lookup
without masking the ECN bits first.

After this patch fib4-rules only compare the DSCP bits. ECN can't
influence the result anymore, even if the caller didn't mask these
bits. Also, fib4-rules now must have both ECN bits cleared or they will
be rejected.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
---
 net/ipv4/fib_rules.c                          | 18 ++++++-----
 tools/testing/selftests/net/fib_rule_tests.sh | 30 ++++++++++++++++++-
 2 files changed, 39 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c
index e0b6c8b6de57..117c48571cf0 100644
--- a/net/ipv4/fib_rules.c
+++ b/net/ipv4/fib_rules.c
@@ -23,6 +23,7 @@
 #include <linux/list.h>
 #include <linux/rcupdate.h>
 #include <linux/export.h>
+#include <net/inet_dscp.h>
 #include <net/ip.h>
 #include <net/route.h>
 #include <net/tcp.h>
@@ -35,7 +36,7 @@ struct fib4_rule {
 	struct fib_rule		common;
 	u8			dst_len;
 	u8			src_len;
-	u8			tos;
+	dscp_t			dscp;
 	__be32			src;
 	__be32			srcmask;
 	__be32			dst;
@@ -49,7 +50,7 @@ static bool fib4_rule_matchall(const struct fib_rule *rule)
 {
 	struct fib4_rule *r = container_of(rule, struct fib4_rule, common);
 
-	if (r->dst_len || r->src_len || r->tos)
+	if (r->dst_len || r->src_len || r->dscp)
 		return false;
 	return fib_rule_matchall(rule);
 }
@@ -185,7 +186,7 @@ INDIRECT_CALLABLE_SCOPE int fib4_rule_match(struct fib_rule *rule,
 	    ((daddr ^ r->dst) & r->dstmask))
 		return 0;
 
-	if (r->tos && (r->tos != fl4->flowi4_tos))
+	if (r->dscp && r->dscp != inet_dsfield_to_dscp(fl4->flowi4_tos))
 		return 0;
 
 	if (rule->ip_proto && (rule->ip_proto != fl4->flowi4_proto))
@@ -225,10 +226,12 @@ static int fib4_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 	int err = -EINVAL;
 	struct fib4_rule *rule4 = (struct fib4_rule *) rule;
 
-	if (frh->tos & ~IPTOS_TOS_MASK) {
-		NL_SET_ERR_MSG(extack, "Invalid tos");
+	if (!inet_validate_dscp(frh->tos)) {
+		NL_SET_ERR_MSG(extack,
+			       "Invalid dsfield (tos): ECN bits must be 0");
 		goto errout;
 	}
+	rule4->dscp = inet_dsfield_to_dscp(frh->tos);
 
 	/* split local/main if they are not already split */
 	err = fib_unmerge(net);
@@ -270,7 +273,6 @@ static int fib4_rule_configure(struct fib_rule *rule, struct sk_buff *skb,
 	rule4->srcmask = inet_make_mask(rule4->src_len);
 	rule4->dst_len = frh->dst_len;
 	rule4->dstmask = inet_make_mask(rule4->dst_len);
-	rule4->tos = frh->tos;
 
 	net->ipv4.fib_has_custom_rules = true;
 
@@ -313,7 +315,7 @@ static int fib4_rule_compare(struct fib_rule *rule, struct fib_rule_hdr *frh,
 	if (frh->dst_len && (rule4->dst_len != frh->dst_len))
 		return 0;
 
-	if (frh->tos && (rule4->tos != frh->tos))
+	if (frh->tos && inet_dscp_to_dsfield(rule4->dscp) != frh->tos)
 		return 0;
 
 #ifdef CONFIG_IP_ROUTE_CLASSID
@@ -337,7 +339,7 @@ static int fib4_rule_fill(struct fib_rule *rule, struct sk_buff *skb,
 
 	frh->dst_len = rule4->dst_len;
 	frh->src_len = rule4->src_len;
-	frh->tos = rule4->tos;
+	frh->tos = inet_dscp_to_dsfield(rule4->dscp);
 
 	if ((rule4->dst_len &&
 	     nla_put_in_addr(skb, FRA_DST, rule4->dst)) ||
diff --git a/tools/testing/selftests/net/fib_rule_tests.sh b/tools/testing/selftests/net/fib_rule_tests.sh
index d7a9ab3be1d3..4f70baad867d 100755
--- a/tools/testing/selftests/net/fib_rule_tests.sh
+++ b/tools/testing/selftests/net/fib_rule_tests.sh
@@ -215,10 +215,25 @@ fib_rule4_test_match_n_redirect()
 	log_test $? 0 "rule4 del by pref: $description"
 }
 
+fib_rule4_test_reject()
+{
+	local match="$1"
+	local rc
+
+	$IP rule add $match table $RTABLE 2>/dev/null
+	rc=$?
+	log_test $rc 2 "rule4 check: $match"
+
+	if [ $rc -eq 0 ]; then
+		$IP rule del $match table $RTABLE
+	fi
+}
+
 fib_rule4_test()
 {
 	local getmatch
 	local match
+	local cnt
 
 	# setup the fib rule redirect route
 	$IP route add table $RTABLE default via $GW_IP4 dev $DEV onlink
@@ -234,8 +249,21 @@ fib_rule4_test()
 	fib_rule4_test_match_n_redirect "$match" "$match" "iif redirect to table"
 	ip netns exec testns sysctl -qw net.ipv4.ip_forward=0
 
+	# Reject dsfield (tos) options which have ECN bits set
+	for cnt in $(seq 1 3); do
+		match="dsfield $cnt"
+		fib_rule4_test_reject "$match"
+	done
+
+	# Don't take ECN bits into account when matching on dsfield
 	match="tos 0x10"
-	fib_rule4_test_match_n_redirect "$match" "$match" "tos redirect to table"
+	for cnt in "0x10" "0x11" "0x12" "0x13"; do
+		# Using option 'tos' instead of 'dsfield' as old iproute2
+		# versions don't support 'dsfield' in ip rule show.
+		getmatch="tos $cnt"
+		fib_rule4_test_match_n_redirect "$match" "$getmatch" \
+						"$getmatch redirect to table"
+	done
 
 	match="fwmark 0x64"
 	getmatch="mark 0x64"
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 3/4] ipv4: Reject routes specifying ECN bits in rtm_tos
  2022-02-04 13:58 [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type Guillaume Nault
  2022-02-04 13:58 ` [PATCH net-next 1/4] ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules Guillaume Nault
  2022-02-04 13:58 ` [PATCH net-next 2/4] ipv4: Stop taking ECN bits into account in fib4-rules Guillaume Nault
@ 2022-02-04 13:58 ` Guillaume Nault
  2022-02-04 18:19   ` Shuah Khan
  2022-02-04 13:58 ` [PATCH net-next 4/4] ipv4: Use dscp_t in struct fib_alias Guillaume Nault
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Guillaume Nault @ 2022-02-04 13:58 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern,
	Toke Høiland-Jørgensen, Shuah Khan, linux-kselftest,
	Russell Strong, Dave Taht

Use the new dscp_t type to replace the fc_tos field of fib_config, to
ensure IPv4 routes aren't influenced by ECN bits when configured with
non-zero rtm_tos.

Before this patch, IPv4 routes specifying an rtm_tos with some of the
ECN bits set were accepted. However they wouldn't work (never match) as
IPv4 normally clears the ECN bits with IPTOS_RT_MASK before doing a FIB
lookup (although a few buggy code paths don't).

After this patch, IPv4 routes specifying an rtm_tos with any ECN bit
set is rejected.

Note: IPv6 routes ignore rtm_tos altogether, any rtm_tos is accepted,
but treated as if it were 0.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
---
Shuah, FYI, this is the patch I was refering to in our discussion about
testing invalid tos values:
https://lore.kernel.org/netdev/20220202232555.GC15826@pc-4.home/

 include/net/ip_fib.h                     |  3 +-
 net/ipv4/fib_frontend.c                  | 11 +++-
 net/ipv4/fib_trie.c                      |  7 ++-
 tools/testing/selftests/net/fib_tests.sh | 76 ++++++++++++++++++++++++
 4 files changed, 93 insertions(+), 4 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index c4297704bbcb..6a82bcb8813b 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -17,6 +17,7 @@
 #include <linux/rcupdate.h>
 #include <net/fib_notifier.h>
 #include <net/fib_rules.h>
+#include <net/inet_dscp.h>
 #include <net/inetpeer.h>
 #include <linux/percpu.h>
 #include <linux/notifier.h>
@@ -24,7 +25,7 @@
 
 struct fib_config {
 	u8			fc_dst_len;
-	u8			fc_tos;
+	dscp_t			fc_dscp;
 	u8			fc_protocol;
 	u8			fc_scope;
 	u8			fc_type;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 4d61ddd8a0ec..c60e1d1ed2b0 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -32,6 +32,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 
+#include <net/inet_dscp.h>
 #include <net/ip.h>
 #include <net/protocol.h>
 #include <net/route.h>
@@ -735,8 +736,16 @@ static int rtm_to_fib_config(struct net *net, struct sk_buff *skb,
 	memset(cfg, 0, sizeof(*cfg));
 
 	rtm = nlmsg_data(nlh);
+
+	if (!inet_validate_dscp(rtm->rtm_tos)) {
+		NL_SET_ERR_MSG(extack,
+			       "Invalid dsfield (tos): ECN bits must be 0");
+		err = -EINVAL;
+		goto errout;
+	}
+	cfg->fc_dscp = inet_dsfield_to_dscp(rtm->rtm_tos);
+
 	cfg->fc_dst_len = rtm->rtm_dst_len;
-	cfg->fc_tos = rtm->rtm_tos;
 	cfg->fc_table = rtm->rtm_table;
 	cfg->fc_protocol = rtm->rtm_protocol;
 	cfg->fc_scope = rtm->rtm_scope;
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 8060524f4256..d937eeebb812 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -61,6 +61,7 @@
 #include <linux/vmalloc.h>
 #include <linux/notifier.h>
 #include <net/net_namespace.h>
+#include <net/inet_dscp.h>
 #include <net/ip.h>
 #include <net/protocol.h>
 #include <net/route.h>
@@ -1210,9 +1211,9 @@ int fib_table_insert(struct net *net, struct fib_table *tb,
 	struct fib_info *fi;
 	u8 plen = cfg->fc_dst_len;
 	u8 slen = KEYLENGTH - plen;
-	u8 tos = cfg->fc_tos;
 	u32 key;
 	int err;
+	u8 tos;
 
 	key = ntohl(cfg->fc_dst);
 
@@ -1227,6 +1228,7 @@ int fib_table_insert(struct net *net, struct fib_table *tb,
 		goto err;
 	}
 
+	tos = inet_dscp_to_dsfield(cfg->fc_dscp);
 	l = fib_find_node(t, &tp, key);
 	fa = l ? fib_find_alias(&l->leaf, slen, tos, fi->fib_priority,
 				tb->tb_id, false) : NULL;
@@ -1703,8 +1705,8 @@ int fib_table_delete(struct net *net, struct fib_table *tb,
 	struct key_vector *l, *tp;
 	u8 plen = cfg->fc_dst_len;
 	u8 slen = KEYLENGTH - plen;
-	u8 tos = cfg->fc_tos;
 	u32 key;
+	u8 tos;
 
 	key = ntohl(cfg->fc_dst);
 
@@ -1715,6 +1717,7 @@ int fib_table_delete(struct net *net, struct fib_table *tb,
 	if (!l)
 		return -ESRCH;
 
+	tos = inet_dscp_to_dsfield(cfg->fc_dscp);
 	fa = fib_find_alias(&l->leaf, slen, tos, 0, tb->tb_id, false);
 	if (!fa)
 		return -ESRCH;
diff --git a/tools/testing/selftests/net/fib_tests.sh b/tools/testing/selftests/net/fib_tests.sh
index 996af1ae3d3d..bb73235976b3 100755
--- a/tools/testing/selftests/net/fib_tests.sh
+++ b/tools/testing/selftests/net/fib_tests.sh
@@ -1447,6 +1447,81 @@ ipv4_local_rt_cache()
 	log_test $? 0 "Cached route removed from VRF port device"
 }
 
+ipv4_rt_dsfield()
+{
+	echo
+	echo "IPv4 route with dsfield tests"
+
+	run_cmd "$IP route flush 172.16.102.0/24"
+
+	# New routes should reject dsfield options that interfere with ECN
+	run_cmd "$IP route add 172.16.102.0/24 dsfield 0x01 via 172.16.101.2"
+	log_test $? 2 "Reject route with dsfield 0x01"
+
+	run_cmd "$IP route add 172.16.102.0/24 dsfield 0x02 via 172.16.101.2"
+	log_test $? 2 "Reject route with dsfield 0x02"
+
+	run_cmd "$IP route add 172.16.102.0/24 dsfield 0x03 via 172.16.101.2"
+	log_test $? 2 "Reject route with dsfield 0x03"
+
+	# A generic route that doesn't take DSCP into account
+	run_cmd "$IP route add 172.16.102.0/24 via 172.16.101.2"
+
+	# A more specific route for DSCP 0x10
+	run_cmd "$IP route add 172.16.102.0/24 dsfield 0x10 via 172.16.103.2"
+
+	# DSCP 0x10 should match the specific route, no matter the ECN bits
+	$IP route get fibmatch 172.16.102.1 dsfield 0x10 | \
+		grep -q "via 172.16.103.2"
+	log_test $? 0 "IPv4 route with DSCP and ECN:Not-ECT"
+
+	$IP route get fibmatch 172.16.102.1 dsfield 0x11 | \
+		grep -q "via 172.16.103.2"
+	log_test $? 0 "IPv4 route with DSCP and ECN:ECT(1)"
+
+	$IP route get fibmatch 172.16.102.1 dsfield 0x12 | \
+		grep -q "via 172.16.103.2"
+	log_test $? 0 "IPv4 route with DSCP and ECN:ECT(0)"
+
+	$IP route get fibmatch 172.16.102.1 dsfield 0x13 | \
+		grep -q "via 172.16.103.2"
+	log_test $? 0 "IPv4 route with DSCP and ECN:CE"
+
+	# Unknown DSCP should match the generic route, no matter the ECN bits
+	$IP route get fibmatch 172.16.102.1 dsfield 0x14 | \
+		grep -q "via 172.16.101.2"
+	log_test $? 0 "IPv4 route with unknown DSCP and ECN:Not-ECT"
+
+	$IP route get fibmatch 172.16.102.1 dsfield 0x15 | \
+		grep -q "via 172.16.101.2"
+	log_test $? 0 "IPv4 route with unknown DSCP and ECN:ECT(1)"
+
+	$IP route get fibmatch 172.16.102.1 dsfield 0x16 | \
+		grep -q "via 172.16.101.2"
+	log_test $? 0 "IPv4 route with unknown DSCP and ECN:ECT(0)"
+
+	$IP route get fibmatch 172.16.102.1 dsfield 0x17 | \
+		grep -q "via 172.16.101.2"
+	log_test $? 0 "IPv4 route with unknown DSCP and ECN:CE"
+
+	# Null DSCP should match the generic route, no matter the ECN bits
+	$IP route get fibmatch 172.16.102.1 dsfield 0x00 | \
+		grep -q "via 172.16.101.2"
+	log_test $? 0 "IPv4 route with no DSCP and ECN:Not-ECT"
+
+	$IP route get fibmatch 172.16.102.1 dsfield 0x01 | \
+		grep -q "via 172.16.101.2"
+	log_test $? 0 "IPv4 route with no DSCP and ECN:ECT(1)"
+
+	$IP route get fibmatch 172.16.102.1 dsfield 0x02 | \
+		grep -q "via 172.16.101.2"
+	log_test $? 0 "IPv4 route with no DSCP and ECN:ECT(0)"
+
+	$IP route get fibmatch 172.16.102.1 dsfield 0x03 | \
+		grep -q "via 172.16.101.2"
+	log_test $? 0 "IPv4 route with no DSCP and ECN:CE"
+}
+
 ipv4_route_test()
 {
 	route_setup
@@ -1454,6 +1529,7 @@ ipv4_route_test()
 	ipv4_rt_add
 	ipv4_rt_replace
 	ipv4_local_rt_cache
+	ipv4_rt_dsfield
 
 	route_cleanup
 }
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next 4/4] ipv4: Use dscp_t in struct fib_alias
  2022-02-04 13:58 [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type Guillaume Nault
                   ` (2 preceding siblings ...)
  2022-02-04 13:58 ` [PATCH net-next 3/4] ipv4: Reject routes specifying ECN bits in rtm_tos Guillaume Nault
@ 2022-02-04 13:58 ` Guillaume Nault
  2022-02-07  6:08 ` [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type David Ahern
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 16+ messages in thread
From: Guillaume Nault @ 2022-02-04 13:58 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern,
	Toke Høiland-Jørgensen, Russell Strong, Dave Taht

Use the new dscp_t type to replace the fa_tos field of fib_alias. This
ensures ECN bits are ignored and makes the field compatible with the
fc_dscp field of struct fib_config.

Converting old *tos variables and fields to dscp_t allows sparse to
flag incorrect uses of DSCP and ECN bits. This patch is entirely about
type annotation and shouldn't change any existing behaviour.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
---
 net/ipv4/fib_lookup.h    |  3 +-
 net/ipv4/fib_semantics.c | 14 ++++++----
 net/ipv4/fib_trie.c      | 59 ++++++++++++++++++++++------------------
 net/ipv4/route.c         |  3 +-
 4 files changed, 45 insertions(+), 34 deletions(-)

diff --git a/net/ipv4/fib_lookup.h b/net/ipv4/fib_lookup.h
index e184bcb19943..a63014b54809 100644
--- a/net/ipv4/fib_lookup.h
+++ b/net/ipv4/fib_lookup.h
@@ -4,13 +4,14 @@
 
 #include <linux/types.h>
 #include <linux/list.h>
+#include <net/inet_dscp.h>
 #include <net/ip_fib.h>
 #include <net/nexthop.h>
 
 struct fib_alias {
 	struct hlist_node	fa_list;
 	struct fib_info		*fa_info;
-	u8			fa_tos;
+	dscp_t			fa_dscp;
 	u8			fa_type;
 	u8			fa_state;
 	u8			fa_slen;
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 4c5399450682..c9c4f2f66b38 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -32,6 +32,7 @@
 #include <linux/hash.h>
 
 #include <net/arp.h>
+#include <net/inet_dscp.h>
 #include <net/ip.h>
 #include <net/protocol.h>
 #include <net/route.h>
@@ -523,7 +524,7 @@ void rtmsg_fib(int event, __be32 key, struct fib_alias *fa,
 	fri.tb_id = tb_id;
 	fri.dst = key;
 	fri.dst_len = dst_len;
-	fri.tos = fa->fa_tos;
+	fri.tos = inet_dscp_to_dsfield(fa->fa_dscp);
 	fri.type = fa->fa_type;
 	fri.offload = fa->offload;
 	fri.trap = fa->trap;
@@ -2039,7 +2040,7 @@ static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
 	int order = -1, last_idx = -1;
 	struct fib_alias *fa, *fa1 = NULL;
 	u32 last_prio = res->fi->fib_priority;
-	u8 last_tos = 0;
+	dscp_t last_dscp = 0;
 
 	hlist_for_each_entry_rcu(fa, fa_head, fa_list) {
 		struct fib_info *next_fi = fa->fa_info;
@@ -2047,19 +2048,20 @@ static void fib_select_default(const struct flowi4 *flp, struct fib_result *res)
 
 		if (fa->fa_slen != slen)
 			continue;
-		if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos)
+		if (fa->fa_dscp &&
+		    fa->fa_dscp != inet_dsfield_to_dscp(flp->flowi4_tos))
 			continue;
 		if (fa->tb_id != tb->tb_id)
 			continue;
 		if (next_fi->fib_priority > last_prio &&
-		    fa->fa_tos == last_tos) {
-			if (last_tos)
+		    fa->fa_dscp == last_dscp) {
+			if (last_dscp)
 				continue;
 			break;
 		}
 		if (next_fi->fib_flags & RTNH_F_DEAD)
 			continue;
-		last_tos = fa->fa_tos;
+		last_dscp = fa->fa_dscp;
 		last_prio = next_fi->fib_priority;
 
 		if (next_fi->fib_scope != res->scope ||
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index d937eeebb812..c05cd105e95e 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -82,7 +82,7 @@ static int call_fib_entry_notifier(struct notifier_block *nb,
 		.dst = dst,
 		.dst_len = dst_len,
 		.fi = fa->fa_info,
-		.tos = fa->fa_tos,
+		.tos = inet_dscp_to_dsfield(fa->fa_dscp),
 		.type = fa->fa_type,
 		.tb_id = fa->tb_id,
 	};
@@ -99,7 +99,7 @@ static int call_fib_entry_notifiers(struct net *net,
 		.dst = dst,
 		.dst_len = dst_len,
 		.fi = fa->fa_info,
-		.tos = fa->fa_tos,
+		.tos = inet_dscp_to_dsfield(fa->fa_dscp),
 		.type = fa->fa_type,
 		.tb_id = fa->tb_id,
 	};
@@ -974,13 +974,13 @@ static struct key_vector *fib_find_node(struct trie *t,
 	return n;
 }
 
-/* Return the first fib alias matching TOS with
+/* Return the first fib alias matching DSCP with
  * priority less than or equal to PRIO.
  * If 'find_first' is set, return the first matching
- * fib alias, regardless of TOS and priority.
+ * fib alias, regardless of DSCP and priority.
  */
 static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen,
-					u8 tos, u32 prio, u32 tb_id,
+					dscp_t dscp, u32 prio, u32 tb_id,
 					bool find_first)
 {
 	struct fib_alias *fa;
@@ -989,6 +989,10 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen,
 		return NULL;
 
 	hlist_for_each_entry(fa, fah, fa_list) {
+		/* Avoid Sparse warning when using dscp_t in inequalities */
+		u8 __fa_dscp = inet_dscp_to_dsfield(fa->fa_dscp);
+		u8 __dscp = inet_dscp_to_dsfield(dscp);
+
 		if (fa->fa_slen < slen)
 			continue;
 		if (fa->fa_slen != slen)
@@ -999,9 +1003,9 @@ static struct fib_alias *fib_find_alias(struct hlist_head *fah, u8 slen,
 			break;
 		if (find_first)
 			return fa;
-		if (fa->fa_tos > tos)
+		if (__fa_dscp > __dscp)
 			continue;
-		if (fa->fa_info->fib_priority >= prio || fa->fa_tos < tos)
+		if (fa->fa_info->fib_priority >= prio || __fa_dscp < __dscp)
 			return fa;
 	}
 
@@ -1028,8 +1032,8 @@ fib_find_matching_alias(struct net *net, const struct fib_rt_info *fri)
 
 	hlist_for_each_entry_rcu(fa, &l->leaf, fa_list) {
 		if (fa->fa_slen == slen && fa->tb_id == fri->tb_id &&
-		    fa->fa_tos == fri->tos && fa->fa_info == fri->fi &&
-		    fa->fa_type == fri->type)
+		    fa->fa_dscp == inet_dsfield_to_dscp(fri->tos) &&
+		    fa->fa_info == fri->fi && fa->fa_type == fri->type)
 			return fa;
 	}
 
@@ -1211,9 +1215,9 @@ int fib_table_insert(struct net *net, struct fib_table *tb,
 	struct fib_info *fi;
 	u8 plen = cfg->fc_dst_len;
 	u8 slen = KEYLENGTH - plen;
+	dscp_t dscp;
 	u32 key;
 	int err;
-	u8 tos;
 
 	key = ntohl(cfg->fc_dst);
 
@@ -1228,13 +1232,13 @@ int fib_table_insert(struct net *net, struct fib_table *tb,
 		goto err;
 	}
 
-	tos = inet_dscp_to_dsfield(cfg->fc_dscp);
+	dscp = cfg->fc_dscp;
 	l = fib_find_node(t, &tp, key);
-	fa = l ? fib_find_alias(&l->leaf, slen, tos, fi->fib_priority,
+	fa = l ? fib_find_alias(&l->leaf, slen, dscp, fi->fib_priority,
 				tb->tb_id, false) : NULL;
 
 	/* Now fa, if non-NULL, points to the first fib alias
-	 * with the same keys [prefix,tos,priority], if such key already
+	 * with the same keys [prefix,dscp,priority], if such key already
 	 * exists or to the node before which we will insert new one.
 	 *
 	 * If fa is NULL, we will need to allocate a new one and
@@ -1242,7 +1246,7 @@ int fib_table_insert(struct net *net, struct fib_table *tb,
 	 * of the new alias.
 	 */
 
-	if (fa && fa->fa_tos == tos &&
+	if (fa && fa->fa_dscp == dscp &&
 	    fa->fa_info->fib_priority == fi->fib_priority) {
 		struct fib_alias *fa_first, *fa_match;
 
@@ -1262,7 +1266,7 @@ int fib_table_insert(struct net *net, struct fib_table *tb,
 		hlist_for_each_entry_from(fa, fa_list) {
 			if ((fa->fa_slen != slen) ||
 			    (fa->tb_id != tb->tb_id) ||
-			    (fa->fa_tos != tos))
+			    (fa->fa_dscp != dscp))
 				break;
 			if (fa->fa_info->fib_priority != fi->fib_priority)
 				break;
@@ -1290,7 +1294,7 @@ int fib_table_insert(struct net *net, struct fib_table *tb,
 				goto out;
 
 			fi_drop = fa->fa_info;
-			new_fa->fa_tos = fa->fa_tos;
+			new_fa->fa_dscp = fa->fa_dscp;
 			new_fa->fa_info = fi;
 			new_fa->fa_type = cfg->fc_type;
 			state = fa->fa_state;
@@ -1353,7 +1357,7 @@ int fib_table_insert(struct net *net, struct fib_table *tb,
 		goto out;
 
 	new_fa->fa_info = fi;
-	new_fa->fa_tos = tos;
+	new_fa->fa_dscp = dscp;
 	new_fa->fa_type = cfg->fc_type;
 	new_fa->fa_state = 0;
 	new_fa->fa_slen = slen;
@@ -1569,7 +1573,8 @@ int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp,
 			if (index >= (1ul << fa->fa_slen))
 				continue;
 		}
-		if (fa->fa_tos && fa->fa_tos != flp->flowi4_tos)
+		if (fa->fa_dscp &&
+		    inet_dscp_to_dsfield(fa->fa_dscp) != flp->flowi4_tos)
 			continue;
 		if (fi->fib_dead)
 			continue;
@@ -1705,8 +1710,8 @@ int fib_table_delete(struct net *net, struct fib_table *tb,
 	struct key_vector *l, *tp;
 	u8 plen = cfg->fc_dst_len;
 	u8 slen = KEYLENGTH - plen;
+	dscp_t dscp;
 	u32 key;
-	u8 tos;
 
 	key = ntohl(cfg->fc_dst);
 
@@ -1717,12 +1722,13 @@ int fib_table_delete(struct net *net, struct fib_table *tb,
 	if (!l)
 		return -ESRCH;
 
-	tos = inet_dscp_to_dsfield(cfg->fc_dscp);
-	fa = fib_find_alias(&l->leaf, slen, tos, 0, tb->tb_id, false);
+	dscp = cfg->fc_dscp;
+	fa = fib_find_alias(&l->leaf, slen, dscp, 0, tb->tb_id, false);
 	if (!fa)
 		return -ESRCH;
 
-	pr_debug("Deleting %08x/%d tos=%d t=%p\n", key, plen, tos, t);
+	pr_debug("Deleting %08x/%d dsfield=0x%02x t=%p\n", key, plen,
+		 inet_dscp_to_dsfield(dscp), t);
 
 	fa_to_delete = NULL;
 	hlist_for_each_entry_from(fa, fa_list) {
@@ -1730,7 +1736,7 @@ int fib_table_delete(struct net *net, struct fib_table *tb,
 
 		if ((fa->fa_slen != slen) ||
 		    (fa->tb_id != tb->tb_id) ||
-		    (fa->fa_tos != tos))
+		    (fa->fa_dscp != dscp))
 			break;
 
 		if ((!cfg->fc_type || fa->fa_type == cfg->fc_type) &&
@@ -2298,7 +2304,7 @@ static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
 				fri.tb_id = tb->tb_id;
 				fri.dst = xkey;
 				fri.dst_len = KEYLENGTH - fa->fa_slen;
-				fri.tos = fa->fa_tos;
+				fri.tos = inet_dscp_to_dsfield(fa->fa_dscp);
 				fri.type = fa->fa_type;
 				fri.offload = fa->offload;
 				fri.trap = fa->trap;
@@ -2810,8 +2816,9 @@ static int fib_trie_seq_show(struct seq_file *seq, void *v)
 					     fa->fa_info->fib_scope),
 				   rtn_type(buf2, sizeof(buf2),
 					    fa->fa_type));
-			if (fa->fa_tos)
-				seq_printf(seq, " tos=%d", fa->fa_tos);
+			if (fa->fa_dscp)
+				seq_printf(seq, " tos=%d",
+					   inet_dscp_to_dsfield(fa->fa_dscp));
 			seq_putc(seq, '\n');
 		}
 	}
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8b35075088e1..634766e6c7cc 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -84,6 +84,7 @@
 #include <linux/jhash.h>
 #include <net/dst.h>
 #include <net/dst_metadata.h>
+#include <net/inet_dscp.h>
 #include <net/net_namespace.h>
 #include <net/ip.h>
 #include <net/route.h>
@@ -3391,7 +3392,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
 
 				if (fa->fa_slen == slen &&
 				    fa->tb_id == fri.tb_id &&
-				    fa->fa_tos == fri.tos &&
+				    fa->fa_dscp == inet_dsfield_to_dscp(fri.tos) &&
 				    fa->fa_info == res.fi &&
 				    fa->fa_type == fri.type) {
 					fri.offload = fa->offload;
-- 
2.21.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 3/4] ipv4: Reject routes specifying ECN bits in rtm_tos
  2022-02-04 13:58 ` [PATCH net-next 3/4] ipv4: Reject routes specifying ECN bits in rtm_tos Guillaume Nault
@ 2022-02-04 18:19   ` Shuah Khan
  0 siblings, 0 replies; 16+ messages in thread
From: Shuah Khan @ 2022-02-04 18:19 UTC (permalink / raw)
  To: Guillaume Nault, David Miller, Jakub Kicinski
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern,
	Toke Høiland-Jørgensen, Shuah Khan, linux-kselftest,
	Russell Strong, Dave Taht, Shuah Khan

On 2/4/22 6:58 AM, Guillaume Nault wrote:
> Use the new dscp_t type to replace the fc_tos field of fib_config, to
> ensure IPv4 routes aren't influenced by ECN bits when configured with
> non-zero rtm_tos.
> 
> Before this patch, IPv4 routes specifying an rtm_tos with some of the
> ECN bits set were accepted. However they wouldn't work (never match) as
> IPv4 normally clears the ECN bits with IPTOS_RT_MASK before doing a FIB
> lookup (although a few buggy code paths don't).
> 
> After this patch, IPv4 routes specifying an rtm_tos with any ECN bit
> set is rejected.
> 
> Note: IPv6 routes ignore rtm_tos altogether, any rtm_tos is accepted,
> but treated as if it were 0.
> 
> Signed-off-by: Guillaume Nault <gnault@redhat.com>
> ---
> Shuah, FYI, this is the patch I was refering to in our discussion about
> testing invalid tos values:
> https://lore.kernel.org/netdev/20220202232555.GC15826@pc-4.home/
> 

This give me context. Thank you.

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type
  2022-02-04 13:58 [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type Guillaume Nault
                   ` (3 preceding siblings ...)
  2022-02-04 13:58 ` [PATCH net-next 4/4] ipv4: Use dscp_t in struct fib_alias Guillaume Nault
@ 2022-02-07  6:08 ` David Ahern
  2022-02-07 19:03 ` Toke Høiland-Jørgensen
  2022-02-08  5:10 ` patchwork-bot+netdevbpf
  6 siblings, 0 replies; 16+ messages in thread
From: David Ahern @ 2022-02-07  6:08 UTC (permalink / raw)
  To: Guillaume Nault, David Miller, Jakub Kicinski
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern,
	Toke Høiland-Jørgensen, Shuah Khan, linux-kselftest,
	Russell Strong, Dave Taht

On 2/4/22 5:58 AM, Guillaume Nault wrote:
> The networking stack currently doesn't clearly distinguish between DSCP
> and ECN bits. The entire DSCP+ECN bits are stored in u8 variables (or
> structure fields), and each part of the stack handles them in their own
> way, using different macros. This has created several bugs in the past
> and some uncommon code paths are still unfixed.
> 
> Such bugs generally manifest by selecting invalid routes because of ECN
> bits interfering with FIB routes and rules lookups (more details in the
> LPC 2021 talk[1] and in the RFC of this series[2]).
> 
> This patch series aims at preventing the introduction of such bugs (and
> detecting existing ones), by introducing a dscp_t type, representing
> "sanitised" DSCP values (that is, with no ECN information), as opposed
> to plain u8 values that contain both DSCP and ECN information. dscp_t
> makes it clear for the reader what we're working on, and Sparse can
> flag invalid interactions between dscp_t and plain u8.
> 
> This series converts only a few variables and structures:
> 
>   * Patch 1 converts the tclass field of struct fib6_rule. It
>     effectively forbids the use of ECN bits in the tos/dsfield option
>     of ip -6 rule. Rules now match packets solely based on their DSCP
>     bits, so ECN doesn't influence the result any more. This contrasts
>     with the previous behaviour where all 8 bits of the Traffic Class
>     field were used. It is believed that this change is acceptable as
>     matching ECN bits wasn't usable for IPv4, so only IPv6-only
>     deployments could be depending on it. Also the previous behaviour
>     made DSCP-based ip6-rules fail for packets with both a DSCP and an
>     ECN mark, which is another reason why any such deploy is unlikely.
> 
>   * Patch 2 converts the tos field of struct fib4_rule. This one too
>     effectively forbids defining ECN bits, this time in ip -4 rule.
>     Before that, setting ECN bit 1 was accepted, while ECN bit 0 was
>     rejected. But even when accepted, the rule would never match, as
>     the packets would have their ECN bits cleared before doing the
>     rule lookup.
> 
>   * Patch 3 converts the fc_tos field of struct fib_config. This is
>     equivalent to patch 2, but for IPv4 routes. Routes using a
>     tos/dsfield option with any ECN bit set is now rejected. Before
>     this patch, they were accepted but, as with ip4 rules, these routes
>     couldn't match any packet, since their ECN bits are cleared before
>     the lookup.
> 
>   * Patch 4 converts the fa_tos field of struct fib_alias. This one is
>     pure internal u8 to dscp_t conversion. While patches 1-3 had user
>     facing consequences, this patch shouldn't have any side effect and
>     is there to give an overview of what future conversion patches will
>     look like. Conversions are quite mechanical, but imply some code
>     churn, which is the price for the extra clarity a possibility of
>     type checking.
> 
> To summarise, all the behaviour changes required for the dscp_t type
> approach to work should be contained in patches 1-3. These changes are
> edge cases of ip-route and ip-rule that don't currently work properly.
> So they should be safe. Also, a kernel selftest is added for each of
> them.
> 

seems like the right directions to me.

Acked-by: David Ahern <dsahern@kernel.org>





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type
  2022-02-04 13:58 [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type Guillaume Nault
                   ` (4 preceding siblings ...)
  2022-02-07  6:08 ` [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type David Ahern
@ 2022-02-07 19:03 ` Toke Høiland-Jørgensen
  2022-02-08  5:10 ` patchwork-bot+netdevbpf
  6 siblings, 0 replies; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-02-07 19:03 UTC (permalink / raw)
  To: Guillaume Nault, David Miller, Jakub Kicinski
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern, Shuah Khan,
	linux-kselftest, Russell Strong, Dave Taht

Guillaume Nault <gnault@redhat.com> writes:

> The networking stack currently doesn't clearly distinguish between DSCP
> and ECN bits. The entire DSCP+ECN bits are stored in u8 variables (or
> structure fields), and each part of the stack handles them in their own
> way, using different macros. This has created several bugs in the past
> and some uncommon code paths are still unfixed.
>
> Such bugs generally manifest by selecting invalid routes because of ECN
> bits interfering with FIB routes and rules lookups (more details in the
> LPC 2021 talk[1] and in the RFC of this series[2]).
>
> This patch series aims at preventing the introduction of such bugs (and
> detecting existing ones), by introducing a dscp_t type, representing
> "sanitised" DSCP values (that is, with no ECN information), as opposed
> to plain u8 values that contain both DSCP and ECN information. dscp_t
> makes it clear for the reader what we're working on, and Sparse can
> flag invalid interactions between dscp_t and plain u8.
>
> This series converts only a few variables and structures:
>
>   * Patch 1 converts the tclass field of struct fib6_rule. It
>     effectively forbids the use of ECN bits in the tos/dsfield option
>     of ip -6 rule. Rules now match packets solely based on their DSCP
>     bits, so ECN doesn't influence the result any more. This contrasts
>     with the previous behaviour where all 8 bits of the Traffic Class
>     field were used. It is believed that this change is acceptable as
>     matching ECN bits wasn't usable for IPv4, so only IPv6-only
>     deployments could be depending on it. Also the previous behaviour
>     made DSCP-based ip6-rules fail for packets with both a DSCP and an
>     ECN mark, which is another reason why any such deploy is unlikely.
>
>   * Patch 2 converts the tos field of struct fib4_rule. This one too
>     effectively forbids defining ECN bits, this time in ip -4 rule.
>     Before that, setting ECN bit 1 was accepted, while ECN bit 0 was
>     rejected. But even when accepted, the rule would never match, as
>     the packets would have their ECN bits cleared before doing the
>     rule lookup.
>
>   * Patch 3 converts the fc_tos field of struct fib_config. This is
>     equivalent to patch 2, but for IPv4 routes. Routes using a
>     tos/dsfield option with any ECN bit set is now rejected. Before
>     this patch, they were accepted but, as with ip4 rules, these routes
>     couldn't match any packet, since their ECN bits are cleared before
>     the lookup.
>
>   * Patch 4 converts the fa_tos field of struct fib_alias. This one is
>     pure internal u8 to dscp_t conversion. While patches 1-3 had user
>     facing consequences, this patch shouldn't have any side effect and
>     is there to give an overview of what future conversion patches will
>     look like. Conversions are quite mechanical, but imply some code
>     churn, which is the price for the extra clarity a possibility of
>     type checking.
>
> To summarise, all the behaviour changes required for the dscp_t type
> approach to work should be contained in patches 1-3. These changes are
> edge cases of ip-route and ip-rule that don't currently work properly.
> So they should be safe. Also, a kernel selftest is added for each of
> them.
>
> Finally, this work also paves the way for allowing the usage of the 3
> high order DSCP bits in IPv4 (a few call paths already handle them, but
> in general the stack clears them before IPv4 rule and route lookups).

LGTM; thanks again for doing this!

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type
  2022-02-04 13:58 [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type Guillaume Nault
                   ` (5 preceding siblings ...)
  2022-02-07 19:03 ` Toke Høiland-Jørgensen
@ 2022-02-08  5:10 ` patchwork-bot+netdevbpf
  6 siblings, 0 replies; 16+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-02-08  5:10 UTC (permalink / raw)
  To: Guillaume Nault
  Cc: davem, kuba, netdev, yoshfuji, dsahern, toke, shuah,
	linux-kselftest, russell, dave.taht

Hello:

This series was applied to netdev/net-next.git (master)
by Jakub Kicinski <kuba@kernel.org>:

On Fri, 4 Feb 2022 14:58:09 +0100 you wrote:
> The networking stack currently doesn't clearly distinguish between DSCP
> and ECN bits. The entire DSCP+ECN bits are stored in u8 variables (or
> structure fields), and each part of the stack handles them in their own
> way, using different macros. This has created several bugs in the past
> and some uncommon code paths are still unfixed.
> 
> Such bugs generally manifest by selecting invalid routes because of ECN
> bits interfering with FIB routes and rules lookups (more details in the
> LPC 2021 talk[1] and in the RFC of this series[2]).
> 
> [...]

Here is the summary with links:
  - [net-next,1/4] ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules
    (no matching commit)
  - [net-next,2/4] ipv4: Stop taking ECN bits into account in fib4-rules
    https://git.kernel.org/netdev/net-next/c/563f8e97e054
  - [net-next,3/4] ipv4: Reject routes specifying ECN bits in rtm_tos
    https://git.kernel.org/netdev/net-next/c/f55fbb6afb8d
  - [net-next,4/4] ipv4: Use dscp_t in struct fib_alias
    https://git.kernel.org/netdev/net-next/c/32ccf1107980

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type
  2021-12-17 17:55     ` Toke Høiland-Jørgensen
@ 2021-12-17 22:52       ` Guillaume Nault
  0 siblings, 0 replies; 16+ messages in thread
From: Guillaume Nault @ 2021-12-17 22:52 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David Miller, Jakub Kicinski, netdev, Hideaki YOSHIFUJI,
	David Ahern, Russell Strong

On Fri, Dec 17, 2021 at 06:55:43PM +0100, Toke Høiland-Jørgensen wrote:
> >> > Note that there's no equivalent of patch 3 for IPv6 (ip route), since
> >> > the tos/dsfield option is silently ignored for IPv6 routes.
> >> 
> >> Shouldn't we just start rejecting them, like for v4?
> >
> > I had some thoughs about that, but didn't talk about them in the cover
> > letter since I felt there was already enough edge cases to discuss, and
> > this one wasn't directly related to this series (the problem is there
> > regardless of this RFC).
> >
> > So, on the one hand, we have this old policy of ignoring unknown
> > netlink attributes, so it looks consistent to also ignore unused
> > structure fields.
> >
> > On the other hand, ignoring rtm_tos leads to a different behaviour than
> > what was requested. So it certainly makes sense to at least warn the
> > user. But a hard fail may break existing programs that don't clear
> > rtm_tos by mistake.
> >
> > I'm not too sure which approach is better.
> 
> So I guess you could argue that those applications were broken in the
> first place, and so an explicit reject would only expose this? Do you
> know of any applications that actually *function* while doing what you
> describe?

I don't know of any existing application that actually does. But it's
easy to imagine a developer setting only parts of the rtmsg structure
and leaving the rest uninitialised. Exposing the problem might not help
the end user, who may have no way to modify the broken program.

Also, for people using ifupdown (/etc/network/interfaces on Debian and
derivatives), rejecting a command can cancel the configuration of an
entire device section. So a stray tos option on an ip -6 route command
would now leave the network interface entirely unconfigured.

I'm not saying these situations exist, just trying to anticipate all
possible side effects.

> One thought could be to add the rejection but be prepared to back it out
> if it does turn out (during the -rc phase) that it breaks something?

Given that it's something that'd be easy to revert, maybe we can try
this approach.

> -Toke
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type
  2021-12-15 16:48   ` Guillaume Nault
  2021-12-15 20:40     ` Dave Taht
@ 2021-12-17 17:55     ` Toke Høiland-Jørgensen
  2021-12-17 22:52       ` Guillaume Nault
  1 sibling, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-12-17 17:55 UTC (permalink / raw)
  To: Guillaume Nault
  Cc: David Miller, Jakub Kicinski, netdev, Hideaki YOSHIFUJI,
	David Ahern, Russell Strong

>> > Note that there's no equivalent of patch 3 for IPv6 (ip route), since
>> > the tos/dsfield option is silently ignored for IPv6 routes.
>> 
>> Shouldn't we just start rejecting them, like for v4?
>
> I had some thoughs about that, but didn't talk about them in the cover
> letter since I felt there was already enough edge cases to discuss, and
> this one wasn't directly related to this series (the problem is there
> regardless of this RFC).
>
> So, on the one hand, we have this old policy of ignoring unknown
> netlink attributes, so it looks consistent to also ignore unused
> structure fields.
>
> On the other hand, ignoring rtm_tos leads to a different behaviour than
> what was requested. So it certainly makes sense to at least warn the
> user. But a hard fail may break existing programs that don't clear
> rtm_tos by mistake.
>
> I'm not too sure which approach is better.

So I guess you could argue that those applications were broken in the
first place, and so an explicit reject would only expose this? Do you
know of any applications that actually *function* while doing what you
describe?

One thought could be to add the rejection but be prepared to back it out
if it does turn out (during the -rc phase) that it breaks something?

-Toke


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type
  2021-12-15 16:48   ` Guillaume Nault
@ 2021-12-15 20:40     ` Dave Taht
  2021-12-17 17:55     ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 16+ messages in thread
From: Dave Taht @ 2021-12-15 20:40 UTC (permalink / raw)
  To: Guillaume Nault
  Cc: Toke Høiland-Jørgensen, David Miller, Jakub Kicinski,
	Linux Kernel Network Developers, Hideaki YOSHIFUJI, David Ahern,
	Russell Strong

I look forward to a final toss of TOS.

Cheered-on-by: Dave Taht <dave.taht@gmail.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type
  2021-12-14  0:16 ` Toke Høiland-Jørgensen
@ 2021-12-15 16:48   ` Guillaume Nault
  2021-12-15 20:40     ` Dave Taht
  2021-12-17 17:55     ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 16+ messages in thread
From: Guillaume Nault @ 2021-12-15 16:48 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David Miller, Jakub Kicinski, netdev, Hideaki YOSHIFUJI,
	David Ahern, Russell Strong

On Tue, Dec 14, 2021 at 01:16:43AM +0100, Toke Høiland-Jørgensen wrote:
> Guillaume Nault <gnault@redhat.com> writes:
> 
> > Following my talk at LPC 2021 [1], here's a patch series whose
> > objective is to start fixing the problems with how DSCP and ECN bits
> > are handled in the kernel. This approach seemed to make consensus among
> > the participants, although it implies a few behaviour changes for some
> > corner cases of ip rule and ip route. Let's see if this consensus can
> > survive a wider review :).
> 
> I like the approach, although I must admit to not being too familiar
> with the parts of the code you're touching in this series. But I think
> the typedefs make sense, and I (still) think it's a good idea to do the
> conversion. I think the main thing to ensure from a backwards
> compatibility PoV is that we don't silently change behaviour in a way
> that is hard to detect. I.e., rejecting invalid configuration is fine
> even if it was "allowed" before, but, say, changing the matching
> behaviour so an existing rule set will still run unchanged but behave
> differently is best avoided.
> 
> > Note, this patch series differs slightly from that of the original talk
> > (slide 14 [2]). For the talk, I just cleared the ECN bits, while in
> > this series, I do a bit-shift. This way dscp_t really represents DSCP
> > values, as defined in RFCs. Also I've renamed the helper functions to
> > replace "u8" by "dsfield", as I felt "u8" was ambiguous. Using
> > "dsfield" makes it clear that dscp_t to u8 conversion isn't just a
> > plain cast, but that a bit-shift happens and the result has the two ECN
> > bits.
> 
> I like the names, but why do the bitshift? I get that it's conceptually
> "cleaner", but AFAICT the shifted values are not actually used for
> anything other than being shifted back again? In which case you're just
> adding operations in the fast path for no reason...

That's right, the value is always shifted back again because all
current APIs work with the full dsfield (well all the ones I'm aware of
at least).

I switched to a bit shift when I tried writing down what dscp_t
was representing: I found it a bit clumsy to explain that it actually
wasn't exactly the DSCP. Also I didn't expect the bit shift to have any
mesurable impact.

Anyway, I don't mind reverting to a simple bit mask.

> > The new dscp_t type is then used to convert several field members:
> >
> >   * Patch 1 converts the tclass field of struct fib6_rule. It
> >     effectively forbids the use of ECN bits in the tos/dsfield option
> >     of ip -6 rule. Rules now match packets solely based on their DSCP
> >     bits, so ECN doesn't influence the result anymore. This contrasts
> >     with previous behaviour where all 8 bits of the Traffic Class field
> >     was used. It is believed this change is acceptable as matching ECN
> >     bits wasn't usable for IPv4, so only IPv6-only deployments could be
> >     depending on it (that is, it's unlikely enough that a someone uses
> >     ip6 rules matching ECN bits in production).
> 
> I think this is OK, cf the "break explicitly" thing I wrote above.
> 
> >   * Patch 2 converts the tos field of struct fib4_rule. This one too
> >     effectively forbids defining ECN bits, this time in ip -4 rule.
> >     Before that, setting ECN bit 1 was accepted, while ECN bit 0 was
> >     rejected. But even when accepted, the rule wouldn't match as the
> >     packets would normally have their ECN bits cleared while doing the
> >     rule lookup.
> 
> As above.
> 
> >   * Patch 3 converts the fc_tos field of struct fib_config. This is
> >     like patch 2, but for ip4 routes. Routes using a tos/dsfield option
> >     with any ECN bit set is now rejected. Before this patch, they were
> >     accepted but, as with ip4 rules, these routes couldn't match any
> >     real packet, since callers were supposed to clear their ECN bits
> >     beforehand.
> 
> Didn't work at all, so also fine.
> 
> >   * Patch 4 converts the fa_tos field of struct fib_alias. This one is
> >     pure internal u8 to dscp_t conversion. While patches 1-3 dealed
> >     with user facing consequences, this patch shouldn't have any side
> >     effect and is just there to give an overview of what such
> >     conversion patches will look like. These are quite mechanical, but
> >     imply some code churn.
> 
> This is reasonable, and I think the code churn is worth the extra
> clarity.

Thanks for these feedbacks. These are the main points I wanted to
discuss with this RFC.

> You should probably spell out in the commit message that it's
> not intended to change behaviour, though.

Will do.

> > Note that there's no equivalent of patch 3 for IPv6 (ip route), since
> > the tos/dsfield option is silently ignored for IPv6 routes.
> 
> Shouldn't we just start rejecting them, like for v4?

I had some thoughs about that, but didn't talk about them in the cover
letter since I felt there was already enough edge cases to discuss, and
this one wasn't directly related to this series (the problem is there
regardless of this RFC).

So, on the one hand, we have this old policy of ignoring unknown
netlink attributes, so it looks consistent to also ignore unused
structure fields.

On the other hand, ignoring rtm_tos leads to a different behaviour than
what was requested. So it certainly makes sense to at least warn the
user. But a hard fail may break existing programs that don't clear
rtm_tos by mistake.

I'm not too sure which approach is better.

> -Toke
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type
  2021-12-06 18:22 Guillaume Nault
  2021-12-06 19:57 ` Guillaume Nault
@ 2021-12-14  0:16 ` Toke Høiland-Jørgensen
  2021-12-15 16:48   ` Guillaume Nault
  1 sibling, 1 reply; 16+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-12-14  0:16 UTC (permalink / raw)
  To: Guillaume Nault, David Miller, Jakub Kicinski
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern, Russell Strong

Guillaume Nault <gnault@redhat.com> writes:

> Following my talk at LPC 2021 [1], here's a patch series whose
> objective is to start fixing the problems with how DSCP and ECN bits
> are handled in the kernel. This approach seemed to make consensus among
> the participants, although it implies a few behaviour changes for some
> corner cases of ip rule and ip route. Let's see if this consensus can
> survive a wider review :).

I like the approach, although I must admit to not being too familiar
with the parts of the code you're touching in this series. But I think
the typedefs make sense, and I (still) think it's a good idea to do the
conversion. I think the main thing to ensure from a backwards
compatibility PoV is that we don't silently change behaviour in a way
that is hard to detect. I.e., rejecting invalid configuration is fine
even if it was "allowed" before, but, say, changing the matching
behaviour so an existing rule set will still run unchanged but behave
differently is best avoided.

> Note, this patch series differs slightly from that of the original talk
> (slide 14 [2]). For the talk, I just cleared the ECN bits, while in
> this series, I do a bit-shift. This way dscp_t really represents DSCP
> values, as defined in RFCs. Also I've renamed the helper functions to
> replace "u8" by "dsfield", as I felt "u8" was ambiguous. Using
> "dsfield" makes it clear that dscp_t to u8 conversion isn't just a
> plain cast, but that a bit-shift happens and the result has the two ECN
> bits.

I like the names, but why do the bitshift? I get that it's conceptually
"cleaner", but AFAICT the shifted values are not actually used for
anything other than being shifted back again? In which case you're just
adding operations in the fast path for no reason...

> The new dscp_t type is then used to convert several field members:
>
>   * Patch 1 converts the tclass field of struct fib6_rule. It
>     effectively forbids the use of ECN bits in the tos/dsfield option
>     of ip -6 rule. Rules now match packets solely based on their DSCP
>     bits, so ECN doesn't influence the result anymore. This contrasts
>     with previous behaviour where all 8 bits of the Traffic Class field
>     was used. It is believed this change is acceptable as matching ECN
>     bits wasn't usable for IPv4, so only IPv6-only deployments could be
>     depending on it (that is, it's unlikely enough that a someone uses
>     ip6 rules matching ECN bits in production).

I think this is OK, cf the "break explicitly" thing I wrote above.

>   * Patch 2 converts the tos field of struct fib4_rule. This one too
>     effectively forbids defining ECN bits, this time in ip -4 rule.
>     Before that, setting ECN bit 1 was accepted, while ECN bit 0 was
>     rejected. But even when accepted, the rule wouldn't match as the
>     packets would normally have their ECN bits cleared while doing the
>     rule lookup.

As above.

>   * Patch 3 converts the fc_tos field of struct fib_config. This is
>     like patch 2, but for ip4 routes. Routes using a tos/dsfield option
>     with any ECN bit set is now rejected. Before this patch, they were
>     accepted but, as with ip4 rules, these routes couldn't match any
>     real packet, since callers were supposed to clear their ECN bits
>     beforehand.

Didn't work at all, so also fine.

>   * Patch 4 converts the fa_tos field of struct fib_alias. This one is
>     pure internal u8 to dscp_t conversion. While patches 1-3 dealed
>     with user facing consequences, this patch shouldn't have any side
>     effect and is just there to give an overview of what such
>     conversion patches will look like. These are quite mechanical, but
>     imply some code churn.

This is reasonable, and I think the code churn is worth the extra
clarity. You should probably spell out in the commit message that it's
not intended to change behaviour, though.

> Note that there's no equivalent of patch 3 for IPv6 (ip route), since
> the tos/dsfield option is silently ignored for IPv6 routes.

Shouldn't we just start rejecting them, like for v4?

-Toke


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type
  2021-12-06 18:22 Guillaume Nault
@ 2021-12-06 19:57 ` Guillaume Nault
  2021-12-14  0:16 ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 16+ messages in thread
From: Guillaume Nault @ 2021-12-06 19:57 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern,
	Toke Høiland-Jørgensen, Russell Strong

On Mon, Dec 06, 2021 at 07:22:02PM +0100, Guillaume Nault wrote:
> Following my talk at LPC 2021 [1], here's a patch series whose
> objective is to start fixing the problems with how DSCP and ECN bits
> are handled in the kernel.

This is of course an RFC patch series, not meant for merging. Sorry for
the missing RFC tag in the subject (used the wrong script :/).


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type
@ 2021-12-06 18:22 Guillaume Nault
  2021-12-06 19:57 ` Guillaume Nault
  2021-12-14  0:16 ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 16+ messages in thread
From: Guillaume Nault @ 2021-12-06 18:22 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski
  Cc: netdev, Hideaki YOSHIFUJI, David Ahern,
	Toke Høiland-Jørgensen, Russell Strong

Following my talk at LPC 2021 [1], here's a patch series whose
objective is to start fixing the problems with how DSCP and ECN bits
are handled in the kernel. This approach seemed to make consensus among
the participants, although it implies a few behaviour changes for some
corner cases of ip rule and ip route. Let's see if this consensus can
survive a wider review :).

The problem to solve
====================

Currently, the networking stack generally doesn't clearly distinguish
between DSCP and ECN bits. The entire DSCP+ECN bits are stored in a
single u8 variable (or structure field), and each part of the stack
handles them in their own way, using different macros.

This has been, and still is, the source of several bugs:

  * Invalid use of RT_TOS(), leading to regression in VXLAN:
      - a0dced17ad9d ("Revert "vxlan: fix tos value before xmit"")

  * Other invalid uses of RT_TOS() in IPv6 code, where it doesn't make
    sense (RT_TOS() follows the old RFC 1349, which was never meant for
    IPv6):
      - grep for 'ip6_make_flowinfo(RT_TOS(tos)' for several examples
        that need to be fixed.

  * Failure to properly mask the ECN bits before doing IPv4 route
    lookups, leading to returning a wrong route:
      - 2e5a6266fbb1 ("netfilter: rpfilter: mask ecn bits before fib lookup"),
      - 8d2b51b008c2 ("udp: mask TOS bits in udp_v4_early_demux()"),
      - 21fdca22eb7d ("ipv4: Ignore ECN bits for fib lookups in fib_compute_spec_dst()"),
      - 1ebf179037cb ("ipv4: Fix tos mask in inet_rtm_getroute()"),
      - some more, uncommon, code paths still need to be fixed.

Also, this creates inconsistencies between subsystems:

 * IPv4 routes accept tos/dsfield options that have ECN bits set, but
   no packet can actually match them, as their ECN bits are cleared
   before lookup.

 * IPv4 rules accept tos/dsfield options that have the high order ECN
   bit set (but rejects those using the low order ECN bit). Like IPv4
   routes, such rules can't match actual packets as the rule lookup is
   done after clearing the packets ECN bits.

 * IPv6 routes accept the tos/dsfield options without any restriction,
   but silently ignores them.

 * IPv6 rules also accept any value of tos/dsfield, but compares all
   8 bits, including ECN. Therefore, a rule matching packets with a
   particular DSCP value stops working as soon as ECN is used (a work
   around is to define 4 rules for each DSCP value to match, one for
   each possible ECN code point).

 * Modules that want to distinguish between the DSCP and ECN bits
   create their own local macros (Netfilter, SCTP).

What this RFC does
==================

This patch series creates a dscp_t type, meant to represent pure DSCP
values, excluding ECN bits. Doing so allows to clearly separate the
interpretation of DSCP and ECN bits and enables Sparse to detect
invalid combinations of dscp_t with the plain u8 variables generally
used to store the full TOS/Traffic Class/DS field.

Note, this patch series differs slightly from that of the original talk
(slide 14 [2]). For the talk, I just cleared the ECN bits, while in
this series, I do a bit-shift. This way dscp_t really represents DSCP
values, as defined in RFCs. Also I've renamed the helper functions to
replace "u8" by "dsfield", as I felt "u8" was ambiguous. Using
"dsfield" makes it clear that dscp_t to u8 conversion isn't just a
plain cast, but that a bit-shift happens and the result has the two ECN
bits.

The new dscp_t type is then used to convert several field members:

  * Patch 1 converts the tclass field of struct fib6_rule. It
    effectively forbids the use of ECN bits in the tos/dsfield option
    of ip -6 rule. Rules now match packets solely based on their DSCP
    bits, so ECN doesn't influence the result anymore. This contrasts
    with previous behaviour where all 8 bits of the Traffic Class field
    was used. It is believed this change is acceptable as matching ECN
    bits wasn't usable for IPv4, so only IPv6-only deployments could be
    depending on it (that is, it's unlikely enough that a someone uses
    ip6 rules matching ECN bits in production).

  * Patch 2 converts the tos field of struct fib4_rule. This one too
    effectively forbids defining ECN bits, this time in ip -4 rule.
    Before that, setting ECN bit 1 was accepted, while ECN bit 0 was
    rejected. But even when accepted, the rule wouldn't match as the
    packets would normally have their ECN bits cleared while doing the
    rule lookup.

  * Patch 3 converts the fc_tos field of struct fib_config. This is
    like patch 2, but for ip4 routes. Routes using a tos/dsfield option
    with any ECN bit set is now rejected. Before this patch, they were
    accepted but, as with ip4 rules, these routes couldn't match any
    real packet, since callers were supposed to clear their ECN bits
    beforehand.

  * Patch 4 converts the fa_tos field of struct fib_alias. This one is
    pure internal u8 to dscp_t conversion. While patches 1-3 dealed
    with user facing consequences, this patch shouldn't have any side
    effect and is just there to give an overview of what such
    conversion patches will look like. These are quite mechanical, but
    imply some code churn.

Note that there's no equivalent of patch 3 for IPv6 (ip route), since
the tos/dsfield option is silently ignored for IPv6 routes.

Future work
===========

This is a minimal series, the objective of this RFC is just to validate
the dscp_t approach. More work will be needed to fully iron out the
problem:

  * Convert more internal structures to dscp_t (in particular the
    flowi4_tos field of struct flowi4).

  * Slowly remove the uses of IPTOS_TOS_MASK, and hence RT_TOS(), in
    the kernel, as it clears only one of the ECN bits and the TOS is
    going to be masked again anyway by IPTOS_RT_MASK, which is
    stricter.

  * Finally, start allowing high DSCP values in IPv4 routes and rules
    (the IPTOS_RT_MASK used in IPv4 clears the 3 high order bits of the
    DS field).

  * Maybe find a way to warn users that use the ignored tos/dsfield
    option with ip -6 route.

Feedbacks welcome!

Thanks,

William

[1] LPC talk: https://linuxplumbersconf.org/event/11/contributions/943/
[2] LPC slides: https://linuxplumbersconf.org/event/11/contributions/943/attachments/901/1780/inet_tos_lpc2021.pdf


Guillaume Nault (4):
  ipv6: Define dscp_t and stop taking ECN bits into account in ip6-rules
  ipv4: Stop taking ECN bits into account in ip4-rules
  ipv4: Reject routes specifying ECN bits in rtm_tos
  ipv4: Use dscp_t in struct fib_alias

 include/net/inet_dscp.h  | 54 +++++++++++++++++++++++++++++++++++++
 include/net/ip_fib.h     |  3 ++-
 include/net/ipv6.h       |  6 +++++
 net/ipv4/fib_frontend.c  | 10 ++++++-
 net/ipv4/fib_lookup.h    |  3 ++-
 net/ipv4/fib_rules.c     | 17 ++++++------
 net/ipv4/fib_semantics.c | 14 +++++-----
 net/ipv4/fib_trie.c      | 58 +++++++++++++++++++++++-----------------
 net/ipv4/route.c         |  3 ++-
 net/ipv6/fib6_rules.c    | 18 ++++++++-----
 10 files changed, 138 insertions(+), 48 deletions(-)
 create mode 100644 include/net/inet_dscp.h

-- 
2.21.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-02-08  5:33 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-04 13:58 [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type Guillaume Nault
2022-02-04 13:58 ` [PATCH net-next 1/4] ipv6: Define dscp_t and stop taking ECN bits into account in fib6-rules Guillaume Nault
2022-02-04 13:58 ` [PATCH net-next 2/4] ipv4: Stop taking ECN bits into account in fib4-rules Guillaume Nault
2022-02-04 13:58 ` [PATCH net-next 3/4] ipv4: Reject routes specifying ECN bits in rtm_tos Guillaume Nault
2022-02-04 18:19   ` Shuah Khan
2022-02-04 13:58 ` [PATCH net-next 4/4] ipv4: Use dscp_t in struct fib_alias Guillaume Nault
2022-02-07  6:08 ` [PATCH net-next 0/4] inet: Separate DSCP from ECN bits using new dscp_t type David Ahern
2022-02-07 19:03 ` Toke Høiland-Jørgensen
2022-02-08  5:10 ` patchwork-bot+netdevbpf
  -- strict thread matches above, loose matches on Subject: below --
2021-12-06 18:22 Guillaume Nault
2021-12-06 19:57 ` Guillaume Nault
2021-12-14  0:16 ` Toke Høiland-Jørgensen
2021-12-15 16:48   ` Guillaume Nault
2021-12-15 20:40     ` Dave Taht
2021-12-17 17:55     ` Toke Høiland-Jørgensen
2021-12-17 22:52       ` Guillaume Nault

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.