[PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
@ 2019-05-31 17:26 Davide Caratti
  2019-05-31 17:26 ` [PATCH net v3 1/3] net/sched: act_csum: pull all VLAN headers before checksumming Davide Caratti
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Davide Caratti @ 2019-05-31 17:26 UTC (permalink / raw)
  To: Eric Dumazet, Cong Wang, Jiri Pirko, Jamal Hadi Salim,
	David S . Miller, netdev
  Cc: shuali, Eli Britstein, Stephen Hemminger

'act_csum' was recently fixed to mangle the IPv4/IPv6 header if a packet
having one or more VLAN headers was processed: patch #1 ensures that all
VLAN headers are in the linear area of the skb.
Other actions might read or mangle the IPv4/IPv6 header: patch #2 and #3
fix 'act_pedit' and 'act_skbedit' respectively.

Changes since v2:
 - don't inline tc_skb_pull_vlans(), thanks to Stephen Hemminger
 - remove unwanted whitespace in patch #3

Changes since v1:
 - add patch #1, thanks to Eric Dumazet
 - add patch #3

Davide Caratti (3):
  net/sched: act_csum: pull all VLAN headers before checksumming
  net/sched: act_pedit: fix 'ex munge' on network header in case of QinQ
    packet
  net/sched: act_skbedit: fix 'inheritdsfield' in case of QinQ packet

 include/net/pkt_cls.h   |  2 ++
 net/sched/act_csum.c    | 14 ++------------
 net/sched/act_pedit.c   | 26 ++++++++++++++++++++++----
 net/sched/act_skbedit.c | 24 ++++++++++++++++++++----
 net/sched/cls_api.c     | 22 ++++++++++++++++++++++
 5 files changed, 68 insertions(+), 20 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH net v3 1/3] net/sched: act_csum: pull all VLAN headers before checksumming
  2019-05-31 17:26 [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets Davide Caratti
@ 2019-05-31 17:26 ` Davide Caratti
  2019-05-31 18:38   ` Cong Wang
  2019-05-31 17:26 ` [PATCH net v3 2/3] net/sched: act_pedit: fix 'ex munge' on network header in case of QinQ packet Davide Caratti
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Davide Caratti @ 2019-05-31 17:26 UTC (permalink / raw)
  To: Eric Dumazet, Cong Wang, Jiri Pirko, Jamal Hadi Salim,
	David S . Miller, netdev
  Cc: shuali, Eli Britstein, Stephen Hemminger

TC 'csum' action needs to read the packets' ethertype. In case the value
is encoded in unstripped/nested VLAN headers, call pskb_may_pull() to be
sure that all tags are in the linear part of skb. While at it, move this
code in a helper function: other TC actions (like 'pedit' and 'skbedit')
might need to read the innermost ethertype.

Fixes: 2ecba2d1e45b ("net: sched: act_csum: Fix csum calc for tagged packets")
CC: Eli Britstein <elibr@mellanox.com>
CC: Li Shuang <shuali@redhat.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 include/net/pkt_cls.h |  2 ++
 net/sched/act_csum.c  | 14 ++------------
 net/sched/cls_api.c   | 22 ++++++++++++++++++++++
 3 files changed, 26 insertions(+), 12 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 514e3c80ecc1..344f51eee1a8 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -953,4 +953,6 @@ struct tc_root_qopt_offload {
 	bool ingress;
 };
 
+int tc_skb_pull_vlans(struct sk_buff *skb, unsigned int *hdr_count,
+		      __be16 *proto);
 #endif
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index 14bb525e355e..e8308ddcae9d 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -574,7 +574,6 @@ static int tcf_csum_act(struct sk_buff *skb, const struct tc_action *a,
 			struct tcf_result *res)
 {
 	struct tcf_csum *p = to_tcf_csum(a);
-	bool orig_vlan_tag_present = false;
 	unsigned int vlan_hdr_count = 0;
 	struct tcf_csum_params *params;
 	u32 update_flags;
@@ -604,17 +603,8 @@ static int tcf_csum_act(struct sk_buff *skb, const struct tc_action *a,
 		break;
 	case cpu_to_be16(ETH_P_8021AD): /* fall through */
 	case cpu_to_be16(ETH_P_8021Q):
-		if (skb_vlan_tag_present(skb) && !orig_vlan_tag_present) {
-			protocol = skb->protocol;
-			orig_vlan_tag_present = true;
-		} else {
-			struct vlan_hdr *vlan = (struct vlan_hdr *)skb->data;
-
-			protocol = vlan->h_vlan_encapsulated_proto;
-			skb_pull(skb, VLAN_HLEN);
-			skb_reset_network_header(skb);
-			vlan_hdr_count++;
-		}
+		if (tc_skb_pull_vlans(skb, &vlan_hdr_count, &protocol))
+			goto drop;
 		goto again;
 	}
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index d4699156974a..382ee69fb1a5 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -3300,6 +3300,28 @@ unsigned int tcf_exts_num_actions(struct tcf_exts *exts)
 }
 EXPORT_SYMBOL(tcf_exts_num_actions);
 
+int tc_skb_pull_vlans(struct sk_buff *skb, unsigned int *hdr_count,
+		      __be16 *proto)
+{
+	if (skb_vlan_tag_present(skb))
+		*proto = skb->protocol;
+
+	while (eth_type_vlan(*proto)) {
+		struct vlan_hdr *vlan;
+
+		if (unlikely(!pskb_may_pull(skb, VLAN_HLEN)))
+			return -ENOMEM;
+
+		vlan = (struct vlan_hdr *)skb->data;
+		*proto = vlan->h_vlan_encapsulated_proto;
+		skb_pull(skb, VLAN_HLEN);
+		skb_reset_network_header(skb);
+		(*hdr_count)++;
+	}
+	return 0;
+}
+EXPORT_SYMBOL(tc_skb_pull_vlans);
+
 static __net_init int tcf_net_init(struct net *net)
 {
 	struct tcf_net *tn = net_generic(net, tcf_net_id);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net v3 2/3] net/sched: act_pedit: fix 'ex munge' on network header in case of QinQ packet
  2019-05-31 17:26 [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets Davide Caratti
  2019-05-31 17:26 ` [PATCH net v3 1/3] net/sched: act_csum: pull all VLAN headers before checksumming Davide Caratti
@ 2019-05-31 17:26 ` Davide Caratti
  2019-05-31 17:26 ` [PATCH net v3 3/3] net/sched: act_skbedit: fix 'inheritdsfield' " Davide Caratti
  2019-05-31 18:42 ` [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets Cong Wang
  3 siblings, 0 replies; 21+ messages in thread
From: Davide Caratti @ 2019-05-31 17:26 UTC (permalink / raw)
  To: Eric Dumazet, Cong Wang, Jiri Pirko, Jamal Hadi Salim,
	David S . Miller, netdev
  Cc: shuali, Eli Britstein, Stephen Hemminger

like it has been done in commit 2ecba2d1e45b ("net: sched: act_csum: Fix
csum calc for tagged packets"), also 'pedit' needs to adjust the network
offset when multiple tags are present in the packets: otherwise wrong IP
headers (but good checksums) can be observed with the following command:

 # tc filter add dev test0 parent ffff: protocol 802.1Q flower \
   vlan_ethtype ipv4 action \
   pedit ex munge ip ttl set 10 pipe \
   csum ip and icmp pipe \
   mirred egress redirect dev test1

Fixes: d8b9605d2697 ("net: sched: fix skb->protocol use in case of accelerated vlan path")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/act_pedit.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/net/sched/act_pedit.c b/net/sched/act_pedit.c
index d790c02b9c6c..26e43d300160 100644
--- a/net/sched/act_pedit.c
+++ b/net/sched/act_pedit.c
@@ -277,9 +277,11 @@ static bool offset_valid(struct sk_buff *skb, int offset)
 }
 
 static int pedit_skb_hdr_offset(struct sk_buff *skb,
-				enum pedit_header_type htype, int *hoffset)
+				enum pedit_header_type htype, int *hoffset,
+				unsigned int *vlan_hdr_count)
 {
 	int ret = -EINVAL;
+	__be16 protocol;
 
 	switch (htype) {
 	case TCA_PEDIT_KEY_EX_HDR_TYPE_ETH:
@@ -291,8 +293,18 @@ static int pedit_skb_hdr_offset(struct sk_buff *skb,
 	case TCA_PEDIT_KEY_EX_HDR_TYPE_NETWORK:
 	case TCA_PEDIT_KEY_EX_HDR_TYPE_IP4:
 	case TCA_PEDIT_KEY_EX_HDR_TYPE_IP6:
-		*hoffset = skb_network_offset(skb);
-		ret = 0;
+		protocol = tc_skb_protocol(skb);
+again:
+		switch (protocol) {
+		case cpu_to_be16(ETH_P_8021AD): /* fall through */
+		case cpu_to_be16(ETH_P_8021Q):
+			if (!tc_skb_pull_vlans(skb, vlan_hdr_count, &protocol))
+				goto again;
+			return ret;
+		default:
+			*hoffset = skb_network_offset(skb);
+			ret = 0;
+		}
 		break;
 	case TCA_PEDIT_KEY_EX_HDR_TYPE_TCP:
 	case TCA_PEDIT_KEY_EX_HDR_TYPE_UDP:
@@ -313,6 +325,7 @@ static int tcf_pedit_act(struct sk_buff *skb, const struct tc_action *a,
 			 struct tcf_result *res)
 {
 	struct tcf_pedit *p = to_pedit(a);
+	unsigned int vlan_hdr_count = 0;
 	int i;
 
 	if (skb_unclone(skb, GFP_ATOMIC))
@@ -343,7 +356,8 @@ static int tcf_pedit_act(struct sk_buff *skb, const struct tc_action *a,
 				tkey_ex++;
 			}
 
-			rc = pedit_skb_hdr_offset(skb, htype, &hoffset);
+			rc = pedit_skb_hdr_offset(skb, htype, &hoffset,
+						  &vlan_hdr_count);
 			if (rc) {
 				pr_info("tc action pedit bad header type specified (0x%x)\n",
 					htype);
@@ -407,6 +421,10 @@ static int tcf_pedit_act(struct sk_buff *skb, const struct tc_action *a,
 bad:
 	p->tcf_qstats.overlimits++;
 done:
+	while (vlan_hdr_count--) {
+		skb_push(skb, VLAN_HLEN);
+		skb_reset_network_header(skb);
+	}
 	bstats_update(&p->tcf_bstats, skb);
 	spin_unlock(&p->tcf_lock);
 	return p->tcf_action;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH net v3 3/3] net/sched: act_skbedit: fix 'inheritdsfield' in case of QinQ packet
  2019-05-31 17:26 [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets Davide Caratti
  2019-05-31 17:26 ` [PATCH net v3 1/3] net/sched: act_csum: pull all VLAN headers before checksumming Davide Caratti
  2019-05-31 17:26 ` [PATCH net v3 2/3] net/sched: act_pedit: fix 'ex munge' on network header in case of QinQ packet Davide Caratti
@ 2019-05-31 17:26 ` Davide Caratti
  2019-05-31 18:42 ` [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets Cong Wang
  3 siblings, 0 replies; 21+ messages in thread
From: Davide Caratti @ 2019-05-31 17:26 UTC (permalink / raw)
  To: Eric Dumazet, Cong Wang, Jiri Pirko, Jamal Hadi Salim,
	David S . Miller, netdev
  Cc: shuali, Eli Britstein, Stephen Hemminger

like it was previously done on TC 'csum' (see commit 2ecba2d1e45b ("net:
sched: act_csum: Fix csum calc for tagged packets")), TC 'skbedit' might
need to adjust the network offset, if the packet has unstripped/multiple
tags: otherwise 'inheritdsfield' will just be skipped for QinQ packets.

Fixes: e7e3728bd776 ("net:sched: add action inheritdsfield to skbedit")
CC: Li Shuang <shuali@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
---
 net/sched/act_skbedit.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/net/sched/act_skbedit.c b/net/sched/act_skbedit.c
index 7ec159b95364..34a0c661c8c5 100644
--- a/net/sched/act_skbedit.c
+++ b/net/sched/act_skbedit.c
@@ -39,6 +39,7 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
 {
 	struct tcf_skbedit *d = to_skbedit(a);
 	struct tcf_skbedit_params *params;
+	unsigned int vlan_hdr_count = 0;
 	int action;
 
 	tcf_lastuse_update(&d->tcf_tm);
@@ -50,9 +51,17 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
 	if (params->flags & SKBEDIT_F_PRIORITY)
 		skb->priority = params->priority;
 	if (params->flags & SKBEDIT_F_INHERITDSFIELD) {
-		int wlen = skb_network_offset(skb);
-
-		switch (tc_skb_protocol(skb)) {
+		__be16 proto = tc_skb_protocol(skb);
+		int wlen;
+
+again:
+		wlen = skb_network_offset(skb);
+		switch (proto) {
+		case htons(ETH_P_8021AD): /* Fall Through */
+		case htons(ETH_P_8021Q):
+			if (tc_skb_pull_vlans(skb, &vlan_hdr_count, &proto))
+				goto err;
+			goto again;
 		case htons(ETH_P_IP):
 			wlen += sizeof(struct iphdr);
 			if (!pskb_may_pull(skb, wlen))
@@ -77,11 +86,18 @@ static int tcf_skbedit_act(struct sk_buff *skb, const struct tc_action *a,
 	}
 	if (params->flags & SKBEDIT_F_PTYPE)
 		skb->pkt_type = params->ptype;
+
+out:
+	while (vlan_hdr_count--) {
+		skb_push(skb, VLAN_HLEN);
+		skb_reset_network_header(skb);
+	}
 	return action;
 
 err:
 	qstats_drop_inc(this_cpu_ptr(d->common.cpu_qstats));
-	return TC_ACT_SHOT;
+	action = TC_ACT_SHOT;
+	goto out;
 }
 
 static const struct nla_policy skbedit_policy[TCA_SKBEDIT_MAX + 1] = {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 1/3] net/sched: act_csum: pull all VLAN headers before checksumming
  2019-05-31 17:26 ` [PATCH net v3 1/3] net/sched: act_csum: pull all VLAN headers before checksumming Davide Caratti
@ 2019-05-31 18:38   ` Cong Wang
  2019-05-31 22:01     ` Davide Caratti
  0 siblings, 1 reply; 21+ messages in thread
From: Cong Wang @ 2019-05-31 18:38 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Eric Dumazet, Jiri Pirko, Jamal Hadi Salim, David S . Miller,
	Linux Kernel Network Developers, Shuang Li, Eli Britstein,
	Stephen Hemminger

On Fri, May 31, 2019 at 10:26 AM Davide Caratti <dcaratti@redhat.com> wrote:
> diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
> index 14bb525e355e..e8308ddcae9d 100644
> --- a/net/sched/act_csum.c
> +++ b/net/sched/act_csum.c
> @@ -574,7 +574,6 @@ static int tcf_csum_act(struct sk_buff *skb, const struct tc_action *a,
>                         struct tcf_result *res)
>  {
>         struct tcf_csum *p = to_tcf_csum(a);
> -       bool orig_vlan_tag_present = false;
>         unsigned int vlan_hdr_count = 0;
>         struct tcf_csum_params *params;
>         u32 update_flags;
> @@ -604,17 +603,8 @@ static int tcf_csum_act(struct sk_buff *skb, const struct tc_action *a,
>                 break;
>         case cpu_to_be16(ETH_P_8021AD): /* fall through */
>         case cpu_to_be16(ETH_P_8021Q):
> -               if (skb_vlan_tag_present(skb) && !orig_vlan_tag_present) {
> -                       protocol = skb->protocol;
> -                       orig_vlan_tag_present = true;
> -               } else {
> -                       struct vlan_hdr *vlan = (struct vlan_hdr *)skb->data;
> -
> -                       protocol = vlan->h_vlan_encapsulated_proto;
> -                       skb_pull(skb, VLAN_HLEN);
> -                       skb_reset_network_header(skb);
> -                       vlan_hdr_count++;
> -               }
> +               if (tc_skb_pull_vlans(skb, &vlan_hdr_count, &protocol))
> +                       goto drop;
>                 goto again;

Why do you still need to loop here? tc_skb_pull_vlans() already
contains a loop to pop all vlan tags?



>         }
>
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index d4699156974a..382ee69fb1a5 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -3300,6 +3300,28 @@ unsigned int tcf_exts_num_actions(struct tcf_exts *exts)
>  }
>  EXPORT_SYMBOL(tcf_exts_num_actions);
>
> +int tc_skb_pull_vlans(struct sk_buff *skb, unsigned int *hdr_count,
> +                     __be16 *proto)


It looks like this function fits better in net/core/skbuff.c, because
I don't see anything TC specific.


> +{
> +       if (skb_vlan_tag_present(skb))
> +               *proto = skb->protocol;
> +
> +       while (eth_type_vlan(*proto)) {
> +               struct vlan_hdr *vlan;
> +
> +               if (unlikely(!pskb_may_pull(skb, VLAN_HLEN)))
> +                       return -ENOMEM;
> +
> +               vlan = (struct vlan_hdr *)skb->data;
> +               *proto = vlan->h_vlan_encapsulated_proto;
> +               skb_pull(skb, VLAN_HLEN);
> +               skb_reset_network_header(skb);

Any reason not to call __skb_vlan_pop() directly?


> +               (*hdr_count)++;
> +       }
> +       return 0;
> +}


Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
  2019-05-31 17:26 [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets Davide Caratti
                   ` (2 preceding siblings ...)
  2019-05-31 17:26 ` [PATCH net v3 3/3] net/sched: act_skbedit: fix 'inheritdsfield' " Davide Caratti
@ 2019-05-31 18:42 ` Cong Wang
  2019-05-31 22:01   ` Davide Caratti
  3 siblings, 1 reply; 21+ messages in thread
From: Cong Wang @ 2019-05-31 18:42 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Eric Dumazet, Jiri Pirko, Jamal Hadi Salim, David S . Miller,
	Linux Kernel Network Developers, Shuang Li, Eli Britstein,
	Stephen Hemminger

On Fri, May 31, 2019 at 10:26 AM Davide Caratti <dcaratti@redhat.com> wrote:
>
> 'act_csum' was recently fixed to mangle the IPv4/IPv6 header if a packet
> having one or more VLAN headers was processed: patch #1 ensures that all
> VLAN headers are in the linear area of the skb.
> Other actions might read or mangle the IPv4/IPv6 header: patch #2 and #3
> fix 'act_pedit' and 'act_skbedit' respectively.

Maybe, just maybe, vlan tags are supposed to be handled by act_vlan?
Which means maybe users have to pipe act_vlan to these actions.

From the code reuse perspective, you are adding TCA_VLAN_ACT_POP
to each of them.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
  2019-05-31 18:42 ` [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets Cong Wang
@ 2019-05-31 22:01   ` Davide Caratti
  2019-05-31 22:29     ` Cong Wang
  0 siblings, 1 reply; 21+ messages in thread
From: Davide Caratti @ 2019-05-31 22:01 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, Jiri Pirko, Jamal Hadi Salim, David S . Miller,
	Linux Kernel Network Developers, Shuang Li, Eli Britstein,
	Stephen Hemminger

On Fri, 2019-05-31 at 11:42 -0700, Cong Wang wrote:
> On Fri, May 31, 2019 at 10:26 AM Davide Caratti <dcaratti@redhat.com> wrote:
> > 'act_csum' was recently fixed to mangle the IPv4/IPv6 header if a packet
> > having one or more VLAN headers was processed: patch #1 ensures that all
> > VLAN headers are in the linear area of the skb.
> > Other actions might read or mangle the IPv4/IPv6 header: patch #2 and #3
> > fix 'act_pedit' and 'act_skbedit' respectively.
> 
> Maybe, just maybe, vlan tags are supposed to be handled by act_vlan?
> Which means maybe users have to pipe act_vlan to these actions.

but it's not possible with the current act_vlan code.
Each 'vlan' action pushes or pops a single tag, so:

1) we don't know how many vlan tags there are in each packet, so I should
put an (enough) high number of "pop" operations to ensure that a 'pedit'
rule correctly mangles the TTL in a IPv4 packet having 1 or more 802.1Q
tags in the L2 header.

2) after a vlan is popped with act_vlan, the kernel forgets about the VLAN
ID and the VLAN type. So, if I want to just mangle the TTL in a QinQ
packet, I need to reinject it in a place where both tags (including VLAN
type *and* VLAN id) are restored in the packet.

Clearly, act_vlan can't be used as is, because 'push' has hardcoded VLAN
ID and ethertype. Unless we change act_vlan code to enable rollback of
previous 'pop' operations, it's quite hard to pipe the correct sequence of
vlan 'pop' and 'push'.

> From the code reuse perspective, you are adding TCA_VLAN_ACT_POP
> to each of them.

No, these patches don't pop VLAN tags. All tags are restored after the
action completed his work, before returning a->tcfa_action.

May I ask you to read it as a followup of commit 2ecba2d1e45b ("net:
sched: act_csum: Fix csum calc for tagged packets"), where the 'csum'
action was modified to mangle the checksum of IPv4 headers even when
multiple 802.1Q tags were present?
With this series it becomes possible to mangle also the TTL field (with
pedit), and assign the diffserv bits to skb->priority (with skbedit).

> Thanks.

Thanks for reviewing, I look forward to see more comments from you.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 1/3] net/sched: act_csum: pull all VLAN headers before checksumming
  2019-05-31 18:38   ` Cong Wang
@ 2019-05-31 22:01     ` Davide Caratti
  2019-05-31 22:50       ` Cong Wang
  0 siblings, 1 reply; 21+ messages in thread
From: Davide Caratti @ 2019-05-31 22:01 UTC (permalink / raw)
  To: Cong Wang
  Cc: Eric Dumazet, Jiri Pirko, Jamal Hadi Salim, David S . Miller,
	Linux Kernel Network Developers, Shuang Li, Eli Britstein,
	Stephen Hemminger

On Fri, 2019-05-31 at 11:38 -0700, Cong Wang wrote:
> On Fri, May 31, 2019 at 10:26 AM Davide Caratti <dcaratti@redhat.com> wrote:
> > diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
> > index 14bb525e355e..e8308ddcae9d 100644
> > --- a/net/sched/act_csum.c
> > +++ b/net/sched/act_csum.c
> > @@ -574,7 +574,6 @@ static int tcf_csum_act(struct sk_buff *skb, const struct tc_action *a,
> >                         struct tcf_result *res)
> >  {
> >         struct tcf_csum *p = to_tcf_csum(a);
> > -       bool orig_vlan_tag_present = false;
> >         unsigned int vlan_hdr_count = 0;
> >         struct tcf_csum_params *params;
> >         u32 update_flags;
> > @@ -604,17 +603,8 @@ static int tcf_csum_act(struct sk_buff *skb, const struct tc_action *a,
> >                 break;
> >         case cpu_to_be16(ETH_P_8021AD): /* fall through */
> >         case cpu_to_be16(ETH_P_8021Q):
> > -               if (skb_vlan_tag_present(skb) && !orig_vlan_tag_present) {
> > -                       protocol = skb->protocol;
> > -                       orig_vlan_tag_present = true;
> > -               } else {
> > -                       struct vlan_hdr *vlan = (struct vlan_hdr *)skb->data;
> > -
> > -                       protocol = vlan->h_vlan_encapsulated_proto;
> > -                       skb_pull(skb, VLAN_HLEN);
> > -                       skb_reset_network_header(skb);
> > -                       vlan_hdr_count++;
> > -               }
> > +               if (tc_skb_pull_vlans(skb, &vlan_hdr_count, &protocol))
> > +                       goto drop;
> >                 goto again;

Please note: this loop was here also before this patch (the 'goto again;'
line is only patch context). It has been introduced with commit
2ecba2d1e45b ("net: sched: act_csum: Fix csum calc for tagged packets").

> Why do you still need to loop here? tc_skb_pull_vlans() already
> contains a loop to pop all vlan tags?

The reason why the loop is here is:
1) in case there is a stripped vlan tag, it replaces tc_skb_protocol(skb)
with the inner ethertype (i.e. skb->protocol)

2) in case there is one or more unstripped VLAN tags, it pulls them. At
the last iteration, when it does:

goto again;

'protocol' contains the innermost ethertype, read from
'h_vlan_encapsulated_proto'. 

If that value is 0x0800 (IPv4) or 0x86dd (IPv6), 'act_csum' will go on
computing the internet checksum for the IPv4 header or for (some of the)
IPv6 'nexthdrs'.

(yes, we could move the call to tc_skb_pull_vlans() before the switch
statement to make this code more understandable, also in the other 2
actions. Please let me know if you think it's better).  

> 
> >         }
> > 
> > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> > index d4699156974a..382ee69fb1a5 100644
> > --- a/net/sched/cls_api.c
> > +++ b/net/sched/cls_api.c
> > @@ -3300,6 +3300,28 @@ unsigned int tcf_exts_num_actions(struct tcf_exts *exts)
> >  }
> >  EXPORT_SYMBOL(tcf_exts_num_actions);
> > 
> > +int tc_skb_pull_vlans(struct sk_buff *skb, unsigned int *hdr_count,
> > +                     __be16 *proto)
> 
> It looks like this function fits better in net/core/skbuff.c, because
> I don't see anything TC specific.

Ok, I don't know if other parts of the kernel really need it. Its use
should be combined with tc_skb_protocol(), which is in pkt_sched.h.

But i can move it to skbuff, or elsewhwere, unless somebody disagrees.

> 
> > +{
> > +       if (skb_vlan_tag_present(skb))
> > +               *proto = skb->protocol;
> > +
> > +       while (eth_type_vlan(*proto)) {
> > +               struct vlan_hdr *vlan;
> > +
> > +               if (unlikely(!pskb_may_pull(skb, VLAN_HLEN)))
> > +                       return -ENOMEM;
> > +
> > +               vlan = (struct vlan_hdr *)skb->data;
> > +               *proto = vlan->h_vlan_encapsulated_proto;
> > +               skb_pull(skb, VLAN_HLEN);
> > +               skb_reset_network_header(skb);

Again, this code was in act_csum.c also before. The only intention of this
patch is to ensure that pskb_may_pull() is called before skb_pull(), as
per Eric suggestion, and move this code out of act_csum to use it with
other TC actions.

> Any reason not to call __skb_vlan_pop() directly?

I think we can't use __skb_vlan_pop(), because 'act_csum' needs to read
the innermost ethertype in the packet to understand if it's IPv4, IPv6 or
else (ARP, EAPOL, ...).

If I well read __skb_vlan_pop(), it returns the VLAN ID, which is useless
here. 

> 
> > +               (*hdr_count)++;
> > +       }
> > +       return 0;
> > +}
> 
> Thanks.

Thanks for reviewing, looking forward to read more comments from you!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
  2019-05-31 22:01   ` Davide Caratti
@ 2019-05-31 22:29     ` Cong Wang
  2019-06-02  4:22       ` Eli Britstein
  0 siblings, 1 reply; 21+ messages in thread
From: Cong Wang @ 2019-05-31 22:29 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Eric Dumazet, Jiri Pirko, Jamal Hadi Salim, David S . Miller,
	Linux Kernel Network Developers, Shuang Li, Eli Britstein,
	Stephen Hemminger

On Fri, May 31, 2019 at 3:01 PM Davide Caratti <dcaratti@redhat.com> wrote:
>
> On Fri, 2019-05-31 at 11:42 -0700, Cong Wang wrote:
> > On Fri, May 31, 2019 at 10:26 AM Davide Caratti <dcaratti@redhat.com> wrote:
> > > 'act_csum' was recently fixed to mangle the IPv4/IPv6 header if a packet
> > > having one or more VLAN headers was processed: patch #1 ensures that all
> > > VLAN headers are in the linear area of the skb.
> > > Other actions might read or mangle the IPv4/IPv6 header: patch #2 and #3
> > > fix 'act_pedit' and 'act_skbedit' respectively.
> >
> > Maybe, just maybe, vlan tags are supposed to be handled by act_vlan?
> > Which means maybe users have to pipe act_vlan to these actions.
>
> but it's not possible with the current act_vlan code.
> Each 'vlan' action pushes or pops a single tag, so:
>
> 1) we don't know how many vlan tags there are in each packet, so I should
> put an (enough) high number of "pop" operations to ensure that a 'pedit'
> rule correctly mangles the TTL in a IPv4 packet having 1 or more 802.1Q
> tags in the L2 header.

Not true, we do know whether the last vlan tag is pop'ed by checking
the protocol. There was already a use case in netdev before:

tc filter add dev veth1 egress prio 100  protocol 802.1Q  matchall
action vlan pop continue #reclassify
tc filter add dev veth1 egress prio 200  protocol ip      u32 match ip
src 192.168.1.0/24  action drop
tc filter add dev veth1 egress prio 201  protocol ip      u32 match ip
dst 192.168.100.0/24  action drop

which is from a bug report.

>
> 2) after a vlan is popped with act_vlan, the kernel forgets about the VLAN
> ID and the VLAN type. So, if I want to just mangle the TTL in a QinQ
> packet, I need to reinject it in a place where both tags (including VLAN
> type *and* VLAN id) are restored in the packet.

It is forgotten by act_vlan only, the vlan info is still inside the
packet header.
Perhaps we just need some action to push it back.

>
> Clearly, act_vlan can't be used as is, because 'push' has hardcoded VLAN
> ID and ethertype. Unless we change act_vlan code to enable rollback of
> previous 'pop' operations, it's quite hard to pipe the correct sequence of
> vlan 'pop' and 'push'.

What about other encapsulations like VXLAN? What if I just want to
mangle the inner TTL of a VXLAN packet? You know the answer is setting
up TC filters and actions on VXLAN device instead of ethernet device.

IOW, why QinQ is so special that we have to take care of inside TC action
not the encapsulation endpoint?


>
> > From the code reuse perspective, you are adding TCA_VLAN_ACT_POP
> > to each of them.
>
> No, these patches don't pop VLAN tags. All tags are restored after the
> action completed his work, before returning a->tcfa_action.
>
> May I ask you to read it as a followup of commit 2ecba2d1e45b ("net:
> sched: act_csum: Fix csum calc for tagged packets"), where the 'csum'
> action was modified to mangle the checksum of IPv4 headers even when
> multiple 802.1Q tags were present?

Yes, I already read it and I think that commit should be reverted for the
same reason as I already stated above.


> With this series it becomes possible to mangle also the TTL field (with
> pedit), and assign the diffserv bits to skb->priority (with skbedit).

Sorry, I am not yet convinced why we should do it in TC.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 1/3] net/sched: act_csum: pull all VLAN headers before checksumming
  2019-05-31 22:01     ` Davide Caratti
@ 2019-05-31 22:50       ` Cong Wang
  2019-06-02  4:22         ` Eli Britstein
  0 siblings, 1 reply; 21+ messages in thread
From: Cong Wang @ 2019-05-31 22:50 UTC (permalink / raw)
  To: Davide Caratti
  Cc: Eric Dumazet, Jiri Pirko, Jamal Hadi Salim, David S . Miller,
	Linux Kernel Network Developers, Shuang Li, Eli Britstein,
	Stephen Hemminger

On Fri, May 31, 2019 at 3:01 PM Davide Caratti <dcaratti@redhat.com> wrote:
>
> Please note: this loop was here also before this patch (the 'goto again;'
> line is only patch context). It has been introduced with commit
> 2ecba2d1e45b ("net: sched: act_csum: Fix csum calc for tagged packets").
>

This is exactly why I ask...


> > Why do you still need to loop here? tc_skb_pull_vlans() already
> > contains a loop to pop all vlan tags?
>
> The reason why the loop is here is:
> 1) in case there is a stripped vlan tag, it replaces tc_skb_protocol(skb)
> with the inner ethertype (i.e. skb->protocol)
>
> 2) in case there is one or more unstripped VLAN tags, it pulls them. At
> the last iteration, when it does:

Let me ask it in another way:

The original code, without your patch, has a loop (the "goto again") to
pop all vlan tags.

The code with your patch adds yet another loop (the while loop inside your
tc_skb_pull_vlans()) to pop all vlan tags.

So, after your patch, we have both loops. So, I am confused why we need
these two nested loops to just pop all vlan tags? I think one is sufficient.


>
> >
> > >         }
> > >
> > > diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> > > index d4699156974a..382ee69fb1a5 100644
> > > --- a/net/sched/cls_api.c
> > > +++ b/net/sched/cls_api.c
> > > @@ -3300,6 +3300,28 @@ unsigned int tcf_exts_num_actions(struct tcf_exts *exts)
> > >  }
> > >  EXPORT_SYMBOL(tcf_exts_num_actions);
> > >
> > > +int tc_skb_pull_vlans(struct sk_buff *skb, unsigned int *hdr_count,
> > > +                     __be16 *proto)
> >
> > It looks like this function fits better in net/core/skbuff.c, because
> > I don't see anything TC specific.
>
> Ok, I don't know if other parts of the kernel really need it. Its use
> should be combined with tc_skb_protocol(), which is in pkt_sched.h.
>
> But i can move it to skbuff, or elsewhwere, unless somebody disagrees.
>
> >
> > > +{
> > > +       if (skb_vlan_tag_present(skb))
> > > +               *proto = skb->protocol;
> > > +
> > > +       while (eth_type_vlan(*proto)) {
> > > +               struct vlan_hdr *vlan;
> > > +
> > > +               if (unlikely(!pskb_may_pull(skb, VLAN_HLEN)))
> > > +                       return -ENOMEM;
> > > +
> > > +               vlan = (struct vlan_hdr *)skb->data;
> > > +               *proto = vlan->h_vlan_encapsulated_proto;
> > > +               skb_pull(skb, VLAN_HLEN);
> > > +               skb_reset_network_header(skb);
>
> Again, this code was in act_csum.c also before. The only intention of this
> patch is to ensure that pskb_may_pull() is called before skb_pull(), as
> per Eric suggestion, and move this code out of act_csum to use it with
> other TC actions.

Sure, no one says the code before yours is more correct, right? :)

>
> > Any reason not to call __skb_vlan_pop() directly?
>
> I think we can't use __skb_vlan_pop(), because 'act_csum' needs to read
> the innermost ethertype in the packet to understand if it's IPv4, IPv6 or
> else (ARP, EAPOL, ...).
>
> If I well read __skb_vlan_pop(), it returns the VLAN ID, which is useless
> here.
>

I am confused, this could be checked by eth_type_vlan(skb->protocol),
right? So why it stops you from considering __skb_vlan_pop() or
skb_vlan_pop()? They both should return error or zero, not vlan ID.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 1/3] net/sched: act_csum: pull all VLAN headers before checksumming
  2019-05-31 22:50       ` Cong Wang
@ 2019-06-02  4:22         ` Eli Britstein
  2019-06-04 17:41           ` Cong Wang
  0 siblings, 1 reply; 21+ messages in thread
From: Eli Britstein @ 2019-06-02  4:22 UTC (permalink / raw)
  To: Cong Wang, Davide Caratti
  Cc: Eric Dumazet, Jiri Pirko, Jamal Hadi Salim, David S . Miller,
	Linux Kernel Network Developers, Shuang Li, Stephen Hemminger


On 6/1/2019 1:50 AM, Cong Wang wrote:
> On Fri, May 31, 2019 at 3:01 PM Davide Caratti <dcaratti@redhat.com> wrote:
>> Please note: this loop was here also before this patch (the 'goto again;'
>> line is only patch context). It has been introduced with commit
>> 2ecba2d1e45b ("net: sched: act_csum: Fix csum calc for tagged packets").
>>
> This is exactly why I ask...
>
>
>>> Why do you still need to loop here? tc_skb_pull_vlans() already
>>> contains a loop to pop all vlan tags?
>> The reason why the loop is here is:
>> 1) in case there is a stripped vlan tag, it replaces tc_skb_protocol(skb)
>> with the inner ethertype (i.e. skb->protocol)
>>
>> 2) in case there is one or more unstripped VLAN tags, it pulls them. At
>> the last iteration, when it does:
> Let me ask it in another way:
>
> The original code, without your patch, has a loop (the "goto again") to
> pop all vlan tags.
>
> The code with your patch adds yet another loop (the while loop inside your
> tc_skb_pull_vlans()) to pop all vlan tags.
>
> So, after your patch, we have both loops. So, I am confused why we need
> these two nested loops to just pop all vlan tags? I think one is sufficient.
After Davide's patch, the "goto again" is needed to re-enter the switch 
case, and guaranteed to be done only once, as all the VLAN tags were 
already pulled. The alternative is having a dedicated if before the switch.
>
>
>>>>          }
>>>>
>>>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>>>> index d4699156974a..382ee69fb1a5 100644
>>>> --- a/net/sched/cls_api.c
>>>> +++ b/net/sched/cls_api.c
>>>> @@ -3300,6 +3300,28 @@ unsigned int tcf_exts_num_actions(struct tcf_exts *exts)
>>>>   }
>>>>   EXPORT_SYMBOL(tcf_exts_num_actions);
>>>>
>>>> +int tc_skb_pull_vlans(struct sk_buff *skb, unsigned int *hdr_count,
>>>> +                     __be16 *proto)
>>> It looks like this function fits better in net/core/skbuff.c, because
>>> I don't see anything TC specific.
>> Ok, I don't know if other parts of the kernel really need it. Its use
>> should be combined with tc_skb_protocol(), which is in pkt_sched.h.
>>
>> But i can move it to skbuff, or elsewhwere, unless somebody disagrees.
>>
>>>> +{
>>>> +       if (skb_vlan_tag_present(skb))
>>>> +               *proto = skb->protocol;
>>>> +
>>>> +       while (eth_type_vlan(*proto)) {
>>>> +               struct vlan_hdr *vlan;
>>>> +
>>>> +               if (unlikely(!pskb_may_pull(skb, VLAN_HLEN)))
>>>> +                       return -ENOMEM;
>>>> +
>>>> +               vlan = (struct vlan_hdr *)skb->data;
>>>> +               *proto = vlan->h_vlan_encapsulated_proto;
>>>> +               skb_pull(skb, VLAN_HLEN);
>>>> +               skb_reset_network_header(skb);
>> Again, this code was in act_csum.c also before. The only intention of this
>> patch is to ensure that pskb_may_pull() is called before skb_pull(), as
>> per Eric suggestion, and move this code out of act_csum to use it with
>> other TC actions.
> Sure, no one says the code before yours is more correct, right? :)
>
>>> Any reason not to call __skb_vlan_pop() directly?
>> I think we can't use __skb_vlan_pop(), because 'act_csum' needs to read
>> the innermost ethertype in the packet to understand if it's IPv4, IPv6 or
>> else (ARP, EAPOL, ...).
>>
>> If I well read __skb_vlan_pop(), it returns the VLAN ID, which is useless
>> here.
>>
> I am confused, this could be checked by eth_type_vlan(skb->protocol),
> right? So why it stops you from considering __skb_vlan_pop() or
> skb_vlan_pop()? They both should return error or zero, not vlan ID.
>
> Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
  2019-05-31 22:29     ` Cong Wang
@ 2019-06-02  4:22       ` Eli Britstein
  2019-06-04 17:55         ` Cong Wang
  0 siblings, 1 reply; 21+ messages in thread
From: Eli Britstein @ 2019-06-02  4:22 UTC (permalink / raw)
  To: Cong Wang, Davide Caratti
  Cc: Eric Dumazet, Jiri Pirko, Jamal Hadi Salim, David S . Miller,
	Linux Kernel Network Developers, Shuang Li, Stephen Hemminger


On 6/1/2019 1:29 AM, Cong Wang wrote:
> On Fri, May 31, 2019 at 3:01 PM Davide Caratti <dcaratti@redhat.com> wrote:
>> On Fri, 2019-05-31 at 11:42 -0700, Cong Wang wrote:
>>> On Fri, May 31, 2019 at 10:26 AM Davide Caratti <dcaratti@redhat.com> wrote:
>>>> 'act_csum' was recently fixed to mangle the IPv4/IPv6 header if a packet
>>>> having one or more VLAN headers was processed: patch #1 ensures that all
>>>> VLAN headers are in the linear area of the skb.
>>>> Other actions might read or mangle the IPv4/IPv6 header: patch #2 and #3
>>>> fix 'act_pedit' and 'act_skbedit' respectively.
>>> Maybe, just maybe, vlan tags are supposed to be handled by act_vlan?
>>> Which means maybe users have to pipe act_vlan to these actions.
>> but it's not possible with the current act_vlan code.
>> Each 'vlan' action pushes or pops a single tag, so:
>>
>> 1) we don't know how many vlan tags there are in each packet, so I should
>> put an (enough) high number of "pop" operations to ensure that a 'pedit'
>> rule correctly mangles the TTL in a IPv4 packet having 1 or more 802.1Q
>> tags in the L2 header.
> Not true, we do know whether the last vlan tag is pop'ed by checking
> the protocol. There was already a use case in netdev before:
>
> tc filter add dev veth1 egress prio 100  protocol 802.1Q  matchall
> action vlan pop continue #reclassify
> tc filter add dev veth1 egress prio 200  protocol ip      u32 match ip
> src 192.168.1.0/24  action drop
> tc filter add dev veth1 egress prio 201  protocol ip      u32 match ip
> dst 192.168.100.0/24  action drop
>
> which is from a bug report.
>
>> 2) after a vlan is popped with act_vlan, the kernel forgets about the VLAN
>> ID and the VLAN type. So, if I want to just mangle the TTL in a QinQ
>> packet, I need to reinject it in a place where both tags (including VLAN
>> type *and* VLAN id) are restored in the packet.
> It is forgotten by act_vlan only, the vlan info is still inside the
> packet header.
> Perhaps we just need some action to push it back.
There is memmove in those functions, so the VLAN is overwritten, and you 
will also need another memory to store the VLANs.
>
>> Clearly, act_vlan can't be used as is, because 'push' has hardcoded VLAN
>> ID and ethertype. Unless we change act_vlan code to enable rollback of
>> previous 'pop' operations, it's quite hard to pipe the correct sequence of
>> vlan 'pop' and 'push'.
> What about other encapsulations like VXLAN? What if I just want to
> mangle the inner TTL of a VXLAN packet? You know the answer is setting
> up TC filters and actions on VXLAN device instead of ethernet device.
>
> IOW, why QinQ is so special that we have to take care of inside TC action
> not the encapsulation endpoint?

I think that's because QinQ, or VLAN is not an encapsulation. There is 
no outer/inner packets, and if you want to mangle fields in the packet 
you can do it and the result is well-defined.

BTW, the motivation for my fix was a use case were 2 VGT VMs 
communicating by OVS failed. Since OVS sees the same VLAN tag, it 
doesn't add explicit VLAN pop/push actions (i.e pop, mangle, push). If 
you force explicit pop/mangle/push you will break such applications.

>
>
>>>  From the code reuse perspective, you are adding TCA_VLAN_ACT_POP
>>> to each of them.
>> No, these patches don't pop VLAN tags. All tags are restored after the
>> action completed his work, before returning a->tcfa_action.
>>
>> May I ask you to read it as a followup of commit 2ecba2d1e45b ("net:
>> sched: act_csum: Fix csum calc for tagged packets"), where the 'csum'
>> action was modified to mangle the checksum of IPv4 headers even when
>> multiple 802.1Q tags were present?
> Yes, I already read it and I think that commit should be reverted for the
> same reason as I already stated above.
>
>
>> With this series it becomes possible to mangle also the TTL field (with
>> pedit), and assign the diffserv bits to skb->priority (with skbedit).
> Sorry, I am not yet convinced why we should do it in TC.
>
> Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 1/3] net/sched: act_csum: pull all VLAN headers before checksumming
  2019-06-02  4:22         ` Eli Britstein
@ 2019-06-04 17:41           ` Cong Wang
  0 siblings, 0 replies; 21+ messages in thread
From: Cong Wang @ 2019-06-04 17:41 UTC (permalink / raw)
  To: Eli Britstein
  Cc: Davide Caratti, Eric Dumazet, Jiri Pirko, Jamal Hadi Salim,
	David S . Miller, Linux Kernel Network Developers, Shuang Li,
	Stephen Hemminger

On Sat, Jun 1, 2019 at 9:22 PM Eli Britstein <elibr@mellanox.com> wrote:
>
>
> On 6/1/2019 1:50 AM, Cong Wang wrote:
> > On Fri, May 31, 2019 at 3:01 PM Davide Caratti <dcaratti@redhat.com> wrote:
> >> Please note: this loop was here also before this patch (the 'goto again;'
> >> line is only patch context). It has been introduced with commit
> >> 2ecba2d1e45b ("net: sched: act_csum: Fix csum calc for tagged packets").
> >>
> > This is exactly why I ask...
> >
> >
> >>> Why do you still need to loop here? tc_skb_pull_vlans() already
> >>> contains a loop to pop all vlan tags?
> >> The reason why the loop is here is:
> >> 1) in case there is a stripped vlan tag, it replaces tc_skb_protocol(skb)
> >> with the inner ethertype (i.e. skb->protocol)
> >>
> >> 2) in case there is one or more unstripped VLAN tags, it pulls them. At
> >> the last iteration, when it does:
> > Let me ask it in another way:
> >
> > The original code, without your patch, has a loop (the "goto again") to
> > pop all vlan tags.
> >
> > The code with your patch adds yet another loop (the while loop inside your
> > tc_skb_pull_vlans()) to pop all vlan tags.
> >
> > So, after your patch, we have both loops. So, I am confused why we need
> > these two nested loops to just pop all vlan tags? I think one is sufficient.
> After Davide's patch, the "goto again" is needed to re-enter the switch
> case, and guaranteed to be done only once, as all the VLAN tags were
> already pulled. The alternative is having a dedicated if before the switch.

Yeah, I think that can be simply moved before the switch so
that we don't have to use two loops, which should be easier to
understand too.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
  2019-06-02  4:22       ` Eli Britstein
@ 2019-06-04 17:55         ` Cong Wang
  2019-06-04 18:19           ` Eli Britstein
  0 siblings, 1 reply; 21+ messages in thread
From: Cong Wang @ 2019-06-04 17:55 UTC (permalink / raw)
  To: Eli Britstein
  Cc: Davide Caratti, Eric Dumazet, Jiri Pirko, Jamal Hadi Salim,
	David S . Miller, Linux Kernel Network Developers, Shuang Li,
	Stephen Hemminger

On Sat, Jun 1, 2019 at 9:22 PM Eli Britstein <elibr@mellanox.com> wrote:
>
> I think that's because QinQ, or VLAN is not an encapsulation. There is
> no outer/inner packets, and if you want to mangle fields in the packet
> you can do it and the result is well-defined.

Sort of, perhaps VLAN tags are too short to be called as an
encapsulation, my point is that it still needs some endpoints to push
or pop the tags, in a similar way we do encap/decap.


>
> BTW, the motivation for my fix was a use case were 2 VGT VMs
> communicating by OVS failed. Since OVS sees the same VLAN tag, it
> doesn't add explicit VLAN pop/push actions (i.e pop, mangle, push). If
> you force explicit pop/mangle/push you will break such applications.

From what you said, it seems act_csum is in the middle of packet
receive/transmit path. So, which is the one pops the VLAN tags in
this scenario? If the VM's are the endpoints, why not use act_csum
there?

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
  2019-06-04 17:55         ` Cong Wang
@ 2019-06-04 18:19           ` Eli Britstein
  2019-06-06  1:42             ` Cong Wang
  0 siblings, 1 reply; 21+ messages in thread
From: Eli Britstein @ 2019-06-04 18:19 UTC (permalink / raw)
  To: Cong Wang
  Cc: Davide Caratti, Eric Dumazet, Jiri Pirko, Jamal Hadi Salim,
	David S . Miller, Linux Kernel Network Developers, Shuang Li,
	Stephen Hemminger


On 6/4/2019 8:55 PM, Cong Wang wrote:
> On Sat, Jun 1, 2019 at 9:22 PM Eli Britstein <elibr@mellanox.com> wrote:
>> I think that's because QinQ, or VLAN is not an encapsulation. There is
>> no outer/inner packets, and if you want to mangle fields in the packet
>> you can do it and the result is well-defined.
> Sort of, perhaps VLAN tags are too short to be called as an
> encapsulation, my point is that it still needs some endpoints to push
> or pop the tags, in a similar way we do encap/decap.
>
>
>> BTW, the motivation for my fix was a use case were 2 VGT VMs
>> communicating by OVS failed. Since OVS sees the same VLAN tag, it
>> doesn't add explicit VLAN pop/push actions (i.e pop, mangle, push). If
>> you force explicit pop/mangle/push you will break such applications.
>  From what you said, it seems act_csum is in the middle of packet
> receive/transmit path. So, which is the one pops the VLAN tags in
> this scenario? If the VM's are the endpoints, why not use act_csum
> there?

In a switchdev mode, we can passthru the VFs to VMs, and have their 
representors in the host, enabling us to manipulate the HW eswitch 
without knowledge of the VMs.

To simplify it, consider the following setup:

v1a <-> v1b and v2a <-> v2b are veth pairs.

Now, we configure v1a.20 and v2a.20 as VLAN devices over v1a/v2a 
respectively (and put the "a" devs in separate namespaces).

The TC rules are on the "b" devs, for example:

tc filter add dev v1b ... action pedit ... action csum ... action 
redirect dev v2b

Now, ping from v1a.20 to v1b.20. The namespaces transmit/receive tagged 
packets, and are not aware of the packet manipulation (and the required 
act_csum).
> Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
  2019-06-04 18:19           ` Eli Britstein
@ 2019-06-06  1:42             ` Cong Wang
  2019-06-06  5:37               ` Eli Britstein
  2019-06-07 18:20               ` Davide Caratti
  0 siblings, 2 replies; 21+ messages in thread
From: Cong Wang @ 2019-06-06  1:42 UTC (permalink / raw)
  To: Eli Britstein
  Cc: Davide Caratti, Eric Dumazet, Jiri Pirko, Jamal Hadi Salim,
	David S . Miller, Linux Kernel Network Developers, Shuang Li,
	Stephen Hemminger

On Tue, Jun 4, 2019 at 11:19 AM Eli Britstein <elibr@mellanox.com> wrote:
>
>
> On 6/4/2019 8:55 PM, Cong Wang wrote:
> > On Sat, Jun 1, 2019 at 9:22 PM Eli Britstein <elibr@mellanox.com> wrote:
> >> I think that's because QinQ, or VLAN is not an encapsulation. There is
> >> no outer/inner packets, and if you want to mangle fields in the packet
> >> you can do it and the result is well-defined.
> > Sort of, perhaps VLAN tags are too short to be called as an
> > encapsulation, my point is that it still needs some endpoints to push
> > or pop the tags, in a similar way we do encap/decap.
> >
> >
> >> BTW, the motivation for my fix was a use case were 2 VGT VMs
> >> communicating by OVS failed. Since OVS sees the same VLAN tag, it
> >> doesn't add explicit VLAN pop/push actions (i.e pop, mangle, push). If
> >> you force explicit pop/mangle/push you will break such applications.
> >  From what you said, it seems act_csum is in the middle of packet
> > receive/transmit path. So, which is the one pops the VLAN tags in
> > this scenario? If the VM's are the endpoints, why not use act_csum
> > there?
>
> In a switchdev mode, we can passthru the VFs to VMs, and have their
> representors in the host, enabling us to manipulate the HW eswitch
> without knowledge of the VMs.
>
> To simplify it, consider the following setup:
>
> v1a <-> v1b and v2a <-> v2b are veth pairs.
>
> Now, we configure v1a.20 and v2a.20 as VLAN devices over v1a/v2a
> respectively (and put the "a" devs in separate namespaces).
>
> The TC rules are on the "b" devs, for example:
>
> tc filter add dev v1b ... action pedit ... action csum ... action
> redirect dev v2b
>
> Now, ping from v1a.20 to v1b.20. The namespaces transmit/receive tagged
> packets, and are not aware of the packet manipulation (and the required
> act_csum).

This is what I said, v1b is not the endpoint which pops the vlan tag,
v1b.20 is. So, why not simply move at least the csum action to
v1b.20? With that, you can still filter and redirect packets on v1b,
you still even modify it too, just defer the checksum fixup to the
endpoint.

And to be fair, if this case is a valid concern, so is VXLAN case,
just replace v1a.20 and v2a.20 with VXLAN tunnels. If you modify
the inner header, you have to fixup the checksum in the outer
UDP header.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
  2019-06-06  1:42             ` Cong Wang
@ 2019-06-06  5:37               ` Eli Britstein
  2019-06-11  0:52                 ` Cong Wang
  2019-06-07 18:20               ` Davide Caratti
  1 sibling, 1 reply; 21+ messages in thread
From: Eli Britstein @ 2019-06-06  5:37 UTC (permalink / raw)
  To: Cong Wang
  Cc: Davide Caratti, Eric Dumazet, Jiri Pirko, Jamal Hadi Salim,
	David S . Miller, Linux Kernel Network Developers, Shuang Li,
	Stephen Hemminger


On 6/6/2019 4:42 AM, Cong Wang wrote:
> On Tue, Jun 4, 2019 at 11:19 AM Eli Britstein <elibr@mellanox.com> wrote:
>>
>> On 6/4/2019 8:55 PM, Cong Wang wrote:
>>> On Sat, Jun 1, 2019 at 9:22 PM Eli Britstein <elibr@mellanox.com> wrote:
>>>> I think that's because QinQ, or VLAN is not an encapsulation. There is
>>>> no outer/inner packets, and if you want to mangle fields in the packet
>>>> you can do it and the result is well-defined.
>>> Sort of, perhaps VLAN tags are too short to be called as an
>>> encapsulation, my point is that it still needs some endpoints to push
>>> or pop the tags, in a similar way we do encap/decap.
>>>
>>>
>>>> BTW, the motivation for my fix was a use case were 2 VGT VMs
>>>> communicating by OVS failed. Since OVS sees the same VLAN tag, it
>>>> doesn't add explicit VLAN pop/push actions (i.e pop, mangle, push). If
>>>> you force explicit pop/mangle/push you will break such applications.
>>>   From what you said, it seems act_csum is in the middle of packet
>>> receive/transmit path. So, which is the one pops the VLAN tags in
>>> this scenario? If the VM's are the endpoints, why not use act_csum
>>> there?
>> In a switchdev mode, we can passthru the VFs to VMs, and have their
>> representors in the host, enabling us to manipulate the HW eswitch
>> without knowledge of the VMs.
>>
>> To simplify it, consider the following setup:
>>
>> v1a <-> v1b and v2a <-> v2b are veth pairs.
>>
>> Now, we configure v1a.20 and v2a.20 as VLAN devices over v1a/v2a
>> respectively (and put the "a" devs in separate namespaces).
>>
>> The TC rules are on the "b" devs, for example:
>>
>> tc filter add dev v1b ... action pedit ... action csum ... action
>> redirect dev v2b
>>
>> Now, ping from v1a.20 to v1b.20. The namespaces transmit/receive tagged
>> packets, and are not aware of the packet manipulation (and the required
>> act_csum).
> This is what I said, v1b is not the endpoint which pops the vlan tag,
> v1b.20 is. So, why not simply move at least the csum action to
> v1b.20? With that, you can still filter and redirect packets on v1b,
> you still even modify it too, just defer the checksum fixup to the
> endpoint.

There are no vxb.20 ports:

ns0:     v1a.20 ----(VLAN)---- v1a ns1:    v2a ---- (VLAN) ---- v2a.20

|----(veth)---- v1b     <---- (TC) ---->    v2b ----(veth)----|

>
> And to be fair, if this case is a valid concern, so is VXLAN case,
> just replace v1a.20 and v2a.20 with VXLAN tunnels. If you modify
> the inner header, you have to fixup the checksum in the outer
> UDP header.
>
> Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
  2019-06-06  1:42             ` Cong Wang
  2019-06-06  5:37               ` Eli Britstein
@ 2019-06-07 18:20               ` Davide Caratti
  2019-06-08 11:19                 ` Eli Britstein
  1 sibling, 1 reply; 21+ messages in thread
From: Davide Caratti @ 2019-06-07 18:20 UTC (permalink / raw)
  To: Cong Wang, Eli Britstein
  Cc: Eric Dumazet, Jiri Pirko, Jamal Hadi Salim, David S . Miller,
	Linux Kernel Network Developers, Shuang Li, Stephen Hemminger

On Wed, 2019-06-05 at 18:42 -0700, Cong Wang wrote:
> On Tue, Jun 4, 2019 at 11:19 AM Eli Britstein <elibr@mellanox.com> wrote:

hello Cong and Eli,

and thanks for all the thoughts.

> > On 6/4/2019 8:55 PM, Cong Wang wrote:
> > > On Sat, Jun 1, 2019 at 9:22 PM Eli Britstein <elibr@mellanox.com> wrote:
> > > > I think that's because QinQ, or VLAN is not an encapsulation. There is
> > > > no outer/inner packets, and if you want to mangle fields in the packet
> > > > you can do it and the result is well-defined.
> > > Sort of, perhaps VLAN tags are too short to be called as an
> > > encapsulation, my point is that it still needs some endpoints to push
> > > or pop the tags, in a similar way we do encap/decap.
> > > 
> > > 
> > > > BTW, the motivation for my fix was a use case were 2 VGT VMs
> > > > communicating by OVS failed. Since OVS sees the same VLAN tag, it
> > > > doesn't add explicit VLAN pop/push actions (i.e pop, mangle, push). If
> > > > you force explicit pop/mangle/push you will break such applications.
> > >  From what you said, it seems act_csum is in the middle of packet
> > > receive/transmit path. So, which is the one pops the VLAN tags in
> > > this scenario? If the VM's are the endpoints, why not use act_csum
> > > there?
> > 
> > In a switchdev mode, we can passthru the VFs to VMs, and have their
> > representors in the host, enabling us to manipulate the HW eswitch
> > without knowledge of the VMs.
> > 
> > To simplify it, consider the following setup:
> > 
> > v1a <-> v1b and v2a <-> v2b are veth pairs.
> > 
> > Now, we configure v1a.20 and v2a.20 as VLAN devices over v1a/v2a
> > respectively (and put the "a" devs in separate namespaces).
> > 
> > The TC rules are on the "b" devs, for example:
> > 
> > tc filter add dev v1b ... action pedit ... action csum ... action
> > redirect dev v2b
> > 
> > Now, ping from v1a.20 to v1b.20. The namespaces transmit/receive tagged
> > packets, and are not aware of the packet manipulation (and the required
> > act_csum).
> 
> This is what I said, v1b is not the endpoint which pops the vlan tag,
> v1b.20 is. So, why not simply move at least the csum action to
> v1b.20? With that, you can still filter and redirect packets on v1b,
> you still even modify it too, just defer the checksum fixup to the
> endpoint.
> 
> And to be fair, if this case is a valid concern, so is VXLAN case,
> just replace v1a.20 and v2a.20 with VXLAN tunnels. If you modify
> the inner header, you have to fixup the checksum in the outer
> UDP header.

at least with single "accelerated" vlan tags, I think that users of
tc_skb_protocol() that explicitly access the IP header should be fixed the
same way that Eli did to 'csum'.

(note that 'pedit' is *not* among them, and that's why Eli only needed to
fix 'csum' in his setup :) )
 
Now I'm no more sure about what to do with the QinQ case, whether we
should:

a) fix missing skb_may_pull() in csum, and fix (a couple of) other actions
in the same way, 

or

b) rework act_csum when it wants to read the IP header, in a way that only
ip header is mangled only when there are only zero or a single
"accelerated" tag, and do the same for (a couple of) other actions.

The pop/push approach built in the action ( option a) ) fixes my
environment - but like Cong says, it only would work with VLANs, and not
with other encapsulations. Probably it really makes sense to see if it's
possible to extend act_vlan in a way that it reconstructs the packet after
an iteration of 'vlan pop'. This might not be easy, because even the above
command assumes that inner and outer tag have the same ethertype (which is
not generally true for QinQ).

But until then, we should assume that read/writes of the IP header are not
feasible for QinQ traffic, just like it's not possible for tunneled
packets, and adjust configuration accordingly.

WDYT?

-- 
davide


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
  2019-06-07 18:20               ` Davide Caratti
@ 2019-06-08 11:19                 ` Eli Britstein
  0 siblings, 0 replies; 21+ messages in thread
From: Eli Britstein @ 2019-06-08 11:19 UTC (permalink / raw)
  To: Davide Caratti, Cong Wang
  Cc: Eric Dumazet, Jiri Pirko, Jamal Hadi Salim, David S . Miller,
	Linux Kernel Network Developers, Shuang Li, Stephen Hemminger


On 6/7/2019 9:20 PM, Davide Caratti wrote:
> On Wed, 2019-06-05 at 18:42 -0700, Cong Wang wrote:
>> On Tue, Jun 4, 2019 at 11:19 AM Eli Britstein <elibr@mellanox.com> wrote:
> hello Cong and Eli,
>
> and thanks for all the thoughts.
>
>>> On 6/4/2019 8:55 PM, Cong Wang wrote:
>>>> On Sat, Jun 1, 2019 at 9:22 PM Eli Britstein <elibr@mellanox.com> wrote:
>>>>> I think that's because QinQ, or VLAN is not an encapsulation. There is
>>>>> no outer/inner packets, and if you want to mangle fields in the packet
>>>>> you can do it and the result is well-defined.
>>>> Sort of, perhaps VLAN tags are too short to be called as an
>>>> encapsulation, my point is that it still needs some endpoints to push
>>>> or pop the tags, in a similar way we do encap/decap.
>>>>
>>>>
>>>>> BTW, the motivation for my fix was a use case were 2 VGT VMs
>>>>> communicating by OVS failed. Since OVS sees the same VLAN tag, it
>>>>> doesn't add explicit VLAN pop/push actions (i.e pop, mangle, push). If
>>>>> you force explicit pop/mangle/push you will break such applications.
>>>>   From what you said, it seems act_csum is in the middle of packet
>>>> receive/transmit path. So, which is the one pops the VLAN tags in
>>>> this scenario? If the VM's are the endpoints, why not use act_csum
>>>> there?
>>> In a switchdev mode, we can passthru the VFs to VMs, and have their
>>> representors in the host, enabling us to manipulate the HW eswitch
>>> without knowledge of the VMs.
>>>
>>> To simplify it, consider the following setup:
>>>
>>> v1a <-> v1b and v2a <-> v2b are veth pairs.
>>>
>>> Now, we configure v1a.20 and v2a.20 as VLAN devices over v1a/v2a
>>> respectively (and put the "a" devs in separate namespaces).
>>>
>>> The TC rules are on the "b" devs, for example:
>>>
>>> tc filter add dev v1b ... action pedit ... action csum ... action
>>> redirect dev v2b
>>>
>>> Now, ping from v1a.20 to v1b.20. The namespaces transmit/receive tagged
>>> packets, and are not aware of the packet manipulation (and the required
>>> act_csum).
>> This is what I said, v1b is not the endpoint which pops the vlan tag,
>> v1b.20 is. So, why not simply move at least the csum action to
>> v1b.20? With that, you can still filter and redirect packets on v1b,
>> you still even modify it too, just defer the checksum fixup to the
>> endpoint.
>>
>> And to be fair, if this case is a valid concern, so is VXLAN case,
>> just replace v1a.20 and v2a.20 with VXLAN tunnels. If you modify
>> the inner header, you have to fixup the checksum in the outer
>> UDP header.
> at least with single "accelerated" vlan tags, I think that users of
> tc_skb_protocol() that explicitly access the IP header should be fixed the
> same way that Eli did to 'csum'.
>
> (note that 'pedit' is *not* among them, and that's why Eli only needed to
> fix 'csum' in his setup :) )
>   
> Now I'm no more sure about what to do with the QinQ case, whether we
> should:
>
> a) fix missing skb_may_pull() in csum, and fix (a couple of) other actions
> in the same way,
I'm for that.
>
> or
>
> b) rework act_csum when it wants to read the IP header, in a way that only
> ip header is mangled only when there are only zero or a single
> "accelerated" tag, and do the same for (a couple of) other actions.

What is the conceptually difference between single VLAN and QinQ? Or, 
even between "accelerated" to non-accelerated?

Furthermore, doing that you would reduce functionality already exists. 
Don't do it.

>
> The pop/push approach built in the action ( option a) ) fixes my
> environment - but like Cong says, it only would work with VLANs, and not
> with other encapsulations. Probably it really makes sense to see if it's
> possible to extend act_vlan in a way that it reconstructs the packet after
> an iteration of 'vlan pop'. This might not be easy, because even the above
> command assumes that inner and outer tag have the same ethertype (which is
> not generally true for QinQ).
Again, I think VLANs and encapsulations are not the same. Actions for 
tagged packets are well defined, the same as they are on native packets. 
In encapsulations you would *HAVE* to do make decap/encap in order to 
act on inner packet.
>
> But until then, we should assume that read/writes of the IP header are not
> feasible for QinQ traffic, just like it's not possible for tunneled
> packets, and adjust configuration accordingly.
>
> WDYT?
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
  2019-06-06  5:37               ` Eli Britstein
@ 2019-06-11  0:52                 ` Cong Wang
  2019-06-11  4:43                   ` Eli Britstein
  0 siblings, 1 reply; 21+ messages in thread
From: Cong Wang @ 2019-06-11  0:52 UTC (permalink / raw)
  To: Eli Britstein
  Cc: Davide Caratti, Eric Dumazet, Jiri Pirko, Jamal Hadi Salim,
	David S . Miller, Linux Kernel Network Developers, Shuang Li,
	Stephen Hemminger

On Wed, Jun 5, 2019 at 10:37 PM Eli Britstein <elibr@mellanox.com> wrote:
>
>
> On 6/6/2019 4:42 AM, Cong Wang wrote:
> > On Tue, Jun 4, 2019 at 11:19 AM Eli Britstein <elibr@mellanox.com> wrote:
> >>
> >> On 6/4/2019 8:55 PM, Cong Wang wrote:
> >>> On Sat, Jun 1, 2019 at 9:22 PM Eli Britstein <elibr@mellanox.com> wrote:
> >>>> I think that's because QinQ, or VLAN is not an encapsulation. There is
> >>>> no outer/inner packets, and if you want to mangle fields in the packet
> >>>> you can do it and the result is well-defined.
> >>> Sort of, perhaps VLAN tags are too short to be called as an
> >>> encapsulation, my point is that it still needs some endpoints to push
> >>> or pop the tags, in a similar way we do encap/decap.
> >>>
> >>>
> >>>> BTW, the motivation for my fix was a use case were 2 VGT VMs
> >>>> communicating by OVS failed. Since OVS sees the same VLAN tag, it
> >>>> doesn't add explicit VLAN pop/push actions (i.e pop, mangle, push). If
> >>>> you force explicit pop/mangle/push you will break such applications.
> >>>   From what you said, it seems act_csum is in the middle of packet
> >>> receive/transmit path. So, which is the one pops the VLAN tags in
> >>> this scenario? If the VM's are the endpoints, why not use act_csum
> >>> there?
> >> In a switchdev mode, we can passthru the VFs to VMs, and have their
> >> representors in the host, enabling us to manipulate the HW eswitch
> >> without knowledge of the VMs.
> >>
> >> To simplify it, consider the following setup:
> >>
> >> v1a <-> v1b and v2a <-> v2b are veth pairs.
> >>
> >> Now, we configure v1a.20 and v2a.20 as VLAN devices over v1a/v2a
> >> respectively (and put the "a" devs in separate namespaces).
> >>
> >> The TC rules are on the "b" devs, for example:
> >>
> >> tc filter add dev v1b ... action pedit ... action csum ... action
> >> redirect dev v2b
> >>
> >> Now, ping from v1a.20 to v1b.20. The namespaces transmit/receive tagged
> >> packets, and are not aware of the packet manipulation (and the required
> >> act_csum).
> > This is what I said, v1b is not the endpoint which pops the vlan tag,
> > v1b.20 is. So, why not simply move at least the csum action to
> > v1b.20? With that, you can still filter and redirect packets on v1b,
> > you still even modify it too, just defer the checksum fixup to the
> > endpoint.
>
> There are no vxb.20 ports:
>
> ns0:     v1a.20 ----(VLAN)---- v1a ns1:    v2a ---- (VLAN) ---- v2a.20
>
> |----(veth)---- v1b     <---- (TC) ---->    v2b ----(veth)----|


This diagram makes me even more confusing...

Can you explicitly explain why there is no vxb.20? Is it a router or
something?

By the way, even if it is router and you really want to checksum the
packet at that point, you still don't have to move the skb->data
pointer, you just need to parse the header and calculate the offset
without touching skb->data. This could at least avoid restoring
skb->data after it.

Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets
  2019-06-11  0:52                 ` Cong Wang
@ 2019-06-11  4:43                   ` Eli Britstein
  0 siblings, 0 replies; 21+ messages in thread
From: Eli Britstein @ 2019-06-11  4:43 UTC (permalink / raw)
  To: Cong Wang
  Cc: Davide Caratti, Eric Dumazet, Jiri Pirko, Jamal Hadi Salim,
	David S . Miller, Linux Kernel Network Developers, Shuang Li,
	Stephen Hemminger


On 6/11/2019 3:52 AM, Cong Wang wrote:
> On Wed, Jun 5, 2019 at 10:37 PM Eli Britstein <elibr@mellanox.com> wrote:
>>
>> On 6/6/2019 4:42 AM, Cong Wang wrote:
>>> On Tue, Jun 4, 2019 at 11:19 AM Eli Britstein <elibr@mellanox.com> wrote:
>>>> On 6/4/2019 8:55 PM, Cong Wang wrote:
>>>>> On Sat, Jun 1, 2019 at 9:22 PM Eli Britstein <elibr@mellanox.com> wrote:
>>>>>> I think that's because QinQ, or VLAN is not an encapsulation. There is
>>>>>> no outer/inner packets, and if you want to mangle fields in the packet
>>>>>> you can do it and the result is well-defined.
>>>>> Sort of, perhaps VLAN tags are too short to be called as an
>>>>> encapsulation, my point is that it still needs some endpoints to push
>>>>> or pop the tags, in a similar way we do encap/decap.
>>>>>
>>>>>
>>>>>> BTW, the motivation for my fix was a use case were 2 VGT VMs
>>>>>> communicating by OVS failed. Since OVS sees the same VLAN tag, it
>>>>>> doesn't add explicit VLAN pop/push actions (i.e pop, mangle, push). If
>>>>>> you force explicit pop/mangle/push you will break such applications.
>>>>>    From what you said, it seems act_csum is in the middle of packet
>>>>> receive/transmit path. So, which is the one pops the VLAN tags in
>>>>> this scenario? If the VM's are the endpoints, why not use act_csum
>>>>> there?
>>>> In a switchdev mode, we can passthru the VFs to VMs, and have their
>>>> representors in the host, enabling us to manipulate the HW eswitch
>>>> without knowledge of the VMs.
>>>>
>>>> To simplify it, consider the following setup:
>>>>
>>>> v1a <-> v1b and v2a <-> v2b are veth pairs.
>>>>
>>>> Now, we configure v1a.20 and v2a.20 as VLAN devices over v1a/v2a
>>>> respectively (and put the "a" devs in separate namespaces).
>>>>
>>>> The TC rules are on the "b" devs, for example:
>>>>
>>>> tc filter add dev v1b ... action pedit ... action csum ... action
>>>> redirect dev v2b
>>>>
>>>> Now, ping from v1a.20 to v1b.20. The namespaces transmit/receive tagged
>>>> packets, and are not aware of the packet manipulation (and the required
>>>> act_csum).
>>> This is what I said, v1b is not the endpoint which pops the vlan tag,
>>> v1b.20 is. So, why not simply move at least the csum action to
>>> v1b.20? With that, you can still filter and redirect packets on v1b,
>>> you still even modify it too, just defer the checksum fixup to the
>>> endpoint.
>> There are no vxb.20 ports:
>>
>> ns0:     v1a.20 ----(VLAN)---- v1a ns1:    v2a ---- (VLAN) ---- v2a.20
>>
>> |----(veth)---- v1b     <---- (TC) ---->    v2b ----(veth)----|
>
> This diagram makes me even more confusing...
>
> Can you explicitly explain why there is no vxb.20? Is it a router or
> something?
Yes.
>
> By the way, even if it is router and you really want to checksum the
> packet at that point, you still don't have to move the skb->data
> pointer, you just need to parse the header and calculate the offset
> without touching skb->data. This could at least avoid restoring
> skb->data after it.
Sure, this is another implementation method. It doesn't change the 
essence. I just wanted to reuse the existing tcf_csum_ipv4/6.
>
> Thanks.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2019-06-11  4:43 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-31 17:26 [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets Davide Caratti
2019-05-31 17:26 ` [PATCH net v3 1/3] net/sched: act_csum: pull all VLAN headers before checksumming Davide Caratti
2019-05-31 18:38   ` Cong Wang
2019-05-31 22:01     ` Davide Caratti
2019-05-31 22:50       ` Cong Wang
2019-06-02  4:22         ` Eli Britstein
2019-06-04 17:41           ` Cong Wang
2019-05-31 17:26 ` [PATCH net v3 2/3] net/sched: act_pedit: fix 'ex munge' on network header in case of QinQ packet Davide Caratti
2019-05-31 17:26 ` [PATCH net v3 3/3] net/sched: act_skbedit: fix 'inheritdsfield' " Davide Caratti
2019-05-31 18:42 ` [PATCH net v3 0/3] net/sched: fix actions reading the network header in case of QinQ packets Cong Wang
2019-05-31 22:01   ` Davide Caratti
2019-05-31 22:29     ` Cong Wang
2019-06-02  4:22       ` Eli Britstein
2019-06-04 17:55         ` Cong Wang
2019-06-04 18:19           ` Eli Britstein
2019-06-06  1:42             ` Cong Wang
2019-06-06  5:37               ` Eli Britstein
2019-06-11  0:52                 ` Cong Wang
2019-06-11  4:43                   ` Eli Britstein
2019-06-07 18:20               ` Davide Caratti
2019-06-08 11:19                 ` Eli Britstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.