netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* VLAN tags in mac_len
@ 2019-06-14 10:18 Johannes Berg
  2019-06-15 15:19 ` Alexei Starovoitov
  0 siblings, 1 reply; 10+ messages in thread
From: Johannes Berg @ 2019-06-14 10:18 UTC (permalink / raw)
  To: netdev
  Cc: bridge, nikolay, roopa, jhs, David Ahern, Zahari Doychev,
	Simon Horman, Toshiaki Makita, Cong Wang, Jiri Pirko,
	Alexei Starovoitov

Hi all,

Sorry for the long CC list - it's a bit unclear where the issue is.

Zahari has been looking at this forever, and we keep talking about it,
but every time we squint and look at it in a different way, the issue
appears in a different place.

Ultimately, the issue appears to be that we cannot make up our mind
whether we consider VLAN tags to be covered by mac_len or not.


Some history first.

For OVS datapath, Simon evidently considered the VLAN tags to be part of
the mac_len, since the OVS VLAN push/pop actions manipulate it, see
commit 25cd9ba0abc07 ("openvswitch: Add basic MPLS support to kernel").
It's not clear to me, however, why it previously wasn't considered and
only MPLS needed to consider it this way - perhaps nobody ever tried
double-VLAN tagging or something.

Anyway, then in commit 93515d53b133 ("net: move vlan pop/push functions
into common code") Jiri moved the code as is into common code, which
still kinda made sense since OVS was the only user. Maybe at this point
we should've asked whether or not the mac_len manipulations make sense
or not though.

Now clearly Jiri had an agenda (btw, sorry for misspelling your name all
the time), and followed up with commit c7e2b9689ef8 ("sched: introduce
vlan action"). I assume this was tested, but probably not in the
scenario we have now.

There's also commit 4e10df9a60d9 ("bpf: introduce
bpf_skb_vlan_push/pop() helpers") which then makes all of this get used
in BPF.

The bridge code, on the other hand, has basically always (I stopped
looking when I hit the end of current git history) assumed that VLAN
tags are not part of mac_len, and does skb_push(ETH_HLEN).


Next, the scenario.

Basically, what Zahari is trying to do is to use TC to push *two* VLAN
tags on ingress, before the packet then goes to the bridge.

This results in the following flow:

We start with an ingress skb, somewhere:
-> eth_type_trans()
 => skb->mac_len = 14,
    skb->data = ethhdr + 14

Push the first tag with TC:
-> tc vlan push (tag1)
 -> skb_push(mac_len)
  => skb->data = ethhdr

 -> skb_vlan_push(tag1)
  -> __vlan_hwaccel_put_tag(tag1)

 -> skb_pull(mac_len)
  => skb->data = ethhdr + 14
 => no changes in SKB other than recorded VLAN acceleration

Push the second tag with TC
-> tc vlan push (tag2)
 -> skb_push(mac_len)
   * skb->data = ethhdr
 -> skb_vlan_push(tag2)
  -> __vlan_insert_tag()
  -> skb->mac_len += 4
   => skb->mac_len = 18
 -> __vlan_hwaccel_put_tag(tag2)
 -> skb_pull(mac_len)
  => skb->data = ethhdr + 18

-> bridge
 -> br_dev_queue_push_xmit()
  -> skb_push(14)
   => skb->data = ethhdr + 4 (!!!)
  -> dev_queue_xmit()

=> as a result, now the first 4 bytes of the frame are lost, as
   on xmit drivers expect to start at skb->data

(There's some more complication here if we actually don't have HW
offload for the VLAN tag, but the end result is basically the same)



Possible solutions?

So far, Zahari tried three different ways of fixing this:

 1) Make the bridge code use skb->mac_len instead of ETH_HLEN. This
    works for this particular case, but breaks some other cases;
    evidently some places exist where skb->mac_len isn't even set to
    ETH_HLEN when a packet gets to the bridge. I don't know right now
    what that was, I think probably somebody who's CC'ed reported that.

 2) Let tc_act_vlan() just pull ETH_HLEN instead of skb->mac_len, but
    this is rather asymmetric and strange, and while it works for this
    case it may cause confusion elsewhere.

 2b) Toshiaki said it might be better to make that code *remember*
     mac_len and then use it to push and pull (so not caring about the
     change made by skb_vlan_push()), but that ultimately leads to
     confusion and if you have TC push/pop combinations things just get
     completely out of sync and die

 3) Make skb_vlan_push()/_pop() just not change mac_len at all. So far
    this also addresses the issue, but it's likely that this will break
    OVS, and I don't know how it'd affect BPF. Quite possibly like TC
    does and is broken, but perhaps not.


But now we're stuck. Depending on how you look at it, all of these seem
sort of reasonable, or not.

Ultimately, the issue seems to be that we couldn't really decide whether
VLAN tags (and probably MPLS tags, for that matter) are covered by
mac_len or not. At least not consistently on ingress and egress.
eth_type_trans() doesn't take them into account, so of course on simple
ingress mac_len will only cover the ETH_HLEN.

If you have an accelerated tag and then push it into the SKB, it will
*not* be taken into account in mac_len. OTOH, if you have a new tag and
use skb_vlan_push() then it *will* be taken into account.


I'm trending towards solution (3), because if we consider other
combinations of VLAN push/pop in TC, I think we can end up in a very
messy situation today. For example, POP/PUSH seems like it should be a
no-op, but it isn't due to the mac_len, *unless* it can use the HW accel
only (i.e. only a single tag).

I think then to propose such a patch though we'd have to figure out
where the BPF case is, and to keep OVS working probably either add an
argument ("bool adjust_mac_len") to the function signatures, or just do
the adjustments in OVS code after calling them?


Any other thoughts?

Thanks,
Zahari & Johannes


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VLAN tags in mac_len
  2019-06-14 10:18 VLAN tags in mac_len Johannes Berg
@ 2019-06-15 15:19 ` Alexei Starovoitov
  2019-06-15 19:19   ` Johannes Berg
  0 siblings, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2019-06-15 15:19 UTC (permalink / raw)
  To: Johannes Berg
  Cc: netdev, bridge, nikolay, roopa, jhs, David Ahern, Zahari Doychev,
	Simon Horman, Toshiaki Makita, Cong Wang, Jiri Pirko,
	Alexei Starovoitov

On Fri, Jun 14, 2019 at 12:18:41PM +0200, Johannes Berg wrote:
> 
> Possible solutions?
> 
> So far, Zahari tried three different ways of fixing this:
> 
>  1) Make the bridge code use skb->mac_len instead of ETH_HLEN. This
>     works for this particular case, but breaks some other cases;
>     evidently some places exist where skb->mac_len isn't even set to
>     ETH_HLEN when a packet gets to the bridge. I don't know right now
>     what that was, I think probably somebody who's CC'ed reported that.
> 
>  2) Let tc_act_vlan() just pull ETH_HLEN instead of skb->mac_len, but
>     this is rather asymmetric and strange, and while it works for this
>     case it may cause confusion elsewhere.
> 
>  2b) Toshiaki said it might be better to make that code *remember*
>      mac_len and then use it to push and pull (so not caring about the
>      change made by skb_vlan_push()), but that ultimately leads to
>      confusion and if you have TC push/pop combinations things just get
>      completely out of sync and die
> 
>  3) Make skb_vlan_push()/_pop() just not change mac_len at all. So far
>     this also addresses the issue, but it's likely that this will break
>     OVS, and I don't know how it'd affect BPF. Quite possibly like TC
>     does and is broken, but perhaps not.
> 
> 
> But now we're stuck. Depending on how you look at it, all of these seem
> sort of reasonable, or not.
> 
> Ultimately, the issue seems to be that we couldn't really decide whether
> VLAN tags (and probably MPLS tags, for that matter) are covered by
> mac_len or not. At least not consistently on ingress and egress.
> eth_type_trans() doesn't take them into account, so of course on simple
> ingress mac_len will only cover the ETH_HLEN.
> 
> If you have an accelerated tag and then push it into the SKB, it will
> *not* be taken into account in mac_len. OTOH, if you have a new tag and
> use skb_vlan_push() then it *will* be taken into account.
> 
> 
> I'm trending towards solution (3), because if we consider other
> combinations of VLAN push/pop in TC, I think we can end up in a very
> messy situation today. For example, POP/PUSH seems like it should be a
> no-op, but it isn't due to the mac_len, *unless* it can use the HW accel
> only (i.e. only a single tag).
> 
> I think then to propose such a patch though we'd have to figure out
> where the BPF case is, and to keep OVS working probably either add an
> argument ("bool adjust_mac_len") to the function signatures, or just do
> the adjustments in OVS code after calling them?
> 
> 
> Any other thoughts?

imo skb_vlan_push() should still change mac_len.
tc, ovs, bpf use it and expect vlan to be part of L2.
There is nothing between L2 and L3 :)
Hence we cannot say that vlan is not part of L2.
Hence push/pop vlan must change mac_len, since skb->mac_len
is kernel's definition of the length of L2 header.

Now as far as bridge... I think it's unfortunate that linux
adopted 'vlan' as a netdevice model and that's where I think
the problem is.
Typical bridge in the networking industry is a device that
does forwarding based on L2. Which includes vlans.
And imo that's the most appropriate way of configuring and thinking
about bridge functionality.
Whereas in the kernel there is a 'vlan' netdevice that 'eats'
vlan tag and pretends that the rest is the same.
So linux bridge kinda doesn't need to be vlan aware.
CONFIG_BRIDGE_VLAN_FILTERING was the right step, but I haven't
seen it being used and I'm not sure about state of things there.

So your option 1 above is imo the best. The bridge needs to deal
with skb->mac_len and full L2 header.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VLAN tags in mac_len
  2019-06-15 15:19 ` Alexei Starovoitov
@ 2019-06-15 19:19   ` Johannes Berg
  2019-06-16  8:51     ` Nikolay Aleksandrov
                       ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Johannes Berg @ 2019-06-15 19:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, bridge, nikolay, roopa, jhs, David Ahern, Zahari Doychev,
	Simon Horman, Toshiaki Makita, Cong Wang, Jiri Pirko,
	Nikolay Aleksandrov

Hi Alexei,

Sorry for messing up your address btw, not sure where I dug that one
up...

> >  1) Make the bridge code use skb->mac_len instead of ETH_HLEN. This
> >     works for this particular case, but breaks some other cases;
> >     evidently some places exist where skb->mac_len isn't even set to
> >     ETH_HLEN when a packet gets to the bridge. I don't know right now
> >     what that was, I think probably somebody who's CC'ed reported that.
> > 
> >  2) Let tc_act_vlan() just pull ETH_HLEN instead of skb->mac_len, but
> >     this is rather asymmetric and strange, and while it works for this
> >     case it may cause confusion elsewhere.
> > 
> >  2b) Toshiaki said it might be better to make that code *remember*
> >      mac_len and then use it to push and pull (so not caring about the
> >      change made by skb_vlan_push()), but that ultimately leads to
> >      confusion and if you have TC push/pop combinations things just get
> >      completely out of sync and die
> > 
> >  3) Make skb_vlan_push()/_pop() just not change mac_len at all. So far
> >     this also addresses the issue, but it's likely that this will break
> >     OVS, and I don't know how it'd affect BPF. Quite possibly like TC
> >     does and is broken, but perhaps not.
> > 
> > 
> > But now we're stuck. Depending on how you look at it, all of these seem
> > sort of reasonable, or not.
> > 
> > Ultimately, the issue seems to be that we couldn't really decide whether
> > VLAN tags (and probably MPLS tags, for that matter) are covered by
> > mac_len or not. At least not consistently on ingress and egress.
> > eth_type_trans() doesn't take them into account, so of course on simple
> > ingress mac_len will only cover the ETH_HLEN.
> > 
> > If you have an accelerated tag and then push it into the SKB, it will
> > *not* be taken into account in mac_len. OTOH, if you have a new tag and
> > use skb_vlan_push() then it *will* be taken into account.
> > 
> > 
> > I'm trending towards solution (3), because if we consider other
> > combinations of VLAN push/pop in TC, I think we can end up in a very
> > messy situation today. For example, POP/PUSH seems like it should be a
> > no-op, but it isn't due to the mac_len, *unless* it can use the HW accel
> > only (i.e. only a single tag).
> > 
> > I think then to propose such a patch though we'd have to figure out
> > where the BPF case is, and to keep OVS working probably either add an
> > argument ("bool adjust_mac_len") to the function signatures, or just do
> > the adjustments in OVS code after calling them?
> > 
> > 
> > Any other thoughts?
> 
> imo skb_vlan_push() should still change mac_len.
> tc, ovs, bpf use it and expect vlan to be part of L2.

I'm not sure tc really cares, but it *is* a reasonable argument to make.

Like I said, whichever way I look at the problem, a different solution
looks more reasonable ;-)

> There is nothing between L2 and L3 :)
> Hence we cannot say that vlan is not part of L2.
> Hence push/pop vlan must change mac_len, since skb->mac_len
> is kernel's definition of the length of L2 header.

I think we're getting to something here now. I actually thought about
this some more last night, and basically asked myself how I would design
it without all the legacy baggage.

I'm certainly not suggesting we should change anything here, but to me
it was a bit of a clarification to do this and then see where we differ
in our handling today.

Thinking along those lines, I sort of ended up with the following scheme
(just for the skb head, not the frags/fraglist):

          +------------------+----------------+---------------+
 headroom | eth | vlan | ... | IP  | TCP      | payload       | tailroom
          +------------------+----------------+---------------+
^ skb->head_ptr
          ^ skb->l2_ptr
                             ^ skb->l3_ptr == skb->l2_ptr + skb->l2_len
                                    ...
                                              ^ skb->payload_ptr
                                                              ^ skb->tail

Now, I deliberately didn't put any "skb->data" here, because what we do
today is sort of confusing.

By getting rid of the "multi-use" skb->data in this scheme I think it
becomes clearer to think about.



On *egress*, all we really care about is this:

          +------------------+----------------+---------------+
 headroom | eth | vlan | ... | IP / TCP       | payload       | tailroom
          +------------------+----------------+---------------+
         
^ skb->data
                                         skb->data + skb->len 
^

On *ingress*, however, we hide some of the data (temporarily):

|--------- headroom ---------|
          +------------------+----------------+---------------+
          | eth | vlan | ... | IP / TCP       | payload       | tailroom
          +------------------+----------------+---------------+
          ^ skb->data - skb->mac_len
                             ^ skb->data
                                         skb->data + skb->len ^

which is somewhat confusing to me, and sort of causes all these
problems.

(It also makes it harder to reason about what data is actually valid in
the skb, although if mac_len is non-zero then it must be, but it means
you actually have less headroom and all).

If instead we just made it like the hypothetical scheme I outlined
above, then on traversing a layer we'd set the next layer pointer
appropriately, and then each layer would just use the right pointer:

 * bridge/ethernet driver/... would use l2_ptr
 * IP would use the l3_ptr
 * TCP would use the l4_ptr (didn't put that into the picture)
 * ...

Now we wouldn't have a problem with the VLAN tags, because we'd just
appropriate set/keep all the pointers - bridge doesn't even care where
l3_ptr is pointing, but for IP it would of course point to beyond the
VLAN tags.

(Now, if you wanted to implement this, you probably wouldn't have l2_ptr
but l2_offset etc. but that's an implementation detail.)

Now, why am I writing all this? Because I think it points out that
you're absolutely right - we should treat mac_len as part of the frame
if we're in anything that cares about L2 like bridge.

> Now as far as bridge... I think it's unfortunate that linux
> adopted 'vlan' as a netdevice model and that's where I think
> the problem is.

I'm not sure. I don't exactly know where the problem is if we fix bridge
according to the patch (1) above, which, btw, was discussed before:
https://lore.kernel.org/netdev/20190113135939.8970-1-zahari.doychev@linux.com/

Back then, Nikolay (whom I forgot to CC, fixed now) said:

> It breaks connectivity between bridge and
> members when vlans are used. The host generated packets going out of the bridge
> have mac_len = 0.

Which probably indicates that we're not even consistent with the egress
scheme I pointed out above, probably because we *also* have
hard_header_len?

Maybe somewhere early in the egress path we should set skb->mac_len to
dev->hard_header_len, and then use skb->mac_len consistently, and
consider that part of the skb (and not arbitrarily consider ETH_HLEN to
be part of the skb in bridge).

(This almost tempts me to actually try to implement the hypothetical SKB
scheme I described above, just so it's easier to understand what part
does what ... and to find where the issues like this occur)

> Typical bridge in the networking industry is a device that
> does forwarding based on L2. Which includes vlans.
> And imo that's the most appropriate way of configuring and thinking
> about bridge functionality.
> Whereas in the kernel there is a 'vlan' netdevice that 'eats'
> vlan tag and pretends that the rest is the same.
> So linux bridge kinda doesn't need to be vlan aware.
> CONFIG_BRIDGE_VLAN_FILTERING was the right step, but I haven't
> seen it being used and I'm not sure about state of things there.

I think that ends up being a question of semantics. You can consider an
"industry bridge" that you describe to be a combination of VLAN and
bridge netdevs, and so it's just a question of what exactly you consider
a "bridge" - does it have to be a single netdev or not.

> So your option 1 above is imo the best. The bridge needs to deal
> with skb->mac_len and full L2 header.

Yeah, I guess. We're back to square 1 ;-)

I'm not even sure I understand the bug that Nikolay described, because
br_dev_xmit() does:

        skb_reset_mac_header(skb);
        eth = eth_hdr(skb);
        skb_pull(skb, ETH_HLEN);

so after this we *do* end up with an SKB that has mac_len == ETH_HLEN,
if it was transmitted out the bridge netdev itself, and thus how would
the bug happen?

Thanks,
johannes


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VLAN tags in mac_len
  2019-06-15 19:19   ` Johannes Berg
@ 2019-06-16  8:51     ` Nikolay Aleksandrov
  2019-06-16 19:44       ` Johannes Berg
  2019-06-17  8:49     ` Daniel Borkmann
  2019-06-17 11:15     ` [Bridge] " Toshiaki Makita
  2 siblings, 1 reply; 10+ messages in thread
From: Nikolay Aleksandrov @ 2019-06-16  8:51 UTC (permalink / raw)
  To: Johannes Berg, Alexei Starovoitov
  Cc: netdev, bridge, roopa, jhs, David Ahern, Zahari Doychev,
	Simon Horman, Toshiaki Makita, Cong Wang, Jiri Pirko

On 15/06/2019 22:19, Johannes Berg wrote:
> Hi Alexei,
> 
> Sorry for messing up your address btw, not sure where I dug that one
> up...
> 
>>>  1) Make the bridge code use skb->mac_len instead of ETH_HLEN. This
>>>     works for this particular case, but breaks some other cases;
>>>     evidently some places exist where skb->mac_len isn't even set to
>>>     ETH_HLEN when a packet gets to the bridge. I don't know right now
>>>     what that was, I think probably somebody who's CC'ed reported that.
>>>
>>>  2) Let tc_act_vlan() just pull ETH_HLEN instead of skb->mac_len, but
>>>     this is rather asymmetric and strange, and while it works for this
>>>     case it may cause confusion elsewhere.
>>>
>>>  2b) Toshiaki said it might be better to make that code *remember*
>>>      mac_len and then use it to push and pull (so not caring about the
>>>      change made by skb_vlan_push()), but that ultimately leads to
>>>      confusion and if you have TC push/pop combinations things just get
>>>      completely out of sync and die
>>>
>>>  3) Make skb_vlan_push()/_pop() just not change mac_len at all. So far
>>>     this also addresses the issue, but it's likely that this will break
>>>     OVS, and I don't know how it'd affect BPF. Quite possibly like TC
>>>     does and is broken, but perhaps not.
>>>
>>>
>>> But now we're stuck. Depending on how you look at it, all of these seem
>>> sort of reasonable, or not.
>>>
>>> Ultimately, the issue seems to be that we couldn't really decide whether
>>> VLAN tags (and probably MPLS tags, for that matter) are covered by
>>> mac_len or not. At least not consistently on ingress and egress.
>>> eth_type_trans() doesn't take them into account, so of course on simple
>>> ingress mac_len will only cover the ETH_HLEN.
>>>
>>> If you have an accelerated tag and then push it into the SKB, it will
>>> *not* be taken into account in mac_len. OTOH, if you have a new tag and
>>> use skb_vlan_push() then it *will* be taken into account.
>>>
>>>
>>> I'm trending towards solution (3), because if we consider other
>>> combinations of VLAN push/pop in TC, I think we can end up in a very
>>> messy situation today. For example, POP/PUSH seems like it should be a
>>> no-op, but it isn't due to the mac_len, *unless* it can use the HW accel
>>> only (i.e. only a single tag).
>>>
>>> I think then to propose such a patch though we'd have to figure out
>>> where the BPF case is, and to keep OVS working probably either add an
>>> argument ("bool adjust_mac_len") to the function signatures, or just do
>>> the adjustments in OVS code after calling them?
>>>
>>>
>>> Any other thoughts?
>>
>> imo skb_vlan_push() should still change mac_len.
>> tc, ovs, bpf use it and expect vlan to be part of L2.
> 
> I'm not sure tc really cares, but it *is* a reasonable argument to make.
> 
> Like I said, whichever way I look at the problem, a different solution
> looks more reasonable ;-)
> 
>> There is nothing between L2 and L3 :)
>> Hence we cannot say that vlan is not part of L2.
>> Hence push/pop vlan must change mac_len, since skb->mac_len
>> is kernel's definition of the length of L2 header.
> 
> I think we're getting to something here now. I actually thought about
> this some more last night, and basically asked myself how I would design
> it without all the legacy baggage.
> 
> I'm certainly not suggesting we should change anything here, but to me
> it was a bit of a clarification to do this and then see where we differ
> in our handling today.
> 
> Thinking along those lines, I sort of ended up with the following scheme
> (just for the skb head, not the frags/fraglist):
> 
>           +------------------+----------------+---------------+
>  headroom | eth | vlan | ... | IP  | TCP      | payload       | tailroom
>           +------------------+----------------+---------------+
> ^ skb->head_ptr
>           ^ skb->l2_ptr
>                              ^ skb->l3_ptr == skb->l2_ptr + skb->l2_len
>                                     ...
>                                               ^ skb->payload_ptr
>                                                               ^ skb->tail
> 
> Now, I deliberately didn't put any "skb->data" here, because what we do
> today is sort of confusing.
> 
> By getting rid of the "multi-use" skb->data in this scheme I think it
> becomes clearer to think about.
> 
> 
> 
> On *egress*, all we really care about is this:
> 
>           +------------------+----------------+---------------+
>  headroom | eth | vlan | ... | IP / TCP       | payload       | tailroom
>           +------------------+----------------+---------------+
>          
> ^ skb->data
>                                          skb->data + skb->len 
> ^
> 
> On *ingress*, however, we hide some of the data (temporarily):
> 
> |--------- headroom ---------|
>           +------------------+----------------+---------------+
>           | eth | vlan | ... | IP / TCP       | payload       | tailroom
>           +------------------+----------------+---------------+
>           ^ skb->data - skb->mac_len
>                              ^ skb->data
>                                          skb->data + skb->len ^
> 
> which is somewhat confusing to me, and sort of causes all these
> problems.
> 
> (It also makes it harder to reason about what data is actually valid in
> the skb, although if mac_len is non-zero then it must be, but it means
> you actually have less headroom and all).
> 
> If instead we just made it like the hypothetical scheme I outlined
> above, then on traversing a layer we'd set the next layer pointer
> appropriately, and then each layer would just use the right pointer:
> 
>  * bridge/ethernet driver/... would use l2_ptr
>  * IP would use the l3_ptr
>  * TCP would use the l4_ptr (didn't put that into the picture)
>  * ...
> 
> Now we wouldn't have a problem with the VLAN tags, because we'd just
> appropriate set/keep all the pointers - bridge doesn't even care where
> l3_ptr is pointing, but for IP it would of course point to beyond the
> VLAN tags.
> 
> (Now, if you wanted to implement this, you probably wouldn't have l2_ptr
> but l2_offset etc. but that's an implementation detail.)
> 

I do like the scheme outlined above, it makes it easier to reason about
all of this, but obviously it'd require quite some changes.

> Now, why am I writing all this? Because I think it points out that
> you're absolutely right - we should treat mac_len as part of the frame
> if we're in anything that cares about L2 like bridge.
> 
>> Now as far as bridge... I think it's unfortunate that linux
>> adopted 'vlan' as a netdevice model and that's where I think
>> the problem is.
> 
> I'm not sure. I don't exactly know where the problem is if we fix bridge
> according to the patch (1) above, which, btw, was discussed before:
> https://lore.kernel.org/netdev/20190113135939.8970-1-zahari.doychev@linux.com/
> 
> Back then, Nikolay (whom I forgot to CC, fixed now) said:
> 
>> It breaks connectivity between bridge and
>> members when vlans are used. The host generated packets going out of the bridge
>> have mac_len = 0.
> 
> Which probably indicates that we're not even consistent with the egress
> scheme I pointed out above, probably because we *also* have
> hard_header_len?
> 

IIRC, mac_len is only set on Rx, while on Tx it usually isn't. More below.

> Maybe somewhere early in the egress path we should set skb->mac_len to
> dev->hard_header_len, and then use skb->mac_len consistently, and
> consider that part of the skb (and not arbitrarily consider ETH_HLEN to
> be part of the skb in bridge).
> 
> (This almost tempts me to actually try to implement the hypothetical SKB
> scheme I described above, just so it's easier to understand what part
> does what ... and to find where the issues like this occur)
> 

Interesting idea.

>> Typical bridge in the networking industry is a device that
>> does forwarding based on L2. Which includes vlans.
>> And imo that's the most appropriate way of configuring and thinking
>> about bridge functionality.
>> Whereas in the kernel there is a 'vlan' netdevice that 'eats'
>> vlan tag and pretends that the rest is the same.
>> So linux bridge kinda doesn't need to be vlan aware.
>> CONFIG_BRIDGE_VLAN_FILTERING was the right step, but I haven't
>> seen it being used and I'm not sure about state of things there.
> 
> I think that ends up being a question of semantics. You can consider an
> "industry bridge" that you describe to be a combination of VLAN and
> bridge netdevs, and so it's just a question of what exactly you consider
> a "bridge" - does it have to be a single netdev or not.
> 
>> So your option 1 above is imo the best. The bridge needs to deal
>> with skb->mac_len and full L2 header.
> 
> Yeah, I guess. We're back to square 1 ;-)
> 
> I'm not even sure I understand the bug that Nikolay described, because
> br_dev_xmit() does:
> 
>         skb_reset_mac_header(skb);
>         eth = eth_hdr(skb);
>         skb_pull(skb, ETH_HLEN);
> 
> so after this we *do* end up with an SKB that has mac_len == ETH_HLEN,
> if it was transmitted out the bridge netdev itself, and thus how would
> the bug happen?
> 

I said *mac_len*. :) The above sets mac_header, at that point you'll have
the following values: mac_len = 0, mac_header_len = 14 (skb_mac_header_len
uses network_header - mac_header which is set there), but that is easy
to overcome and if you do go down the path of consistently using and updating
mac_len it should work.

Cheers,
 Nik

> Thanks,
> johannes
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VLAN tags in mac_len
  2019-06-16  8:51     ` Nikolay Aleksandrov
@ 2019-06-16 19:44       ` Johannes Berg
  0 siblings, 0 replies; 10+ messages in thread
From: Johannes Berg @ 2019-06-16 19:44 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Alexei Starovoitov
  Cc: netdev, bridge, roopa, jhs, David Ahern, Zahari Doychev,
	Simon Horman, Toshiaki Makita, Cong Wang, Jiri Pirko

On Sun, 2019-06-16 at 11:51 +0300, Nikolay Aleksandrov wrote:

> > Thinking along those lines, I sort of ended up with the following scheme
> > (just for the skb head, not the frags/fraglist):
> > 
> >           +------------------+----------------+---------------+
> >  headroom | eth | vlan | ... | IP  | TCP      | payload       | tailroom
> >           +------------------+----------------+---------------+
> > ^ skb->head_ptr
> >           ^ skb->l2_ptr
> >                              ^ skb->l3_ptr == skb->l2_ptr + skb->l2_len
> >                                     ...
> >                                               ^ skb->payload_ptr
> >                                                               ^ skb->tail
[...]

> > (Now, if you wanted to implement this, you probably wouldn't have l2_ptr
> > but l2_offset etc. but that's an implementation detail.)
> > 
> 
> I do like the scheme outlined above, it makes it easier to reason about
> all of this, but obviously it'd require quite some changes.

Yeah. I'm not really ready to suggest something as radical.

But as you found out below, I even got confused *again* while
*carefully* looking at this, and messed up mac_len vs. mac_header_len.

In fact, even looking at it now, I'm not entirely sure I see the
difference. Why do we need both? They have different implementation
semantics, but shouldn't they sort of be the same?

> > > It breaks connectivity between bridge and
> > > members when vlans are used. The host generated packets going out of the bridge
> > > have mac_len = 0.
> > 
> > Which probably indicates that we're not even consistent with the egress
> > scheme I pointed out above, probably because we *also* have
> > hard_header_len?
> > 
> 
> IIRC, mac_len is only set on Rx, while on Tx it usually isn't. More below.

Yes, looks like.

> > I'm not even sure I understand the bug that Nikolay described, because
> > br_dev_xmit() does:
> > 
> >         skb_reset_mac_header(skb);
> >         eth = eth_hdr(skb);
> >         skb_pull(skb, ETH_HLEN);
> > 
> > so after this we *do* end up with an SKB that has mac_len == ETH_HLEN,
> > if it was transmitted out the bridge netdev itself, and thus how would
> > the bug happen?
> > 
> 
> I said *mac_len*. :) 

Yes, I confused myself here.

> The above sets mac_header, at that point you'll have
> the following values: mac_len = 0, mac_header_len = 14 (skb_mac_header_len
> uses network_header - mac_header which is set there), but that is easy
> to overcome and if you do go down the path of consistently using and updating
> mac_len it should work.

Yeah, so basically all we really need is to actually call
skb_reset_mac_len() in addition to skb_reset_mac_header().

Which, is, "slightly" confusing (to say the least) - why are mac_len and
mac_header two completely separate concepts? It almost seems like they
should be two sides of the same coin (len/ptr) but we also have
mac_header_len...

Oh well.

So maybe we should go back to square 1 and resend the patches Zahari had
originally, but with the added skb_reset_mac_len()?

johannes


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VLAN tags in mac_len
  2019-06-15 19:19   ` Johannes Berg
  2019-06-16  8:51     ` Nikolay Aleksandrov
@ 2019-06-17  8:49     ` Daniel Borkmann
  2019-06-28 10:55       ` Johannes Berg
  2019-06-17 11:15     ` [Bridge] " Toshiaki Makita
  2 siblings, 1 reply; 10+ messages in thread
From: Daniel Borkmann @ 2019-06-17  8:49 UTC (permalink / raw)
  To: Johannes Berg, Alexei Starovoitov
  Cc: netdev, bridge, nikolay, roopa, jhs, David Ahern, Zahari Doychev,
	Simon Horman, Toshiaki Makita, Cong Wang, Jiri Pirko

On 06/15/2019 09:19 PM, Johannes Berg wrote:
> Hi Alexei,
> 
> Sorry for messing up your address btw, not sure where I dug that one
> up...
> 
>>>  1) Make the bridge code use skb->mac_len instead of ETH_HLEN. This
>>>     works for this particular case, but breaks some other cases;
>>>     evidently some places exist where skb->mac_len isn't even set to
>>>     ETH_HLEN when a packet gets to the bridge. I don't know right now
>>>     what that was, I think probably somebody who's CC'ed reported that.
>>>
>>>  2) Let tc_act_vlan() just pull ETH_HLEN instead of skb->mac_len, but
>>>     this is rather asymmetric and strange, and while it works for this
>>>     case it may cause confusion elsewhere.
>>>
>>>  2b) Toshiaki said it might be better to make that code *remember*
>>>      mac_len and then use it to push and pull (so not caring about the
>>>      change made by skb_vlan_push()), but that ultimately leads to
>>>      confusion and if you have TC push/pop combinations things just get
>>>      completely out of sync and die
>>>
>>>  3) Make skb_vlan_push()/_pop() just not change mac_len at all. So far
>>>     this also addresses the issue, but it's likely that this will break
>>>     OVS, and I don't know how it'd affect BPF. Quite possibly like TC
>>>     does and is broken, but perhaps not.
>>>
>>>
>>> But now we're stuck. Depending on how you look at it, all of these seem
>>> sort of reasonable, or not.
>>>
>>> Ultimately, the issue seems to be that we couldn't really decide whether
>>> VLAN tags (and probably MPLS tags, for that matter) are covered by
>>> mac_len or not. At least not consistently on ingress and egress.
>>> eth_type_trans() doesn't take them into account, so of course on simple
>>> ingress mac_len will only cover the ETH_HLEN.
>>>
>>> If you have an accelerated tag and then push it into the SKB, it will
>>> *not* be taken into account in mac_len. OTOH, if you have a new tag and
>>> use skb_vlan_push() then it *will* be taken into account.
>>>
>>>
>>> I'm trending towards solution (3), because if we consider other
>>> combinations of VLAN push/pop in TC, I think we can end up in a very
>>> messy situation today. For example, POP/PUSH seems like it should be a
>>> no-op, but it isn't due to the mac_len, *unless* it can use the HW accel
>>> only (i.e. only a single tag).
>>>
>>> I think then to propose such a patch though we'd have to figure out
>>> where the BPF case is, and to keep OVS working probably either add an
>>> argument ("bool adjust_mac_len") to the function signatures, or just do
>>> the adjustments in OVS code after calling them?
>>>
>>>
>>> Any other thoughts?
>>
>> imo skb_vlan_push() should still change mac_len.
>> tc, ovs, bpf use it and expect vlan to be part of L2.
> 
> I'm not sure tc really cares, but it *is* a reasonable argument to make.
> 
> Like I said, whichever way I look at the problem, a different solution
> looks more reasonable ;-)

Agree with Alexei that the approach which would be least confusing and/or
potentially causing regressions might be to go for 1). tc *does* care at least
for the *BPF* case. In sch_clsact we have the ingress and egress hook where we
can attach to and programs don't need to care where they are being attached
since for both cases they see skb->data starting at eth header! In order to
do this, we do a __skb_push()/__skb_pull() dance based on skb->mac_len depending
from where we come. This also means that if a program pushed/popped a vlan tag,
this still must be correct wrt expectations for the receive side. It is essential
that there is consistent behavior on skb->mac_len given skbs can also be redirected
from TX->RX or RX->TX(->RX), so that we don't pull to a wrong offset next time.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bridge] VLAN tags in mac_len
  2019-06-15 19:19   ` Johannes Berg
  2019-06-16  8:51     ` Nikolay Aleksandrov
  2019-06-17  8:49     ` Daniel Borkmann
@ 2019-06-17 11:15     ` Toshiaki Makita
  2019-06-28 11:02       ` Johannes Berg
  2 siblings, 1 reply; 10+ messages in thread
From: Toshiaki Makita @ 2019-06-17 11:15 UTC (permalink / raw)
  To: Johannes Berg, Alexei Starovoitov
  Cc: Toshiaki Makita, Jiri Pirko, Nikolay Aleksandrov, netdev, roopa,
	bridge, Zahari Doychev, jhs, Simon Horman, David Ahern,
	Cong Wang

On 2019/06/16 4:19, Johannes Berg wrote:
> Hi Alexei,
> 
> Sorry for messing up your address btw, not sure where I dug that one
> up...
> 
>>>   1) Make the bridge code use skb->mac_len instead of ETH_HLEN. This
>>>      works for this particular case, but breaks some other cases;
>>>      evidently some places exist where skb->mac_len isn't even set to
>>>      ETH_HLEN when a packet gets to the bridge. I don't know right now
>>>      what that was, I think probably somebody who's CC'ed reported that.
>>>
>>>   2) Let tc_act_vlan() just pull ETH_HLEN instead of skb->mac_len, but
>>>      this is rather asymmetric and strange, and while it works for this
>>>      case it may cause confusion elsewhere.
>>>
>>>   2b) Toshiaki said it might be better to make that code *remember*
>>>       mac_len and then use it to push and pull (so not caring about the
>>>       change made by skb_vlan_push()), but that ultimately leads to
>>>       confusion and if you have TC push/pop combinations things just get
>>>       completely out of sync and die
>>>
>>>   3) Make skb_vlan_push()/_pop() just not change mac_len at all. So far
>>>      this also addresses the issue, but it's likely that this will break
>>>      OVS, and I don't know how it'd affect BPF. Quite possibly like TC
>>>      does and is broken, but perhaps not.
>>>
>>>
>>> But now we're stuck. Depending on how you look at it, all of these seem
>>> sort of reasonable, or not.
>>>
>>> Ultimately, the issue seems to be that we couldn't really decide whether
>>> VLAN tags (and probably MPLS tags, for that matter) are covered by
>>> mac_len or not. At least not consistently on ingress and egress.
>>> eth_type_trans() doesn't take them into account, so of course on simple
>>> ingress mac_len will only cover the ETH_HLEN.
>>>
>>> If you have an accelerated tag and then push it into the SKB, it will
>>> *not* be taken into account in mac_len. OTOH, if you have a new tag and
>>> use skb_vlan_push() then it *will* be taken into account.
>>>
>>>
>>> I'm trending towards solution (3), because if we consider other
>>> combinations of VLAN push/pop in TC, I think we can end up in a very
>>> messy situation today. For example, POP/PUSH seems like it should be a
>>> no-op, but it isn't due to the mac_len, *unless* it can use the HW accel
>>> only (i.e. only a single tag).
>>>
>>> I think then to propose such a patch though we'd have to figure out
>>> where the BPF case is, and to keep OVS working probably either add an
>>> argument ("bool adjust_mac_len") to the function signatures, or just do
>>> the adjustments in OVS code after calling them?
>>>
>>>
>>> Any other thoughts?
>>
>> imo skb_vlan_push() should still change mac_len.
>> tc, ovs, bpf use it and expect vlan to be part of L2.
> 
> I'm not sure tc really cares, but it *is* a reasonable argument to make.
> 
> Like I said, whichever way I look at the problem, a different solution
> looks more reasonable ;-)
> 
>> There is nothing between L2 and L3 :)
>> Hence we cannot say that vlan is not part of L2.
>> Hence push/pop vlan must change mac_len, since skb->mac_len
>> is kernel's definition of the length of L2 header.
> 
> I think we're getting to something here now. I actually thought about
> this some more last night, and basically asked myself how I would design
> it without all the legacy baggage.
> 
> I'm certainly not suggesting we should change anything here, but to me
> it was a bit of a clarification to do this and then see where we differ
> in our handling today.
> 
> Thinking along those lines, I sort of ended up with the following scheme
> (just for the skb head, not the frags/fraglist):
> 
>            +------------------+----------------+---------------+
>   headroom | eth | vlan | ... | IP  | TCP      | payload       | tailroom
>            +------------------+----------------+---------------+
> ^ skb->head_ptr
>            ^ skb->l2_ptr
>                               ^ skb->l3_ptr == skb->l2_ptr + skb->l2_len
>                                      ...
>                                                ^ skb->payload_ptr
>                                                                ^ skb->tail
> 
> Now, I deliberately didn't put any "skb->data" here, because what we do
> today is sort of confusing.
> 
> By getting rid of the "multi-use" skb->data in this scheme I think it
> becomes clearer to think about.
> 
> 
> 
> On *egress*, all we really care about is this:
> 
>            +------------------+----------------+---------------+
>   headroom | eth | vlan | ... | IP / TCP       | payload       | tailroom
>            +------------------+----------------+---------------+
>           
> ^ skb->data
>                                           skb->data + skb->len
> ^
> 
> On *ingress*, however, we hide some of the data (temporarily):
> 
> |--------- headroom ---------|
>            +------------------+----------------+---------------+
>            | eth | vlan | ... | IP / TCP       | payload       | tailroom
>            +------------------+----------------+---------------+
>            ^ skb->data - skb->mac_len
>                               ^ skb->data
>                                           skb->data + skb->len ^
> 
> which is somewhat confusing to me, and sort of causes all these
> problems.
> 
> (It also makes it harder to reason about what data is actually valid in
> the skb, although if mac_len is non-zero then it must be, but it means
> you actually have less headroom and all).
> 
> If instead we just made it like the hypothetical scheme I outlined
> above, then on traversing a layer we'd set the next layer pointer
> appropriately, and then each layer would just use the right pointer:
> 
>   * bridge/ethernet driver/... would use l2_ptr
>   * IP would use the l3_ptr
>   * TCP would use the l4_ptr (didn't put that into the picture)
>   * ...
> 
> Now we wouldn't have a problem with the VLAN tags, because we'd just
> appropriate set/keep all the pointers - bridge doesn't even care where
> l3_ptr is pointing, but for IP it would of course point to beyond the
> VLAN tags.
> 
> (Now, if you wanted to implement this, you probably wouldn't have l2_ptr
> but l2_offset etc. but that's an implementation detail.)
> 
> Now, why am I writing all this? Because I think it points out that
> you're absolutely right - we should treat mac_len as part of the frame
> if we're in anything that cares about L2 like bridge.
> 
>> Now as far as bridge... I think it's unfortunate that linux
>> adopted 'vlan' as a netdevice model and that's where I think
>> the problem is.
> 
> I'm not sure. I don't exactly know where the problem is if we fix bridge
> according to the patch (1) above, which, btw, was discussed before:
> https://lore.kernel.org/netdev/20190113135939.8970-1-zahari.doychev@linux.com/
> 
> Back then, Nikolay (whom I forgot to CC, fixed now) said:
> 
>> It breaks connectivity between bridge and
>> members when vlans are used. The host generated packets going out of the bridge
>> have mac_len = 0.
> 
> Which probably indicates that we're not even consistent with the egress
> scheme I pointed out above, probably because we *also* have
> hard_header_len?
> 
> Maybe somewhere early in the egress path we should set skb->mac_len to
> dev->hard_header_len, and then use skb->mac_len consistently, and
> consider that part of the skb (and not arbitrarily consider ETH_HLEN to
> be part of the skb in bridge).
> 
> (This almost tempts me to actually try to implement the hypothetical SKB
> scheme I described above, just so it's easier to understand what part
> does what ... and to find where the issues like this occur)
> 
>> Typical bridge in the networking industry is a device that
>> does forwarding based on L2. Which includes vlans.
>> And imo that's the most appropriate way of configuring and thinking
>> about bridge functionality.
>> Whereas in the kernel there is a 'vlan' netdevice that 'eats'
>> vlan tag and pretends that the rest is the same.
>> So linux bridge kinda doesn't need to be vlan aware.
>> CONFIG_BRIDGE_VLAN_FILTERING was the right step, but I haven't
>> seen it being used and I'm not sure about state of things there.
> 
> I think that ends up being a question of semantics. You can consider an
> "industry bridge" that you describe to be a combination of VLAN and
> bridge netdevs, and so it's just a question of what exactly you consider
> a "bridge" - does it have to be a single netdev or not.
> 
>> So your option 1 above is imo the best. The bridge needs to deal
>> with skb->mac_len and full L2 header.
> 
> Yeah, I guess. We're back to square 1 ;-)

I'll try to explain the problem I see, which cannot be fixed by option 1...
The bug is in tcf_vlan_act(), and mainly in skb->data, not in mac_len.

Consider about vlan packets from NIC, but non-hw-accelerated, where
vlan devices are configured to receive them.

When __netif_receive_skb_core() is called, skb is like this.

+-----+------+--------
| eth | vlan | TCP/IP
+-----+------+--------
       ^
      data

skb->data is at the beginning of the vlan header.
This is reasonable because we did not process the vlan tag at this point.

Then after vlan_do_receive() (receive the skb on a vlan device), the skb is like this.

+-----+--------
| eth | TCP/IP
+-----+--------
       ^
      data

Or if reorder_hdr is off (which does not remove vlan tags when receiving on vlan devices),

+-----+------+--------
| eth | vlan | TCP/IP
+-----+------+--------
              ^
             data

Relying on this mechanism, we are currently able to handle multiple vlan tags.

For example if we have 2 tags,

- On __netif_receive_skb_core() invocation

+-----+------+------+--------
| eth | vlan | vlan | TCP/IP
+-----+------+------+--------
       ^
      data

- After first vlan_do_receive()

+-----+------+--------
| eth | vlan | TCP/IP
+-----+------+--------
       ^
      data

Or if reorder_hdr is off,

+-----+------+------+--------
| eth | vlan | vlan | TCP/IP
+-----+------+------+--------
              ^
             data

When we process one tag, the data goes forward by one tag.

Now looking at TC vlan case...

After it inserts two tags, the skb looks like:

(The first tag is in vlan_tci)
+-----+------+--------
| eth | vlan | TCP/IP
+-----+------+--------
              ^
             data

The data pointer went forward before we process it.
This is apparently wrong. I think we don't want to (or cannot?) handle cases like this
after tcf_vlan_act(). This is why I said we should remember mac_len there.

So, my opinion is:
On ingress, data pointer can be at the end of vlan header and mac_len probably should
include vlan tag length, but only after the vlan tag is processed.

Bridge may need to handle mac_len that is not equal to ETH_HLEN but to me it's a
different problem.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: VLAN tags in mac_len
  2019-06-17  8:49     ` Daniel Borkmann
@ 2019-06-28 10:55       ` Johannes Berg
  0 siblings, 0 replies; 10+ messages in thread
From: Johannes Berg @ 2019-06-28 10:55 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: netdev, bridge, nikolay, roopa, jhs, David Ahern, Zahari Doychev,
	Simon Horman, Toshiaki Makita, Cong Wang, Jiri Pirko

On Mon, 2019-06-17 at 10:49 +0200, Daniel Borkmann wrote:

> > > > Any other thoughts?
> > > 
> > > imo skb_vlan_push() should still change mac_len.
> > > tc, ovs, bpf use it and expect vlan to be part of L2.
> > 
> > I'm not sure tc really cares, but it *is* a reasonable argument to make.
> > 
> > Like I said, whichever way I look at the problem, a different solution
> > looks more reasonable ;-)
> 
> Agree with Alexei that the approach which would be least confusing and/or
> potentially causing regressions might be to go for 1). 

Toshiaki explained already that (1) [changing the bridge code] isn't
sufficient, but Zahari is going to send patches to do (1) since that
lets use disentangle the bridge code from the rest of the discussion,
basically making it able to handle anything we throw at it.

> tc *does* care at least
> for the *BPF* case. In sch_clsact we have the ingress and egress hook where we
> can attach to and programs don't need to care where they are being attached
> since for both cases they see skb->data starting at eth header! In order to
> do this, we do a __skb_push()/__skb_pull() dance based on skb->mac_len depending
> from where we come. This also means that if a program pushed/popped a vlan tag,
> this still must be correct wrt expectations for the receive side. It is essential
> that there is consistent behavior on skb->mac_len given skbs can also be redirected
> from TX->RX or RX->TX(->RX), so that we don't pull to a wrong offset next time.

As somebody (also Toshiaki I think) explained, this is already not right
and tc mirred is broken.

So I still think we have a semantic problem here with mac_len and TX/RX,
but it's not something I feel I'm competent enough to really address.

johannes


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bridge] VLAN tags in mac_len
  2019-06-17 11:15     ` [Bridge] " Toshiaki Makita
@ 2019-06-28 11:02       ` Johannes Berg
  2019-07-02 10:00         ` Toshiaki Makita
  0 siblings, 1 reply; 10+ messages in thread
From: Johannes Berg @ 2019-06-28 11:02 UTC (permalink / raw)
  To: Toshiaki Makita, Alexei Starovoitov
  Cc: Toshiaki Makita, Jiri Pirko, Nikolay Aleksandrov, netdev, roopa,
	bridge, Zahari Doychev, jhs, Simon Horman, David Ahern,
	Cong Wang

On Mon, 2019-06-17 at 20:15 +0900, Toshiaki Makita wrote:

> I'll try to explain the problem I see, which cannot be fixed by option 1...
> The bug is in tcf_vlan_act(), and mainly in skb->data, not in mac_len.
> 
> Consider about vlan packets from NIC, but non-hw-accelerated, where
> vlan devices are configured to receive them.
> 
> When __netif_receive_skb_core() is called, skb is like this.
> 
> +-----+------+--------
> > eth | vlan | TCP/IP
> 
> +-----+------+--------
>        ^
>       data
> 
> skb->data is at the beginning of the vlan header.

Right.

> This is reasonable because we did not process the vlan tag at this point.

I think with this simple sentence you just threw a whole new semantic
issue into the mix, one that I at least hadn't considered.

However, it's not clear to me whether we should consider a tag as
processed or not when we push it.

In a sense, this means we should have two different VLAN tag push
options - considering it processed or unprocessed. Or maybe it should
always be considered unprocessed, but that's not what we do today.

> Then after vlan_do_receive() (receive the skb on a vlan device), the skb is like this.
> 
> +-----+--------
> > eth | TCP/IP
> 
> +-----+--------
>        ^
>       data
> 
> Or if reorder_hdr is off (which does not remove vlan tags when receiving on vlan devices),
> 
> +-----+------+--------
> > eth | vlan | TCP/IP
> 
> +-----+------+--------
>               ^
>              data
> 
> Relying on this mechanism, we are currently able to handle multiple vlan tags.
> 
> For example if we have 2 tags,
> 
> - On __netif_receive_skb_core() invocation
> 
> +-----+------+------+--------
> > eth | vlan | vlan | TCP/IP
> 
> +-----+------+------+--------
>        ^
>       data
> 
> - After first vlan_do_receive()
> 
> +-----+------+--------
> > eth | vlan | TCP/IP
> 
> +-----+------+--------
>        ^
>       data
> 
> Or if reorder_hdr is off,
> 
> +-----+------+------+--------
> > eth | vlan | vlan | TCP/IP
> 
> +-----+------+------+--------
>               ^
>              data
> 
> When we process one tag, the data goes forward by one tag.

Right, that's a very good point.

> Now looking at TC vlan case...
> 
> After it inserts two tags, the skb looks like:
> 
> (The first tag is in vlan_tci)
> +-----+------+--------
> > eth | vlan | TCP/IP
> 
> +-----+------+--------
>               ^
>              data
> 
> The data pointer went forward before we process it.
> This is apparently wrong. I think we don't want to (or cannot?) handle cases like this
> after tcf_vlan_act(). This is why I said we should remember mac_len there.

Right, makes a lot of sense.

If you consider a tc VLAN pop, you'd argue that it should pop the next
unprocessed tag I guess, since if it was processed then it doesn't
really exist any more (semantically, you still see it if reorder_hdr is
off), right?

> So, my opinion is:
> On ingress, data pointer can be at the end of vlan header and mac_len probably should
> include vlan tag length, but only after the vlan tag is processed.

You're basically arguing for option (3), I think, making VLAN push/pop
not manipulate mac_len since they can just push/pop *unprocessed* tags,
right?

I fear this will cause all kinds of trouble in other code. Perhaps we
need to make this processed/unprocessed state more explicit.

> Bridge may need to handle mac_len that is not equal to ETH_HLEN but to me it's a
> different problem.

Yes. Like I just said to Daniel, I think we should make bridge handle
mac_len so that we can just exclude it from this whole discussion.
Regardless of the mac_len and processed/unprocessed tags, it would just
work as expected.

johannes


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Bridge] VLAN tags in mac_len
  2019-06-28 11:02       ` Johannes Berg
@ 2019-07-02 10:00         ` Toshiaki Makita
  0 siblings, 0 replies; 10+ messages in thread
From: Toshiaki Makita @ 2019-07-02 10:00 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Alexei Starovoitov, Toshiaki Makita, Jiri Pirko,
	Nikolay Aleksandrov, netdev, roopa, bridge, Zahari Doychev, jhs,
	Simon Horman, David Ahern, Cong Wang

On 2019/06/28 20:02, Johannes Berg wrote:
> On Mon, 2019-06-17 at 20:15 +0900, Toshiaki Makita wrote:
>> I'll try to explain the problem I see, which cannot be fixed by option 1...
>> The bug is in tcf_vlan_act(), and mainly in skb->data, not in mac_len.
>>
>> Consider about vlan packets from NIC, but non-hw-accelerated, where
>> vlan devices are configured to receive them.
>>
>> When __netif_receive_skb_core() is called, skb is like this.
>>
>> +-----+------+--------
>>> eth | vlan | TCP/IP
>>
>> +-----+------+--------
>>         ^
>>        data
>>
>> skb->data is at the beginning of the vlan header.
> 
> Right.
> 
>> This is reasonable because we did not process the vlan tag at this point.
> 
> I think with this simple sentence you just threw a whole new semantic
> issue into the mix, one that I at least hadn't considered.
>
> However, it's not clear to me whether we should consider a tag as
> processed or not when we push it.

It's clear that we always insert a tag as unprocessed in a single tag case.
The tag is inserted as hw-offloaded one, and hw-offloaded one is treated
as an unprocessed tag in __netif_receive_skb_core().
The single tag is the most common usage I think, so unprocessed should be
the expected behavior.

Also, tc vlan act was originally introduced to replace OVS vlan action.
It's in fact used as HW offload path of ovs-vswitchd, and should behave the same as
OVS push_vlan action. And push_vlan action inserts a tag as unprocessed one.

> In a sense, this means we should have two different VLAN tag push
> options - considering it processed or unprocessed. Or maybe it should
> always be considered unprocessed, but that's not what we do today.
>
>> Then after vlan_do_receive() (receive the skb on a vlan device), the skb is like this.
>>
>> +-----+--------
>>> eth | TCP/IP
>>
>> +-----+--------
>>         ^
>>        data
>>
>> Or if reorder_hdr is off (which does not remove vlan tags when receiving on vlan devices),
>>
>> +-----+------+--------
>>> eth | vlan | TCP/IP
>>
>> +-----+------+--------
>>                ^
>>               data
>>
>> Relying on this mechanism, we are currently able to handle multiple vlan tags.
>>
>> For example if we have 2 tags,
>>
>> - On __netif_receive_skb_core() invocation
>>
>> +-----+------+------+--------
>>> eth | vlan | vlan | TCP/IP
>>
>> +-----+------+------+--------
>>         ^
>>        data
>>
>> - After first vlan_do_receive()
>>
>> +-----+------+--------
>>> eth | vlan | TCP/IP
>>
>> +-----+------+--------
>>         ^
>>        data
>>
>> Or if reorder_hdr is off,
>>
>> +-----+------+------+--------
>>> eth | vlan | vlan | TCP/IP
>>
>> +-----+------+------+--------
>>                ^
>>               data
>>
>> When we process one tag, the data goes forward by one tag.
> 
> Right, that's a very good point.
> 
>> Now looking at TC vlan case...
>>
>> After it inserts two tags, the skb looks like:
>>
>> (The first tag is in vlan_tci)
>> +-----+------+--------
>>> eth | vlan | TCP/IP
>>
>> +-----+------+--------
>>                ^
>>               data
>>
>> The data pointer went forward before we process it.
>> This is apparently wrong. I think we don't want to (or cannot?) handle cases like this
>> after tcf_vlan_act(). This is why I said we should remember mac_len there.
> 
> Right, makes a lot of sense.
> 
> If you consider a tc VLAN pop, you'd argue that it should pop the next
> unprocessed tag I guess, since if it was processed then it doesn't
> really exist any more (semantically, you still see it if reorder_hdr is
> off), right?

Right.

>> So, my opinion is:
>> On ingress, data pointer can be at the end of vlan header and mac_len probably should
>> include vlan tag length, but only after the vlan tag is processed.
> 
> You're basically arguing for option (3), I think, making VLAN push/pop
> not manipulate mac_len since they can just push/pop *unprocessed* tags,
> right?

Ah, true, on the second thought (2b) is not an appropriate fix but (3) is.
(3) more correctly emulates already tagged packets from wire so it should
cause least confusion.

> I fear this will cause all kinds of trouble in other code. Perhaps we
> need to make this processed/unprocessed state more explicit.

But (3) makes mac_len the same as the already tagged packets from NICs.
If other code cannot handle the packets correctly, they need to be fixed anyway.

As you explained OVS MPLS seems to rely on mac_len adjustment by skb_vlan_push(), but
it means the OVS MPLS code cannot correctly handle double-tagged packets from NICs,
so it needs a fix anyway (use __vlan_get_protocol() to get the real mac_len?).

>> Bridge may need to handle mac_len that is not equal to ETH_HLEN but to me it's a
>> different problem.
> 
> Yes. Like I just said to Daniel, I think we should make bridge handle
> mac_len so that we can just exclude it from this whole discussion.
> Regardless of the mac_len and processed/unprocessed tags, it would just
> work as expected.

That's OK, it should help packets from vlan devices with reorder_hdr off.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-07-02 10:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-14 10:18 VLAN tags in mac_len Johannes Berg
2019-06-15 15:19 ` Alexei Starovoitov
2019-06-15 19:19   ` Johannes Berg
2019-06-16  8:51     ` Nikolay Aleksandrov
2019-06-16 19:44       ` Johannes Berg
2019-06-17  8:49     ` Daniel Borkmann
2019-06-28 10:55       ` Johannes Berg
2019-06-17 11:15     ` [Bridge] " Toshiaki Makita
2019-06-28 11:02       ` Johannes Berg
2019-07-02 10:00         ` Toshiaki Makita

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).