All of lore.kernel.org
 help / color / mirror / Atom feed
* DSA: some questions regarding TX forwarding offload
@ 2021-10-05  8:54 Alvin Šipraga
  2021-10-05 10:12 ` Vladimir Oltean
  0 siblings, 1 reply; 16+ messages in thread
From: Alvin Šipraga @ 2021-10-05  8:54 UTC (permalink / raw)
  To: netdev; +Cc: Vladimir Oltean, Florian Fainelli, Andrew Lunn

Hi,

I am trying to implement TX forwarding offload for my in-progress 
rtl8365mb DSA driver. I have some questions which I could use some 
clarification on. They might be specific to my hardware, which is also 
OK, but then some advice on how to proceed would be helpful.

Q1. Can the tagging driver somehow retrieve a port mask from the DSA 
switch driver in order to assemble the CPU->switch tag on xmit? Is there 
some infrastructure in place to share such data between the two drivers?

Q2. Is it expected by DSA that two isolated ports (e.g. two ports 
belonging to two separate bridges) can be members of the same VLAN 
without issue?

Background: The RTL8365MB's CPU tag includes an ALLOW field followed by 
a "port mask" field. If ALLOW=1 then - based on the VLAN tag in the 
frame and the port mask - the switch will automatically replicate the 
frame and egress it on all suitable ports, but only ports which are in 
the port mask.

If ALLOW=1, and if the port mask is all zeroes or all ones, then the 
switch will make its forwarding decision based only on the VLAN tag in 
the frame (if any). Now consider a configuration as follows:

         br0            br1
          +              +
          |              |
      +---+---+      +---+---+
      |       |      |       |
     swp0    swp1   swp2    swp3

... with both bridges containing switch port(s) belonging to the same 
VLAN n. How should I prevent - with TX forwarding offload - a packet 
with VID=n from being egressed on a port on the opposite bridge which 
belongs to the same VLAN n?

In the above scenario, either I must refine the CPU tag "port mask" 
(hence Q1), or I must restrict the hardware configuration in some way 
(hence Q2), or I must conclude that TX forwarding offload is not 
possible with these constraints, or there is some alternative solution 
or nuance that I have not thought of.

Thank you in advance,

	Alvin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-05  8:54 DSA: some questions regarding TX forwarding offload Alvin Šipraga
@ 2021-10-05 10:12 ` Vladimir Oltean
  2021-10-05 12:06   ` Alvin Šipraga
  2021-10-06  2:50   ` Florian Fainelli
  0 siblings, 2 replies; 16+ messages in thread
From: Vladimir Oltean @ 2021-10-05 10:12 UTC (permalink / raw)
  To: Alvin Šipraga; +Cc: netdev, Florian Fainelli, Andrew Lunn

On Tue, Oct 05, 2021 at 08:54:34AM +0000, Alvin Šipraga wrote:
> Hi,
>
> I am trying to implement TX forwarding offload for my in-progress
> rtl8365mb DSA driver. I have some questions which I could use some
> clarification on. They might be specific to my hardware, which is also
> OK, but then some advice on how to proceed would be helpful.
>
> Q1. Can the tagging driver somehow retrieve a port mask from the DSA
> switch driver in order to assemble the CPU->switch tag on xmit? Is there
> some infrastructure in place to share such data between the two drivers?

Nope. DSA does not maintain a cache of FDB entries retrieved from
hardware. So it cannot deduce the destination port mask from the {MAC DA, VLAN ID}
of the skb. The software FDB maintained by the bridge driver is all there is.
Based on the software FDB, which the bridge still looks up, an skb->dev
is selected. All that the TX forwarding offload feature is is a way to
remove software packet replication (skb_clone) for the case where the
packet should have been flooded, or multicast, by the software bridge
towards multiple skb->dev entities belonging to the same hardware domain.

To achieve the desired replication in hardware with DSA, the idea is to
look up the FDB once more, but this time let the switch do it in hardware.

I see it similar to the quote "life is like a box of chocolates, you
never know what you're going to get". Meaning that ok, you don't know
exactly from software on which egress ports your packet is going to
land, but the result shouldn't be too far off from the expectation in
any case:

(a) hardware FDB and software FDB are in sync for the given {MAC DA, VLAN ID}:
    packet will be forwarded in hardware towards the same port as it
    would have without the TX forwarding offload feature

(b) FDB entry exists in software, but not in hardware: packet will be
    sent once by the bridge, and will be flooded by the hardware towards
    all bridge ports belonging to the switch's hardware domain

(c) FDB entry exists in hardware, but not in software: packet will be
    "flooded" by the software bridge, but the switch will deliver it
    precisely. Flooding is therefore avoided.

(d) FDB entry does not exist in hardware or in software: see case (a)

> Q2. Is it expected by DSA that two isolated ports (e.g. two ports
> belonging to two separate bridges) can be members of the same VLAN
> without issue?

It depends.

If you mean to ask: "given the way in which the DSA core is structured,
what do you expect to happen?", the answer is that it won't work without leaks.

If you mean to ask: "what is the intention going forward?", the answer
is that it should be made to work, and you should employ hardware specific
mechanisms to avoid those leaks between VLAN N of br0 and VLAN N of br1,
or deny the simultaneous existence of a VLAN-aware br0 and a VLAN-aware br1.

For example, right now you should at least impose the latter restriction,
see for example sja1105_prechangeupper().

In the long term, you should get acquainted with your hardware's FDB
isolation mechanism, because there will exist an API through which DSA
will tell you "this switchdev object (FDB, MDB, VLAN) came from this
bridge, which I've associated for you with a unique integer, just so you
know when you program it to hardware, I might come back with an
identical switchdev object later but on a different port, and belonging
to a different bridge":
https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/

The most flexible FDB isolation mechanism I've seen so far is in
mv88e6xxx, you can freely associate a VID with a FID (of which there are
4K entries) and FDB lookup is performed by {FID, MAC DA}. This patch has
the details of where mv88e6xxx is right now and what can be done further:
https://patchwork.kernel.org/project/netdevbpf/patch/20211005001414.1234318-5-vladimir.oltean@nxp.com/

So with that hardware, you can have 2 VLAN-aware bridges, and both
bridges can use the full 4K VID space numerically, but in total you
cannot have more than 4K FIDs in the system, so 1000 VLANs on one bridge
and 3000 on the other, or distributions like that. Numerically, the VIDs
of one bridge can be identical to the VIDs of another as long as FIDs
are unique.

> Background: The RTL8365MB's CPU tag includes an ALLOW field followed by
> a "port mask" field. If ALLOW=1 then - based on the VLAN tag in the
> frame and the port mask - the switch will automatically replicate the
> frame and egress it on all suitable ports, but only ports which are in
> the port mask.
>
> If ALLOW=1, and if the port mask is all zeroes or all ones, then the
> switch will make its forwarding decision based only on the VLAN tag in
> the frame (if any). Now consider a configuration as follows:

When you say "based _only_ on the VLAN tag" do you mean that the MAC DA
is not taken into consideration? Are packets flooded towards the entire
set of ports in the allowance port mask that are members of VLAN N?
Do you have address learning properly set up, and can you confirm with
an FDB dump that the FDB is not in fact empty in the FID you are
injecting in (see below)?

>          br0            br1
>           +              +
>           |              |
>       +---+---+      +---+---+
>       |       |      |       |
>      swp0    swp1   swp2    swp3
>
> ... with both bridges containing switch port(s) belonging to the same
> VLAN n. How should I prevent - with TX forwarding offload - a packet
> with VID=n from being egressed on a port on the opposite bridge which
> belongs to the same VLAN n?
>
> In the above scenario, either I must refine the CPU tag "port mask"
> (hence Q1), or I must restrict the hardware configuration in some way
> (hence Q2), or I must conclude that TX forwarding offload is not
> possible with these constraints, or there is some alternative solution
> or nuance that I have not thought of.

I don't want to answer any of these questions until I understand how
does your hardware intend the FID and FID_EN bits from the DSA header to
be used. The FID only has 2 bits, so it is clear to me that it doesn't
have the same understanding of the term as mv88e6xxx, if the Realtek
switch has up to 4 FIDs while Marvell up to 4K.

You should ask yourself not only how to prevent leakage, but also the
flip side, how should I pass the packet to the switch in such a way that
it will learn its MAC SA in the right FID, assuming that you go with FDB
isolation first and figure that out. Once that question is answered, you
can in premise specify an allowance port mask which is larger than
needed (the entire mask of user ports) and the switch should only
forward it towards the ports belonging to the same FID, which are
roughly equivalent with the ports under a specific bridge. You can
create a mapping between a FID and dp->bridge_num. Makes sense or am I
completely off?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-05 10:12 ` Vladimir Oltean
@ 2021-10-05 12:06   ` Alvin Šipraga
  2021-10-05 13:29     ` Vladimir Oltean
  2021-10-06  2:50   ` Florian Fainelli
  1 sibling, 1 reply; 16+ messages in thread
From: Alvin Šipraga @ 2021-10-05 12:06 UTC (permalink / raw)
  To: Vladimir Oltean; +Cc: netdev, Florian Fainelli, Andrew Lunn

On 10/5/21 12:12 PM, Vladimir Oltean wrote:
> On Tue, Oct 05, 2021 at 08:54:34AM +0000, Alvin Šipraga wrote:
>> Hi,
>>
>> I am trying to implement TX forwarding offload for my in-progress
>> rtl8365mb DSA driver. I have some questions which I could use some
>> clarification on. They might be specific to my hardware, which is also
>> OK, but then some advice on how to proceed would be helpful.
>>
>> Q1. Can the tagging driver somehow retrieve a port mask from the DSA
>> switch driver in order to assemble the CPU->switch tag on xmit? Is there
>> some infrastructure in place to share such data between the two drivers?
> 
> Nope. DSA does not maintain a cache of FDB entries retrieved from
> hardware. So it cannot deduce the destination port mask from the {MAC DA, VLAN ID}
> of the skb. The software FDB maintained by the bridge driver is all there is.
> Based on the software FDB, which the bridge still looks up, an skb->dev
> is selected. All that the TX forwarding offload feature is is a way to
> remove software packet replication (skb_clone) for the case where the
> packet should have been flooded, or multicast, by the software bridge
> towards multiple skb->dev entities belonging to the same hardware domain.
> 
> To achieve the desired replication in hardware with DSA, the idea is to
> look up the FDB once more, but this time let the switch do it in hardware.

Right, yes, this part was more or less clear to me. I did not mean to 
imply that the port mask should be the exact forwarding port mask, but 
rather just an "allowance" port mask. But I think you understood the 
context given what you wrote below.

> 
> I see it similar to the quote "life is like a box of chocolates, you
> never know what you're going to get". Meaning that ok, you don't know
> exactly from software on which egress ports your packet is going to
> land, but the result shouldn't be too far off from the expectation in
> any case:
> 
> (a) hardware FDB and software FDB are in sync for the given {MAC DA, VLAN ID}:
>      packet will be forwarded in hardware towards the same port as it
>      would have without the TX forwarding offload feature
> 
> (b) FDB entry exists in software, but not in hardware: packet will be
>      sent once by the bridge, and will be flooded by the hardware towards
>      all bridge ports belonging to the switch's hardware domain
> 
> (c) FDB entry exists in hardware, but not in software: packet will be
>      "flooded" by the software bridge, but the switch will deliver it
>      precisely. Flooding is therefore avoided.
> 
> (d) FDB entry does not exist in hardware or in software: see case (a)
> 
>> Q2. Is it expected by DSA that two isolated ports (e.g. two ports
>> belonging to two separate bridges) can be members of the same VLAN
>> without issue?
> 
> It depends.
> 
> If you mean to ask: "given the way in which the DSA core is structured,
> what do you expect to happen?", the answer is that it won't work without leaks.
> 
> If you mean to ask: "what is the intention going forward?", the answer
> is that it should be made to work, and you should employ hardware specific
> mechanisms to avoid those leaks between VLAN N of br0 and VLAN N of br1,
> or deny the simultaneous existence of a VLAN-aware br0 and a VLAN-aware br1.
> 
> For example, right now you should at least impose the latter restriction,
> see for example sja1105_prechangeupper().

Thanks for the reference.

> 
> In the long term, you should get acquainted with your hardware's FDB
> isolation mechanism, because there will exist an API through which DSA
> will tell you "this switchdev object (FDB, MDB, VLAN) came from this
> bridge, which I've associated for you with a unique integer, just so you
> know when you program it to hardware, I might come back with an
> identical switchdev object later but on a different port, and belonging
> to a different bridge":
> https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/>> 

The FDB isolation mechanism in my switch seems to be pretty good. As 
long as I can pass along *some* information from the switch driver to 
the tagging driver - namely the "allowance port mask" for a given bridge 
- I think I should be able to achieve full isolation between up to 7 
VLAN-aware bridges and with no restrictions on the number of VLANs per 
bridge, nor on the sharing of VLANs per bridge.

Here is a quick summary of the relevant behaviour of the switch:

VLANs programmed on the switch can be set to either SVL or IVL, on a 
per-VLAN basis. This affects how learned MAC addresses are searched 
for/saved in the hardware FDB:

   - In SVL mode, the hardware FDB is keyed with {FID, MAC}.
   - In IVL mode, the hardware FDB is keyed with {VID, MAC, EFID}.

EFID stands for "Enhanced Filtering Identifier". The EFID is 3 bits.

Unlike the FID - which is programmed per-VLAN - the EFID is programmed 
per-port. When a port has learning enabled and it receives an ingress 
frame with a given VID and MAC SA, it will search in the hardware FDB 
with a key {VID, MAC SA, EFID} - where EFID is the port EFID - and if 
the entry is not found, it will create a new one. This allows the switch 
to learn the same {VID, MAC SA} pair on two separate ports, provided 
those ports have different EFIDs.

With that in mind, I intend to enable IVL by default for all VLANs 
programmed to the hardware, and to reserve a given EFID - say EFID=0 - 
for all standalone ports which by definition will never learn anything. 
Together with the port isolation mechanism which is analogous to Linus' 
recent RTL8366RB changes, this should ensure that all ingress frames on 
a standalone port are trivially flooded to the CPU port only. This 
leaves 7 more EFIDs to use, which can each be mapped to a given 
bridge_num, such that we can support 7 hardware bridges with TX 
forwarding offload and IVL.

Finally, I want to implement TX forward offloading, and to selectively 
enable learning on the CPU port for frames with skb->tx_fwd_offload == 
true. More on that below in this mail.

> The most flexible FDB isolation mechanism I've seen so far is in
> mv88e6xxx, you can freely associate a VID with a FID (of which there are
> 4K entries) and FDB lookup is performed by {FID, MAC DA}. This patch has
> the details of where mv88e6xxx is right now and what can be done further:
> https://patchwork.kernel.org/project/netdevbpf/patch/20211005001414.1234318-5-vladimir.oltean@nxp.com/>> 
> So with that hardware, you can have 2 VLAN-aware bridges, and both
> bridges can use the full 4K VID space numerically, but in total you
> cannot have more than 4K FIDs in the system, so 1000 VLANs on one bridge
> and 3000 on the other, or distributions like that. Numerically, the VIDs
> of one bridge can be identical to the VIDs of another as long as FIDs
> are unique.
> 
>> Background: The RTL8365MB's CPU tag includes an ALLOW field followed by
>> a "port mask" field. If ALLOW=1 then - based on the VLAN tag in the
>> frame and the port mask - the switch will automatically replicate the
>> frame and egress it on all suitable ports, but only ports which are in
>> the port mask.
>>
>> If ALLOW=1, and if the port mask is all zeroes or all ones, then the
>> switch will make its forwarding decision based only on the VLAN tag in
>> the frame (if any). Now consider a configuration as follows:
> 
> When you say "based _only_ on the VLAN tag" do you mean that the MAC DA
> is not taken into consideration? Are packets flooded towards the entire
> set of ports in the allowance port mask that are members of VLAN N?

My choice of words was imprecise. I do not mean to say that the MAC DA 
is ignored. If there exists a suitable entry in the hardware FDB for 
that MAC DA and VLAN n, the switch will _not_ flood the entire set of 
ports in the allowance port mask that are members of VLAN n. Instead, it 
_will_ respect the information contained the hardware FDB and forward 
the packet only on the given port(s) in the FDB. Note that the switch 
will still respect the allowance port mask, so this is something like: 
forwarding_portmask = (fdb_portmask & allowance_portmask).

> Do you have address learning properly set up, and can you confirm with
> an FDB dump that the FDB is not in fact empty in the FID you are
> injecting in (see below)?

I have it set up properly - or as much as I can, some things are still 
to be ironed out - and I can dump the FDB and see that learning is 
taking place according to my expectations which I described upstairs. 
Note I have only tested this for unicast so far, although I think the 
rules for multicast are not dissimilar.

> 
>>           br0            br1
>>            +              +
>>            |              |
>>        +---+---+      +---+---+
>>        |       |      |       |
>>       swp0    swp1   swp2    swp3
>>
>> ... with both bridges containing switch port(s) belonging to the same
>> VLAN n. How should I prevent - with TX forwarding offload - a packet
>> with VID=n from being egressed on a port on the opposite bridge which
>> belongs to the same VLAN n?
>>
>> In the above scenario, either I must refine the CPU tag "port mask"
>> (hence Q1), or I must restrict the hardware configuration in some way
>> (hence Q2), or I must conclude that TX forwarding offload is not
>> possible with these constraints, or there is some alternative solution
>> or nuance that I have not thought of.
> 
> I don't want to answer any of these questions until I understand how
> does your hardware intend the FID and FID_EN bits from the DSA header to
> be used. The FID only has 2 bits, so it is clear to me that it doesn't
> have the same understanding of the term as mv88e6xxx, if the Realtek
> switch has up to 4 FIDs while Marvell up to 4K.

I came to the same conclusion until I started to play around with this, 
only to discover that the Realtek documentation is wrong.

First, so that we are on the same page, here again is the relevant part 
of the CPU tag we are talking about:

0                                  7|8                                 15
|-----------------------------------+-----------------------------------|
|                               (16-bit)                                |
|                       Realtek EtherType [0x8899]                      |
|-----------------------------------+-----------------------------------|
|              (8-bit)              |              (8-bit)              |
|          Protocol [0x04]          |              REASON               |
|-----------------------------------+-----------------------------------|
|   (1)  | (1) | (2) |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    |
| FID_EN |  X  | FID | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     |
|-----------------------------------+-----------------------------------|
  ^^^^^^^^^^^^^^^^^^^^
      look here

What actually appears to be the case - at least in the IVL case I 
described above - is that the fields FID_EN and FID should rather be 
named EFID_EN and EFID. Moreover, the reserved bit X between the two 
fields is actually an extension of the newly-named EFID field. So things 
look more like this:

|-----------------------------------+-----------------------------------|
|   (1)   |    (3)   |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    |
| EFID_EN |   EFID   | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     |
|-----------------------------------+-----------------------------------|
  ^^^^^^^^^^^^^^^^^^^^
      look here

If EFID_EN=1 and LEARN_DIS=0 and learning is enabled on the CPU port, 
then the switch will learn the MAC SA of the frame and enter it into the 
FDB with the corresponding VID (according to the 802.1Q tag) and the 
corresponding EFID (according to the CPU tag and the EFID field). This 
is super useful because it enables the strategy I outlined above, and 
also avoids having to rely on the assisted_learning_on_cpu_port flag.

What this doesn't do is help me with the actual forwarding decision of 
the frame. I hoped that by setting EFID_EN=1, EFID=k, ALLOW=1, and a 
"catch all" allowance port mask of 0 or ~0 (I tested both), the switch 
would only consider forwarding the frame to ports with EFID == k. This 
is not the case however, and it seems that the EFID_EN/EFID fields of 
the CPU tag only affect how the switch learns from this frame. Hence my 
need to "tune" the allowance port mask.

In case you are suspicious when I say the documentation is wrong: I 
tested this behaviour quite heavily in order to come to this conclusion. 
The EFID field is indeed 3 bits - and this matches with the definition 
of EFID in the datasheet of the chip - and by setting it to some 3-bit 
value like 7, I see this reflected in the hardware FDB after dumping it.

> 
> You should ask yourself not only how to prevent leakage, but also the
> flip side, how should I pass the packet to the switch in such a way that
> it will learn its MAC SA in the right FID, assuming that you go with FDB
> isolation first and figure that out. Once that question is answered, you
> can in premise specify an allowance port mask which is larger than
> needed (the entire mask of user ports) and the switch should only
> forward it towards the ports belonging to the same FID, which are
> roughly equivalent with the ports under a specific bridge. You can
> create a mapping between a FID and dp->bridge_num. Makes sense or am I
> completely off?

Right! This was exactly my plan (save for s/FID/EFID/), but I wanted to 
discuss the situation with you because I know that you have some planned 
changes for net-next and I was not sure if this method is considered 
acceptable in DSA land. I hope the explanation above clarifies the 
situation a bit. I will go ahead now and try to implement this mapping 
between bridge_num and EFID, so that the tagging driver can look up the 
correct allowance port mask on xmit of a bridge frame.

Thanks for your detailed response.

	Alvin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-05 12:06   ` Alvin Šipraga
@ 2021-10-05 13:29     ` Vladimir Oltean
  2021-10-05 13:37       ` Vladimir Oltean
  2021-10-05 14:38       ` Alvin Šipraga
  0 siblings, 2 replies; 16+ messages in thread
From: Vladimir Oltean @ 2021-10-05 13:29 UTC (permalink / raw)
  To: Alvin Šipraga; +Cc: netdev, Florian Fainelli, Andrew Lunn

On Tue, Oct 05, 2021 at 12:06:38PM +0000, Alvin Šipraga wrote:
> The FDB isolation mechanism in my switch seems to be pretty good. As
> long as I can pass along *some* information from the switch driver to
> the tagging driver - namely the "allowance port mask" for a given bridge
> - I think I should be able to achieve full isolation between up to 7
> VLAN-aware bridges and with no restrictions on the number of VLANs per
> bridge, nor on the sharing of VLANs per bridge.
>
> Here is a quick summary of the relevant behaviour of the switch:
>
> VLANs programmed on the switch can be set to either SVL or IVL, on a
> per-VLAN basis. This affects how learned MAC addresses are searched
> for/saved in the hardware FDB:
>
>    - In SVL mode, the hardware FDB is keyed with {FID, MAC}.
>    - In IVL mode, the hardware FDB is keyed with {VID, MAC, EFID}.
>
> EFID stands for "Enhanced Filtering Identifier". The EFID is 3 bits.
>
> Unlike the FID - which is programmed per-VLAN - the EFID is programmed
> per-port. When a port has learning enabled and it receives an ingress
> frame with a given VID and MAC SA, it will search in the hardware FDB
> with a key {VID, MAC SA, EFID} - where EFID is the port EFID - and if
> the entry is not found, it will create a new one. This allows the switch
> to learn the same {VID, MAC SA} pair on two separate ports, provided
> those ports have different EFIDs.

So you are basically saying that for FDB lookups, the EFID is like a
3-bit extension of the 12-bit VID, practically emulating a 32K VLAN space?

> > When you say "based _only_ on the VLAN tag" do you mean that the MAC DA
> > is not taken into consideration? Are packets flooded towards the entire
> > set of ports in the allowance port mask that are members of VLAN N?
>
> My choice of words was imprecise. I do not mean to say that the MAC DA
> is ignored. If there exists a suitable entry in the hardware FDB for
> that MAC DA and VLAN n, the switch will _not_ flood the entire set of
> ports in the allowance port mask that are members of VLAN n. Instead, it
> _will_ respect the information contained the hardware FDB and forward
> the packet only on the given port(s) in the FDB. Note that the switch
> will still respect the allowance port mask, so this is something like:
> forwarding_portmask = (fdb_portmask & allowance_portmask).
                         ^^^^^^^^^^^^
                         and this portion should take the EFID into
                         consideration, should it not?

> > I don't want to answer any of these questions until I understand how
> > does your hardware intend the FID and FID_EN bits from the DSA header to
> > be used. The FID only has 2 bits, so it is clear to me that it doesn't
> > have the same understanding of the term as mv88e6xxx, if the Realtek
> > switch has up to 4 FIDs while Marvell up to 4K.
>
> I came to the same conclusion until I started to play around with this,
> only to discover that the Realtek documentation is wrong.
>
> First, so that we are on the same page, here again is the relevant part
> of the CPU tag we are talking about:
>
> 0                                  7|8                                 15
> |-----------------------------------+-----------------------------------|
> |                               (16-bit)                                |
> |                       Realtek EtherType [0x8899]                      |
> |-----------------------------------+-----------------------------------|
> |              (8-bit)              |              (8-bit)              |
> |          Protocol [0x04]          |              REASON               |
> |-----------------------------------+-----------------------------------|
> |   (1)  | (1) | (2) |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    |
> | FID_EN |  X  | FID | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     |
> |-----------------------------------+-----------------------------------|
>   ^^^^^^^^^^^^^^^^^^^^
>       look here
>
> What actually appears to be the case - at least in the IVL case I
> described above - is that the fields FID_EN and FID should rather be
> named EFID_EN and EFID. Moreover, the reserved bit X between the two
> fields is actually an extension of the newly-named EFID field. So things
> look more like this:
>
> |-----------------------------------+-----------------------------------|
> |   (1)   |    (3)   |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    |
> | EFID_EN |   EFID   | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     |
> |-----------------------------------+-----------------------------------|
>   ^^^^^^^^^^^^^^^^^^^^
>       look here
>
> If EFID_EN=1 and LEARN_DIS=0 and learning is enabled on the CPU port,
> then the switch will learn the MAC SA of the frame and enter it into the
> FDB with the corresponding VID (according to the 802.1Q tag) and the
> corresponding EFID (according to the CPU tag and the EFID field). This
> is super useful because it enables the strategy I outlined above, and
> also avoids having to rely on the assisted_learning_on_cpu_port flag.
>
> What this doesn't do is help me with the actual forwarding decision of
> the frame. I hoped that by setting EFID_EN=1, EFID=k, ALLOW=1, and a
> "catch all" allowance port mask of 0 or ~0 (I tested both), the switch
> would only consider forwarding the frame to ports with EFID == k. This
> is not the case however, and it seems that the EFID_EN/EFID fields of
> the CPU tag only affect how the switch learns from this frame. Hence my
> need to "tune" the allowance port mask.

So a frame looks up the FDB twice, once on ingress for learning and once
on egress for selecting the destination port mask.

Practically are you saying that the switch loses the EFID information
between the ingress and the egress stage, since the destination port
mask is selected based on a key constructed with "don't care" in the EFID bits?
Strange.

> In case you are suspicious when I say the documentation is wrong: I
> tested this behaviour quite heavily in order to come to this conclusion.
> The EFID field is indeed 3 bits - and this matches with the definition
> of EFID in the datasheet of the chip - and by setting it to some 3-bit
> value like 7, I see this reflected in the hardware FDB after dumping it.

Ok, this sounds reasonable.

> > You should ask yourself not only how to prevent leakage, but also the
> > flip side, how should I pass the packet to the switch in such a way that
> > it will learn its MAC SA in the right FID, assuming that you go with FDB
> > isolation first and figure that out. Once that question is answered, you
> > can in premise specify an allowance port mask which is larger than
> > needed (the entire mask of user ports) and the switch should only
> > forward it towards the ports belonging to the same FID, which are
> > roughly equivalent with the ports under a specific bridge. You can
> > create a mapping between a FID and dp->bridge_num. Makes sense or am I
> > completely off?
>
> Right! This was exactly my plan (save for s/FID/EFID/), but I wanted to
> discuss the situation with you because I know that you have some planned
> changes for net-next and I was not sure if this method is considered
> acceptable in DSA land. I hope the explanation above clarifies the
> situation a bit. I will go ahead now and try to implement this mapping
> between bridge_num and EFID, so that the tagging driver can look up the
> correct allowance port mask on xmit of a bridge frame.

So in DSA we have a feature which is not very well liked, and you might
understand the reason once I explain it.

DSA tagging protocols are constructed to be agnostic of the switch they
are paired with. The reasoning was that in the beginning, sending and
receiving a packet was a very simple operation, on receive you just
needed to extract the information present in the DSA tag and select a
net device based on simple info, and on xmit you just need to construct
a simple tag based on the info provided to you (which port must be
targeted). So there simply was no reason to couple them.

DSA also is quite the wild west in terms of hardware implementations.
We make almost everything optional and the result is that some switches
go on and implement super advanced features, and some lag behind.
Testing becomes very hard, sometimes different users don't even have the
same experience if they use hardware from different vendors.
From maintainers' point of view there is a desire to at least keep
tagging protocols testable on generic hardware, for more than one
reason. First you might want to analyze how a given Ethernet controller
will behave when coupled with a DSA switch you don't have (stuff like
hardware parsers come to mind, with implications for RFS/RSS), and you
might simply want to debug stuff (we've had issues with VLANs dropped by
the RX filter of the DSA master, when the switch used a tail tagging
protocol; lack of Ethernet frame padding in certain conditions; DSA
inheriting some master->vlan_features it shouldn't have, like TX TCP
checksum offloading with tail taggers). In these cases, we did not have
the hardware where the problem was reported, but nonetheless we could
simulate an environment with the problem. We did that by using the
mock-up dsa_loop driver, and hacking dsa_loop_get_protocol() to return
the tagging protocol we wanted to test.

So generally speaking, whatever a tagging protocol driver may do, it
should do something sensible when paired with dsa_loop too. We have come
to depend on that lack of coupling.

On the other hand, there are situations where decoupled drivers won't
really cut it. You can see this especially with the sja1105 driver and
its tagging protocol, there is even a state machine implemented for PTP
frames, with storage provided by the switch driver. I've no idea how
come exactly the switches I directly maintain are the most quirky in
this regard. Anyway.

So there exists a "void * dp->priv" pointer which you can use to share
information between the switch driver and the tagging protocol driver.
Be careful though, from the tagger's perspective, dp->priv may be
provided by the switch driver you expect, or not (true, you would need
to hack up a driver like dsa_loop to return DSA_TAG_PROTO_RTL8_4, but
some crazy people like maintainers might do that). The compromise I've
found (not ideal by any means) is to access dp->priv from code paths
that cannot be triggered with dsa_loop. For example, since dsa_loop does
not declare TX forwarding offload for the bridge, then you could
dereference dp->priv inside an "if (skb->offload_fwd_mark)" on xmit and
you could be fairly sure you are coupled with your switch.

Okay, so what can you put inside dp->priv? Mostly anything, as long as
you have a basic understanding of the data persistence rules.
The fast path races with the slow path and there is no locking between
them. However, the particular case of xmitting a packet on behalf of the
Linux bridge is protected by RCU:

- if the packet is sent through a socket opened on the bridge device
  itself, br_dev_xmit() holds rcu_read_lock() for the entire duration,
  including during br_dev_queue_push_xmit() which calls dev_queue_xmit
  for the DSA interface.

- if the packet is not terminated locally but forwarded, the execution
  context is that of the NET_RX softirq of the ingress interface, in
  which the bridge rx_handler runs (br_handle_frame). And the receive
  data path is under rcu_read_lock() too - see
  netif_receive_skb_list_internal().

So in short, I think dsa_slave_xmit() runs under rcu_read_lock() under
all circumstances of transmitting a frame on behalf of the bridge.

So you can provide RCU-protected data to the tagger in dp->priv.
For example, you can create a reference-counted bridge structure in
which you keep the allocation port mask as an unsigned long.
I'm not too good with arch and compiler stuff, but I think unsigned long
assignments are guaranteed to be atomic on any arch, so when the fast
path reads that unsigned long it is guaranteed to not see an in-between
value even if the slow path is updating it simultaneously.
You create this bridge structure for the first port that enters a
bridge, and you destroy it for the last port that exits it. All ports
under the same bridge have a dp->priv that points to the same bridge
structure. But you need to access it using rcu_dereference_pointer from
the fast path, of course. And you need to make sure that you free it
under RCU as well, otherwise existing readers might see it disappear
from under their feet when the last port leaves a bridge.

Here is where things get a bit fuzzy. DSA gets notified of leaving a
bridge through del_nbp -> netdev_upper_dev_unlink -> ... ->
dsa_port_bridge_leave. If you look through what else del_nbp does,
you'll see a call to nbp_vlan_flush() -> synchronize_rcu(). So it will
wait for at least one grace period for all existing RCU readers to
finish (including the xmit and rcv data paths) before that function
returns. So in turn, you will see that netdev_upper_dev_unlink is only
called after an RCU grace period, at that point there should be no xmit
from that bridge port anymore, so you shouldn't see any
skb->offload_fwd_mark = true during xmit racing with the actual
dsa_port_bridge_leave.

But! if CONFIG_BRIDGE_VLAN_FILTERING=n, the shim implementation of
nbp_vlan_flush() will kick in, and that doesn't call synchronize_rcu(),
so AFAICS nobody will wait for a grace period. So please don't rely on
the bridge waiting for you, and test with VLAN filtering compiled out
too. Just put a synchronize_net() somewhere before freeing the bridge
structure, or just use call_rcu()/kfree_rcu() to free it asynchronously,
and you should be fine.

As for why rely on the bridge holding the rcu_read_lock(), but not on
waiting? Well, because we kind of need to be atomic with the bridge
itself, otherwise if the bridge releases rcu_read_lock() before
dev_queue_xmit() we are no longer in the same atomic section, yet we see
skb->offload_fwd_mark set to true. In the worst case, RCU might even
report quiescent states on all CPUs in that meantime, and somebody might
decide to free your bridge private data. Not nice.

I would make this allowance port mask be part of the Realtek driver
first. If it proves to be useful more generally we could consider moving
it to the DSA core.

Hope this helps and is not overwhelming.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-05 13:29     ` Vladimir Oltean
@ 2021-10-05 13:37       ` Vladimir Oltean
  2021-10-05 14:38       ` Alvin Šipraga
  1 sibling, 0 replies; 16+ messages in thread
From: Vladimir Oltean @ 2021-10-05 13:37 UTC (permalink / raw)
  To: Alvin Šipraga; +Cc: netdev, Florian Fainelli, Andrew Lunn

On Tue, Oct 05, 2021 at 04:29:12PM +0300, Vladimir Oltean wrote:
> Hope this helps and is not overwhelming.

Ah, I forgot. You cannot use from the tagger driver symbols that are
exported by the switch driver, otherwise you create circular dependencies.
Been there, done that.
https://lore.kernel.org/netdev/20210908220834.d7gmtnwrorhharna@skbuf/

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-05 13:29     ` Vladimir Oltean
  2021-10-05 13:37       ` Vladimir Oltean
@ 2021-10-05 14:38       ` Alvin Šipraga
  2021-10-05 15:25         ` Vladimir Oltean
  1 sibling, 1 reply; 16+ messages in thread
From: Alvin Šipraga @ 2021-10-05 14:38 UTC (permalink / raw)
  To: Vladimir Oltean; +Cc: netdev, Florian Fainelli, Andrew Lunn

On 10/5/21 3:29 PM, Vladimir Oltean wrote:
> On Tue, Oct 05, 2021 at 12:06:38PM +0000, Alvin Šipraga wrote:
>> The FDB isolation mechanism in my switch seems to be pretty good. As
>> long as I can pass along *some* information from the switch driver to
>> the tagging driver - namely the "allowance port mask" for a given bridge
>> - I think I should be able to achieve full isolation between up to 7
>> VLAN-aware bridges and with no restrictions on the number of VLANs per
>> bridge, nor on the sharing of VLANs per bridge.
>>
>> Here is a quick summary of the relevant behaviour of the switch:
>>
>> VLANs programmed on the switch can be set to either SVL or IVL, on a
>> per-VLAN basis. This affects how learned MAC addresses are searched
>> for/saved in the hardware FDB:
>>
>>     - In SVL mode, the hardware FDB is keyed with {FID, MAC}.
>>     - In IVL mode, the hardware FDB is keyed with {VID, MAC, EFID}.
>>
>> EFID stands for "Enhanced Filtering Identifier". The EFID is 3 bits.
>>
>> Unlike the FID - which is programmed per-VLAN - the EFID is programmed
>> per-port. When a port has learning enabled and it receives an ingress
>> frame with a given VID and MAC SA, it will search in the hardware FDB
>> with a key {VID, MAC SA, EFID} - where EFID is the port EFID - and if
>> the entry is not found, it will create a new one. This allows the switch
>> to learn the same {VID, MAC SA} pair on two separate ports, provided
>> those ports have different EFIDs.
> 
> So you are basically saying that for FDB lookups, the EFID is like a
> 3-bit extension of the 12-bit VID, practically emulating a 32K VLAN space?

On a high level yes, you can look at it that way. In practice I think 
the picture is a bit more nuanced, and that enabling IVL on a particular 
VLAN just means that the ASIC will do the VLAN<->FID mapping 
automatically. However I was too eager to state "no restrictions on the 
number of VLANs [...]" because in fact the FIDs of this chip are only 
4-bit. So if I use IVL, the limit will be 16 VLANs. However, I think the 
EFID strategy for FDB isolation still holds for SVL. Most of my 
understanding of the IVL/SVL feature is empirical, based on pushing 
packets and observing how the switch forwards them, and dumping the 
hardware FDB.

It could be that my conclusions about "lookup by VID" as opposed to 
"lookup by FID" are wrong, but if it comes to that, I will just have to 
manually implement VID<->FID mapping in the driver.

> 
>>> When you say "based _only_ on the VLAN tag" do you mean that the MAC DA
>>> is not taken into consideration? Are packets flooded towards the entire
>>> set of ports in the allowance port mask that are members of VLAN N?
>>
>> My choice of words was imprecise. I do not mean to say that the MAC DA
>> is ignored. If there exists a suitable entry in the hardware FDB for
>> that MAC DA and VLAN n, the switch will _not_ flood the entire set of
>> ports in the allowance port mask that are members of VLAN n. Instead, it
>> _will_ respect the information contained the hardware FDB and forward
>> the packet only on the given port(s) in the FDB. Note that the switch
>> will still respect the allowance port mask, so this is something like:
>> forwarding_portmask = (fdb_portmask & allowance_portmask).
>                           ^^^^^^^^^^^^
>                           and this portion should take the EFID into
>                           consideration, should it not?

Correct. I have verified this behaviour.

> 
>>> I don't want to answer any of these questions until I understand how
>>> does your hardware intend the FID and FID_EN bits from the DSA header to
>>> be used. The FID only has 2 bits, so it is clear to me that it doesn't
>>> have the same understanding of the term as mv88e6xxx, if the Realtek
>>> switch has up to 4 FIDs while Marvell up to 4K.
>>
>> I came to the same conclusion until I started to play around with this,
>> only to discover that the Realtek documentation is wrong.
>>
>> First, so that we are on the same page, here again is the relevant part
>> of the CPU tag we are talking about:
>>
>> 0                                  7|8                                 15
>> |-----------------------------------+-----------------------------------|
>> |                               (16-bit)                                |
>> |                       Realtek EtherType [0x8899]                      |
>> |-----------------------------------+-----------------------------------|
>> |              (8-bit)              |              (8-bit)              |
>> |          Protocol [0x04]          |              REASON               |
>> |-----------------------------------+-----------------------------------|
>> |   (1)  | (1) | (2) |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    |
>> | FID_EN |  X  | FID | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     |
>> |-----------------------------------+-----------------------------------|
>>    ^^^^^^^^^^^^^^^^^^^^
>>        look here
>>
>> What actually appears to be the case - at least in the IVL case I
>> described above - is that the fields FID_EN and FID should rather be
>> named EFID_EN and EFID. Moreover, the reserved bit X between the two
>> fields is actually an extension of the newly-named EFID field. So things
>> look more like this:
>>
>> |-----------------------------------+-----------------------------------|
>> |   (1)   |    (3)   |   (1)  | (3) | (1)  | (1) |    (1)    |   (5)    |
>> | EFID_EN |   EFID   | PRI_EN | PRI | KEEP |  X  | LEARN_DIS |    X     |
>> |-----------------------------------+-----------------------------------|
>>    ^^^^^^^^^^^^^^^^^^^^
>>        look here
>>
>> If EFID_EN=1 and LEARN_DIS=0 and learning is enabled on the CPU port,
>> then the switch will learn the MAC SA of the frame and enter it into the
>> FDB with the corresponding VID (according to the 802.1Q tag) and the
>> corresponding EFID (according to the CPU tag and the EFID field). This
>> is super useful because it enables the strategy I outlined above, and
>> also avoids having to rely on the assisted_learning_on_cpu_port flag.
>>
>> What this doesn't do is help me with the actual forwarding decision of
>> the frame. I hoped that by setting EFID_EN=1, EFID=k, ALLOW=1, and a
>> "catch all" allowance port mask of 0 or ~0 (I tested both), the switch
>> would only consider forwarding the frame to ports with EFID == k. This
>> is not the case however, and it seems that the EFID_EN/EFID fields of
>> the CPU tag only affect how the switch learns from this frame. Hence my
>> need to "tune" the allowance port mask.
> 
> So a frame looks up the FDB twice, once on ingress for learning and once
> on egress for selecting the destination port mask.

This improves my mental model of the switch hardware, thanks.

> 
> Practically are you saying that the switch loses the EFID information
> between the ingress and the egress stage, since the destination port
> mask is selected based on a key constructed with "don't care" in the EFID bits?
> Strange.

Strange indeed - and wrong! I just checked this again. The switch 
actually _does_ preserve the EFID for the second lookup when selecting 
the destination port mask, and this behaves as you would expect. My 
observation to the contrary was specifically for the case where there is 
no hit for the destination address, in which case the switch will 
_flood_ according to the VLAN and MAC DA, without regard for the EFID. 
This kind of makes sense, since the EFID is just a searching/learning 
look-up-table concept and is not related to flooding. OTOH there are 
flooding port mask registers where one can set for 
{uni,multi,broad}cast, but this configuration is independent of VLAN.

This is OK, but I think it still puts me back in the same situation 
where I need to communicate a meaningful port allowance mask to avoid 
flooding the wrong ports.

> 
>> In case you are suspicious when I say the documentation is wrong: I
>> tested this behaviour quite heavily in order to come to this conclusion.
>> The EFID field is indeed 3 bits - and this matches with the definition
>> of EFID in the datasheet of the chip - and by setting it to some 3-bit
>> value like 7, I see this reflected in the hardware FDB after dumping it.
> 
> Ok, this sounds reasonable.
> 
>>> You should ask yourself not only how to prevent leakage, but also the
>>> flip side, how should I pass the packet to the switch in such a way that
>>> it will learn its MAC SA in the right FID, assuming that you go with FDB
>>> isolation first and figure that out. Once that question is answered, you
>>> can in premise specify an allowance port mask which is larger than
>>> needed (the entire mask of user ports) and the switch should only
>>> forward it towards the ports belonging to the same FID, which are
>>> roughly equivalent with the ports under a specific bridge. You can
>>> create a mapping between a FID and dp->bridge_num. Makes sense or am I
>>> completely off?
>>
>> Right! This was exactly my plan (save for s/FID/EFID/), but I wanted to
>> discuss the situation with you because I know that you have some planned
>> changes for net-next and I was not sure if this method is considered
>> acceptable in DSA land. I hope the explanation above clarifies the
>> situation a bit. I will go ahead now and try to implement this mapping
>> between bridge_num and EFID, so that the tagging driver can look up the
>> correct allowance port mask on xmit of a bridge frame.
> 
> So in DSA we have a feature which is not very well liked, and you might
> understand the reason once I explain it.
> 
> DSA tagging protocols are constructed to be agnostic of the switch they
> are paired with. The reasoning was that in the beginning, sending and
> receiving a packet was a very simple operation, on receive you just
> needed to extract the information present in the DSA tag and select a
> net device based on simple info, and on xmit you just need to construct
> a simple tag based on the info provided to you (which port must be
> targeted). So there simply was no reason to couple them.
> 
> DSA also is quite the wild west in terms of hardware implementations.
> We make almost everything optional and the result is that some switches
> go on and implement super advanced features, and some lag behind.
> Testing becomes very hard, sometimes different users don't even have the
> same experience if they use hardware from different vendors.
>  From maintainers' point of view there is a desire to at least keep
> tagging protocols testable on generic hardware, for more than one
> reason. First you might want to analyze how a given Ethernet controller
> will behave when coupled with a DSA switch you don't have (stuff like
> hardware parsers come to mind, with implications for RFS/RSS), and you
> might simply want to debug stuff (we've had issues with VLANs dropped by
> the RX filter of the DSA master, when the switch used a tail tagging
> protocol; lack of Ethernet frame padding in certain conditions; DSA
> inheriting some master->vlan_features it shouldn't have, like TX TCP
> checksum offloading with tail taggers). In these cases, we did not have
> the hardware where the problem was reported, but nonetheless we could
> simulate an environment with the problem. We did that by using the
> mock-up dsa_loop driver, and hacking dsa_loop_get_protocol() to return
> the tagging protocol we wanted to test.
> 
> So generally speaking, whatever a tagging protocol driver may do, it
> should do something sensible when paired with dsa_loop too. We have come
> to depend on that lack of coupling.
> 
> On the other hand, there are situations where decoupled drivers won't
> really cut it. You can see this especially with the sja1105 driver and
> its tagging protocol, there is even a state machine implemented for PTP
> frames, with storage provided by the switch driver. I've no idea how
> come exactly the switches I directly maintain are the most quirky in
> this regard. Anyway.
> 
> So there exists a "void * dp->priv" pointer which you can use to share
> information between the switch driver and the tagging protocol driver.
> Be careful though, from the tagger's perspective, dp->priv may be
> provided by the switch driver you expect, or not (true, you would need
> to hack up a driver like dsa_loop to return DSA_TAG_PROTO_RTL8_4, but
> some crazy people like maintainers might do that). The compromise I've
> found (not ideal by any means) is to access dp->priv from code paths
> that cannot be triggered with dsa_loop. For example, since dsa_loop does
> not declare TX forwarding offload for the bridge, then you could
> dereference dp->priv inside an "if (skb->offload_fwd_mark)" on xmit and
> you could be fairly sure you are coupled with your switch.

Okay, so far so good - this is what I was working towards already.

> 
> Okay, so what can you put inside dp->priv? Mostly anything, as long as
> you have a basic understanding of the data persistence rules.
> The fast path races with the slow path and there is no locking between
> them. However, the particular case of xmitting a packet on behalf of the
> Linux bridge is protected by RCU:
> 
> - if the packet is sent through a socket opened on the bridge device
>    itself, br_dev_xmit() holds rcu_read_lock() for the entire duration,
>    including during br_dev_queue_push_xmit() which calls dev_queue_xmit
>    for the DSA interface.
> 
> - if the packet is not terminated locally but forwarded, the execution
>    context is that of the NET_RX softirq of the ingress interface, in
>    which the bridge rx_handler runs (br_handle_frame). And the receive
>    data path is under rcu_read_lock() too - see
>    netif_receive_skb_list_internal().
> 
> So in short, I think dsa_slave_xmit() runs under rcu_read_lock() under
> all circumstances of transmitting a frame on behalf of the bridge.
> 
> So you can provide RCU-protected data to the tagger in dp->priv.
> For example, you can create a reference-counted bridge structure in
> which you keep the allocation port mask as an unsigned long.
> I'm not too good with arch and compiler stuff, but I think unsigned long
> assignments are guaranteed to be atomic on any arch, so when the fast
> path reads that unsigned long it is guaranteed to not see an in-between
> value even if the slow path is updating it simultaneously.
> You create this bridge structure for the first port that enters a
> bridge, and you destroy it for the last port that exits it. All ports
> under the same bridge have a dp->priv that points to the same bridge
> structure. But you need to access it using rcu_dereference_pointer from
> the fast path, of course. And you need to make sure that you free it
> under RCU as well, otherwise existing readers might see it disappear
> from under their feet when the last port leaves a bridge.
> 
> Here is where things get a bit fuzzy. DSA gets notified of leaving a
> bridge through del_nbp -> netdev_upper_dev_unlink -> ... ->
> dsa_port_bridge_leave. If you look through what else del_nbp does,
> you'll see a call to nbp_vlan_flush() -> synchronize_rcu(). So it will
> wait for at least one grace period for all existing RCU readers to
> finish (including the xmit and rcv data paths) before that function
> returns. So in turn, you will see that netdev_upper_dev_unlink is only
> called after an RCU grace period, at that point there should be no xmit
> from that bridge port anymore, so you shouldn't see any
> skb->offload_fwd_mark = true during xmit racing with the actual
> dsa_port_bridge_leave.
> 
> But! if CONFIG_BRIDGE_VLAN_FILTERING=n, the shim implementation of
> nbp_vlan_flush() will kick in, and that doesn't call synchronize_rcu(),
> so AFAICS nobody will wait for a grace period. So please don't rely on
> the bridge waiting for you, and test with VLAN filtering compiled out
> too. Just put a synchronize_net() somewhere before freeing the bridge
> structure, or just use call_rcu()/kfree_rcu() to free it asynchronously,
> and you should be fine.
> 
> As for why rely on the bridge holding the rcu_read_lock(), but not on
> waiting? Well, because we kind of need to be atomic with the bridge
> itself, otherwise if the bridge releases rcu_read_lock() before
> dev_queue_xmit() we are no longer in the same atomic section, yet we see
> skb->offload_fwd_mark set to true. In the worst case, RCU might even
> report quiescent states on all CPUs in that meantime, and somebody might
> decide to free your bridge private data. Not nice.
> 
> I would make this allowance port mask be part of the Realtek driver
> first. If it proves to be useful more generally we could consider moving
> it to the DSA core.
> 
> Hope this helps and is not overwhelming.
> 

It's not overwhelming but very helpful indeed. Not that I understand 
everything you wrote now, but I will check out the numerous references 
you have made and see if I can figure this out. As usual I will be back 
on the mailing list if I have some further questions. Thanks a lot!

	Alvin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-05 14:38       ` Alvin Šipraga
@ 2021-10-05 15:25         ` Vladimir Oltean
  2021-10-06 16:16           ` Alvin Šipraga
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Oltean @ 2021-10-05 15:25 UTC (permalink / raw)
  To: Alvin Šipraga; +Cc: netdev, Florian Fainelli, Andrew Lunn

On Tue, Oct 05, 2021 at 02:38:28PM +0000, Alvin Šipraga wrote:
> On 10/5/21 3:29 PM, Vladimir Oltean wrote:
> > On Tue, Oct 05, 2021 at 12:06:38PM +0000, Alvin Šipraga wrote:
> >> The FDB isolation mechanism in my switch seems to be pretty good. As
> >> long as I can pass along *some* information from the switch driver to
> >> the tagging driver - namely the "allowance port mask" for a given bridge
> >> - I think I should be able to achieve full isolation between up to 7
> >> VLAN-aware bridges and with no restrictions on the number of VLANs per
> >> bridge, nor on the sharing of VLANs per bridge.
> >>
> >> Here is a quick summary of the relevant behaviour of the switch:
> >>
> >> VLANs programmed on the switch can be set to either SVL or IVL, on a
> >> per-VLAN basis. This affects how learned MAC addresses are searched
> >> for/saved in the hardware FDB:
> >>
> >>     - In SVL mode, the hardware FDB is keyed with {FID, MAC}.
> >>     - In IVL mode, the hardware FDB is keyed with {VID, MAC, EFID}.
> >>
> >> EFID stands for "Enhanced Filtering Identifier". The EFID is 3 bits.
> >>
> >> Unlike the FID - which is programmed per-VLAN - the EFID is programmed
> >> per-port. When a port has learning enabled and it receives an ingress
> >> frame with a given VID and MAC SA, it will search in the hardware FDB
> >> with a key {VID, MAC SA, EFID} - where EFID is the port EFID - and if
> >> the entry is not found, it will create a new one. This allows the switch
> >> to learn the same {VID, MAC SA} pair on two separate ports, provided
> >> those ports have different EFIDs.
> >
> > So you are basically saying that for FDB lookups, the EFID is like a
> > 3-bit extension of the 12-bit VID, practically emulating a 32K VLAN space?
>
> On a high level yes, you can look at it that way. In practice I think
> the picture is a bit more nuanced, and that enabling IVL on a particular
> VLAN just means that the ASIC will do the VLAN<->FID mapping
> automatically. However I was too eager to state "no restrictions on the
> number of VLANs [...]" because in fact the FIDs of this chip are only
> 4-bit. So if I use IVL, the limit will be 16 VLANs. However, I think the
> EFID strategy for FDB isolation still holds for SVL. Most of my
> understanding of the IVL/SVL feature is empirical, based on pushing
> packets and observing how the switch forwards them, and dumping the
> hardware FDB.

So let me rephrase the facts which you've presented to make sure I get this right.

(a) The switch processes each frame in an internal 4-bit FID.

(b) Each VLAN (not {port, VLAN} pair) can be configured for SVL or IVL.
    When a packet is received, it is first classified to a VLAN, then
    the VLAN table is looked up, and the switch determines whether that
    VLAN is configured for SVL or IVL.

(c) If configured for SVL, the 4-bit internal FID is derived exclusively
    from the VLAN table entry.

(d) If configured for IVL, the ingress port's EFID is read, and the
    4-bit internal FID is derived from the {12-bit VID, 3-bit port EFID}
    squashed into a 4-bit number.

(e) The sum of internal FIDs in use does not exceed 16, regardless of
    whether SVL or IVL is used for a VID. Otherwise said, the FDB cannot
    be partitioned in more than 16 groups.

(f) The FDB is always looked up by {internal FID, MAC}.

How do you know that point (e) is true? If you add more than 16 VLANs
using IVL, is there any error? If the user can map a SVL VID to a FID
directly through the VLAN table, does that mean that the hardware
continuously remaps IVL {VID, EFID} VLANs to different FIDs, as FID
values keep getting used up by SVL? Can you make an IVL VID reuse the
same internal FID as an SVL VID? Can you make two IVL VIDs use the same
internal FID?

Anyway, this complicates things by quite a bit. The Linux bridge doesn't
really have an SVL/IVL knob. It assumes IVL. Where things will get
challenging is when you offload FDB entries with a given {VID, MAC DA},
what to do if you access the FDB by FID, but in fact there isn't a
bijective mapping between the VID and the FID? You keep reference counts
per FDB entry, such that when the user deletes a MAC DA from VID A, but
you also have that MAC DA in VID B, both of which map to the same FID,
you still keep the entry? And most importantly, do you see the FID bits
in the tagger in the receive path as well? Can you dump them for packets
classified to a FID in different ways, using IVL, SVL?

> It could be that my conclusions about "lookup by VID" as opposed to
> "lookup by FID" are wrong, but if it comes to that, I will just have to
> manually implement VID<->FID mapping in the driver.

And this is the second complication. Whatever VID<->FID mapping you make,
if it's not static, you'll need a lookup table in the tagging protocol
driver to translate the VID from the skb to a FID. Odd. Or maybe I'm wrong.

> > Practically are you saying that the switch loses the EFID information
> > between the ingress and the egress stage, since the destination port
> > mask is selected based on a key constructed with "don't care" in the EFID bits?
> > Strange.
>
> Strange indeed - and wrong! I just checked this again. The switch
> actually _does_ preserve the EFID for the second lookup when selecting
> the destination port mask, and this behaves as you would expect. My
> observation to the contrary was specifically for the case where there is
> no hit for the destination address, in which case the switch will
> _flood_ according to the VLAN and MAC DA, without regard for the EFID.
> This kind of makes sense, since the EFID is just a searching/learning
> look-up-table concept and is not related to flooding. OTOH there are
> flooding port mask registers where one can set for
> {uni,multi,broad}cast, but this configuration is independent of VLAN.

So flooding is indeed the miss action from the FDB, but I'm just
wondering, aren't the flood control registers replicated per FID in fact?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-05 10:12 ` Vladimir Oltean
  2021-10-05 12:06   ` Alvin Šipraga
@ 2021-10-06  2:50   ` Florian Fainelli
  2021-10-06 23:15     ` Tobias Waldekranz
  2021-10-07  1:08     ` Vladimir Oltean
  1 sibling, 2 replies; 16+ messages in thread
From: Florian Fainelli @ 2021-10-06  2:50 UTC (permalink / raw)
  To: Vladimir Oltean, Alvin Šipraga; +Cc: netdev, Andrew Lunn



On 10/5/2021 3:12 AM, Vladimir Oltean wrote:
> I don't want to answer any of these questions until I understand how
> does your hardware intend the FID and FID_EN bits from the DSA header to
> be used. The FID only has 2 bits, so it is clear to me that it doesn't
> have the same understanding of the term as mv88e6xxx, if the Realtek
> switch has up to 4 FIDs while Marvell up to 4K.
> 
> You should ask yourself not only how to prevent leakage, but also the
> flip side, how should I pass the packet to the switch in such a way that
> it will learn its MAC SA in the right FID, assuming that you go with FDB
> isolation first and figure that out. Once that question is answered, you
> can in premise specify an allowance port mask which is larger than
> needed (the entire mask of user ports) and the switch should only
> forward it towards the ports belonging to the same FID, which are
> roughly equivalent with the ports under a specific bridge. You can
> create a mapping between a FID and dp->bridge_num. Makes sense or am I
> completely off?
> 

Sorry for sort of hijacking this discussion.

Broadcom switches do not have FIDs however using Alvin's topology, and 
given the existing bridge support in b53, it also does not look like 
there is leaking from one bridge to other because of the port matrix 
configuration which is enforced in addition to the VLAN membership. 
However based on what I see in tag_dsa.c for the transmit case with 
skb->offload_fwd_mark, I would have to dig into the bridge's members in 
order to construct a bitmask of ports to provide to tag_brcm.c, so that 
does not really get me anywhere, does it?

Those switches also always do double VLAN tag (802.1ad) normalization 
within their data path whenever VLAN is globally enabled within the 
switch, so in premise you could achieve the same isolation by reserving 
one of the outer VLAN tags per bridge, enabling IVL and hope that the 
FDB is looked including the outer and inner VLAN tags and not just the 
inner VLAN tag.

If we don't have a FID concept, and not all switches do, how we can 
still support tx forwarding offload here?
-- 
Florian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-05 15:25         ` Vladimir Oltean
@ 2021-10-06 16:16           ` Alvin Šipraga
  2021-10-07  9:47             ` Vladimir Oltean
  0 siblings, 1 reply; 16+ messages in thread
From: Alvin Šipraga @ 2021-10-06 16:16 UTC (permalink / raw)
  To: Vladimir Oltean; +Cc: netdev, Florian Fainelli, Andrew Lunn

On 10/5/21 5:25 PM, Vladimir Oltean wrote:
> 
> So let me rephrase the facts which you've presented to make sure I get this right.
> 
> (a) The switch processes each frame in an internal 4-bit FID.
> 
> (b) Each VLAN (not {port, VLAN} pair) can be configured for SVL or IVL.
>      When a packet is received, it is first classified to a VLAN, then
>      the VLAN table is looked up, and the switch determines whether that
>      VLAN is configured for SVL or IVL.
> 
> (c) If configured for SVL, the 4-bit internal FID is derived exclusively
>      from the VLAN table entry.
> 
> (d) If configured for IVL, the ingress port's EFID is read, and the
>      4-bit internal FID is derived from the {12-bit VID, 3-bit port EFID}
>      squashed into a 4-bit number.
> 
> (e) The sum of internal FIDs in use does not exceed 16, regardless of
>      whether SVL or IVL is used for a VID. Otherwise said, the FDB cannot
>      be partitioned in more than 16 groups.
> 
> (f) The FDB is always looked up by {internal FID, MAC}.
> 

Hi Vladimir,

The idea that the chip maps everything to an (internal) 4-bit FID - even 
in IVL mode - was just conjecture based on what I read in the datasheet 
of the chip. I think you can see that I'm still a bit confused by this 
hardware. I'm sorry if you feel like you wasted your time, but hopefully 
this mail clarifies some things for you.

First, allow me to reproduce the relevant part of the datasheet here:

| == Search and Learning
|
| = Search
|
| When a packet is received, the RTL8365MB-VC uses the destination MAC
| address, Filtering Identifier (FID) and Enhanced Filtering Identifier
| (EFID) to search the 2K-entry look-up table. The 48-bit MAC address,
| 4-bit FID and 3-bit EFID use a hash algorithm, to calculate an 11-bit
| index value. The RTL8365MB-VC uses the index to compare the packet MAC
| address with the entries (MAC addresses) in the look-up table. This is
| the ‘Address Search’. If the destination MAC address is not found, the
| switch will broadcast the packet according to VLAN configuration.
|
| = Learning
|
| The RTL8365MB-VC uses the source MAC address, FID, and EFID of the
| incoming packet to hash into a 9-bit index. It then compares the source
| MAC address with the data (MAC addresses) in this index. If there is a
| match with one of the entries, the RTL8365MB-VC will update the entry
| with new information. If there is no match and the 2K entries are not
| all occupied by other MAC addresses, the RTL8365MB- VC will record the
| source MAC address and ingress port number into an empty entry. This
| process is called ‘Learning’.
|
| Address aging is used to keep the contents of the address table correct
| in a dynamic network topology. The look-up engine will update the time
| stamp information of an entry whenever the corresponding source MAC
| address appears. An entry will be invalid (aged out) if its time stamp
| information is not refreshed by the address learning process during the
| aging time period. The aging time of the RTL8365MB-VC is between 200 and
| 400 seconds (typical is 300 seconds).
|
| == SVL and IVL/SVL
|
| The RTL8365MB-VC supports a 16-group Filtering Identifier (FID) for L2
| search and learning. In default operation, all VLAN entries belong to
| the same FID. This is called Shared VLAN Learning (SVL). If VLAN entries
| are configured to different FIDs, then the same source MAC address with
| multiple FIDs can be learned into different look-up table entries. This
| is called Independent VLAN Learning and Shared VLAN Learning (IVL/SVL).

This "IVL/SVL" mode would appear to correspond to a field in the vendor 
driver sources called ivl_svl (ivl_svl=1 is what I have referred to as 
"IVL" all this time), which is part of each VLAN configuration in the 
VLAN table. But that field also comes with a /* IVL */ or /* IVL_EN */ 
comment next to it in some places. So I am unsure whether there is a 
third, "genuine" IVL mode which does not use the FID at all. At least, 
the description in the datasheet doesn't seem seem to correlate with the 
behaviour of this ivl_svl switch. But I could be parsing it wrong.

Now, rather than speculate further on the semantics, I went ahead and 
tested out the behaviour by:

- adding 32 VLANs 100..131 on a port, all with IVL (i.e. ivl_svl=1)
- cycling through the 8 possible port EFIDs (0..7) on that port
- for each EFID, sending one 802.1Q-tagged frame with VID=n for 
n=100..131 to the port from the network

Some notes:

- the chip supports up to 32 concurrent VLANs (globally); this is a 
general limitation of the hardware.
- in this scenario the MAC SA is the same for each frame I transmit from 
the network into the port.

By dumping the hardware FDB after the fact, I can see 32 * 8 FDB entries 
for the given MAC SA of my frames:

	cat /sys/kernel/debug/rtl8365mb/lut_dump
	addr    mac                     vid_fid spa     fid     efid
	0004    00:00:aa:aa:aa:aa       104     2       0       0
	0005    00:00:aa:aa:aa:aa       112     2       0       3
	0006    00:00:aa:aa:aa:aa       120     2       0       2
	0008    00:00:aa:aa:aa:aa       128     2       0       5
	0036    00:00:aa:aa:aa:aa       105     2       0       0
	0037    00:00:aa:aa:aa:aa       113     2       0       3
	0038    00:00:aa:aa:aa:aa       121     2       0       2
	0040    00:00:aa:aa:aa:aa       129     2       0       5
	0068    00:00:aa:aa:aa:aa       106     2       0       0
	... (table continues with an entry for each VID/EFID combo)

Legend:
	addr: look-up-table index
	mac: MAC address
	vid_fid: VID of the frame for both IVL and SVL
	spa: source port address, i.e. the port that learned
	fid: FID (of the VLAN)
	efid: EFID (of the port)

I also tried sending untagged frames from the network and cycling 
through one of the VLANs as PVID, in which case the port would learn and 
make an entry with vid_fid corresponding to the PVID.

This suggests to me that the IVL field of the VLAN configuration really 
does achieve Independent VLAN learning, and that there are not many 
constraints here besides the size of the look-up-table.
	

Now to address your questions...

> How do you know that point (e) is true?

Evidently it is not true, since I can partition the FDB into more than 
16 groups.

> If you add more than 16 VLANs using IVL, is there any error?

I added 32 and things seem to work OK.

> If the user can map a SVL VID to a FID directly through the VLAN table,
> does that mean that the hardware continuously remaps IVL {VID, EFID}
> VLANs to different FIDs, as FID values keep getting used up by SVL?

This would be quite some gymnastics on the part of the ASIC. Let's take 
a step back.

Could it be that the ivl_svl switch simply controls how this 
look-up-table index is computed? That is to say:

SVL: {FID, EFID, MAC} -> index
IVL: {VID, EFID, MAC} -> index

I tried the following scenario:

	# add VLAN 100/101
	bridge vlan add vid 100 dev swp2
	bridge vlan add vid 101 dev swp2

	# send VID 100 frame from another host on the network
	mausezahn eth2 -Q 100 -c 1 -a '00:00:aa:aa:aa:aa' -t tcp

	# dump HW FDB
	cat /sys/kernel/debug/rtl8365mb/lut_dump

	# send VID 101 frame this time
	mausezahn eth2 -Q 101 -c 1 -a '00:00:aa:aa:aa:aa' -t tcp

	# dump HW FDB
	cat /sys/kernel/debug/rtl8365mb/lut_dump

I then tested this out with:

- IVL, FID=0

	##### send frame on VLAN 100
	cat /sys/kernel/debug/rtl8365mb/lut_dump
	addr    mac                     vid_fid spa     fid     efid
	0388    00:00:aa:aa:aa:aa       100     2       0       0
	##### send frame on VLAN 101
	cat /sys/kernel/debug/rtl8365mb/lut_dump
	addr    mac                     vid_fid spa     fid     efid
	0388    00:00:aa:aa:aa:aa       100     2       0       0
	0420    00:00:aa:aa:aa:aa       101     2       0       0

- IVL, FID=9

	##### send frame on VLAN 100
	cat /sys/kernel/debug/rtl8365mb/lut_dump
	addr    mac                     vid_fid spa     fid     efid
	0388    00:00:aa:aa:aa:aa       100     2       9       0
	##### send frame on VLAN 101
	cat /sys/kernel/debug/rtl8365mb/lut_dump
	addr    mac                     vid_fid spa     fid     efid
	0388    00:00:aa:aa:aa:aa       100     2       9       0
	0420    00:00:aa:aa:aa:aa       101     2       9       0

- SVL, FID=0

	##### send frame on VLAN 100
	cat /sys/kernel/debug/rtl8365mb/lut_dump
	addr    mac                     vid_fid spa     fid     efid
	1280    00:00:aa:aa:aa:aa       100     2       0       0
	##### send frame on VLAN 101
	cat /sys/kernel/debug/rtl8365mb/lut_dump
	addr    mac                     vid_fid spa     fid     efid
	1280    00:00:aa:aa:aa:aa       101     2       0       0

- SVL, FID=9

	##### send frame on VLAN 100
	cat /sys/kernel/debug/rtl8365mb/lut_dump
	addr    mac                     vid_fid spa     fid     efid
	1056    00:00:aa:aa:aa:aa       100     2       9       0
	##### send frame on VLAN 101
	cat /sys/kernel/debug/rtl8365mb/lut_dump
	addr    mac                     vid_fid spa     fid     efid
	1056    00:00:aa:aa:aa:aa       101     2       9       0

Some observations:

- with IVL, index is the same for FID=0,9
- with SVL, index is different for FID=0,9
- with IVL, index is different for VID=100,101
- with SVL, index is the same for VID=100,101

In particular, with IVL, the FID is stored in the table but it does not 
seem to affect the index.

> Can you make an IVL VID reuse the
> same internal FID as an SVL VID? Can you make two IVL VIDs use the same
> internal FID?
> 
> Anyway, this complicates things by quite a bit. The Linux bridge doesn't
> really have an SVL/IVL knob. It assumes IVL. Where things will get
> challenging is when you offload FDB entries with a given {VID, MAC DA},
> what to do if you access the FDB by FID, but in fact there isn't a
> bijective mapping between the VID and the FID?

I _think_ I can look up the FDB by VID, not just FID - I still have to 
confirm that but I think it depends on whether the particular VLAN is in 
IVL or SVL mode.

But either way, there are bound to be collisions given the way the 
look-up-table works. If the driver is asked to offload two FDB entries 
which map to the same look-up-table entry (i.e. same index), can't it 
just error out on the second request? Something like "I see this entry 
is already occupied by a static (offloaded) FDB entry, so I can't 
satisfy this request".

> You keep reference counts
> per FDB entry, such that when the user deletes a MAC DA from VID A, but
> you also have that MAC DA in VID B, both of which map to the same FID,
> you still keep the entry?

> And most importantly, do you see the FID bits
> in the tagger in the receive path as well?

No, I don't, which is kind of strange. But is it a problem?

> Can you dump them for packets
> classified to a FID in different ways, using IVL, SVL?
> 
>> It could be that my conclusions about "lookup by VID" as opposed to
>> "lookup by FID" are wrong, but if it comes to that, I will just have to
>> manually implement VID<->FID mapping in the driver.
> 
> And this is the second complication. Whatever VID<->FID mapping you make,
> if it's not static, you'll need a lookup table in the tagging protocol
> driver to translate the VID from the skb to a FID. Odd. Or maybe I'm wrong.

OK, I think these last questions of yours are based on the premise of 
some kind of VID<->FID mapping. But I hope this email demonstrates to 
you that the switch behaves somewhat differently.

> 
>>> Practically are you saying that the switch loses the EFID information
>>> between the ingress and the egress stage, since the destination port
>>> mask is selected based on a key constructed with "don't care" in the EFID bits?
>>> Strange.
>>
>> Strange indeed - and wrong! I just checked this again. The switch
>> actually _does_ preserve the EFID for the second lookup when selecting
>> the destination port mask, and this behaves as you would expect. My
>> observation to the contrary was specifically for the case where there is
>> no hit for the destination address, in which case the switch will
>> _flood_ according to the VLAN and MAC DA, without regard for the EFID.
>> This kind of makes sense, since the EFID is just a searching/learning
>> look-up-table concept and is not related to flooding. OTOH there are
>> flooding port mask registers where one can set for
>> {uni,multi,broad}cast, but this configuration is independent of VLAN.
> 
> So flooding is indeed the miss action from the FDB, but I'm just
> wondering, aren't the flood control registers replicated per FID in fact?

No, they seem to be global. Here's what the register definitions look 
like in the vendor driver:

#define    RTL8367C_REG_UNDA_FLOODING_PMSK    0x0890
#define    RTL8367C_UNDA_FLOODING_PMSK_OFFSET    0
#define    RTL8367C_UNDA_FLOODING_PMSK_MASK    0x7FF

#define    RTL8367C_REG_UNMCAST_FLOADING_PMSK    0x0891
#define    RTL8367C_UNMCAST_FLOADING_PMSK_OFFSET    0
#define    RTL8367C_UNMCAST_FLOADING_PMSK_MASK    0x7FF

#define    RTL8367C_REG_BCAST_FLOADING_PMSK    0x0892
#define    RTL8367C_BCAST_FLOADING_PMSK_OFFSET    0
#define    RTL8367C_BCAST_FLOADING_PMSK_MASK    0x7FF

Thanks for your help.

	Alvin


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-06  2:50   ` Florian Fainelli
@ 2021-10-06 23:15     ` Tobias Waldekranz
  2021-10-07  1:08     ` Vladimir Oltean
  1 sibling, 0 replies; 16+ messages in thread
From: Tobias Waldekranz @ 2021-10-06 23:15 UTC (permalink / raw)
  To: Florian Fainelli, Vladimir Oltean, Alvin Šipraga; +Cc: netdev, Andrew Lunn

On Tue, Oct 05, 2021 at 19:50, Florian Fainelli <f.fainelli@gmail.com> wrote:
> On 10/5/2021 3:12 AM, Vladimir Oltean wrote:
>> I don't want to answer any of these questions until I understand how
>> does your hardware intend the FID and FID_EN bits from the DSA header to
>> be used. The FID only has 2 bits, so it is clear to me that it doesn't
>> have the same understanding of the term as mv88e6xxx, if the Realtek
>> switch has up to 4 FIDs while Marvell up to 4K.
>> 
>> You should ask yourself not only how to prevent leakage, but also the
>> flip side, how should I pass the packet to the switch in such a way that
>> it will learn its MAC SA in the right FID, assuming that you go with FDB
>> isolation first and figure that out. Once that question is answered, you
>> can in premise specify an allowance port mask which is larger than
>> needed (the entire mask of user ports) and the switch should only
>> forward it towards the ports belonging to the same FID, which are
>> roughly equivalent with the ports under a specific bridge. You can
>> create a mapping between a FID and dp->bridge_num. Makes sense or am I
>> completely off?
>> 
>
> Sorry for sort of hijacking this discussion.
>
> Broadcom switches do not have FIDs however using Alvin's topology, and 
> given the existing bridge support in b53, it also does not look like 
> there is leaking from one bridge to other because of the port matrix 
> configuration which is enforced in addition to the VLAN membership. 
> However based on what I see in tag_dsa.c for the transmit case with 
> skb->offload_fwd_mark, I would have to dig into the bridge's members in 
> order to construct a bitmask of ports to provide to tag_brcm.c, so that 
> does not really get me anywhere, does it?

Presumably this mask is generated in the driver somewhere whenever ports
joins/leaves bridges? Could you not just cache a copy of it in each
port's dp->priv? From there it is easy to get to it from the tagger on
xmit.

> Those switches also always do double VLAN tag (802.1ad) normalization 
> within their data path whenever VLAN is globally enabled within the 
> switch, so in premise you could achieve the same isolation by reserving 
> one of the outer VLAN tags per bridge, enabling IVL and hope that the 
> FDB is looked including the outer and inner VLAN tags and not just the 
> inner VLAN tag.
>
> If we don't have a FID concept, and not all switches do, how we can 
> still support tx forwarding offload here?

In principle, the deal you make with the bridge is basically:

   If you (bridge) give me (driver) a single skb, marked for offloading,
   to one of my ports, I promise to do the Right Thing™ and forward it
   to the same set of ports that you would have.

If your device needs a port mask instead of a fake DSA chip (what
mv88e6xxx does) in order to make that happen, that should be perfectly
fine.

So the FDB can either do lookups based on {DA} or {DA,VID}, but there is
no level of indirection there? I.e. you cannot have a set of VIDs map to
the same database?

In that case, my guess is that if you want to support multiple bridges
that operate completely independently wrt their FDBs, you will need to
allocate a VID per bridge like you say. That is the only field in the
lookup key that you can control. That VID could also be cached per-port
to keep the tagger overhead down.

> -- 
> Florian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-06  2:50   ` Florian Fainelli
  2021-10-06 23:15     ` Tobias Waldekranz
@ 2021-10-07  1:08     ` Vladimir Oltean
  1 sibling, 0 replies; 16+ messages in thread
From: Vladimir Oltean @ 2021-10-07  1:08 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Alvin Šipraga, netdev, Andrew Lunn, Tobias Waldekranz

On Tue, Oct 05, 2021 at 07:50:40PM -0700, Florian Fainelli wrote:
> On 10/5/2021 3:12 AM, Vladimir Oltean wrote:
> > I don't want to answer any of these questions until I understand how
> > does your hardware intend the FID and FID_EN bits from the DSA header to
> > be used. The FID only has 2 bits, so it is clear to me that it doesn't
> > have the same understanding of the term as mv88e6xxx, if the Realtek
> > switch has up to 4 FIDs while Marvell up to 4K.
> > 
> > You should ask yourself not only how to prevent leakage, but also the
> > flip side, how should I pass the packet to the switch in such a way that
> > it will learn its MAC SA in the right FID, assuming that you go with FDB
> > isolation first and figure that out. Once that question is answered, you
> > can in premise specify an allowance port mask which is larger than
> > needed (the entire mask of user ports) and the switch should only
> > forward it towards the ports belonging to the same FID, which are
> > roughly equivalent with the ports under a specific bridge. You can
> > create a mapping between a FID and dp->bridge_num. Makes sense or am I
> > completely off?
> 
> Sorry for sort of hijacking this discussion.
> 
> Broadcom switches do not have FIDs however using Alvin's topology, and given
> the existing bridge support in b53, it also does not look like there is
> leaking from one bridge to other because of the port matrix configuration
> which is enforced in addition to the VLAN membership.

I don't think we're using the word "leaking" in quite the same way.
TX forwarding offload implies calling brcm_tag_xmit_ll(). How would the
port matrix configuration and VLAN membership help you? The CPU port
(ingress for this packet) should be in the forwarding matrix of all
other ports, and in all VLAN IDs. If you blindly put a port mask for
this packet that targets all user ports, it should reach all user ports
no questions asked, regardless of the bridge they're under.

Alvin's case was discussing the idea of allowing all user ports to be
candidates for destinations of this packet. Hoping that the FDB lookup
would come in and further restrict those candidates to only the ones
belonging to, say, br0. Leaking is possible if you don't have FDB
isolation between br0 and br1, because to the hardware, it's all a
single VLAN, just with a stick between one user port and the other
pretending to be a fence.

> However based on what I see in tag_dsa.c for the transmit case with
> skb->offload_fwd_mark, I would have to dig into the bridge's members
> in order to construct a bitmask of ports to provide to tag_brcm.c, so
> that does not really get me anywhere, does it?

And even then it wouldn't be correct. Not just the flooded packets come
with skb->offload_fwd_mark = true, all packets coming from the bridge
do, even the unicast ones. So the destination port mask needs to be a
subset of the ports under dp->bridge_dev.

But forget about using the destination port mask as a bitmap with more
than one bit set, and figuring out for each packet how to set it. This
mechanism wasn't created for that use case, that would require so much
rework in the network stack that it isn't even funny.

Simple question: what do Broadcom switches do if you throw a packet at
them which has no DSA tag? Forward it as usual, like sja1105, or flat
out drop, like Marvell?

> Those switches also always do double VLAN tag (802.1ad) normalization within
> their data path whenever VLAN is globally enabled within the switch,

I don't know what is double VLAN tag normalization, sorry. Is it
something like "if an outer tag TPID is present, classify the packet to
the outer VID, else if an inner tag TPID is present, classify to the
inner VID, else to the port-based default"? I'm not sure that is helpful
either in general or in this particular case.

> so in premise you could achieve the same isolation by reserving one of
> the outer VLAN tags per bridge, enabling IVL and hope that the FDB is
> looked

Hope? Well, is it or is it not? It's a bit of a pointless exercise if it isn't.

> including the outer and inner VLAN tags and not just the inner VLAN
> tag.

So I expect that if you encapsulate packets from the host in an outer
VLAN tag, the FDB will be looked up in that outer VLAN. That is exactly
what is needed in the case of VLAN-unaware bridging with proper FDB
isolation. The outer VLAN should have a value equal to the private pvid
configured on the ingress of the user ports that are under the
VLAN-unaware bridge, and all should in fact be well.
But in the case of VLAN-aware bridging, you want the switch to look up
the FDB in the same VLAN ID that the user port would classify it in,
were it to receive that same packet on ingress. So encapsulating it
wouldn't do it any good.

> If we don't have a FID concept, and not all switches do, how we can still
> support tx forwarding offload here?

Yes, sja1105 does not have a FID concept indeed. And it barely even has
a DSA tag.

If you don't have the concept of a FID, one thing I can tell you from
the get-go is that multiple VLAN-aware bridges will be broken, because
they don't have proper isolation at the level of FDB lookups among them.
So you should simply deny that configuration and operate with a single
VLAN-aware bridge, or multiple VLAN-unaware ones.

To isolate VLAN-unaware bridges between each other you can just crop
some VLANs from the 4K VID space (as many as the number of bridges you
want to support) and use them as private pvid on all ports, as well as
the encapsulating outer VLAN ID on xmit.

But anyway, here's something I don't understand: is there any field in
the Broadcom xmit DSA tag where you tell the switch in which VLAN should
the packet be processed? If there isn't, and the only mechanism by which
the switch classifies a packet to a VLAN on the IMP port is simply by
looking at the 802.1Q header; and yet it only looks at the 802.1Q header
if VLAN awareness is turned on, then bad luck, because VLAN awareness is
a global setting. So you turn it on for the IMP port => you turn it on
for all user ports as well => bye bye FDB isolation between VLAN-unaware
bridges, because bye-bye VLAN-unaware bridges.

So I just hope there is a way to inject a packet into a given VLAN
through your IMP port that does not involve turning VLAN awareness on
globally. If that is not the case, well, I don't know. Can we have a
more complete picture of the Broadcom tag other than what tag_brcm.c
sets today?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-06 16:16           ` Alvin Šipraga
@ 2021-10-07  9:47             ` Vladimir Oltean
  2021-10-07 11:22               ` Alvin Šipraga
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Oltean @ 2021-10-07  9:47 UTC (permalink / raw)
  To: Alvin Šipraga; +Cc: netdev, Florian Fainelli, Andrew Lunn

On Wed, Oct 06, 2021 at 04:16:34PM +0000, Alvin Šipraga wrote:
> First, allow me to reproduce the relevant part of the datasheet here:
>
> | == Search and Learning
> |
> | = Search
> |
> | When a packet is received, the RTL8365MB-VC uses the destination MAC
> | address, Filtering Identifier (FID) and Enhanced Filtering Identifier
> | (EFID) to search the 2K-entry look-up table. The 48-bit MAC address,
> | 4-bit FID and 3-bit EFID use a hash algorithm, to calculate an 11-bit
> | index value. The RTL8365MB-VC uses the index to compare the packet MAC
> | address with the entries (MAC addresses) in the look-up table. This is
> | the ‘Address Search’. If the destination MAC address is not found, the
> | switch will broadcast the packet according to VLAN configuration.

Unless the wording is plain incorrect or does not cover all cases except
for the "default operation", I think this says that the LUT is always
indexed based on a hash of {48-bit MAC, 4-bit FID, 3-bit EFID}.
No mention of VID.

> |
> | = Learning
> |
> | The RTL8365MB-VC uses the source MAC address, FID, and EFID of the
> | incoming packet to hash into a 9-bit index. It then compares the source
> | MAC address with the data (MAC addresses) in this index. If there is a
> | match with one of the entries, the RTL8365MB-VC will update the entry
> | with new information. If there is no match and the 2K entries are not
> | all occupied by other MAC addresses, the RTL8365MB- VC will record the
> | source MAC address and ingress port number into an empty entry. This
> | process is called ‘Learning’.
> | Address aging is used to keep the contents of the address table correct
> | in a dynamic network topology. The look-up engine will update the time
> | stamp information of an entry whenever the corresponding source MAC
> | address appears. An entry will be invalid (aged out) if its time stamp
> | information is not refreshed by the address learning process during the
> | aging time period. The aging time of the RTL8365MB-VC is between 200 and
> | 400 seconds (typical is 300 seconds).
> |
> | == SVL and IVL/SVL
> |
> | The RTL8365MB-VC supports a 16-group Filtering Identifier (FID) for L2
> | search and learning. In default operation, all VLAN entries belong to
> | the same FID. This is called Shared VLAN Learning (SVL). If VLAN entries
> | are configured to different FIDs, then the same source MAC address with
> | multiple FIDs can be learned into different look-up table entries. This
> | is called Independent VLAN Learning and Shared VLAN Learning (IVL/SVL).

I think I understand what they're trying to say, although I don't
understand what does "default operation" mean. Typical usage? No idea.

> This "IVL/SVL" mode would appear to correspond to a field in the vendor
> driver sources called ivl_svl (ivl_svl=1 is what I have referred to as
> "IVL" all this time), which is part of each VLAN configuration in the
> VLAN table. But that field also comes with a /* IVL */ or /* IVL_EN */
> comment next to it in some places. So I am unsure whether there is a
> third, "genuine" IVL mode which does not use the FID at all. At least,
> the description in the datasheet doesn't seem seem to correlate with the
> behaviour of this ivl_svl switch. But I could be parsing it wrong.

So you've said that a VLAN table entry contains an IVL_EN bit, and a FID.
It's this structure, right?

struct rtl8365mb_vlan_4k {
	u16 vid;
	u16 member;
	u16 untag;
	u8 fid;
	u8 priority;
	u8 priority_en : 1;
	u8 policing_en : 1;
	u8 ivl_en : 1;
	u8 meteridx;
};

What they say is: if you configure some of the VLAN table entries with
non-identical FIDs, you are operating in mixed IVL/SVL mode. Meaning:
you still haven't set the IVL bit in any of the VLAN table entries,
therefore you are still using SVL, where the VLAN table maps a VID to a
FID (and this is in line with the explanation given above).
But on the other hand, not all VLANs map to the same FID (as in the pure
SVL case). So it is a mixed SVL/IVL mode.

What I suspect is that if you set the IVL bit in the VLAN table entry,
the FID is completely ignored. Or maybe, with IVL, the VID _is_ the FID,
and in that case, the description above would actually be correct in
stating that the LUT is always looked up by {MAC, EFID, FID}.
What absolutely bugs me is the fact that they say the FID is 4-bit.
When using a 4K VLAN ID as FID, you can't use just 4 bits of it...

> Now, rather than speculate further on the semantics, I went ahead and
> tested out the behaviour by:
>
> - adding 32 VLANs 100..131 on a port, all with IVL (i.e. ivl_svl=1)
> - cycling through the 8 possible port EFIDs (0..7) on that port
> - for each EFID, sending one 802.1Q-tagged frame with VID=n for
> n=100..131 to the port from the network
>
> Some notes:
>
> - the chip supports up to 32 concurrent VLANs (globally); this is a
> general limitation of the hardware.
> - in this scenario the MAC SA is the same for each frame I transmit from
> the network into the port.
>
> By dumping the hardware FDB after the fact, I can see 32 * 8 FDB entries
> for the given MAC SA of my frames:
>
> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
> 	addr    mac                     vid_fid spa     fid     efid
> 	0004    00:00:aa:aa:aa:aa       104     2       0       0
> 	0005    00:00:aa:aa:aa:aa       112     2       0       3
> 	0006    00:00:aa:aa:aa:aa       120     2       0       2
> 	0008    00:00:aa:aa:aa:aa       128     2       0       5
> 	0036    00:00:aa:aa:aa:aa       105     2       0       0
> 	0037    00:00:aa:aa:aa:aa       113     2       0       3
> 	0038    00:00:aa:aa:aa:aa       121     2       0       2
> 	0040    00:00:aa:aa:aa:aa       129     2       0       5
> 	0068    00:00:aa:aa:aa:aa       106     2       0       0
> 	... (table continues with an entry for each VID/EFID combo)
>
> Legend:
> 	addr: look-up-table index
> 	mac: MAC address
> 	vid_fid: VID of the frame for both IVL and SVL

Who gave it this "vid_fid" name?

> 	spa: source port address, i.e. the port that learned
> 	fid: FID (of the VLAN)
> 	efid: EFID (of the port)
>
> I also tried sending untagged frames from the network and cycling
> through one of the VLANs as PVID, in which case the port would learn and
> make an entry with vid_fid corresponding to the PVID.
>
> This suggests to me that the IVL field of the VLAN configuration really
> does achieve Independent VLAN learning, and that there are not many
> constraints here besides the size of the look-up-table.

Can you repeat the experiment sweeping through EFIDs, but with the VLANs
configured for SVL and having the same FID? I would expect that the LUT
indices will be different, but still as many. Just want to confirm my
theory that the EFID provides port-based isolation regardless of IVL_EN.

Also, can you please repeat the IVL experiment but with VIDs not having
consecutive values, but rather N, N+16, N+32, N+48, ... N+2048 etc?
I would like to get to the bottom of that 4-bit FID thing.

> Could it be that the ivl_svl switch simply controls how this
> look-up-table index is computed? That is to say:
>
> SVL: {FID, EFID, MAC} -> index
> IVL: {VID, EFID, MAC} -> index
>
> I tried the following scenario:
>
> 	# add VLAN 100/101
> 	bridge vlan add vid 100 dev swp2
> 	bridge vlan add vid 101 dev swp2
>
> 	# send VID 100 frame from another host on the network
> 	mausezahn eth2 -Q 100 -c 1 -a '00:00:aa:aa:aa:aa' -t tcp
>
> 	# dump HW FDB
> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>
> 	# send VID 101 frame this time
> 	mausezahn eth2 -Q 101 -c 1 -a '00:00:aa:aa:aa:aa' -t tcp
>
> 	# dump HW FDB
> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>
> I then tested this out with:
>
> - IVL, FID=0
>
> 	##### send frame on VLAN 100
> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
> 	addr    mac                     vid_fid spa     fid     efid
> 	0388    00:00:aa:aa:aa:aa       100     2       0       0
> 	##### send frame on VLAN 101
> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
> 	addr    mac                     vid_fid spa     fid     efid
> 	0388    00:00:aa:aa:aa:aa       100     2       0       0
> 	0420    00:00:aa:aa:aa:aa       101     2       0       0
>
> - IVL, FID=9
>
> 	##### send frame on VLAN 100
> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
> 	addr    mac                     vid_fid spa     fid     efid
> 	0388    00:00:aa:aa:aa:aa       100     2       9       0
> 	##### send frame on VLAN 101
> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
> 	addr    mac                     vid_fid spa     fid     efid
> 	0388    00:00:aa:aa:aa:aa       100     2       9       0
> 	0420    00:00:aa:aa:aa:aa       101     2       9       0
>
> - SVL, FID=0
>
> 	##### send frame on VLAN 100
> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
> 	addr    mac                     vid_fid spa     fid     efid
> 	1280    00:00:aa:aa:aa:aa       100     2       0       0
> 	##### send frame on VLAN 101
> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
> 	addr    mac                     vid_fid spa     fid     efid
> 	1280    00:00:aa:aa:aa:aa       101     2       0       0
>
> - SVL, FID=9
>
> 	##### send frame on VLAN 100
> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
> 	addr    mac                     vid_fid spa     fid     efid
> 	1056    00:00:aa:aa:aa:aa       100     2       9       0
> 	##### send frame on VLAN 101
> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
> 	addr    mac                     vid_fid spa     fid     efid
> 	1056    00:00:aa:aa:aa:aa       101     2       9       0
>
> Some observations:
>
> - with IVL, index is the same for FID=0,9
> - with SVL, index is different for FID=0,9
> - with IVL, index is different for VID=100,101
> - with SVL, index is the same for VID=100,101

Yes, good job investigating, this seems to support the theory that when
a VLAN table entry is configured for IVL, the FID (actually vid_fid in
your dumps) is the VID, otherwise it's the FID from the VLAN table entry.

> In particular, with IVL, the FID is stored in the table but it does not
> seem to affect the index.

It's probably there so that you don't need to flush the LUT and
reinstall everything when you change a VLAN table entry from IVL to SVL/IVL.

> I _think_ I can look up the FDB by VID, not just FID - I still have to
> confirm that but I think it depends on whether the particular VLAN is in
> IVL or SVL mode.
>
> But either way, there are bound to be collisions given the way the
> look-up-table works. If the driver is asked to offload two FDB entries
> which map to the same look-up-table entry (i.e. same index), can't it
> just error out on the second request? Something like "I see this entry
> is already occupied by a static (offloaded) FDB entry, so I can't
> satisfy this request".

Yes, in case of hash collisions between unrelated entries on a full row,
returning -ENOSPC is clearly okay. This case is more interesting because
the LUT entries are not unrelated. I was commenting under the assumption
that you will need to give switchdev the impression that you are
offloading entries via IVL (so you should accept two FDB entries for the
same MAC DA in different VIDs, as long as they point towards the same
destination port) because that's how the hardware is going to treat them.
The only problematic case is when switchdev asks one FDB in one VLAN to
go one way, and another in another VLAN to go another way.

[ by the way you can't propagate errors from .port_fdb_add to switchdev,
  and to the bridge, sorry ]

Anyway, doesn't matter, it's clearer now that you don't have to care
about this, I don't think you should use the SVL or SVL/IVL modes for
anything, just program all VLAN table entries with IVL=true, and set the
EFID based on dp->bridge_num.

> > And most importantly, do you see the FID bits in the tagger in the
> > receive path as well?
>
> No, I don't, which is kind of strange. But is it a problem?

Not really, no.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-07  9:47             ` Vladimir Oltean
@ 2021-10-07 11:22               ` Alvin Šipraga
  2021-10-07 11:34                 ` Vladimir Oltean
  0 siblings, 1 reply; 16+ messages in thread
From: Alvin Šipraga @ 2021-10-07 11:22 UTC (permalink / raw)
  To: Vladimir Oltean; +Cc: netdev, Florian Fainelli, Andrew Lunn

On 10/7/21 11:47 AM, Vladimir Oltean wrote:
> On Wed, Oct 06, 2021 at 04:16:34PM +0000, Alvin Šipraga wrote:
>> First, allow me to reproduce the relevant part of the datasheet here:
>>
>> | == Search and Learning
>> |
>> | = Search
>> |
>> | When a packet is received, the RTL8365MB-VC uses the destination MAC
>> | address, Filtering Identifier (FID) and Enhanced Filtering Identifier
>> | (EFID) to search the 2K-entry look-up table. The 48-bit MAC address,
>> | 4-bit FID and 3-bit EFID use a hash algorithm, to calculate an 11-bit
>> | index value. The RTL8365MB-VC uses the index to compare the packet MAC
>> | address with the entries (MAC addresses) in the look-up table. This is
>> | the ‘Address Search’. If the destination MAC address is not found, the
>> | switch will broadcast the packet according to VLAN configuration.
> 
> Unless the wording is plain incorrect or does not cover all cases except
> for the "default operation", I think this says that the LUT is always
> indexed based on a hash of {48-bit MAC, 4-bit FID, 3-bit EFID}.
> No mention of VID.

Yes, I parsed this paragraph in the same way.

> 
>> |
>> | = Learning
>> |
>> | The RTL8365MB-VC uses the source MAC address, FID, and EFID of the
>> | incoming packet to hash into a 9-bit index. It then compares the source
>> | MAC address with the data (MAC addresses) in this index. If there is a
>> | match with one of the entries, the RTL8365MB-VC will update the entry
>> | with new information. If there is no match and the 2K entries are not
>> | all occupied by other MAC addresses, the RTL8365MB- VC will record the
>> | source MAC address and ingress port number into an empty entry. This
>> | process is called ‘Learning’.
>> | Address aging is used to keep the contents of the address table correct
>> | in a dynamic network topology. The look-up engine will update the time
>> | stamp information of an entry whenever the corresponding source MAC
>> | address appears. An entry will be invalid (aged out) if its time stamp
>> | information is not refreshed by the address learning process during the
>> | aging time period. The aging time of the RTL8365MB-VC is between 200 and
>> | 400 seconds (typical is 300 seconds).
>> |
>> | == SVL and IVL/SVL
>> |
>> | The RTL8365MB-VC supports a 16-group Filtering Identifier (FID) for L2
>> | search and learning. In default operation, all VLAN entries belong to
>> | the same FID. This is called Shared VLAN Learning (SVL). If VLAN entries
>> | are configured to different FIDs, then the same source MAC address with
>> | multiple FIDs can be learned into different look-up table entries. This
>> | is called Independent VLAN Learning and Shared VLAN Learning (IVL/SVL).
> 
> I think I understand what they're trying to say, although I don't
> understand what does "default operation" mean. Typical usage? No idea.

I think it's just saying that this is the state when the chip is reset. 
Put another way, all VLANs have FID=0 and are in SVL mode after one 
resets the chip. At least that's the state I find the chip when I reset 
it - which is consistent with this paragraph.

> 
>> This "IVL/SVL" mode would appear to correspond to a field in the vendor
>> driver sources called ivl_svl (ivl_svl=1 is what I have referred to as
>> "IVL" all this time), which is part of each VLAN configuration in the
>> VLAN table. But that field also comes with a /* IVL */ or /* IVL_EN */
>> comment next to it in some places. So I am unsure whether there is a
>> third, "genuine" IVL mode which does not use the FID at all. At least,
>> the description in the datasheet doesn't seem seem to correlate with the
>> behaviour of this ivl_svl switch. But I could be parsing it wrong.
> 
> So you've said that a VLAN table entry contains an IVL_EN bit, and a FID.
> It's this structure, right?
> 
> struct rtl8365mb_vlan_4k {
> 	u16 vid;
> 	u16 member;
> 	u16 untag;
> 	u8 fid;
> 	u8 priority;
> 	u8 priority_en : 1;
> 	u8 policing_en : 1;
> 	u8 ivl_en : 1;
> 	u8 meteridx;
> };

Correct.

> 
> What they say is: if you configure some of the VLAN table entries with
> non-identical FIDs, you are operating in mixed IVL/SVL mode. Meaning:
> you still haven't set the IVL bit in any of the VLAN table entries,
> therefore you are still using SVL, where the VLAN table maps a VID to a
> FID (and this is in line with the explanation given above).
> But on the other hand, not all VLANs map to the same FID (as in the pure
> SVL case). So it is a mixed SVL/IVL mode.

This is how I understand it too, yeah. In particular, the description in 
the datasheet only covers scenarios where the IVL_EN bit is 0.

> 
> What I suspect is that if you set the IVL bit in the VLAN table entry,
> the FID is completely ignored. Or maybe, with IVL, the VID _is_ the FID,
> and in that case, the description above would actually be correct in
> stating that the LUT is always looked up by {MAC, EFID, FID}.
> What absolutely bugs me is the fact that they say the FID is 4-bit.
> When using a 4K VLAN ID as FID, you can't use just 4 bits of it...

Yeah, it bugs me too... But I am now of the view that the datasheet is 
simply incomplete in its description of the LUT, and that it only talks 
about the SVL scenario.

> 
>> Now, rather than speculate further on the semantics, I went ahead and
>> tested out the behaviour by:
>>
>> - adding 32 VLANs 100..131 on a port, all with IVL (i.e. ivl_svl=1)
>> - cycling through the 8 possible port EFIDs (0..7) on that port
>> - for each EFID, sending one 802.1Q-tagged frame with VID=n for
>> n=100..131 to the port from the network
>>
>> Some notes:
>>
>> - the chip supports up to 32 concurrent VLANs (globally); this is a
>> general limitation of the hardware.
>> - in this scenario the MAC SA is the same for each frame I transmit from
>> the network into the port.
>>
>> By dumping the hardware FDB after the fact, I can see 32 * 8 FDB entries
>> for the given MAC SA of my frames:
>>
>> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>> 	addr    mac                     vid_fid spa     fid     efid
>> 	0004    00:00:aa:aa:aa:aa       104     2       0       0
>> 	0005    00:00:aa:aa:aa:aa       112     2       0       3
>> 	0006    00:00:aa:aa:aa:aa       120     2       0       2
>> 	0008    00:00:aa:aa:aa:aa       128     2       0       5
>> 	0036    00:00:aa:aa:aa:aa       105     2       0       0
>> 	0037    00:00:aa:aa:aa:aa       113     2       0       3
>> 	0038    00:00:aa:aa:aa:aa       121     2       0       2
>> 	0040    00:00:aa:aa:aa:aa       129     2       0       5
>> 	0068    00:00:aa:aa:aa:aa       106     2       0       0
>> 	... (table continues with an entry for each VID/EFID combo)
>>
>> Legend:
>> 	addr: look-up-table index
>> 	mac: MAC address
>> 	vid_fid: VID of the frame for both IVL and SVL
> 
> Who gave it this "vid_fid" name?

The name is borrowed from the vendor driver's data structure.

> 
>> 	spa: source port address, i.e. the port that learned
>> 	fid: FID (of the VLAN)
>> 	efid: EFID (of the port)
>>
>> I also tried sending untagged frames from the network and cycling
>> through one of the VLANs as PVID, in which case the port would learn and
>> make an entry with vid_fid corresponding to the PVID.
>>
>> This suggests to me that the IVL field of the VLAN configuration really
>> does achieve Independent VLAN learning, and that there are not many
>> constraints here besides the size of the look-up-table.
> 
> Can you repeat the experiment sweeping through EFIDs, but with the VLANs
> configured for SVL and having the same FID? I would expect that the LUT
> indices will be different, but still as many. Just want to confirm my
> theory that the EFID provides port-based isolation regardless of IVL_EN.

I was actually testing this just now.

For VLANs with SVL same FID and EFID, the same MAC is learned into the 
same index, irrespective of VID (no surprise).

However, cycling through the EFID, the same MAC is instead learned into 
8 different indices.

So yes, EFID provides port-based isolation regardless of IVL_EN. This is 
consistent with the description in the datasheet too.

> 
> Also, can you please repeat the IVL experiment but with VIDs not having
> consecutive values, but rather N, N+16, N+32, N+48, ... N+2048 etc?
> I would like to get to the bottom of that 4-bit FID thing.

Sure. I ran the test as you suggested with N=100 and the results are the 
same: for 32 VLANs and cycling through the 8 EFIDs for each, I end up 
with 256 entries in the LUT. If I keep adding VLANs (note the limit is 
32, but I can remove an old one and put a new one without losing the LUT 
entries of the old), then the LUT keeps just taking on entries.

Considering this, do you agree with the mapping I suggested in the 
previous email?

| SVL: {FID, EFID, MAC} -> index
| IVL: {VID, EFID, MAC} -> index

There doesn't seem to be any 4-bit resolution to the VID key when doing 
an IVL lookup.

> 
>> Could it be that the ivl_svl switch simply controls how this
>> look-up-table index is computed? That is to say:
>>
>> SVL: {FID, EFID, MAC} -> index
>> IVL: {VID, EFID, MAC} -> index
>>
>> I tried the following scenario:
>>
>> 	# add VLAN 100/101
>> 	bridge vlan add vid 100 dev swp2
>> 	bridge vlan add vid 101 dev swp2
>>
>> 	# send VID 100 frame from another host on the network
>> 	mausezahn eth2 -Q 100 -c 1 -a '00:00:aa:aa:aa:aa' -t tcp
>>
>> 	# dump HW FDB
>> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>>
>> 	# send VID 101 frame this time
>> 	mausezahn eth2 -Q 101 -c 1 -a '00:00:aa:aa:aa:aa' -t tcp
>>
>> 	# dump HW FDB
>> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>>
>> I then tested this out with:
>>
>> - IVL, FID=0
>>
>> 	##### send frame on VLAN 100
>> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>> 	addr    mac                     vid_fid spa     fid     efid
>> 	0388    00:00:aa:aa:aa:aa       100     2       0       0
>> 	##### send frame on VLAN 101
>> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>> 	addr    mac                     vid_fid spa     fid     efid
>> 	0388    00:00:aa:aa:aa:aa       100     2       0       0
>> 	0420    00:00:aa:aa:aa:aa       101     2       0       0
>>
>> - IVL, FID=9
>>
>> 	##### send frame on VLAN 100
>> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>> 	addr    mac                     vid_fid spa     fid     efid
>> 	0388    00:00:aa:aa:aa:aa       100     2       9       0
>> 	##### send frame on VLAN 101
>> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>> 	addr    mac                     vid_fid spa     fid     efid
>> 	0388    00:00:aa:aa:aa:aa       100     2       9       0
>> 	0420    00:00:aa:aa:aa:aa       101     2       9       0
>>
>> - SVL, FID=0
>>
>> 	##### send frame on VLAN 100
>> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>> 	addr    mac                     vid_fid spa     fid     efid
>> 	1280    00:00:aa:aa:aa:aa       100     2       0       0
>> 	##### send frame on VLAN 101
>> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>> 	addr    mac                     vid_fid spa     fid     efid
>> 	1280    00:00:aa:aa:aa:aa       101     2       0       0
>>
>> - SVL, FID=9
>>
>> 	##### send frame on VLAN 100
>> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>> 	addr    mac                     vid_fid spa     fid     efid
>> 	1056    00:00:aa:aa:aa:aa       100     2       9       0
>> 	##### send frame on VLAN 101
>> 	cat /sys/kernel/debug/rtl8365mb/lut_dump
>> 	addr    mac                     vid_fid spa     fid     efid
>> 	1056    00:00:aa:aa:aa:aa       101     2       9       0
>>
>> Some observations:
>>
>> - with IVL, index is the same for FID=0,9
>> - with SVL, index is different for FID=0,9
>> - with IVL, index is different for VID=100,101
>> - with SVL, index is the same for VID=100,101
> 
> Yes, good job investigating, this seems to support the theory that when
> a VLAN table entry is configured for IVL, the FID (actually vid_fid in
> your dumps) is the VID, otherwise it's the FID from the VLAN table entry.

Right. This also justifies the name vid_fid in the vendor driver.

> 
>> In particular, with IVL, the FID is stored in the table but it does not
>> seem to affect the index.
> 
> It's probably there so that you don't need to flush the LUT and
> reinstall everything when you change a VLAN table entry from IVL to SVL/IVL.
> 
>> I _think_ I can look up the FDB by VID, not just FID - I still have to
>> confirm that but I think it depends on whether the particular VLAN is in
>> IVL or SVL mode.
>>
>> But either way, there are bound to be collisions given the way the
>> look-up-table works. If the driver is asked to offload two FDB entries
>> which map to the same look-up-table entry (i.e. same index), can't it
>> just error out on the second request? Something like "I see this entry
>> is already occupied by a static (offloaded) FDB entry, so I can't
>> satisfy this request".
> 
> Yes, in case of hash collisions between unrelated entries on a full row,
> returning -ENOSPC is clearly okay. This case is more interesting because
> the LUT entries are not unrelated. I was commenting under the assumption
> that you will need to give switchdev the impression that you are
> offloading entries via IVL (so you should accept two FDB entries for the
> same MAC DA in different VIDs, as long as they point towards the same
> destination port) because that's how the hardware is going to treat them.
> The only problematic case is when switchdev asks one FDB in one VLAN to
> go one way, and another in another VLAN to go another way.
> 
> [ by the way you can't propagate errors from .port_fdb_add to switchdev,
>    and to the bridge, sorry ]

OK, but I guess returning -ENOSPC in .port_fdb_add is the best a DSA 
driver can do, right?

> 
> Anyway, doesn't matter, it's clearer now that you don't have to care
> about this, I don't think you should use the SVL or SVL/IVL modes for
> anything, just program all VLAN table entries with IVL=true, and set the
> EFID based on dp->bridge_num.

Cool, glad we're on the same page!

> 
>>> And most importantly, do you see the FID bits in the tagger in the
>>> receive path as well?
>>
>> No, I don't, which is kind of strange. But is it a problem?
> 
> Not really, no.
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-07 11:22               ` Alvin Šipraga
@ 2021-10-07 11:34                 ` Vladimir Oltean
  2021-10-07 14:15                   ` Alvin Šipraga
  0 siblings, 1 reply; 16+ messages in thread
From: Vladimir Oltean @ 2021-10-07 11:34 UTC (permalink / raw)
  To: Alvin Šipraga; +Cc: netdev, Florian Fainelli, Andrew Lunn

On Thu, Oct 07, 2021 at 11:22:32AM +0000, Alvin Šipraga wrote:
> >> 	spa: source port address, i.e. the port that learned
> >> 	fid: FID (of the VLAN)
> >> 	efid: EFID (of the port)
> >>
> >> I also tried sending untagged frames from the network and cycling
> >> through one of the VLANs as PVID, in which case the port would learn and
> >> make an entry with vid_fid corresponding to the PVID.
> >>
> >> This suggests to me that the IVL field of the VLAN configuration really
> >> does achieve Independent VLAN learning, and that there are not many
> >> constraints here besides the size of the look-up-table.
> >
> > Can you repeat the experiment sweeping through EFIDs, but with the VLANs
> > configured for SVL and having the same FID? I would expect that the LUT
> > indices will be different, but still as many. Just want to confirm my
> > theory that the EFID provides port-based isolation regardless of IVL_EN.
>
> I was actually testing this just now.
>
> For VLANs with SVL same FID and EFID, the same MAC is learned into the
> same index, irrespective of VID (no surprise).
>
> However, cycling through the EFID, the same MAC is instead learned into
> 8 different indices.
>
> So yes, EFID provides port-based isolation regardless of IVL_EN. This is
> consistent with the description in the datasheet too.

Ok, so the EFID will be the basis for FDB isolation then. An EFID for
all standalone ports, and an EFID for each bridge.

> > Also, can you please repeat the IVL experiment but with VIDs not having
> > consecutive values, but rather N, N+16, N+32, N+48, ... N+2048 etc?
> > I would like to get to the bottom of that 4-bit FID thing.
>
> Sure. I ran the test as you suggested with N=100 and the results are the
> same: for 32 VLANs and cycling through the 8 EFIDs for each, I end up
> with 256 entries in the LUT. If I keep adding VLANs (note the limit is
> 32, but I can remove an old one and put a new one without losing the LUT
> entries of the old), then the LUT keeps just taking on entries.
>
> Considering this, do you agree with the mapping I suggested in the
> previous email?
>
> | SVL: {FID, EFID, MAC} -> index
> | IVL: {VID, EFID, MAC} -> index
>
> There doesn't seem to be any 4-bit resolution to the VID key when doing
> an IVL lookup.

If the 32 VLAN IDs are incremented in steps of 16, then yes, so it would
seem.

> > Yes, in case of hash collisions between unrelated entries on a full row,
> > returning -ENOSPC is clearly okay. This case is more interesting because
> > the LUT entries are not unrelated. I was commenting under the assumption
> > that you will need to give switchdev the impression that you are
> > offloading entries via IVL (so you should accept two FDB entries for the
> > same MAC DA in different VIDs, as long as they point towards the same
> > destination port) because that's how the hardware is going to treat them.
> > The only problematic case is when switchdev asks one FDB in one VLAN to
> > go one way, and another in another VLAN to go another way.
> >
> > [ by the way you can't propagate errors from .port_fdb_add to switchdev,
> >    and to the bridge, sorry ]
>
> OK, but I guess returning -ENOSPC in .port_fdb_add is the best a DSA
> driver can do, right?

Yeah, that part needs some work. It isn't simple stuff.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-07 11:34                 ` Vladimir Oltean
@ 2021-10-07 14:15                   ` Alvin Šipraga
  2021-10-07 16:06                     ` Vladimir Oltean
  0 siblings, 1 reply; 16+ messages in thread
From: Alvin Šipraga @ 2021-10-07 14:15 UTC (permalink / raw)
  To: Vladimir Oltean; +Cc: netdev, Florian Fainelli, Andrew Lunn

On 10/7/21 1:34 PM, Vladimir Oltean wrote:
> On Thu, Oct 07, 2021 at 11:22:32AM +0000, Alvin Šipraga wrote:
>>>> 	spa: source port address, i.e. the port that learned
>>>> 	fid: FID (of the VLAN)
>>>> 	efid: EFID (of the port)
>>>>
>>>> I also tried sending untagged frames from the network and cycling
>>>> through one of the VLANs as PVID, in which case the port would learn and
>>>> make an entry with vid_fid corresponding to the PVID.
>>>>
>>>> This suggests to me that the IVL field of the VLAN configuration really
>>>> does achieve Independent VLAN learning, and that there are not many
>>>> constraints here besides the size of the look-up-table.
>>>
>>> Can you repeat the experiment sweeping through EFIDs, but with the VLANs
>>> configured for SVL and having the same FID? I would expect that the LUT
>>> indices will be different, but still as many. Just want to confirm my
>>> theory that the EFID provides port-based isolation regardless of IVL_EN.
>>
>> I was actually testing this just now.
>>
>> For VLANs with SVL same FID and EFID, the same MAC is learned into the
>> same index, irrespective of VID (no surprise).
>>
>> However, cycling through the EFID, the same MAC is instead learned into
>> 8 different indices.
>>
>> So yes, EFID provides port-based isolation regardless of IVL_EN. This is
>> consistent with the description in the datasheet too.
> 
> Ok, so the EFID will be the basis for FDB isolation then. An EFID for
> all standalone ports, and an EFID for each bridge.

OK, thanks for your patience thus far. I have just one more set of 
questions, this time regarding VLAN-unaware bridges.

First of all, I noticed that even with VLAN-unaware bridges, if I add 
VLANs to the bridge or ports, then my .port_vlan_add is called even 
though VLAN filtering is not enabled. That's OK and the switch will 
still receive untagged/priority-tagged frames without complaint. But if 
I may ask, what is the point? Even when programming a PVID on a given 
port of a VLAN-unaware bridge, the switch is not inserting 802.1Q tags 
on ingress frames that it forwards to the CPU port (tcpdump on the swpN 
interface still shows untagged frames). So I fail to see what capability 
is being offloaded to the switch hardware with that call to 
.port_vlan_add for VLAN-unaware bridges. Or _should_ the switch insert a 
tag for ports with a PVID?

The second concern I have is regarding learning. We agreed now that all 
of the VLANs we add to the switch should have IVL_EN=1, meaning that the 
switch will learn in a VLAN-aware manner. If the switch receives a frame 
with MAC SA=00:00:aa:aa:aa:aa, VID=1 on port N of a VLAN-unaware bridge, 
it will still have to flood a frame from the CPU with MAC 
DA=00:00:aa:aa:aa:aa if the frame doesn't have VID=1, rather than 
forwarding it directly to port N. So although the ports of this 
VLAN-unaware bridge are respecting the rules of VLAN-(un)awareness as I 
read them in the switchdev documentation, the hardware's learning 
process is still VLAN-aware and may cause more flooding than expected. 
My question to you is: is this acceptable behaviour?

To work around this, I guess I can set vlan_filtering_is_global=true. Or 
I can do some bookkeeping in the driver of VLAN settings and only 
program them when enabling VLAN filtering, which is a bit of a pain. 
Hoping you can help clear this up for me - I have a feeling I have 
misunderstood something.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: DSA: some questions regarding TX forwarding offload
  2021-10-07 14:15                   ` Alvin Šipraga
@ 2021-10-07 16:06                     ` Vladimir Oltean
  0 siblings, 0 replies; 16+ messages in thread
From: Vladimir Oltean @ 2021-10-07 16:06 UTC (permalink / raw)
  To: Alvin Šipraga; +Cc: netdev, Florian Fainelli, Andrew Lunn

On Thu, Oct 07, 2021 at 02:15:23PM +0000, Alvin Šipraga wrote:
> First of all, I noticed that even with VLAN-unaware bridges, if I add
> VLANs to the bridge or ports, then my .port_vlan_add is called even
> though VLAN filtering is not enabled. That's OK and the switch will
> still receive untagged/priority-tagged frames without complaint.

There's more to that. The switch should receive and transmit any frames
regardless of VLAN tag, not just untagged and priority-tagged.

> But if I may ask, what is the point?

The point is simple. VLANs are to be used only while the bridge has VLAN
awareness turned on, however they can be added and deleted at any time.
When the bridge has VLAN awareness turned off, those VLANs are just
supposed to sit there, inactive.

ip link add br0 type bridge
ip link set swp0 master br0
bridge vlan add dev swp0 vid 100 # inactive..
ip link set br0 type bridge vlan_filtering 1 # active!
ip link set br0 type bridge vlan_filtering 0 # inactive again

Forget about switchdev, try it out with any interface, like a veth pair.
This is how the software bridge behaves, so this is how offloading ports
should behave too.

In VLAN-unaware mode, you can delete all bridge VLANs on all ports and
on the bridge device itself, and traffic will still work. So an
offloading port should follow suit.

> Even when programming a PVID on a given port of a VLAN-unaware bridge,
> the switch is not inserting 802.1Q tags on ingress frames that it
> forwards to the CPU port (tcpdump on the swpN interface still shows
> untagged frames).

The point of a PVID is not to insert 802.1Q tags in frames, it is to
classify untagged or null-tagged frames to a certain VLAN during the
ingress stage. The frame is associated with that VLAN as the packet
traverses the switch, gets used for the address table lookup,
selection of destination ports, etc. Inserting 802.1Q tags is a concern
at the egress stage of a bridge port. A packet received as untagged can
still be transmitted as untagged if the pvid of the ingress port is
egress-untagged on the egress port. There's no requirement at the
ingress stage for that packet to be tagged or not. It may even be
flooded to two egress ports, and be sent as tagged in one, but untagged
in the other.

I still feel the need to mention that the hardware PVID may or may not
be equal to the bridge's PVID, it can be different especially when VLAN
awareness in the software bridge is disabled (see above). Typically,
hardware always wants to classify packets to a VLAN. But considering
that the bridge may not have any VLAN to give you through switchdev, and
that your hardware still needs to operate => the only logical
consequence is that you need to manage the PVID committed to hardware
when the port is under a VLAN unaware bridge on your own.

> So I fail to see what capability is being offloaded to the switch
> hardware with that call to .port_vlan_add for VLAN-unaware bridges.

The capability is that you don't lag behind the bridge's state when
VLAN awareness is finally turned on. See the commit messages in the git
blame surrounding the ds->configure_vlan_while_not_filtering.

> Or _should_ the switch insert a tag for ports with a PVID?

I don't understand this question, sorry. Insert a tag towards which port?
Above you talked about the CPU port. It is the switch driver's business
whether it configures the CPU port as a VLAN trunk or not. Several
drivers do. Those drivers also set ds->untag_bridge_pvid = true, to
ensure that DSA strips those VLANs before presenting the packets further
to the network stack (hint: the network stack would like to see the
packet as it was on the wire, not with VLANs getting added where they
didn't exist). Rule of thumb though: if you don't know that you need
this feature, you don't need it.

> The second concern I have is regarding learning. We agreed now that all
> of the VLANs we add to the switch should have IVL_EN=1, meaning that the
> switch will learn in a VLAN-aware manner.

Actually the discussion was about FDB isolation and TX forwarding
offload, not about how to make a port VLAN-unaware.

> If the switch receives a frame with MAC SA=00:00:aa:aa:aa:aa, VID=1 on
> port N of a VLAN-unaware bridge, it will still have to flood a frame
> from the CPU with MAC DA=00:00:aa:aa:aa:aa if the frame doesn't have
> VID=1, rather than forwarding it directly to port N. So although the
> ports of this VLAN-unaware bridge are respecting the rules of
> VLAN-(un)awareness as I read them in the switchdev documentation,

Are they? I don't think so. It sounds like the switch classifies the
packet in a different FID on ingress from the user port compared to the
ingress from the CPU port, depending on whether that packet has a VLAN
header or not. That's quite the opposite of "VLAN-unaware".

> the hardware's learning process is still VLAN-aware and may cause more
> flooding than expected.
> My question to you is: is this acceptable behaviour?

Nope.

> To work around this, I guess I can set vlan_filtering_is_global=true. Or
> I can do some bookkeeping in the driver of VLAN settings and only
> program them when enabling VLAN filtering, which is a bit of a pain.
> Hoping you can help clear this up for me - I have a feeling I have
> misunderstood something.

You're asking me as if I knew the hardware, yet I haven't seen the
documentation even once :)

Other switches have an actual VLAN _awareness_ setting. Something like
this (not the diagram of an actual switch, just an oversimplification/
generalization):

   +----------------------------------+
   |     packet ingresses a port      |
   +----------------------------------+
                   |
                   |
                   v
   +----------------------------------+
   | gets classified to the port pvid |
   +----------------------------------+
                   |
                   |
                   v                     / \
                 /   \                  /   \
                /     \                /     \            +--------------------------+
               /       \              / port  \           |                          |
              /  packet \  yes       /   is    \  yes     | packet gets reclassified |
             /    has    \--------->/   VLAN    \-------->+ to the VID from the      |
             \  802.1Q   /          \  aware?   /         | 802.1Q header            |
              \ header? /            \         /          |                          |
               \       /              \       /           +--------------------------+
                \     /                \     /                          |
                 \   /                  \   /                           |
                  \ /                    \ /                            |
                   | no                   | no                          |
                   |                      |                             |
                   +----------------------+-----------------------------+
                   |
                   |
                   v                             / \                / \
                 /   \                          /   \              /   \
                /     \                        /     \            /     \
               /  is   \                      /  is   \          /  is   \
              /classified     yes            / ingress \  no    / ingress \  no
             /   VLAN    \----------------->/  port a   \----->/  port in  \----> drop
             \  in the   /                  \ member of /      \ fallback  /
              \  VLAN   /                    \  this   /        \  mode?  /
               \table? /                      \ VLAN? /          \       /
                \     /                        \     /            \     /
                 \   /                          \   /              \   /
                  \ /                            \ /                \ /
                   | no                           | yes              | yes
                   |                              |                  |
                   |                              +<-----------------+
                   v                              |
                 /   \                            v
                /     \             +--------------------------+
               /  is   \            |                          |
       yes    /  VLAN   \           | read the VLAN table entry|
drop <-------/  ingress  \          | and select the FID       |
             \ filtering /          |                          |
              \ enabled?/           |                          |
               \       /            +--------------------------+
                \     /                           |
                 \   /                            |
                  \ /                             |
                   | no                           |
                   |                              |
                   v                              |
         +--------------------+                   |
         | use a port based   |                   |
         |   default FID      |                   |
         +--------------------+                   |
                   |                              |
                   +------------------------------+
                   |
                   v
       +----------------------+
       | FDB lookup using FID |
       +----------------------+

What the Linux bridge understands by "VLAN filtering" is actually a
mixture of many things, it doesn't really distinguish between ingress
filtering and awareness, it doesn't have support for fallback mode,
it has no idea about FIDs, things like that.
But one thing is certain: when in VLAN-unaware mode, it learns FDB
entries in VID 0, and that is its encoding for "match regardless of VLAN
while VLAN-unaware". But the logic above may still run in your hardware,
you have to configure your hardware to treat switchdev FDB objects in
that way. You may need to remap VID 0 from those entries when you
offload them, but that is what it means.

So you can see that certain switches allow you to say: even if a packet
has an 802.1Q header, I want to keep the classified VLAN equal to the
ingress port's PVID. There's a reason why switches are made this way, it
simplifies things. I've no idea if your switch can do that, I did look
at your rtl8365mb_port_vlan_filtering() implementation and I did notice
that you only disable ingress filtering.

If you cannot inactivate your VLANs present in the VLAN table, then
you'll have to live with the fact that you'll need, in VLAN-unaware
mode, to configure them all for SVL, and point them all towards the same FID.

So if a packet is untagged, it is classified to the PVID, which is
mapped in the VLAN table to, say, FID 1.

If a packet is tagged with VID 100, which is missing from the VLAN
table, it will get mapped to the port-based default FID, which is 1.

If a packet is tagged with VID 101, which is present in the VLAN table,
that will need to be mapped to FID 1.

So all I'm saying is that this is one way in which VLAN-unaware bridging
could be obtained, given the circumstances. It is the best? No, because
if you configure your VLAN table entries as SVL, you can't simply do
that only for the ports which are under a VLAN-unaware bridge. The VLAN
table is global for all ports. So, yes, if one bridge is VLAN-unaware,
all bridges will have to be. We haven't really had this case before, but
I suppose ds->vlan_filtering_is_global = true can deal with it.

Anyway, it is quite a suboptimal approach, but lacking a crystal ball,
I don't know what else to tell you, I hope that you find more switch
settings which you may have dismissed before.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-10-07 16:06 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-05  8:54 DSA: some questions regarding TX forwarding offload Alvin Šipraga
2021-10-05 10:12 ` Vladimir Oltean
2021-10-05 12:06   ` Alvin Šipraga
2021-10-05 13:29     ` Vladimir Oltean
2021-10-05 13:37       ` Vladimir Oltean
2021-10-05 14:38       ` Alvin Šipraga
2021-10-05 15:25         ` Vladimir Oltean
2021-10-06 16:16           ` Alvin Šipraga
2021-10-07  9:47             ` Vladimir Oltean
2021-10-07 11:22               ` Alvin Šipraga
2021-10-07 11:34                 ` Vladimir Oltean
2021-10-07 14:15                   ` Alvin Šipraga
2021-10-07 16:06                     ` Vladimir Oltean
2021-10-06  2:50   ` Florian Fainelli
2021-10-06 23:15     ` Tobias Waldekranz
2021-10-07  1:08     ` Vladimir Oltean

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.