All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
@ 2021-07-03 11:56 ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:56 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck

For this series I have taken Tobias' work from here:
https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
and made the following changes:
- I collected and integrated (hopefully all of) Nikolay's, Ido's and my
  feedback on the bridge driver changes. Otherwise, the structure of the
  bridge changes is pretty much the same as Tobias left it.
- I basically rewrote the DSA infrastructure for the data plane
  forwarding offload, based on the commonalities with another switch
  driver for which I implemented this feature (not submitted here)
- I adapted mv88e6xxx to use the new infrastructure, hopefully it still
  works but I didn't test that

The data plane of the software bridge can be partially offloaded to
switchdev, in the sense that we can trust the accelerator to:
(a) look up its FDB (which is more or less in sync with the software
    bridge FDB) for selecting the destination ports for a packet
(b) replicate the frame in hardware in case it's a multicast/broadcast,
    instead of the software bridge having to clone it and send the
    clones to each net device one at a time. This reduces the bandwidth
    needed between the CPU and the accelerator, as well as the CPU time
    spent.

The data path forwarding offload is managed per "hardware domain" - a
generalization of the "offload_fwd_mark" concept which is being
introduced in this series. Every packet is delivered only once to each
hardware domain.

In addition, Tobias said in the original cover letter:

====================
## Overview

   vlan1   vlan2
       \   /
   .-----------.
   |    br0    |
   '-----------'
   /   /   \   \
swp0 swp1 swp2 eth0
  :   :   :
  (hwdom 1)

Up to this point, switchdevs have been trusted with offloading
forwarding between bridge ports, e.g. forwarding a unicast from swp0
to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
series extends forward offloading to include some new classes of
traffic:

- Locally originating flows, i.e. packets that ingress on br0 that are
  to be forwarded to one or several of the ports swp{0,1,2}. Notably
  this also includes routed flows, e.g. a packet ingressing swp0 on
  VLAN 1 which is then routed over to VLAN 2 by the CPU and then
  forwarded to swp1 is "locally originating" from br0's point of view.

- Flows originating from "foreign" interfaces, i.e. an interface that
  is not offloaded by a particular switchdev instance. This includes
  ports belonging to other switchdev instances. A typical example
  would be flows from eth0 towards swp{0,1,2}.

The bridge still looks up its FDB/MDB as usual and then notifies the
switchdev driver that a particular skb should be offloaded if it
matches one of the classes above. It does so by using the _accel
version of dev_queue_xmit, supplying its own netdev as the
"subordinate" device. The driver can react to the presence of the
subordinate in its .ndo_select_queue in what ever way it needs to make
sure to forward the skb in much the same way that it would for packets
ingressing on regular ports.

Hardware domains to which a particular skb has been forwarded are
recorded so that duplicates are avoided.

The main performance benefit is thus seen on multicast flows. Imagine
for example that:

- An IP camera is connected to swp0 (VLAN 1)

- The CPU is acting as a multicast router, routing the group from VLAN
  1 to VLAN 2.

- There are subscribers for the group in question behind both swp1 and
  swp2 (VLAN 2).

With this offloading in place, the bridge need only send a single skb
to the driver, which will send it to the hardware marked in such a way
that the switch will perform the multicast replication according to
the MDB configuration. Naturally, the number of saved skb_clones
increase linearly with the number of subscribed ports.

As an extra benefit, on mv88e6xxx, this also allows the switch to
perform source address learning on these flows, which avoids having to
sync dynamic FDB entries over slow configuration interfaces like MDIO
to avoid flows directed towards the CPU being flooded as unknown
unicast by the switch.


## RFC

- In general, what do you think about this idea?

- hwdom. What do you think about this terminology? Personally I feel
  that we had too many things called offload_fwd_mark, and that as the
  use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
  might be useful to have a separate term for it.

- .dfwd_{add,del}_station. Am I stretching this abstraction too far,
  and if so do you have any suggestion/preference on how to signal the
  offloading from the bridge down to the switchdev driver?

- The way that flooding is implemented in br_forward.c (lazily cloning
  skbs) means that you have to mark the forwarding as completed very
  early (right after should_deliver in maybe_deliver) in order to
  avoid duplicates. Is there some way to move this decision point to a
  later stage that I am missing?

- BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
  compatible with unicast-to-multicast being used on a port. Then
  again, I think that this would also be broken for regular switchdev
  bridge offloading as this flag is not offloaded to the switchdev
  port, so there is no way for the driver to refuse it. Any ideas on
  how to handle this?


## mv88e6xxx Specifics

Since we are now only receiving a single skb for both unicast and
multicast flows, we can tag the packets with the FORWARD command
instead of FROM_CPU. The swich(es) will then forward the packet in
accordance with its ATU, VTU, STU, and PVT configuration - just like
for packets ingressing on user ports.

Crucially, FROM_CPU is still used for:

- Ports in standalone mode.

- Flows that are trapped to the CPU and software-forwarded by a
  bridge. Note that these flows match neither of the classes discussed
  in the overview.

- Packets that are sent directly to a port netdev without going
  through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
  socket.

We thus have a pretty clean separation where the data plane uses
FORWARDs and the control plane uses TO_/FROM_CPU.

The barrier between different bridges is enforced by port based VLANs
on mv88e6xxx, which in essence is a mapping from a source device/port
pair to an allowed set of egress ports. In order to have a FORWARD
frame (which carries a _source_ device/port) correctly mapped by the
PVT, we must use a unique pair for each bridge.

Fortunately, there is typically lots of unused address space in most
switch trees. When was the last time you saw an mv88e6xxx product
using more than 4 chips? Even if you found one with 16 (!) devices,
you would still have room to allocate 16*16 virtual ports to software
bridges.

Therefore, the mv88e6xxx driver will allocate a virtual device/port
pair to each bridge that it offloads. All members of the same bridge
are then configured to allow packets from this virtual port in their
PVTs.
====================

Tobias Waldekranz (5):
  net: dfwd: constrain existing users to macvlan subordinates
  net: bridge: disambiguate offload_fwd_mark
  net: bridge: switchdev: recycle unused hwdoms
  net: bridge: switchdev: allow the data plane forwarding to be
    offloaded
  net: dsa: tag_dsa: offload the bridge forwarding process

Vladimir Oltean (5):
  net: extract helpers for binding a subordinate device to TX queues
  net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
  net: dsa: track the number of switches in a tree
  net: dsa: add support for bridge forwarding offload
  net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
    the PVT

 drivers/net/dsa/mv88e6xxx/chip.c              | 106 +++++++++++-
 .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
 include/linux/if_bridge.h                     |   1 +
 include/linux/netdevice.h                     |  13 +-
 include/net/dsa.h                             |  37 ++++
 net/bridge/br_forward.c                       |  18 +-
 net/bridge/br_if.c                            |   4 +-
 net/bridge/br_private.h                       |  49 +++++-
 net/bridge/br_switchdev.c                     | 163 +++++++++++++++---
 net/bridge/br_vlan.c                          |  10 +-
 net/core/dev.c                                |  31 +++-
 net/dsa/dsa2.c                                |   3 +
 net/dsa/dsa_priv.h                            |  28 +++
 net/dsa/port.c                                |  35 ++++
 net/dsa/slave.c                               | 134 +++++++++++++-
 net/dsa/switch.c                              |  58 +++++++
 net/dsa/tag_dsa.c                             |  60 ++++++-
 19 files changed, 700 insertions(+), 59 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [Bridge] [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
@ 2021-07-03 11:56 ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:56 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Alexander Duyck, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
	Vivien Didelot, Tobias Waldekranz

For this series I have taken Tobias' work from here:
https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
and made the following changes:
- I collected and integrated (hopefully all of) Nikolay's, Ido's and my
  feedback on the bridge driver changes. Otherwise, the structure of the
  bridge changes is pretty much the same as Tobias left it.
- I basically rewrote the DSA infrastructure for the data plane
  forwarding offload, based on the commonalities with another switch
  driver for which I implemented this feature (not submitted here)
- I adapted mv88e6xxx to use the new infrastructure, hopefully it still
  works but I didn't test that

The data plane of the software bridge can be partially offloaded to
switchdev, in the sense that we can trust the accelerator to:
(a) look up its FDB (which is more or less in sync with the software
    bridge FDB) for selecting the destination ports for a packet
(b) replicate the frame in hardware in case it's a multicast/broadcast,
    instead of the software bridge having to clone it and send the
    clones to each net device one at a time. This reduces the bandwidth
    needed between the CPU and the accelerator, as well as the CPU time
    spent.

The data path forwarding offload is managed per "hardware domain" - a
generalization of the "offload_fwd_mark" concept which is being
introduced in this series. Every packet is delivered only once to each
hardware domain.

In addition, Tobias said in the original cover letter:

====================
## Overview

   vlan1   vlan2
       \   /
   .-----------.
   |    br0    |
   '-----------'
   /   /   \   \
swp0 swp1 swp2 eth0
  :   :   :
  (hwdom 1)

Up to this point, switchdevs have been trusted with offloading
forwarding between bridge ports, e.g. forwarding a unicast from swp0
to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
series extends forward offloading to include some new classes of
traffic:

- Locally originating flows, i.e. packets that ingress on br0 that are
  to be forwarded to one or several of the ports swp{0,1,2}. Notably
  this also includes routed flows, e.g. a packet ingressing swp0 on
  VLAN 1 which is then routed over to VLAN 2 by the CPU and then
  forwarded to swp1 is "locally originating" from br0's point of view.

- Flows originating from "foreign" interfaces, i.e. an interface that
  is not offloaded by a particular switchdev instance. This includes
  ports belonging to other switchdev instances. A typical example
  would be flows from eth0 towards swp{0,1,2}.

The bridge still looks up its FDB/MDB as usual and then notifies the
switchdev driver that a particular skb should be offloaded if it
matches one of the classes above. It does so by using the _accel
version of dev_queue_xmit, supplying its own netdev as the
"subordinate" device. The driver can react to the presence of the
subordinate in its .ndo_select_queue in what ever way it needs to make
sure to forward the skb in much the same way that it would for packets
ingressing on regular ports.

Hardware domains to which a particular skb has been forwarded are
recorded so that duplicates are avoided.

The main performance benefit is thus seen on multicast flows. Imagine
for example that:

- An IP camera is connected to swp0 (VLAN 1)

- The CPU is acting as a multicast router, routing the group from VLAN
  1 to VLAN 2.

- There are subscribers for the group in question behind both swp1 and
  swp2 (VLAN 2).

With this offloading in place, the bridge need only send a single skb
to the driver, which will send it to the hardware marked in such a way
that the switch will perform the multicast replication according to
the MDB configuration. Naturally, the number of saved skb_clones
increase linearly with the number of subscribed ports.

As an extra benefit, on mv88e6xxx, this also allows the switch to
perform source address learning on these flows, which avoids having to
sync dynamic FDB entries over slow configuration interfaces like MDIO
to avoid flows directed towards the CPU being flooded as unknown
unicast by the switch.


## RFC

- In general, what do you think about this idea?

- hwdom. What do you think about this terminology? Personally I feel
  that we had too many things called offload_fwd_mark, and that as the
  use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
  might be useful to have a separate term for it.

- .dfwd_{add,del}_station. Am I stretching this abstraction too far,
  and if so do you have any suggestion/preference on how to signal the
  offloading from the bridge down to the switchdev driver?

- The way that flooding is implemented in br_forward.c (lazily cloning
  skbs) means that you have to mark the forwarding as completed very
  early (right after should_deliver in maybe_deliver) in order to
  avoid duplicates. Is there some way to move this decision point to a
  later stage that I am missing?

- BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
  compatible with unicast-to-multicast being used on a port. Then
  again, I think that this would also be broken for regular switchdev
  bridge offloading as this flag is not offloaded to the switchdev
  port, so there is no way for the driver to refuse it. Any ideas on
  how to handle this?


## mv88e6xxx Specifics

Since we are now only receiving a single skb for both unicast and
multicast flows, we can tag the packets with the FORWARD command
instead of FROM_CPU. The swich(es) will then forward the packet in
accordance with its ATU, VTU, STU, and PVT configuration - just like
for packets ingressing on user ports.

Crucially, FROM_CPU is still used for:

- Ports in standalone mode.

- Flows that are trapped to the CPU and software-forwarded by a
  bridge. Note that these flows match neither of the classes discussed
  in the overview.

- Packets that are sent directly to a port netdev without going
  through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
  socket.

We thus have a pretty clean separation where the data plane uses
FORWARDs and the control plane uses TO_/FROM_CPU.

The barrier between different bridges is enforced by port based VLANs
on mv88e6xxx, which in essence is a mapping from a source device/port
pair to an allowed set of egress ports. In order to have a FORWARD
frame (which carries a _source_ device/port) correctly mapped by the
PVT, we must use a unique pair for each bridge.

Fortunately, there is typically lots of unused address space in most
switch trees. When was the last time you saw an mv88e6xxx product
using more than 4 chips? Even if you found one with 16 (!) devices,
you would still have room to allocate 16*16 virtual ports to software
bridges.

Therefore, the mv88e6xxx driver will allocate a virtual device/port
pair to each bridge that it offloads. All members of the same bridge
are then configured to allow packets from this virtual port in their
PVTs.
====================

Tobias Waldekranz (5):
  net: dfwd: constrain existing users to macvlan subordinates
  net: bridge: disambiguate offload_fwd_mark
  net: bridge: switchdev: recycle unused hwdoms
  net: bridge: switchdev: allow the data plane forwarding to be
    offloaded
  net: dsa: tag_dsa: offload the bridge forwarding process

Vladimir Oltean (5):
  net: extract helpers for binding a subordinate device to TX queues
  net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
  net: dsa: track the number of switches in a tree
  net: dsa: add support for bridge forwarding offload
  net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
    the PVT

 drivers/net/dsa/mv88e6xxx/chip.c              | 106 +++++++++++-
 .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
 include/linux/if_bridge.h                     |   1 +
 include/linux/netdevice.h                     |  13 +-
 include/net/dsa.h                             |  37 ++++
 net/bridge/br_forward.c                       |  18 +-
 net/bridge/br_if.c                            |   4 +-
 net/bridge/br_private.h                       |  49 +++++-
 net/bridge/br_switchdev.c                     | 163 +++++++++++++++---
 net/bridge/br_vlan.c                          |  10 +-
 net/core/dev.c                                |  31 +++-
 net/dsa/dsa2.c                                |   3 +
 net/dsa/dsa_priv.h                            |  28 +++
 net/dsa/port.c                                |  35 ++++
 net/dsa/slave.c                               | 134 +++++++++++++-
 net/dsa/switch.c                              |  58 +++++++
 net/dsa/tag_dsa.c                             |  60 ++++++-
 19 files changed, 700 insertions(+), 59 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 net-next 01/10] net: dfwd: constrain existing users to macvlan subordinates
  2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
@ 2021-07-03 11:56   ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:56 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck

From: Tobias Waldekranz <tobias@waldekranz.com>

The dfwd_add/del_station NDOs are currently only used by the macvlan
subsystem to request L2 forwarding offload from lower devices. In
order add support for other types of devices (like bridges), we
constrain the current users to make sure that the subordinate
requesting the offload is in fact a macvlan.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 3 +++
 drivers/net/ethernet/intel/i40e/i40e_main.c     | 3 +++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index 2fb52bd6fc0e..4dba6e6a282d 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -1352,6 +1352,9 @@ static void *fm10k_dfwd_add_station(struct net_device *dev,
 	int size, i;
 	u16 vid, glort;
 
+	if (!netif_is_macvlan(sdev))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	/* The hardware supported by fm10k only filters on the destination MAC
 	 * address. In order to avoid issues we only support offloading modes
 	 * where the hardware can actually provide the functionality.
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 861e59a350bd..812ad241a049 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7629,6 +7629,9 @@ static void *i40e_fwd_add(struct net_device *netdev, struct net_device *vdev)
 	struct i40e_fwd_adapter *fwd;
 	int avail_macvlan, ret;
 
+	if (!netif_is_macvlan(vdev))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	if ((pf->flags & I40E_FLAG_DCB_ENABLED)) {
 		netdev_info(netdev, "Macvlans are not supported when DCB is enabled\n");
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ffff69efd78a..1ecdb7dc9534 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9938,6 +9938,9 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 	int tcs = adapter->hw_tcs ? : 1;
 	int pool, err;
 
+	if (!netif_is_macvlan(vdev))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	if (adapter->xdp_prog) {
 		e_warn(probe, "L2FW offload is not supported with XDP\n");
 		return ERR_PTR(-EINVAL);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Bridge] [RFC PATCH v2 net-next 01/10] net: dfwd: constrain existing users to macvlan subordinates
@ 2021-07-03 11:56   ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:56 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Alexander Duyck, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
	Vivien Didelot, Tobias Waldekranz

From: Tobias Waldekranz <tobias@waldekranz.com>

The dfwd_add/del_station NDOs are currently only used by the macvlan
subsystem to request L2 forwarding offload from lower devices. In
order add support for other types of devices (like bridges), we
constrain the current users to make sure that the subordinate
requesting the offload is in fact a macvlan.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 3 +++
 drivers/net/ethernet/intel/i40e/i40e_main.c     | 3 +++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index 2fb52bd6fc0e..4dba6e6a282d 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -1352,6 +1352,9 @@ static void *fm10k_dfwd_add_station(struct net_device *dev,
 	int size, i;
 	u16 vid, glort;
 
+	if (!netif_is_macvlan(sdev))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	/* The hardware supported by fm10k only filters on the destination MAC
 	 * address. In order to avoid issues we only support offloading modes
 	 * where the hardware can actually provide the functionality.
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 861e59a350bd..812ad241a049 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7629,6 +7629,9 @@ static void *i40e_fwd_add(struct net_device *netdev, struct net_device *vdev)
 	struct i40e_fwd_adapter *fwd;
 	int avail_macvlan, ret;
 
+	if (!netif_is_macvlan(vdev))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	if ((pf->flags & I40E_FLAG_DCB_ENABLED)) {
 		netdev_info(netdev, "Macvlans are not supported when DCB is enabled\n");
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ffff69efd78a..1ecdb7dc9534 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9938,6 +9938,9 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 	int tcs = adapter->hw_tcs ? : 1;
 	int pool, err;
 
+	if (!netif_is_macvlan(vdev))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	if (adapter->xdp_prog) {
 		e_warn(probe, "L2FW offload is not supported with XDP\n");
 		return ERR_PTR(-EINVAL);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 net-next 02/10] net: bridge: disambiguate offload_fwd_mark
  2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
@ 2021-07-03 11:56   ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:56 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck

From: Tobias Waldekranz <tobias@waldekranz.com>

Before this change, four related - but distinct - concepts where named
offload_fwd_mark:

- skb->offload_fwd_mark: Set by the switchdev driver if the underlying
  hardware has already forwarded this frame to the other ports in the
  same hardware domain.

- nbp->offload_fwd_mark: An idetifier used to group ports that share
  the same hardware forwarding domain.

- br->offload_fwd_mark: Counter used to make sure that unique IDs are
  used in cases where a bridge contains ports from multiple hardware
  domains.

- skb->cb->offload_fwd_mark: The hardware domain on which the frame
  ingressed and was forwarded.

Introduce the term "hardware forwarding domain" ("hwdom") in the
bridge to denote a set of ports with the following property:

    If an skb with skb->offload_fwd_mark set, is received on a port
    belonging to hwdom N, that frame has already been forwarded to all
    other ports in hwdom N.

By decoupling the name from "offload_fwd_mark", we can extend the
term's definition in the future - e.g. to add constraints that
describe expected egress behavior - without overloading the meaning of
"offload_fwd_mark".

- nbp->offload_fwd_mark thus becomes nbp->hwdom.

- br->offload_fwd_mark becomes br->last_hwdom.

- skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. The slight
  change in naming here mandates a slight change in behavior of the
  nbp_switchdev_frame_mark() function. Previously, it only set this
  value in skb->cb for packets with skb->offload_fwd_mark true (ones
  which were forwarded in hardware). Whereas now we always track the
  incoming hwdom for all packets coming from a switchdev (even for the
  packets which weren't forwarded in hardware, such as STP BPDUs, IGMP
  reports etc). As all uses of skb->cb->offload_fwd_mark were already
  gated behind checks of skb->offload_fwd_mark, this will not introduce
  any functional change, but it paves the way for future changes where
  the ingressing hwdom must be known for frames coming from a switchdev
  regardless of whether they were forwarded in hardware or not
  (basically, if the skb comes from a switchdev, skb->cb->src_hwdom now
  always tracks which one).

  A typical example where this is relevant: the switchdev has a fixed
  configuration to trap STP BPDUs, but STP is not running on the bridge
  and the group_fwd_mask allows them to be forwarded. Say we have this
  setup:

        br0
       / | \
      /  |  \
  swp0 swp1 swp2

  A BPDU comes in on swp0 and is trapped to the CPU; the driver does not
  set skb->offload_fwd_mark. The bridge determines that the frame should
  be forwarded to swp{1,2}. It is imperative that forward offloading is
  _not_ allowed in this case, as the source hwdom is already "poisoned".

  Recording the source hwdom allows this case to be handled properly.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_if.c        |  2 +-
 net/bridge/br_private.h   | 10 +++++-----
 net/bridge/br_switchdev.c | 16 ++++++++--------
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index f7d2f472ae24..73fa703f8df5 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -643,7 +643,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
 	if (err)
 		goto err5;
 
-	err = nbp_switchdev_mark_set(p);
+	err = nbp_switchdev_hwdom_set(p);
 	if (err)
 		goto err6;
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 2b48b204205e..e16879caaaf3 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -329,7 +329,7 @@ struct net_bridge_port {
 	struct netpoll			*np;
 #endif
 #ifdef CONFIG_NET_SWITCHDEV
-	int				offload_fwd_mark;
+	int				hwdom;
 #endif
 	u16				group_fwd_mask;
 	u16				backup_redirected_cnt;
@@ -476,7 +476,7 @@ struct net_bridge {
 	u32				auto_cnt;
 
 #ifdef CONFIG_NET_SWITCHDEV
-	int offload_fwd_mark;
+	int last_hwdom;
 #endif
 	struct hlist_head		fdb_list;
 
@@ -506,7 +506,7 @@ struct br_input_skb_cb {
 #endif
 
 #ifdef CONFIG_NET_SWITCHDEV
-	int offload_fwd_mark;
+	int src_hwdom;
 #endif
 };
 
@@ -1645,7 +1645,7 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
 
 /* br_switchdev.c */
 #ifdef CONFIG_NET_SWITCHDEV
-int nbp_switchdev_mark_set(struct net_bridge_port *p);
+int nbp_switchdev_hwdom_set(struct net_bridge_port *p);
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb);
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
@@ -1665,7 +1665,7 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 	skb->offload_fwd_mark = 0;
 }
 #else
-static inline int nbp_switchdev_mark_set(struct net_bridge_port *p)
+static inline int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
 {
 	return 0;
 }
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index d3adee0f91f9..833fd30482c2 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -8,20 +8,20 @@
 
 #include "br_private.h"
 
-static int br_switchdev_mark_get(struct net_bridge *br, struct net_device *dev)
+static int br_switchdev_hwdom_get(struct net_bridge *br, struct net_device *dev)
 {
 	struct net_bridge_port *p;
 
 	/* dev is yet to be added to the port list. */
 	list_for_each_entry(p, &br->port_list, list) {
 		if (netdev_port_same_parent_id(dev, p->dev))
-			return p->offload_fwd_mark;
+			return p->hwdom;
 	}
 
-	return ++br->offload_fwd_mark;
+	return ++br->last_hwdom;
 }
 
-int nbp_switchdev_mark_set(struct net_bridge_port *p)
+int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
 {
 	struct netdev_phys_item_id ppid = { };
 	int err;
@@ -35,7 +35,7 @@ int nbp_switchdev_mark_set(struct net_bridge_port *p)
 		return err;
 	}
 
-	p->offload_fwd_mark = br_switchdev_mark_get(p->br, p->dev);
+	p->hwdom = br_switchdev_hwdom_get(p->br, p->dev);
 
 	return 0;
 }
@@ -43,15 +43,15 @@ int nbp_switchdev_mark_set(struct net_bridge_port *p)
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb)
 {
-	if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark))
-		BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark;
+	if (p->hwdom)
+		BR_INPUT_SKB_CB(skb)->src_hwdom = p->hwdom;
 }
 
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
 				  const struct sk_buff *skb)
 {
 	return !skb->offload_fwd_mark ||
-	       BR_INPUT_SKB_CB(skb)->offload_fwd_mark != p->offload_fwd_mark;
+	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
 }
 
 /* Flags that can be offloaded to hardware */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Bridge] [RFC PATCH v2 net-next 02/10] net: bridge: disambiguate offload_fwd_mark
@ 2021-07-03 11:56   ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:56 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Alexander Duyck, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
	Vivien Didelot, Tobias Waldekranz

From: Tobias Waldekranz <tobias@waldekranz.com>

Before this change, four related - but distinct - concepts where named
offload_fwd_mark:

- skb->offload_fwd_mark: Set by the switchdev driver if the underlying
  hardware has already forwarded this frame to the other ports in the
  same hardware domain.

- nbp->offload_fwd_mark: An idetifier used to group ports that share
  the same hardware forwarding domain.

- br->offload_fwd_mark: Counter used to make sure that unique IDs are
  used in cases where a bridge contains ports from multiple hardware
  domains.

- skb->cb->offload_fwd_mark: The hardware domain on which the frame
  ingressed and was forwarded.

Introduce the term "hardware forwarding domain" ("hwdom") in the
bridge to denote a set of ports with the following property:

    If an skb with skb->offload_fwd_mark set, is received on a port
    belonging to hwdom N, that frame has already been forwarded to all
    other ports in hwdom N.

By decoupling the name from "offload_fwd_mark", we can extend the
term's definition in the future - e.g. to add constraints that
describe expected egress behavior - without overloading the meaning of
"offload_fwd_mark".

- nbp->offload_fwd_mark thus becomes nbp->hwdom.

- br->offload_fwd_mark becomes br->last_hwdom.

- skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. The slight
  change in naming here mandates a slight change in behavior of the
  nbp_switchdev_frame_mark() function. Previously, it only set this
  value in skb->cb for packets with skb->offload_fwd_mark true (ones
  which were forwarded in hardware). Whereas now we always track the
  incoming hwdom for all packets coming from a switchdev (even for the
  packets which weren't forwarded in hardware, such as STP BPDUs, IGMP
  reports etc). As all uses of skb->cb->offload_fwd_mark were already
  gated behind checks of skb->offload_fwd_mark, this will not introduce
  any functional change, but it paves the way for future changes where
  the ingressing hwdom must be known for frames coming from a switchdev
  regardless of whether they were forwarded in hardware or not
  (basically, if the skb comes from a switchdev, skb->cb->src_hwdom now
  always tracks which one).

  A typical example where this is relevant: the switchdev has a fixed
  configuration to trap STP BPDUs, but STP is not running on the bridge
  and the group_fwd_mask allows them to be forwarded. Say we have this
  setup:

        br0
       / | \
      /  |  \
  swp0 swp1 swp2

  A BPDU comes in on swp0 and is trapped to the CPU; the driver does not
  set skb->offload_fwd_mark. The bridge determines that the frame should
  be forwarded to swp{1,2}. It is imperative that forward offloading is
  _not_ allowed in this case, as the source hwdom is already "poisoned".

  Recording the source hwdom allows this case to be handled properly.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_if.c        |  2 +-
 net/bridge/br_private.h   | 10 +++++-----
 net/bridge/br_switchdev.c | 16 ++++++++--------
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index f7d2f472ae24..73fa703f8df5 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -643,7 +643,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
 	if (err)
 		goto err5;
 
-	err = nbp_switchdev_mark_set(p);
+	err = nbp_switchdev_hwdom_set(p);
 	if (err)
 		goto err6;
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 2b48b204205e..e16879caaaf3 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -329,7 +329,7 @@ struct net_bridge_port {
 	struct netpoll			*np;
 #endif
 #ifdef CONFIG_NET_SWITCHDEV
-	int				offload_fwd_mark;
+	int				hwdom;
 #endif
 	u16				group_fwd_mask;
 	u16				backup_redirected_cnt;
@@ -476,7 +476,7 @@ struct net_bridge {
 	u32				auto_cnt;
 
 #ifdef CONFIG_NET_SWITCHDEV
-	int offload_fwd_mark;
+	int last_hwdom;
 #endif
 	struct hlist_head		fdb_list;
 
@@ -506,7 +506,7 @@ struct br_input_skb_cb {
 #endif
 
 #ifdef CONFIG_NET_SWITCHDEV
-	int offload_fwd_mark;
+	int src_hwdom;
 #endif
 };
 
@@ -1645,7 +1645,7 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
 
 /* br_switchdev.c */
 #ifdef CONFIG_NET_SWITCHDEV
-int nbp_switchdev_mark_set(struct net_bridge_port *p);
+int nbp_switchdev_hwdom_set(struct net_bridge_port *p);
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb);
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
@@ -1665,7 +1665,7 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 	skb->offload_fwd_mark = 0;
 }
 #else
-static inline int nbp_switchdev_mark_set(struct net_bridge_port *p)
+static inline int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
 {
 	return 0;
 }
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index d3adee0f91f9..833fd30482c2 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -8,20 +8,20 @@
 
 #include "br_private.h"
 
-static int br_switchdev_mark_get(struct net_bridge *br, struct net_device *dev)
+static int br_switchdev_hwdom_get(struct net_bridge *br, struct net_device *dev)
 {
 	struct net_bridge_port *p;
 
 	/* dev is yet to be added to the port list. */
 	list_for_each_entry(p, &br->port_list, list) {
 		if (netdev_port_same_parent_id(dev, p->dev))
-			return p->offload_fwd_mark;
+			return p->hwdom;
 	}
 
-	return ++br->offload_fwd_mark;
+	return ++br->last_hwdom;
 }
 
-int nbp_switchdev_mark_set(struct net_bridge_port *p)
+int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
 {
 	struct netdev_phys_item_id ppid = { };
 	int err;
@@ -35,7 +35,7 @@ int nbp_switchdev_mark_set(struct net_bridge_port *p)
 		return err;
 	}
 
-	p->offload_fwd_mark = br_switchdev_mark_get(p->br, p->dev);
+	p->hwdom = br_switchdev_hwdom_get(p->br, p->dev);
 
 	return 0;
 }
@@ -43,15 +43,15 @@ int nbp_switchdev_mark_set(struct net_bridge_port *p)
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb)
 {
-	if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark))
-		BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark;
+	if (p->hwdom)
+		BR_INPUT_SKB_CB(skb)->src_hwdom = p->hwdom;
 }
 
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
 				  const struct sk_buff *skb)
 {
 	return !skb->offload_fwd_mark ||
-	       BR_INPUT_SKB_CB(skb)->offload_fwd_mark != p->offload_fwd_mark;
+	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
 }
 
 /* Flags that can be offloaded to hardware */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 net-next 03/10] net: bridge: switchdev: recycle unused hwdoms
  2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
@ 2021-07-03 11:56   ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:56 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck

From: Tobias Waldekranz <tobias@waldekranz.com>

Since hwdoms have only been used thus far for equality comparisons, the
bridge has used the simplest possible assignment policy; using a
counter to keep track of the last value handed out.

With the upcoming transmit offloading, we need to perform set
operations efficiently based on hwdoms, e.g. we want to answer
questions like "has this skb been forwarded to any port within this
hwdom?"

Move to a bitmap-based allocation scheme that recycles hwdoms once all
members leaves the bridge. This means that we can use a single
unsigned long to keep track of the hwdoms that have received an skb.

v1->v2: convert the typedef DECLARE_BITMAP(br_hwdom_map_t, BR_HWDOM_MAX)
        into a plain unsigned long.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_if.c        |  4 +-
 net/bridge/br_private.h   | 27 ++++++++---
 net/bridge/br_switchdev.c | 94 ++++++++++++++++++++++++++-------------
 3 files changed, 85 insertions(+), 40 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 73fa703f8df5..adaf78e45c23 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -349,6 +349,7 @@ static void del_nbp(struct net_bridge_port *p)
 	nbp_backup_clear(p);
 
 	nbp_update_port_count(br);
+	nbp_switchdev_del(p);
 
 	netdev_upper_dev_unlink(dev, br->dev);
 
@@ -643,7 +644,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
 	if (err)
 		goto err5;
 
-	err = nbp_switchdev_hwdom_set(p);
+	err = nbp_switchdev_add(p);
 	if (err)
 		goto err6;
 
@@ -704,6 +705,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
 	list_del_rcu(&p->list);
 	br_fdb_delete_by_port(br, p, 0, 1);
 	nbp_update_port_count(br);
+	nbp_switchdev_del(p);
 err6:
 	netdev_upper_dev_unlink(dev, br->dev);
 err5:
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index e16879caaaf3..9ff09a32e3f8 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -29,6 +29,8 @@
 
 #define BR_MULTICAST_DEFAULT_HASH_MAX 4096
 
+#define BR_HWDOM_MAX BITS_PER_LONG
+
 #define BR_VERSION	"2.3"
 
 /* Control of forwarding link local multicast */
@@ -476,7 +478,7 @@ struct net_bridge {
 	u32				auto_cnt;
 
 #ifdef CONFIG_NET_SWITCHDEV
-	int last_hwdom;
+	unsigned long			busy_hwdoms;
 #endif
 	struct hlist_head		fdb_list;
 
@@ -1645,7 +1647,6 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
 
 /* br_switchdev.c */
 #ifdef CONFIG_NET_SWITCHDEV
-int nbp_switchdev_hwdom_set(struct net_bridge_port *p);
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb);
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
@@ -1659,17 +1660,15 @@ void br_switchdev_fdb_notify(struct net_bridge *br,
 int br_switchdev_port_vlan_add(struct net_device *dev, u16 vid, u16 flags,
 			       struct netlink_ext_ack *extack);
 int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid);
+int nbp_switchdev_add(struct net_bridge_port *p);
+void nbp_switchdev_del(struct net_bridge_port *p);
+void br_switchdev_init(struct net_bridge *br);
 
 static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 {
 	skb->offload_fwd_mark = 0;
 }
 #else
-static inline int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
-{
-	return 0;
-}
-
 static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 					    struct sk_buff *skb)
 {
@@ -1710,6 +1709,20 @@ br_switchdev_fdb_notify(struct net_bridge *br,
 static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 {
 }
+
+static inline int nbp_switchdev_add(struct net_bridge_port *p)
+{
+	return 0;
+}
+
+static inline void nbp_switchdev_del(struct net_bridge_port *p)
+{
+}
+
+static inline void br_switchdev_init(struct net_bridge *br)
+{
+}
+
 #endif /* CONFIG_NET_SWITCHDEV */
 
 /* br_arp_nd_proxy.c */
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index 833fd30482c2..f3120f13c293 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -8,38 +8,6 @@
 
 #include "br_private.h"
 
-static int br_switchdev_hwdom_get(struct net_bridge *br, struct net_device *dev)
-{
-	struct net_bridge_port *p;
-
-	/* dev is yet to be added to the port list. */
-	list_for_each_entry(p, &br->port_list, list) {
-		if (netdev_port_same_parent_id(dev, p->dev))
-			return p->hwdom;
-	}
-
-	return ++br->last_hwdom;
-}
-
-int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
-{
-	struct netdev_phys_item_id ppid = { };
-	int err;
-
-	ASSERT_RTNL();
-
-	err = dev_get_port_parent_id(p->dev, &ppid, true);
-	if (err) {
-		if (err == -EOPNOTSUPP)
-			return 0;
-		return err;
-	}
-
-	p->hwdom = br_switchdev_hwdom_get(p->br, p->dev);
-
-	return 0;
-}
-
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb)
 {
@@ -156,3 +124,65 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)
 
 	return switchdev_port_obj_del(dev, &v.obj);
 }
+
+static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)
+{
+	struct net_bridge *br = joining->br;
+	struct net_bridge_port *p;
+	int hwdom;
+
+	/* joining is yet to be added to the port list. */
+	list_for_each_entry(p, &br->port_list, list) {
+		if (netdev_port_same_parent_id(joining->dev, p->dev)) {
+			joining->hwdom = p->hwdom;
+			return 0;
+		}
+	}
+
+	hwdom = find_next_zero_bit(&br->busy_hwdoms, BR_HWDOM_MAX, 1);
+	if (hwdom >= BR_HWDOM_MAX)
+		return -EBUSY;
+
+	set_bit(hwdom, &br->busy_hwdoms);
+	joining->hwdom = hwdom;
+	return 0;
+}
+
+static void nbp_switchdev_hwdom_put(struct net_bridge_port *leaving)
+{
+	struct net_bridge *br = leaving->br;
+	struct net_bridge_port *p;
+
+	/* leaving is no longer in the port list. */
+	list_for_each_entry(p, &br->port_list, list) {
+		if (p->hwdom == leaving->hwdom)
+			return;
+	}
+
+	clear_bit(leaving->hwdom, &br->busy_hwdoms);
+}
+
+int nbp_switchdev_add(struct net_bridge_port *p)
+{
+	struct netdev_phys_item_id ppid = { };
+	int err;
+
+	ASSERT_RTNL();
+
+	err = dev_get_port_parent_id(p->dev, &ppid, true);
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			return 0;
+		return err;
+	}
+
+	return nbp_switchdev_hwdom_set(p);
+}
+
+void nbp_switchdev_del(struct net_bridge_port *p)
+{
+	ASSERT_RTNL();
+
+	if (p->hwdom)
+		nbp_switchdev_hwdom_put(p);
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Bridge] [RFC PATCH v2 net-next 03/10] net: bridge: switchdev: recycle unused hwdoms
@ 2021-07-03 11:56   ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:56 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Alexander Duyck, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
	Vivien Didelot, Tobias Waldekranz

From: Tobias Waldekranz <tobias@waldekranz.com>

Since hwdoms have only been used thus far for equality comparisons, the
bridge has used the simplest possible assignment policy; using a
counter to keep track of the last value handed out.

With the upcoming transmit offloading, we need to perform set
operations efficiently based on hwdoms, e.g. we want to answer
questions like "has this skb been forwarded to any port within this
hwdom?"

Move to a bitmap-based allocation scheme that recycles hwdoms once all
members leaves the bridge. This means that we can use a single
unsigned long to keep track of the hwdoms that have received an skb.

v1->v2: convert the typedef DECLARE_BITMAP(br_hwdom_map_t, BR_HWDOM_MAX)
        into a plain unsigned long.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_if.c        |  4 +-
 net/bridge/br_private.h   | 27 ++++++++---
 net/bridge/br_switchdev.c | 94 ++++++++++++++++++++++++++-------------
 3 files changed, 85 insertions(+), 40 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 73fa703f8df5..adaf78e45c23 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -349,6 +349,7 @@ static void del_nbp(struct net_bridge_port *p)
 	nbp_backup_clear(p);
 
 	nbp_update_port_count(br);
+	nbp_switchdev_del(p);
 
 	netdev_upper_dev_unlink(dev, br->dev);
 
@@ -643,7 +644,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
 	if (err)
 		goto err5;
 
-	err = nbp_switchdev_hwdom_set(p);
+	err = nbp_switchdev_add(p);
 	if (err)
 		goto err6;
 
@@ -704,6 +705,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
 	list_del_rcu(&p->list);
 	br_fdb_delete_by_port(br, p, 0, 1);
 	nbp_update_port_count(br);
+	nbp_switchdev_del(p);
 err6:
 	netdev_upper_dev_unlink(dev, br->dev);
 err5:
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index e16879caaaf3..9ff09a32e3f8 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -29,6 +29,8 @@
 
 #define BR_MULTICAST_DEFAULT_HASH_MAX 4096
 
+#define BR_HWDOM_MAX BITS_PER_LONG
+
 #define BR_VERSION	"2.3"
 
 /* Control of forwarding link local multicast */
@@ -476,7 +478,7 @@ struct net_bridge {
 	u32				auto_cnt;
 
 #ifdef CONFIG_NET_SWITCHDEV
-	int last_hwdom;
+	unsigned long			busy_hwdoms;
 #endif
 	struct hlist_head		fdb_list;
 
@@ -1645,7 +1647,6 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
 
 /* br_switchdev.c */
 #ifdef CONFIG_NET_SWITCHDEV
-int nbp_switchdev_hwdom_set(struct net_bridge_port *p);
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb);
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
@@ -1659,17 +1660,15 @@ void br_switchdev_fdb_notify(struct net_bridge *br,
 int br_switchdev_port_vlan_add(struct net_device *dev, u16 vid, u16 flags,
 			       struct netlink_ext_ack *extack);
 int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid);
+int nbp_switchdev_add(struct net_bridge_port *p);
+void nbp_switchdev_del(struct net_bridge_port *p);
+void br_switchdev_init(struct net_bridge *br);
 
 static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 {
 	skb->offload_fwd_mark = 0;
 }
 #else
-static inline int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
-{
-	return 0;
-}
-
 static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 					    struct sk_buff *skb)
 {
@@ -1710,6 +1709,20 @@ br_switchdev_fdb_notify(struct net_bridge *br,
 static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 {
 }
+
+static inline int nbp_switchdev_add(struct net_bridge_port *p)
+{
+	return 0;
+}
+
+static inline void nbp_switchdev_del(struct net_bridge_port *p)
+{
+}
+
+static inline void br_switchdev_init(struct net_bridge *br)
+{
+}
+
 #endif /* CONFIG_NET_SWITCHDEV */
 
 /* br_arp_nd_proxy.c */
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index 833fd30482c2..f3120f13c293 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -8,38 +8,6 @@
 
 #include "br_private.h"
 
-static int br_switchdev_hwdom_get(struct net_bridge *br, struct net_device *dev)
-{
-	struct net_bridge_port *p;
-
-	/* dev is yet to be added to the port list. */
-	list_for_each_entry(p, &br->port_list, list) {
-		if (netdev_port_same_parent_id(dev, p->dev))
-			return p->hwdom;
-	}
-
-	return ++br->last_hwdom;
-}
-
-int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
-{
-	struct netdev_phys_item_id ppid = { };
-	int err;
-
-	ASSERT_RTNL();
-
-	err = dev_get_port_parent_id(p->dev, &ppid, true);
-	if (err) {
-		if (err == -EOPNOTSUPP)
-			return 0;
-		return err;
-	}
-
-	p->hwdom = br_switchdev_hwdom_get(p->br, p->dev);
-
-	return 0;
-}
-
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb)
 {
@@ -156,3 +124,65 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)
 
 	return switchdev_port_obj_del(dev, &v.obj);
 }
+
+static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)
+{
+	struct net_bridge *br = joining->br;
+	struct net_bridge_port *p;
+	int hwdom;
+
+	/* joining is yet to be added to the port list. */
+	list_for_each_entry(p, &br->port_list, list) {
+		if (netdev_port_same_parent_id(joining->dev, p->dev)) {
+			joining->hwdom = p->hwdom;
+			return 0;
+		}
+	}
+
+	hwdom = find_next_zero_bit(&br->busy_hwdoms, BR_HWDOM_MAX, 1);
+	if (hwdom >= BR_HWDOM_MAX)
+		return -EBUSY;
+
+	set_bit(hwdom, &br->busy_hwdoms);
+	joining->hwdom = hwdom;
+	return 0;
+}
+
+static void nbp_switchdev_hwdom_put(struct net_bridge_port *leaving)
+{
+	struct net_bridge *br = leaving->br;
+	struct net_bridge_port *p;
+
+	/* leaving is no longer in the port list. */
+	list_for_each_entry(p, &br->port_list, list) {
+		if (p->hwdom == leaving->hwdom)
+			return;
+	}
+
+	clear_bit(leaving->hwdom, &br->busy_hwdoms);
+}
+
+int nbp_switchdev_add(struct net_bridge_port *p)
+{
+	struct netdev_phys_item_id ppid = { };
+	int err;
+
+	ASSERT_RTNL();
+
+	err = dev_get_port_parent_id(p->dev, &ppid, true);
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			return 0;
+		return err;
+	}
+
+	return nbp_switchdev_hwdom_set(p);
+}
+
+void nbp_switchdev_del(struct net_bridge_port *p)
+{
+	ASSERT_RTNL();
+
+	if (p->hwdom)
+		nbp_switchdev_hwdom_put(p);
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 net-next 04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded
  2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
@ 2021-07-03 11:56   ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:56 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck

From: Tobias Waldekranz <tobias@waldekranz.com>

Allow switchdevs to forward frames from the CPU in accordance with the
bridge configuration in the same way as is done between bridge
ports. This means that the bridge will only send a single skb towards
one of the ports under the switchdev's control, and expects the driver
to deliver the packet to all eligible ports in its domain.

Primarily this improves the performance of multicast flows with
multiple subscribers, as it allows the hardware to perform the frame
replication.

The basic flow between the driver and the bridge is as follows:

- The switchdev accepts the offload by returning a non-null pointer
  from .ndo_dfwd_add_station when the port is added to the bridge.

- The bridge sends offloadable skbs to one of the ports under the
  switchdev's control using dev_queue_xmit_accel.

- The switchdev notices the offload by checking for a non-NULL
  "sb_dev" in the core's call to .ndo_select_queue.

v1->v2:
- convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long
- introduce a static key "br_switchdev_fwd_offload_used" to minimize the
  impact of the newly introduced feature on all the setups which don't
  have hardware that can make use of it
- introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache
  line access
- reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in
  __br_forward()
- do not strip VLAN on egress if forwarding offload on VLAN-aware bridge
  is being used
- propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/linux/if_bridge.h |  1 +
 net/bridge/br_forward.c   | 18 +++++++-
 net/bridge/br_private.h   | 24 +++++++++++
 net/bridge/br_switchdev.c | 87 +++++++++++++++++++++++++++++++++++++--
 net/bridge/br_vlan.c      | 10 ++++-
 5 files changed, 135 insertions(+), 5 deletions(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index b651c5e32a28..a47b86ab7f96 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -57,6 +57,7 @@ struct br_ip_list {
 #define BR_MRP_AWARE		BIT(17)
 #define BR_MRP_LOST_CONT	BIT(18)
 #define BR_MRP_LOST_IN_CONT	BIT(19)
+#define BR_FWD_OFFLOAD		BIT(20)
 
 #define BR_DEFAULT_AGEING_TIME	(300 * HZ)
 
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 07856362538f..919246a2c7eb 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -32,6 +32,8 @@ static inline int should_deliver(const struct net_bridge_port *p,
 
 int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
+	struct net_device *sb_dev = NULL;
+
 	skb_push(skb, ETH_HLEN);
 	if (!is_skb_forwardable(skb->dev, skb))
 		goto drop;
@@ -48,7 +50,14 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb
 		skb_set_network_header(skb, depth);
 	}
 
-	dev_queue_xmit(skb);
+	if (br_switchdev_accels_skb(skb)) {
+		sb_dev = BR_INPUT_SKB_CB(skb)->brdev;
+
+		WARN_ON_ONCE(br_vlan_enabled(sb_dev) &&
+			     !skb_vlan_tag_present(skb));
+	}
+
+	dev_queue_xmit_accel(skb, sb_dev);
 
 	return 0;
 
@@ -76,6 +85,11 @@ static void __br_forward(const struct net_bridge_port *to,
 	struct net *net;
 	int br_hook;
 
+	/* Mark the skb for forwarding offload early so that br_handle_vlan()
+	 * can know whether to pop the VLAN header on egress or keep it.
+	 */
+	nbp_switchdev_frame_mark_accel(to, skb);
+
 	vg = nbp_vlan_group_rcu(to);
 	skb = br_handle_vlan(to->br, to, vg, skb);
 	if (!skb)
@@ -174,6 +188,8 @@ static struct net_bridge_port *maybe_deliver(
 	if (!should_deliver(p, skb))
 		return prev;
 
+	nbp_switchdev_frame_mark_fwd(p, skb);
+
 	if (!prev)
 		goto out;
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 9ff09a32e3f8..655212df57f7 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -332,6 +332,7 @@ struct net_bridge_port {
 #endif
 #ifdef CONFIG_NET_SWITCHDEV
 	int				hwdom;
+	void				*accel_priv;
 #endif
 	u16				group_fwd_mask;
 	u16				backup_redirected_cnt;
@@ -508,7 +509,9 @@ struct br_input_skb_cb {
 #endif
 
 #ifdef CONFIG_NET_SWITCHDEV
+	u8 fwd_accel:1;
 	int src_hwdom;
+	unsigned long fwd_hwdoms;
 #endif
 };
 
@@ -1647,6 +1650,12 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
 
 /* br_switchdev.c */
 #ifdef CONFIG_NET_SWITCHDEV
+bool br_switchdev_accels_skb(struct sk_buff *skb);
+
+void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
+				    struct sk_buff *skb);
+void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
+				  struct sk_buff *skb);
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb);
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
@@ -1669,6 +1678,21 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 	skb->offload_fwd_mark = 0;
 }
 #else
+static inline bool br_switchdev_accels_skb(struct sk_buff *skb)
+{
+	return false;
+}
+
+static inline void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
+						  struct sk_buff *skb)
+{
+}
+
+static inline void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
+						struct sk_buff *skb)
+{
+}
+
 static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 					    struct sk_buff *skb)
 {
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index f3120f13c293..8653d9a540a1 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -8,6 +8,40 @@
 
 #include "br_private.h"
 
+static struct static_key_false br_switchdev_fwd_offload_used;
+
+static bool nbp_switchdev_can_accel(const struct net_bridge_port *p,
+				    const struct sk_buff *skb)
+{
+	if (!static_branch_unlikely(&br_switchdev_fwd_offload_used))
+		return false;
+
+	return (p->flags & BR_FWD_OFFLOAD) &&
+	       (p->hwdom != BR_INPUT_SKB_CB(skb)->src_hwdom);
+}
+
+bool br_switchdev_accels_skb(struct sk_buff *skb)
+{
+	if (!static_branch_unlikely(&br_switchdev_fwd_offload_used))
+		return false;
+
+	return BR_INPUT_SKB_CB(skb)->fwd_accel;
+}
+
+void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
+				    struct sk_buff *skb)
+{
+	if (nbp_switchdev_can_accel(p, skb))
+		BR_INPUT_SKB_CB(skb)->fwd_accel = true;
+}
+
+void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
+				  struct sk_buff *skb)
+{
+	if (nbp_switchdev_can_accel(p, skb))
+		set_bit(p->hwdom, &BR_INPUT_SKB_CB(skb)->fwd_hwdoms);
+}
+
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb)
 {
@@ -18,8 +52,10 @@ void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
 				  const struct sk_buff *skb)
 {
-	return !skb->offload_fwd_mark ||
-	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
+	struct br_input_skb_cb *cb = BR_INPUT_SKB_CB(skb);
+
+	return !test_bit(p->hwdom, &cb->fwd_hwdoms) &&
+		(!skb->offload_fwd_mark || cb->src_hwdom != p->hwdom);
 }
 
 /* Flags that can be offloaded to hardware */
@@ -125,6 +161,39 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)
 	return switchdev_port_obj_del(dev, &v.obj);
 }
 
+static int nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)
+{
+	void *priv;
+
+	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))
+		return 0;
+
+	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);
+	if (IS_ERR(priv)) {
+		int err = PTR_ERR(priv);
+
+		return err == -EOPNOTSUPP ? 0 : err;
+	}
+
+	p->accel_priv = priv;
+	p->flags |= BR_FWD_OFFLOAD;
+	static_branch_inc(&br_switchdev_fwd_offload_used);
+
+	return 0;
+}
+
+static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)
+{
+	if (!p->accel_priv)
+		return;
+
+	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);
+
+	p->accel_priv = NULL;
+	p->flags &= ~BR_FWD_OFFLOAD;
+	static_branch_dec(&br_switchdev_fwd_offload_used);
+}
+
 static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)
 {
 	struct net_bridge *br = joining->br;
@@ -176,13 +245,25 @@ int nbp_switchdev_add(struct net_bridge_port *p)
 		return err;
 	}
 
-	return nbp_switchdev_hwdom_set(p);
+	err = nbp_switchdev_hwdom_set(p);
+	if (err)
+		return err;
+
+	if (p->hwdom) {
+		err = nbp_switchdev_fwd_offload_add(p);
+		if (err)
+			return err;
+	}
+
+	return 0;
 }
 
 void nbp_switchdev_del(struct net_bridge_port *p)
 {
 	ASSERT_RTNL();
 
+	nbp_switchdev_fwd_offload_del(p);
+
 	if (p->hwdom)
 		nbp_switchdev_hwdom_put(p);
 }
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index a08e9f193009..bf014efa5851 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -457,7 +457,15 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
 		u64_stats_update_end(&stats->syncp);
 	}
 
-	if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
+	/* If the skb will be sent using forwarding offload, the assumption is
+	 * that the switchdev will inject the packet into hardware together
+	 * with the bridge VLAN, so that it can be forwarded according to that
+	 * VLAN. The switchdev should deal with popping the VLAN header in
+	 * hardware on each egress port as appropriate. So only strip the VLAN
+	 * header if forwarding offload is not being used.
+	 */
+	if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED &&
+	    !br_switchdev_accels_skb(skb))
 		__vlan_hwaccel_clear_tag(skb);
 
 	if (p && (p->flags & BR_VLAN_TUNNEL) &&
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Bridge] [RFC PATCH v2 net-next 04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded
@ 2021-07-03 11:56   ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:56 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Alexander Duyck, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
	Vivien Didelot, Tobias Waldekranz

From: Tobias Waldekranz <tobias@waldekranz.com>

Allow switchdevs to forward frames from the CPU in accordance with the
bridge configuration in the same way as is done between bridge
ports. This means that the bridge will only send a single skb towards
one of the ports under the switchdev's control, and expects the driver
to deliver the packet to all eligible ports in its domain.

Primarily this improves the performance of multicast flows with
multiple subscribers, as it allows the hardware to perform the frame
replication.

The basic flow between the driver and the bridge is as follows:

- The switchdev accepts the offload by returning a non-null pointer
  from .ndo_dfwd_add_station when the port is added to the bridge.

- The bridge sends offloadable skbs to one of the ports under the
  switchdev's control using dev_queue_xmit_accel.

- The switchdev notices the offload by checking for a non-NULL
  "sb_dev" in the core's call to .ndo_select_queue.

v1->v2:
- convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long
- introduce a static key "br_switchdev_fwd_offload_used" to minimize the
  impact of the newly introduced feature on all the setups which don't
  have hardware that can make use of it
- introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache
  line access
- reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in
  __br_forward()
- do not strip VLAN on egress if forwarding offload on VLAN-aware bridge
  is being used
- propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/linux/if_bridge.h |  1 +
 net/bridge/br_forward.c   | 18 +++++++-
 net/bridge/br_private.h   | 24 +++++++++++
 net/bridge/br_switchdev.c | 87 +++++++++++++++++++++++++++++++++++++--
 net/bridge/br_vlan.c      | 10 ++++-
 5 files changed, 135 insertions(+), 5 deletions(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index b651c5e32a28..a47b86ab7f96 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -57,6 +57,7 @@ struct br_ip_list {
 #define BR_MRP_AWARE		BIT(17)
 #define BR_MRP_LOST_CONT	BIT(18)
 #define BR_MRP_LOST_IN_CONT	BIT(19)
+#define BR_FWD_OFFLOAD		BIT(20)
 
 #define BR_DEFAULT_AGEING_TIME	(300 * HZ)
 
diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 07856362538f..919246a2c7eb 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -32,6 +32,8 @@ static inline int should_deliver(const struct net_bridge_port *p,
 
 int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
+	struct net_device *sb_dev = NULL;
+
 	skb_push(skb, ETH_HLEN);
 	if (!is_skb_forwardable(skb->dev, skb))
 		goto drop;
@@ -48,7 +50,14 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb
 		skb_set_network_header(skb, depth);
 	}
 
-	dev_queue_xmit(skb);
+	if (br_switchdev_accels_skb(skb)) {
+		sb_dev = BR_INPUT_SKB_CB(skb)->brdev;
+
+		WARN_ON_ONCE(br_vlan_enabled(sb_dev) &&
+			     !skb_vlan_tag_present(skb));
+	}
+
+	dev_queue_xmit_accel(skb, sb_dev);
 
 	return 0;
 
@@ -76,6 +85,11 @@ static void __br_forward(const struct net_bridge_port *to,
 	struct net *net;
 	int br_hook;
 
+	/* Mark the skb for forwarding offload early so that br_handle_vlan()
+	 * can know whether to pop the VLAN header on egress or keep it.
+	 */
+	nbp_switchdev_frame_mark_accel(to, skb);
+
 	vg = nbp_vlan_group_rcu(to);
 	skb = br_handle_vlan(to->br, to, vg, skb);
 	if (!skb)
@@ -174,6 +188,8 @@ static struct net_bridge_port *maybe_deliver(
 	if (!should_deliver(p, skb))
 		return prev;
 
+	nbp_switchdev_frame_mark_fwd(p, skb);
+
 	if (!prev)
 		goto out;
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 9ff09a32e3f8..655212df57f7 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -332,6 +332,7 @@ struct net_bridge_port {
 #endif
 #ifdef CONFIG_NET_SWITCHDEV
 	int				hwdom;
+	void				*accel_priv;
 #endif
 	u16				group_fwd_mask;
 	u16				backup_redirected_cnt;
@@ -508,7 +509,9 @@ struct br_input_skb_cb {
 #endif
 
 #ifdef CONFIG_NET_SWITCHDEV
+	u8 fwd_accel:1;
 	int src_hwdom;
+	unsigned long fwd_hwdoms;
 #endif
 };
 
@@ -1647,6 +1650,12 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
 
 /* br_switchdev.c */
 #ifdef CONFIG_NET_SWITCHDEV
+bool br_switchdev_accels_skb(struct sk_buff *skb);
+
+void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
+				    struct sk_buff *skb);
+void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
+				  struct sk_buff *skb);
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb);
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
@@ -1669,6 +1678,21 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 	skb->offload_fwd_mark = 0;
 }
 #else
+static inline bool br_switchdev_accels_skb(struct sk_buff *skb)
+{
+	return false;
+}
+
+static inline void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
+						  struct sk_buff *skb)
+{
+}
+
+static inline void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
+						struct sk_buff *skb)
+{
+}
+
 static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 					    struct sk_buff *skb)
 {
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index f3120f13c293..8653d9a540a1 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -8,6 +8,40 @@
 
 #include "br_private.h"
 
+static struct static_key_false br_switchdev_fwd_offload_used;
+
+static bool nbp_switchdev_can_accel(const struct net_bridge_port *p,
+				    const struct sk_buff *skb)
+{
+	if (!static_branch_unlikely(&br_switchdev_fwd_offload_used))
+		return false;
+
+	return (p->flags & BR_FWD_OFFLOAD) &&
+	       (p->hwdom != BR_INPUT_SKB_CB(skb)->src_hwdom);
+}
+
+bool br_switchdev_accels_skb(struct sk_buff *skb)
+{
+	if (!static_branch_unlikely(&br_switchdev_fwd_offload_used))
+		return false;
+
+	return BR_INPUT_SKB_CB(skb)->fwd_accel;
+}
+
+void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
+				    struct sk_buff *skb)
+{
+	if (nbp_switchdev_can_accel(p, skb))
+		BR_INPUT_SKB_CB(skb)->fwd_accel = true;
+}
+
+void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
+				  struct sk_buff *skb)
+{
+	if (nbp_switchdev_can_accel(p, skb))
+		set_bit(p->hwdom, &BR_INPUT_SKB_CB(skb)->fwd_hwdoms);
+}
+
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb)
 {
@@ -18,8 +52,10 @@ void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
 				  const struct sk_buff *skb)
 {
-	return !skb->offload_fwd_mark ||
-	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
+	struct br_input_skb_cb *cb = BR_INPUT_SKB_CB(skb);
+
+	return !test_bit(p->hwdom, &cb->fwd_hwdoms) &&
+		(!skb->offload_fwd_mark || cb->src_hwdom != p->hwdom);
 }
 
 /* Flags that can be offloaded to hardware */
@@ -125,6 +161,39 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)
 	return switchdev_port_obj_del(dev, &v.obj);
 }
 
+static int nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)
+{
+	void *priv;
+
+	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))
+		return 0;
+
+	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);
+	if (IS_ERR(priv)) {
+		int err = PTR_ERR(priv);
+
+		return err == -EOPNOTSUPP ? 0 : err;
+	}
+
+	p->accel_priv = priv;
+	p->flags |= BR_FWD_OFFLOAD;
+	static_branch_inc(&br_switchdev_fwd_offload_used);
+
+	return 0;
+}
+
+static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)
+{
+	if (!p->accel_priv)
+		return;
+
+	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);
+
+	p->accel_priv = NULL;
+	p->flags &= ~BR_FWD_OFFLOAD;
+	static_branch_dec(&br_switchdev_fwd_offload_used);
+}
+
 static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)
 {
 	struct net_bridge *br = joining->br;
@@ -176,13 +245,25 @@ int nbp_switchdev_add(struct net_bridge_port *p)
 		return err;
 	}
 
-	return nbp_switchdev_hwdom_set(p);
+	err = nbp_switchdev_hwdom_set(p);
+	if (err)
+		return err;
+
+	if (p->hwdom) {
+		err = nbp_switchdev_fwd_offload_add(p);
+		if (err)
+			return err;
+	}
+
+	return 0;
 }
 
 void nbp_switchdev_del(struct net_bridge_port *p)
 {
 	ASSERT_RTNL();
 
+	nbp_switchdev_fwd_offload_del(p);
+
 	if (p->hwdom)
 		nbp_switchdev_hwdom_put(p);
 }
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index a08e9f193009..bf014efa5851 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -457,7 +457,15 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
 		u64_stats_update_end(&stats->syncp);
 	}
 
-	if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
+	/* If the skb will be sent using forwarding offload, the assumption is
+	 * that the switchdev will inject the packet into hardware together
+	 * with the bridge VLAN, so that it can be forwarded according to that
+	 * VLAN. The switchdev should deal with popping the VLAN header in
+	 * hardware on each egress port as appropriate. So only strip the VLAN
+	 * header if forwarding offload is not being used.
+	 */
+	if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED &&
+	    !br_switchdev_accels_skb(skb))
 		__vlan_hwaccel_clear_tag(skb);
 
 	if (p && (p->flags & BR_VLAN_TUNNEL) &&
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 net-next 05/10] net: extract helpers for binding a subordinate device to TX queues
  2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
@ 2021-07-03 11:57   ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:57 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck

Currently, the acceleration scheme for offloading the data plane of
upper devices to hardware is geared towards a single topology: that of
macvlan interfaces, where there is a lower interface with many uppers.

We would like to use the same acceleration framework for the bridge data
plane, but there we have a single upper interface with many lowers.

This matters because commit ffcfe25bb50f ("net: Add support for
subordinate device traffic classes") has pulled some logic out of
ixgbe_select_queue() and moved it into net/core/dev.c as if it was
generic enough to do so. In particular, it created a scheme where:

- ixgbe calls netdev_set_sb_channel() on the macvlan interface, which
  changes the macvlan's dev->num_tc to a negative value (-channel).
  The value itself is not used anywhere in any relevant manner, it only
  matters that it's negative, because:
- when ixgbe calls netdev_bind_sb_channel_queue(), the macvlan is
  checked for being configured as a subordinate channel (its num_tc must
  be smaller than zero) and its tc_to_txq guts are being scavenged to
  hold what ixgbe puts in it (for each traffic class, a mapping is
  recorded towards an ixgbe TX ring dedicated to that macvlan). This is
  safe because "we can pretty much guarantee that the tc_to_txq mappings
  and XPS maps for the upper device are unused".
- when a packet is to be transmitted on the ixgbe interface on behalf of
  a macvlan upper and a TX queue is to be selected, netdev_pick_tx() ->
  skb_tx_hash() looks at the tc_to_txq array of the macvlan sb_dev,
  which was populated by ixgbe. The packet reaches the dedicated TX ring.

Fun, but netdev hierarchies with one upper and many lowers cannot do
this, because if multiple lowers tried to lay their eggs into the same
tc_to_txq array of the same upper, they would have to coordinate somehow.
So it doesn't quite work.

But nonetheless, to make sure of the subordinate device concept, we need
access to the sb_dev in the ndo_start_xmit() method, and the only place
we can retrieve it from is:

	netdev_get_tx_queue(dev, skb_get_queue_mapping(skb))->sb_dev

So we need that pointer populated and not much else.

Refactor the code which assigns the subordinate device pointer per lower
interface TX queue into a dedicated set of helpers and export it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/linux/netdevice.h |  7 +++++++
 net/core/dev.c            | 31 +++++++++++++++++++++++--------
 2 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index eaf5bb008aa9..16c88e416693 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2301,6 +2301,13 @@ static inline void net_prefetchw(void *p)
 #endif
 }
 
+void netdev_bind_tx_queues_to_sb_dev(struct net_device *dev,
+				     struct net_device *sb_dev,
+				     u16 count, u16 offset);
+
+void netdev_unbind_tx_queues_from_sb_dev(struct net_device *dev,
+					 struct net_device *sb_dev);
+
 void netdev_unbind_sb_channel(struct net_device *dev,
 			      struct net_device *sb_dev);
 int netdev_bind_sb_channel_queue(struct net_device *dev,
diff --git a/net/core/dev.c b/net/core/dev.c
index c253c2aafe97..02e3a6941381 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2957,21 +2957,37 @@ int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
 }
 EXPORT_SYMBOL(netdev_set_num_tc);
 
-void netdev_unbind_sb_channel(struct net_device *dev,
-			      struct net_device *sb_dev)
+void netdev_bind_tx_queues_to_sb_dev(struct net_device *dev,
+				     struct net_device *sb_dev,
+				     u16 count, u16 offset)
+{
+	while (count--)
+		netdev_get_tx_queue(dev, count + offset)->sb_dev = sb_dev;
+}
+EXPORT_SYMBOL_GPL(netdev_bind_tx_queues_to_sb_dev);
+
+void netdev_unbind_tx_queues_from_sb_dev(struct net_device *dev,
+					 struct net_device *sb_dev)
 {
 	struct netdev_queue *txq = &dev->_tx[dev->num_tx_queues];
 
+	while (txq-- != &dev->_tx[0]) {
+		if (txq->sb_dev == sb_dev)
+			txq->sb_dev = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(netdev_unbind_tx_queues_from_sb_dev);
+
+void netdev_unbind_sb_channel(struct net_device *dev,
+			      struct net_device *sb_dev)
+{
 #ifdef CONFIG_XPS
 	netif_reset_xps_queues_gt(sb_dev, 0);
 #endif
 	memset(sb_dev->tc_to_txq, 0, sizeof(sb_dev->tc_to_txq));
 	memset(sb_dev->prio_tc_map, 0, sizeof(sb_dev->prio_tc_map));
 
-	while (txq-- != &dev->_tx[0]) {
-		if (txq->sb_dev == sb_dev)
-			txq->sb_dev = NULL;
-	}
+	netdev_unbind_tx_queues_from_sb_dev(dev, sb_dev);
 }
 EXPORT_SYMBOL(netdev_unbind_sb_channel);
 
@@ -2994,8 +3010,7 @@ int netdev_bind_sb_channel_queue(struct net_device *dev,
 	/* Provide a way for Tx queue to find the tc_to_txq map or
 	 * XPS map for itself.
 	 */
-	while (count--)
-		netdev_get_tx_queue(dev, count + offset)->sb_dev = sb_dev;
+	netdev_bind_tx_queues_to_sb_dev(dev, sb_dev, count, offset);
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Bridge] [RFC PATCH v2 net-next 05/10] net: extract helpers for binding a subordinate device to TX queues
@ 2021-07-03 11:57   ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:57 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Alexander Duyck, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
	Vivien Didelot, Tobias Waldekranz

Currently, the acceleration scheme for offloading the data plane of
upper devices to hardware is geared towards a single topology: that of
macvlan interfaces, where there is a lower interface with many uppers.

We would like to use the same acceleration framework for the bridge data
plane, but there we have a single upper interface with many lowers.

This matters because commit ffcfe25bb50f ("net: Add support for
subordinate device traffic classes") has pulled some logic out of
ixgbe_select_queue() and moved it into net/core/dev.c as if it was
generic enough to do so. In particular, it created a scheme where:

- ixgbe calls netdev_set_sb_channel() on the macvlan interface, which
  changes the macvlan's dev->num_tc to a negative value (-channel).
  The value itself is not used anywhere in any relevant manner, it only
  matters that it's negative, because:
- when ixgbe calls netdev_bind_sb_channel_queue(), the macvlan is
  checked for being configured as a subordinate channel (its num_tc must
  be smaller than zero) and its tc_to_txq guts are being scavenged to
  hold what ixgbe puts in it (for each traffic class, a mapping is
  recorded towards an ixgbe TX ring dedicated to that macvlan). This is
  safe because "we can pretty much guarantee that the tc_to_txq mappings
  and XPS maps for the upper device are unused".
- when a packet is to be transmitted on the ixgbe interface on behalf of
  a macvlan upper and a TX queue is to be selected, netdev_pick_tx() ->
  skb_tx_hash() looks at the tc_to_txq array of the macvlan sb_dev,
  which was populated by ixgbe. The packet reaches the dedicated TX ring.

Fun, but netdev hierarchies with one upper and many lowers cannot do
this, because if multiple lowers tried to lay their eggs into the same
tc_to_txq array of the same upper, they would have to coordinate somehow.
So it doesn't quite work.

But nonetheless, to make sure of the subordinate device concept, we need
access to the sb_dev in the ndo_start_xmit() method, and the only place
we can retrieve it from is:

	netdev_get_tx_queue(dev, skb_get_queue_mapping(skb))->sb_dev

So we need that pointer populated and not much else.

Refactor the code which assigns the subordinate device pointer per lower
interface TX queue into a dedicated set of helpers and export it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/linux/netdevice.h |  7 +++++++
 net/core/dev.c            | 31 +++++++++++++++++++++++--------
 2 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index eaf5bb008aa9..16c88e416693 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2301,6 +2301,13 @@ static inline void net_prefetchw(void *p)
 #endif
 }
 
+void netdev_bind_tx_queues_to_sb_dev(struct net_device *dev,
+				     struct net_device *sb_dev,
+				     u16 count, u16 offset);
+
+void netdev_unbind_tx_queues_from_sb_dev(struct net_device *dev,
+					 struct net_device *sb_dev);
+
 void netdev_unbind_sb_channel(struct net_device *dev,
 			      struct net_device *sb_dev);
 int netdev_bind_sb_channel_queue(struct net_device *dev,
diff --git a/net/core/dev.c b/net/core/dev.c
index c253c2aafe97..02e3a6941381 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2957,21 +2957,37 @@ int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
 }
 EXPORT_SYMBOL(netdev_set_num_tc);
 
-void netdev_unbind_sb_channel(struct net_device *dev,
-			      struct net_device *sb_dev)
+void netdev_bind_tx_queues_to_sb_dev(struct net_device *dev,
+				     struct net_device *sb_dev,
+				     u16 count, u16 offset)
+{
+	while (count--)
+		netdev_get_tx_queue(dev, count + offset)->sb_dev = sb_dev;
+}
+EXPORT_SYMBOL_GPL(netdev_bind_tx_queues_to_sb_dev);
+
+void netdev_unbind_tx_queues_from_sb_dev(struct net_device *dev,
+					 struct net_device *sb_dev)
 {
 	struct netdev_queue *txq = &dev->_tx[dev->num_tx_queues];
 
+	while (txq-- != &dev->_tx[0]) {
+		if (txq->sb_dev == sb_dev)
+			txq->sb_dev = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(netdev_unbind_tx_queues_from_sb_dev);
+
+void netdev_unbind_sb_channel(struct net_device *dev,
+			      struct net_device *sb_dev)
+{
 #ifdef CONFIG_XPS
 	netif_reset_xps_queues_gt(sb_dev, 0);
 #endif
 	memset(sb_dev->tc_to_txq, 0, sizeof(sb_dev->tc_to_txq));
 	memset(sb_dev->prio_tc_map, 0, sizeof(sb_dev->prio_tc_map));
 
-	while (txq-- != &dev->_tx[0]) {
-		if (txq->sb_dev == sb_dev)
-			txq->sb_dev = NULL;
-	}
+	netdev_unbind_tx_queues_from_sb_dev(dev, sb_dev);
 }
 EXPORT_SYMBOL(netdev_unbind_sb_channel);
 
@@ -2994,8 +3010,7 @@ int netdev_bind_sb_channel_queue(struct net_device *dev,
 	/* Provide a way for Tx queue to find the tc_to_txq map or
 	 * XPS map for itself.
 	 */
-	while (count--)
-		netdev_get_tx_queue(dev, count + offset)->sb_dev = sb_dev;
+	netdev_bind_tx_queues_to_sb_dev(dev, sb_dev, count, offset);
 
 	return 0;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 net-next 06/10] net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
  2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
@ 2021-07-03 11:57   ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:57 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck

When using a bridge upper as a subordinate device, switchdev interfaces
must allocate a TX queue for it, in order to have the information needed
in .ndo_start_xmit() whether the skb comes from the bridge or not.

The dedicated TX queue has the ->sb_dev pointer pointing to the bridge
device, and the only assumption that can be made is that any skb on that
queue must be coming from the bridge. So no other skbs can be sent on
that.

The default netdev_pick_tx() -> skb_tx_hash() policy hashes between TX
queues of the same priority.

To make the scheme work, switchdev drivers offloading a bridge need to
implement their own .ndo_select_queue() which selects the dedicated TX
queue for packets coming from the sb_dev, and lets netdev_pick_tx()
choose from the rest of the TX queues for the rest.

The implication is that the dedicated TX queue for the sb_dev must be
outside of the dev->num_real_tx_queues range, because otherwise,
netdev_pick_tx() might choose that TX queue for packets which aren't
actually coming from our sb_dev and therefore the assumption made in the
driver's .ndo_start_xmit() would be wrong.

This patch lifts the restriction in netdev_core_pick_tx() which says
that the dedicated TX queue for the sb_dev cannot be larger than the
num_real_tx_queues. With this, netdev_pick_tx() can safely pick between
the non-dedicated TX queues.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/linux/netdevice.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 16c88e416693..d43f6ddd12a1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3697,10 +3697,10 @@ static inline void netdev_reset_queue(struct net_device *dev_queue)
  */
 static inline u16 netdev_cap_txqueue(struct net_device *dev, u16 queue_index)
 {
-	if (unlikely(queue_index >= dev->real_num_tx_queues)) {
-		net_warn_ratelimited("%s selects TX queue %d, but real number of TX queues is %d\n",
+	if (unlikely(queue_index >= dev->num_tx_queues)) {
+		net_warn_ratelimited("%s selects TX queue %d, but number of TX queues is %d\n",
 				     dev->name, queue_index,
-				     dev->real_num_tx_queues);
+				     dev->num_tx_queues);
 		return 0;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Bridge] [RFC PATCH v2 net-next 06/10] net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
@ 2021-07-03 11:57   ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:57 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Alexander Duyck, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
	Vivien Didelot, Tobias Waldekranz

When using a bridge upper as a subordinate device, switchdev interfaces
must allocate a TX queue for it, in order to have the information needed
in .ndo_start_xmit() whether the skb comes from the bridge or not.

The dedicated TX queue has the ->sb_dev pointer pointing to the bridge
device, and the only assumption that can be made is that any skb on that
queue must be coming from the bridge. So no other skbs can be sent on
that.

The default netdev_pick_tx() -> skb_tx_hash() policy hashes between TX
queues of the same priority.

To make the scheme work, switchdev drivers offloading a bridge need to
implement their own .ndo_select_queue() which selects the dedicated TX
queue for packets coming from the sb_dev, and lets netdev_pick_tx()
choose from the rest of the TX queues for the rest.

The implication is that the dedicated TX queue for the sb_dev must be
outside of the dev->num_real_tx_queues range, because otherwise,
netdev_pick_tx() might choose that TX queue for packets which aren't
actually coming from our sb_dev and therefore the assumption made in the
driver's .ndo_start_xmit() would be wrong.

This patch lifts the restriction in netdev_core_pick_tx() which says
that the dedicated TX queue for the sb_dev cannot be larger than the
num_real_tx_queues. With this, netdev_pick_tx() can safely pick between
the non-dedicated TX queues.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/linux/netdevice.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 16c88e416693..d43f6ddd12a1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3697,10 +3697,10 @@ static inline void netdev_reset_queue(struct net_device *dev_queue)
  */
 static inline u16 netdev_cap_txqueue(struct net_device *dev, u16 queue_index)
 {
-	if (unlikely(queue_index >= dev->real_num_tx_queues)) {
-		net_warn_ratelimited("%s selects TX queue %d, but real number of TX queues is %d\n",
+	if (unlikely(queue_index >= dev->num_tx_queues)) {
+		net_warn_ratelimited("%s selects TX queue %d, but number of TX queues is %d\n",
 				     dev->name, queue_index,
-				     dev->real_num_tx_queues);
+				     dev->num_tx_queues);
 		return 0;
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 net-next 07/10] net: dsa: track the number of switches in a tree
  2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
@ 2021-07-03 11:57   ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:57 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck

In preparation of supporting data plane forwarding on behalf of a
software bridge, some drivers might need to view bridges as virtual
switches behind the CPU port in a cross-chip topology.

Give them some help and let them know how many physical switches there
are in the tree, so that they can count the virtual switches starting
from that number on.

Note that the first dsa_switch_ops method where this information is
reliably available is .setup(). This is because of how DSA works:
in a tree with 3 switches, each calling dsa_register_switch(), the first
2 will advance until dsa_tree_setup() -> dsa_tree_setup_routing_table()
and exit with error code 0 because the topology is not complete. Since
probing is parallel at this point, one switch does not know about the
existence of the other. Then the third switch comes, and for it,
dsa_tree_setup_routing_table() returns complete = true. This switch goes
ahead and calls dsa_tree_setup_switches() for everybody else, calling
their .setup() methods too. This acts as the synchronization point.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/dsa.h | 3 +++
 net/dsa/dsa2.c    | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 33f40c1ec379..89626eab92b9 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -159,6 +159,9 @@ struct dsa_switch_tree {
 	 */
 	struct net_device **lags;
 	unsigned int lags_len;
+
+	/* Track the largest switch index within a tree */
+	unsigned int last_switch;
 };
 
 #define dsa_lags_foreach_id(_id, _dst)				\
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 185629f27f80..de5e93ba2a9d 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -1265,6 +1265,9 @@ static int dsa_switch_parse_member_of(struct dsa_switch *ds,
 		return -EEXIST;
 	}
 
+	if (ds->dst->last_switch < ds->index)
+		ds->dst->last_switch = ds->index;
+
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Bridge] [RFC PATCH v2 net-next 07/10] net: dsa: track the number of switches in a tree
@ 2021-07-03 11:57   ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:57 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Alexander Duyck, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
	Vivien Didelot, Tobias Waldekranz

In preparation of supporting data plane forwarding on behalf of a
software bridge, some drivers might need to view bridges as virtual
switches behind the CPU port in a cross-chip topology.

Give them some help and let them know how many physical switches there
are in the tree, so that they can count the virtual switches starting
from that number on.

Note that the first dsa_switch_ops method where this information is
reliably available is .setup(). This is because of how DSA works:
in a tree with 3 switches, each calling dsa_register_switch(), the first
2 will advance until dsa_tree_setup() -> dsa_tree_setup_routing_table()
and exit with error code 0 because the topology is not complete. Since
probing is parallel at this point, one switch does not know about the
existence of the other. Then the third switch comes, and for it,
dsa_tree_setup_routing_table() returns complete = true. This switch goes
ahead and calls dsa_tree_setup_switches() for everybody else, calling
their .setup() methods too. This acts as the synchronization point.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/dsa.h | 3 +++
 net/dsa/dsa2.c    | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 33f40c1ec379..89626eab92b9 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -159,6 +159,9 @@ struct dsa_switch_tree {
 	 */
 	struct net_device **lags;
 	unsigned int lags_len;
+
+	/* Track the largest switch index within a tree */
+	unsigned int last_switch;
 };
 
 #define dsa_lags_foreach_id(_id, _dst)				\
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 185629f27f80..de5e93ba2a9d 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -1265,6 +1265,9 @@ static int dsa_switch_parse_member_of(struct dsa_switch *ds,
 		return -EEXIST;
 	}
 
+	if (ds->dst->last_switch < ds->index)
+		ds->dst->last_switch = ds->index;
+
 	return 0;
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 net-next 08/10] net: dsa: add support for bridge forwarding offload
  2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
@ 2021-07-03 11:57   ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:57 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck

For a DSA switch, to offload the forwarding process of a bridge device
means to send the packets coming from the software bridge as data plane
packets. This is contrary to everything that DSA has done so far,
because the current taggers only know to send control packets (ones that
target a specific destination port), whereas data plane packets are
supposed to be forwarded according to the FDB lookup, much like packets
ingressing on any regular ingress port. If the FDB lookup process
returns multiple destination ports (flooding, multicast), then
replication is also handled by the switch hardware - the bridge only
sends a single packet and avoids the skb_clone().

DSA plays a substantial role in backing the forwarding offload, and
leaves relatively few things up to the switch driver. In particular, DSA
creates an accel_priv structure per port associated with each possible
bridge upper, and for each bridge it keeps a zero-based index (the
number of the bridge). Multiple ports enslaved to the same bridge have
a pointer to the same accel_priv structure.

The way this offloading scheme (borrowed from macvlan offloading on
Intel hardware) works is that lower interfaces are supposed to reserve a
netdev TX queue corresponding to each offloadable upper ("subordinate")
interface. DSA reserves a single TX queue per port, a queue outside the
num_real_tx_queues range. That special TX queue has a ->sb_dev pointer,
which is the reason why we use it in the first place (to have access to
the sb_dev from .ndo_start_xmit). DSA then implements a custom
.ndo_select_queue to direct packets on behalf of the bridge to that
special queue, and leaves netdev_pick_tx to pick among the
num_real_tx_queues (excluding the sb_dev queue) using the default policies.

It is assumed that both the tagger must support forwarding offload (it
must search for the subordinate device - the bridge), and must therefore
set the ".bridge_fwd_offload = true" capability, as well as the switch
driver (this must set in ds->num_fwd_offloading_bridges the maximum
number of bridges for which it can offload forwarding).

The tagger can check if the TX queue that the skb is being transmitted
on has a subordinate device (sb_dev) associated with it or not. If it
does, it can be sure that the subordinate device is a bridge, and it can
use the dp->accel_priv to get further information about that bridge,
such as the bridge number. It can then compose a DSA tag for injecting a
data plane packet into that bridge number.

For the switch driver side, we offer two new pair of dsa_switch_ops
methods which are modeled after .port_bridge_{join,leave} and
.crosschip_bridge_{join,leave}.
These are .port_bridge_fwd_offload_{add,del} and the cross-chip
equivalents. These methods are provided in case the driver needs to
configure the hardware to treat packets coming from that bridge software
interface as data plane packets. The bridge calls our
.ndo_dfwd_add_station immediately after netdev_master_upper_dev_link(),
so to switch drivers, the effect is that the
.port_bridge_fwd_offload_add() method is called immediately after
.port_bridge_join().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/dsa.h  |  34 ++++++++++++
 net/dsa/dsa_priv.h |  17 ++++++
 net/dsa/port.c     |  35 ++++++++++++
 net/dsa/slave.c    | 134 ++++++++++++++++++++++++++++++++++++++++++++-
 net/dsa/switch.c   |  58 ++++++++++++++++++++
 5 files changed, 277 insertions(+), 1 deletion(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 89626eab92b9..5d111cc2e403 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -103,6 +103,7 @@ struct dsa_device_ops {
 	 * its RX filter.
 	 */
 	bool promisc_on_master;
+	bool bridge_fwd_offload;
 };
 
 /* This structure defines the control interfaces that are overlayed by the
@@ -162,6 +163,9 @@ struct dsa_switch_tree {
 
 	/* Track the largest switch index within a tree */
 	unsigned int last_switch;
+
+	/* Track the bridges with forwarding offload enabled */
+	unsigned long fwd_offloading_bridges;
 };
 
 #define dsa_lags_foreach_id(_id, _dst)				\
@@ -224,6 +228,10 @@ struct dsa_mall_tc_entry {
 	};
 };
 
+struct dsa_bridge_fwd_accel_priv {
+	struct net_device *sb_dev;
+	int bridge_num;
+};
 
 struct dsa_port {
 	/* A CPU port is physically connected to a master device.
@@ -294,6 +302,8 @@ struct dsa_port {
 	struct list_head	fdbs;
 	struct list_head	mdbs;
 
+	struct dsa_bridge_fwd_accel_priv *accel_priv;
+
 	bool setup;
 };
 
@@ -410,6 +420,12 @@ struct dsa_switch {
 	 */
 	unsigned int		num_lag_ids;
 
+	/* Drivers that support bridge forwarding offload should set this to
+	 * the maximum number of bridges spanning the same switch tree that can
+	 * be offloaded.
+	 */
+	unsigned int		num_fwd_offloading_bridges;
+
 	size_t num_ports;
 };
 
@@ -693,6 +709,14 @@ struct dsa_switch_ops {
 				    struct net_device *bridge);
 	void	(*port_bridge_leave)(struct dsa_switch *ds, int port,
 				     struct net_device *bridge);
+	/* Called right after .port_bridge_join() */
+	int	(*port_bridge_fwd_offload_add)(struct dsa_switch *ds, int port,
+					       struct net_device *bridge,
+					       int bridge_num);
+	/* Called right before .port_bridge_leave() */
+	void	(*port_bridge_fwd_offload_del)(struct dsa_switch *ds, int port,
+					       struct net_device *bridge,
+					       int bridge_num);
 	void	(*port_stp_state_set)(struct dsa_switch *ds, int port,
 				      u8 state);
 	void	(*port_fast_age)(struct dsa_switch *ds, int port);
@@ -777,6 +801,16 @@ struct dsa_switch_ops {
 				      struct netdev_lag_upper_info *info);
 	int	(*crosschip_lag_leave)(struct dsa_switch *ds, int sw_index,
 				       int port, struct net_device *lag);
+	int	(*crosschip_bridge_fwd_offload_add)(struct dsa_switch *ds,
+						    int tree_index,
+						    int sw_index, int port,
+						    struct net_device *br,
+						    int bridge_num);
+	void	(*crosschip_bridge_fwd_offload_del)(struct dsa_switch *ds,
+						    int tree_index,
+						    int sw_index, int port,
+						    struct net_device *br,
+						    int bridge_num);
 
 	/*
 	 * PTP functionality
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index f201c33980bf..c577338b5bb7 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -14,10 +14,14 @@
 #include <net/dsa.h>
 #include <net/gro_cells.h>
 
+#define DSA_MAX_NUM_OFFLOADING_BRIDGES		BITS_PER_LONG
+
 enum {
 	DSA_NOTIFIER_AGEING_TIME,
 	DSA_NOTIFIER_BRIDGE_JOIN,
 	DSA_NOTIFIER_BRIDGE_LEAVE,
+	DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_ADD,
+	DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_DEL,
 	DSA_NOTIFIER_FDB_ADD,
 	DSA_NOTIFIER_FDB_DEL,
 	DSA_NOTIFIER_HOST_FDB_ADD,
@@ -54,6 +58,15 @@ struct dsa_notifier_bridge_info {
 	int port;
 };
 
+/* DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_* */
+struct dsa_notifier_bridge_fwd_offload_info {
+	struct net_device *br;
+	int tree_index;
+	int sw_index;
+	int port;
+	int bridge_num;
+};
+
 /* DSA_NOTIFIER_FDB_* */
 struct dsa_notifier_fdb_info {
 	int sw_index;
@@ -197,6 +210,10 @@ int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
 int dsa_port_pre_bridge_leave(struct dsa_port *dp, struct net_device *br,
 			      struct netlink_ext_ack *extack);
 void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br);
+int dsa_port_bridge_fwd_offload_add(struct dsa_port *dp,
+				    struct net_device *br, int bridge_num);
+void dsa_port_bridge_fwd_offload_del(struct dsa_port *dp,
+				     struct net_device *br, int bridge_num);
 int dsa_port_lag_change(struct dsa_port *dp,
 			struct netdev_lag_lower_state_info *linfo);
 int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag_dev,
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 28b45b7e66df..3c268d00908c 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -344,6 +344,41 @@ void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br)
 	dsa_port_switchdev_unsync_attrs(dp);
 }
 
+int dsa_port_bridge_fwd_offload_add(struct dsa_port *dp,
+				    struct net_device *br, int bridge_num)
+{
+	struct dsa_notifier_bridge_fwd_offload_info info = {
+		.tree_index = dp->ds->dst->index,
+		.sw_index = dp->ds->index,
+		.port = dp->index,
+		.br = br,
+		.bridge_num = bridge_num,
+	};
+
+	return dsa_port_notify(dp, DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_ADD,
+			       &info);
+}
+
+void dsa_port_bridge_fwd_offload_del(struct dsa_port *dp,
+				     struct net_device *br, int bridge_num)
+{
+	struct dsa_notifier_bridge_fwd_offload_info info = {
+		.tree_index = dp->ds->dst->index,
+		.sw_index = dp->ds->index,
+		.port = dp->index,
+		.br = br,
+		.bridge_num = bridge_num,
+	};
+	struct net_device *dev = dp->slave;
+	int err;
+
+	err = dsa_port_notify(dp, DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_DEL,
+			      &info);
+	if (err)
+		netdev_err(dev, "failed to notify fwd offload del: %pe\n",
+			   ERR_PTR(err));
+}
+
 int dsa_port_lag_change(struct dsa_port *dp,
 			struct netdev_lag_lower_state_info *linfo)
 {
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index ffbba1e71551..003f3bb9c51a 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1679,6 +1679,119 @@ static int dsa_slave_fill_forward_path(struct net_device_path_ctx *ctx,
 	return 0;
 }
 
+/* Direct packets coming from the data plane of the bridge to a dedicated TX
+ * queue, and let the generic netdev_pick_tx() handle the rest via hashing
+ * among TX queues of the same priority.
+ */
+static u16 dsa_slave_select_queue(struct net_device *dev, struct sk_buff *skb,
+				  struct net_device *sb_dev)
+{
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_switch *ds = dp->ds;
+
+	if (unlikely(sb_dev))
+		return ds->num_tx_queues;
+
+	return netdev_pick_tx(dev, skb, sb_dev);
+}
+
+static struct dsa_bridge_fwd_accel_priv *
+dsa_find_accel_priv_by_sb_dev(struct dsa_switch_tree *dst,
+			      struct net_device *sb_dev)
+{
+	struct dsa_port *dp;
+
+	list_for_each_entry(dp, &dst->ports, list)
+		if (dp->accel_priv && dp->accel_priv->sb_dev == sb_dev)
+			return dp->accel_priv;
+
+	return NULL;
+}
+
+static void dsa_slave_fwd_offload_del(struct net_device *dev, void *sb_dev)
+{
+	struct dsa_bridge_fwd_accel_priv *accel_priv;
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_switch *ds = dp->ds;
+	struct dsa_switch_tree *dst;
+	int bridge_num;
+
+	if (!netif_is_bridge_master(sb_dev))
+		return;
+
+	dst = ds->dst;
+
+	accel_priv = dp->accel_priv;
+	bridge_num = accel_priv->bridge_num;
+
+	dp->accel_priv = NULL;
+
+	/* accel_priv no longer in use, time to clean it up */
+	if (!dsa_find_accel_priv_by_sb_dev(dst, sb_dev)) {
+		clear_bit(accel_priv->bridge_num, &dst->fwd_offloading_bridges);
+		kfree(accel_priv);
+	}
+
+	netdev_unbind_tx_queues_from_sb_dev(dev, sb_dev);
+
+	/* Notify the chips only once the offload has been deactivated, so
+	 * that they can update their configuration accordingly.
+	 */
+	dsa_port_bridge_fwd_offload_del(dp, sb_dev, bridge_num);
+}
+
+static void *dsa_slave_fwd_offload_add(struct net_device *dev,
+				       struct net_device *sb_dev)
+{
+	struct dsa_bridge_fwd_accel_priv *accel_priv;
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_switch *ds = dp->ds;
+	struct dsa_switch_tree *dst;
+	int err;
+
+	if (!netif_is_bridge_master(sb_dev))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	dst = ds->dst;
+
+	accel_priv = dsa_find_accel_priv_by_sb_dev(dst, sb_dev);
+	if (!accel_priv) {
+		/* First port that offloads forwarding for this bridge */
+		int bridge_num;
+
+		bridge_num = find_first_zero_bit(&dst->fwd_offloading_bridges,
+						 DSA_MAX_NUM_OFFLOADING_BRIDGES);
+		if (bridge_num >= ds->num_fwd_offloading_bridges)
+			return ERR_PTR(-EOPNOTSUPP);
+
+		accel_priv = kzalloc(sizeof(*accel_priv), GFP_KERNEL);
+		if (!accel_priv)
+			return ERR_PTR(-ENOMEM);
+
+		accel_priv->sb_dev = sb_dev;
+		accel_priv->bridge_num = bridge_num;
+
+		set_bit(bridge_num, &dst->fwd_offloading_bridges);
+	}
+
+	dp->accel_priv = accel_priv;
+
+	/* There can be only one master upper interface for each port in the
+	 * case of bridge forwarding offload, so just bind a single TX queue to
+	 * that subordinate device, the last one.
+	 */
+	netdev_bind_tx_queues_to_sb_dev(dev, sb_dev, 1, ds->num_tx_queues);
+
+	err = dsa_port_bridge_fwd_offload_add(dp, sb_dev,
+					      accel_priv->bridge_num);
+	if (err) {
+		dsa_slave_fwd_offload_del(dev, sb_dev);
+		return ERR_PTR(err);
+	}
+
+	return accel_priv;
+}
+
 static const struct net_device_ops dsa_slave_netdev_ops = {
 	.ndo_open	 	= dsa_slave_open,
 	.ndo_stop		= dsa_slave_close,
@@ -1703,6 +1816,9 @@ static const struct net_device_ops dsa_slave_netdev_ops = {
 	.ndo_get_devlink_port	= dsa_slave_get_devlink_port,
 	.ndo_change_mtu		= dsa_slave_change_mtu,
 	.ndo_fill_forward_path	= dsa_slave_fill_forward_path,
+	.ndo_dfwd_add_station	= dsa_slave_fwd_offload_add,
+	.ndo_dfwd_del_station	= dsa_slave_fwd_offload_del,
+	.ndo_select_queue	= dsa_slave_select_queue,
 };
 
 static struct device_type dsa_type = {
@@ -1819,6 +1935,11 @@ void dsa_slave_setup_tagger(struct net_device *slave)
 	slave->needed_tailroom += master->needed_tailroom;
 
 	p->xmit = cpu_dp->tag_ops->xmit;
+
+	if (cpu_dp->tag_ops->bridge_fwd_offload)
+		slave->features |= NETIF_F_HW_L2FW_DOFFLOAD;
+	else
+		slave->features &= ~NETIF_F_HW_L2FW_DOFFLOAD;
 }
 
 static struct lock_class_key dsa_slave_netdev_xmit_lock_key;
@@ -1877,10 +1998,21 @@ int dsa_slave_create(struct dsa_port *port)
 
 	slave_dev = alloc_netdev_mqs(sizeof(struct dsa_slave_priv), name,
 				     NET_NAME_UNKNOWN, ether_setup,
-				     ds->num_tx_queues, 1);
+				     ds->num_tx_queues + 1, 1);
 	if (slave_dev == NULL)
 		return -ENOMEM;
 
+	/* To avoid changing the number of TX queues at runtime depending on
+	 * whether the tagging protocol in use supports bridge forwarding
+	 * offload or not, just assume that all tagging protocols do, and
+	 * unconditionally register one extra TX queue to back that offload.
+	 * Then set num_real_tx_queues such that it will never be selected by
+	 * netdev_pick_tx(), just by ourselves.
+	 */
+	ret = netif_set_real_num_tx_queues(slave_dev, ds->num_tx_queues);
+	if (ret)
+		goto out_free;
+
 	slave_dev->features = master->vlan_features | NETIF_F_HW_TC;
 	if (ds->ops->port_vlan_add && ds->ops->port_vlan_del)
 		slave_dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index 248455145982..f0033906f36b 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -154,6 +154,58 @@ static int dsa_switch_bridge_leave(struct dsa_switch *ds,
 	return 0;
 }
 
+static int
+dsa_switch_bridge_fwd_offload_add(struct dsa_switch *ds,
+				  struct dsa_notifier_bridge_fwd_offload_info *info)
+{
+	struct dsa_switch_tree *dst = ds->dst;
+	int tree_index = info->tree_index;
+	int bridge_num = info->bridge_num;
+	struct net_device *br = info->br;
+	int sw_index = info->sw_index;
+	int port = info->port;
+
+	if (dst->index == tree_index && ds->index == sw_index &&
+	    ds->ops->port_bridge_fwd_offload_add)
+		return ds->ops->port_bridge_fwd_offload_add(ds, port, br,
+							    bridge_num);
+
+	if ((dst->index != tree_index || ds->index != sw_index) &&
+	    ds->ops->crosschip_bridge_fwd_offload_add)
+		return ds->ops->crosschip_bridge_fwd_offload_add(ds,
+								 tree_index,
+								 sw_index,
+								 port, br,
+								 bridge_num);
+
+	return -EOPNOTSUPP;
+}
+
+static int
+dsa_switch_bridge_fwd_offload_del(struct dsa_switch *ds,
+				  struct dsa_notifier_bridge_fwd_offload_info *info)
+{
+	struct dsa_switch_tree *dst = ds->dst;
+	int tree_index = info->tree_index;
+	int bridge_num = info->bridge_num;
+	struct net_device *br = info->br;
+	int sw_index = info->sw_index;
+	int port = info->port;
+
+	if (dst->index == tree_index && ds->index == sw_index &&
+	    ds->ops->port_bridge_fwd_offload_del)
+		ds->ops->port_bridge_fwd_offload_del(ds, port, br,
+						     bridge_num);
+
+	if ((dst->index != info->tree_index || ds->index != info->sw_index) &&
+	    ds->ops->crosschip_bridge_fwd_offload_del)
+		ds->ops->crosschip_bridge_fwd_offload_del(ds, tree_index,
+							  sw_index, port, br,
+							  bridge_num);
+
+	return 0;
+}
+
 /* Matches for all upstream-facing ports (the CPU port and all upstream-facing
  * DSA links) that sit between the targeted port on which the notifier was
  * emitted and its dedicated CPU port.
@@ -663,6 +715,12 @@ static int dsa_switch_event(struct notifier_block *nb,
 	case DSA_NOTIFIER_BRIDGE_LEAVE:
 		err = dsa_switch_bridge_leave(ds, info);
 		break;
+	case DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_ADD:
+		err = dsa_switch_bridge_fwd_offload_add(ds, info);
+		break;
+	case DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_DEL:
+		err = dsa_switch_bridge_fwd_offload_del(ds, info);
+		break;
 	case DSA_NOTIFIER_FDB_ADD:
 		err = dsa_switch_fdb_add(ds, info);
 		break;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Bridge] [RFC PATCH v2 net-next 08/10] net: dsa: add support for bridge forwarding offload
@ 2021-07-03 11:57   ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:57 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Alexander Duyck, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
	Vivien Didelot, Tobias Waldekranz

For a DSA switch, to offload the forwarding process of a bridge device
means to send the packets coming from the software bridge as data plane
packets. This is contrary to everything that DSA has done so far,
because the current taggers only know to send control packets (ones that
target a specific destination port), whereas data plane packets are
supposed to be forwarded according to the FDB lookup, much like packets
ingressing on any regular ingress port. If the FDB lookup process
returns multiple destination ports (flooding, multicast), then
replication is also handled by the switch hardware - the bridge only
sends a single packet and avoids the skb_clone().

DSA plays a substantial role in backing the forwarding offload, and
leaves relatively few things up to the switch driver. In particular, DSA
creates an accel_priv structure per port associated with each possible
bridge upper, and for each bridge it keeps a zero-based index (the
number of the bridge). Multiple ports enslaved to the same bridge have
a pointer to the same accel_priv structure.

The way this offloading scheme (borrowed from macvlan offloading on
Intel hardware) works is that lower interfaces are supposed to reserve a
netdev TX queue corresponding to each offloadable upper ("subordinate")
interface. DSA reserves a single TX queue per port, a queue outside the
num_real_tx_queues range. That special TX queue has a ->sb_dev pointer,
which is the reason why we use it in the first place (to have access to
the sb_dev from .ndo_start_xmit). DSA then implements a custom
.ndo_select_queue to direct packets on behalf of the bridge to that
special queue, and leaves netdev_pick_tx to pick among the
num_real_tx_queues (excluding the sb_dev queue) using the default policies.

It is assumed that both the tagger must support forwarding offload (it
must search for the subordinate device - the bridge), and must therefore
set the ".bridge_fwd_offload = true" capability, as well as the switch
driver (this must set in ds->num_fwd_offloading_bridges the maximum
number of bridges for which it can offload forwarding).

The tagger can check if the TX queue that the skb is being transmitted
on has a subordinate device (sb_dev) associated with it or not. If it
does, it can be sure that the subordinate device is a bridge, and it can
use the dp->accel_priv to get further information about that bridge,
such as the bridge number. It can then compose a DSA tag for injecting a
data plane packet into that bridge number.

For the switch driver side, we offer two new pair of dsa_switch_ops
methods which are modeled after .port_bridge_{join,leave} and
.crosschip_bridge_{join,leave}.
These are .port_bridge_fwd_offload_{add,del} and the cross-chip
equivalents. These methods are provided in case the driver needs to
configure the hardware to treat packets coming from that bridge software
interface as data plane packets. The bridge calls our
.ndo_dfwd_add_station immediately after netdev_master_upper_dev_link(),
so to switch drivers, the effect is that the
.port_bridge_fwd_offload_add() method is called immediately after
.port_bridge_join().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/dsa.h  |  34 ++++++++++++
 net/dsa/dsa_priv.h |  17 ++++++
 net/dsa/port.c     |  35 ++++++++++++
 net/dsa/slave.c    | 134 ++++++++++++++++++++++++++++++++++++++++++++-
 net/dsa/switch.c   |  58 ++++++++++++++++++++
 5 files changed, 277 insertions(+), 1 deletion(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 89626eab92b9..5d111cc2e403 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -103,6 +103,7 @@ struct dsa_device_ops {
 	 * its RX filter.
 	 */
 	bool promisc_on_master;
+	bool bridge_fwd_offload;
 };
 
 /* This structure defines the control interfaces that are overlayed by the
@@ -162,6 +163,9 @@ struct dsa_switch_tree {
 
 	/* Track the largest switch index within a tree */
 	unsigned int last_switch;
+
+	/* Track the bridges with forwarding offload enabled */
+	unsigned long fwd_offloading_bridges;
 };
 
 #define dsa_lags_foreach_id(_id, _dst)				\
@@ -224,6 +228,10 @@ struct dsa_mall_tc_entry {
 	};
 };
 
+struct dsa_bridge_fwd_accel_priv {
+	struct net_device *sb_dev;
+	int bridge_num;
+};
 
 struct dsa_port {
 	/* A CPU port is physically connected to a master device.
@@ -294,6 +302,8 @@ struct dsa_port {
 	struct list_head	fdbs;
 	struct list_head	mdbs;
 
+	struct dsa_bridge_fwd_accel_priv *accel_priv;
+
 	bool setup;
 };
 
@@ -410,6 +420,12 @@ struct dsa_switch {
 	 */
 	unsigned int		num_lag_ids;
 
+	/* Drivers that support bridge forwarding offload should set this to
+	 * the maximum number of bridges spanning the same switch tree that can
+	 * be offloaded.
+	 */
+	unsigned int		num_fwd_offloading_bridges;
+
 	size_t num_ports;
 };
 
@@ -693,6 +709,14 @@ struct dsa_switch_ops {
 				    struct net_device *bridge);
 	void	(*port_bridge_leave)(struct dsa_switch *ds, int port,
 				     struct net_device *bridge);
+	/* Called right after .port_bridge_join() */
+	int	(*port_bridge_fwd_offload_add)(struct dsa_switch *ds, int port,
+					       struct net_device *bridge,
+					       int bridge_num);
+	/* Called right before .port_bridge_leave() */
+	void	(*port_bridge_fwd_offload_del)(struct dsa_switch *ds, int port,
+					       struct net_device *bridge,
+					       int bridge_num);
 	void	(*port_stp_state_set)(struct dsa_switch *ds, int port,
 				      u8 state);
 	void	(*port_fast_age)(struct dsa_switch *ds, int port);
@@ -777,6 +801,16 @@ struct dsa_switch_ops {
 				      struct netdev_lag_upper_info *info);
 	int	(*crosschip_lag_leave)(struct dsa_switch *ds, int sw_index,
 				       int port, struct net_device *lag);
+	int	(*crosschip_bridge_fwd_offload_add)(struct dsa_switch *ds,
+						    int tree_index,
+						    int sw_index, int port,
+						    struct net_device *br,
+						    int bridge_num);
+	void	(*crosschip_bridge_fwd_offload_del)(struct dsa_switch *ds,
+						    int tree_index,
+						    int sw_index, int port,
+						    struct net_device *br,
+						    int bridge_num);
 
 	/*
 	 * PTP functionality
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index f201c33980bf..c577338b5bb7 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -14,10 +14,14 @@
 #include <net/dsa.h>
 #include <net/gro_cells.h>
 
+#define DSA_MAX_NUM_OFFLOADING_BRIDGES		BITS_PER_LONG
+
 enum {
 	DSA_NOTIFIER_AGEING_TIME,
 	DSA_NOTIFIER_BRIDGE_JOIN,
 	DSA_NOTIFIER_BRIDGE_LEAVE,
+	DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_ADD,
+	DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_DEL,
 	DSA_NOTIFIER_FDB_ADD,
 	DSA_NOTIFIER_FDB_DEL,
 	DSA_NOTIFIER_HOST_FDB_ADD,
@@ -54,6 +58,15 @@ struct dsa_notifier_bridge_info {
 	int port;
 };
 
+/* DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_* */
+struct dsa_notifier_bridge_fwd_offload_info {
+	struct net_device *br;
+	int tree_index;
+	int sw_index;
+	int port;
+	int bridge_num;
+};
+
 /* DSA_NOTIFIER_FDB_* */
 struct dsa_notifier_fdb_info {
 	int sw_index;
@@ -197,6 +210,10 @@ int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
 int dsa_port_pre_bridge_leave(struct dsa_port *dp, struct net_device *br,
 			      struct netlink_ext_ack *extack);
 void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br);
+int dsa_port_bridge_fwd_offload_add(struct dsa_port *dp,
+				    struct net_device *br, int bridge_num);
+void dsa_port_bridge_fwd_offload_del(struct dsa_port *dp,
+				     struct net_device *br, int bridge_num);
 int dsa_port_lag_change(struct dsa_port *dp,
 			struct netdev_lag_lower_state_info *linfo);
 int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag_dev,
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 28b45b7e66df..3c268d00908c 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -344,6 +344,41 @@ void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br)
 	dsa_port_switchdev_unsync_attrs(dp);
 }
 
+int dsa_port_bridge_fwd_offload_add(struct dsa_port *dp,
+				    struct net_device *br, int bridge_num)
+{
+	struct dsa_notifier_bridge_fwd_offload_info info = {
+		.tree_index = dp->ds->dst->index,
+		.sw_index = dp->ds->index,
+		.port = dp->index,
+		.br = br,
+		.bridge_num = bridge_num,
+	};
+
+	return dsa_port_notify(dp, DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_ADD,
+			       &info);
+}
+
+void dsa_port_bridge_fwd_offload_del(struct dsa_port *dp,
+				     struct net_device *br, int bridge_num)
+{
+	struct dsa_notifier_bridge_fwd_offload_info info = {
+		.tree_index = dp->ds->dst->index,
+		.sw_index = dp->ds->index,
+		.port = dp->index,
+		.br = br,
+		.bridge_num = bridge_num,
+	};
+	struct net_device *dev = dp->slave;
+	int err;
+
+	err = dsa_port_notify(dp, DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_DEL,
+			      &info);
+	if (err)
+		netdev_err(dev, "failed to notify fwd offload del: %pe\n",
+			   ERR_PTR(err));
+}
+
 int dsa_port_lag_change(struct dsa_port *dp,
 			struct netdev_lag_lower_state_info *linfo)
 {
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index ffbba1e71551..003f3bb9c51a 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1679,6 +1679,119 @@ static int dsa_slave_fill_forward_path(struct net_device_path_ctx *ctx,
 	return 0;
 }
 
+/* Direct packets coming from the data plane of the bridge to a dedicated TX
+ * queue, and let the generic netdev_pick_tx() handle the rest via hashing
+ * among TX queues of the same priority.
+ */
+static u16 dsa_slave_select_queue(struct net_device *dev, struct sk_buff *skb,
+				  struct net_device *sb_dev)
+{
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_switch *ds = dp->ds;
+
+	if (unlikely(sb_dev))
+		return ds->num_tx_queues;
+
+	return netdev_pick_tx(dev, skb, sb_dev);
+}
+
+static struct dsa_bridge_fwd_accel_priv *
+dsa_find_accel_priv_by_sb_dev(struct dsa_switch_tree *dst,
+			      struct net_device *sb_dev)
+{
+	struct dsa_port *dp;
+
+	list_for_each_entry(dp, &dst->ports, list)
+		if (dp->accel_priv && dp->accel_priv->sb_dev == sb_dev)
+			return dp->accel_priv;
+
+	return NULL;
+}
+
+static void dsa_slave_fwd_offload_del(struct net_device *dev, void *sb_dev)
+{
+	struct dsa_bridge_fwd_accel_priv *accel_priv;
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_switch *ds = dp->ds;
+	struct dsa_switch_tree *dst;
+	int bridge_num;
+
+	if (!netif_is_bridge_master(sb_dev))
+		return;
+
+	dst = ds->dst;
+
+	accel_priv = dp->accel_priv;
+	bridge_num = accel_priv->bridge_num;
+
+	dp->accel_priv = NULL;
+
+	/* accel_priv no longer in use, time to clean it up */
+	if (!dsa_find_accel_priv_by_sb_dev(dst, sb_dev)) {
+		clear_bit(accel_priv->bridge_num, &dst->fwd_offloading_bridges);
+		kfree(accel_priv);
+	}
+
+	netdev_unbind_tx_queues_from_sb_dev(dev, sb_dev);
+
+	/* Notify the chips only once the offload has been deactivated, so
+	 * that they can update their configuration accordingly.
+	 */
+	dsa_port_bridge_fwd_offload_del(dp, sb_dev, bridge_num);
+}
+
+static void *dsa_slave_fwd_offload_add(struct net_device *dev,
+				       struct net_device *sb_dev)
+{
+	struct dsa_bridge_fwd_accel_priv *accel_priv;
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_switch *ds = dp->ds;
+	struct dsa_switch_tree *dst;
+	int err;
+
+	if (!netif_is_bridge_master(sb_dev))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	dst = ds->dst;
+
+	accel_priv = dsa_find_accel_priv_by_sb_dev(dst, sb_dev);
+	if (!accel_priv) {
+		/* First port that offloads forwarding for this bridge */
+		int bridge_num;
+
+		bridge_num = find_first_zero_bit(&dst->fwd_offloading_bridges,
+						 DSA_MAX_NUM_OFFLOADING_BRIDGES);
+		if (bridge_num >= ds->num_fwd_offloading_bridges)
+			return ERR_PTR(-EOPNOTSUPP);
+
+		accel_priv = kzalloc(sizeof(*accel_priv), GFP_KERNEL);
+		if (!accel_priv)
+			return ERR_PTR(-ENOMEM);
+
+		accel_priv->sb_dev = sb_dev;
+		accel_priv->bridge_num = bridge_num;
+
+		set_bit(bridge_num, &dst->fwd_offloading_bridges);
+	}
+
+	dp->accel_priv = accel_priv;
+
+	/* There can be only one master upper interface for each port in the
+	 * case of bridge forwarding offload, so just bind a single TX queue to
+	 * that subordinate device, the last one.
+	 */
+	netdev_bind_tx_queues_to_sb_dev(dev, sb_dev, 1, ds->num_tx_queues);
+
+	err = dsa_port_bridge_fwd_offload_add(dp, sb_dev,
+					      accel_priv->bridge_num);
+	if (err) {
+		dsa_slave_fwd_offload_del(dev, sb_dev);
+		return ERR_PTR(err);
+	}
+
+	return accel_priv;
+}
+
 static const struct net_device_ops dsa_slave_netdev_ops = {
 	.ndo_open	 	= dsa_slave_open,
 	.ndo_stop		= dsa_slave_close,
@@ -1703,6 +1816,9 @@ static const struct net_device_ops dsa_slave_netdev_ops = {
 	.ndo_get_devlink_port	= dsa_slave_get_devlink_port,
 	.ndo_change_mtu		= dsa_slave_change_mtu,
 	.ndo_fill_forward_path	= dsa_slave_fill_forward_path,
+	.ndo_dfwd_add_station	= dsa_slave_fwd_offload_add,
+	.ndo_dfwd_del_station	= dsa_slave_fwd_offload_del,
+	.ndo_select_queue	= dsa_slave_select_queue,
 };
 
 static struct device_type dsa_type = {
@@ -1819,6 +1935,11 @@ void dsa_slave_setup_tagger(struct net_device *slave)
 	slave->needed_tailroom += master->needed_tailroom;
 
 	p->xmit = cpu_dp->tag_ops->xmit;
+
+	if (cpu_dp->tag_ops->bridge_fwd_offload)
+		slave->features |= NETIF_F_HW_L2FW_DOFFLOAD;
+	else
+		slave->features &= ~NETIF_F_HW_L2FW_DOFFLOAD;
 }
 
 static struct lock_class_key dsa_slave_netdev_xmit_lock_key;
@@ -1877,10 +1998,21 @@ int dsa_slave_create(struct dsa_port *port)
 
 	slave_dev = alloc_netdev_mqs(sizeof(struct dsa_slave_priv), name,
 				     NET_NAME_UNKNOWN, ether_setup,
-				     ds->num_tx_queues, 1);
+				     ds->num_tx_queues + 1, 1);
 	if (slave_dev == NULL)
 		return -ENOMEM;
 
+	/* To avoid changing the number of TX queues at runtime depending on
+	 * whether the tagging protocol in use supports bridge forwarding
+	 * offload or not, just assume that all tagging protocols do, and
+	 * unconditionally register one extra TX queue to back that offload.
+	 * Then set num_real_tx_queues such that it will never be selected by
+	 * netdev_pick_tx(), just by ourselves.
+	 */
+	ret = netif_set_real_num_tx_queues(slave_dev, ds->num_tx_queues);
+	if (ret)
+		goto out_free;
+
 	slave_dev->features = master->vlan_features | NETIF_F_HW_TC;
 	if (ds->ops->port_vlan_add && ds->ops->port_vlan_del)
 		slave_dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index 248455145982..f0033906f36b 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -154,6 +154,58 @@ static int dsa_switch_bridge_leave(struct dsa_switch *ds,
 	return 0;
 }
 
+static int
+dsa_switch_bridge_fwd_offload_add(struct dsa_switch *ds,
+				  struct dsa_notifier_bridge_fwd_offload_info *info)
+{
+	struct dsa_switch_tree *dst = ds->dst;
+	int tree_index = info->tree_index;
+	int bridge_num = info->bridge_num;
+	struct net_device *br = info->br;
+	int sw_index = info->sw_index;
+	int port = info->port;
+
+	if (dst->index == tree_index && ds->index == sw_index &&
+	    ds->ops->port_bridge_fwd_offload_add)
+		return ds->ops->port_bridge_fwd_offload_add(ds, port, br,
+							    bridge_num);
+
+	if ((dst->index != tree_index || ds->index != sw_index) &&
+	    ds->ops->crosschip_bridge_fwd_offload_add)
+		return ds->ops->crosschip_bridge_fwd_offload_add(ds,
+								 tree_index,
+								 sw_index,
+								 port, br,
+								 bridge_num);
+
+	return -EOPNOTSUPP;
+}
+
+static int
+dsa_switch_bridge_fwd_offload_del(struct dsa_switch *ds,
+				  struct dsa_notifier_bridge_fwd_offload_info *info)
+{
+	struct dsa_switch_tree *dst = ds->dst;
+	int tree_index = info->tree_index;
+	int bridge_num = info->bridge_num;
+	struct net_device *br = info->br;
+	int sw_index = info->sw_index;
+	int port = info->port;
+
+	if (dst->index == tree_index && ds->index == sw_index &&
+	    ds->ops->port_bridge_fwd_offload_del)
+		ds->ops->port_bridge_fwd_offload_del(ds, port, br,
+						     bridge_num);
+
+	if ((dst->index != info->tree_index || ds->index != info->sw_index) &&
+	    ds->ops->crosschip_bridge_fwd_offload_del)
+		ds->ops->crosschip_bridge_fwd_offload_del(ds, tree_index,
+							  sw_index, port, br,
+							  bridge_num);
+
+	return 0;
+}
+
 /* Matches for all upstream-facing ports (the CPU port and all upstream-facing
  * DSA links) that sit between the targeted port on which the notifier was
  * emitted and its dedicated CPU port.
@@ -663,6 +715,12 @@ static int dsa_switch_event(struct notifier_block *nb,
 	case DSA_NOTIFIER_BRIDGE_LEAVE:
 		err = dsa_switch_bridge_leave(ds, info);
 		break;
+	case DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_ADD:
+		err = dsa_switch_bridge_fwd_offload_add(ds, info);
+		break;
+	case DSA_NOTIFIER_BRIDGE_FWD_OFFLOAD_DEL:
+		err = dsa_switch_bridge_fwd_offload_del(ds, info);
+		break;
 	case DSA_NOTIFIER_FDB_ADD:
 		err = dsa_switch_fdb_add(ds, info);
 		break;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 net-next 09/10] net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in the PVT
  2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
@ 2021-07-03 11:57   ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:57 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck

The mv88e6xxx switches have the ability to receive FORWARD (data plane)
frames from the CPU port and route them according to the FDB. We can use
this to offload the forwarding process of packets sent by the software
bridge.

Because DSA supports bridge domain isolation between user ports, just
sending FORWARD frames is not enough, as they might leak the intended
broadcast domain of the bridge on behalf of which the packets are sent.

It should be noted that FORWARD frames are also (and typically) used to
forward data plane packets on DSA links in cross-chip topologies. The
FORWARD frame header contains the source port and switch ID, and
switches receiving this frame header forward the packet according to
their cross-chip port-based VLAN table (PVT).

To address the bridging domain isolation in the context of offloading
the forwarding on TX, the idea is that we can reuse the parts of the PVT
that don't have any physical switch mapped to them, one entry for each
software bridge. The switches will therefore think that behind their
upstream port lie many switches, all in fact backed up by software
bridges through tag_dsa.c, which constructs FORWARD packets with the
right switch ID corresponding to each bridge.

The mapping we use is absolutely trivial: DSA gives us a unique bridge
number, and we add the number of the physical switches in the DSA switch
tree to that, to obtain a unique virtual bridge device number to use in
the PVT.

Co-developed-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 106 +++++++++++++++++++++++++++++--
 1 file changed, 102 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index beb41572d04e..6b9c1a77d874 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1221,14 +1221,38 @@ static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port)
 	bool found = false;
 	u16 pvlan;
 
-	list_for_each_entry(dp, &dst->ports, list) {
-		if (dp->ds->index == dev && dp->index == port) {
+	/* dev is a physical switch */
+	if (dev <= dst->last_switch) {
+		list_for_each_entry(dp, &dst->ports, list) {
+			if (dp->ds->index == dev && dp->index == port) {
+				/* dp might be a DSA link or a user port, so it
+				 * might or might not have a bridge_dev
+				 * pointer. Use the "found" variable for both
+				 * cases.
+				 */
+				br = dp->bridge_dev;
+				found = true;
+				break;
+			}
+		}
+	/* dev is a virtual bridge */
+	} else {
+		list_for_each_entry(dp, &dst->ports, list) {
+			struct dsa_bridge_fwd_accel_priv *accel_priv = dp->accel_priv;
+
+			if (!accel_priv)
+				continue;
+
+			if (accel_priv->bridge_num + 1 + dst->last_switch != dev)
+				continue;
+
+			br = accel_priv->sb_dev;
 			found = true;
 			break;
 		}
 	}
 
-	/* Prevent frames from unknown switch or port */
+	/* Prevent frames from unknown switch or virtual bridge */
 	if (!found)
 		return 0;
 
@@ -1236,7 +1260,6 @@ static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port)
 	if (dp->type == DSA_PORT_TYPE_CPU || dp->type == DSA_PORT_TYPE_DSA)
 		return mv88e6xxx_port_mask(chip);
 
-	br = dp->bridge_dev;
 	pvlan = 0;
 
 	/* Frames from user ports can egress any local DSA links and CPU ports,
@@ -2422,6 +2445,68 @@ static void mv88e6xxx_crosschip_bridge_leave(struct dsa_switch *ds,
 	mv88e6xxx_reg_unlock(chip);
 }
 
+/* Treat the software bridge as a virtual single-port switch behind the
+ * CPU and map in the PVT. First dst->last_switch elements are taken by
+ * physical switches, so start from beyond that range.
+ */
+static int mv88e6xxx_map_virtual_bridge_to_pvt(struct dsa_switch *ds,
+					       int bridge_num)
+{
+	u8 dev = bridge_num + ds->dst->last_switch + 1;
+	struct mv88e6xxx_chip *chip = ds->priv;
+	int err;
+
+	mv88e6xxx_reg_lock(chip);
+	err = mv88e6xxx_pvt_map(chip, dev, 0);
+	mv88e6xxx_reg_unlock(chip);
+
+	return err;
+}
+
+static int mv88e6xxx_bridge_fwd_offload_add(struct dsa_switch *ds, int port,
+					    struct net_device *br,
+					    int bridge_num)
+{
+	return mv88e6xxx_map_virtual_bridge_to_pvt(ds, bridge_num);
+}
+
+static void mv88e6xxx_bridge_fwd_offload_del(struct dsa_switch *ds, int port,
+					     struct net_device *br,
+					     int bridge_num)
+{
+	int err;
+
+	err = mv88e6xxx_map_virtual_bridge_to_pvt(ds, bridge_num);
+	if (err) {
+		dev_err(ds->dev, "failed to remap cross-chip Port VLAN: %pe\n",
+			ERR_PTR(err));
+	}
+}
+
+static int
+mv88e6xxx_crosschip_bridge_fwd_offload_add(struct dsa_switch *ds,
+					   int tree_index, int sw_index,
+					   int port, struct net_device *br,
+					   int bridge_num)
+{
+	return mv88e6xxx_map_virtual_bridge_to_pvt(ds, bridge_num);
+}
+
+static void
+mv88e6xxx_crosschip_bridge_fwd_offload_del(struct dsa_switch *ds,
+					   int tree_index, int sw_index,
+					   int port, struct net_device *br,
+					   int bridge_num)
+{
+	int err;
+
+	err = mv88e6xxx_map_virtual_bridge_to_pvt(ds, bridge_num);
+	if (err) {
+		dev_err(ds->dev, "failed to remap cross-chip Port VLAN: %pe\n",
+			ERR_PTR(err));
+	}
+}
+
 static int mv88e6xxx_software_reset(struct mv88e6xxx_chip *chip)
 {
 	if (chip->info->ops->reset)
@@ -3025,6 +3110,15 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
 	chip->ds = ds;
 	ds->slave_mii_bus = mv88e6xxx_default_mdio_bus(chip);
 
+	/* Since virtual bridges are mapped in the PVT, the number we support
+	 * depends on the physical switch topology. We need to let DSA figure
+	 * that out and therefore we cannot set this at dsa_register_switch()
+	 * time.
+	 */
+	if (mv88e6xxx_has_pvt(chip))
+		ds->num_fwd_offloading_bridges = MV88E6XXX_MAX_PVT_SWITCHES -
+						 ds->dst->last_switch - 1;
+
 	mv88e6xxx_reg_lock(chip);
 
 	if (chip->info->ops->setup_errata) {
@@ -6128,6 +6222,10 @@ static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
 	.crosschip_lag_change	= mv88e6xxx_crosschip_lag_change,
 	.crosschip_lag_join	= mv88e6xxx_crosschip_lag_join,
 	.crosschip_lag_leave	= mv88e6xxx_crosschip_lag_leave,
+	.port_bridge_fwd_offload_add = mv88e6xxx_bridge_fwd_offload_add,
+	.port_bridge_fwd_offload_del = mv88e6xxx_bridge_fwd_offload_del,
+	.crosschip_bridge_fwd_offload_add = mv88e6xxx_crosschip_bridge_fwd_offload_add,
+	.crosschip_bridge_fwd_offload_del = mv88e6xxx_crosschip_bridge_fwd_offload_del,
 };
 
 static int mv88e6xxx_register_switch(struct mv88e6xxx_chip *chip)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Bridge] [RFC PATCH v2 net-next 09/10] net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in the PVT
@ 2021-07-03 11:57   ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:57 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Alexander Duyck, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
	Vivien Didelot, Tobias Waldekranz

The mv88e6xxx switches have the ability to receive FORWARD (data plane)
frames from the CPU port and route them according to the FDB. We can use
this to offload the forwarding process of packets sent by the software
bridge.

Because DSA supports bridge domain isolation between user ports, just
sending FORWARD frames is not enough, as they might leak the intended
broadcast domain of the bridge on behalf of which the packets are sent.

It should be noted that FORWARD frames are also (and typically) used to
forward data plane packets on DSA links in cross-chip topologies. The
FORWARD frame header contains the source port and switch ID, and
switches receiving this frame header forward the packet according to
their cross-chip port-based VLAN table (PVT).

To address the bridging domain isolation in the context of offloading
the forwarding on TX, the idea is that we can reuse the parts of the PVT
that don't have any physical switch mapped to them, one entry for each
software bridge. The switches will therefore think that behind their
upstream port lie many switches, all in fact backed up by software
bridges through tag_dsa.c, which constructs FORWARD packets with the
right switch ID corresponding to each bridge.

The mapping we use is absolutely trivial: DSA gives us a unique bridge
number, and we add the number of the physical switches in the DSA switch
tree to that, to obtain a unique virtual bridge device number to use in
the PVT.

Co-developed-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 106 +++++++++++++++++++++++++++++--
 1 file changed, 102 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index beb41572d04e..6b9c1a77d874 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1221,14 +1221,38 @@ static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port)
 	bool found = false;
 	u16 pvlan;
 
-	list_for_each_entry(dp, &dst->ports, list) {
-		if (dp->ds->index == dev && dp->index == port) {
+	/* dev is a physical switch */
+	if (dev <= dst->last_switch) {
+		list_for_each_entry(dp, &dst->ports, list) {
+			if (dp->ds->index == dev && dp->index == port) {
+				/* dp might be a DSA link or a user port, so it
+				 * might or might not have a bridge_dev
+				 * pointer. Use the "found" variable for both
+				 * cases.
+				 */
+				br = dp->bridge_dev;
+				found = true;
+				break;
+			}
+		}
+	/* dev is a virtual bridge */
+	} else {
+		list_for_each_entry(dp, &dst->ports, list) {
+			struct dsa_bridge_fwd_accel_priv *accel_priv = dp->accel_priv;
+
+			if (!accel_priv)
+				continue;
+
+			if (accel_priv->bridge_num + 1 + dst->last_switch != dev)
+				continue;
+
+			br = accel_priv->sb_dev;
 			found = true;
 			break;
 		}
 	}
 
-	/* Prevent frames from unknown switch or port */
+	/* Prevent frames from unknown switch or virtual bridge */
 	if (!found)
 		return 0;
 
@@ -1236,7 +1260,6 @@ static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port)
 	if (dp->type == DSA_PORT_TYPE_CPU || dp->type == DSA_PORT_TYPE_DSA)
 		return mv88e6xxx_port_mask(chip);
 
-	br = dp->bridge_dev;
 	pvlan = 0;
 
 	/* Frames from user ports can egress any local DSA links and CPU ports,
@@ -2422,6 +2445,68 @@ static void mv88e6xxx_crosschip_bridge_leave(struct dsa_switch *ds,
 	mv88e6xxx_reg_unlock(chip);
 }
 
+/* Treat the software bridge as a virtual single-port switch behind the
+ * CPU and map in the PVT. First dst->last_switch elements are taken by
+ * physical switches, so start from beyond that range.
+ */
+static int mv88e6xxx_map_virtual_bridge_to_pvt(struct dsa_switch *ds,
+					       int bridge_num)
+{
+	u8 dev = bridge_num + ds->dst->last_switch + 1;
+	struct mv88e6xxx_chip *chip = ds->priv;
+	int err;
+
+	mv88e6xxx_reg_lock(chip);
+	err = mv88e6xxx_pvt_map(chip, dev, 0);
+	mv88e6xxx_reg_unlock(chip);
+
+	return err;
+}
+
+static int mv88e6xxx_bridge_fwd_offload_add(struct dsa_switch *ds, int port,
+					    struct net_device *br,
+					    int bridge_num)
+{
+	return mv88e6xxx_map_virtual_bridge_to_pvt(ds, bridge_num);
+}
+
+static void mv88e6xxx_bridge_fwd_offload_del(struct dsa_switch *ds, int port,
+					     struct net_device *br,
+					     int bridge_num)
+{
+	int err;
+
+	err = mv88e6xxx_map_virtual_bridge_to_pvt(ds, bridge_num);
+	if (err) {
+		dev_err(ds->dev, "failed to remap cross-chip Port VLAN: %pe\n",
+			ERR_PTR(err));
+	}
+}
+
+static int
+mv88e6xxx_crosschip_bridge_fwd_offload_add(struct dsa_switch *ds,
+					   int tree_index, int sw_index,
+					   int port, struct net_device *br,
+					   int bridge_num)
+{
+	return mv88e6xxx_map_virtual_bridge_to_pvt(ds, bridge_num);
+}
+
+static void
+mv88e6xxx_crosschip_bridge_fwd_offload_del(struct dsa_switch *ds,
+					   int tree_index, int sw_index,
+					   int port, struct net_device *br,
+					   int bridge_num)
+{
+	int err;
+
+	err = mv88e6xxx_map_virtual_bridge_to_pvt(ds, bridge_num);
+	if (err) {
+		dev_err(ds->dev, "failed to remap cross-chip Port VLAN: %pe\n",
+			ERR_PTR(err));
+	}
+}
+
 static int mv88e6xxx_software_reset(struct mv88e6xxx_chip *chip)
 {
 	if (chip->info->ops->reset)
@@ -3025,6 +3110,15 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
 	chip->ds = ds;
 	ds->slave_mii_bus = mv88e6xxx_default_mdio_bus(chip);
 
+	/* Since virtual bridges are mapped in the PVT, the number we support
+	 * depends on the physical switch topology. We need to let DSA figure
+	 * that out and therefore we cannot set this at dsa_register_switch()
+	 * time.
+	 */
+	if (mv88e6xxx_has_pvt(chip))
+		ds->num_fwd_offloading_bridges = MV88E6XXX_MAX_PVT_SWITCHES -
+						 ds->dst->last_switch - 1;
+
 	mv88e6xxx_reg_lock(chip);
 
 	if (chip->info->ops->setup_errata) {
@@ -6128,6 +6222,10 @@ static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
 	.crosschip_lag_change	= mv88e6xxx_crosschip_lag_change,
 	.crosschip_lag_join	= mv88e6xxx_crosschip_lag_join,
 	.crosschip_lag_leave	= mv88e6xxx_crosschip_lag_leave,
+	.port_bridge_fwd_offload_add = mv88e6xxx_bridge_fwd_offload_add,
+	.port_bridge_fwd_offload_del = mv88e6xxx_bridge_fwd_offload_del,
+	.crosschip_bridge_fwd_offload_add = mv88e6xxx_crosschip_bridge_fwd_offload_add,
+	.crosschip_bridge_fwd_offload_del = mv88e6xxx_crosschip_bridge_fwd_offload_del,
 };
 
 static int mv88e6xxx_register_switch(struct mv88e6xxx_chip *chip)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 net-next 10/10] net: dsa: tag_dsa: offload the bridge forwarding process
  2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
@ 2021-07-03 11:57   ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:57 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck

From: Tobias Waldekranz <tobias@waldekranz.com>

Allow the DSA tagger to generate FORWARD frames for offloaded skbs
sent from a bridge that we offload, allowing the switch to handle any
frame replication that may be required. This also means that source
address learning takes place on packets sent from the CPU, meaning
that return traffic no longer needs to be flooded as unknown unicast.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/dsa/dsa_priv.h | 11 +++++++++
 net/dsa/tag_dsa.c  | 60 +++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 63 insertions(+), 8 deletions(-)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index c577338b5bb7..c070157cd967 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -389,6 +389,17 @@ static inline struct sk_buff *dsa_untag_bridge_pvid(struct sk_buff *skb)
 	return skb;
 }
 
+static inline struct net_device *
+dsa_slave_get_sb_dev(const struct net_device *dev, struct sk_buff *skb)
+{
+	u16 queue_mapping = skb_get_queue_mapping(skb);
+	struct netdev_queue *txq;
+
+	txq = netdev_get_tx_queue(dev, queue_mapping);
+
+	return txq->sb_dev;
+}
+
 /* switch.c */
 int dsa_switch_register_notifier(struct dsa_switch *ds);
 void dsa_switch_unregister_notifier(struct dsa_switch *ds);
diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
index a822355afc90..9151ed141b3e 100644
--- a/net/dsa/tag_dsa.c
+++ b/net/dsa/tag_dsa.c
@@ -125,8 +125,49 @@ enum dsa_code {
 static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
 				   u8 extra)
 {
+	struct net_device *sb_dev = dsa_slave_get_sb_dev(dev, skb);
 	struct dsa_port *dp = dsa_slave_to_port(dev);
+	u8 tag_dev, tag_port;
+	enum dsa_cmd cmd;
 	u8 *dsa_header;
+	u16 pvid = 0;
+	int err;
+
+	if (sb_dev) {
+		/* Don't bother finding the accel_priv corresponding with this
+		 * subordinate device, we know it's the bridge becase we can't
+		 * offload anything else, so just search for it under the port,
+		 * we know it's the same.
+		 */
+		struct dsa_bridge_fwd_accel_priv *accel_priv = dp->accel_priv;
+		struct dsa_switch_tree *dst = dp->ds->dst;
+
+		cmd = DSA_CMD_FORWARD;
+
+		/* When offloading forwarding for a bridge, inject FORWARD
+		 * packets on behalf of a virtual switch device with an index
+		 * past the physical switches.
+		 */
+		tag_dev = dst->last_switch + 1 + accel_priv->bridge_num;
+		tag_port = 0;
+
+		/* If we are offloading forwarding for a VLAN-unaware bridge,
+		 * inject packets to hardware using the bridge's pvid, since
+		 * that's where the packets ingressed from.
+		 */
+		if (!br_vlan_enabled(sb_dev)) {
+			/* Safe because __dev_queue_xmit() runs under
+			 * rcu_read_lock_bh()
+			 */
+			err = br_vlan_get_pvid_rcu(sb_dev, &pvid);
+			if (err)
+				return NULL;
+		}
+	} else {
+		cmd = DSA_CMD_FROM_CPU;
+		tag_dev = dp->ds->index;
+		tag_port = dp->index;
+	}
 
 	if (skb->protocol == htons(ETH_P_8021Q)) {
 		if (extra) {
@@ -134,10 +175,10 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
 			memmove(skb->data, skb->data + extra, 2 * ETH_ALEN);
 		}
 
-		/* Construct tagged FROM_CPU DSA tag from 802.1Q tag. */
+		/* Construct tagged DSA tag from 802.1Q tag. */
 		dsa_header = skb->data + 2 * ETH_ALEN + extra;
-		dsa_header[0] = (DSA_CMD_FROM_CPU << 6) | 0x20 | dp->ds->index;
-		dsa_header[1] = dp->index << 3;
+		dsa_header[0] = (cmd << 6) | 0x20 | tag_dev;
+		dsa_header[1] = tag_port << 3;
 
 		/* Move CFI field from byte 2 to byte 1. */
 		if (dsa_header[2] & 0x10) {
@@ -148,12 +189,13 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
 		skb_push(skb, DSA_HLEN + extra);
 		memmove(skb->data, skb->data + DSA_HLEN + extra, 2 * ETH_ALEN);
 
-		/* Construct untagged FROM_CPU DSA tag. */
+		/* Construct untagged DSA tag. */
 		dsa_header = skb->data + 2 * ETH_ALEN + extra;
-		dsa_header[0] = (DSA_CMD_FROM_CPU << 6) | dp->ds->index;
-		dsa_header[1] = dp->index << 3;
-		dsa_header[2] = 0x00;
-		dsa_header[3] = 0x00;
+
+		dsa_header[0] = (cmd << 6) | tag_dev;
+		dsa_header[1] = tag_port << 3;
+		dsa_header[2] = pvid >> 8;
+		dsa_header[3] = pvid & 0xff;
 	}
 
 	return skb;
@@ -304,6 +346,7 @@ static const struct dsa_device_ops dsa_netdev_ops = {
 	.xmit	  = dsa_xmit,
 	.rcv	  = dsa_rcv,
 	.needed_headroom = DSA_HLEN,
+	.bridge_fwd_offload = true,
 };
 
 DSA_TAG_DRIVER(dsa_netdev_ops);
@@ -347,6 +390,7 @@ static const struct dsa_device_ops edsa_netdev_ops = {
 	.xmit	  = edsa_xmit,
 	.rcv	  = edsa_rcv,
 	.needed_headroom = EDSA_HLEN,
+	.bridge_fwd_offload = true,
 };
 
 DSA_TAG_DRIVER(edsa_netdev_ops);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [Bridge] [RFC PATCH v2 net-next 10/10] net: dsa: tag_dsa: offload the bridge forwarding process
@ 2021-07-03 11:57   ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-03 11:57 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Alexander Duyck, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
	Vivien Didelot, Tobias Waldekranz

From: Tobias Waldekranz <tobias@waldekranz.com>

Allow the DSA tagger to generate FORWARD frames for offloaded skbs
sent from a bridge that we offload, allowing the switch to handle any
frame replication that may be required. This also means that source
address learning takes place on packets sent from the CPU, meaning
that return traffic no longer needs to be flooded as unknown unicast.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/dsa/dsa_priv.h | 11 +++++++++
 net/dsa/tag_dsa.c  | 60 +++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 63 insertions(+), 8 deletions(-)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index c577338b5bb7..c070157cd967 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -389,6 +389,17 @@ static inline struct sk_buff *dsa_untag_bridge_pvid(struct sk_buff *skb)
 	return skb;
 }
 
+static inline struct net_device *
+dsa_slave_get_sb_dev(const struct net_device *dev, struct sk_buff *skb)
+{
+	u16 queue_mapping = skb_get_queue_mapping(skb);
+	struct netdev_queue *txq;
+
+	txq = netdev_get_tx_queue(dev, queue_mapping);
+
+	return txq->sb_dev;
+}
+
 /* switch.c */
 int dsa_switch_register_notifier(struct dsa_switch *ds);
 void dsa_switch_unregister_notifier(struct dsa_switch *ds);
diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
index a822355afc90..9151ed141b3e 100644
--- a/net/dsa/tag_dsa.c
+++ b/net/dsa/tag_dsa.c
@@ -125,8 +125,49 @@ enum dsa_code {
 static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
 				   u8 extra)
 {
+	struct net_device *sb_dev = dsa_slave_get_sb_dev(dev, skb);
 	struct dsa_port *dp = dsa_slave_to_port(dev);
+	u8 tag_dev, tag_port;
+	enum dsa_cmd cmd;
 	u8 *dsa_header;
+	u16 pvid = 0;
+	int err;
+
+	if (sb_dev) {
+		/* Don't bother finding the accel_priv corresponding with this
+		 * subordinate device, we know it's the bridge becase we can't
+		 * offload anything else, so just search for it under the port,
+		 * we know it's the same.
+		 */
+		struct dsa_bridge_fwd_accel_priv *accel_priv = dp->accel_priv;
+		struct dsa_switch_tree *dst = dp->ds->dst;
+
+		cmd = DSA_CMD_FORWARD;
+
+		/* When offloading forwarding for a bridge, inject FORWARD
+		 * packets on behalf of a virtual switch device with an index
+		 * past the physical switches.
+		 */
+		tag_dev = dst->last_switch + 1 + accel_priv->bridge_num;
+		tag_port = 0;
+
+		/* If we are offloading forwarding for a VLAN-unaware bridge,
+		 * inject packets to hardware using the bridge's pvid, since
+		 * that's where the packets ingressed from.
+		 */
+		if (!br_vlan_enabled(sb_dev)) {
+			/* Safe because __dev_queue_xmit() runs under
+			 * rcu_read_lock_bh()
+			 */
+			err = br_vlan_get_pvid_rcu(sb_dev, &pvid);
+			if (err)
+				return NULL;
+		}
+	} else {
+		cmd = DSA_CMD_FROM_CPU;
+		tag_dev = dp->ds->index;
+		tag_port = dp->index;
+	}
 
 	if (skb->protocol == htons(ETH_P_8021Q)) {
 		if (extra) {
@@ -134,10 +175,10 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
 			memmove(skb->data, skb->data + extra, 2 * ETH_ALEN);
 		}
 
-		/* Construct tagged FROM_CPU DSA tag from 802.1Q tag. */
+		/* Construct tagged DSA tag from 802.1Q tag. */
 		dsa_header = skb->data + 2 * ETH_ALEN + extra;
-		dsa_header[0] = (DSA_CMD_FROM_CPU << 6) | 0x20 | dp->ds->index;
-		dsa_header[1] = dp->index << 3;
+		dsa_header[0] = (cmd << 6) | 0x20 | tag_dev;
+		dsa_header[1] = tag_port << 3;
 
 		/* Move CFI field from byte 2 to byte 1. */
 		if (dsa_header[2] & 0x10) {
@@ -148,12 +189,13 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
 		skb_push(skb, DSA_HLEN + extra);
 		memmove(skb->data, skb->data + DSA_HLEN + extra, 2 * ETH_ALEN);
 
-		/* Construct untagged FROM_CPU DSA tag. */
+		/* Construct untagged DSA tag. */
 		dsa_header = skb->data + 2 * ETH_ALEN + extra;
-		dsa_header[0] = (DSA_CMD_FROM_CPU << 6) | dp->ds->index;
-		dsa_header[1] = dp->index << 3;
-		dsa_header[2] = 0x00;
-		dsa_header[3] = 0x00;
+
+		dsa_header[0] = (cmd << 6) | tag_dev;
+		dsa_header[1] = tag_port << 3;
+		dsa_header[2] = pvid >> 8;
+		dsa_header[3] = pvid & 0xff;
 	}
 
 	return skb;
@@ -304,6 +346,7 @@ static const struct dsa_device_ops dsa_netdev_ops = {
 	.xmit	  = dsa_xmit,
 	.rcv	  = dsa_rcv,
 	.needed_headroom = DSA_HLEN,
+	.bridge_fwd_offload = true,
 };
 
 DSA_TAG_DRIVER(dsa_netdev_ops);
@@ -347,6 +390,7 @@ static const struct dsa_device_ops edsa_netdev_ops = {
 	.xmit	  = edsa_xmit,
 	.rcv	  = edsa_rcv,
 	.needed_headroom = EDSA_HLEN,
+	.bridge_fwd_offload = true,
 };
 
 DSA_TAG_DRIVER(edsa_netdev_ops);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
  2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
@ 2021-07-03 22:04   ` Tobias Waldekranz
  -1 siblings, 0 replies; 44+ messages in thread
From: Tobias Waldekranz @ 2021-07-03 22:04 UTC (permalink / raw)
  To: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Roopa Prabhu, Nikolay Aleksandrov,
	Stephen Hemminger, bridge, Alexander Duyck

On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> For this series I have taken Tobias' work from here:
> https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> and made the following changes:
> - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
>   feedback on the bridge driver changes. Otherwise, the structure of the
>   bridge changes is pretty much the same as Tobias left it.
> - I basically rewrote the DSA infrastructure for the data plane
>   forwarding offload, based on the commonalities with another switch
>   driver for which I implemented this feature (not submitted here)
> - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
>   works but I didn't test that

Hi Vladimir,

Sorry that I have dropped the ball on this series. I have actually had a
v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx
specific problems. (See below)

> The data plane of the software bridge can be partially offloaded to
> switchdev, in the sense that we can trust the accelerator to:
> (a) look up its FDB (which is more or less in sync with the software
>     bridge FDB) for selecting the destination ports for a packet
> (b) replicate the frame in hardware in case it's a multicast/broadcast,
>     instead of the software bridge having to clone it and send the
>     clones to each net device one at a time. This reduces the bandwidth
>     needed between the CPU and the accelerator, as well as the CPU time
>     spent.
>
> The data path forwarding offload is managed per "hardware domain" - a
> generalization of the "offload_fwd_mark" concept which is being
> introduced in this series. Every packet is delivered only once to each
> hardware domain.
>
> In addition, Tobias said in the original cover letter:
>
> ====================
> ## Overview
>
>    vlan1   vlan2
>        \   /
>    .-----------.
>    |    br0    |
>    '-----------'
>    /   /   \   \
> swp0 swp1 swp2 eth0
>   :   :   :
>   (hwdom 1)
>
> Up to this point, switchdevs have been trusted with offloading
> forwarding between bridge ports, e.g. forwarding a unicast from swp0
> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> series extends forward offloading to include some new classes of
> traffic:
>
> - Locally originating flows, i.e. packets that ingress on br0 that are
>   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>   this also includes routed flows, e.g. a packet ingressing swp0 on
>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>   forwarded to swp1 is "locally originating" from br0's point of view.
>
> - Flows originating from "foreign" interfaces, i.e. an interface that
>   is not offloaded by a particular switchdev instance. This includes
>   ports belonging to other switchdev instances. A typical example
>   would be flows from eth0 towards swp{0,1,2}.
>
> The bridge still looks up its FDB/MDB as usual and then notifies the
> switchdev driver that a particular skb should be offloaded if it
> matches one of the classes above. It does so by using the _accel
> version of dev_queue_xmit, supplying its own netdev as the
> "subordinate" device. The driver can react to the presence of the
> subordinate in its .ndo_select_queue in what ever way it needs to make
> sure to forward the skb in much the same way that it would for packets
> ingressing on regular ports.
>
> Hardware domains to which a particular skb has been forwarded are
> recorded so that duplicates are avoided.
>
> The main performance benefit is thus seen on multicast flows. Imagine
> for example that:
>
> - An IP camera is connected to swp0 (VLAN 1)
>
> - The CPU is acting as a multicast router, routing the group from VLAN
>   1 to VLAN 2.
>
> - There are subscribers for the group in question behind both swp1 and
>   swp2 (VLAN 2).
>
> With this offloading in place, the bridge need only send a single skb
> to the driver, which will send it to the hardware marked in such a way
> that the switch will perform the multicast replication according to
> the MDB configuration. Naturally, the number of saved skb_clones
> increase linearly with the number of subscribed ports.
>
> As an extra benefit, on mv88e6xxx, this also allows the switch to
> perform source address learning on these flows, which avoids having to
> sync dynamic FDB entries over slow configuration interfaces like MDIO
> to avoid flows directed towards the CPU being flooded as unknown
> unicast by the switch.
>
>
> ## RFC
>
> - In general, what do you think about this idea?
>
> - hwdom. What do you think about this terminology? Personally I feel
>   that we had too many things called offload_fwd_mark, and that as the
>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>   might be useful to have a separate term for it.
>
> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>   and if so do you have any suggestion/preference on how to signal the
>   offloading from the bridge down to the switchdev driver?
>
> - The way that flooding is implemented in br_forward.c (lazily cloning
>   skbs) means that you have to mark the forwarding as completed very
>   early (right after should_deliver in maybe_deliver) in order to
>   avoid duplicates. Is there some way to move this decision point to a
>   later stage that I am missing?
>
> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>   compatible with unicast-to-multicast being used on a port. Then
>   again, I think that this would also be broken for regular switchdev
>   bridge offloading as this flag is not offloaded to the switchdev
>   port, so there is no way for the driver to refuse it. Any ideas on
>   how to handle this?
>
>
> ## mv88e6xxx Specifics
>
> Since we are now only receiving a single skb for both unicast and
> multicast flows, we can tag the packets with the FORWARD command
> instead of FROM_CPU. The swich(es) will then forward the packet in
> accordance with its ATU, VTU, STU, and PVT configuration - just like
> for packets ingressing on user ports.
>
> Crucially, FROM_CPU is still used for:
>
> - Ports in standalone mode.
>
> - Flows that are trapped to the CPU and software-forwarded by a
>   bridge. Note that these flows match neither of the classes discussed
>   in the overview.
>
> - Packets that are sent directly to a port netdev without going
>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>   socket.
>
> We thus have a pretty clean separation where the data plane uses
> FORWARDs and the control plane uses TO_/FROM_CPU.
>
> The barrier between different bridges is enforced by port based VLANs
> on mv88e6xxx, which in essence is a mapping from a source device/port
> pair to an allowed set of egress ports.

Unless I am missing something, it turns out that the PVT is not enough
to support multiple (non-VLAN filtering) bridges in multi-chip
setups. While the isolation barrier works, there is no way of correctly
managing automatic learning.

> In order to have a FORWARD
> frame (which carries a _source_ device/port) correctly mapped by the
> PVT, we must use a unique pair for each bridge.
>
> Fortunately, there is typically lots of unused address space in most
> switch trees. When was the last time you saw an mv88e6xxx product
> using more than 4 chips? Even if you found one with 16 (!) devices,
> you would still have room to allocate 16*16 virtual ports to software
> bridges.
>
> Therefore, the mv88e6xxx driver will allocate a virtual device/port
> pair to each bridge that it offloads. All members of the same bridge
> are then configured to allow packets from this virtual port in their
> PVTs.

So while this solution is cute, it does not work in this example:

 CPU
  | .-----.
.-0-1-. .-0-1-.
| sw0 | | sw1 |
'-2-3-' '-2-3-'

- [sw0p2, sw1p2] are attached to one bridge
- [sw0p3, sw1p3] are attached to another bridge
- Neither bridge uses VLAN filtering

Since no VLAN information available in the frames, the source addresses
of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be
separated into different FIDs. They will all be placed in the respective
port's default FID. Thus, the two bridges are not isolated with respect
to their FDBs.

My current plan is therefore to start by reworking how bridges are
isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for
each non-filtering bridge. Two of these can be easily managed since both
VID 0 and 4095 are illegal on the wire but allowed in the VTU - after
that it gets tricky. The best scheme I have come up with is to just grab
an unused VID when adding any subsequent non-filtering bridge; in the
event that that VID is requested by a filtering bridge or a VLAN upper,
you move the non-filtering bridge to another currently unused VID.

Does that sound reasonable?

> ====================
>
> Tobias Waldekranz (5):
>   net: dfwd: constrain existing users to macvlan subordinates
>   net: bridge: disambiguate offload_fwd_mark
>   net: bridge: switchdev: recycle unused hwdoms
>   net: bridge: switchdev: allow the data plane forwarding to be
>     offloaded
>   net: dsa: tag_dsa: offload the bridge forwarding process
>
> Vladimir Oltean (5):
>   net: extract helpers for binding a subordinate device to TX queues
>   net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
>   net: dsa: track the number of switches in a tree
>   net: dsa: add support for bridge forwarding offload
>   net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
>     the PVT
>
>  drivers/net/dsa/mv88e6xxx/chip.c              | 106 +++++++++++-
>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
>  include/linux/if_bridge.h                     |   1 +
>  include/linux/netdevice.h                     |  13 +-
>  include/net/dsa.h                             |  37 ++++
>  net/bridge/br_forward.c                       |  18 +-
>  net/bridge/br_if.c                            |   4 +-
>  net/bridge/br_private.h                       |  49 +++++-
>  net/bridge/br_switchdev.c                     | 163 +++++++++++++++---
>  net/bridge/br_vlan.c                          |  10 +-
>  net/core/dev.c                                |  31 +++-
>  net/dsa/dsa2.c                                |   3 +
>  net/dsa/dsa_priv.h                            |  28 +++
>  net/dsa/port.c                                |  35 ++++
>  net/dsa/slave.c                               | 134 +++++++++++++-
>  net/dsa/switch.c                              |  58 +++++++
>  net/dsa/tag_dsa.c                             |  60 ++++++-
>  19 files changed, 700 insertions(+), 59 deletions(-)
>
> -- 
> 2.25.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Bridge] [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
@ 2021-07-03 22:04   ` Tobias Waldekranz
  0 siblings, 0 replies; 44+ messages in thread
From: Tobias Waldekranz @ 2021-07-03 22:04 UTC (permalink / raw)
  To: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Alexander Duyck, Ido Schimmel, Nikolay Aleksandrov, Roopa Prabhu,
	Vivien Didelot

On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> For this series I have taken Tobias' work from here:
> https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> and made the following changes:
> - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
>   feedback on the bridge driver changes. Otherwise, the structure of the
>   bridge changes is pretty much the same as Tobias left it.
> - I basically rewrote the DSA infrastructure for the data plane
>   forwarding offload, based on the commonalities with another switch
>   driver for which I implemented this feature (not submitted here)
> - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
>   works but I didn't test that

Hi Vladimir,

Sorry that I have dropped the ball on this series. I have actually had a
v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx
specific problems. (See below)

> The data plane of the software bridge can be partially offloaded to
> switchdev, in the sense that we can trust the accelerator to:
> (a) look up its FDB (which is more or less in sync with the software
>     bridge FDB) for selecting the destination ports for a packet
> (b) replicate the frame in hardware in case it's a multicast/broadcast,
>     instead of the software bridge having to clone it and send the
>     clones to each net device one at a time. This reduces the bandwidth
>     needed between the CPU and the accelerator, as well as the CPU time
>     spent.
>
> The data path forwarding offload is managed per "hardware domain" - a
> generalization of the "offload_fwd_mark" concept which is being
> introduced in this series. Every packet is delivered only once to each
> hardware domain.
>
> In addition, Tobias said in the original cover letter:
>
> ====================
> ## Overview
>
>    vlan1   vlan2
>        \   /
>    .-----------.
>    |    br0    |
>    '-----------'
>    /   /   \   \
> swp0 swp1 swp2 eth0
>   :   :   :
>   (hwdom 1)
>
> Up to this point, switchdevs have been trusted with offloading
> forwarding between bridge ports, e.g. forwarding a unicast from swp0
> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> series extends forward offloading to include some new classes of
> traffic:
>
> - Locally originating flows, i.e. packets that ingress on br0 that are
>   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>   this also includes routed flows, e.g. a packet ingressing swp0 on
>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>   forwarded to swp1 is "locally originating" from br0's point of view.
>
> - Flows originating from "foreign" interfaces, i.e. an interface that
>   is not offloaded by a particular switchdev instance. This includes
>   ports belonging to other switchdev instances. A typical example
>   would be flows from eth0 towards swp{0,1,2}.
>
> The bridge still looks up its FDB/MDB as usual and then notifies the
> switchdev driver that a particular skb should be offloaded if it
> matches one of the classes above. It does so by using the _accel
> version of dev_queue_xmit, supplying its own netdev as the
> "subordinate" device. The driver can react to the presence of the
> subordinate in its .ndo_select_queue in what ever way it needs to make
> sure to forward the skb in much the same way that it would for packets
> ingressing on regular ports.
>
> Hardware domains to which a particular skb has been forwarded are
> recorded so that duplicates are avoided.
>
> The main performance benefit is thus seen on multicast flows. Imagine
> for example that:
>
> - An IP camera is connected to swp0 (VLAN 1)
>
> - The CPU is acting as a multicast router, routing the group from VLAN
>   1 to VLAN 2.
>
> - There are subscribers for the group in question behind both swp1 and
>   swp2 (VLAN 2).
>
> With this offloading in place, the bridge need only send a single skb
> to the driver, which will send it to the hardware marked in such a way
> that the switch will perform the multicast replication according to
> the MDB configuration. Naturally, the number of saved skb_clones
> increase linearly with the number of subscribed ports.
>
> As an extra benefit, on mv88e6xxx, this also allows the switch to
> perform source address learning on these flows, which avoids having to
> sync dynamic FDB entries over slow configuration interfaces like MDIO
> to avoid flows directed towards the CPU being flooded as unknown
> unicast by the switch.
>
>
> ## RFC
>
> - In general, what do you think about this idea?
>
> - hwdom. What do you think about this terminology? Personally I feel
>   that we had too many things called offload_fwd_mark, and that as the
>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>   might be useful to have a separate term for it.
>
> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>   and if so do you have any suggestion/preference on how to signal the
>   offloading from the bridge down to the switchdev driver?
>
> - The way that flooding is implemented in br_forward.c (lazily cloning
>   skbs) means that you have to mark the forwarding as completed very
>   early (right after should_deliver in maybe_deliver) in order to
>   avoid duplicates. Is there some way to move this decision point to a
>   later stage that I am missing?
>
> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>   compatible with unicast-to-multicast being used on a port. Then
>   again, I think that this would also be broken for regular switchdev
>   bridge offloading as this flag is not offloaded to the switchdev
>   port, so there is no way for the driver to refuse it. Any ideas on
>   how to handle this?
>
>
> ## mv88e6xxx Specifics
>
> Since we are now only receiving a single skb for both unicast and
> multicast flows, we can tag the packets with the FORWARD command
> instead of FROM_CPU. The swich(es) will then forward the packet in
> accordance with its ATU, VTU, STU, and PVT configuration - just like
> for packets ingressing on user ports.
>
> Crucially, FROM_CPU is still used for:
>
> - Ports in standalone mode.
>
> - Flows that are trapped to the CPU and software-forwarded by a
>   bridge. Note that these flows match neither of the classes discussed
>   in the overview.
>
> - Packets that are sent directly to a port netdev without going
>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>   socket.
>
> We thus have a pretty clean separation where the data plane uses
> FORWARDs and the control plane uses TO_/FROM_CPU.
>
> The barrier between different bridges is enforced by port based VLANs
> on mv88e6xxx, which in essence is a mapping from a source device/port
> pair to an allowed set of egress ports.

Unless I am missing something, it turns out that the PVT is not enough
to support multiple (non-VLAN filtering) bridges in multi-chip
setups. While the isolation barrier works, there is no way of correctly
managing automatic learning.

> In order to have a FORWARD
> frame (which carries a _source_ device/port) correctly mapped by the
> PVT, we must use a unique pair for each bridge.
>
> Fortunately, there is typically lots of unused address space in most
> switch trees. When was the last time you saw an mv88e6xxx product
> using more than 4 chips? Even if you found one with 16 (!) devices,
> you would still have room to allocate 16*16 virtual ports to software
> bridges.
>
> Therefore, the mv88e6xxx driver will allocate a virtual device/port
> pair to each bridge that it offloads. All members of the same bridge
> are then configured to allow packets from this virtual port in their
> PVTs.

So while this solution is cute, it does not work in this example:

 CPU
  | .-----.
.-0-1-. .-0-1-.
| sw0 | | sw1 |
'-2-3-' '-2-3-'

- [sw0p2, sw1p2] are attached to one bridge
- [sw0p3, sw1p3] are attached to another bridge
- Neither bridge uses VLAN filtering

Since no VLAN information available in the frames, the source addresses
of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be
separated into different FIDs. They will all be placed in the respective
port's default FID. Thus, the two bridges are not isolated with respect
to their FDBs.

My current plan is therefore to start by reworking how bridges are
isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for
each non-filtering bridge. Two of these can be easily managed since both
VID 0 and 4095 are illegal on the wire but allowed in the VTU - after
that it gets tricky. The best scheme I have come up with is to just grab
an unused VID when adding any subsequent non-filtering bridge; in the
event that that VID is requested by a filtering bridge or a VLAN upper,
you move the non-filtering bridge to another currently unused VID.

Does that sound reasonable?

> ====================
>
> Tobias Waldekranz (5):
>   net: dfwd: constrain existing users to macvlan subordinates
>   net: bridge: disambiguate offload_fwd_mark
>   net: bridge: switchdev: recycle unused hwdoms
>   net: bridge: switchdev: allow the data plane forwarding to be
>     offloaded
>   net: dsa: tag_dsa: offload the bridge forwarding process
>
> Vladimir Oltean (5):
>   net: extract helpers for binding a subordinate device to TX queues
>   net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
>   net: dsa: track the number of switches in a tree
>   net: dsa: add support for bridge forwarding offload
>   net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
>     the PVT
>
>  drivers/net/dsa/mv88e6xxx/chip.c              | 106 +++++++++++-
>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
>  include/linux/if_bridge.h                     |   1 +
>  include/linux/netdevice.h                     |  13 +-
>  include/net/dsa.h                             |  37 ++++
>  net/bridge/br_forward.c                       |  18 +-
>  net/bridge/br_if.c                            |   4 +-
>  net/bridge/br_private.h                       |  49 +++++-
>  net/bridge/br_switchdev.c                     | 163 +++++++++++++++---
>  net/bridge/br_vlan.c                          |  10 +-
>  net/core/dev.c                                |  31 +++-
>  net/dsa/dsa2.c                                |   3 +
>  net/dsa/dsa_priv.h                            |  28 +++
>  net/dsa/port.c                                |  35 ++++
>  net/dsa/slave.c                               | 134 +++++++++++++-
>  net/dsa/switch.c                              |  58 +++++++
>  net/dsa/tag_dsa.c                             |  60 ++++++-
>  19 files changed, 700 insertions(+), 59 deletions(-)
>
> -- 
> 2.25.1

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
  2021-07-03 22:04   ` [Bridge] " Tobias Waldekranz
@ 2021-07-04  8:11     ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-04  8:11 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Roopa Prabhu, Nikolay Aleksandrov,
	Stephen Hemminger, bridge, Alexander Duyck

Hi Tobias,

On Sun, Jul 04, 2021 at 12:04:26AM +0200, Tobias Waldekranz wrote:
> On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > For this series I have taken Tobias' work from here:
> > https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> > and made the following changes:
> > - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
> >   feedback on the bridge driver changes. Otherwise, the structure of the
> >   bridge changes is pretty much the same as Tobias left it.
> > - I basically rewrote the DSA infrastructure for the data plane
> >   forwarding offload, based on the commonalities with another switch
> >   driver for which I implemented this feature (not submitted here)
> > - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
> >   works but I didn't test that
>
> Hi Vladimir,
>
> Sorry that I have dropped the ball on this series. I have actually had a
> v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx
> specific problems. (See below)
>
> > The data plane of the software bridge can be partially offloaded to
> > switchdev, in the sense that we can trust the accelerator to:
> > (a) look up its FDB (which is more or less in sync with the software
> >     bridge FDB) for selecting the destination ports for a packet
> > (b) replicate the frame in hardware in case it's a multicast/broadcast,
> >     instead of the software bridge having to clone it and send the
> >     clones to each net device one at a time. This reduces the bandwidth
> >     needed between the CPU and the accelerator, as well as the CPU time
> >     spent.
> >
> > The data path forwarding offload is managed per "hardware domain" - a
> > generalization of the "offload_fwd_mark" concept which is being
> > introduced in this series. Every packet is delivered only once to each
> > hardware domain.
> >
> > In addition, Tobias said in the original cover letter:
> >
> > ====================
> > ## Overview
> >
> >    vlan1   vlan2
> >        \   /
> >    .-----------.
> >    |    br0    |
> >    '-----------'
> >    /   /   \   \
> > swp0 swp1 swp2 eth0
> >   :   :   :
> >   (hwdom 1)
> >
> > Up to this point, switchdevs have been trusted with offloading
> > forwarding between bridge ports, e.g. forwarding a unicast from swp0
> > to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> > series extends forward offloading to include some new classes of
> > traffic:
> >
> > - Locally originating flows, i.e. packets that ingress on br0 that are
> >   to be forwarded to one or several of the ports swp{0,1,2}. Notably
> >   this also includes routed flows, e.g. a packet ingressing swp0 on
> >   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
> >   forwarded to swp1 is "locally originating" from br0's point of view.
> >
> > - Flows originating from "foreign" interfaces, i.e. an interface that
> >   is not offloaded by a particular switchdev instance. This includes
> >   ports belonging to other switchdev instances. A typical example
> >   would be flows from eth0 towards swp{0,1,2}.
> >
> > The bridge still looks up its FDB/MDB as usual and then notifies the
> > switchdev driver that a particular skb should be offloaded if it
> > matches one of the classes above. It does so by using the _accel
> > version of dev_queue_xmit, supplying its own netdev as the
> > "subordinate" device. The driver can react to the presence of the
> > subordinate in its .ndo_select_queue in what ever way it needs to make
> > sure to forward the skb in much the same way that it would for packets
> > ingressing on regular ports.
> >
> > Hardware domains to which a particular skb has been forwarded are
> > recorded so that duplicates are avoided.
> >
> > The main performance benefit is thus seen on multicast flows. Imagine
> > for example that:
> >
> > - An IP camera is connected to swp0 (VLAN 1)
> >
> > - The CPU is acting as a multicast router, routing the group from VLAN
> >   1 to VLAN 2.
> >
> > - There are subscribers for the group in question behind both swp1 and
> >   swp2 (VLAN 2).
> >
> > With this offloading in place, the bridge need only send a single skb
> > to the driver, which will send it to the hardware marked in such a way
> > that the switch will perform the multicast replication according to
> > the MDB configuration. Naturally, the number of saved skb_clones
> > increase linearly with the number of subscribed ports.
> >
> > As an extra benefit, on mv88e6xxx, this also allows the switch to
> > perform source address learning on these flows, which avoids having to
> > sync dynamic FDB entries over slow configuration interfaces like MDIO
> > to avoid flows directed towards the CPU being flooded as unknown
> > unicast by the switch.
> >
> >
> > ## RFC
> >
> > - In general, what do you think about this idea?
> >
> > - hwdom. What do you think about this terminology? Personally I feel
> >   that we had too many things called offload_fwd_mark, and that as the
> >   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
> >   might be useful to have a separate term for it.
> >
> > - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
> >   and if so do you have any suggestion/preference on how to signal the
> >   offloading from the bridge down to the switchdev driver?
> >
> > - The way that flooding is implemented in br_forward.c (lazily cloning
> >   skbs) means that you have to mark the forwarding as completed very
> >   early (right after should_deliver in maybe_deliver) in order to
> >   avoid duplicates. Is there some way to move this decision point to a
> >   later stage that I am missing?
> >
> > - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
> >   compatible with unicast-to-multicast being used on a port. Then
> >   again, I think that this would also be broken for regular switchdev
> >   bridge offloading as this flag is not offloaded to the switchdev
> >   port, so there is no way for the driver to refuse it. Any ideas on
> >   how to handle this?
> >
> >
> > ## mv88e6xxx Specifics
> >
> > Since we are now only receiving a single skb for both unicast and
> > multicast flows, we can tag the packets with the FORWARD command
> > instead of FROM_CPU. The swich(es) will then forward the packet in
> > accordance with its ATU, VTU, STU, and PVT configuration - just like
> > for packets ingressing on user ports.
> >
> > Crucially, FROM_CPU is still used for:
> >
> > - Ports in standalone mode.
> >
> > - Flows that are trapped to the CPU and software-forwarded by a
> >   bridge. Note that these flows match neither of the classes discussed
> >   in the overview.
> >
> > - Packets that are sent directly to a port netdev without going
> >   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
> >   socket.
> >
> > We thus have a pretty clean separation where the data plane uses
> > FORWARDs and the control plane uses TO_/FROM_CPU.
> >
> > The barrier between different bridges is enforced by port based VLANs
> > on mv88e6xxx, which in essence is a mapping from a source device/port
> > pair to an allowed set of egress ports.
>
> Unless I am missing something, it turns out that the PVT is not enough
> to support multiple (non-VLAN filtering) bridges in multi-chip
> setups. While the isolation barrier works, there is no way of correctly
> managing automatic learning.
>
> > In order to have a FORWARD
> > frame (which carries a _source_ device/port) correctly mapped by the
> > PVT, we must use a unique pair for each bridge.
> >
> > Fortunately, there is typically lots of unused address space in most
> > switch trees. When was the last time you saw an mv88e6xxx product
> > using more than 4 chips? Even if you found one with 16 (!) devices,
> > you would still have room to allocate 16*16 virtual ports to software
> > bridges.
> >
> > Therefore, the mv88e6xxx driver will allocate a virtual device/port
> > pair to each bridge that it offloads. All members of the same bridge
> > are then configured to allow packets from this virtual port in their
> > PVTs.
>
> So while this solution is cute, it does not work in this example:
>
>  CPU
>   | .-----.
> .-0-1-. .-0-1-.
> | sw0 | | sw1 |
> '-2-3-' '-2-3-'
>
> - [sw0p2, sw1p2] are attached to one bridge
> - [sw0p3, sw1p3] are attached to another bridge
> - Neither bridge uses VLAN filtering
>
> Since no VLAN information available in the frames, the source addresses
> of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be
> separated into different FIDs. They will all be placed in the respective
> port's default FID. Thus, the two bridges are not isolated with respect
> to their FDBs.
>
> My current plan is therefore to start by reworking how bridges are
> isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for
> each non-filtering bridge. Two of these can be easily managed since both
> VID 0 and 4095 are illegal on the wire but allowed in the VTU - after
> that it gets tricky. The best scheme I have come up with is to just grab
> an unused VID when adding any subsequent non-filtering bridge; in the
> event that that VID is requested by a filtering bridge or a VLAN upper,
> you move the non-filtering bridge to another currently unused VID.
>
> Does that sound reasonable?

I don't think this patch series makes the problem you are describing any
worse than it already is in mainline, does it?

I mean even with multiple VLAN-unaware bridges spanning the same single
switch chip today, it is still true that you can not have two stations
with the same MAC address, one in one bridge and another in the other
bridge, right?

Do you have an example when this causes issues that need to be addressed
immediately?

I thought the only case where this is a real problem is when you have
multiple CPU ports or multiple DSA links between 2 switches, because
then, if learning is enabled, that same MAC address will bounce between
the 2 ports. For that case, the consensus was that you just can't enable
address learning on those ports, and you let the software manage the FDB
in a way that is compatible with multiple CPU ports / DSA links (install
the MAC DA as a sort of multicast address and let the port forwarding
matrix choose only one of the 2 destinations based on source port).

Lack of FDB partitioning also used to be a problem when the standalone
ports were left to do address learning, but that changed too.

The hardware I am working with simply does not have any way to solve
this either - the FDB is simply not partitionable without VLAN
filtering (we have simple shared VLAN filtering, where the VID is
ignored and the FDB lookup is performed with VID 0, but not anything
more complex). So the simple solution I've been advising for people who
want their MAC addresses to be isolated is to create a single VLAN-aware
bridge and manage the VLAN broadcast domains themselves - that seems to
work and is simple to understand and flexible (note that I am going to
send a patch at some point to prevent the user from partitioning a
sja1105 switch tree into multiple VLAN-aware bridges).

Basically unless I'm misunderstanding something, I think what you're
proposing makes theoretical sense, but without a use case behind it it
might just be too much work with no real life benefit.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Bridge] [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
@ 2021-07-04  8:11     ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-04  8:11 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Vladimir Oltean, Roopa Prabhu, Alexander Duyck, Vivien Didelot,
	Ido Schimmel, Nikolay Aleksandrov, netdev, Jakub Kicinski,
	David S. Miller

Hi Tobias,

On Sun, Jul 04, 2021 at 12:04:26AM +0200, Tobias Waldekranz wrote:
> On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > For this series I have taken Tobias' work from here:
> > https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> > and made the following changes:
> > - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
> >   feedback on the bridge driver changes. Otherwise, the structure of the
> >   bridge changes is pretty much the same as Tobias left it.
> > - I basically rewrote the DSA infrastructure for the data plane
> >   forwarding offload, based on the commonalities with another switch
> >   driver for which I implemented this feature (not submitted here)
> > - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
> >   works but I didn't test that
>
> Hi Vladimir,
>
> Sorry that I have dropped the ball on this series. I have actually had a
> v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx
> specific problems. (See below)
>
> > The data plane of the software bridge can be partially offloaded to
> > switchdev, in the sense that we can trust the accelerator to:
> > (a) look up its FDB (which is more or less in sync with the software
> >     bridge FDB) for selecting the destination ports for a packet
> > (b) replicate the frame in hardware in case it's a multicast/broadcast,
> >     instead of the software bridge having to clone it and send the
> >     clones to each net device one at a time. This reduces the bandwidth
> >     needed between the CPU and the accelerator, as well as the CPU time
> >     spent.
> >
> > The data path forwarding offload is managed per "hardware domain" - a
> > generalization of the "offload_fwd_mark" concept which is being
> > introduced in this series. Every packet is delivered only once to each
> > hardware domain.
> >
> > In addition, Tobias said in the original cover letter:
> >
> > ====================
> > ## Overview
> >
> >    vlan1   vlan2
> >        \   /
> >    .-----------.
> >    |    br0    |
> >    '-----------'
> >    /   /   \   \
> > swp0 swp1 swp2 eth0
> >   :   :   :
> >   (hwdom 1)
> >
> > Up to this point, switchdevs have been trusted with offloading
> > forwarding between bridge ports, e.g. forwarding a unicast from swp0
> > to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> > series extends forward offloading to include some new classes of
> > traffic:
> >
> > - Locally originating flows, i.e. packets that ingress on br0 that are
> >   to be forwarded to one or several of the ports swp{0,1,2}. Notably
> >   this also includes routed flows, e.g. a packet ingressing swp0 on
> >   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
> >   forwarded to swp1 is "locally originating" from br0's point of view.
> >
> > - Flows originating from "foreign" interfaces, i.e. an interface that
> >   is not offloaded by a particular switchdev instance. This includes
> >   ports belonging to other switchdev instances. A typical example
> >   would be flows from eth0 towards swp{0,1,2}.
> >
> > The bridge still looks up its FDB/MDB as usual and then notifies the
> > switchdev driver that a particular skb should be offloaded if it
> > matches one of the classes above. It does so by using the _accel
> > version of dev_queue_xmit, supplying its own netdev as the
> > "subordinate" device. The driver can react to the presence of the
> > subordinate in its .ndo_select_queue in what ever way it needs to make
> > sure to forward the skb in much the same way that it would for packets
> > ingressing on regular ports.
> >
> > Hardware domains to which a particular skb has been forwarded are
> > recorded so that duplicates are avoided.
> >
> > The main performance benefit is thus seen on multicast flows. Imagine
> > for example that:
> >
> > - An IP camera is connected to swp0 (VLAN 1)
> >
> > - The CPU is acting as a multicast router, routing the group from VLAN
> >   1 to VLAN 2.
> >
> > - There are subscribers for the group in question behind both swp1 and
> >   swp2 (VLAN 2).
> >
> > With this offloading in place, the bridge need only send a single skb
> > to the driver, which will send it to the hardware marked in such a way
> > that the switch will perform the multicast replication according to
> > the MDB configuration. Naturally, the number of saved skb_clones
> > increase linearly with the number of subscribed ports.
> >
> > As an extra benefit, on mv88e6xxx, this also allows the switch to
> > perform source address learning on these flows, which avoids having to
> > sync dynamic FDB entries over slow configuration interfaces like MDIO
> > to avoid flows directed towards the CPU being flooded as unknown
> > unicast by the switch.
> >
> >
> > ## RFC
> >
> > - In general, what do you think about this idea?
> >
> > - hwdom. What do you think about this terminology? Personally I feel
> >   that we had too many things called offload_fwd_mark, and that as the
> >   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
> >   might be useful to have a separate term for it.
> >
> > - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
> >   and if so do you have any suggestion/preference on how to signal the
> >   offloading from the bridge down to the switchdev driver?
> >
> > - The way that flooding is implemented in br_forward.c (lazily cloning
> >   skbs) means that you have to mark the forwarding as completed very
> >   early (right after should_deliver in maybe_deliver) in order to
> >   avoid duplicates. Is there some way to move this decision point to a
> >   later stage that I am missing?
> >
> > - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
> >   compatible with unicast-to-multicast being used on a port. Then
> >   again, I think that this would also be broken for regular switchdev
> >   bridge offloading as this flag is not offloaded to the switchdev
> >   port, so there is no way for the driver to refuse it. Any ideas on
> >   how to handle this?
> >
> >
> > ## mv88e6xxx Specifics
> >
> > Since we are now only receiving a single skb for both unicast and
> > multicast flows, we can tag the packets with the FORWARD command
> > instead of FROM_CPU. The swich(es) will then forward the packet in
> > accordance with its ATU, VTU, STU, and PVT configuration - just like
> > for packets ingressing on user ports.
> >
> > Crucially, FROM_CPU is still used for:
> >
> > - Ports in standalone mode.
> >
> > - Flows that are trapped to the CPU and software-forwarded by a
> >   bridge. Note that these flows match neither of the classes discussed
> >   in the overview.
> >
> > - Packets that are sent directly to a port netdev without going
> >   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
> >   socket.
> >
> > We thus have a pretty clean separation where the data plane uses
> > FORWARDs and the control plane uses TO_/FROM_CPU.
> >
> > The barrier between different bridges is enforced by port based VLANs
> > on mv88e6xxx, which in essence is a mapping from a source device/port
> > pair to an allowed set of egress ports.
>
> Unless I am missing something, it turns out that the PVT is not enough
> to support multiple (non-VLAN filtering) bridges in multi-chip
> setups. While the isolation barrier works, there is no way of correctly
> managing automatic learning.
>
> > In order to have a FORWARD
> > frame (which carries a _source_ device/port) correctly mapped by the
> > PVT, we must use a unique pair for each bridge.
> >
> > Fortunately, there is typically lots of unused address space in most
> > switch trees. When was the last time you saw an mv88e6xxx product
> > using more than 4 chips? Even if you found one with 16 (!) devices,
> > you would still have room to allocate 16*16 virtual ports to software
> > bridges.
> >
> > Therefore, the mv88e6xxx driver will allocate a virtual device/port
> > pair to each bridge that it offloads. All members of the same bridge
> > are then configured to allow packets from this virtual port in their
> > PVTs.
>
> So while this solution is cute, it does not work in this example:
>
>  CPU
>   | .-----.
> .-0-1-. .-0-1-.
> | sw0 | | sw1 |
> '-2-3-' '-2-3-'
>
> - [sw0p2, sw1p2] are attached to one bridge
> - [sw0p3, sw1p3] are attached to another bridge
> - Neither bridge uses VLAN filtering
>
> Since no VLAN information available in the frames, the source addresses
> of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be
> separated into different FIDs. They will all be placed in the respective
> port's default FID. Thus, the two bridges are not isolated with respect
> to their FDBs.
>
> My current plan is therefore to start by reworking how bridges are
> isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for
> each non-filtering bridge. Two of these can be easily managed since both
> VID 0 and 4095 are illegal on the wire but allowed in the VTU - after
> that it gets tricky. The best scheme I have come up with is to just grab
> an unused VID when adding any subsequent non-filtering bridge; in the
> event that that VID is requested by a filtering bridge or a VLAN upper,
> you move the non-filtering bridge to another currently unused VID.
>
> Does that sound reasonable?

I don't think this patch series makes the problem you are describing any
worse than it already is in mainline, does it?

I mean even with multiple VLAN-unaware bridges spanning the same single
switch chip today, it is still true that you can not have two stations
with the same MAC address, one in one bridge and another in the other
bridge, right?

Do you have an example when this causes issues that need to be addressed
immediately?

I thought the only case where this is a real problem is when you have
multiple CPU ports or multiple DSA links between 2 switches, because
then, if learning is enabled, that same MAC address will bounce between
the 2 ports. For that case, the consensus was that you just can't enable
address learning on those ports, and you let the software manage the FDB
in a way that is compatible with multiple CPU ports / DSA links (install
the MAC DA as a sort of multicast address and let the port forwarding
matrix choose only one of the 2 destinations based on source port).

Lack of FDB partitioning also used to be a problem when the standalone
ports were left to do address learning, but that changed too.

The hardware I am working with simply does not have any way to solve
this either - the FDB is simply not partitionable without VLAN
filtering (we have simple shared VLAN filtering, where the VID is
ignored and the FDB lookup is performed with VID 0, but not anything
more complex). So the simple solution I've been advising for people who
want their MAC addresses to be isolated is to create a single VLAN-aware
bridge and manage the VLAN broadcast domains themselves - that seems to
work and is simple to understand and flexible (note that I am going to
send a patch at some point to prevent the user from partitioning a
sja1105 switch tree into multiple VLAN-aware bridges).

Basically unless I'm misunderstanding something, I think what you're
proposing makes theoretical sense, but without a use case behind it it
might just be too much work with no real life benefit.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
  2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
@ 2021-07-05  4:20   ` DENG Qingfang
  -1 siblings, 0 replies; 44+ messages in thread
From: DENG Qingfang @ 2021-07-05  4:20 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, Jakub Kicinski, David S. Miller, Andrew Lunn,
	Florian Fainelli, Vivien Didelot, Jiri Pirko, Ido Schimmel,
	Tobias Waldekranz, Roopa Prabhu, Nikolay Aleksandrov,
	Stephen Hemminger, bridge, Alexander Duyck

Hi Vladimir,

On Sat, Jul 03, 2021 at 02:56:55PM +0300, Vladimir Oltean wrote:
> For this series I have taken Tobias' work from here:
> https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> and made the following changes:
> - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
>   feedback on the bridge driver changes. Otherwise, the structure of the
>   bridge changes is pretty much the same as Tobias left it.
> - I basically rewrote the DSA infrastructure for the data plane
>   forwarding offload, based on the commonalities with another switch
>   driver for which I implemented this feature (not submitted here)
> - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
>   works but I didn't test that
> 
> The data plane of the software bridge can be partially offloaded to
> switchdev, in the sense that we can trust the accelerator to:
> (a) look up its FDB (which is more or less in sync with the software
>     bridge FDB) for selecting the destination ports for a packet
> (b) replicate the frame in hardware in case it's a multicast/broadcast,
>     instead of the software bridge having to clone it and send the
>     clones to each net device one at a time. This reduces the bandwidth
>     needed between the CPU and the accelerator, as well as the CPU time
>     spent.

Many DSA taggers use port bit field in their TX tags, which allows
replication in hardware. (multiple bits set = send to multiple ports)
I wonder if the tagger API can be updated to support this.

> 
> The data path forwarding offload is managed per "hardware domain" - a
> generalization of the "offload_fwd_mark" concept which is being
> introduced in this series. Every packet is delivered only once to each
> hardware domain.
> 
> In addition, Tobias said in the original cover letter:
> 
> ====================
> ## Overview
> 
>    vlan1   vlan2
>        \   /
>    .-----------.
>    |    br0    |
>    '-----------'
>    /   /   \   \
> swp0 swp1 swp2 eth0
>   :   :   :
>   (hwdom 1)
> 
> Up to this point, switchdevs have been trusted with offloading
> forwarding between bridge ports, e.g. forwarding a unicast from swp0
> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> series extends forward offloading to include some new classes of
> traffic:
> 
> - Locally originating flows, i.e. packets that ingress on br0 that are
>   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>   this also includes routed flows, e.g. a packet ingressing swp0 on
>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>   forwarded to swp1 is "locally originating" from br0's point of view.
> 
> - Flows originating from "foreign" interfaces, i.e. an interface that
>   is not offloaded by a particular switchdev instance. This includes
>   ports belonging to other switchdev instances. A typical example
>   would be flows from eth0 towards swp{0,1,2}.
> 
> The bridge still looks up its FDB/MDB as usual and then notifies the
> switchdev driver that a particular skb should be offloaded if it
> matches one of the classes above. It does so by using the _accel
> version of dev_queue_xmit, supplying its own netdev as the
> "subordinate" device. The driver can react to the presence of the
> subordinate in its .ndo_select_queue in what ever way it needs to make
> sure to forward the skb in much the same way that it would for packets
> ingressing on regular ports.
> 
> Hardware domains to which a particular skb has been forwarded are
> recorded so that duplicates are avoided.
> 
> The main performance benefit is thus seen on multicast flows. Imagine
> for example that:
> 
> - An IP camera is connected to swp0 (VLAN 1)
> 
> - The CPU is acting as a multicast router, routing the group from VLAN
>   1 to VLAN 2.
> 
> - There are subscribers for the group in question behind both swp1 and
>   swp2 (VLAN 2).
> 
> With this offloading in place, the bridge need only send a single skb
> to the driver, which will send it to the hardware marked in such a way
> that the switch will perform the multicast replication according to
> the MDB configuration. Naturally, the number of saved skb_clones
> increase linearly with the number of subscribed ports.
> 
> As an extra benefit, on mv88e6xxx, this also allows the switch to
> perform source address learning on these flows, which avoids having to
> sync dynamic FDB entries over slow configuration interfaces like MDIO
> to avoid flows directed towards the CPU being flooded as unknown
> unicast by the switch.
> 
> 
> ## RFC
> 
> - In general, what do you think about this idea?
> 
> - hwdom. What do you think about this terminology? Personally I feel
>   that we had too many things called offload_fwd_mark, and that as the
>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>   might be useful to have a separate term for it.
> 
> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>   and if so do you have any suggestion/preference on how to signal the
>   offloading from the bridge down to the switchdev driver?
> 
> - The way that flooding is implemented in br_forward.c (lazily cloning
>   skbs) means that you have to mark the forwarding as completed very
>   early (right after should_deliver in maybe_deliver) in order to
>   avoid duplicates. Is there some way to move this decision point to a
>   later stage that I am missing?
> 
> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>   compatible with unicast-to-multicast being used on a port. Then
>   again, I think that this would also be broken for regular switchdev
>   bridge offloading as this flag is not offloaded to the switchdev
>   port, so there is no way for the driver to refuse it. Any ideas on
>   how to handle this?
> 
> 
> ## mv88e6xxx Specifics
> 
> Since we are now only receiving a single skb for both unicast and
> multicast flows, we can tag the packets with the FORWARD command
> instead of FROM_CPU. The swich(es) will then forward the packet in
> accordance with its ATU, VTU, STU, and PVT configuration - just like
> for packets ingressing on user ports.
> 
> Crucially, FROM_CPU is still used for:
> 
> - Ports in standalone mode.
> 
> - Flows that are trapped to the CPU and software-forwarded by a
>   bridge. Note that these flows match neither of the classes discussed
>   in the overview.
> 
> - Packets that are sent directly to a port netdev without going
>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>   socket.
> 
> We thus have a pretty clean separation where the data plane uses
> FORWARDs and the control plane uses TO_/FROM_CPU.
> 
> The barrier between different bridges is enforced by port based VLANs
> on mv88e6xxx, which in essence is a mapping from a source device/port
> pair to an allowed set of egress ports. In order to have a FORWARD
> frame (which carries a _source_ device/port) correctly mapped by the
> PVT, we must use a unique pair for each bridge.
> 
> Fortunately, there is typically lots of unused address space in most
> switch trees. When was the last time you saw an mv88e6xxx product
> using more than 4 chips? Even if you found one with 16 (!) devices,
> you would still have room to allocate 16*16 virtual ports to software
> bridges.
> 
> Therefore, the mv88e6xxx driver will allocate a virtual device/port
> pair to each bridge that it offloads. All members of the same bridge
> are then configured to allow packets from this virtual port in their
> PVTs.
> ====================
> 
> Tobias Waldekranz (5):
>   net: dfwd: constrain existing users to macvlan subordinates
>   net: bridge: disambiguate offload_fwd_mark
>   net: bridge: switchdev: recycle unused hwdoms
>   net: bridge: switchdev: allow the data plane forwarding to be
>     offloaded
>   net: dsa: tag_dsa: offload the bridge forwarding process
> 
> Vladimir Oltean (5):
>   net: extract helpers for binding a subordinate device to TX queues
>   net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
>   net: dsa: track the number of switches in a tree
>   net: dsa: add support for bridge forwarding offload
>   net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
>     the PVT
> 
>  drivers/net/dsa/mv88e6xxx/chip.c              | 106 +++++++++++-
>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
>  include/linux/if_bridge.h                     |   1 +
>  include/linux/netdevice.h                     |  13 +-
>  include/net/dsa.h                             |  37 ++++
>  net/bridge/br_forward.c                       |  18 +-
>  net/bridge/br_if.c                            |   4 +-
>  net/bridge/br_private.h                       |  49 +++++-
>  net/bridge/br_switchdev.c                     | 163 +++++++++++++++---
>  net/bridge/br_vlan.c                          |  10 +-
>  net/core/dev.c                                |  31 +++-
>  net/dsa/dsa2.c                                |   3 +
>  net/dsa/dsa_priv.h                            |  28 +++
>  net/dsa/port.c                                |  35 ++++
>  net/dsa/slave.c                               | 134 +++++++++++++-
>  net/dsa/switch.c                              |  58 +++++++
>  net/dsa/tag_dsa.c                             |  60 ++++++-
>  19 files changed, 700 insertions(+), 59 deletions(-)
> 
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Bridge] [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
@ 2021-07-05  4:20   ` DENG Qingfang
  0 siblings, 0 replies; 44+ messages in thread
From: DENG Qingfang @ 2021-07-05  4:20 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, netdev, bridge,
	Alexander Duyck, Vivien Didelot, Ido Schimmel,
	Nikolay Aleksandrov, Roopa Prabhu, Jakub Kicinski,
	David S. Miller, Tobias Waldekranz

Hi Vladimir,

On Sat, Jul 03, 2021 at 02:56:55PM +0300, Vladimir Oltean wrote:
> For this series I have taken Tobias' work from here:
> https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> and made the following changes:
> - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
>   feedback on the bridge driver changes. Otherwise, the structure of the
>   bridge changes is pretty much the same as Tobias left it.
> - I basically rewrote the DSA infrastructure for the data plane
>   forwarding offload, based on the commonalities with another switch
>   driver for which I implemented this feature (not submitted here)
> - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
>   works but I didn't test that
> 
> The data plane of the software bridge can be partially offloaded to
> switchdev, in the sense that we can trust the accelerator to:
> (a) look up its FDB (which is more or less in sync with the software
>     bridge FDB) for selecting the destination ports for a packet
> (b) replicate the frame in hardware in case it's a multicast/broadcast,
>     instead of the software bridge having to clone it and send the
>     clones to each net device one at a time. This reduces the bandwidth
>     needed between the CPU and the accelerator, as well as the CPU time
>     spent.

Many DSA taggers use port bit field in their TX tags, which allows
replication in hardware. (multiple bits set = send to multiple ports)
I wonder if the tagger API can be updated to support this.

> 
> The data path forwarding offload is managed per "hardware domain" - a
> generalization of the "offload_fwd_mark" concept which is being
> introduced in this series. Every packet is delivered only once to each
> hardware domain.
> 
> In addition, Tobias said in the original cover letter:
> 
> ====================
> ## Overview
> 
>    vlan1   vlan2
>        \   /
>    .-----------.
>    |    br0    |
>    '-----------'
>    /   /   \   \
> swp0 swp1 swp2 eth0
>   :   :   :
>   (hwdom 1)
> 
> Up to this point, switchdevs have been trusted with offloading
> forwarding between bridge ports, e.g. forwarding a unicast from swp0
> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> series extends forward offloading to include some new classes of
> traffic:
> 
> - Locally originating flows, i.e. packets that ingress on br0 that are
>   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>   this also includes routed flows, e.g. a packet ingressing swp0 on
>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>   forwarded to swp1 is "locally originating" from br0's point of view.
> 
> - Flows originating from "foreign" interfaces, i.e. an interface that
>   is not offloaded by a particular switchdev instance. This includes
>   ports belonging to other switchdev instances. A typical example
>   would be flows from eth0 towards swp{0,1,2}.
> 
> The bridge still looks up its FDB/MDB as usual and then notifies the
> switchdev driver that a particular skb should be offloaded if it
> matches one of the classes above. It does so by using the _accel
> version of dev_queue_xmit, supplying its own netdev as the
> "subordinate" device. The driver can react to the presence of the
> subordinate in its .ndo_select_queue in what ever way it needs to make
> sure to forward the skb in much the same way that it would for packets
> ingressing on regular ports.
> 
> Hardware domains to which a particular skb has been forwarded are
> recorded so that duplicates are avoided.
> 
> The main performance benefit is thus seen on multicast flows. Imagine
> for example that:
> 
> - An IP camera is connected to swp0 (VLAN 1)
> 
> - The CPU is acting as a multicast router, routing the group from VLAN
>   1 to VLAN 2.
> 
> - There are subscribers for the group in question behind both swp1 and
>   swp2 (VLAN 2).
> 
> With this offloading in place, the bridge need only send a single skb
> to the driver, which will send it to the hardware marked in such a way
> that the switch will perform the multicast replication according to
> the MDB configuration. Naturally, the number of saved skb_clones
> increase linearly with the number of subscribed ports.
> 
> As an extra benefit, on mv88e6xxx, this also allows the switch to
> perform source address learning on these flows, which avoids having to
> sync dynamic FDB entries over slow configuration interfaces like MDIO
> to avoid flows directed towards the CPU being flooded as unknown
> unicast by the switch.
> 
> 
> ## RFC
> 
> - In general, what do you think about this idea?
> 
> - hwdom. What do you think about this terminology? Personally I feel
>   that we had too many things called offload_fwd_mark, and that as the
>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>   might be useful to have a separate term for it.
> 
> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>   and if so do you have any suggestion/preference on how to signal the
>   offloading from the bridge down to the switchdev driver?
> 
> - The way that flooding is implemented in br_forward.c (lazily cloning
>   skbs) means that you have to mark the forwarding as completed very
>   early (right after should_deliver in maybe_deliver) in order to
>   avoid duplicates. Is there some way to move this decision point to a
>   later stage that I am missing?
> 
> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>   compatible with unicast-to-multicast being used on a port. Then
>   again, I think that this would also be broken for regular switchdev
>   bridge offloading as this flag is not offloaded to the switchdev
>   port, so there is no way for the driver to refuse it. Any ideas on
>   how to handle this?
> 
> 
> ## mv88e6xxx Specifics
> 
> Since we are now only receiving a single skb for both unicast and
> multicast flows, we can tag the packets with the FORWARD command
> instead of FROM_CPU. The swich(es) will then forward the packet in
> accordance with its ATU, VTU, STU, and PVT configuration - just like
> for packets ingressing on user ports.
> 
> Crucially, FROM_CPU is still used for:
> 
> - Ports in standalone mode.
> 
> - Flows that are trapped to the CPU and software-forwarded by a
>   bridge. Note that these flows match neither of the classes discussed
>   in the overview.
> 
> - Packets that are sent directly to a port netdev without going
>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>   socket.
> 
> We thus have a pretty clean separation where the data plane uses
> FORWARDs and the control plane uses TO_/FROM_CPU.
> 
> The barrier between different bridges is enforced by port based VLANs
> on mv88e6xxx, which in essence is a mapping from a source device/port
> pair to an allowed set of egress ports. In order to have a FORWARD
> frame (which carries a _source_ device/port) correctly mapped by the
> PVT, we must use a unique pair for each bridge.
> 
> Fortunately, there is typically lots of unused address space in most
> switch trees. When was the last time you saw an mv88e6xxx product
> using more than 4 chips? Even if you found one with 16 (!) devices,
> you would still have room to allocate 16*16 virtual ports to software
> bridges.
> 
> Therefore, the mv88e6xxx driver will allocate a virtual device/port
> pair to each bridge that it offloads. All members of the same bridge
> are then configured to allow packets from this virtual port in their
> PVTs.
> ====================
> 
> Tobias Waldekranz (5):
>   net: dfwd: constrain existing users to macvlan subordinates
>   net: bridge: disambiguate offload_fwd_mark
>   net: bridge: switchdev: recycle unused hwdoms
>   net: bridge: switchdev: allow the data plane forwarding to be
>     offloaded
>   net: dsa: tag_dsa: offload the bridge forwarding process
> 
> Vladimir Oltean (5):
>   net: extract helpers for binding a subordinate device to TX queues
>   net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
>   net: dsa: track the number of switches in a tree
>   net: dsa: add support for bridge forwarding offload
>   net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
>     the PVT
> 
>  drivers/net/dsa/mv88e6xxx/chip.c              | 106 +++++++++++-
>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
>  include/linux/if_bridge.h                     |   1 +
>  include/linux/netdevice.h                     |  13 +-
>  include/net/dsa.h                             |  37 ++++
>  net/bridge/br_forward.c                       |  18 +-
>  net/bridge/br_if.c                            |   4 +-
>  net/bridge/br_private.h                       |  49 +++++-
>  net/bridge/br_switchdev.c                     | 163 +++++++++++++++---
>  net/bridge/br_vlan.c                          |  10 +-
>  net/core/dev.c                                |  31 +++-
>  net/dsa/dsa2.c                                |   3 +
>  net/dsa/dsa_priv.h                            |  28 +++
>  net/dsa/port.c                                |  35 ++++
>  net/dsa/slave.c                               | 134 +++++++++++++-
>  net/dsa/switch.c                              |  58 +++++++
>  net/dsa/tag_dsa.c                             |  60 ++++++-
>  19 files changed, 700 insertions(+), 59 deletions(-)
> 
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
  2021-07-04  8:11     ` [Bridge] " Vladimir Oltean
@ 2021-07-05  8:09       ` Tobias Waldekranz
  -1 siblings, 0 replies; 44+ messages in thread
From: Tobias Waldekranz @ 2021-07-05  8:09 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Roopa Prabhu, Nikolay Aleksandrov,
	Stephen Hemminger, bridge, Alexander Duyck

On Sun, Jul 04, 2021 at 11:11, Vladimir Oltean <olteanv@gmail.com> wrote:
> Hi Tobias,
>
> On Sun, Jul 04, 2021 at 12:04:26AM +0200, Tobias Waldekranz wrote:
>> On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>> > For this series I have taken Tobias' work from here:
>> > https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
>> > and made the following changes:
>> > - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
>> >   feedback on the bridge driver changes. Otherwise, the structure of the
>> >   bridge changes is pretty much the same as Tobias left it.
>> > - I basically rewrote the DSA infrastructure for the data plane
>> >   forwarding offload, based on the commonalities with another switch
>> >   driver for which I implemented this feature (not submitted here)
>> > - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
>> >   works but I didn't test that
>>
>> Hi Vladimir,
>>
>> Sorry that I have dropped the ball on this series. I have actually had a
>> v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx
>> specific problems. (See below)
>>
>> > The data plane of the software bridge can be partially offloaded to
>> > switchdev, in the sense that we can trust the accelerator to:
>> > (a) look up its FDB (which is more or less in sync with the software
>> >     bridge FDB) for selecting the destination ports for a packet
>> > (b) replicate the frame in hardware in case it's a multicast/broadcast,
>> >     instead of the software bridge having to clone it and send the
>> >     clones to each net device one at a time. This reduces the bandwidth
>> >     needed between the CPU and the accelerator, as well as the CPU time
>> >     spent.
>> >
>> > The data path forwarding offload is managed per "hardware domain" - a
>> > generalization of the "offload_fwd_mark" concept which is being
>> > introduced in this series. Every packet is delivered only once to each
>> > hardware domain.
>> >
>> > In addition, Tobias said in the original cover letter:
>> >
>> > ====================
>> > ## Overview
>> >
>> >    vlan1   vlan2
>> >        \   /
>> >    .-----------.
>> >    |    br0    |
>> >    '-----------'
>> >    /   /   \   \
>> > swp0 swp1 swp2 eth0
>> >   :   :   :
>> >   (hwdom 1)
>> >
>> > Up to this point, switchdevs have been trusted with offloading
>> > forwarding between bridge ports, e.g. forwarding a unicast from swp0
>> > to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
>> > series extends forward offloading to include some new classes of
>> > traffic:
>> >
>> > - Locally originating flows, i.e. packets that ingress on br0 that are
>> >   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>> >   this also includes routed flows, e.g. a packet ingressing swp0 on
>> >   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>> >   forwarded to swp1 is "locally originating" from br0's point of view.
>> >
>> > - Flows originating from "foreign" interfaces, i.e. an interface that
>> >   is not offloaded by a particular switchdev instance. This includes
>> >   ports belonging to other switchdev instances. A typical example
>> >   would be flows from eth0 towards swp{0,1,2}.
>> >
>> > The bridge still looks up its FDB/MDB as usual and then notifies the
>> > switchdev driver that a particular skb should be offloaded if it
>> > matches one of the classes above. It does so by using the _accel
>> > version of dev_queue_xmit, supplying its own netdev as the
>> > "subordinate" device. The driver can react to the presence of the
>> > subordinate in its .ndo_select_queue in what ever way it needs to make
>> > sure to forward the skb in much the same way that it would for packets
>> > ingressing on regular ports.
>> >
>> > Hardware domains to which a particular skb has been forwarded are
>> > recorded so that duplicates are avoided.
>> >
>> > The main performance benefit is thus seen on multicast flows. Imagine
>> > for example that:
>> >
>> > - An IP camera is connected to swp0 (VLAN 1)
>> >
>> > - The CPU is acting as a multicast router, routing the group from VLAN
>> >   1 to VLAN 2.
>> >
>> > - There are subscribers for the group in question behind both swp1 and
>> >   swp2 (VLAN 2).
>> >
>> > With this offloading in place, the bridge need only send a single skb
>> > to the driver, which will send it to the hardware marked in such a way
>> > that the switch will perform the multicast replication according to
>> > the MDB configuration. Naturally, the number of saved skb_clones
>> > increase linearly with the number of subscribed ports.
>> >
>> > As an extra benefit, on mv88e6xxx, this also allows the switch to
>> > perform source address learning on these flows, which avoids having to
>> > sync dynamic FDB entries over slow configuration interfaces like MDIO
>> > to avoid flows directed towards the CPU being flooded as unknown
>> > unicast by the switch.
>> >
>> >
>> > ## RFC
>> >
>> > - In general, what do you think about this idea?
>> >
>> > - hwdom. What do you think about this terminology? Personally I feel
>> >   that we had too many things called offload_fwd_mark, and that as the
>> >   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>> >   might be useful to have a separate term for it.
>> >
>> > - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>> >   and if so do you have any suggestion/preference on how to signal the
>> >   offloading from the bridge down to the switchdev driver?
>> >
>> > - The way that flooding is implemented in br_forward.c (lazily cloning
>> >   skbs) means that you have to mark the forwarding as completed very
>> >   early (right after should_deliver in maybe_deliver) in order to
>> >   avoid duplicates. Is there some way to move this decision point to a
>> >   later stage that I am missing?
>> >
>> > - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>> >   compatible with unicast-to-multicast being used on a port. Then
>> >   again, I think that this would also be broken for regular switchdev
>> >   bridge offloading as this flag is not offloaded to the switchdev
>> >   port, so there is no way for the driver to refuse it. Any ideas on
>> >   how to handle this?
>> >
>> >
>> > ## mv88e6xxx Specifics
>> >
>> > Since we are now only receiving a single skb for both unicast and
>> > multicast flows, we can tag the packets with the FORWARD command
>> > instead of FROM_CPU. The swich(es) will then forward the packet in
>> > accordance with its ATU, VTU, STU, and PVT configuration - just like
>> > for packets ingressing on user ports.
>> >
>> > Crucially, FROM_CPU is still used for:
>> >
>> > - Ports in standalone mode.
>> >
>> > - Flows that are trapped to the CPU and software-forwarded by a
>> >   bridge. Note that these flows match neither of the classes discussed
>> >   in the overview.
>> >
>> > - Packets that are sent directly to a port netdev without going
>> >   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>> >   socket.
>> >
>> > We thus have a pretty clean separation where the data plane uses
>> > FORWARDs and the control plane uses TO_/FROM_CPU.
>> >
>> > The barrier between different bridges is enforced by port based VLANs
>> > on mv88e6xxx, which in essence is a mapping from a source device/port
>> > pair to an allowed set of egress ports.
>>
>> Unless I am missing something, it turns out that the PVT is not enough
>> to support multiple (non-VLAN filtering) bridges in multi-chip
>> setups. While the isolation barrier works, there is no way of correctly
>> managing automatic learning.
>>
>> > In order to have a FORWARD
>> > frame (which carries a _source_ device/port) correctly mapped by the
>> > PVT, we must use a unique pair for each bridge.
>> >
>> > Fortunately, there is typically lots of unused address space in most
>> > switch trees. When was the last time you saw an mv88e6xxx product
>> > using more than 4 chips? Even if you found one with 16 (!) devices,
>> > you would still have room to allocate 16*16 virtual ports to software
>> > bridges.
>> >
>> > Therefore, the mv88e6xxx driver will allocate a virtual device/port
>> > pair to each bridge that it offloads. All members of the same bridge
>> > are then configured to allow packets from this virtual port in their
>> > PVTs.
>>
>> So while this solution is cute, it does not work in this example:
>>
>>  CPU
>>   | .-----.
>> .-0-1-. .-0-1-.
>> | sw0 | | sw1 |
>> '-2-3-' '-2-3-'
>>
>> - [sw0p2, sw1p2] are attached to one bridge
>> - [sw0p3, sw1p3] are attached to another bridge
>> - Neither bridge uses VLAN filtering
>>
>> Since no VLAN information available in the frames, the source addresses
>> of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be
>> separated into different FIDs. They will all be placed in the respective
>> port's default FID. Thus, the two bridges are not isolated with respect
>> to their FDBs.
>>
>> My current plan is therefore to start by reworking how bridges are
>> isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for
>> each non-filtering bridge. Two of these can be easily managed since both
>> VID 0 and 4095 are illegal on the wire but allowed in the VTU - after
>> that it gets tricky. The best scheme I have come up with is to just grab
>> an unused VID when adding any subsequent non-filtering bridge; in the
>> event that that VID is requested by a filtering bridge or a VLAN upper,
>> you move the non-filtering bridge to another currently unused VID.
>>
>> Does that sound reasonable?
>
> I don't think this patch series makes the problem you are describing any
> worse than it already is in mainline, does it?

It does not make it worse, no. But assuming that mv88e6xxx will handle
multi-bridge using the VTU in the future (i.e. my suggestion above),
there is no need for inventing virtual DSA dev/port tuples - we can just
use the physical port info as the source and use the VID to signal the
source bridge. So I am hesitant to merge the mv88e6xxx-specific changes.

> I mean even with multiple VLAN-unaware bridges spanning the same single
> switch chip today, it is still true that you can not have two stations
> with the same MAC address, one in one bridge and another in the other
> bridge, right?

That is correct.

> Do you have an example when this causes issues that need to be addressed
> immediately?
>
> I thought the only case where this is a real problem is when you have
> multiple CPU ports or multiple DSA links between 2 switches, because
> then, if learning is enabled, that same MAC address will bounce between
> the 2 ports. For that case, the consensus was that you just can't enable
> address learning on those ports, and you let the software manage the FDB
> in a way that is compatible with multiple CPU ports / DSA links (install
> the MAC DA as a sort of multicast address and let the port forwarding
> matrix choose only one of the 2 destinations based on source port).
>
> Lack of FDB partitioning also used to be a problem when the standalone
> ports were left to do address learning, but that changed too.

Funny you should mention that. The presence of standalone ports is
actually what first shone a light on this issue for me. I was running a
kselftest-like setup like this:

   br0
   / \
swp1 swp3  swp2  swp4

Physically, [swp1, swp2] and [swp3, swp4] where looped externally:

    CPU
     |
.----0----.
|   sw0   |
'-1-2-3-4-'
  '-' '-'

I was testing automatic learning by sending out a broadcast from br0 and
verifying that br0's MAC was learned on port 0 in the ATU - alas, it was
not. The MAC was nowhere to be found.

Moving back to a topology without these loops, I could see that learning
worked as expected.

Not sure how familiar you are with mv88e6xxx at this point, but the way
learning is disabled is by clearing a port's port association vector
(PAV). This does not, however, "disable" learning really. It just
updates the ATU with an all-zero vector, which means "invalidate the
entry".

So, in the example above, when the broadcast is looped back to the
standalone ports, the port's default FID (0) will be used to invalidate
the MAC for br0. Since they all use FID 0, the standalone ports will
nuke the br0's FDB.

> The hardware I am working with simply does not have any way to solve
> this either - the FDB is simply not partitionable without VLAN
> filtering (we have simple shared VLAN filtering, where the VID is
> ignored and the FDB lookup is performed with VID 0, but not anything
> more complex). So the simple solution I've been advising for people who
> want their MAC addresses to be isolated is to create a single VLAN-aware
> bridge and manage the VLAN broadcast domains themselves - that seems to
> work and is simple to understand and flexible (note that I am going to
> send a patch at some point to prevent the user from partitioning a
> sja1105 switch tree into multiple VLAN-aware bridges).

Essentially what I am proposing is to always run mv88e6xxx VLAN-aware
internally. Then if you have bridges that disable VLAN filtering, you
change the ingress policy on the member ports to classify all incoming
traffic as untagged and assign them to the port's PVID. (Note: this is
different from "force PVID" in that you never pop any tags from the
frame)

Is your device capable of operating in that mode?

> Basically unless I'm misunderstanding something, I think what you're
> proposing makes theoretical sense, but without a use case behind it it
> might just be too much work with no real life benefit.

I have not done the tests to prove it, but I am pretty sure that if you
have two bridges where the same MACs are used (which does happen in
redundant topologies, and is why IVL is a thing) you will add MACs to
the ATU with a destination that is not allowed by the PVT. This will
lead to sporadic drops while the address is on the "wrong" port from the
view of one bridge, until that station sends some return traffic. At
that point of course, it will be on the wrong port according to the
other bridge.

I guess I would just rather spend some more time up front to make sure
that we support full isolation between bridges, than spend that same
time debugging issues in production environments later on.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Bridge] [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
@ 2021-07-05  8:09       ` Tobias Waldekranz
  0 siblings, 0 replies; 44+ messages in thread
From: Tobias Waldekranz @ 2021-07-05  8:09 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Vladimir Oltean, Roopa Prabhu, Alexander Duyck, Vivien Didelot,
	Ido Schimmel, Nikolay Aleksandrov, netdev, Jakub Kicinski,
	David S. Miller

On Sun, Jul 04, 2021 at 11:11, Vladimir Oltean <olteanv@gmail.com> wrote:
> Hi Tobias,
>
> On Sun, Jul 04, 2021 at 12:04:26AM +0200, Tobias Waldekranz wrote:
>> On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>> > For this series I have taken Tobias' work from here:
>> > https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
>> > and made the following changes:
>> > - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
>> >   feedback on the bridge driver changes. Otherwise, the structure of the
>> >   bridge changes is pretty much the same as Tobias left it.
>> > - I basically rewrote the DSA infrastructure for the data plane
>> >   forwarding offload, based on the commonalities with another switch
>> >   driver for which I implemented this feature (not submitted here)
>> > - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
>> >   works but I didn't test that
>>
>> Hi Vladimir,
>>
>> Sorry that I have dropped the ball on this series. I have actually had a
>> v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx
>> specific problems. (See below)
>>
>> > The data plane of the software bridge can be partially offloaded to
>> > switchdev, in the sense that we can trust the accelerator to:
>> > (a) look up its FDB (which is more or less in sync with the software
>> >     bridge FDB) for selecting the destination ports for a packet
>> > (b) replicate the frame in hardware in case it's a multicast/broadcast,
>> >     instead of the software bridge having to clone it and send the
>> >     clones to each net device one at a time. This reduces the bandwidth
>> >     needed between the CPU and the accelerator, as well as the CPU time
>> >     spent.
>> >
>> > The data path forwarding offload is managed per "hardware domain" - a
>> > generalization of the "offload_fwd_mark" concept which is being
>> > introduced in this series. Every packet is delivered only once to each
>> > hardware domain.
>> >
>> > In addition, Tobias said in the original cover letter:
>> >
>> > ====================
>> > ## Overview
>> >
>> >    vlan1   vlan2
>> >        \   /
>> >    .-----------.
>> >    |    br0    |
>> >    '-----------'
>> >    /   /   \   \
>> > swp0 swp1 swp2 eth0
>> >   :   :   :
>> >   (hwdom 1)
>> >
>> > Up to this point, switchdevs have been trusted with offloading
>> > forwarding between bridge ports, e.g. forwarding a unicast from swp0
>> > to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
>> > series extends forward offloading to include some new classes of
>> > traffic:
>> >
>> > - Locally originating flows, i.e. packets that ingress on br0 that are
>> >   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>> >   this also includes routed flows, e.g. a packet ingressing swp0 on
>> >   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>> >   forwarded to swp1 is "locally originating" from br0's point of view.
>> >
>> > - Flows originating from "foreign" interfaces, i.e. an interface that
>> >   is not offloaded by a particular switchdev instance. This includes
>> >   ports belonging to other switchdev instances. A typical example
>> >   would be flows from eth0 towards swp{0,1,2}.
>> >
>> > The bridge still looks up its FDB/MDB as usual and then notifies the
>> > switchdev driver that a particular skb should be offloaded if it
>> > matches one of the classes above. It does so by using the _accel
>> > version of dev_queue_xmit, supplying its own netdev as the
>> > "subordinate" device. The driver can react to the presence of the
>> > subordinate in its .ndo_select_queue in what ever way it needs to make
>> > sure to forward the skb in much the same way that it would for packets
>> > ingressing on regular ports.
>> >
>> > Hardware domains to which a particular skb has been forwarded are
>> > recorded so that duplicates are avoided.
>> >
>> > The main performance benefit is thus seen on multicast flows. Imagine
>> > for example that:
>> >
>> > - An IP camera is connected to swp0 (VLAN 1)
>> >
>> > - The CPU is acting as a multicast router, routing the group from VLAN
>> >   1 to VLAN 2.
>> >
>> > - There are subscribers for the group in question behind both swp1 and
>> >   swp2 (VLAN 2).
>> >
>> > With this offloading in place, the bridge need only send a single skb
>> > to the driver, which will send it to the hardware marked in such a way
>> > that the switch will perform the multicast replication according to
>> > the MDB configuration. Naturally, the number of saved skb_clones
>> > increase linearly with the number of subscribed ports.
>> >
>> > As an extra benefit, on mv88e6xxx, this also allows the switch to
>> > perform source address learning on these flows, which avoids having to
>> > sync dynamic FDB entries over slow configuration interfaces like MDIO
>> > to avoid flows directed towards the CPU being flooded as unknown
>> > unicast by the switch.
>> >
>> >
>> > ## RFC
>> >
>> > - In general, what do you think about this idea?
>> >
>> > - hwdom. What do you think about this terminology? Personally I feel
>> >   that we had too many things called offload_fwd_mark, and that as the
>> >   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>> >   might be useful to have a separate term for it.
>> >
>> > - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>> >   and if so do you have any suggestion/preference on how to signal the
>> >   offloading from the bridge down to the switchdev driver?
>> >
>> > - The way that flooding is implemented in br_forward.c (lazily cloning
>> >   skbs) means that you have to mark the forwarding as completed very
>> >   early (right after should_deliver in maybe_deliver) in order to
>> >   avoid duplicates. Is there some way to move this decision point to a
>> >   later stage that I am missing?
>> >
>> > - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>> >   compatible with unicast-to-multicast being used on a port. Then
>> >   again, I think that this would also be broken for regular switchdev
>> >   bridge offloading as this flag is not offloaded to the switchdev
>> >   port, so there is no way for the driver to refuse it. Any ideas on
>> >   how to handle this?
>> >
>> >
>> > ## mv88e6xxx Specifics
>> >
>> > Since we are now only receiving a single skb for both unicast and
>> > multicast flows, we can tag the packets with the FORWARD command
>> > instead of FROM_CPU. The swich(es) will then forward the packet in
>> > accordance with its ATU, VTU, STU, and PVT configuration - just like
>> > for packets ingressing on user ports.
>> >
>> > Crucially, FROM_CPU is still used for:
>> >
>> > - Ports in standalone mode.
>> >
>> > - Flows that are trapped to the CPU and software-forwarded by a
>> >   bridge. Note that these flows match neither of the classes discussed
>> >   in the overview.
>> >
>> > - Packets that are sent directly to a port netdev without going
>> >   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>> >   socket.
>> >
>> > We thus have a pretty clean separation where the data plane uses
>> > FORWARDs and the control plane uses TO_/FROM_CPU.
>> >
>> > The barrier between different bridges is enforced by port based VLANs
>> > on mv88e6xxx, which in essence is a mapping from a source device/port
>> > pair to an allowed set of egress ports.
>>
>> Unless I am missing something, it turns out that the PVT is not enough
>> to support multiple (non-VLAN filtering) bridges in multi-chip
>> setups. While the isolation barrier works, there is no way of correctly
>> managing automatic learning.
>>
>> > In order to have a FORWARD
>> > frame (which carries a _source_ device/port) correctly mapped by the
>> > PVT, we must use a unique pair for each bridge.
>> >
>> > Fortunately, there is typically lots of unused address space in most
>> > switch trees. When was the last time you saw an mv88e6xxx product
>> > using more than 4 chips? Even if you found one with 16 (!) devices,
>> > you would still have room to allocate 16*16 virtual ports to software
>> > bridges.
>> >
>> > Therefore, the mv88e6xxx driver will allocate a virtual device/port
>> > pair to each bridge that it offloads. All members of the same bridge
>> > are then configured to allow packets from this virtual port in their
>> > PVTs.
>>
>> So while this solution is cute, it does not work in this example:
>>
>>  CPU
>>   | .-----.
>> .-0-1-. .-0-1-.
>> | sw0 | | sw1 |
>> '-2-3-' '-2-3-'
>>
>> - [sw0p2, sw1p2] are attached to one bridge
>> - [sw0p3, sw1p3] are attached to another bridge
>> - Neither bridge uses VLAN filtering
>>
>> Since no VLAN information available in the frames, the source addresses
>> of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be
>> separated into different FIDs. They will all be placed in the respective
>> port's default FID. Thus, the two bridges are not isolated with respect
>> to their FDBs.
>>
>> My current plan is therefore to start by reworking how bridges are
>> isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for
>> each non-filtering bridge. Two of these can be easily managed since both
>> VID 0 and 4095 are illegal on the wire but allowed in the VTU - after
>> that it gets tricky. The best scheme I have come up with is to just grab
>> an unused VID when adding any subsequent non-filtering bridge; in the
>> event that that VID is requested by a filtering bridge or a VLAN upper,
>> you move the non-filtering bridge to another currently unused VID.
>>
>> Does that sound reasonable?
>
> I don't think this patch series makes the problem you are describing any
> worse than it already is in mainline, does it?

It does not make it worse, no. But assuming that mv88e6xxx will handle
multi-bridge using the VTU in the future (i.e. my suggestion above),
there is no need for inventing virtual DSA dev/port tuples - we can just
use the physical port info as the source and use the VID to signal the
source bridge. So I am hesitant to merge the mv88e6xxx-specific changes.

> I mean even with multiple VLAN-unaware bridges spanning the same single
> switch chip today, it is still true that you can not have two stations
> with the same MAC address, one in one bridge and another in the other
> bridge, right?

That is correct.

> Do you have an example when this causes issues that need to be addressed
> immediately?
>
> I thought the only case where this is a real problem is when you have
> multiple CPU ports or multiple DSA links between 2 switches, because
> then, if learning is enabled, that same MAC address will bounce between
> the 2 ports. For that case, the consensus was that you just can't enable
> address learning on those ports, and you let the software manage the FDB
> in a way that is compatible with multiple CPU ports / DSA links (install
> the MAC DA as a sort of multicast address and let the port forwarding
> matrix choose only one of the 2 destinations based on source port).
>
> Lack of FDB partitioning also used to be a problem when the standalone
> ports were left to do address learning, but that changed too.

Funny you should mention that. The presence of standalone ports is
actually what first shone a light on this issue for me. I was running a
kselftest-like setup like this:

   br0
   / \
swp1 swp3  swp2  swp4

Physically, [swp1, swp2] and [swp3, swp4] where looped externally:

    CPU
     |
.----0----.
|   sw0   |
'-1-2-3-4-'
  '-' '-'

I was testing automatic learning by sending out a broadcast from br0 and
verifying that br0's MAC was learned on port 0 in the ATU - alas, it was
not. The MAC was nowhere to be found.

Moving back to a topology without these loops, I could see that learning
worked as expected.

Not sure how familiar you are with mv88e6xxx at this point, but the way
learning is disabled is by clearing a port's port association vector
(PAV). This does not, however, "disable" learning really. It just
updates the ATU with an all-zero vector, which means "invalidate the
entry".

So, in the example above, when the broadcast is looped back to the
standalone ports, the port's default FID (0) will be used to invalidate
the MAC for br0. Since they all use FID 0, the standalone ports will
nuke the br0's FDB.

> The hardware I am working with simply does not have any way to solve
> this either - the FDB is simply not partitionable without VLAN
> filtering (we have simple shared VLAN filtering, where the VID is
> ignored and the FDB lookup is performed with VID 0, but not anything
> more complex). So the simple solution I've been advising for people who
> want their MAC addresses to be isolated is to create a single VLAN-aware
> bridge and manage the VLAN broadcast domains themselves - that seems to
> work and is simple to understand and flexible (note that I am going to
> send a patch at some point to prevent the user from partitioning a
> sja1105 switch tree into multiple VLAN-aware bridges).

Essentially what I am proposing is to always run mv88e6xxx VLAN-aware
internally. Then if you have bridges that disable VLAN filtering, you
change the ingress policy on the member ports to classify all incoming
traffic as untagged and assign them to the port's PVID. (Note: this is
different from "force PVID" in that you never pop any tags from the
frame)

Is your device capable of operating in that mode?

> Basically unless I'm misunderstanding something, I think what you're
> proposing makes theoretical sense, but without a use case behind it it
> might just be too much work with no real life benefit.

I have not done the tests to prove it, but I am pretty sure that if you
have two bridges where the same MACs are used (which does happen in
redundant topologies, and is why IVL is a thing) you will add MACs to
the ATU with a destination that is not allowed by the PVT. This will
lead to sporadic drops while the address is on the "wrong" port from the
view of one bridge, until that station sends some return traffic. At
that point of course, it will be on the wrong port according to the
other bridge.

I guess I would just rather spend some more time up front to make sure
that we support full isolation between bridges, than spend that same
time debugging issues in production environments later on.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
  2021-07-05  4:20   ` [Bridge] " DENG Qingfang
@ 2021-07-05  8:32     ` Tobias Waldekranz
  -1 siblings, 0 replies; 44+ messages in thread
From: Tobias Waldekranz @ 2021-07-05  8:32 UTC (permalink / raw)
  To: DENG Qingfang, Vladimir Oltean
  Cc: netdev, Jakub Kicinski, David S. Miller, Andrew Lunn,
	Florian Fainelli, Vivien Didelot, Jiri Pirko, Ido Schimmel,
	Roopa Prabhu, Nikolay Aleksandrov, Stephen Hemminger, bridge,
	Alexander Duyck

On Mon, Jul 05, 2021 at 12:20, DENG Qingfang <dqfext@gmail.com> wrote:
> Hi Vladimir,
>
> On Sat, Jul 03, 2021 at 02:56:55PM +0300, Vladimir Oltean wrote:
>> For this series I have taken Tobias' work from here:
>> https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
>> and made the following changes:
>> - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
>>   feedback on the bridge driver changes. Otherwise, the structure of the
>>   bridge changes is pretty much the same as Tobias left it.
>> - I basically rewrote the DSA infrastructure for the data plane
>>   forwarding offload, based on the commonalities with another switch
>>   driver for which I implemented this feature (not submitted here)
>> - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
>>   works but I didn't test that
>> 
>> The data plane of the software bridge can be partially offloaded to
>> switchdev, in the sense that we can trust the accelerator to:
>> (a) look up its FDB (which is more or less in sync with the software
>>     bridge FDB) for selecting the destination ports for a packet
>> (b) replicate the frame in hardware in case it's a multicast/broadcast,
>>     instead of the software bridge having to clone it and send the
>>     clones to each net device one at a time. This reduces the bandwidth
>>     needed between the CPU and the accelerator, as well as the CPU time
>>     spent.
>
> Many DSA taggers use port bit field in their TX tags, which allows
> replication in hardware. (multiple bits set = send to multiple ports)
> I wonder if the tagger API can be updated to support this.

I think you could, but it would be tricky.

The bridge does not operate using vectors/bitfields, rather it is
procedural code that you have to loop through before knowing the set of
destination ports.

This series just sends the skb to the first port in the hardware domain
and trusts the HW to calculate the same port set as the code in
br_forward.c would have.

To do what you suggest, the bridge would have to translate each nbp into
a position in a bitfield (or call out to the underlying driver to do it)
as it is looping through ports, then send the aggregated mask along with
the skb. Knowing if a port is the first one you have come across for a
given domain is very easy (just maintain a bitfield), knowing if it is
the last one is harder. So you would likely end up having to queue up
the actual transmission until after the loop has been executed, which
hard to combine with the "lazy cloning" that you really want to get
decent performance.

>> 
>> The data path forwarding offload is managed per "hardware domain" - a
>> generalization of the "offload_fwd_mark" concept which is being
>> introduced in this series. Every packet is delivered only once to each
>> hardware domain.
>> 
>> In addition, Tobias said in the original cover letter:
>> 
>> ====================
>> ## Overview
>> 
>>    vlan1   vlan2
>>        \   /
>>    .-----------.
>>    |    br0    |
>>    '-----------'
>>    /   /   \   \
>> swp0 swp1 swp2 eth0
>>   :   :   :
>>   (hwdom 1)
>> 
>> Up to this point, switchdevs have been trusted with offloading
>> forwarding between bridge ports, e.g. forwarding a unicast from swp0
>> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
>> series extends forward offloading to include some new classes of
>> traffic:
>> 
>> - Locally originating flows, i.e. packets that ingress on br0 that are
>>   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>>   this also includes routed flows, e.g. a packet ingressing swp0 on
>>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>>   forwarded to swp1 is "locally originating" from br0's point of view.
>> 
>> - Flows originating from "foreign" interfaces, i.e. an interface that
>>   is not offloaded by a particular switchdev instance. This includes
>>   ports belonging to other switchdev instances. A typical example
>>   would be flows from eth0 towards swp{0,1,2}.
>> 
>> The bridge still looks up its FDB/MDB as usual and then notifies the
>> switchdev driver that a particular skb should be offloaded if it
>> matches one of the classes above. It does so by using the _accel
>> version of dev_queue_xmit, supplying its own netdev as the
>> "subordinate" device. The driver can react to the presence of the
>> subordinate in its .ndo_select_queue in what ever way it needs to make
>> sure to forward the skb in much the same way that it would for packets
>> ingressing on regular ports.
>> 
>> Hardware domains to which a particular skb has been forwarded are
>> recorded so that duplicates are avoided.
>> 
>> The main performance benefit is thus seen on multicast flows. Imagine
>> for example that:
>> 
>> - An IP camera is connected to swp0 (VLAN 1)
>> 
>> - The CPU is acting as a multicast router, routing the group from VLAN
>>   1 to VLAN 2.
>> 
>> - There are subscribers for the group in question behind both swp1 and
>>   swp2 (VLAN 2).
>> 
>> With this offloading in place, the bridge need only send a single skb
>> to the driver, which will send it to the hardware marked in such a way
>> that the switch will perform the multicast replication according to
>> the MDB configuration. Naturally, the number of saved skb_clones
>> increase linearly with the number of subscribed ports.
>> 
>> As an extra benefit, on mv88e6xxx, this also allows the switch to
>> perform source address learning on these flows, which avoids having to
>> sync dynamic FDB entries over slow configuration interfaces like MDIO
>> to avoid flows directed towards the CPU being flooded as unknown
>> unicast by the switch.
>> 
>> 
>> ## RFC
>> 
>> - In general, what do you think about this idea?
>> 
>> - hwdom. What do you think about this terminology? Personally I feel
>>   that we had too many things called offload_fwd_mark, and that as the
>>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>>   might be useful to have a separate term for it.
>> 
>> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>>   and if so do you have any suggestion/preference on how to signal the
>>   offloading from the bridge down to the switchdev driver?
>> 
>> - The way that flooding is implemented in br_forward.c (lazily cloning
>>   skbs) means that you have to mark the forwarding as completed very
>>   early (right after should_deliver in maybe_deliver) in order to
>>   avoid duplicates. Is there some way to move this decision point to a
>>   later stage that I am missing?
>> 
>> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>>   compatible with unicast-to-multicast being used on a port. Then
>>   again, I think that this would also be broken for regular switchdev
>>   bridge offloading as this flag is not offloaded to the switchdev
>>   port, so there is no way for the driver to refuse it. Any ideas on
>>   how to handle this?
>> 
>> 
>> ## mv88e6xxx Specifics
>> 
>> Since we are now only receiving a single skb for both unicast and
>> multicast flows, we can tag the packets with the FORWARD command
>> instead of FROM_CPU. The swich(es) will then forward the packet in
>> accordance with its ATU, VTU, STU, and PVT configuration - just like
>> for packets ingressing on user ports.
>> 
>> Crucially, FROM_CPU is still used for:
>> 
>> - Ports in standalone mode.
>> 
>> - Flows that are trapped to the CPU and software-forwarded by a
>>   bridge. Note that these flows match neither of the classes discussed
>>   in the overview.
>> 
>> - Packets that are sent directly to a port netdev without going
>>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>>   socket.
>> 
>> We thus have a pretty clean separation where the data plane uses
>> FORWARDs and the control plane uses TO_/FROM_CPU.
>> 
>> The barrier between different bridges is enforced by port based VLANs
>> on mv88e6xxx, which in essence is a mapping from a source device/port
>> pair to an allowed set of egress ports. In order to have a FORWARD
>> frame (which carries a _source_ device/port) correctly mapped by the
>> PVT, we must use a unique pair for each bridge.
>> 
>> Fortunately, there is typically lots of unused address space in most
>> switch trees. When was the last time you saw an mv88e6xxx product
>> using more than 4 chips? Even if you found one with 16 (!) devices,
>> you would still have room to allocate 16*16 virtual ports to software
>> bridges.
>> 
>> Therefore, the mv88e6xxx driver will allocate a virtual device/port
>> pair to each bridge that it offloads. All members of the same bridge
>> are then configured to allow packets from this virtual port in their
>> PVTs.
>> ====================
>> 
>> Tobias Waldekranz (5):
>>   net: dfwd: constrain existing users to macvlan subordinates
>>   net: bridge: disambiguate offload_fwd_mark
>>   net: bridge: switchdev: recycle unused hwdoms
>>   net: bridge: switchdev: allow the data plane forwarding to be
>>     offloaded
>>   net: dsa: tag_dsa: offload the bridge forwarding process
>> 
>> Vladimir Oltean (5):
>>   net: extract helpers for binding a subordinate device to TX queues
>>   net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
>>   net: dsa: track the number of switches in a tree
>>   net: dsa: add support for bridge forwarding offload
>>   net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
>>     the PVT
>> 
>>  drivers/net/dsa/mv88e6xxx/chip.c              | 106 +++++++++++-
>>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
>>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
>>  include/linux/if_bridge.h                     |   1 +
>>  include/linux/netdevice.h                     |  13 +-
>>  include/net/dsa.h                             |  37 ++++
>>  net/bridge/br_forward.c                       |  18 +-
>>  net/bridge/br_if.c                            |   4 +-
>>  net/bridge/br_private.h                       |  49 +++++-
>>  net/bridge/br_switchdev.c                     | 163 +++++++++++++++---
>>  net/bridge/br_vlan.c                          |  10 +-
>>  net/core/dev.c                                |  31 +++-
>>  net/dsa/dsa2.c                                |   3 +
>>  net/dsa/dsa_priv.h                            |  28 +++
>>  net/dsa/port.c                                |  35 ++++
>>  net/dsa/slave.c                               | 134 +++++++++++++-
>>  net/dsa/switch.c                              |  58 +++++++
>>  net/dsa/tag_dsa.c                             |  60 ++++++-
>>  19 files changed, 700 insertions(+), 59 deletions(-)
>> 
>> -- 
>> 2.25.1
>> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Bridge] [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
@ 2021-07-05  8:32     ` Tobias Waldekranz
  0 siblings, 0 replies; 44+ messages in thread
From: Tobias Waldekranz @ 2021-07-05  8:32 UTC (permalink / raw)
  To: DENG Qingfang, Vladimir Oltean
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, netdev, bridge,
	Alexander Duyck, Vivien Didelot, Ido Schimmel,
	Nikolay Aleksandrov, Roopa Prabhu, Jakub Kicinski,
	David S. Miller

On Mon, Jul 05, 2021 at 12:20, DENG Qingfang <dqfext@gmail.com> wrote:
> Hi Vladimir,
>
> On Sat, Jul 03, 2021 at 02:56:55PM +0300, Vladimir Oltean wrote:
>> For this series I have taken Tobias' work from here:
>> https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
>> and made the following changes:
>> - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
>>   feedback on the bridge driver changes. Otherwise, the structure of the
>>   bridge changes is pretty much the same as Tobias left it.
>> - I basically rewrote the DSA infrastructure for the data plane
>>   forwarding offload, based on the commonalities with another switch
>>   driver for which I implemented this feature (not submitted here)
>> - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
>>   works but I didn't test that
>> 
>> The data plane of the software bridge can be partially offloaded to
>> switchdev, in the sense that we can trust the accelerator to:
>> (a) look up its FDB (which is more or less in sync with the software
>>     bridge FDB) for selecting the destination ports for a packet
>> (b) replicate the frame in hardware in case it's a multicast/broadcast,
>>     instead of the software bridge having to clone it and send the
>>     clones to each net device one at a time. This reduces the bandwidth
>>     needed between the CPU and the accelerator, as well as the CPU time
>>     spent.
>
> Many DSA taggers use port bit field in their TX tags, which allows
> replication in hardware. (multiple bits set = send to multiple ports)
> I wonder if the tagger API can be updated to support this.

I think you could, but it would be tricky.

The bridge does not operate using vectors/bitfields, rather it is
procedural code that you have to loop through before knowing the set of
destination ports.

This series just sends the skb to the first port in the hardware domain
and trusts the HW to calculate the same port set as the code in
br_forward.c would have.

To do what you suggest, the bridge would have to translate each nbp into
a position in a bitfield (or call out to the underlying driver to do it)
as it is looping through ports, then send the aggregated mask along with
the skb. Knowing if a port is the first one you have come across for a
given domain is very easy (just maintain a bitfield), knowing if it is
the last one is harder. So you would likely end up having to queue up
the actual transmission until after the loop has been executed, which
hard to combine with the "lazy cloning" that you really want to get
decent performance.

>> 
>> The data path forwarding offload is managed per "hardware domain" - a
>> generalization of the "offload_fwd_mark" concept which is being
>> introduced in this series. Every packet is delivered only once to each
>> hardware domain.
>> 
>> In addition, Tobias said in the original cover letter:
>> 
>> ====================
>> ## Overview
>> 
>>    vlan1   vlan2
>>        \   /
>>    .-----------.
>>    |    br0    |
>>    '-----------'
>>    /   /   \   \
>> swp0 swp1 swp2 eth0
>>   :   :   :
>>   (hwdom 1)
>> 
>> Up to this point, switchdevs have been trusted with offloading
>> forwarding between bridge ports, e.g. forwarding a unicast from swp0
>> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
>> series extends forward offloading to include some new classes of
>> traffic:
>> 
>> - Locally originating flows, i.e. packets that ingress on br0 that are
>>   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>>   this also includes routed flows, e.g. a packet ingressing swp0 on
>>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>>   forwarded to swp1 is "locally originating" from br0's point of view.
>> 
>> - Flows originating from "foreign" interfaces, i.e. an interface that
>>   is not offloaded by a particular switchdev instance. This includes
>>   ports belonging to other switchdev instances. A typical example
>>   would be flows from eth0 towards swp{0,1,2}.
>> 
>> The bridge still looks up its FDB/MDB as usual and then notifies the
>> switchdev driver that a particular skb should be offloaded if it
>> matches one of the classes above. It does so by using the _accel
>> version of dev_queue_xmit, supplying its own netdev as the
>> "subordinate" device. The driver can react to the presence of the
>> subordinate in its .ndo_select_queue in what ever way it needs to make
>> sure to forward the skb in much the same way that it would for packets
>> ingressing on regular ports.
>> 
>> Hardware domains to which a particular skb has been forwarded are
>> recorded so that duplicates are avoided.
>> 
>> The main performance benefit is thus seen on multicast flows. Imagine
>> for example that:
>> 
>> - An IP camera is connected to swp0 (VLAN 1)
>> 
>> - The CPU is acting as a multicast router, routing the group from VLAN
>>   1 to VLAN 2.
>> 
>> - There are subscribers for the group in question behind both swp1 and
>>   swp2 (VLAN 2).
>> 
>> With this offloading in place, the bridge need only send a single skb
>> to the driver, which will send it to the hardware marked in such a way
>> that the switch will perform the multicast replication according to
>> the MDB configuration. Naturally, the number of saved skb_clones
>> increase linearly with the number of subscribed ports.
>> 
>> As an extra benefit, on mv88e6xxx, this also allows the switch to
>> perform source address learning on these flows, which avoids having to
>> sync dynamic FDB entries over slow configuration interfaces like MDIO
>> to avoid flows directed towards the CPU being flooded as unknown
>> unicast by the switch.
>> 
>> 
>> ## RFC
>> 
>> - In general, what do you think about this idea?
>> 
>> - hwdom. What do you think about this terminology? Personally I feel
>>   that we had too many things called offload_fwd_mark, and that as the
>>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>>   might be useful to have a separate term for it.
>> 
>> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>>   and if so do you have any suggestion/preference on how to signal the
>>   offloading from the bridge down to the switchdev driver?
>> 
>> - The way that flooding is implemented in br_forward.c (lazily cloning
>>   skbs) means that you have to mark the forwarding as completed very
>>   early (right after should_deliver in maybe_deliver) in order to
>>   avoid duplicates. Is there some way to move this decision point to a
>>   later stage that I am missing?
>> 
>> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>>   compatible with unicast-to-multicast being used on a port. Then
>>   again, I think that this would also be broken for regular switchdev
>>   bridge offloading as this flag is not offloaded to the switchdev
>>   port, so there is no way for the driver to refuse it. Any ideas on
>>   how to handle this?
>> 
>> 
>> ## mv88e6xxx Specifics
>> 
>> Since we are now only receiving a single skb for both unicast and
>> multicast flows, we can tag the packets with the FORWARD command
>> instead of FROM_CPU. The swich(es) will then forward the packet in
>> accordance with its ATU, VTU, STU, and PVT configuration - just like
>> for packets ingressing on user ports.
>> 
>> Crucially, FROM_CPU is still used for:
>> 
>> - Ports in standalone mode.
>> 
>> - Flows that are trapped to the CPU and software-forwarded by a
>>   bridge. Note that these flows match neither of the classes discussed
>>   in the overview.
>> 
>> - Packets that are sent directly to a port netdev without going
>>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>>   socket.
>> 
>> We thus have a pretty clean separation where the data plane uses
>> FORWARDs and the control plane uses TO_/FROM_CPU.
>> 
>> The barrier between different bridges is enforced by port based VLANs
>> on mv88e6xxx, which in essence is a mapping from a source device/port
>> pair to an allowed set of egress ports. In order to have a FORWARD
>> frame (which carries a _source_ device/port) correctly mapped by the
>> PVT, we must use a unique pair for each bridge.
>> 
>> Fortunately, there is typically lots of unused address space in most
>> switch trees. When was the last time you saw an mv88e6xxx product
>> using more than 4 chips? Even if you found one with 16 (!) devices,
>> you would still have room to allocate 16*16 virtual ports to software
>> bridges.
>> 
>> Therefore, the mv88e6xxx driver will allocate a virtual device/port
>> pair to each bridge that it offloads. All members of the same bridge
>> are then configured to allow packets from this virtual port in their
>> PVTs.
>> ====================
>> 
>> Tobias Waldekranz (5):
>>   net: dfwd: constrain existing users to macvlan subordinates
>>   net: bridge: disambiguate offload_fwd_mark
>>   net: bridge: switchdev: recycle unused hwdoms
>>   net: bridge: switchdev: allow the data plane forwarding to be
>>     offloaded
>>   net: dsa: tag_dsa: offload the bridge forwarding process
>> 
>> Vladimir Oltean (5):
>>   net: extract helpers for binding a subordinate device to TX queues
>>   net: allow ndo_select_queue to go beyond dev->num_real_tx_queues
>>   net: dsa: track the number of switches in a tree
>>   net: dsa: add support for bridge forwarding offload
>>   net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
>>     the PVT
>> 
>>  drivers/net/dsa/mv88e6xxx/chip.c              | 106 +++++++++++-
>>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
>>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
>>  include/linux/if_bridge.h                     |   1 +
>>  include/linux/netdevice.h                     |  13 +-
>>  include/net/dsa.h                             |  37 ++++
>>  net/bridge/br_forward.c                       |  18 +-
>>  net/bridge/br_if.c                            |   4 +-
>>  net/bridge/br_private.h                       |  49 +++++-
>>  net/bridge/br_switchdev.c                     | 163 +++++++++++++++---
>>  net/bridge/br_vlan.c                          |  10 +-
>>  net/core/dev.c                                |  31 +++-
>>  net/dsa/dsa2.c                                |   3 +
>>  net/dsa/dsa_priv.h                            |  28 +++
>>  net/dsa/port.c                                |  35 ++++
>>  net/dsa/slave.c                               | 134 +++++++++++++-
>>  net/dsa/switch.c                              |  58 +++++++
>>  net/dsa/tag_dsa.c                             |  60 ++++++-
>>  19 files changed, 700 insertions(+), 59 deletions(-)
>> 
>> -- 
>> 2.25.1
>> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
  2021-07-05  8:09       ` [Bridge] " Tobias Waldekranz
@ 2021-07-05  8:54         ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-05  8:54 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Roopa Prabhu, Nikolay Aleksandrov,
	Stephen Hemminger, bridge, Alexander Duyck

On Mon, Jul 05, 2021 at 10:09:16AM +0200, Tobias Waldekranz wrote:
> On Sun, Jul 04, 2021 at 11:11, Vladimir Oltean <olteanv@gmail.com> wrote:
> > Hi Tobias,
> >
> > On Sun, Jul 04, 2021 at 12:04:26AM +0200, Tobias Waldekranz wrote:
> >> On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> >> > For this series I have taken Tobias' work from here:
> >> > https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> >> > and made the following changes:
> >> > - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
> >> >   feedback on the bridge driver changes. Otherwise, the structure of the
> >> >   bridge changes is pretty much the same as Tobias left it.
> >> > - I basically rewrote the DSA infrastructure for the data plane
> >> >   forwarding offload, based on the commonalities with another switch
> >> >   driver for which I implemented this feature (not submitted here)
> >> > - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
> >> >   works but I didn't test that
> >>
> >> Hi Vladimir,
> >>
> >> Sorry that I have dropped the ball on this series. I have actually had a
> >> v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx
> >> specific problems. (See below)
> >>
> >> > The data plane of the software bridge can be partially offloaded to
> >> > switchdev, in the sense that we can trust the accelerator to:
> >> > (a) look up its FDB (which is more or less in sync with the software
> >> >     bridge FDB) for selecting the destination ports for a packet
> >> > (b) replicate the frame in hardware in case it's a multicast/broadcast,
> >> >     instead of the software bridge having to clone it and send the
> >> >     clones to each net device one at a time. This reduces the bandwidth
> >> >     needed between the CPU and the accelerator, as well as the CPU time
> >> >     spent.
> >> >
> >> > The data path forwarding offload is managed per "hardware domain" - a
> >> > generalization of the "offload_fwd_mark" concept which is being
> >> > introduced in this series. Every packet is delivered only once to each
> >> > hardware domain.
> >> >
> >> > In addition, Tobias said in the original cover letter:
> >> >
> >> > ====================
> >> > ## Overview
> >> >
> >> >    vlan1   vlan2
> >> >        \   /
> >> >    .-----------.
> >> >    |    br0    |
> >> >    '-----------'
> >> >    /   /   \   \
> >> > swp0 swp1 swp2 eth0
> >> >   :   :   :
> >> >   (hwdom 1)
> >> >
> >> > Up to this point, switchdevs have been trusted with offloading
> >> > forwarding between bridge ports, e.g. forwarding a unicast from swp0
> >> > to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> >> > series extends forward offloading to include some new classes of
> >> > traffic:
> >> >
> >> > - Locally originating flows, i.e. packets that ingress on br0 that are
> >> >   to be forwarded to one or several of the ports swp{0,1,2}. Notably
> >> >   this also includes routed flows, e.g. a packet ingressing swp0 on
> >> >   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
> >> >   forwarded to swp1 is "locally originating" from br0's point of view.
> >> >
> >> > - Flows originating from "foreign" interfaces, i.e. an interface that
> >> >   is not offloaded by a particular switchdev instance. This includes
> >> >   ports belonging to other switchdev instances. A typical example
> >> >   would be flows from eth0 towards swp{0,1,2}.
> >> >
> >> > The bridge still looks up its FDB/MDB as usual and then notifies the
> >> > switchdev driver that a particular skb should be offloaded if it
> >> > matches one of the classes above. It does so by using the _accel
> >> > version of dev_queue_xmit, supplying its own netdev as the
> >> > "subordinate" device. The driver can react to the presence of the
> >> > subordinate in its .ndo_select_queue in what ever way it needs to make
> >> > sure to forward the skb in much the same way that it would for packets
> >> > ingressing on regular ports.
> >> >
> >> > Hardware domains to which a particular skb has been forwarded are
> >> > recorded so that duplicates are avoided.
> >> >
> >> > The main performance benefit is thus seen on multicast flows. Imagine
> >> > for example that:
> >> >
> >> > - An IP camera is connected to swp0 (VLAN 1)
> >> >
> >> > - The CPU is acting as a multicast router, routing the group from VLAN
> >> >   1 to VLAN 2.
> >> >
> >> > - There are subscribers for the group in question behind both swp1 and
> >> >   swp2 (VLAN 2).
> >> >
> >> > With this offloading in place, the bridge need only send a single skb
> >> > to the driver, which will send it to the hardware marked in such a way
> >> > that the switch will perform the multicast replication according to
> >> > the MDB configuration. Naturally, the number of saved skb_clones
> >> > increase linearly with the number of subscribed ports.
> >> >
> >> > As an extra benefit, on mv88e6xxx, this also allows the switch to
> >> > perform source address learning on these flows, which avoids having to
> >> > sync dynamic FDB entries over slow configuration interfaces like MDIO
> >> > to avoid flows directed towards the CPU being flooded as unknown
> >> > unicast by the switch.
> >> >
> >> >
> >> > ## RFC
> >> >
> >> > - In general, what do you think about this idea?
> >> >
> >> > - hwdom. What do you think about this terminology? Personally I feel
> >> >   that we had too many things called offload_fwd_mark, and that as the
> >> >   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
> >> >   might be useful to have a separate term for it.
> >> >
> >> > - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
> >> >   and if so do you have any suggestion/preference on how to signal the
> >> >   offloading from the bridge down to the switchdev driver?
> >> >
> >> > - The way that flooding is implemented in br_forward.c (lazily cloning
> >> >   skbs) means that you have to mark the forwarding as completed very
> >> >   early (right after should_deliver in maybe_deliver) in order to
> >> >   avoid duplicates. Is there some way to move this decision point to a
> >> >   later stage that I am missing?
> >> >
> >> > - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
> >> >   compatible with unicast-to-multicast being used on a port. Then
> >> >   again, I think that this would also be broken for regular switchdev
> >> >   bridge offloading as this flag is not offloaded to the switchdev
> >> >   port, so there is no way for the driver to refuse it. Any ideas on
> >> >   how to handle this?
> >> >
> >> >
> >> > ## mv88e6xxx Specifics
> >> >
> >> > Since we are now only receiving a single skb for both unicast and
> >> > multicast flows, we can tag the packets with the FORWARD command
> >> > instead of FROM_CPU. The swich(es) will then forward the packet in
> >> > accordance with its ATU, VTU, STU, and PVT configuration - just like
> >> > for packets ingressing on user ports.
> >> >
> >> > Crucially, FROM_CPU is still used for:
> >> >
> >> > - Ports in standalone mode.
> >> >
> >> > - Flows that are trapped to the CPU and software-forwarded by a
> >> >   bridge. Note that these flows match neither of the classes discussed
> >> >   in the overview.
> >> >
> >> > - Packets that are sent directly to a port netdev without going
> >> >   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
> >> >   socket.
> >> >
> >> > We thus have a pretty clean separation where the data plane uses
> >> > FORWARDs and the control plane uses TO_/FROM_CPU.
> >> >
> >> > The barrier between different bridges is enforced by port based VLANs
> >> > on mv88e6xxx, which in essence is a mapping from a source device/port
> >> > pair to an allowed set of egress ports.
> >>
> >> Unless I am missing something, it turns out that the PVT is not enough
> >> to support multiple (non-VLAN filtering) bridges in multi-chip
> >> setups. While the isolation barrier works, there is no way of correctly
> >> managing automatic learning.
> >>
> >> > In order to have a FORWARD
> >> > frame (which carries a _source_ device/port) correctly mapped by the
> >> > PVT, we must use a unique pair for each bridge.
> >> >
> >> > Fortunately, there is typically lots of unused address space in most
> >> > switch trees. When was the last time you saw an mv88e6xxx product
> >> > using more than 4 chips? Even if you found one with 16 (!) devices,
> >> > you would still have room to allocate 16*16 virtual ports to software
> >> > bridges.
> >> >
> >> > Therefore, the mv88e6xxx driver will allocate a virtual device/port
> >> > pair to each bridge that it offloads. All members of the same bridge
> >> > are then configured to allow packets from this virtual port in their
> >> > PVTs.
> >>
> >> So while this solution is cute, it does not work in this example:
> >>
> >>  CPU
> >>   | .-----.
> >> .-0-1-. .-0-1-.
> >> | sw0 | | sw1 |
> >> '-2-3-' '-2-3-'
> >>
> >> - [sw0p2, sw1p2] are attached to one bridge
> >> - [sw0p3, sw1p3] are attached to another bridge
> >> - Neither bridge uses VLAN filtering
> >>
> >> Since no VLAN information available in the frames, the source addresses
> >> of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be
> >> separated into different FIDs. They will all be placed in the respective
> >> port's default FID. Thus, the two bridges are not isolated with respect
> >> to their FDBs.
> >>
> >> My current plan is therefore to start by reworking how bridges are
> >> isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for
> >> each non-filtering bridge. Two of these can be easily managed since both
> >> VID 0 and 4095 are illegal on the wire but allowed in the VTU - after
> >> that it gets tricky. The best scheme I have come up with is to just grab
> >> an unused VID when adding any subsequent non-filtering bridge; in the
> >> event that that VID is requested by a filtering bridge or a VLAN upper,
> >> you move the non-filtering bridge to another currently unused VID.
> >>
> >> Does that sound reasonable?
> >
> > I don't think this patch series makes the problem you are describing any
> > worse than it already is in mainline, does it?
> 
> It does not make it worse, no. But assuming that mv88e6xxx will handle
> multi-bridge using the VTU in the future (i.e. my suggestion above),
> there is no need for inventing virtual DSA dev/port tuples - we can just
> use the physical port info as the source and use the VID to signal the
> source bridge. So I am hesitant to merge the mv88e6xxx-specific changes.
> 
> > I mean even with multiple VLAN-unaware bridges spanning the same single
> > switch chip today, it is still true that you can not have two stations
> > with the same MAC address, one in one bridge and another in the other
> > bridge, right?
> 
> That is correct.
> 
> > Do you have an example when this causes issues that need to be addressed
> > immediately?
> >
> > I thought the only case where this is a real problem is when you have
> > multiple CPU ports or multiple DSA links between 2 switches, because
> > then, if learning is enabled, that same MAC address will bounce between
> > the 2 ports. For that case, the consensus was that you just can't enable
> > address learning on those ports, and you let the software manage the FDB
> > in a way that is compatible with multiple CPU ports / DSA links (install
> > the MAC DA as a sort of multicast address and let the port forwarding
> > matrix choose only one of the 2 destinations based on source port).
> >
> > Lack of FDB partitioning also used to be a problem when the standalone
> > ports were left to do address learning, but that changed too.
> 
> Funny you should mention that. The presence of standalone ports is
> actually what first shone a light on this issue for me. I was running a
> kselftest-like setup like this:
> 
>    br0
>    / \
> swp1 swp3  swp2  swp4
> 
> Physically, [swp1, swp2] and [swp3, swp4] where looped externally:
> 
>     CPU
>      |
> .----0----.
> |   sw0   |
> '-1-2-3-4-'
>   '-' '-'
> 
> I was testing automatic learning by sending out a broadcast from br0 and
> verifying that br0's MAC was learned on port 0 in the ATU - alas, it was
> not. The MAC was nowhere to be found.
> 
> Moving back to a topology without these loops, I could see that learning
> worked as expected.
> 
> Not sure how familiar you are with mv88e6xxx at this point, but the way
> learning is disabled is by clearing a port's port association vector
> (PAV). This does not, however, "disable" learning really. It just
> updates the ATU with an all-zero vector, which means "invalidate the
> entry".
> 
> So, in the example above, when the broadcast is looped back to the
> standalone ports, the port's default FID (0) will be used to invalidate
> the MAC for br0. Since they all use FID 0, the standalone ports will
> nuke the br0's FDB.

That's broken IMO. The same kind of setup works with sja1105 and felix/ocelot
(with ocelot I use tools/testing/selftests/drivers/net/ocelot/tc_flower_chains.sh
as a test case for this, with the "DUT ports" (bridged ports) part of
the same switch as the "Generator ports" (standalone ports)).
"Don't learn" means just "don't learn", the FDB entry remains where it
was, in this case on the CPU port.

I imagine that if the "no learning" bit is being set on an access port
as a security measure (you don't trust that whoever connects the cable
won't attempt to spoof MAC addresses and fill the FDB), then it won't be
effective at all on mv88e6xxx, since the guy will at least manage to
invalidate the FDB entries pointing to other ports in the switch. File a
ticket with Marvell maybe?

With your RX filtering patches now merged, and the br0 MAC address
installed as a static entry, this is not really a problem for the setup
you described, is it? Maybe we should just keep using assisted_learning_on_cpu_port
for mv88e6xxx.

> > The hardware I am working with simply does not have any way to solve
> > this either - the FDB is simply not partitionable without VLAN
> > filtering (we have simple shared VLAN filtering, where the VID is
> > ignored and the FDB lookup is performed with VID 0, but not anything
> > more complex). So the simple solution I've been advising for people who
> > want their MAC addresses to be isolated is to create a single VLAN-aware
> > bridge and manage the VLAN broadcast domains themselves - that seems to
> > work and is simple to understand and flexible (note that I am going to
> > send a patch at some point to prevent the user from partitioning a
> > sja1105 switch tree into multiple VLAN-aware bridges).
> 
> Essentially what I am proposing is to always run mv88e6xxx VLAN-aware
> internally. Then if you have bridges that disable VLAN filtering, you
> change the ingress policy on the member ports to classify all incoming
> traffic as untagged and assign them to the port's PVID. (Note: this is
> different from "force PVID" in that you never pop any tags from the
> frame)
> 
> Is your device capable of operating in that mode?

Yes, this is how the switches I maintain work in VLAN-unaware mode.
The sja1105 is always VLAN-aware, but I change the TPID by which it
recognizes VLAN tags to a bogus value (0xdadb) and all packets get
classified to the port pvid.
The ocelot/felix switch also has the concept of a "classified VLAN",
which can be derived from the port-based default with an option to look
at the VLAN header in the frame, or TCAM rules can also change it, etc.
In VLAN-unaware mode I configure the switch to not look at the VLAN
header in the frame when setting the classified VLAN.

Note that due to external reasons, in VLAN-unaware mode the sja1105 and
ocelot/felix switches use different classified VLANs:
- sja1105 uses VID 1024 as pvid for switch ID 0, port 0; 1025 for switch
  id 1 port 1 etc. See net/dsa/tag_8021q.c for details.
- ocelot uses VID 0 for all frames.

On sja1105, because every port has a unique pvid, I necessarily have to
configure it for shared address learning and ignore the VID during FDB
lookup (see commit 6d7c7d948a2e ("net: dsa: sja1105: Fix broken learning
with vlan_filtering disabled")). This also means that the VID is always
looked up as zero, which means I cannot do proper FDB partitioning.
Maybe if it had the option of a more complex shared VLAN learning, where
the VID can be 0 in the FDB lookup of some ports, 1 in others, etc etc,
then it can be made to work.  But it doesn't.

On ocelot/felix, I suppose that can be done: the classified VLAN in
VLAN-unaware mode can be derived from the "bridge ID", and 0 can be used
just for standalone ports. But I'm not really interested in adding
support for that, I don't see a use case where it will make a
significant difference.

> > Basically unless I'm misunderstanding something, I think what you're
> > proposing makes theoretical sense, but without a use case behind it it
> > might just be too much work with no real life benefit.
> 
> I have not done the tests to prove it, but I am pretty sure that if you
> have two bridges where the same MACs are used (which does happen in
> redundant topologies, and is why IVL is a thing) you will add MACs to
> the ATU with a destination that is not allowed by the PVT. This will
> lead to sporadic drops while the address is on the "wrong" port from the
> view of one bridge, until that station sends some return traffic. At
> that point of course, it will be on the wrong port according to the
> other bridge.

The redundant topologies I am familiar with (IEEE 802.1CB) will disable
address learning. HSR too, probably. There's no point in learning where
a packet came from if it is consistently going to come from multiple
sources.

Your VLAN-aware bridge should have independent VLAN learning enabled.
There is really no disadvantage to having a single VLAN-aware bridge as
opposed to multiple VLAN-unaware bridges. You can manage your forwarding
domains on a per-VLAN basis with even more flexibility.

Even if you add support for VLAN-unaware FDB partitioning via unique
classified VLANs per bridge domain, you will probably still have a
single ageing timer for your entire FDB. DSA chooses the
dsa_switch_fastest_ageing_time() for you, assuming that this will be the
case. It is unlikely to find hardware which is fully partitionable, you
will always find some corner cases.

> I guess I would just rather spend some more time up front to make sure
> that we support full isolation between bridges, than spend that same
> time debugging issues in production environments later on.

I have nothing against dropping the mv88e6xxx patches and replacing them
with sja1105 support (even if that will have the same "issues" that you
describe here).

On sja1105, there is actually no DSA tagging support for data plane
packets, only for control packets. This is a good and a bad thing,
because it's hard to work with given the current control-only
infrastructure, but good because everything is a VLAN, really.

Currently, a packet that is not PTP or STP is sent using a tag_8021q
"TX VLAN", whose broadcast domain contains only 2 ports: the CPU port
and the desired egress port. The TX VLAN is popped on egress, so it acts
as a de facto DSA tag.

With bridge data plane offload, all that needs to be done is tag_8021q
sets up a bridge forwarding offload TX VLAN corresponding to each bridge
number, and this is a multicast VLAN: it contains the CPU port as well
as all ports that have joined that bridge. It will not leak outside of
the software bridge's broadcast domain, and the packet will be looked up
in the FDB.

Actually what I described above is only true for VLAN-unaware bridging.
When we offload a VLAN-aware bridge, we let the packet slide into the
switch precisely with the VLAN ID that came from the bridge (basically
the packet looks the same as if it was sent through a socket on the DSA
master, we don't pretend that we construct a DSA tag at all because we
don't, but now we have all the advantages of it coming from the bridge
device, like you can now bridge the sja1105 with foreign interfaces).
This is why we need to restrict the user to a single VLAN-aware bridge,
otherwise port separation could not be maintained. When some ports of a
sja1105 device are part of a VLAN-aware bridge, the standalone ports are
still targeted with the precise TX VLAN, and the tag_8021q VLANs (the
1024-3071 range) cannot be installed in the bridge VLAN database or as
8021q uppers.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Bridge] [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
@ 2021-07-05  8:54         ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-05  8:54 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, bridge,
	Vladimir Oltean, Roopa Prabhu, Alexander Duyck, Vivien Didelot,
	Ido Schimmel, Nikolay Aleksandrov, netdev, Jakub Kicinski,
	David S. Miller

On Mon, Jul 05, 2021 at 10:09:16AM +0200, Tobias Waldekranz wrote:
> On Sun, Jul 04, 2021 at 11:11, Vladimir Oltean <olteanv@gmail.com> wrote:
> > Hi Tobias,
> >
> > On Sun, Jul 04, 2021 at 12:04:26AM +0200, Tobias Waldekranz wrote:
> >> On Sat, Jul 03, 2021 at 14:56, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> >> > For this series I have taken Tobias' work from here:
> >> > https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
> >> > and made the following changes:
> >> > - I collected and integrated (hopefully all of) Nikolay's, Ido's and my
> >> >   feedback on the bridge driver changes. Otherwise, the structure of the
> >> >   bridge changes is pretty much the same as Tobias left it.
> >> > - I basically rewrote the DSA infrastructure for the data plane
> >> >   forwarding offload, based on the commonalities with another switch
> >> >   driver for which I implemented this feature (not submitted here)
> >> > - I adapted mv88e6xxx to use the new infrastructure, hopefully it still
> >> >   works but I didn't test that
> >>
> >> Hi Vladimir,
> >>
> >> Sorry that I have dropped the ball on this series. I have actually had a
> >> v1 of this queued up for a while. Unfortunately I ran into mv88e6xxx
> >> specific problems. (See below)
> >>
> >> > The data plane of the software bridge can be partially offloaded to
> >> > switchdev, in the sense that we can trust the accelerator to:
> >> > (a) look up its FDB (which is more or less in sync with the software
> >> >     bridge FDB) for selecting the destination ports for a packet
> >> > (b) replicate the frame in hardware in case it's a multicast/broadcast,
> >> >     instead of the software bridge having to clone it and send the
> >> >     clones to each net device one at a time. This reduces the bandwidth
> >> >     needed between the CPU and the accelerator, as well as the CPU time
> >> >     spent.
> >> >
> >> > The data path forwarding offload is managed per "hardware domain" - a
> >> > generalization of the "offload_fwd_mark" concept which is being
> >> > introduced in this series. Every packet is delivered only once to each
> >> > hardware domain.
> >> >
> >> > In addition, Tobias said in the original cover letter:
> >> >
> >> > ====================
> >> > ## Overview
> >> >
> >> >    vlan1   vlan2
> >> >        \   /
> >> >    .-----------.
> >> >    |    br0    |
> >> >    '-----------'
> >> >    /   /   \   \
> >> > swp0 swp1 swp2 eth0
> >> >   :   :   :
> >> >   (hwdom 1)
> >> >
> >> > Up to this point, switchdevs have been trusted with offloading
> >> > forwarding between bridge ports, e.g. forwarding a unicast from swp0
> >> > to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> >> > series extends forward offloading to include some new classes of
> >> > traffic:
> >> >
> >> > - Locally originating flows, i.e. packets that ingress on br0 that are
> >> >   to be forwarded to one or several of the ports swp{0,1,2}. Notably
> >> >   this also includes routed flows, e.g. a packet ingressing swp0 on
> >> >   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
> >> >   forwarded to swp1 is "locally originating" from br0's point of view.
> >> >
> >> > - Flows originating from "foreign" interfaces, i.e. an interface that
> >> >   is not offloaded by a particular switchdev instance. This includes
> >> >   ports belonging to other switchdev instances. A typical example
> >> >   would be flows from eth0 towards swp{0,1,2}.
> >> >
> >> > The bridge still looks up its FDB/MDB as usual and then notifies the
> >> > switchdev driver that a particular skb should be offloaded if it
> >> > matches one of the classes above. It does so by using the _accel
> >> > version of dev_queue_xmit, supplying its own netdev as the
> >> > "subordinate" device. The driver can react to the presence of the
> >> > subordinate in its .ndo_select_queue in what ever way it needs to make
> >> > sure to forward the skb in much the same way that it would for packets
> >> > ingressing on regular ports.
> >> >
> >> > Hardware domains to which a particular skb has been forwarded are
> >> > recorded so that duplicates are avoided.
> >> >
> >> > The main performance benefit is thus seen on multicast flows. Imagine
> >> > for example that:
> >> >
> >> > - An IP camera is connected to swp0 (VLAN 1)
> >> >
> >> > - The CPU is acting as a multicast router, routing the group from VLAN
> >> >   1 to VLAN 2.
> >> >
> >> > - There are subscribers for the group in question behind both swp1 and
> >> >   swp2 (VLAN 2).
> >> >
> >> > With this offloading in place, the bridge need only send a single skb
> >> > to the driver, which will send it to the hardware marked in such a way
> >> > that the switch will perform the multicast replication according to
> >> > the MDB configuration. Naturally, the number of saved skb_clones
> >> > increase linearly with the number of subscribed ports.
> >> >
> >> > As an extra benefit, on mv88e6xxx, this also allows the switch to
> >> > perform source address learning on these flows, which avoids having to
> >> > sync dynamic FDB entries over slow configuration interfaces like MDIO
> >> > to avoid flows directed towards the CPU being flooded as unknown
> >> > unicast by the switch.
> >> >
> >> >
> >> > ## RFC
> >> >
> >> > - In general, what do you think about this idea?
> >> >
> >> > - hwdom. What do you think about this terminology? Personally I feel
> >> >   that we had too many things called offload_fwd_mark, and that as the
> >> >   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
> >> >   might be useful to have a separate term for it.
> >> >
> >> > - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
> >> >   and if so do you have any suggestion/preference on how to signal the
> >> >   offloading from the bridge down to the switchdev driver?
> >> >
> >> > - The way that flooding is implemented in br_forward.c (lazily cloning
> >> >   skbs) means that you have to mark the forwarding as completed very
> >> >   early (right after should_deliver in maybe_deliver) in order to
> >> >   avoid duplicates. Is there some way to move this decision point to a
> >> >   later stage that I am missing?
> >> >
> >> > - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
> >> >   compatible with unicast-to-multicast being used on a port. Then
> >> >   again, I think that this would also be broken for regular switchdev
> >> >   bridge offloading as this flag is not offloaded to the switchdev
> >> >   port, so there is no way for the driver to refuse it. Any ideas on
> >> >   how to handle this?
> >> >
> >> >
> >> > ## mv88e6xxx Specifics
> >> >
> >> > Since we are now only receiving a single skb for both unicast and
> >> > multicast flows, we can tag the packets with the FORWARD command
> >> > instead of FROM_CPU. The swich(es) will then forward the packet in
> >> > accordance with its ATU, VTU, STU, and PVT configuration - just like
> >> > for packets ingressing on user ports.
> >> >
> >> > Crucially, FROM_CPU is still used for:
> >> >
> >> > - Ports in standalone mode.
> >> >
> >> > - Flows that are trapped to the CPU and software-forwarded by a
> >> >   bridge. Note that these flows match neither of the classes discussed
> >> >   in the overview.
> >> >
> >> > - Packets that are sent directly to a port netdev without going
> >> >   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
> >> >   socket.
> >> >
> >> > We thus have a pretty clean separation where the data plane uses
> >> > FORWARDs and the control plane uses TO_/FROM_CPU.
> >> >
> >> > The barrier between different bridges is enforced by port based VLANs
> >> > on mv88e6xxx, which in essence is a mapping from a source device/port
> >> > pair to an allowed set of egress ports.
> >>
> >> Unless I am missing something, it turns out that the PVT is not enough
> >> to support multiple (non-VLAN filtering) bridges in multi-chip
> >> setups. While the isolation barrier works, there is no way of correctly
> >> managing automatic learning.
> >>
> >> > In order to have a FORWARD
> >> > frame (which carries a _source_ device/port) correctly mapped by the
> >> > PVT, we must use a unique pair for each bridge.
> >> >
> >> > Fortunately, there is typically lots of unused address space in most
> >> > switch trees. When was the last time you saw an mv88e6xxx product
> >> > using more than 4 chips? Even if you found one with 16 (!) devices,
> >> > you would still have room to allocate 16*16 virtual ports to software
> >> > bridges.
> >> >
> >> > Therefore, the mv88e6xxx driver will allocate a virtual device/port
> >> > pair to each bridge that it offloads. All members of the same bridge
> >> > are then configured to allow packets from this virtual port in their
> >> > PVTs.
> >>
> >> So while this solution is cute, it does not work in this example:
> >>
> >>  CPU
> >>   | .-----.
> >> .-0-1-. .-0-1-.
> >> | sw0 | | sw1 |
> >> '-2-3-' '-2-3-'
> >>
> >> - [sw0p2, sw1p2] are attached to one bridge
> >> - [sw0p3, sw1p3] are attached to another bridge
> >> - Neither bridge uses VLAN filtering
> >>
> >> Since no VLAN information available in the frames, the source addresses
> >> of FORWARDs sent over the DSA link (sw0p1, sw1p0) cannot possibly be
> >> separated into different FIDs. They will all be placed in the respective
> >> port's default FID. Thus, the two bridges are not isolated with respect
> >> to their FDBs.
> >>
> >> My current plan is therefore to start by reworking how bridges are
> >> isolated on mv88e6xxx. Roughly by allocating a reserved VID/FID pair for
> >> each non-filtering bridge. Two of these can be easily managed since both
> >> VID 0 and 4095 are illegal on the wire but allowed in the VTU - after
> >> that it gets tricky. The best scheme I have come up with is to just grab
> >> an unused VID when adding any subsequent non-filtering bridge; in the
> >> event that that VID is requested by a filtering bridge or a VLAN upper,
> >> you move the non-filtering bridge to another currently unused VID.
> >>
> >> Does that sound reasonable?
> >
> > I don't think this patch series makes the problem you are describing any
> > worse than it already is in mainline, does it?
> 
> It does not make it worse, no. But assuming that mv88e6xxx will handle
> multi-bridge using the VTU in the future (i.e. my suggestion above),
> there is no need for inventing virtual DSA dev/port tuples - we can just
> use the physical port info as the source and use the VID to signal the
> source bridge. So I am hesitant to merge the mv88e6xxx-specific changes.
> 
> > I mean even with multiple VLAN-unaware bridges spanning the same single
> > switch chip today, it is still true that you can not have two stations
> > with the same MAC address, one in one bridge and another in the other
> > bridge, right?
> 
> That is correct.
> 
> > Do you have an example when this causes issues that need to be addressed
> > immediately?
> >
> > I thought the only case where this is a real problem is when you have
> > multiple CPU ports or multiple DSA links between 2 switches, because
> > then, if learning is enabled, that same MAC address will bounce between
> > the 2 ports. For that case, the consensus was that you just can't enable
> > address learning on those ports, and you let the software manage the FDB
> > in a way that is compatible with multiple CPU ports / DSA links (install
> > the MAC DA as a sort of multicast address and let the port forwarding
> > matrix choose only one of the 2 destinations based on source port).
> >
> > Lack of FDB partitioning also used to be a problem when the standalone
> > ports were left to do address learning, but that changed too.
> 
> Funny you should mention that. The presence of standalone ports is
> actually what first shone a light on this issue for me. I was running a
> kselftest-like setup like this:
> 
>    br0
>    / \
> swp1 swp3  swp2  swp4
> 
> Physically, [swp1, swp2] and [swp3, swp4] where looped externally:
> 
>     CPU
>      |
> .----0----.
> |   sw0   |
> '-1-2-3-4-'
>   '-' '-'
> 
> I was testing automatic learning by sending out a broadcast from br0 and
> verifying that br0's MAC was learned on port 0 in the ATU - alas, it was
> not. The MAC was nowhere to be found.
> 
> Moving back to a topology without these loops, I could see that learning
> worked as expected.
> 
> Not sure how familiar you are with mv88e6xxx at this point, but the way
> learning is disabled is by clearing a port's port association vector
> (PAV). This does not, however, "disable" learning really. It just
> updates the ATU with an all-zero vector, which means "invalidate the
> entry".
> 
> So, in the example above, when the broadcast is looped back to the
> standalone ports, the port's default FID (0) will be used to invalidate
> the MAC for br0. Since they all use FID 0, the standalone ports will
> nuke the br0's FDB.

That's broken IMO. The same kind of setup works with sja1105 and felix/ocelot
(with ocelot I use tools/testing/selftests/drivers/net/ocelot/tc_flower_chains.sh
as a test case for this, with the "DUT ports" (bridged ports) part of
the same switch as the "Generator ports" (standalone ports)).
"Don't learn" means just "don't learn", the FDB entry remains where it
was, in this case on the CPU port.

I imagine that if the "no learning" bit is being set on an access port
as a security measure (you don't trust that whoever connects the cable
won't attempt to spoof MAC addresses and fill the FDB), then it won't be
effective at all on mv88e6xxx, since the guy will at least manage to
invalidate the FDB entries pointing to other ports in the switch. File a
ticket with Marvell maybe?

With your RX filtering patches now merged, and the br0 MAC address
installed as a static entry, this is not really a problem for the setup
you described, is it? Maybe we should just keep using assisted_learning_on_cpu_port
for mv88e6xxx.

> > The hardware I am working with simply does not have any way to solve
> > this either - the FDB is simply not partitionable without VLAN
> > filtering (we have simple shared VLAN filtering, where the VID is
> > ignored and the FDB lookup is performed with VID 0, but not anything
> > more complex). So the simple solution I've been advising for people who
> > want their MAC addresses to be isolated is to create a single VLAN-aware
> > bridge and manage the VLAN broadcast domains themselves - that seems to
> > work and is simple to understand and flexible (note that I am going to
> > send a patch at some point to prevent the user from partitioning a
> > sja1105 switch tree into multiple VLAN-aware bridges).
> 
> Essentially what I am proposing is to always run mv88e6xxx VLAN-aware
> internally. Then if you have bridges that disable VLAN filtering, you
> change the ingress policy on the member ports to classify all incoming
> traffic as untagged and assign them to the port's PVID. (Note: this is
> different from "force PVID" in that you never pop any tags from the
> frame)
> 
> Is your device capable of operating in that mode?

Yes, this is how the switches I maintain work in VLAN-unaware mode.
The sja1105 is always VLAN-aware, but I change the TPID by which it
recognizes VLAN tags to a bogus value (0xdadb) and all packets get
classified to the port pvid.
The ocelot/felix switch also has the concept of a "classified VLAN",
which can be derived from the port-based default with an option to look
at the VLAN header in the frame, or TCAM rules can also change it, etc.
In VLAN-unaware mode I configure the switch to not look at the VLAN
header in the frame when setting the classified VLAN.

Note that due to external reasons, in VLAN-unaware mode the sja1105 and
ocelot/felix switches use different classified VLANs:
- sja1105 uses VID 1024 as pvid for switch ID 0, port 0; 1025 for switch
  id 1 port 1 etc. See net/dsa/tag_8021q.c for details.
- ocelot uses VID 0 for all frames.

On sja1105, because every port has a unique pvid, I necessarily have to
configure it for shared address learning and ignore the VID during FDB
lookup (see commit 6d7c7d948a2e ("net: dsa: sja1105: Fix broken learning
with vlan_filtering disabled")). This also means that the VID is always
looked up as zero, which means I cannot do proper FDB partitioning.
Maybe if it had the option of a more complex shared VLAN learning, where
the VID can be 0 in the FDB lookup of some ports, 1 in others, etc etc,
then it can be made to work.  But it doesn't.

On ocelot/felix, I suppose that can be done: the classified VLAN in
VLAN-unaware mode can be derived from the "bridge ID", and 0 can be used
just for standalone ports. But I'm not really interested in adding
support for that, I don't see a use case where it will make a
significant difference.

> > Basically unless I'm misunderstanding something, I think what you're
> > proposing makes theoretical sense, but without a use case behind it it
> > might just be too much work with no real life benefit.
> 
> I have not done the tests to prove it, but I am pretty sure that if you
> have two bridges where the same MACs are used (which does happen in
> redundant topologies, and is why IVL is a thing) you will add MACs to
> the ATU with a destination that is not allowed by the PVT. This will
> lead to sporadic drops while the address is on the "wrong" port from the
> view of one bridge, until that station sends some return traffic. At
> that point of course, it will be on the wrong port according to the
> other bridge.

The redundant topologies I am familiar with (IEEE 802.1CB) will disable
address learning. HSR too, probably. There's no point in learning where
a packet came from if it is consistently going to come from multiple
sources.

Your VLAN-aware bridge should have independent VLAN learning enabled.
There is really no disadvantage to having a single VLAN-aware bridge as
opposed to multiple VLAN-unaware bridges. You can manage your forwarding
domains on a per-VLAN basis with even more flexibility.

Even if you add support for VLAN-unaware FDB partitioning via unique
classified VLANs per bridge domain, you will probably still have a
single ageing timer for your entire FDB. DSA chooses the
dsa_switch_fastest_ageing_time() for you, assuming that this will be the
case. It is unlikely to find hardware which is fully partitionable, you
will always find some corner cases.

> I guess I would just rather spend some more time up front to make sure
> that we support full isolation between bridges, than spend that same
> time debugging issues in production environments later on.

I have nothing against dropping the mv88e6xxx patches and replacing them
with sja1105 support (even if that will have the same "issues" that you
describe here).

On sja1105, there is actually no DSA tagging support for data plane
packets, only for control packets. This is a good and a bad thing,
because it's hard to work with given the current control-only
infrastructure, but good because everything is a VLAN, really.

Currently, a packet that is not PTP or STP is sent using a tag_8021q
"TX VLAN", whose broadcast domain contains only 2 ports: the CPU port
and the desired egress port. The TX VLAN is popped on egress, so it acts
as a de facto DSA tag.

With bridge data plane offload, all that needs to be done is tag_8021q
sets up a bridge forwarding offload TX VLAN corresponding to each bridge
number, and this is a multicast VLAN: it contains the CPU port as well
as all ports that have joined that bridge. It will not leak outside of
the software bridge's broadcast domain, and the packet will be looked up
in the FDB.

Actually what I described above is only true for VLAN-unaware bridging.
When we offload a VLAN-aware bridge, we let the packet slide into the
switch precisely with the VLAN ID that came from the bridge (basically
the packet looks the same as if it was sent through a socket on the DSA
master, we don't pretend that we construct a DSA tag at all because we
don't, but now we have all the advantages of it coming from the bridge
device, like you can now bridge the sja1105 with foreign interfaces).
This is why we need to restrict the user to a single VLAN-aware bridge,
otherwise port separation could not be maintained. When some ports of a
sja1105 device are part of a VLAN-aware bridge, the standalone ports are
still targeted with the precise TX VLAN, and the tag_8021q VLANs (the
1024-3071 range) cannot be installed in the bridge VLAN database or as
8021q uppers.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
  2021-07-05  8:32     ` [Bridge] " Tobias Waldekranz
@ 2021-07-05  9:57       ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-05  9:57 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: DENG Qingfang, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Jiri Pirko, Ido Schimmel, Roopa Prabhu, Nikolay Aleksandrov,
	Stephen Hemminger, bridge, Alexander Duyck

On Mon, Jul 05, 2021 at 10:32:04AM +0200, Tobias Waldekranz wrote:
> > Many DSA taggers use port bit field in their TX tags, which allows
> > replication in hardware. (multiple bits set = send to multiple ports)
> > I wonder if the tagger API can be updated to support this.
>
> I think you could, but it would be tricky.
>
> The bridge does not operate using vectors/bitfields, rather it is
> procedural code that you have to loop through before knowing the set of
> destination ports.
>
> This series just sends the skb to the first port in the hardware domain
> and trusts the HW to calculate the same port set as the code in
> br_forward.c would have.
>
> To do what you suggest, the bridge would have to translate each nbp into
> a position in a bitfield (or call out to the underlying driver to do it)
> as it is looping through ports, then send the aggregated mask along with
> the skb. Knowing if a port is the first one you have come across for a
> given domain is very easy (just maintain a bitfield), knowing if it is
> the last one is harder. So you would likely end up having to queue up
> the actual transmission until after the loop has been executed, which
> hard to combine with the "lazy cloning" that you really want to get
> decent performance.

In addition to changing the bridge in order to get the entire bit mask,
one also has to somehow propagate that bit mask per skb down to the
driver which might be tricky in itself. There is currently no bridge
specific data structure passed between the bridge and the switchdev
driver, it is just the struct net_device *sb_dev. A hacky solution I
might imagine is for the bridge to kzalloc() a small data structure
like:

struct bridge_fwd_offload_accel_priv {
	struct net_device *sb_dev; /* Must be first! */
	unsigned long port_mask;
};

and call as follows:

int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	struct bridge_fwd_offload_accel_priv *accel_priv = NULL;

	if (br_switchdev_accels_skb(skb)) {
		accel_priv = kzalloc(sizeof(*accel_priv), GFP_ATOMIC);
		if (!accel_priv)
			return -ENOMEM;

		accel_priv->sb_dev = BR_INPUT_SKB_CB(skb)->brdev;
		accel_priv->port_mask = port_mask;
	}

	dev_queue_xmit_accel(skb, accel_priv);
}

This way, the code in net/core/dev.c can be left unmodified. We give it
an accel_priv pointer but it can think it is only looking at a sb_dev
pointer, since that is the first element in the structure.

But then the switchdev driver must kfree(accel_priv) in the xmit function.

Not really nice, but for a cuter solution, I think we would need to extend struct sk_buff.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Bridge] [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices
@ 2021-07-05  9:57       ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-05  9:57 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, Ido Schimmel, bridge,
	netdev, Roopa Prabhu, Alexander Duyck, Vivien Didelot,
	DENG Qingfang, Nikolay Aleksandrov, Vladimir Oltean,
	Jakub Kicinski, David S. Miller

On Mon, Jul 05, 2021 at 10:32:04AM +0200, Tobias Waldekranz wrote:
> > Many DSA taggers use port bit field in their TX tags, which allows
> > replication in hardware. (multiple bits set = send to multiple ports)
> > I wonder if the tagger API can be updated to support this.
>
> I think you could, but it would be tricky.
>
> The bridge does not operate using vectors/bitfields, rather it is
> procedural code that you have to loop through before knowing the set of
> destination ports.
>
> This series just sends the skb to the first port in the hardware domain
> and trusts the HW to calculate the same port set as the code in
> br_forward.c would have.
>
> To do what you suggest, the bridge would have to translate each nbp into
> a position in a bitfield (or call out to the underlying driver to do it)
> as it is looping through ports, then send the aggregated mask along with
> the skb. Knowing if a port is the first one you have come across for a
> given domain is very easy (just maintain a bitfield), knowing if it is
> the last one is harder. So you would likely end up having to queue up
> the actual transmission until after the loop has been executed, which
> hard to combine with the "lazy cloning" that you really want to get
> decent performance.

In addition to changing the bridge in order to get the entire bit mask,
one also has to somehow propagate that bit mask per skb down to the
driver which might be tricky in itself. There is currently no bridge
specific data structure passed between the bridge and the switchdev
driver, it is just the struct net_device *sb_dev. A hacky solution I
might imagine is for the bridge to kzalloc() a small data structure
like:

struct bridge_fwd_offload_accel_priv {
	struct net_device *sb_dev; /* Must be first! */
	unsigned long port_mask;
};

and call as follows:

int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	struct bridge_fwd_offload_accel_priv *accel_priv = NULL;

	if (br_switchdev_accels_skb(skb)) {
		accel_priv = kzalloc(sizeof(*accel_priv), GFP_ATOMIC);
		if (!accel_priv)
			return -ENOMEM;

		accel_priv->sb_dev = BR_INPUT_SKB_CB(skb)->brdev;
		accel_priv->port_mask = port_mask;
	}

	dev_queue_xmit_accel(skb, accel_priv);
}

This way, the code in net/core/dev.c can be left unmodified. We give it
an accel_priv pointer but it can think it is only looking at a sb_dev
pointer, since that is the first element in the structure.

But then the switchdev driver must kfree(accel_priv) in the xmit function.

Not really nice, but for a cuter solution, I think we would need to extend struct sk_buff.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 net-next 04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded
  2021-07-03 11:56   ` [Bridge] " Vladimir Oltean
  (?)
@ 2021-07-09 13:16   ` Grygorii Strashko
  2021-07-09 14:09       ` [Bridge] " Vladimir Oltean
  -1 siblings, 1 reply; 44+ messages in thread
From: Grygorii Strashko @ 2021-07-09 13:16 UTC (permalink / raw)
  To: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck



On 03/07/2021 14:56, Vladimir Oltean wrote:
> From: Tobias Waldekranz <tobias@waldekranz.com>
> 
> Allow switchdevs to forward frames from the CPU in accordance with the
> bridge configuration in the same way as is done between bridge
> ports. This means that the bridge will only send a single skb towards
> one of the ports under the switchdev's control, and expects the driver
> to deliver the packet to all eligible ports in its domain.
> 
> Primarily this improves the performance of multicast flows with
> multiple subscribers, as it allows the hardware to perform the frame
> replication.
> 
> The basic flow between the driver and the bridge is as follows:
> 
> - The switchdev accepts the offload by returning a non-null pointer
>    from .ndo_dfwd_add_station when the port is added to the bridge.
> 
> - The bridge sends offloadable skbs to one of the ports under the
>    switchdev's control using dev_queue_xmit_accel.
> 
> - The switchdev notices the offload by checking for a non-NULL
>    "sb_dev" in the core's call to .ndo_select_queue.

Sry, I could be missing smth.

Is there any possibility to just mark skb itself as "fwd_offload" (or smth), so driver can
just check it and decide what to do. Following you series:
- BR itself will send packet only once to one port if fwd offload possible and supported
- switchdev driver can check/negotiate BR_FWD_OFFLOAD flag

In our case, TI CPSW can send directed packet (default now), by specifying port_id if DMA desc
or keep port_id == 0 which will allow HW to process packet internally, including MC duplication.

Sry, again, but necessity to add 3 callbacks and manipulate with "virtual" queue to achieve
MC offload (seems like one of the primary goals) from BR itself looks a bit over-complicated :(

> 
> v1->v2:
> - convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long
> - introduce a static key "br_switchdev_fwd_offload_used" to minimize the
>    impact of the newly introduced feature on all the setups which don't
>    have hardware that can make use of it
> - introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache
>    line access
> - reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in
>    __br_forward()
> - do not strip VLAN on egress if forwarding offload on VLAN-aware bridge
>    is being used
> - propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP
> 
> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>   include/linux/if_bridge.h |  1 +
>   net/bridge/br_forward.c   | 18 +++++++-
>   net/bridge/br_private.h   | 24 +++++++++++
>   net/bridge/br_switchdev.c | 87 +++++++++++++++++++++++++++++++++++++--

[...]

-- 
Best regards,
grygorii

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 net-next 02/10] net: bridge: disambiguate offload_fwd_mark
  2021-07-03 11:56   ` [Bridge] " Vladimir Oltean
  (?)
@ 2021-07-09 13:23   ` Grygorii Strashko
  -1 siblings, 0 replies; 44+ messages in thread
From: Grygorii Strashko @ 2021-07-09 13:23 UTC (permalink / raw)
  To: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller
  Cc: Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Tobias Waldekranz, Roopa Prabhu,
	Nikolay Aleksandrov, Stephen Hemminger, bridge, Alexander Duyck

Hi

On 03/07/2021 14:56, Vladimir Oltean wrote:
> From: Tobias Waldekranz <tobias@waldekranz.com>
> 
> Before this change, four related - but distinct - concepts where named
> offload_fwd_mark:
> 
> - skb->offload_fwd_mark: Set by the switchdev driver if the underlying
>    hardware has already forwarded this frame to the other ports in the
>    same hardware domain.
> 
> - nbp->offload_fwd_mark: An idetifier used to group ports that share
>    the same hardware forwarding domain.
> 
> - br->offload_fwd_mark: Counter used to make sure that unique IDs are
>    used in cases where a bridge contains ports from multiple hardware
>    domains.
> 
> - skb->cb->offload_fwd_mark: The hardware domain on which the frame
>    ingressed and was forwarded.
> 
> Introduce the term "hardware forwarding domain" ("hwdom") in the
> bridge to denote a set of ports with the following property:
> 
>      If an skb with skb->offload_fwd_mark set, is received on a port
>      belonging to hwdom N, that frame has already been forwarded to all
>      other ports in hwdom N.
> 
> By decoupling the name from "offload_fwd_mark", we can extend the
> term's definition in the future - e.g. to add constraints that
> describe expected egress behavior - without overloading the meaning of
> "offload_fwd_mark".
> 
> - nbp->offload_fwd_mark thus becomes nbp->hwdom.
> 
> - br->offload_fwd_mark becomes br->last_hwdom.
> 
> - skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. The slight
>    change in naming here mandates a slight change in behavior of the
>    nbp_switchdev_frame_mark() function. Previously, it only set this
>    value in skb->cb for packets with skb->offload_fwd_mark true (ones
>    which were forwarded in hardware). Whereas now we always track the
>    incoming hwdom for all packets coming from a switchdev (even for the
>    packets which weren't forwarded in hardware, such as STP BPDUs, IGMP
>    reports etc). As all uses of skb->cb->offload_fwd_mark were already
>    gated behind checks of skb->offload_fwd_mark, this will not introduce
>    any functional change, but it paves the way for future changes where
>    the ingressing hwdom must be known for frames coming from a switchdev
>    regardless of whether they were forwarded in hardware or not
>    (basically, if the skb comes from a switchdev, skb->cb->src_hwdom now
>    always tracks which one).
> 
>    A typical example where this is relevant: the switchdev has a fixed
>    configuration to trap STP BPDUs, but STP is not running on the bridge
>    and the group_fwd_mask allows them to be forwarded. Say we have this
>    setup:
> 
>          br0
>         / | \
>        /  |  \
>    swp0 swp1 swp2
> 
>    A BPDU comes in on swp0 and is trapped to the CPU; the driver does not
>    set skb->offload_fwd_mark. The bridge determines that the frame should
>    be forwarded to swp{1,2}. It is imperative that forward offloading is
>    _not_ allowed in this case, as the source hwdom is already "poisoned".
> 
>    Recording the source hwdom allows this case to be handled properly.
> 
> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---
>   net/bridge/br_if.c        |  2 +-
>   net/bridge/br_private.h   | 10 +++++-----
>   net/bridge/br_switchdev.c | 16 ++++++++--------
>   3 files changed, 14 insertions(+), 14 deletions(-)
> 
[...]

Thank you. I very much like this patch by itself as it clarifies
properly things which caused much headache (at least for me).

I hope it can be moved forward regardless of the rest of the series.
Minor comment - It will good to add in-code doc for added/renamed struct fields.

Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com>

-- 
Best regards,
grygorii

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 net-next 04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded
  2021-07-09 13:16   ` Grygorii Strashko
@ 2021-07-09 14:09       ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-09 14:09 UTC (permalink / raw)
  To: Grygorii Strashko
  Cc: netdev, Jakub Kicinski, David S. Miller, Andrew Lunn,
	Florian Fainelli, Vivien Didelot, Jiri Pirko, Ido Schimmel,
	Tobias Waldekranz, Roopa Prabhu, Nikolay Aleksandrov,
	Stephen Hemminger, bridge, Alexander Duyck

Hi Grygorii,

On Fri, Jul 09, 2021 at 04:16:13PM +0300, Grygorii Strashko wrote:
> On 03/07/2021 14:56, Vladimir Oltean wrote:
> > From: Tobias Waldekranz <tobias@waldekranz.com>
> >
> > Allow switchdevs to forward frames from the CPU in accordance with the
> > bridge configuration in the same way as is done between bridge
> > ports. This means that the bridge will only send a single skb towards
> > one of the ports under the switchdev's control, and expects the driver
> > to deliver the packet to all eligible ports in its domain.
> >
> > Primarily this improves the performance of multicast flows with
> > multiple subscribers, as it allows the hardware to perform the frame
> > replication.
> >
> > The basic flow between the driver and the bridge is as follows:
> >
> > - The switchdev accepts the offload by returning a non-null pointer
> >    from .ndo_dfwd_add_station when the port is added to the bridge.
> >
> > - The bridge sends offloadable skbs to one of the ports under the
> >    switchdev's control using dev_queue_xmit_accel.
> >
> > - The switchdev notices the offload by checking for a non-NULL
> >    "sb_dev" in the core's call to .ndo_select_queue.
>
> Sry, I could be missing smth.
>
> Is there any possibility to just mark skb itself as "fwd_offload" (or smth), so driver can
> just check it and decide what to do. Following you series:
> - BR itself will send packet only once to one port if fwd offload possible and supported
> - switchdev driver can check/negotiate BR_FWD_OFFLOAD flag
>
> In our case, TI CPSW can send directed packet (default now), by specifying port_id if DMA desc
> or keep port_id == 0 which will allow HW to process packet internally, including MC duplication.
>
> Sry, again, but necessity to add 3 callbacks and manipulate with "virtual" queue to achieve
> MC offload (seems like one of the primary goals) from BR itself looks a bit over-complicated :(

After cutting my teeth myself with Tobias' patches, I tend to agree with
the idea that the macvlan offload framework is not a great fit for the
software bridge data plane TX offloading. Some reasons:
- the sb_dev pointer is necessary for macvlan because you can have
  multiple macvlan uppers and you need to know which one this packet
  came from. Whereas in the case of a bridge, any given switchdev net
  device can have a single bridge upper. So a single bit per skb,
  possibly even skb->offload_fwd_mark, could be used to encode this bit
  of information: please look up your FDB for this packet and
  forward/replicate it accordingly.
- I am a bit on the fence about the "net: allow ndo_select_queue to go
  beyond dev->num_real_tx_queues" and "net: extract helpers for binding
  a subordinate device to TX queues" patches, they look like the wrong
  approach overall, just to shoehorn our use case into a framework that
  was not meant to cover it.
- most importantly: Ido asked about the possibility for a switchdev to
  accelerate the data plane for a bridge port that is a LAG upper. In the
  current design, where the bridge attempts to call the
  .ndo_dfwd_add_station method of the bond/team driver, this will not
  work. Traditionally, switchdev has migrated away from ndo's towards
  notifiers because of the ability for a switchdev to intercept the
  notifier emitted by the bridge for the bonding interface, and to treat
  it by itself. So, logically speaking, it would make more sense to
  introduce a new switchdev notifier for TX data plane offloading per
  port. Actually, now that I'm thinking even more about this, it would
  be great not only if we could migrate towards notifiers, but if the
  notification could be emitted by the switchdev driver itself, at
  bridge join time. Once upon a time I had an RFC patch that changed all
  switchdev drivers to inform the bridge that they are capable of
  offloading the RX data plane:
  https://patchwork.kernel.org/project/netdevbpf/patch/20210318231829.3892920-17-olteanv@gmail.com/
  That patch was necessary because the bridge, when it sees a bridge
  port that is a LAG, and the LAG is on top of a switchdev, will assign
  the port hwdom based on the devlink switch ID of the switchdev. This
  is wrong because it assumes that the switchdev offloads the LAG, but
  in the vast majority of cases this is false, only a handful of
  switchdev drivers have LAG offload right now. So the expectation is
  that the bridge can do software forwarding between such LAG comprised
  of two switchdev interfaces, and a third (standalone) switchdev
  interface, but it doesn't do that, because to the bridge, all ports
  have the same hwdom.
  Now it seems common sense that I pick up this patch again and make the
  switchdev drivers give 2 pieces of information:
  (a) can I offload the RX data path
  (b) can I offload the TX data path

I can try to draft another RFC with these changes.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Bridge] [RFC PATCH v2 net-next 04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded
@ 2021-07-09 14:09       ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-09 14:09 UTC (permalink / raw)
  To: Grygorii Strashko
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, netdev, bridge,
	Alexander Duyck, Vivien Didelot, Ido Schimmel,
	Nikolay Aleksandrov, Roopa Prabhu, Jakub Kicinski,
	David S. Miller, Tobias Waldekranz

Hi Grygorii,

On Fri, Jul 09, 2021 at 04:16:13PM +0300, Grygorii Strashko wrote:
> On 03/07/2021 14:56, Vladimir Oltean wrote:
> > From: Tobias Waldekranz <tobias@waldekranz.com>
> >
> > Allow switchdevs to forward frames from the CPU in accordance with the
> > bridge configuration in the same way as is done between bridge
> > ports. This means that the bridge will only send a single skb towards
> > one of the ports under the switchdev's control, and expects the driver
> > to deliver the packet to all eligible ports in its domain.
> >
> > Primarily this improves the performance of multicast flows with
> > multiple subscribers, as it allows the hardware to perform the frame
> > replication.
> >
> > The basic flow between the driver and the bridge is as follows:
> >
> > - The switchdev accepts the offload by returning a non-null pointer
> >    from .ndo_dfwd_add_station when the port is added to the bridge.
> >
> > - The bridge sends offloadable skbs to one of the ports under the
> >    switchdev's control using dev_queue_xmit_accel.
> >
> > - The switchdev notices the offload by checking for a non-NULL
> >    "sb_dev" in the core's call to .ndo_select_queue.
>
> Sry, I could be missing smth.
>
> Is there any possibility to just mark skb itself as "fwd_offload" (or smth), so driver can
> just check it and decide what to do. Following you series:
> - BR itself will send packet only once to one port if fwd offload possible and supported
> - switchdev driver can check/negotiate BR_FWD_OFFLOAD flag
>
> In our case, TI CPSW can send directed packet (default now), by specifying port_id if DMA desc
> or keep port_id == 0 which will allow HW to process packet internally, including MC duplication.
>
> Sry, again, but necessity to add 3 callbacks and manipulate with "virtual" queue to achieve
> MC offload (seems like one of the primary goals) from BR itself looks a bit over-complicated :(

After cutting my teeth myself with Tobias' patches, I tend to agree with
the idea that the macvlan offload framework is not a great fit for the
software bridge data plane TX offloading. Some reasons:
- the sb_dev pointer is necessary for macvlan because you can have
  multiple macvlan uppers and you need to know which one this packet
  came from. Whereas in the case of a bridge, any given switchdev net
  device can have a single bridge upper. So a single bit per skb,
  possibly even skb->offload_fwd_mark, could be used to encode this bit
  of information: please look up your FDB for this packet and
  forward/replicate it accordingly.
- I am a bit on the fence about the "net: allow ndo_select_queue to go
  beyond dev->num_real_tx_queues" and "net: extract helpers for binding
  a subordinate device to TX queues" patches, they look like the wrong
  approach overall, just to shoehorn our use case into a framework that
  was not meant to cover it.
- most importantly: Ido asked about the possibility for a switchdev to
  accelerate the data plane for a bridge port that is a LAG upper. In the
  current design, where the bridge attempts to call the
  .ndo_dfwd_add_station method of the bond/team driver, this will not
  work. Traditionally, switchdev has migrated away from ndo's towards
  notifiers because of the ability for a switchdev to intercept the
  notifier emitted by the bridge for the bonding interface, and to treat
  it by itself. So, logically speaking, it would make more sense to
  introduce a new switchdev notifier for TX data plane offloading per
  port. Actually, now that I'm thinking even more about this, it would
  be great not only if we could migrate towards notifiers, but if the
  notification could be emitted by the switchdev driver itself, at
  bridge join time. Once upon a time I had an RFC patch that changed all
  switchdev drivers to inform the bridge that they are capable of
  offloading the RX data plane:
  https://patchwork.kernel.org/project/netdevbpf/patch/20210318231829.3892920-17-olteanv@gmail.com/
  That patch was necessary because the bridge, when it sees a bridge
  port that is a LAG, and the LAG is on top of a switchdev, will assign
  the port hwdom based on the devlink switch ID of the switchdev. This
  is wrong because it assumes that the switchdev offloads the LAG, but
  in the vast majority of cases this is false, only a handful of
  switchdev drivers have LAG offload right now. So the expectation is
  that the bridge can do software forwarding between such LAG comprised
  of two switchdev interfaces, and a third (standalone) switchdev
  interface, but it doesn't do that, because to the bridge, all ports
  have the same hwdom.
  Now it seems common sense that I pick up this patch again and make the
  switchdev drivers give 2 pieces of information:
  (a) can I offload the RX data path
  (b) can I offload the TX data path

I can try to draft another RFC with these changes.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 net-next 04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded
  2021-07-09 14:09       ` [Bridge] " Vladimir Oltean
@ 2021-07-12 12:28         ` Tobias Waldekranz
  -1 siblings, 0 replies; 44+ messages in thread
From: Tobias Waldekranz @ 2021-07-12 12:28 UTC (permalink / raw)
  To: Vladimir Oltean, Grygorii Strashko
  Cc: netdev, Jakub Kicinski, David S. Miller, Andrew Lunn,
	Florian Fainelli, Vivien Didelot, Jiri Pirko, Ido Schimmel,
	Roopa Prabhu, Nikolay Aleksandrov, Stephen Hemminger, bridge,
	Alexander Duyck

On Fri, Jul 09, 2021 at 14:09, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> Hi Grygorii,
>
> On Fri, Jul 09, 2021 at 04:16:13PM +0300, Grygorii Strashko wrote:
>> On 03/07/2021 14:56, Vladimir Oltean wrote:
>> > From: Tobias Waldekranz <tobias@waldekranz.com>
>> >
>> > Allow switchdevs to forward frames from the CPU in accordance with the
>> > bridge configuration in the same way as is done between bridge
>> > ports. This means that the bridge will only send a single skb towards
>> > one of the ports under the switchdev's control, and expects the driver
>> > to deliver the packet to all eligible ports in its domain.
>> >
>> > Primarily this improves the performance of multicast flows with
>> > multiple subscribers, as it allows the hardware to perform the frame
>> > replication.
>> >
>> > The basic flow between the driver and the bridge is as follows:
>> >
>> > - The switchdev accepts the offload by returning a non-null pointer
>> >    from .ndo_dfwd_add_station when the port is added to the bridge.
>> >
>> > - The bridge sends offloadable skbs to one of the ports under the
>> >    switchdev's control using dev_queue_xmit_accel.
>> >
>> > - The switchdev notices the offload by checking for a non-NULL
>> >    "sb_dev" in the core's call to .ndo_select_queue.
>>
>> Sry, I could be missing smth.
>>
>> Is there any possibility to just mark skb itself as "fwd_offload" (or smth), so driver can
>> just check it and decide what to do. Following you series:
>> - BR itself will send packet only once to one port if fwd offload possible and supported
>> - switchdev driver can check/negotiate BR_FWD_OFFLOAD flag
>>
>> In our case, TI CPSW can send directed packet (default now), by specifying port_id if DMA desc
>> or keep port_id == 0 which will allow HW to process packet internally, including MC duplication.
>>
>> Sry, again, but necessity to add 3 callbacks and manipulate with "virtual" queue to achieve
>> MC offload (seems like one of the primary goals) from BR itself looks a bit over-complicated :(
>
> After cutting my teeth myself with Tobias' patches, I tend to agree with
> the idea that the macvlan offload framework is not a great fit for the
> software bridge data plane TX offloading. Some reasons:

I agree. I was trying to find an API that would not require adding new
.ndos or other infrastructure. You can see in my original RFC cover that
this was something I wrestled with. 

> - the sb_dev pointer is necessary for macvlan because you can have
>   multiple macvlan uppers and you need to know which one this packet
>   came from. Whereas in the case of a bridge, any given switchdev net
>   device can have a single bridge upper. So a single bit per skb,
>   possibly even skb->offload_fwd_mark, could be used to encode this bit
>   of information: please look up your FDB for this packet and
>   forward/replicate it accordingly.

In fact, in the version I was about to publish, I reused
skb->offload_fwd_mark to encode precisely this property. It works really
well. Maybe I should just publish it, even with the issues regarding
mv88e6xxx. Let me know if you want to take a look at it.

> - I am a bit on the fence about the "net: allow ndo_select_queue to go
>   beyond dev->num_real_tx_queues" and "net: extract helpers for binding
>   a subordinate device to TX queues" patches, they look like the wrong
>   approach overall, just to shoehorn our use case into a framework that
>   was not meant to cover it.

Yep.

> - most importantly: Ido asked about the possibility for a switchdev to
>   accelerate the data plane for a bridge port that is a LAG upper. In the
>   current design, where the bridge attempts to call the
>   .ndo_dfwd_add_station method of the bond/team driver, this will not
>   work. Traditionally, switchdev has migrated away from ndo's towards
>   notifiers because of the ability for a switchdev to intercept the
>   notifier emitted by the bridge for the bonding interface, and to treat
>   it by itself. So, logically speaking, it would make more sense to
>   introduce a new switchdev notifier for TX data plane offloading per
>   port. Actually, now that I'm thinking even more about this, it would
>   be great not only if we could migrate towards notifiers, but if the
>   notification could be emitted by the switchdev driver itself, at

I added pass-through implementations of these .ndos to make it work on
top of LAGs, but a notifier is much cleaner.

>   bridge join time. Once upon a time I had an RFC patch that changed all
>   switchdev drivers to inform the bridge that they are capable of
>   offloading the RX data plane:
>   https://patchwork.kernel.org/project/netdevbpf/patch/20210318231829.3892920-17-olteanv@gmail.com/

Really like this approach! It also opens up the possibility of disabling
it manually (something like `ethtool -K swp0 bridge-{rx, tx} off`). This
will allow you to run a DPI firewall on a specific port in a LAN, for
example.

>   That patch was necessary because the bridge, when it sees a bridge
>   port that is a LAG, and the LAG is on top of a switchdev, will assign
>   the port hwdom based on the devlink switch ID of the switchdev. This
>   is wrong because it assumes that the switchdev offloads the LAG, but
>   in the vast majority of cases this is false, only a handful of
>   switchdev drivers have LAG offload right now. So the expectation is
>   that the bridge can do software forwarding between such LAG comprised
>   of two switchdev interfaces, and a third (standalone) switchdev
>   interface, but it doesn't do that, because to the bridge, all ports
>   have the same hwdom.
>   Now it seems common sense that I pick up this patch again and make the
>   switchdev drivers give 2 pieces of information:
>   (a) can I offload the RX data path
>   (b) can I offload the TX data path
>
> I can try to draft another RFC with these changes.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Bridge] [RFC PATCH v2 net-next 04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded
@ 2021-07-12 12:28         ` Tobias Waldekranz
  0 siblings, 0 replies; 44+ messages in thread
From: Tobias Waldekranz @ 2021-07-12 12:28 UTC (permalink / raw)
  To: Vladimir Oltean, Grygorii Strashko
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, netdev, bridge,
	Alexander Duyck, Vivien Didelot, Ido Schimmel,
	Nikolay Aleksandrov, Roopa Prabhu, Jakub Kicinski,
	David S. Miller

On Fri, Jul 09, 2021 at 14:09, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> Hi Grygorii,
>
> On Fri, Jul 09, 2021 at 04:16:13PM +0300, Grygorii Strashko wrote:
>> On 03/07/2021 14:56, Vladimir Oltean wrote:
>> > From: Tobias Waldekranz <tobias@waldekranz.com>
>> >
>> > Allow switchdevs to forward frames from the CPU in accordance with the
>> > bridge configuration in the same way as is done between bridge
>> > ports. This means that the bridge will only send a single skb towards
>> > one of the ports under the switchdev's control, and expects the driver
>> > to deliver the packet to all eligible ports in its domain.
>> >
>> > Primarily this improves the performance of multicast flows with
>> > multiple subscribers, as it allows the hardware to perform the frame
>> > replication.
>> >
>> > The basic flow between the driver and the bridge is as follows:
>> >
>> > - The switchdev accepts the offload by returning a non-null pointer
>> >    from .ndo_dfwd_add_station when the port is added to the bridge.
>> >
>> > - The bridge sends offloadable skbs to one of the ports under the
>> >    switchdev's control using dev_queue_xmit_accel.
>> >
>> > - The switchdev notices the offload by checking for a non-NULL
>> >    "sb_dev" in the core's call to .ndo_select_queue.
>>
>> Sry, I could be missing smth.
>>
>> Is there any possibility to just mark skb itself as "fwd_offload" (or smth), so driver can
>> just check it and decide what to do. Following you series:
>> - BR itself will send packet only once to one port if fwd offload possible and supported
>> - switchdev driver can check/negotiate BR_FWD_OFFLOAD flag
>>
>> In our case, TI CPSW can send directed packet (default now), by specifying port_id if DMA desc
>> or keep port_id == 0 which will allow HW to process packet internally, including MC duplication.
>>
>> Sry, again, but necessity to add 3 callbacks and manipulate with "virtual" queue to achieve
>> MC offload (seems like one of the primary goals) from BR itself looks a bit over-complicated :(
>
> After cutting my teeth myself with Tobias' patches, I tend to agree with
> the idea that the macvlan offload framework is not a great fit for the
> software bridge data plane TX offloading. Some reasons:

I agree. I was trying to find an API that would not require adding new
.ndos or other infrastructure. You can see in my original RFC cover that
this was something I wrestled with. 

> - the sb_dev pointer is necessary for macvlan because you can have
>   multiple macvlan uppers and you need to know which one this packet
>   came from. Whereas in the case of a bridge, any given switchdev net
>   device can have a single bridge upper. So a single bit per skb,
>   possibly even skb->offload_fwd_mark, could be used to encode this bit
>   of information: please look up your FDB for this packet and
>   forward/replicate it accordingly.

In fact, in the version I was about to publish, I reused
skb->offload_fwd_mark to encode precisely this property. It works really
well. Maybe I should just publish it, even with the issues regarding
mv88e6xxx. Let me know if you want to take a look at it.

> - I am a bit on the fence about the "net: allow ndo_select_queue to go
>   beyond dev->num_real_tx_queues" and "net: extract helpers for binding
>   a subordinate device to TX queues" patches, they look like the wrong
>   approach overall, just to shoehorn our use case into a framework that
>   was not meant to cover it.

Yep.

> - most importantly: Ido asked about the possibility for a switchdev to
>   accelerate the data plane for a bridge port that is a LAG upper. In the
>   current design, where the bridge attempts to call the
>   .ndo_dfwd_add_station method of the bond/team driver, this will not
>   work. Traditionally, switchdev has migrated away from ndo's towards
>   notifiers because of the ability for a switchdev to intercept the
>   notifier emitted by the bridge for the bonding interface, and to treat
>   it by itself. So, logically speaking, it would make more sense to
>   introduce a new switchdev notifier for TX data plane offloading per
>   port. Actually, now that I'm thinking even more about this, it would
>   be great not only if we could migrate towards notifiers, but if the
>   notification could be emitted by the switchdev driver itself, at

I added pass-through implementations of these .ndos to make it work on
top of LAGs, but a notifier is much cleaner.

>   bridge join time. Once upon a time I had an RFC patch that changed all
>   switchdev drivers to inform the bridge that they are capable of
>   offloading the RX data plane:
>   https://patchwork.kernel.org/project/netdevbpf/patch/20210318231829.3892920-17-olteanv@gmail.com/

Really like this approach! It also opens up the possibility of disabling
it manually (something like `ethtool -K swp0 bridge-{rx, tx} off`). This
will allow you to run a DPI firewall on a specific port in a LAN, for
example.

>   That patch was necessary because the bridge, when it sees a bridge
>   port that is a LAG, and the LAG is on top of a switchdev, will assign
>   the port hwdom based on the devlink switch ID of the switchdev. This
>   is wrong because it assumes that the switchdev offloads the LAG, but
>   in the vast majority of cases this is false, only a handful of
>   switchdev drivers have LAG offload right now. So the expectation is
>   that the bridge can do software forwarding between such LAG comprised
>   of two switchdev interfaces, and a third (standalone) switchdev
>   interface, but it doesn't do that, because to the bridge, all ports
>   have the same hwdom.
>   Now it seems common sense that I pick up this patch again and make the
>   switchdev drivers give 2 pieces of information:
>   (a) can I offload the RX data path
>   (b) can I offload the TX data path
>
> I can try to draft another RFC with these changes.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 net-next 04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded
  2021-07-12 12:28         ` [Bridge] " Tobias Waldekranz
@ 2021-07-12 13:03           ` Vladimir Oltean
  -1 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-12 13:03 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Grygorii Strashko, netdev, Jakub Kicinski, David S. Miller,
	Andrew Lunn, Florian Fainelli, Vivien Didelot, Jiri Pirko,
	Ido Schimmel, Roopa Prabhu, Nikolay Aleksandrov,
	Stephen Hemminger, bridge, Alexander Duyck

On Mon, Jul 12, 2021 at 02:28:42PM +0200, Tobias Waldekranz wrote:
> > After cutting my teeth myself with Tobias' patches, I tend to agree with
> > the idea that the macvlan offload framework is not a great fit for the
> > software bridge data plane TX offloading. Some reasons:
>
> I agree. I was trying to find an API that would not require adding new
> .ndos or other infrastructure. You can see in my original RFC cover that
> this was something I wrestled with.
>
> > - the sb_dev pointer is necessary for macvlan because you can have
> >   multiple macvlan uppers and you need to know which one this packet
> >   came from. Whereas in the case of a bridge, any given switchdev net
> >   device can have a single bridge upper. So a single bit per skb,
> >   possibly even skb->offload_fwd_mark, could be used to encode this bit
> >   of information: please look up your FDB for this packet and
> >   forward/replicate it accordingly.
>
> In fact, in the version I was about to publish, I reused
> skb->offload_fwd_mark to encode precisely this property. It works really
> well. Maybe I should just publish it, even with the issues regarding
> mv88e6xxx. Let me know if you want to take a look at it.

I am on it already, I have a 25-patch series that is currently
undergoing testing (yes, it changes all switchdev drivers to call
switchdev_bridge_port_offload() and switchdev_bridge_port_unoffload(),
and it also moves the switchdev object replay helpers to push mode, and
only then it hooks a "bool tx_fwd_offload" argument to the
switchdev_bridge_port_offload() call).
If all goes well and I still have some time today I will publish it for
review. Naturally the final submissions, when net-next reopens, will be
in much smaller chunks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [Bridge] [RFC PATCH v2 net-next 04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded
@ 2021-07-12 13:03           ` Vladimir Oltean
  0 siblings, 0 replies; 44+ messages in thread
From: Vladimir Oltean @ 2021-07-12 13:03 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Andrew Lunn, Florian Fainelli, Jiri Pirko, netdev, bridge,
	Alexander Duyck, Vivien Didelot, Ido Schimmel, Grygorii Strashko,
	Nikolay Aleksandrov, Roopa Prabhu, Jakub Kicinski,
	David S. Miller

On Mon, Jul 12, 2021 at 02:28:42PM +0200, Tobias Waldekranz wrote:
> > After cutting my teeth myself with Tobias' patches, I tend to agree with
> > the idea that the macvlan offload framework is not a great fit for the
> > software bridge data plane TX offloading. Some reasons:
>
> I agree. I was trying to find an API that would not require adding new
> .ndos or other infrastructure. You can see in my original RFC cover that
> this was something I wrestled with.
>
> > - the sb_dev pointer is necessary for macvlan because you can have
> >   multiple macvlan uppers and you need to know which one this packet
> >   came from. Whereas in the case of a bridge, any given switchdev net
> >   device can have a single bridge upper. So a single bit per skb,
> >   possibly even skb->offload_fwd_mark, could be used to encode this bit
> >   of information: please look up your FDB for this packet and
> >   forward/replicate it accordingly.
>
> In fact, in the version I was about to publish, I reused
> skb->offload_fwd_mark to encode precisely this property. It works really
> well. Maybe I should just publish it, even with the issues regarding
> mv88e6xxx. Let me know if you want to take a look at it.

I am on it already, I have a 25-patch series that is currently
undergoing testing (yes, it changes all switchdev drivers to call
switchdev_bridge_port_offload() and switchdev_bridge_port_unoffload(),
and it also moves the switchdev object replay helpers to push mode, and
only then it hooks a "bool tx_fwd_offload" argument to the
switchdev_bridge_port_offload() call).
If all goes well and I still have some time today I will publish it for
review. Naturally the final submissions, when net-next reopens, will be
in much smaller chunks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2021-07-12 13:03 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-03 11:56 [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices Vladimir Oltean
2021-07-03 11:56 ` [Bridge] " Vladimir Oltean
2021-07-03 11:56 ` [RFC PATCH v2 net-next 01/10] net: dfwd: constrain existing users to macvlan subordinates Vladimir Oltean
2021-07-03 11:56   ` [Bridge] " Vladimir Oltean
2021-07-03 11:56 ` [RFC PATCH v2 net-next 02/10] net: bridge: disambiguate offload_fwd_mark Vladimir Oltean
2021-07-03 11:56   ` [Bridge] " Vladimir Oltean
2021-07-09 13:23   ` Grygorii Strashko
2021-07-03 11:56 ` [RFC PATCH v2 net-next 03/10] net: bridge: switchdev: recycle unused hwdoms Vladimir Oltean
2021-07-03 11:56   ` [Bridge] " Vladimir Oltean
2021-07-03 11:56 ` [RFC PATCH v2 net-next 04/10] net: bridge: switchdev: allow the data plane forwarding to be offloaded Vladimir Oltean
2021-07-03 11:56   ` [Bridge] " Vladimir Oltean
2021-07-09 13:16   ` Grygorii Strashko
2021-07-09 14:09     ` Vladimir Oltean
2021-07-09 14:09       ` [Bridge] " Vladimir Oltean
2021-07-12 12:28       ` Tobias Waldekranz
2021-07-12 12:28         ` [Bridge] " Tobias Waldekranz
2021-07-12 13:03         ` Vladimir Oltean
2021-07-12 13:03           ` [Bridge] " Vladimir Oltean
2021-07-03 11:57 ` [RFC PATCH v2 net-next 05/10] net: extract helpers for binding a subordinate device to TX queues Vladimir Oltean
2021-07-03 11:57   ` [Bridge] " Vladimir Oltean
2021-07-03 11:57 ` [RFC PATCH v2 net-next 06/10] net: allow ndo_select_queue to go beyond dev->num_real_tx_queues Vladimir Oltean
2021-07-03 11:57   ` [Bridge] " Vladimir Oltean
2021-07-03 11:57 ` [RFC PATCH v2 net-next 07/10] net: dsa: track the number of switches in a tree Vladimir Oltean
2021-07-03 11:57   ` [Bridge] " Vladimir Oltean
2021-07-03 11:57 ` [RFC PATCH v2 net-next 08/10] net: dsa: add support for bridge forwarding offload Vladimir Oltean
2021-07-03 11:57   ` [Bridge] " Vladimir Oltean
2021-07-03 11:57 ` [RFC PATCH v2 net-next 09/10] net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in the PVT Vladimir Oltean
2021-07-03 11:57   ` [Bridge] " Vladimir Oltean
2021-07-03 11:57 ` [RFC PATCH v2 net-next 10/10] net: dsa: tag_dsa: offload the bridge forwarding process Vladimir Oltean
2021-07-03 11:57   ` [Bridge] " Vladimir Oltean
2021-07-03 22:04 ` [RFC PATCH v2 net-next 00/10] Allow forwarding for the software bridge data path to be offloaded to capable devices Tobias Waldekranz
2021-07-03 22:04   ` [Bridge] " Tobias Waldekranz
2021-07-04  8:11   ` Vladimir Oltean
2021-07-04  8:11     ` [Bridge] " Vladimir Oltean
2021-07-05  8:09     ` Tobias Waldekranz
2021-07-05  8:09       ` [Bridge] " Tobias Waldekranz
2021-07-05  8:54       ` Vladimir Oltean
2021-07-05  8:54         ` [Bridge] " Vladimir Oltean
2021-07-05  4:20 ` DENG Qingfang
2021-07-05  4:20   ` [Bridge] " DENG Qingfang
2021-07-05  8:32   ` Tobias Waldekranz
2021-07-05  8:32     ` [Bridge] " Tobias Waldekranz
2021-07-05  9:57     ` Vladimir Oltean
2021-07-05  9:57       ` [Bridge] " Vladimir Oltean

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.