Netdev Archive on lore.kernel.org
 help / color / Atom feed
* [RFC net-next 0/9] net: bridge: Forward offloading
@ 2021-04-26 17:04 Tobias Waldekranz
  2021-04-26 17:04 ` [RFC net-next 1/9] net: dfwd: Constrain existing users to macvlan subordinates Tobias Waldekranz
                   ` (9 more replies)
  0 siblings, 10 replies; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-26 17:04 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

## Overview

   vlan1   vlan2
       \   /
   .-----------.
   |    br0    |
   '-----------'
   /   /   \   \
swp0 swp1 swp2 eth0
  :   :   :
  (hwdom 1)

Up to this point, switchdevs have been trusted with offloading
forwarding between bridge ports, e.g. forwarding a unicast from swp0
to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
series extends forward offloading to include some new classes of
traffic:

- Locally originating flows, i.e. packets that ingress on br0 that are
  to be forwarded to one or several of the ports swp{0,1,2}. Notably
  this also includes routed flows, e.g. a packet ingressing swp0 on
  VLAN 1 which is then routed over to VLAN 2 by the CPU and then
  forwarded to swp1 is "locally originating" from br0's point of view.

- Flows originating from "foreign" interfaces, i.e. an interface that
  is not offloaded by a particular switchdev instance. This includes
  ports belonging to other switchdev instances. A typical example
  would be flows from eth0 towards swp{0,1,2}.

The bridge still looks up its FDB/MDB as usual and then notifies the
switchdev driver that a particular skb should be offloaded if it
matches one of the classes above. It does so by using the _accel
version of dev_queue_xmit, supplying its own netdev as the
"subordinate" device. The driver can react to the presence of the
subordinate in its .ndo_select_queue in what ever way it needs to make
sure to forward the skb in much the same way that it would for packets
ingressing on regular ports.

Hardware domains to which a particular skb has been forwarded are
recorded so that duplicates are avoided.

The main performance benefit is thus seen on multicast flows. Imagine
for example that:

- An IP camera is connected to swp0 (VLAN 1)

- The CPU is acting as a multicast router, routing the group from VLAN
  1 to VLAN 2.

- There are subscribers for the group in question behind both swp1 and
  swp2 (VLAN 2).

With this offloading in place, the bridge need only send a single skb
to the driver, which will send it to the hardware marked in such a way
that the switch will perform the multicast replication according to
the MDB configuration. Naturally, the number of saved skb_clones
increase linearly with the number of subscribed ports.

As an extra benefit, on mv88e6xxx, this also allows the switch to
perform source address learning on these flows, which avoids having to
sync dynamic FDB entries over slow configuration interfaces like MDIO
to avoid flows directed towards the CPU being flooded as unknown
unicast by the switch.


## RFC

- In general, what do you think about this idea?

- hwdom. What do you think about this terminology? Personally I feel
  that we had too many things called offload_fwd_mark, and that as the
  use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
  might be useful to have a separate term for it.

- .dfwd_{add,del}_station. Am I stretching this abstraction too far,
  and if so do you have any suggestion/preference on how to signal the
  offloading from the bridge down to the switchdev driver?

- The way that flooding is implemented in br_forward.c (lazily cloning
  skbs) means that you have to mark the forwarding as completed very
  early (right after should_deliver in maybe_deliver) in order to
  avoid duplicates. Is there some way to move this decision point to a
  later stage that I am missing?

- BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
  compatible with unicast-to-multicast being used on a port. Then
  again, I think that this would also be broken for regular switchdev
  bridge offloading as this flag is not offloaded to the switchdev
  port, so there is no way for the driver to refuse it. Any ideas on
  how to handle this?


## mv88e6xxx Specifics

Since we are now only receiving a single skb for both unicast and
multicast flows, we can tag the packets with the FORWARD command
instead of FROM_CPU. The swich(es) will then forward the packet in
accordance with its ATU, VTU, STU, and PVT configuration - just like
for packets ingressing on user ports.

Crucially, FROM_CPU is still used for:

- Ports in standalone mode.

- Flows that are trapped to the CPU and software-forwarded by a
  bridge. Note that these flows match neither of the classes discussed
  in the overview.

- Packets that are sent directly to a port netdev without going
  through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
  socket.

We thus have a pretty clean separation where the data plane uses
FORWARDs and the control plane uses TO_/FROM_CPU.

The barrier between different bridges is enforced by port based VLANs
on mv88e6xxx, which in essence is a mapping from a source device/port
pair to an allowed set of egress ports. In order to have a FORWARD
frame (which carries a _source_ device/port) correctly mapped by the
PVT, we must use a unique pair for each bridge.

Fortunately, there is typically lots of unused address space in most
switch trees. When was the last time you saw an mv88e6xxx product
using more than 4 chips? Even if you found one with 16 (!) devices,
you would still have room to allocate 16*16 virtual ports to software
bridges.

Therefore, the mv88e6xxx driver will allocate a virtual device/port
pair to each bridge that it offloads. All members of the same bridge
are then configured to allow packets from this virtual port in their
PVTs.

Tobias Waldekranz (9):
  net: dfwd: Constrain existing users to macvlan subordinates
  net: bridge: Disambiguate offload_fwd_mark
  net: bridge: switchdev: Recycle unused hwdoms
  net: bridge: switchdev: Forward offloading
  net: dsa: Track port PVIDs
  net: dsa: Forward offloading
  net: dsa: mv88e6xxx: Allocate a virtual DSA port for each bridge
  net: dsa: mv88e6xxx: Map virtual bridge port in PVT
  net: dsa: mv88e6xxx: Forward offloading

 MAINTAINERS                                   |   1 +
 drivers/net/dsa/mv88e6xxx/Makefile            |   1 +
 drivers/net/dsa/mv88e6xxx/chip.c              |  61 ++++++-
 drivers/net/dsa/mv88e6xxx/dst.c               | 160 ++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/dst.h               |  14 ++
 .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
 include/linux/dsa/mv88e6xxx.h                 |  13 ++
 include/net/dsa.h                             |  13 ++
 net/bridge/br_forward.c                       |  11 +-
 net/bridge/br_if.c                            |   4 +-
 net/bridge/br_private.h                       |  54 +++++-
 net/bridge/br_switchdev.c                     | 141 +++++++++++----
 net/dsa/port.c                                |  16 +-
 net/dsa/slave.c                               |  36 +++-
 net/dsa/tag_dsa.c                             |  33 +++-
 17 files changed, 510 insertions(+), 57 deletions(-)
 create mode 100644 drivers/net/dsa/mv88e6xxx/dst.c
 create mode 100644 drivers/net/dsa/mv88e6xxx/dst.h
 create mode 100644 include/linux/dsa/mv88e6xxx.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC net-next 1/9] net: dfwd: Constrain existing users to macvlan subordinates
  2021-04-26 17:04 [RFC net-next 0/9] net: bridge: Forward offloading Tobias Waldekranz
@ 2021-04-26 17:04 ` Tobias Waldekranz
  2021-04-26 17:04 ` [RFC net-next 2/9] net: bridge: Disambiguate offload_fwd_mark Tobias Waldekranz
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-26 17:04 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

The dfwd_add/del_station NDOs are currently only used by the macvlan
subsystem to request L2 forwarding offload from lower devices. In
order add support for other types of devices (like bridges), we
constrain the current users to make sure that the subordinate
requesting the offload is in fact a macvlan.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 drivers/net/ethernet/intel/fm10k/fm10k_netdev.c | 3 +++
 drivers/net/ethernet/intel/i40e/i40e_main.c     | 3 +++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   | 3 +++
 3 files changed, 9 insertions(+)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
index 2fb52bd6fc0e..4dba6e6a282d 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_netdev.c
@@ -1352,6 +1352,9 @@ static void *fm10k_dfwd_add_station(struct net_device *dev,
 	int size, i;
 	u16 vid, glort;
 
+	if (!netif_is_macvlan(sdev))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	/* The hardware supported by fm10k only filters on the destination MAC
 	 * address. In order to avoid issues we only support offloading modes
 	 * where the hardware can actually provide the functionality.
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c2d145a56b5e..b90b79f7ee46 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7663,6 +7663,9 @@ static void *i40e_fwd_add(struct net_device *netdev, struct net_device *vdev)
 	struct i40e_fwd_adapter *fwd;
 	int avail_macvlan, ret;
 
+	if (!netif_is_macvlan(vdev))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	if ((pf->flags & I40E_FLAG_DCB_ENABLED)) {
 		netdev_info(netdev, "Macvlans are not supported when DCB is enabled\n");
 		return ERR_PTR(-EINVAL);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index c5ec17d19c59..ff5334faf6c5 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9940,6 +9940,9 @@ static void *ixgbe_fwd_add(struct net_device *pdev, struct net_device *vdev)
 	int tcs = adapter->hw_tcs ? : 1;
 	int pool, err;
 
+	if (!netif_is_macvlan(vdev))
+		return ERR_PTR(-EOPNOTSUPP);
+
 	if (adapter->xdp_prog) {
 		e_warn(probe, "L2FW offload is not supported with XDP\n");
 		return ERR_PTR(-EINVAL);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC net-next 2/9] net: bridge: Disambiguate offload_fwd_mark
  2021-04-26 17:04 [RFC net-next 0/9] net: bridge: Forward offloading Tobias Waldekranz
  2021-04-26 17:04 ` [RFC net-next 1/9] net: dfwd: Constrain existing users to macvlan subordinates Tobias Waldekranz
@ 2021-04-26 17:04 ` Tobias Waldekranz
  2021-05-02 15:00   ` Ido Schimmel
  2021-04-26 17:04 ` [RFC net-next 3/9] net: bridge: switchdev: Recycle unused hwdoms Tobias Waldekranz
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-26 17:04 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

Before this change, four related - but distinct - concepts where named
offload_fwd_mark:

- skb->offload_fwd_mark: Set by the switchdev driver if the underlying
  hardware has already forwarded this frame to the other ports in the
  same hardware domain.

- nbp->offload_fwd_mark: An idetifier used to group ports that share
  the same hardware forwarding domain.

- br->offload_fwd_mark: Counter used to make sure that unique IDs are
  used in cases where a bridge contains ports from multiple hardware
  domains.

- skb->cb->offload_fwd_mark: The hardware domain on which the frame
  ingressed and was forwarded.

Introduce the term "hardware forwarding domain" ("hwdom") in the
bridge to denote a set of ports with the following property:

    If an skb with skb->offload_fwd_mark set, is received on a port
    belonging to hwdom N, that frame has already been forwarded to all
    other ports in hwdom N.

By decoupling the name from "offload_fwd_mark", we can extend the
term's definition in the future - e.g. to add constraints that
describe expected egress behavior - without overloading the meaning of
"offload_fwd_mark".

- nbp->offload_fwd_mark thus becomes nbp->hwdom.

- br->offload_fwd_mark becomes br->last_hwdom.

- skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. There is a
  slight change here: Whereas previously this was only set for
  offloaded packets, we now always track the incoming hwdom. As all
  uses where already gated behind checks of skb->offload_fwd_mark,
  this will not introduce any functional change, but it paves the way
  for future changes where the ingressing hwdom must be known both for
  offloaded and non-offloaded frames.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 net/bridge/br_if.c        |  2 +-
 net/bridge/br_private.h   | 10 +++++-----
 net/bridge/br_switchdev.c | 16 ++++++++--------
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index f7d2f472ae24..73fa703f8df5 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -643,7 +643,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
 	if (err)
 		goto err5;
 
-	err = nbp_switchdev_mark_set(p);
+	err = nbp_switchdev_hwdom_set(p);
 	if (err)
 		goto err6;
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 7ce8a77cc6b6..53248715f631 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -327,7 +327,7 @@ struct net_bridge_port {
 	struct netpoll			*np;
 #endif
 #ifdef CONFIG_NET_SWITCHDEV
-	int				offload_fwd_mark;
+	int				hwdom;
 #endif
 	u16				group_fwd_mask;
 	u16				backup_redirected_cnt;
@@ -472,7 +472,7 @@ struct net_bridge {
 	u32				auto_cnt;
 
 #ifdef CONFIG_NET_SWITCHDEV
-	int offload_fwd_mark;
+	int last_hwdom;
 #endif
 	struct hlist_head		fdb_list;
 
@@ -502,7 +502,7 @@ struct br_input_skb_cb {
 #endif
 
 #ifdef CONFIG_NET_SWITCHDEV
-	int offload_fwd_mark;
+	int src_hwdom;
 #endif
 };
 
@@ -1593,7 +1593,7 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
 
 /* br_switchdev.c */
 #ifdef CONFIG_NET_SWITCHDEV
-int nbp_switchdev_mark_set(struct net_bridge_port *p);
+int nbp_switchdev_hwdom_set(struct net_bridge_port *p);
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb);
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
@@ -1613,7 +1613,7 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 	skb->offload_fwd_mark = 0;
 }
 #else
-static inline int nbp_switchdev_mark_set(struct net_bridge_port *p)
+static inline int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
 {
 	return 0;
 }
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index a5e601e41cb9..bc085077ae71 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -8,20 +8,20 @@
 
 #include "br_private.h"
 
-static int br_switchdev_mark_get(struct net_bridge *br, struct net_device *dev)
+static int br_switchdev_hwdom_get(struct net_bridge *br, struct net_device *dev)
 {
 	struct net_bridge_port *p;
 
 	/* dev is yet to be added to the port list. */
 	list_for_each_entry(p, &br->port_list, list) {
 		if (netdev_port_same_parent_id(dev, p->dev))
-			return p->offload_fwd_mark;
+			return p->hwdom;
 	}
 
-	return ++br->offload_fwd_mark;
+	return ++br->last_hwdom;
 }
 
-int nbp_switchdev_mark_set(struct net_bridge_port *p)
+int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
 {
 	struct netdev_phys_item_id ppid = { };
 	int err;
@@ -35,7 +35,7 @@ int nbp_switchdev_mark_set(struct net_bridge_port *p)
 		return err;
 	}
 
-	p->offload_fwd_mark = br_switchdev_mark_get(p->br, p->dev);
+	p->hwdom = br_switchdev_hwdom_get(p->br, p->dev);
 
 	return 0;
 }
@@ -43,15 +43,15 @@ int nbp_switchdev_mark_set(struct net_bridge_port *p)
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb)
 {
-	if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark))
-		BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark;
+	if (p->hwdom)
+		BR_INPUT_SKB_CB(skb)->src_hwdom = p->hwdom;
 }
 
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
 				  const struct sk_buff *skb)
 {
 	return !skb->offload_fwd_mark ||
-	       BR_INPUT_SKB_CB(skb)->offload_fwd_mark != p->offload_fwd_mark;
+	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
 }
 
 /* Flags that can be offloaded to hardware */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC net-next 3/9] net: bridge: switchdev: Recycle unused hwdoms
  2021-04-26 17:04 [RFC net-next 0/9] net: bridge: Forward offloading Tobias Waldekranz
  2021-04-26 17:04 ` [RFC net-next 1/9] net: dfwd: Constrain existing users to macvlan subordinates Tobias Waldekranz
  2021-04-26 17:04 ` [RFC net-next 2/9] net: bridge: Disambiguate offload_fwd_mark Tobias Waldekranz
@ 2021-04-26 17:04 ` Tobias Waldekranz
  2021-04-27 10:42   ` Nikolay Aleksandrov
  2021-04-26 17:04 ` [RFC net-next 4/9] net: bridge: switchdev: Forward offloading Tobias Waldekranz
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-26 17:04 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

Since hwdoms has thus far only been used for equality comparisons, the
bridge has used the simplest possible assignment policy; using a
counter to keep track of the last value handed out.

With the upcoming transmit offloading, we need to perform set
operations efficiently based on hwdoms, e.g. we want to answer
questions like "has this skb been forwarded to any port within this
hwdom?"

Move to a bitmap-based allocation scheme that recycles hwdoms once all
members leaves the bridge. This means that we can use a single
unsigned long to keep track of the hwdoms that have received an skb.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 net/bridge/br_if.c        |  4 +-
 net/bridge/br_private.h   | 29 +++++++++---
 net/bridge/br_switchdev.c | 94 ++++++++++++++++++++++++++-------------
 3 files changed, 87 insertions(+), 40 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 73fa703f8df5..adaf78e45c23 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -349,6 +349,7 @@ static void del_nbp(struct net_bridge_port *p)
 	nbp_backup_clear(p);
 
 	nbp_update_port_count(br);
+	nbp_switchdev_del(p);
 
 	netdev_upper_dev_unlink(dev, br->dev);
 
@@ -643,7 +644,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
 	if (err)
 		goto err5;
 
-	err = nbp_switchdev_hwdom_set(p);
+	err = nbp_switchdev_add(p);
 	if (err)
 		goto err6;
 
@@ -704,6 +705,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
 	list_del_rcu(&p->list);
 	br_fdb_delete_by_port(br, p, 0, 1);
 	nbp_update_port_count(br);
+	nbp_switchdev_del(p);
 err6:
 	netdev_upper_dev_unlink(dev, br->dev);
 err5:
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 53248715f631..aba92864d285 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -29,6 +29,8 @@
 
 #define BR_MULTICAST_DEFAULT_HASH_MAX 4096
 
+#define BR_HWDOM_MAX BITS_PER_LONG
+
 #define BR_VERSION	"2.3"
 
 /* Control of forwarding link local multicast */
@@ -54,6 +56,8 @@ typedef struct bridge_id bridge_id;
 typedef struct mac_addr mac_addr;
 typedef __u16 port_id;
 
+typedef DECLARE_BITMAP(br_hwdom_map_t, BR_HWDOM_MAX);
+
 struct bridge_id {
 	unsigned char	prio[2];
 	unsigned char	addr[ETH_ALEN];
@@ -472,7 +476,7 @@ struct net_bridge {
 	u32				auto_cnt;
 
 #ifdef CONFIG_NET_SWITCHDEV
-	int last_hwdom;
+	br_hwdom_map_t			busy_hwdoms;
 #endif
 	struct hlist_head		fdb_list;
 
@@ -1593,7 +1597,6 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
 
 /* br_switchdev.c */
 #ifdef CONFIG_NET_SWITCHDEV
-int nbp_switchdev_hwdom_set(struct net_bridge_port *p);
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb);
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
@@ -1607,17 +1610,15 @@ void br_switchdev_fdb_notify(const struct net_bridge_fdb_entry *fdb,
 int br_switchdev_port_vlan_add(struct net_device *dev, u16 vid, u16 flags,
 			       struct netlink_ext_ack *extack);
 int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid);
+int nbp_switchdev_add(struct net_bridge_port *p);
+void nbp_switchdev_del(struct net_bridge_port *p);
+void br_switchdev_init(struct net_bridge *br);
 
 static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 {
 	skb->offload_fwd_mark = 0;
 }
 #else
-static inline int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
-{
-	return 0;
-}
-
 static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 					    struct sk_buff *skb)
 {
@@ -1657,6 +1658,20 @@ br_switchdev_fdb_notify(const struct net_bridge_fdb_entry *fdb, int type)
 static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 {
 }
+
+static inline int nbp_switchdev_add(struct net_bridge_port *p)
+{
+	return 0;
+}
+
+static inline void nbp_switchdev_del(struct net_bridge_port *p)
+{
+}
+
+static inline void br_switchdev_init(struct net_bridge *br)
+{
+}
+
 #endif /* CONFIG_NET_SWITCHDEV */
 
 /* br_arp_nd_proxy.c */
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index bc085077ae71..54bd7205bfb5 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -8,38 +8,6 @@
 
 #include "br_private.h"
 
-static int br_switchdev_hwdom_get(struct net_bridge *br, struct net_device *dev)
-{
-	struct net_bridge_port *p;
-
-	/* dev is yet to be added to the port list. */
-	list_for_each_entry(p, &br->port_list, list) {
-		if (netdev_port_same_parent_id(dev, p->dev))
-			return p->hwdom;
-	}
-
-	return ++br->last_hwdom;
-}
-
-int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
-{
-	struct netdev_phys_item_id ppid = { };
-	int err;
-
-	ASSERT_RTNL();
-
-	err = dev_get_port_parent_id(p->dev, &ppid, true);
-	if (err) {
-		if (err == -EOPNOTSUPP)
-			return 0;
-		return err;
-	}
-
-	p->hwdom = br_switchdev_hwdom_get(p->br, p->dev);
-
-	return 0;
-}
-
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb)
 {
@@ -156,3 +124,65 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)
 
 	return switchdev_port_obj_del(dev, &v.obj);
 }
+
+static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)
+{
+	struct net_bridge *br = joining->br;
+	struct net_bridge_port *p;
+	int hwdom;
+
+	/* joining is yet to be added to the port list. */
+	list_for_each_entry(p, &br->port_list, list) {
+		if (netdev_port_same_parent_id(joining->dev, p->dev)) {
+			joining->hwdom = p->hwdom;
+			return 0;
+		}
+	}
+
+	hwdom = find_next_zero_bit(br->busy_hwdoms, BR_HWDOM_MAX, 1);
+	if (hwdom >= BR_HWDOM_MAX)
+		return -EBUSY;
+
+	set_bit(hwdom, br->busy_hwdoms);
+	joining->hwdom = hwdom;
+	return 0;
+}
+
+static void nbp_switchdev_hwdom_put(struct net_bridge_port *leaving)
+{
+	struct net_bridge *br = leaving->br;
+	struct net_bridge_port *p;
+
+	/* leaving is no longer in the port list. */
+	list_for_each_entry(p, &br->port_list, list) {
+		if (p->hwdom == leaving->hwdom)
+			return;
+	}
+
+	clear_bit(leaving->hwdom, br->busy_hwdoms);
+}
+
+int nbp_switchdev_add(struct net_bridge_port *p)
+{
+	struct netdev_phys_item_id ppid = { };
+	int err;
+
+	ASSERT_RTNL();
+
+	err = dev_get_port_parent_id(p->dev, &ppid, true);
+	if (err) {
+		if (err == -EOPNOTSUPP)
+			return 0;
+		return err;
+	}
+
+	return nbp_switchdev_hwdom_set(p);
+}
+
+void nbp_switchdev_del(struct net_bridge_port *p)
+{
+	ASSERT_RTNL();
+
+	if (p->hwdom)
+		nbp_switchdev_hwdom_put(p);
+}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC net-next 4/9] net: bridge: switchdev: Forward offloading
  2021-04-26 17:04 [RFC net-next 0/9] net: bridge: Forward offloading Tobias Waldekranz
                   ` (2 preceding siblings ...)
  2021-04-26 17:04 ` [RFC net-next 3/9] net: bridge: switchdev: Recycle unused hwdoms Tobias Waldekranz
@ 2021-04-26 17:04 ` Tobias Waldekranz
  2021-04-27 10:35   ` Nikolay Aleksandrov
  2021-05-02 15:04   ` Ido Schimmel
  2021-04-26 17:04 ` [RFC net-next 5/9] net: dsa: Track port PVIDs Tobias Waldekranz
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-26 17:04 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

Allow switchdevs to forward frames from the CPU in accordance with the
bridge configuration in the same way as is done between bridge
ports. This means that the bridge will only send a single skb towards
one of the ports under the switchdev's control, and expects the driver
to deliver the packet to all eligible ports in its domain.

Primarily this improves the performance of multicast flows with
multiple subscribers, as it allows the hardware to perform the frame
replication.

The basic flow between the driver and the bridge is as follows:

- The switchdev accepts the offload by returning a non-null pointer
  from .ndo_dfwd_add_station when the port is added to the bridge.

- The bridge sends offloadable skbs to one of the ports under the
  switchdev's control using dev_queue_xmit_accel.

- The switchdev notices the offload by checking for a non-NULL
  "sb_dev" in the core's call to .ndo_select_queue.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 net/bridge/br_forward.c   | 11 +++++++-
 net/bridge/br_private.h   | 27 ++++++++++++++++++
 net/bridge/br_switchdev.c | 59 +++++++++++++++++++++++++++++++++++++--
 3 files changed, 93 insertions(+), 4 deletions(-)

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 6e9b049ae521..b4fb3b0bb1ec 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -32,6 +32,8 @@ static inline int should_deliver(const struct net_bridge_port *p,
 
 int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
 {
+	struct net_device *sb_dev = NULL;
+
 	skb_push(skb, ETH_HLEN);
 	if (!is_skb_forwardable(skb->dev, skb))
 		goto drop;
@@ -48,7 +50,10 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb
 		skb_set_network_header(skb, depth);
 	}
 
-	dev_queue_xmit(skb);
+	if (br_switchdev_accels_skb(skb))
+		sb_dev = BR_INPUT_SKB_CB(skb)->brdev;
+
+	dev_queue_xmit_accel(skb, sb_dev);
 
 	return 0;
 
@@ -105,6 +110,8 @@ static void __br_forward(const struct net_bridge_port *to,
 		indev = NULL;
 	}
 
+	nbp_switchdev_frame_mark_accel(to, skb);
+
 	NF_HOOK(NFPROTO_BRIDGE, br_hook,
 		net, NULL, skb, indev, skb->dev,
 		br_forward_finish);
@@ -174,6 +181,8 @@ static struct net_bridge_port *maybe_deliver(
 	if (!should_deliver(p, skb))
 		return prev;
 
+	nbp_switchdev_frame_mark_fwd(p, skb);
+
 	if (!prev)
 		goto out;
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index aba92864d285..933e951b0d7a 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -332,6 +332,7 @@ struct net_bridge_port {
 #endif
 #ifdef CONFIG_NET_SWITCHDEV
 	int				hwdom;
+	void				*accel_priv;
 #endif
 	u16				group_fwd_mask;
 	u16				backup_redirected_cnt;
@@ -506,7 +507,9 @@ struct br_input_skb_cb {
 #endif
 
 #ifdef CONFIG_NET_SWITCHDEV
+	u8 fwd_accel:1;
 	int src_hwdom;
+	br_hwdom_map_t fwd_hwdoms;
 #endif
 };
 
@@ -1597,6 +1600,15 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
 
 /* br_switchdev.c */
 #ifdef CONFIG_NET_SWITCHDEV
+static inline bool br_switchdev_accels_skb(struct sk_buff *skb)
+{
+	return BR_INPUT_SKB_CB(skb)->fwd_accel;
+}
+
+void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
+				    struct sk_buff *skb);
+void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
+				  struct sk_buff *skb);
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb);
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
@@ -1619,6 +1631,21 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 	skb->offload_fwd_mark = 0;
 }
 #else
+static inline bool br_switchdev_accels_skb(struct sk_buff *skb)
+{
+	return false;
+}
+
+static inline void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
+						  struct sk_buff *skb)
+{
+}
+
+static inline void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
+						struct sk_buff *skb)
+{
+}
+
 static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 					    struct sk_buff *skb)
 {
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index 54bd7205bfb5..c903171ad291 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -8,6 +8,26 @@
 
 #include "br_private.h"
 
+static bool nbp_switchdev_can_accel(const struct net_bridge_port *p,
+				    const struct sk_buff *skb)
+{
+	return p->accel_priv && (p->hwdom != BR_INPUT_SKB_CB(skb)->src_hwdom);
+}
+
+void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
+				    struct sk_buff *skb)
+{
+	if (nbp_switchdev_can_accel(p, skb))
+		BR_INPUT_SKB_CB(skb)->fwd_accel = true;
+}
+
+void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
+				  struct sk_buff *skb)
+{
+	if (nbp_switchdev_can_accel(p, skb))
+		set_bit(p->hwdom, BR_INPUT_SKB_CB(skb)->fwd_hwdoms);
+}
+
 void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 			      struct sk_buff *skb)
 {
@@ -18,8 +38,10 @@ void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
 bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
 				  const struct sk_buff *skb)
 {
-	return !skb->offload_fwd_mark ||
-	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
+	struct br_input_skb_cb *cb = BR_INPUT_SKB_CB(skb);
+
+	return !test_bit(p->hwdom, cb->fwd_hwdoms) &&
+		(!skb->offload_fwd_mark || cb->src_hwdom != p->hwdom);
 }
 
 /* Flags that can be offloaded to hardware */
@@ -125,6 +147,27 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)
 	return switchdev_port_obj_del(dev, &v.obj);
 }
 
+static void nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)
+{
+	void *priv;
+
+	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))
+		return;
+
+	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);
+	if (!IS_ERR_OR_NULL(priv))
+		p->accel_priv = priv;
+}
+
+static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)
+{
+	if (!p->accel_priv)
+		return;
+
+	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);
+	p->accel_priv = NULL;
+}
+
 static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)
 {
 	struct net_bridge *br = joining->br;
@@ -176,13 +219,23 @@ int nbp_switchdev_add(struct net_bridge_port *p)
 		return err;
 	}
 
-	return nbp_switchdev_hwdom_set(p);
+	err = nbp_switchdev_hwdom_set(p);
+	if (err)
+		return err;
+
+	if (p->hwdom)
+		nbp_switchdev_fwd_offload_add(p);
+
+	return 0;
 }
 
 void nbp_switchdev_del(struct net_bridge_port *p)
 {
 	ASSERT_RTNL();
 
+	if (p->accel_priv)
+		nbp_switchdev_fwd_offload_del(p);
+
 	if (p->hwdom)
 		nbp_switchdev_hwdom_put(p);
 }
-- 
2.25.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC net-next 5/9] net: dsa: Track port PVIDs
  2021-04-26 17:04 [RFC net-next 0/9] net: bridge: Forward offloading Tobias Waldekranz
                   ` (3 preceding siblings ...)
  2021-04-26 17:04 ` [RFC net-next 4/9] net: bridge: switchdev: Forward offloading Tobias Waldekranz
@ 2021-04-26 17:04 ` Tobias Waldekranz
  2021-04-26 19:40   ` Vladimir Oltean
  2021-04-26 17:04 ` [RFC net-next 6/9] net: dsa: Forward offloading Tobias Waldekranz
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-26 17:04 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

In some scenarios a tagger must know which VLAN to assign to a packet,
even if the packet is set to egress untagged. Since the VLAN
information in the skb will be removed by the bridge in this case,
track each port's PVID such that the VID of an outgoing frame can
always be determined.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 include/net/dsa.h |  1 +
 net/dsa/port.c    | 16 ++++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 507082959aa4..1f9ba9889034 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -270,6 +270,7 @@ struct dsa_port {
 	unsigned int		ageing_time;
 	bool			vlan_filtering;
 	u8			stp_state;
+	u16			pvid;
 	struct net_device	*bridge_dev;
 	struct devlink_port	devlink_port;
 	bool			devlink_port_setup;
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 6379d66a6bb3..02d96aebfcc6 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -651,8 +651,14 @@ int dsa_port_vlan_add(struct dsa_port *dp,
 		.vlan = vlan,
 		.extack = extack,
 	};
+	int err;
+
+	err = dsa_port_notify(dp, DSA_NOTIFIER_VLAN_ADD, &info);
 
-	return dsa_port_notify(dp, DSA_NOTIFIER_VLAN_ADD, &info);
+	if (!err && (vlan->flags & BRIDGE_VLAN_INFO_PVID))
+		dp->pvid = vlan->vid;
+
+	return err;
 }
 
 int dsa_port_vlan_del(struct dsa_port *dp,
@@ -663,8 +669,14 @@ int dsa_port_vlan_del(struct dsa_port *dp,
 		.port = dp->index,
 		.vlan = vlan,
 	};
+	int err;
+
+	err = dsa_port_notify(dp, DSA_NOTIFIER_VLAN_DEL, &info);
 
-	return dsa_port_notify(dp, DSA_NOTIFIER_VLAN_DEL, &info);
+	if (!err && vlan->vid == dp->pvid)
+		dp->pvid = 0;
+
+	return err;
 }
 
 int dsa_port_mrp_add(const struct dsa_port *dp,
-- 
2.25.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC net-next 6/9] net: dsa: Forward offloading
  2021-04-26 17:04 [RFC net-next 0/9] net: bridge: Forward offloading Tobias Waldekranz
                   ` (4 preceding siblings ...)
  2021-04-26 17:04 ` [RFC net-next 5/9] net: dsa: Track port PVIDs Tobias Waldekranz
@ 2021-04-26 17:04 ` Tobias Waldekranz
  2021-04-27 10:17   ` Vladimir Oltean
  2021-04-26 17:04 ` [RFC net-next 7/9] net: dsa: mv88e6xxx: Allocate a virtual DSA port for each bridge Tobias Waldekranz
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-26 17:04 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

Allow DSA drivers to support forward offloading from a bridge by:

- Passing calls to .ndo_dfwd_{add,del}_station to the drivers.

- Recording the subordinate device of offloaded skbs in the control
  buffer so that the tagger can take the appropriate action.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 include/net/dsa.h |  7 +++++++
 net/dsa/slave.c   | 36 ++++++++++++++++++++++++++++++++++--
 2 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 1f9ba9889034..77d4df819299 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -119,6 +119,7 @@ struct dsa_netdevice_ops {
 
 struct dsa_skb_cb {
 	struct sk_buff *clone;
+	struct net_device *sb_dev;
 };
 
 struct __dsa_skb_cb {
@@ -828,6 +829,12 @@ struct dsa_switch_ops {
 					  const struct switchdev_obj_ring_role_mrp *mrp);
 	int	(*port_mrp_del_ring_role)(struct dsa_switch *ds, int port,
 					  const struct switchdev_obj_ring_role_mrp *mrp);
+
+	/* L2 forward offloading */
+	void *	(*dfwd_add_station)(struct dsa_switch *ds, int port,
+				    struct net_device *sb_dev);
+	void	(*dfwd_del_station)(struct dsa_switch *ds, int port,
+				    struct net_device *sb_dev);
 };
 
 #define DSA_DEVLINK_PARAM_DRIVER(_id, _name, _type, _cmodes)		\
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 77b33bd161b8..3689ffa2dbb8 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -657,6 +657,13 @@ static netdev_tx_t dsa_slave_xmit(struct sk_buff *skb, struct net_device *dev)
 	return dsa_enqueue_skb(nskb, dev);
 }
 
+static u16 dsa_slave_select_queue(struct net_device *dev, struct sk_buff *skb,
+				  struct net_device *sb_dev)
+{
+	DSA_SKB_CB(skb)->sb_dev = sb_dev;
+	return netdev_pick_tx(dev, skb, sb_dev);
+}
+
 /* ethtool operations *******************************************************/
 
 static void dsa_slave_get_drvinfo(struct net_device *dev,
@@ -1708,10 +1715,33 @@ static int dsa_slave_fill_forward_path(struct net_device_path_ctx *ctx,
 	return 0;
 }
 
+static void *dsa_slave_dfwd_add_station(struct net_device *dev,
+					struct net_device *sb_dev)
+{
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_switch *ds = dp->ds;
+
+	if (ds->ops->dfwd_add_station)
+		return ds->ops->dfwd_add_station(ds, dp->index, sb_dev);
+
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
+static void dsa_slave_dfwd_del_station(struct net_device *dev,
+				       void *sb_dev)
+{
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_switch *ds = dp->ds;
+
+	if (ds->ops->dfwd_del_station)
+		ds->ops->dfwd_del_station(ds, dp->index, sb_dev);
+}
+
 static const struct net_device_ops dsa_slave_netdev_ops = {
 	.ndo_open	 	= dsa_slave_open,
 	.ndo_stop		= dsa_slave_close,
 	.ndo_start_xmit		= dsa_slave_xmit,
+	.ndo_select_queue	= dsa_slave_select_queue,
 	.ndo_change_rx_flags	= dsa_slave_change_rx_flags,
 	.ndo_set_rx_mode	= dsa_slave_set_rx_mode,
 	.ndo_set_mac_address	= dsa_slave_set_mac_address,
@@ -1734,6 +1764,8 @@ static const struct net_device_ops dsa_slave_netdev_ops = {
 	.ndo_get_devlink_port	= dsa_slave_get_devlink_port,
 	.ndo_change_mtu		= dsa_slave_change_mtu,
 	.ndo_fill_forward_path	= dsa_slave_fill_forward_path,
+	.ndo_dfwd_add_station	= dsa_slave_dfwd_add_station,
+	.ndo_dfwd_del_station	= dsa_slave_dfwd_del_station,
 };
 
 static struct device_type dsa_type = {
@@ -1914,8 +1946,8 @@ int dsa_slave_create(struct dsa_port *port)
 	slave_dev->features = master->vlan_features | NETIF_F_HW_TC;
 	if (ds->ops->port_vlan_add && ds->ops->port_vlan_del)
 		slave_dev->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
-	slave_dev->hw_features |= NETIF_F_HW_TC;
-	slave_dev->features |= NETIF_F_LLTX;
+	slave_dev->hw_features |= NETIF_F_HW_TC | NETIF_F_HW_L2FW_DOFFLOAD;
+	slave_dev->features |= NETIF_F_LLTX | NETIF_F_HW_L2FW_DOFFLOAD;
 	slave_dev->ethtool_ops = &dsa_slave_ethtool_ops;
 	if (!is_zero_ether_addr(port->mac))
 		ether_addr_copy(slave_dev->dev_addr, port->mac);
-- 
2.25.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC net-next 7/9] net: dsa: mv88e6xxx: Allocate a virtual DSA port for each bridge
  2021-04-26 17:04 [RFC net-next 0/9] net: bridge: Forward offloading Tobias Waldekranz
                   ` (5 preceding siblings ...)
  2021-04-26 17:04 ` [RFC net-next 6/9] net: dsa: Forward offloading Tobias Waldekranz
@ 2021-04-26 17:04 ` Tobias Waldekranz
  2021-04-26 17:04 ` [RFC net-next 8/9] net: dsa: mv88e6xxx: Map virtual bridge port in PVT Tobias Waldekranz
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-26 17:04 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

In the near future we want to offload transmission of both unicasts
and multicasts from a bridge by sending a single FORWARD and use the
switches' config to determine the destination(s). Much in the same way
as we have already relied on them to do between user ports in the
past.

As isolation between bridges must still be maintained, we need to pass
an identifier in the DSA tag that the switches can use to determine
the set of physical ports that make up a particular flooding domain.

Therefore: allocate a DSA device/port tuple that is not used by any
physical device to each bridge we are offloading. We can then in
upcoming changes use this tuple to setup cross-chip port based VLANs
to restrict the set of valid egress ports to only contain the ports
that are offloading the same bridge.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 drivers/net/dsa/mv88e6xxx/Makefile |   1 +
 drivers/net/dsa/mv88e6xxx/chip.c   |  11 +++
 drivers/net/dsa/mv88e6xxx/dst.c    | 127 +++++++++++++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/dst.h    |  12 +++
 include/net/dsa.h                  |   5 ++
 5 files changed, 156 insertions(+)
 create mode 100644 drivers/net/dsa/mv88e6xxx/dst.c
 create mode 100644 drivers/net/dsa/mv88e6xxx/dst.h

diff --git a/drivers/net/dsa/mv88e6xxx/Makefile b/drivers/net/dsa/mv88e6xxx/Makefile
index c8eca2b6f959..20e00695b28d 100644
--- a/drivers/net/dsa/mv88e6xxx/Makefile
+++ b/drivers/net/dsa/mv88e6xxx/Makefile
@@ -2,6 +2,7 @@
 obj-$(CONFIG_NET_DSA_MV88E6XXX) += mv88e6xxx.o
 mv88e6xxx-objs := chip.o
 mv88e6xxx-objs += devlink.o
+mv88e6xxx-objs += dst.o
 mv88e6xxx-objs += global1.o
 mv88e6xxx-objs += global1_atu.o
 mv88e6xxx-objs += global1_vtu.o
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index eca285aaf72f..06ef654472b7 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -33,6 +33,7 @@
 
 #include "chip.h"
 #include "devlink.h"
+#include "dst.h"
 #include "global1.h"
 #include "global2.h"
 #include "hwtstamp.h"
@@ -2371,6 +2372,10 @@ static int mv88e6xxx_port_bridge_join(struct dsa_switch *ds, int port,
 	struct mv88e6xxx_chip *chip = ds->priv;
 	int err;
 
+	err = mv88e6xxx_dst_bridge_join(ds->dst, br);
+	if (err)
+		return err;
+
 	mv88e6xxx_reg_lock(chip);
 	err = mv88e6xxx_bridge_map(chip, br);
 	mv88e6xxx_reg_unlock(chip);
@@ -2388,6 +2393,8 @@ static void mv88e6xxx_port_bridge_leave(struct dsa_switch *ds, int port,
 	    mv88e6xxx_port_vlan_map(chip, port))
 		dev_err(ds->dev, "failed to remap in-chip Port VLAN\n");
 	mv88e6xxx_reg_unlock(chip);
+
+	mv88e6xxx_dst_bridge_leave(ds->dst, br);
 }
 
 static int mv88e6xxx_crosschip_bridge_join(struct dsa_switch *ds,
@@ -3027,6 +3034,10 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
 
 	mv88e6xxx_reg_lock(chip);
 
+	err = mv88e6xxx_dst_add_chip(chip);
+	if (err)
+		goto unlock;
+
 	if (chip->info->ops->setup_errata) {
 		err = chip->info->ops->setup_errata(chip);
 		if (err)
diff --git a/drivers/net/dsa/mv88e6xxx/dst.c b/drivers/net/dsa/mv88e6xxx/dst.c
new file mode 100644
index 000000000000..399a818063bf
--- /dev/null
+++ b/drivers/net/dsa/mv88e6xxx/dst.c
@@ -0,0 +1,127 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * mv88e6xxx global DSA switch tree state
+ */
+
+#include <linux/bitmap.h>
+#include <linux/dsa/mv88e6xxx.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <net/dsa.h>
+
+#include "chip.h"
+#include "dst.h"
+#include "global2.h"
+
+struct mv88e6xxx_br {
+	struct list_head list;
+
+	struct net_device *brdev;
+	u8 dev;
+	u8 port;
+};
+
+struct mv88e6xxx_dst {
+	struct list_head bridges;
+
+	DECLARE_BITMAP(busy_ports, MV88E6XXX_MAX_PVT_ENTRIES);
+
+#define DEV_PORT_TO_BIT(_dev, _port)			\
+	((_dev) * MV88E6XXX_MAX_PVT_PORTS + (_port))
+#define DEV_FROM_BIT(_bit) ((_bit) / MV88E6XXX_MAX_PVT_PORTS)
+#define PORT_FROM_BIT(_bit) ((_bit) % (MV88E6XXX_MAX_PVT_PORTS))
+};
+
+int mv88e6xxx_dst_bridge_join(struct dsa_switch_tree *dst,
+			      struct net_device *brdev)
+{
+	struct mv88e6xxx_dst *mvdst = dst->priv;
+	struct mv88e6xxx_br *mvbr;
+	unsigned int bit;
+
+	list_for_each_entry(mvbr, &mvdst->bridges, list) {
+		if (mvbr->brdev == brdev)
+			return 0;
+	}
+
+	bit = find_first_zero_bit(mvdst->busy_ports,
+				  MV88E6XXX_MAX_PVT_ENTRIES);
+
+	if (bit >= MV88E6XXX_MAX_PVT_ENTRIES) {
+		pr_err("Unable to allocate virtual port for %s in DSA tree %d\n",
+		       netdev_name(brdev), dst->index);
+		return -ENOSPC;
+	}
+
+	mvbr = kzalloc(sizeof(*mvbr), GFP_KERNEL);
+	if (!mvbr)
+		return -ENOMEM;
+
+	mvbr->brdev = brdev;
+	mvbr->dev = DEV_FROM_BIT(bit);
+	mvbr->port = PORT_FROM_BIT(bit);
+
+	INIT_LIST_HEAD(&mvbr->list);
+	list_add_tail(&mvbr->list, &mvdst->bridges);
+	set_bit(bit, mvdst->busy_ports);
+	return 0;
+}
+
+void mv88e6xxx_dst_bridge_leave(struct dsa_switch_tree *dst,
+				struct net_device *brdev)
+{
+	struct mv88e6xxx_dst *mvdst = dst->priv;
+	struct mv88e6xxx_br *mvbr;
+	struct dsa_port *dp;
+
+	list_for_each_entry(dp, &dst->ports, list) {
+		if (dp->bridge_dev == brdev)
+			return;
+	}
+
+	list_for_each_entry(mvbr, &mvdst->bridges, list) {
+		if (mvbr->brdev == brdev) {
+			clear_bit(DEV_PORT_TO_BIT(mvbr->dev, mvbr->port),
+				  mvdst->busy_ports);
+			list_del(&mvbr->list);
+			kfree(mvbr);
+			return;
+		}
+	}
+}
+
+static struct mv88e6xxx_dst *mv88e6xxx_dst_get(struct dsa_switch_tree *dst)
+{
+	struct mv88e6xxx_dst *mvdst;
+
+	if (dst->priv)
+		return dst->priv;
+
+	mvdst = kzalloc(sizeof(*mvdst), GFP_KERNEL);
+	if (!mvdst)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&mvdst->bridges);
+
+	bitmap_set(mvdst->busy_ports,
+		   DEV_PORT_TO_BIT(MV88E6XXX_G2_PVT_ADDR_DEV_TRUNK, 0),
+		   MV88E6XXX_MAX_PVT_PORTS);
+
+	dst->priv = mvdst;
+	return mvdst;
+}
+
+int mv88e6xxx_dst_add_chip(struct mv88e6xxx_chip *chip)
+{
+	struct dsa_switch_tree *dst = chip->ds->dst;
+	struct mv88e6xxx_dst *mvdst;
+
+	mvdst = mv88e6xxx_dst_get(dst);
+	if (IS_ERR(mvdst))
+		return PTR_ERR(mvdst);
+
+	bitmap_set(mvdst->busy_ports, DEV_PORT_TO_BIT(chip->ds->index, 0),
+		   MV88E6XXX_MAX_PVT_PORTS);
+	return 0;
+}
diff --git a/drivers/net/dsa/mv88e6xxx/dst.h b/drivers/net/dsa/mv88e6xxx/dst.h
new file mode 100644
index 000000000000..3845a19192ef
--- /dev/null
+++ b/drivers/net/dsa/mv88e6xxx/dst.h
@@ -0,0 +1,12 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _MV88E6XXX_DST_H
+#define _MV88E6XXX_DST_H
+
+int mv88e6xxx_dst_bridge_join(struct dsa_switch_tree *dst,
+			      struct net_device *brdev);
+void mv88e6xxx_dst_bridge_leave(struct dsa_switch_tree *dst,
+				struct net_device *brdev);
+int mv88e6xxx_dst_add_chip(struct mv88e6xxx_chip *chip);
+
+#endif /* _MV88E6XXX_DST_H */
diff --git a/include/net/dsa.h b/include/net/dsa.h
index 77d4df819299..c01e74d6e134 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -172,6 +172,11 @@ struct dsa_switch_tree {
 	 */
 	struct net_device **lags;
 	unsigned int lags_len;
+
+	/* Give the switch driver somewhere to hang its tree-wide
+	 * private data structure.
+	 */
+	void *priv;
 };
 
 #define dsa_lags_foreach_id(_id, _dst)				\
-- 
2.25.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC net-next 8/9] net: dsa: mv88e6xxx: Map virtual bridge port in PVT
  2021-04-26 17:04 [RFC net-next 0/9] net: bridge: Forward offloading Tobias Waldekranz
                   ` (6 preceding siblings ...)
  2021-04-26 17:04 ` [RFC net-next 7/9] net: dsa: mv88e6xxx: Allocate a virtual DSA port for each bridge Tobias Waldekranz
@ 2021-04-26 17:04 ` Tobias Waldekranz
  2021-04-26 17:04 ` [RFC net-next 9/9] net: dsa: mv88e6xxx: Forward offloading Tobias Waldekranz
  2021-05-02 14:58 ` [RFC net-next 0/9] net: bridge: " Ido Schimmel
  9 siblings, 0 replies; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-26 17:04 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

Now that each bridge has a unique DSA device/port tuple, make sure
that each chip limits forwarding from the bridge to only include
fabric ports and local ports that are members of the same bridge.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 MAINTAINERS                      |  1 +
 drivers/net/dsa/mv88e6xxx/chip.c | 33 ++++++++++++++++++++++++--------
 drivers/net/dsa/mv88e6xxx/dst.c  | 33 ++++++++++++++++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/dst.h  |  2 ++
 include/linux/dsa/mv88e6xxx.h    | 13 +++++++++++++
 5 files changed, 74 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/dsa/mv88e6xxx.h

diff --git a/MAINTAINERS b/MAINTAINERS
index c3c8fa572580..8794b05793b2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -10647,6 +10647,7 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/net/dsa/marvell.txt
 F:	Documentation/networking/devlink/mv88e6xxx.rst
 F:	drivers/net/dsa/mv88e6xxx/
+F:	include/linux/dsa/mv88e6xxx.h
 F:	include/linux/platform_data/mv88e6xxx.h
 
 MARVELL ARMADA 3700 PHY DRIVERS
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 06ef654472b7..6975cf16da65 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -12,6 +12,7 @@
 
 #include <linux/bitfield.h>
 #include <linux/delay.h>
+#include <linux/dsa/mv88e6xxx.h>
 #include <linux/etherdevice.h>
 #include <linux/ethtool.h>
 #include <linux/if_bridge.h>
@@ -1229,15 +1230,25 @@ static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port)
 		}
 	}
 
-	/* Prevent frames from unknown switch or port */
-	if (!found)
-		return 0;
+	if (found) {
+		/* Frames from DSA links and CPU ports can egress any
+		 * local port.
+		 */
+		if (dp->type == DSA_PORT_TYPE_CPU ||
+		    dp->type == DSA_PORT_TYPE_DSA)
+			return mv88e6xxx_port_mask(chip);
 
-	/* Frames from DSA links and CPU ports can egress any local port */
-	if (dp->type == DSA_PORT_TYPE_CPU || dp->type == DSA_PORT_TYPE_DSA)
-		return mv88e6xxx_port_mask(chip);
+		br = dp->bridge_dev;
+	} else {
+		br = mv88e6xxx_dst_bridge_from_dsa(dst, dev, port);
+
+		/* Reject frames from ports that are neither physical
+		 * nor virtual bridge ports.
+		 */
+		if (!br)
+			return 0;
+	}
 
-	br = dp->bridge_dev;
 	pvlan = 0;
 
 	/* Frames from user ports can egress any local DSA links and CPU ports,
@@ -2340,6 +2351,7 @@ static int mv88e6xxx_bridge_map(struct mv88e6xxx_chip *chip,
 	struct dsa_switch *ds = chip->ds;
 	struct dsa_switch_tree *dst = ds->dst;
 	struct dsa_port *dp;
+	u8 dev, port;
 	int err;
 
 	list_for_each_entry(dp, &dst->ports, list) {
@@ -2363,7 +2375,12 @@ static int mv88e6xxx_bridge_map(struct mv88e6xxx_chip *chip,
 		}
 	}
 
-	return 0;
+	/* Map the virtual bridge port if one is assigned. */
+	err = mv88e6xxx_dst_bridge_to_dsa(dst, br, &dev, &port);
+	if (!err)
+		err = mv88e6xxx_pvt_map(chip, dev, port);
+
+	return err;
 }
 
 static int mv88e6xxx_port_bridge_join(struct dsa_switch *ds, int port,
diff --git a/drivers/net/dsa/mv88e6xxx/dst.c b/drivers/net/dsa/mv88e6xxx/dst.c
index 399a818063bf..a5f9077e5b3f 100644
--- a/drivers/net/dsa/mv88e6xxx/dst.c
+++ b/drivers/net/dsa/mv88e6xxx/dst.c
@@ -33,6 +33,39 @@ struct mv88e6xxx_dst {
 #define PORT_FROM_BIT(_bit) ((_bit) % (MV88E6XXX_MAX_PVT_PORTS))
 };
 
+struct net_device *mv88e6xxx_dst_bridge_from_dsa(struct dsa_switch_tree *dst,
+						 u8 dev, u8 port)
+{
+	struct mv88e6xxx_dst *mvdst = dst->priv;
+	struct mv88e6xxx_br *mvbr;
+
+	list_for_each_entry(mvbr, &mvdst->bridges, list) {
+		if (mvbr->dev == dev && mvbr->port == port)
+			return mvbr->brdev;
+	}
+
+	return NULL;
+}
+
+int mv88e6xxx_dst_bridge_to_dsa(const struct dsa_switch_tree *dst,
+				const struct net_device *brdev,
+				u8 *dev, u8 *port)
+{
+	struct mv88e6xxx_dst *mvdst = dst->priv;
+	struct mv88e6xxx_br *mvbr;
+
+	list_for_each_entry(mvbr, &mvdst->bridges, list) {
+		if (mvbr->brdev == brdev) {
+			*dev = mvbr->dev;
+			*port = mvbr->port;
+			return 0;
+		}
+	}
+
+	return -ENODEV;
+}
+EXPORT_SYMBOL_GPL(mv88e6xxx_dst_bridge_to_dsa);
+
 int mv88e6xxx_dst_bridge_join(struct dsa_switch_tree *dst,
 			      struct net_device *brdev)
 {
diff --git a/drivers/net/dsa/mv88e6xxx/dst.h b/drivers/net/dsa/mv88e6xxx/dst.h
index 3845a19192ef..911890ec4792 100644
--- a/drivers/net/dsa/mv88e6xxx/dst.h
+++ b/drivers/net/dsa/mv88e6xxx/dst.h
@@ -3,6 +3,8 @@
 #ifndef _MV88E6XXX_DST_H
 #define _MV88E6XXX_DST_H
 
+struct net_device *mv88e6xxx_dst_bridge_from_dsa(struct dsa_switch_tree *dst,
+						 u8 dev, u8 port);
 int mv88e6xxx_dst_bridge_join(struct dsa_switch_tree *dst,
 			      struct net_device *brdev);
 void mv88e6xxx_dst_bridge_leave(struct dsa_switch_tree *dst,
diff --git a/include/linux/dsa/mv88e6xxx.h b/include/linux/dsa/mv88e6xxx.h
new file mode 100644
index 000000000000..fa486dfe9808
--- /dev/null
+++ b/include/linux/dsa/mv88e6xxx.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _NET_DSA_MV88E6XXX_H
+#define _NET_DSA_MV88E6XXX_H
+
+#include <linux/netdevice.h>
+#include <net/dsa.h>
+
+int mv88e6xxx_dst_bridge_to_dsa(const struct dsa_switch_tree *dst,
+				const struct net_device *brdev,
+				u8 *dev, u8 *port);
+
+#endif /* _NET_DSA_MV88E6XXX_H */
-- 
2.25.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [RFC net-next 9/9] net: dsa: mv88e6xxx: Forward offloading
  2021-04-26 17:04 [RFC net-next 0/9] net: bridge: Forward offloading Tobias Waldekranz
                   ` (7 preceding siblings ...)
  2021-04-26 17:04 ` [RFC net-next 8/9] net: dsa: mv88e6xxx: Map virtual bridge port in PVT Tobias Waldekranz
@ 2021-04-26 17:04 ` Tobias Waldekranz
  2021-05-02 14:58 ` [RFC net-next 0/9] net: bridge: " Ido Schimmel
  9 siblings, 0 replies; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-26 17:04 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

Allow the DSA tagger to generate FORWARD frames for offloaded skbs
sent from a bridge that we offload, allowing the switch to handle any
frame replication that may be required. This also means that source
address learning takes place on packets sent from the CPU, meaning
that return traffic no longer needs to be flooded as unknown unicast.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 17 ++++++++++++++++
 net/dsa/tag_dsa.c                | 33 ++++++++++++++++++++++++--------
 2 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 6975cf16da65..00ed1aa2a55a 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -6077,6 +6077,21 @@ static int mv88e6xxx_crosschip_lag_leave(struct dsa_switch *ds, int sw_index,
 	return err_sync ? : err_pvt;
 }
 
+static void *mv88e6xxx_dfwd_add_station(struct dsa_switch *ds, int port,
+					struct net_device *sb_dev)
+{
+	struct dsa_port *dp = dsa_to_port(ds, port);
+	struct mv88e6xxx_chip *chip = ds->priv;
+
+	if (!mv88e6xxx_has_pvt(chip))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	if (sb_dev == dp->bridge_dev)
+		return sb_dev;
+
+	return ERR_PTR(-EOPNOTSUPP);
+}
+
 static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
 	.get_tag_protocol	= mv88e6xxx_get_tag_protocol,
 	.change_tag_protocol	= mv88e6xxx_change_tag_protocol,
@@ -6138,6 +6153,7 @@ static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
 	.crosschip_lag_change	= mv88e6xxx_crosschip_lag_change,
 	.crosschip_lag_join	= mv88e6xxx_crosschip_lag_join,
 	.crosschip_lag_leave	= mv88e6xxx_crosschip_lag_leave,
+	.dfwd_add_station	= mv88e6xxx_dfwd_add_station,
 };
 
 static int mv88e6xxx_register_switch(struct mv88e6xxx_chip *chip)
@@ -6156,6 +6172,7 @@ static int mv88e6xxx_register_switch(struct mv88e6xxx_chip *chip)
 	ds->ops = &mv88e6xxx_switch_ops;
 	ds->ageing_time_min = chip->info->age_time_coeff;
 	ds->ageing_time_max = chip->info->age_time_coeff * U8_MAX;
+	ds->num_tx_queues = 2;
 
 	/* Some chips support up to 32, but that requires enabling the
 	 * 5-bit port mode, which we do not support. 640k^W16 ought to
diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
index 7e7b7decdf39..09cdf77697b2 100644
--- a/net/dsa/tag_dsa.c
+++ b/net/dsa/tag_dsa.c
@@ -46,6 +46,7 @@
  */
 
 #include <linux/etherdevice.h>
+#include <linux/dsa/mv88e6xxx.h>
 #include <linux/list.h>
 #include <linux/slab.h>
 
@@ -126,7 +127,22 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
 				   u8 extra)
 {
 	struct dsa_port *dp = dsa_slave_to_port(dev);
+	u16 pvid = dp->pvid;
+	enum dsa_cmd cmd;
 	u8 *dsa_header;
+	u8 tag_dev, tag_port;
+
+	if (DSA_SKB_CB(skb)->sb_dev) {
+		cmd = DSA_CMD_FORWARD;
+		if (mv88e6xxx_dst_bridge_to_dsa(dp->ds->dst,
+						DSA_SKB_CB(skb)->sb_dev,
+						&tag_dev, &tag_port))
+			return NULL;
+	} else {
+		cmd = DSA_CMD_FROM_CPU;
+		tag_dev = dp->ds->index;
+		tag_port = dp->index;
+	}
 
 	if (skb->protocol == htons(ETH_P_8021Q)) {
 		if (extra) {
@@ -134,10 +150,10 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
 			memmove(skb->data, skb->data + extra, 2 * ETH_ALEN);
 		}
 
-		/* Construct tagged FROM_CPU DSA tag from 802.1Q tag. */
+		/* Construct tagged DSA tag from 802.1Q tag. */
 		dsa_header = skb->data + 2 * ETH_ALEN + extra;
-		dsa_header[0] = (DSA_CMD_FROM_CPU << 6) | 0x20 | dp->ds->index;
-		dsa_header[1] = dp->index << 3;
+		dsa_header[0] = (cmd << 6) | 0x20 | tag_dev;
+		dsa_header[1] = tag_port << 3;
 
 		/* Move CFI field from byte 2 to byte 1. */
 		if (dsa_header[2] & 0x10) {
@@ -148,12 +164,13 @@ static struct sk_buff *dsa_xmit_ll(struct sk_buff *skb, struct net_device *dev,
 		skb_push(skb, DSA_HLEN + extra);
 		memmove(skb->data, skb->data + DSA_HLEN + extra, 2 * ETH_ALEN);
 
-		/* Construct untagged FROM_CPU DSA tag. */
+		/* Construct untagged DSA tag. */
 		dsa_header = skb->data + 2 * ETH_ALEN + extra;
-		dsa_header[0] = (DSA_CMD_FROM_CPU << 6) | dp->ds->index;
-		dsa_header[1] = dp->index << 3;
-		dsa_header[2] = 0x00;
-		dsa_header[3] = 0x00;
+
+		dsa_header[0] = (cmd << 6) | tag_dev;
+		dsa_header[1] = tag_port << 3;
+		dsa_header[2] = pvid >> 8;
+		dsa_header[3] = pvid & 0xff;
 	}
 
 	return skb;
-- 
2.25.1


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 5/9] net: dsa: Track port PVIDs
  2021-04-26 17:04 ` [RFC net-next 5/9] net: dsa: Track port PVIDs Tobias Waldekranz
@ 2021-04-26 19:40   ` Vladimir Oltean
  2021-04-26 20:05     ` Tobias Waldekranz
  0 siblings, 1 reply; 42+ messages in thread
From: Vladimir Oltean @ 2021-04-26 19:40 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

Hi Tobias,

On Mon, Apr 26, 2021 at 07:04:07PM +0200, Tobias Waldekranz wrote:
> In some scenarios a tagger must know which VLAN to assign to a packet,
> even if the packet is set to egress untagged. Since the VLAN
> information in the skb will be removed by the bridge in this case,
> track each port's PVID such that the VID of an outgoing frame can
> always be determined.
> 
> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> ---

Let me give you this real-life example:

#!/bin/bash

ip link add br0 type bridge vlan_filtering 1
for eth in eth0 eth1 swp2 swp3 swp4 swp5; do
	ip link set $eth up
	ip link set $eth master br0
done
ip link set br0 up

bridge vlan add dev eth0 vid 100 pvid untagged
bridge vlan del dev swp2 vid 1
bridge vlan del dev swp3 vid 1
bridge vlan add dev swp2 vid 100
bridge vlan add dev swp3 vid 100 untagged

reproducible on the NXP LS1021A-TSN board.
The bridge receives an untagged packet on eth0 and floods it.
It should reach swp2 and swp3, and be tagged on swp2, and untagged on
swp3 respectively.

With your idea of sending untagged frames towards the port's pvid,
wouldn't we be leaking this packet to VLAN 1, therefore towards ports
swp4 and swp5, and the real destination ports would not get this packet?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 5/9] net: dsa: Track port PVIDs
  2021-04-26 19:40   ` Vladimir Oltean
@ 2021-04-26 20:05     ` Tobias Waldekranz
  2021-04-26 20:28       ` Vladimir Oltean
  0 siblings, 1 reply; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-26 20:05 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Mon, Apr 26, 2021 at 22:40, Vladimir Oltean <olteanv@gmail.com> wrote:
> Hi Tobias,
>
> On Mon, Apr 26, 2021 at 07:04:07PM +0200, Tobias Waldekranz wrote:
>> In some scenarios a tagger must know which VLAN to assign to a packet,
>> even if the packet is set to egress untagged. Since the VLAN
>> information in the skb will be removed by the bridge in this case,
>> track each port's PVID such that the VID of an outgoing frame can
>> always be determined.
>> 
>> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>> ---
>
> Let me give you this real-life example:
>
> #!/bin/bash
>
> ip link add br0 type bridge vlan_filtering 1
> for eth in eth0 eth1 swp2 swp3 swp4 swp5; do
> 	ip link set $eth up
> 	ip link set $eth master br0
> done
> ip link set br0 up
>
> bridge vlan add dev eth0 vid 100 pvid untagged
> bridge vlan del dev swp2 vid 1
> bridge vlan del dev swp3 vid 1
> bridge vlan add dev swp2 vid 100
> bridge vlan add dev swp3 vid 100 untagged
>
> reproducible on the NXP LS1021A-TSN board.
> The bridge receives an untagged packet on eth0 and floods it.
> It should reach swp2 and swp3, and be tagged on swp2, and untagged on
> swp3 respectively.
>
> With your idea of sending untagged frames towards the port's pvid,
> wouldn't we be leaking this packet to VLAN 1, therefore towards ports
> swp4 and swp5, and the real destination ports would not get this packet?

I am not sure I follow. The bridge would never send the packet to
swp{4,5} because should_deliver() rejects them (as usual). So it could
only be sent either to swp2 or swp3. In the case that swp3 is first in
the bridge's port list, it would be sent untagged, but the PVID would be
100 and the flooding would thus be limited to swp{2,3}.

You did make me realize that there is a fatal flaw in the current design
though: Using this approach, it is not possible to have multiple VLANs
configured to egress untagged out of one port. Rare, but allowed.

So the VLAN information will have to remain in the skb somehow. My
initial plan was actually to always send offloaded skbs tagged. I went
this route because I thought we already had all the information we
needed in the driver. It seems reasonable that skb->vlan_tci could
always be set for offloaded frames from a filtering bridge, no?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 5/9] net: dsa: Track port PVIDs
  2021-04-26 20:05     ` Tobias Waldekranz
@ 2021-04-26 20:28       ` Vladimir Oltean
  2021-04-27  9:12         ` Tobias Waldekranz
  0 siblings, 1 reply; 42+ messages in thread
From: Vladimir Oltean @ 2021-04-26 20:28 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Mon, Apr 26, 2021 at 10:05:52PM +0200, Tobias Waldekranz wrote:
> On Mon, Apr 26, 2021 at 22:40, Vladimir Oltean <olteanv@gmail.com> wrote:
> > Hi Tobias,
> >
> > On Mon, Apr 26, 2021 at 07:04:07PM +0200, Tobias Waldekranz wrote:
> >> In some scenarios a tagger must know which VLAN to assign to a packet,
> >> even if the packet is set to egress untagged. Since the VLAN
> >> information in the skb will be removed by the bridge in this case,
> >> track each port's PVID such that the VID of an outgoing frame can
> >> always be determined.
> >> 
> >> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> >> ---
> >
> > Let me give you this real-life example:
> >
> > #!/bin/bash
> >
> > ip link add br0 type bridge vlan_filtering 1
> > for eth in eth0 eth1 swp2 swp3 swp4 swp5; do
> > 	ip link set $eth up
> > 	ip link set $eth master br0
> > done
> > ip link set br0 up
> >
> > bridge vlan add dev eth0 vid 100 pvid untagged
> > bridge vlan del dev swp2 vid 1
> > bridge vlan del dev swp3 vid 1
> > bridge vlan add dev swp2 vid 100
> > bridge vlan add dev swp3 vid 100 untagged
> >
> > reproducible on the NXP LS1021A-TSN board.
> > The bridge receives an untagged packet on eth0 and floods it.
> > It should reach swp2 and swp3, and be tagged on swp2, and untagged on
> > swp3 respectively.
> >
> > With your idea of sending untagged frames towards the port's pvid,
> > wouldn't we be leaking this packet to VLAN 1, therefore towards ports
> > swp4 and swp5, and the real destination ports would not get this packet?
> 
> I am not sure I follow. The bridge would never send the packet to
> swp{4,5} because should_deliver() rejects them (as usual). So it could
> only be sent either to swp2 or swp3. In the case that swp3 is first in
> the bridge's port list, it would be sent untagged, but the PVID would be
> 100 and the flooding would thus be limited to swp{2,3}.

Sorry, _I_ don't understand.

When you say that the PVID is 100, whose PVID is it, exactly? Is it the
pvid of the source port (aka eth0 in this example)? That's not what I
see, I see the pvid of the egress port (the Marvell device)...

So to reiterate: when you transmit a packet towards your hardware switch
which has br0 inside the sb_dev, how does the switch know in which VLAN
to forward that packet? As far as I am aware, when the bridge had
received the packet as untagged on eth0, it did not insert VLAN 100 into
the skb itself, so the bridge VLAN information is lost when delivering
the frame to the egress net device. Am I wrong?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 5/9] net: dsa: Track port PVIDs
  2021-04-26 20:28       ` Vladimir Oltean
@ 2021-04-27  9:12         ` Tobias Waldekranz
  2021-04-27  9:27           ` Vladimir Oltean
  2021-04-27 10:07           ` Vladimir Oltean
  0 siblings, 2 replies; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-27  9:12 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Mon, Apr 26, 2021 at 23:28, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Mon, Apr 26, 2021 at 10:05:52PM +0200, Tobias Waldekranz wrote:
>> On Mon, Apr 26, 2021 at 22:40, Vladimir Oltean <olteanv@gmail.com> wrote:
>> > Hi Tobias,
>> >
>> > On Mon, Apr 26, 2021 at 07:04:07PM +0200, Tobias Waldekranz wrote:
>> >> In some scenarios a tagger must know which VLAN to assign to a packet,
>> >> even if the packet is set to egress untagged. Since the VLAN
>> >> information in the skb will be removed by the bridge in this case,
>> >> track each port's PVID such that the VID of an outgoing frame can
>> >> always be determined.
>> >> 
>> >> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>> >> ---
>> >
>> > Let me give you this real-life example:
>> >
>> > #!/bin/bash
>> >
>> > ip link add br0 type bridge vlan_filtering 1
>> > for eth in eth0 eth1 swp2 swp3 swp4 swp5; do
>> > 	ip link set $eth up
>> > 	ip link set $eth master br0
>> > done
>> > ip link set br0 up
>> >
>> > bridge vlan add dev eth0 vid 100 pvid untagged
>> > bridge vlan del dev swp2 vid 1
>> > bridge vlan del dev swp3 vid 1
>> > bridge vlan add dev swp2 vid 100
>> > bridge vlan add dev swp3 vid 100 untagged
>> >
>> > reproducible on the NXP LS1021A-TSN board.
>> > The bridge receives an untagged packet on eth0 and floods it.
>> > It should reach swp2 and swp3, and be tagged on swp2, and untagged on
>> > swp3 respectively.
>> >
>> > With your idea of sending untagged frames towards the port's pvid,
>> > wouldn't we be leaking this packet to VLAN 1, therefore towards ports
>> > swp4 and swp5, and the real destination ports would not get this packet?
>> 
>> I am not sure I follow. The bridge would never send the packet to
>> swp{4,5} because should_deliver() rejects them (as usual). So it could
>> only be sent either to swp2 or swp3. In the case that swp3 is first in
>> the bridge's port list, it would be sent untagged, but the PVID would be
>> 100 and the flooding would thus be limited to swp{2,3}.
>
> Sorry, _I_ don't understand.
>
> When you say that the PVID is 100, whose PVID is it, exactly? Is it the
> pvid of the source port (aka eth0 in this example)? That's not what I
> see, I see the pvid of the egress port (the Marvell device)...

I meant the PVID of swp3.

In summary: This series incorrectly assumes that a port's PVID always
corresponds to the VID that should be assigned to untagged packets on
egress. This is wrong because PVID only specifies which VID to assign
packets to on ingress, it says nothing about policy on egress. Multiple
VIDs can also be configured to egress untagged on a given port. The VID
must thus be sent along with each packet in order for the driver to be
able to assign it to the correct VID.

> So to reiterate: when you transmit a packet towards your hardware switch
> which has br0 inside the sb_dev, how does the switch know in which VLAN
> to forward that packet? As far as I am aware, when the bridge had
> received the packet as untagged on eth0, it did not insert VLAN 100 into
> the skb itself, so the bridge VLAN information is lost when delivering
> the frame to the egress net device. Am I wrong?

VID 100 is inserted into skb->vlan_tci on ingress from eth0, in
br_vlan.c/__allowed_ingress. It is then cleared again in
br_vlan.c/br_handle_vlan if the egress port (swp3 in our example) is set
to egress the VID untagged.

The last step only clears skb->vlan_present though, the actual VID
information still resides in skb->vlan_tci. I tried just removing 5/9
from this series, and then sourced the VID from skb->vlan_tci for
untagged packets. It works like a charm! I think this is the way
forward.

The question is if we need another bit of information to signal that
skb->vlan_tci contains valid information, but the packet should still be
considered untagged? This could also be used on Rx to carry priority
(PCP) information to the bridge.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 5/9] net: dsa: Track port PVIDs
  2021-04-27  9:12         ` Tobias Waldekranz
@ 2021-04-27  9:27           ` Vladimir Oltean
  2021-04-27 10:07           ` Vladimir Oltean
  1 sibling, 0 replies; 42+ messages in thread
From: Vladimir Oltean @ 2021-04-27  9:27 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Tue, Apr 27, 2021 at 11:12:56AM +0200, Tobias Waldekranz wrote:
> On Mon, Apr 26, 2021 at 23:28, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Mon, Apr 26, 2021 at 10:05:52PM +0200, Tobias Waldekranz wrote:
> >> On Mon, Apr 26, 2021 at 22:40, Vladimir Oltean <olteanv@gmail.com> wrote:
> >> > Hi Tobias,
> >> >
> >> > On Mon, Apr 26, 2021 at 07:04:07PM +0200, Tobias Waldekranz wrote:
> >> >> In some scenarios a tagger must know which VLAN to assign to a packet,
> >> >> even if the packet is set to egress untagged. Since the VLAN
> >> >> information in the skb will be removed by the bridge in this case,
> >> >> track each port's PVID such that the VID of an outgoing frame can
> >> >> always be determined.
> >> >> 
> >> >> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> >> >> ---
> >> >
> >> > Let me give you this real-life example:
> >> >
> >> > #!/bin/bash
> >> >
> >> > ip link add br0 type bridge vlan_filtering 1
> >> > for eth in eth0 eth1 swp2 swp3 swp4 swp5; do
> >> > 	ip link set $eth up
> >> > 	ip link set $eth master br0
> >> > done
> >> > ip link set br0 up
> >> >
> >> > bridge vlan add dev eth0 vid 100 pvid untagged
> >> > bridge vlan del dev swp2 vid 1
> >> > bridge vlan del dev swp3 vid 1
> >> > bridge vlan add dev swp2 vid 100
> >> > bridge vlan add dev swp3 vid 100 untagged
> >> >
> >> > reproducible on the NXP LS1021A-TSN board.
> >> > The bridge receives an untagged packet on eth0 and floods it.
> >> > It should reach swp2 and swp3, and be tagged on swp2, and untagged on
> >> > swp3 respectively.
> >> >
> >> > With your idea of sending untagged frames towards the port's pvid,
> >> > wouldn't we be leaking this packet to VLAN 1, therefore towards ports
> >> > swp4 and swp5, and the real destination ports would not get this packet?
> >> 
> >> I am not sure I follow. The bridge would never send the packet to
> >> swp{4,5} because should_deliver() rejects them (as usual). So it could
> >> only be sent either to swp2 or swp3. In the case that swp3 is first in
> >> the bridge's port list, it would be sent untagged, but the PVID would be
> >> 100 and the flooding would thus be limited to swp{2,3}.
> >
> > Sorry, _I_ don't understand.
> >
> > When you say that the PVID is 100, whose PVID is it, exactly? Is it the
> > pvid of the source port (aka eth0 in this example)? That's not what I
> > see, I see the pvid of the egress port (the Marvell device)...
> 
> I meant the PVID of swp3.
> 
> In summary: This series incorrectly assumes that a port's PVID always
> corresponds to the VID that should be assigned to untagged packets on
> egress. This is wrong because PVID only specifies which VID to assign
> packets to on ingress, it says nothing about policy on egress. Multiple
> VIDs can also be configured to egress untagged on a given port. The VID
> must thus be sent along with each packet in order for the driver to be
> able to assign it to the correct VID.
> 
> > So to reiterate: when you transmit a packet towards your hardware switch
> > which has br0 inside the sb_dev, how does the switch know in which VLAN
> > to forward that packet? As far as I am aware, when the bridge had
> > received the packet as untagged on eth0, it did not insert VLAN 100 into
> > the skb itself, so the bridge VLAN information is lost when delivering
> > the frame to the egress net device. Am I wrong?
> 
> VID 100 is inserted into skb->vlan_tci on ingress from eth0, in
> br_vlan.c/__allowed_ingress. It is then cleared again in
> br_vlan.c/br_handle_vlan if the egress port (swp3 in our example) is set
> to egress the VID untagged.
> 
> The last step only clears skb->vlan_present though, the actual VID
> information still resides in skb->vlan_tci. I tried just removing 5/9
> from this series, and then sourced the VID from skb->vlan_tci for
> untagged packets. It works like a charm! I think this is the way
> forward.
> 
> The question is if we need another bit of information to signal that
> skb->vlan_tci contains valid information, but the packet should still be
> considered untagged? This could also be used on Rx to carry priority
> (PCP) information to the bridge.

My expectation is that when you do this forwarding offload thing, the
bridge passes the classified VLAN down to the port driver, encoded
inside the accel_priv alongside the sb_dev somehow.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 5/9] net: dsa: Track port PVIDs
  2021-04-27  9:12         ` Tobias Waldekranz
  2021-04-27  9:27           ` Vladimir Oltean
@ 2021-04-27 10:07           ` Vladimir Oltean
  2021-04-28 23:10             ` Tobias Waldekranz
  1 sibling, 1 reply; 42+ messages in thread
From: Vladimir Oltean @ 2021-04-27 10:07 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Tue, Apr 27, 2021 at 11:12:56AM +0200, Tobias Waldekranz wrote:
> On Mon, Apr 26, 2021 at 23:28, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Mon, Apr 26, 2021 at 10:05:52PM +0200, Tobias Waldekranz wrote:
> >> On Mon, Apr 26, 2021 at 22:40, Vladimir Oltean <olteanv@gmail.com> wrote:
> >> > Hi Tobias,
> >> >
> >> > On Mon, Apr 26, 2021 at 07:04:07PM +0200, Tobias Waldekranz wrote:
> >> >> In some scenarios a tagger must know which VLAN to assign to a packet,
> >> >> even if the packet is set to egress untagged. Since the VLAN
> >> >> information in the skb will be removed by the bridge in this case,
> >> >> track each port's PVID such that the VID of an outgoing frame can
> >> >> always be determined.
> >> >> 
> >> >> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> >> >> ---
> >> >
> >> > Let me give you this real-life example:
> >> >
> >> > #!/bin/bash
> >> >
> >> > ip link add br0 type bridge vlan_filtering 1
> >> > for eth in eth0 eth1 swp2 swp3 swp4 swp5; do
> >> > 	ip link set $eth up
> >> > 	ip link set $eth master br0
> >> > done
> >> > ip link set br0 up
> >> >
> >> > bridge vlan add dev eth0 vid 100 pvid untagged
> >> > bridge vlan del dev swp2 vid 1
> >> > bridge vlan del dev swp3 vid 1
> >> > bridge vlan add dev swp2 vid 100
> >> > bridge vlan add dev swp3 vid 100 untagged
> >> >
> >> > reproducible on the NXP LS1021A-TSN board.
> >> > The bridge receives an untagged packet on eth0 and floods it.
> >> > It should reach swp2 and swp3, and be tagged on swp2, and untagged on
> >> > swp3 respectively.
> >> >
> >> > With your idea of sending untagged frames towards the port's pvid,
> >> > wouldn't we be leaking this packet to VLAN 1, therefore towards ports
> >> > swp4 and swp5, and the real destination ports would not get this packet?
> >> 
> >> I am not sure I follow. The bridge would never send the packet to
> >> swp{4,5} because should_deliver() rejects them (as usual). So it could
> >> only be sent either to swp2 or swp3. In the case that swp3 is first in
> >> the bridge's port list, it would be sent untagged, but the PVID would be
> >> 100 and the flooding would thus be limited to swp{2,3}.
> >
> > Sorry, _I_ don't understand.
> >
> > When you say that the PVID is 100, whose PVID is it, exactly? Is it the
> > pvid of the source port (aka eth0 in this example)? That's not what I
> > see, I see the pvid of the egress port (the Marvell device)...
> 
> I meant the PVID of swp3.
> 
> In summary: This series incorrectly assumes that a port's PVID always
> corresponds to the VID that should be assigned to untagged packets on
> egress. This is wrong because PVID only specifies which VID to assign
> packets to on ingress, it says nothing about policy on egress. Multiple
> VIDs can also be configured to egress untagged on a given port. The VID
> must thus be sent along with each packet in order for the driver to be
> able to assign it to the correct VID.

So yes, I think you and I are on the same page now, in that the port
driver must not inject untagged packets into the port's PVID, since the
PVID is an ingress setting. Heck, the PVID might not even be installed
on the egress port, and that doesn't mean it shouldn't send untagged
packets, it only means it shouldn't receive them.

Just to be even more clear, this is what I think happens with your
change.

Untagged packets classified to VLAN 100 are reinterpreted by the port
driver as untagged, and sent to VLAN 1 (the PVID of the egress port).
What you said about should_deliver() doesn't matter/doesn't make sense,
because the offload forwarding domain contains all of swp2, swp3, swp4,
swp5. It is not per-VLAN. So the bridge has no idea that the port driver
will inject the packet with the wrong VLAN information. The packet
_will_ end up on the wrong ports, and it has hopped VLANs.

> > So to reiterate: when you transmit a packet towards your hardware switch
> > which has br0 inside the sb_dev, how does the switch know in which VLAN
> > to forward that packet? As far as I am aware, when the bridge had
> > received the packet as untagged on eth0, it did not insert VLAN 100 into
> > the skb itself, so the bridge VLAN information is lost when delivering
> > the frame to the egress net device. Am I wrong?
> 
> VID 100 is inserted into skb->vlan_tci on ingress from eth0, in
> br_vlan.c/__allowed_ingress. It is then cleared again in
> br_vlan.c/br_handle_vlan if the egress port (swp3 in our example) is set
> to egress the VID untagged.
> 
> The last step only clears skb->vlan_present though, the actual VID
> information still resides in skb->vlan_tci. I tried just removing 5/9
> from this series, and then sourced the VID from skb->vlan_tci for
> untagged packets. It works like a charm! I think this is the way
> forward.
> 
> The question is if we need another bit of information to signal that
> skb->vlan_tci contains valid information, but the packet should still be
> considered untagged? This could also be used on Rx to carry priority
> (PCP) information to the bridge.

Either we add another bit of information, or we don't clear the VLAN
in this bit of code, if the port supports fwd offload:

br_handle_vlan:

	if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
		__vlan_hwaccel_clear_tag(skb);

The expectation that the hardware handles VLAN popping on the egress of
individual ports (as part of the replication procedure) should be valid,
I guess. In the case of DSA, all packets sent between the DSA master and
the CPU port using fwd offload should be VLAN-tagged.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 6/9] net: dsa: Forward offloading
  2021-04-26 17:04 ` [RFC net-next 6/9] net: dsa: Forward offloading Tobias Waldekranz
@ 2021-04-27 10:17   ` Vladimir Oltean
  2021-05-04 14:44     ` Tobias Waldekranz
  0 siblings, 1 reply; 42+ messages in thread
From: Vladimir Oltean @ 2021-04-27 10:17 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Mon, Apr 26, 2021 at 07:04:08PM +0200, Tobias Waldekranz wrote:
> Allow DSA drivers to support forward offloading from a bridge by:
> 
> - Passing calls to .ndo_dfwd_{add,del}_station to the drivers.
> 
> - Recording the subordinate device of offloaded skbs in the control
>   buffer so that the tagger can take the appropriate action.
> 
> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> ---
>  include/net/dsa.h |  7 +++++++
>  net/dsa/slave.c   | 36 ++++++++++++++++++++++++++++++++++--
>  2 files changed, 41 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/dsa.h b/include/net/dsa.h
> index 1f9ba9889034..77d4df819299 100644
> --- a/include/net/dsa.h
> +++ b/include/net/dsa.h
> @@ -119,6 +119,7 @@ struct dsa_netdevice_ops {
>  
>  struct dsa_skb_cb {
>  	struct sk_buff *clone;
> +	struct net_device *sb_dev;
>  };
>  
>  struct __dsa_skb_cb {
> @@ -828,6 +829,12 @@ struct dsa_switch_ops {
>  					  const struct switchdev_obj_ring_role_mrp *mrp);
>  	int	(*port_mrp_del_ring_role)(struct dsa_switch *ds, int port,
>  					  const struct switchdev_obj_ring_role_mrp *mrp);
> +
> +	/* L2 forward offloading */
> +	void *	(*dfwd_add_station)(struct dsa_switch *ds, int port,
> +				    struct net_device *sb_dev);
> +	void	(*dfwd_del_station)(struct dsa_switch *ds, int port,
> +				    struct net_device *sb_dev);
>  };
>  
>  #define DSA_DEVLINK_PARAM_DRIVER(_id, _name, _type, _cmodes)		\
> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> index 77b33bd161b8..3689ffa2dbb8 100644
> --- a/net/dsa/slave.c
> +++ b/net/dsa/slave.c
> @@ -657,6 +657,13 @@ static netdev_tx_t dsa_slave_xmit(struct sk_buff *skb, struct net_device *dev)
>  	return dsa_enqueue_skb(nskb, dev);
>  }
>  
> +static u16 dsa_slave_select_queue(struct net_device *dev, struct sk_buff *skb,
> +				  struct net_device *sb_dev)
> +{
> +	DSA_SKB_CB(skb)->sb_dev = sb_dev;
> +	return netdev_pick_tx(dev, skb, sb_dev);
> +}
> +

DSA_SKB_CB is going away:
https://patchwork.kernel.org/project/netdevbpf/patch/20210427042203.26258-5-yangbo.lu@nxp.com/

Let's either negotiate with Yangbo on keeping it, or make
.ndo_select_queue a bypass towards the tagger, where it can use its own
SKB_CB structure and be more flexible in general (I think I'm leaning
towards the latter).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 4/9] net: bridge: switchdev: Forward offloading
  2021-04-26 17:04 ` [RFC net-next 4/9] net: bridge: switchdev: Forward offloading Tobias Waldekranz
@ 2021-04-27 10:35   ` Nikolay Aleksandrov
  2021-04-28 22:47     ` Tobias Waldekranz
  2021-05-02 15:04   ` Ido Schimmel
  1 sibling, 1 reply; 42+ messages in thread
From: Nikolay Aleksandrov @ 2021-04-27 10:35 UTC (permalink / raw)
  To: Tobias Waldekranz, davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, jiri, idosch,
	stephen, netdev, bridge

On 26/04/2021 20:04, Tobias Waldekranz wrote:
> Allow switchdevs to forward frames from the CPU in accordance with the
> bridge configuration in the same way as is done between bridge
> ports. This means that the bridge will only send a single skb towards
> one of the ports under the switchdev's control, and expects the driver
> to deliver the packet to all eligible ports in its domain.
> 
> Primarily this improves the performance of multicast flows with
> multiple subscribers, as it allows the hardware to perform the frame
> replication.
> 
> The basic flow between the driver and the bridge is as follows:
> 
> - The switchdev accepts the offload by returning a non-null pointer
>   from .ndo_dfwd_add_station when the port is added to the bridge.
> 
> - The bridge sends offloadable skbs to one of the ports under the
>   switchdev's control using dev_queue_xmit_accel.
> 
> - The switchdev notices the offload by checking for a non-NULL
>   "sb_dev" in the core's call to .ndo_select_queue.
> 
> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> ---
>  net/bridge/br_forward.c   | 11 +++++++-
>  net/bridge/br_private.h   | 27 ++++++++++++++++++
>  net/bridge/br_switchdev.c | 59 +++++++++++++++++++++++++++++++++++++--
>  3 files changed, 93 insertions(+), 4 deletions(-)
> 

Hi,
Please try to find a way to reduce the number of new tests in the fast path.
This specific feature might help these devices, but the new tests hurt everybody else.
I don't mind the control plane changes, but I'd like to minimize the fast-path impact.

Do you need "accel_priv" to be a pointer, I mean can't derive sb_dev from the port alone ?
Either way - you can mark the port via its internal flags if it can accelerate, those are
used frequently and are in a hot cache line (by the way that reminds me that the port
offload mark/hwdom should be moved in the first cache line).

For example you could possibly drop fwd_accel, add the bitmap in a union with sb_dev pointer
in br_input_skb_cb which can be set at __br_forward at the accel check and pass it down to
avoid the final test. Furthermore since the hwdoms are bits and if the port accel is a bit
you could probably reduce the nbp_switchdev_can_accel() helper to one test with a few bitops.

In nbp_switchdev_allowed_egress() I'd make the hwdom tests rely on skb's offload_fwd_mark
so in the software forwarding case we could avoid them.

I might be missing something above, but we have to try and reduce these tests as much as
possible, also the port's first cache line is quite crowded so avoiding any new fields
would be best, i.e. at some point we'll move the hwdom/offload mark there to avoid pulling
in the last cache line of net_bridge_port, so it'd be best to avoid having to move accel_priv
there too.

Cheers,
 Nik

> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
> index 6e9b049ae521..b4fb3b0bb1ec 100644
> --- a/net/bridge/br_forward.c
> +++ b/net/bridge/br_forward.c
> @@ -32,6 +32,8 @@ static inline int should_deliver(const struct net_bridge_port *p,
>  
>  int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
>  {
> +	struct net_device *sb_dev = NULL;
> +
>  	skb_push(skb, ETH_HLEN);
>  	if (!is_skb_forwardable(skb->dev, skb))
>  		goto drop;
> @@ -48,7 +50,10 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb
>  		skb_set_network_header(skb, depth);
>  	}
>  
> -	dev_queue_xmit(skb);
> +	if (br_switchdev_accels_skb(skb))
> +		sb_dev = BR_INPUT_SKB_CB(skb)->brdev;
> +
> +	dev_queue_xmit_accel(skb, sb_dev);
>  
>  	return 0;
>  
> @@ -105,6 +110,8 @@ static void __br_forward(const struct net_bridge_port *to,
>  		indev = NULL;
>  	}
>  
> +	nbp_switchdev_frame_mark_accel(to, skb);
> +
>  	NF_HOOK(NFPROTO_BRIDGE, br_hook,
>  		net, NULL, skb, indev, skb->dev,
>  		br_forward_finish);
> @@ -174,6 +181,8 @@ static struct net_bridge_port *maybe_deliver(
>  	if (!should_deliver(p, skb))
>  		return prev;
>  
> +	nbp_switchdev_frame_mark_fwd(p, skb);
> +
>  	if (!prev)
>  		goto out;
>  
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index aba92864d285..933e951b0d7a 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -332,6 +332,7 @@ struct net_bridge_port {
>  #endif
>  #ifdef CONFIG_NET_SWITCHDEV
>  	int				hwdom;
> +	void				*accel_priv;
>  #endif
>  	u16				group_fwd_mask;
>  	u16				backup_redirected_cnt;
> @@ -506,7 +507,9 @@ struct br_input_skb_cb {
>  #endif
>  
>  #ifdef CONFIG_NET_SWITCHDEV
> +	u8 fwd_accel:1;
>  	int src_hwdom;
> +	br_hwdom_map_t fwd_hwdoms;
>  #endif
>  };
>  
> @@ -1597,6 +1600,15 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
>  
>  /* br_switchdev.c */
>  #ifdef CONFIG_NET_SWITCHDEV
> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)
> +{
> +	return BR_INPUT_SKB_CB(skb)->fwd_accel;
> +}
> +
> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
> +				    struct sk_buff *skb);
> +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
> +				  struct sk_buff *skb);
>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>  			      struct sk_buff *skb);
>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
> @@ -1619,6 +1631,21 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
>  	skb->offload_fwd_mark = 0;
>  }
>  #else
> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)
> +{
> +	return false;
> +}
> +
> +static inline void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
> +						  struct sk_buff *skb)
> +{
> +}
> +
> +static inline void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
> +						struct sk_buff *skb)
> +{
> +}
> +
>  static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>  					    struct sk_buff *skb)
>  {
> diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
> index 54bd7205bfb5..c903171ad291 100644
> --- a/net/bridge/br_switchdev.c
> +++ b/net/bridge/br_switchdev.c
> @@ -8,6 +8,26 @@
>  
>  #include "br_private.h"
>  
> +static bool nbp_switchdev_can_accel(const struct net_bridge_port *p,
> +				    const struct sk_buff *skb)
> +{
> +	return p->accel_priv && (p->hwdom != BR_INPUT_SKB_CB(skb)->src_hwdom);
> +}
> +
> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
> +				    struct sk_buff *skb)
> +{
> +	if (nbp_switchdev_can_accel(p, skb))
> +		BR_INPUT_SKB_CB(skb)->fwd_accel = true;
> +}
> +
> +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
> +				  struct sk_buff *skb)
> +{
> +	if (nbp_switchdev_can_accel(p, skb))
> +		set_bit(p->hwdom, BR_INPUT_SKB_CB(skb)->fwd_hwdoms);
> +}
> +
>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>  			      struct sk_buff *skb)
>  {
> @@ -18,8 +38,10 @@ void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
>  				  const struct sk_buff *skb)
>  {
> -	return !skb->offload_fwd_mark ||
> -	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
> +	struct br_input_skb_cb *cb = BR_INPUT_SKB_CB(skb);
> +
> +	return !test_bit(p->hwdom, cb->fwd_hwdoms) &&
> +		(!skb->offload_fwd_mark || cb->src_hwdom != p->hwdom);
>  }
>  
>  /* Flags that can be offloaded to hardware */
> @@ -125,6 +147,27 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)
>  	return switchdev_port_obj_del(dev, &v.obj);
>  }
>  
> +static void nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)
> +{
> +	void *priv;
> +
> +	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))
> +		return;
> +
> +	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);
> +	if (!IS_ERR_OR_NULL(priv))
> +		p->accel_priv = priv;
> +}
> +
> +static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)
> +{
> +	if (!p->accel_priv)
> +		return;
> +
> +	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);
> +	p->accel_priv = NULL;
> +}
> +
>  static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)
>  {
>  	struct net_bridge *br = joining->br;
> @@ -176,13 +219,23 @@ int nbp_switchdev_add(struct net_bridge_port *p)
>  		return err;
>  	}
>  
> -	return nbp_switchdev_hwdom_set(p);
> +	err = nbp_switchdev_hwdom_set(p);
> +	if (err)
> +		return err;
> +
> +	if (p->hwdom)
> +		nbp_switchdev_fwd_offload_add(p);
> +
> +	return 0;
>  }
>  
>  void nbp_switchdev_del(struct net_bridge_port *p)
>  {
>  	ASSERT_RTNL();
>  
> +	if (p->accel_priv)
> +		nbp_switchdev_fwd_offload_del(p);
> +
>  	if (p->hwdom)
>  		nbp_switchdev_hwdom_put(p);
>  }
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 3/9] net: bridge: switchdev: Recycle unused hwdoms
  2021-04-26 17:04 ` [RFC net-next 3/9] net: bridge: switchdev: Recycle unused hwdoms Tobias Waldekranz
@ 2021-04-27 10:42   ` Nikolay Aleksandrov
  0 siblings, 0 replies; 42+ messages in thread
From: Nikolay Aleksandrov @ 2021-04-27 10:42 UTC (permalink / raw)
  To: Tobias Waldekranz, davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, jiri, idosch,
	stephen, netdev, bridge

On 26/04/2021 20:04, Tobias Waldekranz wrote:
> Since hwdoms has thus far only been used for equality comparisons, the
> bridge has used the simplest possible assignment policy; using a
> counter to keep track of the last value handed out.
> 
> With the upcoming transmit offloading, we need to perform set
> operations efficiently based on hwdoms, e.g. we want to answer
> questions like "has this skb been forwarded to any port within this
> hwdom?"
> 
> Move to a bitmap-based allocation scheme that recycles hwdoms once all
> members leaves the bridge. This means that we can use a single
> unsigned long to keep track of the hwdoms that have received an skb.
> 
> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> ---
>  net/bridge/br_if.c        |  4 +-
>  net/bridge/br_private.h   | 29 +++++++++---
>  net/bridge/br_switchdev.c | 94 ++++++++++++++++++++++++++-------------
>  3 files changed, 87 insertions(+), 40 deletions(-)
> 
> diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
> index 73fa703f8df5..adaf78e45c23 100644
> --- a/net/bridge/br_if.c
> +++ b/net/bridge/br_if.c
> @@ -349,6 +349,7 @@ static void del_nbp(struct net_bridge_port *p)
>  	nbp_backup_clear(p);
>  
>  	nbp_update_port_count(br);
> +	nbp_switchdev_del(p);
>  
>  	netdev_upper_dev_unlink(dev, br->dev);
>  
> @@ -643,7 +644,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
>  	if (err)
>  		goto err5;
>  
> -	err = nbp_switchdev_hwdom_set(p);
> +	err = nbp_switchdev_add(p);
>  	if (err)
>  		goto err6;
>  
> @@ -704,6 +705,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
>  	list_del_rcu(&p->list);
>  	br_fdb_delete_by_port(br, p, 0, 1);
>  	nbp_update_port_count(br);
> +	nbp_switchdev_del(p);
>  err6:
>  	netdev_upper_dev_unlink(dev, br->dev);
>  err5:
> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
> index 53248715f631..aba92864d285 100644
> --- a/net/bridge/br_private.h
> +++ b/net/bridge/br_private.h
> @@ -29,6 +29,8 @@
>  
>  #define BR_MULTICAST_DEFAULT_HASH_MAX 4096
>  
> +#define BR_HWDOM_MAX BITS_PER_LONG
> +
>  #define BR_VERSION	"2.3"
>  
>  /* Control of forwarding link local multicast */
> @@ -54,6 +56,8 @@ typedef struct bridge_id bridge_id;
>  typedef struct mac_addr mac_addr;
>  typedef __u16 port_id;
>  
> +typedef DECLARE_BITMAP(br_hwdom_map_t, BR_HWDOM_MAX);
> +

You can avoid the typedef and DECLARE_BITMAP() and just use an
unsigned long below. In general avoiding new typedefs is a good thing. :)

>  struct bridge_id {
>  	unsigned char	prio[2];
>  	unsigned char	addr[ETH_ALEN];
> @@ -472,7 +476,7 @@ struct net_bridge {
>  	u32				auto_cnt;
>  
>  #ifdef CONFIG_NET_SWITCHDEV
> -	int last_hwdom;
> +	br_hwdom_map_t			busy_hwdoms;
>  #endif
>  	struct hlist_head		fdb_list;
>  
> @@ -1593,7 +1597,6 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
>  
>  /* br_switchdev.c */
>  #ifdef CONFIG_NET_SWITCHDEV
> -int nbp_switchdev_hwdom_set(struct net_bridge_port *p);
>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>  			      struct sk_buff *skb);
>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
> @@ -1607,17 +1610,15 @@ void br_switchdev_fdb_notify(const struct net_bridge_fdb_entry *fdb,
>  int br_switchdev_port_vlan_add(struct net_device *dev, u16 vid, u16 flags,
>  			       struct netlink_ext_ack *extack);
>  int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid);
> +int nbp_switchdev_add(struct net_bridge_port *p);
> +void nbp_switchdev_del(struct net_bridge_port *p);
> +void br_switchdev_init(struct net_bridge *br);
>  
>  static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
>  {
>  	skb->offload_fwd_mark = 0;
>  }
>  #else
> -static inline int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
> -{
> -	return 0;
> -}
> -
>  static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>  					    struct sk_buff *skb)
>  {
> @@ -1657,6 +1658,20 @@ br_switchdev_fdb_notify(const struct net_bridge_fdb_entry *fdb, int type)
>  static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
>  {
>  }
> +
> +static inline int nbp_switchdev_add(struct net_bridge_port *p)
> +{
> +	return 0;
> +}
> +
> +static inline void nbp_switchdev_del(struct net_bridge_port *p)
> +{
> +}
> +
> +static inline void br_switchdev_init(struct net_bridge *br)
> +{
> +}
> +
>  #endif /* CONFIG_NET_SWITCHDEV */
>  
>  /* br_arp_nd_proxy.c */
> diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
> index bc085077ae71..54bd7205bfb5 100644
> --- a/net/bridge/br_switchdev.c
> +++ b/net/bridge/br_switchdev.c
> @@ -8,38 +8,6 @@
>  
>  #include "br_private.h"
>  
> -static int br_switchdev_hwdom_get(struct net_bridge *br, struct net_device *dev)
> -{
> -	struct net_bridge_port *p;
> -
> -	/* dev is yet to be added to the port list. */
> -	list_for_each_entry(p, &br->port_list, list) {
> -		if (netdev_port_same_parent_id(dev, p->dev))
> -			return p->hwdom;
> -	}
> -
> -	return ++br->last_hwdom;
> -}
> -
> -int nbp_switchdev_hwdom_set(struct net_bridge_port *p)
> -{
> -	struct netdev_phys_item_id ppid = { };
> -	int err;
> -
> -	ASSERT_RTNL();
> -
> -	err = dev_get_port_parent_id(p->dev, &ppid, true);
> -	if (err) {
> -		if (err == -EOPNOTSUPP)
> -			return 0;
> -		return err;
> -	}
> -
> -	p->hwdom = br_switchdev_hwdom_get(p->br, p->dev);
> -
> -	return 0;
> -}
> -
>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>  			      struct sk_buff *skb)
>  {
> @@ -156,3 +124,65 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)
>  
>  	return switchdev_port_obj_del(dev, &v.obj);
>  }
> +
> +static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)
> +{
> +	struct net_bridge *br = joining->br;
> +	struct net_bridge_port *p;
> +	int hwdom;
> +
> +	/* joining is yet to be added to the port list. */
> +	list_for_each_entry(p, &br->port_list, list) {
> +		if (netdev_port_same_parent_id(joining->dev, p->dev)) {
> +			joining->hwdom = p->hwdom;
> +			return 0;
> +		}
> +	}
> +
> +	hwdom = find_next_zero_bit(br->busy_hwdoms, BR_HWDOM_MAX, 1);
> +	if (hwdom >= BR_HWDOM_MAX)
> +		return -EBUSY;
> +
> +	set_bit(hwdom, br->busy_hwdoms);
> +	joining->hwdom = hwdom;
> +	return 0;
> +}
> +
> +static void nbp_switchdev_hwdom_put(struct net_bridge_port *leaving)
> +{
> +	struct net_bridge *br = leaving->br;
> +	struct net_bridge_port *p;
> +
> +	/* leaving is no longer in the port list. */
> +	list_for_each_entry(p, &br->port_list, list) {
> +		if (p->hwdom == leaving->hwdom)
> +			return;
> +	}
> +
> +	clear_bit(leaving->hwdom, br->busy_hwdoms);
> +}
> +
> +int nbp_switchdev_add(struct net_bridge_port *p)
> +{
> +	struct netdev_phys_item_id ppid = { };
> +	int err;
> +
> +	ASSERT_RTNL();
> +
> +	err = dev_get_port_parent_id(p->dev, &ppid, true);
> +	if (err) {
> +		if (err == -EOPNOTSUPP)
> +			return 0;
> +		return err;
> +	}
> +
> +	return nbp_switchdev_hwdom_set(p);
> +}
> +
> +void nbp_switchdev_del(struct net_bridge_port *p)
> +{
> +	ASSERT_RTNL();
> +
> +	if (p->hwdom)
> +		nbp_switchdev_hwdom_put(p);
> +}
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 4/9] net: bridge: switchdev: Forward offloading
  2021-04-27 10:35   ` Nikolay Aleksandrov
@ 2021-04-28 22:47     ` Tobias Waldekranz
  2021-04-29  9:16       ` Nikolay Aleksandrov
  0 siblings, 1 reply; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-28 22:47 UTC (permalink / raw)
  To: Nikolay Aleksandrov, davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, jiri, idosch,
	stephen, netdev, bridge

On Tue, Apr 27, 2021 at 13:35, Nikolay Aleksandrov <nikolay@nvidia.com> wrote:
> On 26/04/2021 20:04, Tobias Waldekranz wrote:
>> Allow switchdevs to forward frames from the CPU in accordance with the
>> bridge configuration in the same way as is done between bridge
>> ports. This means that the bridge will only send a single skb towards
>> one of the ports under the switchdev's control, and expects the driver
>> to deliver the packet to all eligible ports in its domain.
>> 
>> Primarily this improves the performance of multicast flows with
>> multiple subscribers, as it allows the hardware to perform the frame
>> replication.
>> 
>> The basic flow between the driver and the bridge is as follows:
>> 
>> - The switchdev accepts the offload by returning a non-null pointer
>>   from .ndo_dfwd_add_station when the port is added to the bridge.
>> 
>> - The bridge sends offloadable skbs to one of the ports under the
>>   switchdev's control using dev_queue_xmit_accel.
>> 
>> - The switchdev notices the offload by checking for a non-NULL
>>   "sb_dev" in the core's call to .ndo_select_queue.
>> 
>> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>> ---
>>  net/bridge/br_forward.c   | 11 +++++++-
>>  net/bridge/br_private.h   | 27 ++++++++++++++++++
>>  net/bridge/br_switchdev.c | 59 +++++++++++++++++++++++++++++++++++++--
>>  3 files changed, 93 insertions(+), 4 deletions(-)
>> 
>
> Hi,
> Please try to find a way to reduce the number of new tests in the fast path.
> This specific feature might help these devices, but the new tests hurt everybody else.
> I don't mind the control plane changes, but I'd like to minimize the fast-path impact.

Wholeheartedly agree.

> Do you need "accel_priv" to be a pointer, I mean can't derive sb_dev from the port alone ?
> Either way - you can mark the port via its internal flags if it can accelerate, those are
> used frequently and are in a hot cache line (by the way that reminds me that the port
> offload mark/hwdom should be moved in the first cache line).

I need to stash accel_priv somewhere as .ndo_dfwd_del_station expects it
sent back when unregistering the offload. But there is no need for it to
be part of the fast-path. Would it be ok to add a BR_FORWARD_OFFLOAD to
p->flags, which would be used in the fast-path, while also keeping
accel_priv on the port, but on a colder line?

> For example you could possibly drop fwd_accel, add the bitmap in a union with sb_dev pointer
> in br_input_skb_cb which can be set at __br_forward at the accel check and pass it down to
> avoid the final test.

Great idea! I will add it in v1.

> Furthermore since the hwdoms are bits and if the port accel is a bit
> you could probably reduce the nbp_switchdev_can_accel() helper to one test with a few bitops.

Not sure I follow. The current code has two tests:

1. Is offloading enabled on the port. (To be done using p->flags in v1)
2. Is offloading allowed for this frame to this port.

The port can be part of a hwdom that does not support forward
offloading; indeed only one driver would support it initially. So how do
I avoid having to test the conditions individually?

> In nbp_switchdev_allowed_egress() I'd make the hwdom tests rely on skb's offload_fwd_mark
> so in the software forwarding case we could avoid them.

Flipping the left and right side of the expression is possible, but I
think that would only impact the case where the frame is _not_ allowed
to egress. Is that what you mean? Otherwise you still need to test for
the condition that we have forwarded to this port's hwdom already, to
avoid sending duplicates on the wire. This is independent of
skb->offload_fwd_mark as both Rx-offloaded and non-Rx-offloaded frames
can still be Tx-offloaded to other hwdoms.

A typical example would be a broadcast frame ingressing the bridge from
eth0 in the figure from the cover letter. skb->offload_fwd_mark would
always be 0, but you still need to test fwd_hwdoms to skip over swp{1,2}
after you have sent the skb to swp0.

> I might be missing something above, but we have to try and reduce these tests as much as
> possible, also the port's first cache line is quite crowded so avoiding any new fields
> would be best, i.e. at some point we'll move the hwdom/offload mark there to avoid pulling
> in the last cache line of net_bridge_port, so it'd be best to avoid having to move accel_priv
> there too.
>
> Cheers,
>  Nik
>
>> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
>> index 6e9b049ae521..b4fb3b0bb1ec 100644
>> --- a/net/bridge/br_forward.c
>> +++ b/net/bridge/br_forward.c
>> @@ -32,6 +32,8 @@ static inline int should_deliver(const struct net_bridge_port *p,
>>  
>>  int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
>>  {
>> +	struct net_device *sb_dev = NULL;
>> +
>>  	skb_push(skb, ETH_HLEN);
>>  	if (!is_skb_forwardable(skb->dev, skb))
>>  		goto drop;
>> @@ -48,7 +50,10 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb
>>  		skb_set_network_header(skb, depth);
>>  	}
>>  
>> -	dev_queue_xmit(skb);
>> +	if (br_switchdev_accels_skb(skb))
>> +		sb_dev = BR_INPUT_SKB_CB(skb)->brdev;
>> +
>> +	dev_queue_xmit_accel(skb, sb_dev);
>>  
>>  	return 0;
>>  
>> @@ -105,6 +110,8 @@ static void __br_forward(const struct net_bridge_port *to,
>>  		indev = NULL;
>>  	}
>>  
>> +	nbp_switchdev_frame_mark_accel(to, skb);
>> +
>>  	NF_HOOK(NFPROTO_BRIDGE, br_hook,
>>  		net, NULL, skb, indev, skb->dev,
>>  		br_forward_finish);
>> @@ -174,6 +181,8 @@ static struct net_bridge_port *maybe_deliver(
>>  	if (!should_deliver(p, skb))
>>  		return prev;
>>  
>> +	nbp_switchdev_frame_mark_fwd(p, skb);
>> +
>>  	if (!prev)
>>  		goto out;
>>  
>> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
>> index aba92864d285..933e951b0d7a 100644
>> --- a/net/bridge/br_private.h
>> +++ b/net/bridge/br_private.h
>> @@ -332,6 +332,7 @@ struct net_bridge_port {
>>  #endif
>>  #ifdef CONFIG_NET_SWITCHDEV
>>  	int				hwdom;
>> +	void				*accel_priv;
>>  #endif
>>  	u16				group_fwd_mask;
>>  	u16				backup_redirected_cnt;
>> @@ -506,7 +507,9 @@ struct br_input_skb_cb {
>>  #endif
>>  
>>  #ifdef CONFIG_NET_SWITCHDEV
>> +	u8 fwd_accel:1;
>>  	int src_hwdom;
>> +	br_hwdom_map_t fwd_hwdoms;
>>  #endif
>>  };
>>  
>> @@ -1597,6 +1600,15 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
>>  
>>  /* br_switchdev.c */
>>  #ifdef CONFIG_NET_SWITCHDEV
>> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)
>> +{
>> +	return BR_INPUT_SKB_CB(skb)->fwd_accel;
>> +}
>> +
>> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
>> +				    struct sk_buff *skb);
>> +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
>> +				  struct sk_buff *skb);
>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>  			      struct sk_buff *skb);
>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
>> @@ -1619,6 +1631,21 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
>>  	skb->offload_fwd_mark = 0;
>>  }
>>  #else
>> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)
>> +{
>> +	return false;
>> +}
>> +
>> +static inline void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
>> +						  struct sk_buff *skb)
>> +{
>> +}
>> +
>> +static inline void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
>> +						struct sk_buff *skb)
>> +{
>> +}
>> +
>>  static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>  					    struct sk_buff *skb)
>>  {
>> diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
>> index 54bd7205bfb5..c903171ad291 100644
>> --- a/net/bridge/br_switchdev.c
>> +++ b/net/bridge/br_switchdev.c
>> @@ -8,6 +8,26 @@
>>  
>>  #include "br_private.h"
>>  
>> +static bool nbp_switchdev_can_accel(const struct net_bridge_port *p,
>> +				    const struct sk_buff *skb)
>> +{
>> +	return p->accel_priv && (p->hwdom != BR_INPUT_SKB_CB(skb)->src_hwdom);
>> +}
>> +
>> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
>> +				    struct sk_buff *skb)
>> +{
>> +	if (nbp_switchdev_can_accel(p, skb))
>> +		BR_INPUT_SKB_CB(skb)->fwd_accel = true;
>> +}
>> +
>> +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
>> +				  struct sk_buff *skb)
>> +{
>> +	if (nbp_switchdev_can_accel(p, skb))
>> +		set_bit(p->hwdom, BR_INPUT_SKB_CB(skb)->fwd_hwdoms);
>> +}
>> +
>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>  			      struct sk_buff *skb)
>>  {
>> @@ -18,8 +38,10 @@ void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
>>  				  const struct sk_buff *skb)
>>  {
>> -	return !skb->offload_fwd_mark ||
>> -	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
>> +	struct br_input_skb_cb *cb = BR_INPUT_SKB_CB(skb);
>> +
>> +	return !test_bit(p->hwdom, cb->fwd_hwdoms) &&
>> +		(!skb->offload_fwd_mark || cb->src_hwdom != p->hwdom);
>>  }
>>  
>>  /* Flags that can be offloaded to hardware */
>> @@ -125,6 +147,27 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)
>>  	return switchdev_port_obj_del(dev, &v.obj);
>>  }
>>  
>> +static void nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)
>> +{
>> +	void *priv;
>> +
>> +	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))
>> +		return;
>> +
>> +	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);
>> +	if (!IS_ERR_OR_NULL(priv))
>> +		p->accel_priv = priv;
>> +}
>> +
>> +static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)
>> +{
>> +	if (!p->accel_priv)
>> +		return;
>> +
>> +	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);
>> +	p->accel_priv = NULL;
>> +}
>> +
>>  static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)
>>  {
>>  	struct net_bridge *br = joining->br;
>> @@ -176,13 +219,23 @@ int nbp_switchdev_add(struct net_bridge_port *p)
>>  		return err;
>>  	}
>>  
>> -	return nbp_switchdev_hwdom_set(p);
>> +	err = nbp_switchdev_hwdom_set(p);
>> +	if (err)
>> +		return err;
>> +
>> +	if (p->hwdom)
>> +		nbp_switchdev_fwd_offload_add(p);
>> +
>> +	return 0;
>>  }
>>  
>>  void nbp_switchdev_del(struct net_bridge_port *p)
>>  {
>>  	ASSERT_RTNL();
>>  
>> +	if (p->accel_priv)
>> +		nbp_switchdev_fwd_offload_del(p);
>> +
>>  	if (p->hwdom)
>>  		nbp_switchdev_hwdom_put(p);
>>  }
>> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 5/9] net: dsa: Track port PVIDs
  2021-04-27 10:07           ` Vladimir Oltean
@ 2021-04-28 23:10             ` Tobias Waldekranz
  0 siblings, 0 replies; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-28 23:10 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Tue, Apr 27, 2021 at 13:07, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Tue, Apr 27, 2021 at 11:12:56AM +0200, Tobias Waldekranz wrote:
>> On Mon, Apr 26, 2021 at 23:28, Vladimir Oltean <olteanv@gmail.com> wrote:
>> > On Mon, Apr 26, 2021 at 10:05:52PM +0200, Tobias Waldekranz wrote:
>> >> On Mon, Apr 26, 2021 at 22:40, Vladimir Oltean <olteanv@gmail.com> wrote:
>> >> > Hi Tobias,
>> >> >
>> >> > On Mon, Apr 26, 2021 at 07:04:07PM +0200, Tobias Waldekranz wrote:
>> >> >> In some scenarios a tagger must know which VLAN to assign to a packet,
>> >> >> even if the packet is set to egress untagged. Since the VLAN
>> >> >> information in the skb will be removed by the bridge in this case,
>> >> >> track each port's PVID such that the VID of an outgoing frame can
>> >> >> always be determined.
>> >> >> 
>> >> >> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>> >> >> ---
>> >> >
>> >> > Let me give you this real-life example:
>> >> >
>> >> > #!/bin/bash
>> >> >
>> >> > ip link add br0 type bridge vlan_filtering 1
>> >> > for eth in eth0 eth1 swp2 swp3 swp4 swp5; do
>> >> > 	ip link set $eth up
>> >> > 	ip link set $eth master br0
>> >> > done
>> >> > ip link set br0 up
>> >> >
>> >> > bridge vlan add dev eth0 vid 100 pvid untagged
>> >> > bridge vlan del dev swp2 vid 1
>> >> > bridge vlan del dev swp3 vid 1
>> >> > bridge vlan add dev swp2 vid 100
>> >> > bridge vlan add dev swp3 vid 100 untagged
>> >> >
>> >> > reproducible on the NXP LS1021A-TSN board.
>> >> > The bridge receives an untagged packet on eth0 and floods it.
>> >> > It should reach swp2 and swp3, and be tagged on swp2, and untagged on
>> >> > swp3 respectively.
>> >> >
>> >> > With your idea of sending untagged frames towards the port's pvid,
>> >> > wouldn't we be leaking this packet to VLAN 1, therefore towards ports
>> >> > swp4 and swp5, and the real destination ports would not get this packet?
>> >> 
>> >> I am not sure I follow. The bridge would never send the packet to
>> >> swp{4,5} because should_deliver() rejects them (as usual). So it could
>> >> only be sent either to swp2 or swp3. In the case that swp3 is first in
>> >> the bridge's port list, it would be sent untagged, but the PVID would be
>> >> 100 and the flooding would thus be limited to swp{2,3}.
>> >
>> > Sorry, _I_ don't understand.
>> >
>> > When you say that the PVID is 100, whose PVID is it, exactly? Is it the
>> > pvid of the source port (aka eth0 in this example)? That's not what I
>> > see, I see the pvid of the egress port (the Marvell device)...
>> 
>> I meant the PVID of swp3.
>> 
>> In summary: This series incorrectly assumes that a port's PVID always
>> corresponds to the VID that should be assigned to untagged packets on
>> egress. This is wrong because PVID only specifies which VID to assign
>> packets to on ingress, it says nothing about policy on egress. Multiple
>> VIDs can also be configured to egress untagged on a given port. The VID
>> must thus be sent along with each packet in order for the driver to be
>> able to assign it to the correct VID.
>
> So yes, I think you and I are on the same page now, in that the port
> driver must not inject untagged packets into the port's PVID, since the
> PVID is an ingress setting. Heck, the PVID might not even be installed
> on the egress port, and that doesn't mean it shouldn't send untagged
> packets, it only means it shouldn't receive them.
>
> Just to be even more clear, this is what I think happens with your
> change.
>
> Untagged packets classified to VLAN 100 are reinterpreted by the port
> driver as untagged, and sent to VLAN 1 (the PVID of the egress port).
> What you said about should_deliver() doesn't matter/doesn't make sense,
> because the offload forwarding domain contains all of swp2, swp3, swp4,
> swp5. It is not per-VLAN. So the bridge has no idea that the port driver
> will inject the packet with the wrong VLAN information. The packet
> _will_ end up on the wrong ports, and it has hopped VLANs.

My brain's iproute2 simulator must have malfunctioned :) Anyway, we
agree that the current implementation only works for the common case
where there is a single untagged VID on a port that is also set as the
PVID.

>> > So to reiterate: when you transmit a packet towards your hardware switch
>> > which has br0 inside the sb_dev, how does the switch know in which VLAN
>> > to forward that packet? As far as I am aware, when the bridge had
>> > received the packet as untagged on eth0, it did not insert VLAN 100 into
>> > the skb itself, so the bridge VLAN information is lost when delivering
>> > the frame to the egress net device. Am I wrong?
>> 
>> VID 100 is inserted into skb->vlan_tci on ingress from eth0, in
>> br_vlan.c/__allowed_ingress. It is then cleared again in
>> br_vlan.c/br_handle_vlan if the egress port (swp3 in our example) is set
>> to egress the VID untagged.
>> 
>> The last step only clears skb->vlan_present though, the actual VID
>> information still resides in skb->vlan_tci. I tried just removing 5/9
>> from this series, and then sourced the VID from skb->vlan_tci for
>> untagged packets. It works like a charm! I think this is the way
>> forward.
>> 
>> The question is if we need another bit of information to signal that
>> skb->vlan_tci contains valid information, but the packet should still be
>> considered untagged? This could also be used on Rx to carry priority
>> (PCP) information to the bridge.
>
> Either we add another bit of information, or we don't clear the VLAN
> in this bit of code, if the port supports fwd offload:
>
> br_handle_vlan:
>
> 	if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
> 		__vlan_hwaccel_clear_tag(skb);
>
> The expectation that the hardware handles VLAN popping on the egress of
> individual ports (as part of the replication procedure) should be valid,
> I guess. In the case of DSA, all packets sent between the DSA master and
> the CPU port using fwd offload should be VLAN-tagged.

Yeah I agree that for this offload, it would be fine to always send
packets tagged. There are some things that might be helped by that extra
bit of info though:

- VLAN PCP. The switchdev and bridge could communicate the priority bits
  also for untagged packets, both on ingress and egress. This would
  maintain the priority up to a VLAN upper on top of the bridge, where
  you can use the standard {ingress,egress}-qos-map feature to map PCP
  to socket priority.

- TC. Right now, matching on VLANs is messy because there is no way to
  express "match VLAN1" in a filter that can be reused across a group of
  ports ("block" in TC parlance) where some may be untagged members and
  others are tagged. In hardware, the VLAN parser typically resides much
  earlier in the pipeline (way before reaching the bridge engine) so
  TCAMs can easily do these things.

But this is perhaps a separate job. Nothing stops us from going the
always-tagged-route now and adding "untagged awareness" to the stack
later on.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 4/9] net: bridge: switchdev: Forward offloading
  2021-04-28 22:47     ` Tobias Waldekranz
@ 2021-04-29  9:16       ` Nikolay Aleksandrov
  2021-04-29 14:55         ` Tobias Waldekranz
  0 siblings, 1 reply; 42+ messages in thread
From: Nikolay Aleksandrov @ 2021-04-29  9:16 UTC (permalink / raw)
  To: Tobias Waldekranz, davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, jiri, idosch,
	stephen, netdev, bridge

On 29/04/2021 01:47, Tobias Waldekranz wrote:
> On Tue, Apr 27, 2021 at 13:35, Nikolay Aleksandrov <nikolay@nvidia.com> wrote:
>> On 26/04/2021 20:04, Tobias Waldekranz wrote:
>>> Allow switchdevs to forward frames from the CPU in accordance with the
>>> bridge configuration in the same way as is done between bridge
>>> ports. This means that the bridge will only send a single skb towards
>>> one of the ports under the switchdev's control, and expects the driver
>>> to deliver the packet to all eligible ports in its domain.
>>>
>>> Primarily this improves the performance of multicast flows with
>>> multiple subscribers, as it allows the hardware to perform the frame
>>> replication.
>>>
>>> The basic flow between the driver and the bridge is as follows:
>>>
>>> - The switchdev accepts the offload by returning a non-null pointer
>>>   from .ndo_dfwd_add_station when the port is added to the bridge.
>>>
>>> - The bridge sends offloadable skbs to one of the ports under the
>>>   switchdev's control using dev_queue_xmit_accel.
>>>
>>> - The switchdev notices the offload by checking for a non-NULL
>>>   "sb_dev" in the core's call to .ndo_select_queue.
>>>
>>> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>>> ---
>>>  net/bridge/br_forward.c   | 11 +++++++-
>>>  net/bridge/br_private.h   | 27 ++++++++++++++++++
>>>  net/bridge/br_switchdev.c | 59 +++++++++++++++++++++++++++++++++++++--
>>>  3 files changed, 93 insertions(+), 4 deletions(-)
>>>
>>
>> Hi,
>> Please try to find a way to reduce the number of new tests in the fast path.
>> This specific feature might help these devices, but the new tests hurt everybody else.
>> I don't mind the control plane changes, but I'd like to minimize the fast-path impact.
> 
> Wholeheartedly agree.
> 
>> Do you need "accel_priv" to be a pointer, I mean can't derive sb_dev from the port alone ?
>> Either way - you can mark the port via its internal flags if it can accelerate, those are
>> used frequently and are in a hot cache line (by the way that reminds me that the port
>> offload mark/hwdom should be moved in the first cache line).
> 
> I need to stash accel_priv somewhere as .ndo_dfwd_del_station expects it
> sent back when unregistering the offload. But there is no need for it to
> be part of the fast-path. Would it be ok to add a BR_FORWARD_OFFLOAD to
> p->flags, which would be used in the fast-path, while also keeping
> accel_priv on the port, but on a colder line?
> 

About the flag - yes, that is what I'm proposing. Use an internal port flag for it
and for the tests.

About the pointer - it is certainly not appropriate to use net_bridge_port for a void pointer
coming in from a driver or external place. I see that it's always the bridge device so can
we just compare the result of the add op to the bridge dev and set only the flag based on it?
Then on the del path if the flag is set we know it's the bridge and use it as sb_dev.

>> For example you could possibly drop fwd_accel, add the bitmap in a union with sb_dev pointer
>> in br_input_skb_cb which can be set at __br_forward at the accel check and pass it down to
>> avoid the final test.
> 
> Great idea! I will add it in v1.
> 
Actually you can also use a static key to avoid all checks and effects of this feature on
the bridge fast-path. You can enable it when the first device that can accelerate shows
up and disable it when the last one leaves.

By the way __br_forward() can take one more argument (sb_dev) and always set it because
that cache line is dirty anyway due to the tstamp zeroing, but that doesn't matter if
static keys are used. Just noting it. :)

>> Furthermore since the hwdoms are bits and if the port accel is a bit
>> you could probably reduce the nbp_switchdev_can_accel() helper to one test with a few bitops.
> 
> Not sure I follow. The current code has two tests:
> 
> 1. Is offloading enabled on the port. (To be done using p->flags in v1)
> 2. Is offloading allowed for this frame to this port.
> 
> The port can be part of a hwdom that does not support forward
> offloading; indeed only one driver would support it initially. So how do
> I avoid having to test the conditions individually?
> 

Coming to think of it with the port bit test first it should be fine.


For the sake of fun here's one way that can turn it into one test,
obviously I haven't tested anything:
 - reserve a few most significant bits of hwdom as "feature" bits, say 4
 - add a "feature" bit which encodes ability to accelerate
 => test becomes p->hwdom | (src_hwdom & hwdom_bitmask) > (accel feature bit) | p->hwdom
It's very hacky and _not_ to be used. :)

>> In nbp_switchdev_allowed_egress() I'd make the hwdom tests rely on skb's offload_fwd_mark
>> so in the software forwarding case we could avoid them.
> 
> Flipping the left and right side of the expression is possible, but I
> think that would only impact the case where the frame is _not_ allowed
> to egress. Is that what you mean? Otherwise you still need to test for
> the condition that we have forwarded to this port's hwdom already, to
> avoid sending duplicates on the wire. This is independent of
> skb->offload_fwd_mark as both Rx-offloaded and non-Rx-offloaded frames
> can still be Tx-offloaded to other hwdoms.
> 
> A typical example would be a broadcast frame ingressing the bridge from
> eth0 in the figure from the cover letter. skb->offload_fwd_mark would
> always be 0, but you still need to test fwd_hwdoms to skip over swp{1,2}
> after you have sent the skb to swp0.
> 

Yeah, you're right I was still thinking only of offloaded skbs, didn't consider
mixing the two.

>> I might be missing something above, but we have to try and reduce these tests as much as
>> possible, also the port's first cache line is quite crowded so avoiding any new fields
>> would be best, i.e. at some point we'll move the hwdom/offload mark there to avoid pulling
>> in the last cache line of net_bridge_port, so it'd be best to avoid having to move accel_priv
>> there too.
>>
>> Cheers,
>>  Nik
>>
>>> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
>>> index 6e9b049ae521..b4fb3b0bb1ec 100644
>>> --- a/net/bridge/br_forward.c
>>> +++ b/net/bridge/br_forward.c
>>> @@ -32,6 +32,8 @@ static inline int should_deliver(const struct net_bridge_port *p,
>>>  
>>>  int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
>>>  {
>>> +	struct net_device *sb_dev = NULL;
>>> +
>>>  	skb_push(skb, ETH_HLEN);
>>>  	if (!is_skb_forwardable(skb->dev, skb))
>>>  		goto drop;
>>> @@ -48,7 +50,10 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb
>>>  		skb_set_network_header(skb, depth);
>>>  	}
>>>  
>>> -	dev_queue_xmit(skb);
>>> +	if (br_switchdev_accels_skb(skb))
>>> +		sb_dev = BR_INPUT_SKB_CB(skb)->brdev;
>>> +
>>> +	dev_queue_xmit_accel(skb, sb_dev);
>>>  
>>>  	return 0;
>>>  
>>> @@ -105,6 +110,8 @@ static void __br_forward(const struct net_bridge_port *to,
>>>  		indev = NULL;
>>>  	}
>>>  
>>> +	nbp_switchdev_frame_mark_accel(to, skb);
>>> +
>>>  	NF_HOOK(NFPROTO_BRIDGE, br_hook,
>>>  		net, NULL, skb, indev, skb->dev,
>>>  		br_forward_finish);
>>> @@ -174,6 +181,8 @@ static struct net_bridge_port *maybe_deliver(
>>>  	if (!should_deliver(p, skb))
>>>  		return prev;
>>>  
>>> +	nbp_switchdev_frame_mark_fwd(p, skb);
>>> +
>>>  	if (!prev)
>>>  		goto out;
>>>  
>>> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
>>> index aba92864d285..933e951b0d7a 100644
>>> --- a/net/bridge/br_private.h
>>> +++ b/net/bridge/br_private.h
>>> @@ -332,6 +332,7 @@ struct net_bridge_port {
>>>  #endif
>>>  #ifdef CONFIG_NET_SWITCHDEV
>>>  	int				hwdom;
>>> +	void				*accel_priv;
>>>  #endif
>>>  	u16				group_fwd_mask;
>>>  	u16				backup_redirected_cnt;
>>> @@ -506,7 +507,9 @@ struct br_input_skb_cb {
>>>  #endif
>>>  
>>>  #ifdef CONFIG_NET_SWITCHDEV
>>> +	u8 fwd_accel:1;
>>>  	int src_hwdom;
>>> +	br_hwdom_map_t fwd_hwdoms;
>>>  #endif
>>>  };
>>>  
>>> @@ -1597,6 +1600,15 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
>>>  
>>>  /* br_switchdev.c */
>>>  #ifdef CONFIG_NET_SWITCHDEV
>>> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)
>>> +{
>>> +	return BR_INPUT_SKB_CB(skb)->fwd_accel;
>>> +}
>>> +
>>> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
>>> +				    struct sk_buff *skb);
>>> +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
>>> +				  struct sk_buff *skb);
>>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>>  			      struct sk_buff *skb);
>>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
>>> @@ -1619,6 +1631,21 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
>>>  	skb->offload_fwd_mark = 0;
>>>  }
>>>  #else
>>> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)
>>> +{
>>> +	return false;
>>> +}
>>> +
>>> +static inline void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
>>> +						  struct sk_buff *skb)
>>> +{
>>> +}
>>> +
>>> +static inline void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
>>> +						struct sk_buff *skb)
>>> +{
>>> +}
>>> +
>>>  static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>>  					    struct sk_buff *skb)
>>>  {
>>> diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
>>> index 54bd7205bfb5..c903171ad291 100644
>>> --- a/net/bridge/br_switchdev.c
>>> +++ b/net/bridge/br_switchdev.c
>>> @@ -8,6 +8,26 @@
>>>  
>>>  #include "br_private.h"
>>>  
>>> +static bool nbp_switchdev_can_accel(const struct net_bridge_port *p,
>>> +				    const struct sk_buff *skb)
>>> +{
>>> +	return p->accel_priv && (p->hwdom != BR_INPUT_SKB_CB(skb)->src_hwdom);
>>> +}
>>> +
>>> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
>>> +				    struct sk_buff *skb)
>>> +{
>>> +	if (nbp_switchdev_can_accel(p, skb))
>>> +		BR_INPUT_SKB_CB(skb)->fwd_accel = true;
>>> +}
>>> +
>>> +void  (const struct net_bridge_port *p,
>>> +				  struct sk_buff *skb)
>>> +{
>>> +	if (nbp_switchdev_can_accel(p, skb))
>>> +		set_bit(p->hwdom, BR_INPUT_SKB_CB(skb)->fwd_hwdoms);
>>> +}
>>> +
>>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>>  			      struct sk_buff *skb)
>>>  {
>>> @@ -18,8 +38,10 @@ void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
>>>  				  const struct sk_buff *skb)
>>>  {
>>> -	return !skb->offload_fwd_mark ||
>>> -	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
>>> +	struct br_input_skb_cb *cb = BR_INPUT_SKB_CB(skb);
>>> +
>>> +	return !test_bit(p->hwdom, cb->fwd_hwdoms) &&
>>> +		(!skb->offload_fwd_mark || cb->src_hwdom != p->hwdom);
>>>  }
>>>  
>>>  /* Flags that can be offloaded to hardware */
>>> @@ -125,6 +147,27 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)
>>>  	return switchdev_port_obj_del(dev, &v.obj);
>>>  }
>>>  
>>> +static void nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)
>>> +{
>>> +	void *priv;
>>> +
>>> +	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))
>>> +		return;
>>> +
>>> +	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);
>>> +	if (!IS_ERR_OR_NULL(priv))
>>> +		p->accel_priv = priv;
>>> +}
>>> +
>>> +static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)
>>> +{
>>> +	if (!p->accel_priv)
>>> +		return;
>>> +
>>> +	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);
>>> +	p->accel_priv = NULL;
>>> +}
>>> +
>>>  static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)
>>>  {
>>>  	struct net_bridge *br = joining->br;
>>> @@ -176,13 +219,23 @@ int nbp_switchdev_add(struct net_bridge_port *p)
>>>  		return err;
>>>  	}
>>>  
>>> -	return nbp_switchdev_hwdom_set(p);
>>> +	err = nbp_switchdev_hwdom_set(p);
>>> +	if (err)
>>> +		return err;
>>> +
>>> +	if (p->hwdom)
>>> +		nbp_switchdev_fwd_offload_add(p);
>>> +
>>> +	return 0;
>>>  }
>>>  
>>>  void nbp_switchdev_del(struct net_bridge_port *p)
>>>  {
>>>  	ASSERT_RTNL();
>>>  
>>> +	if (p->accel_priv)
>>> +		nbp_switchdev_fwd_offload_del(p);
>>> +
>>>  	if (p->hwdom)
>>>  		nbp_switchdev_hwdom_put(p);
>>>  }
>>>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 4/9] net: bridge: switchdev: Forward offloading
  2021-04-29  9:16       ` Nikolay Aleksandrov
@ 2021-04-29 14:55         ` Tobias Waldekranz
  0 siblings, 0 replies; 42+ messages in thread
From: Tobias Waldekranz @ 2021-04-29 14:55 UTC (permalink / raw)
  To: Nikolay Aleksandrov, davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, roopa, jiri, idosch,
	stephen, netdev, bridge

On Thu, Apr 29, 2021 at 12:16, Nikolay Aleksandrov <nikolay@nvidia.com> wrote:
> On 29/04/2021 01:47, Tobias Waldekranz wrote:
>> On Tue, Apr 27, 2021 at 13:35, Nikolay Aleksandrov <nikolay@nvidia.com> wrote:
>>> On 26/04/2021 20:04, Tobias Waldekranz wrote:
>>>> Allow switchdevs to forward frames from the CPU in accordance with the
>>>> bridge configuration in the same way as is done between bridge
>>>> ports. This means that the bridge will only send a single skb towards
>>>> one of the ports under the switchdev's control, and expects the driver
>>>> to deliver the packet to all eligible ports in its domain.
>>>>
>>>> Primarily this improves the performance of multicast flows with
>>>> multiple subscribers, as it allows the hardware to perform the frame
>>>> replication.
>>>>
>>>> The basic flow between the driver and the bridge is as follows:
>>>>
>>>> - The switchdev accepts the offload by returning a non-null pointer
>>>>   from .ndo_dfwd_add_station when the port is added to the bridge.
>>>>
>>>> - The bridge sends offloadable skbs to one of the ports under the
>>>>   switchdev's control using dev_queue_xmit_accel.
>>>>
>>>> - The switchdev notices the offload by checking for a non-NULL
>>>>   "sb_dev" in the core's call to .ndo_select_queue.
>>>>
>>>> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>>>> ---
>>>>  net/bridge/br_forward.c   | 11 +++++++-
>>>>  net/bridge/br_private.h   | 27 ++++++++++++++++++
>>>>  net/bridge/br_switchdev.c | 59 +++++++++++++++++++++++++++++++++++++--
>>>>  3 files changed, 93 insertions(+), 4 deletions(-)
>>>>
>>>
>>> Hi,
>>> Please try to find a way to reduce the number of new tests in the fast path.
>>> This specific feature might help these devices, but the new tests hurt everybody else.
>>> I don't mind the control plane changes, but I'd like to minimize the fast-path impact.
>> 
>> Wholeheartedly agree.
>> 
>>> Do you need "accel_priv" to be a pointer, I mean can't derive sb_dev from the port alone ?
>>> Either way - you can mark the port via its internal flags if it can accelerate, those are
>>> used frequently and are in a hot cache line (by the way that reminds me that the port
>>> offload mark/hwdom should be moved in the first cache line).
>> 
>> I need to stash accel_priv somewhere as .ndo_dfwd_del_station expects it
>> sent back when unregistering the offload. But there is no need for it to
>> be part of the fast-path. Would it be ok to add a BR_FORWARD_OFFLOAD to
>> p->flags, which would be used in the fast-path, while also keeping
>> accel_priv on the port, but on a colder line?
>> 
>
> About the flag - yes, that is what I'm proposing. Use an internal port flag for it
> and for the tests.
>
> About the pointer - it is certainly not appropriate to use net_bridge_port for a void pointer
> coming in from a driver or external place. I see that it's always the bridge device so can
> we just compare the result of the add op to the bridge dev and set only the flag based on it?
> Then on the del path if the flag is set we know it's the bridge and use it as sb_dev.

I am not an expert on this API, but I believe that the returned pointer
is supposed to be treated as an opaque cookie that the driver can use to
associate with anything it likes. On mv88e6xxx I saw no need for it, so
I just returned the device pointer to indicate success. But other
drivers might have use of it if we add a getter - similar to
macvlan_accel_priv.

Now that I think of it, the DSA tagger might also benefit from it. It
would avoid lots of pointer chasing on egress. I will probably try this
out in v1 and see what you think.

>>> For example you could possibly drop fwd_accel, add the bitmap in a union with sb_dev pointer
>>> in br_input_skb_cb which can be set at __br_forward at the accel check and pass it down to
>>> avoid the final test.
>> 
>> Great idea! I will add it in v1.
>> 
> Actually you can also use a static key to avoid all checks and effects of this feature on
> the bridge fast-path. You can enable it when the first device that can accelerate shows
> up and disable it when the last one leaves.

That would be great. Never used the static key stuff before, but it
looks pretty straightforward. These are always global, right?  Since you
rewrite the actual .text when you toggle the keys? So it would have to
be enabled as soon as an offloadable port joins any bridge in the
system. Anyway, I will read up on it and add it in v1.

> By the way __br_forward() can take one more argument (sb_dev) and always set it because
> that cache line is dirty anyway due to the tstamp zeroing, but that doesn't matter if
> static keys are used. Just noting it. :)
>
>>> Furthermore since the hwdoms are bits and if the port accel is a bit
>>> you could probably reduce the nbp_switchdev_can_accel() helper to one test with a few bitops.
>> 
>> Not sure I follow. The current code has two tests:
>> 
>> 1. Is offloading enabled on the port. (To be done using p->flags in v1)
>> 2. Is offloading allowed for this frame to this port.
>> 
>> The port can be part of a hwdom that does not support forward
>> offloading; indeed only one driver would support it initially. So how do
>> I avoid having to test the conditions individually?
>> 
>
> Coming to think of it with the port bit test first it should be fine.
>
>
> For the sake of fun here's one way that can turn it into one test,
> obviously I haven't tested anything:
>  - reserve a few most significant bits of hwdom as "feature" bits, say 4
>  - add a "feature" bit which encodes ability to accelerate
>  => test becomes p->hwdom | (src_hwdom & hwdom_bitmask) > (accel feature bit) | p->hwdom
> It's very hacky and _not_ to be used. :)

I stand corrected - and in awe :)

>>> In nbp_switchdev_allowed_egress() I'd make the hwdom tests rely on skb's offload_fwd_mark
>>> so in the software forwarding case we could avoid them.
>> 
>> Flipping the left and right side of the expression is possible, but I
>> think that would only impact the case where the frame is _not_ allowed
>> to egress. Is that what you mean? Otherwise you still need to test for
>> the condition that we have forwarded to this port's hwdom already, to
>> avoid sending duplicates on the wire. This is independent of
>> skb->offload_fwd_mark as both Rx-offloaded and non-Rx-offloaded frames
>> can still be Tx-offloaded to other hwdoms.
>> 
>> A typical example would be a broadcast frame ingressing the bridge from
>> eth0 in the figure from the cover letter. skb->offload_fwd_mark would
>> always be 0, but you still need to test fwd_hwdoms to skip over swp{1,2}
>> after you have sent the skb to swp0.
>> 
>
> Yeah, you're right I was still thinking only of offloaded skbs, didn't consider
> mixing the two.
>
>>> I might be missing something above, but we have to try and reduce these tests as much as
>>> possible, also the port's first cache line is quite crowded so avoiding any new fields
>>> would be best, i.e. at some point we'll move the hwdom/offload mark there to avoid pulling
>>> in the last cache line of net_bridge_port, so it'd be best to avoid having to move accel_priv
>>> there too.
>>>
>>> Cheers,
>>>  Nik
>>>
>>>> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
>>>> index 6e9b049ae521..b4fb3b0bb1ec 100644
>>>> --- a/net/bridge/br_forward.c
>>>> +++ b/net/bridge/br_forward.c
>>>> @@ -32,6 +32,8 @@ static inline int should_deliver(const struct net_bridge_port *p,
>>>>  
>>>>  int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)
>>>>  {
>>>> +	struct net_device *sb_dev = NULL;
>>>> +
>>>>  	skb_push(skb, ETH_HLEN);
>>>>  	if (!is_skb_forwardable(skb->dev, skb))
>>>>  		goto drop;
>>>> @@ -48,7 +50,10 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb
>>>>  		skb_set_network_header(skb, depth);
>>>>  	}
>>>>  
>>>> -	dev_queue_xmit(skb);
>>>> +	if (br_switchdev_accels_skb(skb))
>>>> +		sb_dev = BR_INPUT_SKB_CB(skb)->brdev;
>>>> +
>>>> +	dev_queue_xmit_accel(skb, sb_dev);
>>>>  
>>>>  	return 0;
>>>>  
>>>> @@ -105,6 +110,8 @@ static void __br_forward(const struct net_bridge_port *to,
>>>>  		indev = NULL;
>>>>  	}
>>>>  
>>>> +	nbp_switchdev_frame_mark_accel(to, skb);
>>>> +
>>>>  	NF_HOOK(NFPROTO_BRIDGE, br_hook,
>>>>  		net, NULL, skb, indev, skb->dev,
>>>>  		br_forward_finish);
>>>> @@ -174,6 +181,8 @@ static struct net_bridge_port *maybe_deliver(
>>>>  	if (!should_deliver(p, skb))
>>>>  		return prev;
>>>>  
>>>> +	nbp_switchdev_frame_mark_fwd(p, skb);
>>>> +
>>>>  	if (!prev)
>>>>  		goto out;
>>>>  
>>>> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
>>>> index aba92864d285..933e951b0d7a 100644
>>>> --- a/net/bridge/br_private.h
>>>> +++ b/net/bridge/br_private.h
>>>> @@ -332,6 +332,7 @@ struct net_bridge_port {
>>>>  #endif
>>>>  #ifdef CONFIG_NET_SWITCHDEV
>>>>  	int				hwdom;
>>>> +	void				*accel_priv;
>>>>  #endif
>>>>  	u16				group_fwd_mask;
>>>>  	u16				backup_redirected_cnt;
>>>> @@ -506,7 +507,9 @@ struct br_input_skb_cb {
>>>>  #endif
>>>>  
>>>>  #ifdef CONFIG_NET_SWITCHDEV
>>>> +	u8 fwd_accel:1;
>>>>  	int src_hwdom;
>>>> +	br_hwdom_map_t fwd_hwdoms;
>>>>  #endif
>>>>  };
>>>>  
>>>> @@ -1597,6 +1600,15 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
>>>>  
>>>>  /* br_switchdev.c */
>>>>  #ifdef CONFIG_NET_SWITCHDEV
>>>> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)
>>>> +{
>>>> +	return BR_INPUT_SKB_CB(skb)->fwd_accel;
>>>> +}
>>>> +
>>>> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
>>>> +				    struct sk_buff *skb);
>>>> +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
>>>> +				  struct sk_buff *skb);
>>>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>>>  			      struct sk_buff *skb);
>>>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
>>>> @@ -1619,6 +1631,21 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
>>>>  	skb->offload_fwd_mark = 0;
>>>>  }
>>>>  #else
>>>> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)
>>>> +{
>>>> +	return false;
>>>> +}
>>>> +
>>>> +static inline void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
>>>> +						  struct sk_buff *skb)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,
>>>> +						struct sk_buff *skb)
>>>> +{
>>>> +}
>>>> +
>>>>  static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>>>  					    struct sk_buff *skb)
>>>>  {
>>>> diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
>>>> index 54bd7205bfb5..c903171ad291 100644
>>>> --- a/net/bridge/br_switchdev.c
>>>> +++ b/net/bridge/br_switchdev.c
>>>> @@ -8,6 +8,26 @@
>>>>  
>>>>  #include "br_private.h"
>>>>  
>>>> +static bool nbp_switchdev_can_accel(const struct net_bridge_port *p,
>>>> +				    const struct sk_buff *skb)
>>>> +{
>>>> +	return p->accel_priv && (p->hwdom != BR_INPUT_SKB_CB(skb)->src_hwdom);
>>>> +}
>>>> +
>>>> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,
>>>> +				    struct sk_buff *skb)
>>>> +{
>>>> +	if (nbp_switchdev_can_accel(p, skb))
>>>> +		BR_INPUT_SKB_CB(skb)->fwd_accel = true;
>>>> +}
>>>> +
>>>> +void  (const struct net_bridge_port *p,
>>>> +				  struct sk_buff *skb)
>>>> +{
>>>> +	if (nbp_switchdev_can_accel(p, skb))
>>>> +		set_bit(p->hwdom, BR_INPUT_SKB_CB(skb)->fwd_hwdoms);
>>>> +}
>>>> +
>>>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>>>  			      struct sk_buff *skb)
>>>>  {
>>>> @@ -18,8 +38,10 @@ void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
>>>>  				  const struct sk_buff *skb)
>>>>  {
>>>> -	return !skb->offload_fwd_mark ||
>>>> -	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
>>>> +	struct br_input_skb_cb *cb = BR_INPUT_SKB_CB(skb);
>>>> +
>>>> +	return !test_bit(p->hwdom, cb->fwd_hwdoms) &&
>>>> +		(!skb->offload_fwd_mark || cb->src_hwdom != p->hwdom);
>>>>  }
>>>>  
>>>>  /* Flags that can be offloaded to hardware */
>>>> @@ -125,6 +147,27 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)
>>>>  	return switchdev_port_obj_del(dev, &v.obj);
>>>>  }
>>>>  
>>>> +static void nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)
>>>> +{
>>>> +	void *priv;
>>>> +
>>>> +	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))
>>>> +		return;
>>>> +
>>>> +	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);
>>>> +	if (!IS_ERR_OR_NULL(priv))
>>>> +		p->accel_priv = priv;
>>>> +}
>>>> +
>>>> +static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)
>>>> +{
>>>> +	if (!p->accel_priv)
>>>> +		return;
>>>> +
>>>> +	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);
>>>> +	p->accel_priv = NULL;
>>>> +}
>>>> +
>>>>  static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)
>>>>  {
>>>>  	struct net_bridge *br = joining->br;
>>>> @@ -176,13 +219,23 @@ int nbp_switchdev_add(struct net_bridge_port *p)
>>>>  		return err;
>>>>  	}
>>>>  
>>>> -	return nbp_switchdev_hwdom_set(p);
>>>> +	err = nbp_switchdev_hwdom_set(p);
>>>> +	if (err)
>>>> +		return err;
>>>> +
>>>> +	if (p->hwdom)
>>>> +		nbp_switchdev_fwd_offload_add(p);
>>>> +
>>>> +	return 0;
>>>>  }
>>>>  
>>>>  void nbp_switchdev_del(struct net_bridge_port *p)
>>>>  {
>>>>  	ASSERT_RTNL();
>>>>  
>>>> +	if (p->accel_priv)
>>>> +		nbp_switchdev_fwd_offload_del(p);
>>>> +
>>>>  	if (p->hwdom)
>>>>  		nbp_switchdev_hwdom_put(p);
>>>>  }
>>>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 0/9] net: bridge: Forward offloading
  2021-04-26 17:04 [RFC net-next 0/9] net: bridge: Forward offloading Tobias Waldekranz
                   ` (8 preceding siblings ...)
  2021-04-26 17:04 ` [RFC net-next 9/9] net: dsa: mv88e6xxx: Forward offloading Tobias Waldekranz
@ 2021-05-02 14:58 ` Ido Schimmel
  2021-05-03  9:44   ` Tobias Waldekranz
  9 siblings, 1 reply; 42+ messages in thread
From: Ido Schimmel @ 2021-05-02 14:58 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, olteanv, roopa,
	nikolay, jiri, stephen, netdev, bridge

On Mon, Apr 26, 2021 at 07:04:02PM +0200, Tobias Waldekranz wrote:
> ## Overview
> 
>    vlan1   vlan2
>        \   /
>    .-----------.
>    |    br0    |
>    '-----------'
>    /   /   \   \
> swp0 swp1 swp2 eth0
>   :   :   :
>   (hwdom 1)
> 
> Up to this point, switchdevs have been trusted with offloading
> forwarding between bridge ports, e.g. forwarding a unicast from swp0
> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
> series extends forward offloading to include some new classes of
> traffic:
> 
> - Locally originating flows, i.e. packets that ingress on br0 that are
>   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>   this also includes routed flows, e.g. a packet ingressing swp0 on
>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>   forwarded to swp1 is "locally originating" from br0's point of view.
> 
> - Flows originating from "foreign" interfaces, i.e. an interface that
>   is not offloaded by a particular switchdev instance. This includes
>   ports belonging to other switchdev instances. A typical example
>   would be flows from eth0 towards swp{0,1,2}.
> 
> The bridge still looks up its FDB/MDB as usual and then notifies the
> switchdev driver that a particular skb should be offloaded if it
> matches one of the classes above. It does so by using the _accel
> version of dev_queue_xmit, supplying its own netdev as the
> "subordinate" device. The driver can react to the presence of the
> subordinate in its .ndo_select_queue in what ever way it needs to make
> sure to forward the skb in much the same way that it would for packets
> ingressing on regular ports.
> 
> Hardware domains to which a particular skb has been forwarded are
> recorded so that duplicates are avoided.
> 
> The main performance benefit is thus seen on multicast flows. Imagine
> for example that:
> 
> - An IP camera is connected to swp0 (VLAN 1)
> 
> - The CPU is acting as a multicast router, routing the group from VLAN
>   1 to VLAN 2.
> 
> - There are subscribers for the group in question behind both swp1 and
>   swp2 (VLAN 2).

IIUC, this falls under the first use case ("Locally originating flows").
Do you have a need for this optimization in the forwarding case? Asking
because it might allow us to avoid unnecessary modifications to the
forwarding path. I have yet to look at the code, so maybe it's not a big
deal.

> 
> With this offloading in place, the bridge need only send a single skb
> to the driver, which will send it to the hardware marked in such a way
> that the switch will perform the multicast replication according to
> the MDB configuration. Naturally, the number of saved skb_clones
> increase linearly with the number of subscribed ports.

Yes, this is clear. FWIW, Spectrum has something similar. You can send
packets as either "data" or "control". Data packets are injected via the
CPU port and forwarded according to the hardware database. Control
packets are sent as-is via the specified front panel port, bypassing the
hardware data path. mlxsw is always sending packets as "control".

> 
> As an extra benefit, on mv88e6xxx, this also allows the switch to
> perform source address learning on these flows, which avoids having to
> sync dynamic FDB entries over slow configuration interfaces like MDIO
> to avoid flows directed towards the CPU being flooded as unknown
> unicast by the switch.

Since you are not syncing FDBs, it is possible that you are needlessly
flooding locally generated packets. This optimization avoids it.

> 
> 
> ## RFC
> 
> - In general, what do you think about this idea?

Looks sane to me

> 
> - hwdom. What do you think about this terminology? Personally I feel
>   that we had too many things called offload_fwd_mark, and that as the
>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>   might be useful to have a separate term for it.

Sounds OK

> 
> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>   and if so do you have any suggestion/preference on how to signal the
>   offloading from the bridge down to the switchdev driver?

I was not aware of this interface before the RFC, but your use case
seems to fit the kdoc: "Called by upper layer devices to accelerate
switching or other station functionality into hardware".

Do you expect this optimization to only work when physical netdevs are
enslaved to the bridge? What about LAG/VLANs?

> 
> - The way that flooding is implemented in br_forward.c (lazily cloning
>   skbs) means that you have to mark the forwarding as completed very
>   early (right after should_deliver in maybe_deliver) in order to
>   avoid duplicates. Is there some way to move this decision point to a
>   later stage that I am missing?
> 
> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>   compatible with unicast-to-multicast being used on a port. Then
>   again, I think that this would also be broken for regular switchdev
>   bridge offloading as this flag is not offloaded to the switchdev
>   port, so there is no way for the driver to refuse it. Any ideas on
>   how to handle this?
> 
> 
> ## mv88e6xxx Specifics
> 
> Since we are now only receiving a single skb for both unicast and
> multicast flows, we can tag the packets with the FORWARD command
> instead of FROM_CPU. The swich(es) will then forward the packet in
> accordance with its ATU, VTU, STU, and PVT configuration - just like
> for packets ingressing on user ports.
> 
> Crucially, FROM_CPU is still used for:
> 
> - Ports in standalone mode.
> 
> - Flows that are trapped to the CPU and software-forwarded by a
>   bridge. Note that these flows match neither of the classes discussed
>   in the overview.
> 
> - Packets that are sent directly to a port netdev without going
>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>   socket.
> 
> We thus have a pretty clean separation where the data plane uses
> FORWARDs and the control plane uses TO_/FROM_CPU.
> 
> The barrier between different bridges is enforced by port based VLANs
> on mv88e6xxx, which in essence is a mapping from a source device/port
> pair to an allowed set of egress ports. In order to have a FORWARD
> frame (which carries a _source_ device/port) correctly mapped by the
> PVT, we must use a unique pair for each bridge.
> 
> Fortunately, there is typically lots of unused address space in most
> switch trees. When was the last time you saw an mv88e6xxx product
> using more than 4 chips? Even if you found one with 16 (!) devices,
> you would still have room to allocate 16*16 virtual ports to software
> bridges.
> 
> Therefore, the mv88e6xxx driver will allocate a virtual device/port
> pair to each bridge that it offloads. All members of the same bridge
> are then configured to allow packets from this virtual port in their
> PVTs.
> 
> Tobias Waldekranz (9):
>   net: dfwd: Constrain existing users to macvlan subordinates
>   net: bridge: Disambiguate offload_fwd_mark
>   net: bridge: switchdev: Recycle unused hwdoms
>   net: bridge: switchdev: Forward offloading
>   net: dsa: Track port PVIDs
>   net: dsa: Forward offloading
>   net: dsa: mv88e6xxx: Allocate a virtual DSA port for each bridge
>   net: dsa: mv88e6xxx: Map virtual bridge port in PVT
>   net: dsa: mv88e6xxx: Forward offloading
> 
>  MAINTAINERS                                   |   1 +
>  drivers/net/dsa/mv88e6xxx/Makefile            |   1 +
>  drivers/net/dsa/mv88e6xxx/chip.c              |  61 ++++++-
>  drivers/net/dsa/mv88e6xxx/dst.c               | 160 ++++++++++++++++++
>  drivers/net/dsa/mv88e6xxx/dst.h               |  14 ++
>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
>  include/linux/dsa/mv88e6xxx.h                 |  13 ++
>  include/net/dsa.h                             |  13 ++
>  net/bridge/br_forward.c                       |  11 +-
>  net/bridge/br_if.c                            |   4 +-
>  net/bridge/br_private.h                       |  54 +++++-
>  net/bridge/br_switchdev.c                     | 141 +++++++++++----
>  net/dsa/port.c                                |  16 +-
>  net/dsa/slave.c                               |  36 +++-
>  net/dsa/tag_dsa.c                             |  33 +++-
>  17 files changed, 510 insertions(+), 57 deletions(-)
>  create mode 100644 drivers/net/dsa/mv88e6xxx/dst.c
>  create mode 100644 drivers/net/dsa/mv88e6xxx/dst.h
>  create mode 100644 include/linux/dsa/mv88e6xxx.h
> 
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 2/9] net: bridge: Disambiguate offload_fwd_mark
  2021-04-26 17:04 ` [RFC net-next 2/9] net: bridge: Disambiguate offload_fwd_mark Tobias Waldekranz
@ 2021-05-02 15:00   ` Ido Schimmel
  2021-05-03  8:49     ` Tobias Waldekranz
  0 siblings, 1 reply; 42+ messages in thread
From: Ido Schimmel @ 2021-05-02 15:00 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, olteanv, roopa,
	nikolay, jiri, stephen, netdev, bridge

On Mon, Apr 26, 2021 at 07:04:04PM +0200, Tobias Waldekranz wrote:
> - skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. There is a
>   slight change here: Whereas previously this was only set for
>   offloaded packets, we now always track the incoming hwdom. As all
>   uses where already gated behind checks of skb->offload_fwd_mark,
>   this will not introduce any functional change, but it paves the way
>   for future changes where the ingressing hwdom must be known both for
>   offloaded and non-offloaded frames.

[...]

> @@ -43,15 +43,15 @@ int nbp_switchdev_mark_set(struct net_bridge_port *p)
>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>  			      struct sk_buff *skb)
>  {
> -	if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark))
> -		BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark;
> +	if (p->hwdom)
> +		BR_INPUT_SKB_CB(skb)->src_hwdom = p->hwdom;
>  }

I assume you are referring to this change? "src_hwdom" sounds weird if
it's expected to be valid for non-offloaded frames.

Can you elaborate about "future changes where the ingressing hwdom must
be known both for offloaded and non-offloaded frames"?

Probably best to split this change to a different patch given the rest
of the changes are mechanical.

>  
>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
>  				  const struct sk_buff *skb)
>  {
>  	return !skb->offload_fwd_mark ||
> -	       BR_INPUT_SKB_CB(skb)->offload_fwd_mark != p->offload_fwd_mark;
> +	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
>  }
>  
>  /* Flags that can be offloaded to hardware */
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 4/9] net: bridge: switchdev: Forward offloading
  2021-04-26 17:04 ` [RFC net-next 4/9] net: bridge: switchdev: Forward offloading Tobias Waldekranz
  2021-04-27 10:35   ` Nikolay Aleksandrov
@ 2021-05-02 15:04   ` Ido Schimmel
  2021-05-03  8:53     ` Tobias Waldekranz
  1 sibling, 1 reply; 42+ messages in thread
From: Ido Schimmel @ 2021-05-02 15:04 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, olteanv, roopa,
	nikolay, jiri, stephen, netdev, bridge

On Mon, Apr 26, 2021 at 07:04:06PM +0200, Tobias Waldekranz wrote:
> +static void nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)
> +{
> +	void *priv;
> +
> +	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))
> +		return;
> +
> +	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);

Some changes to team/bond/8021q will be needed in order to get this
optimization to work when they are enslaved to the bridge instead of the
front panel port itself?

> +	if (!IS_ERR_OR_NULL(priv))
> +		p->accel_priv = priv;
> +}
> +
> +static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)
> +{
> +	if (!p->accel_priv)
> +		return;
> +
> +	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);
> +	p->accel_priv = NULL;
> +}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 2/9] net: bridge: Disambiguate offload_fwd_mark
  2021-05-02 15:00   ` Ido Schimmel
@ 2021-05-03  8:49     ` Tobias Waldekranz
  2021-05-05  7:39       ` Ido Schimmel
  0 siblings, 1 reply; 42+ messages in thread
From: Tobias Waldekranz @ 2021-05-03  8:49 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, olteanv, roopa,
	nikolay, jiri, stephen, netdev, bridge

On Sun, May 02, 2021 at 18:00, Ido Schimmel <idosch@idosch.org> wrote:
> On Mon, Apr 26, 2021 at 07:04:04PM +0200, Tobias Waldekranz wrote:
>> - skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. There is a
>>   slight change here: Whereas previously this was only set for
>>   offloaded packets, we now always track the incoming hwdom. As all
>>   uses where already gated behind checks of skb->offload_fwd_mark,
>>   this will not introduce any functional change, but it paves the way
>>   for future changes where the ingressing hwdom must be known both for
>>   offloaded and non-offloaded frames.
>
> [...]
>
>> @@ -43,15 +43,15 @@ int nbp_switchdev_mark_set(struct net_bridge_port *p)
>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
>>  			      struct sk_buff *skb)
>>  {
>> -	if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark))
>> -		BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark;
>> +	if (p->hwdom)
>> +		BR_INPUT_SKB_CB(skb)->src_hwdom = p->hwdom;
>>  }
>
> I assume you are referring to this change? "src_hwdom" sounds weird if
> it's expected to be valid for non-offloaded frames.

Perhaps "non-offloaded" was a sloppy description on my part. I was
trying to describe frames that originate from a switchdev, but have not
been forwarded by hardware; e.g. STP BPDUs, IGMP reports, etc. So
nbp_switchdev_frame_mark now basically says: "If this skb came in from a
switchdev, make sure to note which one".

> Can you elaborate about "future changes where the ingressing hwdom must
> be known both for offloaded and non-offloaded frames"?

Typical example: The switchdev has a fixed configuration to trap STP
BPDUs, but STP is not running on the bridge and the group_fwd_mask
allows them to be forwarded. Say we have this setup:

      br0
    /  |  \
swp0 swp1 swp2

A BPDU comes in on swp0 and is trapped to the CPU; the driver does not
set skb->offload_fwd_mark. The bridge determines that the frame should
be forwarded to swp{1,2}. It is imperative that forward offloading is
_not_ allowed in this case, as the source hwdom is already "poisoned".

Recording the source hwdom allows this case to be handled properly.

> Probably best to split this change to a different patch given the rest
> of the changes are mechanical.

Right, but I think the change in name to warrants a change in
semantics. It is being renamed to src_hwdom because it now holds just
that information. Again, there is no functional change introduced by
this since nbp_switchdev_allowed_egress always checks for the presence
of skb->offload_fwd_mark anyway. But if you feel strongly about it, I
will split it up.

>>  
>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
>>  				  const struct sk_buff *skb)
>>  {
>>  	return !skb->offload_fwd_mark ||
>> -	       BR_INPUT_SKB_CB(skb)->offload_fwd_mark != p->offload_fwd_mark;
>> +	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
>>  }
>>  
>>  /* Flags that can be offloaded to hardware */
>> -- 
>> 2.25.1
>> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 4/9] net: bridge: switchdev: Forward offloading
  2021-05-02 15:04   ` Ido Schimmel
@ 2021-05-03  8:53     ` Tobias Waldekranz
  2021-05-06 11:01       ` Vladimir Oltean
  0 siblings, 1 reply; 42+ messages in thread
From: Tobias Waldekranz @ 2021-05-03  8:53 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, olteanv, roopa,
	nikolay, jiri, stephen, netdev, bridge

On Sun, May 02, 2021 at 18:04, Ido Schimmel <idosch@idosch.org> wrote:
> On Mon, Apr 26, 2021 at 07:04:06PM +0200, Tobias Waldekranz wrote:
>> +static void nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)
>> +{
>> +	void *priv;
>> +
>> +	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))
>> +		return;
>> +
>> +	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);
>
> Some changes to team/bond/8021q will be needed in order to get this
> optimization to work when they are enslaved to the bridge instead of the
> front panel port itself?

Right you are. We should probably do something similar to the
switchdev_handle_port_* family of helpers that could be reused in
stacked devices. I will look at it for v1.

>> +	if (!IS_ERR_OR_NULL(priv))
>> +		p->accel_priv = priv;
>> +}
>> +
>> +static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)
>> +{
>> +	if (!p->accel_priv)
>> +		return;
>> +
>> +	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);
>> +	p->accel_priv = NULL;
>> +}

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 0/9] net: bridge: Forward offloading
  2021-05-02 14:58 ` [RFC net-next 0/9] net: bridge: " Ido Schimmel
@ 2021-05-03  9:44   ` Tobias Waldekranz
  2021-05-06 10:59     ` Vladimir Oltean
  0 siblings, 1 reply; 42+ messages in thread
From: Tobias Waldekranz @ 2021-05-03  9:44 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, olteanv, roopa,
	nikolay, jiri, stephen, netdev, bridge

On Sun, May 02, 2021 at 17:58, Ido Schimmel <idosch@idosch.org> wrote:
> On Mon, Apr 26, 2021 at 07:04:02PM +0200, Tobias Waldekranz wrote:
>> ## Overview
>> 
>>    vlan1   vlan2
>>        \   /
>>    .-----------.
>>    |    br0    |
>>    '-----------'
>>    /   /   \   \
>> swp0 swp1 swp2 eth0
>>   :   :   :
>>   (hwdom 1)
>> 
>> Up to this point, switchdevs have been trusted with offloading
>> forwarding between bridge ports, e.g. forwarding a unicast from swp0
>> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
>> series extends forward offloading to include some new classes of
>> traffic:
>> 
>> - Locally originating flows, i.e. packets that ingress on br0 that are
>>   to be forwarded to one or several of the ports swp{0,1,2}. Notably
>>   this also includes routed flows, e.g. a packet ingressing swp0 on
>>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then
>>   forwarded to swp1 is "locally originating" from br0's point of view.
>> 
>> - Flows originating from "foreign" interfaces, i.e. an interface that
>>   is not offloaded by a particular switchdev instance. This includes
>>   ports belonging to other switchdev instances. A typical example
>>   would be flows from eth0 towards swp{0,1,2}.
>> 
>> The bridge still looks up its FDB/MDB as usual and then notifies the
>> switchdev driver that a particular skb should be offloaded if it
>> matches one of the classes above. It does so by using the _accel
>> version of dev_queue_xmit, supplying its own netdev as the
>> "subordinate" device. The driver can react to the presence of the
>> subordinate in its .ndo_select_queue in what ever way it needs to make
>> sure to forward the skb in much the same way that it would for packets
>> ingressing on regular ports.
>> 
>> Hardware domains to which a particular skb has been forwarded are
>> recorded so that duplicates are avoided.
>> 
>> The main performance benefit is thus seen on multicast flows. Imagine
>> for example that:
>> 
>> - An IP camera is connected to swp0 (VLAN 1)
>> 
>> - The CPU is acting as a multicast router, routing the group from VLAN
>>   1 to VLAN 2.
>> 
>> - There are subscribers for the group in question behind both swp1 and
>>   swp2 (VLAN 2).
>
> IIUC, this falls under the first use case ("Locally originating flows").
> Do you have a need for this optimization in the forwarding case? Asking
> because it might allow us to avoid unnecessary modifications to the
> forwarding path. I have yet to look at the code, so maybe it's not a big
> deal.

Routed multicast is the most pressing issue. But being able to avoid
issues with learning on flows from the CPU (locally originating and from
foreign interfaces) is a close second. Yes you can handle the second
issue by syncing FDBs but it means lots of extra traffic over an already
congested interface (MDIO).

The overhead is pretty small in this version, and with Nikolay's
suggestions about hiding it behind a static key, it should go down to 0
in v1.

>> With this offloading in place, the bridge need only send a single skb
>> to the driver, which will send it to the hardware marked in such a way
>> that the switch will perform the multicast replication according to
>> the MDB configuration. Naturally, the number of saved skb_clones
>> increase linearly with the number of subscribed ports.
>
> Yes, this is clear. FWIW, Spectrum has something similar. You can send
> packets as either "data" or "control". Data packets are injected via the
> CPU port and forwarded according to the hardware database. Control
> packets are sent as-is via the specified front panel port, bypassing the
> hardware data path. mlxsw is always sending packets as "control".

Marvell has the same concept, but they call "data" "forward" and
"control" "from CPU". mv88e6xxx has thus far also been limited to only
sending control frames. I imagine that many chips will be able to make
use of this acceleration.

>> As an extra benefit, on mv88e6xxx, this also allows the switch to
>> perform source address learning on these flows, which avoids having to
>> sync dynamic FDB entries over slow configuration interfaces like MDIO
>> to avoid flows directed towards the CPU being flooded as unknown
>> unicast by the switch.
>
> Since you are not syncing FDBs, it is possible that you are needlessly
> flooding locally generated packets. This optimization avoids it.

Since the switchdev driver muxes incoming frames to the right port
netdev, the bridge will know the correct port to use on egress. It is
more that return traffic towards the CPU will be flooded by the switch
as unknown unicast.

.-----.   .--------.
| CPU +---0 Switch 1--- A
'-----'   | (fdb)  2--- B
          '--------'

If a ping is sent from CPU to A, A's reply will also be flooded to B
because the CPU's SA has never been seen on a "forward"/"data" packet
and therefore has never been added to the FDB.

>> 
>> 
>> ## RFC
>> 
>> - In general, what do you think about this idea?
>
> Looks sane to me

Glad to hear it, thanks!

>> - hwdom. What do you think about this terminology? Personally I feel
>>   that we had too many things called offload_fwd_mark, and that as the
>>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
>>   might be useful to have a separate term for it.
>
> Sounds OK
>
>> 
>> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,
>>   and if so do you have any suggestion/preference on how to signal the
>>   offloading from the bridge down to the switchdev driver?
>
> I was not aware of this interface before the RFC, but your use case
> seems to fit the kdoc: "Called by upper layer devices to accelerate
> switching or other station functionality into hardware".
>
> Do you expect this optimization to only work when physical netdevs are
> enslaved to the bridge? What about LAG/VLANs?

LAGs should definitely work once the .ndo_dfwd_{add,del}_station helpers
are in place.

Stacked VLANs could also be made to work. But they may require some
extra work.

In v1, the bridge will always send offloaded frames with the VLAN
information intact, even if the port is configured to egress the VID
untagged. This is needed so that the driver can determine the correct
VID to use in cases where multiple VIDs are set to egress untagged.

If any kind of VID translation takes place I think things get very
sticky. Then again, that is maybe not all that defined without this
change applied either?

What is the typical use-case for using an "external" stacked VLAN device
over configuring the VLAN inside the bridge?

>> 
>> - The way that flooding is implemented in br_forward.c (lazily cloning
>>   skbs) means that you have to mark the forwarding as completed very
>>   early (right after should_deliver in maybe_deliver) in order to
>>   avoid duplicates. Is there some way to move this decision point to a
>>   later stage that I am missing?
>> 
>> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
>>   compatible with unicast-to-multicast being used on a port. Then
>>   again, I think that this would also be broken for regular switchdev
>>   bridge offloading as this flag is not offloaded to the switchdev
>>   port, so there is no way for the driver to refuse it. Any ideas on
>>   how to handle this?
>> 
>> 
>> ## mv88e6xxx Specifics
>> 
>> Since we are now only receiving a single skb for both unicast and
>> multicast flows, we can tag the packets with the FORWARD command
>> instead of FROM_CPU. The swich(es) will then forward the packet in
>> accordance with its ATU, VTU, STU, and PVT configuration - just like
>> for packets ingressing on user ports.
>> 
>> Crucially, FROM_CPU is still used for:
>> 
>> - Ports in standalone mode.
>> 
>> - Flows that are trapped to the CPU and software-forwarded by a
>>   bridge. Note that these flows match neither of the classes discussed
>>   in the overview.
>> 
>> - Packets that are sent directly to a port netdev without going
>>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
>>   socket.
>> 
>> We thus have a pretty clean separation where the data plane uses
>> FORWARDs and the control plane uses TO_/FROM_CPU.
>> 
>> The barrier between different bridges is enforced by port based VLANs
>> on mv88e6xxx, which in essence is a mapping from a source device/port
>> pair to an allowed set of egress ports. In order to have a FORWARD
>> frame (which carries a _source_ device/port) correctly mapped by the
>> PVT, we must use a unique pair for each bridge.
>> 
>> Fortunately, there is typically lots of unused address space in most
>> switch trees. When was the last time you saw an mv88e6xxx product
>> using more than 4 chips? Even if you found one with 16 (!) devices,
>> you would still have room to allocate 16*16 virtual ports to software
>> bridges.
>> 
>> Therefore, the mv88e6xxx driver will allocate a virtual device/port
>> pair to each bridge that it offloads. All members of the same bridge
>> are then configured to allow packets from this virtual port in their
>> PVTs.
>> 
>> Tobias Waldekranz (9):
>>   net: dfwd: Constrain existing users to macvlan subordinates
>>   net: bridge: Disambiguate offload_fwd_mark
>>   net: bridge: switchdev: Recycle unused hwdoms
>>   net: bridge: switchdev: Forward offloading
>>   net: dsa: Track port PVIDs
>>   net: dsa: Forward offloading
>>   net: dsa: mv88e6xxx: Allocate a virtual DSA port for each bridge
>>   net: dsa: mv88e6xxx: Map virtual bridge port in PVT
>>   net: dsa: mv88e6xxx: Forward offloading
>> 
>>  MAINTAINERS                                   |   1 +
>>  drivers/net/dsa/mv88e6xxx/Makefile            |   1 +
>>  drivers/net/dsa/mv88e6xxx/chip.c              |  61 ++++++-
>>  drivers/net/dsa/mv88e6xxx/dst.c               | 160 ++++++++++++++++++
>>  drivers/net/dsa/mv88e6xxx/dst.h               |  14 ++
>>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
>>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
>>  include/linux/dsa/mv88e6xxx.h                 |  13 ++
>>  include/net/dsa.h                             |  13 ++
>>  net/bridge/br_forward.c                       |  11 +-
>>  net/bridge/br_if.c                            |   4 +-
>>  net/bridge/br_private.h                       |  54 +++++-
>>  net/bridge/br_switchdev.c                     | 141 +++++++++++----
>>  net/dsa/port.c                                |  16 +-
>>  net/dsa/slave.c                               |  36 +++-
>>  net/dsa/tag_dsa.c                             |  33 +++-
>>  17 files changed, 510 insertions(+), 57 deletions(-)
>>  create mode 100644 drivers/net/dsa/mv88e6xxx/dst.c
>>  create mode 100644 drivers/net/dsa/mv88e6xxx/dst.h
>>  create mode 100644 include/linux/dsa/mv88e6xxx.h
>> 
>> -- 
>> 2.25.1
>> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 6/9] net: dsa: Forward offloading
  2021-04-27 10:17   ` Vladimir Oltean
@ 2021-05-04 14:44     ` Tobias Waldekranz
  2021-05-04 15:21       ` Vladimir Oltean
  0 siblings, 1 reply; 42+ messages in thread
From: Tobias Waldekranz @ 2021-05-04 14:44 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Tue, Apr 27, 2021 at 13:17, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Mon, Apr 26, 2021 at 07:04:08PM +0200, Tobias Waldekranz wrote:
>> Allow DSA drivers to support forward offloading from a bridge by:
>> 
>> - Passing calls to .ndo_dfwd_{add,del}_station to the drivers.
>> 
>> - Recording the subordinate device of offloaded skbs in the control
>>   buffer so that the tagger can take the appropriate action.
>> 
>> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>> ---
>>  include/net/dsa.h |  7 +++++++
>>  net/dsa/slave.c   | 36 ++++++++++++++++++++++++++++++++++--
>>  2 files changed, 41 insertions(+), 2 deletions(-)
>> 
>> diff --git a/include/net/dsa.h b/include/net/dsa.h
>> index 1f9ba9889034..77d4df819299 100644
>> --- a/include/net/dsa.h
>> +++ b/include/net/dsa.h
>> @@ -119,6 +119,7 @@ struct dsa_netdevice_ops {
>>  
>>  struct dsa_skb_cb {
>>  	struct sk_buff *clone;
>> +	struct net_device *sb_dev;
>>  };
>>  
>>  struct __dsa_skb_cb {
>> @@ -828,6 +829,12 @@ struct dsa_switch_ops {
>>  					  const struct switchdev_obj_ring_role_mrp *mrp);
>>  	int	(*port_mrp_del_ring_role)(struct dsa_switch *ds, int port,
>>  					  const struct switchdev_obj_ring_role_mrp *mrp);
>> +
>> +	/* L2 forward offloading */
>> +	void *	(*dfwd_add_station)(struct dsa_switch *ds, int port,
>> +				    struct net_device *sb_dev);
>> +	void	(*dfwd_del_station)(struct dsa_switch *ds, int port,
>> +				    struct net_device *sb_dev);
>>  };
>>  
>>  #define DSA_DEVLINK_PARAM_DRIVER(_id, _name, _type, _cmodes)		\
>> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
>> index 77b33bd161b8..3689ffa2dbb8 100644
>> --- a/net/dsa/slave.c
>> +++ b/net/dsa/slave.c
>> @@ -657,6 +657,13 @@ static netdev_tx_t dsa_slave_xmit(struct sk_buff *skb, struct net_device *dev)
>>  	return dsa_enqueue_skb(nskb, dev);
>>  }
>>  
>> +static u16 dsa_slave_select_queue(struct net_device *dev, struct sk_buff *skb,
>> +				  struct net_device *sb_dev)
>> +{
>> +	DSA_SKB_CB(skb)->sb_dev = sb_dev;
>> +	return netdev_pick_tx(dev, skb, sb_dev);
>> +}
>> +
>
> DSA_SKB_CB is going away:
> https://patchwork.kernel.org/project/netdevbpf/patch/20210427042203.26258-5-yangbo.lu@nxp.com/
>
> Let's either negotiate with Yangbo on keeping it, or make
> .ndo_select_queue a bypass towards the tagger, where it can use its own
> SKB_CB structure and be more flexible in general (I think I'm leaning
> towards the latter).

Thus far, Yangbo is a tough negotiator, giving me the silent treatment:

https://lore.kernel.org/netdev/87y2d2noe5.fsf@waldekranz.com/

:)

That memset is giving me a hard time. I have just disabled it on my
branch at the moment. Any ideas on how to get rid of it without breaking
timestamping?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 6/9] net: dsa: Forward offloading
  2021-05-04 14:44     ` Tobias Waldekranz
@ 2021-05-04 15:21       ` Vladimir Oltean
  2021-05-04 20:07         ` Tobias Waldekranz
  0 siblings, 1 reply; 42+ messages in thread
From: Vladimir Oltean @ 2021-05-04 15:21 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Tue, May 04, 2021 at 04:44:31PM +0200, Tobias Waldekranz wrote:
> On Tue, Apr 27, 2021 at 13:17, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Mon, Apr 26, 2021 at 07:04:08PM +0200, Tobias Waldekranz wrote:
> >> Allow DSA drivers to support forward offloading from a bridge by:
> >> 
> >> - Passing calls to .ndo_dfwd_{add,del}_station to the drivers.
> >> 
> >> - Recording the subordinate device of offloaded skbs in the control
> >>   buffer so that the tagger can take the appropriate action.
> >> 
> >> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> >> ---
> >>  include/net/dsa.h |  7 +++++++
> >>  net/dsa/slave.c   | 36 ++++++++++++++++++++++++++++++++++--
> >>  2 files changed, 41 insertions(+), 2 deletions(-)
> >> 
> >> diff --git a/include/net/dsa.h b/include/net/dsa.h
> >> index 1f9ba9889034..77d4df819299 100644
> >> --- a/include/net/dsa.h
> >> +++ b/include/net/dsa.h
> >> @@ -119,6 +119,7 @@ struct dsa_netdevice_ops {
> >>  
> >>  struct dsa_skb_cb {
> >>  	struct sk_buff *clone;
> >> +	struct net_device *sb_dev;
> >>  };
> >>  
> >>  struct __dsa_skb_cb {
> >> @@ -828,6 +829,12 @@ struct dsa_switch_ops {
> >>  					  const struct switchdev_obj_ring_role_mrp *mrp);
> >>  	int	(*port_mrp_del_ring_role)(struct dsa_switch *ds, int port,
> >>  					  const struct switchdev_obj_ring_role_mrp *mrp);
> >> +
> >> +	/* L2 forward offloading */
> >> +	void *	(*dfwd_add_station)(struct dsa_switch *ds, int port,
> >> +				    struct net_device *sb_dev);
> >> +	void	(*dfwd_del_station)(struct dsa_switch *ds, int port,
> >> +				    struct net_device *sb_dev);
> >>  };
> >>  
> >>  #define DSA_DEVLINK_PARAM_DRIVER(_id, _name, _type, _cmodes)		\
> >> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> >> index 77b33bd161b8..3689ffa2dbb8 100644
> >> --- a/net/dsa/slave.c
> >> +++ b/net/dsa/slave.c
> >> @@ -657,6 +657,13 @@ static netdev_tx_t dsa_slave_xmit(struct sk_buff *skb, struct net_device *dev)
> >>  	return dsa_enqueue_skb(nskb, dev);
> >>  }
> >>  
> >> +static u16 dsa_slave_select_queue(struct net_device *dev, struct sk_buff *skb,
> >> +				  struct net_device *sb_dev)
> >> +{
> >> +	DSA_SKB_CB(skb)->sb_dev = sb_dev;
> >> +	return netdev_pick_tx(dev, skb, sb_dev);
> >> +}
> >> +
> >
> > DSA_SKB_CB is going away:
> > https://patchwork.kernel.org/project/netdevbpf/patch/20210427042203.26258-5-yangbo.lu@nxp.com/
> >
> > Let's either negotiate with Yangbo on keeping it, or make
> > .ndo_select_queue a bypass towards the tagger, where it can use its own
> > SKB_CB structure and be more flexible in general (I think I'm leaning
> > towards the latter).
> 
> Thus far, Yangbo is a tough negotiator, giving me the silent treatment:
> 
> https://lore.kernel.org/netdev/87y2d2noe5.fsf@waldekranz.com/
> 
> :)
> 
> That memset is giving me a hard time. I have just disabled it on my
> branch at the moment. Any ideas on how to get rid of it without breaking
> timestamping?

:)

Is there any guarantee written somewhere that the ownership of skb->cb
belongs to the NIC driver at the time of the ndo_select_queue call?

If there is, then the trivial solution is to just move the memset in
ndo_select_queue.

If there isn't, then we've got bigger issues (such as, for example, the
qdisc layer being able to overwrite your DSA_SKB_CB(skb)->sb_dev).

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 6/9] net: dsa: Forward offloading
  2021-05-04 15:21       ` Vladimir Oltean
@ 2021-05-04 20:07         ` Tobias Waldekranz
  2021-05-04 20:33           ` Andrew Lunn
  2021-05-04 20:58           ` Vladimir Oltean
  0 siblings, 2 replies; 42+ messages in thread
From: Tobias Waldekranz @ 2021-05-04 20:07 UTC (permalink / raw)
  To: Vladimir Oltean, andrew
  Cc: davem, kuba, vivien.didelot, f.fainelli, roopa, nikolay, jiri,
	idosch, stephen, netdev, bridge

On Tue, May 04, 2021 at 18:21, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Tue, May 04, 2021 at 04:44:31PM +0200, Tobias Waldekranz wrote:
>> On Tue, Apr 27, 2021 at 13:17, Vladimir Oltean <olteanv@gmail.com> wrote:
>> > On Mon, Apr 26, 2021 at 07:04:08PM +0200, Tobias Waldekranz wrote:
>> >> Allow DSA drivers to support forward offloading from a bridge by:
>> >> 
>> >> - Passing calls to .ndo_dfwd_{add,del}_station to the drivers.
>> >> 
>> >> - Recording the subordinate device of offloaded skbs in the control
>> >>   buffer so that the tagger can take the appropriate action.
>> >> 
>> >> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>> >> ---
>> >>  include/net/dsa.h |  7 +++++++
>> >>  net/dsa/slave.c   | 36 ++++++++++++++++++++++++++++++++++--
>> >>  2 files changed, 41 insertions(+), 2 deletions(-)
>> >> 
>> >> diff --git a/include/net/dsa.h b/include/net/dsa.h
>> >> index 1f9ba9889034..77d4df819299 100644
>> >> --- a/include/net/dsa.h
>> >> +++ b/include/net/dsa.h
>> >> @@ -119,6 +119,7 @@ struct dsa_netdevice_ops {
>> >>  
>> >>  struct dsa_skb_cb {
>> >>  	struct sk_buff *clone;
>> >> +	struct net_device *sb_dev;
>> >>  };
>> >>  
>> >>  struct __dsa_skb_cb {
>> >> @@ -828,6 +829,12 @@ struct dsa_switch_ops {
>> >>  					  const struct switchdev_obj_ring_role_mrp *mrp);
>> >>  	int	(*port_mrp_del_ring_role)(struct dsa_switch *ds, int port,
>> >>  					  const struct switchdev_obj_ring_role_mrp *mrp);
>> >> +
>> >> +	/* L2 forward offloading */
>> >> +	void *	(*dfwd_add_station)(struct dsa_switch *ds, int port,
>> >> +				    struct net_device *sb_dev);
>> >> +	void	(*dfwd_del_station)(struct dsa_switch *ds, int port,
>> >> +				    struct net_device *sb_dev);
>> >>  };
>> >>  
>> >>  #define DSA_DEVLINK_PARAM_DRIVER(_id, _name, _type, _cmodes)		\
>> >> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
>> >> index 77b33bd161b8..3689ffa2dbb8 100644
>> >> --- a/net/dsa/slave.c
>> >> +++ b/net/dsa/slave.c
>> >> @@ -657,6 +657,13 @@ static netdev_tx_t dsa_slave_xmit(struct sk_buff *skb, struct net_device *dev)
>> >>  	return dsa_enqueue_skb(nskb, dev);
>> >>  }
>> >>  
>> >> +static u16 dsa_slave_select_queue(struct net_device *dev, struct sk_buff *skb,
>> >> +				  struct net_device *sb_dev)
>> >> +{
>> >> +	DSA_SKB_CB(skb)->sb_dev = sb_dev;
>> >> +	return netdev_pick_tx(dev, skb, sb_dev);
>> >> +}
>> >> +
>> >
>> > DSA_SKB_CB is going away:
>> > https://patchwork.kernel.org/project/netdevbpf/patch/20210427042203.26258-5-yangbo.lu@nxp.com/
>> >
>> > Let's either negotiate with Yangbo on keeping it, or make
>> > .ndo_select_queue a bypass towards the tagger, where it can use its own
>> > SKB_CB structure and be more flexible in general (I think I'm leaning
>> > towards the latter).
>> 
>> Thus far, Yangbo is a tough negotiator, giving me the silent treatment:
>> 
>> https://lore.kernel.org/netdev/87y2d2noe5.fsf@waldekranz.com/
>> 
>> :)
>> 
>> That memset is giving me a hard time. I have just disabled it on my
>> branch at the moment. Any ideas on how to get rid of it without breaking
>> timestamping?
>
> :)
>
> Is there any guarantee written somewhere that the ownership of skb->cb
> belongs to the NIC driver at the time of the ndo_select_queue call?
>
> If there is, then the trivial solution is to just move the memset in
> ndo_select_queue.
>
> If there isn't, then we've got bigger issues (such as, for example, the
> qdisc layer being able to overwrite your DSA_SKB_CB(skb)->sb_dev).

The comment says:

   "This is owned by whoever has the skb queued ATM."

But qdisc_skb_cb is a thing as it turns out - so I think I can kiss the
idea of stashing the pointer in the CB goodbye.

Looking at some of the other users of .ndo_select_queue, I get the
feeling that we should really:

- Pre-generate a FROM_CPU tag template and store it under "TxQ 0"
- Pre-generate a FORWARD tag template and store it under "TxQ 1"
- Redfine tag_dsa's .ndo_select_queue to be: `return sb_dev ? 1 : 0;`
- Fetch the template using skb_queue_mapping, fill in the VID, and send
  it.

There is really no need to recompute the static parts of the tags on
each skb. It would mean moving some knowledge of the tagging format to
the driver. But that boundary is pretty artificial for
mv88e6xxx. tag_dsa has no use outside of mv88e6xxx, and mv88e6xxx does
not work with any other tagger. I suppose you could even move the whole
tagger to drivers/net/dsa/mv88e6xxx/?

What do you think?

Andrew?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 6/9] net: dsa: Forward offloading
  2021-05-04 20:07         ` Tobias Waldekranz
@ 2021-05-04 20:33           ` Andrew Lunn
  2021-05-04 21:24             ` Tobias Waldekranz
  2021-05-04 20:58           ` Vladimir Oltean
  1 sibling, 1 reply; 42+ messages in thread
From: Andrew Lunn @ 2021-05-04 20:33 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Vladimir Oltean, davem, kuba, vivien.didelot, f.fainelli, roopa,
	nikolay, jiri, idosch, stephen, netdev, bridge

> There is really no need to recompute the static parts of the tags on
> each skb. It would mean moving some knowledge of the tagging format to
> the driver. But that boundary is pretty artificial for
> mv88e6xxx. tag_dsa has no use outside of mv88e6xxx, and mv88e6xxx does
> not work with any other tagger. I suppose you could even move the whole
> tagger to drivers/net/dsa/mv88e6xxx/?
> 
> What do you think?
> 
> Andrew?

We have resisted this before.

What information do you actually need to share between the tagger and
the driver? Both tag_lan9303.c and tag_ocelot_8021q.c do reference
their switch driver data structures, so some sharing is allowed. But
please try to keep the surface areas down.

       Andrew

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 6/9] net: dsa: Forward offloading
  2021-05-04 20:07         ` Tobias Waldekranz
  2021-05-04 20:33           ` Andrew Lunn
@ 2021-05-04 20:58           ` Vladimir Oltean
  2021-05-04 22:12             ` Tobias Waldekranz
  1 sibling, 1 reply; 42+ messages in thread
From: Vladimir Oltean @ 2021-05-04 20:58 UTC (permalink / raw)
  To: Tobias Waldekranz, Alexander Duyck
  Cc: andrew, davem, kuba, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Tue, May 04, 2021 at 10:07:14PM +0200, Tobias Waldekranz wrote:
> On Tue, May 04, 2021 at 18:21, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Tue, May 04, 2021 at 04:44:31PM +0200, Tobias Waldekranz wrote:
> >> On Tue, Apr 27, 2021 at 13:17, Vladimir Oltean <olteanv@gmail.com> wrote:
> >> > On Mon, Apr 26, 2021 at 07:04:08PM +0200, Tobias Waldekranz wrote:
> >> >> Allow DSA drivers to support forward offloading from a bridge by:
> >> >> 
> >> >> - Passing calls to .ndo_dfwd_{add,del}_station to the drivers.
> >> >> 
> >> >> - Recording the subordinate device of offloaded skbs in the control
> >> >>   buffer so that the tagger can take the appropriate action.
> >> >> 
> >> >> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> >> >> ---
> >> >>  include/net/dsa.h |  7 +++++++
> >> >>  net/dsa/slave.c   | 36 ++++++++++++++++++++++++++++++++++--
> >> >>  2 files changed, 41 insertions(+), 2 deletions(-)
> >> >> 
> >> >> diff --git a/include/net/dsa.h b/include/net/dsa.h
> >> >> index 1f9ba9889034..77d4df819299 100644
> >> >> --- a/include/net/dsa.h
> >> >> +++ b/include/net/dsa.h
> >> >> @@ -119,6 +119,7 @@ struct dsa_netdevice_ops {
> >> >>  
> >> >>  struct dsa_skb_cb {
> >> >>  	struct sk_buff *clone;
> >> >> +	struct net_device *sb_dev;
> >> >>  };
> >> >>  
> >> >>  struct __dsa_skb_cb {
> >> >> @@ -828,6 +829,12 @@ struct dsa_switch_ops {
> >> >>  					  const struct switchdev_obj_ring_role_mrp *mrp);
> >> >>  	int	(*port_mrp_del_ring_role)(struct dsa_switch *ds, int port,
> >> >>  					  const struct switchdev_obj_ring_role_mrp *mrp);
> >> >> +
> >> >> +	/* L2 forward offloading */
> >> >> +	void *	(*dfwd_add_station)(struct dsa_switch *ds, int port,
> >> >> +				    struct net_device *sb_dev);
> >> >> +	void	(*dfwd_del_station)(struct dsa_switch *ds, int port,
> >> >> +				    struct net_device *sb_dev);
> >> >>  };
> >> >>  
> >> >>  #define DSA_DEVLINK_PARAM_DRIVER(_id, _name, _type, _cmodes)		\
> >> >> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> >> >> index 77b33bd161b8..3689ffa2dbb8 100644
> >> >> --- a/net/dsa/slave.c
> >> >> +++ b/net/dsa/slave.c
> >> >> @@ -657,6 +657,13 @@ static netdev_tx_t dsa_slave_xmit(struct sk_buff *skb, struct net_device *dev)
> >> >>  	return dsa_enqueue_skb(nskb, dev);
> >> >>  }
> >> >>  
> >> >> +static u16 dsa_slave_select_queue(struct net_device *dev, struct sk_buff *skb,
> >> >> +				  struct net_device *sb_dev)
> >> >> +{
> >> >> +	DSA_SKB_CB(skb)->sb_dev = sb_dev;
> >> >> +	return netdev_pick_tx(dev, skb, sb_dev);
> >> >> +}
> >> >> +
> >> >
> >> > DSA_SKB_CB is going away:
> >> > https://patchwork.kernel.org/project/netdevbpf/patch/20210427042203.26258-5-yangbo.lu@nxp.com/
> >> >
> >> > Let's either negotiate with Yangbo on keeping it, or make
> >> > .ndo_select_queue a bypass towards the tagger, where it can use its own
> >> > SKB_CB structure and be more flexible in general (I think I'm leaning
> >> > towards the latter).
> >> 
> >> Thus far, Yangbo is a tough negotiator, giving me the silent treatment:
> >> 
> >> https://lore.kernel.org/netdev/87y2d2noe5.fsf@waldekranz.com/
> >> 
> >> :)
> >> 
> >> That memset is giving me a hard time. I have just disabled it on my
> >> branch at the moment. Any ideas on how to get rid of it without breaking
> >> timestamping?
> >
> > :)
> >
> > Is there any guarantee written somewhere that the ownership of skb->cb
> > belongs to the NIC driver at the time of the ndo_select_queue call?
> >
> > If there is, then the trivial solution is to just move the memset in
> > ndo_select_queue.
> >
> > If there isn't, then we've got bigger issues (such as, for example, the
> > qdisc layer being able to overwrite your DSA_SKB_CB(skb)->sb_dev).
> 
> The comment says:
> 
>    "This is owned by whoever has the skb queued ATM."
> 
> But qdisc_skb_cb is a thing as it turns out - so I think I can kiss the
> idea of stashing the pointer in the CB goodbye.
> 
> Looking at some of the other users of .ndo_select_queue, I get the
> feeling that we should really:
> 
> - Pre-generate a FROM_CPU tag template and store it under "TxQ 0"
> - Pre-generate a FORWARD tag template and store it under "TxQ 1"
> - Redfine tag_dsa's .ndo_select_queue to be: `return sb_dev ? 1 : 0;`
> - Fetch the template using skb_queue_mapping, fill in the VID, and send
>   it.

Different drivers use TX queues in different ways. For example, for the
switches with TSN offloads, we set ds->num_tx_queues to a value equal to
the number of hardware traffic classes, so that the CPU can inject
packets with a specific QOS_CLASS field in the DSA header (think VLAN PCP).
This is really visible with tc-taprio where some traffic classes can be
completely turned off, so you can easily tell which TC was a packet
enqueued to. Other switches use TX queues in other ways. Some Broadcom
tagging protocols use the skb queue_mapping to direct the packets to one
of multiple TX queues of the DSA master, in order to apply backpressure
in case there is congestion on the front port.

Selecting a TX queue based on which upper netdev the packet is coming
form sounds to me like the oddest of the bunch. It really adds one more
dimension to the existing uses, I am not sure that this is how it was
intended to be done [ and why, for example, if the sb_dev was propagated
so deeply into dev_queue_xmit, why was it not propagated all the way to
.ndo_start_xmit ], but on the other hand, you have more working
experience with the dev_queue_xmit_accel API than the zero I have.

By the way (to show how little I know) what does "d" in "dfwd" stand for?
It almost sounds to me like a typo that was carried along from
NETIF_F_HW_L2FW_DOFFLOAD_BIT.

We might need to ask for the input of some people from Intel who worked
on this offload framework. For example, I just added Alexander Duyck
hoping he can provide some suggestions. We just want the sb_dev in
ndo_start_xmit, and abusing ndo_select_queue seems like a huge hack just
to obtain that.

> There is really no need to recompute the static parts of the tags on
> each skb. It would mean moving some knowledge of the tagging format to
> the driver. But that boundary is pretty artificial for
> mv88e6xxx. tag_dsa has no use outside of mv88e6xxx, and mv88e6xxx does
> not work with any other tagger. I suppose you could even move the whole
> tagger to drivers/net/dsa/mv88e6xxx/?
> 
> What do you think?
> 
> Andrew?

[ not Andrew, but ]

I made that mistake so that you don't have to. You don't actually gain
as much as you think (performance is about the same, what you win in
instruction count and conditionals you lose in the memcpy), and you
create a dependency between the tagger and the switch driver which was
supposed by design to not exist. For my drivers I tried to remove this
dependency - see commit 7c4bb540e917 ("net: dsa: tag_ocelot: create
separate tagger for Seville"). Also, in the case of Ocelot switches,
a template was used to mask out handling differences between switch
generations, and present them to user space as "the same tagger".
Another bad idea. In general, if a tagging protocol is testable with
dsa_loop this is a plus. People at NXP wanted to see how their drivers
perform with Marvell switches (what are their options for balancing with
RFS/RSS) and this is what they did, changed DSA_TAG_PROTO_NONE from what
dsa_loop advertises. If they need the actual switch driver to initialize
the tagger's template, suddenly it's not so fun anymore.

If it ever becomes important enough, I suppose dsa_loop could even gain
support for the new .change_tag_protocol API to advertise the
feasibility of the idea in general, although given how DYI dsa_loop is
in general, maybe changing the tag protocol at runtime isn't so
important.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 6/9] net: dsa: Forward offloading
  2021-05-04 20:33           ` Andrew Lunn
@ 2021-05-04 21:24             ` Tobias Waldekranz
  0 siblings, 0 replies; 42+ messages in thread
From: Tobias Waldekranz @ 2021-05-04 21:24 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Vladimir Oltean, davem, kuba, vivien.didelot, f.fainelli, roopa,
	nikolay, jiri, idosch, stephen, netdev, bridge

On Tue, May 04, 2021 at 22:33, Andrew Lunn <andrew@lunn.ch> wrote:
>> There is really no need to recompute the static parts of the tags on
>> each skb. It would mean moving some knowledge of the tagging format to
>> the driver. But that boundary is pretty artificial for
>> mv88e6xxx. tag_dsa has no use outside of mv88e6xxx, and mv88e6xxx does
>> not work with any other tagger. I suppose you could even move the whole
>> tagger to drivers/net/dsa/mv88e6xxx/?
>> 
>> What do you think?
>> 
>> Andrew?
>
> We have resisted this before.
>
> What information do you actually need to share between the tagger and
> the driver?

So far:

- Trunk/LAG ID to netdev mappings (this is stored on the dst now, but I
  think I have seen the light and agree with Vladimir that it really has
  no business there).

- DSA dev/port to bridge netdev mappings for the forwarding offloading
  in this RFC (or preferably the actual tag templates to use on egress
  since that would probably give you better performance)

In the future:

- Completions for in-flight remote management operations.

- FlowID to TC rule mappings (from the "Switch Egress header" when we
  enable that)

- In-band signaling between firmware running on the IMP and the driver
  for things like MRP and CFM offloading.

> Both tag_lan9303.c and tag_ocelot_8021q.c do reference
> their switch driver data structures, so some sharing is allowed. But
> please try to keep the surface areas down.

If you have a surface area keep it small, yes, agreed. I guess my
question is more why we should have any surface area at all? What do we
gain by the tagger/driver separation in the case of mv88e6xxx?

>        Andrew

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 6/9] net: dsa: Forward offloading
  2021-05-04 20:58           ` Vladimir Oltean
@ 2021-05-04 22:12             ` Tobias Waldekranz
  2021-05-04 23:04               ` Vladimir Oltean
  0 siblings, 1 reply; 42+ messages in thread
From: Tobias Waldekranz @ 2021-05-04 22:12 UTC (permalink / raw)
  To: Vladimir Oltean, Alexander Duyck
  Cc: andrew, davem, kuba, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Tue, May 04, 2021 at 23:58, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Tue, May 04, 2021 at 10:07:14PM +0200, Tobias Waldekranz wrote:
>> On Tue, May 04, 2021 at 18:21, Vladimir Oltean <olteanv@gmail.com> wrote:
>> > On Tue, May 04, 2021 at 04:44:31PM +0200, Tobias Waldekranz wrote:
>> >> On Tue, Apr 27, 2021 at 13:17, Vladimir Oltean <olteanv@gmail.com> wrote:
>> >> > On Mon, Apr 26, 2021 at 07:04:08PM +0200, Tobias Waldekranz wrote:
>> >> >> Allow DSA drivers to support forward offloading from a bridge by:
>> >> >> 
>> >> >> - Passing calls to .ndo_dfwd_{add,del}_station to the drivers.
>> >> >> 
>> >> >> - Recording the subordinate device of offloaded skbs in the control
>> >> >>   buffer so that the tagger can take the appropriate action.
>> >> >> 
>> >> >> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>> >> >> ---
>> >> >>  include/net/dsa.h |  7 +++++++
>> >> >>  net/dsa/slave.c   | 36 ++++++++++++++++++++++++++++++++++--
>> >> >>  2 files changed, 41 insertions(+), 2 deletions(-)
>> >> >> 
>> >> >> diff --git a/include/net/dsa.h b/include/net/dsa.h
>> >> >> index 1f9ba9889034..77d4df819299 100644
>> >> >> --- a/include/net/dsa.h
>> >> >> +++ b/include/net/dsa.h
>> >> >> @@ -119,6 +119,7 @@ struct dsa_netdevice_ops {
>> >> >>  
>> >> >>  struct dsa_skb_cb {
>> >> >>  	struct sk_buff *clone;
>> >> >> +	struct net_device *sb_dev;
>> >> >>  };
>> >> >>  
>> >> >>  struct __dsa_skb_cb {
>> >> >> @@ -828,6 +829,12 @@ struct dsa_switch_ops {
>> >> >>  					  const struct switchdev_obj_ring_role_mrp *mrp);
>> >> >>  	int	(*port_mrp_del_ring_role)(struct dsa_switch *ds, int port,
>> >> >>  					  const struct switchdev_obj_ring_role_mrp *mrp);
>> >> >> +
>> >> >> +	/* L2 forward offloading */
>> >> >> +	void *	(*dfwd_add_station)(struct dsa_switch *ds, int port,
>> >> >> +				    struct net_device *sb_dev);
>> >> >> +	void	(*dfwd_del_station)(struct dsa_switch *ds, int port,
>> >> >> +				    struct net_device *sb_dev);
>> >> >>  };
>> >> >>  
>> >> >>  #define DSA_DEVLINK_PARAM_DRIVER(_id, _name, _type, _cmodes)		\
>> >> >> diff --git a/net/dsa/slave.c b/net/dsa/slave.c
>> >> >> index 77b33bd161b8..3689ffa2dbb8 100644
>> >> >> --- a/net/dsa/slave.c
>> >> >> +++ b/net/dsa/slave.c
>> >> >> @@ -657,6 +657,13 @@ static netdev_tx_t dsa_slave_xmit(struct sk_buff *skb, struct net_device *dev)
>> >> >>  	return dsa_enqueue_skb(nskb, dev);
>> >> >>  }
>> >> >>  
>> >> >> +static u16 dsa_slave_select_queue(struct net_device *dev, struct sk_buff *skb,
>> >> >> +				  struct net_device *sb_dev)
>> >> >> +{
>> >> >> +	DSA_SKB_CB(skb)->sb_dev = sb_dev;
>> >> >> +	return netdev_pick_tx(dev, skb, sb_dev);
>> >> >> +}
>> >> >> +
>> >> >
>> >> > DSA_SKB_CB is going away:
>> >> > https://patchwork.kernel.org/project/netdevbpf/patch/20210427042203.26258-5-yangbo.lu@nxp.com/
>> >> >
>> >> > Let's either negotiate with Yangbo on keeping it, or make
>> >> > .ndo_select_queue a bypass towards the tagger, where it can use its own
>> >> > SKB_CB structure and be more flexible in general (I think I'm leaning
>> >> > towards the latter).
>> >> 
>> >> Thus far, Yangbo is a tough negotiator, giving me the silent treatment:
>> >> 
>> >> https://lore.kernel.org/netdev/87y2d2noe5.fsf@waldekranz.com/
>> >> 
>> >> :)
>> >> 
>> >> That memset is giving me a hard time. I have just disabled it on my
>> >> branch at the moment. Any ideas on how to get rid of it without breaking
>> >> timestamping?
>> >
>> > :)
>> >
>> > Is there any guarantee written somewhere that the ownership of skb->cb
>> > belongs to the NIC driver at the time of the ndo_select_queue call?
>> >
>> > If there is, then the trivial solution is to just move the memset in
>> > ndo_select_queue.
>> >
>> > If there isn't, then we've got bigger issues (such as, for example, the
>> > qdisc layer being able to overwrite your DSA_SKB_CB(skb)->sb_dev).
>> 
>> The comment says:
>> 
>>    "This is owned by whoever has the skb queued ATM."
>> 
>> But qdisc_skb_cb is a thing as it turns out - so I think I can kiss the
>> idea of stashing the pointer in the CB goodbye.
>> 
>> Looking at some of the other users of .ndo_select_queue, I get the
>> feeling that we should really:
>> 
>> - Pre-generate a FROM_CPU tag template and store it under "TxQ 0"
>> - Pre-generate a FORWARD tag template and store it under "TxQ 1"
>> - Redfine tag_dsa's .ndo_select_queue to be: `return sb_dev ? 1 : 0;`
>> - Fetch the template using skb_queue_mapping, fill in the VID, and send
>>   it.
>
> Different drivers use TX queues in different ways. For example, for the
> switches with TSN offloads, we set ds->num_tx_queues to a value equal to
> the number of hardware traffic classes, so that the CPU can inject
> packets with a specific QOS_CLASS field in the DSA header (think VLAN PCP).
> This is really visible with tc-taprio where some traffic classes can be
> completely turned off, so you can easily tell which TC was a packet
> enqueued to. Other switches use TX queues in other ways. Some Broadcom
> tagging protocols use the skb queue_mapping to direct the packets to one
> of multiple TX queues of the DSA master, in order to apply backpressure
> in case there is congestion on the front port.
>
> Selecting a TX queue based on which upper netdev the packet is coming
> form sounds to me like the oddest of the bunch. It really adds one more
> dimension to the existing uses, I am not sure that this is how it was
> intended to be done [ and why, for example, if the sb_dev was propagated
> so deeply into dev_queue_xmit, why was it not propagated all the way to
> .ndo_start_xmit ], but on the other hand, you have more working
> experience with the dev_queue_xmit_accel API than the zero I have.

Yeah it does not feel right. I expect mv88e6xxx will also want to expose
the real number of queues in the future. Some of the newer devices have
support for time aware shapers for example.

As for why sb_dev is not propagated to .ndo_start_xmit: I chalked it up
to the existing users managing the macvlan offloads by directing those
flows to a particular TxQ. I.e. they simply had no need for it.

Or perhaps they did not have the nerve to send the commit that changed
the signature of _every_ driver's .ndo_start_xmit :)

> By the way (to show how little I know) what does "d" in "dfwd" stand for?
> It almost sounds to me like a typo that was carried along from
> NETIF_F_HW_L2FW_DOFFLOAD_BIT.

That has been bugging me as well! I have no idea.

> We might need to ask for the input of some people from Intel who worked
> on this offload framework. For example, I just added Alexander Duyck
> hoping he can provide some suggestions. We just want the sb_dev in
> ndo_start_xmit, and abusing ndo_select_queue seems like a huge hack just
> to obtain that.

I think you are right.

>> There is really no need to recompute the static parts of the tags on
>> each skb. It would mean moving some knowledge of the tagging format to
>> the driver. But that boundary is pretty artificial for
>> mv88e6xxx. tag_dsa has no use outside of mv88e6xxx, and mv88e6xxx does
>> not work with any other tagger. I suppose you could even move the whole
>> tagger to drivers/net/dsa/mv88e6xxx/?
>> 
>> What do you think?
>> 
>> Andrew?
>
> [ not Andrew, but ]
>
> I made that mistake so that you don't have to. You don't actually gain
> as much as you think (performance is about the same, what you win in
> instruction count and conditionals you lose in the memcpy),

That is valuable info, thank you. But I think the most important
improvement I see would be the ability to couple the tagger tighter to
the driver when we add more complicated features.

> and you
> create a dependency between the tagger and the switch driver which was
> supposed by design to not exist.

Sure, but _why_ should it not exist? Many fields in the tag can only be
correctly generated/interpreted in combination with knowledge of the
current configuration, which is the driver's domain. The dependency is
already there, etched in silicon.

> For my drivers I tried to remove this
> dependency - see commit 7c4bb540e917 ("net: dsa: tag_ocelot: create
> separate tagger for Seville"). Also, in the case of Ocelot switches,
> a template was used to mask out handling differences between switch
> generations, and present them to user space as "the same tagger".
> Another bad idea. In general, if a tagging protocol is testable with
> dsa_loop this is a plus. People at NXP wanted to see how their drivers
> perform with Marvell switches (what are their options for balancing with
> RFS/RSS) and this is what they did, changed DSA_TAG_PROTO_NONE from what
> dsa_loop advertises. If they need the actual switch driver to initialize
> the tagger's template, suddenly it's not so fun anymore.

I shall have to look more closely at dsa_loop, so far I have just seen
the name float by on a few occasions.

> If it ever becomes important enough, I suppose dsa_loop could even gain
> support for the new .change_tag_protocol API to advertise the
> feasibility of the idea in general, although given how DYI dsa_loop is
> in general, maybe changing the tag protocol at runtime isn't so
> important.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 6/9] net: dsa: Forward offloading
  2021-05-04 22:12             ` Tobias Waldekranz
@ 2021-05-04 23:04               ` Vladimir Oltean
  2021-05-05  9:01                 ` Tobias Waldekranz
  0 siblings, 1 reply; 42+ messages in thread
From: Vladimir Oltean @ 2021-05-04 23:04 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: andrew, davem, kuba, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Wed, May 05, 2021 at 12:12:15AM +0200, Tobias Waldekranz wrote:
> > and you create a dependency between the tagger and the switch driver
> > which was supposed by design to not exist.
> 
> Sure, but _why_ should it not exist? Many fields in the tag can only be
> correctly generated/interpreted in combination with knowledge of the
> current configuration, which is the driver's domain. The dependency is
> already there, etched in silicon.

I'm a bit more of a pragmatic person, it's not so much that I think that
Lennert Buytenhek's original DSA design from 2008 was the holy grail and
that we should do everything we can to preserve it intact. Far from it.
But I actually like having the option to inject a DSA-tagged packet
using Spirent TestCenter and measure IP forwarding between dsa_loop
"switch" ports (actually a one-armed router is what it is). I also like,
as a reviewer, to be able to test, if I want to, how a tail tagger
behaves even if I don't own a switch with tail tagging. And this
separation between the switch driver and the tag protocol driver makes
that possible, just see it as a nice perk which we don't want to lose.

As for more advanced features, like "the hardware requires me to invent
a unique number based on a rolling counter, call it a TX timestamp ID,
put it in the DSA header, then when transmission is done, an IRQ will be
raised, and I need to match that TX timestamp that just became available
to me, which is identifiable via the timestamp ID that I put in the DSA
header, with the original skb", of course you can't do that without
communication between the tagger and the driver itself, unless you make
the tagger handle interrupts (and then there's the whole issue that the
tagging protocol driver needs to be instantiated per switch, if it's
going to be stateful), or the switch driver send packets. As a general
rule of thumb, just don't break dsa_loop and we should be fine. For
example, yes, PTP requires driver <-> tagger communication, but PTP
timestamping is also not enabled by default, and guarded by an ioctl
which dsa_loop doesn't implement. So the tagger can never trigger faulty
code, dereferencing a ds->priv pointer which it thinks is "struct
mv88e6xxx_chip" but is actually "struct dsa_loop_priv".

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 2/9] net: bridge: Disambiguate offload_fwd_mark
  2021-05-03  8:49     ` Tobias Waldekranz
@ 2021-05-05  7:39       ` Ido Schimmel
  0 siblings, 0 replies; 42+ messages in thread
From: Ido Schimmel @ 2021-05-05  7:39 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, olteanv, roopa,
	nikolay, jiri, stephen, netdev, bridge

On Mon, May 03, 2021 at 10:49:12AM +0200, Tobias Waldekranz wrote:
> On Sun, May 02, 2021 at 18:00, Ido Schimmel <idosch@idosch.org> wrote:
> > On Mon, Apr 26, 2021 at 07:04:04PM +0200, Tobias Waldekranz wrote:
> >> - skb->cb->offload_fwd_mark becomes skb->cb->src_hwdom. There is a
> >>   slight change here: Whereas previously this was only set for
> >>   offloaded packets, we now always track the incoming hwdom. As all
> >>   uses where already gated behind checks of skb->offload_fwd_mark,
> >>   this will not introduce any functional change, but it paves the way
> >>   for future changes where the ingressing hwdom must be known both for
> >>   offloaded and non-offloaded frames.
> >
> > [...]
> >
> >> @@ -43,15 +43,15 @@ int nbp_switchdev_mark_set(struct net_bridge_port *p)
> >>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,
> >>  			      struct sk_buff *skb)
> >>  {
> >> -	if (skb->offload_fwd_mark && !WARN_ON_ONCE(!p->offload_fwd_mark))
> >> -		BR_INPUT_SKB_CB(skb)->offload_fwd_mark = p->offload_fwd_mark;
> >> +	if (p->hwdom)
> >> +		BR_INPUT_SKB_CB(skb)->src_hwdom = p->hwdom;
> >>  }
> >
> > I assume you are referring to this change? "src_hwdom" sounds weird if
> > it's expected to be valid for non-offloaded frames.
> 
> Perhaps "non-offloaded" was a sloppy description on my part. I was
> trying to describe frames that originate from a switchdev, but have not
> been forwarded by hardware; e.g. STP BPDUs, IGMP reports, etc. So
> nbp_switchdev_frame_mark now basically says: "If this skb came in from a
> switchdev, make sure to note which one".
> 
> > Can you elaborate about "future changes where the ingressing hwdom must
> > be known both for offloaded and non-offloaded frames"?
> 
> Typical example: The switchdev has a fixed configuration to trap STP
> BPDUs, but STP is not running on the bridge and the group_fwd_mask
> allows them to be forwarded. Say we have this setup:
> 
>       br0
>     /  |  \
> swp0 swp1 swp2
> 
> A BPDU comes in on swp0 and is trapped to the CPU; the driver does not
> set skb->offload_fwd_mark. The bridge determines that the frame should
> be forwarded to swp{1,2}. It is imperative that forward offloading is
> _not_ allowed in this case, as the source hwdom is already "poisoned".
> 
> Recording the source hwdom allows this case to be handled properly.

OK, thanks for the explanation. If it is allowed, then the packet will
be transmitted from swp0, from which it was received.

> 
> > Probably best to split this change to a different patch given the rest
> > of the changes are mechanical.
> 
> Right, but I think the change in name to warrants a change in
> semantics. It is being renamed to src_hwdom because it now holds just
> that information. Again, there is no functional change introduced by
> this since nbp_switchdev_allowed_egress always checks for the presence
> of skb->offload_fwd_mark anyway. But if you feel strongly about it, I
> will split it up.

If you put the explanation above in the changelog, then it should be
fine to keep it as one patch.

> 
> >>  
> >>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,
> >>  				  const struct sk_buff *skb)
> >>  {
> >>  	return !skb->offload_fwd_mark ||
> >> -	       BR_INPUT_SKB_CB(skb)->offload_fwd_mark != p->offload_fwd_mark;
> >> +	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;
> >>  }
> >>  
> >>  /* Flags that can be offloaded to hardware */
> >> -- 
> >> 2.25.1
> >> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 6/9] net: dsa: Forward offloading
  2021-05-04 23:04               ` Vladimir Oltean
@ 2021-05-05  9:01                 ` Tobias Waldekranz
  2021-05-05 16:12                   ` Vladimir Oltean
  0 siblings, 1 reply; 42+ messages in thread
From: Tobias Waldekranz @ 2021-05-05  9:01 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: andrew, davem, kuba, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Wed, May 05, 2021 at 02:04, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Wed, May 05, 2021 at 12:12:15AM +0200, Tobias Waldekranz wrote:
>> > and you create a dependency between the tagger and the switch driver
>> > which was supposed by design to not exist.
>> 
>> Sure, but _why_ should it not exist? Many fields in the tag can only be
>> correctly generated/interpreted in combination with knowledge of the
>> current configuration, which is the driver's domain. The dependency is
>> already there, etched in silicon.
>
> I'm a bit more of a pragmatic person,

Excuse me sir, I believe you left your dagger IN MY HEART :)

> it's not so much that I think that
> Lennert Buytenhek's original DSA design from 2008 was the holy grail and
> that we should do everything we can to preserve it intact. Far from it.
> But I actually like having the option to inject a DSA-tagged packet
> using Spirent TestCenter and measure IP forwarding between dsa_loop
> "switch" ports (actually a one-armed router is what it is). I also like,
> as a reviewer, to be able to test, if I want to, how a tail tagger
> behaves even if I don't own a switch with tail tagging. And this
> separation between the switch driver and the tag protocol driver makes
> that possible, just see it as a nice perk which we don't want to lose.

Completely understandable. I was trying to extrapolate where we will end
up with this separation as we add more and more features and couple the
tagger closer to the driver, and see if the current architecture was
still the optimal one. Trying to be ...pragmatic, if you will.

> As for more advanced features, like "the hardware requires me to invent
> a unique number based on a rolling counter, call it a TX timestamp ID,
> put it in the DSA header, then when transmission is done, an IRQ will be
> raised, and I need to match that TX timestamp that just became available
> to me, which is identifiable via the timestamp ID that I put in the DSA
> header, with the original skb", of course you can't do that without
> communication between the tagger and the driver itself, unless you make
> the tagger handle interrupts (and then there's the whole issue that the
> tagging protocol driver needs to be instantiated per switch, if it's
> going to be stateful), or the switch driver send packets. As a general
> rule of thumb, just don't break dsa_loop and we should be fine. For
> example, yes, PTP requires driver <-> tagger communication, but PTP
> timestamping is also not enabled by default, and guarded by an ioctl
> which dsa_loop doesn't implement. So the tagger can never trigger faulty
> code, dereferencing a ds->priv pointer which it thinks is "struct
> mv88e6xxx_chip" but is actually "struct dsa_loop_priv".

This should also hold for forward offloading, since dsa_loop would not
implement .ndo_dfwd_{add,del}_station.

Alright, include/linux/dsa/mv88e6xxx.h here I come!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 6/9] net: dsa: Forward offloading
  2021-05-05  9:01                 ` Tobias Waldekranz
@ 2021-05-05 16:12                   ` Vladimir Oltean
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Oltean @ 2021-05-05 16:12 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: andrew, davem, kuba, vivien.didelot, f.fainelli, roopa, nikolay,
	jiri, idosch, stephen, netdev, bridge

On Wed, May 05, 2021 at 11:01:09AM +0200, Tobias Waldekranz wrote:
> On Wed, May 05, 2021 at 02:04, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Wed, May 05, 2021 at 12:12:15AM +0200, Tobias Waldekranz wrote:
> >> > and you create a dependency between the tagger and the switch driver
> >> > which was supposed by design to not exist.
> >> 
> >> Sure, but _why_ should it not exist? Many fields in the tag can only be
> >> correctly generated/interpreted in combination with knowledge of the
> >> current configuration, which is the driver's domain. The dependency is
> >> already there, etched in silicon.
> >
> > I'm a bit more of a pragmatic person,
> 
> Excuse me sir, I believe you left your dagger IN MY HEART :)

You might have misinterpreted my words, I did not mean to say "look what
a good quality I have and you don't", in fact I don't view pragmatism as
much of a desirable quality at all. What I meant to say in the context
is that, even though in general I value functionality more than how it
is implemented, I would still like to keep the separation between
taggers and switch drivers at least at the most basic RX/TX level, for
the reasons explained.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 0/9] net: bridge: Forward offloading
  2021-05-03  9:44   ` Tobias Waldekranz
@ 2021-05-06 10:59     ` Vladimir Oltean
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Oltean @ 2021-05-06 10:59 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Ido Schimmel, davem, kuba, andrew, vivien.didelot, f.fainelli,
	roopa, nikolay, jiri, stephen, netdev, bridge

On Mon, May 03, 2021 at 11:44:21AM +0200, Tobias Waldekranz wrote:
> > Do you expect this optimization to only work when physical netdevs are
> > enslaved to the bridge? What about LAG/VLANs?
> 
> LAGs should definitely work once the .ndo_dfwd_{add,del}_station helpers
> are in place.
> 
> Stacked VLANs could also be made to work. But they may require some
> extra work.
> 
> In v1, the bridge will always send offloaded frames with the VLAN
> information intact, even if the port is configured to egress the VID
> untagged. This is needed so that the driver can determine the correct
> VID to use in cases where multiple VIDs are set to egress untagged.
> 
> If any kind of VID translation takes place I think things get very
> sticky. Then again, that is maybe not all that defined without this
> change applied either?
> 
> What is the typical use-case for using an "external" stacked VLAN device
> over configuring the VLAN inside the bridge?

I think it will be very sticky to support forwarding offload for this
setup anyway:

            br0
swp0.100  swp0.101   swp1.100

Need to understand what the use case is. The correct behavior IMO is for
the physical switch port to remain standalone and for the bridge to know
that the bridge port (swp0.100) is not offloaded.
FWIW I had an attempt to handle situations like this here:
https://patchwork.kernel.org/project/netdevbpf/patch/20210224114350.2791260-17-olteanv@gmail.com/

I will let Tobias finish with his forwarding offload patch set before
rebasing and resending mine, his work looks a lot better at this point.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [RFC net-next 4/9] net: bridge: switchdev: Forward offloading
  2021-05-03  8:53     ` Tobias Waldekranz
@ 2021-05-06 11:01       ` Vladimir Oltean
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Oltean @ 2021-05-06 11:01 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Ido Schimmel, davem, kuba, andrew, vivien.didelot, f.fainelli,
	roopa, nikolay, jiri, stephen, netdev, bridge

On Mon, May 03, 2021 at 10:53:36AM +0200, Tobias Waldekranz wrote:
> On Sun, May 02, 2021 at 18:04, Ido Schimmel <idosch@idosch.org> wrote:
> > On Mon, Apr 26, 2021 at 07:04:06PM +0200, Tobias Waldekranz wrote:
> >> +static void nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)
> >> +{
> >> +	void *priv;
> >> +
> >> +	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))
> >> +		return;
> >> +
> >> +	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);
> >
> > Some changes to team/bond/8021q will be needed in order to get this
> > optimization to work when they are enslaved to the bridge instead of the
> > front panel port itself?
> 
> Right you are. We should probably do something similar to the
> switchdev_handle_port_* family of helpers that could be reused in
> stacked devices. I will look at it for v1.

Makes sense, issuing a switchdev notifier of some sort will allow the
switchdev driver to act upon the bridge's request for a net device
belonging to a different driver.

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, back to index

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-26 17:04 [RFC net-next 0/9] net: bridge: Forward offloading Tobias Waldekranz
2021-04-26 17:04 ` [RFC net-next 1/9] net: dfwd: Constrain existing users to macvlan subordinates Tobias Waldekranz
2021-04-26 17:04 ` [RFC net-next 2/9] net: bridge: Disambiguate offload_fwd_mark Tobias Waldekranz
2021-05-02 15:00   ` Ido Schimmel
2021-05-03  8:49     ` Tobias Waldekranz
2021-05-05  7:39       ` Ido Schimmel
2021-04-26 17:04 ` [RFC net-next 3/9] net: bridge: switchdev: Recycle unused hwdoms Tobias Waldekranz
2021-04-27 10:42   ` Nikolay Aleksandrov
2021-04-26 17:04 ` [RFC net-next 4/9] net: bridge: switchdev: Forward offloading Tobias Waldekranz
2021-04-27 10:35   ` Nikolay Aleksandrov
2021-04-28 22:47     ` Tobias Waldekranz
2021-04-29  9:16       ` Nikolay Aleksandrov
2021-04-29 14:55         ` Tobias Waldekranz
2021-05-02 15:04   ` Ido Schimmel
2021-05-03  8:53     ` Tobias Waldekranz
2021-05-06 11:01       ` Vladimir Oltean
2021-04-26 17:04 ` [RFC net-next 5/9] net: dsa: Track port PVIDs Tobias Waldekranz
2021-04-26 19:40   ` Vladimir Oltean
2021-04-26 20:05     ` Tobias Waldekranz
2021-04-26 20:28       ` Vladimir Oltean
2021-04-27  9:12         ` Tobias Waldekranz
2021-04-27  9:27           ` Vladimir Oltean
2021-04-27 10:07           ` Vladimir Oltean
2021-04-28 23:10             ` Tobias Waldekranz
2021-04-26 17:04 ` [RFC net-next 6/9] net: dsa: Forward offloading Tobias Waldekranz
2021-04-27 10:17   ` Vladimir Oltean
2021-05-04 14:44     ` Tobias Waldekranz
2021-05-04 15:21       ` Vladimir Oltean
2021-05-04 20:07         ` Tobias Waldekranz
2021-05-04 20:33           ` Andrew Lunn
2021-05-04 21:24             ` Tobias Waldekranz
2021-05-04 20:58           ` Vladimir Oltean
2021-05-04 22:12             ` Tobias Waldekranz
2021-05-04 23:04               ` Vladimir Oltean
2021-05-05  9:01                 ` Tobias Waldekranz
2021-05-05 16:12                   ` Vladimir Oltean
2021-04-26 17:04 ` [RFC net-next 7/9] net: dsa: mv88e6xxx: Allocate a virtual DSA port for each bridge Tobias Waldekranz
2021-04-26 17:04 ` [RFC net-next 8/9] net: dsa: mv88e6xxx: Map virtual bridge port in PVT Tobias Waldekranz
2021-04-26 17:04 ` [RFC net-next 9/9] net: dsa: mv88e6xxx: Forward offloading Tobias Waldekranz
2021-05-02 14:58 ` [RFC net-next 0/9] net: bridge: " Ido Schimmel
2021-05-03  9:44   ` Tobias Waldekranz
2021-05-06 10:59     ` Vladimir Oltean

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org
	public-inbox-index netdev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git