All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
@ 2021-10-25 22:24 Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 01/15] net: bridge: remove fdb_notify forward declaration Vladimir Oltean
                   ` (15 more replies)
  0 siblings, 16 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

Hello, this is me bombarding the list with switchdev FDB changes again.

This series attempts to address one design limitation in the interaction
between the bridge and switchdev: error codes returned from the
SWITCHDEV_FDB_ADD_TO_DEVICE and SWITCHDEV_FDB_DEL_TO_DEVICE handlers are
completely ignored.

There are multiple aspects to that. First of all, drivers have a portion
that handles those switchdev events in atomic context, and a portion
that handles them in a private deferred work context. Errors reported
from both calling contexts are ignored by the bridge, and it is
desirable to actually propagate both to user space.

Secondly, it is in fact okay that some switchdev errors are ignored.
The call graph for fdb_notify() is not simple, it looks something like
this (not complete):

IFLA_BRPORT_FLUSH                                                              RTM_NEWNEIGH
   |                                                                               |
   | {br,nbp}_vlan_delete                 br_fdb_change_mac_address                v
   |   |  |                                                  |     fast      __br_fdb_add
   |   |  |  del_nbp, br_dev_delete       br_fdb_changeaddr  |     path         /  |  \
   |   |  |      |                                        |  |    learning     /   |   \
   \   |   -------------------- br_fdb_find_delete_local  |  |       |        /    |    \     switchdev event
    \  |         |                                     |  |  |       |       /     |     \     listener
     -------------------------- br_fdb_delete_by_port  |  |  |       |      /      |      \       |
                                                 |  |  |  |  |       |     /       |       \      |
                                                 |  |  |  |  |       |    /        |        \     |
                                                 |  |  |  |  |    br_fdb_update    |        br_fdb_external_learn_add
           (RTM_DELNEIGH)  br_fdb_delete         |  |  |  |  |       |             |              |
                                     |           |  |  |  |  |       |             |              |    gc_work        netdevice
                                     |           |  |  |  |  |       |      fdb_add_entry         |     timer          event
                                     |           | fdb_delete_local  |             |              |        |          listener
                         __br_fdb_delete         |  |                |             |              /  br_fdb_cleanup      |
                                     |           |  |                |             |             /         |             |     br_stp_change_bridge_id
                                     |           |  |                \             |            /          | br_fdb_changeaddr      |
                                     |           |  |                 \            |           /           |     |                  |
                     fdb_delete_by_addr_and_port |  | fdb_insert       \           |          /       ----/      | br_fdb_change_mac_address
                                              |  |  |  |                \          |         /       /           |  |
                   br_fdb_external_learn_del  |  |  |  | br_fdb_cleanup  \         |        /       /            |  | br_fdb_insert
                                          |   |  |  |  |  |               \        |       /   ----/             |  | |
                                          |   |  |  |  |  |                \       |      /   /                 fdb_insert
                          br_fdb_flush    |   |  |  |  |  |                 \      |     /   /            --------/
                                 \----    |   |  |  |  |  |                  \     |    /   /      ------/
                                      \----------- fdb_delete --------------- fdb_notify ---------/

There's not a lot that the fast path learning can do about switchdev
when that returns an error.

So this patch set mainly wants to deal with the 2 code paths that are
triggered by these regular commands:

bridge fdb add dev swp0 00:01:02:03:04:05 master static # __br_fdb_add
bridge fdb del dev swp0 00:01:02:03:04:05 master static # __br_fdb_delete

In some other, semi-related discussions, Ido Schimmel pointed out that
it would be nice if user space got some feedback from the actual driver,
and made some proposals about how that could be done.
https://patchwork.kernel.org/project/netdevbpf/cover/20210819160723.2186424-1-vladimir.oltean@nxp.com/
One of the proposals was to call fdb_notify() from sleepable context,
but Nikolay disliked the idea of introducing deferred work in the bridge
driver (seems like nobody wants to deal with it).

And since all proposals of dealing with the deferred work inside
switchdev were also shot down for valid reasons, we are basically left
as a baseline with the code that we have today, with the deferred work
being private to the driver, and somehow we must propagate an err and an
extack from there.

So the approach taken here is to reorganize the code a bit and add some
hooks in:
(a) some callers of the fdb_notify() function to initialize a completion
    structure
(b) some drivers that catch SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE and mark
    that completion structure as done
(c) some bridge logic that I believe is fairly safe (I'm open to being
    proven wrong) that temporarily drops the &br->hash_lock in order to
    sleep until the completion is done.

There are some further optimizations that can be made. For example, we
can avoid dropping the hash_lock if there is no switchdev response pending.
And we can move some of that completion logic in br_switchdev.c such
that it is compiled out on a CONFIG_NET_SWITCHDEV=n build. I haven't
done those here, since they aren't exactly trivial. Mainly searching for
high-level feedback first and foremost.

The structure of the patch series is:
- patches 1-6 are me toying around with some code organization while I
  was trying to understand the various call paths better. I like not
  having forward declarations, but if they exist for a reason, I can
  drop these patches.
- patches 7-10 and 12 are some preparation work that can also be ignored.
- patches 11 and 13 are where the meat of the series is.
- patches 14 and 15 are DSA boilerplate so I could test what I'm doing.

Vladimir Oltean (15):
  net: bridge: remove fdb_notify forward declaration
  net: bridge: remove fdb_insert forward declaration
  net: bridge: rename fdb_insert to fdb_add_local
  net: bridge: rename br_fdb_insert to br_fdb_add_local
  net: bridge: move br_fdb_replay inside br_switchdev.c
  net: bridge: create a common function for populating switchdev FDB
    entries
  net: switchdev: keep the MAC address by value in struct
    switchdev_notifier_fdb_info
  net: bridge: take the hash_lock inside fdb_add_entry
  net: bridge: rename fdb_notify to br_fdb_notify
  net: switchdev: merge switchdev_handle_fdb_{add,del}_to_device
  net: bridge: make fdb_add_entry() wait for switchdev feedback
  net: rtnetlink: pass extack to .ndo_fdb_del
  net: bridge: wait for errors from switchdev when deleting FDB entries
  net: dsa: propagate feedback to SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
  net: dsa: propagate extack to .port_fdb_{add,del}

 drivers/net/dsa/b53/b53_common.c              |   6 +-
 drivers/net/dsa/b53/b53_priv.h                |   6 +-
 drivers/net/dsa/hirschmann/hellcreek.c        |   6 +-
 drivers/net/dsa/lan9303-core.c                |   7 +-
 drivers/net/dsa/lantiq_gswip.c                |   6 +-
 drivers/net/dsa/microchip/ksz9477.c           |   6 +-
 drivers/net/dsa/mt7530.c                      |   6 +-
 drivers/net/dsa/mv88e6xxx/chip.c              |   6 +-
 drivers/net/dsa/ocelot/felix.c                |   6 +-
 drivers/net/dsa/qca8k.c                       |   6 +-
 drivers/net/dsa/sja1105/sja1105_main.c        |  12 +-
 .../ethernet/freescale/dpaa2/dpaa2-switch.c   |  14 +-
 drivers/net/ethernet/intel/ice/ice_main.c     |   4 +-
 .../marvell/prestera/prestera_switchdev.c     |  18 +-
 .../mellanox/mlx5/core/en/rep/bridge.c        |  10 -
 .../ethernet/mellanox/mlx5/core/esw/bridge.c  |   2 +-
 .../ethernet/mellanox/mlxsw/spectrum_router.c |   4 +-
 .../mellanox/mlxsw/spectrum_switchdev.c       |  11 +-
 .../microchip/sparx5/sparx5_mactable.c        |   2 +-
 .../microchip/sparx5/sparx5_switchdev.c       |  12 +-
 drivers/net/ethernet/mscc/ocelot_net.c        |   3 +-
 .../net/ethernet/qlogic/qlcnic/qlcnic_main.c  |   5 +-
 drivers/net/ethernet/rocker/rocker_main.c     |  13 +-
 drivers/net/ethernet/rocker/rocker_ofdpa.c    |   2 +-
 drivers/net/ethernet/ti/am65-cpsw-switchdev.c |  14 +-
 drivers/net/ethernet/ti/cpsw_switchdev.c      |  14 +-
 drivers/net/macvlan.c                         |   3 +-
 drivers/net/vxlan.c                           |   3 +-
 drivers/s390/net/qeth_l2_main.c               |  11 +-
 include/linux/netdevice.h                     |   6 +-
 include/net/dsa.h                             |   6 +-
 include/net/switchdev.h                       |  67 +-
 net/bridge/br_fdb.c                           | 657 ++++++++++--------
 net/bridge/br_if.c                            |   2 +-
 net/bridge/br_private.h                       |  25 +-
 net/bridge/br_switchdev.c                     | 103 ++-
 net/bridge/br_vlan.c                          |   5 +-
 net/core/rtnetlink.c                          |   4 +-
 net/dsa/dsa_priv.h                            |  15 +-
 net/dsa/port.c                                |  13 +-
 net/dsa/slave.c                               |  87 +--
 net/dsa/switch.c                              |  22 +-
 net/switchdev/switchdev.c                     | 156 +----
 43 files changed, 678 insertions(+), 708 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 01/15] net: bridge: remove fdb_notify forward declaration
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 02/15] net: bridge: remove fdb_insert " Vladimir Oltean
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

fdb_notify() has a forward declaration because its first caller,
fdb_delete(), is declared before 3 functions that fdb_notify() needs:
fdb_to_nud(), fdb_fill_info() and fdb_nlmsg_size().

This patch moves the aforementioned 4 functions above fdb_delete() and
deletes the forward declaration.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_fdb.c | 246 ++++++++++++++++++++++----------------------
 1 file changed, 122 insertions(+), 124 deletions(-)

diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index a6a68e18c70a..bfb28a24ea81 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -34,8 +34,6 @@ static const struct rhashtable_params br_fdb_rht_params = {
 static struct kmem_cache *br_fdb_cache __read_mostly;
 static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
 		      const unsigned char *addr, u16 vid);
-static void fdb_notify(struct net_bridge *br,
-		       const struct net_bridge_fdb_entry *, int, bool);
 
 int __init br_fdb_init(void)
 {
@@ -87,6 +85,128 @@ static void fdb_rcu_free(struct rcu_head *head)
 	kmem_cache_free(br_fdb_cache, ent);
 }
 
+static int fdb_to_nud(const struct net_bridge *br,
+		      const struct net_bridge_fdb_entry *fdb)
+{
+	if (test_bit(BR_FDB_LOCAL, &fdb->flags))
+		return NUD_PERMANENT;
+	else if (test_bit(BR_FDB_STATIC, &fdb->flags))
+		return NUD_NOARP;
+	else if (has_expired(br, fdb))
+		return NUD_STALE;
+	else
+		return NUD_REACHABLE;
+}
+
+static int fdb_fill_info(struct sk_buff *skb, const struct net_bridge *br,
+			 const struct net_bridge_fdb_entry *fdb,
+			 u32 portid, u32 seq, int type, unsigned int flags)
+{
+	const struct net_bridge_port *dst = READ_ONCE(fdb->dst);
+	unsigned long now = jiffies;
+	struct nda_cacheinfo ci;
+	struct nlmsghdr *nlh;
+	struct ndmsg *ndm;
+
+	nlh = nlmsg_put(skb, portid, seq, type, sizeof(*ndm), flags);
+	if (nlh == NULL)
+		return -EMSGSIZE;
+
+	ndm = nlmsg_data(nlh);
+	ndm->ndm_family	 = AF_BRIDGE;
+	ndm->ndm_pad1    = 0;
+	ndm->ndm_pad2    = 0;
+	ndm->ndm_flags	 = 0;
+	ndm->ndm_type	 = 0;
+	ndm->ndm_ifindex = dst ? dst->dev->ifindex : br->dev->ifindex;
+	ndm->ndm_state   = fdb_to_nud(br, fdb);
+
+	if (test_bit(BR_FDB_OFFLOADED, &fdb->flags))
+		ndm->ndm_flags |= NTF_OFFLOADED;
+	if (test_bit(BR_FDB_ADDED_BY_EXT_LEARN, &fdb->flags))
+		ndm->ndm_flags |= NTF_EXT_LEARNED;
+	if (test_bit(BR_FDB_STICKY, &fdb->flags))
+		ndm->ndm_flags |= NTF_STICKY;
+
+	if (nla_put(skb, NDA_LLADDR, ETH_ALEN, &fdb->key.addr))
+		goto nla_put_failure;
+	if (nla_put_u32(skb, NDA_MASTER, br->dev->ifindex))
+		goto nla_put_failure;
+	ci.ndm_used	 = jiffies_to_clock_t(now - fdb->used);
+	ci.ndm_confirmed = 0;
+	ci.ndm_updated	 = jiffies_to_clock_t(now - fdb->updated);
+	ci.ndm_refcnt	 = 0;
+	if (nla_put(skb, NDA_CACHEINFO, sizeof(ci), &ci))
+		goto nla_put_failure;
+
+	if (fdb->key.vlan_id && nla_put(skb, NDA_VLAN, sizeof(u16),
+					&fdb->key.vlan_id))
+		goto nla_put_failure;
+
+	if (test_bit(BR_FDB_NOTIFY, &fdb->flags)) {
+		struct nlattr *nest = nla_nest_start(skb, NDA_FDB_EXT_ATTRS);
+		u8 notify_bits = FDB_NOTIFY_BIT;
+
+		if (!nest)
+			goto nla_put_failure;
+		if (test_bit(BR_FDB_NOTIFY_INACTIVE, &fdb->flags))
+			notify_bits |= FDB_NOTIFY_INACTIVE_BIT;
+
+		if (nla_put_u8(skb, NFEA_ACTIVITY_NOTIFY, notify_bits)) {
+			nla_nest_cancel(skb, nest);
+			goto nla_put_failure;
+		}
+
+		nla_nest_end(skb, nest);
+	}
+
+	nlmsg_end(skb, nlh);
+	return 0;
+
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+	return -EMSGSIZE;
+}
+
+static inline size_t fdb_nlmsg_size(void)
+{
+	return NLMSG_ALIGN(sizeof(struct ndmsg))
+		+ nla_total_size(ETH_ALEN) /* NDA_LLADDR */
+		+ nla_total_size(sizeof(u32)) /* NDA_MASTER */
+		+ nla_total_size(sizeof(u16)) /* NDA_VLAN */
+		+ nla_total_size(sizeof(struct nda_cacheinfo))
+		+ nla_total_size(0) /* NDA_FDB_EXT_ATTRS */
+		+ nla_total_size(sizeof(u8)); /* NFEA_ACTIVITY_NOTIFY */
+}
+
+static void fdb_notify(struct net_bridge *br,
+		       const struct net_bridge_fdb_entry *fdb, int type,
+		       bool swdev_notify)
+{
+	struct net *net = dev_net(br->dev);
+	struct sk_buff *skb;
+	int err = -ENOBUFS;
+
+	if (swdev_notify)
+		br_switchdev_fdb_notify(br, fdb, type);
+
+	skb = nlmsg_new(fdb_nlmsg_size(), GFP_ATOMIC);
+	if (skb == NULL)
+		goto errout;
+
+	err = fdb_fill_info(skb, br, fdb, 0, 0, type, 0);
+	if (err < 0) {
+		/* -EMSGSIZE implies BUG in fdb_nlmsg_size() */
+		WARN_ON(err == -EMSGSIZE);
+		kfree_skb(skb);
+		goto errout;
+	}
+	rtnl_notify(skb, net, 0, RTNLGRP_NEIGH, NULL, GFP_ATOMIC);
+	return;
+errout:
+	rtnl_set_sk_err(net, RTNLGRP_NEIGH, err);
+}
+
 static struct net_bridge_fdb_entry *fdb_find_rcu(struct rhashtable *tbl,
 						 const unsigned char *addr,
 						 __u16 vid)
@@ -638,100 +758,6 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
 	}
 }
 
-static int fdb_to_nud(const struct net_bridge *br,
-		      const struct net_bridge_fdb_entry *fdb)
-{
-	if (test_bit(BR_FDB_LOCAL, &fdb->flags))
-		return NUD_PERMANENT;
-	else if (test_bit(BR_FDB_STATIC, &fdb->flags))
-		return NUD_NOARP;
-	else if (has_expired(br, fdb))
-		return NUD_STALE;
-	else
-		return NUD_REACHABLE;
-}
-
-static int fdb_fill_info(struct sk_buff *skb, const struct net_bridge *br,
-			 const struct net_bridge_fdb_entry *fdb,
-			 u32 portid, u32 seq, int type, unsigned int flags)
-{
-	const struct net_bridge_port *dst = READ_ONCE(fdb->dst);
-	unsigned long now = jiffies;
-	struct nda_cacheinfo ci;
-	struct nlmsghdr *nlh;
-	struct ndmsg *ndm;
-
-	nlh = nlmsg_put(skb, portid, seq, type, sizeof(*ndm), flags);
-	if (nlh == NULL)
-		return -EMSGSIZE;
-
-	ndm = nlmsg_data(nlh);
-	ndm->ndm_family	 = AF_BRIDGE;
-	ndm->ndm_pad1    = 0;
-	ndm->ndm_pad2    = 0;
-	ndm->ndm_flags	 = 0;
-	ndm->ndm_type	 = 0;
-	ndm->ndm_ifindex = dst ? dst->dev->ifindex : br->dev->ifindex;
-	ndm->ndm_state   = fdb_to_nud(br, fdb);
-
-	if (test_bit(BR_FDB_OFFLOADED, &fdb->flags))
-		ndm->ndm_flags |= NTF_OFFLOADED;
-	if (test_bit(BR_FDB_ADDED_BY_EXT_LEARN, &fdb->flags))
-		ndm->ndm_flags |= NTF_EXT_LEARNED;
-	if (test_bit(BR_FDB_STICKY, &fdb->flags))
-		ndm->ndm_flags |= NTF_STICKY;
-
-	if (nla_put(skb, NDA_LLADDR, ETH_ALEN, &fdb->key.addr))
-		goto nla_put_failure;
-	if (nla_put_u32(skb, NDA_MASTER, br->dev->ifindex))
-		goto nla_put_failure;
-	ci.ndm_used	 = jiffies_to_clock_t(now - fdb->used);
-	ci.ndm_confirmed = 0;
-	ci.ndm_updated	 = jiffies_to_clock_t(now - fdb->updated);
-	ci.ndm_refcnt	 = 0;
-	if (nla_put(skb, NDA_CACHEINFO, sizeof(ci), &ci))
-		goto nla_put_failure;
-
-	if (fdb->key.vlan_id && nla_put(skb, NDA_VLAN, sizeof(u16),
-					&fdb->key.vlan_id))
-		goto nla_put_failure;
-
-	if (test_bit(BR_FDB_NOTIFY, &fdb->flags)) {
-		struct nlattr *nest = nla_nest_start(skb, NDA_FDB_EXT_ATTRS);
-		u8 notify_bits = FDB_NOTIFY_BIT;
-
-		if (!nest)
-			goto nla_put_failure;
-		if (test_bit(BR_FDB_NOTIFY_INACTIVE, &fdb->flags))
-			notify_bits |= FDB_NOTIFY_INACTIVE_BIT;
-
-		if (nla_put_u8(skb, NFEA_ACTIVITY_NOTIFY, notify_bits)) {
-			nla_nest_cancel(skb, nest);
-			goto nla_put_failure;
-		}
-
-		nla_nest_end(skb, nest);
-	}
-
-	nlmsg_end(skb, nlh);
-	return 0;
-
-nla_put_failure:
-	nlmsg_cancel(skb, nlh);
-	return -EMSGSIZE;
-}
-
-static inline size_t fdb_nlmsg_size(void)
-{
-	return NLMSG_ALIGN(sizeof(struct ndmsg))
-		+ nla_total_size(ETH_ALEN) /* NDA_LLADDR */
-		+ nla_total_size(sizeof(u32)) /* NDA_MASTER */
-		+ nla_total_size(sizeof(u16)) /* NDA_VLAN */
-		+ nla_total_size(sizeof(struct nda_cacheinfo))
-		+ nla_total_size(0) /* NDA_FDB_EXT_ATTRS */
-		+ nla_total_size(sizeof(u8)); /* NFEA_ACTIVITY_NOTIFY */
-}
-
 static int br_fdb_replay_one(struct net_bridge *br, struct notifier_block *nb,
 			     const struct net_bridge_fdb_entry *fdb,
 			     unsigned long action, const void *ctx)
@@ -786,34 +812,6 @@ int br_fdb_replay(const struct net_device *br_dev, const void *ctx, bool adding,
 	return err;
 }
 
-static void fdb_notify(struct net_bridge *br,
-		       const struct net_bridge_fdb_entry *fdb, int type,
-		       bool swdev_notify)
-{
-	struct net *net = dev_net(br->dev);
-	struct sk_buff *skb;
-	int err = -ENOBUFS;
-
-	if (swdev_notify)
-		br_switchdev_fdb_notify(br, fdb, type);
-
-	skb = nlmsg_new(fdb_nlmsg_size(), GFP_ATOMIC);
-	if (skb == NULL)
-		goto errout;
-
-	err = fdb_fill_info(skb, br, fdb, 0, 0, type, 0);
-	if (err < 0) {
-		/* -EMSGSIZE implies BUG in fdb_nlmsg_size() */
-		WARN_ON(err == -EMSGSIZE);
-		kfree_skb(skb);
-		goto errout;
-	}
-	rtnl_notify(skb, net, 0, RTNLGRP_NEIGH, NULL, GFP_ATOMIC);
-	return;
-errout:
-	rtnl_set_sk_err(net, RTNLGRP_NEIGH, err);
-}
-
 /* Dump information about entries, in response to GETNEIGH */
 int br_fdb_dump(struct sk_buff *skb,
 		struct netlink_callback *cb,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 02/15] net: bridge: remove fdb_insert forward declaration
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 01/15] net: bridge: remove fdb_notify forward declaration Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 03/15] net: bridge: rename fdb_insert to fdb_add_local Vladimir Oltean
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

fdb_insert() has a forward declaration because its first caller,
br_fdb_changeaddr(), is declared before fdb_create(), a function which
fdb_insert() needs.

This patch moves the 2 functions above br_fdb_changeaddr() and deletes
the forward declaration for fdb_insert().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_fdb.c | 116 ++++++++++++++++++++++----------------------
 1 file changed, 57 insertions(+), 59 deletions(-)

diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index bfb28a24ea81..4fe2e958573e 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -32,8 +32,6 @@ static const struct rhashtable_params br_fdb_rht_params = {
 };
 
 static struct kmem_cache *br_fdb_cache __read_mostly;
-static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
-		      const unsigned char *addr, u16 vid);
 
 int __init br_fdb_init(void)
 {
@@ -377,6 +375,63 @@ void br_fdb_find_delete_local(struct net_bridge *br,
 	spin_unlock_bh(&br->hash_lock);
 }
 
+static struct net_bridge_fdb_entry *fdb_create(struct net_bridge *br,
+					       struct net_bridge_port *source,
+					       const unsigned char *addr,
+					       __u16 vid,
+					       unsigned long flags)
+{
+	struct net_bridge_fdb_entry *fdb;
+
+	fdb = kmem_cache_alloc(br_fdb_cache, GFP_ATOMIC);
+	if (fdb) {
+		memcpy(fdb->key.addr.addr, addr, ETH_ALEN);
+		WRITE_ONCE(fdb->dst, source);
+		fdb->key.vlan_id = vid;
+		fdb->flags = flags;
+		fdb->updated = fdb->used = jiffies;
+		if (rhashtable_lookup_insert_fast(&br->fdb_hash_tbl,
+						  &fdb->rhnode,
+						  br_fdb_rht_params)) {
+			kmem_cache_free(br_fdb_cache, fdb);
+			fdb = NULL;
+		} else {
+			hlist_add_head_rcu(&fdb->fdb_node, &br->fdb_list);
+		}
+	}
+	return fdb;
+}
+
+static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
+		      const unsigned char *addr, u16 vid)
+{
+	struct net_bridge_fdb_entry *fdb;
+
+	if (!is_valid_ether_addr(addr))
+		return -EINVAL;
+
+	fdb = br_fdb_find(br, addr, vid);
+	if (fdb) {
+		/* it is okay to have multiple ports with same
+		 * address, just use the first one.
+		 */
+		if (test_bit(BR_FDB_LOCAL, &fdb->flags))
+			return 0;
+		br_warn(br, "adding interface %s with same address as a received packet (addr:%pM, vlan:%u)\n",
+		       source ? source->dev->name : br->dev->name, addr, vid);
+		fdb_delete(br, fdb, true);
+	}
+
+	fdb = fdb_create(br, source, addr, vid,
+			 BIT(BR_FDB_LOCAL) | BIT(BR_FDB_STATIC));
+	if (!fdb)
+		return -ENOMEM;
+
+	fdb_add_hw_addr(br, addr);
+	fdb_notify(br, fdb, RTM_NEWNEIGH, true);
+	return 0;
+}
+
 void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr)
 {
 	struct net_bridge_vlan_group *vg;
@@ -623,63 +678,6 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf,
 	return num;
 }
 
-static struct net_bridge_fdb_entry *fdb_create(struct net_bridge *br,
-					       struct net_bridge_port *source,
-					       const unsigned char *addr,
-					       __u16 vid,
-					       unsigned long flags)
-{
-	struct net_bridge_fdb_entry *fdb;
-
-	fdb = kmem_cache_alloc(br_fdb_cache, GFP_ATOMIC);
-	if (fdb) {
-		memcpy(fdb->key.addr.addr, addr, ETH_ALEN);
-		WRITE_ONCE(fdb->dst, source);
-		fdb->key.vlan_id = vid;
-		fdb->flags = flags;
-		fdb->updated = fdb->used = jiffies;
-		if (rhashtable_lookup_insert_fast(&br->fdb_hash_tbl,
-						  &fdb->rhnode,
-						  br_fdb_rht_params)) {
-			kmem_cache_free(br_fdb_cache, fdb);
-			fdb = NULL;
-		} else {
-			hlist_add_head_rcu(&fdb->fdb_node, &br->fdb_list);
-		}
-	}
-	return fdb;
-}
-
-static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
-		  const unsigned char *addr, u16 vid)
-{
-	struct net_bridge_fdb_entry *fdb;
-
-	if (!is_valid_ether_addr(addr))
-		return -EINVAL;
-
-	fdb = br_fdb_find(br, addr, vid);
-	if (fdb) {
-		/* it is okay to have multiple ports with same
-		 * address, just use the first one.
-		 */
-		if (test_bit(BR_FDB_LOCAL, &fdb->flags))
-			return 0;
-		br_warn(br, "adding interface %s with same address as a received packet (addr:%pM, vlan:%u)\n",
-		       source ? source->dev->name : br->dev->name, addr, vid);
-		fdb_delete(br, fdb, true);
-	}
-
-	fdb = fdb_create(br, source, addr, vid,
-			 BIT(BR_FDB_LOCAL) | BIT(BR_FDB_STATIC));
-	if (!fdb)
-		return -ENOMEM;
-
-	fdb_add_hw_addr(br, addr);
-	fdb_notify(br, fdb, RTM_NEWNEIGH, true);
-	return 0;
-}
-
 /* Add entry for local address of interface */
 int br_fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
 		  const unsigned char *addr, u16 vid)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 03/15] net: bridge: rename fdb_insert to fdb_add_local
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 01/15] net: bridge: remove fdb_notify forward declaration Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 02/15] net: bridge: remove fdb_insert " Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 04/15] net: bridge: rename br_fdb_insert to br_fdb_add_local Vladimir Oltean
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

fdb_insert() is not a descriptive name for this function, and also easy
to confuse with __br_fdb_add(), fdb_add_entry(), br_fdb_update().
Even more confusingly, it is not even related in any way with those
functions, neither one calls the other.

Since fdb_insert() basically deals with the creation of a BR_FDB_LOCAL
entry and is called only from functions where that is the intention:

- br_fdb_changeaddr
- br_fdb_change_mac_address
- br_fdb_insert

then rename it to fdb_add_local(), because its removal counterpart is
called fdb_delete_local().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_fdb.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 4fe2e958573e..0d6fb25c2ab2 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -402,8 +402,8 @@ static struct net_bridge_fdb_entry *fdb_create(struct net_bridge *br,
 	return fdb;
 }
 
-static int fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
-		      const unsigned char *addr, u16 vid)
+static int fdb_add_local(struct net_bridge *br, struct net_bridge_port *source,
+			 const unsigned char *addr, u16 vid)
 {
 	struct net_bridge_fdb_entry *fdb;
 
@@ -458,7 +458,7 @@ void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr)
 
 insert:
 	/* insert new address,  may fail if invalid address or dup. */
-	fdb_insert(br, p, newaddr, 0);
+	fdb_add_local(br, p, newaddr, 0);
 
 	if (!vg || !vg->num_vlans)
 		goto done;
@@ -468,7 +468,7 @@ void br_fdb_changeaddr(struct net_bridge_port *p, const unsigned char *newaddr)
 	 * from under us.
 	 */
 	list_for_each_entry(v, &vg->vlan_list, vlist)
-		fdb_insert(br, p, newaddr, v->vid);
+		fdb_add_local(br, p, newaddr, v->vid);
 
 done:
 	spin_unlock_bh(&br->hash_lock);
@@ -488,7 +488,7 @@ void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr)
 	    !f->dst && !test_bit(BR_FDB_ADDED_BY_USER, &f->flags))
 		fdb_delete_local(br, NULL, f);
 
-	fdb_insert(br, NULL, newaddr, 0);
+	fdb_add_local(br, NULL, newaddr, 0);
 	vg = br_vlan_group(br);
 	if (!vg || !vg->num_vlans)
 		goto out;
@@ -503,7 +503,7 @@ void br_fdb_change_mac_address(struct net_bridge *br, const u8 *newaddr)
 		if (f && test_bit(BR_FDB_LOCAL, &f->flags) &&
 		    !f->dst && !test_bit(BR_FDB_ADDED_BY_USER, &f->flags))
 			fdb_delete_local(br, NULL, f);
-		fdb_insert(br, NULL, newaddr, v->vid);
+		fdb_add_local(br, NULL, newaddr, v->vid);
 	}
 out:
 	spin_unlock_bh(&br->hash_lock);
@@ -685,7 +685,7 @@ int br_fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
 	int ret;
 
 	spin_lock_bh(&br->hash_lock);
-	ret = fdb_insert(br, source, addr, vid);
+	ret = fdb_add_local(br, source, addr, vid);
 	spin_unlock_bh(&br->hash_lock);
 	return ret;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 04/15] net: bridge: rename br_fdb_insert to br_fdb_add_local
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (2 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 03/15] net: bridge: rename fdb_insert to fdb_add_local Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 05/15] net: bridge: move br_fdb_replay inside br_switchdev.c Vladimir Oltean
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

br_fdb_insert() is a wrapper over fdb_insert() that also takes the
bridge hash_lock.

With fdb_insert() being renamed to fdb_add_local(), rename
br_fdb_insert() to br_fdb_add_local().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_fdb.c     | 4 ++--
 net/bridge/br_if.c      | 2 +-
 net/bridge/br_private.h | 4 ++--
 net/bridge/br_vlan.c    | 5 ++---
 4 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 0d6fb25c2ab2..d955deea1b4d 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -679,8 +679,8 @@ int br_fdb_fillbuf(struct net_bridge *br, void *buf,
 }
 
 /* Add entry for local address of interface */
-int br_fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
-		  const unsigned char *addr, u16 vid)
+int br_fdb_add_local(struct net_bridge *br, struct net_bridge_port *source,
+		     const unsigned char *addr, u16 vid)
 {
 	int ret;
 
diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index c11bba3e7ec0..c1183fef1f21 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -670,7 +670,7 @@ int br_add_if(struct net_bridge *br, struct net_device *dev,
 	else
 		netdev_set_rx_headroom(dev, br_hr);
 
-	if (br_fdb_insert(br, p, dev->dev_addr, 0))
+	if (br_fdb_add_local(br, p, dev->dev_addr, 0))
 		netdev_err(dev, "failed insert local address bridge forwarding table\n");
 
 	if (br->dev->addr_assign_type != NET_ADDR_SET) {
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 37ca76406f1e..705606fc2237 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -767,8 +767,8 @@ struct net_bridge_fdb_entry *br_fdb_find_rcu(struct net_bridge *br,
 int br_fdb_test_addr(struct net_device *dev, unsigned char *addr);
 int br_fdb_fillbuf(struct net_bridge *br, void *buf, unsigned long count,
 		   unsigned long off);
-int br_fdb_insert(struct net_bridge *br, struct net_bridge_port *source,
-		  const unsigned char *addr, u16 vid);
+int br_fdb_add_local(struct net_bridge *br, struct net_bridge_port *source,
+		     const unsigned char *addr, u16 vid);
 void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
 		   const unsigned char *addr, u16 vid, unsigned long flags);
 
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 19f65ab91a02..57bd6ee72a07 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -293,7 +293,7 @@ static int __vlan_add(struct net_bridge_vlan *v, u16 flags,
 
 	/* Add the dev mac and count the vlan only if it's usable */
 	if (br_vlan_should_use(v)) {
-		err = br_fdb_insert(br, p, dev->dev_addr, v->vid);
+		err = br_fdb_add_local(br, p, dev->dev_addr, v->vid);
 		if (err) {
 			br_err(br, "failed insert local address into bridge forwarding table\n");
 			goto out_filt;
@@ -683,8 +683,7 @@ static int br_vlan_add_existing(struct net_bridge *br,
 			goto err_flags;
 		}
 		/* It was only kept for port vlans, now make it real */
-		err = br_fdb_insert(br, NULL, br->dev->dev_addr,
-				    vlan->vid);
+		err = br_fdb_add_local(br, NULL, br->dev->dev_addr, vlan->vid);
 		if (err) {
 			br_err(br, "failed to insert local address into bridge forwarding table\n");
 			goto err_fdb_insert;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 05/15] net: bridge: move br_fdb_replay inside br_switchdev.c
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (3 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 04/15] net: bridge: rename br_fdb_insert to br_fdb_add_local Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 06/15] net: bridge: create a common function for populating switchdev FDB entries Vladimir Oltean
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

br_fdb_replay is only called from switchdev code paths, so it makes
sense to be disabled if switchdev is not enabled in the first place.

As opposed to br_mdb_replay and br_vlan_replay which might be turned off
depending on bridge support for multicast and VLANs, FDB support is
always on. So moving br_mdb_replay and br_vlan_replay inside
br_switchdev.c would mean adding some #ifdef's in br_switchdev.c, so we
keep those where they are.

The reason for the movement is that in future changes there will be some
code reuse between br_switchdev_fdb_notify and br_fdb_replay.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_fdb.c       | 54 ---------------------------------------
 net/bridge/br_private.h   |  2 --
 net/bridge/br_switchdev.c | 54 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 54 insertions(+), 56 deletions(-)

diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index d955deea1b4d..9c57040d8341 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -756,60 +756,6 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
 	}
 }
 
-static int br_fdb_replay_one(struct net_bridge *br, struct notifier_block *nb,
-			     const struct net_bridge_fdb_entry *fdb,
-			     unsigned long action, const void *ctx)
-{
-	const struct net_bridge_port *p = READ_ONCE(fdb->dst);
-	struct switchdev_notifier_fdb_info item;
-	int err;
-
-	item.addr = fdb->key.addr.addr;
-	item.vid = fdb->key.vlan_id;
-	item.added_by_user = test_bit(BR_FDB_ADDED_BY_USER, &fdb->flags);
-	item.offloaded = test_bit(BR_FDB_OFFLOADED, &fdb->flags);
-	item.is_local = test_bit(BR_FDB_LOCAL, &fdb->flags);
-	item.info.dev = (!p || item.is_local) ? br->dev : p->dev;
-	item.info.ctx = ctx;
-
-	err = nb->notifier_call(nb, action, &item);
-	return notifier_to_errno(err);
-}
-
-int br_fdb_replay(const struct net_device *br_dev, const void *ctx, bool adding,
-		  struct notifier_block *nb)
-{
-	struct net_bridge_fdb_entry *fdb;
-	struct net_bridge *br;
-	unsigned long action;
-	int err = 0;
-
-	if (!nb)
-		return 0;
-
-	if (!netif_is_bridge_master(br_dev))
-		return -EINVAL;
-
-	br = netdev_priv(br_dev);
-
-	if (adding)
-		action = SWITCHDEV_FDB_ADD_TO_DEVICE;
-	else
-		action = SWITCHDEV_FDB_DEL_TO_DEVICE;
-
-	rcu_read_lock();
-
-	hlist_for_each_entry_rcu(fdb, &br->fdb_list, fdb_node) {
-		err = br_fdb_replay_one(br, nb, fdb, action, ctx);
-		if (err)
-			break;
-	}
-
-	rcu_read_unlock();
-
-	return err;
-}
-
 /* Dump information about entries, in response to GETNEIGH */
 int br_fdb_dump(struct sk_buff *skb,
 		struct netlink_callback *cb,
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 705606fc2237..3c9327628060 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -792,8 +792,6 @@ int br_fdb_external_learn_del(struct net_bridge *br, struct net_bridge_port *p,
 			      bool swdev_notify);
 void br_fdb_offloaded_set(struct net_bridge *br, struct net_bridge_port *p,
 			  const unsigned char *addr, u16 vid, bool offloaded);
-int br_fdb_replay(const struct net_device *br_dev, const void *ctx, bool adding,
-		  struct notifier_block *nb);
 
 /* br_forward.c */
 enum br_pkt_type {
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index 6bf518d78f02..8a45b1cfe06f 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -270,6 +270,60 @@ static void nbp_switchdev_del(struct net_bridge_port *p)
 	}
 }
 
+static int br_fdb_replay_one(struct net_bridge *br, struct notifier_block *nb,
+			     const struct net_bridge_fdb_entry *fdb,
+			     unsigned long action, const void *ctx)
+{
+	const struct net_bridge_port *p = READ_ONCE(fdb->dst);
+	struct switchdev_notifier_fdb_info item;
+	int err;
+
+	item.addr = fdb->key.addr.addr;
+	item.vid = fdb->key.vlan_id;
+	item.added_by_user = test_bit(BR_FDB_ADDED_BY_USER, &fdb->flags);
+	item.offloaded = test_bit(BR_FDB_OFFLOADED, &fdb->flags);
+	item.is_local = test_bit(BR_FDB_LOCAL, &fdb->flags);
+	item.info.dev = (!p || item.is_local) ? br->dev : p->dev;
+	item.info.ctx = ctx;
+
+	err = nb->notifier_call(nb, action, &item);
+	return notifier_to_errno(err);
+}
+
+static int br_fdb_replay(const struct net_device *br_dev, const void *ctx,
+			 bool adding, struct notifier_block *nb)
+{
+	struct net_bridge_fdb_entry *fdb;
+	struct net_bridge *br;
+	unsigned long action;
+	int err = 0;
+
+	if (!nb)
+		return 0;
+
+	if (!netif_is_bridge_master(br_dev))
+		return -EINVAL;
+
+	br = netdev_priv(br_dev);
+
+	if (adding)
+		action = SWITCHDEV_FDB_ADD_TO_DEVICE;
+	else
+		action = SWITCHDEV_FDB_DEL_TO_DEVICE;
+
+	rcu_read_lock();
+
+	hlist_for_each_entry_rcu(fdb, &br->fdb_list, fdb_node) {
+		err = br_fdb_replay_one(br, nb, fdb, action, ctx);
+		if (err)
+			break;
+	}
+
+	rcu_read_unlock();
+
+	return err;
+}
+
 static int nbp_switchdev_sync_objs(struct net_bridge_port *p, const void *ctx,
 				   struct notifier_block *atomic_nb,
 				   struct notifier_block *blocking_nb,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 06/15] net: bridge: create a common function for populating switchdev FDB entries
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (4 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 05/15] net: bridge: move br_fdb_replay inside br_switchdev.c Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 07/15] net: switchdev: keep the MAC address by value in struct switchdev_notifier_fdb_info Vladimir Oltean
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

There are two places where a switchdev FDB entry is constructed, one is
br_switchdev_fdb_notify() and the other is br_fdb_replay(). One uses a
struct initializer, and the other declares the structure as
uninitialized and populates the elements one by one.

One problem when introducing new members of struct
switchdev_notifier_fdb_info is that there is a risk for one of these
functions to run with an uninitialized value.

So centralize the logic of populating such structure into a dedicated
function. Being the primary location where these structures are created,
using an uninitialized variable and populating the members one by one
should be fine, since this one function is supposed to assign values to
all its members.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_switchdev.c | 41 +++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index 8a45b1cfe06f..2fbe881cdfe2 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -122,28 +122,38 @@ int br_switchdev_set_port_flag(struct net_bridge_port *p,
 	return 0;
 }
 
+static void br_switchdev_fdb_populate(struct net_bridge *br,
+				      struct switchdev_notifier_fdb_info *item,
+				      const struct net_bridge_fdb_entry *fdb,
+				      const void *ctx)
+{
+	const struct net_bridge_port *p = READ_ONCE(fdb->dst);
+
+	item->addr = fdb->key.addr.addr;
+	item->vid = fdb->key.vlan_id;
+	item->added_by_user = test_bit(BR_FDB_ADDED_BY_USER, &fdb->flags);
+	item->offloaded = test_bit(BR_FDB_OFFLOADED, &fdb->flags);
+	item->is_local = test_bit(BR_FDB_LOCAL, &fdb->flags);
+	item->info.dev = (!p || item->is_local) ? br->dev : p->dev;
+	item->info.ctx = ctx;
+}
+
 void
 br_switchdev_fdb_notify(struct net_bridge *br,
 			const struct net_bridge_fdb_entry *fdb, int type)
 {
-	const struct net_bridge_port *dst = READ_ONCE(fdb->dst);
-	struct switchdev_notifier_fdb_info info = {
-		.addr = fdb->key.addr.addr,
-		.vid = fdb->key.vlan_id,
-		.added_by_user = test_bit(BR_FDB_ADDED_BY_USER, &fdb->flags),
-		.is_local = test_bit(BR_FDB_LOCAL, &fdb->flags),
-		.offloaded = test_bit(BR_FDB_OFFLOADED, &fdb->flags),
-	};
-	struct net_device *dev = (!dst || info.is_local) ? br->dev : dst->dev;
+	struct switchdev_notifier_fdb_info item;
+
+	br_switchdev_fdb_populate(br, &item, fdb, NULL);
 
 	switch (type) {
 	case RTM_DELNEIGH:
 		call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE,
-					 dev, &info.info, NULL);
+					 item.info.dev, &item.info, NULL);
 		break;
 	case RTM_NEWNEIGH:
 		call_switchdev_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE,
-					 dev, &info.info, NULL);
+					 item.info.dev, &item.info, NULL);
 		break;
 	}
 }
@@ -274,17 +284,10 @@ static int br_fdb_replay_one(struct net_bridge *br, struct notifier_block *nb,
 			     const struct net_bridge_fdb_entry *fdb,
 			     unsigned long action, const void *ctx)
 {
-	const struct net_bridge_port *p = READ_ONCE(fdb->dst);
 	struct switchdev_notifier_fdb_info item;
 	int err;
 
-	item.addr = fdb->key.addr.addr;
-	item.vid = fdb->key.vlan_id;
-	item.added_by_user = test_bit(BR_FDB_ADDED_BY_USER, &fdb->flags);
-	item.offloaded = test_bit(BR_FDB_OFFLOADED, &fdb->flags);
-	item.is_local = test_bit(BR_FDB_LOCAL, &fdb->flags);
-	item.info.dev = (!p || item.is_local) ? br->dev : p->dev;
-	item.info.ctx = ctx;
+	br_switchdev_fdb_populate(br, &item, fdb, ctx);
 
 	err = nb->notifier_call(nb, action, &item);
 	return notifier_to_errno(err);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 07/15] net: switchdev: keep the MAC address by value in struct switchdev_notifier_fdb_info
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (5 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 06/15] net: bridge: create a common function for populating switchdev FDB entries Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 08/15] net: bridge: take the hash_lock inside fdb_add_entry Vladimir Oltean
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

Currently, shallow copies of struct switchdev_notifier_fdb_info cannot
be carried around, since the "addr" member points to bridge memory.
This complicates driver implementations because they need to explicitly
allocate a separate piece of memory for the address.

Replace the pointer with a 6-byte size array and simplify driver
handling. This makes the structure safely copyable, which in turn eases
some of the future changes (having a similar API between
switchdev_fdb_mark_pending and switchdev_fdb_mark_done).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 .../ethernet/freescale/dpaa2/dpaa2-switch.c    | 14 +-------------
 .../marvell/prestera/prestera_switchdev.c      | 18 +++---------------
 .../mellanox/mlx5/core/en/rep/bridge.c         | 10 ----------
 .../ethernet/mellanox/mlx5/core/esw/bridge.c   |  2 +-
 .../ethernet/mellanox/mlxsw/spectrum_router.c  |  4 ++--
 .../mellanox/mlxsw/spectrum_switchdev.c        | 11 ++---------
 .../microchip/sparx5/sparx5_mactable.c         |  2 +-
 .../microchip/sparx5/sparx5_switchdev.c        | 12 +-----------
 drivers/net/ethernet/rocker/rocker_main.c      | 13 ++-----------
 drivers/net/ethernet/rocker/rocker_ofdpa.c     |  2 +-
 drivers/net/ethernet/ti/am65-cpsw-switchdev.c  | 14 ++------------
 drivers/net/ethernet/ti/cpsw_switchdev.c       | 14 ++------------
 drivers/s390/net/qeth_l2_main.c                | 11 ++++-------
 include/net/switchdev.h                        |  2 +-
 net/bridge/br_switchdev.c                      |  2 +-
 net/dsa/slave.c                                |  2 +-
 16 files changed, 25 insertions(+), 108 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
index d039457928b0..6190feb44219 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
@@ -2246,7 +2246,6 @@ static void dpaa2_switch_event_work(struct work_struct *work)
 	}
 
 	rtnl_unlock();
-	kfree(switchdev_work->fdb_info.addr);
 	kfree(switchdev_work);
 	dev_put(dev);
 }
@@ -2278,15 +2277,8 @@ static int dpaa2_switch_port_event(struct notifier_block *nb,
 	switch (event) {
 	case SWITCHDEV_FDB_ADD_TO_DEVICE:
 	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		memcpy(&switchdev_work->fdb_info, ptr,
+		memcpy(&switchdev_work->fdb_info, fdb_info,
 		       sizeof(switchdev_work->fdb_info));
-		switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (!switchdev_work->fdb_info.addr)
-			goto err_addr_alloc;
-
-		ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
-				fdb_info->addr);
-
 		/* Take a reference on the device to avoid being freed. */
 		dev_hold(dev);
 		break;
@@ -2298,10 +2290,6 @@ static int dpaa2_switch_port_event(struct notifier_block *nb,
 	queue_work(ethsw->workqueue, &switchdev_work->work);
 
 	return NOTIFY_DONE;
-
-err_addr_alloc:
-	kfree(switchdev_work);
-	return NOTIFY_BAD;
 }
 
 static int dpaa2_switch_port_obj_event(unsigned long event,
diff --git a/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c b/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c
index 3ce6ccd0f539..236b07c42df0 100644
--- a/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c
+++ b/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c
@@ -760,7 +760,7 @@ prestera_fdb_offload_notify(struct prestera_port *port,
 {
 	struct switchdev_notifier_fdb_info send_info = {};
 
-	send_info.addr = info->addr;
+	ether_addr_copy(send_info.addr, info->addr);
 	send_info.vid = info->vid;
 	send_info.offloaded = true;
 
@@ -836,7 +836,6 @@ static void prestera_fdb_event_work(struct work_struct *work)
 out_unlock:
 	rtnl_unlock();
 
-	kfree(swdev_work->fdb_info.addr);
 	kfree(swdev_work);
 	dev_put(dev);
 }
@@ -883,15 +882,8 @@ static int prestera_switchdev_event(struct notifier_block *unused,
 					info);
 
 		INIT_WORK(&swdev_work->work, prestera_fdb_event_work);
-		memcpy(&swdev_work->fdb_info, ptr,
+		memcpy(&swdev_work->fdb_info, fdb_info,
 		       sizeof(swdev_work->fdb_info));
-
-		swdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (!swdev_work->fdb_info.addr)
-			goto out_bad;
-
-		ether_addr_copy((u8 *)swdev_work->fdb_info.addr,
-				fdb_info->addr);
 		dev_hold(dev);
 		break;
 
@@ -902,10 +894,6 @@ static int prestera_switchdev_event(struct notifier_block *unused,
 
 	queue_work(swdev_wq, &swdev_work->work);
 	return NOTIFY_DONE;
-
-out_bad:
-	kfree(swdev_work);
-	return NOTIFY_BAD;
 }
 
 static int
@@ -1156,7 +1144,7 @@ static void prestera_fdb_event(struct prestera_switch *sw,
 	if (!dev)
 		return;
 
-	info.addr = evt->fdb_evt.data.mac;
+	ether_addr_copy(info.addr, evt->fdb_evt.data.mac);
 	info.vid = evt->fdb_evt.vid;
 	info.offloaded = true;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
index c6d2f8c78db7..d9735d817073 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
@@ -306,7 +306,6 @@ static void
 mlx5_esw_bridge_cleanup_switchdev_fdb_work(struct mlx5_bridge_switchdev_fdb_work *fdb_work)
 {
 	dev_put(fdb_work->dev);
-	kfree(fdb_work->fdb_info.addr);
 	kfree(fdb_work);
 }
 
@@ -345,7 +344,6 @@ mlx5_esw_bridge_init_switchdev_fdb_work(struct net_device *dev, bool add,
 					struct mlx5_esw_bridge_offloads *br_offloads)
 {
 	struct mlx5_bridge_switchdev_fdb_work *work;
-	u8 *addr;
 
 	work = kzalloc(sizeof(*work), GFP_ATOMIC);
 	if (!work)
@@ -354,14 +352,6 @@ mlx5_esw_bridge_init_switchdev_fdb_work(struct net_device *dev, bool add,
 	INIT_WORK(&work->work, mlx5_esw_bridge_switchdev_fdb_event_work);
 	memcpy(&work->fdb_info, fdb_info, sizeof(work->fdb_info));
 
-	addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-	if (!addr) {
-		kfree(work);
-		return ERR_PTR(-ENOMEM);
-	}
-	ether_addr_copy(addr, fdb_info->addr);
-	work->fdb_info.addr = addr;
-
 	dev_hold(dev);
 	work->dev = dev;
 	work->br_offloads = br_offloads;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/esw/bridge.c b/drivers/net/ethernet/mellanox/mlx5/core/esw/bridge.c
index 588622ba38c1..30584aad3021 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/esw/bridge.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/esw/bridge.c
@@ -77,7 +77,7 @@ mlx5_esw_bridge_fdb_offload_notify(struct net_device *dev, const unsigned char *
 {
 	struct switchdev_notifier_fdb_info send_info = {};
 
-	send_info.addr = addr;
+	ether_addr_copy(send_info.addr, addr);
 	send_info.vid = vid;
 	send_info.offloaded = true;
 	call_switchdev_notifiers(val, dev, &send_info.info, NULL);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 1e141b5944cd..54bd2b30eb8c 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -9216,7 +9216,7 @@ static void mlxsw_sp_rif_fid_fdb_del(struct mlxsw_sp_rif *rif, const char *mac)
 	if (!dev)
 		return;
 
-	info.addr = mac;
+	ether_addr_copy(info.addr, mac);
 	info.vid = 0;
 	call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_BRIDGE, dev, &info.info,
 				 NULL);
@@ -9267,7 +9267,7 @@ static void mlxsw_sp_rif_vlan_fdb_del(struct mlxsw_sp_rif *rif, const char *mac)
 	if (!dev)
 		return;
 
-	info.addr = mac;
+	ether_addr_copy(info.addr, mac);
 	info.vid = vid;
 	call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_BRIDGE, dev, &info.info,
 				 NULL);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 81c7e8a7fcf5..2b66a6d8d8a0 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -2519,7 +2519,7 @@ mlxsw_sp_fdb_call_notifiers(enum switchdev_notifier_type type,
 {
 	struct switchdev_notifier_fdb_info info = {};
 
-	info.addr = mac;
+	ether_addr_copy(info.addr, mac);
 	info.vid = vid;
 	info.offloaded = offloaded;
 	call_switchdev_notifiers(type, dev, &info.info, NULL);
@@ -3010,7 +3010,6 @@ static void mlxsw_sp_switchdev_bridge_fdb_event_work(struct work_struct *work)
 
 out:
 	rtnl_unlock();
-	kfree(switchdev_work->fdb_info.addr);
 	kfree(switchdev_work);
 	dev_put(dev);
 }
@@ -3253,13 +3252,8 @@ static int mlxsw_sp_switchdev_event(struct notifier_block *unused,
 					info);
 		INIT_WORK(&switchdev_work->work,
 			  mlxsw_sp_switchdev_bridge_fdb_event_work);
-		memcpy(&switchdev_work->fdb_info, ptr,
+		memcpy(&switchdev_work->fdb_info, fdb_info,
 		       sizeof(switchdev_work->fdb_info));
-		switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (!switchdev_work->fdb_info.addr)
-			goto err_addr_alloc;
-		ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
-				fdb_info->addr);
 		/* Take a reference on the device. This can be either
 		 * upper device containig mlxsw_sp_port or just a
 		 * mlxsw_sp_port
@@ -3286,7 +3280,6 @@ static int mlxsw_sp_switchdev_event(struct notifier_block *unused,
 	return NOTIFY_DONE;
 
 err_vxlan_work_prepare:
-err_addr_alloc:
 	kfree(switchdev_work);
 	return NOTIFY_BAD;
 }
diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_mactable.c b/drivers/net/ethernet/microchip/sparx5/sparx5_mactable.c
index 9a8e4f201eb1..3dcb6a887ea5 100644
--- a/drivers/net/ethernet/microchip/sparx5/sparx5_mactable.c
+++ b/drivers/net/ethernet/microchip/sparx5/sparx5_mactable.c
@@ -279,7 +279,7 @@ static void sparx5_fdb_call_notifiers(enum switchdev_notifier_type type,
 {
 	struct switchdev_notifier_fdb_info info = {};
 
-	info.addr = mac;
+	ether_addr_copy(info.addr, mac);
 	info.vid = vid;
 	info.offloaded = offloaded;
 	call_switchdev_notifiers(type, dev, &info.info, NULL);
diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c b/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
index 649ca609884a..5c5eb557a19c 100644
--- a/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
+++ b/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
@@ -254,7 +254,6 @@ static void sparx5_switchdev_bridge_fdb_event_work(struct work_struct *work)
 
 out:
 	rtnl_unlock();
-	kfree(switchdev_work->fdb_info.addr);
 	kfree(switchdev_work);
 	dev_put(dev);
 }
@@ -294,14 +293,8 @@ static int sparx5_switchdev_event(struct notifier_block *unused,
 					info);
 		INIT_WORK(&switchdev_work->work,
 			  sparx5_switchdev_bridge_fdb_event_work);
-		memcpy(&switchdev_work->fdb_info, ptr,
+		memcpy(&switchdev_work->fdb_info, fdb_info,
 		       sizeof(switchdev_work->fdb_info));
-		switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (!switchdev_work->fdb_info.addr)
-			goto err_addr_alloc;
-
-		ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
-				fdb_info->addr);
 		dev_hold(dev);
 
 		sparx5_schedule_work(&switchdev_work->work);
@@ -309,9 +302,6 @@ static int sparx5_switchdev_event(struct notifier_block *unused,
 	}
 
 	return NOTIFY_DONE;
-err_addr_alloc:
-	kfree(switchdev_work);
-	return NOTIFY_BAD;
 }
 
 static void sparx5_sync_port_dev_addr(struct sparx5 *sparx5,
diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
index ba4062881eed..e5ab9f577808 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2720,7 +2720,7 @@ rocker_fdb_offload_notify(struct rocker_port *rocker_port,
 {
 	struct switchdev_notifier_fdb_info info = {};
 
-	info.addr = recv_info->addr;
+	ether_addr_copy(info.addr, recv_info->addr);
 	info.vid = recv_info->vid;
 	info.offloaded = true;
 	call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED,
@@ -2759,7 +2759,6 @@ static void rocker_switchdev_event_work(struct work_struct *work)
 	}
 	rtnl_unlock();
 
-	kfree(switchdev_work->fdb_info.addr);
 	kfree(switchdev_work);
 	dev_put(rocker_port->dev);
 }
@@ -2791,16 +2790,8 @@ static int rocker_switchdev_event(struct notifier_block *unused,
 	switch (event) {
 	case SWITCHDEV_FDB_ADD_TO_DEVICE:
 	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		memcpy(&switchdev_work->fdb_info, ptr,
+		memcpy(&switchdev_work->fdb_info, fdb_info,
 		       sizeof(switchdev_work->fdb_info));
-		switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (unlikely(!switchdev_work->fdb_info.addr)) {
-			kfree(switchdev_work);
-			return NOTIFY_BAD;
-		}
-
-		ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
-				fdb_info->addr);
 		/* Take a reference on the rocker device */
 		dev_hold(dev);
 		break;
diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 3e1ca7a8d029..abdb4b49add1 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -1824,7 +1824,7 @@ static void ofdpa_port_fdb_learn_work(struct work_struct *work)
 	bool learned = (lw->flags & OFDPA_OP_FLAG_LEARNED);
 	struct switchdev_notifier_fdb_info info = {};
 
-	info.addr = lw->addr;
+	ether_addr_copy(info.addr, lw->addr);
 	info.vid = lw->vid;
 
 	rtnl_lock();
diff --git a/drivers/net/ethernet/ti/am65-cpsw-switchdev.c b/drivers/net/ethernet/ti/am65-cpsw-switchdev.c
index 599708a3e81d..860214e1a8ca 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-switchdev.c
+++ b/drivers/net/ethernet/ti/am65-cpsw-switchdev.c
@@ -360,7 +360,7 @@ static void am65_cpsw_fdb_offload_notify(struct net_device *ndev,
 {
 	struct switchdev_notifier_fdb_info info = {};
 
-	info.addr = rcv->addr;
+	ether_addr_copy(info.addr, rcv->addr);
 	info.vid = rcv->vid;
 	info.offloaded = true;
 	call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED,
@@ -414,7 +414,6 @@ static void am65_cpsw_switchdev_event_work(struct work_struct *work)
 	}
 	rtnl_unlock();
 
-	kfree(switchdev_work->fdb_info.addr);
 	kfree(switchdev_work);
 	dev_put(port->ndev);
 }
@@ -450,13 +449,8 @@ static int am65_cpsw_switchdev_event(struct notifier_block *unused,
 	switch (event) {
 	case SWITCHDEV_FDB_ADD_TO_DEVICE:
 	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		memcpy(&switchdev_work->fdb_info, ptr,
+		memcpy(&switchdev_work->fdb_info, fdb_info,
 		       sizeof(switchdev_work->fdb_info));
-		switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (!switchdev_work->fdb_info.addr)
-			goto err_addr_alloc;
-		ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
-				fdb_info->addr);
 		dev_hold(ndev);
 		break;
 	default:
@@ -467,10 +461,6 @@ static int am65_cpsw_switchdev_event(struct notifier_block *unused,
 	queue_work(system_long_wq, &switchdev_work->work);
 
 	return NOTIFY_DONE;
-
-err_addr_alloc:
-	kfree(switchdev_work);
-	return NOTIFY_BAD;
 }
 
 static struct notifier_block cpsw_switchdev_notifier = {
diff --git a/drivers/net/ethernet/ti/cpsw_switchdev.c b/drivers/net/ethernet/ti/cpsw_switchdev.c
index a7d97d429e06..786bb848ddeb 100644
--- a/drivers/net/ethernet/ti/cpsw_switchdev.c
+++ b/drivers/net/ethernet/ti/cpsw_switchdev.c
@@ -370,7 +370,7 @@ static void cpsw_fdb_offload_notify(struct net_device *ndev,
 {
 	struct switchdev_notifier_fdb_info info = {};
 
-	info.addr = rcv->addr;
+	ether_addr_copy(info.addr, rcv->addr);
 	info.vid = rcv->vid;
 	info.offloaded = true;
 	call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED,
@@ -424,7 +424,6 @@ static void cpsw_switchdev_event_work(struct work_struct *work)
 	}
 	rtnl_unlock();
 
-	kfree(switchdev_work->fdb_info.addr);
 	kfree(switchdev_work);
 	dev_put(priv->ndev);
 }
@@ -460,13 +459,8 @@ static int cpsw_switchdev_event(struct notifier_block *unused,
 	switch (event) {
 	case SWITCHDEV_FDB_ADD_TO_DEVICE:
 	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		memcpy(&switchdev_work->fdb_info, ptr,
+		memcpy(&switchdev_work->fdb_info, fdb_info,
 		       sizeof(switchdev_work->fdb_info));
-		switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (!switchdev_work->fdb_info.addr)
-			goto err_addr_alloc;
-		ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
-				fdb_info->addr);
 		dev_hold(ndev);
 		break;
 	default:
@@ -477,10 +471,6 @@ static int cpsw_switchdev_event(struct notifier_block *unused,
 	queue_work(system_long_wq, &switchdev_work->work);
 
 	return NOTIFY_DONE;
-
-err_addr_alloc:
-	kfree(switchdev_work);
-	return NOTIFY_BAD;
 }
 
 static struct notifier_block cpsw_switchdev_notifier = {
diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index 0347fc184786..deb8e3889f7e 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -283,7 +283,6 @@ static void qeth_l2_dev2br_fdb_flush(struct qeth_card *card)
 
 	QETH_CARD_TEXT(card, 2, "fdbflush");
 
-	info.addr = NULL;
 	/* flush all VLANs: */
 	info.vid = 0;
 	info.added_by_user = false;
@@ -637,20 +636,18 @@ static void qeth_l2_dev2br_fdb_notify(struct qeth_card *card, u8 code,
 				      struct mac_addr_lnid *addr_lnid)
 {
 	struct switchdev_notifier_fdb_info info = {};
-	u8 ntfy_mac[ETH_ALEN];
 
-	ether_addr_copy(ntfy_mac, addr_lnid->mac);
 	/* Ignore VLAN only changes */
 	if (!(code & IPA_ADDR_CHANGE_CODE_MACADDR))
 		return;
 	/* Ignore mcast entries */
-	if (is_multicast_ether_addr(ntfy_mac))
+	if (is_multicast_ether_addr(addr_lnid->mac))
 		return;
 	/* Ignore my own addresses */
 	if (qeth_is_my_net_if_token(card, token))
 		return;
 
-	info.addr = ntfy_mac;
+	ether_addr_copy(info.addr, addr_lnid->mac);
 	/* don't report VLAN IDs */
 	info.vid = 0;
 	info.added_by_user = false;
@@ -661,13 +658,13 @@ static void qeth_l2_dev2br_fdb_notify(struct qeth_card *card, u8 code,
 					 card->dev, &info.info, NULL);
 		QETH_CARD_TEXT(card, 4, "andelmac");
 		QETH_CARD_TEXT_(card, 4,
-				"mc%012llx", ether_addr_to_u64(ntfy_mac));
+				"mc%012llx", ether_addr_to_u64(info.addr));
 	} else {
 		call_switchdev_notifiers(SWITCHDEV_FDB_ADD_TO_BRIDGE,
 					 card->dev, &info.info, NULL);
 		QETH_CARD_TEXT(card, 4, "anaddmac");
 		QETH_CARD_TEXT_(card, 4,
-				"mc%012llx", ether_addr_to_u64(ntfy_mac));
+				"mc%012llx", ether_addr_to_u64(info.addr));
 	}
 }
 
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 60d806b6a5ae..6764fb7692e2 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -218,7 +218,7 @@ struct switchdev_notifier_info {
 
 struct switchdev_notifier_fdb_info {
 	struct switchdev_notifier_info info; /* must be first */
-	const unsigned char *addr;
+	unsigned char addr[ETH_ALEN];
 	u16 vid;
 	u8 added_by_user:1,
 	   is_local:1,
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index 2fbe881cdfe2..f58fb06ae641 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -129,7 +129,7 @@ static void br_switchdev_fdb_populate(struct net_bridge *br,
 {
 	const struct net_bridge_port *p = READ_ONCE(fdb->dst);
 
-	item->addr = fdb->key.addr.addr;
+	ether_addr_copy(item->addr, fdb->key.addr.addr);
 	item->vid = fdb->key.vlan_id;
 	item->added_by_user = test_bit(BR_FDB_ADDED_BY_USER, &fdb->flags);
 	item->offloaded = test_bit(BR_FDB_OFFLOADED, &fdb->flags);
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index adcfb2cb4e61..54d53e18a211 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -2395,7 +2395,7 @@ dsa_fdb_offload_notify(struct dsa_switchdev_event_work *switchdev_work)
 	if (!dsa_is_user_port(ds, switchdev_work->port))
 		return;
 
-	info.addr = switchdev_work->addr;
+	ether_addr_copy(info.addr, switchdev_work->addr);
 	info.vid = switchdev_work->vid;
 	info.offloaded = true;
 	dp = dsa_to_port(ds, switchdev_work->port);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 08/15] net: bridge: take the hash_lock inside fdb_add_entry
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (6 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 07/15] net: switchdev: keep the MAC address by value in struct switchdev_notifier_fdb_info Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 09/15] net: bridge: rename fdb_notify to br_fdb_notify Vladimir Oltean
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

In preparation of a future change where fdb_add_entry() will do some
work with the &br->hash_lock not held, make the noisy change of taking
that lock inside the function separately.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_fdb.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 9c57040d8341..421b8960945a 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -893,19 +893,27 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
 			return -EINVAL;
 	}
 
+	spin_lock_bh(&br->hash_lock);
+
 	fdb = br_fdb_find(br, addr, vid);
 	if (fdb == NULL) {
-		if (!(flags & NLM_F_CREATE))
+		if (!(flags & NLM_F_CREATE)) {
+			spin_unlock_bh(&br->hash_lock);
 			return -ENOENT;
+		}
 
 		fdb = fdb_create(br, source, addr, vid, 0);
-		if (!fdb)
+		if (!fdb) {
+			spin_unlock_bh(&br->hash_lock);
 			return -ENOMEM;
+		}
 
 		modified = true;
 	} else {
-		if (flags & NLM_F_EXCL)
+		if (flags & NLM_F_EXCL) {
+			spin_unlock_bh(&br->hash_lock);
 			return -EEXIST;
+		}
 
 		if (READ_ONCE(fdb->dst) != source) {
 			WRITE_ONCE(fdb->dst, source);
@@ -948,6 +956,8 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
 		fdb_notify(br, fdb, RTM_NEWNEIGH, true);
 	}
 
+	spin_unlock_bh(&br->hash_lock);
+
 	return 0;
 }
 
@@ -980,9 +990,7 @@ static int __br_fdb_add(struct ndmsg *ndm, struct net_bridge *br,
 		}
 		err = br_fdb_external_learn_add(br, p, addr, vid, true);
 	} else {
-		spin_lock_bh(&br->hash_lock);
 		err = fdb_add_entry(br, p, addr, ndm, nlh_flags, vid, nfea_tb);
-		spin_unlock_bh(&br->hash_lock);
 	}
 
 	return err;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 09/15] net: bridge: rename fdb_notify to br_fdb_notify
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (7 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 08/15] net: bridge: take the hash_lock inside fdb_add_entry Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 10/15] net: switchdev: merge switchdev_handle_fdb_{add,del}_to_device Vladimir Oltean
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

This change creates a wrapper over fdb_notify() called br_fdb_notify(),
with the same behavior.

A future change will introduce another variant, br_fdb_notify_async(),
that actually cares about the deferred return code of
br_switchdev_fdb_notify().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_fdb.c | 32 +++++++++++++++++++-------------
 1 file changed, 19 insertions(+), 13 deletions(-)

diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 421b8960945a..2095bdf24e42 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -178,16 +178,12 @@ static inline size_t fdb_nlmsg_size(void)
 }
 
 static void fdb_notify(struct net_bridge *br,
-		       const struct net_bridge_fdb_entry *fdb, int type,
-		       bool swdev_notify)
+		       const struct net_bridge_fdb_entry *fdb, int type)
 {
 	struct net *net = dev_net(br->dev);
 	struct sk_buff *skb;
 	int err = -ENOBUFS;
 
-	if (swdev_notify)
-		br_switchdev_fdb_notify(br, fdb, type);
-
 	skb = nlmsg_new(fdb_nlmsg_size(), GFP_ATOMIC);
 	if (skb == NULL)
 		goto errout;
@@ -205,6 +201,16 @@ static void fdb_notify(struct net_bridge *br,
 	rtnl_set_sk_err(net, RTNLGRP_NEIGH, err);
 }
 
+static void br_fdb_notify(struct net_bridge *br,
+			  const struct net_bridge_fdb_entry *fdb, int type,
+			  bool swdev_notify)
+{
+	if (swdev_notify)
+		br_switchdev_fdb_notify(br, fdb, type);
+
+	fdb_notify(br, fdb, type);
+}
+
 static struct net_bridge_fdb_entry *fdb_find_rcu(struct rhashtable *tbl,
 						 const unsigned char *addr,
 						 __u16 vid)
@@ -322,7 +328,7 @@ static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f,
 	hlist_del_init_rcu(&f->fdb_node);
 	rhashtable_remove_fast(&br->fdb_hash_tbl, &f->rhnode,
 			       br_fdb_rht_params);
-	fdb_notify(br, f, RTM_DELNEIGH, swdev_notify);
+	br_fdb_notify(br, f, RTM_DELNEIGH, swdev_notify);
 	call_rcu(&f->rcu, fdb_rcu_free);
 }
 
@@ -428,7 +434,7 @@ static int fdb_add_local(struct net_bridge *br, struct net_bridge_port *source,
 		return -ENOMEM;
 
 	fdb_add_hw_addr(br, addr);
-	fdb_notify(br, fdb, RTM_NEWNEIGH, true);
+	br_fdb_notify(br, fdb, RTM_NEWNEIGH, true);
 	return 0;
 }
 
@@ -534,7 +540,7 @@ void br_fdb_cleanup(struct work_struct *work)
 							 this_timer - now);
 				else if (!test_and_set_bit(BR_FDB_NOTIFY_INACTIVE,
 							   &f->flags))
-					fdb_notify(br, f, RTM_NEWNEIGH, false);
+					br_fdb_notify(br, f, RTM_NEWNEIGH, false);
 			}
 			continue;
 		}
@@ -739,7 +745,7 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
 				set_bit(BR_FDB_ADDED_BY_USER, &fdb->flags);
 			if (unlikely(fdb_modified)) {
 				trace_br_fdb_update(br, source, addr, vid, flags);
-				fdb_notify(br, fdb, RTM_NEWNEIGH, true);
+				br_fdb_notify(br, fdb, RTM_NEWNEIGH, true);
 			}
 		}
 	} else {
@@ -747,7 +753,7 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
 		fdb = fdb_create(br, source, addr, vid, flags);
 		if (fdb) {
 			trace_br_fdb_update(br, source, addr, vid, flags);
-			fdb_notify(br, fdb, RTM_NEWNEIGH, true);
+			br_fdb_notify(br, fdb, RTM_NEWNEIGH, true);
 		}
 		/* else  we lose race and someone else inserts
 		 * it first, don't bother updating
@@ -953,7 +959,7 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
 	if (modified) {
 		if (refresh)
 			fdb->updated = jiffies;
-		fdb_notify(br, fdb, RTM_NEWNEIGH, true);
+		br_fdb_notify(br, fdb, RTM_NEWNEIGH, true);
 	}
 
 	spin_unlock_bh(&br->hash_lock);
@@ -1240,7 +1246,7 @@ int br_fdb_external_learn_add(struct net_bridge *br, struct net_bridge_port *p,
 			err = -ENOMEM;
 			goto err_unlock;
 		}
-		fdb_notify(br, fdb, RTM_NEWNEIGH, swdev_notify);
+		br_fdb_notify(br, fdb, RTM_NEWNEIGH, swdev_notify);
 	} else {
 		fdb->updated = jiffies;
 
@@ -1265,7 +1271,7 @@ int br_fdb_external_learn_add(struct net_bridge *br, struct net_bridge_port *p,
 			set_bit(BR_FDB_LOCAL, &fdb->flags);
 
 		if (modified)
-			fdb_notify(br, fdb, RTM_NEWNEIGH, swdev_notify);
+			br_fdb_notify(br, fdb, RTM_NEWNEIGH, swdev_notify);
 	}
 
 err_unlock:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 10/15] net: switchdev: merge switchdev_handle_fdb_{add,del}_to_device
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (8 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 09/15] net: bridge: rename fdb_notify to br_fdb_notify Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 11/15] net: bridge: make fdb_add_entry() wait for switchdev feedback Vladimir Oltean
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

To reduce code churn, the same patch makes multiple changes, since they
all touch the same lines:

1. The implementations for these two are identical, just with different
   function pointers. Reduce duplications and name the function pointers
   "mod_cb" instead of "add_cb" and "del_cb". Pass the event as argument.

2. Drop the "const" attribute from "orig_dev". If the driver needs to
   check whether orig_dev belongs to itself and then
   call_switchdev_notifiers(orig_dev, SWITCHDEV_FDB_OFFLOADED), it
   can't, because call_switchdev_notifiers takes a non-const struct
   net_device *.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/switchdev.h   |  48 +++---------
 net/dsa/slave.c           |  41 ++--------
 net/switchdev/switchdev.c | 156 ++++++--------------------------------
 3 files changed, 43 insertions(+), 202 deletions(-)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 6764fb7692e2..559f63abc15b 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -299,28 +299,16 @@ void switchdev_port_fwd_mark_set(struct net_device *dev,
 				 struct net_device *group_dev,
 				 bool joining);
 
-int switchdev_handle_fdb_add_to_device(struct net_device *dev,
+int switchdev_handle_fdb_event_to_device(struct net_device *dev, unsigned long event,
 		const struct switchdev_notifier_fdb_info *fdb_info,
 		bool (*check_cb)(const struct net_device *dev),
 		bool (*foreign_dev_check_cb)(const struct net_device *dev,
 					     const struct net_device *foreign_dev),
-		int (*add_cb)(struct net_device *dev,
-			      const struct net_device *orig_dev, const void *ctx,
+		int (*mod_cb)(struct net_device *dev, struct net_device *orig_dev,
+			      unsigned long event, const void *ctx,
 			      const struct switchdev_notifier_fdb_info *fdb_info),
-		int (*lag_add_cb)(struct net_device *dev,
-				  const struct net_device *orig_dev, const void *ctx,
-				  const struct switchdev_notifier_fdb_info *fdb_info));
-
-int switchdev_handle_fdb_del_to_device(struct net_device *dev,
-		const struct switchdev_notifier_fdb_info *fdb_info,
-		bool (*check_cb)(const struct net_device *dev),
-		bool (*foreign_dev_check_cb)(const struct net_device *dev,
-					     const struct net_device *foreign_dev),
-		int (*del_cb)(struct net_device *dev,
-			      const struct net_device *orig_dev, const void *ctx,
-			      const struct switchdev_notifier_fdb_info *fdb_info),
-		int (*lag_del_cb)(struct net_device *dev,
-				  const struct net_device *orig_dev, const void *ctx,
+		int (*lag_mod_cb)(struct net_device *dev, struct net_device *orig_dev,
+				  unsigned long event, const void *ctx,
 				  const struct switchdev_notifier_fdb_info *fdb_info));
 
 int switchdev_handle_port_obj_add(struct net_device *dev,
@@ -426,32 +414,16 @@ call_switchdev_blocking_notifiers(unsigned long val,
 }
 
 static inline int
-switchdev_handle_fdb_add_to_device(struct net_device *dev,
-		const struct switchdev_notifier_fdb_info *fdb_info,
-		bool (*check_cb)(const struct net_device *dev),
-		bool (*foreign_dev_check_cb)(const struct net_device *dev,
-					     const struct net_device *foreign_dev),
-		int (*add_cb)(struct net_device *dev,
-			      const struct net_device *orig_dev, const void *ctx,
-			      const struct switchdev_notifier_fdb_info *fdb_info),
-		int (*lag_add_cb)(struct net_device *dev,
-				  const struct net_device *orig_dev, const void *ctx,
-				  const struct switchdev_notifier_fdb_info *fdb_info))
-{
-	return 0;
-}
-
-static inline int
-switchdev_handle_fdb_del_to_device(struct net_device *dev,
+switchdev_handle_fdb_event_to_device(struct net_device *dev, unsigned long event,
 		const struct switchdev_notifier_fdb_info *fdb_info,
 		bool (*check_cb)(const struct net_device *dev),
 		bool (*foreign_dev_check_cb)(const struct net_device *dev,
 					     const struct net_device *foreign_dev),
-		int (*del_cb)(struct net_device *dev,
-			      const struct net_device *orig_dev, const void *ctx,
+		int (*mod_cb)(struct net_device *dev, struct net_device *orig_dev,
+			      unsigned long event, const void *ctx,
 			      const struct switchdev_notifier_fdb_info *fdb_info),
-		int (*lag_del_cb)(struct net_device *dev,
-				  const struct net_device *orig_dev, const void *ctx,
+		int (*lag_mod_cb)(struct net_device *dev, struct net_device *orig_dev,
+				  unsigned long event, const void *ctx,
 				  const struct switchdev_notifier_fdb_info *fdb_info))
 {
 	return 0;
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 54d53e18a211..af573d16dff5 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -2469,10 +2469,9 @@ static bool dsa_foreign_dev_check(const struct net_device *dev,
 }
 
 static int dsa_slave_fdb_event(struct net_device *dev,
-			       const struct net_device *orig_dev,
-			       const void *ctx,
-			       const struct switchdev_notifier_fdb_info *fdb_info,
-			       unsigned long event)
+			       struct net_device *orig_dev,
+			       unsigned long event, const void *ctx,
+			       const struct switchdev_notifier_fdb_info *fdb_info)
 {
 	struct dsa_switchdev_event_work *switchdev_work;
 	struct dsa_port *dp = dsa_slave_to_port(dev);
@@ -2528,24 +2527,6 @@ static int dsa_slave_fdb_event(struct net_device *dev,
 	return 0;
 }
 
-static int
-dsa_slave_fdb_add_to_device(struct net_device *dev,
-			    const struct net_device *orig_dev, const void *ctx,
-			    const struct switchdev_notifier_fdb_info *fdb_info)
-{
-	return dsa_slave_fdb_event(dev, orig_dev, ctx, fdb_info,
-				   SWITCHDEV_FDB_ADD_TO_DEVICE);
-}
-
-static int
-dsa_slave_fdb_del_to_device(struct net_device *dev,
-			    const struct net_device *orig_dev, const void *ctx,
-			    const struct switchdev_notifier_fdb_info *fdb_info)
-{
-	return dsa_slave_fdb_event(dev, orig_dev, ctx, fdb_info,
-				   SWITCHDEV_FDB_DEL_TO_DEVICE);
-}
-
 /* Called under rcu_read_lock() */
 static int dsa_slave_switchdev_event(struct notifier_block *unused,
 				     unsigned long event, void *ptr)
@@ -2560,18 +2541,12 @@ static int dsa_slave_switchdev_event(struct notifier_block *unused,
 						     dsa_slave_port_attr_set);
 		return notifier_from_errno(err);
 	case SWITCHDEV_FDB_ADD_TO_DEVICE:
-		err = switchdev_handle_fdb_add_to_device(dev, ptr,
-							 dsa_slave_dev_check,
-							 dsa_foreign_dev_check,
-							 dsa_slave_fdb_add_to_device,
-							 NULL);
-		return notifier_from_errno(err);
 	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		err = switchdev_handle_fdb_del_to_device(dev, ptr,
-							 dsa_slave_dev_check,
-							 dsa_foreign_dev_check,
-							 dsa_slave_fdb_del_to_device,
-							 NULL);
+		err = switchdev_handle_fdb_event_to_device(dev, event, ptr,
+							   dsa_slave_dev_check,
+							   dsa_foreign_dev_check,
+							   dsa_slave_fdb_event,
+							   NULL);
 		return notifier_from_errno(err);
 	default:
 		return NOTIFY_DONE;
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 0b2c18efc079..83460470e883 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -428,17 +428,17 @@ switchdev_lower_dev_find(struct net_device *dev,
 	return switchdev_priv.lower_dev;
 }
 
-static int __switchdev_handle_fdb_add_to_device(struct net_device *dev,
-		const struct net_device *orig_dev,
+static int __switchdev_handle_fdb_event_to_device(struct net_device *dev,
+		struct net_device *orig_dev, unsigned long event,
 		const struct switchdev_notifier_fdb_info *fdb_info,
 		bool (*check_cb)(const struct net_device *dev),
 		bool (*foreign_dev_check_cb)(const struct net_device *dev,
 					     const struct net_device *foreign_dev),
-		int (*add_cb)(struct net_device *dev,
-			      const struct net_device *orig_dev, const void *ctx,
+		int (*mod_cb)(struct net_device *dev, struct net_device *orig_dev,
+			      unsigned long event, const void *ctx,
 			      const struct switchdev_notifier_fdb_info *fdb_info),
-		int (*lag_add_cb)(struct net_device *dev,
-				  const struct net_device *orig_dev, const void *ctx,
+		int (*lag_mod_cb)(struct net_device *dev, struct net_device *orig_dev,
+				  unsigned long event, const void *ctx,
 				  const struct switchdev_notifier_fdb_info *fdb_info))
 {
 	const struct switchdev_notifier_info *info = &fdb_info->info;
@@ -447,17 +447,17 @@ static int __switchdev_handle_fdb_add_to_device(struct net_device *dev,
 	int err = -EOPNOTSUPP;
 
 	if (check_cb(dev))
-		return add_cb(dev, orig_dev, info->ctx, fdb_info);
+		return mod_cb(dev, orig_dev, event, info->ctx, fdb_info);
 
 	if (netif_is_lag_master(dev)) {
 		if (!switchdev_lower_dev_find(dev, check_cb, foreign_dev_check_cb))
 			goto maybe_bridged_with_us;
 
 		/* This is a LAG interface that we offload */
-		if (!lag_add_cb)
+		if (!lag_mod_cb)
 			return -EOPNOTSUPP;
 
-		return lag_add_cb(dev, orig_dev, info->ctx, fdb_info);
+		return lag_mod_cb(dev, orig_dev, event, info->ctx, fdb_info);
 	}
 
 	/* Recurse through lower interfaces in case the FDB entry is pointing
@@ -481,10 +481,10 @@ static int __switchdev_handle_fdb_add_to_device(struct net_device *dev,
 						      foreign_dev_check_cb))
 				continue;
 
-			err = __switchdev_handle_fdb_add_to_device(lower_dev, orig_dev,
-								   fdb_info, check_cb,
-								   foreign_dev_check_cb,
-								   add_cb, lag_add_cb);
+			err = __switchdev_handle_fdb_event_to_device(lower_dev, orig_dev,
+								     event, fdb_info, check_cb,
+								     foreign_dev_check_cb,
+								     mod_cb, lag_mod_cb);
 			if (err && err != -EOPNOTSUPP)
 				return err;
 		}
@@ -503,140 +503,34 @@ static int __switchdev_handle_fdb_add_to_device(struct net_device *dev,
 	if (!switchdev_lower_dev_find(br, check_cb, foreign_dev_check_cb))
 		return 0;
 
-	return __switchdev_handle_fdb_add_to_device(br, orig_dev, fdb_info,
-						    check_cb, foreign_dev_check_cb,
-						    add_cb, lag_add_cb);
+	return __switchdev_handle_fdb_event_to_device(br, orig_dev, event, fdb_info,
+						      check_cb, foreign_dev_check_cb,
+						      mod_cb, lag_mod_cb);
 }
 
-int switchdev_handle_fdb_add_to_device(struct net_device *dev,
+int switchdev_handle_fdb_event_to_device(struct net_device *dev, unsigned long event,
 		const struct switchdev_notifier_fdb_info *fdb_info,
 		bool (*check_cb)(const struct net_device *dev),
 		bool (*foreign_dev_check_cb)(const struct net_device *dev,
 					     const struct net_device *foreign_dev),
-		int (*add_cb)(struct net_device *dev,
-			      const struct net_device *orig_dev, const void *ctx,
+		int (*mod_cb)(struct net_device *dev, struct net_device *orig_dev,
+			      unsigned long event, const void *ctx,
 			      const struct switchdev_notifier_fdb_info *fdb_info),
-		int (*lag_add_cb)(struct net_device *dev,
-				  const struct net_device *orig_dev, const void *ctx,
+		int (*lag_mod_cb)(struct net_device *dev, struct net_device *orig_dev,
+				  unsigned long event, const void *ctx,
 				  const struct switchdev_notifier_fdb_info *fdb_info))
 {
 	int err;
 
-	err = __switchdev_handle_fdb_add_to_device(dev, dev, fdb_info,
-						   check_cb,
-						   foreign_dev_check_cb,
-						   add_cb, lag_add_cb);
+	err = __switchdev_handle_fdb_event_to_device(dev, dev, event, fdb_info,
+						     check_cb, foreign_dev_check_cb,
+						     mod_cb, lag_mod_cb);
 	if (err == -EOPNOTSUPP)
 		err = 0;
 
 	return err;
 }
-EXPORT_SYMBOL_GPL(switchdev_handle_fdb_add_to_device);
-
-static int __switchdev_handle_fdb_del_to_device(struct net_device *dev,
-		const struct net_device *orig_dev,
-		const struct switchdev_notifier_fdb_info *fdb_info,
-		bool (*check_cb)(const struct net_device *dev),
-		bool (*foreign_dev_check_cb)(const struct net_device *dev,
-					     const struct net_device *foreign_dev),
-		int (*del_cb)(struct net_device *dev,
-			      const struct net_device *orig_dev, const void *ctx,
-			      const struct switchdev_notifier_fdb_info *fdb_info),
-		int (*lag_del_cb)(struct net_device *dev,
-				  const struct net_device *orig_dev, const void *ctx,
-				  const struct switchdev_notifier_fdb_info *fdb_info))
-{
-	const struct switchdev_notifier_info *info = &fdb_info->info;
-	struct net_device *br, *lower_dev;
-	struct list_head *iter;
-	int err = -EOPNOTSUPP;
-
-	if (check_cb(dev))
-		return del_cb(dev, orig_dev, info->ctx, fdb_info);
-
-	if (netif_is_lag_master(dev)) {
-		if (!switchdev_lower_dev_find(dev, check_cb, foreign_dev_check_cb))
-			goto maybe_bridged_with_us;
-
-		/* This is a LAG interface that we offload */
-		if (!lag_del_cb)
-			return -EOPNOTSUPP;
-
-		return lag_del_cb(dev, orig_dev, info->ctx, fdb_info);
-	}
-
-	/* Recurse through lower interfaces in case the FDB entry is pointing
-	 * towards a bridge device.
-	 */
-	if (netif_is_bridge_master(dev)) {
-		if (!switchdev_lower_dev_find(dev, check_cb, foreign_dev_check_cb))
-			return 0;
-
-		/* This is a bridge interface that we offload */
-		netdev_for_each_lower_dev(dev, lower_dev, iter) {
-			/* Do not propagate FDB entries across bridges */
-			if (netif_is_bridge_master(lower_dev))
-				continue;
-
-			/* Bridge ports might be either us, or LAG interfaces
-			 * that we offload.
-			 */
-			if (!check_cb(lower_dev) &&
-			    !switchdev_lower_dev_find(lower_dev, check_cb,
-						      foreign_dev_check_cb))
-				continue;
-
-			err = __switchdev_handle_fdb_del_to_device(lower_dev, orig_dev,
-								   fdb_info, check_cb,
-								   foreign_dev_check_cb,
-								   del_cb, lag_del_cb);
-			if (err && err != -EOPNOTSUPP)
-				return err;
-		}
-
-		return 0;
-	}
-
-maybe_bridged_with_us:
-	/* Event is neither on a bridge nor a LAG. Check whether it is on an
-	 * interface that is in a bridge with us.
-	 */
-	br = netdev_master_upper_dev_get_rcu(dev);
-	if (!br || !netif_is_bridge_master(br))
-		return 0;
-
-	if (!switchdev_lower_dev_find(br, check_cb, foreign_dev_check_cb))
-		return 0;
-
-	return __switchdev_handle_fdb_del_to_device(br, orig_dev, fdb_info,
-						    check_cb, foreign_dev_check_cb,
-						    del_cb, lag_del_cb);
-}
-
-int switchdev_handle_fdb_del_to_device(struct net_device *dev,
-		const struct switchdev_notifier_fdb_info *fdb_info,
-		bool (*check_cb)(const struct net_device *dev),
-		bool (*foreign_dev_check_cb)(const struct net_device *dev,
-					     const struct net_device *foreign_dev),
-		int (*del_cb)(struct net_device *dev,
-			      const struct net_device *orig_dev, const void *ctx,
-			      const struct switchdev_notifier_fdb_info *fdb_info),
-		int (*lag_del_cb)(struct net_device *dev,
-				  const struct net_device *orig_dev, const void *ctx,
-				  const struct switchdev_notifier_fdb_info *fdb_info))
-{
-	int err;
-
-	err = __switchdev_handle_fdb_del_to_device(dev, dev, fdb_info,
-						   check_cb,
-						   foreign_dev_check_cb,
-						   del_cb, lag_del_cb);
-	if (err == -EOPNOTSUPP)
-		err = 0;
-
-	return err;
-}
-EXPORT_SYMBOL_GPL(switchdev_handle_fdb_del_to_device);
+EXPORT_SYMBOL_GPL(switchdev_handle_fdb_event_to_device);
 
 static int __switchdev_handle_port_obj_add(struct net_device *dev,
 			struct switchdev_notifier_port_obj_info *port_obj_info,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 11/15] net: bridge: make fdb_add_entry() wait for switchdev feedback
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (9 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 10/15] net: switchdev: merge switchdev_handle_fdb_{add,del}_to_device Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 12/15] net: rtnetlink: pass extack to .ndo_fdb_del Vladimir Oltean
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

At the moment, switchdev gets notified of FDB entries via the
br_switchdev_fdb_notify() function, which is called from fdb_notify().
In turn, fdb_notify is called from a wide variety of places, mixing data
path learning, STP timers, sysfs handlers, netlink IFLA_BRPORT_FLUSH
handlers, RTM_NEWNEIGH/RTM_DELNEIGH handlers, FDB garbage collection
timers, and others.

The common denominator is that FDB entries are created and added to the
bridge hash table under the br->hash_lock. And so is the switchdev
notification.

Because SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events are notified on the
switchdev atomic notifier chain and all drivers need sleepable context
to offload those FDB entries, the current situation is that all drivers
register a private workqueue on which they schedule work items from the
atomic switchdev notifier chain. The practical implication is that they
can't return error codes and extack messages from their private
workqueue, and even if they could (or if they would return an error
directly from the atomic notifier), the bridge would still ignore those.

We're not structurally changing anything, because there are reasons why
things are the way they are (the theoretical possibility of performance,
basically). Instead, this patch just adds a mechanism based on a
completion structure by which user space can wait for the driver's
deferred work item to finish, and return an error code.

It works as follows:

- br_switchdev_fdb_notify() behaves as before, but we introduce a new
  br_switchdev_fdb_notify_async() which contains some bridge data.
  The functions which don't care what switchdev has to say keep calling
  br_switchdev_fdb_notify(), the others (for now fdb_add_entry(), the
  function that processes RTM_NEWNEIGH) calls
  br_switchdev_fdb_notify_async()

- every function that calls br_switchdev_fdb_notify_async() must declare
  a struct br_switchdev_fdb_wait_ctx on stack. This is the storage for
  the completion structure. Then br_switchdev_fdb_notify_async() will
  create a special struct switchdev_notifier_fdb_info that contains some
  function pointers that wake up the bridge process waiting for this
  completion.

- every function that calls br_switchdev_fdb_notify_async() under
  br->hash_lock must release this lock before it can sleep waiting for
  the completion, then it has to take the lock again and search for the
  FDB entry again.

- in the case of fdb_add_entry(), we have nothing to do if switchdev was
  happy, otherwise we need to take the hash_lock again and delete the
  FDB entry we've just created. We may not find the FDB entry we were
  trying to delete, due to factors such as ultra short ageing time.
  In those cases we do nothing. The rollback logic for fdb_add_entry()
  is copied from fdb_delete_by_addr_and_port() except here we don't
  notify switchdev - we know it doesn't contain that entry and doesn't
  want it. I've renamed the existing fdb_delete_by_addr_and_port()
  function into fdb_delete_by_addr_and_port_switchdev().

The API exposed to switchdev drivers is comprised of two functions:

* switchdev_fdb_mark_pending() must be called from atomic context, to
  tell the bridge to wait
* switchdev_fdb_mark_done() can be called from any context, it tells
  the bridge to stop waiting

When nobody has called switchdev_fdb_mark_pending(), the bridge doesn't
wait. There is no race condition here, because the atomic notifiers have
finished by the time br_switchdev_fdb_notify_async() finishes.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/switchdev.h   |  17 +++++
 net/bridge/br_fdb.c       | 138 +++++++++++++++++++++++++++++++++++---
 net/bridge/br_private.h   |  16 +++++
 net/bridge/br_switchdev.c |  34 ++++++++--
 4 files changed, 190 insertions(+), 15 deletions(-)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 559f63abc15b..67f7b22e5332 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -218,6 +218,9 @@ struct switchdev_notifier_info {
 
 struct switchdev_notifier_fdb_info {
 	struct switchdev_notifier_info info; /* must be first */
+	void (*pending_fn)(unsigned long cookie);
+	void (*done_fn)(unsigned long cookie, int err);
+	unsigned long cookie;
 	unsigned char addr[ETH_ALEN];
 	u16 vid;
 	u8 added_by_user:1,
@@ -260,6 +263,20 @@ switchdev_fdb_is_dynamically_learned(const struct switchdev_notifier_fdb_info *f
 	return !fdb_info->added_by_user && !fdb_info->is_local;
 }
 
+static inline void
+switchdev_fdb_mark_pending(const struct switchdev_notifier_fdb_info *fdb_info)
+{
+	if (fdb_info->pending_fn)
+		fdb_info->pending_fn(fdb_info->cookie);
+}
+
+static inline void
+switchdev_fdb_mark_done(const struct switchdev_notifier_fdb_info *fdb_info, int err)
+{
+	if (fdb_info->done_fn)
+		fdb_info->done_fn(fdb_info->cookie, err);
+}
+
 #ifdef CONFIG_NET_SWITCHDEV
 
 int switchdev_bridge_port_offload(struct net_device *brport_dev,
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 2095bdf24e42..e8afe64dadcc 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -201,6 +201,83 @@ static void fdb_notify(struct net_bridge *br,
 	rtnl_set_sk_err(net, RTNLGRP_NEIGH, err);
 }
 
+struct br_switchdev_fdb_wait_ctx {
+	/* Serializes switchdev driver calls to switchdev_fdb_mark_done */
+	struct mutex lock;
+	struct completion done;
+	bool pending;
+	int pending_count;
+	int err;
+};
+
+static void
+br_switchdev_fdb_wait_ctx_init(struct br_switchdev_fdb_wait_ctx *wait_ctx)
+{
+	wait_ctx->pending = false;
+	wait_ctx->pending_count = 0;
+	wait_ctx->err = 0;
+	mutex_init(&wait_ctx->lock);
+	init_completion(&wait_ctx->done);
+}
+
+static void br_switchdev_fdb_pending(unsigned long cookie)
+{
+	struct br_switchdev_fdb_wait_ctx *wait_ctx;
+
+	wait_ctx = (struct br_switchdev_fdb_wait_ctx *)cookie;
+
+	wait_ctx->pending = true;
+	/* Drivers serialize on the switchdev atomic notifier chain when they
+	 * call switchdev_fdb_mark_pending, so no locking is necessary.
+	 */
+	wait_ctx->pending_count++;
+}
+
+static void br_switchdev_fdb_done(unsigned long cookie, int err)
+{
+	struct br_switchdev_fdb_wait_ctx *wait_ctx;
+	bool done;
+
+	wait_ctx = (struct br_switchdev_fdb_wait_ctx *)cookie;
+
+	/* Potentially multiple drivers might call switchdev_fdb_mark_done,
+	 * each from its own deferred context. So we need to serialize here.
+	 */
+	mutex_lock(&wait_ctx->lock);
+
+	/* Do not overwrite errors with success stories. This preserves the
+	 * last non-zero error code, which may or may not coincide with the
+	 * last extack.
+	 */
+	if (err)
+		wait_ctx->err = err;
+
+	wait_ctx->pending_count--;
+	done = wait_ctx->pending_count == 0;
+
+	mutex_unlock(&wait_ctx->lock);
+
+	if (done)
+		complete(&wait_ctx->done);
+}
+
+static int br_switchdev_fdb_wait(struct br_switchdev_fdb_wait_ctx *wait_ctx)
+{
+	/* If the pending flag isn't set, there's nothing to wait for
+	 * (switchdev not compiled, no driver interested, driver with the
+	 * legacy silent behavior, etc).
+	 * We need a dedicated pending flag as opposed to looking at the
+	 * pending_count, because we'd need a lock to look at the
+	 * pending_count (it's decremented concurrently with us), whereas we
+	 * need no locking to look at the pending flag: it was set (if it was)
+	 * during the atomic notifier call.
+	 */
+	if (wait_ctx->pending)
+		wait_for_completion(&wait_ctx->done);
+
+	return wait_ctx->err;
+}
+
 static void br_fdb_notify(struct net_bridge *br,
 			  const struct net_bridge_fdb_entry *fdb, int type,
 			  bool swdev_notify)
@@ -211,6 +288,18 @@ static void br_fdb_notify(struct net_bridge *br,
 	fdb_notify(br, fdb, type);
 }
 
+static void br_fdb_notify_async(struct net_bridge *br,
+				const struct net_bridge_fdb_entry *fdb,
+				int type, struct netlink_ext_ack *extack,
+				struct br_switchdev_fdb_wait_ctx *wait_ctx)
+{
+	br_switchdev_fdb_notify_async(br, fdb, type, br_switchdev_fdb_pending,
+				      br_switchdev_fdb_done,
+				      (unsigned long)wait_ctx, extack);
+
+	fdb_notify(br, fdb, type);
+}
+
 static struct net_bridge_fdb_entry *fdb_find_rcu(struct rhashtable *tbl,
 						 const unsigned char *addr,
 						 __u16 vid)
@@ -841,6 +930,26 @@ int br_fdb_get(struct sk_buff *skb,
 	return err;
 }
 
+/* Delete an FDB entry and don't notify switchdev */
+static void fdb_delete_by_addr_and_port(struct net_bridge *br,
+					const struct net_bridge_port *p,
+					const u8 *addr, u16 vlan)
+{
+	struct net_bridge_fdb_entry *fdb;
+
+	spin_lock_bh(&br->hash_lock);
+
+	fdb = br_fdb_find(br, addr, vlan);
+	if (!fdb || READ_ONCE(fdb->dst) != p) {
+		spin_unlock_bh(&br->hash_lock);
+		return;
+	}
+
+	fdb_delete(br, fdb, false);
+
+	spin_unlock_bh(&br->hash_lock);
+}
+
 /* returns true if the fdb is modified */
 static bool fdb_handle_notify(struct net_bridge_fdb_entry *fdb, u8 notify)
 {
@@ -868,14 +977,16 @@ static bool fdb_handle_notify(struct net_bridge_fdb_entry *fdb, u8 notify)
 /* Update (create or replace) forwarding database entry */
 static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
 			 const u8 *addr, struct ndmsg *ndm, u16 flags, u16 vid,
-			 struct nlattr *nfea_tb[])
+			 struct nlattr *nfea_tb[], struct netlink_ext_ack *extack)
 {
 	bool is_sticky = !!(ndm->ndm_flags & NTF_STICKY);
 	bool refresh = !nfea_tb[NFEA_DONT_REFRESH];
+	struct br_switchdev_fdb_wait_ctx wait_ctx;
 	struct net_bridge_fdb_entry *fdb;
 	u16 state = ndm->ndm_state;
 	bool modified = false;
 	u8 notify = 0;
+	int err;
 
 	/* If the port cannot learn allow only local and static entries */
 	if (source && !(state & NUD_PERMANENT) && !(state & NUD_NOARP) &&
@@ -899,6 +1010,8 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
 			return -EINVAL;
 	}
 
+	br_switchdev_fdb_wait_ctx_init(&wait_ctx);
+
 	spin_lock_bh(&br->hash_lock);
 
 	fdb = br_fdb_find(br, addr, vid);
@@ -959,12 +1072,16 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
 	if (modified) {
 		if (refresh)
 			fdb->updated = jiffies;
-		br_fdb_notify(br, fdb, RTM_NEWNEIGH, true);
+		br_fdb_notify_async(br, fdb, RTM_NEWNEIGH, extack, &wait_ctx);
 	}
 
 	spin_unlock_bh(&br->hash_lock);
 
-	return 0;
+	err = br_switchdev_fdb_wait(&wait_ctx);
+	if (err)
+		fdb_delete_by_addr_and_port(br, source, addr, vid);
+
+	return err;
 }
 
 static int __br_fdb_add(struct ndmsg *ndm, struct net_bridge *br,
@@ -996,7 +1113,8 @@ static int __br_fdb_add(struct ndmsg *ndm, struct net_bridge *br,
 		}
 		err = br_fdb_external_learn_add(br, p, addr, vid, true);
 	} else {
-		err = fdb_add_entry(br, p, addr, ndm, nlh_flags, vid, nfea_tb);
+		err = fdb_add_entry(br, p, addr, ndm, nlh_flags, vid, nfea_tb,
+				    extack);
 	}
 
 	return err;
@@ -1090,9 +1208,13 @@ int br_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
 	return err;
 }
 
-static int fdb_delete_by_addr_and_port(struct net_bridge *br,
-				       const struct net_bridge_port *p,
-				       const u8 *addr, u16 vlan)
+/* Delete an FDB entry and notify switchdev.
+ * Caller must hold &br->hash_lock.
+ */
+static int
+fdb_delete_by_addr_and_port_switchdev(struct net_bridge *br,
+				      const struct net_bridge_port *p,
+				      const u8 *addr, u16 vlan)
 {
 	struct net_bridge_fdb_entry *fdb;
 
@@ -1112,7 +1234,7 @@ static int __br_fdb_delete(struct net_bridge *br,
 	int err;
 
 	spin_lock_bh(&br->hash_lock);
-	err = fdb_delete_by_addr_and_port(br, p, addr, vid);
+	err = fdb_delete_by_addr_and_port_switchdev(br, p, addr, vid);
 	spin_unlock_bh(&br->hash_lock);
 
 	return err;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 3c9327628060..f5f7501dad7d 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -1987,6 +1987,12 @@ int br_switchdev_set_port_flag(struct net_bridge_port *p,
 			       struct netlink_ext_ack *extack);
 void br_switchdev_fdb_notify(struct net_bridge *br,
 			     const struct net_bridge_fdb_entry *fdb, int type);
+void br_switchdev_fdb_notify_async(struct net_bridge *br,
+				   const struct net_bridge_fdb_entry *fdb, int type,
+				   void (*pending_fn)(unsigned long cookie),
+				   void (*done_fn)(unsigned long cookie, int err),
+				   unsigned long cookie,
+				   struct netlink_ext_ack *extack);
 int br_switchdev_port_vlan_add(struct net_device *dev, u16 vid, u16 flags,
 			       struct netlink_ext_ack *extack);
 int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid);
@@ -2073,6 +2079,16 @@ br_switchdev_fdb_notify(struct net_bridge *br,
 {
 }
 
+static inline void
+br_switchdev_fdb_notify_async(struct net_bridge *br,
+			      const struct net_bridge_fdb_entry *fdb, int type,
+			      void (*pending_fn)(unsigned long cookie),
+			      void (*done_fn)(unsigned long cookie, int err),
+			      unsigned long cookie,
+			      struct netlink_ext_ack *extack)
+{
+}
+
 static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 {
 }
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index f58fb06ae641..6e3040f6f636 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -125,7 +125,10 @@ int br_switchdev_set_port_flag(struct net_bridge_port *p,
 static void br_switchdev_fdb_populate(struct net_bridge *br,
 				      struct switchdev_notifier_fdb_info *item,
 				      const struct net_bridge_fdb_entry *fdb,
-				      const void *ctx)
+				      const void *ctx,
+				      void (*pending_fn)(unsigned long cookie),
+				      void (*done_fn)(unsigned long cookie, int err),
+				      unsigned long cookie)
 {
 	const struct net_bridge_port *p = READ_ONCE(fdb->dst);
 
@@ -136,28 +139,44 @@ static void br_switchdev_fdb_populate(struct net_bridge *br,
 	item->is_local = test_bit(BR_FDB_LOCAL, &fdb->flags);
 	item->info.dev = (!p || item->is_local) ? br->dev : p->dev;
 	item->info.ctx = ctx;
+	item->pending_fn = pending_fn;
+	item->done_fn = done_fn;
+	item->cookie = cookie;
 }
 
 void
-br_switchdev_fdb_notify(struct net_bridge *br,
-			const struct net_bridge_fdb_entry *fdb, int type)
+br_switchdev_fdb_notify_async(struct net_bridge *br,
+			      const struct net_bridge_fdb_entry *fdb, int type,
+			      void (*pending_fn)(unsigned long cookie),
+			      void (*done_fn)(unsigned long cookie, int err),
+			      unsigned long cookie,
+			      struct netlink_ext_ack *extack)
 {
 	struct switchdev_notifier_fdb_info item;
 
-	br_switchdev_fdb_populate(br, &item, fdb, NULL);
+	br_switchdev_fdb_populate(br, &item, fdb, NULL, pending_fn,
+				  done_fn, cookie);
 
 	switch (type) {
 	case RTM_DELNEIGH:
 		call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE,
-					 item.info.dev, &item.info, NULL);
+					 item.info.dev, &item.info, extack);
 		break;
 	case RTM_NEWNEIGH:
 		call_switchdev_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE,
-					 item.info.dev, &item.info, NULL);
+					 item.info.dev, &item.info, extack);
 		break;
 	}
 }
 
+void
+br_switchdev_fdb_notify(struct net_bridge *br,
+			const struct net_bridge_fdb_entry *fdb, int type)
+{
+	br_switchdev_fdb_notify_async(br, fdb, type, NULL, NULL,
+				      (unsigned long)NULL, NULL);
+}
+
 int br_switchdev_port_vlan_add(struct net_device *dev, u16 vid, u16 flags,
 			       struct netlink_ext_ack *extack)
 {
@@ -287,7 +306,8 @@ static int br_fdb_replay_one(struct net_bridge *br, struct notifier_block *nb,
 	struct switchdev_notifier_fdb_info item;
 	int err;
 
-	br_switchdev_fdb_populate(br, &item, fdb, ctx);
+	br_switchdev_fdb_populate(br, &item, fdb, ctx, NULL,
+				  NULL, (unsigned long)NULL);
 
 	err = nb->notifier_call(nb, action, &item);
 	return notifier_to_errno(err);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 12/15] net: rtnetlink: pass extack to .ndo_fdb_del
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (10 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 11/15] net: bridge: make fdb_add_entry() wait for switchdev feedback Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 13/15] net: bridge: wait for errors from switchdev when deleting FDB entries Vladimir Oltean
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

The .ndo_fdb_del method can already return an int error code, let's also
propagate the netlink extack for detailed error messages to user space.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/ethernet/intel/ice/ice_main.c        | 4 ++--
 drivers/net/ethernet/mscc/ocelot_net.c           | 3 ++-
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c | 5 +++--
 drivers/net/macvlan.c                            | 3 ++-
 drivers/net/vxlan.c                              | 3 ++-
 include/linux/netdevice.h                        | 6 ++++--
 net/bridge/br_fdb.c                              | 3 ++-
 net/bridge/br_private.h                          | 3 ++-
 net/core/rtnetlink.c                             | 4 ++--
 9 files changed, 21 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 9ba22778011d..7e0741d62aae 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -5434,12 +5434,12 @@ ice_fdb_add(struct ndmsg *ndm, struct nlattr __always_unused *tb[],
 static int
 ice_fdb_del(struct ndmsg *ndm, __always_unused struct nlattr *tb[],
 	    struct net_device *dev, const unsigned char *addr,
-	    __always_unused u16 vid)
+	    __always_unused u16 vid, struct netlink_ext_ack *extack)
 {
 	int err;
 
 	if (ndm->ndm_state & NUD_PERMANENT) {
-		netdev_err(dev, "FDB only supports static addresses\n");
+		NL_SET_ERR_MSG_MOD(extack, "FDB only supports static addresses");
 		return -EINVAL;
 	}
 
diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c
index eaeba60b1bba..8f53c9858cc1 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -664,7 +664,8 @@ static int ocelot_port_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
 
 static int ocelot_port_fdb_del(struct ndmsg *ndm, struct nlattr *tb[],
 			       struct net_device *dev,
-			       const unsigned char *addr, u16 vid)
+			       const unsigned char *addr, u16 vid,
+			       struct netlink_ext_ack *extack)
 {
 	struct ocelot_port_private *priv = netdev_priv(dev);
 	struct ocelot *ocelot = priv->port.ocelot;
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
index ed84f0f97623..f276cc2c4351 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
@@ -367,8 +367,9 @@ static int qlcnic_set_mac(struct net_device *netdev, void *p)
 }
 
 static int qlcnic_fdb_del(struct ndmsg *ndm, struct nlattr *tb[],
-			struct net_device *netdev,
-			const unsigned char *addr, u16 vid)
+			  struct net_device *netdev,
+			  const unsigned char *addr, u16 vid,
+			  struct netlink_ext_ack *extack)
 {
 	struct qlcnic_adapter *adapter = netdev_priv(netdev);
 	int err = -EOPNOTSUPP;
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index d2f830ec2969..49377ef174c0 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -1014,7 +1014,8 @@ static int macvlan_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
 
 static int macvlan_fdb_del(struct ndmsg *ndm, struct nlattr *tb[],
 			   struct net_device *dev,
-			   const unsigned char *addr, u16 vid)
+			   const unsigned char *addr, u16 vid,
+			   struct netlink_ext_ack *extack)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	int err = -EINVAL;
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 141635a35c28..45e872d4e052 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1342,7 +1342,8 @@ static int __vxlan_fdb_delete(struct vxlan_dev *vxlan,
 /* Delete entry (via netlink) */
 static int vxlan_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
 			    struct net_device *dev,
-			    const unsigned char *addr, u16 vid)
+			    const unsigned char *addr, u16 vid,
+			    struct netlink_ext_ack *extack)
 {
 	struct vxlan_dev *vxlan = netdev_priv(dev);
 	union vxlan_addr ip;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec42495a43a..79284bdd4b6f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1245,7 +1245,8 @@ struct netdev_net_notifier {
  *	Adds an FDB entry to dev for addr.
  * int (*ndo_fdb_del)(struct ndmsg *ndm, struct nlattr *tb[],
  *		      struct net_device *dev,
- *		      const unsigned char *addr, u16 vid)
+ *		      const unsigned char *addr, u16 vid,
+ *		      struct netlink_ext_ack *extack);
  *	Deletes the FDB entry from dev coresponding to addr.
  * int (*ndo_fdb_dump)(struct sk_buff *skb, struct netlink_callback *cb,
  *		       struct net_device *dev, struct net_device *filter_dev,
@@ -1501,7 +1502,8 @@ struct net_device_ops {
 					       struct nlattr *tb[],
 					       struct net_device *dev,
 					       const unsigned char *addr,
-					       u16 vid);
+					       u16 vid,
+					       struct netlink_ext_ack *extack);
 	int			(*ndo_fdb_dump)(struct sk_buff *skb,
 						struct netlink_callback *cb,
 						struct net_device *dev,
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index e8afe64dadcc..ce49e5f914b1 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -1243,7 +1243,8 @@ static int __br_fdb_delete(struct net_bridge *br,
 /* Remove neighbor entry with RTM_DELNEIGH */
 int br_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
 		  struct net_device *dev,
-		  const unsigned char *addr, u16 vid)
+		  const unsigned char *addr, u16 vid,
+		  struct netlink_ext_ack *extack)
 {
 	struct net_bridge_vlan_group *vg;
 	struct net_bridge_port *p = NULL;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index f5f7501dad7d..6c663ccc346c 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -773,7 +773,8 @@ void br_fdb_update(struct net_bridge *br, struct net_bridge_port *source,
 		   const unsigned char *addr, u16 vid, unsigned long flags);
 
 int br_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
-		  struct net_device *dev, const unsigned char *addr, u16 vid);
+		  struct net_device *dev, const unsigned char *addr, u16 vid,
+		  struct netlink_ext_ack *extack);
 int br_fdb_add(struct ndmsg *nlh, struct nlattr *tb[], struct net_device *dev,
 	       const unsigned char *addr, u16 vid, u16 nlh_flags,
 	       struct netlink_ext_ack *extack);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 2af8aeeadadf..eed5eefe2bcd 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -4152,7 +4152,7 @@ static int rtnl_fdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
 		const struct net_device_ops *ops = br_dev->netdev_ops;
 
 		if (ops->ndo_fdb_del)
-			err = ops->ndo_fdb_del(ndm, tb, dev, addr, vid);
+			err = ops->ndo_fdb_del(ndm, tb, dev, addr, vid, extack);
 
 		if (err)
 			goto out;
@@ -4164,7 +4164,7 @@ static int rtnl_fdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
 	if (ndm->ndm_flags & NTF_SELF) {
 		if (dev->netdev_ops->ndo_fdb_del)
 			err = dev->netdev_ops->ndo_fdb_del(ndm, tb, dev, addr,
-							   vid);
+							   vid, extack);
 		else
 			err = ndo_dflt_fdb_del(ndm, tb, dev, addr, vid);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 13/15] net: bridge: wait for errors from switchdev when deleting FDB entries
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (11 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 12/15] net: rtnetlink: pass extack to .ndo_fdb_del Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 14/15] net: dsa: propagate feedback to SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE Vladimir Oltean
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

Similar to the logic added for RTM_NEWNEIGH, we can make the bridge
RTM_DELNEIGH handler let switchdev veto an FDB entry deletion.

The strategy is:

- get hold of the FDB entry under &br->hash_lock
- notify switchdev of its deletion via the atomic notifier chain
- release the &br->hash_lock, wait for the response
- switchdev vetoes => error out
- switchdev agrees => attempt to get hold again of the FDB entry under
  &br->hash_lock, this time delete it without notifying switchdev _or_
  rtnetlink (we've notified both already).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_fdb.c | 82 ++++++++++++++++++++++++---------------------
 1 file changed, 44 insertions(+), 38 deletions(-)

diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index ce49e5f914b1..5f9bef6e4472 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -407,7 +407,7 @@ static void fdb_del_hw_addr(struct net_bridge *br, const unsigned char *addr)
 }
 
 static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f,
-		       bool swdev_notify)
+		       bool notify_rtnl, bool swdev_notify)
 {
 	trace_fdb_delete(br, f);
 
@@ -417,7 +417,8 @@ static void fdb_delete(struct net_bridge *br, struct net_bridge_fdb_entry *f,
 	hlist_del_init_rcu(&f->fdb_node);
 	rhashtable_remove_fast(&br->fdb_hash_tbl, &f->rhnode,
 			       br_fdb_rht_params);
-	br_fdb_notify(br, f, RTM_DELNEIGH, swdev_notify);
+	if (notify_rtnl)
+		br_fdb_notify(br, f, RTM_DELNEIGH, swdev_notify);
 	call_rcu(&f->rcu, fdb_rcu_free);
 }
 
@@ -453,7 +454,7 @@ static void fdb_delete_local(struct net_bridge *br,
 		return;
 	}
 
-	fdb_delete(br, f, true);
+	fdb_delete(br, f, true, true);
 }
 
 void br_fdb_find_delete_local(struct net_bridge *br,
@@ -514,7 +515,7 @@ static int fdb_add_local(struct net_bridge *br, struct net_bridge_port *source,
 			return 0;
 		br_warn(br, "adding interface %s with same address as a received packet (addr:%pM, vlan:%u)\n",
 		       source ? source->dev->name : br->dev->name, addr, vid);
-		fdb_delete(br, fdb, true);
+		fdb_delete(br, fdb, true, true);
 	}
 
 	fdb = fdb_create(br, source, addr, vid,
@@ -639,7 +640,7 @@ void br_fdb_cleanup(struct work_struct *work)
 		} else {
 			spin_lock_bh(&br->hash_lock);
 			if (!hlist_unhashed(&f->fdb_node))
-				fdb_delete(br, f, true);
+				fdb_delete(br, f, true, true);
 			spin_unlock_bh(&br->hash_lock);
 		}
 	}
@@ -659,7 +660,7 @@ void br_fdb_flush(struct net_bridge *br)
 	spin_lock_bh(&br->hash_lock);
 	hlist_for_each_entry_safe(f, tmp, &br->fdb_list, fdb_node) {
 		if (!test_bit(BR_FDB_STATIC, &f->flags))
-			fdb_delete(br, f, true);
+			fdb_delete(br, f, true, true);
 	}
 	spin_unlock_bh(&br->hash_lock);
 }
@@ -691,7 +692,7 @@ void br_fdb_delete_by_port(struct net_bridge *br,
 		if (test_bit(BR_FDB_LOCAL, &f->flags))
 			fdb_delete_local(br, p, f);
 		else
-			fdb_delete(br, f, true);
+			fdb_delete(br, f, true, true);
 	}
 	spin_unlock_bh(&br->hash_lock);
 }
@@ -931,9 +932,10 @@ int br_fdb_get(struct sk_buff *skb,
 }
 
 /* Delete an FDB entry and don't notify switchdev */
-static void fdb_delete_by_addr_and_port(struct net_bridge *br,
-					const struct net_bridge_port *p,
-					const u8 *addr, u16 vlan)
+static int fdb_delete_by_addr_and_port(struct net_bridge *br,
+				       const struct net_bridge_port *p,
+				       const u8 *addr, u16 vlan,
+				       bool notify_rtnl)
 {
 	struct net_bridge_fdb_entry *fdb;
 
@@ -942,12 +944,14 @@ static void fdb_delete_by_addr_and_port(struct net_bridge *br,
 	fdb = br_fdb_find(br, addr, vlan);
 	if (!fdb || READ_ONCE(fdb->dst) != p) {
 		spin_unlock_bh(&br->hash_lock);
-		return;
+		return -ENOENT;
 	}
 
-	fdb_delete(br, fdb, false);
+	fdb_delete(br, fdb, notify_rtnl, false);
 
 	spin_unlock_bh(&br->hash_lock);
+
+	return 0;
 }
 
 /* returns true if the fdb is modified */
@@ -1079,7 +1083,7 @@ static int fdb_add_entry(struct net_bridge *br, struct net_bridge_port *source,
 
 	err = br_switchdev_fdb_wait(&wait_ctx);
 	if (err)
-		fdb_delete_by_addr_and_port(br, source, addr, vid);
+		fdb_delete_by_addr_and_port(br, source, addr, vid, true);
 
 	return err;
 }
@@ -1208,36 +1212,38 @@ int br_fdb_add(struct ndmsg *ndm, struct nlattr *tb[],
 	return err;
 }
 
-/* Delete an FDB entry and notify switchdev.
- * Caller must hold &br->hash_lock.
- */
-static int
-fdb_delete_by_addr_and_port_switchdev(struct net_bridge *br,
-				      const struct net_bridge_port *p,
-				      const u8 *addr, u16 vlan)
+/* Delete an FDB entry and notify switchdev. */
+static int __br_fdb_delete(struct net_bridge *br,
+			   const struct net_bridge_port *p,
+			   const u8 *addr, u16 vlan,
+			   struct netlink_ext_ack *extack)
 {
+	struct br_switchdev_fdb_wait_ctx wait_ctx;
 	struct net_bridge_fdb_entry *fdb;
+	int err;
 
-	fdb = br_fdb_find(br, addr, vlan);
-	if (!fdb || READ_ONCE(fdb->dst) != p)
-		return -ENOENT;
+	br_switchdev_fdb_wait_ctx_init(&wait_ctx);
 
-	fdb_delete(br, fdb, true);
+	spin_lock_bh(&br->hash_lock);
 
-	return 0;
-}
+	fdb = br_fdb_find(br, addr, vlan);
+	if (!fdb || READ_ONCE(fdb->dst) != p) {
+		spin_unlock_bh(&br->hash_lock);
+		return -ENOENT;
+	}
 
-static int __br_fdb_delete(struct net_bridge *br,
-			   const struct net_bridge_port *p,
-			   const unsigned char *addr, u16 vid)
-{
-	int err;
+	br_fdb_notify_async(br, fdb, RTM_DELNEIGH, extack, &wait_ctx);
 
-	spin_lock_bh(&br->hash_lock);
-	err = fdb_delete_by_addr_and_port_switchdev(br, p, addr, vid);
 	spin_unlock_bh(&br->hash_lock);
 
-	return err;
+	err = br_switchdev_fdb_wait(&wait_ctx);
+	if (err)
+		return err;
+
+	/* We've notified rtnl and switchdev once, don't do it again,
+	 * just delete.
+	 */
+	return fdb_delete_by_addr_and_port(br, p, addr, vlan, false);
 }
 
 /* Remove neighbor entry with RTM_DELNEIGH */
@@ -1273,17 +1279,17 @@ int br_fdb_delete(struct ndmsg *ndm, struct nlattr *tb[],
 			return -EINVAL;
 		}
 
-		err = __br_fdb_delete(br, p, addr, vid);
+		err = __br_fdb_delete(br, p, addr, vid, extack);
 	} else {
 		err = -ENOENT;
-		err &= __br_fdb_delete(br, p, addr, 0);
+		err &= __br_fdb_delete(br, p, addr, 0, extack);
 		if (!vg || !vg->num_vlans)
 			return err;
 
 		list_for_each_entry(v, &vg->vlan_list, vlist) {
 			if (!br_vlan_should_use(v))
 				continue;
-			err &= __br_fdb_delete(br, p, addr, v->vid);
+			err &= __br_fdb_delete(br, p, addr, v->vid, extack);
 		}
 	}
 
@@ -1414,7 +1420,7 @@ int br_fdb_external_learn_del(struct net_bridge *br, struct net_bridge_port *p,
 
 	fdb = br_fdb_find(br, addr, vid);
 	if (fdb && test_bit(BR_FDB_ADDED_BY_EXT_LEARN, &fdb->flags))
-		fdb_delete(br, fdb, swdev_notify);
+		fdb_delete(br, fdb, true, swdev_notify);
 	else
 		err = -ENOENT;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 14/15] net: dsa: propagate feedback to SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (12 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 13/15] net: bridge: wait for errors from switchdev when deleting FDB entries Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-25 22:24 ` [RFC PATCH net-next 15/15] net: dsa: propagate extack to .port_fdb_{add,del} Vladimir Oltean
  2021-10-26 10:40 ` [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Nikolay Aleksandrov
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

The strategy is to copy the entire struct switchdev_notifier_fdb_info to
our deferred work, since that contains pointers to the bridge cookies
waiting for us to tell it how things went with offloading the FDB entry.

Now that we have a full struct switchdev_notifier_fdb_info in the
deferred work, we can use that just fine to call dsa_fdb_offload_notify,
although that is not the primary purpose of this patch.

The shelf life of the cookies is basically right until the point where
we call switchdev_fdb_mark_done(). Since that wakes up the completion
and therefore the process that called into the bridge, it can just as
well free its on-stack data structures, so drivers can do nothing
(safely) with their struct switchdev_notifier_fdb_info copy after that
point.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/dsa/dsa_priv.h |  6 +-----
 net/dsa/slave.c    | 42 +++++++++++++-----------------------------
 2 files changed, 14 insertions(+), 34 deletions(-)

diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index a5c9bc7b66c6..5d3f8291ec7f 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -131,11 +131,7 @@ struct dsa_switchdev_event_work {
 	struct net_device *dev;
 	struct work_struct work;
 	unsigned long event;
-	/* Specific for SWITCHDEV_FDB_ADD_TO_DEVICE and
-	 * SWITCHDEV_FDB_DEL_TO_DEVICE
-	 */
-	unsigned char addr[ETH_ALEN];
-	u16 vid;
+	struct switchdev_notifier_fdb_info fdb_info;
 	bool host_addr;
 };
 
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index af573d16dff5..1329e56e22ca 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -2388,66 +2388,50 @@ static int dsa_slave_netdevice_event(struct notifier_block *nb,
 static void
 dsa_fdb_offload_notify(struct dsa_switchdev_event_work *switchdev_work)
 {
-	struct switchdev_notifier_fdb_info info = {};
+	struct switchdev_notifier_fdb_info *fdb_info = &switchdev_work->fdb_info;
 	struct dsa_switch *ds = switchdev_work->ds;
 	struct dsa_port *dp;
 
 	if (!dsa_is_user_port(ds, switchdev_work->port))
 		return;
 
-	ether_addr_copy(info.addr, switchdev_work->addr);
-	info.vid = switchdev_work->vid;
-	info.offloaded = true;
+	fdb_info->offloaded = true;
 	dp = dsa_to_port(ds, switchdev_work->port);
 	call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED,
-				 dp->slave, &info.info, NULL);
+				 dp->slave, &fdb_info->info, NULL);
 }
 
 static void dsa_slave_switchdev_event_work(struct work_struct *work)
 {
+	struct netlink_ext_ack *extack;
 	struct dsa_switchdev_event_work *switchdev_work =
 		container_of(work, struct dsa_switchdev_event_work, work);
+	struct switchdev_notifier_fdb_info *fdb_info = &switchdev_work->fdb_info;
 	struct dsa_switch *ds = switchdev_work->ds;
 	struct dsa_port *dp;
 	int err;
 
 	dp = dsa_to_port(ds, switchdev_work->port);
+	extack = switchdev_notifier_info_to_extack(&fdb_info->info);
 
 	switch (switchdev_work->event) {
 	case SWITCHDEV_FDB_ADD_TO_DEVICE:
 		if (switchdev_work->host_addr)
-			err = dsa_port_host_fdb_add(dp, switchdev_work->addr,
-						    switchdev_work->vid);
+			err = dsa_port_host_fdb_add(dp, fdb_info->addr, fdb_info->vid);
 		else
-			err = dsa_port_fdb_add(dp, switchdev_work->addr,
-					       switchdev_work->vid);
-		if (err) {
-			dev_err(ds->dev,
-				"port %d failed to add %pM vid %d to fdb: %d\n",
-				dp->index, switchdev_work->addr,
-				switchdev_work->vid, err);
-			break;
-		}
+			err = dsa_port_fdb_add(dp, fdb_info->addr, fdb_info->vid);
 		dsa_fdb_offload_notify(switchdev_work);
 		break;
 
 	case SWITCHDEV_FDB_DEL_TO_DEVICE:
 		if (switchdev_work->host_addr)
-			err = dsa_port_host_fdb_del(dp, switchdev_work->addr,
-						    switchdev_work->vid);
+			err = dsa_port_host_fdb_del(dp, fdb_info->addr, fdb_info->vid);
 		else
-			err = dsa_port_fdb_del(dp, switchdev_work->addr,
-					       switchdev_work->vid);
-		if (err) {
-			dev_err(ds->dev,
-				"port %d failed to delete %pM vid %d from fdb: %d\n",
-				dp->index, switchdev_work->addr,
-				switchdev_work->vid, err);
-		}
-
+			err = dsa_port_fdb_del(dp, fdb_info->addr, fdb_info->vid);
 		break;
 	}
 
+	switchdev_fdb_mark_done(fdb_info, err);
 	dev_put(switchdev_work->dev);
 	kfree(switchdev_work);
 }
@@ -2516,12 +2500,12 @@ static int dsa_slave_fdb_event(struct net_device *dev,
 	switchdev_work->event = event;
 	switchdev_work->dev = dev;
 
-	ether_addr_copy(switchdev_work->addr, fdb_info->addr);
-	switchdev_work->vid = fdb_info->vid;
+	memcpy(&switchdev_work->fdb_info, fdb_info, sizeof(*fdb_info));
 	switchdev_work->host_addr = host_addr;
 
 	/* Hold a reference for dsa_fdb_offload_notify */
 	dev_hold(dev);
+	switchdev_fdb_mark_pending(fdb_info);
 	dsa_schedule_work(&switchdev_work->work);
 
 	return 0;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH net-next 15/15] net: dsa: propagate extack to .port_fdb_{add,del}
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (13 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 14/15] net: dsa: propagate feedback to SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE Vladimir Oltean
@ 2021-10-25 22:24 ` Vladimir Oltean
  2021-10-26 10:40 ` [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Nikolay Aleksandrov
  15 siblings, 0 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-25 22:24 UTC (permalink / raw)
  To: netdev, Roopa Prabhu, Nikolay Aleksandrov, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

Some drivers will need to veto FDB entries in certain circumstances.
Modify the driver-facing API to expose the netlink extack for extended
error message reporting to drivers.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/dsa/b53/b53_common.c       |  6 ++++--
 drivers/net/dsa/b53/b53_priv.h         |  6 ++++--
 drivers/net/dsa/hirschmann/hellcreek.c |  6 ++++--
 drivers/net/dsa/lan9303-core.c         |  7 ++++---
 drivers/net/dsa/lantiq_gswip.c         |  6 ++++--
 drivers/net/dsa/microchip/ksz9477.c    |  6 ++++--
 drivers/net/dsa/mt7530.c               |  6 ++++--
 drivers/net/dsa/mv88e6xxx/chip.c       |  6 ++++--
 drivers/net/dsa/ocelot/felix.c         |  6 ++++--
 drivers/net/dsa/qca8k.c                |  6 ++++--
 drivers/net/dsa/sja1105/sja1105_main.c | 12 +++++++-----
 include/net/dsa.h                      |  6 ++++--
 net/dsa/dsa_priv.h                     |  9 +++++----
 net/dsa/port.c                         | 13 ++++++++-----
 net/dsa/slave.c                        | 12 ++++++++----
 net/dsa/switch.c                       | 22 ++++++++++++----------
 16 files changed, 84 insertions(+), 51 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index af4761968733..8657afd18791 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -1704,7 +1704,8 @@ static int b53_arl_op(struct b53_device *dev, int op, int port,
 }
 
 int b53_fdb_add(struct dsa_switch *ds, int port,
-		const unsigned char *addr, u16 vid)
+		const unsigned char *addr, u16 vid,
+		struct netlink_ext_ack *extack)
 {
 	struct b53_device *priv = ds->priv;
 	int ret;
@@ -1724,7 +1725,8 @@ int b53_fdb_add(struct dsa_switch *ds, int port,
 EXPORT_SYMBOL(b53_fdb_add);
 
 int b53_fdb_del(struct dsa_switch *ds, int port,
-		const unsigned char *addr, u16 vid)
+		const unsigned char *addr, u16 vid,
+		struct netlink_ext_ack *extack)
 {
 	struct b53_device *priv = ds->priv;
 	int ret;
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index 579da74ada64..8c4741f7a837 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -362,9 +362,11 @@ int b53_vlan_add(struct dsa_switch *ds, int port,
 int b53_vlan_del(struct dsa_switch *ds, int port,
 		 const struct switchdev_obj_port_vlan *vlan);
 int b53_fdb_add(struct dsa_switch *ds, int port,
-		const unsigned char *addr, u16 vid);
+		const unsigned char *addr, u16 vid,
+		struct netlink_ext_ack *extack);
 int b53_fdb_del(struct dsa_switch *ds, int port,
-		const unsigned char *addr, u16 vid);
+		const unsigned char *addr, u16 vid,
+		struct netlink_ext_ack *extack);
 int b53_fdb_dump(struct dsa_switch *ds, int port,
 		 dsa_fdb_dump_cb_t *cb, void *data);
 int b53_mdb_add(struct dsa_switch *ds, int port,
diff --git a/drivers/net/dsa/hirschmann/hellcreek.c b/drivers/net/dsa/hirschmann/hellcreek.c
index 4e0b53d94b52..3e1fcc7e23f1 100644
--- a/drivers/net/dsa/hirschmann/hellcreek.c
+++ b/drivers/net/dsa/hirschmann/hellcreek.c
@@ -823,7 +823,8 @@ static int hellcreek_fdb_get(struct hellcreek *hellcreek,
 }
 
 static int hellcreek_fdb_add(struct dsa_switch *ds, int port,
-			     const unsigned char *addr, u16 vid)
+			     const unsigned char *addr, u16 vid,
+			     struct netlink_ext_ack *extack)
 {
 	struct hellcreek_fdb_entry entry = { 0 };
 	struct hellcreek *hellcreek = ds->priv;
@@ -868,7 +869,8 @@ static int hellcreek_fdb_add(struct dsa_switch *ds, int port,
 }
 
 static int hellcreek_fdb_del(struct dsa_switch *ds, int port,
-			     const unsigned char *addr, u16 vid)
+			     const unsigned char *addr, u16 vid,
+			     struct netlink_ext_ack *extack)
 {
 	struct hellcreek_fdb_entry entry = { 0 };
 	struct hellcreek *hellcreek = ds->priv;
diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
index 89f920289ae2..1255a28a12e4 100644
--- a/drivers/net/dsa/lan9303-core.c
+++ b/drivers/net/dsa/lan9303-core.c
@@ -1180,7 +1180,8 @@ static void lan9303_port_fast_age(struct dsa_switch *ds, int port)
 }
 
 static int lan9303_port_fdb_add(struct dsa_switch *ds, int port,
-				const unsigned char *addr, u16 vid)
+				const unsigned char *addr, u16 vid,
+				struct netlink_ext_ack *extack)
 {
 	struct lan9303 *chip = ds->priv;
 
@@ -1192,8 +1193,8 @@ static int lan9303_port_fdb_add(struct dsa_switch *ds, int port,
 }
 
 static int lan9303_port_fdb_del(struct dsa_switch *ds, int port,
-				const unsigned char *addr, u16 vid)
-
+				const unsigned char *addr, u16 vid,
+				struct netlink_ext_ack *extack)
 {
 	struct lan9303 *chip = ds->priv;
 
diff --git a/drivers/net/dsa/lantiq_gswip.c b/drivers/net/dsa/lantiq_gswip.c
index 7056d98d8177..83d125259897 100644
--- a/drivers/net/dsa/lantiq_gswip.c
+++ b/drivers/net/dsa/lantiq_gswip.c
@@ -1381,13 +1381,15 @@ static int gswip_port_fdb(struct dsa_switch *ds, int port,
 }
 
 static int gswip_port_fdb_add(struct dsa_switch *ds, int port,
-			      const unsigned char *addr, u16 vid)
+			      const unsigned char *addr, u16 vid,
+			      struct netlink_ext_ack *extack)
 {
 	return gswip_port_fdb(ds, port, addr, vid, true);
 }
 
 static int gswip_port_fdb_del(struct dsa_switch *ds, int port,
-			      const unsigned char *addr, u16 vid)
+			      const unsigned char *addr, u16 vid,
+			      struct netlink_ext_ack *extack)
 {
 	return gswip_port_fdb(ds, port, addr, vid, false);
 }
diff --git a/drivers/net/dsa/microchip/ksz9477.c b/drivers/net/dsa/microchip/ksz9477.c
index 854e25f43fa7..e4cd36c25d76 100644
--- a/drivers/net/dsa/microchip/ksz9477.c
+++ b/drivers/net/dsa/microchip/ksz9477.c
@@ -583,7 +583,8 @@ static int ksz9477_port_vlan_del(struct dsa_switch *ds, int port,
 }
 
 static int ksz9477_port_fdb_add(struct dsa_switch *ds, int port,
-				const unsigned char *addr, u16 vid)
+				const unsigned char *addr, u16 vid,
+				struct netlink_ext_ack *extack)
 {
 	struct ksz_device *dev = ds->priv;
 	u32 alu_table[4];
@@ -640,7 +641,8 @@ static int ksz9477_port_fdb_add(struct dsa_switch *ds, int port,
 }
 
 static int ksz9477_port_fdb_del(struct dsa_switch *ds, int port,
-				const unsigned char *addr, u16 vid)
+				const unsigned char *addr, u16 vid,
+				struct netlink_ext_ack *extack)
 {
 	struct ksz_device *dev = ds->priv;
 	u32 alu_table[4];
diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index 9890672a206d..e6176078f5ad 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -1341,7 +1341,8 @@ mt7530_port_bridge_leave(struct dsa_switch *ds, int port,
 
 static int
 mt7530_port_fdb_add(struct dsa_switch *ds, int port,
-		    const unsigned char *addr, u16 vid)
+		    const unsigned char *addr, u16 vid,
+		    struct netlink_ext_ack *extack)
 {
 	struct mt7530_priv *priv = ds->priv;
 	int ret;
@@ -1357,7 +1358,8 @@ mt7530_port_fdb_add(struct dsa_switch *ds, int port,
 
 static int
 mt7530_port_fdb_del(struct dsa_switch *ds, int port,
-		    const unsigned char *addr, u16 vid)
+		    const unsigned char *addr, u16 vid,
+		    struct netlink_ext_ack *extack)
 {
 	struct mt7530_priv *priv = ds->priv;
 	int ret;
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 14c678a9e41b..fbd22d1c479e 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2299,7 +2299,8 @@ static int mv88e6xxx_port_vlan_del(struct dsa_switch *ds, int port,
 }
 
 static int mv88e6xxx_port_fdb_add(struct dsa_switch *ds, int port,
-				  const unsigned char *addr, u16 vid)
+				  const unsigned char *addr, u16 vid,
+				  struct netlink_ext_ack *extack)
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
 	int err;
@@ -2313,7 +2314,8 @@ static int mv88e6xxx_port_fdb_add(struct dsa_switch *ds, int port,
 }
 
 static int mv88e6xxx_port_fdb_del(struct dsa_switch *ds, int port,
-				  const unsigned char *addr, u16 vid)
+				  const unsigned char *addr, u16 vid,
+				  struct netlink_ext_ack *extack)
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
 	int err;
diff --git a/drivers/net/dsa/ocelot/felix.c b/drivers/net/dsa/ocelot/felix.c
index 83808e7dbdda..0a3d0cbd25e2 100644
--- a/drivers/net/dsa/ocelot/felix.c
+++ b/drivers/net/dsa/ocelot/felix.c
@@ -638,7 +638,8 @@ static int felix_fdb_dump(struct dsa_switch *ds, int port,
 }
 
 static int felix_fdb_add(struct dsa_switch *ds, int port,
-			 const unsigned char *addr, u16 vid)
+			 const unsigned char *addr, u16 vid,
+			 struct netlink_ext_ack *extack)
 {
 	struct ocelot *ocelot = ds->priv;
 
@@ -646,7 +647,8 @@ static int felix_fdb_add(struct dsa_switch *ds, int port,
 }
 
 static int felix_fdb_del(struct dsa_switch *ds, int port,
-			 const unsigned char *addr, u16 vid)
+			 const unsigned char *addr, u16 vid,
+			 struct netlink_ext_ack *extack)
 {
 	struct ocelot *ocelot = ds->priv;
 
diff --git a/drivers/net/dsa/qca8k.c b/drivers/net/dsa/qca8k.c
index ea7f12778922..26cf4b583c74 100644
--- a/drivers/net/dsa/qca8k.c
+++ b/drivers/net/dsa/qca8k.c
@@ -1850,7 +1850,8 @@ qca8k_port_fdb_insert(struct qca8k_priv *priv, const u8 *addr,
 
 static int
 qca8k_port_fdb_add(struct dsa_switch *ds, int port,
-		   const unsigned char *addr, u16 vid)
+		   const unsigned char *addr, u16 vid,
+		   struct netlink_ext_ack *extack)
 {
 	struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv;
 	u16 port_mask = BIT(port);
@@ -1860,7 +1861,8 @@ qca8k_port_fdb_add(struct dsa_switch *ds, int port,
 
 static int
 qca8k_port_fdb_del(struct dsa_switch *ds, int port,
-		   const unsigned char *addr, u16 vid)
+		   const unsigned char *addr, u16 vid,
+		   struct netlink_ext_ack *extack)
 {
 	struct qca8k_priv *priv = (struct qca8k_priv *)ds->priv;
 	u16 port_mask = BIT(port);
diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c
index c343effe2e96..40a970874409 100644
--- a/drivers/net/dsa/sja1105/sja1105_main.c
+++ b/drivers/net/dsa/sja1105/sja1105_main.c
@@ -1818,7 +1818,8 @@ int sja1105pqrs_fdb_del(struct dsa_switch *ds, int port,
 }
 
 static int sja1105_fdb_add(struct dsa_switch *ds, int port,
-			   const unsigned char *addr, u16 vid)
+			   const unsigned char *addr, u16 vid,
+			   struct netlink_ext_ack *extack)
 {
 	struct sja1105_private *priv = ds->priv;
 
@@ -1826,7 +1827,8 @@ static int sja1105_fdb_add(struct dsa_switch *ds, int port,
 }
 
 static int sja1105_fdb_del(struct dsa_switch *ds, int port,
-			   const unsigned char *addr, u16 vid)
+			   const unsigned char *addr, u16 vid,
+			   struct netlink_ext_ack *extack)
 {
 	struct sja1105_private *priv = ds->priv;
 
@@ -1912,7 +1914,7 @@ static void sja1105_fast_age(struct dsa_switch *ds, int port)
 
 		u64_to_ether_addr(l2_lookup.macaddr, macaddr);
 
-		rc = sja1105_fdb_del(ds, port, macaddr, l2_lookup.vlanid);
+		rc = sja1105_fdb_del(ds, port, macaddr, l2_lookup.vlanid, NULL);
 		if (rc) {
 			dev_err(ds->dev,
 				"Failed to delete FDB entry %pM vid %lld: %pe\n",
@@ -1925,13 +1927,13 @@ static void sja1105_fast_age(struct dsa_switch *ds, int port)
 static int sja1105_mdb_add(struct dsa_switch *ds, int port,
 			   const struct switchdev_obj_port_mdb *mdb)
 {
-	return sja1105_fdb_add(ds, port, mdb->addr, mdb->vid);
+	return sja1105_fdb_add(ds, port, mdb->addr, mdb->vid, NULL);
 }
 
 static int sja1105_mdb_del(struct dsa_switch *ds, int port,
 			   const struct switchdev_obj_port_mdb *mdb)
 {
-	return sja1105_fdb_del(ds, port, mdb->addr, mdb->vid);
+	return sja1105_fdb_del(ds, port, mdb->addr, mdb->vid, NULL);
 }
 
 /* Common function for unicast and broadcast flood configuration.
diff --git a/include/net/dsa.h b/include/net/dsa.h
index badd214f7470..ca333c11c7f4 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -782,9 +782,11 @@ struct dsa_switch_ops {
 	 * Forwarding database
 	 */
 	int	(*port_fdb_add)(struct dsa_switch *ds, int port,
-				const unsigned char *addr, u16 vid);
+				const unsigned char *addr, u16 vid,
+				struct netlink_ext_ack *extack);
 	int	(*port_fdb_del)(struct dsa_switch *ds, int port,
-				const unsigned char *addr, u16 vid);
+				const unsigned char *addr, u16 vid,
+				struct netlink_ext_ack *extack);
 	int	(*port_fdb_dump)(struct dsa_switch *ds, int port,
 				 dsa_fdb_dump_cb_t *cb, void *data);
 
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 5d3f8291ec7f..253a875f54cd 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -64,6 +64,7 @@ struct dsa_notifier_fdb_info {
 	int port;
 	const unsigned char *addr;
 	u16 vid;
+	struct netlink_ext_ack *extack;
 };
 
 /* DSA_NOTIFIER_MDB_* */
@@ -219,13 +220,13 @@ int dsa_port_ageing_time(struct dsa_port *dp, clock_t ageing_clock);
 int dsa_port_mtu_change(struct dsa_port *dp, int new_mtu,
 			bool targeted_match);
 int dsa_port_fdb_add(struct dsa_port *dp, const unsigned char *addr,
-		     u16 vid);
+		     u16 vid, struct netlink_ext_ack *extack);
 int dsa_port_fdb_del(struct dsa_port *dp, const unsigned char *addr,
-		     u16 vid);
+		     u16 vid, struct netlink_ext_ack *extack);
 int dsa_port_host_fdb_add(struct dsa_port *dp, const unsigned char *addr,
-			  u16 vid);
+			  u16 vid, struct netlink_ext_ack *extack);
 int dsa_port_host_fdb_del(struct dsa_port *dp, const unsigned char *addr,
-			  u16 vid);
+			  u16 vid, struct netlink_ext_ack *extack);
 int dsa_port_fdb_dump(struct dsa_port *dp, dsa_fdb_dump_cb_t *cb, void *data);
 int dsa_port_mdb_add(const struct dsa_port *dp,
 		     const struct switchdev_obj_port_mdb *mdb);
diff --git a/net/dsa/port.c b/net/dsa/port.c
index bf671306b560..444c7539c826 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -737,40 +737,42 @@ int dsa_port_mtu_change(struct dsa_port *dp, int new_mtu,
 }
 
 int dsa_port_fdb_add(struct dsa_port *dp, const unsigned char *addr,
-		     u16 vid)
+		     u16 vid, struct netlink_ext_ack *extack)
 {
 	struct dsa_notifier_fdb_info info = {
 		.sw_index = dp->ds->index,
 		.port = dp->index,
 		.addr = addr,
 		.vid = vid,
+		.extack = extack,
 	};
 
 	return dsa_port_notify(dp, DSA_NOTIFIER_FDB_ADD, &info);
 }
 
 int dsa_port_fdb_del(struct dsa_port *dp, const unsigned char *addr,
-		     u16 vid)
+		     u16 vid, struct netlink_ext_ack *extack)
 {
 	struct dsa_notifier_fdb_info info = {
 		.sw_index = dp->ds->index,
 		.port = dp->index,
 		.addr = addr,
 		.vid = vid,
-
+		.extack = extack,
 	};
 
 	return dsa_port_notify(dp, DSA_NOTIFIER_FDB_DEL, &info);
 }
 
 int dsa_port_host_fdb_add(struct dsa_port *dp, const unsigned char *addr,
-			  u16 vid)
+			  u16 vid, struct netlink_ext_ack *extack)
 {
 	struct dsa_notifier_fdb_info info = {
 		.sw_index = dp->ds->index,
 		.port = dp->index,
 		.addr = addr,
 		.vid = vid,
+		.extack = extack,
 	};
 	struct dsa_port *cpu_dp = dp->cpu_dp;
 	int err;
@@ -783,13 +785,14 @@ int dsa_port_host_fdb_add(struct dsa_port *dp, const unsigned char *addr,
 }
 
 int dsa_port_host_fdb_del(struct dsa_port *dp, const unsigned char *addr,
-			  u16 vid)
+			  u16 vid, struct netlink_ext_ack *extack)
 {
 	struct dsa_notifier_fdb_info info = {
 		.sw_index = dp->ds->index,
 		.port = dp->index,
 		.addr = addr,
 		.vid = vid,
+		.extack = extack,
 	};
 	struct dsa_port *cpu_dp = dp->cpu_dp;
 	int err;
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 1329e56e22ca..3e49feb81261 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -2417,17 +2417,21 @@ static void dsa_slave_switchdev_event_work(struct work_struct *work)
 	switch (switchdev_work->event) {
 	case SWITCHDEV_FDB_ADD_TO_DEVICE:
 		if (switchdev_work->host_addr)
-			err = dsa_port_host_fdb_add(dp, fdb_info->addr, fdb_info->vid);
+			err = dsa_port_host_fdb_add(dp, fdb_info->addr,
+						    fdb_info->vid, extack);
 		else
-			err = dsa_port_fdb_add(dp, fdb_info->addr, fdb_info->vid);
+			err = dsa_port_fdb_add(dp, fdb_info->addr,
+					       fdb_info->vid, extack);
 		dsa_fdb_offload_notify(switchdev_work);
 		break;
 
 	case SWITCHDEV_FDB_DEL_TO_DEVICE:
 		if (switchdev_work->host_addr)
-			err = dsa_port_host_fdb_del(dp, fdb_info->addr, fdb_info->vid);
+			err = dsa_port_host_fdb_del(dp, fdb_info->addr,
+						    fdb_info->vid, extack);
 		else
-			err = dsa_port_fdb_del(dp, fdb_info->addr, fdb_info->vid);
+			err = dsa_port_fdb_del(dp, fdb_info->addr,
+					       fdb_info->vid, extack);
 		break;
 	}
 
diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index bb155a16d454..9058882a282a 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -291,7 +291,7 @@ static int dsa_port_do_mdb_del(struct dsa_port *dp,
 }
 
 static int dsa_port_do_fdb_add(struct dsa_port *dp, const unsigned char *addr,
-			       u16 vid)
+			       u16 vid, struct netlink_ext_ack *extack)
 {
 	struct dsa_switch *ds = dp->ds;
 	struct dsa_mac_addr *a;
@@ -300,7 +300,7 @@ static int dsa_port_do_fdb_add(struct dsa_port *dp, const unsigned char *addr,
 
 	/* No need to bother with refcounting for user ports */
 	if (!(dsa_port_is_cpu(dp) || dsa_port_is_dsa(dp)))
-		return ds->ops->port_fdb_add(ds, port, addr, vid);
+		return ds->ops->port_fdb_add(ds, port, addr, vid, extack);
 
 	mutex_lock(&dp->addr_lists_lock);
 
@@ -316,7 +316,7 @@ static int dsa_port_do_fdb_add(struct dsa_port *dp, const unsigned char *addr,
 		goto out;
 	}
 
-	err = ds->ops->port_fdb_add(ds, port, addr, vid);
+	err = ds->ops->port_fdb_add(ds, port, addr, vid, extack);
 	if (err) {
 		kfree(a);
 		goto out;
@@ -334,7 +334,7 @@ static int dsa_port_do_fdb_add(struct dsa_port *dp, const unsigned char *addr,
 }
 
 static int dsa_port_do_fdb_del(struct dsa_port *dp, const unsigned char *addr,
-			       u16 vid)
+			       u16 vid, struct netlink_ext_ack *extack)
 {
 	struct dsa_switch *ds = dp->ds;
 	struct dsa_mac_addr *a;
@@ -343,7 +343,7 @@ static int dsa_port_do_fdb_del(struct dsa_port *dp, const unsigned char *addr,
 
 	/* No need to bother with refcounting for user ports */
 	if (!(dsa_port_is_cpu(dp) || dsa_port_is_dsa(dp)))
-		return ds->ops->port_fdb_del(ds, port, addr, vid);
+		return ds->ops->port_fdb_del(ds, port, addr, vid, extack);
 
 	mutex_lock(&dp->addr_lists_lock);
 
@@ -356,7 +356,7 @@ static int dsa_port_do_fdb_del(struct dsa_port *dp, const unsigned char *addr,
 	if (!refcount_dec_and_test(&a->refcount))
 		goto out;
 
-	err = ds->ops->port_fdb_del(ds, port, addr, vid);
+	err = ds->ops->port_fdb_del(ds, port, addr, vid, extack);
 	if (err) {
 		refcount_set(&a->refcount, 1);
 		goto out;
@@ -383,7 +383,8 @@ static int dsa_switch_host_fdb_add(struct dsa_switch *ds,
 	dsa_switch_for_each_port(dp, ds) {
 		if (dsa_port_host_address_match(dp, info->sw_index,
 						info->port)) {
-			err = dsa_port_do_fdb_add(dp, info->addr, info->vid);
+			err = dsa_port_do_fdb_add(dp, info->addr, info->vid,
+						  info->extack);
 			if (err)
 				break;
 		}
@@ -404,7 +405,8 @@ static int dsa_switch_host_fdb_del(struct dsa_switch *ds,
 	dsa_switch_for_each_port(dp, ds) {
 		if (dsa_port_host_address_match(dp, info->sw_index,
 						info->port)) {
-			err = dsa_port_do_fdb_del(dp, info->addr, info->vid);
+			err = dsa_port_do_fdb_del(dp, info->addr, info->vid,
+						  info->extack);
 			if (err)
 				break;
 		}
@@ -422,7 +424,7 @@ static int dsa_switch_fdb_add(struct dsa_switch *ds,
 	if (!ds->ops->port_fdb_add)
 		return -EOPNOTSUPP;
 
-	return dsa_port_do_fdb_add(dp, info->addr, info->vid);
+	return dsa_port_do_fdb_add(dp, info->addr, info->vid, info->extack);
 }
 
 static int dsa_switch_fdb_del(struct dsa_switch *ds,
@@ -434,7 +436,7 @@ static int dsa_switch_fdb_del(struct dsa_switch *ds,
 	if (!ds->ops->port_fdb_del)
 		return -EOPNOTSUPP;
 
-	return dsa_port_do_fdb_del(dp, info->addr, info->vid);
+	return dsa_port_do_fdb_del(dp, info->addr, info->vid, info->extack);
 }
 
 static int dsa_switch_hsr_join(struct dsa_switch *ds,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
                   ` (14 preceding siblings ...)
  2021-10-25 22:24 ` [RFC PATCH net-next 15/15] net: dsa: propagate extack to .port_fdb_{add,del} Vladimir Oltean
@ 2021-10-26 10:40 ` Nikolay Aleksandrov
  2021-10-26 11:25   ` Vladimir Oltean
  15 siblings, 1 reply; 29+ messages in thread
From: Nikolay Aleksandrov @ 2021-10-26 10:40 UTC (permalink / raw)
  To: Vladimir Oltean, netdev, Roopa Prabhu, Ido Schimmel
  Cc: Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

On 26/10/2021 01:24, Vladimir Oltean wrote:
> Hello, this is me bombarding the list with switchdev FDB changes again.
> 
> This series attempts to address one design limitation in the interaction
> between the bridge and switchdev: error codes returned from the
> SWITCHDEV_FDB_ADD_TO_DEVICE and SWITCHDEV_FDB_DEL_TO_DEVICE handlers are
> completely ignored.
> 
> There are multiple aspects to that. First of all, drivers have a portion
> that handles those switchdev events in atomic context, and a portion
> that handles them in a private deferred work context. Errors reported
> from both calling contexts are ignored by the bridge, and it is
> desirable to actually propagate both to user space.
> 
> Secondly, it is in fact okay that some switchdev errors are ignored.
> The call graph for fdb_notify() is not simple, it looks something like
> this (not complete):
> 
> IFLA_BRPORT_FLUSH                                                              RTM_NEWNEIGH
>    |                                                                               |
>    | {br,nbp}_vlan_delete                 br_fdb_change_mac_address                v
>    |   |  |                                                  |     fast      __br_fdb_add
>    |   |  |  del_nbp, br_dev_delete       br_fdb_changeaddr  |     path         /  |  \
>    |   |  |      |                                        |  |    learning     /   |   \
>    \   |   -------------------- br_fdb_find_delete_local  |  |       |        /    |    \     switchdev event
>     \  |         |                                     |  |  |       |       /     |     \     listener
>      -------------------------- br_fdb_delete_by_port  |  |  |       |      /      |      \       |
>                                                  |  |  |  |  |       |     /       |       \      |
>                                                  |  |  |  |  |       |    /        |        \     |
>                                                  |  |  |  |  |    br_fdb_update    |        br_fdb_external_learn_add
>            (RTM_DELNEIGH)  br_fdb_delete         |  |  |  |  |       |             |              |
>                                      |           |  |  |  |  |       |             |              |    gc_work        netdevice
>                                      |           |  |  |  |  |       |      fdb_add_entry         |     timer          event
>                                      |           | fdb_delete_local  |             |              |        |          listener
>                          __br_fdb_delete         |  |                |             |              /  br_fdb_cleanup      |
>                                      |           |  |                |             |             /         |             |     br_stp_change_bridge_id
>                                      |           |  |                \             |            /          | br_fdb_changeaddr      |
>                                      |           |  |                 \            |           /           |     |                  |
>                      fdb_delete_by_addr_and_port |  | fdb_insert       \           |          /       ----/      | br_fdb_change_mac_address
>                                               |  |  |  |                \          |         /       /           |  |
>                    br_fdb_external_learn_del  |  |  |  | br_fdb_cleanup  \         |        /       /            |  | br_fdb_insert
>                                           |   |  |  |  |  |               \        |       /   ----/             |  | |
>                                           |   |  |  |  |  |                \       |      /   /                 fdb_insert
>                           br_fdb_flush    |   |  |  |  |  |                 \      |     /   /            --------/
>                                  \----    |   |  |  |  |  |                  \     |    /   /      ------/
>                                       \----------- fdb_delete --------------- fdb_notify ---------/
> 
> There's not a lot that the fast path learning can do about switchdev
> when that returns an error.
> 
> So this patch set mainly wants to deal with the 2 code paths that are
> triggered by these regular commands:
> 
> bridge fdb add dev swp0 00:01:02:03:04:05 master static # __br_fdb_add
> bridge fdb del dev swp0 00:01:02:03:04:05 master static # __br_fdb_delete
> 
> In some other, semi-related discussions, Ido Schimmel pointed out that
> it would be nice if user space got some feedback from the actual driver,
> and made some proposals about how that could be done.
> https://patchwork.kernel.org/project/netdevbpf/cover/20210819160723.2186424-1-vladimir.oltean@nxp.com/
> One of the proposals was to call fdb_notify() from sleepable context,
> but Nikolay disliked the idea of introducing deferred work in the bridge
> driver (seems like nobody wants to deal with it).
> 
> And since all proposals of dealing with the deferred work inside
> switchdev were also shot down for valid reasons, we are basically left
> as a baseline with the code that we have today, with the deferred work
> being private to the driver, and somehow we must propagate an err and an
> extack from there.
> 
> So the approach taken here is to reorganize the code a bit and add some
> hooks in:
> (a) some callers of the fdb_notify() function to initialize a completion
>     structure
> (b) some drivers that catch SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE and mark
>     that completion structure as done
> (c) some bridge logic that I believe is fairly safe (I'm open to being
>     proven wrong) that temporarily drops the &br->hash_lock in order to
>     sleep until the completion is done.
> 
> There are some further optimizations that can be made. For example, we
> can avoid dropping the hash_lock if there is no switchdev response pending.
> And we can move some of that completion logic in br_switchdev.c such
> that it is compiled out on a CONFIG_NET_SWITCHDEV=n build. I haven't
> done those here, since they aren't exactly trivial. Mainly searching for
> high-level feedback first and foremost.
> 
> The structure of the patch series is:
> - patches 1-6 are me toying around with some code organization while I
>   was trying to understand the various call paths better. I like not
>   having forward declarations, but if they exist for a reason, I can
>   drop these patches.
> - patches 7-10 and 12 are some preparation work that can also be ignored.
> - patches 11 and 13 are where the meat of the series is.
> - patches 14 and 15 are DSA boilerplate so I could test what I'm doing.
> 

Hi,
Interesting way to work around the asynchronous notifiers. :) I went over
the patch-set and given that we'll have to support and maintain this fragile
solution (e.g. playing with locking, possible races with fdb changes etc) I'm
inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
with delayed learning from the fast-path to get a sleepable context where we can
use synchronous switchdev calls and get feedback immediately. That would be the
cleanest and most straight-forward solution, it'd be less error-prone and easier
to maintain long term. I plan to convert the bridge hash_lock to a mutex and then
you can do the synchronous switchdev change if you don't mind and agree of course.

By the way patches 1-6 can stand on their own, feel free to send them separately. 

Thanks,
 Nik


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-26 10:40 ` [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Nikolay Aleksandrov
@ 2021-10-26 11:25   ` Vladimir Oltean
  2021-10-26 12:20     ` Nikolay Aleksandrov
  0 siblings, 1 reply; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-26 11:25 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, Roopa Prabhu, Ido Schimmel, Jakub Kicinski,
	David S. Miller, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Jiri Pirko

On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
> Hi,
> Interesting way to work around the asynchronous notifiers. :) I went over
> the patch-set and given that we'll have to support and maintain this fragile
> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
> with delayed learning from the fast-path to get a sleepable context where we can
> use synchronous switchdev calls and get feedback immediately.

Delayed learning means that we'll receive a sequence of packets like this:

            br0--------\
          /    \        \
         /      \        \
        /        \        \
     swp0         swp1    swp2
      |            |        |
   station A   station B  station C

station A sends request to B, station B sends reply to A.
Since the learning of station A's MAC SA races with the reply sent by
station B, it now becomes theoretically possible for the reply packet to
be flooded to station C as well, right? And that was not possible before
(at least assuming an ageing time longer than the round-trip time of these packets).

And that will happen regardless of whether switchdev is used or not.
I don't want to outright dismiss this (maybe I don't fully understand
this either), but it seems like a pretty heavy-handed change.

> That would be the
> cleanest and most straight-forward solution, it'd be less error-prone and easier
> to maintain long term. I plan to convert the bridge hash_lock to a mutex and then
> you can do the synchronous switchdev change if you don't mind and agree of course.

I agree that there are races and implications I haven't fully thought of,
with this temporary dropping of the br->hash_lock. It doesn't appear ideal.

For example,

/* Delete an FDB entry and notify switchdev. */
static int __br_fdb_delete(struct net_bridge *br,
			   const struct net_bridge_port *p,
			   const u8 *addr, u16 vlan,
			   struct netlink_ext_ack *extack)
{
	struct br_switchdev_fdb_wait_ctx wait_ctx;
	struct net_bridge_fdb_entry *fdb;
	int err;

	br_switchdev_fdb_wait_ctx_init(&wait_ctx);

	spin_lock_bh(&br->hash_lock);

	fdb = br_fdb_find(br, addr, vlan);
	if (!fdb || READ_ONCE(fdb->dst) != p) {
		spin_unlock_bh(&br->hash_lock);
		return -ENOENT;
	}

	br_fdb_notify_async(br, fdb, RTM_DELNEIGH, extack, &wait_ctx);

	spin_unlock_bh(&br->hash_lock);

	err = br_switchdev_fdb_wait(&wait_ctx); <- at this stage (more comments below)
	if (err)
		return err;

	/* We've notified rtnl and switchdev once, don't do it again,
	 * just delete.
	 */
	return fdb_delete_by_addr_and_port(br, p, addr, vlan, false);
}

the software FDB still contains the entry, while the hardware doesn't.
And we are no longer holding the lock, so somebody can either add or
delete that entry.

If somebody else tries to concurrently add that entry, it should not
notify switchdev again because it will see that the FDB entry exists,
and we should finally end up deleting it and result in a consistent
state.

If somebody else tries to concurrently delete that entry, it will
probably be from a code path that ignores errors (because the code paths
that don't are serialized by the rtnl_mutex). Switchdev will say "hey,
but I don't have this FDB entry, you've just deleted it", but that will
again be fine.

There seems to be a problem if somebody concurrently deletes that entry,
_and_then_ it gets added back again, all that before we call
fdb_delete_by_addr_and_port(). Because we don't notify switchdev the
second time around, we'll end up with an address in hardware but no
software counterpart.

I don't really know how to cleanly deal with that.

> By the way patches 1-6 can stand on their own, feel free to send them separately. 

Thanks, I will.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-26 11:25   ` Vladimir Oltean
@ 2021-10-26 12:20     ` Nikolay Aleksandrov
  2021-10-26 12:38       ` Ido Schimmel
  2021-10-26 16:54       ` Vladimir Oltean
  0 siblings, 2 replies; 29+ messages in thread
From: Nikolay Aleksandrov @ 2021-10-26 12:20 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, Roopa Prabhu, Ido Schimmel, Jakub Kicinski,
	David S. Miller, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Jiri Pirko

On 26/10/2021 14:25, Vladimir Oltean wrote:
> On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
>> Hi,
>> Interesting way to work around the asynchronous notifiers. :) I went over
>> the patch-set and given that we'll have to support and maintain this fragile
>> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
>> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
>> with delayed learning from the fast-path to get a sleepable context where we can
>> use synchronous switchdev calls and get feedback immediately.
> 
> Delayed learning means that we'll receive a sequence of packets like this:
> 
>             br0--------\
>           /    \        \
>          /      \        \
>         /        \        \
>      swp0         swp1    swp2
>       |            |        |
>    station A   station B  station C
> 
> station A sends request to B, station B sends reply to A.
> Since the learning of station A's MAC SA races with the reply sent by
> station B, it now becomes theoretically possible for the reply packet to
> be flooded to station C as well, right? And that was not possible before
> (at least assuming an ageing time longer than the round-trip time of these packets).
> 
> And that will happen regardless of whether switchdev is used or not.
> I don't want to outright dismiss this (maybe I don't fully understand
> this either), but it seems like a pretty heavy-handed change.
> 

It will depending on lock contention, I plan to add a fast/uncontended case with
trylock from fast-path and if that fails then queue the fdb, but yes - in general
you are correct that the traffic could get flooded in the queue case before the delayed
learning processes the entry, it's a trade off if we want sleepable learning context.
Ido noted privately that's usually how hw acts anyway, also if people want guarantees
that the reply won't get flooded there are other methods to achieve that (ucast flood
disable, firewall rules etc). Today the reply could get flooded if the entry can't be programmed
as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
but still there haven't been any such guarantees. I think it's generally a good improvement and
will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
and reason about in the future.

>> That would be the
>> cleanest and most straight-forward solution, it'd be less error-prone and easier
>> to maintain long term. I plan to convert the bridge hash_lock to a mutex and then
>> you can do the synchronous switchdev change if you don't mind and agree of course.
> 
> I agree that there are races and implications I haven't fully thought of,
> with this temporary dropping of the br->hash_lock. It doesn't appear ideal.
> 
> For example,
> 
> /* Delete an FDB entry and notify switchdev. */
> static int __br_fdb_delete(struct net_bridge *br,
> 			   const struct net_bridge_port *p,
> 			   const u8 *addr, u16 vlan,
> 			   struct netlink_ext_ack *extack)
> {
> 	struct br_switchdev_fdb_wait_ctx wait_ctx;
> 	struct net_bridge_fdb_entry *fdb;
> 	int err;
> 
> 	br_switchdev_fdb_wait_ctx_init(&wait_ctx);
> 
> 	spin_lock_bh(&br->hash_lock);
> 
> 	fdb = br_fdb_find(br, addr, vlan);
> 	if (!fdb || READ_ONCE(fdb->dst) != p) {
> 		spin_unlock_bh(&br->hash_lock);
> 		return -ENOENT;
> 	}
> 
> 	br_fdb_notify_async(br, fdb, RTM_DELNEIGH, extack, &wait_ctx);
> 
> 	spin_unlock_bh(&br->hash_lock);
> 
> 	err = br_switchdev_fdb_wait(&wait_ctx); <- at this stage (more comments below)
> 	if (err)
> 		return err;
> 
> 	/* We've notified rtnl and switchdev once, don't do it again,
> 	 * just delete.
> 	 */
> 	return fdb_delete_by_addr_and_port(br, p, addr, vlan, false);
> }
> 
> the software FDB still contains the entry, while the hardware doesn't.
> And we are no longer holding the lock, so somebody can either add or
> delete that entry.
> 
> If somebody else tries to concurrently add that entry, it should not
> notify switchdev again because it will see that the FDB entry exists,
> and we should finally end up deleting it and result in a consistent
> state.
> 
> If somebody else tries to concurrently delete that entry, it will
> probably be from a code path that ignores errors (because the code paths
> that don't are serialized by the rtnl_mutex). Switchdev will say "hey,
> but I don't have this FDB entry, you've just deleted it", but that will
> again be fine.
> 
> There seems to be a problem if somebody concurrently deletes that entry,
> _and_then_ it gets added back again, all that before we call
> fdb_delete_by_addr_and_port(). Because we don't notify switchdev the
> second time around, we'll end up with an address in hardware but no
> software counterpart.
> 
> I don't really know how to cleanly deal with that.
> 

Right, I'm sure new cases will come up and it won't be easy to reason about these
races when changes need to be made. I'd rather stick to a more straight-forward
and simpler approach.

>> By the way patches 1-6 can stand on their own, feel free to send them separately. 
> 
> Thanks, I will.
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-26 12:20     ` Nikolay Aleksandrov
@ 2021-10-26 12:38       ` Ido Schimmel
  2021-10-26 16:54       ` Vladimir Oltean
  1 sibling, 0 replies; 29+ messages in thread
From: Ido Schimmel @ 2021-10-26 12:38 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Vladimir Oltean, netdev, Roopa Prabhu, Ido Schimmel,
	Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

On Tue, Oct 26, 2021 at 03:20:03PM +0300, Nikolay Aleksandrov wrote:
> On 26/10/2021 14:25, Vladimir Oltean wrote:
> > On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
> >> Hi,
> >> Interesting way to work around the asynchronous notifiers. :) I went over
> >> the patch-set and given that we'll have to support and maintain this fragile
> >> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
> >> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
> >> with delayed learning from the fast-path to get a sleepable context where we can
> >> use synchronous switchdev calls and get feedback immediately.
> > 
> > Delayed learning means that we'll receive a sequence of packets like this:
> > 
> >             br0--------\
> >           /    \        \
> >          /      \        \
> >         /        \        \
> >      swp0         swp1    swp2
> >       |            |        |
> >    station A   station B  station C
> > 
> > station A sends request to B, station B sends reply to A.
> > Since the learning of station A's MAC SA races with the reply sent by
> > station B, it now becomes theoretically possible for the reply packet to
> > be flooded to station C as well, right? And that was not possible before
> > (at least assuming an ageing time longer than the round-trip time of these packets).
> > 
> > And that will happen regardless of whether switchdev is used or not.
> > I don't want to outright dismiss this (maybe I don't fully understand
> > this either), but it seems like a pretty heavy-handed change.
> > 
> 
> It will depending on lock contention, I plan to add a fast/uncontended case with
> trylock from fast-path and if that fails then queue the fdb, but yes - in general
> you are correct that the traffic could get flooded in the queue case before the delayed
> learning processes the entry, it's a trade off if we want sleepable learning context.
> Ido noted privately that's usually how hw acts anyway, also if people want guarantees

To be clear, I was referring to Spectrum where that hardware doesn't
learn automatically, but instead notifies the CPU about entries that can
be learned / aged. It is then up to the software to program the entries.

I don't know how it works in other devices, but I assume Spectrum is not
special in this regard.

> that the reply won't get flooded there are other methods to achieve that (ucast flood
> disable, firewall rules etc). Today the reply could get flooded if the entry can't be programmed
> as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
> but still there haven't been any such guarantees. I think it's generally a good improvement and
> will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
> infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
> and reason about in the future.

Yes, I agree. Moving processing away from the data path has
disadvantages (Vladimir's example), but it also has advantages, such as
synchronous feedback, improved performance (e.g., netlink notifications
no longer sent from the data path), etc.

> 
> >> That would be the
> >> cleanest and most straight-forward solution, it'd be less error-prone and easier
> >> to maintain long term. I plan to convert the bridge hash_lock to a mutex and then
> >> you can do the synchronous switchdev change if you don't mind and agree of course.
> > 
> > I agree that there are races and implications I haven't fully thought of,
> > with this temporary dropping of the br->hash_lock. It doesn't appear ideal.
> > 
> > For example,
> > 
> > /* Delete an FDB entry and notify switchdev. */
> > static int __br_fdb_delete(struct net_bridge *br,
> > 			   const struct net_bridge_port *p,
> > 			   const u8 *addr, u16 vlan,
> > 			   struct netlink_ext_ack *extack)
> > {
> > 	struct br_switchdev_fdb_wait_ctx wait_ctx;
> > 	struct net_bridge_fdb_entry *fdb;
> > 	int err;
> > 
> > 	br_switchdev_fdb_wait_ctx_init(&wait_ctx);
> > 
> > 	spin_lock_bh(&br->hash_lock);
> > 
> > 	fdb = br_fdb_find(br, addr, vlan);
> > 	if (!fdb || READ_ONCE(fdb->dst) != p) {
> > 		spin_unlock_bh(&br->hash_lock);
> > 		return -ENOENT;
> > 	}
> > 
> > 	br_fdb_notify_async(br, fdb, RTM_DELNEIGH, extack, &wait_ctx);
> > 
> > 	spin_unlock_bh(&br->hash_lock);
> > 
> > 	err = br_switchdev_fdb_wait(&wait_ctx); <- at this stage (more comments below)
> > 	if (err)
> > 		return err;
> > 
> > 	/* We've notified rtnl and switchdev once, don't do it again,
> > 	 * just delete.
> > 	 */
> > 	return fdb_delete_by_addr_and_port(br, p, addr, vlan, false);
> > }
> > 
> > the software FDB still contains the entry, while the hardware doesn't.
> > And we are no longer holding the lock, so somebody can either add or
> > delete that entry.
> > 
> > If somebody else tries to concurrently add that entry, it should not
> > notify switchdev again because it will see that the FDB entry exists,
> > and we should finally end up deleting it and result in a consistent
> > state.
> > 
> > If somebody else tries to concurrently delete that entry, it will
> > probably be from a code path that ignores errors (because the code paths
> > that don't are serialized by the rtnl_mutex). Switchdev will say "hey,
> > but I don't have this FDB entry, you've just deleted it", but that will
> > again be fine.
> > 
> > There seems to be a problem if somebody concurrently deletes that entry,
> > _and_then_ it gets added back again, all that before we call
> > fdb_delete_by_addr_and_port(). Because we don't notify switchdev the
> > second time around, we'll end up with an address in hardware but no
> > software counterpart.
> > 
> > I don't really know how to cleanly deal with that.
> > 
> 
> Right, I'm sure new cases will come up and it won't be easy to reason about these
> races when changes need to be made. I'd rather stick to a more straight-forward
> and simpler approach.

I think we should either convert to mutex or stay with the current
behavior. I think the bridge is going to be really hard to maintain
otherwise.

> 
> >> By the way patches 1-6 can stand on their own, feel free to send them separately. 
> > 
> > Thanks, I will.
> > 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-26 12:20     ` Nikolay Aleksandrov
  2021-10-26 12:38       ` Ido Schimmel
@ 2021-10-26 16:54       ` Vladimir Oltean
  2021-10-26 17:10         ` Nikolay Aleksandrov
  1 sibling, 1 reply; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-26 16:54 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, Roopa Prabhu, Ido Schimmel, Jakub Kicinski,
	David S. Miller, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Jiri Pirko

On Tue, Oct 26, 2021 at 03:20:03PM +0300, Nikolay Aleksandrov wrote:
> On 26/10/2021 14:25, Vladimir Oltean wrote:
> > On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
> >> Hi,
> >> Interesting way to work around the asynchronous notifiers. :) I went over
> >> the patch-set and given that we'll have to support and maintain this fragile
> >> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
> >> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
> >> with delayed learning from the fast-path to get a sleepable context where we can
> >> use synchronous switchdev calls and get feedback immediately.
> > 
> > Delayed learning means that we'll receive a sequence of packets like this:
> > 
> >             br0--------\
> >           /    \        \
> >          /      \        \
> >         /        \        \
> >      swp0         swp1    swp2
> >       |            |        |
> >    station A   station B  station C
> > 
> > station A sends request to B, station B sends reply to A.
> > Since the learning of station A's MAC SA races with the reply sent by
> > station B, it now becomes theoretically possible for the reply packet to
> > be flooded to station C as well, right? And that was not possible before
> > (at least assuming an ageing time longer than the round-trip time of these packets).
> > 
> > And that will happen regardless of whether switchdev is used or not.
> > I don't want to outright dismiss this (maybe I don't fully understand
> > this either), but it seems like a pretty heavy-handed change.
> > 
> 
> It will depending on lock contention, I plan to add a fast/uncontended case with
> trylock from fast-path and if that fails then queue the fdb, but yes - in general

I wonder why mutex_trylock has this comment?

 * This function must not be used in interrupt context. The
 * mutex must be released by the same task that acquired it.

> you are correct that the traffic could get flooded in the queue case before the delayed
> learning processes the entry, it's a trade off if we want sleepable learning context.
> Ido noted privately that's usually how hw acts anyway, also if people want guarantees
> that the reply won't get flooded there are other methods to achieve that (ucast flood
> disable, firewall rules etc).

Not all hardware is like that, the switches I'm working with, which
perform autonomous learning, all complete the learning process for a
frame strictly before they start the forwarding process. The software
bridge also behaves like that. My only concern is that we might start
building on top of some fundamental bridge changes like these, which
might risk a revert a few months down the line, when somebody notices
and comes with a use case where that is not acceptable.

> Today the reply could get flooded if the entry can't be programmed
> as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
> but still there haven't been any such guarantees. I think it's generally a good improvement and
> will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
> infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
> and reason about in the future.

You're the maintainer, I certainly won't complain if we go down this path.
It would be nice if br->lock can also be transformed into a mutex, it
would make all of switchdev much simpler.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-26 16:54       ` Vladimir Oltean
@ 2021-10-26 17:10         ` Nikolay Aleksandrov
  2021-10-26 19:01           ` Vladimir Oltean
  0 siblings, 1 reply; 29+ messages in thread
From: Nikolay Aleksandrov @ 2021-10-26 17:10 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, Roopa Prabhu, Ido Schimmel, Jakub Kicinski,
	David S. Miller, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Jiri Pirko

On 26/10/2021 19:54, Vladimir Oltean wrote:
> On Tue, Oct 26, 2021 at 03:20:03PM +0300, Nikolay Aleksandrov wrote:
>> On 26/10/2021 14:25, Vladimir Oltean wrote:
>>> On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
>>>> Hi,
>>>> Interesting way to work around the asynchronous notifiers. :) I went over
>>>> the patch-set and given that we'll have to support and maintain this fragile
>>>> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
>>>> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
>>>> with delayed learning from the fast-path to get a sleepable context where we can
>>>> use synchronous switchdev calls and get feedback immediately.
>>>
>>> Delayed learning means that we'll receive a sequence of packets like this:
>>>
>>>             br0--------\
>>>           /    \        \
>>>          /      \        \
>>>         /        \        \
>>>      swp0         swp1    swp2
>>>       |            |        |
>>>    station A   station B  station C
>>>
>>> station A sends request to B, station B sends reply to A.
>>> Since the learning of station A's MAC SA races with the reply sent by
>>> station B, it now becomes theoretically possible for the reply packet to
>>> be flooded to station C as well, right? And that was not possible before
>>> (at least assuming an ageing time longer than the round-trip time of these packets).
>>>
>>> And that will happen regardless of whether switchdev is used or not.
>>> I don't want to outright dismiss this (maybe I don't fully understand
>>> this either), but it seems like a pretty heavy-handed change.
>>>
>>
>> It will depending on lock contention, I plan to add a fast/uncontended case with
>> trylock from fast-path and if that fails then queue the fdb, but yes - in general
> 
> I wonder why mutex_trylock has this comment?
> 
>  * This function must not be used in interrupt context. The
>  * mutex must be released by the same task that acquired it.
> 
>> you are correct that the traffic could get flooded in the queue case before the delayed
>> learning processes the entry, it's a trade off if we want sleepable learning context.
>> Ido noted privately that's usually how hw acts anyway, also if people want guarantees
>> that the reply won't get flooded there are other methods to achieve that (ucast flood
>> disable, firewall rules etc).
> 
> Not all hardware is like that, the switches I'm working with, which
> perform autonomous learning, all complete the learning process for a
> frame strictly before they start the forwarding process. The software
> bridge also behaves like that. My only concern is that we might start
> building on top of some fundamental bridge changes like these, which
> might risk a revert a few months down the line, when somebody notices
> and comes with a use case where that is not acceptable.
> 

I should've clarified I was talking about Spectrum as Ido did in a reply.

>> Today the reply could get flooded if the entry can't be programmed
>> as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
>> but still there haven't been any such guarantees. I think it's generally a good improvement and
>> will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
>> infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
>> and reason about in the future.
> 
> You're the maintainer, I certainly won't complain if we go down this path.
> It would be nice if br->lock can also be transformed into a mutex, it
> would make all of switchdev much simpler.
> 

That is why we are discussing possible solutions, I don't want to force anything
but rather reach a consensus about the way forward. There are complexities involved with
moving to delayed learning too, one would be that the queue won't be a simple linked list
but probably a structure that allows fast lookups (e.g. rbtree) to avoid duplicating entries,
we also may end up doing two stage lookup if the main hash table doesn't find an entry
to close the above scenario's window as much as possible. But as I said I think that we can get
these correct and well defined, after that we'll be able to reason about future changes and
use cases easier. I'm still thinking about the various corner cases with delayed learning, so
any feedback is welcome. I'll start working on it in a few days and will get a clearer
view of the issues once I start converting the bridge, but having straight-forward locking
rules and known deterministic behaviour sounds like a better long term plan.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-26 17:10         ` Nikolay Aleksandrov
@ 2021-10-26 19:01           ` Vladimir Oltean
  2021-10-26 19:56             ` Nikolay Aleksandrov
  0 siblings, 1 reply; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-26 19:01 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, Roopa Prabhu, Ido Schimmel, Jakub Kicinski,
	David S. Miller, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Jiri Pirko

On Tue, Oct 26, 2021 at 08:10:45PM +0300, Nikolay Aleksandrov wrote:
> On 26/10/2021 19:54, Vladimir Oltean wrote:
> > On Tue, Oct 26, 2021 at 03:20:03PM +0300, Nikolay Aleksandrov wrote:
> >> On 26/10/2021 14:25, Vladimir Oltean wrote:
> >>> On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
> >>>> Hi,
> >>>> Interesting way to work around the asynchronous notifiers. :) I went over
> >>>> the patch-set and given that we'll have to support and maintain this fragile
> >>>> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
> >>>> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
> >>>> with delayed learning from the fast-path to get a sleepable context where we can
> >>>> use synchronous switchdev calls and get feedback immediately.
> >>>
> >>> Delayed learning means that we'll receive a sequence of packets like this:
> >>>
> >>>             br0--------\
> >>>           /    \        \
> >>>          /      \        \
> >>>         /        \        \
> >>>      swp0         swp1    swp2
> >>>       |            |        |
> >>>    station A   station B  station C
> >>>
> >>> station A sends request to B, station B sends reply to A.
> >>> Since the learning of station A's MAC SA races with the reply sent by
> >>> station B, it now becomes theoretically possible for the reply packet to
> >>> be flooded to station C as well, right? And that was not possible before
> >>> (at least assuming an ageing time longer than the round-trip time of these packets).
> >>>
> >>> And that will happen regardless of whether switchdev is used or not.
> >>> I don't want to outright dismiss this (maybe I don't fully understand
> >>> this either), but it seems like a pretty heavy-handed change.
> >>>
> >>
> >> It will depending on lock contention, I plan to add a fast/uncontended case with
> >> trylock from fast-path and if that fails then queue the fdb, but yes - in general
> > 
> > I wonder why mutex_trylock has this comment?
> > 
> >  * This function must not be used in interrupt context. The
> >  * mutex must be released by the same task that acquired it.
> > 
> >> you are correct that the traffic could get flooded in the queue case before the delayed
> >> learning processes the entry, it's a trade off if we want sleepable learning context.
> >> Ido noted privately that's usually how hw acts anyway, also if people want guarantees
> >> that the reply won't get flooded there are other methods to achieve that (ucast flood
> >> disable, firewall rules etc).
> > 
> > Not all hardware is like that, the switches I'm working with, which
> > perform autonomous learning, all complete the learning process for a
> > frame strictly before they start the forwarding process. The software
> > bridge also behaves like that. My only concern is that we might start
> > building on top of some fundamental bridge changes like these, which
> > might risk a revert a few months down the line, when somebody notices
> > and comes with a use case where that is not acceptable.
> > 
> 
> I should've clarified I was talking about Spectrum as Ido did in a reply.

Ido also said "I assume Spectrum is not special in this regard" and I
just wanted to say this such that we don't end with the wrong impression.
Special or not, to the software bridge it would be new behavior, which
I can only hope will be well received.

> >> Today the reply could get flooded if the entry can't be programmed
> >> as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
> >> but still there haven't been any such guarantees. I think it's generally a good improvement and
> >> will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
> >> infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
> >> and reason about in the future.
> > 
> > You're the maintainer, I certainly won't complain if we go down this path.
> > It would be nice if br->lock can also be transformed into a mutex, it
> > would make all of switchdev much simpler.
> > 
> 
> That is why we are discussing possible solutions, I don't want to force anything
> but rather reach a consensus about the way forward. There are complexities involved with
> moving to delayed learning too, one would be that the queue won't be a simple linked list
> but probably a structure that allows fast lookups (e.g. rbtree) to avoid duplicating entries,
> we also may end up doing two stage lookup if the main hash table doesn't find an entry
> to close the above scenario's window as much as possible. But as I said I think that we can get
> these correct and well defined, after that we'll be able to reason about future changes and
> use cases easier. I'm still thinking about the various corner cases with delayed learning, so
> any feedback is welcome. I'll start working on it in a few days and will get a clearer
> view of the issues once I start converting the bridge, but having straight-forward locking
> rules and known deterministic behaviour sounds like a better long term plan.

Sorry, I might have to read the code, I don't want to misinterpret your
idea. With what you're describing here, I think that what you are trying
to achieve is to both have it our way, and preserve the current behavior
for the common case, where learning still happens from the fast path.
But I'm not sure that this is the correct goal, especially if you also
add "straightforward locking rules" to the mix.

I think you'd have to explain what is the purpose of the fast path trylock
logic you've mentioned above. So in the uncontended br->hash_lock case,
br_fdb_update() could take that mutex and then what? It would create and
add the FDB entry, and call fdb_notify(), but that still can't sleep.
So if switchdev drivers still need to privately defer the offloading
work, we still need some crazy completion-based mechanism to report
errors like the one I posted here, because in the general sense,
SWITCHDEV_FDB_ADD_TO_DEVICE will still be atomic.

And how do you queue a deletion, do you delete the FDB from the pending
and the main hash table, or do you just create a deletion command to be
processed in deferred context?

And since you'd be operating on the hash table concurrently from the
deferred work and from the fast path, doesn't this mean you'll need to
use some sort of spin_lock_bh from the deferred work, which again would
incur atomic context when you want to notify switchdev? Because with the
mutex_trylock described above, you'd gain exclusivity to the main hash
table, but if you lose, what you need is exclusivity to the pending hash
table. So the deferred context also needs to be atomic when it dequeues
the pending FDB entries and notifies them. I just don't see what we're
winning, how the rtnetlink functions will be any different for the better.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-26 19:01           ` Vladimir Oltean
@ 2021-10-26 19:56             ` Nikolay Aleksandrov
  2021-10-26 21:51               ` Vladimir Oltean
  0 siblings, 1 reply; 29+ messages in thread
From: Nikolay Aleksandrov @ 2021-10-26 19:56 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, Roopa Prabhu, Ido Schimmel, Jakub Kicinski,
	David S. Miller, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Jiri Pirko

On 26/10/2021 22:01, Vladimir Oltean wrote:
> On Tue, Oct 26, 2021 at 08:10:45PM +0300, Nikolay Aleksandrov wrote:
>> On 26/10/2021 19:54, Vladimir Oltean wrote:
>>> On Tue, Oct 26, 2021 at 03:20:03PM +0300, Nikolay Aleksandrov wrote:
>>>> On 26/10/2021 14:25, Vladimir Oltean wrote:
>>>>> On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
>>>>>> Hi,
>>>>>> Interesting way to work around the asynchronous notifiers. :) I went over
>>>>>> the patch-set and given that we'll have to support and maintain this fragile
>>>>>> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
>>>>>> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
>>>>>> with delayed learning from the fast-path to get a sleepable context where we can
>>>>>> use synchronous switchdev calls and get feedback immediately.
>>>>>
>>>>> Delayed learning means that we'll receive a sequence of packets like this:
>>>>>
>>>>>             br0--------\
>>>>>           /    \        \
>>>>>          /      \        \
>>>>>         /        \        \
>>>>>      swp0         swp1    swp2
>>>>>       |            |        |
>>>>>    station A   station B  station C
>>>>>
>>>>> station A sends request to B, station B sends reply to A.
>>>>> Since the learning of station A's MAC SA races with the reply sent by
>>>>> station B, it now becomes theoretically possible for the reply packet to
>>>>> be flooded to station C as well, right? And that was not possible before
>>>>> (at least assuming an ageing time longer than the round-trip time of these packets).
>>>>>
>>>>> And that will happen regardless of whether switchdev is used or not.
>>>>> I don't want to outright dismiss this (maybe I don't fully understand
>>>>> this either), but it seems like a pretty heavy-handed change.
>>>>>
>>>>
>>>> It will depending on lock contention, I plan to add a fast/uncontended case with
>>>> trylock from fast-path and if that fails then queue the fdb, but yes - in general
>>>
>>> I wonder why mutex_trylock has this comment?
>>>
>>>  * This function must not be used in interrupt context. The
>>>  * mutex must be released by the same task that acquired it.
>>>
>>>> you are correct that the traffic could get flooded in the queue case before the delayed
>>>> learning processes the entry, it's a trade off if we want sleepable learning context.
>>>> Ido noted privately that's usually how hw acts anyway, also if people want guarantees
>>>> that the reply won't get flooded there are other methods to achieve that (ucast flood
>>>> disable, firewall rules etc).
>>>
>>> Not all hardware is like that, the switches I'm working with, which
>>> perform autonomous learning, all complete the learning process for a
>>> frame strictly before they start the forwarding process. The software
>>> bridge also behaves like that. My only concern is that we might start
>>> building on top of some fundamental bridge changes like these, which
>>> might risk a revert a few months down the line, when somebody notices
>>> and comes with a use case where that is not acceptable.
>>>
>>
>> I should've clarified I was talking about Spectrum as Ido did in a reply.
> 
> Ido also said "I assume Spectrum is not special in this regard" and I
> just wanted to say this such that we don't end with the wrong impression.
> Special or not, to the software bridge it would be new behavior, which
> I can only hope will be well received.
> 
>>>> Today the reply could get flooded if the entry can't be programmed
>>>> as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
>>>> but still there haven't been any such guarantees. I think it's generally a good improvement and
>>>> will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
>>>> infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
>>>> and reason about in the future.
>>>
>>> You're the maintainer, I certainly won't complain if we go down this path.
>>> It would be nice if br->lock can also be transformed into a mutex, it
>>> would make all of switchdev much simpler.
>>>
>>
>> That is why we are discussing possible solutions, I don't want to force anything
>> but rather reach a consensus about the way forward. There are complexities involved with
>> moving to delayed learning too, one would be that the queue won't be a simple linked list
>> but probably a structure that allows fast lookups (e.g. rbtree) to avoid duplicating entries,
>> we also may end up doing two stage lookup if the main hash table doesn't find an entry
>> to close the above scenario's window as much as possible. But as I said I think that we can get
>> these correct and well defined, after that we'll be able to reason about future changes and
>> use cases easier. I'm still thinking about the various corner cases with delayed learning, so
>> any feedback is welcome. I'll start working on it in a few days and will get a clearer
>> view of the issues once I start converting the bridge, but having straight-forward locking
>> rules and known deterministic behaviour sounds like a better long term plan.
> 
> Sorry, I might have to read the code, I don't want to misinterpret your
> idea. With what you're describing here, I think that what you are trying
> to achieve is to both have it our way, and preserve the current behavior
> for the common case, where learning still happens from the fast path.
> But I'm not sure that this is the correct goal, especially if you also
> add "straightforward locking rules" to the mix.
> 

The trylock was only an optimization idea, but yes you'd need both synchronous
and asynchronous notifiers. I don't think you need sleepable context when
learning from softirq, who would you propagate the error to? It is important
when entries are being added from user-space, but if you'd like to have veto
option from softirq then we can scratch the trylock idea altogether.
Let's not focus on the trylock, it's a minor detail.

> I think you'd have to explain what is the purpose of the fast path trylock
> logic you've mentioned above. So in the uncontended br->hash_lock case,
> br_fdb_update() could take that mutex and then what? It would create and
> add the FDB entry, and call fdb_notify(), but that still can't sleep.
> So if switchdev drivers still need to privately defer the offloading
> work, we still need some crazy completion-based mechanism to report
> errors like the one I posted here, because in the general sense,
> SWITCHDEV_FDB_ADD_TO_DEVICE will still be atomic.

We do not if we have ADD_TO_DEVICE and ADD_TO_DEVICE_SYNC, but that optimization
is probably not worth the complexity of maintaining both so we can just drop the
trylock.

> 
> And how do you queue a deletion, do you delete the FDB from the pending
> and the main hash table, or do you just create a deletion command to be
> processed in deferred context?
> 

All commands which cannot take the mutex directly will be executed from a delayed
queue. We also need a delayed flush operation because we need to flush from STP code
and we can't sleep there due to the STP spinlock. The pending table can be considered
only if we decide to do a 2-stage lookup, it won't be used or consulted in any other
case, so user-space adds and deletes go only the main table.

> And since you'd be operating on the hash table concurrently from the
> deferred work and from the fast path, doesn't this mean you'll need to
> use some sort of spin_lock_bh from the deferred work, which again would
> incur atomic context when you want to notify switchdev? Because with the
> mutex_trylock described above, you'd gain exclusivity to the main hash
> table, but if you lose, what you need is exclusivity to the pending hash
> table. So the deferred context also needs to be atomic when it dequeues
> the pending FDB entries and notifies them. I just don't see what we're
> winning, how the rtnetlink functions will be any different for the better.
> 

The rbtree can be fully taken by the delayed action and swapped with a new one,
so no exclusivity needed for the notifications. It will take the spinlock, get
the whole tree and release the lock, same if it was a simple queue.
The spinlock for the rbtree for the delayed learning is necessary in all cases,
if we used the trylock fast learn then we could directly insert the entry if we win, but
again lets just always use delayed ops from atomic contexts as a start.

All entries from atomic contexts will be added to an rbtree which will be processed
from a delayed work, it will be freed by RCU so lookups can be done if we decide to
do a 2-stage lookup for the fast Rx path to reduce the flooding case window that you
described above.

We win having sleepable context for user-space calls and being able to do synchronous
calls to the drivers to wait for the errors.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-26 19:56             ` Nikolay Aleksandrov
@ 2021-10-26 21:51               ` Vladimir Oltean
  2021-10-26 22:27                 ` Nikolay Aleksandrov
  2021-10-27  7:24                 ` Ido Schimmel
  0 siblings, 2 replies; 29+ messages in thread
From: Vladimir Oltean @ 2021-10-26 21:51 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, Roopa Prabhu, Ido Schimmel, Jakub Kicinski,
	David S. Miller, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Jiri Pirko

On Tue, Oct 26, 2021 at 10:56:59PM +0300, Nikolay Aleksandrov wrote:
> On 26/10/2021 22:01, Vladimir Oltean wrote:
> > On Tue, Oct 26, 2021 at 08:10:45PM +0300, Nikolay Aleksandrov wrote:
> >> On 26/10/2021 19:54, Vladimir Oltean wrote:
> >>> On Tue, Oct 26, 2021 at 03:20:03PM +0300, Nikolay Aleksandrov wrote:
> >>>> On 26/10/2021 14:25, Vladimir Oltean wrote:
> >>>>> On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
> >>>>>> Hi,
> >>>>>> Interesting way to work around the asynchronous notifiers. :) I went over
> >>>>>> the patch-set and given that we'll have to support and maintain this fragile
> >>>>>> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
> >>>>>> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
> >>>>>> with delayed learning from the fast-path to get a sleepable context where we can
> >>>>>> use synchronous switchdev calls and get feedback immediately.
> >>>>>
> >>>>> Delayed learning means that we'll receive a sequence of packets like this:
> >>>>>
> >>>>>             br0--------\
> >>>>>           /    \        \
> >>>>>          /      \        \
> >>>>>         /        \        \
> >>>>>      swp0         swp1    swp2
> >>>>>       |            |        |
> >>>>>    station A   station B  station C
> >>>>>
> >>>>> station A sends request to B, station B sends reply to A.
> >>>>> Since the learning of station A's MAC SA races with the reply sent by
> >>>>> station B, it now becomes theoretically possible for the reply packet to
> >>>>> be flooded to station C as well, right? And that was not possible before
> >>>>> (at least assuming an ageing time longer than the round-trip time of these packets).
> >>>>>
> >>>>> And that will happen regardless of whether switchdev is used or not.
> >>>>> I don't want to outright dismiss this (maybe I don't fully understand
> >>>>> this either), but it seems like a pretty heavy-handed change.
> >>>>>
> >>>>
> >>>> It will depending on lock contention, I plan to add a fast/uncontended case with
> >>>> trylock from fast-path and if that fails then queue the fdb, but yes - in general
> >>>
> >>> I wonder why mutex_trylock has this comment?
> >>>
> >>>  * This function must not be used in interrupt context. The
> >>>  * mutex must be released by the same task that acquired it.
> >>>
> >>>> you are correct that the traffic could get flooded in the queue case before the delayed
> >>>> learning processes the entry, it's a trade off if we want sleepable learning context.
> >>>> Ido noted privately that's usually how hw acts anyway, also if people want guarantees
> >>>> that the reply won't get flooded there are other methods to achieve that (ucast flood
> >>>> disable, firewall rules etc).
> >>>
> >>> Not all hardware is like that, the switches I'm working with, which
> >>> perform autonomous learning, all complete the learning process for a
> >>> frame strictly before they start the forwarding process. The software
> >>> bridge also behaves like that. My only concern is that we might start
> >>> building on top of some fundamental bridge changes like these, which
> >>> might risk a revert a few months down the line, when somebody notices
> >>> and comes with a use case where that is not acceptable.
> >>>
> >>
> >> I should've clarified I was talking about Spectrum as Ido did in a reply.
> > 
> > Ido also said "I assume Spectrum is not special in this regard" and I
> > just wanted to say this such that we don't end with the wrong impression.
> > Special or not, to the software bridge it would be new behavior, which
> > I can only hope will be well received.
> > 
> >>>> Today the reply could get flooded if the entry can't be programmed
> >>>> as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
> >>>> but still there haven't been any such guarantees. I think it's generally a good improvement and
> >>>> will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
> >>>> infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
> >>>> and reason about in the future.
> >>>
> >>> You're the maintainer, I certainly won't complain if we go down this path.
> >>> It would be nice if br->lock can also be transformed into a mutex, it
> >>> would make all of switchdev much simpler.
> >>>
> >>
> >> That is why we are discussing possible solutions, I don't want to force anything
> >> but rather reach a consensus about the way forward. There are complexities involved with
> >> moving to delayed learning too, one would be that the queue won't be a simple linked list
> >> but probably a structure that allows fast lookups (e.g. rbtree) to avoid duplicating entries,
> >> we also may end up doing two stage lookup if the main hash table doesn't find an entry
> >> to close the above scenario's window as much as possible. But as I said I think that we can get
> >> these correct and well defined, after that we'll be able to reason about future changes and
> >> use cases easier. I'm still thinking about the various corner cases with delayed learning, so
> >> any feedback is welcome. I'll start working on it in a few days and will get a clearer
> >> view of the issues once I start converting the bridge, but having straight-forward locking
> >> rules and known deterministic behaviour sounds like a better long term plan.
> > 
> > Sorry, I might have to read the code, I don't want to misinterpret your
> > idea. With what you're describing here, I think that what you are trying
> > to achieve is to both have it our way, and preserve the current behavior
> > for the common case, where learning still happens from the fast path.
> > But I'm not sure that this is the correct goal, especially if you also
> > add "straightforward locking rules" to the mix.
> > 
> 
> The trylock was only an optimization idea, but yes you'd need both synchronous
> and asynchronous notifiers. I don't think you need sleepable context when
> learning from softirq, who would you propagate the error to? It is important
> when entries are being added from user-space, but if you'd like to have veto
> option from softirq then we can scratch the trylock idea altogether.

I'll let Ido answer here. As I said, the model I'm working with is that
of autonomous learning, so for me, no. Whereas the Spectrum model is
that of secure learning. I expect that it'd be pretty useless to set up
software assisted secure learning if you're just going to say yes and
learn all addresses anyway. I've never seen Spectrum documentation, but
I would be shocked if it wouldn't be able to be configured to operate in
the bare-bones autonomous learning mode too.

> Let's not focus on the trylock, it's a minor detail.
> 
> > I think you'd have to explain what is the purpose of the fast path trylock
> > logic you've mentioned above. So in the uncontended br->hash_lock case,
> > br_fdb_update() could take that mutex and then what? It would create and
> > add the FDB entry, and call fdb_notify(), but that still can't sleep.
> > So if switchdev drivers still need to privately defer the offloading
> > work, we still need some crazy completion-based mechanism to report
> > errors like the one I posted here, because in the general sense,
> > SWITCHDEV_FDB_ADD_TO_DEVICE will still be atomic.
> 
> We do not if we have ADD_TO_DEVICE and ADD_TO_DEVICE_SYNC,

Just when I was about to say that I'm happy to get rid of some of those
workqueues, and now you're telling me not only we might not get rid of
them, but we might also end up with a second, separate implementation :)

Anyway, let's not put the cart before the horses, let's see what the
realities of the bridge data path learning and STP flushing will teach
us about what can and can't be done.

> but that optimization is probably not worth the complexity of
> maintaining both so we can just drop the trylock.
> 
> > 
> > And how do you queue a deletion, do you delete the FDB from the pending
> > and the main hash table, or do you just create a deletion command to be
> > processed in deferred context?
> > 
> 
> All commands which cannot take the mutex directly will be executed from a delayed
> queue. We also need a delayed flush operation because we need to flush from STP code
> and we can't sleep there due to the STP spinlock. The pending table can be considered
> only if we decide to do a 2-stage lookup, it won't be used or consulted in any other
> case, so user-space adds and deletes go only the main table.
> 
> > And since you'd be operating on the hash table concurrently from the
> > deferred work and from the fast path, doesn't this mean you'll need to
> > use some sort of spin_lock_bh from the deferred work, which again would
> > incur atomic context when you want to notify switchdev? Because with the
> > mutex_trylock described above, you'd gain exclusivity to the main hash
> > table, but if you lose, what you need is exclusivity to the pending hash
> > table. So the deferred context also needs to be atomic when it dequeues
> > the pending FDB entries and notifies them. I just don't see what we're
> > winning, how the rtnetlink functions will be any different for the better.
> > 
> 
> The rbtree can be fully taken by the delayed action and swapped with a new one,
> so no exclusivity needed for the notifications. It will take the spinlock, get
> the whole tree and release the lock, same if it was a simple queue.

I'm quite unfamiliar with this technique, atomically swapping a queue
pointer with a new empty one, and emptying that queue while unlocked.
Do you have any reference implementation for this kind of technique?

> The spinlock for the rbtree for the delayed learning is necessary in all cases,

"in all cases" means "regardless of whether we try from the fast path to
make fdb_create() insert directly into &br->fdb_hash_tbl, or if we
insert into a temporary rbtree for pending entries"? I don't understand
this: why would you take the rbtree spinlock if you've inserted into the
main hash table already?

I'm only concerned about spin locks we'd have to hold while calling
fdb_notify().

> if we used the trylock fast learn then we could directly insert the entry if we win, but
> again lets just always use delayed ops from atomic contexts as a start.
> 
> All entries from atomic contexts will be added to an rbtree which will be processed
> from a delayed work, it will be freed by RCU so lookups can be done if we decide to
> do a 2-stage lookup for the fast Rx path to reduce the flooding case window that you
> described above.
> 
> We win having sleepable context for user-space calls and being able to do synchronous
> calls to the drivers to wait for the errors.

I think I'd really have to see the code at this point, sorry.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-26 21:51               ` Vladimir Oltean
@ 2021-10-26 22:27                 ` Nikolay Aleksandrov
  2021-10-27  9:20                   ` Nikolay Aleksandrov
  2021-10-27  7:24                 ` Ido Schimmel
  1 sibling, 1 reply; 29+ messages in thread
From: Nikolay Aleksandrov @ 2021-10-26 22:27 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, Roopa Prabhu, Ido Schimmel, Jakub Kicinski,
	David S. Miller, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Jiri Pirko

On 27/10/2021 00:51, Vladimir Oltean wrote:
> On Tue, Oct 26, 2021 at 10:56:59PM +0300, Nikolay Aleksandrov wrote:
>> On 26/10/2021 22:01, Vladimir Oltean wrote:
>>> On Tue, Oct 26, 2021 at 08:10:45PM +0300, Nikolay Aleksandrov wrote:
>>>> On 26/10/2021 19:54, Vladimir Oltean wrote:
>>>>> On Tue, Oct 26, 2021 at 03:20:03PM +0300, Nikolay Aleksandrov wrote:
>>>>>> On 26/10/2021 14:25, Vladimir Oltean wrote:
>>>>>>> On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
>>>>>>>> Hi,
>>>>>>>> Interesting way to work around the asynchronous notifiers. :) I went over
>>>>>>>> the patch-set and given that we'll have to support and maintain this fragile
>>>>>>>> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
>>>>>>>> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
>>>>>>>> with delayed learning from the fast-path to get a sleepable context where we can
>>>>>>>> use synchronous switchdev calls and get feedback immediately.
>>>>>>>
>>>>>>> Delayed learning means that we'll receive a sequence of packets like this:
>>>>>>>
>>>>>>>             br0--------\
>>>>>>>           /    \        \
>>>>>>>          /      \        \
>>>>>>>         /        \        \
>>>>>>>      swp0         swp1    swp2
>>>>>>>       |            |        |
>>>>>>>    station A   station B  station C
>>>>>>>
>>>>>>> station A sends request to B, station B sends reply to A.
>>>>>>> Since the learning of station A's MAC SA races with the reply sent by
>>>>>>> station B, it now becomes theoretically possible for the reply packet to
>>>>>>> be flooded to station C as well, right? And that was not possible before
>>>>>>> (at least assuming an ageing time longer than the round-trip time of these packets).
>>>>>>>
>>>>>>> And that will happen regardless of whether switchdev is used or not.
>>>>>>> I don't want to outright dismiss this (maybe I don't fully understand
>>>>>>> this either), but it seems like a pretty heavy-handed change.
>>>>>>>
>>>>>>
>>>>>> It will depending on lock contention, I plan to add a fast/uncontended case with
>>>>>> trylock from fast-path and if that fails then queue the fdb, but yes - in general
>>>>>
>>>>> I wonder why mutex_trylock has this comment?
>>>>>
>>>>>  * This function must not be used in interrupt context. The
>>>>>  * mutex must be released by the same task that acquired it.
>>>>>
>>>>>> you are correct that the traffic could get flooded in the queue case before the delayed
>>>>>> learning processes the entry, it's a trade off if we want sleepable learning context.
>>>>>> Ido noted privately that's usually how hw acts anyway, also if people want guarantees
>>>>>> that the reply won't get flooded there are other methods to achieve that (ucast flood
>>>>>> disable, firewall rules etc).
>>>>>
>>>>> Not all hardware is like that, the switches I'm working with, which
>>>>> perform autonomous learning, all complete the learning process for a
>>>>> frame strictly before they start the forwarding process. The software
>>>>> bridge also behaves like that. My only concern is that we might start
>>>>> building on top of some fundamental bridge changes like these, which
>>>>> might risk a revert a few months down the line, when somebody notices
>>>>> and comes with a use case where that is not acceptable.
>>>>>
>>>>
>>>> I should've clarified I was talking about Spectrum as Ido did in a reply.
>>>
>>> Ido also said "I assume Spectrum is not special in this regard" and I
>>> just wanted to say this such that we don't end with the wrong impression.
>>> Special or not, to the software bridge it would be new behavior, which
>>> I can only hope will be well received.
>>>
>>>>>> Today the reply could get flooded if the entry can't be programmed
>>>>>> as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
>>>>>> but still there haven't been any such guarantees. I think it's generally a good improvement and
>>>>>> will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
>>>>>> infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
>>>>>> and reason about in the future.
>>>>>
>>>>> You're the maintainer, I certainly won't complain if we go down this path.
>>>>> It would be nice if br->lock can also be transformed into a mutex, it
>>>>> would make all of switchdev much simpler.
>>>>>
>>>>
>>>> That is why we are discussing possible solutions, I don't want to force anything
>>>> but rather reach a consensus about the way forward. There are complexities involved with
>>>> moving to delayed learning too, one would be that the queue won't be a simple linked list
>>>> but probably a structure that allows fast lookups (e.g. rbtree) to avoid duplicating entries,
>>>> we also may end up doing two stage lookup if the main hash table doesn't find an entry
>>>> to close the above scenario's window as much as possible. But as I said I think that we can get
>>>> these correct and well defined, after that we'll be able to reason about future changes and
>>>> use cases easier. I'm still thinking about the various corner cases with delayed learning, so
>>>> any feedback is welcome. I'll start working on it in a few days and will get a clearer
>>>> view of the issues once I start converting the bridge, but having straight-forward locking
>>>> rules and known deterministic behaviour sounds like a better long term plan.
>>>
>>> Sorry, I might have to read the code, I don't want to misinterpret your
>>> idea. With what you're describing here, I think that what you are trying
>>> to achieve is to both have it our way, and preserve the current behavior
>>> for the common case, where learning still happens from the fast path.
>>> But I'm not sure that this is the correct goal, especially if you also
>>> add "straightforward locking rules" to the mix.
>>>
>>
>> The trylock was only an optimization idea, but yes you'd need both synchronous
>> and asynchronous notifiers. I don't think you need sleepable context when
>> learning from softirq, who would you propagate the error to? It is important
>> when entries are being added from user-space, but if you'd like to have veto
>> option from softirq then we can scratch the trylock idea altogether.
> 
> I'll let Ido answer here. As I said, the model I'm working with is that
> of autonomous learning, so for me, no. Whereas the Spectrum model is
> that of secure learning. I expect that it'd be pretty useless to set up
> software assisted secure learning if you're just going to say yes and
> learn all addresses anyway. I've never seen Spectrum documentation, but
> I would be shocked if it wouldn't be able to be configured to operate in
> the bare-bones autonomous learning mode too.
> 

Ack, got it.

>> Let's not focus on the trylock, it's a minor detail.
>>
>>> I think you'd have to explain what is the purpose of the fast path trylock
>>> logic you've mentioned above. So in the uncontended br->hash_lock case,
>>> br_fdb_update() could take that mutex and then what? It would create and
>>> add the FDB entry, and call fdb_notify(), but that still can't sleep.
>>> So if switchdev drivers still need to privately defer the offloading
>>> work, we still need some crazy completion-based mechanism to report
>>> errors like the one I posted here, because in the general sense,
>>> SWITCHDEV_FDB_ADD_TO_DEVICE will still be atomic.
>>
>> We do not if we have ADD_TO_DEVICE and ADD_TO_DEVICE_SYNC,
> 
> Just when I was about to say that I'm happy to get rid of some of those
> workqueues, and now you're telling me not only we might not get rid of
> them, but we might also end up with a second, separate implementation :)
> 
> Anyway, let's not put the cart before the horses, let's see what the
> realities of the bridge data path learning and STP flushing will teach
> us about what can and can't be done.
> 

We will get rid of some workqueues (I hope), I was saying that if do the trylock we
might have to add both sync and async, otherwise we just need a single sync one.

>> but that optimization is probably not worth the complexity of
>> maintaining both so we can just drop the trylock.
>>
>>>
>>> And how do you queue a deletion, do you delete the FDB from the pending
>>> and the main hash table, or do you just create a deletion command to be
>>> processed in deferred context?
>>>
>>
>> All commands which cannot take the mutex directly will be executed from a delayed
>> queue. We also need a delayed flush operation because we need to flush from STP code
>> and we can't sleep there due to the STP spinlock. The pending table can be considered
>> only if we decide to do a 2-stage lookup, it won't be used or consulted in any other
>> case, so user-space adds and deletes go only the main table.
>>
>>> And since you'd be operating on the hash table concurrently from the
>>> deferred work and from the fast path, doesn't this mean you'll need to
>>> use some sort of spin_lock_bh from the deferred work, which again would
>>> incur atomic context when you want to notify switchdev? Because with the
>>> mutex_trylock described above, you'd gain exclusivity to the main hash
>>> table, but if you lose, what you need is exclusivity to the pending hash
>>> table. So the deferred context also needs to be atomic when it dequeues
>>> the pending FDB entries and notifies them. I just don't see what we're
>>> winning, how the rtnetlink functions will be any different for the better.
>>>
>>
>> The rbtree can be fully taken by the delayed action and swapped with a new one,
>> so no exclusivity needed for the notifications. It will take the spinlock, get
>> the whole tree and release the lock, same if it was a simple queue.
> 
> I'm quite unfamiliar with this technique, atomically swapping a queue
> pointer with a new empty one, and emptying that queue while unlocked.
> Do you have any reference implementation for this kind of technique?
> 

the delayed work would be doing something similar to:

spin_lock(delay_lock);
record tree root
rcu change tree root with empty
spin_unlock(delay_lock);

mutex_lock(br->hash_lock);
process recorded (old) tree and free items via rcu
mutex_unlock(br->hash_lock);

We have the same pattern with queues all around the kernel.

>> The spinlock for the rbtree for the delayed learning is necessary in all cases,
> 
> "in all cases" means "regardless of whether we try from the fast path to
> make fdb_create() insert directly into &br->fdb_hash_tbl, or if we
> insert into a temporary rbtree for pending entries"? I don't understand
> this: why would you take the rbtree spinlock if you've inserted into the
> main hash table already?
> 

No, it means that we need the spinlock to protect the delayed queue.

> I'm only concerned about spin locks we'd have to hold while calling
> fdb_notify().
> 

There won't be any spinlocks for fdb_notify(), we'll be doing all notifications from
sleepable context with the mutex held, that's the goal at least.

>> if we used the trylock fast learn then we could directly insert the entry if we win, but
>> again lets just always use delayed ops from atomic contexts as a start.
>>
>> All entries from atomic contexts will be added to an rbtree which will be processed
>> from a delayed work, it will be freed by RCU so lookups can be done if we decide to
>> do a 2-stage lookup for the fast Rx path to reduce the flooding case window that you
>> described above.
>>
>> We win having sleepable context for user-space calls and being able to do synchronous
>> calls to the drivers to wait for the errors.
> 
> I think I'd really have to see the code at this point, sorry.
> 

Sure, I'll prepare an RFC version this week.

Thanks for the comments.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-26 21:51               ` Vladimir Oltean
  2021-10-26 22:27                 ` Nikolay Aleksandrov
@ 2021-10-27  7:24                 ` Ido Schimmel
  1 sibling, 0 replies; 29+ messages in thread
From: Ido Schimmel @ 2021-10-27  7:24 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Nikolay Aleksandrov, netdev, Roopa Prabhu, Ido Schimmel,
	Jakub Kicinski, David S. Miller, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Jiri Pirko

On Tue, Oct 26, 2021 at 09:51:54PM +0000, Vladimir Oltean wrote:
> I'll let Ido answer here. As I said, the model I'm working with is that
> of autonomous learning, so for me, no. Whereas the Spectrum model is
> that of secure learning. I expect that it'd be pretty useless to set up
> software assisted secure learning if you're just going to say yes and
> learn all addresses anyway. I've never seen Spectrum documentation, but
> I would be shocked if it wouldn't be able to be configured to operate in
> the bare-bones autonomous learning mode too.

Hi,

Yes, you are correct. It can support automatic learning, but it was
never enabled. We update the software bridge about learned FDB entries
(unlike DSA I think?), so secure learning makes sense in our case.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-26 22:27                 ` Nikolay Aleksandrov
@ 2021-10-27  9:20                   ` Nikolay Aleksandrov
  2021-10-27  9:36                     ` Nikolay Aleksandrov
  0 siblings, 1 reply; 29+ messages in thread
From: Nikolay Aleksandrov @ 2021-10-27  9:20 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, Roopa Prabhu, Ido Schimmel, Jakub Kicinski,
	David S. Miller, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Jiri Pirko

On 27/10/2021 01:27, Nikolay Aleksandrov wrote:
> On 27/10/2021 00:51, Vladimir Oltean wrote:
>> On Tue, Oct 26, 2021 at 10:56:59PM +0300, Nikolay Aleksandrov wrote:
>>> On 26/10/2021 22:01, Vladimir Oltean wrote:
>>>> On Tue, Oct 26, 2021 at 08:10:45PM +0300, Nikolay Aleksandrov wrote:
>>>>> On 26/10/2021 19:54, Vladimir Oltean wrote:
>>>>>> On Tue, Oct 26, 2021 at 03:20:03PM +0300, Nikolay Aleksandrov wrote:
>>>>>>> On 26/10/2021 14:25, Vladimir Oltean wrote:
>>>>>>>> On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
>>>>>>>>> Hi,
>>>>>>>>> Interesting way to work around the asynchronous notifiers. :) I went over
>>>>>>>>> the patch-set and given that we'll have to support and maintain this fragile
>>>>>>>>> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
>>>>>>>>> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
>>>>>>>>> with delayed learning from the fast-path to get a sleepable context where we can
>>>>>>>>> use synchronous switchdev calls and get feedback immediately.
>>>>>>>>
>>>>>>>> Delayed learning means that we'll receive a sequence of packets like this:
>>>>>>>>
>>>>>>>>             br0--------\
>>>>>>>>           /    \        \
>>>>>>>>          /      \        \
>>>>>>>>         /        \        \
>>>>>>>>      swp0         swp1    swp2
>>>>>>>>       |            |        |
>>>>>>>>    station A   station B  station C
>>>>>>>>
>>>>>>>> station A sends request to B, station B sends reply to A.
>>>>>>>> Since the learning of station A's MAC SA races with the reply sent by
>>>>>>>> station B, it now becomes theoretically possible for the reply packet to
>>>>>>>> be flooded to station C as well, right? And that was not possible before
>>>>>>>> (at least assuming an ageing time longer than the round-trip time of these packets).
>>>>>>>>
>>>>>>>> And that will happen regardless of whether switchdev is used or not.
>>>>>>>> I don't want to outright dismiss this (maybe I don't fully understand
>>>>>>>> this either), but it seems like a pretty heavy-handed change.
>>>>>>>>
>>>>>>>
>>>>>>> It will depending on lock contention, I plan to add a fast/uncontended case with
>>>>>>> trylock from fast-path and if that fails then queue the fdb, but yes - in general
>>>>>>
>>>>>> I wonder why mutex_trylock has this comment?
>>>>>>
>>>>>>  * This function must not be used in interrupt context. The
>>>>>>  * mutex must be released by the same task that acquired it.
>>>>>>
>>>>>>> you are correct that the traffic could get flooded in the queue case before the delayed
>>>>>>> learning processes the entry, it's a trade off if we want sleepable learning context.
>>>>>>> Ido noted privately that's usually how hw acts anyway, also if people want guarantees
>>>>>>> that the reply won't get flooded there are other methods to achieve that (ucast flood
>>>>>>> disable, firewall rules etc).
>>>>>>
>>>>>> Not all hardware is like that, the switches I'm working with, which
>>>>>> perform autonomous learning, all complete the learning process for a
>>>>>> frame strictly before they start the forwarding process. The software
>>>>>> bridge also behaves like that. My only concern is that we might start
>>>>>> building on top of some fundamental bridge changes like these, which
>>>>>> might risk a revert a few months down the line, when somebody notices
>>>>>> and comes with a use case where that is not acceptable.
>>>>>>
>>>>>
>>>>> I should've clarified I was talking about Spectrum as Ido did in a reply.
>>>>
>>>> Ido also said "I assume Spectrum is not special in this regard" and I
>>>> just wanted to say this such that we don't end with the wrong impression.
>>>> Special or not, to the software bridge it would be new behavior, which
>>>> I can only hope will be well received.
>>>>
>>>>>>> Today the reply could get flooded if the entry can't be programmed
>>>>>>> as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
>>>>>>> but still there haven't been any such guarantees. I think it's generally a good improvement and
>>>>>>> will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
>>>>>>> infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
>>>>>>> and reason about in the future.
>>>>>>
>>>>>> You're the maintainer, I certainly won't complain if we go down this path.
>>>>>> It would be nice if br->lock can also be transformed into a mutex, it
>>>>>> would make all of switchdev much simpler.
>>>>>>
>>>>>
>>>>> That is why we are discussing possible solutions, I don't want to force anything
>>>>> but rather reach a consensus about the way forward. There are complexities involved with
>>>>> moving to delayed learning too, one would be that the queue won't be a simple linked list
>>>>> but probably a structure that allows fast lookups (e.g. rbtree) to avoid duplicating entries,
>>>>> we also may end up doing two stage lookup if the main hash table doesn't find an entry
>>>>> to close the above scenario's window as much as possible. But as I said I think that we can get
>>>>> these correct and well defined, after that we'll be able to reason about future changes and
>>>>> use cases easier. I'm still thinking about the various corner cases with delayed learning, so
>>>>> any feedback is welcome. I'll start working on it in a few days and will get a clearer
>>>>> view of the issues once I start converting the bridge, but having straight-forward locking
>>>>> rules and known deterministic behaviour sounds like a better long term plan.
>>>>
>>>> Sorry, I might have to read the code, I don't want to misinterpret your
>>>> idea. With what you're describing here, I think that what you are trying
>>>> to achieve is to both have it our way, and preserve the current behavior
>>>> for the common case, where learning still happens from the fast path.
>>>> But I'm not sure that this is the correct goal, especially if you also
>>>> add "straightforward locking rules" to the mix.
>>>>
>>>
>>> The trylock was only an optimization idea, but yes you'd need both synchronous
>>> and asynchronous notifiers. I don't think you need sleepable context when
>>> learning from softirq, who would you propagate the error to? It is important
>>> when entries are being added from user-space, but if you'd like to have veto
>>> option from softirq then we can scratch the trylock idea altogether.
>>
>> I'll let Ido answer here. As I said, the model I'm working with is that
>> of autonomous learning, so for me, no. Whereas the Spectrum model is
>> that of secure learning. I expect that it'd be pretty useless to set up
>> software assisted secure learning if you're just going to say yes and
>> learn all addresses anyway. I've never seen Spectrum documentation, but
>> I would be shocked if it wouldn't be able to be configured to operate in
>> the bare-bones autonomous learning mode too.
>>
> 
> Ack, got it.
> 
>>> Let's not focus on the trylock, it's a minor detail.
>>>
>>>> I think you'd have to explain what is the purpose of the fast path trylock
>>>> logic you've mentioned above. So in the uncontended br->hash_lock case,
>>>> br_fdb_update() could take that mutex and then what? It would create and
>>>> add the FDB entry, and call fdb_notify(), but that still can't sleep.
>>>> So if switchdev drivers still need to privately defer the offloading
>>>> work, we still need some crazy completion-based mechanism to report
>>>> errors like the one I posted here, because in the general sense,
>>>> SWITCHDEV_FDB_ADD_TO_DEVICE will still be atomic.
>>>
>>> We do not if we have ADD_TO_DEVICE and ADD_TO_DEVICE_SYNC,
>>
>> Just when I was about to say that I'm happy to get rid of some of those
>> workqueues, and now you're telling me not only we might not get rid of
>> them, but we might also end up with a second, separate implementation :)
>>
>> Anyway, let's not put the cart before the horses, let's see what the
>> realities of the bridge data path learning and STP flushing will teach
>> us about what can and can't be done.
>>
> 
> We will get rid of some workqueues (I hope), I was saying that if do the trylock we
> might have to add both sync and async, otherwise we just need a single sync one.
> 
>>> but that optimization is probably not worth the complexity of
>>> maintaining both so we can just drop the trylock.
>>>
>>>>
>>>> And how do you queue a deletion, do you delete the FDB from the pending
>>>> and the main hash table, or do you just create a deletion command to be
>>>> processed in deferred context?
>>>>
>>>
>>> All commands which cannot take the mutex directly will be executed from a delayed
>>> queue. We also need a delayed flush operation because we need to flush from STP code
>>> and we can't sleep there due to the STP spinlock. The pending table can be considered
>>> only if we decide to do a 2-stage lookup, it won't be used or consulted in any other
>>> case, so user-space adds and deletes go only the main table.
>>>
>>>> And since you'd be operating on the hash table concurrently from the
>>>> deferred work and from the fast path, doesn't this mean you'll need to
>>>> use some sort of spin_lock_bh from the deferred work, which again would
>>>> incur atomic context when you want to notify switchdev? Because with the
>>>> mutex_trylock described above, you'd gain exclusivity to the main hash
>>>> table, but if you lose, what you need is exclusivity to the pending hash
>>>> table. So the deferred context also needs to be atomic when it dequeues
>>>> the pending FDB entries and notifies them. I just don't see what we're
>>>> winning, how the rtnetlink functions will be any different for the better.
>>>>
>>>
>>> The rbtree can be fully taken by the delayed action and swapped with a new one,
>>> so no exclusivity needed for the notifications. It will take the spinlock, get
>>> the whole tree and release the lock, same if it was a simple queue.
>>
>> I'm quite unfamiliar with this technique, atomically swapping a queue
>> pointer with a new empty one, and emptying that queue while unlocked.
>> Do you have any reference implementation for this kind of technique?
>>
> 
> the delayed work would be doing something similar to:
> 
> spin_lock(delay_lock);
> record tree root
> rcu change tree root with empty
> spin_unlock(delay_lock);
> 
> mutex_lock(br->hash_lock);
> process recorded (old) tree and free items via rcu
> mutex_unlock(br->hash_lock);
> 
> We have the same pattern with queues all around the kernel.
> 
>>> The spinlock for the rbtree for the delayed learning is necessary in all cases,
>>
>> "in all cases" means "regardless of whether we try from the fast path to
>> make fdb_create() insert directly into &br->fdb_hash_tbl, or if we
>> insert into a temporary rbtree for pending entries"? I don't understand
>> this: why would you take the rbtree spinlock if you've inserted into the
>> main hash table already?
>>
> 
> No, it means that we need the spinlock to protect the delayed queue.
> 
>> I'm only concerned about spin locks we'd have to hold while calling
>> fdb_notify().
>>
> 
> There won't be any spinlocks for fdb_notify(), we'll be doing all notifications from
> sleepable context with the mutex held, that's the goal at least.
> 
>>> if we used the trylock fast learn then we could directly insert the entry if we win, but
>>> again lets just always use delayed ops from atomic contexts as a start.
>>>
>>> All entries from atomic contexts will be added to an rbtree which will be processed
>>> from a delayed work, it will be freed by RCU so lookups can be done if we decide to
>>> do a 2-stage lookup for the fast Rx path to reduce the flooding case window that you
>>> described above.
>>>
>>> We win having sleepable context for user-space calls and being able to do synchronous
>>> calls to the drivers to wait for the errors.
>>
>> I think I'd really have to see the code at this point, sorry.
>>
> 
> Sure, I'll prepare an RFC version this week.
> 
> Thanks for the comments.
> 

OK, I've started working on converting the bridge but there are a few more downsides to
delaying notifications that I really don't like and there is no good clean way around them.

a) after conversion we need to queue an event (read take spinlock) for each roaming of an fdb and flag change
   which are now lockless

b) we either lose transient fdb changes and send only the last notification for that fdb, or we have to snapshot
   the whole fdb on every change

c) we need to serialize and order actions, so in addition to the rbtree for fdb lookup, we'll need an ordered list of events

Now b) is most worrisome and really problematic, anything from user-space which is following that fdb's changes
would miss events in the first case (not an option), or we'll have to deal with the complexity of snapshotting
entries which in itself has a few problems and doesn't really extend well to other objects (snapshotting an mdb
could be a nightmare). I don't see a clean way forward to fix this as I don't want to do the completion async
waiting either, that would impose future race and locking issues and could make future changes harder to get right.

We have to explore alternative options, until we have something viable which doesn't play locking games
we'll have to live with the current situation.

I'll keep thinking about this and will try a few ideas.

Thanks,
 Nik

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge
  2021-10-27  9:20                   ` Nikolay Aleksandrov
@ 2021-10-27  9:36                     ` Nikolay Aleksandrov
  0 siblings, 0 replies; 29+ messages in thread
From: Nikolay Aleksandrov @ 2021-10-27  9:36 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, Roopa Prabhu, Ido Schimmel, Jakub Kicinski,
	David S. Miller, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Vladimir Oltean, Jiri Pirko

On 27/10/2021 12:20, Nikolay Aleksandrov wrote:
> On 27/10/2021 01:27, Nikolay Aleksandrov wrote:
>> On 27/10/2021 00:51, Vladimir Oltean wrote:
>>> On Tue, Oct 26, 2021 at 10:56:59PM +0300, Nikolay Aleksandrov wrote:
>>>> On 26/10/2021 22:01, Vladimir Oltean wrote:
>>>>> On Tue, Oct 26, 2021 at 08:10:45PM +0300, Nikolay Aleksandrov wrote:
>>>>>> On 26/10/2021 19:54, Vladimir Oltean wrote:
>>>>>>> On Tue, Oct 26, 2021 at 03:20:03PM +0300, Nikolay Aleksandrov wrote:
>>>>>>>> On 26/10/2021 14:25, Vladimir Oltean wrote:
>>>>>>>>> On Tue, Oct 26, 2021 at 01:40:15PM +0300, Nikolay Aleksandrov wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> Interesting way to work around the asynchronous notifiers. :) I went over
>>>>>>>>>> the patch-set and given that we'll have to support and maintain this fragile
>>>>>>>>>> solution (e.g. playing with locking, possible races with fdb changes etc) I'm
>>>>>>>>>> inclined to go with Ido's previous proposition to convert the hash_lock into a mutex
>>>>>>>>>> with delayed learning from the fast-path to get a sleepable context where we can
>>>>>>>>>> use synchronous switchdev calls and get feedback immediately.
>>>>>>>>>
>>>>>>>>> Delayed learning means that we'll receive a sequence of packets like this:
>>>>>>>>>
>>>>>>>>>             br0--------\
>>>>>>>>>           /    \        \
>>>>>>>>>          /      \        \
>>>>>>>>>         /        \        \
>>>>>>>>>      swp0         swp1    swp2
>>>>>>>>>       |            |        |
>>>>>>>>>    station A   station B  station C
>>>>>>>>>
>>>>>>>>> station A sends request to B, station B sends reply to A.
>>>>>>>>> Since the learning of station A's MAC SA races with the reply sent by
>>>>>>>>> station B, it now becomes theoretically possible for the reply packet to
>>>>>>>>> be flooded to station C as well, right? And that was not possible before
>>>>>>>>> (at least assuming an ageing time longer than the round-trip time of these packets).
>>>>>>>>>
>>>>>>>>> And that will happen regardless of whether switchdev is used or not.
>>>>>>>>> I don't want to outright dismiss this (maybe I don't fully understand
>>>>>>>>> this either), but it seems like a pretty heavy-handed change.
>>>>>>>>>
>>>>>>>>
>>>>>>>> It will depending on lock contention, I plan to add a fast/uncontended case with
>>>>>>>> trylock from fast-path and if that fails then queue the fdb, but yes - in general
>>>>>>>
>>>>>>> I wonder why mutex_trylock has this comment?
>>>>>>>
>>>>>>>  * This function must not be used in interrupt context. The
>>>>>>>  * mutex must be released by the same task that acquired it.
>>>>>>>
>>>>>>>> you are correct that the traffic could get flooded in the queue case before the delayed
>>>>>>>> learning processes the entry, it's a trade off if we want sleepable learning context.
>>>>>>>> Ido noted privately that's usually how hw acts anyway, also if people want guarantees
>>>>>>>> that the reply won't get flooded there are other methods to achieve that (ucast flood
>>>>>>>> disable, firewall rules etc).
>>>>>>>
>>>>>>> Not all hardware is like that, the switches I'm working with, which
>>>>>>> perform autonomous learning, all complete the learning process for a
>>>>>>> frame strictly before they start the forwarding process. The software
>>>>>>> bridge also behaves like that. My only concern is that we might start
>>>>>>> building on top of some fundamental bridge changes like these, which
>>>>>>> might risk a revert a few months down the line, when somebody notices
>>>>>>> and comes with a use case where that is not acceptable.
>>>>>>>
>>>>>>
>>>>>> I should've clarified I was talking about Spectrum as Ido did in a reply.
>>>>>
>>>>> Ido also said "I assume Spectrum is not special in this regard" and I
>>>>> just wanted to say this such that we don't end with the wrong impression.
>>>>> Special or not, to the software bridge it would be new behavior, which
>>>>> I can only hope will be well received.
>>>>>
>>>>>>>> Today the reply could get flooded if the entry can't be programmed
>>>>>>>> as well, e.g. the atomic allocation might fail and we'll flood it again, granted it's much less likely
>>>>>>>> but still there haven't been any such guarantees. I think it's generally a good improvement and
>>>>>>>> will simplify a lot of processing complexity. We can bite the bullet and get the underlying delayed
>>>>>>>> infrastructure correct once now, then the locking rules and other use cases would be easier to enforce
>>>>>>>> and reason about in the future.
>>>>>>>
>>>>>>> You're the maintainer, I certainly won't complain if we go down this path.
>>>>>>> It would be nice if br->lock can also be transformed into a mutex, it
>>>>>>> would make all of switchdev much simpler.
>>>>>>>
>>>>>>
>>>>>> That is why we are discussing possible solutions, I don't want to force anything
>>>>>> but rather reach a consensus about the way forward. There are complexities involved with
>>>>>> moving to delayed learning too, one would be that the queue won't be a simple linked list
>>>>>> but probably a structure that allows fast lookups (e.g. rbtree) to avoid duplicating entries,
>>>>>> we also may end up doing two stage lookup if the main hash table doesn't find an entry
>>>>>> to close the above scenario's window as much as possible. But as I said I think that we can get
>>>>>> these correct and well defined, after that we'll be able to reason about future changes and
>>>>>> use cases easier. I'm still thinking about the various corner cases with delayed learning, so
>>>>>> any feedback is welcome. I'll start working on it in a few days and will get a clearer
>>>>>> view of the issues once I start converting the bridge, but having straight-forward locking
>>>>>> rules and known deterministic behaviour sounds like a better long term plan.
>>>>>
>>>>> Sorry, I might have to read the code, I don't want to misinterpret your
>>>>> idea. With what you're describing here, I think that what you are trying
>>>>> to achieve is to both have it our way, and preserve the current behavior
>>>>> for the common case, where learning still happens from the fast path.
>>>>> But I'm not sure that this is the correct goal, especially if you also
>>>>> add "straightforward locking rules" to the mix.
>>>>>
>>>>
>>>> The trylock was only an optimization idea, but yes you'd need both synchronous
>>>> and asynchronous notifiers. I don't think you need sleepable context when
>>>> learning from softirq, who would you propagate the error to? It is important
>>>> when entries are being added from user-space, but if you'd like to have veto
>>>> option from softirq then we can scratch the trylock idea altogether.
>>>
>>> I'll let Ido answer here. As I said, the model I'm working with is that
>>> of autonomous learning, so for me, no. Whereas the Spectrum model is
>>> that of secure learning. I expect that it'd be pretty useless to set up
>>> software assisted secure learning if you're just going to say yes and
>>> learn all addresses anyway. I've never seen Spectrum documentation, but
>>> I would be shocked if it wouldn't be able to be configured to operate in
>>> the bare-bones autonomous learning mode too.
>>>
>>
>> Ack, got it.
>>
>>>> Let's not focus on the trylock, it's a minor detail.
>>>>
>>>>> I think you'd have to explain what is the purpose of the fast path trylock
>>>>> logic you've mentioned above. So in the uncontended br->hash_lock case,
>>>>> br_fdb_update() could take that mutex and then what? It would create and
>>>>> add the FDB entry, and call fdb_notify(), but that still can't sleep.
>>>>> So if switchdev drivers still need to privately defer the offloading
>>>>> work, we still need some crazy completion-based mechanism to report
>>>>> errors like the one I posted here, because in the general sense,
>>>>> SWITCHDEV_FDB_ADD_TO_DEVICE will still be atomic.
>>>>
>>>> We do not if we have ADD_TO_DEVICE and ADD_TO_DEVICE_SYNC,
>>>
>>> Just when I was about to say that I'm happy to get rid of some of those
>>> workqueues, and now you're telling me not only we might not get rid of
>>> them, but we might also end up with a second, separate implementation :)
>>>
>>> Anyway, let's not put the cart before the horses, let's see what the
>>> realities of the bridge data path learning and STP flushing will teach
>>> us about what can and can't be done.
>>>
>>
>> We will get rid of some workqueues (I hope), I was saying that if do the trylock we
>> might have to add both sync and async, otherwise we just need a single sync one.
>>
>>>> but that optimization is probably not worth the complexity of
>>>> maintaining both so we can just drop the trylock.
>>>>
>>>>>
>>>>> And how do you queue a deletion, do you delete the FDB from the pending
>>>>> and the main hash table, or do you just create a deletion command to be
>>>>> processed in deferred context?
>>>>>
>>>>
>>>> All commands which cannot take the mutex directly will be executed from a delayed
>>>> queue. We also need a delayed flush operation because we need to flush from STP code
>>>> and we can't sleep there due to the STP spinlock. The pending table can be considered
>>>> only if we decide to do a 2-stage lookup, it won't be used or consulted in any other
>>>> case, so user-space adds and deletes go only the main table.
>>>>
>>>>> And since you'd be operating on the hash table concurrently from the
>>>>> deferred work and from the fast path, doesn't this mean you'll need to
>>>>> use some sort of spin_lock_bh from the deferred work, which again would
>>>>> incur atomic context when you want to notify switchdev? Because with the
>>>>> mutex_trylock described above, you'd gain exclusivity to the main hash
>>>>> table, but if you lose, what you need is exclusivity to the pending hash
>>>>> table. So the deferred context also needs to be atomic when it dequeues
>>>>> the pending FDB entries and notifies them. I just don't see what we're
>>>>> winning, how the rtnetlink functions will be any different for the better.
>>>>>
>>>>
>>>> The rbtree can be fully taken by the delayed action and swapped with a new one,
>>>> so no exclusivity needed for the notifications. It will take the spinlock, get
>>>> the whole tree and release the lock, same if it was a simple queue.
>>>
>>> I'm quite unfamiliar with this technique, atomically swapping a queue
>>> pointer with a new empty one, and emptying that queue while unlocked.
>>> Do you have any reference implementation for this kind of technique?
>>>
>>
>> the delayed work would be doing something similar to:
>>
>> spin_lock(delay_lock);
>> record tree root
>> rcu change tree root with empty
>> spin_unlock(delay_lock);
>>
>> mutex_lock(br->hash_lock);
>> process recorded (old) tree and free items via rcu
>> mutex_unlock(br->hash_lock);
>>
>> We have the same pattern with queues all around the kernel.
>>
>>>> The spinlock for the rbtree for the delayed learning is necessary in all cases,
>>>
>>> "in all cases" means "regardless of whether we try from the fast path to
>>> make fdb_create() insert directly into &br->fdb_hash_tbl, or if we
>>> insert into a temporary rbtree for pending entries"? I don't understand
>>> this: why would you take the rbtree spinlock if you've inserted into the
>>> main hash table already?
>>>
>>
>> No, it means that we need the spinlock to protect the delayed queue.
>>
>>> I'm only concerned about spin locks we'd have to hold while calling
>>> fdb_notify().
>>>
>>
>> There won't be any spinlocks for fdb_notify(), we'll be doing all notifications from
>> sleepable context with the mutex held, that's the goal at least.
>>
>>>> if we used the trylock fast learn then we could directly insert the entry if we win, but
>>>> again lets just always use delayed ops from atomic contexts as a start.
>>>>
>>>> All entries from atomic contexts will be added to an rbtree which will be processed
>>>> from a delayed work, it will be freed by RCU so lookups can be done if we decide to
>>>> do a 2-stage lookup for the fast Rx path to reduce the flooding case window that you
>>>> described above.
>>>>
>>>> We win having sleepable context for user-space calls and being able to do synchronous
>>>> calls to the drivers to wait for the errors.
>>>
>>> I think I'd really have to see the code at this point, sorry.
>>>
>>
>> Sure, I'll prepare an RFC version this week.
>>
>> Thanks for the comments.
>>
> 
> OK, I've started working on converting the bridge but there are a few more downsides to
> delaying notifications that I really don't like and there is no good clean way around them.
> 
> a) after conversion we need to queue an event (read take spinlock) for each roaming of an fdb and flag change
>    which are now lockless
> 
> b) we either lose transient fdb changes and send only the last notification for that fdb, or we have to snapshot
>    the whole fdb on every change
> 
> c) we need to serialize and order actions, so in addition to the rbtree for fdb lookup, we'll need an ordered list of events
> 
> Now b) is most worrisome and really problematic, anything from user-space which is following that fdb's changes
> would miss events in the first case (not an option), or we'll have to deal with the complexity of snapshotting
> entries which in itself has a few problems and doesn't really extend well to other objects (snapshotting an mdb
> could be a nightmare). I don't see a clean way forward to fix this as I don't want to do the completion async
> waiting either, that would impose future race and locking issues and could make future changes harder to get right.
> 
> We have to explore alternative options, until we have something viable which doesn't play locking games
> we'll have to live with the current situation.
> 
> I'll keep thinking about this and will try a few ideas.
> 
> Thanks,
>  Nik
> 

To add - I'll be experimenting with turning hash_lock into a semaphore so we can use while(down_trylock()) from
the fast-path and down() from process context to be able to sleep. For this we'll still need to support
two notifier flavors obviously - atomic and blocking, but the locking rules are kept simple. We won't
have error feedback for the atomic cases, but I think that's the best we can do with the current situation.


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2021-10-27  9:36 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-25 22:24 [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 01/15] net: bridge: remove fdb_notify forward declaration Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 02/15] net: bridge: remove fdb_insert " Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 03/15] net: bridge: rename fdb_insert to fdb_add_local Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 04/15] net: bridge: rename br_fdb_insert to br_fdb_add_local Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 05/15] net: bridge: move br_fdb_replay inside br_switchdev.c Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 06/15] net: bridge: create a common function for populating switchdev FDB entries Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 07/15] net: switchdev: keep the MAC address by value in struct switchdev_notifier_fdb_info Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 08/15] net: bridge: take the hash_lock inside fdb_add_entry Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 09/15] net: bridge: rename fdb_notify to br_fdb_notify Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 10/15] net: switchdev: merge switchdev_handle_fdb_{add,del}_to_device Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 11/15] net: bridge: make fdb_add_entry() wait for switchdev feedback Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 12/15] net: rtnetlink: pass extack to .ndo_fdb_del Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 13/15] net: bridge: wait for errors from switchdev when deleting FDB entries Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 14/15] net: dsa: propagate feedback to SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE Vladimir Oltean
2021-10-25 22:24 ` [RFC PATCH net-next 15/15] net: dsa: propagate extack to .port_fdb_{add,del} Vladimir Oltean
2021-10-26 10:40 ` [RFC PATCH net-next 00/15] Synchronous feedback on FDB add/del from switchdev to the bridge Nikolay Aleksandrov
2021-10-26 11:25   ` Vladimir Oltean
2021-10-26 12:20     ` Nikolay Aleksandrov
2021-10-26 12:38       ` Ido Schimmel
2021-10-26 16:54       ` Vladimir Oltean
2021-10-26 17:10         ` Nikolay Aleksandrov
2021-10-26 19:01           ` Vladimir Oltean
2021-10-26 19:56             ` Nikolay Aleksandrov
2021-10-26 21:51               ` Vladimir Oltean
2021-10-26 22:27                 ` Nikolay Aleksandrov
2021-10-27  9:20                   ` Nikolay Aleksandrov
2021-10-27  9:36                     ` Nikolay Aleksandrov
2021-10-27  7:24                 ` Ido Schimmel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.