netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
@ 2021-08-19 16:07 Vladimir Oltean
  2021-08-19 16:07 ` [PATCH v2 net-next 1/5] net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking notifier chain Vladimir Oltean
                   ` (5 more replies)
  0 siblings, 6 replies; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-19 16:07 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Vadym Kochan, Taras Chornyi,
	Jiri Pirko, Ido Schimmel, UNGLinuxDriver, Grygorii Strashko,
	Marek Behun, DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens,
	Woojung Huh, Sean Wang, Landen Chao, Claudiu Manoil,
	Alexandre Belloni, George McCollister, Ioana Ciornei,
	Saeed Mahameed, Leon Romanovsky, Lars Povlsen, Steen Hegelund,
	Julian Wiedmann, Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

Problem statement:

Any time a driver needs to create a private association between a bridge
upper interface and use that association within its
SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
entries deleted by the bridge when the port leaves. The issue is that
all switchdev drivers schedule a work item to have sleepable context,
and that work item can be actually scheduled after the port has left the
bridge, which means the association might have already been broken by
the time the scheduled FDB work item attempts to use it.

The solution is to modify switchdev to use its embedded SWITCHDEV_F_DEFER
mechanism to make the FDB notifiers emitted from the fastpath be
scheduled in sleepable context. All drivers are converted to handle
SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE from their blocking notifier block
handler (or register a blocking switchdev notifier handler if they
didn't have one). This solves the aforementioned problem because the
bridge waits for the switchdev deferred work items to finish before a
port leaves (del_nbp calls switchdev_deferred_process), whereas a work
item privately scheduled by the driver will obviously not be waited upon
by the bridge, leading to the possibility of having the race.

This is a dependency for the "DSA FDB isolation" posted here. It was
split out of that series hence the numbering starts directly at v2.

https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/

Vladimir Oltean (5):
  net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking
    notifier chain
  net: bridge: switchdev: make br_fdb_replay offer sleepable context to
    consumers
  net: switchdev: drop the atomic notifier block from
    switchdev_bridge_port_{,un}offload
  net: switchdev: don't assume RCU context in
    switchdev_handle_fdb_{add,del}_to_device
  net: dsa: handle SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE synchronously

 .../ethernet/freescale/dpaa2/dpaa2-switch.c   |  86 +++++------
 .../marvell/prestera/prestera_switchdev.c     | 110 +++++++-------
 .../mellanox/mlx5/core/en/rep/bridge.c        |  59 +++++++-
 .../mellanox/mlxsw/spectrum_switchdev.c       |  61 +++++++-
 .../microchip/sparx5/sparx5_switchdev.c       |  78 +++++-----
 drivers/net/ethernet/mscc/ocelot_net.c        |   3 -
 drivers/net/ethernet/rocker/rocker_main.c     |  73 ++++-----
 drivers/net/ethernet/rocker/rocker_ofdpa.c    |   4 +-
 drivers/net/ethernet/ti/am65-cpsw-nuss.c      |   4 +-
 drivers/net/ethernet/ti/am65-cpsw-switchdev.c |  57 ++++----
 drivers/net/ethernet/ti/cpsw_new.c            |   4 +-
 drivers/net/ethernet/ti/cpsw_switchdev.c      |  60 ++++----
 drivers/s390/net/qeth_l2_main.c               |  10 +-
 include/net/switchdev.h                       |  30 +++-
 net/bridge/br.c                               |   5 +-
 net/bridge/br_fdb.c                           |  40 ++++-
 net/bridge/br_private.h                       |   4 -
 net/bridge/br_switchdev.c                     |  18 +--
 net/dsa/dsa.c                                 |  15 --
 net/dsa/dsa_priv.h                            |  15 --
 net/dsa/port.c                                |   3 -
 net/dsa/slave.c                               | 138 ++++++------------
 net/switchdev/switchdev.c                     |  61 +++++++-
 23 files changed, 529 insertions(+), 409 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 net-next 1/5] net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking notifier chain
  2021-08-19 16:07 [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Vladimir Oltean
@ 2021-08-19 16:07 ` Vladimir Oltean
  2021-08-19 18:15   ` Vlad Buslov
  2021-08-19 16:07 ` [PATCH v2 net-next 2/5] net: bridge: switchdev: make br_fdb_replay offer sleepable context to consumers Vladimir Oltean
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-19 16:07 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Vadym Kochan, Taras Chornyi,
	Jiri Pirko, Ido Schimmel, UNGLinuxDriver, Grygorii Strashko,
	Marek Behun, DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens,
	Woojung Huh, Sean Wang, Landen Chao, Claudiu Manoil,
	Alexandre Belloni, George McCollister, Ioana Ciornei,
	Saeed Mahameed, Leon Romanovsky, Lars Povlsen, Steen Hegelund,
	Julian Wiedmann, Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

Currently, br_switchdev_fdb_notify() uses call_switchdev_notifiers (and
br_fdb_replay() open-codes the same thing). This means that drivers
handle the SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events on the atomic
switchdev notifier block.

Most existing switchdev drivers either talk to firmware, or to a device
over a bus where the I/O is sleepable (SPI, I2C, MDIO etc). So there
exists an (anti)pattern where drivers make a sleepable context for
offloading the given FDB entry by registering an ordered workqueue and
scheduling work items on it, and doing all the work from there.

The problem is the inherent limitation that this design imposes upon
what a switchdev driver can do with those FDB entries.

For example, a switchdev driver might want to perform FDB isolation,
i.e. associate each FDB entry with the bridge it belongs to. Maybe the
driver associates each bridge with a number, allocating that number when
the first port of the driver joins that bridge, and freeing it when the
last port leaves it.

And this is where the problem is. When user space deletes a bridge and
all the ports leave, the bridge will notify us of the deletion of all
FDB entries in atomic context, and switchdev drivers will schedule their
private work items on their private workqueue.

The FDB entry deletion notifications will succeed, the bridge will then
finish deleting itself, but the switchdev work items have not run yet.
When they will eventually get scheduled, the aforementioned association
between the bridge_dev and a number will have already been broken by the
switchdev driver. All ports are standalone now, the bridge is a foreign
interface!

One might say "why don't you cache all your associations while you're
still in the atomic context and they're still valid, pass them by value
through your switchdev_work and work with the cached values as opposed
to the current ones?"

This option smells of poor design, because instead of fixing a central
problem, we add tens of lateral workarounds to avoid it. It should be
easier to use switchdev, not harder, and we should look at the common
patterns which lead to code duplication and eliminate them.

In this case, we must notice that
(a) switchdev already has the concept of notifiers emitted from the fast
    path that are still processed by drivers from blocking context. This
    is accomplished through the SWITCHDEV_F_DEFER flag which is used by
    e.g. SWITCHDEV_OBJ_ID_HOST_MDB.
(b) the bridge del_nbp() function already calls switchdev_deferred_process().
    So if we could hook into that, we could have a chance that the
    bridge simply waits for our FDB entry offloading procedure to finish
    before it calls netdev_upper_dev_unlink() - which is almost
    immediately afterwards, and also when switchdev drivers typically
    break their stateful associations between the bridge upper and
    private data.

So it is in fact possible to use switchdev's generic
switchdev_deferred_enqueue mechanism to get a sleepable callback, and
from there we can call_switchdev_blocking_notifiers().

In the case of br_fdb_replay(), the only code path is from
switchdev_bridge_port_offload(), which is already in blocking context.
So we don't need to go through switchdev_deferred_enqueue, and we can
just call the blocking notifier block directly.

To preserve the same behavior as before, all drivers need to have their
SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handlers moved from their switchdev
atomic notifier blocks to the blocking ones. This patch attempts to make
that trivial movement. Note that now they might schedule a work item for
nothing (since they are now called from a work item themselves), but I
don't have the energy or hardware to test all of them, so this will have
to do.

Note that previously, we were under rcu_read_lock() but now we're not.
I have eyeballed the drivers that make any sort of RCU assumption and
for the most part, enclosed them between a private pair of
rcu_read_lock() and rcu_read_unlock(). The exception is
qeth_l2_switchdev_event, for which adding the rcu_read_lock and properly
calling rcu_read_unlock from all places that return would result in more
churn than what I am about to do. This function had an apparently bogus
comment "Called under rtnl_lock", but to me this is not quite possible,
since this is the handler function from register_switchdev_notifier
which is on an atomic chain. But anyway, after the rework we _are_ under
rtnl_mutex, so just drop the _rcu from the functions used by the qeth
driver.

The RCU protection can be dropped from the other drivers when they are
reworked to stop scheduling.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 .../ethernet/freescale/dpaa2/dpaa2-switch.c   |  84 +++++++-------
 .../marvell/prestera/prestera_switchdev.c     | 104 +++++++++---------
 .../mellanox/mlx5/core/en/rep/bridge.c        |  59 +++++++++-
 .../mellanox/mlxsw/spectrum_switchdev.c       |  57 +++++++++-
 .../microchip/sparx5/sparx5_switchdev.c       |  74 +++++++------
 drivers/net/ethernet/rocker/rocker_main.c     |  73 ++++++------
 drivers/net/ethernet/ti/am65-cpsw-switchdev.c |  57 +++++-----
 drivers/net/ethernet/ti/cpsw_switchdev.c      |  60 +++++-----
 drivers/s390/net/qeth_l2_main.c               |  10 +-
 include/net/switchdev.h                       |  25 ++++-
 net/bridge/br_fdb.c                           |   2 +
 net/bridge/br_switchdev.c                     |  10 +-
 net/dsa/slave.c                               |  32 +++---
 net/switchdev/switchdev.c                     |  47 ++++++++
 14 files changed, 447 insertions(+), 247 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
index d260993ab2dc..5de475927958 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
@@ -2254,52 +2254,11 @@ static int dpaa2_switch_port_event(struct notifier_block *nb,
 				   unsigned long event, void *ptr)
 {
 	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
-	struct ethsw_port_priv *port_priv = netdev_priv(dev);
-	struct ethsw_switchdev_event_work *switchdev_work;
-	struct switchdev_notifier_fdb_info *fdb_info = ptr;
-	struct ethsw_core *ethsw = port_priv->ethsw_data;
 
 	if (event == SWITCHDEV_PORT_ATTR_SET)
 		return dpaa2_switch_port_attr_set_event(dev, ptr);
 
-	if (!dpaa2_switch_port_dev_check(dev))
-		return NOTIFY_DONE;
-
-	switchdev_work = kzalloc(sizeof(*switchdev_work), GFP_ATOMIC);
-	if (!switchdev_work)
-		return NOTIFY_BAD;
-
-	INIT_WORK(&switchdev_work->work, dpaa2_switch_event_work);
-	switchdev_work->dev = dev;
-	switchdev_work->event = event;
-
-	switch (event) {
-	case SWITCHDEV_FDB_ADD_TO_DEVICE:
-	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		memcpy(&switchdev_work->fdb_info, ptr,
-		       sizeof(switchdev_work->fdb_info));
-		switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (!switchdev_work->fdb_info.addr)
-			goto err_addr_alloc;
-
-		ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
-				fdb_info->addr);
-
-		/* Take a reference on the device to avoid being freed. */
-		dev_hold(dev);
-		break;
-	default:
-		kfree(switchdev_work);
-		return NOTIFY_DONE;
-	}
-
-	queue_work(ethsw->workqueue, &switchdev_work->work);
-
 	return NOTIFY_DONE;
-
-err_addr_alloc:
-	kfree(switchdev_work);
-	return NOTIFY_BAD;
 }
 
 static int dpaa2_switch_port_obj_event(unsigned long event,
@@ -2324,6 +2283,46 @@ static int dpaa2_switch_port_obj_event(unsigned long event,
 	return notifier_from_errno(err);
 }
 
+static int dpaa2_switch_fdb_event(unsigned long event,
+				  struct net_device *dev,
+				  struct switchdev_notifier_fdb_info *fdb_info)
+{
+	struct ethsw_port_priv *port_priv = netdev_priv(dev);
+	struct ethsw_switchdev_event_work *switchdev_work;
+	struct ethsw_core *ethsw = port_priv->ethsw_data;
+
+	if (!dpaa2_switch_port_dev_check(dev))
+		return NOTIFY_DONE;
+
+	switchdev_work = kzalloc(sizeof(*switchdev_work), GFP_ATOMIC);
+	if (!switchdev_work)
+		return NOTIFY_BAD;
+
+	INIT_WORK(&switchdev_work->work, dpaa2_switch_event_work);
+	switchdev_work->dev = dev;
+	switchdev_work->event = event;
+
+	memcpy(&switchdev_work->fdb_info, fdb_info,
+	       sizeof(switchdev_work->fdb_info));
+	switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
+	if (!switchdev_work->fdb_info.addr)
+		goto err_addr_alloc;
+
+	ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
+			fdb_info->addr);
+
+	/* Take a reference on the device to avoid being freed. */
+	dev_hold(dev);
+
+	queue_work(ethsw->workqueue, &switchdev_work->work);
+
+	return NOTIFY_DONE;
+
+err_addr_alloc:
+	kfree(switchdev_work);
+	return NOTIFY_BAD;
+}
+
 static int dpaa2_switch_port_blocking_event(struct notifier_block *nb,
 					    unsigned long event, void *ptr)
 {
@@ -2335,6 +2334,9 @@ static int dpaa2_switch_port_blocking_event(struct notifier_block *nb,
 		return dpaa2_switch_port_obj_event(event, dev, ptr);
 	case SWITCHDEV_PORT_ATTR_SET:
 		return dpaa2_switch_port_attr_set_event(dev, ptr);
+	case SWITCHDEV_FDB_ADD_TO_DEVICE:
+	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+		return dpaa2_switch_fdb_event(event, dev, ptr);
 	}
 
 	return NOTIFY_DONE;
diff --git a/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c b/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c
index 3ce6ccd0f539..9b8847aa3b92 100644
--- a/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c
+++ b/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c
@@ -845,10 +845,6 @@ static int prestera_switchdev_event(struct notifier_block *unused,
 				    unsigned long event, void *ptr)
 {
 	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
-	struct switchdev_notifier_fdb_info *fdb_info;
-	struct switchdev_notifier_info *info = ptr;
-	struct prestera_fdb_event_work *swdev_work;
-	struct net_device *upper;
 	int err;
 
 	if (event == SWITCHDEV_PORT_ATTR_SET) {
@@ -858,54 +854,7 @@ static int prestera_switchdev_event(struct notifier_block *unused,
 		return notifier_from_errno(err);
 	}
 
-	if (!prestera_netdev_check(dev))
-		return NOTIFY_DONE;
-
-	upper = netdev_master_upper_dev_get_rcu(dev);
-	if (!upper)
-		return NOTIFY_DONE;
-
-	if (!netif_is_bridge_master(upper))
-		return NOTIFY_DONE;
-
-	swdev_work = kzalloc(sizeof(*swdev_work), GFP_ATOMIC);
-	if (!swdev_work)
-		return NOTIFY_BAD;
-
-	swdev_work->event = event;
-	swdev_work->dev = dev;
-
-	switch (event) {
-	case SWITCHDEV_FDB_ADD_TO_DEVICE:
-	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		fdb_info = container_of(info,
-					struct switchdev_notifier_fdb_info,
-					info);
-
-		INIT_WORK(&swdev_work->work, prestera_fdb_event_work);
-		memcpy(&swdev_work->fdb_info, ptr,
-		       sizeof(swdev_work->fdb_info));
-
-		swdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (!swdev_work->fdb_info.addr)
-			goto out_bad;
-
-		ether_addr_copy((u8 *)swdev_work->fdb_info.addr,
-				fdb_info->addr);
-		dev_hold(dev);
-		break;
-
-	default:
-		kfree(swdev_work);
-		return NOTIFY_DONE;
-	}
-
-	queue_work(swdev_wq, &swdev_work->work);
 	return NOTIFY_DONE;
-
-out_bad:
-	kfree(swdev_work);
-	return NOTIFY_BAD;
 }
 
 static int
@@ -1101,6 +1050,53 @@ static int prestera_port_obj_del(struct net_device *dev, const void *ctx,
 	}
 }
 
+static int prestera_switchdev_fdb_event(struct net_device *dev,
+					unsigned long event,
+					struct switchdev_notifier_info *info)
+{
+	struct switchdev_notifier_fdb_info *fdb_info;
+	struct prestera_fdb_event_work *swdev_work;
+	struct net_device *upper;
+
+	if (!prestera_netdev_check(dev))
+		return 0;
+
+	upper = netdev_master_upper_dev_get_rcu(dev);
+	if (!upper)
+		return 0;
+
+	if (!netif_is_bridge_master(upper))
+		return 0;
+
+	swdev_work = kzalloc(sizeof(*swdev_work), GFP_ATOMIC);
+	if (!swdev_work)
+		return -ENOMEM;
+
+	swdev_work->event = event;
+	swdev_work->dev = dev;
+
+	fdb_info = container_of(info, struct switchdev_notifier_fdb_info,
+				info);
+
+	INIT_WORK(&swdev_work->work, prestera_fdb_event_work);
+	memcpy(&swdev_work->fdb_info, fdb_info, sizeof(swdev_work->fdb_info));
+
+	swdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
+	if (!swdev_work->fdb_info.addr)
+		goto out_bad;
+
+	ether_addr_copy((u8 *)swdev_work->fdb_info.addr,
+			fdb_info->addr);
+	dev_hold(dev);
+
+	queue_work(swdev_wq, &swdev_work->work);
+	return 0;
+
+out_bad:
+	kfree(swdev_work);
+	return -ENOMEM;
+}
+
 static int prestera_switchdev_blk_event(struct notifier_block *unused,
 					unsigned long event, void *ptr)
 {
@@ -1123,6 +1119,12 @@ static int prestera_switchdev_blk_event(struct notifier_block *unused,
 						     prestera_netdev_check,
 						     prestera_port_obj_attr_set);
 		break;
+	case SWITCHDEV_FDB_ADD_TO_DEVICE:
+	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+		rcu_read_lock();
+		err = prestera_switchdev_fdb_event(dev, event, ptr);
+		rcu_read_unlock();
+		break;
 	default:
 		err = -EOPNOTSUPP;
 	}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
index 0c38c2e319be..ea7c3f07f6fe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
@@ -276,6 +276,55 @@ mlx5_esw_bridge_port_obj_attr_set(struct net_device *dev,
 	return err;
 }
 
+static struct mlx5_bridge_switchdev_fdb_work *
+mlx5_esw_bridge_init_switchdev_fdb_work(struct net_device *dev, bool add,
+					struct switchdev_notifier_fdb_info *fdb_info,
+					struct mlx5_esw_bridge_offloads *br_offloads);
+
+static int
+mlx5_esw_bridge_fdb_event(struct net_device *dev, unsigned long event,
+			  struct switchdev_notifier_info *info,
+			  struct mlx5_esw_bridge_offloads *br_offloads)
+{
+	struct switchdev_notifier_fdb_info *fdb_info;
+	struct mlx5_bridge_switchdev_fdb_work *work;
+	struct mlx5_eswitch *esw = br_offloads->esw;
+	u16 vport_num, esw_owner_vhca_id;
+	struct net_device *upper, *rep;
+
+	upper = netdev_master_upper_dev_get_rcu(dev);
+	if (!upper)
+		return 0;
+	if (!netif_is_bridge_master(upper))
+		return 0;
+
+	rep = mlx5_esw_bridge_rep_vport_num_vhca_id_get(dev, esw,
+							&vport_num,
+							&esw_owner_vhca_id);
+	if (!rep)
+		return 0;
+
+	/* only handle the event on peers */
+	if (mlx5_esw_bridge_is_local(dev, rep, esw))
+		return 0;
+
+	fdb_info = container_of(info, struct switchdev_notifier_fdb_info, info);
+
+	work = mlx5_esw_bridge_init_switchdev_fdb_work(dev,
+						       event == SWITCHDEV_FDB_ADD_TO_DEVICE,
+						       fdb_info,
+						       br_offloads);
+	if (IS_ERR(work)) {
+		WARN_ONCE(1, "Failed to init switchdev work, err=%ld",
+			  PTR_ERR(work));
+		return PTR_ERR(work);
+	}
+
+	queue_work(br_offloads->wq, &work->work);
+
+	return 0;
+}
+
 static int mlx5_esw_bridge_event_blocking(struct notifier_block *nb,
 					  unsigned long event, void *ptr)
 {
@@ -295,6 +344,12 @@ static int mlx5_esw_bridge_event_blocking(struct notifier_block *nb,
 	case SWITCHDEV_PORT_ATTR_SET:
 		err = mlx5_esw_bridge_port_obj_attr_set(dev, ptr, br_offloads);
 		break;
+	case SWITCHDEV_FDB_ADD_TO_DEVICE:
+	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+		rcu_read_lock();
+		err = mlx5_esw_bridge_fdb_event(dev, event, ptr, br_offloads);
+		rcu_read_unlock();
+		break;
 	default:
 		err = 0;
 	}
@@ -415,9 +470,7 @@ static int mlx5_esw_bridge_switchdev_event(struct notifier_block *nb,
 		/* only handle the event on peers */
 		if (mlx5_esw_bridge_is_local(dev, rep, esw))
 			break;
-		fallthrough;
-	case SWITCHDEV_FDB_ADD_TO_DEVICE:
-	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+
 		fdb_info = container_of(info,
 					struct switchdev_notifier_fdb_info,
 					info);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 22fede5cb32c..791a165fe3aa 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -3247,8 +3247,6 @@ static int mlxsw_sp_switchdev_event(struct notifier_block *unused,
 	switchdev_work->event = event;
 
 	switch (event) {
-	case SWITCHDEV_FDB_ADD_TO_DEVICE:
-	case SWITCHDEV_FDB_DEL_TO_DEVICE:
 	case SWITCHDEV_FDB_ADD_TO_BRIDGE:
 	case SWITCHDEV_FDB_DEL_TO_BRIDGE:
 		fdb_info = container_of(info,
@@ -3513,6 +3511,55 @@ mlxsw_sp_switchdev_handle_vxlan_obj_del(struct net_device *vxlan_dev,
 	}
 }
 
+static int mlxsw_sp_switchdev_fdb_event(struct net_device *dev, unsigned long event,
+					struct switchdev_notifier_info *info)
+{
+	struct mlxsw_sp_switchdev_event_work *switchdev_work;
+	struct switchdev_notifier_fdb_info *fdb_info;
+	struct net_device *br_dev;
+
+	/* Tunnel devices are not our uppers, so check their master instead */
+	br_dev = netdev_master_upper_dev_get_rcu(dev);
+	if (!br_dev)
+		return 0;
+	if (!netif_is_bridge_master(br_dev))
+		return 0;
+	if (!mlxsw_sp_port_dev_lower_find_rcu(br_dev))
+		return 0;
+
+	switchdev_work = kzalloc(sizeof(*switchdev_work), GFP_ATOMIC);
+	if (!switchdev_work)
+		return -ENOMEM;
+
+	switchdev_work->dev = dev;
+	switchdev_work->event = event;
+
+	fdb_info = container_of(info, struct switchdev_notifier_fdb_info,
+				info);
+	INIT_WORK(&switchdev_work->work,
+		  mlxsw_sp_switchdev_bridge_fdb_event_work);
+	memcpy(&switchdev_work->fdb_info, fdb_info,
+	       sizeof(switchdev_work->fdb_info));
+	switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
+	if (!switchdev_work->fdb_info.addr)
+		goto err_addr_alloc;
+	ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
+			fdb_info->addr);
+	/* Take a reference on the device. This can be either
+	 * upper device containig mlxsw_sp_port or just a
+	 * mlxsw_sp_port
+	 */
+	dev_hold(dev);
+
+	mlxsw_core_schedule_work(&switchdev_work->work);
+
+	return 0;
+
+err_addr_alloc:
+	kfree(switchdev_work);
+	return NOTIFY_BAD;
+}
+
 static int mlxsw_sp_switchdev_blocking_event(struct notifier_block *unused,
 					     unsigned long event, void *ptr)
 {
@@ -3541,6 +3588,12 @@ static int mlxsw_sp_switchdev_blocking_event(struct notifier_block *unused,
 						     mlxsw_sp_port_dev_check,
 						     mlxsw_sp_port_attr_set);
 		return notifier_from_errno(err);
+	case SWITCHDEV_FDB_ADD_TO_DEVICE:
+	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+		rcu_read_lock();
+		err = mlxsw_sp_switchdev_fdb_event(dev, event, ptr);
+		rcu_read_unlock();
+		return notifier_from_errno(err);
 	}
 
 	return NOTIFY_DONE;
diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c b/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
index 649ca609884a..7fb9f59d43e0 100644
--- a/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
+++ b/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
@@ -268,9 +268,6 @@ static int sparx5_switchdev_event(struct notifier_block *unused,
 				  unsigned long event, void *ptr)
 {
 	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
-	struct sparx5_switchdev_event_work *switchdev_work;
-	struct switchdev_notifier_fdb_info *fdb_info;
-	struct switchdev_notifier_info *info = ptr;
 	int err;
 
 	switch (event) {
@@ -279,39 +276,9 @@ static int sparx5_switchdev_event(struct notifier_block *unused,
 						     sparx5_netdevice_check,
 						     sparx5_port_attr_set);
 		return notifier_from_errno(err);
-	case SWITCHDEV_FDB_ADD_TO_DEVICE:
-		fallthrough;
-	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		switchdev_work = kzalloc(sizeof(*switchdev_work), GFP_ATOMIC);
-		if (!switchdev_work)
-			return NOTIFY_BAD;
-
-		switchdev_work->dev = dev;
-		switchdev_work->event = event;
-
-		fdb_info = container_of(info,
-					struct switchdev_notifier_fdb_info,
-					info);
-		INIT_WORK(&switchdev_work->work,
-			  sparx5_switchdev_bridge_fdb_event_work);
-		memcpy(&switchdev_work->fdb_info, ptr,
-		       sizeof(switchdev_work->fdb_info));
-		switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (!switchdev_work->fdb_info.addr)
-			goto err_addr_alloc;
-
-		ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
-				fdb_info->addr);
-		dev_hold(dev);
-
-		sparx5_schedule_work(&switchdev_work->work);
-		break;
 	}
 
 	return NOTIFY_DONE;
-err_addr_alloc:
-	kfree(switchdev_work);
-	return NOTIFY_BAD;
 }
 
 static void sparx5_sync_port_dev_addr(struct sparx5 *sparx5,
@@ -459,6 +426,43 @@ static int sparx5_handle_port_obj_del(struct net_device *dev,
 	return err;
 }
 
+static int sparx5_switchdev_fdb_event(struct net_device *dev, unsigned long event,
+				      struct switchdev_notifier_info *info)
+{
+	struct sparx5_switchdev_event_work *switchdev_work;
+	struct switchdev_notifier_fdb_info *fdb_info;
+
+	switchdev_work = kzalloc(sizeof(*switchdev_work), GFP_ATOMIC);
+	if (!switchdev_work)
+		return -ENOMEM;
+
+	switchdev_work->dev = dev;
+	switchdev_work->event = event;
+
+	fdb_info = container_of(info,
+				struct switchdev_notifier_fdb_info,
+				info);
+	INIT_WORK(&switchdev_work->work,
+		  sparx5_switchdev_bridge_fdb_event_work);
+	memcpy(&switchdev_work->fdb_info, fdb_info,
+	       sizeof(switchdev_work->fdb_info));
+	switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
+	if (!switchdev_work->fdb_info.addr)
+		goto err_addr_alloc;
+
+	ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
+			fdb_info->addr);
+	dev_hold(dev);
+
+	sparx5_schedule_work(&switchdev_work->work);
+
+	return 0;
+
+err_addr_alloc:
+	kfree(switchdev_work);
+	return -ENOMEM;
+}
+
 static int sparx5_switchdev_blocking_event(struct notifier_block *nb,
 					   unsigned long event,
 					   void *ptr)
@@ -478,6 +482,10 @@ static int sparx5_switchdev_blocking_event(struct notifier_block *nb,
 						     sparx5_netdevice_check,
 						     sparx5_port_attr_set);
 		return notifier_from_errno(err);
+	case SWITCHDEV_FDB_ADD_TO_DEVICE:
+	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+		err = sparx5_switchdev_fdb_event(dev, event, ptr);
+		return notifier_from_errno(err);
 	}
 
 	return NOTIFY_DONE;
diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
index 3364b6a56bd1..0d998b28bb90 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2767,9 +2767,6 @@ static int rocker_switchdev_event(struct notifier_block *unused,
 				  unsigned long event, void *ptr)
 {
 	struct net_device *dev = switchdev_notifier_info_to_dev(ptr);
-	struct rocker_switchdev_event_work *switchdev_work;
-	struct switchdev_notifier_fdb_info *fdb_info = ptr;
-	struct rocker_port *rocker_port;
 
 	if (!rocker_port_dev_check(dev))
 		return NOTIFY_DONE;
@@ -2777,38 +2774,6 @@ static int rocker_switchdev_event(struct notifier_block *unused,
 	if (event == SWITCHDEV_PORT_ATTR_SET)
 		return rocker_switchdev_port_attr_set_event(dev, ptr);
 
-	rocker_port = netdev_priv(dev);
-	switchdev_work = kzalloc(sizeof(*switchdev_work), GFP_ATOMIC);
-	if (WARN_ON(!switchdev_work))
-		return NOTIFY_BAD;
-
-	INIT_WORK(&switchdev_work->work, rocker_switchdev_event_work);
-	switchdev_work->rocker_port = rocker_port;
-	switchdev_work->event = event;
-
-	switch (event) {
-	case SWITCHDEV_FDB_ADD_TO_DEVICE:
-	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		memcpy(&switchdev_work->fdb_info, ptr,
-		       sizeof(switchdev_work->fdb_info));
-		switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (unlikely(!switchdev_work->fdb_info.addr)) {
-			kfree(switchdev_work);
-			return NOTIFY_BAD;
-		}
-
-		ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
-				fdb_info->addr);
-		/* Take a reference on the rocker device */
-		dev_hold(dev);
-		break;
-	default:
-		kfree(switchdev_work);
-		return NOTIFY_DONE;
-	}
-
-	queue_work(rocker_port->rocker->rocker_owq,
-		   &switchdev_work->work);
 	return NOTIFY_DONE;
 }
 
@@ -2831,6 +2796,41 @@ rocker_switchdev_port_obj_event(unsigned long event, struct net_device *netdev,
 	return notifier_from_errno(err);
 }
 
+static int
+rocker_switchdev_fdb_event(unsigned long event, struct net_device *dev,
+			   struct switchdev_notifier_fdb_info *fdb_info)
+{
+	struct rocker_switchdev_event_work *switchdev_work;
+	struct rocker_port *rocker_port;
+
+	rocker_port = netdev_priv(dev);
+	switchdev_work = kzalloc(sizeof(*switchdev_work), GFP_ATOMIC);
+	if (WARN_ON(!switchdev_work))
+		return NOTIFY_BAD;
+
+	INIT_WORK(&switchdev_work->work, rocker_switchdev_event_work);
+	switchdev_work->rocker_port = rocker_port;
+	switchdev_work->event = event;
+
+	memcpy(&switchdev_work->fdb_info, fdb_info,
+	       sizeof(switchdev_work->fdb_info));
+	switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
+	if (unlikely(!switchdev_work->fdb_info.addr)) {
+		kfree(switchdev_work);
+		return NOTIFY_BAD;
+	}
+
+	ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
+			fdb_info->addr);
+	/* Take a reference on the rocker device */
+	dev_hold(dev);
+
+	queue_work(rocker_port->rocker->rocker_owq,
+		   &switchdev_work->work);
+
+	return NOTIFY_DONE;
+}
+
 static int rocker_switchdev_blocking_event(struct notifier_block *unused,
 					   unsigned long event, void *ptr)
 {
@@ -2845,6 +2845,9 @@ static int rocker_switchdev_blocking_event(struct notifier_block *unused,
 		return rocker_switchdev_port_obj_event(event, dev, ptr);
 	case SWITCHDEV_PORT_ATTR_SET:
 		return rocker_switchdev_port_attr_set_event(dev, ptr);
+	case SWITCHDEV_FDB_ADD_TO_DEVICE:
+	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+		return rocker_switchdev_fdb_event(event, dev, ptr);
 	}
 
 	return NOTIFY_DONE;
diff --git a/drivers/net/ethernet/ti/am65-cpsw-switchdev.c b/drivers/net/ethernet/ti/am65-cpsw-switchdev.c
index 599708a3e81d..7282001c85e6 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-switchdev.c
+++ b/drivers/net/ethernet/ti/am65-cpsw-switchdev.c
@@ -424,9 +424,6 @@ static int am65_cpsw_switchdev_event(struct notifier_block *unused,
 				     unsigned long event, void *ptr)
 {
 	struct net_device *ndev = switchdev_notifier_info_to_dev(ptr);
-	struct am65_cpsw_switchdev_event_work *switchdev_work;
-	struct am65_cpsw_port *port = am65_ndev_to_port(ndev);
-	struct switchdev_notifier_fdb_info *fdb_info = ptr;
 	int err;
 
 	if (event == SWITCHDEV_PORT_ATTR_SET) {
@@ -436,47 +433,49 @@ static int am65_cpsw_switchdev_event(struct notifier_block *unused,
 		return notifier_from_errno(err);
 	}
 
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block cpsw_switchdev_notifier = {
+	.notifier_call = am65_cpsw_switchdev_event,
+};
+
+static int am65_cpsw_switchdev_fdb_event(struct net_device *ndev,
+					 unsigned long event,
+					 struct switchdev_notifier_fdb_info *fdb_info)
+{
+	struct am65_cpsw_switchdev_event_work *switchdev_work;
+	struct am65_cpsw_port *port = am65_ndev_to_port(ndev);
+
 	if (!am65_cpsw_port_dev_check(ndev))
-		return NOTIFY_DONE;
+		return 0;
 
 	switchdev_work = kzalloc(sizeof(*switchdev_work), GFP_ATOMIC);
 	if (WARN_ON(!switchdev_work))
-		return NOTIFY_BAD;
+		return -ENOMEM;
 
 	INIT_WORK(&switchdev_work->work, am65_cpsw_switchdev_event_work);
 	switchdev_work->port = port;
 	switchdev_work->event = event;
 
-	switch (event) {
-	case SWITCHDEV_FDB_ADD_TO_DEVICE:
-	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		memcpy(&switchdev_work->fdb_info, ptr,
-		       sizeof(switchdev_work->fdb_info));
-		switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (!switchdev_work->fdb_info.addr)
-			goto err_addr_alloc;
-		ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
-				fdb_info->addr);
-		dev_hold(ndev);
-		break;
-	default:
-		kfree(switchdev_work);
-		return NOTIFY_DONE;
-	}
+	memcpy(&switchdev_work->fdb_info, fdb_info,
+	       sizeof(switchdev_work->fdb_info));
+	switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
+	if (!switchdev_work->fdb_info.addr)
+		goto err_addr_alloc;
+	ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
+			fdb_info->addr);
+	dev_hold(ndev);
 
 	queue_work(system_long_wq, &switchdev_work->work);
 
-	return NOTIFY_DONE;
+	return 0;
 
 err_addr_alloc:
 	kfree(switchdev_work);
-	return NOTIFY_BAD;
+	return -ENOMEM;
 }
 
-static struct notifier_block cpsw_switchdev_notifier = {
-	.notifier_call = am65_cpsw_switchdev_event,
-};
-
 static int am65_cpsw_switchdev_blocking_event(struct notifier_block *unused,
 					      unsigned long event, void *ptr)
 {
@@ -499,6 +498,10 @@ static int am65_cpsw_switchdev_blocking_event(struct notifier_block *unused,
 						     am65_cpsw_port_dev_check,
 						     am65_cpsw_port_attr_set);
 		return notifier_from_errno(err);
+	case SWITCHDEV_FDB_ADD_TO_DEVICE:
+	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+		err = am65_cpsw_switchdev_fdb_event(dev, event, ptr);
+		return notifier_from_errno(err);
 	default:
 		break;
 	}
diff --git a/drivers/net/ethernet/ti/cpsw_switchdev.c b/drivers/net/ethernet/ti/cpsw_switchdev.c
index a7d97d429e06..bc3ddca2750d 100644
--- a/drivers/net/ethernet/ti/cpsw_switchdev.c
+++ b/drivers/net/ethernet/ti/cpsw_switchdev.c
@@ -434,9 +434,6 @@ static int cpsw_switchdev_event(struct notifier_block *unused,
 				unsigned long event, void *ptr)
 {
 	struct net_device *ndev = switchdev_notifier_info_to_dev(ptr);
-	struct switchdev_notifier_fdb_info *fdb_info = ptr;
-	struct cpsw_switchdev_event_work *switchdev_work;
-	struct cpsw_priv *priv = netdev_priv(ndev);
 	int err;
 
 	if (event == SWITCHDEV_PORT_ATTR_SET) {
@@ -446,47 +443,50 @@ static int cpsw_switchdev_event(struct notifier_block *unused,
 		return notifier_from_errno(err);
 	}
 
-	if (!cpsw_port_dev_check(ndev))
-		return NOTIFY_DONE;
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block cpsw_switchdev_notifier = {
+	.notifier_call = cpsw_switchdev_event,
+};
+
+static int cpsw_switchdev_fdb_event(struct net_device *dev, unsigned long event,
+				    struct switchdev_notifier_fdb_info *fdb_info)
+{
+	struct cpsw_switchdev_event_work *switchdev_work;
+	struct cpsw_priv *priv;
+
+	if (!cpsw_port_dev_check(dev))
+		return 0;
+
+	priv = netdev_priv(dev);
 
 	switchdev_work = kzalloc(sizeof(*switchdev_work), GFP_ATOMIC);
 	if (WARN_ON(!switchdev_work))
-		return NOTIFY_BAD;
+		return -ENOMEM;
 
 	INIT_WORK(&switchdev_work->work, cpsw_switchdev_event_work);
 	switchdev_work->priv = priv;
 	switchdev_work->event = event;
 
-	switch (event) {
-	case SWITCHDEV_FDB_ADD_TO_DEVICE:
-	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		memcpy(&switchdev_work->fdb_info, ptr,
-		       sizeof(switchdev_work->fdb_info));
-		switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
-		if (!switchdev_work->fdb_info.addr)
-			goto err_addr_alloc;
-		ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
-				fdb_info->addr);
-		dev_hold(ndev);
-		break;
-	default:
-		kfree(switchdev_work);
-		return NOTIFY_DONE;
-	}
+	memcpy(&switchdev_work->fdb_info, fdb_info,
+	       sizeof(switchdev_work->fdb_info));
+	switchdev_work->fdb_info.addr = kzalloc(ETH_ALEN, GFP_ATOMIC);
+	if (!switchdev_work->fdb_info.addr)
+		goto err_addr_alloc;
+	ether_addr_copy((u8 *)switchdev_work->fdb_info.addr,
+			fdb_info->addr);
+	dev_hold(dev);
 
 	queue_work(system_long_wq, &switchdev_work->work);
 
-	return NOTIFY_DONE;
+	return 0;
 
 err_addr_alloc:
 	kfree(switchdev_work);
-	return NOTIFY_BAD;
+	return -ENOMEM;
 }
 
-static struct notifier_block cpsw_switchdev_notifier = {
-	.notifier_call = cpsw_switchdev_event,
-};
-
 static int cpsw_switchdev_blocking_event(struct notifier_block *unused,
 					 unsigned long event, void *ptr)
 {
@@ -509,6 +509,10 @@ static int cpsw_switchdev_blocking_event(struct notifier_block *unused,
 						     cpsw_port_dev_check,
 						     cpsw_port_attr_set);
 		return notifier_from_errno(err);
+	case SWITCHDEV_FDB_ADD_TO_DEVICE:
+	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+		err = cpsw_switchdev_fdb_event(dev, event, ptr);
+		return notifier_from_errno(err);
 	default:
 		break;
 	}
diff --git a/drivers/s390/net/qeth_l2_main.c b/drivers/s390/net/qeth_l2_main.c
index 72e84ff9fea5..aa94f57e1d7c 100644
--- a/drivers/s390/net/qeth_l2_main.c
+++ b/drivers/s390/net/qeth_l2_main.c
@@ -867,14 +867,14 @@ static int qeth_l2_switchdev_event(struct notifier_block *unused,
 		return NOTIFY_DONE;
 
 	dstdev = switchdev_notifier_info_to_dev(info);
-	brdev = netdev_master_upper_dev_get_rcu(dstdev);
+	brdev = netdev_master_upper_dev_get(dstdev);
 	if (!brdev || !netif_is_bridge_master(brdev))
 		return NOTIFY_DONE;
 	fdb_info = container_of(info,
 				struct switchdev_notifier_fdb_info,
 				info);
 	iter = &brdev->adj_list.lower;
-	lowerdev = netdev_next_lower_dev_rcu(brdev, &iter);
+	lowerdev = netdev_next_lower_dev(brdev, &iter);
 	while (lowerdev) {
 		if (qeth_l2_must_learn(lowerdev, dstdev)) {
 			card = lowerdev->ml_priv;
@@ -887,7 +887,7 @@ static int qeth_l2_switchdev_event(struct notifier_block *unused,
 				return NOTIFY_BAD;
 			}
 		}
-		lowerdev = netdev_next_lower_dev_rcu(brdev, &iter);
+		lowerdev = netdev_next_lower_dev(brdev, &iter);
 	}
 	return NOTIFY_DONE;
 }
@@ -904,7 +904,7 @@ static void qeth_l2_br2dev_get(void)
 	int rc;
 
 	if (!refcount_inc_not_zero(&qeth_l2_switchdev_notify_refcnt)) {
-		rc = register_switchdev_notifier(&qeth_l2_sw_notifier);
+		rc = register_switchdev_blocking_notifier(&qeth_l2_sw_notifier);
 		if (rc) {
 			QETH_DBF_MESSAGE(2,
 					 "failed to register qeth_l2_sw_notifier: %d\n",
@@ -924,7 +924,7 @@ static void qeth_l2_br2dev_put(void)
 	int rc;
 
 	if (refcount_dec_and_test(&qeth_l2_switchdev_notify_refcnt)) {
-		rc = unregister_switchdev_notifier(&qeth_l2_sw_notifier);
+		rc = unregister_switchdev_blocking_notifier(&qeth_l2_sw_notifier);
 		if (rc) {
 			QETH_DBF_MESSAGE(2,
 					 "failed to unregister qeth_l2_sw_notifier: %d\n",
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 60d806b6a5ae..5d9ae1ec85b3 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -191,8 +191,8 @@ struct switchdev_brport {
 enum switchdev_notifier_type {
 	SWITCHDEV_FDB_ADD_TO_BRIDGE = 1,
 	SWITCHDEV_FDB_DEL_TO_BRIDGE,
-	SWITCHDEV_FDB_ADD_TO_DEVICE,
-	SWITCHDEV_FDB_DEL_TO_DEVICE,
+	SWITCHDEV_FDB_ADD_TO_DEVICE, /* Blocking. */
+	SWITCHDEV_FDB_DEL_TO_DEVICE, /* Blocking. */
 	SWITCHDEV_FDB_OFFLOADED,
 	SWITCHDEV_FDB_FLUSH_TO_BRIDGE,
 
@@ -283,6 +283,13 @@ int switchdev_port_obj_add(struct net_device *dev,
 int switchdev_port_obj_del(struct net_device *dev,
 			   const struct switchdev_obj *obj);
 
+int
+switchdev_fdb_add_to_device(struct net_device *dev,
+			    const struct switchdev_notifier_fdb_info *fdb_info);
+int
+switchdev_fdb_del_to_device(struct net_device *dev,
+			    const struct switchdev_notifier_fdb_info *fdb_info);
+
 int register_switchdev_notifier(struct notifier_block *nb);
 int unregister_switchdev_notifier(struct notifier_block *nb);
 int call_switchdev_notifiers(unsigned long val, struct net_device *dev,
@@ -386,6 +393,20 @@ static inline int switchdev_port_obj_del(struct net_device *dev,
 	return -EOPNOTSUPP;
 }
 
+static inline int
+switchdev_fdb_add_to_device(struct net_device *dev,
+			    const struct switchdev_notifier_fdb_info *fdb_info)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int
+switchdev_fdb_del_to_device(struct net_device *dev,
+			    const struct switchdev_notifier_fdb_info *fdb_info)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int register_switchdev_notifier(struct notifier_block *nb)
 {
 	return 0;
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 46812b659710..0bdbcfc53914 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -763,6 +763,8 @@ int br_fdb_replay(const struct net_device *br_dev, const void *ctx, bool adding,
 	if (!nb)
 		return 0;
 
+	ASSERT_RTNL();
+
 	if (!netif_is_bridge_master(br_dev))
 		return -EINVAL;
 
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index 6bf518d78f02..cd413b010567 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -138,12 +138,10 @@ br_switchdev_fdb_notify(struct net_bridge *br,
 
 	switch (type) {
 	case RTM_DELNEIGH:
-		call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE,
-					 dev, &info.info, NULL);
+		switchdev_fdb_del_to_device(dev, &info);
 		break;
 	case RTM_NEWNEIGH:
-		call_switchdev_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE,
-					 dev, &info.info, NULL);
+		switchdev_fdb_add_to_device(dev, &info);
 		break;
 	}
 }
@@ -287,7 +285,7 @@ static int nbp_switchdev_sync_objs(struct net_bridge_port *p, const void *ctx,
 	if (err && err != -EOPNOTSUPP)
 		return err;
 
-	err = br_fdb_replay(br_dev, ctx, true, atomic_nb);
+	err = br_fdb_replay(br_dev, ctx, true, blocking_nb);
 	if (err && err != -EOPNOTSUPP)
 		return err;
 
@@ -306,7 +304,7 @@ static void nbp_switchdev_unsync_objs(struct net_bridge_port *p,
 
 	br_mdb_replay(br_dev, dev, ctx, false, blocking_nb, NULL);
 
-	br_fdb_replay(br_dev, ctx, false, atomic_nb);
+	br_fdb_replay(br_dev, ctx, false, blocking_nb);
 }
 
 /* Let the bridge know that this port is offloaded, so that it can assign a
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index eb9d9e53c536..249303ac3c3c 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -2454,20 +2454,6 @@ static int dsa_slave_switchdev_event(struct notifier_block *unused,
 						     dsa_slave_dev_check,
 						     dsa_slave_port_attr_set);
 		return notifier_from_errno(err);
-	case SWITCHDEV_FDB_ADD_TO_DEVICE:
-		err = switchdev_handle_fdb_add_to_device(dev, ptr,
-							 dsa_slave_dev_check,
-							 dsa_foreign_dev_check,
-							 dsa_slave_fdb_add_to_device,
-							 NULL);
-		return notifier_from_errno(err);
-	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		err = switchdev_handle_fdb_del_to_device(dev, ptr,
-							 dsa_slave_dev_check,
-							 dsa_foreign_dev_check,
-							 dsa_slave_fdb_del_to_device,
-							 NULL);
-		return notifier_from_errno(err);
 	default:
 		return NOTIFY_DONE;
 	}
@@ -2497,6 +2483,24 @@ static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused,
 						     dsa_slave_dev_check,
 						     dsa_slave_port_attr_set);
 		return notifier_from_errno(err);
+	case SWITCHDEV_FDB_ADD_TO_DEVICE:
+		rcu_read_lock();
+		err = switchdev_handle_fdb_add_to_device(dev, ptr,
+							 dsa_slave_dev_check,
+							 dsa_foreign_dev_check,
+							 dsa_slave_fdb_add_to_device,
+							 NULL);
+		rcu_read_unlock();
+		return notifier_from_errno(err);
+	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+		rcu_read_lock();
+		err = switchdev_handle_fdb_del_to_device(dev, ptr,
+							 dsa_slave_dev_check,
+							 dsa_foreign_dev_check,
+							 dsa_slave_fdb_del_to_device,
+							 NULL);
+		rcu_read_unlock();
+		return notifier_from_errno(err);
 	}
 
 	return NOTIFY_DONE;
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 0b2c18efc079..c34c6abceec6 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -378,6 +378,53 @@ int call_switchdev_blocking_notifiers(unsigned long val, struct net_device *dev,
 }
 EXPORT_SYMBOL_GPL(call_switchdev_blocking_notifiers);
 
+static void switchdev_fdb_add_deferred(struct net_device *dev, const void *data)
+{
+	const struct switchdev_notifier_fdb_info *fdb_info = data;
+	struct switchdev_notifier_fdb_info tmp = *fdb_info;
+	int err;
+
+	ASSERT_RTNL();
+	err = call_switchdev_blocking_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE,
+						dev, &tmp.info, NULL);
+	err = notifier_to_errno(err);
+	if (err && err != -EOPNOTSUPP)
+		netdev_err(dev, "failed to add FDB entry: %pe\n", ERR_PTR(err));
+}
+
+static void switchdev_fdb_del_deferred(struct net_device *dev, const void *data)
+{
+	const struct switchdev_notifier_fdb_info *fdb_info = data;
+	struct switchdev_notifier_fdb_info tmp = *fdb_info;
+	int err;
+
+	ASSERT_RTNL();
+	err = call_switchdev_blocking_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE,
+						dev, &tmp.info, NULL);
+	err = notifier_to_errno(err);
+	if (err && err != -EOPNOTSUPP)
+		netdev_err(dev, "failed to delete FDB entry: %pe\n",
+			   ERR_PTR(err));
+}
+
+int
+switchdev_fdb_add_to_device(struct net_device *dev,
+			    const struct switchdev_notifier_fdb_info *fdb_info)
+{
+	return switchdev_deferred_enqueue(dev, fdb_info, sizeof(*fdb_info),
+					  switchdev_fdb_add_deferred);
+}
+EXPORT_SYMBOL_GPL(switchdev_fdb_add_to_device);
+
+int
+switchdev_fdb_del_to_device(struct net_device *dev,
+			    const struct switchdev_notifier_fdb_info *fdb_info)
+{
+	return switchdev_deferred_enqueue(dev, fdb_info, sizeof(*fdb_info),
+					  switchdev_fdb_del_deferred);
+}
+EXPORT_SYMBOL_GPL(switchdev_fdb_del_to_device);
+
 struct switchdev_nested_priv {
 	bool (*check_cb)(const struct net_device *dev);
 	bool (*foreign_dev_check_cb)(const struct net_device *dev,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 net-next 2/5] net: bridge: switchdev: make br_fdb_replay offer sleepable context to consumers
  2021-08-19 16:07 [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Vladimir Oltean
  2021-08-19 16:07 ` [PATCH v2 net-next 1/5] net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking notifier chain Vladimir Oltean
@ 2021-08-19 16:07 ` Vladimir Oltean
  2021-08-19 16:07 ` [PATCH v2 net-next 3/5] net: switchdev: drop the atomic notifier block from switchdev_bridge_port_{,un}offload Vladimir Oltean
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-19 16:07 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Vadym Kochan, Taras Chornyi,
	Jiri Pirko, Ido Schimmel, UNGLinuxDriver, Grygorii Strashko,
	Marek Behun, DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens,
	Woojung Huh, Sean Wang, Landen Chao, Claudiu Manoil,
	Alexandre Belloni, George McCollister, Ioana Ciornei,
	Saeed Mahameed, Leon Romanovsky, Lars Povlsen, Steen Hegelund,
	Julian Wiedmann, Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

Now that the SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events are notified on
the blocking chain, it would be nice if we could also drop the
rcu_read_lock() atomic context from br_fdb_replay() so that drivers can
actually benefit from the blocking context and simplify their logic.

Do something similar to what is done in br_mdb_queue_one/br_mdb_replay_one,
except the fact that FDB entries are held in a hash list.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/bridge/br_fdb.c | 38 ++++++++++++++++++++++++++++++++++----
 1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 0bdbcfc53914..36f4e3b8d21b 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -752,12 +752,28 @@ static int br_fdb_replay_one(struct net_bridge *br, struct notifier_block *nb,
 	return notifier_to_errno(err);
 }
 
+static int br_fdb_queue_one(struct hlist_head *fdb_list,
+			    const struct net_bridge_fdb_entry *fdb)
+{
+	struct net_bridge_fdb_entry *fdb_new;
+
+	fdb_new = kmemdup(fdb, sizeof(*fdb), GFP_ATOMIC);
+	if (!fdb_new)
+		return -ENOMEM;
+
+	hlist_add_head_rcu(&fdb_new->fdb_node, fdb_list);
+
+	return 0;
+}
+
 int br_fdb_replay(const struct net_device *br_dev, const void *ctx, bool adding,
 		  struct notifier_block *nb)
 {
 	struct net_bridge_fdb_entry *fdb;
+	struct hlist_node *tmp;
 	struct net_bridge *br;
 	unsigned long action;
+	HLIST_HEAD(fdb_list);
 	int err = 0;
 
 	if (!nb)
@@ -770,20 +786,34 @@ int br_fdb_replay(const struct net_device *br_dev, const void *ctx, bool adding,
 
 	br = netdev_priv(br_dev);
 
+	rcu_read_lock();
+
+	hlist_for_each_entry_rcu(fdb, &br->fdb_list, fdb_node) {
+		err = br_fdb_queue_one(&fdb_list, fdb);
+		if (err) {
+			rcu_read_unlock();
+			goto out_free_fdb;
+		}
+	}
+
+	rcu_read_unlock();
+
 	if (adding)
 		action = SWITCHDEV_FDB_ADD_TO_DEVICE;
 	else
 		action = SWITCHDEV_FDB_DEL_TO_DEVICE;
 
-	rcu_read_lock();
-
-	hlist_for_each_entry_rcu(fdb, &br->fdb_list, fdb_node) {
+	hlist_for_each_entry(fdb, &fdb_list, fdb_node) {
 		err = br_fdb_replay_one(br, nb, fdb, action, ctx);
 		if (err)
 			break;
 	}
 
-	rcu_read_unlock();
+out_free_fdb:
+	hlist_for_each_entry_safe(fdb, tmp, &fdb_list, fdb_node) {
+		hlist_del_rcu(&fdb->fdb_node);
+		kfree(fdb);
+	}
 
 	return err;
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 net-next 3/5] net: switchdev: drop the atomic notifier block from switchdev_bridge_port_{,un}offload
  2021-08-19 16:07 [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Vladimir Oltean
  2021-08-19 16:07 ` [PATCH v2 net-next 1/5] net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking notifier chain Vladimir Oltean
  2021-08-19 16:07 ` [PATCH v2 net-next 2/5] net: bridge: switchdev: make br_fdb_replay offer sleepable context to consumers Vladimir Oltean
@ 2021-08-19 16:07 ` Vladimir Oltean
  2021-08-19 16:07 ` [PATCH v2 net-next 4/5] net: switchdev: don't assume RCU context in switchdev_handle_fdb_{add,del}_to_device Vladimir Oltean
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-19 16:07 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Vadym Kochan, Taras Chornyi,
	Jiri Pirko, Ido Schimmel, UNGLinuxDriver, Grygorii Strashko,
	Marek Behun, DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens,
	Woojung Huh, Sean Wang, Landen Chao, Claudiu Manoil,
	Alexandre Belloni, George McCollister, Ioana Ciornei,
	Saeed Mahameed, Leon Romanovsky, Lars Povlsen, Steen Hegelund,
	Julian Wiedmann, Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

Now that br_fdb_replay() uses the blocking_nb, there is no point in
passing the atomic nb anymore.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c       | 2 --
 .../net/ethernet/marvell/prestera/prestera_switchdev.c    | 6 +++---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c  | 4 ++--
 drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c  | 4 ++--
 drivers/net/ethernet/mscc/ocelot_net.c                    | 3 ---
 drivers/net/ethernet/rocker/rocker_ofdpa.c                | 4 ++--
 drivers/net/ethernet/ti/am65-cpsw-nuss.c                  | 4 ++--
 drivers/net/ethernet/ti/cpsw_new.c                        | 4 ++--
 include/net/switchdev.h                                   | 5 -----
 net/bridge/br.c                                           | 5 ++---
 net/bridge/br_private.h                                   | 4 ----
 net/bridge/br_switchdev.c                                 | 8 ++------
 net/dsa/port.c                                            | 3 ---
 net/switchdev/switchdev.c                                 | 4 ----
 14 files changed, 17 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
index 5de475927958..82f31e9f41a9 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
@@ -2016,7 +2016,6 @@ static int dpaa2_switch_port_bridge_join(struct net_device *netdev,
 		goto err_egress_flood;
 
 	err = switchdev_bridge_port_offload(netdev, netdev, NULL,
-					    &dpaa2_switch_port_switchdev_nb,
 					    &dpaa2_switch_port_switchdev_blocking_nb,
 					    false, extack);
 	if (err)
@@ -2053,7 +2052,6 @@ static int dpaa2_switch_port_restore_rxvlan(struct net_device *vdev, int vid, vo
 static void dpaa2_switch_port_pre_bridge_leave(struct net_device *netdev)
 {
 	switchdev_bridge_port_unoffload(netdev, NULL,
-					&dpaa2_switch_port_switchdev_nb,
 					&dpaa2_switch_port_switchdev_blocking_nb);
 }
 
diff --git a/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c b/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c
index 9b8847aa3b92..fb0fa782a5ff 100644
--- a/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c
+++ b/drivers/net/ethernet/marvell/prestera/prestera_switchdev.c
@@ -502,7 +502,7 @@ int prestera_bridge_port_join(struct net_device *br_dev,
 	}
 
 	err = switchdev_bridge_port_offload(br_port->dev, port->dev, NULL,
-					    NULL, NULL, false, extack);
+					    NULL, false, extack);
 	if (err)
 		goto err_switchdev_offload;
 
@@ -516,7 +516,7 @@ int prestera_bridge_port_join(struct net_device *br_dev,
 	return 0;
 
 err_port_join:
-	switchdev_bridge_port_unoffload(br_port->dev, NULL, NULL, NULL);
+	switchdev_bridge_port_unoffload(br_port->dev, NULL, NULL);
 err_switchdev_offload:
 	prestera_bridge_port_put(br_port);
 err_brport_create:
@@ -592,7 +592,7 @@ void prestera_bridge_port_leave(struct net_device *br_dev,
 	else
 		prestera_bridge_1d_port_leave(br_port);
 
-	switchdev_bridge_port_unoffload(br_port->dev, NULL, NULL, NULL);
+	switchdev_bridge_port_unoffload(br_port->dev, NULL, NULL);
 
 	prestera_hw_port_learning_set(port, false);
 	prestera_hw_port_flood_set(port, BR_FLOOD | BR_MCAST_FLOOD, 0);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
index 791a165fe3aa..1a2fa8b2fa58 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_switchdev.c
@@ -362,7 +362,7 @@ mlxsw_sp_bridge_port_create(struct mlxsw_sp_bridge_device *bridge_device,
 	bridge_port->ref_count = 1;
 
 	err = switchdev_bridge_port_offload(brport_dev, mlxsw_sp_port->dev,
-					    NULL, NULL, NULL, false, extack);
+					    NULL, NULL, false, extack);
 	if (err)
 		goto err_switchdev_offload;
 
@@ -377,7 +377,7 @@ mlxsw_sp_bridge_port_create(struct mlxsw_sp_bridge_device *bridge_device,
 static void
 mlxsw_sp_bridge_port_destroy(struct mlxsw_sp_bridge_port *bridge_port)
 {
-	switchdev_bridge_port_unoffload(bridge_port->dev, NULL, NULL, NULL);
+	switchdev_bridge_port_unoffload(bridge_port->dev, NULL, NULL);
 	list_del(&bridge_port->list);
 	WARN_ON(!list_empty(&bridge_port->vlans_list));
 	kfree(bridge_port);
diff --git a/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c b/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
index 7fb9f59d43e0..eb957c323669 100644
--- a/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
+++ b/drivers/net/ethernet/microchip/sparx5/sparx5_switchdev.c
@@ -112,7 +112,7 @@ static int sparx5_port_bridge_join(struct sparx5_port *port,
 
 	set_bit(port->portno, sparx5->bridge_mask);
 
-	err = switchdev_bridge_port_offload(ndev, ndev, NULL, NULL, NULL,
+	err = switchdev_bridge_port_offload(ndev, ndev, NULL, NULL,
 					    false, extack);
 	if (err)
 		goto err_switchdev_offload;
@@ -134,7 +134,7 @@ static void sparx5_port_bridge_leave(struct sparx5_port *port,
 {
 	struct sparx5 *sparx5 = port->sparx5;
 
-	switchdev_bridge_port_unoffload(port->ndev, NULL, NULL, NULL);
+	switchdev_bridge_port_unoffload(port->ndev, NULL, NULL);
 
 	clear_bit(port->portno, sparx5->bridge_mask);
 	if (bitmap_empty(sparx5->bridge_mask, SPX5_PORTS))
diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c
index 5e8965be968a..04ca55ff0fd0 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -1162,7 +1162,6 @@ static int ocelot_netdevice_bridge_join(struct net_device *dev,
 	ocelot_port_bridge_join(ocelot, port, bridge);
 
 	err = switchdev_bridge_port_offload(brport_dev, dev, priv,
-					    &ocelot_netdevice_nb,
 					    &ocelot_switchdev_blocking_nb,
 					    false, extack);
 	if (err)
@@ -1176,7 +1175,6 @@ static int ocelot_netdevice_bridge_join(struct net_device *dev,
 
 err_switchdev_sync:
 	switchdev_bridge_port_unoffload(brport_dev, priv,
-					&ocelot_netdevice_nb,
 					&ocelot_switchdev_blocking_nb);
 err_switchdev_offload:
 	ocelot_port_bridge_leave(ocelot, port, bridge);
@@ -1189,7 +1187,6 @@ static void ocelot_netdevice_pre_bridge_leave(struct net_device *dev,
 	struct ocelot_port_private *priv = netdev_priv(dev);
 
 	switchdev_bridge_port_unoffload(brport_dev, priv,
-					&ocelot_netdevice_nb,
 					&ocelot_switchdev_blocking_nb);
 }
 
diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 3e1ca7a8d029..c09f2a93337c 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -2598,7 +2598,7 @@ static int ofdpa_port_bridge_join(struct ofdpa_port *ofdpa_port,
 	if (err)
 		return err;
 
-	return switchdev_bridge_port_offload(dev, dev, NULL, NULL, NULL,
+	return switchdev_bridge_port_offload(dev, dev, NULL, NULL,
 					     false, extack);
 }
 
@@ -2607,7 +2607,7 @@ static int ofdpa_port_bridge_leave(struct ofdpa_port *ofdpa_port)
 	struct net_device *dev = ofdpa_port->dev;
 	int err;
 
-	switchdev_bridge_port_unoffload(dev, NULL, NULL, NULL);
+	switchdev_bridge_port_unoffload(dev, NULL, NULL);
 
 	err = ofdpa_port_vlan_del(ofdpa_port, OFDPA_UNTAGGED_VID, 0);
 	if (err)
diff --git a/drivers/net/ethernet/ti/am65-cpsw-nuss.c b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
index 130346f74ee8..3a7fde2bf861 100644
--- a/drivers/net/ethernet/ti/am65-cpsw-nuss.c
+++ b/drivers/net/ethernet/ti/am65-cpsw-nuss.c
@@ -2109,7 +2109,7 @@ static int am65_cpsw_netdevice_port_link(struct net_device *ndev,
 			return -EOPNOTSUPP;
 	}
 
-	err = switchdev_bridge_port_offload(ndev, ndev, NULL, NULL, NULL,
+	err = switchdev_bridge_port_offload(ndev, ndev, NULL, NULL,
 					    false, extack);
 	if (err)
 		return err;
@@ -2126,7 +2126,7 @@ static void am65_cpsw_netdevice_port_unlink(struct net_device *ndev)
 	struct am65_cpsw_common *common = am65_ndev_to_common(ndev);
 	struct am65_cpsw_ndev_priv *priv = am65_ndev_to_priv(ndev);
 
-	switchdev_bridge_port_unoffload(ndev, NULL, NULL, NULL);
+	switchdev_bridge_port_unoffload(ndev, NULL, NULL);
 
 	common->br_members &= ~BIT(priv->port->port_id);
 
diff --git a/drivers/net/ethernet/ti/cpsw_new.c b/drivers/net/ethernet/ti/cpsw_new.c
index 85d05b9be2b8..239ccdd6bc48 100644
--- a/drivers/net/ethernet/ti/cpsw_new.c
+++ b/drivers/net/ethernet/ti/cpsw_new.c
@@ -1518,7 +1518,7 @@ static int cpsw_netdevice_port_link(struct net_device *ndev,
 			return -EOPNOTSUPP;
 	}
 
-	err = switchdev_bridge_port_offload(ndev, ndev, NULL, NULL, NULL,
+	err = switchdev_bridge_port_offload(ndev, ndev, NULL, NULL,
 					    false, extack);
 	if (err)
 		return err;
@@ -1535,7 +1535,7 @@ static void cpsw_netdevice_port_unlink(struct net_device *ndev)
 	struct cpsw_priv *priv = netdev_priv(ndev);
 	struct cpsw_common *cpsw = priv->cpsw;
 
-	switchdev_bridge_port_unoffload(ndev, NULL, NULL, NULL);
+	switchdev_bridge_port_unoffload(ndev, NULL, NULL);
 
 	cpsw->br_members &= ~BIT(priv->emac_port);
 
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 5d9ae1ec85b3..b433432c4ef8 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -183,7 +183,6 @@ typedef int switchdev_obj_dump_cb_t(struct switchdev_obj *obj);
 struct switchdev_brport {
 	struct net_device *dev;
 	const void *ctx;
-	struct notifier_block *atomic_nb;
 	struct notifier_block *blocking_nb;
 	bool tx_fwd_offload;
 };
@@ -264,13 +263,11 @@ switchdev_fdb_is_dynamically_learned(const struct switchdev_notifier_fdb_info *f
 
 int switchdev_bridge_port_offload(struct net_device *brport_dev,
 				  struct net_device *dev, const void *ctx,
-				  struct notifier_block *atomic_nb,
 				  struct notifier_block *blocking_nb,
 				  bool tx_fwd_offload,
 				  struct netlink_ext_ack *extack);
 void switchdev_bridge_port_unoffload(struct net_device *brport_dev,
 				     const void *ctx,
-				     struct notifier_block *atomic_nb,
 				     struct notifier_block *blocking_nb);
 
 void switchdev_deferred_process(void);
@@ -353,7 +350,6 @@ int switchdev_handle_port_attr_set(struct net_device *dev,
 static inline int
 switchdev_bridge_port_offload(struct net_device *brport_dev,
 			      struct net_device *dev, const void *ctx,
-			      struct notifier_block *atomic_nb,
 			      struct notifier_block *blocking_nb,
 			      bool tx_fwd_offload,
 			      struct netlink_ext_ack *extack)
@@ -364,7 +360,6 @@ switchdev_bridge_port_offload(struct net_device *brport_dev,
 static inline void
 switchdev_bridge_port_unoffload(struct net_device *brport_dev,
 				const void *ctx,
-				struct notifier_block *atomic_nb,
 				struct notifier_block *blocking_nb)
 {
 }
diff --git a/net/bridge/br.c b/net/bridge/br.c
index d3a32c6813e0..ef92f57b14e6 100644
--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -222,7 +222,7 @@ static int br_switchdev_blocking_event(struct notifier_block *nb,
 		b = &brport_info->brport;
 
 		err = br_switchdev_port_offload(p, b->dev, b->ctx,
-						b->atomic_nb, b->blocking_nb,
+						b->blocking_nb,
 						b->tx_fwd_offload, extack);
 		err = notifier_from_errno(err);
 		break;
@@ -230,8 +230,7 @@ static int br_switchdev_blocking_event(struct notifier_block *nb,
 		brport_info = ptr;
 		b = &brport_info->brport;
 
-		br_switchdev_port_unoffload(p, b->ctx, b->atomic_nb,
-					    b->blocking_nb);
+		br_switchdev_port_unoffload(p, b->ctx, b->blocking_nb);
 		break;
 	}
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 21b292eb2b3e..a7ea4ef0d9e3 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -1948,13 +1948,11 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }
 #ifdef CONFIG_NET_SWITCHDEV
 int br_switchdev_port_offload(struct net_bridge_port *p,
 			      struct net_device *dev, const void *ctx,
-			      struct notifier_block *atomic_nb,
 			      struct notifier_block *blocking_nb,
 			      bool tx_fwd_offload,
 			      struct netlink_ext_ack *extack);
 
 void br_switchdev_port_unoffload(struct net_bridge_port *p, const void *ctx,
-				 struct notifier_block *atomic_nb,
 				 struct notifier_block *blocking_nb);
 
 bool br_switchdev_frame_uses_tx_fwd_offload(struct sk_buff *skb);
@@ -1988,7 +1986,6 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)
 static inline int
 br_switchdev_port_offload(struct net_bridge_port *p,
 			  struct net_device *dev, const void *ctx,
-			  struct notifier_block *atomic_nb,
 			  struct notifier_block *blocking_nb,
 			  bool tx_fwd_offload,
 			  struct netlink_ext_ack *extack)
@@ -1998,7 +1995,6 @@ br_switchdev_port_offload(struct net_bridge_port *p,
 
 static inline void
 br_switchdev_port_unoffload(struct net_bridge_port *p, const void *ctx,
-			    struct notifier_block *atomic_nb,
 			    struct notifier_block *blocking_nb)
 {
 }
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index cd413b010567..8ff0d2d341d7 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -269,7 +269,6 @@ static void nbp_switchdev_del(struct net_bridge_port *p)
 }
 
 static int nbp_switchdev_sync_objs(struct net_bridge_port *p, const void *ctx,
-				   struct notifier_block *atomic_nb,
 				   struct notifier_block *blocking_nb,
 				   struct netlink_ext_ack *extack)
 {
@@ -294,7 +293,6 @@ static int nbp_switchdev_sync_objs(struct net_bridge_port *p, const void *ctx,
 
 static void nbp_switchdev_unsync_objs(struct net_bridge_port *p,
 				      const void *ctx,
-				      struct notifier_block *atomic_nb,
 				      struct notifier_block *blocking_nb)
 {
 	struct net_device *br_dev = p->br->dev;
@@ -312,7 +310,6 @@ static void nbp_switchdev_unsync_objs(struct net_bridge_port *p,
  */
 int br_switchdev_port_offload(struct net_bridge_port *p,
 			      struct net_device *dev, const void *ctx,
-			      struct notifier_block *atomic_nb,
 			      struct notifier_block *blocking_nb,
 			      bool tx_fwd_offload,
 			      struct netlink_ext_ack *extack)
@@ -328,7 +325,7 @@ int br_switchdev_port_offload(struct net_bridge_port *p,
 	if (err)
 		return err;
 
-	err = nbp_switchdev_sync_objs(p, ctx, atomic_nb, blocking_nb, extack);
+	err = nbp_switchdev_sync_objs(p, ctx, blocking_nb, extack);
 	if (err)
 		goto out_switchdev_del;
 
@@ -341,10 +338,9 @@ int br_switchdev_port_offload(struct net_bridge_port *p,
 }
 
 void br_switchdev_port_unoffload(struct net_bridge_port *p, const void *ctx,
-				 struct notifier_block *atomic_nb,
 				 struct notifier_block *blocking_nb)
 {
-	nbp_switchdev_unsync_objs(p, ctx, atomic_nb, blocking_nb);
+	nbp_switchdev_unsync_objs(p, ctx, blocking_nb);
 
 	nbp_switchdev_del(p);
 }
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 979042a64d1a..30071da45403 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -375,7 +375,6 @@ int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
 	tx_fwd_offload = dsa_port_bridge_tx_fwd_offload(dp, br);
 
 	err = switchdev_bridge_port_offload(brport_dev, dev, dp,
-					    &dsa_slave_switchdev_notifier,
 					    &dsa_slave_switchdev_blocking_notifier,
 					    tx_fwd_offload, extack);
 	if (err)
@@ -389,7 +388,6 @@ int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br,
 
 out_rollback_unoffload:
 	switchdev_bridge_port_unoffload(brport_dev, dp,
-					&dsa_slave_switchdev_notifier,
 					&dsa_slave_switchdev_blocking_notifier);
 out_rollback_unbridge:
 	dsa_broadcast(DSA_NOTIFIER_BRIDGE_LEAVE, &info);
@@ -403,7 +401,6 @@ void dsa_port_pre_bridge_leave(struct dsa_port *dp, struct net_device *br)
 	struct net_device *brport_dev = dsa_port_to_bridge_port(dp);
 
 	switchdev_bridge_port_unoffload(brport_dev, dp,
-					&dsa_slave_switchdev_notifier,
 					&dsa_slave_switchdev_blocking_notifier);
 }
 
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index c34c6abceec6..d09e8e9df5b6 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -859,7 +859,6 @@ EXPORT_SYMBOL_GPL(switchdev_handle_port_attr_set);
 
 int switchdev_bridge_port_offload(struct net_device *brport_dev,
 				  struct net_device *dev, const void *ctx,
-				  struct notifier_block *atomic_nb,
 				  struct notifier_block *blocking_nb,
 				  bool tx_fwd_offload,
 				  struct netlink_ext_ack *extack)
@@ -868,7 +867,6 @@ int switchdev_bridge_port_offload(struct net_device *brport_dev,
 		.brport = {
 			.dev = dev,
 			.ctx = ctx,
-			.atomic_nb = atomic_nb,
 			.blocking_nb = blocking_nb,
 			.tx_fwd_offload = tx_fwd_offload,
 		},
@@ -886,13 +884,11 @@ EXPORT_SYMBOL_GPL(switchdev_bridge_port_offload);
 
 void switchdev_bridge_port_unoffload(struct net_device *brport_dev,
 				     const void *ctx,
-				     struct notifier_block *atomic_nb,
 				     struct notifier_block *blocking_nb)
 {
 	struct switchdev_notifier_brport_info brport_info = {
 		.brport = {
 			.ctx = ctx,
-			.atomic_nb = atomic_nb,
 			.blocking_nb = blocking_nb,
 		},
 	};
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 net-next 4/5] net: switchdev: don't assume RCU context in switchdev_handle_fdb_{add,del}_to_device
  2021-08-19 16:07 [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Vladimir Oltean
                   ` (2 preceding siblings ...)
  2021-08-19 16:07 ` [PATCH v2 net-next 3/5] net: switchdev: drop the atomic notifier block from switchdev_bridge_port_{,un}offload Vladimir Oltean
@ 2021-08-19 16:07 ` Vladimir Oltean
  2021-08-19 16:07 ` [PATCH v2 net-next 5/5] net: dsa: handle SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE synchronously Vladimir Oltean
  2021-08-20  9:16 ` [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Ido Schimmel
  5 siblings, 0 replies; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-19 16:07 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Vadym Kochan, Taras Chornyi,
	Jiri Pirko, Ido Schimmel, UNGLinuxDriver, Grygorii Strashko,
	Marek Behun, DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens,
	Woojung Huh, Sean Wang, Landen Chao, Claudiu Manoil,
	Alexandre Belloni, George McCollister, Ioana Ciornei,
	Saeed Mahameed, Leon Romanovsky, Lars Povlsen, Steen Hegelund,
	Julian Wiedmann, Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

Now that the SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events are blocking, it
would be nice if callers of the fan-out helper functions (i.e. DSA)
could benefit from that blocking context.

But at the moment, switchdev_handle_fdb_{add,del}_to_device use some
netdev adjacency list checking functions that assume RCU protection.
Switch over to their rtnl_mutex equivalents, since we are also running
with that taken, and drop the surrounding rcu_read_lock from the callers.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/dsa/slave.c           |  4 ----
 net/switchdev/switchdev.c | 10 +++++++---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 249303ac3c3c..b6a94861cddd 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -2484,22 +2484,18 @@ static int dsa_slave_switchdev_blocking_event(struct notifier_block *unused,
 						     dsa_slave_port_attr_set);
 		return notifier_from_errno(err);
 	case SWITCHDEV_FDB_ADD_TO_DEVICE:
-		rcu_read_lock();
 		err = switchdev_handle_fdb_add_to_device(dev, ptr,
 							 dsa_slave_dev_check,
 							 dsa_foreign_dev_check,
 							 dsa_slave_fdb_add_to_device,
 							 NULL);
-		rcu_read_unlock();
 		return notifier_from_errno(err);
 	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		rcu_read_lock();
 		err = switchdev_handle_fdb_del_to_device(dev, ptr,
 							 dsa_slave_dev_check,
 							 dsa_foreign_dev_check,
 							 dsa_slave_fdb_del_to_device,
 							 NULL);
-		rcu_read_unlock();
 		return notifier_from_errno(err);
 	}
 
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index d09e8e9df5b6..fdbb73439f37 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -470,7 +470,7 @@ switchdev_lower_dev_find(struct net_device *dev,
 		.data = &switchdev_priv,
 	};
 
-	netdev_walk_all_lower_dev_rcu(dev, switchdev_lower_dev_walk, &priv);
+	netdev_walk_all_lower_dev(dev, switchdev_lower_dev_walk, &priv);
 
 	return switchdev_priv.lower_dev;
 }
@@ -543,7 +543,7 @@ static int __switchdev_handle_fdb_add_to_device(struct net_device *dev,
 	/* Event is neither on a bridge nor a LAG. Check whether it is on an
 	 * interface that is in a bridge with us.
 	 */
-	br = netdev_master_upper_dev_get_rcu(dev);
+	br = netdev_master_upper_dev_get(dev);
 	if (!br || !netif_is_bridge_master(br))
 		return 0;
 
@@ -569,6 +569,8 @@ int switchdev_handle_fdb_add_to_device(struct net_device *dev,
 {
 	int err;
 
+	ASSERT_RTNL();
+
 	err = __switchdev_handle_fdb_add_to_device(dev, dev, fdb_info,
 						   check_cb,
 						   foreign_dev_check_cb,
@@ -648,7 +650,7 @@ static int __switchdev_handle_fdb_del_to_device(struct net_device *dev,
 	/* Event is neither on a bridge nor a LAG. Check whether it is on an
 	 * interface that is in a bridge with us.
 	 */
-	br = netdev_master_upper_dev_get_rcu(dev);
+	br = netdev_master_upper_dev_get(dev);
 	if (!br || !netif_is_bridge_master(br))
 		return 0;
 
@@ -674,6 +676,8 @@ int switchdev_handle_fdb_del_to_device(struct net_device *dev,
 {
 	int err;
 
+	ASSERT_RTNL();
+
 	err = __switchdev_handle_fdb_del_to_device(dev, dev, fdb_info,
 						   check_cb,
 						   foreign_dev_check_cb,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 net-next 5/5] net: dsa: handle SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE synchronously
  2021-08-19 16:07 [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Vladimir Oltean
                   ` (3 preceding siblings ...)
  2021-08-19 16:07 ` [PATCH v2 net-next 4/5] net: switchdev: don't assume RCU context in switchdev_handle_fdb_{add,del}_to_device Vladimir Oltean
@ 2021-08-19 16:07 ` Vladimir Oltean
  2021-08-20  9:16 ` [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Ido Schimmel
  5 siblings, 0 replies; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-19 16:07 UTC (permalink / raw)
  To: netdev, Jakub Kicinski, David S. Miller
  Cc: Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Vadym Kochan, Taras Chornyi,
	Jiri Pirko, Ido Schimmel, UNGLinuxDriver, Grygorii Strashko,
	Marek Behun, DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens,
	Woojung Huh, Sean Wang, Landen Chao, Claudiu Manoil,
	Alexandre Belloni, George McCollister, Ioana Ciornei,
	Saeed Mahameed, Leon Romanovsky, Lars Povlsen, Steen Hegelund,
	Julian Wiedmann, Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

Since the switchdev FDB entry notifications are now blocking and
deferred by switchdev and not by us, switchdev will also wait for us to
finish, which means we can proceed with our FDB isolation mechanism
based on dp->bridge_num.

It also means that the ordered workqueue is no longer needed, drop it
and simply call the driver.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 net/dsa/dsa.c      |  15 -------
 net/dsa/dsa_priv.h |  15 -------
 net/dsa/slave.c    | 110 ++++++++++++---------------------------------
 3 files changed, 28 insertions(+), 112 deletions(-)

diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index 1dc45e40f961..b2126334387f 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -338,13 +338,6 @@ static struct packet_type dsa_pack_type __read_mostly = {
 	.func	= dsa_switch_rcv,
 };
 
-static struct workqueue_struct *dsa_owq;
-
-bool dsa_schedule_work(struct work_struct *work)
-{
-	return queue_work(dsa_owq, work);
-}
-
 int dsa_devlink_param_get(struct devlink *dl, u32 id,
 			  struct devlink_param_gset_ctx *ctx)
 {
@@ -465,11 +458,6 @@ static int __init dsa_init_module(void)
 {
 	int rc;
 
-	dsa_owq = alloc_ordered_workqueue("dsa_ordered",
-					  WQ_MEM_RECLAIM);
-	if (!dsa_owq)
-		return -ENOMEM;
-
 	rc = dsa_slave_register_notifier();
 	if (rc)
 		goto register_notifier_fail;
@@ -482,8 +470,6 @@ static int __init dsa_init_module(void)
 	return 0;
 
 register_notifier_fail:
-	destroy_workqueue(dsa_owq);
-
 	return rc;
 }
 module_init(dsa_init_module);
@@ -494,7 +480,6 @@ static void __exit dsa_cleanup_module(void)
 
 	dsa_slave_unregister_notifier();
 	dev_remove_pack(&dsa_pack_type);
-	destroy_workqueue(dsa_owq);
 }
 module_exit(dsa_cleanup_module);
 
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index b7a269e0513f..f759abceeb18 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -125,20 +125,6 @@ struct dsa_notifier_tag_8021q_vlan_info {
 	u16 vid;
 };
 
-struct dsa_switchdev_event_work {
-	struct dsa_switch *ds;
-	int port;
-	struct net_device *dev;
-	struct work_struct work;
-	unsigned long event;
-	/* Specific for SWITCHDEV_FDB_ADD_TO_DEVICE and
-	 * SWITCHDEV_FDB_DEL_TO_DEVICE
-	 */
-	unsigned char addr[ETH_ALEN];
-	u16 vid;
-	bool host_addr;
-};
-
 /* DSA_NOTIFIER_HSR_* */
 struct dsa_notifier_hsr_info {
 	struct net_device *hsr;
@@ -169,7 +155,6 @@ const struct dsa_device_ops *dsa_tag_driver_get(int tag_protocol);
 void dsa_tag_driver_put(const struct dsa_device_ops *ops);
 const struct dsa_device_ops *dsa_find_tagger_by_name(const char *buf);
 
-bool dsa_schedule_work(struct work_struct *work);
 const char *dsa_tag_protocol_to_str(const struct dsa_device_ops *ops);
 
 static inline int dsa_tag_protocol_overhead(const struct dsa_device_ops *ops)
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index b6a94861cddd..faa08e6d8651 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -2278,73 +2278,18 @@ static int dsa_slave_netdevice_event(struct notifier_block *nb,
 	return NOTIFY_DONE;
 }
 
-static void
-dsa_fdb_offload_notify(struct dsa_switchdev_event_work *switchdev_work)
+static void dsa_fdb_offload_notify(struct net_device *dev,
+				   const unsigned char *addr,
+				   u16 vid)
 {
 	struct switchdev_notifier_fdb_info info = {};
-	struct dsa_switch *ds = switchdev_work->ds;
-	struct dsa_port *dp;
-
-	if (!dsa_is_user_port(ds, switchdev_work->port))
-		return;
 
-	info.addr = switchdev_work->addr;
-	info.vid = switchdev_work->vid;
+	info.addr = addr;
+	info.vid = vid;
 	info.offloaded = true;
-	dp = dsa_to_port(ds, switchdev_work->port);
-	call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED,
-				 dp->slave, &info.info, NULL);
-}
-
-static void dsa_slave_switchdev_event_work(struct work_struct *work)
-{
-	struct dsa_switchdev_event_work *switchdev_work =
-		container_of(work, struct dsa_switchdev_event_work, work);
-	struct dsa_switch *ds = switchdev_work->ds;
-	struct dsa_port *dp;
-	int err;
-
-	dp = dsa_to_port(ds, switchdev_work->port);
-
-	rtnl_lock();
-	switch (switchdev_work->event) {
-	case SWITCHDEV_FDB_ADD_TO_DEVICE:
-		if (switchdev_work->host_addr)
-			err = dsa_port_host_fdb_add(dp, switchdev_work->addr,
-						    switchdev_work->vid);
-		else
-			err = dsa_port_fdb_add(dp, switchdev_work->addr,
-					       switchdev_work->vid);
-		if (err) {
-			dev_err(ds->dev,
-				"port %d failed to add %pM vid %d to fdb: %d\n",
-				dp->index, switchdev_work->addr,
-				switchdev_work->vid, err);
-			break;
-		}
-		dsa_fdb_offload_notify(switchdev_work);
-		break;
-
-	case SWITCHDEV_FDB_DEL_TO_DEVICE:
-		if (switchdev_work->host_addr)
-			err = dsa_port_host_fdb_del(dp, switchdev_work->addr,
-						    switchdev_work->vid);
-		else
-			err = dsa_port_fdb_del(dp, switchdev_work->addr,
-					       switchdev_work->vid);
-		if (err) {
-			dev_err(ds->dev,
-				"port %d failed to delete %pM vid %d from fdb: %d\n",
-				dp->index, switchdev_work->addr,
-				switchdev_work->vid, err);
-		}
-
-		break;
-	}
-	rtnl_unlock();
 
-	dev_put(switchdev_work->dev);
-	kfree(switchdev_work);
+	call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED, dev, &info.info,
+				 NULL);
 }
 
 static bool dsa_foreign_dev_check(const struct net_device *dev,
@@ -2369,10 +2314,12 @@ static int dsa_slave_fdb_event(struct net_device *dev,
 			       const struct switchdev_notifier_fdb_info *fdb_info,
 			       unsigned long event)
 {
-	struct dsa_switchdev_event_work *switchdev_work;
 	struct dsa_port *dp = dsa_slave_to_port(dev);
+	const unsigned char *addr = fdb_info->addr;
 	bool host_addr = fdb_info->is_local;
 	struct dsa_switch *ds = dp->ds;
+	u16 vid = fdb_info->vid;
+	int err;
 
 	if (ctx && ctx != dp)
 		return 0;
@@ -2397,30 +2344,29 @@ static int dsa_slave_fdb_event(struct net_device *dev,
 	if (dsa_foreign_dev_check(dev, orig_dev))
 		host_addr = true;
 
-	switchdev_work = kzalloc(sizeof(*switchdev_work), GFP_ATOMIC);
-	if (!switchdev_work)
-		return -ENOMEM;
-
 	netdev_dbg(dev, "%s FDB entry towards %s, addr %pM vid %d%s\n",
 		   event == SWITCHDEV_FDB_ADD_TO_DEVICE ? "Adding" : "Deleting",
-		   orig_dev->name, fdb_info->addr, fdb_info->vid,
-		   host_addr ? " as host address" : "");
-
-	INIT_WORK(&switchdev_work->work, dsa_slave_switchdev_event_work);
-	switchdev_work->ds = ds;
-	switchdev_work->port = dp->index;
-	switchdev_work->event = event;
-	switchdev_work->dev = dev;
+		   orig_dev->name, addr, vid, host_addr ? " as host address" : "");
 
-	ether_addr_copy(switchdev_work->addr, fdb_info->addr);
-	switchdev_work->vid = fdb_info->vid;
-	switchdev_work->host_addr = host_addr;
+	switch (event) {
+	case SWITCHDEV_FDB_ADD_TO_DEVICE:
+		if (host_addr)
+			err = dsa_port_host_fdb_add(dp, addr, vid);
+		else
+			err = dsa_port_fdb_add(dp, addr, vid);
+		if (!err)
+			dsa_fdb_offload_notify(dev, addr, vid);
+		break;
 
-	/* Hold a reference for dsa_fdb_offload_notify */
-	dev_hold(dev);
-	dsa_schedule_work(&switchdev_work->work);
+	case SWITCHDEV_FDB_DEL_TO_DEVICE:
+		if (host_addr)
+			err = dsa_port_host_fdb_del(dp, addr, vid);
+		else
+			err = dsa_port_fdb_del(dp, addr, vid);
+		break;
+	}
 
-	return 0;
+	return err;
 }
 
 static int
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 1/5] net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking notifier chain
  2021-08-19 16:07 ` [PATCH v2 net-next 1/5] net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking notifier chain Vladimir Oltean
@ 2021-08-19 18:15   ` Vlad Buslov
  2021-08-19 23:18     ` Vladimir Oltean
  0 siblings, 1 reply; 34+ messages in thread
From: Vlad Buslov @ 2021-08-19 18:15 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, Jakub Kicinski, David S. Miller, Roopa Prabhu,
	Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Vadym Kochan, Taras Chornyi,
	Jiri Pirko, Ido Schimmel, UNGLinuxDriver, Grygorii Strashko,
	Marek Behun, DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens,
	Woojung Huh, Sean Wang, Landen Chao, Claudiu Manoil,
	Alexandre Belloni, George McCollister, Ioana Ciornei,
	Saeed Mahameed, Leon Romanovsky, Lars Povlsen, Steen Hegelund,
	Julian Wiedmann, Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Jianbo Liu, Mark Bloch,
	Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Thu 19 Aug 2021 at 19:07, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> Currently, br_switchdev_fdb_notify() uses call_switchdev_notifiers (and
> br_fdb_replay() open-codes the same thing). This means that drivers
> handle the SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE events on the atomic
> switchdev notifier block.
>
> Most existing switchdev drivers either talk to firmware, or to a device
> over a bus where the I/O is sleepable (SPI, I2C, MDIO etc). So there
> exists an (anti)pattern where drivers make a sleepable context for
> offloading the given FDB entry by registering an ordered workqueue and
> scheduling work items on it, and doing all the work from there.
>
> The problem is the inherent limitation that this design imposes upon
> what a switchdev driver can do with those FDB entries.
>
> For example, a switchdev driver might want to perform FDB isolation,
> i.e. associate each FDB entry with the bridge it belongs to. Maybe the
> driver associates each bridge with a number, allocating that number when
> the first port of the driver joins that bridge, and freeing it when the
> last port leaves it.
>
> And this is where the problem is. When user space deletes a bridge and
> all the ports leave, the bridge will notify us of the deletion of all
> FDB entries in atomic context, and switchdev drivers will schedule their
> private work items on their private workqueue.
>
> The FDB entry deletion notifications will succeed, the bridge will then
> finish deleting itself, but the switchdev work items have not run yet.
> When they will eventually get scheduled, the aforementioned association
> between the bridge_dev and a number will have already been broken by the
> switchdev driver. All ports are standalone now, the bridge is a foreign
> interface!
>
> One might say "why don't you cache all your associations while you're
> still in the atomic context and they're still valid, pass them by value
> through your switchdev_work and work with the cached values as opposed
> to the current ones?"
>
> This option smells of poor design, because instead of fixing a central
> problem, we add tens of lateral workarounds to avoid it. It should be
> easier to use switchdev, not harder, and we should look at the common
> patterns which lead to code duplication and eliminate them.
>
> In this case, we must notice that
> (a) switchdev already has the concept of notifiers emitted from the fast
>     path that are still processed by drivers from blocking context. This
>     is accomplished through the SWITCHDEV_F_DEFER flag which is used by
>     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.
> (b) the bridge del_nbp() function already calls switchdev_deferred_process().
>     So if we could hook into that, we could have a chance that the
>     bridge simply waits for our FDB entry offloading procedure to finish
>     before it calls netdev_upper_dev_unlink() - which is almost
>     immediately afterwards, and also when switchdev drivers typically
>     break their stateful associations between the bridge upper and
>     private data.
>
> So it is in fact possible to use switchdev's generic
> switchdev_deferred_enqueue mechanism to get a sleepable callback, and
> from there we can call_switchdev_blocking_notifiers().
>
> In the case of br_fdb_replay(), the only code path is from
> switchdev_bridge_port_offload(), which is already in blocking context.
> So we don't need to go through switchdev_deferred_enqueue, and we can
> just call the blocking notifier block directly.
>
> To preserve the same behavior as before, all drivers need to have their
> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handlers moved from their switchdev
> atomic notifier blocks to the blocking ones. This patch attempts to make
> that trivial movement. Note that now they might schedule a work item for
> nothing (since they are now called from a work item themselves), but I
> don't have the energy or hardware to test all of them, so this will have
> to do.
>
> Note that previously, we were under rcu_read_lock() but now we're not.
> I have eyeballed the drivers that make any sort of RCU assumption and
> for the most part, enclosed them between a private pair of
> rcu_read_lock() and rcu_read_unlock(). The exception is
> qeth_l2_switchdev_event, for which adding the rcu_read_lock and properly
> calling rcu_read_unlock from all places that return would result in more
> churn than what I am about to do. This function had an apparently bogus
> comment "Called under rtnl_lock", but to me this is not quite possible,
> since this is the handler function from register_switchdev_notifier
> which is on an atomic chain. But anyway, after the rework we _are_ under
> rtnl_mutex, so just drop the _rcu from the functions used by the qeth
> driver.
>
> The RCU protection can be dropped from the other drivers when they are
> reworked to stop scheduling.
>
> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
> ---

[...]

> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
> index 0c38c2e319be..ea7c3f07f6fe 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
> @@ -276,6 +276,55 @@ mlx5_esw_bridge_port_obj_attr_set(struct net_device *dev,
>  	return err;
>  }
>  
> +static struct mlx5_bridge_switchdev_fdb_work *
> +mlx5_esw_bridge_init_switchdev_fdb_work(struct net_device *dev, bool add,
> +					struct switchdev_notifier_fdb_info *fdb_info,
> +					struct mlx5_esw_bridge_offloads *br_offloads);
> +
> +static int
> +mlx5_esw_bridge_fdb_event(struct net_device *dev, unsigned long event,
> +			  struct switchdev_notifier_info *info,
> +			  struct mlx5_esw_bridge_offloads *br_offloads)
> +{
> +	struct switchdev_notifier_fdb_info *fdb_info;
> +	struct mlx5_bridge_switchdev_fdb_work *work;
> +	struct mlx5_eswitch *esw = br_offloads->esw;
> +	u16 vport_num, esw_owner_vhca_id;
> +	struct net_device *upper, *rep;
> +
> +	upper = netdev_master_upper_dev_get_rcu(dev);
> +	if (!upper)
> +		return 0;
> +	if (!netif_is_bridge_master(upper))
> +		return 0;
> +
> +	rep = mlx5_esw_bridge_rep_vport_num_vhca_id_get(dev, esw,
> +							&vport_num,
> +							&esw_owner_vhca_id);
> +	if (!rep)
> +		return 0;
> +
> +	/* only handle the event on peers */
> +	if (mlx5_esw_bridge_is_local(dev, rep, esw))
> +		return 0;

This check is only needed for SWITCHDEV_FDB_DEL_TO_BRIDGE case. Here it
breaks the offload.

> +
> +	fdb_info = container_of(info, struct switchdev_notifier_fdb_info, info);
> +
> +	work = mlx5_esw_bridge_init_switchdev_fdb_work(dev,
> +						       event == SWITCHDEV_FDB_ADD_TO_DEVICE,
> +						       fdb_info,

Here FDB info can already be deallocated[1] since this is now executing
asynchronously and races with fdb_rcu_free() that is scheduled to be
called after rcu grace period by fdb_delete().

> +						       br_offloads);
> +	if (IS_ERR(work)) {
> +		WARN_ONCE(1, "Failed to init switchdev work, err=%ld",
> +			  PTR_ERR(work));
> +		return PTR_ERR(work);
> +	}
> +
> +	queue_work(br_offloads->wq, &work->work);
> +
> +	return 0;
> +}
> +
>  static int mlx5_esw_bridge_event_blocking(struct notifier_block *nb,
>  					  unsigned long event, void *ptr)
>  {
> @@ -295,6 +344,12 @@ static int mlx5_esw_bridge_event_blocking(struct notifier_block *nb,
>  	case SWITCHDEV_PORT_ATTR_SET:
>  		err = mlx5_esw_bridge_port_obj_attr_set(dev, ptr, br_offloads);
>  		break;
> +	case SWITCHDEV_FDB_ADD_TO_DEVICE:
> +	case SWITCHDEV_FDB_DEL_TO_DEVICE:
> +		rcu_read_lock();
> +		err = mlx5_esw_bridge_fdb_event(dev, event, ptr, br_offloads);
> +		rcu_read_unlock();
> +		break;
>  	default:
>  		err = 0;
>  	}
> @@ -415,9 +470,7 @@ static int mlx5_esw_bridge_switchdev_event(struct notifier_block *nb,
>  		/* only handle the event on peers */
>  		if (mlx5_esw_bridge_is_local(dev, rep, esw))
>  			break;

I really like the idea of completely remove the driver wq from FDB
handling code, but I'm not yet too familiar with bridge internals to
easily determine whether same approach can be applied to
SWITCHDEV_FDB_{ADD|DEL}_TO_BRIDGE event after this series is accepted.
It seems that all current users already generate these events from
blocking context, so would it be a trivial change for me to do in your
opinion? That would allow me to get rid of mlx5_esw_bridge_offloads->wq
in our driver.

> -		fallthrough;
> -	case SWITCHDEV_FDB_ADD_TO_DEVICE:
> -	case SWITCHDEV_FDB_DEL_TO_DEVICE:
> +
>  		fdb_info = container_of(info,
>  					struct switchdev_notifier_fdb_info,
>  					info);

[...]

[1]:
[  579.633363] ==================================================================                                              
[  579.634922] BUG: KASAN: use-after-free in mlx5_esw_bridge_init_switchdev_fdb_work+0x363/0x400 [mlx5_core]            
[  579.636969] Read of size 4 at addr ffff888130175d90 by task ip/7454                                                         
                                                                                                                               
[  579.638898] CPU: 0 PID: 7454 Comm: ip Not tainted 5.14.0-rc5+ #7                                                            
[  579.640549] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  579.643617] Call Trace:                                                                                                     
[  579.644444]  dump_stack_lvl+0x46/0x5a                                                                                
[  579.645568]  print_address_description.constprop.0+0x1f/0x140                                                               
[  579.647195]  ? mlx5_esw_bridge_init_switchdev_fdb_work+0x363/0x400 [mlx5_core]                                              
[  579.649365]  ? mlx5_esw_bridge_init_switchdev_fdb_work+0x363/0x400 [mlx5_core]                                              
[  579.651203]  kasan_report.cold+0x83/0xdf                                                                             
[  579.652035]  ? mlx5_esw_bridge_init_switchdev_fdb_work+0x363/0x400 [mlx5_core]                                              
[  579.653570]  mlx5_esw_bridge_init_switchdev_fdb_work+0x363/0x400 [mlx5_core]                                                
[  579.655005]  mlx5_esw_bridge_event_blocking+0x346/0x610 [mlx5_core]                                                         
[  579.656328]  ? mlx5_esw_bridge_port_obj_attr_set+0x320/0x320 [mlx5_core]                                                    
[  579.657708]  ? rwsem_mark_wake+0x7e0/0x7e0                                                                                  
[  579.658599]  ? rwsem_down_read_slowpath+0x142/0xad0                                                                         
[  579.659653]  blocking_notifier_call_chain+0xdb/0x130                                                                 
[  579.660724]  ? switchdev_fdb_add_deferred+0x1b0/0x1b0                                                                       
[  579.661813]  switchdev_fdb_del_deferred+0x10c/0x1b0                                                                         
[  579.662871]  ? switchdev_fdb_add_deferred+0x1b0/0x1b0                                                                       
[  579.663964]  ? _raw_spin_lock+0xd0/0xd0                                                                              
[  579.664825]  ? switchdev_deferred_process+0x175/0x290                                                                
[  579.665912]  ? kfree+0xa8/0x420                                                                                      
[  579.666656]  switchdev_deferred_process+0x12f/0x290                                                                         
[  579.667715]  del_nbp+0x35c/0xcb0 [bridge]                                                                            
[  579.668623]  br_dev_delete+0x8d/0x190 [bridge]                                                                       
[  579.669609]  rtnl_dellink+0x2cb/0x9b0                                                                                
[  579.670456]  ? unwind_next_frame+0x11fb/0x1a40                                                                       
[  579.671431]  ? rtnl_bridge_getlink+0x650/0x650                                                                              
[  579.672403]  ? deref_stack_reg+0xe6/0x160                                                                            
[  579.673291]  ? unwind_next_frame+0x11fb/0x1a40                                                                              
[  579.674273]  ? arch_stack_walk+0x9e/0xf0                                                                             
[  579.675152]  ? mutex_lock+0xa1/0xf0                                                                                         
[  579.675947]  ? __mutex_lock_slowpath+0x10/0x10                                                                              
[  579.676922]  rtnetlink_rcv_msg+0x359/0x9a0                                                                                  
[  579.677838]  ? rtnl_calcit.isra.0+0x2b0/0x2b0                                                                        
[  579.678795]  ? ___sys_sendmsg+0xd8/0x160                                                                                    
[  579.679669]  ? __sys_sendmsg+0xb7/0x140                                                                              
[  579.680532]  ? do_syscall_64+0x3b/0x90                                                                               
[  579.681426]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae                                                              
[  579.682540]  ? kasan_save_stack+0x32/0x40                                                                            
[  579.683429]  ? kasan_save_stack+0x1b/0x40                                                                            
[  579.684321]  ? kasan_record_aux_stack+0xa3/0xb0                                                                      
[  579.685308]  ? task_work_add+0x3a/0x130                                                                              
[  579.686168]  ? fput_many.part.0+0x8c/0x110                                                                           
[  579.687071]  ? path_openat+0x1e02/0x3960                                                                                    
[  579.687944]  ? do_filp_open+0x19e/0x3e0
[  579.699734]  ? do_sys_openat2+0x122/0x360                                                                            
[  579.700627]  ? __x64_sys_openat+0x120/0x1d0                                                                          
[  579.701548]  ? do_syscall_64+0x3b/0x90                                                                               
[  579.702350]  netlink_rcv_skb+0x120/0x350                                                                             
[  579.703180]  ? rtnl_calcit.isra.0+0x2b0/0x2b0                                                                        
[  579.704084]  ? netlink_ack+0x9c0/0x9c0                                                                               
[  579.704880]  ? netlink_deliver_tap+0x7f/0x8f0                                                                        
[  579.705777]  ? _copy_from_iter+0x277/0xdb0                                                                           
[  579.706641]  netlink_unicast+0x4c6/0x7a0                                                                             
[  579.707470]  ? netlink_attachskb+0x750/0x750                                                                         
[  579.708352]  ? __build_skb_around+0x1f9/0x2b0                                                                        
[  579.709250]  ? __check_object_size+0x23e/0x300                                                                       
[  579.710171]  netlink_sendmsg+0x70a/0xbf0                                                                             
[  579.711045]  ? netlink_unicast+0x7a0/0x7a0                                                                           
[  579.711951]  ? __import_iovec+0x51/0x610                                                                             
[  579.712825]  ? netlink_unicast+0x7a0/0x7a0                                                                           
[  579.713736]  sock_sendmsg+0xe4/0x110                                                                                 
[  579.714555]  ____sys_sendmsg+0x5cf/0x7d0                                                                             
[  579.715429]  ? kernel_sendmsg+0x30/0x30                                                                              
[  579.716292]  ? __ia32_sys_recvmmsg+0x210/0x210                                                                       
[  579.717265]  ? trace_event_raw_event_mmap_lock_released+0x240/0x240                                                  
[  579.718566]  ? lru_cache_add+0x17d/0x2a0                                                                             
[  579.719440]  ? wp_page_copy+0x87c/0x1370                                                                             
[  579.720315]  ___sys_sendmsg+0xd8/0x160                                                                               
[  579.721156]  ? sendmsg_copy_msghdr+0x110/0x110                                                                       
[  579.722142]  ? do_wp_page+0x1d1/0xf50                                                                                
[  579.722970]  ? __handle_mm_fault+0x1c96/0x3390                                                                       
[  579.723943]  ? vm_iomap_memory+0x170/0x170                                                                           
[  579.724855]  ? __fget_light+0x51/0x220                                                                               
[  579.725696]  __sys_sendmsg+0xb7/0x140                                                                                
[  579.726526]  ? __sys_sendmsg_sock+0x20/0x20                                                                          
[  579.727450]  ? copy_page_range+0x14c0/0x2a40                                                                         
[  579.728389]  do_syscall_64+0x3b/0x90                                                                                 
[  579.729199]  entry_SYSCALL_64_after_hwframe+0x44/0xae                                                                
[  579.730285] RIP: 0033:0x7feb5f746c17                                                                                 
[  579.731099] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 
[  579.734799] RSP: 002b:00007fff12a9e948 EFLAGS: 00000246 ORIG_RAX: 000000000000002e                                   
[  579.736403] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007feb5f746c17                                        
[  579.737857] RDX: 0000000000000000 RSI: 00007fff12a9e9b0 RDI: 0000000000000003                                        
[  579.739316] RBP: 00000000611e94b8 R08: 0000000000000001 R09: 0000000000403578                                        
[  579.740770] R10: 00007feb5f8948b0 R11: 0000000000000246 R12: 0000000000000001                                        
[  579.742226] R13: 00007fff12a9f060 R14: 0000000000000000 R15: 000000000048e520                                        
                                                                                                                        
[  579.744115] Allocated by task 0:                                                                                     
[  579.744872]  kasan_save_stack+0x1b/0x40                                                                              
[  579.745730]  __kasan_slab_alloc+0x61/0x80                                                                            
[  579.746623]  kmem_cache_alloc+0x14c/0x2f0
[  579.747515]  fdb_create+0x32/0xc30 [bridge]                                                                           
[  579.748450]  br_fdb_update+0x301/0x730 [bridge]                                                                      
[  579.749444]  br_handle_frame_finish+0x5f7/0x1690 [bridge]                                                             
[  579.750611]  br_handle_frame+0x55f/0x910 [bridge]                                                                     
[  579.751647]  __netif_receive_skb_core+0xfc3/0x2a10                                                                    
[  579.752680]  __netif_receive_skb_list_core+0x2ef/0x900                                                                
[  579.753777]  netif_receive_skb_list_internal+0x5f4/0xc60                                                              
[  579.754933]  napi_complete_done+0x188/0x5d0                                                                          
[  579.755856]  mlx5e_napi_poll+0x2bc/0x1680 [mlx5_core]                                                                 
[  579.757014]  __napi_poll+0xa1/0x420                                                                                   
[  579.757808]  net_rx_action+0x2c4/0x950                                                                                
[  579.758655]  __do_softirq+0x1a0/0x57f                                                                                
                                                                                                                         
[  579.759918] Freed by task 0:                                                                                          
[  579.760600]  kasan_save_stack+0x1b/0x40                                                                               
[  579.761464]  kasan_set_track+0x1c/0x30                                                                                
[  579.762321]  kasan_set_free_info+0x20/0x30                                                                            
[  579.763225]  __kasan_slab_free+0xeb/0x120                                                                             
[  579.764115]  kmem_cache_free+0x82/0x3f0                                                                              
[  579.764978]  rcu_do_batch+0x32f/0xba0                                                                                 
[  579.765802]  rcu_core+0x4c4/0x910                                                                                     
[  579.766560]  __do_softirq+0x1a0/0x57f                                                                                 
                                                                                                                        
[  579.767804] Last potentially related work creation:                                                                  
[  579.768820]  kasan_save_stack+0x1b/0x40                                                                              
[  579.769660]  kasan_record_aux_stack+0xa3/0xb0                                                                         
[  579.770599]  call_rcu+0xe3/0x1230                                                                                    
[  579.771367]  br_fdb_delete_by_port+0x1d7/0x270 [bridge]                                                              
[  579.772468]  br_stp_disable_port+0x150/0x180 [bridge]                                                                
[  579.773541]  del_nbp+0x11e/0xcb0 [bridge]                                                                            
[  579.774435]  br_dev_delete+0x8d/0x190 [bridge]                                                                        
[  579.775391]  rtnl_dellink+0x2cb/0x9b0                                                                                
[  579.776218]  rtnetlink_rcv_msg+0x359/0x9a0                                                                            
[  579.777123]  netlink_rcv_skb+0x120/0x350                                                                             
[  579.778035]  netlink_unicast+0x4c6/0x7a0                                                                              
[  579.778904]  netlink_sendmsg+0x70a/0xbf0                                                                              
[  579.779777]  sock_sendmsg+0xe4/0x110                                                                                  
[  579.780589]  ____sys_sendmsg+0x5cf/0x7d0                                                                             
[  579.781462]  ___sys_sendmsg+0xd8/0x160                                                                                
[  579.782315]  __sys_sendmsg+0xb7/0x140                                                                                
[  579.783144]  do_syscall_64+0x3b/0x90                                                                                 
[  579.783959]  entry_SYSCALL_64_after_hwframe+0x44/0xae                                                                
                                                                                                                        
[  579.785467] The buggy address belongs to the object at ffff888130175d80                                              
                which belongs to the cache bridge_fdb_cache of size 128                                                 
[  579.788085] The buggy address is located 16 bytes inside of                                                          
                128-byte region [ffff888130175d80, ffff888130175e00)                                                    
[  579.790432] The buggy address belongs to the page:                                                                    
[  579.791461] page:0000000044cdd676 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888130175cc0 pfn:0x130175
[  579.795093] raw: 0017ffffc0000200 0000000000000000 dead000000000122 ffff88811ea56140                                        
[  579.796733] raw: ffff888130175cc0 0000000080150009 00000001ffffffff 0000000000000000                                        
[  579.798380] page dumped because: kasan: bad access detected                                                                 
                                                                                                                               
[  579.799984] Memory state around the buggy address:                                                                          
[  579.801019]  ffff888130175c80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb                                       
[  579.802566]  ffff888130175d00: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc                                              
[  579.804107] >ffff888130175d80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb                                              
[  579.805654]                          ^                                                                                      
[  579.806488]  ffff888130175e00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb                                       
[  579.807945]  ffff888130175e80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc                                              
[  579.809398] ==================================================================                                              
[  579.810865] Disabling lock debugging due to kernel taint                                                                    
[  579.811956] ==================================================================                                              
[  579.813429] BUG: KASAN: use-after-free in mlx5_esw_bridge_init_switchdev_fdb_work+0x339/0x400 [mlx5_core]                   
[  579.815432] Read of size 2 at addr ffff888130175d94 by task ip/7454                                                         
                                                                                                                        
[  579.817174] CPU: 0 PID: 7454 Comm: ip Tainted: G    B             5.14.0-rc5+ #7                                            
[  579.818758] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[  579.821035] Call Trace:                                                                                                     
[  579.821649]  dump_stack_lvl+0x46/0x5a                                                                                
[  579.822492]  print_address_description.constprop.0+0x1f/0x140                                                        
[  579.823706]  ? mlx5_esw_bridge_init_switchdev_fdb_work+0x339/0x400 [mlx5_core]                                       
[  579.825317]  ? mlx5_esw_bridge_init_switchdev_fdb_work+0x339/0x400 [mlx5_core]                                              
[  579.826845]  kasan_report.cold+0x83/0xdf                                                                             
[  579.827674]  ? mlx5_esw_bridge_init_switchdev_fdb_work+0x339/0x400 [mlx5_core]                                       
[  579.829202]  mlx5_esw_bridge_init_switchdev_fdb_work+0x339/0x400 [mlx5_core]                                         
[  579.830638]  mlx5_esw_bridge_event_blocking+0x346/0x610 [mlx5_core]                                                  
[  579.831933]  ? mlx5_esw_bridge_port_obj_attr_set+0x320/0x320 [mlx5_core]                                                    
[  579.833308]  ? rwsem_mark_wake+0x7e0/0x7e0                                                                           
[  579.834219]  ? rwsem_down_read_slowpath+0x142/0xad0                                                                         
[  579.835271]  blocking_notifier_call_chain+0xdb/0x130                                                                 
[  579.836345]  ? switchdev_fdb_add_deferred+0x1b0/0x1b0                                                                       
[  579.837427]  switchdev_fdb_del_deferred+0x10c/0x1b0                                                                         
[  579.838484]  ? switchdev_fdb_add_deferred+0x1b0/0x1b0                                                                       
[  579.839577]  ? _raw_spin_lock+0xd0/0xd0                                                                              
[  579.840439]  ? switchdev_deferred_process+0x175/0x290                                                                       
[  579.841518]  ? kfree+0xa8/0x420                                                                                      
[  579.842256]  switchdev_deferred_process+0x12f/0x290                                                                  
[  579.843317]  del_nbp+0x35c/0xcb0 [bridge]                                                                            
[  579.844228]  br_dev_delete+0x8d/0x190 [bridge]                                                                       
[  579.845212]  rtnl_dellink+0x2cb/0x9b0                                                                                
[  579.846045]  ? unwind_next_frame+0x11fb/0x1a40                                                                       
[  579.847023]  ? rtnl_bridge_getlink+0x650/0x650                                                                       
[  579.847994]  ? deref_stack_reg+0xe6/0x160                                                                            
[  579.848879]  ? unwind_next_frame+0x11fb/0x1a40                                                                              
[  579.849850]  ? arch_stack_walk+0x9e/0xf0
[  579.850731]  ? mutex_lock+0xa1/0xf0                                  
[  579.851530]  ? __mutex_lock_slowpath+0x10/0x10                       
[  579.852499]  rtnetlink_rcv_msg+0x359/0x9a0                           
[  579.853406]  ? rtnl_calcit.isra.0+0x2b0/0x2b0                        
[  579.854399]  ? ___sys_sendmsg+0xd8/0x160                             
[  579.855275]  ? __sys_sendmsg+0xb7/0x140                              
[  579.856135]  ? do_syscall_64+0x3b/0x90                               
[  579.856984]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae              
[  579.858101]  ? kasan_save_stack+0x32/0x40                            
[  579.858995]  ? kasan_save_stack+0x1b/0x40                            
[  579.859887]  ? kasan_record_aux_stack+0xa3/0xb0                      
[  579.860878]  ? task_work_add+0x3a/0x130                              
[  579.861751]  ? fput_many.part.0+0x8c/0x110                           
[  579.862677]  ? path_openat+0x1e02/0x3960                             
[  579.863551]  ? do_filp_open+0x19e/0x3e0                              
[  579.864413]  ? do_sys_openat2+0x122/0x360                            
[  579.865303]  ? __x64_sys_openat+0x120/0x1d0                          
[  579.877159]  ? do_syscall_64+0x3b/0x90                               
[  579.878007]  netlink_rcv_skb+0x120/0x350                             
[  579.878834]  ? rtnl_calcit.isra.0+0x2b0/0x2b0                        
[  579.879733]  ? netlink_ack+0x9c0/0x9c0                               
[  579.880534]  ? netlink_deliver_tap+0x7f/0x8f0                        
[  579.881429]  ? _copy_from_iter+0x277/0xdb0                           
[  579.882291]  netlink_unicast+0x4c6/0x7a0                             
[  579.883122]  ? netlink_attachskb+0x750/0x750                         
[  579.884010]  ? __build_skb_around+0x1f9/0x2b0                        
[  579.884906]  ? __check_object_size+0x23e/0x300                       
[  579.885819]  netlink_sendmsg+0x70a/0xbf0                             
[  579.886654]  ? netlink_unicast+0x7a0/0x7a0                           
[  579.887565]  ? __import_iovec+0x51/0x610                             
[  579.888440]  ? netlink_unicast+0x7a0/0x7a0                           
[  579.889344]  sock_sendmsg+0xe4/0x110                                 
[  579.890163]  ____sys_sendmsg+0x5cf/0x7d0                             
[  579.891047]  ? kernel_sendmsg+0x30/0x30                              
[  579.891908]  ? __ia32_sys_recvmmsg+0x210/0x210                       
[  579.892884]  ? trace_event_raw_event_mmap_lock_released+0x240/0x240  
[  579.894198]  ? lru_cache_add+0x17d/0x2a0                             
[  579.895084]  ? wp_page_copy+0x87c/0x1370                             
[  579.895960]  ___sys_sendmsg+0xd8/0x160                               
[  579.896803]  ? sendmsg_copy_msghdr+0x110/0x110                       
[  579.897744]  ? do_wp_page+0x1d1/0xf50                                
[  579.898537]  ? __handle_mm_fault+0x1c96/0x3390                       
[  579.899450]  ? vm_iomap_memory+0x170/0x170                           
[  579.900313]  ? __fget_light+0x51/0x220                               
[  579.901114]  __sys_sendmsg+0xb7/0x140                                
[  579.901898]  ? __sys_sendmsg_sock+0x20/0x20                          
[  579.902774]  ? copy_page_range+0x14c0/0x2a40                         
[  579.903666]  do_syscall_64+0x3b/0x90
[  579.904440]  entry_SYSCALL_64_after_hwframe+0x44/0xae                                                                
[  579.905461] RIP: 0033:0x7feb5f746c17                                                                                  
[  579.906237] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 7
[  579.909720] RSP: 002b:00007fff12a9e948 EFLAGS: 00000246 ORIG_RAX: 000000000000002e                                    
[  579.911339] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007feb5f746c17                                         
[  579.912803] RDX: 0000000000000000 RSI: 00007fff12a9e9b0 RDI: 0000000000000003                                         
[  579.914271] RBP: 00000000611e94b8 R08: 0000000000000001 R09: 0000000000403578                                        
[  579.915737] R10: 00007feb5f8948b0 R11: 0000000000000246 R12: 0000000000000001                                         
[  579.917200] R13: 00007fff12a9f060 R14: 0000000000000000 R15: 000000000048e520                                         
                                                                                                                         
[  579.919112] Allocated by task 0:                                                                                     
[  579.919861]  kasan_save_stack+0x1b/0x40                                                                               
[  579.920740]  __kasan_slab_alloc+0x61/0x80                                                                             
[  579.921636]  kmem_cache_alloc+0x14c/0x2f0                                                                             
[  579.922537]  fdb_create+0x32/0xc30 [bridge]                                                                           
[  579.923470]  br_fdb_update+0x301/0x730 [bridge]                                                                       
[  579.924463]  br_handle_frame_finish+0x5f7/0x1690 [bridge]                                                             
[  579.925613]  br_handle_frame+0x55f/0x910 [bridge]                                                                    
[  579.926644]  __netif_receive_skb_core+0xfc3/0x2a10                                                                    
[  579.927675]  __netif_receive_skb_list_core+0x2ef/0x900                                                                
[  579.928773]  netif_receive_skb_list_internal+0x5f4/0xc60                                                              
[  579.929896]  napi_complete_done+0x188/0x5d0                                                                          
[  579.930828]  mlx5e_napi_poll+0x2bc/0x1680 [mlx5_core]                                                                
[  579.931985]  __napi_poll+0xa1/0x420                                                                                  
[  579.932785]  net_rx_action+0x2c4/0x950                                                                                
[  579.933633]  __do_softirq+0x1a0/0x57f                                                                                
                                                                                                                        
[  579.934896] Freed by task 0:                                                                                         
[  579.935586]  kasan_save_stack+0x1b/0x40                                                                              
[  579.936443]  kasan_set_track+0x1c/0x30                                                                                
[  579.937285]  kasan_set_free_info+0x20/0x30                                                                           
[  579.938199]  __kasan_slab_free+0xeb/0x120                                                                             
[  579.939087]  kmem_cache_free+0x82/0x3f0                                                                              
[  579.939945]  rcu_do_batch+0x32f/0xba0                                                                                 
[  579.940777]  rcu_core+0x4c4/0x910                                                                                     
[  579.941542]  __do_softirq+0x1a0/0x57f                                                                                 
                                                                                                                        
[  579.942806] Last potentially related work creation:                                                                   
[  579.943855]  kasan_save_stack+0x1b/0x40                                                                              
[  579.944710]  kasan_record_aux_stack+0xa3/0xb0                                                                        
[  579.945664]  call_rcu+0xe3/0x1230                                                                                    
[  579.946430]  br_fdb_delete_by_port+0x1d7/0x270 [bridge]                                                              
[  579.947557]  br_stp_disable_port+0x150/0x180 [bridge]                                                                
[  579.948649]  del_nbp+0x11e/0xcb0 [bridge]                                                                            
[  579.949552]  br_dev_delete+0x8d/0x190 [bridge]                                                                       
[  579.950536]  rtnl_dellink+0x2cb/0x9b0                                                                                
[  579.951365]  rtnetlink_rcv_msg+0x359/0x9a0                                                                            
[  579.952267]  netlink_rcv_skb+0x120/0x350
[  579.953145]  netlink_unicast+0x4c6/0x7a0                                                                             
[  579.954029]  netlink_sendmsg+0x70a/0xbf0                                                                              
[  579.954897]  sock_sendmsg+0xe4/0x110                                                                                  
[  579.955707]  ____sys_sendmsg+0x5cf/0x7d0                                                                              
[  579.956584]  ___sys_sendmsg+0xd8/0x160                                                                                
[  579.957427]  __sys_sendmsg+0xb7/0x140                                                                                 
[  579.958254]  do_syscall_64+0x3b/0x90                                                                                 
[  579.959069]  entry_SYSCALL_64_after_hwframe+0x44/0xae                                                                 
                                                                                                                         
[  579.960584] The buggy address belongs to the object at ffff888130175d80                                               
                which belongs to the cache bridge_fdb_cache of size 128                                                 
[  579.963194] The buggy address is located 20 bytes inside of                                                           
                128-byte region [ffff888130175d80, ffff888130175e00)                                                     
[  579.965550] The buggy address belongs to the page:                                                                    
[  579.966596] page:0000000044cdd676 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888130175cc0 pfn:0x130175
[  579.968774] flags: 0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff)                                           
[  579.970212] raw: 0017ffffc0000200 0000000000000000 dead000000000122 ffff88811ea56140                                  
[  579.971860] raw: ffff888130175cc0 0000000080150009 00000001ffffffff 0000000000000000                                 
[  579.973495] page dumped because: kasan: bad access detected                                                           
                                                                                                                         
[  579.975112] Memory state around the buggy address:                                                                    
[  579.976140]  ffff888130175c80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb                                       
[  579.977693]  ffff888130175d00: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc                                       
[  579.979241] >ffff888130175d80: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb                                       
[  579.980789]                          ^                                                                                
[  579.981632]  ffff888130175e00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb                                       
[  579.983183]  ffff888130175e80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc                                       
[  579.984724] ==================================================================                                       

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 1/5] net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking notifier chain
  2021-08-19 18:15   ` Vlad Buslov
@ 2021-08-19 23:18     ` Vladimir Oltean
  2021-08-20  7:36       ` Vlad Buslov
  0 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-19 23:18 UTC (permalink / raw)
  To: Vlad Buslov
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Jianbo Liu, Mark Bloch,
	Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

Hi Vlad,

On Thu, Aug 19, 2021 at 09:15:17PM +0300, Vlad Buslov wrote:
> On Thu 19 Aug 2021 at 19:07, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
> > index 0c38c2e319be..ea7c3f07f6fe 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
> > @@ -276,6 +276,55 @@ mlx5_esw_bridge_port_obj_attr_set(struct net_device *dev,
> >  	return err;
> >  }
> >
> > +static struct mlx5_bridge_switchdev_fdb_work *
> > +mlx5_esw_bridge_init_switchdev_fdb_work(struct net_device *dev, bool add,
> > +					struct switchdev_notifier_fdb_info *fdb_info,
> > +					struct mlx5_esw_bridge_offloads *br_offloads);
> > +
> > +static int
> > +mlx5_esw_bridge_fdb_event(struct net_device *dev, unsigned long event,
> > +			  struct switchdev_notifier_info *info,
> > +			  struct mlx5_esw_bridge_offloads *br_offloads)
> > +{
> > +	struct switchdev_notifier_fdb_info *fdb_info;
> > +	struct mlx5_bridge_switchdev_fdb_work *work;
> > +	struct mlx5_eswitch *esw = br_offloads->esw;
> > +	u16 vport_num, esw_owner_vhca_id;
> > +	struct net_device *upper, *rep;
> > +
> > +	upper = netdev_master_upper_dev_get_rcu(dev);
> > +	if (!upper)
> > +		return 0;
> > +	if (!netif_is_bridge_master(upper))
> > +		return 0;
> > +
> > +	rep = mlx5_esw_bridge_rep_vport_num_vhca_id_get(dev, esw,
> > +							&vport_num,
> > +							&esw_owner_vhca_id);
> > +	if (!rep)
> > +		return 0;
> > +
> > +	/* only handle the event on peers */
> > +	if (mlx5_esw_bridge_is_local(dev, rep, esw))
> > +		return 0;
>
> This check is only needed for SWITCHDEV_FDB_DEL_TO_BRIDGE case. Here it
> breaks the offload.

Very good point, thanks for looking. I copied the entire atomic notifier
handler and deleted the code which wasn't needed, but I actually took a
break while converting mlx5, and so I forgot to delete this part when I
came back.

> > +
> > +	fdb_info = container_of(info, struct switchdev_notifier_fdb_info, info);
> > +
> > +	work = mlx5_esw_bridge_init_switchdev_fdb_work(dev,
> > +						       event == SWITCHDEV_FDB_ADD_TO_DEVICE,
> > +						       fdb_info,
>
> Here FDB info can already be deallocated[1] since this is now executing
> asynchronously and races with fdb_rcu_free() that is scheduled to be
> called after rcu grace period by fdb_delete().

I am incredibly lucky that you caught this, apparently I needed to add
an msleep(1000) to see it as well.

It is not the struct switchdev_notifier_fdb_info *fdb_info that gets
freed under RCU. It is fdb_info->addr (the MAC address), since
switchdev_deferred_enqueue only performs a shallow copy. I will address
that in v3.

> > @@ -415,9 +470,7 @@ static int mlx5_esw_bridge_switchdev_event(struct notifier_block *nb,
> >  		/* only handle the event on peers */
> >  		if (mlx5_esw_bridge_is_local(dev, rep, esw))
> >  			break;
>
> I really like the idea of completely remove the driver wq from FDB
> handling code, but I'm not yet too familiar with bridge internals to
> easily determine whether same approach can be applied to
> SWITCHDEV_FDB_{ADD|DEL}_TO_BRIDGE event after this series is accepted.
> It seems that all current users already generate these events from
> blocking context, so would it be a trivial change for me to do in your
> opinion? That would allow me to get rid of mlx5_esw_bridge_offloads->wq
> in our driver.

If all callers really are in blocking context (and they do appear to be)
you can even forgo the switchdev_deferred_enqueue that switchdev_fdb_add_to_device
does, and just call_switchdev_blocking_notifiers() directly. Then you
move the bridge handler from br_switchdev_event() to br_switchdev_blocking_event().
It should be even simpler than this conversion.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 1/5] net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking notifier chain
  2021-08-19 23:18     ` Vladimir Oltean
@ 2021-08-20  7:36       ` Vlad Buslov
  0 siblings, 0 replies; 34+ messages in thread
From: Vlad Buslov @ 2021-08-20  7:36 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Jianbo Liu, Mark Bloch,
	Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Fri 20 Aug 2021 at 02:18, Vladimir Oltean <olteanv@gmail.com> wrote:
> Hi Vlad,
>
> On Thu, Aug 19, 2021 at 09:15:17PM +0300, Vlad Buslov wrote:
>> On Thu 19 Aug 2021 at 19:07, Vladimir Oltean <vladimir.oltean@nxp.com> wrote:
>> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
>> > index 0c38c2e319be..ea7c3f07f6fe 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c
>> > @@ -276,6 +276,55 @@ mlx5_esw_bridge_port_obj_attr_set(struct net_device *dev,
>> >  	return err;
>> >  }
>> >
>> > +static struct mlx5_bridge_switchdev_fdb_work *
>> > +mlx5_esw_bridge_init_switchdev_fdb_work(struct net_device *dev, bool add,
>> > +					struct switchdev_notifier_fdb_info *fdb_info,
>> > +					struct mlx5_esw_bridge_offloads *br_offloads);
>> > +
>> > +static int
>> > +mlx5_esw_bridge_fdb_event(struct net_device *dev, unsigned long event,
>> > +			  struct switchdev_notifier_info *info,
>> > +			  struct mlx5_esw_bridge_offloads *br_offloads)
>> > +{
>> > +	struct switchdev_notifier_fdb_info *fdb_info;
>> > +	struct mlx5_bridge_switchdev_fdb_work *work;
>> > +	struct mlx5_eswitch *esw = br_offloads->esw;
>> > +	u16 vport_num, esw_owner_vhca_id;
>> > +	struct net_device *upper, *rep;
>> > +
>> > +	upper = netdev_master_upper_dev_get_rcu(dev);
>> > +	if (!upper)
>> > +		return 0;
>> > +	if (!netif_is_bridge_master(upper))
>> > +		return 0;
>> > +
>> > +	rep = mlx5_esw_bridge_rep_vport_num_vhca_id_get(dev, esw,
>> > +							&vport_num,
>> > +							&esw_owner_vhca_id);
>> > +	if (!rep)
>> > +		return 0;
>> > +
>> > +	/* only handle the event on peers */
>> > +	if (mlx5_esw_bridge_is_local(dev, rep, esw))
>> > +		return 0;
>>
>> This check is only needed for SWITCHDEV_FDB_DEL_TO_BRIDGE case. Here it
>> breaks the offload.
>
> Very good point, thanks for looking. I copied the entire atomic notifier
> handler and deleted the code which wasn't needed, but I actually took a
> break while converting mlx5, and so I forgot to delete this part when I
> came back.
>
>> > +
>> > +	fdb_info = container_of(info, struct switchdev_notifier_fdb_info, info);
>> > +
>> > +	work = mlx5_esw_bridge_init_switchdev_fdb_work(dev,
>> > +						       event == SWITCHDEV_FDB_ADD_TO_DEVICE,
>> > +						       fdb_info,
>>
>> Here FDB info can already be deallocated[1] since this is now executing
>> asynchronously and races with fdb_rcu_free() that is scheduled to be
>> called after rcu grace period by fdb_delete().
>
> I am incredibly lucky that you caught this, apparently I needed to add
> an msleep(1000) to see it as well.
>
> It is not the struct switchdev_notifier_fdb_info *fdb_info that gets
> freed under RCU. It is fdb_info->addr (the MAC address), since
> switchdev_deferred_enqueue only performs a shallow copy. I will address
> that in v3.
>
>> > @@ -415,9 +470,7 @@ static int mlx5_esw_bridge_switchdev_event(struct notifier_block *nb,
>> >  		/* only handle the event on peers */
>> >  		if (mlx5_esw_bridge_is_local(dev, rep, esw))
>> >  			break;
>>
>> I really like the idea of completely remove the driver wq from FDB
>> handling code, but I'm not yet too familiar with bridge internals to
>> easily determine whether same approach can be applied to
>> SWITCHDEV_FDB_{ADD|DEL}_TO_BRIDGE event after this series is accepted.
>> It seems that all current users already generate these events from
>> blocking context, so would it be a trivial change for me to do in your
>> opinion? That would allow me to get rid of mlx5_esw_bridge_offloads->wq
>> in our driver.
>
> If all callers really are in blocking context (and they do appear to be)
> you can even forgo the switchdev_deferred_enqueue that switchdev_fdb_add_to_device
> does, and just call_switchdev_blocking_notifiers() directly. Then you
> move the bridge handler from br_switchdev_event() to br_switchdev_blocking_event().
> It should be even simpler than this conversion.

Thanks for your advice! I'll start looking into it as soon as this
series is accepted.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-19 16:07 [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Vladimir Oltean
                   ` (4 preceding siblings ...)
  2021-08-19 16:07 ` [PATCH v2 net-next 5/5] net: dsa: handle SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE synchronously Vladimir Oltean
@ 2021-08-20  9:16 ` Ido Schimmel
  2021-08-20  9:37   ` Vladimir Oltean
  2021-08-20 10:49   ` Vladimir Oltean
  5 siblings, 2 replies; 34+ messages in thread
From: Ido Schimmel @ 2021-08-20  9:16 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, Jakub Kicinski, David S. Miller, Roopa Prabhu,
	Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vladimir Oltean, Vadym Kochan, Taras Chornyi,
	Jiri Pirko, Ido Schimmel, UNGLinuxDriver, Grygorii Strashko,
	Marek Behun, DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens,
	Woojung Huh, Sean Wang, Landen Chao, Claudiu Manoil,
	Alexandre Belloni, George McCollister, Ioana Ciornei,
	Saeed Mahameed, Leon Romanovsky, Lars Povlsen, Steen Hegelund,
	Julian Wiedmann, Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
> Problem statement:
> 
> Any time a driver needs to create a private association between a bridge
> upper interface and use that association within its
> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
> entries deleted by the bridge when the port leaves. The issue is that
> all switchdev drivers schedule a work item to have sleepable context,
> and that work item can be actually scheduled after the port has left the
> bridge, which means the association might have already been broken by
> the time the scheduled FDB work item attempts to use it.

This is handled in mlxsw by telling the device to flush the FDB entries
pointing to the {port, FID} when the VLAN is deleted (synchronously).

> 
> The solution is to modify switchdev to use its embedded SWITCHDEV_F_DEFER
> mechanism to make the FDB notifiers emitted from the fastpath be
> scheduled in sleepable context. All drivers are converted to handle
> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE from their blocking notifier block
> handler (or register a blocking switchdev notifier handler if they
> didn't have one). This solves the aforementioned problem because the
> bridge waits for the switchdev deferred work items to finish before a
> port leaves (del_nbp calls switchdev_deferred_process), whereas a work
> item privately scheduled by the driver will obviously not be waited upon
> by the bridge, leading to the possibility of having the race.

How the problem is solved if after this patchset drivers still queue a
work item?

DSA supports learning, but does not report the entries to the bridge.
How are these entries deleted when a port leaves the bridge?

> 
> This is a dependency for the "DSA FDB isolation" posted here. It was
> split out of that series hence the numbering starts directly at v2.
> 
> https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/

What is FDB isolation? Cover letter says: "There are use cases which
need FDB isolation between standalone ports and bridged ports, as well
as isolation between ports of different bridges".

Does it mean that DSA currently forwards packets between ports even if
they are member in different bridges or standalone?

> 
> Vladimir Oltean (5):
>   net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking
>     notifier chain
>   net: bridge: switchdev: make br_fdb_replay offer sleepable context to
>     consumers
>   net: switchdev: drop the atomic notifier block from
>     switchdev_bridge_port_{,un}offload
>   net: switchdev: don't assume RCU context in
>     switchdev_handle_fdb_{add,del}_to_device
>   net: dsa: handle SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE synchronously
> 
>  .../ethernet/freescale/dpaa2/dpaa2-switch.c   |  86 +++++------
>  .../marvell/prestera/prestera_switchdev.c     | 110 +++++++-------
>  .../mellanox/mlx5/core/en/rep/bridge.c        |  59 +++++++-
>  .../mellanox/mlxsw/spectrum_switchdev.c       |  61 +++++++-
>  .../microchip/sparx5/sparx5_switchdev.c       |  78 +++++-----
>  drivers/net/ethernet/mscc/ocelot_net.c        |   3 -
>  drivers/net/ethernet/rocker/rocker_main.c     |  73 ++++-----
>  drivers/net/ethernet/rocker/rocker_ofdpa.c    |   4 +-
>  drivers/net/ethernet/ti/am65-cpsw-nuss.c      |   4 +-
>  drivers/net/ethernet/ti/am65-cpsw-switchdev.c |  57 ++++----
>  drivers/net/ethernet/ti/cpsw_new.c            |   4 +-
>  drivers/net/ethernet/ti/cpsw_switchdev.c      |  60 ++++----
>  drivers/s390/net/qeth_l2_main.c               |  10 +-
>  include/net/switchdev.h                       |  30 +++-
>  net/bridge/br.c                               |   5 +-
>  net/bridge/br_fdb.c                           |  40 ++++-
>  net/bridge/br_private.h                       |   4 -
>  net/bridge/br_switchdev.c                     |  18 +--
>  net/dsa/dsa.c                                 |  15 --
>  net/dsa/dsa_priv.h                            |  15 --
>  net/dsa/port.c                                |   3 -
>  net/dsa/slave.c                               | 138 ++++++------------
>  net/switchdev/switchdev.c                     |  61 +++++++-
>  23 files changed, 529 insertions(+), 409 deletions(-)
> 
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-20  9:16 ` [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Ido Schimmel
@ 2021-08-20  9:37   ` Vladimir Oltean
  2021-08-20 16:09     ` Ido Schimmel
  2021-08-20 10:49   ` Vladimir Oltean
  1 sibling, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-20  9:37 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
> > Problem statement:
> > 
> > Any time a driver needs to create a private association between a bridge
> > upper interface and use that association within its
> > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
> > entries deleted by the bridge when the port leaves. The issue is that
> > all switchdev drivers schedule a work item to have sleepable context,
> > and that work item can be actually scheduled after the port has left the
> > bridge, which means the association might have already been broken by
> > the time the scheduled FDB work item attempts to use it.
> 
> This is handled in mlxsw by telling the device to flush the FDB entries
> pointing to the {port, FID} when the VLAN is deleted (synchronously).

Again, central solution vs mlxsw solution.

If a port leaves a LAG that is offloaded but the LAG does not leave the
bridge, the driver still needs to initiate the VLAN deletion. I really
don't like that, it makes switchdev drivers bloated.

As long as you call switchdev_bridge_port_unoffload and you populate the
blocking notifier pointer, you will get replays of item deletion from
the bridge.

> > The solution is to modify switchdev to use its embedded SWITCHDEV_F_DEFER
> > mechanism to make the FDB notifiers emitted from the fastpath be
> > scheduled in sleepable context. All drivers are converted to handle
> > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE from their blocking notifier block
> > handler (or register a blocking switchdev notifier handler if they
> > didn't have one). This solves the aforementioned problem because the
> > bridge waits for the switchdev deferred work items to finish before a
> > port leaves (del_nbp calls switchdev_deferred_process), whereas a work
> > item privately scheduled by the driver will obviously not be waited upon
> > by the bridge, leading to the possibility of having the race.
> 
> How the problem is solved if after this patchset drivers still queue a
> work item?

It's only a problem if you bank on any stateful association between FDB
entries and your ports (aka you expect that port->bridge_dev still holds
the same value in the atomic handler as in the deferred work item). I
think drivers don't do this at the moment, otherwise they would be
broken.

When they need that, they will convert to synchronous handling and all
will be fine.

> DSA supports learning, but does not report the entries to the bridge.

Why is this relevant exactly?

> How are these entries deleted when a port leaves the bridge?

dsa_port_fast_age does the following
(a) deletes the hardware learned entries on a port, in all VLANs
(b) notifies the bridge to also flush its software FDB on that port

It is called
(a) when the STP state changes from a learning-capable state (LEARNING,
    FORWARDING) to a non-learning capable state (BLOCKING, LISTENING)
(b) when learning is turned off by the user
(c) when learning is turned off by the port becoming standalone after
    leaving a bridge (actually same code path as b)

So the FDB of a port is also flushed when a single switch port leaves a
LAG that is the actual bridge port (maybe not ideal, but I don't know
any better).

> > This is a dependency for the "DSA FDB isolation" posted here. It was
> > split out of that series hence the numbering starts directly at v2.
> > 
> > https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/
> 
> What is FDB isolation? Cover letter says: "There are use cases which
> need FDB isolation between standalone ports and bridged ports, as well
> as isolation between ports of different bridges".

FDB isolation means exactly what it says: that the hardware FDB lookup
of ports that are standalone, or under one bridge, is unable to find FDB entries
(same MAC address, same VID) learned on another port from another bridge.

> Does it mean that DSA currently forwards packets between ports even if
> they are member in different bridges or standalone?

No, that is plain forwarding isolation in my understanding of terms, and
we have had that for many years now.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-20  9:16 ` [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Ido Schimmel
  2021-08-20  9:37   ` Vladimir Oltean
@ 2021-08-20 10:49   ` Vladimir Oltean
  2021-08-20 16:11     ` Ido Schimmel
  1 sibling, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-20 10:49 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
> > Problem statement:
> >
> > Any time a driver needs to create a private association between a bridge
> > upper interface and use that association within its
> > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
> > entries deleted by the bridge when the port leaves. The issue is that
> > all switchdev drivers schedule a work item to have sleepable context,
> > and that work item can be actually scheduled after the port has left the
> > bridge, which means the association might have already been broken by
> > the time the scheduled FDB work item attempts to use it.
>
> This is handled in mlxsw by telling the device to flush the FDB entries
> pointing to the {port, FID} when the VLAN is deleted (synchronously).

If you have FDB entries pointing to bridge ports that are foreign
interfaces and you offload them, do you catch the VLAN deletion on the
foreign port and flush your entries towards it at that time?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-20  9:37   ` Vladimir Oltean
@ 2021-08-20 16:09     ` Ido Schimmel
  2021-08-20 17:06       ` Vladimir Oltean
  0 siblings, 1 reply; 34+ messages in thread
From: Ido Schimmel @ 2021-08-20 16:09 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote:
> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
> > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
> > > Problem statement:
> > > 
> > > Any time a driver needs to create a private association between a bridge
> > > upper interface and use that association within its
> > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
> > > entries deleted by the bridge when the port leaves. The issue is that
> > > all switchdev drivers schedule a work item to have sleepable context,
> > > and that work item can be actually scheduled after the port has left the
> > > bridge, which means the association might have already been broken by
> > > the time the scheduled FDB work item attempts to use it.
> > 
> > This is handled in mlxsw by telling the device to flush the FDB entries
> > pointing to the {port, FID} when the VLAN is deleted (synchronously).
> 
> Again, central solution vs mlxsw solution.

Again, a solution is forced on everyone regardless if it benefits them
or not. List is bombarded with version after version until patches are
applied. *EXHAUSTING*.

With these patches, except DSA, everyone gets another queue_work() for
each FDB entry. In some cases, it completely misses the purpose of the
patchset.

Want a central solution? Make sure it is properly integrated. "Don't
have the energy"? Ask for help. Do not try to force a solution on
everyone and motivate them to change the code by doing a poor conversion
yourself.

I don't accept "this will have to do".

> 
> If a port leaves a LAG that is offloaded but the LAG does not leave the
> bridge, the driver still needs to initiate the VLAN deletion. I really
> don't like that, it makes switchdev drivers bloated.
> 
> As long as you call switchdev_bridge_port_unoffload and you populate the
> blocking notifier pointer, you will get replays of item deletion from
> the bridge.
> 
> > > The solution is to modify switchdev to use its embedded SWITCHDEV_F_DEFER
> > > mechanism to make the FDB notifiers emitted from the fastpath be
> > > scheduled in sleepable context. All drivers are converted to handle
> > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE from their blocking notifier block
> > > handler (or register a blocking switchdev notifier handler if they
> > > didn't have one). This solves the aforementioned problem because the
> > > bridge waits for the switchdev deferred work items to finish before a
> > > port leaves (del_nbp calls switchdev_deferred_process), whereas a work
> > > item privately scheduled by the driver will obviously not be waited upon
> > > by the bridge, leading to the possibility of having the race.
> > 
> > How the problem is solved if after this patchset drivers still queue a
> > work item?
> 
> It's only a problem if you bank on any stateful association between FDB
> entries and your ports (aka you expect that port->bridge_dev still holds
> the same value in the atomic handler as in the deferred work item). I
> think drivers don't do this at the moment, otherwise they would be
> broken.
> 
> When they need that, they will convert to synchronous handling and all
> will be fine.
> 
> > DSA supports learning, but does not report the entries to the bridge.
> 
> Why is this relevant exactly?

Because I wanted to make sure that FDB entries that are not present in
the bridge are also flushed.

> 
> > How are these entries deleted when a port leaves the bridge?
> 
> dsa_port_fast_age does the following
> (a) deletes the hardware learned entries on a port, in all VLANs
> (b) notifies the bridge to also flush its software FDB on that port
> 
> It is called
> (a) when the STP state changes from a learning-capable state (LEARNING,
>     FORWARDING) to a non-learning capable state (BLOCKING, LISTENING)
> (b) when learning is turned off by the user
> (c) when learning is turned off by the port becoming standalone after
>     leaving a bridge (actually same code path as b)
> 
> So the FDB of a port is also flushed when a single switch port leaves a
> LAG that is the actual bridge port (maybe not ideal, but I don't know
> any better).
> 
> > > This is a dependency for the "DSA FDB isolation" posted here. It was
> > > split out of that series hence the numbering starts directly at v2.
> > > 
> > > https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/
> > 
> > What is FDB isolation? Cover letter says: "There are use cases which
> > need FDB isolation between standalone ports and bridged ports, as well
> > as isolation between ports of different bridges".
> 
> FDB isolation means exactly what it says: that the hardware FDB lookup
> of ports that are standalone, or under one bridge, is unable to find FDB entries
> (same MAC address, same VID) learned on another port from another bridge.
> 
> > Does it mean that DSA currently forwards packets between ports even if
> > they are member in different bridges or standalone?
> 
> No, that is plain forwarding isolation in my understanding of terms, and
> we have had that for many years now.

So if I have {00:01:02:03:04:05, 5} in br0, but not in br1 and now a
packet with this DMAC/VID needs to be forwarded in br1 it will be
dropped instead of being flooded?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-20 10:49   ` Vladimir Oltean
@ 2021-08-20 16:11     ` Ido Schimmel
  2021-08-21 19:09       ` Vladimir Oltean
  0 siblings, 1 reply; 34+ messages in thread
From: Ido Schimmel @ 2021-08-20 16:11 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Fri, Aug 20, 2021 at 01:49:48PM +0300, Vladimir Oltean wrote:
> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
> > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
> > > Problem statement:
> > >
> > > Any time a driver needs to create a private association between a bridge
> > > upper interface and use that association within its
> > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
> > > entries deleted by the bridge when the port leaves. The issue is that
> > > all switchdev drivers schedule a work item to have sleepable context,
> > > and that work item can be actually scheduled after the port has left the
> > > bridge, which means the association might have already been broken by
> > > the time the scheduled FDB work item attempts to use it.
> >
> > This is handled in mlxsw by telling the device to flush the FDB entries
> > pointing to the {port, FID} when the VLAN is deleted (synchronously).
> 
> If you have FDB entries pointing to bridge ports that are foreign
> interfaces and you offload them, do you catch the VLAN deletion on the
> foreign port and flush your entries towards it at that time?

Yes, that's how VXLAN offload works. VLAN addition is used to determine
the mapping between VNI and VLAN.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-20 16:09     ` Ido Schimmel
@ 2021-08-20 17:06       ` Vladimir Oltean
  2021-08-20 23:36         ` Nikolay Aleksandrov
  0 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-20 17:06 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote:
> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote:
> > On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
> > > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
> > > > Problem statement:
> > > >
> > > > Any time a driver needs to create a private association between a bridge
> > > > upper interface and use that association within its
> > > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
> > > > entries deleted by the bridge when the port leaves. The issue is that
> > > > all switchdev drivers schedule a work item to have sleepable context,
> > > > and that work item can be actually scheduled after the port has left the
> > > > bridge, which means the association might have already been broken by
> > > > the time the scheduled FDB work item attempts to use it.
> > >
> > > This is handled in mlxsw by telling the device to flush the FDB entries
> > > pointing to the {port, FID} when the VLAN is deleted (synchronously).
> >
> > Again, central solution vs mlxsw solution.
>
> Again, a solution is forced on everyone regardless if it benefits them
> or not. List is bombarded with version after version until patches are
> applied. *EXHAUSTING*.

So if I replace "bombarded" with a more neutral word, isn't that how
it's done though? What would you do if you wanted to achieve something
but the framework stood in your way? Would you work around it to avoid
bombarding the list?

> With these patches, except DSA, everyone gets another queue_work() for
> each FDB entry. In some cases, it completely misses the purpose of the
> patchset.

I also fail to see the point. Patch 3 will have to make things worse
before they get better. It is like that in DSA too, and made more
reasonable only in the last patch from the series.

If I saw any middle-ground way, like keeping the notifiers on the atomic
chain for unconverted drivers, I would have done it. But what do you do
if more than one driver listens for one event, one driver wants it
blocking, the other wants it atomic. Do you make the bridge emit it
twice? That's even worse than having one useless queue_work() in some
drivers.

So if you think I can avoid that please tell me how.

> Want a central solution? Make sure it is properly integrated. "Don't
> have the energy"? Ask for help. Do not try to force a solution on
> everyone and motivate them to change the code by doing a poor conversion
> yourself.
>
> I don't accept "this will have to do".

So I can make many suppositions about what I did wrong, but I would
prefer that you tell me.

Is it the timing, as we're late in the development cycle? Maybe, and
that would make a lot of sense, but I don't want to assume anything that
has not been said.

Is it that I converted too few drivers? You said I'm bombarding the
list. Can I convert more drivers with less code? I would be absolutely
glad to. I have more driver conversions unsubmitted, some tested on
hardware.

Is it that I didn't ask for help? I still believe that it is best I
leave the driver maintainers to do the rest of the conversion, at their
own pace and with hardware to test and find issues I can not using just
code analysis and non-expert knowledge. After all, with all due respect
to the net-next tree, I sent these patches to a development git tree,
not to a production facility.

> > > What is FDB isolation? Cover letter says: "There are use cases which
> > > need FDB isolation between standalone ports and bridged ports, as well
> > > as isolation between ports of different bridges".
> >
> > FDB isolation means exactly what it says: that the hardware FDB lookup
> > of ports that are standalone, or under one bridge, is unable to find FDB entries
> > (same MAC address, same VID) learned on another port from another bridge.
> >
> > > Does it mean that DSA currently forwards packets between ports even if
> > > they are member in different bridges or standalone?
> >
> > No, that is plain forwarding isolation in my understanding of terms, and
> > we have had that for many years now.
>
> So if I have {00:01:02:03:04:05, 5} in br0, but not in br1 and now a
> packet with this DMAC/VID needs to be forwarded in br1 it will be
> dropped instead of being flooded?

Yes.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-20 17:06       ` Vladimir Oltean
@ 2021-08-20 23:36         ` Nikolay Aleksandrov
  2021-08-21  0:22           ` Vladimir Oltean
  2021-08-22  6:48           ` Ido Schimmel
  0 siblings, 2 replies; 34+ messages in thread
From: Nikolay Aleksandrov @ 2021-08-20 23:36 UTC (permalink / raw)
  To: Vladimir Oltean, Ido Schimmel
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Roopa Prabhu, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Vadym Kochan, Taras Chornyi, Jiri Pirko, Ido Schimmel,
	UNGLinuxDriver, Grygorii Strashko, Marek Behun, DENG Qingfang,
	Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh, Sean Wang,
	Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On 20/08/2021 20:06, Vladimir Oltean wrote:
> On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote:
>> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote:
>>> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
>>>> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
>>>>> Problem statement:
>>>>>
>>>>> Any time a driver needs to create a private association between a bridge
>>>>> upper interface and use that association within its
>>>>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
>>>>> entries deleted by the bridge when the port leaves. The issue is that
>>>>> all switchdev drivers schedule a work item to have sleepable context,
>>>>> and that work item can be actually scheduled after the port has left the
>>>>> bridge, which means the association might have already been broken by
>>>>> the time the scheduled FDB work item attempts to use it.
>>>>
>>>> This is handled in mlxsw by telling the device to flush the FDB entries
>>>> pointing to the {port, FID} when the VLAN is deleted (synchronously).
>>>
>>> Again, central solution vs mlxsw solution.
>>
>> Again, a solution is forced on everyone regardless if it benefits them
>> or not. List is bombarded with version after version until patches are
>> applied. *EXHAUSTING*.
> 
> So if I replace "bombarded" with a more neutral word, isn't that how
> it's done though? What would you do if you wanted to achieve something
> but the framework stood in your way? Would you work around it to avoid
> bombarding the list?
> 
>> With these patches, except DSA, everyone gets another queue_work() for
>> each FDB entry. In some cases, it completely misses the purpose of the
>> patchset.
> 
> I also fail to see the point. Patch 3 will have to make things worse
> before they get better. It is like that in DSA too, and made more
> reasonable only in the last patch from the series.
> 
> If I saw any middle-ground way, like keeping the notifiers on the atomic
> chain for unconverted drivers, I would have done it. But what do you do
> if more than one driver listens for one event, one driver wants it
> blocking, the other wants it atomic. Do you make the bridge emit it
> twice? That's even worse than having one useless queue_work() in some
> drivers.
> 
> So if you think I can avoid that please tell me how.
> 

Hi,
I don't like the double-queuing for each fdb for everyone either, it's forcing them
to rework it asap due to inefficiency even though that shouldn't be necessary. In the
long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually.
For most drivers this is introducing more work (as in processing) rather than helping
them right now, give them the option to convert to it on their own accord or bite
the bullet and convert everyone so the change won't affect them, it holds rtnl, it is blocking
I don't see why not convert everyone to just execute their otherwise queued work.
I'm sure driver maintainers would appreciate such help and would test and review it. You're
halfway there already..

Cheers,
 Nik









^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-20 23:36         ` Nikolay Aleksandrov
@ 2021-08-21  0:22           ` Vladimir Oltean
  2021-08-22  6:48           ` Ido Schimmel
  1 sibling, 0 replies; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-21  0:22 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Ido Schimmel, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Sat, Aug 21, 2021 at 02:36:26AM +0300, Nikolay Aleksandrov wrote:
> Hi,
> I don't like the double-queuing for each fdb for everyone either, it's forcing them
> to rework it asap due to inefficiency even though that shouldn't be necessary.

Let's be honest, with the vast majority of drivers having absurdities such as the
"if (!fdb_info->added_by_user || fdb_info->is_local) => nothing to do here, bye"
check placed _inside_ the actual work item (and therefore scheduling for nothing
for entries dynamically learned by the bridge), it's hard to believe that driver
authors cared too much about inefficiency when mindlessly copy-pasting that snippet
from mlxsw

[ which for the record does call mlxsw_sp_span_respin for dynamically learned FDB
  entries, so that driver doesn't schedule for nothing like the rest - although
  maybe even mlxsw could call mlxsw_sp_port_dev_lower_find_rcu instead of
  mlxsw_sp_port_dev_lower_find, and could save a queue_work for FDB entries on
  foreign && non-VXLAN ports. Who knows?! ]

Now I get to care for them.

But I can see how a partial conversion could leave things in an even more absurd position.
I don't want to contribute to the absurdity.

> In the
> long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually.
> For most drivers this is introducing more work (as in processing) rather than helping
> them right now, give them the option to convert to it on their own accord or bite
> the bullet and convert everyone so the change won't affect them, it holds rtnl, it is blocking
> I don't see why not convert everyone to just execute their otherwise queued work.
> I'm sure driver maintainers would appreciate such help and would test and review it. You're
> halfway there already..

Agree, this needs more work. Thanks for looking.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-20 16:11     ` Ido Schimmel
@ 2021-08-21 19:09       ` Vladimir Oltean
  2021-08-22  7:19         ` Ido Schimmel
  0 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-21 19:09 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Fri, Aug 20, 2021 at 07:11:15PM +0300, Ido Schimmel wrote:
> On Fri, Aug 20, 2021 at 01:49:48PM +0300, Vladimir Oltean wrote:
> > On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
> > > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
> > > > Problem statement:
> > > >
> > > > Any time a driver needs to create a private association between a bridge
> > > > upper interface and use that association within its
> > > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
> > > > entries deleted by the bridge when the port leaves. The issue is that
> > > > all switchdev drivers schedule a work item to have sleepable context,
> > > > and that work item can be actually scheduled after the port has left the
> > > > bridge, which means the association might have already been broken by
> > > > the time the scheduled FDB work item attempts to use it.
> > >
> > > This is handled in mlxsw by telling the device to flush the FDB entries
> > > pointing to the {port, FID} when the VLAN is deleted (synchronously).
> > 
> > If you have FDB entries pointing to bridge ports that are foreign
> > interfaces and you offload them, do you catch the VLAN deletion on the
> > foreign port and flush your entries towards it at that time?
> 
> Yes, that's how VXLAN offload works. VLAN addition is used to determine
> the mapping between VNI and VLAN.

I was only able to follow as far as:

mlxsw_sp_switchdev_blocking_event
-> mlxsw_sp_switchdev_handle_vxlan_obj_del
   -> mlxsw_sp_switchdev_vxlan_vlans_del
      -> mlxsw_sp_switchdev_vxlan_vlan_del
         -> ??? where are the FDB entries flushed?

I was expecting to see something along the lines of

mlxsw_sp_switchdev_blocking_event
-> mlxsw_sp_port_vlans_del
   -> mlxsw_sp_bridge_port_vlan_del
      -> mlxsw_sp_port_vlan_bridge_leave
         -> mlxsw_sp_bridge_port_fdb_flush

but that is exactly on the other branch of the "if (netif_is_vxlan(dev))"
condition (and also, mlxsw_sp_bridge_port_fdb_flush flushes an externally-facing
port, not really what I needed to know, see below).

Anyway, it also seems to me that we are referring to slightly different
things by "foreign" interfaces. To me, a "foreign" interface is one
towards which there is no hardware data path. Like for example if you
have a mlxsw port in a plain L2 bridge with an Intel card. The data path
is the CPU and that was my question: do you track FDB entries towards
those interfaces (implicitly: towards the CPU)? You've answered about
VXLAN, which is quite not "foreign" in the sense I am thinking about,
because mlxsw does have a hardware data path towards a VXLAN interface
(as you've mentioned, it associates a VID with each VNI).

I've been searching through the mlxsw driver and I don't see that this
is being done, so I'm guessing you might wonder/ask why you would want
to do that in the first place. If you bridge a mlxsw port with an Intel
card, then (from another thread where you've said that mlxsw always
injects control packets where hardware learning is not performed) my
guess is that the MAC addresses learned on the Intel bridge port will
never be learned on the mlxsw device. So every packet that ingresses the
mlxsw and must egress the Intel card will reach the CPU through flooding
(and will consequently be flooded in the entire broadcast domain of the
mlxsw side of the bridge). Right?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-20 23:36         ` Nikolay Aleksandrov
  2021-08-21  0:22           ` Vladimir Oltean
@ 2021-08-22  6:48           ` Ido Schimmel
  2021-08-22  9:12             ` Nikolay Aleksandrov
  1 sibling, 1 reply; 34+ messages in thread
From: Ido Schimmel @ 2021-08-22  6:48 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Vladimir Oltean, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Sat, Aug 21, 2021 at 02:36:26AM +0300, Nikolay Aleksandrov wrote:
> On 20/08/2021 20:06, Vladimir Oltean wrote:
> > On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote:
> >> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote:
> >>> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
> >>>> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
> >>>>> Problem statement:
> >>>>>
> >>>>> Any time a driver needs to create a private association between a bridge
> >>>>> upper interface and use that association within its
> >>>>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
> >>>>> entries deleted by the bridge when the port leaves. The issue is that
> >>>>> all switchdev drivers schedule a work item to have sleepable context,
> >>>>> and that work item can be actually scheduled after the port has left the
> >>>>> bridge, which means the association might have already been broken by
> >>>>> the time the scheduled FDB work item attempts to use it.
> >>>>
> >>>> This is handled in mlxsw by telling the device to flush the FDB entries
> >>>> pointing to the {port, FID} when the VLAN is deleted (synchronously).
> >>>
> >>> Again, central solution vs mlxsw solution.
> >>
> >> Again, a solution is forced on everyone regardless if it benefits them
> >> or not. List is bombarded with version after version until patches are
> >> applied. *EXHAUSTING*.
> > 
> > So if I replace "bombarded" with a more neutral word, isn't that how
> > it's done though? What would you do if you wanted to achieve something
> > but the framework stood in your way? Would you work around it to avoid
> > bombarding the list?
> > 
> >> With these patches, except DSA, everyone gets another queue_work() for
> >> each FDB entry. In some cases, it completely misses the purpose of the
> >> patchset.
> > 
> > I also fail to see the point. Patch 3 will have to make things worse
> > before they get better. It is like that in DSA too, and made more
> > reasonable only in the last patch from the series.
> > 
> > If I saw any middle-ground way, like keeping the notifiers on the atomic
> > chain for unconverted drivers, I would have done it. But what do you do
> > if more than one driver listens for one event, one driver wants it
> > blocking, the other wants it atomic. Do you make the bridge emit it
> > twice? That's even worse than having one useless queue_work() in some
> > drivers.
> > 
> > So if you think I can avoid that please tell me how.
> > 
> 
> Hi,
> I don't like the double-queuing for each fdb for everyone either, it's forcing them
> to rework it asap due to inefficiency even though that shouldn't be necessary. In the
> long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually.

The fundamental problem is that these operations need to be deferred in
the first place. It would have been much better if user space could get
a synchronous feedback.

It all stems from the fact that control plane operations need to be done
under a spin lock because the shared databases (e.g., FDB, MDB) or
states (e.g., STP) that they are updating can also be updated from the
data plane in softIRQ.

I don't have a clean solution to this problem without doing a surgery in
the bridge driver. Deferring updates from the data plane using a work
queue and converting the spin locks to mutexes. This will also allow us
to emit netlink notifications from process context and convert
GFP_ATOMIC to GFP_KERNEL.

Is that something you consider as acceptable? Does anybody have a better
idea?

> For most drivers this is introducing more work (as in processing) rather than helping
> them right now, give them the option to convert to it on their own accord or bite
> the bullet and convert everyone so the change won't affect them, it holds rtnl, it is blocking
> I don't see why not convert everyone to just execute their otherwise queued work.
> I'm sure driver maintainers would appreciate such help and would test and review it. You're
> halfway there already..
> 
> Cheers,
>  Nik
> 
> 
> 
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-21 19:09       ` Vladimir Oltean
@ 2021-08-22  7:19         ` Ido Schimmel
  0 siblings, 0 replies; 34+ messages in thread
From: Ido Schimmel @ 2021-08-22  7:19 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Roopa Prabhu, Nikolay Aleksandrov, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Sat, Aug 21, 2021 at 10:09:14PM +0300, Vladimir Oltean wrote:
> On Fri, Aug 20, 2021 at 07:11:15PM +0300, Ido Schimmel wrote:
> > On Fri, Aug 20, 2021 at 01:49:48PM +0300, Vladimir Oltean wrote:
> > > On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
> > > > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
> > > > > Problem statement:
> > > > >
> > > > > Any time a driver needs to create a private association between a bridge
> > > > > upper interface and use that association within its
> > > > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
> > > > > entries deleted by the bridge when the port leaves. The issue is that
> > > > > all switchdev drivers schedule a work item to have sleepable context,
> > > > > and that work item can be actually scheduled after the port has left the
> > > > > bridge, which means the association might have already been broken by
> > > > > the time the scheduled FDB work item attempts to use it.
> > > >
> > > > This is handled in mlxsw by telling the device to flush the FDB entries
> > > > pointing to the {port, FID} when the VLAN is deleted (synchronously).
> > > 
> > > If you have FDB entries pointing to bridge ports that are foreign
> > > interfaces and you offload them, do you catch the VLAN deletion on the
> > > foreign port and flush your entries towards it at that time?
> > 
> > Yes, that's how VXLAN offload works. VLAN addition is used to determine
> > the mapping between VNI and VLAN.
> 
> I was only able to follow as far as:
> 
> mlxsw_sp_switchdev_blocking_event
> -> mlxsw_sp_switchdev_handle_vxlan_obj_del
>    -> mlxsw_sp_switchdev_vxlan_vlans_del
>       -> mlxsw_sp_switchdev_vxlan_vlan_del
>          -> ??? where are the FDB entries flushed?

 mlxsw_sp_switchdev_blocking_event
 -> mlxsw_sp_switchdev_handle_vxlan_obj_del
    -> mlxsw_sp_switchdev_vxlan_vlans_del
       -> mlxsw_sp_switchdev_vxlan_vlan_del
          -> mlxsw_sp_bridge_vxlan_leave
	     -> mlxsw_sp_nve_fid_disable
	        -> mlxsw_sp_nve_fdb_flush_by_fid

> 
> I was expecting to see something along the lines of
> 
> mlxsw_sp_switchdev_blocking_event
> -> mlxsw_sp_port_vlans_del
>    -> mlxsw_sp_bridge_port_vlan_del
>       -> mlxsw_sp_port_vlan_bridge_leave
>          -> mlxsw_sp_bridge_port_fdb_flush
> 
> but that is exactly on the other branch of the "if (netif_is_vxlan(dev))"
> condition (and also, mlxsw_sp_bridge_port_fdb_flush flushes an externally-facing
> port, not really what I needed to know, see below).
> 
> Anyway, it also seems to me that we are referring to slightly different
> things by "foreign" interfaces. To me, a "foreign" interface is one
> towards which there is no hardware data path. Like for example if you
> have a mlxsw port in a plain L2 bridge with an Intel card. The data path
> is the CPU and that was my question: do you track FDB entries towards
> those interfaces (implicitly: towards the CPU)? You've answered about
> VXLAN, which is quite not "foreign" in the sense I am thinking about,
> because mlxsw does have a hardware data path towards a VXLAN interface
> (as you've mentioned, it associates a VID with each VNI).
> 
> I've been searching through the mlxsw driver and I don't see that this
> is being done, so I'm guessing you might wonder/ask why you would want
> to do that in the first place. If you bridge a mlxsw port with an Intel
> card, then (from another thread where you've said that mlxsw always
> injects control packets where hardware learning is not performed) my
> guess is that the MAC addresses learned on the Intel bridge port will
> never be learned on the mlxsw device. So every packet that ingresses the
> mlxsw and must egress the Intel card will reach the CPU through flooding
> (and will consequently be flooded in the entire broadcast domain of the
> mlxsw side of the bridge). Right?

I can see how this use case makes sense on systems where the difference
in performance between the ASIC and the CPU is not huge, but it doesn't
make much sense with Spectrum and I have yet to get requests to support
it (might change). Keep in mind that Spectrum is able to forward several
Bpps with a switching capacity of several Tbps. It is usually connected
to a weak CPU (e.g., low-end ARM, Intel Atom) through a PCI bus with a
bandwidth of several Gbps. There is usually one "Intel card" on such
systems which is connected to the management network that is separated
from the data plane network.

If we were to support it, FDB entries towards "foreign" interfaces would
be programmed to trap packets to the CPU. For now, for correctness /
rigor purposes, I would prefer simply returning an error / warning via
extack when such topologies are configured.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-22  6:48           ` Ido Schimmel
@ 2021-08-22  9:12             ` Nikolay Aleksandrov
  2021-08-22 13:31               ` Vladimir Oltean
  0 siblings, 1 reply; 34+ messages in thread
From: Nikolay Aleksandrov @ 2021-08-22  9:12 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Vladimir Oltean, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On 22/08/2021 09:48, Ido Schimmel wrote:
> On Sat, Aug 21, 2021 at 02:36:26AM +0300, Nikolay Aleksandrov wrote:
>> On 20/08/2021 20:06, Vladimir Oltean wrote:
>>> On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote:
>>>> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote:
>>>>> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
>>>>>> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
>>>>>>> Problem statement:
>>>>>>>
>>>>>>> Any time a driver needs to create a private association between a bridge
>>>>>>> upper interface and use that association within its
>>>>>>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
>>>>>>> entries deleted by the bridge when the port leaves. The issue is that
>>>>>>> all switchdev drivers schedule a work item to have sleepable context,
>>>>>>> and that work item can be actually scheduled after the port has left the
>>>>>>> bridge, which means the association might have already been broken by
>>>>>>> the time the scheduled FDB work item attempts to use it.
>>>>>>
>>>>>> This is handled in mlxsw by telling the device to flush the FDB entries
>>>>>> pointing to the {port, FID} when the VLAN is deleted (synchronously).
>>>>>
>>>>> Again, central solution vs mlxsw solution.
>>>>
>>>> Again, a solution is forced on everyone regardless if it benefits them
>>>> or not. List is bombarded with version after version until patches are
>>>> applied. *EXHAUSTING*.
>>>
>>> So if I replace "bombarded" with a more neutral word, isn't that how
>>> it's done though? What would you do if you wanted to achieve something
>>> but the framework stood in your way? Would you work around it to avoid
>>> bombarding the list?
>>>
>>>> With these patches, except DSA, everyone gets another queue_work() for
>>>> each FDB entry. In some cases, it completely misses the purpose of the
>>>> patchset.
>>>
>>> I also fail to see the point. Patch 3 will have to make things worse
>>> before they get better. It is like that in DSA too, and made more
>>> reasonable only in the last patch from the series.
>>>
>>> If I saw any middle-ground way, like keeping the notifiers on the atomic
>>> chain for unconverted drivers, I would have done it. But what do you do
>>> if more than one driver listens for one event, one driver wants it
>>> blocking, the other wants it atomic. Do you make the bridge emit it
>>> twice? That's even worse than having one useless queue_work() in some
>>> drivers.
>>>
>>> So if you think I can avoid that please tell me how.
>>>
>>
>> Hi,
>> I don't like the double-queuing for each fdb for everyone either, it's forcing them
>> to rework it asap due to inefficiency even though that shouldn't be necessary. In the
>> long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually.
> 
> The fundamental problem is that these operations need to be deferred in
> the first place. It would have been much better if user space could get
> a synchronous feedback.
> 
> It all stems from the fact that control plane operations need to be done
> under a spin lock because the shared databases (e.g., FDB, MDB) or
> states (e.g., STP) that they are updating can also be updated from the
> data plane in softIRQ.
> 

Right, but changing that, as you've noted below, would require moving
the delaying to the bridge, I'd like to avoid that.

> I don't have a clean solution to this problem without doing a surgery in
> the bridge driver. Deferring updates from the data plane using a work
> queue and converting the spin locks to mutexes. This will also allow us
> to emit netlink notifications from process context and convert
> GFP_ATOMIC to GFP_KERNEL.
> 
> Is that something you consider as acceptable? Does anybody have a better
> idea?
> 

Moving the delays to the bridge for this purpose does not sound like a good solution,
I'd prefer the delaying to be done by the interested third party as in this case rather
than the bridge. If there's a solution that avoids delaying and doesn't hurt the software
fast-path then of course I'll be ok with that.
 
>> For most drivers this is introducing more work (as in processing) rather than helping
>> them right now, give them the option to convert to it on their own accord or bite
>> the bullet and convert everyone so the change won't affect them, it holds rtnl, it is blocking
>> I don't see why not convert everyone to just execute their otherwise queued work.
>> I'm sure driver maintainers would appreciate such help and would test and review it. You're
>> halfway there already..
>>
>> Cheers,
>>  Nik
>>
>>
>>
>>
>>
>>
>>
>>


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-22  9:12             ` Nikolay Aleksandrov
@ 2021-08-22 13:31               ` Vladimir Oltean
  2021-08-22 17:06                 ` Ido Schimmel
  0 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-22 13:31 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Ido Schimmel, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Sun, Aug 22, 2021 at 12:12:02PM +0300, Nikolay Aleksandrov wrote:
> On 22/08/2021 09:48, Ido Schimmel wrote:
> > On Sat, Aug 21, 2021 at 02:36:26AM +0300, Nikolay Aleksandrov wrote:
> >> On 20/08/2021 20:06, Vladimir Oltean wrote:
> >>> On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote:
> >>>> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote:
> >>>>> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote:
> >>>>>> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote:
> >>>>>>> Problem statement:
> >>>>>>>
> >>>>>>> Any time a driver needs to create a private association between a bridge
> >>>>>>> upper interface and use that association within its
> >>>>>>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB
> >>>>>>> entries deleted by the bridge when the port leaves. The issue is that
> >>>>>>> all switchdev drivers schedule a work item to have sleepable context,
> >>>>>>> and that work item can be actually scheduled after the port has left the
> >>>>>>> bridge, which means the association might have already been broken by
> >>>>>>> the time the scheduled FDB work item attempts to use it.
> >>>>>>
> >>>>>> This is handled in mlxsw by telling the device to flush the FDB entries
> >>>>>> pointing to the {port, FID} when the VLAN is deleted (synchronously).
> >>>>>
> >>>>> Again, central solution vs mlxsw solution.
> >>>>
> >>>> Again, a solution is forced on everyone regardless if it benefits them
> >>>> or not. List is bombarded with version after version until patches are
> >>>> applied. *EXHAUSTING*.
> >>>
> >>> So if I replace "bombarded" with a more neutral word, isn't that how
> >>> it's done though? What would you do if you wanted to achieve something
> >>> but the framework stood in your way? Would you work around it to avoid
> >>> bombarding the list?
> >>>
> >>>> With these patches, except DSA, everyone gets another queue_work() for
> >>>> each FDB entry. In some cases, it completely misses the purpose of the
> >>>> patchset.
> >>>
> >>> I also fail to see the point. Patch 3 will have to make things worse
> >>> before they get better. It is like that in DSA too, and made more
> >>> reasonable only in the last patch from the series.
> >>>
> >>> If I saw any middle-ground way, like keeping the notifiers on the atomic
> >>> chain for unconverted drivers, I would have done it. But what do you do
> >>> if more than one driver listens for one event, one driver wants it
> >>> blocking, the other wants it atomic. Do you make the bridge emit it
> >>> twice? That's even worse than having one useless queue_work() in some
> >>> drivers.
> >>>
> >>> So if you think I can avoid that please tell me how.
> >>>
> >>
> >> Hi,
> >> I don't like the double-queuing for each fdb for everyone either, it's forcing them
> >> to rework it asap due to inefficiency even though that shouldn't be necessary. In the
> >> long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually.
> > 
> > The fundamental problem is that these operations need to be deferred in
> > the first place. It would have been much better if user space could get
> > a synchronous feedback.
> > 
> > It all stems from the fact that control plane operations need to be done
> > under a spin lock because the shared databases (e.g., FDB, MDB) or
> > states (e.g., STP) that they are updating can also be updated from the
> > data plane in softIRQ.
> > 
> 
> Right, but changing that, as you've noted below, would require moving
> the delaying to the bridge, I'd like to avoid that.
> 
> > I don't have a clean solution to this problem without doing a surgery in
> > the bridge driver. Deferring updates from the data plane using a work
> > queue and converting the spin locks to mutexes. This will also allow us
> > to emit netlink notifications from process context and convert
> > GFP_ATOMIC to GFP_KERNEL.
> > 
> > Is that something you consider as acceptable? Does anybody have a better
> > idea?
> > 
> 
> Moving the delays to the bridge for this purpose does not sound like a good solution,
> I'd prefer the delaying to be done by the interested third party as in this case rather
> than the bridge. If there's a solution that avoids delaying and doesn't hurt the software
> fast-path then of course I'll be ok with that.

Maybe emitting two notifiers, one atomic and one blocking, per FDB
add/del event is not such a stupid idea after all.

Here's an alternative I've been cooking. Obviously it still has pros and
cons. Hopefully by reading the commit message you get the basic idea and
I don't need to post the full series.

-----------------------------[ cut here ]-----------------------------
From 9870699f0fafeb6175af3462173a957ece551322 Mon Sep 17 00:00:00 2001
From: Vladimir Oltean <vladimir.oltean@nxp.com>
Date: Sat, 21 Aug 2021 15:57:40 +0300
Subject: [PATCH] net: switchdev: add an option for
 SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to be deferred

Most existing switchdev drivers either talk to firmware, or to a device
over a bus where the I/O is sleepable (SPI, I2C, MDIO etc). So there
exists a pattern where drivers make a sleepable context for offloading
the given FDB entry by registering an ordered workqueue and scheduling
work items on it, and doing all the work from there.

This solution works, but there are some issues with it:

1. It creates large amounts of duplication between switchdev drivers,
   and they don't even copy all the right patterns from each other.
   For example:

   * DSA, dpaa2-switch, rocker allocate an ordered workqueue with the
     WQ_MEM_RECLAIM flag and no one knows why.

   * dpaa2-switch, sparx5, am65_cpsw, cpsw, rocker, prestera, all have
     this check, or one very similar to it:

		if (!fdb_info->added_by_user || fdb_info->is_local)
			break; /* do nothing and exit */

     within the actually scheduled workqueue item. That is to say, they
     schedule and take the rtnl_mutex for nothing - every single time
     that an FDB entry is dynamically learned by the software bridge and
     they are not interested in it. Same thing for the *_dev_check
     function - the function which checks if an FDB entry was learned on
     a network interface owned by the driver.

2. The work items scheduled privately by the driver are not synchronous
   with bridge events (i.e. the bridge will not wait for the driver to
   finish deleting an FDB entry before e.g. calling del_nbp and deleting
   that interface as a bridge port). This might matter for middle layers
   like DSA which construct their own API to their downstream consumers
   on top of the switchdev primitives. With the current switchdev API
   design, it is not possible to guarantee that the bridge which
   generated an FDB entry deletion is still the upper interface by the
   time that the work item is scheduled and the FDB deletion is actually
   executed. To obtain this guarantee it would be necessary to introduce
   a refcounting system where the reference to the bridge is kept by DSA
   for as long as there are pending bridge FDB entry additions/deletions.
   Not really ideal if we look at the big picture.

3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are
   deferred by drivers even from code paths that are initially blocking
   (are running in process context):

br_fdb_add
-> __br_fdb_add
   -> fdb_add_entry
      -> fdb_notify
         -> br_switchdev_fdb_notify

    It seems fairly trivial to move the fdb_notify call outside of the
    atomic section of fdb_add_entry, but with switchdev offering only an
    API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would
    still have to defer these events and are unable to provide
    synchronous feedback to user space (error codes, extack).

The above issues would warrant an attempt to fix a central problem, and
make switchdev expose an API that is easier to consume rather than
having drivers implement lateral workarounds.

In this case, we must notice that

(a) switchdev already has the concept of notifiers emitted from the fast
    path that are still processed by drivers from blocking context. This
    is accomplished through the SWITCHDEV_F_DEFER flag which is used by
    e.g. SWITCHDEV_OBJ_ID_HOST_MDB.

(b) the bridge del_nbp() function already calls switchdev_deferred_process().
    So if we could hook into that, we could have a chance that the
    bridge simply waits for our FDB entry offloading procedure to finish
    before it calls netdev_upper_dev_unlink() - which is almost
    immediately afterwards, and also when switchdev drivers typically
    break their stateful associations between the bridge upper and
    private data.

So it is in fact possible to use switchdev's generic
switchdev_deferred_enqueue mechanism to get a sleepable callback, and
from there we can call_switchdev_blocking_notifiers().

To address all requirements:

- drivers that are unconverted from atomic to blocking still work
- drivers that currently have a private workqueue are not worse off
- drivers that want the bridge to wait for their deferred work can use
  the bridge's defer mechanism
- a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested
  parties does not get deferred for no reason, because this takes the
  rtnl_mutex and schedules a worker thread for nothing

it looks like we can in fact start off by emitting
SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in
struct switchdev_notifier_fdb_info called "needs_defer", and any
interested party can set this to true.

This way:

- unconverted drivers do their work (i.e. schedule their private work
  item) based on the atomic notifier, and do not set "needs_defer"
- converted drivers only mark "needs_defer" and treat a separate
  notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
- SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not
  generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED

Additionally, code paths that are blocking right not, like br_fdb_replay,
could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all
consumers of the replayed FDB events support that (right now, that is
DSA and dpaa2-switch).

Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
needs_defer as appropriate, then the notifiers emitted from process
context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
directly, and we would also have fully blocking context all the way
down, with the opportunity for error propagation and extack.

Some disadvantages of this solution though:

- A driver now needs to check whether it is interested in an event
  twice: first on the atomic call chain, then again on the blocking call
  chain (because it is a notifier chain, it is potentially not the only
  driver subscribed to it, it may be listening to another driver's
  needs_defer request). The flip side: on sistems with mixed switchdev
  setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB
  entries on foreign interfaces), there are some "synergies", and the
  SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as
  opposed to what would happen if each driver scheduled its own private
  work item.

- Right now drivers take rtnl_lock() as soon as their private work item
  runs. They need the rtnl_lock for the call to
  call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). But it doesn't
  really seem necessary for them to perform the actual hardware
  manipulation (adding the FDB entry) with the rtnl_lock held (anyway
  most do that). But with the new option of servicing
  SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken top-level
  by switchdev, so even if these drivers wanted to be more self-conscious,
  they couldn't.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/switchdev.h   | 26 ++++++++++++++-
 net/bridge/br_switchdev.c |  6 ++--
 net/switchdev/switchdev.c | 69 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 6764fb7692e2..67ddb80c828f 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -193,6 +193,8 @@ enum switchdev_notifier_type {
 	SWITCHDEV_FDB_DEL_TO_BRIDGE,
 	SWITCHDEV_FDB_ADD_TO_DEVICE,
 	SWITCHDEV_FDB_DEL_TO_DEVICE,
+	SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, /* Blocking. */
+	SWITCHDEV_FDB_DEL_TO_DEVICE_DEFERRED, /* Blocking. */
 	SWITCHDEV_FDB_OFFLOADED,
 	SWITCHDEV_FDB_FLUSH_TO_BRIDGE,
 
@@ -222,7 +224,8 @@ struct switchdev_notifier_fdb_info {
 	u16 vid;
 	u8 added_by_user:1,
 	   is_local:1,
-	   offloaded:1;
+	   offloaded:1,
+	   needs_defer:1;
 };
 
 struct switchdev_notifier_port_obj_info {
@@ -283,6 +286,13 @@ int switchdev_port_obj_add(struct net_device *dev,
 int switchdev_port_obj_del(struct net_device *dev,
 			   const struct switchdev_obj *obj);
 
+int
+switchdev_fdb_add_to_device(struct net_device *dev,
+			    struct switchdev_notifier_fdb_info *fdb_info);
+int
+switchdev_fdb_del_to_device(struct net_device *dev,
+			    struct switchdev_notifier_fdb_info *fdb_info);
+
 int register_switchdev_notifier(struct notifier_block *nb);
 int unregister_switchdev_notifier(struct notifier_block *nb);
 int call_switchdev_notifiers(unsigned long val, struct net_device *dev,
@@ -386,6 +396,20 @@ static inline int switchdev_port_obj_del(struct net_device *dev,
 	return -EOPNOTSUPP;
 }
 
+static inline int
+switchdev_fdb_add_to_device(struct net_device *dev,
+			    struct switchdev_notifier_fdb_info *fdb_info)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline int
+switchdev_fdb_del_to_device(struct net_device *dev,
+			    struct switchdev_notifier_fdb_info *fdb_info)
+{
+	return -EOPNOTSUPP;
+}
+
 static inline int register_switchdev_notifier(struct notifier_block *nb)
 {
 	return 0;
diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c
index 7e62904089c8..687100ca7088 100644
--- a/net/bridge/br_switchdev.c
+++ b/net/bridge/br_switchdev.c
@@ -140,12 +140,10 @@ br_switchdev_fdb_notify(struct net_bridge *br,
 
 	switch (type) {
 	case RTM_DELNEIGH:
-		call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE,
-					 dev, &info.info, NULL);
+		switchdev_fdb_del_to_device(dev, &info);
 		break;
 	case RTM_NEWNEIGH:
-		call_switchdev_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE,
-					 dev, &info.info, NULL);
+		switchdev_fdb_add_to_device(dev, &info);
 		break;
 	}
 }
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index 0b2c18efc079..d2f0bfc8a0b4 100644
--- a/net/switchdev/switchdev.c
+++ b/net/switchdev/switchdev.c
@@ -378,6 +378,75 @@ int call_switchdev_blocking_notifiers(unsigned long val, struct net_device *dev,
 }
 EXPORT_SYMBOL_GPL(call_switchdev_blocking_notifiers);
 
+static void switchdev_fdb_add_deferred(struct net_device *dev, const void *data)
+{
+	const struct switchdev_notifier_fdb_info *fdb_info = data;
+	struct switchdev_notifier_fdb_info tmp = *fdb_info;
+	int err;
+
+	ASSERT_RTNL();
+	err = call_switchdev_blocking_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED,
+						dev, &tmp.info, NULL);
+	err = notifier_to_errno(err);
+	if (err && err != -EOPNOTSUPP)
+		netdev_err(dev, "failed to add FDB entry: %pe\n", ERR_PTR(err));
+}
+
+static void switchdev_fdb_del_deferred(struct net_device *dev, const void *data)
+{
+	const struct switchdev_notifier_fdb_info *fdb_info = data;
+	struct switchdev_notifier_fdb_info tmp = *fdb_info;
+	int err;
+
+	ASSERT_RTNL();
+	err = call_switchdev_blocking_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE_DEFERRED,
+						dev, &tmp.info, NULL);
+	err = notifier_to_errno(err);
+	if (err && err != -EOPNOTSUPP)
+		netdev_err(dev, "failed to delete FDB entry: %pe\n",
+			   ERR_PTR(err));
+}
+
+int
+switchdev_fdb_add_to_device(struct net_device *dev,
+			    struct switchdev_notifier_fdb_info *fdb_info)
+{
+	int err;
+
+	err = call_switchdev_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE, dev,
+				       &fdb_info->info, NULL);
+	err = notifier_to_errno(err);
+	if (err)
+		return err;
+
+	if (!fdb_info->needs_defer)
+		return 0;
+
+	return switchdev_deferred_enqueue(dev, fdb_info, sizeof(*fdb_info),
+					  switchdev_fdb_add_deferred);
+}
+EXPORT_SYMBOL_GPL(switchdev_fdb_add_to_device);
+
+int
+switchdev_fdb_del_to_device(struct net_device *dev,
+			    struct switchdev_notifier_fdb_info *fdb_info)
+{
+	int err;
+
+	err = call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE, dev,
+				       &fdb_info->info, NULL);
+	err = notifier_to_errno(err);
+	if (err)
+		return err;
+
+	if (!fdb_info->needs_defer)
+		return 0;
+
+	return switchdev_deferred_enqueue(dev, fdb_info, sizeof(*fdb_info),
+					  switchdev_fdb_del_deferred);
+}
+EXPORT_SYMBOL_GPL(switchdev_fdb_del_to_device);
+
 struct switchdev_nested_priv {
 	bool (*check_cb)(const struct net_device *dev);
 	bool (*foreign_dev_check_cb)(const struct net_device *dev,
-----------------------------[ cut here ]-----------------------------

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-22 13:31               ` Vladimir Oltean
@ 2021-08-22 17:06                 ` Ido Schimmel
  2021-08-22 17:44                   ` Vladimir Oltean
  0 siblings, 1 reply; 34+ messages in thread
From: Ido Schimmel @ 2021-08-22 17:06 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Nikolay Aleksandrov, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:
> 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are
>    deferred by drivers even from code paths that are initially blocking
>    (are running in process context):
> 
> br_fdb_add
> -> __br_fdb_add
>    -> fdb_add_entry
>       -> fdb_notify
>          -> br_switchdev_fdb_notify
> 
>     It seems fairly trivial to move the fdb_notify call outside of the
>     atomic section of fdb_add_entry, but with switchdev offering only an
>     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would
>     still have to defer these events and are unable to provide
>     synchronous feedback to user space (error codes, extack).
> 
> The above issues would warrant an attempt to fix a central problem, and
> make switchdev expose an API that is easier to consume rather than
> having drivers implement lateral workarounds.
> 
> In this case, we must notice that
> 
> (a) switchdev already has the concept of notifiers emitted from the fast
>     path that are still processed by drivers from blocking context. This
>     is accomplished through the SWITCHDEV_F_DEFER flag which is used by
>     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.
> 
> (b) the bridge del_nbp() function already calls switchdev_deferred_process().
>     So if we could hook into that, we could have a chance that the
>     bridge simply waits for our FDB entry offloading procedure to finish
>     before it calls netdev_upper_dev_unlink() - which is almost
>     immediately afterwards, and also when switchdev drivers typically
>     break their stateful associations between the bridge upper and
>     private data.
> 
> So it is in fact possible to use switchdev's generic
> switchdev_deferred_enqueue mechanism to get a sleepable callback, and
> from there we can call_switchdev_blocking_notifiers().
> 
> To address all requirements:
> 
> - drivers that are unconverted from atomic to blocking still work
> - drivers that currently have a private workqueue are not worse off
> - drivers that want the bridge to wait for their deferred work can use
>   the bridge's defer mechanism
> - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested
>   parties does not get deferred for no reason, because this takes the
>   rtnl_mutex and schedules a worker thread for nothing
> 
> it looks like we can in fact start off by emitting
> SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in
> struct switchdev_notifier_fdb_info called "needs_defer", and any
> interested party can set this to true.
> 
> This way:
> 
> - unconverted drivers do their work (i.e. schedule their private work
>   item) based on the atomic notifier, and do not set "needs_defer"
> - converted drivers only mark "needs_defer" and treat a separate
>   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not
>   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> 
> Additionally, code paths that are blocking right not, like br_fdb_replay,
> could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all
> consumers of the replayed FDB events support that (right now, that is
> DSA and dpaa2-switch).
> 
> Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> needs_defer as appropriate, then the notifiers emitted from process
> context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> directly, and we would also have fully blocking context all the way
> down, with the opportunity for error propagation and extack.

IIUC, at this stage all the FDB notifications drivers get are blocking,
either from the work queue (because they were deferred) or directly from
process context. If so, how do we synchronize the two and ensure drivers
get the notifications at the correct order?

I was thinking of adding all the notifications to the 'deferred' list
when 'hash_lock' is held and then calling switchdev_deferred_process()
directly in process context. It's not very pretty (do we return an error
only for the entry the user added or for any other entry we flushed from
the list?), but I don't have a better idea right now.

> 
> Some disadvantages of this solution though:
> 
> - A driver now needs to check whether it is interested in an event
>   twice: first on the atomic call chain, then again on the blocking call
>   chain (because it is a notifier chain, it is potentially not the only
>   driver subscribed to it, it may be listening to another driver's
>   needs_defer request). The flip side: on sistems with mixed switchdev
>   setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB
>   entries on foreign interfaces), there are some "synergies", and the
>   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as
>   opposed to what would happen if each driver scheduled its own private
>   work item.
> 
> - Right now drivers take rtnl_lock() as soon as their private work item
>   runs. They need the rtnl_lock for the call to
>   call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). 

I think RCU is enough?

>   But it doesn't really seem necessary for them to perform the actual
>   hardware manipulation (adding the FDB entry) with the rtnl_lock held
>   (anyway most do that). But with the new option of servicing
>   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken
>   top-level by switchdev, so even if these drivers wanted to be more
>   self-conscious, they couldn't.

Yes, I want to remove this dependency in mlxsw assuming notifications
remain atomic. The more pressing issue is actually removing it from the
learning path.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-22 17:06                 ` Ido Schimmel
@ 2021-08-22 17:44                   ` Vladimir Oltean
  2021-08-23 10:47                     ` Ido Schimmel
  0 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-22 17:44 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Nikolay Aleksandrov, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote:
> On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:
> > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are
> >    deferred by drivers even from code paths that are initially blocking
> >    (are running in process context):
> > 
> > br_fdb_add
> > -> __br_fdb_add
> >    -> fdb_add_entry
> >       -> fdb_notify
> >          -> br_switchdev_fdb_notify
> > 
> >     It seems fairly trivial to move the fdb_notify call outside of the
> >     atomic section of fdb_add_entry, but with switchdev offering only an
> >     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would
> >     still have to defer these events and are unable to provide
> >     synchronous feedback to user space (error codes, extack).
> > 
> > The above issues would warrant an attempt to fix a central problem, and
> > make switchdev expose an API that is easier to consume rather than
> > having drivers implement lateral workarounds.
> > 
> > In this case, we must notice that
> > 
> > (a) switchdev already has the concept of notifiers emitted from the fast
> >     path that are still processed by drivers from blocking context. This
> >     is accomplished through the SWITCHDEV_F_DEFER flag which is used by
> >     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.
> > 
> > (b) the bridge del_nbp() function already calls switchdev_deferred_process().
> >     So if we could hook into that, we could have a chance that the
> >     bridge simply waits for our FDB entry offloading procedure to finish
> >     before it calls netdev_upper_dev_unlink() - which is almost
> >     immediately afterwards, and also when switchdev drivers typically
> >     break their stateful associations between the bridge upper and
> >     private data.
> > 
> > So it is in fact possible to use switchdev's generic
> > switchdev_deferred_enqueue mechanism to get a sleepable callback, and
> > from there we can call_switchdev_blocking_notifiers().
> > 
> > To address all requirements:
> > 
> > - drivers that are unconverted from atomic to blocking still work
> > - drivers that currently have a private workqueue are not worse off
> > - drivers that want the bridge to wait for their deferred work can use
> >   the bridge's defer mechanism
> > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested
> >   parties does not get deferred for no reason, because this takes the
> >   rtnl_mutex and schedules a worker thread for nothing
> > 
> > it looks like we can in fact start off by emitting
> > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in
> > struct switchdev_notifier_fdb_info called "needs_defer", and any
> > interested party can set this to true.
> > 
> > This way:
> > 
> > - unconverted drivers do their work (i.e. schedule their private work
> >   item) based on the atomic notifier, and do not set "needs_defer"
> > - converted drivers only mark "needs_defer" and treat a separate
> >   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not
> >   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > 
> > Additionally, code paths that are blocking right not, like br_fdb_replay,
> > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all
> > consumers of the replayed FDB events support that (right now, that is
> > DSA and dpaa2-switch).
> > 
> > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> > needs_defer as appropriate, then the notifiers emitted from process
> > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > directly, and we would also have fully blocking context all the way
> > down, with the opportunity for error propagation and extack.
> 
> IIUC, at this stage all the FDB notifications drivers get are blocking,
> either from the work queue (because they were deferred) or directly from
> process context. If so, how do we synchronize the two and ensure drivers
> get the notifications at the correct order?

What does 'at this stage' mean? Does it mean 'assuming the patch we're
discussing now gets accepted'? If that's what it means, then 'at this
stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE,
then would set needs_defer, then would receive the blocking
FDB_ADD_TO_DEVICE.

Thinking a bit more - this two-stage notification process ends up being
more efficient for br_fdb_replay too. We don't queue up FDB entries
except if the driver tells us that it wants to process them in blocking
context.

> I was thinking of adding all the notifications to the 'deferred' list
> when 'hash_lock' is held and then calling switchdev_deferred_process()
> directly in process context. It's not very pretty (do we return an error
> only for the entry the user added or for any other entry we flushed from
> the list?), but I don't have a better idea right now.

I was thinking to add a switchdev_fdb_add_to_device_now(). As opposed to
the switchdev_fdb_add_to_device() which defers, this does not defer at
all but just call_blocking_switchdev_notifiers(). So it would not go
through switchdev_deferred_enqueue.  For the code path I talked above,
we would temporarily drop the spin_lock, then call
switchdev_fdb_add_to_device_now(), then if that fails, take the
spin_lock again and delete the software fdb entry we've just added.

So as long as we use a _now() variant and don't resynchronize with the
deferred work, we shouldn't have any ordering issues, or am I
misunderstanding your question?

> 
> > 
> > Some disadvantages of this solution though:
> > 
> > - A driver now needs to check whether it is interested in an event
> >   twice: first on the atomic call chain, then again on the blocking call
> >   chain (because it is a notifier chain, it is potentially not the only
> >   driver subscribed to it, it may be listening to another driver's
> >   needs_defer request). The flip side: on sistems with mixed switchdev
> >   setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB
> >   entries on foreign interfaces), there are some "synergies", and the
> >   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as
> >   opposed to what would happen if each driver scheduled its own private
> >   work item.
> > 
> > - Right now drivers take rtnl_lock() as soon as their private work item
> >   runs. They need the rtnl_lock for the call to
> >   call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). 
> 
> I think RCU is enough?

Maybe, I haven't experimented with it. I thought br_fdb_offloaded_set
would notify back rtnetlink, but it looks like it doesn't.

> >   But it doesn't really seem necessary for them to perform the actual
> >   hardware manipulation (adding the FDB entry) with the rtnl_lock held
> >   (anyway most do that). But with the new option of servicing
> >   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken
> >   top-level by switchdev, so even if these drivers wanted to be more
> >   self-conscious, they couldn't.
> 
> Yes, I want to remove this dependency in mlxsw assuming notifications
> remain atomic. The more pressing issue is actually removing it from the
> learning path.

Bah, I understand where you're coming from, but it would be tricky to
remove the rtnl_lock from switchdev_deferred_process_work (that's what
it boils down to). My switchdev_handle_fdb_add_to_device helper currently
assumes rcu_read_lock(). With the blocking variant of SWITCHDEV_FDB_ADD_TO_DEVICE,
it would still need to traverse the netdev adjacency lists, so it would
need the rtnl_mutex for that. If we remove the rtnl_lock from
switchdev_deferred_process_work we'd have to add it back in DSA and to
any other callers of switchdev_handle_fdb_add_to_device.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-22 17:44                   ` Vladimir Oltean
@ 2021-08-23 10:47                     ` Ido Schimmel
  2021-08-23 11:00                       ` Vladimir Oltean
  0 siblings, 1 reply; 34+ messages in thread
From: Ido Schimmel @ 2021-08-23 10:47 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Nikolay Aleksandrov, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote:
> On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote:
> > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:
> > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are
> > >    deferred by drivers even from code paths that are initially blocking
> > >    (are running in process context):
> > > 
> > > br_fdb_add
> > > -> __br_fdb_add
> > >    -> fdb_add_entry
> > >       -> fdb_notify
> > >          -> br_switchdev_fdb_notify
> > > 
> > >     It seems fairly trivial to move the fdb_notify call outside of the
> > >     atomic section of fdb_add_entry, but with switchdev offering only an
> > >     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would
> > >     still have to defer these events and are unable to provide
> > >     synchronous feedback to user space (error codes, extack).
> > > 
> > > The above issues would warrant an attempt to fix a central problem, and
> > > make switchdev expose an API that is easier to consume rather than
> > > having drivers implement lateral workarounds.
> > > 
> > > In this case, we must notice that
> > > 
> > > (a) switchdev already has the concept of notifiers emitted from the fast
> > >     path that are still processed by drivers from blocking context. This
> > >     is accomplished through the SWITCHDEV_F_DEFER flag which is used by
> > >     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.
> > > 
> > > (b) the bridge del_nbp() function already calls switchdev_deferred_process().
> > >     So if we could hook into that, we could have a chance that the
> > >     bridge simply waits for our FDB entry offloading procedure to finish
> > >     before it calls netdev_upper_dev_unlink() - which is almost
> > >     immediately afterwards, and also when switchdev drivers typically
> > >     break their stateful associations between the bridge upper and
> > >     private data.
> > > 
> > > So it is in fact possible to use switchdev's generic
> > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and
> > > from there we can call_switchdev_blocking_notifiers().
> > > 
> > > To address all requirements:
> > > 
> > > - drivers that are unconverted from atomic to blocking still work
> > > - drivers that currently have a private workqueue are not worse off
> > > - drivers that want the bridge to wait for their deferred work can use
> > >   the bridge's defer mechanism
> > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested
> > >   parties does not get deferred for no reason, because this takes the
> > >   rtnl_mutex and schedules a worker thread for nothing
> > > 
> > > it looks like we can in fact start off by emitting
> > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in
> > > struct switchdev_notifier_fdb_info called "needs_defer", and any
> > > interested party can set this to true.
> > > 
> > > This way:
> > > 
> > > - unconverted drivers do their work (i.e. schedule their private work
> > >   item) based on the atomic notifier, and do not set "needs_defer"
> > > - converted drivers only mark "needs_defer" and treat a separate
> > >   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not
> > >   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > 
> > > Additionally, code paths that are blocking right not, like br_fdb_replay,
> > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all
> > > consumers of the replayed FDB events support that (right now, that is
> > > DSA and dpaa2-switch).
> > > 
> > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> > > needs_defer as appropriate, then the notifiers emitted from process
> > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > directly, and we would also have fully blocking context all the way
> > > down, with the opportunity for error propagation and extack.
> > 
> > IIUC, at this stage all the FDB notifications drivers get are blocking,
> > either from the work queue (because they were deferred) or directly from
> > process context. If so, how do we synchronize the two and ensure drivers
> > get the notifications at the correct order?
> 
> What does 'at this stage' mean? Does it mean 'assuming the patch we're
> discussing now gets accepted'? If that's what it means, then 'at this
> stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE,
> then would set needs_defer, then would receive the blocking
> FDB_ADD_TO_DEVICE.

I meant after:

"Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
needs_defer as appropriate, then the notifiers emitted from process
context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
directly, and we would also have fully blocking context all the way
down, with the opportunity for error propagation and extack."

IIUC, after the conversion the 'needs_defer' is gone and all the FDB
events are blocking? Either from syscall context or the workqueue.

If so, I'm not sure how we synchronize the two. That is, making sure
that an event from syscall context does not reach drivers before an
earlier event that was added to the 'deferred' list.

I mean, in syscall context we are holding RTNL so whatever is already on
the 'deferred' list cannot be dequeued and processed.


> 
> Thinking a bit more - this two-stage notification process ends up being
> more efficient for br_fdb_replay too. We don't queue up FDB entries
> except if the driver tells us that it wants to process them in blocking
> context.
> 
> > I was thinking of adding all the notifications to the 'deferred' list
> > when 'hash_lock' is held and then calling switchdev_deferred_process()
> > directly in process context. It's not very pretty (do we return an error
> > only for the entry the user added or for any other entry we flushed from
> > the list?), but I don't have a better idea right now.
> 
> I was thinking to add a switchdev_fdb_add_to_device_now(). As opposed to
> the switchdev_fdb_add_to_device() which defers, this does not defer at
> all but just call_blocking_switchdev_notifiers(). So it would not go
> through switchdev_deferred_enqueue.  For the code path I talked above,
> we would temporarily drop the spin_lock, then call
> switchdev_fdb_add_to_device_now(), then if that fails, take the
> spin_lock again and delete the software fdb entry we've just added.
> 
> So as long as we use a _now() variant and don't resynchronize with the
> deferred work, we shouldn't have any ordering issues, or am I
> misunderstanding your question?

Not sure I'm following. I tried to explain above.

> 
> > 
> > > 
> > > Some disadvantages of this solution though:
> > > 
> > > - A driver now needs to check whether it is interested in an event
> > >   twice: first on the atomic call chain, then again on the blocking call
> > >   chain (because it is a notifier chain, it is potentially not the only
> > >   driver subscribed to it, it may be listening to another driver's
> > >   needs_defer request). The flip side: on sistems with mixed switchdev
> > >   setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB
> > >   entries on foreign interfaces), there are some "synergies", and the
> > >   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as
> > >   opposed to what would happen if each driver scheduled its own private
> > >   work item.
> > > 
> > > - Right now drivers take rtnl_lock() as soon as their private work item
> > >   runs. They need the rtnl_lock for the call to
> > >   call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). 
> > 
> > I think RCU is enough?
> 
> Maybe, I haven't experimented with it. I thought br_fdb_offloaded_set
> would notify back rtnetlink, but it looks like it doesn't.

You mean emit a RTM_NEWNEIGH? This can be done even without RTNL (from
the data path, for example)

> 
> > >   But it doesn't really seem necessary for them to perform the actual
> > >   hardware manipulation (adding the FDB entry) with the rtnl_lock held
> > >   (anyway most do that). But with the new option of servicing
> > >   SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken
> > >   top-level by switchdev, so even if these drivers wanted to be more
> > >   self-conscious, they couldn't.
> > 
> > Yes, I want to remove this dependency in mlxsw assuming notifications
> > remain atomic. The more pressing issue is actually removing it from the
> > learning path.
> 
> Bah, I understand where you're coming from, but it would be tricky to
> remove the rtnl_lock from switchdev_deferred_process_work (that's what
> it boils down to). My switchdev_handle_fdb_add_to_device helper currently
> assumes rcu_read_lock(). With the blocking variant of SWITCHDEV_FDB_ADD_TO_DEVICE,
> it would still need to traverse the netdev adjacency lists, so it would
> need the rtnl_mutex for that. If we remove the rtnl_lock from
> switchdev_deferred_process_work we'd have to add it back in DSA and to
> any other callers of switchdev_handle_fdb_add_to_device.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-23 10:47                     ` Ido Schimmel
@ 2021-08-23 11:00                       ` Vladimir Oltean
  2021-08-23 12:16                         ` Ido Schimmel
  0 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-23 11:00 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Nikolay Aleksandrov, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Mon, Aug 23, 2021 at 01:47:57PM +0300, Ido Schimmel wrote:
> On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote:
> > On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote:
> > > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:
> > > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are
> > > >    deferred by drivers even from code paths that are initially blocking
> > > >    (are running in process context):
> > > >
> > > > br_fdb_add
> > > > -> __br_fdb_add
> > > >    -> fdb_add_entry
> > > >       -> fdb_notify
> > > >          -> br_switchdev_fdb_notify
> > > >
> > > >     It seems fairly trivial to move the fdb_notify call outside of the
> > > >     atomic section of fdb_add_entry, but with switchdev offering only an
> > > >     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would
> > > >     still have to defer these events and are unable to provide
> > > >     synchronous feedback to user space (error codes, extack).
> > > >
> > > > The above issues would warrant an attempt to fix a central problem, and
> > > > make switchdev expose an API that is easier to consume rather than
> > > > having drivers implement lateral workarounds.
> > > >
> > > > In this case, we must notice that
> > > >
> > > > (a) switchdev already has the concept of notifiers emitted from the fast
> > > >     path that are still processed by drivers from blocking context. This
> > > >     is accomplished through the SWITCHDEV_F_DEFER flag which is used by
> > > >     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.
> > > >
> > > > (b) the bridge del_nbp() function already calls switchdev_deferred_process().
> > > >     So if we could hook into that, we could have a chance that the
> > > >     bridge simply waits for our FDB entry offloading procedure to finish
> > > >     before it calls netdev_upper_dev_unlink() - which is almost
> > > >     immediately afterwards, and also when switchdev drivers typically
> > > >     break their stateful associations between the bridge upper and
> > > >     private data.
> > > >
> > > > So it is in fact possible to use switchdev's generic
> > > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and
> > > > from there we can call_switchdev_blocking_notifiers().
> > > >
> > > > To address all requirements:
> > > >
> > > > - drivers that are unconverted from atomic to blocking still work
> > > > - drivers that currently have a private workqueue are not worse off
> > > > - drivers that want the bridge to wait for their deferred work can use
> > > >   the bridge's defer mechanism
> > > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested
> > > >   parties does not get deferred for no reason, because this takes the
> > > >   rtnl_mutex and schedules a worker thread for nothing
> > > >
> > > > it looks like we can in fact start off by emitting
> > > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in
> > > > struct switchdev_notifier_fdb_info called "needs_defer", and any
> > > > interested party can set this to true.
> > > >
> > > > This way:
> > > >
> > > > - unconverted drivers do their work (i.e. schedule their private work
> > > >   item) based on the atomic notifier, and do not set "needs_defer"
> > > > - converted drivers only mark "needs_defer" and treat a separate
> > > >   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not
> > > >   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > >
> > > > Additionally, code paths that are blocking right not, like br_fdb_replay,
> > > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all
> > > > consumers of the replayed FDB events support that (right now, that is
> > > > DSA and dpaa2-switch).
> > > >
> > > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> > > > needs_defer as appropriate, then the notifiers emitted from process
> > > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > > directly, and we would also have fully blocking context all the way
> > > > down, with the opportunity for error propagation and extack.
> > >
> > > IIUC, at this stage all the FDB notifications drivers get are blocking,
> > > either from the work queue (because they were deferred) or directly from
> > > process context. If so, how do we synchronize the two and ensure drivers
> > > get the notifications at the correct order?
> >
> > What does 'at this stage' mean? Does it mean 'assuming the patch we're
> > discussing now gets accepted'? If that's what it means, then 'at this
> > stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE,
> > then would set needs_defer, then would receive the blocking
> > FDB_ADD_TO_DEVICE.
>
> I meant after:
>
> "Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> needs_defer as appropriate, then the notifiers emitted from process
> context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> directly, and we would also have fully blocking context all the way
> down, with the opportunity for error propagation and extack."
>
> IIUC, after the conversion the 'needs_defer' is gone and all the FDB
> events are blocking? Either from syscall context or the workqueue.

We would not delete 'needs_defer'. It still offers a useful preliminary
filtering mechanism for the fast path (and for br_fdb_replay). In
retrospect, the SWITCHDEV_OBJ_ID_HOST_MDB would also benefit from 'needs_defer'
instead of jumping to blocking context (if we care so much about performance).

If a FDB event does not need to be processed by anyone (dynamically
learned entry on a switchdev port), the bridge notifies the atomic call
chain for the sake of it, but not the blocking chain.

> If so, I'm not sure how we synchronize the two. That is, making sure
> that an event from syscall context does not reach drivers before an
> earlier event that was added to the 'deferred' list.
>
> I mean, in syscall context we are holding RTNL so whatever is already on
> the 'deferred' list cannot be dequeued and processed.

So switchdev_deferred_process() has ASSERT_RTNL. If we call
switchdev_deferred_process() right before adding the blocking FDB entry
in process context (and we already hold rtnl_mutex), I though that would
be enough to ensure we have a synchronization point: Everything that was
scheduled before is flushed now, everything that is scheduled while we
are running will run after we unlock the rtnl_mutex. Is that not the
order we expect? I mean, if there is a fast path FDB entry being learned
/ deleted while user space say adds that same FDB entry as static, how
is the relative ordering ensured between the two?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-23 11:00                       ` Vladimir Oltean
@ 2021-08-23 12:16                         ` Ido Schimmel
  2021-08-23 14:29                           ` Vladimir Oltean
  0 siblings, 1 reply; 34+ messages in thread
From: Ido Schimmel @ 2021-08-23 12:16 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Nikolay Aleksandrov, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Mon, Aug 23, 2021 at 02:00:46PM +0300, Vladimir Oltean wrote:
> On Mon, Aug 23, 2021 at 01:47:57PM +0300, Ido Schimmel wrote:
> > On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote:
> > > On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote:
> > > > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote:
> > > > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are
> > > > >    deferred by drivers even from code paths that are initially blocking
> > > > >    (are running in process context):
> > > > >
> > > > > br_fdb_add
> > > > > -> __br_fdb_add
> > > > >    -> fdb_add_entry
> > > > >       -> fdb_notify
> > > > >          -> br_switchdev_fdb_notify
> > > > >
> > > > >     It seems fairly trivial to move the fdb_notify call outside of the
> > > > >     atomic section of fdb_add_entry, but with switchdev offering only an
> > > > >     API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would
> > > > >     still have to defer these events and are unable to provide
> > > > >     synchronous feedback to user space (error codes, extack).
> > > > >
> > > > > The above issues would warrant an attempt to fix a central problem, and
> > > > > make switchdev expose an API that is easier to consume rather than
> > > > > having drivers implement lateral workarounds.
> > > > >
> > > > > In this case, we must notice that
> > > > >
> > > > > (a) switchdev already has the concept of notifiers emitted from the fast
> > > > >     path that are still processed by drivers from blocking context. This
> > > > >     is accomplished through the SWITCHDEV_F_DEFER flag which is used by
> > > > >     e.g. SWITCHDEV_OBJ_ID_HOST_MDB.
> > > > >
> > > > > (b) the bridge del_nbp() function already calls switchdev_deferred_process().
> > > > >     So if we could hook into that, we could have a chance that the
> > > > >     bridge simply waits for our FDB entry offloading procedure to finish
> > > > >     before it calls netdev_upper_dev_unlink() - which is almost
> > > > >     immediately afterwards, and also when switchdev drivers typically
> > > > >     break their stateful associations between the bridge upper and
> > > > >     private data.
> > > > >
> > > > > So it is in fact possible to use switchdev's generic
> > > > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and
> > > > > from there we can call_switchdev_blocking_notifiers().
> > > > >
> > > > > To address all requirements:
> > > > >
> > > > > - drivers that are unconverted from atomic to blocking still work
> > > > > - drivers that currently have a private workqueue are not worse off
> > > > > - drivers that want the bridge to wait for their deferred work can use
> > > > >   the bridge's defer mechanism
> > > > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested
> > > > >   parties does not get deferred for no reason, because this takes the
> > > > >   rtnl_mutex and schedules a worker thread for nothing
> > > > >
> > > > > it looks like we can in fact start off by emitting
> > > > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in
> > > > > struct switchdev_notifier_fdb_info called "needs_defer", and any
> > > > > interested party can set this to true.
> > > > >
> > > > > This way:
> > > > >
> > > > > - unconverted drivers do their work (i.e. schedule their private work
> > > > >   item) based on the atomic notifier, and do not set "needs_defer"
> > > > > - converted drivers only mark "needs_defer" and treat a separate
> > > > >   notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not
> > > > >   generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > > >
> > > > > Additionally, code paths that are blocking right not, like br_fdb_replay,
> > > > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all
> > > > > consumers of the replayed FDB events support that (right now, that is
> > > > > DSA and dpaa2-switch).
> > > > >
> > > > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> > > > > needs_defer as appropriate, then the notifiers emitted from process
> > > > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > > > > directly, and we would also have fully blocking context all the way
> > > > > down, with the opportunity for error propagation and extack.
> > > >
> > > > IIUC, at this stage all the FDB notifications drivers get are blocking,
> > > > either from the work queue (because they were deferred) or directly from
> > > > process context. If so, how do we synchronize the two and ensure drivers
> > > > get the notifications at the correct order?
> > >
> > > What does 'at this stage' mean? Does it mean 'assuming the patch we're
> > > discussing now gets accepted'? If that's what it means, then 'at this
> > > stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE,
> > > then would set needs_defer, then would receive the blocking
> > > FDB_ADD_TO_DEVICE.
> >
> > I meant after:
> >
> > "Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set
> > needs_defer as appropriate, then the notifiers emitted from process
> > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED
> > directly, and we would also have fully blocking context all the way
> > down, with the opportunity for error propagation and extack."
> >
> > IIUC, after the conversion the 'needs_defer' is gone and all the FDB
> > events are blocking? Either from syscall context or the workqueue.
> 
> We would not delete 'needs_defer'. It still offers a useful preliminary
> filtering mechanism for the fast path (and for br_fdb_replay). In
> retrospect, the SWITCHDEV_OBJ_ID_HOST_MDB would also benefit from 'needs_defer'
> instead of jumping to blocking context (if we care so much about performance).
> 
> If a FDB event does not need to be processed by anyone (dynamically
> learned entry on a switchdev port), the bridge notifies the atomic call
> chain for the sake of it, but not the blocking chain.
> 
> > If so, I'm not sure how we synchronize the two. That is, making sure
> > that an event from syscall context does not reach drivers before an
> > earlier event that was added to the 'deferred' list.
> >
> > I mean, in syscall context we are holding RTNL so whatever is already on
> > the 'deferred' list cannot be dequeued and processed.
> 
> So switchdev_deferred_process() has ASSERT_RTNL. If we call
> switchdev_deferred_process() right before adding the blocking FDB entry
> in process context (and we already hold rtnl_mutex), I though that would
> be enough to ensure we have a synchronization point: Everything that was
> scheduled before is flushed now, everything that is scheduled while we
> are running will run after we unlock the rtnl_mutex. Is that not the
> order we expect? I mean, if there is a fast path FDB entry being learned
> / deleted while user space say adds that same FDB entry as static, how
> is the relative ordering ensured between the two?

I was thinking about the following case:

t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'
t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in
     response to STP state. Notifications are added to 'deferred' list
t2 - switchdev_deferred_process() is called in syscall context
t3 - <MAC1,VID1,P1> is notified as blocking

Updates to the SW FDB are protected by 'hash_lock', but updates to the
HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but
it will exist in HW.

Another case assuming switchdev_deferred_process() is called first:

t0 - switchdev_deferred_process() is called in syscall context
t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added
     to 'deferred' list
t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to
     <MAC1,VID1,P2>
t3 - <MAC1,VID1,P2> is notified as blocking
t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'
     list is processed)

In this case, the HW will have <MAC1,VID1,P1>, but SW will have
<MAC1,VID1,P2>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-23 12:16                         ` Ido Schimmel
@ 2021-08-23 14:29                           ` Vladimir Oltean
  2021-08-23 15:18                             ` Ido Schimmel
  0 siblings, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-23 14:29 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Nikolay Aleksandrov, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote:
> I was thinking about the following case:
>
> t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'
> t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in
>      response to STP state. Notifications are added to 'deferred' list
> t2 - switchdev_deferred_process() is called in syscall context
> t3 - <MAC1,VID1,P1> is notified as blocking
>
> Updates to the SW FDB are protected by 'hash_lock', but updates to the
> HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but
> it will exist in HW.
>
> Another case assuming switchdev_deferred_process() is called first:
>
> t0 - switchdev_deferred_process() is called in syscall context
> t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added
>      to 'deferred' list
> t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to
>      <MAC1,VID1,P2>
> t3 - <MAC1,VID1,P2> is notified as blocking
> t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'
>      list is processed)
>
> In this case, the HW will have <MAC1,VID1,P1>, but SW will have
> <MAC1,VID1,P2>

Ok, so if the hardware FDB entry needs to be updated under the same
hash_lock as the software FDB entry, then it seems that the goal of
updating the hardware FDB synchronously and in a sleepable manner is if
the data path defers the learning to sleepable context too. That in turn
means that there will be 'dead time' between the reception of a packet
from a given {MAC SA, VID} flow and the learning of that address. So I
don't think that is really desirable. So I don't know if it is actually
realistic to do this.

Can we drop it from the requirements of this change, or do you feel like
it's not worth it to make my change if this problem is not solved?

There is of course the option of going half-way too, just like for
SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the
atomic chain, the switchdev throws as many errors as it can reasonably
can, then you defer the actual installation which means a hardware access.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-23 14:29                           ` Vladimir Oltean
@ 2021-08-23 15:18                             ` Ido Schimmel
  2021-08-23 15:42                               ` Nikolay Aleksandrov
  2021-08-23 15:42                               ` Vladimir Oltean
  0 siblings, 2 replies; 34+ messages in thread
From: Ido Schimmel @ 2021-08-23 15:18 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Nikolay Aleksandrov, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Mon, Aug 23, 2021 at 05:29:53PM +0300, Vladimir Oltean wrote:
> On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote:
> > I was thinking about the following case:
> >
> > t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'
> > t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in
> >      response to STP state. Notifications are added to 'deferred' list
> > t2 - switchdev_deferred_process() is called in syscall context
> > t3 - <MAC1,VID1,P1> is notified as blocking
> >
> > Updates to the SW FDB are protected by 'hash_lock', but updates to the
> > HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but
> > it will exist in HW.
> >
> > Another case assuming switchdev_deferred_process() is called first:
> >
> > t0 - switchdev_deferred_process() is called in syscall context
> > t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added
> >      to 'deferred' list
> > t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to
> >      <MAC1,VID1,P2>
> > t3 - <MAC1,VID1,P2> is notified as blocking
> > t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'
> >      list is processed)
> >
> > In this case, the HW will have <MAC1,VID1,P1>, but SW will have
> > <MAC1,VID1,P2>
> 
> Ok, so if the hardware FDB entry needs to be updated under the same
> hash_lock as the software FDB entry, then it seems that the goal of
> updating the hardware FDB synchronously and in a sleepable manner is if
> the data path defers the learning to sleepable context too. That in turn
> means that there will be 'dead time' between the reception of a packet
> from a given {MAC SA, VID} flow and the learning of that address. So I
> don't think that is really desirable. So I don't know if it is actually
> realistic to do this.
> 
> Can we drop it from the requirements of this change, or do you feel like
> it's not worth it to make my change if this problem is not solved?

I didn't pose it as a requirement, but as a desirable goal that I don't
know how to achieve w/o a surgery in the bridge driver that Nik and you
(understandably) don't like.

Regarding a more practical solution, earlier versions (not what you
posted yesterday) have the undesirable properties of being both
asynchronous (current state) and mandating RTNL to be held. If we are
going with the asynchronous model, then I think we should have a model
that doesn't force RTNL and allows batching.

I have the following proposal, which I believe solves your problem and
allows for batching without RTNL:

The pattern of enqueuing a work item per-entry is not very smart.
Instead, it is better to to add the notification info to a list
(protected by a spin lock) and scheduling a single work item whose
purpose is to dequeue entries from this list and batch process them.

Inside the work item you would do something like:

spin_lock_bh()
list_splice_init()
spin_unlock_bh()

mutex_lock() // rtnl or preferably private lock
list_for_each_entry_safe() 
	// process entry
	cond_resched()
mutex_unlock()

In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some
new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will
instruct the driver to flush all its pending FDB notifications. You
don't strictly need this notification because of the
netdev_upper_dev_unlink() that follows, but it helps in making things
more structured.

Pros:

1. Solves your problem?
2. Pattern is not worse than what we currently have
3. Does not force RTNL
4. Allows for batching. For example, mlxsw has the ability to program up
to 64 entries in one transaction with the device. I assume other devices
in the same grade have similar capabilities

Cons:

1. Asynchronous
2. Pattern we will see in multiple drivers? Can consider migrating it
into switchdev itself at some point
3. Something I missed / overlooked

> There is of course the option of going half-way too, just like for
> SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the
> atomic chain, the switchdev throws as many errors as it can reasonably
> can, then you defer the actual installation which means a hardware access.

Yes, the above proposal has the same property. You can throw errors
before enqueueing the notification info on your list.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-23 15:18                             ` Ido Schimmel
@ 2021-08-23 15:42                               ` Nikolay Aleksandrov
  2021-08-23 15:42                               ` Vladimir Oltean
  1 sibling, 0 replies; 34+ messages in thread
From: Nikolay Aleksandrov @ 2021-08-23 15:42 UTC (permalink / raw)
  To: Ido Schimmel, Vladimir Oltean
  Cc: Vladimir Oltean, netdev, Jakub Kicinski, David S. Miller,
	Roopa Prabhu, Andrew Lunn, Florian Fainelli, Vivien Didelot,
	Vadym Kochan, Taras Chornyi, Jiri Pirko, Ido Schimmel,
	UNGLinuxDriver, Grygorii Strashko, Marek Behun, DENG Qingfang,
	Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh, Sean Wang,
	Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On 23/08/2021 18:18, Ido Schimmel wrote:
> On Mon, Aug 23, 2021 at 05:29:53PM +0300, Vladimir Oltean wrote:
>> On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote:
>>> I was thinking about the following case:
>>>
>>> t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'
>>> t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in
>>>      response to STP state. Notifications are added to 'deferred' list
>>> t2 - switchdev_deferred_process() is called in syscall context
>>> t3 - <MAC1,VID1,P1> is notified as blocking
>>>
>>> Updates to the SW FDB are protected by 'hash_lock', but updates to the
>>> HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but
>>> it will exist in HW.
>>>
>>> Another case assuming switchdev_deferred_process() is called first:
>>>
>>> t0 - switchdev_deferred_process() is called in syscall context
>>> t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added
>>>      to 'deferred' list
>>> t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to
>>>      <MAC1,VID1,P2>
>>> t3 - <MAC1,VID1,P2> is notified as blocking
>>> t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'
>>>      list is processed)
>>>
>>> In this case, the HW will have <MAC1,VID1,P1>, but SW will have
>>> <MAC1,VID1,P2>
>>
>> Ok, so if the hardware FDB entry needs to be updated under the same
>> hash_lock as the software FDB entry, then it seems that the goal of
>> updating the hardware FDB synchronously and in a sleepable manner is if
>> the data path defers the learning to sleepable context too. That in turn
>> means that there will be 'dead time' between the reception of a packet
>> from a given {MAC SA, VID} flow and the learning of that address. So I
>> don't think that is really desirable. So I don't know if it is actually
>> realistic to do this.
>>
>> Can we drop it from the requirements of this change, or do you feel like
>> it's not worth it to make my change if this problem is not solved?
> 
> I didn't pose it as a requirement, but as a desirable goal that I don't
> know how to achieve w/o a surgery in the bridge driver that Nik and you
> (understandably) don't like.
> 
> Regarding a more practical solution, earlier versions (not what you
> posted yesterday) have the undesirable properties of being both
> asynchronous (current state) and mandating RTNL to be held. If we are
> going with the asynchronous model, then I think we should have a model
> that doesn't force RTNL and allows batching.
> 
> I have the following proposal, which I believe solves your problem and
> allows for batching without RTNL:
> 
> The pattern of enqueuing a work item per-entry is not very smart.
> Instead, it is better to to add the notification info to a list
> (protected by a spin lock) and scheduling a single work item whose
> purpose is to dequeue entries from this list and batch process them.
> 
> Inside the work item you would do something like:
> 
> spin_lock_bh()
> list_splice_init()
> spin_unlock_bh()
> 
> mutex_lock() // rtnl or preferably private lock
> list_for_each_entry_safe() 
> 	// process entry
> 	cond_resched()
> mutex_unlock()
> 
> In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some
> new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will
> instruct the driver to flush all its pending FDB notifications. You
> don't strictly need this notification because of the
> netdev_upper_dev_unlink() that follows, but it helps in making things
> more structured.
> 

I was also thinking about a solution along these lines, I like this proposition.

> Pros:
> 
> 1. Solves your problem?
> 2. Pattern is not worse than what we currently have
> 3. Does not force RTNL
> 4. Allows for batching. For example, mlxsw has the ability to program up
> to 64 entries in one transaction with the device. I assume other devices
> in the same grade have similar capabilities

Batching would help a lot even if we don't remove rtnl, on loaded systems rtnl itself
is a bottleneck and we've seen crazy delays in commands because of contention. That
coupled with the ability to program multiple entries would be a nice win.

> 
> Cons:
> 
> 1. Asynchronous
> 2. Pattern we will see in multiple drivers? Can consider migrating it
> into switchdev itself at some point
> 3. Something I missed / overlooked
> >> There is of course the option of going half-way too, just like for
>> SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the
>> atomic chain, the switchdev throws as many errors as it can reasonably
>> can, then you defer the actual installation which means a hardware access.
> 
> Yes, the above proposal has the same property. You can throw errors
> before enqueueing the notification info on your list.
> 

Thanks,
 Nik

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-23 15:18                             ` Ido Schimmel
  2021-08-23 15:42                               ` Nikolay Aleksandrov
@ 2021-08-23 15:42                               ` Vladimir Oltean
  2021-08-23 16:02                                 ` Ido Schimmel
  1 sibling, 1 reply; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-23 15:42 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Nikolay Aleksandrov, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Mon, Aug 23, 2021 at 06:18:08PM +0300, Ido Schimmel wrote:
> On Mon, Aug 23, 2021 at 05:29:53PM +0300, Vladimir Oltean wrote:
> > On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote:
> > > I was thinking about the following case:
> > >
> > > t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'
> > > t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in
> > >      response to STP state. Notifications are added to 'deferred' list
> > > t2 - switchdev_deferred_process() is called in syscall context
> > > t3 - <MAC1,VID1,P1> is notified as blocking
> > >
> > > Updates to the SW FDB are protected by 'hash_lock', but updates to the
> > > HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but
> > > it will exist in HW.
> > >
> > > Another case assuming switchdev_deferred_process() is called first:
> > >
> > > t0 - switchdev_deferred_process() is called in syscall context
> > > t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added
> > >      to 'deferred' list
> > > t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to
> > >      <MAC1,VID1,P2>
> > > t3 - <MAC1,VID1,P2> is notified as blocking
> > > t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'
> > >      list is processed)
> > >
> > > In this case, the HW will have <MAC1,VID1,P1>, but SW will have
> > > <MAC1,VID1,P2>
> >
> > Ok, so if the hardware FDB entry needs to be updated under the same
> > hash_lock as the software FDB entry, then it seems that the goal of
> > updating the hardware FDB synchronously and in a sleepable manner is if
> > the data path defers the learning to sleepable context too. That in turn
> > means that there will be 'dead time' between the reception of a packet
> > from a given {MAC SA, VID} flow and the learning of that address. So I
> > don't think that is really desirable. So I don't know if it is actually
> > realistic to do this.
> >
> > Can we drop it from the requirements of this change, or do you feel like
> > it's not worth it to make my change if this problem is not solved?
>
> I didn't pose it as a requirement, but as a desirable goal that I don't
> know how to achieve w/o a surgery in the bridge driver that Nik and you
> (understandably) don't like.
>
> Regarding a more practical solution, earlier versions (not what you
> posted yesterday) have the undesirable properties of being both
> asynchronous (current state) and mandating RTNL to be held. If we are
> going with the asynchronous model, then I think we should have a model
> that doesn't force RTNL and allows batching.
>
> I have the following proposal, which I believe solves your problem and
> allows for batching without RTNL:
>
> The pattern of enqueuing a work item per-entry is not very smart.
> Instead, it is better to to add the notification info to a list
> (protected by a spin lock) and scheduling a single work item whose
> purpose is to dequeue entries from this list and batch process them.

I don't have hardware where FDB entries can be installed in bulk, so
this is new to me. Might make sense though where you are in fact talking
to firmware, and the firmware is in fact still committing to hardware
one by one, you are still reducing the number of round trips.

> Inside the work item you would do something like:
>
> spin_lock_bh()
> list_splice_init()
> spin_unlock_bh()
>
> mutex_lock() // rtnl or preferably private lock
> list_for_each_entry_safe()
> 	// process entry
> 	cond_resched()
> mutex_unlock()

When is the work item scheduled in your proposal? I assume not only when
SWITCHDEV_FDB_FLUSH_TO_DEVICE is emitted. Is there some sort of timer to
allow for some batching to occur?

>
> In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some
> new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will
> instruct the driver to flush all its pending FDB notifications. You
> don't strictly need this notification because of the
> netdev_upper_dev_unlink() that follows, but it helps in making things
> more structured.
>
> Pros:
>
> 1. Solves your problem?
> 2. Pattern is not worse than what we currently have
> 3. Does not force RTNL
> 4. Allows for batching. For example, mlxsw has the ability to program up
> to 64 entries in one transaction with the device. I assume other devices
> in the same grade have similar capabilities
>
> Cons:
>
> 1. Asynchronous
> 2. Pattern we will see in multiple drivers? Can consider migrating it
> into switchdev itself at some point

I can already flush_workqueue(dsa_owq) in dsa_port_pre_bridge_leave()
and this will solve the problem in the same way, will it not?

It's not that I don't have driver-level solutions and hook points.
My concern is that there are way too many moving parts and the entrance
barrier for a new switchdev driver is getting higher and higher to
achieve even basic stuff.

For example, I need to maintain a DSA driver and a switchdev driver for
the exact same class of hardware (ocelot is switchdev, felix is DSA, but
the hardware is the same) and it is just so annoying that the interaction
with switchdev is so verbose and open-coded, it just leads to so much
duplication of basic patterns.
When I add support for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE in ocelot I
really don't want to add a boatload of code, all copied from DSA.

> 3. Something I missed / overlooked
>
> > There is of course the option of going half-way too, just like for
> > SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the
> > atomic chain, the switchdev throws as many errors as it can reasonably
> > can, then you defer the actual installation which means a hardware access.
>
> Yes, the above proposal has the same property. You can throw errors
> before enqueueing the notification info on your list.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-23 15:42                               ` Vladimir Oltean
@ 2021-08-23 16:02                                 ` Ido Schimmel
  2021-08-23 16:11                                   ` Vladimir Oltean
  2021-08-23 16:23                                   ` Vladimir Oltean
  0 siblings, 2 replies; 34+ messages in thread
From: Ido Schimmel @ 2021-08-23 16:02 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Nikolay Aleksandrov, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Mon, Aug 23, 2021 at 06:42:44PM +0300, Vladimir Oltean wrote:
> On Mon, Aug 23, 2021 at 06:18:08PM +0300, Ido Schimmel wrote:
> > On Mon, Aug 23, 2021 at 05:29:53PM +0300, Vladimir Oltean wrote:
> > > On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote:
> > > > I was thinking about the following case:
> > > >
> > > > t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock'
> > > > t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in
> > > >      response to STP state. Notifications are added to 'deferred' list
> > > > t2 - switchdev_deferred_process() is called in syscall context
> > > > t3 - <MAC1,VID1,P1> is notified as blocking
> > > >
> > > > Updates to the SW FDB are protected by 'hash_lock', but updates to the
> > > > HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but
> > > > it will exist in HW.
> > > >
> > > > Another case assuming switchdev_deferred_process() is called first:
> > > >
> > > > t0 - switchdev_deferred_process() is called in syscall context
> > > > t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added
> > > >      to 'deferred' list
> > > > t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to
> > > >      <MAC1,VID1,P2>
> > > > t3 - <MAC1,VID1,P2> is notified as blocking
> > > > t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred'
> > > >      list is processed)
> > > >
> > > > In this case, the HW will have <MAC1,VID1,P1>, but SW will have
> > > > <MAC1,VID1,P2>
> > >
> > > Ok, so if the hardware FDB entry needs to be updated under the same
> > > hash_lock as the software FDB entry, then it seems that the goal of
> > > updating the hardware FDB synchronously and in a sleepable manner is if
> > > the data path defers the learning to sleepable context too. That in turn
> > > means that there will be 'dead time' between the reception of a packet
> > > from a given {MAC SA, VID} flow and the learning of that address. So I
> > > don't think that is really desirable. So I don't know if it is actually
> > > realistic to do this.
> > >
> > > Can we drop it from the requirements of this change, or do you feel like
> > > it's not worth it to make my change if this problem is not solved?
> >
> > I didn't pose it as a requirement, but as a desirable goal that I don't
> > know how to achieve w/o a surgery in the bridge driver that Nik and you
> > (understandably) don't like.
> >
> > Regarding a more practical solution, earlier versions (not what you
> > posted yesterday) have the undesirable properties of being both
> > asynchronous (current state) and mandating RTNL to be held. If we are
> > going with the asynchronous model, then I think we should have a model
> > that doesn't force RTNL and allows batching.
> >
> > I have the following proposal, which I believe solves your problem and
> > allows for batching without RTNL:
> >
> > The pattern of enqueuing a work item per-entry is not very smart.
> > Instead, it is better to to add the notification info to a list
> > (protected by a spin lock) and scheduling a single work item whose
> > purpose is to dequeue entries from this list and batch process them.
> 
> I don't have hardware where FDB entries can be installed in bulk, so
> this is new to me. Might make sense though where you are in fact talking
> to firmware, and the firmware is in fact still committing to hardware
> one by one, you are still reducing the number of round trips.

Yes

> 
> > Inside the work item you would do something like:
> >
> > spin_lock_bh()
> > list_splice_init()
> > spin_unlock_bh()
> >
> > mutex_lock() // rtnl or preferably private lock
> > list_for_each_entry_safe()
> > 	// process entry
> > 	cond_resched()
> > mutex_unlock()
> 
> When is the work item scheduled in your proposal?

Calling queue_work() whenever you get a notification. The work item
might already be queued, which is fine.

> I assume not only when SWITCHDEV_FDB_FLUSH_TO_DEVICE is emitted. Is
> there some sort of timer to allow for some batching to occur?

You can add an hysteresis timer if you want, but I don't think it's
necessary. Assuming user space is programming entries at a high rate,
then by the time you finish a batch, you will have a new one enqueued.

> 
> >
> > In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some
> > new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will
> > instruct the driver to flush all its pending FDB notifications. You
> > don't strictly need this notification because of the
> > netdev_upper_dev_unlink() that follows, but it helps in making things
> > more structured.
> >
> > Pros:
> >
> > 1. Solves your problem?
> > 2. Pattern is not worse than what we currently have
> > 3. Does not force RTNL
> > 4. Allows for batching. For example, mlxsw has the ability to program up
> > to 64 entries in one transaction with the device. I assume other devices
> > in the same grade have similar capabilities
> >
> > Cons:
> >
> > 1. Asynchronous
> > 2. Pattern we will see in multiple drivers? Can consider migrating it
> > into switchdev itself at some point
> 
> I can already flush_workqueue(dsa_owq) in dsa_port_pre_bridge_leave()
> and this will solve the problem in the same way, will it not?

Problem is that you will deadlock if your work item tries to take RTNL.

> 
> It's not that I don't have driver-level solutions and hook points.
> My concern is that there are way too many moving parts and the entrance
> barrier for a new switchdev driver is getting higher and higher to
> achieve even basic stuff.

I understand the frustration, but that's my best proposal at the moment.
IMO, it doesn't make things worse and has some nice advantages.

> 
> For example, I need to maintain a DSA driver and a switchdev driver for
> the exact same class of hardware (ocelot is switchdev, felix is DSA, but
> the hardware is the same) and it is just so annoying that the interaction
> with switchdev is so verbose and open-coded, it just leads to so much
> duplication of basic patterns.
> When I add support for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE in ocelot I
> really don't want to add a boatload of code, all copied from DSA.
> 
> > 3. Something I missed / overlooked
> >
> > > There is of course the option of going half-way too, just like for
> > > SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the
> > > atomic chain, the switchdev throws as many errors as it can reasonably
> > > can, then you defer the actual installation which means a hardware access.
> >
> > Yes, the above proposal has the same property. You can throw errors
> > before enqueueing the notification info on your list.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-23 16:02                                 ` Ido Schimmel
@ 2021-08-23 16:11                                   ` Vladimir Oltean
  2021-08-23 16:23                                   ` Vladimir Oltean
  1 sibling, 0 replies; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-23 16:11 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Nikolay Aleksandrov, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Mon, Aug 23, 2021 at 07:02:15PM +0300, Ido Schimmel wrote:
> > > Inside the work item you would do something like:
> > >
> > > spin_lock_bh()
> > > list_splice_init()
> > > spin_unlock_bh()
> > >
> > > mutex_lock() // rtnl or preferably private lock
> > > list_for_each_entry_safe()
> > > 	// process entry
> > > 	cond_resched()
> > > mutex_unlock()
> >
> > When is the work item scheduled in your proposal?
>
> Calling queue_work() whenever you get a notification. The work item
> might already be queued, which is fine.
>
> > I assume not only when SWITCHDEV_FDB_FLUSH_TO_DEVICE is emitted. Is
> > there some sort of timer to allow for some batching to occur?
>
> You can add an hysteresis timer if you want, but I don't think it's
> necessary. Assuming user space is programming entries at a high rate,
> then by the time you finish a batch, you will have a new one enqueued.

With the current model, nobody really stops any driver from doing that
if so they wish. No switchdev or bridge changes needed. We have maximum
flexibility now, with this async model. Yet it just so happens that no
one is exploiting it, and instead the existing options are poorly
utilized by most drivers.

> > > In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some
> > > new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will
> > > instruct the driver to flush all its pending FDB notifications. You
> > > don't strictly need this notification because of the
> > > netdev_upper_dev_unlink() that follows, but it helps in making things
> > > more structured.
> > >
> > > Pros:
> > >
> > > 1. Solves your problem?
> > > 2. Pattern is not worse than what we currently have
> > > 3. Does not force RTNL
> > > 4. Allows for batching. For example, mlxsw has the ability to program up
> > > to 64 entries in one transaction with the device. I assume other devices
> > > in the same grade have similar capabilities
> > >
> > > Cons:
> > >
> > > 1. Asynchronous
> > > 2. Pattern we will see in multiple drivers? Can consider migrating it
> > > into switchdev itself at some point
> >
> > I can already flush_workqueue(dsa_owq) in dsa_port_pre_bridge_leave()
> > and this will solve the problem in the same way, will it not?
>
> Problem is that you will deadlock if your work item tries to take RTNL.

I think we agreed that the rtnl_lock could be dropped from driver FDB work items.
I have not tried that yet though.

> > It's not that I don't have driver-level solutions and hook points.
> > My concern is that there are way too many moving parts and the entrance
> > barrier for a new switchdev driver is getting higher and higher to
> > achieve even basic stuff.
>
> I understand the frustration, but that's my best proposal at the moment.
> IMO, it doesn't make things worse and has some nice advantages.

Reconsidering my options, I don't want to reduce the available optimizations
that other switchdev drivers can make, in the name of a simpler baseline.
I am also not smart enough for reworking the bridge data path.
I will probably do something like flush_workqueue in the PRECHANGEUPPER
handler, see what other common patterns there might be, and try to synthesize
them in library code (a la switchdev_handle_*) that can be used by drivers
that wish to, and ignored by drivers that don't.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking
  2021-08-23 16:02                                 ` Ido Schimmel
  2021-08-23 16:11                                   ` Vladimir Oltean
@ 2021-08-23 16:23                                   ` Vladimir Oltean
  1 sibling, 0 replies; 34+ messages in thread
From: Vladimir Oltean @ 2021-08-23 16:23 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Nikolay Aleksandrov, Vladimir Oltean, netdev, Jakub Kicinski,
	David S. Miller, Roopa Prabhu, Andrew Lunn, Florian Fainelli,
	Vivien Didelot, Vadym Kochan, Taras Chornyi, Jiri Pirko,
	Ido Schimmel, UNGLinuxDriver, Grygorii Strashko, Marek Behun,
	DENG Qingfang, Kurt Kanzenbach, Hauke Mehrtens, Woojung Huh,
	Sean Wang, Landen Chao, Claudiu Manoil, Alexandre Belloni,
	George McCollister, Ioana Ciornei, Saeed Mahameed,
	Leon Romanovsky, Lars Povlsen, Steen Hegelund, Julian Wiedmann,
	Karsten Graul, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Ivan Vecera, Vlad Buslov, Jianbo Liu,
	Mark Bloch, Roi Dayan, Tobias Waldekranz, Vignesh Raghavendra,
	Jesse Brandeburg

On Mon, Aug 23, 2021 at 07:02:15PM +0300, Ido Schimmel wrote:
> > When is the work item scheduled in your proposal?
> 
> Calling queue_work() whenever you get a notification. The work item
> might already be queued, which is fine.
> 
> > I assume not only when SWITCHDEV_FDB_FLUSH_TO_DEVICE is emitted. Is
> > there some sort of timer to allow for some batching to occur?
> 
> You can add an hysteresis timer if you want, but I don't think it's
> necessary. Assuming user space is programming entries at a high rate,
> then by the time you finish a batch, you will have a new one enqueued.

I tried to do something similar in DSA. There we have .ndo_fdb_dump
because we don't sync the hardware FDB. We also have some drivers where
the FDB flush on a port is a very slow procedure, because the FDB needs
to be walked element by element to see what needs to be deleted.
So I wanted to defer the FDB flush to a background work queue, and let
the port leave the bridge quickly and not block the rtnl_mutex.
But it gets really nasty really quick. The FDB flush workqueue cannot
run concurrently with the ndo_fdb_dump, for reasons that have to do with
hardware access. Also, any fdb_add or fdb_del would need to flush the
FDB flush workqueue, for the same reasons. All these are currently
implicitly serialized by the rtnl_mutex now. Your hardware/firmware might
be smarter, but I think that if you drop the rtnl_mutex requirement, you
will be seriously surprised by the amount of extra concurrency you need
to handle.
In the end I scrapped everything and I'm happy with a synchronous FDB
flush even if it's slow. YMMV of course.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-08-23 16:23 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-19 16:07 [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Vladimir Oltean
2021-08-19 16:07 ` [PATCH v2 net-next 1/5] net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking notifier chain Vladimir Oltean
2021-08-19 18:15   ` Vlad Buslov
2021-08-19 23:18     ` Vladimir Oltean
2021-08-20  7:36       ` Vlad Buslov
2021-08-19 16:07 ` [PATCH v2 net-next 2/5] net: bridge: switchdev: make br_fdb_replay offer sleepable context to consumers Vladimir Oltean
2021-08-19 16:07 ` [PATCH v2 net-next 3/5] net: switchdev: drop the atomic notifier block from switchdev_bridge_port_{,un}offload Vladimir Oltean
2021-08-19 16:07 ` [PATCH v2 net-next 4/5] net: switchdev: don't assume RCU context in switchdev_handle_fdb_{add,del}_to_device Vladimir Oltean
2021-08-19 16:07 ` [PATCH v2 net-next 5/5] net: dsa: handle SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE synchronously Vladimir Oltean
2021-08-20  9:16 ` [PATCH v2 net-next 0/5] Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking Ido Schimmel
2021-08-20  9:37   ` Vladimir Oltean
2021-08-20 16:09     ` Ido Schimmel
2021-08-20 17:06       ` Vladimir Oltean
2021-08-20 23:36         ` Nikolay Aleksandrov
2021-08-21  0:22           ` Vladimir Oltean
2021-08-22  6:48           ` Ido Schimmel
2021-08-22  9:12             ` Nikolay Aleksandrov
2021-08-22 13:31               ` Vladimir Oltean
2021-08-22 17:06                 ` Ido Schimmel
2021-08-22 17:44                   ` Vladimir Oltean
2021-08-23 10:47                     ` Ido Schimmel
2021-08-23 11:00                       ` Vladimir Oltean
2021-08-23 12:16                         ` Ido Schimmel
2021-08-23 14:29                           ` Vladimir Oltean
2021-08-23 15:18                             ` Ido Schimmel
2021-08-23 15:42                               ` Nikolay Aleksandrov
2021-08-23 15:42                               ` Vladimir Oltean
2021-08-23 16:02                                 ` Ido Schimmel
2021-08-23 16:11                                   ` Vladimir Oltean
2021-08-23 16:23                                   ` Vladimir Oltean
2021-08-20 10:49   ` Vladimir Oltean
2021-08-20 16:11     ` Ido Schimmel
2021-08-21 19:09       ` Vladimir Oltean
2021-08-22  7:19         ` Ido Schimmel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).