All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/4] net: dsa: Link aggregation support
@ 2020-11-19 14:45 Tobias Waldekranz
  2020-11-19 14:45 ` [PATCH net-next 1/4] net: bonding: Notify ports about their initial state Tobias Waldekranz
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Tobias Waldekranz @ 2020-11-19 14:45 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, j.vosburgh, vfalico,
	andy, netdev

Start of by adding an extra notification when adding a port to a bond,
this allows static LAGs to be offloaded using the bonding driver.

Then add the generic support required to offload link aggregates to
drivers built on top of the DSA subsystem.

Finally, implement offloading for the mv88e6xxx driver, i.e. Marvell's
LinkStreet family.

Supported LAG implementations:
- Bonding
- Team

Supported modes:
- Isolated. The LAG may be used as a regular interface outside of any
  bridge.
- Bridged. The LAG may be added to a bridge, in which case switching
  is offloaded between the LAG and any other switch ports. I.e. the
  LAG behaves just like a port from this perspective.

In bridged mode, the following is supported:
- STP filtering.
- VLAN filtering.
- Multicast filtering. The bridge correctly snoops IGMP and configures
  the proper groups if snooping is enabled. Static groups can also be
  configured. MLD seems to work, but has not been extensively tested.
- Unicast filtering. Automatic learning works. Static entries are
  _not_ supported. This will be added in a later series as it requires
  some more general refactoring in mv88e6xxx before I can test it.

RFC -> v1:
- Properly propagate MDB operations.
- Support for bonding in addition to team.
- Fixed a locking bug in mv88e6xxx.
- Make sure ports are disabled-by-default in mv88e6xxx.
- Support for both DSA and EDSA tagging.

Tobias Waldekranz (4):
  net: bonding: Notify ports about their initial state
  net: dsa: Link aggregation support
  net: dsa: mv88e6xxx: Link aggregation support
  net: dsa: tag_dsa: Support reception of packets from LAG devices

 drivers/net/bonding/bond_main.c     |   2 +
 drivers/net/dsa/mv88e6xxx/chip.c    | 226 +++++++++++++++++++++++++++-
 drivers/net/dsa/mv88e6xxx/chip.h    |   4 +
 drivers/net/dsa/mv88e6xxx/global2.c |   8 +-
 drivers/net/dsa/mv88e6xxx/global2.h |   5 +
 drivers/net/dsa/mv88e6xxx/port.c    |  21 +++
 drivers/net/dsa/mv88e6xxx/port.h    |   5 +
 include/net/dsa.h                   |  66 ++++++++
 net/dsa/dsa.c                       |  12 +-
 net/dsa/dsa2.c                      |   3 +
 net/dsa/dsa_priv.h                  |  31 ++++
 net/dsa/port.c                      | 143 ++++++++++++++++++
 net/dsa/slave.c                     |  55 ++++++-
 net/dsa/switch.c                    |  49 ++++++
 net/dsa/tag_dsa.c                   |  17 ++-
 15 files changed, 631 insertions(+), 16 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next 1/4] net: bonding: Notify ports about their initial state
  2020-11-19 14:45 [PATCH net-next 0/4] net: dsa: Link aggregation support Tobias Waldekranz
@ 2020-11-19 14:45 ` Tobias Waldekranz
  2020-11-19 18:18   ` Jay Vosburgh
  2020-11-19 14:45 ` [PATCH net-next 2/4] net: dsa: Link aggregation support Tobias Waldekranz
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Tobias Waldekranz @ 2020-11-19 14:45 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, j.vosburgh, vfalico,
	andy, netdev

When creating a static bond (e.g. balance-xor), all ports will always
be enabled. This is set, and the corresponding notification is sent
out, before the port is linked to the bond upper.

In the offloaded case, this ordering is hard to deal with.

The lower will first see a notification that it can not associate with
any bond. Then the bond is joined. After that point no more
notifications are sent, so all ports remain disabled.

This change simply sends an extra notification once the port has been
linked to the upper to synchronize the initial state.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 drivers/net/bonding/bond_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 71c9677d135f..80c164198dcf 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1897,6 +1897,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
 		goto err_unregister;
 	}
 
+	bond_lower_state_changed(new_slave);
+
 	res = bond_sysfs_slave_add(new_slave);
 	if (res) {
 		slave_dbg(bond_dev, slave_dev, "Error %d calling bond_sysfs_slave_add\n", res);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next 2/4] net: dsa: Link aggregation support
  2020-11-19 14:45 [PATCH net-next 0/4] net: dsa: Link aggregation support Tobias Waldekranz
  2020-11-19 14:45 ` [PATCH net-next 1/4] net: bonding: Notify ports about their initial state Tobias Waldekranz
@ 2020-11-19 14:45 ` Tobias Waldekranz
  2020-11-20  0:30   ` Andrew Lunn
  2020-11-19 14:45 ` [PATCH net-next 3/4] net: dsa: mv88e6xxx: " Tobias Waldekranz
  2020-11-19 14:45 ` [PATCH net-next 4/4] net: dsa: tag_dsa: Support reception of packets from LAG devices Tobias Waldekranz
  3 siblings, 1 reply; 18+ messages in thread
From: Tobias Waldekranz @ 2020-11-19 14:45 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, j.vosburgh, vfalico,
	andy, netdev

Monitor the following events and notify the driver when:

- A DSA port joins/leaves a LAG.
- A LAG, made up of DSA ports, joins/leaves a bridge.
- A DSA port in a LAG is enabled/disabled (enabled meaning
  "distributing" in 802.3ad LACP terms).

Each LAG interface to which a DSA port is attached is represented by a
`struct dsa_lag` which is globally reachable from the switch tree and
from each associated port.

When a LAG joins a bridge, the DSA subsystem will treat that as each
individual port joining the bridge. The driver may look at the port's
LAG pointer to see if it is associated with any LAG, if that is
required. This is analogue to how switchdev events are replicated out
to all lower devices when reaching e.g. a LAG.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 include/net/dsa.h  |  66 +++++++++++++++++++++
 net/dsa/dsa2.c     |   3 +
 net/dsa/dsa_priv.h |  31 ++++++++++
 net/dsa/port.c     | 143 +++++++++++++++++++++++++++++++++++++++++++++
 net/dsa/slave.c    |  55 +++++++++++++++--
 net/dsa/switch.c   |  49 ++++++++++++++++
 6 files changed, 341 insertions(+), 6 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 4e60d2610f20..3fd8f041ddbe 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -147,6 +147,9 @@ struct dsa_switch_tree {
 	/* List of switch ports */
 	struct list_head ports;
 
+	/* List of configured LAGs */
+	struct list_head lags;
+
 	/* List of DSA links composing the routing table */
 	struct list_head rtable;
 };
@@ -180,6 +183,49 @@ struct dsa_mall_tc_entry {
 	};
 };
 
+struct dsa_lag {
+	struct net_device *dev;
+	int id;
+
+	struct list_head ports;
+
+	/* For multichip systems, we must ensure that each hash bucket
+	 * is only enabled on a single egress port throughout the
+	 * whole tree, lest we send duplicates. Therefore we must
+	 * maintain a global list of active tx ports, so that each
+	 * switch can figure out which buckets to enable on which
+	 * ports.
+	 */
+	struct list_head tx_ports;
+	int num_tx;
+
+	struct kref refcount;
+	struct list_head list;
+};
+
+static inline struct dsa_lag *dsa_lag_by_dev(struct dsa_switch_tree *dst,
+					     struct net_device *dev)
+{
+	struct dsa_lag *lag;
+
+	list_for_each_entry(lag, &dst->lags, list)
+		if (lag->dev == dev)
+			return lag;
+
+	return NULL;
+}
+
+static inline struct net_device *dsa_lag_dev_by_id(struct dsa_switch_tree *dst,
+						   int id)
+{
+	struct dsa_lag *lag;
+
+	list_for_each_entry_rcu(lag, &dst->lags, list)
+		if (lag->id == id)
+			return lag->dev;
+
+	return NULL;
+}
 
 struct dsa_port {
 	/* A CPU port is physically connected to a master device.
@@ -220,6 +266,9 @@ struct dsa_port {
 	bool			devlink_port_setup;
 	struct phylink		*pl;
 	struct phylink_config	pl_config;
+	struct dsa_lag		*lag;
+	struct list_head	lag_list;
+	struct list_head	lag_tx_list;
 
 	struct list_head list;
 
@@ -624,6 +673,13 @@ struct dsa_switch_ops {
 	void	(*crosschip_bridge_leave)(struct dsa_switch *ds, int tree_index,
 					  int sw_index, int port,
 					  struct net_device *br);
+	int	(*crosschip_lag_change)(struct dsa_switch *ds, int sw_index,
+					int port, struct net_device *lag_dev,
+					struct netdev_lag_lower_state_info *info);
+	int	(*crosschip_lag_join)(struct dsa_switch *ds, int sw_index,
+				      int port, struct net_device *lag_dev);
+	void	(*crosschip_lag_leave)(struct dsa_switch *ds, int sw_index,
+				       int port, struct net_device *lag_dev);
 
 	/*
 	 * PTP functionality
@@ -655,6 +711,16 @@ struct dsa_switch_ops {
 	int	(*port_change_mtu)(struct dsa_switch *ds, int port,
 				   int new_mtu);
 	int	(*port_max_mtu)(struct dsa_switch *ds, int port);
+
+	/*
+	 * LAG integration
+	 */
+	int	(*port_lag_change)(struct dsa_switch *ds, int port,
+				   struct netdev_lag_lower_state_info *info);
+	int	(*port_lag_join)(struct dsa_switch *ds, int port,
+				 struct net_device *lag_dev);
+	void	(*port_lag_leave)(struct dsa_switch *ds, int port,
+				  struct net_device *lag_dev);
 };
 
 #define DSA_DEVLINK_PARAM_DRIVER(_id, _name, _type, _cmodes)		\
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index 183003e45762..708d5a34e150 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -66,6 +66,7 @@ static struct dsa_switch_tree *dsa_tree_alloc(int index)
 	INIT_LIST_HEAD(&dst->rtable);
 
 	INIT_LIST_HEAD(&dst->ports);
+	INIT_LIST_HEAD(&dst->lags);
 
 	INIT_LIST_HEAD(&dst->list);
 	list_add_tail(&dst->list, &dsa_tree_list);
@@ -659,6 +660,8 @@ static struct dsa_port *dsa_port_touch(struct dsa_switch *ds, int index)
 	dp->index = index;
 
 	INIT_LIST_HEAD(&dp->list);
+	INIT_LIST_HEAD(&dp->lag_list);
+	INIT_LIST_HEAD(&dp->lag_tx_list);
 	list_add_tail(&dp->list, &dst->ports);
 
 	return dp;
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index 7c96aae9062c..214051f3ced0 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -20,6 +20,9 @@ enum {
 	DSA_NOTIFIER_BRIDGE_LEAVE,
 	DSA_NOTIFIER_FDB_ADD,
 	DSA_NOTIFIER_FDB_DEL,
+	DSA_NOTIFIER_LAG_CHANGE,
+	DSA_NOTIFIER_LAG_JOIN,
+	DSA_NOTIFIER_LAG_LEAVE,
 	DSA_NOTIFIER_MDB_ADD,
 	DSA_NOTIFIER_MDB_DEL,
 	DSA_NOTIFIER_VLAN_ADD,
@@ -57,6 +60,14 @@ struct dsa_notifier_mdb_info {
 	int port;
 };
 
+/* DSA_NOTIFIER_LAG_* */
+struct dsa_notifier_lag_info {
+	struct netdev_lag_lower_state_info *info;
+	struct net_device *lag;
+	int sw_index;
+	int port;
+};
+
 /* DSA_NOTIFIER_VLAN_* */
 struct dsa_notifier_vlan_info {
 	const struct switchdev_obj_port_vlan *vlan;
@@ -135,6 +146,10 @@ void dsa_port_disable_rt(struct dsa_port *dp);
 void dsa_port_disable(struct dsa_port *dp);
 int dsa_port_bridge_join(struct dsa_port *dp, struct net_device *br);
 void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br);
+int dsa_port_lag_change(struct dsa_port *dp,
+			struct netdev_lag_lower_state_info *linfo);
+int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag_dev);
+void dsa_port_lag_leave(struct dsa_port *dp, struct net_device *lag_dev);
 int dsa_port_vlan_filtering(struct dsa_port *dp, bool vlan_filtering,
 			    struct switchdev_trans *trans);
 bool dsa_port_skip_vlan_configuration(struct dsa_port *dp);
@@ -167,6 +182,22 @@ int dsa_port_link_register_of(struct dsa_port *dp);
 void dsa_port_link_unregister_of(struct dsa_port *dp);
 extern const struct phylink_mac_ops dsa_port_phylink_mac_ops;
 
+static inline bool dsa_port_can_offload(struct dsa_port *dp,
+					struct net_device *dev)
+{
+	/* Switchdev offloading can be configured on: */
+
+	if (dev == dp->slave)
+		/* DSA ports directly connected to a bridge. */
+		return true;
+
+	if (dp->lag && dev == dp->lag->dev)
+		/* DSA ports connected to a bridge via a LAG */
+		return true;
+
+	return false;
+}
+
 /* slave.c */
 extern const struct dsa_device_ops notag_netdev_ops;
 void dsa_slave_mii_bus_init(struct dsa_switch *ds);
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 73569c9af3cc..4bb8a69d7ec2 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -193,6 +193,149 @@ void dsa_port_bridge_leave(struct dsa_port *dp, struct net_device *br)
 	dsa_port_set_state_now(dp, BR_STATE_FORWARDING);
 }
 
+static struct dsa_lag *dsa_lag_get(struct dsa_switch_tree *dst,
+				   struct net_device *dev)
+{
+	unsigned long busy = 0;
+	struct dsa_lag *lag;
+	int id;
+
+	list_for_each_entry(lag, &dst->lags, list) {
+		set_bit(lag->id, &busy);
+
+		if (lag->dev == dev) {
+			kref_get(&lag->refcount);
+			return lag;
+		}
+	}
+
+	id = find_first_zero_bit(&busy, BITS_PER_LONG);
+	if (id >= BITS_PER_LONG)
+		return ERR_PTR(-ENOSPC);
+
+	lag = kzalloc(sizeof(*lag), GFP_KERNEL);
+	if (!lag)
+		return ERR_PTR(-ENOMEM);
+
+	kref_init(&lag->refcount);
+	lag->id = id;
+	lag->dev = dev;
+	INIT_LIST_HEAD(&lag->ports);
+	INIT_LIST_HEAD(&lag->tx_ports);
+
+	INIT_LIST_HEAD(&lag->list);
+	list_add_tail_rcu(&lag->list, &dst->lags);
+	return lag;
+}
+
+static void dsa_lag_release(struct kref *refcount)
+{
+	struct dsa_lag *lag = container_of(refcount, struct dsa_lag, refcount);
+
+	list_del_rcu(&lag->list);
+	synchronize_rcu();
+	kfree(lag);
+}
+
+static void dsa_lag_put(struct dsa_lag *lag)
+{
+	kref_put(&lag->refcount, dsa_lag_release);
+}
+
+int dsa_port_lag_change(struct dsa_port *dp,
+			struct netdev_lag_lower_state_info *linfo)
+{
+	struct dsa_notifier_lag_info info = {
+		.sw_index = dp->ds->index,
+		.port = dp->index,
+		.info = linfo,
+	};
+	bool old, new;
+
+	if (!dp->lag)
+		return 0;
+
+	info.lag = dp->lag->dev;
+
+	/* If this port is on the tx list, it is already enabled. */
+	old = !list_empty(&dp->lag_tx_list);
+
+	/* On statically configured aggregates (e.g. loadbalance
+	 * without LACP) ports will always be tx_enabled, even if the
+	 * link is down. Thus we require both link_up and tx_enabled
+	 * in order to include it in the tx set.
+	 */
+	new = linfo->link_up && linfo->tx_enabled;
+
+	if (new == old)
+		return 0;
+
+	if (new) {
+		dp->lag->num_tx++;
+		list_add_tail(&dp->lag_tx_list, &dp->lag->tx_ports);
+	} else {
+		list_del_init(&dp->lag_tx_list);
+		dp->lag->num_tx--;
+	}
+
+	return dsa_port_notify(dp, DSA_NOTIFIER_LAG_CHANGE, &info);
+}
+
+int dsa_port_lag_join(struct dsa_port *dp, struct net_device *lag_dev)
+{
+	struct dsa_notifier_lag_info info = {
+		.sw_index = dp->ds->index,
+		.port = dp->index,
+		.lag = lag_dev,
+	};
+	struct dsa_lag *lag;
+	int err;
+
+	lag = dsa_lag_get(dp->ds->dst, lag_dev);
+	if (IS_ERR(lag))
+		return PTR_ERR(lag);
+
+	dp->lag = lag;
+	list_add_tail(&dp->lag_list, &lag->ports);
+
+	err = dsa_port_notify(dp, DSA_NOTIFIER_LAG_JOIN, &info);
+	if (err) {
+		dp->lag = NULL;
+		list_del_init(&dp->lag_list);
+		dsa_lag_put(lag);
+	}
+
+	return err;
+}
+
+void dsa_port_lag_leave(struct dsa_port *dp, struct net_device *lag_dev)
+{
+	struct dsa_notifier_lag_info info = {
+		.sw_index = dp->ds->index,
+		.port = dp->index,
+		.lag = lag_dev,
+	};
+	int err;
+
+	/* Port might have been part of a LAG that in turn was
+	 * attached to a bridge.
+	 */
+	if (dp->bridge_dev)
+		dsa_port_bridge_leave(dp, dp->bridge_dev);
+
+	list_del_init(&dp->lag_list);
+	list_del_init(&dp->lag_tx_list);
+
+	err = dsa_port_notify(dp, DSA_NOTIFIER_LAG_LEAVE, &info);
+	if (err)
+		pr_err("DSA: failed to notify DSA_NOTIFIER_LAG_LEAVE: %d\n",
+		       err);
+
+	dsa_lag_put(dp->lag);
+
+	dp->lag = NULL;
+}
+
 /* Must be called under rcu_read_lock() */
 static bool dsa_port_can_apply_vlan_filtering(struct dsa_port *dp,
 					      bool vlan_filtering)
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index ff2266d2b998..ca61349886a4 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -334,7 +334,7 @@ static int dsa_slave_vlan_add(struct net_device *dev,
 	struct switchdev_obj_port_vlan vlan;
 	int vid, err;
 
-	if (obj->orig_dev != dev)
+	if (!dsa_port_can_offload(dp, obj->orig_dev))
 		return -EOPNOTSUPP;
 
 	if (dsa_port_skip_vlan_configuration(dp))
@@ -391,7 +391,7 @@ static int dsa_slave_port_obj_add(struct net_device *dev,
 
 	switch (obj->id) {
 	case SWITCHDEV_OBJ_ID_PORT_MDB:
-		if (obj->orig_dev != dev)
+		if (!dsa_port_can_offload(dp, obj->orig_dev))
 			return -EOPNOTSUPP;
 		err = dsa_port_mdb_add(dp, SWITCHDEV_OBJ_PORT_MDB(obj), trans);
 		break;
@@ -421,7 +421,7 @@ static int dsa_slave_vlan_del(struct net_device *dev,
 	struct switchdev_obj_port_vlan *vlan;
 	int vid, err;
 
-	if (obj->orig_dev != dev)
+	if (!dsa_port_can_offload(dp, obj->orig_dev))
 		return -EOPNOTSUPP;
 
 	if (dsa_port_skip_vlan_configuration(dp))
@@ -450,7 +450,7 @@ static int dsa_slave_port_obj_del(struct net_device *dev,
 
 	switch (obj->id) {
 	case SWITCHDEV_OBJ_ID_PORT_MDB:
-		if (obj->orig_dev != dev)
+		if (!dsa_port_can_offload(dp, obj->orig_dev))
 			return -EOPNOTSUPP;
 		err = dsa_port_mdb_del(dp, SWITCHDEV_OBJ_PORT_MDB(obj));
 		break;
@@ -1941,6 +1941,33 @@ static int dsa_slave_changeupper(struct net_device *dev,
 			dsa_port_bridge_leave(dp, info->upper_dev);
 			err = NOTIFY_OK;
 		}
+	} else if (netif_is_lag_master(info->upper_dev)) {
+		if (info->linking) {
+			err = dsa_port_lag_join(dp, info->upper_dev);
+			err = notifier_from_errno(err);
+		} else {
+			dsa_port_lag_leave(dp, info->upper_dev);
+			err = NOTIFY_OK;
+		}
+	}
+
+	return err;
+}
+
+static int dsa_slave_lag_changeupper(struct net_device *dev,
+				     struct netdev_notifier_changeupper_info *info)
+{
+	struct net_device *lower;
+	struct list_head *iter;
+	int err = NOTIFY_DONE;
+
+	netdev_for_each_lower_dev(dev, lower, iter) {
+		if (!dsa_slave_dev_check(lower))
+			continue;
+
+		err = dsa_slave_changeupper(lower, info);
+		if (notifier_to_errno(err))
+			break;
 	}
 
 	return err;
@@ -2038,10 +2065,26 @@ static int dsa_slave_netdevice_event(struct notifier_block *nb,
 		break;
 	}
 	case NETDEV_CHANGEUPPER:
+		if (dsa_slave_dev_check(dev))
+			return dsa_slave_changeupper(dev, ptr);
+
+		if (netif_is_lag_master(dev))
+			return dsa_slave_lag_changeupper(dev, ptr);
+
+		break;
+	case NETDEV_CHANGELOWERSTATE: {
+		struct netdev_notifier_changelowerstate_info *info = ptr;
+		struct dsa_port *dp;
+		int err;
+
 		if (!dsa_slave_dev_check(dev))
-			return NOTIFY_DONE;
+			break;
 
-		return dsa_slave_changeupper(dev, ptr);
+		dp = dsa_slave_to_port(dev);
+
+		err = dsa_port_lag_change(dp, info->lower_state_info);
+		return notifier_from_errno(err);
+	}
 	}
 
 	return NOTIFY_DONE;
diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index 3fb362b6874e..3e518df7cd1f 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -178,6 +178,46 @@ static int dsa_switch_fdb_del(struct dsa_switch *ds,
 	return ds->ops->port_fdb_del(ds, port, info->addr, info->vid);
 }
 
+static int dsa_switch_lag_change(struct dsa_switch *ds,
+				 struct dsa_notifier_lag_info *info)
+{
+	if (ds->index == info->sw_index && ds->ops->port_lag_change)
+		return ds->ops->port_lag_change(ds, info->port, info->info);
+
+	if (ds->index != info->sw_index && ds->ops->crosschip_lag_change)
+		return ds->ops->crosschip_lag_change(ds, info->sw_index,
+						     info->port, info->lag,
+						     info->info);
+
+	return 0;
+}
+
+static int dsa_switch_lag_join(struct dsa_switch *ds,
+			       struct dsa_notifier_lag_info *info)
+{
+	if (ds->index == info->sw_index && ds->ops->port_lag_join)
+		return ds->ops->port_lag_join(ds, info->port, info->lag);
+
+	if (ds->index != info->sw_index && ds->ops->crosschip_lag_join)
+		return ds->ops->crosschip_lag_join(ds, info->sw_index,
+						   info->port, info->lag);
+
+	return 0;
+}
+
+static int dsa_switch_lag_leave(struct dsa_switch *ds,
+				struct dsa_notifier_lag_info *info)
+{
+	if (ds->index == info->sw_index && ds->ops->port_lag_leave)
+		ds->ops->port_lag_leave(ds, info->port, info->lag);
+
+	if (ds->index != info->sw_index && ds->ops->crosschip_lag_leave)
+		ds->ops->crosschip_lag_leave(ds, info->sw_index,
+					     info->port, info->lag);
+
+	return 0;
+}
+
 static bool dsa_switch_mdb_match(struct dsa_switch *ds, int port,
 				 struct dsa_notifier_mdb_info *info)
 {
@@ -325,6 +365,15 @@ static int dsa_switch_event(struct notifier_block *nb,
 	case DSA_NOTIFIER_FDB_DEL:
 		err = dsa_switch_fdb_del(ds, info);
 		break;
+	case DSA_NOTIFIER_LAG_CHANGE:
+		err = dsa_switch_lag_change(ds, info);
+		break;
+	case DSA_NOTIFIER_LAG_JOIN:
+		err = dsa_switch_lag_join(ds, info);
+		break;
+	case DSA_NOTIFIER_LAG_LEAVE:
+		err = dsa_switch_lag_leave(ds, info);
+		break;
 	case DSA_NOTIFIER_MDB_ADD:
 		err = dsa_switch_mdb_add(ds, info);
 		break;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next 3/4] net: dsa: mv88e6xxx: Link aggregation support
  2020-11-19 14:45 [PATCH net-next 0/4] net: dsa: Link aggregation support Tobias Waldekranz
  2020-11-19 14:45 ` [PATCH net-next 1/4] net: bonding: Notify ports about their initial state Tobias Waldekranz
  2020-11-19 14:45 ` [PATCH net-next 2/4] net: dsa: Link aggregation support Tobias Waldekranz
@ 2020-11-19 14:45 ` Tobias Waldekranz
  2020-11-19 14:45 ` [PATCH net-next 4/4] net: dsa: tag_dsa: Support reception of packets from LAG devices Tobias Waldekranz
  3 siblings, 0 replies; 18+ messages in thread
From: Tobias Waldekranz @ 2020-11-19 14:45 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, j.vosburgh, vfalico,
	andy, netdev

Support offloading of LAGs to hardware. LAGs may be attached to a
bridge in which case VLANs, multicast groups, etc. are also offloaded
as usual.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c    | 226 +++++++++++++++++++++++++++-
 drivers/net/dsa/mv88e6xxx/chip.h    |   4 +
 drivers/net/dsa/mv88e6xxx/global2.c |   8 +-
 drivers/net/dsa/mv88e6xxx/global2.h |   5 +
 drivers/net/dsa/mv88e6xxx/port.c    |  21 +++
 drivers/net/dsa/mv88e6xxx/port.h    |   5 +
 6 files changed, 261 insertions(+), 8 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index ea466f8913d3..5f2bdbb94f5e 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1189,7 +1189,8 @@ static int mv88e6xxx_set_mac_eee(struct dsa_switch *ds, int port,
 }
 
 /* Mask of the local ports allowed to receive frames from a given fabric port */
-static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port)
+static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port,
+			       struct dsa_lag **lag)
 {
 	struct dsa_switch *ds = chip->ds;
 	struct dsa_switch_tree *dst = ds->dst;
@@ -1201,6 +1202,9 @@ static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port)
 	list_for_each_entry(dp, &dst->ports, list) {
 		if (dp->ds->index == dev && dp->index == port) {
 			found = true;
+
+			if (dp->lag && lag)
+				*lag = dp->lag;
 			break;
 		}
 	}
@@ -1231,7 +1235,9 @@ static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port)
 
 static int mv88e6xxx_port_vlan_map(struct mv88e6xxx_chip *chip, int port)
 {
-	u16 output_ports = mv88e6xxx_port_vlan(chip, chip->ds->index, port);
+	u16 output_ports;
+
+	output_ports = mv88e6xxx_port_vlan(chip, chip->ds->index, port, NULL);
 
 	/* prevent frames from going back out of the port they came in on */
 	output_ports &= ~BIT(port);
@@ -1389,14 +1395,21 @@ static int mv88e6xxx_mac_setup(struct mv88e6xxx_chip *chip)
 
 static int mv88e6xxx_pvt_map(struct mv88e6xxx_chip *chip, int dev, int port)
 {
+	struct dsa_lag *lag = NULL;
 	u16 pvlan = 0;
 
 	if (!mv88e6xxx_has_pvt(chip))
 		return 0;
 
 	/* Skip the local source device, which uses in-chip port VLAN */
-	if (dev != chip->ds->index)
-		pvlan = mv88e6xxx_port_vlan(chip, dev, port);
+	if (dev != chip->ds->index) {
+		pvlan = mv88e6xxx_port_vlan(chip, dev, port, &lag);
+
+		if (lag) {
+			dev = MV88E6XXX_G2_PVT_ADRR_DEV_TRUNK;
+			port = lag->id;
+		}
+	}
 
 	return mv88e6xxx_g2_pvt_write(chip, dev, port, pvlan);
 }
@@ -5327,6 +5340,205 @@ static int mv88e6xxx_port_egress_floods(struct dsa_switch *ds, int port,
 	return err;
 }
 
+static int mv88e6xxx_lag_sync_map(struct dsa_switch *ds, struct dsa_lag *lag)
+{
+	struct mv88e6xxx_chip *chip = ds->priv;
+	struct dsa_port *dp;
+	u16 map = 0;
+
+	/* Build the map of all ports to distribute flows destined for
+	 * this LAG. This can be either a local user port, or a DSA
+	 * port if the LAG port is on a remote chip.
+	 */
+	list_for_each_entry(dp, &lag->ports, lag_list) {
+		map |= BIT(dsa_towards_port(ds, dp->ds->index, dp->index));
+	}
+
+	return mv88e6xxx_g2_trunk_mapping_write(chip, lag->id, map);
+}
+
+static const u8 mv88e6xxx_lag_mask_table[8][8] = {
+	/* Row number corresponds to the number of active members in a
+	 * LAG. Each column states which of the eight hash buckets are
+	 * mapped to the column:th port in the LAG.
+	 *
+	 * Example: In a LAG with three active ports, the second port
+	 * ([2][1]) would be selected for traffic mapped to buckets
+	 * 3,4,5 (0x38).
+	 */
+	{ 0xff,    0,    0,    0,    0,    0,    0,    0 },
+	{ 0x0f, 0xf0,    0,    0,    0,    0,    0,    0 },
+	{ 0x07, 0x38, 0xc0,    0,    0,    0,    0,    0 },
+	{ 0x03, 0x0c, 0x30, 0xc0,    0,    0,    0,    0 },
+	{ 0x03, 0x0c, 0x30, 0x40, 0x80,    0,    0,    0 },
+	{ 0x03, 0x0c, 0x10, 0x20, 0x40, 0x80,    0,    0 },
+	{ 0x03, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80,    0 },
+	{ 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 },
+};
+
+static void mv88e6xxx_lag_set_port_mask(u16 *mask, int port,
+					int num_tx, int nth)
+{
+	u8 active = 0;
+	int i;
+
+	num_tx = num_tx <= 8 ? num_tx : 8;
+	if (nth < num_tx)
+		active = mv88e6xxx_lag_mask_table[num_tx - 1][nth];
+
+	for (i = 0; i < 8; i++) {
+		if (BIT(i) & active)
+			mask[i] |= BIT(port);
+	}
+}
+
+static int mv88e6xxx_lag_sync_masks(struct dsa_switch *ds)
+{
+	struct mv88e6xxx_chip *chip = ds->priv;
+	struct dsa_port *dp;
+	struct dsa_lag *lag;
+	int i, err, nth;
+	u16 mask[8] = { 0 };
+	u16 ivec;
+
+	/* Assume no port is a member of any LAG. */
+	ivec = BIT(mv88e6xxx_num_ports(chip)) - 1;
+
+	/* Disable all masks for ports that _are_ members of a LAG. */
+	list_for_each_entry(lag, &ds->dst->lags, list) {
+		list_for_each_entry(dp, &lag->ports, lag_list) {
+			if (dp->ds != ds)
+				continue;
+
+			ivec &= ~BIT(dp->index);
+		}
+	}
+
+	for (i = 0; i < 8; i++)
+		mask[i] = ivec;
+
+	/* Enable the correct subset of masks for all LAG ports that
+	 * are in the Tx set.
+	 */
+	list_for_each_entry(lag, &ds->dst->lags, list) {
+		if (!lag->num_tx)
+			continue;
+
+		nth = 0;
+		list_for_each_entry(dp, &lag->tx_ports, lag_tx_list) {
+			if (dp->ds == ds)
+				mv88e6xxx_lag_set_port_mask(mask, dp->index,
+							    lag->num_tx, nth);
+
+			nth++;
+		}
+	}
+
+	for (i = 0; i < 8; i++) {
+		err = mv88e6xxx_g2_trunk_mask_write(chip, i, true, mask[i]);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int mv88e6xxx_lag_sync_masks_map(struct dsa_switch *ds,
+					struct dsa_lag *lag)
+{
+	int err;
+
+	err = mv88e6xxx_lag_sync_masks(ds);
+
+	if (!err)
+		err = mv88e6xxx_lag_sync_map(ds, lag);
+
+	return err;
+}
+
+static int mv88e6xxx_port_lag_change(struct dsa_switch *ds, int port,
+				     struct netdev_lag_lower_state_info *info)
+{
+	struct mv88e6xxx_chip *chip = ds->priv;
+	int err;
+
+	mv88e6xxx_reg_lock(chip);
+	err = mv88e6xxx_lag_sync_masks(ds);
+	mv88e6xxx_reg_unlock(chip);
+	return err;
+}
+
+static int mv88e6xxx_port_lag_join(struct dsa_switch *ds, int port,
+				   struct net_device *lag_dev)
+{
+	struct dsa_lag *lag = dsa_to_port(ds, port)->lag;
+	struct mv88e6xxx_chip *chip = ds->priv;
+	int err;
+
+	mv88e6xxx_reg_lock(chip);
+
+	err = mv88e6xxx_port_set_trunk(chip, port, true, lag->id);
+	if (err)
+		goto unlock;
+
+	err = mv88e6xxx_lag_sync_masks_map(ds, lag);
+	if (err)
+		mv88e6xxx_port_set_trunk(chip, port, false, 0);
+
+unlock:
+	mv88e6xxx_reg_unlock(chip);
+	return err;
+}
+
+static void mv88e6xxx_port_lag_leave(struct dsa_switch *ds, int port,
+				     struct net_device *lag_dev)
+{
+	struct dsa_lag *lag = dsa_to_port(ds, port)->lag;
+	struct mv88e6xxx_chip *chip = ds->priv;
+
+	mv88e6xxx_reg_lock(chip);
+	mv88e6xxx_lag_sync_masks_map(ds, lag);
+	mv88e6xxx_port_set_trunk(chip, port, false, 0);
+	mv88e6xxx_reg_unlock(chip);
+}
+
+static int mv88e6xxx_crosschip_lag_change(struct dsa_switch *ds, int sw_index,
+					  int port, struct net_device *lag_dev,
+					  struct netdev_lag_lower_state_info *info)
+{
+	struct mv88e6xxx_chip *chip = ds->priv;
+	int err;
+
+	mv88e6xxx_reg_lock(chip);
+	err = mv88e6xxx_lag_sync_masks(ds);
+	mv88e6xxx_reg_unlock(chip);
+	return err;
+}
+
+static int mv88e6xxx_crosschip_lag_join(struct dsa_switch *ds, int sw_index,
+					int port, struct net_device *lag_dev)
+{
+	struct dsa_lag *lag = dsa_lag_by_dev(ds->dst, lag_dev);
+	struct mv88e6xxx_chip *chip = ds->priv;
+	int err;
+
+	mv88e6xxx_reg_lock(chip);
+	err = mv88e6xxx_lag_sync_masks_map(ds, lag);
+	mv88e6xxx_reg_unlock(chip);
+	return err;
+}
+
+static void mv88e6xxx_crosschip_lag_leave(struct dsa_switch *ds, int sw_index,
+					  int port, struct net_device *lag_dev)
+{
+	struct dsa_lag *lag = dsa_lag_by_dev(ds->dst, lag_dev);
+	struct mv88e6xxx_chip *chip = ds->priv;
+
+	mv88e6xxx_reg_lock(chip);
+	mv88e6xxx_lag_sync_masks_map(ds, lag);
+	mv88e6xxx_reg_unlock(chip);
+}
+
 static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
 	.get_tag_protocol	= mv88e6xxx_get_tag_protocol,
 	.setup			= mv88e6xxx_setup,
@@ -5381,6 +5593,12 @@ static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
 	.devlink_param_get	= mv88e6xxx_devlink_param_get,
 	.devlink_param_set	= mv88e6xxx_devlink_param_set,
 	.devlink_info_get	= mv88e6xxx_devlink_info_get,
+	.port_lag_change	= mv88e6xxx_port_lag_change,
+	.port_lag_join		= mv88e6xxx_port_lag_join,
+	.port_lag_leave		= mv88e6xxx_port_lag_leave,
+	.crosschip_lag_change	= mv88e6xxx_crosschip_lag_change,
+	.crosschip_lag_join	= mv88e6xxx_crosschip_lag_join,
+	.crosschip_lag_leave	= mv88e6xxx_crosschip_lag_leave,
 };
 
 static int mv88e6xxx_register_switch(struct mv88e6xxx_chip *chip)
diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index 7e137bd0fd06..ed3a88018947 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -576,6 +576,10 @@ struct mv88e6xxx_ops {
 
 	/* Max Frame Size */
 	int (*set_max_frame_size)(struct mv88e6xxx_chip *chip, int mtu);
+
+	/* Link aggregation */
+	int (*lag_set_map)(struct mv88e6xxx_chip *chip, struct dsa_lag *lag);
+	int (*lag_set_masks)(struct mv88e6xxx_chip *chip, struct dsa_lag *lag);
 };
 
 struct mv88e6xxx_irq_ops {
diff --git a/drivers/net/dsa/mv88e6xxx/global2.c b/drivers/net/dsa/mv88e6xxx/global2.c
index 8dbb1ae723ae..fa65ecd9cb85 100644
--- a/drivers/net/dsa/mv88e6xxx/global2.c
+++ b/drivers/net/dsa/mv88e6xxx/global2.c
@@ -126,8 +126,8 @@ int mv88e6xxx_g2_device_mapping_write(struct mv88e6xxx_chip *chip, int target,
 
 /* Offset 0x07: Trunk Mask Table register */
 
-static int mv88e6xxx_g2_trunk_mask_write(struct mv88e6xxx_chip *chip, int num,
-					 bool hash, u16 mask)
+int mv88e6xxx_g2_trunk_mask_write(struct mv88e6xxx_chip *chip, int num,
+				  bool hash, u16 mask)
 {
 	u16 val = (num << 12) | (mask & mv88e6xxx_port_mask(chip));
 
@@ -140,8 +140,8 @@ static int mv88e6xxx_g2_trunk_mask_write(struct mv88e6xxx_chip *chip, int num,
 
 /* Offset 0x08: Trunk Mapping Table register */
 
-static int mv88e6xxx_g2_trunk_mapping_write(struct mv88e6xxx_chip *chip, int id,
-					    u16 map)
+int mv88e6xxx_g2_trunk_mapping_write(struct mv88e6xxx_chip *chip, int id,
+				     u16 map)
 {
 	const u16 port_mask = BIT(mv88e6xxx_num_ports(chip)) - 1;
 	u16 val = (id << 11) | (map & port_mask);
diff --git a/drivers/net/dsa/mv88e6xxx/global2.h b/drivers/net/dsa/mv88e6xxx/global2.h
index 1510a1c810c8..621bbc4a1d80 100644
--- a/drivers/net/dsa/mv88e6xxx/global2.h
+++ b/drivers/net/dsa/mv88e6xxx/global2.h
@@ -101,6 +101,7 @@
 #define MV88E6XXX_G2_PVT_ADDR_OP_WRITE_PVLAN	0x3000
 #define MV88E6XXX_G2_PVT_ADDR_OP_READ		0x4000
 #define MV88E6XXX_G2_PVT_ADDR_PTR_MASK		0x01ff
+#define MV88E6XXX_G2_PVT_ADRR_DEV_TRUNK		0x1f
 
 /* Offset 0x0C: Cross-chip Port VLAN Data Register */
 #define MV88E6XXX_G2_PVT_DATA		0x0c
@@ -347,6 +348,10 @@ int mv88e6352_g2_mgmt_rsvd2cpu(struct mv88e6xxx_chip *chip);
 
 int mv88e6xxx_g2_pot_clear(struct mv88e6xxx_chip *chip);
 
+int mv88e6xxx_g2_trunk_mask_write(struct mv88e6xxx_chip *chip, int num,
+				  bool hash, u16 mask);
+int mv88e6xxx_g2_trunk_mapping_write(struct mv88e6xxx_chip *chip, int id,
+				     u16 map);
 int mv88e6xxx_g2_trunk_clear(struct mv88e6xxx_chip *chip);
 
 int mv88e6xxx_g2_device_mapping_write(struct mv88e6xxx_chip *chip, int target,
diff --git a/drivers/net/dsa/mv88e6xxx/port.c b/drivers/net/dsa/mv88e6xxx/port.c
index 8128dc607cf4..7bf5ba55bf81 100644
--- a/drivers/net/dsa/mv88e6xxx/port.c
+++ b/drivers/net/dsa/mv88e6xxx/port.c
@@ -815,6 +815,27 @@ int mv88e6xxx_port_set_message_port(struct mv88e6xxx_chip *chip, int port,
 	return mv88e6xxx_port_write(chip, port, MV88E6XXX_PORT_CTL1, val);
 }
 
+int mv88e6xxx_port_set_trunk(struct mv88e6xxx_chip *chip, int port,
+			     bool trunk, u8 id)
+{
+	u16 val;
+	int err;
+
+	err = mv88e6xxx_port_read(chip, port, MV88E6XXX_PORT_CTL1, &val);
+	if (err)
+		return err;
+
+	val &= ~MV88E6XXX_PORT_CTL1_TRUNK_ID_MASK;
+
+	if (trunk)
+		val |= MV88E6XXX_PORT_CTL1_TRUNK_PORT |
+			(id << MV88E6XXX_PORT_CTL1_TRUNK_ID_SHIFT);
+	else
+		val &= ~MV88E6XXX_PORT_CTL1_TRUNK_PORT;
+
+	return mv88e6xxx_port_write(chip, port, MV88E6XXX_PORT_CTL1, val);
+}
+
 /* Offset 0x06: Port Based VLAN Map */
 
 int mv88e6xxx_port_set_vlan_map(struct mv88e6xxx_chip *chip, int port, u16 map)
diff --git a/drivers/net/dsa/mv88e6xxx/port.h b/drivers/net/dsa/mv88e6xxx/port.h
index 44d76ac973f6..e6a61be7dff9 100644
--- a/drivers/net/dsa/mv88e6xxx/port.h
+++ b/drivers/net/dsa/mv88e6xxx/port.h
@@ -168,6 +168,9 @@
 /* Offset 0x05: Port Control 1 */
 #define MV88E6XXX_PORT_CTL1			0x05
 #define MV88E6XXX_PORT_CTL1_MESSAGE_PORT	0x8000
+#define MV88E6XXX_PORT_CTL1_TRUNK_PORT		0x4000
+#define MV88E6XXX_PORT_CTL1_TRUNK_ID_MASK	0x0f00
+#define MV88E6XXX_PORT_CTL1_TRUNK_ID_SHIFT	8
 #define MV88E6XXX_PORT_CTL1_FID_11_4_MASK	0x00ff
 
 /* Offset 0x06: Port Based VLAN Map */
@@ -348,6 +351,8 @@ int mv88e6351_port_set_ether_type(struct mv88e6xxx_chip *chip, int port,
 				  u16 etype);
 int mv88e6xxx_port_set_message_port(struct mv88e6xxx_chip *chip, int port,
 				    bool message_port);
+int mv88e6xxx_port_set_trunk(struct mv88e6xxx_chip *chip, int port,
+			     bool trunk, u8 id);
 int mv88e6165_port_set_jumbo_size(struct mv88e6xxx_chip *chip, int port,
 				  size_t size);
 int mv88e6095_port_egress_rate_limiting(struct mv88e6xxx_chip *chip, int port);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH net-next 4/4] net: dsa: tag_dsa: Support reception of packets from LAG devices
  2020-11-19 14:45 [PATCH net-next 0/4] net: dsa: Link aggregation support Tobias Waldekranz
                   ` (2 preceding siblings ...)
  2020-11-19 14:45 ` [PATCH net-next 3/4] net: dsa: mv88e6xxx: " Tobias Waldekranz
@ 2020-11-19 14:45 ` Tobias Waldekranz
  3 siblings, 0 replies; 18+ messages in thread
From: Tobias Waldekranz @ 2020-11-19 14:45 UTC (permalink / raw)
  To: davem, kuba
  Cc: andrew, vivien.didelot, f.fainelli, olteanv, j.vosburgh, vfalico,
	andy, netdev

Packets ingressing on a LAG that egress on the CPU port, which are not
classified as management, will have a FORWARD tag that does not
contain the normal source device/port tuple. Instead the trunk bit
will be set, and the port field holds the LAG id.

Since the exact source port information is not available in the tag,
frames are injected directly on the LAG interface and thus do never
pass through any DSA port interface on ingress.

Management frames (TO_CPU) are not affected and will pass through the
DSA port interface as usual.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
---
 net/dsa/dsa.c     | 12 +++++++++++-
 net/dsa/tag_dsa.c | 17 ++++++++++++++++-
 2 files changed, 27 insertions(+), 2 deletions(-)

diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c
index a1b1dc8a4d87..7325bf4608e9 100644
--- a/net/dsa/dsa.c
+++ b/net/dsa/dsa.c
@@ -219,11 +219,21 @@ static int dsa_switch_rcv(struct sk_buff *skb, struct net_device *dev,
 	}
 
 	skb = nskb;
-	p = netdev_priv(skb->dev);
 	skb_push(skb, ETH_HLEN);
 	skb->pkt_type = PACKET_HOST;
 	skb->protocol = eth_type_trans(skb, skb->dev);
 
+	if (unlikely(!dsa_slave_dev_check(skb->dev))) {
+		/* Packet is to be injected directly on an upper
+		 * device, e.g. a team/bond, so skip all DSA-port
+		 * specific actions.
+		 */
+		netif_rx(skb);
+		return 0;
+	}
+
+	p = netdev_priv(skb->dev);
+
 	if (unlikely(cpu_dp->ds->untag_bridge_pvid)) {
 		nskb = dsa_untag_bridge_pvid(skb);
 		if (!nskb) {
diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
index 112c7c6dd568..be7271de8d0b 100644
--- a/net/dsa/tag_dsa.c
+++ b/net/dsa/tag_dsa.c
@@ -163,6 +163,7 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,
 				  u8 extra)
 {
 	int source_device, source_port;
+	bool trunk = false;
 	enum dsa_code code;
 	enum dsa_cmd cmd;
 	u8 *dsa_header;
@@ -174,6 +175,8 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,
 	switch (cmd) {
 	case DSA_CMD_FORWARD:
 		skb->offload_fwd_mark = 1;
+
+		trunk = !!(dsa_header[1] & 7);
 		break;
 
 	case DSA_CMD_TO_CPU:
@@ -216,7 +219,19 @@ static struct sk_buff *dsa_rcv_ll(struct sk_buff *skb, struct net_device *dev,
 	source_device = dsa_header[0] & 0x1f;
 	source_port = (dsa_header[1] >> 3) & 0x1f;
 
-	skb->dev = dsa_master_find_slave(dev, source_device, source_port);
+	if (trunk) {
+		struct dsa_port *cpu_dp = dev->dsa_ptr;
+
+		/* The exact source port is not available in the tag,
+		 * so we inject the frame directly on the upper
+		 * team/bond.
+		 */
+		skb->dev = dsa_lag_dev_by_id(cpu_dp->dst, source_port);
+	} else {
+		skb->dev = dsa_master_find_slave(dev, source_device,
+						 source_port);
+	}
+
 	if (!skb->dev)
 		return NULL;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/4] net: bonding: Notify ports about their initial state
  2020-11-19 14:45 ` [PATCH net-next 1/4] net: bonding: Notify ports about their initial state Tobias Waldekranz
@ 2020-11-19 18:18   ` Jay Vosburgh
  2020-11-19 21:20     ` Tobias Waldekranz
  0 siblings, 1 reply; 18+ messages in thread
From: Jay Vosburgh @ 2020-11-19 18:18 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, olteanv,
	vfalico, andy, netdev

Tobias Waldekranz <tobias@waldekranz.com> wrote:

>When creating a static bond (e.g. balance-xor), all ports will always
>be enabled. This is set, and the corresponding notification is sent
>out, before the port is linked to the bond upper.
>
>In the offloaded case, this ordering is hard to deal with.
>
>The lower will first see a notification that it can not associate with
>any bond. Then the bond is joined. After that point no more
>notifications are sent, so all ports remain disabled.

	Why are the notifications generated in __netdev_upper_dev_link
(via bond_master_upper_dev_link) not sufficient?

>This change simply sends an extra notification once the port has been
>linked to the upper to synchronize the initial state.
>
>Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
>---
> drivers/net/bonding/bond_main.c | 2 ++
> 1 file changed, 2 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 71c9677d135f..80c164198dcf 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1897,6 +1897,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
> 		goto err_unregister;
> 	}
> 
>+	bond_lower_state_changed(new_slave);
>+
> 	res = bond_sysfs_slave_add(new_slave);
> 	if (res) {
> 		slave_dbg(bond_dev, slave_dev, "Error %d calling bond_sysfs_slave_add\n", res);

	Would it be better to add this call further down, after all
possible failures have been checked?  I.e., if this new call to
bond_lower_state_changed() completes, and then very soon afterwards the
upper is unlinked, could that cause any issues?

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 1/4] net: bonding: Notify ports about their initial state
  2020-11-19 18:18   ` Jay Vosburgh
@ 2020-11-19 21:20     ` Tobias Waldekranz
  0 siblings, 0 replies; 18+ messages in thread
From: Tobias Waldekranz @ 2020-11-19 21:20 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: davem, kuba, andrew, vivien.didelot, f.fainelli, olteanv,
	vfalico, andy, netdev

On Thu Nov 19, 2020 at 11:18 AM CET, Jay Vosburgh wrote:
> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>
> >When creating a static bond (e.g. balance-xor), all ports will always
> >be enabled. This is set, and the corresponding notification is sent
> >out, before the port is linked to the bond upper.
> >
> >In the offloaded case, this ordering is hard to deal with.
> >
> >The lower will first see a notification that it can not associate with
> >any bond. Then the bond is joined. After that point no more
> >notifications are sent, so all ports remain disabled.
>
> Why are the notifications generated in __netdev_upper_dev_link
> (via bond_master_upper_dev_link) not sufficient?

That notification only lets the switchdev driver know that the port is
now a part of the bond; it says nothing about if the port is in the
active transmit set or not.

That notification actually is sent. Unfortunately this happens before
the event your referencing, in the `switch (BOND_MODE(bond))` section
just a few lines above. Essentially the conversation goes like this:

bond0: swp0 has link and is in the active tx set.
dsa:   Cool, but swp0 is not a part of any LAG afaik; ignore.
bond0: swp0 is now a part of bond0.
dsa:   OK, I'll set up the hardware, setting swp0 as inactive initially.

This change just repeats the initial message at the end when the
driver can make sense of it. Without it, modes that default ports to
be inactive (e.g. LACP) still work, as the driver and the bond agree
on the initial state in those cases. But for a static LAG, there will
never be another event (until the link fails).

> >This change simply sends an extra notification once the port has been
> >linked to the upper to synchronize the initial state.
> >
> >Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
> >---
> > drivers/net/bonding/bond_main.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> >index 71c9677d135f..80c164198dcf 100644
> >--- a/drivers/net/bonding/bond_main.c
> >+++ b/drivers/net/bonding/bond_main.c
> >@@ -1897,6 +1897,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
> > 		goto err_unregister;
> > 	}
> > 
> >+	bond_lower_state_changed(new_slave);
> >+
> > 	res = bond_sysfs_slave_add(new_slave);
> > 	if (res) {
> > 		slave_dbg(bond_dev, slave_dev, "Error %d calling bond_sysfs_slave_add\n", res);
>
> Would it be better to add this call further down, after all
> possible failures have been checked? I.e., if this new call to
> bond_lower_state_changed() completes, and then very soon afterwards the
> upper is unlinked, could that cause any issues?

All the work of configuring the LAG offload is done in the
notification send out by netdev_upper_dev_link. So that will all have
to be torn down in that case no matter where we place this call.

So from the DSA/switchdev point-of-view, I would say no, and I believe
these are the only consumers of the events.

Additionally, I think it makes sense to place the call as early as
possible as that means you have a smaller window of time where the
bond and the switchdev driver may disagree on the port's state.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/4] net: dsa: Link aggregation support
  2020-11-19 14:45 ` [PATCH net-next 2/4] net: dsa: Link aggregation support Tobias Waldekranz
@ 2020-11-20  0:30   ` Andrew Lunn
  2020-11-20  2:43     ` Florian Fainelli
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Lunn @ 2020-11-20  0:30 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: davem, kuba, vivien.didelot, f.fainelli, olteanv, j.vosburgh,
	vfalico, andy, netdev

> +static struct dsa_lag *dsa_lag_get(struct dsa_switch_tree *dst,
> +				   struct net_device *dev)
> +{
> +	unsigned long busy = 0;
> +	struct dsa_lag *lag;
> +	int id;
> +
> +	list_for_each_entry(lag, &dst->lags, list) {
> +		set_bit(lag->id, &busy);
> +
> +		if (lag->dev == dev) {
> +			kref_get(&lag->refcount);
> +			return lag;
> +		}
> +	}
> +
> +	id = find_first_zero_bit(&busy, BITS_PER_LONG);
> +	if (id >= BITS_PER_LONG)
> +		return ERR_PTR(-ENOSPC);
> +
> +	lag = kzalloc(sizeof(*lag), GFP_KERNEL);
> +	if (!lag)
> +		return ERR_PTR(-ENOMEM);

Hi Tobias

My comment last time was to statically allocated them at probe
time. Worse case scenario is each port is alone in a LAG. Pointless,
but somebody could configure it. In dsa_tree_setup_switches() you can
count the number of ports and then allocate an array, or while setting
up a port, add one more lag to the list of lags.

   Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/4] net: dsa: Link aggregation support
  2020-11-20  0:30   ` Andrew Lunn
@ 2020-11-20  2:43     ` Florian Fainelli
  2020-11-20 13:30       ` Andrew Lunn
  0 siblings, 1 reply; 18+ messages in thread
From: Florian Fainelli @ 2020-11-20  2:43 UTC (permalink / raw)
  To: Andrew Lunn, Tobias Waldekranz
  Cc: davem, kuba, vivien.didelot, olteanv, j.vosburgh, vfalico, andy, netdev



On 11/19/2020 4:30 PM, Andrew Lunn wrote:
>> +static struct dsa_lag *dsa_lag_get(struct dsa_switch_tree *dst,
>> +				   struct net_device *dev)
>> +{
>> +	unsigned long busy = 0;
>> +	struct dsa_lag *lag;
>> +	int id;
>> +
>> +	list_for_each_entry(lag, &dst->lags, list) {
>> +		set_bit(lag->id, &busy);
>> +
>> +		if (lag->dev == dev) {
>> +			kref_get(&lag->refcount);
>> +			return lag;
>> +		}
>> +	}
>> +
>> +	id = find_first_zero_bit(&busy, BITS_PER_LONG);
>> +	if (id >= BITS_PER_LONG)
>> +		return ERR_PTR(-ENOSPC);
>> +
>> +	lag = kzalloc(sizeof(*lag), GFP_KERNEL);
>> +	if (!lag)
>> +		return ERR_PTR(-ENOMEM);
> 
> Hi Tobias
> 
> My comment last time was to statically allocated them at probe
> time. Worse case scenario is each port is alone in a LAG. Pointless,
> but somebody could configure it. In dsa_tree_setup_switches() you can
> count the number of ports and then allocate an array, or while setting
> up a port, add one more lag to the list of lags.

The allocation is allowed to sleep (have not checked the calling context
of dsa_lag_get() whether this is OK) so what would be the upside of
doing upfront dsa_lag allocation which could be wasteful?
-- 
Florian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/4] net: dsa: Link aggregation support
  2020-11-20  2:43     ` Florian Fainelli
@ 2020-11-20 13:30       ` Andrew Lunn
  2020-11-26 22:36         ` Tobias Waldekranz
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Lunn @ 2020-11-20 13:30 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Tobias Waldekranz, davem, kuba, vivien.didelot, olteanv,
	j.vosburgh, vfalico, andy, netdev

On Thu, Nov 19, 2020 at 06:43:38PM -0800, Florian Fainelli wrote:
> 
> 
> On 11/19/2020 4:30 PM, Andrew Lunn wrote:
> >> +static struct dsa_lag *dsa_lag_get(struct dsa_switch_tree *dst,
> >> +				   struct net_device *dev)
> >> +{
> >> +	unsigned long busy = 0;
> >> +	struct dsa_lag *lag;
> >> +	int id;
> >> +
> >> +	list_for_each_entry(lag, &dst->lags, list) {
> >> +		set_bit(lag->id, &busy);
> >> +
> >> +		if (lag->dev == dev) {
> >> +			kref_get(&lag->refcount);
> >> +			return lag;
> >> +		}
> >> +	}
> >> +
> >> +	id = find_first_zero_bit(&busy, BITS_PER_LONG);
> >> +	if (id >= BITS_PER_LONG)
> >> +		return ERR_PTR(-ENOSPC);
> >> +
> >> +	lag = kzalloc(sizeof(*lag), GFP_KERNEL);
> >> +	if (!lag)
> >> +		return ERR_PTR(-ENOMEM);
> > 
> > Hi Tobias
> > 
> > My comment last time was to statically allocated them at probe
> > time. Worse case scenario is each port is alone in a LAG. Pointless,
> > but somebody could configure it. In dsa_tree_setup_switches() you can
> > count the number of ports and then allocate an array, or while setting
> > up a port, add one more lag to the list of lags.
> 
> The allocation is allowed to sleep (have not checked the calling context
> of dsa_lag_get() whether this is OK) so what would be the upside of
> doing upfront dsa_lag allocation which could be wasteful?

Hi Florian

It fits the pattern for the rest of the DSA core. We never allocate
anything at runtime. That keeps the error handling simple, we don't
need to deal with ENOMEM errors, undoing whatever we might of done,
implementing transactions etc.

And the memory involved here is small. I guess around 80 bytes per
lag? So even for a 10 port switch, we are only talking about 800
bytes. That is not a lot.

       Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/4] net: dsa: Link aggregation support
  2020-11-20 13:30       ` Andrew Lunn
@ 2020-11-26 22:36         ` Tobias Waldekranz
  2020-11-26 22:57           ` Andrew Lunn
  0 siblings, 1 reply; 18+ messages in thread
From: Tobias Waldekranz @ 2020-11-26 22:36 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli
  Cc: davem, kuba, vivien.didelot, olteanv, j.vosburgh, vfalico, andy, netdev

On Fri, Nov 20, 2020 at 14:30, Andrew Lunn <andrew@lunn.ch> wrote:
> On Thu, Nov 19, 2020 at 06:43:38PM -0800, Florian Fainelli wrote:
>> > 
>> > Hi Tobias
>> > 
>> > My comment last time was to statically allocated them at probe
>> > time. Worse case scenario is each port is alone in a LAG. Pointless,
>> > but somebody could configure it. In dsa_tree_setup_switches() you can
>> > count the number of ports and then allocate an array, or while setting
>> > up a port, add one more lag to the list of lags.

Right, yes I forgot about that, sorry.

Permit me a final plea for the current implementation. If it is a hard
no, say the word and I will re-spin a v2 with your suggestion.

>> The allocation is allowed to sleep (have not checked the calling context
>> of dsa_lag_get() whether this is OK) so what would be the upside of

These events always originate from a sendmsg (netlink message), AFAIK
there is no other way for an interface to move to a new upper, so I do
not see any issues there.

>> doing upfront dsa_lag allocation which could be wasteful?
>
> Hi Florian
>
> It fits the pattern for the rest of the DSA core. We never allocate
> anything at runtime. That keeps the error handling simple, we don't
> need to deal with ENOMEM errors, undoing whatever we might of done,
> implementing transactions etc.

I understand that argument in principle. The reason that I think it
carries less weight in this case has to do with the dynamic nature of
the LAGs themselves.

In the cases of `struct dsa_port`s or `struct dsa_switch`es there is no
way for them to be dynamically removed, so it makes a lot of sense to
statically allocate everything.

Contrast that with a LAG that dynamically created and dynamically
modified, i.e. more ports can join/leave the LAG during its
lifecycle. This means you get all the headache of figuring out if the
LAG is in use or not anyway, i.e. you need to refcount them.

So essentially we _will_ be dynamically allocating/freeing them. We just
get to choose if we do it through the SLAB allocator, or through the
static array.

If you go with the static array, you theoretically can not get the
equivalent of an ENOMEM. Practically though you have to iterate through
the array and look for a free entry, but you still have to put a return
statement at the bottom of that function, right? Or panic I suppose. My
guess is you end up with:

    struct dsa_lag *dsa_lag_get(dst)
    {
        for (lag in dst->lag_array) {
            if (lag->dev == NULL)
                return lag;
        }

        return NULL;
    }

So now we have just traded dealing with an ENOMEM for a NULL pointer;
pretty much the same thing.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/4] net: dsa: Link aggregation support
  2020-11-26 22:36         ` Tobias Waldekranz
@ 2020-11-26 22:57           ` Andrew Lunn
  2020-11-27  9:19             ` Tobias Waldekranz
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Lunn @ 2020-11-26 22:57 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Florian Fainelli, davem, kuba, vivien.didelot, olteanv,
	j.vosburgh, vfalico, andy, netdev

> If you go with the static array, you theoretically can not get the
> equivalent of an ENOMEM. Practically though you have to iterate through
> the array and look for a free entry, but you still have to put a return
> statement at the bottom of that function, right? Or panic I suppose. My
> guess is you end up with:
> 
>     struct dsa_lag *dsa_lag_get(dst)
>     {
>         for (lag in dst->lag_array) {
>             if (lag->dev == NULL)
>                 return lag;
>         }
> 
>         return NULL;
>     }

I would put a WARN() in here, and return the NULL pointer. That will
then likely opps soon afterwards and kill of the user space tool
configuring the LAG, maybe leaving rtnl locked, and so all network
configuration dies. But that is all fine, since you cannot have more
LAGs than ports. This can never happen. If it does happen, something
is badly wrong and we want to know about it. And so a very obvious
explosion is good.

> So now we have just traded dealing with an ENOMEM for a NULL pointer;
> pretty much the same thing.

ENOMEM you have to handle correctly, unwind everything and leaving the
stack in the same state as before. Being out of memory is not a reason
to explode. Have you tested this? It is the sort of thing which does
not happen very often, so i expect is broken.

   Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/4] net: dsa: Link aggregation support
  2020-11-26 22:57           ` Andrew Lunn
@ 2020-11-27  9:19             ` Tobias Waldekranz
  2020-11-27 16:28               ` Andrew Lunn
  0 siblings, 1 reply; 18+ messages in thread
From: Tobias Waldekranz @ 2020-11-27  9:19 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Florian Fainelli, davem, kuba, vivien.didelot, olteanv,
	j.vosburgh, vfalico, andy, netdev

On Thu, Nov 26, 2020 at 23:57, Andrew Lunn <andrew@lunn.ch> wrote:
>> If you go with the static array, you theoretically can not get the
>> equivalent of an ENOMEM. Practically though you have to iterate through
>> the array and look for a free entry, but you still have to put a return
>> statement at the bottom of that function, right? Or panic I suppose. My
>> guess is you end up with:
>> 
>>     struct dsa_lag *dsa_lag_get(dst)
>>     {
>>         for (lag in dst->lag_array) {
>>             if (lag->dev == NULL)
>>                 return lag;
>>         }
>> 
>>         return NULL;
>>     }
>
> I would put a WARN() in here, and return the NULL pointer. That will
> then likely opps soon afterwards and kill of the user space tool
> configuring the LAG, maybe leaving rtnl locked, and so all network
> configuration dies. But that is all fine, since you cannot have more

This is a digression, but I really do not get this shift from using
BUG()s to WARN()s in the kernel when you detect a violated invariant. It
smells of "On Error Resume Next" to me.

> LAGs than ports. This can never happen. If it does happen, something
> is badly wrong and we want to know about it. And so a very obvious
> explosion is good.

Indeed. That is why I think it should be BUG(). Leaving the system to
limp along can easily go undetected.

>> So now we have just traded dealing with an ENOMEM for a NULL pointer;
>> pretty much the same thing.
>
> ENOMEM you have to handle correctly, unwind everything and leaving the
> stack in the same state as before. Being out of memory is not a reason

We have to handle EWHATEVER correctly, no? I do not get what is so
special about ENOMEM. If we hit some other error after the allocation we
have to remember to free the memory of course, but that is just normal
cleanup. Is this what you mean by "unwind everything"? In that case we
need to also "free" the element we allocated from the array. Again, it
is fundamentally the same problem.

How would a call to kmalloc have any impact on the stack? (Barring
exotic bugs in the allocator that would allow the heap to intrude on
stack memory) Or am I misunderstanding what you mean by "the stack"?

> to explode. Have you tested this? It is the sort of thing which does
> not happen very often, so i expect is broken.

I have not run my system under memory pressure or done any error
injection testing. Is that customary to do whenever using kmalloc?  If
it would make a difference I could look into setting that up.

I have certainly tried my best to audit all the possible error paths to
make sure that no memory is ever leaked, and I feel confident in that
analysis.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/4] net: dsa: Link aggregation support
  2020-11-27  9:19             ` Tobias Waldekranz
@ 2020-11-27 16:28               ` Andrew Lunn
  2020-11-27 23:19                 ` Tobias Waldekranz
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Lunn @ 2020-11-27 16:28 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Florian Fainelli, davem, kuba, vivien.didelot, olteanv,
	j.vosburgh, vfalico, andy, netdev

> This is a digression, but I really do not get this shift from using
> BUG()s to WARN()s in the kernel when you detect a violated invariant. It
> smells of "On Error Resume Next" to me.

A WARN() gives you a chance to actually use the machine to collect
logs, the kernel dump, etc. You might be able to sync the filesystems,
reducing the damage to the disks.  With BUG(), the machine is
dead. You cannot interact with it, all you have to go on, is what you
can see on the disk, or what you might have logged on the serial
console.

> We have to handle EWHATEVER correctly, no? I do not get what is so
> special about ENOMEM.

Nothing is particularly special about it. But looking at the current
code base the only other error we can get is probably ETIMEDOUT from
an MDIO read/write. But if that happens, there is probably no real
recovery. You have no idea what state the switch is in, and all other
MDIO calls are likely to fail in the same way.

> How would a call to kmalloc have any impact on the stack? (Barring
> exotic bugs in the allocator that would allow the heap to intrude on
> stack memory) Or am I misunderstanding what you mean by "the stack"?

I mean the network stack, top to bottom. Say we have a few vlan
interfaces, on top of the bridge, on top of a LAG, on top of DSA, on
top of IP over avian carriers. When setting up a LAG, what else has
happened by the time you get your ENOMEM? What notifications have
already happened, which some other layer has acted upon? It is not
just LAG inside DSA which needs to unwind, it is all the actions which
have been triggered so far.

The initial design of switchdev was transactions. First there was a
prepare call, where you validated the requested action is possible,
and allocate resources needed, but don't actually do it. This prepare
call is allowed to fail. Then there is a second call to actually do
it, and that call is not allowed to fail. This structure avoids most
of the complexity of the unwind, just free up some resources. If you
never had to allocate the resources in the first place, better still.

      Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/4] net: dsa: Link aggregation support
  2020-11-27 16:28               ` Andrew Lunn
@ 2020-11-27 23:19                 ` Tobias Waldekranz
  2020-11-28  5:19                   ` Florian Fainelli
  0 siblings, 1 reply; 18+ messages in thread
From: Tobias Waldekranz @ 2020-11-27 23:19 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Florian Fainelli, davem, kuba, vivien.didelot, olteanv,
	j.vosburgh, vfalico, andy, netdev

On Fri, Nov 27, 2020 at 17:28, Andrew Lunn <andrew@lunn.ch> wrote:
>> This is a digression, but I really do not get this shift from using
>> BUG()s to WARN()s in the kernel when you detect a violated invariant. It
>> smells of "On Error Resume Next" to me.
>
> A WARN() gives you a chance to actually use the machine to collect
> logs, the kernel dump, etc. You might be able to sync the filesystems,

I completely agree that collecting the logs is very important. But I
think there are ways of solving that problem that works for both BUG()
and WARN(), e.g. pstore or mtdoops.

> reducing the damage to the disks.  With BUG(), the machine is

Well, you are betting that whatever bug you just discovered has not
affected the VFS in any way, causing you to sync back bad data to the
disk. I would rather take a hard reboot and rely on the filesystem's
journal to bring it up in a consistent state.

> dead. You cannot interact with it, all you have to go on, is what you
> can see on the disk, or what you might have logged on the serial
> console.

I guess we are coming at this problem from different angles. For me it
is not OK to just leave the device in limp-mode until an administrator
can inspect it and perform a reboot, as that can take weeks in some
cases. I would much rather have a hard reboot, bring the system back to
a fully operational state, detect the panic during the next boot,
prepare a debug tarball with the pstore data, and signal an alarm of
some kind.

Agree to disagree? :)

>> We have to handle EWHATEVER correctly, no? I do not get what is so
>> special about ENOMEM.
>
> Nothing is particularly special about it. But looking at the current
> code base the only other error we can get is probably ETIMEDOUT from
> an MDIO read/write. But if that happens, there is probably no real
> recovery. You have no idea what state the switch is in, and all other
> MDIO calls are likely to fail in the same way.
>
>> How would a call to kmalloc have any impact on the stack? (Barring
>> exotic bugs in the allocator that would allow the heap to intrude on
>> stack memory) Or am I misunderstanding what you mean by "the stack"?
>
> I mean the network stack, top to bottom. Say we have a few vlan
> interfaces, on top of the bridge, on top of a LAG, on top of DSA, on
> top of IP over avian carriers. When setting up a LAG, what else has
> happened by the time you get your ENOMEM? What notifications have
> already happened, which some other layer has acted upon? It is not
> just LAG inside DSA which needs to unwind, it is all the actions which
> have been triggered so far.
>
> The initial design of switchdev was transactions. First there was a
> prepare call, where you validated the requested action is possible,
> and allocate resources needed, but don't actually do it. This prepare
> call is allowed to fail. Then there is a second call to actually do
> it, and that call is not allowed to fail. This structure avoids most
> of the complexity of the unwind, just free up some resources. If you
> never had to allocate the resources in the first place, better still.

OK I think I finally see what you are saying. Sorry it took me this
long. I do not mean to be difficult, I just want to understand.

How about this:

- Add a `lags_max` field to `struct dsa_switch` to let each driver
  define the maximum number supported by the hardware. By default this
  would be zero, which would mean that LAG offloading is not supported.

- In dsa_tree_setup we allocate a static array of the minimum supported
  number across the entire tree.

- When joining a new LAG, we ensure that a slot is available in
  NETDEV_PRECHANGEUPPER, avoiding the issue you are describing.

- In NETDEV_CHANGEUPPER, we actually mark it as busy and start using it.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/4] net: dsa: Link aggregation support
  2020-11-27 23:19                 ` Tobias Waldekranz
@ 2020-11-28  5:19                   ` Florian Fainelli
  2020-11-28 16:38                     ` Andrew Lunn
  0 siblings, 1 reply; 18+ messages in thread
From: Florian Fainelli @ 2020-11-28  5:19 UTC (permalink / raw)
  To: Tobias Waldekranz, Andrew Lunn
  Cc: davem, kuba, vivien.didelot, olteanv, j.vosburgh, vfalico, andy, netdev



On 11/27/2020 3:19 PM, Tobias Waldekranz wrote:
>> The initial design of switchdev was transactions. First there was a
>> prepare call, where you validated the requested action is possible,
>> and allocate resources needed, but don't actually do it. This prepare
>> call is allowed to fail. Then there is a second call to actually do
>> it, and that call is not allowed to fail. This structure avoids most
>> of the complexity of the unwind, just free up some resources. If you
>> never had to allocate the resources in the first place, better still.
> 
> OK I think I finally see what you are saying. Sorry it took me this
> long. I do not mean to be difficult, I just want to understand.
> 
> How about this:
> 
> - Add a `lags_max` field to `struct dsa_switch` to let each driver
>   define the maximum number supported by the hardware. By default this
>   would be zero, which would mean that LAG offloading is not supported.
> 
> - In dsa_tree_setup we allocate a static array of the minimum supported
>   number across the entire tree.
> 
> - When joining a new LAG, we ensure that a slot is available in
>   NETDEV_PRECHANGEUPPER, avoiding the issue you are describing.
> 
> - In NETDEV_CHANGEUPPER, we actually mark it as busy and start using it.
> 

Sounds reasonable to me.
-- 
Florian

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/4] net: dsa: Link aggregation support
  2020-11-28  5:19                   ` Florian Fainelli
@ 2020-11-28 16:38                     ` Andrew Lunn
  2020-11-28 19:48                       ` Tobias Waldekranz
  0 siblings, 1 reply; 18+ messages in thread
From: Andrew Lunn @ 2020-11-28 16:38 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Tobias Waldekranz, davem, kuba, vivien.didelot, olteanv,
	j.vosburgh, vfalico, andy, netdev

> > OK I think I finally see what you are saying. Sorry it took me this
> > long. I do not mean to be difficult, I just want to understand.

Not a problem. This is a bit different to normal, the complexity of
the stack means you need to handle this different to most drivers. If
you have done any deeply embedded stuff, RTOS, allocating everything
up front is normal, it eliminates a whole class of problems.

> > 
> > How about this:
> > 
> > - Add a `lags_max` field to `struct dsa_switch` to let each driver
> >   define the maximum number supported by the hardware.

This is all reasonable. I just wonder what that number is for the
mv88e6xx case, especially for D in DSA. I guess LAGs are global in
scope across a set of switches. So it does not matter if there is one
switch or lots of switches, the lags_max stays the same.

      Andrew

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next 2/4] net: dsa: Link aggregation support
  2020-11-28 16:38                     ` Andrew Lunn
@ 2020-11-28 19:48                       ` Tobias Waldekranz
  0 siblings, 0 replies; 18+ messages in thread
From: Tobias Waldekranz @ 2020-11-28 19:48 UTC (permalink / raw)
  To: Andrew Lunn, Florian Fainelli
  Cc: Tobias Waldekranz, davem, kuba, vivien.didelot, olteanv,
	j.vosburgh, vfalico, andy, netdev

On Sat Nov 28, 2020 at 6:38 PM CET, Andrew Lunn wrote:
> > > OK I think I finally see what you are saying. Sorry it took me this
> > > long. I do not mean to be difficult, I just want to understand.
>
> Not a problem. This is a bit different to normal, the complexity of
> the stack means you need to handle this different to most drivers. If
> you have done any deeply embedded stuff, RTOS, allocating everything
> up front is normal, it eliminates a whole class of problems.

Yeah I am well aware, Linux is pretty far from that kind of embedded
though :)

The problem here, IMHO, is not really the allocation. Rather we want
to shift as much work as possible from CHANGEUPPER to PRECHANGEUPPER
to signal errors early, this was the part that was not clicking for
me.

And you can not allocate anything in PRECHANGE, because there may
never be a corresponding CHANGEUPPER if another subscriber on the
chain throws an error. _That_ is what forces you to use the static
array.

> This is all reasonable. I just wonder what that number is for the
> mv88e6xx case, especially for D in DSA. I guess LAGs are global in
> scope across a set of switches. So it does not matter if there is one
> switch or lots of switches, the lags_max stays the same.

For everything up to Agate, the max is 16. Peridot (and I guess Topaz)
can potentially do 32, but mv88e6xxx never sets the "5-bit port" mode
AFAIK, so in practice the max is 16 across the board.

Yes the LAG IDs are global, they must be configured on all switches,
even those that have no local member ports. So the pool will hold 16
entries on a mv88e6xxx tree.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-11-28 22:19 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-19 14:45 [PATCH net-next 0/4] net: dsa: Link aggregation support Tobias Waldekranz
2020-11-19 14:45 ` [PATCH net-next 1/4] net: bonding: Notify ports about their initial state Tobias Waldekranz
2020-11-19 18:18   ` Jay Vosburgh
2020-11-19 21:20     ` Tobias Waldekranz
2020-11-19 14:45 ` [PATCH net-next 2/4] net: dsa: Link aggregation support Tobias Waldekranz
2020-11-20  0:30   ` Andrew Lunn
2020-11-20  2:43     ` Florian Fainelli
2020-11-20 13:30       ` Andrew Lunn
2020-11-26 22:36         ` Tobias Waldekranz
2020-11-26 22:57           ` Andrew Lunn
2020-11-27  9:19             ` Tobias Waldekranz
2020-11-27 16:28               ` Andrew Lunn
2020-11-27 23:19                 ` Tobias Waldekranz
2020-11-28  5:19                   ` Florian Fainelli
2020-11-28 16:38                     ` Andrew Lunn
2020-11-28 19:48                       ` Tobias Waldekranz
2020-11-19 14:45 ` [PATCH net-next 3/4] net: dsa: mv88e6xxx: " Tobias Waldekranz
2020-11-19 14:45 ` [PATCH net-next 4/4] net: dsa: tag_dsa: Support reception of packets from LAG devices Tobias Waldekranz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.