netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC net-next 0/3] Multi-CPU DSA support
@ 2021-04-10 13:34 Ansuel Smith
  2021-04-10 13:34 ` [PATCH RFC net-next 1/3] net: dsa: allow for multiple CPU ports Ansuel Smith
                   ` (4 more replies)
  0 siblings, 5 replies; 68+ messages in thread
From: Ansuel Smith @ 2021-04-10 13:34 UTC (permalink / raw)
  To: netdev
  Cc: Ansuel Smith, David S. Miller, Jakub Kicinski, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Marek Behún,
	Roopa Prabhu, Di Zhu, Francis Laniel, linux-kernel

Hi,
this is a respin of the Marek series in hope that this time we can
finally make some progress with dsa supporting multi-cpu port.

This implementation is similar to the Marek series but with some tweaks.
This adds support for multiple-cpu port but leave the driver the
decision of the type of logic to use about assigning a CPU port to the
various port. The driver can also provide no preference and the CPU port
is decided using a round-robin way.

(copying the Marek cover letter)
Patch 2 adds a new operation to the net device operations structure.
Currently we use the iflink property of a net device to report to which
CPU port a given switch port si connected to. The ip link utility from
iproute2 reports this as "lan1@eth0". We add a new net device operation,
ndo_set_iflink, which can be used to set this property. We call this
function from the netlink handlers.

Patch 3 implements this new ndo_set_iflink operation for DSA slave
device. Thus the userspace can request a change of CPU port of a given
port. The mac address is changed accordingly following the CPU port mac
address.

There are at least 2 driver that would benefit from this and currently
now works using half their capabilities. (qca8k and mv88e6xxx)
This current series is tested with qca8k. 

Ansuel Smith (1):
  net: dsa: allow for multiple CPU ports

Marek Behún (2):
  net: add ndo for setting the iflink property
  net: dsa: implement ndo_set_netlink for chaning port's CPU port

 include/linux/netdevice.h |  5 +++
 include/net/dsa.h         |  7 +++++
 net/core/dev.c            | 15 +++++++++
 net/core/rtnetlink.c      |  7 +++++
 net/dsa/dsa2.c            | 66 ++++++++++++++++++++++++++++-----------
 net/dsa/slave.c           | 31 ++++++++++++++++++
 6 files changed, 112 insertions(+), 19 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH RFC net-next 1/3] net: dsa: allow for multiple CPU ports
  2021-04-10 13:34 [PATCH RFC net-next 0/3] Multi-CPU DSA support Ansuel Smith
@ 2021-04-10 13:34 ` Ansuel Smith
  2021-04-12  3:35   ` DENG Qingfang
  2021-04-10 13:34 ` [PATCH RFC net-next 2/3] net: add ndo for setting the iflink property Ansuel Smith
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 68+ messages in thread
From: Ansuel Smith @ 2021-04-10 13:34 UTC (permalink / raw)
  To: netdev
  Cc: Ansuel Smith, Marek Behún, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Roopa Prabhu, Di Zhu,
	Nikolay Aleksandrov, Francis Laniel, linux-kernel

Allow for multiple CPU ports in a DSA switch tree. By default the first
CPU port is assigned mimic the original assignement logic. A DSA driver
can define a function to declare a preferred CPU port based on the
provided port. If the function doesn't have a preferred port the CPU
port is assigned using a round-robin way starting from the last assigned
CPU port.
Examples:
There are two CPU port but no port_get_preferred_cpu is provided:
- The old logic is used. Every port is assigned to the first cpu port.
There are two CPU port but the port_get_preferred_cpu return -1:
- The port is assigned using a round-robin way since no preference is
  provided.
There are two CPU port and the port_get_preferred_cpu define only one
port and the rest with -1: (wan port with CPU1 and the rest no
preference)
  lan1 <-> eth0
  lan2 <-> eth1
  lan3 <-> eth0
  lan4 <-> eth1
  wan  <-> eth1
There are two CPU port and the port_get_preferred assign a preference
for every port: (wan port with CPU1 everything else CPU0)
  lan1 <-> eth0
  lan2 <-> eth0
  lan3 <-> eth0
  lan4 <-> eth0
  wan  <-> eth1

Signed-off-by: Marek Behún <marek.behun@nic.cz>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
---
 include/net/dsa.h |  7 +++++
 net/dsa/dsa2.c    | 66 +++++++++++++++++++++++++++++++++--------------
 2 files changed, 54 insertions(+), 19 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index d71b1acd9c3e..3d3e936bda4c 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -523,6 +523,13 @@ struct dsa_switch_ops {
 	void	(*teardown)(struct dsa_switch *ds);
 	u32	(*get_phy_flags)(struct dsa_switch *ds, int port);
 
+	/*
+	 * Get preferred CPU port for the provided port.
+	 * Return -1 if there isn't a preferred CPU port, a round-robin logic
+	 * is used to chose the CPU port to link to the provided port.
+	 */
+	int	(*port_get_preferred_cpu)(struct dsa_switch *ds, int port);
+
 	/*
 	 * Access to the switch's PHY registers.
 	 */
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index d142eb2b288b..d3b92499f006 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -315,10 +315,17 @@ static bool dsa_tree_setup_routing_table(struct dsa_switch_tree *dst)
 	return complete;
 }
 
-static struct dsa_port *dsa_tree_find_first_cpu(struct dsa_switch_tree *dst)
+static struct dsa_port *dsa_tree_find_next_cpu_port(struct dsa_switch_tree *dst,
+						    struct dsa_port *cpu_dp)
 {
-	struct dsa_port *dp;
+	struct dsa_port *dp = cpu_dp;
+
+	if (dp)
+		list_for_each_entry_from(dp, &dst->ports, list)
+			if (dsa_port_is_cpu(dp))
+				return dp;
 
+	/* If another CPU port can't be found, try from the start */
 	list_for_each_entry(dp, &dst->ports, list)
 		if (dsa_port_is_cpu(dp))
 			return dp;
@@ -326,25 +333,40 @@ static struct dsa_port *dsa_tree_find_first_cpu(struct dsa_switch_tree *dst)
 	return NULL;
 }
 
-static int dsa_tree_setup_default_cpu(struct dsa_switch_tree *dst)
+static int dsa_tree_setup_default_cpus(struct dsa_switch_tree *dst)
 {
-	struct dsa_port *cpu_dp, *dp;
+	struct dsa_port *dp, *cpu_dp = NULL, *first_cpu_dp;
+	int cpu_port;
 
-	cpu_dp = dsa_tree_find_first_cpu(dst);
-	if (!cpu_dp) {
+	first_cpu_dp = dsa_tree_find_next_cpu_port(dst, NULL);
+	if (!first_cpu_dp) {
 		pr_err("DSA: tree %d has no CPU port\n", dst->index);
 		return -EINVAL;
 	}
 
-	/* Assign the default CPU port to all ports of the fabric */
 	list_for_each_entry(dp, &dst->ports, list)
-		if (dsa_port_is_user(dp) || dsa_port_is_dsa(dp))
-			dp->cpu_dp = cpu_dp;
+		if (dsa_port_is_user(dp) || dsa_port_is_dsa(dp)) {
+			/* Check if the driver advice a CPU port for the current port */
+			if (dp->ds->ops->port_get_preferred_cpu) {
+				cpu_port = dp->ds->ops->port_get_preferred_cpu(dp->ds, dp->index);
+
+				/* If the driver doesn't have a preferred port,
+				 * assing in round-robin way.
+				 */
+				if (cpu_port < 0)
+					cpu_dp = dsa_tree_find_next_cpu_port(dst, cpu_dp);
+				else
+					cpu_dp = dsa_to_port(dp->ds, cpu_port);
+			}
+
+			/* If a cpu port is not chosen, assign the first one by default */
+			dp->cpu_dp = cpu_dp ? cpu_dp : first_cpu_dp;
+		}
 
 	return 0;
 }
 
-static void dsa_tree_teardown_default_cpu(struct dsa_switch_tree *dst)
+static void dsa_tree_teardown_default_cpus(struct dsa_switch_tree *dst)
 {
 	struct dsa_port *dp;
 
@@ -822,7 +844,7 @@ static void dsa_tree_teardown_switches(struct dsa_switch_tree *dst)
 		dsa_switch_teardown(dp->ds);
 }
 
-static int dsa_tree_setup_master(struct dsa_switch_tree *dst)
+static int dsa_tree_setup_masters(struct dsa_switch_tree *dst)
 {
 	struct dsa_port *dp;
 	int err;
@@ -831,14 +853,20 @@ static int dsa_tree_setup_master(struct dsa_switch_tree *dst)
 		if (dsa_port_is_cpu(dp)) {
 			err = dsa_master_setup(dp->master, dp);
 			if (err)
-				return err;
+				goto teardown;
 		}
 	}
 
 	return 0;
+teardown:
+	list_for_each_entry_from_reverse(dp, &dst->ports, list)
+		if (dsa_port_is_cpu(dp))
+			dsa_master_teardown(dp->master);
+
+	return err;
 }
 
-static void dsa_tree_teardown_master(struct dsa_switch_tree *dst)
+static void dsa_tree_teardown_masters(struct dsa_switch_tree *dst)
 {
 	struct dsa_port *dp;
 
@@ -888,7 +916,7 @@ static int dsa_tree_setup(struct dsa_switch_tree *dst)
 	if (!complete)
 		return 0;
 
-	err = dsa_tree_setup_default_cpu(dst);
+	err = dsa_tree_setup_default_cpus(dst);
 	if (err)
 		return err;
 
@@ -896,7 +924,7 @@ static int dsa_tree_setup(struct dsa_switch_tree *dst)
 	if (err)
 		goto teardown_default_cpu;
 
-	err = dsa_tree_setup_master(dst);
+	err = dsa_tree_setup_masters(dst);
 	if (err)
 		goto teardown_switches;
 
@@ -911,11 +939,11 @@ static int dsa_tree_setup(struct dsa_switch_tree *dst)
 	return 0;
 
 teardown_master:
-	dsa_tree_teardown_master(dst);
+	dsa_tree_teardown_masters(dst);
 teardown_switches:
 	dsa_tree_teardown_switches(dst);
 teardown_default_cpu:
-	dsa_tree_teardown_default_cpu(dst);
+	dsa_tree_teardown_default_cpus(dst);
 
 	return err;
 }
@@ -929,11 +957,11 @@ static void dsa_tree_teardown(struct dsa_switch_tree *dst)
 
 	dsa_tree_teardown_lags(dst);
 
-	dsa_tree_teardown_master(dst);
+	dsa_tree_teardown_masters(dst);
 
 	dsa_tree_teardown_switches(dst);
 
-	dsa_tree_teardown_default_cpu(dst);
+	dsa_tree_teardown_default_cpus(dst);
 
 	list_for_each_entry_safe(dl, next, &dst->rtable, list) {
 		list_del(&dl->list);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH RFC net-next 2/3] net: add ndo for setting the iflink property
  2021-04-10 13:34 [PATCH RFC net-next 0/3] Multi-CPU DSA support Ansuel Smith
  2021-04-10 13:34 ` [PATCH RFC net-next 1/3] net: dsa: allow for multiple CPU ports Ansuel Smith
@ 2021-04-10 13:34 ` Ansuel Smith
  2021-04-10 13:34 ` [PATCH RFC net-next 3/3] net: dsa: implement ndo_set_netlink for chaning port's CPU port Ansuel Smith
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 68+ messages in thread
From: Ansuel Smith @ 2021-04-10 13:34 UTC (permalink / raw)
  To: netdev
  Cc: Marek Behún, David S. Miller, Jakub Kicinski, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, Nikolay Aleksandrov, Roopa Prabhu, Di Zhu,
	Francis Laniel, linux-kernel

From: Marek Behún <marek.behun@nic.cz>

In DSA the iflink value is used to report to which CPU port a given
switch port is connected to. Since we want to support multi-CPU DSA, we
want the user to be able to change this value.

Add ndo_set_iflink method into the ndo strucutre to be a pair to
ndo_get_iflink. Also create dev_set_iflink and call this from the
netlink code, so that userspace can change the iflink value.

Signed-off-by: Marek Behún <marek.behun@nic.cz>
---
 include/linux/netdevice.h |  5 +++++
 net/core/dev.c            | 15 +++++++++++++++
 net/core/rtnetlink.c      |  7 +++++++
 3 files changed, 27 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 87a5d186faff..76054182c288 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1225,6 +1225,8 @@ struct netdev_net_notifier {
  *	TX queue.
  * int (*ndo_get_iflink)(const struct net_device *dev);
  *	Called to get the iflink value of this device.
+ * int (*ndo_set_iflink)(struct net_device *dev, int iflink);
+ *	Called to set the iflink value of this device.
  * void (*ndo_change_proto_down)(struct net_device *dev,
  *				 bool proto_down);
  *	This function is used to pass protocol port error state information
@@ -1456,6 +1458,8 @@ struct net_device_ops {
 						      int queue_index,
 						      u32 maxrate);
 	int			(*ndo_get_iflink)(const struct net_device *dev);
+	int			(*ndo_set_iflink)(struct net_device *dev,
+						  int iflink);
 	int			(*ndo_change_proto_down)(struct net_device *dev,
 							 bool proto_down);
 	int			(*ndo_fill_metadata_dst)(struct net_device *dev,
@@ -2845,6 +2849,7 @@ void dev_add_offload(struct packet_offload *po);
 void dev_remove_offload(struct packet_offload *po);
 
 int dev_get_iflink(const struct net_device *dev);
+int dev_set_iflink(struct net_device *dev, int iflink);
 int dev_fill_metadata_dst(struct net_device *dev, struct sk_buff *skb);
 struct net_device *__dev_get_by_flags(struct net *net, unsigned short flags,
 				      unsigned short mask);
diff --git a/net/core/dev.c b/net/core/dev.c
index 0f72ff5d34ba..9e5ddcf6d401 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -822,6 +822,21 @@ int dev_get_iflink(const struct net_device *dev)
 }
 EXPORT_SYMBOL(dev_get_iflink);
 
+/**
+ *	dev_set_iflink - set 'iflink' value of an interface
+ *	@dev: target interface
+ *	@iflink: new value
+ *
+ *	Change the interface to which this interface is linked to.
+ */
+int dev_set_iflink(struct net_device *dev, int iflink)
+{
+	if (dev->netdev_ops && dev->netdev_ops->ndo_set_iflink)
+		return dev->netdev_ops->ndo_set_iflink(dev, iflink);
+
+	return -EOPNOTSUPP;
+}
+
 /**
  *	dev_fill_metadata_dst - Retrieve tunnel egress information.
  *	@dev: targeted interface
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 1bdcb33fb561..629b7685f942 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2718,6 +2718,13 @@ static int do_setlink(const struct sk_buff *skb,
 		status |= DO_SETLINK_MODIFIED;
 	}
 
+	if (tb[IFLA_LINK]) {
+		err = dev_set_iflink(dev, nla_get_u32(tb[IFLA_LINK]));
+		if (err)
+			goto errout;
+		status |= DO_SETLINK_MODIFIED;
+	}
+
 	if (tb[IFLA_CARRIER]) {
 		err = dev_change_carrier(dev, nla_get_u8(tb[IFLA_CARRIER]));
 		if (err)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH RFC net-next 3/3] net: dsa: implement ndo_set_netlink for chaning port's CPU port
  2021-04-10 13:34 [PATCH RFC net-next 0/3] Multi-CPU DSA support Ansuel Smith
  2021-04-10 13:34 ` [PATCH RFC net-next 1/3] net: dsa: allow for multiple CPU ports Ansuel Smith
  2021-04-10 13:34 ` [PATCH RFC net-next 2/3] net: add ndo for setting the iflink property Ansuel Smith
@ 2021-04-10 13:34 ` Ansuel Smith
  2021-04-10 13:34 ` [PATCH RFC iproute2-next] iplink: allow to change iplink value Ansuel Smith
  2021-04-11 18:01 ` [PATCH RFC net-next 0/3] Multi-CPU DSA support Marek Behun
  4 siblings, 0 replies; 68+ messages in thread
From: Ansuel Smith @ 2021-04-10 13:34 UTC (permalink / raw)
  To: netdev
  Cc: Ansuel Smith, Marek Behún, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, Francis Laniel, zhang kai, Di Zhu,
	Roopa Prabhu, linux-kernel

Implement ndo_set_iflink for DSA slave device. In multi-CPU port setup
this should be used to change to which CPU destination port a given port
should be connected. On CPU port change, the mac address is updated with
the new value, if not set to a custom value.

Signed-off-by: Marek Behún <marek.behun@nic.cz>
Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>
---
 net/dsa/slave.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 992fcab4b552..c68dbd3ab21a 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -63,6 +63,36 @@ static int dsa_slave_get_iflink(const struct net_device *dev)
 	return dsa_slave_to_master(dev)->ifindex;
 }
 
+static int dsa_slave_set_iflink(struct net_device *dev, int iflink)
+{
+	struct net_device *master = dsa_slave_to_master(dev);
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_slave_priv *p = netdev_priv(dev);
+	struct net_device *cpu_dev;
+	struct dsa_port *cpu_dp;
+
+	cpu_dev = dev_get_by_index(dev_net(dev), iflink);
+	if (!cpu_dev)
+		return -ENODEV;
+
+	cpu_dp = cpu_dev->dsa_ptr;
+	if (!cpu_dp)
+		return -EINVAL;
+
+	/* new CPU port has to be on the same switch tree */
+	if (cpu_dp->dst != dp->cpu_dp->dst)
+		return -EINVAL;
+
+	if (ether_addr_equal(dev->dev_addr, master->dev_addr))
+		eth_hw_addr_inherit(dev, cpu_dev);
+
+	/* should this be done atomically? */
+	dp->cpu_dp = cpu_dp;
+	p->xmit = cpu_dp->tag_ops->xmit;
+
+	return 0;
+}
+
 static int dsa_slave_open(struct net_device *dev)
 {
 	struct net_device *master = dsa_slave_to_master(dev);
@@ -1666,6 +1696,7 @@ static const struct net_device_ops dsa_slave_netdev_ops = {
 	.ndo_fdb_dump		= dsa_slave_fdb_dump,
 	.ndo_do_ioctl		= dsa_slave_ioctl,
 	.ndo_get_iflink		= dsa_slave_get_iflink,
+	.ndo_set_iflink		= dsa_slave_set_iflink,
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_netpoll_setup	= dsa_slave_netpoll_setup,
 	.ndo_netpoll_cleanup	= dsa_slave_netpoll_cleanup,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH RFC iproute2-next] iplink: allow to change iplink value
  2021-04-10 13:34 [PATCH RFC net-next 0/3] Multi-CPU DSA support Ansuel Smith
                   ` (2 preceding siblings ...)
  2021-04-10 13:34 ` [PATCH RFC net-next 3/3] net: dsa: implement ndo_set_netlink for chaning port's CPU port Ansuel Smith
@ 2021-04-10 13:34 ` Ansuel Smith
  2021-04-11 17:04   ` Stephen Hemminger
  2021-04-11 18:01 ` [PATCH RFC net-next 0/3] Multi-CPU DSA support Marek Behun
  4 siblings, 1 reply; 68+ messages in thread
From: Ansuel Smith @ 2021-04-10 13:34 UTC (permalink / raw)
  To: netdev
  Cc: Marek Behún, David Ahern, Stephen Hemminger,
	David S. Miller, Jakub Kicinski, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Vladimir Oltean, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eric Dumazet, Wei Wang,
	Cong Wang, Taehee Yoo, Björn Töpel, Di Zhu,
	Weilong Chen, Nikolay Aleksandrov, Colin Ian King, Roopa Prabhu,
	Francis Laniel, linux-kernel

Allow to change the interface to which a given interface is linked to.
This is useful in the case of multi-CPU port DSA, for changing the CPU
port of a given user port.

Signed-off-by: Marek Behún <marek.behun@nic.cz>
Cc: David Ahern <dsahern@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
---
 ip/iplink.c           | 16 +++++-----------
 man/man8/ip-link.8.in |  7 +++++++
 2 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/ip/iplink.c b/ip/iplink.c
index 212a0885..d52c0aaf 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -579,7 +579,6 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type)
 {
 	char *name = NULL;
 	char *dev = NULL;
-	char *link = NULL;
 	int ret, len;
 	char abuf[32];
 	int qlen = -1;
@@ -590,6 +589,7 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type)
 	int numrxqueues = -1;
 	int link_netnsid = -1;
 	int index = 0;
+	int link = -1;
 	int group = -1;
 	int addr_len = 0;
 
@@ -620,7 +620,10 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type)
 				invarg("Invalid \"index\" value", *argv);
 		} else if (matches(*argv, "link") == 0) {
 			NEXT_ARG();
-			link = *argv;
+			link = ll_name_to_index(*argv);
+			if (!link)
+				return nodev(*argv);
+			addattr32(&req->n, sizeof(*req), IFLA_LINK, link);
 		} else if (matches(*argv, "address") == 0) {
 			NEXT_ARG();
 			addr_len = ll_addr_a2n(abuf, sizeof(abuf), *argv);
@@ -1004,15 +1007,6 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, char **type)
 			exit(-1);
 		}
 
-		if (link) {
-			int ifindex;
-
-			ifindex = ll_name_to_index(link);
-			if (!ifindex)
-				return nodev(link);
-			addattr32(&req->n, sizeof(*req), IFLA_LINK, ifindex);
-		}
-
 		req->i.ifi_index = index;
 	}
 
diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index a8ae72d2..800aed05 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -149,6 +149,9 @@ ip-link \- network device configuration
 .br
 .RB "[ " nomaster " ]"
 .br
+.RB "[ " link
+.IR DEVICE " ]"
+.br
 .RB "[ " vrf
 .IR NAME " ]"
 .br
@@ -2131,6 +2134,10 @@ set master device of the device (enslave device).
 .BI nomaster
 unset master device of the device (release device).
 
+.TP
+.BI link " DEVICE"
+set device to which this device is linked to.
+
 .TP
 .BI addrgenmode " eui64|none|stable_secret|random"
 set the IPv6 address generation mode
-- 
2.21.0



^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC iproute2-next] iplink: allow to change iplink value
  2021-04-10 13:34 ` [PATCH RFC iproute2-next] iplink: allow to change iplink value Ansuel Smith
@ 2021-04-11 17:04   ` Stephen Hemminger
  2021-04-11 17:09     ` Vladimir Oltean
  0 siblings, 1 reply; 68+ messages in thread
From: Stephen Hemminger @ 2021-04-11 17:04 UTC (permalink / raw)
  To: Ansuel Smith
  Cc: netdev, Marek Behún, David Ahern, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Vladimir Oltean, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, Di Zhu, Weilong Chen, Nikolay Aleksandrov,
	Colin Ian King, Roopa Prabhu, Francis Laniel, linux-kernel

On Sat, 10 Apr 2021 15:34:50 +0200
Ansuel Smith <ansuelsmth@gmail.com> wrote:

> Allow to change the interface to which a given interface is linked to.
> This is useful in the case of multi-CPU port DSA, for changing the CPU
> port of a given user port.
> 
> Signed-off-by: Marek Behún <marek.behun@nic.cz>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Stephen Hemminger <stephen@networkplumber.org>

This may work for DSA but it won't work for all the device types vlan/macsec/... that
now use the link attribute.  It looks like the change link handling for those
device types just ignores the link attribute (maybe ok). But before supporting this
as an API, it would be better if all the other drivers that use IFLA_LINK
had error checks in their change link handling.

Please add error checks in kernel first.



^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC iproute2-next] iplink: allow to change iplink value
  2021-04-11 17:04   ` Stephen Hemminger
@ 2021-04-11 17:09     ` Vladimir Oltean
  0 siblings, 0 replies; 68+ messages in thread
From: Vladimir Oltean @ 2021-04-11 17:09 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Ansuel Smith, netdev, Marek Behún, David Ahern,
	David S. Miller, Jakub Kicinski, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, Di Zhu, Weilong Chen, Nikolay Aleksandrov,
	Colin Ian King, Roopa Prabhu, Francis Laniel, linux-kernel

On Sun, Apr 11, 2021 at 10:04:11AM -0700, Stephen Hemminger wrote:
> On Sat, 10 Apr 2021 15:34:50 +0200
> Ansuel Smith <ansuelsmth@gmail.com> wrote:
> 
> > Allow to change the interface to which a given interface is linked to.
> > This is useful in the case of multi-CPU port DSA, for changing the CPU
> > port of a given user port.
> > 
> > Signed-off-by: Marek Behún <marek.behun@nic.cz>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Stephen Hemminger <stephen@networkplumber.org>
> 
> This may work for DSA but it won't work for all the device types vlan/macsec/... that
> now use the link attribute.  It looks like the change link handling for those
> device types just ignores the link attribute (maybe ok). But before supporting this
> as an API, it would be better if all the other drivers that use IFLA_LINK
> had error checks in their change link handling.
> 
> Please add error checks in kernel first.

Would it be better to expose this as a netlink attribute specific to
DSA, instead of iflink which as you point out has uses for other virtual
interfaces like veth, and the semantics there are not quite the same?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-10 13:34 [PATCH RFC net-next 0/3] Multi-CPU DSA support Ansuel Smith
                   ` (3 preceding siblings ...)
  2021-04-10 13:34 ` [PATCH RFC iproute2-next] iplink: allow to change iplink value Ansuel Smith
@ 2021-04-11 18:01 ` Marek Behun
  2021-04-11 18:08   ` Ansuel Smith
                     ` (2 more replies)
  4 siblings, 3 replies; 68+ messages in thread
From: Marek Behun @ 2021-04-11 18:01 UTC (permalink / raw)
  To: Ansuel Smith
  Cc: netdev, David S. Miller, Jakub Kicinski, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Sat, 10 Apr 2021 15:34:46 +0200
Ansuel Smith <ansuelsmth@gmail.com> wrote:

> Hi,
> this is a respin of the Marek series in hope that this time we can
> finally make some progress with dsa supporting multi-cpu port.
> 
> This implementation is similar to the Marek series but with some tweaks.
> This adds support for multiple-cpu port but leave the driver the
> decision of the type of logic to use about assigning a CPU port to the
> various port. The driver can also provide no preference and the CPU port
> is decided using a round-robin way.

In the last couple of months I have been giving some thought to this
problem, and came up with one important thing: if there are multiple
upstream ports, it would make a lot of sense to dynamically reallocate
them to each user port, based on which user port is actually used, and
at what speed.

For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
ports support at most 1 Gbps. Round-robin would assign:
  CPU port 0 - Port 0
  CPU port 1 - Port 1
  CPU port 0 - Port 2
  CPU port 1 - Port 3
  CPU port 0 - Port 4

Now suppose that the user plugs ethernet cables only into ports 0 and 2,
with 1, 3 and 4 free:
  CPU port 0 - Port 0 (plugged)
  CPU port 1 - Port 1 (free)
  CPU port 0 - Port 2 (plugged)
  CPU port 1 - Port 3 (free)
  CPU port 0 - Port 4 (free)

We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
CPU, and the second CPU port is not used at all.

A mechanism for automatic reassignment of CPU ports would be ideal here.

What do you guys think?

Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-11 18:01 ` [PATCH RFC net-next 0/3] Multi-CPU DSA support Marek Behun
@ 2021-04-11 18:08   ` Ansuel Smith
  2021-04-11 18:39   ` Andrew Lunn
  2021-04-11 18:50   ` Vladimir Oltean
  2 siblings, 0 replies; 68+ messages in thread
From: Ansuel Smith @ 2021-04-11 18:08 UTC (permalink / raw)
  To: Marek Behun
  Cc: netdev, David S. Miller, Jakub Kicinski, Andrew Lunn,
	Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
> On Sat, 10 Apr 2021 15:34:46 +0200
> Ansuel Smith <ansuelsmth@gmail.com> wrote:
> 
> > Hi,
> > this is a respin of the Marek series in hope that this time we can
> > finally make some progress with dsa supporting multi-cpu port.
> > 
> > This implementation is similar to the Marek series but with some tweaks.
> > This adds support for multiple-cpu port but leave the driver the
> > decision of the type of logic to use about assigning a CPU port to the
> > various port. The driver can also provide no preference and the CPU port
> > is decided using a round-robin way.
> 
> In the last couple of months I have been giving some thought to this
> problem, and came up with one important thing: if there are multiple
> upstream ports, it would make a lot of sense to dynamically reallocate
> them to each user port, based on which user port is actually used, and
> at what speed.
> 
> For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
> ports support at most 1 Gbps. Round-robin would assign:
>   CPU port 0 - Port 0
>   CPU port 1 - Port 1
>   CPU port 0 - Port 2
>   CPU port 1 - Port 3
>   CPU port 0 - Port 4
> 
> Now suppose that the user plugs ethernet cables only into ports 0 and 2,
> with 1, 3 and 4 free:
>   CPU port 0 - Port 0 (plugged)
>   CPU port 1 - Port 1 (free)
>   CPU port 0 - Port 2 (plugged)
>   CPU port 1 - Port 3 (free)
>   CPU port 0 - Port 4 (free)
> 
> We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
> CPU, and the second CPU port is not used at all.
> 
> A mechanism for automatic reassignment of CPU ports would be ideal here.
> 
> What do you guys think?
> 
> Marek

A function called on every port change that checks the connected ports and
reassign the CPU based on that. Fact is that most of the time devices
have at least 2 ethernet port connected, one for the wan traffic and
other for some LAN device, so some type of preference from the switch
driver is needed, to also try to skip some problematic switch that have
CPU port with different supported features. A good idea but could be
overkill since we have seen at most devices with max 2 CPU port.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-11 18:01 ` [PATCH RFC net-next 0/3] Multi-CPU DSA support Marek Behun
  2021-04-11 18:08   ` Ansuel Smith
@ 2021-04-11 18:39   ` Andrew Lunn
  2021-04-12  2:07     ` Florian Fainelli
  2021-04-12  4:53     ` Ansuel Smith
  2021-04-11 18:50   ` Vladimir Oltean
  2 siblings, 2 replies; 68+ messages in thread
From: Andrew Lunn @ 2021-04-11 18:39 UTC (permalink / raw)
  To: Marek Behun
  Cc: Ansuel Smith, netdev, David S. Miller, Jakub Kicinski,
	Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
> On Sat, 10 Apr 2021 15:34:46 +0200
> Ansuel Smith <ansuelsmth@gmail.com> wrote:
> 
> > Hi,
> > this is a respin of the Marek series in hope that this time we can
> > finally make some progress with dsa supporting multi-cpu port.
> > 
> > This implementation is similar to the Marek series but with some tweaks.
> > This adds support for multiple-cpu port but leave the driver the
> > decision of the type of logic to use about assigning a CPU port to the
> > various port. The driver can also provide no preference and the CPU port
> > is decided using a round-robin way.
> 
> In the last couple of months I have been giving some thought to this
> problem, and came up with one important thing: if there are multiple
> upstream ports, it would make a lot of sense to dynamically reallocate
> them to each user port, based on which user port is actually used, and
> at what speed.
> 
> For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
> ports support at most 1 Gbps. Round-robin would assign:
>   CPU port 0 - Port 0
>   CPU port 1 - Port 1
>   CPU port 0 - Port 2
>   CPU port 1 - Port 3
>   CPU port 0 - Port 4
> 
> Now suppose that the user plugs ethernet cables only into ports 0 and 2,
> with 1, 3 and 4 free:
>   CPU port 0 - Port 0 (plugged)
>   CPU port 1 - Port 1 (free)
>   CPU port 0 - Port 2 (plugged)
>   CPU port 1 - Port 3 (free)
>   CPU port 0 - Port 4 (free)
> 
> We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
> CPU, and the second CPU port is not used at all.
> 
> A mechanism for automatic reassignment of CPU ports would be ideal here.

One thing you need to watch out for here source MAC addresses. I've
not looked at the details, so this is more a heads up, it needs to be
thought about.

DSA slaves get there MAC address from the master interface. For a
single CPU port, all the slaves have the same MAC address. What
happens when you have multiple CPU ports? Does the slave interface get
the MAC address from its CPU port? What happens when a slave moves
from one CPU interface to another CPU interface? Does its MAC address
change. ARP is going to be unhappy for a while? Also, how is the
switch deciding on which CPU port to use? Some switches are probably
learning the MAC address being used by the interface and doing
forwarding based on that. So you might need unique slave MAC
addresses, and when a slave moves, it takes it MAC address with it,
and you hope the switch learns about the move. But considered trapped
frames as opposed to forwarded frames. So BPDU, IGMP, etc. Generally,
you only have the choice to send such trapped frames to one CPU
port. So potentially, such frames are going to ingress on the wrong
port. Does this matter? What about multicast? How do you control what
port that ingresses on? What about RX filters on the master
interfaces? Could it be we added it to the wrong master?

For this series to make progress, we need to know what has been
tested, and if all the more complex functionality works, not just
basic pings.

      Andrew

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-11 18:01 ` [PATCH RFC net-next 0/3] Multi-CPU DSA support Marek Behun
  2021-04-11 18:08   ` Ansuel Smith
  2021-04-11 18:39   ` Andrew Lunn
@ 2021-04-11 18:50   ` Vladimir Oltean
  2021-04-11 23:53     ` Vladimir Oltean
                       ` (3 more replies)
  2 siblings, 4 replies; 68+ messages in thread
From: Vladimir Oltean @ 2021-04-11 18:50 UTC (permalink / raw)
  To: Marek Behun
  Cc: Ansuel Smith, netdev, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
> On Sat, 10 Apr 2021 15:34:46 +0200
> Ansuel Smith <ansuelsmth@gmail.com> wrote:
> 
> > Hi,
> > this is a respin of the Marek series in hope that this time we can
> > finally make some progress with dsa supporting multi-cpu port.
> > 
> > This implementation is similar to the Marek series but with some tweaks.
> > This adds support for multiple-cpu port but leave the driver the
> > decision of the type of logic to use about assigning a CPU port to the
> > various port. The driver can also provide no preference and the CPU port
> > is decided using a round-robin way.
> 
> In the last couple of months I have been giving some thought to this
> problem, and came up with one important thing: if there are multiple
> upstream ports, it would make a lot of sense to dynamically reallocate
> them to each user port, based on which user port is actually used, and
> at what speed.
> 
> For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
> ports support at most 1 Gbps. Round-robin would assign:
>   CPU port 0 - Port 0
>   CPU port 1 - Port 1
>   CPU port 0 - Port 2
>   CPU port 1 - Port 3
>   CPU port 0 - Port 4
> 
> Now suppose that the user plugs ethernet cables only into ports 0 and 2,
> with 1, 3 and 4 free:
>   CPU port 0 - Port 0 (plugged)
>   CPU port 1 - Port 1 (free)
>   CPU port 0 - Port 2 (plugged)
>   CPU port 1 - Port 3 (free)
>   CPU port 0 - Port 4 (free)
> 
> We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
> CPU, and the second CPU port is not used at all.
> 
> A mechanism for automatic reassignment of CPU ports would be ideal here.
> 
> What do you guys think?

The reason why I don't think this is such a great idea is because the
CPU port assignment is a major reconfiguration step which should at the
very least be done while the network is down, to avoid races with the
data path (something which this series does not appear to handle).
And if you allow the static user-port-to-CPU-port assignment to change
every time a link goes up/down, I don't think you really want to force
the network down through the entire switch basically.

So I'd be tempted to say 'tough luck' if all your ports are not up, and
the ones that are are assigned statically to the same CPU port. It's a
compromise between flexibility and simplicity, and I would go for
simplicity here. That's the most you can achieve with static assignment,
just put the CPU ports in a LAG if you want better dynamic load balancing
(for details read on below).

But this brings us to another topic, which I've been discussing with
Florian. I am also interested in the multi CPU ports topic for the
NXP LS1028A SoC, which uses the felix driver for its embedded switch.
I need to explain some of the complexities there, in order to lay out
what are the aspects which should ideally be supported.

The Ocelot switch family (which Felix is a part of) doesn't actually
support more than one "NPI" port as it's called (when the CPU port
module's queues are linked to an Ethernet port, which is what DSA calls
the "CPU port"). So you'd be tempted to say that a DSA setup with
multiple CPU ports is not realizable for this SoC.

But in fact, there are 2 Ethernet ports connecting the embedded switch
and the CPU, one port is at 2.5Gbps and the other is at 1Gbps. We can
dynamically choose which one is the NPI port through device tree
(arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi), and at the moment, we
choose the 2.5Gbps port as DSA CPU port, and we disable the 1Gbps
internal port. If we wanted to, we could enable the second internal port
as an internally-facing user port, but that's a bit awkward due to
multi-homing. Nonetheless, this is all that's achievable using the NPI
port functionality.

However, due to some unrelated issues, the Felix switch has ended up
supporting two tagging protocols in fact. So there is now an option
through which the user can run this command:

  echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging

(where eno2 is the DSA master)
and the switch will disable the NPI port and set up some VLAN
pushing/popping rules through which DSA gets everything it needs to
offer absolutely the same services towards the upper network stack
layer, but without enabling the hardware functionality for a CPU port
(as far as the switch hardware is aware, it is unmanaged).

This opens up some possibilities because we no longer have the
limitation that there can be only 1 NPI port in the system. As you'd
have it, I believe that any DSA switch with a fully programmable "port
forwarding matrix" (aka a bitmap which answers the question "can port i
send packets to port j?") is capable to some degree of supporting
multiple DSA CPU ports, in the statically-assigned fashion that this
patch series attempts to achieve. Namely, you just configure the port
forwarding matrix to allow user port i to flood traffic to one CPU port
but not to the other, and you disable communication between the CPU
ports.

But there is something which is even more interesting about Felix with
the ocelot-8021q tagger. Since Marek posted his RFC and until Ansuel
posted the follow-up, things have happened, and now both Felix and the
Marvell driver support LAG offload via the bonding and/or team drivers.
At least for Felix, when using the ocelot-8021q tagged, it should be
possible to put the two CPU ports in a hardware LAG, and the two DSA
masters in a software LAG, and let the bond/team upper of the DSA
masters be the CPU port.

I would like us to keep the door open for both alternatives, and to have
a way to switch between static user-to-CPU port assignment, and LAG.
I think that if there are multiple 'ethernet = ' phandles present in the
device tree, DSA should populate a list of valid DSA masters, and then
call into the driver to allow it to select which master it prefers for
each user port. This is similar to what Ansuel added with 'port_get_preferred_cpu',
except that I chose "DSA master" and not "CPU port" for a specific reason.
For LAG, the DSA master would be bond0.

In fact, in my case, this CPU port election procedure should also be
repeated when the tagging protocol changes, because Felix will be forced
to choose the same DSA master for all user ports at probe time, because
it boots up with the standard NPI-based "ocelot" tagger. So it can only
make use of the second CPU port when the tagging protocol changes.

Changing the DSA tagging protocol has to be done with the network down
(ip link set eno2 down && ip link set eno3 down). If we were to bring
eno2 and eno3 back up now, DSA or the driver would choose one of the DSA
masters for every port, round robin or what not. But we don't bring
the masters up yet, we create a bonding interface and we enslave eno2
and eno3 to it. DSA should detect this and add bond0 to its list of
candidates for a DSA master. Only now we bring up the masters, and the
port_get_preferred_cpu() function (or however it might end up being
named) should give the driver the option to select bond0 as a valid DSA
master.

Using bond0 as a DSA master would need some changes to DSA, because
currently it won't work. Namely, the RX data path needs the netdev->dsa_ptr
populated on the DSA master, whose type is a struct dsa_port *cpu_dp.
So, logically, we need a *cpu_dp corresponding to bond0.

One idea to solve this is to reuse something we already have: the
current struct dsa_switch_tree :: lags array of net devices. These hold
pointers to bonding interfaces now, but we could turn them into an array
of struct dsa_port "logical ports". The idea is that through this
change, every LAG offloaded by DSA will have a corresponding "logical
dp" which isn't present in the dst->ports list. Since every struct
dsa_port already has a lag_dev pointer, transforming the "dst->lags"
array from an array of LAG net devices into an array of logical DSA
ports will cover the existing use case as well: a logical port will
always have the dp->lag_dev pointer populated with the bonding/team
interface it offloads.

So if we make this change, we can populate bond0->dsa_ptr with this
"logical dp". This way we need to make no other changes to the RX data
path, and from the PoV of the data path itself, it isn't even a 'multi
CPU port' setup.

As for TX, that should already be fine: we call dev_queue_xmit() towards
bond0 because that's our master, and bond0 deals with xmit hashing
towards eno2 or eno3 on a per-packet basis, and that's what we want.

To destroy this setup, the user just needs to run 'ip link del bond0'
(with the link down I think) and DSA should call 'port_get_preferred_cpu'
again, but without bond0 in the list this time. The setup should revert
to its state with static assignment of user to CPU ports.

[ of course, I haven't tried any of this, I am just imagining things ]

I deliberately gave this very convoluted example because I would like
the progress that we make to be in this general direction. If I
understand your use cases, I believe they should be covered.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-11 18:50   ` Vladimir Oltean
@ 2021-04-11 23:53     ` Vladimir Oltean
  2021-04-12  2:10       ` Florian Fainelli
  2021-04-12  5:04     ` Ansuel Smith
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 68+ messages in thread
From: Vladimir Oltean @ 2021-04-11 23:53 UTC (permalink / raw)
  To: Marek Behun
  Cc: Ansuel Smith, netdev, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Sun, Apr 11, 2021 at 09:50:17PM +0300, Vladimir Oltean wrote:
> On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
> > On Sat, 10 Apr 2021 15:34:46 +0200
> > Ansuel Smith <ansuelsmth@gmail.com> wrote:
> > 
> > > Hi,
> > > this is a respin of the Marek series in hope that this time we can
> > > finally make some progress with dsa supporting multi-cpu port.
> > > 
> > > This implementation is similar to the Marek series but with some tweaks.
> > > This adds support for multiple-cpu port but leave the driver the
> > > decision of the type of logic to use about assigning a CPU port to the
> > > various port. The driver can also provide no preference and the CPU port
> > > is decided using a round-robin way.
> > 
> > In the last couple of months I have been giving some thought to this
> > problem, and came up with one important thing: if there are multiple
> > upstream ports, it would make a lot of sense to dynamically reallocate
> > them to each user port, based on which user port is actually used, and
> > at what speed.
> > 
> > For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
> > ports support at most 1 Gbps. Round-robin would assign:
> >   CPU port 0 - Port 0
> >   CPU port 1 - Port 1
> >   CPU port 0 - Port 2
> >   CPU port 1 - Port 3
> >   CPU port 0 - Port 4
> > 
> > Now suppose that the user plugs ethernet cables only into ports 0 and 2,
> > with 1, 3 and 4 free:
> >   CPU port 0 - Port 0 (plugged)
> >   CPU port 1 - Port 1 (free)
> >   CPU port 0 - Port 2 (plugged)
> >   CPU port 1 - Port 3 (free)
> >   CPU port 0 - Port 4 (free)
> > 
> > We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
> > CPU, and the second CPU port is not used at all.
> > 
> > A mechanism for automatic reassignment of CPU ports would be ideal here.
> > 
> > What do you guys think?
> 
> The reason why I don't think this is such a great idea is because the
> CPU port assignment is a major reconfiguration step which should at the
> very least be done while the network is down, to avoid races with the
> data path (something which this series does not appear to handle).
> And if you allow the static user-port-to-CPU-port assignment to change
> every time a link goes up/down, I don't think you really want to force
> the network down through the entire switch basically.
> 
> So I'd be tempted to say 'tough luck' if all your ports are not up, and
> the ones that are are assigned statically to the same CPU port. It's a
> compromise between flexibility and simplicity, and I would go for
> simplicity here. That's the most you can achieve with static assignment,
> just put the CPU ports in a LAG if you want better dynamic load balancing
> (for details read on below).

Just one more small comment, because I got so carried away with
describing what I already had in mind, that I forgot to completely
address your idea.

I think that DSA should provide the means to do what you want but not
the policy. Meaning that you can always write a user space program that
monitors the NETLINK_ROUTE rtnetlink through a socket and listens for
link state change events on it with poll(), then does whatever (like
moves the static user-to-CPU port mapping in the way that is adequate to
your network's requirements). The link up/down events are already
emitted, and the patch set here gives user space the rope to hang itself.

If you need inspiration, one user of the rtnetlink socket that I know of
is ptp4l:
https://github.com/richardcochran/linuxptp/blob/master/rtnl.c

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-11 18:39   ` Andrew Lunn
@ 2021-04-12  2:07     ` Florian Fainelli
  2021-04-12  4:53     ` Ansuel Smith
  1 sibling, 0 replies; 68+ messages in thread
From: Florian Fainelli @ 2021-04-12  2:07 UTC (permalink / raw)
  To: Andrew Lunn, Marek Behun
  Cc: Ansuel Smith, netdev, David S. Miller, Jakub Kicinski,
	Vivien Didelot, Vladimir Oltean, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eric Dumazet, Wei Wang,
	Cong Wang, Taehee Yoo, Björn Töpel, zhang kai,
	Weilong Chen, Roopa Prabhu, Di Zhu, Francis Laniel, linux-kernel



On 4/11/2021 11:39 AM, Andrew Lunn wrote:
> On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
>> On Sat, 10 Apr 2021 15:34:46 +0200
>> Ansuel Smith <ansuelsmth@gmail.com> wrote:
>>
>>> Hi,
>>> this is a respin of the Marek series in hope that this time we can
>>> finally make some progress with dsa supporting multi-cpu port.
>>>
>>> This implementation is similar to the Marek series but with some tweaks.
>>> This adds support for multiple-cpu port but leave the driver the
>>> decision of the type of logic to use about assigning a CPU port to the
>>> various port. The driver can also provide no preference and the CPU port
>>> is decided using a round-robin way.
>>
>> In the last couple of months I have been giving some thought to this
>> problem, and came up with one important thing: if there are multiple
>> upstream ports, it would make a lot of sense to dynamically reallocate
>> them to each user port, based on which user port is actually used, and
>> at what speed.
>>
>> For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
>> ports support at most 1 Gbps. Round-robin would assign:
>>   CPU port 0 - Port 0
>>   CPU port 1 - Port 1
>>   CPU port 0 - Port 2
>>   CPU port 1 - Port 3
>>   CPU port 0 - Port 4
>>
>> Now suppose that the user plugs ethernet cables only into ports 0 and 2,
>> with 1, 3 and 4 free:
>>   CPU port 0 - Port 0 (plugged)
>>   CPU port 1 - Port 1 (free)
>>   CPU port 0 - Port 2 (plugged)
>>   CPU port 1 - Port 3 (free)
>>   CPU port 0 - Port 4 (free)
>>
>> We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
>> CPU, and the second CPU port is not used at all.
>>
>> A mechanism for automatic reassignment of CPU ports would be ideal here.
> 
> One thing you need to watch out for here source MAC addresses. I've
> not looked at the details, so this is more a heads up, it needs to be
> thought about.
> 
> DSA slaves get there MAC address from the master interface. For a
> single CPU port, all the slaves have the same MAC address. What
> happens when you have multiple CPU ports? Does the slave interface get
> the MAC address from its CPU port?

It seems to be addressed by this part of patch 2:

+	if (ether_addr_equal(dev->dev_addr, master->dev_addr))
+		eth_hw_addr_inherit(dev, cpu_dev);

although this could create an interesting set of issues if done fully
dynamically while the data path is active.

> What happens when a slave moves
> from one CPU interface to another CPU interface? Does its MAC address
> change. ARP is going to be unhappy for a while? Also, how is the
> switch deciding on which CPU port to use? Some switches are probably
> learning the MAC address being used by the interface and doing
> forwarding based on that. So you might need unique slave MAC
> addresses, and when a slave moves, it takes it MAC address with it,
> and you hope the switch learns about the move. But considered trapped
> frames as opposed to forwarded frames. So BPDU, IGMP, etc. Generally,
> you only have the choice to send such trapped frames to one CPU
> port. So potentially, such frames are going to ingress on the wrong
> port. Does this matter? What about multicast? How do you control what
> port that ingresses on? What about RX filters on the master
> interfaces? Could it be we added it to the wrong master?
> 
> For this series to make progress, we need to know what has been
> tested, and if all the more complex functionality works, not just
> basic pings.

Agreed.
-- 
Florian

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-11 23:53     ` Vladimir Oltean
@ 2021-04-12  2:10       ` Florian Fainelli
  0 siblings, 0 replies; 68+ messages in thread
From: Florian Fainelli @ 2021-04-12  2:10 UTC (permalink / raw)
  To: Vladimir Oltean, Marek Behun
  Cc: Ansuel Smith, netdev, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel



On 4/11/2021 4:53 PM, Vladimir Oltean wrote:
> On Sun, Apr 11, 2021 at 09:50:17PM +0300, Vladimir Oltean wrote:
>> On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
>>> On Sat, 10 Apr 2021 15:34:46 +0200
>>> Ansuel Smith <ansuelsmth@gmail.com> wrote:
>>>
>>>> Hi,
>>>> this is a respin of the Marek series in hope that this time we can
>>>> finally make some progress with dsa supporting multi-cpu port.
>>>>
>>>> This implementation is similar to the Marek series but with some tweaks.
>>>> This adds support for multiple-cpu port but leave the driver the
>>>> decision of the type of logic to use about assigning a CPU port to the
>>>> various port. The driver can also provide no preference and the CPU port
>>>> is decided using a round-robin way.
>>>
>>> In the last couple of months I have been giving some thought to this
>>> problem, and came up with one important thing: if there are multiple
>>> upstream ports, it would make a lot of sense to dynamically reallocate
>>> them to each user port, based on which user port is actually used, and
>>> at what speed.
>>>
>>> For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
>>> ports support at most 1 Gbps. Round-robin would assign:
>>>   CPU port 0 - Port 0
>>>   CPU port 1 - Port 1
>>>   CPU port 0 - Port 2
>>>   CPU port 1 - Port 3
>>>   CPU port 0 - Port 4
>>>
>>> Now suppose that the user plugs ethernet cables only into ports 0 and 2,
>>> with 1, 3 and 4 free:
>>>   CPU port 0 - Port 0 (plugged)
>>>   CPU port 1 - Port 1 (free)
>>>   CPU port 0 - Port 2 (plugged)
>>>   CPU port 1 - Port 3 (free)
>>>   CPU port 0 - Port 4 (free)
>>>
>>> We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
>>> CPU, and the second CPU port is not used at all.
>>>
>>> A mechanism for automatic reassignment of CPU ports would be ideal here.
>>>
>>> What do you guys think?
>>
>> The reason why I don't think this is such a great idea is because the
>> CPU port assignment is a major reconfiguration step which should at the
>> very least be done while the network is down, to avoid races with the
>> data path (something which this series does not appear to handle).
>> And if you allow the static user-port-to-CPU-port assignment to change
>> every time a link goes up/down, I don't think you really want to force
>> the network down through the entire switch basically.
>>
>> So I'd be tempted to say 'tough luck' if all your ports are not up, and
>> the ones that are are assigned statically to the same CPU port. It's a
>> compromise between flexibility and simplicity, and I would go for
>> simplicity here. That's the most you can achieve with static assignment,
>> just put the CPU ports in a LAG if you want better dynamic load balancing
>> (for details read on below).
> 
> Just one more small comment, because I got so carried away with
> describing what I already had in mind, that I forgot to completely
> address your idea.
> 
> I think that DSA should provide the means to do what you want but not
> the policy.

Could not agree more, this point is what has historically prevented any
multi-CPU port patch series from landing because what everyone seems to
have wanted so far is along these lines:

- assign LAN ports 0-3 to CPU port #0
- assign WAN port 4 to CPU port #1

and do that from Device Tree, problem solved? Not entirely unfortunately.

Being able to change the mapping via iproute2 is definitively an
improvement, and to echo to your comment on the iproute2 change proper
we can try to agree on a more specialized syntax.

> Meaning that you can always write a user space program that
> monitors the NETLINK_ROUTE rtnetlink through a socket and listens for
> link state change events on it with poll(), then does whatever (like
> moves the static user-to-CPU port mapping in the way that is adequate to
> your network's requirements). The link up/down events are already
> emitted, and the patch set here gives user space the rope to hang itself.

That seems like an entirely reasonable approach to me, and solving how
to map a given user-port to a particular CPU port definitively belongs
in user-space, within the constraints expressed by what the switch
driver can do of course.

> 
> If you need inspiration, one user of the rtnetlink socket that I know of
> is ptp4l:
> https://github.com/richardcochran/linuxptp/blob/master/rtnl.c
> 

-- 
Florian

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 1/3] net: dsa: allow for multiple CPU ports
  2021-04-10 13:34 ` [PATCH RFC net-next 1/3] net: dsa: allow for multiple CPU ports Ansuel Smith
@ 2021-04-12  3:35   ` DENG Qingfang
  2021-04-12  4:41     ` Ansuel Smith
  0 siblings, 1 reply; 68+ messages in thread
From: DENG Qingfang @ 2021-04-12  3:35 UTC (permalink / raw)
  To: Ansuel Smith
  Cc: netdev, Marek Behún, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Roopa Prabhu, Di Zhu,
	Nikolay Aleksandrov, Francis Laniel, linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset=UTF-8, Size: 1398 bytes --]

On Sat, Apr 10, 2021 at 03:34:47PM +0200, Ansuel Smith wrote:
> Allow for multiple CPU ports in a DSA switch tree. By default the first
> CPU port is assigned mimic the original assignement logic. A DSA driver
> can define a function to declare a preferred CPU port based on the
> provided port. If the function doesn't have a preferred port the CPU
> port is assigned using a round-robin way starting from the last assigned
> CPU port.
> Examples:
> There are two CPU port but no port_get_preferred_cpu is provided:
> - The old logic is used. Every port is assigned to the first cpu port.
> There are two CPU port but the port_get_preferred_cpu return -1:
> - The port is assigned using a round-robin way since no preference is
>   provided.
> There are two CPU port and the port_get_preferred_cpu define only one
> port and the rest with -1: (wan port with CPU1 and the rest no
> preference)
>   lan1 <-> eth0
>   lan2 <-> eth1
>   lan3 <-> eth0
>   lan4 <-> eth1
>   wan  <-> eth1
> There are two CPU port and the port_get_preferred assign a preference
> for every port: (wan port with CPU1 everything else CPU0)
>   lan1 <-> eth0
>   lan2 <-> eth0
>   lan3 <-> eth0
>   lan4 <-> eth0
>   wan  <-> eth1

So, drivers will read the name of every port and decide which CPU port
does it use?

> 
> Signed-off-by: Marek Behún <marek.behun@nic.cz>
> Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 1/3] net: dsa: allow for multiple CPU ports
  2021-04-12  3:35   ` DENG Qingfang
@ 2021-04-12  4:41     ` Ansuel Smith
  2021-04-12 15:30       ` DENG Qingfang
  0 siblings, 1 reply; 68+ messages in thread
From: Ansuel Smith @ 2021-04-12  4:41 UTC (permalink / raw)
  To: DENG Qingfang
  Cc: netdev, Marek Behún, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Roopa Prabhu, Di Zhu,
	Nikolay Aleksandrov, Francis Laniel, linux-kernel

On Mon, Apr 12, 2021 at 11:35:25AM +0800, DENG Qingfang wrote:
> On Sat, Apr 10, 2021 at 03:34:47PM +0200, Ansuel Smith wrote:
> > Allow for multiple CPU ports in a DSA switch tree. By default the first
> > CPU port is assigned mimic the original assignement logic. A DSA driver
> > can define a function to declare a preferred CPU port based on the
> > provided port. If the function doesn't have a preferred port the CPU
> > port is assigned using a round-robin way starting from the last assigned
> > CPU port.
> > Examples:
> > There are two CPU port but no port_get_preferred_cpu is provided:
> > - The old logic is used. Every port is assigned to the first cpu port.
> > There are two CPU port but the port_get_preferred_cpu return -1:
> > - The port is assigned using a round-robin way since no preference is
> >   provided.
> > There are two CPU port and the port_get_preferred_cpu define only one
> > port and the rest with -1: (wan port with CPU1 and the rest no
> > preference)
> >   lan1 <-> eth0
> >   lan2 <-> eth1
> >   lan3 <-> eth0
> >   lan4 <-> eth1
> >   wan  <-> eth1
> > There are two CPU port and the port_get_preferred assign a preference
> > for every port: (wan port with CPU1 everything else CPU0)
> >   lan1 <-> eth0
> >   lan2 <-> eth0
> >   lan3 <-> eth0
> >   lan4 <-> eth0
> >   wan  <-> eth1
> 
> So, drivers will read the name of every port and decide which CPU port
> does it use?
>

Yes, this seems to be an acceptable path to follow. The driver can
provide a preferred CPU port or just tell DSA that every cpu is equal
and assign them in a round-robin.

> > 
> > Signed-off-by: Marek Beh?n <marek.behun@nic.cz>
> > Signed-off-by: Ansuel Smith <ansuelsmth@gmail.com>

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-11 18:39   ` Andrew Lunn
  2021-04-12  2:07     ` Florian Fainelli
@ 2021-04-12  4:53     ` Ansuel Smith
  1 sibling, 0 replies; 68+ messages in thread
From: Ansuel Smith @ 2021-04-12  4:53 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Marek Behun, netdev, David S. Miller, Jakub Kicinski,
	Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Sun, Apr 11, 2021 at 08:39:12PM +0200, Andrew Lunn wrote:
> On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
> > On Sat, 10 Apr 2021 15:34:46 +0200
> > Ansuel Smith <ansuelsmth@gmail.com> wrote:
> > 
> > > Hi,
> > > this is a respin of the Marek series in hope that this time we can
> > > finally make some progress with dsa supporting multi-cpu port.
> > > 
> > > This implementation is similar to the Marek series but with some tweaks.
> > > This adds support for multiple-cpu port but leave the driver the
> > > decision of the type of logic to use about assigning a CPU port to the
> > > various port. The driver can also provide no preference and the CPU port
> > > is decided using a round-robin way.
> > 
> > In the last couple of months I have been giving some thought to this
> > problem, and came up with one important thing: if there are multiple
> > upstream ports, it would make a lot of sense to dynamically reallocate
> > them to each user port, based on which user port is actually used, and
> > at what speed.
> > 
> > For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
> > ports support at most 1 Gbps. Round-robin would assign:
> >   CPU port 0 - Port 0
> >   CPU port 1 - Port 1
> >   CPU port 0 - Port 2
> >   CPU port 1 - Port 3
> >   CPU port 0 - Port 4
> > 
> > Now suppose that the user plugs ethernet cables only into ports 0 and 2,
> > with 1, 3 and 4 free:
> >   CPU port 0 - Port 0 (plugged)
> >   CPU port 1 - Port 1 (free)
> >   CPU port 0 - Port 2 (plugged)
> >   CPU port 1 - Port 3 (free)
> >   CPU port 0 - Port 4 (free)
> > 
> > We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
> > CPU, and the second CPU port is not used at all.
> > 
> > A mechanism for automatic reassignment of CPU ports would be ideal here.
> 
> One thing you need to watch out for here source MAC addresses. I've
> not looked at the details, so this is more a heads up, it needs to be
> thought about.
>
> DSA slaves get there MAC address from the master interface. For a
> single CPU port, all the slaves have the same MAC address. What
> happens when you have multiple CPU ports? Does the slave interface get
> the MAC address from its CPU port? What happens when a slave moves
> from one CPU interface to another CPU interface? Does its MAC address
> change. ARP is going to be unhappy for a while? Also, how is the
> switch deciding on which CPU port to use? Some switches are probably
> learning the MAC address being used by the interface and doing
> forwarding based on that. So you might need unique slave MAC
> addresses, and when a slave moves, it takes it MAC address with it,
> and you hope the switch learns about the move. But considered trapped
> frames as opposed to forwarded frames. So BPDU, IGMP, etc. Generally,
> you only have the choice to send such trapped frames to one CPU
> port. So potentially, such frames are going to ingress on the wrong
> port. Does this matter? What about multicast? How do you control what
> port that ingresses on? What about RX filters on the master
> interfaces? Could it be we added it to the wrong master?
> 

I think that MAC adress should be changed accordingly. If the port
doesn't have a custom mac set, it should follow the master mac and be
changed on port change. (Since this is an RFC I didn't add any lock but
I think it's needed to change also the cpu_dp and the xmit path)

> For this series to make progress, we need to know what has been
> tested, and if all the more complex functionality works, not just
> basic pings.

About testing, I'm currently using this with a qca8k switch. It looks
like all works correctly but I agree that this needs better testing of
the more complex funcionality. Anyway this series just adds the
possibility of support and change cpu port but by default keeps the
old default bheaviour. (so in theory no regression in that part)

> 
>       Andrew

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-11 18:50   ` Vladimir Oltean
  2021-04-11 23:53     ` Vladimir Oltean
@ 2021-04-12  5:04     ` Ansuel Smith
  2021-04-12 12:46     ` Tobias Waldekranz
  2021-04-12 15:00     ` DENG Qingfang
  3 siblings, 0 replies; 68+ messages in thread
From: Ansuel Smith @ 2021-04-12  5:04 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Marek Behun, netdev, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Sun, Apr 11, 2021 at 09:50:17PM +0300, Vladimir Oltean wrote:
> On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
> > On Sat, 10 Apr 2021 15:34:46 +0200
> > Ansuel Smith <ansuelsmth@gmail.com> wrote:
> > 
> > > Hi,
> > > this is a respin of the Marek series in hope that this time we can
> > > finally make some progress with dsa supporting multi-cpu port.
> > > 
> > > This implementation is similar to the Marek series but with some tweaks.
> > > This adds support for multiple-cpu port but leave the driver the
> > > decision of the type of logic to use about assigning a CPU port to the
> > > various port. The driver can also provide no preference and the CPU port
> > > is decided using a round-robin way.
> > 
> > In the last couple of months I have been giving some thought to this
> > problem, and came up with one important thing: if there are multiple
> > upstream ports, it would make a lot of sense to dynamically reallocate
> > them to each user port, based on which user port is actually used, and
> > at what speed.
> > 
> > For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
> > ports support at most 1 Gbps. Round-robin would assign:
> >   CPU port 0 - Port 0
> >   CPU port 1 - Port 1
> >   CPU port 0 - Port 2
> >   CPU port 1 - Port 3
> >   CPU port 0 - Port 4
> > 
> > Now suppose that the user plugs ethernet cables only into ports 0 and 2,
> > with 1, 3 and 4 free:
> >   CPU port 0 - Port 0 (plugged)
> >   CPU port 1 - Port 1 (free)
> >   CPU port 0 - Port 2 (plugged)
> >   CPU port 1 - Port 3 (free)
> >   CPU port 0 - Port 4 (free)
> > 
> > We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
> > CPU, and the second CPU port is not used at all.
> > 
> > A mechanism for automatic reassignment of CPU ports would be ideal here.
> > 
> > What do you guys think?
> 
> The reason why I don't think this is such a great idea is because the
> CPU port assignment is a major reconfiguration step which should at the
> very least be done while the network is down, to avoid races with the
> data path (something which this series does not appear to handle).
> And if you allow the static user-port-to-CPU-port assignment to change
> every time a link goes up/down, I don't think you really want to force
> the network down through the entire switch basically.
> 
> So I'd be tempted to say 'tough luck' if all your ports are not up, and
> the ones that are are assigned statically to the same CPU port. It's a
> compromise between flexibility and simplicity, and I would go for
> simplicity here. That's the most you can achieve with static assignment,
> just put the CPU ports in a LAG if you want better dynamic load balancing
> (for details read on below).
> 
> But this brings us to another topic, which I've been discussing with
> Florian. I am also interested in the multi CPU ports topic for the
> NXP LS1028A SoC, which uses the felix driver for its embedded switch.
> I need to explain some of the complexities there, in order to lay out
> what are the aspects which should ideally be supported.
> 
> The Ocelot switch family (which Felix is a part of) doesn't actually
> support more than one "NPI" port as it's called (when the CPU port
> module's queues are linked to an Ethernet port, which is what DSA calls
> the "CPU port"). So you'd be tempted to say that a DSA setup with
> multiple CPU ports is not realizable for this SoC.
> 
> But in fact, there are 2 Ethernet ports connecting the embedded switch
> and the CPU, one port is at 2.5Gbps and the other is at 1Gbps. We can
> dynamically choose which one is the NPI port through device tree
> (arch/arm64/boot/dts/freescale/fsl-ls1028a.dtsi), and at the moment, we
> choose the 2.5Gbps port as DSA CPU port, and we disable the 1Gbps
> internal port. If we wanted to, we could enable the second internal port
> as an internally-facing user port, but that's a bit awkward due to
> multi-homing. Nonetheless, this is all that's achievable using the NPI
> port functionality.
> 
> However, due to some unrelated issues, the Felix switch has ended up
> supporting two tagging protocols in fact. So there is now an option
> through which the user can run this command:
> 
>   echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
> 
> (where eno2 is the DSA master)
> and the switch will disable the NPI port and set up some VLAN
> pushing/popping rules through which DSA gets everything it needs to
> offer absolutely the same services towards the upper network stack
> layer, but without enabling the hardware functionality for a CPU port
> (as far as the switch hardware is aware, it is unmanaged).
> 
> This opens up some possibilities because we no longer have the
> limitation that there can be only 1 NPI port in the system. As you'd
> have it, I believe that any DSA switch with a fully programmable "port
> forwarding matrix" (aka a bitmap which answers the question "can port i
> send packets to port j?") is capable to some degree of supporting
> multiple DSA CPU ports, in the statically-assigned fashion that this
> patch series attempts to achieve. Namely, you just configure the port
> forwarding matrix to allow user port i to flood traffic to one CPU port
> but not to the other, and you disable communication between the CPU
> ports.
> 
> But there is something which is even more interesting about Felix with
> the ocelot-8021q tagger. Since Marek posted his RFC and until Ansuel
> posted the follow-up, things have happened, and now both Felix and the
> Marvell driver support LAG offload via the bonding and/or team drivers.
> At least for Felix, when using the ocelot-8021q tagged, it should be
> possible to put the two CPU ports in a hardware LAG, and the two DSA
> masters in a software LAG, and let the bond/team upper of the DSA
> masters be the CPU port.
> 
> I would like us to keep the door open for both alternatives, and to have
> a way to switch between static user-to-CPU port assignment, and LAG.
> I think that if there are multiple 'ethernet = ' phandles present in the
> device tree, DSA should populate a list of valid DSA masters, and then
> call into the driver to allow it to select which master it prefers for
> each user port. This is similar to what Ansuel added with 'port_get_preferred_cpu',
> except that I chose "DSA master" and not "CPU port" for a specific reason.
> For LAG, the DSA master would be bond0.
> 
> In fact, in my case, this CPU port election procedure should also be
> repeated when the tagging protocol changes, because Felix will be forced
> to choose the same DSA master for all user ports at probe time, because
> it boots up with the standard NPI-based "ocelot" tagger. So it can only
> make use of the second CPU port when the tagging protocol changes.
> 
> Changing the DSA tagging protocol has to be done with the network down
> (ip link set eno2 down && ip link set eno3 down). If we were to bring
> eno2 and eno3 back up now, DSA or the driver would choose one of the DSA
> masters for every port, round robin or what not. But we don't bring
> the masters up yet, we create a bonding interface and we enslave eno2
> and eno3 to it. DSA should detect this and add bond0 to its list of
> candidates for a DSA master. Only now we bring up the masters, and the
> port_get_preferred_cpu() function (or however it might end up being
> named) should give the driver the option to select bond0 as a valid DSA
> master.
> 
> Using bond0 as a DSA master would need some changes to DSA, because
> currently it won't work. Namely, the RX data path needs the netdev->dsa_ptr
> populated on the DSA master, whose type is a struct dsa_port *cpu_dp.
> So, logically, we need a *cpu_dp corresponding to bond0.
> 
> One idea to solve this is to reuse something we already have: the
> current struct dsa_switch_tree :: lags array of net devices. These hold
> pointers to bonding interfaces now, but we could turn them into an array
> of struct dsa_port "logical ports". The idea is that through this
> change, every LAG offloaded by DSA will have a corresponding "logical
> dp" which isn't present in the dst->ports list. Since every struct
> dsa_port already has a lag_dev pointer, transforming the "dst->lags"
> array from an array of LAG net devices into an array of logical DSA
> ports will cover the existing use case as well: a logical port will
> always have the dp->lag_dev pointer populated with the bonding/team
> interface it offloads.
> 
> So if we make this change, we can populate bond0->dsa_ptr with this
> "logical dp". This way we need to make no other changes to the RX data
> path, and from the PoV of the data path itself, it isn't even a 'multi
> CPU port' setup.
> 
> As for TX, that should already be fine: we call dev_queue_xmit() towards
> bond0 because that's our master, and bond0 deals with xmit hashing
> towards eno2 or eno3 on a per-packet basis, and that's what we want.
> 
> To destroy this setup, the user just needs to run 'ip link del bond0'
> (with the link down I think) and DSA should call 'port_get_preferred_cpu'
> again, but without bond0 in the list this time. The setup should revert
> to its state with static assignment of user to CPU ports.
> 
> [ of course, I haven't tried any of this, I am just imagining things ]
> 
> I deliberately gave this very convoluted example because I would like
> the progress that we make to be in this general direction. If I
> understand your use cases, I believe they should be covered.

To sum up all these comments, we really wants all the same thing.
- Switch driver that can comunicate some type of preference with the CPU
  assignement logic.
- Userspace that can change the CPU.
I really think that to make some progress with this, we should really try
to add support for some very basic implementation of this. I think the series
I posted achieve this, the switch driver to actually enable multi-cpu
support requires the get_preferred_port function to be declared so every
driver must be changed to support this or the old implementation is used
(and this is not a bad thing since some tweaks are always needed or 99%
of the time the second cpu will cause some problem)


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-11 18:50   ` Vladimir Oltean
  2021-04-11 23:53     ` Vladimir Oltean
  2021-04-12  5:04     ` Ansuel Smith
@ 2021-04-12 12:46     ` Tobias Waldekranz
  2021-04-12 14:35       ` Vladimir Oltean
  2021-04-12 19:30       ` Marek Behun
  2021-04-12 15:00     ` DENG Qingfang
  3 siblings, 2 replies; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-12 12:46 UTC (permalink / raw)
  To: Vladimir Oltean, Marek Behun
  Cc: Ansuel Smith, netdev, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Sun, Apr 11, 2021 at 21:50, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
>> On Sat, 10 Apr 2021 15:34:46 +0200
>> Ansuel Smith <ansuelsmth@gmail.com> wrote:
>> 
>> > Hi,
>> > this is a respin of the Marek series in hope that this time we can
>> > finally make some progress with dsa supporting multi-cpu port.
>> > 
>> > This implementation is similar to the Marek series but with some tweaks.
>> > This adds support for multiple-cpu port but leave the driver the
>> > decision of the type of logic to use about assigning a CPU port to the
>> > various port. The driver can also provide no preference and the CPU port
>> > is decided using a round-robin way.
>> 
>> In the last couple of months I have been giving some thought to this
>> problem, and came up with one important thing: if there are multiple
>> upstream ports, it would make a lot of sense to dynamically reallocate
>> them to each user port, based on which user port is actually used, and
>> at what speed.
>> 
>> For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
>> ports support at most 1 Gbps. Round-robin would assign:
>>   CPU port 0 - Port 0
>>   CPU port 1 - Port 1
>>   CPU port 0 - Port 2
>>   CPU port 1 - Port 3
>>   CPU port 0 - Port 4
>> 
>> Now suppose that the user plugs ethernet cables only into ports 0 and 2,
>> with 1, 3 and 4 free:
>>   CPU port 0 - Port 0 (plugged)
>>   CPU port 1 - Port 1 (free)
>>   CPU port 0 - Port 2 (plugged)
>>   CPU port 1 - Port 3 (free)
>>   CPU port 0 - Port 4 (free)
>> 
>> We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
>> CPU, and the second CPU port is not used at all.
>> 
>> A mechanism for automatic reassignment of CPU ports would be ideal here.
>> 
>> What do you guys think?
>
> The reason why I don't think this is such a great idea is because the
> CPU port assignment is a major reconfiguration step which should at the
> very least be done while the network is down, to avoid races with the
> data path (something which this series does not appear to handle).
> And if you allow the static user-port-to-CPU-port assignment to change
> every time a link goes up/down, I don't think you really want to force
> the network down through the entire switch basically.
>
> So I'd be tempted to say 'tough luck' if all your ports are not up, and
> the ones that are are assigned statically to the same CPU port. It's a
> compromise between flexibility and simplicity, and I would go for
> simplicity here. That's the most you can achieve with static assignment,
> just put the CPU ports in a LAG if you want better dynamic load balancing
> (for details read on below).

I agree. Unless you only have a few really wideband flows, a LAG will
typically do a great job with balancing. This will happen without the
user having to do any configuration at all. It would also perform well
in "router-on-a-stick"-setups where the incoming and outgoing port is
the same.

...

> But there is something which is even more interesting about Felix with
> the ocelot-8021q tagger. Since Marek posted his RFC and until Ansuel
> posted the follow-up, things have happened, and now both Felix and the
> Marvell driver support LAG offload via the bonding and/or team drivers.
> At least for Felix, when using the ocelot-8021q tagged, it should be
> possible to put the two CPU ports in a hardware LAG, and the two DSA
> masters in a software LAG, and let the bond/team upper of the DSA
> masters be the CPU port.
>
> I would like us to keep the door open for both alternatives, and to have
> a way to switch between static user-to-CPU port assignment, and LAG.
> I think that if there are multiple 'ethernet = ' phandles present in the
> device tree, DSA should populate a list of valid DSA masters, and then
> call into the driver to allow it to select which master it prefers for
> each user port. This is similar to what Ansuel added with 'port_get_preferred_cpu',
> except that I chose "DSA master" and not "CPU port" for a specific reason.
> For LAG, the DSA master would be bond0.

I do not see why we would go through the trouble of creating a
user-visible bond/team for this. As you detail below, it would mean
jumping through a lot of hoops. I am not sure there is that much we can
use from those drivers.

- We know that the CPU ports are statically up, so there is no "active
  transmit set" to manage, it always consists of all ports.

- The LAG members are statically known at boot time via the DT, so we do
  not need (or want, in fact) any management of that from userspace.

We could just let the drivers setup the LAG internally, and then do the
load-balancing in dsa_slave_xmit or provide a generic helper that the
taggers could use.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 12:46     ` Tobias Waldekranz
@ 2021-04-12 14:35       ` Vladimir Oltean
  2021-04-12 21:06         ` Tobias Waldekranz
  2021-04-12 19:30       ` Marek Behun
  1 sibling, 1 reply; 68+ messages in thread
From: Vladimir Oltean @ 2021-04-12 14:35 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Marek Behun, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Mon, Apr 12, 2021 at 02:46:11PM +0200, Tobias Waldekranz wrote:
> On Sun, Apr 11, 2021 at 21:50, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
> >> On Sat, 10 Apr 2021 15:34:46 +0200
> >> Ansuel Smith <ansuelsmth@gmail.com> wrote:
> >> 
> >> > Hi,
> >> > this is a respin of the Marek series in hope that this time we can
> >> > finally make some progress with dsa supporting multi-cpu port.
> >> > 
> >> > This implementation is similar to the Marek series but with some tweaks.
> >> > This adds support for multiple-cpu port but leave the driver the
> >> > decision of the type of logic to use about assigning a CPU port to the
> >> > various port. The driver can also provide no preference and the CPU port
> >> > is decided using a round-robin way.
> >> 
> >> In the last couple of months I have been giving some thought to this
> >> problem, and came up with one important thing: if there are multiple
> >> upstream ports, it would make a lot of sense to dynamically reallocate
> >> them to each user port, based on which user port is actually used, and
> >> at what speed.
> >> 
> >> For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
> >> ports support at most 1 Gbps. Round-robin would assign:
> >>   CPU port 0 - Port 0
> >>   CPU port 1 - Port 1
> >>   CPU port 0 - Port 2
> >>   CPU port 1 - Port 3
> >>   CPU port 0 - Port 4
> >> 
> >> Now suppose that the user plugs ethernet cables only into ports 0 and 2,
> >> with 1, 3 and 4 free:
> >>   CPU port 0 - Port 0 (plugged)
> >>   CPU port 1 - Port 1 (free)
> >>   CPU port 0 - Port 2 (plugged)
> >>   CPU port 1 - Port 3 (free)
> >>   CPU port 0 - Port 4 (free)
> >> 
> >> We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
> >> CPU, and the second CPU port is not used at all.
> >> 
> >> A mechanism for automatic reassignment of CPU ports would be ideal here.
> >> 
> >> What do you guys think?
> >
> > The reason why I don't think this is such a great idea is because the
> > CPU port assignment is a major reconfiguration step which should at the
> > very least be done while the network is down, to avoid races with the
> > data path (something which this series does not appear to handle).
> > And if you allow the static user-port-to-CPU-port assignment to change
> > every time a link goes up/down, I don't think you really want to force
> > the network down through the entire switch basically.
> >
> > So I'd be tempted to say 'tough luck' if all your ports are not up, and
> > the ones that are are assigned statically to the same CPU port. It's a
> > compromise between flexibility and simplicity, and I would go for
> > simplicity here. That's the most you can achieve with static assignment,
> > just put the CPU ports in a LAG if you want better dynamic load balancing
> > (for details read on below).
> 
> I agree. Unless you only have a few really wideband flows, a LAG will
> typically do a great job with balancing. This will happen without the
> user having to do any configuration at all. It would also perform well
> in "router-on-a-stick"-setups where the incoming and outgoing port is
> the same.
> 
> ...
> 
> > But there is something which is even more interesting about Felix with
> > the ocelot-8021q tagger. Since Marek posted his RFC and until Ansuel
> > posted the follow-up, things have happened, and now both Felix and the
> > Marvell driver support LAG offload via the bonding and/or team drivers.
> > At least for Felix, when using the ocelot-8021q tagged, it should be
> > possible to put the two CPU ports in a hardware LAG, and the two DSA
> > masters in a software LAG, and let the bond/team upper of the DSA
> > masters be the CPU port.
> >
> > I would like us to keep the door open for both alternatives, and to have
> > a way to switch between static user-to-CPU port assignment, and LAG.
> > I think that if there are multiple 'ethernet = ' phandles present in the
> > device tree, DSA should populate a list of valid DSA masters, and then
> > call into the driver to allow it to select which master it prefers for
> > each user port. This is similar to what Ansuel added with 'port_get_preferred_cpu',
> > except that I chose "DSA master" and not "CPU port" for a specific reason.
> > For LAG, the DSA master would be bond0.
> 
> I do not see why we would go through the trouble of creating a
> user-visible bond/team for this. As you detail below, it would mean
> jumping through a lot of hoops. I am not sure there is that much we can
> use from those drivers.
> 
> - We know that the CPU ports are statically up, so there is no "active
>   transmit set" to manage, it always consists of all ports.
> 
> - The LAG members are statically known at boot time via the DT, so we do
>   not need (or want, in fact) any management of that from userspace.
> 
> We could just let the drivers setup the LAG internally, and then do the
> load-balancing in dsa_slave_xmit or provide a generic helper that the
> taggers could use.

It's natural of you to lobby for LAG defined in device tree, since I
know you want to cover the problem for DSA links as well, not only for
CPU ports. That is about the only merit of this solution, however, and I
think we should thoroughly discuss DSA links in the first place, before
even claiming that it is even the best solution for them.

First of all, DSA drivers can now look at a struct net_device *bond when
they establish their link aggregation domains. If we have no struct
net_device *bond we will have to invent a new and parallel API for LAG
on CPU ports and DSA links. If we have to modify DSA anyway, I wonder
why we don't strive to converge towards a unified driver API at least.

Also, the CPU ports might be statically up, but their corresponding DSA
masters may not. You are concentrating only on the switch side, but the
DSA master side is what's more finicky. For example, I really don't want
to see DSA implement its own xmit policies, that will get old really
quick, I really appreciate being able to externalize the TX hashing to a
separate driver, for which the user already has enough knobs to
customize, than to shove that in DT.

Defining a LAG between the CPU ports in the device tree also goes
against the idea that we should not define policy in the kernel.
For that matter, I slept on the overall design and I think that if we
were really purists, we should even drop the whole idea with 'round
robin by default' in the static user-to-CPU port assignment scenario,
and let user space manage _everything_. Meaning that even if there are
multiple 'ethernet = ' phandles in the device tree, DSA will consider
them only to the point of registering CPU ports for all of them. But we
will keep the same dsa_tree_setup_default_cpu -> dsa_tree_find_first_cpu
logic, and there will be only one CPU port / DSA master until user space
requests otherwise. In this model, the crucial aspect of DSA support for
multiple CPU ports will be the ability to have that netlink reconfiguration
while the network is down (or has not even been set up yet). You can see
how, in this world view, your proposal to define a LAG in the device
tree does not smell great.

Of course, I will let Andrew be the supreme judge when it comes to
system design. Simplicity done right is an acquired taste, and I'm
absolutely open to admit that I'm not quite there yet.



As for LAG on DSA links, yes that is going to be interesting. I see two
approaches:

- Similar to how mv88e6xxx just decides all by itself to use DSA instead
  of EDSA tags on the DSA links, maybe there simply are some things that
  the software data path and control path just don't need to care about.
  So I wonder, if there isn't any known case in which it wouldn't 'just
  work' for mv88e6xxx to detect that it has multiple routing ports
  towards the same destination switch, to just go ahead and create a LAG
  between them, no DSA involvement at all.

- I come from a background where I am working with boards with disjoint
  DSA trees:

      +---------------------------------------------------------------+
      | LS1028A                                                       |
      |               +------------------------------+                |
      |               |      DSA master for Felix    |                |
      |               |(internal ENETC port 2: eno2))|                |
      |  +------------+------------------------------+-------------+  |
      |  | Felix embedded L2 switch                                |  |
      |  |                                                         |  |
      |  | +--------------+   +--------------+   +--------------+  |  |
      |  | |DSA master for|   |DSA master for|   |DSA master for|  |  |
      |  | |  SJA1105 1   |   |  SJA1105 2   |   |  SJA1105 3   |  |  |
      |  | |   (swp1)     |   |   (swp2)     |   |   (swp3)     |  |  |
      +--+-+--------------+---+--------------+---+--------------+--+--+

+-----------------------+ +-----------------------+ +-----------------------+
|   SJA1105 switch 1    | |   SJA1105 switch 2    | |   SJA1105 switch 3    |
+-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+
|sw1p0|sw1p1|sw1p2|sw1p3| |sw2p0|sw2p1|sw2p2|sw2p3| |sw3p0|sw3p1|sw3p2|sw3p3|
+-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+

You may find this setup to be a little bit odd, but the network
interfaces in this system are:

eno2, swp1, swp2, swp3, sw1p0-sw1p3, sw2p0-sw2p3, sw3p0-sw3p3.

This is because the Felix switch has a dsa,member property of <0 0>,
SJA1105 switch 1 is <1 0>, SJA1105 switch 2 is <2 0> and SJA1105 switch
3 is <3 0>. So each switch is the only switch within its tree. And that
makes Felix the DSA master for 3 DSA switches, while being a DSA switch
itself. This was done this way so that tag stacking 'just works': every
packet is decapsulated of the Felix tag on eno2, then of the SJA1105 tag
on swp1/swp2/swp3.

This setup gives me a lot of visibility into Ethernet port counters on
all ports. Because swp1, swp2, swp3, because are DSA masters, I see not
only their port counters, but also the port counters of the SJA1105 CPU
ports. Great.

But with the ocelot-8021q tagger, imagine a scenario where I make Felix
grok the DSA headers added by SJA1105 (which are also VLAN-based). Then
tag stacking would no longer be necessary. I could convert the swp1,
swp2, swp3 ports from being DSA masters into being out-facing DSA links,
and their net devices would disappear. But then I would lose all
visibility into them! And the strange part in my view is that this is a
100% software implementation-defined layout: if they're DSA masters
they're net devices, if they're DSA links they aren't, when in fact it
is the same hardware just configured differently.

So my idea here is maybe we could unify DSA links and disjoint DSA trees
by exposing net devices for the out-facing DSA links, just for the sake
of:
- port counters both ways
- having a hook point for LAGs

Now don't get me wrong, there are downsides too. For example, the net
device would not be useful for the actual data path. DSA will not use it
for packet processing coming from / going towards the leaves. You _could_
still xmit packets towards an out-facing DSA link, and maybe it would
even be less useless than xmitting them through a DSA master: if, when
you send a packet through the DSA master, that will be untagged, the
same isn't necessarily the case with a DSA link. You can maybe inject a
FORWARD packet encapsulated in a FROM_CPU tag, and the semantics would
be just 'do your thing'. I don't know.

So I would prefer exhausting the first approach, with private LAG setup
on DSA links done in the driver, before even considering the second one.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-11 18:50   ` Vladimir Oltean
                       ` (2 preceding siblings ...)
  2021-04-12 12:46     ` Tobias Waldekranz
@ 2021-04-12 15:00     ` DENG Qingfang
  2021-04-12 16:32       ` Vladimir Oltean
  3 siblings, 1 reply; 68+ messages in thread
From: DENG Qingfang @ 2021-04-12 15:00 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Marek Behun, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Sun, Apr 11, 2021 at 09:50:17PM +0300, Vladimir Oltean wrote:
> 
> So I'd be tempted to say 'tough luck' if all your ports are not up, and
> the ones that are are assigned statically to the same CPU port. It's a
> compromise between flexibility and simplicity, and I would go for
> simplicity here. That's the most you can achieve with static assignment,
> just put the CPU ports in a LAG if you want better dynamic load balancing
> (for details read on below).
> 

Many switches such as mv88e6xxx only support MAC DA/SA load balancing,
which make it not ideal in router application (Router WAN <--> ISP BRAS
traffic will always have the same DA/SA and thus use only one port).

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 1/3] net: dsa: allow for multiple CPU ports
  2021-04-12  4:41     ` Ansuel Smith
@ 2021-04-12 15:30       ` DENG Qingfang
  2021-04-12 16:17         ` Frank Wunderlich
  0 siblings, 1 reply; 68+ messages in thread
From: DENG Qingfang @ 2021-04-12 15:30 UTC (permalink / raw)
  To: Ansuel Smith
  Cc: netdev, Marek Behún, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Roopa Prabhu, Di Zhu,
	Nikolay Aleksandrov, Francis Laniel, linux-kernel,
	Frank Wunderlich

On Mon, Apr 12, 2021 at 06:41:58AM +0200, Ansuel Smith wrote:
> > So, drivers will read the name of every port and decide which CPU port
> > does it use?
> >
> 
> Yes, this seems to be an acceptable path to follow. The driver can
> provide a preferred CPU port or just tell DSA that every cpu is equal
> and assign them in a round-robin.
> 

So we somehow configured default CPU port in dts (by port name). In
my opinion we can just add a default CPU property in dts to specify
it (like Frank Wunderlich did earlier), and fall back to round-robin
if the property is not present, while still allow users to change it
in userspace.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 1/3] net: dsa: allow for multiple CPU ports
  2021-04-12 15:30       ` DENG Qingfang
@ 2021-04-12 16:17         ` Frank Wunderlich
  0 siblings, 0 replies; 68+ messages in thread
From: Frank Wunderlich @ 2021-04-12 16:17 UTC (permalink / raw)
  To: DENG Qingfang, Ansuel Smith
  Cc: netdev, Marek Behún, David S. Miller, Jakub Kicinski,
	Andrew Lunn, Vivien Didelot, Florian Fainelli, Vladimir Oltean,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Roopa Prabhu, Di Zhu,
	Nikolay Aleksandrov, Francis Laniel, linux-kernel

Am 12. April 2021 17:30:58 MESZ schrieb DENG Qingfang <dqfext@gmail.com>:

>So we somehow configured default CPU port in dts (by port name). In
>my opinion we can just add a default CPU property in dts to specify
>it (like Frank Wunderlich did earlier), and fall back to round-robin
>if the property is not present, while still allow users to change it
>in userspace.

My series was an up ported version of Patches (linux 4.9) afair from Felix. The dts-version was 1 reason why it was rejected, because DT describes hardware and not software preferences.


regards Frank

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 15:00     ` DENG Qingfang
@ 2021-04-12 16:32       ` Vladimir Oltean
  2021-04-12 22:04         ` Marek Behun
  0 siblings, 1 reply; 68+ messages in thread
From: Vladimir Oltean @ 2021-04-12 16:32 UTC (permalink / raw)
  To: DENG Qingfang
  Cc: Marek Behun, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Mon, Apr 12, 2021 at 11:00:45PM +0800, DENG Qingfang wrote:
> On Sun, Apr 11, 2021 at 09:50:17PM +0300, Vladimir Oltean wrote:
> >
> > So I'd be tempted to say 'tough luck' if all your ports are not up, and
> > the ones that are are assigned statically to the same CPU port. It's a
> > compromise between flexibility and simplicity, and I would go for
> > simplicity here. That's the most you can achieve with static assignment,
> > just put the CPU ports in a LAG if you want better dynamic load balancing
> > (for details read on below).
> >
>
> Many switches such as mv88e6xxx only support MAC DA/SA load balancing,
> which make it not ideal in router application (Router WAN <--> ISP BRAS
> traffic will always have the same DA/SA and thus use only one port).

Is this supposed to make a difference? Choose a better switch vendor!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 12:46     ` Tobias Waldekranz
  2021-04-12 14:35       ` Vladimir Oltean
@ 2021-04-12 19:30       ` Marek Behun
  2021-04-12 21:22         ` Tobias Waldekranz
  1 sibling, 1 reply; 68+ messages in thread
From: Marek Behun @ 2021-04-12 19:30 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Mon, 12 Apr 2021 14:46:11 +0200
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> I agree. Unless you only have a few really wideband flows, a LAG will
> typically do a great job with balancing. This will happen without the
> user having to do any configuration at all. It would also perform well
> in "router-on-a-stick"-setups where the incoming and outgoing port is
> the same.

TLDR: The problem with LAGs how they are currently implemented is that
for Turris Omnia, basically in 1/16 of configurations the traffic would
go via one CPU port anyway.



One potencial problem that I see with using LAGs for aggregating CPU
ports on mv88e6xxx is how these switches determine the port for a
packet: only the src and dst MAC address is used for the hash that
chooses the port.

The most common scenario for Turris Omnia, for example, where we have 2
CPU ports and 5 user ports, is that into these 5 user ports the user
plugs 5 simple devices (no switches, so only one peer MAC address for
port). So we have only 5 pairs of src + dst MAC addresses. If we simply
fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
chance that all packets would go through one CPU port.

In order to have real load balancing in this scenario, we would either
have to recompute the LAG mask table depending on the MAC addresses, or
rewrite the LAG mask table somewhat randomly periodically. (This could
be in theory offloaded onto the Z80 internal CPU for some of the
switches of the mv88e6xxx family, but not for Omnia.)

Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 14:35       ` Vladimir Oltean
@ 2021-04-12 21:06         ` Tobias Waldekranz
  0 siblings, 0 replies; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-12 21:06 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Marek Behun, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Mon, Apr 12, 2021 at 17:35, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Mon, Apr 12, 2021 at 02:46:11PM +0200, Tobias Waldekranz wrote:
>> On Sun, Apr 11, 2021 at 21:50, Vladimir Oltean <olteanv@gmail.com> wrote:
>> > On Sun, Apr 11, 2021 at 08:01:35PM +0200, Marek Behun wrote:
>> >> On Sat, 10 Apr 2021 15:34:46 +0200
>> >> Ansuel Smith <ansuelsmth@gmail.com> wrote:
>> >> 
>> >> > Hi,
>> >> > this is a respin of the Marek series in hope that this time we can
>> >> > finally make some progress with dsa supporting multi-cpu port.
>> >> > 
>> >> > This implementation is similar to the Marek series but with some tweaks.
>> >> > This adds support for multiple-cpu port but leave the driver the
>> >> > decision of the type of logic to use about assigning a CPU port to the
>> >> > various port. The driver can also provide no preference and the CPU port
>> >> > is decided using a round-robin way.
>> >> 
>> >> In the last couple of months I have been giving some thought to this
>> >> problem, and came up with one important thing: if there are multiple
>> >> upstream ports, it would make a lot of sense to dynamically reallocate
>> >> them to each user port, based on which user port is actually used, and
>> >> at what speed.
>> >> 
>> >> For example on Turris Omnia we have 2 CPU ports and 5 user ports. All
>> >> ports support at most 1 Gbps. Round-robin would assign:
>> >>   CPU port 0 - Port 0
>> >>   CPU port 1 - Port 1
>> >>   CPU port 0 - Port 2
>> >>   CPU port 1 - Port 3
>> >>   CPU port 0 - Port 4
>> >> 
>> >> Now suppose that the user plugs ethernet cables only into ports 0 and 2,
>> >> with 1, 3 and 4 free:
>> >>   CPU port 0 - Port 0 (plugged)
>> >>   CPU port 1 - Port 1 (free)
>> >>   CPU port 0 - Port 2 (plugged)
>> >>   CPU port 1 - Port 3 (free)
>> >>   CPU port 0 - Port 4 (free)
>> >> 
>> >> We end up in a situation where ports 0 and 2 share 1 Gbps bandwidth to
>> >> CPU, and the second CPU port is not used at all.
>> >> 
>> >> A mechanism for automatic reassignment of CPU ports would be ideal here.
>> >> 
>> >> What do you guys think?
>> >
>> > The reason why I don't think this is such a great idea is because the
>> > CPU port assignment is a major reconfiguration step which should at the
>> > very least be done while the network is down, to avoid races with the
>> > data path (something which this series does not appear to handle).
>> > And if you allow the static user-port-to-CPU-port assignment to change
>> > every time a link goes up/down, I don't think you really want to force
>> > the network down through the entire switch basically.
>> >
>> > So I'd be tempted to say 'tough luck' if all your ports are not up, and
>> > the ones that are are assigned statically to the same CPU port. It's a
>> > compromise between flexibility and simplicity, and I would go for
>> > simplicity here. That's the most you can achieve with static assignment,
>> > just put the CPU ports in a LAG if you want better dynamic load balancing
>> > (for details read on below).
>> 
>> I agree. Unless you only have a few really wideband flows, a LAG will
>> typically do a great job with balancing. This will happen without the
>> user having to do any configuration at all. It would also perform well
>> in "router-on-a-stick"-setups where the incoming and outgoing port is
>> the same.
>> 
>> ...
>> 
>> > But there is something which is even more interesting about Felix with
>> > the ocelot-8021q tagger. Since Marek posted his RFC and until Ansuel
>> > posted the follow-up, things have happened, and now both Felix and the
>> > Marvell driver support LAG offload via the bonding and/or team drivers.
>> > At least for Felix, when using the ocelot-8021q tagged, it should be
>> > possible to put the two CPU ports in a hardware LAG, and the two DSA
>> > masters in a software LAG, and let the bond/team upper of the DSA
>> > masters be the CPU port.
>> >
>> > I would like us to keep the door open for both alternatives, and to have
>> > a way to switch between static user-to-CPU port assignment, and LAG.
>> > I think that if there are multiple 'ethernet = ' phandles present in the
>> > device tree, DSA should populate a list of valid DSA masters, and then
>> > call into the driver to allow it to select which master it prefers for
>> > each user port. This is similar to what Ansuel added with 'port_get_preferred_cpu',
>> > except that I chose "DSA master" and not "CPU port" for a specific reason.
>> > For LAG, the DSA master would be bond0.
>> 
>> I do not see why we would go through the trouble of creating a
>> user-visible bond/team for this. As you detail below, it would mean
>> jumping through a lot of hoops. I am not sure there is that much we can
>> use from those drivers.
>> 
>> - We know that the CPU ports are statically up, so there is no "active
>>   transmit set" to manage, it always consists of all ports.
>> 
>> - The LAG members are statically known at boot time via the DT, so we do
>>   not need (or want, in fact) any management of that from userspace.
>> 
>> We could just let the drivers setup the LAG internally, and then do the
>> load-balancing in dsa_slave_xmit or provide a generic helper that the
>> taggers could use.
>
> It's natural of you to lobby for LAG defined in device tree, since I
> know you want to cover the problem for DSA links as well, not only for
> CPU ports. That is about the only merit of this solution, however, and I
> think we should thoroughly discuss DSA links in the first place, before
> even claiming that it is even the best solution for them.

I am quite sure I never said anything about having to define it in the
DT. The information ("port x, y and z are connected to the CPU") is
already there. What more would we need?

> First of all, DSA drivers can now look at a struct net_device *bond when
> they establish their link aggregation domains. If we have no struct
> net_device *bond we will have to invent a new and parallel API for LAG
> on CPU ports and DSA links. If we have to modify DSA anyway, I wonder
> why we don't strive to converge towards a unified driver API at least.

OK, I was thinking more along the lines of these LAGs being a driver
internal matter. I.e. when a driver sees two links between two chips, it
could just setup the LAGs without having to bother the DSA layer at
all.

DSA would just have N rx handlers setup, and no matter which the
incoming master port was used, the same action would be taken (untag and
mux it to the right slave netdev as usual).

> Also, the CPU ports might be statically up, but their corresponding DSA
> masters may not. You are concentrating only on the switch side, but the
> DSA master side is what's more finicky. For example, I really don't want
> to see DSA implement its own xmit policies, that will get old really
> quick, I really appreciate being able to externalize the TX hashing to a
> separate driver, for which the user already has enough knobs to
> customize, than to shove that in DT.

Yeah if we need customizable hashing then I agree, we should go through
an existing LAG driver.

> Defining a LAG between the CPU ports in the device tree also goes
> against the idea that we should not define policy in the kernel.
> For that matter, I slept on the overall design and I think that if we
> were really purists, we should even drop the whole idea with 'round
> robin by default' in the static user-to-CPU port assignment scenario,
> and let user space manage _everything_. Meaning that even if there are
> multiple 'ethernet = ' phandles in the device tree, DSA will consider
> them only to the point of registering CPU ports for all of them. But we
> will keep the same dsa_tree_setup_default_cpu -> dsa_tree_find_first_cpu
> logic, and there will be only one CPU port / DSA master until user space
> requests otherwise. In this model, the crucial aspect of DSA support for
> multiple CPU ports will be the ability to have that netlink reconfiguration
> while the network is down (or has not even been set up yet). You can see
> how, in this world view, your proposal to define a LAG in the device
> tree does not smell great.

Again, I am suggesting no such thing.

> Of course, I will let Andrew be the supreme judge when it comes to
> system design. Simplicity done right is an acquired taste, and I'm
> absolutely open to admit that I'm not quite there yet.
>
>
>
> As for LAG on DSA links, yes that is going to be interesting. I see two
> approaches:
>
> - Similar to how mv88e6xxx just decides all by itself to use DSA instead
>   of EDSA tags on the DSA links, maybe there simply are some things that
>   the software data path and control path just don't need to care about.
>   So I wonder, if there isn't any known case in which it wouldn't 'just
>   work' for mv88e6xxx to detect that it has multiple routing ports
>   towards the same destination switch, to just go ahead and create a LAG
>   between them, no DSA involvement at all.

Right, this is similar to how I saw the CPU ports being managed as well.

> - I come from a background where I am working with boards with disjoint
>   DSA trees:
>
>       +---------------------------------------------------------------+
>       | LS1028A                                                       |
>       |               +------------------------------+                |
>       |               |      DSA master for Felix    |                |
>       |               |(internal ENETC port 2: eno2))|                |
>       |  +------------+------------------------------+-------------+  |
>       |  | Felix embedded L2 switch                                |  |
>       |  |                                                         |  |
>       |  | +--------------+   +--------------+   +--------------+  |  |
>       |  | |DSA master for|   |DSA master for|   |DSA master for|  |  |
>       |  | |  SJA1105 1   |   |  SJA1105 2   |   |  SJA1105 3   |  |  |
>       |  | |   (swp1)     |   |   (swp2)     |   |   (swp3)     |  |  |
>       +--+-+--------------+---+--------------+---+--------------+--+--+
>
> +-----------------------+ +-----------------------+ +-----------------------+
> |   SJA1105 switch 1    | |   SJA1105 switch 2    | |   SJA1105 switch 3    |
> +-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+
> |sw1p0|sw1p1|sw1p2|sw1p3| |sw2p0|sw2p1|sw2p2|sw2p3| |sw3p0|sw3p1|sw3p2|sw3p3|
> +-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+
>
> You may find this setup to be a little bit odd, but the network
> interfaces in this system are:
>
> eno2, swp1, swp2, swp3, sw1p0-sw1p3, sw2p0-sw2p3, sw3p0-sw3p3.
>
> This is because the Felix switch has a dsa,member property of <0 0>,
> SJA1105 switch 1 is <1 0>, SJA1105 switch 2 is <2 0> and SJA1105 switch
> 3 is <3 0>. So each switch is the only switch within its tree. And that
> makes Felix the DSA master for 3 DSA switches, while being a DSA switch
> itself. This was done this way so that tag stacking 'just works': every
> packet is decapsulated of the Felix tag on eno2, then of the SJA1105 tag
> on swp1/swp2/swp3.
>
> This setup gives me a lot of visibility into Ethernet port counters on
> all ports. Because swp1, swp2, swp3, because are DSA masters, I see not
> only their port counters, but also the port counters of the SJA1105 CPU
> ports. Great.
>
> But with the ocelot-8021q tagger, imagine a scenario where I make Felix
> grok the DSA headers added by SJA1105 (which are also VLAN-based). Then
> tag stacking would no longer be necessary. I could convert the swp1,
> swp2, swp3 ports from being DSA masters into being out-facing DSA links,
> and their net devices would disappear. But then I would lose all
> visibility into them! And the strange part in my view is that this is a
> 100% software implementation-defined layout: if they're DSA masters
> they're net devices, if they're DSA links they aren't, when in fact it
> is the same hardware just configured differently.
>
> So my idea here is maybe we could unify DSA links and disjoint DSA trees
> by exposing net devices for the out-facing DSA links, just for the sake
> of:
> - port counters both ways

We have previously experimented with netdevs for dsa (and cpu) ports,
just to get to the counters. It seems like something that might be
better suited for devlink though. That way, regular users would not have
to be confused by netdevs that they can do very little with, but if you
know about the fabric underneath you can still get to them easily.

> - having a hook point for LAGs
>
> Now don't get me wrong, there are downsides too. For example, the net
> device would not be useful for the actual data path. DSA will not use it
> for packet processing coming from / going towards the leaves. You _could_
> still xmit packets towards an out-facing DSA link, and maybe it would
> even be less useless than xmitting them through a DSA master: if, when
> you send a packet through the DSA master, that will be untagged, the
> same isn't necessarily the case with a DSA link. You can maybe inject a
> FORWARD packet encapsulated in a FROM_CPU tag, and the semantics would
> be just 'do your thing'. I don't know.

There could be devices like that I suppose, mv88e6xxx is not one of them
though. There is no nesting of DSA tags (unless you set them up as
disjoint trees) AFAIK.

> So I would prefer exhausting the first approach, with private LAG setup
> on DSA links done in the driver, before even considering the second one.

What is the second approach?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 19:30       ` Marek Behun
@ 2021-04-12 21:22         ` Tobias Waldekranz
  2021-04-12 21:34           ` Vladimir Oltean
  2021-04-12 21:50           ` Marek Behun
  0 siblings, 2 replies; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-12 21:22 UTC (permalink / raw)
  To: Marek Behun
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:
> On Mon, 12 Apr 2021 14:46:11 +0200
> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>
>> I agree. Unless you only have a few really wideband flows, a LAG will
>> typically do a great job with balancing. This will happen without the
>> user having to do any configuration at all. It would also perform well
>> in "router-on-a-stick"-setups where the incoming and outgoing port is
>> the same.
>
> TLDR: The problem with LAGs how they are currently implemented is that
> for Turris Omnia, basically in 1/16 of configurations the traffic would
> go via one CPU port anyway.
>
>
>
> One potencial problem that I see with using LAGs for aggregating CPU
> ports on mv88e6xxx is how these switches determine the port for a
> packet: only the src and dst MAC address is used for the hash that
> chooses the port.
>
> The most common scenario for Turris Omnia, for example, where we have 2
> CPU ports and 5 user ports, is that into these 5 user ports the user
> plugs 5 simple devices (no switches, so only one peer MAC address for
> port). So we have only 5 pairs of src + dst MAC addresses. If we simply
> fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
> chance that all packets would go through one CPU port.
>
> In order to have real load balancing in this scenario, we would either
> have to recompute the LAG mask table depending on the MAC addresses, or
> rewrite the LAG mask table somewhat randomly periodically. (This could
> be in theory offloaded onto the Z80 internal CPU for some of the
> switches of the mv88e6xxx family, but not for Omnia.)

I thought that the option to associate each port netdev with a DSA
master would only be used on transmit. Are you saying that there is a
way to configure an mv88e6xxx chip to steer packets to different CPU
ports depending on the incoming port?

The reason that the traffic is directed towards the CPU is that some
kind of entry in the ATU says so, and the destination of that entry will
either be a port vector or a LAG. Of those two, only the LAG will offer
any kind of balancing. What am I missing?

Transmit is easy; you are already in the CPU, so you can use an
arbitrarily fancy hashing algo/ebpf classifier/whatever to load balance
in that case.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 21:22         ` Tobias Waldekranz
@ 2021-04-12 21:34           ` Vladimir Oltean
  2021-04-12 21:49             ` Tobias Waldekranz
  2021-04-12 21:50           ` Marek Behun
  1 sibling, 1 reply; 68+ messages in thread
From: Vladimir Oltean @ 2021-04-12 21:34 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Marek Behun, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Mon, Apr 12, 2021 at 11:22:45PM +0200, Tobias Waldekranz wrote:
> On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:
> > On Mon, 12 Apr 2021 14:46:11 +0200
> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
> >
> >> I agree. Unless you only have a few really wideband flows, a LAG will
> >> typically do a great job with balancing. This will happen without the
> >> user having to do any configuration at all. It would also perform well
> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
> >> the same.
> >
> > TLDR: The problem with LAGs how they are currently implemented is that
> > for Turris Omnia, basically in 1/16 of configurations the traffic would
> > go via one CPU port anyway.
> >
> >
> >
> > One potencial problem that I see with using LAGs for aggregating CPU
> > ports on mv88e6xxx is how these switches determine the port for a
> > packet: only the src and dst MAC address is used for the hash that
> > chooses the port.
> >
> > The most common scenario for Turris Omnia, for example, where we have 2
> > CPU ports and 5 user ports, is that into these 5 user ports the user
> > plugs 5 simple devices (no switches, so only one peer MAC address for
> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
> > chance that all packets would go through one CPU port.
> >
> > In order to have real load balancing in this scenario, we would either
> > have to recompute the LAG mask table depending on the MAC addresses, or
> > rewrite the LAG mask table somewhat randomly periodically. (This could
> > be in theory offloaded onto the Z80 internal CPU for some of the
> > switches of the mv88e6xxx family, but not for Omnia.)
> 
> I thought that the option to associate each port netdev with a DSA
> master would only be used on transmit. Are you saying that there is a
> way to configure an mv88e6xxx chip to steer packets to different CPU
> ports depending on the incoming port?
> 
> The reason that the traffic is directed towards the CPU is that some
> kind of entry in the ATU says so, and the destination of that entry will
> either be a port vector or a LAG. Of those two, only the LAG will offer
> any kind of balancing. What am I missing?
> 
> Transmit is easy; you are already in the CPU, so you can use an
> arbitrarily fancy hashing algo/ebpf classifier/whatever to load balance
> in that case.

Say a user port receives a broadcast frame. Based on your understanding
where user-to-CPU port assignments are used only for TX, which CPU port
should be selected by the switch for this broadcast packet, and by which
mechanism?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 21:34           ` Vladimir Oltean
@ 2021-04-12 21:49             ` Tobias Waldekranz
  2021-04-12 21:56               ` Marek Behun
  2021-04-12 22:06               ` Vladimir Oltean
  0 siblings, 2 replies; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-12 21:49 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Marek Behun, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, Apr 13, 2021 at 00:34, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Mon, Apr 12, 2021 at 11:22:45PM +0200, Tobias Waldekranz wrote:
>> On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:
>> > On Mon, 12 Apr 2021 14:46:11 +0200
>> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
>> >
>> >> I agree. Unless you only have a few really wideband flows, a LAG will
>> >> typically do a great job with balancing. This will happen without the
>> >> user having to do any configuration at all. It would also perform well
>> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
>> >> the same.
>> >
>> > TLDR: The problem with LAGs how they are currently implemented is that
>> > for Turris Omnia, basically in 1/16 of configurations the traffic would
>> > go via one CPU port anyway.
>> >
>> >
>> >
>> > One potencial problem that I see with using LAGs for aggregating CPU
>> > ports on mv88e6xxx is how these switches determine the port for a
>> > packet: only the src and dst MAC address is used for the hash that
>> > chooses the port.
>> >
>> > The most common scenario for Turris Omnia, for example, where we have 2
>> > CPU ports and 5 user ports, is that into these 5 user ports the user
>> > plugs 5 simple devices (no switches, so only one peer MAC address for
>> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
>> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
>> > chance that all packets would go through one CPU port.
>> >
>> > In order to have real load balancing in this scenario, we would either
>> > have to recompute the LAG mask table depending on the MAC addresses, or
>> > rewrite the LAG mask table somewhat randomly periodically. (This could
>> > be in theory offloaded onto the Z80 internal CPU for some of the
>> > switches of the mv88e6xxx family, but not for Omnia.)
>> 
>> I thought that the option to associate each port netdev with a DSA
>> master would only be used on transmit. Are you saying that there is a
>> way to configure an mv88e6xxx chip to steer packets to different CPU
>> ports depending on the incoming port?
>> 
>> The reason that the traffic is directed towards the CPU is that some
>> kind of entry in the ATU says so, and the destination of that entry will
>> either be a port vector or a LAG. Of those two, only the LAG will offer
>> any kind of balancing. What am I missing?
>> 
>> Transmit is easy; you are already in the CPU, so you can use an
>> arbitrarily fancy hashing algo/ebpf classifier/whatever to load balance
>> in that case.
>
> Say a user port receives a broadcast frame. Based on your understanding
> where user-to-CPU port assignments are used only for TX, which CPU port
> should be selected by the switch for this broadcast packet, and by which
> mechanism?

AFAIK, the only option available to you (again, if there is no LAG set
up) is to statically choose one CPU port per entry. But hopefully Marek
can teach me some new tricks!

So for any known (since the broadcast address is loaded in the ATU it is
known) destination (b/m/u-cast), you can only "load balance" based on
the DA. You would also have to make sure that unknown unicast and
unknown multicast is only allowed to egress one of the CPU ports.

If you have a LAG OTOH, you could include all CPU ports in the port
vectors of those same entries. The LAG mask would then do the final
filtering so that you only send a single copy to the CPU.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 21:22         ` Tobias Waldekranz
  2021-04-12 21:34           ` Vladimir Oltean
@ 2021-04-12 21:50           ` Marek Behun
  2021-04-12 22:05             ` Tobias Waldekranz
  1 sibling, 1 reply; 68+ messages in thread
From: Marek Behun @ 2021-04-12 21:50 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Mon, 12 Apr 2021 23:22:45 +0200
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:
> > On Mon, 12 Apr 2021 14:46:11 +0200
> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
> >  
> >> I agree. Unless you only have a few really wideband flows, a LAG will
> >> typically do a great job with balancing. This will happen without the
> >> user having to do any configuration at all. It would also perform well
> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
> >> the same.  
> >
> > TLDR: The problem with LAGs how they are currently implemented is that
> > for Turris Omnia, basically in 1/16 of configurations the traffic would
> > go via one CPU port anyway.
> >
> >
> >
> > One potencial problem that I see with using LAGs for aggregating CPU
> > ports on mv88e6xxx is how these switches determine the port for a
> > packet: only the src and dst MAC address is used for the hash that
> > chooses the port.
> >
> > The most common scenario for Turris Omnia, for example, where we have 2
> > CPU ports and 5 user ports, is that into these 5 user ports the user
> > plugs 5 simple devices (no switches, so only one peer MAC address for
> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
> > chance that all packets would go through one CPU port.
> >
> > In order to have real load balancing in this scenario, we would either
> > have to recompute the LAG mask table depending on the MAC addresses, or
> > rewrite the LAG mask table somewhat randomly periodically. (This could
> > be in theory offloaded onto the Z80 internal CPU for some of the
> > switches of the mv88e6xxx family, but not for Omnia.)  
> 
> I thought that the option to associate each port netdev with a DSA
> master would only be used on transmit. Are you saying that there is a
> way to configure an mv88e6xxx chip to steer packets to different CPU
> ports depending on the incoming port?
>
> The reason that the traffic is directed towards the CPU is that some
> kind of entry in the ATU says so, and the destination of that entry will
> either be a port vector or a LAG. Of those two, only the LAG will offer
> any kind of balancing. What am I missing?

Via port vectors you can "load balance" by ports only, i.e. input port X
-> trasmit via CPU port Y.

When using LAGs, you are load balancing via hash(src MAC | dst mac)
only. This is better in some ways. But what I am saying is that if the
LAG mask table is static, as it is now implemented in mv88e6xxx code,
then for many scenarios there is a big probability of no load balancing
at all. For Turris Omnia for example there is 6.25% probability that
the switch chip will send all traffic to the CPU via one CPU port.
This is because the switch chooses the LAG port only from the hash of
dst+src MAC address. (By the 1/16 = 6.25% probability I mean that for
cca 1 in 16 customers, the switch would only use one port when sending
data to the CPU).

The round robin solution here is therefore better in this case.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 21:49             ` Tobias Waldekranz
@ 2021-04-12 21:56               ` Marek Behun
  2021-04-12 22:06               ` Vladimir Oltean
  1 sibling, 0 replies; 68+ messages in thread
From: Marek Behun @ 2021-04-12 21:56 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Mon, 12 Apr 2021 23:49:22 +0200
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> On Tue, Apr 13, 2021 at 00:34, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Mon, Apr 12, 2021 at 11:22:45PM +0200, Tobias Waldekranz wrote:  
> >> On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:  
> >> > On Mon, 12 Apr 2021 14:46:11 +0200
> >> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
> >> >  
> >> >> I agree. Unless you only have a few really wideband flows, a LAG will
> >> >> typically do a great job with balancing. This will happen without the
> >> >> user having to do any configuration at all. It would also perform well
> >> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
> >> >> the same.  
> >> >
> >> > TLDR: The problem with LAGs how they are currently implemented is that
> >> > for Turris Omnia, basically in 1/16 of configurations the traffic would
> >> > go via one CPU port anyway.
> >> >
> >> >
> >> >
> >> > One potencial problem that I see with using LAGs for aggregating CPU
> >> > ports on mv88e6xxx is how these switches determine the port for a
> >> > packet: only the src and dst MAC address is used for the hash that
> >> > chooses the port.
> >> >
> >> > The most common scenario for Turris Omnia, for example, where we have 2
> >> > CPU ports and 5 user ports, is that into these 5 user ports the user
> >> > plugs 5 simple devices (no switches, so only one peer MAC address for
> >> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
> >> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
> >> > chance that all packets would go through one CPU port.
> >> >
> >> > In order to have real load balancing in this scenario, we would either
> >> > have to recompute the LAG mask table depending on the MAC addresses, or
> >> > rewrite the LAG mask table somewhat randomly periodically. (This could
> >> > be in theory offloaded onto the Z80 internal CPU for some of the
> >> > switches of the mv88e6xxx family, but not for Omnia.)  
> >> 
> >> I thought that the option to associate each port netdev with a DSA
> >> master would only be used on transmit. Are you saying that there is a
> >> way to configure an mv88e6xxx chip to steer packets to different CPU
> >> ports depending on the incoming port?
> >> 
> >> The reason that the traffic is directed towards the CPU is that some
> >> kind of entry in the ATU says so, and the destination of that entry will
> >> either be a port vector or a LAG. Of those two, only the LAG will offer
> >> any kind of balancing. What am I missing?
> >> 
> >> Transmit is easy; you are already in the CPU, so you can use an
> >> arbitrarily fancy hashing algo/ebpf classifier/whatever to load balance
> >> in that case.  
> >
> > Say a user port receives a broadcast frame. Based on your understanding
> > where user-to-CPU port assignments are used only for TX, which CPU port
> > should be selected by the switch for this broadcast packet, and by which
> > mechanism?  
> 
> AFAIK, the only option available to you (again, if there is no LAG set
> up) is to statically choose one CPU port per entry. But hopefully Marek
> can teach me some new tricks!
> 
> So for any known (since the broadcast address is loaded in the ATU it is
> known) destination (b/m/u-cast), you can only "load balance" based on
> the DA. You would also have to make sure that unknown unicast and
> unknown multicast is only allowed to egress one of the CPU ports.
> 
> If you have a LAG OTOH, you could include all CPU ports in the port
> vectors of those same entries. The LAG mask would then do the final
> filtering so that you only send a single copy to the CPU.

The problem is that when the mv88e6xxx switch chooses the LAG entry, it
takes into account only hash(src MAC | dst MAC). There is no other
option, Marvell switches are unable to take more information into this
decision (for example hash of the IP + TCP/UDP header).

And in many usecases, there are only a couple of this (src,dst) MAC
pairs. On Turris Omnia in most cases there are only 5 such pairs,
because the user plugs into the router only 5 devices.

So for each of these 5 (src,dst) MAC pairs, there is probability 1/2
that the packet will be sent via CPU port 0. So 1/32 probability that
all packets will be sent via CPU port 0, and the same probability that
all packets will be sent via CPU port 1.

This means that in 1/16 of cases the LAG is useless in this scenario.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 16:32       ` Vladimir Oltean
@ 2021-04-12 22:04         ` Marek Behun
  2021-04-12 22:17           ` Vladimir Oltean
  0 siblings, 1 reply; 68+ messages in thread
From: Marek Behun @ 2021-04-12 22:04 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: DENG Qingfang, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Mon, 12 Apr 2021 19:32:11 +0300
Vladimir Oltean <olteanv@gmail.com> wrote:

> On Mon, Apr 12, 2021 at 11:00:45PM +0800, DENG Qingfang wrote:
> > On Sun, Apr 11, 2021 at 09:50:17PM +0300, Vladimir Oltean wrote:  
> > >
> > > So I'd be tempted to say 'tough luck' if all your ports are not up, and
> > > the ones that are are assigned statically to the same CPU port. It's a
> > > compromise between flexibility and simplicity, and I would go for
> > > simplicity here. That's the most you can achieve with static assignment,
> > > just put the CPU ports in a LAG if you want better dynamic load balancing
> > > (for details read on below).
> > >  
> >
> > Many switches such as mv88e6xxx only support MAC DA/SA load balancing,
> > which make it not ideal in router application (Router WAN <--> ISP BRAS
> > traffic will always have the same DA/SA and thus use only one port).  
> 
> Is this supposed to make a difference? Choose a better switch vendor!

:-) Are you saying that we shall abandon trying to make the DSA
subsystem work with better performace for our routers, in order to
punish ourselves for our bad decision to use Marvell switches?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 21:50           ` Marek Behun
@ 2021-04-12 22:05             ` Tobias Waldekranz
  2021-04-12 22:55               ` Marek Behun
  0 siblings, 1 reply; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-12 22:05 UTC (permalink / raw)
  To: Marek Behun
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Mon, Apr 12, 2021 at 23:50, Marek Behun <marek.behun@nic.cz> wrote:
> On Mon, 12 Apr 2021 23:22:45 +0200
> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>
>> On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:
>> > On Mon, 12 Apr 2021 14:46:11 +0200
>> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
>> >  
>> >> I agree. Unless you only have a few really wideband flows, a LAG will
>> >> typically do a great job with balancing. This will happen without the
>> >> user having to do any configuration at all. It would also perform well
>> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
>> >> the same.  
>> >
>> > TLDR: The problem with LAGs how they are currently implemented is that
>> > for Turris Omnia, basically in 1/16 of configurations the traffic would
>> > go via one CPU port anyway.
>> >
>> >
>> >
>> > One potencial problem that I see with using LAGs for aggregating CPU
>> > ports on mv88e6xxx is how these switches determine the port for a
>> > packet: only the src and dst MAC address is used for the hash that
>> > chooses the port.
>> >
>> > The most common scenario for Turris Omnia, for example, where we have 2
>> > CPU ports and 5 user ports, is that into these 5 user ports the user
>> > plugs 5 simple devices (no switches, so only one peer MAC address for
>> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
>> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
>> > chance that all packets would go through one CPU port.
>> >
>> > In order to have real load balancing in this scenario, we would either
>> > have to recompute the LAG mask table depending on the MAC addresses, or
>> > rewrite the LAG mask table somewhat randomly periodically. (This could
>> > be in theory offloaded onto the Z80 internal CPU for some of the
>> > switches of the mv88e6xxx family, but not for Omnia.)  
>> 
>> I thought that the option to associate each port netdev with a DSA
>> master would only be used on transmit. Are you saying that there is a
>> way to configure an mv88e6xxx chip to steer packets to different CPU
>> ports depending on the incoming port?
>>
>> The reason that the traffic is directed towards the CPU is that some
>> kind of entry in the ATU says so, and the destination of that entry will
>> either be a port vector or a LAG. Of those two, only the LAG will offer
>> any kind of balancing. What am I missing?
>
> Via port vectors you can "load balance" by ports only, i.e. input port X
> -> trasmit via CPU port Y.

How is this done? In a case where there is no bridging between the
ports, then I understand. Each port could have its own FID. But if you
have this setup...

   br0    wan
   / \
lan0 lan1

lan0 and lan1 would use the same FID. So how could you say that frames
from lan0 should go to cpu0 and frames from lan1 should go to cpu1 if
the DA is the same? What would be the content of the ATU in a setup
like that?

> When using LAGs, you are load balancing via hash(src MAC | dst mac)
> only. This is better in some ways. But what I am saying is that if the
> LAG mask table is static, as it is now implemented in mv88e6xxx code,
> then for many scenarios there is a big probability of no load balancing
> at all. For Turris Omnia for example there is 6.25% probability that
> the switch chip will send all traffic to the CPU via one CPU port.
> This is because the switch chooses the LAG port only from the hash of
> dst+src MAC address. (By the 1/16 = 6.25% probability I mean that for
> cca 1 in 16 customers, the switch would only use one port when sending
> data to the CPU).
>
> The round robin solution here is therefore better in this case.

I agree that it would be better in that case. I just do not get how you
get the switch to do it for you.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 21:49             ` Tobias Waldekranz
  2021-04-12 21:56               ` Marek Behun
@ 2021-04-12 22:06               ` Vladimir Oltean
  2021-04-12 22:26                 ` Tobias Waldekranz
  1 sibling, 1 reply; 68+ messages in thread
From: Vladimir Oltean @ 2021-04-12 22:06 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Marek Behun, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Mon, Apr 12, 2021 at 11:49:22PM +0200, Tobias Waldekranz wrote:
> On Tue, Apr 13, 2021 at 00:34, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Mon, Apr 12, 2021 at 11:22:45PM +0200, Tobias Waldekranz wrote:
> >> On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:
> >> > On Mon, 12 Apr 2021 14:46:11 +0200
> >> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
> >> >
> >> >> I agree. Unless you only have a few really wideband flows, a LAG will
> >> >> typically do a great job with balancing. This will happen without the
> >> >> user having to do any configuration at all. It would also perform well
> >> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
> >> >> the same.
> >> >
> >> > TLDR: The problem with LAGs how they are currently implemented is that
> >> > for Turris Omnia, basically in 1/16 of configurations the traffic would
> >> > go via one CPU port anyway.
> >> >
> >> >
> >> >
> >> > One potencial problem that I see with using LAGs for aggregating CPU
> >> > ports on mv88e6xxx is how these switches determine the port for a
> >> > packet: only the src and dst MAC address is used for the hash that
> >> > chooses the port.
> >> >
> >> > The most common scenario for Turris Omnia, for example, where we have 2
> >> > CPU ports and 5 user ports, is that into these 5 user ports the user
> >> > plugs 5 simple devices (no switches, so only one peer MAC address for
> >> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
> >> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
> >> > chance that all packets would go through one CPU port.
> >> >
> >> > In order to have real load balancing in this scenario, we would either
> >> > have to recompute the LAG mask table depending on the MAC addresses, or
> >> > rewrite the LAG mask table somewhat randomly periodically. (This could
> >> > be in theory offloaded onto the Z80 internal CPU for some of the
> >> > switches of the mv88e6xxx family, but not for Omnia.)
> >> 
> >> I thought that the option to associate each port netdev with a DSA
> >> master would only be used on transmit. Are you saying that there is a
> >> way to configure an mv88e6xxx chip to steer packets to different CPU
> >> ports depending on the incoming port?
> >> 
> >> The reason that the traffic is directed towards the CPU is that some
> >> kind of entry in the ATU says so, and the destination of that entry will
> >> either be a port vector or a LAG. Of those two, only the LAG will offer
> >> any kind of balancing. What am I missing?
> >> 
> >> Transmit is easy; you are already in the CPU, so you can use an
> >> arbitrarily fancy hashing algo/ebpf classifier/whatever to load balance
> >> in that case.
> >
> > Say a user port receives a broadcast frame. Based on your understanding
> > where user-to-CPU port assignments are used only for TX, which CPU port
> > should be selected by the switch for this broadcast packet, and by which
> > mechanism?
> 
> AFAIK, the only option available to you (again, if there is no LAG set
> up) is to statically choose one CPU port per entry. But hopefully Marek
> can teach me some new tricks!
> 
> So for any known (since the broadcast address is loaded in the ATU it is
> known) destination (b/m/u-cast), you can only "load balance" based on
> the DA. You would also have to make sure that unknown unicast and
> unknown multicast is only allowed to egress one of the CPU ports.
> 
> If you have a LAG OTOH, you could include all CPU ports in the port
> vectors of those same entries. The LAG mask would then do the final
> filtering so that you only send a single copy to the CPU.

I forgot that mv88e6xxx keeps the broadcast address in the ATU. I wanted
to know what is done in the flooding case, therefore I should have asked
about unknown destination traffic. It is sent to one CPU but not the
other based on what information?

And for destinations loaded into the ATU, how is user port isolation
performed? Say lan0 and lan1 have the same MAC address of 00:01:02:03:04:05,
but lan0 goes to the eth0 DSA master and lan1 goes to eth1. How many ATU
entries would there be for host addresses, and towards which port would
they point? Are they isolated by a port private VLAN or something along
those lines?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 22:04         ` Marek Behun
@ 2021-04-12 22:17           ` Vladimir Oltean
  2021-04-12 22:47             ` Marek Behun
  0 siblings, 1 reply; 68+ messages in thread
From: Vladimir Oltean @ 2021-04-12 22:17 UTC (permalink / raw)
  To: Marek Behun
  Cc: DENG Qingfang, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, Apr 13, 2021 at 12:04:57AM +0200, Marek Behun wrote:
> On Mon, 12 Apr 2021 19:32:11 +0300
> Vladimir Oltean <olteanv@gmail.com> wrote:
>
> > On Mon, Apr 12, 2021 at 11:00:45PM +0800, DENG Qingfang wrote:
> > > On Sun, Apr 11, 2021 at 09:50:17PM +0300, Vladimir Oltean wrote:
> > > >
> > > > So I'd be tempted to say 'tough luck' if all your ports are not up, and
> > > > the ones that are are assigned statically to the same CPU port. It's a
> > > > compromise between flexibility and simplicity, and I would go for
> > > > simplicity here. That's the most you can achieve with static assignment,
> > > > just put the CPU ports in a LAG if you want better dynamic load balancing
> > > > (for details read on below).
> > > >
> > >
> > > Many switches such as mv88e6xxx only support MAC DA/SA load balancing,
> > > which make it not ideal in router application (Router WAN <--> ISP BRAS
> > > traffic will always have the same DA/SA and thus use only one port).
> >
> > Is this supposed to make a difference? Choose a better switch vendor!
>
> :-) Are you saying that we shall abandon trying to make the DSA
> subsystem work with better performace for our routers, in order to
> punish ourselves for our bad decision to use Marvell switches?

No, not at all, I just don't understand what is the point you and
Qingfang are trying to make. LAG is useful in general for load balancing.
With the particular case of point-to-point links with Marvell Linkstreet,
not so much. Okay. With a different workload, maybe it is useful with
Marvell Linkstreet too. Again okay. Same for static assignment,
sometimes it is what is needed and sometimes it just isn't.
It was proposed that you write up a user space program that picks the
CPU port assignment based on your favorite metric and just tells DSA to
reconfigure itself, either using a custom fancy static assignment based
on traffic rate (read MIB counters every minute) or simply based on LAG.
All the data laid out so far would indicate that this would give you the
flexibility you need, however you didn't leave any comment on that,
either acknowledging or explaining why it wouldn't be what you want.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 22:06               ` Vladimir Oltean
@ 2021-04-12 22:26                 ` Tobias Waldekranz
  2021-04-12 22:48                   ` Vladimir Oltean
  0 siblings, 1 reply; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-12 22:26 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Marek Behun, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, Apr 13, 2021 at 01:06, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Mon, Apr 12, 2021 at 11:49:22PM +0200, Tobias Waldekranz wrote:
>> On Tue, Apr 13, 2021 at 00:34, Vladimir Oltean <olteanv@gmail.com> wrote:
>> > On Mon, Apr 12, 2021 at 11:22:45PM +0200, Tobias Waldekranz wrote:
>> >> On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:
>> >> > On Mon, 12 Apr 2021 14:46:11 +0200
>> >> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
>> >> >
>> >> >> I agree. Unless you only have a few really wideband flows, a LAG will
>> >> >> typically do a great job with balancing. This will happen without the
>> >> >> user having to do any configuration at all. It would also perform well
>> >> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
>> >> >> the same.
>> >> >
>> >> > TLDR: The problem with LAGs how they are currently implemented is that
>> >> > for Turris Omnia, basically in 1/16 of configurations the traffic would
>> >> > go via one CPU port anyway.
>> >> >
>> >> >
>> >> >
>> >> > One potencial problem that I see with using LAGs for aggregating CPU
>> >> > ports on mv88e6xxx is how these switches determine the port for a
>> >> > packet: only the src and dst MAC address is used for the hash that
>> >> > chooses the port.
>> >> >
>> >> > The most common scenario for Turris Omnia, for example, where we have 2
>> >> > CPU ports and 5 user ports, is that into these 5 user ports the user
>> >> > plugs 5 simple devices (no switches, so only one peer MAC address for
>> >> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
>> >> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
>> >> > chance that all packets would go through one CPU port.
>> >> >
>> >> > In order to have real load balancing in this scenario, we would either
>> >> > have to recompute the LAG mask table depending on the MAC addresses, or
>> >> > rewrite the LAG mask table somewhat randomly periodically. (This could
>> >> > be in theory offloaded onto the Z80 internal CPU for some of the
>> >> > switches of the mv88e6xxx family, but not for Omnia.)
>> >> 
>> >> I thought that the option to associate each port netdev with a DSA
>> >> master would only be used on transmit. Are you saying that there is a
>> >> way to configure an mv88e6xxx chip to steer packets to different CPU
>> >> ports depending on the incoming port?
>> >> 
>> >> The reason that the traffic is directed towards the CPU is that some
>> >> kind of entry in the ATU says so, and the destination of that entry will
>> >> either be a port vector or a LAG. Of those two, only the LAG will offer
>> >> any kind of balancing. What am I missing?
>> >> 
>> >> Transmit is easy; you are already in the CPU, so you can use an
>> >> arbitrarily fancy hashing algo/ebpf classifier/whatever to load balance
>> >> in that case.
>> >
>> > Say a user port receives a broadcast frame. Based on your understanding
>> > where user-to-CPU port assignments are used only for TX, which CPU port
>> > should be selected by the switch for this broadcast packet, and by which
>> > mechanism?
>> 
>> AFAIK, the only option available to you (again, if there is no LAG set
>> up) is to statically choose one CPU port per entry. But hopefully Marek
>> can teach me some new tricks!
>> 
>> So for any known (since the broadcast address is loaded in the ATU it is
>> known) destination (b/m/u-cast), you can only "load balance" based on
>> the DA. You would also have to make sure that unknown unicast and
>> unknown multicast is only allowed to egress one of the CPU ports.
>> 
>> If you have a LAG OTOH, you could include all CPU ports in the port
>> vectors of those same entries. The LAG mask would then do the final
>> filtering so that you only send a single copy to the CPU.
>
> I forgot that mv88e6xxx keeps the broadcast address in the ATU. I wanted
> to know what is done in the flooding case, therefore I should have asked
> about unknown destination traffic. It is sent to one CPU but not the
> other based on what information?
>
> And for destinations loaded into the ATU, how is user port isolation
> performed? Say lan0 and lan1 have the same MAC address of 00:01:02:03:04:05,
> but lan0 goes to the eth0 DSA master and lan1 goes to eth1. How many ATU
> entries would there be for host addresses, and towards which port would
> they point? Are they isolated by a port private VLAN or something along
> those lines?

This is what I do not understand. This is what I hope that Marek can
tell me. To my knowledge, there is no way to per-port load balancing
from the switch to the CPU. In any given FID, there can be only one
entry per address, and that entry can only point to either a vector or a
LAG.

So my theory is that the only way of getting any load balancing, however
flawed, on receive (from switch to CPU) is by setting up a
LAG. Hopefully there is some trick that I do not know about which means
we have another option available to us.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 22:17           ` Vladimir Oltean
@ 2021-04-12 22:47             ` Marek Behun
  0 siblings, 0 replies; 68+ messages in thread
From: Marek Behun @ 2021-04-12 22:47 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: DENG Qingfang, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, 13 Apr 2021 01:17:21 +0300
Vladimir Oltean <olteanv@gmail.com> wrote:

> On Tue, Apr 13, 2021 at 12:04:57AM +0200, Marek Behun wrote:
> > On Mon, 12 Apr 2021 19:32:11 +0300
> > Vladimir Oltean <olteanv@gmail.com> wrote:
> >  
> > > On Mon, Apr 12, 2021 at 11:00:45PM +0800, DENG Qingfang wrote:  
> > > > On Sun, Apr 11, 2021 at 09:50:17PM +0300, Vladimir Oltean wrote:  
> > > > >
> > > > > So I'd be tempted to say 'tough luck' if all your ports are not up, and
> > > > > the ones that are are assigned statically to the same CPU port. It's a
> > > > > compromise between flexibility and simplicity, and I would go for
> > > > > simplicity here. That's the most you can achieve with static assignment,
> > > > > just put the CPU ports in a LAG if you want better dynamic load balancing
> > > > > (for details read on below).
> > > > >  
> > > >
> > > > Many switches such as mv88e6xxx only support MAC DA/SA load balancing,
> > > > which make it not ideal in router application (Router WAN <--> ISP BRAS
> > > > traffic will always have the same DA/SA and thus use only one port).  
> > >
> > > Is this supposed to make a difference? Choose a better switch vendor!  
> >
> > :-) Are you saying that we shall abandon trying to make the DSA
> > subsystem work with better performace for our routers, in order to
> > punish ourselves for our bad decision to use Marvell switches?  
> 
> No, not at all, I just don't understand what is the point you and
> Qingfang are trying to make.

I am not trying to make a point for this patch series. I did not touch
it since the last time I sent it. Ansuel just took over this series and
I am just contributing my thoughts to the RFC :)

I agree with you that this patch series still needs a lot of work.

> LAG is useful in general for load balancing.
> With the particular case of point-to-point links with Marvell Linkstreet,
> not so much. Okay. With a different workload, maybe it is useful with
> Marvell Linkstreet too. Again okay. Same for static assignment,
> sometimes it is what is needed and sometimes it just isn't.
> It was proposed that you write up a user space program that picks the
> CPU port assignment based on your favorite metric and just tells DSA to
> reconfigure itself, either using a custom fancy static assignment based
> on traffic rate (read MIB counters every minute) or simply based on LAG.
> All the data laid out so far would indicate that this would give you the
> flexibility you need, however you didn't leave any comment on that,
> either acknowledging or explaining why it wouldn't be what you want.

Yes, you are right. A custom userspace utility for assigning CPU ports
would be better here than adding lots of complication into the kernel
abstraction.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 22:26                 ` Tobias Waldekranz
@ 2021-04-12 22:48                   ` Vladimir Oltean
  2021-04-12 23:04                     ` Marek Behun
  0 siblings, 1 reply; 68+ messages in thread
From: Vladimir Oltean @ 2021-04-12 22:48 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Marek Behun, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, Apr 13, 2021 at 12:26:52AM +0200, Tobias Waldekranz wrote:
> On Tue, Apr 13, 2021 at 01:06, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Mon, Apr 12, 2021 at 11:49:22PM +0200, Tobias Waldekranz wrote:
> >> On Tue, Apr 13, 2021 at 00:34, Vladimir Oltean <olteanv@gmail.com> wrote:
> >> > On Mon, Apr 12, 2021 at 11:22:45PM +0200, Tobias Waldekranz wrote:
> >> >> On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:
> >> >> > On Mon, 12 Apr 2021 14:46:11 +0200
> >> >> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
> >> >> >
> >> >> >> I agree. Unless you only have a few really wideband flows, a LAG will
> >> >> >> typically do a great job with balancing. This will happen without the
> >> >> >> user having to do any configuration at all. It would also perform well
> >> >> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
> >> >> >> the same.
> >> >> >
> >> >> > TLDR: The problem with LAGs how they are currently implemented is that
> >> >> > for Turris Omnia, basically in 1/16 of configurations the traffic would
> >> >> > go via one CPU port anyway.
> >> >> >
> >> >> >
> >> >> >
> >> >> > One potencial problem that I see with using LAGs for aggregating CPU
> >> >> > ports on mv88e6xxx is how these switches determine the port for a
> >> >> > packet: only the src and dst MAC address is used for the hash that
> >> >> > chooses the port.
> >> >> >
> >> >> > The most common scenario for Turris Omnia, for example, where we have 2
> >> >> > CPU ports and 5 user ports, is that into these 5 user ports the user
> >> >> > plugs 5 simple devices (no switches, so only one peer MAC address for
> >> >> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
> >> >> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
> >> >> > chance that all packets would go through one CPU port.
> >> >> >
> >> >> > In order to have real load balancing in this scenario, we would either
> >> >> > have to recompute the LAG mask table depending on the MAC addresses, or
> >> >> > rewrite the LAG mask table somewhat randomly periodically. (This could
> >> >> > be in theory offloaded onto the Z80 internal CPU for some of the
> >> >> > switches of the mv88e6xxx family, but not for Omnia.)
> >> >> 
> >> >> I thought that the option to associate each port netdev with a DSA
> >> >> master would only be used on transmit. Are you saying that there is a
> >> >> way to configure an mv88e6xxx chip to steer packets to different CPU
> >> >> ports depending on the incoming port?
> >> >> 
> >> >> The reason that the traffic is directed towards the CPU is that some
> >> >> kind of entry in the ATU says so, and the destination of that entry will
> >> >> either be a port vector or a LAG. Of those two, only the LAG will offer
> >> >> any kind of balancing. What am I missing?
> >> >> 
> >> >> Transmit is easy; you are already in the CPU, so you can use an
> >> >> arbitrarily fancy hashing algo/ebpf classifier/whatever to load balance
> >> >> in that case.
> >> >
> >> > Say a user port receives a broadcast frame. Based on your understanding
> >> > where user-to-CPU port assignments are used only for TX, which CPU port
> >> > should be selected by the switch for this broadcast packet, and by which
> >> > mechanism?
> >> 
> >> AFAIK, the only option available to you (again, if there is no LAG set
> >> up) is to statically choose one CPU port per entry. But hopefully Marek
> >> can teach me some new tricks!
> >> 
> >> So for any known (since the broadcast address is loaded in the ATU it is
> >> known) destination (b/m/u-cast), you can only "load balance" based on
> >> the DA. You would also have to make sure that unknown unicast and
> >> unknown multicast is only allowed to egress one of the CPU ports.
> >> 
> >> If you have a LAG OTOH, you could include all CPU ports in the port
> >> vectors of those same entries. The LAG mask would then do the final
> >> filtering so that you only send a single copy to the CPU.
> >
> > I forgot that mv88e6xxx keeps the broadcast address in the ATU. I wanted
> > to know what is done in the flooding case, therefore I should have asked
> > about unknown destination traffic. It is sent to one CPU but not the
> > other based on what information?
> >
> > And for destinations loaded into the ATU, how is user port isolation
> > performed? Say lan0 and lan1 have the same MAC address of 00:01:02:03:04:05,
> > but lan0 goes to the eth0 DSA master and lan1 goes to eth1. How many ATU
> > entries would there be for host addresses, and towards which port would
> > they point? Are they isolated by a port private VLAN or something along
> > those lines?
> 
> This is what I do not understand. This is what I hope that Marek can
> tell me. To my knowledge, there is no way to per-port load balancing
> from the switch to the CPU. In any given FID, there can be only one
> entry per address, and that entry can only point to either a vector or a
> LAG.
> 
> So my theory is that the only way of getting any load balancing, however
> flawed, on receive (from switch to CPU) is by setting up a
> LAG. Hopefully there is some trick that I do not know about which means
> we have another option available to us.

Understood. So as far as you know the Marvell Linkstreet hardware
capabilities, it isn't possible to do a clean-cut "all traffic from port
X goes to CPU port A and none to B", but instead it's more of a mushy
mess like "unknown unicast is flooded to CPU port A, unknown multicast
to CPU port B, MAC address 00:01:02:03:04:05 may go to CPU port A, MAC
address 00:01:02:03:04:06 to CPU port B". Basically an open-coded mess
of a LAG handled by some logic like DSA, once the RX filtering series
gets merged. Until then, all traffic to the CPU is unknown-destination
traffic as long as I know the mv88e6xxx (due to that limitation where it
doesn't learn from the MAC SA of FROM_CPU packets, and DSA does not
install into the ATU any of the host addresses, nor does it send any
FORWARD frames). But if this is the case and everything towards the CPU
is currently flooded, what sort of load balancing do we even have?
Between unknown unicast and unknown multicast? :)

So excuse me for believing that the hardware is capable of doing what
these 3 patches pretend without seeing the driver-side code!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 22:05             ` Tobias Waldekranz
@ 2021-04-12 22:55               ` Marek Behun
  2021-04-12 23:09                 ` Tobias Waldekranz
  0 siblings, 1 reply; 68+ messages in thread
From: Marek Behun @ 2021-04-12 22:55 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, 13 Apr 2021 00:05:51 +0200
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> On Mon, Apr 12, 2021 at 23:50, Marek Behun <marek.behun@nic.cz> wrote:
> > On Mon, 12 Apr 2021 23:22:45 +0200
> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
> >  
> >> On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:  
> >> > On Mon, 12 Apr 2021 14:46:11 +0200
> >> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
> >> >    
> >> >> I agree. Unless you only have a few really wideband flows, a LAG will
> >> >> typically do a great job with balancing. This will happen without the
> >> >> user having to do any configuration at all. It would also perform well
> >> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
> >> >> the same.    
> >> >
> >> > TLDR: The problem with LAGs how they are currently implemented is that
> >> > for Turris Omnia, basically in 1/16 of configurations the traffic would
> >> > go via one CPU port anyway.
> >> >
> >> >
> >> >
> >> > One potencial problem that I see with using LAGs for aggregating CPU
> >> > ports on mv88e6xxx is how these switches determine the port for a
> >> > packet: only the src and dst MAC address is used for the hash that
> >> > chooses the port.
> >> >
> >> > The most common scenario for Turris Omnia, for example, where we have 2
> >> > CPU ports and 5 user ports, is that into these 5 user ports the user
> >> > plugs 5 simple devices (no switches, so only one peer MAC address for
> >> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
> >> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
> >> > chance that all packets would go through one CPU port.
> >> >
> >> > In order to have real load balancing in this scenario, we would either
> >> > have to recompute the LAG mask table depending on the MAC addresses, or
> >> > rewrite the LAG mask table somewhat randomly periodically. (This could
> >> > be in theory offloaded onto the Z80 internal CPU for some of the
> >> > switches of the mv88e6xxx family, but not for Omnia.)    
> >> 
> >> I thought that the option to associate each port netdev with a DSA
> >> master would only be used on transmit. Are you saying that there is a
> >> way to configure an mv88e6xxx chip to steer packets to different CPU
> >> ports depending on the incoming port?
> >>
> >> The reason that the traffic is directed towards the CPU is that some
> >> kind of entry in the ATU says so, and the destination of that entry will
> >> either be a port vector or a LAG. Of those two, only the LAG will offer
> >> any kind of balancing. What am I missing?  
> >
> > Via port vectors you can "load balance" by ports only, i.e. input port X  
> > -> trasmit via CPU port Y.  
> 
> How is this done? In a case where there is no bridging between the
> ports, then I understand. Each port could have its own FID. But if you
> have this setup...
> 
>    br0    wan
>    / \
> lan0 lan1
> 
> lan0 and lan1 would use the same FID. So how could you say that frames
> from lan0 should go to cpu0 and frames from lan1 should go to cpu1 if
> the DA is the same? What would be the content of the ATU in a setup
> like that?
> 
> > When using LAGs, you are load balancing via hash(src MAC | dst mac)
> > only. This is better in some ways. But what I am saying is that if the
> > LAG mask table is static, as it is now implemented in mv88e6xxx code,
> > then for many scenarios there is a big probability of no load balancing
> > at all. For Turris Omnia for example there is 6.25% probability that
> > the switch chip will send all traffic to the CPU via one CPU port.
> > This is because the switch chooses the LAG port only from the hash of
> > dst+src MAC address. (By the 1/16 = 6.25% probability I mean that for
> > cca 1 in 16 customers, the switch would only use one port when sending
> > data to the CPU).
> >
> > The round robin solution here is therefore better in this case.  
> 
> I agree that it would be better in that case. I just do not get how you
> get the switch to do it for you.

I thought that this is configured in the mv88e6xxx_port_vlan() function.
For each port, you specify via which ports data can egress.
So for ports 0, 2, 4 you can enable CPU port 0, and for ports 1 and 3
CPU port 1.

Am I wrong? I confess that I did not understand this into the most fine
details, so it is entirely possible that I am missing something
important and am completely wrong. Maybe this cannot be done.

Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 22:48                   ` Vladimir Oltean
@ 2021-04-12 23:04                     ` Marek Behun
  0 siblings, 0 replies; 68+ messages in thread
From: Marek Behun @ 2021-04-12 23:04 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Tobias Waldekranz, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, 13 Apr 2021 01:48:05 +0300
Vladimir Oltean <olteanv@gmail.com> wrote:

> On Tue, Apr 13, 2021 at 12:26:52AM +0200, Tobias Waldekranz wrote:
> > On Tue, Apr 13, 2021 at 01:06, Vladimir Oltean <olteanv@gmail.com> wrote:  
> > > On Mon, Apr 12, 2021 at 11:49:22PM +0200, Tobias Waldekranz wrote:  
> > >> On Tue, Apr 13, 2021 at 00:34, Vladimir Oltean <olteanv@gmail.com> wrote:  
> > >> > On Mon, Apr 12, 2021 at 11:22:45PM +0200, Tobias Waldekranz wrote:  
> > >> >> On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:  
> > >> >> > On Mon, 12 Apr 2021 14:46:11 +0200
> > >> >> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
> > >> >> >  
> > >> >> >> I agree. Unless you only have a few really wideband flows, a LAG will
> > >> >> >> typically do a great job with balancing. This will happen without the
> > >> >> >> user having to do any configuration at all. It would also perform well
> > >> >> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
> > >> >> >> the same.  
> > >> >> >
> > >> >> > TLDR: The problem with LAGs how they are currently implemented is that
> > >> >> > for Turris Omnia, basically in 1/16 of configurations the traffic would
> > >> >> > go via one CPU port anyway.
> > >> >> >
> > >> >> >
> > >> >> >
> > >> >> > One potencial problem that I see with using LAGs for aggregating CPU
> > >> >> > ports on mv88e6xxx is how these switches determine the port for a
> > >> >> > packet: only the src and dst MAC address is used for the hash that
> > >> >> > chooses the port.
> > >> >> >
> > >> >> > The most common scenario for Turris Omnia, for example, where we have 2
> > >> >> > CPU ports and 5 user ports, is that into these 5 user ports the user
> > >> >> > plugs 5 simple devices (no switches, so only one peer MAC address for
> > >> >> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
> > >> >> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
> > >> >> > chance that all packets would go through one CPU port.
> > >> >> >
> > >> >> > In order to have real load balancing in this scenario, we would either
> > >> >> > have to recompute the LAG mask table depending on the MAC addresses, or
> > >> >> > rewrite the LAG mask table somewhat randomly periodically. (This could
> > >> >> > be in theory offloaded onto the Z80 internal CPU for some of the
> > >> >> > switches of the mv88e6xxx family, but not for Omnia.)  
> > >> >> 
> > >> >> I thought that the option to associate each port netdev with a DSA
> > >> >> master would only be used on transmit. Are you saying that there is a
> > >> >> way to configure an mv88e6xxx chip to steer packets to different CPU
> > >> >> ports depending on the incoming port?
> > >> >> 
> > >> >> The reason that the traffic is directed towards the CPU is that some
> > >> >> kind of entry in the ATU says so, and the destination of that entry will
> > >> >> either be a port vector or a LAG. Of those two, only the LAG will offer
> > >> >> any kind of balancing. What am I missing?
> > >> >> 
> > >> >> Transmit is easy; you are already in the CPU, so you can use an
> > >> >> arbitrarily fancy hashing algo/ebpf classifier/whatever to load balance
> > >> >> in that case.  
> > >> >
> > >> > Say a user port receives a broadcast frame. Based on your understanding
> > >> > where user-to-CPU port assignments are used only for TX, which CPU port
> > >> > should be selected by the switch for this broadcast packet, and by which
> > >> > mechanism?  
> > >> 
> > >> AFAIK, the only option available to you (again, if there is no LAG set
> > >> up) is to statically choose one CPU port per entry. But hopefully Marek
> > >> can teach me some new tricks!
> > >> 
> > >> So for any known (since the broadcast address is loaded in the ATU it is
> > >> known) destination (b/m/u-cast), you can only "load balance" based on
> > >> the DA. You would also have to make sure that unknown unicast and
> > >> unknown multicast is only allowed to egress one of the CPU ports.
> > >> 
> > >> If you have a LAG OTOH, you could include all CPU ports in the port
> > >> vectors of those same entries. The LAG mask would then do the final
> > >> filtering so that you only send a single copy to the CPU.  
> > >
> > > I forgot that mv88e6xxx keeps the broadcast address in the ATU. I wanted
> > > to know what is done in the flooding case, therefore I should have asked
> > > about unknown destination traffic. It is sent to one CPU but not the
> > > other based on what information?
> > >
> > > And for destinations loaded into the ATU, how is user port isolation
> > > performed? Say lan0 and lan1 have the same MAC address of 00:01:02:03:04:05,
> > > but lan0 goes to the eth0 DSA master and lan1 goes to eth1. How many ATU
> > > entries would there be for host addresses, and towards which port would
> > > they point? Are they isolated by a port private VLAN or something along
> > > those lines?  
> > 
> > This is what I do not understand. This is what I hope that Marek can
> > tell me. To my knowledge, there is no way to per-port load balancing
> > from the switch to the CPU. In any given FID, there can be only one
> > entry per address, and that entry can only point to either a vector or a
> > LAG.
> > 
> > So my theory is that the only way of getting any load balancing, however
> > flawed, on receive (from switch to CPU) is by setting up a
> > LAG. Hopefully there is some trick that I do not know about which means
> > we have another option available to us.  
> 
> Understood. So as far as you know the Marvell Linkstreet hardware
> capabilities, it isn't possible to do a clean-cut "all traffic from port
> X goes to CPU port A and none to B", but instead it's more of a mushy
> mess like "unknown unicast is flooded to CPU port A, unknown multicast
> to CPU port B, MAC address 00:01:02:03:04:05 may go to CPU port A, MAC
> address 00:01:02:03:04:06 to CPU port B". Basically an open-coded mess
> of a LAG handled by some logic like DSA, once the RX filtering series
> gets merged. Until then, all traffic to the CPU is unknown-destination
> traffic as long as I know the mv88e6xxx (due to that limitation where it
> doesn't learn from the MAC SA of FROM_CPU packets, and DSA does not
> install into the ATU any of the host addresses, nor does it send any
> FORWARD frames). But if this is the case and everything towards the CPU
> is currently flooded, what sort of load balancing do we even have?
> Between unknown unicast and unknown multicast? :)
> 
> So excuse me for believing that the hardware is capable of doing what
> these 3 patches pretend without seeing the driver-side code!

I just now noticed that this series does not include the proposed code
change for mv88e6xxx.

I am attaching below a patch we use for our TurrisOS 5.4 kernel that
uses this API for Omnia in the mv88e6xxx driver.

Subject: [PATCH] net: dsa: mv88e6xxx: support multi-CPU DSA
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add support for multi-CPU DSA for mv88e6xxx.
Currently only works with multiple CPUs when there is only one switch in
the switch tree.

Signed-off-by: Marek Behún <marek.behun@nic.cz>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 48 ++++++++++++++++++++++++++++++--
 1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 33b391376352..804ba563540e 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -1080,6 +1080,7 @@ static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port)
 {
 	struct dsa_switch *ds = NULL;
 	struct net_device *br;
+	u8 upstream;
 	u16 pvlan;
 	int i;
 
@@ -1091,17 +1092,36 @@ static u16 mv88e6xxx_port_vlan(struct mv88e6xxx_chip *chip, int dev, int port)
 		return 0;
 
 	/* Frames from DSA links and CPU ports can egress any local port */
-	if (dsa_is_cpu_port(ds, port) || dsa_is_dsa_port(ds, port))
+	if (dsa_is_dsa_port(ds, port))
 		return mv88e6xxx_port_mask(chip);
 
+	if (dsa_is_cpu_port(ds, port)) {
+		u16 pmask = mv88e6xxx_port_mask(chip);
+		pvlan = 0;
+
+		for (i = 0; i < mv88e6xxx_num_ports(chip); ++i) {
+			if (dsa_is_cpu_port(ds, i)) {
+				if (i == port)
+					pvlan |= BIT(i);
+				continue;
+			}
+			if ((pmask & BIT(i)) &&
+			     dsa_upstream_port(chip->ds, i) == port)
+				pvlan |= BIT(i);
+		}
+
+		return pvlan;
+	}
+
 	br = ds->ports[port].bridge_dev;
 	pvlan = 0;
 
 	/* Frames from user ports can egress any local DSA links and CPU ports,
 	  * as well as any local member of their bridge group.
 	  */
+	upstream = dsa_upstream_port(chip->ds, port);
 	for (i = 0; i < mv88e6xxx_num_ports(chip); ++i)
-		if (dsa_is_cpu_port(chip->ds, i) ||
+		if ((dsa_is_cpu_port(chip->ds, i) && i == upstream) ||
 		     dsa_is_dsa_port(chip->ds, i) ||
 		     (br && dsa_to_port(chip->ds, i)->bridge_dev == br))
 			pvlan |= BIT(i);
@@ -2388,6 +2408,7 @@ static int mv88e6xxx_setup_upstream_port(struct mv88e6xxx_chip *chip, int port)
 	}
 
 	if (port == upstream_port) {
+		dev_info(chip->dev, "Setting CPU port as port %i\n", port);
 		if (chip->info->ops->set_cpu_port) {
 			err = chip->info->ops->set_cpu_port(chip,
 							     upstream_port);
@@ -2406,6 +2427,28 @@ static int mv88e6xxx_setup_upstream_port(struct mv88e6xxx_chip *chip, int port)
 	return 0;
 }
 
+static int mv88e6xxx_port_change_cpu_port(struct dsa_switch *ds, int port,
+					   struct dsa_port *new_cpu_dp)
+{
+	struct mv88e6xxx_chip *chip = ds->priv;
+	int err;
+
+	mv88e6xxx_reg_lock(chip);
+
+	err = mv88e6xxx_setup_upstream_port(chip, port);
+	if (err)
+		goto unlock;
+
+	err = mv88e6xxx_port_vlan_map(chip, port);
+	if (err)
+		goto unlock;
+
+unlock:
+	mv88e6xxx_reg_unlock(chip);
+
+	return err;
+}
+
 static int mv88e6xxx_setup_port(struct mv88e6xxx_chip *chip, int port)
 {
 	struct dsa_switch *ds = chip->ds;
@@ -4996,6 +5039,7 @@ static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
 	.port_hwtstamp_get	= mv88e6xxx_port_hwtstamp_get,
 	.port_txtstamp		= mv88e6xxx_port_txtstamp,
 	.port_rxtstamp		= mv88e6xxx_port_rxtstamp,
+	.port_change_cpu_port	= mv88e6xxx_port_change_cpu_port,
 	.get_ts_info		= mv88e6xxx_get_ts_info,
 };
 
-- 
2.24.1

^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 22:55               ` Marek Behun
@ 2021-04-12 23:09                 ` Tobias Waldekranz
  2021-04-12 23:13                   ` Tobias Waldekranz
  0 siblings, 1 reply; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-12 23:09 UTC (permalink / raw)
  To: Marek Behun
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, Apr 13, 2021 at 00:55, Marek Behun <marek.behun@nic.cz> wrote:
> On Tue, 13 Apr 2021 00:05:51 +0200
> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>
>> On Mon, Apr 12, 2021 at 23:50, Marek Behun <marek.behun@nic.cz> wrote:
>> > On Mon, 12 Apr 2021 23:22:45 +0200
>> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
>> >  
>> >> On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:  
>> >> > On Mon, 12 Apr 2021 14:46:11 +0200
>> >> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
>> >> >    
>> >> >> I agree. Unless you only have a few really wideband flows, a LAG will
>> >> >> typically do a great job with balancing. This will happen without the
>> >> >> user having to do any configuration at all. It would also perform well
>> >> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
>> >> >> the same.    
>> >> >
>> >> > TLDR: The problem with LAGs how they are currently implemented is that
>> >> > for Turris Omnia, basically in 1/16 of configurations the traffic would
>> >> > go via one CPU port anyway.
>> >> >
>> >> >
>> >> >
>> >> > One potencial problem that I see with using LAGs for aggregating CPU
>> >> > ports on mv88e6xxx is how these switches determine the port for a
>> >> > packet: only the src and dst MAC address is used for the hash that
>> >> > chooses the port.
>> >> >
>> >> > The most common scenario for Turris Omnia, for example, where we have 2
>> >> > CPU ports and 5 user ports, is that into these 5 user ports the user
>> >> > plugs 5 simple devices (no switches, so only one peer MAC address for
>> >> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
>> >> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
>> >> > chance that all packets would go through one CPU port.
>> >> >
>> >> > In order to have real load balancing in this scenario, we would either
>> >> > have to recompute the LAG mask table depending on the MAC addresses, or
>> >> > rewrite the LAG mask table somewhat randomly periodically. (This could
>> >> > be in theory offloaded onto the Z80 internal CPU for some of the
>> >> > switches of the mv88e6xxx family, but not for Omnia.)    
>> >> 
>> >> I thought that the option to associate each port netdev with a DSA
>> >> master would only be used on transmit. Are you saying that there is a
>> >> way to configure an mv88e6xxx chip to steer packets to different CPU
>> >> ports depending on the incoming port?
>> >>
>> >> The reason that the traffic is directed towards the CPU is that some
>> >> kind of entry in the ATU says so, and the destination of that entry will
>> >> either be a port vector or a LAG. Of those two, only the LAG will offer
>> >> any kind of balancing. What am I missing?  
>> >
>> > Via port vectors you can "load balance" by ports only, i.e. input port X  
>> > -> trasmit via CPU port Y.  
>> 
>> How is this done? In a case where there is no bridging between the
>> ports, then I understand. Each port could have its own FID. But if you
>> have this setup...
>> 
>>    br0    wan
>>    / \
>> lan0 lan1
>> 
>> lan0 and lan1 would use the same FID. So how could you say that frames
>> from lan0 should go to cpu0 and frames from lan1 should go to cpu1 if
>> the DA is the same? What would be the content of the ATU in a setup
>> like that?
>> 
>> > When using LAGs, you are load balancing via hash(src MAC | dst mac)
>> > only. This is better in some ways. But what I am saying is that if the
>> > LAG mask table is static, as it is now implemented in mv88e6xxx code,
>> > then for many scenarios there is a big probability of no load balancing
>> > at all. For Turris Omnia for example there is 6.25% probability that
>> > the switch chip will send all traffic to the CPU via one CPU port.
>> > This is because the switch chooses the LAG port only from the hash of
>> > dst+src MAC address. (By the 1/16 = 6.25% probability I mean that for
>> > cca 1 in 16 customers, the switch would only use one port when sending
>> > data to the CPU).
>> >
>> > The round robin solution here is therefore better in this case.  
>> 
>> I agree that it would be better in that case. I just do not get how you
>> get the switch to do it for you.
>
> I thought that this is configured in the mv88e6xxx_port_vlan() function.
> For each port, you specify via which ports data can egress.
> So for ports 0, 2, 4 you can enable CPU port 0, and for ports 1 and 3
> CPU port 1.

Ahh I see. Well the port based VLANs are just an isolation mechanism. So
with a setup like this...

    cpu0 cpu1 lan0 lan1
cpu0           x
cpu1                x
lan0 x              x
lan1      x    x

...you could get the isolation in place. But you will still lookup the
DA in the ATU, and there you will find a destination of either cpu0 or
cpu1. So for one of the ports, the destination will be outside of its
port based VLAN. Once the vectors are ANDed together, it is left with no
valid port to egress through, and the packet is dropped.

> Am I wrong? I confess that I did not understand this into the most fine
> details, so it is entirely possible that I am missing something
> important and am completely wrong. Maybe this cannot be done.

I really doubt that it can be done. Not in any robust way at
least. Happy to be proven wrong though! :)

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 23:09                 ` Tobias Waldekranz
@ 2021-04-12 23:13                   ` Tobias Waldekranz
  2021-04-12 23:54                     ` Marek Behun
  0 siblings, 1 reply; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-12 23:13 UTC (permalink / raw)
  To: Marek Behun
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, Apr 13, 2021 at 01:09, Tobias Waldekranz <tobias@waldekranz.com> wrote:
> On Tue, Apr 13, 2021 at 00:55, Marek Behun <marek.behun@nic.cz> wrote:
>> On Tue, 13 Apr 2021 00:05:51 +0200
>> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>>
>>> On Mon, Apr 12, 2021 at 23:50, Marek Behun <marek.behun@nic.cz> wrote:
>>> > On Mon, 12 Apr 2021 23:22:45 +0200
>>> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
>>> >  
>>> >> On Mon, Apr 12, 2021 at 21:30, Marek Behun <marek.behun@nic.cz> wrote:  
>>> >> > On Mon, 12 Apr 2021 14:46:11 +0200
>>> >> > Tobias Waldekranz <tobias@waldekranz.com> wrote:
>>> >> >    
>>> >> >> I agree. Unless you only have a few really wideband flows, a LAG will
>>> >> >> typically do a great job with balancing. This will happen without the
>>> >> >> user having to do any configuration at all. It would also perform well
>>> >> >> in "router-on-a-stick"-setups where the incoming and outgoing port is
>>> >> >> the same.    
>>> >> >
>>> >> > TLDR: The problem with LAGs how they are currently implemented is that
>>> >> > for Turris Omnia, basically in 1/16 of configurations the traffic would
>>> >> > go via one CPU port anyway.
>>> >> >
>>> >> >
>>> >> >
>>> >> > One potencial problem that I see with using LAGs for aggregating CPU
>>> >> > ports on mv88e6xxx is how these switches determine the port for a
>>> >> > packet: only the src and dst MAC address is used for the hash that
>>> >> > chooses the port.
>>> >> >
>>> >> > The most common scenario for Turris Omnia, for example, where we have 2
>>> >> > CPU ports and 5 user ports, is that into these 5 user ports the user
>>> >> > plugs 5 simple devices (no switches, so only one peer MAC address for
>>> >> > port). So we have only 5 pairs of src + dst MAC addresses. If we simply
>>> >> > fill the LAG table as it is done now, then there is 2 * 0.5^5 = 1/16
>>> >> > chance that all packets would go through one CPU port.
>>> >> >
>>> >> > In order to have real load balancing in this scenario, we would either
>>> >> > have to recompute the LAG mask table depending on the MAC addresses, or
>>> >> > rewrite the LAG mask table somewhat randomly periodically. (This could
>>> >> > be in theory offloaded onto the Z80 internal CPU for some of the
>>> >> > switches of the mv88e6xxx family, but not for Omnia.)    
>>> >> 
>>> >> I thought that the option to associate each port netdev with a DSA
>>> >> master would only be used on transmit. Are you saying that there is a
>>> >> way to configure an mv88e6xxx chip to steer packets to different CPU
>>> >> ports depending on the incoming port?
>>> >>
>>> >> The reason that the traffic is directed towards the CPU is that some
>>> >> kind of entry in the ATU says so, and the destination of that entry will
>>> >> either be a port vector or a LAG. Of those two, only the LAG will offer
>>> >> any kind of balancing. What am I missing?  
>>> >
>>> > Via port vectors you can "load balance" by ports only, i.e. input port X  
>>> > -> trasmit via CPU port Y.  
>>> 
>>> How is this done? In a case where there is no bridging between the
>>> ports, then I understand. Each port could have its own FID. But if you
>>> have this setup...
>>> 
>>>    br0    wan
>>>    / \
>>> lan0 lan1
>>> 
>>> lan0 and lan1 would use the same FID. So how could you say that frames
>>> from lan0 should go to cpu0 and frames from lan1 should go to cpu1 if
>>> the DA is the same? What would be the content of the ATU in a setup
>>> like that?
>>> 
>>> > When using LAGs, you are load balancing via hash(src MAC | dst mac)
>>> > only. This is better in some ways. But what I am saying is that if the
>>> > LAG mask table is static, as it is now implemented in mv88e6xxx code,
>>> > then for many scenarios there is a big probability of no load balancing
>>> > at all. For Turris Omnia for example there is 6.25% probability that
>>> > the switch chip will send all traffic to the CPU via one CPU port.
>>> > This is because the switch chooses the LAG port only from the hash of
>>> > dst+src MAC address. (By the 1/16 = 6.25% probability I mean that for
>>> > cca 1 in 16 customers, the switch would only use one port when sending
>>> > data to the CPU).
>>> >
>>> > The round robin solution here is therefore better in this case.  
>>> 
>>> I agree that it would be better in that case. I just do not get how you
>>> get the switch to do it for you.
>>
>> I thought that this is configured in the mv88e6xxx_port_vlan() function.
>> For each port, you specify via which ports data can egress.
>> So for ports 0, 2, 4 you can enable CPU port 0, and for ports 1 and 3
>> CPU port 1.
>
> Ahh I see. Well the port based VLANs are just an isolation mechanism. So
> with a setup like this...
>
>     cpu0 cpu1 lan0 lan1
> cpu0           x
> cpu1                x
> lan0 x              x
> lan1      x    x
>
> ...you could get the isolation in place. But you will still lookup the
> DA in the ATU, and there you will find a destination of either cpu0 or
> cpu1. So for one of the ports, the destination will be outside of its
> port based VLAN. Once the vectors are ANDed together, it is left with no
> valid port to egress through, and the packet is dropped.
>
>> Am I wrong? I confess that I did not understand this into the most fine
>> details, so it is entirely possible that I am missing something
>> important and am completely wrong. Maybe this cannot be done.
>
> I really doubt that it can be done. Not in any robust way at
> least. Happy to be proven wrong though! :)

I think I figured out why it "works" for you. Since the CPU address is
never added to the ATU, traffic for it is treated as unknown. Thanks to
that, it flooded and the isolation brings it together. As soon as
mv88e6xxx starts making use of Vladimirs offloading of host addresses
though, I suspect this will fall apart.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 23:13                   ` Tobias Waldekranz
@ 2021-04-12 23:54                     ` Marek Behun
  2021-04-13  0:27                       ` Marek Behun
  2021-04-13 14:40                       ` Tobias Waldekranz
  0 siblings, 2 replies; 68+ messages in thread
From: Marek Behun @ 2021-04-12 23:54 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, 13 Apr 2021 01:13:53 +0200
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> > ...you could get the isolation in place. But you will still lookup the
> > DA in the ATU, and there you will find a destination of either cpu0 or
> > cpu1. So for one of the ports, the destination will be outside of its
> > port based VLAN. Once the vectors are ANDed together, it is left with no
> > valid port to egress through, and the packet is dropped.
> >  
> >> Am I wrong? I confess that I did not understand this into the most fine
> >> details, so it is entirely possible that I am missing something
> >> important and am completely wrong. Maybe this cannot be done.  
> >
> > I really doubt that it can be done. Not in any robust way at
> > least. Happy to be proven wrong though! :)  
> 
> I think I figured out why it "works" for you. Since the CPU address is
> never added to the ATU, traffic for it is treated as unknown. Thanks to
> that, it flooded and the isolation brings it together. As soon as
> mv88e6xxx starts making use of Vladimirs offloading of host addresses
> though, I suspect this will fall apart.

Hmm :( This is bad news. I would really like to make it balance via
input ports. The LAG balancing for this usecase is simply unacceptable,
since the switch puts so little information into the hash function.

I will look into this, maybe ask some follow-up questions.

Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 23:54                     ` Marek Behun
@ 2021-04-13  0:27                       ` Marek Behun
  2021-04-13  0:31                         ` Marek Behun
  2021-04-13 14:46                         ` Tobias Waldekranz
  2021-04-13 14:40                       ` Tobias Waldekranz
  1 sibling, 2 replies; 68+ messages in thread
From: Marek Behun @ 2021-04-13  0:27 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, 13 Apr 2021 01:54:50 +0200
Marek Behun <marek.behun@nic.cz> wrote:

> I will look into this, maybe ask some follow-up questions.

Tobias,

it seems that currently the LAGs in mv88e6xxx driver do not use the
HashTrunk feature (which can be enabled via bit 11 of the
MV88E6XXX_G2_TRUNK_MAPPING register).

If we used this feature and if we knew what hash function it uses, we
could write a userspace tool that could recompute new MAC
addresses for the CPU ports in order to avoid the problem I explained
previously...

Or the tool can simply inject frames into the switch and try different
MAC addresses for the CPU ports until desired load-balancing is reached.

What do you think?

Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-13  0:27                       ` Marek Behun
@ 2021-04-13  0:31                         ` Marek Behun
  2021-04-13 14:46                         ` Tobias Waldekranz
  1 sibling, 0 replies; 68+ messages in thread
From: Marek Behun @ 2021-04-13  0:31 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, 13 Apr 2021 02:27:30 +0200
Marek Behun <marek.behun@nic.cz> wrote:

> On Tue, 13 Apr 2021 01:54:50 +0200
> Marek Behun <marek.behun@nic.cz> wrote:
> 
> > I will look into this, maybe ask some follow-up questions.
> 
> Tobias,
> 
> it seems that currently the LAGs in mv88e6xxx driver do not use the
> HashTrunk feature (which can be enabled via bit 11 of the
> MV88E6XXX_G2_TRUNK_MAPPING register).
> 
> If we used this feature and if we knew what hash function it uses, we
> could write a userspace tool that could recompute new MAC
> addresses for the CPU ports in order to avoid the problem I explained
> previously...
> 
> Or the tool can simply inject frames into the switch and try different
> MAC addresses for the CPU ports until desired load-balancing is reached.
> 
> What do you think?
> 
> Marek

Although changing MAC addresses of the CPU ports each time some new
device comes into the network doesn't seem like a good idea, now that I
think about it. Hmm.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-12 23:54                     ` Marek Behun
  2021-04-13  0:27                       ` Marek Behun
@ 2021-04-13 14:40                       ` Tobias Waldekranz
  1 sibling, 0 replies; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-13 14:40 UTC (permalink / raw)
  To: Marek Behun
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, Apr 13, 2021 at 01:54, Marek Behun <marek.behun@nic.cz> wrote:
> On Tue, 13 Apr 2021 01:13:53 +0200
> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>
>> > ...you could get the isolation in place. But you will still lookup the
>> > DA in the ATU, and there you will find a destination of either cpu0 or
>> > cpu1. So for one of the ports, the destination will be outside of its
>> > port based VLAN. Once the vectors are ANDed together, it is left with no
>> > valid port to egress through, and the packet is dropped.
>> >  
>> >> Am I wrong? I confess that I did not understand this into the most fine
>> >> details, so it is entirely possible that I am missing something
>> >> important and am completely wrong. Maybe this cannot be done.  
>> >
>> > I really doubt that it can be done. Not in any robust way at
>> > least. Happy to be proven wrong though! :)  
>> 
>> I think I figured out why it "works" for you. Since the CPU address is
>> never added to the ATU, traffic for it is treated as unknown. Thanks to
>> that, it flooded and the isolation brings it together. As soon as
>> mv88e6xxx starts making use of Vladimirs offloading of host addresses
>> though, I suspect this will fall apart.
>
> Hmm :( This is bad news. I would really like to make it balance via
> input ports. The LAG balancing for this usecase is simply unacceptable,
> since the switch puts so little information into the hash function.

If you have the ports in standalone mode, you could imagine having each
port use its own FID. But then you cannot to L2 forwarding between the
LAN ports in hardware.

If you have a chip with a TCAM, you could theoretically use that to get
policy switching to the preferred CPU port. But since that would likely
run on top of TC flower or something, it is not obvious to my how to
would describe that kind of action.

Barring something like that, I think you will have to accept the
unacceptable :)

> I will look into this, maybe ask some follow-up questions.
>
> Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-13  0:27                       ` Marek Behun
  2021-04-13  0:31                         ` Marek Behun
@ 2021-04-13 14:46                         ` Tobias Waldekranz
  2021-04-13 15:14                           ` Marek Behun
  1 sibling, 1 reply; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-13 14:46 UTC (permalink / raw)
  To: Marek Behun
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, Apr 13, 2021 at 02:27, Marek Behun <marek.behun@nic.cz> wrote:
> On Tue, 13 Apr 2021 01:54:50 +0200
> Marek Behun <marek.behun@nic.cz> wrote:
>
>> I will look into this, maybe ask some follow-up questions.
>
> Tobias,
>
> it seems that currently the LAGs in mv88e6xxx driver do not use the
> HashTrunk feature (which can be enabled via bit 11 of the
> MV88E6XXX_G2_TRUNK_MAPPING register).

This should be set at the bottom of mv88e6xxx_lag_sync_masks.

> If we used this feature and if we knew what hash function it uses, we
> could write a userspace tool that could recompute new MAC
> addresses for the CPU ports in order to avoid the problem I explained
> previously...
>
> Or the tool can simply inject frames into the switch and try different
> MAC addresses for the CPU ports until desired load-balancing is reached.
>
> What do you think?

As you concluded in your followup, not being able to have a fixed MAC
for the CPU seems weird.

Maybe you could do the inverse? Allow userspace to set the masks for an
individual bond/team port in a hash-based LAG, then you can offload that
to DSA. Here there be dragons though, you need to ensure that there is
no intermediate config in which any buckets are enabled on multiple
ports.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-13 14:46                         ` Tobias Waldekranz
@ 2021-04-13 15:14                           ` Marek Behun
  2021-04-13 18:16                             ` Tobias Waldekranz
  0 siblings, 1 reply; 68+ messages in thread
From: Marek Behun @ 2021-04-13 15:14 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, 13 Apr 2021 16:46:32 +0200
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> On Tue, Apr 13, 2021 at 02:27, Marek Behun <marek.behun@nic.cz> wrote:
> > On Tue, 13 Apr 2021 01:54:50 +0200
> > Marek Behun <marek.behun@nic.cz> wrote:
> >  
> >> I will look into this, maybe ask some follow-up questions.  
> >
> > Tobias,
> >
> > it seems that currently the LAGs in mv88e6xxx driver do not use the
> > HashTrunk feature (which can be enabled via bit 11 of the
> > MV88E6XXX_G2_TRUNK_MAPPING register).  
> 
> This should be set at the bottom of mv88e6xxx_lag_sync_masks.
> 
> > If we used this feature and if we knew what hash function it uses, we
> > could write a userspace tool that could recompute new MAC
> > addresses for the CPU ports in order to avoid the problem I explained
> > previously...
> >
> > Or the tool can simply inject frames into the switch and try different
> > MAC addresses for the CPU ports until desired load-balancing is reached.
> >
> > What do you think?  
> 
> As you concluded in your followup, not being able to have a fixed MAC
> for the CPU seems weird.
> 
> Maybe you could do the inverse? Allow userspace to set the masks for an
> individual bond/team port in a hash-based LAG, then you can offload that
> to DSA. 

What masks?

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-13 15:14                           ` Marek Behun
@ 2021-04-13 18:16                             ` Tobias Waldekranz
  2021-04-14 15:14                               ` Marek Behun
  0 siblings, 1 reply; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-13 18:16 UTC (permalink / raw)
  To: Marek Behun
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, Apr 13, 2021 at 17:14, Marek Behun <marek.behun@nic.cz> wrote:
> On Tue, 13 Apr 2021 16:46:32 +0200
> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>
>> On Tue, Apr 13, 2021 at 02:27, Marek Behun <marek.behun@nic.cz> wrote:
>> > On Tue, 13 Apr 2021 01:54:50 +0200
>> > Marek Behun <marek.behun@nic.cz> wrote:
>> >  
>> >> I will look into this, maybe ask some follow-up questions.  
>> >
>> > Tobias,
>> >
>> > it seems that currently the LAGs in mv88e6xxx driver do not use the
>> > HashTrunk feature (which can be enabled via bit 11 of the
>> > MV88E6XXX_G2_TRUNK_MAPPING register).  
>> 
>> This should be set at the bottom of mv88e6xxx_lag_sync_masks.
>> 
>> > If we used this feature and if we knew what hash function it uses, we
>> > could write a userspace tool that could recompute new MAC
>> > addresses for the CPU ports in order to avoid the problem I explained
>> > previously...
>> >
>> > Or the tool can simply inject frames into the switch and try different
>> > MAC addresses for the CPU ports until desired load-balancing is reached.
>> >
>> > What do you think?  
>> 
>> As you concluded in your followup, not being able to have a fixed MAC
>> for the CPU seems weird.
>> 
>> Maybe you could do the inverse? Allow userspace to set the masks for an
>> individual bond/team port in a hash-based LAG, then you can offload that
>> to DSA. 
>
> What masks?

The table defined in Global2/Register7.

When a frame is mapped to a LAG (e.g. by an ATU lookup), all member
ports will added to the frame's destination vector. The mask table is
the block that then filters the vector to only include a single
member.

By modifying that table, you can choose which buckets are assigned to
which member ports. This includes assigning 7 buckets to one member and
1 to the other for example.

At the moment, mv88e6xxx will statically determine this mapping (in
mv88e6xxx_lag_set_port_mask), by trying to spread the buckets as evenly
as possible. It will also rebalance the assignments whenever a link goes
down, or is "detached" in LACP terms.

You could imagine a different mode in which the DSA driver would receive
the bucket allocation from the bond/team driver (which in turn could
come all the way from userspace). Userspace could then implement
whatever strategy it wants to maximize utilization, though still bound
by the limitations of the hardware in terms of fields considered during
hashing of course.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-13 18:16                             ` Tobias Waldekranz
@ 2021-04-14 15:14                               ` Marek Behun
  2021-04-14 18:39                                 ` Tobias Waldekranz
  0 siblings, 1 reply; 68+ messages in thread
From: Marek Behun @ 2021-04-14 15:14 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Tue, 13 Apr 2021 20:16:24 +0200
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> You could imagine a different mode in which the DSA driver would receive
> the bucket allocation from the bond/team driver (which in turn could
> come all the way from userspace). Userspace could then implement
> whatever strategy it wants to maximize utilization, though still bound
> by the limitations of the hardware in terms of fields considered during
> hashing of course.

The problem is that even with the ability to change the bucket
configuration however we want it still can happen with non-trivial
probability that all (src,dst) pairs on the network will hash to one
bucket.

The probability of that happening is 1/(8^(n-1)) for n (src,dst) pairs.

On Turris Omnia the most common configuration is that the switch ports
are bridged.

If the user plugs only two devices into the lan ports, one would expect
that both devices could utilize 1 gbps each. In this case there is
1/8 probability that both devices would hash to the same bucket. It is
quite bad if multi-CPU upload won't work for 12.5% of our customers that
are using our device in this way.

So if there is some reasonable solution how to implement multi-CPU via
the port vlan mask, I will try to pursue this.

Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-14 15:14                               ` Marek Behun
@ 2021-04-14 18:39                                 ` Tobias Waldekranz
  2021-04-14 23:39                                   ` Vladimir Oltean
  0 siblings, 1 reply; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-14 18:39 UTC (permalink / raw)
  To: Marek Behun
  Cc: Vladimir Oltean, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Wed, Apr 14, 2021 at 17:14, Marek Behun <marek.behun@nic.cz> wrote:
> On Tue, 13 Apr 2021 20:16:24 +0200
> Tobias Waldekranz <tobias@waldekranz.com> wrote:
>
>> You could imagine a different mode in which the DSA driver would receive
>> the bucket allocation from the bond/team driver (which in turn could
>> come all the way from userspace). Userspace could then implement
>> whatever strategy it wants to maximize utilization, though still bound
>> by the limitations of the hardware in terms of fields considered during
>> hashing of course.
>
> The problem is that even with the ability to change the bucket
> configuration however we want it still can happen with non-trivial
> probability that all (src,dst) pairs on the network will hash to one
> bucket.
>
> The probability of that happening is 1/(8^(n-1)) for n (src,dst) pairs.

Yes I understand all that, hence "though still bound by the limitations
of the hardware in terms of fields considered during hashing of course."

> On Turris Omnia the most common configuration is that the switch ports
> are bridged.
>
> If the user plugs only two devices into the lan ports, one would expect
> that both devices could utilize 1 gbps each. In this case there is
> 1/8 probability that both devices would hash to the same bucket. It is
> quite bad if multi-CPU upload won't work for 12.5% of our customers that
> are using our device in this way.

Agreed, but it is a category error to talk in terms of expectations and
desires here. I am pretty sure the silicon just does not have the gates
required to do per-port steering in combination with bridging. (Except
by using the TCAM).

> So if there is some reasonable solution how to implement multi-CPU via
> the port vlan mask, I will try to pursue this.

I hope whatever solution you come up with does not depend on the
destination being unknown. If the current patch works for the reason I
suspect, you will effectively limit the downstream bandwidth of all
connected stations to 1G minus the aggregated upstream rate. Example:

     .------.
 A --+ lan0 |
 B --+ lan1 |
 C --+ lan2 |
 D --+ lan3 |
     |      |
     + wan  |
     '------'

If you run with this series applied, in this setup, and have A,B,C each
send a 10 kpps flow to the CPU, what is the observed rate on D?  My
guess would be 30 kpps, as all traffic is being flooded as unknown
unicast. This is true also for net-next at the moment. To solve that you
have to load the CPU address in the ATU, at which point you have to
decide between cpu0 and cpu1.

In order to have two entries for the same destination, they must belong
to different FIDs. But that FID is also used for automatic learning. So
if all ports use their own FID, all the switched traffic will have to be
flooded instead, since any address learned on lan0 will be invisible to
lan1,2,3 and vice versa.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-14 18:39                                 ` Tobias Waldekranz
@ 2021-04-14 23:39                                   ` Vladimir Oltean
  2021-04-15  9:20                                     ` Tobias Waldekranz
  0 siblings, 1 reply; 68+ messages in thread
From: Vladimir Oltean @ 2021-04-14 23:39 UTC (permalink / raw)
  To: Tobias Waldekranz
  Cc: Marek Behun, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Wed, Apr 14, 2021 at 08:39:53PM +0200, Tobias Waldekranz wrote:
> In order to have two entries for the same destination, they must belong
> to different FIDs. But that FID is also used for automatic learning. So
> if all ports use their own FID, all the switched traffic will have to be
> flooded instead, since any address learned on lan0 will be invisible to
> lan1,2,3 and vice versa.

Can you explain a bit more what do you mean when you say that the FID
for the CPU port is also used for automatic learning? Since when does
mv88e6xxx learn frames sent by tag_dsa.c?

The way Ocelot switches work, and this is also the mechanism that I plan
to build on top of, is explained in include/soc/mscc/ocelot.h (copied
here for your convenience):

/* Port Group IDs (PGID) are masks of destination ports.
 *
 * For L2 forwarding, the switch performs 3 lookups in the PGID table for each
 * frame, and forwards the frame to the ports that are present in the logical
 * AND of all 3 PGIDs.
 *
 * These PGID lookups are:
 * - In one of PGID[0-63]: for the destination masks. There are 2 paths by
 *   which the switch selects a destination PGID:
 *     - The {DMAC, VID} is present in the MAC table. In that case, the
 *       destination PGID is given by the DEST_IDX field of the MAC table entry
 *       that matched.
 *     - The {DMAC, VID} is not present in the MAC table (it is unknown). The
 *       frame is disseminated as being either unicast, multicast or broadcast,
 *       and according to that, the destination PGID is chosen as being the
 *       value contained by ANA_FLOODING_FLD_UNICAST,
 *       ANA_FLOODING_FLD_MULTICAST or ANA_FLOODING_FLD_BROADCAST.
 *   The destination PGID can be an unicast set: the first PGIDs, 0 to
 *   ocelot->num_phys_ports - 1, or a multicast set: the PGIDs from
 *   ocelot->num_phys_ports to 63. By convention, a unicast PGID corresponds to
 *   a physical port and has a single bit set in the destination ports mask:
 *   that corresponding to the port number itself. In contrast, a multicast
 *   PGID will have potentially more than one single bit set in the destination
 *   ports mask.
 * - In one of PGID[64-79]: for the aggregation mask. The switch classifier
 *   dissects each frame and generates a 4-bit Link Aggregation Code which is
 *   used for this second PGID table lookup. The goal of link aggregation is to
 *   hash multiple flows within the same LAG on to different destination ports.
 *   The first lookup will result in a PGID with all the LAG members present in
 *   the destination ports mask, and the second lookup, by Link Aggregation
 *   Code, will ensure that each flow gets forwarded only to a single port out
 *   of that mask (there are no duplicates).
 * - In one of PGID[80-90]: for the source mask. The third time, the PGID table
 *   is indexed with the ingress port (plus 80). These PGIDs answer the
 *   question "is port i allowed to forward traffic to port j?" If yes, then
 *   BIT(j) of PGID 80+i will be found set. The third PGID lookup can be used
 *   to enforce the L2 forwarding matrix imposed by e.g. a Linux bridge.
 */

/* Reserve some destination PGIDs at the end of the range:
 * PGID_BLACKHOLE: used for not forwarding the frames
 * PGID_CPU: used for whitelisting certain MAC addresses, such as the addresses
 *           of the switch port net devices, towards the CPU port module.
 * PGID_UC: the flooding destinations for unknown unicast traffic.
 * PGID_MC: the flooding destinations for non-IP multicast traffic.
 * PGID_MCIPV4: the flooding destinations for IPv4 multicast traffic.
 * PGID_MCIPV6: the flooding destinations for IPv6 multicast traffic.
 * PGID_BC: the flooding destinations for broadcast traffic.
 */

Basically the frame is forwarded towards:

PGID_DST[MAC table -> destination] & PGID_AGGR[aggregation code] & PGID_SRC[source port]

This is also how we set up LAGs in ocelot_set_aggr_pgids: as far as
PGID_DST is concerned, all traffic towards a LAG is 'sort of multicast'
(even for unicast traffic), in the sense that the destination port mask
is all ones for the physical ports in that LAG. We then reduce the
destination port mask through PGID_AGGR, in the sense that every
aggregation code (of which there can be 16) has a single bit set,
corresponding to either one of the physical ports in the LAG. So every
packet does indeed see no more than one destination port in the end.

For multiple CPU ports with static assignment to user ports, it would be
sufficient, given the Ocelot architecture, to install a single 'multicast'
entry per address in the MAC table, with a DEST_IDX having two bits set,
one for each CPU port. Then, we would let the third lookup (PGID_SRC,
equivalent to the Marvell's port VLANs, AFAIU) enforce the bounding box
for every packet such that it goes to one CPU port or to another.

This, however, has implications upon the DSA API. In my current attempts
for the 'RX filtering in DSA' series, host addresses are reference-counted
by DSA, and passed on to the driver through .port_fdb_add and .port_mdb_add
calls, where the "port" parameter is the CPU port. Which CPU port? Yes.
It is clear now that DSA should take its hands off of these addresses,
and we should define a new API for .port_host_uc_add and .port_host_mc_add,
which is per user port. If the driver doesn't have anything better to do
or it doesn't support multiple CPU ports for whatever reason, it can go
ahead and implement .port_host_uc_add(ds, port) as
.port_fdb_add(ds, dsa_to_port(ds, port)->cpu_dp->index). But it also has
the option of doing smarter tricks like adding a TCAM trapping entry
(unknown to tc, why would it be exposed to tc?) or doing this 'multicast'
MAC table entry thing. But it must also manage potentially duplicate
entries all by itself. For example, DSA will call .port_host_uc_add at
probe time with the MAC address of every port. Then, the bridge will
also notify of static FDB entries, and at least some of those have the
same MAC address as the port itself. Then, the bridge will be deleted,
and the expectation is that the driver is smart enough to not remove the
entry for the port, because that's still needed for standalone traffic
termination.

It won't be ideal from a driver writer's perspective, but it will be
flexible.

I've started already to send some patches for RX filtering, little by
little. Don't get your hopes up, it's been almost a year since I started
working on them with no end in sight, so one thing that's clear is that
nothing spectacular is going to happen at least until the upcoming merge
window closes. It also has some dependencies it seems, like for example
the fact that br_fdb_replay and br_mdb_replay are still far from perfect,
and the refcounting is still impossible to do without leaks. I have yet
another non-trivial and insufficiently tested patch series for that,
which I've been delaying due to the need to work on some other stuff.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2021-04-14 23:39                                   ` Vladimir Oltean
@ 2021-04-15  9:20                                     ` Tobias Waldekranz
  0 siblings, 0 replies; 68+ messages in thread
From: Tobias Waldekranz @ 2021-04-15  9:20 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Marek Behun, Ansuel Smith, netdev, David S. Miller,
	Jakub Kicinski, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Eric Dumazet, Wei Wang, Cong Wang, Taehee Yoo,
	Björn Töpel, zhang kai, Weilong Chen, Roopa Prabhu,
	Di Zhu, Francis Laniel, linux-kernel

On Thu, Apr 15, 2021 at 02:39, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Wed, Apr 14, 2021 at 08:39:53PM +0200, Tobias Waldekranz wrote:
>> In order to have two entries for the same destination, they must belong
>> to different FIDs. But that FID is also used for automatic learning. So
>> if all ports use their own FID, all the switched traffic will have to be
>> flooded instead, since any address learned on lan0 will be invisible to
>> lan1,2,3 and vice versa.
>
> Can you explain a bit more what do you mean when you say that the FID
> for the CPU port is also used for automatic learning? Since when does
> mv88e6xxx learn frames sent by tag_dsa.c?

I was thinking about the incoming traffic on the LAN ports, not the CPU
port. We are still exclusively sending FROM_CPUs from tag_dsa.c, nothing
has changed there.

> The way Ocelot switches work, and this is also the mechanism that I plan
> to build on top of, is explained in include/soc/mscc/ocelot.h (copied
> here for your convenience):
>
> /* Port Group IDs (PGID) are masks of destination ports.
>  *
>  * For L2 forwarding, the switch performs 3 lookups in the PGID table for each
>  * frame, and forwards the frame to the ports that are present in the logical
>  * AND of all 3 PGIDs.
>  *
>  * These PGID lookups are:
>  * - In one of PGID[0-63]: for the destination masks. There are 2 paths by
>  *   which the switch selects a destination PGID:
>  *     - The {DMAC, VID} is present in the MAC table. In that case, the
>  *       destination PGID is given by the DEST_IDX field of the MAC table entry
>  *       that matched.
>  *     - The {DMAC, VID} is not present in the MAC table (it is unknown). The
>  *       frame is disseminated as being either unicast, multicast or broadcast,
>  *       and according to that, the destination PGID is chosen as being the
>  *       value contained by ANA_FLOODING_FLD_UNICAST,
>  *       ANA_FLOODING_FLD_MULTICAST or ANA_FLOODING_FLD_BROADCAST.
>  *   The destination PGID can be an unicast set: the first PGIDs, 0 to
>  *   ocelot->num_phys_ports - 1, or a multicast set: the PGIDs from
>  *   ocelot->num_phys_ports to 63. By convention, a unicast PGID corresponds to
>  *   a physical port and has a single bit set in the destination ports mask:
>  *   that corresponding to the port number itself. In contrast, a multicast
>  *   PGID will have potentially more than one single bit set in the destination
>  *   ports mask.
>  * - In one of PGID[64-79]: for the aggregation mask. The switch classifier
>  *   dissects each frame and generates a 4-bit Link Aggregation Code which is
>  *   used for this second PGID table lookup. The goal of link aggregation is to
>  *   hash multiple flows within the same LAG on to different destination ports.
>  *   The first lookup will result in a PGID with all the LAG members present in
>  *   the destination ports mask, and the second lookup, by Link Aggregation
>  *   Code, will ensure that each flow gets forwarded only to a single port out
>  *   of that mask (there are no duplicates).
>  * - In one of PGID[80-90]: for the source mask. The third time, the PGID table
>  *   is indexed with the ingress port (plus 80). These PGIDs answer the
>  *   question "is port i allowed to forward traffic to port j?" If yes, then
>  *   BIT(j) of PGID 80+i will be found set. The third PGID lookup can be used
>  *   to enforce the L2 forwarding matrix imposed by e.g. a Linux bridge.
>  */
>
> /* Reserve some destination PGIDs at the end of the range:
>  * PGID_BLACKHOLE: used for not forwarding the frames
>  * PGID_CPU: used for whitelisting certain MAC addresses, such as the addresses
>  *           of the switch port net devices, towards the CPU port module.
>  * PGID_UC: the flooding destinations for unknown unicast traffic.
>  * PGID_MC: the flooding destinations for non-IP multicast traffic.
>  * PGID_MCIPV4: the flooding destinations for IPv4 multicast traffic.
>  * PGID_MCIPV6: the flooding destinations for IPv6 multicast traffic.
>  * PGID_BC: the flooding destinations for broadcast traffic.
>  */
>
> Basically the frame is forwarded towards:
>
> PGID_DST[MAC table -> destination] & PGID_AGGR[aggregation code] & PGID_SRC[source port]
>
> This is also how we set up LAGs in ocelot_set_aggr_pgids: as far as
> PGID_DST is concerned, all traffic towards a LAG is 'sort of multicast'
> (even for unicast traffic), in the sense that the destination port mask
> is all ones for the physical ports in that LAG. We then reduce the
> destination port mask through PGID_AGGR, in the sense that every
> aggregation code (of which there can be 16) has a single bit set,
> corresponding to either one of the physical ports in the LAG. So every
> packet does indeed see no more than one destination port in the end.

This is all very similar to how mv88e6xxx works. The only minor
difference is that the MAC table is wide enough to include the vector
directly. Unicast entries mapped to LAGs are then handled as a special
case for some reason, most likely having to do with supporting
cross-chip configurations correctly I think.

> For multiple CPU ports with static assignment to user ports, it would be
> sufficient, given the Ocelot architecture, to install a single 'multicast'
> entry per address in the MAC table, with a DEST_IDX having two bits set,
> one for each CPU port. Then, we would let the third lookup (PGID_SRC,
> equivalent to the Marvell's port VLANs, AFAIU) enforce the bounding box
> for every packet such that it goes to one CPU port or to another.

Yeah I also considered this approach. Basically you create a broadcast
LAG and then rely on the third component in your expression (or port
based VLANs on mv88e6xxx) above to avoid duplicates?

I dismissed it because I thought that it would break down once you need
to support multiple chips, as the LAGs are managed separately in the
PVT. But I now realize that the PVT is indexed based on the FORWARD tag,
which contains the _source_ information. So that might work for
mv88e6xxx as well!

Marek, what do you think? If this works, it would be great if we could
also allow the hash based approach since that would work better in cases
where you have many flows coming in on a single port that you would like
to spread over multiple cores.

> This, however, has implications upon the DSA API. In my current attempts
> for the 'RX filtering in DSA' series, host addresses are reference-counted
> by DSA, and passed on to the driver through .port_fdb_add and .port_mdb_add
> calls, where the "port" parameter is the CPU port. Which CPU port? Yes.
> It is clear now that DSA should take its hands off of these addresses,
> and we should define a new API for .port_host_uc_add and .port_host_mc_add,
> which is per user port. If the driver doesn't have anything better to do
> or it doesn't support multiple CPU ports for whatever reason, it can go
> ahead and implement .port_host_uc_add(ds, port) as
> .port_fdb_add(ds, dsa_to_port(ds, port)->cpu_dp->index). But it also has
> the option of doing smarter tricks like adding a TCAM trapping entry
> (unknown to tc, why would it be exposed to tc?) or doing this 'multicast'
> MAC table entry thing. But it must also manage potentially duplicate
> entries all by itself. For example, DSA will call .port_host_uc_add at
> probe time with the MAC address of every port. Then, the bridge will
> also notify of static FDB entries, and at least some of those have the
> same MAC address as the port itself. Then, the bridge will be deleted,
> and the expectation is that the driver is smart enough to not remove the
> entry for the port, because that's still needed for standalone traffic
> termination.
>
> It won't be ideal from a driver writer's perspective, but it will be
> flexible.

Yeah I think it is the right approach. We could also supply a default
implementation to handle the default single-CPU-port-case to make it
easier.

> I've started already to send some patches for RX filtering, little by
> little. Don't get your hopes up, it's been almost a year since I started
> working on them with no end in sight, so one thing that's clear is that
> nothing spectacular is going to happen at least until the upcoming merge
> window closes. It also has some dependencies it seems, like for example
> the fact that br_fdb_replay and br_mdb_replay are still far from perfect,
> and the refcounting is still impossible to do without leaks. I have yet
> another non-trivial and insufficiently tested patch series for that,
> which I've been delaying due to the need to work on some other stuff.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-25  7:13   ` Marek Behun
@ 2019-08-25 15:00     ` Florian Fainelli
  0 siblings, 0 replies; 68+ messages in thread
From: Florian Fainelli @ 2019-08-25 15:00 UTC (permalink / raw)
  To: Marek Behun
  Cc: netdev, Andrew Lunn, Vivien Didelot, David Ahern,
	Stephen Hemminger, Chris Healy, Vladimir Oltean



On 8/25/2019 12:13 AM, Marek Behun wrote:
> On Sat, 24 Aug 2019 13:04:04 -0700
> Florian Fainelli <f.fainelli@gmail.com> wrote:
> 
>> Now, the 4.9 kernel behavior actually works just fine because eth1 is
>> not a special interface, so no tagging is expected, and "wifi", although
>> it supports DSA tagging, represents another side of the CPU/host network
>> stack, so you never have to inject frames into the switch, because you
>> can use eth1 to do that and let MAC learning do its job to forward to
>> the correct port of the switch.
> 
> Hi Florian,
> 
> Sorry, I am having trouble understanding what you mean in the
> paragraph I quoted above (and paragraphs afterwards).
> 
> eth0 and eth1 are interfaces created by an ethernet driver.
> wlan0 is an interface created by wireless driver.
> wifi is a slave interface created by DSA for port 5 on the switch.
> eth1 is DSA slave or a DSA master connected to port 5?

"wifi" is a DSA slave network interface, and as you understood, eth1 is
a standard Ethernet driver (which happens to be the same as eth0, since
on that system there are two identical controllers, driver is
bcmsysport.c). The relevant part of the DTS for that system looks like this:

port@5 {
	reg = <5>;
	label = "wifi";
	ethernet = <&enet_1>;
};

port@8 {
	reg = <8>;
	label = "cpu";
	ethernet = <&enet_0>;
};

So as you can see, the devices are all described correctly, simply the
behavior on the Linux change changes whether you have commit
6d4e5c570c2d66c806ecc6bd851fcf881fe8a38e ("net: dsa: get port type at
parse time") included or not. With this commit, the criteria for
determining what is a DSA/CPU port changed: it used to be based on the
label (e.g.: "cpu", "dsa") and then it changed to be based on whether a
phandle property is either "ethernet" (port is CPU) or "link" (port is
DSA), which is the right thing to do, but it no longer allows us to be
specific about which port is going to be elected as the CPU port.
Fortunately the switch driver is coded with the assumption that either
port 5 or 8 could be used.

Note that when we do not have a network device that represents the
switch"side (e.g.: the CPU port or the DSA port), we had to jump through
many hoops in order to make some information visible to the user, or
even overlay specific operations onto the DSA master, that code is under
net/dsa/master.c.

> 
> How does DSA handle two interfaces with same reg property?

This is not happening, see DTS example above.
-- 
Florian

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24 20:04 ` Florian Fainelli
  2019-08-24 21:01   ` Marek Behun
@ 2019-08-25  7:13   ` Marek Behun
  2019-08-25 15:00     ` Florian Fainelli
  1 sibling, 1 reply; 68+ messages in thread
From: Marek Behun @ 2019-08-25  7:13 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, Andrew Lunn, Vivien Didelot, David Ahern,
	Stephen Hemminger, Chris Healy, Vladimir Oltean

On Sat, 24 Aug 2019 13:04:04 -0700
Florian Fainelli <f.fainelli@gmail.com> wrote:

> Now, the 4.9 kernel behavior actually works just fine because eth1 is
> not a special interface, so no tagging is expected, and "wifi", although
> it supports DSA tagging, represents another side of the CPU/host network
> stack, so you never have to inject frames into the switch, because you
> can use eth1 to do that and let MAC learning do its job to forward to
> the correct port of the switch.

Hi Florian,

Sorry, I am having trouble understanding what you mean in the
paragraph I quoted above (and paragraphs afterwards).

eth0 and eth1 are interfaces created by an ethernet driver.
wlan0 is an interface created by wireless driver.
wifi is a slave interface created by DSA for port 5 on the switch.
eth1 is DSA slave or a DSA master connected to port 5?

How does DSA handle two interfaces with same reg property?

Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24 15:24 ` Andrew Lunn
  2019-08-24 17:45   ` Marek Behun
@ 2019-08-25  4:19   ` Marek Behun
  1 sibling, 0 replies; 68+ messages in thread
From: Marek Behun @ 2019-08-25  4:19 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, Vivien Didelot, Florian Fainelli, David Ahern, Stephen Hemminger

On Sat, 24 Aug 2019 17:24:07 +0200
Andrew Lunn <andrew@lunn.ch> wrote:

> That is a new idea. Interesting.
> 
> I would like to look around and see what else uses this "lan1@eth0"
> concept. We need to ensure it is not counter intuitive in general,
> when you consider all possible users.

There are not many users of ndo_get_iflink besides DSA slave:
  ip6_gre, ip6_vti, sit, ip6_tunnel
  ip_gte, ip_vti, ipmr, ipip,
  macsec, macvlan, veth
  ipvlan
  ipoib
and a few other. What these have in common is that all these interfaces
are linked somehow to another interfacem, ie. a macvlan interface eth0.1
is linked to it's master interface eth0. All of these are virtual
interfaces.

Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24 21:01   ` Marek Behun
@ 2019-08-25  4:08     ` Marek Behun
  0 siblings, 0 replies; 68+ messages in thread
From: Marek Behun @ 2019-08-25  4:08 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, Andrew Lunn, Vivien Didelot, David Ahern,
	Stephen Hemminger, Chris Healy, Vladimir Oltean

On Sat, 24 Aug 2019 23:01:21 +0200
Marek Behun <marek.behun@nic.cz> wrote:

> the documentation would became weird to users.
... would become weird ...
> 
> We are *already* using the iflink property to report which CPU device
> is used as CPU destination port for a given switch slave interface. So
> why to use that for changing this, also?
... why NOT to use that for chaning this also?
> 
> If you think that iflink should not be used for this, and other agree,
... and others agree with you,

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24 20:04 ` Florian Fainelli
@ 2019-08-24 21:01   ` Marek Behun
  2019-08-25  4:08     ` Marek Behun
  2019-08-25  7:13   ` Marek Behun
  1 sibling, 1 reply; 68+ messages in thread
From: Marek Behun @ 2019-08-24 21:01 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, Andrew Lunn, Vivien Didelot, David Ahern,
	Stephen Hemminger, Chris Healy, Vladimir Oltean

On Sat, 24 Aug 2019 13:04:04 -0700
Florian Fainelli <f.fainelli@gmail.com> wrote:

> 1) Your approach is kind of interesting here, not sure if it is the best
> but it is not outright wrong. In the past, we had been talking about
> different approaches, some of which seemed too simplistic or too narrow
> on the use case, and some of which that are up in the air and were not
> worked on.
> 
> - John Crispin submitted a patch series for the MTK switch driver a
> while back that was picked up by Frank Wunderlich more recently. This
> approach uses a Device Tree based configuration in order to statically
> assign ports, or groups of ports to a specific DSA master device. This
> is IMHO wrong because a) DT is not to impose a policy but strictly
> describe HW, and b) there was no way to change that static assignment at
> runtime.
> 
> - Based on that patch series, Andrew, Vivien, Frank and myself discussed
> two possible options:
> 	- allowing the enslaving of DSA master devices in the bridge, so as to
> provide a hint that specific DSA slave network devices should be
> "bound"/"linked" to a specific DSA master device. This requires
> modifications in the bridge layer to avoid undoing what commit
> 8db0a2ee2c6302a1dcbcdb93cb731dfc6c0cdb5e ("net: bridge: reject
> DSA-enabled master netdevices as bridge members"). This would also
> require a bridge to be set-up
> 
> 	- enhancing the iproute2 command and backing kernel code in order to
> allow defining that a DSA slave device may be enslaved into a specific
> DSA master, similarly to how you currently enslave a device into a
> bridge, you could "enslave" a DSA slave to a DSA master with something
> that could look like this:
> 
> 	ip link set dev sw0p0 master eth0	# Associate port 0 with eth0
> 	ip link set dev sw0p1 master eth1	# Associate port 1 with eth1
> 
> To date, this may be the best way to do what we want here, rather than
> use the iflink which is a bit re-purposing something that is not exactly
> meant for that.

We cannot use "set master" to set CPU port, since that is used for
enslaving interfaces to bridges. There are usecases where these would
conflict with each other. The semantics would become complicated and
the documentation would became weird to users.

We are *already* using the iflink property to report which CPU device
is used as CPU destination port for a given switch slave interface. So
why to use that for changing this, also?

If you think that iflink should not be used for this, and other agree,
then we should create a new property, something like dsa-upstream, (eg.
ip link set dev sw0p0 dsa-upstream eth0). Using the "master" property
is not right, IMO.

Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24  2:42 Marek Behún
  2019-08-24 15:24 ` Andrew Lunn
  2019-08-24 15:40 ` Vladimir Oltean
@ 2019-08-24 20:04 ` Florian Fainelli
  2019-08-24 21:01   ` Marek Behun
  2019-08-25  7:13   ` Marek Behun
  2 siblings, 2 replies; 68+ messages in thread
From: Florian Fainelli @ 2019-08-24 20:04 UTC (permalink / raw)
  To: Marek Behún, netdev
  Cc: Andrew Lunn, Vivien Didelot, David Ahern, Stephen Hemminger,
	Chris Healy, Vladimir Oltean



On 8/23/2019 7:42 PM, Marek Behún wrote:
> Hi,
> this is my attempt to solve the multi-CPU port issue for DSA.
> 
> Patch 1 adds code for handling multiple CPU ports in a DSA switch tree.
> If more than one CPU port is found in a tree, the code assigns CPU ports
> to user/DSA ports in a round robin way. So for the simplest case where
> we have one switch with N ports, 2 of them of type CPU connected to eth0
> and eth1, and the other ports labels being lan1, lan2, ..., the code
> assigns them to CPU ports this way:
>   lan1 <-> eth0
>   lan2 <-> eth1
>   lan3 <-> eth0
>   lan4 <-> eth1
>   lan5 <-> eth0
>   ...
> 
> Patch 2 adds a new operation to the net device operations structure.
> Currently we use the iflink property of a net device to report to which
> CPU port a given switch port si connected to. The ip link utility from
> iproute2 reports this as "lan1@eth0". We add a new net device operation,
> ndo_set_iflink, which can be used to set this property. We call this
> function from the netlink handlers.
> 
> Patch 3 implements this new ndo_set_iflink operation for DSA slave
> device. Thus the userspace can request a change of CPU port of a given
> port.
> 
> I am also sending patch for iproute2-next, to add support for setting
> this iflink value.

This is going to be a long email, that is broken into several parts,
feel free to skip/reply on the parts you would like.

- review/comments on your approach here
- history of multiple CPU ports within Broadcom switches
- specific use case for a device that uses upstream drivers and a
Broadcom switch (BCM7278)

1) Your approach is kind of interesting here, not sure if it is the best
but it is not outright wrong. In the past, we had been talking about
different approaches, some of which seemed too simplistic or too narrow
on the use case, and some of which that are up in the air and were not
worked on.

- John Crispin submitted a patch series for the MTK switch driver a
while back that was picked up by Frank Wunderlich more recently. This
approach uses a Device Tree based configuration in order to statically
assign ports, or groups of ports to a specific DSA master device. This
is IMHO wrong because a) DT is not to impose a policy but strictly
describe HW, and b) there was no way to change that static assignment at
runtime.

- Based on that patch series, Andrew, Vivien, Frank and myself discussed
two possible options:
	- allowing the enslaving of DSA master devices in the bridge, so as to
provide a hint that specific DSA slave network devices should be
"bound"/"linked" to a specific DSA master device. This requires
modifications in the bridge layer to avoid undoing what commit
8db0a2ee2c6302a1dcbcdb93cb731dfc6c0cdb5e ("net: bridge: reject
DSA-enabled master netdevices as bridge members"). This would also
require a bridge to be set-up

	- enhancing the iproute2 command and backing kernel code in order to
allow defining that a DSA slave device may be enslaved into a specific
DSA master, similarly to how you currently enslave a device into a
bridge, you could "enslave" a DSA slave to a DSA master with something
that could look like this:

	ip link set dev sw0p0 master eth0	# Associate port 0 with eth0
	ip link set dev sw0p1 master eth1	# Associate port 1 with eth1

To date, this may be the best way to do what we want here, rather than
use the iflink which is a bit re-purposing something that is not exactly
meant for that.

2) With Broadcom Ethernet switches there has been historically few
designs that allowed the following:

- 6 port(s) switches: port 5 was the CPU port, always and supports
tagging (EthType + 4bytes Broadcom tag). This is the 5325, 5365 class of
switches, they are nearly 20 years old now.

- 9 port(s) switches: both port 5 and port 8 could be defined as In-band
Managemement Port(s) (IMP) and port 5 is IMP1 and port 8 is IMP0 by
default, preference is to use IMP0. Tagging is only supported on those
two ports. These are the 5395, 53125 and similar switches. Port 5 is
typically meant to be connected to a WAN interface where it might be
necessary to run some specific management protocol like 802.1x and such
that would require a managed switch to assist with those tasks. Port 5
can do most of what port 8 does, except when it comes to classification
and specific remapping rules, that limitation is carried forward with
all switches described below.

- 9 port(s) switches: port 5, 7 or 8 support tagging, with port 5 and 8
being possible management ports. Tagging is permitted on port 7 to
provide in-band information about packets (e.g.: classification ID, QoS,
etc.) to an "accelerator" (whatever it is), or a MoCA interface behind
port 7. This is the BCM5301X class, the NorthStar Plus (58xxx), and the
BCM7445 switches. Tagging can be enabled on RX, or TX, or both.

- 9 port(s) switches: all ports support tagging, with RX only, TX only,
or both directions supporting tagging. Again, port 8 remains the default
and preferred management port. This is the BCM7278 class of switches.

What needs to be considered here is that while multiple CPU ports may be
defined, if say, both port 8 and port 5 are defined, it is preferable to
continue using port 8 as the default management port because it is the
most capable, therefore having a switch driver callback that allows us
to elect the most suitable CPU/management port might be a good idea. If
other switches treat all CPU ports equal, no need to provide that callback.

3) On the BCM7278 system we have 3 external ports wired, 1 internal port
to an audio/video streaming accelerator and 2 ports wired to Ethernet
MACs, on port 8 and port 5, an ascii diagram looks like this:

-------------------	-------------------
|SYSTEMPORT Lite 0|	|SYSTEMPORT Lite 1|
-------------------	-------------------
	|			|
	|			|
-------------------------------------------------
| 	Port 8			Port 5		|
|						|     ----------------
|					Port 7	|-----| A/V streaming|
|						|     ----------------
| Port 0   Port 1   Port 2			|
------------------------------------------------|
   GPHY    RGMII_1  RGMII_2

GPHY is an integrated Gigabit PHY, RGMII_1 and 2 connect to external
MII/RevMII/GMII/RGMII external PHYs.

The Device Tree for that system declares both CPU ports by providing a
phandle to the respective Ethernet MAC controllers. Now, depending on
the kernel version though, you may have different behaviors:

4.9 is our downstream production kernel ATM and on such a system you have:

DSA master:
eth0 (port 8)

DSA slaves:
gphy (port 0)
rgmii_1 (port 1)
rgmii_2 (port 2)
asp (port 7)
wifi (port 5)

Standard interface:
eth1 (port 5)

And here you should be like: hold on Florian, you have two interfaces
that each represent one side of the pipe (wifi, eth1), is not that
counter to the very DSA principles?

On an upstream kernel though, eth0 is still present, but eth1, because
it is connected to port 5 and thus has a lower number, gets chosen as
the DSA master, and then the "wifi" interface is not created at all. all
of that is expeccted.

Now, the 4.9 kernel behavior actually works just fine because eth1 is
not a special interface, so no tagging is expected, and "wifi", although
it supports DSA tagging, represents another side of the CPU/host network
stack, so you never have to inject frames into the switch, because you
can use eth1 to do that and let MAC learning do its job to forward to
the correct port of the switch.

Likewise, for a frame ingressing port 0 for a MAC address that is behind
port 5 that works too. The "wifi" interface here acts as a control
interface that allows us to have a configuration end-point for port
number 5.

The typical set-up we have involves two bridge devices:

br-lan spans port 0, port 1, port 2, port 7 and port 5 and takes care of
putting all of these ports in the same broadcast domain

br-wifi spans eth1 and another device, e.g: wlan0 and takes care of
bridging LAN to WLAN

Now, let's say your use case involves doubling the bandwidth for
routing/NAT and you have two CPU ports for that purpose. You could use
exactly that same set-up as described, and create a LAN bridge that does
not span the switch port to which your second Ethernet MAC is connected
to. Leave that WAN port as a standalone DSA device. Although you do need
a way to indicate somehow that port X connects to Ethernet MAC Y, which
Device Tree already provides.

Giving configuration control such that you can arbitrarily assign DSA
slaves to a given DSA master is fine, but what problems does it
potentially creates?
-- 
Florian

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24 15:56   ` Andrew Lunn
@ 2019-08-24 17:58     ` Marek Behun
  0 siblings, 0 replies; 68+ messages in thread
From: Marek Behun @ 2019-08-24 17:58 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Vladimir Oltean, netdev, Vivien Didelot, Florian Fainelli,
	David Ahern, Stephen Hemminger

On Sat, 24 Aug 2019 17:56:36 +0200
Andrew Lunn <andrew@lunn.ch> wrote:

> I expect bad things will happen if frames are flooded to multiple CPU
> ports. For this to work, the whole switch design needs to support
> multiple CPU ports. I doubt this will work on any old switch.
> 
> Having a host interface connected to a user port of the switch is a
> completely different uses case, and not what this patchset is about.

In the next proposal I shall also add a guard to all DSA drivers, that
if more than one CPU port is set, the driver will not probe.

After that the next patch will try to add multi-CPU support to
mv88e6xxx (while removeing the guard for that driver).

qca8k should also be possible to do, since we used it in such a way in
openwrt. I shall look into that afterwards.

Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24 15:44   ` Vladimir Oltean
@ 2019-08-24 17:55     ` Marek Behun
  0 siblings, 0 replies; 68+ messages in thread
From: Marek Behun @ 2019-08-24 17:55 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: netdev, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	David Ahern, Stephen Hemminger

On Sat, 24 Aug 2019 18:44:44 +0300
Vladimir Oltean <olteanv@gmail.com> wrote:

> Just to be clear. You can argue that such switches are weird, and
> that's ok. Just want to understand the general type of hardware for
> which such a patch is intended.

Vladimir,

the general part should solve for devices like Turris 1.x (qca8k) and
Turris Omnia (mv88e6xxx). In these devices the switch is connected to
CPU via 2 ports, and 5 ports are connected to RJ-45s.

I answered Andrew's question about the receive path in previous mail.
To your other question I still would have to think about, but the
general idea is that for other types of frames the switch driver
should only use one CPU port, so that no frame would reach CPU 2 times.

I shall send proposed implementation for mv88e6xxx in next version,
perhaps this night.

Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24 17:45   ` Marek Behun
@ 2019-08-24 17:54     ` Andrew Lunn
  0 siblings, 0 replies; 68+ messages in thread
From: Andrew Lunn @ 2019-08-24 17:54 UTC (permalink / raw)
  To: Marek Behun
  Cc: netdev, Vivien Didelot, Florian Fainelli, David Ahern, Stephen Hemminger

On Sat, Aug 24, 2019 at 07:45:46PM +0200, Marek Behun wrote:
> On Sat, 24 Aug 2019 17:24:07 +0200
> Andrew Lunn <andrew@lunn.ch> wrote:
> 
> > So this is all about transmit from the host out the switch. What about
> > receive? How do you tell the switch which CPU interface it should use
> > for a port?
> 
> Andrew, we use the same. The DSA slave implementation of ndo_set_iflink
> will also tell the switch driver to change the CPU port for that port.
> Patch 3 also adds operation port_change_cpu_port to the DSA switch
> operations. This is called from dsa_slave_set_iflink (at least in this
> first proposal).

Yes, i noticed this later. The cover letter did not include a change
to a driver, so it was not clear you had considered receive, which is
very much in the hard of the switch driver, not the DSA core.

     Andrew

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24 15:24 ` Andrew Lunn
@ 2019-08-24 17:45   ` Marek Behun
  2019-08-24 17:54     ` Andrew Lunn
  2019-08-25  4:19   ` Marek Behun
  1 sibling, 1 reply; 68+ messages in thread
From: Marek Behun @ 2019-08-24 17:45 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, Vivien Didelot, Florian Fainelli, David Ahern, Stephen Hemminger

On Sat, 24 Aug 2019 17:24:07 +0200
Andrew Lunn <andrew@lunn.ch> wrote:

> So this is all about transmit from the host out the switch. What about
> receive? How do you tell the switch which CPU interface it should use
> for a port?

Andrew, we use the same. The DSA slave implementation of ndo_set_iflink
will also tell the switch driver to change the CPU port for that port.
Patch 3 also adds operation port_change_cpu_port to the DSA switch
operations. This is called from dsa_slave_set_iflink (at least in this
first proposal).

Marek

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24 15:40 ` Vladimir Oltean
  2019-08-24 15:44   ` Vladimir Oltean
@ 2019-08-24 15:56   ` Andrew Lunn
  2019-08-24 17:58     ` Marek Behun
  1 sibling, 1 reply; 68+ messages in thread
From: Andrew Lunn @ 2019-08-24 15:56 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Marek Behún, netdev, Vivien Didelot, Florian Fainelli,
	David Ahern, Stephen Hemminger

> Will DSA assume that all CPU ports are equal in terms of tagging
> protocol abilities? There are switches where one of the CPU ports can
> do tagging and the other can't.

Hi Vladimir

Given the current definition of what a CPU port is, we have to assume
the port is using tags. Frames have to be directed out a specific
egress port, otherwise things like BPDU, PTP will break. You cannot
rely on MAC address learning.

> Is the static assignment between slave and CPU ports going to be the
> only use case? What about link aggregation? Flow steering perhaps?
> And like Andrew pointed out, how do you handle the receive case? What
> happens to flooded frames, will the switch send them to both CPU
> interfaces, and get received twice in Linux? How do you prevent that?

I expect bad things will happen if frames are flooded to multiple CPU
ports. For this to work, the whole switch design needs to support
multiple CPU ports. I doubt this will work on any old switch.

Having a host interface connected to a user port of the switch is a
completely different uses case, and not what this patchset is about.

	   Andrew

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24 15:40 ` Vladimir Oltean
@ 2019-08-24 15:44   ` Vladimir Oltean
  2019-08-24 17:55     ` Marek Behun
  2019-08-24 15:56   ` Andrew Lunn
  1 sibling, 1 reply; 68+ messages in thread
From: Vladimir Oltean @ 2019-08-24 15:44 UTC (permalink / raw)
  To: Marek Behún
  Cc: netdev, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	David Ahern, Stephen Hemminger

On Sat, 24 Aug 2019 at 18:40, Vladimir Oltean <olteanv@gmail.com> wrote:
>
> Hi Marek,
>
> On Sat, 24 Aug 2019 at 05:43, Marek Behún <marek.behun@nic.cz> wrote:
> >
> > Hi,
> > this is my attempt to solve the multi-CPU port issue for DSA.
> >
> > Patch 1 adds code for handling multiple CPU ports in a DSA switch tree.
> > If more than one CPU port is found in a tree, the code assigns CPU ports
> > to user/DSA ports in a round robin way. So for the simplest case where
> > we have one switch with N ports, 2 of them of type CPU connected to eth0
> > and eth1, and the other ports labels being lan1, lan2, ..., the code
> > assigns them to CPU ports this way:
> >   lan1 <-> eth0
> >   lan2 <-> eth1
> >   lan3 <-> eth0
> >   lan4 <-> eth1
> >   lan5 <-> eth0
> >   ...
> >
> > Patch 2 adds a new operation to the net device operations structure.
> > Currently we use the iflink property of a net device to report to which
> > CPU port a given switch port si connected to. The ip link utility from
> > iproute2 reports this as "lan1@eth0". We add a new net device operation,
> > ndo_set_iflink, which can be used to set this property. We call this
> > function from the netlink handlers.
> >
> > Patch 3 implements this new ndo_set_iflink operation for DSA slave
> > device. Thus the userspace can request a change of CPU port of a given
> > port.
> >
> > I am also sending patch for iproute2-next, to add support for setting
> > this iflink value.
> >
> > Marek
> >
>
> The topic is interesting.
> This changeset leaves the reader wanting to see a driver
> implementation of .port_change_cpu_port. (mostly to understand what
> your hardware is capable of)
> Will DSA assume that all CPU ports are equal in terms of tagging
> protocol abilities? There are switches where one of the CPU ports can
> do tagging and the other can't.

Just to be clear. You can argue that such switches are weird, and
that's ok. Just want to understand the general type of hardware for
which such a patch is intended.

> Is the static assignment between slave and CPU ports going to be the
> only use case? What about link aggregation? Flow steering perhaps?
> And like Andrew pointed out, how do you handle the receive case? What
> happens to flooded frames, will the switch send them to both CPU
> interfaces, and get received twice in Linux? How do you prevent that?
>
> > Marek Behún (3):
> >   net: dsa: allow for multiple CPU ports
> >   net: add ndo for setting the iflink property
> >   net: dsa: implement ndo_set_netlink for chaning port's CPU port
> >
> >  include/linux/netdevice.h |  5 +++
> >  include/net/dsa.h         | 11 ++++-
> >  net/core/dev.c            | 15 +++++++
> >  net/core/rtnetlink.c      |  7 ++++
> >  net/dsa/dsa2.c            | 84 +++++++++++++++++++++++++--------------
> >  net/dsa/slave.c           | 35 ++++++++++++++++
> >  6 files changed, 126 insertions(+), 31 deletions(-)
> >
> > --
> > 2.21.0
> >
>
> Regards,
> -Vladimir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24  2:42 Marek Behún
  2019-08-24 15:24 ` Andrew Lunn
@ 2019-08-24 15:40 ` Vladimir Oltean
  2019-08-24 15:44   ` Vladimir Oltean
  2019-08-24 15:56   ` Andrew Lunn
  2019-08-24 20:04 ` Florian Fainelli
  2 siblings, 2 replies; 68+ messages in thread
From: Vladimir Oltean @ 2019-08-24 15:40 UTC (permalink / raw)
  To: Marek Behún
  Cc: netdev, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	David Ahern, Stephen Hemminger

Hi Marek,

On Sat, 24 Aug 2019 at 05:43, Marek Behún <marek.behun@nic.cz> wrote:
>
> Hi,
> this is my attempt to solve the multi-CPU port issue for DSA.
>
> Patch 1 adds code for handling multiple CPU ports in a DSA switch tree.
> If more than one CPU port is found in a tree, the code assigns CPU ports
> to user/DSA ports in a round robin way. So for the simplest case where
> we have one switch with N ports, 2 of them of type CPU connected to eth0
> and eth1, and the other ports labels being lan1, lan2, ..., the code
> assigns them to CPU ports this way:
>   lan1 <-> eth0
>   lan2 <-> eth1
>   lan3 <-> eth0
>   lan4 <-> eth1
>   lan5 <-> eth0
>   ...
>
> Patch 2 adds a new operation to the net device operations structure.
> Currently we use the iflink property of a net device to report to which
> CPU port a given switch port si connected to. The ip link utility from
> iproute2 reports this as "lan1@eth0". We add a new net device operation,
> ndo_set_iflink, which can be used to set this property. We call this
> function from the netlink handlers.
>
> Patch 3 implements this new ndo_set_iflink operation for DSA slave
> device. Thus the userspace can request a change of CPU port of a given
> port.
>
> I am also sending patch for iproute2-next, to add support for setting
> this iflink value.
>
> Marek
>

The topic is interesting.
This changeset leaves the reader wanting to see a driver
implementation of .port_change_cpu_port. (mostly to understand what
your hardware is capable of)
Will DSA assume that all CPU ports are equal in terms of tagging
protocol abilities? There are switches where one of the CPU ports can
do tagging and the other can't.
Is the static assignment between slave and CPU ports going to be the
only use case? What about link aggregation? Flow steering perhaps?
And like Andrew pointed out, how do you handle the receive case? What
happens to flooded frames, will the switch send them to both CPU
interfaces, and get received twice in Linux? How do you prevent that?

> Marek Behún (3):
>   net: dsa: allow for multiple CPU ports
>   net: add ndo for setting the iflink property
>   net: dsa: implement ndo_set_netlink for chaning port's CPU port
>
>  include/linux/netdevice.h |  5 +++
>  include/net/dsa.h         | 11 ++++-
>  net/core/dev.c            | 15 +++++++
>  net/core/rtnetlink.c      |  7 ++++
>  net/dsa/dsa2.c            | 84 +++++++++++++++++++++++++--------------
>  net/dsa/slave.c           | 35 ++++++++++++++++
>  6 files changed, 126 insertions(+), 31 deletions(-)
>
> --
> 2.21.0
>

Regards,
-Vladimir

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH RFC net-next 0/3] Multi-CPU DSA support
  2019-08-24  2:42 Marek Behún
@ 2019-08-24 15:24 ` Andrew Lunn
  2019-08-24 17:45   ` Marek Behun
  2019-08-25  4:19   ` Marek Behun
  2019-08-24 15:40 ` Vladimir Oltean
  2019-08-24 20:04 ` Florian Fainelli
  2 siblings, 2 replies; 68+ messages in thread
From: Andrew Lunn @ 2019-08-24 15:24 UTC (permalink / raw)
  To: Marek Behún
  Cc: netdev, Vivien Didelot, Florian Fainelli, David Ahern, Stephen Hemminger

On Sat, Aug 24, 2019 at 04:42:47AM +0200, Marek Behún wrote:
> Hi,
> this is my attempt to solve the multi-CPU port issue for DSA.
> 
> Patch 1 adds code for handling multiple CPU ports in a DSA switch tree.
> If more than one CPU port is found in a tree, the code assigns CPU ports
> to user/DSA ports in a round robin way. So for the simplest case where
> we have one switch with N ports, 2 of them of type CPU connected to eth0
> and eth1, and the other ports labels being lan1, lan2, ..., the code
> assigns them to CPU ports this way:
>   lan1 <-> eth0
>   lan2 <-> eth1
>   lan3 <-> eth0
>   lan4 <-> eth1
>   lan5 <-> eth0

Hi Marek

That is what i've always argued is a good default. So i'm happy with
this.

> Patch 2 adds a new operation to the net device operations structure.
> Currently we use the iflink property of a net device to report to which
> CPU port a given switch port si connected to. The ip link utility from
> iproute2 reports this as "lan1@eth0". We add a new net device operation,
> ndo_set_iflink, which can be used to set this property. We call this
> function from the netlink handlers.

That is a new idea. Interesting.

I would like to look around and see what else uses this "lan1@eth0"
concept. We need to ensure it is not counter intuitive in general,
when you consider all possible users.

> Patch 3 implements this new ndo_set_iflink operation for DSA slave
> device. Thus the userspace can request a change of CPU port of a given
> port.

So this is all about transmit from the host out the switch. What about
receive? How do you tell the switch which CPU interface it should use
for a port?

    Thanks
	Andrew

^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH RFC net-next 0/3] Multi-CPU DSA support
@ 2019-08-24  2:42 Marek Behún
  2019-08-24 15:24 ` Andrew Lunn
                   ` (2 more replies)
  0 siblings, 3 replies; 68+ messages in thread
From: Marek Behún @ 2019-08-24  2:42 UTC (permalink / raw)
  To: netdev
  Cc: Andrew Lunn, Vivien Didelot, Florian Fainelli, David Ahern,
	Stephen Hemminger, Marek Behún

Hi,
this is my attempt to solve the multi-CPU port issue for DSA.

Patch 1 adds code for handling multiple CPU ports in a DSA switch tree.
If more than one CPU port is found in a tree, the code assigns CPU ports
to user/DSA ports in a round robin way. So for the simplest case where
we have one switch with N ports, 2 of them of type CPU connected to eth0
and eth1, and the other ports labels being lan1, lan2, ..., the code
assigns them to CPU ports this way:
  lan1 <-> eth0
  lan2 <-> eth1
  lan3 <-> eth0
  lan4 <-> eth1
  lan5 <-> eth0
  ...

Patch 2 adds a new operation to the net device operations structure.
Currently we use the iflink property of a net device to report to which
CPU port a given switch port si connected to. The ip link utility from
iproute2 reports this as "lan1@eth0". We add a new net device operation,
ndo_set_iflink, which can be used to set this property. We call this
function from the netlink handlers.

Patch 3 implements this new ndo_set_iflink operation for DSA slave
device. Thus the userspace can request a change of CPU port of a given
port.

I am also sending patch for iproute2-next, to add support for setting
this iflink value.

Marek

Marek Behún (3):
  net: dsa: allow for multiple CPU ports
  net: add ndo for setting the iflink property
  net: dsa: implement ndo_set_netlink for chaning port's CPU port

 include/linux/netdevice.h |  5 +++
 include/net/dsa.h         | 11 ++++-
 net/core/dev.c            | 15 +++++++
 net/core/rtnetlink.c      |  7 ++++
 net/dsa/dsa2.c            | 84 +++++++++++++++++++++++++--------------
 net/dsa/slave.c           | 35 ++++++++++++++++
 6 files changed, 126 insertions(+), 31 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2021-04-15  9:21 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-10 13:34 [PATCH RFC net-next 0/3] Multi-CPU DSA support Ansuel Smith
2021-04-10 13:34 ` [PATCH RFC net-next 1/3] net: dsa: allow for multiple CPU ports Ansuel Smith
2021-04-12  3:35   ` DENG Qingfang
2021-04-12  4:41     ` Ansuel Smith
2021-04-12 15:30       ` DENG Qingfang
2021-04-12 16:17         ` Frank Wunderlich
2021-04-10 13:34 ` [PATCH RFC net-next 2/3] net: add ndo for setting the iflink property Ansuel Smith
2021-04-10 13:34 ` [PATCH RFC net-next 3/3] net: dsa: implement ndo_set_netlink for chaning port's CPU port Ansuel Smith
2021-04-10 13:34 ` [PATCH RFC iproute2-next] iplink: allow to change iplink value Ansuel Smith
2021-04-11 17:04   ` Stephen Hemminger
2021-04-11 17:09     ` Vladimir Oltean
2021-04-11 18:01 ` [PATCH RFC net-next 0/3] Multi-CPU DSA support Marek Behun
2021-04-11 18:08   ` Ansuel Smith
2021-04-11 18:39   ` Andrew Lunn
2021-04-12  2:07     ` Florian Fainelli
2021-04-12  4:53     ` Ansuel Smith
2021-04-11 18:50   ` Vladimir Oltean
2021-04-11 23:53     ` Vladimir Oltean
2021-04-12  2:10       ` Florian Fainelli
2021-04-12  5:04     ` Ansuel Smith
2021-04-12 12:46     ` Tobias Waldekranz
2021-04-12 14:35       ` Vladimir Oltean
2021-04-12 21:06         ` Tobias Waldekranz
2021-04-12 19:30       ` Marek Behun
2021-04-12 21:22         ` Tobias Waldekranz
2021-04-12 21:34           ` Vladimir Oltean
2021-04-12 21:49             ` Tobias Waldekranz
2021-04-12 21:56               ` Marek Behun
2021-04-12 22:06               ` Vladimir Oltean
2021-04-12 22:26                 ` Tobias Waldekranz
2021-04-12 22:48                   ` Vladimir Oltean
2021-04-12 23:04                     ` Marek Behun
2021-04-12 21:50           ` Marek Behun
2021-04-12 22:05             ` Tobias Waldekranz
2021-04-12 22:55               ` Marek Behun
2021-04-12 23:09                 ` Tobias Waldekranz
2021-04-12 23:13                   ` Tobias Waldekranz
2021-04-12 23:54                     ` Marek Behun
2021-04-13  0:27                       ` Marek Behun
2021-04-13  0:31                         ` Marek Behun
2021-04-13 14:46                         ` Tobias Waldekranz
2021-04-13 15:14                           ` Marek Behun
2021-04-13 18:16                             ` Tobias Waldekranz
2021-04-14 15:14                               ` Marek Behun
2021-04-14 18:39                                 ` Tobias Waldekranz
2021-04-14 23:39                                   ` Vladimir Oltean
2021-04-15  9:20                                     ` Tobias Waldekranz
2021-04-13 14:40                       ` Tobias Waldekranz
2021-04-12 15:00     ` DENG Qingfang
2021-04-12 16:32       ` Vladimir Oltean
2021-04-12 22:04         ` Marek Behun
2021-04-12 22:17           ` Vladimir Oltean
2021-04-12 22:47             ` Marek Behun
  -- strict thread matches above, loose matches on Subject: below --
2019-08-24  2:42 Marek Behún
2019-08-24 15:24 ` Andrew Lunn
2019-08-24 17:45   ` Marek Behun
2019-08-24 17:54     ` Andrew Lunn
2019-08-25  4:19   ` Marek Behun
2019-08-24 15:40 ` Vladimir Oltean
2019-08-24 15:44   ` Vladimir Oltean
2019-08-24 17:55     ` Marek Behun
2019-08-24 15:56   ` Andrew Lunn
2019-08-24 17:58     ` Marek Behun
2019-08-24 20:04 ` Florian Fainelli
2019-08-24 21:01   ` Marek Behun
2019-08-25  4:08     ` Marek Behun
2019-08-25  7:13   ` Marek Behun
2019-08-25 15:00     ` Florian Fainelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).