All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB
@ 2021-08-21 21:00 Vladimir Oltean
  2021-08-21 21:00 ` [RFC PATCH 1/4] net: rtnetlink: create a netlink cb context struct for fdb dump Vladimir Oltean
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Vladimir Oltean @ 2021-08-21 21:00 UTC (permalink / raw)
  To: netdev

I have a board where it is painfully slow to run "bridge fdb". It has 16
switch ports which are accessed over an I2C controller -> I2C mux 1 ->
I2C mux 2 -> I2C-to-SPI bridge.

It doesn't really help either that we traverse the hardware FDB of each
switch for every netdev, even though we already know all there is to
know the first time we traversed it. In fact, I hacked up some rtnetlink
and DSA changes, and with those, the time to run 'bridge fdb' on this
board decreases from 207 seconds to 26 seconds (2 FDB traversals instead
of 16), turning something intolerable into 'tolerable'.

I don't know how much we care about .ndo_fdb_dump implemented directly
by drivers (and that's where I expect this to be most useful), because
of SWITCHDEV_FDB_ADD_TO_BRIDGE and all that. So this is RFC in case it
is helpful for somebody, at least during debugging.

Vladimir Oltean (4):
  net: rtnetlink: create a netlink cb context struct for fdb dump
  net: rtnetlink: add a minimal state machine for dumping shared FDBs
  net: dsa: implement a shared FDB dump procedure
  net: dsa: sja1105: implement shared FDB dump

 drivers/net/dsa/sja1105/sja1105_main.c        |  50 +++--
 .../ethernet/freescale/dpaa2/dpaa2-switch.c   |   9 +-
 drivers/net/ethernet/mscc/ocelot.c            |   5 +-
 drivers/net/ethernet/mscc/ocelot_net.c        |   4 +
 drivers/net/vxlan.c                           |   8 +-
 include/linux/rtnetlink.h                     |  25 +++
 include/net/dsa.h                             |  17 ++
 net/bridge/br_fdb.c                           |   6 +-
 net/core/rtnetlink.c                          | 105 +++++++---
 net/dsa/dsa2.c                                |   2 +
 net/dsa/dsa_priv.h                            |   1 +
 net/dsa/slave.c                               | 189 ++++++++++++++++--
 net/dsa/switch.c                              |   8 +
 13 files changed, 368 insertions(+), 61 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC PATCH 1/4] net: rtnetlink: create a netlink cb context struct for fdb dump
  2021-08-21 21:00 [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB Vladimir Oltean
@ 2021-08-21 21:00 ` Vladimir Oltean
  2021-08-21 21:00 ` [RFC PATCH 2/4] net: rtnetlink: add a minimal state machine for dumping shared FDBs Vladimir Oltean
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Vladimir Oltean @ 2021-08-21 21:00 UTC (permalink / raw)
  To: netdev

For the ability to grep for proper structure/variable names, if for
nothing else, use the more modern struct netlink_callback::ctx as
opposed to args to hold the stateful data over the course of an FDB dump
operation.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 .../ethernet/freescale/dpaa2/dpaa2-switch.c    |  5 ++++-
 drivers/net/ethernet/mscc/ocelot.c             |  5 ++++-
 drivers/net/vxlan.c                            |  5 +++--
 include/linux/rtnetlink.h                      | 18 ++++++++++++++++++
 net/bridge/br_fdb.c                            |  3 ++-
 net/core/rtnetlink.c                           | 16 +++++++++-------
 net/dsa/slave.c                                |  5 ++++-
 7 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
index d260993ab2dc..dd018dfb25ee 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
@@ -771,10 +771,13 @@ static int dpaa2_switch_fdb_dump_nl(struct fdb_dump_entry *entry,
 	int is_dynamic = entry->type & DPSW_FDB_ENTRY_DINAMIC;
 	u32 portid = NETLINK_CB(dump->cb->skb).portid;
 	u32 seq = dump->cb->nlh->nlmsg_seq;
+	struct rtnl_fdb_dump_ctx *ctx;
 	struct nlmsghdr *nlh;
 	struct ndmsg *ndm;
 
-	if (dump->idx < dump->cb->args[2])
+	ctx = (struct rtnl_fdb_dump_ctx *)dump->cb->ctx;
+
+	if (dump->idx < ctx->fidx)
 		goto skip;
 
 	nlh = nlmsg_put(dump->skb, portid, seq, RTM_NEWNEIGH,
diff --git a/drivers/net/ethernet/mscc/ocelot.c b/drivers/net/ethernet/mscc/ocelot.c
index 5209650fd25f..44a56f9cda07 100644
--- a/drivers/net/ethernet/mscc/ocelot.c
+++ b/drivers/net/ethernet/mscc/ocelot.c
@@ -971,10 +971,13 @@ int ocelot_port_fdb_do_dump(const unsigned char *addr, u16 vid,
 	struct ocelot_dump_ctx *dump = data;
 	u32 portid = NETLINK_CB(dump->cb->skb).portid;
 	u32 seq = dump->cb->nlh->nlmsg_seq;
+	struct rtnl_fdb_dump_ctx *ctx;
 	struct nlmsghdr *nlh;
 	struct ndmsg *ndm;
 
-	if (dump->idx < dump->cb->args[2])
+	ctx = (struct rtnl_fdb_dump_ctx *)dump->cb->ctx;
+
+	if (dump->idx < ctx->fidx)
 		goto skip;
 
 	nlh = nlmsg_put(dump->skb, portid, seq, RTM_NEWNEIGH,
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 5a8df5a195cb..8c9371bf8195 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1371,6 +1371,7 @@ static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
 			  struct net_device *dev,
 			  struct net_device *filter_dev, int *idx)
 {
+	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
 	struct vxlan_dev *vxlan = netdev_priv(dev);
 	unsigned int h;
 	int err = 0;
@@ -1383,7 +1384,7 @@ static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
 			struct vxlan_rdst *rd;
 
 			if (rcu_access_pointer(f->nh)) {
-				if (*idx < cb->args[2])
+				if (*idx < ctx->fidx)
 					goto skip_nh;
 				err = vxlan_fdb_info(skb, vxlan, f,
 						     NETLINK_CB(cb->skb).portid,
@@ -1400,7 +1401,7 @@ static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
 			}
 
 			list_for_each_entry_rcu(rd, &f->remotes, list) {
-				if (*idx < cb->args[2])
+				if (*idx < ctx->fidx)
 					goto skip;
 
 				err = vxlan_fdb_info(skb, vxlan, f,
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index bb9cb84114c1..f14cda6939c6 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -110,6 +110,24 @@ void rtnl_kfree_skbs(struct sk_buff *head, struct sk_buff *tail);
 	WARN_ONCE(!rtnl_is_locked(), \
 		  "RTNL: assertion failed at %s (%d)\n", __FILE__,  __LINE__)
 
+struct rtnl_fdb_dump_ctx {
+	/* Last bucket in the dev_index_head hash list that was checked.
+	 * Used by rtnl_fdb_dump to resume in case the procedure is
+	 * interrupted.
+	 */
+	int pos_hash;
+	/* Last interface within bucket @pos_hash that was checked.
+	 * Used by rtnl_fdb_dump to resume in case the procedure is
+	 * interrupted.
+	 */
+	int pos_idx;
+	/* Last FDB entry number that was dumped for the current interface.
+	 * Updated by implementers of .ndo_fdb_dump and used to resume in case
+	 * the dump procedure is interrupted.
+	 */
+	int fidx;
+};
+
 extern int ndo_dflt_fdb_dump(struct sk_buff *skb,
 			     struct netlink_callback *cb,
 			     struct net_device *dev,
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 46812b659710..2f6527d1df27 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -821,6 +821,7 @@ int br_fdb_dump(struct sk_buff *skb,
 		struct net_device *filter_dev,
 		int *idx)
 {
+	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
 	struct net_bridge *br = netdev_priv(dev);
 	struct net_bridge_fdb_entry *f;
 	int err = 0;
@@ -836,7 +837,7 @@ int br_fdb_dump(struct sk_buff *skb,
 
 	rcu_read_lock();
 	hlist_for_each_entry_rcu(f, &br->fdb_list, fdb_node) {
-		if (*idx < cb->args[2])
+		if (*idx < ctx->fidx)
 			goto skip;
 		if (filter_dev && (!f->dst || f->dst->dev != filter_dev)) {
 			if (filter_dev != dev)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 2dcf1c084b20..06cd59b6260a 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -4184,6 +4184,7 @@ static int nlmsg_populate_fdb(struct sk_buff *skb,
 			      int *idx,
 			      struct netdev_hw_addr_list *list)
 {
+	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
 	struct netdev_hw_addr *ha;
 	int err;
 	u32 portid, seq;
@@ -4192,7 +4193,7 @@ static int nlmsg_populate_fdb(struct sk_buff *skb,
 	seq = cb->nlh->nlmsg_seq;
 
 	list_for_each_entry(ha, &list->list, list) {
-		if (*idx < cb->args[2])
+		if (*idx < ctx->fidx)
 			goto skip;
 
 		err = nlmsg_populate_fdb_fill(skb, dev, ha->addr, 0,
@@ -4331,6 +4332,7 @@ static int valid_fdb_dump_legacy(const struct nlmsghdr *nlh,
 
 static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
 {
+	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
 	struct net_device *dev;
 	struct net_device *br_dev = NULL;
 	const struct net_device_ops *ops = NULL;
@@ -4361,8 +4363,8 @@ static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
 		ops = br_dev->netdev_ops;
 	}
 
-	s_h = cb->args[0];
-	s_idx = cb->args[1];
+	s_h = ctx->pos_hash;
+	s_idx = ctx->pos_idx;
 
 	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
 		idx = 0;
@@ -4414,7 +4416,7 @@ static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
 			cops = NULL;
 
 			/* reset fdb offset to 0 for rest of the interfaces */
-			cb->args[2] = 0;
+			ctx->fidx = 0;
 			fidx = 0;
 cont:
 			idx++;
@@ -4422,9 +4424,9 @@ static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 
 out:
-	cb->args[0] = h;
-	cb->args[1] = idx;
-	cb->args[2] = fidx;
+	ctx->pos_hash = h;
+	ctx->pos_idx = idx;
+	ctx->fidx = fidx;
 
 	return skb->len;
 }
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index eb9d9e53c536..f25cd48a75ee 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -193,10 +193,13 @@ dsa_slave_port_fdb_do_dump(const unsigned char *addr, u16 vid,
 	struct dsa_slave_dump_ctx *dump = data;
 	u32 portid = NETLINK_CB(dump->cb->skb).portid;
 	u32 seq = dump->cb->nlh->nlmsg_seq;
+	struct rtnl_fdb_dump_ctx *ctx;
 	struct nlmsghdr *nlh;
 	struct ndmsg *ndm;
 
-	if (dump->idx < dump->cb->args[2])
+	ctx = (struct rtnl_fdb_dump_ctx *)dump->cb->ctx;
+
+	if (dump->idx < ctx->fidx)
 		goto skip;
 
 	nlh = nlmsg_put(dump->skb, portid, seq, RTM_NEWNEIGH,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 2/4] net: rtnetlink: add a minimal state machine for dumping shared FDBs
  2021-08-21 21:00 [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB Vladimir Oltean
  2021-08-21 21:00 ` [RFC PATCH 1/4] net: rtnetlink: create a netlink cb context struct for fdb dump Vladimir Oltean
@ 2021-08-21 21:00 ` Vladimir Oltean
  2021-08-21 21:00 ` [RFC PATCH 3/4] net: dsa: implement a shared FDB dump procedure Vladimir Oltean
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Vladimir Oltean @ 2021-08-21 21:00 UTC (permalink / raw)
  To: netdev

Some drivers which offload the bridge do not have the ability to
synchronize their hardware FDB with the bridge's software FDB, but they
perform autonomous address learning nonetheless. These drivers implement
.ndo_fdb_dump and report their hardware learned entries as 'self' to
netlink.

The FDB dump procedure for these drivers is wasteful, since many times,
the FDB is shared across all ports of a bridge (or even globally shared
across all ports of a switch in some cases). So since an FDB dump means
to walk the entire FDB, and .ndo_fdb_dump is per netdev, we end up
walking an FDB shared by 10 netdevices 10 times, once for each port.

If on top of that, the access to the FDB is also slow (which is actually
not all that uncommon, since this is one of the reasons these drivers do
not bother to synchronize their hardware FDB in the first place), it
means that an FDB dump is a very inefficient and slow operation - it may
take minutes or more.

This change keeps the .ndo_fdb_dump function prototype as is, but:

- introduces a "prepare" and a "finish" phase. The phase that exists in
  the code base right now is retroactively named "commit" phase.

- if the rtnl_fdb_dump request is specific to a single port, nothing
  changes. We jump straight to the commit phase of that specific port.

- if the rtnl_fdb_dump request is imprecise (no brport_idx or br_idx
  specified), that is when there is an opportunity for improvement.
  rtnl_fdb_dump first enters the "prepare" phase, where it notifies
  _all_ netdev drivers that have the .ndo_fdb_dump method implemented.
  It only enters the "commit" phase once all netdevs were prepared.
  The "commit" phase may be interrupted by lack of space in the netlink
  skb. No problem, when user space comes back with a new buffer we
  return to the commit phase, just like in the code that exists now.
  After the commit phase ends for all netdevs, rtnl_fdb_dump proceeds to
  call the "finish" phase for all drivers.

In the envisioned use case, a multi-port [ switch ] driver will dump its
shared FDB in the "prepare" phase: for .ndo_fdb_dump(dev), it checks
what is the FDB corresponding to "dev", and if the FDB has been already
dumped, do nothing, otherwise dump it and just save the FDB entries
collected (in a list, array, whatever), no matter which port they
correspond to.

Then, in the "commit" phase, the FDB entries collected above are
filtered by the "dev" in .ndo_fdb_dump(dev). Only those are reported
inside the netlink skb.

Then, in the "finish" phase, any allocated memory can be freed.

All drivers are modified to ignore any other phase except the "commit"
phase, to preserve existing functionality.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 .../ethernet/freescale/dpaa2/dpaa2-switch.c   |  4 +
 drivers/net/ethernet/mscc/ocelot_net.c        |  4 +
 drivers/net/vxlan.c                           |  3 +
 include/linux/rtnetlink.h                     |  7 ++
 net/bridge/br_fdb.c                           |  3 +
 net/core/rtnetlink.c                          | 91 +++++++++++++++----
 net/dsa/slave.c                               |  4 +
 7 files changed, 99 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
index dd018dfb25ee..bca3a9c05b18 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-switch.c
@@ -895,6 +895,7 @@ static int dpaa2_switch_port_fdb_dump(struct sk_buff *skb, struct netlink_callba
 				      struct net_device *net_dev,
 				      struct net_device *filter_dev, int *idx)
 {
+	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
 	struct ethsw_port_priv *port_priv = netdev_priv(net_dev);
 	struct ethsw_dump_ctx dump = {
 		.dev = net_dev,
@@ -904,6 +905,9 @@ static int dpaa2_switch_port_fdb_dump(struct sk_buff *skb, struct netlink_callba
 	};
 	int err;
 
+	if (ctx->state != RTNL_FDB_DUMP_COMMIT)
+		return 0;
+
 	err = dpaa2_switch_fdb_iterate(port_priv, dpaa2_switch_fdb_entry_dump, &dump);
 	*idx = dump.idx;
 
diff --git a/drivers/net/ethernet/mscc/ocelot_net.c b/drivers/net/ethernet/mscc/ocelot_net.c
index 5e8965be968a..02efe452106f 100644
--- a/drivers/net/ethernet/mscc/ocelot_net.c
+++ b/drivers/net/ethernet/mscc/ocelot_net.c
@@ -687,6 +687,7 @@ static int ocelot_port_fdb_dump(struct sk_buff *skb,
 				struct net_device *dev,
 				struct net_device *filter_dev, int *idx)
 {
+	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
 	struct ocelot_port_private *priv = netdev_priv(dev);
 	struct ocelot *ocelot = priv->port.ocelot;
 	struct ocelot_dump_ctx dump = {
@@ -698,6 +699,9 @@ static int ocelot_port_fdb_dump(struct sk_buff *skb,
 	int port = priv->chip_port;
 	int ret;
 
+	if (ctx->state != RTNL_FDB_DUMP_COMMIT)
+		return 0;
+
 	ret = ocelot_fdb_dump(ocelot, port, ocelot_port_fdb_do_dump, &dump);
 
 	*idx = dump.idx;
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 8c9371bf8195..09f5d796c26b 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -1376,6 +1376,9 @@ static int vxlan_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
 	unsigned int h;
 	int err = 0;
 
+	if (ctx->state != RTNL_FDB_DUMP_COMMIT)
+		return 0;
+
 	for (h = 0; h < FDB_HASH_SIZE; ++h) {
 		struct vxlan_fdb *f;
 
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index f14cda6939c6..e4773ebde8fc 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -110,6 +110,12 @@ void rtnl_kfree_skbs(struct sk_buff *head, struct sk_buff *tail);
 	WARN_ONCE(!rtnl_is_locked(), \
 		  "RTNL: assertion failed at %s (%d)\n", __FILE__,  __LINE__)
 
+enum rtnl_fdb_dump_state {
+	RTNL_FDB_DUMP_PREPARE,
+	RTNL_FDB_DUMP_COMMIT,
+	RTNL_FDB_DUMP_FINISH,
+};
+
 struct rtnl_fdb_dump_ctx {
 	/* Last bucket in the dev_index_head hash list that was checked.
 	 * Used by rtnl_fdb_dump to resume in case the procedure is
@@ -126,6 +132,7 @@ struct rtnl_fdb_dump_ctx {
 	 * the dump procedure is interrupted.
 	 */
 	int fidx;
+	enum rtnl_fdb_dump_state state;
 };
 
 extern int ndo_dflt_fdb_dump(struct sk_buff *skb,
diff --git a/net/bridge/br_fdb.c b/net/bridge/br_fdb.c
index 2f6527d1df27..cbbd291edb66 100644
--- a/net/bridge/br_fdb.c
+++ b/net/bridge/br_fdb.c
@@ -826,6 +826,9 @@ int br_fdb_dump(struct sk_buff *skb,
 	struct net_bridge_fdb_entry *f;
 	int err = 0;
 
+	if (ctx->state != RTNL_FDB_DUMP_COMMIT)
+		return 0;
+
 	if (!(dev->priv_flags & IFF_EBRIDGE))
 		return err;
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 06cd59b6260a..57d58f3824b0 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -4225,8 +4225,12 @@ int ndo_dflt_fdb_dump(struct sk_buff *skb,
 		      struct net_device *filter_dev,
 		      int *idx)
 {
+	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
 	int err;
 
+	if (ctx->state != RTNL_FDB_DUMP_COMMIT)
+		return 0;
+
 	if (dev->type != ARPHRD_ETHER)
 		return -EINVAL;
 
@@ -4330,30 +4334,40 @@ static int valid_fdb_dump_legacy(const struct nlmsghdr *nlh,
 	return 0;
 }
 
-static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
+static void rtnl_fdb_dump_prepare_finish(struct sk_buff *skb,
+					 struct netlink_callback *cb)
 {
-	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
+	struct net *net = sock_net(skb->sk);
+	struct hlist_head *head;
 	struct net_device *dev;
-	struct net_device *br_dev = NULL;
-	const struct net_device_ops *ops = NULL;
+	int h, fidx = 0;
+
+	for (h = 0; h < NETDEV_HASHENTRIES; h++) {
+		head = &net->dev_index_head[h];
+		hlist_for_each_entry(dev, head, index_hlist) {
+			if (!dev->netdev_ops->ndo_fdb_dump)
+				continue;
+
+			dev->netdev_ops->ndo_fdb_dump(skb, cb, dev,
+						      NULL, &fidx);
+		}
+	}
+}
+
+static int rtnl_fdb_dump_commit(struct sk_buff *skb, struct netlink_callback *cb,
+				int br_idx, int brport_idx)
+{
+	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
 	const struct net_device_ops *cops = NULL;
+	const struct net_device_ops *ops = NULL;
 	struct net *net = sock_net(skb->sk);
+	struct net_device *br_dev = NULL;
 	struct hlist_head *head;
-	int brport_idx = 0;
-	int br_idx = 0;
-	int h, s_h;
+	struct net_device *dev;
 	int idx = 0, s_idx;
-	int err = 0;
 	int fidx = 0;
-
-	if (cb->strict_check)
-		err = valid_fdb_dump_strict(cb->nlh, &br_idx, &brport_idx,
-					    cb->extack);
-	else
-		err = valid_fdb_dump_legacy(cb->nlh, &br_idx, &brport_idx,
-					    cb->extack);
-	if (err < 0)
-		return err;
+	int err = 0;
+	int h, s_h;
 
 	if (br_idx) {
 		br_dev = __dev_get_by_index(net, br_idx);
@@ -4431,6 +4445,49 @@ static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
 	return skb->len;
 }
 
+static int rtnl_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
+	int brport_idx = 0;
+	int br_idx = 0;
+	int err;
+
+	if (cb->strict_check)
+		err = valid_fdb_dump_strict(cb->nlh, &br_idx, &brport_idx,
+					    cb->extack);
+	else
+		err = valid_fdb_dump_legacy(cb->nlh, &br_idx, &brport_idx,
+					    cb->extack);
+	if (err < 0)
+		return err;
+
+	/* user did not specify a bridge or a bridge port */
+	if (!brport_idx && !br_idx) {
+		switch (ctx->state) {
+		case RTNL_FDB_DUMP_PREPARE:
+			rtnl_fdb_dump_prepare_finish(skb, cb);
+			ctx->state = RTNL_FDB_DUMP_COMMIT;
+			fallthrough;
+		case RTNL_FDB_DUMP_COMMIT:
+			err = rtnl_fdb_dump_commit(skb, cb, br_idx, brport_idx);
+			if (err)
+				return err;
+			ctx->state = RTNL_FDB_DUMP_FINISH;
+			fallthrough;
+		case RTNL_FDB_DUMP_FINISH:
+			rtnl_fdb_dump_prepare_finish(skb, cb);
+			break;
+		}
+	} else {
+		ctx->state = RTNL_FDB_DUMP_COMMIT;
+		err = rtnl_fdb_dump_commit(skb, cb, br_idx, brport_idx);
+		if (err)
+			return err;
+	}
+
+	return err;
+}
+
 static int valid_fdb_get_strict(const struct nlmsghdr *nlh,
 				struct nlattr **tb, u8 *ndm_flags,
 				int *br_idx, int *brport_idx, u8 **addr,
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index f25cd48a75ee..9331093a84dd 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -238,6 +238,7 @@ dsa_slave_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
 		   struct net_device *dev, struct net_device *filter_dev,
 		   int *idx)
 {
+	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
 	struct dsa_port *dp = dsa_slave_to_port(dev);
 	struct dsa_slave_dump_ctx dump = {
 		.dev = dev,
@@ -247,6 +248,9 @@ dsa_slave_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
 	};
 	int err;
 
+	if (ctx->state != RTNL_FDB_DUMP_COMMIT)
+		return 0;
+
 	err = dsa_port_fdb_dump(dp, dsa_slave_port_fdb_do_dump, &dump);
 	*idx = dump.idx;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 3/4] net: dsa: implement a shared FDB dump procedure
  2021-08-21 21:00 [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB Vladimir Oltean
  2021-08-21 21:00 ` [RFC PATCH 1/4] net: rtnetlink: create a netlink cb context struct for fdb dump Vladimir Oltean
  2021-08-21 21:00 ` [RFC PATCH 2/4] net: rtnetlink: add a minimal state machine for dumping shared FDBs Vladimir Oltean
@ 2021-08-21 21:00 ` Vladimir Oltean
  2021-08-21 21:00 ` [RFC PATCH 4/4] net: dsa: sja1105: implement shared FDB dump Vladimir Oltean
  2021-09-23 15:25 ` [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB Vladimir Oltean
  4 siblings, 0 replies; 10+ messages in thread
From: Vladimir Oltean @ 2021-08-21 21:00 UTC (permalink / raw)
  To: netdev

Create a list of FDB entries per switch that will be:

- populated during the ndo_fdb_dump prepare phase
- looked up during the ndo_fdb_dump commit phase
- freed during the ndo_fdb_dump finish phase

Also a bool ds->shared_fdb_dump_in_progress to denote whether we should
perform the shared FDB dump or the normal FDB dump procedure (since the
shared FDB dump needs more memory, we prefer the per-port procedure for
dumps that target a single port).

Introduce a new dsa_switch_ops method for the shared FDB dump. This is
"switch_fdb_dump" and lacks a "port" argument - instead, the switch is
supposed to provide the port for each FDB entry it finds.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 include/net/dsa.h  |  17 ++++
 net/dsa/dsa2.c     |   2 +
 net/dsa/dsa_priv.h |   1 +
 net/dsa/slave.c    | 194 ++++++++++++++++++++++++++++++++++++++-------
 net/dsa/switch.c   |   8 ++
 5 files changed, 195 insertions(+), 27 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 0c2cba45fa79..23b675f843f4 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -312,8 +312,17 @@ struct dsa_mac_addr {
 	struct list_head list;
 };
 
+struct dsa_fdb_entry {
+	unsigned char addr[ETH_ALEN];
+	u16 vid;
+	bool is_static;
+	struct net_device *dev;
+	struct list_head list;
+};
+
 struct dsa_switch {
 	bool setup;
+	bool shared_fdb_dump_in_progress;
 
 	struct device *dev;
 
@@ -355,6 +364,9 @@ struct dsa_switch {
 	/* Storage for drivers using tag_8021q */
 	struct dsa_8021q_context *tag_8021q_ctx;
 
+	/* Storage for shared FDB dumps */
+	struct list_head	fdb_list;
+
 	/* devlink used to represent this switch device */
 	struct devlink		*devlink;
 
@@ -565,6 +577,9 @@ struct net_device *dsa_port_to_bridge_port(const struct dsa_port *dp)
 
 typedef int dsa_fdb_dump_cb_t(const unsigned char *addr, u16 vid,
 			      bool is_static, void *data);
+typedef int dsa_switch_fdb_dump_cb_t(struct dsa_switch *ds, int port,
+				     const unsigned char *addr, u16 vid,
+				     bool is_static);
 struct dsa_switch_ops {
 	/*
 	 * Tagging protocol helpers called for the CPU ports and DSA links.
@@ -737,6 +752,8 @@ struct dsa_switch_ops {
 				const unsigned char *addr, u16 vid);
 	int	(*port_fdb_dump)(struct dsa_switch *ds, int port,
 				 dsa_fdb_dump_cb_t *cb, void *data);
+	int	(*switch_fdb_dump)(struct dsa_switch *ds,
+				   dsa_switch_fdb_dump_cb_t *cb);
 
 	/*
 	 * Multicast database
diff --git a/net/dsa/dsa2.c b/net/dsa/dsa2.c
index dcd67801eca4..99b5aad46b02 100644
--- a/net/dsa/dsa2.c
+++ b/net/dsa/dsa2.c
@@ -801,6 +801,8 @@ static int dsa_switch_setup(struct dsa_switch *ds)
 			goto teardown;
 	}
 
+	INIT_LIST_HEAD(&ds->fdb_list);
+
 	ds->setup = true;
 
 	return 0;
diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h
index b7a269e0513f..c8306b1f1c11 100644
--- a/net/dsa/dsa_priv.h
+++ b/net/dsa/dsa_priv.h
@@ -533,6 +533,7 @@ static inline void *dsa_etype_header_pos_tx(struct sk_buff *skb)
 /* switch.c */
 int dsa_switch_register_notifier(struct dsa_switch *ds);
 void dsa_switch_unregister_notifier(struct dsa_switch *ds);
+int dsa_switch_fdb_dump(struct dsa_switch *ds, dsa_switch_fdb_dump_cb_t *cb);
 
 /* dsa2.c */
 void dsa_lag_map(struct dsa_switch_tree *dst, struct net_device *lag);
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 9331093a84dd..ba864c5d1350 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -186,23 +186,18 @@ struct dsa_slave_dump_ctx {
 	int idx;
 };
 
-static int
-dsa_slave_port_fdb_do_dump(const unsigned char *addr, u16 vid,
-			   bool is_static, void *data)
-{
-	struct dsa_slave_dump_ctx *dump = data;
-	u32 portid = NETLINK_CB(dump->cb->skb).portid;
-	u32 seq = dump->cb->nlh->nlmsg_seq;
-	struct rtnl_fdb_dump_ctx *ctx;
+static int dsa_nlmsg_populate_fdb(struct sk_buff *skb,
+				  struct netlink_callback *cb,
+				  struct net_device *dev,
+				  const unsigned char *addr, u16 vid,
+				  bool is_static)
+{
+	u32 portid = NETLINK_CB(cb->skb).portid;
+	u32 seq = cb->nlh->nlmsg_seq;
 	struct nlmsghdr *nlh;
 	struct ndmsg *ndm;
 
-	ctx = (struct rtnl_fdb_dump_ctx *)dump->cb->ctx;
-
-	if (dump->idx < ctx->fidx)
-		goto skip;
-
-	nlh = nlmsg_put(dump->skb, portid, seq, RTM_NEWNEIGH,
+	nlh = nlmsg_put(skb, portid, seq, RTM_NEWNEIGH,
 			sizeof(*ndm), NLM_F_MULTI);
 	if (!nlh)
 		return -EMSGSIZE;
@@ -213,32 +208,152 @@ dsa_slave_port_fdb_do_dump(const unsigned char *addr, u16 vid,
 	ndm->ndm_pad2    = 0;
 	ndm->ndm_flags   = NTF_SELF;
 	ndm->ndm_type    = 0;
-	ndm->ndm_ifindex = dump->dev->ifindex;
+	ndm->ndm_ifindex = dev->ifindex;
 	ndm->ndm_state   = is_static ? NUD_NOARP : NUD_REACHABLE;
 
-	if (nla_put(dump->skb, NDA_LLADDR, ETH_ALEN, addr))
+	if (nla_put(skb, NDA_LLADDR, ETH_ALEN, addr))
 		goto nla_put_failure;
 
-	if (vid && nla_put_u16(dump->skb, NDA_VLAN, vid))
+	if (vid && nla_put_u16(skb, NDA_VLAN, vid))
 		goto nla_put_failure;
 
-	nlmsg_end(dump->skb, nlh);
+	nlmsg_end(skb, nlh);
 
-skip:
-	dump->idx++;
 	return 0;
 
 nla_put_failure:
-	nlmsg_cancel(dump->skb, nlh);
+	nlmsg_cancel(skb, nlh);
 	return -EMSGSIZE;
 }
 
+static int dsa_switch_shared_fdb_save_one(struct dsa_switch *ds, int port,
+					  const unsigned char *addr, u16 vid,
+					  bool is_static)
+{
+	struct dsa_port *dp = dsa_to_port(ds, port);
+	struct dsa_fdb_entry *fdb;
+
+	if (!dsa_port_is_user(dp))
+		return 0;
+
+	/* Will be freed during the finish phase */
+	fdb = kzalloc(sizeof(*fdb), GFP_KERNEL);
+	if (!fdb)
+		return -ENOMEM;
+
+	ether_addr_copy(fdb->addr, addr);
+	fdb->vid = vid;
+	fdb->is_static = is_static;
+	fdb->dev = dp->slave;
+	list_add_tail(&fdb->list, &ds->fdb_list);
+
+	return 0;
+}
+
+/* If the switch does not support shared FDB dump, do nothing and do the work
+ * in the commit phase.
+ */
+static int dsa_shared_fdb_dump_prepare(struct net_device *dev)
+{
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_switch *ds = dp->ds;
+	int err;
+
+	if (!ds->ops->switch_fdb_dump)
+		return 0;
+
+	if (ds->shared_fdb_dump_in_progress)
+		return 0;
+
+	/* If this switch's FDB has not been dumped before during this
+	 * prepare/commit/finish cycle, dump it now and save the results.
+	 */
+	err = dsa_switch_fdb_dump(ds, dsa_switch_shared_fdb_save_one);
+	if (err)
+		return err;
+
+	ds->shared_fdb_dump_in_progress = true;
+
+	return 0;
+}
+
 static int
-dsa_slave_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
-		   struct net_device *dev, struct net_device *filter_dev,
-		   int *idx)
+dsa_shared_fdb_dump_commit(struct sk_buff *skb, struct netlink_callback *cb,
+			   struct net_device *dev, int *idx)
 {
 	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_switch *ds = dp->ds;
+	struct dsa_fdb_entry *fdb;
+	int err;
+
+	/* Dump the FDB entries corresponding to the requested port from the
+	 * saved results.
+	 */
+	list_for_each_entry(fdb, &ds->fdb_list, list) {
+		if (fdb->dev != dev)
+			continue;
+
+		if (*idx < ctx->fidx)
+			goto skip;
+
+		err = dsa_nlmsg_populate_fdb(skb, cb, dev, fdb->addr, fdb->vid,
+					     fdb->is_static);
+		if (err)
+			return err;
+
+skip:
+		*idx += 1;
+	}
+
+	return 0;
+}
+
+/* Tear down the context stored during the shared FDB dump */
+static void dsa_shared_fdb_dump_finish(struct net_device *dev)
+{
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_fdb_entry *fdb, *tmp;
+	struct dsa_switch *ds = dp->ds;
+
+	if (!ds->shared_fdb_dump_in_progress)
+		return;
+
+	list_for_each_entry_safe(fdb, tmp, &ds->fdb_list, list) {
+		list_del(&fdb->list);
+		kfree(fdb);
+	}
+
+	ds->shared_fdb_dump_in_progress = false;
+}
+
+static int
+dsa_slave_port_fdb_do_dump(const unsigned char *addr, u16 vid,
+			   bool is_static, void *data)
+{
+	struct dsa_slave_dump_ctx *dump = data;
+	struct rtnl_fdb_dump_ctx *ctx;
+	int err;
+
+	ctx = (struct rtnl_fdb_dump_ctx *)dump->cb->ctx;
+
+	if (dump->idx < ctx->fidx)
+		goto skip;
+
+	err = dsa_nlmsg_populate_fdb(dump->skb, dump->cb, dump->dev, addr, vid,
+				     is_static);
+	if (err)
+		return err;
+
+skip:
+	dump->idx++;
+	return 0;
+}
+
+static int
+dsa_slave_fdb_dump_single(struct sk_buff *skb, struct netlink_callback *cb,
+			  struct net_device *dev, int *idx)
+{
 	struct dsa_port *dp = dsa_slave_to_port(dev);
 	struct dsa_slave_dump_ctx dump = {
 		.dev = dev,
@@ -248,15 +363,40 @@ dsa_slave_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
 	};
 	int err;
 
-	if (ctx->state != RTNL_FDB_DUMP_COMMIT)
-		return 0;
-
 	err = dsa_port_fdb_dump(dp, dsa_slave_port_fdb_do_dump, &dump);
 	*idx = dump.idx;
 
 	return err;
 }
 
+static int
+dsa_slave_fdb_dump(struct sk_buff *skb, struct netlink_callback *cb,
+		   struct net_device *dev, struct net_device *filter_dev,
+		   int *idx)
+{
+	struct rtnl_fdb_dump_ctx *ctx = (struct rtnl_fdb_dump_ctx *)cb->ctx;
+	struct dsa_port *dp = dsa_slave_to_port(dev);
+	struct dsa_switch *ds = dp->ds;
+	int err = 0;
+
+	switch (ctx->state) {
+	case RTNL_FDB_DUMP_PREPARE:
+		err = dsa_shared_fdb_dump_prepare(dev);
+		break;
+	case RTNL_FDB_DUMP_COMMIT:
+		if (ds->shared_fdb_dump_in_progress)
+			err = dsa_shared_fdb_dump_commit(skb, cb, dev, idx);
+		else
+			err = dsa_slave_fdb_dump_single(skb, cb, dev, idx);
+		break;
+	case RTNL_FDB_DUMP_FINISH:
+		dsa_shared_fdb_dump_finish(dev);
+		break;
+	}
+
+	return err;
+}
+
 static int dsa_slave_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
 {
 	struct dsa_slave_priv *p = netdev_priv(dev);
diff --git a/net/dsa/switch.c b/net/dsa/switch.c
index fd1a1c6bf9cf..a64613e1f99e 100644
--- a/net/dsa/switch.c
+++ b/net/dsa/switch.c
@@ -655,6 +655,14 @@ dsa_switch_mrp_del_ring_role(struct dsa_switch *ds,
 	return 0;
 }
 
+int dsa_switch_fdb_dump(struct dsa_switch *ds, dsa_switch_fdb_dump_cb_t *cb)
+{
+	if (!ds->ops->switch_fdb_dump)
+		return -EOPNOTSUPP;
+
+	return ds->ops->switch_fdb_dump(ds, cb);
+}
+
 static int dsa_switch_event(struct notifier_block *nb,
 			    unsigned long event, void *info)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC PATCH 4/4] net: dsa: sja1105: implement shared FDB dump
  2021-08-21 21:00 [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB Vladimir Oltean
                   ` (2 preceding siblings ...)
  2021-08-21 21:00 ` [RFC PATCH 3/4] net: dsa: implement a shared FDB dump procedure Vladimir Oltean
@ 2021-08-21 21:00 ` Vladimir Oltean
  2021-09-23 15:25 ` [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB Vladimir Oltean
  4 siblings, 0 replies; 10+ messages in thread
From: Vladimir Oltean @ 2021-08-21 21:00 UTC (permalink / raw)
  To: netdev

This driver already walks linearly over the FDB in the .port_fdb_dump
method, so the .switch_fdb_dump can reuse the same logic, just call back
a different DSA method when it finds something.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
---
 drivers/net/dsa/sja1105/sja1105_main.c | 50 +++++++++++++++++++-------
 1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/drivers/net/dsa/sja1105/sja1105_main.c b/drivers/net/dsa/sja1105/sja1105_main.c
index 05ba65042b5f..e1e9e514814e 100644
--- a/drivers/net/dsa/sja1105/sja1105_main.c
+++ b/drivers/net/dsa/sja1105/sja1105_main.c
@@ -1747,14 +1747,17 @@ static int sja1105_fdb_del(struct dsa_switch *ds, int port,
 }
 
 static int sja1105_fdb_dump(struct dsa_switch *ds, int port,
-			    dsa_fdb_dump_cb_t *cb, void *data)
+			    dsa_fdb_dump_cb_t *port_cb,
+			    dsa_switch_fdb_dump_cb_t *switch_cb,
+			    void *data)
 {
 	struct sja1105_private *priv = ds->priv;
 	struct device *dev = ds->dev;
-	int i;
+	int i, p;
 
 	for (i = 0; i < SJA1105_MAX_L2_LOOKUP_COUNT; i++) {
 		struct sja1105_l2_lookup_entry l2_lookup = {0};
+		unsigned long destports;
 		u8 macaddr[ETH_ALEN];
 		int rc;
 
@@ -1768,13 +1771,12 @@ static int sja1105_fdb_dump(struct dsa_switch *ds, int port,
 			return rc;
 		}
 
-		/* FDB dump callback is per port. This means we have to
-		 * disregard a valid entry if it's not for this port, even if
-		 * only to revisit it later. This is inefficient because the
-		 * 1024-sized FDB table needs to be traversed 4 times through
-		 * SPI during a 'bridge fdb show' command.
+		destports = l2_lookup.destports;
+
+		/* If the FDB dump callback is per port, ignore the entries
+		 * belonging to a different one.
 		 */
-		if (!(l2_lookup.destports & BIT(port)))
+		if (port >= 0 && !(destports & BIT(port)))
 			continue;
 
 		/* We need to hide the FDB entry for unknown multicast */
@@ -1787,13 +1789,36 @@ static int sja1105_fdb_dump(struct dsa_switch *ds, int port,
 		/* We need to hide the dsa_8021q VLANs from the user. */
 		if (!priv->vlan_aware)
 			l2_lookup.vlanid = 0;
-		rc = cb(macaddr, l2_lookup.vlanid, l2_lookup.lockeds, data);
-		if (rc)
-			return rc;
+
+		if (port_cb) {
+			rc = port_cb(macaddr, l2_lookup.vlanid,
+				     l2_lookup.lockeds, data);
+			if (rc)
+				return rc;
+		} else {
+			for_each_set_bit(p, &destports, ds->num_ports) {
+				rc = switch_cb(ds, p, macaddr, l2_lookup.vlanid,
+					       l2_lookup.lockeds);
+				if (rc)
+					return rc;
+			}
+		}
 	}
 	return 0;
 }
 
+static int sja1105_port_fdb_dump(struct dsa_switch *ds, int port,
+				 dsa_fdb_dump_cb_t *cb, void *data)
+{
+	return sja1105_fdb_dump(ds, port, cb, NULL, data);
+}
+
+static int sja1105_switch_fdb_dump(struct dsa_switch *ds,
+				   dsa_switch_fdb_dump_cb_t *cb)
+{
+	return sja1105_fdb_dump(ds, -1, NULL, cb, NULL);
+}
+
 static void sja1105_fast_age(struct dsa_switch *ds, int port)
 {
 	struct sja1105_private *priv = ds->priv;
@@ -3114,7 +3139,8 @@ const struct dsa_switch_ops sja1105_switch_ops = {
 	.get_sset_count		= sja1105_get_sset_count,
 	.get_ts_info		= sja1105_get_ts_info,
 	.port_disable		= sja1105_port_disable,
-	.port_fdb_dump		= sja1105_fdb_dump,
+	.port_fdb_dump		= sja1105_port_fdb_dump,
+	.switch_fdb_dump	= sja1105_switch_fdb_dump,
 	.port_fdb_add		= sja1105_fdb_add,
 	.port_fdb_del		= sja1105_fdb_del,
 	.port_fast_age		= sja1105_fast_age,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB
  2021-08-21 21:00 [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB Vladimir Oltean
                   ` (3 preceding siblings ...)
  2021-08-21 21:00 ` [RFC PATCH 4/4] net: dsa: sja1105: implement shared FDB dump Vladimir Oltean
@ 2021-09-23 15:25 ` Vladimir Oltean
  2021-09-23 22:29   ` Florian Fainelli
  4 siblings, 1 reply; 10+ messages in thread
From: Vladimir Oltean @ 2021-09-23 15:25 UTC (permalink / raw)
  To: Roopa Prabhu, Andrew Lunn, Florian Fainelli; +Cc: netdev

Roopa, Andrew, Florian,

On Sun, Aug 22, 2021 at 12:00:14AM +0300, Vladimir Oltean wrote:
> I have a board where it is painfully slow to run "bridge fdb". It has 16
> switch ports which are accessed over an I2C controller -> I2C mux 1 ->
> I2C mux 2 -> I2C-to-SPI bridge.
> 
> It doesn't really help either that we traverse the hardware FDB of each
> switch for every netdev, even though we already know all there is to
> know the first time we traversed it. In fact, I hacked up some rtnetlink
> and DSA changes, and with those, the time to run 'bridge fdb' on this
> board decreases from 207 seconds to 26 seconds (2 FDB traversals instead
> of 16), turning something intolerable into 'tolerable'.
> 
> I don't know how much we care about .ndo_fdb_dump implemented directly
> by drivers (and that's where I expect this to be most useful), because
> of SWITCHDEV_FDB_ADD_TO_BRIDGE and all that. So this is RFC in case it
> is helpful for somebody, at least during debugging.
> 
> Vladimir Oltean (4):
>   net: rtnetlink: create a netlink cb context struct for fdb dump
>   net: rtnetlink: add a minimal state machine for dumping shared FDBs
>   net: dsa: implement a shared FDB dump procedure
>   net: dsa: sja1105: implement shared FDB dump
> 
>  drivers/net/dsa/sja1105/sja1105_main.c        |  50 +++--
>  .../ethernet/freescale/dpaa2/dpaa2-switch.c   |   9 +-
>  drivers/net/ethernet/mscc/ocelot.c            |   5 +-
>  drivers/net/ethernet/mscc/ocelot_net.c        |   4 +
>  drivers/net/vxlan.c                           |   8 +-
>  include/linux/rtnetlink.h                     |  25 +++
>  include/net/dsa.h                             |  17 ++
>  net/bridge/br_fdb.c                           |   6 +-
>  net/core/rtnetlink.c                          | 105 +++++++---
>  net/dsa/dsa2.c                                |   2 +
>  net/dsa/dsa_priv.h                            |   1 +
>  net/dsa/slave.c                               | 189 ++++++++++++++++--
>  net/dsa/switch.c                              |   8 +
>  13 files changed, 368 insertions(+), 61 deletions(-)
> 
> -- 
> 2.25.1
> 

Does something like this have any chance of being accepted?
https://patchwork.kernel.org/project/netdevbpf/cover/20210821210018.1314952-1-vladimir.oltean@nxp.com/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB
  2021-09-23 15:25 ` [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB Vladimir Oltean
@ 2021-09-23 22:29   ` Florian Fainelli
  2021-09-23 22:49     ` Vladimir Oltean
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Fainelli @ 2021-09-23 22:29 UTC (permalink / raw)
  To: Vladimir Oltean, Roopa Prabhu, Andrew Lunn; +Cc: netdev

On 9/23/21 8:25 AM, Vladimir Oltean wrote:
> Roopa, Andrew, Florian,
> 
> On Sun, Aug 22, 2021 at 12:00:14AM +0300, Vladimir Oltean wrote:
>> I have a board where it is painfully slow to run "bridge fdb". It has 16
>> switch ports which are accessed over an I2C controller -> I2C mux 1 ->
>> I2C mux 2 -> I2C-to-SPI bridge.
>>
>> It doesn't really help either that we traverse the hardware FDB of each
>> switch for every netdev, even though we already know all there is to
>> know the first time we traversed it. In fact, I hacked up some rtnetlink
>> and DSA changes, and with those, the time to run 'bridge fdb' on this
>> board decreases from 207 seconds to 26 seconds (2 FDB traversals instead
>> of 16), turning something intolerable into 'tolerable'.
>>
>> I don't know how much we care about .ndo_fdb_dump implemented directly
>> by drivers (and that's where I expect this to be most useful), because
>> of SWITCHDEV_FDB_ADD_TO_BRIDGE and all that. So this is RFC in case it
>> is helpful for somebody, at least during debugging.
>>
>> Vladimir Oltean (4):
>>   net: rtnetlink: create a netlink cb context struct for fdb dump
>>   net: rtnetlink: add a minimal state machine for dumping shared FDBs
>>   net: dsa: implement a shared FDB dump procedure
>>   net: dsa: sja1105: implement shared FDB dump
>>
>>  drivers/net/dsa/sja1105/sja1105_main.c        |  50 +++--
>>  .../ethernet/freescale/dpaa2/dpaa2-switch.c   |   9 +-
>>  drivers/net/ethernet/mscc/ocelot.c            |   5 +-
>>  drivers/net/ethernet/mscc/ocelot_net.c        |   4 +
>>  drivers/net/vxlan.c                           |   8 +-
>>  include/linux/rtnetlink.h                     |  25 +++
>>  include/net/dsa.h                             |  17 ++
>>  net/bridge/br_fdb.c                           |   6 +-
>>  net/core/rtnetlink.c                          | 105 +++++++---
>>  net/dsa/dsa2.c                                |   2 +
>>  net/dsa/dsa_priv.h                            |   1 +
>>  net/dsa/slave.c                               | 189 ++++++++++++++++--
>>  net/dsa/switch.c                              |   8 +
>>  13 files changed, 368 insertions(+), 61 deletions(-)
>>
>> -- 
>> 2.25.1
>>
> 
> Does something like this have any chance of being accepted?
> https://patchwork.kernel.org/project/netdevbpf/cover/20210821210018.1314952-1-vladimir.oltean@nxp.com/

Had not seen the link you just posted, in premise speeding up the FDB
dump sounds good to me, especially given that we typically have these
slow buses to work with.

These questions are probably super stupid and trivial and I really
missed reviewing properly your latest work, how do we manage to keep the
bridge's FDB and hardware FDB in sync given that switches don't
typically tell us when they learn new addresses?
--
Florian

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB
  2021-09-23 22:29   ` Florian Fainelli
@ 2021-09-23 22:49     ` Vladimir Oltean
  2021-09-24 10:03       ` Nikolay Aleksandrov
  0 siblings, 1 reply; 10+ messages in thread
From: Vladimir Oltean @ 2021-09-23 22:49 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: Roopa Prabhu, Andrew Lunn, netdev

On Thu, Sep 23, 2021 at 03:29:07PM -0700, Florian Fainelli wrote:
> > Does something like this have any chance of being accepted?
> > https://patchwork.kernel.org/project/netdevbpf/cover/20210821210018.1314952-1-vladimir.oltean@nxp.com/
> 
> Had not seen the link you just posted, in premise speeding up the FDB
> dump sounds good to me, especially given that we typically have these
> slow buses to work with.

I didn't copy you because you were on vacation.

> These questions are probably super stupid and trivial and I really
> missed reviewing properly your latest work, how do we manage to keep the
> bridge's FDB and hardware FDB in sync given that switches don't
> typically tell us when they learn new addresses?

We don't, that's the premise, you didn't miss anything there. I mean in
the switchdev world, I see that an interrupt is the primary mechanism,
and we do have DSA switches which are theoretically capable of notifying
of new addresses, but not through interrupts but over Ethernet, of
course. Ocelot for example supports sending "learn frames" to the CPU -
these are just like flooded frames, except that a "flooded" frame is one
with unknown MAC DA, and a "learn" frame is one with unknown MAC SA.
I've been thinking whether it's worth adding support for learn frames
coming from Ocelot/Felix in DSA, and it doesn't really look like it.
Anyway, when the DSA tagging protocol receives a "learn" frame, it needs
to consume_skb() it, then trigger some sort of switchdev atomic notifier
for that MAC SA (SWITCHDEV_FDB_ADD_TO_BRIDGE), but sadly that is only
the beginning of a long series of complications, because we also need to
know when the hardware has fogotten it, and it doesn't look like
"forget" frames are a thing. So because the risk of having an address
expire in hardware while it is still present in software is kind of
high, the only option left is to make the hardware entry static, and
(a) delete it manually when the software entry expires
(b) set up a second alert which triggers a MAC SA has been received on a
    port other than the destination port which is pointed towards by an
    existing FDB entry. In short, "station migration alert". Because the
    FDB entry is static, we need to migrate it by hand, in software.
So it all looks kind of clunky. Whereas what we do now is basically
assume that the amount of frames with unknown MAC DA reaching the CPU
via flooding is more or less equal and equally distributed with the
frames with unknown MAC SA reaching the CPU. I have not yet encountered
a use case where the two are tragically different, in a way that could
be solved only with SWITCHDEV_FDB_ADD_TO_BRIDGE events and in no other way.


Anyway, huge digression, the idea was that DSA doesn't synchronize FDBs
and that is the reason in the first place why we have an .ndo_fdb_dump
implementation. Theoretically if all hardware drivers were perfect,
you'd only have the .ndo_fdb_dump implementation done for the bridge,
vxlan, things like that. So that is why I asked Roopa whether hacking up
rtnl_fdb_dump in this way, transforming it into a state machine even
more complicated than it already was, is acceptable. We aren't the
primary use case of it, I think.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB
  2021-09-23 22:49     ` Vladimir Oltean
@ 2021-09-24 10:03       ` Nikolay Aleksandrov
  2021-09-25 14:31         ` Roopa Prabhu
  0 siblings, 1 reply; 10+ messages in thread
From: Nikolay Aleksandrov @ 2021-09-24 10:03 UTC (permalink / raw)
  To: Vladimir Oltean, Florian Fainelli; +Cc: Roopa Prabhu, Andrew Lunn, netdev

On 24/09/2021 01:49, Vladimir Oltean wrote:
> On Thu, Sep 23, 2021 at 03:29:07PM -0700, Florian Fainelli wrote:
>>> Does something like this have any chance of being accepted?
>>> https://patchwork.kernel.org/project/netdevbpf/cover/20210821210018.1314952-1-vladimir.oltean@nxp.com/
>>
>> Had not seen the link you just posted, in premise speeding up the FDB
>> dump sounds good to me, especially given that we typically have these
>> slow buses to work with.
> 
> I didn't copy you because you were on vacation.
> 
>> These questions are probably super stupid and trivial and I really
>> missed reviewing properly your latest work, how do we manage to keep the
>> bridge's FDB and hardware FDB in sync given that switches don't
>> typically tell us when they learn new addresses?
> 
> We don't, that's the premise, you didn't miss anything there. I mean in
> the switchdev world, I see that an interrupt is the primary mechanism,
> and we do have DSA switches which are theoretically capable of notifying
> of new addresses, but not through interrupts but over Ethernet, of
> course. Ocelot for example supports sending "learn frames" to the CPU -
> these are just like flooded frames, except that a "flooded" frame is one
> with unknown MAC DA, and a "learn" frame is one with unknown MAC SA.
> I've been thinking whether it's worth adding support for learn frames
> coming from Ocelot/Felix in DSA, and it doesn't really look like it.
> Anyway, when the DSA tagging protocol receives a "learn" frame, it needs
> to consume_skb() it, then trigger some sort of switchdev atomic notifier
> for that MAC SA (SWITCHDEV_FDB_ADD_TO_BRIDGE), but sadly that is only
> the beginning of a long series of complications, because we also need to
> know when the hardware has fogotten it, and it doesn't look like
> "forget" frames are a thing. So because the risk of having an address
> expire in hardware while it is still present in software is kind of
> high, the only option left is to make the hardware entry static, and
> (a) delete it manually when the software entry expires
> (b) set up a second alert which triggers a MAC SA has been received on a
>     port other than the destination port which is pointed towards by an
>     existing FDB entry. In short, "station migration alert". Because the
>     FDB entry is static, we need to migrate it by hand, in software.
> So it all looks kind of clunky. Whereas what we do now is basically
> assume that the amount of frames with unknown MAC DA reaching the CPU
> via flooding is more or less equal and equally distributed with the
> frames with unknown MAC SA reaching the CPU. I have not yet encountered
> a use case where the two are tragically different, in a way that could
> be solved only with SWITCHDEV_FDB_ADD_TO_BRIDGE events and in no other way.
> 
> 
> Anyway, huge digression, the idea was that DSA doesn't synchronize FDBs
> and that is the reason in the first place why we have an .ndo_fdb_dump
> implementation. Theoretically if all hardware drivers were perfect,
> you'd only have the .ndo_fdb_dump implementation done for the bridge,
> vxlan, things like that. So that is why I asked Roopa whether hacking up
> rtnl_fdb_dump in this way, transforming it into a state machine even
> more complicated than it already was, is acceptable. We aren't the
> primary use case of it, I think.
> 

Hi Vladimir,
I glanced over the patches and the obvious question that comes first is have you
tried pushing all of this complexity closer to the driver which needs it?
I mean rtnl_fdb_dump() can already "resume" and passes all the necessary resume indices
to ndo_fdb_dump(), so it seems to me that all of this can be hidden away. I doubt
there will be a many users overall, so it would be nice to avoid adding the complexity
as you put it and supporting it in the core. I'd imagine a hacked driver would simply cache
the dump for some time while needed (that's important to define well, more below) and just
return it for the next couple of devices which share it upon request, basically you'd have the
same type of solution you've done here, just have the details hidden in the layer which needs it.

Now the hard part seems to be figuring out when to finish in this case. Prepare should be a simple
check if a shared fdb list is populated, finish would need to be inferred. One way to do that is
based on a transaction/dump id which is tracked for each shared device and the last dump. Another
is if you just pass a new argument to ndo_fdb_dump() if it's dumping a single device or doing a
full dump, since that's the case that's difficult to differentiate. If you're doing a single
dump - obviously you do a normal fdb dump without caching, if you're doing a full dump (all devices)
then you need to check if the list is populated, if it is and this is the last device you need to free
the cached shared list (finish phase). That would make the core change very small and would push the
complexity to be maintained where it's needed. Actually you have the netlink_callback struct passed
down to ndo_fdb_dump() so probably that can be used to infer if you're doing a single or multi- device
dump already and there can be 0 core changes.

Also one current downside with this set is the multiple walks over the net device list for a single dump,
especially for setups with thousands of net devices. Actually, I might be missing something, but what
happens if a dump is terminated before it's finished? Looks like the finish phase will never be reached.
That would be a problem for both solutions, you might have to add a netlink ->done() callback anyway.

Cheers,
 Nik







^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB
  2021-09-24 10:03       ` Nikolay Aleksandrov
@ 2021-09-25 14:31         ` Roopa Prabhu
  0 siblings, 0 replies; 10+ messages in thread
From: Roopa Prabhu @ 2021-09-25 14:31 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Vladimir Oltean, Florian Fainelli
  Cc: Andrew Lunn, netdev


On 9/24/21 3:03 AM, Nikolay Aleksandrov wrote:
> On 24/09/2021 01:49, Vladimir Oltean wrote:
>>
[snip]
>> We don't, that's the premise, you didn't miss anything there. I mean in
>> the switchdev world, I see that an interrupt is the primary mechanism,
>> and we do have DSA switches which are theoretically capable of notifying
>> of new addresses, but not through interrupts but over Ethernet, of
>> course. Ocelot for example supports sending "learn frames" to the CPU -
>> these are just like flooded frames, except that a "flooded" frame is one
>> with unknown MAC DA, and a "learn" frame is one with unknown MAC SA.
>> I've been thinking whether it's worth adding support for learn frames
>> coming from Ocelot/Felix in DSA, and it doesn't really look like it.
>> Anyway, when the DSA tagging protocol receives a "learn" frame, it needs
>> to consume_skb() it, then trigger some sort of switchdev atomic notifier
>> for that MAC SA (SWITCHDEV_FDB_ADD_TO_BRIDGE), but sadly that is only
>> the beginning of a long series of complications, because we also need to
>> know when the hardware has fogotten it, and it doesn't look like
>> "forget" frames are a thing. So because the risk of having an address
>> expire in hardware while it is still present in software is kind of
>> high, the only option left is to make the hardware entry static, and
>> (a) delete it manually when the software entry expires
>> (b) set up a second alert which triggers a MAC SA has been received on a
>>      port other than the destination port which is pointed towards by an
>>      existing FDB entry. In short, "station migration alert". Because the
>>      FDB entry is static, we need to migrate it by hand, in software.
>> So it all looks kind of clunky. Whereas what we do now is basically
>> assume that the amount of frames with unknown MAC DA reaching the CPU
>> via flooding is more or less equal and equally distributed with the
>> frames with unknown MAC SA reaching the CPU. I have not yet encountered
>> a use case where the two are tragically different, in a way that could
>> be solved only with SWITCHDEV_FDB_ADD_TO_BRIDGE events and in no other way.
>>
>>
>> Anyway, huge digression, the idea was that DSA doesn't synchronize FDBs
>> and that is the reason in the first place why we have an .ndo_fdb_dump
>> implementation. Theoretically if all hardware drivers were perfect,
>> you'd only have the .ndo_fdb_dump implementation done for the bridge,
>> vxlan, things like that. So that is why I asked Roopa whether hacking up
>> rtnl_fdb_dump in this way, transforming it into a state machine even
>> more complicated than it already was, is acceptable. We aren't the
>> primary use case of it, I think.
>>
> Hi Vladimir,
> I glanced over the patches and the obvious question that comes first is have you
> tried pushing all of this complexity closer to the driver which needs it?
> I mean rtnl_fdb_dump() can already "resume" and passes all the necessary resume indices
> to ndo_fdb_dump(), so it seems to me that all of this can be hidden away. I doubt
> there will be a many users overall, so it would be nice to avoid adding the complexity
> as you put it and supporting it in the core. I'd imagine a hacked driver would simply cache
> the dump for some time while needed (that's important to define well, more below) and just
> return it for the next couple of devices which share it upon request, basically you'd have the
> same type of solution you've done here, just have the details hidden in the layer which needs it.
>
> Now the hard part seems to be figuring out when to finish in this case. Prepare should be a simple
> check if a shared fdb list is populated, finish would need to be inferred. One way to do that is
> based on a transaction/dump id which is tracked for each shared device and the last dump. Another
> is if you just pass a new argument to ndo_fdb_dump() if it's dumping a single device or doing a
> full dump, since that's the case that's difficult to differentiate. If you're doing a single
> dump - obviously you do a normal fdb dump without caching, if you're doing a full dump (all devices)
> then you need to check if the list is populated, if it is and this is the last device you need to free
> the cached shared list (finish phase). That would make the core change very small and would push the
> complexity to be maintained where it's needed. Actually you have the netlink_callback struct passed
> down to ndo_fdb_dump() so probably that can be used to infer if you're doing a single or multi- device
> dump already and there can be 0 core changes.
>
> Also one current downside with this set is the multiple walks over the net device list for a single dump,
> especially for setups with thousands of net devices. Actually, I might be missing something, but what
> happens if a dump is terminated before it's finished? Looks like the finish phase will never be reached.
> That would be a problem for both solutions, you might have to add a netlink ->done() callback anyway.

+1, the series conceptually looks good, but core fdb dump is already 
unnecessarily complex. moving this to the driver with indicators passed 
as flags

is preferred (you can possibly also mark a port as designated for shared 
fdb operations. again in the driver)



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-09-25 14:32 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-21 21:00 [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB Vladimir Oltean
2021-08-21 21:00 ` [RFC PATCH 1/4] net: rtnetlink: create a netlink cb context struct for fdb dump Vladimir Oltean
2021-08-21 21:00 ` [RFC PATCH 2/4] net: rtnetlink: add a minimal state machine for dumping shared FDBs Vladimir Oltean
2021-08-21 21:00 ` [RFC PATCH 3/4] net: dsa: implement a shared FDB dump procedure Vladimir Oltean
2021-08-21 21:00 ` [RFC PATCH 4/4] net: dsa: sja1105: implement shared FDB dump Vladimir Oltean
2021-09-23 15:25 ` [RFC PATCH 0/4] Faster ndo_fdb_dump for drivers with shared FDB Vladimir Oltean
2021-09-23 22:29   ` Florian Fainelli
2021-09-23 22:49     ` Vladimir Oltean
2021-09-24 10:03       ` Nikolay Aleksandrov
2021-09-25 14:31         ` Roopa Prabhu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.