bridge.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC net-next 00/10] MC Flood disable and snooping
@ 2024-04-02  0:10 Joseph Huang
  2024-04-02  0:11 ` [PATCH RFC net-next 01/10] net: bridge: Flood Queries even when mc flood is disabled Joseph Huang
                   ` (11 more replies)
  0 siblings, 12 replies; 41+ messages in thread
From: Joseph Huang @ 2024-04-02  0:10 UTC (permalink / raw)
  To: netdev
  Cc: Joseph Huang, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

There is a use case where one would like to enable multicast snooping
on a bridge but disable multicast flooding on all bridge ports so that
registered multicast traffic will only reach the intended recipients and
unregistered multicast traffic will be dropped. However, with existing
bridge ports' mcast_flood flag implementation, it doesn't work as desired.

This patchset aims to make multicast snooping work even when multicast
flooding is disabled on the bridge ports, without changing the semantic of
the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.

Also, in a network where more than one multicast snooping capable bridges
are interconnected without multicast routers being present, multicast
snooping fails if:

  1. The source is not directly attached to the Querier
  2. The listener is beyond the mrouter port of the bridge where the
     source is directly attached
  3. A hardware offloading switch is involved

When all of the conditions are met, the listener will not receive any
multicast packets from the source. Patches 5 to 10 attempt to address this
issue. Specifically, patches 5 to 8 set up the infrastructure, patch 9
handles unregistered multicast packets forwarding, and patch 10 handles
registered multicast packets forwarding to the mrouter port.

The patches were developed against 5.15, and forward-ported to 6.8.
Tests were done on a Pi 4B + Marvell 6393X Eval board with a single
switch chip with no VLAN.

V1 -> V2:
- Moved the bulk of the change from the bridge to the mv88e6xxx driver.
- Added more patches (specifically 3 and 4) to workaround some more
  issues with multicast flooding being disabled.

v1 here:
https://patchwork.kernel.org/project/netdevbpf/cover/20210504182259.5042-1-Joseph.Huang@garmin.com/


Joseph Huang (10):
  net: bridge: Flood Queries even when mc flood is disabled
  net: bridge: Always multicast_flood Reports
  net: bridge: Always flood local subnet mc packets
  net: dsa: mv88e6xxx: Add all hosts mc addr to ATU
  net: dsa: Add support for PORT_MROUTER attribute
  net: dsa: mv88e6xxx: Track soft bridge objects
  net: dsa: mv88e6xxx: Track bridge mdb objects
  net: dsa: mv88e6xxx: Convert MAB to use bit flags
  net: dsa: mv88e6xxx: Enable mc flood for mrouter port
  net: dsa: mv88e6xxx: Offload mrouter port

 drivers/net/dsa/mv88e6xxx/Kconfig       |  12 +
 drivers/net/dsa/mv88e6xxx/chip.c        | 439 +++++++++++++++++++++++-
 drivers/net/dsa/mv88e6xxx/chip.h        |  16 +-
 drivers/net/dsa/mv88e6xxx/global1_atu.c |   3 +-
 include/net/dsa.h                       |   2 +
 net/bridge/br_device.c                  |   5 +-
 net/bridge/br_forward.c                 |   3 +-
 net/bridge/br_input.c                   |   5 +-
 net/bridge/br_multicast.c               |  21 +-
 net/bridge/br_private.h                 |   6 +
 net/dsa/port.c                          |  11 +
 net/dsa/port.h                          |   2 +
 net/dsa/user.c                          |   6 +
 13 files changed, 502 insertions(+), 29 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH RFC net-next 01/10] net: bridge: Flood Queries even when mc flood is disabled
  2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
@ 2024-04-02  0:11 ` Joseph Huang
  2024-04-02  0:11 ` [PATCH RFC net-next 02/10] net: bridge: Always multicast_flood Reports Joseph Huang
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Joseph Huang @ 2024-04-02  0:11 UTC (permalink / raw)
  To: netdev
  Cc: Joseph Huang, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

Modify the forwarding path so that received Queries are always flooded
even when multicast flooding is disabled on a bridge port.

In current implementation, when multicast flooding is disabled on a bridge
port, Queries received from other Querier will not be forwarded out of
that bridge port. This unfortunately breaks multicast snooping.

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
---
 net/bridge/br_forward.c   | 3 ++-
 net/bridge/br_multicast.c | 3 +++
 net/bridge/br_private.h   | 3 +++
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 7431f89e897b..6c18ea37b5f5 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -216,7 +216,8 @@ void br_flood(struct net_bridge *br, struct sk_buff *skb,
 				continue;
 			break;
 		case BR_PKT_MULTICAST:
-			if (!(p->flags & BR_MCAST_FLOOD) && skb->dev != br->dev)
+			if (!(p->flags & BR_MCAST_FLOOD) && skb->dev != br->dev &&
+			    !BR_INPUT_SKB_CB_FORCE_FLOOD(skb))
 				continue;
 			break;
 		case BR_PKT_BROADCAST:
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 9a1cb5079a7a..42d900549227 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -3851,6 +3851,7 @@ static int br_multicast_ipv4_rcv(struct net_bridge_mcast *brmctx,
 		err = br_ip4_multicast_igmp3_report(brmctx, pmctx, skb, vid);
 		break;
 	case IGMP_HOST_MEMBERSHIP_QUERY:
+		BR_INPUT_SKB_CB(skb)->force_flood = 1;
 		br_ip4_multicast_query(brmctx, pmctx, skb, vid);
 		break;
 	case IGMP_HOST_LEAVE_MESSAGE:
@@ -3916,6 +3917,7 @@ static int br_multicast_ipv6_rcv(struct net_bridge_mcast *brmctx,
 		err = br_ip6_multicast_mld2_report(brmctx, pmctx, skb, vid);
 		break;
 	case ICMPV6_MGM_QUERY:
+		BR_INPUT_SKB_CB(skb)->force_flood = 1;
 		err = br_ip6_multicast_query(brmctx, pmctx, skb, vid);
 		break;
 	case ICMPV6_MGM_REDUCTION:
@@ -3941,6 +3943,7 @@ int br_multicast_rcv(struct net_bridge_mcast **brmctx,
 
 	BR_INPUT_SKB_CB(skb)->igmp = 0;
 	BR_INPUT_SKB_CB(skb)->mrouters_only = 0;
+	BR_INPUT_SKB_CB(skb)->force_flood = 0;
 
 	if (!br_opt_get((*brmctx)->br, BROPT_MULTICAST_ENABLED))
 		return 0;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 86ea5e6689b5..c28e0cd0855c 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -586,6 +586,7 @@ struct br_input_skb_cb {
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 	u8 igmp;
 	u8 mrouters_only:1;
+	u8 force_flood:1;
 #endif
 	u8 proxyarp_replied:1;
 	u8 src_port_isolated:1;
@@ -620,8 +621,10 @@ struct br_input_skb_cb {
 
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 # define BR_INPUT_SKB_CB_MROUTERS_ONLY(__skb)	(BR_INPUT_SKB_CB(__skb)->mrouters_only)
+# define BR_INPUT_SKB_CB_FORCE_FLOOD(__skb)	(BR_INPUT_SKB_CB(__skb)->force_flood)
 #else
 # define BR_INPUT_SKB_CB_MROUTERS_ONLY(__skb)	(0)
+# define BR_INPUT_SKB_CB_FORCE_FLOOD(__skb)	(0)
 #endif
 
 #define br_printk(level, br, format, args...)	\
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RFC net-next 02/10] net: bridge: Always multicast_flood Reports
  2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
  2024-04-02  0:11 ` [PATCH RFC net-next 01/10] net: bridge: Flood Queries even when mc flood is disabled Joseph Huang
@ 2024-04-02  0:11 ` Joseph Huang
  2024-04-03 15:52   ` Simon Horman
  2024-04-02  0:11 ` [PATCH RFC net-next 03/10] net: bridge: Always flood local subnet mc packets Joseph Huang
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-02  0:11 UTC (permalink / raw)
  To: netdev
  Cc: Joseph Huang, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

Modify the forwarding path so that IGMPv1/v2/MLDv1 Reports are always
flooded by br_multicast_flood(), regardless of the check done
by br_multicast_querier_exists().

This patch fixes the problems where shortly after a system boots up,
the first couple of Reports are not handled properly in that:

1) The Report from the Host is being flooded (via br_flood) to all
   bridge ports, and
2) If the mrouter port's multicast flooding is disabled, the Reports
   received from other hosts will not be forwarded to the Querier.

Fixes: b00589af3b04 ("bridge: disable snooping if there is no querier")

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
---
 net/bridge/br_device.c    | 5 +++--
 net/bridge/br_input.c     | 5 +++--
 net/bridge/br_multicast.c | 5 +++++
 net/bridge/br_private.h   | 3 +++
 4 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/net/bridge/br_device.c b/net/bridge/br_device.c
index c366ccc8b3db..5c09b9dd61dc 100644
--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -93,8 +93,9 @@ netdev_tx_t br_dev_xmit(struct sk_buff *skb, struct net_device *dev)
 		}
 
 		mdst = br_mdb_entry_skb_get(brmctx, skb, vid);
-		if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
-		    br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst))
+		if (((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
+		     br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst)) ||
+		    BR_INPUT_SKB_CB_FORCE_MC_FLOOD(skb))
 			br_multicast_flood(mdst, skb, brmctx, false, true);
 		else
 			br_flood(br, skb, BR_PKT_MULTICAST, false, true, vid);
diff --git a/net/bridge/br_input.c b/net/bridge/br_input.c
index f21097e73482..8e614ab20966 100644
--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
@@ -176,8 +176,9 @@ int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb
 	switch (pkt_type) {
 	case BR_PKT_MULTICAST:
 		mdst = br_mdb_entry_skb_get(brmctx, skb, vid);
-		if ((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
-		    br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst)) {
+		if (((mdst || BR_INPUT_SKB_CB_MROUTERS_ONLY(skb)) &&
+		     br_multicast_querier_exists(brmctx, eth_hdr(skb), mdst)) ||
+		    BR_INPUT_SKB_CB_FORCE_MC_FLOOD(skb)) {
 			if ((mdst && mdst->host_joined) ||
 			    br_multicast_is_router(brmctx, skb)) {
 				local_rcv = true;
diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 42d900549227..8531f0e03f41 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -3844,6 +3844,7 @@ static int br_multicast_ipv4_rcv(struct net_bridge_mcast *brmctx,
 	case IGMP_HOST_MEMBERSHIP_REPORT:
 	case IGMPV2_HOST_MEMBERSHIP_REPORT:
 		BR_INPUT_SKB_CB(skb)->mrouters_only = 1;
+		BR_INPUT_SKB_CB(skb)->force_mc_flood = 1;
 		err = br_ip4_multicast_add_group(brmctx, pmctx, ih->group, vid,
 						 src, true);
 		break;
@@ -3855,6 +3856,7 @@ static int br_multicast_ipv4_rcv(struct net_bridge_mcast *brmctx,
 		br_ip4_multicast_query(brmctx, pmctx, skb, vid);
 		break;
 	case IGMP_HOST_LEAVE_MESSAGE:
+		BR_INPUT_SKB_CB(skb)->force_mc_flood = 1;
 		br_ip4_multicast_leave_group(brmctx, pmctx, ih->group, vid, src);
 		break;
 	}
@@ -3910,6 +3912,7 @@ static int br_multicast_ipv6_rcv(struct net_bridge_mcast *brmctx,
 	case ICMPV6_MGM_REPORT:
 		src = eth_hdr(skb)->h_source;
 		BR_INPUT_SKB_CB(skb)->mrouters_only = 1;
+		BR_INPUT_SKB_CB(skb)->force_mc_flood = 1;
 		err = br_ip6_multicast_add_group(brmctx, pmctx, &mld->mld_mca,
 						 vid, src, true);
 		break;
@@ -3922,6 +3925,7 @@ static int br_multicast_ipv6_rcv(struct net_bridge_mcast *brmctx,
 		break;
 	case ICMPV6_MGM_REDUCTION:
 		src = eth_hdr(skb)->h_source;
+		BR_INPUT_SKB_CB(skb)->force_mc_flood = 1;
 		br_ip6_multicast_leave_group(brmctx, pmctx, &mld->mld_mca, vid,
 					     src);
 		break;
@@ -3944,6 +3948,7 @@ int br_multicast_rcv(struct net_bridge_mcast **brmctx,
 	BR_INPUT_SKB_CB(skb)->igmp = 0;
 	BR_INPUT_SKB_CB(skb)->mrouters_only = 0;
 	BR_INPUT_SKB_CB(skb)->force_flood = 0;
+	BR_INPUT_SKB_CB(skb)->force_mc_flood = 0;
 
 	if (!br_opt_get((*brmctx)->br, BROPT_MULTICAST_ENABLED))
 		return 0;
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index c28e0cd0855c..d72a632a1ad2 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -587,6 +587,7 @@ struct br_input_skb_cb {
 	u8 igmp;
 	u8 mrouters_only:1;
 	u8 force_flood:1;
+	u8 force_mc_flood:1;
 #endif
 	u8 proxyarp_replied:1;
 	u8 src_port_isolated:1;
@@ -622,9 +623,11 @@ struct br_input_skb_cb {
 #ifdef CONFIG_BRIDGE_IGMP_SNOOPING
 # define BR_INPUT_SKB_CB_MROUTERS_ONLY(__skb)	(BR_INPUT_SKB_CB(__skb)->mrouters_only)
 # define BR_INPUT_SKB_CB_FORCE_FLOOD(__skb)	(BR_INPUT_SKB_CB(__skb)->force_flood)
+# define BR_INPUT_SKB_CB_FORCE_MC_FLOOD(__skb)	(BR_INPUT_SKB_CB(__skb)->force_mc_flood)
 #else
 # define BR_INPUT_SKB_CB_MROUTERS_ONLY(__skb)	(0)
 # define BR_INPUT_SKB_CB_FORCE_FLOOD(__skb)	(0)
+# define BR_INPUT_SKB_CB_FORCE_MC_FLOOD(__skb)	(0)
 #endif
 
 #define br_printk(level, br, format, args...)	\
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RFC net-next 03/10] net: bridge: Always flood local subnet mc packets
  2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
  2024-04-02  0:11 ` [PATCH RFC net-next 01/10] net: bridge: Flood Queries even when mc flood is disabled Joseph Huang
  2024-04-02  0:11 ` [PATCH RFC net-next 02/10] net: bridge: Always multicast_flood Reports Joseph Huang
@ 2024-04-02  0:11 ` Joseph Huang
  2024-04-02  0:11 ` [PATCH RFC net-next 04/10] net: dsa: mv88e6xxx: Add all hosts mc addr to ATU Joseph Huang
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Joseph Huang @ 2024-04-02  0:11 UTC (permalink / raw)
  To: netdev
  Cc: Joseph Huang, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

Always flood packets with local multicast destination address.

If multicast flooding is disabled on a bridge port, local subnet multicast
packets from the bridge will not be forwarded out of that port, even if
IGMP snooping is running and the hosts beyond the bridge port are sending
Reports to join these groups (e.g., 224.0.0.251). This is because the bridge
blocks the creation of an mdb entry if the group is a local subnet multicast
address, which will cause these packets to be flooded via br_flood(),
but blocked by the mcast_flood flag check.

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
---
 net/bridge/br_multicast.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/bridge/br_multicast.c b/net/bridge/br_multicast.c
index 8531f0e03f41..02a5209afab8 100644
--- a/net/bridge/br_multicast.c
+++ b/net/bridge/br_multicast.c
@@ -3823,11 +3823,14 @@ static int br_multicast_ipv4_rcv(struct net_bridge_mcast *brmctx,
 	if (err == -ENOMSG) {
 		if (!ipv4_is_local_multicast(ip_hdr(skb)->daddr)) {
 			BR_INPUT_SKB_CB(skb)->mrouters_only = 1;
-		} else if (pim_ipv4_all_pim_routers(ip_hdr(skb)->daddr)) {
-			if (ip_hdr(skb)->protocol == IPPROTO_PIM)
-				br_multicast_pim(brmctx, pmctx, skb);
-		} else if (ipv4_is_all_snoopers(ip_hdr(skb)->daddr)) {
-			br_ip4_multicast_mrd_rcv(brmctx, pmctx, skb);
+		} else {
+			BR_INPUT_SKB_CB(skb)->force_flood = 1;
+			if (pim_ipv4_all_pim_routers(ip_hdr(skb)->daddr)) {
+				if (ip_hdr(skb)->protocol == IPPROTO_PIM)
+					br_multicast_pim(brmctx, pmctx, skb);
+			} else if (ipv4_is_all_snoopers(ip_hdr(skb)->daddr)) {
+				br_ip4_multicast_mrd_rcv(brmctx, pmctx, skb);
+			}
 		}
 
 		return 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RFC net-next 04/10] net: dsa: mv88e6xxx: Add all hosts mc addr to ATU
  2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
                   ` (2 preceding siblings ...)
  2024-04-02  0:11 ` [PATCH RFC net-next 03/10] net: bridge: Always flood local subnet mc packets Joseph Huang
@ 2024-04-02  0:11 ` Joseph Huang
  2024-04-02 18:08   ` Vladimir Oltean
  2024-04-02  0:11 ` [PATCH RFC net-next 05/10] net: dsa: Add support for PORT_MROUTER attribute Joseph Huang
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-02  0:11 UTC (permalink / raw)
  To: netdev
  Cc: Joseph Huang, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

Add local network all hosts multicast address (224.0.0.1) and link-local
all nodes multicast address (ff02::1) to the ATU so that IGMP/MLD
Queries can be forwarded even when multicast flooding is disabled on a
port.

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
---
 drivers/net/dsa/mv88e6xxx/Kconfig | 12 ++++++++
 drivers/net/dsa/mv88e6xxx/chip.c  | 47 +++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/Kconfig b/drivers/net/dsa/mv88e6xxx/Kconfig
index e3181d5471df..ef7798bf50d7 100644
--- a/drivers/net/dsa/mv88e6xxx/Kconfig
+++ b/drivers/net/dsa/mv88e6xxx/Kconfig
@@ -17,3 +17,15 @@ config NET_DSA_MV88E6XXX_PTP
 	help
 	  Say Y to enable PTP hardware timestamping on Marvell 88E6xxx switch
 	  chips that support it.
+
+config NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS
+	bool "Always flood local all hosts multicast packets"
+	depends on NET_DSA_MV88E6XXX
+	help
+	  When set to Y, always flood multicast packets destined for
+	  224.0.0.1 (Local Network All Hosts multicast address) and
+	  ff02::1 (Link-Local All Nodes multicast address), even when
+	  multicast flooding is disabled for a port.  This is so that
+	  multicast snooping can continue to function even when
+	  multicast flooding is disabled.
+	  If in doubt, say N.
diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 9ed1821184ec..fddcb596c421 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2488,6 +2488,41 @@ static int mv88e6xxx_broadcast_setup(struct mv88e6xxx_chip *chip, u16 vid)
 	return 0;
 }
 
+#if IS_ENABLED(CONFIG_NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS)
+static int mv88e6xxx_port_add_multicast(struct mv88e6xxx_chip *chip, int port,
+					u16 vid)
+{
+	u8 state = MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC;
+	const char multicast[][ETH_ALEN] = {
+		{ 0x01, 0x00, 0x5e, 0x00, 0x00, 0x01 },
+		{ 0x33, 0x33, 0x00, 0x00, 0x00, 0x01 }
+	};
+	int i, err;
+
+	for (i = 0; i < ARRAY_SIZE(multicast); i++) {
+		err = mv88e6xxx_port_db_load_purge(chip, port, multicast[i], vid, state);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+
+static int mv88e6xxx_multicast_setup(struct mv88e6xxx_chip *chip, u16 vid)
+{
+	int port;
+	int err;
+
+	for (port = 0; port < mv88e6xxx_num_ports(chip); port++) {
+		err = mv88e6xxx_port_add_multicast(chip, port, vid);
+		if (err)
+			return err;
+	}
+
+	return 0;
+}
+#endif
+
 struct mv88e6xxx_port_broadcast_sync_ctx {
 	int port;
 	bool flood;
@@ -2572,6 +2607,12 @@ static int mv88e6xxx_port_vlan_join(struct mv88e6xxx_chip *chip, int port,
 		err = mv88e6xxx_broadcast_setup(chip, vlan.vid);
 		if (err)
 			return err;
+
+#if IS_ENABLED(CONFIG_NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS)
+		err = mv88e6xxx_multicast_setup(chip, vlan.vid);
+		if (err)
+			return err;
+#endif
 	} else if (vlan.member[port] != member) {
 		vlan.member[port] = member;
 
@@ -3930,6 +3971,12 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
 	if (err)
 		goto unlock;
 
+#if IS_ENABLED(CONFIG_NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS)
+	err = mv88e6xxx_multicast_setup(chip, 0);
+	if (err)
+		goto unlock;
+#endif
+
 	err = mv88e6xxx_pot_setup(chip);
 	if (err)
 		goto unlock;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RFC net-next 05/10] net: dsa: Add support for PORT_MROUTER attribute
  2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
                   ` (3 preceding siblings ...)
  2024-04-02  0:11 ` [PATCH RFC net-next 04/10] net: dsa: mv88e6xxx: Add all hosts mc addr to ATU Joseph Huang
@ 2024-04-02  0:11 ` Joseph Huang
  2024-04-02  0:11 ` [PATCH RFC net-next 06/10] net: dsa: mv88e6xxx: Track soft bridge objects Joseph Huang
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Joseph Huang @ 2024-04-02  0:11 UTC (permalink / raw)
  To: netdev
  Cc: Joseph Huang, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

Add support for delivering SWITCHDEV_ATTR_ID_PORT_MROUTER event to DSA
subsystem. This is essentially 08cc83c ("net: dsa: add support for
BRIDGE_MROUTER attribute") repurposed for PORT_MROUTER.

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
---
 include/net/dsa.h |  2 ++
 net/dsa/port.c    | 11 +++++++++++
 net/dsa/port.h    |  2 ++
 net/dsa/user.c    |  6 ++++++
 4 files changed, 21 insertions(+)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 7c0da9effe4e..5dc5838caa9c 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -1037,6 +1037,8 @@ struct dsa_switch_ops {
 	int	(*port_bridge_flags)(struct dsa_switch *ds, int port,
 				     struct switchdev_brport_flags flags,
 				     struct netlink_ext_ack *extack);
+	int	(*port_mrouter)(struct dsa_switch *ds, int port, bool mrouter,
+				struct netlink_ext_ack *extack);
 	void	(*port_set_host_flood)(struct dsa_switch *ds, int port,
 				       bool uc, bool mc);
 
diff --git a/net/dsa/port.c b/net/dsa/port.c
index c42dac87671b..3f689cb713aa 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -948,6 +948,17 @@ int dsa_port_bridge_flags(struct dsa_port *dp,
 	return 0;
 }
 
+int dsa_port_mrouter(struct dsa_port *dp, bool mrouter,
+		     struct netlink_ext_ack *extack)
+{
+	struct dsa_switch *ds = dp->ds;
+
+	if (!ds->ops->port_mrouter)
+		return -EOPNOTSUPP;
+
+	return ds->ops->port_mrouter(ds, dp->index, mrouter, extack);
+}
+
 void dsa_port_set_host_flood(struct dsa_port *dp, bool uc, bool mc)
 {
 	struct dsa_switch *ds = dp->ds;
diff --git a/net/dsa/port.h b/net/dsa/port.h
index 6bc3291573c0..85102e1659ae 100644
--- a/net/dsa/port.h
+++ b/net/dsa/port.h
@@ -81,6 +81,8 @@ int dsa_port_pre_bridge_flags(const struct dsa_port *dp,
 int dsa_port_bridge_flags(struct dsa_port *dp,
 			  struct switchdev_brport_flags flags,
 			  struct netlink_ext_ack *extack);
+int dsa_port_mrouter(struct dsa_port *dp, bool mrouter,
+		     struct netlink_ext_ack *extack);
 int dsa_port_vlan_add(struct dsa_port *dp,
 		      const struct switchdev_obj_port_vlan *vlan,
 		      struct netlink_ext_ack *extack);
diff --git a/net/dsa/user.c b/net/dsa/user.c
index 16d395bb1a1f..f69c4df143f7 100644
--- a/net/dsa/user.c
+++ b/net/dsa/user.c
@@ -647,6 +647,12 @@ static int dsa_user_port_attr_set(struct net_device *dev, const void *ctx,
 
 		ret = dsa_port_bridge_flags(dp, attr->u.brport_flags, extack);
 		break;
+	case SWITCHDEV_ATTR_ID_PORT_MROUTER:
+		if (!dsa_port_offloads_bridge_port(dp, attr->orig_dev))
+			return -EOPNOTSUPP;
+
+		ret = dsa_port_mrouter(dp, attr->u.mrouter, extack);
+		break;
 	case SWITCHDEV_ATTR_ID_VLAN_MSTI:
 		if (!dsa_port_offloads_bridge_dev(dp, attr->orig_dev))
 			return -EOPNOTSUPP;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RFC net-next 06/10] net: dsa: mv88e6xxx: Track soft bridge objects
  2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
                   ` (4 preceding siblings ...)
  2024-04-02  0:11 ` [PATCH RFC net-next 05/10] net: dsa: Add support for PORT_MROUTER attribute Joseph Huang
@ 2024-04-02  0:11 ` Joseph Huang
  2024-04-02  0:11 ` [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects Joseph Huang
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Joseph Huang @ 2024-04-02  0:11 UTC (permalink / raw)
  To: netdev
  Cc: Joseph Huang, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

Keep track of soft bridge objects in the driver.

Since the driver doesn't get explicit notifications about bridge creation
or destruction, just create the bridge data structure when the first port
joins the bridge via mv88e6xxx_port_bridge_join(). Similarly, destroy
the bridge after the last port left the bridge via
mv88e6xxx_port_bridge_leave().

Use the bridge's net_device pointer as the key to the list. Port
information is stored in a bitmask.

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 85 ++++++++++++++++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/chip.h |  3 ++
 2 files changed, 88 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index fddcb596c421..f66ddde484dc 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -43,6 +43,12 @@
 #include "serdes.h"
 #include "smi.h"
 
+struct mv88e6xxx_bridge {
+	struct list_head head;
+	struct net_device *br_dev;
+	u16 ports;
+};
+
 static void assert_reg_lock(struct mv88e6xxx_chip *chip)
 {
 	if (unlikely(!mutex_is_locked(&chip->reg_lock))) {
@@ -2958,6 +2964,60 @@ static int mv88e6xxx_port_fdb_dump(struct dsa_switch *ds, int port,
 	return err;
 }
 
+static struct mv88e6xxx_bridge *
+mv88e6xxx_bridge_create(struct mv88e6xxx_chip *chip, struct net_device *br_dev)
+{
+	struct mv88e6xxx_bridge *mv_bridge;
+
+	mv_bridge = kzalloc(sizeof(*mv_bridge), GFP_KERNEL);
+	if (!mv_bridge)
+		return ERR_PTR(-ENOMEM);
+
+	mv_bridge->br_dev = br_dev;
+	list_add(&mv_bridge->head, &chip->bridge_list);
+
+	return mv_bridge;
+}
+
+static void mv88e6xxx_bridge_destroy(struct mv88e6xxx_bridge *mv_bridge)
+{
+	list_del(&mv_bridge->head);
+
+	WARN_ON(mv_bridge->ports);
+	kfree(mv_bridge);
+}
+
+static
+struct mv88e6xxx_bridge *mv88e6xxx_bridge_by_dev(struct mv88e6xxx_chip *chip,
+						 const struct net_device *br_dev)
+{
+	struct mv88e6xxx_bridge *mv_bridge;
+
+	list_for_each_entry(mv_bridge, &chip->bridge_list, head)
+		if (mv_bridge->br_dev == br_dev)
+			return mv_bridge;
+
+	return NULL;
+}
+
+static struct mv88e6xxx_bridge *
+mv88e6xxx_bridge_get(struct mv88e6xxx_chip *chip, struct net_device *br_dev)
+{
+	struct mv88e6xxx_bridge *mv_bridge;
+
+	mv_bridge = mv88e6xxx_bridge_by_dev(chip, br_dev);
+	if (!mv_bridge)
+		mv_bridge = mv88e6xxx_bridge_create(chip, br_dev);
+
+	return mv_bridge;
+}
+
+static void mv88e6xxx_bridge_put(struct mv88e6xxx_bridge *mv_bridge)
+{
+	if (!mv_bridge->ports)
+		mv88e6xxx_bridge_destroy(mv_bridge);
+}
+
 static int mv88e6xxx_bridge_map(struct mv88e6xxx_chip *chip,
 				struct dsa_bridge bridge)
 {
@@ -3009,8 +3069,16 @@ static int mv88e6xxx_port_bridge_join(struct dsa_switch *ds, int port,
 				      struct netlink_ext_ack *extack)
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
+	struct mv88e6xxx_bridge *mv_bridge;
 	int err;
 
+	mv_bridge = mv88e6xxx_bridge_get(chip, bridge.dev);
+	if (IS_ERR(mv_bridge))
+		return PTR_ERR(mv_bridge);
+
+	if (mv_bridge->ports & BIT(port))
+		return -EEXIST;
+
 	mv88e6xxx_reg_lock(chip);
 
 	err = mv88e6xxx_bridge_map(chip, bridge);
@@ -3033,6 +3101,8 @@ static int mv88e6xxx_port_bridge_join(struct dsa_switch *ds, int port,
 		*tx_fwd_offload = true;
 	}
 
+	mv_bridge->ports |= BIT(port);
+
 unlock:
 	mv88e6xxx_reg_unlock(chip);
 
@@ -3043,8 +3113,19 @@ static void mv88e6xxx_port_bridge_leave(struct dsa_switch *ds, int port,
 					struct dsa_bridge bridge)
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
+	struct mv88e6xxx_bridge *mv_bridge;
 	int err;
 
+	mv_bridge = mv88e6xxx_bridge_by_dev(chip, bridge.dev);
+	if (!mv_bridge)
+		return;
+
+	if (!(mv_bridge->ports & BIT(port)))
+		return;
+
+	mv_bridge->ports &= ~BIT(port);
+	mv88e6xxx_bridge_put(mv_bridge);
+
 	mv88e6xxx_reg_lock(chip);
 
 	if (bridge.tx_fwd_offload &&
@@ -6436,6 +6517,7 @@ static struct mv88e6xxx_chip *mv88e6xxx_alloc_chip(struct device *dev)
 	INIT_LIST_HEAD(&chip->mdios);
 	idr_init(&chip->policies);
 	INIT_LIST_HEAD(&chip->msts);
+	INIT_LIST_HEAD(&chip->bridge_list);
 
 	return chip;
 }
@@ -7272,6 +7354,9 @@ static void mv88e6xxx_remove(struct mdio_device *mdiodev)
 		mv88e6xxx_g1_irq_free(chip);
 	else
 		mv88e6xxx_irq_poll_free(chip);
+
+	WARN_ON(!list_empty(&chip->bridge_list));
+
 }
 
 static void mv88e6xxx_shutdown(struct mdio_device *mdiodev)
diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index 85eb293381a7..a32e4564eb3d 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -432,6 +432,9 @@ struct mv88e6xxx_chip {
 
 	/* Bridge MST to SID mappings */
 	struct list_head msts;
+
+	/* software bridges */
+	struct list_head bridge_list;
 };
 
 struct mv88e6xxx_bus_ops {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects
  2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
                   ` (5 preceding siblings ...)
  2024-04-02  0:11 ` [PATCH RFC net-next 06/10] net: dsa: mv88e6xxx: Track soft bridge objects Joseph Huang
@ 2024-04-02  0:11 ` Joseph Huang
  2024-04-02 12:23   ` Vladimir Oltean
  2024-04-02  0:11 ` [PATCH RFC net-next 08/10] net: dsa: mv88e6xxx: Convert MAB to use bit flags Joseph Huang
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-02  0:11 UTC (permalink / raw)
  To: netdev
  Cc: Joseph Huang, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

Keep track of bridge mdb objects in the driver.

Similar to the previous patch, since the driver doesn't get explicit
notifications about mdb group creation or destruction, just create
the mdb group when the first port joins the group via
mv88e6xxx_port_mdb_add(), and destroys the group when the last port left
the group via mv88e6xxx_port_mdb_del().

Use the group's L2 address together with the VLAN ID as the key to the list.
Port membership is again stored in a bitmask.

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 117 +++++++++++++++++++++++++++++++
 1 file changed, 117 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index f66ddde484dc..32a613c965b1 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -47,6 +47,14 @@ struct mv88e6xxx_bridge {
 	struct list_head head;
 	struct net_device *br_dev;
 	u16 ports;
+	struct list_head br_mdb_list;
+};
+
+struct mv88e6xxx_br_mdb {
+	struct list_head head;
+	unsigned char addr[ETH_ALEN];
+	u16 vid;
+	u16 portvec;
 };
 
 static void assert_reg_lock(struct mv88e6xxx_chip *chip)
@@ -2974,6 +2982,7 @@ mv88e6xxx_bridge_create(struct mv88e6xxx_chip *chip, struct net_device *br_dev)
 		return ERR_PTR(-ENOMEM);
 
 	mv_bridge->br_dev = br_dev;
+	INIT_LIST_HEAD(&mv_bridge->br_mdb_list);
 	list_add(&mv_bridge->head, &chip->bridge_list);
 
 	return mv_bridge;
@@ -2984,6 +2993,7 @@ static void mv88e6xxx_bridge_destroy(struct mv88e6xxx_bridge *mv_bridge)
 	list_del(&mv_bridge->head);
 
 	WARN_ON(mv_bridge->ports);
+	WARN_ON(!list_empty(&mv_bridge->br_mdb_list));
 	kfree(mv_bridge);
 }
 
@@ -6583,16 +6593,101 @@ static int mv88e6xxx_change_tag_protocol(struct dsa_switch *ds,
 	return err;
 }
 
+static struct mv88e6xxx_br_mdb *
+mv88e6xxx_br_mdb_create(struct mv88e6xxx_bridge *mv_bridge,
+			const struct switchdev_obj_port_mdb *mdb)
+{
+	struct mv88e6xxx_br_mdb *mv_br_mdb;
+
+	mv_br_mdb = kzalloc(sizeof(*mv_br_mdb), GFP_KERNEL);
+	if (!mv_br_mdb)
+		return ERR_PTR(-ENOMEM);
+
+	ether_addr_copy(mv_br_mdb->addr, mdb->addr);
+	mv_br_mdb->vid = mdb->vid;
+	list_add(&mv_br_mdb->head, &mv_bridge->br_mdb_list);
+
+	return mv_br_mdb;
+}
+
+static void mv88e6xxx_br_mdb_destroy(struct mv88e6xxx_br_mdb *mv_br_mdb)
+{
+	list_del(&mv_br_mdb->head);
+
+	WARN_ON(mv_br_mdb->portvec);
+	kfree(mv_br_mdb);
+}
+
+static struct mv88e6xxx_br_mdb *
+mv88e6xxx_br_mdb_find(struct mv88e6xxx_bridge *mv_bridge,
+		      const struct switchdev_obj_port_mdb *mdb)
+{
+	struct mv88e6xxx_br_mdb *mv_br_mdb;
+
+	list_for_each_entry(mv_br_mdb, &mv_bridge->br_mdb_list, head)
+		if (ether_addr_equal(mv_br_mdb->addr, mdb->addr) &&
+		    mv_br_mdb->vid == mdb->vid)
+			return mv_br_mdb;
+
+	return NULL;
+}
+
+static struct mv88e6xxx_br_mdb *
+mv88e6xxx_br_mdb_get(struct mv88e6xxx_bridge *mv_bridge,
+		     const struct switchdev_obj_port_mdb *mdb)
+{
+	struct mv88e6xxx_br_mdb *mv_br_mdb;
+
+	mv_br_mdb = mv88e6xxx_br_mdb_find(mv_bridge, mdb);
+	if (!mv_br_mdb)
+		mv_br_mdb = mv88e6xxx_br_mdb_create(mv_bridge, mdb);
+
+	return mv_br_mdb;
+}
+
+static void mv88e6xxx_br_mdb_put(struct mv88e6xxx_br_mdb *mv_br_mdb)
+{
+	if (!mv_br_mdb->portvec)
+		mv88e6xxx_br_mdb_destroy(mv_br_mdb);
+}
+
 static int mv88e6xxx_port_mdb_add(struct dsa_switch *ds, int port,
 				  const struct switchdev_obj_port_mdb *mdb,
 				  struct dsa_db db)
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
+	struct mv88e6xxx_bridge *mv_bridge;
+	struct mv88e6xxx_br_mdb *mv_br_mdb;
+	struct net_device *orig_dev;
+	struct net_device *br_dev;
 	int err;
 
+	orig_dev = mdb->obj.orig_dev;
+	br_dev = netdev_master_upper_dev_get(orig_dev);
+	if (!br_dev)
+		br_dev = orig_dev;
+
+	mv_bridge = mv88e6xxx_bridge_by_dev(chip, br_dev);
+	if (!mv_bridge)
+		return -EINVAL;
+
+	mv_br_mdb = mv88e6xxx_br_mdb_get(mv_bridge, mdb);
+	if (IS_ERR(mv_br_mdb))
+		return PTR_ERR(mv_br_mdb);
+
+	if (mv_br_mdb->portvec & BIT(port))
+		return -EEXIST;
+
 	mv88e6xxx_reg_lock(chip);
 	err = mv88e6xxx_port_db_load_purge(chip, port, mdb->addr, mdb->vid,
 					   MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC);
+
+	if (err)
+		goto out;
+
+	mv_br_mdb->portvec |= BIT(port);
+
+out:
 	mv88e6xxx_reg_unlock(chip);
 
 	return err;
@@ -6603,10 +6698,32 @@ static int mv88e6xxx_port_mdb_del(struct dsa_switch *ds, int port,
 				  struct dsa_db db)
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
+	struct mv88e6xxx_bridge *mv_bridge;
+	struct mv88e6xxx_br_mdb *mv_br_mdb;
+	struct net_device *orig_dev;
+	struct net_device *br_dev;
 	int err;
 
+	orig_dev = mdb->obj.orig_dev;
+	br_dev = netdev_master_upper_dev_get(orig_dev);
+	if (!br_dev)
+		br_dev = orig_dev;
+
+	mv_bridge = mv88e6xxx_bridge_by_dev(chip, br_dev);
+	if (!mv_bridge)
+		return -EINVAL;
+
+	mv_br_mdb = mv88e6xxx_br_mdb_find(mv_bridge, mdb);
+	if (!mv_br_mdb)
+		return -ENOENT;
+
+	if (!(mv_br_mdb->portvec & BIT(port)))
+		return -ENOENT;
+
 	mv88e6xxx_reg_lock(chip);
 	err = mv88e6xxx_port_db_load_purge(chip, port, mdb->addr, mdb->vid, 0);
+	mv_br_mdb->portvec &= ~BIT(port);
+	mv88e6xxx_br_mdb_put(mv_br_mdb);
 	mv88e6xxx_reg_unlock(chip);
 
 	return err;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RFC net-next 08/10] net: dsa: mv88e6xxx: Convert MAB to use bit flags
  2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
                   ` (6 preceding siblings ...)
  2024-04-02  0:11 ` [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects Joseph Huang
@ 2024-04-02  0:11 ` Joseph Huang
  2024-04-02  0:11 ` [PATCH RFC net-next 09/10] net: dsa: mv88e6xxx: Enable mc flood for mrouter port Joseph Huang
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 41+ messages in thread
From: Joseph Huang @ 2024-04-02  0:11 UTC (permalink / raw)
  To: netdev
  Cc: Joseph Huang, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

Convert MAB (MacAuth Bypass) flag to use bitmap.

A new port flag will be added with a subsequent patch. Convert the
'mab' member to a bit in a bitmask to save space.

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
---
 drivers/net/dsa/mv88e6xxx/chip.h        | 12 ++++++++----
 drivers/net/dsa/mv88e6xxx/global1_atu.c |  3 ++-
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index a32e4564eb3d..205f6777c2ac 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -272,6 +272,9 @@ struct mv88e6xxx_vlan {
 	bool	valid;
 };
 
+/* MacAuth Bypass Control Flag */
+#define MV88E6XXX_PORT_FLAG_MAB		BIT(0)
+
 struct mv88e6xxx_port {
 	struct mv88e6xxx_chip *chip;
 	int port;
@@ -288,9 +291,7 @@ struct mv88e6xxx_port {
 	bool mirror_egress;
 	struct devlink_region *region;
 	void *pcs_private;
-
-	/* MacAuth Bypass control flag */
-	bool mab;
+	unsigned int flags;
 };
 
 enum mv88e6xxx_region_id {
@@ -802,7 +803,10 @@ static inline bool mv88e6xxx_is_invalid_port(struct mv88e6xxx_chip *chip, int po
 static inline void mv88e6xxx_port_set_mab(struct mv88e6xxx_chip *chip,
 					  int port, bool mab)
 {
-	chip->ports[port].mab = mab;
+	if (mab)
+		chip->ports[port].flags |= MV88E6XXX_PORT_FLAG_MAB;
+	else
+		chip->ports[port].flags &= ~MV88E6XXX_PORT_FLAG_MAB;
 }
 
 int mv88e6xxx_read(struct mv88e6xxx_chip *chip, int addr, int reg, u16 *val);
diff --git a/drivers/net/dsa/mv88e6xxx/global1_atu.c b/drivers/net/dsa/mv88e6xxx/global1_atu.c
index ce3b3690c3c0..06d0c526e33d 100644
--- a/drivers/net/dsa/mv88e6xxx/global1_atu.c
+++ b/drivers/net/dsa/mv88e6xxx/global1_atu.c
@@ -445,7 +445,8 @@ static irqreturn_t mv88e6xxx_g1_atu_prob_irq_thread_fn(int irq, void *dev_id)
 						   fid);
 		chip->ports[spid].atu_miss_violation++;
 
-		if (fid != MV88E6XXX_FID_STANDALONE && chip->ports[spid].mab) {
+		if (fid != MV88E6XXX_FID_STANDALONE &&
+		    !!(chip->ports[spid].flags & MV88E6XXX_PORT_FLAG_MAB)) {
 			err = mv88e6xxx_handle_miss_violation(chip, spid,
 							      &entry, fid);
 			if (err)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RFC net-next 09/10] net: dsa: mv88e6xxx: Enable mc flood for mrouter port
  2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
                   ` (7 preceding siblings ...)
  2024-04-02  0:11 ` [PATCH RFC net-next 08/10] net: dsa: mv88e6xxx: Convert MAB to use bit flags Joseph Huang
@ 2024-04-02  0:11 ` Joseph Huang
  2024-04-03 15:49   ` Simon Horman
  2024-04-02  0:11 ` [PATCH RFC net-next 10/10] net: dsa: mv88e6xxx: Offload " Joseph Huang
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-02  0:11 UTC (permalink / raw)
  To: netdev
  Cc: Joseph Huang, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

When a port turns into an mrouter port, enable multicast flooding
on that port even if multicast flooding is disabled by user config. This
is necessary so that in a distributed system, the multicast packets
can be forwarded to the Querier when the multicast source is attached
to a Non-Querier bridge.

Consider the following scenario:

                 +--------------------+
                 |                    |
                 |      Snooping      |    +------------+
                 |      Bridge 1      |----| Listener 1 |
                 |     (Querier)      |    +------------+
                 |                    |
                 +--------------------+
                           |
                           |
                 +--------------------+
                 |    | mrouter |     |
+-----------+    |    +---------+     |
| MC Source |----|      Snooping      |
+-----------|    |      Bridge 2      |
                 |    (Non-Querier)   |
                 +--------------------+

In this scenario, Listener 1 will never receive multicast traffic
from MC Source if multicast flooding is disabled on the mrouter port on
Snooping Bridge 2.

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 86 ++++++++++++++++++++++++++++++--
 drivers/net/dsa/mv88e6xxx/chip.h |  1 +
 2 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 32a613c965b1..9831aa370921 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -47,6 +47,7 @@ struct mv88e6xxx_bridge {
 	struct list_head head;
 	struct net_device *br_dev;
 	u16 ports;
+	u16 mrouter_ports;
 	struct list_head br_mdb_list;
 };
 
@@ -2993,6 +2994,7 @@ static void mv88e6xxx_bridge_destroy(struct mv88e6xxx_bridge *mv_bridge)
 	list_del(&mv_bridge->head);
 
 	WARN_ON(mv_bridge->ports);
+	WARN_ON(mv_bridge->mrouter_ports);
 	WARN_ON(!list_empty(&mv_bridge->br_mdb_list));
 	kfree(mv_bridge);
 }
@@ -3010,6 +3012,19 @@ struct mv88e6xxx_bridge *mv88e6xxx_bridge_by_dev(struct mv88e6xxx_chip *chip,
 	return NULL;
 }
 
+static
+struct mv88e6xxx_bridge *mv88e6xxx_bridge_by_port(struct mv88e6xxx_chip *chip,
+						  int port)
+{
+	struct mv88e6xxx_bridge *mv_bridge;
+
+	list_for_each_entry(mv_bridge, &chip->bridge_list, head)
+		if (mv_bridge->ports & BIT(port))
+			return mv_bridge;
+
+	return NULL;
+}
+
 static struct mv88e6xxx_bridge *
 mv88e6xxx_bridge_get(struct mv88e6xxx_chip *chip, struct net_device *br_dev)
 {
@@ -6849,11 +6864,28 @@ static int mv88e6xxx_port_bridge_flags(struct dsa_switch *ds, int port,
 
 	if (flags.mask & BR_MCAST_FLOOD) {
 		bool multicast = !!(flags.val & BR_MCAST_FLOOD);
+		struct mv88e6xxx_bridge *mv_bridge;
+		struct mv88e6xxx_port *p;
+		bool mrouter;
 
-		err = chip->info->ops->port_set_mcast_flood(chip, port,
-							    multicast);
-		if (err)
-			goto out;
+		mv_bridge = mv88e6xxx_bridge_by_port(chip, port);
+		if (!mv_bridge)
+			return -EINVAL;
+
+		p = &chip->ports[port];
+		mrouter = !!(mv_bridge->mrouter_ports & BIT(port));
+
+		if (!mrouter) {
+			err = chip->info->ops->port_set_mcast_flood(chip, port,
+								    multicast);
+			if (err)
+				goto out;
+		}
+
+		if (multicast)
+			p->flags |= MV88E6XXX_PORT_FLAG_MC_FLOOD;
+		else
+			p->flags &= ~MV88E6XXX_PORT_FLAG_MC_FLOOD;
 	}
 
 	if (flags.mask & BR_BCAST_FLOOD) {
@@ -6883,6 +6915,51 @@ static int mv88e6xxx_port_bridge_flags(struct dsa_switch *ds, int port,
 	return err;
 }
 
+static int mv88e6xxx_port_mrouter(struct dsa_switch *ds, int port,
+				  bool mrouter,
+				  struct netlink_ext_ack *extack)
+{
+	struct mv88e6xxx_chip *chip = ds->priv;
+	struct mv88e6xxx_bridge *mv_bridge;
+	struct mv88e6xxx_port *p;
+	bool old_mrouter;
+	bool mc_flood;
+	int err;
+
+	if (!chip->info->ops->port_set_mcast_flood)
+		return -EOPNOTSUPP;
+
+	mv_bridge = mv88e6xxx_bridge_by_port(chip, port);
+	if (!mv_bridge)
+		return -EINVAL;
+
+	old_mrouter = !!(mv_bridge->mrouter_ports & BIT(port));
+	if (mrouter == old_mrouter)
+		return 0;
+
+	p = &chip->ports[port];
+	mc_flood = !!(p->flags & MV88E6XXX_PORT_FLAG_MC_FLOOD);
+
+	mv88e6xxx_reg_lock(chip);
+
+	if (!mc_flood) {
+		err = chip->info->ops->port_set_mcast_flood(chip, port,
+							    mrouter);
+		if (err)
+			goto out;
+	}
+
+	if (mrouter)
+		mv_bridge->mrouter_ports |= BIT(port);
+	else
+		mv_bridge->mrouter_ports &= ~BIT(port);
+
+out:
+	mv88e6xxx_reg_unlock(chip);
+
+	return err;
+}
+
 static bool mv88e6xxx_lag_can_offload(struct dsa_switch *ds,
 				      struct dsa_lag lag,
 				      struct netdev_lag_upper_info *info,
@@ -7199,6 +7276,7 @@ static const struct dsa_switch_ops mv88e6xxx_switch_ops = {
 	.port_bridge_leave	= mv88e6xxx_port_bridge_leave,
 	.port_pre_bridge_flags	= mv88e6xxx_port_pre_bridge_flags,
 	.port_bridge_flags	= mv88e6xxx_port_bridge_flags,
+	.port_mrouter		= mv88e6xxx_port_mrouter,
 	.port_stp_state_set	= mv88e6xxx_port_stp_state_set,
 	.port_mst_state_set	= mv88e6xxx_port_mst_state_set,
 	.port_fast_age		= mv88e6xxx_port_fast_age,
diff --git a/drivers/net/dsa/mv88e6xxx/chip.h b/drivers/net/dsa/mv88e6xxx/chip.h
index 205f6777c2ac..47e056dc7925 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.h
+++ b/drivers/net/dsa/mv88e6xxx/chip.h
@@ -274,6 +274,7 @@ struct mv88e6xxx_vlan {
 
 /* MacAuth Bypass Control Flag */
 #define MV88E6XXX_PORT_FLAG_MAB		BIT(0)
+#define MV88E6XXX_PORT_FLAG_MC_FLOOD	BIT(1)
 
 struct mv88e6xxx_port {
 	struct mv88e6xxx_chip *chip;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH RFC net-next 10/10] net: dsa: mv88e6xxx: Offload mrouter port
  2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
                   ` (8 preceding siblings ...)
  2024-04-02  0:11 ` [PATCH RFC net-next 09/10] net: dsa: mv88e6xxx: Enable mc flood for mrouter port Joseph Huang
@ 2024-04-02  0:11 ` Joseph Huang
  2024-04-02  9:28 ` [PATCH RFC net-next 00/10] MC Flood disable and snooping Nikolay Aleksandrov
  2024-04-02 12:43 ` Andrew Lunn
  11 siblings, 0 replies; 41+ messages in thread
From: Joseph Huang @ 2024-04-02  0:11 UTC (permalink / raw)
  To: netdev
  Cc: Joseph Huang, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

Offload mrouter port forwarding to mv88e6xxx.

Currently multicast snooping fails to forward traffic in some cases
where there are multiple hardware-offloading bridges involved.

Consider the following scenario:

                 +--------------------+
                 |                    |
                 |      Snooping   +--|    +------------+
                 |      Bridge 1   |P1|----| Listener 1 |
                 |     (Querier)   +--|    +------------+
                 |                    |
                 +--------------------+
                           |
                           |
                 +--------------------+
                 |    | mrouter |     |
+-----------+    |    +---------+  +--|    +------------+
| MC Source |----|      Snooping   |P2|----| Listener 2 |
+-----------|    |      Bridge 2   +--|    +------------+
                 |    (Non-Querier)   |
                 +--------------------+

In this scenario, Listener 2 is able to receive multicast traffic from
MC Source while Listener 1 is not. The reason is that on Snooping
Bridge 2, when the (soft) bridge attempts to forward a packet to
the mrouter port via br_multicast_flood(), the effort is blocked by
nbp_switchdev_allowed_egress(), since offload_fwd_mark indicates that
the packet should have been handled by the hardware already. Listener 2
would receive the packets without any problem since P2 is programmed unto
the switch chip as a member of the group; however, the mrouter port
would not since the mrouter port would normally not be a member of any
group, and thus will not be added to the address database on the switch
chip of Snooping Bridge 2.

Even if nbp_switchdev_allowed_egress() did not block the forwarding,
it would still be better to offload the forwarding to the switch
rather than letting the bridge handle the forwarding in software.

Before this patch, mv88e6xxx programming matches exactly with mdb:

 +-----+
 | mdb |
 +-----+
    |
 +----------------------------------------------+
 |  |        +--------------------------------+ |
 |  |        | both in mdb and mv88e6xxx      | |
 |  |        | +------+   +------+   +------+ | |
 |  +--------|-| port |---| port |---| port | | |
 |           | +------+   +------+   +------+ | |
 | mv88e6xxx +--------------------------------+ |
 +----------------------------------------------+

After this patch, some entries will only exist in mv88e6xxx and not
in mdb:

 +-----+
 | mdb |
 +-----+
    |
 +---------------------------------------------------------------------+
 |  |        +--------------------------------++---------------------+ |
 |  |        |  both in mdb and mv88e6xxx     ||  only in mv88e6xxx  | |
 |  |        | +------+   +------+   +------+ || +------+   +------+ | |
 |  +--------|-| port |---| port |---| port |-||-|  mr  |---|  mr  | | |
 |           | +------+   +------+   +------+ || +------+   +------+ | |
 | mv88e6xxx +--------------------------------++---------------------+ |
 +---------------------------------------------------------------------+

Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
---
 drivers/net/dsa/mv88e6xxx/chip.c | 104 ++++++++++++++++++++++++++++---
 1 file changed, 94 insertions(+), 10 deletions(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 9831aa370921..ab519e4d9e4f 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2194,13 +2194,10 @@ mv88e6xxx_port_vlan_prepare(struct dsa_switch *ds, int port,
 	return err;
 }
 
-static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
-					const unsigned char *addr, u16 vid,
-					u8 state)
+static int mv88e6xxx_fid_from_vid(struct mv88e6xxx_chip *chip, u16 vid,
+				  u16 *fid)
 {
-	struct mv88e6xxx_atu_entry entry;
 	struct mv88e6xxx_vtu_entry vlan;
-	u16 fid;
 	int err;
 
 	/* Ports have two private address databases: one for when the port is
@@ -2211,7 +2208,7 @@ static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
 	 * VLAN ID into the port's database used for VLAN-unaware bridging.
 	 */
 	if (vid == 0) {
-		fid = MV88E6XXX_FID_BRIDGED;
+		*fid = MV88E6XXX_FID_BRIDGED;
 	} else {
 		err = mv88e6xxx_vtu_get(chip, vid, &vlan);
 		if (err)
@@ -2221,9 +2218,24 @@ static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
 		if (!vlan.valid)
 			return -EOPNOTSUPP;
 
-		fid = vlan.fid;
+		*fid = vlan.fid;
 	}
 
+	return 0;
+}
+
+static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
+					const unsigned char *addr, u16 vid,
+					u8 state)
+{
+	struct mv88e6xxx_atu_entry entry;
+	u16 fid;
+	int err;
+
+	err = mv88e6xxx_fid_from_vid(chip, vid, &fid);
+	if (err)
+		return err;
+
 	entry.state = 0;
 	ether_addr_copy(entry.mac, addr);
 	eth_addr_dec(entry.mac);
@@ -2255,6 +2267,30 @@ static int mv88e6xxx_port_db_load_purge(struct mv88e6xxx_chip *chip, int port,
 	return mv88e6xxx_g1_atu_loadpurge(chip, fid, &entry);
 }
 
+static int mv88e6xxx_port_mdb_load_purge(struct mv88e6xxx_chip *chip,
+					 int portvec,
+					 const unsigned char *addr,
+					 u16 vid)
+{
+	struct mv88e6xxx_atu_entry entry;
+	u16 fid;
+	int err;
+
+	err = mv88e6xxx_fid_from_vid(chip, vid, &fid);
+	if (err)
+		return err;
+
+	memset(&entry, 0, sizeof(entry));
+	ether_addr_copy(entry.mac, addr);
+	entry.portvec = portvec;
+	if (!entry.portvec)
+		entry.state = 0;
+	else
+		entry.state = MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC;
+
+	return mv88e6xxx_g1_atu_loadpurge(chip, fid, &entry);
+}
+
 static int mv88e6xxx_policy_apply(struct mv88e6xxx_chip *chip, int port,
 				  const struct mv88e6xxx_policy *policy)
 {
@@ -6666,6 +6702,12 @@ static void mv88e6xxx_br_mdb_put(struct mv88e6xxx_br_mdb *mv_br_mdb)
 		mv88e6xxx_br_mdb_destroy(mv_br_mdb);
 }
 
+static u16 mv88e6xxx_br_mdb_portvec_fixup(struct mv88e6xxx_bridge *mv_bridge,
+					  u16 portvec)
+{
+	return portvec ? (portvec | mv_bridge->mrouter_ports) : portvec;
+}
+
 static int mv88e6xxx_port_mdb_add(struct dsa_switch *ds, int port,
 				  const struct switchdev_obj_port_mdb *mdb,
 				  struct dsa_db db)
@@ -6675,6 +6717,7 @@ static int mv88e6xxx_port_mdb_add(struct dsa_switch *ds, int port,
 	struct mv88e6xxx_br_mdb *mv_br_mdb;
 	struct net_device *orig_dev;
 	struct net_device *br_dev;
+	u16 portvec;
 	int err;
 
 	orig_dev = mdb->obj.orig_dev;
@@ -6693,9 +6736,11 @@ static int mv88e6xxx_port_mdb_add(struct dsa_switch *ds, int port,
 	if (mv_br_mdb->portvec & BIT(port))
 		return -EEXIST;
 
+	portvec = mv_br_mdb->portvec | BIT(port);
+	portvec = mv88e6xxx_br_mdb_portvec_fixup(mv_bridge, portvec);
+
 	mv88e6xxx_reg_lock(chip);
-	err = mv88e6xxx_port_db_load_purge(chip, port, mdb->addr, mdb->vid,
-					   MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC);
+	err = mv88e6xxx_port_mdb_load_purge(chip, portvec, mdb->addr, mdb->vid);
 
 	if (err)
 		goto out;
@@ -6717,6 +6762,7 @@ static int mv88e6xxx_port_mdb_del(struct dsa_switch *ds, int port,
 	struct mv88e6xxx_br_mdb *mv_br_mdb;
 	struct net_device *orig_dev;
 	struct net_device *br_dev;
+	u16 portvec;
 	int err;
 
 	orig_dev = mdb->obj.orig_dev;
@@ -6735,8 +6781,11 @@ static int mv88e6xxx_port_mdb_del(struct dsa_switch *ds, int port,
 	if (!(mv_br_mdb->portvec & BIT(port)))
 		return -ENOENT;
 
+	portvec = mv_br_mdb->portvec & ~BIT(port);
+	portvec = mv88e6xxx_br_mdb_portvec_fixup(mv_bridge, portvec);
+
 	mv88e6xxx_reg_lock(chip);
-	err = mv88e6xxx_port_db_load_purge(chip, port, mdb->addr, mdb->vid, 0);
+	err = mv88e6xxx_port_mdb_load_purge(chip, portvec, mdb->addr, mdb->vid);
 	mv_br_mdb->portvec &= ~BIT(port);
 	mv88e6xxx_br_mdb_put(mv_br_mdb);
 	mv88e6xxx_reg_unlock(chip);
@@ -6921,9 +6970,11 @@ static int mv88e6xxx_port_mrouter(struct dsa_switch *ds, int port,
 {
 	struct mv88e6xxx_chip *chip = ds->priv;
 	struct mv88e6xxx_bridge *mv_bridge;
+	struct mv88e6xxx_br_mdb *mv_br_mdb;
 	struct mv88e6xxx_port *p;
 	bool old_mrouter;
 	bool mc_flood;
+	u16 portvec;
 	int err;
 
 	if (!chip->info->ops->port_set_mcast_flood)
@@ -6949,11 +7000,44 @@ static int mv88e6xxx_port_mrouter(struct dsa_switch *ds, int port,
 			goto out;
 	}
 
+	list_for_each_entry(mv_br_mdb, &mv_bridge->br_mdb_list, head) {
+		portvec = mv_br_mdb->portvec;
+		portvec = mv88e6xxx_br_mdb_portvec_fixup(mv_bridge, portvec);
+
+		if (mrouter)
+			portvec |= BIT(port);
+		else
+			portvec &= ~BIT(port);
+
+		err = mv88e6xxx_port_mdb_load_purge(chip, portvec,
+						    mv_br_mdb->addr,
+						    mv_br_mdb->vid);
+		if (err)
+			goto out_port_mdb_load_purge;
+	}
+
 	if (mrouter)
 		mv_bridge->mrouter_ports |= BIT(port);
 	else
 		mv_bridge->mrouter_ports &= ~BIT(port);
 
+	mv88e6xxx_reg_unlock(chip);
+
+	return 0;
+
+out_port_mdb_load_purge:
+	list_for_each_entry_continue_reverse(mv_br_mdb,
+					     &mv_bridge->br_mdb_list,
+					     head) {
+		portvec = mv_br_mdb->portvec;
+		portvec = mv88e6xxx_br_mdb_portvec_fixup(mv_bridge, portvec);
+		mv88e6xxx_port_mdb_load_purge(chip, portvec,
+					      mv_br_mdb->addr,
+					      mv_br_mdb->vid);
+	}
+
+	if (!mc_flood)
+		chip->info->ops->port_set_mcast_flood(chip, port, old_mrouter);
 out:
 	mv88e6xxx_reg_unlock(chip);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
                   ` (9 preceding siblings ...)
  2024-04-02  0:11 ` [PATCH RFC net-next 10/10] net: dsa: mv88e6xxx: Offload " Joseph Huang
@ 2024-04-02  9:28 ` Nikolay Aleksandrov
  2024-04-02 17:43   ` Vladimir Oltean
  2024-04-02 12:43 ` Andrew Lunn
  11 siblings, 1 reply; 41+ messages in thread
From: Nikolay Aleksandrov @ 2024-04-02  9:28 UTC (permalink / raw)
  To: Joseph Huang, netdev
  Cc: Andrew Lunn, Florian Fainelli, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Roopa Prabhu,
	Linus Lüssing, linux-kernel, bridge

On 4/2/24 03:10, Joseph Huang wrote:
> There is a use case where one would like to enable multicast snooping
> on a bridge but disable multicast flooding on all bridge ports so that
> registered multicast traffic will only reach the intended recipients and
> unregistered multicast traffic will be dropped. However, with existing
> bridge ports' mcast_flood flag implementation, it doesn't work as desired.
> 
> This patchset aims to make multicast snooping work even when multicast
> flooding is disabled on the bridge ports, without changing the semantic of
> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
> 
> Also, in a network where more than one multicast snooping capable bridges
> are interconnected without multicast routers being present, multicast
> snooping fails if:
> 
>    1. The source is not directly attached to the Querier
>    2. The listener is beyond the mrouter port of the bridge where the
>       source is directly attached
>    3. A hardware offloading switch is involved
> 
> When all of the conditions are met, the listener will not receive any
> multicast packets from the source. Patches 5 to 10 attempt to address this
> issue. Specifically, patches 5 to 8 set up the infrastructure, patch 9
> handles unregistered multicast packets forwarding, and patch 10 handles
> registered multicast packets forwarding to the mrouter port.
> 
> The patches were developed against 5.15, and forward-ported to 6.8.
> Tests were done on a Pi 4B + Marvell 6393X Eval board with a single
> switch chip with no VLAN.
> 
> V1 -> V2:
> - Moved the bulk of the change from the bridge to the mv88e6xxx driver.
> - Added more patches (specifically 3 and 4) to workaround some more
>    issues with multicast flooding being disabled.
> 
> v1 here:
> https://patchwork.kernel.org/project/netdevbpf/cover/20210504182259.5042-1-Joseph.Huang@garmin.com/
> 

For the bridge patches:
Nacked-by: Nikolay Aleksandrov <razor@blackwall.org>

You cannot break the multicast flood flag to add support for a custom
use-case. This is unacceptable. The current bridge behaviour is correct
your patch 02 doesn't fix anything, you should configure the bridge
properly to avoid all those problems, not break protocols.

Your special use case can easily be solved by a user-space helper or
eBPF and nftables. You can set the mcast flood flag and bypass the
bridge for these packets. I basically said the same in 2021, if this is
going to be in the bridge it should be hidden behind an option that is
default off. But in my opinion adding an option to solve such special
cases is undesirable, they can be easily solved with what's currently
available.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects
  2024-04-02  0:11 ` [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects Joseph Huang
@ 2024-04-02 12:23   ` Vladimir Oltean
  2024-04-04 20:43     ` Joseph Huang
  0 siblings, 1 reply; 41+ messages in thread
From: Vladimir Oltean @ 2024-04-02 12:23 UTC (permalink / raw)
  To: Joseph Huang
  Cc: netdev, Andrew Lunn, Florian Fainelli, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Roopa Prabhu,
	Nikolay Aleksandrov, Linus Lüssing, linux-kernel, bridge

On Mon, Apr 01, 2024 at 08:11:06PM -0400, Joseph Huang wrote:
> Keep track of bridge mdb objects in the driver.
> 
> Similar to the previous patch, since the driver doesn't get explicit
> notifications about mdb group creation or destruction, just create
> the mdb group when the first port joins the group via
> mv88e6xxx_port_mdb_add(), and destroys the group when the last port left
> the group via mv88e6xxx_port_mdb_del().
> 
> Use the group's L2 address together with the VLAN ID as the key to the list.
> Port membership is again stored in a bitmask.
> 
> Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
> ---

Can you comment on the feasibility/infeasibility of Tobias' proposal of:
"The bridge could just provide some MDB iterator to save us from having
to cache all the configured groups."?
https://lore.kernel.org/netdev/87sg31n04a.fsf@waldekranz.com/

What is done here will have to be scaled to many drivers - potentially
all existing DSA ones, as far as I'm aware.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
                   ` (10 preceding siblings ...)
  2024-04-02  9:28 ` [PATCH RFC net-next 00/10] MC Flood disable and snooping Nikolay Aleksandrov
@ 2024-04-02 12:43 ` Andrew Lunn
  2024-04-04 21:35   ` Joseph Huang
  11 siblings, 1 reply; 41+ messages in thread
From: Andrew Lunn @ 2024-04-02 12:43 UTC (permalink / raw)
  To: Joseph Huang
  Cc: netdev, Florian Fainelli, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Roopa Prabhu,
	Nikolay Aleksandrov, Linus Lüssing, linux-kernel, bridge

On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote:
> There is a use case where one would like to enable multicast snooping
> on a bridge but disable multicast flooding on all bridge ports so that
> registered multicast traffic will only reach the intended recipients and
> unregistered multicast traffic will be dropped. However, with existing
> bridge ports' mcast_flood flag implementation, it doesn't work as desired.
> 
> This patchset aims to make multicast snooping work even when multicast
> flooding is disabled on the bridge ports, without changing the semantic of
> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
> 
> Also, in a network where more than one multicast snooping capable bridges
> are interconnected without multicast routers being present, multicast
> snooping fails if:
> 
>   1. The source is not directly attached to the Querier
>   2. The listener is beyond the mrouter port of the bridge where the
>      source is directly attached
>   3. A hardware offloading switch is involved

I've not studied the details here, but that last point makes me think
the offload driver is broken. There should not be any difference
between software bridging and hardware bridging. The whole idea is
that you take what Linux can do in software and accelerate it by
offloading to hardware. Doing acceleration should not change the
behaviour.

> The patches were developed against 5.15, and forward-ported to 6.8.
> Tests were done on a Pi 4B + Marvell 6393X Eval board with a single
> switch chip with no VLAN.

So what is the mv88e6xxx driver doing wrong?

	Andrew


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-02  9:28 ` [PATCH RFC net-next 00/10] MC Flood disable and snooping Nikolay Aleksandrov
@ 2024-04-02 17:43   ` Vladimir Oltean
  2024-04-02 18:50     ` Nikolay Aleksandrov
  0 siblings, 1 reply; 41+ messages in thread
From: Vladimir Oltean @ 2024-04-02 17:43 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Linus Lüssing, linux-kernel, bridge

Hi Nikolai,

On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
> For the bridge patches:
> Nacked-by: Nikolay Aleksandrov <razor@blackwall.org>
> 
> You cannot break the multicast flood flag to add support for a custom
> use-case. This is unacceptable. The current bridge behaviour is correct
> your patch 02 doesn't fix anything, you should configure the bridge
> properly to avoid all those problems, not break protocols.
> 
> Your special use case can easily be solved by a user-space helper or
> eBPF and nftables. You can set the mcast flood flag and bypass the
> bridge for these packets. I basically said the same in 2021, if this is
> going to be in the bridge it should be hidden behind an option that is
> default off. But in my opinion adding an option to solve such special
> cases is undesirable, they can be easily solved with what's currently
> available.

I appreciate your time is limited, but could you please translate your
suggestion, and detail your proposed alternative a bit, for those of us
who are not very familiar with IP multicast snooping?

Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
that break snooping? And then do what with the packets, forward them in
another software layer than the bridge?

I also don't quite understand the suggestion of turning on mcast flooding:
isn't Joseph saying that he wants it off for the unregistered multicast
data traffic?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 04/10] net: dsa: mv88e6xxx: Add all hosts mc addr to ATU
  2024-04-02  0:11 ` [PATCH RFC net-next 04/10] net: dsa: mv88e6xxx: Add all hosts mc addr to ATU Joseph Huang
@ 2024-04-02 18:08   ` Vladimir Oltean
  0 siblings, 0 replies; 41+ messages in thread
From: Vladimir Oltean @ 2024-04-02 18:08 UTC (permalink / raw)
  To: Joseph Huang
  Cc: netdev, Andrew Lunn, Florian Fainelli, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Roopa Prabhu,
	Nikolay Aleksandrov, Linus Lüssing, linux-kernel, bridge

On Mon, Apr 01, 2024 at 08:11:03PM -0400, Joseph Huang wrote:
> Add local network all hosts multicast address (224.0.0.1) and link-local
> all nodes multicast address (ff02::1) to the ATU so that IGMP/MLD
> Queries can be forwarded even when multicast flooding is disabled on a
> port.
> 
> Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>
> ---
>  drivers/net/dsa/mv88e6xxx/Kconfig | 12 ++++++++
>  drivers/net/dsa/mv88e6xxx/chip.c  | 47 +++++++++++++++++++++++++++++++
>  2 files changed, 59 insertions(+)
> 
> diff --git a/drivers/net/dsa/mv88e6xxx/Kconfig b/drivers/net/dsa/mv88e6xxx/Kconfig
> index e3181d5471df..ef7798bf50d7 100644
> --- a/drivers/net/dsa/mv88e6xxx/Kconfig
> +++ b/drivers/net/dsa/mv88e6xxx/Kconfig
> @@ -17,3 +17,15 @@ config NET_DSA_MV88E6XXX_PTP
>  	help
>  	  Say Y to enable PTP hardware timestamping on Marvell 88E6xxx switch
>  	  chips that support it.
> +
> +config NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS
> +	bool "Always flood local all hosts multicast packets"
> +	depends on NET_DSA_MV88E6XXX
> +	help
> +	  When set to Y, always flood multicast packets destined for
> +	  224.0.0.1 (Local Network All Hosts multicast address) and
> +	  ff02::1 (Link-Local All Nodes multicast address), even when
> +	  multicast flooding is disabled for a port.  This is so that
> +	  multicast snooping can continue to function even when
> +	  multicast flooding is disabled.
> +	  If in doubt, say N.

In its current form, this will never be accepted. The mainline Linux
kernel is not a purpose-built project like a bootloader (and also like
some other uses of the Linux kernel). It gets picked up by
distributions, and the same kernel image is supposed to be used on
multiple platforms. So, customizing behavior at compilation time is not
a viable option. If there is any behavior that needs to be different on
some platforms than on others for some reason (this needs a
justification in its own right), it is handled through dedicated user
space API. When others say "hide custom behavior X behind an option", a
compile-time option is not what they mean.

As for the substance of the change itself, I am far from an authority
on multicast, I think you should try to push for something else, which
should be more palatable for everybody.

Some switches I've been working with have explicit flooding controls for:
- L2 multicast
- IPv4 control multicast (224.0.0.x)
- IPv4 data multicast
- IPv6 control multicast (FF02::/16)
- IPv6 data multicast

Whereas the bridge has a single "mcast_flood" control.

It seems adequate to attempt adding more netlink attributes to control
all of the above, individually. Then you could describe your use case,
in a standard way to the kernel, as "ip link set swp0 type bridge_slave
mcast_ipv4_data_flood off mcast_ipv4_ctrl_flood on". For compatibility,
"mcast_flood" could still be interpreted as a global enable for all
multicast.

The trouble seems to be actually offloading this config to Marvell DSA,
they don't seem to have a classifier that distinguishes between kinds of
multicast traffic.

How many link-local IPv4 and IPv6 addresses are there in actual use?
Would it be feasible to actually add ATU addresses for all of them, like
you did for 224.0.0.1 and ff02::1, and say that the port destinations of
_those_ are the mcast_ipv4_ctrl_flood ports?

> diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
> index 9ed1821184ec..fddcb596c421 100644
> --- a/drivers/net/dsa/mv88e6xxx/chip.c
> +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> @@ -2488,6 +2488,41 @@ static int mv88e6xxx_broadcast_setup(struct mv88e6xxx_chip *chip, u16 vid)
>  	return 0;
>  }
>  
> +#if IS_ENABLED(CONFIG_NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS)
> +static int mv88e6xxx_port_add_multicast(struct mv88e6xxx_chip *chip, int port,
> +					u16 vid)
> +{
> +	u8 state = MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC;
> +	const char multicast[][ETH_ALEN] = {
> +		{ 0x01, 0x00, 0x5e, 0x00, 0x00, 0x01 },
> +		{ 0x33, 0x33, 0x00, 0x00, 0x00, 0x01 }
> +	};
> +	int i, err;
> +
> +	for (i = 0; i < ARRAY_SIZE(multicast); i++) {
> +		err = mv88e6xxx_port_db_load_purge(chip, port, multicast[i], vid, state);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
> +}
> +
> +static int mv88e6xxx_multicast_setup(struct mv88e6xxx_chip *chip, u16 vid)
> +{
> +	int port;
> +	int err;
> +
> +	for (port = 0; port < mv88e6xxx_num_ports(chip); port++) {
> +		err = mv88e6xxx_port_add_multicast(chip, port, vid);
> +		if (err)
> +			return err;
> +	}
> +
> +	return 0;
> +}
> +#endif
> +
>  struct mv88e6xxx_port_broadcast_sync_ctx {
>  	int port;
>  	bool flood;
> @@ -2572,6 +2607,12 @@ static int mv88e6xxx_port_vlan_join(struct mv88e6xxx_chip *chip, int port,
>  		err = mv88e6xxx_broadcast_setup(chip, vlan.vid);
>  		if (err)
>  			return err;
> +
> +#if IS_ENABLED(CONFIG_NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS)
> +		err = mv88e6xxx_multicast_setup(chip, vlan.vid);
> +		if (err)
> +			return err;
> +#endif
>  	} else if (vlan.member[port] != member) {
>  		vlan.member[port] = member;
>  
> @@ -3930,6 +3971,12 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
>  	if (err)
>  		goto unlock;
>  
> +#if IS_ENABLED(CONFIG_NET_DSA_MV88E6XXX_ALWAYS_FLOOD_LOCAL_ALL_HOSTS_ADDRESS)
> +	err = mv88e6xxx_multicast_setup(chip, 0);

This is the danger of developing on an old kernel and then just blindly
forward-porting. It will compile and smile and look pretty, but it won't
work. You need to request your board provider to do something and give
you access to mainline code.

In newer kernels, VID 0 is MV88E6XXX_VID_STANDALONE, aka a completely
separate address database compared to what the bridge is in charge of.
So, applied to the mainline kernel as of today, your change does nothing
useful, because when under a VLAN-unaware bridge, packets now get
classified to MV88E6XXX_VID_BRIDGED (4095).

For that matter, the port database (MV88E6XXX_VID_STANDALONE) should be
controlled through dev_mc_add(), and "ip maddr" shows you the link-local
multicast addresses offloaded to the device's RX filter.

mv88e6xxx today does not pass the requirements for dsa_switch_supports_mc_filtering(),
so dev_mc_add() does not actually program anything into hardware, but if
it did, it would have been the port database (VID 0), and this is what
makes your change wrong. You can read more about address databases in
Documentation/networking/dsa/dsa.rst.

> +	if (err)
> +		goto unlock;
> +#endif
> +
>  	err = mv88e6xxx_pot_setup(chip);
>  	if (err)
>  		goto unlock;
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-02 17:43   ` Vladimir Oltean
@ 2024-04-02 18:50     ` Nikolay Aleksandrov
  2024-04-02 20:46       ` Vladimir Oltean
  0 siblings, 1 reply; 41+ messages in thread
From: Nikolay Aleksandrov @ 2024-04-02 18:50 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Linus Lüssing, linux-kernel, bridge

On 4/2/24 20:43, Vladimir Oltean wrote:
> Hi Nikolai,
> 
> On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
>> For the bridge patches:
>> Nacked-by: Nikolay Aleksandrov <razor@blackwall.org>
>>
>> You cannot break the multicast flood flag to add support for a custom
>> use-case. This is unacceptable. The current bridge behaviour is correct
>> your patch 02 doesn't fix anything, you should configure the bridge
>> properly to avoid all those problems, not break protocols.
>>
>> Your special use case can easily be solved by a user-space helper or
>> eBPF and nftables. You can set the mcast flood flag and bypass the
>> bridge for these packets. I basically said the same in 2021, if this is
>> going to be in the bridge it should be hidden behind an option that is
>> default off. But in my opinion adding an option to solve such special
>> cases is undesirable, they can be easily solved with what's currently
>> available.
> 
> I appreciate your time is limited, but could you please translate your
> suggestion, and detail your proposed alternative a bit, for those of us
> who are not very familiar with IP multicast snooping?
> 

My suggestion is not related to snooping really, but to the goal of
patches 01-03. The bridge patches in this set are trying to forward
traffic that is not supposed to be forwarded with the proposed
configuration, so that can be done by a user-space helper that installs
rules to bypass the bridge specifically for those packets while
monitoring the bridge state to implement a policy and manage these rules
in order to keep snooping working.

> Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
> that break snooping? And then do what with the packets, forward them in
> another software layer than the bridge?
> 

The ones that are not supposed to be forwarded in the proposed config
and are needed for this use case (control traffic and link-local). 
Obviously to have proper snooping you'd need to manage these bypass
rules and use them only while needed.

> I also don't quite understand the suggestion of turning on mcast flooding:
> isn't Joseph saying that he wants it off for the unregistered multicast
> data traffic?

Ah my bad, I meant to turn off flooding and bypass the bridge for those
packets and ports while necessary, under necessary can be any policy
that the user-space helper wants to implement.

In any case, if this is going to be yet another kernel solution then it
must be a new option that is default off, and doesn't break current 
mcast flood flag behaviour.

In general my opinion is that the whole snooping control must be in
user-space and only have the dataplane in the kernel, but that is beyond
the scope of this set.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-02 18:50     ` Nikolay Aleksandrov
@ 2024-04-02 20:46       ` Vladimir Oltean
  2024-04-02 21:59         ` Nikolay Aleksandrov
  0 siblings, 1 reply; 41+ messages in thread
From: Vladimir Oltean @ 2024-04-02 20:46 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Linus Lüssing, linux-kernel, bridge

On Tue, Apr 02, 2024 at 09:50:51PM +0300, Nikolay Aleksandrov wrote:
> On 4/2/24 20:43, Vladimir Oltean wrote:
> > Hi Nikolai,
> > 
> > On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
> > > For the bridge patches:
> > > Nacked-by: Nikolay Aleksandrov <razor@blackwall.org>
> > > 
> > > You cannot break the multicast flood flag to add support for a custom
> > > use-case. This is unacceptable. The current bridge behaviour is correct
> > > your patch 02 doesn't fix anything, you should configure the bridge
> > > properly to avoid all those problems, not break protocols.
> > > 
> > > Your special use case can easily be solved by a user-space helper or
> > > eBPF and nftables. You can set the mcast flood flag and bypass the
> > > bridge for these packets. I basically said the same in 2021, if this is
> > > going to be in the bridge it should be hidden behind an option that is
> > > default off. But in my opinion adding an option to solve such special
> > > cases is undesirable, they can be easily solved with what's currently
> > > available.
> > 
> > I appreciate your time is limited, but could you please translate your
> > suggestion, and detail your proposed alternative a bit, for those of us
> > who are not very familiar with IP multicast snooping?
> > 
> 
> My suggestion is not related to snooping really, but to the goal of
> patches 01-03. The bridge patches in this set are trying to forward
> traffic that is not supposed to be forwarded with the proposed
> configuration,

Correct up to a point. Reinterpreting the given user space configuration
and trying to make it do something else seems like a mistake, but in
principle one could also look at alternative bridge configurations like
the one I described here:
https://lore.kernel.org/netdev/20240402180805.yhhwj2f52sdc4dl2@skbuf/

> so that can be done by a user-space helper that installs
> rules to bypass the bridge specifically for those packets while
> monitoring the bridge state to implement a policy and manage these rules
> in order to keep snooping working.
> 
> > Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
> > that break snooping? And then do what with the packets, forward them in
> > another software layer than the bridge?
> > 
> 
> The ones that are not supposed to be forwarded in the proposed config
> and are needed for this use case (control traffic and link-local). Obviously
> to have proper snooping you'd need to manage these bypass
> rules and use them only while needed.

I think Joseph will end up in a situation where he needs IGMP control
messages both in the bridge data path and outside of it :)

Also, your proposal eliminates the possibility of cooperating with a
hardware accelerator which can forward the IGMP messages where they need
to go.

As far as I understand, I don't think Joseph has a very "special" use case.
Disabling flooding of unregistered multicast in the data plane sounds
reasonable. There seems to be a gap in the bridge API, in that this
operation also affects the control plane, which he is trying to fix with
this "force flooding", because of insufficiently fine grained control.

> > I also don't quite understand the suggestion of turning on mcast flooding:
> > isn't Joseph saying that he wants it off for the unregistered multicast
> > data traffic?
> 
> Ah my bad, I meant to turn off flooding and bypass the bridge for those
> packets and ports while necessary, under necessary can be any policy
> that the user-space helper wants to implement.
> 
> In any case, if this is going to be yet another kernel solution then it
> must be a new option that is default off, and doesn't break current mcast
> flood flag behaviour.

Yeah, maybe something like this, simple and with clear offload
semantics, as seen in existing hardware (not Marvell though):

mcast_flood == off:
- mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
- mcast_ipv4_data_flood: don't care
- mcast_ipv6_ctrl_flood: don't care
- mcast_ipv6_data_flood: don't care
- mcast_l2_flood: don't care
mcast_flood == on:
- Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
- Flood all other IPv4 multicast according to mcast_ipv4_data_flood
- Flood ff02::/16 according to mcast_ipv6_ctrl_flood
- Flood all other IPv6 multicast according to mcast_ipv6_data_flood
- Flood L2 according to mcast_l2_flood

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-02 20:46       ` Vladimir Oltean
@ 2024-04-02 21:59         ` Nikolay Aleksandrov
  2024-04-04 22:16           ` Joseph Huang
  0 siblings, 1 reply; 41+ messages in thread
From: Nikolay Aleksandrov @ 2024-04-02 21:59 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Linus Lüssing, linux-kernel, bridge

On 4/2/24 23:46, Vladimir Oltean wrote:
> On Tue, Apr 02, 2024 at 09:50:51PM +0300, Nikolay Aleksandrov wrote:
>> On 4/2/24 20:43, Vladimir Oltean wrote:
>>> Hi Nikolai,
>>>
>>> On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
>>>> For the bridge patches:
>>>> Nacked-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>>
>>>> You cannot break the multicast flood flag to add support for a custom
>>>> use-case. This is unacceptable. The current bridge behaviour is correct
>>>> your patch 02 doesn't fix anything, you should configure the bridge
>>>> properly to avoid all those problems, not break protocols.
>>>>
>>>> Your special use case can easily be solved by a user-space helper or
>>>> eBPF and nftables. You can set the mcast flood flag and bypass the
>>>> bridge for these packets. I basically said the same in 2021, if this is
>>>> going to be in the bridge it should be hidden behind an option that is
>>>> default off. But in my opinion adding an option to solve such special
>>>> cases is undesirable, they can be easily solved with what's currently
>>>> available.
>>>
>>> I appreciate your time is limited, but could you please translate your
>>> suggestion, and detail your proposed alternative a bit, for those of us
>>> who are not very familiar with IP multicast snooping?
>>>
>>
>> My suggestion is not related to snooping really, but to the goal of
>> patches 01-03. The bridge patches in this set are trying to forward
>> traffic that is not supposed to be forwarded with the proposed
>> configuration,
> 
> Correct up to a point. Reinterpreting the given user space configuration
> and trying to make it do something else seems like a mistake, but in
> principle one could also look at alternative bridge configurations like
> the one I described here:
> https://lore.kernel.org/netdev/20240402180805.yhhwj2f52sdc4dl2@skbuf/
> 
>> so that can be done by a user-space helper that installs
>> rules to bypass the bridge specifically for those packets while
>> monitoring the bridge state to implement a policy and manage these rules
>> in order to keep snooping working.
>>
>>> Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
>>> that break snooping? And then do what with the packets, forward them in
>>> another software layer than the bridge?
>>>
>>
>> The ones that are not supposed to be forwarded in the proposed config
>> and are needed for this use case (control traffic and link-local). Obviously
>> to have proper snooping you'd need to manage these bypass
>> rules and use them only while needed.
> 
> I think Joseph will end up in a situation where he needs IGMP control
> messages both in the bridge data path and outside of it :)
> 

My solution does not exclude such scenario. With all unregistered mcast
disabled it will be handled the same as with this proposed solution.

> Also, your proposal eliminates the possibility of cooperating with a
> hardware accelerator which can forward the IGMP messages where they need
> to go.
> 
> As far as I understand, I don't think Joseph has a very "special" use case.
> Disabling flooding of unregistered multicast in the data plane sounds
> reasonable. There seems to be a gap in the bridge API, in that this

This we already have, but..

> operation also affects the control plane, which he is trying to fix with
> this "force flooding", because of insufficiently fine grained control.
> 

Right, this is the part that is more special, I'm not saying it's
unreasonable. The proposition to make it optional and break it down to
type of traffic sounds good to me.

>>> I also don't quite understand the suggestion of turning on mcast flooding:
>>> isn't Joseph saying that he wants it off for the unregistered multicast
>>> data traffic?
>>
>> Ah my bad, I meant to turn off flooding and bypass the bridge for those
>> packets and ports while necessary, under necessary can be any policy
>> that the user-space helper wants to implement.
>>
>> In any case, if this is going to be yet another kernel solution then it
>> must be a new option that is default off, and doesn't break current mcast
>> flood flag behaviour.
> 
> Yeah, maybe something like this, simple and with clear offload
> semantics, as seen in existing hardware (not Marvell though):
> 
> mcast_flood == off:
> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
> - mcast_ipv4_data_flood: don't care
> - mcast_ipv6_ctrl_flood: don't care
> - mcast_ipv6_data_flood: don't care
> - mcast_l2_flood: don't care
> mcast_flood == on:
> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
> - Flood L2 according to mcast_l2_flood

Yep, sounds good to me. I was thinking about something in these lines
as well if doing a kernel solution in order to make it simpler and more
generic. The ctrl flood bits need to be handled more carefully to make
sure they match only control traffic and not link-local data.
I think the old option can be converted to use this fine-grained
control, that is if anyone sets the old flood on/off then the flood
mask gets set properly so we can do just 1 & in the fast path and avoid
adding more tests. It will also make it symmetric - if it can override
the on case, then it will be able to override the off case. And to be
more explicit you can pass a mask variable to br_multicast_rcv() which
will get populated and then you can pass it down to br_flood(). That
will also avoid adding new bits to the skb's bridge CB.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 09/10] net: dsa: mv88e6xxx: Enable mc flood for mrouter port
  2024-04-02  0:11 ` [PATCH RFC net-next 09/10] net: dsa: mv88e6xxx: Enable mc flood for mrouter port Joseph Huang
@ 2024-04-03 15:49   ` Simon Horman
  0 siblings, 0 replies; 41+ messages in thread
From: Simon Horman @ 2024-04-03 15:49 UTC (permalink / raw)
  To: Joseph Huang
  Cc: netdev, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

On Mon, Apr 01, 2024 at 08:11:08PM -0400, Joseph Huang wrote:
> When a port turns into an mrouter port, enable multicast flooding
> on that port even if multicast flooding is disabled by user config. This
> is necessary so that in a distributed system, the multicast packets
> can be forwarded to the Querier when the multicast source is attached
> to a Non-Querier bridge.
> 
> Consider the following scenario:
> 
>                  +--------------------+
>                  |                    |
>                  |      Snooping      |    +------------+
>                  |      Bridge 1      |----| Listener 1 |
>                  |     (Querier)      |    +------------+
>                  |                    |
>                  +--------------------+
>                            |
>                            |
>                  +--------------------+
>                  |    | mrouter |     |
> +-----------+    |    +---------+     |
> | MC Source |----|      Snooping      |
> +-----------|    |      Bridge 2      |
>                  |    (Non-Querier)   |
>                  +--------------------+
> 
> In this scenario, Listener 1 will never receive multicast traffic
> from MC Source if multicast flooding is disabled on the mrouter port on
> Snooping Bridge 2.
> 
> Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>

...

> @@ -6849,11 +6864,28 @@ static int mv88e6xxx_port_bridge_flags(struct dsa_switch *ds, int port,
>  
>  	if (flags.mask & BR_MCAST_FLOOD) {
>  		bool multicast = !!(flags.val & BR_MCAST_FLOOD);
> +		struct mv88e6xxx_bridge *mv_bridge;
> +		struct mv88e6xxx_port *p;
> +		bool mrouter;
>  
> -		err = chip->info->ops->port_set_mcast_flood(chip, port,
> -							    multicast);
> -		if (err)
> -			goto out;
> +		mv_bridge = mv88e6xxx_bridge_by_port(chip, port);
> +		if (!mv_bridge)
> +			return -EINVAL;

I think that mv88e6xxx_reg_unlock(chip) is needed here.
So perhaps (completely untested!):

		if (!mv_bridge) {
			err = -EINVAL;
			goto out;
		}

Flagged by Smatch

> +
> +		p = &chip->ports[port];
> +		mrouter = !!(mv_bridge->mrouter_ports & BIT(port));
> +
> +		if (!mrouter) {
> +			err = chip->info->ops->port_set_mcast_flood(chip, port,
> +								    multicast);
> +			if (err)
> +				goto out;
> +		}
> +
> +		if (multicast)
> +			p->flags |= MV88E6XXX_PORT_FLAG_MC_FLOOD;
> +		else
> +			p->flags &= ~MV88E6XXX_PORT_FLAG_MC_FLOOD;
>  	}
>  
>  	if (flags.mask & BR_BCAST_FLOOD) {
> @@ -6883,6 +6915,51 @@ static int mv88e6xxx_port_bridge_flags(struct dsa_switch *ds, int port,
>  	return err;
>  }
>  
> +static int mv88e6xxx_port_mrouter(struct dsa_switch *ds, int port,
> +				  bool mrouter,
> +				  struct netlink_ext_ack *extack)
> +{
> +	struct mv88e6xxx_chip *chip = ds->priv;
> +	struct mv88e6xxx_bridge *mv_bridge;
> +	struct mv88e6xxx_port *p;
> +	bool old_mrouter;
> +	bool mc_flood;
> +	int err;
> +
> +	if (!chip->info->ops->port_set_mcast_flood)
> +		return -EOPNOTSUPP;
> +
> +	mv_bridge = mv88e6xxx_bridge_by_port(chip, port);
> +	if (!mv_bridge)
> +		return -EINVAL;
> +
> +	old_mrouter = !!(mv_bridge->mrouter_ports & BIT(port));
> +	if (mrouter == old_mrouter)
> +		return 0;
> +
> +	p = &chip->ports[port];
> +	mc_flood = !!(p->flags & MV88E6XXX_PORT_FLAG_MC_FLOOD);
> +
> +	mv88e6xxx_reg_lock(chip);
> +
> +	if (!mc_flood) {
> +		err = chip->info->ops->port_set_mcast_flood(chip, port,
> +							    mrouter);
> +		if (err)
> +			goto out;
> +	}
> +
> +	if (mrouter)
> +		mv_bridge->mrouter_ports |= BIT(port);
> +	else
> +		mv_bridge->mrouter_ports &= ~BIT(port);
> +
> +out:
> +	mv88e6xxx_reg_unlock(chip);

If mc_flood is true then err is uninitialised here.

Flagged by clang-17 W=1 build, and Smatch.

> +
> +	return err;
> +}
> +
>  static bool mv88e6xxx_lag_can_offload(struct dsa_switch *ds,
>  				      struct dsa_lag lag,
>  				      struct netdev_lag_upper_info *info,

...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 02/10] net: bridge: Always multicast_flood Reports
  2024-04-02  0:11 ` [PATCH RFC net-next 02/10] net: bridge: Always multicast_flood Reports Joseph Huang
@ 2024-04-03 15:52   ` Simon Horman
  0 siblings, 0 replies; 41+ messages in thread
From: Simon Horman @ 2024-04-03 15:52 UTC (permalink / raw)
  To: Joseph Huang
  Cc: netdev, Andrew Lunn, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

On Mon, Apr 01, 2024 at 08:11:01PM -0400, Joseph Huang wrote:
> Modify the forwarding path so that IGMPv1/v2/MLDv1 Reports are always
> flooded by br_multicast_flood(), regardless of the check done
> by br_multicast_querier_exists().
> 
> This patch fixes the problems where shortly after a system boots up,
> the first couple of Reports are not handled properly in that:
> 
> 1) The Report from the Host is being flooded (via br_flood) to all
>    bridge ports, and
> 2) If the mrouter port's multicast flooding is disabled, the Reports
>    received from other hosts will not be forwarded to the Querier.
> 
> Fixes: b00589af3b04 ("bridge: disable snooping if there is no querier")
> 

There should be no blank line between Fixes and other tags.

If this is a fix, it should be targeted against net,
and likely broken out of this series.

If it is not a fix, then it should not have a Fixes tag.
Rather, you can reference a commit in the commit message text
(not tags) using something like.

Introduced by commit b00589af3b04 ("bridge: disable snooping if there is no querier")

Link: https://www.kernel.org/doc/html/next/process/maintainer-netdev.html

> Signed-off-by: Joseph Huang <Joseph.Huang@garmin.com>

...

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects
  2024-04-02 12:23   ` Vladimir Oltean
@ 2024-04-04 20:43     ` Joseph Huang
  2024-04-05 11:07       ` Vladimir Oltean
  0 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-04 20:43 UTC (permalink / raw)
  To: Vladimir Oltean, Joseph Huang
  Cc: netdev, Andrew Lunn, Florian Fainelli, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Roopa Prabhu,
	Nikolay Aleksandrov, Linus Lüssing, linux-kernel, bridge

Hi Vladimir,

On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
> Can you comment on the feasibility/infeasibility of Tobias' proposal of:
> "The bridge could just provide some MDB iterator to save us from having
> to cache all the configured groups."?
> https://lore.kernel.org/netdev/87sg31n04a.fsf@waldekranz.com/
> 
> What is done here will have to be scaled to many drivers - potentially
> all existing DSA ones, as far as I'm aware.
> 

I thought about implementing an MDB iterator as suggested by Tobias, but 
I'm a bit concerned about the coherence of these MDB objects. In theory, 
when the device driver is trying to act on an event, the source of the 
trigger may have changed its state in the bridge already. If, upon 
receiving an event in the device driver, we iterate over what the bridge 
has at that instant, the differences between the worlds as seen by the 
bridge and the device driver might lead to some unexpected results. 
However, if we cache the MDB objects in the device driver, at least the 
order in which the events took place will be coherent and at any give 
time the state of the MDB objects in the device driver can be guaranteed 
to be sane. This is also the approach the prestera device driver took.

Thanks,
Joseph

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-02 12:43 ` Andrew Lunn
@ 2024-04-04 21:35   ` Joseph Huang
  2024-04-04 22:11     ` Andrew Lunn
  0 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-04 21:35 UTC (permalink / raw)
  To: Andrew Lunn, Joseph Huang
  Cc: netdev, Florian Fainelli, Vladimir Oltean, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Roopa Prabhu,
	Nikolay Aleksandrov, Linus Lüssing, linux-kernel, bridge

Hi Andrew,

On 4/2/2024 8:43 AM, Andrew Lunn wrote:
> On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote:
>> There is a use case where one would like to enable multicast snooping
>> on a bridge but disable multicast flooding on all bridge ports so that
>> registered multicast traffic will only reach the intended recipients and
>> unregistered multicast traffic will be dropped. However, with existing
>> bridge ports' mcast_flood flag implementation, it doesn't work as desired.
>> 
>> This patchset aims to make multicast snooping work even when multicast
>> flooding is disabled on the bridge ports, without changing the semantic of
>> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
>> 
>> Also, in a network where more than one multicast snooping capable bridges
>> are interconnected without multicast routers being present, multicast
>> snooping fails if:
>> 
>>   1. The source is not directly attached to the Querier
>>   2. The listener is beyond the mrouter port of the bridge where the
>>      source is directly attached
>>   3. A hardware offloading switch is involved
> 
> I've not studied the details here, but that last point makes me think
> the offload driver is broken. There should not be any difference
> between software bridging and hardware bridging. The whole idea is
> that you take what Linux can do in software and accelerate it by
> offloading to hardware. Doing acceleration should not change the
> behaviour.
> 

In patch 10 I gave a little more detail about the fix, but basically 
this is what happened.

Assuming we have a soft bridge like the following:

             bp1 +------------+
   Querier <---- |   bridge   |
                 +------------+
                bp2 |      | bp3
                    |      |
                    v      v
             MC Source    MC Listener

Here bp1 is the mrouter port, bp2 is connected to the multicast source, 
and bp3 is connected to the multicast listener who wishes to receive 
multicast traffic for that group.

After some Query/Report exchange, the snooping code in the bridge is 
going to learn about the Listener from bp3, and is going to create an 
MDB group which includes bp3 as the sole member. When the bridge 
receives a multicast packet for that group from bp2, the bridge will do 
a look up to find the members of that group (in this case, bp3) and 
forward the packet to every single member in that group. At the same 
time, the bridge will also forward the packet to every mrouter port so 
that listeners beyond mrouter ports can receive that multicast packet as 
well.

Now consider the same scenario, but with a hardware-offloaded switch:

                 +------------+
                 |   bridge   |
                 +------------+
                       ^
                       |
                       | p6 (Host CPU port)
          p1/bp1 +------------+
   Querier <---- |     sw     |
                 +------------+
             p2/bp2 |      | p3/bp3
                    |      |
                    v      v
             MC Source    MC Listener

Same Query/Report exchange, same MDB group, except that this time around 
the MDB group will be offloaded to the switch as well. So in the 
switch's ATU we will now have an entry for the multicast group and with 
p3 being the only member of that ATU. When the multicast packet arrives 
at the switch from p2, the switch will do an ATU lookup, and forward the 
packet to p3 only. This means that the Host CPU (p6) will not get a copy 
of the packet, and so the soft bridge will not have the opportunity to 
forward that packet to the mrouter port. This is what patch 10 attempts 
to address.

One possible solution of course is to add p6 to every MDB group, however 
that's probably not very desirable. Besides, it will be more efficient 
if the packet is forwarded to the mrouter port by the switch in hardware 
(i.e., offload mrouter forwarding), vs. being forwarded in the bridge by 
software.

>> The patches were developed against 5.15, and forward-ported to 6.8.
>> Tests were done on a Pi 4B + Marvell 6393X Eval board with a single
>> switch chip with no VLAN.
> 
> So what is the mv88e6xxx driver doing wrong?
> 
> 	Andrew
> 

The mv88e6xxx driver did nothing differently than the other DSA drivers. 
I chose the mv88e6xxx driver to implement the fix simply because that's 
the only platform I have at hand to verify the solution.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-04 21:35   ` Joseph Huang
@ 2024-04-04 22:11     ` Andrew Lunn
  2024-04-04 22:40       ` Joseph Huang
  0 siblings, 1 reply; 41+ messages in thread
From: Andrew Lunn @ 2024-04-04 22:11 UTC (permalink / raw)
  To: Joseph Huang
  Cc: Joseph Huang, netdev, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

On Thu, Apr 04, 2024 at 05:35:41PM -0400, Joseph Huang wrote:
> Hi Andrew,
> 
> On 4/2/2024 8:43 AM, Andrew Lunn wrote:
> > On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote:
> > > There is a use case where one would like to enable multicast snooping
> > > on a bridge but disable multicast flooding on all bridge ports so that
> > > registered multicast traffic will only reach the intended recipients and
> > > unregistered multicast traffic will be dropped. However, with existing
> > > bridge ports' mcast_flood flag implementation, it doesn't work as desired.
> > > 
> > > This patchset aims to make multicast snooping work even when multicast
> > > flooding is disabled on the bridge ports, without changing the semantic of
> > > the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
> > > 
> > > Also, in a network where more than one multicast snooping capable bridges
> > > are interconnected without multicast routers being present, multicast
> > > snooping fails if:
> > > 
> > >   1. The source is not directly attached to the Querier
> > >   2. The listener is beyond the mrouter port of the bridge where the
> > >      source is directly attached
> > >   3. A hardware offloading switch is involved
> > 
> > I've not studied the details here, but that last point makes me think
> > the offload driver is broken. There should not be any difference
> > between software bridging and hardware bridging. The whole idea is
> > that you take what Linux can do in software and accelerate it by
> > offloading to hardware. Doing acceleration should not change the
> > behaviour.
> > 
> 
> In patch 10 I gave a little more detail about the fix, but basically this is
> what happened.
> 
> Assuming we have a soft bridge like the following:
> 
>             bp1 +------------+
>   Querier <---- |   bridge   |
>                 +------------+
>                bp2 |      | bp3
>                    |      |
>                    v      v
>             MC Source    MC Listener
> 
> Here bp1 is the mrouter port, bp2 is connected to the multicast source, and
> bp3 is connected to the multicast listener who wishes to receive multicast
> traffic for that group.
> 
> After some Query/Report exchange, the snooping code in the bridge is going
> to learn about the Listener from bp3, and is going to create an MDB group
> which includes bp3 as the sole member. When the bridge receives a multicast
> packet for that group from bp2, the bridge will do a look up to find the
> members of that group (in this case, bp3) and forward the packet to every
> single member in that group. At the same time, the bridge will also forward
> the packet to every mrouter port so that listeners beyond mrouter ports can
> receive that multicast packet as well.
> 
> Now consider the same scenario, but with a hardware-offloaded switch:
> 
>                 +------------+
>                 |   bridge   |
>                 +------------+
>                       ^
>                       |
>                       | p6 (Host CPU port)
>          p1/bp1 +------------+
>   Querier <---- |     sw     |
>                 +------------+
>             p2/bp2 |      | p3/bp3
>                    |      |
>                    v      v
>             MC Source    MC Listener
> 
> Same Query/Report exchange, same MDB group, except that this time around the
> MDB group will be offloaded to the switch as well. So in the switch's ATU we
> will now have an entry for the multicast group and with p3 being the only
> member of that ATU. When the multicast packet arrives at the switch from p2,
> the switch will do an ATU lookup, and forward the packet to p3 only. This
> means that the Host CPU (p6) will not get a copy of the packet, and so the
> soft bridge will not have the opportunity to forward that packet to the
> mrouter port. This is what patch 10 attempts to address.
> 
> One possible solution of course is to add p6 to every MDB group, however
> that's probably not very desirable. Besides, it will be more efficient if
> the packet is forwarded to the mrouter port by the switch in hardware (i.e.,
> offload mrouter forwarding), vs. being forwarded in the bridge by software.

Thanks for the explanation. So i think the key part which you said
above is:

  At the same time, the bridge will also forward the packet to every
  mrouter port so that listeners beyond mrouter ports can receive that
  multicast packet as well.

How does the bridge know about the mrouter port? It seems like the
bridge needs to pass that information down to the switch. Is the
mrouter itself performing a join on the group when it has members of
the group on the other side of it?

	Andrew

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-02 21:59         ` Nikolay Aleksandrov
@ 2024-04-04 22:16           ` Joseph Huang
  2024-04-05 10:20             ` Vladimir Oltean
  0 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-04 22:16 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Vladimir Oltean
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Linus Lüssing, linux-kernel, bridge

On 4/2/2024 5:59 PM, Nikolay Aleksandrov wrote:
> On 4/2/24 23:46, Vladimir Oltean wrote:
>> On Tue, Apr 02, 2024 at 09:50:51PM +0300, Nikolay Aleksandrov wrote:
>>> On 4/2/24 20:43, Vladimir Oltean wrote:
>>>> Hi Nikolai,
>>>>
>>>> On Tue, Apr 02, 2024 at 12:28:38PM +0300, Nikolay Aleksandrov wrote:
>>>>> For the bridge patches:
>>>>> Nacked-by: Nikolay Aleksandrov <razor@blackwall.org>
>>>>>
>>>>> You cannot break the multicast flood flag to add support for a custom
>>>>> use-case. This is unacceptable. The current bridge behaviour is correct
>>>>> your patch 02 doesn't fix anything, you should configure the bridge
>>>>> properly to avoid all those problems, not break protocols.
>>>>>
>>>>> Your special use case can easily be solved by a user-space helper or
>>>>> eBPF and nftables. You can set the mcast flood flag and bypass the
>>>>> bridge for these packets. I basically said the same in 2021, if this is
>>>>> going to be in the bridge it should be hidden behind an option that is
>>>>> default off. But in my opinion adding an option to solve such special
>>>>> cases is undesirable, they can be easily solved with what's currently
>>>>> available.
>>>>
>>>> I appreciate your time is limited, but could you please translate your
>>>> suggestion, and detail your proposed alternative a bit, for those of us
>>>> who are not very familiar with IP multicast snooping?
>>>>
>>>
>>> My suggestion is not related to snooping really, but to the goal of
>>> patches 01-03. The bridge patches in this set are trying to forward
>>> traffic that is not supposed to be forwarded with the proposed
>>> configuration,
>> 
>> Correct up to a point. Reinterpreting the given user space configuration
>> and trying to make it do something else seems like a mistake, but in
>> principle one could also look at alternative bridge configurations like
>> the one I described here:
>> https://lore.kernel.org/netdev/20240402180805.yhhwj2f52sdc4dl2@skbuf/
>> 
>>> so that can be done by a user-space helper that installs
>>> rules to bypass the bridge specifically for those packets while
>>> monitoring the bridge state to implement a policy and manage these rules
>>> in order to keep snooping working.
>>>
>>>> Bypass the bridge for which packets? General IGMP/MLD queries? Wouldn't
>>>> that break snooping? And then do what with the packets, forward them in
>>>> another software layer than the bridge?
>>>>
>>>
>>> The ones that are not supposed to be forwarded in the proposed config
>>> and are needed for this use case (control traffic and link-local). Obviously
>>> to have proper snooping you'd need to manage these bypass
>>> rules and use them only while needed.
>> 
>> I think Joseph will end up in a situation where he needs IGMP control
>> messages both in the bridge data path and outside of it :)
>> 
> 
> My solution does not exclude such scenario. With all unregistered mcast
> disabled it will be handled the same as with this proposed solution.
> 
>> Also, your proposal eliminates the possibility of cooperating with a
>> hardware accelerator which can forward the IGMP messages where they need
>> to go.
>> 
>> As far as I understand, I don't think Joseph has a very "special" use case.
>> Disabling flooding of unregistered multicast in the data plane sounds
>> reasonable. There seems to be a gap in the bridge API, in that this
> 
> This we already have, but..
> 
>> operation also affects the control plane, which he is trying to fix with
>> this "force flooding", because of insufficiently fine grained control.
>> 
> 
> Right, this is the part that is more special, I'm not saying it's
> unreasonable. The proposition to make it optional and break it down to
> type of traffic sounds good to me.
> 
>>>> I also don't quite understand the suggestion of turning on mcast flooding:
>>>> isn't Joseph saying that he wants it off for the unregistered multicast
>>>> data traffic?
>>>
>>> Ah my bad, I meant to turn off flooding and bypass the bridge for those
>>> packets and ports while necessary, under necessary can be any policy
>>> that the user-space helper wants to implement.
>>>
>>> In any case, if this is going to be yet another kernel solution then it
>>> must be a new option that is default off, and doesn't break current mcast
>>> flood flag behaviour.
>> 
>> Yeah, maybe something like this, simple and with clear offload
>> semantics, as seen in existing hardware (not Marvell though):
>> 
>> mcast_flood == off:
>> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
>> - mcast_ipv4_data_flood: don't care
>> - mcast_ipv6_ctrl_flood: don't care
>> - mcast_ipv6_data_flood: don't care
>> - mcast_l2_flood: don't care
>> mcast_flood == on:
>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
>> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
>> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
>> - Flood L2 according to mcast_l2_flood

Did you mean

if mcast_flood == on (meaning mcast_flood is ENABLED)
- mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded anyway)
...

if mcast_flood == off (meaning mcast_flood is DISABLED)
- Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
...

? Otherwise the problem is still not solved when mcast_flood is disabled.

> 
> Yep, sounds good to me. I was thinking about something in these lines
> as well if doing a kernel solution in order to make it simpler and more
> generic. The ctrl flood bits need to be handled more carefully to make
> sure they match only control traffic and not link-local data.

Do we consider 224.0.0.251 (mDNS) to be control or data? What qualifies 
as control I guess that's my question.

> I think the old option can be converted to use this fine-grained
> control, that is if anyone sets the old flood on/off then the flood
> mask gets set properly so we can do just 1 & in the fast path and avoid
> adding more tests. It will also make it symmetric - if it can override
> the on case, then it will be able to override the off case. And to be
> more explicit you can pass a mask variable to br_multicast_rcv() which
> will get populated and then you can pass it down to br_flood(). That
> will also avoid adding new bits to the skb's bridge CB.
> 
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-04 22:11     ` Andrew Lunn
@ 2024-04-04 22:40       ` Joseph Huang
  2024-04-05 13:09         ` Andrew Lunn
  0 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-04 22:40 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Joseph Huang, netdev, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

Hi Andrew,

On 4/4/2024 6:11 PM, Andrew Lunn wrote:
> On Thu, Apr 04, 2024 at 05:35:41PM -0400, Joseph Huang wrote:
>> Hi Andrew,
>>
>> On 4/2/2024 8:43 AM, Andrew Lunn wrote:
>>> On Mon, Apr 01, 2024 at 08:10:59PM -0400, Joseph Huang wrote:
>>>> There is a use case where one would like to enable multicast snooping
>>>> on a bridge but disable multicast flooding on all bridge ports so that
>>>> registered multicast traffic will only reach the intended recipients and
>>>> unregistered multicast traffic will be dropped. However, with existing
>>>> bridge ports' mcast_flood flag implementation, it doesn't work as desired.
>>>>
>>>> This patchset aims to make multicast snooping work even when multicast
>>>> flooding is disabled on the bridge ports, without changing the semantic of
>>>> the mcast_flood flag too much. Patches 1 to 4 attempt to address this issue.
>>>>
>>>> Also, in a network where more than one multicast snooping capable bridges
>>>> are interconnected without multicast routers being present, multicast
>>>> snooping fails if:
>>>>
>>>>    1. The source is not directly attached to the Querier
>>>>    2. The listener is beyond the mrouter port of the bridge where the
>>>>       source is directly attached
>>>>    3. A hardware offloading switch is involved
>>>
>>> I've not studied the details here, but that last point makes me think
>>> the offload driver is broken. There should not be any difference
>>> between software bridging and hardware bridging. The whole idea is
>>> that you take what Linux can do in software and accelerate it by
>>> offloading to hardware. Doing acceleration should not change the
>>> behaviour.
>>>
>>
>> In patch 10 I gave a little more detail about the fix, but basically this is
>> what happened.
>>
>> Assuming we have a soft bridge like the following:
>>
>>              bp1 +------------+
>>    Querier <---- |   bridge   |
>>                  +------------+
>>                 bp2 |      | bp3
>>                     |      |
>>                     v      v
>>              MC Source    MC Listener
>>
>> Here bp1 is the mrouter port, bp2 is connected to the multicast source, and
>> bp3 is connected to the multicast listener who wishes to receive multicast
>> traffic for that group.
>>
>> After some Query/Report exchange, the snooping code in the bridge is going
>> to learn about the Listener from bp3, and is going to create an MDB group
>> which includes bp3 as the sole member. When the bridge receives a multicast
>> packet for that group from bp2, the bridge will do a look up to find the
>> members of that group (in this case, bp3) and forward the packet to every
>> single member in that group. At the same time, the bridge will also forward
>> the packet to every mrouter port so that listeners beyond mrouter ports can
>> receive that multicast packet as well.
>>
>> Now consider the same scenario, but with a hardware-offloaded switch:
>>
>>                  +------------+
>>                  |   bridge   |
>>                  +------------+
>>                        ^
>>                        |
>>                        | p6 (Host CPU port)
>>           p1/bp1 +------------+
>>    Querier <---- |     sw     |
>>                  +------------+
>>              p2/bp2 |      | p3/bp3
>>                     |      |
>>                     v      v
>>              MC Source    MC Listener
>>
>> Same Query/Report exchange, same MDB group, except that this time around the
>> MDB group will be offloaded to the switch as well. So in the switch's ATU we
>> will now have an entry for the multicast group and with p3 being the only
>> member of that ATU. When the multicast packet arrives at the switch from p2,
>> the switch will do an ATU lookup, and forward the packet to p3 only. This
>> means that the Host CPU (p6) will not get a copy of the packet, and so the
>> soft bridge will not have the opportunity to forward that packet to the
>> mrouter port. This is what patch 10 attempts to address.
>>
>> One possible solution of course is to add p6 to every MDB group, however
>> that's probably not very desirable. Besides, it will be more efficient if
>> the packet is forwarded to the mrouter port by the switch in hardware (i.e.,
>> offload mrouter forwarding), vs. being forwarded in the bridge by software.
> 
> Thanks for the explanation. So i think the key part which you said
> above is:
> 
>    At the same time, the bridge will also forward the packet to every
>    mrouter port so that listeners beyond mrouter ports can receive that
>    multicast packet as well.
> 
> How does the bridge know about the mrouter port? It seems like the

The bridge learns about the existence of the Querier by the reception of 
Queries. The bridge will mark the port which it received Queries from as 
the mrouter port.

> bridge needs to pass that information down to the switch. Is the

The bridge does pass that information down to switchdev. Patch 5 adds 
DSA handling of that event as well. Patches 9 and 10 adds the support in 
the mv88e6xxx driver.

> mrouter itself performing a join on the group when it has members of
> the group on the other side of it?

You hit a key point here. The Querier does not send Report (not with 
IGMPv2 anyway), so the bridge will never add the mrouter port to any MDB 
group. That's why patch 10 is needed. Prestera driver does something 
similar (which is what patches 6,7, and 10 are modeled after).

> 
> 	Andrew



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-04 22:16           ` Joseph Huang
@ 2024-04-05 10:20             ` Vladimir Oltean
  2024-04-05 11:00               ` Nikolay Aleksandrov
  0 siblings, 1 reply; 41+ messages in thread
From: Vladimir Oltean @ 2024-04-05 10:20 UTC (permalink / raw)
  To: Joseph Huang
  Cc: Nikolay Aleksandrov, Joseph Huang, netdev, Andrew Lunn,
	Florian Fainelli, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Roopa Prabhu, Linus Lüssing, linux-kernel,
	bridge

On Thu, Apr 04, 2024 at 06:16:12PM -0400, Joseph Huang wrote:
> > > mcast_flood == off:
> > > - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
> > > - mcast_ipv4_data_flood: don't care
> > > - mcast_ipv6_ctrl_flood: don't care
> > > - mcast_ipv6_data_flood: don't care
> > > - mcast_l2_flood: don't care
> > > mcast_flood == on:
> > > - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
> > > - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
> > > - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
> > > - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
> > > - Flood L2 according to mcast_l2_flood
> 
> Did you mean
> 
> if mcast_flood == on (meaning mcast_flood is ENABLED)
> - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded anyway)
> ...
> 
> if mcast_flood == off (meaning mcast_flood is DISABLED)
> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
> ...
> 
> ? Otherwise the problem is still not solved when mcast_flood is disabled.

No, I mean exactly as I said. My goal was not to "solve the problem"
when mcast_flood is disabled, but to give you an option to configure the
bridge to achieve what you want, in a way which I think is more acceptable.

AFAIU, there is not really any "problem" - the bridge behaves exactly as
instructed given the limited language available to instruct it ("mcast_flood"
covers all multicast). So the other knobs have the role of fine-tuning
what gets flooded when mcast_flood is on. Like "yes, but..."

You can't "solve the problem" when it involves changing an established
behavior that somebody probably depended on to be just like that.

> > Yep, sounds good to me. I was thinking about something in these lines
> > as well if doing a kernel solution in order to make it simpler and more
> > generic. The ctrl flood bits need to be handled more carefully to make
> > sure they match only control traffic and not link-local data.
> 
> Do we consider 224.0.0.251 (mDNS) to be control or data? What qualifies as
> control I guess that's my question.

Well, as I said, I'm proposing that 224.0.0.x qualifies as control and
the rest of IPv4 multicast as data. Which means that, applied to your
case, "mcast_flood on mcast_ipv4_ctrl_flood on mcast_ipv4_data_flood off"
will "force flood" mDNS just like the IGMP traffic from your patches.
I'm not aware if this could be considered problematic (I don't think so).

The reason behind this proposal is that, AFAIU, endpoints may choose to
join IGMP groups in the 224.0.0.x range or not, but RFC4541 says that
switches shouldn't prune the destinations towards endpoints that don't
join this range anyway: https://www.rfc-editor.org/rfc/rfc4541#page-6

Whereas for IP multicast traffic towards an address outside 224.0.0.x,
pruning will happen as per the IGMP join tracking mechanism.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-05 10:20             ` Vladimir Oltean
@ 2024-04-05 11:00               ` Nikolay Aleksandrov
  2024-04-05 20:22                 ` Joseph Huang
  0 siblings, 1 reply; 41+ messages in thread
From: Nikolay Aleksandrov @ 2024-04-05 11:00 UTC (permalink / raw)
  To: Vladimir Oltean, Joseph Huang
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Linus Lüssing, linux-kernel, bridge

On 4/5/24 13:20, Vladimir Oltean wrote:
> On Thu, Apr 04, 2024 at 06:16:12PM -0400, Joseph Huang wrote:
>>>> mcast_flood == off:
>>>> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
>>>> - mcast_ipv4_data_flood: don't care
>>>> - mcast_ipv6_ctrl_flood: don't care
>>>> - mcast_ipv6_data_flood: don't care
>>>> - mcast_l2_flood: don't care
>>>> mcast_flood == on:
>>>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>>>> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
>>>> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
>>>> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
>>>> - Flood L2 according to mcast_l2_flood
>>
>> Did you mean
>>
>> if mcast_flood == on (meaning mcast_flood is ENABLED)
>> - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded anyway)
>> ...
>>
>> if mcast_flood == off (meaning mcast_flood is DISABLED)
>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>> ...
>>
>> ? Otherwise the problem is still not solved when mcast_flood is disabled.
> 
> No, I mean exactly as I said. My goal was not to "solve the problem"
> when mcast_flood is disabled, but to give you an option to configure the
> bridge to achieve what you want, in a way which I think is more acceptable.
> 
> AFAIU, there is not really any "problem" - the bridge behaves exactly as
> instructed given the limited language available to instruct it ("mcast_flood"
> covers all multicast). So the other knobs have the role of fine-tuning
> what gets flooded when mcast_flood is on. Like "yes, but..."
> 
> You can't "solve the problem" when it involves changing an established
> behavior that somebody probably depended on to be just like that.
> 
>>> Yep, sounds good to me. I was thinking about something in these lines
>>> as well if doing a kernel solution in order to make it simpler and more
>>> generic. The ctrl flood bits need to be handled more carefully to make
>>> sure they match only control traffic and not link-local data.
>>
>> Do we consider 224.0.0.251 (mDNS) to be control or data? What qualifies as
>> control I guess that's my question.
> 
> Well, as I said, I'm proposing that 224.0.0.x qualifies as control and
> the rest of IPv4 multicast as data. Which means that, applied to your
> case, "mcast_flood on mcast_ipv4_ctrl_flood on mcast_ipv4_data_flood off"
> will "force flood" mDNS just like the IGMP traffic from your patches.
> I'm not aware if this could be considered problematic (I don't think so).
> 
> The reason behind this proposal is that, AFAIU, endpoints may choose to
> join IGMP groups in the 224.0.0.x range or not, but RFC4541 says that
> switches shouldn't prune the destinations towards endpoints that don't
> join this range anyway: https://www.rfc-editor.org/rfc/rfc4541#page-6
> 
> Whereas for IP multicast traffic towards an address outside 224.0.0.x,
> pruning will happen as per the IGMP join tracking mechanism.

+1, non-IGMP traffic to 224.0.0.x must be flooded to all anyway
so this should allow for a better control over it, but perhaps
the naming should be link_local instead because control usually
means IGMP, maybe something like mcast_ip_link_local_flood


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects
  2024-04-04 20:43     ` Joseph Huang
@ 2024-04-05 11:07       ` Vladimir Oltean
  2024-04-05 18:58         ` Joseph Huang
  0 siblings, 1 reply; 41+ messages in thread
From: Vladimir Oltean @ 2024-04-05 11:07 UTC (permalink / raw)
  To: Joseph Huang
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

On Thu, Apr 04, 2024 at 04:43:38PM -0400, Joseph Huang wrote:
> Hi Vladimir,
> 
> On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
> > Can you comment on the feasibility/infeasibility of Tobias' proposal of:
> > "The bridge could just provide some MDB iterator to save us from having
> > to cache all the configured groups."?
> > https://lore.kernel.org/netdev/87sg31n04a.fsf@waldekranz.com/
> > 
> > What is done here will have to be scaled to many drivers - potentially
> > all existing DSA ones, as far as I'm aware.
> > 
> 
> I thought about implementing an MDB iterator as suggested by Tobias, but I'm
> a bit concerned about the coherence of these MDB objects. In theory, when
> the device driver is trying to act on an event, the source of the trigger
> may have changed its state in the bridge already.

Yes, this is the result of SWITCHDEV_F_DEFER, used by both
SWITCHDEV_ATTR_ID_PORT_MROUTER and SWITCHDEV_OBJ_ID_PORT_MDB.

> If, upon receiving an event in the device driver, we iterate over what
> the bridge has at that instant, the differences between the worlds as
> seen by the bridge and the device driver might lead to some unexpected
> results.

Translated: iterating over bridge MDB objects needs to be serialized
with new switchdev events by acquiring rtnl_lock(). Then, once switchdev
events are temporarily blocked, the pending ones need to be flushed
using switchdev_deferred_process(), so resync the bridge state with the
driver state. Once the resync is done, the iteration is safe until
rtnl_unlock().

Applied to our case, the MDB iterator is needed in mv88e6xxx_port_mrouter().
This is already called with rtnl_lock() acquired. The resync procedure
will indirectly call mv88e6xxx_port_mdb_add()/mv88e6xxx_port_mdb_del()
through switchdev_deferred_process(), and then the walk is consistent
for the remainder of the mv88e6xxx_port_mrouter() function.

A helper which does this is what would be required - an iterator
function which calls an int (*cb)(struct net_device *brport, const struct switchdev_obj_port_mdb *mdb)
for each MDB entry. The DSA core could then offer some post-processing
services over this API, to recover the struct dsa_port associated with
the bridge port (in the LAG case they aren't the same) and the address
database associated with the bridge.

Do you think there would be unexpected results even if we did this?
br_switchdev_mdb_replay() needs to handle a similarly complicated
situation of synchronizing with deferred MDB events.

> However, if we cache the MDB objects in the device driver, at least
> the order in which the events took place will be coherent and at any
> give time the state of the MDB objects in the device driver can be
> guaranteed to be sane. This is also the approach the prestera device
> driver took.

Not contesting this, but I wouldn't like to see MDBs cached in each
device driver just for this. Switchdev is not very high on the list of
APIs which are easy to use, and making MDB caching a requirement
(for the common case that MDB entry destinations need software fixups
with the mrouter ports) isn't exactly going to make that any better.
Others' opinion may differ, but mine is that core offload APIs need to
consider what hardware is available in the real world, make the common
case easy, and the advanced cases possible. Rather than make every case
"advanced" :)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-04 22:40       ` Joseph Huang
@ 2024-04-05 13:09         ` Andrew Lunn
  0 siblings, 0 replies; 41+ messages in thread
From: Andrew Lunn @ 2024-04-05 13:09 UTC (permalink / raw)
  To: Joseph Huang
  Cc: Joseph Huang, netdev, Florian Fainelli, Vladimir Oltean,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

> > Thanks for the explanation. So i think the key part which you said
> > above is:
> > 
> >    At the same time, the bridge will also forward the packet to every
> >    mrouter port so that listeners beyond mrouter ports can receive that
> >    multicast packet as well.
> > 
> > How does the bridge know about the mrouter port? It seems like the
> 
> The bridge learns about the existence of the Querier by the reception of
> Queries. The bridge will mark the port which it received Queries from as the
> mrouter port.
> 
> > bridge needs to pass that information down to the switch. Is the
> 
> The bridge does pass that information down to switchdev. Patch 5 adds DSA
> handling of that event as well. Patches 9 and 10 adds the support in the
> mv88e6xxx driver.

I've not been looking at the details too much for this patchset. It
does however seem that some parts are controversial, and others just
seem like a bug fix. Does it make sense to split this into two
patchsets?

	Andrew

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects
  2024-04-05 11:07       ` Vladimir Oltean
@ 2024-04-05 18:58         ` Joseph Huang
  2024-04-29 22:07           ` Joseph Huang
  0 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-05 18:58 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

Hi Vladimir,

On 4/5/2024 7:07 AM, Vladimir Oltean wrote:
> On Thu, Apr 04, 2024 at 04:43:38PM -0400, Joseph Huang wrote:
>> Hi Vladimir,
>>
>> On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
>>> Can you comment on the feasibility/infeasibility of Tobias' proposal of:
>>> "The bridge could just provide some MDB iterator to save us from having
>>> to cache all the configured groups."?
>>> https://lore.kernel.org/netdev/87sg31n04a.fsf@waldekranz.com/
>>>
>>> What is done here will have to be scaled to many drivers - potentially
>>> all existing DSA ones, as far as I'm aware.
>>>
>>
>> I thought about implementing an MDB iterator as suggested by Tobias, but I'm
>> a bit concerned about the coherence of these MDB objects. In theory, when
>> the device driver is trying to act on an event, the source of the trigger
>> may have changed its state in the bridge already.
> 
> Yes, this is the result of SWITCHDEV_F_DEFER, used by both
> SWITCHDEV_ATTR_ID_PORT_MROUTER and SWITCHDEV_OBJ_ID_PORT_MDB.
> 
>> If, upon receiving an event in the device driver, we iterate over what
>> the bridge has at that instant, the differences between the worlds as
>> seen by the bridge and the device driver might lead to some unexpected
>> results.
> 
> Translated: iterating over bridge MDB objects needs to be serialized
> with new switchdev events by acquiring rtnl_lock(). Then, once switchdev
> events are temporarily blocked, the pending ones need to be flushed
> using switchdev_deferred_process(), so resync the bridge state with the
> driver state. Once the resync is done, the iteration is safe until
> rtnl_unlock().
> 
> Applied to our case, the MDB iterator is needed in mv88e6xxx_port_mrouter().
> This is already called with rtnl_lock() acquired. The resync procedure
> will indirectly call mv88e6xxx_port_mdb_add()/mv88e6xxx_port_mdb_del()
> through switchdev_deferred_process(), and then the walk is consistent
> for the remainder of the mv88e6xxx_port_mrouter() function.
> 
> A helper which does this is what would be required - an iterator
> function which calls an int (*cb)(struct net_device *brport, const struct switchdev_obj_port_mdb *mdb)
> for each MDB entry. The DSA core could then offer some post-processing
> services over this API, to recover the struct dsa_port associated with
> the bridge port (in the LAG case they aren't the same) and the address
> database associated with the bridge.
> 
> Do you think there would be unexpected results even if we did this?
> br_switchdev_mdb_replay() needs to handle a similarly complicated
> situation of synchronizing with deferred MDB events.
>  >> However, if we cache the MDB objects in the device driver, at least
>> the order in which the events took place will be coherent and at any
>> give time the state of the MDB objects in the device driver can be
>> guaranteed to be sane. This is also the approach the prestera device
>> driver took.
> 
> Not contesting this, but I wouldn't like to see MDBs cached in each
> device driver just for this. Switchdev is not very high on the list of
> APIs which are easy to use, and making MDB caching a requirement
> (for the common case that MDB entry destinations need software fixups
> with the mrouter ports) isn't exactly going to make that any better.
> Others' opinion may differ, but mine is that core offload APIs need to
> consider what hardware is available in the real world, make the common
> case easy, and the advanced cases possible. Rather than make every case
> "advanced" :)

Just throwing some random ideas out there. Do you think it would make 
more sense if this whole solution (rtnl_lock, iterator cb,...etc.) is 
moved up to DSA so that other DSA drivers could benefit from it? I 
thought about implement it (not the iterator, the current form) in DSA 
at first, but I'm not sure how other drivers would behave, so I did it 
with mv instead.

I guess the question is, is the current limitation (mrouter not properly 
offloaded) an issue specific to mv or is it a limitation of hardware 
offloading in general? I tend to think it's the latter.

But then again, if we move it to DSA, we would lose the benefit of the 
optimization of consolidating multiple register writes into just one (as 
done in patch 10 currently), unless we add a new switch op which takes a 
portvec instead of a port when modifying mdb's.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-05 11:00               ` Nikolay Aleksandrov
@ 2024-04-05 20:22                 ` Joseph Huang
  2024-04-05 21:15                   ` Vladimir Oltean
  0 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-05 20:22 UTC (permalink / raw)
  To: Nikolay Aleksandrov, Vladimir Oltean
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Linus Lüssing, linux-kernel, bridge

On 4/5/2024 7:00 AM, Nikolay Aleksandrov wrote:
> On 4/5/24 13:20, Vladimir Oltean wrote:
>> On Thu, Apr 04, 2024 at 06:16:12PM -0400, Joseph Huang wrote:
>>>>> mcast_flood == off:
>>>>> - mcast_ipv4_ctrl_flood: don't care (maybe can force to "off")
>>>>> - mcast_ipv4_data_flood: don't care
>>>>> - mcast_ipv6_ctrl_flood: don't care
>>>>> - mcast_ipv6_data_flood: don't care
>>>>> - mcast_l2_flood: don't care
>>>>> mcast_flood == on:
>>>>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>>>>> - Flood all other IPv4 multicast according to mcast_ipv4_data_flood
>>>>> - Flood ff02::/16 according to mcast_ipv6_ctrl_flood
>>>>> - Flood all other IPv6 multicast according to mcast_ipv6_data_flood
>>>>> - Flood L2 according to mcast_l2_flood
>>>
>>> Did you mean
>>>
>>> if mcast_flood == on (meaning mcast_flood is ENABLED)
>>> - mcast_ipv4_ctrl_flood: don't care (since 224.0.0.x will be flooded 
>>> anyway)
>>> ...
>>>
>>> if mcast_flood == off (meaning mcast_flood is DISABLED)
>>> - Flood 224.0.0.x according to mcast_ipv4_ctrl_flood
>>> ...
>>>
>>> ? Otherwise the problem is still not solved when mcast_flood is 
>>> disabled.
>>
>> No, I mean exactly as I said. My goal was not to "solve the problem"
>> when mcast_flood is disabled, but to give you an option to configure the
>> bridge to achieve what you want, in a way which I think is more 
>> acceptable.
>>
>> AFAIU, there is not really any "problem" - the bridge behaves exactly as
>> instructed given the limited language available to instruct it 
>> ("mcast_flood"
>> covers all multicast). So the other knobs have the role of fine-tuning
>> what gets flooded when mcast_flood is on. Like "yes, but..."
>>
>> You can't "solve the problem" when it involves changing an established
>> behavior that somebody probably depended on to be just like that.
>>
>>>> Yep, sounds good to me. I was thinking about something in these lines
>>>> as well if doing a kernel solution in order to make it simpler and more
>>>> generic. The ctrl flood bits need to be handled more carefully to make
>>>> sure they match only control traffic and not link-local data.
>>>
>>> Do we consider 224.0.0.251 (mDNS) to be control or data? What 
>>> qualifies as
>>> control I guess that's my question.
>>
>> Well, as I said, I'm proposing that 224.0.0.x qualifies as control and
>> the rest of IPv4 multicast as data. Which means that, applied to your
>> case, "mcast_flood on mcast_ipv4_ctrl_flood on mcast_ipv4_data_flood off"
>> will "force flood" mDNS just like the IGMP traffic from your patches.
>> I'm not aware if this could be considered problematic (I don't think so).
>>
>> The reason behind this proposal is that, AFAIU, endpoints may choose to
>> join IGMP groups in the 224.0.0.x range or not, but RFC4541 says that
>> switches shouldn't prune the destinations towards endpoints that don't
>> join this range anyway: https://www.rfc-editor.org/rfc/rfc4541#page-6
>>
>> Whereas for IP multicast traffic towards an address outside 224.0.0.x,
>> pruning will happen as per the IGMP join tracking mechanism.
> 
> +1, non-IGMP traffic to 224.0.0.x must be flooded to all anyway
> so this should allow for a better control over it, but perhaps
> the naming should be link_local instead because control usually
> means IGMP, maybe something like mcast_ip_link_local_flood
> 

Like this?

bridge link set dev swp0 mcast_flood off
   - all flooding disabled

bridge link set dev swp0 mcast_flood on
   - all flooding enabled

bridge link set dev swp0 mcast_flood on mcast_ipv4_data_flood off 
mcast_ipv6_data_flood off
   - IPv4 data packets flooding disabled, IPv6 data packets flooding 
disabled, everything else floods (that is to say, only allow IPv4 local 
subnet and IPv6 link-local to flood)

?

The syntax seems to be counterintuitive.

Or like this?

bridge link set dev swp0 mcast_flood on mcast_ipv4_ctrl_flood on
   - only allow IPv4 local subnet to flood, everything else off

?

So basically the question is, what should the behavior be when something 
is omitted from the command line?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-05 20:22                 ` Joseph Huang
@ 2024-04-05 21:15                   ` Vladimir Oltean
  2024-04-29 20:14                     ` Joseph Huang
  0 siblings, 1 reply; 41+ messages in thread
From: Vladimir Oltean @ 2024-04-05 21:15 UTC (permalink / raw)
  To: Joseph Huang
  Cc: Nikolay Aleksandrov, Joseph Huang, netdev, Andrew Lunn,
	Florian Fainelli, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Roopa Prabhu, Linus Lüssing, linux-kernel,
	bridge

On Fri, Apr 05, 2024 at 04:22:43PM -0400, Joseph Huang wrote:
> Like this?
> 
> bridge link set dev swp0 mcast_flood off
>   - all flooding disabled
> 
> bridge link set dev swp0 mcast_flood on
>   - all flooding enabled
> 
> bridge link set dev swp0 mcast_flood on mcast_ipv4_data_flood off
> mcast_ipv6_data_flood off
>   - IPv4 data packets flooding disabled, IPv6 data packets flooding
> disabled, everything else floods (that is to say, only allow IPv4 local
> subnet and IPv6 link-local to flood)
> 
> ?

Yeah.

> The syntax seems to be counterintuitive.
> 
> Or like this?
> 
> bridge link set dev swp0 mcast_flood on mcast_ipv4_ctrl_flood on
>   - only allow IPv4 local subnet to flood, everything else off
> 
> ?

Nope.

> So basically the question is, what should the behavior be when something is
> omitted from the command line?

The answer is always: "new options should default to behaving exactly
like before". It's not just about the command line arguments, but also
about the actual netlink attributes that iproute2 (and other tooling)
creates when communicating with the kernel. Old user space has no idea
about the existence of mcast_ipv4_ctrl_flood et. al. So, if netlink
attributes specifying their value are not sent by user space, their
value in the kernel must mimic the value of mcast_flood.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-05 21:15                   ` Vladimir Oltean
@ 2024-04-29 20:14                     ` Joseph Huang
  2024-04-30  1:21                       ` Vladimir Oltean
  0 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-29 20:14 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Nikolay Aleksandrov, Joseph Huang, netdev, Andrew Lunn,
	Florian Fainelli, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Roopa Prabhu, Linus Lüssing, linux-kernel,
	bridge

On 4/5/2024 5:15 PM, Vladimir Oltean wrote:
> On Fri, Apr 05, 2024 at 04:22:43PM -0400, Joseph Huang wrote:
>> Like this?
>>
>> bridge link set dev swp0 mcast_flood off
>>    - all flooding disabled
>>
>> bridge link set dev swp0 mcast_flood on
>>    - all flooding enabled
>>
>> bridge link set dev swp0 mcast_flood on mcast_ipv4_data_flood off
>> mcast_ipv6_data_flood off
>>    - IPv4 data packets flooding disabled, IPv6 data packets flooding
>> disabled, everything else floods (that is to say, only allow IPv4 local
>> subnet and IPv6 link-local to flood)
>>
>> ?
> 
> Yeah.
> 
>> The syntax seems to be counterintuitive.
>>
>> Or like this?
>>
>> bridge link set dev swp0 mcast_flood on mcast_ipv4_ctrl_flood on
>>    - only allow IPv4 local subnet to flood, everything else off
>>
>> ?
> 
> Nope.
> 
>> So basically the question is, what should the behavior be when something is
>> omitted from the command line?
> 
> The answer is always: "new options should default to behaving exactly
> like before". It's not just about the command line arguments, but also
> about the actual netlink attributes that iproute2 (and other tooling)
> creates when communicating with the kernel. Old user space has no idea
> about the existence of mcast_ipv4_ctrl_flood et. al. So, if netlink
> attributes specifying their value are not sent by user space, their
> value in the kernel must mimic the value of mcast_flood.

How about the following syntax? I think it satisfies all the "not 
breaking existing behavior" requirements (new option defaults to off, 
and missing user space netlink attributes does not change the existing 
behavior):

mcast_flood off
   all off
mcast_flood off mcast_flood_rfc4541 off
   all off
mcast_flood off mcast_flood_rfc4541 on
   224.0.0.X and ff02::1 on, the rest off
mcast_flood on
   all on
mcast_flood on mcast_flood_rfc4541 off
   all on (mcast_flood on overrides mcast_flood_rfc4541)
mcast_flood on mcast_flood_rfc4541 on
   all on
mcast_flood_rfc4541 off
   invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] 
is specified first)
mcast_flood_rfc4541 on
   invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] 
is specified first)

Think of mcast_flood_rfc4541 like a pet door if you will.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects
  2024-04-05 18:58         ` Joseph Huang
@ 2024-04-29 22:07           ` Joseph Huang
  2024-04-30  0:59             ` Vladimir Oltean
  0 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-29 22:07 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

On 4/5/2024 2:58 PM, Joseph Huang wrote:
> Hi Vladimir,
> 
> On 4/5/2024 7:07 AM, Vladimir Oltean wrote:
>> On Thu, Apr 04, 2024 at 04:43:38PM -0400, Joseph Huang wrote:
>>> Hi Vladimir,
>>>
>>> On 4/2/2024 8:23 AM, Vladimir Oltean wrote:
>>>> Can you comment on the feasibility/infeasibility of Tobias' proposal 
>>>> of:
>>>> "The bridge could just provide some MDB iterator to save us from having
>>>> to cache all the configured groups."?
>>>> https://lore.kernel.org/netdev/87sg31n04a.fsf@waldekranz.com/
>>>>
>>>> What is done here will have to be scaled to many drivers - potentially
>>>> all existing DSA ones, as far as I'm aware.
>>>>
>>>
>>> I thought about implementing an MDB iterator as suggested by Tobias, 
>>> but I'm
>>> a bit concerned about the coherence of these MDB objects. In theory, 
>>> when
>>> the device driver is trying to act on an event, the source of the 
>>> trigger
>>> may have changed its state in the bridge already.
>>
>> Yes, this is the result of SWITCHDEV_F_DEFER, used by both
>> SWITCHDEV_ATTR_ID_PORT_MROUTER and SWITCHDEV_OBJ_ID_PORT_MDB.
>>
>>> If, upon receiving an event in the device driver, we iterate over what
>>> the bridge has at that instant, the differences between the worlds as
>>> seen by the bridge and the device driver might lead to some unexpected
>>> results.
>>
>> Translated: iterating over bridge MDB objects needs to be serialized
>> with new switchdev events by acquiring rtnl_lock(). Then, once switchdev
>> events are temporarily blocked, the pending ones need to be flushed
>> using switchdev_deferred_process(), so resync the bridge state with the
>> driver state. Once the resync is done, the iteration is safe until
>> rtnl_unlock().
>>
>> Applied to our case, the MDB iterator is needed in 
>> mv88e6xxx_port_mrouter().
>> This is already called with rtnl_lock() acquired. The resync procedure
>> will indirectly call mv88e6xxx_port_mdb_add()/mv88e6xxx_port_mdb_del()
>> through switchdev_deferred_process(), and then the walk is consistent
>> for the remainder of the mv88e6xxx_port_mrouter() function.
>>
>> A helper which does this is what would be required - an iterator
>> function which calls an int (*cb)(struct net_device *brport, const 
>> struct switchdev_obj_port_mdb *mdb)
>> for each MDB entry. The DSA core could then offer some post-processing
>> services over this API, to recover the struct dsa_port associated with
>> the bridge port (in the LAG case they aren't the same) and the address
>> database associated with the bridge.

Something like this (some layers omitted for brevity)?

                                       +br_iterator
                                       |  for each mdb
                                       |    _br_switchdev_mdb_notify
rtnl_lock                             |      without F_DEFER flag
  |                                    |      |
  +switchdev_port_attr_set_deferred    |      +switchdev_port_obj_notify
    |                                  |        |
    +dsa_port_mrouter                  |        +dsa_user_port_obj_a/d
      |                                |          |
      +mv88e6xxx_port_mrouter----------+          +mv88e6xxx_port_obj_a/d
                                         |
  +--------------------------------------+
  |
rtnl_unlock

Note that on the system I tested, each register read/write takes about 
100us to complete. For 100's of mdb groups, this would mean that we will 
be holding rtnl lock for 10's of ms. I don't know if it's considered too 
long.


>>
>> Do you think there would be unexpected results even if we did this?
>> br_switchdev_mdb_replay() needs to handle a similarly complicated
>> situation of synchronizing with deferred MDB events.
>>  >> However, if we cache the MDB objects in the device driver, at least
>>> the order in which the events took place will be coherent and at any
>>> give time the state of the MDB objects in the device driver can be
>>> guaranteed to be sane. This is also the approach the prestera device
>>> driver took.
>>
>> Not contesting this, but I wouldn't like to see MDBs cached in each
>> device driver just for this. Switchdev is not very high on the list of
>> APIs which are easy to use, and making MDB caching a requirement
>> (for the common case that MDB entry destinations need software fixups
>> with the mrouter ports) isn't exactly going to make that any better.
>> Others' opinion may differ, but mine is that core offload APIs need to
>> consider what hardware is available in the real world, make the common
>> case easy, and the advanced cases possible. Rather than make every case
>> "advanced" :)
> 
> Just throwing some random ideas out there. Do you think it would make 
> more sense if this whole solution (rtnl_lock, iterator cb,...etc.) is 
> moved up to DSA so that other DSA drivers could benefit from it? I 
> thought about implement it (not the iterator, the current form) in DSA 
> at first, but I'm not sure how other drivers would behave, so I did it 
> with mv instead.
> 
> I guess the question is, is the current limitation (mrouter not properly 
> offloaded) an issue specific to mv or is it a limitation of hardware 
> offloading in general? I tend to think it's the latter.
> 
> But then again, if we move it to DSA, we would lose the benefit of the 
> optimization of consolidating multiple register writes into just one (as 
> done in patch 10 currently), unless we add a new switch op which takes a 
> portvec instead of a port when modifying mdb's.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects
  2024-04-29 22:07           ` Joseph Huang
@ 2024-04-30  0:59             ` Vladimir Oltean
  2024-04-30 16:27               ` Joseph Huang
  0 siblings, 1 reply; 41+ messages in thread
From: Vladimir Oltean @ 2024-04-30  0:59 UTC (permalink / raw)
  To: Joseph Huang
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

On Mon, Apr 29, 2024 at 06:07:25PM -0400, Joseph Huang wrote:
> Something like this (some layers omitted for brevity)?
> 
>                                       +br_iterator
>                                       |  for each mdb
>                                       |    _br_switchdev_mdb_notify
> rtnl_lock                             |      without F_DEFER flag
>  |                                    |      |
>  +switchdev_port_attr_set_deferred    |      +switchdev_port_obj_notify
>    |                                  |        |
>    +dsa_port_mrouter                  |        +dsa_user_port_obj_a/d
>      |                                |          |
>      +mv88e6xxx_port_mrouter----------+          +mv88e6xxx_port_obj_a/d
>                                         |
>  +--------------------------------------+
>  |
> rtnl_unlock

At a _very_ superficial glance, I don't think you are properly
accounting for the fact that even with rtnl_lock() held, there are still
SWITCHDEV_OBJ_ID_PORT_MDB events which may be pending on the switchdev
chain. Without a switchdev_deferred_process() flush call, you won't be
getting rid of them, so when you rtnl_unlock(), they will still run.

Even worse, holding rtnl_lock() will not stop the bridge multicast layer
from modifying its br->mdb_list; only br->multicast_lock will.

So you may be better off also acquiring br->multicast_lock, and
notifying the MDB entries to the switchdev chain _with_the F_DEFER flag.

> Note that on the system I tested, each register read/write takes about 100us
> to complete. For 100's of mdb groups, this would mean that we will be
> holding rtnl lock for 10's of ms. I don't know if it's considered too long.

Not sure how this is going to be any better if the iteration over MDB
entries is done 100% in the driver, though - since its hook,
dsa_port_mrouter(), runs entirely under rtnl_lock().

Anyway, with the SWITCHDEV_F_DEFER flag, maybe the mdb object
notifications can be made to run by switchdev only a few at a time, to
give the network stack time to do other things as well.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-29 20:14                     ` Joseph Huang
@ 2024-04-30  1:21                       ` Vladimir Oltean
  2024-04-30 17:01                         ` Joseph Huang
  0 siblings, 1 reply; 41+ messages in thread
From: Vladimir Oltean @ 2024-04-30  1:21 UTC (permalink / raw)
  To: Joseph Huang
  Cc: Nikolay Aleksandrov, Joseph Huang, netdev, Andrew Lunn,
	Florian Fainelli, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Roopa Prabhu, Linus Lüssing, linux-kernel,
	bridge

On Mon, Apr 29, 2024 at 04:14:03PM -0400, Joseph Huang wrote:
> How about the following syntax? I think it satisfies all the "not breaking
> existing behavior" requirements (new option defaults to off, and missing
> user space netlink attributes does not change the existing behavior):
> 
> mcast_flood off
>   all off
> mcast_flood off mcast_flood_rfc4541 off
>   all off
> mcast_flood off mcast_flood_rfc4541 on
>   224.0.0.X and ff02::1 on, the rest off
> mcast_flood on
>   all on
> mcast_flood on mcast_flood_rfc4541 off
>   all on (mcast_flood on overrides mcast_flood_rfc4541)
> mcast_flood on mcast_flood_rfc4541 on
>   all on
> mcast_flood_rfc4541 off
>   invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
> specified first)
> mcast_flood_rfc4541 on
>   invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
> specified first)

A bridge port defaults to having BR_MCAST_FLOOD set - see new_nbp().
Netlink attributes are only there to _change_ the state of properties in
the kernel. They don't need to be specified by user space if there's
nothing to be changed. "Only valid if another netlink attribute comes
first" makes no sense. You can alter 2 bridge port flags as part of the
same netlink message, or as part of different netlink messages (sent
over sockets of other processes).

> 
> Think of mcast_flood_rfc4541 like a pet door if you will.

Ultimately, as far as I see it, both the OR-based and the AND-based UAPI
addition could be made to work in a way that's perhaps similarly backwards
compatible. It needs to be worked out with the bridge maintainers. Given
that I'm not doing great with my spare time, I will take a back seat on
that.

However, what I don't quite understand in your proposal is how many IPv4
addresses lie beyond the "224.0.0.X" notation? 256? Explain why there is
such a large discrepancy in the number of IPv4 addresses you are in
control of (256), vs only 1 IPv6 address with the new rfc4541 flag?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects
  2024-04-30  0:59             ` Vladimir Oltean
@ 2024-04-30 16:27               ` Joseph Huang
  2024-05-02 20:37                 ` Joseph Huang
  0 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-30 16:27 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

On 4/29/2024 8:59 PM, Vladimir Oltean wrote:
> On Mon, Apr 29, 2024 at 06:07:25PM -0400, Joseph Huang wrote:
>> Something like this (some layers omitted for brevity)?
>>
>>                                        +br_iterator
>>                                        |  for each mdb
>>                                        |    _br_switchdev_mdb_notify
>> rtnl_lock                             |      without F_DEFER flag
>>   |                                    |      |
>>   +switchdev_port_attr_set_deferred    |      +switchdev_port_obj_notify
>>     |                                  |        |
>>     +dsa_port_mrouter                  |        +dsa_user_port_obj_a/d
>>       |                                |          |
>>       +mv88e6xxx_port_mrouter----------+          +mv88e6xxx_port_obj_a/d
>>                                          |
>>   +--------------------------------------+
>>   |
>> rtnl_unlock
> 
> At a _very_ superficial glance, I don't think you are properly
> accounting for the fact that even with rtnl_lock() held, there are still
> SWITCHDEV_OBJ_ID_PORT_MDB events which may be pending on the switchdev
> chain. Without a switchdev_deferred_process() flush call, you won't be
> getting rid of them, so when you rtnl_unlock(), they will still run.
> 
> Even worse, holding rtnl_lock() will not stop the bridge multicast layer
> from modifying its br->mdb_list; only br->multicast_lock will.
> 
> So you may be better off also acquiring br->multicast_lock, and
> notifying the MDB entries to the switchdev chain _with_the F_DEFER flag.

Like this?

                                       +br_iterator(dsa_cb)
                                       |  lock br->multicask_lock
                                       |  for each mdb
                                       |    br_switchdev_mdb_notify
rtnl_lock                             |      |
  |                                    |      +switchdev_port_obj_._defer
  +switchdev_port_attr_set_deferred    |  unlock br->multicast_lock
    |                                  |
    +dsa_port_mrouter                  |
      |                                |
      +mv88e6xxx_port_mrouter----------+
                                         |
  +--------------------------------------+
  |
rtnl_unlock

(potential task change)

rtnl_lock
  |
  +switchdev_deferred_process
  | flush all queued dfitems in queuing order
  |
rtnl_unlock

I'm not that familiar with the bridge code, but is there any concern 
with potential deadlock here (bewteen rtnl_lock and br->multicast_lock)?

> 
>> Note that on the system I tested, each register read/write takes about 100us
>> to complete. For 100's of mdb groups, this would mean that we will be
>> holding rtnl lock for 10's of ms. I don't know if it's considered too long.
> 
> Not sure how this is going to be any better if the iteration over MDB
> entries is done 100% in the driver, though - since its hook,
> dsa_port_mrouter(), runs entirely under rtnl_lock(). >
> Anyway, with the SWITCHDEV_F_DEFER flag, maybe the mdb object
> notifications can be made to run by switchdev only a few at a time, to
> give the network stack time to do other things as well.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-30  1:21                       ` Vladimir Oltean
@ 2024-04-30 17:01                         ` Joseph Huang
  2024-05-02 12:12                           ` Nikolay Aleksandrov
  0 siblings, 1 reply; 41+ messages in thread
From: Joseph Huang @ 2024-04-30 17:01 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Nikolay Aleksandrov, Joseph Huang, netdev, Andrew Lunn,
	Florian Fainelli, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Roopa Prabhu, Linus Lüssing, linux-kernel,
	bridge

On 4/29/2024 9:21 PM, Vladimir Oltean wrote:
> On Mon, Apr 29, 2024 at 04:14:03PM -0400, Joseph Huang wrote:
>> How about the following syntax? I think it satisfies all the "not breaking
>> existing behavior" requirements (new option defaults to off, and missing
>> user space netlink attributes does not change the existing behavior):
>>
>> mcast_flood off
>>    all off
>> mcast_flood off mcast_flood_rfc4541 off
>>    all off
>> mcast_flood off mcast_flood_rfc4541 on
>>    224.0.0.X and ff02::1 on, the rest off
>> mcast_flood on
>>    all on
>> mcast_flood on mcast_flood_rfc4541 off
>>    all on (mcast_flood on overrides mcast_flood_rfc4541)
>> mcast_flood on mcast_flood_rfc4541 on
>>    all on
>> mcast_flood_rfc4541 off
>>    invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
>> specified first)
>> mcast_flood_rfc4541 on
>>    invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
>> specified first)
> 
> A bridge port defaults to having BR_MCAST_FLOOD set - see new_nbp().
> Netlink attributes are only there to _change_ the state of properties in
> the kernel. They don't need to be specified by user space if there's
> nothing to be changed. "Only valid if another netlink attribute comes
> first" makes no sense. You can alter 2 bridge port flags as part of the
> same netlink message, or as part of different netlink messages (sent
> over sockets of other processes).
> 
>>
>> Think of mcast_flood_rfc4541 like a pet door if you will.
> 
> Ultimately, as far as I see it, both the OR-based and the AND-based UAPI
> addition could be made to work in a way that's perhaps similarly backwards
> compatible. It needs to be worked out with the bridge maintainers. Given
> that I'm not doing great with my spare time, I will take a back seat on
> that.

Nik, do you have any objection to the following proposal?

mcast_flood ->          default/    off         on
(existing flag)         missing     (specified/ (specified/
                         (on)        nlmsg)      nlmsg)

mcast_flood_rfc4541
(proposed new flag)
      |
      v
default/                flood all   no flood    flood all
missing
(off)

off                     flood all   no flood    flood all
(specified/nlmsg)

on                      flood all   flood 4541  flood all
(specified/nlmsg)                   ^^^^^^^^^^
                                     only behavior change


Basically the attributes are OR'ed together to form the final flooding 
decision.


> 
> However, what I don't quite understand in your proposal is how many IPv4
> addresses lie beyond the "224.0.0.X" notation? 256? Explain why there is
> such a large discrepancy in the number of IPv4 addresses you are in
> control of (256), vs only 1 IPv6 address with the new rfc4541 flag?

That's straight out of RFC-4541 ("coincidentally", these are also the IP 
addresses for which the bridge will not create mdb's):

2.1.2.

    2) Packets with a destination IP (DIP) address in the 224.0.0.X range
       which are not IGMP must be forwarded on all ports.

       This recommendation is based on the fact that many host systems do
       not send Join IP multicast addresses in this range before sending
       or listening to IP multicast packets.  Furthermore, since the
       224.0.0.X address range is defined as link-local (not to be
       routed), it seems unnecessary to keep the state for each address
       in this range.  Additionally, some routers operate in the
       224.0.0.X address range without issuing IGMP Joins, and these
       applications would break if the switch were to prune them due to
       not having seen a Join Group message from the router.

and

3.

In IPv6, the data forwarding rules are more straight forward because
    MLD is mandated for addresses with scope 2 (link-scope) or greater.
    The only exception is the address FF02::1 which is the all hosts
    link-scope address for which MLD messages are never sent.  Packets
    with the all hosts link-scope address should be forwarded on all
    ports.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 00/10] MC Flood disable and snooping
  2024-04-30 17:01                         ` Joseph Huang
@ 2024-05-02 12:12                           ` Nikolay Aleksandrov
  0 siblings, 0 replies; 41+ messages in thread
From: Nikolay Aleksandrov @ 2024-05-02 12:12 UTC (permalink / raw)
  To: Joseph Huang, Vladimir Oltean
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Linus Lüssing, linux-kernel, bridge

On 30/04/2024 20:01, Joseph Huang wrote:
> On 4/29/2024 9:21 PM, Vladimir Oltean wrote:
>> On Mon, Apr 29, 2024 at 04:14:03PM -0400, Joseph Huang wrote:
>>> How about the following syntax? I think it satisfies all the "not breaking
>>> existing behavior" requirements (new option defaults to off, and missing
>>> user space netlink attributes does not change the existing behavior):
>>>
>>> mcast_flood off
>>>    all off
>>> mcast_flood off mcast_flood_rfc4541 off
>>>    all off
>>> mcast_flood off mcast_flood_rfc4541 on
>>>    224.0.0.X and ff02::1 on, the rest off
>>> mcast_flood on
>>>    all on
>>> mcast_flood on mcast_flood_rfc4541 off
>>>    all on (mcast_flood on overrides mcast_flood_rfc4541)
>>> mcast_flood on mcast_flood_rfc4541 on
>>>    all on
>>> mcast_flood_rfc4541 off
>>>    invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
>>> specified first)
>>> mcast_flood_rfc4541 on
>>>    invalid (mcast_flood_rfc4541 is only valid if mcast_flood [on | off] is
>>> specified first)
>>
>> A bridge port defaults to having BR_MCAST_FLOOD set - see new_nbp().
>> Netlink attributes are only there to _change_ the state of properties in
>> the kernel. They don't need to be specified by user space if there's
>> nothing to be changed. "Only valid if another netlink attribute comes
>> first" makes no sense. You can alter 2 bridge port flags as part of the
>> same netlink message, or as part of different netlink messages (sent
>> over sockets of other processes).
>>
>>>
>>> Think of mcast_flood_rfc4541 like a pet door if you will.
>>
>> Ultimately, as far as I see it, both the OR-based and the AND-based UAPI
>> addition could be made to work in a way that's perhaps similarly backwards
>> compatible. It needs to be worked out with the bridge maintainers. Given
>> that I'm not doing great with my spare time, I will take a back seat on
>> that.
> 
> Nik, do you have any objection to the following proposal?
> 
> mcast_flood ->          default/    off         on
> (existing flag)         missing     (specified/ (specified/
>                         (on)        nlmsg)      nlmsg)
> 
> mcast_flood_rfc4541
> (proposed new flag)
>      |
>      v
> default/                flood all   no flood    flood all
> missing
> (off)
> 
> off                     flood all   no flood    flood all
> (specified/nlmsg)
> 
> on                      flood all   flood 4541  flood all
> (specified/nlmsg)                   ^^^^^^^^^^
>                                     only behavior change
> 
> 
> Basically the attributes are OR'ed together to form the final flooding decision.
> 
> 

Looks good to me. Please make use of the boolopt uapi to avoid adding new
nl attributes.

Thanks,
 Nik



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects
  2024-04-30 16:27               ` Joseph Huang
@ 2024-05-02 20:37                 ` Joseph Huang
  0 siblings, 0 replies; 41+ messages in thread
From: Joseph Huang @ 2024-05-02 20:37 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Joseph Huang, netdev, Andrew Lunn, Florian Fainelli,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Roopa Prabhu, Nikolay Aleksandrov, Linus Lüssing,
	linux-kernel, bridge

On 4/30/2024 12:27 PM, Joseph Huang wrote:
> On 4/29/2024 8:59 PM, Vladimir Oltean wrote:
>> On Mon, Apr 29, 2024 at 06:07:25PM -0400, Joseph Huang wrote:
>>> Something like this (some layers omitted for brevity)?
>>>
>>>                                        +br_iterator
>>>                                        |  for each mdb
>>>                                        |    _br_switchdev_mdb_notify
>>>  rtnl_lock                             |      without F_DEFER flag
>>>   |                                    |      |
>>>   +switchdev_port_attr_set_deferred    |      +switchdev_port_obj_notify
>>>     |                                  |        |
>>>     +dsa_port_mrouter                  |        +dsa_user_port_obj_a/d
>>>       |                                |          |
>>>       +mv88e6xxx_port_mrouter----------+          
>>> +mv88e6xxx_port_obj_a/d
>>>                                          |
>>>   +--------------------------------------+
>>>   |
>>> rtnl_unlock
>>
>> At a _very_ superficial glance, I don't think you are properly
>> accounting for the fact that even with rtnl_lock() held, there are still
>> SWITCHDEV_OBJ_ID_PORT_MDB events which may be pending on the switchdev
>> chain. Without a switchdev_deferred_process() flush call, you won't be
>> getting rid of them, so when you rtnl_unlock(), they will still run.
>>
>> Even worse, holding rtnl_lock() will not stop the bridge multicast layer
>> from modifying its br->mdb_list; only br->multicast_lock will.
>>
>> So you may be better off also acquiring br->multicast_lock, and
>> notifying the MDB entries to the switchdev chain _with_the F_DEFER flag.
> 
> Like this?
> 
>                                        +br_iterator(dsa_cb)
>                                        |  lock br->multicask_lock
>                                        |  for each mdb
>                                        |    br_switchdev_mdb_notify
>  rtnl_lock                             |      |
>   |                                    |      +switchdev_port_obj_._defer
>   +switchdev_port_attr_set_deferred    |  unlock br->multicast_lock
>     |                                  |
>     +dsa_port_mrouter                  |
>       |                                |
>       +mv88e6xxx_port_mrouter----------+
>                                          |
>   +--------------------------------------+
>   |
> rtnl_unlock
> 
> (potential task change)
> 
> rtnl_lock
>   |
>   +switchdev_deferred_process
>   | flush all queued dfitems in queuing order
>   |
> rtnl_unlock
> 
> I'm not that familiar with the bridge code, but is there any concern 
> with potential deadlock here (between rtnl_lock and br->multicast_lock)?

Hi Nik,

Do you know if it's safe to acquire rtnl_lock and br->multicast_lock in 
the following sequence? Is there any potential possibility for a deadlock?

rtnl_lock
spin_lock(br->multicast_lock)
spin_unlock(br->multicast_lock)
rtnl_unlock

If the operation is safe, the next question is should it be spin_lock or 
spin_lock_bh?

Thanks,
Joseph


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2024-05-02 20:37 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-04-02  0:10 [PATCH RFC net-next 00/10] MC Flood disable and snooping Joseph Huang
2024-04-02  0:11 ` [PATCH RFC net-next 01/10] net: bridge: Flood Queries even when mc flood is disabled Joseph Huang
2024-04-02  0:11 ` [PATCH RFC net-next 02/10] net: bridge: Always multicast_flood Reports Joseph Huang
2024-04-03 15:52   ` Simon Horman
2024-04-02  0:11 ` [PATCH RFC net-next 03/10] net: bridge: Always flood local subnet mc packets Joseph Huang
2024-04-02  0:11 ` [PATCH RFC net-next 04/10] net: dsa: mv88e6xxx: Add all hosts mc addr to ATU Joseph Huang
2024-04-02 18:08   ` Vladimir Oltean
2024-04-02  0:11 ` [PATCH RFC net-next 05/10] net: dsa: Add support for PORT_MROUTER attribute Joseph Huang
2024-04-02  0:11 ` [PATCH RFC net-next 06/10] net: dsa: mv88e6xxx: Track soft bridge objects Joseph Huang
2024-04-02  0:11 ` [PATCH RFC net-next 07/10] net: dsa: mv88e6xxx: Track bridge mdb objects Joseph Huang
2024-04-02 12:23   ` Vladimir Oltean
2024-04-04 20:43     ` Joseph Huang
2024-04-05 11:07       ` Vladimir Oltean
2024-04-05 18:58         ` Joseph Huang
2024-04-29 22:07           ` Joseph Huang
2024-04-30  0:59             ` Vladimir Oltean
2024-04-30 16:27               ` Joseph Huang
2024-05-02 20:37                 ` Joseph Huang
2024-04-02  0:11 ` [PATCH RFC net-next 08/10] net: dsa: mv88e6xxx: Convert MAB to use bit flags Joseph Huang
2024-04-02  0:11 ` [PATCH RFC net-next 09/10] net: dsa: mv88e6xxx: Enable mc flood for mrouter port Joseph Huang
2024-04-03 15:49   ` Simon Horman
2024-04-02  0:11 ` [PATCH RFC net-next 10/10] net: dsa: mv88e6xxx: Offload " Joseph Huang
2024-04-02  9:28 ` [PATCH RFC net-next 00/10] MC Flood disable and snooping Nikolay Aleksandrov
2024-04-02 17:43   ` Vladimir Oltean
2024-04-02 18:50     ` Nikolay Aleksandrov
2024-04-02 20:46       ` Vladimir Oltean
2024-04-02 21:59         ` Nikolay Aleksandrov
2024-04-04 22:16           ` Joseph Huang
2024-04-05 10:20             ` Vladimir Oltean
2024-04-05 11:00               ` Nikolay Aleksandrov
2024-04-05 20:22                 ` Joseph Huang
2024-04-05 21:15                   ` Vladimir Oltean
2024-04-29 20:14                     ` Joseph Huang
2024-04-30  1:21                       ` Vladimir Oltean
2024-04-30 17:01                         ` Joseph Huang
2024-05-02 12:12                           ` Nikolay Aleksandrov
2024-04-02 12:43 ` Andrew Lunn
2024-04-04 21:35   ` Joseph Huang
2024-04-04 22:11     ` Andrew Lunn
2024-04-04 22:40       ` Joseph Huang
2024-04-05 13:09         ` Andrew Lunn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).