archive mirror
 help / color / mirror / Atom feed
Subject: [PATCH net-next v6 23/23] switchdev: bring documentation up-to-date
Date: Sat,  9 May 2015 10:40:25 -0700	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

From: Scott Feldman <>

Much need updated of switchdev documentation to cover what's been
implmented to-date.  There are some XXX comments in the text for
unimplemented or broken items.  I'd like to keep these in there (poor-man's
TODO list) and update the document once each issue is resolved.

Signed-off-by: Scott Feldman <>
 Documentation/networking/switchdev.txt |  414 +++++++++++++++++++++++++++-----
 1 file changed, 355 insertions(+), 59 deletions(-)

diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt
index f981a92..b3e18c8 100644
--- a/Documentation/networking/switchdev.txt
+++ b/Documentation/networking/switchdev.txt
@@ -1,59 +1,355 @@
-Switch (and switch-ish) device drivers HOWTO
-Please note that the word "switch" is here used in very generic meaning.
-This include devices supporting L2/L3 but also various flow offloading chips,
-including switches embedded into SR-IOV NICs.
-Lets describe a topology a bit. Imagine the following example:
-       +----------------------------+    +---------------+
-       |     SOME switch chip       |    |      CPU      |
-       +----------------------------+    +---------------+
-       port1 port2 port3 port4 MNGMNT    |     PCI-E     |
-         |     |     |     |     |       +---------------+
-        PHY   PHY    |     |     |         |  NIC0 NIC1
-                     |     |     |         |   |    |
-                     |     |     +- PCI-E -+   |    |
-                     |     +------- MII -------+    |
-                     +------------- MII ------------+
-In this example, there are two independent lines between the switch silicon
-and CPU. NIC0 and NIC1 drivers are not aware of a switch presence. They are
-separate from the switch driver. SOME switch chip is by managed by a driver
-via PCI-E device MNGMNT. Note that MNGMNT device, NIC0 and NIC1 may be
-connected to some other type of bus.
-Now, for the previous example show the representation in kernel:
-       +----------------------------+    +---------------+
-       |     SOME switch chip       |    |      CPU      |
-       +----------------------------+    +---------------+
-       sw0p0 sw0p1 sw0p2 sw0p3 MNGMNT    |     PCI-E     |
-         |     |     |     |     |       +---------------+
-        PHY   PHY    |     |     |         |  eth0 eth1
-                     |     |     |         |   |    |
-                     |     |     +- PCI-E -+   |    |
-                     |     +------- MII -------+    |
-                     +------------- MII ------------+
-Lets call the example switch driver for SOME switch chip "SOMEswitch". This
-driver takes care of PCI-E device MNGMNT. There is a netdevice instance sw0pX
-created for each port of a switch. These netdevices are instances
-of "SOMEswitch" driver. sw0pX netdevices serve as a "representation"
-of the switch chip. eth0 and eth1 are instances of some other existing driver.
-The only difference of the switch-port netdevice from the ordinary netdevice
-is that is implements couple more NDOs:
-  ndo_switch_parent_id_get - This returns the same ID for two port netdevices
-			     of the same physical switch chip. This is
-			     mandatory to be implemented by all switch drivers
-			     and serves the caller for recognition of a port
-			     netdevice.
-  ndo_switch_parent_* - Functions that serve for a manipulation of the switch
-			chip itself (it can be though of as a "parent" of the
-			port, therefore the name). They are not port-specific.
-			Caller might use arbitrary port netdevice of the same
-			switch and it will make no difference.
-  ndo_switch_port_* - Functions that serve for a port-specific manipulation.
+Ethernet switch device driver model (switchdev)
+Copyright (c) 2014 Jiri Pirko <>
+Copyright (c) 2014-2015 Scott Feldman <>
+The Ethernet switch device driver model (switchdev) is an in-kernel driver
+model for switch devices which offload the forwarding (data) plane from the
+Figure 1 is a block diagram showing the components of the switchdev model for
+an example setup using a data-center-class switch ASIC chip.  Other setups
+with SR-IOV or soft switches, such as OVS, are possible.
+                             User-space tools                                 
+       user space                   |                                         
+      +-------------------------------------------------------------------+   
+       kernel                       | Netlink                                 
+                                    |                                         
+                     +--------------+-------------------------------+         
+                     |         Network stack                        |         
+                     |           (Linux)                            |         
+                     |                                              |         
+                     +----------------------------------------------+         
+                           sw1p2     sw1p4     sw1p6
+                      sw1p1  +  sw1p3  +  sw1p5  +          eth1             
+                        +    |    +    |    +    |            +               
+                        |    |    |    |    |    |            |               
+                     +--+----+----+----+-+--+----+---+  +-----+-----+         
+                     |         Switch driver         |  |    mgmt   |         
+                     |        (this document)        |  |   driver  |         
+                     |                               |  |           |         
+                     +--------------+----------------+  +-----------+         
+                                    |                                         
+       kernel                       | HW bus (eg PCI)                         
+      +-------------------------------------------------------------------+   
+       hardware                     |                                         
+                     +--------------+---+------------+                        
+                     |         Switch device (sw1)   |                        
+                     |  +----+                       +--------+               
+                     |  |    v offloaded data path   | mgmt port              
+                     |  |    |                       |                        
+                     +--|----|----+----+----+----+---+                        
+                        |    |    |    |    |    |                            
+                        +    +    +    +    +    +                            
+                       p1   p2   p3   p4   p5   p6
+                             front-panel ports                                
+                                    Fig 1.
+Include Files
+#include <linux/netdevice.h>
+#include <net/switchdev.h>
+Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model
+support is built for driver.
+Switch Ports
+On switchdev driver initialization, the driver will allocate and register a
+struct net_device (using register_netdev()) for each enumerated physical switch
+port, called the port netdev.  A port netdev is the software representation of
+the physical port and provides a conduit for control traffic to/from the
+controller (the kernel) and the network, as well as an anchor point for higher
+level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers.  Using
+standard netdev tools (iproute2, ethtool, etc), the port netdev can also
+provide to the user access to the physical properties of the switch port such
+as PHY link state and I/O statistics.
+There is (currently) no higher-level kernel object for the switch beyond the
+port netdevs.  All of the switchdev driver ops are netdev ops or switchdev ops.
+A switch management port is outside the scope of the switchdev driver model.
+Typically, the management port is not participating in offloaded data plane and
+is loaded with a different driver, such as a NIC driver, on the management port
+Port Netdev Naming
+Udev rules should be used for port netdev naming, using some unique attribute
+of the port as a key, for example the port MAC address or the port PHYS name.
+Hard-coding of kernel netdev names within the driver is discouraged; let the
+kernel pick the default netdev name, and let udev set the final name based on a
+port attribute.
+Using port PHYS name (ndo_get_phys_port_name) for the key is particularly
+useful for dynically-named ports where the device names it's ports based on
+external configuration.  For example, if a physical 40G port is split logically
+into 4 10G ports, resulting in 4 port netdevs, the device can give a unique
+name for each port using port PHYS name.  The udev rule would be:
+SUBSYSTEM=="net", ACTION=="add", DRIVER="<driver>", ATTR{phys_port_name}!="", \
+	NAME="$attr{phys_port_name}"
+Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y
+is the port name or ID, and Z is the sub-port name or ID.  For example, sw1p1s0
+would be sub-port 0 on port 1 on switch 1.
+Switch ID
+The switchdev driver must implement the switchdev op switchdev_port_attr_get for
+SWITCHDEV_ATTR_PORT_PARENT_ID for each port netdev, returning the same physical ID
+for each port of a switch.  The ID must be unique between switches on the same
+system.  The ID does not need to be unique between switches on different
+The switch ID is used to locate ports on a switch and to know if aggregated
+ports belong to the same switch.
+Port Features
+If the switchdev driver (and device) only supports offloading of the default
+network namespace (netns), the driver should set this feature flag to prevent
+the port netdev from being moved out of the default netns.  A netns-aware
+driver/device would not set this flag and be resposible for partitioning
+hardware to preserve netns containment.  This means hardware cannot forward
+traffic from a port in one namespace to another port in another namespace.
+Port Topology
+The port netdevs representing the physical switch ports can be organized into
+higher-level switching constructs.  The default construct is a standalone
+router port, used to offload L3 forwarding.  Two or more ports can be bonded
+together to form a LAG.  Two or more ports (or LAGs) can be bridged to bridge
+to L2 networks.  VLANs can be applied to sub-divide L2 networks.  L2-over-L3
+tunnels can be built on ports.  These constructs are built using standard Linux
+tools such as the bridge driver, the bonding/team drivers, and netlink-based
+tools such as iproute2.
+The switchdev driver can know a particular port's position in the topology by
+monitoring NETDEV_CHANGEUPPER notifications.  For example, a port moved into a
+bond will see it's upper master change.  If that bond is moved into a bridge,
+the bond's upper master will change.  And so on.  The driver will track such
+movements to know what position a port is in in the overall topology by
+registering for netdevice events and acting on NETDEV_CHANGEUPPER.
+L2 Forwarding Offload
+The idea is to offload the L2 data forwarding (switching) path from the kernel
+to the switchdev device by mirroring bridge FDB entries down to the device.  An
+FDB entry is the {port, MAC, VLAN} tuple forwarding destination.
+To offloading L2 bridging, the switchdev driver/device should support:
+	- Static FDB entries installed on a bridge port
+	- Notification of learned/forgotten src mac/vlans from device
+	- STP state changes on the port
+	- VLAN flooding of multicast/broadcast and unknown unicast packets
+Static FDB Entries
+The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump
+to support static FDB entries installed to the device.  Static bridge FDB
+entries are installed, for example, using iproute2 bridge cmd:
+	bridge fdb add ADDR dev DEV [vlan VID] [self]
+Note: by default, the bridge does not filter on VLAN and only bridges untagged
+traffic.  To enable VLAN support, turn on VLAN filtering:
+	echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering
+Notification of Learned/Forgotten Source MAC/VLANs
+The switch device will learn/forget source MAC address/VLAN on ingress packets
+and notify the switch driver of the mac/vlan/port tuples.  The switch driver,
+in turn, will notify the bridge driver using the switchdev notifier call:
+	err = call_switchdev_notifiers(val, dev, info);
+Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when forgetting, and
+info points to a struct switchdev_notifier_fdb_info.  On SWITCHDEV_FDB_ADD, the bridge
+driver will install the FDB entry into the bridge's FDB and mark the entry as
+NTF_EXT_LEARNED.  The iproute2 bridge command will label these entries
+	$ bridge fdb
+	52:54:00:12:35:01 dev sw1p1 master br0 permanent
+	00:02:00:00:02:00 dev sw1p1 master br0 offload
+	00:02:00:00:02:00 dev sw1p1 self
+	52:54:00:12:35:02 dev sw1p2 master br0 permanent
+	00:02:00:00:03:00 dev sw1p2 master br0 offload
+	00:02:00:00:03:00 dev sw1p2 self
+	33:33:00:00:00:01 dev eth0 self permanent
+	01:00:5e:00:00:01 dev eth0 self permanent
+	33:33:ff:00:00:00 dev eth0 self permanent
+	01:80:c2:00:00:0e dev eth0 self permanent
+	33:33:00:00:00:01 dev br0 self permanent
+	01:00:5e:00:00:01 dev br0 self permanent
+	33:33:ff:12:35:01 dev br0 self permanent
+Learning on the port should be disabled on the bridge using the bridge command:
+	bridge link set dev DEV learning off
+Learning on the device port should be enabled, as well as learning_sync:
+	bridge link set dev DEV learning on self
+	bridge link set dev DEV learning_sync on self
+Learning_sync attribute enables syncing of the learned/forgotton FDB entry to
+the bridge's FDB.  It's possible, but not optimal, to enable learning on the
+device port and on the bridge port, and disable learning_sync.
+To support learning and learning_sync port attributes, the driver implements
+switchdev op switchdev_port_attr_get/set for SWITCHDEV_ATTR_PORT_BRIDGE_FLAGS.  The driver
+should initialize the attributes to the hardware defaults.
+FDB Ageing
+There are two FDB ageing models supported: 1) ageing by the device, and 2)
+ageing by the kernel.  Ageing by the device is preferred if many FDB entries
+are supported.  The driver calls call_switchdev_notifiers(SWITCHDEV_FDB_DEL, ...) to
+age out the FDB entry.  In this model, ageing by the kernel should be turned
+off.  XXX: how to turn off ageing in kernel on a per-port basis or otherwise
+prevent the kernel from ageing out the FDB entry?
+In the kernel ageing model, the standard bridge ageing mechanism is used to age
+out stale FDB entries.  To keep an FDB entry "alive", the driver should refresh
+the FDB entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...).  The
+notification will reset the FDB entry's last-used time to now.  The driver
+should rate limit refresh notifications, for example, no more than once a
+second.  If the FDB entry expires, ndo_fdb_del is called to remove entry from
+the device.  XXX: this last part isn't currently correct: ndo_fdb_del isn't
+called, so the stale entry remains in device...this need to get fixed.
+FDB Flush
+XXX: Unimplemented.  Need to support FDB flush by bridge driver for port and
+remove both static and learned FDB entries.
+STP State Change on Port
+Internally or with a third-party STP protocol implementation (e.g. mstpd), the
+bridge driver maintains the STP state for ports, and will notify the switch
+driver of STP state change on a port using the switchdev op switchdev_attr_port_set for
+State is one of BR_STATE_*.  The switch driver can use STP state updates to
+update ingress packet filter list for the port.  For example, if port is
+DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs
+and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass.
+Note that STP BDPUs are untagged and STP state applies to all VLANs on the port
+so packet filters should be applied consistently across untagged and tagged
+VLANs on the port.
+Flooding L2 domain
+For a given L2 VLAN domain, the switch device should flood multicast/broadcast
+and unknown unicast packets to all ports in domain, if allowed by port's
+current STP state.  The switch driver, knowing which ports are within which
+vlan L2 domain, can program the switch device for flooding.  The packet should
+also be sent to the port netdev for processing by the bridge driver.  The
+bridge should not reflood the packet to the same ports the device flooded.
+XXX: the mechanism to avoid duplicate flood packets is being discuseed.
+It is possible for the switch device to not handle flooding and push the
+packets up to the bridge driver for flooding.  This is not ideal as the number
+of ports scale in the L2 domain as the device is much more efficient at
+flooding packets that software.
+IGMP Snooping
+XXX: complete this section
+L3 routing
+Offloading L3 routing requires that device be programmed with FIB entries from
+the kernel, with the device doing the FIB lookup and forwarding.  The device
+does a longest prefix match (LPM) on FIB entries matching route prefix and
+forwards the packet to the matching FIB entry's nexthop(s) egress ports.  To
+program the device, the switchdev driver is called with add/delete ops for IPv4
+and IPv6 FIB entries.  For IPv4, the driver implements switchdev ops:
+	int (*switchdev_fib_ipv4_add)(struct net_device *dev,
+				  __be32 dst, int dst_len,
+				  struct fib_info *fi,
+				  u8 tos, u8 type,
+				  u32 nlflags, u32 tb_id);
+	int (*switchdev_fib_ipv4_del)(struct net_device *dev,
+				  __be32 dst, int dst_len,
+				  struct fib_info *fi,
+				  u8 tos, u8 type,
+				  u32 tb_id);
+to add/delete IPv4 dst/dest_len prefix on table tb_id.  The *fi structure holds
+details on the route and route's nexthops.  *dev is one of the port netdevs
+mentioned in the routes next hop list.  If the output port netdevs referenced
+in the route's nexthop list don't all have the same switch ID, the driver is
+not called to add/delete the FIB entry.
+Routes offloaded to the device are labeled with "offload" in the ip route
+	$ ip route show
+	default via dev eth0
+ dev sw1p1  proto kernel  scope link  src offload
+ via dev sw1p1  proto zebra  metric 20 offload
+ dev sw1p2  proto kernel  scope link  src offload
+ via dev sw1p2  proto zebra  metric 20 offload
+  proto zebra  metric 30 offload
+		nexthop via  dev sw1p1 weight 1
+		nexthop via  dev sw1p2 weight 1
+ via dev sw1p1  proto zebra  metric 20 offload
+ via dev sw1p2  proto zebra  metric 20 offload
+ dev eth0  proto kernel  scope link  src
+XXX: add/del IPv6 FIB API
+Nexthop Resolution
+The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for
+the switch device to forward the packet with the correct dst mac address, the
+nexthop gateways must be resolved to the neighbor's mac address.  Neighbor mac
+address discovery comes via the ARP (or ND) process and is available via the
+arp_tbl neighbor table.  To resolve the routes nexthop gateways, the driver
+should trigger the kernel's neighbor resolution process.  See the rocker
+driver's rocker_port_ipv4_resolve() for an example.
+The driver can monitor for updates to arp_tbl using the netevent notifier
+NETEVENT_NEIGH_UPDATE.  The device can be programmed with resolved nexthops
+for the routes as arp_tbl updates.

      parent reply	other threads:[~2015-05-09 17:43 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-09 17:40 [PATCH net-next v6 00/23] switchdev: spring cleanup sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 01/23] switchdev: s/netdev_switch_/switchdev_/ and s/NETDEV_SWITCH_/SWITCHDEV_/ sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 02/23] switchdev: s/swdev_/switchdev_/ sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 03/23] switchdev: introduce get/set attrs ops sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 04/23] switchdev: convert parent_id_get to switchdev attr get sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 05/23] rocker: support prepare-commit transaction model sfeldma
2015-05-09 18:18   ` Jiri Pirko
2015-05-10  6:14     ` Scott Feldman
2015-05-09 17:40 ` [PATCH net-next v6 06/23] switchdev: convert STP update to switchdev attr set sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 07/23] switchdev: introduce switchdev add/del obj ops sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 08/23] switchdev: add port vlan obj sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 09/23] rocker: use switchdev add/del obj for bridge port vlans sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 10/23] switchdev: add bridge port flags attr sfeldma
2015-05-09 18:47   ` Jiri Pirko
2015-05-09 17:40 ` [PATCH net-next v6 11/23] switchdev: add new switchdev bridge setlink sfeldma
2015-05-09 18:48   ` Jiri Pirko
2015-05-09 17:40 ` [PATCH net-next v6 12/23] switchdev: cut over to new switchdev_port_bridge_setlink sfeldma
2015-05-09 18:49   ` Jiri Pirko
2015-05-09 17:40 ` [PATCH net-next v6 13/23] switchdev: remove old switchdev_port_bridge_setlink sfeldma
2015-05-09 18:49   ` Jiri Pirko
2015-05-09 17:40 ` [PATCH net-next v6 14/23] bridge: restore br_setlink back to original sfeldma
2015-05-09 19:00   ` Jiri Pirko
2015-05-10 16:10     ` roopa
2015-05-10 23:55       ` Scott Feldman
2015-05-11  0:55         ` roopa
2015-05-11  2:46           ` Scott Feldman
2015-05-11  3:03             ` roopa
2015-05-09 17:40 ` [PATCH net-next v6 15/23] switchdev: add new switchdev_port_bridge_dellink sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 16/23] switchdev: cut over to " sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 17/23] switchdev: remove unused switchdev_port_bridge_dellink sfeldma
2015-05-09 18:54   ` Jiri Pirko
2015-05-09 17:40 ` [PATCH net-next v6 18/23] switchdev: add new switchdev_port_bridge_getlink sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 19/23] switchdev: cut over to " sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 20/23] switchdev: convert fib_ipv4_add/del over to switchdev_port_obj_add/del sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 21/23] switchdev: remove NETIF_F_HW_SWITCH_OFFLOAD feature flag sfeldma
2015-05-09 17:40 ` [PATCH net-next v6 22/23] rocker: make checkpatch -f clean sfeldma
2015-05-09 17:40 ` sfeldma [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).