All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
@ 2014-03-19 15:33 Jiri Pirko
  2014-03-19 15:33 ` [patch net-next RFC 1/4] openvswitch: split flow structures into ovs specific and generic ones Jiri Pirko
                   ` (4 more replies)
  0 siblings, 5 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-19 15:33 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet

This is just an early draft, RFC. I wanted to post this early to get the
feedback as soon as possible.

The basic idea is to introduce a generic infractructure to support various
switch chips in kernel. Also the idea is to benefit of currently existing
Open vSwitch userspace infrastructure.


The first patch does a split of structures which are not specific to OVS
into more generic ones that can be reused.


The second patch introduces the "switchdev" API itself. It should serve as
a glue between chip drivers on the one side and the "linked users" on the other.
Note that even though the only "linked user" in this patchset is OVS, it
is certainly possible to add another ones (sysfs, netlink, whatever).

The infrastructure is designed to be similar to for example linux bridge.
There is one netdevice representing a switch chip and one netdevice per every
port. These are bound together in classic slave-master way. The reason
to reuse the netdevices for port representation is that userspace can use
standard tools to get information about the ports, statistics, tcpdump, etc.


The third patch introduces a support for switchdev vports into OVS datapath.
After that, userspace would be able to create a switchdev DP for a switch chip,
to add switchdev ports to it and to use it in the same way as it would be
OVS SW datapath.


The fourth patch adds a dummy switch module. It is just an example of
switchdev driver implementation.


Any feedback very welcome!

Thanks.

Jiri

Jiri Pirko (4):
  openvswitch: split flow structures into ovs specific and generic ones
  net: introduce switchdev API
  openvswitch: Introduce support for switchdev based datapath
  net: introduce dummy switch

 drivers/net/Kconfig                        |   7 +
 drivers/net/Makefile                       |   1 +
 drivers/net/dummyswitch.c                  | 142 +++++++++++++
 include/linux/sw_flow.h                    | 105 +++++++++
 include/linux/switchdev.h                  |  62 ++++++
 include/uapi/linux/openvswitch.h           |   4 +
 net/Kconfig                                |  10 +
 net/core/Makefile                          |   1 +
 net/core/switchdev.c                       | 330 +++++++++++++++++++++++++++++
 net/openvswitch/Makefile                   |   4 +
 net/openvswitch/datapath.c                 |  90 +++++---
 net/openvswitch/datapath.h                 |  12 +-
 net/openvswitch/dp_notify.c                |   3 +-
 net/openvswitch/flow.c                     |  12 +-
 net/openvswitch/flow.h                     | 123 +++--------
 net/openvswitch/flow_netlink.c             |  24 +--
 net/openvswitch/flow_netlink.h             |   4 +-
 net/openvswitch/flow_table.c               | 100 ++++-----
 net/openvswitch/flow_table.h               |  18 +-
 net/openvswitch/vport-gre.c                |   4 +-
 net/openvswitch/vport-internal_switchdev.c | 148 +++++++++++++
 net/openvswitch/vport-internal_switchdev.h |  26 +++
 net/openvswitch/vport-netdev.c             |   4 +-
 net/openvswitch/vport-switchportdev.c      | 158 ++++++++++++++
 net/openvswitch/vport-switchportdev.h      |  24 +++
 net/openvswitch/vport-vxlan.c              |   2 +-
 net/openvswitch/vport.c                    |   6 +-
 net/openvswitch/vport.h                    |   4 +-
 28 files changed, 1218 insertions(+), 210 deletions(-)
 create mode 100644 drivers/net/dummyswitch.c
 create mode 100644 include/linux/sw_flow.h
 create mode 100644 include/linux/switchdev.h
 create mode 100644 net/core/switchdev.c
 create mode 100644 net/openvswitch/vport-internal_switchdev.c
 create mode 100644 net/openvswitch/vport-internal_switchdev.h
 create mode 100644 net/openvswitch/vport-switchportdev.c
 create mode 100644 net/openvswitch/vport-switchportdev.h

-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 125+ messages in thread

* [patch net-next RFC 1/4] openvswitch: split flow structures into ovs specific and generic ones
  2014-03-19 15:33 [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath Jiri Pirko
@ 2014-03-19 15:33 ` Jiri Pirko
  2014-03-20 13:04   ` Thomas Graf
  2014-03-19 15:33 ` [patch net-next RFC 2/4] net: introduce switchdev API Jiri Pirko
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 125+ messages in thread
From: Jiri Pirko @ 2014-03-19 15:33 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet

After this, flow related structures can be used in other code.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/sw_flow.h        | 105 +++++++++++++++++++++++++++++++++++
 net/openvswitch/datapath.c     |  45 +++++++--------
 net/openvswitch/datapath.h     |   4 +-
 net/openvswitch/flow.c         |  12 ++--
 net/openvswitch/flow.h         | 123 +++++++++--------------------------------
 net/openvswitch/flow_netlink.c |  24 ++++----
 net/openvswitch/flow_netlink.h |   4 +-
 net/openvswitch/flow_table.c   | 100 +++++++++++++++++----------------
 net/openvswitch/flow_table.h   |  18 +++---
 net/openvswitch/vport-gre.c    |   4 +-
 net/openvswitch/vport-vxlan.c  |   2 +-
 net/openvswitch/vport.c        |   2 +-
 net/openvswitch/vport.h        |   2 +-
 13 files changed, 241 insertions(+), 204 deletions(-)
 create mode 100644 include/linux/sw_flow.h

diff --git a/include/linux/sw_flow.h b/include/linux/sw_flow.h
new file mode 100644
index 0000000..e7b1ef9
--- /dev/null
+++ b/include/linux/sw_flow.h
@@ -0,0 +1,105 @@
+/*
+ * Copyright (c) 2007-2012 Nicira, Inc.
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+
+#ifndef _LINUX_SW_FLOW_H_
+#define _LINUX_SW_FLOW_H_
+
+struct sw_flow_key_ipv4_tunnel {
+	__be64 tun_id;
+	__be32 ipv4_src;
+	__be32 ipv4_dst;
+	__be16 tun_flags;
+	u8   ipv4_tos;
+	u8   ipv4_ttl;
+};
+
+struct sw_flow_key {
+	struct sw_flow_key_ipv4_tunnel tun_key;  /* Encapsulating tunnel key. */
+	struct {
+		u32	priority;	/* Packet QoS priority. */
+		u32	skb_mark;	/* SKB mark. */
+		u16	in_port;	/* Input switch port (or DP_MAX_PORTS). */
+	} phy;
+	struct {
+		u8     src[ETH_ALEN];	/* Ethernet source address. */
+		u8     dst[ETH_ALEN];	/* Ethernet destination address. */
+		__be16 tci;		/* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
+		__be16 type;		/* Ethernet frame type. */
+	} eth;
+	struct {
+		u8     proto;		/* IP protocol or lower 8 bits of ARP opcode. */
+		u8     tos;		/* IP ToS. */
+		u8     ttl;		/* IP TTL/hop limit. */
+		u8     frag;		/* One of OVS_FRAG_TYPE_*. */
+	} ip;
+	union {
+		struct {
+			struct {
+				__be32 src;	/* IP source address. */
+				__be32 dst;	/* IP destination address. */
+			} addr;
+			union {
+				struct {
+					__be16 src;		/* TCP/UDP/SCTP source port. */
+					__be16 dst;		/* TCP/UDP/SCTP destination port. */
+					__be16 flags;		/* TCP flags. */
+				} tp;
+				struct {
+					u8 sha[ETH_ALEN];	/* ARP source hardware address. */
+					u8 tha[ETH_ALEN];	/* ARP target hardware address. */
+				} arp;
+			};
+		} ipv4;
+		struct {
+			struct {
+				struct in6_addr src;	/* IPv6 source address. */
+				struct in6_addr dst;	/* IPv6 destination address. */
+			} addr;
+			__be32 label;			/* IPv6 flow label. */
+			struct {
+				__be16 src;		/* TCP/UDP/SCTP source port. */
+				__be16 dst;		/* TCP/UDP/SCTP destination port. */
+				__be16 flags;		/* TCP flags. */
+			} tp;
+			struct {
+				struct in6_addr target;	/* ND target address. */
+				u8 sll[ETH_ALEN];	/* ND source link layer address. */
+				u8 tll[ETH_ALEN];	/* ND target link layer address. */
+			} nd;
+		} ipv6;
+	};
+} __aligned(BITS_PER_LONG/8); /* Ensure that we can do comparisons as longs. */
+
+struct sw_flow_key_range {
+	unsigned short int start;
+	unsigned short int end;
+};
+
+struct sw_flow_mask {
+	struct sw_flow_key_range range;
+	struct sw_flow_key key;
+};
+
+struct sw_flow {
+	struct sw_flow_key key;
+	struct sw_flow_key unmasked_key;
+	struct sw_flow_mask *mask;
+};
+
+#endif /* _LINUX_SW_FLOW_H_ */
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index c53fe0c..7906fe0 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -216,7 +216,7 @@ void ovs_dp_detach_port(struct vport *p)
 void ovs_dp_process_received_packet(struct vport *p, struct sk_buff *skb)
 {
 	struct datapath *dp = p->dp;
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct dp_stats_percpu *stats;
 	struct sw_flow_key key;
 	u64 *stats_counter;
@@ -488,7 +488,7 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 	struct nlattr **a = info->attrs;
 	struct sw_flow_actions *acts;
 	struct sk_buff *packet;
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct datapath *dp;
 	struct ethhdr *eth;
 	int len;
@@ -525,11 +525,11 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 	if (IS_ERR(flow))
 		goto err_kfree_skb;
 
-	err = ovs_flow_extract(packet, -1, &flow->key);
+	err = ovs_flow_extract(packet, -1, &flow->flow.key);
 	if (err)
 		goto err_flow_free;
 
-	err = ovs_nla_get_flow_metadata(flow, a[OVS_PACKET_ATTR_KEY]);
+	err = ovs_nla_get_flow_metadata(&flow->flow, a[OVS_PACKET_ATTR_KEY]);
 	if (err)
 		goto err_flow_free;
 	acts = ovs_nla_alloc_flow_actions(nla_len(a[OVS_PACKET_ATTR_ACTIONS]));
@@ -538,15 +538,15 @@ static int ovs_packet_cmd_execute(struct sk_buff *skb, struct genl_info *info)
 		goto err_flow_free;
 
 	err = ovs_nla_copy_actions(a[OVS_PACKET_ATTR_ACTIONS],
-				   &flow->key, 0, &acts);
+				   &flow->flow.key, 0, &acts);
 	rcu_assign_pointer(flow->sf_acts, acts);
 	if (err)
 		goto err_flow_free;
 
 	OVS_CB(packet)->flow = flow;
-	OVS_CB(packet)->pkt_key = &flow->key;
-	packet->priority = flow->key.phy.priority;
-	packet->mark = flow->key.phy.skb_mark;
+	OVS_CB(packet)->pkt_key = &flow->flow.key;
+	packet->priority = flow->flow.key.phy.priority;
+	packet->mark = flow->flow.key.phy.skb_mark;
 
 	rcu_read_lock();
 	dp = get_dp(sock_net(skb->sk), ovs_header->dp_ifindex);
@@ -649,7 +649,7 @@ static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts)
 }
 
 /* Called with ovs_mutex. */
-static int ovs_flow_cmd_fill_info(struct sw_flow *flow, struct datapath *dp,
+static int ovs_flow_cmd_fill_info(struct ovs_flow *flow, struct datapath *dp,
 				  struct sk_buff *skb, u32 portid,
 				  u32 seq, u32 flags, u8 cmd)
 {
@@ -673,7 +673,8 @@ static int ovs_flow_cmd_fill_info(struct sw_flow *flow, struct datapath *dp,
 	if (!nla)
 		goto nla_put_failure;
 
-	err = ovs_nla_put_flow(&flow->unmasked_key, &flow->unmasked_key, skb);
+	err = ovs_nla_put_flow(&flow->flow.unmasked_key,
+			       &flow->flow.unmasked_key, skb);
 	if (err)
 		goto error;
 	nla_nest_end(skb, nla);
@@ -682,7 +683,7 @@ static int ovs_flow_cmd_fill_info(struct sw_flow *flow, struct datapath *dp,
 	if (!nla)
 		goto nla_put_failure;
 
-	err = ovs_nla_put_flow(&flow->key, &flow->mask->key, skb);
+	err = ovs_nla_put_flow(&flow->flow.key, &flow->flow.mask->key, skb);
 	if (err)
 		goto error;
 
@@ -739,7 +740,7 @@ error:
 	return err;
 }
 
-static struct sk_buff *ovs_flow_cmd_alloc_info(struct sw_flow *flow,
+static struct sk_buff *ovs_flow_cmd_alloc_info(struct ovs_flow *flow,
 					       struct genl_info *info)
 {
 	size_t len;
@@ -749,7 +750,7 @@ static struct sk_buff *ovs_flow_cmd_alloc_info(struct sw_flow *flow,
 	return genlmsg_new_unicast(len, info, GFP_KERNEL);
 }
 
-static struct sk_buff *ovs_flow_cmd_build_info(struct sw_flow *flow,
+static struct sk_buff *ovs_flow_cmd_build_info(struct ovs_flow *flow,
 					       struct datapath *dp,
 					       struct genl_info *info,
 					       u8 cmd)
@@ -772,12 +773,12 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info)
 	struct nlattr **a = info->attrs;
 	struct ovs_header *ovs_header = info->userhdr;
 	struct sw_flow_key key, masked_key;
-	struct sw_flow *flow = NULL;
+	struct ovs_flow *flow = NULL;
 	struct sw_flow_mask mask;
 	struct sk_buff *reply;
 	struct datapath *dp;
 	struct sw_flow_actions *acts = NULL;
-	struct sw_flow_match match;
+	struct ovs_flow_match match;
 	bool exact_5tuple;
 	int error;
 
@@ -832,8 +833,8 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info)
 			goto err_unlock_ovs;
 		}
 
-		flow->key = masked_key;
-		flow->unmasked_key = key;
+		flow->flow.key = masked_key;
+		flow->flow.unmasked_key = key;
 		rcu_assign_pointer(flow->sf_acts, acts);
 
 		/* Put flow in bucket. */
@@ -899,9 +900,9 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info)
 	struct ovs_header *ovs_header = info->userhdr;
 	struct sw_flow_key key;
 	struct sk_buff *reply;
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct datapath *dp;
-	struct sw_flow_match match;
+	struct ovs_flow_match match;
 	int err;
 
 	if (!a[OVS_FLOW_ATTR_KEY]) {
@@ -946,9 +947,9 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	struct ovs_header *ovs_header = info->userhdr;
 	struct sw_flow_key key;
 	struct sk_buff *reply;
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct datapath *dp;
-	struct sw_flow_match match;
+	struct ovs_flow_match match;
 	int err;
 
 	ovs_lock();
@@ -1011,7 +1012,7 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
 
 	ti = rcu_dereference(dp->table.ti);
 	for (;;) {
-		struct sw_flow *flow;
+		struct ovs_flow *flow;
 		u32 bucket, obj;
 
 		bucket = cb->args[0];
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 0531738..5388cac 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -100,9 +100,9 @@ struct datapath {
  * packet is not being tunneled.
  */
 struct ovs_skb_cb {
-	struct sw_flow		*flow;
+	struct ovs_flow		*flow;
 	struct sw_flow_key	*pkt_key;
-	struct ovs_key_ipv4_tunnel  *tun_key;
+	struct sw_flow_key_ipv4_tunnel  *tun_key;
 };
 #define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb)
 
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 16f4b46..52e4466 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -61,7 +61,7 @@ u64 ovs_flow_used_time(unsigned long flow_jiffies)
 
 #define TCP_FLAGS_BE16(tp) (*(__be16 *)&tcp_flag_word(tp) & htons(0x0FFF))
 
-void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb)
+void ovs_flow_stats_update(struct ovs_flow *flow, struct sk_buff *skb)
 {
 	struct flow_stats *stats;
 	__be16 tcp_flags = 0;
@@ -71,9 +71,9 @@ void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb)
 	else
 		stats = this_cpu_ptr(flow->stats.cpu_stats);
 
-	if ((flow->key.eth.type == htons(ETH_P_IP) ||
-	     flow->key.eth.type == htons(ETH_P_IPV6)) &&
-	    flow->key.ip.proto == IPPROTO_TCP &&
+	if ((flow->flow.key.eth.type == htons(ETH_P_IP) ||
+	     flow->flow.key.eth.type == htons(ETH_P_IPV6)) &&
+	    flow->flow.key.ip.proto == IPPROTO_TCP &&
 	    likely(skb->len >= skb_transport_offset(skb) + sizeof(struct tcphdr))) {
 		tcp_flags = TCP_FLAGS_BE16(tcp_hdr(skb));
 	}
@@ -99,7 +99,7 @@ static void stats_read(struct flow_stats *stats,
 	spin_unlock(&stats->lock);
 }
 
-void ovs_flow_stats_get(struct sw_flow *flow, struct ovs_flow_stats *ovs_stats,
+void ovs_flow_stats_get(struct ovs_flow *flow, struct ovs_flow_stats *ovs_stats,
 			unsigned long *used, __be16 *tcp_flags)
 {
 	int cpu, cur_cpu;
@@ -138,7 +138,7 @@ static void stats_reset(struct flow_stats *stats)
 	spin_unlock(&stats->lock);
 }
 
-void ovs_flow_stats_clear(struct sw_flow *flow)
+void ovs_flow_stats_clear(struct ovs_flow *flow)
 {
 	int cpu, cur_cpu;
 
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 2d770e2..1ece896 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -32,24 +32,16 @@
 #include <linux/time.h>
 #include <linux/flex_array.h>
 #include <net/inet_ecn.h>
+#include <linux/sw_flow.h>
 
 struct sk_buff;
 
-/* Used to memset ovs_key_ipv4_tunnel padding. */
+/* Used to memset sw_flow_key_ipv4_tunnel padding. */
 #define OVS_TUNNEL_KEY_SIZE					\
-	(offsetof(struct ovs_key_ipv4_tunnel, ipv4_ttl) +	\
-	FIELD_SIZEOF(struct ovs_key_ipv4_tunnel, ipv4_ttl))
-
-struct ovs_key_ipv4_tunnel {
-	__be64 tun_id;
-	__be32 ipv4_src;
-	__be32 ipv4_dst;
-	__be16 tun_flags;
-	u8   ipv4_tos;
-	u8   ipv4_ttl;
-};
+	(offsetof(struct sw_flow_key_ipv4_tunnel, ipv4_ttl) +	\
+	FIELD_SIZEOF(struct sw_flow_key_ipv4_tunnel, ipv4_ttl))
 
-static inline void ovs_flow_tun_key_init(struct ovs_key_ipv4_tunnel *tun_key,
+static inline void ovs_flow_tun_key_init(struct sw_flow_key_ipv4_tunnel *tun_key,
 					 const struct iphdr *iph, __be64 tun_id,
 					 __be16 tun_flags)
 {
@@ -65,77 +57,28 @@ static inline void ovs_flow_tun_key_init(struct ovs_key_ipv4_tunnel *tun_key,
 	       sizeof(*tun_key) - OVS_TUNNEL_KEY_SIZE);
 }
 
-struct sw_flow_key {
-	struct ovs_key_ipv4_tunnel tun_key;  /* Encapsulating tunnel key. */
-	struct {
-		u32	priority;	/* Packet QoS priority. */
-		u32	skb_mark;	/* SKB mark. */
-		u16	in_port;	/* Input switch port (or DP_MAX_PORTS). */
-	} phy;
-	struct {
-		u8     src[ETH_ALEN];	/* Ethernet source address. */
-		u8     dst[ETH_ALEN];	/* Ethernet destination address. */
-		__be16 tci;		/* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
-		__be16 type;		/* Ethernet frame type. */
-	} eth;
-	struct {
-		u8     proto;		/* IP protocol or lower 8 bits of ARP opcode. */
-		u8     tos;		/* IP ToS. */
-		u8     ttl;		/* IP TTL/hop limit. */
-		u8     frag;		/* One of OVS_FRAG_TYPE_*. */
-	} ip;
-	union {
-		struct {
-			struct {
-				__be32 src;	/* IP source address. */
-				__be32 dst;	/* IP destination address. */
-			} addr;
-			union {
-				struct {
-					__be16 src;		/* TCP/UDP/SCTP source port. */
-					__be16 dst;		/* TCP/UDP/SCTP destination port. */
-					__be16 flags;		/* TCP flags. */
-				} tp;
-				struct {
-					u8 sha[ETH_ALEN];	/* ARP source hardware address. */
-					u8 tha[ETH_ALEN];	/* ARP target hardware address. */
-				} arp;
-			};
-		} ipv4;
-		struct {
-			struct {
-				struct in6_addr src;	/* IPv6 source address. */
-				struct in6_addr dst;	/* IPv6 destination address. */
-			} addr;
-			__be32 label;			/* IPv6 flow label. */
-			struct {
-				__be16 src;		/* TCP/UDP/SCTP source port. */
-				__be16 dst;		/* TCP/UDP/SCTP destination port. */
-				__be16 flags;		/* TCP flags. */
-			} tp;
-			struct {
-				struct in6_addr target;	/* ND target address. */
-				u8 sll[ETH_ALEN];	/* ND source link layer address. */
-				u8 tll[ETH_ALEN];	/* ND target link layer address. */
-			} nd;
-		} ipv6;
-	};
-} __aligned(BITS_PER_LONG/8); /* Ensure that we can do comparisons as longs. */
+struct arp_eth_header {
+	__be16      ar_hrd;	/* format of hardware address   */
+	__be16      ar_pro;	/* format of protocol address   */
+	unsigned char   ar_hln;	/* length of hardware address   */
+	unsigned char   ar_pln;	/* length of protocol address   */
+	__be16      ar_op;	/* ARP opcode (command)     */
 
-struct sw_flow_key_range {
-	unsigned short int start;
-	unsigned short int end;
-};
+	/* Ethernet+IPv4 specific members. */
+	unsigned char       ar_sha[ETH_ALEN];	/* sender hardware address  */
+	unsigned char       ar_sip[4];		/* sender IP address        */
+	unsigned char       ar_tha[ETH_ALEN];	/* target hardware address  */
+	unsigned char       ar_tip[4];		/* target IP address        */
+} __packed;
 
-struct sw_flow_mask {
+struct ovs_flow_mask {
 	int ref_count;
 	struct rcu_head rcu;
 	struct list_head list;
-	struct sw_flow_key_range range;
-	struct sw_flow_key key;
+	struct sw_flow_mask mask;
 };
 
-struct sw_flow_match {
+struct ovs_flow_match {
 	struct sw_flow_key *key;
 	struct sw_flow_key_range range;
 	struct sw_flow_mask *mask;
@@ -163,36 +106,20 @@ struct sw_flow_stats {
 	};
 };
 
-struct sw_flow {
+struct ovs_flow {
 	struct rcu_head rcu;
 	struct hlist_node hash_node[2];
 	u32 hash;
 
-	struct sw_flow_key key;
-	struct sw_flow_key unmasked_key;
-	struct sw_flow_mask *mask;
+	struct sw_flow flow;
 	struct sw_flow_actions __rcu *sf_acts;
 	struct sw_flow_stats stats;
 };
 
-struct arp_eth_header {
-	__be16      ar_hrd;	/* format of hardware address   */
-	__be16      ar_pro;	/* format of protocol address   */
-	unsigned char   ar_hln;	/* length of hardware address   */
-	unsigned char   ar_pln;	/* length of protocol address   */
-	__be16      ar_op;	/* ARP opcode (command)     */
-
-	/* Ethernet+IPv4 specific members. */
-	unsigned char       ar_sha[ETH_ALEN];	/* sender hardware address  */
-	unsigned char       ar_sip[4];		/* sender IP address        */
-	unsigned char       ar_tha[ETH_ALEN];	/* target hardware address  */
-	unsigned char       ar_tip[4];		/* target IP address        */
-} __packed;
-
-void ovs_flow_stats_update(struct sw_flow *flow, struct sk_buff *skb);
-void ovs_flow_stats_get(struct sw_flow *flow, struct ovs_flow_stats *stats,
+void ovs_flow_stats_update(struct ovs_flow *flow, struct sk_buff *skb);
+void ovs_flow_stats_get(struct ovs_flow *flow, struct ovs_flow_stats *stats,
 			unsigned long *used, __be16 *tcp_flags);
-void ovs_flow_stats_clear(struct sw_flow *flow);
+void ovs_flow_stats_clear(struct ovs_flow *flow);
 u64 ovs_flow_used_time(unsigned long flow_jiffies);
 
 int ovs_flow_extract(struct sk_buff *, u16 in_port, struct sw_flow_key *);
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 4d000ac..179ab98 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -46,7 +46,7 @@
 
 #include "flow_netlink.h"
 
-static void update_range__(struct sw_flow_match *match,
+static void update_range__(struct ovs_flow_match *match,
 			   size_t offset, size_t size, bool is_mask)
 {
 	struct sw_flow_key_range *range = NULL;
@@ -103,7 +103,7 @@ static u16 range_n_bytes(const struct sw_flow_key_range *range)
 	return range->end - range->start;
 }
 
-static bool match_validate(const struct sw_flow_match *match,
+static bool match_validate(const struct ovs_flow_match *match,
 			   u64 key_attrs, u64 mask_attrs)
 {
 	u64 key_expected = 1 << OVS_KEY_ATTR_ETHERNET;
@@ -339,7 +339,7 @@ static int parse_flow_nlattrs(const struct nlattr *attr,
 }
 
 static int ipv4_tun_from_nlattr(const struct nlattr *attr,
-				struct sw_flow_match *match, bool is_mask)
+				struct ovs_flow_match *match, bool is_mask)
 {
 	struct nlattr *a;
 	int rem;
@@ -428,8 +428,8 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 }
 
 static int ipv4_tun_to_nlattr(struct sk_buff *skb,
-			      const struct ovs_key_ipv4_tunnel *tun_key,
-			      const struct ovs_key_ipv4_tunnel *output)
+			      const struct sw_flow_key_ipv4_tunnel *tun_key,
+			      const struct sw_flow_key_ipv4_tunnel *output)
 {
 	struct nlattr *nla;
 
@@ -463,7 +463,7 @@ static int ipv4_tun_to_nlattr(struct sk_buff *skb,
 }
 
 
-static int metadata_from_nlattrs(struct sw_flow_match *match,  u64 *attrs,
+static int metadata_from_nlattrs(struct ovs_flow_match *match,  u64 *attrs,
 				 const struct nlattr **a, bool is_mask)
 {
 	if (*attrs & (1 << OVS_KEY_ATTR_PRIORITY)) {
@@ -501,7 +501,7 @@ static int metadata_from_nlattrs(struct sw_flow_match *match,  u64 *attrs,
 	return 0;
 }
 
-static int ovs_key_from_nlattrs(struct sw_flow_match *match,  bool *exact_5tuple,
+static int ovs_key_from_nlattrs(struct ovs_flow_match *match,  bool *exact_5tuple,
 				u64 attrs, const struct nlattr **a,
 				bool is_mask)
 {
@@ -799,7 +799,7 @@ static void sw_flow_mask_set(struct sw_flow_mask *mask,
  * @mask: Optional. Netlink attribute holding nested %OVS_KEY_ATTR_* Netlink
  * attribute specifies the mask field of the wildcarded flow.
  */
-int ovs_nla_get_match(struct sw_flow_match *match,
+int ovs_nla_get_match(struct ovs_flow_match *match,
 		      bool *exact_5tuple,
 		      const struct nlattr *key,
 		      const struct nlattr *mask)
@@ -922,11 +922,11 @@ int ovs_nla_get_match(struct sw_flow_match *match,
 int ovs_nla_get_flow_metadata(struct sw_flow *flow,
 			      const struct nlattr *attr)
 {
-	struct ovs_key_ipv4_tunnel *tun_key = &flow->key.tun_key;
+	struct sw_flow_key_ipv4_tunnel *tun_key = &flow->key.tun_key;
 	const struct nlattr *a[OVS_KEY_ATTR_MAX + 1];
 	u64 attrs = 0;
 	int err;
-	struct sw_flow_match match;
+	struct ovs_flow_match match;
 
 	flow->key.phy.in_port = DP_MAX_PORTS;
 	flow->key.phy.priority = 0;
@@ -1320,7 +1320,7 @@ static int validate_tp_port(const struct sw_flow_key *flow_key)
 	return -EINVAL;
 }
 
-void ovs_match_init(struct sw_flow_match *match,
+void ovs_match_init(struct ovs_flow_match *match,
 		    struct sw_flow_key *key,
 		    struct sw_flow_mask *mask)
 {
@@ -1339,7 +1339,7 @@ void ovs_match_init(struct sw_flow_match *match,
 static int validate_and_copy_set_tun(const struct nlattr *attr,
 				     struct sw_flow_actions **sfa)
 {
-	struct sw_flow_match match;
+	struct ovs_flow_match match;
 	struct sw_flow_key key;
 	int err, start;
 
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index b31fbe2..f223929 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -37,14 +37,14 @@
 
 #include "flow.h"
 
-void ovs_match_init(struct sw_flow_match *match,
+void ovs_match_init(struct ovs_flow_match *match,
 		    struct sw_flow_key *key, struct sw_flow_mask *mask);
 
 int ovs_nla_put_flow(const struct sw_flow_key *,
 		     const struct sw_flow_key *, struct sk_buff *);
 int ovs_nla_get_flow_metadata(struct sw_flow *flow,
 			      const struct nlattr *attr);
-int ovs_nla_get_match(struct sw_flow_match *match,
+int ovs_nla_get_match(struct ovs_flow_match *match,
 		      bool *exact_5tuple,
 		      const struct nlattr *,
 		      const struct nlattr *);
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index 3c268b3..053ece9 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -70,9 +70,9 @@ void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
 		*d++ = *s++ & *m++;
 }
 
-struct sw_flow *ovs_flow_alloc(bool percpu_stats)
+struct ovs_flow *ovs_flow_alloc(bool percpu_stats)
 {
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	int cpu;
 
 	flow = kmem_cache_alloc(flow_cache, GFP_KERNEL);
@@ -80,7 +80,7 @@ struct sw_flow *ovs_flow_alloc(bool percpu_stats)
 		return ERR_PTR(-ENOMEM);
 
 	flow->sf_acts = NULL;
-	flow->mask = NULL;
+	flow->flow.mask = NULL;
 
 	flow->stats.is_percpu = percpu_stats;
 
@@ -136,7 +136,7 @@ static struct flex_array *alloc_buckets(unsigned int n_buckets)
 	return buckets;
 }
 
-static void flow_free(struct sw_flow *flow)
+static void flow_free(struct ovs_flow *flow)
 {
 	kfree((struct sf_flow_acts __force *)flow->sf_acts);
 	if (flow->stats.is_percpu)
@@ -148,18 +148,20 @@ static void flow_free(struct sw_flow *flow)
 
 static void rcu_free_flow_callback(struct rcu_head *rcu)
 {
-	struct sw_flow *flow = container_of(rcu, struct sw_flow, rcu);
+	struct ovs_flow *flow = container_of(rcu, struct ovs_flow, rcu);
 
 	flow_free(flow);
 }
 
-void ovs_flow_free(struct sw_flow *flow, bool deferred)
+void ovs_flow_free(struct ovs_flow *flow, bool deferred)
 {
 	if (!flow)
 		return;
 
-	if (flow->mask) {
-		struct sw_flow_mask *mask = flow->mask;
+	if (flow->flow.mask) {
+		struct ovs_flow_mask *mask = container_of(flow->flow.mask,
+							  struct ovs_flow_mask,
+							  mask);
 
 		/* ovs-lock is required to protect mask-refcount and
 		 * mask list.
@@ -250,7 +252,7 @@ static void table_instance_destroy(struct table_instance *ti, bool deferred)
 		goto skip_flows;
 
 	for (i = 0; i < ti->n_buckets; i++) {
-		struct sw_flow *flow;
+		struct ovs_flow *flow;
 		struct hlist_head *head = flex_array_get(ti->buckets, i);
 		struct hlist_node *n;
 		int ver = ti->node_ver;
@@ -275,10 +277,10 @@ void ovs_flow_tbl_destroy(struct flow_table *table, bool deferred)
 	table_instance_destroy(ti, deferred);
 }
 
-struct sw_flow *ovs_flow_tbl_dump_next(struct table_instance *ti,
+struct ovs_flow *ovs_flow_tbl_dump_next(struct table_instance *ti,
 				       u32 *bucket, u32 *last)
 {
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct hlist_head *head;
 	int ver;
 	int i;
@@ -309,7 +311,8 @@ static struct hlist_head *find_bucket(struct table_instance *ti, u32 hash)
 				(hash & (ti->n_buckets - 1)));
 }
 
-static void table_instance_insert(struct table_instance *ti, struct sw_flow *flow)
+static void table_instance_insert(struct table_instance *ti,
+				  struct ovs_flow *flow)
 {
 	struct hlist_head *head;
 
@@ -328,7 +331,7 @@ static void flow_table_copy_flows(struct table_instance *old,
 
 	/* Insert in new table. */
 	for (i = 0; i < old->n_buckets; i++) {
-		struct sw_flow *flow;
+		struct ovs_flow *flow;
 		struct hlist_head *head;
 
 		head = flex_array_get(old->buckets, i);
@@ -415,21 +418,21 @@ static bool flow_cmp_masked_key(const struct sw_flow *flow,
 	return cmp_key(&flow->key, key, key_start, key_end);
 }
 
-bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow,
-			       struct sw_flow_match *match)
+bool ovs_flow_cmp_unmasked_key(const struct ovs_flow *flow,
+			       struct ovs_flow_match *match)
 {
 	struct sw_flow_key *key = match->key;
 	int key_start = flow_key_start(key);
 	int key_end = match->range.end;
 
-	return cmp_key(&flow->unmasked_key, key, key_start, key_end);
+	return cmp_key(&flow->flow.unmasked_key, key, key_start, key_end);
 }
 
-static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
-					  const struct sw_flow_key *unmasked,
-					  struct sw_flow_mask *mask)
+static struct ovs_flow *masked_flow_lookup(struct table_instance *ti,
+					   const struct sw_flow_key *unmasked,
+					   struct sw_flow_mask *mask)
 {
-	struct sw_flow *flow;
+	struct ovs_flow *flow;
 	struct hlist_head *head;
 	int key_start = mask->range.start;
 	int key_end = mask->range.end;
@@ -440,34 +443,34 @@ static struct sw_flow *masked_flow_lookup(struct table_instance *ti,
 	hash = flow_hash(&masked_key, key_start, key_end);
 	head = find_bucket(ti, hash);
 	hlist_for_each_entry_rcu(flow, head, hash_node[ti->node_ver]) {
-		if (flow->mask == mask && flow->hash == hash &&
-		    flow_cmp_masked_key(flow, &masked_key,
+		if (flow->flow.mask == mask && flow->hash == hash &&
+		    flow_cmp_masked_key(&flow->flow, &masked_key,
 					  key_start, key_end))
 			return flow;
 	}
 	return NULL;
 }
 
-struct sw_flow *ovs_flow_tbl_lookup_stats(struct flow_table *tbl,
-				    const struct sw_flow_key *key,
-				    u32 *n_mask_hit)
+struct ovs_flow *ovs_flow_tbl_lookup_stats(struct flow_table *tbl,
+					   const struct sw_flow_key *key,
+					   u32 *n_mask_hit)
 {
 	struct table_instance *ti = rcu_dereference_ovsl(tbl->ti);
-	struct sw_flow_mask *mask;
-	struct sw_flow *flow;
+	struct ovs_flow_mask *mask;
+	struct ovs_flow *flow;
 
 	*n_mask_hit = 0;
 	list_for_each_entry_rcu(mask, &tbl->mask_list, list) {
 		(*n_mask_hit)++;
-		flow = masked_flow_lookup(ti, key, mask);
+		flow = masked_flow_lookup(ti, key, &mask->mask);
 		if (flow)  /* Found */
 			return flow;
 	}
 	return NULL;
 }
 
-struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *tbl,
-				    const struct sw_flow_key *key)
+struct ovs_flow *ovs_flow_tbl_lookup(struct flow_table *tbl,
+				     const struct sw_flow_key *key)
 {
 	u32 __always_unused n_mask_hit;
 
@@ -476,7 +479,7 @@ struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *tbl,
 
 int ovs_flow_tbl_num_masks(const struct flow_table *table)
 {
-	struct sw_flow_mask *mask;
+	struct ovs_flow_mask *mask;
 	int num = 0;
 
 	list_for_each_entry(mask, &table->mask_list, list)
@@ -490,7 +493,7 @@ static struct table_instance *table_instance_expand(struct table_instance *ti)
 	return table_instance_rehash(ti, ti->n_buckets * 2);
 }
 
-void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow)
+void ovs_flow_tbl_remove(struct flow_table *table, struct ovs_flow *flow)
 {
 	struct table_instance *ti = ovsl_dereference(table->ti);
 
@@ -499,9 +502,9 @@ void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow)
 	table->count--;
 }
 
-static struct sw_flow_mask *mask_alloc(void)
+static struct ovs_flow_mask *mask_alloc(void)
 {
-	struct sw_flow_mask *mask;
+	struct ovs_flow_mask *mask;
 
 	mask = kmalloc(sizeof(*mask), GFP_KERNEL);
 	if (mask)
@@ -521,15 +524,15 @@ static bool mask_equal(const struct sw_flow_mask *a,
 		&& (memcmp(a_, b_, range_n_bytes(&a->range)) == 0);
 }
 
-static struct sw_flow_mask *flow_mask_find(const struct flow_table *tbl,
-					   const struct sw_flow_mask *mask)
+static struct ovs_flow_mask *flow_mask_find(const struct flow_table *tbl,
+					    const struct sw_flow_mask *mask)
 {
 	struct list_head *ml;
 
 	list_for_each(ml, &tbl->mask_list) {
-		struct sw_flow_mask *m;
-		m = container_of(ml, struct sw_flow_mask, list);
-		if (mask_equal(mask, m))
+		struct ovs_flow_mask *m;
+		m = container_of(ml, struct ovs_flow_mask, list);
+		if (mask_equal(mask, &m->mask))
 			return m;
 	}
 
@@ -537,29 +540,30 @@ static struct sw_flow_mask *flow_mask_find(const struct flow_table *tbl,
 }
 
 /* Add 'mask' into the mask list, if it is not already there. */
-static int flow_mask_insert(struct flow_table *tbl, struct sw_flow *flow,
+static int flow_mask_insert(struct flow_table *tbl, struct ovs_flow *flow,
 			    struct sw_flow_mask *new)
 {
-	struct sw_flow_mask *mask;
+	struct ovs_flow_mask *mask;
+
 	mask = flow_mask_find(tbl, new);
 	if (!mask) {
 		/* Allocate a new mask if none exsits. */
 		mask = mask_alloc();
 		if (!mask)
 			return -ENOMEM;
-		mask->key = new->key;
-		mask->range = new->range;
+		mask->mask.key = new->key;
+		mask->mask.range = new->range;
 		list_add_rcu(&mask->list, &tbl->mask_list);
 	} else {
 		BUG_ON(!mask->ref_count);
 		mask->ref_count++;
 	}
 
-	flow->mask = mask;
+	flow->flow.mask = &mask->mask;
 	return 0;
 }
 
-int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
+int ovs_flow_tbl_insert(struct flow_table *table, struct ovs_flow *flow,
 			struct sw_flow_mask *mask)
 {
 	struct table_instance *new_ti = NULL;
@@ -570,8 +574,8 @@ int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
 	if (err)
 		return err;
 
-	flow->hash = flow_hash(&flow->key, flow->mask->range.start,
-			flow->mask->range.end);
+	flow->hash = flow_hash(&flow->flow.key, flow->flow.mask->range.start,
+			        flow->flow.mask->range.end);
 	ti = ovsl_dereference(table->ti);
 	table_instance_insert(ti, flow);
 	table->count++;
@@ -597,7 +601,7 @@ int ovs_flow_init(void)
 	BUILD_BUG_ON(__alignof__(struct sw_flow_key) % __alignof__(long));
 	BUILD_BUG_ON(sizeof(struct sw_flow_key) % sizeof(long));
 
-	flow_cache = kmem_cache_create("sw_flow", sizeof(struct sw_flow), 0,
+	flow_cache = kmem_cache_create("ovs_flow", sizeof(struct ovs_flow), 0,
 					0, NULL);
 	if (flow_cache == NULL)
 		return -ENOMEM;
diff --git a/net/openvswitch/flow_table.h b/net/openvswitch/flow_table.h
index baaeb10..c6abd84 100644
--- a/net/openvswitch/flow_table.h
+++ b/net/openvswitch/flow_table.h
@@ -55,28 +55,28 @@ struct flow_table {
 int ovs_flow_init(void);
 void ovs_flow_exit(void);
 
-struct sw_flow *ovs_flow_alloc(bool percpu_stats);
-void ovs_flow_free(struct sw_flow *, bool deferred);
+struct ovs_flow *ovs_flow_alloc(bool percpu_stats);
+void ovs_flow_free(struct ovs_flow *, bool deferred);
 
 int ovs_flow_tbl_init(struct flow_table *);
 int ovs_flow_tbl_count(struct flow_table *table);
 void ovs_flow_tbl_destroy(struct flow_table *table, bool deferred);
 int ovs_flow_tbl_flush(struct flow_table *flow_table);
 
-int ovs_flow_tbl_insert(struct flow_table *table, struct sw_flow *flow,
+int ovs_flow_tbl_insert(struct flow_table *table, struct ovs_flow *flow,
 			struct sw_flow_mask *mask);
-void ovs_flow_tbl_remove(struct flow_table *table, struct sw_flow *flow);
+void ovs_flow_tbl_remove(struct flow_table *table, struct ovs_flow *flow);
 int  ovs_flow_tbl_num_masks(const struct flow_table *table);
-struct sw_flow *ovs_flow_tbl_dump_next(struct table_instance *table,
+struct ovs_flow *ovs_flow_tbl_dump_next(struct table_instance *table,
 				       u32 *bucket, u32 *idx);
-struct sw_flow *ovs_flow_tbl_lookup_stats(struct flow_table *,
+struct ovs_flow *ovs_flow_tbl_lookup_stats(struct flow_table *,
 				    const struct sw_flow_key *,
 				    u32 *n_mask_hit);
-struct sw_flow *ovs_flow_tbl_lookup(struct flow_table *,
+struct ovs_flow *ovs_flow_tbl_lookup(struct flow_table *,
 				    const struct sw_flow_key *);
 
-bool ovs_flow_cmp_unmasked_key(const struct sw_flow *flow,
-			       struct sw_flow_match *match);
+bool ovs_flow_cmp_unmasked_key(const struct ovs_flow *flow,
+			       struct ovs_flow_match *match);
 
 void ovs_flow_mask_key(struct sw_flow_key *dst, const struct sw_flow_key *src,
 		       const struct sw_flow_mask *mask);
diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c
index a3d6951..f940cbd 100644
--- a/net/openvswitch/vport-gre.c
+++ b/net/openvswitch/vport-gre.c
@@ -63,7 +63,7 @@ static __be16 filter_tnl_flags(__be16 flags)
 static struct sk_buff *__build_header(struct sk_buff *skb,
 				      int tunnel_hlen)
 {
-	const struct ovs_key_ipv4_tunnel *tun_key = OVS_CB(skb)->tun_key;
+	const struct sw_flow_key_ipv4_tunnel *tun_key = OVS_CB(skb)->tun_key;
 	struct tnl_ptk_info tpi;
 
 	skb = gre_handle_offloads(skb, !!(tun_key->tun_flags & TUNNEL_CSUM));
@@ -92,7 +92,7 @@ static __be64 key_to_tunnel_id(__be32 key, __be32 seq)
 static int gre_rcv(struct sk_buff *skb,
 		   const struct tnl_ptk_info *tpi)
 {
-	struct ovs_key_ipv4_tunnel tun_key;
+	struct sw_flow_key_ipv4_tunnel tun_key;
 	struct ovs_net *ovs_net;
 	struct vport *vport;
 	__be64 key;
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index e797a50..e0be18e 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -58,7 +58,7 @@ static inline struct vxlan_port *vxlan_vport(const struct vport *vport)
 /* Called with rcu_read_lock and BH disabled. */
 static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
 {
-	struct ovs_key_ipv4_tunnel tun_key;
+	struct sw_flow_key_ipv4_tunnel tun_key;
 	struct vport *vport = vs->data;
 	struct iphdr *iph;
 	__be64 key;
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 42c0f4a..81b083c 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -337,7 +337,7 @@ int ovs_vport_get_options(const struct vport *vport, struct sk_buff *skb)
  * skb->data should point to the Ethernet header.
  */
 void ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
-		       struct ovs_key_ipv4_tunnel *tun_key)
+		       struct sw_flow_key_ipv4_tunnel *tun_key)
 {
 	struct pcpu_sw_netstats *stats;
 
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index d7e50a1..0979304 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -191,7 +191,7 @@ static inline struct vport *vport_from_priv(const void *priv)
 }
 
 void ovs_vport_receive(struct vport *, struct sk_buff *,
-		       struct ovs_key_ipv4_tunnel *);
+		       struct sw_flow_key_ipv4_tunnel *);
 
 /* List of statically compiled vport implementations.  Don't forget to also
  * add yours to the list at the top of vport.c. */
-- 
1.8.5.3

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [patch net-next RFC 2/4] net: introduce switchdev API
  2014-03-19 15:33 [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath Jiri Pirko
  2014-03-19 15:33 ` [patch net-next RFC 1/4] openvswitch: split flow structures into ovs specific and generic ones Jiri Pirko
@ 2014-03-19 15:33 ` Jiri Pirko
  2014-03-20 13:59   ` Thomas Graf
  2014-03-20 14:43   ` Nikolay Aleksandrov
  2014-03-19 15:33 ` [patch net-next RFC 3/4] openvswitch: Introduce support for switchdev based datapath Jiri Pirko
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-19 15:33 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet

switchdev API is designed to allow kernel support for various switch
chips.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/linux/switchdev.h |  62 +++++++++
 net/Kconfig               |  10 ++
 net/core/Makefile         |   1 +
 net/core/switchdev.c      | 330 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 403 insertions(+)
 create mode 100644 include/linux/switchdev.h
 create mode 100644 net/core/switchdev.c

diff --git a/include/linux/switchdev.h b/include/linux/switchdev.h
new file mode 100644
index 0000000..ac6db2d
--- /dev/null
+++ b/include/linux/switchdev.h
@@ -0,0 +1,62 @@
+/*
+ * include/linux/switchdev.h - Switch device API
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+#ifndef _LINUX_SWITCHDEV_H_
+#define _LINUX_SWITCHDEV_H_
+
+#include <linux/netdevice.h>
+#include <linux/sw_flow.h>
+
+struct swdev_linked_ops {
+};
+
+bool swdev_dev_check(const struct net_device *dev);
+void swdev_link(struct net_device *dev,
+		const struct swdev_linked_ops *linked_ops,
+		void *linked_priv);
+void swdev_unlink(struct net_device *dev);
+void *swdev_linked_priv(const struct net_device *dev);
+bool swdev_is_linked(const struct net_device *dev);
+int swdev_flow_insert(struct net_device *dev, struct sw_flow *flow);
+int swdev_flow_remove(struct net_device *dev, struct sw_flow *flow);
+int swdev_packet_upcall(struct net_device *dev, struct sk_buff *skb);
+
+struct swdev_ops {
+	const char *kind;
+	int (*flow_insert)(struct net_device *dev, struct sw_flow *flow);
+	int (*flow_remove)(struct net_device *dev, struct sw_flow *flow);
+};
+
+struct net_device *swdev_create(const struct swdev_ops *ops);
+void swdev_destroy(struct net_device *dev);
+
+struct swportdev_linked_ops {
+	void (*skb_upcall)(struct net_device *dev, struct sk_buff *skb,
+			   struct sw_flow_key *key, void *linked_priv);
+};
+
+bool swportdev_dev_check(const struct net_device *dev);
+void swportdev_link(struct net_device *dev,
+		    const struct swportdev_linked_ops *linked_ops,
+		    void *linked_priv);
+void swportdev_unlink(struct net_device *dev);
+void *swportdev_linked_priv(const struct net_device *dev);
+bool swportdev_is_linked(const struct net_device *dev);
+
+struct swportdev_ops {
+	const char *kind;
+	netdev_tx_t (*skb_xmit)(struct sk_buff *skb,
+				struct net_device *port_dev);
+};
+
+struct net_device *swportdev_create(struct net_device *dev,
+				    const struct swportdev_ops *ops);
+void swportdev_destroy(struct net_device *port_dev);
+
+#endif /* _LINUX_SWITCHDEV_H_ */
diff --git a/net/Kconfig b/net/Kconfig
index e411046..e02ab8d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -285,6 +285,16 @@ config NET_FLOW_LIMIT
 	  with many clients some protection against DoS by a single (spoofed)
 	  flow that greatly exceeds average workload.
 
+config NET_SWITCHDEV
+	tristate "Switch device"
+	depends on INET
+	---help---
+	  This module provides glue for hardware switch chips so they can be
+	  accessed from userspace via Open vSwitch Netlink API. 
+
+	  To compile this code as a module, choose M here: the
+	  module will be called pktgen.
+
 menu "Network testing"
 
 config NET_PKTGEN
diff --git a/net/core/Makefile b/net/core/Makefile
index 9628c20..426a619 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -23,3 +23,4 @@ obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
 obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
 obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
 obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
+obj-$(CONFIG_NET_SWITCHDEV) += switchdev.o
diff --git a/net/core/switchdev.c b/net/core/switchdev.c
new file mode 100644
index 0000000..3b8daaf
--- /dev/null
+++ b/net/core/switchdev.c
@@ -0,0 +1,330 @@
+/*
+ * net/core/switchdev.c - Switch device API
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/netdevice.h>
+#include <linux/ethtool.h>
+#include <linux/switchdev.h>
+
+#include <net/rtnetlink.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+
+#include <generated/utsrelease.h>
+
+struct swdev {
+	const struct swdev_ops *ops;
+	const struct swdev_linked_ops *linked_ops;
+	void *linked_priv;
+};
+
+static netdev_tx_t swdev_ndo_start_xmit(struct sk_buff *skb,
+					struct net_device *dev)
+{
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static const struct net_device_ops swdev_netdev_ops = {
+	.ndo_start_xmit = swdev_ndo_start_xmit,
+};
+
+static void swdev_ethtool_get_drvinfo(struct net_device *dev,
+				      struct ethtool_drvinfo *drvinfo)
+{
+	struct swdev *sw = netdev_priv(dev);
+
+	strlcpy(drvinfo->driver, sw->ops->kind, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));
+}
+
+static const struct ethtool_ops swdev_ethtool_ops = {
+	.get_drvinfo		= swdev_ethtool_get_drvinfo,
+	.get_link		= ethtool_op_get_link,
+};
+
+static void swdev_setup(struct net_device *dev)
+{
+	ether_setup(dev);
+	dev->netdev_ops = &swdev_netdev_ops;
+	dev->ethtool_ops = &swdev_ethtool_ops;
+}
+
+bool swdev_dev_check(const struct net_device *dev)
+{
+	return dev->netdev_ops == &swdev_netdev_ops;
+}
+EXPORT_SYMBOL(swdev_dev_check);
+
+void swdev_link(struct net_device *dev,
+		const struct swdev_linked_ops *linked_ops,
+		void *linked_priv)
+{
+	struct swdev *sw = netdev_priv(dev);
+
+	sw->linked_ops = linked_ops;
+	sw->linked_priv = linked_priv;
+	netdev_info(dev, "Switch device linked\n");
+}
+EXPORT_SYMBOL(swdev_link);
+
+void swdev_unlink(struct net_device *dev)
+{
+	struct swdev *sw = netdev_priv(dev);
+
+	sw->linked_ops = NULL;
+	sw->linked_priv = NULL;
+	netdev_info(dev, "Switch device unlinked\n");
+}
+EXPORT_SYMBOL(swdev_unlink);
+
+void *swdev_linked_priv(const struct net_device *dev)
+{
+	struct swdev *sw = netdev_priv(dev);
+
+	return sw->linked_priv;
+}
+EXPORT_SYMBOL(swdev_linked_priv);
+
+bool swdev_is_linked(const struct net_device *dev)
+{
+	return swdev_linked_priv(dev);
+}
+EXPORT_SYMBOL(swdev_is_linked);
+
+int swdev_flow_insert(struct net_device *dev, struct sw_flow *flow)
+{
+	struct swdev *sw = netdev_priv(dev);
+
+	BUG_ON(!swdev_dev_check(dev));
+	if (!sw->ops->flow_insert)
+		return 0;
+	return sw->ops->flow_insert(dev, flow);
+}
+EXPORT_SYMBOL(swdev_flow_insert);
+
+int swdev_flow_remove(struct net_device *dev, struct sw_flow *flow)
+{
+	struct swdev *sw = netdev_priv(dev);
+
+	BUG_ON(!swdev_dev_check(dev));
+	if (!sw->ops->flow_remove)
+		return 0;
+	return sw->ops->flow_remove(dev, flow);
+}
+EXPORT_SYMBOL(swdev_flow_remove);
+
+struct net_device *swdev_create(const struct swdev_ops *ops)
+{
+	struct net_device *dev;
+	struct swdev *sw;
+	int err;
+
+	dev = alloc_netdev(sizeof(struct swdev), "swdev%d", swdev_setup);
+	if (!dev)
+		return ERR_PTR(-ENOMEM);
+
+	err = register_netdevice(dev);
+	if (err)
+		goto err_register_netdevice;
+	sw = netdev_priv(dev);
+	sw->ops = ops;
+	netif_carrier_off(dev);
+	netdev_info(dev, "Switch device created (%s)\n", sw->ops->kind);
+	return dev;
+
+err_register_netdevice:
+	free_netdev(dev);
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL(swdev_create);
+
+void swdev_destroy(struct net_device *dev)
+{
+	unregister_netdevice(dev);
+	free_netdev(dev);
+	netdev_info(dev, "Switch device destroyed\n");
+}
+EXPORT_SYMBOL(swdev_destroy);
+
+
+struct swportdev {
+	const struct swportdev_ops *ops;
+	const struct swportdev_linked_ops *linked_ops;
+	void *linked_priv;
+};
+
+static netdev_tx_t swportdev_ndo_start_xmit(struct sk_buff *skb,
+					    struct net_device *port_dev)
+{
+	struct swportdev *swp = netdev_priv(port_dev);
+
+	return swp->ops->skb_xmit(skb, port_dev);
+}
+
+static const struct net_device_ops swportdev_netdev_ops = {
+	.ndo_start_xmit = swportdev_ndo_start_xmit,
+};
+
+static void swportdev_ethtool_get_drvinfo(struct net_device *port_dev,
+					  struct ethtool_drvinfo *drvinfo)
+{
+	struct swportdev *swp = netdev_priv(port_dev);
+
+	strlcpy(drvinfo->driver, swp->ops->kind, sizeof(drvinfo->driver));
+	strlcpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));
+}
+
+static const struct ethtool_ops swportdev_ethtool_ops = {
+	.get_drvinfo		= swportdev_ethtool_get_drvinfo,
+	.get_link		= ethtool_op_get_link,
+};
+static void swportdev_setup(struct net_device *port_dev)
+{
+	ether_setup(port_dev);
+	port_dev->netdev_ops = &swportdev_netdev_ops;
+	port_dev->ethtool_ops = &swportdev_ethtool_ops;
+}
+
+bool swportdev_dev_check(const struct net_device *port_dev)
+{
+	return port_dev->netdev_ops == &swportdev_netdev_ops;
+}
+EXPORT_SYMBOL(swportdev_dev_check);
+
+void swportdev_link(struct net_device *port_dev,
+		    const struct swportdev_linked_ops *linked_ops,
+		    void *linked_priv)
+{
+	struct swportdev *swp = netdev_priv(port_dev);
+
+	swp->linked_priv = linked_priv;
+	netdev_info(port_dev, "Switch port device linked\n");
+}
+EXPORT_SYMBOL(swportdev_link);
+
+void swportdev_unlink(struct net_device *port_dev)
+{
+	struct swportdev *swp = netdev_priv(port_dev);
+
+	swp->linked_ops = NULL;
+	swp->linked_priv = NULL;
+	netdev_info(port_dev, "Switch port device unlinked\n");
+}
+EXPORT_SYMBOL(swportdev_unlink);
+
+void *swportdev_linked_priv(const struct net_device *port_dev)
+{
+	struct swportdev *swp = netdev_priv(port_dev);
+
+	return swp->linked_priv;
+}
+EXPORT_SYMBOL(swportdev_linked_priv);
+
+bool swportdev_is_linked(const struct net_device *port_dev)
+{
+	return swportdev_linked_priv(port_dev);
+}
+EXPORT_SYMBOL(swportdev_is_linked);
+
+void swportdev_skb_upcall(struct net_device *dev, struct sk_buff *skb,
+			  struct sw_flow_key *key, void *linked_priv)
+{
+	struct swportdev *swp = netdev_priv(dev);
+
+	BUG_ON(!swportdev_dev_check(dev));
+	if (!swp->linked_ops->skb_upcall)
+		return;
+	swp->linked_ops->skb_upcall(dev, skb, key, swp->linked_priv);
+}
+EXPORT_SYMBOL(swportdev_skb_upcall);
+
+static rx_handler_result_t swportdev_handle_frame(struct sk_buff **pskb)
+{
+	struct sk_buff *skb = *pskb;
+
+	/* We don't care what comes from port device into rx path.
+	 * If there's something there, it is destined to ETH_P_ALL
+	 * handlers. So just consume it.
+	 */
+	dev_kfree_skb(skb);
+	return RX_HANDLER_CONSUMED;
+}
+
+struct net_device *swportdev_create(struct net_device *dev,
+				    const struct swportdev_ops *ops)
+{
+	struct net_device *port_dev;
+	char name[IFNAMSIZ];
+	struct swportdev *swp;
+	int err;
+
+	err = snprintf(name, IFNAMSIZ, "%sp%%d", dev->name);
+	if (err >= IFNAMSIZ)
+		return ERR_PTR(-EINVAL);
+
+	port_dev = alloc_netdev(sizeof(struct swportdev), name, swportdev_setup);
+	if (!port_dev)
+		return ERR_PTR(-ENOMEM);
+
+	err = register_netdevice(port_dev);
+	if (err)
+		goto err_register_netdevice;
+
+	err = netdev_master_upper_dev_link(port_dev, dev);
+	if (err) {
+		netdev_err(dev, "Device %s failed to set upper link\n",
+			   port_dev->name);
+		goto err_set_upper_link;
+	}
+	swp = netdev_priv(port_dev);
+	err = netdev_rx_handler_register(port_dev, swportdev_handle_frame, swp);
+	if (err) {
+		netdev_err(dev, "Device %s failed to register rx_handler\n",
+			   port_dev->name);
+		goto err_handler_register;
+	}
+
+	swp = netdev_priv(port_dev);
+	swp->ops = ops;
+	netif_carrier_off(port_dev);
+	netdev_info(port_dev, "Switch port device created (%s)\n", swp->ops->kind);
+	return port_dev;
+
+err_handler_register:
+	netdev_upper_dev_unlink(port_dev, dev);
+err_set_upper_link:
+	unregister_netdevice(port_dev);
+err_register_netdevice:
+	free_netdev(port_dev);
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL(swportdev_create);
+
+void swportdev_destroy(struct net_device *port_dev)
+{
+	struct net_device *dev;
+
+	dev = netdev_master_upper_dev_get(port_dev);
+	BUG_ON(!dev);
+	netdev_rx_handler_unregister(port_dev);
+	netdev_upper_dev_unlink(port_dev, dev);
+	unregister_netdevice(port_dev);
+	free_netdev(port_dev);
+	netdev_info(port_dev, "Switch port device destroyed\n");
+}
+EXPORT_SYMBOL(swportdev_destroy);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Jiri Pirko <jiri@resnulli.us>");
+MODULE_DESCRIPTION("Switch device API");
-- 
1.8.5.3

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [patch net-next RFC 3/4] openvswitch: Introduce support for switchdev based datapath
  2014-03-19 15:33 [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath Jiri Pirko
  2014-03-19 15:33 ` [patch net-next RFC 1/4] openvswitch: split flow structures into ovs specific and generic ones Jiri Pirko
  2014-03-19 15:33 ` [patch net-next RFC 2/4] net: introduce switchdev API Jiri Pirko
@ 2014-03-19 15:33 ` Jiri Pirko
  2014-03-19 15:33 ` [patch net-next RFC 4/4] net: introduce dummy switch Jiri Pirko
  2014-03-20 11:49 ` [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath Jamal Hadi Salim
  4 siblings, 0 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-19 15:33 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 include/uapi/linux/openvswitch.h           |   4 +
 net/openvswitch/Makefile                   |   4 +
 net/openvswitch/datapath.c                 |  45 +++++++-
 net/openvswitch/datapath.h                 |   8 ++
 net/openvswitch/dp_notify.c                |   3 +-
 net/openvswitch/vport-internal_switchdev.c | 148 +++++++++++++++++++++++++++
 net/openvswitch/vport-internal_switchdev.h |  26 +++++
 net/openvswitch/vport-netdev.c             |   4 +-
 net/openvswitch/vport-switchportdev.c      | 158 +++++++++++++++++++++++++++++
 net/openvswitch/vport-switchportdev.h      |  24 +++++
 net/openvswitch/vport.c                    |   4 +
 net/openvswitch/vport.h                    |   2 +
 12 files changed, 424 insertions(+), 6 deletions(-)
 create mode 100644 net/openvswitch/vport-internal_switchdev.c
 create mode 100644 net/openvswitch/vport-internal_switchdev.h
 create mode 100644 net/openvswitch/vport-switchportdev.c
 create mode 100644 net/openvswitch/vport-switchportdev.h

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 970553c..8df1a49 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -189,6 +189,10 @@ enum ovs_vport_type {
 	OVS_VPORT_TYPE_INTERNAL, /* network device implemented by datapath */
 	OVS_VPORT_TYPE_GRE,      /* GRE tunnel. */
 	OVS_VPORT_TYPE_VXLAN,	 /* VXLAN tunnel. */
+	OVS_VPORT_TYPE_INTERNAL_SWITCHDEV, /* network device which represents
+					      a hardware switch */
+	OVS_VPORT_TYPE_SWITCHPORTDEV, /* network device which represents
+					 a port of a hardware switch */
 	__OVS_VPORT_TYPE_MAX
 };
 
diff --git a/net/openvswitch/Makefile b/net/openvswitch/Makefile
index 3591cb5..6e9fb2a 100644
--- a/net/openvswitch/Makefile
+++ b/net/openvswitch/Makefile
@@ -22,3 +22,7 @@ endif
 ifneq ($(CONFIG_OPENVSWITCH_GRE),)
 openvswitch-y += vport-gre.o
 endif
+
+ifneq ($(CONFIG_NET_SWITCHDEV),)
+openvswitch-y += vport-internal_switchdev.o vport-switchportdev.o
+endif
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 7906fe0..d8d5e1f 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -58,7 +58,9 @@
 #include "flow_table.h"
 #include "flow_netlink.h"
 #include "vport-internal_dev.h"
+#include "vport-internal_switchdev.h"
 #include "vport-netdev.h"
+#include "vport-switchportdev.h"
 
 int ovs_net_id __read_mostly;
 
@@ -124,6 +126,9 @@ static struct datapath *get_dp(struct net *net, int dp_ifindex)
 	dev = dev_get_by_index_rcu(net, dp_ifindex);
 	if (dev) {
 		struct vport *vport = ovs_internal_dev_get_vport(dev);
+
+		if (!vport)
+			vport = ovs_internal_swdev_get_vport(dev);
 		if (vport)
 			dp = vport->dp;
 	}
@@ -768,6 +773,19 @@ static struct sk_buff *ovs_flow_cmd_build_info(struct ovs_flow *flow,
 	return skb;
 }
 
+static int ovs_dp_flow_insert(struct datapath *dp, struct sw_flow *flow)
+{
+	if (dp->ops && dp->ops->flow_insert)
+		return dp->ops->flow_insert(dp, flow);
+	return 0;
+}
+
+static void ovs_dp_flow_remove(struct datapath *dp, struct sw_flow *flow)
+{
+	if (dp->ops && dp->ops->flow_remove)
+		dp->ops->flow_remove(dp, flow);
+}
+
 static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info)
 {
 	struct nlattr **a = info->attrs;
@@ -836,13 +854,15 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info)
 		flow->flow.key = masked_key;
 		flow->flow.unmasked_key = key;
 		rcu_assign_pointer(flow->sf_acts, acts);
+		acts = NULL;
 
 		/* Put flow in bucket. */
 		error = ovs_flow_tbl_insert(&dp->table, flow, &mask);
-		if (error) {
-			acts = NULL;
+		if (error)
 			goto err_flow_free;
-		}
+		error = ovs_dp_flow_insert(dp, &flow->flow);
+		if (error)
+			goto err_flow_tbl_remove;
 
 		reply = ovs_flow_cmd_build_info(flow, dp, info, OVS_FLOW_CMD_NEW);
 	} else {
@@ -884,6 +904,8 @@ static int ovs_flow_cmd_new_or_set(struct sk_buff *skb, struct genl_info *info)
 			     0, PTR_ERR(reply));
 	return 0;
 
+err_flow_tbl_remove:
+	ovs_flow_tbl_remove(&dp->table, flow);
 err_flow_free:
 	ovs_flow_free(flow, false);
 err_unlock_ovs:
@@ -981,6 +1003,7 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 		goto unlock;
 	}
 
+	ovs_dp_flow_remove(dp, &flow->flow);
 	ovs_flow_tbl_remove(&dp->table, flow);
 
 	err = ovs_flow_cmd_fill_info(flow, dp, reply, info->snd_portid,
@@ -1234,7 +1257,10 @@ static int ovs_dp_cmd_new(struct sk_buff *skb, struct genl_info *info)
 
 	/* Set up our datapath device. */
 	parms.name = nla_data(a[OVS_DP_ATTR_NAME]);
-	parms.type = OVS_VPORT_TYPE_INTERNAL;
+	if (ovs_is_suitable_for_internal_swdev(sock_net(skb->sk), parms.name))
+		parms.type = OVS_VPORT_TYPE_INTERNAL_SWITCHDEV;
+	else
+		parms.type = OVS_VPORT_TYPE_INTERNAL;
 	parms.options = NULL;
 	parms.dp = dp;
 	parms.port_no = OVSP_LOCAL;
@@ -1572,6 +1598,7 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info)
 	struct sk_buff *reply;
 	struct vport *vport;
 	struct datapath *dp;
+	struct vport *local_vport;
 	u32 port_no;
 	int err;
 
@@ -1611,6 +1638,16 @@ static int ovs_vport_cmd_new(struct sk_buff *skb, struct genl_info *info)
 
 	parms.name = nla_data(a[OVS_VPORT_ATTR_NAME]);
 	parms.type = nla_get_u32(a[OVS_VPORT_ATTR_TYPE]);
+
+	if (parms.type == OVS_VPORT_TYPE_NETDEV &&
+	    ovs_is_suitable_for_switchportdev(sock_net(skb->sk), parms.name))
+		parms.type = OVS_VPORT_TYPE_SWITCHPORTDEV;
+
+	local_vport = ovs_vport_ovsl(dp, OVSP_LOCAL);
+	if (local_vport->ops->type == OVS_VPORT_TYPE_INTERNAL_SWITCHDEV &&
+	    parms.type != OVS_VPORT_TYPE_SWITCHPORTDEV)
+		return -EOPNOTSUPP;
+
 	parms.options = a[OVS_VPORT_ATTR_OPTIONS];
 	parms.dp = dp;
 	parms.port_no = port_no;
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 5388cac..584999b 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -58,6 +58,8 @@ struct dp_stats_percpu {
 	struct u64_stats_sync syncp;
 };
 
+struct dp_ops;
+
 /**
  * struct datapath - datapath for flow-based packet switching
  * @rcu: RCU callback head for deferred destruction.
@@ -90,6 +92,12 @@ struct datapath {
 #endif
 
 	u32 user_features;
+	const struct dp_ops *ops;
+};
+
+struct dp_ops {
+	int (*flow_insert)(struct datapath *dp, struct sw_flow *flow);
+	void (*flow_remove)(struct datapath *dp, struct sw_flow *flow);
 };
 
 /**
diff --git a/net/openvswitch/dp_notify.c b/net/openvswitch/dp_notify.c
index 2c631fe..7f9b6ae 100644
--- a/net/openvswitch/dp_notify.c
+++ b/net/openvswitch/dp_notify.c
@@ -22,6 +22,7 @@
 
 #include "datapath.h"
 #include "vport-internal_dev.h"
+#include "vport-internal_switchdev.h"
 #include "vport-netdev.h"
 
 static void dp_detach_port_notify(struct vport *vport)
@@ -79,7 +80,7 @@ static int dp_device_event(struct notifier_block *unused, unsigned long event,
 	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
 	struct vport *vport = NULL;
 
-	if (!ovs_is_internal_dev(dev))
+	if (!ovs_is_internal_dev(dev) && !ovs_is_internal_swdev(dev))
 		vport = ovs_netdev_get_vport(dev);
 
 	if (!vport)
diff --git a/net/openvswitch/vport-internal_switchdev.c b/net/openvswitch/vport-internal_switchdev.c
new file mode 100644
index 0000000..5d40123
--- /dev/null
+++ b/net/openvswitch/vport-internal_switchdev.c
@@ -0,0 +1,148 @@
+/*
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/rcupdate.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <linux/switchdev.h>
+
+#include <net/net_namespace.h>
+
+#include "datapath.h"
+#include "vport-netdev.h"
+#include "vport-internal_switchdev.h"
+
+static const struct swdev_linked_ops internal_swdev_linked_ops = {
+};
+
+static int internal_swdev_flow_insert(struct datapath *dp, struct sw_flow *flow)
+{
+	struct vport *vport;
+	struct netdev_vport *netdev_vport;
+
+	vport = ovs_vport_ovsl(dp, OVSP_LOCAL);
+	netdev_vport = netdev_vport_priv(vport);
+	return swdev_flow_insert(netdev_vport->dev, flow);
+}
+
+static void internal_swdev_flow_remove(struct datapath *dp, struct sw_flow *flow)
+{
+	struct vport *vport;
+	struct netdev_vport *netdev_vport;
+
+	vport = ovs_vport_ovsl(dp, OVSP_LOCAL);
+	netdev_vport = netdev_vport_priv(vport);
+	swdev_flow_remove(netdev_vport->dev, flow);
+}
+
+static const struct dp_ops internal_swdev_dp_ops = {
+	.flow_insert = internal_swdev_flow_insert,
+	.flow_remove = internal_swdev_flow_remove,
+};
+
+static struct vport *internal_swdev_create(const struct vport_parms *parms)
+{
+	struct vport *vport;
+	struct netdev_vport *netdev_vport;
+	struct net_device *dev;
+	int err;
+
+	vport = ovs_vport_alloc(sizeof(struct netdev_vport),
+				&ovs_internal_swdev_vport_ops, parms);
+	if (IS_ERR(vport))
+		return vport;
+
+	netdev_vport = netdev_vport_priv(vport);
+
+	rtnl_lock();
+	dev = __dev_get_by_name(ovs_dp_get_net(vport->dp), parms->name);
+	if (!dev) {
+		err = -ENODEV;
+		goto error_free_vport;
+	}
+	if (!swdev_dev_check(dev)) {
+		err = -EINVAL;
+		goto error_free_vport;
+	}
+	if (swdev_is_linked(dev)) {
+		err = -EBUSY;
+		goto error_free_vport;
+	}
+	swdev_link(dev, &internal_swdev_linked_ops, vport);
+	netdev_vport->dev = dev;
+	vport->dp->ops = &internal_swdev_dp_ops;
+	rtnl_unlock();
+
+	return vport;
+
+error_free_vport:
+	rtnl_unlock();
+	ovs_vport_free(vport);
+	return ERR_PTR(err);
+}
+
+static void internal_swdev_destroy(struct vport *vport)
+{
+	struct netdev_vport *netdev_vport = netdev_vport_priv(vport);
+
+	rtnl_lock();
+	swdev_unlink(netdev_vport->dev);
+	rtnl_unlock();
+}
+
+static int internal_swdev_send(struct vport *vport, struct sk_buff *skb)
+{
+	int len;
+
+	len = skb->len;
+	consume_skb(skb);
+	return len;
+}
+
+const struct vport_ops ovs_internal_swdev_vport_ops = {
+	.type		= OVS_VPORT_TYPE_INTERNAL_SWITCHDEV,
+	.create		= internal_swdev_create,
+	.destroy	= internal_swdev_destroy,
+	.get_name	= ovs_netdev_get_name,
+	.send		= internal_swdev_send,
+};
+
+bool ovs_is_internal_swdev(const struct net_device *dev)
+{
+	return swdev_dev_check(dev) && swdev_is_linked(dev) ? true : false;
+}
+
+bool ovs_is_suitable_for_internal_swdev(struct net *net, const char *name)
+{
+	struct net_device *dev;
+	bool ret;
+
+	rcu_read_lock();
+	dev = dev_get_by_name_rcu(net, name);
+	ret = dev ? swdev_dev_check(dev) : false;
+	rcu_read_unlock();
+	return ret;
+}
+
+struct vport *ovs_internal_swdev_get_vport(struct net_device *dev)
+{
+	if (!ovs_is_internal_swdev(dev))
+		return NULL;
+	return swdev_linked_priv(dev);
+}
diff --git a/net/openvswitch/vport-internal_switchdev.h b/net/openvswitch/vport-internal_switchdev.h
new file mode 100644
index 0000000..063ab0c
--- /dev/null
+++ b/net/openvswitch/vport-internal_switchdev.h
@@ -0,0 +1,26 @@
+/*
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+
+#ifndef VPORT_INTERNAL_SWITCHDEV_H
+#define VPORT_INTERNAL_SWITCHDEV_H 1
+
+bool ovs_is_internal_swdev(const struct net_device *dev);
+bool ovs_is_suitable_for_internal_swdev(struct net *net, const char *name);
+struct vport *ovs_internal_swdev_get_vport(struct net_device *dev);
+
+#endif /* vport-internal_switchdev.h */
diff --git a/net/openvswitch/vport-netdev.c b/net/openvswitch/vport-netdev.c
index d21f77d..3121b59 100644
--- a/net/openvswitch/vport-netdev.c
+++ b/net/openvswitch/vport-netdev.c
@@ -31,6 +31,7 @@
 
 #include "datapath.h"
 #include "vport-internal_dev.h"
+#include "vport-internal_switchdev.h"
 #include "vport-netdev.h"
 
 /* Must be called with rcu_read_lock. */
@@ -107,7 +108,8 @@ static struct vport *netdev_create(const struct vport_parms *parms)
 
 	if (netdev_vport->dev->flags & IFF_LOOPBACK ||
 	    netdev_vport->dev->type != ARPHRD_ETHER ||
-	    ovs_is_internal_dev(netdev_vport->dev)) {
+	    ovs_is_internal_dev(netdev_vport->dev) ||
+	    ovs_is_internal_swdev(netdev_vport->dev)) {
 		err = -EINVAL;
 		goto error_put;
 	}
diff --git a/net/openvswitch/vport-switchportdev.c b/net/openvswitch/vport-switchportdev.c
new file mode 100644
index 0000000..bcb77bf
--- /dev/null
+++ b/net/openvswitch/vport-switchportdev.c
@@ -0,0 +1,158 @@
+/*
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+
+#include <linux/kernel.h>
+#include <linux/rcupdate.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <linux/if_vlan.h>
+#include <linux/switchdev.h>
+
+#include <net/net_namespace.h>
+
+#include "datapath.h"
+#include "vport-netdev.h"
+
+static void vport_swportdev_skb_upcall(struct net_device *dev,
+				       struct sk_buff *skb,
+				       struct sw_flow_key *key,
+				       void *linked_priv)
+{
+	struct vport *vport = linked_priv;
+	struct dp_upcall_info upcall;
+
+	upcall.cmd = OVS_PACKET_CMD_MISS;
+	upcall.key = key;
+	upcall.userdata = NULL;
+	upcall.portid = vport->upcall_portid;
+	ovs_dp_upcall(vport->dp, skb, &upcall);
+	consume_skb(skb);
+}
+
+static const struct swportdev_linked_ops vport_swportdev_linked_ops = {
+	.skb_upcall = vport_swportdev_skb_upcall,
+};
+
+static struct vport *vport_swportdev_create(const struct vport_parms *parms)
+{
+	struct vport *vport;
+	struct vport *local_vport;
+	struct netdev_vport *netdev_vport;
+	struct net_device *dev;
+	int err;
+
+	local_vport = ovs_vport_ovsl(parms->dp, OVSP_LOCAL);
+	if (local_vport->ops->type != OVS_VPORT_TYPE_INTERNAL_SWITCHDEV)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	vport = ovs_vport_alloc(sizeof(struct netdev_vport),
+				&ovs_swportdev_vport_ops, parms);
+	if (IS_ERR(vport))
+		return vport;
+
+	netdev_vport = netdev_vport_priv(vport);
+
+	rtnl_lock();
+	dev = __dev_get_by_name(ovs_dp_get_net(vport->dp), parms->name);
+	if (!dev) {
+		err = -ENODEV;
+		goto error_free_vport;
+	}
+	if (!swportdev_dev_check(dev)) {
+		err = -EINVAL;
+		goto error_free_vport;
+	}
+	if (swportdev_is_linked(dev)) {
+		err = -EBUSY;
+		goto error_free_vport;
+	}
+	swportdev_link(dev, &vport_swportdev_linked_ops, vport);
+	netdev_vport->dev = dev;
+	rtnl_unlock();
+
+	return vport;
+
+error_free_vport:
+	rtnl_unlock();
+	ovs_vport_free(vport);
+	return ERR_PTR(err);
+}
+
+static void vport_swportdev_destroy(struct vport *vport)
+{
+	struct netdev_vport *netdev_vport = netdev_vport_priv(vport);
+
+	rtnl_lock();
+	swportdev_unlink(netdev_vport->dev);
+	rtnl_unlock();
+}
+
+static unsigned int packet_length(const struct sk_buff *skb)
+{
+	unsigned int length = skb->len - ETH_HLEN;
+
+	if (skb->protocol == htons(ETH_P_8021Q))
+		length -= VLAN_HLEN;
+
+	return length;
+}
+
+static int vport_swportdev_send(struct vport *vport, struct sk_buff *skb)
+{
+	struct netdev_vport *netdev_vport = netdev_vport_priv(vport);
+	int mtu = netdev_vport->dev->mtu;
+	int len;
+
+	if (unlikely(packet_length(skb) > mtu && !skb_is_gso(skb))) {
+		net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n",
+				     netdev_vport->dev->name,
+				     packet_length(skb), mtu);
+		goto drop;
+	}
+
+	skb->dev = netdev_vport->dev;
+	len = skb->len;
+	dev_queue_xmit(skb);
+
+	return len;
+
+drop:
+	kfree_skb(skb);
+	return 0;
+}
+
+const struct vport_ops ovs_swportdev_vport_ops = {
+	.type		= OVS_VPORT_TYPE_SWITCHPORTDEV,
+	.create		= vport_swportdev_create,
+	.destroy	= vport_swportdev_destroy,
+	.get_name	= ovs_netdev_get_name,
+	.send		= vport_swportdev_send,
+};
+
+bool ovs_is_suitable_for_switchportdev(struct net *net, const char *name)
+{
+	struct net_device *dev;
+	bool ret;
+
+	rcu_read_lock();
+	dev = dev_get_by_name_rcu(net, name);
+	ret = dev ? swportdev_dev_check(dev) : false;
+	rcu_read_unlock();
+	return ret;
+}
+
diff --git a/net/openvswitch/vport-switchportdev.h b/net/openvswitch/vport-switchportdev.h
new file mode 100644
index 0000000..b578794
--- /dev/null
+++ b/net/openvswitch/vport-switchportdev.h
@@ -0,0 +1,24 @@
+/*
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+
+#ifndef VPORT_SWITCHPORTDEV_H
+#define VPORT_SWITCHPORTDEV_H 1
+
+bool ovs_is_suitable_for_switchportdev(struct net *net, const char *name);
+
+#endif /* vport-switchportdev.h */
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 81b083c..eb8932e 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -48,6 +48,10 @@ static const struct vport_ops *vport_ops_list[] = {
 #ifdef CONFIG_OPENVSWITCH_VXLAN
 	&ovs_vxlan_vport_ops,
 #endif
+#if defined(CONFIG_NET_SWITCHDEV) || defined(CONFIG_NET_SWITCHDEV_MODULE)
+	&ovs_internal_swdev_vport_ops,
+	&ovs_swportdev_vport_ops,
+#endif
 };
 
 /* Protected by RCU read lock for reading, ovs_mutex for writing. */
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index 0979304..100277f 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -199,6 +199,8 @@ extern const struct vport_ops ovs_netdev_vport_ops;
 extern const struct vport_ops ovs_internal_vport_ops;
 extern const struct vport_ops ovs_gre_vport_ops;
 extern const struct vport_ops ovs_vxlan_vport_ops;
+extern const struct vport_ops ovs_internal_swdev_vport_ops;
+extern const struct vport_ops ovs_swportdev_vport_ops;
 
 static inline void ovs_skb_postpush_rcsum(struct sk_buff *skb,
 				      const void *start, unsigned int len)
-- 
1.8.5.3

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* [patch net-next RFC 4/4] net: introduce dummy switch
  2014-03-19 15:33 [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath Jiri Pirko
                   ` (2 preceding siblings ...)
  2014-03-19 15:33 ` [patch net-next RFC 3/4] openvswitch: Introduce support for switchdev based datapath Jiri Pirko
@ 2014-03-19 15:33 ` Jiri Pirko
  2014-03-20 11:49 ` [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath Jamal Hadi Salim
  4 siblings, 0 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-19 15:33 UTC (permalink / raw)
  To: netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet

Dummy switch implementation using switchdev API

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
---
 drivers/net/Kconfig       |   7 +++
 drivers/net/Makefile      |   1 +
 drivers/net/dummyswitch.c | 142 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 150 insertions(+)
 create mode 100644 drivers/net/dummyswitch.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 89402c3..a9629a7 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -71,6 +71,13 @@ config DUMMY
 	  To compile this driver as a module, choose M here: the module
 	  will be called dummy.
 
+config NET_DUMMY_SWITCH
+	tristate "Dummy switch net driver support"
+	depends on NET_SWITCHDEV
+	---help---
+	  To compile this driver as a module, choose M here: the module
+	  will be called dummyswitch.
+
 config EQUALIZER
 	tristate "EQL (serial line load balancing) support"
 	---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 3fef8a8..d5d4ce6 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -7,6 +7,7 @@
 #
 obj-$(CONFIG_BONDING) += bonding/
 obj-$(CONFIG_DUMMY) += dummy.o
+obj-$(CONFIG_NET_DUMMY_SWITCH) += dummyswitch.o
 obj-$(CONFIG_EQUALIZER) += eql.o
 obj-$(CONFIG_IFB) += ifb.o
 obj-$(CONFIG_MACVLAN) += macvlan.o
diff --git a/drivers/net/dummyswitch.c b/drivers/net/dummyswitch.c
new file mode 100644
index 0000000..8c315cb6
--- /dev/null
+++ b/drivers/net/dummyswitch.c
@@ -0,0 +1,142 @@
+/*
+ * drivers/net/dummyswitch.c - Dummy switch device
+ * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/moduleparam.h>
+#include <linux/switchdev.h>
+
+#include <net/rtnetlink.h>
+
+static int numswitches = 1;
+static int numports = 8;
+
+module_param(numswitches, int, 1);
+MODULE_PARM_DESC(numswitches, "Number of dummy switch pseudo devices");
+module_param(numports, int, 8);
+MODULE_PARM_DESC(numports, "Number of ports per dummy switch pseudo device");
+
+static const struct swdev_ops dummysw_swdev_ops = {
+	.kind = "dummyswitch",
+};
+
+static netdev_tx_t dummyswport_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	dev_kfree_skb(skb);
+	return NETDEV_TX_OK;
+}
+
+static const struct swportdev_ops dummysw_swportdev_ops = {
+	.kind = "dummyswitchport",
+	.skb_xmit = dummyswport_xmit,
+};
+
+static struct net_device **dummyswdevs;
+
+static void dummysw_exit_ports(struct net_device *dev)
+{
+	struct net_device *tmp;
+
+	for_each_netdev(dev_net(dev), tmp) {
+		if (netdev_master_upper_dev_get(tmp) == dev)
+			swportdev_destroy(tmp);
+	}
+}
+
+static int dummysw_init_ports(struct net_device *dev)
+{
+	struct net_device *tmp;
+	int err;
+	int i;
+
+	for (i = 0; i < numports; i++) {
+		tmp = swportdev_create(dev, &dummysw_swportdev_ops);
+		if (IS_ERR(tmp)) {
+			err = PTR_ERR(tmp);
+			goto exit_ports;
+		}
+	}
+	return 0;
+
+exit_ports:
+	dummysw_exit_ports(dev);
+	return err;
+}
+
+static void dummysw_exit_one(struct net_device **pdev)
+{
+	dummysw_exit_ports(*pdev);
+	swdev_destroy(*pdev);
+}
+
+static int dummysw_init_one(struct net_device **pdev)
+{
+	struct net_device *dev;
+	int err;
+
+	dev = swdev_create(&dummysw_swdev_ops);
+	if (IS_ERR(dev))
+		return PTR_ERR(dev);
+	err = dummysw_init_ports(dev);
+	if (err)
+		goto swdev_destroy;
+	*pdev = dev;
+	return 0;
+
+swdev_destroy:
+	swdev_destroy(dev);
+	return err;
+}
+
+static int __init dummysw_module_init(void)
+{
+	int err;
+	int i;
+
+	dummyswdevs = kmalloc(sizeof(struct net_device) * numswitches,
+			      GFP_KERNEL);
+	if (!dummyswdevs)
+		return -ENOMEM;
+
+	rtnl_lock();
+	for (i = 0; i < numswitches; i++) {
+		err = dummysw_init_one(&dummyswdevs[i]);
+		if (err)
+			goto rollback;
+	}
+	rtnl_unlock();
+	return 0;
+
+rollback:
+	for (i--; i >= 0; i--)
+		dummysw_exit_one(&dummyswdevs[i]);
+	rtnl_unlock();
+	kfree(dummyswdevs);
+	return err;
+}
+
+static void __exit dummysw_module_exit(void)
+{
+	int i;
+
+	rtnl_lock();
+	for (i = 0; i < numswitches; i++)
+		dummysw_exit_one(&dummyswdevs[i]);
+	rtnl_unlock();
+	kfree(dummyswdevs);
+}
+
+module_init(dummysw_module_init);
+module_exit(dummysw_module_exit);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Jiri Pirko <jiri@resnulli.us>");
+MODULE_DESCRIPTION("Dummy switch device");
-- 
1.8.5.3

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-19 15:33 [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath Jiri Pirko
                   ` (3 preceding siblings ...)
  2014-03-19 15:33 ` [patch net-next RFC 4/4] net: introduce dummy switch Jiri Pirko
@ 2014-03-20 11:49 ` Jamal Hadi Salim
  2014-03-20 12:40   ` Jiri Pirko
  4 siblings, 1 reply; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-20 11:49 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet, Scott Feldman, Lennert Buytenhek

Hi Jiri,

On 03/19/14 11:33, Jiri Pirko wrote:
> This is just an early draft, RFC. I wanted to post this early to get the
> feedback as soon as possible.
>
> The basic idea is to introduce a generic infractructure to support various
> switch chips in kernel. Also the idea is to benefit of currently existing
> Open vSwitch userspace infrastructure.
>


I think the abstraction should be a netdev and to be specific the
bridge - not openvswitch. Our current tools like ifconfig, iproute2,
bridge etc should continue to work.
In my experience, it is sufficient to model a switch after the linux
bridge at the basic level if the starting point is
L2 (which is the lowest common denominator).
And then you add capabilities that different chips expose.
Not every chip can do vxlan, flows etc. And we already know how
to abstract those out.
My  experience on top of broadcom chips is the approach i described
works rather well.

Additionally, note:
We do have L2 devices that offload in the kernel
(refer to DSA, posting earlier from the openwrt guys, and
the intel devices which do VDMQ etc). I am now counting we have 5
different approaches if we add yours.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-20 11:49 ` [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath Jamal Hadi Salim
@ 2014-03-20 12:40   ` Jiri Pirko
  2014-03-20 17:21     ` Florian Fainelli
  0 siblings, 1 reply; 125+ messages in thread
From: Jiri Pirko @ 2014-03-20 12:40 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet, Scott Feldman,
	Lennert Buytenhek

Thu, Mar 20, 2014 at 12:49:07PM CET, jhs@mojatatu.com wrote:
>Hi Jiri,
>
>On 03/19/14 11:33, Jiri Pirko wrote:
>>This is just an early draft, RFC. I wanted to post this early to get the
>>feedback as soon as possible.
>>
>>The basic idea is to introduce a generic infractructure to support various
>>switch chips in kernel. Also the idea is to benefit of currently existing
>>Open vSwitch userspace infrastructure.
>>
>
>
>I think the abstraction should be a netdev and to be specific the
>bridge - not openvswitch. Our current tools like ifconfig, iproute2,
>bridge etc should continue to work.

That is exactly the case. Nothing is specific to OVS. OVS is just a one
method to access the switchdev api.

Abstraction is netdev. One netdev per each switch port and one netdev as
a master on the top of that representing the switch itself.


>In my experience, it is sufficient to model a switch after the linux
>bridge at the basic level if the starting point is
>L2 (which is the lowest common denominator).
>And then you add capabilities that different chips expose.
>Not every chip can do vxlan, flows etc. And we already know how
>to abstract those out.
>My  experience on top of broadcom chips is the approach i described
>works rather well.
>
>Additionally, note:
>We do have L2 devices that offload in the kernel
>(refer to DSA, posting earlier from the openwrt guys, and
>the intel devices which do VDMQ etc). I am now counting we have 5
>different approaches if we add yours.

I think that the problem is that each solution serves different purpose.
For example DSA is for switches connected as a PHY to a MAC. That is
completely different case to what my switchdev API is trying to handle.


>
>cheers,
>jamal
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 1/4] openvswitch: split flow structures into ovs specific and generic ones
  2014-03-19 15:33 ` [patch net-next RFC 1/4] openvswitch: split flow structures into ovs specific and generic ones Jiri Pirko
@ 2014-03-20 13:04   ` Thomas Graf
  0 siblings, 0 replies; 125+ messages in thread
From: Thomas Graf @ 2014-03-20 13:04 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet

On 03/19/14 at 04:33pm, Jiri Pirko wrote:
> After this, flow related structures can be used in other code.

LGTM. Definitely makes sense to share flow definition
between OVS and a possible HW switch API.

> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> +	/* Ethernet+IPv4 specific members. */
> +	unsigned char       ar_sha[ETH_ALEN];	/* sender hardware address  */
> +	unsigned char       ar_sip[4];		/* sender IP address        */
> +	unsigned char       ar_tha[ETH_ALEN];	/* target hardware address  */
> +	unsigned char       ar_tip[4];		/* target IP address        */
> +} __packed;
>  
> -struct sw_flow_mask {
> +struct ovs_flow_mask {
>  	int ref_count;
>  	struct rcu_head rcu;

Perhaps move rcu_head to the end if you are touching it anyway.

>  	struct list_head list;
> -	struct sw_flow_key_range range;
> -	struct sw_flow_key key;
> +	struct sw_flow_mask mask;
>  };
>  
> -struct sw_flow_match {
> +struct ovs_flow_match {
>  	struct sw_flow_key *key;
>  	struct sw_flow_key_range range;
>  	struct sw_flow_mask *mask;
> @@ -163,36 +106,20 @@ struct sw_flow_stats {
>  	};
>  };
>  
> -struct sw_flow {
> +struct ovs_flow {
>  	struct rcu_head rcu;

Same here.

>  	struct hlist_node hash_node[2];
>  	u32 hash;
>  
> -	struct sw_flow_key key;
> -	struct sw_flow_key unmasked_key;
> -	struct sw_flow_mask *mask;
> +	struct sw_flow flow;
>  	struct sw_flow_actions __rcu *sf_acts;
>  	struct sw_flow_stats stats;
>  };

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 2/4] net: introduce switchdev API
  2014-03-19 15:33 ` [patch net-next RFC 2/4] net: introduce switchdev API Jiri Pirko
@ 2014-03-20 13:59   ` Thomas Graf
  2014-03-20 14:18     ` Jiri Pirko
  2014-03-20 14:43   ` Nikolay Aleksandrov
  1 sibling, 1 reply; 125+ messages in thread
From: Thomas Graf @ 2014-03-20 13:59 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, davem, nhorman, andy, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet

On 03/19/14 at 04:33pm, Jiri Pirko wrote:
> +struct swdev_linked_ops {
> +};

I've been trying to think of better names for this to make
it absolutely clear which is which (linked ops vs. ops).

What do you think about the following?

    sw_api       -> sw_api_ops / sw_api_port_ops 
       | 
   sw_device
       |
   sw_driver     -> sw_driver_ops / sw_driver_port_ops

> +bool swdev_dev_check(const struct net_device *dev);
> +void swdev_link(struct net_device *dev,
> +		const struct swdev_linked_ops *linked_ops,
> +		void *linked_priv);
> +void swdev_unlink(struct net_device *dev);
> +void *swdev_linked_priv(const struct net_device *dev);
> +bool swdev_is_linked(const struct net_device *dev);
> +int swdev_flow_insert(struct net_device *dev, struct sw_flow *flow);
> +int swdev_flow_remove(struct net_device *dev, struct sw_flow *flow);
> +int swdev_packet_upcall(struct net_device *dev, struct sk_buff *skb);
> +
> +struct swdev_ops {
> +	const char *kind;
> +	int (*flow_insert)(struct net_device *dev, struct sw_flow *flow);
> +	int (*flow_remove)(struct net_device *dev, struct sw_flow *flow);

I think this API should be made more extendable. Flags might be
needed at some point or even switch specific configuration
blobs. How about adding a struct sw_flow_opts early on to avoid
cluttering the function parameter list later on?

> +int swdev_flow_insert(struct net_device *dev, struct sw_flow *flow)
> +{
> +	struct swdev *sw = netdev_priv(dev);
> +
> +	BUG_ON(!swdev_dev_check(dev));

How about taking the swdev struct instead to make it clear that
all these swdev_ functions are only supposed to be used with
swdev instances? We can translate the swdev pointer to a net_device.

> +bool swportdev_dev_check(const struct net_device *port_dev)
> +{
> +	return port_dev->netdev_ops == &swportdev_netdev_ops;
> +}
> +EXPORT_SYMBOL(swportdev_dev_check);

Same as above

> +struct net_device *swportdev_create(struct net_device *dev,
> +				    const struct swportdev_ops *ops)
> +{
> +	struct net_device *port_dev;
> +	char name[IFNAMSIZ];
> +	struct swportdev *swp;
> +	int err;


Needs a check that dev is of the same family as the provided
port ops.

> +	err = snprintf(name, IFNAMSIZ, "%sp%%d", dev->name);
> +	if (err >= IFNAMSIZ)
> +		return ERR_PTR(-EINVAL);
> +
> +	port_dev = alloc_netdev(sizeof(struct swportdev), name, swportdev_setup);
> +	if (!port_dev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	err = register_netdevice(port_dev);
> +	if (err)
> +		goto err_register_netdevice;
> +
> +	err = netdev_master_upper_dev_link(port_dev, dev);
> +	if (err) {
> +		netdev_err(dev, "Device %s failed to set upper link\n",
> +			   port_dev->name);
> +		goto err_set_upper_link;
> +	}
> +	swp = netdev_priv(port_dev);
> +	err = netdev_rx_handler_register(port_dev, swportdev_handle_frame, swp);
> +	if (err) {
> +		netdev_err(dev, "Device %s failed to register rx_handler\n",
> +			   port_dev->name);
> +		goto err_handler_register;
> +	}
> +
> +	swp = netdev_priv(port_dev);
> +	swp->ops = ops;
> +	netif_carrier_off(port_dev);
> +	netdev_info(port_dev, "Switch port device created (%s)\n", swp->ops->kind);
> +	return port_dev;
> +
> +err_handler_register:
> +	netdev_upper_dev_unlink(port_dev, dev);
> +err_set_upper_link:
> +	unregister_netdevice(port_dev);
> +err_register_netdevice:
> +	free_netdev(port_dev);
> +	return ERR_PTR(err);
> +}
> +EXPORT_SYMBOL(swportdev_create);

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 2/4] net: introduce switchdev API
  2014-03-20 13:59   ` Thomas Graf
@ 2014-03-20 14:18     ` Jiri Pirko
  0 siblings, 0 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-20 14:18 UTC (permalink / raw)
  To: Thomas Graf
  Cc: netdev, davem, nhorman, andy, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet

Thanks for review Thomas.

Thu, Mar 20, 2014 at 02:59:01PM CET, tgraf@suug.ch wrote:
>On 03/19/14 at 04:33pm, Jiri Pirko wrote:
>> +struct swdev_linked_ops {
>> +};
>
>I've been trying to think of better names for this to make
>it absolutely clear which is which (linked ops vs. ops).
>
>What do you think about the following?
>
>    sw_api       -> sw_api_ops / sw_api_port_ops 
>       | 
>   sw_device
>       |
>   sw_driver     -> sw_driver_ops / sw_driver_port_ops

Sure. Makes sense. Will change that.


>
>> +bool swdev_dev_check(const struct net_device *dev);
>> +void swdev_link(struct net_device *dev,
>> +		const struct swdev_linked_ops *linked_ops,
>> +		void *linked_priv);
>> +void swdev_unlink(struct net_device *dev);
>> +void *swdev_linked_priv(const struct net_device *dev);
>> +bool swdev_is_linked(const struct net_device *dev);
>> +int swdev_flow_insert(struct net_device *dev, struct sw_flow *flow);
>> +int swdev_flow_remove(struct net_device *dev, struct sw_flow *flow);
>> +int swdev_packet_upcall(struct net_device *dev, struct sk_buff *skb);
>> +
>> +struct swdev_ops {
>> +	const char *kind;
>> +	int (*flow_insert)(struct net_device *dev, struct sw_flow *flow);
>> +	int (*flow_remove)(struct net_device *dev, struct sw_flow *flow);
>
>I think this API should be made more extendable. Flags might be
>needed at some point or even switch specific configuration
>blobs. How about adding a struct sw_flow_opts early on to avoid
>cluttering the function parameter list later on?

Hmm. I'm not in favor to add thing for the reason "they might be needed".
I think it would be better to wait for the need and change the API after
that. No need to do it now IMO.


>
>> +int swdev_flow_insert(struct net_device *dev, struct sw_flow *flow)
>> +{
>> +	struct swdev *sw = netdev_priv(dev);
>> +
>> +	BUG_ON(!swdev_dev_check(dev));
>
>How about taking the swdev struct instead to make it clear that
>all these swdev_ functions are only supposed to be used with
>swdev instances? We can translate the swdev pointer to a net_device.

Valid point. Will look into this.


>
>> +bool swportdev_dev_check(const struct net_device *port_dev)
>> +{
>> +	return port_dev->netdev_ops == &swportdev_netdev_ops;
>> +}
>> +EXPORT_SYMBOL(swportdev_dev_check);
>
>Same as above
>
>> +struct net_device *swportdev_create(struct net_device *dev,
>> +				    const struct swportdev_ops *ops)
>> +{
>> +	struct net_device *port_dev;
>> +	char name[IFNAMSIZ];
>> +	struct swportdev *swp;
>> +	int err;
>
>
>Needs a check that dev is of the same family as the provided
>port ops.

Noted, will fix that.


>
>> +	err = snprintf(name, IFNAMSIZ, "%sp%%d", dev->name);
>> +	if (err >= IFNAMSIZ)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	port_dev = alloc_netdev(sizeof(struct swportdev), name, swportdev_setup);
>> +	if (!port_dev)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	err = register_netdevice(port_dev);
>> +	if (err)
>> +		goto err_register_netdevice;
>> +
>> +	err = netdev_master_upper_dev_link(port_dev, dev);
>> +	if (err) {
>> +		netdev_err(dev, "Device %s failed to set upper link\n",
>> +			   port_dev->name);
>> +		goto err_set_upper_link;
>> +	}
>> +	swp = netdev_priv(port_dev);
>> +	err = netdev_rx_handler_register(port_dev, swportdev_handle_frame, swp);
>> +	if (err) {
>> +		netdev_err(dev, "Device %s failed to register rx_handler\n",
>> +			   port_dev->name);
>> +		goto err_handler_register;
>> +	}
>> +
>> +	swp = netdev_priv(port_dev);
>> +	swp->ops = ops;
>> +	netif_carrier_off(port_dev);
>> +	netdev_info(port_dev, "Switch port device created (%s)\n", swp->ops->kind);
>> +	return port_dev;
>> +
>> +err_handler_register:
>> +	netdev_upper_dev_unlink(port_dev, dev);
>> +err_set_upper_link:
>> +	unregister_netdevice(port_dev);
>> +err_register_netdevice:
>> +	free_netdev(port_dev);
>> +	return ERR_PTR(err);
>> +}
>> +EXPORT_SYMBOL(swportdev_create);

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 2/4] net: introduce switchdev API
  2014-03-19 15:33 ` [patch net-next RFC 2/4] net: introduce switchdev API Jiri Pirko
  2014-03-20 13:59   ` Thomas Graf
@ 2014-03-20 14:43   ` Nikolay Aleksandrov
  2014-03-20 15:42     ` Jiri Pirko
  1 sibling, 1 reply; 125+ messages in thread
From: Nikolay Aleksandrov @ 2014-03-20 14:43 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse, pshelar,
	azhou, ben, stephen, jeffrey.t.kirsher, vyasevic, xiyou.wangcong,
	john.r.fastabend, edumazet

On 03/19/2014 04:33 PM, Jiri Pirko wrote:
> switchdev API is designed to allow kernel support for various switch
> chips.
> 
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> ---
>  include/linux/switchdev.h |  62 +++++++++
>  net/Kconfig               |  10 ++
>  net/core/Makefile         |   1 +
>  net/core/switchdev.c      | 330 ++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 403 insertions(+)
>  create mode 100644 include/linux/switchdev.h
>  create mode 100644 net/core/switchdev.c
> 
> diff --git a/include/linux/switchdev.h b/include/linux/switchdev.h
> new file mode 100644
> index 0000000..ac6db2d
> --- /dev/null
> +++ b/include/linux/switchdev.h
> @@ -0,0 +1,62 @@
> +/*
> + * include/linux/switchdev.h - Switch device API
> + * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +#ifndef _LINUX_SWITCHDEV_H_
> +#define _LINUX_SWITCHDEV_H_
> +
> +#include <linux/netdevice.h>
> +#include <linux/sw_flow.h>
> +
> +struct swdev_linked_ops {
> +};
> +
> +bool swdev_dev_check(const struct net_device *dev);
> +void swdev_link(struct net_device *dev,
> +		const struct swdev_linked_ops *linked_ops,
> +		void *linked_priv);
> +void swdev_unlink(struct net_device *dev);
> +void *swdev_linked_priv(const struct net_device *dev);
> +bool swdev_is_linked(const struct net_device *dev);
> +int swdev_flow_insert(struct net_device *dev, struct sw_flow *flow);
> +int swdev_flow_remove(struct net_device *dev, struct sw_flow *flow);
> +int swdev_packet_upcall(struct net_device *dev, struct sk_buff *skb);
> +
> +struct swdev_ops {
> +	const char *kind;
> +	int (*flow_insert)(struct net_device *dev, struct sw_flow *flow);
> +	int (*flow_remove)(struct net_device *dev, struct sw_flow *flow);
> +};
> +
> +struct net_device *swdev_create(const struct swdev_ops *ops);
> +void swdev_destroy(struct net_device *dev);
> +
> +struct swportdev_linked_ops {
> +	void (*skb_upcall)(struct net_device *dev, struct sk_buff *skb,
> +			   struct sw_flow_key *key, void *linked_priv);
> +};
> +
> +bool swportdev_dev_check(const struct net_device *dev);
> +void swportdev_link(struct net_device *dev,
> +		    const struct swportdev_linked_ops *linked_ops,
> +		    void *linked_priv);
> +void swportdev_unlink(struct net_device *dev);
> +void *swportdev_linked_priv(const struct net_device *dev);
> +bool swportdev_is_linked(const struct net_device *dev);
> +
> +struct swportdev_ops {
> +	const char *kind;
> +	netdev_tx_t (*skb_xmit)(struct sk_buff *skb,
> +				struct net_device *port_dev);
> +};
> +
> +struct net_device *swportdev_create(struct net_device *dev,
> +				    const struct swportdev_ops *ops);
> +void swportdev_destroy(struct net_device *port_dev);
> +
> +#endif /* _LINUX_SWITCHDEV_H_ */
> diff --git a/net/Kconfig b/net/Kconfig
> index e411046..e02ab8d 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -285,6 +285,16 @@ config NET_FLOW_LIMIT
>  	  with many clients some protection against DoS by a single (spoofed)
>  	  flow that greatly exceeds average workload.
>  
> +config NET_SWITCHDEV
> +	tristate "Switch device"
> +	depends on INET
> +	---help---
> +	  This module provides glue for hardware switch chips so they can be
> +	  accessed from userspace via Open vSwitch Netlink API. 
> +
> +	  To compile this code as a module, choose M here: the
> +	  module will be called pktgen.
> +
>  menu "Network testing"
>  
>  config NET_PKTGEN
> diff --git a/net/core/Makefile b/net/core/Makefile
> index 9628c20..426a619 100644
> --- a/net/core/Makefile
> +++ b/net/core/Makefile
> @@ -23,3 +23,4 @@ obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
>  obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
>  obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
>  obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
> +obj-$(CONFIG_NET_SWITCHDEV) += switchdev.o
> diff --git a/net/core/switchdev.c b/net/core/switchdev.c
> new file mode 100644
> index 0000000..3b8daaf
> --- /dev/null
> +++ b/net/core/switchdev.c
> @@ -0,0 +1,330 @@
> +/*
> + * net/core/switchdev.c - Switch device API
> + * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/netdevice.h>
> +#include <linux/ethtool.h>
> +#include <linux/switchdev.h>
> +
> +#include <net/rtnetlink.h>
> +#include <net/net_namespace.h>
> +#include <net/netns/generic.h>
> +
> +#include <generated/utsrelease.h>
> +
> +struct swdev {
> +	const struct swdev_ops *ops;
> +	const struct swdev_linked_ops *linked_ops;
> +	void *linked_priv;
> +};
> +
> +static netdev_tx_t swdev_ndo_start_xmit(struct sk_buff *skb,
> +					struct net_device *dev)
> +{
> +	dev_kfree_skb(skb);
> +	return NETDEV_TX_OK;
> +}
> +
> +static const struct net_device_ops swdev_netdev_ops = {
> +	.ndo_start_xmit = swdev_ndo_start_xmit,
> +};
> +
> +static void swdev_ethtool_get_drvinfo(struct net_device *dev,
> +				      struct ethtool_drvinfo *drvinfo)
> +{
> +	struct swdev *sw = netdev_priv(dev);
> +
> +	strlcpy(drvinfo->driver, sw->ops->kind, sizeof(drvinfo->driver));
> +	strlcpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));
> +}
> +
> +static const struct ethtool_ops swdev_ethtool_ops = {
> +	.get_drvinfo		= swdev_ethtool_get_drvinfo,
> +	.get_link		= ethtool_op_get_link,
> +};
> +
> +static void swdev_setup(struct net_device *dev)
> +{
> +	ether_setup(dev);
> +	dev->netdev_ops = &swdev_netdev_ops;
> +	dev->ethtool_ops = &swdev_ethtool_ops;
> +}
> +
> +bool swdev_dev_check(const struct net_device *dev)
> +{
> +	return dev->netdev_ops == &swdev_netdev_ops;
> +}
> +EXPORT_SYMBOL(swdev_dev_check);
> +
> +void swdev_link(struct net_device *dev,
> +		const struct swdev_linked_ops *linked_ops,
> +		void *linked_priv)
> +{
> +	struct swdev *sw = netdev_priv(dev);
> +
> +	sw->linked_ops = linked_ops;
> +	sw->linked_priv = linked_priv;
> +	netdev_info(dev, "Switch device linked\n");
> +}
> +EXPORT_SYMBOL(swdev_link);
> +
> +void swdev_unlink(struct net_device *dev)
> +{
> +	struct swdev *sw = netdev_priv(dev);
> +
> +	sw->linked_ops = NULL;
> +	sw->linked_priv = NULL;
> +	netdev_info(dev, "Switch device unlinked\n");
> +}
> +EXPORT_SYMBOL(swdev_unlink);
> +
> +void *swdev_linked_priv(const struct net_device *dev)
> +{
> +	struct swdev *sw = netdev_priv(dev);
> +
> +	return sw->linked_priv;
> +}
> +EXPORT_SYMBOL(swdev_linked_priv);
> +
> +bool swdev_is_linked(const struct net_device *dev)
> +{
> +	return swdev_linked_priv(dev);
> +}
> +EXPORT_SYMBOL(swdev_is_linked);
> +
> +int swdev_flow_insert(struct net_device *dev, struct sw_flow *flow)
> +{
> +	struct swdev *sw = netdev_priv(dev);
> +
> +	BUG_ON(!swdev_dev_check(dev));
> +	if (!sw->ops->flow_insert)
> +		return 0;
> +	return sw->ops->flow_insert(dev, flow);
> +}
> +EXPORT_SYMBOL(swdev_flow_insert);
> +
> +int swdev_flow_remove(struct net_device *dev, struct sw_flow *flow)
> +{
> +	struct swdev *sw = netdev_priv(dev);
> +
> +	BUG_ON(!swdev_dev_check(dev));
> +	if (!sw->ops->flow_remove)
> +		return 0;
> +	return sw->ops->flow_remove(dev, flow);
> +}
> +EXPORT_SYMBOL(swdev_flow_remove);
> +
> +struct net_device *swdev_create(const struct swdev_ops *ops)
> +{
> +	struct net_device *dev;
> +	struct swdev *sw;
> +	int err;
> +
> +	dev = alloc_netdev(sizeof(struct swdev), "swdev%d", swdev_setup);
> +	if (!dev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	err = register_netdevice(dev);
> +	if (err)
> +		goto err_register_netdevice;
> +	sw = netdev_priv(dev);
> +	sw->ops = ops;
> +	netif_carrier_off(dev);
> +	netdev_info(dev, "Switch device created (%s)\n", sw->ops->kind);
> +	return dev;
> +
> +err_register_netdevice:
> +	free_netdev(dev);
> +	return ERR_PTR(err);
> +}
> +EXPORT_SYMBOL(swdev_create);
> +
> +void swdev_destroy(struct net_device *dev)
> +{
> +	unregister_netdevice(dev);
> +	free_netdev(dev);
> +	netdev_info(dev, "Switch device destroyed\n");
> +}
> +EXPORT_SYMBOL(swdev_destroy);
> +
> +
> +struct swportdev {
> +	const struct swportdev_ops *ops;
> +	const struct swportdev_linked_ops *linked_ops;
> +	void *linked_priv;
> +};
> +
> +static netdev_tx_t swportdev_ndo_start_xmit(struct sk_buff *skb,
> +					    struct net_device *port_dev)
> +{
> +	struct swportdev *swp = netdev_priv(port_dev);
> +
> +	return swp->ops->skb_xmit(skb, port_dev);
> +}
> +
> +static const struct net_device_ops swportdev_netdev_ops = {
> +	.ndo_start_xmit = swportdev_ndo_start_xmit,
> +};
> +
> +static void swportdev_ethtool_get_drvinfo(struct net_device *port_dev,
> +					  struct ethtool_drvinfo *drvinfo)
> +{
> +	struct swportdev *swp = netdev_priv(port_dev);
> +
> +	strlcpy(drvinfo->driver, swp->ops->kind, sizeof(drvinfo->driver));
> +	strlcpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));
> +}
> +
> +static const struct ethtool_ops swportdev_ethtool_ops = {
> +	.get_drvinfo		= swportdev_ethtool_get_drvinfo,
> +	.get_link		= ethtool_op_get_link,
> +};
> +static void swportdev_setup(struct net_device *port_dev)
> +{
> +	ether_setup(port_dev);
> +	port_dev->netdev_ops = &swportdev_netdev_ops;
> +	port_dev->ethtool_ops = &swportdev_ethtool_ops;
> +}
> +
> +bool swportdev_dev_check(const struct net_device *port_dev)
> +{
> +	return port_dev->netdev_ops == &swportdev_netdev_ops;
> +}
> +EXPORT_SYMBOL(swportdev_dev_check);
> +
> +void swportdev_link(struct net_device *port_dev,
> +		    const struct swportdev_linked_ops *linked_ops,
> +		    void *linked_priv)
> +{
> +	struct swportdev *swp = netdev_priv(port_dev);
> +
> +	swp->linked_priv = linked_priv;
> +	netdev_info(port_dev, "Switch port device linked\n");
> +}
> +EXPORT_SYMBOL(swportdev_link);
> +
> +void swportdev_unlink(struct net_device *port_dev)
> +{
> +	struct swportdev *swp = netdev_priv(port_dev);
> +
> +	swp->linked_ops = NULL;
> +	swp->linked_priv = NULL;
> +	netdev_info(port_dev, "Switch port device unlinked\n");
> +}
> +EXPORT_SYMBOL(swportdev_unlink);
> +
> +void *swportdev_linked_priv(const struct net_device *port_dev)
> +{
> +	struct swportdev *swp = netdev_priv(port_dev);
> +
> +	return swp->linked_priv;
> +}
> +EXPORT_SYMBOL(swportdev_linked_priv);
> +
> +bool swportdev_is_linked(const struct net_device *port_dev)
> +{
> +	return swportdev_linked_priv(port_dev);
> +}
> +EXPORT_SYMBOL(swportdev_is_linked);
> +
> +void swportdev_skb_upcall(struct net_device *dev, struct sk_buff *skb,
> +			  struct sw_flow_key *key, void *linked_priv)
> +{
> +	struct swportdev *swp = netdev_priv(dev);
> +
> +	BUG_ON(!swportdev_dev_check(dev));
> +	if (!swp->linked_ops->skb_upcall)
> +		return;
> +	swp->linked_ops->skb_upcall(dev, skb, key, swp->linked_priv);
> +}
> +EXPORT_SYMBOL(swportdev_skb_upcall);
> +
> +static rx_handler_result_t swportdev_handle_frame(struct sk_buff **pskb)
> +{
> +	struct sk_buff *skb = *pskb;
> +
> +	/* We don't care what comes from port device into rx path.
> +	 * If there's something there, it is destined to ETH_P_ALL
> +	 * handlers. So just consume it.
> +	 */
> +	dev_kfree_skb(skb);
> +	return RX_HANDLER_CONSUMED;
> +}
> +
> +struct net_device *swportdev_create(struct net_device *dev,
> +				    const struct swportdev_ops *ops)
> +{
> +	struct net_device *port_dev;
> +	char name[IFNAMSIZ];
> +	struct swportdev *swp;
> +	int err;
> +
> +	err = snprintf(name, IFNAMSIZ, "%sp%%d", dev->name);
> +	if (err >= IFNAMSIZ)
> +		return ERR_PTR(-EINVAL);
> +
> +	port_dev = alloc_netdev(sizeof(struct swportdev), name, swportdev_setup);
> +	if (!port_dev)
> +		return ERR_PTR(-ENOMEM);
> +
> +	err = register_netdevice(port_dev);
> +	if (err)
> +		goto err_register_netdevice;
> +
> +	err = netdev_master_upper_dev_link(port_dev, dev);
> +	if (err) {
> +		netdev_err(dev, "Device %s failed to set upper link\n",
> +			   port_dev->name);
> +		goto err_set_upper_link;
> +	}
> +	swp = netdev_priv(port_dev);
> +	err = netdev_rx_handler_register(port_dev, swportdev_handle_frame, swp);
> +	if (err) {
> +		netdev_err(dev, "Device %s failed to register rx_handler\n",
> +			   port_dev->name);
> +		goto err_handler_register;
> +	}
> +
> +	swp = netdev_priv(port_dev);
> +	swp->ops = ops;
> +	netif_carrier_off(port_dev);
> +	netdev_info(port_dev, "Switch port device created (%s)\n", swp->ops->kind);
> +	return port_dev;
> +
> +err_handler_register:
> +	netdev_upper_dev_unlink(port_dev, dev);
> +err_set_upper_link:
> +	unregister_netdevice(port_dev);
Hi Jiri,
Sorry to cut in on the discussion, and I might be missing something but wouldn't
this trigger a BUG_ON in free_netdev: BUG_ON(dev->reg_state !=
NETREG_UNREGISTERED); since unregister_netdevice leaves the reg_state in
NETREG_UNREGISTERING and unless netdev_run_todo is executed, the call to
free_netdev afterwards will get that BUG_ON triggered.

> +err_register_netdevice:
> +	free_netdev(port_dev);
> +	return ERR_PTR(err);
> +}
> +EXPORT_SYMBOL(swportdev_create);
> +
> +void swportdev_destroy(struct net_device *port_dev)
> +{
> +	struct net_device *dev;
> +
> +	dev = netdev_master_upper_dev_get(port_dev);
> +	BUG_ON(!dev);
> +	netdev_rx_handler_unregister(port_dev);
> +	netdev_upper_dev_unlink(port_dev, dev);
> +	unregister_netdevice(port_dev);
> +	free_netdev(port_dev);
Ditto.

> +	netdev_info(port_dev, "Switch port device destroyed\n");
> +}
> +EXPORT_SYMBOL(swportdev_destroy);
> +
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR("Jiri Pirko <jiri@resnulli.us>");
> +MODULE_DESCRIPTION("Switch device API");
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 2/4] net: introduce switchdev API
  2014-03-20 14:43   ` Nikolay Aleksandrov
@ 2014-03-20 15:42     ` Jiri Pirko
  0 siblings, 0 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-20 15:42 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: netdev, davem, nhorman, andy, tgraf, dborkman, ogerlitz, jesse,
	pshelar, azhou, ben, stephen, jeffrey.t.kirsher, vyasevic,
	xiyou.wangcong, john.r.fastabend, edumazet

Thu, Mar 20, 2014 at 03:43:42PM CET, nikolay@redhat.com wrote:
>On 03/19/2014 04:33 PM, Jiri Pirko wrote:
>> switchdev API is designed to allow kernel support for various switch
>> chips.
>> 
>> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>> ---
>>  include/linux/switchdev.h |  62 +++++++++
>>  net/Kconfig               |  10 ++
>>  net/core/Makefile         |   1 +
>>  net/core/switchdev.c      | 330 ++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 403 insertions(+)
>>  create mode 100644 include/linux/switchdev.h
>>  create mode 100644 net/core/switchdev.c
>> 
>> diff --git a/include/linux/switchdev.h b/include/linux/switchdev.h
>> new file mode 100644
>> index 0000000..ac6db2d
>> --- /dev/null
>> +++ b/include/linux/switchdev.h
>> @@ -0,0 +1,62 @@
>> +/*
>> + * include/linux/switchdev.h - Switch device API
>> + * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + */
>> +#ifndef _LINUX_SWITCHDEV_H_
>> +#define _LINUX_SWITCHDEV_H_
>> +
>> +#include <linux/netdevice.h>
>> +#include <linux/sw_flow.h>
>> +
>> +struct swdev_linked_ops {
>> +};
>> +
>> +bool swdev_dev_check(const struct net_device *dev);
>> +void swdev_link(struct net_device *dev,
>> +		const struct swdev_linked_ops *linked_ops,
>> +		void *linked_priv);
>> +void swdev_unlink(struct net_device *dev);
>> +void *swdev_linked_priv(const struct net_device *dev);
>> +bool swdev_is_linked(const struct net_device *dev);
>> +int swdev_flow_insert(struct net_device *dev, struct sw_flow *flow);
>> +int swdev_flow_remove(struct net_device *dev, struct sw_flow *flow);
>> +int swdev_packet_upcall(struct net_device *dev, struct sk_buff *skb);
>> +
>> +struct swdev_ops {
>> +	const char *kind;
>> +	int (*flow_insert)(struct net_device *dev, struct sw_flow *flow);
>> +	int (*flow_remove)(struct net_device *dev, struct sw_flow *flow);
>> +};
>> +
>> +struct net_device *swdev_create(const struct swdev_ops *ops);
>> +void swdev_destroy(struct net_device *dev);
>> +
>> +struct swportdev_linked_ops {
>> +	void (*skb_upcall)(struct net_device *dev, struct sk_buff *skb,
>> +			   struct sw_flow_key *key, void *linked_priv);
>> +};
>> +
>> +bool swportdev_dev_check(const struct net_device *dev);
>> +void swportdev_link(struct net_device *dev,
>> +		    const struct swportdev_linked_ops *linked_ops,
>> +		    void *linked_priv);
>> +void swportdev_unlink(struct net_device *dev);
>> +void *swportdev_linked_priv(const struct net_device *dev);
>> +bool swportdev_is_linked(const struct net_device *dev);
>> +
>> +struct swportdev_ops {
>> +	const char *kind;
>> +	netdev_tx_t (*skb_xmit)(struct sk_buff *skb,
>> +				struct net_device *port_dev);
>> +};
>> +
>> +struct net_device *swportdev_create(struct net_device *dev,
>> +				    const struct swportdev_ops *ops);
>> +void swportdev_destroy(struct net_device *port_dev);
>> +
>> +#endif /* _LINUX_SWITCHDEV_H_ */
>> diff --git a/net/Kconfig b/net/Kconfig
>> index e411046..e02ab8d 100644
>> --- a/net/Kconfig
>> +++ b/net/Kconfig
>> @@ -285,6 +285,16 @@ config NET_FLOW_LIMIT
>>  	  with many clients some protection against DoS by a single (spoofed)
>>  	  flow that greatly exceeds average workload.
>>  
>> +config NET_SWITCHDEV
>> +	tristate "Switch device"
>> +	depends on INET
>> +	---help---
>> +	  This module provides glue for hardware switch chips so they can be
>> +	  accessed from userspace via Open vSwitch Netlink API. 
>> +
>> +	  To compile this code as a module, choose M here: the
>> +	  module will be called pktgen.
>> +
>>  menu "Network testing"
>>  
>>  config NET_PKTGEN
>> diff --git a/net/core/Makefile b/net/core/Makefile
>> index 9628c20..426a619 100644
>> --- a/net/core/Makefile
>> +++ b/net/core/Makefile
>> @@ -23,3 +23,4 @@ obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
>>  obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
>>  obj-$(CONFIG_CGROUP_NET_PRIO) += netprio_cgroup.o
>>  obj-$(CONFIG_CGROUP_NET_CLASSID) += netclassid_cgroup.o
>> +obj-$(CONFIG_NET_SWITCHDEV) += switchdev.o
>> diff --git a/net/core/switchdev.c b/net/core/switchdev.c
>> new file mode 100644
>> index 0000000..3b8daaf
>> --- /dev/null
>> +++ b/net/core/switchdev.c
>> @@ -0,0 +1,330 @@
>> +/*
>> + * net/core/switchdev.c - Switch device API
>> + * Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; either version 2 of the License, or
>> + * (at your option) any later version.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/types.h>
>> +#include <linux/module.h>
>> +#include <linux/init.h>
>> +#include <linux/netdevice.h>
>> +#include <linux/ethtool.h>
>> +#include <linux/switchdev.h>
>> +
>> +#include <net/rtnetlink.h>
>> +#include <net/net_namespace.h>
>> +#include <net/netns/generic.h>
>> +
>> +#include <generated/utsrelease.h>
>> +
>> +struct swdev {
>> +	const struct swdev_ops *ops;
>> +	const struct swdev_linked_ops *linked_ops;
>> +	void *linked_priv;
>> +};
>> +
>> +static netdev_tx_t swdev_ndo_start_xmit(struct sk_buff *skb,
>> +					struct net_device *dev)
>> +{
>> +	dev_kfree_skb(skb);
>> +	return NETDEV_TX_OK;
>> +}
>> +
>> +static const struct net_device_ops swdev_netdev_ops = {
>> +	.ndo_start_xmit = swdev_ndo_start_xmit,
>> +};
>> +
>> +static void swdev_ethtool_get_drvinfo(struct net_device *dev,
>> +				      struct ethtool_drvinfo *drvinfo)
>> +{
>> +	struct swdev *sw = netdev_priv(dev);
>> +
>> +	strlcpy(drvinfo->driver, sw->ops->kind, sizeof(drvinfo->driver));
>> +	strlcpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));
>> +}
>> +
>> +static const struct ethtool_ops swdev_ethtool_ops = {
>> +	.get_drvinfo		= swdev_ethtool_get_drvinfo,
>> +	.get_link		= ethtool_op_get_link,
>> +};
>> +
>> +static void swdev_setup(struct net_device *dev)
>> +{
>> +	ether_setup(dev);
>> +	dev->netdev_ops = &swdev_netdev_ops;
>> +	dev->ethtool_ops = &swdev_ethtool_ops;
>> +}
>> +
>> +bool swdev_dev_check(const struct net_device *dev)
>> +{
>> +	return dev->netdev_ops == &swdev_netdev_ops;
>> +}
>> +EXPORT_SYMBOL(swdev_dev_check);
>> +
>> +void swdev_link(struct net_device *dev,
>> +		const struct swdev_linked_ops *linked_ops,
>> +		void *linked_priv)
>> +{
>> +	struct swdev *sw = netdev_priv(dev);
>> +
>> +	sw->linked_ops = linked_ops;
>> +	sw->linked_priv = linked_priv;
>> +	netdev_info(dev, "Switch device linked\n");
>> +}
>> +EXPORT_SYMBOL(swdev_link);
>> +
>> +void swdev_unlink(struct net_device *dev)
>> +{
>> +	struct swdev *sw = netdev_priv(dev);
>> +
>> +	sw->linked_ops = NULL;
>> +	sw->linked_priv = NULL;
>> +	netdev_info(dev, "Switch device unlinked\n");
>> +}
>> +EXPORT_SYMBOL(swdev_unlink);
>> +
>> +void *swdev_linked_priv(const struct net_device *dev)
>> +{
>> +	struct swdev *sw = netdev_priv(dev);
>> +
>> +	return sw->linked_priv;
>> +}
>> +EXPORT_SYMBOL(swdev_linked_priv);
>> +
>> +bool swdev_is_linked(const struct net_device *dev)
>> +{
>> +	return swdev_linked_priv(dev);
>> +}
>> +EXPORT_SYMBOL(swdev_is_linked);
>> +
>> +int swdev_flow_insert(struct net_device *dev, struct sw_flow *flow)
>> +{
>> +	struct swdev *sw = netdev_priv(dev);
>> +
>> +	BUG_ON(!swdev_dev_check(dev));
>> +	if (!sw->ops->flow_insert)
>> +		return 0;
>> +	return sw->ops->flow_insert(dev, flow);
>> +}
>> +EXPORT_SYMBOL(swdev_flow_insert);
>> +
>> +int swdev_flow_remove(struct net_device *dev, struct sw_flow *flow)
>> +{
>> +	struct swdev *sw = netdev_priv(dev);
>> +
>> +	BUG_ON(!swdev_dev_check(dev));
>> +	if (!sw->ops->flow_remove)
>> +		return 0;
>> +	return sw->ops->flow_remove(dev, flow);
>> +}
>> +EXPORT_SYMBOL(swdev_flow_remove);
>> +
>> +struct net_device *swdev_create(const struct swdev_ops *ops)
>> +{
>> +	struct net_device *dev;
>> +	struct swdev *sw;
>> +	int err;
>> +
>> +	dev = alloc_netdev(sizeof(struct swdev), "swdev%d", swdev_setup);
>> +	if (!dev)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	err = register_netdevice(dev);
>> +	if (err)
>> +		goto err_register_netdevice;
>> +	sw = netdev_priv(dev);
>> +	sw->ops = ops;
>> +	netif_carrier_off(dev);
>> +	netdev_info(dev, "Switch device created (%s)\n", sw->ops->kind);
>> +	return dev;
>> +
>> +err_register_netdevice:
>> +	free_netdev(dev);
>> +	return ERR_PTR(err);
>> +}
>> +EXPORT_SYMBOL(swdev_create);
>> +
>> +void swdev_destroy(struct net_device *dev)
>> +{
>> +	unregister_netdevice(dev);
>> +	free_netdev(dev);
>> +	netdev_info(dev, "Switch device destroyed\n");
>> +}
>> +EXPORT_SYMBOL(swdev_destroy);
>> +
>> +
>> +struct swportdev {
>> +	const struct swportdev_ops *ops;
>> +	const struct swportdev_linked_ops *linked_ops;
>> +	void *linked_priv;
>> +};
>> +
>> +static netdev_tx_t swportdev_ndo_start_xmit(struct sk_buff *skb,
>> +					    struct net_device *port_dev)
>> +{
>> +	struct swportdev *swp = netdev_priv(port_dev);
>> +
>> +	return swp->ops->skb_xmit(skb, port_dev);
>> +}
>> +
>> +static const struct net_device_ops swportdev_netdev_ops = {
>> +	.ndo_start_xmit = swportdev_ndo_start_xmit,
>> +};
>> +
>> +static void swportdev_ethtool_get_drvinfo(struct net_device *port_dev,
>> +					  struct ethtool_drvinfo *drvinfo)
>> +{
>> +	struct swportdev *swp = netdev_priv(port_dev);
>> +
>> +	strlcpy(drvinfo->driver, swp->ops->kind, sizeof(drvinfo->driver));
>> +	strlcpy(drvinfo->version, UTS_RELEASE, sizeof(drvinfo->version));
>> +}
>> +
>> +static const struct ethtool_ops swportdev_ethtool_ops = {
>> +	.get_drvinfo		= swportdev_ethtool_get_drvinfo,
>> +	.get_link		= ethtool_op_get_link,
>> +};
>> +static void swportdev_setup(struct net_device *port_dev)
>> +{
>> +	ether_setup(port_dev);
>> +	port_dev->netdev_ops = &swportdev_netdev_ops;
>> +	port_dev->ethtool_ops = &swportdev_ethtool_ops;
>> +}
>> +
>> +bool swportdev_dev_check(const struct net_device *port_dev)
>> +{
>> +	return port_dev->netdev_ops == &swportdev_netdev_ops;
>> +}
>> +EXPORT_SYMBOL(swportdev_dev_check);
>> +
>> +void swportdev_link(struct net_device *port_dev,
>> +		    const struct swportdev_linked_ops *linked_ops,
>> +		    void *linked_priv)
>> +{
>> +	struct swportdev *swp = netdev_priv(port_dev);
>> +
>> +	swp->linked_priv = linked_priv;
>> +	netdev_info(port_dev, "Switch port device linked\n");
>> +}
>> +EXPORT_SYMBOL(swportdev_link);
>> +
>> +void swportdev_unlink(struct net_device *port_dev)
>> +{
>> +	struct swportdev *swp = netdev_priv(port_dev);
>> +
>> +	swp->linked_ops = NULL;
>> +	swp->linked_priv = NULL;
>> +	netdev_info(port_dev, "Switch port device unlinked\n");
>> +}
>> +EXPORT_SYMBOL(swportdev_unlink);
>> +
>> +void *swportdev_linked_priv(const struct net_device *port_dev)
>> +{
>> +	struct swportdev *swp = netdev_priv(port_dev);
>> +
>> +	return swp->linked_priv;
>> +}
>> +EXPORT_SYMBOL(swportdev_linked_priv);
>> +
>> +bool swportdev_is_linked(const struct net_device *port_dev)
>> +{
>> +	return swportdev_linked_priv(port_dev);
>> +}
>> +EXPORT_SYMBOL(swportdev_is_linked);
>> +
>> +void swportdev_skb_upcall(struct net_device *dev, struct sk_buff *skb,
>> +			  struct sw_flow_key *key, void *linked_priv)
>> +{
>> +	struct swportdev *swp = netdev_priv(dev);
>> +
>> +	BUG_ON(!swportdev_dev_check(dev));
>> +	if (!swp->linked_ops->skb_upcall)
>> +		return;
>> +	swp->linked_ops->skb_upcall(dev, skb, key, swp->linked_priv);
>> +}
>> +EXPORT_SYMBOL(swportdev_skb_upcall);
>> +
>> +static rx_handler_result_t swportdev_handle_frame(struct sk_buff **pskb)
>> +{
>> +	struct sk_buff *skb = *pskb;
>> +
>> +	/* We don't care what comes from port device into rx path.
>> +	 * If there's something there, it is destined to ETH_P_ALL
>> +	 * handlers. So just consume it.
>> +	 */
>> +	dev_kfree_skb(skb);
>> +	return RX_HANDLER_CONSUMED;
>> +}
>> +
>> +struct net_device *swportdev_create(struct net_device *dev,
>> +				    const struct swportdev_ops *ops)
>> +{
>> +	struct net_device *port_dev;
>> +	char name[IFNAMSIZ];
>> +	struct swportdev *swp;
>> +	int err;
>> +
>> +	err = snprintf(name, IFNAMSIZ, "%sp%%d", dev->name);
>> +	if (err >= IFNAMSIZ)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	port_dev = alloc_netdev(sizeof(struct swportdev), name, swportdev_setup);
>> +	if (!port_dev)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	err = register_netdevice(port_dev);
>> +	if (err)
>> +		goto err_register_netdevice;
>> +
>> +	err = netdev_master_upper_dev_link(port_dev, dev);
>> +	if (err) {
>> +		netdev_err(dev, "Device %s failed to set upper link\n",
>> +			   port_dev->name);
>> +		goto err_set_upper_link;
>> +	}
>> +	swp = netdev_priv(port_dev);
>> +	err = netdev_rx_handler_register(port_dev, swportdev_handle_frame, swp);
>> +	if (err) {
>> +		netdev_err(dev, "Device %s failed to register rx_handler\n",
>> +			   port_dev->name);
>> +		goto err_handler_register;
>> +	}
>> +
>> +	swp = netdev_priv(port_dev);
>> +	swp->ops = ops;
>> +	netif_carrier_off(port_dev);
>> +	netdev_info(port_dev, "Switch port device created (%s)\n", swp->ops->kind);
>> +	return port_dev;
>> +
>> +err_handler_register:
>> +	netdev_upper_dev_unlink(port_dev, dev);
>> +err_set_upper_link:
>> +	unregister_netdevice(port_dev);
>Hi Jiri,
>Sorry to cut in on the discussion, and I might be missing something but wouldn't
>this trigger a BUG_ON in free_netdev: BUG_ON(dev->reg_state !=
>NETREG_UNREGISTERED); since unregister_netdevice leaves the reg_state in
>NETREG_UNREGISTERING and unless netdev_run_todo is executed, the call to
>free_netdev afterwards will get that BUG_ON triggered.


You are right. I'll fix this. Thanks!

>
>> +err_register_netdevice:
>> +	free_netdev(port_dev);
>> +	return ERR_PTR(err);
>> +}
>> +EXPORT_SYMBOL(swportdev_create);
>> +
>> +void swportdev_destroy(struct net_device *port_dev)
>> +{
>> +	struct net_device *dev;
>> +
>> +	dev = netdev_master_upper_dev_get(port_dev);
>> +	BUG_ON(!dev);
>> +	netdev_rx_handler_unregister(port_dev);
>> +	netdev_upper_dev_unlink(port_dev, dev);
>> +	unregister_netdevice(port_dev);
>> +	free_netdev(port_dev);
>Ditto.
>
>> +	netdev_info(port_dev, "Switch port device destroyed\n");
>> +}
>> +EXPORT_SYMBOL(swportdev_destroy);
>> +
>> +MODULE_LICENSE("GPL v2");
>> +MODULE_AUTHOR("Jiri Pirko <jiri@resnulli.us>");
>> +MODULE_DESCRIPTION("Switch device API");
>> 
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-20 12:40   ` Jiri Pirko
@ 2014-03-20 17:21     ` Florian Fainelli
  2014-03-21 12:04       ` Jamal Hadi Salim
  2014-03-22  9:40       ` Jiri Pirko
  0 siblings, 2 replies; 125+ messages in thread
From: Florian Fainelli @ 2014-03-20 17:21 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jamal Hadi Salim, netdev, David Miller, Neil Horman, andy, tgraf,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

2014-03-20 5:40 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
> Thu, Mar 20, 2014 at 12:49:07PM CET, jhs@mojatatu.com wrote:
>>Hi Jiri,
>>
>>On 03/19/14 11:33, Jiri Pirko wrote:
>>>This is just an early draft, RFC. I wanted to post this early to get the
>>>feedback as soon as possible.
>>>
>>>The basic idea is to introduce a generic infractructure to support various
>>>switch chips in kernel. Also the idea is to benefit of currently existing
>>>Open vSwitch userspace infrastructure.
>>>
>>
>>
>>I think the abstraction should be a netdev and to be specific the
>>bridge - not openvswitch. Our current tools like ifconfig, iproute2,
>>bridge etc should continue to work.
>
> That is exactly the case. Nothing is specific to OVS. OVS is just a one
> method to access the switchdev api.
>
> Abstraction is netdev. One netdev per each switch port and one netdev as
> a master on the top of that representing the switch itself.
>
>
>>In my experience, it is sufficient to model a switch after the linux
>>bridge at the basic level if the starting point is
>>L2 (which is the lowest common denominator).
>>And then you add capabilities that different chips expose.
>>Not every chip can do vxlan, flows etc. And we already know how
>>to abstract those out.
>>My  experience on top of broadcom chips is the approach i described
>>works rather well.
>>
>>Additionally, note:
>>We do have L2 devices that offload in the kernel
>>(refer to DSA, posting earlier from the openwrt guys, and
>>the intel devices which do VDMQ etc). I am now counting we have 5
>>different approaches if we add yours.
>
> I think that the problem is that each solution serves different purpose.
> For example DSA is for switches connected as a PHY to a MAC. That is
> completely different case to what my switchdev API is trying to handle.

I agree with Jamal here, we should try to find a solution that fits
most users here, it seems to me like there are 3 switches categories:

- entreprise built-in switches in NICs that support VF/PF
- embedded/entreprise switches that support tagging (Marvell eDSA/DSA,
Broadcom tags)
- embedded switches that only support 802.1q VLANs

The first category is more flow-oriented than control-oriented,
whereas the last two are more "event and control" oriented where you
usually have a system where the switch will be configured not to flood
the CPU port if possible, but when it does, this is to perform
specific configuration (address learning, port protection, snooping,
authorization...).

DSA is not designed specifically for switches which are connected to a
MAC and appear as a regular PHY, this is how it first started, but
nothing prevents you from using DSA with a switch that is e.g: memory
mapped into your CPU register space, MDIO is just the transport for
the control part.

For instance, if my switches support a N-bytes tag that will give me a
reason code for receiving this frame, and a bitmap representing the
originating port, how would you imagine this fitting into the
openvswitch/switchdev model, aside from the netdev per-port? Do you
think we could easily migrate existing DSA users to
openvswitch/switchdev by handling the custom switch tag?
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-20 17:21     ` Florian Fainelli
@ 2014-03-21 12:04       ` Jamal Hadi Salim
  2014-03-22  9:48         ` Jiri Pirko
  2014-03-22  9:40       ` Jiri Pirko
  1 sibling, 1 reply; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-21 12:04 UTC (permalink / raw)
  To: Florian Fainelli, Jiri Pirko
  Cc: netdev, David Miller, Neil Horman, andy, tgraf, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/20/14 13:21, Florian Fainelli wrote:
> 2014-03-20 5:40 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>> Thu, Mar 20, 2014 at 12:49:07PM CET, jhs@mojatatu.com wrote:
>>> Hi Jiri,
>>>

>>>
>>> I think the abstraction should be a netdev and to be specific the
>>> bridge - not openvswitch. Our current tools like ifconfig, iproute2,
>>> bridge etc should continue to work.
>>
>> That is exactly the case. Nothing is specific to OVS. OVS is just a one
>> method to access the switchdev api.
>>
>> Abstraction is netdev. One netdev per each switch port and one netdev as
>> a master on the top of that representing the switch itself.
>>

Ok, so that is what a bridge is.

>> I think that the problem is that each solution serves different purpose.
>> For example DSA is for switches connected as a PHY to a MAC. That is
>> completely different case to what my switchdev API is trying to handle.
>
> I agree with Jamal here, we should try to find a solution that fits
> most users here,

Indeed. We have too many splinters already and each has its own way
of being addressed. [Did you know MacVLAN is now also a L2 device that
does bridging and a crap load of other things? A long way off from
what the original intent was.]

I think we are saying the same thing, but:
This means need for a consistent interface and abstraction.
My favorite abstraction in the kernel that i consider to be immortal
is the netdev. I can have a netdev that is implemented as a physical
ethernet port or as a tuntap or as a tunnel etc. They mostly use the
same abstraction with small differences depending on the type, f.e
a tuntap  with uid, gid etc is mostly no different than my laptop
realtek ethernet port. I can control any of those the same way I
control a CAN device on a vehicle with iproute2 and the same way i
control  a dummy device, ifb, veth, etc.
In otherwords, how packet processing happens (whether the netdev is
used to toast bread) or what tables or constructs a specific kind of
netdev needs (to slice bread) is only relevant to the implementation.
 From user space i dont need to have 15 different APIs to manage/control
things (ok, there is ethtool - but that is just one more interface; but
we have matured enough such that if you try to use /proc or /sysfs
people will yell at you).

In my view: that (immortal) device for L2/bridging is the bridge or
maybe a more barebone version of the bridge (since it has gained a
little more weight in recent times).

>it seems to me like there are 3 switches categories:
>
> - entreprise built-in switches in NICs that support VF/PF
> - embedded/entreprise switches that support tagging (Marvell eDSA/DSA,
> Broadcom tags)
> - embedded switches that only support 802.1q VLANs
>

I had started documenting this stuff to provide some context for an
abstraction, but i had too many pre-emptions, so the document is not
complete. Both John and Vlad had provided inputs to shape it. I
could post it and take patches to it.

> The first category is more flow-oriented than control-oriented,
> whereas the last two are more "event and control" oriented where you
> usually have a system where the switch will be configured not to flood
> the CPU port if possible, but when it does, this is to perform
> specific configuration (address learning, port protection, snooping,
> authorization...).
>
 >
> DSA is not designed specifically for switches which are connected to a
> MAC and appear as a regular PHY, this is how it first started, but
> nothing prevents you from using DSA with a switch that is e.g: memory
> mapped into your CPU register space, MDIO is just the transport for
> the control part.

Your view is more detail oriented than mine. My focus is to more from
a control/management abstraction level. From that perspective this
is a healthy discussion - thank you.

> For instance, if my switches support a N-bytes tag that will give me a
> reason code for receiving this frame, and a bitmap representing the
> originating port, how would you imagine this fitting into the
> openvswitch/switchdev model, aside from the netdev per-port? Do you
> think we could easily migrate existing DSA users to
> openvswitch/switchdev by handling the custom switch tag?
>

I dont think so. I think we need to have this discussion to come
up with a reasonable conclusion.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-20 17:21     ` Florian Fainelli
  2014-03-21 12:04       ` Jamal Hadi Salim
@ 2014-03-22  9:40       ` Jiri Pirko
  1 sibling, 0 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-22  9:40 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Jamal Hadi Salim, netdev, David Miller, Neil Horman, andy, tgraf,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

Thu, Mar 20, 2014 at 06:21:10PM CET, f.fainelli@gmail.com wrote:
>2014-03-20 5:40 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>> Thu, Mar 20, 2014 at 12:49:07PM CET, jhs@mojatatu.com wrote:
>>>Hi Jiri,
>>>
>>>On 03/19/14 11:33, Jiri Pirko wrote:
>>>>This is just an early draft, RFC. I wanted to post this early to get the
>>>>feedback as soon as possible.
>>>>
>>>>The basic idea is to introduce a generic infractructure to support various
>>>>switch chips in kernel. Also the idea is to benefit of currently existing
>>>>Open vSwitch userspace infrastructure.
>>>>
>>>
>>>
>>>I think the abstraction should be a netdev and to be specific the
>>>bridge - not openvswitch. Our current tools like ifconfig, iproute2,
>>>bridge etc should continue to work.
>>
>> That is exactly the case. Nothing is specific to OVS. OVS is just a one
>> method to access the switchdev api.
>>
>> Abstraction is netdev. One netdev per each switch port and one netdev as
>> a master on the top of that representing the switch itself.
>>
>>
>>>In my experience, it is sufficient to model a switch after the linux
>>>bridge at the basic level if the starting point is
>>>L2 (which is the lowest common denominator).
>>>And then you add capabilities that different chips expose.
>>>Not every chip can do vxlan, flows etc. And we already know how
>>>to abstract those out.
>>>My  experience on top of broadcom chips is the approach i described
>>>works rather well.
>>>
>>>Additionally, note:
>>>We do have L2 devices that offload in the kernel
>>>(refer to DSA, posting earlier from the openwrt guys, and
>>>the intel devices which do VDMQ etc). I am now counting we have 5
>>>different approaches if we add yours.
>>
>> I think that the problem is that each solution serves different purpose.
>> For example DSA is for switches connected as a PHY to a MAC. That is
>> completely different case to what my switchdev API is trying to handle.
>
>I agree with Jamal here, we should try to find a solution that fits
>most users here, it seems to me like there are 3 switches categories:
>
>- entreprise built-in switches in NICs that support VF/PF
>- embedded/entreprise switches that support tagging (Marvell eDSA/DSA,
>Broadcom tags)
>- embedded switches that only support 802.1q VLANs

One case which you maybe forgot:

        switch chip
   ------------------------
    |  |  |  |  |  |   |               CPU
   p1 p2 ...pn px py  MNGMNT       -----------
                |  |   |              pcie
                |  |   |         ---------------
                |  |   |          |  NIC0 NIC1
                |  |   ---pcie-----   |   |
                |  ------someMII-------   |
                ---------someMII-----------

	NIC0 and NIC1 are ordinary NICs like 8139too for example with no
	notion they are connected to a switch. They as completely
	independent on the mngmnt iface.

>
>The first category is more flow-oriented than control-oriented,
>whereas the last two are more "event and control" oriented where you
>usually have a system where the switch will be configured not to flood
>the CPU port if possible, but when it does, this is to perform
>specific configuration (address learning, port protection, snooping,
>authorization...).
>
>DSA is not designed specifically for switches which are connected to a
>MAC and appear as a regular PHY, this is how it first started, but
>nothing prevents you from using DSA with a switch that is e.g: memory
>mapped into your CPU register space, MDIO is just the transport for
>the control part.


I see that DSA now is *very* MII-oriented. I'm not sure how hard it would
be to rewrite it to be more negeric and if it would make sense at all.


>
>For instance, if my switches support a N-bytes tag that will give me a
>reason code for receiving this frame, and a bitmap representing the
>originating port, how would you imagine this fitting into the
>openvswitch/switchdev model, aside from the netdev per-port? Do you
>think we could easily migrate existing DSA users to
>openvswitch/switchdev by handling the custom switch tag?

I do not think so either.


>-- 
>Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-21 12:04       ` Jamal Hadi Salim
@ 2014-03-22  9:48         ` Jiri Pirko
  2014-03-24 23:07           ` Jamal Hadi Salim
  0 siblings, 1 reply; 125+ messages in thread
From: Jiri Pirko @ 2014-03-22  9:48 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Florian Fainelli, netdev, David Miller, Neil Horman, andy, tgraf,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

Fri, Mar 21, 2014 at 01:04:20PM CET, jhs@mojatatu.com wrote:
>On 03/20/14 13:21, Florian Fainelli wrote:
>>2014-03-20 5:40 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>Thu, Mar 20, 2014 at 12:49:07PM CET, jhs@mojatatu.com wrote:
>>>>Hi Jiri,
>>>>
>
>>>>
>>>>I think the abstraction should be a netdev and to be specific the
>>>>bridge - not openvswitch. Our current tools like ifconfig, iproute2,
>>>>bridge etc should continue to work.
>>>
>>>That is exactly the case. Nothing is specific to OVS. OVS is just a one
>>>method to access the switchdev api.
>>>
>>>Abstraction is netdev. One netdev per each switch port and one netdev as
>>>a master on the top of that representing the switch itself.
>>>
>
>Ok, so that is what a bridge is.
>
>>>I think that the problem is that each solution serves different purpose.
>>>For example DSA is for switches connected as a PHY to a MAC. That is
>>>completely different case to what my switchdev API is trying to handle.
>>
>>I agree with Jamal here, we should try to find a solution that fits
>>most users here,
>
>Indeed. We have too many splinters already and each has its own way
>of being addressed. [Did you know MacVLAN is now also a L2 device that
>does bridging and a crap load of other things? A long way off from
>what the original intent was.]
>
>I think we are saying the same thing, but:
>This means need for a consistent interface and abstraction.
>My favorite abstraction in the kernel that i consider to be immortal
>is the netdev. I can have a netdev that is implemented as a physical
>ethernet port or as a tuntap or as a tunnel etc. They mostly use the
>same abstraction with small differences depending on the type, f.e
>a tuntap  with uid, gid etc is mostly no different than my laptop
>realtek ethernet port. I can control any of those the same way I
>control a CAN device on a vehicle with iproute2 and the same way i
>control  a dummy device, ifb, veth, etc.
>In otherwords, how packet processing happens (whether the netdev is
>used to toast bread) or what tables or constructs a specific kind of
>netdev needs (to slice bread) is only relevant to the implementation.
>From user space i dont need to have 15 different APIs to manage/control
>things (ok, there is ethtool - but that is just one more interface; but
>we have matured enough such that if you try to use /proc or /sysfs
>people will yell at you).

Hmm. This got me thinking about netdev and switches well and perhaps the
switchdev api could be mostly implemented by couple of more ndos and
feature flags. That way we could stick to your immortal netdev :)


>
>In my view: that (immortal) device for L2/bridging is the bridge or
>maybe a more barebone version of the bridge (since it has gained a
>little more weight in recent times).

Well, I do not think that bridge is ideal abstraction for modern switch
chips. Bridge is very limited.

But I don't necessary think it is needed to "mask" as a bride or mimic a
bridge in any way. DSA does not do that either.

switchdev tries to provide an API. Who takes it and use it is up to us.
OVS, bridge or whatever.

>
>>it seems to me like there are 3 switches categories:
>>
>>- entreprise built-in switches in NICs that support VF/PF
>>- embedded/entreprise switches that support tagging (Marvell eDSA/DSA,
>>Broadcom tags)
>>- embedded switches that only support 802.1q VLANs
>>
>
>I had started documenting this stuff to provide some context for an
>abstraction, but i had too many pre-emptions, so the document is not
>complete. Both John and Vlad had provided inputs to shape it. I
>could post it and take patches to it.

Sure, send us a link please.


>
>>The first category is more flow-oriented than control-oriented,
>>whereas the last two are more "event and control" oriented where you
>>usually have a system where the switch will be configured not to flood
>>the CPU port if possible, but when it does, this is to perform
>>specific configuration (address learning, port protection, snooping,
>>authorization...).
>>
>>
>>DSA is not designed specifically for switches which are connected to a
>>MAC and appear as a regular PHY, this is how it first started, but
>>nothing prevents you from using DSA with a switch that is e.g: memory
>>mapped into your CPU register space, MDIO is just the transport for
>>the control part.
>
>Your view is more detail oriented than mine. My focus is to more from
>a control/management abstraction level. From that perspective this
>is a healthy discussion - thank you.
>
>>For instance, if my switches support a N-bytes tag that will give me a
>>reason code for receiving this frame, and a bitmap representing the
>>originating port, how would you imagine this fitting into the
>>openvswitch/switchdev model, aside from the netdev per-port? Do you
>>think we could easily migrate existing DSA users to
>>openvswitch/switchdev by handling the custom switch tag?
>>
>
>I dont think so. I think we need to have this discussion to come
>up with a reasonable conclusion.
>
>cheers,
>jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-22  9:48         ` Jiri Pirko
@ 2014-03-24 23:07           ` Jamal Hadi Salim
  2014-03-25 17:39             ` Neil Horman
  0 siblings, 1 reply; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-24 23:07 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Florian Fainelli, netdev, David Miller, Neil Horman, andy, tgraf,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/22/14 05:48, Jiri Pirko wrote:
> Fri, Mar 21, 2014 at 01:04:20PM CET, jhs@mojatatu.com wrote:

> Hmm. This got me thinking about netdev and switches well and perhaps the
> switchdev api could be mostly implemented by couple of more ndos and
> feature flags. That way we could stick to your immortal netdev :)
>
>

Perhaps ;->

>>
>> In my view: that (immortal) device for L2/bridging is the bridge or
>> maybe a more barebone version of the bridge (since it has gained a
>> little more weight in recent times).
>
> Well, I do not think that bridge is ideal abstraction for modern switch
> chips. Bridge is very limited.
>

True - but i was more thinking of being inclusive of the smaller
devices. They are mostly L2 only and in very limited scope. And thats
probably 95% of the population. The things you are talking about
are very high end and they can do more. Florian's taxanomy was useful.

> But I don't necessary think it is needed to "mask" as a bride or mimic a
> bridge in any way. DSA does not do that either.
>

I am open to the idea of exposing ports instead of a bridge.
Such ports could be aggregate together to form a bridge when the
hardware is capable.

> switchdev tries to provide an API. Who takes it and use it is up to us.
> OVS, bridge or whatever.
>

As long as you maintain the current user tools I am happy.
Can i run all my iproute2 tools?


>
> Sure, send us a link please.
>

I will post it somewhere. The starting point was L2; if we decide to
go a different direction it may require a different approach.


cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-24 23:07           ` Jamal Hadi Salim
@ 2014-03-25 17:39             ` Neil Horman
  2014-03-25 18:00               ` Thomas Graf
                                 ` (3 more replies)
  0 siblings, 4 replies; 125+ messages in thread
From: Neil Horman @ 2014-03-25 17:39 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Jiri Pirko, Florian Fainelli, netdev, David Miller, andy, tgraf,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On Mon, Mar 24, 2014 at 07:07:35PM -0400, Jamal Hadi Salim wrote:
> On 03/22/14 05:48, Jiri Pirko wrote:
> >Fri, Mar 21, 2014 at 01:04:20PM CET, jhs@mojatatu.com wrote:
> 
> >Hmm. This got me thinking about netdev and switches well and perhaps the
> >switchdev api could be mostly implemented by couple of more ndos and
> >feature flags. That way we could stick to your immortal netdev :)
> >
> >
> 
> Perhaps ;->
> 
> >>
> >>In my view: that (immortal) device for L2/bridging is the bridge or
> >>maybe a more barebone version of the bridge (since it has gained a
> >>little more weight in recent times).
> >
> >Well, I do not think that bridge is ideal abstraction for modern switch
> >chips. Bridge is very limited.
> >
> 
> True - but i was more thinking of being inclusive of the smaller
> devices. They are mostly L2 only and in very limited scope. And thats
> probably 95% of the population. The things you are talking about
> are very high end and they can do more. Florian's taxanomy was useful.
> 
> >But I don't necessary think it is needed to "mask" as a bride or mimic a
> >bridge in any way. DSA does not do that either.
No, but it would be really nice if these smaller devices could take advantage of
this infrastructure.  Looking at it, I don't see why thats not possible.  The
big trick (as we've discussed in the past), is using a net_device structure to
take advantage of all the features that net_devices offer while not enabling the
device specific features that some hardware doesn't allow.

For instance the broadcom chips that live in many wireless routers would be well
served by the model jiri has here as far as Media level interface control is
concerned (i.e. ifup/down/speed/duplex/etc), but its a bit lacking in that
net_devices are assumed to support L3 protocol configuration (i.e. they can have
ip addresses assigned to them), which you can't IIRC do on these chips.

Would it be worth considering a private interface model?  That is to say:

1) Ports on a switch chip are accessed using net_device structures, but
registered to a private list contained within the switch device, rather than to
the net namespaces device list.

2) Access to the switch ports via user space is done through the master switch
interface with additional netlink attributes specifying the port on the switch
to access (or not to access the master switch device directly)


Such a model I think might fit well with Jiri's code here and provide greater
flexibility for a wider range of devices.  It would of course require
augmentation for user space, but the changes would be additive, so I think they
would be reasonable.  This would also allow the switch device to have a hook in
the control path to block or allow features that the hardware may or may not
support while still being able to use the existing net_device infrastructure to
support these operations as they are normally carried out.

Best
Neil

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 17:39             ` Neil Horman
@ 2014-03-25 18:00               ` Thomas Graf
  2014-03-25 19:35                 ` Neil Horman
  2014-03-25 18:33               ` Florian Fainelli
                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 125+ messages in thread
From: Thomas Graf @ 2014-03-25 18:00 UTC (permalink / raw)
  To: Neil Horman
  Cc: Jamal Hadi Salim, Jiri Pirko, Florian Fainelli, netdev,
	David Miller, andy, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

On 03/25/14 at 01:39pm, Neil Horman wrote:
> No, but it would be really nice if these smaller devices could take advantage of
> this infrastructure.  Looking at it, I don't see why thats not possible.  The
> big trick (as we've discussed in the past), is using a net_device structure to
> take advantage of all the features that net_devices offer while not enabling the
> device specific features that some hardware doesn't allow.
> 
> For instance the broadcom chips that live in many wireless routers would be well
> served by the model jiri has here as far as Media level interface control is
> concerned (i.e. ifup/down/speed/duplex/etc), but its a bit lacking in that
> net_devices are assumed to support L3 protocol configuration (i.e. they can have
> ip addresses assigned to them), which you can't IIRC do on these chips.

How about a new device flag indicating pure L2 mode? Any L3 address
configuration would fail with EAFNOSUPP.

> Would it be worth considering a private interface model?  That is to say:
> 
> 1) Ports on a switch chip are accessed using net_device structures, but
> registered to a private list contained within the switch device, rather than to
> the net namespaces device list.

> 2) Access to the switch ports via user space is done through the master switch
> interface with additional netlink attributes specifying the port on the switch
> to access (or not to access the master switch device directly)

> Such a model I think might fit well with Jiri's code here and provide greater
> flexibility for a wider range of devices.  It would of course require
> augmentation for user space, but the changes would be additive, so I think they
> would be reasonable.  This would also allow the switch device to have a hook in
> the control path to block or allow features that the hardware may or may not
> support while still being able to use the existing net_device infrastructure to
> support these operations as they are normally carried out.

I believe this would defeat the main advantage of reusing net_device
model which is compatibility with the well established standard toolset.

In an ideal world, we represent what is possible using the existing
net_device model.

On top of that, like for VFs, we provide extended nested attributes or
alternate control paths such as via OVS that provide the additional
flexibility and control required by the more advanced devices.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 17:39             ` Neil Horman
  2014-03-25 18:00               ` Thomas Graf
@ 2014-03-25 18:33               ` Florian Fainelli
  2014-03-25 19:40                 ` Neil Horman
  2014-03-25 20:46               ` Jamal Hadi Salim
  2014-03-26  7:24               ` Jiri Pirko
  3 siblings, 1 reply; 125+ messages in thread
From: Florian Fainelli @ 2014-03-25 18:33 UTC (permalink / raw)
  To: Neil Horman
  Cc: Jamal Hadi Salim, Jiri Pirko, netdev, David Miller, andy, tgraf,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Felix Fietkau

2014-03-25 10:39 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> On Mon, Mar 24, 2014 at 07:07:35PM -0400, Jamal Hadi Salim wrote:
>> On 03/22/14 05:48, Jiri Pirko wrote:
>> >Fri, Mar 21, 2014 at 01:04:20PM CET, jhs@mojatatu.com wrote:
>>
>> >Hmm. This got me thinking about netdev and switches well and perhaps the
>> >switchdev api could be mostly implemented by couple of more ndos and
>> >feature flags. That way we could stick to your immortal netdev :)
>> >
>> >
>>
>> Perhaps ;->
>>
>> >>
>> >>In my view: that (immortal) device for L2/bridging is the bridge or
>> >>maybe a more barebone version of the bridge (since it has gained a
>> >>little more weight in recent times).
>> >
>> >Well, I do not think that bridge is ideal abstraction for modern switch
>> >chips. Bridge is very limited.
>> >
>>
>> True - but i was more thinking of being inclusive of the smaller
>> devices. They are mostly L2 only and in very limited scope. And thats
>> probably 95% of the population. The things you are talking about
>> are very high end and they can do more. Florian's taxanomy was useful.
>>
>> >But I don't necessary think it is needed to "mask" as a bride or mimic a
>> >bridge in any way. DSA does not do that either.
> No, but it would be really nice if these smaller devices could take advantage of
> this infrastructure.  Looking at it, I don't see why thats not possible.  The
> big trick (as we've discussed in the past), is using a net_device structure to
> take advantage of all the features that net_devices offer while not enabling the
> device specific features that some hardware doesn't allow.
>
> For instance the broadcom chips that live in many wireless routers would be well
> served by the model jiri has here as far as Media level interface control is
> concerned (i.e. ifup/down/speed/duplex/etc), but its a bit lacking in that
> net_devices are assumed to support L3 protocol configuration (i.e. they can have
> ip addresses assigned to them), which you can't IIRC do on these chips.

In fact, some switches could have valid L3 configurations, what you
usually do is have the switch be configured such that it selectively
inserts Broadcom tags for a given set of physical ports, such that
your CPU port (In-band Management port in Broadcom terminology), gets
flooded with such packets, and can dispatch those packets to the
per-port netdevices. Then you can take any decisions based on those
received packets, such as bridging this per-port netdevice with
another one for instance, or any switch topology change.

>
> Would it be worth considering a private interface model?  That is to say:
>
> 1) Ports on a switch chip are accessed using net_device structures, but
> registered to a private list contained within the switch device, rather than to
> the net namespaces device list.

I think this would be a good model for simple embedded switches that
only support 802.1q VLANs for instance, since we won't be able to get
any actual data to be sent/received to any per-port netdevice, those
per-port netdevices would only be effective for control at the L2
level.

For switches that do support tags, I think we do want per-port
netdevices to appear in the regular netdevices namespace as those
might be able to get actual data sent to/received from by using these
tags, at least momentarily until a higher-level entity decides
otherwise (e.g: by bridging, disabling interfaces...).

>
> 2) Access to the switch ports via user space is done through the master switch
> interface with additional netlink attributes specifying the port on the switch
> to access (or not to access the master switch device directly)

OpenWrt's swconfig model was to use the CPU Ethernet interface as the
master device for accessing the switch, although it did not use
RTNETLINK, but a separate Genl family, I think this is a good model.

One thing that was easy with OpenWrt's swconfig model was to promote
what started as a switch specific feature into something that could
become a generic operation that all switch drivers could implement. It
was also very easy to add custom switch driver features without
bloating the switch configuration API.

>
>
> Such a model I think might fit well with Jiri's code here and provide greater
> flexibility for a wider range of devices.  It would of course require
> augmentation for user space, but the changes would be additive, so I think they
> would be reasonable.  This would also allow the switch device to have a hook in
> the control path to block or allow features that the hardware may or may not
> support while still being able to use the existing net_device infrastructure to
> support these operations as they are normally carried out.

Sounds good. There might be a bunch of new NETIF_F_* flags to add to
help advertising switch-specific features such as: hardware/software
switch tag insertion, support for classification, support for
influencing switch queueing etc... but this does not have to be
available from day 1.
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 18:00               ` Thomas Graf
@ 2014-03-25 19:35                 ` Neil Horman
  2014-03-25 20:11                   ` Florian Fainelli
  2014-03-25 20:56                   ` Jamal Hadi Salim
  0 siblings, 2 replies; 125+ messages in thread
From: Neil Horman @ 2014-03-25 19:35 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Jamal Hadi Salim, Jiri Pirko, Florian Fainelli, netdev,
	David Miller, andy, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

On Tue, Mar 25, 2014 at 06:00:09PM +0000, Thomas Graf wrote:
> On 03/25/14 at 01:39pm, Neil Horman wrote:
> > No, but it would be really nice if these smaller devices could take advantage of
> > this infrastructure.  Looking at it, I don't see why thats not possible.  The
> > big trick (as we've discussed in the past), is using a net_device structure to
> > take advantage of all the features that net_devices offer while not enabling the
> > device specific features that some hardware doesn't allow.
> > 
> > For instance the broadcom chips that live in many wireless routers would be well
> > served by the model jiri has here as far as Media level interface control is
> > concerned (i.e. ifup/down/speed/duplex/etc), but its a bit lacking in that
> > net_devices are assumed to support L3 protocol configuration (i.e. they can have
> > ip addresses assigned to them), which you can't IIRC do on these chips.
> 
> How about a new device flag indicating pure L2 mode? Any L3 address
> configuration would fail with EAFNOSUPP.
> 
Yeah, we've discussed that before, and it seems like a good idea, though I'm not
sure that its flexible enough.  It clearly prevents L3 operations on devices
that can only do L2, which is great, but that may not be sufficient for some
devices.  For example, what if you wanted to use ebtables on an L2 port where
the hardware can't mirror the actions of a given table rule?  Do we need to
expand out those capabilities?

> > Would it be worth considering a private interface model?  That is to say:
> > 
> > 1) Ports on a switch chip are accessed using net_device structures, but
> > registered to a private list contained within the switch device, rather than to
> > the net namespaces device list.
> 
> > 2) Access to the switch ports via user space is done through the master switch
> > interface with additional netlink attributes specifying the port on the switch
> > to access (or not to access the master switch device directly)
> 
> > Such a model I think might fit well with Jiri's code here and provide greater
> > flexibility for a wider range of devices.  It would of course require
> > augmentation for user space, but the changes would be additive, so I think they
> > would be reasonable.  This would also allow the switch device to have a hook in
> > the control path to block or allow features that the hardware may or may not
> > support while still being able to use the existing net_device infrastructure to
> > support these operations as they are normally carried out.
> 
> I believe this would defeat the main advantage of reusing net_device
> model which is compatibility with the well established standard toolset.
> 
> In an ideal world, we represent what is possible using the existing
> net_device model.
> 

Maybe I'm not being clear. I'm not suggesting that we abandon the use of a
net_device to do any of this work, only that we add a layer of indirection to
get to it.  By Augmenting the existing network device stack to allow
registration of net_devices to arbitrary lists, rather than to a fixes
per-net-namespace global device list, we can operate net_devices that are only
visible within the scope of a given switch fabric.  User space still works the
same way, it just requires the specification of additional information when
speaking to ports on a switch device that may not be directly accessible via the
cpu.  For example, if a systems has a directly connected nic (em1), and a switch
fabric with a master bridge port (sw1), and 10 external ports (sw1pX), we could
access them all from user space via ip link show.  for example:

1) ip link show:
em1
sw1

2) ip link show sw1
sw1

3) ip link show -p sw1
sw1p0
sw1p1
sw1p2...


The idea is to augment user space to allow the visibiliy of ports through the
switch device, not directly, but using the same existing mechanisms.  We can
reuse all the existing infrastruture, but with this model, control must pass
through the switch device driver, allowing it to taylor available features by
passing the netlink request on to the appropriate netdevice, or sending back an
error itself.

> On top of that, like for VFs, we provide extended nested attributes or
> alternate control paths such as via OVS that provide the additional
> flexibility and control required by the more advanced devices.
I'm sorry, I don't understand the relevance here.  Are you suggesting that to
make this modification, we would need to augment more than a single set of
netlink control paths?

Neil

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 18:33               ` Florian Fainelli
@ 2014-03-25 19:40                 ` Neil Horman
  2014-03-25 20:00                   ` Florian Fainelli
  0 siblings, 1 reply; 125+ messages in thread
From: Neil Horman @ 2014-03-25 19:40 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Jamal Hadi Salim, Jiri Pirko, netdev, David Miller, andy, tgraf,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Felix Fietkau

On Tue, Mar 25, 2014 at 11:33:22AM -0700, Florian Fainelli wrote:
> 2014-03-25 10:39 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> > On Mon, Mar 24, 2014 at 07:07:35PM -0400, Jamal Hadi Salim wrote:
> >> On 03/22/14 05:48, Jiri Pirko wrote:
> >> >Fri, Mar 21, 2014 at 01:04:20PM CET, jhs@mojatatu.com wrote:
> >>
> >> >Hmm. This got me thinking about netdev and switches well and perhaps the
> >> >switchdev api could be mostly implemented by couple of more ndos and
> >> >feature flags. That way we could stick to your immortal netdev :)
> >> >
> >> >
> >>
> >> Perhaps ;->
> >>
> >> >>
> >> >>In my view: that (immortal) device for L2/bridging is the bridge or
> >> >>maybe a more barebone version of the bridge (since it has gained a
> >> >>little more weight in recent times).
> >> >
> >> >Well, I do not think that bridge is ideal abstraction for modern switch
> >> >chips. Bridge is very limited.
> >> >
> >>
> >> True - but i was more thinking of being inclusive of the smaller
> >> devices. They are mostly L2 only and in very limited scope. And thats
> >> probably 95% of the population. The things you are talking about
> >> are very high end and they can do more. Florian's taxanomy was useful.
> >>
> >> >But I don't necessary think it is needed to "mask" as a bride or mimic a
> >> >bridge in any way. DSA does not do that either.
> > No, but it would be really nice if these smaller devices could take advantage of
> > this infrastructure.  Looking at it, I don't see why thats not possible.  The
> > big trick (as we've discussed in the past), is using a net_device structure to
> > take advantage of all the features that net_devices offer while not enabling the
> > device specific features that some hardware doesn't allow.
> >
> > For instance the broadcom chips that live in many wireless routers would be well
> > served by the model jiri has here as far as Media level interface control is
> > concerned (i.e. ifup/down/speed/duplex/etc), but its a bit lacking in that
> > net_devices are assumed to support L3 protocol configuration (i.e. they can have
> > ip addresses assigned to them), which you can't IIRC do on these chips.
> 
> In fact, some switches could have valid L3 configurations, what you
> usually do is have the switch be configured such that it selectively
> inserts Broadcom tags for a given set of physical ports, such that
> your CPU port (In-band Management port in Broadcom terminology), gets
> flooded with such packets, and can dispatch those packets to the
> per-port netdevices. Then you can take any decisions based on those
> received packets, such as bridging this per-port netdevice with
> another one for instance, or any switch topology change.
> 
> >
> > Would it be worth considering a private interface model?  That is to say:
> >
> > 1) Ports on a switch chip are accessed using net_device structures, but
> > registered to a private list contained within the switch device, rather than to
> > the net namespaces device list.
> 
> I think this would be a good model for simple embedded switches that
> only support 802.1q VLANs for instance, since we won't be able to get
> any actual data to be sent/received to any per-port netdevice, those
> per-port netdevices would only be effective for control at the L2
> level.
> 
> For switches that do support tags, I think we do want per-port
> netdevices to appear in the regular netdevices namespace as those
> might be able to get actual data sent to/received from by using these
> tags, at least momentarily until a higher-level entity decides
> otherwise (e.g: by bridging, disabling interfaces...).
> 
Well, perhaps thats the answer then  - Augment the model to allow for the
registration of net_devices to private lists within a switch device, but don't
require it.  If a given chip supports the assignment of L3 data by the cpu, the
use of iptables etc, let the switch driver do so, its not like we can't do that
already, but for the smaller devices, keeping them tightly controled via the
switch driver in such a way that user space can only access them with permission
from the switch driver.

Does that seem reasonable?

Neil

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 19:40                 ` Neil Horman
@ 2014-03-25 20:00                   ` Florian Fainelli
  2014-03-25 21:39                     ` tgraf
  0 siblings, 1 reply; 125+ messages in thread
From: Florian Fainelli @ 2014-03-25 20:00 UTC (permalink / raw)
  To: Neil Horman
  Cc: Jamal Hadi Salim, Jiri Pirko, netdev, David Miller, andy, tgraf,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Felix Fietkau

2014-03-25 12:40 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> On Tue, Mar 25, 2014 at 11:33:22AM -0700, Florian Fainelli wrote:
>> 2014-03-25 10:39 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>> > On Mon, Mar 24, 2014 at 07:07:35PM -0400, Jamal Hadi Salim wrote:
>> >> On 03/22/14 05:48, Jiri Pirko wrote:
>> >> >Fri, Mar 21, 2014 at 01:04:20PM CET, jhs@mojatatu.com wrote:
>> >>
>> >> >Hmm. This got me thinking about netdev and switches well and perhaps the
>> >> >switchdev api could be mostly implemented by couple of more ndos and
>> >> >feature flags. That way we could stick to your immortal netdev :)
>> >> >
>> >> >
>> >>
>> >> Perhaps ;->
>> >>
>> >> >>
>> >> >>In my view: that (immortal) device for L2/bridging is the bridge or
>> >> >>maybe a more barebone version of the bridge (since it has gained a
>> >> >>little more weight in recent times).
>> >> >
>> >> >Well, I do not think that bridge is ideal abstraction for modern switch
>> >> >chips. Bridge is very limited.
>> >> >
>> >>
>> >> True - but i was more thinking of being inclusive of the smaller
>> >> devices. They are mostly L2 only and in very limited scope. And thats
>> >> probably 95% of the population. The things you are talking about
>> >> are very high end and they can do more. Florian's taxanomy was useful.
>> >>
>> >> >But I don't necessary think it is needed to "mask" as a bride or mimic a
>> >> >bridge in any way. DSA does not do that either.
>> > No, but it would be really nice if these smaller devices could take advantage of
>> > this infrastructure.  Looking at it, I don't see why thats not possible.  The
>> > big trick (as we've discussed in the past), is using a net_device structure to
>> > take advantage of all the features that net_devices offer while not enabling the
>> > device specific features that some hardware doesn't allow.
>> >
>> > For instance the broadcom chips that live in many wireless routers would be well
>> > served by the model jiri has here as far as Media level interface control is
>> > concerned (i.e. ifup/down/speed/duplex/etc), but its a bit lacking in that
>> > net_devices are assumed to support L3 protocol configuration (i.e. they can have
>> > ip addresses assigned to them), which you can't IIRC do on these chips.
>>
>> In fact, some switches could have valid L3 configurations, what you
>> usually do is have the switch be configured such that it selectively
>> inserts Broadcom tags for a given set of physical ports, such that
>> your CPU port (In-band Management port in Broadcom terminology), gets
>> flooded with such packets, and can dispatch those packets to the
>> per-port netdevices. Then you can take any decisions based on those
>> received packets, such as bridging this per-port netdevice with
>> another one for instance, or any switch topology change.
>>
>> >
>> > Would it be worth considering a private interface model?  That is to say:
>> >
>> > 1) Ports on a switch chip are accessed using net_device structures, but
>> > registered to a private list contained within the switch device, rather than to
>> > the net namespaces device list.
>>
>> I think this would be a good model for simple embedded switches that
>> only support 802.1q VLANs for instance, since we won't be able to get
>> any actual data to be sent/received to any per-port netdevice, those
>> per-port netdevices would only be effective for control at the L2
>> level.
>>
>> For switches that do support tags, I think we do want per-port
>> netdevices to appear in the regular netdevices namespace as those
>> might be able to get actual data sent to/received from by using these
>> tags, at least momentarily until a higher-level entity decides
>> otherwise (e.g: by bridging, disabling interfaces...).
>>
> Well, perhaps thats the answer then  - Augment the model to allow for the
> registration of net_devices to private lists within a switch device, but don't
> require it.  If a given chip supports the assignment of L3 data by the cpu, the
> use of iptables etc, let the switch driver do so, its not like we can't do that
> already, but for the smaller devices, keeping them tightly controled via the
> switch driver in such a way that user space can only access them with permission
> from the switch driver.
>
> Does that seem reasonable?

Sure that looks good, the switch driver will know what L2/L3 features
it has, and the higher levels will know how to utilize that
information to construct the net_devices stacking and namespacing.
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 19:35                 ` Neil Horman
@ 2014-03-25 20:11                   ` Florian Fainelli
  2014-03-25 20:31                     ` Neil Horman
                                       ` (2 more replies)
  2014-03-25 20:56                   ` Jamal Hadi Salim
  1 sibling, 3 replies; 125+ messages in thread
From: Florian Fainelli @ 2014-03-25 20:11 UTC (permalink / raw)
  To: Neil Horman
  Cc: Thomas Graf, Jamal Hadi Salim, Jiri Pirko, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> On Tue, Mar 25, 2014 at 06:00:09PM +0000, Thomas Graf wrote:
>> On 03/25/14 at 01:39pm, Neil Horman wrote:
>> > No, but it would be really nice if these smaller devices could take advantage of
>> > this infrastructure.  Looking at it, I don't see why thats not possible.  The
>> > big trick (as we've discussed in the past), is using a net_device structure to
>> > take advantage of all the features that net_devices offer while not enabling the
>> > device specific features that some hardware doesn't allow.
>> >
>> > For instance the broadcom chips that live in many wireless routers would be well
>> > served by the model jiri has here as far as Media level interface control is
>> > concerned (i.e. ifup/down/speed/duplex/etc), but its a bit lacking in that
>> > net_devices are assumed to support L3 protocol configuration (i.e. they can have
>> > ip addresses assigned to them), which you can't IIRC do on these chips.
>>
>> How about a new device flag indicating pure L2 mode? Any L3 address
>> configuration would fail with EAFNOSUPP.
>>
> Yeah, we've discussed that before, and it seems like a good idea, though I'm not
> sure that its flexible enough.  It clearly prevents L3 operations on devices
> that can only do L2, which is great, but that may not be sufficient for some
> devices.  For example, what if you wanted to use ebtables on an L2 port where
> the hardware can't mirror the actions of a given table rule?  Do we need to
> expand out those capabilities?
>
>> > Would it be worth considering a private interface model?  That is to say:
>> >
>> > 1) Ports on a switch chip are accessed using net_device structures, but
>> > registered to a private list contained within the switch device, rather than to
>> > the net namespaces device list.
>>
>> > 2) Access to the switch ports via user space is done through the master switch
>> > interface with additional netlink attributes specifying the port on the switch
>> > to access (or not to access the master switch device directly)
>>
>> > Such a model I think might fit well with Jiri's code here and provide greater
>> > flexibility for a wider range of devices.  It would of course require
>> > augmentation for user space, but the changes would be additive, so I think they
>> > would be reasonable.  This would also allow the switch device to have a hook in
>> > the control path to block or allow features that the hardware may or may not
>> > support while still being able to use the existing net_device infrastructure to
>> > support these operations as they are normally carried out.
>>
>> I believe this would defeat the main advantage of reusing net_device
>> model which is compatibility with the well established standard toolset.
>>
>> In an ideal world, we represent what is possible using the existing
>> net_device model.
>>
>
> Maybe I'm not being clear. I'm not suggesting that we abandon the use of a
> net_device to do any of this work, only that we add a layer of indirection to
> get to it.  By Augmenting the existing network device stack to allow
> registration of net_devices to arbitrary lists, rather than to a fixes
> per-net-namespace global device list, we can operate net_devices that are only
> visible within the scope of a given switch fabric.  User space still works the
> same way, it just requires the specification of additional information when
> speaking to ports on a switch device that may not be directly accessible via the
> cpu.  For example, if a systems has a directly connected nic (em1), and a switch
> fabric with a master bridge port (sw1), and 10 external ports (sw1pX), we could
> access them all from user space via ip link show.  for example:
>
> 1) ip link show:
> em1
> sw1
>
> 2) ip link show sw1
> sw1
>
> 3) ip link show -p sw1
> sw1p0
> sw1p1
> sw1p2...

I was scratching my head about why we might want to expose sw1 as a
separate net_device, but I think this is a good model as it allows for
a "seamless" switch awareness to be constructed, and allows for
controlling the CPU/management port(s) of a given Ethernet switch
separately, which is valuable. It also makes it possible to expose the
multiple CPU/management ports of a given switch when that exists, and
finally, there might be special firmware running on the Ethernet
switch, and that specific 'sw1' net_device could be the one to use to
talk to this via sockets, ioctls, whatever.

>
>
> The idea is to augment user space to allow the visibiliy of ports through the
> switch device, not directly, but using the same existing mechanisms.  We can
> reuse all the existing infrastruture, but with this model, control must pass
> through the switch device driver, allowing it to taylor available features by
> passing the netlink request on to the appropriate netdevice, or sending back an
> error itself.
>
>> On top of that, like for VFs, we provide extended nested attributes or
>> alternate control paths such as via OVS that provide the additional
>> flexibility and control required by the more advanced devices.
> I'm sorry, I don't understand the relevance here.  Are you suggesting that to
> make this modification, we would need to augment more than a single set of
> netlink control paths?

Not sure if I got this right, but there might be additional control
knobs required for specific Ethernet switch features that do not map
nicely, if at all with existing interfaces provided by ip/tc,
ethtool... although I guess one would say, well, then go add these
APIs instead of creating "extended" ones?
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 20:11                   ` Florian Fainelli
@ 2014-03-25 20:31                     ` Neil Horman
  2014-03-25 21:22                       ` Jamal Hadi Salim
  2014-03-25 21:26                     ` Thomas Graf
  2014-03-26  5:37                     ` Roopa Prabhu
  2 siblings, 1 reply; 125+ messages in thread
From: Neil Horman @ 2014-03-25 20:31 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Thomas Graf, Jamal Hadi Salim, Jiri Pirko, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

On Tue, Mar 25, 2014 at 01:11:55PM -0700, Florian Fainelli wrote:
> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> > On Tue, Mar 25, 2014 at 06:00:09PM +0000, Thomas Graf wrote:
> >> On 03/25/14 at 01:39pm, Neil Horman wrote:
> >> > No, but it would be really nice if these smaller devices could take advantage of
> >> > this infrastructure.  Looking at it, I don't see why thats not possible.  The
> >> > big trick (as we've discussed in the past), is using a net_device structure to
> >> > take advantage of all the features that net_devices offer while not enabling the
> >> > device specific features that some hardware doesn't allow.
> >> >
> >> > For instance the broadcom chips that live in many wireless routers would be well
> >> > served by the model jiri has here as far as Media level interface control is
> >> > concerned (i.e. ifup/down/speed/duplex/etc), but its a bit lacking in that
> >> > net_devices are assumed to support L3 protocol configuration (i.e. they can have
> >> > ip addresses assigned to them), which you can't IIRC do on these chips.
> >>
> >> How about a new device flag indicating pure L2 mode? Any L3 address
> >> configuration would fail with EAFNOSUPP.
> >>
> > Yeah, we've discussed that before, and it seems like a good idea, though I'm not
> > sure that its flexible enough.  It clearly prevents L3 operations on devices
> > that can only do L2, which is great, but that may not be sufficient for some
> > devices.  For example, what if you wanted to use ebtables on an L2 port where
> > the hardware can't mirror the actions of a given table rule?  Do we need to
> > expand out those capabilities?
> >
> >> > Would it be worth considering a private interface model?  That is to say:
> >> >
> >> > 1) Ports on a switch chip are accessed using net_device structures, but
> >> > registered to a private list contained within the switch device, rather than to
> >> > the net namespaces device list.
> >>
> >> > 2) Access to the switch ports via user space is done through the master switch
> >> > interface with additional netlink attributes specifying the port on the switch
> >> > to access (or not to access the master switch device directly)
> >>
> >> > Such a model I think might fit well with Jiri's code here and provide greater
> >> > flexibility for a wider range of devices.  It would of course require
> >> > augmentation for user space, but the changes would be additive, so I think they
> >> > would be reasonable.  This would also allow the switch device to have a hook in
> >> > the control path to block or allow features that the hardware may or may not
> >> > support while still being able to use the existing net_device infrastructure to
> >> > support these operations as they are normally carried out.
> >>
> >> I believe this would defeat the main advantage of reusing net_device
> >> model which is compatibility with the well established standard toolset.
> >>
> >> In an ideal world, we represent what is possible using the existing
> >> net_device model.
> >>
> >
> > Maybe I'm not being clear. I'm not suggesting that we abandon the use of a
> > net_device to do any of this work, only that we add a layer of indirection to
> > get to it.  By Augmenting the existing network device stack to allow
> > registration of net_devices to arbitrary lists, rather than to a fixes
> > per-net-namespace global device list, we can operate net_devices that are only
> > visible within the scope of a given switch fabric.  User space still works the
> > same way, it just requires the specification of additional information when
> > speaking to ports on a switch device that may not be directly accessible via the
> > cpu.  For example, if a systems has a directly connected nic (em1), and a switch
> > fabric with a master bridge port (sw1), and 10 external ports (sw1pX), we could
> > access them all from user space via ip link show.  for example:
> >
> > 1) ip link show:
> > em1
> > sw1
> >
> > 2) ip link show sw1
> > sw1
> >
> > 3) ip link show -p sw1
> > sw1p0
> > sw1p1
> > sw1p2...
> 
> I was scratching my head about why we might want to expose sw1 as a
> separate net_device, but I think this is a good model as it allows for
> a "seamless" switch awareness to be constructed, and allows for
> controlling the CPU/management port(s) of a given Ethernet switch
> separately, which is valuable. It also makes it possible to expose the
> multiple CPU/management ports of a given switch when that exists, and
> finally, there might be special firmware running on the Ethernet
> switch, and that specific 'sw1' net_device could be the one to use to
> talk to this via sockets, ioctls, whatever.
> 
> >
> >
> > The idea is to augment user space to allow the visibiliy of ports through the
> > switch device, not directly, but using the same existing mechanisms.  We can
> > reuse all the existing infrastruture, but with this model, control must pass
> > through the switch device driver, allowing it to taylor available features by
> > passing the netlink request on to the appropriate netdevice, or sending back an
> > error itself.
> >
> >> On top of that, like for VFs, we provide extended nested attributes or
> >> alternate control paths such as via OVS that provide the additional
> >> flexibility and control required by the more advanced devices.
> > I'm sorry, I don't understand the relevance here.  Are you suggesting that to
> > make this modification, we would need to augment more than a single set of
> > netlink control paths?
> 
> Not sure if I got this right, but there might be additional control
> knobs required for specific Ethernet switch features that do not map
> nicely, if at all with existing interfaces provided by ip/tc,
> ethtool... although I guess one would say, well, then go add these
> APIs instead of creating "extended" ones?
Ostensibly yes, but I'm not well versed enough in what those interfaces are, to
know for certain.  I definately agree however, that if a given interface outside
the scope of network device control is required (say for example, direct access
to a switch fabrics cam lookup table), then you are correct, we should develop
those api's rather than shoehorn them into a net_device model

Neil

> -- 
> Florian
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 17:39             ` Neil Horman
  2014-03-25 18:00               ` Thomas Graf
  2014-03-25 18:33               ` Florian Fainelli
@ 2014-03-25 20:46               ` Jamal Hadi Salim
  2014-03-26  7:24               ` Jiri Pirko
  3 siblings, 0 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-25 20:46 UTC (permalink / raw)
  To: Neil Horman
  Cc: Jiri Pirko, Florian Fainelli, netdev, David Miller, andy, tgraf,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/25/14 13:39, Neil Horman wrote:

> No, but it would be really nice if these smaller devices could take advantage of
> this infrastructure.

Indeed.

> Looking at it, I don't see why thats not possible.  The
> big trick (as we've discussed in the past), is using a net_device structure to
> take advantage of all the features that net_devices offer while not enabling the
> device specific features that some hardware doesn't allow.
>

Exactly. And i dont think thats hard to do. I do think for capabilities,
netdev->features is insufficient (example I cant export to user space
the size of my h/w fdb table etc). But those things can be easily
ironed out.

> For instance the broadcom chips that live in many wireless routers would be well
> served by the model jiri has here as far as Media level interface control is
> concerned (i.e. ifup/down/speed/duplex/etc), but its a bit lacking in that
> net_devices are assumed to support L3 protocol configuration (i.e. they can have
> ip addresses assigned to them), which you can't IIRC do on these chips.
>

This is part of the challenge i was talking about and why the lowest
common denominator is just ports and L2 bridging.

> Would it be worth considering a private interface model?  That is to say:
>
> 1) Ports on a switch chip are accessed using net_device structures, but
> registered to a private list contained within the switch device, rather than to
> the net namespaces device list.
>
> 2) Access to the switch ports via user space is done through the master switch
> interface with additional netlink attributes specifying the port on the switch
> to access (or not to access the master switch device directly)
>
>
> Such a model I think might fit well with Jiri's code here and provide greater
> flexibility for a wider range of devices.  It would of course require
> augmentation for user space, but the changes would be additive, so I think they
> would be reasonable.  This would also allow the switch device to have a hook in
> the control path to block or allow features that the hardware may or may not
> support while still being able to use the existing net_device infrastructure to
> support these operations as they are normally carried out.
>

I think Jiri's model is upside down (Yes, I was on that boat as well
earlier)
What needs to be exposed are ports. Something like #1 above which is not
a netdev but rather the conduit to the chip.
Note: We already an above working model with bridging today. If i attach
a port to a bridge I can infact get/set the fdb entries from/to the 
bridge as well as ones offloaded on the chip/hware.
I should be able to do the same with stats etc.
Seems to make sense we to extend it to other features.
The litmus test is: Can i have my iproute2 please? If you can do that
then you are allowing me to do bridges, routes, ports, vxlan, tunnels
qos etc. Whatever the chips  capabilities allow for otherwise I am
terminating at the CPU level.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 19:35                 ` Neil Horman
  2014-03-25 20:11                   ` Florian Fainelli
@ 2014-03-25 20:56                   ` Jamal Hadi Salim
  2014-03-25 21:19                     ` Thomas Graf
  2014-03-26 11:10                     ` Neil Horman
  1 sibling, 2 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-25 20:56 UTC (permalink / raw)
  To: Neil Horman, Thomas Graf
  Cc: Jiri Pirko, Florian Fainelli, netdev, David Miller, andy,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/25/14 15:35, Neil Horman wrote:
> On Tue, Mar 25, 2014 at 06:00:09PM +0000, Thomas Graf wrote:

>> How about a new device flag indicating pure L2 mode? Any L3 address
>> configuration would fail with EAFNOSUPP.
>>
> Yeah, we've discussed that before, and it seems like a good idea, though I'm not
> sure that its flexible enough.  It clearly prevents L3 operations on devices
> that can only do L2, which is great, but that may not be sufficient for some
> devices.  For example, what if you wanted to use ebtables on an L2 port where
> the hardware can't mirror the actions of a given table rule?  Do we need to
> expand out those capabilities?

There are two capability approaches.
a) you do things and let the kernel reject
b) You discover the capabilities and do something more interesting.
We already do this kind of stuff in user tools today (simple example
is name->ifindex mapping querying).

What is missing is ability to store richer capabilities which are not
just boolean in nature.



>
> Maybe I'm not being clear. I'm not suggesting that we abandon the use of a
> net_device to do any of this work, only that we add a layer of indirection to
> get to it.  By Augmenting the existing network device stack to allow
> registration of net_devices to arbitrary lists, rather than to a fixes
> per-net-namespace global device list, we can operate net_devices that are only
> visible within the scope of a given switch fabric.  User space still works the
> same way, it just requires the specification of additional information when
> speaking to ports on a switch device that may not be directly accessible via the
> cpu.  For example, if a systems has a directly connected nic (em1), and a switch
> fabric with a master bridge port (sw1), and 10 external ports (sw1pX), we could
> access them all from user space via ip link show.  for example:
>
> 1) ip link show:
> em1
> sw1
>
> 2) ip link show sw1
> sw1
>
> 3) ip link show -p sw1
> sw1p0
> sw1p1
> sw1p2...
>
>
> The idea is to augment user space to allow the visibiliy of ports through the
> switch device, not directly, but using the same existing mechanisms.  We can
> reuse all the existing infrastruture, but with this model, control must pass
> through the switch device driver, allowing it to taylor available features by
> passing the netlink request on to the appropriate netdevice, or sending back an
> error itself.
>

I think i am with you mostly - just not on the visibility of a "master"
device.
Expose the ports. Users create bridges bonds and if the hardware is
capable it does the hard work to ensure consistency. No change in tools.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 20:56                   ` Jamal Hadi Salim
@ 2014-03-25 21:19                     ` Thomas Graf
  2014-03-25 21:24                       ` Jamal Hadi Salim
  2014-03-26  7:21                       ` Jiri Pirko
  2014-03-26 11:10                     ` Neil Horman
  1 sibling, 2 replies; 125+ messages in thread
From: Thomas Graf @ 2014-03-25 21:19 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Neil Horman, Jiri Pirko, Florian Fainelli, netdev, David Miller,
	andy, dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/25/14 at 04:56pm, Jamal Hadi Salim wrote:
> On 03/25/14 15:35, Neil Horman wrote:
> >1) ip link show:
> >em1
> >sw1
> >
> >2) ip link show sw1
> >sw1
> >
> >3) ip link show -p sw1
> >sw1p0
> >sw1p1
> >sw1p2...
> >
> >
> >The idea is to augment user space to allow the visibiliy of ports through the
> >switch device, not directly, but using the same existing mechanisms.  We can
> >reuse all the existing infrastruture, but with this model, control must pass
> >through the switch device driver, allowing it to taylor available features by
> >passing the netlink request on to the appropriate netdevice, or sending back an
> >error itself.
> >
> 
> I think i am with you mostly - just not on the visibility of a "master"
> device.
> Expose the ports. Users create bridges bonds and if the hardware is
> capable it does the hard work to ensure consistency. No change in tools.

Exactly. This is what I meant as well. No change in tools.

It's not just about changing ip link. We have tons of existing
applications out there using Netlink and they will expect all ports
visible if they issue RTM_GETLINK with NLM_F_DUMP.

What speaks against exposing it by default? To me, the model should
not differ from a multi port NIC which we also expose all ports with
any indirection.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 20:31                     ` Neil Horman
@ 2014-03-25 21:22                       ` Jamal Hadi Salim
  0 siblings, 0 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-25 21:22 UTC (permalink / raw)
  To: Neil Horman, Florian Fainelli
  Cc: Thomas Graf, Jiri Pirko, netdev, David Miller, Andy Gospodarek,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/25/14 16:31, Neil Horman wrote:
> On Tue, Mar 25, 2014 at 01:11:55PM -0700, Florian Fainelli wrote:
>> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>> On Tue, Mar 25, 2014 at 06:00:09PM +0000, Thomas Graf wrote:

>> Not sure if I got this right, but there might be additional control
>> knobs required for specific Ethernet switch features that do not map
>> nicely, if at all with existing interfaces provided by ip/tc,
>> ethtool... although I guess one would say, well, then go add these
>> APIs instead of creating "extended" ones?
> Ostensibly yes, but I'm not well versed enough in what those interfaces are, to
> know for certain.  I definately agree however, that if a given interface outside
> the scope of network device control is required (say for example, direct access
> to a switch fabrics cam lookup table), then you are correct, we should develop
> those api's rather than shoehorn them into a net_device model
>

Sorry i should have started from last and gone backwards reading emails.
So things like stats that are only available via some chip but not
others come to mind. But we know how to carry those things to user
space via netlink. Give me an IFLA_KIND and i can give you access to
set/get extended features.
I dont see much disagreement to the end goals from any of the goals.
Thou shalt make current tools work.
The challenge perhaps maybe the different implementation approaches.
It seems we are also agreeing that there will be some conduit driver
which will expose ports to the kernel. It will likely own the ASIC
feature set and can be queried for capabilities indirectly.
It can talk DSA, PCI etc.

For when things dont fit: I would expect this "driver" to
also be the conduit to the chip resources and any
translation that needs to happen between the kernel view of the
abstraction to the chip view.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 21:19                     ` Thomas Graf
@ 2014-03-25 21:24                       ` Jamal Hadi Salim
  2014-03-26  7:21                       ` Jiri Pirko
  1 sibling, 0 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-25 21:24 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Neil Horman, Jiri Pirko, Florian Fainelli, netdev, David Miller,
	andy, dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/25/14 17:19, Thomas Graf wrote:

>
> Exactly. This is what I meant as well. No change in tools.
>
> It's not just about changing ip link. We have tons of existing
> applications out there using Netlink and they will expect all ports
> visible if they issue RTM_GETLINK with NLM_F_DUMP.
>

Amen Brother.
Linux is the API.
(Ok, I know i am preaching to the choir ..)

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 20:11                   ` Florian Fainelli
  2014-03-25 20:31                     ` Neil Horman
@ 2014-03-25 21:26                     ` Thomas Graf
  2014-03-25 21:42                       ` Florian Fainelli
  2014-03-26  5:37                     ` Roopa Prabhu
  2 siblings, 1 reply; 125+ messages in thread
From: Thomas Graf @ 2014-03-25 21:26 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Neil Horman, Jamal Hadi Salim, Jiri Pirko, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

On 03/25/14 at 01:11pm, Florian Fainelli wrote:
> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> > On Tue, Mar 25, 2014 at 06:00:09PM +0000, Thomas Graf wrote:
> >> On top of that, like for VFs, we provide extended nested attributes or
> >> alternate control paths such as via OVS that provide the additional
> >> flexibility and control required by the more advanced devices.
> > I'm sorry, I don't understand the relevance here.  Are you suggesting that to
> > make this modification, we would need to augment more than a single set of
> > netlink control paths?
> 
> Not sure if I got this right, but there might be additional control
> knobs required for specific Ethernet switch features that do not map
> nicely, if at all with existing interfaces provided by ip/tc,
> ethtool... although I guess one would say, well, then go add these
> APIs instead of creating "extended" ones?

Exactly. Some of the logic and configuration structure will not
fit the existing model and is too switch specific to justify
extending the generic link model. It also seems likely that some
knobs will be switch specific. Not an issue as long as they are
tunneled through the standard API and any effort is undertaken
to generalize where it makes sense.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 20:00                   ` Florian Fainelli
@ 2014-03-25 21:39                     ` tgraf
  2014-03-25 22:08                       ` Jamal Hadi Salim
  0 siblings, 1 reply; 125+ messages in thread
From: tgraf @ 2014-03-25 21:39 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Neil Horman, Jamal Hadi Salim, Jiri Pirko, netdev, David Miller,
	andy, dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Felix Fietkau

On 03/25/14 at 01:00pm, Florian Fainelli wrote:
> 2014-03-25 12:40 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> > On Tue, Mar 25, 2014 at 11:33:22AM -0700, Florian Fainelli wrote:
> >> 2014-03-25 10:39 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> >> > 1) Ports on a switch chip are accessed using net_device structures, but
> >> > registered to a private list contained within the switch device, rather than to
> >> > the net namespaces device list.
> >>
> >> I think this would be a good model for simple embedded switches that
> >> only support 802.1q VLANs for instance, since we won't be able to get
> >> any actual data to be sent/received to any per-port netdevice, those
> >> per-port netdevices would only be effective for control at the L2
> >> level.
> >>
> >> For switches that do support tags, I think we do want per-port
> >> netdevices to appear in the regular netdevices namespace as those
> >> might be able to get actual data sent to/received from by using these
> >> tags, at least momentarily until a higher-level entity decides
> >> otherwise (e.g: by bridging, disabling interfaces...).
> >>
> > Well, perhaps thats the answer then  - Augment the model to allow for the
> > registration of net_devices to private lists within a switch device, but don't
> > require it.  If a given chip supports the assignment of L3 data by the cpu, the
> > use of iptables etc, let the switch driver do so, its not like we can't do that
> > already, but for the smaller devices, keeping them tightly controled via the
> > switch driver in such a way that user space can only access them with permission
> > from the switch driver.
> >
> > Does that seem reasonable?
> 
> Sure that looks good, the switch driver will know what L2/L3 features
> it has, and the higher levels will know how to utilize that
> information to construct the net_devices stacking and namespacing.


I think all it takes is to correctly apply the existing separation
which is already available but not applied right now.

We already have the L2/L3 separation in place:

net_device vs in_device/inet6_dev/....

A pure L2 device that will never do L3 on the CPU would only
need to set a flag which we check before allocating a in_device
and therefore prevent from all the L3 configs to be exposed.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 21:26                     ` Thomas Graf
@ 2014-03-25 21:42                       ` Florian Fainelli
  2014-03-25 21:54                         ` Thomas Graf
  0 siblings, 1 reply; 125+ messages in thread
From: Florian Fainelli @ 2014-03-25 21:42 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Neil Horman, Jamal Hadi Salim, Jiri Pirko, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

2014-03-25 14:26 GMT-07:00 Thomas Graf <tgraf@suug.ch>:
> On 03/25/14 at 01:11pm, Florian Fainelli wrote:
>> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>> > On Tue, Mar 25, 2014 at 06:00:09PM +0000, Thomas Graf wrote:
>> >> On top of that, like for VFs, we provide extended nested attributes or
>> >> alternate control paths such as via OVS that provide the additional
>> >> flexibility and control required by the more advanced devices.
>> > I'm sorry, I don't understand the relevance here.  Are you suggesting that to
>> > make this modification, we would need to augment more than a single set of
>> > netlink control paths?
>>
>> Not sure if I got this right, but there might be additional control
>> knobs required for specific Ethernet switch features that do not map
>> nicely, if at all with existing interfaces provided by ip/tc,
>> ethtool... although I guess one would say, well, then go add these
>> APIs instead of creating "extended" ones?
>
> Exactly. Some of the logic and configuration structure will not
> fit the existing model and is too switch specific to justify
> extending the generic link model. It also seems likely that some
> knobs will be switch specific. Not an issue as long as they are
> tunneled through the standard API and any effort is undertaken
> to generalize where it makes sense.

The question is how you would imagine conveying these switch-specific
features that do not (yet) map into a general feature, shall we go for
a separate netlink family, just like what Felix did in OpenWrt with
swconfig, without much stability from one kernel release to another,
as we migrate what was once a switch specific feature into a general
Ethernet switch feature?
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 21:42                       ` Florian Fainelli
@ 2014-03-25 21:54                         ` Thomas Graf
  2014-03-26 10:55                           ` Neil Horman
  0 siblings, 1 reply; 125+ messages in thread
From: Thomas Graf @ 2014-03-25 21:54 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Neil Horman, Jamal Hadi Salim, Jiri Pirko, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

On 03/25/14 at 02:42pm, Florian Fainelli wrote:
> 2014-03-25 14:26 GMT-07:00 Thomas Graf <tgraf@suug.ch>:
> > On 03/25/14 at 01:11pm, Florian Fainelli wrote:
> >> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> >> > On Tue, Mar 25, 2014 at 06:00:09PM +0000, Thomas Graf wrote:
> >> >> On top of that, like for VFs, we provide extended nested attributes or
> >> >> alternate control paths such as via OVS that provide the additional
> >> >> flexibility and control required by the more advanced devices.
> >> > I'm sorry, I don't understand the relevance here.  Are you suggesting that to
> >> > make this modification, we would need to augment more than a single set of
> >> > netlink control paths?
> >>
> >> Not sure if I got this right, but there might be additional control
> >> knobs required for specific Ethernet switch features that do not map
> >> nicely, if at all with existing interfaces provided by ip/tc,
> >> ethtool... although I guess one would say, well, then go add these
> >> APIs instead of creating "extended" ones?
> >
> > Exactly. Some of the logic and configuration structure will not
> > fit the existing model and is too switch specific to justify
> > extending the generic link model. It also seems likely that some
> > knobs will be switch specific. Not an issue as long as they are
> > tunneled through the standard API and any effort is undertaken
> > to generalize where it makes sense.
> 
> The question is how you would imagine conveying these switch-specific
> features that do not (yet) map into a general feature, shall we go for
> a separate netlink family, just like what Felix did in OpenWrt with
> swconfig, without much stability from one kernel release to another,
> as we migrate what was once a switch specific feature into a general
> Ethernet switch feature?

I believe it is essential to transport them as part of the standard
Netlink API and have a single channel for all configuration. It also
eases message synchronization.

We also want to enforce strict ABI compatibility rules just like
for all other Netlink users. As we know, it's not difficult to design
the message format in a way to allow for extendability and backwards
compatibility.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 21:39                     ` tgraf
@ 2014-03-25 22:08                       ` Jamal Hadi Salim
  2014-03-26  5:48                         ` Roopa Prabhu
  0 siblings, 1 reply; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-25 22:08 UTC (permalink / raw)
  To: tgraf, Florian Fainelli
  Cc: Neil Horman, Jiri Pirko, netdev, David Miller, andy, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Felix Fietkau

On 03/25/14 17:39, tgraf wrote:
> On 03/25/14 at 01:00pm, Florian Fainelli wrote:

>
> I think all it takes is to correctly apply the existing separation
> which is already available but not applied right now.
>
> We already have the L2/L3 separation in place:
>
> net_device vs in_device/inet6_dev/....
>
> A pure L2 device that will never do L3 on the CPU would only
> need to set a flag which we check before allocating a in_device
> and therefore prevent from all the L3 configs to be exposed.


I think we need much deeper discussion on the topic of other
functions that may not be directly connected to netdevs
(v4/6 forwarding, ACL, etc).
In my opinion - if a chip knows how to do L3, then i have
a choice to just send a FIB add via netlink and specify
where it goes (hardware vs software or both).
The bridge ports with underlying hardware FDB entries as
an example already work this way (although i am not fond
of the naming convention used).

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 20:11                   ` Florian Fainelli
  2014-03-25 20:31                     ` Neil Horman
  2014-03-25 21:26                     ` Thomas Graf
@ 2014-03-26  5:37                     ` Roopa Prabhu
  2014-03-26 10:54                       ` Jamal Hadi Salim
  2 siblings, 1 reply; 125+ messages in thread
From: Roopa Prabhu @ 2014-03-26  5:37 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Neil Horman, Thomas Graf, Jamal Hadi Salim, Jiri Pirko, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

On 3/25/14, 1:11 PM, Florian Fainelli wrote:
> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>> On Tue, Mar 25, 2014 at 06:00:09PM +0000, Thomas Graf wrote:
>>> On 03/25/14 at 01:39pm, Neil Horman wrote:
>>>> No, but it would be really nice if these smaller devices could take advantage of
>>>> this infrastructure.  Looking at it, I don't see why thats not possible.  The
>>>> big trick (as we've discussed in the past), is using a net_device structure to
>>>> take advantage of all the features that net_devices offer while not enabling the
>>>> device specific features that some hardware doesn't allow.
>>>>
>>>> For instance the broadcom chips that live in many wireless routers would be well
>>>> served by the model jiri has here as far as Media level interface control is
>>>> concerned (i.e. ifup/down/speed/duplex/etc), but its a bit lacking in that
>>>> net_devices are assumed to support L3 protocol configuration (i.e. they can have
>>>> ip addresses assigned to them), which you can't IIRC do on these chips.
>>> How about a new device flag indicating pure L2 mode? Any L3 address
>>> configuration would fail with EAFNOSUPP.
>>>
>> Yeah, we've discussed that before, and it seems like a good idea, though I'm not
>> sure that its flexible enough.  It clearly prevents L3 operations on devices
>> that can only do L2, which is great, but that may not be sufficient for some
>> devices.  For example, what if you wanted to use ebtables on an L2 port where
>> the hardware can't mirror the actions of a given table rule?  Do we need to
>> expand out those capabilities?
>>
>>>> Would it be worth considering a private interface model?  That is to say:
>>>>
>>>> 1) Ports on a switch chip are accessed using net_device structures, but
>>>> registered to a private list contained within the switch device, rather than to
>>>> the net namespaces device list.
>>>> 2) Access to the switch ports via user space is done through the master switch
>>>> interface with additional netlink attributes specifying the port on the switch
>>>> to access (or not to access the master switch device directly)
>>>> Such a model I think might fit well with Jiri's code here and provide greater
>>>> flexibility for a wider range of devices.  It would of course require
>>>> augmentation for user space, but the changes would be additive, so I think they
>>>> would be reasonable.  This would also allow the switch device to have a hook in
>>>> the control path to block or allow features that the hardware may or may not
>>>> support while still being able to use the existing net_device infrastructure to
>>>> support these operations as they are normally carried out.
>>> I believe this would defeat the main advantage of reusing net_device
>>> model which is compatibility with the well established standard toolset.
>>>
>>> In an ideal world, we represent what is possible using the existing
>>> net_device model.
>>>
>> Maybe I'm not being clear. I'm not suggesting that we abandon the use of a
>> net_device to do any of this work, only that we add a layer of indirection to
>> get to it.  By Augmenting the existing network device stack to allow
>> registration of net_devices to arbitrary lists, rather than to a fixes
>> per-net-namespace global device list, we can operate net_devices that are only
>> visible within the scope of a given switch fabric.  User space still works the
>> same way, it just requires the specification of additional information when
>> speaking to ports on a switch device that may not be directly accessible via the
>> cpu.  For example, if a systems has a directly connected nic (em1), and a switch
>> fabric with a master bridge port (sw1), and 10 external ports (sw1pX), we could
>> access them all from user space via ip link show.  for example:
>>
>> 1) ip link show:
>> em1
>> sw1
>>
>> 2) ip link show sw1
>> sw1
>>
>> 3) ip link show -p sw1
>> sw1p0
>> sw1p1
>> sw1p2...
> I was scratching my head about why we might want to expose sw1 as a
> separate net_device, but I think this is a good model as it allows for
> a "seamless" switch awareness to be constructed, and allows for
> controlling the CPU/management port(s) of a given Ethernet switch
> separately, which is valuable. It also makes it possible to expose the
> multiple CPU/management ports of a given switch when that exists, and
> finally, there might be special firmware running on the Ethernet
> switch, and that specific 'sw1' net_device could be the one to use to
> talk to this via sockets, ioctls, whatever.
>
>
Sorry about getting on this thread late and possibly in the middle.
Agree on the idea of keeping the ports linked to the master switch dev 
(or the 'conduit' to the switch chip) via private list instead of the 
master-slave relationship proposed earlier.
By private i mean the netdev->priv linkage to the master switch dev and 
not really keeping the ports from being exposed to the user.

We think its better to keep the switch ports exposed as any other netdev 
on linux.
  This approach will make the switch ports look exactly like a nic port 
and all tools will continue to work seamlessly. The switch port 
operations could internally be forwarded to the switch netdev (sw1 in 
the above case).

example:
$ip link set dev sw1p0 up
$ethtool -S sw1p0


whether sw1 is needed as a separate netdev existing on the system is 
debatable.
Most cases the switch port driver (API) can talk to the switch chip 
driver without a switch netdev in between.
But there are cases where a switch netdev might become necessary for 
switch chip specific operations (which probably has been discussed on 
this thread). An example could be a global acl rule that applies to all 
switch ports. One can argue that this can be applied on individual 
switch ports and the switch driver can take care of consolidating or 
optimally programming the acl rule in the switch chip.

Thanks,
Roopa

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 22:08                       ` Jamal Hadi Salim
@ 2014-03-26  5:48                         ` Roopa Prabhu
  0 siblings, 0 replies; 125+ messages in thread
From: Roopa Prabhu @ 2014-03-26  5:48 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: tgraf, Florian Fainelli, Neil Horman, Jiri Pirko, netdev,
	David Miller, andy, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek, Felix Fietkau, Shrijeet Mukherjee

On 3/25/14, 3:08 PM, Jamal Hadi Salim wrote:
> On 03/25/14 17:39, tgraf wrote:
>> On 03/25/14 at 01:00pm, Florian Fainelli wrote:
>
>>
>> I think all it takes is to correctly apply the existing separation
>> which is already available but not applied right now.
>>
>> We already have the L2/L3 separation in place:
>>
>> net_device vs in_device/inet6_dev/....
>>
>> A pure L2 device that will never do L3 on the CPU would only
>> need to set a flag which we check before allocating a in_device
>> and therefore prevent from all the L3 configs to be exposed.
>
>
> I think we need much deeper discussion on the topic of other
> functions that may not be directly connected to netdevs
> (v4/6 forwarding, ACL, etc).

Agreed.  Some of our generic switch api challenges have been in these areas.

thanks,
Roopa

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 21:19                     ` Thomas Graf
  2014-03-25 21:24                       ` Jamal Hadi Salim
@ 2014-03-26  7:21                       ` Jiri Pirko
  2014-03-26 11:00                         ` Jamal Hadi Salim
  1 sibling, 1 reply; 125+ messages in thread
From: Jiri Pirko @ 2014-03-26  7:21 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Jamal Hadi Salim, Neil Horman, Florian Fainelli, netdev,
	David Miller, andy, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

Tue, Mar 25, 2014 at 10:19:45PM CET, tgraf@suug.ch wrote:
>On 03/25/14 at 04:56pm, Jamal Hadi Salim wrote:
>> On 03/25/14 15:35, Neil Horman wrote:
>> >1) ip link show:
>> >em1
>> >sw1
>> >
>> >2) ip link show sw1
>> >sw1
>> >
>> >3) ip link show -p sw1
>> >sw1p0
>> >sw1p1
>> >sw1p2...
>> >
>> >
>> >The idea is to augment user space to allow the visibiliy of ports through the
>> >switch device, not directly, but using the same existing mechanisms.  We can
>> >reuse all the existing infrastruture, but with this model, control must pass
>> >through the switch device driver, allowing it to taylor available features by
>> >passing the netlink request on to the appropriate netdevice, or sending back an
>> >error itself.
>> >
>> 
>> I think i am with you mostly - just not on the visibility of a "master"
>> device.
>> Expose the ports. Users create bridges bonds and if the hardware is
>> capable it does the hard work to ensure consistency. No change in tools.

Creating bonding of the switch ports does not fit into the picture at
all. These port netdevices are just a representation of a port. Not
actual netdevice where the data goes through.

Please consider the case I gave already to this thread:

        switch chip
   ------------------------
    |  |  |  |  |  |   |               CPU
   p1 p2 ...pn px py  MNGMNT       -----------
                |  |   |              pcie
                |  |   |         ---------------
                |  |   |          |  NIC0 NIC1
                |  |   ---pcie-----   |   |
                |  ------someMII-------   |
                ---------someMII-----------

        NIC0 and NIC1 are ordinary NICs like 8139too for example with no
        notion they are connected to a switch. They as completely
        independent on the mngmnt iface.


There, actual data is coming through NIC0 and NIC1 which is completely separated
from the p1...pn,px.px port representations.

And if you understand it this way, it makes perfect sense to have a master device
for these port representations.

Btw note this model fits into existing DSA as well I believe. The actual DSA
devices whould act as NIC0, NIC1 and what would be added is the switch
representation (couple of more netdevices to represent actual HW ports and
their master)

>
>Exactly. This is what I meant as well. No change in tools.

I agree.

>
>It's not just about changing ip link. We have tons of existing
>applications out there using Netlink and they will expect all ports
>visible if they issue RTM_GETLINK with NLM_F_DUMP.
>
>What speaks against exposing it by default? To me, the model should
>not differ from a multi port NIC which we also expose all ports with
>any indirection.

Note that you won't get actual data through these ports (visible to
CPU). That is where it differs from multiport NIC.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 17:39             ` Neil Horman
                                 ` (2 preceding siblings ...)
  2014-03-25 20:46               ` Jamal Hadi Salim
@ 2014-03-26  7:24               ` Jiri Pirko
  3 siblings, 0 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-26  7:24 UTC (permalink / raw)
  To: Neil Horman
  Cc: Jamal Hadi Salim, Florian Fainelli, netdev, David Miller, andy,
	tgraf, dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

Tue, Mar 25, 2014 at 06:39:27PM CET, nhorman@tuxdriver.com wrote:
>On Mon, Mar 24, 2014 at 07:07:35PM -0400, Jamal Hadi Salim wrote:
>> On 03/22/14 05:48, Jiri Pirko wrote:
>> >Fri, Mar 21, 2014 at 01:04:20PM CET, jhs@mojatatu.com wrote:
>> 
>> >Hmm. This got me thinking about netdev and switches well and perhaps the
>> >switchdev api could be mostly implemented by couple of more ndos and
>> >feature flags. That way we could stick to your immortal netdev :)
>> >
>> >
>> 
>> Perhaps ;->
>> 
>> >>
>> >>In my view: that (immortal) device for L2/bridging is the bridge or
>> >>maybe a more barebone version of the bridge (since it has gained a
>> >>little more weight in recent times).
>> >
>> >Well, I do not think that bridge is ideal abstraction for modern switch
>> >chips. Bridge is very limited.
>> >
>> 
>> True - but i was more thinking of being inclusive of the smaller
>> devices. They are mostly L2 only and in very limited scope. And thats
>> probably 95% of the population. The things you are talking about
>> are very high end and they can do more. Florian's taxanomy was useful.
>> 
>> >But I don't necessary think it is needed to "mask" as a bride or mimic a
>> >bridge in any way. DSA does not do that either.
>No, but it would be really nice if these smaller devices could take advantage of
>this infrastructure.  Looking at it, I don't see why thats not possible.  The
>big trick (as we've discussed in the past), is using a net_device structure to
>take advantage of all the features that net_devices offer while not enabling the
>device specific features that some hardware doesn't allow.
>
>For instance the broadcom chips that live in many wireless routers would be well
>served by the model jiri has here as far as Media level interface control is
>concerned (i.e. ifup/down/speed/duplex/etc), but its a bit lacking in that
>net_devices are assumed to support L3 protocol configuration (i.e. they can have
>ip addresses assigned to them), which you can't IIRC do on these chips.
>
>Would it be worth considering a private interface model?  That is to say:

I'm personaly strongly againts this. All netdevices should stay under net
namespace list. If you break this, I expect many unexpected issues.
+ There is not really a reason for this breakage.


>
>1) Ports on a switch chip are accessed using net_device structures, but
>registered to a private list contained within the switch device, rather than to
>the net namespaces device list.
>
>2) Access to the switch ports via user space is done through the master switch
>interface with additional netlink attributes specifying the port on the switch
>to access (or not to access the master switch device directly)
>
>
>Such a model I think might fit well with Jiri's code here and provide greater
>flexibility for a wider range of devices.  It would of course require
>augmentation for user space, but the changes would be additive, so I think they
>would be reasonable.  This would also allow the switch device to have a hook in
>the control path to block or allow features that the hardware may or may not
>support while still being able to use the existing net_device infrastructure to
>support these operations as they are normally carried out.
>
>Best
>Neil
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26  5:37                     ` Roopa Prabhu
@ 2014-03-26 10:54                       ` Jamal Hadi Salim
  2014-03-26 15:31                         ` John W. Linville
  2014-03-26 16:54                         ` Roopa Prabhu
  0 siblings, 2 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-26 10:54 UTC (permalink / raw)
  To: Roopa Prabhu, Florian Fainelli
  Cc: Neil Horman, Thomas Graf, Jiri Pirko, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek, Shrijeet Mukherjee

On 03/26/14 01:37, Roopa Prabhu wrote:
> On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:

> Sorry about getting on this thread late and possibly in the middle.
> Agree on the idea of keeping the ports linked to the master switch dev
> (or the 'conduit' to the switch chip) via private list instead of the
> master-slave relationship proposed earlier.
> By private i mean the netdev->priv linkage to the master switch dev and
> not really keeping the ports from being exposed to the user.
>
> We think its better to keep the switch ports exposed as any other netdev
> on linux.
>   This approach will make the switch ports look exactly like a nic port
> and all tools will continue to work seamlessly. The switch port
> operations could internally be forwarded to the switch netdev (sw1 in
> the above case).
>
> example:
> $ip link set dev sw1p0 up
> $ethtool -S sw1p0
>

I like the approach. I know the above is a simple version, but i am
assuming you also mean i can do things like
ip route add ...
bridge fdb add ... (and if you like your brctl go ahead)
bonding ...



>
> whether sw1 is needed as a separate netdev existing on the system is
> debatable.

I dont see need to expose it. For 1, it will be confusing to have this
netdev whose only task is to control the chip.

> Most cases the switch port driver (API) can talk to the switch chip
> driver without a switch netdev in between.
> But there are cases where a switch netdev might become necessary for
> switch chip specific operations (which probably has been discussed on
> this thread). An example could be a global acl rule that applies to all
> switch ports. One can argue that this can be applied on individual
> switch ports and the switch driver can take care of consolidating or
> optimally programming the acl rule in the switch chip.
>

There are a lot of things which dont tie to a specific port.
I think these should be transparent to the chip. If i add a route
and the decision is for that route to go to the chip, then it
shows up at the driver and it goes to the ASIC.
If i am dumping a fib table and some parts of it sit in the chip,
then whatever interfaces would end up querying the chip.

cheers,
jamal


> Thanks,
> Roopa
>
>
>
>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 21:54                         ` Thomas Graf
@ 2014-03-26 10:55                           ` Neil Horman
  0 siblings, 0 replies; 125+ messages in thread
From: Neil Horman @ 2014-03-26 10:55 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Florian Fainelli, Jamal Hadi Salim, Jiri Pirko, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek

On Tue, Mar 25, 2014 at 09:54:53PM +0000, Thomas Graf wrote:
> On 03/25/14 at 02:42pm, Florian Fainelli wrote:
> > 2014-03-25 14:26 GMT-07:00 Thomas Graf <tgraf@suug.ch>:
> > > On 03/25/14 at 01:11pm, Florian Fainelli wrote:
> > >> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> > >> > On Tue, Mar 25, 2014 at 06:00:09PM +0000, Thomas Graf wrote:
> > >> >> On top of that, like for VFs, we provide extended nested attributes or
> > >> >> alternate control paths such as via OVS that provide the additional
> > >> >> flexibility and control required by the more advanced devices.
> > >> > I'm sorry, I don't understand the relevance here.  Are you suggesting that to
> > >> > make this modification, we would need to augment more than a single set of
> > >> > netlink control paths?
> > >>
> > >> Not sure if I got this right, but there might be additional control
> > >> knobs required for specific Ethernet switch features that do not map
> > >> nicely, if at all with existing interfaces provided by ip/tc,
> > >> ethtool... although I guess one would say, well, then go add these
> > >> APIs instead of creating "extended" ones?
> > >
> > > Exactly. Some of the logic and configuration structure will not
> > > fit the existing model and is too switch specific to justify
> > > extending the generic link model. It also seems likely that some
> > > knobs will be switch specific. Not an issue as long as they are
> > > tunneled through the standard API and any effort is undertaken
> > > to generalize where it makes sense.
> > 
> > The question is how you would imagine conveying these switch-specific
> > features that do not (yet) map into a general feature, shall we go for
> > a separate netlink family, just like what Felix did in OpenWrt with
> > swconfig, without much stability from one kernel release to another,
> > as we migrate what was once a switch specific feature into a general
> > Ethernet switch feature?
> 
> I believe it is essential to transport them as part of the standard
> Netlink API and have a single channel for all configuration. It also
> eases message synchronization.
> 
> We also want to enforce strict ABI compatibility rules just like
> for all other Netlink users. As we know, it's not difficult to design
> the message format in a way to allow for extendability and backwards
> compatibility.
> 

+1, I'm not sure what the extent of any additional feature set is, for any
particular piece of hardware, but whatever it is almost certainly needs to be
baked into the same netlink protocol(s), so as to implicitly enforce ordering of
configuration.

Neil

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26  7:21                       ` Jiri Pirko
@ 2014-03-26 11:00                         ` Jamal Hadi Salim
  2014-03-26 11:06                           ` Jamal Hadi Salim
  2014-03-26 13:17                           ` Jiri Pirko
  0 siblings, 2 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-26 11:00 UTC (permalink / raw)
  To: Jiri Pirko, Thomas Graf
  Cc: Neil Horman, Florian Fainelli, netdev, David Miller, andy,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/26/14 03:21, Jiri Pirko wrote:

>
> Creating bonding of the switch ports does not fit into the picture at
> all. These port netdevices are just a representation of a port. Not
> actual netdevice where the data goes through.
>
> Please consider the case I gave already to this thread:
>
>          switch chip
>     ------------------------
>      |  |  |  |  |  |   |               CPU
>     p1 p2 ...pn px py  MNGMNT       -----------
>                  |  |   |              pcie
>                  |  |   |         ---------------
>                  |  |   |          |  NIC0 NIC1
>                  |  |   ---pcie-----   |   |
>                  |  ------someMII-------   |
>                  ---------someMII-----------
>
>          NIC0 and NIC1 are ordinary NICs like 8139too for example with no
>          notion they are connected to a switch. They as completely
>          independent on the mngmnt iface.
>
 > There, actual data is coming through NIC0 and NIC1 which is
 > completely separated
 > from the p1...pn,px.px port representations.
 >
 > And if you understand it this way, it makes perfect sense to have a
 > master device
 > for these port representations.
 >

I think you may be looking at some specific board design which has those
two NICs; there are typically many variations of such boards and they
have to be each dealt with slightly differently by whoever is porting. 
Important detail is:
we  already know how to deal with NICs - remove them from the diagram
and then the discussion is about the switch chip. I am assuming
the MNGMT interface is where the control is going to be. i.e
I can send table updates there, control the different port
charasterstics etc.
So Neil's option #1 is to have a driver controlling that interface
(->priv).
There's probably some DMA engine's for the datapath for one or more
of the ports this driver exposes...
Replace PCIE with DSA, a simulation chip, whatever the gazillion
crazy interfaces the openwrt guys have to deal with and we have
ourselves a consistent interface.


> Btw note this model fits into existing DSA as well I believe. The actual DSA
> devices whould act as NIC0, NIC1 and what would be added is the switch
> representation (couple of more netdevices to represent actual HW ports and
> their master)
>

Refer to my comments above.


cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 11:00                         ` Jamal Hadi Salim
@ 2014-03-26 11:06                           ` Jamal Hadi Salim
  2014-03-26 11:31                             ` Jamal Hadi Salim
  2014-03-26 13:20                             ` Jiri Pirko
  2014-03-26 13:17                           ` Jiri Pirko
  1 sibling, 2 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-26 11:06 UTC (permalink / raw)
  To: Jiri Pirko, Thomas Graf
  Cc: Neil Horman, Florian Fainelli, netdev, David Miller, andy,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/26/14 07:00, Jamal Hadi Salim wrote:
> On 03/26/14 03:21, Jiri Pirko wrote:
>
>>
>> Creating bonding of the switch ports does not fit into the picture at
>> all.

Sorry wanted to respond to the bonding part but my fingers kept typing;->

If it cant do bonding and the chip is capable of LAGging, it is simply
the wrong approach. I dont think what has been described so far will
have a problem doing bonding.



cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-25 20:56                   ` Jamal Hadi Salim
  2014-03-25 21:19                     ` Thomas Graf
@ 2014-03-26 11:10                     ` Neil Horman
  2014-03-26 11:29                       ` Thomas Graf
                                         ` (2 more replies)
  1 sibling, 3 replies; 125+ messages in thread
From: Neil Horman @ 2014-03-26 11:10 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Thomas Graf, Jiri Pirko, Florian Fainelli, netdev, David Miller,
	andy, dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On Tue, Mar 25, 2014 at 04:56:38PM -0400, Jamal Hadi Salim wrote:
> On 03/25/14 15:35, Neil Horman wrote:
> >On Tue, Mar 25, 2014 at 06:00:09PM +0000, Thomas Graf wrote:
> 
> >>How about a new device flag indicating pure L2 mode? Any L3 address
> >>configuration would fail with EAFNOSUPP.
> >>
> >Yeah, we've discussed that before, and it seems like a good idea, though I'm not
> >sure that its flexible enough.  It clearly prevents L3 operations on devices
> >that can only do L2, which is great, but that may not be sufficient for some
> >devices.  For example, what if you wanted to use ebtables on an L2 port where
> >the hardware can't mirror the actions of a given table rule?  Do we need to
> >expand out those capabilities?
> 
> There are two capability approaches.
> a) you do things and let the kernel reject
> b) You discover the capabilities and do something more interesting.
> We already do this kind of stuff in user tools today (simple example
> is name->ifindex mapping querying).
> 
> What is missing is ability to store richer capabilities which are not
> just boolean in nature.
> 
> 
> 
> >
> >Maybe I'm not being clear. I'm not suggesting that we abandon the use of a
> >net_device to do any of this work, only that we add a layer of indirection to
> >get to it.  By Augmenting the existing network device stack to allow
> >registration of net_devices to arbitrary lists, rather than to a fixes
> >per-net-namespace global device list, we can operate net_devices that are only
> >visible within the scope of a given switch fabric.  User space still works the
> >same way, it just requires the specification of additional information when
> >speaking to ports on a switch device that may not be directly accessible via the
> >cpu.  For example, if a systems has a directly connected nic (em1), and a switch
> >fabric with a master bridge port (sw1), and 10 external ports (sw1pX), we could
> >access them all from user space via ip link show.  for example:
> >
> >1) ip link show:
> >em1
> >sw1
> >
> >2) ip link show sw1
> >sw1
> >
> >3) ip link show -p sw1
> >sw1p0
> >sw1p1
> >sw1p2...
> >
> >
> >The idea is to augment user space to allow the visibiliy of ports through the
> >switch device, not directly, but using the same existing mechanisms.  We can
> >reuse all the existing infrastruture, but with this model, control must pass
> >through the switch device driver, allowing it to taylor available features by
> >passing the netlink request on to the appropriate netdevice, or sending back an
> >error itself.
> >
> 
> I think i am with you mostly - just not on the visibility of a "master"
> device.
> Expose the ports. Users create bridges bonds and if the hardware is
> capable it does the hard work to ensure consistency. No change in tools.
> 
But by creating net_devices that are registered in the current fashion we
implicitly agree to levels of functionality that are assumed to be available and
as such are not within the purview of a net_device to reject.  E.g. it is
assumed that a netdevice can filter frames using iptables/ebtables, limit
traffic using tc, etc. And if a switch fabric is short cutting traffic so that
the cpu doesn't see them, those bits of functionality won't work.  I agree we
can likely work around that with richer feature capabilities, but such an
infrastructure would both require extensive kernel changes to fully cover the
set of existing features at a sufficient granularity, and require user space
changes to grok the feature set of a given device.  Not saying its impossibible
or even undesireable mind you, just thats its not any less invasive than what
I'm proposing.

Neil

> cheers,
> jamal
> 
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 11:10                     ` Neil Horman
@ 2014-03-26 11:29                       ` Thomas Graf
  2014-03-26 12:58                         ` Jamal Hadi Salim
                                           ` (2 more replies)
  2014-03-26 12:19                       ` Jamal Hadi Salim
  2014-03-26 15:27                       ` John W. Linville
  2 siblings, 3 replies; 125+ messages in thread
From: Thomas Graf @ 2014-03-26 11:29 UTC (permalink / raw)
  To: Neil Horman
  Cc: Jamal Hadi Salim, Jiri Pirko, Florian Fainelli, netdev,
	David Miller, andy, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

On 03/26/14 at 07:10am, Neil Horman wrote:
> But by creating net_devices that are registered in the current fashion we
> implicitly agree to levels of functionality that are assumed to be available and
> as such are not within the purview of a net_device to reject.  E.g. it is
> assumed that a netdevice can filter frames using iptables/ebtables, limit
> traffic using tc, etc.

I think this is the point where we disagree. We already have several
devices that hook into the rx handler and never have their packets
pass through either iptables or ebtables. Better examples of this are
macvtap or OVS.

What should happen is that these devices are given a chance to implement
the ACL in their own flow table. If no such facility exists, the rule
insertion should fall back to software mode if that is possible (an
OF capable switching chip could insert a 'upcall' flow), or as
a last resort return an error to indicate EOPNOTSUPP.

> And if a switch fabric is short cutting traffic so that
> the cpu doesn't see them, those bits of functionality won't work.  I agree we
> can likely work around that with richer feature capabilities, but such an
> infrastructure would both require extensive kernel changes to fully cover the
> set of existing features at a sufficient granularity, and require user space
> changes to grok the feature set of a given device.  Not saying its impossibible
> or even undesireable mind you, just thats its not any less invasive than what
> I'm proposing.

What I don't understand at this point is how hiding the ports behind
a master device would buy us anything. We would still need to abstract
the filtering capabilities of the ports at some level and hiding that
behind existing tools seems to most convenient way.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 11:06                           ` Jamal Hadi Salim
@ 2014-03-26 11:31                             ` Jamal Hadi Salim
  2014-03-26 13:20                             ` Jiri Pirko
  1 sibling, 0 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-26 11:31 UTC (permalink / raw)
  To: Jiri Pirko, Thomas Graf
  Cc: Neil Horman, Florian Fainelli, netdev, David Miller, andy,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

[-- Attachment #1: Type: text/plain, Size: 389 bytes --]

On 03/26/14 07:06, Jamal Hadi Salim wrote:

> If it cant do bonding and the chip is capable of LAGging, it is simply
> the wrong approach. I dont think what has been described so far will
> have a problem doing bonding.

So here's a half a coffee of ascii for yer to just up the game a little.
If cant do this when the chip is capable - then IMO it is the wrong 
interface.

cheers,
jamal

[-- Attachment #2: x3 --]
[-- Type: text/plain, Size: 1255 bytes --]


   +---------------------+     +---------------------------------------+
   |       br5           |     |            br100                      |
   +-+--+----+---+-+----++     +----+------------+------+------------+-+
     |p0|    |p1 | |eth0|           | bond10     |      |  bond10     |
     ++-+    +-+-+ +----+           +-+---+-+---++      +-+---+--+--+-+
      ^        ^                     |   | |   |         |   |  |  |
      |        |                     |p6 | |p7 |         |p11|  |p8|
      |        |                     +-+-+ +-+-+         +-+-+  +-++
      |        |                       ^     ^             ^      ^
      |        |                       |     |             |      |
      |        |                       |     |             |      |
   +--v--------v-----------------------v-----v-------------v------v----+
   | switch driver (exposes ports and sets bonds/bridges in H/ware)    |
   +------------+--------------------------------+---------------------+
                | Switch control/data interfaces |
                +--------------------------------+
                                | 
                         +-------------+
                         |  Switch HW  |
                         +-------------+


^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 11:10                     ` Neil Horman
  2014-03-26 11:29                       ` Thomas Graf
@ 2014-03-26 12:19                       ` Jamal Hadi Salim
  2014-03-26 15:27                       ` John W. Linville
  2 siblings, 0 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-26 12:19 UTC (permalink / raw)
  To: Neil Horman
  Cc: Thomas Graf, Jiri Pirko, Florian Fainelli, netdev, David Miller,
	andy, dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/26/14 07:10, Neil Horman wrote:
> On Tue, Mar 25, 2014 at 04:56:38PM -0400, Jamal Hadi Salim wrote:

> But by creating net_devices that are registered in the current fashion we
> implicitly agree to levels of functionality that are assumed to be available and
> as such are not within the purview of a net_device to reject.  E.g. it is
> assumed that a netdevice can filter frames using iptables/ebtables, limit
> traffic using tc, etc.

More like my response to Roopa: there are things that can be tied to 
ports and others are not.
This is where i think we need more discussion - I believe that
the offload model offered by current bridging is a good starting point
for things tied to ports. In that setup, something in the netlink header
says "give me stuff that is in the hardware" or "set this to the fdb
table in the kernel as well as in the hardware".
That is the part i like - and that concept, although tied to ports in
the case of bridging, applies generically.
Most of these things tend to be in the form of a table abstraction in
the chip.
If i add a FIB where the tie to a port is weak (occassionaly the nexthop
tie may be to a port), I should still be able to use the same method.
If this request shows up in the kernel, mechanisms are needed to shunt
it to said driver.

> And if a switch fabric is short cutting traffic so that
> the cpu doesn't see them, those bits of functionality won't work.

Separate control from data. What i am refering about is control.
Datapath and what shows up in the CPU is dependent on what gets set in
the hardware.

> I agree we
> can likely work around that with richer feature capabilities, but such an
> infrastructure would both require extensive kernel changes to fully cover the
> set of existing features at a sufficient granularity, and require user space
> changes to grok the feature set of a given device.  Not saying its impossibible
> or even undesireable mind you, just thats its not any less invasive than what
> I'm proposing.

We would need to change some things - but i think it is a small
evolution (as opposed to a revolution).

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 11:29                       ` Thomas Graf
@ 2014-03-26 12:58                         ` Jamal Hadi Salim
  2014-03-26 15:22                         ` John W. Linville
  2014-03-26 18:21                         ` Neil Horman
  2 siblings, 0 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-26 12:58 UTC (permalink / raw)
  To: Thomas Graf, Neil Horman
  Cc: Jiri Pirko, Florian Fainelli, netdev, David Miller, andy,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/26/14 07:29, Thomas Graf wrote:
> On 03/26/14 at 07:10am, Neil Horman wrote:
>> But by creating net_devices that are registered in the current fashion we
>> implicitly agree to levels of functionality that are assumed to be available and
>> as such are not within the purview of a net_device to reject.  E.g. it is
>> assumed that a netdevice can filter frames using iptables/ebtables, limit
>> traffic using tc, etc.
>
> I think this is the point where we disagree. We already have several
> devices that hook into the rx handler and never have their packets
> pass through either iptables or ebtables. Better examples of this are
> macvtap or OVS.
>

I am going to disagree here.
Where it makes sense to use port handling is when you have port
layering.
Really dont need to create another parallel stack just because this
is to reflect what is in the hardware. There are corner cases, but
not the ones discussed.
Use ebtables if that is what the hardware reflects best.
Use tc when that is the better abstraction (example: for years now I
have modelled the broadcom acls with tc classifier/actions just fine).

> What should happen is that these devices are given a chance to implement
> the ACL in their own flow table. If no such facility exists, the rule
> insertion should fall back to software mode if that is possible (an
> OF capable switching chip could insert a 'upcall' flow), or as
> a last resort return an error to indicate EOPNOTSUPP.
>

I would like to specify where the acl rule goes.
Refer to the bridge port fdb offloading as an example.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 11:00                         ` Jamal Hadi Salim
  2014-03-26 11:06                           ` Jamal Hadi Salim
@ 2014-03-26 13:17                           ` Jiri Pirko
  1 sibling, 0 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-26 13:17 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Thomas Graf, Neil Horman, Florian Fainelli, netdev, David Miller,
	andy, dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

Wed, Mar 26, 2014 at 12:00:53PM CET, jhs@mojatatu.com wrote:
>On 03/26/14 03:21, Jiri Pirko wrote:
>
>>
>>Creating bonding of the switch ports does not fit into the picture at
>>all. These port netdevices are just a representation of a port. Not
>>actual netdevice where the data goes through.
>>
>>Please consider the case I gave already to this thread:
>>
>>         switch chip
>>    ------------------------
>>     |  |  |  |  |  |   |               CPU
>>    p1 p2 ...pn px py  MNGMNT       -----------
>>                 |  |   |              pcie
>>                 |  |   |         ---------------
>>                 |  |   |          |  NIC0 NIC1
>>                 |  |   ---pcie-----   |   |
>>                 |  ------someMII-------   |
>>                 ---------someMII-----------
>>
>>         NIC0 and NIC1 are ordinary NICs like 8139too for example with no
>>         notion they are connected to a switch. They as completely
>>         independent on the mngmnt iface.
>>
>> There, actual data is coming through NIC0 and NIC1 which is
>> completely separated
>> from the p1...pn,px.px port representations.
>>
>> And if you understand it this way, it makes perfect sense to have a
>> master device
>> for these port representations.
>>
>
>I think you may be looking at some specific board design which has those
>two NICs; there are typically many variations of such boards and they
>have to be each dealt with slightly differently by whoever is
>porting. Important detail is:

It is just an example, nothing more.



>we  already know how to deal with NICs - remove them from the diagram
>and then the discussion is about the switch chip. I am assuming
>the MNGMT interface is where the control is going to be. i.e

* I just tried to emphasize where the actual network traffic in between
  switch chip and CPU flows. That is important to realize I believe.


>I can send table updates there, control the different port
>charasterstics etc.
>So Neil's option #1 is to have a driver controlling that interface
>(->priv).
>There's probably some DMA engine's for the datapath for one or more
>of the ports this driver exposes...

See *.

>Replace PCIE with DSA, a simulation chip, whatever the gazillion
>crazy interfaces the openwrt guys have to deal with and we have
>ourselves a consistent interface.
>
>
>>Btw note this model fits into existing DSA as well I believe. The actual DSA
>>devices whould act as NIC0, NIC1 and what would be added is the switch
>>representation (couple of more netdevices to represent actual HW ports and
>>their master)
>>
>
>Refer to my comments above.
>
>
>cheers,
>jamal
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 11:06                           ` Jamal Hadi Salim
  2014-03-26 11:31                             ` Jamal Hadi Salim
@ 2014-03-26 13:20                             ` Jiri Pirko
  2014-03-26 13:23                               ` Jamal Hadi Salim
  1 sibling, 1 reply; 125+ messages in thread
From: Jiri Pirko @ 2014-03-26 13:20 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Thomas Graf, Neil Horman, Florian Fainelli, netdev, David Miller,
	andy, dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

Wed, Mar 26, 2014 at 12:06:08PM CET, jhs@mojatatu.com wrote:
>On 03/26/14 07:00, Jamal Hadi Salim wrote:
>>On 03/26/14 03:21, Jiri Pirko wrote:
>>
>>>
>>>Creating bonding of the switch ports does not fit into the picture at
>>>all.
>
>Sorry wanted to respond to the bonding part but my fingers kept typing;->
>
>If it cant do bonding and the chip is capable of LAGging, it is simply
>the wrong approach. I dont think what has been described so far will
>have a problem doing bonding.

I think that bonding (bonding driver or any of bonding driver interface)
should have nothing to do with the switch chip capability of link aggregation.


>
>
>
>cheers,
>jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 13:20                             ` Jiri Pirko
@ 2014-03-26 13:23                               ` Jamal Hadi Salim
  0 siblings, 0 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-26 13:23 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Thomas Graf, Neil Horman, Florian Fainelli, netdev, David Miller,
	andy, dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/26/14 09:20, Jiri Pirko wrote:

> I think that bonding (bonding driver or any of bonding driver interface)
> should have nothing to do with the switch chip capability of link aggregation.

I am probably missing something: that may work - but not well at the
asic level. I can see it would work on packets to/from cpu because
we can use software techniques.
But how would it work at the ASIC level without ability to lag
the ports?

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 11:29                       ` Thomas Graf
  2014-03-26 12:58                         ` Jamal Hadi Salim
@ 2014-03-26 15:22                         ` John W. Linville
  2014-03-26 21:36                           ` Jamal Hadi Salim
  2014-03-26 18:21                         ` Neil Horman
  2 siblings, 1 reply; 125+ messages in thread
From: John W. Linville @ 2014-03-26 15:22 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Neil Horman, Jamal Hadi Salim, Jiri Pirko, Florian Fainelli,
	netdev, David Miller, andy, dborkman, ogerlitz, jesse, pshelar,
	azhou, Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher,
	vyasevic, Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

On Wed, Mar 26, 2014 at 11:29:03AM +0000, Thomas Graf wrote:
> On 03/26/14 at 07:10am, Neil Horman wrote:
> > But by creating net_devices that are registered in the current fashion we
> > implicitly agree to levels of functionality that are assumed to be available and
> > as such are not within the purview of a net_device to reject.  E.g. it is
> > assumed that a netdevice can filter frames using iptables/ebtables, limit
> > traffic using tc, etc.
> 
> I think this is the point where we disagree. We already have several
> devices that hook into the rx handler and never have their packets
> pass through either iptables or ebtables. Better examples of this are
> macvtap or OVS.
> 
> What should happen is that these devices are given a chance to implement
> the ACL in their own flow table. If no such facility exists, the rule
> insertion should fall back to software mode if that is possible (an
> OF capable switching chip could insert a 'upcall' flow), or as
> a last resort return an error to indicate EOPNOTSUPP.

This part makes sense to me -- use the hardware forwarding offloads if
they are available, but fall back to software for sake of flexibility.
It gives the admin enough rope to shoot himself in the foot...

> 
> > And if a switch fabric is short cutting traffic so that
> > the cpu doesn't see them, those bits of functionality won't work.  I agree we
> > can likely work around that with richer feature capabilities, but such an
> > infrastructure would both require extensive kernel changes to fully cover the
> > set of existing features at a sufficient granularity, and require user space
> > changes to grok the feature set of a given device.  Not saying its impossibible
> > or even undesireable mind you, just thats its not any less invasive than what
> > I'm proposing.
> 
> What I don't understand at this point is how hiding the ports behind
> a master device would buy us anything. We would still need to abstract
> the filtering capabilities of the ports at some level and hiding that
> behind existing tools seems to most convenient way.

I don't see much benefit from the master driver approach either.
We had something like that in the wireless space for a while, and it
mostly just caused confusion.

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 11:10                     ` Neil Horman
  2014-03-26 11:29                       ` Thomas Graf
  2014-03-26 12:19                       ` Jamal Hadi Salim
@ 2014-03-26 15:27                       ` John W. Linville
  2 siblings, 0 replies; 125+ messages in thread
From: John W. Linville @ 2014-03-26 15:27 UTC (permalink / raw)
  To: Neil Horman
  Cc: Jamal Hadi Salim, Thomas Graf, Jiri Pirko, Florian Fainelli,
	netdev, David Miller, andy, dborkman, ogerlitz, jesse, pshelar,
	azhou, Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher,
	vyasevic, Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

On Wed, Mar 26, 2014 at 07:10:31AM -0400, Neil Horman wrote:
> On Tue, Mar 25, 2014 at 04:56:38PM -0400, Jamal Hadi Salim wrote:

> > I think i am with you mostly - just not on the visibility of a "master"
> > device.
> > Expose the ports. Users create bridges bonds and if the hardware is
> > capable it does the hard work to ensure consistency. No change in tools.
> > 
> But by creating net_devices that are registered in the current fashion we
> implicitly agree to levels of functionality that are assumed to be available and
> as such are not within the purview of a net_device to reject.  E.g. it is
> assumed that a netdevice can filter frames using iptables/ebtables, limit
> traffic using tc, etc. And if a switch fabric is short cutting traffic so that
> the cpu doesn't see them, those bits of functionality won't work.  I agree we
> can likely work around that with richer feature capabilities, but such an
> infrastructure would both require extensive kernel changes to fully cover the
> set of existing features at a sufficient granularity, and require user space
> changes to grok the feature set of a given device.  Not saying its impossibible
> or even undesireable mind you, just thats its not any less invasive than what
> I'm proposing.

Some of this sounds akin to the old (but true) arguments against TOE
hardware.  But as Thomas suggests, I think most of this disappears
if you give the driver the chance to implement such rules and/or fall
back to software-only forwarding.

While I'm sure there will be significant kernel changes to allow
for some of that, I think that by putting that intelligence in the
drivers we can avoid most/all of the user space changes for groking
device features.

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 10:54                       ` Jamal Hadi Salim
@ 2014-03-26 15:31                         ` John W. Linville
  2014-03-26 16:54                         ` Roopa Prabhu
  1 sibling, 0 replies; 125+ messages in thread
From: John W. Linville @ 2014-03-26 15:31 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Roopa Prabhu, Florian Fainelli, Neil Horman, Thomas Graf,
	Jiri Pirko, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

On Wed, Mar 26, 2014 at 06:54:54AM -0400, Jamal Hadi Salim wrote:
> On 03/26/14 01:37, Roopa Prabhu wrote:

> >Most cases the switch port driver (API) can talk to the switch chip
> >driver without a switch netdev in between.
> >But there are cases where a switch netdev might become necessary for
> >switch chip specific operations (which probably has been discussed on
> >this thread). An example could be a global acl rule that applies to all
> >switch ports. One can argue that this can be applied on individual
> >switch ports and the switch driver can take care of consolidating or
> >optimally programming the acl rule in the switch chip.
> >
> 
> There are a lot of things which dont tie to a specific port.
> I think these should be transparent to the chip. If i add a route
> and the decision is for that route to go to the chip, then it
> shows up at the driver and it goes to the ASIC.
> If i am dumping a fib table and some parts of it sit in the chip,
> then whatever interfaces would end up querying the chip.

Right -- I think the netdev portions of the switch hardware drivers can
be made smart enough to handle most of this stuff behind the scenes.

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 10:54                       ` Jamal Hadi Salim
  2014-03-26 15:31                         ` John W. Linville
@ 2014-03-26 16:54                         ` Roopa Prabhu
  2014-03-26 16:59                           ` Jiri Pirko
  1 sibling, 1 reply; 125+ messages in thread
From: Roopa Prabhu @ 2014-03-26 16:54 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Florian Fainelli, Neil Horman, Thomas Graf, Jiri Pirko, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
> On 03/26/14 01:37, Roopa Prabhu wrote:
>> On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>
>> Sorry about getting on this thread late and possibly in the middle.
>> Agree on the idea of keeping the ports linked to the master switch dev
>> (or the 'conduit' to the switch chip) via private list instead of the
>> master-slave relationship proposed earlier.
>> By private i mean the netdev->priv linkage to the master switch dev and
>> not really keeping the ports from being exposed to the user.
>>
>> We think its better to keep the switch ports exposed as any other netdev
>> on linux.
>>   This approach will make the switch ports look exactly like a nic port
>> and all tools will continue to work seamlessly. The switch port
>> operations could internally be forwarded to the switch netdev (sw1 in
>> the above case).
>>
>> example:
>> $ip link set dev sw1p0 up
>> $ethtool -S sw1p0
>>
>
> I like the approach. I know the above is a simple version, but i am
> assuming you also mean i can do things like
> ip route add ...
> bridge fdb add ... (and if you like your brctl go ahead)
> bonding ...
>
yes, exactly.  We support this model on our boxes today.
User can bond switch ports on our box in the exact same way as he/she 
would bond two nic ports.
Our 'conduit to switch chip' reflects the corresponding lag 
configuration in the switch chip.
Same goes for bridging, routing, acls.

Thanks,
Roopa

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 16:54                         ` Roopa Prabhu
@ 2014-03-26 16:59                           ` Jiri Pirko
  2014-03-26 17:29                             ` Florian Fainelli
  2014-03-26 17:47                             ` Roopa Prabhu
  0 siblings, 2 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-26 16:59 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jamal Hadi Salim, Florian Fainelli, Neil Horman, Thomas Graf,
	netdev, David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>On 03/26/14 01:37, Roopa Prabhu wrote:
>>>On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>
>>>Sorry about getting on this thread late and possibly in the middle.
>>>Agree on the idea of keeping the ports linked to the master switch dev
>>>(or the 'conduit' to the switch chip) via private list instead of the
>>>master-slave relationship proposed earlier.
>>>By private i mean the netdev->priv linkage to the master switch dev and
>>>not really keeping the ports from being exposed to the user.
>>>
>>>We think its better to keep the switch ports exposed as any other netdev
>>>on linux.
>>>  This approach will make the switch ports look exactly like a nic port
>>>and all tools will continue to work seamlessly. The switch port
>>>operations could internally be forwarded to the switch netdev (sw1 in
>>>the above case).
>>>
>>>example:
>>>$ip link set dev sw1p0 up
>>>$ethtool -S sw1p0
>>>
>>
>>I like the approach. I know the above is a simple version, but i am
>>assuming you also mean i can do things like
>>ip route add ...
>>bridge fdb add ... (and if you like your brctl go ahead)
>>bonding ...
>>
>yes, exactly.  We support this model on our boxes today.
>User can bond switch ports on our box in the exact same way as he/she
>would bond two nic ports.
>Our 'conduit to switch chip' reflects the corresponding lag
>configuration in the switch chip.
>Same goes for bridging, routing, acls.


So you implement bonding netlink api? Or you hook into bonding driver
itselt? Can you show us the code?



>
>Thanks,
>Roopa

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 16:59                           ` Jiri Pirko
@ 2014-03-26 17:29                             ` Florian Fainelli
  2014-03-26 17:35                               ` Jiri Pirko
  2014-03-26 17:57                               ` Roopa Prabhu
  2014-03-26 17:47                             ` Roopa Prabhu
  1 sibling, 2 replies; 125+ messages in thread
From: Florian Fainelli @ 2014-03-26 17:29 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Roopa Prabhu, Jamal Hadi Salim, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

2014-03-26 9:59 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>>On 03/26/14 01:37, Roopa Prabhu wrote:
>>>>On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>>2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>
>>>>Sorry about getting on this thread late and possibly in the middle.
>>>>Agree on the idea of keeping the ports linked to the master switch dev
>>>>(or the 'conduit' to the switch chip) via private list instead of the
>>>>master-slave relationship proposed earlier.
>>>>By private i mean the netdev->priv linkage to the master switch dev and
>>>>not really keeping the ports from being exposed to the user.
>>>>
>>>>We think its better to keep the switch ports exposed as any other netdev
>>>>on linux.
>>>>  This approach will make the switch ports look exactly like a nic port
>>>>and all tools will continue to work seamlessly. The switch port
>>>>operations could internally be forwarded to the switch netdev (sw1 in
>>>>the above case).
>>>>
>>>>example:
>>>>$ip link set dev sw1p0 up
>>>>$ethtool -S sw1p0
>>>>
>>>
>>>I like the approach. I know the above is a simple version, but i am
>>>assuming you also mean i can do things like
>>>ip route add ...
>>>bridge fdb add ... (and if you like your brctl go ahead)
>>>bonding ...
>>>
>>yes, exactly.  We support this model on our boxes today.
>>User can bond switch ports on our box in the exact same way as he/she
>>would bond two nic ports.
>>Our 'conduit to switch chip' reflects the corresponding lag
>>configuration in the switch chip.
>>Same goes for bridging, routing, acls.
>
>
> So you implement bonding netlink api? Or you hook into bonding driver
> itselt? Can you show us the code?

Before we start talking about bonding, maybe we should make sure that
we cover some basic hardware switches uses which are to make some
ports belong to certain VLANs, tagged or untagged?

It seems to me like this would become something like this, assuming P0
and P1 are two switch ports and 'eth0' is the CPU port, where P0 and
P1 belong to VLAN1 and CPU belongs to VLAN2:

ip link set dev sw1p0 up
ip link set dev sw1p1 up
ip link set dev eth0 up

ip link add link eth0 name eth0.2 type vlan id 2

ip link add link sw1p0 name sw1p0.1 type vlan id 1
ip link add link sw1p1 name sw1p1.1 type vlan id 1

ip link add sw1.1 type bridge
ip link set sw1p0.1 master sw1.1
ip link set sw1p1.1 master sw1.1

Does that fit the model correctly?
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 17:29                             ` Florian Fainelli
@ 2014-03-26 17:35                               ` Jiri Pirko
  2014-03-26 17:58                                 ` Florian Fainelli
  2014-03-26 17:57                               ` Roopa Prabhu
  1 sibling, 1 reply; 125+ messages in thread
From: Jiri Pirko @ 2014-03-26 17:35 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Roopa Prabhu, Jamal Hadi Salim, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

Wed, Mar 26, 2014 at 06:29:07PM CET, f.fainelli@gmail.com wrote:
>2014-03-26 9:59 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>>>On 03/26/14 01:37, Roopa Prabhu wrote:
>>>>>On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>>>2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>>
>>>>>Sorry about getting on this thread late and possibly in the middle.
>>>>>Agree on the idea of keeping the ports linked to the master switch dev
>>>>>(or the 'conduit' to the switch chip) via private list instead of the
>>>>>master-slave relationship proposed earlier.
>>>>>By private i mean the netdev->priv linkage to the master switch dev and
>>>>>not really keeping the ports from being exposed to the user.
>>>>>
>>>>>We think its better to keep the switch ports exposed as any other netdev
>>>>>on linux.
>>>>>  This approach will make the switch ports look exactly like a nic port
>>>>>and all tools will continue to work seamlessly. The switch port
>>>>>operations could internally be forwarded to the switch netdev (sw1 in
>>>>>the above case).
>>>>>
>>>>>example:
>>>>>$ip link set dev sw1p0 up
>>>>>$ethtool -S sw1p0
>>>>>
>>>>
>>>>I like the approach. I know the above is a simple version, but i am
>>>>assuming you also mean i can do things like
>>>>ip route add ...
>>>>bridge fdb add ... (and if you like your brctl go ahead)
>>>>bonding ...
>>>>
>>>yes, exactly.  We support this model on our boxes today.
>>>User can bond switch ports on our box in the exact same way as he/she
>>>would bond two nic ports.
>>>Our 'conduit to switch chip' reflects the corresponding lag
>>>configuration in the switch chip.
>>>Same goes for bridging, routing, acls.
>>
>>
>> So you implement bonding netlink api? Or you hook into bonding driver
>> itselt? Can you show us the code?
>
>Before we start talking about bonding, maybe we should make sure that
>we cover some basic hardware switches uses which are to make some
>ports belong to certain VLANs, tagged or untagged?
>
>It seems to me like this would become something like this, assuming P0
>and P1 are two switch ports and 'eth0' is the CPU port, where P0 and
>P1 belong to VLAN1 and CPU belongs to VLAN2:
>
>ip link set dev sw1p0 up
>ip link set dev sw1p1 up
>ip link set dev eth0 up


I might be mistaken, But I think you are missing a switch port
representing a connection to eth0 (eth0 being cpu conterpart of it).
Or is it one of sw1p0 and sw1p1 ?

>
>ip link add link eth0 name eth0.2 type vlan id 2
>
>ip link add link sw1p0 name sw1p0.1 type vlan id 1
>ip link add link sw1p1 name sw1p1.1 type vlan id 1
>
>ip link add sw1.1 type bridge
>ip link set sw1p0.1 master sw1.1
>ip link set sw1p1.1 master sw1.1
>
>Does that fit the model correctly?
>-- 
>Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 16:59                           ` Jiri Pirko
  2014-03-26 17:29                             ` Florian Fainelli
@ 2014-03-26 17:47                             ` Roopa Prabhu
  2014-03-26 18:03                               ` Jiri Pirko
  1 sibling, 1 reply; 125+ messages in thread
From: Roopa Prabhu @ 2014-03-26 17:47 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jamal Hadi Salim, Florian Fainelli, Neil Horman, Thomas Graf,
	netdev, David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

On 3/26/14, 9:59 AM, Jiri Pirko wrote:
> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>> On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>> On 03/26/14 01:37, Roopa Prabhu wrote:
>>>> On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>> Sorry about getting on this thread late and possibly in the middle.
>>>> Agree on the idea of keeping the ports linked to the master switch dev
>>>> (or the 'conduit' to the switch chip) via private list instead of the
>>>> master-slave relationship proposed earlier.
>>>> By private i mean the netdev->priv linkage to the master switch dev and
>>>> not really keeping the ports from being exposed to the user.
>>>>
>>>> We think its better to keep the switch ports exposed as any other netdev
>>>> on linux.
>>>>   This approach will make the switch ports look exactly like a nic port
>>>> and all tools will continue to work seamlessly. The switch port
>>>> operations could internally be forwarded to the switch netdev (sw1 in
>>>> the above case).
>>>>
>>>> example:
>>>> $ip link set dev sw1p0 up
>>>> $ethtool -S sw1p0
>>>>
>>> I like the approach. I know the above is a simple version, but i am
>>> assuming you also mean i can do things like
>>> ip route add ...
>>> bridge fdb add ... (and if you like your brctl go ahead)
>>> bonding ...
>>>
>> yes, exactly.  We support this model on our boxes today.
>> User can bond switch ports on our box in the exact same way as he/she
>> would bond two nic ports.
>> Our 'conduit to switch chip' reflects the corresponding lag
>> configuration in the switch chip.
>> Same goes for bridging, routing, acls.
>
> So you implement bonding netlink api? Or you hook into bonding driver
> itselt? Can you show us the code?
We use the netlink API and libnl. In our current model, our switch chip 
driver listens to netlink notifications and programs the switch chip. 
The switch chip driver uses libnl caches and libnl netlink apis to 
reflect the kernel state to switch chip.

Thanks,
Roopa

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 17:29                             ` Florian Fainelli
  2014-03-26 17:35                               ` Jiri Pirko
@ 2014-03-26 17:57                               ` Roopa Prabhu
  2014-03-26 18:09                                 ` Florian Fainelli
  1 sibling, 1 reply; 125+ messages in thread
From: Roopa Prabhu @ 2014-03-26 17:57 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Jiri Pirko, Jamal Hadi Salim, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

On 3/26/14, 10:29 AM, Florian Fainelli wrote:
> 2014-03-26 9:59 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>> On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>>> On 03/26/14 01:37, Roopa Prabhu wrote:
>>>>> On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>>> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>>> Sorry about getting on this thread late and possibly in the middle.
>>>>> Agree on the idea of keeping the ports linked to the master switch dev
>>>>> (or the 'conduit' to the switch chip) via private list instead of the
>>>>> master-slave relationship proposed earlier.
>>>>> By private i mean the netdev->priv linkage to the master switch dev and
>>>>> not really keeping the ports from being exposed to the user.
>>>>>
>>>>> We think its better to keep the switch ports exposed as any other netdev
>>>>> on linux.
>>>>>   This approach will make the switch ports look exactly like a nic port
>>>>> and all tools will continue to work seamlessly. The switch port
>>>>> operations could internally be forwarded to the switch netdev (sw1 in
>>>>> the above case).
>>>>>
>>>>> example:
>>>>> $ip link set dev sw1p0 up
>>>>> $ethtool -S sw1p0
>>>>>
>>>> I like the approach. I know the above is a simple version, but i am
>>>> assuming you also mean i can do things like
>>>> ip route add ...
>>>> bridge fdb add ... (and if you like your brctl go ahead)
>>>> bonding ...
>>>>
>>> yes, exactly.  We support this model on our boxes today.
>>> User can bond switch ports on our box in the exact same way as he/she
>>> would bond two nic ports.
>>> Our 'conduit to switch chip' reflects the corresponding lag
>>> configuration in the switch chip.
>>> Same goes for bridging, routing, acls.
>>
>> So you implement bonding netlink api? Or you hook into bonding driver
>> itselt? Can you show us the code?
> Before we start talking about bonding, maybe we should make sure that
> we cover some basic hardware switches uses which are to make some
> ports belong to certain VLANs, tagged or untagged?
>
> It seems to me like this would become something like this, assuming P0
> and P1 are two switch ports and 'eth0' is the CPU port, where P0 and
> P1 belong to VLAN1 and CPU belongs to VLAN2:
>
> ip link set dev sw1p0 up
> ip link set dev sw1p1 up
> ip link set dev eth0 up
>
> ip link add link eth0 name eth0.2 type vlan id 2
>
> ip link add link sw1p0 name sw1p0.1 type vlan id 1
> ip link add link sw1p1 name sw1p1.1 type vlan id 1
>
> ip link add sw1.1 type bridge
> ip link set sw1p0.1 master sw1.1
> ip link set sw1p1.1 master sw1.1
>
> Does that fit the model correctly?
Not entirely, but close.
In our current model, there is no netdev for cpu port (or master switch 
netdev):

ip link set dev sw1p0 up
ip link set dev sw1p1 up

ip link add link sw1p0 name sw1p0.1 type vlan id 1
ip link add link sw1p1 name sw1p1.1 type vlan id 1

ip link add brvlan1 type bridge
ip link set swp1p0.1 master brvlan1
ip link set swp1p1.1 master brvlan1

switch driver programs the brvlan1 vlan in the switch asic.

bonding works in the same way.

Thanks,
Roopa


  

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 17:35                               ` Jiri Pirko
@ 2014-03-26 17:58                                 ` Florian Fainelli
  2014-03-26 18:14                                   ` Jiri Pirko
  0 siblings, 1 reply; 125+ messages in thread
From: Florian Fainelli @ 2014-03-26 17:58 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Roopa Prabhu, Jamal Hadi Salim, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
> Wed, Mar 26, 2014 at 06:29:07PM CET, f.fainelli@gmail.com wrote:
>>2014-03-26 9:59 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>>On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>>>>On 03/26/14 01:37, Roopa Prabhu wrote:
>>>>>>On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>>>>2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>>>
>>>>>>Sorry about getting on this thread late and possibly in the middle.
>>>>>>Agree on the idea of keeping the ports linked to the master switch dev
>>>>>>(or the 'conduit' to the switch chip) via private list instead of the
>>>>>>master-slave relationship proposed earlier.
>>>>>>By private i mean the netdev->priv linkage to the master switch dev and
>>>>>>not really keeping the ports from being exposed to the user.
>>>>>>
>>>>>>We think its better to keep the switch ports exposed as any other netdev
>>>>>>on linux.
>>>>>>  This approach will make the switch ports look exactly like a nic port
>>>>>>and all tools will continue to work seamlessly. The switch port
>>>>>>operations could internally be forwarded to the switch netdev (sw1 in
>>>>>>the above case).
>>>>>>
>>>>>>example:
>>>>>>$ip link set dev sw1p0 up
>>>>>>$ethtool -S sw1p0
>>>>>>
>>>>>
>>>>>I like the approach. I know the above is a simple version, but i am
>>>>>assuming you also mean i can do things like
>>>>>ip route add ...
>>>>>bridge fdb add ... (and if you like your brctl go ahead)
>>>>>bonding ...
>>>>>
>>>>yes, exactly.  We support this model on our boxes today.
>>>>User can bond switch ports on our box in the exact same way as he/she
>>>>would bond two nic ports.
>>>>Our 'conduit to switch chip' reflects the corresponding lag
>>>>configuration in the switch chip.
>>>>Same goes for bridging, routing, acls.
>>>
>>>
>>> So you implement bonding netlink api? Or you hook into bonding driver
>>> itselt? Can you show us the code?
>>
>>Before we start talking about bonding, maybe we should make sure that
>>we cover some basic hardware switches uses which are to make some
>>ports belong to certain VLANs, tagged or untagged?
>>
>>It seems to me like this would become something like this, assuming P0
>>and P1 are two switch ports and 'eth0' is the CPU port, where P0 and
>>P1 belong to VLAN1 and CPU belongs to VLAN2:
>>
>>ip link set dev sw1p0 up
>>ip link set dev sw1p1 up
>>ip link set dev eth0 up
>
>
> I might be mistaken, But I think you are missing a switch port
> representing a connection to eth0 (eth0 being cpu conterpart of it).
> Or is it one of sw1p0 and sw1p1 ?

You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my example.

I think there is an implicit convention that sw1 represents the
Ethernet switch port connected to the CPU Ethernet MAC, and that it is
always connected, hence there is no need to create a "fake" bridge to
link sw1 to eth0 for instance?

>
>>
>>ip link add link eth0 name eth0.2 type vlan id 2
>>
>>ip link add link sw1p0 name sw1p0.1 type vlan id 1
>>ip link add link sw1p1 name sw1p1.1 type vlan id 1
>>
>>ip link add sw1.1 type bridge
>>ip link set sw1p0.1 master sw1.1
>>ip link set sw1p1.1 master sw1.1
>>
>>Does that fit the model correctly?
>>--
>>Florian



-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 17:47                             ` Roopa Prabhu
@ 2014-03-26 18:03                               ` Jiri Pirko
  2014-03-26 21:27                                 ` Roopa Prabhu
  2014-04-01 19:13                                 ` Scott Feldman
  0 siblings, 2 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-26 18:03 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jamal Hadi Salim, Florian Fainelli, Neil Horman, Thomas Graf,
	netdev, David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

Wed, Mar 26, 2014 at 06:47:15PM CET, roopa@cumulusnetworks.com wrote:
>On 3/26/14, 9:59 AM, Jiri Pirko wrote:
>>Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>>>On 03/26/14 01:37, Roopa Prabhu wrote:
>>>>>On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>>>2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>>>Sorry about getting on this thread late and possibly in the middle.
>>>>>Agree on the idea of keeping the ports linked to the master switch dev
>>>>>(or the 'conduit' to the switch chip) via private list instead of the
>>>>>master-slave relationship proposed earlier.
>>>>>By private i mean the netdev->priv linkage to the master switch dev and
>>>>>not really keeping the ports from being exposed to the user.
>>>>>
>>>>>We think its better to keep the switch ports exposed as any other netdev
>>>>>on linux.
>>>>>  This approach will make the switch ports look exactly like a nic port
>>>>>and all tools will continue to work seamlessly. The switch port
>>>>>operations could internally be forwarded to the switch netdev (sw1 in
>>>>>the above case).
>>>>>
>>>>>example:
>>>>>$ip link set dev sw1p0 up
>>>>>$ethtool -S sw1p0
>>>>>
>>>>I like the approach. I know the above is a simple version, but i am
>>>>assuming you also mean i can do things like
>>>>ip route add ...
>>>>bridge fdb add ... (and if you like your brctl go ahead)
>>>>bonding ...
>>>>
>>>yes, exactly.  We support this model on our boxes today.
>>>User can bond switch ports on our box in the exact same way as he/she
>>>would bond two nic ports.
>>>Our 'conduit to switch chip' reflects the corresponding lag
>>>configuration in the switch chip.
>>>Same goes for bridging, routing, acls.
>>
>>So you implement bonding netlink api? Or you hook into bonding driver
>>itselt? Can you show us the code?
>We use the netlink API and libnl. In our current model, our switch
>chip driver listens to netlink notifications and programs the switch
>chip. The switch chip driver uses libnl caches and libnl netlink apis
>to reflect the kernel state to switch chip.


So when you configure for example bonding over 2 ports, you actually use
bonding driver to do that. And you userspace app listens to
notifications and programs the switch chip accordingly. Am I close?

How about data? Is this new "bonding" interface able to assign ip to is
and send/receive packets.

I'm still not sure I understand your concept. Do you have some
documentation for it available?

>
>Thanks,
>Roopa
>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 17:57                               ` Roopa Prabhu
@ 2014-03-26 18:09                                 ` Florian Fainelli
  2014-03-27 13:46                                   ` John W. Linville
  0 siblings, 1 reply; 125+ messages in thread
From: Florian Fainelli @ 2014-03-26 18:09 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jiri Pirko, Jamal Hadi Salim, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

2014-03-26 10:57 GMT-07:00 Roopa Prabhu <roopa@cumulusnetworks.com>:
> On 3/26/14, 10:29 AM, Florian Fainelli wrote:
>>
>> 2014-03-26 9:59 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>
>>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>>
>>>> On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>>>>
>>>>> On 03/26/14 01:37, Roopa Prabhu wrote:
>>>>>>
>>>>>> On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>>>>
>>>>>>> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>>>>
>>>>>> Sorry about getting on this thread late and possibly in the middle.
>>>>>> Agree on the idea of keeping the ports linked to the master switch dev
>>>>>> (or the 'conduit' to the switch chip) via private list instead of the
>>>>>> master-slave relationship proposed earlier.
>>>>>> By private i mean the netdev->priv linkage to the master switch dev
>>>>>> and
>>>>>> not really keeping the ports from being exposed to the user.
>>>>>>
>>>>>> We think its better to keep the switch ports exposed as any other
>>>>>> netdev
>>>>>> on linux.
>>>>>>   This approach will make the switch ports look exactly like a nic
>>>>>> port
>>>>>> and all tools will continue to work seamlessly. The switch port
>>>>>> operations could internally be forwarded to the switch netdev (sw1 in
>>>>>> the above case).
>>>>>>
>>>>>> example:
>>>>>> $ip link set dev sw1p0 up
>>>>>> $ethtool -S sw1p0
>>>>>>
>>>>> I like the approach. I know the above is a simple version, but i am
>>>>> assuming you also mean i can do things like
>>>>> ip route add ...
>>>>> bridge fdb add ... (and if you like your brctl go ahead)
>>>>> bonding ...
>>>>>
>>>> yes, exactly.  We support this model on our boxes today.
>>>> User can bond switch ports on our box in the exact same way as he/she
>>>> would bond two nic ports.
>>>> Our 'conduit to switch chip' reflects the corresponding lag
>>>> configuration in the switch chip.
>>>> Same goes for bridging, routing, acls.
>>>
>>>
>>> So you implement bonding netlink api? Or you hook into bonding driver
>>> itselt? Can you show us the code?
>>
>> Before we start talking about bonding, maybe we should make sure that
>> we cover some basic hardware switches uses which are to make some
>> ports belong to certain VLANs, tagged or untagged?
>>
>> It seems to me like this would become something like this, assuming P0
>> and P1 are two switch ports and 'eth0' is the CPU port, where P0 and
>> P1 belong to VLAN1 and CPU belongs to VLAN2:
>>
>> ip link set dev sw1p0 up
>> ip link set dev sw1p1 up
>> ip link set dev eth0 up
>>
>> ip link add link eth0 name eth0.2 type vlan id 2
>>
>> ip link add link sw1p0 name sw1p0.1 type vlan id 1
>> ip link add link sw1p1 name sw1p1.1 type vlan id 1
>>
>> ip link add sw1.1 type bridge
>> ip link set sw1p0.1 master sw1.1
>> ip link set sw1p1.1 master sw1.1
>>
>> Does that fit the model correctly?
>
> Not entirely, but close.
> In our current model, there is no netdev for cpu port (or master switch
> netdev):

You mean there is no netdev for the switch-side CPU-port facing the
CPU Ethernet MAC, right? There is still a netdev for the CPU Ethernet
MAC to receive packets destined to it presumably.

I do not think it hurts nor changes anything to introduce a CPU-port
netdev, this just gives greater flexibility and this should allow for
more complex setups where multiple CPU-ports exist (there are some
real devices using this...).

>
>
> ip link set dev sw1p0 up
> ip link set dev sw1p1 up
>
> ip link add link sw1p0 name sw1p0.1 type vlan id 1
> ip link add link sw1p1 name sw1p1.1 type vlan id 1
>
> ip link add brvlan1 type bridge
> ip link set swp1p0.1 master brvlan1
> ip link set swp1p1.1 master brvlan1
>
> switch driver programs the brvlan1 vlan in the switch asic.
>
> bonding works in the same way.
>
> Thanks,
> Roopa
>
>
>
>
>



-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 17:58                                 ` Florian Fainelli
@ 2014-03-26 18:14                                   ` Jiri Pirko
  2014-03-26 18:29                                     ` Hannes Frederic Sowa
                                                       ` (2 more replies)
  0 siblings, 3 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-26 18:14 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Roopa Prabhu, Jamal Hadi Salim, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>> Wed, Mar 26, 2014 at 06:29:07PM CET, f.fainelli@gmail.com wrote:
>>>2014-03-26 9:59 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>>>On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>>>>>On 03/26/14 01:37, Roopa Prabhu wrote:
>>>>>>>On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>>>>>2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>>>>
>>>>>>>Sorry about getting on this thread late and possibly in the middle.
>>>>>>>Agree on the idea of keeping the ports linked to the master switch dev
>>>>>>>(or the 'conduit' to the switch chip) via private list instead of the
>>>>>>>master-slave relationship proposed earlier.
>>>>>>>By private i mean the netdev->priv linkage to the master switch dev and
>>>>>>>not really keeping the ports from being exposed to the user.
>>>>>>>
>>>>>>>We think its better to keep the switch ports exposed as any other netdev
>>>>>>>on linux.
>>>>>>>  This approach will make the switch ports look exactly like a nic port
>>>>>>>and all tools will continue to work seamlessly. The switch port
>>>>>>>operations could internally be forwarded to the switch netdev (sw1 in
>>>>>>>the above case).
>>>>>>>
>>>>>>>example:
>>>>>>>$ip link set dev sw1p0 up
>>>>>>>$ethtool -S sw1p0
>>>>>>>
>>>>>>
>>>>>>I like the approach. I know the above is a simple version, but i am
>>>>>>assuming you also mean i can do things like
>>>>>>ip route add ...
>>>>>>bridge fdb add ... (and if you like your brctl go ahead)
>>>>>>bonding ...
>>>>>>
>>>>>yes, exactly.  We support this model on our boxes today.
>>>>>User can bond switch ports on our box in the exact same way as he/she
>>>>>would bond two nic ports.
>>>>>Our 'conduit to switch chip' reflects the corresponding lag
>>>>>configuration in the switch chip.
>>>>>Same goes for bridging, routing, acls.
>>>>
>>>>
>>>> So you implement bonding netlink api? Or you hook into bonding driver
>>>> itselt? Can you show us the code?
>>>
>>>Before we start talking about bonding, maybe we should make sure that
>>>we cover some basic hardware switches uses which are to make some
>>>ports belong to certain VLANs, tagged or untagged?
>>>
>>>It seems to me like this would become something like this, assuming P0
>>>and P1 are two switch ports and 'eth0' is the CPU port, where P0 and
>>>P1 belong to VLAN1 and CPU belongs to VLAN2:
>>>
>>>ip link set dev sw1p0 up
>>>ip link set dev sw1p1 up
>>>ip link set dev eth0 up
>>
>>
>> I might be mistaken, But I think you are missing a switch port
>> representing a connection to eth0 (eth0 being cpu conterpart of it).
>> Or is it one of sw1p0 and sw1p1 ?
>
>You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my example.
>
>I think there is an implicit convention that sw1 represents the
>Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>always connected, hence there is no need to create a "fake" bridge to
>link sw1 to eth0 for instance?

I think you are kind of mixing apples and oranges (or I might be I'm not
understanding you correctly).
This is how I see it, sticking to the names you use in the example:

            (sw1) (abstract place-holder netdev)
          --------
         switch chip                   CPU
   -----------------------            ------
   sw1p0 sw1p1 sw1p2 sw1p3             eth0
     |     |     |     |                |
    PHY   PHY   PHY    ------someMII-----

You see that eth0 is the CPU part of the "connection" and sw1p3 is the
switch part (port representation). 



>
>>
>>>
>>>ip link add link eth0 name eth0.2 type vlan id 2
>>>
>>>ip link add link sw1p0 name sw1p0.1 type vlan id 1
>>>ip link add link sw1p1 name sw1p1.1 type vlan id 1
>>>
>>>ip link add sw1.1 type bridge
>>>ip link set sw1p0.1 master sw1.1
>>>ip link set sw1p1.1 master sw1.1
>>>
>>>Does that fit the model correctly?
>>>--
>>>Florian
>
>
>
>-- 
>Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 11:29                       ` Thomas Graf
  2014-03-26 12:58                         ` Jamal Hadi Salim
  2014-03-26 15:22                         ` John W. Linville
@ 2014-03-26 18:21                         ` Neil Horman
  2014-03-26 19:11                           ` Florian Fainelli
                                             ` (2 more replies)
  2 siblings, 3 replies; 125+ messages in thread
From: Neil Horman @ 2014-03-26 18:21 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Jamal Hadi Salim, Jiri Pirko, Florian Fainelli, netdev,
	David Miller, andy, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

On Wed, Mar 26, 2014 at 11:29:03AM +0000, Thomas Graf wrote:
> On 03/26/14 at 07:10am, Neil Horman wrote:
> > But by creating net_devices that are registered in the current fashion we
> > implicitly agree to levels of functionality that are assumed to be available and
> > as such are not within the purview of a net_device to reject.  E.g. it is
> > assumed that a netdevice can filter frames using iptables/ebtables, limit
> > traffic using tc, etc.
> 
> I think this is the point where we disagree. We already have several
> devices that hook into the rx handler and never have their packets
> pass through either iptables or ebtables. Better examples of this are
> macvtap or OVS.
> 
Yes, this is the point of contention, you're right.  And you're also correct in
that we do have several devices that bypass the network stack on the.  My
concern is that, in all of those cases its being bypassed because we know that
other software is handling that functionality (in the case of macvtap we know
that we're passing it off to a guest to be processed via the full network stack
available in the guest, and in the case of OVS, we know that we are passing
traffic to a software defined switch for handling).  In the case of having a
switch fabric available, we're explicitly hiding the fact that traffic we are
passing between ports never touches the cpu, and that just rubs me the wrong
way.  I suppose I'm looking at switch fabrics in the same way that I look at
TOE.  In offloading forwaring functionality we remove from the cpu activity
which an administrator may reasonably expect to see handled in the cpu, but they
wont.  In the case of macvlan, the admin knows thats a macvlan device, and
packet handling for frames bound to it occurs in the guest.  for OVS, packets
recieved on the cpu with the proper encapsulation are clearly handled in the
OVS bridge.  But in the case of a hardware switch, all they see are 4 net device
interfaces that seem like any other net device.  

Perhaps I need to let go of this notion, but it seems to me, if we're going to
allow cpu stack bypass, then we need to make that very obvious to an
administrator.  Maybe a flag like IFF_L2ONLY (or perhaps better still
IFF_LOCALDATAONLY, to indicate that only data directly addressed to the
interface, or to a multi/broadcast address will be received by it, despite the
promisc or other settings is sufficient). I really don't know.  Thats where my
hang up is though.

> What should happen is that these devices are given a chance to implement
> the ACL in their own flow table. If no such facility exists, the rule
> insertion should fall back to software mode if that is possible (an
> OF capable switching chip could insert a 'upcall' flow), or as
> a last resort return an error to indicate EOPNOTSUPP.
> 
> > And if a switch fabric is short cutting traffic so that
> > the cpu doesn't see them, those bits of functionality won't work.  I agree we
> > can likely work around that with richer feature capabilities, but such an
> > infrastructure would both require extensive kernel changes to fully cover the
> > set of existing features at a sufficient granularity, and require user space
> > changes to grok the feature set of a given device.  Not saying its impossibible
> > or even undesireable mind you, just thats its not any less invasive than what
> > I'm proposing.
> 
> What I don't understand at this point is how hiding the ports behind
> a master device would buy us anything. We would still need to abstract
> the filtering capabilities of the ports at some level and hiding that
> behind existing tools seems to most convenient way.
> 

If we agree that inconsistent frame reception / stack bypass is acceptable, then
hiding the ports buys us nothing.  My only goal with that suggestion was to
differentiate ports on a switch device so that the ports were differentiated in
such a way as to make it clear that they didn't behave like typical NIC ports
that were meant to receive host terminated traffic only.  If the consensus is
to allows sparse reception of forwarded traffic at the cpu, then no, its not
worthwhile and can be ignored.

Best
Neil

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 18:14                                   ` Jiri Pirko
@ 2014-03-26 18:29                                     ` Hannes Frederic Sowa
  2014-03-26 18:30                                     ` Florian Fainelli
  2014-03-26 21:51                                     ` Jamal Hadi Salim
  2 siblings, 0 replies; 125+ messages in thread
From: Hannes Frederic Sowa @ 2014-03-26 18:29 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Florian Fainelli, Roopa Prabhu, Jamal Hadi Salim, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

On Wed, Mar 26, 2014 at 07:14:36PM +0100, Jiri Pirko wrote:
> >You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my example.
> >
> >I think there is an implicit convention that sw1 represents the
> >Ethernet switch port connected to the CPU Ethernet MAC, and that it is
> >always connected, hence there is no need to create a "fake" bridge to
> >link sw1 to eth0 for instance?
> 
> I think you are kind of mixing apples and oranges (or I might be I'm not
> understanding you correctly).

I guess the discussion is only about user interface:

In case someone adds a bond on switch ports without knowing the details
of the kernel and interfaces, the kernel would transparently setup this
configuration on the switch via the specific driver. iproute bond interface
stays the same but backend differs.

Actually, I find the idea pretty neat, but the bonding netlink/sysfs
interface would be required to expose feature information, because
maybe not every switch supports each (bonding) feature (like l4-hashing)
and software fallback maybe not possible.

> This is how I see it, sticking to the names you use in the example:
> 
>             (sw1) (abstract place-holder netdev)
>           --------
>          switch chip                   CPU
>    -----------------------            ------
>    sw1p0 sw1p1 sw1p2 sw1p3             eth0
>      |     |     |     |                |
>     PHY   PHY   PHY    ------someMII-----
> 
> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
> switch part (port representation). 
>
> >>>ip link add link eth0 name eth0.2 type vlan id 2
> >>>
> >>>ip link add link sw1p0 name sw1p0.1 type vlan id 1
> >>>ip link add link sw1p1 name sw1p1.1 type vlan id 1
> >>>
> >>>ip link add sw1.1 type bridge
> >>>ip link set sw1p0.1 master sw1.1
> >>>ip link set sw1p1.1 master sw1.1
> >>>
> >>>Does that fit the model correctly?

Why not make every switch a bridge (for the user POV) from the beginning
and send bridge netlink messages to switchdev then?

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 18:14                                   ` Jiri Pirko
  2014-03-26 18:29                                     ` Hannes Frederic Sowa
@ 2014-03-26 18:30                                     ` Florian Fainelli
  2014-03-26 21:51                                     ` Jamal Hadi Salim
  2 siblings, 0 replies; 125+ messages in thread
From: Florian Fainelli @ 2014-03-26 18:30 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Roopa Prabhu, Jamal Hadi Salim, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

2014-03-26 11:14 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>> Wed, Mar 26, 2014 at 06:29:07PM CET, f.fainelli@gmail.com wrote:
>>>>2014-03-26 9:59 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>>>>On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>>>>>>On 03/26/14 01:37, Roopa Prabhu wrote:
>>>>>>>>On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>>>>>>2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>>>>>
>>>>>>>>Sorry about getting on this thread late and possibly in the middle.
>>>>>>>>Agree on the idea of keeping the ports linked to the master switch dev
>>>>>>>>(or the 'conduit' to the switch chip) via private list instead of the
>>>>>>>>master-slave relationship proposed earlier.
>>>>>>>>By private i mean the netdev->priv linkage to the master switch dev and
>>>>>>>>not really keeping the ports from being exposed to the user.
>>>>>>>>
>>>>>>>>We think its better to keep the switch ports exposed as any other netdev
>>>>>>>>on linux.
>>>>>>>>  This approach will make the switch ports look exactly like a nic port
>>>>>>>>and all tools will continue to work seamlessly. The switch port
>>>>>>>>operations could internally be forwarded to the switch netdev (sw1 in
>>>>>>>>the above case).
>>>>>>>>
>>>>>>>>example:
>>>>>>>>$ip link set dev sw1p0 up
>>>>>>>>$ethtool -S sw1p0
>>>>>>>>
>>>>>>>
>>>>>>>I like the approach. I know the above is a simple version, but i am
>>>>>>>assuming you also mean i can do things like
>>>>>>>ip route add ...
>>>>>>>bridge fdb add ... (and if you like your brctl go ahead)
>>>>>>>bonding ...
>>>>>>>
>>>>>>yes, exactly.  We support this model on our boxes today.
>>>>>>User can bond switch ports on our box in the exact same way as he/she
>>>>>>would bond two nic ports.
>>>>>>Our 'conduit to switch chip' reflects the corresponding lag
>>>>>>configuration in the switch chip.
>>>>>>Same goes for bridging, routing, acls.
>>>>>
>>>>>
>>>>> So you implement bonding netlink api? Or you hook into bonding driver
>>>>> itselt? Can you show us the code?
>>>>
>>>>Before we start talking about bonding, maybe we should make sure that
>>>>we cover some basic hardware switches uses which are to make some
>>>>ports belong to certain VLANs, tagged or untagged?
>>>>
>>>>It seems to me like this would become something like this, assuming P0
>>>>and P1 are two switch ports and 'eth0' is the CPU port, where P0 and
>>>>P1 belong to VLAN1 and CPU belongs to VLAN2:
>>>>
>>>>ip link set dev sw1p0 up
>>>>ip link set dev sw1p1 up
>>>>ip link set dev eth0 up
>>>
>>>
>>> I might be mistaken, But I think you are missing a switch port
>>> representing a connection to eth0 (eth0 being cpu conterpart of it).
>>> Or is it one of sw1p0 and sw1p1 ?
>>
>>You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my example.
>>
>>I think there is an implicit convention that sw1 represents the
>>Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>always connected, hence there is no need to create a "fake" bridge to
>>link sw1 to eth0 for instance?
>
> I think you are kind of mixing apples and oranges (or I might be I'm not
> understanding you correctly).
> This is how I see it, sticking to the names you use in the example:
>
>             (sw1) (abstract place-holder netdev)
>           --------
>          switch chip                   CPU
>    -----------------------            ------
>    sw1p0 sw1p1 sw1p2 sw1p3             eth0
>      |     |     |     |                |
>     PHY   PHY   PHY    ------someMII-----
>
> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
> switch part (port representation).

Thanks for clarifying, this is indeed how it should be modelled/represented.
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 18:21                         ` Neil Horman
@ 2014-03-26 19:11                           ` Florian Fainelli
  2014-03-26 22:44                             ` Jamal Hadi Salim
  2014-03-26 19:24                           ` Hannes Frederic Sowa
  2014-03-27 13:43                           ` John W. Linville
  2 siblings, 1 reply; 125+ messages in thread
From: Florian Fainelli @ 2014-03-26 19:11 UTC (permalink / raw)
  To: Neil Horman
  Cc: Thomas Graf, Jamal Hadi Salim, Jiri Pirko, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek, Felix Fietkau

2014-03-26 11:21 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> On Wed, Mar 26, 2014 at 11:29:03AM +0000, Thomas Graf wrote:
>> On 03/26/14 at 07:10am, Neil Horman wrote:
>> > But by creating net_devices that are registered in the current fashion we
>> > implicitly agree to levels of functionality that are assumed to be available and
>> > as such are not within the purview of a net_device to reject.  E.g. it is
>> > assumed that a netdevice can filter frames using iptables/ebtables, limit
>> > traffic using tc, etc.
>>
>> I think this is the point where we disagree. We already have several
>> devices that hook into the rx handler and never have their packets
>> pass through either iptables or ebtables. Better examples of this are
>> macvtap or OVS.
>>
> Yes, this is the point of contention, you're right.  And you're also correct in
> that we do have several devices that bypass the network stack on the.  My
> concern is that, in all of those cases its being bypassed because we know that
> other software is handling that functionality (in the case of macvtap we know
> that we're passing it off to a guest to be processed via the full network stack
> available in the guest, and in the case of OVS, we know that we are passing
> traffic to a software defined switch for handling).  In the case of having a
> switch fabric available, we're explicitly hiding the fact that traffic we are
> passing between ports never touches the cpu, and that just rubs me the wrong
> way.  I suppose I'm looking at switch fabrics in the same way that I look at
> TOE.  In offloading forwaring functionality we remove from the cpu activity
> which an administrator may reasonably expect to see handled in the cpu, but they
> wont.  In the case of macvlan, the admin knows thats a macvlan device, and
> packet handling for frames bound to it occurs in the guest.  for OVS, packets
> recieved on the cpu with the proper encapsulation are clearly handled in the
> OVS bridge.  But in the case of a hardware switch, all they see are 4 net device
> interfaces that seem like any other net device.

Right, this is why Felix did not expose the switch ports as netdevices
when he designed swconfig, because this would break the contract and
assumptions that net_devices do actually transport data, and are not
just used for control. It also made it easier to have a separate
control path to expose the gazillion different configuration knobs
that various switches offer...

>
> Perhaps I need to let go of this notion, but it seems to me, if we're going to
> allow cpu stack bypass, then we need to make that very obvious to an
> administrator.  Maybe a flag like IFF_L2ONLY (or perhaps better still
> IFF_LOCALDATAONLY, to indicate that only data directly addressed to the
> interface, or to a multi/broadcast address will be received by it, despite the
> promisc or other settings is sufficient). I really don't know.  Thats where my
> hang up is though.

This is where putting those devices in a separate namespace really
helped making that obvious. That said, there are already in-tree
infrastructure which is "breaking" the contract that per-port
net_device do transport data, with DSA in particular. Those per-port
network devices are just used as control endpoints to reach the switch
per-port configuration registers. They might deliver some per-port
traffic at some point in time, until you reconfigure the switch to do
otherwise, by e.g: bridging LAN ports together.

If we use Jiri's latest patchset, IFF_LOCALDATAONLY would become
pretty much implied by IFF_SWITCH_PORT.

>
>> What should happen is that these devices are given a chance to implement
>> the ACL in their own flow table. If no such facility exists, the rule
>> insertion should fall back to software mode if that is possible (an
>> OF capable switching chip could insert a 'upcall' flow), or as
>> a last resort return an error to indicate EOPNOTSUPP.
>>
>> > And if a switch fabric is short cutting traffic so that
>> > the cpu doesn't see them, those bits of functionality won't work.  I agree we
>> > can likely work around that with richer feature capabilities, but such an
>> > infrastructure would both require extensive kernel changes to fully cover the
>> > set of existing features at a sufficient granularity, and require user space
>> > changes to grok the feature set of a given device.  Not saying its impossibible
>> > or even undesireable mind you, just thats its not any less invasive than what
>> > I'm proposing.
>>
>> What I don't understand at this point is how hiding the ports behind
>> a master device would buy us anything. We would still need to abstract
>> the filtering capabilities of the ports at some level and hiding that
>> behind existing tools seems to most convenient way.
>>
>
> If we agree that inconsistent frame reception / stack bypass is acceptable, then
> hiding the ports buys us nothing.

I think this was pretty much agreed on a while ago with DSA, macvlan
and TOE as you cited.

> My only goal with that suggestion was to
> differentiate ports on a switch device so that the ports were differentiated in
> such a way as to make it clear that they didn't behave like typical NIC ports
> that were meant to receive host terminated traffic only.  If the consensus is
> to allows sparse reception of forwarded traffic at the cpu, then no, its not
> worthwhile and can be ignored.

Part of the problem is that you might start seeing actual relevant
traffic on these per-port net_devices e.g: during software learning
times, where traffic to specific ports will also be mirrored to the
CPU port for lossless (or close to) traffic delivery, and then some
software agent on the CPU will decide to bridge/bond/add vlans to some
ports, and then we won't be seeing traffic again on these per-port
net_devices for a while (in the context of switches supporting tags).
As such, I'd rather treat those per-port net_devices as almost regular
net_devices to allow that traffic to flow, even though this is not a
permanent state.
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 18:21                         ` Neil Horman
  2014-03-26 19:11                           ` Florian Fainelli
@ 2014-03-26 19:24                           ` Hannes Frederic Sowa
  2014-03-27 13:43                           ` John W. Linville
  2 siblings, 0 replies; 125+ messages in thread
From: Hannes Frederic Sowa @ 2014-03-26 19:24 UTC (permalink / raw)
  To: Neil Horman
  Cc: Thomas Graf, Jamal Hadi Salim, Jiri Pirko, Florian Fainelli,
	netdev, David Miller, andy, dborkman, ogerlitz, jesse, pshelar,
	azhou, Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher,
	vyasevic, Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

On Wed, Mar 26, 2014 at 02:21:22PM -0400, Neil Horman wrote:
> Yes, this is the point of contention, you're right.  And you're also correct in
> that we do have several devices that bypass the network stack on the.  My
> concern is that, in all of those cases its being bypassed because we know that
> other software is handling that functionality (in the case of macvtap we know
> that we're passing it off to a guest to be processed via the full network stack
> available in the guest, and in the case of OVS, we know that we are passing
> traffic to a software defined switch for handling).  In the case of having a
> switch fabric available, we're explicitly hiding the fact that traffic we are
> passing between ports never touches the cpu, and that just rubs me the wrong
> way.  I suppose I'm looking at switch fabrics in the same way that I look at
> TOE.  In offloading forwaring functionality we remove from the cpu activity
> which an administrator may reasonably expect to see handled in the cpu, but they
> wont.  In the case of macvlan, the admin knows thats a macvlan device, and
> packet handling for frames bound to it occurs in the guest.  for OVS, packets
> recieved on the cpu with the proper encapsulation are clearly handled in the
> OVS bridge.  But in the case of a hardware switch, all they see are 4 net device
> interfaces that seem like any other net device.  
> 
> Perhaps I need to let go of this notion, but it seems to me, if we're going to
> allow cpu stack bypass, then we need to make that very obvious to an
> administrator.  Maybe a flag like IFF_L2ONLY (or perhaps better still
> IFF_LOCALDATAONLY, to indicate that only data directly addressed to the
> interface, or to a multi/broadcast address will be received by it, despite the
> promisc or other settings is sufficient). I really don't know.  Thats where my
> hang up is though.

The switch master port would actually be a normal interface only. The
ports which just get managed by the switch and don't directly communicate
with the host could have a IFF_CONFIGONLY flag to show they only react
to netlink config messages but don't handle any traffic.

Maybe we need this for routing offloading soon, too and should not try to
design for switches only but for all kind of devices which have their forwarind
plane unreachable from the kernel.

Some of the small routers integrate hardware nat by now, seems like broadcom has some
cut-through-offloading for IP already implemented on their small-routing SoCs.

So maybe we have to redo this for L3 traffic soon, too. Depends on whether we
want to support that (see TOE problematic).

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 18:03                               ` Jiri Pirko
@ 2014-03-26 21:27                                 ` Roopa Prabhu
  2014-03-26 21:31                                   ` Jiri Pirko
  2014-04-01 19:13                                 ` Scott Feldman
  1 sibling, 1 reply; 125+ messages in thread
From: Roopa Prabhu @ 2014-03-26 21:27 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jamal Hadi Salim, Florian Fainelli, Neil Horman, Thomas Graf,
	netdev, David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

On 3/26/14, 11:03 AM, Jiri Pirko wrote:
> Wed, Mar 26, 2014 at 06:47:15PM CET, roopa@cumulusnetworks.com wrote:
>> On 3/26/14, 9:59 AM, Jiri Pirko wrote:
>>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>> On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>>>> On 03/26/14 01:37, Roopa Prabhu wrote:
>>>>>> On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>>>> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>>>> Sorry about getting on this thread late and possibly in the middle.
>>>>>> Agree on the idea of keeping the ports linked to the master switch dev
>>>>>> (or the 'conduit' to the switch chip) via private list instead of the
>>>>>> master-slave relationship proposed earlier.
>>>>>> By private i mean the netdev->priv linkage to the master switch dev and
>>>>>> not really keeping the ports from being exposed to the user.
>>>>>>
>>>>>> We think its better to keep the switch ports exposed as any other netdev
>>>>>> on linux.
>>>>>>   This approach will make the switch ports look exactly like a nic port
>>>>>> and all tools will continue to work seamlessly. The switch port
>>>>>> operations could internally be forwarded to the switch netdev (sw1 in
>>>>>> the above case).
>>>>>>
>>>>>> example:
>>>>>> $ip link set dev sw1p0 up
>>>>>> $ethtool -S sw1p0
>>>>>>
>>>>> I like the approach. I know the above is a simple version, but i am
>>>>> assuming you also mean i can do things like
>>>>> ip route add ...
>>>>> bridge fdb add ... (and if you like your brctl go ahead)
>>>>> bonding ...
>>>>>
>>>> yes, exactly.  We support this model on our boxes today.
>>>> User can bond switch ports on our box in the exact same way as he/she
>>>> would bond two nic ports.
>>>> Our 'conduit to switch chip' reflects the corresponding lag
>>>> configuration in the switch chip.
>>>> Same goes for bridging, routing, acls.
>>> So you implement bonding netlink api? Or you hook into bonding driver
>>> itselt? Can you show us the code?
>> We use the netlink API and libnl. In our current model, our switch
>> chip driver listens to netlink notifications and programs the switch
>> chip. The switch chip driver uses libnl caches and libnl netlink apis
>> to reflect the kernel state to switch chip.
>
> So when you configure for example bonding over 2 ports, you actually use
> bonding driver to do that. And you userspace app listens to
> notifications and programs the switch chip accordingly. Am I close?
yes correct.
>
> How about data? Is this new "bonding" interface able to assign ip to is
> and send/receive packets.
yes
>
> I'm still not sure I understand your concept. Do you have some
> documentation for it available?
>
I think the only documentation available today in this area is the user 
guide and that in-turn points to native linux command manpages iproute2, 
sysfs, debian ifupdown etc.
I will see if i can find anything else.

thanks,
Roopa

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 21:27                                 ` Roopa Prabhu
@ 2014-03-26 21:31                                   ` Jiri Pirko
  2014-03-27 15:35                                     ` Roopa Prabhu
  0 siblings, 1 reply; 125+ messages in thread
From: Jiri Pirko @ 2014-03-26 21:31 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jamal Hadi Salim, Florian Fainelli, Neil Horman, Thomas Graf,
	netdev, David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

Wed, Mar 26, 2014 at 10:27:05PM CET, roopa@cumulusnetworks.com wrote:
>On 3/26/14, 11:03 AM, Jiri Pirko wrote:
>>Wed, Mar 26, 2014 at 06:47:15PM CET, roopa@cumulusnetworks.com wrote:
>>>On 3/26/14, 9:59 AM, Jiri Pirko wrote:
>>>>Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>>>On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>>>>>On 03/26/14 01:37, Roopa Prabhu wrote:
>>>>>>>On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>>>>>2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>>>>>Sorry about getting on this thread late and possibly in the middle.
>>>>>>>Agree on the idea of keeping the ports linked to the master switch dev
>>>>>>>(or the 'conduit' to the switch chip) via private list instead of the
>>>>>>>master-slave relationship proposed earlier.
>>>>>>>By private i mean the netdev->priv linkage to the master switch dev and
>>>>>>>not really keeping the ports from being exposed to the user.
>>>>>>>
>>>>>>>We think its better to keep the switch ports exposed as any other netdev
>>>>>>>on linux.
>>>>>>>  This approach will make the switch ports look exactly like a nic port
>>>>>>>and all tools will continue to work seamlessly. The switch port
>>>>>>>operations could internally be forwarded to the switch netdev (sw1 in
>>>>>>>the above case).
>>>>>>>
>>>>>>>example:
>>>>>>>$ip link set dev sw1p0 up
>>>>>>>$ethtool -S sw1p0
>>>>>>>
>>>>>>I like the approach. I know the above is a simple version, but i am
>>>>>>assuming you also mean i can do things like
>>>>>>ip route add ...
>>>>>>bridge fdb add ... (and if you like your brctl go ahead)
>>>>>>bonding ...
>>>>>>
>>>>>yes, exactly.  We support this model on our boxes today.
>>>>>User can bond switch ports on our box in the exact same way as he/she
>>>>>would bond two nic ports.
>>>>>Our 'conduit to switch chip' reflects the corresponding lag
>>>>>configuration in the switch chip.
>>>>>Same goes for bridging, routing, acls.
>>>>So you implement bonding netlink api? Or you hook into bonding driver
>>>>itselt? Can you show us the code?
>>>We use the netlink API and libnl. In our current model, our switch
>>>chip driver listens to netlink notifications and programs the switch
>>>chip. The switch chip driver uses libnl caches and libnl netlink apis
>>>to reflect the kernel state to switch chip.
>>
>>So when you configure for example bonding over 2 ports, you actually use
>>bonding driver to do that. And you userspace app listens to
>>notifications and programs the switch chip accordingly. Am I close?
>yes correct.
>>
>>How about data? Is this new "bonding" interface able to assign ip to is
>>and send/receive packets.
>yes
>>
>>I'm still not sure I understand your concept. Do you have some
>>documentation for it available?
>>
>I think the only documentation available today in this area is the
>user guide and that in-turn points to native linux command manpages
>iproute2, sysfs, debian ifupdown etc.
>I will see if i can find anything else.

I ment the architecture design documentation. linux manpages are not
that interesting to me :)

>
>thanks,
>Roopa

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 15:22                         ` John W. Linville
@ 2014-03-26 21:36                           ` Jamal Hadi Salim
  0 siblings, 0 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-26 21:36 UTC (permalink / raw)
  To: John W. Linville, Thomas Graf
  Cc: Neil Horman, Jiri Pirko, Florian Fainelli, netdev, David Miller,
	andy, dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek

On 03/26/14 11:22, John W. Linville wrote:

> This part makes sense to me -- use the hardware forwarding offloads if
> they are available, but fall back to software for sake of flexibility.
> It gives the admin enough rope to shoot himself in the foot...
>

This should be left up to policy. Maybe a sysctl? Example:
I should be able to choose whether i want to _only_
use the L3 table in hardware and not in software for certain
traffic which may require lower latency.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 18:14                                   ` Jiri Pirko
  2014-03-26 18:29                                     ` Hannes Frederic Sowa
  2014-03-26 18:30                                     ` Florian Fainelli
@ 2014-03-26 21:51                                     ` Jamal Hadi Salim
  2014-03-26 22:22                                       ` Florian Fainelli
  2 siblings, 1 reply; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-26 21:51 UTC (permalink / raw)
  To: Jiri Pirko, Florian Fainelli
  Cc: Roopa Prabhu, Neil Horman, Thomas Graf, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek, Shrijeet Mukherjee

On 03/26/14 14:14, Jiri Pirko wrote:
> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:


>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my example.
>>
>> I think there is an implicit convention that sw1 represents the
>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>> always connected, hence there is no need to create a "fake" bridge to
>> link sw1 to eth0 for instance?
>
> I think you are kind of mixing apples and oranges (or I might be I'm not
> understanding you correctly).
> This is how I see it, sticking to the names you use in the example:
>
>              (sw1) (abstract place-holder netdev)
>            --------
>           switch chip                   CPU
>     -----------------------            ------
>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>       |     |     |     |                |
>      PHY   PHY   PHY    ------someMII-----
>
> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
> switch part (port representation).
>


Florian - I am sure you explained this before; I just dont remember. Why
is there need to expose eth0? It seems to me sw1p0-3 are abstracted
already in the kernel and the "cpu port" is merely a control interface.

Note: even the high end chips tend to have the concept of a "cpu port"
but my experience is to hide that as part of the switch driver.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 21:51                                     ` Jamal Hadi Salim
@ 2014-03-26 22:22                                       ` Florian Fainelli
  2014-03-26 22:53                                         ` Jamal Hadi Salim
  2014-03-27  6:56                                         ` Jiri Pirko
  0 siblings, 2 replies; 125+ messages in thread
From: Florian Fainelli @ 2014-03-26 22:22 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Jiri Pirko, Roopa Prabhu, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
> On 03/26/14 14:14, Jiri Pirko wrote:
>>
>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>>
>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>
>
>
>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
>>> example.
>>>
>>> I think there is an implicit convention that sw1 represents the
>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>> always connected, hence there is no need to create a "fake" bridge to
>>> link sw1 to eth0 for instance?
>>
>>
>> I think you are kind of mixing apples and oranges (or I might be I'm not
>> understanding you correctly).
>> This is how I see it, sticking to the names you use in the example:
>>
>>              (sw1) (abstract place-holder netdev)
>>            --------
>>           switch chip                   CPU
>>     -----------------------            ------
>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>>       |     |     |     |                |
>>      PHY   PHY   PHY    ------someMII-----
>>
>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
>> switch part (port representation).
>>
>
>
> Florian - I am sure you explained this before; I just dont remember. Why
> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
> already in the kernel and the "cpu port" is merely a control interface.

eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
It is "regular" Ethernet driver connected to the switch without
switch-specific logic. The goal is twofold:

- allow any regular Ethernet driver to be connected to an external
switch via e.g: MDIO/MDC or other without specific switch knowledge
- represents accurately how the hardware is designed/connected

but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...

>
> Note: even the high end chips tend to have the concept of a "cpu port"
> but my experience is to hide that as part of the switch driver.
>
> cheers,
> jamal
>
>



-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 19:11                           ` Florian Fainelli
@ 2014-03-26 22:44                             ` Jamal Hadi Salim
  2014-03-26 23:15                               ` Thomas Graf
  2014-03-27 15:26                               ` Neil Horman
  0 siblings, 2 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-26 22:44 UTC (permalink / raw)
  To: Florian Fainelli, Neil Horman
  Cc: Thomas Graf, Jiri Pirko, netdev, David Miller, Andy Gospodarek,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Felix Fietkau

On 03/26/14 15:11, Florian Fainelli wrote:
> 2014-03-26 11:21 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:

>> Yes, this is the point of contention, you're right.  And you're also correct in
>> that we do have several devices that bypass the network stack on the.  My
>> concern is that, in all of those cases its being bypassed because we know that
>> other software is handling that functionality (in the case of macvtap we know
>> that we're passing it off to a guest to be processed via the full network stack
>> available in the guest, and in the case of OVS, we know that we are passing
>> traffic to a software defined switch for handling).  In the case of having a
>> switch fabric available, we're explicitly hiding the fact that traffic we are
>> passing between ports never touches the cpu, and that just rubs me the wrong
>> way.  I suppose I'm looking at switch fabrics in the same way that I look at
>> TOE.  In offloading forwaring functionality we remove from the cpu activity
>> which an administrator may reasonably expect to see handled in the cpu, but they
>> wont.  In the case of macvlan, the admin knows thats a macvlan device, and
>> packet handling for frames bound to it occurs in the guest.  for OVS, packets
>> recieved on the cpu with the proper encapsulation are clearly handled in the
>> OVS bridge.  But in the case of a hardware switch, all they see are 4 net device
>> interfaces that seem like any other net device.
>
> Right, this is why Felix did not expose the switch ports as netdevices
> when he designed swconfig, because this would break the contract and
> assumptions that net_devices do actually transport data, and are not
> just used for control. It also made it easier to have a separate
> control path to expose the gazillion different configuration knobs
> that various switches offer...
>

Neil, I may be misreading your "TOE" semantis, but i think you view the 
switch ports from a host prism. I am a middle box guy - I love it when 
packets transiting through my box are offloaded. I can move more
bits/sec.
It is only TOE if the middle box is trying to do an end host function;->

OTOH, the owrt view is probably because (If i understood correctly
last time), there are cases where there is no way to even pass packets
and attribute them to the originating switch ports. Infact, in some
cases  there may be no way at all to even pass packets to the kernel.
Did i  understand that part correctly?
I suppose this is eventually all part of that capability discovery.

[..]

>
> Part of the problem is that you might start seeing actual relevant
> traffic on these per-port net_devices e.g: during software learning
> times, where traffic to specific ports will also be mirrored to the
> CPU port for lossless (or close to) traffic delivery, and then some
> software agent on the CPU will decide to bridge/bond/add vlans to some
> ports, and then we won't be seeing traffic again on these per-port
> net_devices for a while (in the context of switches supporting tags).
> As such, I'd rather treat those per-port net_devices as almost regular
> net_devices to allow that traffic to flow, even though this is not a
> permanent state.
>

A nod from here.
I think it would be useful to enumerate these types of devices
and what their control/data capability is.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 22:22                                       ` Florian Fainelli
@ 2014-03-26 22:53                                         ` Jamal Hadi Salim
  2014-03-26 23:16                                           ` Florian Fainelli
  2014-03-27  6:56                                         ` Jiri Pirko
  1 sibling, 1 reply; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-26 22:53 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Jiri Pirko, Roopa Prabhu, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

On 03/26/14 18:22, Florian Fainelli wrote:
> 2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:

>
> eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
> It is "regular" Ethernet driver connected to the switch without
> switch-specific logic. The goal is twofold:
>
> - allow any regular Ethernet driver to be connected to an external
> switch via e.g: MDIO/MDC or other without specific switch knowledge
> - represents accurately how the hardware is designed/connected
>

Gah. Sorry - I missed the MII interface.
In such a case as shown here then, how do you control sw1p0-3?

> but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...

It sounds to me the CPU side is only a driver for sw1p3.

>>
>> Note: even the high end chips tend to have the concept of a "cpu port"
>> but my experience is to hide that as part of the switch driver.

Note: the high end devices "cpu ports" are accessible typically
via PCIE interfaces for control and some DMA for data activity.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 22:44                             ` Jamal Hadi Salim
@ 2014-03-26 23:15                               ` Thomas Graf
  2014-03-26 23:21                                 ` Florian Fainelli
  2014-03-27 15:26                               ` Neil Horman
  1 sibling, 1 reply; 125+ messages in thread
From: Thomas Graf @ 2014-03-26 23:15 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Florian Fainelli, Neil Horman, Jiri Pirko, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek, Felix Fietkau

On 03/26/14 at 06:44pm, Jamal Hadi Salim wrote:
> OTOH, the owrt view is probably because (If i understood correctly
> last time), there are cases where there is no way to even pass packets
> and attribute them to the originating switch ports. Infact, in some
> cases  there may be no way at all to even pass packets to the kernel.
> Did i  understand that part correctly?
> I suppose this is eventually all part of that capability discovery.

Listening to Florian it sounds like the fact that a separate control
path was chosen early on in owrt got rid of the main driver to abstract
everything through globally visible net_devices. Reusing existing
tools was never an objective.

I believe that the question whether a particular port will send
packets to the cpu does not matter that much. We'll see both and we'll
see various forms of hybrid models with software based learnings paths,
slow paths, and deliberate upcalls.

The simpler the model, the better. If the desire to hide some of the
complexity is driven by usability when I believe that hiding should
happen in user space.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 22:53                                         ` Jamal Hadi Salim
@ 2014-03-26 23:16                                           ` Florian Fainelli
  0 siblings, 0 replies; 125+ messages in thread
From: Florian Fainelli @ 2014-03-26 23:16 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Jiri Pirko, Roopa Prabhu, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

2014-03-26 15:53 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
> On 03/26/14 18:22, Florian Fainelli wrote:
>>
>> 2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>
>
>>
>> eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
>> It is "regular" Ethernet driver connected to the switch without
>> switch-specific logic. The goal is twofold:
>>
>> - allow any regular Ethernet driver to be connected to an external
>> switch via e.g: MDIO/MDC or other without specific switch knowledge
>> - represents accurately how the hardware is designed/connected
>>
>
> Gah. Sorry - I missed the MII interface.
> In such a case as shown here then, how do you control sw1p0-3?

Most switches expose individual ports as individual PHY addresses
either on the same MDIO bus the Ethernet MAC is connected to the
switch, or an internal one which is accessible through a special PHY
address on the "regular" MDIO bus. Ports 0-3 are accessible
individually at MDIO addresses 0-3 and the special CPU port has a
special PHY address e.g: 16 for Marvell, 30 for Broadcom, which
delivers register access to global and per-port configuration
registers. For memory-mapped switches, well, you get per-port register
ranges, so this is much simpler.

ethtool would be the user-interface to talk individually to these
ports here, and the DSA driver just goes talk to the individual port
through MDIO to get their port status (right now it regularly polls
for the port link status/duplex/speed).

>
>
>> but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same
>> interface...
>
>
> It sounds to me the CPU side is only a driver for sw1p3.

I think what I had in mind when I wrote that part of the mail is some
special hardware here on which the CPU Ethernet MAC output queues do
not have a 1:1 mapping to the corresponding switch egress port output
queues, but I think you are right.

>
>
>>>
>>> Note: even the high end chips tend to have the concept of a "cpu port"
>>> but my experience is to hide that as part of the switch driver.
>
>
> Note: the high end devices "cpu ports" are accessible typically
> via PCIE interfaces for control and some DMA for data activity.
>
> cheers,
> jamal
>
>



-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 23:15                               ` Thomas Graf
@ 2014-03-26 23:21                                 ` Florian Fainelli
  0 siblings, 0 replies; 125+ messages in thread
From: Florian Fainelli @ 2014-03-26 23:21 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Jamal Hadi Salim, Neil Horman, Jiri Pirko, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek, Felix Fietkau

2014-03-26 16:15 GMT-07:00 Thomas Graf <tgraf@suug.ch>:
> On 03/26/14 at 06:44pm, Jamal Hadi Salim wrote:
>> OTOH, the owrt view is probably because (If i understood correctly
>> last time), there are cases where there is no way to even pass packets
>> and attribute them to the originating switch ports. Infact, in some
>> cases  there may be no way at all to even pass packets to the kernel.
>> Did i  understand that part correctly?
>> I suppose this is eventually all part of that capability discovery.
>
> Listening to Florian it sounds like the fact that a separate control
> path was chosen early on in owrt got rid of the main driver to abstract
> everything through globally visible net_devices. Reusing existing
> tools was never an objective.

Correct. OpenWrt already has a fairly custom user-space, so it was
deemed reasonable to have another lightweight, yet custom control
interfaces for switches. The ability to use an unmodified Ethernet
driver was also a key goal. The reasons for putting that in the kernel
versus using e.g: an ioctl(SIOCGMIIREG) based approach in user-space,
is that it allows for better abstraction between control paths (MDIO,
I2C, SPI, memory-mapped I/O ...), and preserves the "kernel has
hardware ownership" paradigm.

>
> I believe that the question whether a particular port will send
> packets to the cpu does not matter that much. We'll see both and we'll
> see various forms of hybrid models with software based learnings paths,
> slow paths, and deliberate upcalls.
>
> The simpler the model, the better. If the desire to hide some of the
> complexity is driven by usability when I believe that hiding should
> happen in user space.
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 22:22                                       ` Florian Fainelli
  2014-03-26 22:53                                         ` Jamal Hadi Salim
@ 2014-03-27  6:56                                         ` Jiri Pirko
  2014-03-27 10:39                                           ` Jamal Hadi Salim
  2014-03-27 14:10                                           ` Sergey Ryazanov
  1 sibling, 2 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-27  6:56 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Jamal Hadi Salim, Roopa Prabhu, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
>2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>> On 03/26/14 14:14, Jiri Pirko wrote:
>>>
>>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>>>
>>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>
>>
>>
>>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
>>>> example.
>>>>
>>>> I think there is an implicit convention that sw1 represents the
>>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>>> always connected, hence there is no need to create a "fake" bridge to
>>>> link sw1 to eth0 for instance?
>>>
>>>
>>> I think you are kind of mixing apples and oranges (or I might be I'm not
>>> understanding you correctly).
>>> This is how I see it, sticking to the names you use in the example:
>>>
>>>              (sw1) (abstract place-holder netdev)
>>>            --------
>>>           switch chip                   CPU
>>>     -----------------------            ------
>>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>>>       |     |     |     |                |
>>>      PHY   PHY   PHY    ------someMII-----
>>>
>>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
>>> switch part (port representation).
>>>
>>
>>
>> Florian - I am sure you explained this before; I just dont remember. Why
>> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
>> already in the kernel and the "cpu port" is merely a control interface.
>
>eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
>It is "regular" Ethernet driver connected to the switch without
>switch-specific logic. The goal is twofold:
>
>- allow any regular Ethernet driver to be connected to an external
>switch via e.g: MDIO/MDC or other without specific switch knowledge
>- represents accurately how the hardware is designed/connected
>
>but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...

I believe that hawing both sw1p3 and eth0 is the correct way of
modelling this. sw1p3 is instance if switch chip driver representing the
actual port of a switch. eth0 is an instance of some other ordinary NIC
driver (8139too is my favorite :))

This model allows to draw the exact picture.
Also, when you add the described possibility to use iplink to build
vlans, bridges whatever on the switch ports, it makes perfect sense to
have this model.

Merging sw1p3 and eth0 would cause a loose of information and confusion.

>
>>
>> Note: even the high end chips tend to have the concept of a "cpu port"
>> but my experience is to hide that as part of the switch driver.
>>
>> cheers,
>> jamal
>>
>>
>
>
>
>-- 
>Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27  6:56                                         ` Jiri Pirko
@ 2014-03-27 10:39                                           ` Jamal Hadi Salim
  2014-03-27 10:50                                             ` Jiri Pirko
  2014-03-27 14:10                                           ` Sergey Ryazanov
  1 sibling, 1 reply; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-27 10:39 UTC (permalink / raw)
  To: Jiri Pirko, Florian Fainelli
  Cc: Roopa Prabhu, Neil Horman, Thomas Graf, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek, Shrijeet Mukherjee, Felix Fietkau

On 03/27/14 02:56, Jiri Pirko wrote:
> Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:

>
> I believe that hawing both sw1p3 and eth0 is the correct way of
> modelling this. sw1p3 is instance if switch chip driver representing the
> actual port of a switch. eth0 is an instance of some other ordinary NIC
> driver (8139too is my favorite :))
>
> This model allows to draw the exact picture.
> Also, when you add the described possibility to use iplink to build
> vlans, bridges whatever on the switch ports, it makes perfect sense to
> have this model.
>
> Merging sw1p3 and eth0 would cause a loose of information and confusion.

I think that if eth0 and sw1p3 have different control interfaces i
agree.

Jiri, sorry - didnt respond to your other email on your patch. My
comment was on newlink() interface - i was wondering about why it was
there i.e as in allowing user space to create switch ports.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 10:39                                           ` Jamal Hadi Salim
@ 2014-03-27 10:50                                             ` Jiri Pirko
  2014-03-27 11:12                                               ` Jamal Hadi Salim
  0 siblings, 1 reply; 125+ messages in thread
From: Jiri Pirko @ 2014-03-27 10:50 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Florian Fainelli, Roopa Prabhu, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

Thu, Mar 27, 2014 at 11:39:59AM CET, jhs@mojatatu.com wrote:
>On 03/27/14 02:56, Jiri Pirko wrote:
>>Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
>
>>
>>I believe that hawing both sw1p3 and eth0 is the correct way of
>>modelling this. sw1p3 is instance if switch chip driver representing the
>>actual port of a switch. eth0 is an instance of some other ordinary NIC
>>driver (8139too is my favorite :))
>>
>>This model allows to draw the exact picture.
>>Also, when you add the described possibility to use iplink to build
>>vlans, bridges whatever on the switch ports, it makes perfect sense to
>>have this model.
>>
>>Merging sw1p3 and eth0 would cause a loose of information and confusion.
>
>I think that if eth0 and sw1p3 have different control interfaces i
>agree.
>
>Jiri, sorry - didnt respond to your other email on your patch. My
>comment was on newlink() interface - i was wondering about why it was
>there i.e as in allowing user space to create switch ports.

In my patchset, this is only for dummyswitch. The dummy switch driver
implementation. I provides a possibility for user to create these using
iplink.

>
>cheers,
>jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 10:50                                             ` Jiri Pirko
@ 2014-03-27 11:12                                               ` Jamal Hadi Salim
  2014-03-27 11:16                                                 ` Jiri Pirko
  0 siblings, 1 reply; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-27 11:12 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Florian Fainelli, Roopa Prabhu, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

On 03/27/14 06:50, Jiri Pirko wrote:
> Thu, Mar 27, 2014 at 11:39:59AM CET, jhs@mojatatu.com wrote:

> In my patchset, this is only for dummyswitch. The dummy switch driver
> implementation. I provides a possibility for user to create these using
> iplink.
>

Understood, but what is the motivation? ;-> Do you see some vendor SDK 
creating these?

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 11:12                                               ` Jamal Hadi Salim
@ 2014-03-27 11:16                                                 ` Jiri Pirko
  0 siblings, 0 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-27 11:16 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Florian Fainelli, Roopa Prabhu, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

Thu, Mar 27, 2014 at 12:12:07PM CET, jhs@mojatatu.com wrote:
>On 03/27/14 06:50, Jiri Pirko wrote:
>>Thu, Mar 27, 2014 at 11:39:59AM CET, jhs@mojatatu.com wrote:
>
>>In my patchset, this is only for dummyswitch. The dummy switch driver
>>implementation. I provides a possibility for user to create these using
>>iplink.
>>
>
>Understood, but what is the motivation? ;-> Do you see some vendor
>SDK creating these?

No, dummyswitch is just for testing purposes and showing the swtichdev
API use. Nothing else. I most probably will not add it to the final
patchset proposal.

>
>cheers,
>jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 18:21                         ` Neil Horman
  2014-03-26 19:11                           ` Florian Fainelli
  2014-03-26 19:24                           ` Hannes Frederic Sowa
@ 2014-03-27 13:43                           ` John W. Linville
  2 siblings, 0 replies; 125+ messages in thread
From: John W. Linville @ 2014-03-27 13:43 UTC (permalink / raw)
  To: Neil Horman
  Cc: Thomas Graf, Jamal Hadi Salim, Jiri Pirko, Florian Fainelli,
	netdev, David Miller, andy, dborkman, ogerlitz, jesse, pshelar,
	azhou, Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher,
	vyasevic, Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek

On Wed, Mar 26, 2014 at 02:21:22PM -0400, Neil Horman wrote:
> On Wed, Mar 26, 2014 at 11:29:03AM +0000, Thomas Graf wrote:

> > What I don't understand at this point is how hiding the ports behind
> > a master device would buy us anything. We would still need to abstract
> > the filtering capabilities of the ports at some level and hiding that
> > behind existing tools seems to most convenient way.
> > 
> 
> If we agree that inconsistent frame reception / stack bypass is acceptable, then
> hiding the ports buys us nothing.  My only goal with that suggestion was to
> differentiate ports on a switch device so that the ports were differentiated in
> such a way as to make it clear that they didn't behave like typical NIC ports
> that were meant to receive host terminated traffic only.  If the consensus is
> to allows sparse reception of forwarded traffic at the cpu, then no, its not
> worthwhile and can be ignored.

I don't like the master device idea.  It just seems likely to cause
confusion.  But, we may want something to expose some topology
information to the user.  There could be situations where it is
releavant to know that two netdevs are part of the same hardware
device.

In the past I've also seen boxes with multiple switch chips tied
together either through dedicated "stacking" ports or even just with
ethernet ports directly wired together inside the box.  A variety
of combinations are possible, each with performance considerations
that might be relevant to an admin.  It would be good if we could
help them map-out such relationships.

Maybe the existing bus infrastructure (or something like it) could
be leveraged?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 18:09                                 ` Florian Fainelli
@ 2014-03-27 13:46                                   ` John W. Linville
  0 siblings, 0 replies; 125+ messages in thread
From: John W. Linville @ 2014-03-27 13:46 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Roopa Prabhu, Jiri Pirko, Jamal Hadi Salim, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

On Wed, Mar 26, 2014 at 11:09:43AM -0700, Florian Fainelli wrote:
> 2014-03-26 10:57 GMT-07:00 Roopa Prabhu <roopa@cumulusnetworks.com>:
> > On 3/26/14, 10:29 AM, Florian Fainelli wrote:
> >>
> >> 2014-03-26 9:59 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
> >>>
> >>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
> >>>>
> >>>> On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
> >>>>>
> >>>>> On 03/26/14 01:37, Roopa Prabhu wrote:
> >>>>>>
> >>>>>> On 3/25/14, 1:11 PM, Florian Fainelli wrote:
> >>>>>>>
> >>>>>>> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> >>>>>>
> >>>>>> Sorry about getting on this thread late and possibly in the middle.
> >>>>>> Agree on the idea of keeping the ports linked to the master switch dev
> >>>>>> (or the 'conduit' to the switch chip) via private list instead of the
> >>>>>> master-slave relationship proposed earlier.
> >>>>>> By private i mean the netdev->priv linkage to the master switch dev
> >>>>>> and
> >>>>>> not really keeping the ports from being exposed to the user.
> >>>>>>
> >>>>>> We think its better to keep the switch ports exposed as any other
> >>>>>> netdev
> >>>>>> on linux.
> >>>>>>   This approach will make the switch ports look exactly like a nic
> >>>>>> port
> >>>>>> and all tools will continue to work seamlessly. The switch port
> >>>>>> operations could internally be forwarded to the switch netdev (sw1 in
> >>>>>> the above case).
> >>>>>>
> >>>>>> example:
> >>>>>> $ip link set dev sw1p0 up
> >>>>>> $ethtool -S sw1p0
> >>>>>>
> >>>>> I like the approach. I know the above is a simple version, but i am
> >>>>> assuming you also mean i can do things like
> >>>>> ip route add ...
> >>>>> bridge fdb add ... (and if you like your brctl go ahead)
> >>>>> bonding ...
> >>>>>
> >>>> yes, exactly.  We support this model on our boxes today.
> >>>> User can bond switch ports on our box in the exact same way as he/she
> >>>> would bond two nic ports.
> >>>> Our 'conduit to switch chip' reflects the corresponding lag
> >>>> configuration in the switch chip.
> >>>> Same goes for bridging, routing, acls.
> >>>
> >>>
> >>> So you implement bonding netlink api? Or you hook into bonding driver
> >>> itselt? Can you show us the code?
> >>
> >> Before we start talking about bonding, maybe we should make sure that
> >> we cover some basic hardware switches uses which are to make some
> >> ports belong to certain VLANs, tagged or untagged?
> >>
> >> It seems to me like this would become something like this, assuming P0
> >> and P1 are two switch ports and 'eth0' is the CPU port, where P0 and
> >> P1 belong to VLAN1 and CPU belongs to VLAN2:
> >>
> >> ip link set dev sw1p0 up
> >> ip link set dev sw1p1 up
> >> ip link set dev eth0 up
> >>
> >> ip link add link eth0 name eth0.2 type vlan id 2
> >>
> >> ip link add link sw1p0 name sw1p0.1 type vlan id 1
> >> ip link add link sw1p1 name sw1p1.1 type vlan id 1
> >>
> >> ip link add sw1.1 type bridge
> >> ip link set sw1p0.1 master sw1.1
> >> ip link set sw1p1.1 master sw1.1
> >>
> >> Does that fit the model correctly?
> >
> > Not entirely, but close.
> > In our current model, there is no netdev for cpu port (or master switch
> > netdev):
> 
> You mean there is no netdev for the switch-side CPU-port facing the
> CPU Ethernet MAC, right? There is still a netdev for the CPU Ethernet
> MAC to receive packets destined to it presumably.
> 
> I do not think it hurts nor changes anything to introduce a CPU-port
> netdev, this just gives greater flexibility and this should allow for
> more complex setups where multiple CPU-ports exist (there are some
> real devices using this...).

But how is this useful?  It just seems redundant to me.  Also, how
should we communicate to the user that a given interface on the switch
is hardwired to a given NIC on the CPU?
 
> >
> >
> > ip link set dev sw1p0 up
> > ip link set dev sw1p1 up
> >
> > ip link add link sw1p0 name sw1p0.1 type vlan id 1
> > ip link add link sw1p1 name sw1p1.1 type vlan id 1
> >
> > ip link add brvlan1 type bridge
> > ip link set swp1p0.1 master brvlan1
> > ip link set swp1p1.1 master brvlan1
> >
> > switch driver programs the brvlan1 vlan in the switch asic.
> >
> > bonding works in the same way.
> >
> > Thanks,
> > Roopa
> >
> >
> >
> >
> >
> 
> 
> 
> -- 
> Florian
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27  6:56                                         ` Jiri Pirko
  2014-03-27 10:39                                           ` Jamal Hadi Salim
@ 2014-03-27 14:10                                           ` Sergey Ryazanov
  2014-03-27 16:41                                             ` Florian Fainelli
  2014-03-27 16:55                                             ` Jiri Pirko
  1 sibling, 2 replies; 125+ messages in thread
From: Sergey Ryazanov @ 2014-03-27 14:10 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Florian Fainelli, Jamal Hadi Salim, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

Hi all,

sorry for the intrusion, but let me place my 2 cents.

2014-03-27 10:56 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
> Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
>>2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>>> On 03/26/14 14:14, Jiri Pirko wrote:
>>>>
>>>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>>>>
>>>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>
>>>
>>>
>>>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
>>>>> example.
>>>>>
>>>>> I think there is an implicit convention that sw1 represents the
>>>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>>>> always connected, hence there is no need to create a "fake" bridge to
>>>>> link sw1 to eth0 for instance?
>>>>
>>>>
>>>> I think you are kind of mixing apples and oranges (or I might be I'm not
>>>> understanding you correctly).
>>>> This is how I see it, sticking to the names you use in the example:
>>>>
>>>>              (sw1) (abstract place-holder netdev)
>>>>            --------
>>>>           switch chip                   CPU
>>>>     -----------------------            ------
>>>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>>>>       |     |     |     |                |
>>>>      PHY   PHY   PHY    ------someMII-----
>>>>
>>>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
>>>> switch part (port representation).
>>>>
>>>
>>>
>>> Florian - I am sure you explained this before; I just dont remember. Why
>>> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
>>> already in the kernel and the "cpu port" is merely a control interface.
>>
>>eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
>>It is "regular" Ethernet driver connected to the switch without
>>switch-specific logic. The goal is twofold:
>>
>>- allow any regular Ethernet driver to be connected to an external
>>switch via e.g: MDIO/MDC or other without specific switch knowledge
>>- represents accurately how the hardware is designed/connected
>>
>>but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...
>
> I believe that hawing both sw1p3 and eth0 is the correct way of
> modelling this. sw1p3 is instance if switch chip driver representing the
> actual port of a switch. eth0 is an instance of some other ordinary NIC
> driver (8139too is my favorite :))
>
> This model allows to draw the exact picture.
> Also, when you add the described possibility to use iplink to build
> vlans, bridges whatever on the switch ports, it makes perfect sense to
> have this model.
>
> Merging sw1p3 and eth0 would cause a loose of information and confusion.
>

CPU switch port and switch fabric itself should be configured in
consistence with host, in first place I mean a set of VLANs. Also it
should be mentioned that some generic knobs such as port rate and
duplex mode are meaningless for CPU switch port and a lot of status
information (rx/tx counters etc.) duplicates statistics of host
interface which is connected to switch port. So there are no reasons
to force user to configure this port manually, and automatic
configuration of CPU switch port without exporting them as netdev
seems as good approach.

-- 
BR,
Sergey

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 22:44                             ` Jamal Hadi Salim
  2014-03-26 23:15                               ` Thomas Graf
@ 2014-03-27 15:26                               ` Neil Horman
  2014-03-27 21:33                                 ` Jamal Hadi Salim
  1 sibling, 1 reply; 125+ messages in thread
From: Neil Horman @ 2014-03-27 15:26 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Florian Fainelli, Thomas Graf, Jiri Pirko, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek, Felix Fietkau

On Wed, Mar 26, 2014 at 06:44:08PM -0400, Jamal Hadi Salim wrote:
> On 03/26/14 15:11, Florian Fainelli wrote:
> >2014-03-26 11:21 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
> 
> >>Yes, this is the point of contention, you're right.  And you're also correct in
> >>that we do have several devices that bypass the network stack on the.  My
> >>concern is that, in all of those cases its being bypassed because we know that
> >>other software is handling that functionality (in the case of macvtap we know
> >>that we're passing it off to a guest to be processed via the full network stack
> >>available in the guest, and in the case of OVS, we know that we are passing
> >>traffic to a software defined switch for handling).  In the case of having a
> >>switch fabric available, we're explicitly hiding the fact that traffic we are
> >>passing between ports never touches the cpu, and that just rubs me the wrong
> >>way.  I suppose I'm looking at switch fabrics in the same way that I look at
> >>TOE.  In offloading forwaring functionality we remove from the cpu activity
> >>which an administrator may reasonably expect to see handled in the cpu, but they
> >>wont.  In the case of macvlan, the admin knows thats a macvlan device, and
> >>packet handling for frames bound to it occurs in the guest.  for OVS, packets
> >>recieved on the cpu with the proper encapsulation are clearly handled in the
> >>OVS bridge.  But in the case of a hardware switch, all they see are 4 net device
> >>interfaces that seem like any other net device.
> >
> >Right, this is why Felix did not expose the switch ports as netdevices
> >when he designed swconfig, because this would break the contract and
> >assumptions that net_devices do actually transport data, and are not
> >just used for control. It also made it easier to have a separate
> >control path to expose the gazillion different configuration knobs
> >that various switches offer...
> >
> 
> Neil, I may be misreading your "TOE" semantis, but i think you view
> the switch ports from a host prism. I am a middle box guy - I love
> it when packets transiting through my box are offloaded. I can move
> more
> bits/sec.
> It is only TOE if the middle box is trying to do an end host function;->
> 
You're absolutely correct - I am viewing this from a host based perspective.
And I completely understand that offload is good in a middle box environment (I
worked for embedded switch companies in a former life).  I'm looking at it from
a host perspective because, as we've been discussing the wide range of devices
covered here (from the small SOC switches used by owrt to the big enterprise
switches), theres this middle ground thats seeing some consolodation here which
I think we need to cover as well.  I'm referring to NICS that have an embedded
switch in them that can (or soon will) preform lots of these flow based
forwarding operations and actions.

> OTOH, the owrt view is probably because (If i understood correctly
> last time), there are cases where there is no way to even pass packets
> and attribute them to the originating switch ports. Infact, in some
> cases  there may be no way at all to even pass packets to the kernel.
> Did i  understand that part correctly?
I think you did.  At least you and I had the same understanding here.

> I suppose this is eventually all part of that capability discovery.
> 
Agreed.

> [..]
> 
> >
> >Part of the problem is that you might start seeing actual relevant
> >traffic on these per-port net_devices e.g: during software learning
> >times, where traffic to specific ports will also be mirrored to the
> >CPU port for lossless (or close to) traffic delivery, and then some
> >software agent on the CPU will decide to bridge/bond/add vlans to some
> >ports, and then we won't be seeing traffic again on these per-port
> >net_devices for a while (in the context of switches supporting tags).
> >As such, I'd rather treat those per-port net_devices as almost regular
> >net_devices to allow that traffic to flow, even though this is not a
> >permanent state.
> >
> 
> A nod from here.
> I think it would be useful to enumerate these types of devices
> and what their control/data capability is.
> 
> cheers,
> jamal
> 

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 21:31                                   ` Jiri Pirko
@ 2014-03-27 15:35                                     ` Roopa Prabhu
  2014-03-27 16:10                                       ` Jiri Pirko
  0 siblings, 1 reply; 125+ messages in thread
From: Roopa Prabhu @ 2014-03-27 15:35 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Jamal Hadi Salim, Florian Fainelli, Neil Horman, Thomas Graf,
	netdev, David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

On 3/26/14, 2:31 PM, Jiri Pirko wrote:
> Wed, Mar 26, 2014 at 10:27:05PM CET, roopa@cumulusnetworks.com wrote:
>> On 3/26/14, 11:03 AM, Jiri Pirko wrote:
>>> Wed, Mar 26, 2014 at 06:47:15PM CET, roopa@cumulusnetworks.com wrote:
>>>> On 3/26/14, 9:59 AM, Jiri Pirko wrote:
>>>>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>>>> On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>>>>>> On 03/26/14 01:37, Roopa Prabhu wrote:
>>>>>>>> On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>>>>>> 2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>>>>>> Sorry about getting on this thread late and possibly in the middle.
>>>>>>>> Agree on the idea of keeping the ports linked to the master switch dev
>>>>>>>> (or the 'conduit' to the switch chip) via private list instead of the
>>>>>>>> master-slave relationship proposed earlier.
>>>>>>>> By private i mean the netdev->priv linkage to the master switch dev and
>>>>>>>> not really keeping the ports from being exposed to the user.
>>>>>>>>
>>>>>>>> We think its better to keep the switch ports exposed as any other netdev
>>>>>>>> on linux.
>>>>>>>>   This approach will make the switch ports look exactly like a nic port
>>>>>>>> and all tools will continue to work seamlessly. The switch port
>>>>>>>> operations could internally be forwarded to the switch netdev (sw1 in
>>>>>>>> the above case).
>>>>>>>>
>>>>>>>> example:
>>>>>>>> $ip link set dev sw1p0 up
>>>>>>>> $ethtool -S sw1p0
>>>>>>>>
>>>>>>> I like the approach. I know the above is a simple version, but i am
>>>>>>> assuming you also mean i can do things like
>>>>>>> ip route add ...
>>>>>>> bridge fdb add ... (and if you like your brctl go ahead)
>>>>>>> bonding ...
>>>>>>>
>>>>>> yes, exactly.  We support this model on our boxes today.
>>>>>> User can bond switch ports on our box in the exact same way as he/she
>>>>>> would bond two nic ports.
>>>>>> Our 'conduit to switch chip' reflects the corresponding lag
>>>>>> configuration in the switch chip.
>>>>>> Same goes for bridging, routing, acls.
>>>>> So you implement bonding netlink api? Or you hook into bonding driver
>>>>> itselt? Can you show us the code?
>>>> We use the netlink API and libnl. In our current model, our switch
>>>> chip driver listens to netlink notifications and programs the switch
>>>> chip. The switch chip driver uses libnl caches and libnl netlink apis
>>>> to reflect the kernel state to switch chip.
>>> So when you configure for example bonding over 2 ports, you actually use
>>> bonding driver to do that. And you userspace app listens to
>>> notifications and programs the switch chip accordingly. Am I close?
>> yes correct.
>>> How about data? Is this new "bonding" interface able to assign ip to is
>>> and send/receive packets.
>> yes
>>> I'm still not sure I understand your concept. Do you have some
>>> documentation for it available?
>>>
>> I think the only documentation available today in this area is the
>> user guide and that in-turn points to native linux command manpages
>> iproute2, sysfs, debian ifupdown etc.
>> I will see if i can find anything else.
> I ment the architecture design documentation. linux manpages are not
> that interesting to me :)
>
yes, i get that and thats why i did not include a pointer to our user 
guide. :).
Sorry, the easiest thing to find right now was a high level marketing 
diagram and here you go: 
http://cumulusnetworks.com/product/architecture/. This is nothing but 
what i mentioned in my emails.
 From here the details involve nothing but programming the broadcom 
asic. This is mostly broadcom details/documentation.

The above is our current working/shipping model.

In our second phase of implementation, We wanted to preserve the above 
user interface model (which people using our boxes are very fond of), 
but introduce the concept of a switchdev and switchports in the kernel.
We had a switchdev api in the works ourselves which we were planning to 
publish on netdev until you beat us to it.
Our version is similar to yours but it reflects some of the points that 
i have brought up in my previous emails.
It probably looks more like your v2 (patch 4/6) without the master/slave 
link.
We can share some code in the comming weeks. It does need some cleanup 
and i am also waiting for scott feldman who is on vacation this week.

I know you are looking for specifics, but we don't have switchdev code 
to create a bond in switch chip asic yet. But we have been thinking 
about the details and the current thought there at a high level was, we 
would add a netdev op which the bonding driver could redirect to the 
switchdev driver when it has slaves with IFF_SWITCH_PORT set.

Thanks,
Roopa

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 15:35                                     ` Roopa Prabhu
@ 2014-03-27 16:10                                       ` Jiri Pirko
  0 siblings, 0 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-27 16:10 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Jamal Hadi Salim, Florian Fainelli, Neil Horman, Thomas Graf,
	netdev, David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee

Thu, Mar 27, 2014 at 04:35:41PM CET, roopa@cumulusnetworks.com wrote:
>On 3/26/14, 2:31 PM, Jiri Pirko wrote:
>>Wed, Mar 26, 2014 at 10:27:05PM CET, roopa@cumulusnetworks.com wrote:
>>>On 3/26/14, 11:03 AM, Jiri Pirko wrote:
>>>>Wed, Mar 26, 2014 at 06:47:15PM CET, roopa@cumulusnetworks.com wrote:
>>>>>On 3/26/14, 9:59 AM, Jiri Pirko wrote:
>>>>>>Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>>>>>On 3/26/14, 3:54 AM, Jamal Hadi Salim wrote:
>>>>>>>>On 03/26/14 01:37, Roopa Prabhu wrote:
>>>>>>>>>On 3/25/14, 1:11 PM, Florian Fainelli wrote:
>>>>>>>>>>2014-03-25 12:35 GMT-07:00 Neil Horman <nhorman@tuxdriver.com>:
>>>>>>>>>Sorry about getting on this thread late and possibly in the middle.
>>>>>>>>>Agree on the idea of keeping the ports linked to the master switch dev
>>>>>>>>>(or the 'conduit' to the switch chip) via private list instead of the
>>>>>>>>>master-slave relationship proposed earlier.
>>>>>>>>>By private i mean the netdev->priv linkage to the master switch dev and
>>>>>>>>>not really keeping the ports from being exposed to the user.
>>>>>>>>>
>>>>>>>>>We think its better to keep the switch ports exposed as any other netdev
>>>>>>>>>on linux.
>>>>>>>>>  This approach will make the switch ports look exactly like a nic port
>>>>>>>>>and all tools will continue to work seamlessly. The switch port
>>>>>>>>>operations could internally be forwarded to the switch netdev (sw1 in
>>>>>>>>>the above case).
>>>>>>>>>
>>>>>>>>>example:
>>>>>>>>>$ip link set dev sw1p0 up
>>>>>>>>>$ethtool -S sw1p0
>>>>>>>>>
>>>>>>>>I like the approach. I know the above is a simple version, but i am
>>>>>>>>assuming you also mean i can do things like
>>>>>>>>ip route add ...
>>>>>>>>bridge fdb add ... (and if you like your brctl go ahead)
>>>>>>>>bonding ...
>>>>>>>>
>>>>>>>yes, exactly.  We support this model on our boxes today.
>>>>>>>User can bond switch ports on our box in the exact same way as he/she
>>>>>>>would bond two nic ports.
>>>>>>>Our 'conduit to switch chip' reflects the corresponding lag
>>>>>>>configuration in the switch chip.
>>>>>>>Same goes for bridging, routing, acls.
>>>>>>So you implement bonding netlink api? Or you hook into bonding driver
>>>>>>itselt? Can you show us the code?
>>>>>We use the netlink API and libnl. In our current model, our switch
>>>>>chip driver listens to netlink notifications and programs the switch
>>>>>chip. The switch chip driver uses libnl caches and libnl netlink apis
>>>>>to reflect the kernel state to switch chip.
>>>>So when you configure for example bonding over 2 ports, you actually use
>>>>bonding driver to do that. And you userspace app listens to
>>>>notifications and programs the switch chip accordingly. Am I close?
>>>yes correct.
>>>>How about data? Is this new "bonding" interface able to assign ip to is
>>>>and send/receive packets.
>>>yes
>>>>I'm still not sure I understand your concept. Do you have some
>>>>documentation for it available?
>>>>
>>>I think the only documentation available today in this area is the
>>>user guide and that in-turn points to native linux command manpages
>>>iproute2, sysfs, debian ifupdown etc.
>>>I will see if i can find anything else.
>>I ment the architecture design documentation. linux manpages are not
>>that interesting to me :)
>>
>yes, i get that and thats why i did not include a pointer to our user
>guide. :).
>Sorry, the easiest thing to find right now was a high level marketing
>diagram and here you go:
>http://cumulusnetworks.com/product/architecture/. This is nothing but
>what i mentioned in my emails.
>From here the details involve nothing but programming the broadcom
>asic. This is mostly broadcom details/documentation.
>
>The above is our current working/shipping model.
>
>In our second phase of implementation, We wanted to preserve the
>above user interface model (which people using our boxes are very
>fond of), but introduce the concept of a switchdev and switchports in
>the kernel.
>We had a switchdev api in the works ourselves which we were planning
>to publish on netdev until you beat us to it.
>Our version is similar to yours but it reflects some of the points
>that i have brought up in my previous emails.
>It probably looks more like your v2 (patch 4/6) without the
>master/slave link.

Yes, I was thinking about this some more and I plan to remove this
implicit master/slave link in the next version.


>We can share some code in the comming weeks. It does need some
>cleanup and i am also waiting for scott feldman who is on vacation
>this week.
>
>I know you are looking for specifics, but we don't have switchdev
>code to create a bond in switch chip asic yet. But we have been
>thinking about the details and the current thought there at a high
>level was, we would add a netdev op which the bonding driver could
>redirect to the switchdev driver when it has slaves with
>IFF_SWITCH_PORT set.

I was thinking about this as well. I believe that if this have to be
done, it should be done on RTNL level, not as a hack to
bond/bridge/whatever code.

>
>Thanks,
>Roopa
>
>
>

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 14:10                                           ` Sergey Ryazanov
@ 2014-03-27 16:41                                             ` Florian Fainelli
  2014-03-27 16:57                                               ` Jiri Pirko
                                                                 ` (3 more replies)
  2014-03-27 16:55                                             ` Jiri Pirko
  1 sibling, 4 replies; 125+ messages in thread
From: Florian Fainelli @ 2014-03-27 16:41 UTC (permalink / raw)
  To: Sergey Ryazanov
  Cc: Jiri Pirko, Jamal Hadi Salim, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

2014-03-27 7:10 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
> Hi all,
>
> sorry for the intrusion, but let me place my 2 cents.
>
> 2014-03-27 10:56 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
>> Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
>>>2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>>>> On 03/26/14 14:14, Jiri Pirko wrote:
>>>>>
>>>>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>>>>>
>>>>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>>
>>>>
>>>>
>>>>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
>>>>>> example.
>>>>>>
>>>>>> I think there is an implicit convention that sw1 represents the
>>>>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>>>>> always connected, hence there is no need to create a "fake" bridge to
>>>>>> link sw1 to eth0 for instance?
>>>>>
>>>>>
>>>>> I think you are kind of mixing apples and oranges (or I might be I'm not
>>>>> understanding you correctly).
>>>>> This is how I see it, sticking to the names you use in the example:
>>>>>
>>>>>              (sw1) (abstract place-holder netdev)
>>>>>            --------
>>>>>           switch chip                   CPU
>>>>>     -----------------------            ------
>>>>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>>>>>       |     |     |     |                |
>>>>>      PHY   PHY   PHY    ------someMII-----
>>>>>
>>>>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
>>>>> switch part (port representation).
>>>>>
>>>>
>>>>
>>>> Florian - I am sure you explained this before; I just dont remember. Why
>>>> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
>>>> already in the kernel and the "cpu port" is merely a control interface.
>>>
>>>eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
>>>It is "regular" Ethernet driver connected to the switch without
>>>switch-specific logic. The goal is twofold:
>>>
>>>- allow any regular Ethernet driver to be connected to an external
>>>switch via e.g: MDIO/MDC or other without specific switch knowledge
>>>- represents accurately how the hardware is designed/connected
>>>
>>>but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...
>>
>> I believe that hawing both sw1p3 and eth0 is the correct way of
>> modelling this. sw1p3 is instance if switch chip driver representing the
>> actual port of a switch. eth0 is an instance of some other ordinary NIC
>> driver (8139too is my favorite :))
>>
>> This model allows to draw the exact picture.
>> Also, when you add the described possibility to use iplink to build
>> vlans, bridges whatever on the switch ports, it makes perfect sense to
>> have this model.
>>
>> Merging sw1p3 and eth0 would cause a loose of information and confusion.
>>
>
> CPU switch port and switch fabric itself should be configured in
> consistence with host, in first place I mean a set of VLANs. Also it
> should be mentioned that some generic knobs such as port rate and
> duplex mode are meaningless for CPU switch port and a lot of status
> information (rx/tx counters etc.) duplicates statistics of host
> interface which is connected to switch port.

It duplicates the information when things just work fine, consider an
external switch connected via RGMII to a CPU Ethernet MAC, you might
want to get statistics from both sides (the switch CPU port and the
CPU Ethernet MAC) to diagnose why things are not working as expected,
which unfortunately happens once in a while with RGMII.

If we expose both net_device, we will be able to retrieve statistics
about from both sides, without resorting to ad-hoc debugging tools,
but maybe this is not worth the effort.

> So there are no reasons
> to force user to configure this port manually, and automatic
> configuration of CPU switch port without exporting them as netdev
> seems as good approach.

Well, maybe that's the answer, since we know that e.g: sw1p3 is always
connected to e.g: eth0, we could create an automatic bridge between
those two, this would keep the netdev exposure to user-space, but an
user would not have to know about that specific detail to get things
to work.
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 14:10                                           ` Sergey Ryazanov
  2014-03-27 16:41                                             ` Florian Fainelli
@ 2014-03-27 16:55                                             ` Jiri Pirko
  2014-03-27 19:58                                               ` Sergey Ryazanov
  1 sibling, 1 reply; 125+ messages in thread
From: Jiri Pirko @ 2014-03-27 16:55 UTC (permalink / raw)
  To: Sergey Ryazanov
  Cc: Florian Fainelli, Jamal Hadi Salim, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

Thu, Mar 27, 2014 at 03:10:24PM CET, ryazanov.s.a@gmail.com wrote:
>Hi all,
>
>sorry for the intrusion, but let me place my 2 cents.
>
>2014-03-27 10:56 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
>> Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
>>>2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>>>> On 03/26/14 14:14, Jiri Pirko wrote:
>>>>>
>>>>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>>>>>
>>>>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>>
>>>>
>>>>
>>>>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
>>>>>> example.
>>>>>>
>>>>>> I think there is an implicit convention that sw1 represents the
>>>>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>>>>> always connected, hence there is no need to create a "fake" bridge to
>>>>>> link sw1 to eth0 for instance?
>>>>>
>>>>>
>>>>> I think you are kind of mixing apples and oranges (or I might be I'm not
>>>>> understanding you correctly).
>>>>> This is how I see it, sticking to the names you use in the example:
>>>>>
>>>>>              (sw1) (abstract place-holder netdev)
>>>>>            --------
>>>>>           switch chip                   CPU
>>>>>     -----------------------            ------
>>>>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>>>>>       |     |     |     |                |
>>>>>      PHY   PHY   PHY    ------someMII-----
>>>>>
>>>>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
>>>>> switch part (port representation).
>>>>>
>>>>
>>>>
>>>> Florian - I am sure you explained this before; I just dont remember. Why
>>>> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
>>>> already in the kernel and the "cpu port" is merely a control interface.
>>>
>>>eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
>>>It is "regular" Ethernet driver connected to the switch without
>>>switch-specific logic. The goal is twofold:
>>>
>>>- allow any regular Ethernet driver to be connected to an external
>>>switch via e.g: MDIO/MDC or other without specific switch knowledge
>>>- represents accurately how the hardware is designed/connected
>>>
>>>but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...
>>
>> I believe that hawing both sw1p3 and eth0 is the correct way of
>> modelling this. sw1p3 is instance if switch chip driver representing the
>> actual port of a switch. eth0 is an instance of some other ordinary NIC
>> driver (8139too is my favorite :))
>>
>> This model allows to draw the exact picture.
>> Also, when you add the described possibility to use iplink to build
>> vlans, bridges whatever on the switch ports, it makes perfect sense to
>> have this model.
>>
>> Merging sw1p3 and eth0 would cause a loose of information and confusion.
>>
>
>CPU switch port and switch fabric itself should be configured in
>consistence with host, in first place I mean a set of VLANs. Also it
>should be mentioned that some generic knobs such as port rate and
>duplex mode are meaningless for CPU switch port and a lot of status
>information (rx/tx counters etc.) duplicates statistics of host
>interface which is connected to switch port. So there are no reasons
>to force user to configure this port manually, and automatic
>configuration of CPU switch port without exporting them as netdev
>seems as good approach.

How can you tell that certain port is connected to CPU? That is platform
specific. 

>
>-- 
>BR,
>Sergey

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 16:41                                             ` Florian Fainelli
@ 2014-03-27 16:57                                               ` Jiri Pirko
  2014-03-27 16:59                                               ` Thomas Graf
                                                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 125+ messages in thread
From: Jiri Pirko @ 2014-03-27 16:57 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Sergey Ryazanov, Jamal Hadi Salim, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

Thu, Mar 27, 2014 at 05:41:52PM CET, f.fainelli@gmail.com wrote:
>2014-03-27 7:10 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
>> Hi all,
>>
>> sorry for the intrusion, but let me place my 2 cents.
>>
>> 2014-03-27 10:56 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
>>> Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
>>>>2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>>>>> On 03/26/14 14:14, Jiri Pirko wrote:
>>>>>>
>>>>>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>>>>>>
>>>>>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>>>
>>>>>
>>>>>
>>>>>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
>>>>>>> example.
>>>>>>>
>>>>>>> I think there is an implicit convention that sw1 represents the
>>>>>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>>>>>> always connected, hence there is no need to create a "fake" bridge to
>>>>>>> link sw1 to eth0 for instance?
>>>>>>
>>>>>>
>>>>>> I think you are kind of mixing apples and oranges (or I might be I'm not
>>>>>> understanding you correctly).
>>>>>> This is how I see it, sticking to the names you use in the example:
>>>>>>
>>>>>>              (sw1) (abstract place-holder netdev)
>>>>>>            --------
>>>>>>           switch chip                   CPU
>>>>>>     -----------------------            ------
>>>>>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>>>>>>       |     |     |     |                |
>>>>>>      PHY   PHY   PHY    ------someMII-----
>>>>>>
>>>>>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
>>>>>> switch part (port representation).
>>>>>>
>>>>>
>>>>>
>>>>> Florian - I am sure you explained this before; I just dont remember. Why
>>>>> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
>>>>> already in the kernel and the "cpu port" is merely a control interface.
>>>>
>>>>eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
>>>>It is "regular" Ethernet driver connected to the switch without
>>>>switch-specific logic. The goal is twofold:
>>>>
>>>>- allow any regular Ethernet driver to be connected to an external
>>>>switch via e.g: MDIO/MDC or other without specific switch knowledge
>>>>- represents accurately how the hardware is designed/connected
>>>>
>>>>but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...
>>>
>>> I believe that hawing both sw1p3 and eth0 is the correct way of
>>> modelling this. sw1p3 is instance if switch chip driver representing the
>>> actual port of a switch. eth0 is an instance of some other ordinary NIC
>>> driver (8139too is my favorite :))
>>>
>>> This model allows to draw the exact picture.
>>> Also, when you add the described possibility to use iplink to build
>>> vlans, bridges whatever on the switch ports, it makes perfect sense to
>>> have this model.
>>>
>>> Merging sw1p3 and eth0 would cause a loose of information and confusion.
>>>
>>
>> CPU switch port and switch fabric itself should be configured in
>> consistence with host, in first place I mean a set of VLANs. Also it
>> should be mentioned that some generic knobs such as port rate and
>> duplex mode are meaningless for CPU switch port and a lot of status
>> information (rx/tx counters etc.) duplicates statistics of host
>> interface which is connected to switch port.
>
>It duplicates the information when things just work fine, consider an
>external switch connected via RGMII to a CPU Ethernet MAC, you might
>want to get statistics from both sides (the switch CPU port and the
>CPU Ethernet MAC) to diagnose why things are not working as expected,
>which unfortunately happens once in a while with RGMII.
>
>If we expose both net_device, we will be able to retrieve statistics
>about from both sides, without resorting to ad-hoc debugging tools,
>but maybe this is not worth the effort.

Good point. I agree.

>
>> So there are no reasons
>> to force user to configure this port manually, and automatic
>> configuration of CPU switch port without exporting them as netdev
>> seems as good approach.
>
>Well, maybe that's the answer, since we know that e.g: sw1p3 is always
>connected to e.g: eth0, we could create an automatic bridge between
>those two, this would keep the netdev exposure to user-space, but an
>user would not have to know about that specific detail to get things
>to work.

That might be a good way.

>-- 
>Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 16:41                                             ` Florian Fainelli
  2014-03-27 16:57                                               ` Jiri Pirko
@ 2014-03-27 16:59                                               ` Thomas Graf
  2014-03-27 20:32                                               ` Sergey Ryazanov
  2014-03-27 21:41                                               ` Jamal Hadi Salim
  3 siblings, 0 replies; 125+ messages in thread
From: Thomas Graf @ 2014-03-27 16:59 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Sergey Ryazanov, Jiri Pirko, Jamal Hadi Salim, Roopa Prabhu,
	Neil Horman, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

On 03/27/14 at 09:41am, Florian Fainelli wrote:
> 2014-03-27 7:10 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
> > So there are no reasons
> > to force user to configure this port manually, and automatic
> > configuration of CPU switch port without exporting them as netdev
> > seems as good approach.
> 
> Well, maybe that's the answer, since we know that e.g: sw1p3 is always
> connected to e.g: eth0, we could create an automatic bridge between
> those two, this would keep the netdev exposure to user-space, but an
> user would not have to know about that specific detail to get things
> to work.

I like this approach. It provides visibility without additional
configuration burdgen.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 16:55                                             ` Jiri Pirko
@ 2014-03-27 19:58                                               ` Sergey Ryazanov
  2014-03-27 20:01                                                 ` Florian Fainelli
  0 siblings, 1 reply; 125+ messages in thread
From: Sergey Ryazanov @ 2014-03-27 19:58 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Florian Fainelli, Jamal Hadi Salim, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

2014-03-27 20:55 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
> Thu, Mar 27, 2014 at 03:10:24PM CET, ryazanov.s.a@gmail.com wrote:
>>Hi all,
>>
>>sorry for the intrusion, but let me place my 2 cents.
>>
>>2014-03-27 10:56 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
>>> Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
>>>>2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>>>>> On 03/26/14 14:14, Jiri Pirko wrote:
>>>>>>
>>>>>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>>>>>>
>>>>>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>>>
>>>>>
>>>>>
>>>>>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
>>>>>>> example.
>>>>>>>
>>>>>>> I think there is an implicit convention that sw1 represents the
>>>>>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>>>>>> always connected, hence there is no need to create a "fake" bridge to
>>>>>>> link sw1 to eth0 for instance?
>>>>>>
>>>>>>
>>>>>> I think you are kind of mixing apples and oranges (or I might be I'm not
>>>>>> understanding you correctly).
>>>>>> This is how I see it, sticking to the names you use in the example:
>>>>>>
>>>>>>              (sw1) (abstract place-holder netdev)
>>>>>>            --------
>>>>>>           switch chip                   CPU
>>>>>>     -----------------------            ------
>>>>>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>>>>>>       |     |     |     |                |
>>>>>>      PHY   PHY   PHY    ------someMII-----
>>>>>>
>>>>>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
>>>>>> switch part (port representation).
>>>>>>
>>>>>
>>>>>
>>>>> Florian - I am sure you explained this before; I just dont remember. Why
>>>>> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
>>>>> already in the kernel and the "cpu port" is merely a control interface.
>>>>
>>>>eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
>>>>It is "regular" Ethernet driver connected to the switch without
>>>>switch-specific logic. The goal is twofold:
>>>>
>>>>- allow any regular Ethernet driver to be connected to an external
>>>>switch via e.g: MDIO/MDC or other without specific switch knowledge
>>>>- represents accurately how the hardware is designed/connected
>>>>
>>>>but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...
>>>
>>> I believe that hawing both sw1p3 and eth0 is the correct way of
>>> modelling this. sw1p3 is instance if switch chip driver representing the
>>> actual port of a switch. eth0 is an instance of some other ordinary NIC
>>> driver (8139too is my favorite :))
>>>
>>> This model allows to draw the exact picture.
>>> Also, when you add the described possibility to use iplink to build
>>> vlans, bridges whatever on the switch ports, it makes perfect sense to
>>> have this model.
>>>
>>> Merging sw1p3 and eth0 would cause a loose of information and confusion.
>>>
>>
>>CPU switch port and switch fabric itself should be configured in
>>consistence with host, in first place I mean a set of VLANs. Also it
>>should be mentioned that some generic knobs such as port rate and
>>duplex mode are meaningless for CPU switch port and a lot of status
>>information (rx/tx counters etc.) duplicates statistics of host
>>interface which is connected to switch port. So there are no reasons
>>to force user to configure this port manually, and automatic
>>configuration of CPU switch port without exporting them as netdev
>>seems as good approach.
>
> How can you tell that certain port is connected to CPU? That is platform
> specific.
>
You have answered your own question: via platform data which are
initialized and passed to the driver by the board initialization code.
IMHO, it is not so good way, to suggest the user to guess the switch
port, which is connected to CPU.

Moreover, we need to know ports of switch chip, what are really wired
to the connectors (e.g. five-port switch on the board with only three
connectors).

-- 
BR,
Sergey

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 19:58                                               ` Sergey Ryazanov
@ 2014-03-27 20:01                                                 ` Florian Fainelli
  2014-03-27 20:04                                                   ` Sergey Ryazanov
  2014-03-27 21:47                                                   ` Jamal Hadi Salim
  0 siblings, 2 replies; 125+ messages in thread
From: Florian Fainelli @ 2014-03-27 20:01 UTC (permalink / raw)
  To: Sergey Ryazanov
  Cc: Jiri Pirko, Jamal Hadi Salim, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

2014-03-27 12:58 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
> 2014-03-27 20:55 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
>> Thu, Mar 27, 2014 at 03:10:24PM CET, ryazanov.s.a@gmail.com wrote:
>>>Hi all,
>>>
>>>sorry for the intrusion, but let me place my 2 cents.
>>>
>>>2014-03-27 10:56 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
>>>> Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
>>>>>2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>>>>>> On 03/26/14 14:14, Jiri Pirko wrote:
>>>>>>>
>>>>>>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>>>>>>>
>>>>>>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
>>>>>>>> example.
>>>>>>>>
>>>>>>>> I think there is an implicit convention that sw1 represents the
>>>>>>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>>>>>>> always connected, hence there is no need to create a "fake" bridge to
>>>>>>>> link sw1 to eth0 for instance?
>>>>>>>
>>>>>>>
>>>>>>> I think you are kind of mixing apples and oranges (or I might be I'm not
>>>>>>> understanding you correctly).
>>>>>>> This is how I see it, sticking to the names you use in the example:
>>>>>>>
>>>>>>>              (sw1) (abstract place-holder netdev)
>>>>>>>            --------
>>>>>>>           switch chip                   CPU
>>>>>>>     -----------------------            ------
>>>>>>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>>>>>>>       |     |     |     |                |
>>>>>>>      PHY   PHY   PHY    ------someMII-----
>>>>>>>
>>>>>>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
>>>>>>> switch part (port representation).
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Florian - I am sure you explained this before; I just dont remember. Why
>>>>>> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
>>>>>> already in the kernel and the "cpu port" is merely a control interface.
>>>>>
>>>>>eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
>>>>>It is "regular" Ethernet driver connected to the switch without
>>>>>switch-specific logic. The goal is twofold:
>>>>>
>>>>>- allow any regular Ethernet driver to be connected to an external
>>>>>switch via e.g: MDIO/MDC or other without specific switch knowledge
>>>>>- represents accurately how the hardware is designed/connected
>>>>>
>>>>>but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...
>>>>
>>>> I believe that hawing both sw1p3 and eth0 is the correct way of
>>>> modelling this. sw1p3 is instance if switch chip driver representing the
>>>> actual port of a switch. eth0 is an instance of some other ordinary NIC
>>>> driver (8139too is my favorite :))
>>>>
>>>> This model allows to draw the exact picture.
>>>> Also, when you add the described possibility to use iplink to build
>>>> vlans, bridges whatever on the switch ports, it makes perfect sense to
>>>> have this model.
>>>>
>>>> Merging sw1p3 and eth0 would cause a loose of information and confusion.
>>>>
>>>
>>>CPU switch port and switch fabric itself should be configured in
>>>consistence with host, in first place I mean a set of VLANs. Also it
>>>should be mentioned that some generic knobs such as port rate and
>>>duplex mode are meaningless for CPU switch port and a lot of status
>>>information (rx/tx counters etc.) duplicates statistics of host
>>>interface which is connected to switch port. So there are no reasons
>>>to force user to configure this port manually, and automatic
>>>configuration of CPU switch port without exporting them as netdev
>>>seems as good approach.
>>
>> How can you tell that certain port is connected to CPU? That is platform
>> specific.
>>
> You have answered your own question: via platform data which are
> initialized and passed to the driver by the board initialization code.
> IMHO, it is not so good way, to suggest the user to guess the switch
> port, which is connected to CPU.
>
> Moreover, we need to know ports of switch chip, what are really wired
> to the connectors (e.g. five-port switch on the board with only three
> connectors).

Well, DSA already does all of that for you and has Device Tree
bindings too to instantiate per-port net_device, create the switch
routing table in case switches are cascaded...
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 20:01                                                 ` Florian Fainelli
@ 2014-03-27 20:04                                                   ` Sergey Ryazanov
  2014-03-27 21:47                                                   ` Jamal Hadi Salim
  1 sibling, 0 replies; 125+ messages in thread
From: Sergey Ryazanov @ 2014-03-27 20:04 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Jiri Pirko, Jamal Hadi Salim, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

2014-03-28 0:01 GMT+04:00 Florian Fainelli <f.fainelli@gmail.com>:
> 2014-03-27 12:58 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
>> 2014-03-27 20:55 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
>>> Thu, Mar 27, 2014 at 03:10:24PM CET, ryazanov.s.a@gmail.com wrote:
>>>>Hi all,
>>>>
>>>>sorry for the intrusion, but let me place my 2 cents.
>>>>
>>>>2014-03-27 10:56 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
>>>>> Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
>>>>>>2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>>>>>>> On 03/26/14 14:14, Jiri Pirko wrote:
>>>>>>>>
>>>>>>>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>>>>>>>>
>>>>>>>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
>>>>>>>>> example.
>>>>>>>>>
>>>>>>>>> I think there is an implicit convention that sw1 represents the
>>>>>>>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>>>>>>>> always connected, hence there is no need to create a "fake" bridge to
>>>>>>>>> link sw1 to eth0 for instance?
>>>>>>>>
>>>>>>>>
>>>>>>>> I think you are kind of mixing apples and oranges (or I might be I'm not
>>>>>>>> understanding you correctly).
>>>>>>>> This is how I see it, sticking to the names you use in the example:
>>>>>>>>
>>>>>>>>              (sw1) (abstract place-holder netdev)
>>>>>>>>            --------
>>>>>>>>           switch chip                   CPU
>>>>>>>>     -----------------------            ------
>>>>>>>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>>>>>>>>       |     |     |     |                |
>>>>>>>>      PHY   PHY   PHY    ------someMII-----
>>>>>>>>
>>>>>>>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
>>>>>>>> switch part (port representation).
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Florian - I am sure you explained this before; I just dont remember. Why
>>>>>>> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
>>>>>>> already in the kernel and the "cpu port" is merely a control interface.
>>>>>>
>>>>>>eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
>>>>>>It is "regular" Ethernet driver connected to the switch without
>>>>>>switch-specific logic. The goal is twofold:
>>>>>>
>>>>>>- allow any regular Ethernet driver to be connected to an external
>>>>>>switch via e.g: MDIO/MDC or other without specific switch knowledge
>>>>>>- represents accurately how the hardware is designed/connected
>>>>>>
>>>>>>but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...
>>>>>
>>>>> I believe that hawing both sw1p3 and eth0 is the correct way of
>>>>> modelling this. sw1p3 is instance if switch chip driver representing the
>>>>> actual port of a switch. eth0 is an instance of some other ordinary NIC
>>>>> driver (8139too is my favorite :))
>>>>>
>>>>> This model allows to draw the exact picture.
>>>>> Also, when you add the described possibility to use iplink to build
>>>>> vlans, bridges whatever on the switch ports, it makes perfect sense to
>>>>> have this model.
>>>>>
>>>>> Merging sw1p3 and eth0 would cause a loose of information and confusion.
>>>>>
>>>>
>>>>CPU switch port and switch fabric itself should be configured in
>>>>consistence with host, in first place I mean a set of VLANs. Also it
>>>>should be mentioned that some generic knobs such as port rate and
>>>>duplex mode are meaningless for CPU switch port and a lot of status
>>>>information (rx/tx counters etc.) duplicates statistics of host
>>>>interface which is connected to switch port. So there are no reasons
>>>>to force user to configure this port manually, and automatic
>>>>configuration of CPU switch port without exporting them as netdev
>>>>seems as good approach.
>>>
>>> How can you tell that certain port is connected to CPU? That is platform
>>> specific.
>>>
>> You have answered your own question: via platform data which are
>> initialized and passed to the driver by the board initialization code.
>> IMHO, it is not so good way, to suggest the user to guess the switch
>> port, which is connected to CPU.
>>
>> Moreover, we need to know ports of switch chip, what are really wired
>> to the connectors (e.g. five-port switch on the board with only three
>> connectors).
>
> Well, DSA already does all of that for you and has Device Tree
> bindings too to instantiate per-port net_device, create the switch
> routing table in case switches are cascaded...
>
Wow, I am missed that.

-- 
BR,
Sergey

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 16:41                                             ` Florian Fainelli
  2014-03-27 16:57                                               ` Jiri Pirko
  2014-03-27 16:59                                               ` Thomas Graf
@ 2014-03-27 20:32                                               ` Sergey Ryazanov
  2014-03-27 21:20                                                 ` Florian Fainelli
  2014-03-27 21:41                                               ` Jamal Hadi Salim
  3 siblings, 1 reply; 125+ messages in thread
From: Sergey Ryazanov @ 2014-03-27 20:32 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Jiri Pirko, Jamal Hadi Salim, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

2014-03-27 20:41 GMT+04:00 Florian Fainelli <f.fainelli@gmail.com>:
> 2014-03-27 7:10 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
>> Hi all,
>>
>> sorry for the intrusion, but let me place my 2 cents.
>>
>> 2014-03-27 10:56 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
>>> Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
>>>>2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>>>>> On 03/26/14 14:14, Jiri Pirko wrote:
>>>>>>
>>>>>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>>>>>>
>>>>>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>>>
>>>>>
>>>>>
>>>>>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
>>>>>>> example.
>>>>>>>
>>>>>>> I think there is an implicit convention that sw1 represents the
>>>>>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>>>>>> always connected, hence there is no need to create a "fake" bridge to
>>>>>>> link sw1 to eth0 for instance?
>>>>>>
>>>>>>
>>>>>> I think you are kind of mixing apples and oranges (or I might be I'm not
>>>>>> understanding you correctly).
>>>>>> This is how I see it, sticking to the names you use in the example:
>>>>>>
>>>>>>              (sw1) (abstract place-holder netdev)
>>>>>>            --------
>>>>>>           switch chip                   CPU
>>>>>>     -----------------------            ------
>>>>>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>>>>>>       |     |     |     |                |
>>>>>>      PHY   PHY   PHY    ------someMII-----
>>>>>>
>>>>>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
>>>>>> switch part (port representation).
>>>>>>
>>>>>
>>>>>
>>>>> Florian - I am sure you explained this before; I just dont remember. Why
>>>>> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
>>>>> already in the kernel and the "cpu port" is merely a control interface.
>>>>
>>>>eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
>>>>It is "regular" Ethernet driver connected to the switch without
>>>>switch-specific logic. The goal is twofold:
>>>>
>>>>- allow any regular Ethernet driver to be connected to an external
>>>>switch via e.g: MDIO/MDC or other without specific switch knowledge
>>>>- represents accurately how the hardware is designed/connected
>>>>
>>>>but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...
>>>
>>> I believe that hawing both sw1p3 and eth0 is the correct way of
>>> modelling this. sw1p3 is instance if switch chip driver representing the
>>> actual port of a switch. eth0 is an instance of some other ordinary NIC
>>> driver (8139too is my favorite :))
>>>
>>> This model allows to draw the exact picture.
>>> Also, when you add the described possibility to use iplink to build
>>> vlans, bridges whatever on the switch ports, it makes perfect sense to
>>> have this model.
>>>
>>> Merging sw1p3 and eth0 would cause a loose of information and confusion.
>>>
>>
>> CPU switch port and switch fabric itself should be configured in
>> consistence with host, in first place I mean a set of VLANs. Also it
>> should be mentioned that some generic knobs such as port rate and
>> duplex mode are meaningless for CPU switch port and a lot of status
>> information (rx/tx counters etc.) duplicates statistics of host
>> interface which is connected to switch port.
>
> It duplicates the information when things just work fine, consider an
> external switch connected via RGMII to a CPU Ethernet MAC, you might
> want to get statistics from both sides (the switch CPU port and the
> CPU Ethernet MAC) to diagnose why things are not working as expected,
> which unfortunately happens once in a while with RGMII.
>
> If we expose both net_device, we will be able to retrieve statistics
> about from both sides, without resorting to ad-hoc debugging tools,
> but maybe this is not worth the effort.
>
I also thought about this situation. Can we use the debugfs interface
for these purposes?

>> So there are no reasons
>> to force user to configure this port manually, and automatic
>> configuration of CPU switch port without exporting them as netdev
>> seems as good approach.
>
> Well, maybe that's the answer, since we know that e.g: sw1p3 is always
> connected to e.g: eth0, we could create an automatic bridge between
> those two, this would keep the netdev exposure to user-space, but an
> user would not have to know about that specific detail to get things
> to work.
>
I would like go further and suggest to consider a netdev that is
connected to the CPU switch port, as master. In case when we need to
perform some action on whole switch (e.g. dump FIB). And even name
switch ports, using master netdev name as prefix (e.g. eth1p0, eth1p1,
..., eth1pN for ports of switch that is connected via eth1).

-- 
BR,
Sergey

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 20:32                                               ` Sergey Ryazanov
@ 2014-03-27 21:20                                                 ` Florian Fainelli
  2014-03-27 21:55                                                   ` Jamal Hadi Salim
  2014-03-28  6:28                                                   ` Jiri Pirko
  0 siblings, 2 replies; 125+ messages in thread
From: Florian Fainelli @ 2014-03-27 21:20 UTC (permalink / raw)
  To: Sergey Ryazanov
  Cc: Jiri Pirko, Jamal Hadi Salim, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

2014-03-27 13:32 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
> 2014-03-27 20:41 GMT+04:00 Florian Fainelli <f.fainelli@gmail.com>:
>> 2014-03-27 7:10 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
>>> Hi all,
>>>
>>> sorry for the intrusion, but let me place my 2 cents.
>>>
>>> 2014-03-27 10:56 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
>>>> Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
>>>>>2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>>>>>> On 03/26/14 14:14, Jiri Pirko wrote:
>>>>>>>
>>>>>>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>>>>>>>
>>>>>>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
>>>>>>>> example.
>>>>>>>>
>>>>>>>> I think there is an implicit convention that sw1 represents the
>>>>>>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>>>>>>> always connected, hence there is no need to create a "fake" bridge to
>>>>>>>> link sw1 to eth0 for instance?
>>>>>>>
>>>>>>>
>>>>>>> I think you are kind of mixing apples and oranges (or I might be I'm not
>>>>>>> understanding you correctly).
>>>>>>> This is how I see it, sticking to the names you use in the example:
>>>>>>>
>>>>>>>              (sw1) (abstract place-holder netdev)
>>>>>>>            --------
>>>>>>>           switch chip                   CPU
>>>>>>>     -----------------------            ------
>>>>>>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>>>>>>>       |     |     |     |                |
>>>>>>>      PHY   PHY   PHY    ------someMII-----
>>>>>>>
>>>>>>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
>>>>>>> switch part (port representation).
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Florian - I am sure you explained this before; I just dont remember. Why
>>>>>> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
>>>>>> already in the kernel and the "cpu port" is merely a control interface.
>>>>>
>>>>>eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
>>>>>It is "regular" Ethernet driver connected to the switch without
>>>>>switch-specific logic. The goal is twofold:
>>>>>
>>>>>- allow any regular Ethernet driver to be connected to an external
>>>>>switch via e.g: MDIO/MDC or other without specific switch knowledge
>>>>>- represents accurately how the hardware is designed/connected
>>>>>
>>>>>but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...
>>>>
>>>> I believe that hawing both sw1p3 and eth0 is the correct way of
>>>> modelling this. sw1p3 is instance if switch chip driver representing the
>>>> actual port of a switch. eth0 is an instance of some other ordinary NIC
>>>> driver (8139too is my favorite :))
>>>>
>>>> This model allows to draw the exact picture.
>>>> Also, when you add the described possibility to use iplink to build
>>>> vlans, bridges whatever on the switch ports, it makes perfect sense to
>>>> have this model.
>>>>
>>>> Merging sw1p3 and eth0 would cause a loose of information and confusion.
>>>>
>>>
>>> CPU switch port and switch fabric itself should be configured in
>>> consistence with host, in first place I mean a set of VLANs. Also it
>>> should be mentioned that some generic knobs such as port rate and
>>> duplex mode are meaningless for CPU switch port and a lot of status
>>> information (rx/tx counters etc.) duplicates statistics of host
>>> interface which is connected to switch port.
>>
>> It duplicates the information when things just work fine, consider an
>> external switch connected via RGMII to a CPU Ethernet MAC, you might
>> want to get statistics from both sides (the switch CPU port and the
>> CPU Ethernet MAC) to diagnose why things are not working as expected,
>> which unfortunately happens once in a while with RGMII.
>>
>> If we expose both net_device, we will be able to retrieve statistics
>> about from both sides, without resorting to ad-hoc debugging tools,
>> but maybe this is not worth the effort.
>>
> I also thought about this situation. Can we use the debugfs interface
> for these purposes?

Most of the time you are interesting in MIB counters for debugging
such issues, so ethtool quickly comes handy for this task. Since we
will provide per-port counters, the CPU port is not different, so
there are no reason for restricting this.

>
>>> So there are no reasons
>>> to force user to configure this port manually, and automatic
>>> configuration of CPU switch port without exporting them as netdev
>>> seems as good approach.
>>
>> Well, maybe that's the answer, since we know that e.g: sw1p3 is always
>> connected to e.g: eth0, we could create an automatic bridge between
>> those two, this would keep the netdev exposure to user-space, but an
>> user would not have to know about that specific detail to get things
>> to work.
>>
> I would like go further and suggest to consider a netdev that is
> connected to the CPU switch port, as master. In case when we need to
> perform some action on whole switch (e.g. dump FIB).

This is what the 'sw1' net_device in Jiri's proposal would do.

> And even name
> switch ports, using master netdev name as prefix (e.g. eth1p0, eth1p1,
> ..., eth1pN for ports of switch that is connected via eth1).

I think the port naming using the switch abstract interface (sw1 here)
is better because ports do belong to the switch.
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 15:26                               ` Neil Horman
@ 2014-03-27 21:33                                 ` Jamal Hadi Salim
  0 siblings, 0 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-27 21:33 UTC (permalink / raw)
  To: Neil Horman
  Cc: Florian Fainelli, Thomas Graf, Jiri Pirko, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek, Felix Fietkau

On 03/27/14 11:26, Neil Horman wrote:

> You're absolutely correct - I am viewing this from a host based perspective.
> And I completely understand that offload is good in a middle box environment (I
> worked for embedded switch companies in a former life).  I'm looking at it from
> a host perspective because, as we've been discussing the wide range of devices
> covered here (from the small SOC switches used by owrt to the big enterprise
> switches), theres this middle ground thats seeing some consolodation here which
> I think we need to cover as well.  I'm referring to NICS that have an embedded
> switch in them that can (or soon will) preform lots of these flow based
> forwarding operations and actions.

Agreed - I think we need to capture those. The challenge there maybe how
to abstract some of those tables (example in VMDQ) and make it feel
like a L2 fdb.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 16:41                                             ` Florian Fainelli
                                                                 ` (2 preceding siblings ...)
  2014-03-27 20:32                                               ` Sergey Ryazanov
@ 2014-03-27 21:41                                               ` Jamal Hadi Salim
  3 siblings, 0 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-27 21:41 UTC (permalink / raw)
  To: Florian Fainelli, Sergey Ryazanov
  Cc: Jiri Pirko, Roopa Prabhu, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

I should probably be reading emails backward to catchup, but this
one caught my eye.

On 03/27/14 12:41, Florian Fainelli wrote:

> It duplicates the information when things just work fine, consider an
> external switch connected via RGMII to a CPU Ethernet MAC, you might
> want to get statistics from both sides (the switch CPU port and the
> CPU Ethernet MAC) to diagnose why things are not working as expected,
> which unfortunately happens once in a while with RGMII.
>
> If we expose both net_device, we will be able to retrieve statistics
> about from both sides, without resorting to ad-hoc debugging tools,
> but maybe this is not worth the effort.
>

This is probably the most convincing rationale i have seen
for this netdev. I think the abstraction is what was bothering me
earlier from Jiri's patch. It is not a "master" netdev of the switch
(since a switch port can only be attached to one master),
it is closer to a "conduit" netdev Roopa was alluding to. From
abstraction you dont attach ports to it, it comes with ports - but it is
merely the control for those ports. Would refereing to this port as
"control" be a good way to go forward?
So if i wanted to retrieve hardware stats for sw1p0 i look up its
"control"->dev->somendo and pass it enough information to give me
the hardware stats for sw1p0.
I can see the rate control Sergey is refering to as merely a tc
ingress policer attached on it (which may translate to some hardware
register setting).
Likewise this is probably where my query to "tell me how many
fdb entries you can support" ends up.

>> So there are no reasons
>> to force user to configure this port manually, and automatic
>> configuration of CPU switch port without exporting them as netdev
>> seems as good approach.
>
> Well, maybe that's the answer, since we know that e.g: sw1p3 is always
> connected to e.g: eth0, we could create an automatic bridge between
> those two, this would keep the netdev exposure to user-space, but an
> user would not have to know about that specific detail to get things
> to work.
>

Could ifconfig down of this port have some semantic that nothing comes
to the cpu?

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 20:01                                                 ` Florian Fainelli
  2014-03-27 20:04                                                   ` Sergey Ryazanov
@ 2014-03-27 21:47                                                   ` Jamal Hadi Salim
  2014-03-27 21:54                                                     ` Florian Fainelli
  1 sibling, 1 reply; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-27 21:47 UTC (permalink / raw)
  To: Florian Fainelli, Sergey Ryazanov
  Cc: Jiri Pirko, Roopa Prabhu, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

On 03/27/14 16:01, Florian Fainelli wrote:
> 2014-03-27 12:58 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
>> 2014-03-27 20:55 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:

>> Moreover, we need to know ports of switch chip, what are really wired
>> to the connectors (e.g. five-port switch on the board with only three
>> connectors).
>
> Well, DSA already does all of that for you and has Device Tree
> bindings too to instantiate per-port net_device, create the switch
> routing table in case switches are cascaded...
>

Just to be clear:
routing here implies how the devices are interconnected (as opposed
to L3 packet processing).
Is that generic enough to be usable for different vendors or only
specific to marvel?

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 21:47                                                   ` Jamal Hadi Salim
@ 2014-03-27 21:54                                                     ` Florian Fainelli
  2014-03-27 21:59                                                       ` Jamal Hadi Salim
  0 siblings, 1 reply; 125+ messages in thread
From: Florian Fainelli @ 2014-03-27 21:54 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Sergey Ryazanov, Jiri Pirko, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

2014-03-27 14:47 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
> On 03/27/14 16:01, Florian Fainelli wrote:
>>
>> 2014-03-27 12:58 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
>>>
>>> 2014-03-27 20:55 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
>
>
>>> Moreover, we need to know ports of switch chip, what are really wired
>>> to the connectors (e.g. five-port switch on the board with only three
>>> connectors).
>>
>>
>> Well, DSA already does all of that for you and has Device Tree
>> bindings too to instantiate per-port net_device, create the switch
>> routing table in case switches are cascaded...
>>
>
> Just to be clear:
> routing here implies how the devices are interconnected (as opposed
> to L3 packet processing).

Correct, this is more like chaining here.

> Is that generic enough to be usable for different vendors or only
> specific to marvel?

I think it is. There is nothing in-tree that actually uses that
feature (switch chaining), but all the infrastructure is there to:

- represent links between switches, which one is cascaded from which
one and get a sense of their addressing within the switch tree
- a function call to get that cascading table to programmed to each switch
- get all relevant "uplink/downlink" switch ports to be configured appropriately
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 21:20                                                 ` Florian Fainelli
@ 2014-03-27 21:55                                                   ` Jamal Hadi Salim
  2014-03-28  6:28                                                   ` Jiri Pirko
  1 sibling, 0 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-27 21:55 UTC (permalink / raw)
  To: Florian Fainelli, Sergey Ryazanov
  Cc: Jiri Pirko, Roopa Prabhu, Neil Horman, Thomas Graf, netdev,
	David Miller, Andy Gospodarek, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

On 03/27/14 17:20, Florian Fainelli wrote:
> 2014-03-27 13:32 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:

>> I would like go further and suggest to consider a netdev that is
>> connected to the CPU switch port, as master. In case when we need to
>> perform some action on whole switch (e.g. dump FIB).
>
> This is what the 'sw1' net_device in Jiri's proposal would do.
>
>> And even name
>> switch ports, using master netdev name as prefix (e.g. eth1p0, eth1p1,
>> ..., eth1pN for ports of switch that is connected via eth1).
>
> I think the port naming using the switch abstract interface (sw1 here)
> is better because ports do belong to the switch.
>


Can we start calling whatever this netdev is something like "control"?
Or maybe it is "cpu" netdev? Since i am
comprehending better the need for such a special netdev, i would say:
if i do a "route add" and it needs to go to the specific switch,
the FIB entry will find its way to the "control" netdev which will
invoke device specific interfaces for the ASIC.
I would go as far as almost claiming - let this interface use netlink
definitions (we know how fib netlink interfaces look like already); 
caveat is: there may room  to tone it down.
Repeat and rinse for everything else we know how to do already

Of course there's a lot of evolution on the core code but thats
a different discussion.

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 21:54                                                     ` Florian Fainelli
@ 2014-03-27 21:59                                                       ` Jamal Hadi Salim
  2014-03-27 22:19                                                         ` Florian Fainelli
  2014-03-27 23:42                                                         ` Thomas Graf
  0 siblings, 2 replies; 125+ messages in thread
From: Jamal Hadi Salim @ 2014-03-27 21:59 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Sergey Ryazanov, Jiri Pirko, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

On 03/27/14 17:54, Florian Fainelli wrote:
> 2014-03-27 14:47 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:

> I think it is. There is nothing in-tree that actually uses that
> feature (switch chaining), but all the infrastructure is there to:
>
> - represent links between switches, which one is cascaded from which
> one and get a sense of their addressing within the switch tree
> - a function call to get that cascading table to programmed to each switch
> - get all relevant "uplink/downlink" switch ports to be configured appropriately
>

Sounds like we have a winner on this aspect at least.
Is it a common setup to have such cascading though - or is this some
overzelous way of solving cascading switches within a LAN?

cheers,
jamal

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 21:59                                                       ` Jamal Hadi Salim
@ 2014-03-27 22:19                                                         ` Florian Fainelli
  2014-03-27 23:42                                                         ` Thomas Graf
  1 sibling, 0 replies; 125+ messages in thread
From: Florian Fainelli @ 2014-03-27 22:19 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Sergey Ryazanov, Jiri Pirko, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

2014-03-27 14:59 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
> On 03/27/14 17:54, Florian Fainelli wrote:
>>
>> 2014-03-27 14:47 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>
>
>> I think it is. There is nothing in-tree that actually uses that
>> feature (switch chaining), but all the infrastructure is there to:
>>
>> - represent links between switches, which one is cascaded from which
>> one and get a sense of their addressing within the switch tree
>> - a function call to get that cascading table to programmed to each switch
>> - get all relevant "uplink/downlink" switch ports to be configured
>> appropriately
>>
>
> Sounds like we have a winner on this aspect at least.
> Is it a common setup to have such cascading though - or is this some
> overzelous way of solving cascading switches within a LAN?

It is fairly common for racked Ethernet switches that collect user
traffic in University campuses, or companies to be virtually assembled
into in big virtual switch which spans multiple physical switches
(Cisco's Flex Stack) as you minimize the number of uplinks. I would
guess that when Lennert designed DSA, he may have had this in mind,
for higher end Marvell Ethernet switch ASICs which could have
benefited that feature.

For the embedded space, I would say that this may happen from time to
time where you might cascade an internal switch to an external switch
which are of the same brand and speak the same tagging protocol.
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 21:59                                                       ` Jamal Hadi Salim
  2014-03-27 22:19                                                         ` Florian Fainelli
@ 2014-03-27 23:42                                                         ` Thomas Graf
  2014-03-27 23:46                                                           ` Florian Fainelli
  1 sibling, 1 reply; 125+ messages in thread
From: Thomas Graf @ 2014-03-27 23:42 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Florian Fainelli, Sergey Ryazanov, Jiri Pirko, Roopa Prabhu,
	Neil Horman, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

On 03/27/14 at 05:59pm, Jamal Hadi Salim wrote:
> On 03/27/14 17:54, Florian Fainelli wrote:
> >2014-03-27 14:47 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
> 
> >I think it is. There is nothing in-tree that actually uses that
> >feature (switch chaining), but all the infrastructure is there to:
> >
> >- represent links between switches, which one is cascaded from which
> >one and get a sense of their addressing within the switch tree
> >- a function call to get that cascading table to programmed to each switch
> >- get all relevant "uplink/downlink" switch ports to be configured appropriately
> >
> 
> Sounds like we have a winner on this aspect at least.
> Is it a common setup to have such cascading though - or is this some
> overzelous way of solving cascading switches within a LAN?

If the DSA code can be refactored, even better. Any issues with
bumping these Florian?

#define DSA_MAX_SWITCHES        4
#define DSA_MAX_PORTS           12

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 23:42                                                         ` Thomas Graf
@ 2014-03-27 23:46                                                           ` Florian Fainelli
  0 siblings, 0 replies; 125+ messages in thread
From: Florian Fainelli @ 2014-03-27 23:46 UTC (permalink / raw)
  To: Thomas Graf
  Cc: Jamal Hadi Salim, Sergey Ryazanov, Jiri Pirko, Roopa Prabhu,
	Neil Horman, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

2014-03-27 16:42 GMT-07:00 Thomas Graf <tgraf@suug.ch>:
> On 03/27/14 at 05:59pm, Jamal Hadi Salim wrote:
>> On 03/27/14 17:54, Florian Fainelli wrote:
>> >2014-03-27 14:47 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>>
>> >I think it is. There is nothing in-tree that actually uses that
>> >feature (switch chaining), but all the infrastructure is there to:
>> >
>> >- represent links between switches, which one is cascaded from which
>> >one and get a sense of their addressing within the switch tree
>> >- a function call to get that cascading table to programmed to each switch
>> >- get all relevant "uplink/downlink" switch ports to be configured appropriately
>> >
>>
>> Sounds like we have a winner on this aspect at least.
>> Is it a common setup to have such cascading though - or is this some
>> overzelous way of solving cascading switches within a LAN?
>
> If the DSA code can be refactored, even better. Any issues with
> bumping these Florian?
>
> #define DSA_MAX_SWITCHES        4
> #define DSA_MAX_PORTS           12

I don't have access to the relevant datasheets at the moment, so I
can't remember on the top of my head whether that represents a hard
limit for all the Marvell switches drivers in tree (Lennert would
surely know), but at any rate, this should certainly be made dynamic
based on what the switch driver advertises, so please go ahead bumping
those.
-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-27 21:20                                                 ` Florian Fainelli
  2014-03-27 21:55                                                   ` Jamal Hadi Salim
@ 2014-03-28  6:28                                                   ` Jiri Pirko
  2014-03-30 12:08                                                     ` Alon Harel
  1 sibling, 1 reply; 125+ messages in thread
From: Jiri Pirko @ 2014-03-28  6:28 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Sergey Ryazanov, Jamal Hadi Salim, Roopa Prabhu, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Scott Feldman, Lennert Buytenhek,
	Shrijeet Mukherjee, Felix Fietkau

Thu, Mar 27, 2014 at 10:20:06PM CET, f.fainelli@gmail.com wrote:
>2014-03-27 13:32 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
>> 2014-03-27 20:41 GMT+04:00 Florian Fainelli <f.fainelli@gmail.com>:
>>> 2014-03-27 7:10 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
>>>> Hi all,
>>>>
>>>> sorry for the intrusion, but let me place my 2 cents.
>>>>
>>>> 2014-03-27 10:56 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
>>>>> Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
>>>>>>2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
>>>>>>> On 03/26/14 14:14, Jiri Pirko wrote:
>>>>>>>>
>>>>>>>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
>>>>>>>>>
>>>>>>>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
>>>>>>>>> example.
>>>>>>>>>
>>>>>>>>> I think there is an implicit convention that sw1 represents the
>>>>>>>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
>>>>>>>>> always connected, hence there is no need to create a "fake" bridge to
>>>>>>>>> link sw1 to eth0 for instance?
>>>>>>>>
>>>>>>>>
>>>>>>>> I think you are kind of mixing apples and oranges (or I might be I'm not
>>>>>>>> understanding you correctly).
>>>>>>>> This is how I see it, sticking to the names you use in the example:
>>>>>>>>
>>>>>>>>              (sw1) (abstract place-holder netdev)
>>>>>>>>            --------
>>>>>>>>           switch chip                   CPU
>>>>>>>>     -----------------------            ------
>>>>>>>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
>>>>>>>>       |     |     |     |                |
>>>>>>>>      PHY   PHY   PHY    ------someMII-----
>>>>>>>>
>>>>>>>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
>>>>>>>> switch part (port representation).
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Florian - I am sure you explained this before; I just dont remember. Why
>>>>>>> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
>>>>>>> already in the kernel and the "cpu port" is merely a control interface.
>>>>>>
>>>>>>eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
>>>>>>It is "regular" Ethernet driver connected to the switch without
>>>>>>switch-specific logic. The goal is twofold:
>>>>>>
>>>>>>- allow any regular Ethernet driver to be connected to an external
>>>>>>switch via e.g: MDIO/MDC or other without specific switch knowledge
>>>>>>- represents accurately how the hardware is designed/connected
>>>>>>
>>>>>>but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...
>>>>>
>>>>> I believe that hawing both sw1p3 and eth0 is the correct way of
>>>>> modelling this. sw1p3 is instance if switch chip driver representing the
>>>>> actual port of a switch. eth0 is an instance of some other ordinary NIC
>>>>> driver (8139too is my favorite :))
>>>>>
>>>>> This model allows to draw the exact picture.
>>>>> Also, when you add the described possibility to use iplink to build
>>>>> vlans, bridges whatever on the switch ports, it makes perfect sense to
>>>>> have this model.
>>>>>
>>>>> Merging sw1p3 and eth0 would cause a loose of information and confusion.
>>>>>
>>>>
>>>> CPU switch port and switch fabric itself should be configured in
>>>> consistence with host, in first place I mean a set of VLANs. Also it
>>>> should be mentioned that some generic knobs such as port rate and
>>>> duplex mode are meaningless for CPU switch port and a lot of status
>>>> information (rx/tx counters etc.) duplicates statistics of host
>>>> interface which is connected to switch port.
>>>
>>> It duplicates the information when things just work fine, consider an
>>> external switch connected via RGMII to a CPU Ethernet MAC, you might
>>> want to get statistics from both sides (the switch CPU port and the
>>> CPU Ethernet MAC) to diagnose why things are not working as expected,
>>> which unfortunately happens once in a while with RGMII.
>>>
>>> If we expose both net_device, we will be able to retrieve statistics
>>> about from both sides, without resorting to ad-hoc debugging tools,
>>> but maybe this is not worth the effort.
>>>
>> I also thought about this situation. Can we use the debugfs interface
>> for these purposes?
>
>Most of the time you are interesting in MIB counters for debugging
>such issues, so ethtool quickly comes handy for this task. Since we
>will provide per-port counters, the CPU port is not different, so
>there are no reason for restricting this.

I agree, no need to provide parallel api.

>
>>
>>>> So there are no reasons
>>>> to force user to configure this port manually, and automatic
>>>> configuration of CPU switch port without exporting them as netdev
>>>> seems as good approach.
>>>
>>> Well, maybe that's the answer, since we know that e.g: sw1p3 is always
>>> connected to e.g: eth0, we could create an automatic bridge between
>>> those two, this would keep the netdev exposure to user-space, but an
>>> user would not have to know about that specific detail to get things
>>> to work.
>>>
>> I would like go further and suggest to consider a netdev that is
>> connected to the CPU switch port, as master. In case when we need to
>> perform some action on whole switch (e.g. dump FIB).
>
>This is what the 'sw1' net_device in Jiri's proposal would do.

Except, sw1 is not cpu port. It's just a place holder not representing
any physical port/netdev.

>
>> And even name
>> switch ports, using master netdev name as prefix (e.g. eth1p0, eth1p1,
>> ..., eth1pN for ports of switch that is connected via eth1).
>
>I think the port naming using the switch abstract interface (sw1 here)
>is better because ports do belong to the switch.
>-- 
>Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-28  6:28                                                   ` Jiri Pirko
@ 2014-03-30 12:08                                                     ` Alon Harel
  0 siblings, 0 replies; 125+ messages in thread
From: Alon Harel @ 2014-03-30 12:08 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Florian Fainelli, Sergey Ryazanov, Jamal Hadi Salim,
	Roopa Prabhu, Neil Horman, Thomas Graf, netdev, David Miller,
	Andy Gospodarek, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Scott Feldman,
	Lennert Buytenhek, Shrijeet Mukherjee

2014-03-28 9:28 GMT+03:00 Jiri Pirko <jiri@resnulli.us>:
>
> Thu, Mar 27, 2014 at 10:20:06PM CET, f.fainelli@gmail.com wrote:
> >2014-03-27 13:32 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
> >> 2014-03-27 20:41 GMT+04:00 Florian Fainelli <f.fainelli@gmail.com>:
> >>> 2014-03-27 7:10 GMT-07:00 Sergey Ryazanov <ryazanov.s.a@gmail.com>:
> >>>> Hi all,
> >>>>
> >>>> sorry for the intrusion, but let me place my 2 cents.
> >>>>
> >>>> 2014-03-27 10:56 GMT+04:00 Jiri Pirko <jiri@resnulli.us>:
> >>>>> Wed, Mar 26, 2014 at 11:22:51PM CET, f.fainelli@gmail.com wrote:
> >>>>>>2014-03-26 14:51 GMT-07:00 Jamal Hadi Salim <jhs@mojatatu.com>:
> >>>>>>> On 03/26/14 14:14, Jiri Pirko wrote:
> >>>>>>>>
> >>>>>>>> Wed, Mar 26, 2014 at 06:58:32PM CET, f.fainelli@gmail.com wrote:
> >>>>>>>>>
> >>>>>>>>> 2014-03-26 10:35 GMT-07:00 Jiri Pirko <jiri@resnulli.us>:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>> You are right, sw1p0 and sw1p1 were meant to be, say LAN ports in my
> >>>>>>>>> example.
> >>>>>>>>>
> >>>>>>>>> I think there is an implicit convention that sw1 represents the
> >>>>>>>>> Ethernet switch port connected to the CPU Ethernet MAC, and that it is
> >>>>>>>>> always connected, hence there is no need to create a "fake" bridge to
> >>>>>>>>> link sw1 to eth0 for instance?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I think you are kind of mixing apples and oranges (or I might be I'm not
> >>>>>>>> understanding you correctly).
> >>>>>>>> This is how I see it, sticking to the names you use in the example:
> >>>>>>>>
> >>>>>>>>              (sw1) (abstract place-holder netdev)
> >>>>>>>>            --------
> >>>>>>>>           switch chip                   CPU
> >>>>>>>>     -----------------------            ------
> >>>>>>>>     sw1p0 sw1p1 sw1p2 sw1p3             eth0
> >>>>>>>>       |     |     |     |                |
> >>>>>>>>      PHY   PHY   PHY    ------someMII-----
> >>>>>>>>
> >>>>>>>> You see that eth0 is the CPU part of the "connection" and sw1p3 is the
> >>>>>>>> switch part (port representation).
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Florian - I am sure you explained this before; I just dont remember. Why
> >>>>>>> is there need to expose eth0? It seems to me sw1p0-3 are abstracted
> >>>>>>> already in the kernel and the "cpu port" is merely a control interface.
> >>>>>>
> >>>>>>eth0 corresponds to a CPU Ethernet MAC facing e.g: sw1p3 switch port.
> >>>>>>It is "regular" Ethernet driver connected to the switch without
> >>>>>>switch-specific logic. The goal is twofold:
> >>>>>>
> >>>>>>- allow any regular Ethernet driver to be connected to an external
> >>>>>>switch via e.g: MDIO/MDC or other without specific switch knowledge
> >>>>>>- represents accurately how the hardware is designed/connected
> >>>>>>
> >>>>>>but maybe, we can simplify and have e.g: sw1p3 and eth0 be the same interface...
> >>>>>
> >>>>> I believe that hawing both sw1p3 and eth0 is the correct way of
> >>>>> modelling this. sw1p3 is instance if switch chip driver representing the
> >>>>> actual port of a switch. eth0 is an instance of some other ordinary NIC
> >>>>> driver (8139too is my favorite :))
> >>>>>
> >>>>> This model allows to draw the exact picture.
> >>>>> Also, when you add the described possibility to use iplink to build
> >>>>> vlans, bridges whatever on the switch ports, it makes perfect sense to
> >>>>> have this model.
> >>>>>
> >>>>> Merging sw1p3 and eth0 would cause a loose of information and confusion.
> >>>>>
> >>>>
> >>>> CPU switch port and switch fabric itself should be configured in
> >>>> consistence with host, in first place I mean a set of VLANs. Also it
> >>>> should be mentioned that some generic knobs such as port rate and
> >>>> duplex mode are meaningless for CPU switch port and a lot of status
> >>>> information (rx/tx counters etc.) duplicates statistics of host
> >>>> interface which is connected to switch port.
> >>>
> >>> It duplicates the information when things just work fine, consider an
> >>> external switch connected via RGMII to a CPU Ethernet MAC, you might
> >>> want to get statistics from both sides (the switch CPU port and the
> >>> CPU Ethernet MAC) to diagnose why things are not working as expected,
> >>> which unfortunately happens once in a while with RGMII.
> >>>
> >>> If we expose both net_device, we will be able to retrieve statistics
> >>> about from both sides, without resorting to ad-hoc debugging tools,
> >>> but maybe this is not worth the effort.
> >>>
> >> I also thought about this situation. Can we use the debugfs interface
> >> for these purposes?
> >
> >Most of the time you are interesting in MIB counters for debugging
> >such issues, so ethtool quickly comes handy for this task. Since we
> >will provide per-port counters, the CPU port is not different, so
> >there are no reason for restricting this.
>
> I agree, no need to provide parallel api.
>
> >
> >>
> >>>> So there are no reasons
> >>>> to force user to configure this port manually, and automatic
> >>>> configuration of CPU switch port without exporting them as netdev
> >>>> seems as good approach.
> >>>
> >>> Well, maybe that's the answer, since we know that e.g: sw1p3 is always
> >>> connected to e.g: eth0, we could create an automatic bridge between
> >>> those two, this would keep the netdev exposure to user-space, but an
> >>> user would not have to know about that specific detail to get things
> >>> to work.
> >>>
> >> I would like go further and suggest to consider a netdev that is
> >> connected to the CPU switch port, as master. In case when we need to
> >> perform some action on whole switch (e.g. dump FIB).
> >
> >This is what the 'sw1' net_device in Jiri's proposal would do.
>
> Except, sw1 is not cpu port. It's just a place holder not representing
> any physical port/netdev.
>
> >
> >> And even name
> >> switch ports, using master netdev name as prefix (e.g. eth1p0, eth1p1,
> >> ..., eth1pN for ports of switch that is connected via eth1).
> >
> >I think the port naming using the switch abstract interface (sw1 here)
> >is better because ports do belong to the switch.
> >--
> >Florian

Re sending (sorry, new to the mailing list and I don not have the
previous mails)

Sorry for jumping in a bit late.
I would like to comment on the point of coupling OVS datapath (dp) with one
piece of hardware. In the model I can think of, there is an embedded switch
in a NIC (eSwitch) with some virtual functions (VFs) through which some of
the VMs are connected (SRIOV) while at the same time there is also a
vSwitch (software switch), e.g. an OVS bridge instance through which other
VMs are connected using 'macvtap'. In this case, according to Jiri's path,
we will need more than one dp.
It looks like the support for multiple dp's was removed from OVS during its
evolution, Are we trying to add it back?
Another option would be to stay with a single dp and support crossing flow
(i.e. flows that cross switches) by additional logic that associates
ingress & egress port with a switch.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-03-26 18:03                               ` Jiri Pirko
  2014-03-26 21:27                                 ` Roopa Prabhu
@ 2014-04-01 19:13                                 ` Scott Feldman
  2014-04-02  6:41                                   ` Jiri Pirko
  2014-04-02 14:32                                   ` Andy Gospodarek
  1 sibling, 2 replies; 125+ messages in thread
From: Scott Feldman @ 2014-04-01 19:13 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Roopa Prabhu, Jamal Hadi Salim, Florian Fainelli, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee


On Mar 26, 2014, at 11:03 AM, Jiri Pirko <jiri@resnulli.us> wrote:

> Wed, Mar 26, 2014 at 06:47:15PM CET, roopa@cumulusnetworks.com wrote:
>> On 3/26/14, 9:59 AM, Jiri Pirko wrote:
>>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>> So you implement bonding netlink api? Or you hook into bonding driver
>>> itselt? Can you show us the code?
>> We use the netlink API and libnl. In our current model, our switch
>> chip driver listens to netlink notifications and programs the switch
>> chip. The switch chip driver uses libnl caches and libnl netlink apis
>> to reflect the kernel state to switch chip.
> 
> 
> So when you configure for example bonding over 2 ports, you actually use
> bonding driver to do that. And you userspace app listens to
> notifications and programs the switch chip accordingly. Am I close?
> 
> How about data? Is this new "bonding" interface able to assign ip to is
> and send/receive packets.
> 
> I'm still not sure I understand your concept. Do you have some
> documentation for it available?

Actually Jiri this is the code you and I worked on recently to netlink-ify bonding/slave attributes and active/inactive notification.  You have it right, user uses normal ip link tools and bonding driver to create bond, set attributes, and enslave switch ports.  RTM_NEWLINK is used to program ASIC to offload LAG to HW.  RTM_NEWLINK msgs contains bond attributes (mode, etc) and slave list, as well as slave status.  This is enough information to program ASIC.  Once programmed, ASIC offloads the data plane traffic, and in the case of egress, handles the LAG hash distribution.  Only the LACP control plane traffic makes it to the bonding driver; data plane traffic does not make it to the bonding driver.

So, not trying to sound like a smart-ass, but the documentation is the bonding driver, specifically the netlink attributes/notifications.

-scott

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-01 19:13                                 ` Scott Feldman
@ 2014-04-02  6:41                                   ` Jiri Pirko
  2014-04-02 15:37                                     ` Scott Feldman
  2014-04-02 14:32                                   ` Andy Gospodarek
  1 sibling, 1 reply; 125+ messages in thread
From: Jiri Pirko @ 2014-04-02  6:41 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Roopa Prabhu, Jamal Hadi Salim, Florian Fainelli, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee

Tue, Apr 01, 2014 at 09:13:00PM CEST, sfeldma@cumulusnetworks.com wrote:
>
>On Mar 26, 2014, at 11:03 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>
>> Wed, Mar 26, 2014 at 06:47:15PM CET, roopa@cumulusnetworks.com wrote:
>>> On 3/26/14, 9:59 AM, Jiri Pirko wrote:
>>>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>> So you implement bonding netlink api? Or you hook into bonding driver
>>>> itselt? Can you show us the code?
>>> We use the netlink API and libnl. In our current model, our switch
>>> chip driver listens to netlink notifications and programs the switch
>>> chip. The switch chip driver uses libnl caches and libnl netlink apis
>>> to reflect the kernel state to switch chip.
>> 
>> 
>> So when you configure for example bonding over 2 ports, you actually use
>> bonding driver to do that. And you userspace app listens to
>> notifications and programs the switch chip accordingly. Am I close?
>> 
>> How about data? Is this new "bonding" interface able to assign ip to is
>> and send/receive packets.
>> 
>> I'm still not sure I understand your concept. Do you have some
>> documentation for it available?
>
>Actually Jiri this is the code you and I worked on recently to netlink-ify bonding/slave attributes and active/inactive notification.  You have it right, user uses normal ip link tools and bonding driver to create bond, set attributes, and enslave switch ports.  RTM_NEWLINK is used to program ASIC to offload LAG to HW.  RTM_NEWLINK msgs contains bond attributes (mode, etc) and slave list, as well as slave status.  This is enough information to program ASIC.  Once programmed, ASIC offloads the data plane traffic, and in the case of egress, handles the LAG hash distribution.  Only the LACP control plane traffic makes it to the bonding driver; data plane traffic does not make it to the bonding driver.
>
>So, not trying to sound like a smart-ass, but the documentation is the bonding driver, specifically the netlink attributes/notifications.

Ok, so no additional kernel code for this? Only some userpace agent
programming the chip?


>
>-scott

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-01 19:13                                 ` Scott Feldman
  2014-04-02  6:41                                   ` Jiri Pirko
@ 2014-04-02 14:32                                   ` Andy Gospodarek
  2014-04-02 15:25                                     ` John W. Linville
  1 sibling, 1 reply; 125+ messages in thread
From: Andy Gospodarek @ 2014-04-02 14:32 UTC (permalink / raw)
  To: Scott Feldman, Jiri Pirko
  Cc: Roopa Prabhu, Jamal Hadi Salim, Florian Fainelli, Neil Horman,
	Thomas Graf, netdev, David Miller, dborkman, ogerlitz, jesse,
	pshelar, azhou, Ben Hutchings, Stephen Hemminger,
	jeffrey.t.kirsher, vyasevic, Cong Wang, John Fastabend,
	Eric Dumazet, Lennert Buytenhek, Shrijeet Mukherjee

On 04/01/2014 03:13 PM, Scott Feldman wrote:
> On Mar 26, 2014, at 11:03 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>
>> Wed, Mar 26, 2014 at 06:47:15PM CET, roopa@cumulusnetworks.com wrote:
>>> On 3/26/14, 9:59 AM, Jiri Pirko wrote:
>>>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>> So you implement bonding netlink api? Or you hook into bonding driver
>>>> itselt? Can you show us the code?
>>> We use the netlink API and libnl. In our current model, our switch
>>> chip driver listens to netlink notifications and programs the switch
>>> chip. The switch chip driver uses libnl caches and libnl netlink apis
>>> to reflect the kernel state to switch chip.
>>
>> So when you configure for example bonding over 2 ports, you actually use
>> bonding driver to do that. And you userspace app listens to
>> notifications and programs the switch chip accordingly. Am I close?
>>
>> How about data? Is this new "bonding" interface able to assign ip to is
>> and send/receive packets.
>>
>> I'm still not sure I understand your concept. Do you have some
>> documentation for it available?
> Actually Jiri this is the code you and I worked on recently to netlink-ify bonding/slave attributes and active/inactive notification.  You have it right, user uses normal ip link tools and bonding driver to create bond, set attributes, and enslave switch ports.  RTM_NEWLINK is used to program ASIC to offload LAG to HW.  RTM_NEWLINK msgs contains bond attributes (mode, etc) and slave list, as well as slave status.  This is enough information to program ASIC.  Once programmed, ASIC offloads the data plane traffic, and in the case of egress, handles the LAG hash distribution.  Only the LACP control plane traffic makes it to the bonding driver; data plane traffic does not make it to the bonding driver.
>
> So, not trying to sound like a smart-ass, but the documentation is the bonding driver, specifically the netlink attributes/notifications.
>
> -scott

Using netlink messages to notify drivers for these ASICs really seems 
like a great way to handle things.  It would obviously require some 
expansion of netlink, but that seems fine.

I would prefer that ASIC vendors write initial drivers for their ASICs 
such that each physical port is detected and exported as a netdev.  This 
would mean each *minimal* kernel driver for an ASIC would need to have 
support for the following (off the top of my head):

- detect link status on an interface
- set an interface's MAC address
- configure the chip to send all frames to the CPU
- register a napi handler for the interfaces (depending on 
packet-buffering capabilities in the hardware)

As support for new hardware capabilities are moved from switch vendor 
SDKs to their kernel driver the driver can begin to listen for netlink 
messages that:

- setup bonds/teams
- add ports to bridge groups
- configure port-based or mac-based VLANs
- add unicast and multicast entries
- add and remove entries from a flow table
- ...

Maybe this all seems to matter-of-fact and the discussion has evolved 
well beyond something this high-level, but there still seems to be 
significant discussion about whether or not the ASIC should be exported 
as a netdev and I'm just not seeing a compelling reason. This was my 
attempt to explain why.  :)

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-02 14:32                                   ` Andy Gospodarek
@ 2014-04-02 15:25                                     ` John W. Linville
  2014-04-02 16:15                                       ` Scott Feldman
  0 siblings, 1 reply; 125+ messages in thread
From: John W. Linville @ 2014-04-02 15:25 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Scott Feldman, Jiri Pirko, Roopa Prabhu, Jamal Hadi Salim,
	Florian Fainelli, Neil Horman, Thomas Graf, netdev, David Miller,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee

On Wed, Apr 02, 2014 at 10:32:49AM -0400, Andy Gospodarek wrote:
> On 04/01/2014 03:13 PM, Scott Feldman wrote:
> >On Mar 26, 2014, at 11:03 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> >
> >>Wed, Mar 26, 2014 at 06:47:15PM CET, roopa@cumulusnetworks.com wrote:
> >>>On 3/26/14, 9:59 AM, Jiri Pirko wrote:
> >>>>Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
> >>>>So you implement bonding netlink api? Or you hook into bonding driver
> >>>>itselt? Can you show us the code?
> >>>We use the netlink API and libnl. In our current model, our switch
> >>>chip driver listens to netlink notifications and programs the switch
> >>>chip. The switch chip driver uses libnl caches and libnl netlink apis
> >>>to reflect the kernel state to switch chip.
> >>
> >>So when you configure for example bonding over 2 ports, you actually use
> >>bonding driver to do that. And you userspace app listens to
> >>notifications and programs the switch chip accordingly. Am I close?
> >>
> >>How about data? Is this new "bonding" interface able to assign ip to is
> >>and send/receive packets.
> >>
> >>I'm still not sure I understand your concept. Do you have some
> >>documentation for it available?
> >Actually Jiri this is the code you and I worked on recently to netlink-ify bonding/slave attributes and active/inactive notification.  You have it right, user uses normal ip link tools and bonding driver to create bond, set attributes, and enslave switch ports.  RTM_NEWLINK is used to program ASIC to offload LAG to HW.  RTM_NEWLINK msgs contains bond attributes (mode, etc) and slave list, as well as slave status.  This is enough information to program ASIC.  Once programmed, ASIC offloads the data plane traffic, and in the case of egress, handles the LAG hash distribution.  Only the LACP control plane traffic makes it to the bonding driver; data plane traffic does not make it to the bonding driver.
> >
> >So, not trying to sound like a smart-ass, but the documentation is the bonding driver, specifically the netlink attributes/notifications.
> >
> >-scott
> 
> Using netlink messages to notify drivers for these ASICs really
> seems like a great way to handle things.  It would obviously require
> some expansion of netlink, but that seems fine.
> 
> I would prefer that ASIC vendors write initial drivers for their
> ASICs such that each physical port is detected and exported as a
> netdev.  This would mean each *minimal* kernel driver for an ASIC
> would need to have support for the following (off the top of my
> head):
> 
> - detect link status on an interface
> - set an interface's MAC address
> - configure the chip to send all frames to the CPU
> - register a napi handler for the interfaces (depending on
> packet-buffering capabilities in the hardware)
> 
> As support for new hardware capabilities are moved from switch
> vendor SDKs to their kernel driver the driver can begin to listen
> for netlink messages that:
> 
> - setup bonds/teams
> - add ports to bridge groups
> - configure port-based or mac-based VLANs
> - add unicast and multicast entries
> - add and remove entries from a flow table
> - ...
> 
> Maybe this all seems to matter-of-fact and the discussion has
> evolved well beyond something this high-level, but there still seems
> to be significant discussion about whether or not the ASIC should be
> exported as a netdev and I'm just not seeing a compelling reason.
> This was my attempt to explain why.  :)

Andy and I discussed this off-line, so I am admittedly partial to
the conclusions we shared as reflected above... :-)

While I might be convinced that there should be _something_ to
represent the switch chip for some purpose (e.g. topology mapping),
I'm not at all convinced that thing should be a netdev.  I don't see
where the switch chip by itself looks much like any other netdev at
all, especially once you model the actual front-panel ports with
their own netdevs.  I do know that having an extra "magic netdev"
in the wireless space added a lot of confusion for no clear gain,
leading to it later being abolished.

Modeling at the switch level might make more sense from a flow
management perspective?  But if those flows are managed using a netlink
protocol, does it matter what sort of entity is listening and acting
on those messages?  If a switch-specific interface is needed for that,
we should build it rather than pretending it looks like a netdev.
I also think that throwing the DSA switches in with flow-based and
"Enterprise" switches may just be confusing things.

I think that the opening bid should be a minimal hardware driver that
models each front-panel port with a netdev and passes all traffic
to/from the CPU.  Intelligence beyond that should be added on a
'can-do' basis, with individual drivers (or corresponding userland
components) listening to existing netlink traffic and implementing
support for existing protocols to the best of their abilities.
Missing functionality in the netlink protocols or other functions
(e.g. bonding, bridging, etc) can be evolved over time as we discover
missing bits required for switch acceleration.

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-02  6:41                                   ` Jiri Pirko
@ 2014-04-02 15:37                                     ` Scott Feldman
  0 siblings, 0 replies; 125+ messages in thread
From: Scott Feldman @ 2014-04-02 15:37 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Roopa Prabhu, Jamal Hadi Salim, Florian Fainelli, Neil Horman,
	Thomas Graf, netdev, David Miller, Andy Gospodarek, dborkman,
	ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, Jeff Kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee


On Apr 1, 2014, at 11:41 PM, Jiri Pirko <jiri@resnulli.us> wrote:

> Tue, Apr 01, 2014 at 09:13:00PM CEST, sfeldma@cumulusnetworks.com wrote:
>> 
>> On Mar 26, 2014, at 11:03 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> 
>>> Wed, Mar 26, 2014 at 06:47:15PM CET, roopa@cumulusnetworks.com wrote:
>>>> On 3/26/14, 9:59 AM, Jiri Pirko wrote:
>>>>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>>> So you implement bonding netlink api? Or you hook into bonding driver
>>>>> itselt? Can you show us the code?
>>>> We use the netlink API and libnl. In our current model, our switch
>>>> chip driver listens to netlink notifications and programs the switch
>>>> chip. The switch chip driver uses libnl caches and libnl netlink apis
>>>> to reflect the kernel state to switch chip.
>>> 
>>> 
>>> So when you configure for example bonding over 2 ports, you actually use
>>> bonding driver to do that. And you userspace app listens to
>>> notifications and programs the switch chip accordingly. Am I close?
>>> 
>>> How about data? Is this new "bonding" interface able to assign ip to is
>>> and send/receive packets.
>>> 
>>> I'm still not sure I understand your concept. Do you have some
>>> documentation for it available?
>> 
>> Actually Jiri this is the code you and I worked on recently to netlink-ify bonding/slave attributes and active/inactive notification. You have it right, user uses normal ip link tools and bonding driver to create bond, set attributes, and enslave switch ports. RTM_NEWLINK is used to program ASIC to offload LAG to HW.  RTM_NEWLINK msgs contains bond attributes (mode, etc) and slave list, as well as slave status.  This is enough information to program ASIC.  Once programmed, ASIC offloads the data plane traffic, and in the case of egress, handles the LAG hash distribution.  Only the LACP control plane traffic makes it to the bonding driver; data plane traffic does not make it to the bonding driver.
>> 
>> So, not trying to sound like a smart-ass, but the documentation is the bonding driver, specifically the netlink attributes/notifications.
> 
> Ok, so no additional kernel code for this?

The kernel is rich with netlink.  Bonds, bridges, vlans, vxlans, L3 route tables, flow tables, neigh table, addr table, and the list goes on, all give up their info via netlink.  Add in a simple netdev-based abstraction for switch ports (such as yours) and you have everything (well, almost, the devil is in the details) you need to program switch chips.  From netlink you get the mgmt plane to HW offload the data plane from the kernel.

> Only some user pace agent programming the chip?

If using netlink, the agent programming the chip can live in the kernel (preferred) or user-space.  Netlink listener can reside either place, since it’s a multicast bus.

-scott

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-02 15:25                                     ` John W. Linville
@ 2014-04-02 16:15                                       ` Scott Feldman
  2014-04-02 16:47                                         ` Florian Fainelli
  2014-04-02 19:29                                         ` John W. Linville
  0 siblings, 2 replies; 125+ messages in thread
From: Scott Feldman @ 2014-04-02 16:15 UTC (permalink / raw)
  To: John W. Linville
  Cc: Andy Gospodarek, Jiri Pirko, Roopa Prabhu, Jamal Hadi Salim,
	Florian Fainelli, Neil Horman, Thomas Graf, netdev, David Miller,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee


On Apr 2, 2014, at 8:25 AM, John W. Linville <linville@tuxdriver.com> wrote:

> On Wed, Apr 02, 2014 at 10:32:49AM -0400, Andy Gospodarek wrote:
>> On 04/01/2014 03:13 PM, Scott Feldman wrote:
>>> On Mar 26, 2014, at 11:03 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>> 
>>>> Wed, Mar 26, 2014 at 06:47:15PM CET, roopa@cumulusnetworks.com wrote:
>>>>> On 3/26/14, 9:59 AM, Jiri Pirko wrote:
>>>>>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>>>> So you implement bonding netlink api? Or you hook into bonding driver
>>>>>> itselt? Can you show us the code?
>>>>> We use the netlink API and libnl. In our current model, our switch
>>>>> chip driver listens to netlink notifications and programs the switch
>>>>> chip. The switch chip driver uses libnl caches and libnl netlink apis
>>>>> to reflect the kernel state to switch chip.
>>>> 
>>>> So when you configure for example bonding over 2 ports, you actually use
>>>> bonding driver to do that. And you userspace app listens to
>>>> notifications and programs the switch chip accordingly. Am I close?
>>>> 
>>>> How about data? Is this new "bonding" interface able to assign ip to is
>>>> and send/receive packets.
>>>> 
>>>> I'm still not sure I understand your concept. Do you have some
>>>> documentation for it available?
>>> Actually Jiri this is the code you and I worked on recently to netlink-ify bonding/slave attributes and active/inactive notification.  You have it right, user uses normal ip link tools and bonding driver to create bond, set attributes, and enslave switch ports.  RTM_NEWLINK is used to program ASIC to offload LAG to HW.  RTM_NEWLINK msgs contains bond attributes (mode, etc) and slave list, as well as slave status.  This is enough information to program ASIC.  Once programmed, ASIC offloads the data plane traffic, and in the case of egress, handles the LAG hash distribution.  Only the LACP control plane traffic makes it to the bonding driver; data plane traffic does not make it to the bonding driver.
>>> 
>>> So, not trying to sound like a smart-ass, but the documentation is the bonding driver, specifically the netlink attributes/notifications.
>>> 
>>> -scott
>> 
>> Using netlink messages to notify drivers for these ASICs really
>> seems like a great way to handle things.  It would obviously require
>> some expansion of netlink, but that seems fine.
>> 
>> I would prefer that ASIC vendors write initial drivers for their
>> ASICs such that each physical port is detected and exported as a
>> netdev.  This would mean each *minimal* kernel driver for an ASIC
>> would need to have support for the following (off the top of my
>> head):
>> 
>> - detect link status on an interface
>> - set an interface's MAC address
>> - configure the chip to send all frames to the CPU
>> - register a napi handler for the interfaces (depending on
>> packet-buffering capabilities in the hardware)
>> 
>> As support for new hardware capabilities are moved from switch
>> vendor SDKs to their kernel driver the driver can begin to listen
>> for netlink messages that:
>> 
>> - setup bonds/teams
>> - add ports to bridge groups
>> - configure port-based or mac-based VLANs
>> - add unicast and multicast entries
>> - add and remove entries from a flow table
>> - ...
>> 
>> Maybe this all seems to matter-of-fact and the discussion has
>> evolved well beyond something this high-level, but there still seems
>> to be significant discussion about whether or not the ASIC should be
>> exported as a netdev and I'm just not seeing a compelling reason.
>> This was my attempt to explain why.  :)
> 
> Andy and I discussed this off-line, so I am admittedly partial to
> the conclusions we shared as reflected above... :-)
> 
> While I might be convinced that there should be _something_ to
> represent the switch chip for some purpose (e.g. topology mapping),
> I'm not at all convinced that thing should be a netdev.  I don't see
> where the switch chip by itself looks much like any other netdev at
> all, especially once you model the actual front-panel ports with
> their own netdevs.  I do know that having an extra "magic netdev"
> in the wireless space added a lot of confusion for no clear gain,
> leading to it later being abolished.
> 
> Modeling at the switch level might make more sense from a flow
> management perspective?  But if those flows are managed using a netlink
> protocol, does it matter what sort of entity is listening and acting
> on those messages?  If a switch-specific interface is needed for that,
> we should build it rather than pretending it looks like a netdev.
> I also think that throwing the DSA switches in with flow-based and
> "Enterprise" switches may just be confusing things.
> 
> I think that the opening bid should be a minimal hardware driver that
> models each front-panel port with a netdev and passes all traffic
> to/from the CPU.  Intelligence beyond that should be added on a
> 'can-do' basis, with individual drivers (or corresponding userland
> components) listening to existing netlink traffic and implementing
> support for existing protocols to the best of their abilities.
> Missing functionality in the netlink protocols or other functions
> (e.g. bonding, bridging, etc) can be evolved over time as we discover
> missing bits required for switch acceleration.

I agree completely with your/Andy’s view.  It’s the switch port, not the switch, that needs to be modeled as a netdev.  The switch port is the abstraction that allows other existing virtual devices (bridges, bond, vxlans, etc) to cuddle against.  Is a switch port a special netdev in some way?  At a high level, not really.  I mean in sense it’s just eth48 on a super NIC.  OK, there may be some advantage to setting a IFF_SWITCH_PORT on the switch port netdev, so cuddling netdevs could get a hint that their data plane might be offloaded.

I’ve been back-and-forth on the switch netdev.  Today I’m not for it.  But I’m still searching for a reason.  At one point I thought a switch netdev would be nice in a L3 router case where we needed a router IP address to do things like OSPF unnumbered interfaces, but even in that case, we can just put the router IP on lo.  Another reason would be to use the switch netdev as a place for switch-wide settings and status.  For example, 
ethtool -S stats on switch netdev would show switch-wide stats like ACL drops or something like that.  Maybe a switch device is modeled as a new device class?  I guess it comes down to how much is duplicated between different vendors' switch driver implementations.

Agree on the missing netlink functionality point, add it as we go.  Outside the bonding stuff we recently added, we (Cumulus) find netlink pretty complete as-is to program modern, enterprise-class switch chips.

-scott

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-02 16:15                                       ` Scott Feldman
@ 2014-04-02 16:47                                         ` Florian Fainelli
  2014-04-02 21:52                                           ` Thomas Graf
  2014-04-02 19:29                                         ` John W. Linville
  1 sibling, 1 reply; 125+ messages in thread
From: Florian Fainelli @ 2014-04-02 16:47 UTC (permalink / raw)
  To: Scott Feldman
  Cc: John W. Linville, Andy Gospodarek, Jiri Pirko, Roopa Prabhu,
	Jamal Hadi Salim, Neil Horman, Thomas Graf, netdev, David Miller,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee

2014-04-02 9:15 GMT-07:00 Scott Feldman <sfeldma@cumulusnetworks.com>:
>
> On Apr 2, 2014, at 8:25 AM, John W. Linville <linville@tuxdriver.com> wrote:
>
>> On Wed, Apr 02, 2014 at 10:32:49AM -0400, Andy Gospodarek wrote:
>>> On 04/01/2014 03:13 PM, Scott Feldman wrote:
>>>> On Mar 26, 2014, at 11:03 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>>>>
>>>>> Wed, Mar 26, 2014 at 06:47:15PM CET, roopa@cumulusnetworks.com wrote:
>>>>>> On 3/26/14, 9:59 AM, Jiri Pirko wrote:
>>>>>>> Wed, Mar 26, 2014 at 05:54:17PM CET, roopa@cumulusnetworks.com wrote:
>>>>>>> So you implement bonding netlink api? Or you hook into bonding driver
>>>>>>> itselt? Can you show us the code?
>>>>>> We use the netlink API and libnl. In our current model, our switch
>>>>>> chip driver listens to netlink notifications and programs the switch
>>>>>> chip. The switch chip driver uses libnl caches and libnl netlink apis
>>>>>> to reflect the kernel state to switch chip.
>>>>>
>>>>> So when you configure for example bonding over 2 ports, you actually use
>>>>> bonding driver to do that. And you userspace app listens to
>>>>> notifications and programs the switch chip accordingly. Am I close?
>>>>>
>>>>> How about data? Is this new "bonding" interface able to assign ip to is
>>>>> and send/receive packets.
>>>>>
>>>>> I'm still not sure I understand your concept. Do you have some
>>>>> documentation for it available?
>>>> Actually Jiri this is the code you and I worked on recently to netlink-ify bonding/slave attributes and active/inactive notification.  You have it right, user uses normal ip link tools and bonding driver to create bond, set attributes, and enslave switch ports.  RTM_NEWLINK is used to program ASIC to offload LAG to HW.  RTM_NEWLINK msgs contains bond attributes (mode, etc) and slave list, as well as slave status.  This is enough information to program ASIC.  Once programmed, ASIC offloads the data plane traffic, and in the case of egress, handles the LAG hash distribution.  Only the LACP control plane traffic makes it to the bonding driver; data plane traffic does not make it to the bonding driver.
>>>>
>>>> So, not trying to sound like a smart-ass, but the documentation is the bonding driver, specifically the netlink attributes/notifications.
>>>>
>>>> -scott
>>>
>>> Using netlink messages to notify drivers for these ASICs really
>>> seems like a great way to handle things.  It would obviously require
>>> some expansion of netlink, but that seems fine.
>>>
>>> I would prefer that ASIC vendors write initial drivers for their
>>> ASICs such that each physical port is detected and exported as a
>>> netdev.  This would mean each *minimal* kernel driver for an ASIC
>>> would need to have support for the following (off the top of my
>>> head):
>>>
>>> - detect link status on an interface
>>> - set an interface's MAC address
>>> - configure the chip to send all frames to the CPU
>>> - register a napi handler for the interfaces (depending on
>>> packet-buffering capabilities in the hardware)
>>>
>>> As support for new hardware capabilities are moved from switch
>>> vendor SDKs to their kernel driver the driver can begin to listen
>>> for netlink messages that:
>>>
>>> - setup bonds/teams
>>> - add ports to bridge groups
>>> - configure port-based or mac-based VLANs
>>> - add unicast and multicast entries
>>> - add and remove entries from a flow table
>>> - ...
>>>
>>> Maybe this all seems to matter-of-fact and the discussion has
>>> evolved well beyond something this high-level, but there still seems
>>> to be significant discussion about whether or not the ASIC should be
>>> exported as a netdev and I'm just not seeing a compelling reason.
>>> This was my attempt to explain why.  :)
>>
>> Andy and I discussed this off-line, so I am admittedly partial to
>> the conclusions we shared as reflected above... :-)
>>
>> While I might be convinced that there should be _something_ to
>> represent the switch chip for some purpose (e.g. topology mapping),
>> I'm not at all convinced that thing should be a netdev.  I don't see
>> where the switch chip by itself looks much like any other netdev at
>> all, especially once you model the actual front-panel ports with
>> their own netdevs.  I do know that having an extra "magic netdev"
>> in the wireless space added a lot of confusion for no clear gain,
>> leading to it later being abolished.
>>
>> Modeling at the switch level might make more sense from a flow
>> management perspective?  But if those flows are managed using a netlink
>> protocol, does it matter what sort of entity is listening and acting
>> on those messages?  If a switch-specific interface is needed for that,
>> we should build it rather than pretending it looks like a netdev.
>> I also think that throwing the DSA switches in with flow-based and
>> "Enterprise" switches may just be confusing things.
>>
>> I think that the opening bid should be a minimal hardware driver that
>> models each front-panel port with a netdev and passes all traffic
>> to/from the CPU.  Intelligence beyond that should be added on a
>> 'can-do' basis, with individual drivers (or corresponding userland
>> components) listening to existing netlink traffic and implementing
>> support for existing protocols to the best of their abilities.
>> Missing functionality in the netlink protocols or other functions
>> (e.g. bonding, bridging, etc) can be evolved over time as we discover
>> missing bits required for switch acceleration.
>
> I agree completely with your/Andy’s view.  It’s the switch port, not the switch, that needs to be modeled as a netdev.  The switch port is the abstraction that allows other existing virtual devices (bridges, bond, vxlans, etc) to cuddle against.  Is a switch port a special netdev in some way?  At a high level, not really.  I mean in sense it’s just eth48 on a super NIC.  OK, there may be some advantage to setting a IFF_SWITCH_PORT on the switch port netdev, so cuddling netdevs could get a hint that their data plane might be offloaded.
>
> I’ve been back-and-forth on the switch netdev.  Today I’m not for it.  But I’m still searching for a reason.  At one point I thought a switch netdev would be nice in a L3 router case where we needed a router IP address to do things like OSPF unnumbered interfaces, but even in that case, we can just put the router IP on lo.  Another reason would be to use the switch netdev as a place for switch-wide settings and status.  For example,
> ethtool -S stats on switch netdev would show switch-wide stats like ACL drops or something like that.  Maybe a switch device is modeled as a new device class?  I guess it comes down to how much is duplicated between different vendors' switch driver implementations.

I think the idea behind exposing a switch net_device is to account for
all special cases where there is not already an existing and
well-defined model for switch-wide events/control/information that we
might want to have. Why a net_device, because the switch ports will
already be exposed as such, so mostly for consistency with the
presented user-space interface. Whether that net_device exposes
different child devices of different classes, e.g: MTD partitions to
access firmware updates, SPI master/slave controller(s), MDIO
controller(s), is yet to be defined I suppose.

>
> Agree on the missing netlink functionality point, add it as we go.  Outside the bonding stuff we recently added, we (Cumulus) find netlink pretty complete as-is to program modern, enterprise-class switch chips.
>
> -scott
>
>
>



-- 
Florian

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-02 16:15                                       ` Scott Feldman
  2014-04-02 16:47                                         ` Florian Fainelli
@ 2014-04-02 19:29                                         ` John W. Linville
  2014-04-02 19:54                                           ` Scott Feldman
  2014-04-02 20:04                                           ` Stephen Hemminger
  1 sibling, 2 replies; 125+ messages in thread
From: John W. Linville @ 2014-04-02 19:29 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Andy Gospodarek, Jiri Pirko, Roopa Prabhu, Jamal Hadi Salim,
	Florian Fainelli, Neil Horman, Thomas Graf, netdev, David Miller,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee

On Wed, Apr 02, 2014 at 09:15:55AM -0700, Scott Feldman wrote:
> 
> On Apr 2, 2014, at 8:25 AM, John W. Linville <linville@tuxdriver.com> wrote:
> 
> > On Wed, Apr 02, 2014 at 10:32:49AM -0400, Andy Gospodarek wrote:

> >> Using netlink messages to notify drivers for these ASICs really
> >> seems like a great way to handle things.  It would obviously require
> >> some expansion of netlink, but that seems fine.
> >> 
> >> I would prefer that ASIC vendors write initial drivers for their
> >> ASICs such that each physical port is detected and exported as a
> >> netdev.  This would mean each *minimal* kernel driver for an ASIC
> >> would need to have support for the following (off the top of my
> >> head):
> >> 
> >> - detect link status on an interface
> >> - set an interface's MAC address
> >> - configure the chip to send all frames to the CPU
> >> - register a napi handler for the interfaces (depending on
> >> packet-buffering capabilities in the hardware)
> >> 
> >> As support for new hardware capabilities are moved from switch
> >> vendor SDKs to their kernel driver the driver can begin to listen
> >> for netlink messages that:
> >> 
> >> - setup bonds/teams
> >> - add ports to bridge groups
> >> - configure port-based or mac-based VLANs
> >> - add unicast and multicast entries
> >> - add and remove entries from a flow table
> >> - ...
> >> 
> >> Maybe this all seems to matter-of-fact and the discussion has
> >> evolved well beyond something this high-level, but there still seems
> >> to be significant discussion about whether or not the ASIC should be
> >> exported as a netdev and I'm just not seeing a compelling reason.
> >> This was my attempt to explain why.  :)
> > 
> > Andy and I discussed this off-line, so I am admittedly partial to
> > the conclusions we shared as reflected above... :-)
> > 
> > While I might be convinced that there should be _something_ to
> > represent the switch chip for some purpose (e.g. topology mapping),
> > I'm not at all convinced that thing should be a netdev.  I don't see
> > where the switch chip by itself looks much like any other netdev at
> > all, especially once you model the actual front-panel ports with
> > their own netdevs.  I do know that having an extra "magic netdev"
> > in the wireless space added a lot of confusion for no clear gain,
> > leading to it later being abolished.
> > 
> > Modeling at the switch level might make more sense from a flow
> > management perspective?  But if those flows are managed using a netlink
> > protocol, does it matter what sort of entity is listening and acting
> > on those messages?  If a switch-specific interface is needed for that,
> > we should build it rather than pretending it looks like a netdev.
> > I also think that throwing the DSA switches in with flow-based and
> > "Enterprise" switches may just be confusing things.
> > 
> > I think that the opening bid should be a minimal hardware driver that
> > models each front-panel port with a netdev and passes all traffic
> > to/from the CPU.  Intelligence beyond that should be added on a
> > 'can-do' basis, with individual drivers (or corresponding userland
> > components) listening to existing netlink traffic and implementing
> > support for existing protocols to the best of their abilities.
> > Missing functionality in the netlink protocols or other functions
> > (e.g. bonding, bridging, etc) can be evolved over time as we discover
> > missing bits required for switch acceleration.
> 
> I agree completely with your/Andy’s view.  It’s the switch port,
> not the switch, that needs to be modeled as a netdev.  The switch port
> is the abstraction that allows other existing virtual devices (bridges,
> bond, vxlans, etc) to cuddle against.  Is a switch port a special
> netdev in some way?  At a high level, not really.  I mean in sense
> it’s just eth48 on a super NIC.  OK, there may be some advantage
> to setting a IFF_SWITCH_PORT on the switch port netdev, so cuddling
> netdevs could get a hint that their data plane might be offloaded.

Some sort of "I'm a switch port!" flag or an ndo_whatever might make
sense in the long run.  But, I'm not sure it is the kind of thing
that needs to be modeled right now...?  It seems more important to
get something modeled that we can build upon without having to solve
every problem up front.

> I’ve been back-and-forth on the switch netdev.  Today I’m
> not for it.  But I’m still searching for a reason.  At one point
> I thought a switch netdev would be nice in a L3 router case where
> we needed a router IP address to do things like OSPF unnumbered
> interfaces, but even in that case, we can just put the router IP
> on lo.  Another reason would be to use the switch netdev as a place
> for switch-wide settings and status.  For example,
> ethtool -S stats on switch netdev would show switch-wide stats like
> ACL drops or something like that.  Maybe a switch device is modeled as
> a new device class?  I guess it comes down to how much is duplicated
> between different vendors' switch driver implementations.

I've seen the 'ethtool -S' example before and I guess it is valid.
Still, is it worth the confusion of having a mostly useless/unique
netdev just to reuse an ethtool ioctl?  Maybe, I guess...?

The example of having a netdev that represents an L3 entity riding on
top of the L2 network provided by the switch seems somewhat reasonable.
It reminds me of what we did when I worked on FASTPATH, ages ago.
In some cases it probably makes some sense.  Still, I'm not sure it
provides any utility over just implementing a bridge on top of all
the switch port netdevs?

> Agree on the missing netlink functionality point, add it as we go.
> Outside the bonding stuff we recently added, we (Cumulus) find netlink
> pretty complete as-is to program modern, enterprise-class switch chips.

Cool!  I'm glad we agree.  Now we just need some switch hardware
drivers that fit the general model outlined above...

I would be happy to maintain a kernel.org git tree as a nursery for
such drivers as they develop and mature, and I'm sure my daytime
employer would be happy to support me on that.  I wonder if we can
get any switch people from Intel, Mellanox, Broadcom, or elsewhere
to play along?

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-02 19:29                                         ` John W. Linville
@ 2014-04-02 19:54                                           ` Scott Feldman
  2014-04-02 20:06                                             ` John W. Linville
  2014-04-02 20:04                                           ` Stephen Hemminger
  1 sibling, 1 reply; 125+ messages in thread
From: Scott Feldman @ 2014-04-02 19:54 UTC (permalink / raw)
  To: John W. Linville
  Cc: Andy Gospodarek, Jiri Pirko, Roopa Prabhu, Jamal Hadi Salim,
	Florian Fainelli, Neil Horman, Thomas Graf, netdev, David Miller,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee


On Apr 2, 2014, at 12:29 PM, John W. Linville <linville@tuxdriver.com> wrote:

> Cool!  I'm glad we agree.  Now we just need some switch hardware
> drivers that fit the general model outlined above...

Why wait?  Let’s create a switch device in qemu and then write the model/sample driver to that.  Put a PCI front end on the qemu device which is mapped to kernel, and define a register set to represent all the switch-like ops we want to offload, in a generic way.  Throw in some DMA for CPU-bound I/O (ctrl traffic).  On the qemu device back end, expose the ports as taps or whatever so we can wire to real-world link partners on the host side.

> I would be happy to maintain a kernel.org git tree as a nursery for
> such drivers as they develop and mature, and I'm sure my daytime
> employer would be happy to support me on that.  I wonder if we can
> get any switch people from Intel, Mellanox, Broadcom, or elsewhere
> to play along?

My gut tells me this is a build-it-and-they-will-come situation.

-scott

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-02 19:29                                         ` John W. Linville
  2014-04-02 19:54                                           ` Scott Feldman
@ 2014-04-02 20:04                                           ` Stephen Hemminger
  2014-04-02 20:23                                             ` Jiri Pirko
  1 sibling, 1 reply; 125+ messages in thread
From: Stephen Hemminger @ 2014-04-02 20:04 UTC (permalink / raw)
  To: John W. Linville
  Cc: Scott Feldman, Andy Gospodarek, Jiri Pirko, Roopa Prabhu,
	Jamal Hadi Salim, Florian Fainelli, Neil Horman, Thomas Graf,
	netdev, David Miller, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee

On Wed, 2 Apr 2014 15:29:15 -0400
"John W. Linville" <linville@tuxdriver.com> wrote:

> I've seen the 'ethtool -S' example before and I guess it is valid.
> Still, is it worth the confusion of having a mostly useless/unique
> netdev just to reuse an ethtool ioctl?  Maybe, I guess...?

ethtool is actually the most worthless part of the API.
It can't be monitored, is ioctl based but and the statistics are device
dependent making them useless for monitoring applications.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-02 19:54                                           ` Scott Feldman
@ 2014-04-02 20:06                                             ` John W. Linville
  0 siblings, 0 replies; 125+ messages in thread
From: John W. Linville @ 2014-04-02 20:06 UTC (permalink / raw)
  To: Scott Feldman
  Cc: Andy Gospodarek, Jiri Pirko, Roopa Prabhu, Jamal Hadi Salim,
	Florian Fainelli, Neil Horman, Thomas Graf, netdev, David Miller,
	dborkman, ogerlitz, jesse, pshelar, azhou, Ben Hutchings,
	Stephen Hemminger, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee

On Wed, Apr 02, 2014 at 12:54:50PM -0700, Scott Feldman wrote:
> 
> On Apr 2, 2014, at 12:29 PM, John W. Linville <linville@tuxdriver.com> wrote:
> 
> > Cool!  I'm glad we agree.  Now we just need some switch hardware
> > drivers that fit the general model outlined above...
> 
> Why wait?  Let’s create a switch device in qemu and then write the
> model/sample driver to that.  Put a PCI front end on the qemu device
> which is mapped to kernel, and define a register set to represent all
> the switch-like ops we want to offload, in a generic way.  Throw in
> some DMA for CPU-bound I/O (ctrl traffic).  On the qemu device back
> end, expose the ports as taps or whatever so we can wire to real-world
> link partners on the host side.

Not a bad idea at all.  This probably needs further discussion and/or
a spec...and a QEMU hacker.  Where is my friend PJ when I need him? ;-)

> > I would be happy to maintain a kernel.org git tree as a nursery for
> > such drivers as they develop and mature, and I'm sure my daytime
> > employer would be happy to support me on that.  I wonder if we can
> > get any switch people from Intel, Mellanox, Broadcom, or elsewhere
> > to play along?
> 
> My gut tells me this is a build-it-and-they-will-come situation.

No doubt -- in the meantime, feel free to Cc me on some patches!

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-02 20:04                                           ` Stephen Hemminger
@ 2014-04-02 20:23                                             ` Jiri Pirko
  2014-04-02 20:38                                               ` John W. Linville
  0 siblings, 1 reply; 125+ messages in thread
From: Jiri Pirko @ 2014-04-02 20:23 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: John W. Linville, Scott Feldman, Andy Gospodarek, Roopa Prabhu,
	Jamal Hadi Salim, Florian Fainelli, Neil Horman, Thomas Graf,
	netdev, David Miller, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee

Wed, Apr 02, 2014 at 10:04:37PM CEST, stephen@networkplumber.org wrote:
>On Wed, 2 Apr 2014 15:29:15 -0400
>"John W. Linville" <linville@tuxdriver.com> wrote:
>
>> I've seen the 'ethtool -S' example before and I guess it is valid.
>> Still, is it worth the confusion of having a mostly useless/unique
>> netdev just to reuse an ethtool ioctl?  Maybe, I guess...?
>
>ethtool is actually the most worthless part of the API.
>It can't be monitored, is ioctl based but and the statistics are device
>dependent making them useless for monitoring applications.

Has anyone actually been thinking about converting ethtool functionality
to netlink as well? I did some time ago. Most of it should be easy (more or less)
to do I believe.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-02 20:23                                             ` Jiri Pirko
@ 2014-04-02 20:38                                               ` John W. Linville
  2014-04-02 21:36                                                 ` Thomas Graf
  0 siblings, 1 reply; 125+ messages in thread
From: John W. Linville @ 2014-04-02 20:38 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Stephen Hemminger, Scott Feldman, Andy Gospodarek, Roopa Prabhu,
	Jamal Hadi Salim, Florian Fainelli, Neil Horman, Thomas Graf,
	netdev, David Miller, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee

On Wed, Apr 02, 2014 at 10:23:18PM +0200, Jiri Pirko wrote:
> Wed, Apr 02, 2014 at 10:04:37PM CEST, stephen@networkplumber.org wrote:
> >On Wed, 2 Apr 2014 15:29:15 -0400
> >"John W. Linville" <linville@tuxdriver.com> wrote:
> >
> >> I've seen the 'ethtool -S' example before and I guess it is valid.
> >> Still, is it worth the confusion of having a mostly useless/unique
> >> netdev just to reuse an ethtool ioctl?  Maybe, I guess...?
> >
> >ethtool is actually the most worthless part of the API.
> >It can't be monitored, is ioctl based but and the statistics are device
> >dependent making them useless for monitoring applications.
> 
> Has anyone actually been thinking about converting ethtool functionality
> to netlink as well? I did some time ago. Most of it should be easy (more or less)
> to do I believe.

Seems like a good idea...but only if you promise to have it primarily
accessed by a tool with a 2-letter name! :-)

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-02 20:38                                               ` John W. Linville
@ 2014-04-02 21:36                                                 ` Thomas Graf
  0 siblings, 0 replies; 125+ messages in thread
From: Thomas Graf @ 2014-04-02 21:36 UTC (permalink / raw)
  To: John W. Linville
  Cc: Jiri Pirko, Stephen Hemminger, Scott Feldman, Andy Gospodarek,
	Roopa Prabhu, Jamal Hadi Salim, Florian Fainelli, Neil Horman,
	netdev, David Miller, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, jeffrey.t.kirsher, vyasevic, Cong Wang,
	John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee

On 04/02/14 at 04:38pm, John W. Linville wrote:
> On Wed, Apr 02, 2014 at 10:23:18PM +0200, Jiri Pirko wrote:
> > Wed, Apr 02, 2014 at 10:04:37PM CEST, stephen@networkplumber.org wrote:
> > >On Wed, 2 Apr 2014 15:29:15 -0400
> > >"John W. Linville" <linville@tuxdriver.com> wrote:
> > >
> > >> I've seen the 'ethtool -S' example before and I guess it is valid.
> > >> Still, is it worth the confusion of having a mostly useless/unique
> > >> netdev just to reuse an ethtool ioctl?  Maybe, I guess...?
> > >
> > >ethtool is actually the most worthless part of the API.
> > >It can't be monitored, is ioctl based but and the statistics are device
> > >dependent making them useless for monitoring applications.
> > 
> > Has anyone actually been thinking about converting ethtool functionality
> > to netlink as well? I did some time ago. Most of it should be easy (more or less)
> > to do I believe.
> 
> Seems like a good idea...but only if you promise to have it primarily
> accessed by a tool with a 2-letter name! :-)

I have a semi complete patch rotting on my system ;-) Attaching
below if somebody wants to continue working on it. The
challenging bit is that Netlink typically guarantees atomic
operations in terms of requests, either all or none of the
requested changes are applied. This is not compatible with the
ethtool_ops as implemented by the drivers where each driver
verifies the input data per setting.


diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index c6a850a..e461775 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -1200,4 +1200,31 @@ enum ethtool_reset_flags {
 };
 #define ETH_RESET_SHARED_SHIFT	16
 
+enum {
+	IFLA_ETHTOOL_UNSPEC,
+	IFLA_ETHTOOL_SETTINGS,
+	IFLA_ETHTOOL_DRVINFO,
+	__IFLA_ETHTOOL_MAX,
+};
+
+#define IFLA_ETHTOOL_MAX (__IFLA_ETHTOOL_MAX - 1)
+
+enum {
+	IFLA_ET_DRVINFO_UNSPEC,
+	IFLA_ET_DRVINFO_NAME,
+	IFLA_ET_DRVINFO_VERSION,
+	IFLA_ET_DRVINFO_FW_VERSION,
+	IFLA_ET_DRVINFO_BUS_INFO,
+	__IFLA_ET_DRVINFO_MAX,
+};
+
+#define IFLA_ET_DRVINFO_MAX (__IFLA_ET_DRVINFO_MAX - 1)
+
+#ifdef __KERNEL__
+
+extern size_t ethtool_nlattr_size(const struct net_device *);
+extern int ethtool_fill_nlattr(struct sk_buff *, struct net_device *);
+
+#endif /* __KERNEL__ */
+
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index 0ee969a..f3ad6dd 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -137,6 +137,7 @@ enum {
 	IFLA_AF_SPEC,
 	IFLA_GROUP,		/* Group the device belongs to */
 	IFLA_NET_NS_FD,
+	IFLA_ETHTOOL,
 	__IFLA_MAX
 };
 
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index fd14116..088e4e9 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -23,6 +23,7 @@
 #include <linux/slab.h>
 #include <linux/rtnetlink.h>
 #include <linux/sched.h>
+#include <net/netlink.h>
 
 /*
  * Some useful ethtool_ops methods that're device independent.
@@ -2164,3 +2165,137 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 
 	return rc;
 }
+
+/**************************************************************************
+ *
+ * Ethtool Netlink Implementation
+ *
+ **************************************************************************/
+
+static size_t settings_size(const struct net_device *dev)
+{
+	if (dev->ethtool_ops && dev->ethtool_ops->get_settings)
+		return nla_total_size(sizeof(struct ethtool_cmd));
+	else
+		return 0;
+}
+
+static int settings_fill(struct sk_buff *skb, struct net_device *dev)
+{
+	struct ethtool_cmd cmd = { .cmd = ETHTOOL_GSET };
+	int err;
+
+	if (!dev->ethtool_ops || !dev->ethtool_ops->get_settings)
+		return 0;
+
+	err = dev->ethtool_ops->get_settings(dev, &cmd);
+	if (err < 0)
+		return err;
+
+	return nla_put(skb, IFLA_ETHTOOL_SETTINGS, sizeof(cmd), &cmd);
+}
+
+static size_t drvinfo_size(const struct net_device *dev)
+{
+	if ((dev->ethtool_ops && dev->ethtool_ops->get_drvinfo) ||
+	    (dev->dev.parent && dev->dev.parent->driver)) {
+		size_t size;
+
+		size = nla_total_size(32);	/* IFLA_ET_DRVINFO_NAME */
+		size += nla_total_size(32);	/* IFLA_ET_DRVINFO_VERSION */
+		/* IFLA_ET_DRVINFO_FW_VERSION */
+		size += nla_total_size(ETHTOOL_FWVERS_LEN);
+		/* IFLA_ET_DRVINFO_BUS_INFO */
+		size += nla_total_size(ETHTOOL_BUSINFO_LEN);
+
+		return nla_total_size(size); /* IFLA_ETHTOOL_DRVINFO */
+	}
+
+	return 0;
+}
+
+static int drvinfo_fill(struct sk_buff *skb, struct net_device *dev)
+{
+	const struct ethtool_ops *ops = dev->ethtool_ops;
+	struct ethtool_drvinfo info;
+	struct nlattr *attr;
+
+	info.cmd = ETHTOOL_GDRVINFO;
+	if (ops && ops->get_drvinfo) {
+		ops->get_drvinfo(dev, &info);
+	} else if (dev->dev.parent && dev->dev.parent->driver) {
+		strlcpy(info.bus_info, dev_name(dev->dev.parent),
+			sizeof(info.bus_info));
+		strlcpy(info.driver, dev->dev.parent->driver->name,
+			sizeof(info.driver));
+	} else
+		return 0;
+
+	if (!(attr = nla_nest_start(skb, IFLA_ETHTOOL_DRVINFO)))
+		return -EMSGSIZE;
+
+	if (info.driver[0])
+		NLA_PUT_STRING(skb, IFLA_ET_DRVINFO_NAME, info.driver);
+
+	if (info.version[0])
+		NLA_PUT_STRING(skb, IFLA_ET_DRVINFO_VERSION, info.version);
+
+	if (info.fw_version[0])
+		NLA_PUT_STRING(skb, IFLA_ET_DRVINFO_FW_VERSION,
+			       info.fw_version);
+
+	if (info.bus_info[0])
+		NLA_PUT_STRING(skb, IFLA_ET_DRVINFO_BUS_INFO, info.bus_info);
+
+	nla_nest_end(skb, attr);
+	return 0;
+
+nla_put_failure:
+	nla_nest_cancel(skb, attr);
+	return -EMSGSIZE;
+}
+
+static const struct {
+	int (*fill)(struct sk_buff *, struct net_device *);
+	size_t (*size)(const struct net_device *);
+} nl_ops[IFLA_ETHTOOL_MAX+1] = {
+	[IFLA_ETHTOOL_SETTINGS] = { .fill	= settings_fill,
+				    .size	= settings_size, },
+	[IFLA_ETHTOOL_DRVINFO] = {  .fill	= drvinfo_fill,
+				    .size	= drvinfo_size, },
+};
+
+size_t ethtool_nlattr_size(const struct net_device *dev)
+{
+	size_t size = 0;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(nl_ops); i++)
+		if (nl_ops[i].size)
+			size += nl_ops[i].size(dev);
+
+	/* IFLA_ETHTOOL */
+	return nla_total_size(size);
+}
+
+int ethtool_fill_nlattr(struct sk_buff *skb, struct net_device *dev)
+{
+	struct nlattr *attr;
+	int err, i;
+
+	if (!(attr = nla_nest_start(skb, IFLA_ETHTOOL)))
+		return -EMSGSIZE;
+
+	for (i = 0; i < ARRAY_SIZE(nl_ops); i++) {
+		if (nl_ops[i].fill)
+			if ((err = nl_ops[i].fill(skb, dev)) < 0)
+				goto nla_put_failure;
+	}
+	
+	nla_nest_end(skb, attr);
+	return 0;
+
+nla_put_failure:
+	nla_nest_cancel(skb, attr);
+	return err;
+}
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index abd936d..684a0e1 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -42,6 +42,7 @@
 
 #include <linux/inet.h>
 #include <linux/netdevice.h>
+#include <linux/ethtool.h>
 #include <net/ip.h>
 #include <net/protocol.h>
 #include <net/arp.h>
@@ -761,7 +762,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev)
 	       + rtnl_vfinfo_size(dev) /* IFLA_VFINFO_LIST */
 	       + rtnl_port_size(dev) /* IFLA_VF_PORTS + IFLA_PORT_SELF */
 	       + rtnl_link_get_size(dev) /* IFLA_LINKINFO */
-	       + rtnl_link_get_af_size(dev); /* IFLA_AF_SPEC */
+	       + rtnl_link_get_af_size(dev) /* IFLA_AF_SPEC */
+	       + ethtool_nlattr_size(dev); /* IFLA_ETHTOOL */
 }
 
 static int rtnl_vf_ports_fill(struct sk_buff *skb, struct net_device *dev)
@@ -989,6 +991,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 
 	nla_nest_end(skb, af_spec);
 
+	if (ethtool_fill_nlattr(skb, dev) < 0)
+		goto nla_put_failure;
+
 	return nlmsg_end(skb, nlh);
 
 nla_put_failure:

^ permalink raw reply related	[flat|nested] 125+ messages in thread

* Re: [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath
  2014-04-02 16:47                                         ` Florian Fainelli
@ 2014-04-02 21:52                                           ` Thomas Graf
  0 siblings, 0 replies; 125+ messages in thread
From: Thomas Graf @ 2014-04-02 21:52 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Scott Feldman, John W. Linville, Andy Gospodarek, Jiri Pirko,
	Roopa Prabhu, Jamal Hadi Salim, Neil Horman, netdev,
	David Miller, dborkman, ogerlitz, jesse, pshelar, azhou,
	Ben Hutchings, Stephen Hemminger, jeffrey.t.kirsher, vyasevic,
	Cong Wang, John Fastabend, Eric Dumazet, Lennert Buytenhek,
	Shrijeet Mukherjee

On 04/02/14 at 09:47am, Florian Fainelli wrote:
> 2014-04-02 9:15 GMT-07:00 Scott Feldman <sfeldma@cumulusnetworks.com>:
> > On Apr 2, 2014, at 8:25 AM, John W. Linville <linville@tuxdriver.com> wrote:
> >> On Wed, Apr 02, 2014 at 10:32:49AM -0400, Andy Gospodarek wrote:
> >>> Maybe this all seems to matter-of-fact and the discussion has
> >>> evolved well beyond something this high-level, but there still seems
> >>> to be significant discussion about whether or not the ASIC should be
> >>> exported as a netdev and I'm just not seeing a compelling reason.
> >>> This was my attempt to explain why.  :)
> >>
> >> Andy and I discussed this off-line, so I am admittedly partial to
> >> the conclusions we shared as reflected above... :-)
> >>
> >> While I might be convinced that there should be _something_ to
> >> represent the switch chip for some purpose (e.g. topology mapping),
> >> I'm not at all convinced that thing should be a netdev.  I don't see
> >> where the switch chip by itself looks much like any other netdev at
> >> all, especially once you model the actual front-panel ports with
> >> their own netdevs.  I do know that having an extra "magic netdev"
> >> in the wireless space added a lot of confusion for no clear gain,
> >> leading to it later being abolished.
> >>
> >> Modeling at the switch level might make more sense from a flow
> >> management perspective?  But if those flows are managed using a netlink
> >> protocol, does it matter what sort of entity is listening and acting
> >> on those messages?  If a switch-specific interface is needed for that,
> >> we should build it rather than pretending it looks like a netdev.
> >> I also think that throwing the DSA switches in with flow-based and
> >> "Enterprise" switches may just be confusing things.
> >>
> >> I think that the opening bid should be a minimal hardware driver that
> >> models each front-panel port with a netdev and passes all traffic
> >> to/from the CPU.  Intelligence beyond that should be added on a
> >> 'can-do' basis, with individual drivers (or corresponding userland
> >> components) listening to existing netlink traffic and implementing
> >> support for existing protocols to the best of their abilities.
> >> Missing functionality in the netlink protocols or other functions
> >> (e.g. bonding, bridging, etc) can be evolved over time as we discover
> >> missing bits required for switch acceleration.
> >
> > I agree completely with your/Andy’s view.  It’s the switch port, not the switch, that needs to be modeled as a netdev.  The switch port is the abstraction that allows other existing virtual devices (bridges, bond, vxlans, etc) to cuddle against.  Is a switch port a special netdev in some way?  At a high level, not really.  I mean in sense it’s just eth48 on a super NIC.  OK, there may be some advantage to setting a IFF_SWITCH_PORT on the switch port netdev, so cuddling netdevs could get a hint that their data plane might be offloaded.
> >
> > I’ve been back-and-forth on the switch netdev.  Today I’m not for it.  But I’m still searching for a reason.  At one point I thought a switch netdev would be nice in a L3 router case where we needed a router IP address to do things like OSPF unnumbered interfaces, but even in that case, we can just put the router IP on lo.  Another reason would be to use the switch netdev as a place for switch-wide settings and status.  For example,
> > ethtool -S stats on switch netdev would show switch-wide stats like ACL drops or something like that.  Maybe a switch device is modeled as a new device class?  I guess it comes down to how much is duplicated between different vendors' switch driver implementations.
> 
> I think the idea behind exposing a switch net_device is to account for
> all special cases where there is not already an existing and
> well-defined model for switch-wide events/control/information that we
> might want to have. Why a net_device, because the switch ports will
> already be exposed as such, so mostly for consistency with the
> presented user-space interface. Whether that net_device exposes
> different child devices of different classes, e.g: MTD partitions to
> access firmware updates, SPI master/slave controller(s), MDIO
> controller(s), is yet to be defined I suppose.

Having a master net_device seemed logical to me at first just
like it always made sense to me to have software bridges be
represented by a net_device. I agree with a lot of the concerns
though.

I see the following uses for a master net_device:
 - represent slave/master relationship and provide IFF_UP control
 - expose non port specific statistics
 - flow configuration
 - tunnel configuration
 - allow creation of virtual ports that are not backed up with HW

I want to expand on the last point a bit. I specifically did not
mention IP configuration above which is what the bridge master is
used frequently. I absolutely like the OVS model where multiple
internal ports can be created which hook into the network stack
and can thus be assigned IPs. The model allows for separate internal
ports to be configured as different VLAN access ports for example.
They also provide multiple AF_PACKET rx handlers, etc.

 sw1p1 -+
 sw1p2 -+       +-sw1int0 (ip=30.0.0.1) -> netif_rx()
 sw1p3 -+- sw1 -+-sw1int1 (vlan=10, ip=10.0.0.1) -> netif_rx()
 sw1p4 -+       +-sw1vxlan0 (remote_ip=20.0.0.2)

If supported by the chip, flows can be setup automatically to
feed these virtual ports and setup encapsultion. Others will
require software fallback. Some will not support it at all.

^ permalink raw reply	[flat|nested] 125+ messages in thread

end of thread, other threads:[~2014-04-02 21:52 UTC | newest]

Thread overview: 125+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-19 15:33 [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath Jiri Pirko
2014-03-19 15:33 ` [patch net-next RFC 1/4] openvswitch: split flow structures into ovs specific and generic ones Jiri Pirko
2014-03-20 13:04   ` Thomas Graf
2014-03-19 15:33 ` [patch net-next RFC 2/4] net: introduce switchdev API Jiri Pirko
2014-03-20 13:59   ` Thomas Graf
2014-03-20 14:18     ` Jiri Pirko
2014-03-20 14:43   ` Nikolay Aleksandrov
2014-03-20 15:42     ` Jiri Pirko
2014-03-19 15:33 ` [patch net-next RFC 3/4] openvswitch: Introduce support for switchdev based datapath Jiri Pirko
2014-03-19 15:33 ` [patch net-next RFC 4/4] net: introduce dummy switch Jiri Pirko
2014-03-20 11:49 ` [patch net-next RFC 0/4] introduce infrastructure for support of switch chip datapath Jamal Hadi Salim
2014-03-20 12:40   ` Jiri Pirko
2014-03-20 17:21     ` Florian Fainelli
2014-03-21 12:04       ` Jamal Hadi Salim
2014-03-22  9:48         ` Jiri Pirko
2014-03-24 23:07           ` Jamal Hadi Salim
2014-03-25 17:39             ` Neil Horman
2014-03-25 18:00               ` Thomas Graf
2014-03-25 19:35                 ` Neil Horman
2014-03-25 20:11                   ` Florian Fainelli
2014-03-25 20:31                     ` Neil Horman
2014-03-25 21:22                       ` Jamal Hadi Salim
2014-03-25 21:26                     ` Thomas Graf
2014-03-25 21:42                       ` Florian Fainelli
2014-03-25 21:54                         ` Thomas Graf
2014-03-26 10:55                           ` Neil Horman
2014-03-26  5:37                     ` Roopa Prabhu
2014-03-26 10:54                       ` Jamal Hadi Salim
2014-03-26 15:31                         ` John W. Linville
2014-03-26 16:54                         ` Roopa Prabhu
2014-03-26 16:59                           ` Jiri Pirko
2014-03-26 17:29                             ` Florian Fainelli
2014-03-26 17:35                               ` Jiri Pirko
2014-03-26 17:58                                 ` Florian Fainelli
2014-03-26 18:14                                   ` Jiri Pirko
2014-03-26 18:29                                     ` Hannes Frederic Sowa
2014-03-26 18:30                                     ` Florian Fainelli
2014-03-26 21:51                                     ` Jamal Hadi Salim
2014-03-26 22:22                                       ` Florian Fainelli
2014-03-26 22:53                                         ` Jamal Hadi Salim
2014-03-26 23:16                                           ` Florian Fainelli
2014-03-27  6:56                                         ` Jiri Pirko
2014-03-27 10:39                                           ` Jamal Hadi Salim
2014-03-27 10:50                                             ` Jiri Pirko
2014-03-27 11:12                                               ` Jamal Hadi Salim
2014-03-27 11:16                                                 ` Jiri Pirko
2014-03-27 14:10                                           ` Sergey Ryazanov
2014-03-27 16:41                                             ` Florian Fainelli
2014-03-27 16:57                                               ` Jiri Pirko
2014-03-27 16:59                                               ` Thomas Graf
2014-03-27 20:32                                               ` Sergey Ryazanov
2014-03-27 21:20                                                 ` Florian Fainelli
2014-03-27 21:55                                                   ` Jamal Hadi Salim
2014-03-28  6:28                                                   ` Jiri Pirko
2014-03-30 12:08                                                     ` Alon Harel
2014-03-27 21:41                                               ` Jamal Hadi Salim
2014-03-27 16:55                                             ` Jiri Pirko
2014-03-27 19:58                                               ` Sergey Ryazanov
2014-03-27 20:01                                                 ` Florian Fainelli
2014-03-27 20:04                                                   ` Sergey Ryazanov
2014-03-27 21:47                                                   ` Jamal Hadi Salim
2014-03-27 21:54                                                     ` Florian Fainelli
2014-03-27 21:59                                                       ` Jamal Hadi Salim
2014-03-27 22:19                                                         ` Florian Fainelli
2014-03-27 23:42                                                         ` Thomas Graf
2014-03-27 23:46                                                           ` Florian Fainelli
2014-03-26 17:57                               ` Roopa Prabhu
2014-03-26 18:09                                 ` Florian Fainelli
2014-03-27 13:46                                   ` John W. Linville
2014-03-26 17:47                             ` Roopa Prabhu
2014-03-26 18:03                               ` Jiri Pirko
2014-03-26 21:27                                 ` Roopa Prabhu
2014-03-26 21:31                                   ` Jiri Pirko
2014-03-27 15:35                                     ` Roopa Prabhu
2014-03-27 16:10                                       ` Jiri Pirko
2014-04-01 19:13                                 ` Scott Feldman
2014-04-02  6:41                                   ` Jiri Pirko
2014-04-02 15:37                                     ` Scott Feldman
2014-04-02 14:32                                   ` Andy Gospodarek
2014-04-02 15:25                                     ` John W. Linville
2014-04-02 16:15                                       ` Scott Feldman
2014-04-02 16:47                                         ` Florian Fainelli
2014-04-02 21:52                                           ` Thomas Graf
2014-04-02 19:29                                         ` John W. Linville
2014-04-02 19:54                                           ` Scott Feldman
2014-04-02 20:06                                             ` John W. Linville
2014-04-02 20:04                                           ` Stephen Hemminger
2014-04-02 20:23                                             ` Jiri Pirko
2014-04-02 20:38                                               ` John W. Linville
2014-04-02 21:36                                                 ` Thomas Graf
2014-03-25 20:56                   ` Jamal Hadi Salim
2014-03-25 21:19                     ` Thomas Graf
2014-03-25 21:24                       ` Jamal Hadi Salim
2014-03-26  7:21                       ` Jiri Pirko
2014-03-26 11:00                         ` Jamal Hadi Salim
2014-03-26 11:06                           ` Jamal Hadi Salim
2014-03-26 11:31                             ` Jamal Hadi Salim
2014-03-26 13:20                             ` Jiri Pirko
2014-03-26 13:23                               ` Jamal Hadi Salim
2014-03-26 13:17                           ` Jiri Pirko
2014-03-26 11:10                     ` Neil Horman
2014-03-26 11:29                       ` Thomas Graf
2014-03-26 12:58                         ` Jamal Hadi Salim
2014-03-26 15:22                         ` John W. Linville
2014-03-26 21:36                           ` Jamal Hadi Salim
2014-03-26 18:21                         ` Neil Horman
2014-03-26 19:11                           ` Florian Fainelli
2014-03-26 22:44                             ` Jamal Hadi Salim
2014-03-26 23:15                               ` Thomas Graf
2014-03-26 23:21                                 ` Florian Fainelli
2014-03-27 15:26                               ` Neil Horman
2014-03-27 21:33                                 ` Jamal Hadi Salim
2014-03-26 19:24                           ` Hannes Frederic Sowa
2014-03-27 13:43                           ` John W. Linville
2014-03-26 12:19                       ` Jamal Hadi Salim
2014-03-26 15:27                       ` John W. Linville
2014-03-25 18:33               ` Florian Fainelli
2014-03-25 19:40                 ` Neil Horman
2014-03-25 20:00                   ` Florian Fainelli
2014-03-25 21:39                     ` tgraf
2014-03-25 22:08                       ` Jamal Hadi Salim
2014-03-26  5:48                         ` Roopa Prabhu
2014-03-25 20:46               ` Jamal Hadi Salim
2014-03-26  7:24               ` Jiri Pirko
2014-03-22  9:40       ` Jiri Pirko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.