All of lore.kernel.org
 help / color / mirror / Atom feed
* [net-next 00/10] Add Geneve
@ 2014-07-22 10:19 Andy Zhou
  2014-07-22 10:19 ` [net-next 01/10] net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port Andy Zhou
                   ` (11 more replies)
  0 siblings, 12 replies; 46+ messages in thread
From: Andy Zhou @ 2014-07-22 10:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, Andy Zhou

Following patches adds initial support for Geneve tunnel protocol
1. Add Geneve driver.
2. Add common UDP tunnel code into UDP tunnel support function
3. Refactor vxlan driver to make use of the UDP tunnel support
4. Refactor Openvswitch  in preparation for #5 
5. Add Geneve support to Openvswitch.

Note: Geneve offload are not supported in this version. We plan to
post follow on patches that implements them we can verified with
at least one working NIC that supports Geneve offloading.

Andy Zhou (5):
  net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port.
  udp: Expand UDP tunnel common APIs
  vxlan: Remove vxlan_get_rx_port()
  net: Refactor vxlan driver to make use of common UDP tunnel functions
  net: Add Geneve tunneling protocol driver

Jesse Gross (5):
  openvswitch: Eliminate memset() from flow_extract.
  openvswitch: Add support for matching on OAM packets.
  openvswitch: Wrap struct ovs_key_ipv4_tunnel in a new structure.
  openvswitch: Factor out allocation and verification of actions.
  openvswitch: Add support for Geneve tunneling.

 drivers/net/ethernet/emulex/benet/be_main.c      |   17 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c      |   18 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |   19 +-
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c |   19 +-
 drivers/net/vxlan.c                              |  273 ++++++----------------
 include/linux/netdevice.h                        |   35 +--
 include/net/geneve.h                             |   85 +++++++
 include/net/ip_tunnels.h                         |    2 +
 include/net/udp_tunnel.h                         |   57 +++++
 include/net/vxlan.h                              |   21 +-
 include/uapi/linux/openvswitch.h                 |    5 +-
 net/ipv4/Kconfig                                 |   14 ++
 net/ipv4/Makefile                                |    1 +
 net/ipv4/geneve.c                                |  273 ++++++++++++++++++++++
 net/ipv4/udp_tunnel.c                            |  257 +++++++++++++++++++-
 net/openvswitch/Kconfig                          |   11 +
 net/openvswitch/Makefile                         |    5 +
 net/openvswitch/actions.c                        |    6 +-
 net/openvswitch/datapath.c                       |   71 ++++--
 net/openvswitch/datapath.h                       |    3 +-
 net/openvswitch/flow.c                           |   62 ++++-
 net/openvswitch/flow.h                           |   41 +++-
 net/openvswitch/flow_netlink.c                   |  184 +++++++++++++--
 net/openvswitch/flow_netlink.h                   |    2 +-
 net/openvswitch/vport-geneve.c                   |  258 ++++++++++++++++++++
 net/openvswitch/vport-gre.c                      |   29 +--
 net/openvswitch/vport-vxlan.c                    |   34 +--
 net/openvswitch/vport.c                          |    8 +-
 net/openvswitch/vport.h                          |    3 +-
 29 files changed, 1457 insertions(+), 356 deletions(-)
 create mode 100644 include/net/geneve.h
 create mode 100644 net/ipv4/geneve.c
 create mode 100644 net/openvswitch/vport-geneve.c

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [net-next 01/10] net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port.
  2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
@ 2014-07-22 10:19 ` Andy Zhou
  2014-07-22 10:49   ` Varka Bhadram
  2014-07-24  6:40   ` Or Gerlitz
  2014-07-22 10:19 ` [net-next 02/10] udp: Expand UDP tunnel common APIs Andy Zhou
                   ` (10 subsequent siblings)
  11 siblings, 2 replies; 46+ messages in thread
From: Andy Zhou @ 2014-07-22 10:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, Andy Zhou

Rename ndo_add_vxlan_port() API provided by net_device_ops to
ndo_add_udp_tunnel_port(). Generalized the API in preparation for
up coming NICs and device drivers that may support offloading more
UDP tunnels protocols besides VxLAN.  There is no behavioral changes
with this patch.

Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 drivers/net/ethernet/emulex/benet/be_main.c      |   15 +++++++---
 drivers/net/ethernet/intel/i40e/i40e_main.c      |   16 ++++++++--
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |   17 ++++++++---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c |   17 ++++++++---
 drivers/net/vxlan.c                              |   23 +++++++-------
 include/linux/netdevice.h                        |   35 ++++++++++++----------
 include/net/udp_tunnel.h                         |    2 ++
 7 files changed, 84 insertions(+), 41 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 9c50814..028dc6e 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -24,6 +24,7 @@
 #include <linux/if_bridge.h>
 #include <net/busy_poll.h>
 #include <net/vxlan.h>
+#include <net/udp_tunnel.h>
 
 MODULE_VERSION(DRV_VER);
 MODULE_DEVICE_TABLE(pci, be_dev_ids);
@@ -4324,7 +4325,7 @@ static int be_ndo_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
 
 #ifdef CONFIG_BE2NET_VXLAN
 static void be_add_vxlan_port(struct net_device *netdev, sa_family_t sa_family,
-			      __be16 port)
+			      __be16 port, u8 udp_tunnel_type)
 {
 	struct be_adapter *adapter = netdev_priv(netdev);
 	struct device *dev = &adapter->pdev->dev;
@@ -4333,6 +4334,9 @@ static void be_add_vxlan_port(struct net_device *netdev, sa_family_t sa_family,
 	if (lancer_chip(adapter) || BEx_chip(adapter))
 		return;
 
+	if (udp_tunnel_type != UDP_TUNNEL_TYPE_VXLAN)
+		return;
+
 	if (adapter->flags & BE_FLAGS_VXLAN_OFFLOADS) {
 		dev_warn(dev, "Cannot add UDP port %d for VxLAN offloads\n",
 			 be16_to_cpu(port));
@@ -4365,13 +4369,16 @@ err:
 }
 
 static void be_del_vxlan_port(struct net_device *netdev, sa_family_t sa_family,
-			      __be16 port)
+			      __be16 port, u8 udp_tunnel_type)
 {
 	struct be_adapter *adapter = netdev_priv(netdev);
 
 	if (lancer_chip(adapter) || BEx_chip(adapter))
 		return;
 
+	if (udp_tunnel_type != UDP_TUNNEL_TYPE_VXLAN)
+		return;
+
 	if (adapter->vxlan_port != port)
 		return;
 
@@ -4408,8 +4415,8 @@ static const struct net_device_ops be_netdev_ops = {
 	.ndo_busy_poll		= be_busy_poll,
 #endif
 #ifdef CONFIG_BE2NET_VXLAN
-	.ndo_add_vxlan_port	= be_add_vxlan_port,
-	.ndo_del_vxlan_port	= be_del_vxlan_port,
+	.ndo_add_udp_tunnel_port	= be_add_vxlan_port,
+	.ndo_del_udp_tunnel_port	= be_del_vxlan_port,
 #endif
 };
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c34e390..d9fd53b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -29,6 +29,7 @@
 #include "i40e_diag.h"
 #ifdef CONFIG_I40E_VXLAN
 #include <net/vxlan.h>
+#include <net/udp_tunnel.h>
 #endif
 
 const char i40e_driver_name[] = "i40e";
@@ -6944,9 +6945,11 @@ static u8 i40e_get_vxlan_port_idx(struct i40e_pf *pf, __be16 port)
  * @netdev: This physical port's netdev
  * @sa_family: Socket Family that VXLAN is notifying us about
  * @port: New UDP port number that VXLAN started listening to
+ * @udp_tunnel_type: Only UDP_TUNNEL_TYPE_VXLAN will be processed.
  **/
 static void i40e_add_vxlan_port(struct net_device *netdev,
-				sa_family_t sa_family, __be16 port)
+				sa_family_t sa_family, __be16 port,
+				u8 udp_tunnel_type)
 {
 	struct i40e_netdev_priv *np = netdev_priv(netdev);
 	struct i40e_vsi *vsi = np->vsi;
@@ -6957,6 +6960,9 @@ static void i40e_add_vxlan_port(struct net_device *netdev,
 	if (sa_family == AF_INET6)
 		return;
 
+	if (udp_tunnel_type != UDP_TUNNEL_TYPE_VXLAN)
+		return;
+
 	idx = i40e_get_vxlan_port_idx(pf, port);
 
 	/* Check if port already exists */
@@ -6986,6 +6992,7 @@ static void i40e_add_vxlan_port(struct net_device *netdev,
  * @netdev: This physical port's netdev
  * @sa_family: Socket Family that VXLAN is notifying us about
  * @port: UDP port number that VXLAN stopped listening to
+ * @udp_tunnel_type: Only UDP_TUNNEL_TYPE_VXLAN will be processed.
  **/
 static void i40e_del_vxlan_port(struct net_device *netdev,
 				sa_family_t sa_family, __be16 port)
@@ -6998,6 +7005,9 @@ static void i40e_del_vxlan_port(struct net_device *netdev,
 	if (sa_family == AF_INET6)
 		return;
 
+	if (udp_tunnel_type != UDP_TUNNEL_TYPE_VXLAN)
+		return;
+
 	idx = i40e_get_vxlan_port_idx(pf, port);
 
 	/* Check if port already exists */
@@ -7149,8 +7159,8 @@ static const struct net_device_ops i40e_netdev_ops = {
 	.ndo_set_vf_link_state	= i40e_ndo_set_vf_link_state,
 	.ndo_set_vf_spoofchk	= i40e_ndo_set_vf_spoofck,
 #ifdef CONFIG_I40E_VXLAN
-	.ndo_add_vxlan_port	= i40e_add_vxlan_port,
-	.ndo_del_vxlan_port	= i40e_del_vxlan_port,
+	.ndo_add_udp_tunnel_port	= i40e_add_vxlan_port,
+	.ndo_del_udp_tunnel_port	= i40e_del_vxlan_port,
 #endif
 	.ndo_get_phys_port_id	= i40e_get_phys_port_id,
 #ifdef HAVE_FDB_OPS
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 887cf01..d5f6b91 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -40,6 +40,7 @@
 #include <net/ip.h>
 #include <net/busy_poll.h>
 #include <net/vxlan.h>
+#include <net/udp_tunnel.h>
 
 #include <linux/mlx4/driver.h>
 #include <linux/mlx4/device.h>
@@ -2326,7 +2327,8 @@ static void mlx4_en_del_vxlan_offloads(struct work_struct *work)
 }
 
 static void mlx4_en_add_vxlan_port(struct  net_device *dev,
-				   sa_family_t sa_family, __be16 port)
+				   sa_family_t sa_family, __be16 port,
+				   u8 udp_tunnel_type)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	__be16 current_port;
@@ -2337,6 +2339,9 @@ static void mlx4_en_add_vxlan_port(struct  net_device *dev,
 	if (sa_family == AF_INET6)
 		return;
 
+	if (udp_tunnel_type != UDP_TUNNEL_TYPE_VXLAN)
+		return;
+
 	current_port = priv->vxlan_port;
 	if (current_port && current_port != port) {
 		en_warn(priv, "vxlan port %d configured, can't add port %d\n",
@@ -2349,7 +2354,8 @@ static void mlx4_en_add_vxlan_port(struct  net_device *dev,
 }
 
 static void mlx4_en_del_vxlan_port(struct  net_device *dev,
-				   sa_family_t sa_family, __be16 port)
+				   sa_family_t sa_family, __be16 port,
+				   u8 udp_tunnel_type)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	__be16 current_port;
@@ -2360,6 +2366,9 @@ static void mlx4_en_del_vxlan_port(struct  net_device *dev,
 	if (sa_family == AF_INET6)
 		return;
 
+	if (udp_tunnel_type != UDP_TUNNEL_TYPE_VXLAN)
+		return;
+
 	current_port = priv->vxlan_port;
 	if (current_port != port) {
 		en_dbg(DRV, priv, "vxlan port %d isn't configured, ignoring\n", ntohs(port));
@@ -2397,8 +2406,8 @@ static const struct net_device_ops mlx4_netdev_ops = {
 #endif
 	.ndo_get_phys_port_id	= mlx4_en_get_phys_port_id,
 #ifdef CONFIG_MLX4_EN_VXLAN
-	.ndo_add_vxlan_port	= mlx4_en_add_vxlan_port,
-	.ndo_del_vxlan_port	= mlx4_en_del_vxlan_port,
+	.ndo_add_udp_tunnel_port	= mlx4_en_add_vxlan_port,
+	.ndo_del_udp_tunnel_port	= mlx4_en_del_vxlan_port,
 #endif
 };
 
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
index 0fdbcc8..a39020d 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
@@ -23,6 +23,7 @@
 #include <linux/pci.h>
 #ifdef CONFIG_QLCNIC_VXLAN
 #include <net/vxlan.h>
+#include <net/udp_tunnel.h>
 #endif
 
 MODULE_DESCRIPTION("QLogic 1/10 GbE Converged/Intelligent Ethernet Driver");
@@ -470,7 +471,8 @@ static int qlcnic_get_phys_port_id(struct net_device *netdev,
 
 #ifdef CONFIG_QLCNIC_VXLAN
 static void qlcnic_add_vxlan_port(struct net_device *netdev,
-				  sa_family_t sa_family, __be16 port)
+				  sa_family_t sa_family, __be16 port,
+				  u8 udp_tunnel_type)
 {
 	struct qlcnic_adapter *adapter = netdev_priv(netdev);
 	struct qlcnic_hardware_context *ahw = adapter->ahw;
@@ -481,12 +483,16 @@ static void qlcnic_add_vxlan_port(struct net_device *netdev,
 	if (!qlcnic_encap_rx_offload(adapter) || ahw->vxlan_port)
 		return;
 
+	if (udp_tunnel_type != UDP_TUNNEL_TYPE_VXLAN)
+		return;
+
 	ahw->vxlan_port = ntohs(port);
 	adapter->flags |= QLCNIC_ADD_VXLAN_PORT;
 }
 
 static void qlcnic_del_vxlan_port(struct net_device *netdev,
-				  sa_family_t sa_family, __be16 port)
+				  sa_family_t sa_family, __be16 port,
+				  u8 udp_tunnel_type)
 {
 	struct qlcnic_adapter *adapter = netdev_priv(netdev);
 	struct qlcnic_hardware_context *ahw = adapter->ahw;
@@ -495,6 +501,9 @@ static void qlcnic_del_vxlan_port(struct net_device *netdev,
 	    (ahw->vxlan_port != ntohs(port)))
 		return;
 
+	if (udp_tunnel_type != UDP_TUNNEL_TYPE_VXLAN)
+		return;
+
 	adapter->flags |= QLCNIC_DEL_VXLAN_PORT;
 }
 #endif
@@ -518,8 +527,8 @@ static const struct net_device_ops qlcnic_netdev_ops = {
 	.ndo_fdb_dump		= qlcnic_fdb_dump,
 	.ndo_get_phys_port_id	= qlcnic_get_phys_port_id,
 #ifdef CONFIG_QLCNIC_VXLAN
-	.ndo_add_vxlan_port	= qlcnic_add_vxlan_port,
-	.ndo_del_vxlan_port	= qlcnic_del_vxlan_port,
+	.ndo_add_udp_tunnel_port	= qlcnic_add_vxlan_port,
+	.ndo_del_udp_tunnel_port	= qlcnic_del_vxlan_port,
 #endif
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	.ndo_poll_controller = qlcnic_poll_controller,
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index d3f3e5d..829d447 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -650,9 +650,11 @@ static void vxlan_notify_add_rx_port(struct vxlan_sock *vs)
 
 	rcu_read_lock();
 	for_each_netdev_rcu(net, dev) {
-		if (dev->netdev_ops->ndo_add_vxlan_port)
-			dev->netdev_ops->ndo_add_vxlan_port(dev, sa_family,
-							    port);
+		if (!dev->netdev_ops->ndo_add_udp_tunnel_port)
+			continue;
+
+		dev->netdev_ops->ndo_add_udp_tunnel_port(dev, sa_family, port,
+							 UDP_TUNNEL_TYPE_VXLAN);
 	}
 	rcu_read_unlock();
 }
@@ -668,9 +670,10 @@ static void vxlan_notify_del_rx_port(struct vxlan_sock *vs)
 
 	rcu_read_lock();
 	for_each_netdev_rcu(net, dev) {
-		if (dev->netdev_ops->ndo_del_vxlan_port)
-			dev->netdev_ops->ndo_del_vxlan_port(dev, sa_family,
-							    port);
+		if (!dev->netdev_ops->ndo_del_udp_tunnel_port)
+			continue;
+		dev->netdev_ops->ndo_del_udp_tunnel_port(dev, sa_family, port,
+							 UDP_TUNNEL_TYPE_VXLAN);
 	}
 	rcu_read_unlock();
 
@@ -2188,9 +2191,9 @@ static struct device_type vxlan_type = {
 	.name = "vxlan",
 };
 
-/* Calls the ndo_add_vxlan_port of the caller in order to
+/* Calls the ndo_add_tunnel_port of the caller in order to
  * supply the listening VXLAN udp ports. Callers are expected
- * to implement the ndo_add_vxlan_port.
+ * to implement the ndo_add_tunnle_port.
  */
 void vxlan_get_rx_port(struct net_device *dev)
 {
@@ -2206,8 +2209,8 @@ void vxlan_get_rx_port(struct net_device *dev)
 		hlist_for_each_entry_rcu(vs, &vn->sock_list[i], hlist) {
 			port = inet_sk(vs->sock->sk)->inet_sport;
 			sa_family = vs->sock->sk->sk_family;
-			dev->netdev_ops->ndo_add_vxlan_port(dev, sa_family,
-							    port);
+			dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
+					sa_family, port, UDP_TUNNEL_TYPE_VXLAN);
 		}
 	}
 	spin_unlock(&vn->sock_lock);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 8e8fb3e..4b79db4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -966,18 +966,21 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  *	not implement this, it is assumed that the hw is not able to have
  *	multiple net devices on single physical port.
  *
- * void (*ndo_add_vxlan_port)(struct  net_device *dev,
- *			      sa_family_t sa_family, __be16 port);
- *	Called by vxlan to notiy a driver about the UDP port and socket
- *	address family that vxlan is listnening to. It is called only when
- *	a new port starts listening. The operation is protected by the
- *	vxlan_net->sock_lock.
- *
- * void (*ndo_del_vxlan_port)(struct  net_device *dev,
- *			      sa_family_t sa_family, __be16 port);
- *	Called by vxlan to notify the driver about a UDP port and socket
- *	address family that vxlan is not listening to anymore. The operation
- *	is protected by the vxlan_net->sock_lock.
+ * void (*ndo_add_udp_tunnel_port)(struct  net_device *dev,
+ *				   sa_family_t sa_family, __be16 port,
+ *				   u8 udp_tunnel_type);
+ *	Called by udp based tunnels to notify a driver about the UDP port,
+ *	socket address family and the tunnel type that udp tunnels is
+ *	listening to. It is called only when a new port starts listening.
+ *	The operation is protected by the udp_net->sock_lock.
+ *
+ * void (*ndo_del_udp_tunnel_port)(struct  net_device *dev,
+ *				   sa_family_t sa_family, __be16 port,
+ *				   u8 udp_tunnel_type);
+ *	Called by udp based tunnels to notify the driver about a UDP port,
+ *	socket address family and the tunnel type that udp tunnel is not
+ *	listening to anymore.  The operation is protected by the
+ *	udp_net->sock_lock.
  *
  * void* (*ndo_dfwd_add_station)(struct net_device *pdev,
  *				 struct net_device *dev)
@@ -1130,12 +1133,12 @@ struct net_device_ops {
 						      bool new_carrier);
 	int			(*ndo_get_phys_port_id)(struct net_device *dev,
 							struct netdev_phys_port_id *ppid);
-	void			(*ndo_add_vxlan_port)(struct  net_device *dev,
+	void			(*ndo_add_udp_tunnel_port)(struct net_device *dev,
 						      sa_family_t sa_family,
-						      __be16 port);
-	void			(*ndo_del_vxlan_port)(struct  net_device *dev,
+						      __be16 port, u8 type);
+	void			(*ndo_del_udp_tunnel_port)(struct  net_device *dev,
 						      sa_family_t sa_family,
-						      __be16 port);
+						      __be16 port, u8 type);
 
 	void*			(*ndo_dfwd_add_station)(struct net_device *pdev,
 							struct net_device *dev);
diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index ffd69cb..3f34c65 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -1,6 +1,8 @@
 #ifndef __NET_UDP_TUNNEL_H
 #define __NET_UDP_TUNNEL_H
 
+#define UDP_TUNNEL_TYPE_VXLAN 0x01
+
 struct udp_port_cfg {
 	u8			family;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
  2014-07-22 10:19 ` [net-next 01/10] net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port Andy Zhou
@ 2014-07-22 10:19 ` Andy Zhou
       [not found]   ` <CA+mtBx9M_BpjT-_Egng+jFxmqJzdC2Npg0ufE2ZSAb9Lhw8hxg@mail.gmail.com>
  2014-07-23 19:57   ` Tom Herbert
  2014-07-22 10:19 ` [net-next 03/10] vxlan: Remove vxlan_get_rx_port() Andy Zhou
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 46+ messages in thread
From: Andy Zhou @ 2014-07-22 10:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, Andy Zhou

Added create_udp_tunnel_socket(), packet receive and transmit,  and
other related common functions for UDP tunnels.

Per net open UDP tunnel ports are tracked in this common layer to
prevent sharing of a single port with more than one UDP tunnel.

Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 include/net/udp_tunnel.h |   57 +++++++++-
 net/ipv4/udp_tunnel.c    |  257 +++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 312 insertions(+), 2 deletions(-)

diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
index 3f34c65..b5e815a 100644
--- a/include/net/udp_tunnel.h
+++ b/include/net/udp_tunnel.h
@@ -1,7 +1,10 @@
 #ifndef __NET_UDP_TUNNEL_H
 #define __NET_UDP_TUNNEL_H
 
-#define UDP_TUNNEL_TYPE_VXLAN 0x01
+#include <net/ip_tunnels.h>
+
+#define UDP_TUNNEL_TYPE_VXLAN  0x01
+#define UDP_TUNNEL_TYPE_GENEVE 0x02
 
 struct udp_port_cfg {
 	u8			family;
@@ -28,7 +31,59 @@ struct udp_port_cfg {
 				use_udp6_rx_checksums:1;
 };
 
+struct udp_tunnel_sock;
+
+typedef void (udp_tunnel_rcv_t)(struct udp_tunnel_sock *uts,
+				struct sk_buff *skb, ...);
+
+typedef int (udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
+
+struct udp_tunnel_socket_cfg {
+	u8 tunnel_type;
+	struct udp_port_cfg port;
+	udp_tunnel_rcv_t *rcv;
+	udp_tunnel_encap_rcv_t *encap_rcv;
+	void *data;
+};
+
+struct udp_tunnel_sock {
+	u8 tunnel_type;
+	struct hlist_node hlist;
+	udp_tunnel_rcv_t *rcv;
+	void *data;
+	struct socket *sock;
+};
+
 int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
 		    struct socket **sockp);
 
+struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
+						 struct udp_tunnel_socket_cfg
+							*socket_cfg);
+
+struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port);
+
+int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
+			struct sk_buff *skb, __be32 src, __be32 dst,
+			__u8 tos, __u8 ttl, __be16 df, __be16 src_port,
+			__be16 dst_port, bool xnet);
+
+#if IS_ENABLED(CONFIG_IPV6)
+int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
+		struct sk_buff *skb, struct net_device *dev,
+		struct in6_addr *saddr, struct in6_addr *daddr,
+		__u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port);
+
+#endif
+
+void udp_tunnel_sock_release(struct udp_tunnel_sock *uts);
+void udp_tunnel_get_rx_port(struct net_device *dev);
+
+static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb,
+							 bool udp_csum)
+{
+	int type = udp_csum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
+
+	return iptunnel_handle_offloads(skb, udp_csum, type);
+}
 #endif
diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
index 61ec1a6..3c14b16 100644
--- a/net/ipv4/udp_tunnel.c
+++ b/net/ipv4/udp_tunnel.c
@@ -7,6 +7,23 @@
 #include <net/udp.h>
 #include <net/udp_tunnel.h>
 #include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#if IS_ENABLED(CONFIG_IPV6)
+#include <net/ipv6.h>
+#include <net/addrconf.h>
+#include <net/ip6_tunnel.h>
+#include <net/ip6_checksum.h>
+#endif
+
+#define PORT_HASH_BITS 8
+#define PORT_HASH_SIZE (1 << PORT_HASH_BITS)
+
+static int udp_tunnel_net_id;
+
+struct udp_tunnel_net {
+	struct hlist_head sock_list[PORT_HASH_SIZE];
+	spinlock_t  sock_lock;   /* Protecting the sock_list */
+};
 
 int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
 		    struct socket **sockp)
@@ -82,7 +99,6 @@ int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
 		return -EPFNOSUPPORT;
 	}
 
-
 	*sockp = sock;
 
 	return 0;
@@ -97,4 +113,243 @@ error:
 }
 EXPORT_SYMBOL(udp_sock_create);
 
+
+/* Socket hash table head */
+static inline struct hlist_head *uts_head(struct net *net, const __be16 port)
+{
+	struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
+
+	return &utn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
+}
+
+static int handle_offloads(struct sk_buff *skb)
+{
+	if (skb_is_gso(skb)) {
+		int err = skb_unclone(skb, GFP_ATOMIC);
+
+		if (unlikely(err))
+			return err;
+		skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
+	} else {
+		if (skb->ip_summed != CHECKSUM_PARTIAL)
+			skb->ip_summed = CHECKSUM_NONE;
+	}
+
+	return 0;
+}
+
+struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
+						 struct udp_tunnel_socket_cfg
+							*cfg)
+{
+	struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
+	struct udp_tunnel_sock *uts;
+	struct socket *sock;
+	struct sock *sk;
+	const __be16 port = cfg->port.local_udp_port;
+	const int ipv6 = (cfg->port.family == AF_INET6);
+	int err;
+
+	uts = kzalloc(size, GFP_KERNEL);
+	if (!uts)
+		return ERR_PTR(-ENOMEM);
+
+	err = udp_sock_create(net, &cfg->port, &sock);
+	if (err < 0) {
+		kfree(uts);
+		return NULL;
+	}
+
+	/* Disable multicast loopback */
+	inet_sk(sock->sk)->mc_loop = 0;
+
+	uts->sock = sock;
+	sk = sock->sk;
+	uts->rcv = cfg->rcv;
+	uts->data = cfg->data;
+	rcu_assign_sk_user_data(sock->sk, uts);
+
+	spin_lock(&utn->sock_lock);
+	hlist_add_head_rcu(&uts->hlist, uts_head(net, port));
+	spin_unlock(&utn->sock_lock);
+
+	udp_sk(sk)->encap_type = 1;
+	udp_sk(sk)->encap_rcv = cfg->encap_rcv;
+
+#if IS_ENABLED(CONFIG_IPV6)
+	if (ipv6)
+		ipv6_stub->udpv6_encap_enable();
+	else
+#endif
+		udp_encap_enable();
+
+	return uts;
+}
+EXPORT_SYMBOL_GPL(create_udp_tunnel_socket);
+
+int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
+			struct sk_buff *skb, __be32 src, __be32 dst,
+			__u8 tos, __u8 ttl, __be16 df, __be16 src_port,
+			__be16 dst_port, bool xnet)
+{
+	struct udphdr *uh;
+
+	__skb_push(skb, sizeof(*uh));
+	skb_reset_transport_header(skb);
+	uh = udp_hdr(skb);
+
+	uh->dest = dst_port;
+	uh->source = src_port;
+	uh->len = htons(skb->len);
+
+	udp_set_csum(sock->sk->sk_no_check_tx, skb, src, dst, skb->len);
+
+	return iptunnel_xmit(sock->sk, rt, skb, src, dst, IPPROTO_UDP,
+			     tos, ttl, df, xnet);
+}
+EXPORT_SYMBOL_GPL(udp_tunnel_xmit_skb);
+
+#if IS_ENABLED(CONFIG_IPV6)
+int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
+			 struct sk_buff *skb, struct net_device *dev,
+			 struct in6_addr *saddr, struct in6_addr *daddr,
+			 __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port)
+{
+	struct udphdr *uh;
+	struct ipv6hdr *ip6h;
+	int err;
+
+	__skb_push(skb, sizeof(*uh));
+	skb_reset_transport_header(skb);
+	uh = udp_hdr(skb);
+
+	uh->dest = dst_port;
+	uh->source = src_port;
+
+	uh->len = htons(skb->len);
+	uh->check = 0;
+
+	memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
+	IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED
+			    | IPSKB_REROUTED);
+	skb_dst_set(skb, dst);
+
+	if (!skb_is_gso(skb) && !(dst->dev->features & NETIF_F_IPV6_CSUM)) {
+		__wsum csum = skb_checksum(skb, 0, skb->len, 0);
+
+		skb->ip_summed = CHECKSUM_UNNECESSARY;
+		uh->check = csum_ipv6_magic(saddr, daddr, skb->len,
+				IPPROTO_UDP, csum);
+		if (uh->check == 0)
+			uh->check = CSUM_MANGLED_0;
+	} else {
+		skb->ip_summed = CHECKSUM_PARTIAL;
+		skb->csum_start = skb_transport_header(skb) - skb->head;
+		skb->csum_offset = offsetof(struct udphdr, check);
+		uh->check = ~csum_ipv6_magic(saddr, daddr,
+				skb->len, IPPROTO_UDP, 0);
+	}
+
+	__skb_push(skb, sizeof(*ip6h));
+	skb_reset_network_header(skb);
+	ip6h		  = ipv6_hdr(skb);
+	ip6h->version	  = 6;
+	ip6h->priority	  = prio;
+	ip6h->flow_lbl[0] = 0;
+	ip6h->flow_lbl[1] = 0;
+	ip6h->flow_lbl[2] = 0;
+	ip6h->payload_len = htons(skb->len);
+	ip6h->nexthdr     = IPPROTO_UDP;
+	ip6h->hop_limit   = ttl;
+	ip6h->daddr	  = *daddr;
+	ip6h->saddr	  = *saddr;
+
+	err = handle_offloads(skb);
+	if (err)
+		return err;
+
+	ip6tunnel_xmit(skb, dev);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(udp_tunnel6_xmit_skb);
+#endif
+
+struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port)
+{
+	struct udp_tunnel_sock *uts;
+
+	hlist_for_each_entry_rcu(uts, uts_head(net, port), hlist) {
+		if (inet_sk(uts->sock->sk)->inet_sport == port)
+			return uts;
+	}
+
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(udp_tunnel_find_sock);
+
+void udp_tunnel_sock_release(struct udp_tunnel_sock *uts)
+{
+	struct sock *sk = uts->sock->sk;
+	struct net *net = sock_net(sk);
+	struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
+
+	spin_lock(&utn->sock_lock);
+	hlist_del_rcu(&uts->hlist);
+	rcu_assign_sk_user_data(uts->sock->sk, NULL);
+	spin_unlock(&utn->sock_lock);
+}
+EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
+
+/* Calls the ndo_add_tunnel_port of the caller in order to
+ * supply the listening VXLAN udp ports. Callers are expected
+ * to implement the ndo_add_tunnle_port.
+ */
+void udp_tunnel_get_rx_port(struct net_device *dev)
+{
+	struct udp_tunnel_sock *uts;
+	struct net *net = dev_net(dev);
+	struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
+	sa_family_t sa_family;
+	__be16 port;
+	unsigned int i;
+
+	spin_lock(&utn->sock_lock);
+	for (i = 0; i < PORT_HASH_SIZE; ++i) {
+		hlist_for_each_entry_rcu(uts, &utn->sock_list[i], hlist) {
+			port = inet_sk(uts->sock->sk)->inet_sport;
+			sa_family = uts->sock->sk->sk_family;
+			dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
+					sa_family, port, uts->tunnel_type);
+		}
+	}
+	spin_unlock(&utn->sock_lock);
+}
+EXPORT_SYMBOL_GPL(udp_tunnel_get_rx_port);
+
+static int __net_init udp_tunnel_init_net(struct net *net)
+{
+	struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
+	unsigned int h;
+
+	spin_lock_init(&utn->sock_lock);
+
+	for (h = 0; h < PORT_HASH_SIZE; h++)
+		INIT_HLIST_HEAD(&utn->sock_list[h]);
+
+	return 0;
+}
+
+static struct pernet_operations udp_tunnel_net_ops = {
+	.init = udp_tunnel_init_net,
+	.exit = NULL,
+	.id = &udp_tunnel_net_id,
+	.size = sizeof(struct udp_tunnel_net),
+};
+
+static int __init udp_tunnel_init(void)
+{
+	return register_pernet_subsys(&udp_tunnel_net_ops);
+}
+late_initcall(udp_tunnel_init);
+
 MODULE_LICENSE("GPL");
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next 03/10] vxlan: Remove vxlan_get_rx_port()
  2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
  2014-07-22 10:19 ` [net-next 01/10] net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port Andy Zhou
  2014-07-22 10:19 ` [net-next 02/10] udp: Expand UDP tunnel common APIs Andy Zhou
@ 2014-07-22 10:19 ` Andy Zhou
       [not found]   ` <CAKgT0UeRSc3MaZrLmXyx4jPZO+F1hS5imR1TjFkvKp4S8nQmeg@mail.gmail.com>
  2014-07-22 10:19 ` [net-next 04/10] net: Refactor vxlan driver to make use of common UDP tunnel functions Andy Zhou
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Andy Zhou @ 2014-07-22 10:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, Andy Zhou

Instead of specificly calling vxlan_get_rx_port(), Device driver
should now call udp_tunnel_get_rx_port() instead.  Making this change
to support future NICs and device drivers that may support more
UDP tunnel protocol offloads.

Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 drivers/net/ethernet/emulex/benet/be_main.c      |    2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c      |    2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |    2 +-
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c |    2 +-
 drivers/net/vxlan.c                              |   26 ----------------------
 include/net/vxlan.h                              |    7 ------
 6 files changed, 4 insertions(+), 37 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 028dc6e..b6badc0 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -2922,7 +2922,7 @@ static int be_open(struct net_device *netdev)
 
 #ifdef CONFIG_BE2NET_VXLAN
 	if (skyhawk_chip(adapter))
-		vxlan_get_rx_port(netdev);
+		udp_tunnel_get_rx_port(netdev);
 #endif
 
 	return 0;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index d9fd53b..3cebb1b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -4527,7 +4527,7 @@ static int i40e_open(struct net_device *netdev)
 	wr32(&pf->hw, I40E_GLLAN_TSOMSK_L, be32_to_cpu(TCP_FLAG_CWR) >> 16);
 
 #ifdef CONFIG_I40E_VXLAN
-	vxlan_get_rx_port(netdev);
+	udp_tunnel_get_rx_port(netdev);
 #endif
 
 	return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index d5f6b91..d935ad3 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1721,7 +1721,7 @@ int mlx4_en_start_port(struct net_device *dev)
 
 #ifdef CONFIG_MLX4_EN_VXLAN
 	if (priv->mdev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_VXLAN_OFFLOADS)
-		vxlan_get_rx_port(dev);
+		udp_tunnel_get_rx_port(dev);
 #endif
 	priv->port_up = true;
 	netif_tx_start_all_queues(dev);
diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
index a39020d..df760f7 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c
@@ -2002,7 +2002,7 @@ qlcnic_attach(struct qlcnic_adapter *adapter)
 
 #ifdef CONFIG_QLCNIC_VXLAN
 	if (qlcnic_encap_rx_offload(adapter))
-		vxlan_get_rx_port(netdev);
+		udp_tunnel_get_rx_port(netdev);
 #endif
 
 	adapter->is_up = QLCNIC_ADAPTER_UP_MAGIC;
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 829d447..93f2e40 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2191,32 +2191,6 @@ static struct device_type vxlan_type = {
 	.name = "vxlan",
 };
 
-/* Calls the ndo_add_tunnel_port of the caller in order to
- * supply the listening VXLAN udp ports. Callers are expected
- * to implement the ndo_add_tunnle_port.
- */
-void vxlan_get_rx_port(struct net_device *dev)
-{
-	struct vxlan_sock *vs;
-	struct net *net = dev_net(dev);
-	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
-	sa_family_t sa_family;
-	__be16 port;
-	unsigned int i;
-
-	spin_lock(&vn->sock_lock);
-	for (i = 0; i < PORT_HASH_SIZE; ++i) {
-		hlist_for_each_entry_rcu(vs, &vn->sock_list[i], hlist) {
-			port = inet_sk(vs->sock->sk)->inet_sport;
-			sa_family = vs->sock->sk->sk_family;
-			dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
-					sa_family, port, UDP_TUNNEL_TYPE_VXLAN);
-		}
-	}
-	spin_unlock(&vn->sock_lock);
-}
-EXPORT_SYMBOL_GPL(vxlan_get_rx_port);
-
 /* Initialize the device structure. */
 static void vxlan_setup(struct net_device *dev)
 {
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index d5f59f3..60f9d4d 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -50,11 +50,4 @@ int vxlan_xmit_skb(struct vxlan_sock *vs,
 /* IPv6 header + UDP + VXLAN + Ethernet header */
 #define VXLAN6_HEADROOM (40 + 8 + 8 + 14)
 
-#if IS_ENABLED(CONFIG_VXLAN)
-void vxlan_get_rx_port(struct net_device *netdev);
-#else
-static inline void vxlan_get_rx_port(struct net_device *netdev)
-{
-}
-#endif
 #endif
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next 04/10] net: Refactor vxlan driver to make use of common UDP tunnel functions
  2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
                   ` (2 preceding siblings ...)
  2014-07-22 10:19 ` [net-next 03/10] vxlan: Remove vxlan_get_rx_port() Andy Zhou
@ 2014-07-22 10:19 ` Andy Zhou
  2014-07-24  6:46   ` Or Gerlitz
  2014-07-22 10:19 ` [net-next 05/10] net: Add Geneve tunneling protocol driver Andy Zhou
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 46+ messages in thread
From: Andy Zhou @ 2014-07-22 10:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, Andy Zhou

Refactor vxlan driver to make use of the common UDP tunnel
functions.

Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 drivers/net/vxlan.c           |  232 ++++++++++-------------------------------
 include/net/vxlan.h           |   14 ++-
 net/openvswitch/vport-vxlan.c |    7 +-
 3 files changed, 66 insertions(+), 187 deletions(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 93f2e40..816f42d 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -42,6 +42,7 @@
 #include <net/netns/generic.h>
 #include <net/vxlan.h>
 #include <net/protocol.h>
+#include <net/udp_tunnel.h>
 #if IS_ENABLED(CONFIG_IPV6)
 #include <net/ipv6.h>
 #include <net/addrconf.h>
@@ -51,8 +52,6 @@
 
 #define VXLAN_VERSION	"0.1"
 
-#define PORT_HASH_BITS	8
-#define PORT_HASH_SIZE  (1<<PORT_HASH_BITS)
 #define VNI_HASH_BITS	10
 #define VNI_HASH_SIZE	(1<<VNI_HASH_BITS)
 #define FDB_HASH_BITS	8
@@ -91,8 +90,7 @@ static const u8 all_zeros_mac[ETH_ALEN];
 /* per-network namespace private data for this module */
 struct vxlan_net {
 	struct list_head  vxlan_list;
-	struct hlist_head sock_list[PORT_HASH_SIZE];
-	spinlock_t	  sock_lock;
+	spinlock_t	  vxlan_list_lock; /* protecting vxlan_list */
 };
 
 union vxlan_addr {
@@ -253,14 +251,6 @@ static inline struct hlist_head *vni_head(struct vxlan_sock *vs, u32 id)
 	return &vs->vni_list[hash_32(id, VNI_HASH_BITS)];
 }
 
-/* Socket hash table head */
-static inline struct hlist_head *vs_head(struct net *net, __be16 port)
-{
-	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
-
-	return &vn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
-}
-
 /* First remote destination for a forwarding entry.
  * Guaranteed to be non-NULL because remotes are never deleted.
  */
@@ -277,13 +267,7 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
 /* Find VXLAN socket based on network namespace and UDP port */
 static struct vxlan_sock *vxlan_find_sock(struct net *net, __be16 port)
 {
-	struct vxlan_sock *vs;
-
-	hlist_for_each_entry_rcu(vs, vs_head(net, port), hlist) {
-		if (inet_sk(vs->sock->sk)->inet_sport == port)
-			return vs;
-	}
-	return NULL;
+	return (struct vxlan_sock *)udp_tunnel_find_sock(net, port);
 }
 
 static struct vxlan_dev *vxlan_vs_find_vni(struct vxlan_sock *vs, u32 id)
@@ -636,7 +620,7 @@ static int vxlan_gro_complete(struct sk_buff *skb, int nhoff)
 static void vxlan_notify_add_rx_port(struct vxlan_sock *vs)
 {
 	struct net_device *dev;
-	struct sock *sk = vs->sock->sk;
+	struct sock *sk = vs->uts.sock->sk;
 	struct net *net = sock_net(sk);
 	sa_family_t sa_family = sk->sk_family;
 	__be16 port = inet_sk(sk)->inet_sport;
@@ -663,7 +647,7 @@ static void vxlan_notify_add_rx_port(struct vxlan_sock *vs)
 static void vxlan_notify_del_rx_port(struct vxlan_sock *vs)
 {
 	struct net_device *dev;
-	struct sock *sk = vs->sock->sk;
+	struct sock *sk = vs->uts.sock->sk;
 	struct net *net = sock_net(sk);
 	sa_family_t sa_family = sk->sk_family;
 	__be16 port = inet_sk(sk)->inet_sport;
@@ -1056,19 +1040,11 @@ static void vxlan_sock_hold(struct vxlan_sock *vs)
 
 void vxlan_sock_release(struct vxlan_sock *vs)
 {
-	struct sock *sk = vs->sock->sk;
-	struct net *net = sock_net(sk);
-	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
-
 	if (!atomic_dec_and_test(&vs->refcnt))
 		return;
 
-	spin_lock(&vn->sock_lock);
-	hlist_del_rcu(&vs->hlist);
-	rcu_assign_sk_user_data(vs->sock->sk, NULL);
+	udp_tunnel_sock_release(&vs->uts);
 	vxlan_notify_del_rx_port(vs);
-	spin_unlock(&vn->sock_lock);
-
 	queue_work(vxlan_wq, &vs->del_work);
 }
 EXPORT_SYMBOL_GPL(vxlan_sock_release);
@@ -1081,7 +1057,7 @@ static void vxlan_igmp_join(struct work_struct *work)
 {
 	struct vxlan_dev *vxlan = container_of(work, struct vxlan_dev, igmp_join);
 	struct vxlan_sock *vs = vxlan->vn_sock;
-	struct sock *sk = vs->sock->sk;
+	struct sock *sk = vs->uts.sock->sk;
 	union vxlan_addr *ip = &vxlan->default_dst.remote_ip;
 	int ifindex = vxlan->default_dst.remote_ifindex;
 
@@ -1110,7 +1086,7 @@ static void vxlan_igmp_leave(struct work_struct *work)
 {
 	struct vxlan_dev *vxlan = container_of(work, struct vxlan_dev, igmp_leave);
 	struct vxlan_sock *vs = vxlan->vn_sock;
-	struct sock *sk = vs->sock->sk;
+	struct sock *sk = vs->uts.sock->sk;
 	union vxlan_addr *ip = &vxlan->default_dst.remote_ip;
 	int ifindex = vxlan->default_dst.remote_ifindex;
 
@@ -1163,7 +1139,7 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 
 	skb_pop_rcv_encapsulation(skb);
 
-	vs->rcv(vs, skb, vxh->vx_vni);
+	vs->uts.rcv(&vs->uts, skb, vxh->vx_vni);
 	return 0;
 
 drop:
@@ -1341,7 +1317,6 @@ out:
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
-
 static struct sk_buff *vxlan_na_create(struct sk_buff *request,
 	struct neighbour *n, bool isrouter)
 {
@@ -1575,13 +1550,6 @@ static bool route_shortcircuit(struct net_device *dev, struct sk_buff *skb)
 	return false;
 }
 
-static inline struct sk_buff *vxlan_handle_offloads(struct sk_buff *skb,
-						    bool udp_csum)
-{
-	int type = udp_csum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
-	return iptunnel_handle_offloads(skb, udp_csum, type);
-}
-
 #if IS_ENABLED(CONFIG_IPV6)
 static int vxlan6_xmit_skb(struct vxlan_sock *vs,
 			   struct dst_entry *dst, struct sk_buff *skb,
@@ -1590,13 +1558,13 @@ static int vxlan6_xmit_skb(struct vxlan_sock *vs,
 			   __be16 src_port, __be16 dst_port, __be32 vni,
 			   bool xnet)
 {
-	struct ipv6hdr *ip6h;
 	struct vxlanhdr *vxh;
-	struct udphdr *uh;
 	int min_headroom;
 	int err;
 
-	skb = vxlan_handle_offloads(skb, !udp_get_no_check6_tx(vs->sock->sk));
+	skb = udp_tunnel_handle_offloads(skb,
+					 !udp_get_no_check6_tx(
+						 vs->uts.sock->sk));
 	if (IS_ERR(skb))
 		return -EINVAL;
 
@@ -1624,38 +1592,8 @@ static int vxlan6_xmit_skb(struct vxlan_sock *vs,
 	vxh->vx_flags = htonl(VXLAN_FLAGS);
 	vxh->vx_vni = vni;
 
-	__skb_push(skb, sizeof(*uh));
-	skb_reset_transport_header(skb);
-	uh = udp_hdr(skb);
-
-	uh->dest = dst_port;
-	uh->source = src_port;
-
-	uh->len = htons(skb->len);
-
-	memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
-	IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED |
-			      IPSKB_REROUTED);
-	skb_dst_set(skb, dst);
-
-	udp6_set_csum(udp_get_no_check6_tx(vs->sock->sk), skb,
-		      saddr, daddr, skb->len);
-
-	__skb_push(skb, sizeof(*ip6h));
-	skb_reset_network_header(skb);
-	ip6h		  = ipv6_hdr(skb);
-	ip6h->version	  = 6;
-	ip6h->priority	  = prio;
-	ip6h->flow_lbl[0] = 0;
-	ip6h->flow_lbl[1] = 0;
-	ip6h->flow_lbl[2] = 0;
-	ip6h->payload_len = htons(skb->len);
-	ip6h->nexthdr     = IPPROTO_UDP;
-	ip6h->hop_limit   = ttl;
-	ip6h->daddr	  = *daddr;
-	ip6h->saddr	  = *saddr;
-
-	ip6tunnel_xmit(skb, dev);
+	udp_tunnel6_xmit_skb(vs->uts.sock, dst, skb, dev, saddr, daddr, prio,
+			     ttl, src_port, dst_port);
 	return 0;
 }
 #endif
@@ -1666,11 +1604,11 @@ int vxlan_xmit_skb(struct vxlan_sock *vs,
 		   __be16 src_port, __be16 dst_port, __be32 vni, bool xnet)
 {
 	struct vxlanhdr *vxh;
-	struct udphdr *uh;
 	int min_headroom;
 	int err;
 
-	skb = vxlan_handle_offloads(skb, !vs->sock->sk->sk_no_check_tx);
+	skb = udp_tunnel_handle_offloads(skb,
+					 !vs->uts.sock->sk->sk_no_check_tx);
 	if (IS_ERR(skb))
 		return -EINVAL;
 
@@ -1696,20 +1634,8 @@ int vxlan_xmit_skb(struct vxlan_sock *vs,
 	vxh->vx_flags = htonl(VXLAN_FLAGS);
 	vxh->vx_vni = vni;
 
-	__skb_push(skb, sizeof(*uh));
-	skb_reset_transport_header(skb);
-	uh = udp_hdr(skb);
-
-	uh->dest = dst_port;
-	uh->source = src_port;
-
-	uh->len = htons(skb->len);
-
-	udp_set_csum(vs->sock->sk->sk_no_check_tx, skb,
-		     src, dst, skb->len);
-
-	return iptunnel_xmit(vs->sock->sk, rt, skb, src, dst, IPPROTO_UDP,
-			     tos, ttl, df, xnet);
+	return udp_tunnel_xmit_skb(vs->uts.sock, rt, skb, src, dst, tos,
+				   ttl, df, src_port, dst_port, xnet);
 }
 EXPORT_SYMBOL_GPL(vxlan_xmit_skb);
 
@@ -1834,18 +1760,18 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
 		ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
 
-		err = vxlan_xmit_skb(vxlan->vn_sock, rt, skb,
-				     fl4.saddr, dst->sin.sin_addr.s_addr,
-				     tos, ttl, df, src_port, dst_port,
-				     htonl(vni << 8),
-				     !net_eq(vxlan->net, dev_net(vxlan->dev)));
+		err = udp_tunnel_xmit_skb(vxlan->vn_sock->uts.sock, rt, skb,
+					  fl4.saddr, dst->sin.sin_addr.s_addr,
+					  tos, ttl, df, src_port, dst_port,
+					  !net_eq(vxlan->net,
+						  dev_net(vxlan->dev)));
 
 		if (err < 0)
 			goto rt_tx_error;
 		iptunnel_xmit_stats(err, &dev->stats, dev->tstats);
 #if IS_ENABLED(CONFIG_IPV6)
 	} else {
-		struct sock *sk = vxlan->vn_sock->sock->sk;
+		struct sock *sk = vxlan->vn_sock->uts.sock->sk;
 		struct dst_entry *ndst;
 		struct flowi6 fl6;
 		u32 flags;
@@ -2041,7 +1967,7 @@ static int vxlan_init(struct net_device *dev)
 	if (!dev->tstats)
 		return -ENOMEM;
 
-	spin_lock(&vn->sock_lock);
+	spin_lock(&vn->vxlan_list_lock);
 	vs = vxlan_find_sock(vxlan->net, vxlan->dst_port);
 	if (vs) {
 		/* If we have a socket with same port already, reuse it */
@@ -2052,7 +1978,7 @@ static int vxlan_init(struct net_device *dev)
 		dev_hold(dev);
 		queue_work(vxlan_wq, &vxlan->sock_work);
 	}
-	spin_unlock(&vn->sock_lock);
+	spin_unlock(&vn->vxlan_list_lock);
 
 	return 0;
 }
@@ -2312,59 +2238,44 @@ static const struct ethtool_ops vxlan_ethtool_ops = {
 static void vxlan_del_work(struct work_struct *work)
 {
 	struct vxlan_sock *vs = container_of(work, struct vxlan_sock, del_work);
-
-	sk_release_kernel(vs->sock->sk);
+	sk_release_kernel(vs->uts.sock->sk);
 	kfree_rcu(vs, rcu);
 }
 
-static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
-					__be16 port, u32 flags)
+/* Create new listen socket if needed */
+static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
+					      vxlan_rcv_t *rcv, void *data,
+					      u32 flags)
 {
-	struct socket *sock;
-	struct udp_port_cfg udp_conf;
-	int err;
+	bool ipv6 = !!(flags & VXLAN_F_IPV6);
+	struct vxlan_sock *vs;
+	struct udp_tunnel_socket_cfg vxlan_ts_cfg;
+	unsigned int h;
 
-	memset(&udp_conf, 0, sizeof(udp_conf));
+	memset(&vxlan_ts_cfg, 0, sizeof(struct udp_tunnel_socket_cfg));
+
+	vxlan_ts_cfg.tunnel_type = UDP_TUNNEL_TYPE_VXLAN;
 
 	if (ipv6) {
-		udp_conf.family = AF_INET6;
-		udp_conf.use_udp6_tx_checksums =
+		vxlan_ts_cfg.port.family = AF_INET6;
+		vxlan_ts_cfg.port.use_udp6_tx_checksums =
 		    !!(flags & VXLAN_F_UDP_ZERO_CSUM6_TX);
-		udp_conf.use_udp6_rx_checksums =
+		vxlan_ts_cfg.port.use_udp6_rx_checksums =
 		    !!(flags & VXLAN_F_UDP_ZERO_CSUM6_RX);
 	} else {
-		udp_conf.family = AF_INET;
-		udp_conf.local_ip.s_addr = INADDR_ANY;
-		udp_conf.use_udp_checksums =
+		vxlan_ts_cfg.port.family = AF_INET;
+		vxlan_ts_cfg.port.local_ip.s_addr = INADDR_ANY;
+		vxlan_ts_cfg.port.use_udp_checksums =
 		    !!(flags & VXLAN_F_UDP_CSUM);
 	}
 
-	udp_conf.local_udp_port = port;
-
-	/* Open UDP socket */
-	err = udp_sock_create(net, &udp_conf, &sock);
-	if (err < 0)
-		return ERR_PTR(err);
-
-	/* Disable multicast loopback */
-	inet_sk(sock->sk)->mc_loop = 0;
-
-	return sock;
-}
+	vxlan_ts_cfg.port.local_udp_port = port;
+	vxlan_ts_cfg.rcv = (udp_tunnel_rcv_t *)rcv;
+	vxlan_ts_cfg.encap_rcv = vxlan_udp_encap_recv;
+	vxlan_ts_cfg.data = data;
 
-/* Create new listen socket if needed */
-static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
-					      vxlan_rcv_t *rcv, void *data,
-					      u32 flags)
-{
-	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
-	struct vxlan_sock *vs;
-	struct socket *sock;
-	struct sock *sk;
-	unsigned int h;
-	bool ipv6 = !!(flags & VXLAN_F_IPV6);
-
-	vs = kzalloc(sizeof(*vs), GFP_KERNEL);
+	vs = (struct vxlan_sock *)create_udp_tunnel_socket(net, sizeof(*vs),
+							   &vxlan_ts_cfg);
 	if (!vs)
 		return ERR_PTR(-ENOMEM);
 
@@ -2373,38 +2284,14 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
 
 	INIT_WORK(&vs->del_work, vxlan_del_work);
 
-	sock = vxlan_create_sock(net, ipv6, port, flags);
-	if (IS_ERR(sock)) {
-		kfree(vs);
-		return ERR_CAST(sock);
-	}
-
-	vs->sock = sock;
-	sk = sock->sk;
 	atomic_set(&vs->refcnt, 1);
-	vs->rcv = rcv;
-	vs->data = data;
-	rcu_assign_sk_user_data(vs->sock->sk, vs);
 
 	/* Initialize the vxlan udp offloads structure */
 	vs->udp_offloads.port = port;
 	vs->udp_offloads.callbacks.gro_receive  = vxlan_gro_receive;
 	vs->udp_offloads.callbacks.gro_complete = vxlan_gro_complete;
 
-	spin_lock(&vn->sock_lock);
-	hlist_add_head_rcu(&vs->hlist, vs_head(net, port));
 	vxlan_notify_add_rx_port(vs);
-	spin_unlock(&vn->sock_lock);
-
-	/* Mark socket as an encapsulation socket. */
-	udp_sk(sk)->encap_type = 1;
-	udp_sk(sk)->encap_rcv = vxlan_udp_encap_recv;
-#if IS_ENABLED(CONFIG_IPV6)
-	if (ipv6)
-		ipv6_stub->udpv6_encap_enable();
-	else
-#endif
-		udp_encap_enable();
 
 	return vs;
 }
@@ -2413,7 +2300,6 @@ struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
 				  vxlan_rcv_t *rcv, void *data,
 				  bool no_share, u32 flags)
 {
-	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
 	struct vxlan_sock *vs;
 
 	vs = vxlan_socket_create(net, port, rcv, data, flags);
@@ -2423,15 +2309,13 @@ struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
 	if (no_share)	/* Return error if sharing is not allowed. */
 		return vs;
 
-	spin_lock(&vn->sock_lock);
 	vs = vxlan_find_sock(net, port);
 	if (vs) {
-		if (vs->rcv == rcv)
+		if (vs->uts.rcv == (udp_tunnel_rcv_t *)rcv)
 			atomic_inc(&vs->refcnt);
 		else
 			vs = ERR_PTR(-EBUSY);
 	}
-	spin_unlock(&vn->sock_lock);
 
 	if (!vs)
 		vs = ERR_PTR(-EINVAL);
@@ -2450,10 +2334,10 @@ static void vxlan_sock_work(struct work_struct *work)
 	struct vxlan_sock *nvs;
 
 	nvs = vxlan_sock_add(net, port, vxlan_rcv, NULL, false, vxlan->flags);
-	spin_lock(&vn->sock_lock);
+	spin_lock(&vn->vxlan_list_lock);
 	if (!IS_ERR(nvs))
 		vxlan_vs_add_dev(nvs, vxlan);
-	spin_unlock(&vn->sock_lock);
+	spin_unlock(&vn->vxlan_list_lock);
 
 	dev_put(vxlan->dev);
 }
@@ -2620,10 +2504,10 @@ static void vxlan_dellink(struct net_device *dev, struct list_head *head)
 	struct vxlan_dev *vxlan = netdev_priv(dev);
 	struct vxlan_net *vn = net_generic(vxlan->net, vxlan_net_id);
 
-	spin_lock(&vn->sock_lock);
+	spin_lock(&vn->vxlan_list_lock);
 	if (!hlist_unhashed(&vxlan->hlist))
 		hlist_del_rcu(&vxlan->hlist);
-	spin_unlock(&vn->sock_lock);
+	spin_unlock(&vn->vxlan_list_lock);
 
 	list_del(&vxlan->next);
 	unregister_netdevice_queue(dev, head);
@@ -2781,13 +2665,9 @@ static struct notifier_block vxlan_notifier_block __read_mostly = {
 static __net_init int vxlan_init_net(struct net *net)
 {
 	struct vxlan_net *vn = net_generic(net, vxlan_net_id);
-	unsigned int h;
 
 	INIT_LIST_HEAD(&vn->vxlan_list);
-	spin_lock_init(&vn->sock_lock);
-
-	for (h = 0; h < PORT_HASH_SIZE; ++h)
-		INIT_HLIST_HEAD(&vn->sock_list[h]);
+	spin_lock_init(&vn->vxlan_list_lock);
 
 	return 0;
 }
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 60f9d4d..81ce6a0 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -4,26 +4,24 @@
 #include <linux/skbuff.h>
 #include <linux/netdevice.h>
 #include <linux/udp.h>
+#include <net/udp_tunnel.h>
 
 #define VNI_HASH_BITS	10
 #define VNI_HASH_SIZE	(1<<VNI_HASH_BITS)
 
-struct vxlan_sock;
-typedef void (vxlan_rcv_t)(struct vxlan_sock *vh, struct sk_buff *skb, __be32 key);
-
-/* per UDP socket information */
+/* per vxlan socket information */
 struct vxlan_sock {
-	struct hlist_node hlist;
-	vxlan_rcv_t	 *rcv;
-	void		 *data;
+	struct udp_tunnel_sock uts;  /* Must be the first member */
 	struct work_struct del_work;
-	struct socket	 *sock;
 	struct rcu_head	  rcu;
 	struct hlist_head vni_list[VNI_HASH_SIZE];
 	atomic_t	  refcnt;
 	struct udp_offload udp_offloads;
 };
 
+typedef void (vxlan_rcv_t)(struct vxlan_sock *vh, struct sk_buff *skb,
+			   __be32 key);
+
 #define VXLAN_F_LEARN			0x01
 #define VXLAN_F_PROXY			0x02
 #define VXLAN_F_RSC			0x04
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index d8b7e24..d523e74 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -59,7 +59,7 @@ static inline struct vxlan_port *vxlan_vport(const struct vport *vport)
 static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
 {
 	struct ovs_key_ipv4_tunnel tun_key;
-	struct vport *vport = vs->data;
+	struct vport *vport = vs->uts.data;
 	struct iphdr *iph;
 	__be64 key;
 
@@ -74,7 +74,7 @@ static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
 static int vxlan_get_options(const struct vport *vport, struct sk_buff *skb)
 {
 	struct vxlan_port *vxlan_port = vxlan_vport(vport);
-	__be16 dst_port = inet_sk(vxlan_port->vs->sock->sk)->inet_sport;
+	__be16 dst_port = inet_sk(vxlan_port->vs->uts.sock->sk)->inet_sport;
 
 	if (nla_put_u16(skb, OVS_TUNNEL_ATTR_DST_PORT, ntohs(dst_port)))
 		return -EMSGSIZE;
@@ -105,6 +105,7 @@ static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
 		err = -EINVAL;
 		goto error;
 	}
+
 	a = nla_find_nested(options, OVS_TUNNEL_ATTR_DST_PORT);
 	if (a && nla_len(a) == sizeof(u16)) {
 		dst_port = nla_get_u16(a);
@@ -139,7 +140,7 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 {
 	struct net *net = ovs_dp_get_net(vport->dp);
 	struct vxlan_port *vxlan_port = vxlan_vport(vport);
-	__be16 dst_port = inet_sk(vxlan_port->vs->sock->sk)->inet_sport;
+	__be16 dst_port = inet_sk(vxlan_port->vs->uts.sock->sk)->inet_sport;
 	struct rtable *rt;
 	struct flowi4 fl;
 	__be16 src_port;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next 05/10] net: Add Geneve tunneling protocol driver
  2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
                   ` (3 preceding siblings ...)
  2014-07-22 10:19 ` [net-next 04/10] net: Refactor vxlan driver to make use of common UDP tunnel functions Andy Zhou
@ 2014-07-22 10:19 ` Andy Zhou
  2014-07-22 23:12   ` Alexander Duyck
  2014-07-23 18:20   ` Stephen Hemminger
  2014-07-22 10:19 ` [net-next 06/10] openvswitch: Eliminate memset() from flow_extract Andy Zhou
                   ` (6 subsequent siblings)
  11 siblings, 2 replies; 46+ messages in thread
From: Andy Zhou @ 2014-07-22 10:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, Andy Zhou, Jesse Gross

This adds a device level support for Geneve -- Generic Network
Virtualization Encapsulation. The protocol is documented at
http://tools.ietf.org/html/draft-gross-geneve-00

Only protocol layer Geneve support is provided by this driver.
Openvswitch can be used for configuring, set up and tear down
functional Geneve tunnels.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 include/net/geneve.h     |   85 +++++++++++++++
 include/net/ip_tunnels.h |    2 +
 net/ipv4/Kconfig         |   14 +++
 net/ipv4/Makefile        |    1 +
 net/ipv4/geneve.c        |  273 ++++++++++++++++++++++++++++++++++++++++++++++
 net/openvswitch/Kconfig  |   11 ++
 net/openvswitch/vport.c  |    3 +
 7 files changed, 389 insertions(+)
 create mode 100644 include/net/geneve.h
 create mode 100644 net/ipv4/geneve.c

diff --git a/include/net/geneve.h b/include/net/geneve.h
new file mode 100644
index 0000000..4e3a301
--- /dev/null
+++ b/include/net/geneve.h
@@ -0,0 +1,85 @@
+#ifndef __NET_GENEVE_H
+#define __NET_GENEVE_H  1
+
+#include <net/udp_tunnel.h>
+
+struct geneve_sock {
+	struct udp_tunnel_sock  uts;
+	atomic_t		refcnt;
+	struct udp_offload	udp_offloads;
+};
+
+typedef void (geneve_rcv_t)(struct geneve_sock *gs, struct sk_buff *skb);
+
+/* Geneve Header:
+ *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *  |Ver|  Opt Len  |O|C|    Rsvd.  |          Protocol Type        |
+ *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *  |        Virtual Network Identifier (VNI)       |    Reserved   |
+ *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *  |                    Variable Length Options                    |
+ *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ * Option Header:
+ *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *  |          Option Class         |      Type     |R|R|R| Length  |
+ *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *  |                      Variable Option Data                     |
+ *  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ */
+
+struct geneve_opt {
+	__be16	opt_class;
+	u8	type;
+#ifdef __LITTLE_ENDIAN_BITFIELD
+	u8	length:5;
+	u8	r3:1;
+	u8	r2:1;
+	u8	r1:1;
+#else
+	u8	r1:1;
+	u8	r2:1;
+	u8	r3:1;
+	u8	length:5;
+#endif
+	u8	opt_data[];
+};
+
+struct genevehdr {
+#ifdef __LITTLE_ENDIAN_BITFIELD
+	u8 opt_len:6;
+	u8 ver:2;
+	u8 rsvd1:6;
+	u8 critical:1;
+	u8 oam:1;
+#else
+	u8 ver:2;
+	u8 opt_len:6;
+	u8 oam:1;
+	u8 critical:1;
+	u8 rsvd1:6;
+#endif
+	__be16 proto_type;
+	u8 vni[3];
+	u8 rsvd2;
+	struct geneve_opt options[];
+};
+
+#define GENEVE_VER 0
+#define GENEVE_BASE_HLEN (sizeof(struct udphdr) + sizeof(struct genevehdr))
+
+#define GENEVE_CRIT_OPT_TYPE (1 << 7)
+
+struct geneve_sock *geneve_sock_add(struct net *net, __be16 port,
+				    geneve_rcv_t *rcv, void *data,
+				    bool no_share, bool ipv6);
+
+void geneve_sock_release(struct geneve_sock *vs);
+
+int geneve_xmit_skb(struct geneve_sock *gs,
+		   struct rtable *rt, struct sk_buff *skb,
+		   __be32 src, __be32 dst, __u8 tos, __u8 ttl, __be16 df,
+		   __be16 src_port, __be16 dst_port, __be16 tun_flags,
+		   u8 vni[3], u8 opt_len, u8 *opt, bool xnet);
+
+#endif
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index a4daf9e..2e221cd 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -81,6 +81,8 @@ struct ip_tunnel {
 #define TUNNEL_VERSION	__cpu_to_be16(0x40)
 #define TUNNEL_NO_KEY	__cpu_to_be16(0x80)
 #define TUNNEL_DONT_FRAGMENT    __cpu_to_be16(0x0100)
+#define TUNNEL_OAM	__cpu_to_be16(0x0200)
+#define TUNNEL_CRIT_OPT	__cpu_to_be16(0x0400)
 
 struct tnl_ptk_info {
 	__be16 flags;
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index dbc10d8..69cb508 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -595,6 +595,20 @@ endchoice
 
 endif
 
+config GENEVE
+       tristate "Generic Network Virtualization Encapsulation (Geneve)"
+       depends on INET
+       select NET_IP_TUNNEL
+       select NET_UDP_TUNNEL
+       ---help---
+	  This allows one to create Geneve virtual interfaces that provide
+	  Layer 2 Networks over Layer 3 Networks. Geneve is often used
+	  to tunnel virtual network infrastructure in virtualized environments.
+	  For more information see:
+	    http://tools.ietf.org/html/draft-gross-geneve-00
+
+	  To compile this driver as a module, choose M here: the module
+
 config TCP_CONG_CUBIC
 	tristate
 	depends on !TCP_CONG_ADVANCED
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 8ee1cd4..cae2403 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -54,6 +54,7 @@ obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o
 obj-$(CONFIG_TCP_CONG_ILLINOIS) += tcp_illinois.o
 obj-$(CONFIG_MEMCG_KMEM) += tcp_memcontrol.o
 obj-$(CONFIG_NETLABEL) += cipso_ipv4.o
+obj-$(CONFIG_GENEVE) += geneve.o
 
 obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \
 		      xfrm4_output.o xfrm4_protocol.o
diff --git a/net/ipv4/geneve.c b/net/ipv4/geneve.c
new file mode 100644
index 0000000..2fda60e
--- /dev/null
+++ b/net/ipv4/geneve.c
@@ -0,0 +1,273 @@
+/* Copyright (c) 2014 Nicira, Inc.  */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/rculist.h>
+#include <linux/netdevice.h>
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/udp.h>
+#include <linux/igmp.h>
+#include <linux/etherdevice.h>
+#include <linux/if_ether.h>
+#include <linux/if_vlan.h>
+#include <linux/hash.h>
+#include <linux/ethtool.h>
+#include <net/arp.h>
+#include <net/ndisc.h>
+#include <net/ip.h>
+#include <net/ip_tunnels.h>
+#include <net/icmp.h>
+#include <net/udp.h>
+#include <net/rtnetlink.h>
+#include <net/route.h>
+#include <net/dsfield.h>
+#include <net/inet_ecn.h>
+#include <net/net_namespace.h>
+#include <net/netns/generic.h>
+#include <net/geneve.h>
+#include <net/protocol.h>
+#include <net/udp_tunnel.h>
+#if IS_ENABLED(CONFIG_IPV6)
+#include <net/ipv6.h>
+#include <net/addrconf.h>
+#include <net/ip6_tunnel.h>
+#include <net/ip6_checksum.h>
+#endif
+
+static inline struct genevehdr *geneve_hdr(const struct sk_buff *skb)
+{
+	return (struct genevehdr *)(udp_hdr(skb) + 1);
+}
+
+/* Find geneve socket based on network namespace and UDP port */
+static struct geneve_sock *geneve_find_sock(struct net *net, __be16 port)
+{
+	return (struct geneve_sock *)udp_tunnel_find_sock(net, port);
+}
+
+static void geneve_build_header(struct genevehdr *geneveh,
+				__be16 tun_flags, u8 vni[3],
+				u8 options_len, u8 *options)
+{
+	geneveh->ver = GENEVE_VER;
+	geneveh->opt_len = options_len / 4;
+	geneveh->oam = !!(tun_flags & TUNNEL_OAM);
+	geneveh->critical = !!(tun_flags & TUNNEL_CRIT_OPT);
+	geneveh->rsvd1 = 0;
+	memcpy(geneveh->vni, vni, 3);
+	geneveh->proto_type = htons(ETH_P_TEB);
+	geneveh->rsvd2 = 0;
+
+	memcpy(geneveh->options, options, options_len);
+}
+
+/* Transmit a fully formated Geneve frame.
+ *
+ * When calling this function. The skb->data should point
+ * to the geneve header which is fully formed.
+ *
+ * This function will add other UDP tunnel headers.
+ */
+int geneve_xmit_skb(struct geneve_sock *gs,
+		    struct rtable *rt, struct sk_buff *skb,
+		    __be32 src, __be32 dst, __u8 tos, __u8 ttl, __be16 df,
+		    __be16 src_port, __be16 dst_port,
+		    __be16 tun_flags, u8 vni[3], u8 opt_len, u8 *opt,
+		    bool xnet)
+{
+	struct genevehdr *gnvh;
+	int min_headroom;
+	int err;
+
+	skb = udp_tunnel_handle_offloads(skb, !udp_get_no_check6_tx(gs->uts.sock->sk));
+
+	min_headroom = LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len
+			+ GENEVE_BASE_HLEN + opt_len + sizeof(struct iphdr)
+			+ (vlan_tx_tag_present(skb) ? VLAN_HLEN : 0);
+
+	err = skb_cow_head(skb, min_headroom);
+	if (unlikely(err))
+		return err;
+
+	if (vlan_tx_tag_present(skb)) {
+		if (unlikely(!__vlan_put_tag(skb,
+					     skb->vlan_proto,
+					     vlan_tx_tag_get(skb)))) {
+			err = -ENOMEM;
+			return err;
+		}
+		skb->vlan_tci = 0;
+	}
+
+	gnvh = (struct genevehdr *)__skb_push(skb, sizeof(*gnvh));
+	geneve_build_header(gnvh, tun_flags, vni, opt_len, opt);
+
+	return udp_tunnel_xmit_skb(gs->uts.sock, rt, skb, src, dst,
+				   tos, ttl, df, src_port, dst_port, xnet);
+}
+EXPORT_SYMBOL_GPL(geneve_xmit_skb);
+
+static void geneve_notify_add_rx_port(struct geneve_sock *gs)
+{
+	struct net_device *dev;
+	struct sock *sk = gs->uts.sock->sk;
+	struct net *net = sock_net(sk);
+	sa_family_t sa_family = sk->sk_family;
+	__be16 port = inet_sk(sk)->inet_sport;
+	int err;
+
+	if (sa_family == AF_INET) {
+		err = udp_add_offload(&gs->udp_offloads);
+		if (err)
+			pr_warn("geneve: udp_add_offload failed with status %d\n", err);
+	}
+
+	rcu_read_lock();
+	for_each_netdev_rcu(net, dev) {
+		if (!dev->netdev_ops->ndo_add_udp_tunnel_port)
+			continue;
+
+		dev->netdev_ops->ndo_add_udp_tunnel_port(dev, sa_family, port,
+						       UDP_TUNNEL_TYPE_GENEVE);
+	}
+	rcu_read_unlock();
+}
+
+/* Callback from net/ipv4/udp.c to receive packets */
+static int geneve_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
+{
+	struct genevehdr *geneveh;
+	struct geneve_sock *gs;
+	int opts_len;
+
+	/* Need Geneve and inner Ethernet header to be present */
+	if (unlikely(!pskb_may_pull(skb, GENEVE_BASE_HLEN)))
+		goto error;
+
+	/* Return packets with reserved bits set */
+	geneveh = geneve_hdr(skb);
+
+	opts_len = geneveh->opt_len * 4;
+	if (iptunnel_pull_header(skb, GENEVE_BASE_HLEN + opts_len,
+				 htons(ETH_P_TEB)))
+		goto drop;
+
+	gs = rcu_dereference_sk_user_data(sk);
+	if (!gs)
+		goto drop;
+
+	skb_pop_rcv_encapsulation(skb);
+
+	gs->uts.rcv(&gs->uts, skb);
+	return 0;
+
+drop:
+	/* Consume bad packet */
+	kfree_skb(skb);
+	return 0;
+
+error:
+	kfree_skb(skb);
+	return 1;
+}
+
+/* Create new listen socket if needed */
+static struct geneve_sock *geneve_socket_create(struct net *net, __be16 port,
+						geneve_rcv_t *rcv, void *data,
+						bool ipv6)
+{
+	struct geneve_sock *gs;
+	struct udp_tunnel_socket_cfg geneve_ts_cfg;
+
+	memset(&geneve_ts_cfg, 0, sizeof(struct udp_tunnel_socket_cfg));
+
+	if (ipv6) {
+		geneve_ts_cfg.port.family = AF_INET6;
+	} else {
+		geneve_ts_cfg.port.family = AF_INET;
+		geneve_ts_cfg.port.local_ip.s_addr = INADDR_ANY;
+	}
+
+	geneve_ts_cfg.tunnel_type = UDP_TUNNEL_TYPE_GENEVE;
+	geneve_ts_cfg.port.local_udp_port = port;
+	geneve_ts_cfg.rcv = (udp_tunnel_rcv_t *)rcv;
+	geneve_ts_cfg.encap_rcv = geneve_udp_encap_recv;
+	geneve_ts_cfg.data = data;
+
+	gs = (struct geneve_sock *)create_udp_tunnel_socket(net, sizeof(*gs),
+							    &geneve_ts_cfg);
+	if (!gs)
+		return ERR_PTR(-ENOMEM);
+
+	atomic_set(&gs->refcnt, 1);
+
+	/* Initialize the geneve udp offloads structure */
+	gs->udp_offloads.port = port;
+	gs->udp_offloads.callbacks.gro_receive = NULL;
+	gs->udp_offloads.callbacks.gro_complete = NULL;
+
+	geneve_notify_add_rx_port(gs);
+
+	return gs;
+}
+
+struct geneve_sock *geneve_sock_add(struct net *net, __be16 port,
+				    geneve_rcv_t *rcv, void *data,
+				    bool no_share, bool ipv6)
+{
+	struct geneve_sock *gs;
+
+	gs = geneve_socket_create(net, port, rcv, data, ipv6);
+	if (!IS_ERR(gs))
+		return gs;
+
+	if (no_share)	/* Return error if sharing is not allowed. */
+		return ERR_PTR(-EINVAL);
+
+	gs = geneve_find_sock(net, port);
+	if (gs) {
+		if (gs->uts.rcv == (udp_tunnel_rcv_t *)rcv)
+			atomic_inc(&gs->refcnt);
+		else
+			gs = ERR_PTR(-EBUSY);
+	} else {
+		gs = ERR_PTR(-EINVAL);
+	}
+
+	return gs;
+}
+EXPORT_SYMBOL_GPL(geneve_sock_add);
+
+void geneve_sock_release(struct geneve_sock *gs)
+{
+	if (!atomic_dec_and_test(&gs->refcnt))
+		return;
+
+	udp_tunnel_sock_release(&gs->uts);
+}
+EXPORT_SYMBOL_GPL(geneve_sock_release);
+
+static int __init geneve_init_module(void)
+{
+	pr_info("Geneve driver\n");
+
+	return 0;
+}
+late_initcall(geneve_init_module);
+
+static void __exit geneve_cleanup_module(void)
+{
+}
+module_exit(geneve_cleanup_module);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Jesse Gross <jesse@nicira.com>");
+MODULE_DESCRIPTION("Driver for GENEVE encapsulated traffic");
+MODULE_ALIAS_RTNL_LINK("geneve");
diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 6ecf491..ba3bb82 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -54,3 +54,14 @@ config OPENVSWITCH_VXLAN
 	  Say N to exclude this support and reduce the binary size.
 
 	  If unsure, say Y.
+
+config OPENVSWITCH_GENEVE
+	bool "Open vSwitch Geneve tunneling support"
+	depends on INET
+	depends on OPENVSWITCH
+	depends on GENEVE && !(OPENVSWITCH=y && GENEVE=m)
+	default y
+	---help---
+	  If you say Y here, then the Open vSwitch will be able create geneve vport.
+
+	  Say N to exclude this support and reduce the binary size.
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 42c0f4a..5b4cb82 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -48,6 +48,9 @@ static const struct vport_ops *vport_ops_list[] = {
 #ifdef CONFIG_OPENVSWITCH_VXLAN
 	&ovs_vxlan_vport_ops,
 #endif
+#ifdef CONFIG_OPENVSWITCH_GENEVE
+	&ovs_geneve_vport_ops,
+#endif
 };
 
 /* Protected by RCU read lock for reading, ovs_mutex for writing. */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next 06/10] openvswitch: Eliminate memset() from flow_extract.
  2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
                   ` (4 preceding siblings ...)
  2014-07-22 10:19 ` [net-next 05/10] net: Add Geneve tunneling protocol driver Andy Zhou
@ 2014-07-22 10:19 ` Andy Zhou
  2014-07-22 10:19 ` [net-next 07/10] openvswitch: Add support for matching on OAM packets Andy Zhou
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Andy Zhou @ 2014-07-22 10:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, Jesse Gross, Andy Zhou

From: Jesse Gross <jesse@nicira.com>

As new protocols are added, the size of the flow key tends to
increase although few protocols care about all of the fields. In
order to optimize this for hashing and matching, OVS uses a variable
length portion of the key. However, when fields are extracted from
the packet we must still zero out the entire key.

This is no longer necessary now that OVS implements masking. Any
fields (or holes in the structure) which are not part of a given
protocol will be by definition not part of the mask and zeroed out
during lookup. Furthermore, since masking already uses variable
length keys this zeroing operation automatically benefits as well.

In principle, the only thing that needs to be done at this point
is remove the memset() at the beginning of flow. However, some
fields assume that they are initialized to zero, which now must be
done explicitly. In addition, in the event of an error we must also
zero out corresponding fields to signal that there is no valid data
present. These increase the total amount of code but very little of
it is executed in non-error situations.

Removing the memset() reduces the profile of ovs_flow_extract()
from 0.64% to 0.56% when tested with large packets on a 10G link.

Suggested-by: Pravin Shelar <pshelar@nicira.com>
Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 net/openvswitch/flow.c |   47 ++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 42 insertions(+), 5 deletions(-)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index d07ab53..7691b11 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -272,6 +272,8 @@ static int parse_ipv6hdr(struct sk_buff *skb, struct sw_flow_key *key)
 			key->ip.frag = OVS_FRAG_TYPE_LATER;
 		else
 			key->ip.frag = OVS_FRAG_TYPE_FIRST;
+	} else {
+		key->ip.frag = OVS_FRAG_TYPE_NONE;
 	}
 
 	nh_len = payload_ofs - nh_ofs;
@@ -356,6 +358,7 @@ static int parse_icmpv6(struct sk_buff *skb, struct sw_flow_key *key,
 	 */
 	key->tp.src = htons(icmp->icmp6_type);
 	key->tp.dst = htons(icmp->icmp6_code);
+	memset(&key->ipv6.nd, 0, sizeof(key->ipv6.nd));
 
 	if (icmp->icmp6_code == 0 &&
 	    (icmp->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION ||
@@ -447,14 +450,18 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key)
 	int error;
 	struct ethhdr *eth;
 
-	memset(key, 0, sizeof(*key));
-
 	key->phy.priority = skb->priority;
 	if (OVS_CB(skb)->tun_key)
 		memcpy(&key->tun_key, OVS_CB(skb)->tun_key, sizeof(key->tun_key));
+	else
+		memset(&key->tun_key, 0, sizeof(key->tun_key));
+
 	key->phy.in_port = in_port;
 	key->phy.skb_mark = skb->mark;
 
+	/* Flags are always used as part of stats. */
+	key->tp.flags = 0;
+
 	skb_reset_mac_header(skb);
 
 	/* Link layer.  We are guaranteed to have at least the 14 byte Ethernet
@@ -469,6 +476,7 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key)
 	 * update skb->csum here.
 	 */
 
+	key->eth.tci = 0;
 	if (vlan_tx_tag_present(skb))
 		key->eth.tci = htons(skb->vlan_tci);
 	else if (eth->h_proto == htons(ETH_P_8021Q))
@@ -489,6 +497,8 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key)
 
 		error = check_iphdr(skb);
 		if (unlikely(error)) {
+			memset(&key->ip, 0, sizeof(key->ip));
+			memset(&key->ipv4, 0, sizeof(key->ipv4));
 			if (error == -EINVAL) {
 				skb->transport_header = skb->network_header;
 				error = 0;
@@ -512,6 +522,8 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key)
 		if (nh->frag_off & htons(IP_MF) ||
 			 skb_shinfo(skb)->gso_type & SKB_GSO_UDP)
 			key->ip.frag = OVS_FRAG_TYPE_FIRST;
+		else
+			key->ip.frag = OVS_FRAG_TYPE_NONE;
 
 		/* Transport layer. */
 		if (key->ip.proto == IPPROTO_TCP) {
@@ -520,18 +532,24 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key)
 				key->tp.src = tcp->source;
 				key->tp.dst = tcp->dest;
 				key->tp.flags = TCP_FLAGS_BE16(tcp);
+			} else {
+				memset(&key->tp, 0, sizeof(key->tp));
 			}
 		} else if (key->ip.proto == IPPROTO_UDP) {
 			if (udphdr_ok(skb)) {
 				struct udphdr *udp = udp_hdr(skb);
 				key->tp.src = udp->source;
 				key->tp.dst = udp->dest;
+			} else {
+				memset(&key->tp, 0, sizeof(key->tp));
 			}
 		} else if (key->ip.proto == IPPROTO_SCTP) {
 			if (sctphdr_ok(skb)) {
 				struct sctphdr *sctp = sctp_hdr(skb);
 				key->tp.src = sctp->source;
 				key->tp.dst = sctp->dest;
+			} else {
+				memset(&key->tp, 0, sizeof(key->tp));
 			}
 		} else if (key->ip.proto == IPPROTO_ICMP) {
 			if (icmphdr_ok(skb)) {
@@ -541,16 +559,19 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key)
 				 * them in 16-bit network byte order. */
 				key->tp.src = htons(icmp->type);
 				key->tp.dst = htons(icmp->code);
+			} else {
+				memset(&key->tp, 0, sizeof(key->tp));
 			}
 		}
 
-	} else if ((key->eth.type == htons(ETH_P_ARP) ||
-		   key->eth.type == htons(ETH_P_RARP)) && arphdr_ok(skb)) {
+	} else if (key->eth.type == htons(ETH_P_ARP) ||
+		   key->eth.type == htons(ETH_P_RARP)) {
 		struct arp_eth_header *arp;
 
 		arp = (struct arp_eth_header *)skb_network_header(skb);
 
-		if (arp->ar_hrd == htons(ARPHRD_ETHER)
+		if (arphdr_ok(skb)
+				&& arp->ar_hrd == htons(ARPHRD_ETHER)
 				&& arp->ar_pro == htons(ETH_P_IP)
 				&& arp->ar_hln == ETH_ALEN
 				&& arp->ar_pln == 4) {
@@ -558,16 +579,24 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key)
 			/* We only match on the lower 8 bits of the opcode. */
 			if (ntohs(arp->ar_op) <= 0xff)
 				key->ip.proto = ntohs(arp->ar_op);
+			else
+				key->ip.proto = 0;
+
 			memcpy(&key->ipv4.addr.src, arp->ar_sip, sizeof(key->ipv4.addr.src));
 			memcpy(&key->ipv4.addr.dst, arp->ar_tip, sizeof(key->ipv4.addr.dst));
 			ether_addr_copy(key->ipv4.arp.sha, arp->ar_sha);
 			ether_addr_copy(key->ipv4.arp.tha, arp->ar_tha);
+		} else {
+			memset(&key->ip, 0, sizeof(key->ip));
+			memset(&key->ipv4, 0, sizeof(key->ipv4));
 		}
 	} else if (key->eth.type == htons(ETH_P_IPV6)) {
 		int nh_len;             /* IPv6 Header + Extensions */
 
 		nh_len = parse_ipv6hdr(skb, key);
 		if (unlikely(nh_len < 0)) {
+			memset(&key->ip, 0, sizeof(key->ip));
+			memset(&key->ipv6.addr, 0, sizeof(key->ipv6.addr));
 			if (nh_len == -EINVAL) {
 				skb->transport_header = skb->network_header;
 				error = 0;
@@ -589,24 +618,32 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key)
 				key->tp.src = tcp->source;
 				key->tp.dst = tcp->dest;
 				key->tp.flags = TCP_FLAGS_BE16(tcp);
+			} else {
+				memset(&key->tp, 0, sizeof(key->tp));
 			}
 		} else if (key->ip.proto == NEXTHDR_UDP) {
 			if (udphdr_ok(skb)) {
 				struct udphdr *udp = udp_hdr(skb);
 				key->tp.src = udp->source;
 				key->tp.dst = udp->dest;
+			} else {
+				memset(&key->tp, 0, sizeof(key->tp));
 			}
 		} else if (key->ip.proto == NEXTHDR_SCTP) {
 			if (sctphdr_ok(skb)) {
 				struct sctphdr *sctp = sctp_hdr(skb);
 				key->tp.src = sctp->source;
 				key->tp.dst = sctp->dest;
+			} else {
+				memset(&key->tp, 0, sizeof(key->tp));
 			}
 		} else if (key->ip.proto == NEXTHDR_ICMP) {
 			if (icmp6hdr_ok(skb)) {
 				error = parse_icmpv6(skb, key, nh_len);
 				if (error)
 					return error;
+			} else {
+				memset(&key->tp, 0, sizeof(key->tp));
 			}
 		}
 	}
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next 07/10] openvswitch: Add support for matching on OAM packets.
  2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
                   ` (5 preceding siblings ...)
  2014-07-22 10:19 ` [net-next 06/10] openvswitch: Eliminate memset() from flow_extract Andy Zhou
@ 2014-07-22 10:19 ` Andy Zhou
  2014-07-22 10:19 ` [net-next 08/10] openvswitch: Wrap struct ovs_key_ipv4_tunnel in a new structure Andy Zhou
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Andy Zhou @ 2014-07-22 10:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, Jesse Gross, Andy Zhou

From: Jesse Gross <jesse@nicira.com>

Some tunnel formats have mechanisms for indicating that packets are
OAM frames that should be handled specially (either as high priority or
not forwarded beyond an endpoint). This provides support for allowing
those types of packets to be matched.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 include/uapi/linux/openvswitch.h |    1 +
 net/openvswitch/datapath.c       |    1 +
 net/openvswitch/flow_netlink.c   |    7 +++++++
 3 files changed, 9 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 0b979ee..5e83a95 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -301,6 +301,7 @@ enum ovs_tunnel_key_attr {
 	OVS_TUNNEL_KEY_ATTR_TTL,                /* u8 Tunnel IP TTL. */
 	OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT,      /* No argument, set DF. */
 	OVS_TUNNEL_KEY_ATTR_CSUM,               /* No argument. CSUM packet. */
+	OVS_TUNNEL_KEY_ATTR_OAM,                /* No argument. OAM frame.  */
 	__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 20f59b6..0ddb189 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -375,6 +375,7 @@ static size_t key_attr_size(void)
 		  + nla_total_size(1)   /* OVS_TUNNEL_KEY_ATTR_TTL */
 		  + nla_total_size(0)   /* OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT */
 		  + nla_total_size(0)   /* OVS_TUNNEL_KEY_ATTR_CSUM */
+		  + nla_total_size(0)   /* OVS_TUNNEL_KEY_ATTR_OAM */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_IN_PORT */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_SKB_MARK */
 		+ nla_total_size(12)  /* OVS_KEY_ATTR_ETHERNET */
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d757848..d5837d3 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -344,6 +344,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 			[OVS_TUNNEL_KEY_ATTR_TTL] = 1,
 			[OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT] = 0,
 			[OVS_TUNNEL_KEY_ATTR_CSUM] = 0,
+			[OVS_TUNNEL_KEY_ATTR_OAM] = 0,
 		};
 
 		if (type > OVS_TUNNEL_KEY_ATTR_MAX) {
@@ -388,6 +389,9 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 		case OVS_TUNNEL_KEY_ATTR_CSUM:
 			tun_flags |= TUNNEL_CSUM;
 			break;
+		case OVS_TUNNEL_KEY_ATTR_OAM:
+			tun_flags |= TUNNEL_OAM;
+			break;
 		default:
 			return -EINVAL;
 		}
@@ -445,6 +449,9 @@ static int ipv4_tun_to_nlattr(struct sk_buff *skb,
 	if ((output->tun_flags & TUNNEL_CSUM) &&
 		nla_put_flag(skb, OVS_TUNNEL_KEY_ATTR_CSUM))
 		return -EMSGSIZE;
+	if ((output->tun_flags & TUNNEL_OAM) &&
+		nla_put_flag(skb, OVS_TUNNEL_KEY_ATTR_OAM))
+		return -EMSGSIZE;
 
 	nla_nest_end(skb, nla);
 	return 0;
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next 08/10] openvswitch: Wrap struct ovs_key_ipv4_tunnel in a new structure.
  2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
                   ` (6 preceding siblings ...)
  2014-07-22 10:19 ` [net-next 07/10] openvswitch: Add support for matching on OAM packets Andy Zhou
@ 2014-07-22 10:19 ` Andy Zhou
  2014-07-22 10:19 ` [net-next 09/10] openvswitch: Factor out allocation and verification of actions Andy Zhou
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Andy Zhou @ 2014-07-22 10:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, Jesse Gross, Andy Zhou

From: Jesse Gross <jesse@nicira.com>

Currently, the flow information that is matched for tunnels and
the tunnel data passed around with packets is the same. However,
as additional information is added this is not necessarily desirable,
as in the case of pointers.

This adds a new structure for tunnel metadata which currently contains
only the existing struct. This change is purely internal to the kernel
since the current OVS_KEY_ATTR_IPV4_TUNNEL is simply a compressed version
of OVS_KEY_ATTR_TUNNEL that is translated at flow setup.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 include/uapi/linux/openvswitch.h |    2 +-
 net/openvswitch/actions.c        |    6 +++---
 net/openvswitch/datapath.h       |    3 ++-
 net/openvswitch/flow.c           |   11 +++++++----
 net/openvswitch/flow.h           |   22 +++++++++++++---------
 net/openvswitch/flow_netlink.c   |   38 +++++++++++++++++++++++++++++++-------
 net/openvswitch/vport-gre.c      |   29 +++++++++++++++--------------
 net/openvswitch/vport-vxlan.c    |   27 +++++++++++++++------------
 net/openvswitch/vport.c          |    4 ++--
 net/openvswitch/vport.h          |    2 +-
 10 files changed, 90 insertions(+), 54 deletions(-)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 5e83a95..3b72277 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -286,7 +286,7 @@ enum ovs_key_attr {
 	OVS_KEY_ATTR_TCP_FLAGS,	/* be16 TCP flags. */
 
 #ifdef __KERNEL__
-	OVS_KEY_ATTR_IPV4_TUNNEL,  /* struct ovs_key_ipv4_tunnel */
+	OVS_KEY_ATTR_TUNNEL_INFO,  /* struct ovs_tunnel_info */
 #endif
 	__OVS_KEY_ATTR_MAX
 };
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index e70d8b1..45292da 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -473,8 +473,8 @@ static int execute_set_action(struct sk_buff *skb,
 		skb->mark = nla_get_u32(nested_attr);
 		break;
 
-	case OVS_KEY_ATTR_IPV4_TUNNEL:
-		OVS_CB(skb)->tun_key = nla_data(nested_attr);
+	case OVS_KEY_ATTR_TUNNEL_INFO:
+		OVS_CB(skb)->tun_info = nla_data(nested_attr);
 		break;
 
 	case OVS_KEY_ATTR_ETHERNET:
@@ -578,7 +578,7 @@ int ovs_execute_actions(struct datapath *dp, struct sk_buff *skb)
 {
 	struct sw_flow_actions *acts = rcu_dereference(OVS_CB(skb)->flow->sf_acts);
 
-	OVS_CB(skb)->tun_key = NULL;
+	OVS_CB(skb)->tun_info = NULL;
 	return do_execute_actions(dp, skb, acts->actions,
 					 acts->actions_len, false);
 }
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 7ede507..3b81051 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -102,7 +102,8 @@ struct datapath {
 struct ovs_skb_cb {
 	struct sw_flow		*flow;
 	struct sw_flow_key	*pkt_key;
-	struct ovs_key_ipv4_tunnel  *tun_key;
+	struct ovs_tunnel_info  *tun_info;
+	struct vport	        *input_vport;
 };
 #define OVS_CB(skb) ((struct ovs_skb_cb *)(skb)->cb)
 
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 7691b11..3d0adc5 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -450,12 +450,15 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key)
 	int error;
 	struct ethhdr *eth;
 
-	key->phy.priority = skb->priority;
-	if (OVS_CB(skb)->tun_key)
-		memcpy(&key->tun_key, OVS_CB(skb)->tun_key, sizeof(key->tun_key));
-	else
+	if (OVS_CB(skb)->tun_info) {
+		struct ovs_tunnel_info *tun_info = OVS_CB(skb)->tun_info;
+		memcpy(&key->tun_key, &tun_info->tunnel,
+			sizeof(key->tun_key));
+	} else {
 		memset(&key->tun_key, 0, sizeof(key->tun_key));
+	}
 
+	key->phy.priority = skb->priority;
 	key->phy.in_port = in_port;
 	key->phy.skb_mark = skb->mark;
 
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 5e5aaed..6261ad0 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -49,20 +49,24 @@ struct ovs_key_ipv4_tunnel {
 	u8   ipv4_ttl;
 } __packed __aligned(4); /* Minimize padding. */
 
-static inline void ovs_flow_tun_key_init(struct ovs_key_ipv4_tunnel *tun_key,
+struct ovs_tunnel_info {
+	struct ovs_key_ipv4_tunnel tunnel;
+};
+
+static inline void ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
 					 const struct iphdr *iph, __be64 tun_id,
 					 __be16 tun_flags)
 {
-	tun_key->tun_id = tun_id;
-	tun_key->ipv4_src = iph->saddr;
-	tun_key->ipv4_dst = iph->daddr;
-	tun_key->ipv4_tos = iph->tos;
-	tun_key->ipv4_ttl = iph->ttl;
-	tun_key->tun_flags = tun_flags;
+	tun_info->tunnel.tun_id = tun_id;
+	tun_info->tunnel.ipv4_src = iph->saddr;
+	tun_info->tunnel.ipv4_dst = iph->daddr;
+	tun_info->tunnel.ipv4_tos = iph->tos;
+	tun_info->tunnel.ipv4_ttl = iph->ttl;
+	tun_info->tunnel.tun_flags = tun_flags;
 
 	/* clear struct padding. */
-	memset((unsigned char *) tun_key + OVS_TUNNEL_KEY_SIZE, 0,
-	       sizeof(*tun_key) - OVS_TUNNEL_KEY_SIZE);
+	memset((unsigned char *) &tun_info->tunnel + OVS_TUNNEL_KEY_SIZE, 0,
+	       sizeof(tun_info->tunnel) - OVS_TUNNEL_KEY_SIZE);
 }
 
 struct sw_flow_key {
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d5837d3..aa7c3d5 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -1134,13 +1134,14 @@ out:
 	return  (struct nlattr *) ((unsigned char *)(*sfa) + next_offset);
 }
 
-static int add_action(struct sw_flow_actions **sfa, int attrtype, void *data, int len)
+static struct nlattr *__add_action(struct sw_flow_actions **sfa, int attrtype,
+				   void *data, int len)
 {
 	struct nlattr *a;
 
 	a = reserve_sfa_size(sfa, nla_attr_size(len));
 	if (IS_ERR(a))
-		return PTR_ERR(a);
+		return a;
 
 	a->nla_type = attrtype;
 	a->nla_len = nla_attr_size(len);
@@ -1149,6 +1150,18 @@ static int add_action(struct sw_flow_actions **sfa, int attrtype, void *data, in
 		memcpy(nla_data(a), data, len);
 	memset((unsigned char *) a + a->nla_len, 0, nla_padlen(len));
 
+	return a;
+}
+
+static int add_action(struct sw_flow_actions **sfa, int attrtype,
+		      void *data, int len)
+{
+	struct nlattr *a;
+
+	a = __add_action(sfa, attrtype, data, len);
+	if (IS_ERR(a))
+		return PTR_ERR(a);
+
 	return 0;
 }
 
@@ -1254,6 +1267,8 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 {
 	struct sw_flow_match match;
 	struct sw_flow_key key;
+	struct ovs_tunnel_info *tun_info;
+	struct nlattr *a;
 	int err, start;
 
 	ovs_match_init(&match, &key, NULL);
@@ -1265,8 +1280,14 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	if (start < 0)
 		return start;
 
-	err = add_action(sfa, OVS_KEY_ATTR_IPV4_TUNNEL, &match.key->tun_key,
-			sizeof(match.key->tun_key));
+	a = __add_action(sfa, OVS_KEY_ATTR_TUNNEL_INFO, NULL,
+			sizeof(*tun_info));
+	if (IS_ERR(a))
+		return PTR_ERR(a);
+
+	tun_info = nla_data(a);
+	tun_info->tunnel = key.tun_key;
+
 	add_nested_action_end(*sfa, start);
 
 	return err;
@@ -1532,17 +1553,20 @@ static int set_action_to_attr(const struct nlattr *a, struct sk_buff *skb)
 	int err;
 
 	switch (key_type) {
-	case OVS_KEY_ATTR_IPV4_TUNNEL:
+	case OVS_KEY_ATTR_TUNNEL_INFO: {
+		struct ovs_tunnel_info *tun_info = nla_data(ovs_key);
+
 		start = nla_nest_start(skb, OVS_ACTION_ATTR_SET);
 		if (!start)
 			return -EMSGSIZE;
 
-		err = ipv4_tun_to_nlattr(skb, nla_data(ovs_key),
-					     nla_data(ovs_key));
+		err = ipv4_tun_to_nlattr(skb, &tun_info->tunnel,
+					 &tun_info->tunnel);
 		if (err)
 			return err;
 		nla_nest_end(skb, start);
 		break;
+	}
 	default:
 		if (nla_put(skb, OVS_ACTION_ATTR_SET, nla_len(a), ovs_key))
 			return -EMSGSIZE;
diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c
index f49148a..d4fcbb2 100644
--- a/net/openvswitch/vport-gre.c
+++ b/net/openvswitch/vport-gre.c
@@ -63,7 +63,7 @@ static __be16 filter_tnl_flags(__be16 flags)
 static struct sk_buff *__build_header(struct sk_buff *skb,
 				      int tunnel_hlen)
 {
-	const struct ovs_key_ipv4_tunnel *tun_key = OVS_CB(skb)->tun_key;
+	const struct ovs_key_ipv4_tunnel *tun_key = &OVS_CB(skb)->tun_info->tunnel;
 	struct tnl_ptk_info tpi;
 
 	skb = gre_handle_offloads(skb, !!(tun_key->tun_flags & TUNNEL_CSUM));
@@ -92,7 +92,7 @@ static __be64 key_to_tunnel_id(__be32 key, __be32 seq)
 static int gre_rcv(struct sk_buff *skb,
 		   const struct tnl_ptk_info *tpi)
 {
-	struct ovs_key_ipv4_tunnel tun_key;
+	struct ovs_tunnel_info tun_info;
 	struct ovs_net *ovs_net;
 	struct vport *vport;
 	__be64 key;
@@ -103,10 +103,10 @@ static int gre_rcv(struct sk_buff *skb,
 		return PACKET_REJECT;
 
 	key = key_to_tunnel_id(tpi->key, tpi->seq);
-	ovs_flow_tun_key_init(&tun_key, ip_hdr(skb), key,
-			      filter_tnl_flags(tpi->flags));
+	ovs_flow_tun_info_init(&tun_info, ip_hdr(skb), key,
+			       filter_tnl_flags(tpi->flags));
 
-	ovs_vport_receive(vport, skb, &tun_key);
+	ovs_vport_receive(vport, skb, &tun_info);
 	return PACKET_RCVD;
 }
 
@@ -128,6 +128,7 @@ static int gre_err(struct sk_buff *skb, u32 info,
 
 static int gre_tnl_send(struct vport *vport, struct sk_buff *skb)
 {
+	const struct ovs_key_ipv4_tunnel *tun_key = &OVS_CB(skb)->tun_info->tunnel;
 	struct net *net = ovs_dp_get_net(vport->dp);
 	struct flowi4 fl;
 	struct rtable *rt;
@@ -136,16 +137,16 @@ static int gre_tnl_send(struct vport *vport, struct sk_buff *skb)
 	__be16 df;
 	int err;
 
-	if (unlikely(!OVS_CB(skb)->tun_key)) {
+	if (unlikely(!tun_key)) {
 		err = -EINVAL;
 		goto error;
 	}
 
 	/* Route lookup */
 	memset(&fl, 0, sizeof(fl));
-	fl.daddr = OVS_CB(skb)->tun_key->ipv4_dst;
-	fl.saddr = OVS_CB(skb)->tun_key->ipv4_src;
-	fl.flowi4_tos = RT_TOS(OVS_CB(skb)->tun_key->ipv4_tos);
+	fl.daddr = tun_key->ipv4_dst;
+	fl.saddr = tun_key->ipv4_src;
+	fl.flowi4_tos = RT_TOS(tun_key->ipv4_tos);
 	fl.flowi4_mark = skb->mark;
 	fl.flowi4_proto = IPPROTO_GRE;
 
@@ -153,7 +154,7 @@ static int gre_tnl_send(struct vport *vport, struct sk_buff *skb)
 	if (IS_ERR(rt))
 		return PTR_ERR(rt);
 
-	tunnel_hlen = ip_gre_calc_hlen(OVS_CB(skb)->tun_key->tun_flags);
+	tunnel_hlen = ip_gre_calc_hlen(tun_key->tun_flags);
 
 	min_headroom = LL_RESERVED_SPACE(rt->dst.dev) + rt->dst.header_len
 			+ tunnel_hlen + sizeof(struct iphdr)
@@ -185,15 +186,15 @@ static int gre_tnl_send(struct vport *vport, struct sk_buff *skb)
 		goto err_free_rt;
 	}
 
-	df = OVS_CB(skb)->tun_key->tun_flags & TUNNEL_DONT_FRAGMENT ?
+	df = tun_key->tun_flags & TUNNEL_DONT_FRAGMENT ?
 		htons(IP_DF) : 0;
 
 	skb->ignore_df = 1;
 
 	return iptunnel_xmit(skb->sk, rt, skb, fl.saddr,
-			     OVS_CB(skb)->tun_key->ipv4_dst, IPPROTO_GRE,
-			     OVS_CB(skb)->tun_key->ipv4_tos,
-			     OVS_CB(skb)->tun_key->ipv4_ttl, df, false);
+			     tun_key->ipv4_dst, IPPROTO_GRE,
+			     tun_key->ipv4_tos,
+			     tun_key->ipv4_ttl, df, false);
 err_free_rt:
 	ip_rt_put(rt);
 error:
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index d523e74..3835143 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -58,7 +58,7 @@ static inline struct vxlan_port *vxlan_vport(const struct vport *vport)
 /* Called with rcu_read_lock and BH disabled. */
 static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
 {
-	struct ovs_key_ipv4_tunnel tun_key;
+	struct ovs_tunnel_info tun_info;
 	struct vport *vport = vs->uts.data;
 	struct iphdr *iph;
 	__be64 key;
@@ -66,9 +66,9 @@ static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
 	/* Save outer tunnel values */
 	iph = ip_hdr(skb);
 	key = cpu_to_be64(ntohl(vx_vni) >> 8);
-	ovs_flow_tun_key_init(&tun_key, iph, key, TUNNEL_KEY);
+	ovs_flow_tun_info_init(&tun_info, iph, key, TUNNEL_KEY);
 
-	ovs_vport_receive(vport, skb, &tun_key);
+	ovs_vport_receive(vport, skb, &tun_info);
 }
 
 static int vxlan_get_options(const struct vport *vport, struct sk_buff *skb)
@@ -138,6 +138,7 @@ error:
 
 static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 {
+	struct ovs_key_ipv4_tunnel *tun_key;
 	struct net *net = ovs_dp_get_net(vport->dp);
 	struct vxlan_port *vxlan_port = vxlan_vport(vport);
 	__be16 dst_port = inet_sk(vxlan_port->vs->uts.sock->sk)->inet_sport;
@@ -147,16 +148,18 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 	__be16 df;
 	int err;
 
-	if (unlikely(!OVS_CB(skb)->tun_key)) {
+	if (unlikely(!OVS_CB(skb)->tun_info)) {
 		err = -EINVAL;
 		goto error;
 	}
 
+	tun_key = &OVS_CB(skb)->tun_info->tunnel;
+
 	/* Route lookup */
 	memset(&fl, 0, sizeof(fl));
-	fl.daddr = OVS_CB(skb)->tun_key->ipv4_dst;
-	fl.saddr = OVS_CB(skb)->tun_key->ipv4_src;
-	fl.flowi4_tos = RT_TOS(OVS_CB(skb)->tun_key->ipv4_tos);
+	fl.daddr = tun_key->ipv4_dst;
+	fl.saddr = tun_key->ipv4_src;
+	fl.flowi4_tos = RT_TOS(tun_key->ipv4_tos);
 	fl.flowi4_mark = skb->mark;
 	fl.flowi4_proto = IPPROTO_UDP;
 
@@ -166,7 +169,7 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 		goto error;
 	}
 
-	df = OVS_CB(skb)->tun_key->tun_flags & TUNNEL_DONT_FRAGMENT ?
+	df = tun_key->tun_flags & TUNNEL_DONT_FRAGMENT ?
 		htons(IP_DF) : 0;
 
 	skb->ignore_df = 1;
@@ -174,11 +177,11 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
 	src_port = udp_flow_src_port(net, skb, 0, 0, true);
 
 	err = vxlan_xmit_skb(vxlan_port->vs, rt, skb,
-			     fl.saddr, OVS_CB(skb)->tun_key->ipv4_dst,
-			     OVS_CB(skb)->tun_key->ipv4_tos,
-			     OVS_CB(skb)->tun_key->ipv4_ttl, df,
+			     fl.saddr, tun_key->ipv4_dst,
+			     tun_key->ipv4_tos,
+			     tun_key->ipv4_ttl, df,
 			     src_port, dst_port,
-			     htonl(be64_to_cpu(OVS_CB(skb)->tun_key->tun_id) << 8),
+			     htonl(be64_to_cpu(tun_key->tun_id) << 8),
 			     false);
 	if (err < 0)
 		ip_rt_put(rt);
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 5b4cb82..39e2c9c 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -340,7 +340,7 @@ int ovs_vport_get_options(const struct vport *vport, struct sk_buff *skb)
  * skb->data should point to the Ethernet header.
  */
 void ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
-		       struct ovs_key_ipv4_tunnel *tun_key)
+		       struct ovs_tunnel_info *tun_info)
 {
 	struct pcpu_sw_netstats *stats;
 
@@ -350,7 +350,7 @@ void ovs_vport_receive(struct vport *vport, struct sk_buff *skb,
 	stats->rx_bytes += skb->len;
 	u64_stats_update_end(&stats->syncp);
 
-	OVS_CB(skb)->tun_key = tun_key;
+	OVS_CB(skb)->tun_info = tun_info;
 	ovs_dp_process_received_packet(vport, skb);
 }
 
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index 8d721e6..400cd1e 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -191,7 +191,7 @@ static inline struct vport *vport_from_priv(void *priv)
 }
 
 void ovs_vport_receive(struct vport *, struct sk_buff *,
-		       struct ovs_key_ipv4_tunnel *);
+		       struct ovs_tunnel_info *);
 
 /* List of statically compiled vport implementations.  Don't forget to also
  * add yours to the list at the top of vport.c. */
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next 09/10] openvswitch: Factor out allocation and verification of actions.
  2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
                   ` (7 preceding siblings ...)
  2014-07-22 10:19 ` [net-next 08/10] openvswitch: Wrap struct ovs_key_ipv4_tunnel in a new structure Andy Zhou
@ 2014-07-22 10:19 ` Andy Zhou
  2014-07-22 10:19 ` [net-next 10/10] openvswitch: Add support for Geneve tunneling Andy Zhou
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 46+ messages in thread
From: Andy Zhou @ 2014-07-22 10:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, Jesse Gross, Andy Zhou

From: Jesse Gross <jesse@nicira.com>

As the size of the flow key grows, it can put some pressure on the
stack. This is particularly true in ovs_flow_cmd_set(), which needs several
copies of the key on the stack. One of those uses is logically separate,
so this factors it out to reduce stack pressure and improve readibility.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 net/openvswitch/datapath.c |   38 +++++++++++++++++++++++++++-----------
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 0ddb189..daa935f 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -929,11 +929,34 @@ error:
 	return error;
 }
 
+static struct sw_flow_actions *get_flow_actions(const struct nlattr *a,
+						const struct sw_flow_key *key,
+						const struct sw_flow_mask *mask)
+{
+	struct sw_flow_actions *acts;
+	struct sw_flow_key masked_key;
+	int error;
+
+	acts = ovs_nla_alloc_flow_actions(nla_len(a));
+	if (IS_ERR(acts))
+		return acts;
+
+	ovs_flow_mask_key(&masked_key, key, mask);
+	error = ovs_nla_copy_actions(a, &masked_key, 0, &acts);
+	if (error) {
+		OVS_NLERR("Flow actions may not be safe on all matching packets.\n");
+		kfree(acts);
+		return ERR_PTR(error);
+	}
+
+	return acts;
+}
+
 static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 {
 	struct nlattr **a = info->attrs;
 	struct ovs_header *ovs_header = info->userhdr;
-	struct sw_flow_key key, masked_key;
+	struct sw_flow_key key;
 	struct sw_flow *flow;
 	struct sw_flow_mask mask;
 	struct sk_buff *reply = NULL;
@@ -955,17 +978,10 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 
 	/* Validate actions. */
 	if (a[OVS_FLOW_ATTR_ACTIONS]) {
-		acts = ovs_nla_alloc_flow_actions(nla_len(a[OVS_FLOW_ATTR_ACTIONS]));
-		error = PTR_ERR(acts);
-		if (IS_ERR(acts))
+		acts = get_flow_actions(a[OVS_FLOW_ATTR_ACTIONS], &key, &mask);
+		if (IS_ERR(acts)) {
+			error = PTR_ERR(acts);
 			goto error;
-
-		ovs_flow_mask_key(&masked_key, &key, &mask);
-		error = ovs_nla_copy_actions(a[OVS_FLOW_ATTR_ACTIONS],
-					     &masked_key, 0, &acts);
-		if (error) {
-			OVS_NLERR("Flow actions may not be safe on all matching packets.\n");
-			goto err_kfree_acts;
 		}
 	}
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [net-next 10/10] openvswitch: Add support for Geneve tunneling.
  2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
                   ` (8 preceding siblings ...)
  2014-07-22 10:19 ` [net-next 09/10] openvswitch: Factor out allocation and verification of actions Andy Zhou
@ 2014-07-22 10:19 ` Andy Zhou
  2014-07-23 20:29   ` Tom Herbert
  2014-07-22 10:54 ` [net-next 00/10] Add Geneve Varka Bhadram
  2014-07-24  6:58 ` Or Gerlitz
  11 siblings, 1 reply; 46+ messages in thread
From: Andy Zhou @ 2014-07-22 10:19 UTC (permalink / raw)
  To: davem; +Cc: netdev, Jesse Gross, Andy Zhou

From: Jesse Gross <jesse@nicira.com>

The Openvswitch implementation is completely agnostic to the options
that are in use and can handle newly defined options without
further work. It does this by simply matching on a byte array
of options and allowing userspace to setup flows on this array.

Userspace currently implements only support for basic version of
Geneve. It can work with the base header (including the VNI) and
is capable of parsing options but does not currently support any
particular option definitions. Over time, the intention is to
allow options to be matched through OpenFlow without requiring
explicit support in OVS userspace.

Signed-off-by: Jesse Gross <jesse@nicira.com>
Signed-off-by: Andy Zhou <azhou@nicira.com>
---
 include/uapi/linux/openvswitch.h |    2 +
 net/openvswitch/Makefile         |    5 +
 net/openvswitch/datapath.c       |   32 +++--
 net/openvswitch/flow.c           |   10 ++
 net/openvswitch/flow.h           |   19 ++-
 net/openvswitch/flow_netlink.c   |  143 ++++++++++++++++++---
 net/openvswitch/flow_netlink.h   |    2 +-
 net/openvswitch/vport-geneve.c   |  258 ++++++++++++++++++++++++++++++++++++++
 net/openvswitch/vport-gre.c      |    2 +-
 net/openvswitch/vport-vxlan.c    |    2 +-
 net/openvswitch/vport.c          |    1 +
 net/openvswitch/vport.h          |    1 +
 12 files changed, 446 insertions(+), 31 deletions(-)
 create mode 100644 net/openvswitch/vport-geneve.c

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 3b72277..0c6e846 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -189,6 +189,7 @@ enum ovs_vport_type {
 	OVS_VPORT_TYPE_INTERNAL, /* network device implemented by datapath */
 	OVS_VPORT_TYPE_GRE,      /* GRE tunnel. */
 	OVS_VPORT_TYPE_VXLAN,	 /* VXLAN tunnel. */
+	OVS_VPORT_TYPE_GENEVE = 6,   /* Geneve tunnel. */
 	__OVS_VPORT_TYPE_MAX
 };
 
@@ -302,6 +303,7 @@ enum ovs_tunnel_key_attr {
 	OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT,      /* No argument, set DF. */
 	OVS_TUNNEL_KEY_ATTR_CSUM,               /* No argument. CSUM packet. */
 	OVS_TUNNEL_KEY_ATTR_OAM,                /* No argument. OAM frame.  */
+	OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,        /* Array of Geneve options.  */
 	__OVS_TUNNEL_KEY_ATTR_MAX
 };
 
diff --git a/net/openvswitch/Makefile b/net/openvswitch/Makefile
index 3591cb5..2bbfc32 100644
--- a/net/openvswitch/Makefile
+++ b/net/openvswitch/Makefile
@@ -13,6 +13,7 @@ openvswitch-y := \
 	flow_table.o \
 	vport.o \
 	vport-internal_dev.o \
+	vport-geneve.o  \
 	vport-netdev.o
 
 ifneq ($(CONFIG_OPENVSWITCH_VXLAN),)
@@ -22,3 +23,7 @@ endif
 ifneq ($(CONFIG_OPENVSWITCH_GRE),)
 openvswitch-y += vport-gre.o
 endif
+
+ifneq ($(CONFIG_OPENVSWITCH_GENEVE),)
+openvswitch-y += vport-geneve.o
+endif
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index daa935f..29f877e 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -376,6 +376,7 @@ static size_t key_attr_size(void)
 		  + nla_total_size(0)   /* OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT */
 		  + nla_total_size(0)   /* OVS_TUNNEL_KEY_ATTR_CSUM */
 		  + nla_total_size(0)   /* OVS_TUNNEL_KEY_ATTR_OAM */
+		  + nla_total_size(256) /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_IN_PORT */
 		+ nla_total_size(4)   /* OVS_KEY_ATTR_SKB_MARK */
 		+ nla_total_size(12)  /* OVS_KEY_ATTR_ETHERNET */
@@ -465,7 +466,7 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
 	upcall->dp_ifindex = dp_ifindex;
 
 	nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_KEY);
-	ovs_nla_put_flow(upcall_info->key, upcall_info->key, user_skb);
+	ovs_nla_put_flow(dp, upcall_info->key, upcall_info->key, user_skb);
 	nla_nest_end(user_skb, nla);
 
 	if (upcall_info->userdata)
@@ -662,7 +663,8 @@ static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts)
 }
 
 /* Called with ovs_mutex or RCU read lock. */
-static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
+static int ovs_flow_cmd_fill_info(struct datapath *dp,
+				  const struct sw_flow *flow, int dp_ifindex,
 				  struct sk_buff *skb, u32 portid,
 				  u32 seq, u32 flags, u8 cmd)
 {
@@ -686,7 +688,8 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
 	if (!nla)
 		goto nla_put_failure;
 
-	err = ovs_nla_put_flow(&flow->unmasked_key, &flow->unmasked_key, skb);
+	err = ovs_nla_put_flow(dp, &flow->unmasked_key,
+			       &flow->unmasked_key, skb);
 	if (err)
 		goto error;
 	nla_nest_end(skb, nla);
@@ -695,7 +698,7 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
 	if (!nla)
 		goto nla_put_failure;
 
-	err = ovs_nla_put_flow(&flow->key, &flow->mask->key, skb);
+	err = ovs_nla_put_flow(dp, &flow->key, &flow->mask->key, skb);
 	if (err)
 		goto error;
 
@@ -771,7 +774,8 @@ static struct sk_buff *ovs_flow_cmd_alloc_info(const struct sw_flow_actions *act
 }
 
 /* Called with ovs_mutex. */
-static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,
+static struct sk_buff *ovs_flow_cmd_build_info(struct datapath *dp,
+					       const struct sw_flow *flow,
 					       int dp_ifindex,
 					       struct genl_info *info, u8 cmd,
 					       bool always)
@@ -784,7 +788,7 @@ static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,
 	if (!skb || IS_ERR(skb))
 		return skb;
 
-	retval = ovs_flow_cmd_fill_info(flow, dp_ifindex, skb,
+	retval = ovs_flow_cmd_fill_info(dp, flow, dp_ifindex, skb,
 					info->snd_portid, info->snd_seq, 0,
 					cmd);
 	BUG_ON(retval < 0);
@@ -866,7 +870,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 		}
 
 		if (unlikely(reply)) {
-			error = ovs_flow_cmd_fill_info(new_flow,
+			error = ovs_flow_cmd_fill_info(dp, new_flow,
 						       ovs_header->dp_ifindex,
 						       reply, info->snd_portid,
 						       info->snd_seq, 0,
@@ -901,7 +905,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
 		rcu_assign_pointer(flow->sf_acts, acts);
 
 		if (unlikely(reply)) {
-			error = ovs_flow_cmd_fill_info(flow,
+			error = ovs_flow_cmd_fill_info(dp, flow,
 						       ovs_header->dp_ifindex,
 						       reply, info->snd_portid,
 						       info->snd_seq, 0,
@@ -1013,7 +1017,7 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 		rcu_assign_pointer(flow->sf_acts, acts);
 
 		if (unlikely(reply)) {
-			error = ovs_flow_cmd_fill_info(flow,
+			error = ovs_flow_cmd_fill_info(dp, flow,
 						       ovs_header->dp_ifindex,
 						       reply, info->snd_portid,
 						       info->snd_seq, 0,
@@ -1022,7 +1026,8 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
 		}
 	} else {
 		/* Could not alloc without acts before locking. */
-		reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex,
+		reply = ovs_flow_cmd_build_info(dp, flow,
+						ovs_header->dp_ifindex,
 						info, OVS_FLOW_CMD_NEW, false);
 		if (unlikely(IS_ERR(reply))) {
 			error = PTR_ERR(reply);
@@ -1085,7 +1090,7 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info)
 		goto unlock;
 	}
 
-	reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex, info,
+	reply = ovs_flow_cmd_build_info(dp, flow, ovs_header->dp_ifindex, info,
 					OVS_FLOW_CMD_NEW, true);
 	if (IS_ERR(reply)) {
 		err = PTR_ERR(reply);
@@ -1143,7 +1148,8 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
 	if (likely(reply)) {
 		if (likely(!IS_ERR(reply))) {
 			rcu_read_lock();	/*To keep RCU checker happy. */
-			err = ovs_flow_cmd_fill_info(flow, ovs_header->dp_ifindex,
+			err = ovs_flow_cmd_fill_info(dp, flow,
+						     ovs_header->dp_ifindex,
 						     reply, info->snd_portid,
 						     info->snd_seq, 0,
 						     OVS_FLOW_CMD_DEL);
@@ -1187,7 +1193,7 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
 		if (!flow)
 			break;
 
-		if (ovs_flow_cmd_fill_info(flow, ovs_header->dp_ifindex, skb,
+		if (ovs_flow_cmd_fill_info(dp, flow, ovs_header->dp_ifindex, skb,
 					   NETLINK_CB(cb->skb).portid,
 					   cb->nlh->nlmsg_seq, NLM_F_MULTI,
 					   OVS_FLOW_CMD_NEW) < 0)
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 3d0adc5..b487cab 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -454,7 +454,17 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key)
 		struct ovs_tunnel_info *tun_info = OVS_CB(skb)->tun_info;
 		memcpy(&key->tun_key, &tun_info->tunnel,
 			sizeof(key->tun_key));
+		if (tun_info->options) {
+			BUILD_BUG_ON((1 << (sizeof(tun_info->options_len) * 8)) - 1
+					> sizeof(key->tun_opts));
+			memcpy(GENEVE_OPTS(key, tun_info->options_len),
+				tun_info->options, tun_info->options_len);
+			key->tun_opts_len = tun_info->options_len;
+		} else {
+			key->tun_opts_len = 0;
+		}
 	} else {
+		key->tun_opts_len = 0;
 		memset(&key->tun_key, 0, sizeof(key->tun_key));
 	}
 
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index 6261ad0..216aa1b 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -51,11 +51,23 @@ struct ovs_key_ipv4_tunnel {
 
 struct ovs_tunnel_info {
 	struct ovs_key_ipv4_tunnel tunnel;
+	struct geneve_opt *options;
+	u8 options_len;
 };
 
+/* Store options at the end of the array if they are less than the
+ * maximum size. This allows us to get the benefits of variable length
+ * matching for small options.
+ */
+#define GENEVE_OPTS(flow_key, opt_len) (struct geneve_opt *) \
+					((flow_key)->tun_opts + \
+					FIELD_SIZEOF(struct sw_flow_key, tun_opts) - \
+					   opt_len)
 static inline void ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
 					 const struct iphdr *iph, __be64 tun_id,
-					 __be16 tun_flags)
+					 __be16 tun_flags,
+					 struct geneve_opt *opts,
+					 u8 opts_len)
 {
 	tun_info->tunnel.tun_id = tun_id;
 	tun_info->tunnel.ipv4_src = iph->saddr;
@@ -67,9 +79,14 @@ static inline void ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
 	/* clear struct padding. */
 	memset((unsigned char *) &tun_info->tunnel + OVS_TUNNEL_KEY_SIZE, 0,
 	       sizeof(tun_info->tunnel) - OVS_TUNNEL_KEY_SIZE);
+
+	tun_info->options = opts;
+	tun_info->options_len = opts_len;
 }
 
 struct sw_flow_key {
+	u8 tun_opts[255];
+	u8 tun_opts_len;
 	struct ovs_key_ipv4_tunnel tun_key;  /* Encapsulating tunnel key. */
 	struct {
 		u32	priority;	/* Packet QoS priority. */
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index aa7c3d5..e0399c9 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -42,6 +42,7 @@
 #include <linux/icmp.h>
 #include <linux/icmpv6.h>
 #include <linux/rculist.h>
+#include <net/geneve.h>
 #include <net/ip.h>
 #include <net/ipv6.h>
 #include <net/ndisc.h>
@@ -88,18 +89,21 @@ static void update_range__(struct sw_flow_match *match,
 		}                                                           \
 	} while (0)
 
-#define SW_FLOW_KEY_MEMCPY(match, field, value_p, len, is_mask) \
+#define SW_FLOW_KEY_MEMCPY_OFFSET(match, offset, value_p, len, is_mask) \
 	do { \
-		update_range__(match, offsetof(struct sw_flow_key, field),  \
-				len, is_mask);                              \
+		update_range__(match, offset, len, is_mask);                \
 		if (is_mask) {						    \
 			if ((match)->mask)				    \
-				memcpy(&(match)->mask->key.field, value_p, len);\
+				memcpy((u8 *)&(match)->mask->key + offset, value_p, len);\
 		} else {                                                    \
-			memcpy(&(match)->key->field, value_p, len);         \
+			memcpy((u8 *)(match)->key + offset, value_p, len);         \
 		}                                                           \
 	} while (0)
 
+#define SW_FLOW_KEY_MEMCPY(match, field, value_p, len, is_mask) \
+	SW_FLOW_KEY_MEMCPY_OFFSET(match, offsetof(struct sw_flow_key, field), \
+				  value_p, len, is_mask)
+
 static u16 range_n_bytes(const struct sw_flow_key_range *range)
 {
 	return range->end - range->start;
@@ -345,6 +349,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 			[OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT] = 0,
 			[OVS_TUNNEL_KEY_ATTR_CSUM] = 0,
 			[OVS_TUNNEL_KEY_ATTR_OAM] = 0,
+			[OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS] = -1,
 		};
 
 		if (type > OVS_TUNNEL_KEY_ATTR_MAX) {
@@ -353,7 +358,8 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 			return -EINVAL;
 		}
 
-		if (ovs_tunnel_key_lens[type] != nla_len(a)) {
+		if (ovs_tunnel_key_lens[type] != nla_len(a) &&
+		    ovs_tunnel_key_lens[type] != -1) {
 			OVS_NLERR("IPv4 tunnel attribute type has unexpected "
 				  " length (type=%d, length=%d, expected=%d).\n",
 				  type, nla_len(a), ovs_tunnel_key_lens[type]);
@@ -392,6 +398,56 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 		case OVS_TUNNEL_KEY_ATTR_OAM:
 			tun_flags |= TUNNEL_OAM;
 			break;
+		case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
+			if (nla_len(a) > sizeof(match->key->tun_opts)) {
+				OVS_NLERR("Geneve option length exceeds "
+					  "maximum size (len %d, max %zu).\n",
+					  nla_len(a),
+					  sizeof(match->key->tun_opts));
+				return -EINVAL;
+			}
+
+			if (nla_len(a) % 4 != 0) {
+				OVS_NLERR("Geneve option length is not "
+					  "a multiple of 4 (len %d).\n",
+					  nla_len(a));
+				return -EINVAL;
+			}
+
+			/* We need to record the length of the options passed
+			 * down, otherwise packets with the same format but
+			 * additional options will be silently matched.
+			 */
+			if (!is_mask) {
+				SW_FLOW_KEY_PUT(match, tun_opts_len, nla_len(a),
+						false);
+			} else {
+				/* This is somewhat unusual because it looks at
+				 * both the key and mask while parsing the
+				 * attributes (and by extension assumes the key
+				 * is parsed first). Normally, we would verify
+				 * that each is the correct length and that the
+				 * attributes line up in the validate function.
+				 * However, that is difficult because this is
+				 * variable length and we won't have the
+				 * information later.
+				 */
+				if (match->key->tun_opts_len != nla_len(a)) {
+					OVS_NLERR("Geneve option key length (%d)"
+					   " is different from mask length (%d).",
+					   match->key->tun_opts_len, nla_len(a));
+					return -EINVAL;
+				}
+
+				SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff,
+						true);
+			}
+
+			SW_FLOW_KEY_MEMCPY_OFFSET(match,
+				(unsigned long)GENEVE_OPTS((struct sw_flow_key *)0,
+							   nla_len(a)),
+				nla_data(a), nla_len(a), is_mask);
+			break;
 		default:
 			return -EINVAL;
 		}
@@ -420,8 +476,9 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
 }
 
 static int ipv4_tun_to_nlattr(struct sk_buff *skb,
-			      const struct ovs_key_ipv4_tunnel *tun_key,
-			      const struct ovs_key_ipv4_tunnel *output)
+			      const struct ovs_key_ipv4_tunnel *output,
+			      const struct geneve_opt *tun_opts,
+			      int swkey_tun_opts_len)
 {
 	struct nlattr *nla;
 
@@ -452,6 +509,9 @@ static int ipv4_tun_to_nlattr(struct sk_buff *skb,
 	if ((output->tun_flags & TUNNEL_OAM) &&
 		nla_put_flag(skb, OVS_TUNNEL_KEY_ATTR_OAM))
 		return -EMSGSIZE;
+	if (tun_opts &&
+	    nla_put(skb, OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,
+		    swkey_tun_opts_len, tun_opts));
 
 	nla_nest_end(skb, nla);
 	return 0;
@@ -881,7 +941,7 @@ int ovs_nla_get_flow_metadata(struct sw_flow *flow,
 	return 0;
 }
 
-int ovs_nla_put_flow(const struct sw_flow_key *swkey,
+int ovs_nla_put_flow(struct datapath *dp, const struct sw_flow_key *swkey,
 		     const struct sw_flow_key *output, struct sk_buff *skb)
 {
 	struct ovs_key_ethernet *eth_key;
@@ -891,9 +951,24 @@ int ovs_nla_put_flow(const struct sw_flow_key *swkey,
 	if (nla_put_u32(skb, OVS_KEY_ATTR_PRIORITY, output->phy.priority))
 		goto nla_put_failure;
 
-	if ((swkey->tun_key.ipv4_dst || is_mask) &&
-	    ipv4_tun_to_nlattr(skb, &swkey->tun_key, &output->tun_key))
-		goto nla_put_failure;
+	if ((swkey->tun_key.ipv4_dst || is_mask)) {
+		const struct geneve_opt *opts = NULL;
+
+		if (!is_mask) {
+			struct vport *in_port;
+
+			in_port = ovs_vport_ovsl_rcu(dp, swkey->phy.in_port);
+			if (in_port->ops->type == OVS_VPORT_TYPE_GENEVE)
+				opts = GENEVE_OPTS(output, swkey->tun_opts_len);
+		} else {
+			if (output->tun_opts_len)
+				opts = GENEVE_OPTS(output, swkey->tun_opts_len);
+		}
+
+		if (ipv4_tun_to_nlattr(skb, &output->tun_key, opts,
+					swkey->tun_opts_len))
+			goto nla_put_failure;
+	}
 
 	if (swkey->phy.in_port == DP_MAX_PORTS) {
 		if (is_mask && (output->phy.in_port == 0xffff))
@@ -1276,17 +1351,55 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
 	if (err)
 		return err;
 
+	if (key.tun_opts_len) {
+		struct geneve_opt *option = GENEVE_OPTS(&key,
+							key.tun_opts_len);
+		int opts_len = key.tun_opts_len;
+		bool crit_opt = false;
+
+		while (opts_len > 0) {
+			int len;
+
+			if (opts_len < sizeof(*option))
+				return -EINVAL;
+
+			len = sizeof(*option) + option->length * 4;
+			if (len > opts_len)
+				return -EINVAL;
+
+			crit_opt |= !!(option->type & GENEVE_CRIT_OPT_TYPE);
+
+			option = (struct geneve_opt *)((u8 *)option + len);
+			opts_len -= len;
+		};
+
+		key.tun_key.tun_flags |= crit_opt ? TUNNEL_CRIT_OPT : 0;
+	};
+
 	start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SET);
 	if (start < 0)
 		return start;
 
 	a = __add_action(sfa, OVS_KEY_ATTR_TUNNEL_INFO, NULL,
-			sizeof(*tun_info));
+			sizeof(*tun_info) + key.tun_opts_len);
 	if (IS_ERR(a))
 		return PTR_ERR(a);
 
 	tun_info = nla_data(a);
 	tun_info->tunnel = key.tun_key;
+	tun_info->options_len = key.tun_opts_len;
+
+	if (tun_info->options_len) {
+		/* We need to store the options in the action itself since
+		 * everything else will go away after flow setup. We can append
+		 * it to tun_info and then point there.
+		 */
+		tun_info->options = (struct geneve_opt *)(tun_info + 1);
+		memcpy(tun_info->options, GENEVE_OPTS(&key, key.tun_opts_len),
+			key.tun_opts_len);
+	} else {
+		tun_info->options = NULL;
+	}
 
 	add_nested_action_end(*sfa, start);
 
@@ -1561,7 +1674,9 @@ static int set_action_to_attr(const struct nlattr *a, struct sk_buff *skb)
 			return -EMSGSIZE;
 
 		err = ipv4_tun_to_nlattr(skb, &tun_info->tunnel,
-					 &tun_info->tunnel);
+					 tun_info->options_len ?
+						tun_info->options : NULL,
+					 tun_info->options_len);
 		if (err)
 			return err;
 		nla_nest_end(skb, start);
diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
index 4401510..42de456 100644
--- a/net/openvswitch/flow_netlink.h
+++ b/net/openvswitch/flow_netlink.h
@@ -40,7 +40,7 @@
 void ovs_match_init(struct sw_flow_match *match,
 		    struct sw_flow_key *key, struct sw_flow_mask *mask);
 
-int ovs_nla_put_flow(const struct sw_flow_key *,
+int ovs_nla_put_flow(struct datapath *dp, const struct sw_flow_key *,
 		     const struct sw_flow_key *, struct sk_buff *);
 int ovs_nla_get_flow_metadata(struct sw_flow *flow,
 			      const struct nlattr *attr);
diff --git a/net/openvswitch/vport-geneve.c b/net/openvswitch/vport-geneve.c
new file mode 100644
index 0000000..b1b0a3b
--- /dev/null
+++ b/net/openvswitch/vport-geneve.c
@@ -0,0 +1,258 @@
+/*
+ * Copyright (c) 2014 Nicira, Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of version 2 of the GNU General Public
+ * License as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
+ * 02110-1301, USA
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/version.h>
+
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/net.h>
+#include <linux/rculist.h>
+#include <linux/udp.h>
+#include <linux/if_vlan.h>
+
+#include <net/geneve.h>
+#include <net/icmp.h>
+#include <net/ip.h>
+#include <net/route.h>
+#include <net/udp.h>
+#include <net/xfrm.h>
+
+#include "datapath.h"
+#include "vport.h"
+
+/**
+ * struct geneve_port - Keeps track of open UDP ports
+ * @sock: The socket created for this port number.
+ * @name: vport name.
+ */
+struct geneve_port {
+	struct geneve_sock *gs;
+	char name[IFNAMSIZ];
+};
+
+static LIST_HEAD(geneve_ports);
+
+static inline struct geneve_port *geneve_vport(const struct vport *vport)
+{
+	return vport_priv(vport);
+}
+
+static inline struct genevehdr *geneve_hdr(const struct sk_buff *skb)
+{
+	return (struct genevehdr *)(udp_hdr(skb) + 1);
+}
+
+/* Convert 64 bit tunnel ID to 24 bit VNI. */
+static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni)
+{
+#ifdef __BIG_ENDIAN
+	vni[0] = (__force __u8)(tun_id >> 16);
+	vni[1] = (__force __u8)(tun_id >> 8);
+	vni[2] = (__force __u8)tun_id;
+#else
+	vni[0] = (__force __u8)((__force u64)tun_id >> 40);
+	vni[1] = (__force __u8)((__force u64)tun_id >> 48);
+	vni[2] = (__force __u8)((__force u64)tun_id >> 56);
+#endif
+}
+
+/* Convert 24 bit VNI to 64 bit tunnel ID. */
+static __be64 vni_to_tunnel_id(__u8 *vni)
+{
+#ifdef __BIG_ENDIAN
+	return (vni[0] << 16) | (vni[1] << 8) | vni[2];
+#else
+	return (__force __be64)(((__force u64)vni[0] << 40) |
+				((__force u64)vni[1] << 48) |
+				((__force u64)vni[2] << 56));
+#endif
+}
+
+static void geneve_rcv(struct geneve_sock *gs, struct sk_buff *skb)
+{
+	struct vport *vport = gs->uts.data;
+	struct genevehdr *geneveh;
+	int opts_len;
+	struct ovs_tunnel_info tun_info;
+	__be64 key;
+	__be16 flags;
+
+	if (unlikely(!pskb_may_pull(skb, GENEVE_BASE_HLEN)))
+		goto error;
+
+	geneveh = geneve_hdr(skb);
+
+	if (unlikely(geneveh->ver != GENEVE_VER))
+		goto error;
+
+	if (unlikely(geneveh->proto_type != htons(ETH_P_TEB)))
+		goto error;
+
+	opts_len = geneveh->opt_len * 4;
+
+	flags = TUNNEL_KEY |
+		(udp_hdr(skb)->check != 0 ? TUNNEL_CSUM : 0) |
+		(geneveh->oam ? TUNNEL_OAM : 0) |
+		(geneveh->critical ? TUNNEL_CRIT_OPT : 0);
+
+	key = vni_to_tunnel_id(geneveh->vni);
+	ovs_flow_tun_info_init(&tun_info, ip_hdr(skb), key, flags,
+			       geneveh->options, opts_len);
+
+	ovs_vport_receive(vport, skb, &tun_info);
+	return;
+
+error:
+	kfree_skb(skb);
+}
+
+static int geneve_get_options(const struct vport *vport,
+			      struct sk_buff *skb)
+{
+	struct geneve_port *geneve_port = geneve_vport(vport);
+	__be16 sport;
+
+	sport = ntohs(inet_sk(geneve_port->gs->uts.sock->sk)->inet_sport);
+	if (nla_put_u16(skb, OVS_TUNNEL_ATTR_DST_PORT, sport))
+		return -EMSGSIZE;
+	return 0;
+}
+
+static void geneve_tnl_destroy(struct vport *vport)
+{
+	struct geneve_port *geneve_port = geneve_vport(vport);
+
+	geneve_sock_release(geneve_port->gs);
+
+	ovs_vport_deferred_free(vport);
+}
+
+static struct vport *geneve_tnl_create(const struct vport_parms *parms)
+{
+	struct net *net = ovs_dp_get_net(parms->dp);
+	struct nlattr *options = parms->options;
+	struct geneve_port *geneve_port;
+	struct geneve_sock *gs;
+	struct vport *vport;
+	struct nlattr *a;
+	int err;
+	u16 dst_port;
+
+	if (!options) {
+		err = -EINVAL;
+		goto error;
+	}
+
+	a = nla_find_nested(options, OVS_TUNNEL_ATTR_DST_PORT);
+	if (a && nla_len(a) == sizeof(u16)) {
+		dst_port = nla_get_u16(a);
+	} else {
+		/* Require destination port from userspace. */
+		err = -EINVAL;
+		goto error;
+	}
+
+	vport = ovs_vport_alloc(sizeof(struct geneve_port),
+				&ovs_geneve_vport_ops, parms);
+	if (IS_ERR(vport))
+		return vport;
+
+	geneve_port = geneve_vport(vport);
+	strncpy(geneve_port->name, parms->name, IFNAMSIZ);
+
+	gs = geneve_sock_add(net, htons(dst_port), geneve_rcv, vport, true, 0);
+	if (IS_ERR(gs)) {
+		ovs_vport_free(vport);
+		return (void *)gs;
+	}
+	geneve_port->gs = gs;
+
+	return vport;
+error:
+	return ERR_PTR(err);
+}
+
+static int geneve_send(struct vport *vport, struct sk_buff *skb)
+{
+	struct ovs_key_ipv4_tunnel *tun_key;
+	struct ovs_tunnel_info *tun_info = OVS_CB(skb)->tun_info;
+	struct net *net = ovs_dp_get_net(vport->dp);
+	struct geneve_port *geneve_port = geneve_vport(vport);
+	__be16 dport = inet_sk(geneve_port->gs->uts.sock->sk)->inet_sport;
+	__be16 sport;
+	struct rtable *rt;
+	struct flowi4 fl;
+	u8 vni[3];
+	__be16 df;
+	int err;
+	int sent;
+
+	if (unlikely(!tun_info))
+		return -EINVAL;
+
+	tun_key = &tun_info->tunnel;
+
+	/* Route lookup */
+	memset(&fl, 0, sizeof(fl));
+	fl.daddr = tun_key->ipv4_dst;
+	fl.saddr = tun_key->ipv4_src;
+	fl.flowi4_tos = RT_TOS(tun_key->ipv4_tos);
+	fl.flowi4_mark = skb->mark;
+	fl.flowi4_proto = IPPROTO_UDP;
+
+	rt = ip_route_output_key(net, &fl);
+	if (IS_ERR(rt)) {
+		err = PTR_ERR(rt);
+		goto error;
+	}
+
+	df = tun_key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
+	sport = udp_flow_src_port(net, skb, 1, USHRT_MAX, true);
+	tunnel_id_to_vni(tun_key->tun_id, vni);
+
+	sent = geneve_xmit_skb(geneve_port->gs, rt, skb, fl.saddr,
+			       tun_key->ipv4_dst, tun_key->ipv4_tos,
+			       tun_key->ipv4_ttl, df, sport, dport,
+			       tun_key->tun_flags, vni,
+			       tun_info->options_len, (u8 *)tun_info->options,
+			       false);
+	if (!sent)
+		ip_rt_put(rt);
+
+	return sent;
+
+error:
+	return err;
+}
+
+static const char *geneve_get_name(const struct vport *vport)
+{
+	struct geneve_port *geneve_port = geneve_vport(vport);
+	return geneve_port->name;
+}
+
+const struct vport_ops ovs_geneve_vport_ops = {
+	.type		= OVS_VPORT_TYPE_GENEVE,
+	.create		= geneve_tnl_create,
+	.destroy	= geneve_tnl_destroy,
+	.get_name	= geneve_get_name,
+	.get_options	= geneve_get_options,
+	.send		= geneve_send,
+};
diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c
index d4fcbb2..1aeeed6 100644
--- a/net/openvswitch/vport-gre.c
+++ b/net/openvswitch/vport-gre.c
@@ -104,7 +104,7 @@ static int gre_rcv(struct sk_buff *skb,
 
 	key = key_to_tunnel_id(tpi->key, tpi->seq);
 	ovs_flow_tun_info_init(&tun_info, ip_hdr(skb), key,
-			       filter_tnl_flags(tpi->flags));
+			       filter_tnl_flags(tpi->flags), NULL, 0);
 
 	ovs_vport_receive(vport, skb, &tun_info);
 	return PACKET_RCVD;
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index 3835143..eded300 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -66,7 +66,7 @@ static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
 	/* Save outer tunnel values */
 	iph = ip_hdr(skb);
 	key = cpu_to_be64(ntohl(vx_vni) >> 8);
-	ovs_flow_tun_info_init(&tun_info, iph, key, TUNNEL_KEY);
+	ovs_flow_tun_info_init(&tun_info, iph, key, TUNNEL_KEY, NULL, 0);
 
 	ovs_vport_receive(vport, skb, &tun_info);
 }
diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
index 39e2c9c..038d14a 100644
--- a/net/openvswitch/vport.c
+++ b/net/openvswitch/vport.c
@@ -41,6 +41,7 @@ static void ovs_vport_record_error(struct vport *,
 static const struct vport_ops *vport_ops_list[] = {
 	&ovs_netdev_vport_ops,
 	&ovs_internal_vport_ops,
+	&ovs_geneve_vport_ops,
 
 #ifdef CONFIG_OPENVSWITCH_GRE
 	&ovs_gre_vport_ops,
diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
index 400cd1e..d2eb700 100644
--- a/net/openvswitch/vport.h
+++ b/net/openvswitch/vport.h
@@ -197,6 +197,7 @@ void ovs_vport_receive(struct vport *, struct sk_buff *,
  * add yours to the list at the top of vport.c. */
 extern const struct vport_ops ovs_netdev_vport_ops;
 extern const struct vport_ops ovs_internal_vport_ops;
+extern const struct vport_ops ovs_geneve_vport_ops;
 extern const struct vport_ops ovs_gre_vport_ops;
 extern const struct vport_ops ovs_vxlan_vport_ops;
 
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [net-next 01/10] net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port.
  2014-07-22 10:19 ` [net-next 01/10] net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port Andy Zhou
@ 2014-07-22 10:49   ` Varka Bhadram
  2014-07-24  6:40   ` Or Gerlitz
  1 sibling, 0 replies; 46+ messages in thread
From: Varka Bhadram @ 2014-07-22 10:49 UTC (permalink / raw)
  To: Andy Zhou, davem; +Cc: netdev

On 07/22/2014 03:49 PM, Andy Zhou wrote:

(...)

>   
> -/* Calls the ndo_add_vxlan_port of the caller in order to
> +/* Calls the ndo_add_tunnel_port of the caller in order to
>    * supply the listening VXLAN udp ports. Callers are expected
> - * to implement the ndo_add_vxlan_port.
> + * to implement the ndo_add_tunnle_port.
>    */
>   void vxlan_get_rx_port(struct net_device *dev)
>   {
> @@ -2206,8 +2209,8 @@ void vxlan_get_rx_port(struct net_device *dev)
>   		hlist_for_each_entry_rcu(vs, &vn->sock_list[i], hlist) {
>   			port = inet_sk(vs->sock->sk)->inet_sport;
>   			sa_family = vs->sock->sk->sk_family;
> -			dev->netdev_ops->ndo_add_vxlan_port(dev, sa_family,
> -							    port);
> +			dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
> +					sa_family, port, UDP_TUNNEL_TYPE_VXLAN);

Should match open parenthesis:
	dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
						 sa_family, port,
						 UDP_TUNNEL_TYPE_VXLAN);

>   


-- 
Regards,
Varka Bhadram.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 00/10] Add Geneve
  2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
                   ` (9 preceding siblings ...)
  2014-07-22 10:19 ` [net-next 10/10] openvswitch: Add support for Geneve tunneling Andy Zhou
@ 2014-07-22 10:54 ` Varka Bhadram
  2014-07-24  6:58 ` Or Gerlitz
  11 siblings, 0 replies; 46+ messages in thread
From: Varka Bhadram @ 2014-07-22 10:54 UTC (permalink / raw)
  To: Andy Zhou, davem; +Cc: netdev

On 07/22/2014 03:49 PM, Andy Zhou wrote:
> Following patches adds initial support for Geneve tunnel protocol
> 1. Add Geneve driver.
> 2. Add common UDP tunnel code into UDP tunnel support function
> 3. Refactor vxlan driver to make use of the UDP tunnel support
> 4. Refactor Openvswitch  in preparation for #5
> 5. Add Geneve support to Openvswitch.
>
> Note: Geneve offload are not supported in this version. We plan to
> post follow on patches that implements them we can verified with
> at least one working NIC that supports Geneve offloading.
>
> Andy Zhou (5):
>    net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port.
>    udp: Expand UDP tunnel common APIs
>    vxlan: Remove vxlan_get_rx_port()
>    net: Refactor vxlan driver to make use of common UDP tunnel functions
>    net: Add Geneve tunneling protocol driver
>
> Jesse Gross (5):
>    openvswitch: Eliminate memset() from flow_extract.
>    openvswitch: Add support for matching on OAM packets.
>    openvswitch: Wrap struct ovs_key_ipv4_tunnel in a new structure.
>    openvswitch: Factor out allocation and verification of actions.
>    openvswitch: Add support for Geneve tunneling.
>
>   drivers/net/ethernet/emulex/benet/be_main.c      |   17 +-
>   drivers/net/ethernet/intel/i40e/i40e_main.c      |   18 +-
>   drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |   19 +-
>   drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c |   19 +-
>   drivers/net/vxlan.c                              |  273 ++++++----------------
>   include/linux/netdevice.h                        |   35 +--
>   include/net/geneve.h                             |   85 +++++++
>   include/net/ip_tunnels.h                         |    2 +
>   include/net/udp_tunnel.h                         |   57 +++++
>   include/net/vxlan.h                              |   21 +-
>   include/uapi/linux/openvswitch.h                 |    5 +-
>   net/ipv4/Kconfig                                 |   14 ++
>   net/ipv4/Makefile                                |    1 +
>   net/ipv4/geneve.c                                |  273 ++++++++++++++++++++++
>   net/ipv4/udp_tunnel.c                            |  257 +++++++++++++++++++-
>   net/openvswitch/Kconfig                          |   11 +
>   net/openvswitch/Makefile                         |    5 +
>   net/openvswitch/actions.c                        |    6 +-
>   net/openvswitch/datapath.c                       |   71 ++++--
>   net/openvswitch/datapath.h                       |    3 +-
>   net/openvswitch/flow.c                           |   62 ++++-
>   net/openvswitch/flow.h                           |   41 +++-
>   net/openvswitch/flow_netlink.c                   |  184 +++++++++++++--
>   net/openvswitch/flow_netlink.h                   |    2 +-
>   net/openvswitch/vport-geneve.c                   |  258 ++++++++++++++++++++
>   net/openvswitch/vport-gre.c                      |   29 +--
>   net/openvswitch/vport-vxlan.c                    |   34 +--
>   net/openvswitch/vport.c                          |    8 +-
>   net/openvswitch/vport.h                          |    3 +-
>   29 files changed, 1457 insertions(+), 356 deletions(-)
>   create mode 100644 include/net/geneve.h
>   create mode 100644 net/ipv4/geneve.c
>   create mode 100644 net/openvswitch/vport-geneve.c
>
check patch warnings and errors are there in this series...


-- 
Regards,
Varka Bhadram.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
       [not found]   ` <CA+mtBx9M_BpjT-_Egng+jFxmqJzdC2Npg0ufE2ZSAb9Lhw8hxg@mail.gmail.com>
@ 2014-07-22 21:02     ` Andy Zhou
  2014-07-22 21:16       ` Tom Herbert
  0 siblings, 1 reply; 46+ messages in thread
From: Andy Zhou @ 2014-07-22 21:02 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Linux Netdev List

On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>
>
> On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
>> Added create_udp_tunnel_socket(), packet receive and transmit,  and
>> other related common functions for UDP tunnels.
>>
>> Per net open UDP tunnel ports are tracked in this common layer to
>> prevent sharing of a single port with more than one UDP tunnel.
>>
> bind should already prevent this. I don't really see a need to track udp
> encap ports separately.

When a new network device driver is activated, does it need to get a list
of currently open UDP tunnel ports to configure its offloads?

>> --- a/include/net/udp_tunnel.h
>> +++ b/include/net/udp_tunnel.h
>> @@ -1,7 +1,10 @@
>>  #ifndef __NET_UDP_TUNNEL_H
>>  #define __NET_UDP_TUNNEL_H
>>
>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>> +#include <net/ip_tunnels.h>
>> +
>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>
> Why do we need to define these? Caller should know what type of port is
> being opened and provide appropriate encap_rcv.

Assume udp tunnel layer needs to keep track of open ports, should it
also keep track of the protocol associated with the port?

>> +
>> +/* Calls the ndo_add_tunnel_port of the caller in order to
>> + * supply the listening VXLAN udp ports. Callers are expected
>> + * to implement the ndo_add_tunnle_port.
>> + */
> Seems a little presumptuous that we're doing VXLAN specific things in what
> should be common and generic code...
>
You are right. Cut-and-past error. It should read "UDP tunnel ports"
instead. I will fix it.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-22 21:02     ` Andy Zhou
@ 2014-07-22 21:16       ` Tom Herbert
  2014-07-22 21:56         ` Jesse Gross
  0 siblings, 1 reply; 46+ messages in thread
From: Tom Herbert @ 2014-07-22 21:16 UTC (permalink / raw)
  To: Andy Zhou; +Cc: David Miller, Linux Netdev List

On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>
>>
>> On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
>>> Added create_udp_tunnel_socket(), packet receive and transmit,  and
>>> other related common functions for UDP tunnels.
>>>
>>> Per net open UDP tunnel ports are tracked in this common layer to
>>> prevent sharing of a single port with more than one UDP tunnel.
>>>
>> bind should already prevent this. I don't really see a need to track udp
>> encap ports separately.
>
> When a new network device driver is activated, does it need to get a list
> of currently open UDP tunnel ports to configure its offloads?
>
If that's needed it should be driven by the UDP offload registration
mechanisms, not from UDP tunnel code. It's very conceivable that we
will have UDP offloads that don't correspond to UDP tunnels in the
kernel--QUIC comes to mind.

>>> --- a/include/net/udp_tunnel.h
>>> +++ b/include/net/udp_tunnel.h
>>> @@ -1,7 +1,10 @@
>>>  #ifndef __NET_UDP_TUNNEL_H
>>>  #define __NET_UDP_TUNNEL_H
>>>
>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>> +#include <net/ip_tunnels.h>
>>> +
>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>
>> Why do we need to define these? Caller should know what type of port is
>> being opened and provide appropriate encap_rcv.
>
> Assume udp tunnel layer needs to keep track of open ports, should it
> also keep track of the protocol associated with the port?
>
For what purpose? Other than for offloads and rcv_encap functions that
provide the service function anyway, what need is there for UDP layer
to know about this. More to the point, if I add a module to the kernel
with a new flavor of UDP tunneling, I shouldn't have to touch any core
code for things to work correctly. So by this line of thinking,
neither the terms VXLAN nor GENEVE should appear in any common code.

>>> +
>>> +/* Calls the ndo_add_tunnel_port of the caller in order to
>>> + * supply the listening VXLAN udp ports. Callers are expected
>>> + * to implement the ndo_add_tunnle_port.
>>> + */
>> Seems a little presumptuous that we're doing VXLAN specific things in what
>> should be common and generic code...
>>
> You are right. Cut-and-past error. It should read "UDP tunnel ports"
> instead. I will fix it.

Given my arguments above, I'm not sure that ndo_add_tunnel_port is the
right interface either.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-22 21:16       ` Tom Herbert
@ 2014-07-22 21:56         ` Jesse Gross
  2014-07-22 22:38           ` Tom Herbert
  0 siblings, 1 reply; 46+ messages in thread
From: Jesse Gross @ 2014-07-22 21:56 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Andy Zhou, David Miller, Linux Netdev List

On Tue, Jul 22, 2014 at 5:16 PM, Tom Herbert <therbert@google.com> wrote:
> On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
>> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>>> --- a/include/net/udp_tunnel.h
>>>> +++ b/include/net/udp_tunnel.h
>>>> @@ -1,7 +1,10 @@
>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>  #define __NET_UDP_TUNNEL_H
>>>>
>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>> +#include <net/ip_tunnels.h>
>>>> +
>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>
>>> Why do we need to define these? Caller should know what type of port is
>>> being opened and provide appropriate encap_rcv.
>>
>> Assume udp tunnel layer needs to keep track of open ports, should it
>> also keep track of the protocol associated with the port?
>>
> For what purpose? Other than for offloads and rcv_encap functions that
> provide the service function anyway, what need is there for UDP layer
> to know about this. More to the point, if I add a module to the kernel
> with a new flavor of UDP tunneling, I shouldn't have to touch any core
> code for things to work correctly. So by this line of thinking,
> neither the terms VXLAN nor GENEVE should appear in any common code.

The hardware will need to know what the header format is so that it
can parse the packets on receive. And since the NIC can't exactly call
into a function pointer like GRO can, I'm not sure that there is a
solution that doesn't involve an identifier that needs to be listed
somewhere. This is a pretty minimal impact - it doesn't actually
appear in the core code.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-22 21:56         ` Jesse Gross
@ 2014-07-22 22:38           ` Tom Herbert
  2014-07-22 22:55             ` Alexander Duyck
  2014-07-22 23:12             ` Jesse Gross
  0 siblings, 2 replies; 46+ messages in thread
From: Tom Herbert @ 2014-07-22 22:38 UTC (permalink / raw)
  To: Jesse Gross; +Cc: Andy Zhou, David Miller, Linux Netdev List

On Tue, Jul 22, 2014 at 2:56 PM, Jesse Gross <jesse@nicira.com> wrote:
> On Tue, Jul 22, 2014 at 5:16 PM, Tom Herbert <therbert@google.com> wrote:
>> On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
>>> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>>>> --- a/include/net/udp_tunnel.h
>>>>> +++ b/include/net/udp_tunnel.h
>>>>> @@ -1,7 +1,10 @@
>>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>>  #define __NET_UDP_TUNNEL_H
>>>>>
>>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>>> +#include <net/ip_tunnels.h>
>>>>> +
>>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>>
>>>> Why do we need to define these? Caller should know what type of port is
>>>> being opened and provide appropriate encap_rcv.
>>>
>>> Assume udp tunnel layer needs to keep track of open ports, should it
>>> also keep track of the protocol associated with the port?
>>>
>> For what purpose? Other than for offloads and rcv_encap functions that
>> provide the service function anyway, what need is there for UDP layer
>> to know about this. More to the point, if I add a module to the kernel
>> with a new flavor of UDP tunneling, I shouldn't have to touch any core
>> code for things to work correctly. So by this line of thinking,
>> neither the terms VXLAN nor GENEVE should appear in any common code.
>
> The hardware will need to know what the header format is so that it
> can parse the packets on receive. And since the NIC can't exactly call
> into a function pointer like GRO can, I'm not sure that there is a
> solution that doesn't involve an identifier that needs to be listed
> somewhere. This is a pretty minimal impact - it doesn't actually
> appear in the core code.

The hardware doesn't *need* to know this, it's must be optional and
should have no bearing on the software stack. Suggest to put them in
their own header file. Also, as HW features these should appear in
NETIF_F_* list so that we can control on a per device level rather to
enable this feature (something like how NETIF_F_GSO_* was done).

What about support for L2TP/UDP?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-22 22:38           ` Tom Herbert
@ 2014-07-22 22:55             ` Alexander Duyck
  2014-07-22 23:24               ` Tom Herbert
  2014-07-22 23:12             ` Jesse Gross
  1 sibling, 1 reply; 46+ messages in thread
From: Alexander Duyck @ 2014-07-22 22:55 UTC (permalink / raw)
  To: Tom Herbert, Jesse Gross; +Cc: Andy Zhou, David Miller, Linux Netdev List

On 07/22/2014 03:38 PM, Tom Herbert wrote:
> On Tue, Jul 22, 2014 at 2:56 PM, Jesse Gross <jesse@nicira.com> wrote:
>> On Tue, Jul 22, 2014 at 5:16 PM, Tom Herbert <therbert@google.com> wrote:
>>> On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
>>>> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>>>>> --- a/include/net/udp_tunnel.h
>>>>>> +++ b/include/net/udp_tunnel.h
>>>>>> @@ -1,7 +1,10 @@
>>>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>>>  #define __NET_UDP_TUNNEL_H
>>>>>>
>>>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>>>> +#include <net/ip_tunnels.h>
>>>>>> +
>>>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>>>
>>>>> Why do we need to define these? Caller should know what type of port is
>>>>> being opened and provide appropriate encap_rcv.
>>>>
>>>> Assume udp tunnel layer needs to keep track of open ports, should it
>>>> also keep track of the protocol associated with the port?
>>>>
>>> For what purpose? Other than for offloads and rcv_encap functions that
>>> provide the service function anyway, what need is there for UDP layer
>>> to know about this. More to the point, if I add a module to the kernel
>>> with a new flavor of UDP tunneling, I shouldn't have to touch any core
>>> code for things to work correctly. So by this line of thinking,
>>> neither the terms VXLAN nor GENEVE should appear in any common code.
>>
>> The hardware will need to know what the header format is so that it
>> can parse the packets on receive. And since the NIC can't exactly call
>> into a function pointer like GRO can, I'm not sure that there is a
>> solution that doesn't involve an identifier that needs to be listed
>> somewhere. This is a pretty minimal impact - it doesn't actually
>> appear in the core code.
> 
> The hardware doesn't *need* to know this, it's must be optional and
> should have no bearing on the software stack. Suggest to put them in
> their own header file. Also, as HW features these should appear in
> NETIF_F_* list so that we can control on a per device level rather to
> enable this feature (something like how NETIF_F_GSO_* was done).
> 
> What about support for L2TP/UDP?

The hardware needs some means of knowing what UDP port numbers are used
for VXLAN and/or GENEVE as the two formats contain subtle differences
that we have to be ready for on the Rx path as we have to parse out the
frames.

We already have feature flags controlling the offloads, what the port
numbers provide is a means for us to determine what Rx packets we should
parse as tunnels vs standard UDP and which tunnel type we should parse
it as.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-22 22:38           ` Tom Herbert
  2014-07-22 22:55             ` Alexander Duyck
@ 2014-07-22 23:12             ` Jesse Gross
  1 sibling, 0 replies; 46+ messages in thread
From: Jesse Gross @ 2014-07-22 23:12 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Andy Zhou, David Miller, Linux Netdev List

On Tue, Jul 22, 2014 at 6:38 PM, Tom Herbert <therbert@google.com> wrote:
> On Tue, Jul 22, 2014 at 2:56 PM, Jesse Gross <jesse@nicira.com> wrote:
>> On Tue, Jul 22, 2014 at 5:16 PM, Tom Herbert <therbert@google.com> wrote:
>>> On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
>>>> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>>>>> --- a/include/net/udp_tunnel.h
>>>>>> +++ b/include/net/udp_tunnel.h
>>>>>> @@ -1,7 +1,10 @@
>>>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>>>  #define __NET_UDP_TUNNEL_H
>>>>>>
>>>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>>>> +#include <net/ip_tunnels.h>
>>>>>> +
>>>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>>>
>>>>> Why do we need to define these? Caller should know what type of port is
>>>>> being opened and provide appropriate encap_rcv.
>>>>
>>>> Assume udp tunnel layer needs to keep track of open ports, should it
>>>> also keep track of the protocol associated with the port?
>>>>
>>> For what purpose? Other than for offloads and rcv_encap functions that
>>> provide the service function anyway, what need is there for UDP layer
>>> to know about this. More to the point, if I add a module to the kernel
>>> with a new flavor of UDP tunneling, I shouldn't have to touch any core
>>> code for things to work correctly. So by this line of thinking,
>>> neither the terms VXLAN nor GENEVE should appear in any common code.
>>
>> The hardware will need to know what the header format is so that it
>> can parse the packets on receive. And since the NIC can't exactly call
>> into a function pointer like GRO can, I'm not sure that there is a
>> solution that doesn't involve an identifier that needs to be listed
>> somewhere. This is a pretty minimal impact - it doesn't actually
>> appear in the core code.
>
> The hardware doesn't *need* to know this, it's must be optional and
> should have no bearing on the software stack. Suggest to put them in
> their own header file. Also, as HW features these should appear in
> NETIF_F_* list so that we can control on a per device level rather to
> enable this feature (something like how NETIF_F_GSO_* was done).

Right - I meant for hardware offload. Obviously, pure software
implementations should continue to work fine with the tunnel stack (as
it does here). I don't have any particular objection to moving them to
a different file (udp_offload.h?) but I agree with Alex that these are
slightly different than hardware feature flags.

> What about support for L2TP/UDP?

It should be possible to take advantage of the common UDP tunnel code
here as well. I believe that Andy is planning on doing it as a follow
up patch, which would be a good example of a pure software
implementation.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 05/10] net: Add Geneve tunneling protocol driver
  2014-07-22 10:19 ` [net-next 05/10] net: Add Geneve tunneling protocol driver Andy Zhou
@ 2014-07-22 23:12   ` Alexander Duyck
  2014-07-22 23:24     ` Jesse Gross
  2014-07-23 18:20   ` Stephen Hemminger
  1 sibling, 1 reply; 46+ messages in thread
From: Alexander Duyck @ 2014-07-22 23:12 UTC (permalink / raw)
  To: Andy Zhou, davem; +Cc: netdev, Jesse Gross

On 07/22/2014 03:19 AM, Andy Zhou wrote:
> This adds a device level support for Geneve -- Generic Network
> Virtualization Encapsulation. The protocol is documented at
> http://tools.ietf.org/html/draft-gross-geneve-00
> 
> Only protocol layer Geneve support is provided by this driver.
> Openvswitch can be used for configuring, set up and tear down
> functional Geneve tunnels.
> 
> Signed-off-by: Jesse Gross <jesse@nicira.com>
> Signed-off-by: Andy Zhou <azhou@nicira.com>
> ---
>  include/net/geneve.h     |   85 +++++++++++++++
>  include/net/ip_tunnels.h |    2 +
>  net/ipv4/Kconfig         |   14 +++
>  net/ipv4/Makefile        |    1 +
>  net/ipv4/geneve.c        |  273 ++++++++++++++++++++++++++++++++++++++++++++++
>  net/openvswitch/Kconfig  |   11 ++
>  net/openvswitch/vport.c  |    3 +
>  7 files changed, 389 insertions(+)
>  create mode 100644 include/net/geneve.h
>  create mode 100644 net/ipv4/geneve.c
> 

So all this is really doing is enabling a Geneve socket for use by
Openvswitch.  Do you have any plans to enable a stand alone interface
like what we already have for VXLAN?

Thanks,

Alex

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-22 22:55             ` Alexander Duyck
@ 2014-07-22 23:24               ` Tom Herbert
  2014-07-23  2:16                 ` Alexander Duyck
  0 siblings, 1 reply; 46+ messages in thread
From: Tom Herbert @ 2014-07-22 23:24 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Jesse Gross, Andy Zhou, David Miller, Linux Netdev List

On Tue, Jul 22, 2014 at 3:55 PM, Alexander Duyck
<alexander.h.duyck@intel.com> wrote:
> On 07/22/2014 03:38 PM, Tom Herbert wrote:
>> On Tue, Jul 22, 2014 at 2:56 PM, Jesse Gross <jesse@nicira.com> wrote:
>>> On Tue, Jul 22, 2014 at 5:16 PM, Tom Herbert <therbert@google.com> wrote:
>>>> On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
>>>>> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>>>>>> --- a/include/net/udp_tunnel.h
>>>>>>> +++ b/include/net/udp_tunnel.h
>>>>>>> @@ -1,7 +1,10 @@
>>>>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>>>>  #define __NET_UDP_TUNNEL_H
>>>>>>>
>>>>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>>>>> +#include <net/ip_tunnels.h>
>>>>>>> +
>>>>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>>>>
>>>>>> Why do we need to define these? Caller should know what type of port is
>>>>>> being opened and provide appropriate encap_rcv.
>>>>>
>>>>> Assume udp tunnel layer needs to keep track of open ports, should it
>>>>> also keep track of the protocol associated with the port?
>>>>>
>>>> For what purpose? Other than for offloads and rcv_encap functions that
>>>> provide the service function anyway, what need is there for UDP layer
>>>> to know about this. More to the point, if I add a module to the kernel
>>>> with a new flavor of UDP tunneling, I shouldn't have to touch any core
>>>> code for things to work correctly. So by this line of thinking,
>>>> neither the terms VXLAN nor GENEVE should appear in any common code.
>>>
>>> The hardware will need to know what the header format is so that it
>>> can parse the packets on receive. And since the NIC can't exactly call
>>> into a function pointer like GRO can, I'm not sure that there is a
>>> solution that doesn't involve an identifier that needs to be listed
>>> somewhere. This is a pretty minimal impact - it doesn't actually
>>> appear in the core code.
>>
>> The hardware doesn't *need* to know this, it's must be optional and
>> should have no bearing on the software stack. Suggest to put them in
>> their own header file. Also, as HW features these should appear in
>> NETIF_F_* list so that we can control on a per device level rather to
>> enable this feature (something like how NETIF_F_GSO_* was done).
>>
>> What about support for L2TP/UDP?
>
> The hardware needs some means of knowing what UDP port numbers are used
> for VXLAN and/or GENEVE as the two formats contain subtle differences
> that we have to be ready for on the Rx path as we have to parse out the
> frames.
>
> We already have feature flags controlling the offloads, what the port
> numbers provide is a means for us to determine what Rx packets we should
> parse as tunnels vs standard UDP and which tunnel type we should parse
> it as.
>
Which feature flags control the receive side parsing in the device?

> Thanks,
>
> Alex
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 05/10] net: Add Geneve tunneling protocol driver
  2014-07-22 23:12   ` Alexander Duyck
@ 2014-07-22 23:24     ` Jesse Gross
  2014-07-23 14:11       ` John W. Linville
  0 siblings, 1 reply; 46+ messages in thread
From: Jesse Gross @ 2014-07-22 23:24 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Andy Zhou, David Miller, netdev, John W. Linville

On Tue, Jul 22, 2014 at 7:12 PM, Alexander Duyck
<alexander.h.duyck@intel.com> wrote:
> On 07/22/2014 03:19 AM, Andy Zhou wrote:
>> This adds a device level support for Geneve -- Generic Network
>> Virtualization Encapsulation. The protocol is documented at
>> http://tools.ietf.org/html/draft-gross-geneve-00
>>
>> Only protocol layer Geneve support is provided by this driver.
>> Openvswitch can be used for configuring, set up and tear down
>> functional Geneve tunnels.
>>
>> Signed-off-by: Jesse Gross <jesse@nicira.com>
>> Signed-off-by: Andy Zhou <azhou@nicira.com>
>> ---
>>  include/net/geneve.h     |   85 +++++++++++++++
>>  include/net/ip_tunnels.h |    2 +
>>  net/ipv4/Kconfig         |   14 +++
>>  net/ipv4/Makefile        |    1 +
>>  net/ipv4/geneve.c        |  273 ++++++++++++++++++++++++++++++++++++++++++++++
>>  net/openvswitch/Kconfig  |   11 ++
>>  net/openvswitch/vport.c  |    3 +
>>  7 files changed, 389 insertions(+)
>>  create mode 100644 include/net/geneve.h
>>  create mode 100644 net/ipv4/geneve.c
>>
>
> So all this is really doing is enabling a Geneve socket for use by
> Openvswitch.  Do you have any plans to enable a stand alone interface
> like what we already have for VXLAN?

Yes, this is the basic protocol code that would be shared by all users
of Geneve. OVS is the first user but John Linville is looking at
adding support for connecting this to 'ip' as a non-OVS user.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-22 23:24               ` Tom Herbert
@ 2014-07-23  2:16                 ` Alexander Duyck
  2014-07-23  3:53                   ` Tom Herbert
  0 siblings, 1 reply; 46+ messages in thread
From: Alexander Duyck @ 2014-07-23  2:16 UTC (permalink / raw)
  To: Tom Herbert, Alexander Duyck
  Cc: Jesse Gross, Andy Zhou, David Miller, Linux Netdev List

On 07/22/2014 04:24 PM, Tom Herbert wrote:
> On Tue, Jul 22, 2014 at 3:55 PM, Alexander Duyck
> <alexander.h.duyck@intel.com> wrote:
>> On 07/22/2014 03:38 PM, Tom Herbert wrote:
>>> On Tue, Jul 22, 2014 at 2:56 PM, Jesse Gross <jesse@nicira.com> wrote:
>>>> On Tue, Jul 22, 2014 at 5:16 PM, Tom Herbert <therbert@google.com> wrote:
>>>>> On Tue, Jul 22, 2014 at 2:02 PM, Andy Zhou <azhou@nicira.com> wrote:
>>>>>> On Tue, Jul 22, 2014 at 12:52 PM, Tom Herbert <therbert@google.com> wrote:
>>>>>>>> --- a/include/net/udp_tunnel.h
>>>>>>>> +++ b/include/net/udp_tunnel.h
>>>>>>>> @@ -1,7 +1,10 @@
>>>>>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>>>>>  #define __NET_UDP_TUNNEL_H
>>>>>>>>
>>>>>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>>>>>> +#include <net/ip_tunnels.h>
>>>>>>>> +
>>>>>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>>>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>>>>>
>>>>>>> Why do we need to define these? Caller should know what type of port is
>>>>>>> being opened and provide appropriate encap_rcv.
>>>>>>
>>>>>> Assume udp tunnel layer needs to keep track of open ports, should it
>>>>>> also keep track of the protocol associated with the port?
>>>>>>
>>>>> For what purpose? Other than for offloads and rcv_encap functions that
>>>>> provide the service function anyway, what need is there for UDP layer
>>>>> to know about this. More to the point, if I add a module to the kernel
>>>>> with a new flavor of UDP tunneling, I shouldn't have to touch any core
>>>>> code for things to work correctly. So by this line of thinking,
>>>>> neither the terms VXLAN nor GENEVE should appear in any common code.
>>>>
>>>> The hardware will need to know what the header format is so that it
>>>> can parse the packets on receive. And since the NIC can't exactly call
>>>> into a function pointer like GRO can, I'm not sure that there is a
>>>> solution that doesn't involve an identifier that needs to be listed
>>>> somewhere. This is a pretty minimal impact - it doesn't actually
>>>> appear in the core code.
>>>
>>> The hardware doesn't *need* to know this, it's must be optional and
>>> should have no bearing on the software stack. Suggest to put them in
>>> their own header file. Also, as HW features these should appear in
>>> NETIF_F_* list so that we can control on a per device level rather to
>>> enable this feature (something like how NETIF_F_GSO_* was done).
>>>
>>> What about support for L2TP/UDP?
>>
>> The hardware needs some means of knowing what UDP port numbers are used
>> for VXLAN and/or GENEVE as the two formats contain subtle differences
>> that we have to be ready for on the Rx path as we have to parse out the
>> frames.
>>
>> We already have feature flags controlling the offloads, what the port
>> numbers provide is a means for us to determine what Rx packets we should
>> parse as tunnels vs standard UDP and which tunnel type we should parse
>> it as.
>>
> Which feature flags control the receive side parsing in the device?

The only real features that need the port info are Rx hash and Rx
checksum.  If those are disabled then there shouldn't be any need for
the port numbers.  I don't recall if you can disable them separately
from the non-tunnel case though.  I believe they are linked to the
standard offloads.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-23  2:16                 ` Alexander Duyck
@ 2014-07-23  3:53                   ` Tom Herbert
  2014-07-23  4:35                     ` Jesse Gross
  0 siblings, 1 reply; 46+ messages in thread
From: Tom Herbert @ 2014-07-23  3:53 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, Jesse Gross, Andy Zhou, David Miller, Linux Netdev List

>> Which feature flags control the receive side parsing in the device?
>
> The only real features that need the port info are Rx hash and Rx
> checksum.  If those are disabled then there shouldn't be any need for
> the port numbers.  I don't recall if you can disable them separately
> from the non-tunnel case though.  I believe they are linked to the
> standard offloads.
>
Rx hash is unnecessary consideration because we can derive that from
UDP header. The fact that we can deduce a reasonable hash is a major
rationale of UDP encapsulation. We will need drivers to start
enabling/supporting UDP RSS and providing RX hash to realize full
benefits of this.

Rx checksum is also an unnecessary consideration if devices return
CHECKSUM_COMPLETE instead of CHECKSUM_UNNECESSARY. Pretty much
anything can (and probably will) be encapsulated in UDP (VXLAN, GRE,
MPLS, L2TP, IPIP, SIT, etc.), so if your hardware provides
CHECKSUM_COMPLETE this immediately gives us easy calculation the
embedded checksums no matter how many encapsulation layers there are.

Another need for parsing UDP contents would be for LRO. This would
require implementation of each encapsulation format supported. I
believe that LRO pretty much deprecated, so maybe this is not an issue
either.

Are there any other cases where HW needs to know about port? Is this
needed for those devices that provide SRIOV?

Tom

> Thanks,
>
> Alex
>
>
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 03/10] vxlan: Remove vxlan_get_rx_port()
       [not found]   ` <CAKgT0UeRSc3MaZrLmXyx4jPZO+F1hS5imR1TjFkvKp4S8nQmeg@mail.gmail.com>
@ 2014-07-23  3:57     ` Andy Zhou
  0 siblings, 0 replies; 46+ messages in thread
From: Andy Zhou @ 2014-07-23  3:57 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David Miller, Netdev

You are right.  Thanks for pointing this out. I will swap it in the
next version.

On Tue, Jul 22, 2014 at 7:20 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
>>
>> Instead of specificly calling vxlan_get_rx_port(), Device driver
>> should now call udp_tunnel_get_rx_port() instead.  Making this change
>> to support future NICs and device drivers that may support more
>> UDP tunnel protocol offloads.
>>
>> Signed-off-by: Andy Zhou <azhou@nicira.com>
>> ---
>>  drivers/net/ethernet/emulex/benet/be_main.c      |    2 +-
>>  drivers/net/ethernet/intel/i40e/i40e_main.c      |    2 +-
>>  drivers/net/ethernet/mellanox/mlx4/en_netdev.c   |    2 +-
>>  drivers/net/ethernet/qlogic/qlcnic/qlcnic_main.c |    2 +-
>>  drivers/net/vxlan.c                              |   26
>> ----------------------
>>  include/net/vxlan.h                              |    7 ------
>>  6 files changed, 4 insertions(+), 37 deletions(-)
>
>
> If I am not mistaken I think this patch is incomplete.  There is nothing
> that is currently initializing the tunnel type for VXLAN until the next
> patch.  As such I believe this patch breaks the functionality.  You might
> want to consider swapping it with patch 4 in order to avoid that.
>
> Thanks,
>
> Alex

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-23  3:53                   ` Tom Herbert
@ 2014-07-23  4:35                     ` Jesse Gross
  2014-07-23 15:45                       ` Tom Herbert
  0 siblings, 1 reply; 46+ messages in thread
From: Jesse Gross @ 2014-07-23  4:35 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexander Duyck, Alexander Duyck, Andy Zhou, David Miller,
	Linux Netdev List

On Tue, Jul 22, 2014 at 11:53 PM, Tom Herbert <therbert@google.com> wrote:
>>> Which feature flags control the receive side parsing in the device?
>>
>> The only real features that need the port info are Rx hash and Rx
>> checksum.  If those are disabled then there shouldn't be any need for
>> the port numbers.  I don't recall if you can disable them separately
>> from the non-tunnel case though.  I believe they are linked to the
>> standard offloads.
>>
> Rx hash is unnecessary consideration because we can derive that from
> UDP header. The fact that we can deduce a reasonable hash is a major
> rationale of UDP encapsulation. We will need drivers to start
> enabling/supporting UDP RSS and providing RX hash to realize full
> benefits of this.

That's true for basic hashing but for more sophisticated things like
flow steering or sending OAM packets to control queues the hardware
still needs to be able to look into the header.

> Rx checksum is also an unnecessary consideration if devices return
> CHECKSUM_COMPLETE instead of CHECKSUM_UNNECESSARY. Pretty much
> anything can (and probably will) be encapsulated in UDP (VXLAN, GRE,
> MPLS, L2TP, IPIP, SIT, etc.), so if your hardware provides
> CHECKSUM_COMPLETE this immediately gives us easy calculation the
> embedded checksums no matter how many encapsulation layers there are.

This property only applies to ones-complement checksums though. If I
recall correctly, I believe you have a desire for something stronger
:)

> Another need for parsing UDP contents would be for LRO. This would
> require implementation of each encapsulation format supported. I
> believe that LRO pretty much deprecated, so maybe this is not an issue
> either.

I think only the old style of LRO is deprecated. Some drivers provide
"GRO" where the hardware supplies the original MSS and that works OK.

Some of these are obviously future looking but I think that means that
even if you got your desired changes, the use of the UDP port on
receive would only shift, not go away.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 05/10] net: Add Geneve tunneling protocol driver
  2014-07-22 23:24     ` Jesse Gross
@ 2014-07-23 14:11       ` John W. Linville
  0 siblings, 0 replies; 46+ messages in thread
From: John W. Linville @ 2014-07-23 14:11 UTC (permalink / raw)
  To: Jesse Gross; +Cc: Alexander Duyck, Andy Zhou, David Miller, netdev

On Tue, Jul 22, 2014 at 07:24:20PM -0400, Jesse Gross wrote:
> On Tue, Jul 22, 2014 at 7:12 PM, Alexander Duyck
> <alexander.h.duyck@intel.com> wrote:
> > On 07/22/2014 03:19 AM, Andy Zhou wrote:
> >> This adds a device level support for Geneve -- Generic Network
> >> Virtualization Encapsulation. The protocol is documented at
> >> http://tools.ietf.org/html/draft-gross-geneve-00
> >>
> >> Only protocol layer Geneve support is provided by this driver.
> >> Openvswitch can be used for configuring, set up and tear down
> >> functional Geneve tunnels.
> >>
> >> Signed-off-by: Jesse Gross <jesse@nicira.com>
> >> Signed-off-by: Andy Zhou <azhou@nicira.com>
> >> ---
> >>  include/net/geneve.h     |   85 +++++++++++++++
> >>  include/net/ip_tunnels.h |    2 +
> >>  net/ipv4/Kconfig         |   14 +++
> >>  net/ipv4/Makefile        |    1 +
> >>  net/ipv4/geneve.c        |  273 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  net/openvswitch/Kconfig  |   11 ++
> >>  net/openvswitch/vport.c  |    3 +
> >>  7 files changed, 389 insertions(+)
> >>  create mode 100644 include/net/geneve.h
> >>  create mode 100644 net/ipv4/geneve.c
> >>
> >
> > So all this is really doing is enabling a Geneve socket for use by
> > Openvswitch.  Do you have any plans to enable a stand alone interface
> > like what we already have for VXLAN?
> 
> Yes, this is the basic protocol code that would be shared by all users
> of Geneve. OVS is the first user but John Linville is looking at
> adding support for connecting this to 'ip' as a non-OVS user.

ACK

-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-23  4:35                     ` Jesse Gross
@ 2014-07-23 15:45                       ` Tom Herbert
  2014-07-24  3:24                         ` Jesse Gross
  0 siblings, 1 reply; 46+ messages in thread
From: Tom Herbert @ 2014-07-23 15:45 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Alexander Duyck, Alexander Duyck, Andy Zhou, David Miller,
	Linux Netdev List

On Tue, Jul 22, 2014 at 9:35 PM, Jesse Gross <jesse@nicira.com> wrote:
> On Tue, Jul 22, 2014 at 11:53 PM, Tom Herbert <therbert@google.com> wrote:
>>>> Which feature flags control the receive side parsing in the device?
>>>
>>> The only real features that need the port info are Rx hash and Rx
>>> checksum.  If those are disabled then there shouldn't be any need for
>>> the port numbers.  I don't recall if you can disable them separately
>>> from the non-tunnel case though.  I believe they are linked to the
>>> standard offloads.
>>>
>> Rx hash is unnecessary consideration because we can derive that from
>> UDP header. The fact that we can deduce a reasonable hash is a major
>> rationale of UDP encapsulation. We will need drivers to start
>> enabling/supporting UDP RSS and providing RX hash to realize full
>> benefits of this.
>
> That's true for basic hashing but for more sophisticated things like
> flow steering or sending OAM packets to control queues the hardware
> still needs to be able to look into the header.
>
Flow steering (aRFS, FlowDirector, ECMP in network) will work just
fine based on UDP header-- again this is a fundamental property in UDP
encapsulation. If you need to implement mechanisms that require
parsing of the encapsulated headers, then it's better to make this
part of RX filtering.

We already have a mess with the all the GSO protocol variants for
different protocols because no one has defined a generic TSO
mechanism, let's avoid repeating that for RX.

>> Rx checksum is also an unnecessary consideration if devices return
>> CHECKSUM_COMPLETE instead of CHECKSUM_UNNECESSARY. Pretty much
>> anything can (and probably will) be encapsulated in UDP (VXLAN, GRE,
>> MPLS, L2TP, IPIP, SIT, etc.), so if your hardware provides
>> CHECKSUM_COMPLETE this immediately gives us easy calculation the
>> embedded checksums no matter how many encapsulation layers there are.
>
> This property only applies to ones-complement checksums though. If I
> recall correctly, I believe you have a desire for something stronger
> :)

True, I desire full line rate encryption of all packets :-). In order
to the do this efficiently and generically we will want to do
something like ESP/UDP to keep the flow hash visible. So this is one
valid case where we'd need to configure the HW with a UDP port if it
is to do decrypt.

btw, Geneve draft allows for non-zero UDP checksums to be ignored like
in VXLAN-- this is a violation of UDP standard :-(. We will not do
this in the stack, but it opens the possibility that HW may tell us
checksum is okay when it actually isn't. Accepting
CHECKSUM_UNNECESSARY from all these devices is quite the leap of faith
we're taking!

>
>> Another need for parsing UDP contents would be for LRO. This would
>> require implementation of each encapsulation format supported. I
>> believe that LRO pretty much deprecated, so maybe this is not an issue
>> either.
>
> I think only the old style of LRO is deprecated. Some drivers provide
> "GRO" where the hardware supplies the original MSS and that works OK.
>
> Some of these are obviously future looking but I think that means that
> even if you got your desired changes, the use of the UDP port on
> receive would only shift, not go away.

I think your hitting the major point that we have to be future
looking. When hardware hardwire specific protocols instead of using
generic mechanisms, we become pigeonholed-- this is *not* future
looking and in the long run it's a disservice to customers if we
advocate this in the stack. Consider that geneve is likely superior to
VXLAN because it is extensible, but that VXLAN may still win since it
is already "supported" in so much HW.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 05/10] net: Add Geneve tunneling protocol driver
  2014-07-22 10:19 ` [net-next 05/10] net: Add Geneve tunneling protocol driver Andy Zhou
  2014-07-22 23:12   ` Alexander Duyck
@ 2014-07-23 18:20   ` Stephen Hemminger
  1 sibling, 0 replies; 46+ messages in thread
From: Stephen Hemminger @ 2014-07-23 18:20 UTC (permalink / raw)
  To: Andy Zhou; +Cc: davem, netdev, Jesse Gross

On Tue, 22 Jul 2014 03:19:48 -0700
Andy Zhou <azhou@nicira.com> wrote:

> +	geneveh->ver = GENEVE_VER;
> +	geneveh->opt_len = options_len / 4;
> +	geneveh->oam = !!(tun_flags & TUNNEL_OAM);
> +	geneveh->critical = !!(tun_flags & TUNNEL_CRIT_OPT);
> +	geneveh->rsvd1 = 0;

Bitfield's suck in C.

Setting bitfield values individually generates slow code because
the compiler has to generate multiple mask operations.

Better to change this part to one set of shifts and do one load.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-22 10:19 ` [net-next 02/10] udp: Expand UDP tunnel common APIs Andy Zhou
       [not found]   ` <CA+mtBx9M_BpjT-_Egng+jFxmqJzdC2Npg0ufE2ZSAb9Lhw8hxg@mail.gmail.com>
@ 2014-07-23 19:57   ` Tom Herbert
  2014-07-24 20:23     ` Andy Zhou
  1 sibling, 1 reply; 46+ messages in thread
From: Tom Herbert @ 2014-07-23 19:57 UTC (permalink / raw)
  To: Andy Zhou; +Cc: David Miller, Linux Netdev List

On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
> Added create_udp_tunnel_socket(), packet receive and transmit,  and
> other related common functions for UDP tunnels.
>
> Per net open UDP tunnel ports are tracked in this common layer to
> prevent sharing of a single port with more than one UDP tunnel.
>
> Signed-off-by: Andy Zhou <azhou@nicira.com>
> ---
>  include/net/udp_tunnel.h |   57 +++++++++-
>  net/ipv4/udp_tunnel.c    |  257 +++++++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 312 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
> index 3f34c65..b5e815a 100644
> --- a/include/net/udp_tunnel.h
> +++ b/include/net/udp_tunnel.h
> @@ -1,7 +1,10 @@
>  #ifndef __NET_UDP_TUNNEL_H
>  #define __NET_UDP_TUNNEL_H
>
> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
> +#include <net/ip_tunnels.h>
> +
> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>
>  struct udp_port_cfg {
>         u8                      family;
> @@ -28,7 +31,59 @@ struct udp_port_cfg {
>                                 use_udp6_rx_checksums:1;
>  };
>
> +struct udp_tunnel_sock;
> +
> +typedef void (udp_tunnel_rcv_t)(struct udp_tunnel_sock *uts,
> +                               struct sk_buff *skb, ...);
> +
> +typedef int (udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
> +
> +struct udp_tunnel_socket_cfg {
> +       u8 tunnel_type;
> +       struct udp_port_cfg port;
> +       udp_tunnel_rcv_t *rcv;
> +       udp_tunnel_encap_rcv_t *encap_rcv;

Why do you need two receive functions or udp_tunnel_rcv_t?

> +       void *data;

Similarly, why is this needed when we already have sk_user_data?

> +};
> +
> +struct udp_tunnel_sock {
> +       u8 tunnel_type;
> +       struct hlist_node hlist;
> +       udp_tunnel_rcv_t *rcv;
> +       void *data;
> +       struct socket *sock;
> +};
> +
>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>                     struct socket **sockp);
>
> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
> +                                                struct udp_tunnel_socket_cfg
> +                                                       *socket_cfg);
> +
> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port);
> +
> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
> +                       struct sk_buff *skb, __be32 src, __be32 dst,
> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
> +                       __be16 dst_port, bool xnet);
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
> +               struct sk_buff *skb, struct net_device *dev,
> +               struct in6_addr *saddr, struct in6_addr *daddr,
> +               __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port);
> +
> +#endif
> +
> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts);
> +void udp_tunnel_get_rx_port(struct net_device *dev);
> +
> +static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb,
> +                                                        bool udp_csum)
> +{
> +       int type = udp_csum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
> +
> +       return iptunnel_handle_offloads(skb, udp_csum, type);
> +}
>  #endif
> diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
> index 61ec1a6..3c14b16 100644
> --- a/net/ipv4/udp_tunnel.c
> +++ b/net/ipv4/udp_tunnel.c
> @@ -7,6 +7,23 @@
>  #include <net/udp.h>
>  #include <net/udp_tunnel.h>
>  #include <net/net_namespace.h>
> +#include <net/netns/generic.h>
> +#if IS_ENABLED(CONFIG_IPV6)
> +#include <net/ipv6.h>
> +#include <net/addrconf.h>
> +#include <net/ip6_tunnel.h>
> +#include <net/ip6_checksum.h>
> +#endif
> +
> +#define PORT_HASH_BITS 8
> +#define PORT_HASH_SIZE (1 << PORT_HASH_BITS)
> +
> +static int udp_tunnel_net_id;
> +
> +struct udp_tunnel_net {
> +       struct hlist_head sock_list[PORT_HASH_SIZE];
> +       spinlock_t  sock_lock;   /* Protecting the sock_list */
> +};
>
>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>                     struct socket **sockp)
> @@ -82,7 +99,6 @@ int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>                 return -EPFNOSUPPORT;
>         }
>
> -
>         *sockp = sock;
>
>         return 0;
> @@ -97,4 +113,243 @@ error:
>  }
>  EXPORT_SYMBOL(udp_sock_create);
>
> +
> +/* Socket hash table head */
> +static inline struct hlist_head *uts_head(struct net *net, const __be16 port)
> +{
> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
> +
> +       return &utn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
> +}
> +
> +static int handle_offloads(struct sk_buff *skb)
> +{
> +       if (skb_is_gso(skb)) {
> +               int err = skb_unclone(skb, GFP_ATOMIC);
> +
> +               if (unlikely(err))
> +                       return err;
> +               skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
> +       } else {
> +               if (skb->ip_summed != CHECKSUM_PARTIAL)
> +                       skb->ip_summed = CHECKSUM_NONE;
> +       }
> +
> +       return 0;
> +}
> +
> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
> +                                                struct udp_tunnel_socket_cfg
> +                                                       *cfg)
> +{
> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
> +       struct udp_tunnel_sock *uts;
> +       struct socket *sock;
> +       struct sock *sk;
> +       const __be16 port = cfg->port.local_udp_port;
> +       const int ipv6 = (cfg->port.family == AF_INET6);
> +       int err;
> +
> +       uts = kzalloc(size, GFP_KERNEL);
> +       if (!uts)
> +               return ERR_PTR(-ENOMEM);
> +
> +       err = udp_sock_create(net, &cfg->port, &sock);
> +       if (err < 0) {
> +               kfree(uts);
> +               return NULL;
> +       }
> +
> +       /* Disable multicast loopback */
> +       inet_sk(sock->sk)->mc_loop = 0;
> +
> +       uts->sock = sock;
> +       sk = sock->sk;
> +       uts->rcv = cfg->rcv;
> +       uts->data = cfg->data;
> +       rcu_assign_sk_user_data(sock->sk, uts);
> +
> +       spin_lock(&utn->sock_lock);
> +       hlist_add_head_rcu(&uts->hlist, uts_head(net, port));
> +       spin_unlock(&utn->sock_lock);
> +
> +       udp_sk(sk)->encap_type = 1;
> +       udp_sk(sk)->encap_rcv = cfg->encap_rcv;
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +       if (ipv6)
> +               ipv6_stub->udpv6_encap_enable();
> +       else
> +#endif
> +               udp_encap_enable();
> +
> +       return uts;
> +}
> +EXPORT_SYMBOL_GPL(create_udp_tunnel_socket);
> +
> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
> +                       struct sk_buff *skb, __be32 src, __be32 dst,
> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
> +                       __be16 dst_port, bool xnet)
> +{
> +       struct udphdr *uh;
> +
> +       __skb_push(skb, sizeof(*uh));
> +       skb_reset_transport_header(skb);
> +       uh = udp_hdr(skb);
> +
> +       uh->dest = dst_port;
> +       uh->source = src_port;
> +       uh->len = htons(skb->len);
> +
> +       udp_set_csum(sock->sk->sk_no_check_tx, skb, src, dst, skb->len);
> +
> +       return iptunnel_xmit(sock->sk, rt, skb, src, dst, IPPROTO_UDP,
> +                            tos, ttl, df, xnet);
> +}
> +EXPORT_SYMBOL_GPL(udp_tunnel_xmit_skb);
> +
> +#if IS_ENABLED(CONFIG_IPV6)
> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
> +                        struct sk_buff *skb, struct net_device *dev,
> +                        struct in6_addr *saddr, struct in6_addr *daddr,
> +                        __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port)
> +{
> +       struct udphdr *uh;
> +       struct ipv6hdr *ip6h;
> +       int err;
> +
> +       __skb_push(skb, sizeof(*uh));
> +       skb_reset_transport_header(skb);
> +       uh = udp_hdr(skb);
> +
> +       uh->dest = dst_port;
> +       uh->source = src_port;
> +
> +       uh->len = htons(skb->len);
> +       uh->check = 0;
> +
> +       memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
> +       IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED
> +                           | IPSKB_REROUTED);
> +       skb_dst_set(skb, dst);
> +
> +       if (!skb_is_gso(skb) && !(dst->dev->features & NETIF_F_IPV6_CSUM)) {
> +               __wsum csum = skb_checksum(skb, 0, skb->len, 0);
> +
> +               skb->ip_summed = CHECKSUM_UNNECESSARY;
> +               uh->check = csum_ipv6_magic(saddr, daddr, skb->len,
> +                               IPPROTO_UDP, csum);
> +               if (uh->check == 0)
> +                       uh->check = CSUM_MANGLED_0;
> +       } else {
> +               skb->ip_summed = CHECKSUM_PARTIAL;
> +               skb->csum_start = skb_transport_header(skb) - skb->head;
> +               skb->csum_offset = offsetof(struct udphdr, check);
> +               uh->check = ~csum_ipv6_magic(saddr, daddr,
> +                               skb->len, IPPROTO_UDP, 0);
> +       }
> +
> +       __skb_push(skb, sizeof(*ip6h));
> +       skb_reset_network_header(skb);
> +       ip6h              = ipv6_hdr(skb);
> +       ip6h->version     = 6;
> +       ip6h->priority    = prio;
> +       ip6h->flow_lbl[0] = 0;
> +       ip6h->flow_lbl[1] = 0;
> +       ip6h->flow_lbl[2] = 0;
> +       ip6h->payload_len = htons(skb->len);
> +       ip6h->nexthdr     = IPPROTO_UDP;
> +       ip6h->hop_limit   = ttl;
> +       ip6h->daddr       = *daddr;
> +       ip6h->saddr       = *saddr;
> +
> +       err = handle_offloads(skb);
> +       if (err)
> +               return err;
> +
> +       ip6tunnel_xmit(skb, dev);
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(udp_tunnel6_xmit_skb);
> +#endif
> +
> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port)
> +{
> +       struct udp_tunnel_sock *uts;
> +
> +       hlist_for_each_entry_rcu(uts, uts_head(net, port), hlist) {
> +               if (inet_sk(uts->sock->sk)->inet_sport == port)
> +                       return uts;
> +       }
> +
> +       return NULL;
> +}
> +EXPORT_SYMBOL_GPL(udp_tunnel_find_sock);
> +
> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts)
> +{
> +       struct sock *sk = uts->sock->sk;
> +       struct net *net = sock_net(sk);
> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
> +
> +       spin_lock(&utn->sock_lock);
> +       hlist_del_rcu(&uts->hlist);
> +       rcu_assign_sk_user_data(uts->sock->sk, NULL);
> +       spin_unlock(&utn->sock_lock);
> +}
> +EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
> +
> +/* Calls the ndo_add_tunnel_port of the caller in order to
> + * supply the listening VXLAN udp ports. Callers are expected
> + * to implement the ndo_add_tunnle_port.
> + */
> +void udp_tunnel_get_rx_port(struct net_device *dev)
> +{
> +       struct udp_tunnel_sock *uts;
> +       struct net *net = dev_net(dev);
> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
> +       sa_family_t sa_family;
> +       __be16 port;
> +       unsigned int i;
> +
> +       spin_lock(&utn->sock_lock);
> +       for (i = 0; i < PORT_HASH_SIZE; ++i) {
> +               hlist_for_each_entry_rcu(uts, &utn->sock_list[i], hlist) {
> +                       port = inet_sk(uts->sock->sk)->inet_sport;
> +                       sa_family = uts->sock->sk->sk_family;
> +                       dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
> +                                       sa_family, port, uts->tunnel_type);
> +               }
> +       }
> +       spin_unlock(&utn->sock_lock);
> +}
> +EXPORT_SYMBOL_GPL(udp_tunnel_get_rx_port);
> +
> +static int __net_init udp_tunnel_init_net(struct net *net)
> +{
> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
> +       unsigned int h;
> +
> +       spin_lock_init(&utn->sock_lock);
> +
> +       for (h = 0; h < PORT_HASH_SIZE; h++)
> +               INIT_HLIST_HEAD(&utn->sock_list[h]);
> +
> +       return 0;
> +}
> +
> +static struct pernet_operations udp_tunnel_net_ops = {
> +       .init = udp_tunnel_init_net,
> +       .exit = NULL,
> +       .id = &udp_tunnel_net_id,
> +       .size = sizeof(struct udp_tunnel_net),
> +};
> +
> +static int __init udp_tunnel_init(void)
> +{
> +       return register_pernet_subsys(&udp_tunnel_net_ops);
> +}
> +late_initcall(udp_tunnel_init);
> +
>  MODULE_LICENSE("GPL");
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 10/10] openvswitch: Add support for Geneve tunneling.
  2014-07-22 10:19 ` [net-next 10/10] openvswitch: Add support for Geneve tunneling Andy Zhou
@ 2014-07-23 20:29   ` Tom Herbert
  2014-07-24  4:10     ` Jesse Gross
  0 siblings, 1 reply; 46+ messages in thread
From: Tom Herbert @ 2014-07-23 20:29 UTC (permalink / raw)
  To: Andy Zhou; +Cc: David Miller, Linux Netdev List, Jesse Gross

On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
> From: Jesse Gross <jesse@nicira.com>
>
> The Openvswitch implementation is completely agnostic to the options
> that are in use and can handle newly defined options without
> further work. It does this by simply matching on a byte array
> of options and allowing userspace to setup flows on this array.
>
> Userspace currently implements only support for basic version of
> Geneve. It can work with the base header (including the VNI) and
> is capable of parsing options but does not currently support any
> particular option definitions. Over time, the intention is to
> allow options to be matched through OpenFlow without requiring
> explicit support in OVS userspace.
>
> Signed-off-by: Jesse Gross <jesse@nicira.com>
> Signed-off-by: Andy Zhou <azhou@nicira.com>
> ---
>  include/uapi/linux/openvswitch.h |    2 +
>  net/openvswitch/Makefile         |    5 +
>  net/openvswitch/datapath.c       |   32 +++--
>  net/openvswitch/flow.c           |   10 ++
>  net/openvswitch/flow.h           |   19 ++-
>  net/openvswitch/flow_netlink.c   |  143 ++++++++++++++++++---
>  net/openvswitch/flow_netlink.h   |    2 +-
>  net/openvswitch/vport-geneve.c   |  258 ++++++++++++++++++++++++++++++++++++++
>  net/openvswitch/vport-gre.c      |    2 +-
>  net/openvswitch/vport-vxlan.c    |    2 +-
>  net/openvswitch/vport.c          |    1 +
>  net/openvswitch/vport.h          |    1 +
>  12 files changed, 446 insertions(+), 31 deletions(-)
>  create mode 100644 net/openvswitch/vport-geneve.c
>
> diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
> index 3b72277..0c6e846 100644
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -189,6 +189,7 @@ enum ovs_vport_type {
>         OVS_VPORT_TYPE_INTERNAL, /* network device implemented by datapath */
>         OVS_VPORT_TYPE_GRE,      /* GRE tunnel. */
>         OVS_VPORT_TYPE_VXLAN,    /* VXLAN tunnel. */
> +       OVS_VPORT_TYPE_GENEVE = 6,   /* Geneve tunnel. */
>         __OVS_VPORT_TYPE_MAX
>  };
>
> @@ -302,6 +303,7 @@ enum ovs_tunnel_key_attr {
>         OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT,      /* No argument, set DF. */
>         OVS_TUNNEL_KEY_ATTR_CSUM,               /* No argument. CSUM packet. */
>         OVS_TUNNEL_KEY_ATTR_OAM,                /* No argument. OAM frame.  */
> +       OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,        /* Array of Geneve options.  */
>         __OVS_TUNNEL_KEY_ATTR_MAX
>  };
>
> diff --git a/net/openvswitch/Makefile b/net/openvswitch/Makefile
> index 3591cb5..2bbfc32 100644
> --- a/net/openvswitch/Makefile
> +++ b/net/openvswitch/Makefile
> @@ -13,6 +13,7 @@ openvswitch-y := \
>         flow_table.o \
>         vport.o \
>         vport-internal_dev.o \
> +       vport-geneve.o  \
>         vport-netdev.o
>
>  ifneq ($(CONFIG_OPENVSWITCH_VXLAN),)
> @@ -22,3 +23,7 @@ endif
>  ifneq ($(CONFIG_OPENVSWITCH_GRE),)
>  openvswitch-y += vport-gre.o
>  endif
> +
> +ifneq ($(CONFIG_OPENVSWITCH_GENEVE),)
> +openvswitch-y += vport-geneve.o
> +endif
> diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
> index daa935f..29f877e 100644
> --- a/net/openvswitch/datapath.c
> +++ b/net/openvswitch/datapath.c
> @@ -376,6 +376,7 @@ static size_t key_attr_size(void)
>                   + nla_total_size(0)   /* OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT */
>                   + nla_total_size(0)   /* OVS_TUNNEL_KEY_ATTR_CSUM */
>                   + nla_total_size(0)   /* OVS_TUNNEL_KEY_ATTR_OAM */
> +                 + nla_total_size(256) /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */
>                 + nla_total_size(4)   /* OVS_KEY_ATTR_IN_PORT */
>                 + nla_total_size(4)   /* OVS_KEY_ATTR_SKB_MARK */
>                 + nla_total_size(12)  /* OVS_KEY_ATTR_ETHERNET */
> @@ -465,7 +466,7 @@ static int queue_userspace_packet(struct datapath *dp, struct sk_buff *skb,
>         upcall->dp_ifindex = dp_ifindex;
>
>         nla = nla_nest_start(user_skb, OVS_PACKET_ATTR_KEY);
> -       ovs_nla_put_flow(upcall_info->key, upcall_info->key, user_skb);
> +       ovs_nla_put_flow(dp, upcall_info->key, upcall_info->key, user_skb);
>         nla_nest_end(user_skb, nla);
>
>         if (upcall_info->userdata)
> @@ -662,7 +663,8 @@ static size_t ovs_flow_cmd_msg_size(const struct sw_flow_actions *acts)
>  }
>
>  /* Called with ovs_mutex or RCU read lock. */
> -static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
> +static int ovs_flow_cmd_fill_info(struct datapath *dp,
> +                                 const struct sw_flow *flow, int dp_ifindex,
>                                   struct sk_buff *skb, u32 portid,
>                                   u32 seq, u32 flags, u8 cmd)
>  {
> @@ -686,7 +688,8 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
>         if (!nla)
>                 goto nla_put_failure;
>
> -       err = ovs_nla_put_flow(&flow->unmasked_key, &flow->unmasked_key, skb);
> +       err = ovs_nla_put_flow(dp, &flow->unmasked_key,
> +                              &flow->unmasked_key, skb);
>         if (err)
>                 goto error;
>         nla_nest_end(skb, nla);
> @@ -695,7 +698,7 @@ static int ovs_flow_cmd_fill_info(const struct sw_flow *flow, int dp_ifindex,
>         if (!nla)
>                 goto nla_put_failure;
>
> -       err = ovs_nla_put_flow(&flow->key, &flow->mask->key, skb);
> +       err = ovs_nla_put_flow(dp, &flow->key, &flow->mask->key, skb);
>         if (err)
>                 goto error;
>
> @@ -771,7 +774,8 @@ static struct sk_buff *ovs_flow_cmd_alloc_info(const struct sw_flow_actions *act
>  }
>
>  /* Called with ovs_mutex. */
> -static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,
> +static struct sk_buff *ovs_flow_cmd_build_info(struct datapath *dp,
> +                                              const struct sw_flow *flow,
>                                                int dp_ifindex,
>                                                struct genl_info *info, u8 cmd,
>                                                bool always)
> @@ -784,7 +788,7 @@ static struct sk_buff *ovs_flow_cmd_build_info(const struct sw_flow *flow,
>         if (!skb || IS_ERR(skb))
>                 return skb;
>
> -       retval = ovs_flow_cmd_fill_info(flow, dp_ifindex, skb,
> +       retval = ovs_flow_cmd_fill_info(dp, flow, dp_ifindex, skb,
>                                         info->snd_portid, info->snd_seq, 0,
>                                         cmd);
>         BUG_ON(retval < 0);
> @@ -866,7 +870,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
>                 }
>
>                 if (unlikely(reply)) {
> -                       error = ovs_flow_cmd_fill_info(new_flow,
> +                       error = ovs_flow_cmd_fill_info(dp, new_flow,
>                                                        ovs_header->dp_ifindex,
>                                                        reply, info->snd_portid,
>                                                        info->snd_seq, 0,
> @@ -901,7 +905,7 @@ static int ovs_flow_cmd_new(struct sk_buff *skb, struct genl_info *info)
>                 rcu_assign_pointer(flow->sf_acts, acts);
>
>                 if (unlikely(reply)) {
> -                       error = ovs_flow_cmd_fill_info(flow,
> +                       error = ovs_flow_cmd_fill_info(dp, flow,
>                                                        ovs_header->dp_ifindex,
>                                                        reply, info->snd_portid,
>                                                        info->snd_seq, 0,
> @@ -1013,7 +1017,7 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
>                 rcu_assign_pointer(flow->sf_acts, acts);
>
>                 if (unlikely(reply)) {
> -                       error = ovs_flow_cmd_fill_info(flow,
> +                       error = ovs_flow_cmd_fill_info(dp, flow,
>                                                        ovs_header->dp_ifindex,
>                                                        reply, info->snd_portid,
>                                                        info->snd_seq, 0,
> @@ -1022,7 +1026,8 @@ static int ovs_flow_cmd_set(struct sk_buff *skb, struct genl_info *info)
>                 }
>         } else {
>                 /* Could not alloc without acts before locking. */
> -               reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex,
> +               reply = ovs_flow_cmd_build_info(dp, flow,
> +                                               ovs_header->dp_ifindex,
>                                                 info, OVS_FLOW_CMD_NEW, false);
>                 if (unlikely(IS_ERR(reply))) {
>                         error = PTR_ERR(reply);
> @@ -1085,7 +1090,7 @@ static int ovs_flow_cmd_get(struct sk_buff *skb, struct genl_info *info)
>                 goto unlock;
>         }
>
> -       reply = ovs_flow_cmd_build_info(flow, ovs_header->dp_ifindex, info,
> +       reply = ovs_flow_cmd_build_info(dp, flow, ovs_header->dp_ifindex, info,
>                                         OVS_FLOW_CMD_NEW, true);
>         if (IS_ERR(reply)) {
>                 err = PTR_ERR(reply);
> @@ -1143,7 +1148,8 @@ static int ovs_flow_cmd_del(struct sk_buff *skb, struct genl_info *info)
>         if (likely(reply)) {
>                 if (likely(!IS_ERR(reply))) {
>                         rcu_read_lock();        /*To keep RCU checker happy. */
> -                       err = ovs_flow_cmd_fill_info(flow, ovs_header->dp_ifindex,
> +                       err = ovs_flow_cmd_fill_info(dp, flow,
> +                                                    ovs_header->dp_ifindex,
>                                                      reply, info->snd_portid,
>                                                      info->snd_seq, 0,
>                                                      OVS_FLOW_CMD_DEL);
> @@ -1187,7 +1193,7 @@ static int ovs_flow_cmd_dump(struct sk_buff *skb, struct netlink_callback *cb)
>                 if (!flow)
>                         break;
>
> -               if (ovs_flow_cmd_fill_info(flow, ovs_header->dp_ifindex, skb,
> +               if (ovs_flow_cmd_fill_info(dp, flow, ovs_header->dp_ifindex, skb,
>                                            NETLINK_CB(cb->skb).portid,
>                                            cb->nlh->nlmsg_seq, NLM_F_MULTI,
>                                            OVS_FLOW_CMD_NEW) < 0)
> diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
> index 3d0adc5..b487cab 100644
> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
> @@ -454,7 +454,17 @@ int ovs_flow_extract(struct sk_buff *skb, u16 in_port, struct sw_flow_key *key)
>                 struct ovs_tunnel_info *tun_info = OVS_CB(skb)->tun_info;
>                 memcpy(&key->tun_key, &tun_info->tunnel,
>                         sizeof(key->tun_key));
> +               if (tun_info->options) {
> +                       BUILD_BUG_ON((1 << (sizeof(tun_info->options_len) * 8)) - 1
> +                                       > sizeof(key->tun_opts));
> +                       memcpy(GENEVE_OPTS(key, tun_info->options_len),
> +                               tun_info->options, tun_info->options_len);
> +                       key->tun_opts_len = tun_info->options_len;
> +               } else {
> +                       key->tun_opts_len = 0;
> +               }
>         } else {
> +               key->tun_opts_len = 0;
>                 memset(&key->tun_key, 0, sizeof(key->tun_key));
>         }
>
> diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
> index 6261ad0..216aa1b 100644
> --- a/net/openvswitch/flow.h
> +++ b/net/openvswitch/flow.h
> @@ -51,11 +51,23 @@ struct ovs_key_ipv4_tunnel {
>
>  struct ovs_tunnel_info {
>         struct ovs_key_ipv4_tunnel tunnel;
> +       struct geneve_opt *options;
> +       u8 options_len;
>  };
>
> +/* Store options at the end of the array if they are less than the
> + * maximum size. This allows us to get the benefits of variable length
> + * matching for small options.
> + */
> +#define GENEVE_OPTS(flow_key, opt_len) (struct geneve_opt *) \
> +                                       ((flow_key)->tun_opts + \
> +                                       FIELD_SIZEOF(struct sw_flow_key, tun_opts) - \
> +                                          opt_len)
>  static inline void ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
>                                          const struct iphdr *iph, __be64 tun_id,
> -                                        __be16 tun_flags)
> +                                        __be16 tun_flags,
> +                                        struct geneve_opt *opts,
> +                                        u8 opts_len)
>  {
>         tun_info->tunnel.tun_id = tun_id;
>         tun_info->tunnel.ipv4_src = iph->saddr;
> @@ -67,9 +79,14 @@ static inline void ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
>         /* clear struct padding. */
>         memset((unsigned char *) &tun_info->tunnel + OVS_TUNNEL_KEY_SIZE, 0,
>                sizeof(tun_info->tunnel) - OVS_TUNNEL_KEY_SIZE);
> +
> +       tun_info->options = opts;
> +       tun_info->options_len = opts_len;
>  }
>
>  struct sw_flow_key {
> +       u8 tun_opts[255];
> +       u8 tun_opts_len;
>         struct ovs_key_ipv4_tunnel tun_key;  /* Encapsulating tunnel key. */
>         struct {
>                 u32     priority;       /* Packet QoS priority. */
> diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
> index aa7c3d5..e0399c9 100644
> --- a/net/openvswitch/flow_netlink.c
> +++ b/net/openvswitch/flow_netlink.c
> @@ -42,6 +42,7 @@
>  #include <linux/icmp.h>
>  #include <linux/icmpv6.h>
>  #include <linux/rculist.h>
> +#include <net/geneve.h>
>  #include <net/ip.h>
>  #include <net/ipv6.h>
>  #include <net/ndisc.h>
> @@ -88,18 +89,21 @@ static void update_range__(struct sw_flow_match *match,
>                 }                                                           \
>         } while (0)
>
> -#define SW_FLOW_KEY_MEMCPY(match, field, value_p, len, is_mask) \
> +#define SW_FLOW_KEY_MEMCPY_OFFSET(match, offset, value_p, len, is_mask) \
>         do { \
> -               update_range__(match, offsetof(struct sw_flow_key, field),  \
> -                               len, is_mask);                              \
> +               update_range__(match, offset, len, is_mask);                \
>                 if (is_mask) {                                              \
>                         if ((match)->mask)                                  \
> -                               memcpy(&(match)->mask->key.field, value_p, len);\
> +                               memcpy((u8 *)&(match)->mask->key + offset, value_p, len);\
>                 } else {                                                    \
> -                       memcpy(&(match)->key->field, value_p, len);         \
> +                       memcpy((u8 *)(match)->key + offset, value_p, len);         \
>                 }                                                           \
>         } while (0)
>
> +#define SW_FLOW_KEY_MEMCPY(match, field, value_p, len, is_mask) \
> +       SW_FLOW_KEY_MEMCPY_OFFSET(match, offsetof(struct sw_flow_key, field), \
> +                                 value_p, len, is_mask)
> +
>  static u16 range_n_bytes(const struct sw_flow_key_range *range)
>  {
>         return range->end - range->start;
> @@ -345,6 +349,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
>                         [OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT] = 0,
>                         [OVS_TUNNEL_KEY_ATTR_CSUM] = 0,
>                         [OVS_TUNNEL_KEY_ATTR_OAM] = 0,
> +                       [OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS] = -1,
>                 };
>
>                 if (type > OVS_TUNNEL_KEY_ATTR_MAX) {
> @@ -353,7 +358,8 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
>                         return -EINVAL;
>                 }
>
> -               if (ovs_tunnel_key_lens[type] != nla_len(a)) {
> +               if (ovs_tunnel_key_lens[type] != nla_len(a) &&
> +                   ovs_tunnel_key_lens[type] != -1) {
>                         OVS_NLERR("IPv4 tunnel attribute type has unexpected "
>                                   " length (type=%d, length=%d, expected=%d).\n",
>                                   type, nla_len(a), ovs_tunnel_key_lens[type]);
> @@ -392,6 +398,56 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
>                 case OVS_TUNNEL_KEY_ATTR_OAM:
>                         tun_flags |= TUNNEL_OAM;
>                         break;
> +               case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
> +                       if (nla_len(a) > sizeof(match->key->tun_opts)) {
> +                               OVS_NLERR("Geneve option length exceeds "
> +                                         "maximum size (len %d, max %zu).\n",
> +                                         nla_len(a),
> +                                         sizeof(match->key->tun_opts));
> +                               return -EINVAL;
> +                       }
> +
> +                       if (nla_len(a) % 4 != 0) {
> +                               OVS_NLERR("Geneve option length is not "
> +                                         "a multiple of 4 (len %d).\n",
> +                                         nla_len(a));
> +                               return -EINVAL;
> +                       }
> +
> +                       /* We need to record the length of the options passed
> +                        * down, otherwise packets with the same format but
> +                        * additional options will be silently matched.
> +                        */
> +                       if (!is_mask) {
> +                               SW_FLOW_KEY_PUT(match, tun_opts_len, nla_len(a),
> +                                               false);
> +                       } else {
> +                               /* This is somewhat unusual because it looks at
> +                                * both the key and mask while parsing the
> +                                * attributes (and by extension assumes the key
> +                                * is parsed first). Normally, we would verify
> +                                * that each is the correct length and that the
> +                                * attributes line up in the validate function.
> +                                * However, that is difficult because this is
> +                                * variable length and we won't have the
> +                                * information later.
> +                                */
> +                               if (match->key->tun_opts_len != nla_len(a)) {
> +                                       OVS_NLERR("Geneve option key length (%d)"
> +                                          " is different from mask length (%d).",
> +                                          match->key->tun_opts_len, nla_len(a));
> +                                       return -EINVAL;
> +                               }
> +
> +                               SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff,
> +                                               true);
> +                       }
> +
> +                       SW_FLOW_KEY_MEMCPY_OFFSET(match,
> +                               (unsigned long)GENEVE_OPTS((struct sw_flow_key *)0,
> +                                                          nla_len(a)),
> +                               nla_data(a), nla_len(a), is_mask);
> +                       break;
>                 default:
>                         return -EINVAL;
>                 }
> @@ -420,8 +476,9 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
>  }
>
>  static int ipv4_tun_to_nlattr(struct sk_buff *skb,
> -                             const struct ovs_key_ipv4_tunnel *tun_key,
> -                             const struct ovs_key_ipv4_tunnel *output)
> +                             const struct ovs_key_ipv4_tunnel *output,
> +                             const struct geneve_opt *tun_opts,
> +                             int swkey_tun_opts_len)
>  {
>         struct nlattr *nla;
>
> @@ -452,6 +509,9 @@ static int ipv4_tun_to_nlattr(struct sk_buff *skb,
>         if ((output->tun_flags & TUNNEL_OAM) &&
>                 nla_put_flag(skb, OVS_TUNNEL_KEY_ATTR_OAM))
>                 return -EMSGSIZE;
> +       if (tun_opts &&
> +           nla_put(skb, OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,
> +                   swkey_tun_opts_len, tun_opts));
>
>         nla_nest_end(skb, nla);
>         return 0;
> @@ -881,7 +941,7 @@ int ovs_nla_get_flow_metadata(struct sw_flow *flow,
>         return 0;
>  }
>
> -int ovs_nla_put_flow(const struct sw_flow_key *swkey,
> +int ovs_nla_put_flow(struct datapath *dp, const struct sw_flow_key *swkey,
>                      const struct sw_flow_key *output, struct sk_buff *skb)
>  {
>         struct ovs_key_ethernet *eth_key;
> @@ -891,9 +951,24 @@ int ovs_nla_put_flow(const struct sw_flow_key *swkey,
>         if (nla_put_u32(skb, OVS_KEY_ATTR_PRIORITY, output->phy.priority))
>                 goto nla_put_failure;
>
> -       if ((swkey->tun_key.ipv4_dst || is_mask) &&
> -           ipv4_tun_to_nlattr(skb, &swkey->tun_key, &output->tun_key))
> -               goto nla_put_failure;
> +       if ((swkey->tun_key.ipv4_dst || is_mask)) {
> +               const struct geneve_opt *opts = NULL;
> +
> +               if (!is_mask) {
> +                       struct vport *in_port;
> +
> +                       in_port = ovs_vport_ovsl_rcu(dp, swkey->phy.in_port);
> +                       if (in_port->ops->type == OVS_VPORT_TYPE_GENEVE)
> +                               opts = GENEVE_OPTS(output, swkey->tun_opts_len);
> +               } else {
> +                       if (output->tun_opts_len)
> +                               opts = GENEVE_OPTS(output, swkey->tun_opts_len);
> +               }
> +
> +               if (ipv4_tun_to_nlattr(skb, &output->tun_key, opts,
> +                                       swkey->tun_opts_len))
> +                       goto nla_put_failure;
> +       }
>
>         if (swkey->phy.in_port == DP_MAX_PORTS) {
>                 if (is_mask && (output->phy.in_port == 0xffff))
> @@ -1276,17 +1351,55 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
>         if (err)
>                 return err;
>
> +       if (key.tun_opts_len) {
> +               struct geneve_opt *option = GENEVE_OPTS(&key,
> +                                                       key.tun_opts_len);
> +               int opts_len = key.tun_opts_len;
> +               bool crit_opt = false;
> +
> +               while (opts_len > 0) {
> +                       int len;
> +
> +                       if (opts_len < sizeof(*option))
> +                               return -EINVAL;
> +
> +                       len = sizeof(*option) + option->length * 4;
> +                       if (len > opts_len)
> +                               return -EINVAL;
> +
> +                       crit_opt |= !!(option->type & GENEVE_CRIT_OPT_TYPE);
> +
> +                       option = (struct geneve_opt *)((u8 *)option + len);
> +                       opts_len -= len;
> +               };
> +
> +               key.tun_key.tun_flags |= crit_opt ? TUNNEL_CRIT_OPT : 0;
> +       };
> +
>         start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SET);
>         if (start < 0)
>                 return start;
>
>         a = __add_action(sfa, OVS_KEY_ATTR_TUNNEL_INFO, NULL,
> -                       sizeof(*tun_info));
> +                       sizeof(*tun_info) + key.tun_opts_len);
>         if (IS_ERR(a))
>                 return PTR_ERR(a);
>
>         tun_info = nla_data(a);
>         tun_info->tunnel = key.tun_key;
> +       tun_info->options_len = key.tun_opts_len;
> +
> +       if (tun_info->options_len) {
> +               /* We need to store the options in the action itself since
> +                * everything else will go away after flow setup. We can append
> +                * it to tun_info and then point there.
> +                */
> +               tun_info->options = (struct geneve_opt *)(tun_info + 1);
> +               memcpy(tun_info->options, GENEVE_OPTS(&key, key.tun_opts_len),
> +                       key.tun_opts_len);
> +       } else {
> +               tun_info->options = NULL;
> +       }
>
>         add_nested_action_end(*sfa, start);
>
> @@ -1561,7 +1674,9 @@ static int set_action_to_attr(const struct nlattr *a, struct sk_buff *skb)
>                         return -EMSGSIZE;
>
>                 err = ipv4_tun_to_nlattr(skb, &tun_info->tunnel,
> -                                        &tun_info->tunnel);
> +                                        tun_info->options_len ?
> +                                               tun_info->options : NULL,
> +                                        tun_info->options_len);
>                 if (err)
>                         return err;
>                 nla_nest_end(skb, start);
> diff --git a/net/openvswitch/flow_netlink.h b/net/openvswitch/flow_netlink.h
> index 4401510..42de456 100644
> --- a/net/openvswitch/flow_netlink.h
> +++ b/net/openvswitch/flow_netlink.h
> @@ -40,7 +40,7 @@
>  void ovs_match_init(struct sw_flow_match *match,
>                     struct sw_flow_key *key, struct sw_flow_mask *mask);
>
> -int ovs_nla_put_flow(const struct sw_flow_key *,
> +int ovs_nla_put_flow(struct datapath *dp, const struct sw_flow_key *,
>                      const struct sw_flow_key *, struct sk_buff *);
>  int ovs_nla_get_flow_metadata(struct sw_flow *flow,
>                               const struct nlattr *attr);
> diff --git a/net/openvswitch/vport-geneve.c b/net/openvswitch/vport-geneve.c
> new file mode 100644
> index 0000000..b1b0a3b
> --- /dev/null
> +++ b/net/openvswitch/vport-geneve.c
> @@ -0,0 +1,258 @@
> +/*
> + * Copyright (c) 2014 Nicira, Inc.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> + * 02110-1301, USA
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/version.h>
> +
> +#include <linux/in.h>
> +#include <linux/ip.h>
> +#include <linux/net.h>
> +#include <linux/rculist.h>
> +#include <linux/udp.h>
> +#include <linux/if_vlan.h>
> +
> +#include <net/geneve.h>
> +#include <net/icmp.h>
> +#include <net/ip.h>
> +#include <net/route.h>
> +#include <net/udp.h>
> +#include <net/xfrm.h>
> +
> +#include "datapath.h"
> +#include "vport.h"
> +
> +/**
> + * struct geneve_port - Keeps track of open UDP ports
> + * @sock: The socket created for this port number.
> + * @name: vport name.
> + */
> +struct geneve_port {
> +       struct geneve_sock *gs;
> +       char name[IFNAMSIZ];
> +};
> +
> +static LIST_HEAD(geneve_ports);
> +
> +static inline struct geneve_port *geneve_vport(const struct vport *vport)
> +{
> +       return vport_priv(vport);
> +}
> +
> +static inline struct genevehdr *geneve_hdr(const struct sk_buff *skb)
> +{
> +       return (struct genevehdr *)(udp_hdr(skb) + 1);
> +}
> +
> +/* Convert 64 bit tunnel ID to 24 bit VNI. */
> +static void tunnel_id_to_vni(__be64 tun_id, __u8 *vni)
> +{
> +#ifdef __BIG_ENDIAN
> +       vni[0] = (__force __u8)(tun_id >> 16);
> +       vni[1] = (__force __u8)(tun_id >> 8);
> +       vni[2] = (__force __u8)tun_id;
> +#else
> +       vni[0] = (__force __u8)((__force u64)tun_id >> 40);
> +       vni[1] = (__force __u8)((__force u64)tun_id >> 48);
> +       vni[2] = (__force __u8)((__force u64)tun_id >> 56);
> +#endif
> +}
> +
> +/* Convert 24 bit VNI to 64 bit tunnel ID. */
> +static __be64 vni_to_tunnel_id(__u8 *vni)
> +{
> +#ifdef __BIG_ENDIAN
> +       return (vni[0] << 16) | (vni[1] << 8) | vni[2];
> +#else
> +       return (__force __be64)(((__force u64)vni[0] << 40) |
> +                               ((__force u64)vni[1] << 48) |
> +                               ((__force u64)vni[2] << 56));
> +#endif
> +}
> +
> +static void geneve_rcv(struct geneve_sock *gs, struct sk_buff *skb)
> +{
> +       struct vport *vport = gs->uts.data;
> +       struct genevehdr *geneveh;
> +       int opts_len;
> +       struct ovs_tunnel_info tun_info;
> +       __be64 key;
> +       __be16 flags;
> +
> +       if (unlikely(!pskb_may_pull(skb, GENEVE_BASE_HLEN)))
> +               goto error;
> +
> +       geneveh = geneve_hdr(skb);
> +
> +       if (unlikely(geneveh->ver != GENEVE_VER))
> +               goto error;

> +
> +       if (unlikely(geneveh->proto_type != htons(ETH_P_TEB)))

Why? I thought the point of geneve carrying protocol field was to
allow protocols other than Ethernet... is this temporary maybe?

This check also applies in the OAM case where there is no data packet
but we still enforce the protocol field to be Ethernert (meaning of
prot_type when OAM bit is set is ambiguous in the draft). As I
mentioned on the nvo3 list, this OAM bit is really a 1-bit packet
type. If this bit is donated to version field (make it a type version
field) then we can switch on ver_type above and create another
processing path for OAM so that the prot_type is at least not
unnecessarily verified in that case and the bits could even be reused
for some OAM specific purpose.



> +               goto error;
> +
> +       opts_len = geneveh->opt_len * 4;
> +
> +       flags = TUNNEL_KEY |
> +               (udp_hdr(skb)->check != 0 ? TUNNEL_CSUM : 0) |
> +               (geneveh->oam ? TUNNEL_OAM : 0) |
> +               (geneveh->critical ? TUNNEL_CRIT_OPT : 0);

Three conditionals in critical data path just extract the flags and
not even do anything with them :-(. Also why should OVS care about
checksum, it has already been validated at this point?

> +
> +       key = vni_to_tunnel_id(geneveh->vni);
> +       ovs_flow_tun_info_init(&tun_info, ip_hdr(skb), key, flags,
> +                              geneveh->options, opts_len);
> +
> +       ovs_vport_receive(vport, skb, &tun_info);
> +       return;
> +
> +error:
> +       kfree_skb(skb);
> +}
> +
> +static int geneve_get_options(const struct vport *vport,
> +                             struct sk_buff *skb)
> +{
> +       struct geneve_port *geneve_port = geneve_vport(vport);
> +       __be16 sport;
> +
> +       sport = ntohs(inet_sk(geneve_port->gs->uts.sock->sk)->inet_sport);
> +       if (nla_put_u16(skb, OVS_TUNNEL_ATTR_DST_PORT, sport))
> +               return -EMSGSIZE;
> +       return 0;
> +}
> +
> +static void geneve_tnl_destroy(struct vport *vport)
> +{
> +       struct geneve_port *geneve_port = geneve_vport(vport);
> +
> +       geneve_sock_release(geneve_port->gs);
> +
> +       ovs_vport_deferred_free(vport);
> +}
> +
> +static struct vport *geneve_tnl_create(const struct vport_parms *parms)
> +{
> +       struct net *net = ovs_dp_get_net(parms->dp);
> +       struct nlattr *options = parms->options;
> +       struct geneve_port *geneve_port;
> +       struct geneve_sock *gs;
> +       struct vport *vport;
> +       struct nlattr *a;
> +       int err;
> +       u16 dst_port;
> +
> +       if (!options) {
> +               err = -EINVAL;
> +               goto error;
> +       }
> +
> +       a = nla_find_nested(options, OVS_TUNNEL_ATTR_DST_PORT);
> +       if (a && nla_len(a) == sizeof(u16)) {
> +               dst_port = nla_get_u16(a);
> +       } else {
> +               /* Require destination port from userspace. */
> +               err = -EINVAL;
> +               goto error;
> +       }
> +
> +       vport = ovs_vport_alloc(sizeof(struct geneve_port),
> +                               &ovs_geneve_vport_ops, parms);
> +       if (IS_ERR(vport))
> +               return vport;
> +
> +       geneve_port = geneve_vport(vport);
> +       strncpy(geneve_port->name, parms->name, IFNAMSIZ);
> +
> +       gs = geneve_sock_add(net, htons(dst_port), geneve_rcv, vport, true, 0);
> +       if (IS_ERR(gs)) {
> +               ovs_vport_free(vport);
> +               return (void *)gs;
> +       }
> +       geneve_port->gs = gs;
> +
> +       return vport;
> +error:
> +       return ERR_PTR(err);
> +}
> +
> +static int geneve_send(struct vport *vport, struct sk_buff *skb)
> +{
> +       struct ovs_key_ipv4_tunnel *tun_key;
> +       struct ovs_tunnel_info *tun_info = OVS_CB(skb)->tun_info;
> +       struct net *net = ovs_dp_get_net(vport->dp);
> +       struct geneve_port *geneve_port = geneve_vport(vport);
> +       __be16 dport = inet_sk(geneve_port->gs->uts.sock->sk)->inet_sport;
> +       __be16 sport;
> +       struct rtable *rt;
> +       struct flowi4 fl;
> +       u8 vni[3];
> +       __be16 df;
> +       int err;
> +       int sent;
> +
> +       if (unlikely(!tun_info))
> +               return -EINVAL;
> +
> +       tun_key = &tun_info->tunnel;
> +
> +       /* Route lookup */
> +       memset(&fl, 0, sizeof(fl));
> +       fl.daddr = tun_key->ipv4_dst;
> +       fl.saddr = tun_key->ipv4_src;
> +       fl.flowi4_tos = RT_TOS(tun_key->ipv4_tos);
> +       fl.flowi4_mark = skb->mark;
> +       fl.flowi4_proto = IPPROTO_UDP;
> +
> +       rt = ip_route_output_key(net, &fl);

Route lookup on every packet? No route cached in the flow structs?

> +       if (IS_ERR(rt)) {
> +               err = PTR_ERR(rt);
> +               goto error;
> +       }
> +
> +       df = tun_key->tun_flags & TUNNEL_DONT_FRAGMENT ? htons(IP_DF) : 0;
> +       sport = udp_flow_src_port(net, skb, 1, USHRT_MAX, true);
> +       tunnel_id_to_vni(tun_key->tun_id, vni);
> +
> +       sent = geneve_xmit_skb(geneve_port->gs, rt, skb, fl.saddr,
> +                              tun_key->ipv4_dst, tun_key->ipv4_tos,
> +                              tun_key->ipv4_ttl, df, sport, dport,
> +                              tun_key->tun_flags, vni,
> +                              tun_info->options_len, (u8 *)tun_info->options,
> +                              false);
> +       if (!sent)
> +               ip_rt_put(rt);
> +
> +       return sent;
> +
> +error:
> +       return err;
> +}
> +
> +static const char *geneve_get_name(const struct vport *vport)
> +{
> +       struct geneve_port *geneve_port = geneve_vport(vport);
> +       return geneve_port->name;
> +}
> +
> +const struct vport_ops ovs_geneve_vport_ops = {
> +       .type           = OVS_VPORT_TYPE_GENEVE,
> +       .create         = geneve_tnl_create,
> +       .destroy        = geneve_tnl_destroy,
> +       .get_name       = geneve_get_name,
> +       .get_options    = geneve_get_options,
> +       .send           = geneve_send,
> +};
> diff --git a/net/openvswitch/vport-gre.c b/net/openvswitch/vport-gre.c
> index d4fcbb2..1aeeed6 100644
> --- a/net/openvswitch/vport-gre.c
> +++ b/net/openvswitch/vport-gre.c
> @@ -104,7 +104,7 @@ static int gre_rcv(struct sk_buff *skb,
>
>         key = key_to_tunnel_id(tpi->key, tpi->seq);
>         ovs_flow_tun_info_init(&tun_info, ip_hdr(skb), key,
> -                              filter_tnl_flags(tpi->flags));
> +                              filter_tnl_flags(tpi->flags), NULL, 0);
>
>         ovs_vport_receive(vport, skb, &tun_info);
>         return PACKET_RCVD;
> diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
> index 3835143..eded300 100644
> --- a/net/openvswitch/vport-vxlan.c
> +++ b/net/openvswitch/vport-vxlan.c
> @@ -66,7 +66,7 @@ static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
>         /* Save outer tunnel values */
>         iph = ip_hdr(skb);
>         key = cpu_to_be64(ntohl(vx_vni) >> 8);
> -       ovs_flow_tun_info_init(&tun_info, iph, key, TUNNEL_KEY);
> +       ovs_flow_tun_info_init(&tun_info, iph, key, TUNNEL_KEY, NULL, 0);
>
>         ovs_vport_receive(vport, skb, &tun_info);
>  }
> diff --git a/net/openvswitch/vport.c b/net/openvswitch/vport.c
> index 39e2c9c..038d14a 100644
> --- a/net/openvswitch/vport.c
> +++ b/net/openvswitch/vport.c
> @@ -41,6 +41,7 @@ static void ovs_vport_record_error(struct vport *,
>  static const struct vport_ops *vport_ops_list[] = {
>         &ovs_netdev_vport_ops,
>         &ovs_internal_vport_ops,
> +       &ovs_geneve_vport_ops,
>
>  #ifdef CONFIG_OPENVSWITCH_GRE
>         &ovs_gre_vport_ops,
> diff --git a/net/openvswitch/vport.h b/net/openvswitch/vport.h
> index 400cd1e..d2eb700 100644
> --- a/net/openvswitch/vport.h
> +++ b/net/openvswitch/vport.h
> @@ -197,6 +197,7 @@ void ovs_vport_receive(struct vport *, struct sk_buff *,
>   * add yours to the list at the top of vport.c. */
>  extern const struct vport_ops ovs_netdev_vport_ops;
>  extern const struct vport_ops ovs_internal_vport_ops;
> +extern const struct vport_ops ovs_geneve_vport_ops;
>  extern const struct vport_ops ovs_gre_vport_ops;
>  extern const struct vport_ops ovs_vxlan_vport_ops;
>
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-23 15:45                       ` Tom Herbert
@ 2014-07-24  3:24                         ` Jesse Gross
  0 siblings, 0 replies; 46+ messages in thread
From: Jesse Gross @ 2014-07-24  3:24 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexander Duyck, Alexander Duyck, Andy Zhou, David Miller,
	Linux Netdev List

On Wed, Jul 23, 2014 at 11:45 AM, Tom Herbert <therbert@google.com> wrote:
> On Tue, Jul 22, 2014 at 9:35 PM, Jesse Gross <jesse@nicira.com> wrote:
>> On Tue, Jul 22, 2014 at 11:53 PM, Tom Herbert <therbert@google.com> wrote:
>>>>> Which feature flags control the receive side parsing in the device?
>>>>
>>>> The only real features that need the port info are Rx hash and Rx
>>>> checksum.  If those are disabled then there shouldn't be any need for
>>>> the port numbers.  I don't recall if you can disable them separately
>>>> from the non-tunnel case though.  I believe they are linked to the
>>>> standard offloads.
>>>>
>>> Rx hash is unnecessary consideration because we can derive that from
>>> UDP header. The fact that we can deduce a reasonable hash is a major
>>> rationale of UDP encapsulation. We will need drivers to start
>>> enabling/supporting UDP RSS and providing RX hash to realize full
>>> benefits of this.
>>
>> That's true for basic hashing but for more sophisticated things like
>> flow steering or sending OAM packets to control queues the hardware
>> still needs to be able to look into the header.
>>
> Flow steering (aRFS, FlowDirector, ECMP in network) will work just
> fine based on UDP header-- again this is a fundamental property in UDP
> encapsulation. If you need to implement mechanisms that require
> parsing of the encapsulated headers, then it's better to make this
> part of RX filtering.

Sure, it can operate on the UDP hash but I would argue that it works
better if you actually look into the packet. Using the hash is either
going to just randomly spread traffic or require you to track hashes
and direct them to particular places for established connections.
However, depending on the situation this may not really be optimal
compared to, say, steering based on inner MAC address.

But in reality, whether it is for steering or filtering these
operations are pretty similar to me, just different goals.

> btw, Geneve draft allows for non-zero UDP checksums to be ignored like
> in VXLAN-- this is a violation of UDP standard :-(. We will not do
> this in the stack, but it opens the possibility that HW may tell us
> checksum is okay when it actually isn't. Accepting
> CHECKSUM_UNNECESSARY from all these devices is quite the leap of faith
> we're taking!

This is actually not the intention but I see that the wording of the
draft is poor. I'll see if I can improve it to avoid this situation.

>> Some of these are obviously future looking but I think that means that
>> even if you got your desired changes, the use of the UDP port on
>> receive would only shift, not go away.
>
> I think your hitting the major point that we have to be future
> looking. When hardware hardwire specific protocols instead of using
> generic mechanisms, we become pigeonholed-- this is *not* future
> looking and in the long run it's a disservice to customers if we
> advocate this in the stack. Consider that geneve is likely superior to
> VXLAN because it is extensible, but that VXLAN may still win since it
> is already "supported" in so much HW.

I understand your goal but I'm not really sure what solution you are
proposing. There are obviously ways that the stack can be made more
generic from where it is today but I think we agree that at least some
things will require protocol knowledge.

Geneve (and GUE) are trying to solve this by having a protocol that is
generic - hardware will still need to have support for a specific
protocol but at least that can support many uses. However, this
doesn't seem to be what you are getting at since it's not true of
VXLAN, particularly if you are concerned with deployed hardware.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 10/10] openvswitch: Add support for Geneve tunneling.
  2014-07-23 20:29   ` Tom Herbert
@ 2014-07-24  4:10     ` Jesse Gross
       [not found]       ` <CA+mtBx9umxiFYtnG1kzFkK+Ev=b=4f3q2OOow2QcfCB5rUTUyA@mail.gmail.com>
  0 siblings, 1 reply; 46+ messages in thread
From: Jesse Gross @ 2014-07-24  4:10 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Andy Zhou, David Miller, Linux Netdev List

On Wed, Jul 23, 2014 at 4:29 PM, Tom Herbert <therbert@google.com> wrote:
> On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
>> diff --git a/net/openvswitch/vport-geneve.c b/net/openvswitch/vport-geneve.c
>> new file mode 100644
>> index 0000000..b1b0a3b
>> --- /dev/null
>> +++ b/net/openvswitch/vport-geneve.c
>> +static void geneve_rcv(struct geneve_sock *gs, struct sk_buff *skb)
>> +{
>> +       struct vport *vport = gs->uts.data;
>> +       struct genevehdr *geneveh;
>> +       int opts_len;
>> +       struct ovs_tunnel_info tun_info;
>> +       __be64 key;
>> +       __be16 flags;
>> +
>> +       if (unlikely(!pskb_may_pull(skb, GENEVE_BASE_HLEN)))
>> +               goto error;
>> +
>> +       geneveh = geneve_hdr(skb);
>> +
>> +       if (unlikely(geneveh->ver != GENEVE_VER))
>> +               goto error;
>
>> +
>> +       if (unlikely(geneveh->proto_type != htons(ETH_P_TEB)))
>
> Why? I thought the point of geneve carrying protocol field was to
> allow protocols other than Ethernet... is this temporary maybe?

Yes, it is temporary. Currently OVS only handles Ethernet packets but
this restriction can be lifted once we have a consumer that is capable
of handling other protocols.

> This check also applies in the OAM case where there is no data packet
> but we still enforce the protocol field to be Ethernert (meaning of
> prot_type when OAM bit is set is ambiguous in the draft). As I
> mentioned on the nvo3 list, this OAM bit is really a 1-bit packet
> type. If this bit is donated to version field (make it a type version
> field) then we can switch on ver_type above and create another
> processing path for OAM so that the prot_type is at least not
> unnecessarily verified in that case and the bits could even be reused
> for some OAM specific purpose.

I think the draft is clear :) The value of the OAM bit does not change
the interpretation of the protocol field. This is true in the other
drafts as well.

OAM packets are essentially just high priority packets (presumably
with some kind of control semantics but that depends on the control
plane). That means that they might be BFD or some other heartbeat,
header fragments for tracing, or really anything else that should be
treated as control. In all of these cases, the protocol type still
needs to indicate the format of the data.

>> +               goto error;
>> +
>> +       opts_len = geneveh->opt_len * 4;
>> +
>> +       flags = TUNNEL_KEY |
>> +               (udp_hdr(skb)->check != 0 ? TUNNEL_CSUM : 0) |
>> +               (geneveh->oam ? TUNNEL_OAM : 0) |
>> +               (geneveh->critical ? TUNNEL_CRIT_OPT : 0);
>
> Three conditionals in critical data path just extract the flags and
> not even do anything with them :-(. Also why should OVS care about
> checksum, it has already been validated at this point?

These flags are included in the flow structure, so they are consumed
by userspace.

OVS isn't validating the checksum, it's just marking which packets
have a checksum so that policy can be enforced. (I think you have
talked about something similar in the past - "I only want to accept
packets with a checksum even if no checksum is allowed by the
protocol.")

>> +static int geneve_send(struct vport *vport, struct sk_buff *skb)
>> +{
>> +       struct ovs_key_ipv4_tunnel *tun_key;
>> +       struct ovs_tunnel_info *tun_info = OVS_CB(skb)->tun_info;
>> +       struct net *net = ovs_dp_get_net(vport->dp);
>> +       struct geneve_port *geneve_port = geneve_vport(vport);
>> +       __be16 dport = inet_sk(geneve_port->gs->uts.sock->sk)->inet_sport;
>> +       __be16 sport;
>> +       struct rtable *rt;
>> +       struct flowi4 fl;
>> +       u8 vni[3];
>> +       __be16 df;
>> +       int err;
>> +       int sent;
>> +
>> +       if (unlikely(!tun_info))
>> +               return -EINVAL;
>> +
>> +       tun_key = &tun_info->tunnel;
>> +
>> +       /* Route lookup */
>> +       memset(&fl, 0, sizeof(fl));
>> +       fl.daddr = tun_key->ipv4_dst;
>> +       fl.saddr = tun_key->ipv4_src;
>> +       fl.flowi4_tos = RT_TOS(tun_key->ipv4_tos);
>> +       fl.flowi4_mark = skb->mark;
>> +       fl.flowi4_proto = IPPROTO_UDP;
>> +
>> +       rt = ip_route_output_key(net, &fl);
>
> Route lookup on every packet? No route cached in the flow structs?

This is a possible future optimization.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 01/10] net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port.
  2014-07-22 10:19 ` [net-next 01/10] net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port Andy Zhou
  2014-07-22 10:49   ` Varka Bhadram
@ 2014-07-24  6:40   ` Or Gerlitz
  2014-07-24 20:28     ` Andy Zhou
  1 sibling, 1 reply; 46+ messages in thread
From: Or Gerlitz @ 2014-07-24  6:40 UTC (permalink / raw)
  To: Andy Zhou; +Cc: David Miller, netdev

On Tue, Jul 22, 2014 at 1:19 PM, Andy Zhou <azhou@nicira.com> wrote:
>
> Rename ndo_add_vxlan_port() API provided by net_device_ops to
> ndo_add_udp_tunnel_port(). Generalized the API in preparation for
> up coming NICs and device drivers that may support offloading more
> UDP tunnels protocols besides VxLAN.  There is no behavioral changes
> with this patch.
>


[..]

>
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -650,9 +650,11 @@ static void vxlan_notify_add_rx_port(struct vxlan_sock *vs)
>
>         rcu_read_lock();
>         for_each_netdev_rcu(net, dev) {
> -               if (dev->netdev_ops->ndo_add_vxlan_port)
> -                       dev->netdev_ops->ndo_add_vxlan_port(dev, sa_family,
> -                                                           port);
> +               if (!dev->netdev_ops->ndo_add_udp_tunnel_port)
> +                       continue;
> +
> +               dev->netdev_ops->ndo_add_udp_tunnel_port(dev, sa_family, port,
> +                                                        UDP_TUNNEL_TYPE_VXLAN);
>         }
>         rcu_read_unlock();
>  }



Such changes should be done in a manner which is as minimal as
possible and not introduce further cleanups
or style modifications, here the existing code say

if(ndo X is supported by dev Y)
   call it

Please stick to this and just replace the ndo name in this patch


>
> @@ -668,9 +670,10 @@ static void vxlan_notify_del_rx_port(struct vxlan_sock *vs)
>
>         rcu_read_lock();
>         for_each_netdev_rcu(net, dev) {
> -               if (dev->netdev_ops->ndo_del_vxlan_port)
> -                       dev->netdev_ops->ndo_del_vxlan_port(dev, sa_family,
> -                                                           port);
> +               if (!dev->netdev_ops->ndo_del_udp_tunnel_port)
> +                       continue;
> +               dev->netdev_ops->ndo_del_udp_tunnel_port(dev, sa_family, port,
> +                                                        UDP_TUNNEL_TYPE_VXLAN);
>         }
>         rcu_read_unlock();
>

same here

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 04/10] net: Refactor vxlan driver to make use of common UDP tunnel functions
  2014-07-22 10:19 ` [net-next 04/10] net: Refactor vxlan driver to make use of common UDP tunnel functions Andy Zhou
@ 2014-07-24  6:46   ` Or Gerlitz
  0 siblings, 0 replies; 46+ messages in thread
From: Or Gerlitz @ 2014-07-24  6:46 UTC (permalink / raw)
  To: Andy Zhou; +Cc: netdev

On Tue, Jul 22, 2014 at 1:19 PM, Andy Zhou <azhou@nicira.com> wrote:

> @@ -105,6 +105,7 @@ static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
>                 err = -EINVAL;
>                 goto error;
>         }
> +
>         a = nla_find_nested(options, OVS_TUNNEL_ATTR_DST_PORT);
>         if (a && nla_len(a) == sizeof(u16)) {
>                 dst_port = nla_get_u16(a);

blank line added by mistake?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 00/10] Add Geneve
  2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
                   ` (10 preceding siblings ...)
  2014-07-22 10:54 ` [net-next 00/10] Add Geneve Varka Bhadram
@ 2014-07-24  6:58 ` Or Gerlitz
  2014-07-24 17:40   ` Tom Herbert
  11 siblings, 1 reply; 46+ messages in thread
From: Or Gerlitz @ 2014-07-24  6:58 UTC (permalink / raw)
  To: Andy Zhou, Tom Herbert; +Cc: David Miller, netdev

On Tue, Jul 22, 2014 at 1:19 PM, Andy Zhou <azhou@nicira.com> wrote:
> Following patches adds initial support for Geneve tunnel protocol

Just to make this a bit more clear, would it be correct to say that
the logical ordering here is as follows:

> 1. Add common UDP tunnel code into UDP tunnel support function
> 2. Refactor vxlan driver to make use of the UDP tunnel support
> 3. Add Geneve driver.

implemented by patches 1-5 below)

> Andy Zhou (5):
>   net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port.
>   udp: Expand UDP tunnel common APIs
>   vxlan: Remove vxlan_get_rx_port()
>   net: Refactor vxlan driver to make use of common UDP tunnel functions
>   net: Add Geneve tunneling protocol driver

and on top of that

> 4. Refactor Openvswitch  in preparation for #5
> 5. Add Geneve support to Openvswitch.

implemented by patches 6-10 (below)

> Jesse Gross (5):
>   openvswitch: Eliminate memset() from flow_extract.
>   openvswitch: Add support for matching on OAM packets.
>   openvswitch: Wrap struct ovs_key_ipv4_tunnel in a new structure.
>   openvswitch: Factor out allocation and verification of actions.
>   openvswitch: Add support for Geneve tunneling.

I understand the wish to eventually have something that goes beyond
refactoring of
the vxlan and tunneling code plus Geneve basics. However, isn't the
1st part of the series
(patches 1-5) have something is common to Tom's GUE work, which is
currently under review
too? I think we need first see how the basic elements from your series
go along together with GUE.

Or.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 00/10] Add Geneve
  2014-07-24  6:58 ` Or Gerlitz
@ 2014-07-24 17:40   ` Tom Herbert
  2014-07-24 21:03     ` Andy Zhou
  0 siblings, 1 reply; 46+ messages in thread
From: Tom Herbert @ 2014-07-24 17:40 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: Andy Zhou, David Miller, netdev

On Wed, Jul 23, 2014 at 11:58 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
> On Tue, Jul 22, 2014 at 1:19 PM, Andy Zhou <azhou@nicira.com> wrote:
>> Following patches adds initial support for Geneve tunnel protocol
>
> Just to make this a bit more clear, would it be correct to say that
> the logical ordering here is as follows:
>
Agreed, improvements to the general infrastructure to support UDP
tunneling should be done first. This was already begun with
introduction of udp_tunnel.[ch] and the udp_tunnel_xmit functions seem
like a nice addition at least.

Also, we have at least two instances of UDP tunneling in the code that
should addressed when interface improvements: VXLAN and L2TP. Please
make sure *both* of these are considered with such patches (also the
needs for Geneve, GUE, LISP, etc. should be considered, but please no
protocol specific stuff in the common infrastructure code!)

>> 1. Add common UDP tunnel code into UDP tunnel support function
>> 2. Refactor vxlan driver to make use of the UDP tunnel support
>> 3. Add Geneve driver.
>
> implemented by patches 1-5 below)
>
>> Andy Zhou (5):
>>   net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port.
>>   udp: Expand UDP tunnel common APIs
>>   vxlan: Remove vxlan_get_rx_port()
>>   net: Refactor vxlan driver to make use of common UDP tunnel functions
>>   net: Add Geneve tunneling protocol driver
>
> and on top of that
>
>> 4. Refactor Openvswitch  in preparation for #5
>> 5. Add Geneve support to Openvswitch.
>
> implemented by patches 6-10 (below)
>
>> Jesse Gross (5):
>>   openvswitch: Eliminate memset() from flow_extract.
>>   openvswitch: Add support for matching on OAM packets.
>>   openvswitch: Wrap struct ovs_key_ipv4_tunnel in a new structure.
>>   openvswitch: Factor out allocation and verification of actions.
>>   openvswitch: Add support for Geneve tunneling.
>
> I understand the wish to eventually have something that goes beyond
> refactoring of
> the vxlan and tunneling code plus Geneve basics. However, isn't the
> 1st part of the series
> (patches 1-5) have something is common to Tom's GUE work, which is
> currently under review
> too? I think we need first see how the basic elements from your series
> go along together with GUE.
>
> Or.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-23 19:57   ` Tom Herbert
@ 2014-07-24 20:23     ` Andy Zhou
  2014-07-24 20:47       ` Tom Herbert
  0 siblings, 1 reply; 46+ messages in thread
From: Andy Zhou @ 2014-07-24 20:23 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Linux Netdev List

The general layering I see is  tunnel_user (i.e. OVS) -> tuunel_driver
(i.e. vxlan) -> udp_tunnel.

The two receive functions are from two separate layers above
udp_tunnel. I can restructure the APIs to make it
cleaner.

On Wed, Jul 23, 2014 at 12:57 PM, Tom Herbert <therbert@google.com> wrote:
> On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
>> Added create_udp_tunnel_socket(), packet receive and transmit,  and
>> other related common functions for UDP tunnels.
>>
>> Per net open UDP tunnel ports are tracked in this common layer to
>> prevent sharing of a single port with more than one UDP tunnel.
>>
>> Signed-off-by: Andy Zhou <azhou@nicira.com>
>> ---
>>  include/net/udp_tunnel.h |   57 +++++++++-
>>  net/ipv4/udp_tunnel.c    |  257 +++++++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 312 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
>> index 3f34c65..b5e815a 100644
>> --- a/include/net/udp_tunnel.h
>> +++ b/include/net/udp_tunnel.h
>> @@ -1,7 +1,10 @@
>>  #ifndef __NET_UDP_TUNNEL_H
>>  #define __NET_UDP_TUNNEL_H
>>
>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>> +#include <net/ip_tunnels.h>
>> +
>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>
>>  struct udp_port_cfg {
>>         u8                      family;
>> @@ -28,7 +31,59 @@ struct udp_port_cfg {
>>                                 use_udp6_rx_checksums:1;
>>  };
>>
>> +struct udp_tunnel_sock;
>> +
>> +typedef void (udp_tunnel_rcv_t)(struct udp_tunnel_sock *uts,
>> +                               struct sk_buff *skb, ...);
>> +
>> +typedef int (udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
>> +
>> +struct udp_tunnel_socket_cfg {
>> +       u8 tunnel_type;
>> +       struct udp_port_cfg port;
>> +       udp_tunnel_rcv_t *rcv;
>> +       udp_tunnel_encap_rcv_t *encap_rcv;
>
> Why do you need two receive functions or udp_tunnel_rcv_t?
>
>> +       void *data;
>
> Similarly, why is this needed when we already have sk_user_data?
>
>> +};
>> +
>> +struct udp_tunnel_sock {
>> +       u8 tunnel_type;
>> +       struct hlist_node hlist;
>> +       udp_tunnel_rcv_t *rcv;
>> +       void *data;
>> +       struct socket *sock;
>> +};
>> +
>>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>                     struct socket **sockp);
>>
>> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
>> +                                                struct udp_tunnel_socket_cfg
>> +                                                       *socket_cfg);
>> +
>> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port);
>> +
>> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
>> +                       struct sk_buff *skb, __be32 src, __be32 dst,
>> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
>> +                       __be16 dst_port, bool xnet);
>> +
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
>> +               struct sk_buff *skb, struct net_device *dev,
>> +               struct in6_addr *saddr, struct in6_addr *daddr,
>> +               __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port);
>> +
>> +#endif
>> +
>> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts);
>> +void udp_tunnel_get_rx_port(struct net_device *dev);
>> +
>> +static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb,
>> +                                                        bool udp_csum)
>> +{
>> +       int type = udp_csum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
>> +
>> +       return iptunnel_handle_offloads(skb, udp_csum, type);
>> +}
>>  #endif
>> diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
>> index 61ec1a6..3c14b16 100644
>> --- a/net/ipv4/udp_tunnel.c
>> +++ b/net/ipv4/udp_tunnel.c
>> @@ -7,6 +7,23 @@
>>  #include <net/udp.h>
>>  #include <net/udp_tunnel.h>
>>  #include <net/net_namespace.h>
>> +#include <net/netns/generic.h>
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +#include <net/ipv6.h>
>> +#include <net/addrconf.h>
>> +#include <net/ip6_tunnel.h>
>> +#include <net/ip6_checksum.h>
>> +#endif
>> +
>> +#define PORT_HASH_BITS 8
>> +#define PORT_HASH_SIZE (1 << PORT_HASH_BITS)
>> +
>> +static int udp_tunnel_net_id;
>> +
>> +struct udp_tunnel_net {
>> +       struct hlist_head sock_list[PORT_HASH_SIZE];
>> +       spinlock_t  sock_lock;   /* Protecting the sock_list */
>> +};
>>
>>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>                     struct socket **sockp)
>> @@ -82,7 +99,6 @@ int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>                 return -EPFNOSUPPORT;
>>         }
>>
>> -
>>         *sockp = sock;
>>
>>         return 0;
>> @@ -97,4 +113,243 @@ error:
>>  }
>>  EXPORT_SYMBOL(udp_sock_create);
>>
>> +
>> +/* Socket hash table head */
>> +static inline struct hlist_head *uts_head(struct net *net, const __be16 port)
>> +{
>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>> +
>> +       return &utn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
>> +}
>> +
>> +static int handle_offloads(struct sk_buff *skb)
>> +{
>> +       if (skb_is_gso(skb)) {
>> +               int err = skb_unclone(skb, GFP_ATOMIC);
>> +
>> +               if (unlikely(err))
>> +                       return err;
>> +               skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
>> +       } else {
>> +               if (skb->ip_summed != CHECKSUM_PARTIAL)
>> +                       skb->ip_summed = CHECKSUM_NONE;
>> +       }
>> +
>> +       return 0;
>> +}
>> +
>> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
>> +                                                struct udp_tunnel_socket_cfg
>> +                                                       *cfg)
>> +{
>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>> +       struct udp_tunnel_sock *uts;
>> +       struct socket *sock;
>> +       struct sock *sk;
>> +       const __be16 port = cfg->port.local_udp_port;
>> +       const int ipv6 = (cfg->port.family == AF_INET6);
>> +       int err;
>> +
>> +       uts = kzalloc(size, GFP_KERNEL);
>> +       if (!uts)
>> +               return ERR_PTR(-ENOMEM);
>> +
>> +       err = udp_sock_create(net, &cfg->port, &sock);
>> +       if (err < 0) {
>> +               kfree(uts);
>> +               return NULL;
>> +       }
>> +
>> +       /* Disable multicast loopback */
>> +       inet_sk(sock->sk)->mc_loop = 0;
>> +
>> +       uts->sock = sock;
>> +       sk = sock->sk;
>> +       uts->rcv = cfg->rcv;
>> +       uts->data = cfg->data;
>> +       rcu_assign_sk_user_data(sock->sk, uts);
>> +
>> +       spin_lock(&utn->sock_lock);
>> +       hlist_add_head_rcu(&uts->hlist, uts_head(net, port));
>> +       spin_unlock(&utn->sock_lock);
>> +
>> +       udp_sk(sk)->encap_type = 1;
>> +       udp_sk(sk)->encap_rcv = cfg->encap_rcv;
>> +
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +       if (ipv6)
>> +               ipv6_stub->udpv6_encap_enable();
>> +       else
>> +#endif
>> +               udp_encap_enable();
>> +
>> +       return uts;
>> +}
>> +EXPORT_SYMBOL_GPL(create_udp_tunnel_socket);
>> +
>> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
>> +                       struct sk_buff *skb, __be32 src, __be32 dst,
>> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
>> +                       __be16 dst_port, bool xnet)
>> +{
>> +       struct udphdr *uh;
>> +
>> +       __skb_push(skb, sizeof(*uh));
>> +       skb_reset_transport_header(skb);
>> +       uh = udp_hdr(skb);
>> +
>> +       uh->dest = dst_port;
>> +       uh->source = src_port;
>> +       uh->len = htons(skb->len);
>> +
>> +       udp_set_csum(sock->sk->sk_no_check_tx, skb, src, dst, skb->len);
>> +
>> +       return iptunnel_xmit(sock->sk, rt, skb, src, dst, IPPROTO_UDP,
>> +                            tos, ttl, df, xnet);
>> +}
>> +EXPORT_SYMBOL_GPL(udp_tunnel_xmit_skb);
>> +
>> +#if IS_ENABLED(CONFIG_IPV6)
>> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
>> +                        struct sk_buff *skb, struct net_device *dev,
>> +                        struct in6_addr *saddr, struct in6_addr *daddr,
>> +                        __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port)
>> +{
>> +       struct udphdr *uh;
>> +       struct ipv6hdr *ip6h;
>> +       int err;
>> +
>> +       __skb_push(skb, sizeof(*uh));
>> +       skb_reset_transport_header(skb);
>> +       uh = udp_hdr(skb);
>> +
>> +       uh->dest = dst_port;
>> +       uh->source = src_port;
>> +
>> +       uh->len = htons(skb->len);
>> +       uh->check = 0;
>> +
>> +       memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
>> +       IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED
>> +                           | IPSKB_REROUTED);
>> +       skb_dst_set(skb, dst);
>> +
>> +       if (!skb_is_gso(skb) && !(dst->dev->features & NETIF_F_IPV6_CSUM)) {
>> +               __wsum csum = skb_checksum(skb, 0, skb->len, 0);
>> +
>> +               skb->ip_summed = CHECKSUM_UNNECESSARY;
>> +               uh->check = csum_ipv6_magic(saddr, daddr, skb->len,
>> +                               IPPROTO_UDP, csum);
>> +               if (uh->check == 0)
>> +                       uh->check = CSUM_MANGLED_0;
>> +       } else {
>> +               skb->ip_summed = CHECKSUM_PARTIAL;
>> +               skb->csum_start = skb_transport_header(skb) - skb->head;
>> +               skb->csum_offset = offsetof(struct udphdr, check);
>> +               uh->check = ~csum_ipv6_magic(saddr, daddr,
>> +                               skb->len, IPPROTO_UDP, 0);
>> +       }
>> +
>> +       __skb_push(skb, sizeof(*ip6h));
>> +       skb_reset_network_header(skb);
>> +       ip6h              = ipv6_hdr(skb);
>> +       ip6h->version     = 6;
>> +       ip6h->priority    = prio;
>> +       ip6h->flow_lbl[0] = 0;
>> +       ip6h->flow_lbl[1] = 0;
>> +       ip6h->flow_lbl[2] = 0;
>> +       ip6h->payload_len = htons(skb->len);
>> +       ip6h->nexthdr     = IPPROTO_UDP;
>> +       ip6h->hop_limit   = ttl;
>> +       ip6h->daddr       = *daddr;
>> +       ip6h->saddr       = *saddr;
>> +
>> +       err = handle_offloads(skb);
>> +       if (err)
>> +               return err;
>> +
>> +       ip6tunnel_xmit(skb, dev);
>> +       return 0;
>> +}
>> +EXPORT_SYMBOL_GPL(udp_tunnel6_xmit_skb);
>> +#endif
>> +
>> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port)
>> +{
>> +       struct udp_tunnel_sock *uts;
>> +
>> +       hlist_for_each_entry_rcu(uts, uts_head(net, port), hlist) {
>> +               if (inet_sk(uts->sock->sk)->inet_sport == port)
>> +                       return uts;
>> +       }
>> +
>> +       return NULL;
>> +}
>> +EXPORT_SYMBOL_GPL(udp_tunnel_find_sock);
>> +
>> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts)
>> +{
>> +       struct sock *sk = uts->sock->sk;
>> +       struct net *net = sock_net(sk);
>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>> +
>> +       spin_lock(&utn->sock_lock);
>> +       hlist_del_rcu(&uts->hlist);
>> +       rcu_assign_sk_user_data(uts->sock->sk, NULL);
>> +       spin_unlock(&utn->sock_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
>> +
>> +/* Calls the ndo_add_tunnel_port of the caller in order to
>> + * supply the listening VXLAN udp ports. Callers are expected
>> + * to implement the ndo_add_tunnle_port.
>> + */
>> +void udp_tunnel_get_rx_port(struct net_device *dev)
>> +{
>> +       struct udp_tunnel_sock *uts;
>> +       struct net *net = dev_net(dev);
>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>> +       sa_family_t sa_family;
>> +       __be16 port;
>> +       unsigned int i;
>> +
>> +       spin_lock(&utn->sock_lock);
>> +       for (i = 0; i < PORT_HASH_SIZE; ++i) {
>> +               hlist_for_each_entry_rcu(uts, &utn->sock_list[i], hlist) {
>> +                       port = inet_sk(uts->sock->sk)->inet_sport;
>> +                       sa_family = uts->sock->sk->sk_family;
>> +                       dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
>> +                                       sa_family, port, uts->tunnel_type);
>> +               }
>> +       }
>> +       spin_unlock(&utn->sock_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(udp_tunnel_get_rx_port);
>> +
>> +static int __net_init udp_tunnel_init_net(struct net *net)
>> +{
>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>> +       unsigned int h;
>> +
>> +       spin_lock_init(&utn->sock_lock);
>> +
>> +       for (h = 0; h < PORT_HASH_SIZE; h++)
>> +               INIT_HLIST_HEAD(&utn->sock_list[h]);
>> +
>> +       return 0;
>> +}
>> +
>> +static struct pernet_operations udp_tunnel_net_ops = {
>> +       .init = udp_tunnel_init_net,
>> +       .exit = NULL,
>> +       .id = &udp_tunnel_net_id,
>> +       .size = sizeof(struct udp_tunnel_net),
>> +};
>> +
>> +static int __init udp_tunnel_init(void)
>> +{
>> +       return register_pernet_subsys(&udp_tunnel_net_ops);
>> +}
>> +late_initcall(udp_tunnel_init);
>> +
>>  MODULE_LICENSE("GPL");
>> --
>> 1.7.9.5
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 01/10] net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port.
  2014-07-24  6:40   ` Or Gerlitz
@ 2014-07-24 20:28     ` Andy Zhou
  0 siblings, 0 replies; 46+ messages in thread
From: Andy Zhou @ 2014-07-24 20:28 UTC (permalink / raw)
  To: Or Gerlitz; +Cc: David Miller, netdev

On Wed, Jul 23, 2014 at 11:40 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
> On Tue, Jul 22, 2014 at 1:19 PM, Andy Zhou <azhou@nicira.com> wrote:
>>
>> Rename ndo_add_vxlan_port() API provided by net_device_ops to
>> ndo_add_udp_tunnel_port(). Generalized the API in preparation for
>> up coming NICs and device drivers that may support offloading more
>> UDP tunnels protocols besides VxLAN.  There is no behavioral changes
>> with this patch.
>>
>
>
> [..]
>
>>
>> --- a/drivers/net/vxlan.c
>> +++ b/drivers/net/vxlan.c
>> @@ -650,9 +650,11 @@ static void vxlan_notify_add_rx_port(struct vxlan_sock *vs)
>>
>>         rcu_read_lock();
>>         for_each_netdev_rcu(net, dev) {
>> -               if (dev->netdev_ops->ndo_add_vxlan_port)
>> -                       dev->netdev_ops->ndo_add_vxlan_port(dev, sa_family,
>> -                                                           port);
>> +               if (!dev->netdev_ops->ndo_add_udp_tunnel_port)
>> +                       continue;
>> +
>> +               dev->netdev_ops->ndo_add_udp_tunnel_port(dev, sa_family, port,
>> +                                                        UDP_TUNNEL_TYPE_VXLAN);
>>         }
>>         rcu_read_unlock();
>>  }
>
>
>
> Such changes should be done in a manner which is as minimal as
> possible and not introduce further cleanups
> or style modifications, here the existing code say
>
> if(ndo X is supported by dev Y)
>    call it
>
> Please stick to this and just replace the ndo name in this patch
>
>
>>
>> @@ -668,9 +670,10 @@ static void vxlan_notify_del_rx_port(struct vxlan_sock *vs)
>>
>>         rcu_read_lock();
>>         for_each_netdev_rcu(net, dev) {
>> -               if (dev->netdev_ops->ndo_del_vxlan_port)
>> -                       dev->netdev_ops->ndo_del_vxlan_port(dev, sa_family,
>> -                                                           port);
>> +               if (!dev->netdev_ops->ndo_del_udp_tunnel_port)
>> +                       continue;
>> +               dev->netdev_ops->ndo_del_udp_tunnel_port(dev, sa_family, port,
>> +                                                        UDP_TUNNEL_TYPE_VXLAN);
>>         }
>>         rcu_read_unlock();
>>
>
> same here

O.K. I will revert them back.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-24 20:23     ` Andy Zhou
@ 2014-07-24 20:47       ` Tom Herbert
  2014-07-24 20:54         ` Andy Zhou
  0 siblings, 1 reply; 46+ messages in thread
From: Tom Herbert @ 2014-07-24 20:47 UTC (permalink / raw)
  To: Andy Zhou; +Cc: David Miller, Linux Netdev List

On Thu, Jul 24, 2014 at 1:23 PM, Andy Zhou <azhou@nicira.com> wrote:
> The general layering I see is  tunnel_user (i.e. OVS) -> tuunel_driver
> (i.e. vxlan) -> udp_tunnel.
>
Simpler and more efficient if you stick with UDP->UDP_encap_handler as
the most general model for RX.

> The two receive functions are from two separate layers above
> udp_tunnel. I can restructure the APIs to make it
> cleaner.
>
The only necessary function for opening the UDP encap port is the UDP
receive handler (encap receive). If you want to implement more
indirection within your handler then it should be pretty easy to
create another layer of API for that purpose.

> On Wed, Jul 23, 2014 at 12:57 PM, Tom Herbert <therbert@google.com> wrote:
>> On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
>>> Added create_udp_tunnel_socket(), packet receive and transmit,  and
>>> other related common functions for UDP tunnels.
>>>
>>> Per net open UDP tunnel ports are tracked in this common layer to
>>> prevent sharing of a single port with more than one UDP tunnel.
>>>
>>> Signed-off-by: Andy Zhou <azhou@nicira.com>
>>> ---
>>>  include/net/udp_tunnel.h |   57 +++++++++-
>>>  net/ipv4/udp_tunnel.c    |  257 +++++++++++++++++++++++++++++++++++++++++++++-
>>>  2 files changed, 312 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
>>> index 3f34c65..b5e815a 100644
>>> --- a/include/net/udp_tunnel.h
>>> +++ b/include/net/udp_tunnel.h
>>> @@ -1,7 +1,10 @@
>>>  #ifndef __NET_UDP_TUNNEL_H
>>>  #define __NET_UDP_TUNNEL_H
>>>
>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>> +#include <net/ip_tunnels.h>
>>> +
>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>
>>>  struct udp_port_cfg {
>>>         u8                      family;
>>> @@ -28,7 +31,59 @@ struct udp_port_cfg {
>>>                                 use_udp6_rx_checksums:1;
>>>  };
>>>
>>> +struct udp_tunnel_sock;
>>> +
>>> +typedef void (udp_tunnel_rcv_t)(struct udp_tunnel_sock *uts,
>>> +                               struct sk_buff *skb, ...);
>>> +
>>> +typedef int (udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
>>> +
>>> +struct udp_tunnel_socket_cfg {
>>> +       u8 tunnel_type;
>>> +       struct udp_port_cfg port;
>>> +       udp_tunnel_rcv_t *rcv;
>>> +       udp_tunnel_encap_rcv_t *encap_rcv;
>>
>> Why do you need two receive functions or udp_tunnel_rcv_t?
>>
>>> +       void *data;
>>
>> Similarly, why is this needed when we already have sk_user_data?
>>
>>> +};
>>> +
>>> +struct udp_tunnel_sock {
>>> +       u8 tunnel_type;
>>> +       struct hlist_node hlist;
>>> +       udp_tunnel_rcv_t *rcv;
>>> +       void *data;
>>> +       struct socket *sock;
>>> +};
>>> +
>>>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>>                     struct socket **sockp);
>>>
>>> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
>>> +                                                struct udp_tunnel_socket_cfg
>>> +                                                       *socket_cfg);
>>> +
>>> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port);
>>> +
>>> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
>>> +                       struct sk_buff *skb, __be32 src, __be32 dst,
>>> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
>>> +                       __be16 dst_port, bool xnet);
>>> +
>>> +#if IS_ENABLED(CONFIG_IPV6)
>>> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
>>> +               struct sk_buff *skb, struct net_device *dev,
>>> +               struct in6_addr *saddr, struct in6_addr *daddr,
>>> +               __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port);
>>> +
>>> +#endif
>>> +
>>> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts);
>>> +void udp_tunnel_get_rx_port(struct net_device *dev);
>>> +
>>> +static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb,
>>> +                                                        bool udp_csum)
>>> +{
>>> +       int type = udp_csum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
>>> +
>>> +       return iptunnel_handle_offloads(skb, udp_csum, type);
>>> +}
>>>  #endif
>>> diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
>>> index 61ec1a6..3c14b16 100644
>>> --- a/net/ipv4/udp_tunnel.c
>>> +++ b/net/ipv4/udp_tunnel.c
>>> @@ -7,6 +7,23 @@
>>>  #include <net/udp.h>
>>>  #include <net/udp_tunnel.h>
>>>  #include <net/net_namespace.h>
>>> +#include <net/netns/generic.h>
>>> +#if IS_ENABLED(CONFIG_IPV6)
>>> +#include <net/ipv6.h>
>>> +#include <net/addrconf.h>
>>> +#include <net/ip6_tunnel.h>
>>> +#include <net/ip6_checksum.h>
>>> +#endif
>>> +
>>> +#define PORT_HASH_BITS 8
>>> +#define PORT_HASH_SIZE (1 << PORT_HASH_BITS)
>>> +
>>> +static int udp_tunnel_net_id;
>>> +
>>> +struct udp_tunnel_net {
>>> +       struct hlist_head sock_list[PORT_HASH_SIZE];
>>> +       spinlock_t  sock_lock;   /* Protecting the sock_list */
>>> +};
>>>
>>>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>>                     struct socket **sockp)
>>> @@ -82,7 +99,6 @@ int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>>                 return -EPFNOSUPPORT;
>>>         }
>>>
>>> -
>>>         *sockp = sock;
>>>
>>>         return 0;
>>> @@ -97,4 +113,243 @@ error:
>>>  }
>>>  EXPORT_SYMBOL(udp_sock_create);
>>>
>>> +
>>> +/* Socket hash table head */
>>> +static inline struct hlist_head *uts_head(struct net *net, const __be16 port)
>>> +{
>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>> +
>>> +       return &utn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
>>> +}
>>> +
>>> +static int handle_offloads(struct sk_buff *skb)
>>> +{
>>> +       if (skb_is_gso(skb)) {
>>> +               int err = skb_unclone(skb, GFP_ATOMIC);
>>> +
>>> +               if (unlikely(err))
>>> +                       return err;
>>> +               skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
>>> +       } else {
>>> +               if (skb->ip_summed != CHECKSUM_PARTIAL)
>>> +                       skb->ip_summed = CHECKSUM_NONE;
>>> +       }
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
>>> +                                                struct udp_tunnel_socket_cfg
>>> +                                                       *cfg)
>>> +{
>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>> +       struct udp_tunnel_sock *uts;
>>> +       struct socket *sock;
>>> +       struct sock *sk;
>>> +       const __be16 port = cfg->port.local_udp_port;
>>> +       const int ipv6 = (cfg->port.family == AF_INET6);
>>> +       int err;
>>> +
>>> +       uts = kzalloc(size, GFP_KERNEL);
>>> +       if (!uts)
>>> +               return ERR_PTR(-ENOMEM);
>>> +
>>> +       err = udp_sock_create(net, &cfg->port, &sock);
>>> +       if (err < 0) {
>>> +               kfree(uts);
>>> +               return NULL;
>>> +       }
>>> +
>>> +       /* Disable multicast loopback */
>>> +       inet_sk(sock->sk)->mc_loop = 0;
>>> +
>>> +       uts->sock = sock;
>>> +       sk = sock->sk;
>>> +       uts->rcv = cfg->rcv;
>>> +       uts->data = cfg->data;
>>> +       rcu_assign_sk_user_data(sock->sk, uts);
>>> +
>>> +       spin_lock(&utn->sock_lock);
>>> +       hlist_add_head_rcu(&uts->hlist, uts_head(net, port));
>>> +       spin_unlock(&utn->sock_lock);
>>> +
>>> +       udp_sk(sk)->encap_type = 1;
>>> +       udp_sk(sk)->encap_rcv = cfg->encap_rcv;
>>> +
>>> +#if IS_ENABLED(CONFIG_IPV6)
>>> +       if (ipv6)
>>> +               ipv6_stub->udpv6_encap_enable();
>>> +       else
>>> +#endif
>>> +               udp_encap_enable();
>>> +
>>> +       return uts;
>>> +}
>>> +EXPORT_SYMBOL_GPL(create_udp_tunnel_socket);
>>> +
>>> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
>>> +                       struct sk_buff *skb, __be32 src, __be32 dst,
>>> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
>>> +                       __be16 dst_port, bool xnet)
>>> +{
>>> +       struct udphdr *uh;
>>> +
>>> +       __skb_push(skb, sizeof(*uh));
>>> +       skb_reset_transport_header(skb);
>>> +       uh = udp_hdr(skb);
>>> +
>>> +       uh->dest = dst_port;
>>> +       uh->source = src_port;
>>> +       uh->len = htons(skb->len);
>>> +
>>> +       udp_set_csum(sock->sk->sk_no_check_tx, skb, src, dst, skb->len);
>>> +
>>> +       return iptunnel_xmit(sock->sk, rt, skb, src, dst, IPPROTO_UDP,
>>> +                            tos, ttl, df, xnet);
>>> +}
>>> +EXPORT_SYMBOL_GPL(udp_tunnel_xmit_skb);
>>> +
>>> +#if IS_ENABLED(CONFIG_IPV6)
>>> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
>>> +                        struct sk_buff *skb, struct net_device *dev,
>>> +                        struct in6_addr *saddr, struct in6_addr *daddr,
>>> +                        __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port)
>>> +{
>>> +       struct udphdr *uh;
>>> +       struct ipv6hdr *ip6h;
>>> +       int err;
>>> +
>>> +       __skb_push(skb, sizeof(*uh));
>>> +       skb_reset_transport_header(skb);
>>> +       uh = udp_hdr(skb);
>>> +
>>> +       uh->dest = dst_port;
>>> +       uh->source = src_port;
>>> +
>>> +       uh->len = htons(skb->len);
>>> +       uh->check = 0;
>>> +
>>> +       memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
>>> +       IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED
>>> +                           | IPSKB_REROUTED);
>>> +       skb_dst_set(skb, dst);
>>> +
>>> +       if (!skb_is_gso(skb) && !(dst->dev->features & NETIF_F_IPV6_CSUM)) {
>>> +               __wsum csum = skb_checksum(skb, 0, skb->len, 0);
>>> +
>>> +               skb->ip_summed = CHECKSUM_UNNECESSARY;
>>> +               uh->check = csum_ipv6_magic(saddr, daddr, skb->len,
>>> +                               IPPROTO_UDP, csum);
>>> +               if (uh->check == 0)
>>> +                       uh->check = CSUM_MANGLED_0;
>>> +       } else {
>>> +               skb->ip_summed = CHECKSUM_PARTIAL;
>>> +               skb->csum_start = skb_transport_header(skb) - skb->head;
>>> +               skb->csum_offset = offsetof(struct udphdr, check);
>>> +               uh->check = ~csum_ipv6_magic(saddr, daddr,
>>> +                               skb->len, IPPROTO_UDP, 0);
>>> +       }
>>> +
>>> +       __skb_push(skb, sizeof(*ip6h));
>>> +       skb_reset_network_header(skb);
>>> +       ip6h              = ipv6_hdr(skb);
>>> +       ip6h->version     = 6;
>>> +       ip6h->priority    = prio;
>>> +       ip6h->flow_lbl[0] = 0;
>>> +       ip6h->flow_lbl[1] = 0;
>>> +       ip6h->flow_lbl[2] = 0;
>>> +       ip6h->payload_len = htons(skb->len);
>>> +       ip6h->nexthdr     = IPPROTO_UDP;
>>> +       ip6h->hop_limit   = ttl;
>>> +       ip6h->daddr       = *daddr;
>>> +       ip6h->saddr       = *saddr;
>>> +
>>> +       err = handle_offloads(skb);
>>> +       if (err)
>>> +               return err;
>>> +
>>> +       ip6tunnel_xmit(skb, dev);
>>> +       return 0;
>>> +}
>>> +EXPORT_SYMBOL_GPL(udp_tunnel6_xmit_skb);
>>> +#endif
>>> +
>>> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port)
>>> +{
>>> +       struct udp_tunnel_sock *uts;
>>> +
>>> +       hlist_for_each_entry_rcu(uts, uts_head(net, port), hlist) {
>>> +               if (inet_sk(uts->sock->sk)->inet_sport == port)
>>> +                       return uts;
>>> +       }
>>> +
>>> +       return NULL;
>>> +}
>>> +EXPORT_SYMBOL_GPL(udp_tunnel_find_sock);
>>> +
>>> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts)
>>> +{
>>> +       struct sock *sk = uts->sock->sk;
>>> +       struct net *net = sock_net(sk);
>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>> +
>>> +       spin_lock(&utn->sock_lock);
>>> +       hlist_del_rcu(&uts->hlist);
>>> +       rcu_assign_sk_user_data(uts->sock->sk, NULL);
>>> +       spin_unlock(&utn->sock_lock);
>>> +}
>>> +EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
>>> +
>>> +/* Calls the ndo_add_tunnel_port of the caller in order to
>>> + * supply the listening VXLAN udp ports. Callers are expected
>>> + * to implement the ndo_add_tunnle_port.
>>> + */
>>> +void udp_tunnel_get_rx_port(struct net_device *dev)
>>> +{
>>> +       struct udp_tunnel_sock *uts;
>>> +       struct net *net = dev_net(dev);
>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>> +       sa_family_t sa_family;
>>> +       __be16 port;
>>> +       unsigned int i;
>>> +
>>> +       spin_lock(&utn->sock_lock);
>>> +       for (i = 0; i < PORT_HASH_SIZE; ++i) {
>>> +               hlist_for_each_entry_rcu(uts, &utn->sock_list[i], hlist) {
>>> +                       port = inet_sk(uts->sock->sk)->inet_sport;
>>> +                       sa_family = uts->sock->sk->sk_family;
>>> +                       dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
>>> +                                       sa_family, port, uts->tunnel_type);
>>> +               }
>>> +       }
>>> +       spin_unlock(&utn->sock_lock);
>>> +}
>>> +EXPORT_SYMBOL_GPL(udp_tunnel_get_rx_port);
>>> +
>>> +static int __net_init udp_tunnel_init_net(struct net *net)
>>> +{
>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>> +       unsigned int h;
>>> +
>>> +       spin_lock_init(&utn->sock_lock);
>>> +
>>> +       for (h = 0; h < PORT_HASH_SIZE; h++)
>>> +               INIT_HLIST_HEAD(&utn->sock_list[h]);
>>> +
>>> +       return 0;
>>> +}
>>> +
>>> +static struct pernet_operations udp_tunnel_net_ops = {
>>> +       .init = udp_tunnel_init_net,
>>> +       .exit = NULL,
>>> +       .id = &udp_tunnel_net_id,
>>> +       .size = sizeof(struct udp_tunnel_net),
>>> +};
>>> +
>>> +static int __init udp_tunnel_init(void)
>>> +{
>>> +       return register_pernet_subsys(&udp_tunnel_net_ops);
>>> +}
>>> +late_initcall(udp_tunnel_init);
>>> +
>>>  MODULE_LICENSE("GPL");
>>> --
>>> 1.7.9.5
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 02/10] udp: Expand UDP tunnel common APIs
  2014-07-24 20:47       ` Tom Herbert
@ 2014-07-24 20:54         ` Andy Zhou
  0 siblings, 0 replies; 46+ messages in thread
From: Andy Zhou @ 2014-07-24 20:54 UTC (permalink / raw)
  To: Tom Herbert; +Cc: David Miller, Linux Netdev List

On Thu, Jul 24, 2014 at 1:47 PM, Tom Herbert <therbert@google.com> wrote:
> On Thu, Jul 24, 2014 at 1:23 PM, Andy Zhou <azhou@nicira.com> wrote:
>> The general layering I see is  tunnel_user (i.e. OVS) -> tuunel_driver
>> (i.e. vxlan) -> udp_tunnel.
>>
> Simpler and more efficient if you stick with UDP->UDP_encap_handler as
> the most general model for RX.

I believe this is the case now. I don't plan to change this. Just not
exposing the
higher layer callback to the udp_tunnel layer.

>
>> The two receive functions are from two separate layers above
>> udp_tunnel. I can restructure the APIs to make it
>> cleaner.
>>
> The only necessary function for opening the UDP encap port is the UDP
> receive handler (encap receive). If you want to implement more
> indirection within your handler then it should be pretty easy to
> create another layer of API for that purpose.
>

Yes, this is the direction I am going towards.

>> On Wed, Jul 23, 2014 at 12:57 PM, Tom Herbert <therbert@google.com> wrote:
>>> On Tue, Jul 22, 2014 at 3:19 AM, Andy Zhou <azhou@nicira.com> wrote:
>>>> Added create_udp_tunnel_socket(), packet receive and transmit,  and
>>>> other related common functions for UDP tunnels.
>>>>
>>>> Per net open UDP tunnel ports are tracked in this common layer to
>>>> prevent sharing of a single port with more than one UDP tunnel.
>>>>
>>>> Signed-off-by: Andy Zhou <azhou@nicira.com>
>>>> ---
>>>>  include/net/udp_tunnel.h |   57 +++++++++-
>>>>  net/ipv4/udp_tunnel.c    |  257 +++++++++++++++++++++++++++++++++++++++++++++-
>>>>  2 files changed, 312 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/net/udp_tunnel.h b/include/net/udp_tunnel.h
>>>> index 3f34c65..b5e815a 100644
>>>> --- a/include/net/udp_tunnel.h
>>>> +++ b/include/net/udp_tunnel.h
>>>> @@ -1,7 +1,10 @@
>>>>  #ifndef __NET_UDP_TUNNEL_H
>>>>  #define __NET_UDP_TUNNEL_H
>>>>
>>>> -#define UDP_TUNNEL_TYPE_VXLAN 0x01
>>>> +#include <net/ip_tunnels.h>
>>>> +
>>>> +#define UDP_TUNNEL_TYPE_VXLAN  0x01
>>>> +#define UDP_TUNNEL_TYPE_GENEVE 0x02
>>>>
>>>>  struct udp_port_cfg {
>>>>         u8                      family;
>>>> @@ -28,7 +31,59 @@ struct udp_port_cfg {
>>>>                                 use_udp6_rx_checksums:1;
>>>>  };
>>>>
>>>> +struct udp_tunnel_sock;
>>>> +
>>>> +typedef void (udp_tunnel_rcv_t)(struct udp_tunnel_sock *uts,
>>>> +                               struct sk_buff *skb, ...);
>>>> +
>>>> +typedef int (udp_tunnel_encap_rcv_t)(struct sock *sk, struct sk_buff *skb);
>>>> +
>>>> +struct udp_tunnel_socket_cfg {
>>>> +       u8 tunnel_type;
>>>> +       struct udp_port_cfg port;
>>>> +       udp_tunnel_rcv_t *rcv;
>>>> +       udp_tunnel_encap_rcv_t *encap_rcv;
>>>
>>> Why do you need two receive functions or udp_tunnel_rcv_t?
>>>
>>>> +       void *data;
>>>
>>> Similarly, why is this needed when we already have sk_user_data?
>>>
>>>> +};
>>>> +
>>>> +struct udp_tunnel_sock {
>>>> +       u8 tunnel_type;
>>>> +       struct hlist_node hlist;
>>>> +       udp_tunnel_rcv_t *rcv;
>>>> +       void *data;
>>>> +       struct socket *sock;
>>>> +};
>>>> +
>>>>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>>>                     struct socket **sockp);
>>>>
>>>> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
>>>> +                                                struct udp_tunnel_socket_cfg
>>>> +                                                       *socket_cfg);
>>>> +
>>>> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port);
>>>> +
>>>> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
>>>> +                       struct sk_buff *skb, __be32 src, __be32 dst,
>>>> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
>>>> +                       __be16 dst_port, bool xnet);
>>>> +
>>>> +#if IS_ENABLED(CONFIG_IPV6)
>>>> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
>>>> +               struct sk_buff *skb, struct net_device *dev,
>>>> +               struct in6_addr *saddr, struct in6_addr *daddr,
>>>> +               __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port);
>>>> +
>>>> +#endif
>>>> +
>>>> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts);
>>>> +void udp_tunnel_get_rx_port(struct net_device *dev);
>>>> +
>>>> +static inline struct sk_buff *udp_tunnel_handle_offloads(struct sk_buff *skb,
>>>> +                                                        bool udp_csum)
>>>> +{
>>>> +       int type = udp_csum ? SKB_GSO_UDP_TUNNEL_CSUM : SKB_GSO_UDP_TUNNEL;
>>>> +
>>>> +       return iptunnel_handle_offloads(skb, udp_csum, type);
>>>> +}
>>>>  #endif
>>>> diff --git a/net/ipv4/udp_tunnel.c b/net/ipv4/udp_tunnel.c
>>>> index 61ec1a6..3c14b16 100644
>>>> --- a/net/ipv4/udp_tunnel.c
>>>> +++ b/net/ipv4/udp_tunnel.c
>>>> @@ -7,6 +7,23 @@
>>>>  #include <net/udp.h>
>>>>  #include <net/udp_tunnel.h>
>>>>  #include <net/net_namespace.h>
>>>> +#include <net/netns/generic.h>
>>>> +#if IS_ENABLED(CONFIG_IPV6)
>>>> +#include <net/ipv6.h>
>>>> +#include <net/addrconf.h>
>>>> +#include <net/ip6_tunnel.h>
>>>> +#include <net/ip6_checksum.h>
>>>> +#endif
>>>> +
>>>> +#define PORT_HASH_BITS 8
>>>> +#define PORT_HASH_SIZE (1 << PORT_HASH_BITS)
>>>> +
>>>> +static int udp_tunnel_net_id;
>>>> +
>>>> +struct udp_tunnel_net {
>>>> +       struct hlist_head sock_list[PORT_HASH_SIZE];
>>>> +       spinlock_t  sock_lock;   /* Protecting the sock_list */
>>>> +};
>>>>
>>>>  int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>>>                     struct socket **sockp)
>>>> @@ -82,7 +99,6 @@ int udp_sock_create(struct net *net, struct udp_port_cfg *cfg,
>>>>                 return -EPFNOSUPPORT;
>>>>         }
>>>>
>>>> -
>>>>         *sockp = sock;
>>>>
>>>>         return 0;
>>>> @@ -97,4 +113,243 @@ error:
>>>>  }
>>>>  EXPORT_SYMBOL(udp_sock_create);
>>>>
>>>> +
>>>> +/* Socket hash table head */
>>>> +static inline struct hlist_head *uts_head(struct net *net, const __be16 port)
>>>> +{
>>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>>> +
>>>> +       return &utn->sock_list[hash_32(ntohs(port), PORT_HASH_BITS)];
>>>> +}
>>>> +
>>>> +static int handle_offloads(struct sk_buff *skb)
>>>> +{
>>>> +       if (skb_is_gso(skb)) {
>>>> +               int err = skb_unclone(skb, GFP_ATOMIC);
>>>> +
>>>> +               if (unlikely(err))
>>>> +                       return err;
>>>> +               skb_shinfo(skb)->gso_type |= SKB_GSO_UDP_TUNNEL;
>>>> +       } else {
>>>> +               if (skb->ip_summed != CHECKSUM_PARTIAL)
>>>> +                       skb->ip_summed = CHECKSUM_NONE;
>>>> +       }
>>>> +
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +struct udp_tunnel_sock *create_udp_tunnel_socket(struct net *net, size_t size,
>>>> +                                                struct udp_tunnel_socket_cfg
>>>> +                                                       *cfg)
>>>> +{
>>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>>> +       struct udp_tunnel_sock *uts;
>>>> +       struct socket *sock;
>>>> +       struct sock *sk;
>>>> +       const __be16 port = cfg->port.local_udp_port;
>>>> +       const int ipv6 = (cfg->port.family == AF_INET6);
>>>> +       int err;
>>>> +
>>>> +       uts = kzalloc(size, GFP_KERNEL);
>>>> +       if (!uts)
>>>> +               return ERR_PTR(-ENOMEM);
>>>> +
>>>> +       err = udp_sock_create(net, &cfg->port, &sock);
>>>> +       if (err < 0) {
>>>> +               kfree(uts);
>>>> +               return NULL;
>>>> +       }
>>>> +
>>>> +       /* Disable multicast loopback */
>>>> +       inet_sk(sock->sk)->mc_loop = 0;
>>>> +
>>>> +       uts->sock = sock;
>>>> +       sk = sock->sk;
>>>> +       uts->rcv = cfg->rcv;
>>>> +       uts->data = cfg->data;
>>>> +       rcu_assign_sk_user_data(sock->sk, uts);
>>>> +
>>>> +       spin_lock(&utn->sock_lock);
>>>> +       hlist_add_head_rcu(&uts->hlist, uts_head(net, port));
>>>> +       spin_unlock(&utn->sock_lock);
>>>> +
>>>> +       udp_sk(sk)->encap_type = 1;
>>>> +       udp_sk(sk)->encap_rcv = cfg->encap_rcv;
>>>> +
>>>> +#if IS_ENABLED(CONFIG_IPV6)
>>>> +       if (ipv6)
>>>> +               ipv6_stub->udpv6_encap_enable();
>>>> +       else
>>>> +#endif
>>>> +               udp_encap_enable();
>>>> +
>>>> +       return uts;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(create_udp_tunnel_socket);
>>>> +
>>>> +int udp_tunnel_xmit_skb(struct socket *sock, struct rtable *rt,
>>>> +                       struct sk_buff *skb, __be32 src, __be32 dst,
>>>> +                       __u8 tos, __u8 ttl, __be16 df, __be16 src_port,
>>>> +                       __be16 dst_port, bool xnet)
>>>> +{
>>>> +       struct udphdr *uh;
>>>> +
>>>> +       __skb_push(skb, sizeof(*uh));
>>>> +       skb_reset_transport_header(skb);
>>>> +       uh = udp_hdr(skb);
>>>> +
>>>> +       uh->dest = dst_port;
>>>> +       uh->source = src_port;
>>>> +       uh->len = htons(skb->len);
>>>> +
>>>> +       udp_set_csum(sock->sk->sk_no_check_tx, skb, src, dst, skb->len);
>>>> +
>>>> +       return iptunnel_xmit(sock->sk, rt, skb, src, dst, IPPROTO_UDP,
>>>> +                            tos, ttl, df, xnet);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(udp_tunnel_xmit_skb);
>>>> +
>>>> +#if IS_ENABLED(CONFIG_IPV6)
>>>> +int udp_tunnel6_xmit_skb(struct socket *sock, struct dst_entry *dst,
>>>> +                        struct sk_buff *skb, struct net_device *dev,
>>>> +                        struct in6_addr *saddr, struct in6_addr *daddr,
>>>> +                        __u8 prio, __u8 ttl, __be16 src_port, __be16 dst_port)
>>>> +{
>>>> +       struct udphdr *uh;
>>>> +       struct ipv6hdr *ip6h;
>>>> +       int err;
>>>> +
>>>> +       __skb_push(skb, sizeof(*uh));
>>>> +       skb_reset_transport_header(skb);
>>>> +       uh = udp_hdr(skb);
>>>> +
>>>> +       uh->dest = dst_port;
>>>> +       uh->source = src_port;
>>>> +
>>>> +       uh->len = htons(skb->len);
>>>> +       uh->check = 0;
>>>> +
>>>> +       memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt));
>>>> +       IPCB(skb)->flags &= ~(IPSKB_XFRM_TUNNEL_SIZE | IPSKB_XFRM_TRANSFORMED
>>>> +                           | IPSKB_REROUTED);
>>>> +       skb_dst_set(skb, dst);
>>>> +
>>>> +       if (!skb_is_gso(skb) && !(dst->dev->features & NETIF_F_IPV6_CSUM)) {
>>>> +               __wsum csum = skb_checksum(skb, 0, skb->len, 0);
>>>> +
>>>> +               skb->ip_summed = CHECKSUM_UNNECESSARY;
>>>> +               uh->check = csum_ipv6_magic(saddr, daddr, skb->len,
>>>> +                               IPPROTO_UDP, csum);
>>>> +               if (uh->check == 0)
>>>> +                       uh->check = CSUM_MANGLED_0;
>>>> +       } else {
>>>> +               skb->ip_summed = CHECKSUM_PARTIAL;
>>>> +               skb->csum_start = skb_transport_header(skb) - skb->head;
>>>> +               skb->csum_offset = offsetof(struct udphdr, check);
>>>> +               uh->check = ~csum_ipv6_magic(saddr, daddr,
>>>> +                               skb->len, IPPROTO_UDP, 0);
>>>> +       }
>>>> +
>>>> +       __skb_push(skb, sizeof(*ip6h));
>>>> +       skb_reset_network_header(skb);
>>>> +       ip6h              = ipv6_hdr(skb);
>>>> +       ip6h->version     = 6;
>>>> +       ip6h->priority    = prio;
>>>> +       ip6h->flow_lbl[0] = 0;
>>>> +       ip6h->flow_lbl[1] = 0;
>>>> +       ip6h->flow_lbl[2] = 0;
>>>> +       ip6h->payload_len = htons(skb->len);
>>>> +       ip6h->nexthdr     = IPPROTO_UDP;
>>>> +       ip6h->hop_limit   = ttl;
>>>> +       ip6h->daddr       = *daddr;
>>>> +       ip6h->saddr       = *saddr;
>>>> +
>>>> +       err = handle_offloads(skb);
>>>> +       if (err)
>>>> +               return err;
>>>> +
>>>> +       ip6tunnel_xmit(skb, dev);
>>>> +       return 0;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(udp_tunnel6_xmit_skb);
>>>> +#endif
>>>> +
>>>> +struct udp_tunnel_sock *udp_tunnel_find_sock(struct net *net, __be16 port)
>>>> +{
>>>> +       struct udp_tunnel_sock *uts;
>>>> +
>>>> +       hlist_for_each_entry_rcu(uts, uts_head(net, port), hlist) {
>>>> +               if (inet_sk(uts->sock->sk)->inet_sport == port)
>>>> +                       return uts;
>>>> +       }
>>>> +
>>>> +       return NULL;
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(udp_tunnel_find_sock);
>>>> +
>>>> +void udp_tunnel_sock_release(struct udp_tunnel_sock *uts)
>>>> +{
>>>> +       struct sock *sk = uts->sock->sk;
>>>> +       struct net *net = sock_net(sk);
>>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>>> +
>>>> +       spin_lock(&utn->sock_lock);
>>>> +       hlist_del_rcu(&uts->hlist);
>>>> +       rcu_assign_sk_user_data(uts->sock->sk, NULL);
>>>> +       spin_unlock(&utn->sock_lock);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(udp_tunnel_sock_release);
>>>> +
>>>> +/* Calls the ndo_add_tunnel_port of the caller in order to
>>>> + * supply the listening VXLAN udp ports. Callers are expected
>>>> + * to implement the ndo_add_tunnle_port.
>>>> + */
>>>> +void udp_tunnel_get_rx_port(struct net_device *dev)
>>>> +{
>>>> +       struct udp_tunnel_sock *uts;
>>>> +       struct net *net = dev_net(dev);
>>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>>> +       sa_family_t sa_family;
>>>> +       __be16 port;
>>>> +       unsigned int i;
>>>> +
>>>> +       spin_lock(&utn->sock_lock);
>>>> +       for (i = 0; i < PORT_HASH_SIZE; ++i) {
>>>> +               hlist_for_each_entry_rcu(uts, &utn->sock_list[i], hlist) {
>>>> +                       port = inet_sk(uts->sock->sk)->inet_sport;
>>>> +                       sa_family = uts->sock->sk->sk_family;
>>>> +                       dev->netdev_ops->ndo_add_udp_tunnel_port(dev,
>>>> +                                       sa_family, port, uts->tunnel_type);
>>>> +               }
>>>> +       }
>>>> +       spin_unlock(&utn->sock_lock);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(udp_tunnel_get_rx_port);
>>>> +
>>>> +static int __net_init udp_tunnel_init_net(struct net *net)
>>>> +{
>>>> +       struct udp_tunnel_net *utn = net_generic(net, udp_tunnel_net_id);
>>>> +       unsigned int h;
>>>> +
>>>> +       spin_lock_init(&utn->sock_lock);
>>>> +
>>>> +       for (h = 0; h < PORT_HASH_SIZE; h++)
>>>> +               INIT_HLIST_HEAD(&utn->sock_list[h]);
>>>> +
>>>> +       return 0;
>>>> +}
>>>> +
>>>> +static struct pernet_operations udp_tunnel_net_ops = {
>>>> +       .init = udp_tunnel_init_net,
>>>> +       .exit = NULL,
>>>> +       .id = &udp_tunnel_net_id,
>>>> +       .size = sizeof(struct udp_tunnel_net),
>>>> +};
>>>> +
>>>> +static int __init udp_tunnel_init(void)
>>>> +{
>>>> +       return register_pernet_subsys(&udp_tunnel_net_ops);
>>>> +}
>>>> +late_initcall(udp_tunnel_init);
>>>> +
>>>>  MODULE_LICENSE("GPL");
>>>> --
>>>> 1.7.9.5
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 00/10] Add Geneve
  2014-07-24 17:40   ` Tom Herbert
@ 2014-07-24 21:03     ` Andy Zhou
  2014-07-24 22:03       ` Tom Herbert
  0 siblings, 1 reply; 46+ messages in thread
From: Andy Zhou @ 2014-07-24 21:03 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Or Gerlitz, David Miller, netdev

@or.gerlitz@gmail.com:  Yes, this is the logical order of the patch
series. I can improve the cover letter to make it clearer.


@therbert@google.com: I will add L2TP refactor to the series in the
next version. Given that port number and associated protocol is still
required
by NIC device driver for advanced offloads, we may still need to
define the protocol types. Are you O.K. with this, if not, any other
suggestions?

On Thu, Jul 24, 2014 at 10:40 AM, Tom Herbert <therbert@google.com> wrote:
> On Wed, Jul 23, 2014 at 11:58 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
>> On Tue, Jul 22, 2014 at 1:19 PM, Andy Zhou <azhou@nicira.com> wrote:
>>> Following patches adds initial support for Geneve tunnel protocol
>>
>> Just to make this a bit more clear, would it be correct to say that
>> the logical ordering here is as follows:
>>
> Agreed, improvements to the general infrastructure to support UDP
> tunneling should be done first. This was already begun with
> introduction of udp_tunnel.[ch] and the udp_tunnel_xmit functions seem
> like a nice addition at least.
>
> Also, we have at least two instances of UDP tunneling in the code that
> should addressed when interface improvements: VXLAN and L2TP. Please
> make sure *both* of these are considered with such patches (also the
> needs for Geneve, GUE, LISP, etc. should be considered, but please no
> protocol specific stuff in the common infrastructure code!)
>
>>> 1. Add common UDP tunnel code into UDP tunnel support function
>>> 2. Refactor vxlan driver to make use of the UDP tunnel support
>>> 3. Add Geneve driver.
>>
>> implemented by patches 1-5 below)
>>
>>> Andy Zhou (5):
>>>   net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port.
>>>   udp: Expand UDP tunnel common APIs
>>>   vxlan: Remove vxlan_get_rx_port()
>>>   net: Refactor vxlan driver to make use of common UDP tunnel functions
>>>   net: Add Geneve tunneling protocol driver
>>
>> and on top of that
>>
>>> 4. Refactor Openvswitch  in preparation for #5
>>> 5. Add Geneve support to Openvswitch.
>>
>> implemented by patches 6-10 (below)
>>
>>> Jesse Gross (5):
>>>   openvswitch: Eliminate memset() from flow_extract.
>>>   openvswitch: Add support for matching on OAM packets.
>>>   openvswitch: Wrap struct ovs_key_ipv4_tunnel in a new structure.
>>>   openvswitch: Factor out allocation and verification of actions.
>>>   openvswitch: Add support for Geneve tunneling.
>>
>> I understand the wish to eventually have something that goes beyond
>> refactoring of
>> the vxlan and tunneling code plus Geneve basics. However, isn't the
>> 1st part of the series
>> (patches 1-5) have something is common to Tom's GUE work, which is
>> currently under review
>> too? I think we need first see how the basic elements from your series
>> go along together with GUE.
>>
>> Or.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 00/10] Add Geneve
  2014-07-24 21:03     ` Andy Zhou
@ 2014-07-24 22:03       ` Tom Herbert
  0 siblings, 0 replies; 46+ messages in thread
From: Tom Herbert @ 2014-07-24 22:03 UTC (permalink / raw)
  To: Andy Zhou; +Cc: Or Gerlitz, David Miller, netdev

On Thu, Jul 24, 2014 at 2:03 PM, Andy Zhou <azhou@nicira.com> wrote:
> @or.gerlitz@gmail.com:  Yes, this is the logical order of the patch
> series. I can improve the cover letter to make it clearer.
>
Please send patches in two different patch sets.

>
> @therbert@google.com: I will add L2TP refactor to the series in the
> next version. Given that port number and associated protocol is still
> required
> by NIC device driver for advanced offloads, we may still need to
> define the protocol types. Are you O.K. with this, if not, any other
> suggestions?
>
If you really need this, then put the constants for this should moved
this to their own header file. Also, something needs to be added to
the hw_features to have control over the feature-- for instance if I
discovered that my VXLAN device was incorrectly ignoring UDP
checksums, I'd want the ability to turn off checksum offload of VXLAN
without needing to disable checksum offload for everyone else. I'm not
sure what goes into features since what the device actually does with
port seems undefined, maybe the features would be purposely generic
(e.g. NETIF_F_VXLAN_RX_PARSING). Also, we should not blindly be
telling devices port information for things it doesn't support or need
to know about, to do so would be a security risk.

Also, L2TP (and probably LISP) are interesting use cases since
encapsulated protocol does not appear in the header and can only be
determined from the tunnel context. Telling devices about L2TP ports
probably isn't enough to do anything meaningful.

> On Thu, Jul 24, 2014 at 10:40 AM, Tom Herbert <therbert@google.com> wrote:
>> On Wed, Jul 23, 2014 at 11:58 PM, Or Gerlitz <or.gerlitz@gmail.com> wrote:
>>> On Tue, Jul 22, 2014 at 1:19 PM, Andy Zhou <azhou@nicira.com> wrote:
>>>> Following patches adds initial support for Geneve tunnel protocol
>>>
>>> Just to make this a bit more clear, would it be correct to say that
>>> the logical ordering here is as follows:
>>>
>> Agreed, improvements to the general infrastructure to support UDP
>> tunneling should be done first. This was already begun with
>> introduction of udp_tunnel.[ch] and the udp_tunnel_xmit functions seem
>> like a nice addition at least.
>>
>> Also, we have at least two instances of UDP tunneling in the code that
>> should addressed when interface improvements: VXLAN and L2TP. Please
>> make sure *both* of these are considered with such patches (also the
>> needs for Geneve, GUE, LISP, etc. should be considered, but please no
>> protocol specific stuff in the common infrastructure code!)
>>
>>>> 1. Add common UDP tunnel code into UDP tunnel support function
>>>> 2. Refactor vxlan driver to make use of the UDP tunnel support
>>>> 3. Add Geneve driver.
>>>
>>> implemented by patches 1-5 below)
>>>
>>>> Andy Zhou (5):
>>>>   net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port.
>>>>   udp: Expand UDP tunnel common APIs
>>>>   vxlan: Remove vxlan_get_rx_port()
>>>>   net: Refactor vxlan driver to make use of common UDP tunnel functions
>>>>   net: Add Geneve tunneling protocol driver
>>>
>>> and on top of that
>>>
>>>> 4. Refactor Openvswitch  in preparation for #5
>>>> 5. Add Geneve support to Openvswitch.
>>>
>>> implemented by patches 6-10 (below)
>>>
>>>> Jesse Gross (5):
>>>>   openvswitch: Eliminate memset() from flow_extract.
>>>>   openvswitch: Add support for matching on OAM packets.
>>>>   openvswitch: Wrap struct ovs_key_ipv4_tunnel in a new structure.
>>>>   openvswitch: Factor out allocation and verification of actions.
>>>>   openvswitch: Add support for Geneve tunneling.
>>>
>>> I understand the wish to eventually have something that goes beyond
>>> refactoring of
>>> the vxlan and tunneling code plus Geneve basics. However, isn't the
>>> 1st part of the series
>>> (patches 1-5) have something is common to Tom's GUE work, which is
>>> currently under review
>>> too? I think we need first see how the basic elements from your series
>>> go along together with GUE.
>>>
>>> Or.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 10/10] openvswitch: Add support for Geneve tunneling.
       [not found]       ` <CA+mtBx9umxiFYtnG1kzFkK+Ev=b=4f3q2OOow2QcfCB5rUTUyA@mail.gmail.com>
@ 2014-07-24 22:59         ` Jesse Gross
  2014-07-24 23:45           ` Tom Herbert
  0 siblings, 1 reply; 46+ messages in thread
From: Jesse Gross @ 2014-07-24 22:59 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Andy Zhou, David Miller, Linux Netdev List

On Thu, Jul 24, 2014 at 12:43 PM, Tom Herbert <therbert@google.com> wrote:
> On Wed, Jul 23, 2014 at 9:10 PM, Jesse Gross <jesse@nicira.com> wrote:
>> On Wed, Jul 23, 2014 at 4:29 PM, Tom Herbert <therbert@google.com> wrote:
>> > This check also applies in the OAM case where there is no data packet
>> > but we still enforce the protocol field to be Ethernert (meaning of
>> > prot_type when OAM bit is set is ambiguous in the draft). As I
>> > mentioned on the nvo3 list, this OAM bit is really a 1-bit packet
>> > type. If this bit is donated to version field (make it a type version
>> > field) then we can switch on ver_type above and create another
>> > processing path for OAM so that the prot_type is at least not
>> > unnecessarily verified in that case and the bits could even be reused
>> > for some OAM specific purpose.
>>
>> I think the draft is clear :) The value of the OAM bit does not change
>> the interpretation of the protocol field. This is true in the other
>> drafts as well.
>
>
>>
>> OAM packets are essentially just high priority packets (presumably
>> with some kind of control semantics but that depends on the control
>> plane). That means that they might be BFD or some other heartbeat,
>> header fragments for tracing, or really anything else that should be
>> treated as control. In all of these cases, the protocol type still
>> needs to indicate the format of the data.
>>
> I see, I think you're saying that control messages still contain some sort
> of message after Geneve header whose type is indicated by protocol field.
> So, if OAM doesn't change the interpretation of the protocol field then this
> must be an Ether_type protocol, hence control messages must be formatted as
> Ether_types. So then I would extrapolate that control messages could be
> directled to the control plane based on demux of protocol field (ie. type is
> something like NSH, or a new control message Ethertype). Consequently, the
> OAM bit is really just a hint to request expedited processing but does not
> influence demux so could in theory be ignored without loss of functionality.
> Is this interpretation correct?

Yes, this is largely correct. There is one other possible significance
to the OAM flag, which is to say "do not forward the resulting packet
after decapsulation". I think this is slightly less likely to be an
issue for Geneve (although I can imagine situations where it might be)
but it was a problem with TRILL OAM, which used data-like packets and
has no OAM bit. The result is that they had to use all kinds of
special addresses and other things to separate out data traffic from
OAM.

> A concrete example of a control message with headers would be useful in
> understanding OAM bit. For instance, how would would a mechanism to
> implement NAT keepalive of UDP flows be implemented (see RFC3948 for how
> this was done in ESP/UDP)?

A real example that comes to mind is if you want to have a heartbeat
protocol such as CFM or BFD. You would set the appropriate EtherType
for this protocol and then run it as the payload. As you said, it
would already be obvious that this is a control protocol based on the
EtherType and the OAM bit is not strictly required. However, it would
allow easy (and generic) separation of control traffic from data
traffic so in the event that you have a flood of data packets, the
heartbeats can prioritized to avoid detecting a failure.

If I was trying to implement just a totally simple keep alive, then
there's really no need to put anything in the payload. I would
probably just set the EtherType to zero (guaranteed to be unused) and
leave it at that.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 10/10] openvswitch: Add support for Geneve tunneling.
  2014-07-24 22:59         ` Jesse Gross
@ 2014-07-24 23:45           ` Tom Herbert
  2014-07-25  1:04             ` Jesse Gross
  0 siblings, 1 reply; 46+ messages in thread
From: Tom Herbert @ 2014-07-24 23:45 UTC (permalink / raw)
  To: Jesse Gross; +Cc: Andy Zhou, David Miller, Linux Netdev List

On Thu, Jul 24, 2014 at 3:59 PM, Jesse Gross <jesse@nicira.com> wrote:
> On Thu, Jul 24, 2014 at 12:43 PM, Tom Herbert <therbert@google.com> wrote:
>> On Wed, Jul 23, 2014 at 9:10 PM, Jesse Gross <jesse@nicira.com> wrote:
>>> On Wed, Jul 23, 2014 at 4:29 PM, Tom Herbert <therbert@google.com> wrote:
>>> > This check also applies in the OAM case where there is no data packet
>>> > but we still enforce the protocol field to be Ethernert (meaning of
>>> > prot_type when OAM bit is set is ambiguous in the draft). As I
>>> > mentioned on the nvo3 list, this OAM bit is really a 1-bit packet
>>> > type. If this bit is donated to version field (make it a type version
>>> > field) then we can switch on ver_type above and create another
>>> > processing path for OAM so that the prot_type is at least not
>>> > unnecessarily verified in that case and the bits could even be reused
>>> > for some OAM specific purpose.
>>>
>>> I think the draft is clear :) The value of the OAM bit does not change
>>> the interpretation of the protocol field. This is true in the other
>>> drafts as well.
>>
>>
>>>
>>> OAM packets are essentially just high priority packets (presumably
>>> with some kind of control semantics but that depends on the control
>>> plane). That means that they might be BFD or some other heartbeat,
>>> header fragments for tracing, or really anything else that should be
>>> treated as control. In all of these cases, the protocol type still
>>> needs to indicate the format of the data.
>>>
>> I see, I think you're saying that control messages still contain some sort
>> of message after Geneve header whose type is indicated by protocol field.
>> So, if OAM doesn't change the interpretation of the protocol field then this
>> must be an Ether_type protocol, hence control messages must be formatted as
>> Ether_types. So then I would extrapolate that control messages could be
>> directled to the control plane based on demux of protocol field (ie. type is
>> something like NSH, or a new control message Ethertype). Consequently, the
>> OAM bit is really just a hint to request expedited processing but does not
>> influence demux so could in theory be ignored without loss of functionality.
>> Is this interpretation correct?
>
> Yes, this is largely correct. There is one other possible significance
> to the OAM flag, which is to say "do not forward the resulting packet
> after decapsulation". I think this is slightly less likely to be an
> issue for Geneve (although I can imagine situations where it might be)
> but it was a problem with TRILL OAM, which used data-like packets and
> has no OAM bit. The result is that they had to use all kinds of
> special addresses and other things to separate out data traffic from
> OAM.
>
Thanks for the clarification.

>> A concrete example of a control message with headers would be useful in
>> understanding OAM bit. For instance, how would would a mechanism to
>> implement NAT keepalive of UDP flows be implemented (see RFC3948 for how
>> this was done in ESP/UDP)?
>
> A real example that comes to mind is if you want to have a heartbeat
> protocol such as CFM or BFD. You would set the appropriate EtherType
> for this protocol and then run it as the payload. As you said, it
> would already be obvious that this is a control protocol based on the
> EtherType and the OAM bit is not strictly required. However, it would
> allow easy (and generic) separation of control traffic from data
> traffic so in the event that you have a flood of data packets, the
> heartbeats can prioritized to avoid detecting a failure.
>

I don't see a whole lot of value in using a OAM bit as a priority bit,
we don't have any support in the network which is where we really need
to apply priority. There are already existing mechanisms in the
network to deal with prioritization which are more generic (IP TOS,
Ether priority) and could work today. Sender of these control message
should be setting priority to reflect it.


> If I was trying to implement just a totally simple keep alive, then
> there's really no need to put anything in the payload. I would
> probably just set the EtherType to zero (guaranteed to be unused) and
> leave it at that.

Interesting idea. I don't think 0 has be reserved for this purpose,
but it does seem like a NULL ether_type might be useful. Has any other
encapsulation protocol carrying ether_type specific special meaning of
zero?

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [net-next 10/10] openvswitch: Add support for Geneve tunneling.
  2014-07-24 23:45           ` Tom Herbert
@ 2014-07-25  1:04             ` Jesse Gross
  0 siblings, 0 replies; 46+ messages in thread
From: Jesse Gross @ 2014-07-25  1:04 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Andy Zhou, David Miller, Linux Netdev List

On Thu, Jul 24, 2014 at 7:45 PM, Tom Herbert <therbert@google.com> wrote:
> On Thu, Jul 24, 2014 at 3:59 PM, Jesse Gross <jesse@nicira.com> wrote:
>> On Thu, Jul 24, 2014 at 12:43 PM, Tom Herbert <therbert@google.com> wrote:
>>> On Wed, Jul 23, 2014 at 9:10 PM, Jesse Gross <jesse@nicira.com> wrote:
>>>> On Wed, Jul 23, 2014 at 4:29 PM, Tom Herbert <therbert@google.com> wrote:
>>>> > This check also applies in the OAM case where there is no data packet
>>>> > but we still enforce the protocol field to be Ethernert (meaning of
>>>> > prot_type when OAM bit is set is ambiguous in the draft). As I
>>>> > mentioned on the nvo3 list, this OAM bit is really a 1-bit packet
>>>> > type. If this bit is donated to version field (make it a type version
>>>> > field) then we can switch on ver_type above and create another
>>>> > processing path for OAM so that the prot_type is at least not
>>>> > unnecessarily verified in that case and the bits could even be reused
>>>> > for some OAM specific purpose.
>>>>
>>>> I think the draft is clear :) The value of the OAM bit does not change
>>>> the interpretation of the protocol field. This is true in the other
>>>> drafts as well.
>>>
>>>
>>>>
>>>> OAM packets are essentially just high priority packets (presumably
>>>> with some kind of control semantics but that depends on the control
>>>> plane). That means that they might be BFD or some other heartbeat,
>>>> header fragments for tracing, or really anything else that should be
>>>> treated as control. In all of these cases, the protocol type still
>>>> needs to indicate the format of the data.
>>>>
>>> I see, I think you're saying that control messages still contain some sort
>>> of message after Geneve header whose type is indicated by protocol field.
>>> So, if OAM doesn't change the interpretation of the protocol field then this
>>> must be an Ether_type protocol, hence control messages must be formatted as
>>> Ether_types. So then I would extrapolate that control messages could be
>>> directled to the control plane based on demux of protocol field (ie. type is
>>> something like NSH, or a new control message Ethertype). Consequently, the
>>> OAM bit is really just a hint to request expedited processing but does not
>>> influence demux so could in theory be ignored without loss of functionality.
>>> Is this interpretation correct?
>>
>> Yes, this is largely correct. There is one other possible significance
>> to the OAM flag, which is to say "do not forward the resulting packet
>> after decapsulation". I think this is slightly less likely to be an
>> issue for Geneve (although I can imagine situations where it might be)
>> but it was a problem with TRILL OAM, which used data-like packets and
>> has no OAM bit. The result is that they had to use all kinds of
>> special addresses and other things to separate out data traffic from
>> OAM.
>>
> Thanks for the clarification.
>
>>> A concrete example of a control message with headers would be useful in
>>> understanding OAM bit. For instance, how would would a mechanism to
>>> implement NAT keepalive of UDP flows be implemented (see RFC3948 for how
>>> this was done in ESP/UDP)?
>>
>> A real example that comes to mind is if you want to have a heartbeat
>> protocol such as CFM or BFD. You would set the appropriate EtherType
>> for this protocol and then run it as the payload. As you said, it
>> would already be obvious that this is a control protocol based on the
>> EtherType and the OAM bit is not strictly required. However, it would
>> allow easy (and generic) separation of control traffic from data
>> traffic so in the event that you have a flood of data packets, the
>> heartbeats can prioritized to avoid detecting a failure.
>>
>
> I don't see a whole lot of value in using a OAM bit as a priority bit,
> we don't have any support in the network which is where we really need
> to apply priority. There are already existing mechanisms in the
> network to deal with prioritization which are more generic (IP TOS,
> Ether priority) and could work today. Sender of these control message
> should be setting priority to reflect it.

Some things can be done with ToS marking but this is really more a
"send to control plane" bit (which could be a special queue in a host
or the management CPU in a physical switch), which is a little
different from priority. Between the semantics of distinguishing
control packets from data packets and different queuing I think it
ends up proving to be fairly useful.

>> If I was trying to implement just a totally simple keep alive, then
>> there's really no need to put anything in the payload. I would
>> probably just set the EtherType to zero (guaranteed to be unused) and
>> leave it at that.
>
> Interesting idea. I don't think 0 has be reserved for this purpose,
> but it does seem like a NULL ether_type might be useful. Has any other
> encapsulation protocol carrying ether_type specific special meaning of
> zero?

This has never come up as something that I've needed, so I haven't
really researched/designed it all that much. An EtherType of zero is
guaranteed to be unused since it is reserved for compatibility with
802.2 frames but I don't know of anyone else using it that way.
However, I agree that having a NULL EtherType seems potentially
useful.

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2014-07-25  1:04 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-22 10:19 [net-next 00/10] Add Geneve Andy Zhou
2014-07-22 10:19 ` [net-next 01/10] net: Rename ndo_add_vxlan_port to ndo_add_udp_tunnel_port Andy Zhou
2014-07-22 10:49   ` Varka Bhadram
2014-07-24  6:40   ` Or Gerlitz
2014-07-24 20:28     ` Andy Zhou
2014-07-22 10:19 ` [net-next 02/10] udp: Expand UDP tunnel common APIs Andy Zhou
     [not found]   ` <CA+mtBx9M_BpjT-_Egng+jFxmqJzdC2Npg0ufE2ZSAb9Lhw8hxg@mail.gmail.com>
2014-07-22 21:02     ` Andy Zhou
2014-07-22 21:16       ` Tom Herbert
2014-07-22 21:56         ` Jesse Gross
2014-07-22 22:38           ` Tom Herbert
2014-07-22 22:55             ` Alexander Duyck
2014-07-22 23:24               ` Tom Herbert
2014-07-23  2:16                 ` Alexander Duyck
2014-07-23  3:53                   ` Tom Herbert
2014-07-23  4:35                     ` Jesse Gross
2014-07-23 15:45                       ` Tom Herbert
2014-07-24  3:24                         ` Jesse Gross
2014-07-22 23:12             ` Jesse Gross
2014-07-23 19:57   ` Tom Herbert
2014-07-24 20:23     ` Andy Zhou
2014-07-24 20:47       ` Tom Herbert
2014-07-24 20:54         ` Andy Zhou
2014-07-22 10:19 ` [net-next 03/10] vxlan: Remove vxlan_get_rx_port() Andy Zhou
     [not found]   ` <CAKgT0UeRSc3MaZrLmXyx4jPZO+F1hS5imR1TjFkvKp4S8nQmeg@mail.gmail.com>
2014-07-23  3:57     ` Andy Zhou
2014-07-22 10:19 ` [net-next 04/10] net: Refactor vxlan driver to make use of common UDP tunnel functions Andy Zhou
2014-07-24  6:46   ` Or Gerlitz
2014-07-22 10:19 ` [net-next 05/10] net: Add Geneve tunneling protocol driver Andy Zhou
2014-07-22 23:12   ` Alexander Duyck
2014-07-22 23:24     ` Jesse Gross
2014-07-23 14:11       ` John W. Linville
2014-07-23 18:20   ` Stephen Hemminger
2014-07-22 10:19 ` [net-next 06/10] openvswitch: Eliminate memset() from flow_extract Andy Zhou
2014-07-22 10:19 ` [net-next 07/10] openvswitch: Add support for matching on OAM packets Andy Zhou
2014-07-22 10:19 ` [net-next 08/10] openvswitch: Wrap struct ovs_key_ipv4_tunnel in a new structure Andy Zhou
2014-07-22 10:19 ` [net-next 09/10] openvswitch: Factor out allocation and verification of actions Andy Zhou
2014-07-22 10:19 ` [net-next 10/10] openvswitch: Add support for Geneve tunneling Andy Zhou
2014-07-23 20:29   ` Tom Herbert
2014-07-24  4:10     ` Jesse Gross
     [not found]       ` <CA+mtBx9umxiFYtnG1kzFkK+Ev=b=4f3q2OOow2QcfCB5rUTUyA@mail.gmail.com>
2014-07-24 22:59         ` Jesse Gross
2014-07-24 23:45           ` Tom Herbert
2014-07-25  1:04             ` Jesse Gross
2014-07-22 10:54 ` [net-next 00/10] Add Geneve Varka Bhadram
2014-07-24  6:58 ` Or Gerlitz
2014-07-24 17:40   ` Tom Herbert
2014-07-24 21:03     ` Andy Zhou
2014-07-24 22:03       ` Tom Herbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.