All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3 v3] macvtap driver
@ 2010-01-27 10:04 ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-27 10:04 UTC (permalink / raw)
  To: David Miller
  Cc: Stephen Hemminger, Patrick McHardy, Michael S. Tsirkin,
	Herbert Xu, Or Gerlitz, netdev, bridge, linux-kernel

This is the third version of the macvtap device driver, following another major restructuring and a lot of bug fixes:

* Change macvtap to be based around a struct sock
* macvtap: fix initialization
* return 0 to netlink
* don't use rcu for q->file and q->vlan pointers
* macvtap: checkpatch.pl fixes
* macvtap: fix tun IFF flags
* Use a struct socket to make tx flow control work
* disable BH processing during transmit
* only add an ethernet header for receive not forward
* allocate the SKB using GFP_NOWAIT since we're
  in rcu_read_lock
* use atomic allocation for socket
* fix blocking on send
* do not destroy netdev twice in error path

There are still known problems, but unless there
are fundamental concerns, I'd like this to go
into net-next as an experimental driver,
fixing up the remaining problems by 2.6.34-rc1.

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [Bridge] [PATCH 0/3 v3] macvtap driver
@ 2010-01-27 10:04 ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-27 10:04 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel, Or Gerlitz

This is the third version of the macvtap device driver, following another major restructuring and a lot of bug fixes:

* Change macvtap to be based around a struct sock
* macvtap: fix initialization
* return 0 to netlink
* don't use rcu for q->file and q->vlan pointers
* macvtap: checkpatch.pl fixes
* macvtap: fix tun IFF flags
* Use a struct socket to make tx flow control work
* disable BH processing during transmit
* only add an ethernet header for receive not forward
* allocate the SKB using GFP_NOWAIT since we're
  in rcu_read_lock
* use atomic allocation for socket
* fix blocking on send
* do not destroy netdev twice in error path

There are still known problems, but unless there
are fundamental concerns, I'd like this to go
into net-next as an experimental driver,
fixing up the remaining problems by 2.6.34-rc1.

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 1/3] net: maintain namespace isolation between vlan and real device
  2010-01-27 10:04 ` [Bridge] " Arnd Bergmann
@ 2010-01-27 10:05   ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-27 10:05 UTC (permalink / raw)
  To: David Miller
  Cc: Stephen Hemminger, Patrick McHardy, Michael S. Tsirkin,
	Herbert Xu, Or Gerlitz, netdev, bridge, linux-kernel

In the vlan and macvlan drivers, the start_xmit function forwards
data to the dev_queue_xmit function for another device, which may
potentially belong to a different namespace.

To make sure that classification stays within a single namespace,
this resets the potentially critical fields.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvlan.c     |    2 +-
 include/linux/netdevice.h |    9 +++++++++
 net/8021q/vlan_dev.c      |    2 +-
 net/core/dev.c            |   35 +++++++++++++++++++++++++++++++----
 4 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index bad1303..e0436fd 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -269,7 +269,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 xmit_world:
-	skb->dev = vlan->lowerdev;
+	skb_set_dev(skb, vlan->lowerdev);
 	return dev_queue_xmit(skb);
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 93a32a5..622ba5a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1004,6 +1004,15 @@ static inline bool netdev_uses_dsa_tags(struct net_device *dev)
 	return 0;
 }
 
+#ifndef CONFIG_NET_NS
+static inline void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+	skb->dev = dev;
+}
+#else /* CONFIG_NET_NS */
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev);
+#endif
+
 static inline bool netdev_uses_trailer_tags(struct net_device *dev)
 {
 #ifdef CONFIG_NET_DSA_TAG_TRAILER
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 77a49ff..95034a8 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -322,7 +322,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
 	}
 
 
-	skb->dev = vlan_dev_info(dev)->real_dev;
+	skb_set_dev(skb, vlan_dev_info(dev)->real_dev);
 	len = skb->len;
 	ret = dev_queue_xmit(skb);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 2cba5c5..e80403a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1448,13 +1448,10 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 	if (skb->len > (dev->mtu + dev->hard_header_len))
 		return NET_RX_DROP;
 
-	skb_dst_drop(skb);
+	skb_set_dev(skb, dev);
 	skb->tstamp.tv64 = 0;
 	skb->pkt_type = PACKET_HOST;
 	skb->protocol = eth_type_trans(skb, dev);
-	skb->mark = 0;
-	secpath_reset(skb);
-	nf_reset(skb);
 	return netif_rx(skb);
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
@@ -1614,6 +1611,36 @@ static bool dev_can_checksum(struct net_device *dev, struct sk_buff *skb)
 	return false;
 }
 
+/**
+ * skb_dev_set -- assign a buffer to a new device
+ * @skb: buffer for the new device
+ * @dev: network device
+ *
+ * If an skb is owned by a device already, we have to reset
+ * all data private to the namespace a device belongs to
+ * before assigning it a new device.
+ */
+#ifdef CONFIG_NET_NS
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+	skb_dst_drop(skb);
+	if (skb->dev && !net_eq(dev_net(skb->dev), dev_net(dev))) {
+		secpath_reset(skb);
+		nf_reset(skb);
+		skb_init_secmark(skb);
+		skb->mark = 0;
+		skb->priority = 0;
+		skb->nf_trace = 0;
+		skb->ipvs_property = 0;
+#ifdef CONFIG_NET_SCHED
+		skb->tc_index = 0;
+#endif
+	}
+	skb->dev = dev;
+}
+EXPORT_SYMBOL(skb_set_dev);
+#endif /* CONFIG_NET_NS */
+
 /*
  * Invalidate hardware checksum when packet is to be mangled, and
  * complete checksum manually on outgoing path.
-- 
1.6.3.3



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Bridge] [PATCH 1/3] net: maintain namespace isolation between vlan and real device
@ 2010-01-27 10:05   ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-27 10:05 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel, Or Gerlitz

In the vlan and macvlan drivers, the start_xmit function forwards
data to the dev_queue_xmit function for another device, which may
potentially belong to a different namespace.

To make sure that classification stays within a single namespace,
this resets the potentially critical fields.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvlan.c     |    2 +-
 include/linux/netdevice.h |    9 +++++++++
 net/8021q/vlan_dev.c      |    2 +-
 net/core/dev.c            |   35 +++++++++++++++++++++++++++++++----
 4 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index bad1303..e0436fd 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -269,7 +269,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 xmit_world:
-	skb->dev = vlan->lowerdev;
+	skb_set_dev(skb, vlan->lowerdev);
 	return dev_queue_xmit(skb);
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 93a32a5..622ba5a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1004,6 +1004,15 @@ static inline bool netdev_uses_dsa_tags(struct net_device *dev)
 	return 0;
 }
 
+#ifndef CONFIG_NET_NS
+static inline void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+	skb->dev = dev;
+}
+#else /* CONFIG_NET_NS */
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev);
+#endif
+
 static inline bool netdev_uses_trailer_tags(struct net_device *dev)
 {
 #ifdef CONFIG_NET_DSA_TAG_TRAILER
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index 77a49ff..95034a8 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -322,7 +322,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
 	}
 
 
-	skb->dev = vlan_dev_info(dev)->real_dev;
+	skb_set_dev(skb, vlan_dev_info(dev)->real_dev);
 	len = skb->len;
 	ret = dev_queue_xmit(skb);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 2cba5c5..e80403a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1448,13 +1448,10 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 	if (skb->len > (dev->mtu + dev->hard_header_len))
 		return NET_RX_DROP;
 
-	skb_dst_drop(skb);
+	skb_set_dev(skb, dev);
 	skb->tstamp.tv64 = 0;
 	skb->pkt_type = PACKET_HOST;
 	skb->protocol = eth_type_trans(skb, dev);
-	skb->mark = 0;
-	secpath_reset(skb);
-	nf_reset(skb);
 	return netif_rx(skb);
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
@@ -1614,6 +1611,36 @@ static bool dev_can_checksum(struct net_device *dev, struct sk_buff *skb)
 	return false;
 }
 
+/**
+ * skb_dev_set -- assign a buffer to a new device
+ * @skb: buffer for the new device
+ * @dev: network device
+ *
+ * If an skb is owned by a device already, we have to reset
+ * all data private to the namespace a device belongs to
+ * before assigning it a new device.
+ */
+#ifdef CONFIG_NET_NS
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+	skb_dst_drop(skb);
+	if (skb->dev && !net_eq(dev_net(skb->dev), dev_net(dev))) {
+		secpath_reset(skb);
+		nf_reset(skb);
+		skb_init_secmark(skb);
+		skb->mark = 0;
+		skb->priority = 0;
+		skb->nf_trace = 0;
+		skb->ipvs_property = 0;
+#ifdef CONFIG_NET_SCHED
+		skb->tc_index = 0;
+#endif
+	}
+	skb->dev = dev;
+}
+EXPORT_SYMBOL(skb_set_dev);
+#endif /* CONFIG_NET_NS */
+
 /*
  * Invalidate hardware checksum when packet is to be mangled, and
  * complete checksum manually on outgoing path.
-- 
1.6.3.3



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 2/3] net/macvlan: allow multiple driver backends
  2010-01-27 10:04 ` [Bridge] " Arnd Bergmann
@ 2010-01-27 10:06   ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-27 10:06 UTC (permalink / raw)
  To: David Miller
  Cc: Stephen Hemminger, Patrick McHardy, Michael S. Tsirkin,
	Herbert Xu, Or Gerlitz, netdev, bridge, linux-kernel

This makes it possible to hook into the macvlan driver
from another kernel module. In particular, the goal is
to extend it with the macvtap backend that provides
a tun/tap compatible interface directly on the macvlan
device.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvlan.c      |  113 +++++++++++++++++++-------------------------
 include/linux/if_macvlan.h |   70 +++++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 64 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index e0436fd..1517537 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -39,31 +39,6 @@ struct macvlan_port {
 	struct list_head	vlans;
 };
 
-/**
- *	struct macvlan_rx_stats - MACVLAN percpu rx stats
- *	@rx_packets: number of received packets
- *	@rx_bytes: number of received bytes
- *	@multicast: number of received multicast packets
- *	@rx_errors: number of errors
- */
-struct macvlan_rx_stats {
-	unsigned long rx_packets;
-	unsigned long rx_bytes;
-	unsigned long multicast;
-	unsigned long rx_errors;
-};
-
-struct macvlan_dev {
-	struct net_device	*dev;
-	struct list_head	list;
-	struct hlist_node	hlist;
-	struct macvlan_port	*port;
-	struct net_device	*lowerdev;
-	struct macvlan_rx_stats *rx_stats;
-	enum macvlan_mode	mode;
-};
-
-
 static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
 					       const unsigned char *addr)
 {
@@ -118,31 +93,17 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
 	return 0;
 }
 
-static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
-				    unsigned int len, bool success,
-				    bool multicast)
-{
-	struct macvlan_rx_stats *rx_stats;
-
-	rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
-	if (likely(success)) {
-		rx_stats->rx_packets++;;
-		rx_stats->rx_bytes += len;
-		if (multicast)
-			rx_stats->multicast++;
-	} else {
-		rx_stats->rx_errors++;
-	}
-}
 
-static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
+static int macvlan_broadcast_one(struct sk_buff *skb,
+				 const struct macvlan_dev *vlan,
 				 const struct ethhdr *eth, bool local)
 {
+	struct net_device *dev = vlan->dev;
 	if (!skb)
 		return NET_RX_DROP;
 
 	if (local)
-		return dev_forward_skb(dev, skb);
+		return vlan->forward(dev, skb);
 
 	skb->dev = dev;
 	if (!compare_ether_addr_64bits(eth->h_dest,
@@ -151,7 +112,7 @@ static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
 	else
 		skb->pkt_type = PACKET_MULTICAST;
 
-	return netif_receive_skb(skb);
+	return vlan->receive(skb);
 }
 
 static void macvlan_broadcast(struct sk_buff *skb,
@@ -175,7 +136,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
 				continue;
 
 			nskb = skb_clone(skb, GFP_ATOMIC);
-			err = macvlan_broadcast_one(nskb, vlan->dev, eth,
+			err = macvlan_broadcast_one(nskb, vlan, eth,
 					 mode == MACVLAN_MODE_BRIDGE);
 			macvlan_count_rx(vlan, skb->len + ETH_HLEN,
 					 err == NET_RX_SUCCESS, 1);
@@ -238,7 +199,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
 	skb->dev = dev;
 	skb->pkt_type = PACKET_HOST;
 
-	netif_receive_skb(skb);
+	vlan->receive(skb);
 	return NULL;
 }
 
@@ -260,7 +221,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 		dest = macvlan_hash_lookup(port, eth->h_dest);
 		if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
 			unsigned int length = skb->len + ETH_HLEN;
-			int ret = dev_forward_skb(dest->dev, skb);
+			int ret = dest->forward(dest->dev, skb);
 			macvlan_count_rx(dest, length,
 					 ret == NET_RX_SUCCESS, 0);
 
@@ -273,8 +234,8 @@ xmit_world:
 	return dev_queue_xmit(skb);
 }
 
-static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
-				      struct net_device *dev)
+netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
+			       struct net_device *dev)
 {
 	int i = skb_get_queue_mapping(skb);
 	struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
@@ -290,6 +251,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(macvlan_start_xmit);
 
 static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev,
 			       unsigned short type, const void *daddr,
@@ -623,8 +585,11 @@ static int macvlan_get_tx_queues(struct net *net,
 	return 0;
 }
 
-static int macvlan_newlink(struct net *src_net, struct net_device *dev,
-			   struct nlattr *tb[], struct nlattr *data[])
+int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+			   struct nlattr *tb[], struct nlattr *data[],
+			   int (*receive)(struct sk_buff *skb),
+			   int (*forward)(struct net_device *dev,
+					  struct sk_buff *skb))
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port;
@@ -664,6 +629,8 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
 	vlan->lowerdev = lowerdev;
 	vlan->dev      = dev;
 	vlan->port     = port;
+	vlan->receive  = receive;
+	vlan->forward  = forward;
 
 	vlan->mode     = MACVLAN_MODE_VEPA;
 	if (data && data[IFLA_MACVLAN_MODE])
@@ -677,8 +644,17 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
 	netif_stacked_transfer_operstate(lowerdev, dev);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(macvlan_common_newlink);
 
-static void macvlan_dellink(struct net_device *dev, struct list_head *head)
+static int macvlan_newlink(struct net *src_net, struct net_device *dev,
+			   struct nlattr *tb[], struct nlattr *data[])
+{
+	return macvlan_common_newlink(src_net, dev, tb, data,
+				      netif_receive_skb,
+				      dev_forward_skb);
+}
+
+void macvlan_dellink(struct net_device *dev, struct list_head *head)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port = vlan->port;
@@ -689,6 +665,7 @@ static void macvlan_dellink(struct net_device *dev, struct list_head *head)
 	if (list_empty(&port->vlans))
 		macvlan_port_destroy(port->dev);
 }
+EXPORT_SYMBOL_GPL(macvlan_dellink);
 
 static int macvlan_changelink(struct net_device *dev,
 		struct nlattr *tb[], struct nlattr *data[])
@@ -720,19 +697,27 @@ static const struct nla_policy macvlan_policy[IFLA_MACVLAN_MAX + 1] = {
 	[IFLA_MACVLAN_MODE] = { .type = NLA_U32 },
 };
 
-static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
+int macvlan_link_register(struct rtnl_link_ops *ops)
+{
+	/* common fields */
+	ops->priv_size		= sizeof(struct macvlan_dev);
+	ops->get_tx_queues	= macvlan_get_tx_queues;
+	ops->setup		= macvlan_setup;
+	ops->validate		= macvlan_validate;
+	ops->maxtype		= IFLA_MACVLAN_MAX;
+	ops->policy		= macvlan_policy;
+	ops->changelink		= macvlan_changelink;
+	ops->get_size		= macvlan_get_size;
+	ops->fill_info		= macvlan_fill_info;
+
+	return rtnl_link_register(ops);
+};
+EXPORT_SYMBOL_GPL(macvlan_link_register);
+
+static struct rtnl_link_ops macvlan_link_ops = {
 	.kind		= "macvlan",
-	.priv_size	= sizeof(struct macvlan_dev),
-	.get_tx_queues  = macvlan_get_tx_queues,
-	.setup		= macvlan_setup,
-	.validate	= macvlan_validate,
 	.newlink	= macvlan_newlink,
 	.dellink	= macvlan_dellink,
-	.maxtype	= IFLA_MACVLAN_MAX,
-	.policy		= macvlan_policy,
-	.changelink	= macvlan_changelink,
-	.get_size	= macvlan_get_size,
-	.fill_info	= macvlan_fill_info,
 };
 
 static int macvlan_device_event(struct notifier_block *unused,
@@ -761,7 +746,7 @@ static int macvlan_device_event(struct notifier_block *unused,
 		break;
 	case NETDEV_UNREGISTER:
 		list_for_each_entry_safe(vlan, next, &port->vlans, list)
-			macvlan_dellink(vlan->dev, NULL);
+			vlan->dev->rtnl_link_ops->dellink(vlan->dev, NULL);
 		break;
 	}
 	return NOTIFY_DONE;
@@ -778,7 +763,7 @@ static int __init macvlan_init_module(void)
 	register_netdevice_notifier(&macvlan_notifier_block);
 	macvlan_handle_frame_hook = macvlan_handle_frame;
 
-	err = rtnl_link_register(&macvlan_link_ops);
+	err = macvlan_link_register(&macvlan_link_ops);
 	if (err < 0)
 		goto err1;
 	return 0;
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 5f200ba..9a11544 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -1,6 +1,76 @@
 #ifndef _LINUX_IF_MACVLAN_H
 #define _LINUX_IF_MACVLAN_H
 
+#include <linux/if_link.h>
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/netlink.h>
+#include <net/netlink.h>
+
+struct macvlan_port;
+struct macvtap_queue;
+
+/**
+ *	struct macvlan_rx_stats - MACVLAN percpu rx stats
+ *	@rx_packets: number of received packets
+ *	@rx_bytes: number of received bytes
+ *	@multicast: number of received multicast packets
+ *	@rx_errors: number of errors
+ */
+struct macvlan_rx_stats {
+	unsigned long rx_packets;
+	unsigned long rx_bytes;
+	unsigned long multicast;
+	unsigned long rx_errors;
+};
+
+struct macvlan_dev {
+	struct net_device	*dev;
+	struct list_head	list;
+	struct hlist_node	hlist;
+	struct macvlan_port	*port;
+	struct net_device	*lowerdev;
+	struct macvlan_rx_stats *rx_stats;
+	enum macvlan_mode	mode;
+	int (*receive)(struct sk_buff *skb);
+	int (*forward)(struct net_device *dev, struct sk_buff *skb);
+};
+
+static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
+				    unsigned int len, bool success,
+				    bool multicast)
+{
+	struct macvlan_rx_stats *rx_stats;
+
+	rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
+	if (likely(success)) {
+		rx_stats->rx_packets++;;
+		rx_stats->rx_bytes += len;
+		if (multicast)
+			rx_stats->multicast++;
+	} else {
+		rx_stats->rx_errors++;
+	}
+}
+
+extern int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+				  struct nlattr *tb[], struct nlattr *data[],
+				  int (*receive)(struct sk_buff *skb),
+				  int (*forward)(struct net_device *dev,
+						 struct sk_buff *skb));
+
+extern void macvlan_count_rx(const struct macvlan_dev *vlan,
+			     unsigned int len, bool success,
+			     bool multicast);
+
+extern void macvlan_dellink(struct net_device *dev, struct list_head *head);
+
+extern int macvlan_link_register(struct rtnl_link_ops *ops);
+
+extern netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
+				      struct net_device *dev);
+
+
 extern struct sk_buff *(*macvlan_handle_frame_hook)(struct sk_buff *);
 
 #endif /* _LINUX_IF_MACVLAN_H */
-- 
1.6.3.3




^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Bridge] [PATCH 2/3] net/macvlan: allow multiple driver backends
@ 2010-01-27 10:06   ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-27 10:06 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel, Or Gerlitz

This makes it possible to hook into the macvlan driver
from another kernel module. In particular, the goal is
to extend it with the macvtap backend that provides
a tun/tap compatible interface directly on the macvlan
device.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvlan.c      |  113 +++++++++++++++++++-------------------------
 include/linux/if_macvlan.h |   70 +++++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 64 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index e0436fd..1517537 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -39,31 +39,6 @@ struct macvlan_port {
 	struct list_head	vlans;
 };
 
-/**
- *	struct macvlan_rx_stats - MACVLAN percpu rx stats
- *	@rx_packets: number of received packets
- *	@rx_bytes: number of received bytes
- *	@multicast: number of received multicast packets
- *	@rx_errors: number of errors
- */
-struct macvlan_rx_stats {
-	unsigned long rx_packets;
-	unsigned long rx_bytes;
-	unsigned long multicast;
-	unsigned long rx_errors;
-};
-
-struct macvlan_dev {
-	struct net_device	*dev;
-	struct list_head	list;
-	struct hlist_node	hlist;
-	struct macvlan_port	*port;
-	struct net_device	*lowerdev;
-	struct macvlan_rx_stats *rx_stats;
-	enum macvlan_mode	mode;
-};
-
-
 static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
 					       const unsigned char *addr)
 {
@@ -118,31 +93,17 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
 	return 0;
 }
 
-static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
-				    unsigned int len, bool success,
-				    bool multicast)
-{
-	struct macvlan_rx_stats *rx_stats;
-
-	rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
-	if (likely(success)) {
-		rx_stats->rx_packets++;;
-		rx_stats->rx_bytes += len;
-		if (multicast)
-			rx_stats->multicast++;
-	} else {
-		rx_stats->rx_errors++;
-	}
-}
 
-static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
+static int macvlan_broadcast_one(struct sk_buff *skb,
+				 const struct macvlan_dev *vlan,
 				 const struct ethhdr *eth, bool local)
 {
+	struct net_device *dev = vlan->dev;
 	if (!skb)
 		return NET_RX_DROP;
 
 	if (local)
-		return dev_forward_skb(dev, skb);
+		return vlan->forward(dev, skb);
 
 	skb->dev = dev;
 	if (!compare_ether_addr_64bits(eth->h_dest,
@@ -151,7 +112,7 @@ static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
 	else
 		skb->pkt_type = PACKET_MULTICAST;
 
-	return netif_receive_skb(skb);
+	return vlan->receive(skb);
 }
 
 static void macvlan_broadcast(struct sk_buff *skb,
@@ -175,7 +136,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
 				continue;
 
 			nskb = skb_clone(skb, GFP_ATOMIC);
-			err = macvlan_broadcast_one(nskb, vlan->dev, eth,
+			err = macvlan_broadcast_one(nskb, vlan, eth,
 					 mode == MACVLAN_MODE_BRIDGE);
 			macvlan_count_rx(vlan, skb->len + ETH_HLEN,
 					 err == NET_RX_SUCCESS, 1);
@@ -238,7 +199,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
 	skb->dev = dev;
 	skb->pkt_type = PACKET_HOST;
 
-	netif_receive_skb(skb);
+	vlan->receive(skb);
 	return NULL;
 }
 
@@ -260,7 +221,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 		dest = macvlan_hash_lookup(port, eth->h_dest);
 		if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
 			unsigned int length = skb->len + ETH_HLEN;
-			int ret = dev_forward_skb(dest->dev, skb);
+			int ret = dest->forward(dest->dev, skb);
 			macvlan_count_rx(dest, length,
 					 ret == NET_RX_SUCCESS, 0);
 
@@ -273,8 +234,8 @@ xmit_world:
 	return dev_queue_xmit(skb);
 }
 
-static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
-				      struct net_device *dev)
+netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
+			       struct net_device *dev)
 {
 	int i = skb_get_queue_mapping(skb);
 	struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
@@ -290,6 +251,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(macvlan_start_xmit);
 
 static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev,
 			       unsigned short type, const void *daddr,
@@ -623,8 +585,11 @@ static int macvlan_get_tx_queues(struct net *net,
 	return 0;
 }
 
-static int macvlan_newlink(struct net *src_net, struct net_device *dev,
-			   struct nlattr *tb[], struct nlattr *data[])
+int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+			   struct nlattr *tb[], struct nlattr *data[],
+			   int (*receive)(struct sk_buff *skb),
+			   int (*forward)(struct net_device *dev,
+					  struct sk_buff *skb))
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port;
@@ -664,6 +629,8 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
 	vlan->lowerdev = lowerdev;
 	vlan->dev      = dev;
 	vlan->port     = port;
+	vlan->receive  = receive;
+	vlan->forward  = forward;
 
 	vlan->mode     = MACVLAN_MODE_VEPA;
 	if (data && data[IFLA_MACVLAN_MODE])
@@ -677,8 +644,17 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
 	netif_stacked_transfer_operstate(lowerdev, dev);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(macvlan_common_newlink);
 
-static void macvlan_dellink(struct net_device *dev, struct list_head *head)
+static int macvlan_newlink(struct net *src_net, struct net_device *dev,
+			   struct nlattr *tb[], struct nlattr *data[])
+{
+	return macvlan_common_newlink(src_net, dev, tb, data,
+				      netif_receive_skb,
+				      dev_forward_skb);
+}
+
+void macvlan_dellink(struct net_device *dev, struct list_head *head)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port = vlan->port;
@@ -689,6 +665,7 @@ static void macvlan_dellink(struct net_device *dev, struct list_head *head)
 	if (list_empty(&port->vlans))
 		macvlan_port_destroy(port->dev);
 }
+EXPORT_SYMBOL_GPL(macvlan_dellink);
 
 static int macvlan_changelink(struct net_device *dev,
 		struct nlattr *tb[], struct nlattr *data[])
@@ -720,19 +697,27 @@ static const struct nla_policy macvlan_policy[IFLA_MACVLAN_MAX + 1] = {
 	[IFLA_MACVLAN_MODE] = { .type = NLA_U32 },
 };
 
-static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
+int macvlan_link_register(struct rtnl_link_ops *ops)
+{
+	/* common fields */
+	ops->priv_size		= sizeof(struct macvlan_dev);
+	ops->get_tx_queues	= macvlan_get_tx_queues;
+	ops->setup		= macvlan_setup;
+	ops->validate		= macvlan_validate;
+	ops->maxtype		= IFLA_MACVLAN_MAX;
+	ops->policy		= macvlan_policy;
+	ops->changelink		= macvlan_changelink;
+	ops->get_size		= macvlan_get_size;
+	ops->fill_info		= macvlan_fill_info;
+
+	return rtnl_link_register(ops);
+};
+EXPORT_SYMBOL_GPL(macvlan_link_register);
+
+static struct rtnl_link_ops macvlan_link_ops = {
 	.kind		= "macvlan",
-	.priv_size	= sizeof(struct macvlan_dev),
-	.get_tx_queues  = macvlan_get_tx_queues,
-	.setup		= macvlan_setup,
-	.validate	= macvlan_validate,
 	.newlink	= macvlan_newlink,
 	.dellink	= macvlan_dellink,
-	.maxtype	= IFLA_MACVLAN_MAX,
-	.policy		= macvlan_policy,
-	.changelink	= macvlan_changelink,
-	.get_size	= macvlan_get_size,
-	.fill_info	= macvlan_fill_info,
 };
 
 static int macvlan_device_event(struct notifier_block *unused,
@@ -761,7 +746,7 @@ static int macvlan_device_event(struct notifier_block *unused,
 		break;
 	case NETDEV_UNREGISTER:
 		list_for_each_entry_safe(vlan, next, &port->vlans, list)
-			macvlan_dellink(vlan->dev, NULL);
+			vlan->dev->rtnl_link_ops->dellink(vlan->dev, NULL);
 		break;
 	}
 	return NOTIFY_DONE;
@@ -778,7 +763,7 @@ static int __init macvlan_init_module(void)
 	register_netdevice_notifier(&macvlan_notifier_block);
 	macvlan_handle_frame_hook = macvlan_handle_frame;
 
-	err = rtnl_link_register(&macvlan_link_ops);
+	err = macvlan_link_register(&macvlan_link_ops);
 	if (err < 0)
 		goto err1;
 	return 0;
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 5f200ba..9a11544 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -1,6 +1,76 @@
 #ifndef _LINUX_IF_MACVLAN_H
 #define _LINUX_IF_MACVLAN_H
 
+#include <linux/if_link.h>
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/netlink.h>
+#include <net/netlink.h>
+
+struct macvlan_port;
+struct macvtap_queue;
+
+/**
+ *	struct macvlan_rx_stats - MACVLAN percpu rx stats
+ *	@rx_packets: number of received packets
+ *	@rx_bytes: number of received bytes
+ *	@multicast: number of received multicast packets
+ *	@rx_errors: number of errors
+ */
+struct macvlan_rx_stats {
+	unsigned long rx_packets;
+	unsigned long rx_bytes;
+	unsigned long multicast;
+	unsigned long rx_errors;
+};
+
+struct macvlan_dev {
+	struct net_device	*dev;
+	struct list_head	list;
+	struct hlist_node	hlist;
+	struct macvlan_port	*port;
+	struct net_device	*lowerdev;
+	struct macvlan_rx_stats *rx_stats;
+	enum macvlan_mode	mode;
+	int (*receive)(struct sk_buff *skb);
+	int (*forward)(struct net_device *dev, struct sk_buff *skb);
+};
+
+static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
+				    unsigned int len, bool success,
+				    bool multicast)
+{
+	struct macvlan_rx_stats *rx_stats;
+
+	rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
+	if (likely(success)) {
+		rx_stats->rx_packets++;;
+		rx_stats->rx_bytes += len;
+		if (multicast)
+			rx_stats->multicast++;
+	} else {
+		rx_stats->rx_errors++;
+	}
+}
+
+extern int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+				  struct nlattr *tb[], struct nlattr *data[],
+				  int (*receive)(struct sk_buff *skb),
+				  int (*forward)(struct net_device *dev,
+						 struct sk_buff *skb));
+
+extern void macvlan_count_rx(const struct macvlan_dev *vlan,
+			     unsigned int len, bool success,
+			     bool multicast);
+
+extern void macvlan_dellink(struct net_device *dev, struct list_head *head);
+
+extern int macvlan_link_register(struct rtnl_link_ops *ops);
+
+extern netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
+				      struct net_device *dev);
+
+
 extern struct sk_buff *(*macvlan_handle_frame_hook)(struct sk_buff *);
 
 #endif /* _LINUX_IF_MACVLAN_H */
-- 
1.6.3.3




^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 3/3] net: macvtap driver
  2010-01-27 10:04 ` [Bridge] " Arnd Bergmann
@ 2010-01-27 21:09   ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-27 21:09 UTC (permalink / raw)
  To: David Miller
  Cc: Stephen Hemminger, Patrick McHardy, Michael S. Tsirkin,
	Herbert Xu, Or Gerlitz, netdev, bridge, linux-kernel

In order to use macvlan with qemu and other tools that require
a tap file descriptor, the macvtap driver adds a small backend
with a character device with the same interface as the tun
driver, with a minimum set of features.

Macvtap interfaces are created in the same way as macvlan
interfaces using ip link, but the netif is just used as a
handle for configuration and accounting, while the data
goes through the chardev. Each macvtap interface has its
own character device, simplifying permission management
significantly over the generic tun/tap driver.

Cc: Patrick McHardy <kaber@trash.net>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Or Gerlitz <ogerlitz@voltaire.com>
Cc: netdev@vger.kernel.org
Cc: bridge@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/Kconfig        |   12 +
 drivers/net/Makefile       |    1 +
 drivers/net/macvtap.c      |  572 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/if_macvlan.h |    1 +
 4 files changed, 586 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/macvtap.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index cb0e534..411e207 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -90,6 +90,18 @@ config MACVLAN
 	  To compile this driver as a module, choose M here: the module
 	  will be called macvlan.
 
+config MACVTAP
+	tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
+	depends on MACVLAN
+	help
+	  This adds a specialized tap character device driver that is based
+	  on the MAC-VLAN network interface, called macvtap. A macvtap device
+	  can be added in the same way as a macvlan device, using 'type
+	  macvlan', and then be accessed through the tap user space interface.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called macvtap.
+
 config EQUALIZER
 	tristate "EQL (serial line load balancing) support"
 	---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 0b763cb..9595803 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -169,6 +169,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
 obj-$(CONFIG_DUMMY) += dummy.o
 obj-$(CONFIG_IFB) += ifb.o
 obj-$(CONFIG_MACVLAN) += macvlan.o
+obj-$(CONFIG_MACVTAP) += macvtap.o
 obj-$(CONFIG_DE600) += de600.o
 obj-$(CONFIG_DE620) += de620.o
 obj-$(CONFIG_LANCE) += lance.o
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
new file mode 100644
index 0000000..2916202
--- /dev/null
+++ b/drivers/net/macvtap.c
@@ -0,0 +1,572 @@
+#include <linux/etherdevice.h>
+#include <linux/if_macvlan.h>
+#include <linux/interrupt.h>
+#include <linux/nsproxy.h>
+#include <linux/compat.h>
+#include <linux/if_tun.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/cache.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/wait.h>
+#include <linux/cdev.h>
+#include <linux/fs.h>
+
+#include <net/net_namespace.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+/*
+ * A macvtap queue is the central object of this driver, it connects
+ * an open character device to a macvlan interface. There can be
+ * multiple queues on one interface, which map back to queues
+ * implemented in hardware on the underlying device.
+ *
+ * macvtap_proto is used to allocate queues through the sock allocation
+ * mechanism.
+ *
+ * TODO: multiqueue support is currently not implemented, even though
+ * macvtap is basically prepared for that. We will need to add this
+ * here as well as in virtio-net and qemu to get line rate on 10gbit
+ * adapters from a guest.
+ */
+struct macvtap_queue {
+	struct sock sk;
+	struct socket sock;
+	struct macvlan_dev *vlan;
+	struct file *file;
+};
+
+static struct proto macvtap_proto = {
+	.name = "macvtap",
+	.owner = THIS_MODULE,
+	.obj_size = sizeof (struct macvtap_queue),
+};
+
+/*
+ * Minor number matches netdev->ifindex, so need a potentially
+ * large value. This also makes it possible to split the
+ * tap functionality out again in the future by offering it
+ * from other drivers besides macvtap. As long as every device
+ * only has one tap, the interface numbers assure that the
+ * device nodes are unique.
+ */
+static unsigned int macvtap_major;
+#define MACVTAP_NUM_DEVS 65536
+static struct class *macvtap_class;
+static struct cdev macvtap_cdev;
+
+/*
+ * RCU usage:
+ * The macvtap_queue is referenced both from the chardev struct file
+ * and from the struct macvlan_dev using rcu_read_lock.
+ *
+ * We never actually update the contents of a macvtap_queue atomically
+ * with RCU but it is used for race-free destruction of a queue when
+ * either the file or the macvlan_dev goes away. Pointers back to
+ * the dev and the file are implicitly valid as long as the queue
+ * exists.
+ *
+ * The callbacks from macvlan are always done with rcu_read_lock held
+ * already, while in the file_operations, we get it ourselves.
+ *
+ * When destroying a queue, we remove the pointers from the file and
+ * from the dev and then synchronize_rcu to make sure no thread is
+ * still using the queue. There may still be references to the struct
+ * sock inside of the queue from outbound SKBs, but these never
+ * reference back to the file or the dev. The data structure is freed
+ * through __sk_free when both our references and any pending SKBs
+ * are gone.
+ *
+ * macvtap_lock is only used to prevent multiple concurrent open()
+ * calls to assign a new vlan->tap pointer. It could be moved into
+ * the macvlan_dev itself but is extremely rarely used.
+ */
+static DEFINE_SPINLOCK(macvtap_lock);
+
+/*
+ * Choose the next free queue, for now there is only one
+ */
+static int macvtap_set_queue(struct net_device *dev, struct file *file,
+				struct macvtap_queue *q)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	int err = -EBUSY;
+
+	spin_lock(&macvtap_lock);
+	if (rcu_dereference(vlan->tap))
+		goto out;
+
+	err = 0;
+	q->vlan = vlan;
+	rcu_assign_pointer(vlan->tap, q);
+
+	q->file = file;
+	rcu_assign_pointer(file->private_data, q);
+
+out:
+	spin_unlock(&macvtap_lock);
+	return err;
+}
+
+/*
+ * We must destroy each queue exactly once, when either
+ * the netdev or the file go away.
+ *
+ * Using the spinlock makes sure that we don't get
+ * to the queue again after destroying it.
+ *
+ * synchronize_rcu serializes with the packet flow
+ * that uses rcu_read_lock.
+ */
+static void macvtap_del_queue(struct macvtap_queue **qp)
+{
+	struct macvtap_queue *q;
+
+	spin_lock(&macvtap_lock);
+	q = rcu_dereference(*qp);
+	if (!q) {
+		spin_unlock(&macvtap_lock);
+		return;
+	}
+
+	rcu_assign_pointer(q->vlan->tap, NULL);
+	rcu_assign_pointer(q->file->private_data, NULL);
+	spin_unlock(&macvtap_lock);
+
+	synchronize_rcu();
+	sock_put(&q->sk);
+}
+
+/*
+ * Since we only support one queue, just dereference the pointer.
+ */
+static struct macvtap_queue *macvtap_get_queue(struct net_device *dev,
+					       struct sk_buff *skb)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+
+	return rcu_dereference(vlan->tap);
+}
+
+static void macvtap_del_queues(struct net_device *dev)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	macvtap_del_queue(&vlan->tap);
+}
+
+static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
+{
+	rcu_read_lock_bh();
+	return rcu_dereference(file->private_data);
+}
+
+static inline void macvtap_file_put_queue(void)
+{
+	rcu_read_unlock_bh();
+}
+
+/*
+ * Forward happens for data that gets sent from one macvlan
+ * endpoint to another one in bridge mode. We just take
+ * the skb and put it into the receive queue.
+ */
+static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
+{
+	struct macvtap_queue *q = macvtap_get_queue(dev, skb);
+	if (!q)
+		return -ENOLINK;
+
+	skb_queue_tail(&q->sk.sk_receive_queue, skb);
+	wake_up(q->sk.sk_sleep);
+	return 0;
+}
+
+/*
+ * Receive is for data from the external interface (lowerdev),
+ * in case of macvtap, we can treat that the same way as
+ * forward, which macvlan cannot.
+ */
+static int macvtap_receive(struct sk_buff *skb)
+{
+	skb_push(skb, ETH_HLEN);
+	return macvtap_forward(skb->dev, skb);
+}
+
+static int macvtap_newlink(struct net *src_net,
+			   struct net_device *dev,
+			   struct nlattr *tb[],
+			   struct nlattr *data[])
+{
+	struct device *classdev;
+	dev_t devt;
+	int err;
+
+	err = macvlan_common_newlink(src_net, dev, tb, data,
+				     macvtap_receive, macvtap_forward);
+	if (err)
+		goto out;
+
+	devt = MKDEV(MAJOR(macvtap_major), dev->ifindex);
+
+	classdev = device_create(macvtap_class, &dev->dev, devt,
+				 dev, "tap%d", dev->ifindex);
+	if (IS_ERR(classdev)) {
+		err = PTR_ERR(classdev);
+		macvtap_del_queues(dev);
+	}
+
+out:
+	return err;
+}
+
+static void macvtap_dellink(struct net_device *dev,
+			    struct list_head *head)
+{
+	device_destroy(macvtap_class,
+		       MKDEV(MAJOR(macvtap_major), dev->ifindex));
+
+	macvtap_del_queues(dev);
+	macvlan_dellink(dev, head);
+}
+
+static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
+	.kind		= "macvtap",
+	.newlink	= macvtap_newlink,
+	.dellink	= macvtap_dellink,
+};
+
+
+static void macvtap_sock_write_space(struct sock *sk)
+{
+	if (!sock_writeable(sk) ||
+	    !test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
+		return;
+
+	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		wake_up_interruptible_sync(sk->sk_sleep);
+}
+
+static int macvtap_open(struct inode *inode, struct file *file)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct net_device *dev = dev_get_by_index(net, iminor(inode));
+	struct macvtap_queue *q;
+	int err;
+
+	err = -ENODEV;
+	if (!dev)
+		goto out;
+
+	/* check if this is a macvtap device */
+	err = -EINVAL;
+	if (dev->rtnl_link_ops != &macvtap_link_ops)
+		goto out;
+
+	err = -ENOMEM;
+	q = (struct macvtap_queue *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
+					     &macvtap_proto);
+	if (!q)
+		goto out;
+
+	init_waitqueue_head(&q->sock.wait);
+	q->sock.type = SOCK_RAW;
+	q->sock.state = SS_CONNECTED;
+	sock_init_data(&q->sock, &q->sk);
+	q->sk.sk_allocation = GFP_ATOMIC; /* for now */
+	q->sk.sk_write_space = macvtap_sock_write_space;
+
+	err = macvtap_set_queue(dev, file, q);
+	if (err)
+		sock_put(&q->sk);
+
+out:
+	return err;
+}
+
+static int macvtap_release(struct inode *inode, struct file *file)
+{
+	macvtap_del_queue((struct macvtap_queue **)&file->private_data);
+	return 0;
+}
+
+static unsigned int macvtap_poll(struct file *file, poll_table * wait)
+{
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	unsigned int mask = POLLERR;
+
+	if (!q)
+		goto out;
+
+	mask = 0;
+	poll_wait(file, &q->sock.wait, wait);
+
+	if (!skb_queue_empty(&q->sk.sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(&q->sk) ||
+	    (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &q->sock.flags) &&
+	     sock_writeable(&q->sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+out:
+	macvtap_file_put_queue();
+	return mask;
+}
+
+/* Get packet from user space buffer */
+static ssize_t macvtap_get_user(struct macvtap_queue *q,
+				const struct iovec *iv, size_t count,
+				int noblock)
+{
+	struct sk_buff *skb;
+	size_t len = count;
+	int err;
+
+	if (unlikely(len < ETH_HLEN))
+		return -EINVAL;
+
+	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
+
+	if (!skb) {
+		macvlan_count_rx(q->vlan, 0, false, false);
+		return err;
+	}
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, count);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
+		macvlan_count_rx(q->vlan, 0, false, false);
+		kfree_skb(skb);
+		return -EFAULT;
+	}
+
+	skb_set_network_header(skb, ETH_HLEN);
+
+	macvlan_start_xmit(skb, q->vlan->dev);
+
+	return count;
+}
+
+static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
+				 unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t result = -ENOLINK;
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+
+	if (!q)
+		goto out;
+
+	result = macvtap_get_user(q, iv, iov_length(iv, count),
+			      file->f_flags & O_NONBLOCK);
+out:
+	macvtap_file_put_queue();
+	return result;
+}
+
+/* Put packet to the user space buffer */
+static ssize_t macvtap_put_user(struct macvtap_queue *q,
+				const struct sk_buff *skb,
+				const struct iovec *iv, int len)
+{
+	struct macvlan_dev *vlan = q->vlan;
+	int ret;
+
+	len = min_t(int, skb->len, len);
+
+	ret = skb_copy_datagram_const_iovec(skb, 0, iv, 0, len);
+
+	macvlan_count_rx(vlan, len, ret == 0, 0);
+
+	return ret ? ret : len;
+}
+
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+				unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+
+	DECLARE_WAITQUEUE(wait, current);
+	struct sk_buff *skb;
+	ssize_t len, ret = 0;
+
+	if (!q) {
+		ret = -ENOLINK;
+		goto out;
+	}
+
+	len = iov_length(iv, count);
+	if (len < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	add_wait_queue(q->sk.sk_sleep, &wait);
+	while (len) {
+		current->state = TASK_INTERRUPTIBLE;
+
+		/* Read frames from the queue */
+		skb = skb_dequeue(&q->sk.sk_receive_queue);
+		if (!skb) {
+			if (file->f_flags & O_NONBLOCK) {
+				ret = -EAGAIN;
+				break;
+			}
+			if (signal_pending(current)) {
+				ret = -ERESTARTSYS;
+				break;
+			}
+			/* Nothing to read, let's sleep */
+			schedule();
+			continue;
+		}
+		ret = macvtap_put_user(q, skb, iv, len);
+		kfree_skb(skb);
+		break;
+	}
+
+	current->state = TASK_RUNNING;
+	remove_wait_queue(q->sk.sk_sleep, &wait);
+
+out:
+	macvtap_file_put_queue();
+	return ret;
+}
+
+/*
+ * provide compatibility with generic tun/tap interface
+ */
+static long macvtap_ioctl(struct file *file, unsigned int cmd,
+			  unsigned long arg)
+{
+	struct macvtap_queue *q;
+	void __user *argp = (void __user *)arg;
+	struct ifreq __user *ifr = argp;
+	unsigned int __user *up = argp;
+	unsigned int u;
+	int err;
+
+	switch (cmd) {
+	case TUNSETIFF:
+		/* ignore the name, just look at flags */
+		if (get_user(u, &ifr->ifr_flags))
+			return -EFAULT;
+		if (u != (IFF_TAP | IFF_NO_PI))
+			return -EINVAL;
+		return 0;
+
+	case TUNGETIFF:
+		q = macvtap_file_get_queue(file);
+		if (!q)
+			return -ENOLINK;
+		err = 0;
+		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
+		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
+			err = -EFAULT;
+		macvtap_file_put_queue();
+		return err;
+
+	case TUNGETFEATURES:
+		if (put_user((IFF_TAP | IFF_NO_PI), up))
+			return -EFAULT;
+		return 0;
+
+	case TUNSETSNDBUF:
+		/* ignore */
+		return 0;
+
+	case TUNSETOFFLOAD:
+		/* let the user check for future flags */
+		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
+			  TUN_F_TSO_ECN | TUN_F_UFO))
+			return -EINVAL;
+
+		/* TODO: add support for these, so far we don't
+			 support any offload */
+		if (arg & (TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
+			 TUN_F_TSO_ECN | TUN_F_UFO))
+			return -EINVAL;
+
+		return 0;
+
+	default:
+		return -EINVAL;
+	}
+}
+
+#ifdef CONFIG_COMPAT
+static long macvtap_compat_ioctl(struct file *file, unsigned int cmd,
+				 unsigned long arg)
+{
+	return macvtap_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations macvtap_fops = {
+	.owner		= THIS_MODULE,
+	.open		= macvtap_open,
+	.release	= macvtap_release,
+	.aio_read	= macvtap_aio_read,
+	.aio_write	= macvtap_aio_write,
+	.poll		= macvtap_poll,
+	.llseek		= no_llseek,
+	.unlocked_ioctl	= macvtap_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= macvtap_compat_ioctl,
+#endif
+};
+
+static int macvtap_init(void)
+{
+	int err;
+
+	err = alloc_chrdev_region(&macvtap_major, 0,
+				MACVTAP_NUM_DEVS, "macvtap");
+	if (err)
+		goto out1;
+
+	cdev_init(&macvtap_cdev, &macvtap_fops);
+	err = cdev_add(&macvtap_cdev, macvtap_major, MACVTAP_NUM_DEVS);
+	if (err)
+		goto out2;
+
+	macvtap_class = class_create(THIS_MODULE, "macvtap");
+	if (IS_ERR(macvtap_class)) {
+		err = PTR_ERR(macvtap_class);
+		goto out3;
+	}
+
+	err = macvlan_link_register(&macvtap_link_ops);
+	if (err)
+		goto out4;
+
+	return 0;
+
+out4:
+	class_unregister(macvtap_class);
+out3:
+	cdev_del(&macvtap_cdev);
+out2:
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+out1:
+	return err;
+}
+module_init(macvtap_init);
+
+static void macvtap_exit(void)
+{
+	rtnl_link_unregister(&macvtap_link_ops);
+	class_unregister(macvtap_class);
+	cdev_del(&macvtap_cdev);
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+}
+module_exit(macvtap_exit);
+
+MODULE_ALIAS_RTNL_LINK("macvtap");
+MODULE_AUTHOR("Arnd Bergmann <arnd@arndb.de>");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 9a11544..51f1512 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -34,6 +34,7 @@ struct macvlan_dev {
 	enum macvlan_mode	mode;
 	int (*receive)(struct sk_buff *skb);
 	int (*forward)(struct net_device *dev, struct sk_buff *skb);
+	struct macvtap_queue	*tap;
 };
 
 static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Bridge] [PATCH 3/3] net: macvtap driver
@ 2010-01-27 21:09   ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-27 21:09 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel, Or Gerlitz

In order to use macvlan with qemu and other tools that require
a tap file descriptor, the macvtap driver adds a small backend
with a character device with the same interface as the tun
driver, with a minimum set of features.

Macvtap interfaces are created in the same way as macvlan
interfaces using ip link, but the netif is just used as a
handle for configuration and accounting, while the data
goes through the chardev. Each macvtap interface has its
own character device, simplifying permission management
significantly over the generic tun/tap driver.

Cc: Patrick McHardy <kaber@trash.net>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Or Gerlitz <ogerlitz@voltaire.com>
Cc: netdev@vger.kernel.org
Cc: bridge@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/Kconfig        |   12 +
 drivers/net/Makefile       |    1 +
 drivers/net/macvtap.c      |  572 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/if_macvlan.h |    1 +
 4 files changed, 586 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/macvtap.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index cb0e534..411e207 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -90,6 +90,18 @@ config MACVLAN
 	  To compile this driver as a module, choose M here: the module
 	  will be called macvlan.
 
+config MACVTAP
+	tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
+	depends on MACVLAN
+	help
+	  This adds a specialized tap character device driver that is based
+	  on the MAC-VLAN network interface, called macvtap. A macvtap device
+	  can be added in the same way as a macvlan device, using 'type
+	  macvlan', and then be accessed through the tap user space interface.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called macvtap.
+
 config EQUALIZER
 	tristate "EQL (serial line load balancing) support"
 	---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 0b763cb..9595803 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -169,6 +169,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
 obj-$(CONFIG_DUMMY) += dummy.o
 obj-$(CONFIG_IFB) += ifb.o
 obj-$(CONFIG_MACVLAN) += macvlan.o
+obj-$(CONFIG_MACVTAP) += macvtap.o
 obj-$(CONFIG_DE600) += de600.o
 obj-$(CONFIG_DE620) += de620.o
 obj-$(CONFIG_LANCE) += lance.o
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
new file mode 100644
index 0000000..2916202
--- /dev/null
+++ b/drivers/net/macvtap.c
@@ -0,0 +1,572 @@
+#include <linux/etherdevice.h>
+#include <linux/if_macvlan.h>
+#include <linux/interrupt.h>
+#include <linux/nsproxy.h>
+#include <linux/compat.h>
+#include <linux/if_tun.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/cache.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/wait.h>
+#include <linux/cdev.h>
+#include <linux/fs.h>
+
+#include <net/net_namespace.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+/*
+ * A macvtap queue is the central object of this driver, it connects
+ * an open character device to a macvlan interface. There can be
+ * multiple queues on one interface, which map back to queues
+ * implemented in hardware on the underlying device.
+ *
+ * macvtap_proto is used to allocate queues through the sock allocation
+ * mechanism.
+ *
+ * TODO: multiqueue support is currently not implemented, even though
+ * macvtap is basically prepared for that. We will need to add this
+ * here as well as in virtio-net and qemu to get line rate on 10gbit
+ * adapters from a guest.
+ */
+struct macvtap_queue {
+	struct sock sk;
+	struct socket sock;
+	struct macvlan_dev *vlan;
+	struct file *file;
+};
+
+static struct proto macvtap_proto = {
+	.name = "macvtap",
+	.owner = THIS_MODULE,
+	.obj_size = sizeof (struct macvtap_queue),
+};
+
+/*
+ * Minor number matches netdev->ifindex, so need a potentially
+ * large value. This also makes it possible to split the
+ * tap functionality out again in the future by offering it
+ * from other drivers besides macvtap. As long as every device
+ * only has one tap, the interface numbers assure that the
+ * device nodes are unique.
+ */
+static unsigned int macvtap_major;
+#define MACVTAP_NUM_DEVS 65536
+static struct class *macvtap_class;
+static struct cdev macvtap_cdev;
+
+/*
+ * RCU usage:
+ * The macvtap_queue is referenced both from the chardev struct file
+ * and from the struct macvlan_dev using rcu_read_lock.
+ *
+ * We never actually update the contents of a macvtap_queue atomically
+ * with RCU but it is used for race-free destruction of a queue when
+ * either the file or the macvlan_dev goes away. Pointers back to
+ * the dev and the file are implicitly valid as long as the queue
+ * exists.
+ *
+ * The callbacks from macvlan are always done with rcu_read_lock held
+ * already, while in the file_operations, we get it ourselves.
+ *
+ * When destroying a queue, we remove the pointers from the file and
+ * from the dev and then synchronize_rcu to make sure no thread is
+ * still using the queue. There may still be references to the struct
+ * sock inside of the queue from outbound SKBs, but these never
+ * reference back to the file or the dev. The data structure is freed
+ * through __sk_free when both our references and any pending SKBs
+ * are gone.
+ *
+ * macvtap_lock is only used to prevent multiple concurrent open()
+ * calls to assign a new vlan->tap pointer. It could be moved into
+ * the macvlan_dev itself but is extremely rarely used.
+ */
+static DEFINE_SPINLOCK(macvtap_lock);
+
+/*
+ * Choose the next free queue, for now there is only one
+ */
+static int macvtap_set_queue(struct net_device *dev, struct file *file,
+				struct macvtap_queue *q)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	int err = -EBUSY;
+
+	spin_lock(&macvtap_lock);
+	if (rcu_dereference(vlan->tap))
+		goto out;
+
+	err = 0;
+	q->vlan = vlan;
+	rcu_assign_pointer(vlan->tap, q);
+
+	q->file = file;
+	rcu_assign_pointer(file->private_data, q);
+
+out:
+	spin_unlock(&macvtap_lock);
+	return err;
+}
+
+/*
+ * We must destroy each queue exactly once, when either
+ * the netdev or the file go away.
+ *
+ * Using the spinlock makes sure that we don't get
+ * to the queue again after destroying it.
+ *
+ * synchronize_rcu serializes with the packet flow
+ * that uses rcu_read_lock.
+ */
+static void macvtap_del_queue(struct macvtap_queue **qp)
+{
+	struct macvtap_queue *q;
+
+	spin_lock(&macvtap_lock);
+	q = rcu_dereference(*qp);
+	if (!q) {
+		spin_unlock(&macvtap_lock);
+		return;
+	}
+
+	rcu_assign_pointer(q->vlan->tap, NULL);
+	rcu_assign_pointer(q->file->private_data, NULL);
+	spin_unlock(&macvtap_lock);
+
+	synchronize_rcu();
+	sock_put(&q->sk);
+}
+
+/*
+ * Since we only support one queue, just dereference the pointer.
+ */
+static struct macvtap_queue *macvtap_get_queue(struct net_device *dev,
+					       struct sk_buff *skb)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+
+	return rcu_dereference(vlan->tap);
+}
+
+static void macvtap_del_queues(struct net_device *dev)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	macvtap_del_queue(&vlan->tap);
+}
+
+static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
+{
+	rcu_read_lock_bh();
+	return rcu_dereference(file->private_data);
+}
+
+static inline void macvtap_file_put_queue(void)
+{
+	rcu_read_unlock_bh();
+}
+
+/*
+ * Forward happens for data that gets sent from one macvlan
+ * endpoint to another one in bridge mode. We just take
+ * the skb and put it into the receive queue.
+ */
+static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
+{
+	struct macvtap_queue *q = macvtap_get_queue(dev, skb);
+	if (!q)
+		return -ENOLINK;
+
+	skb_queue_tail(&q->sk.sk_receive_queue, skb);
+	wake_up(q->sk.sk_sleep);
+	return 0;
+}
+
+/*
+ * Receive is for data from the external interface (lowerdev),
+ * in case of macvtap, we can treat that the same way as
+ * forward, which macvlan cannot.
+ */
+static int macvtap_receive(struct sk_buff *skb)
+{
+	skb_push(skb, ETH_HLEN);
+	return macvtap_forward(skb->dev, skb);
+}
+
+static int macvtap_newlink(struct net *src_net,
+			   struct net_device *dev,
+			   struct nlattr *tb[],
+			   struct nlattr *data[])
+{
+	struct device *classdev;
+	dev_t devt;
+	int err;
+
+	err = macvlan_common_newlink(src_net, dev, tb, data,
+				     macvtap_receive, macvtap_forward);
+	if (err)
+		goto out;
+
+	devt = MKDEV(MAJOR(macvtap_major), dev->ifindex);
+
+	classdev = device_create(macvtap_class, &dev->dev, devt,
+				 dev, "tap%d", dev->ifindex);
+	if (IS_ERR(classdev)) {
+		err = PTR_ERR(classdev);
+		macvtap_del_queues(dev);
+	}
+
+out:
+	return err;
+}
+
+static void macvtap_dellink(struct net_device *dev,
+			    struct list_head *head)
+{
+	device_destroy(macvtap_class,
+		       MKDEV(MAJOR(macvtap_major), dev->ifindex));
+
+	macvtap_del_queues(dev);
+	macvlan_dellink(dev, head);
+}
+
+static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
+	.kind		= "macvtap",
+	.newlink	= macvtap_newlink,
+	.dellink	= macvtap_dellink,
+};
+
+
+static void macvtap_sock_write_space(struct sock *sk)
+{
+	if (!sock_writeable(sk) ||
+	    !test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
+		return;
+
+	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		wake_up_interruptible_sync(sk->sk_sleep);
+}
+
+static int macvtap_open(struct inode *inode, struct file *file)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct net_device *dev = dev_get_by_index(net, iminor(inode));
+	struct macvtap_queue *q;
+	int err;
+
+	err = -ENODEV;
+	if (!dev)
+		goto out;
+
+	/* check if this is a macvtap device */
+	err = -EINVAL;
+	if (dev->rtnl_link_ops != &macvtap_link_ops)
+		goto out;
+
+	err = -ENOMEM;
+	q = (struct macvtap_queue *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
+					     &macvtap_proto);
+	if (!q)
+		goto out;
+
+	init_waitqueue_head(&q->sock.wait);
+	q->sock.type = SOCK_RAW;
+	q->sock.state = SS_CONNECTED;
+	sock_init_data(&q->sock, &q->sk);
+	q->sk.sk_allocation = GFP_ATOMIC; /* for now */
+	q->sk.sk_write_space = macvtap_sock_write_space;
+
+	err = macvtap_set_queue(dev, file, q);
+	if (err)
+		sock_put(&q->sk);
+
+out:
+	return err;
+}
+
+static int macvtap_release(struct inode *inode, struct file *file)
+{
+	macvtap_del_queue((struct macvtap_queue **)&file->private_data);
+	return 0;
+}
+
+static unsigned int macvtap_poll(struct file *file, poll_table * wait)
+{
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	unsigned int mask = POLLERR;
+
+	if (!q)
+		goto out;
+
+	mask = 0;
+	poll_wait(file, &q->sock.wait, wait);
+
+	if (!skb_queue_empty(&q->sk.sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(&q->sk) ||
+	    (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &q->sock.flags) &&
+	     sock_writeable(&q->sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+out:
+	macvtap_file_put_queue();
+	return mask;
+}
+
+/* Get packet from user space buffer */
+static ssize_t macvtap_get_user(struct macvtap_queue *q,
+				const struct iovec *iv, size_t count,
+				int noblock)
+{
+	struct sk_buff *skb;
+	size_t len = count;
+	int err;
+
+	if (unlikely(len < ETH_HLEN))
+		return -EINVAL;
+
+	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
+
+	if (!skb) {
+		macvlan_count_rx(q->vlan, 0, false, false);
+		return err;
+	}
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, count);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
+		macvlan_count_rx(q->vlan, 0, false, false);
+		kfree_skb(skb);
+		return -EFAULT;
+	}
+
+	skb_set_network_header(skb, ETH_HLEN);
+
+	macvlan_start_xmit(skb, q->vlan->dev);
+
+	return count;
+}
+
+static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
+				 unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t result = -ENOLINK;
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+
+	if (!q)
+		goto out;
+
+	result = macvtap_get_user(q, iv, iov_length(iv, count),
+			      file->f_flags & O_NONBLOCK);
+out:
+	macvtap_file_put_queue();
+	return result;
+}
+
+/* Put packet to the user space buffer */
+static ssize_t macvtap_put_user(struct macvtap_queue *q,
+				const struct sk_buff *skb,
+				const struct iovec *iv, int len)
+{
+	struct macvlan_dev *vlan = q->vlan;
+	int ret;
+
+	len = min_t(int, skb->len, len);
+
+	ret = skb_copy_datagram_const_iovec(skb, 0, iv, 0, len);
+
+	macvlan_count_rx(vlan, len, ret == 0, 0);
+
+	return ret ? ret : len;
+}
+
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+				unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+
+	DECLARE_WAITQUEUE(wait, current);
+	struct sk_buff *skb;
+	ssize_t len, ret = 0;
+
+	if (!q) {
+		ret = -ENOLINK;
+		goto out;
+	}
+
+	len = iov_length(iv, count);
+	if (len < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	add_wait_queue(q->sk.sk_sleep, &wait);
+	while (len) {
+		current->state = TASK_INTERRUPTIBLE;
+
+		/* Read frames from the queue */
+		skb = skb_dequeue(&q->sk.sk_receive_queue);
+		if (!skb) {
+			if (file->f_flags & O_NONBLOCK) {
+				ret = -EAGAIN;
+				break;
+			}
+			if (signal_pending(current)) {
+				ret = -ERESTARTSYS;
+				break;
+			}
+			/* Nothing to read, let's sleep */
+			schedule();
+			continue;
+		}
+		ret = macvtap_put_user(q, skb, iv, len);
+		kfree_skb(skb);
+		break;
+	}
+
+	current->state = TASK_RUNNING;
+	remove_wait_queue(q->sk.sk_sleep, &wait);
+
+out:
+	macvtap_file_put_queue();
+	return ret;
+}
+
+/*
+ * provide compatibility with generic tun/tap interface
+ */
+static long macvtap_ioctl(struct file *file, unsigned int cmd,
+			  unsigned long arg)
+{
+	struct macvtap_queue *q;
+	void __user *argp = (void __user *)arg;
+	struct ifreq __user *ifr = argp;
+	unsigned int __user *up = argp;
+	unsigned int u;
+	int err;
+
+	switch (cmd) {
+	case TUNSETIFF:
+		/* ignore the name, just look at flags */
+		if (get_user(u, &ifr->ifr_flags))
+			return -EFAULT;
+		if (u != (IFF_TAP | IFF_NO_PI))
+			return -EINVAL;
+		return 0;
+
+	case TUNGETIFF:
+		q = macvtap_file_get_queue(file);
+		if (!q)
+			return -ENOLINK;
+		err = 0;
+		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
+		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
+			err = -EFAULT;
+		macvtap_file_put_queue();
+		return err;
+
+	case TUNGETFEATURES:
+		if (put_user((IFF_TAP | IFF_NO_PI), up))
+			return -EFAULT;
+		return 0;
+
+	case TUNSETSNDBUF:
+		/* ignore */
+		return 0;
+
+	case TUNSETOFFLOAD:
+		/* let the user check for future flags */
+		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
+			  TUN_F_TSO_ECN | TUN_F_UFO))
+			return -EINVAL;
+
+		/* TODO: add support for these, so far we don't
+			 support any offload */
+		if (arg & (TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
+			 TUN_F_TSO_ECN | TUN_F_UFO))
+			return -EINVAL;
+
+		return 0;
+
+	default:
+		return -EINVAL;
+	}
+}
+
+#ifdef CONFIG_COMPAT
+static long macvtap_compat_ioctl(struct file *file, unsigned int cmd,
+				 unsigned long arg)
+{
+	return macvtap_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations macvtap_fops = {
+	.owner		= THIS_MODULE,
+	.open		= macvtap_open,
+	.release	= macvtap_release,
+	.aio_read	= macvtap_aio_read,
+	.aio_write	= macvtap_aio_write,
+	.poll		= macvtap_poll,
+	.llseek		= no_llseek,
+	.unlocked_ioctl	= macvtap_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= macvtap_compat_ioctl,
+#endif
+};
+
+static int macvtap_init(void)
+{
+	int err;
+
+	err = alloc_chrdev_region(&macvtap_major, 0,
+				MACVTAP_NUM_DEVS, "macvtap");
+	if (err)
+		goto out1;
+
+	cdev_init(&macvtap_cdev, &macvtap_fops);
+	err = cdev_add(&macvtap_cdev, macvtap_major, MACVTAP_NUM_DEVS);
+	if (err)
+		goto out2;
+
+	macvtap_class = class_create(THIS_MODULE, "macvtap");
+	if (IS_ERR(macvtap_class)) {
+		err = PTR_ERR(macvtap_class);
+		goto out3;
+	}
+
+	err = macvlan_link_register(&macvtap_link_ops);
+	if (err)
+		goto out4;
+
+	return 0;
+
+out4:
+	class_unregister(macvtap_class);
+out3:
+	cdev_del(&macvtap_cdev);
+out2:
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+out1:
+	return err;
+}
+module_init(macvtap_init);
+
+static void macvtap_exit(void)
+{
+	rtnl_link_unregister(&macvtap_link_ops);
+	class_unregister(macvtap_class);
+	cdev_del(&macvtap_cdev);
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+}
+module_exit(macvtap_exit);
+
+MODULE_ALIAS_RTNL_LINK("macvtap");
+MODULE_AUTHOR("Arnd Bergmann <arnd@arndb.de>");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 9a11544..51f1512 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -34,6 +34,7 @@ struct macvlan_dev {
 	enum macvlan_mode	mode;
 	int (*receive)(struct sk_buff *skb);
 	int (*forward)(struct net_device *dev, struct sk_buff *skb);
+	struct macvtap_queue	*tap;
 };
 
 static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v3] macvtap driver
  2010-01-27 10:04 ` [Bridge] " Arnd Bergmann
@ 2010-01-27 21:59   ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-27 21:59 UTC (permalink / raw)
  To: David Miller
  Cc: Stephen Hemminger, Patrick McHardy, Michael S. Tsirkin,
	Herbert Xu, Or Gerlitz, netdev, bridge, linux-kernel

On Wednesday 27 January 2010, Arnd Bergmann wrote:
> There are still known problems, but unless there
> are fundamental concerns, I'd like this to go
> into net-next as an experimental driver,
> fixing up the remaining problems by 2.6.34-rc1.

I should have been more specific here. The one
really annoying problem is a reference counting
problem I introduced in one of the last changes
that prevents you from destroying a device after
it has been used.

Unfortunately, I'm still traveling after LCA,
and haven't had a chance to look into this before
sending out the patches as I had originally
planned. I've also seen crashes that are not
fully reproducible, any bug reports on those
are appreciated.

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Bridge] [PATCH 0/3 v3] macvtap driver
@ 2010-01-27 21:59   ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-27 21:59 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel, Or Gerlitz

On Wednesday 27 January 2010, Arnd Bergmann wrote:
> There are still known problems, but unless there
> are fundamental concerns, I'd like this to go
> into net-next as an experimental driver,
> fixing up the remaining problems by 2.6.34-rc1.

I should have been more specific here. The one
really annoying problem is a reference counting
problem I introduced in one of the last changes
that prevents you from destroying a device after
it has been used.

Unfortunately, I'm still traveling after LCA,
and haven't had a chance to look into this before
sending out the patches as I had originally
planned. I've also seen crashes that are not
fully reproducible, any bug reports on those
are appreciated.

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/3] net: macvtap driver
  2010-01-27 21:09   ` [Bridge] " Arnd Bergmann
@ 2010-01-28 17:34     ` Michael S. Tsirkin
  -1 siblings, 0 replies; 63+ messages in thread
From: Michael S. Tsirkin @ 2010-01-28 17:34 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David Miller, Stephen Hemminger, Patrick McHardy, Herbert Xu,
	Or Gerlitz, netdev, bridge, linux-kernel

On Wed, Jan 27, 2010 at 10:09:27PM +0100, Arnd Bergmann wrote:
> In order to use macvlan with qemu and other tools that require
> a tap file descriptor, the macvtap driver adds a small backend
> with a character device with the same interface as the tun
> driver, with a minimum set of features.
> 
> Macvtap interfaces are created in the same way as macvlan
> interfaces using ip link, but the netif is just used as a
> handle for configuration and accounting, while the data
> goes through the chardev. Each macvtap interface has its
> own character device, simplifying permission management
> significantly over the generic tun/tap driver.
> 
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: Stephen Hemminger <shemminger@linux-foundation.org>
> Cc: David S. Miller" <davem@davemloft.net>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Or Gerlitz <ogerlitz@voltaire.com>
> Cc: netdev@vger.kernel.org
> Cc: bridge@lists.linux-foundation.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  drivers/net/Kconfig        |   12 +
>  drivers/net/Makefile       |    1 +
>  drivers/net/macvtap.c      |  572 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/if_macvlan.h |    1 +
>  4 files changed, 586 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/macvtap.c
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index cb0e534..411e207 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -90,6 +90,18 @@ config MACVLAN
>  	  To compile this driver as a module, choose M here: the module
>  	  will be called macvlan.
>  
> +config MACVTAP
> +	tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
> +	depends on MACVLAN
> +	help
> +	  This adds a specialized tap character device driver that is based
> +	  on the MAC-VLAN network interface, called macvtap. A macvtap device
> +	  can be added in the same way as a macvlan device, using 'type
> +	  macvlan', and then be accessed through the tap user space interface.
> +
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called macvtap.
> +
>  config EQUALIZER
>  	tristate "EQL (serial line load balancing) support"
>  	---help---
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index 0b763cb..9595803 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -169,6 +169,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
>  obj-$(CONFIG_DUMMY) += dummy.o
>  obj-$(CONFIG_IFB) += ifb.o
>  obj-$(CONFIG_MACVLAN) += macvlan.o
> +obj-$(CONFIG_MACVTAP) += macvtap.o
>  obj-$(CONFIG_DE600) += de600.o
>  obj-$(CONFIG_DE620) += de620.o
>  obj-$(CONFIG_LANCE) += lance.o
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> new file mode 100644
> index 0000000..2916202
> --- /dev/null
> +++ b/drivers/net/macvtap.c
> @@ -0,0 +1,572 @@
> +#include <linux/etherdevice.h>
> +#include <linux/if_macvlan.h>
> +#include <linux/interrupt.h>
> +#include <linux/nsproxy.h>
> +#include <linux/compat.h>
> +#include <linux/if_tun.h>
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/cache.h>
> +#include <linux/sched.h>
> +#include <linux/types.h>
> +#include <linux/init.h>
> +#include <linux/wait.h>
> +#include <linux/cdev.h>
> +#include <linux/fs.h>
> +
> +#include <net/net_namespace.h>
> +#include <net/rtnetlink.h>
> +#include <net/sock.h>
> +
> +/*
> + * A macvtap queue is the central object of this driver, it connects
> + * an open character device to a macvlan interface. There can be
> + * multiple queues on one interface, which map back to queues
> + * implemented in hardware on the underlying device.
> + *
> + * macvtap_proto is used to allocate queues through the sock allocation
> + * mechanism.
> + *
> + * TODO: multiqueue support is currently not implemented, even though
> + * macvtap is basically prepared for that. We will need to add this
> + * here as well as in virtio-net and qemu to get line rate on 10gbit
> + * adapters from a guest.
> + */
> +struct macvtap_queue {
> +	struct sock sk;
> +	struct socket sock;
> +	struct macvlan_dev *vlan;
> +	struct file *file;
> +};
> +
> +static struct proto macvtap_proto = {
> +	.name = "macvtap",
> +	.owner = THIS_MODULE,
> +	.obj_size = sizeof (struct macvtap_queue),
> +};
> +
> +/*
> + * Minor number matches netdev->ifindex, so need a potentially
> + * large value. This also makes it possible to split the
> + * tap functionality out again in the future by offering it
> + * from other drivers besides macvtap. As long as every device
> + * only has one tap, the interface numbers assure that the
> + * device nodes are unique.
> + */
> +static unsigned int macvtap_major;
> +#define MACVTAP_NUM_DEVS 65536
> +static struct class *macvtap_class;
> +static struct cdev macvtap_cdev;
> +
> +/*
> + * RCU usage:
> + * The macvtap_queue is referenced both from the chardev struct file
> + * and from the struct macvlan_dev using rcu_read_lock.
> + *
> + * We never actually update the contents of a macvtap_queue atomically
> + * with RCU but it is used for race-free destruction of a queue when
> + * either the file or the macvlan_dev goes away. Pointers back to
> + * the dev and the file are implicitly valid as long as the queue
> + * exists.
> + *
> + * The callbacks from macvlan are always done with rcu_read_lock held
> + * already, while in the file_operations, we get it ourselves.
> + *
> + * When destroying a queue, we remove the pointers from the file and
> + * from the dev and then synchronize_rcu to make sure no thread is
> + * still using the queue. There may still be references to the struct
> + * sock inside of the queue from outbound SKBs, but these never
> + * reference back to the file or the dev. The data structure is freed
> + * through __sk_free when both our references and any pending SKBs
> + * are gone.
> + *
> + * macvtap_lock is only used to prevent multiple concurrent open()
> + * calls to assign a new vlan->tap pointer. It could be moved into
> + * the macvlan_dev itself but is extremely rarely used.
> + */
> +static DEFINE_SPINLOCK(macvtap_lock);
> +
> +/*
> + * Choose the next free queue, for now there is only one
> + */
> +static int macvtap_set_queue(struct net_device *dev, struct file *file,
> +				struct macvtap_queue *q)
> +{
> +	struct macvlan_dev *vlan = netdev_priv(dev);
> +	int err = -EBUSY;
> +
> +	spin_lock(&macvtap_lock);
> +	if (rcu_dereference(vlan->tap))
> +		goto out;
> +
> +	err = 0;
> +	q->vlan = vlan;
> +	rcu_assign_pointer(vlan->tap, q);
> +
> +	q->file = file;
> +	rcu_assign_pointer(file->private_data, q);
> +
> +out:
> +	spin_unlock(&macvtap_lock);
> +	return err;
> +}
> +
> +/*
> + * We must destroy each queue exactly once, when either
> + * the netdev or the file go away.
> + *
> + * Using the spinlock makes sure that we don't get
> + * to the queue again after destroying it.
> + *
> + * synchronize_rcu serializes with the packet flow
> + * that uses rcu_read_lock.
> + */
> +static void macvtap_del_queue(struct macvtap_queue **qp)
> +{
> +	struct macvtap_queue *q;
> +
> +	spin_lock(&macvtap_lock);
> +	q = rcu_dereference(*qp);
> +	if (!q) {
> +		spin_unlock(&macvtap_lock);
> +		return;
> +	}
> +
> +	rcu_assign_pointer(q->vlan->tap, NULL);
> +	rcu_assign_pointer(q->file->private_data, NULL);
> +	spin_unlock(&macvtap_lock);
> +
> +	synchronize_rcu();
> +	sock_put(&q->sk);
> +}
> +
> +/*
> + * Since we only support one queue, just dereference the pointer.
> + */
> +static struct macvtap_queue *macvtap_get_queue(struct net_device *dev,
> +					       struct sk_buff *skb)
> +{
> +	struct macvlan_dev *vlan = netdev_priv(dev);
> +
> +	return rcu_dereference(vlan->tap);
> +}
> +
> +static void macvtap_del_queues(struct net_device *dev)
> +{
> +	struct macvlan_dev *vlan = netdev_priv(dev);
> +	macvtap_del_queue(&vlan->tap);
> +}
> +
> +static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
> +{
> +	rcu_read_lock_bh();
> +	return rcu_dereference(file->private_data);
> +}
> +
> +static inline void macvtap_file_put_queue(void)
> +{
> +	rcu_read_unlock_bh();
> +}
> +

I find such wrappers around rcu obscure this,
already sufficiently complex, pattern.
Might be just me.

> +/*
> + * Forward happens for data that gets sent from one macvlan
> + * endpoint to another one in bridge mode. We just take
> + * the skb and put it into the receive queue.
> + */
> +static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
> +{
> +	struct macvtap_queue *q = macvtap_get_queue(dev, skb);
> +	if (!q)
> +		return -ENOLINK;
> +
> +	skb_queue_tail(&q->sk.sk_receive_queue, skb);
> +	wake_up(q->sk.sk_sleep);
> +	return 0;
> +}
> +
> +/*
> + * Receive is for data from the external interface (lowerdev),
> + * in case of macvtap, we can treat that the same way as
> + * forward, which macvlan cannot.
> + */
> +static int macvtap_receive(struct sk_buff *skb)
> +{
> +	skb_push(skb, ETH_HLEN);
> +	return macvtap_forward(skb->dev, skb);
> +}
> +
> +static int macvtap_newlink(struct net *src_net,
> +			   struct net_device *dev,
> +			   struct nlattr *tb[],
> +			   struct nlattr *data[])
> +{
> +	struct device *classdev;
> +	dev_t devt;
> +	int err;
> +
> +	err = macvlan_common_newlink(src_net, dev, tb, data,
> +				     macvtap_receive, macvtap_forward);
> +	if (err)
> +		goto out;
> +
> +	devt = MKDEV(MAJOR(macvtap_major), dev->ifindex);
> +
> +	classdev = device_create(macvtap_class, &dev->dev, devt,
> +				 dev, "tap%d", dev->ifindex);
> +	if (IS_ERR(classdev)) {
> +		err = PTR_ERR(classdev);
> +		macvtap_del_queues(dev);
> +	}
> +
> +out:
> +	return err;
> +}
> +
> +static void macvtap_dellink(struct net_device *dev,
> +			    struct list_head *head)
> +{
> +	device_destroy(macvtap_class,
> +		       MKDEV(MAJOR(macvtap_major), dev->ifindex));
> +
> +	macvtap_del_queues(dev);
> +	macvlan_dellink(dev, head);
> +}
> +
> +static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
> +	.kind		= "macvtap",
> +	.newlink	= macvtap_newlink,
> +	.dellink	= macvtap_dellink,
> +};
> +
> +
> +static void macvtap_sock_write_space(struct sock *sk)
> +{
> +	if (!sock_writeable(sk) ||
> +	    !test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
> +		return;
> +
> +	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +		wake_up_interruptible_sync(sk->sk_sleep);
> +}
> +
> +static int macvtap_open(struct inode *inode, struct file *file)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	struct net_device *dev = dev_get_by_index(net, iminor(inode));
> +	struct macvtap_queue *q;
> +	int err;
> +

This seems to keep reference to device as long as character device is
open, which, if I understand correctly, will start printing error
messages to kernel log about once a second if you try to remove the
device.

I suspect the best way to fix this issue would be to use some kind
of notifier so that macvtap can disconnect on device removal.


> +	err = -ENODEV;
> +	if (!dev)
> +		goto out;
> +
> +	/* check if this is a macvtap device */
> +	err = -EINVAL;
> +	if (dev->rtnl_link_ops != &macvtap_link_ops)
> +		goto out;
> +
> +	err = -ENOMEM;
> +	q = (struct macvtap_queue *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
> +					     &macvtap_proto);
> +	if (!q)
> +		goto out;
> +
> +	init_waitqueue_head(&q->sock.wait);
> +	q->sock.type = SOCK_RAW;
> +	q->sock.state = SS_CONNECTED;
> +	sock_init_data(&q->sock, &q->sk);
> +	q->sk.sk_allocation = GFP_ATOMIC; /* for now */
> +	q->sk.sk_write_space = macvtap_sock_write_space;
> +
> +	err = macvtap_set_queue(dev, file, q);
> +	if (err)
> +		sock_put(&q->sk);
> +
> +out:
> +	return err;
> +}
> +
> +static int macvtap_release(struct inode *inode, struct file *file)
> +{
> +	macvtap_del_queue((struct macvtap_queue **)&file->private_data);
> +	return 0;
> +}
> +
> +static unsigned int macvtap_poll(struct file *file, poll_table * wait)
> +{
> +	struct macvtap_queue *q = macvtap_file_get_queue(file);
> +	unsigned int mask = POLLERR;
> +
> +	if (!q)
> +		goto out;
> +
> +	mask = 0;
> +	poll_wait(file, &q->sock.wait, wait);
> +
> +	if (!skb_queue_empty(&q->sk.sk_receive_queue))
> +		mask |= POLLIN | POLLRDNORM;
> +
> +	if (sock_writeable(&q->sk) ||
> +	    (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &q->sock.flags) &&
> +	     sock_writeable(&q->sk)))
> +		mask |= POLLOUT | POLLWRNORM;
> +
> +out:
> +	macvtap_file_put_queue();
> +	return mask;
> +}
> +
> +/* Get packet from user space buffer */
> +static ssize_t macvtap_get_user(struct macvtap_queue *q,
> +				const struct iovec *iv, size_t count,
> +				int noblock)
> +{
> +	struct sk_buff *skb;
> +	size_t len = count;
> +	int err;
> +
> +	if (unlikely(len < ETH_HLEN))
> +		return -EINVAL;
> +
> +	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
> +
> +	if (!skb) {
> +		macvlan_count_rx(q->vlan, 0, false, false);
> +		return err;
> +	}
> +
> +	skb_reserve(skb, NET_IP_ALIGN);
> +	skb_put(skb, count);
> +
> +	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> +		macvlan_count_rx(q->vlan, 0, false, false);
> +		kfree_skb(skb);
> +		return -EFAULT;
> +	}
> +
> +	skb_set_network_header(skb, ETH_HLEN);
> +
> +	macvlan_start_xmit(skb, q->vlan->dev);
> +
> +	return count;
> +}
> +

I am surprised there's no GSO support.  Would it be a good idea to share
code with tun driver? That already has GSO ...
Also, network header pointer seems off for vlan packets?

> +static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
> +				 unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	ssize_t result = -ENOLINK;
> +	struct macvtap_queue *q = macvtap_file_get_queue(file);
> +
> +	if (!q)
> +		goto out;
> +
> +	result = macvtap_get_user(q, iv, iov_length(iv, count),
> +			      file->f_flags & O_NONBLOCK);
> +out:
> +	macvtap_file_put_queue();
> +	return result;
> +}
> +
> +/* Put packet to the user space buffer */
> +static ssize_t macvtap_put_user(struct macvtap_queue *q,
> +				const struct sk_buff *skb,
> +				const struct iovec *iv, int len)
> +{
> +	struct macvlan_dev *vlan = q->vlan;
> +	int ret;
> +
> +	len = min_t(int, skb->len, len);
> +
> +	ret = skb_copy_datagram_const_iovec(skb, 0, iv, 0, len);
> +
> +	macvlan_count_rx(vlan, len, ret == 0, 0);
> +
> +	return ret ? ret : len;
> +}
> +
> +static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
> +				unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct macvtap_queue *q = macvtap_file_get_queue(file);
> +
> +	DECLARE_WAITQUEUE(wait, current);
> +	struct sk_buff *skb;
> +	ssize_t len, ret = 0;
> +
> +	if (!q) {
> +		ret = -ENOLINK;
> +		goto out;
> +	}
> +
> +	len = iov_length(iv, count);
> +	if (len < 0) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	add_wait_queue(q->sk.sk_sleep, &wait);
> +	while (len) {
> +		current->state = TASK_INTERRUPTIBLE;
> +
> +		/* Read frames from the queue */
> +		skb = skb_dequeue(&q->sk.sk_receive_queue);
> +		if (!skb) {
> +			if (file->f_flags & O_NONBLOCK) {
> +				ret = -EAGAIN;
> +				break;
> +			}
> +			if (signal_pending(current)) {
> +				ret = -ERESTARTSYS;
> +				break;
> +			}
> +			/* Nothing to read, let's sleep */
> +			schedule();
> +			continue;
> +		}
> +		ret = macvtap_put_user(q, skb, iv, len);
> +		kfree_skb(skb);
> +		break;
> +	}
> +
> +	current->state = TASK_RUNNING;
> +	remove_wait_queue(q->sk.sk_sleep, &wait);
> +
> +out:
> +	macvtap_file_put_queue();
> +	return ret;
> +}
> +

Same GSO comment here.

> +/*
> + * provide compatibility with generic tun/tap interface
> + */
> +static long macvtap_ioctl(struct file *file, unsigned int cmd,
> +			  unsigned long arg)
> +{

All of these seem to be stubs, and tun has many more that you didn't
stub out. So, why do you bother to support any ioctls at all?

> +	struct macvtap_queue *q;
> +	void __user *argp = (void __user *)arg;
> +	struct ifreq __user *ifr = argp;
> +	unsigned int __user *up = argp;
> +	unsigned int u;
> +	int err;
> +
> +	switch (cmd) {
> +	case TUNSETIFF:
> +		/* ignore the name, just look at flags */
> +		if (get_user(u, &ifr->ifr_flags))
> +			return -EFAULT;
> +		if (u != (IFF_TAP | IFF_NO_PI))
> +			return -EINVAL;
> +		return 0;
> +
> +	case TUNGETIFF:
> +		q = macvtap_file_get_queue(file);
> +		if (!q)
> +			return -ENOLINK;
> +		err = 0;
> +		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
> +		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
> +			err = -EFAULT;
> +		macvtap_file_put_queue();
> +		return err;
> +
> +	case TUNGETFEATURES:
> +		if (put_user((IFF_TAP | IFF_NO_PI), up))
> +			return -EFAULT;
> +		return 0;
> +
> +	case TUNSETSNDBUF:
> +		/* ignore */
> +		return 0;
> +
> +	case TUNSETOFFLOAD:
> +		/* let the user check for future flags */
> +		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
> +			  TUN_F_TSO_ECN | TUN_F_UFO))
> +			return -EINVAL;
> +
> +		/* TODO: add support for these, so far we don't
> +			 support any offload */
> +		if (arg & (TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
> +			 TUN_F_TSO_ECN | TUN_F_UFO))
> +			return -EINVAL;
> +
> +		return 0;
> +
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long macvtap_compat_ioctl(struct file *file, unsigned int cmd,
> +				 unsigned long arg)
> +{
> +	return macvtap_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
> +}
> +#endif
> +
> +static const struct file_operations macvtap_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= macvtap_open,
> +	.release	= macvtap_release,
> +	.aio_read	= macvtap_aio_read,
> +	.aio_write	= macvtap_aio_write,
> +	.poll		= macvtap_poll,
> +	.llseek		= no_llseek,
> +	.unlocked_ioctl	= macvtap_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= macvtap_compat_ioctl,
> +#endif
> +};
> +
> +static int macvtap_init(void)
> +{
> +	int err;
> +
> +	err = alloc_chrdev_region(&macvtap_major, 0,
> +				MACVTAP_NUM_DEVS, "macvtap");
> +	if (err)
> +		goto out1;
> +
> +	cdev_init(&macvtap_cdev, &macvtap_fops);
> +	err = cdev_add(&macvtap_cdev, macvtap_major, MACVTAP_NUM_DEVS);
> +	if (err)
> +		goto out2;
> +
> +	macvtap_class = class_create(THIS_MODULE, "macvtap");
> +	if (IS_ERR(macvtap_class)) {
> +		err = PTR_ERR(macvtap_class);
> +		goto out3;
> +	}
> +
> +	err = macvlan_link_register(&macvtap_link_ops);
> +	if (err)
> +		goto out4;
> +
> +	return 0;
> +
> +out4:
> +	class_unregister(macvtap_class);
> +out3:
> +	cdev_del(&macvtap_cdev);
> +out2:
> +	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
> +out1:
> +	return err;
> +}
> +module_init(macvtap_init);
> +
> +static void macvtap_exit(void)
> +{
> +	rtnl_link_unregister(&macvtap_link_ops);
> +	class_unregister(macvtap_class);
> +	cdev_del(&macvtap_cdev);
> +	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
> +}
> +module_exit(macvtap_exit);
> +
> +MODULE_ALIAS_RTNL_LINK("macvtap");
> +MODULE_AUTHOR("Arnd Bergmann <arnd@arndb.de>");
> +MODULE_LICENSE("GPL");
> diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
> index 9a11544..51f1512 100644
> --- a/include/linux/if_macvlan.h
> +++ b/include/linux/if_macvlan.h
> @@ -34,6 +34,7 @@ struct macvlan_dev {
>  	enum macvlan_mode	mode;
>  	int (*receive)(struct sk_buff *skb);
>  	int (*forward)(struct net_device *dev, struct sk_buff *skb);
> +	struct macvtap_queue	*tap;
>  };
>  
>  static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
> -- 
> 1.6.3.3

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Bridge] [PATCH 3/3] net: macvtap driver
@ 2010-01-28 17:34     ` Michael S. Tsirkin
  0 siblings, 0 replies; 63+ messages in thread
From: Michael S. Tsirkin @ 2010-01-28 17:34 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Herbert Xu, netdev, bridge, linux-kernel, Or Gerlitz, David Miller

On Wed, Jan 27, 2010 at 10:09:27PM +0100, Arnd Bergmann wrote:
> In order to use macvlan with qemu and other tools that require
> a tap file descriptor, the macvtap driver adds a small backend
> with a character device with the same interface as the tun
> driver, with a minimum set of features.
> 
> Macvtap interfaces are created in the same way as macvlan
> interfaces using ip link, but the netif is just used as a
> handle for configuration and accounting, while the data
> goes through the chardev. Each macvtap interface has its
> own character device, simplifying permission management
> significantly over the generic tun/tap driver.
> 
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: Stephen Hemminger <shemminger@linux-foundation.org>
> Cc: David S. Miller" <davem@davemloft.net>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Or Gerlitz <ogerlitz@voltaire.com>
> Cc: netdev@vger.kernel.org
> Cc: bridge@lists.linux-foundation.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
>  drivers/net/Kconfig        |   12 +
>  drivers/net/Makefile       |    1 +
>  drivers/net/macvtap.c      |  572 ++++++++++++++++++++++++++++++++++++++++++++
>  include/linux/if_macvlan.h |    1 +
>  4 files changed, 586 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/net/macvtap.c
> 
> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
> index cb0e534..411e207 100644
> --- a/drivers/net/Kconfig
> +++ b/drivers/net/Kconfig
> @@ -90,6 +90,18 @@ config MACVLAN
>  	  To compile this driver as a module, choose M here: the module
>  	  will be called macvlan.
>  
> +config MACVTAP
> +	tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
> +	depends on MACVLAN
> +	help
> +	  This adds a specialized tap character device driver that is based
> +	  on the MAC-VLAN network interface, called macvtap. A macvtap device
> +	  can be added in the same way as a macvlan device, using 'type
> +	  macvlan', and then be accessed through the tap user space interface.
> +
> +	  To compile this driver as a module, choose M here: the module
> +	  will be called macvtap.
> +
>  config EQUALIZER
>  	tristate "EQL (serial line load balancing) support"
>  	---help---
> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> index 0b763cb..9595803 100644
> --- a/drivers/net/Makefile
> +++ b/drivers/net/Makefile
> @@ -169,6 +169,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
>  obj-$(CONFIG_DUMMY) += dummy.o
>  obj-$(CONFIG_IFB) += ifb.o
>  obj-$(CONFIG_MACVLAN) += macvlan.o
> +obj-$(CONFIG_MACVTAP) += macvtap.o
>  obj-$(CONFIG_DE600) += de600.o
>  obj-$(CONFIG_DE620) += de620.o
>  obj-$(CONFIG_LANCE) += lance.o
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> new file mode 100644
> index 0000000..2916202
> --- /dev/null
> +++ b/drivers/net/macvtap.c
> @@ -0,0 +1,572 @@
> +#include <linux/etherdevice.h>
> +#include <linux/if_macvlan.h>
> +#include <linux/interrupt.h>
> +#include <linux/nsproxy.h>
> +#include <linux/compat.h>
> +#include <linux/if_tun.h>
> +#include <linux/module.h>
> +#include <linux/skbuff.h>
> +#include <linux/cache.h>
> +#include <linux/sched.h>
> +#include <linux/types.h>
> +#include <linux/init.h>
> +#include <linux/wait.h>
> +#include <linux/cdev.h>
> +#include <linux/fs.h>
> +
> +#include <net/net_namespace.h>
> +#include <net/rtnetlink.h>
> +#include <net/sock.h>
> +
> +/*
> + * A macvtap queue is the central object of this driver, it connects
> + * an open character device to a macvlan interface. There can be
> + * multiple queues on one interface, which map back to queues
> + * implemented in hardware on the underlying device.
> + *
> + * macvtap_proto is used to allocate queues through the sock allocation
> + * mechanism.
> + *
> + * TODO: multiqueue support is currently not implemented, even though
> + * macvtap is basically prepared for that. We will need to add this
> + * here as well as in virtio-net and qemu to get line rate on 10gbit
> + * adapters from a guest.
> + */
> +struct macvtap_queue {
> +	struct sock sk;
> +	struct socket sock;
> +	struct macvlan_dev *vlan;
> +	struct file *file;
> +};
> +
> +static struct proto macvtap_proto = {
> +	.name = "macvtap",
> +	.owner = THIS_MODULE,
> +	.obj_size = sizeof (struct macvtap_queue),
> +};
> +
> +/*
> + * Minor number matches netdev->ifindex, so need a potentially
> + * large value. This also makes it possible to split the
> + * tap functionality out again in the future by offering it
> + * from other drivers besides macvtap. As long as every device
> + * only has one tap, the interface numbers assure that the
> + * device nodes are unique.
> + */
> +static unsigned int macvtap_major;
> +#define MACVTAP_NUM_DEVS 65536
> +static struct class *macvtap_class;
> +static struct cdev macvtap_cdev;
> +
> +/*
> + * RCU usage:
> + * The macvtap_queue is referenced both from the chardev struct file
> + * and from the struct macvlan_dev using rcu_read_lock.
> + *
> + * We never actually update the contents of a macvtap_queue atomically
> + * with RCU but it is used for race-free destruction of a queue when
> + * either the file or the macvlan_dev goes away. Pointers back to
> + * the dev and the file are implicitly valid as long as the queue
> + * exists.
> + *
> + * The callbacks from macvlan are always done with rcu_read_lock held
> + * already, while in the file_operations, we get it ourselves.
> + *
> + * When destroying a queue, we remove the pointers from the file and
> + * from the dev and then synchronize_rcu to make sure no thread is
> + * still using the queue. There may still be references to the struct
> + * sock inside of the queue from outbound SKBs, but these never
> + * reference back to the file or the dev. The data structure is freed
> + * through __sk_free when both our references and any pending SKBs
> + * are gone.
> + *
> + * macvtap_lock is only used to prevent multiple concurrent open()
> + * calls to assign a new vlan->tap pointer. It could be moved into
> + * the macvlan_dev itself but is extremely rarely used.
> + */
> +static DEFINE_SPINLOCK(macvtap_lock);
> +
> +/*
> + * Choose the next free queue, for now there is only one
> + */
> +static int macvtap_set_queue(struct net_device *dev, struct file *file,
> +				struct macvtap_queue *q)
> +{
> +	struct macvlan_dev *vlan = netdev_priv(dev);
> +	int err = -EBUSY;
> +
> +	spin_lock(&macvtap_lock);
> +	if (rcu_dereference(vlan->tap))
> +		goto out;
> +
> +	err = 0;
> +	q->vlan = vlan;
> +	rcu_assign_pointer(vlan->tap, q);
> +
> +	q->file = file;
> +	rcu_assign_pointer(file->private_data, q);
> +
> +out:
> +	spin_unlock(&macvtap_lock);
> +	return err;
> +}
> +
> +/*
> + * We must destroy each queue exactly once, when either
> + * the netdev or the file go away.
> + *
> + * Using the spinlock makes sure that we don't get
> + * to the queue again after destroying it.
> + *
> + * synchronize_rcu serializes with the packet flow
> + * that uses rcu_read_lock.
> + */
> +static void macvtap_del_queue(struct macvtap_queue **qp)
> +{
> +	struct macvtap_queue *q;
> +
> +	spin_lock(&macvtap_lock);
> +	q = rcu_dereference(*qp);
> +	if (!q) {
> +		spin_unlock(&macvtap_lock);
> +		return;
> +	}
> +
> +	rcu_assign_pointer(q->vlan->tap, NULL);
> +	rcu_assign_pointer(q->file->private_data, NULL);
> +	spin_unlock(&macvtap_lock);
> +
> +	synchronize_rcu();
> +	sock_put(&q->sk);
> +}
> +
> +/*
> + * Since we only support one queue, just dereference the pointer.
> + */
> +static struct macvtap_queue *macvtap_get_queue(struct net_device *dev,
> +					       struct sk_buff *skb)
> +{
> +	struct macvlan_dev *vlan = netdev_priv(dev);
> +
> +	return rcu_dereference(vlan->tap);
> +}
> +
> +static void macvtap_del_queues(struct net_device *dev)
> +{
> +	struct macvlan_dev *vlan = netdev_priv(dev);
> +	macvtap_del_queue(&vlan->tap);
> +}
> +
> +static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
> +{
> +	rcu_read_lock_bh();
> +	return rcu_dereference(file->private_data);
> +}
> +
> +static inline void macvtap_file_put_queue(void)
> +{
> +	rcu_read_unlock_bh();
> +}
> +

I find such wrappers around rcu obscure this,
already sufficiently complex, pattern.
Might be just me.

> +/*
> + * Forward happens for data that gets sent from one macvlan
> + * endpoint to another one in bridge mode. We just take
> + * the skb and put it into the receive queue.
> + */
> +static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
> +{
> +	struct macvtap_queue *q = macvtap_get_queue(dev, skb);
> +	if (!q)
> +		return -ENOLINK;
> +
> +	skb_queue_tail(&q->sk.sk_receive_queue, skb);
> +	wake_up(q->sk.sk_sleep);
> +	return 0;
> +}
> +
> +/*
> + * Receive is for data from the external interface (lowerdev),
> + * in case of macvtap, we can treat that the same way as
> + * forward, which macvlan cannot.
> + */
> +static int macvtap_receive(struct sk_buff *skb)
> +{
> +	skb_push(skb, ETH_HLEN);
> +	return macvtap_forward(skb->dev, skb);
> +}
> +
> +static int macvtap_newlink(struct net *src_net,
> +			   struct net_device *dev,
> +			   struct nlattr *tb[],
> +			   struct nlattr *data[])
> +{
> +	struct device *classdev;
> +	dev_t devt;
> +	int err;
> +
> +	err = macvlan_common_newlink(src_net, dev, tb, data,
> +				     macvtap_receive, macvtap_forward);
> +	if (err)
> +		goto out;
> +
> +	devt = MKDEV(MAJOR(macvtap_major), dev->ifindex);
> +
> +	classdev = device_create(macvtap_class, &dev->dev, devt,
> +				 dev, "tap%d", dev->ifindex);
> +	if (IS_ERR(classdev)) {
> +		err = PTR_ERR(classdev);
> +		macvtap_del_queues(dev);
> +	}
> +
> +out:
> +	return err;
> +}
> +
> +static void macvtap_dellink(struct net_device *dev,
> +			    struct list_head *head)
> +{
> +	device_destroy(macvtap_class,
> +		       MKDEV(MAJOR(macvtap_major), dev->ifindex));
> +
> +	macvtap_del_queues(dev);
> +	macvlan_dellink(dev, head);
> +}
> +
> +static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
> +	.kind		= "macvtap",
> +	.newlink	= macvtap_newlink,
> +	.dellink	= macvtap_dellink,
> +};
> +
> +
> +static void macvtap_sock_write_space(struct sock *sk)
> +{
> +	if (!sock_writeable(sk) ||
> +	    !test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
> +		return;
> +
> +	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +		wake_up_interruptible_sync(sk->sk_sleep);
> +}
> +
> +static int macvtap_open(struct inode *inode, struct file *file)
> +{
> +	struct net *net = current->nsproxy->net_ns;
> +	struct net_device *dev = dev_get_by_index(net, iminor(inode));
> +	struct macvtap_queue *q;
> +	int err;
> +

This seems to keep reference to device as long as character device is
open, which, if I understand correctly, will start printing error
messages to kernel log about once a second if you try to remove the
device.

I suspect the best way to fix this issue would be to use some kind
of notifier so that macvtap can disconnect on device removal.


> +	err = -ENODEV;
> +	if (!dev)
> +		goto out;
> +
> +	/* check if this is a macvtap device */
> +	err = -EINVAL;
> +	if (dev->rtnl_link_ops != &macvtap_link_ops)
> +		goto out;
> +
> +	err = -ENOMEM;
> +	q = (struct macvtap_queue *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
> +					     &macvtap_proto);
> +	if (!q)
> +		goto out;
> +
> +	init_waitqueue_head(&q->sock.wait);
> +	q->sock.type = SOCK_RAW;
> +	q->sock.state = SS_CONNECTED;
> +	sock_init_data(&q->sock, &q->sk);
> +	q->sk.sk_allocation = GFP_ATOMIC; /* for now */
> +	q->sk.sk_write_space = macvtap_sock_write_space;
> +
> +	err = macvtap_set_queue(dev, file, q);
> +	if (err)
> +		sock_put(&q->sk);
> +
> +out:
> +	return err;
> +}
> +
> +static int macvtap_release(struct inode *inode, struct file *file)
> +{
> +	macvtap_del_queue((struct macvtap_queue **)&file->private_data);
> +	return 0;
> +}
> +
> +static unsigned int macvtap_poll(struct file *file, poll_table * wait)
> +{
> +	struct macvtap_queue *q = macvtap_file_get_queue(file);
> +	unsigned int mask = POLLERR;
> +
> +	if (!q)
> +		goto out;
> +
> +	mask = 0;
> +	poll_wait(file, &q->sock.wait, wait);
> +
> +	if (!skb_queue_empty(&q->sk.sk_receive_queue))
> +		mask |= POLLIN | POLLRDNORM;
> +
> +	if (sock_writeable(&q->sk) ||
> +	    (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &q->sock.flags) &&
> +	     sock_writeable(&q->sk)))
> +		mask |= POLLOUT | POLLWRNORM;
> +
> +out:
> +	macvtap_file_put_queue();
> +	return mask;
> +}
> +
> +/* Get packet from user space buffer */
> +static ssize_t macvtap_get_user(struct macvtap_queue *q,
> +				const struct iovec *iv, size_t count,
> +				int noblock)
> +{
> +	struct sk_buff *skb;
> +	size_t len = count;
> +	int err;
> +
> +	if (unlikely(len < ETH_HLEN))
> +		return -EINVAL;
> +
> +	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
> +
> +	if (!skb) {
> +		macvlan_count_rx(q->vlan, 0, false, false);
> +		return err;
> +	}
> +
> +	skb_reserve(skb, NET_IP_ALIGN);
> +	skb_put(skb, count);
> +
> +	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> +		macvlan_count_rx(q->vlan, 0, false, false);
> +		kfree_skb(skb);
> +		return -EFAULT;
> +	}
> +
> +	skb_set_network_header(skb, ETH_HLEN);
> +
> +	macvlan_start_xmit(skb, q->vlan->dev);
> +
> +	return count;
> +}
> +

I am surprised there's no GSO support.  Would it be a good idea to share
code with tun driver? That already has GSO ...
Also, network header pointer seems off for vlan packets?

> +static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
> +				 unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	ssize_t result = -ENOLINK;
> +	struct macvtap_queue *q = macvtap_file_get_queue(file);
> +
> +	if (!q)
> +		goto out;
> +
> +	result = macvtap_get_user(q, iv, iov_length(iv, count),
> +			      file->f_flags & O_NONBLOCK);
> +out:
> +	macvtap_file_put_queue();
> +	return result;
> +}
> +
> +/* Put packet to the user space buffer */
> +static ssize_t macvtap_put_user(struct macvtap_queue *q,
> +				const struct sk_buff *skb,
> +				const struct iovec *iv, int len)
> +{
> +	struct macvlan_dev *vlan = q->vlan;
> +	int ret;
> +
> +	len = min_t(int, skb->len, len);
> +
> +	ret = skb_copy_datagram_const_iovec(skb, 0, iv, 0, len);
> +
> +	macvlan_count_rx(vlan, len, ret == 0, 0);
> +
> +	return ret ? ret : len;
> +}
> +
> +static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
> +				unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct macvtap_queue *q = macvtap_file_get_queue(file);
> +
> +	DECLARE_WAITQUEUE(wait, current);
> +	struct sk_buff *skb;
> +	ssize_t len, ret = 0;
> +
> +	if (!q) {
> +		ret = -ENOLINK;
> +		goto out;
> +	}
> +
> +	len = iov_length(iv, count);
> +	if (len < 0) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	add_wait_queue(q->sk.sk_sleep, &wait);
> +	while (len) {
> +		current->state = TASK_INTERRUPTIBLE;
> +
> +		/* Read frames from the queue */
> +		skb = skb_dequeue(&q->sk.sk_receive_queue);
> +		if (!skb) {
> +			if (file->f_flags & O_NONBLOCK) {
> +				ret = -EAGAIN;
> +				break;
> +			}
> +			if (signal_pending(current)) {
> +				ret = -ERESTARTSYS;
> +				break;
> +			}
> +			/* Nothing to read, let's sleep */
> +			schedule();
> +			continue;
> +		}
> +		ret = macvtap_put_user(q, skb, iv, len);
> +		kfree_skb(skb);
> +		break;
> +	}
> +
> +	current->state = TASK_RUNNING;
> +	remove_wait_queue(q->sk.sk_sleep, &wait);
> +
> +out:
> +	macvtap_file_put_queue();
> +	return ret;
> +}
> +

Same GSO comment here.

> +/*
> + * provide compatibility with generic tun/tap interface
> + */
> +static long macvtap_ioctl(struct file *file, unsigned int cmd,
> +			  unsigned long arg)
> +{

All of these seem to be stubs, and tun has many more that you didn't
stub out. So, why do you bother to support any ioctls at all?

> +	struct macvtap_queue *q;
> +	void __user *argp = (void __user *)arg;
> +	struct ifreq __user *ifr = argp;
> +	unsigned int __user *up = argp;
> +	unsigned int u;
> +	int err;
> +
> +	switch (cmd) {
> +	case TUNSETIFF:
> +		/* ignore the name, just look at flags */
> +		if (get_user(u, &ifr->ifr_flags))
> +			return -EFAULT;
> +		if (u != (IFF_TAP | IFF_NO_PI))
> +			return -EINVAL;
> +		return 0;
> +
> +	case TUNGETIFF:
> +		q = macvtap_file_get_queue(file);
> +		if (!q)
> +			return -ENOLINK;
> +		err = 0;
> +		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
> +		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
> +			err = -EFAULT;
> +		macvtap_file_put_queue();
> +		return err;
> +
> +	case TUNGETFEATURES:
> +		if (put_user((IFF_TAP | IFF_NO_PI), up))
> +			return -EFAULT;
> +		return 0;
> +
> +	case TUNSETSNDBUF:
> +		/* ignore */
> +		return 0;
> +
> +	case TUNSETOFFLOAD:
> +		/* let the user check for future flags */
> +		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
> +			  TUN_F_TSO_ECN | TUN_F_UFO))
> +			return -EINVAL;
> +
> +		/* TODO: add support for these, so far we don't
> +			 support any offload */
> +		if (arg & (TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
> +			 TUN_F_TSO_ECN | TUN_F_UFO))
> +			return -EINVAL;
> +
> +		return 0;
> +
> +	default:
> +		return -EINVAL;
> +	}
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long macvtap_compat_ioctl(struct file *file, unsigned int cmd,
> +				 unsigned long arg)
> +{
> +	return macvtap_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
> +}
> +#endif
> +
> +static const struct file_operations macvtap_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= macvtap_open,
> +	.release	= macvtap_release,
> +	.aio_read	= macvtap_aio_read,
> +	.aio_write	= macvtap_aio_write,
> +	.poll		= macvtap_poll,
> +	.llseek		= no_llseek,
> +	.unlocked_ioctl	= macvtap_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= macvtap_compat_ioctl,
> +#endif
> +};
> +
> +static int macvtap_init(void)
> +{
> +	int err;
> +
> +	err = alloc_chrdev_region(&macvtap_major, 0,
> +				MACVTAP_NUM_DEVS, "macvtap");
> +	if (err)
> +		goto out1;
> +
> +	cdev_init(&macvtap_cdev, &macvtap_fops);
> +	err = cdev_add(&macvtap_cdev, macvtap_major, MACVTAP_NUM_DEVS);
> +	if (err)
> +		goto out2;
> +
> +	macvtap_class = class_create(THIS_MODULE, "macvtap");
> +	if (IS_ERR(macvtap_class)) {
> +		err = PTR_ERR(macvtap_class);
> +		goto out3;
> +	}
> +
> +	err = macvlan_link_register(&macvtap_link_ops);
> +	if (err)
> +		goto out4;
> +
> +	return 0;
> +
> +out4:
> +	class_unregister(macvtap_class);
> +out3:
> +	cdev_del(&macvtap_cdev);
> +out2:
> +	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
> +out1:
> +	return err;
> +}
> +module_init(macvtap_init);
> +
> +static void macvtap_exit(void)
> +{
> +	rtnl_link_unregister(&macvtap_link_ops);
> +	class_unregister(macvtap_class);
> +	cdev_del(&macvtap_cdev);
> +	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
> +}
> +module_exit(macvtap_exit);
> +
> +MODULE_ALIAS_RTNL_LINK("macvtap");
> +MODULE_AUTHOR("Arnd Bergmann <arnd@arndb.de>");
> +MODULE_LICENSE("GPL");
> diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
> index 9a11544..51f1512 100644
> --- a/include/linux/if_macvlan.h
> +++ b/include/linux/if_macvlan.h
> @@ -34,6 +34,7 @@ struct macvlan_dev {
>  	enum macvlan_mode	mode;
>  	int (*receive)(struct sk_buff *skb);
>  	int (*forward)(struct net_device *dev, struct sk_buff *skb);
> +	struct macvtap_queue	*tap;
>  };
>  
>  static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
> -- 
> 1.6.3.3

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/3] net: macvtap driver
  2010-01-28 17:34     ` [Bridge] " Michael S. Tsirkin
@ 2010-01-28 20:18       ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-28 20:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Stephen Hemminger, Patrick McHardy, Herbert Xu,
	Or Gerlitz, netdev, bridge, linux-kernel

On Thursday 28 January 2010, Michael S. Tsirkin wrote:
> On Wed, Jan 27, 2010 at 10:09:27PM +0100, Arnd Bergmann wrote:
> > +static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
> > +{
> > +	rcu_read_lock_bh();
> > +	return rcu_dereference(file->private_data);
> > +}
> > +
> > +static inline void macvtap_file_put_queue(void)
> > +{
> > +	rcu_read_unlock_bh();
> > +}
> > +
> 
> I find such wrappers around rcu obscure this,
> already sufficiently complex, pattern.
> Might be just me.

Obviously I find them useful here, but if more people feel the
same as you, I'll just open-code them.

> > +static int macvtap_open(struct inode *inode, struct file *file)
> > +{
> > +	struct net *net = current->nsproxy->net_ns;
> > +	struct net_device *dev = dev_get_by_index(net, iminor(inode));
> > +	struct macvtap_queue *q;
> > +	int err;
> > +
> 
> This seems to keep reference to device as long as character device is
> open, which, if I understand correctly, will start printing error
> messages to kernel log about once a second if you try to remove the
> device.
> 
> I suspect the best way to fix this issue would be to use some kind
> of notifier so that macvtap can disconnect on device removal.

I think I'm just missing the put in the open function, the code
already handles the netif and the file disappearing independently.

Thanks for spotting this one, I'll fix that in the next post.

> > +	skb_reserve(skb, NET_IP_ALIGN);
> > +	skb_put(skb, count);
> > +
> > +	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> > +		macvlan_count_rx(q->vlan, 0, false, false);
> > +		kfree_skb(skb);
> > +		return -EFAULT;
> > +	}
> > +
> > +	skb_set_network_header(skb, ETH_HLEN);
> > +
> > +	macvlan_start_xmit(skb, q->vlan->dev);
> > +
> > +	return count;
> > +}
> > +
> 
> I am surprised there's no GSO support.  Would it be a good idea to share
> code with tun driver? That already has GSO ...

The driver still only does the minimum feature set to get things going.
GSO is an obvious extension, but I wanted the code to be as simple
as possible to find all the basic bugs before we do anything fancy.

> Also, network header pointer seems off for vlan packets?

That may well be, I haven't tried vlan. What do you think it should do
then?

> > +/*
> > + * provide compatibility with generic tun/tap interface
> > + */
> > +static long macvtap_ioctl(struct file *file, unsigned int cmd,
> > +			  unsigned long arg)
> > +{
> 
> All of these seem to be stubs, and tun has many more that you didn't
> stub out. So, why do you bother to support any ioctls at all?

Again, minimum features to get things going. qemu fails to open
the device if these ioctls are not implemented, but any of the
more advanced features are left out.

Thansk for the review,

	Arnd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Bridge] [PATCH 3/3] net: macvtap driver
@ 2010-01-28 20:18       ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-28 20:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Herbert Xu, netdev, bridge, linux-kernel, Or Gerlitz, David Miller

On Thursday 28 January 2010, Michael S. Tsirkin wrote:
> On Wed, Jan 27, 2010 at 10:09:27PM +0100, Arnd Bergmann wrote:
> > +static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
> > +{
> > +	rcu_read_lock_bh();
> > +	return rcu_dereference(file->private_data);
> > +}
> > +
> > +static inline void macvtap_file_put_queue(void)
> > +{
> > +	rcu_read_unlock_bh();
> > +}
> > +
> 
> I find such wrappers around rcu obscure this,
> already sufficiently complex, pattern.
> Might be just me.

Obviously I find them useful here, but if more people feel the
same as you, I'll just open-code them.

> > +static int macvtap_open(struct inode *inode, struct file *file)
> > +{
> > +	struct net *net = current->nsproxy->net_ns;
> > +	struct net_device *dev = dev_get_by_index(net, iminor(inode));
> > +	struct macvtap_queue *q;
> > +	int err;
> > +
> 
> This seems to keep reference to device as long as character device is
> open, which, if I understand correctly, will start printing error
> messages to kernel log about once a second if you try to remove the
> device.
> 
> I suspect the best way to fix this issue would be to use some kind
> of notifier so that macvtap can disconnect on device removal.

I think I'm just missing the put in the open function, the code
already handles the netif and the file disappearing independently.

Thanks for spotting this one, I'll fix that in the next post.

> > +	skb_reserve(skb, NET_IP_ALIGN);
> > +	skb_put(skb, count);
> > +
> > +	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> > +		macvlan_count_rx(q->vlan, 0, false, false);
> > +		kfree_skb(skb);
> > +		return -EFAULT;
> > +	}
> > +
> > +	skb_set_network_header(skb, ETH_HLEN);
> > +
> > +	macvlan_start_xmit(skb, q->vlan->dev);
> > +
> > +	return count;
> > +}
> > +
> 
> I am surprised there's no GSO support.  Would it be a good idea to share
> code with tun driver? That already has GSO ...

The driver still only does the minimum feature set to get things going.
GSO is an obvious extension, but I wanted the code to be as simple
as possible to find all the basic bugs before we do anything fancy.

> Also, network header pointer seems off for vlan packets?

That may well be, I haven't tried vlan. What do you think it should do
then?

> > +/*
> > + * provide compatibility with generic tun/tap interface
> > + */
> > +static long macvtap_ioctl(struct file *file, unsigned int cmd,
> > +			  unsigned long arg)
> > +{
> 
> All of these seem to be stubs, and tun has many more that you didn't
> stub out. So, why do you bother to support any ioctls at all?

Again, minimum features to get things going. qemu fails to open
the device if these ioctls are not implemented, but any of the
more advanced features are left out.

Thansk for the review,

	Arnd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/3] net: maintain namespace isolation between vlan and real device
  2010-01-27 10:05   ` [Bridge] " Arnd Bergmann
@ 2010-01-29  5:33     ` David Miller
  -1 siblings, 0 replies; 63+ messages in thread
From: David Miller @ 2010-01-29  5:33 UTC (permalink / raw)
  To: arnd
  Cc: shemminger, kaber, mst, herbert, ogerlitz, netdev, bridge, linux-kernel

From: Arnd Bergmann <arnd@arndb.de>
Date: Wed, 27 Jan 2010 11:05:15 +0100

> + * skb_dev_set -- assign a buffer to a new device

My english is terrible, but I think this should be
"assign a new device to a buffer".

If you agree, please fix this up when you resubmit patches #1 and #2
along with the fix you already plan to make to patch #3.

Thanks!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Bridge] [PATCH 1/3] net: maintain namespace isolation between vlan and real device
@ 2010-01-29  5:33     ` David Miller
  0 siblings, 0 replies; 63+ messages in thread
From: David Miller @ 2010-01-29  5:33 UTC (permalink / raw)
  To: arnd; +Cc: herbert, mst, netdev, bridge, linux-kernel, ogerlitz

From: Arnd Bergmann <arnd@arndb.de>
Date: Wed, 27 Jan 2010 11:05:15 +0100

> + * skb_dev_set -- assign a buffer to a new device

My english is terrible, but I think this should be
"assign a new device to a buffer".

If you agree, please fix this up when you resubmit patches #1 and #2
along with the fix you already plan to make to patch #3.

Thanks!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/3] net: maintain namespace isolation between vlan and real device
  2010-01-29  5:33     ` [Bridge] " David Miller
@ 2010-01-29 10:12       ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-29 10:12 UTC (permalink / raw)
  To: David Miller
  Cc: shemminger, kaber, mst, herbert, ogerlitz, netdev, bridge, linux-kernel

On Friday 29 January 2010, David Miller wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Wed, 27 Jan 2010 11:05:15 +0100
> 
> > + * skb_dev_set -- assign a buffer to a new device
> 
> My english is terrible, but I think this should be
> "assign a new device to a buffer".

Right, that seems clearer.

> If you agree, please fix this up when you resubmit patches #1 and #2
> along with the fix you already plan to make to patch #3.

Ok, will do.

Thanks,

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Bridge] [PATCH 1/3] net: maintain namespace isolation between vlan and real device
@ 2010-01-29 10:12       ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-29 10:12 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, mst, netdev, bridge, linux-kernel, ogerlitz

On Friday 29 January 2010, David Miller wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> Date: Wed, 27 Jan 2010 11:05:15 +0100
> 
> > + * skb_dev_set -- assign a buffer to a new device
> 
> My english is terrible, but I think this should be
> "assign a new device to a buffer".

Right, that seems clearer.

> If you agree, please fix this up when you resubmit patches #1 and #2
> along with the fix you already plan to make to patch #3.

Ok, will do.

Thanks,

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/3] net: macvtap driver
  2010-01-28 20:18       ` [Bridge] " Arnd Bergmann
@ 2010-01-29 11:21         ` Michael S. Tsirkin
  -1 siblings, 0 replies; 63+ messages in thread
From: Michael S. Tsirkin @ 2010-01-29 11:21 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David Miller, Stephen Hemminger, Patrick McHardy, Herbert Xu,
	Or Gerlitz, netdev, bridge, linux-kernel

On Thu, Jan 28, 2010 at 09:18:08PM +0100, Arnd Bergmann wrote:
> On Thursday 28 January 2010, Michael S. Tsirkin wrote:
> > On Wed, Jan 27, 2010 at 10:09:27PM +0100, Arnd Bergmann wrote:
> > > +static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
> > > +{
> > > +	rcu_read_lock_bh();
> > > +	return rcu_dereference(file->private_data);
> > > +}
> > > +
> > > +static inline void macvtap_file_put_queue(void)
> > > +{
> > > +	rcu_read_unlock_bh();
> > > +}
> > > +
> > 
> > I find such wrappers around rcu obscure this,
> > already sufficiently complex, pattern.
> > Might be just me.
> 
> Obviously I find them useful here, but if more people feel the
> same as you, I'll just open-code them.
> 
> > > +static int macvtap_open(struct inode *inode, struct file *file)
> > > +{
> > > +	struct net *net = current->nsproxy->net_ns;
> > > +	struct net_device *dev = dev_get_by_index(net, iminor(inode));
> > > +	struct macvtap_queue *q;
> > > +	int err;
> > > +
> > 
> > This seems to keep reference to device as long as character device is
> > open, which, if I understand correctly, will start printing error
> > messages to kernel log about once a second if you try to remove the
> > device.
> > 
> > I suspect the best way to fix this issue would be to use some kind
> > of notifier so that macvtap can disconnect on device removal.
> 
> I think I'm just missing the put in the open function, the code
> already handles the netif and the file disappearing independently.
> 
> Thanks for spotting this one, I'll fix that in the next post.
> 
> > > +	skb_reserve(skb, NET_IP_ALIGN);
> > > +	skb_put(skb, count);
> > > +
> > > +	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> > > +		macvlan_count_rx(q->vlan, 0, false, false);
> > > +		kfree_skb(skb);
> > > +		return -EFAULT;
> > > +	}
> > > +
> > > +	skb_set_network_header(skb, ETH_HLEN);
> > > +
> > > +	macvlan_start_xmit(skb, q->vlan->dev);
> > > +
> > > +	return count;
> > > +}
> > > +
> > 
> > I am surprised there's no GSO support.  Would it be a good idea to share
> > code with tun driver? That already has GSO ...
> 
> The driver still only does the minimum feature set to get things going.
> GSO is an obvious extension, but I wanted the code to be as simple
> as possible to find all the basic bugs before we do anything fancy.
> 
> > Also, network header pointer seems off for vlan packets?
> 
> That may well be, I haven't tried vlan. What do you think it should do
> then?

Look at eth_type for a more complete packet parsing.

> > > +/*
> > > + * provide compatibility with generic tun/tap interface
> > > + */
> > > +static long macvtap_ioctl(struct file *file, unsigned int cmd,
> > > +			  unsigned long arg)
> > > +{
> > 
> > All of these seem to be stubs, and tun has many more that you didn't
> > stub out. So, why do you bother to support any ioctls at all?
> 
> Again, minimum features to get things going. qemu fails to open
> the device if these ioctls are not implemented, but any of the
> more advanced features are left out.

This is strange, could be application bug.  E.g. send buf size is
relatively new and apps should handle failure gracefully.  IMO,
returning success and ignoring the value is not a good idea.  How about
we just fix qemu?  What about other apps?

> Thansk for the review,
> 
> 	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Bridge] [PATCH 3/3] net: macvtap driver
@ 2010-01-29 11:21         ` Michael S. Tsirkin
  0 siblings, 0 replies; 63+ messages in thread
From: Michael S. Tsirkin @ 2010-01-29 11:21 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Herbert Xu, netdev, bridge, linux-kernel, Or Gerlitz, David Miller

On Thu, Jan 28, 2010 at 09:18:08PM +0100, Arnd Bergmann wrote:
> On Thursday 28 January 2010, Michael S. Tsirkin wrote:
> > On Wed, Jan 27, 2010 at 10:09:27PM +0100, Arnd Bergmann wrote:
> > > +static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
> > > +{
> > > +	rcu_read_lock_bh();
> > > +	return rcu_dereference(file->private_data);
> > > +}
> > > +
> > > +static inline void macvtap_file_put_queue(void)
> > > +{
> > > +	rcu_read_unlock_bh();
> > > +}
> > > +
> > 
> > I find such wrappers around rcu obscure this,
> > already sufficiently complex, pattern.
> > Might be just me.
> 
> Obviously I find them useful here, but if more people feel the
> same as you, I'll just open-code them.
> 
> > > +static int macvtap_open(struct inode *inode, struct file *file)
> > > +{
> > > +	struct net *net = current->nsproxy->net_ns;
> > > +	struct net_device *dev = dev_get_by_index(net, iminor(inode));
> > > +	struct macvtap_queue *q;
> > > +	int err;
> > > +
> > 
> > This seems to keep reference to device as long as character device is
> > open, which, if I understand correctly, will start printing error
> > messages to kernel log about once a second if you try to remove the
> > device.
> > 
> > I suspect the best way to fix this issue would be to use some kind
> > of notifier so that macvtap can disconnect on device removal.
> 
> I think I'm just missing the put in the open function, the code
> already handles the netif and the file disappearing independently.
> 
> Thanks for spotting this one, I'll fix that in the next post.
> 
> > > +	skb_reserve(skb, NET_IP_ALIGN);
> > > +	skb_put(skb, count);
> > > +
> > > +	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> > > +		macvlan_count_rx(q->vlan, 0, false, false);
> > > +		kfree_skb(skb);
> > > +		return -EFAULT;
> > > +	}
> > > +
> > > +	skb_set_network_header(skb, ETH_HLEN);
> > > +
> > > +	macvlan_start_xmit(skb, q->vlan->dev);
> > > +
> > > +	return count;
> > > +}
> > > +
> > 
> > I am surprised there's no GSO support.  Would it be a good idea to share
> > code with tun driver? That already has GSO ...
> 
> The driver still only does the minimum feature set to get things going.
> GSO is an obvious extension, but I wanted the code to be as simple
> as possible to find all the basic bugs before we do anything fancy.
> 
> > Also, network header pointer seems off for vlan packets?
> 
> That may well be, I haven't tried vlan. What do you think it should do
> then?

Look at eth_type for a more complete packet parsing.

> > > +/*
> > > + * provide compatibility with generic tun/tap interface
> > > + */
> > > +static long macvtap_ioctl(struct file *file, unsigned int cmd,
> > > +			  unsigned long arg)
> > > +{
> > 
> > All of these seem to be stubs, and tun has many more that you didn't
> > stub out. So, why do you bother to support any ioctls at all?
> 
> Again, minimum features to get things going. qemu fails to open
> the device if these ioctls are not implemented, but any of the
> more advanced features are left out.

This is strange, could be application bug.  E.g. send buf size is
relatively new and apps should handle failure gracefully.  IMO,
returning success and ignoring the value is not a good idea.  How about
we just fix qemu?  What about other apps?

> Thansk for the review,
> 
> 	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/3] net: macvtap driver
  2010-01-29 11:21         ` [Bridge] " Michael S. Tsirkin
@ 2010-01-29 19:49           ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-29 19:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Stephen Hemminger, Patrick McHardy, Herbert Xu,
	Or Gerlitz, netdev, bridge, linux-kernel

On Friday 29 January 2010, Michael S. Tsirkin wrote:
> > That may well be, I haven't tried vlan. What do you think it should do
> > then?
> 
> Look at eth_type for a more complete packet parsing.

ok. I initially called that but it crashed because the skb was not initialized
properly at that point. I'll have a look.

> > > > +/*
> > > > + * provide compatibility with generic tun/tap interface
> > > > + */
> > > > +static long macvtap_ioctl(struct file *file, unsigned int cmd,
> > > > +                   unsigned long arg)
> > > > +{
> > > 
> > > All of these seem to be stubs, and tun has many more that you didn't
> > > stub out. So, why do you bother to support any ioctls at all?
> > 
> > Again, minimum features to get things going. qemu fails to open
> > the device if these ioctls are not implemented, but any of the
> > more advanced features are left out.
> 
> This is strange, could be application bug.  E.g. send buf size is
> relatively new and apps should handle failure gracefully.  IMO,
> returning success and ignoring the value is not a good idea.  How about
> we just fix qemu?  What about other apps?

Ok, I'll go through the ioctls again and make sure they behave correctly
they way you said. I haven't tried against against anything but qemu and
cat.

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Bridge] [PATCH 3/3] net: macvtap driver
@ 2010-01-29 19:49           ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-29 19:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Herbert Xu, netdev, bridge, linux-kernel, Or Gerlitz, David Miller

On Friday 29 January 2010, Michael S. Tsirkin wrote:
> > That may well be, I haven't tried vlan. What do you think it should do
> > then?
> 
> Look at eth_type for a more complete packet parsing.

ok. I initially called that but it crashed because the skb was not initialized
properly at that point. I'll have a look.

> > > > +/*
> > > > + * provide compatibility with generic tun/tap interface
> > > > + */
> > > > +static long macvtap_ioctl(struct file *file, unsigned int cmd,
> > > > +                   unsigned long arg)
> > > > +{
> > > 
> > > All of these seem to be stubs, and tun has many more that you didn't
> > > stub out. So, why do you bother to support any ioctls at all?
> > 
> > Again, minimum features to get things going. qemu fails to open
> > the device if these ioctls are not implemented, but any of the
> > more advanced features are left out.
> 
> This is strange, could be application bug.  E.g. send buf size is
> relatively new and apps should handle failure gracefully.  IMO,
> returning success and ignoring the value is not a good idea.  How about
> we just fix qemu?  What about other apps?

Ok, I'll go through the ioctls again and make sure they behave correctly
they way you said. I haven't tried against against anything but qemu and
cat.

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 0/3 v4] macvtap driver
  2010-01-27 10:04 ` [Bridge] " Arnd Bergmann
@ 2010-01-30 22:22   ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-30 22:22 UTC (permalink / raw)
  To: David Miller
  Cc: Stephen Hemminger, Patrick McHardy, Michael S. Tsirkin,
	Herbert Xu, Or Gerlitz, netdev, bridge, linux-kernel,
	virtualization

This is the fourth version of the macvtap driver,
based on the comments I got for the last version
I got a few days ago. Very few changes:

* release netdev in chardev open function so
  we can destroy it properly.
* Implement TUNSETSNDBUF
* fix sleeping call in rcu_read_lock
* Fix comment in namespace isolation patch
* Fix small context difference to make it apply
  to net-next

I can't really test here while travelling, so please
give it a go if you're interested in this driver.

	Arnd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 0/3 v4] macvtap driver
  2010-01-27 10:04 ` [Bridge] " Arnd Bergmann
                   ` (4 preceding siblings ...)
  (?)
@ 2010-01-30 22:22 ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-30 22:22 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel,
	virtualization, Stephen Hemminger

This is the fourth version of the macvtap driver,
based on the comments I got for the last version
I got a few days ago. Very few changes:

* release netdev in chardev open function so
  we can destroy it properly.
* Implement TUNSETSNDBUF
* fix sleeping call in rcu_read_lock
* Fix comment in namespace isolation patch
* Fix small context difference to make it apply
  to net-next

I can't really test here while travelling, so please
give it a go if you're interested in this driver.

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [Bridge] [PATCH 0/3 v4] macvtap driver
@ 2010-01-30 22:22   ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-30 22:22 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel,
	virtualization, Or Gerlitz

This is the fourth version of the macvtap driver,
based on the comments I got for the last version
I got a few days ago. Very few changes:

* release netdev in chardev open function so
  we can destroy it properly.
* Implement TUNSETSNDBUF
* fix sleeping call in rcu_read_lock
* Fix comment in namespace isolation patch
* Fix small context difference to make it apply
  to net-next

I can't really test here while travelling, so please
give it a go if you're interested in this driver.

	Arnd


^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 1/3] net: maintain namespace isolation between vlan and real device
  2010-01-30 22:22   ` [Bridge] " Arnd Bergmann
@ 2010-01-30 22:23     ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-30 22:23 UTC (permalink / raw)
  To: David Miller
  Cc: Stephen Hemminger, Patrick McHardy, Michael S. Tsirkin,
	Herbert Xu, Or Gerlitz, netdev, bridge, linux-kernel,
	virtualization

In the vlan and macvlan drivers, the start_xmit function forwards
data to the dev_queue_xmit function for another device, which may
potentially belong to a different namespace.

To make sure that classification stays within a single namespace,
this resets the potentially critical fields.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvlan.c     |    2 +-
 include/linux/netdevice.h |    9 +++++++++
 net/8021q/vlan_dev.c      |    2 +-
 net/core/dev.c            |   35 +++++++++++++++++++++++++++++++----
 4 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index fa0dc51..d32e0bd 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -269,7 +269,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 xmit_world:
-	skb->dev = vlan->lowerdev;
+	skb_set_dev(skb, vlan->lowerdev);
 	return dev_queue_xmit(skb);
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 93a32a5..622ba5a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1004,6 +1004,15 @@ static inline bool netdev_uses_dsa_tags(struct net_device *dev)
 	return 0;
 }
 
+#ifndef CONFIG_NET_NS
+static inline void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+	skb->dev = dev;
+}
+#else /* CONFIG_NET_NS */
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev);
+#endif
+
 static inline bool netdev_uses_trailer_tags(struct net_device *dev)
 {
 #ifdef CONFIG_NET_DSA_TAG_TRAILER
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index a9e1f17..9e83272 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -322,7 +322,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
 	}
 
 
-	skb->dev = vlan_dev_info(dev)->real_dev;
+	skb_set_dev(skb, vlan_dev_info(dev)->real_dev);
 	len = skb->len;
 	ret = dev_queue_xmit(skb);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 2cba5c5..94c1eee 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1448,13 +1448,10 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 	if (skb->len > (dev->mtu + dev->hard_header_len))
 		return NET_RX_DROP;
 
-	skb_dst_drop(skb);
+	skb_set_dev(skb, dev);
 	skb->tstamp.tv64 = 0;
 	skb->pkt_type = PACKET_HOST;
 	skb->protocol = eth_type_trans(skb, dev);
-	skb->mark = 0;
-	secpath_reset(skb);
-	nf_reset(skb);
 	return netif_rx(skb);
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
@@ -1614,6 +1611,36 @@ static bool dev_can_checksum(struct net_device *dev, struct sk_buff *skb)
 	return false;
 }
 
+/**
+ * skb_dev_set -- assign a new device to a buffer
+ * @skb: buffer for the new device
+ * @dev: network device
+ *
+ * If an skb is owned by a device already, we have to reset
+ * all data private to the namespace a device belongs to
+ * before assigning it a new device.
+ */
+#ifdef CONFIG_NET_NS
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+	skb_dst_drop(skb);
+	if (skb->dev && !net_eq(dev_net(skb->dev), dev_net(dev))) {
+		secpath_reset(skb);
+		nf_reset(skb);
+		skb_init_secmark(skb);
+		skb->mark = 0;
+		skb->priority = 0;
+		skb->nf_trace = 0;
+		skb->ipvs_property = 0;
+#ifdef CONFIG_NET_SCHED
+		skb->tc_index = 0;
+#endif
+	}
+	skb->dev = dev;
+}
+EXPORT_SYMBOL(skb_set_dev);
+#endif /* CONFIG_NET_NS */
+
 /*
  * Invalidate hardware checksum when packet is to be mangled, and
  * complete checksum manually on outgoing path.
-- 
1.6.3.3



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 1/3] net: maintain namespace isolation between vlan and real device
  2010-01-30 22:22   ` [Bridge] " Arnd Bergmann
  (?)
@ 2010-01-30 22:23   ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-30 22:23 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel,
	virtualization, Stephen Hemminger

In the vlan and macvlan drivers, the start_xmit function forwards
data to the dev_queue_xmit function for another device, which may
potentially belong to a different namespace.

To make sure that classification stays within a single namespace,
this resets the potentially critical fields.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvlan.c     |    2 +-
 include/linux/netdevice.h |    9 +++++++++
 net/8021q/vlan_dev.c      |    2 +-
 net/core/dev.c            |   35 +++++++++++++++++++++++++++++++----
 4 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index fa0dc51..d32e0bd 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -269,7 +269,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 xmit_world:
-	skb->dev = vlan->lowerdev;
+	skb_set_dev(skb, vlan->lowerdev);
 	return dev_queue_xmit(skb);
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 93a32a5..622ba5a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1004,6 +1004,15 @@ static inline bool netdev_uses_dsa_tags(struct net_device *dev)
 	return 0;
 }
 
+#ifndef CONFIG_NET_NS
+static inline void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+	skb->dev = dev;
+}
+#else /* CONFIG_NET_NS */
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev);
+#endif
+
 static inline bool netdev_uses_trailer_tags(struct net_device *dev)
 {
 #ifdef CONFIG_NET_DSA_TAG_TRAILER
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index a9e1f17..9e83272 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -322,7 +322,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
 	}
 
 
-	skb->dev = vlan_dev_info(dev)->real_dev;
+	skb_set_dev(skb, vlan_dev_info(dev)->real_dev);
 	len = skb->len;
 	ret = dev_queue_xmit(skb);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 2cba5c5..94c1eee 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1448,13 +1448,10 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 	if (skb->len > (dev->mtu + dev->hard_header_len))
 		return NET_RX_DROP;
 
-	skb_dst_drop(skb);
+	skb_set_dev(skb, dev);
 	skb->tstamp.tv64 = 0;
 	skb->pkt_type = PACKET_HOST;
 	skb->protocol = eth_type_trans(skb, dev);
-	skb->mark = 0;
-	secpath_reset(skb);
-	nf_reset(skb);
 	return netif_rx(skb);
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
@@ -1614,6 +1611,36 @@ static bool dev_can_checksum(struct net_device *dev, struct sk_buff *skb)
 	return false;
 }
 
+/**
+ * skb_dev_set -- assign a new device to a buffer
+ * @skb: buffer for the new device
+ * @dev: network device
+ *
+ * If an skb is owned by a device already, we have to reset
+ * all data private to the namespace a device belongs to
+ * before assigning it a new device.
+ */
+#ifdef CONFIG_NET_NS
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+	skb_dst_drop(skb);
+	if (skb->dev && !net_eq(dev_net(skb->dev), dev_net(dev))) {
+		secpath_reset(skb);
+		nf_reset(skb);
+		skb_init_secmark(skb);
+		skb->mark = 0;
+		skb->priority = 0;
+		skb->nf_trace = 0;
+		skb->ipvs_property = 0;
+#ifdef CONFIG_NET_SCHED
+		skb->tc_index = 0;
+#endif
+	}
+	skb->dev = dev;
+}
+EXPORT_SYMBOL(skb_set_dev);
+#endif /* CONFIG_NET_NS */
+
 /*
  * Invalidate hardware checksum when packet is to be mangled, and
  * complete checksum manually on outgoing path.
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Bridge] [PATCH 1/3] net: maintain namespace isolation between vlan and real device
@ 2010-01-30 22:23     ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-30 22:23 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel,
	virtualization, Or Gerlitz

In the vlan and macvlan drivers, the start_xmit function forwards
data to the dev_queue_xmit function for another device, which may
potentially belong to a different namespace.

To make sure that classification stays within a single namespace,
this resets the potentially critical fields.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvlan.c     |    2 +-
 include/linux/netdevice.h |    9 +++++++++
 net/8021q/vlan_dev.c      |    2 +-
 net/core/dev.c            |   35 +++++++++++++++++++++++++++++++----
 4 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index fa0dc51..d32e0bd 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -269,7 +269,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 xmit_world:
-	skb->dev = vlan->lowerdev;
+	skb_set_dev(skb, vlan->lowerdev);
 	return dev_queue_xmit(skb);
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 93a32a5..622ba5a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1004,6 +1004,15 @@ static inline bool netdev_uses_dsa_tags(struct net_device *dev)
 	return 0;
 }
 
+#ifndef CONFIG_NET_NS
+static inline void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+	skb->dev = dev;
+}
+#else /* CONFIG_NET_NS */
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev);
+#endif
+
 static inline bool netdev_uses_trailer_tags(struct net_device *dev)
 {
 #ifdef CONFIG_NET_DSA_TAG_TRAILER
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index a9e1f17..9e83272 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -322,7 +322,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff *skb,
 	}
 
 
-	skb->dev = vlan_dev_info(dev)->real_dev;
+	skb_set_dev(skb, vlan_dev_info(dev)->real_dev);
 	len = skb->len;
 	ret = dev_queue_xmit(skb);
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 2cba5c5..94c1eee 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1448,13 +1448,10 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 	if (skb->len > (dev->mtu + dev->hard_header_len))
 		return NET_RX_DROP;
 
-	skb_dst_drop(skb);
+	skb_set_dev(skb, dev);
 	skb->tstamp.tv64 = 0;
 	skb->pkt_type = PACKET_HOST;
 	skb->protocol = eth_type_trans(skb, dev);
-	skb->mark = 0;
-	secpath_reset(skb);
-	nf_reset(skb);
 	return netif_rx(skb);
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
@@ -1614,6 +1611,36 @@ static bool dev_can_checksum(struct net_device *dev, struct sk_buff *skb)
 	return false;
 }
 
+/**
+ * skb_dev_set -- assign a new device to a buffer
+ * @skb: buffer for the new device
+ * @dev: network device
+ *
+ * If an skb is owned by a device already, we have to reset
+ * all data private to the namespace a device belongs to
+ * before assigning it a new device.
+ */
+#ifdef CONFIG_NET_NS
+void skb_set_dev(struct sk_buff *skb, struct net_device *dev)
+{
+	skb_dst_drop(skb);
+	if (skb->dev && !net_eq(dev_net(skb->dev), dev_net(dev))) {
+		secpath_reset(skb);
+		nf_reset(skb);
+		skb_init_secmark(skb);
+		skb->mark = 0;
+		skb->priority = 0;
+		skb->nf_trace = 0;
+		skb->ipvs_property = 0;
+#ifdef CONFIG_NET_SCHED
+		skb->tc_index = 0;
+#endif
+	}
+	skb->dev = dev;
+}
+EXPORT_SYMBOL(skb_set_dev);
+#endif /* CONFIG_NET_NS */
+
 /*
  * Invalidate hardware checksum when packet is to be mangled, and
  * complete checksum manually on outgoing path.
-- 
1.6.3.3



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 2/3] macvlan: allow multiple driver backends
  2010-01-30 22:22   ` [Bridge] " Arnd Bergmann
@ 2010-01-30 22:23     ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-30 22:23 UTC (permalink / raw)
  To: David Miller
  Cc: Stephen Hemminger, Patrick McHardy, Michael S. Tsirkin,
	Herbert Xu, Or Gerlitz, netdev, bridge, linux-kernel,
	virtualization

This makes it possible to hook into the macvlan driver
from another kernel module. In particular, the goal is
to extend it with the macvtap backend that provides
a tun/tap compatible interface directly on the macvlan
device.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvlan.c      |  113 +++++++++++++++++++-------------------------
 include/linux/if_macvlan.h |   70 +++++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 64 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index d32e0bd..40faa36 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -39,31 +39,6 @@ struct macvlan_port {
 	struct list_head	vlans;
 };
 
-/**
- *	struct macvlan_rx_stats - MACVLAN percpu rx stats
- *	@rx_packets: number of received packets
- *	@rx_bytes: number of received bytes
- *	@multicast: number of received multicast packets
- *	@rx_errors: number of errors
- */
-struct macvlan_rx_stats {
-	unsigned long rx_packets;
-	unsigned long rx_bytes;
-	unsigned long multicast;
-	unsigned long rx_errors;
-};
-
-struct macvlan_dev {
-	struct net_device	*dev;
-	struct list_head	list;
-	struct hlist_node	hlist;
-	struct macvlan_port	*port;
-	struct net_device	*lowerdev;
-	struct macvlan_rx_stats *rx_stats;
-	enum macvlan_mode	mode;
-};
-
-
 static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
 					       const unsigned char *addr)
 {
@@ -118,31 +93,17 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
 	return 0;
 }
 
-static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
-				    unsigned int len, bool success,
-				    bool multicast)
-{
-	struct macvlan_rx_stats *rx_stats;
-
-	rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
-	if (likely(success)) {
-		rx_stats->rx_packets++;;
-		rx_stats->rx_bytes += len;
-		if (multicast)
-			rx_stats->multicast++;
-	} else {
-		rx_stats->rx_errors++;
-	}
-}
 
-static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
+static int macvlan_broadcast_one(struct sk_buff *skb,
+				 const struct macvlan_dev *vlan,
 				 const struct ethhdr *eth, bool local)
 {
+	struct net_device *dev = vlan->dev;
 	if (!skb)
 		return NET_RX_DROP;
 
 	if (local)
-		return dev_forward_skb(dev, skb);
+		return vlan->forward(dev, skb);
 
 	skb->dev = dev;
 	if (!compare_ether_addr_64bits(eth->h_dest,
@@ -151,7 +112,7 @@ static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
 	else
 		skb->pkt_type = PACKET_MULTICAST;
 
-	return netif_rx(skb);
+	return vlan->receive(skb);
 }
 
 static void macvlan_broadcast(struct sk_buff *skb,
@@ -175,7 +136,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
 				continue;
 
 			nskb = skb_clone(skb, GFP_ATOMIC);
-			err = macvlan_broadcast_one(nskb, vlan->dev, eth,
+			err = macvlan_broadcast_one(nskb, vlan, eth,
 					 mode == MACVLAN_MODE_BRIDGE);
 			macvlan_count_rx(vlan, skb->len + ETH_HLEN,
 					 err == NET_RX_SUCCESS, 1);
@@ -238,7 +199,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
 	skb->dev = dev;
 	skb->pkt_type = PACKET_HOST;
 
-	netif_rx(skb);
+	vlan->receive(skb);
 	return NULL;
 }
 
@@ -260,7 +221,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 		dest = macvlan_hash_lookup(port, eth->h_dest);
 		if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
 			unsigned int length = skb->len + ETH_HLEN;
-			int ret = dev_forward_skb(dest->dev, skb);
+			int ret = dest->forward(dest->dev, skb);
 			macvlan_count_rx(dest, length,
 					 ret == NET_RX_SUCCESS, 0);
 
@@ -273,8 +234,8 @@ xmit_world:
 	return dev_queue_xmit(skb);
 }
 
-static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
-				      struct net_device *dev)
+netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
+			       struct net_device *dev)
 {
 	int i = skb_get_queue_mapping(skb);
 	struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
@@ -290,6 +251,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(macvlan_start_xmit);
 
 static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev,
 			       unsigned short type, const void *daddr,
@@ -623,8 +585,11 @@ static int macvlan_get_tx_queues(struct net *net,
 	return 0;
 }
 
-static int macvlan_newlink(struct net *src_net, struct net_device *dev,
-			   struct nlattr *tb[], struct nlattr *data[])
+int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+			   struct nlattr *tb[], struct nlattr *data[],
+			   int (*receive)(struct sk_buff *skb),
+			   int (*forward)(struct net_device *dev,
+					  struct sk_buff *skb))
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port;
@@ -664,6 +629,8 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
 	vlan->lowerdev = lowerdev;
 	vlan->dev      = dev;
 	vlan->port     = port;
+	vlan->receive  = receive;
+	vlan->forward  = forward;
 
 	vlan->mode     = MACVLAN_MODE_VEPA;
 	if (data && data[IFLA_MACVLAN_MODE])
@@ -677,8 +644,17 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
 	netif_stacked_transfer_operstate(lowerdev, dev);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(macvlan_common_newlink);
 
-static void macvlan_dellink(struct net_device *dev, struct list_head *head)
+static int macvlan_newlink(struct net *src_net, struct net_device *dev,
+			   struct nlattr *tb[], struct nlattr *data[])
+{
+	return macvlan_common_newlink(src_net, dev, tb, data,
+				      netif_rx,
+				      dev_forward_skb);
+}
+
+void macvlan_dellink(struct net_device *dev, struct list_head *head)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port = vlan->port;
@@ -689,6 +665,7 @@ static void macvlan_dellink(struct net_device *dev, struct list_head *head)
 	if (list_empty(&port->vlans))
 		macvlan_port_destroy(port->dev);
 }
+EXPORT_SYMBOL_GPL(macvlan_dellink);
 
 static int macvlan_changelink(struct net_device *dev,
 		struct nlattr *tb[], struct nlattr *data[])
@@ -720,19 +697,27 @@ static const struct nla_policy macvlan_policy[IFLA_MACVLAN_MAX + 1] = {
 	[IFLA_MACVLAN_MODE] = { .type = NLA_U32 },
 };
 
-static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
+int macvlan_link_register(struct rtnl_link_ops *ops)
+{
+	/* common fields */
+	ops->priv_size		= sizeof(struct macvlan_dev);
+	ops->get_tx_queues	= macvlan_get_tx_queues;
+	ops->setup		= macvlan_setup;
+	ops->validate		= macvlan_validate;
+	ops->maxtype		= IFLA_MACVLAN_MAX;
+	ops->policy		= macvlan_policy;
+	ops->changelink		= macvlan_changelink;
+	ops->get_size		= macvlan_get_size;
+	ops->fill_info		= macvlan_fill_info;
+
+	return rtnl_link_register(ops);
+};
+EXPORT_SYMBOL_GPL(macvlan_link_register);
+
+static struct rtnl_link_ops macvlan_link_ops = {
 	.kind		= "macvlan",
-	.priv_size	= sizeof(struct macvlan_dev),
-	.get_tx_queues  = macvlan_get_tx_queues,
-	.setup		= macvlan_setup,
-	.validate	= macvlan_validate,
 	.newlink	= macvlan_newlink,
 	.dellink	= macvlan_dellink,
-	.maxtype	= IFLA_MACVLAN_MAX,
-	.policy		= macvlan_policy,
-	.changelink	= macvlan_changelink,
-	.get_size	= macvlan_get_size,
-	.fill_info	= macvlan_fill_info,
 };
 
 static int macvlan_device_event(struct notifier_block *unused,
@@ -761,7 +746,7 @@ static int macvlan_device_event(struct notifier_block *unused,
 		break;
 	case NETDEV_UNREGISTER:
 		list_for_each_entry_safe(vlan, next, &port->vlans, list)
-			macvlan_dellink(vlan->dev, NULL);
+			vlan->dev->rtnl_link_ops->dellink(vlan->dev, NULL);
 		break;
 	}
 	return NOTIFY_DONE;
@@ -778,7 +763,7 @@ static int __init macvlan_init_module(void)
 	register_netdevice_notifier(&macvlan_notifier_block);
 	macvlan_handle_frame_hook = macvlan_handle_frame;
 
-	err = rtnl_link_register(&macvlan_link_ops);
+	err = macvlan_link_register(&macvlan_link_ops);
 	if (err < 0)
 		goto err1;
 	return 0;
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 5f200ba..9a11544 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -1,6 +1,76 @@
 #ifndef _LINUX_IF_MACVLAN_H
 #define _LINUX_IF_MACVLAN_H
 
+#include <linux/if_link.h>
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/netlink.h>
+#include <net/netlink.h>
+
+struct macvlan_port;
+struct macvtap_queue;
+
+/**
+ *	struct macvlan_rx_stats - MACVLAN percpu rx stats
+ *	@rx_packets: number of received packets
+ *	@rx_bytes: number of received bytes
+ *	@multicast: number of received multicast packets
+ *	@rx_errors: number of errors
+ */
+struct macvlan_rx_stats {
+	unsigned long rx_packets;
+	unsigned long rx_bytes;
+	unsigned long multicast;
+	unsigned long rx_errors;
+};
+
+struct macvlan_dev {
+	struct net_device	*dev;
+	struct list_head	list;
+	struct hlist_node	hlist;
+	struct macvlan_port	*port;
+	struct net_device	*lowerdev;
+	struct macvlan_rx_stats *rx_stats;
+	enum macvlan_mode	mode;
+	int (*receive)(struct sk_buff *skb);
+	int (*forward)(struct net_device *dev, struct sk_buff *skb);
+};
+
+static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
+				    unsigned int len, bool success,
+				    bool multicast)
+{
+	struct macvlan_rx_stats *rx_stats;
+
+	rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
+	if (likely(success)) {
+		rx_stats->rx_packets++;;
+		rx_stats->rx_bytes += len;
+		if (multicast)
+			rx_stats->multicast++;
+	} else {
+		rx_stats->rx_errors++;
+	}
+}
+
+extern int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+				  struct nlattr *tb[], struct nlattr *data[],
+				  int (*receive)(struct sk_buff *skb),
+				  int (*forward)(struct net_device *dev,
+						 struct sk_buff *skb));
+
+extern void macvlan_count_rx(const struct macvlan_dev *vlan,
+			     unsigned int len, bool success,
+			     bool multicast);
+
+extern void macvlan_dellink(struct net_device *dev, struct list_head *head);
+
+extern int macvlan_link_register(struct rtnl_link_ops *ops);
+
+extern netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
+				      struct net_device *dev);
+
+
 extern struct sk_buff *(*macvlan_handle_frame_hook)(struct sk_buff *);
 
 #endif /* _LINUX_IF_MACVLAN_H */
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 2/3] macvlan: allow multiple driver backends
  2010-01-30 22:22   ` [Bridge] " Arnd Bergmann
                     ` (2 preceding siblings ...)
  (?)
@ 2010-01-30 22:23   ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-30 22:23 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel,
	virtualization, Stephen Hemminger

This makes it possible to hook into the macvlan driver
from another kernel module. In particular, the goal is
to extend it with the macvtap backend that provides
a tun/tap compatible interface directly on the macvlan
device.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvlan.c      |  113 +++++++++++++++++++-------------------------
 include/linux/if_macvlan.h |   70 +++++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 64 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index d32e0bd..40faa36 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -39,31 +39,6 @@ struct macvlan_port {
 	struct list_head	vlans;
 };
 
-/**
- *	struct macvlan_rx_stats - MACVLAN percpu rx stats
- *	@rx_packets: number of received packets
- *	@rx_bytes: number of received bytes
- *	@multicast: number of received multicast packets
- *	@rx_errors: number of errors
- */
-struct macvlan_rx_stats {
-	unsigned long rx_packets;
-	unsigned long rx_bytes;
-	unsigned long multicast;
-	unsigned long rx_errors;
-};
-
-struct macvlan_dev {
-	struct net_device	*dev;
-	struct list_head	list;
-	struct hlist_node	hlist;
-	struct macvlan_port	*port;
-	struct net_device	*lowerdev;
-	struct macvlan_rx_stats *rx_stats;
-	enum macvlan_mode	mode;
-};
-
-
 static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
 					       const unsigned char *addr)
 {
@@ -118,31 +93,17 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
 	return 0;
 }
 
-static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
-				    unsigned int len, bool success,
-				    bool multicast)
-{
-	struct macvlan_rx_stats *rx_stats;
-
-	rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
-	if (likely(success)) {
-		rx_stats->rx_packets++;;
-		rx_stats->rx_bytes += len;
-		if (multicast)
-			rx_stats->multicast++;
-	} else {
-		rx_stats->rx_errors++;
-	}
-}
 
-static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
+static int macvlan_broadcast_one(struct sk_buff *skb,
+				 const struct macvlan_dev *vlan,
 				 const struct ethhdr *eth, bool local)
 {
+	struct net_device *dev = vlan->dev;
 	if (!skb)
 		return NET_RX_DROP;
 
 	if (local)
-		return dev_forward_skb(dev, skb);
+		return vlan->forward(dev, skb);
 
 	skb->dev = dev;
 	if (!compare_ether_addr_64bits(eth->h_dest,
@@ -151,7 +112,7 @@ static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
 	else
 		skb->pkt_type = PACKET_MULTICAST;
 
-	return netif_rx(skb);
+	return vlan->receive(skb);
 }
 
 static void macvlan_broadcast(struct sk_buff *skb,
@@ -175,7 +136,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
 				continue;
 
 			nskb = skb_clone(skb, GFP_ATOMIC);
-			err = macvlan_broadcast_one(nskb, vlan->dev, eth,
+			err = macvlan_broadcast_one(nskb, vlan, eth,
 					 mode == MACVLAN_MODE_BRIDGE);
 			macvlan_count_rx(vlan, skb->len + ETH_HLEN,
 					 err == NET_RX_SUCCESS, 1);
@@ -238,7 +199,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
 	skb->dev = dev;
 	skb->pkt_type = PACKET_HOST;
 
-	netif_rx(skb);
+	vlan->receive(skb);
 	return NULL;
 }
 
@@ -260,7 +221,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 		dest = macvlan_hash_lookup(port, eth->h_dest);
 		if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
 			unsigned int length = skb->len + ETH_HLEN;
-			int ret = dev_forward_skb(dest->dev, skb);
+			int ret = dest->forward(dest->dev, skb);
 			macvlan_count_rx(dest, length,
 					 ret == NET_RX_SUCCESS, 0);
 
@@ -273,8 +234,8 @@ xmit_world:
 	return dev_queue_xmit(skb);
 }
 
-static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
-				      struct net_device *dev)
+netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
+			       struct net_device *dev)
 {
 	int i = skb_get_queue_mapping(skb);
 	struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
@@ -290,6 +251,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(macvlan_start_xmit);
 
 static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev,
 			       unsigned short type, const void *daddr,
@@ -623,8 +585,11 @@ static int macvlan_get_tx_queues(struct net *net,
 	return 0;
 }
 
-static int macvlan_newlink(struct net *src_net, struct net_device *dev,
-			   struct nlattr *tb[], struct nlattr *data[])
+int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+			   struct nlattr *tb[], struct nlattr *data[],
+			   int (*receive)(struct sk_buff *skb),
+			   int (*forward)(struct net_device *dev,
+					  struct sk_buff *skb))
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port;
@@ -664,6 +629,8 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
 	vlan->lowerdev = lowerdev;
 	vlan->dev      = dev;
 	vlan->port     = port;
+	vlan->receive  = receive;
+	vlan->forward  = forward;
 
 	vlan->mode     = MACVLAN_MODE_VEPA;
 	if (data && data[IFLA_MACVLAN_MODE])
@@ -677,8 +644,17 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
 	netif_stacked_transfer_operstate(lowerdev, dev);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(macvlan_common_newlink);
 
-static void macvlan_dellink(struct net_device *dev, struct list_head *head)
+static int macvlan_newlink(struct net *src_net, struct net_device *dev,
+			   struct nlattr *tb[], struct nlattr *data[])
+{
+	return macvlan_common_newlink(src_net, dev, tb, data,
+				      netif_rx,
+				      dev_forward_skb);
+}
+
+void macvlan_dellink(struct net_device *dev, struct list_head *head)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port = vlan->port;
@@ -689,6 +665,7 @@ static void macvlan_dellink(struct net_device *dev, struct list_head *head)
 	if (list_empty(&port->vlans))
 		macvlan_port_destroy(port->dev);
 }
+EXPORT_SYMBOL_GPL(macvlan_dellink);
 
 static int macvlan_changelink(struct net_device *dev,
 		struct nlattr *tb[], struct nlattr *data[])
@@ -720,19 +697,27 @@ static const struct nla_policy macvlan_policy[IFLA_MACVLAN_MAX + 1] = {
 	[IFLA_MACVLAN_MODE] = { .type = NLA_U32 },
 };
 
-static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
+int macvlan_link_register(struct rtnl_link_ops *ops)
+{
+	/* common fields */
+	ops->priv_size		= sizeof(struct macvlan_dev);
+	ops->get_tx_queues	= macvlan_get_tx_queues;
+	ops->setup		= macvlan_setup;
+	ops->validate		= macvlan_validate;
+	ops->maxtype		= IFLA_MACVLAN_MAX;
+	ops->policy		= macvlan_policy;
+	ops->changelink		= macvlan_changelink;
+	ops->get_size		= macvlan_get_size;
+	ops->fill_info		= macvlan_fill_info;
+
+	return rtnl_link_register(ops);
+};
+EXPORT_SYMBOL_GPL(macvlan_link_register);
+
+static struct rtnl_link_ops macvlan_link_ops = {
 	.kind		= "macvlan",
-	.priv_size	= sizeof(struct macvlan_dev),
-	.get_tx_queues  = macvlan_get_tx_queues,
-	.setup		= macvlan_setup,
-	.validate	= macvlan_validate,
 	.newlink	= macvlan_newlink,
 	.dellink	= macvlan_dellink,
-	.maxtype	= IFLA_MACVLAN_MAX,
-	.policy		= macvlan_policy,
-	.changelink	= macvlan_changelink,
-	.get_size	= macvlan_get_size,
-	.fill_info	= macvlan_fill_info,
 };
 
 static int macvlan_device_event(struct notifier_block *unused,
@@ -761,7 +746,7 @@ static int macvlan_device_event(struct notifier_block *unused,
 		break;
 	case NETDEV_UNREGISTER:
 		list_for_each_entry_safe(vlan, next, &port->vlans, list)
-			macvlan_dellink(vlan->dev, NULL);
+			vlan->dev->rtnl_link_ops->dellink(vlan->dev, NULL);
 		break;
 	}
 	return NOTIFY_DONE;
@@ -778,7 +763,7 @@ static int __init macvlan_init_module(void)
 	register_netdevice_notifier(&macvlan_notifier_block);
 	macvlan_handle_frame_hook = macvlan_handle_frame;
 
-	err = rtnl_link_register(&macvlan_link_ops);
+	err = macvlan_link_register(&macvlan_link_ops);
 	if (err < 0)
 		goto err1;
 	return 0;
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 5f200ba..9a11544 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -1,6 +1,76 @@
 #ifndef _LINUX_IF_MACVLAN_H
 #define _LINUX_IF_MACVLAN_H
 
+#include <linux/if_link.h>
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/netlink.h>
+#include <net/netlink.h>
+
+struct macvlan_port;
+struct macvtap_queue;
+
+/**
+ *	struct macvlan_rx_stats - MACVLAN percpu rx stats
+ *	@rx_packets: number of received packets
+ *	@rx_bytes: number of received bytes
+ *	@multicast: number of received multicast packets
+ *	@rx_errors: number of errors
+ */
+struct macvlan_rx_stats {
+	unsigned long rx_packets;
+	unsigned long rx_bytes;
+	unsigned long multicast;
+	unsigned long rx_errors;
+};
+
+struct macvlan_dev {
+	struct net_device	*dev;
+	struct list_head	list;
+	struct hlist_node	hlist;
+	struct macvlan_port	*port;
+	struct net_device	*lowerdev;
+	struct macvlan_rx_stats *rx_stats;
+	enum macvlan_mode	mode;
+	int (*receive)(struct sk_buff *skb);
+	int (*forward)(struct net_device *dev, struct sk_buff *skb);
+};
+
+static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
+				    unsigned int len, bool success,
+				    bool multicast)
+{
+	struct macvlan_rx_stats *rx_stats;
+
+	rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
+	if (likely(success)) {
+		rx_stats->rx_packets++;;
+		rx_stats->rx_bytes += len;
+		if (multicast)
+			rx_stats->multicast++;
+	} else {
+		rx_stats->rx_errors++;
+	}
+}
+
+extern int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+				  struct nlattr *tb[], struct nlattr *data[],
+				  int (*receive)(struct sk_buff *skb),
+				  int (*forward)(struct net_device *dev,
+						 struct sk_buff *skb));
+
+extern void macvlan_count_rx(const struct macvlan_dev *vlan,
+			     unsigned int len, bool success,
+			     bool multicast);
+
+extern void macvlan_dellink(struct net_device *dev, struct list_head *head);
+
+extern int macvlan_link_register(struct rtnl_link_ops *ops);
+
+extern netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
+				      struct net_device *dev);
+
+
 extern struct sk_buff *(*macvlan_handle_frame_hook)(struct sk_buff *);
 
 #endif /* _LINUX_IF_MACVLAN_H */
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Bridge] [PATCH 2/3] macvlan: allow multiple driver backends
@ 2010-01-30 22:23     ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-30 22:23 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel,
	virtualization, Or Gerlitz

This makes it possible to hook into the macvlan driver
from another kernel module. In particular, the goal is
to extend it with the macvtap backend that provides
a tun/tap compatible interface directly on the macvlan
device.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvlan.c      |  113 +++++++++++++++++++-------------------------
 include/linux/if_macvlan.h |   70 +++++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 64 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index d32e0bd..40faa36 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -39,31 +39,6 @@ struct macvlan_port {
 	struct list_head	vlans;
 };
 
-/**
- *	struct macvlan_rx_stats - MACVLAN percpu rx stats
- *	@rx_packets: number of received packets
- *	@rx_bytes: number of received bytes
- *	@multicast: number of received multicast packets
- *	@rx_errors: number of errors
- */
-struct macvlan_rx_stats {
-	unsigned long rx_packets;
-	unsigned long rx_bytes;
-	unsigned long multicast;
-	unsigned long rx_errors;
-};
-
-struct macvlan_dev {
-	struct net_device	*dev;
-	struct list_head	list;
-	struct hlist_node	hlist;
-	struct macvlan_port	*port;
-	struct net_device	*lowerdev;
-	struct macvlan_rx_stats *rx_stats;
-	enum macvlan_mode	mode;
-};
-
-
 static struct macvlan_dev *macvlan_hash_lookup(const struct macvlan_port *port,
 					       const unsigned char *addr)
 {
@@ -118,31 +93,17 @@ static int macvlan_addr_busy(const struct macvlan_port *port,
 	return 0;
 }
 
-static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
-				    unsigned int len, bool success,
-				    bool multicast)
-{
-	struct macvlan_rx_stats *rx_stats;
-
-	rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
-	if (likely(success)) {
-		rx_stats->rx_packets++;;
-		rx_stats->rx_bytes += len;
-		if (multicast)
-			rx_stats->multicast++;
-	} else {
-		rx_stats->rx_errors++;
-	}
-}
 
-static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
+static int macvlan_broadcast_one(struct sk_buff *skb,
+				 const struct macvlan_dev *vlan,
 				 const struct ethhdr *eth, bool local)
 {
+	struct net_device *dev = vlan->dev;
 	if (!skb)
 		return NET_RX_DROP;
 
 	if (local)
-		return dev_forward_skb(dev, skb);
+		return vlan->forward(dev, skb);
 
 	skb->dev = dev;
 	if (!compare_ether_addr_64bits(eth->h_dest,
@@ -151,7 +112,7 @@ static int macvlan_broadcast_one(struct sk_buff *skb, struct net_device *dev,
 	else
 		skb->pkt_type = PACKET_MULTICAST;
 
-	return netif_rx(skb);
+	return vlan->receive(skb);
 }
 
 static void macvlan_broadcast(struct sk_buff *skb,
@@ -175,7 +136,7 @@ static void macvlan_broadcast(struct sk_buff *skb,
 				continue;
 
 			nskb = skb_clone(skb, GFP_ATOMIC);
-			err = macvlan_broadcast_one(nskb, vlan->dev, eth,
+			err = macvlan_broadcast_one(nskb, vlan, eth,
 					 mode == MACVLAN_MODE_BRIDGE);
 			macvlan_count_rx(vlan, skb->len + ETH_HLEN,
 					 err == NET_RX_SUCCESS, 1);
@@ -238,7 +199,7 @@ static struct sk_buff *macvlan_handle_frame(struct sk_buff *skb)
 	skb->dev = dev;
 	skb->pkt_type = PACKET_HOST;
 
-	netif_rx(skb);
+	vlan->receive(skb);
 	return NULL;
 }
 
@@ -260,7 +221,7 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 		dest = macvlan_hash_lookup(port, eth->h_dest);
 		if (dest && dest->mode == MACVLAN_MODE_BRIDGE) {
 			unsigned int length = skb->len + ETH_HLEN;
-			int ret = dev_forward_skb(dest->dev, skb);
+			int ret = dest->forward(dest->dev, skb);
 			macvlan_count_rx(dest, length,
 					 ret == NET_RX_SUCCESS, 0);
 
@@ -273,8 +234,8 @@ xmit_world:
 	return dev_queue_xmit(skb);
 }
 
-static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
-				      struct net_device *dev)
+netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
+			       struct net_device *dev)
 {
 	int i = skb_get_queue_mapping(skb);
 	struct netdev_queue *txq = netdev_get_tx_queue(dev, i);
@@ -290,6 +251,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 
 	return ret;
 }
+EXPORT_SYMBOL_GPL(macvlan_start_xmit);
 
 static int macvlan_hard_header(struct sk_buff *skb, struct net_device *dev,
 			       unsigned short type, const void *daddr,
@@ -623,8 +585,11 @@ static int macvlan_get_tx_queues(struct net *net,
 	return 0;
 }
 
-static int macvlan_newlink(struct net *src_net, struct net_device *dev,
-			   struct nlattr *tb[], struct nlattr *data[])
+int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+			   struct nlattr *tb[], struct nlattr *data[],
+			   int (*receive)(struct sk_buff *skb),
+			   int (*forward)(struct net_device *dev,
+					  struct sk_buff *skb))
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port;
@@ -664,6 +629,8 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
 	vlan->lowerdev = lowerdev;
 	vlan->dev      = dev;
 	vlan->port     = port;
+	vlan->receive  = receive;
+	vlan->forward  = forward;
 
 	vlan->mode     = MACVLAN_MODE_VEPA;
 	if (data && data[IFLA_MACVLAN_MODE])
@@ -677,8 +644,17 @@ static int macvlan_newlink(struct net *src_net, struct net_device *dev,
 	netif_stacked_transfer_operstate(lowerdev, dev);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(macvlan_common_newlink);
 
-static void macvlan_dellink(struct net_device *dev, struct list_head *head)
+static int macvlan_newlink(struct net *src_net, struct net_device *dev,
+			   struct nlattr *tb[], struct nlattr *data[])
+{
+	return macvlan_common_newlink(src_net, dev, tb, data,
+				      netif_rx,
+				      dev_forward_skb);
+}
+
+void macvlan_dellink(struct net_device *dev, struct list_head *head)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
 	struct macvlan_port *port = vlan->port;
@@ -689,6 +665,7 @@ static void macvlan_dellink(struct net_device *dev, struct list_head *head)
 	if (list_empty(&port->vlans))
 		macvlan_port_destroy(port->dev);
 }
+EXPORT_SYMBOL_GPL(macvlan_dellink);
 
 static int macvlan_changelink(struct net_device *dev,
 		struct nlattr *tb[], struct nlattr *data[])
@@ -720,19 +697,27 @@ static const struct nla_policy macvlan_policy[IFLA_MACVLAN_MAX + 1] = {
 	[IFLA_MACVLAN_MODE] = { .type = NLA_U32 },
 };
 
-static struct rtnl_link_ops macvlan_link_ops __read_mostly = {
+int macvlan_link_register(struct rtnl_link_ops *ops)
+{
+	/* common fields */
+	ops->priv_size		= sizeof(struct macvlan_dev);
+	ops->get_tx_queues	= macvlan_get_tx_queues;
+	ops->setup		= macvlan_setup;
+	ops->validate		= macvlan_validate;
+	ops->maxtype		= IFLA_MACVLAN_MAX;
+	ops->policy		= macvlan_policy;
+	ops->changelink		= macvlan_changelink;
+	ops->get_size		= macvlan_get_size;
+	ops->fill_info		= macvlan_fill_info;
+
+	return rtnl_link_register(ops);
+};
+EXPORT_SYMBOL_GPL(macvlan_link_register);
+
+static struct rtnl_link_ops macvlan_link_ops = {
 	.kind		= "macvlan",
-	.priv_size	= sizeof(struct macvlan_dev),
-	.get_tx_queues  = macvlan_get_tx_queues,
-	.setup		= macvlan_setup,
-	.validate	= macvlan_validate,
 	.newlink	= macvlan_newlink,
 	.dellink	= macvlan_dellink,
-	.maxtype	= IFLA_MACVLAN_MAX,
-	.policy		= macvlan_policy,
-	.changelink	= macvlan_changelink,
-	.get_size	= macvlan_get_size,
-	.fill_info	= macvlan_fill_info,
 };
 
 static int macvlan_device_event(struct notifier_block *unused,
@@ -761,7 +746,7 @@ static int macvlan_device_event(struct notifier_block *unused,
 		break;
 	case NETDEV_UNREGISTER:
 		list_for_each_entry_safe(vlan, next, &port->vlans, list)
-			macvlan_dellink(vlan->dev, NULL);
+			vlan->dev->rtnl_link_ops->dellink(vlan->dev, NULL);
 		break;
 	}
 	return NOTIFY_DONE;
@@ -778,7 +763,7 @@ static int __init macvlan_init_module(void)
 	register_netdevice_notifier(&macvlan_notifier_block);
 	macvlan_handle_frame_hook = macvlan_handle_frame;
 
-	err = rtnl_link_register(&macvlan_link_ops);
+	err = macvlan_link_register(&macvlan_link_ops);
 	if (err < 0)
 		goto err1;
 	return 0;
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 5f200ba..9a11544 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -1,6 +1,76 @@
 #ifndef _LINUX_IF_MACVLAN_H
 #define _LINUX_IF_MACVLAN_H
 
+#include <linux/if_link.h>
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/netlink.h>
+#include <net/netlink.h>
+
+struct macvlan_port;
+struct macvtap_queue;
+
+/**
+ *	struct macvlan_rx_stats - MACVLAN percpu rx stats
+ *	@rx_packets: number of received packets
+ *	@rx_bytes: number of received bytes
+ *	@multicast: number of received multicast packets
+ *	@rx_errors: number of errors
+ */
+struct macvlan_rx_stats {
+	unsigned long rx_packets;
+	unsigned long rx_bytes;
+	unsigned long multicast;
+	unsigned long rx_errors;
+};
+
+struct macvlan_dev {
+	struct net_device	*dev;
+	struct list_head	list;
+	struct hlist_node	hlist;
+	struct macvlan_port	*port;
+	struct net_device	*lowerdev;
+	struct macvlan_rx_stats *rx_stats;
+	enum macvlan_mode	mode;
+	int (*receive)(struct sk_buff *skb);
+	int (*forward)(struct net_device *dev, struct sk_buff *skb);
+};
+
+static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
+				    unsigned int len, bool success,
+				    bool multicast)
+{
+	struct macvlan_rx_stats *rx_stats;
+
+	rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
+	if (likely(success)) {
+		rx_stats->rx_packets++;;
+		rx_stats->rx_bytes += len;
+		if (multicast)
+			rx_stats->multicast++;
+	} else {
+		rx_stats->rx_errors++;
+	}
+}
+
+extern int macvlan_common_newlink(struct net *src_net, struct net_device *dev,
+				  struct nlattr *tb[], struct nlattr *data[],
+				  int (*receive)(struct sk_buff *skb),
+				  int (*forward)(struct net_device *dev,
+						 struct sk_buff *skb));
+
+extern void macvlan_count_rx(const struct macvlan_dev *vlan,
+			     unsigned int len, bool success,
+			     bool multicast);
+
+extern void macvlan_dellink(struct net_device *dev, struct list_head *head);
+
+extern int macvlan_link_register(struct rtnl_link_ops *ops);
+
+extern netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
+				      struct net_device *dev);
+
+
 extern struct sk_buff *(*macvlan_handle_frame_hook)(struct sk_buff *);
 
 #endif /* _LINUX_IF_MACVLAN_H */
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 3/3] net: macvtap driver
  2010-01-30 22:22   ` [Bridge] " Arnd Bergmann
@ 2010-01-30 22:24     ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-30 22:24 UTC (permalink / raw)
  To: David Miller
  Cc: Stephen Hemminger, Patrick McHardy, Michael S. Tsirkin,
	Herbert Xu, Or Gerlitz, netdev, bridge, linux-kernel,
	virtualization

In order to use macvlan with qemu and other tools that require
a tap file descriptor, the macvtap driver adds a small backend
with a character device with the same interface as the tun
driver, with a minimum set of features.

Macvtap interfaces are created in the same way as macvlan
interfaces using ip link, but the netif is just used as a
handle for configuration and accounting, while the data
goes through the chardev. Each macvtap interface has its
own character device, simplifying permission management
significantly over the generic tun/tap driver.

Cc: Patrick McHardy <kaber@trash.net>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Or Gerlitz <ogerlitz@voltaire.com>
Cc: netdev@vger.kernel.org
Cc: bridge@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/Kconfig        |   12 +
 drivers/net/Makefile       |    1 +
 drivers/net/macvtap.c      |  581 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/if_macvlan.h |    1 +
 4 files changed, 595 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/macvtap.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index cb0e534..411e207 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -90,6 +90,18 @@ config MACVLAN
 	  To compile this driver as a module, choose M here: the module
 	  will be called macvlan.
 
+config MACVTAP
+	tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
+	depends on MACVLAN
+	help
+	  This adds a specialized tap character device driver that is based
+	  on the MAC-VLAN network interface, called macvtap. A macvtap device
+	  can be added in the same way as a macvlan device, using 'type
+	  macvlan', and then be accessed through the tap user space interface.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called macvtap.
+
 config EQUALIZER
 	tristate "EQL (serial line load balancing) support"
 	---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 0b763cb..9595803 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -169,6 +169,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
 obj-$(CONFIG_DUMMY) += dummy.o
 obj-$(CONFIG_IFB) += ifb.o
 obj-$(CONFIG_MACVLAN) += macvlan.o
+obj-$(CONFIG_MACVTAP) += macvtap.o
 obj-$(CONFIG_DE600) += de600.o
 obj-$(CONFIG_DE620) += de620.o
 obj-$(CONFIG_LANCE) += lance.o
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
new file mode 100644
index 0000000..ad1f6ef
--- /dev/null
+++ b/drivers/net/macvtap.c
@@ -0,0 +1,581 @@
+#include <linux/etherdevice.h>
+#include <linux/if_macvlan.h>
+#include <linux/interrupt.h>
+#include <linux/nsproxy.h>
+#include <linux/compat.h>
+#include <linux/if_tun.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/cache.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/wait.h>
+#include <linux/cdev.h>
+#include <linux/fs.h>
+
+#include <net/net_namespace.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+/*
+ * A macvtap queue is the central object of this driver, it connects
+ * an open character device to a macvlan interface. There can be
+ * multiple queues on one interface, which map back to queues
+ * implemented in hardware on the underlying device.
+ *
+ * macvtap_proto is used to allocate queues through the sock allocation
+ * mechanism.
+ *
+ * TODO: multiqueue support is currently not implemented, even though
+ * macvtap is basically prepared for that. We will need to add this
+ * here as well as in virtio-net and qemu to get line rate on 10gbit
+ * adapters from a guest.
+ */
+struct macvtap_queue {
+	struct sock sk;
+	struct socket sock;
+	struct macvlan_dev *vlan;
+	struct file *file;
+};
+
+static struct proto macvtap_proto = {
+	.name = "macvtap",
+	.owner = THIS_MODULE,
+	.obj_size = sizeof (struct macvtap_queue),
+};
+
+/*
+ * Minor number matches netdev->ifindex, so need a potentially
+ * large value. This also makes it possible to split the
+ * tap functionality out again in the future by offering it
+ * from other drivers besides macvtap. As long as every device
+ * only has one tap, the interface numbers assure that the
+ * device nodes are unique.
+ */
+static unsigned int macvtap_major;
+#define MACVTAP_NUM_DEVS 65536
+static struct class *macvtap_class;
+static struct cdev macvtap_cdev;
+
+/*
+ * RCU usage:
+ * The macvtap_queue is referenced both from the chardev struct file
+ * and from the struct macvlan_dev using rcu_read_lock.
+ *
+ * We never actually update the contents of a macvtap_queue atomically
+ * with RCU but it is used for race-free destruction of a queue when
+ * either the file or the macvlan_dev goes away. Pointers back to
+ * the dev and the file are implicitly valid as long as the queue
+ * exists.
+ *
+ * The callbacks from macvlan are always done with rcu_read_lock held
+ * already, while in the file_operations, we get it ourselves.
+ *
+ * When destroying a queue, we remove the pointers from the file and
+ * from the dev and then synchronize_rcu to make sure no thread is
+ * still using the queue. There may still be references to the struct
+ * sock inside of the queue from outbound SKBs, but these never
+ * reference back to the file or the dev. The data structure is freed
+ * through __sk_free when both our references and any pending SKBs
+ * are gone.
+ *
+ * macvtap_lock is only used to prevent multiple concurrent open()
+ * calls to assign a new vlan->tap pointer. It could be moved into
+ * the macvlan_dev itself but is extremely rarely used.
+ */
+static DEFINE_SPINLOCK(macvtap_lock);
+
+/*
+ * Choose the next free queue, for now there is only one
+ */
+static int macvtap_set_queue(struct net_device *dev, struct file *file,
+				struct macvtap_queue *q)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	int err = -EBUSY;
+
+	spin_lock(&macvtap_lock);
+	if (rcu_dereference(vlan->tap))
+		goto out;
+
+	err = 0;
+	q->vlan = vlan;
+	rcu_assign_pointer(vlan->tap, q);
+
+	q->file = file;
+	rcu_assign_pointer(file->private_data, q);
+
+out:
+	spin_unlock(&macvtap_lock);
+	return err;
+}
+
+/*
+ * We must destroy each queue exactly once, when either
+ * the netdev or the file go away.
+ *
+ * Using the spinlock makes sure that we don't get
+ * to the queue again after destroying it.
+ *
+ * synchronize_rcu serializes with the packet flow
+ * that uses rcu_read_lock.
+ */
+static void macvtap_del_queue(struct macvtap_queue **qp)
+{
+	struct macvtap_queue *q;
+
+	spin_lock(&macvtap_lock);
+	q = rcu_dereference(*qp);
+	if (!q) {
+		spin_unlock(&macvtap_lock);
+		return;
+	}
+
+	rcu_assign_pointer(q->vlan->tap, NULL);
+	rcu_assign_pointer(q->file->private_data, NULL);
+	spin_unlock(&macvtap_lock);
+
+	synchronize_rcu();
+	sock_put(&q->sk);
+}
+
+/*
+ * Since we only support one queue, just dereference the pointer.
+ */
+static struct macvtap_queue *macvtap_get_queue(struct net_device *dev,
+					       struct sk_buff *skb)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+
+	return rcu_dereference(vlan->tap);
+}
+
+static void macvtap_del_queues(struct net_device *dev)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	macvtap_del_queue(&vlan->tap);
+}
+
+static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
+{
+	rcu_read_lock_bh();
+	return rcu_dereference(file->private_data);
+}
+
+static inline void macvtap_file_put_queue(void)
+{
+	rcu_read_unlock_bh();
+}
+
+/*
+ * Forward happens for data that gets sent from one macvlan
+ * endpoint to another one in bridge mode. We just take
+ * the skb and put it into the receive queue.
+ */
+static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
+{
+	struct macvtap_queue *q = macvtap_get_queue(dev, skb);
+	if (!q)
+		return -ENOLINK;
+
+	skb_queue_tail(&q->sk.sk_receive_queue, skb);
+	wake_up(q->sk.sk_sleep);
+	return 0;
+}
+
+/*
+ * Receive is for data from the external interface (lowerdev),
+ * in case of macvtap, we can treat that the same way as
+ * forward, which macvlan cannot.
+ */
+static int macvtap_receive(struct sk_buff *skb)
+{
+	skb_push(skb, ETH_HLEN);
+	return macvtap_forward(skb->dev, skb);
+}
+
+static int macvtap_newlink(struct net *src_net,
+			   struct net_device *dev,
+			   struct nlattr *tb[],
+			   struct nlattr *data[])
+{
+	struct device *classdev;
+	dev_t devt;
+	int err;
+
+	err = macvlan_common_newlink(src_net, dev, tb, data,
+				     macvtap_receive, macvtap_forward);
+	if (err)
+		goto out;
+
+	devt = MKDEV(MAJOR(macvtap_major), dev->ifindex);
+
+	classdev = device_create(macvtap_class, &dev->dev, devt,
+				 dev, "tap%d", dev->ifindex);
+	if (IS_ERR(classdev)) {
+		err = PTR_ERR(classdev);
+		macvtap_del_queues(dev);
+	}
+
+out:
+	return err;
+}
+
+static void macvtap_dellink(struct net_device *dev,
+			    struct list_head *head)
+{
+	device_destroy(macvtap_class,
+		       MKDEV(MAJOR(macvtap_major), dev->ifindex));
+
+	macvtap_del_queues(dev);
+	macvlan_dellink(dev, head);
+}
+
+static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
+	.kind		= "macvtap",
+	.newlink	= macvtap_newlink,
+	.dellink	= macvtap_dellink,
+};
+
+
+static void macvtap_sock_write_space(struct sock *sk)
+{
+	if (!sock_writeable(sk) ||
+	    !test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
+		return;
+
+	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		wake_up_interruptible_sync(sk->sk_sleep);
+}
+
+static int macvtap_open(struct inode *inode, struct file *file)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct net_device *dev = dev_get_by_index(net, iminor(inode));
+	struct macvtap_queue *q;
+	int err;
+
+	err = -ENODEV;
+	if (!dev)
+		goto out;
+
+	/* check if this is a macvtap device */
+	err = -EINVAL;
+	if (dev->rtnl_link_ops != &macvtap_link_ops)
+		goto out;
+
+	err = -ENOMEM;
+	q = (struct macvtap_queue *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
+					     &macvtap_proto);
+	if (!q)
+		goto out;
+
+	init_waitqueue_head(&q->sock.wait);
+	q->sock.type = SOCK_RAW;
+	q->sock.state = SS_CONNECTED;
+	sock_init_data(&q->sock, &q->sk);
+	q->sk.sk_allocation = GFP_ATOMIC; /* for now */
+	q->sk.sk_write_space = macvtap_sock_write_space;
+
+	err = macvtap_set_queue(dev, file, q);
+	if (err)
+		sock_put(&q->sk);
+
+out:
+	if (dev)
+		dev_put(dev);
+
+	return err;
+}
+
+static int macvtap_release(struct inode *inode, struct file *file)
+{
+	macvtap_del_queue((struct macvtap_queue **)&file->private_data);
+	return 0;
+}
+
+static unsigned int macvtap_poll(struct file *file, poll_table * wait)
+{
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	unsigned int mask = POLLERR;
+
+	if (!q)
+		goto out;
+
+	mask = 0;
+	poll_wait(file, &q->sock.wait, wait);
+
+	if (!skb_queue_empty(&q->sk.sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(&q->sk) ||
+	    (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &q->sock.flags) &&
+	     sock_writeable(&q->sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+out:
+	macvtap_file_put_queue();
+	return mask;
+}
+
+/* Get packet from user space buffer */
+static ssize_t macvtap_get_user(struct macvtap_queue *q,
+				const struct iovec *iv, size_t count,
+				int noblock)
+{
+	struct sk_buff *skb;
+	size_t len = count;
+	int err;
+
+	if (unlikely(len < ETH_HLEN))
+		return -EINVAL;
+
+	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
+
+	if (!skb) {
+		macvlan_count_rx(q->vlan, 0, false, false);
+		return err;
+	}
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, count);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
+		macvlan_count_rx(q->vlan, 0, false, false);
+		kfree_skb(skb);
+		return -EFAULT;
+	}
+
+	skb_set_network_header(skb, ETH_HLEN);
+
+	macvlan_start_xmit(skb, q->vlan->dev);
+
+	return count;
+}
+
+static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
+				 unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t result = -ENOLINK;
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+
+	if (!q)
+		goto out;
+
+	result = macvtap_get_user(q, iv, iov_length(iv, count),
+			      file->f_flags & O_NONBLOCK);
+out:
+	macvtap_file_put_queue();
+	return result;
+}
+
+/* Put packet to the user space buffer */
+static ssize_t macvtap_put_user(struct macvtap_queue *q,
+				const struct sk_buff *skb,
+				const struct iovec *iv, int len)
+{
+	struct macvlan_dev *vlan = q->vlan;
+	int ret;
+
+	len = min_t(int, skb->len, len);
+
+	ret = skb_copy_datagram_const_iovec(skb, 0, iv, 0, len);
+
+	macvlan_count_rx(vlan, len, ret == 0, 0);
+
+	return ret ? ret : len;
+}
+
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+				unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+
+	DECLARE_WAITQUEUE(wait, current);
+	struct sk_buff *skb;
+	ssize_t len, ret = 0;
+
+	if (!q) {
+		ret = -ENOLINK;
+		goto out;
+	}
+
+	len = iov_length(iv, count);
+	if (len < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	add_wait_queue(q->sk.sk_sleep, &wait);
+	while (len) {
+		current->state = TASK_INTERRUPTIBLE;
+
+		/* Read frames from the queue */
+		skb = skb_dequeue(&q->sk.sk_receive_queue);
+		if (!skb) {
+			if (file->f_flags & O_NONBLOCK) {
+				ret = -EAGAIN;
+				break;
+			}
+			if (signal_pending(current)) {
+				ret = -ERESTARTSYS;
+				break;
+			}
+			/* Nothing to read, let's sleep */
+			schedule();
+			continue;
+		}
+		ret = macvtap_put_user(q, skb, iv, len);
+		kfree_skb(skb);
+		break;
+	}
+
+	current->state = TASK_RUNNING;
+	remove_wait_queue(q->sk.sk_sleep, &wait);
+
+out:
+	macvtap_file_put_queue();
+	return ret;
+}
+
+/*
+ * provide compatibility with generic tun/tap interface
+ */
+static long macvtap_ioctl(struct file *file, unsigned int cmd,
+			  unsigned long arg)
+{
+	struct macvtap_queue *q;
+	void __user *argp = (void __user *)arg;
+	struct ifreq __user *ifr = argp;
+	unsigned int __user *up = argp;
+	unsigned int u;
+	char devname[IFNAMSIZ];
+
+	switch (cmd) {
+	case TUNSETIFF:
+		/* ignore the name, just look at flags */
+		if (get_user(u, &ifr->ifr_flags))
+			return -EFAULT;
+		if (u != (IFF_TAP | IFF_NO_PI))
+			return -EINVAL;
+		return 0;
+
+	case TUNGETIFF:
+		q = macvtap_file_get_queue(file);
+		if (!q)
+			return -ENOLINK;
+		memcpy(devname, q->vlan->dev->name, sizeof(devname));
+		macvtap_file_put_queue();
+
+		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
+		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
+			return -EFAULT;
+		return 0;
+
+	case TUNGETFEATURES:
+		if (put_user((IFF_TAP | IFF_NO_PI), up))
+			return -EFAULT;
+		return 0;
+
+	case TUNSETSNDBUF:
+		if (get_user(u, up))
+			return -EFAULT;
+
+		q = macvtap_file_get_queue(file);
+		q->sk.sk_sndbuf = u;
+		macvtap_file_put_queue();
+		return 0;
+
+	case TUNSETOFFLOAD:
+		/* let the user check for future flags */
+		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
+			  TUN_F_TSO_ECN | TUN_F_UFO))
+			return -EINVAL;
+
+		/* TODO: add support for these, so far we don't
+			 support any offload */
+		if (arg & (TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
+			 TUN_F_TSO_ECN | TUN_F_UFO))
+			return -EINVAL;
+
+		return 0;
+
+	default:
+		return -EINVAL;
+	}
+}
+
+#ifdef CONFIG_COMPAT
+static long macvtap_compat_ioctl(struct file *file, unsigned int cmd,
+				 unsigned long arg)
+{
+	return macvtap_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations macvtap_fops = {
+	.owner		= THIS_MODULE,
+	.open		= macvtap_open,
+	.release	= macvtap_release,
+	.aio_read	= macvtap_aio_read,
+	.aio_write	= macvtap_aio_write,
+	.poll		= macvtap_poll,
+	.llseek		= no_llseek,
+	.unlocked_ioctl	= macvtap_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= macvtap_compat_ioctl,
+#endif
+};
+
+static int macvtap_init(void)
+{
+	int err;
+
+	err = alloc_chrdev_region(&macvtap_major, 0,
+				MACVTAP_NUM_DEVS, "macvtap");
+	if (err)
+		goto out1;
+
+	cdev_init(&macvtap_cdev, &macvtap_fops);
+	err = cdev_add(&macvtap_cdev, macvtap_major, MACVTAP_NUM_DEVS);
+	if (err)
+		goto out2;
+
+	macvtap_class = class_create(THIS_MODULE, "macvtap");
+	if (IS_ERR(macvtap_class)) {
+		err = PTR_ERR(macvtap_class);
+		goto out3;
+	}
+
+	err = macvlan_link_register(&macvtap_link_ops);
+	if (err)
+		goto out4;
+
+	return 0;
+
+out4:
+	class_unregister(macvtap_class);
+out3:
+	cdev_del(&macvtap_cdev);
+out2:
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+out1:
+	return err;
+}
+module_init(macvtap_init);
+
+static void macvtap_exit(void)
+{
+	rtnl_link_unregister(&macvtap_link_ops);
+	class_unregister(macvtap_class);
+	cdev_del(&macvtap_cdev);
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+}
+module_exit(macvtap_exit);
+
+MODULE_ALIAS_RTNL_LINK("macvtap");
+MODULE_AUTHOR("Arnd Bergmann <arnd@arndb.de>");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 9a11544..51f1512 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -34,6 +34,7 @@ struct macvlan_dev {
 	enum macvlan_mode	mode;
 	int (*receive)(struct sk_buff *skb);
 	int (*forward)(struct net_device *dev, struct sk_buff *skb);
+	struct macvtap_queue	*tap;
 };
 
 static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 3/3] net: macvtap driver
  2010-01-30 22:22   ` [Bridge] " Arnd Bergmann
                     ` (4 preceding siblings ...)
  (?)
@ 2010-01-30 22:24   ` Arnd Bergmann
  -1 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-30 22:24 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel,
	virtualization, Stephen Hemminger

In order to use macvlan with qemu and other tools that require
a tap file descriptor, the macvtap driver adds a small backend
with a character device with the same interface as the tun
driver, with a minimum set of features.

Macvtap interfaces are created in the same way as macvlan
interfaces using ip link, but the netif is just used as a
handle for configuration and accounting, while the data
goes through the chardev. Each macvtap interface has its
own character device, simplifying permission management
significantly over the generic tun/tap driver.

Cc: Patrick McHardy <kaber@trash.net>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Or Gerlitz <ogerlitz@voltaire.com>
Cc: netdev@vger.kernel.org
Cc: bridge@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/Kconfig        |   12 +
 drivers/net/Makefile       |    1 +
 drivers/net/macvtap.c      |  581 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/if_macvlan.h |    1 +
 4 files changed, 595 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/macvtap.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index cb0e534..411e207 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -90,6 +90,18 @@ config MACVLAN
 	  To compile this driver as a module, choose M here: the module
 	  will be called macvlan.
 
+config MACVTAP
+	tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
+	depends on MACVLAN
+	help
+	  This adds a specialized tap character device driver that is based
+	  on the MAC-VLAN network interface, called macvtap. A macvtap device
+	  can be added in the same way as a macvlan device, using 'type
+	  macvlan', and then be accessed through the tap user space interface.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called macvtap.
+
 config EQUALIZER
 	tristate "EQL (serial line load balancing) support"
 	---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 0b763cb..9595803 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -169,6 +169,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
 obj-$(CONFIG_DUMMY) += dummy.o
 obj-$(CONFIG_IFB) += ifb.o
 obj-$(CONFIG_MACVLAN) += macvlan.o
+obj-$(CONFIG_MACVTAP) += macvtap.o
 obj-$(CONFIG_DE600) += de600.o
 obj-$(CONFIG_DE620) += de620.o
 obj-$(CONFIG_LANCE) += lance.o
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
new file mode 100644
index 0000000..ad1f6ef
--- /dev/null
+++ b/drivers/net/macvtap.c
@@ -0,0 +1,581 @@
+#include <linux/etherdevice.h>
+#include <linux/if_macvlan.h>
+#include <linux/interrupt.h>
+#include <linux/nsproxy.h>
+#include <linux/compat.h>
+#include <linux/if_tun.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/cache.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/wait.h>
+#include <linux/cdev.h>
+#include <linux/fs.h>
+
+#include <net/net_namespace.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+/*
+ * A macvtap queue is the central object of this driver, it connects
+ * an open character device to a macvlan interface. There can be
+ * multiple queues on one interface, which map back to queues
+ * implemented in hardware on the underlying device.
+ *
+ * macvtap_proto is used to allocate queues through the sock allocation
+ * mechanism.
+ *
+ * TODO: multiqueue support is currently not implemented, even though
+ * macvtap is basically prepared for that. We will need to add this
+ * here as well as in virtio-net and qemu to get line rate on 10gbit
+ * adapters from a guest.
+ */
+struct macvtap_queue {
+	struct sock sk;
+	struct socket sock;
+	struct macvlan_dev *vlan;
+	struct file *file;
+};
+
+static struct proto macvtap_proto = {
+	.name = "macvtap",
+	.owner = THIS_MODULE,
+	.obj_size = sizeof (struct macvtap_queue),
+};
+
+/*
+ * Minor number matches netdev->ifindex, so need a potentially
+ * large value. This also makes it possible to split the
+ * tap functionality out again in the future by offering it
+ * from other drivers besides macvtap. As long as every device
+ * only has one tap, the interface numbers assure that the
+ * device nodes are unique.
+ */
+static unsigned int macvtap_major;
+#define MACVTAP_NUM_DEVS 65536
+static struct class *macvtap_class;
+static struct cdev macvtap_cdev;
+
+/*
+ * RCU usage:
+ * The macvtap_queue is referenced both from the chardev struct file
+ * and from the struct macvlan_dev using rcu_read_lock.
+ *
+ * We never actually update the contents of a macvtap_queue atomically
+ * with RCU but it is used for race-free destruction of a queue when
+ * either the file or the macvlan_dev goes away. Pointers back to
+ * the dev and the file are implicitly valid as long as the queue
+ * exists.
+ *
+ * The callbacks from macvlan are always done with rcu_read_lock held
+ * already, while in the file_operations, we get it ourselves.
+ *
+ * When destroying a queue, we remove the pointers from the file and
+ * from the dev and then synchronize_rcu to make sure no thread is
+ * still using the queue. There may still be references to the struct
+ * sock inside of the queue from outbound SKBs, but these never
+ * reference back to the file or the dev. The data structure is freed
+ * through __sk_free when both our references and any pending SKBs
+ * are gone.
+ *
+ * macvtap_lock is only used to prevent multiple concurrent open()
+ * calls to assign a new vlan->tap pointer. It could be moved into
+ * the macvlan_dev itself but is extremely rarely used.
+ */
+static DEFINE_SPINLOCK(macvtap_lock);
+
+/*
+ * Choose the next free queue, for now there is only one
+ */
+static int macvtap_set_queue(struct net_device *dev, struct file *file,
+				struct macvtap_queue *q)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	int err = -EBUSY;
+
+	spin_lock(&macvtap_lock);
+	if (rcu_dereference(vlan->tap))
+		goto out;
+
+	err = 0;
+	q->vlan = vlan;
+	rcu_assign_pointer(vlan->tap, q);
+
+	q->file = file;
+	rcu_assign_pointer(file->private_data, q);
+
+out:
+	spin_unlock(&macvtap_lock);
+	return err;
+}
+
+/*
+ * We must destroy each queue exactly once, when either
+ * the netdev or the file go away.
+ *
+ * Using the spinlock makes sure that we don't get
+ * to the queue again after destroying it.
+ *
+ * synchronize_rcu serializes with the packet flow
+ * that uses rcu_read_lock.
+ */
+static void macvtap_del_queue(struct macvtap_queue **qp)
+{
+	struct macvtap_queue *q;
+
+	spin_lock(&macvtap_lock);
+	q = rcu_dereference(*qp);
+	if (!q) {
+		spin_unlock(&macvtap_lock);
+		return;
+	}
+
+	rcu_assign_pointer(q->vlan->tap, NULL);
+	rcu_assign_pointer(q->file->private_data, NULL);
+	spin_unlock(&macvtap_lock);
+
+	synchronize_rcu();
+	sock_put(&q->sk);
+}
+
+/*
+ * Since we only support one queue, just dereference the pointer.
+ */
+static struct macvtap_queue *macvtap_get_queue(struct net_device *dev,
+					       struct sk_buff *skb)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+
+	return rcu_dereference(vlan->tap);
+}
+
+static void macvtap_del_queues(struct net_device *dev)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	macvtap_del_queue(&vlan->tap);
+}
+
+static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
+{
+	rcu_read_lock_bh();
+	return rcu_dereference(file->private_data);
+}
+
+static inline void macvtap_file_put_queue(void)
+{
+	rcu_read_unlock_bh();
+}
+
+/*
+ * Forward happens for data that gets sent from one macvlan
+ * endpoint to another one in bridge mode. We just take
+ * the skb and put it into the receive queue.
+ */
+static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
+{
+	struct macvtap_queue *q = macvtap_get_queue(dev, skb);
+	if (!q)
+		return -ENOLINK;
+
+	skb_queue_tail(&q->sk.sk_receive_queue, skb);
+	wake_up(q->sk.sk_sleep);
+	return 0;
+}
+
+/*
+ * Receive is for data from the external interface (lowerdev),
+ * in case of macvtap, we can treat that the same way as
+ * forward, which macvlan cannot.
+ */
+static int macvtap_receive(struct sk_buff *skb)
+{
+	skb_push(skb, ETH_HLEN);
+	return macvtap_forward(skb->dev, skb);
+}
+
+static int macvtap_newlink(struct net *src_net,
+			   struct net_device *dev,
+			   struct nlattr *tb[],
+			   struct nlattr *data[])
+{
+	struct device *classdev;
+	dev_t devt;
+	int err;
+
+	err = macvlan_common_newlink(src_net, dev, tb, data,
+				     macvtap_receive, macvtap_forward);
+	if (err)
+		goto out;
+
+	devt = MKDEV(MAJOR(macvtap_major), dev->ifindex);
+
+	classdev = device_create(macvtap_class, &dev->dev, devt,
+				 dev, "tap%d", dev->ifindex);
+	if (IS_ERR(classdev)) {
+		err = PTR_ERR(classdev);
+		macvtap_del_queues(dev);
+	}
+
+out:
+	return err;
+}
+
+static void macvtap_dellink(struct net_device *dev,
+			    struct list_head *head)
+{
+	device_destroy(macvtap_class,
+		       MKDEV(MAJOR(macvtap_major), dev->ifindex));
+
+	macvtap_del_queues(dev);
+	macvlan_dellink(dev, head);
+}
+
+static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
+	.kind		= "macvtap",
+	.newlink	= macvtap_newlink,
+	.dellink	= macvtap_dellink,
+};
+
+
+static void macvtap_sock_write_space(struct sock *sk)
+{
+	if (!sock_writeable(sk) ||
+	    !test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
+		return;
+
+	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		wake_up_interruptible_sync(sk->sk_sleep);
+}
+
+static int macvtap_open(struct inode *inode, struct file *file)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct net_device *dev = dev_get_by_index(net, iminor(inode));
+	struct macvtap_queue *q;
+	int err;
+
+	err = -ENODEV;
+	if (!dev)
+		goto out;
+
+	/* check if this is a macvtap device */
+	err = -EINVAL;
+	if (dev->rtnl_link_ops != &macvtap_link_ops)
+		goto out;
+
+	err = -ENOMEM;
+	q = (struct macvtap_queue *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
+					     &macvtap_proto);
+	if (!q)
+		goto out;
+
+	init_waitqueue_head(&q->sock.wait);
+	q->sock.type = SOCK_RAW;
+	q->sock.state = SS_CONNECTED;
+	sock_init_data(&q->sock, &q->sk);
+	q->sk.sk_allocation = GFP_ATOMIC; /* for now */
+	q->sk.sk_write_space = macvtap_sock_write_space;
+
+	err = macvtap_set_queue(dev, file, q);
+	if (err)
+		sock_put(&q->sk);
+
+out:
+	if (dev)
+		dev_put(dev);
+
+	return err;
+}
+
+static int macvtap_release(struct inode *inode, struct file *file)
+{
+	macvtap_del_queue((struct macvtap_queue **)&file->private_data);
+	return 0;
+}
+
+static unsigned int macvtap_poll(struct file *file, poll_table * wait)
+{
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	unsigned int mask = POLLERR;
+
+	if (!q)
+		goto out;
+
+	mask = 0;
+	poll_wait(file, &q->sock.wait, wait);
+
+	if (!skb_queue_empty(&q->sk.sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(&q->sk) ||
+	    (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &q->sock.flags) &&
+	     sock_writeable(&q->sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+out:
+	macvtap_file_put_queue();
+	return mask;
+}
+
+/* Get packet from user space buffer */
+static ssize_t macvtap_get_user(struct macvtap_queue *q,
+				const struct iovec *iv, size_t count,
+				int noblock)
+{
+	struct sk_buff *skb;
+	size_t len = count;
+	int err;
+
+	if (unlikely(len < ETH_HLEN))
+		return -EINVAL;
+
+	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
+
+	if (!skb) {
+		macvlan_count_rx(q->vlan, 0, false, false);
+		return err;
+	}
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, count);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
+		macvlan_count_rx(q->vlan, 0, false, false);
+		kfree_skb(skb);
+		return -EFAULT;
+	}
+
+	skb_set_network_header(skb, ETH_HLEN);
+
+	macvlan_start_xmit(skb, q->vlan->dev);
+
+	return count;
+}
+
+static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
+				 unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t result = -ENOLINK;
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+
+	if (!q)
+		goto out;
+
+	result = macvtap_get_user(q, iv, iov_length(iv, count),
+			      file->f_flags & O_NONBLOCK);
+out:
+	macvtap_file_put_queue();
+	return result;
+}
+
+/* Put packet to the user space buffer */
+static ssize_t macvtap_put_user(struct macvtap_queue *q,
+				const struct sk_buff *skb,
+				const struct iovec *iv, int len)
+{
+	struct macvlan_dev *vlan = q->vlan;
+	int ret;
+
+	len = min_t(int, skb->len, len);
+
+	ret = skb_copy_datagram_const_iovec(skb, 0, iv, 0, len);
+
+	macvlan_count_rx(vlan, len, ret == 0, 0);
+
+	return ret ? ret : len;
+}
+
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+				unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+
+	DECLARE_WAITQUEUE(wait, current);
+	struct sk_buff *skb;
+	ssize_t len, ret = 0;
+
+	if (!q) {
+		ret = -ENOLINK;
+		goto out;
+	}
+
+	len = iov_length(iv, count);
+	if (len < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	add_wait_queue(q->sk.sk_sleep, &wait);
+	while (len) {
+		current->state = TASK_INTERRUPTIBLE;
+
+		/* Read frames from the queue */
+		skb = skb_dequeue(&q->sk.sk_receive_queue);
+		if (!skb) {
+			if (file->f_flags & O_NONBLOCK) {
+				ret = -EAGAIN;
+				break;
+			}
+			if (signal_pending(current)) {
+				ret = -ERESTARTSYS;
+				break;
+			}
+			/* Nothing to read, let's sleep */
+			schedule();
+			continue;
+		}
+		ret = macvtap_put_user(q, skb, iv, len);
+		kfree_skb(skb);
+		break;
+	}
+
+	current->state = TASK_RUNNING;
+	remove_wait_queue(q->sk.sk_sleep, &wait);
+
+out:
+	macvtap_file_put_queue();
+	return ret;
+}
+
+/*
+ * provide compatibility with generic tun/tap interface
+ */
+static long macvtap_ioctl(struct file *file, unsigned int cmd,
+			  unsigned long arg)
+{
+	struct macvtap_queue *q;
+	void __user *argp = (void __user *)arg;
+	struct ifreq __user *ifr = argp;
+	unsigned int __user *up = argp;
+	unsigned int u;
+	char devname[IFNAMSIZ];
+
+	switch (cmd) {
+	case TUNSETIFF:
+		/* ignore the name, just look at flags */
+		if (get_user(u, &ifr->ifr_flags))
+			return -EFAULT;
+		if (u != (IFF_TAP | IFF_NO_PI))
+			return -EINVAL;
+		return 0;
+
+	case TUNGETIFF:
+		q = macvtap_file_get_queue(file);
+		if (!q)
+			return -ENOLINK;
+		memcpy(devname, q->vlan->dev->name, sizeof(devname));
+		macvtap_file_put_queue();
+
+		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
+		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
+			return -EFAULT;
+		return 0;
+
+	case TUNGETFEATURES:
+		if (put_user((IFF_TAP | IFF_NO_PI), up))
+			return -EFAULT;
+		return 0;
+
+	case TUNSETSNDBUF:
+		if (get_user(u, up))
+			return -EFAULT;
+
+		q = macvtap_file_get_queue(file);
+		q->sk.sk_sndbuf = u;
+		macvtap_file_put_queue();
+		return 0;
+
+	case TUNSETOFFLOAD:
+		/* let the user check for future flags */
+		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
+			  TUN_F_TSO_ECN | TUN_F_UFO))
+			return -EINVAL;
+
+		/* TODO: add support for these, so far we don't
+			 support any offload */
+		if (arg & (TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
+			 TUN_F_TSO_ECN | TUN_F_UFO))
+			return -EINVAL;
+
+		return 0;
+
+	default:
+		return -EINVAL;
+	}
+}
+
+#ifdef CONFIG_COMPAT
+static long macvtap_compat_ioctl(struct file *file, unsigned int cmd,
+				 unsigned long arg)
+{
+	return macvtap_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations macvtap_fops = {
+	.owner		= THIS_MODULE,
+	.open		= macvtap_open,
+	.release	= macvtap_release,
+	.aio_read	= macvtap_aio_read,
+	.aio_write	= macvtap_aio_write,
+	.poll		= macvtap_poll,
+	.llseek		= no_llseek,
+	.unlocked_ioctl	= macvtap_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= macvtap_compat_ioctl,
+#endif
+};
+
+static int macvtap_init(void)
+{
+	int err;
+
+	err = alloc_chrdev_region(&macvtap_major, 0,
+				MACVTAP_NUM_DEVS, "macvtap");
+	if (err)
+		goto out1;
+
+	cdev_init(&macvtap_cdev, &macvtap_fops);
+	err = cdev_add(&macvtap_cdev, macvtap_major, MACVTAP_NUM_DEVS);
+	if (err)
+		goto out2;
+
+	macvtap_class = class_create(THIS_MODULE, "macvtap");
+	if (IS_ERR(macvtap_class)) {
+		err = PTR_ERR(macvtap_class);
+		goto out3;
+	}
+
+	err = macvlan_link_register(&macvtap_link_ops);
+	if (err)
+		goto out4;
+
+	return 0;
+
+out4:
+	class_unregister(macvtap_class);
+out3:
+	cdev_del(&macvtap_cdev);
+out2:
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+out1:
+	return err;
+}
+module_init(macvtap_init);
+
+static void macvtap_exit(void)
+{
+	rtnl_link_unregister(&macvtap_link_ops);
+	class_unregister(macvtap_class);
+	cdev_del(&macvtap_cdev);
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+}
+module_exit(macvtap_exit);
+
+MODULE_ALIAS_RTNL_LINK("macvtap");
+MODULE_AUTHOR("Arnd Bergmann <arnd@arndb.de>");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 9a11544..51f1512 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -34,6 +34,7 @@ struct macvlan_dev {
 	enum macvlan_mode	mode;
 	int (*receive)(struct sk_buff *skb);
 	int (*forward)(struct net_device *dev, struct sk_buff *skb);
+	struct macvtap_queue	*tap;
 };
 
 static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [Bridge] [PATCH 3/3] net: macvtap driver
@ 2010-01-30 22:24     ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-01-30 22:24 UTC (permalink / raw)
  To: David Miller
  Cc: Herbert Xu, Michael S. Tsirkin, netdev, bridge, linux-kernel,
	virtualization, Or Gerlitz

In order to use macvlan with qemu and other tools that require
a tap file descriptor, the macvtap driver adds a small backend
with a character device with the same interface as the tun
driver, with a minimum set of features.

Macvtap interfaces are created in the same way as macvlan
interfaces using ip link, but the netif is just used as a
handle for configuration and accounting, while the data
goes through the chardev. Each macvtap interface has its
own character device, simplifying permission management
significantly over the generic tun/tap driver.

Cc: Patrick McHardy <kaber@trash.net>
Cc: Stephen Hemminger <shemminger@linux-foundation.org>
Cc: David S. Miller" <davem@davemloft.net>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Or Gerlitz <ogerlitz@voltaire.com>
Cc: netdev@vger.kernel.org
Cc: bridge@lists.linux-foundation.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/Kconfig        |   12 +
 drivers/net/Makefile       |    1 +
 drivers/net/macvtap.c      |  581 ++++++++++++++++++++++++++++++++++++++++++++
 include/linux/if_macvlan.h |    1 +
 4 files changed, 595 insertions(+), 0 deletions(-)
 create mode 100644 drivers/net/macvtap.c

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index cb0e534..411e207 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -90,6 +90,18 @@ config MACVLAN
 	  To compile this driver as a module, choose M here: the module
 	  will be called macvlan.
 
+config MACVTAP
+	tristate "MAC-VLAN based tap driver (EXPERIMENTAL)"
+	depends on MACVLAN
+	help
+	  This adds a specialized tap character device driver that is based
+	  on the MAC-VLAN network interface, called macvtap. A macvtap device
+	  can be added in the same way as a macvlan device, using 'type
+	  macvlan', and then be accessed through the tap user space interface.
+
+	  To compile this driver as a module, choose M here: the module
+	  will be called macvtap.
+
 config EQUALIZER
 	tristate "EQL (serial line load balancing) support"
 	---help---
diff --git a/drivers/net/Makefile b/drivers/net/Makefile
index 0b763cb..9595803 100644
--- a/drivers/net/Makefile
+++ b/drivers/net/Makefile
@@ -169,6 +169,7 @@ obj-$(CONFIG_XEN_NETDEV_FRONTEND) += xen-netfront.o
 obj-$(CONFIG_DUMMY) += dummy.o
 obj-$(CONFIG_IFB) += ifb.o
 obj-$(CONFIG_MACVLAN) += macvlan.o
+obj-$(CONFIG_MACVTAP) += macvtap.o
 obj-$(CONFIG_DE600) += de600.o
 obj-$(CONFIG_DE620) += de620.o
 obj-$(CONFIG_LANCE) += lance.o
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
new file mode 100644
index 0000000..ad1f6ef
--- /dev/null
+++ b/drivers/net/macvtap.c
@@ -0,0 +1,581 @@
+#include <linux/etherdevice.h>
+#include <linux/if_macvlan.h>
+#include <linux/interrupt.h>
+#include <linux/nsproxy.h>
+#include <linux/compat.h>
+#include <linux/if_tun.h>
+#include <linux/module.h>
+#include <linux/skbuff.h>
+#include <linux/cache.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/wait.h>
+#include <linux/cdev.h>
+#include <linux/fs.h>
+
+#include <net/net_namespace.h>
+#include <net/rtnetlink.h>
+#include <net/sock.h>
+
+/*
+ * A macvtap queue is the central object of this driver, it connects
+ * an open character device to a macvlan interface. There can be
+ * multiple queues on one interface, which map back to queues
+ * implemented in hardware on the underlying device.
+ *
+ * macvtap_proto is used to allocate queues through the sock allocation
+ * mechanism.
+ *
+ * TODO: multiqueue support is currently not implemented, even though
+ * macvtap is basically prepared for that. We will need to add this
+ * here as well as in virtio-net and qemu to get line rate on 10gbit
+ * adapters from a guest.
+ */
+struct macvtap_queue {
+	struct sock sk;
+	struct socket sock;
+	struct macvlan_dev *vlan;
+	struct file *file;
+};
+
+static struct proto macvtap_proto = {
+	.name = "macvtap",
+	.owner = THIS_MODULE,
+	.obj_size = sizeof (struct macvtap_queue),
+};
+
+/*
+ * Minor number matches netdev->ifindex, so need a potentially
+ * large value. This also makes it possible to split the
+ * tap functionality out again in the future by offering it
+ * from other drivers besides macvtap. As long as every device
+ * only has one tap, the interface numbers assure that the
+ * device nodes are unique.
+ */
+static unsigned int macvtap_major;
+#define MACVTAP_NUM_DEVS 65536
+static struct class *macvtap_class;
+static struct cdev macvtap_cdev;
+
+/*
+ * RCU usage:
+ * The macvtap_queue is referenced both from the chardev struct file
+ * and from the struct macvlan_dev using rcu_read_lock.
+ *
+ * We never actually update the contents of a macvtap_queue atomically
+ * with RCU but it is used for race-free destruction of a queue when
+ * either the file or the macvlan_dev goes away. Pointers back to
+ * the dev and the file are implicitly valid as long as the queue
+ * exists.
+ *
+ * The callbacks from macvlan are always done with rcu_read_lock held
+ * already, while in the file_operations, we get it ourselves.
+ *
+ * When destroying a queue, we remove the pointers from the file and
+ * from the dev and then synchronize_rcu to make sure no thread is
+ * still using the queue. There may still be references to the struct
+ * sock inside of the queue from outbound SKBs, but these never
+ * reference back to the file or the dev. The data structure is freed
+ * through __sk_free when both our references and any pending SKBs
+ * are gone.
+ *
+ * macvtap_lock is only used to prevent multiple concurrent open()
+ * calls to assign a new vlan->tap pointer. It could be moved into
+ * the macvlan_dev itself but is extremely rarely used.
+ */
+static DEFINE_SPINLOCK(macvtap_lock);
+
+/*
+ * Choose the next free queue, for now there is only one
+ */
+static int macvtap_set_queue(struct net_device *dev, struct file *file,
+				struct macvtap_queue *q)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	int err = -EBUSY;
+
+	spin_lock(&macvtap_lock);
+	if (rcu_dereference(vlan->tap))
+		goto out;
+
+	err = 0;
+	q->vlan = vlan;
+	rcu_assign_pointer(vlan->tap, q);
+
+	q->file = file;
+	rcu_assign_pointer(file->private_data, q);
+
+out:
+	spin_unlock(&macvtap_lock);
+	return err;
+}
+
+/*
+ * We must destroy each queue exactly once, when either
+ * the netdev or the file go away.
+ *
+ * Using the spinlock makes sure that we don't get
+ * to the queue again after destroying it.
+ *
+ * synchronize_rcu serializes with the packet flow
+ * that uses rcu_read_lock.
+ */
+static void macvtap_del_queue(struct macvtap_queue **qp)
+{
+	struct macvtap_queue *q;
+
+	spin_lock(&macvtap_lock);
+	q = rcu_dereference(*qp);
+	if (!q) {
+		spin_unlock(&macvtap_lock);
+		return;
+	}
+
+	rcu_assign_pointer(q->vlan->tap, NULL);
+	rcu_assign_pointer(q->file->private_data, NULL);
+	spin_unlock(&macvtap_lock);
+
+	synchronize_rcu();
+	sock_put(&q->sk);
+}
+
+/*
+ * Since we only support one queue, just dereference the pointer.
+ */
+static struct macvtap_queue *macvtap_get_queue(struct net_device *dev,
+					       struct sk_buff *skb)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+
+	return rcu_dereference(vlan->tap);
+}
+
+static void macvtap_del_queues(struct net_device *dev)
+{
+	struct macvlan_dev *vlan = netdev_priv(dev);
+	macvtap_del_queue(&vlan->tap);
+}
+
+static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
+{
+	rcu_read_lock_bh();
+	return rcu_dereference(file->private_data);
+}
+
+static inline void macvtap_file_put_queue(void)
+{
+	rcu_read_unlock_bh();
+}
+
+/*
+ * Forward happens for data that gets sent from one macvlan
+ * endpoint to another one in bridge mode. We just take
+ * the skb and put it into the receive queue.
+ */
+static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
+{
+	struct macvtap_queue *q = macvtap_get_queue(dev, skb);
+	if (!q)
+		return -ENOLINK;
+
+	skb_queue_tail(&q->sk.sk_receive_queue, skb);
+	wake_up(q->sk.sk_sleep);
+	return 0;
+}
+
+/*
+ * Receive is for data from the external interface (lowerdev),
+ * in case of macvtap, we can treat that the same way as
+ * forward, which macvlan cannot.
+ */
+static int macvtap_receive(struct sk_buff *skb)
+{
+	skb_push(skb, ETH_HLEN);
+	return macvtap_forward(skb->dev, skb);
+}
+
+static int macvtap_newlink(struct net *src_net,
+			   struct net_device *dev,
+			   struct nlattr *tb[],
+			   struct nlattr *data[])
+{
+	struct device *classdev;
+	dev_t devt;
+	int err;
+
+	err = macvlan_common_newlink(src_net, dev, tb, data,
+				     macvtap_receive, macvtap_forward);
+	if (err)
+		goto out;
+
+	devt = MKDEV(MAJOR(macvtap_major), dev->ifindex);
+
+	classdev = device_create(macvtap_class, &dev->dev, devt,
+				 dev, "tap%d", dev->ifindex);
+	if (IS_ERR(classdev)) {
+		err = PTR_ERR(classdev);
+		macvtap_del_queues(dev);
+	}
+
+out:
+	return err;
+}
+
+static void macvtap_dellink(struct net_device *dev,
+			    struct list_head *head)
+{
+	device_destroy(macvtap_class,
+		       MKDEV(MAJOR(macvtap_major), dev->ifindex));
+
+	macvtap_del_queues(dev);
+	macvlan_dellink(dev, head);
+}
+
+static struct rtnl_link_ops macvtap_link_ops __read_mostly = {
+	.kind		= "macvtap",
+	.newlink	= macvtap_newlink,
+	.dellink	= macvtap_dellink,
+};
+
+
+static void macvtap_sock_write_space(struct sock *sk)
+{
+	if (!sock_writeable(sk) ||
+	    !test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
+		return;
+
+	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		wake_up_interruptible_sync(sk->sk_sleep);
+}
+
+static int macvtap_open(struct inode *inode, struct file *file)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct net_device *dev = dev_get_by_index(net, iminor(inode));
+	struct macvtap_queue *q;
+	int err;
+
+	err = -ENODEV;
+	if (!dev)
+		goto out;
+
+	/* check if this is a macvtap device */
+	err = -EINVAL;
+	if (dev->rtnl_link_ops != &macvtap_link_ops)
+		goto out;
+
+	err = -ENOMEM;
+	q = (struct macvtap_queue *)sk_alloc(net, AF_UNSPEC, GFP_KERNEL,
+					     &macvtap_proto);
+	if (!q)
+		goto out;
+
+	init_waitqueue_head(&q->sock.wait);
+	q->sock.type = SOCK_RAW;
+	q->sock.state = SS_CONNECTED;
+	sock_init_data(&q->sock, &q->sk);
+	q->sk.sk_allocation = GFP_ATOMIC; /* for now */
+	q->sk.sk_write_space = macvtap_sock_write_space;
+
+	err = macvtap_set_queue(dev, file, q);
+	if (err)
+		sock_put(&q->sk);
+
+out:
+	if (dev)
+		dev_put(dev);
+
+	return err;
+}
+
+static int macvtap_release(struct inode *inode, struct file *file)
+{
+	macvtap_del_queue((struct macvtap_queue **)&file->private_data);
+	return 0;
+}
+
+static unsigned int macvtap_poll(struct file *file, poll_table * wait)
+{
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	unsigned int mask = POLLERR;
+
+	if (!q)
+		goto out;
+
+	mask = 0;
+	poll_wait(file, &q->sock.wait, wait);
+
+	if (!skb_queue_empty(&q->sk.sk_receive_queue))
+		mask |= POLLIN | POLLRDNORM;
+
+	if (sock_writeable(&q->sk) ||
+	    (!test_and_set_bit(SOCK_ASYNC_NOSPACE, &q->sock.flags) &&
+	     sock_writeable(&q->sk)))
+		mask |= POLLOUT | POLLWRNORM;
+
+out:
+	macvtap_file_put_queue();
+	return mask;
+}
+
+/* Get packet from user space buffer */
+static ssize_t macvtap_get_user(struct macvtap_queue *q,
+				const struct iovec *iv, size_t count,
+				int noblock)
+{
+	struct sk_buff *skb;
+	size_t len = count;
+	int err;
+
+	if (unlikely(len < ETH_HLEN))
+		return -EINVAL;
+
+	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
+
+	if (!skb) {
+		macvlan_count_rx(q->vlan, 0, false, false);
+		return err;
+	}
+
+	skb_reserve(skb, NET_IP_ALIGN);
+	skb_put(skb, count);
+
+	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
+		macvlan_count_rx(q->vlan, 0, false, false);
+		kfree_skb(skb);
+		return -EFAULT;
+	}
+
+	skb_set_network_header(skb, ETH_HLEN);
+
+	macvlan_start_xmit(skb, q->vlan->dev);
+
+	return count;
+}
+
+static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
+				 unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	ssize_t result = -ENOLINK;
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+
+	if (!q)
+		goto out;
+
+	result = macvtap_get_user(q, iv, iov_length(iv, count),
+			      file->f_flags & O_NONBLOCK);
+out:
+	macvtap_file_put_queue();
+	return result;
+}
+
+/* Put packet to the user space buffer */
+static ssize_t macvtap_put_user(struct macvtap_queue *q,
+				const struct sk_buff *skb,
+				const struct iovec *iv, int len)
+{
+	struct macvlan_dev *vlan = q->vlan;
+	int ret;
+
+	len = min_t(int, skb->len, len);
+
+	ret = skb_copy_datagram_const_iovec(skb, 0, iv, 0, len);
+
+	macvlan_count_rx(vlan, len, ret == 0, 0);
+
+	return ret ? ret : len;
+}
+
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+				unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct macvtap_queue *q = macvtap_file_get_queue(file);
+
+	DECLARE_WAITQUEUE(wait, current);
+	struct sk_buff *skb;
+	ssize_t len, ret = 0;
+
+	if (!q) {
+		ret = -ENOLINK;
+		goto out;
+	}
+
+	len = iov_length(iv, count);
+	if (len < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	add_wait_queue(q->sk.sk_sleep, &wait);
+	while (len) {
+		current->state = TASK_INTERRUPTIBLE;
+
+		/* Read frames from the queue */
+		skb = skb_dequeue(&q->sk.sk_receive_queue);
+		if (!skb) {
+			if (file->f_flags & O_NONBLOCK) {
+				ret = -EAGAIN;
+				break;
+			}
+			if (signal_pending(current)) {
+				ret = -ERESTARTSYS;
+				break;
+			}
+			/* Nothing to read, let's sleep */
+			schedule();
+			continue;
+		}
+		ret = macvtap_put_user(q, skb, iv, len);
+		kfree_skb(skb);
+		break;
+	}
+
+	current->state = TASK_RUNNING;
+	remove_wait_queue(q->sk.sk_sleep, &wait);
+
+out:
+	macvtap_file_put_queue();
+	return ret;
+}
+
+/*
+ * provide compatibility with generic tun/tap interface
+ */
+static long macvtap_ioctl(struct file *file, unsigned int cmd,
+			  unsigned long arg)
+{
+	struct macvtap_queue *q;
+	void __user *argp = (void __user *)arg;
+	struct ifreq __user *ifr = argp;
+	unsigned int __user *up = argp;
+	unsigned int u;
+	char devname[IFNAMSIZ];
+
+	switch (cmd) {
+	case TUNSETIFF:
+		/* ignore the name, just look at flags */
+		if (get_user(u, &ifr->ifr_flags))
+			return -EFAULT;
+		if (u != (IFF_TAP | IFF_NO_PI))
+			return -EINVAL;
+		return 0;
+
+	case TUNGETIFF:
+		q = macvtap_file_get_queue(file);
+		if (!q)
+			return -ENOLINK;
+		memcpy(devname, q->vlan->dev->name, sizeof(devname));
+		macvtap_file_put_queue();
+
+		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
+		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
+			return -EFAULT;
+		return 0;
+
+	case TUNGETFEATURES:
+		if (put_user((IFF_TAP | IFF_NO_PI), up))
+			return -EFAULT;
+		return 0;
+
+	case TUNSETSNDBUF:
+		if (get_user(u, up))
+			return -EFAULT;
+
+		q = macvtap_file_get_queue(file);
+		q->sk.sk_sndbuf = u;
+		macvtap_file_put_queue();
+		return 0;
+
+	case TUNSETOFFLOAD:
+		/* let the user check for future flags */
+		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
+			  TUN_F_TSO_ECN | TUN_F_UFO))
+			return -EINVAL;
+
+		/* TODO: add support for these, so far we don't
+			 support any offload */
+		if (arg & (TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
+			 TUN_F_TSO_ECN | TUN_F_UFO))
+			return -EINVAL;
+
+		return 0;
+
+	default:
+		return -EINVAL;
+	}
+}
+
+#ifdef CONFIG_COMPAT
+static long macvtap_compat_ioctl(struct file *file, unsigned int cmd,
+				 unsigned long arg)
+{
+	return macvtap_ioctl(file, cmd, (unsigned long)compat_ptr(arg));
+}
+#endif
+
+static const struct file_operations macvtap_fops = {
+	.owner		= THIS_MODULE,
+	.open		= macvtap_open,
+	.release	= macvtap_release,
+	.aio_read	= macvtap_aio_read,
+	.aio_write	= macvtap_aio_write,
+	.poll		= macvtap_poll,
+	.llseek		= no_llseek,
+	.unlocked_ioctl	= macvtap_ioctl,
+#ifdef CONFIG_COMPAT
+	.compat_ioctl	= macvtap_compat_ioctl,
+#endif
+};
+
+static int macvtap_init(void)
+{
+	int err;
+
+	err = alloc_chrdev_region(&macvtap_major, 0,
+				MACVTAP_NUM_DEVS, "macvtap");
+	if (err)
+		goto out1;
+
+	cdev_init(&macvtap_cdev, &macvtap_fops);
+	err = cdev_add(&macvtap_cdev, macvtap_major, MACVTAP_NUM_DEVS);
+	if (err)
+		goto out2;
+
+	macvtap_class = class_create(THIS_MODULE, "macvtap");
+	if (IS_ERR(macvtap_class)) {
+		err = PTR_ERR(macvtap_class);
+		goto out3;
+	}
+
+	err = macvlan_link_register(&macvtap_link_ops);
+	if (err)
+		goto out4;
+
+	return 0;
+
+out4:
+	class_unregister(macvtap_class);
+out3:
+	cdev_del(&macvtap_cdev);
+out2:
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+out1:
+	return err;
+}
+module_init(macvtap_init);
+
+static void macvtap_exit(void)
+{
+	rtnl_link_unregister(&macvtap_link_ops);
+	class_unregister(macvtap_class);
+	cdev_del(&macvtap_cdev);
+	unregister_chrdev_region(macvtap_major, MACVTAP_NUM_DEVS);
+}
+module_exit(macvtap_exit);
+
+MODULE_ALIAS_RTNL_LINK("macvtap");
+MODULE_AUTHOR("Arnd Bergmann <arnd@arndb.de>");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index 9a11544..51f1512 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -34,6 +34,7 @@ struct macvlan_dev {
 	enum macvlan_mode	mode;
 	int (*receive)(struct sk_buff *skb);
 	int (*forward)(struct net_device *dev, struct sk_buff *skb);
+	struct macvtap_queue	*tap;
 };
 
 static inline void macvlan_count_rx(const struct macvlan_dev *vlan,
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-01-30 22:22   ` [Bridge] " Arnd Bergmann
@ 2010-02-04  4:21     ` David Miller
  -1 siblings, 0 replies; 63+ messages in thread
From: David Miller @ 2010-02-04  4:21 UTC (permalink / raw)
  To: arnd
  Cc: shemminger, kaber, mst, herbert, ogerlitz, netdev, bridge,
	linux-kernel, virtualization

From: Arnd Bergmann <arnd@arndb.de>
Date: Sat, 30 Jan 2010 23:22:15 +0100

> This is the fourth version of the macvtap driver,
> based on the comments I got for the last version
> I got a few days ago. Very few changes:
> 
> * release netdev in chardev open function so
>   we can destroy it properly.
> * Implement TUNSETSNDBUF
> * fix sleeping call in rcu_read_lock
> * Fix comment in namespace isolation patch
> * Fix small context difference to make it apply
>   to net-next
> 
> I can't really test here while travelling, so please
> give it a go if you're interested in this driver.

All applied to net-next-2.6, thanks!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-01-30 22:22   ` [Bridge] " Arnd Bergmann
                     ` (6 preceding siblings ...)
  (?)
@ 2010-02-04  4:21   ` David Miller
  -1 siblings, 0 replies; 63+ messages in thread
From: David Miller @ 2010-02-04  4:21 UTC (permalink / raw)
  To: arnd
  Cc: herbert, mst, netdev, bridge, linux-kernel, virtualization, shemminger

From: Arnd Bergmann <arnd@arndb.de>
Date: Sat, 30 Jan 2010 23:22:15 +0100

> This is the fourth version of the macvtap driver,
> based on the comments I got for the last version
> I got a few days ago. Very few changes:
> 
> * release netdev in chardev open function so
>   we can destroy it properly.
> * Implement TUNSETSNDBUF
> * fix sleeping call in rcu_read_lock
> * Fix comment in namespace isolation patch
> * Fix small context difference to make it apply
>   to net-next
> 
> I can't really test here while travelling, so please
> give it a go if you're interested in this driver.

All applied to net-next-2.6, thanks!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [Bridge] [PATCH 0/3 v4] macvtap driver
@ 2010-02-04  4:21     ` David Miller
  0 siblings, 0 replies; 63+ messages in thread
From: David Miller @ 2010-02-04  4:21 UTC (permalink / raw)
  To: arnd; +Cc: herbert, mst, netdev, bridge, linux-kernel, virtualization, ogerlitz

From: Arnd Bergmann <arnd@arndb.de>
Date: Sat, 30 Jan 2010 23:22:15 +0100

> This is the fourth version of the macvtap driver,
> based on the comments I got for the last version
> I got a few days ago. Very few changes:
> 
> * release netdev in chardev open function so
>   we can destroy it properly.
> * Implement TUNSETSNDBUF
> * fix sleeping call in rcu_read_lock
> * Fix comment in namespace isolation patch
> * Fix small context difference to make it apply
>   to net-next
> 
> I can't really test here while travelling, so please
> give it a go if you're interested in this driver.

All applied to net-next-2.6, thanks!

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-02-04  4:21     ` [Bridge] " David Miller
  (?)
@ 2010-02-08 17:14     ` Ed Swierk
  2010-02-08 18:55       ` Sridhar Samudrala
  -1 siblings, 1 reply; 63+ messages in thread
From: Ed Swierk @ 2010-02-08 17:14 UTC (permalink / raw)
  To: arnd; +Cc: netdev

> From: Arnd Bergmann <arnd@arndb.de>
> Date: Sat, 30 Jan 2010 23:22:15 +0100
>
>> This is the fourth version of the macvtap driver,
>> based on the comments I got for the last version
>> I got a few days ago. Very few changes:
>>
>> * release netdev in chardev open function so
>>   we can destroy it properly.
>> * Implement TUNSETSNDBUF
>> * fix sleeping call in rcu_read_lock
>> * Fix comment in namespace isolation patch
>> * Fix small context difference to make it apply
>>   to net-next
>>
>> I can't really test here while travelling, so please
>> give it a go if you're interested in this driver.

I'm seeing complaints from might_sleep():

Feb  8 16:21:06 ti102 kernel: BUG: sleeping function called from
invalid context at include/linux/kernel.h:155
Feb  8 16:21:06 ti102 kernel: in_atomic(): 1, irqs_disabled(): 0, pid:
2881, name: qemu-kvm
Feb  8 16:21:06 ti102 kernel: Pid: 2881, comm: qemu-kvm Not tainted
2.6.29.6.Ar-224527.2009eswierk8 #1
Feb  8 16:21:06 ti102 kernel: Call Trace:
Feb  8 16:21:06 ti102 kernel: [<c0119250>] __might_sleep+0xdc/0xe3
Feb  8 16:21:06 ti102 kernel: [<c0210f7c>] copy_to_user+0x36/0x106
Feb  8 16:21:06 ti102 kernel: [<c02af568>] memcpy_toiovec+0x2c/0x50
Feb  8 16:21:06 ti102 kernel: [<c02afbb3>] skb_copy_datagram_iovec+0x47/0x184
Feb  8 16:21:06 ti102 kernel: [<c034bd07>] ? _spin_unlock_irqrestore+0x17/0x2c
Feb  8 16:21:06 ti102 kernel: [<f829a776>]
macvtap_aio_read+0x102/0x158 [macvtap]
Feb  8 16:21:06 ti102 kernel: [<c011eaf7>] ? default_wake_function+0x0/0xd
Feb  8 16:21:06 ti102 kernel: [<c016c75f>] do_sync_read+0xab/0xe9
Feb  8 16:21:06 ti102 kernel: [<c0133933>] ? autoremove_wake_function+0x0/0x33
Feb  8 16:21:06 ti102 kernel: [<c019211f>] ? eventfd_read+0x121/0x156
Feb  8 16:21:06 ti102 kernel: [<c011eaf7>] ? default_wake_function+0x0/0xd
Feb  8 16:21:06 ti102 kernel: [<c016d101>] vfs_read+0xb5/0x129
Feb  8 16:21:06 ti102 kernel: [<c016d20e>] sys_read+0x3b/0x60
Feb  8 16:21:06 ti102 kernel: [<c0102e71>] sysenter_do_call+0x12/0x25

Feb  8 16:21:08 ti102 kernel: BUG: sleeping function called from
invalid context at include/linux/kernel.h:155
Feb  8 16:21:08 ti102 kernel: in_atomic(): 1, irqs_disabled(): 0, pid:
2882, name: qemu-kvm
Feb  8 16:21:08 ti102 kernel: Pid: 2882, comm: qemu-kvm Not tainted
2.6.29.6.Ar-224527.2009eswierk8 #1
Feb  8 16:21:08 ti102 kernel: Call Trace:
Feb  8 16:21:08 ti102 kernel: [<c0119250>] __might_sleep+0xdc/0xe3
Feb  8 16:21:08 ti102 kernel: [<c0210e5f>] copy_from_user+0x34/0x11b
Feb  8 16:21:08 ti102 kernel: [<c02af388>] memcpy_fromiovec+0x2c/0x50
Feb  8 16:21:08 ti102 kernel: [<c02afa30>]
skb_copy_datagram_from_iovec+0x47/0x183
Feb  8 16:21:08 ti102 kernel: [<f829a618>]
macvtap_aio_write+0xaa/0x106 [macvtap]
Feb  8 16:21:08 ti102 kernel: [<c034a553>] ? mutex_unlock+0x8/0xa
Feb  8 16:21:08 ti102 kernel: [<c016c58d>] do_sync_readv_writev+0xa1/0xdf
Feb  8 16:21:08 ti102 kernel: [<c0133933>] ? autoremove_wake_function+0x0/0x33
Feb  8 16:21:08 ti102 kernel: [<c016c44f>] ? rw_copy_check_uvector+0x5b/0xc3
Feb  8 16:21:08 ti102 kernel: [<c016cbec>] do_readv_writev+0x82/0x165
Feb  8 16:21:08 ti102 kernel: [<f829a56e>] ?
macvtap_aio_write+0x0/0x106 [macvtap]
Feb  8 16:21:08 ti102 kernel: [<c0177461>] ? do_vfs_ioctl+0x4a3/0x4dc
Feb  8 16:21:08 ti102 kernel: [<c0112d83>] ? read_hpet+0xf/0x13
Feb  8 16:21:08 ti102 kernel: [<c016cd1e>] vfs_writev+0x4f/0x6b
Feb  8 16:21:08 ti102 kernel: [<c016cd75>] sys_writev+0x3b/0x60
Feb  8 16:21:08 ti102 kernel: [<c0102e71>] sysenter_do_call+0x12/0x25

I backported your patch to kernel 2.6.29.6, an i386 kernel with
CONFIG_PREEMPT=y and CONFIG_CLASSIC_RCU=y. It's entirely possible that
I screwed something up in the backport; I can post my modified patch
if it would help.

--Ed

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-02-08 17:14     ` Ed Swierk
@ 2010-02-08 18:55       ` Sridhar Samudrala
  2010-02-08 23:30         ` Ed Swierk
                           ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Sridhar Samudrala @ 2010-02-08 18:55 UTC (permalink / raw)
  To: Ed Swierk; +Cc: arnd, netdev

On Mon, 2010-02-08 at 09:14 -0800, Ed Swierk wrote:
> > From: Arnd Bergmann <arnd@arndb.de>
> > Date: Sat, 30 Jan 2010 23:22:15 +0100
> >
> >> This is the fourth version of the macvtap driver,
> >> based on the comments I got for the last version
> >> I got a few days ago. Very few changes:
> >>
> >> * release netdev in chardev open function so
> >>   we can destroy it properly.
> >> * Implement TUNSETSNDBUF
> >> * fix sleeping call in rcu_read_lock
> >> * Fix comment in namespace isolation patch
> >> * Fix small context difference to make it apply
> >>   to net-next
> >>
> >> I can't really test here while travelling, so please
> >> give it a go if you're interested in this driver.
> 
> I'm seeing complaints from might_sleep():
> 
> Feb  8 16:21:06 ti102 kernel: BUG: sleeping function called from
> invalid context at include/linux/kernel.h:155
> Feb  8 16:21:06 ti102 kernel: in_atomic(): 1, irqs_disabled(): 0, pid:
> 2881, name: qemu-kvm
> Feb  8 16:21:06 ti102 kernel: Pid: 2881, comm: qemu-kvm Not tainted
> 2.6.29.6.Ar-224527.2009eswierk8 #1
> Feb  8 16:21:06 ti102 kernel: Call Trace:
> Feb  8 16:21:06 ti102 kernel: [<c0119250>] __might_sleep+0xdc/0xe3
> Feb  8 16:21:06 ti102 kernel: [<c0210f7c>] copy_to_user+0x36/0x106
> Feb  8 16:21:06 ti102 kernel: [<c02af568>] memcpy_toiovec+0x2c/0x50
> Feb  8 16:21:06 ti102 kernel: [<c02afbb3>] skb_copy_datagram_iovec+0x47/0x184
> Feb  8 16:21:06 ti102 kernel: [<c034bd07>] ? _spin_unlock_irqrestore+0x17/0x2c
> Feb  8 16:21:06 ti102 kernel: [<f829a776>]
> macvtap_aio_read+0x102/0x158 [macvtap]
> Feb  8 16:21:06 ti102 kernel: [<c011eaf7>] ? default_wake_function+0x0/0xd
> Feb  8 16:21:06 ti102 kernel: [<c016c75f>] do_sync_read+0xab/0xe9
> Feb  8 16:21:06 ti102 kernel: [<c0133933>] ? autoremove_wake_function+0x0/0x33
> Feb  8 16:21:06 ti102 kernel: [<c019211f>] ? eventfd_read+0x121/0x156
> Feb  8 16:21:06 ti102 kernel: [<c011eaf7>] ? default_wake_function+0x0/0xd
> Feb  8 16:21:06 ti102 kernel: [<c016d101>] vfs_read+0xb5/0x129
> Feb  8 16:21:06 ti102 kernel: [<c016d20e>] sys_read+0x3b/0x60
> Feb  8 16:21:06 ti102 kernel: [<c0102e71>] sysenter_do_call+0x12/0x25

I am also seeing this issue with net-next-2.6.
Basically macvtap_put_user() and macvtap_get_user() call copy_to/from_user
from within a RCU read-side critical section.

The following patch fixes this issue by releasing the RCU read lock before
calling these routines, but instead hold a reference to q->sk.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index ad1f6ef..e3102ab 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -320,7 +320,7 @@ out:
 }
 
 /* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
+static ssize_t macvtap_get_user(struct macvlan_dev *vlan, struct sock *sk,
 				const struct iovec *iv, size_t count,
 				int noblock)
 {
@@ -331,10 +331,10 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 	if (unlikely(len < ETH_HLEN))
 		return -EINVAL;
 
-	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
+	skb = sock_alloc_send_skb(sk, NET_IP_ALIGN + len, noblock, &err);
 
 	if (!skb) {
-		macvlan_count_rx(q->vlan, 0, false, false);
+		macvlan_count_rx(vlan, 0, false, false);
 		return err;
 	}
 
@@ -342,14 +342,14 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 	skb_put(skb, count);
 
 	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
-		macvlan_count_rx(q->vlan, 0, false, false);
+		macvlan_count_rx(vlan, 0, false, false);
 		kfree_skb(skb);
 		return -EFAULT;
 	}
 
 	skb_set_network_header(skb, ETH_HLEN);
 
-	macvlan_start_xmit(skb, q->vlan->dev);
+	macvlan_start_xmit(skb, vlan->dev);
 
 	return count;
 }
@@ -360,23 +360,29 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
 	struct file *file = iocb->ki_filp;
 	ssize_t result = -ENOLINK;
 	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	struct macvlan_dev *vlan;
+	struct sock *sk;
 
 	if (!q)
 		goto out;
 
-	result = macvtap_get_user(q, iv, iov_length(iv, count),
+	vlan = q->vlan;
+	sk = &q->sk;
+	sock_hold(sk);
+	macvtap_file_put_queue();
+
+	result = macvtap_get_user(vlan, sk, iv, iov_length(iv, count),
 			      file->f_flags & O_NONBLOCK);
+	sock_put(sk);
 out:
-	macvtap_file_put_queue();
 	return result;
 }
 
 /* Put packet to the user space buffer */
-static ssize_t macvtap_put_user(struct macvtap_queue *q,
+static ssize_t macvtap_put_user(struct macvlan_dev *vlan,
 				const struct sk_buff *skb,
 				const struct iovec *iv, int len)
 {
-	struct macvlan_dev *vlan = q->vlan;
 	int ret;
 
 	len = min_t(int, skb->len, len);
@@ -393,15 +399,20 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 {
 	struct file *file = iocb->ki_filp;
 	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	struct macvlan_dev *vlan;
+	struct sock *sk;
 
 	DECLARE_WAITQUEUE(wait, current);
 	struct sk_buff *skb;
 	ssize_t len, ret = 0;
 
-	if (!q) {
-		ret = -ENOLINK;
-		goto out;
-	}
+	if (!q)
+		return -ENOLINK;
+
+	vlan = q->vlan;
+	sk = &q->sk;
+	sock_hold(sk);
+	macvtap_file_put_queue();
 
 	len = iov_length(iv, count);
 	if (len < 0) {
@@ -409,12 +420,12 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 		goto out;
 	}
 
-	add_wait_queue(q->sk.sk_sleep, &wait);
+	add_wait_queue(sk->sk_sleep, &wait);
 	while (len) {
 		current->state = TASK_INTERRUPTIBLE;
 
 		/* Read frames from the queue */
-		skb = skb_dequeue(&q->sk.sk_receive_queue);
+		skb = skb_dequeue(&sk->sk_receive_queue);
 		if (!skb) {
 			if (file->f_flags & O_NONBLOCK) {
 				ret = -EAGAIN;
@@ -428,16 +439,16 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 			schedule();
 			continue;
 		}
-		ret = macvtap_put_user(q, skb, iv, len);
+		ret = macvtap_put_user(vlan, skb, iv, len);
 		kfree_skb(skb);
 		break;
 	}
 
 	current->state = TASK_RUNNING;
-	remove_wait_queue(q->sk.sk_sleep, &wait);
+	remove_wait_queue(sk->sk_sleep, &wait);
 
 out:
-	macvtap_file_put_queue();
+	sock_put(sk);
 	return ret;
 }
 





^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-02-08 18:55       ` Sridhar Samudrala
@ 2010-02-08 23:30         ` Ed Swierk
  2010-02-10 14:50           ` Arnd Bergmann
  2010-02-09  3:25         ` Ed Swierk
  2010-02-10 14:48         ` Arnd Bergmann
  2 siblings, 1 reply; 63+ messages in thread
From: Ed Swierk @ 2010-02-08 23:30 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: arnd, netdev

On Mon, 2010-02-08 at 10:55 -0800, Sridhar Samudrala wrote:
> I am also seeing this issue with net-next-2.6.
> Basically macvtap_put_user() and macvtap_get_user() call copy_to/from_user
> from within a RCU read-side critical section.
> 
> The following patch fixes this issue by releasing the RCU read lock before
> calling these routines, but instead hold a reference to q->sk.

Thanks, I tried your patch and it fixes the problem.

However, it seems to cause another minor problem.  macvlan_count_rx() is
now getting called from macvtap_put_user() with preemption enabled,
which causes smp_processor_id() to BUG:

Feb  8 20:31:38 ti102 kernel: BUG: using smp_processor_id() in
preemptible [00000000] code: qemu-kvm/4546 
Feb  8 20:31:38 ti102 kernel: caller is macvtap_aio_read+0x18c/0x221
[macvtap] 
Feb  8 20:31:38 ti102 kernel: Pid: 4546, comm: qemu-kvm Not tainted
2.6.29.6.Ar-224686.2009eswierk8.2 #1 
Feb  8 20:31:38 ti102 kernel: Call Trace: 
Feb  8 20:31:38 ti102 kernel: [<c0349546>] ? printk+0xf/0x11 
Feb  8 20:31:38 ti102 kernel: [<c02142c0>] debug_smp_processor_id
+0xa4/0xb8 
Feb  8 20:31:38 ti102 kernel: [<f8af581f>] macvtap_aio_read+0x18c/0x221
[macvtap] 
Feb  8 20:31:38 ti102 kernel: [<c011eaf7>] ? default_wake_function
+0x0/0xd 
Feb  8 20:31:38 ti102 kernel: [<c016c75f>] do_sync_read+0xab/0xe9 
Feb  8 20:31:38 ti102 kernel: [<c011933d>] ? update_curr+0x6c/0x147 
Feb  8 20:31:38 ti102 kernel: [<c0133933>] ? autoremove_wake_function
+0x0/0x33 
Feb  8 20:31:38 ti102 kernel: [<c0349fd0>] ? schedule+0x7af/0x7e3 
Feb  8 20:31:38 ti102 kernel: [<c016d101>] vfs_read+0xb5/0x129 
Feb  8 20:31:38 ti102 kernel: [<c016d20e>] sys_read+0x3b/0x60 
Feb  8 20:31:38 ti102 kernel: [<c0102e71>] sysenter_do_call+0x12/0x25 

I fixed this problem with the change below.  I'm not sure if replacing
smp_processor_id() with get_cpu() is the right thing to do but it works
for macvtap at least.

Signed-off-by: Ed Swierk <eswierk@aristanetworks.com>

---
Index: linux-2.6.29.6/include/linux/if_macvlan.h
===================================================================
--- linux-2.6.29.6.orig/include/linux/if_macvlan.h
+++ linux-2.6.29.6/include/linux/if_macvlan.h
@@ -42,8 +42,9 @@ static inline void macvlan_count_rx(cons
 				    bool multicast)
 {
 	struct macvlan_rx_stats *rx_stats;
+	int cpu = get_cpu();
 
-	rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
+	rx_stats = per_cpu_ptr(vlan->rx_stats, cpu);
 	if (likely(success)) {
 		rx_stats->rx_packets++;;
 		rx_stats->rx_bytes += len;
@@ -52,6 +53,7 @@ static inline void macvlan_count_rx(cons
 	} else {
 		rx_stats->rx_errors++;
 	}
+	put_cpu();
 }
 
 extern int macvlan_common_newlink(struct net_device *dev,



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-02-08 18:55       ` Sridhar Samudrala
  2010-02-08 23:30         ` Ed Swierk
@ 2010-02-09  3:25         ` Ed Swierk
  2010-02-10 14:52           ` Arnd Bergmann
  2010-02-10 14:48         ` Arnd Bergmann
  2 siblings, 1 reply; 63+ messages in thread
From: Ed Swierk @ 2010-02-09  3:25 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: arnd, netdev

On Mon, 2010-02-08 at 10:55 -0800, Sridhar Samudrala wrote:
> I am also seeing this issue with net-next-2.6.
> Basically macvtap_put_user() and macvtap_get_user() call copy_to/from_user
> from within a RCU read-side critical section.
> 
> The following patch fixes this issue by releasing the RCU read lock before
> calling these routines, but instead hold a reference to q->sk.

I've encountered some more problems, with various users of
macvtap_file_get_queue() either calling or neglecting to call
macvtap_file_put_queue() in error cases.

I modified your patch so that when macvtap_file_get_queue() returns 0,
it also calls rcu_read_unlock_bh(), and modified the users
appropriately.

This patch also incorporates my preemption fix for macvlan_count_rx().

Signed-off-by: Ed Swierk <eswierk@aristanetworks.com>

---
On Mon, 2010-02-08 at 09:14 -0800, Ed Swierk wrote:
> > From: Arnd Bergmann <arnd@arndb.de>
> > Date: Sat, 30 Jan 2010 23:22:15 +0100
> >
> >> This is the fourth version of the macvtap driver,
> >> based on the comments I got for the last version
> >> I got a few days ago. Very few changes:
> >>
> >> * release netdev in chardev open function so
> >>   we can destroy it properly.
> >> * Implement TUNSETSNDBUF
> >> * fix sleeping call in rcu_read_lock
> >> * Fix comment in namespace isolation patch
> >> * Fix small context difference to make it apply
> >>   to net-next
> >>
> >> I can't really test here while travelling, so please
> >> give it a go if you're interested in this driver.
> 
> I'm seeing complaints from might_sleep():
> 
> Feb  8 16:21:06 ti102 kernel: BUG: sleeping function called from
> invalid context at include/linux/kernel.h:155
> Feb  8 16:21:06 ti102 kernel: in_atomic(): 1, irqs_disabled(): 0, pid:
> 2881, name: qemu-kvm
> Feb  8 16:21:06 ti102 kernel: Pid: 2881, comm: qemu-kvm Not tainted
> 2.6.29.6.Ar-224527.2009eswierk8 #1
> Feb  8 16:21:06 ti102 kernel: Call Trace:
> Feb  8 16:21:06 ti102 kernel: [<c0119250>] __might_sleep+0xdc/0xe3
> Feb  8 16:21:06 ti102 kernel: [<c0210f7c>] copy_to_user+0x36/0x106
> Feb  8 16:21:06 ti102 kernel: [<c02af568>] memcpy_toiovec+0x2c/0x50
> Feb  8 16:21:06 ti102 kernel: [<c02afbb3>] skb_copy_datagram_iovec+0x47/0x184
> Feb  8 16:21:06 ti102 kernel: [<c034bd07>] ? _spin_unlock_irqrestore+0x17/0x2c
> Feb  8 16:21:06 ti102 kernel: [<f829a776>]
> macvtap_aio_read+0x102/0x158 [macvtap]
> Feb  8 16:21:06 ti102 kernel: [<c011eaf7>] ? default_wake_function+0x0/0xd
> Feb  8 16:21:06 ti102 kernel: [<c016c75f>] do_sync_read+0xab/0xe9
> Feb  8 16:21:06 ti102 kernel: [<c0133933>] ? autoremove_wake_function+0x0/0x33
> Feb  8 16:21:06 ti102 kernel: [<c019211f>] ? eventfd_read+0x121/0x156
> Feb  8 16:21:06 ti102 kernel: [<c011eaf7>] ? default_wake_function+0x0/0xd
> Feb  8 16:21:06 ti102 kernel: [<c016d101>] vfs_read+0xb5/0x129
> Feb  8 16:21:06 ti102 kernel: [<c016d20e>] sys_read+0x3b/0x60
> Feb  8 16:21:06 ti102 kernel: [<c0102e71>] sysenter_do_call+0x12/0x25

I am also seeing this issue with net-next-2.6.
Basically macvtap_put_user() and macvtap_get_user() call copy_to/from_user
from within a RCU read-side critical section.

The following patch fixes this issue by releasing the RCU read lock before
calling these routines, but instead hold a reference to q->sk.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>

Index: linux-2.6.29.6/drivers/net/macvtap.c
===================================================================
--- linux-2.6.29.6.orig/drivers/net/macvtap.c
+++ linux-2.6.29.6/drivers/net/macvtap.c
@@ -160,8 +160,12 @@ static void macvtap_del_queues(struct ne
 
 static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
 {
+	struct macvtap_queue *q;
 	rcu_read_lock_bh();
-	return rcu_dereference(file->private_data);
+	q = rcu_dereference(file->private_data);
+	if (!q)
+		rcu_read_unlock_bh();
+	return q;
 }
 
 static inline void macvtap_file_put_queue(void)
@@ -313,13 +317,14 @@ static unsigned int macvtap_poll(struct 
 	     sock_writeable(&q->sk)))
 		mask |= POLLOUT | POLLWRNORM;
 
-out:
 	macvtap_file_put_queue();
+
+out:
 	return mask;
 }
 
 /* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
+static ssize_t macvtap_get_user(struct macvlan_dev *vlan, struct sock *sk,
 				struct iovec *iv, size_t count,
 				int noblock)
 {
@@ -330,10 +335,10 @@ static ssize_t macvtap_get_user(struct m
 	if (unlikely(len < ETH_HLEN))
 		return -EINVAL;
 
-	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
+	skb = sock_alloc_send_skb(sk, NET_IP_ALIGN + len, noblock, &err);
 
 	if (!skb) {
-		macvlan_count_rx(q->vlan, 0, false, false);
+		macvlan_count_rx(vlan, 0, false, false);
 		return err;
 	}
 
@@ -341,14 +346,14 @@ static ssize_t macvtap_get_user(struct m
 	skb_put(skb, count);
 
 	if (skb_copy_datagram_from_iovec(skb, 0, iv, len)) {
-		macvlan_count_rx(q->vlan, 0, false, false);
+		macvlan_count_rx(vlan, 0, false, false);
 		kfree_skb(skb);
 		return -EFAULT;
 	}
 
 	skb_set_network_header(skb, ETH_HLEN);
 
-	macvlan_start_xmit(skb, q->vlan->dev);
+	macvlan_start_xmit(skb, vlan->dev);
 
 	return count;
 }
@@ -359,23 +364,29 @@ static ssize_t macvtap_aio_write(struct 
 	struct file *file = iocb->ki_filp;
 	ssize_t result = -ENOLINK;
 	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	struct macvlan_dev *vlan;
+	struct sock *sk;
 
 	if (!q)
 		goto out;
 
-	result = macvtap_get_user(q, (struct iovec *) iv, iov_length(iv, count),
+	vlan = q->vlan;
+	sk = &q->sk;
+	sock_hold(sk);
+	macvtap_file_put_queue();
+
+	result = macvtap_get_user(vlan, sk, (struct iovec *) iv, iov_length(iv, count),
 			      file->f_flags & O_NONBLOCK);
+	sock_put(sk);
 out:
-	macvtap_file_put_queue();
 	return result;
 }
 
 /* Put packet to the user space buffer */
-static ssize_t macvtap_put_user(struct macvtap_queue *q,
+static ssize_t macvtap_put_user(struct macvlan_dev *vlan,
 				struct sk_buff *skb,
 				struct iovec *iv, int len)
 {
-	struct macvlan_dev *vlan = q->vlan;
 	int ret;
 
 	len = min_t(int, skb->len, len);
@@ -392,15 +403,20 @@ static ssize_t macvtap_aio_read(struct k
 {
 	struct file *file = iocb->ki_filp;
 	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	struct macvlan_dev *vlan;
+	struct sock *sk;
 
 	DECLARE_WAITQUEUE(wait, current);
 	struct sk_buff *skb;
 	ssize_t len, ret = 0;
 
-	if (!q) {
-		ret = -ENOLINK;
-		goto out;
-	}
+	if (!q)
+		return -ENOLINK;
+
+	vlan = q->vlan;
+	sk = &q->sk;
+	sock_hold(sk);
+	macvtap_file_put_queue();
 
 	len = iov_length(iv, count);
 	if (len < 0) {
@@ -408,12 +424,12 @@ static ssize_t macvtap_aio_read(struct k
 		goto out;
 	}
 
-	add_wait_queue(q->sk.sk_sleep, &wait);
+	add_wait_queue(sk->sk_sleep, &wait);
 	while (len) {
 		current->state = TASK_INTERRUPTIBLE;
 
 		/* Read frames from the queue */
-		skb = skb_dequeue(&q->sk.sk_receive_queue);
+		skb = skb_dequeue(&sk->sk_receive_queue);
 		if (!skb) {
 			if (file->f_flags & O_NONBLOCK) {
 				ret = -EAGAIN;
@@ -427,16 +443,16 @@ static ssize_t macvtap_aio_read(struct k
 			schedule();
 			continue;
 		}
-		ret = macvtap_put_user(q, skb, (struct iovec *) iv, len);
+		ret = macvtap_put_user(vlan, skb, (struct iovec *) iv, len);
 		kfree_skb(skb);
 		break;
 	}
 
 	current->state = TASK_RUNNING;
-	remove_wait_queue(q->sk.sk_sleep, &wait);
+	remove_wait_queue(sk->sk_sleep, &wait);
 
 out:
-	macvtap_file_put_queue();
+	sock_put(sk);
 	return ret;
 }
 
Index: linux-2.6.29.6/include/linux/if_macvlan.h
===================================================================
--- linux-2.6.29.6.orig/include/linux/if_macvlan.h
+++ linux-2.6.29.6/include/linux/if_macvlan.h
@@ -42,8 +42,9 @@ static inline void macvlan_count_rx(cons
 				    bool multicast)
 {
 	struct macvlan_rx_stats *rx_stats;
+	int cpu = get_cpu();
 
-	rx_stats = per_cpu_ptr(vlan->rx_stats, smp_processor_id());
+	rx_stats = per_cpu_ptr(vlan->rx_stats, cpu);
 	if (likely(success)) {
 		rx_stats->rx_packets++;;
 		rx_stats->rx_bytes += len;
@@ -52,6 +53,7 @@ static inline void macvlan_count_rx(cons
 	} else {
 		rx_stats->rx_errors++;
 	}
+	put_cpu();
 }
 
 extern int macvlan_common_newlink(struct net_device *dev,



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-02-08 18:55       ` Sridhar Samudrala
  2010-02-08 23:30         ` Ed Swierk
  2010-02-09  3:25         ` Ed Swierk
@ 2010-02-10 14:48         ` Arnd Bergmann
  2010-02-10 18:05           ` Sridhar Samudrala
  2 siblings, 1 reply; 63+ messages in thread
From: Arnd Bergmann @ 2010-02-10 14:48 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Ed Swierk, netdev

On Monday 08 February 2010, Sridhar Samudrala wrote:
> I am also seeing this issue with net-next-2.6.
> Basically macvtap_put_user() and macvtap_get_user() call copy_to/from_user
> from within a RCU read-side critical section.
> 
> The following patch fixes this issue by releasing the RCU read lock before
> calling these routines, but instead hold a reference to q->sk.
> 
> Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>

Yes, we need something like this, but we also need to protect the
device from going away. The concept right now is to use file_get_queue
to protect both the macvtap_queue and the macvlan_dev from going
away. The sock_hold will keep the macvtap_queue around, but
as far as I can tell, a user could still destroy the macvlan_dev
using netlink at the same time, which still breaks.

>  /* Get packet from user space buffer */
> -static ssize_t macvtap_get_user(struct macvtap_queue *q,
> +static ssize_t macvtap_get_user(struct macvlan_dev *vlan, struct sock *sk,
>                                 const struct iovec *iv, size_t count,
>                                 int noblock)
>  {
> @@ -331,10 +331,10 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
>         if (unlikely(len < ETH_HLEN))
>                 return -EINVAL;
>  
> -       skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
> +       skb = sock_alloc_send_skb(sk, NET_IP_ALIGN + len, noblock, &err);
>  
>         if (!skb) {
> -               macvlan_count_rx(q->vlan, 0, false, false);
> +               macvlan_count_rx(vlan, 0, false, false);
>                 return err;
>         }
>  
> @@ -342,14 +342,14 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
>         skb_put(skb, count);
>  
>         if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> -               macvlan_count_rx(q->vlan, 0, false, false);
> +               macvlan_count_rx(vlan, 0, false, false);
>                 kfree_skb(skb);
>                 return -EFAULT;
>         }
>  
>         skb_set_network_header(skb, ETH_HLEN);
>  
> -       macvlan_start_xmit(skb, q->vlan->dev);
> +       macvlan_start_xmit(skb, vlan->dev);
>  
>         return count;
>  }

What are these changes for? The lifetime of q is the same as &q->sk, so
it won't change anything, right?
Moving the macvlan_count_rx and maxclan_start_xmit under the lock
should be fine though, but we'd have to take it twice then for
each transmit.

I'd hope that this could get simpler by adding zero-copy transmit,
where we first get_user() the whole buffer and do the rest under
rcu_read_lock_bh().

> @@ -393,15 +399,20 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
>  {
>         struct file *file = iocb->ki_filp;
>         struct macvtap_queue *q = macvtap_file_get_queue(file);
> +       struct macvlan_dev *vlan;
> +       struct sock *sk;
>  
>         DECLARE_WAITQUEUE(wait, current);
>         struct sk_buff *skb;
>         ssize_t len, ret = 0;
>  
> -       if (!q) {
> -               ret = -ENOLINK;
> -               goto out;
> -       }
> +       if (!q)
> +               return -ENOLINK;
> +
> +       vlan = q->vlan;
> +       sk = &q->sk;
> +       sock_hold(sk);
> +       macvtap_file_put_queue();

Here, we probably need to prevent vlan from going away by dev_hold(),
not just sock_hold(). Or is one implied by the other?

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-02-08 23:30         ` Ed Swierk
@ 2010-02-10 14:50           ` Arnd Bergmann
  2010-02-11  0:42             ` Ed Swierk
  0 siblings, 1 reply; 63+ messages in thread
From: Arnd Bergmann @ 2010-02-10 14:50 UTC (permalink / raw)
  To: Ed Swierk; +Cc: Sridhar Samudrala, netdev

On Tuesday 09 February 2010, Ed Swierk wrote:
> I fixed this problem with the change below.  I'm not sure if replacing
> smp_processor_id() with get_cpu() is the right thing to do but it works
> for macvtap at least.

I think we also need to ensure the device doesn't go away, which
was one of the reasons for the rcu_read_lock_bh() earlier.

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-02-09  3:25         ` Ed Swierk
@ 2010-02-10 14:52           ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-02-10 14:52 UTC (permalink / raw)
  To: Ed Swierk; +Cc: Sridhar Samudrala, netdev

On Tuesday 09 February 2010, Ed Swierk wrote:
> I've encountered some more problems, with various users of
> macvtap_file_get_queue() either calling or neglecting to call
> macvtap_file_put_queue() in error cases.
> 
> I modified your patch so that when macvtap_file_get_queue() returns 0,
> it also calls rcu_read_unlock_bh(), and modified the users
> appropriately.
> 
> This patch also incorporates my preemption fix for macvlan_count_rx().
> 
> Signed-off-by: Ed Swierk <eswierk@aristanetworks.com>

Good idea, I'll incorporate it into my next fix, need to think
about it some more to make sure I catch all the corner cases
of the lifetime rules.

Thanks,

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-02-10 14:48         ` Arnd Bergmann
@ 2010-02-10 18:05           ` Sridhar Samudrala
  2010-02-10 18:10             ` Patrick McHardy
  0 siblings, 1 reply; 63+ messages in thread
From: Sridhar Samudrala @ 2010-02-10 18:05 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Ed Swierk, netdev

On Wed, 2010-02-10 at 15:48 +0100, Arnd Bergmann wrote:
> On Monday 08 February 2010, Sridhar Samudrala wrote:
> > I am also seeing this issue with net-next-2.6.
> > Basically macvtap_put_user() and macvtap_get_user() call copy_to/from_user
> > from within a RCU read-side critical section.
> > 
> > The following patch fixes this issue by releasing the RCU read lock before
> > calling these routines, but instead hold a reference to q->sk.
> > 
> > Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
> 
> Yes, we need something like this, but we also need to protect the
> device from going away. The concept right now is to use file_get_queue
> to protect both the macvtap_queue and the macvlan_dev from going
> away. The sock_hold will keep the macvtap_queue around, but
> as far as I can tell, a user could still destroy the macvlan_dev
> using netlink at the same time, which still breaks.

may be we should do a dev_hold() in macvtap_set_queue() and dev_put()
in macvtap_del_queue() so that the underlying device cannot go away as
long the macvtap fd is open.

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-02-10 18:05           ` Sridhar Samudrala
@ 2010-02-10 18:10             ` Patrick McHardy
  2010-02-11 15:45               ` [PATCH] net/macvtap: fix reference counting Arnd Bergmann
  0 siblings, 1 reply; 63+ messages in thread
From: Patrick McHardy @ 2010-02-10 18:10 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Arnd Bergmann, Ed Swierk, netdev

Sridhar Samudrala wrote:
> On Wed, 2010-02-10 at 15:48 +0100, Arnd Bergmann wrote:
>> On Monday 08 February 2010, Sridhar Samudrala wrote:
>>> I am also seeing this issue with net-next-2.6.
>>> Basically macvtap_put_user() and macvtap_get_user() call copy_to/from_user
>>> from within a RCU read-side critical section.
>>>
>>> The following patch fixes this issue by releasing the RCU read lock before
>>> calling these routines, but instead hold a reference to q->sk.
>>>
>>> Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
>> Yes, we need something like this, but we also need to protect the
>> device from going away. The concept right now is to use file_get_queue
>> to protect both the macvtap_queue and the macvlan_dev from going
>> away. The sock_hold will keep the macvtap_queue around, but
>> as far as I can tell, a user could still destroy the macvlan_dev
>> using netlink at the same time, which still breaks.
> 
> may be we should do a dev_hold() in macvtap_set_queue() and dev_put()
> in macvtap_del_queue() so that the underlying device cannot go away as
> long the macvtap fd is open.

You either need some kind of loose binding (f.i. using the ifindex)
or need to handle the case that the device goes away asynchronously
by indicating an error to the socket and unbinding it.

But you can't make the lifetime of the device dependant on the socket.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-02-10 14:50           ` Arnd Bergmann
@ 2010-02-11  0:42             ` Ed Swierk
  2010-02-11  7:12               ` Arnd Bergmann
  0 siblings, 1 reply; 63+ messages in thread
From: Ed Swierk @ 2010-02-11  0:42 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Sridhar Samudrala, netdev

On Wed, Feb 10, 2010 at 6:50 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> I think we also need to ensure the device doesn't go away, which
> was one of the reasons for the rcu_read_lock_bh() earlier.

This may be veering far off into the weeds, but I'm wondering if you
considered making macvtap devices behave more like tap devices.
Specifically, the application would open /dev/net/macvtap and send it
an ioctl with the name of the macvtap interface, the name of the lower
interface to attach to, the MAC address, etc; this would cause the
macvtap interface to spring into existence. The macvtap interface
would go away when the application exits or closes the file.

The tricky part here would be noticing when the lower interface goes
away, and (ideally) reattaching when an interface with the same name
reappears.

I think the advantage of this approach is that it better fits the way
applications like qemu and libvirt use tap interfaces. Unlike the
current approach, however, this wouldn't allow creating a macvtap
interface and keep it around independently of the application using
it. Is it desirable to support this use case?

--Ed

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 0/3 v4] macvtap driver
  2010-02-11  0:42             ` Ed Swierk
@ 2010-02-11  7:12               ` Arnd Bergmann
  0 siblings, 0 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-02-11  7:12 UTC (permalink / raw)
  To: Ed Swierk; +Cc: Sridhar Samudrala, netdev

On Thursday 11 February 2010 01:42:04 Ed Swierk wrote:
> On Wed, Feb 10, 2010 at 6:50 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> > I think we also need to ensure the device doesn't go away, which
> > was one of the reasons for the rcu_read_lock_bh() earlier.
> 
> This may be veering far off into the weeds, but I'm wondering if you
> considered making macvtap devices behave more like tap devices.
> Specifically, the application would open /dev/net/macvtap and send it
> an ioctl with the name of the macvtap interface, the name of the lower
> interface to attach to, the MAC address, etc; this would cause the
> macvtap interface to spring into existence. The macvtap interface
> would go away when the application exits or closes the file.

No, I never considered this. In fact, this behavior of tun/tap
is what makes that driver have really complex lifetime rules (more
so than macvtap) and causes all sorts of problems if you want to
manage unprivileged users accessing different outgoing interfaces.

> The tricky part here would be noticing when the lower interface goes
> away, and (ideally) reattaching when an interface with the same name
> reappears.

The first part is not so hard, the second part I'd rather not do.
 
> I think the advantage of this approach is that it better fits the way
> applications like qemu and libvirt use tap interfaces. Unlike the
> current approach, however, this wouldn't allow creating a macvtap
> interface and keep it around independently of the application using
> it. Is it desirable to support this use case?

I think it's very useful that you can set up static interfaces and give
them to a user (or group) that are then able to use these interfaces
without getting any network privileges beyond that.

Another reason for having one chardev per interface is to support
multiple open files for the same interface. I want to use that as
an easy way to support multi-queue NICs.

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH] net/macvtap: fix reference counting
  2010-02-10 18:10             ` Patrick McHardy
@ 2010-02-11 15:45               ` Arnd Bergmann
  2010-02-11 15:55                 ` [PATCH v2] " Arnd Bergmann
  0 siblings, 1 reply; 63+ messages in thread
From: Arnd Bergmann @ 2010-02-11 15:45 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Sridhar Samudrala, Ed Swierk, netdev

The RCU usage in the original code was broken because
there are cases where we possibly sleep with rcu_read_lock
held. As a fix, change the macvtap_file_get_queue to
get a reference on the socket and the netdev instead of
taking the full rcu_read_lock.

Also, change macvtap_file_get_queue failure case to
not require a subsequent macvtap_file_put_queue, as
pointed out by Ed Swierk.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ed Swierk <eswierk@aristanetworks.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
---
 drivers/net/macvtap.c |   57 +++++++++++++++++++++++++++++++-----------------
 1 files changed, 37 insertions(+), 20 deletions(-)

Sridhar, Ed: Does this look ok to you? I'm still working
on restoring my test setup, but I'd like you to take a
look at this version.

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index ad1f6ef..5954324 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -159,8 +159,12 @@ static void macvtap_del_queues(struct net_device *dev)
 
 static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
 {
+	struct macvtap_queue *q;
 	rcu_read_lock_bh();
-	return rcu_dereference(file->private_data);
+	q = rcu_dereference(file->private_data);
+	if (!q)
+		rcu_read_unlock_bh();
+	return q;
 }
 
 static inline void macvtap_file_put_queue(void)
@@ -314,13 +318,13 @@ static unsigned int macvtap_poll(struct file *file, poll_table * wait)
 	     sock_writeable(&q->sk)))
 		mask |= POLLOUT | POLLWRNORM;
 
-out:
 	macvtap_file_put_queue();
+out:
 	return mask;
 }
 
 /* Get packet from user space buffer */
-static ssize_t macvtap_get_user(struct macvtap_queue *q,
+static ssize_t macvtap_get_user(struct macvlan_dev *vlan, struct sock *sk,
 				const struct iovec *iv, size_t count,
 				int noblock)
 {
@@ -331,10 +335,10 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 	if (unlikely(len < ETH_HLEN))
 		return -EINVAL;
 
-	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
+	skb = sock_alloc_send_skb(sk, NET_IP_ALIGN + len, noblock, &err);
 
 	if (!skb) {
-		macvlan_count_rx(q->vlan, 0, false, false);
+		macvlan_count_rx(vlan, 0, false, false);
 		return err;
 	}
 
@@ -342,14 +346,14 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 	skb_put(skb, count);
 
 	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
-		macvlan_count_rx(q->vlan, 0, false, false);
+		macvlan_count_rx(vlan, 0, false, false);
 		kfree_skb(skb);
 		return -EFAULT;
 	}
 
 	skb_set_network_header(skb, ETH_HLEN);
 
-	macvlan_start_xmit(skb, q->vlan->dev);
+	macvlan_start_xmit(skb, vlan->dev);
 
 	return count;
 }
@@ -360,23 +364,29 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
 	struct file *file = iocb->ki_filp;
 	ssize_t result = -ENOLINK;
 	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	struct macvlan_dev *vlan;
+	struct sock *sk;
 
 	if (!q)
 		goto out;
 
-	result = macvtap_get_user(q, iv, iov_length(iv, count),
+	vlan = q->vlan;
+	sk = &q->sk;
+	sock_hold(sk);
+	macvtap_file_put_queue();
+
+	result = macvtap_get_user(vlan, sk, iv, iov_length(iv, count),
 			      file->f_flags & O_NONBLOCK);
+	sock_put(sk);
 out:
-	macvtap_file_put_queue();
 	return result;
 }
 
 /* Put packet to the user space buffer */
-static ssize_t macvtap_put_user(struct macvtap_queue *q,
+static ssize_t macvtap_put_user(struct macvlan_dev *vlan,
 				const struct sk_buff *skb,
 				const struct iovec *iv, int len)
 {
-	struct macvlan_dev *vlan = q->vlan;
 	int ret;
 
 	len = min_t(int, skb->len, len);
@@ -393,15 +403,20 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 {
 	struct file *file = iocb->ki_filp;
 	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	struct macvlan_dev *vlan;
+	struct sock *sk;
 
 	DECLARE_WAITQUEUE(wait, current);
 	struct sk_buff *skb;
 	ssize_t len, ret = 0;
 
-	if (!q) {
-		ret = -ENOLINK;
-		goto out;
-	}
+	if (!q)
+		return -ENOLINK;
+
+	vlan = q->vlan;
+	sk = &q->sk;
+	sock_hold(sk);
+	macvtap_file_put_queue();
 
 	len = iov_length(iv, count);
 	if (len < 0) {
@@ -409,12 +424,12 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 		goto out;
 	}
 
-	add_wait_queue(q->sk.sk_sleep, &wait);
+	add_wait_queue(sk->sk_sleep, &wait);
 	while (len) {
 		current->state = TASK_INTERRUPTIBLE;
 
 		/* Read frames from the queue */
-		skb = skb_dequeue(&q->sk.sk_receive_queue);
+		skb = skb_dequeue(&sk->sk_receive_queue);
 		if (!skb) {
 			if (file->f_flags & O_NONBLOCK) {
 				ret = -EAGAIN;
@@ -428,16 +443,16 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 			schedule();
 			continue;
 		}
-		ret = macvtap_put_user(q, skb, iv, len);
+		ret = macvtap_put_user(vlan, skb, iv, len);
 		kfree_skb(skb);
 		break;
 	}
 
 	current->state = TASK_RUNNING;
-	remove_wait_queue(q->sk.sk_sleep, &wait);
+	remove_wait_queue(sk->sk_sleep, &wait);
 
 out:
-	macvtap_file_put_queue();
+	sock_put(sk);
 	return ret;
 }
 
@@ -485,6 +500,8 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 			return -EFAULT;
 
 		q = macvtap_file_get_queue(file);
+		if (!q)
+			return -ENOLINK;
 		q->sk.sk_sndbuf = u;
 		macvtap_file_put_queue();
 		return 0;
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v2] net/macvtap: fix reference counting
  2010-02-11 15:45               ` [PATCH] net/macvtap: fix reference counting Arnd Bergmann
@ 2010-02-11 15:55                 ` Arnd Bergmann
  2010-02-11 21:09                   ` Sridhar Samudrala
  2010-02-12 20:58                   ` [PATCH v2] net/macvtap: fix reference counting Ed Swierk
  0 siblings, 2 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-02-11 15:55 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Sridhar Samudrala, Ed Swierk, netdev

The RCU usage in the original code was broken because
there are cases where we possibly sleep with rcu_read_lock
held. As a fix, change the macvtap_file_get_queue to
get a reference on the socket and the netdev instead of
taking the full rcu_read_lock.

Also, change macvtap_file_get_queue failure case to
not require a subsequent macvtap_file_put_queue, as
pointed out by Ed Swierk.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Ed Swierk <eswierk@aristanetworks.com>
Cc: Sridhar Samudrala <sri@us.ibm.com>
---
 drivers/net/macvtap.c |   35 ++++++++++++++++++++++-------------
 1 files changed, 22 insertions(+), 13 deletions(-)

Please disregard v1 of this patch, I accidentally sent Sridhar's
patch...

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index ad1f6ef..fe7656b 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -70,7 +70,8 @@ static struct cdev macvtap_cdev;
  * exists.
  *
  * The callbacks from macvlan are always done with rcu_read_lock held
- * already, while in the file_operations, we get it ourselves.
+ * already. For calls from file_operations, we use the rcu_read_lock_bh
+ * to get a reference count on the socket and the device.
  *
  * When destroying a queue, we remove the pointers from the file and
  * from the dev and then synchronize_rcu to make sure no thread is
@@ -159,13 +160,21 @@ static void macvtap_del_queues(struct net_device *dev)
 
 static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
 {
+	struct macvtap_queue *q;
 	rcu_read_lock_bh();
-	return rcu_dereference(file->private_data);
+	q = rcu_dereference(file->private_data);
+	if (q) {
+		sock_hold(&q->sk);
+		dev_hold(q->vlan->dev);
+	}
+	rcu_read_unlock_bh();
+	return q;
 }
 
-static inline void macvtap_file_put_queue(void)
+static inline void macvtap_file_put_queue(struct macvtap_queue *q)
 {
-	rcu_read_unlock_bh();
+	sock_put(&q->sk);
+	dev_put(q->vlan->dev);
 }
 
 /*
@@ -314,8 +323,8 @@ static unsigned int macvtap_poll(struct file *file, poll_table * wait)
 	     sock_writeable(&q->sk)))
 		mask |= POLLOUT | POLLWRNORM;
 
+	macvtap_file_put_queue(q);
 out:
-	macvtap_file_put_queue();
 	return mask;
 }
 
@@ -366,8 +375,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
 
 	result = macvtap_get_user(q, iv, iov_length(iv, count),
 			      file->f_flags & O_NONBLOCK);
+	macvtap_file_put_queue(q);
 out:
-	macvtap_file_put_queue();
 	return result;
 }
 
@@ -398,10 +407,8 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 	struct sk_buff *skb;
 	ssize_t len, ret = 0;
 
-	if (!q) {
-		ret = -ENOLINK;
-		goto out;
-	}
+	if (!q)
+		return -ENOLINK;
 
 	len = iov_length(iv, count);
 	if (len < 0) {
@@ -437,7 +444,7 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 	remove_wait_queue(q->sk.sk_sleep, &wait);
 
 out:
-	macvtap_file_put_queue();
+	macvtap_file_put_queue(q);
 	return ret;
 }
 
@@ -468,7 +475,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		if (!q)
 			return -ENOLINK;
 		memcpy(devname, q->vlan->dev->name, sizeof(devname));
-		macvtap_file_put_queue();
+		macvtap_file_put_queue(q);
 
 		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
 		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
@@ -485,8 +492,10 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 			return -EFAULT;
 
 		q = macvtap_file_get_queue(file);
+		if (!q)
+			return -ENOLINK;
 		q->sk.sk_sndbuf = u;
-		macvtap_file_put_queue();
+		macvtap_file_put_queue(q);
 		return 0;
 
 	case TUNSETOFFLOAD:
-- 
1.6.3.3



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v2] net/macvtap: fix reference counting
  2010-02-11 15:55                 ` [PATCH v2] " Arnd Bergmann
@ 2010-02-11 21:09                   ` Sridhar Samudrala
  2010-02-16  5:53                     ` David Miller
  2010-02-12 20:58                   ` [PATCH v2] net/macvtap: fix reference counting Ed Swierk
  1 sibling, 1 reply; 63+ messages in thread
From: Sridhar Samudrala @ 2010-02-11 21:09 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Patrick McHardy, Ed Swierk, netdev

On Thu, 2010-02-11 at 16:55 +0100, Arnd Bergmann wrote:
> The RCU usage in the original code was broken because
> there are cases where we possibly sleep with rcu_read_lock
> held. As a fix, change the macvtap_file_get_queue to
> get a reference on the socket and the netdev instead of
> taking the full rcu_read_lock.
> 
> Also, change macvtap_file_get_queue failure case to
> not require a subsequent macvtap_file_put_queue, as
> pointed out by Ed Swierk.

Looks good.

Acked-by: Sridhar Samudrala <sri@us.ibm.com>

> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> Cc: Ed Swierk <eswierk@aristanetworks.com>
> Cc: Sridhar Samudrala <sri@us.ibm.com>
> ---
>  drivers/net/macvtap.c |   35 ++++++++++++++++++++++-------------
>  1 files changed, 22 insertions(+), 13 deletions(-)
> 
> Please disregard v1 of this patch, I accidentally sent Sridhar's
> patch...
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index ad1f6ef..fe7656b 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -70,7 +70,8 @@ static struct cdev macvtap_cdev;
>   * exists.
>   *
>   * The callbacks from macvlan are always done with rcu_read_lock held
> - * already, while in the file_operations, we get it ourselves.
> + * already. For calls from file_operations, we use the rcu_read_lock_bh
> + * to get a reference count on the socket and the device.
>   *
>   * When destroying a queue, we remove the pointers from the file and
>   * from the dev and then synchronize_rcu to make sure no thread is
> @@ -159,13 +160,21 @@ static void macvtap_del_queues(struct net_device *dev)
> 
>  static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
>  {
> +	struct macvtap_queue *q;
>  	rcu_read_lock_bh();
> -	return rcu_dereference(file->private_data);
> +	q = rcu_dereference(file->private_data);
> +	if (q) {
> +		sock_hold(&q->sk);
> +		dev_hold(q->vlan->dev);
> +	}
> +	rcu_read_unlock_bh();
> +	return q;
>  }
> 
> -static inline void macvtap_file_put_queue(void)
> +static inline void macvtap_file_put_queue(struct macvtap_queue *q)
>  {
> -	rcu_read_unlock_bh();
> +	sock_put(&q->sk);
> +	dev_put(q->vlan->dev);
>  }
> 
>  /*
> @@ -314,8 +323,8 @@ static unsigned int macvtap_poll(struct file *file, poll_table * wait)
>  	     sock_writeable(&q->sk)))
>  		mask |= POLLOUT | POLLWRNORM;
> 
> +	macvtap_file_put_queue(q);
>  out:
> -	macvtap_file_put_queue();
>  	return mask;
>  }
> 
> @@ -366,8 +375,8 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
> 
>  	result = macvtap_get_user(q, iv, iov_length(iv, count),
>  			      file->f_flags & O_NONBLOCK);
> +	macvtap_file_put_queue(q);
>  out:
> -	macvtap_file_put_queue();
>  	return result;
>  }
> 
> @@ -398,10 +407,8 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
>  	struct sk_buff *skb;
>  	ssize_t len, ret = 0;
> 
> -	if (!q) {
> -		ret = -ENOLINK;
> -		goto out;
> -	}
> +	if (!q)
> +		return -ENOLINK;
> 
>  	len = iov_length(iv, count);
>  	if (len < 0) {
> @@ -437,7 +444,7 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
>  	remove_wait_queue(q->sk.sk_sleep, &wait);
> 
>  out:
> -	macvtap_file_put_queue();
> +	macvtap_file_put_queue(q);
>  	return ret;
>  }
> 
> @@ -468,7 +475,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
>  		if (!q)
>  			return -ENOLINK;
>  		memcpy(devname, q->vlan->dev->name, sizeof(devname));
> -		macvtap_file_put_queue();
> +		macvtap_file_put_queue(q);
> 
>  		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
>  		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
> @@ -485,8 +492,10 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
>  			return -EFAULT;
> 
>  		q = macvtap_file_get_queue(file);
> +		if (!q)
> +			return -ENOLINK;
>  		q->sk.sk_sndbuf = u;
> -		macvtap_file_put_queue();
> +		macvtap_file_put_queue(q);
>  		return 0;
> 
>  	case TUNSETOFFLOAD:


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2] net/macvtap: fix reference counting
  2010-02-11 15:55                 ` [PATCH v2] " Arnd Bergmann
  2010-02-11 21:09                   ` Sridhar Samudrala
@ 2010-02-12 20:58                   ` Ed Swierk
  1 sibling, 0 replies; 63+ messages in thread
From: Ed Swierk @ 2010-02-12 20:58 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Patrick McHardy, Sridhar Samudrala, netdev

On Thu, Feb 11, 2010 at 7:55 AM, Arnd Bergmann <arnd@arndb.de> wrote:
> The RCU usage in the original code was broken because
> there are cases where we possibly sleep with rcu_read_lock
> held. As a fix, change the macvtap_file_get_queue to
> get a reference on the socket and the netdev instead of
> taking the full rcu_read_lock.
>
> Also, change macvtap_file_get_queue failure case to
> not require a subsequent macvtap_file_put_queue, as
> pointed out by Ed Swierk.

Works for me. Thanks.

Acked-by: Ed Swierk <eswierk@aristanetworks.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2] net/macvtap: fix reference counting
  2010-02-11 21:09                   ` Sridhar Samudrala
@ 2010-02-16  5:53                     ` David Miller
  2010-02-18 15:44                       ` Arnd Bergmann
  0 siblings, 1 reply; 63+ messages in thread
From: David Miller @ 2010-02-16  5:53 UTC (permalink / raw)
  To: sri; +Cc: arnd, kaber, eswierk, netdev

From: Sridhar Samudrala <sri@us.ibm.com>
Date: Thu, 11 Feb 2010 13:09:37 -0800

> On Thu, 2010-02-11 at 16:55 +0100, Arnd Bergmann wrote:
>> The RCU usage in the original code was broken because
>> there are cases where we possibly sleep with rcu_read_lock
>> held. As a fix, change the macvtap_file_get_queue to
>> get a reference on the socket and the netdev instead of
>> taking the full rcu_read_lock.
>> 
>> Also, change macvtap_file_get_queue failure case to
>> not require a subsequent macvtap_file_put_queue, as
>> pointed out by Ed Swierk.
> 
> Looks good.
> 
> Acked-by: Sridhar Samudrala <sri@us.ibm.com>

Applied.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v2] net/macvtap: fix reference counting
  2010-02-16  5:53                     ` David Miller
@ 2010-02-18 15:44                       ` Arnd Bergmann
  2010-02-18 15:45                         ` [PATCH 1/3] macvtap: rework object lifetime rules Arnd Bergmann
                                           ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-02-18 15:44 UTC (permalink / raw)
  To: David Miller; +Cc: sri, kaber, eswierk, netdev

On Tuesday 16 February 2010, David Miller wrote:
> From: Sridhar Samudrala <sri@us.ibm.com>
> > Acked-by: Sridhar Samudrala <sri@us.ibm.com>
> 
> Applied.

Thanks for applying this one, but unfortunately I had reworked
the patch in a different way in the meantime and Sridhar has
added another patch conflicting with my rework. I think I've
got it all sorted out now, so I'll send a new series with
the three patches I was intending to get merged, but rebased
on this one and mostly reverting it.

Sridhar, please have a look at this series and send an Ack
if you're fine with it.

	Arnd

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 1/3] macvtap: rework object lifetime rules
  2010-02-18 15:44                       ` Arnd Bergmann
@ 2010-02-18 15:45                         ` Arnd Bergmann
  2010-02-18 20:09                           ` Sridhar Samudrala
  2010-02-18 22:11                           ` David Miller
  2010-02-18 15:46                         ` [PATCH 2/3] net/macvtap: add vhost support Arnd Bergmann
  2010-02-18 15:48                         ` [PATCH 3/3] macvtap: add GSO/csum offload support Arnd Bergmann
  2 siblings, 2 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-02-18 15:45 UTC (permalink / raw)
  To: David Miller; +Cc: sri, kaber, eswierk, netdev

This reworks the change done by the previous patch
in a more complete way.

The original macvtap code has a number of problems
resulting from the use of RCU for protecting the
access to struct macvtap_queue from open files.

This includes
- need for GFP_ATOMIC allocations for skbs
- potential deadlocks when copy_*_user sleeps
- inability to work with vhost-net

Changing the lifetime of macvtap_queue to always
depend on the open file solves all these. The
RCU reference simply moves one step down to
the reference on the macvlan_dev, which we
only need for nonblocking operations.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvtap.c |  183 ++++++++++++++++++++++++-------------------------
 1 files changed, 91 insertions(+), 92 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index fe7656b..7050997 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -60,30 +60,19 @@ static struct cdev macvtap_cdev;
 
 /*
  * RCU usage:
- * The macvtap_queue is referenced both from the chardev struct file
- * and from the struct macvlan_dev using rcu_read_lock.
+ * The macvtap_queue and the macvlan_dev are loosely coupled, the
+ * pointers from one to the other can only be read while rcu_read_lock
+ * or macvtap_lock is held.
  *
- * We never actually update the contents of a macvtap_queue atomically
- * with RCU but it is used for race-free destruction of a queue when
- * either the file or the macvlan_dev goes away. Pointers back to
- * the dev and the file are implicitly valid as long as the queue
- * exists.
+ * Both the file and the macvlan_dev hold a reference on the macvtap_queue
+ * through sock_hold(&q->sk). When the macvlan_dev goes away first,
+ * q->vlan becomes inaccessible. When the files gets closed,
+ * macvtap_get_queue() fails.
  *
- * The callbacks from macvlan are always done with rcu_read_lock held
- * already. For calls from file_operations, we use the rcu_read_lock_bh
- * to get a reference count on the socket and the device.
- *
- * When destroying a queue, we remove the pointers from the file and
- * from the dev and then synchronize_rcu to make sure no thread is
- * still using the queue. There may still be references to the struct
- * sock inside of the queue from outbound SKBs, but these never
- * reference back to the file or the dev. The data structure is freed
- * through __sk_free when both our references and any pending SKBs
- * are gone.
- *
- * macvtap_lock is only used to prevent multiple concurrent open()
- * calls to assign a new vlan->tap pointer. It could be moved into
- * the macvlan_dev itself but is extremely rarely used.
+ * There may still be references to the struct sock inside of the
+ * queue from outbound SKBs, but these never reference back to the
+ * file or the dev. The data structure is freed through __sk_free
+ * when both our references and any pending SKBs are gone.
  */
 static DEFINE_SPINLOCK(macvtap_lock);
 
@@ -101,11 +90,12 @@ static int macvtap_set_queue(struct net_device *dev, struct file *file,
 		goto out;
 
 	err = 0;
-	q->vlan = vlan;
+	rcu_assign_pointer(q->vlan, vlan);
 	rcu_assign_pointer(vlan->tap, q);
+	sock_hold(&q->sk);
 
 	q->file = file;
-	rcu_assign_pointer(file->private_data, q);
+	file->private_data = q;
 
 out:
 	spin_unlock(&macvtap_lock);
@@ -113,28 +103,25 @@ out:
 }
 
 /*
- * We must destroy each queue exactly once, when either
- * the netdev or the file go away.
+ * The file owning the queue got closed, give up both
+ * the reference that the files holds as well as the
+ * one from the macvlan_dev if that still exists.
  *
  * Using the spinlock makes sure that we don't get
  * to the queue again after destroying it.
- *
- * synchronize_rcu serializes with the packet flow
- * that uses rcu_read_lock.
  */
-static void macvtap_del_queue(struct macvtap_queue **qp)
+static void macvtap_put_queue(struct macvtap_queue *q)
 {
-	struct macvtap_queue *q;
+	struct macvlan_dev *vlan;
 
 	spin_lock(&macvtap_lock);
-	q = rcu_dereference(*qp);
-	if (!q) {
-		spin_unlock(&macvtap_lock);
-		return;
+	vlan = rcu_dereference(q->vlan);
+	if (vlan) {
+		rcu_assign_pointer(vlan->tap, NULL);
+		rcu_assign_pointer(q->vlan, NULL);
+		sock_put(&q->sk);
 	}
 
-	rcu_assign_pointer(q->vlan->tap, NULL);
-	rcu_assign_pointer(q->file->private_data, NULL);
 	spin_unlock(&macvtap_lock);
 
 	synchronize_rcu();
@@ -152,29 +139,29 @@ static struct macvtap_queue *macvtap_get_queue(struct net_device *dev,
 	return rcu_dereference(vlan->tap);
 }
 
+/*
+ * The net_device is going away, give up the reference
+ * that it holds on the queue (all the queues one day)
+ * and safely set the pointer from the queues to NULL.
+ */
 static void macvtap_del_queues(struct net_device *dev)
 {
 	struct macvlan_dev *vlan = netdev_priv(dev);
-	macvtap_del_queue(&vlan->tap);
-}
-
-static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
-{
 	struct macvtap_queue *q;
-	rcu_read_lock_bh();
-	q = rcu_dereference(file->private_data);
-	if (q) {
-		sock_hold(&q->sk);
-		dev_hold(q->vlan->dev);
+
+	spin_lock(&macvtap_lock);
+	q = rcu_dereference(vlan->tap);
+	if (!q) {
+		spin_unlock(&macvtap_lock);
+		return;
 	}
-	rcu_read_unlock_bh();
-	return q;
-}
 
-static inline void macvtap_file_put_queue(struct macvtap_queue *q)
-{
+	rcu_assign_pointer(vlan->tap, NULL);
+	rcu_assign_pointer(q->vlan, NULL);
+	spin_unlock(&macvtap_lock);
+
+	synchronize_rcu();
 	sock_put(&q->sk);
-	dev_put(q->vlan->dev);
 }
 
 /*
@@ -284,7 +271,6 @@ static int macvtap_open(struct inode *inode, struct file *file)
 	q->sock.type = SOCK_RAW;
 	q->sock.state = SS_CONNECTED;
 	sock_init_data(&q->sock, &q->sk);
-	q->sk.sk_allocation = GFP_ATOMIC; /* for now */
 	q->sk.sk_write_space = macvtap_sock_write_space;
 
 	err = macvtap_set_queue(dev, file, q);
@@ -300,13 +286,14 @@ out:
 
 static int macvtap_release(struct inode *inode, struct file *file)
 {
-	macvtap_del_queue((struct macvtap_queue **)&file->private_data);
+	struct macvtap_queue *q = file->private_data;
+	macvtap_put_queue(q);
 	return 0;
 }
 
 static unsigned int macvtap_poll(struct file *file, poll_table * wait)
 {
-	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	struct macvtap_queue *q = file->private_data;
 	unsigned int mask = POLLERR;
 
 	if (!q)
@@ -323,7 +310,6 @@ static unsigned int macvtap_poll(struct file *file, poll_table * wait)
 	     sock_writeable(&q->sk)))
 		mask |= POLLOUT | POLLWRNORM;
 
-	macvtap_file_put_queue(q);
 out:
 	return mask;
 }
@@ -334,6 +320,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 				int noblock)
 {
 	struct sk_buff *skb;
+	struct macvlan_dev *vlan;
 	size_t len = count;
 	int err;
 
@@ -341,26 +328,37 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 		return -EINVAL;
 
 	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
-
-	if (!skb) {
-		macvlan_count_rx(q->vlan, 0, false, false);
-		return err;
-	}
+	if (!skb)
+		goto err;
 
 	skb_reserve(skb, NET_IP_ALIGN);
 	skb_put(skb, count);
 
-	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
-		macvlan_count_rx(q->vlan, 0, false, false);
-		kfree_skb(skb);
-		return -EFAULT;
-	}
+	err = skb_copy_datagram_from_iovec(skb, 0, iv, 0, len);
+	if (err)
+		goto err;
 
 	skb_set_network_header(skb, ETH_HLEN);
-
-	macvlan_start_xmit(skb, q->vlan->dev);
+	rcu_read_lock_bh();
+	vlan = rcu_dereference(q->vlan);
+	if (vlan)
+		macvlan_start_xmit(skb, vlan->dev);
+	else
+		kfree_skb(skb);
+	rcu_read_unlock_bh();
 
 	return count;
+
+err:
+	rcu_read_lock_bh();
+	vlan = rcu_dereference(q->vlan);
+	if (vlan)
+		macvlan_count_rx(q->vlan, 0, false, false);
+	rcu_read_unlock_bh();
+
+	kfree_skb(skb);
+
+	return err;
 }
 
 static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
@@ -368,15 +366,10 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
 {
 	struct file *file = iocb->ki_filp;
 	ssize_t result = -ENOLINK;
-	struct macvtap_queue *q = macvtap_file_get_queue(file);
-
-	if (!q)
-		goto out;
+	struct macvtap_queue *q = file->private_data;
 
 	result = macvtap_get_user(q, iv, iov_length(iv, count),
 			      file->f_flags & O_NONBLOCK);
-	macvtap_file_put_queue(q);
-out:
 	return result;
 }
 
@@ -385,14 +378,17 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 				const struct sk_buff *skb,
 				const struct iovec *iv, int len)
 {
-	struct macvlan_dev *vlan = q->vlan;
+	struct macvlan_dev *vlan;
 	int ret;
 
 	len = min_t(int, skb->len, len);
 
 	ret = skb_copy_datagram_const_iovec(skb, 0, iv, 0, len);
 
+	rcu_read_lock_bh();
+	vlan = rcu_dereference(q->vlan);
 	macvlan_count_rx(vlan, len, ret == 0, 0);
+	rcu_read_unlock_bh();
 
 	return ret ? ret : len;
 }
@@ -401,14 +397,16 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 				unsigned long count, loff_t pos)
 {
 	struct file *file = iocb->ki_filp;
-	struct macvtap_queue *q = macvtap_file_get_queue(file);
+	struct macvtap_queue *q = file->private_data;
 
 	DECLARE_WAITQUEUE(wait, current);
 	struct sk_buff *skb;
 	ssize_t len, ret = 0;
 
-	if (!q)
-		return -ENOLINK;
+	if (!q) {
+		ret = -ENOLINK;
+		goto out;
+	}
 
 	len = iov_length(iv, count);
 	if (len < 0) {
@@ -444,7 +442,6 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 	remove_wait_queue(q->sk.sk_sleep, &wait);
 
 out:
-	macvtap_file_put_queue(q);
 	return ret;
 }
 
@@ -454,12 +451,13 @@ out:
 static long macvtap_ioctl(struct file *file, unsigned int cmd,
 			  unsigned long arg)
 {
-	struct macvtap_queue *q;
+	struct macvtap_queue *q = file->private_data;
+	struct macvlan_dev *vlan;
 	void __user *argp = (void __user *)arg;
 	struct ifreq __user *ifr = argp;
 	unsigned int __user *up = argp;
 	unsigned int u;
-	char devname[IFNAMSIZ];
+	int ret;
 
 	switch (cmd) {
 	case TUNSETIFF:
@@ -471,16 +469,21 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		return 0;
 
 	case TUNGETIFF:
-		q = macvtap_file_get_queue(file);
-		if (!q)
+		rcu_read_lock_bh();
+		vlan = rcu_dereference(q->vlan);
+		if (vlan)
+			dev_hold(vlan->dev);
+		rcu_read_unlock_bh();
+
+		if (!vlan)
 			return -ENOLINK;
-		memcpy(devname, q->vlan->dev->name, sizeof(devname));
-		macvtap_file_put_queue(q);
 
+		ret = 0;
 		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
 		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
-			return -EFAULT;
-		return 0;
+			ret = -EFAULT;
+		dev_put(vlan->dev);
+		return ret;
 
 	case TUNGETFEATURES:
 		if (put_user((IFF_TAP | IFF_NO_PI), up))
@@ -491,11 +494,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		if (get_user(u, up))
 			return -EFAULT;
 
-		q = macvtap_file_get_queue(file);
-		if (!q)
-			return -ENOLINK;
 		q->sk.sk_sndbuf = u;
-		macvtap_file_put_queue(q);
 		return 0;
 
 	case TUNSETOFFLOAD:
-- 
1.6.3.3



^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 2/3] net/macvtap: add vhost support
  2010-02-18 15:44                       ` Arnd Bergmann
  2010-02-18 15:45                         ` [PATCH 1/3] macvtap: rework object lifetime rules Arnd Bergmann
@ 2010-02-18 15:46                         ` Arnd Bergmann
  2010-02-18 20:10                           ` Sridhar Samudrala
  2010-02-18 22:11                           ` David Miller
  2010-02-18 15:48                         ` [PATCH 3/3] macvtap: add GSO/csum offload support Arnd Bergmann
  2 siblings, 2 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-02-18 15:46 UTC (permalink / raw)
  To: David Miller; +Cc: sri, kaber, eswierk, netdev, Michael S. Tsirkin

This adds support for passing a macvtap file descriptor into
vhost-net, much like we already do for tun/tap.

Most of the new code is taken from the respective patch
in the tun driver and may get consolidated in the future.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvtap.c      |   98 ++++++++++++++++++++++++++++++++++---------
 drivers/vhost/Kconfig      |    2 +-
 drivers/vhost/net.c        |    8 +++-
 include/linux/if_macvlan.h |   13 ++++++
 4 files changed, 97 insertions(+), 24 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 7050997..e354501 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -58,6 +58,8 @@ static unsigned int macvtap_major;
 static struct class *macvtap_class;
 static struct cdev macvtap_cdev;
 
+static const struct proto_ops macvtap_socket_ops;
+
 /*
  * RCU usage:
  * The macvtap_queue and the macvlan_dev are loosely coupled, the
@@ -176,7 +178,7 @@ static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
 		return -ENOLINK;
 
 	skb_queue_tail(&q->sk.sk_receive_queue, skb);
-	wake_up(q->sk.sk_sleep);
+	wake_up_interruptible_poll(q->sk.sk_sleep, POLLIN | POLLRDNORM | POLLRDBAND);
 	return 0;
 }
 
@@ -242,7 +244,7 @@ static void macvtap_sock_write_space(struct sock *sk)
 		return;
 
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible_sync(sk->sk_sleep);
+		wake_up_interruptible_poll(sk->sk_sleep, POLLOUT | POLLWRNORM | POLLWRBAND);
 }
 
 static int macvtap_open(struct inode *inode, struct file *file)
@@ -270,6 +272,8 @@ static int macvtap_open(struct inode *inode, struct file *file)
 	init_waitqueue_head(&q->sock.wait);
 	q->sock.type = SOCK_RAW;
 	q->sock.state = SS_CONNECTED;
+	q->sock.file = file;
+	q->sock.ops = &macvtap_socket_ops;
 	sock_init_data(&q->sock, &q->sk);
 	q->sk.sk_write_space = macvtap_sock_write_space;
 
@@ -387,32 +391,20 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 
 	rcu_read_lock_bh();
 	vlan = rcu_dereference(q->vlan);
-	macvlan_count_rx(vlan, len, ret == 0, 0);
+	if (vlan)
+		macvlan_count_rx(vlan, len, ret == 0, 0);
 	rcu_read_unlock_bh();
 
 	return ret ? ret : len;
 }
 
-static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
-				unsigned long count, loff_t pos)
+static ssize_t macvtap_do_read(struct macvtap_queue *q, struct kiocb *iocb,
+			       const struct iovec *iv, unsigned long len,
+			       int noblock)
 {
-	struct file *file = iocb->ki_filp;
-	struct macvtap_queue *q = file->private_data;
-
 	DECLARE_WAITQUEUE(wait, current);
 	struct sk_buff *skb;
-	ssize_t len, ret = 0;
-
-	if (!q) {
-		ret = -ENOLINK;
-		goto out;
-	}
-
-	len = iov_length(iv, count);
-	if (len < 0) {
-		ret = -EINVAL;
-		goto out;
-	}
+	ssize_t ret = 0;
 
 	add_wait_queue(q->sk.sk_sleep, &wait);
 	while (len) {
@@ -421,7 +413,7 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 		/* Read frames from the queue */
 		skb = skb_dequeue(&q->sk.sk_receive_queue);
 		if (!skb) {
-			if (file->f_flags & O_NONBLOCK) {
+			if (noblock) {
 				ret = -EAGAIN;
 				break;
 			}
@@ -440,7 +432,24 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
 
 	current->state = TASK_RUNNING;
 	remove_wait_queue(q->sk.sk_sleep, &wait);
+	return ret;
+}
+
+static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
+				unsigned long count, loff_t pos)
+{
+	struct file *file = iocb->ki_filp;
+	struct macvtap_queue *q = file->private_data;
+	ssize_t len, ret = 0;
 
+	len = iov_length(iv, count);
+	if (len < 0) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ret = macvtap_do_read(q, iocb, iv, len, file->f_flags & O_NONBLOCK);
+	ret = min_t(ssize_t, ret, len); /* XXX copied from tun.c. Why? */
 out:
 	return ret;
 }
@@ -538,6 +547,53 @@ static const struct file_operations macvtap_fops = {
 #endif
 };
 
+static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
+			   struct msghdr *m, size_t total_len)
+{
+	struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
+	return macvtap_get_user(q, m->msg_iov, total_len,
+			    m->msg_flags & MSG_DONTWAIT);
+}
+
+static int macvtap_recvmsg(struct kiocb *iocb, struct socket *sock,
+			   struct msghdr *m, size_t total_len,
+			   int flags)
+{
+	struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
+	int ret;
+	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
+		return -EINVAL;
+	ret = macvtap_do_read(q, iocb, m->msg_iov, total_len,
+			  flags & MSG_DONTWAIT);
+	if (ret > total_len) {
+		m->msg_flags |= MSG_TRUNC;
+		ret = flags & MSG_TRUNC ? ret : total_len;
+	}
+	return ret;
+}
+
+/* Ops structure to mimic raw sockets with tun */
+static const struct proto_ops macvtap_socket_ops = {
+	.sendmsg = macvtap_sendmsg,
+	.recvmsg = macvtap_recvmsg,
+};
+
+/* Get an underlying socket object from tun file.  Returns error unless file is
+ * attached to a device.  The returned object works like a packet socket, it
+ * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
+ * holding a reference to the file for as long as the socket is in use. */
+struct socket *macvtap_get_socket(struct file *file)
+{
+	struct macvtap_queue *q;
+	if (file->f_op != &macvtap_fops)
+		return ERR_PTR(-EINVAL);
+	q = file->private_data;
+	if (!q)
+		return ERR_PTR(-EBADFD);
+	return &q->sock;
+}
+EXPORT_SYMBOL_GPL(macvtap_get_socket);
+
 static int macvtap_init(void)
 {
 	int err;
diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 9e93553..e4e2fd1 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -1,6 +1,6 @@
 config VHOST_NET
 	tristate "Host kernel accelerator for virtio net (EXPERIMENTAL)"
-	depends on NET && EVENTFD && (TUN || !TUN) && EXPERIMENTAL
+	depends on NET && EVENTFD && (TUN || !TUN) && (MACVTAP || !MACVTAP) && EXPERIMENTAL
 	---help---
 	  This kernel module can be loaded in host kernel to accelerate
 	  guest networking with virtio_net. Not to be confused with virtio_net
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 4c89283..91a324c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -22,6 +22,7 @@
 #include <linux/if_packet.h>
 #include <linux/if_arp.h>
 #include <linux/if_tun.h>
+#include <linux/if_macvlan.h>
 
 #include <net/sock.h>
 
@@ -452,13 +453,16 @@ err:
 	return ERR_PTR(r);
 }
 
-static struct socket *get_tun_socket(int fd)
+static struct socket *get_tap_socket(int fd)
 {
 	struct file *file = fget(fd);
 	struct socket *sock;
 	if (!file)
 		return ERR_PTR(-EBADF);
 	sock = tun_get_socket(file);
+	if (!IS_ERR(sock))
+		return sock;
+	sock = macvtap_get_socket(file);
 	if (IS_ERR(sock))
 		fput(file);
 	return sock;
@@ -473,7 +477,7 @@ static struct socket *get_socket(int fd)
 	sock = get_raw_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
-	sock = get_tun_socket(fd);
+	sock = get_tap_socket(fd);
 	if (!IS_ERR(sock))
 		return sock;
 	return ERR_PTR(-ENOTSOCK);
diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
index f9cb9ba..b78a712 100644
--- a/include/linux/if_macvlan.h
+++ b/include/linux/if_macvlan.h
@@ -7,6 +7,19 @@
 #include <linux/netlink.h>
 #include <net/netlink.h>
 
+#if defined(CONFIG_MACVTAP) || defined(CONFIG_MACVTAP_MODULE)
+struct socket *macvtap_get_socket(struct file *);
+#else
+#include <linux/err.h>
+#include <linux/errno.h>
+struct file;
+struct socket;
+static inline struct socket *macvtap_get_socket(struct file *f)
+{
+	return ERR_PTR(-EINVAL);
+}
+#endif /* CONFIG_MACVTAP */
+
 struct macvlan_port;
 struct macvtap_queue;
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH 3/3] macvtap: add GSO/csum offload support
  2010-02-18 15:44                       ` Arnd Bergmann
  2010-02-18 15:45                         ` [PATCH 1/3] macvtap: rework object lifetime rules Arnd Bergmann
  2010-02-18 15:46                         ` [PATCH 2/3] net/macvtap: add vhost support Arnd Bergmann
@ 2010-02-18 15:48                         ` Arnd Bergmann
  2010-02-18 20:38                           ` Sridhar Samudrala
  2010-02-18 22:11                           ` David Miller
  2 siblings, 2 replies; 63+ messages in thread
From: Arnd Bergmann @ 2010-02-18 15:48 UTC (permalink / raw)
  To: David Miller; +Cc: sri, kaber, eswierk, netdev

Added flags field to macvtap_queue to enable/disable processing of
virtio_net_hdr via IFF_VNET_HDR. This flag is checked to prepend virtio_net_hdr
in the receive path and process/skip virtio_net_hdr in the send path.

Original patch by Sridhar, further changes by Arnd.

Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/macvtap.c |  206 +++++++++++++++++++++++++++++++++++++++++++------
 1 files changed, 182 insertions(+), 24 deletions(-)

diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index e354501..55ceae0 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -17,6 +17,7 @@
 #include <net/net_namespace.h>
 #include <net/rtnetlink.h>
 #include <net/sock.h>
+#include <linux/virtio_net.h>
 
 /*
  * A macvtap queue is the central object of this driver, it connects
@@ -37,6 +38,7 @@ struct macvtap_queue {
 	struct socket sock;
 	struct macvlan_dev *vlan;
 	struct file *file;
+	unsigned int flags;
 };
 
 static struct proto macvtap_proto = {
@@ -276,6 +278,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
 	q->sock.ops = &macvtap_socket_ops;
 	sock_init_data(&q->sock, &q->sk);
 	q->sk.sk_write_space = macvtap_sock_write_space;
+	q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
 
 	err = macvtap_set_queue(dev, file, q);
 	if (err)
@@ -318,6 +321,111 @@ out:
 	return mask;
 }
 
+static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
+						size_t len, size_t linear,
+						int noblock, int *err)
+{
+	struct sk_buff *skb;
+
+	/* Under a page?  Don't bother with paged skb. */
+	if (prepad + len < PAGE_SIZE || !linear)
+		linear = len;
+
+	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
+				   err);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, prepad);
+	skb_put(skb, linear);
+	skb->data_len = len - linear;
+	skb->len += len - linear;
+
+	return skb;
+}
+
+/*
+ * macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
+ * be shared with the tun/tap driver.
+ */
+static int macvtap_skb_from_vnet_hdr(struct sk_buff *skb,
+				     struct virtio_net_hdr *vnet_hdr)
+{
+	unsigned short gso_type = 0;
+	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
+		case VIRTIO_NET_HDR_GSO_TCPV4:
+			gso_type = SKB_GSO_TCPV4;
+			break;
+		case VIRTIO_NET_HDR_GSO_TCPV6:
+			gso_type = SKB_GSO_TCPV6;
+			break;
+		case VIRTIO_NET_HDR_GSO_UDP:
+			gso_type = SKB_GSO_UDP;
+			break;
+		default:
+			return -EINVAL;
+		}
+
+		if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
+			gso_type |= SKB_GSO_TCP_ECN;
+
+		if (vnet_hdr->gso_size == 0)
+			return -EINVAL;
+	}
+
+	if (vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
+		if (!skb_partial_csum_set(skb, vnet_hdr->csum_start,
+					  vnet_hdr->csum_offset))
+			return -EINVAL;
+	}
+
+	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
+		skb_shinfo(skb)->gso_size = vnet_hdr->gso_size;
+		skb_shinfo(skb)->gso_type = gso_type;
+
+		/* Header must be checked, and gso_segs computed. */
+		skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+		skb_shinfo(skb)->gso_segs = 0;
+	}
+	return 0;
+}
+
+static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb,
+				   struct virtio_net_hdr *vnet_hdr)
+{
+	memset(vnet_hdr, 0, sizeof(*vnet_hdr));
+
+	if (skb_is_gso(skb)) {
+		struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+		/* This is a hint as to how much should be linear. */
+		vnet_hdr->hdr_len = skb_headlen(skb);
+		vnet_hdr->gso_size = sinfo->gso_size;
+		if (sinfo->gso_type & SKB_GSO_TCPV4)
+			vnet_hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+		else if (sinfo->gso_type & SKB_GSO_TCPV6)
+			vnet_hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+		else if (sinfo->gso_type & SKB_GSO_UDP)
+			vnet_hdr->gso_type = VIRTIO_NET_HDR_GSO_UDP;
+		else
+			BUG();
+		if (sinfo->gso_type & SKB_GSO_TCP_ECN)
+			vnet_hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
+	} else
+		vnet_hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE;
+
+	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		vnet_hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+		vnet_hdr->csum_start = skb->csum_start -
+					skb_headroom(skb);
+		vnet_hdr->csum_offset = skb->csum_offset;
+	} /* else everything is zero */
+
+	return 0;
+}
+
+
 /* Get packet from user space buffer */
 static ssize_t macvtap_get_user(struct macvtap_queue *q,
 				const struct iovec *iv, size_t count,
@@ -327,22 +435,53 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 	struct macvlan_dev *vlan;
 	size_t len = count;
 	int err;
+	struct virtio_net_hdr vnet_hdr = { 0 };
+	int vnet_hdr_len = 0;
+
+	if (q->flags & IFF_VNET_HDR) {
+		vnet_hdr_len = sizeof(vnet_hdr);
+
+		err = -EINVAL;
+		if ((len -= vnet_hdr_len) < 0)
+			goto err;
+
+		err = memcpy_fromiovecend((void *)&vnet_hdr, iv, 0,
+					   vnet_hdr_len);
+		if (err < 0)
+			goto err;
+		if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+		     vnet_hdr.csum_start + vnet_hdr.csum_offset + 2 >
+							vnet_hdr.hdr_len)
+			vnet_hdr.hdr_len = vnet_hdr.csum_start +
+						vnet_hdr.csum_offset + 2;
+		err = -EINVAL;
+		if (vnet_hdr.hdr_len > len)
+			goto err;
+	}
 
+	err = -EINVAL;
 	if (unlikely(len < ETH_HLEN))
-		return -EINVAL;
+		goto err;
 
-	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
+	skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
+				noblock, &err);
 	if (!skb)
 		goto err;
 
-	skb_reserve(skb, NET_IP_ALIGN);
-	skb_put(skb, count);
-
-	err = skb_copy_datagram_from_iovec(skb, 0, iv, 0, len);
+	err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
 	if (err)
-		goto err;
+		goto err_kfree;
 
 	skb_set_network_header(skb, ETH_HLEN);
+	skb_reset_mac_header(skb);
+	skb->protocol = eth_hdr(skb)->h_proto;
+
+	if (vnet_hdr_len) {
+		err = macvtap_skb_from_vnet_hdr(skb, &vnet_hdr);
+		if (err)
+			goto err_kfree;
+	}
+
 	rcu_read_lock_bh();
 	vlan = rcu_dereference(q->vlan);
 	if (vlan)
@@ -353,15 +492,16 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
 
 	return count;
 
+err_kfree:
+	kfree_skb(skb);
+
 err:
 	rcu_read_lock_bh();
 	vlan = rcu_dereference(q->vlan);
 	if (vlan)
-		macvlan_count_rx(q->vlan, 0, false, false);
+		netdev_get_tx_queue(vlan->dev, 0)->tx_dropped++;
 	rcu_read_unlock_bh();
 
-	kfree_skb(skb);
-
 	return err;
 }
 
@@ -384,10 +524,25 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 {
 	struct macvlan_dev *vlan;
 	int ret;
+	int vnet_hdr_len = 0;
+
+	if (q->flags & IFF_VNET_HDR) {
+		struct virtio_net_hdr vnet_hdr;
+		vnet_hdr_len = sizeof (vnet_hdr);
+		if ((len -= vnet_hdr_len) < 0)
+			return -EINVAL;
+
+		ret = macvtap_skb_to_vnet_hdr(skb, &vnet_hdr);
+		if (ret)
+			return ret;
+
+		if (memcpy_toiovecend(iv, (void *)&vnet_hdr, 0, vnet_hdr_len))
+			return -EFAULT;
+	}
 
 	len = min_t(int, skb->len, len);
 
-	ret = skb_copy_datagram_const_iovec(skb, 0, iv, 0, len);
+	ret = skb_copy_datagram_const_iovec(skb, 0, iv, vnet_hdr_len, len);
 
 	rcu_read_lock_bh();
 	vlan = rcu_dereference(q->vlan);
@@ -395,7 +550,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
 		macvlan_count_rx(vlan, len, ret == 0, 0);
 	rcu_read_unlock_bh();
 
-	return ret ? ret : len;
+	return ret ? ret : (len + vnet_hdr_len);
 }
 
 static ssize_t macvtap_do_read(struct macvtap_queue *q, struct kiocb *iocb,
@@ -473,9 +628,14 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 		/* ignore the name, just look at flags */
 		if (get_user(u, &ifr->ifr_flags))
 			return -EFAULT;
-		if (u != (IFF_TAP | IFF_NO_PI))
-			return -EINVAL;
-		return 0;
+
+		ret = 0;
+		if ((u & ~IFF_VNET_HDR) != (IFF_NO_PI | IFF_TAP))
+			ret = -EINVAL;
+		else
+			q->flags = u;
+
+		return ret;
 
 	case TUNGETIFF:
 		rcu_read_lock_bh();
@@ -489,13 +649,13 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 
 		ret = 0;
 		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
-		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
+		    put_user(q->flags, &ifr->ifr_flags))
 			ret = -EFAULT;
 		dev_put(vlan->dev);
 		return ret;
 
 	case TUNGETFEATURES:
-		if (put_user((IFF_TAP | IFF_NO_PI), up))
+		if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR, up))
 			return -EFAULT;
 		return 0;
 
@@ -509,15 +669,13 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
 	case TUNSETOFFLOAD:
 		/* let the user check for future flags */
 		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
-			  TUN_F_TSO_ECN | TUN_F_UFO))
-			return -EINVAL;
-
-		/* TODO: add support for these, so far we don't
-			 support any offload */
-		if (arg & (TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
-			 TUN_F_TSO_ECN | TUN_F_UFO))
+			    TUN_F_TSO_ECN | TUN_F_UFO))
 			return -EINVAL;
 
+		/* TODO: only accept frames with the features that
+			 got enabled for forwarded frames */
+		if (!(q->flags & IFF_VNET_HDR))
+			return  -EINVAL;
 		return 0;
 
 	default:
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/3] macvtap: rework object lifetime rules
  2010-02-18 15:45                         ` [PATCH 1/3] macvtap: rework object lifetime rules Arnd Bergmann
@ 2010-02-18 20:09                           ` Sridhar Samudrala
  2010-02-18 22:11                           ` David Miller
  1 sibling, 0 replies; 63+ messages in thread
From: Sridhar Samudrala @ 2010-02-18 20:09 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: David Miller, kaber, eswierk, netdev

On Thu, 2010-02-18 at 16:45 +0100, Arnd Bergmann wrote:
> This reworks the change done by the previous patch
> in a more complete way.
> 
> The original macvtap code has a number of problems
> resulting from the use of RCU for protecting the
> access to struct macvtap_queue from open files.
> 
> This includes
> - need for GFP_ATOMIC allocations for skbs
> - potential deadlocks when copy_*_user sleeps
> - inability to work with vhost-net
> 
> Changing the lifetime of macvtap_queue to always
> depend on the open file solves all these. The
> RCU reference simply moves one step down to
> the reference on the macvlan_dev, which we
> only need for nonblocking operations.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Reviewed and tested.
Acked-by: Sridhar Samudrala <sri@us.ibm.com>

> ---
>  drivers/net/macvtap.c |  183 ++++++++++++++++++++++++-------------------------
>  1 files changed, 91 insertions(+), 92 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index fe7656b..7050997 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -60,30 +60,19 @@ static struct cdev macvtap_cdev;
> 
>  /*
>   * RCU usage:
> - * The macvtap_queue is referenced both from the chardev struct file
> - * and from the struct macvlan_dev using rcu_read_lock.
> + * The macvtap_queue and the macvlan_dev are loosely coupled, the
> + * pointers from one to the other can only be read while rcu_read_lock
> + * or macvtap_lock is held.
>   *
> - * We never actually update the contents of a macvtap_queue atomically
> - * with RCU but it is used for race-free destruction of a queue when
> - * either the file or the macvlan_dev goes away. Pointers back to
> - * the dev and the file are implicitly valid as long as the queue
> - * exists.
> + * Both the file and the macvlan_dev hold a reference on the macvtap_queue
> + * through sock_hold(&q->sk). When the macvlan_dev goes away first,
> + * q->vlan becomes inaccessible. When the files gets closed,
> + * macvtap_get_queue() fails.
>   *
> - * The callbacks from macvlan are always done with rcu_read_lock held
> - * already. For calls from file_operations, we use the rcu_read_lock_bh
> - * to get a reference count on the socket and the device.
> - *
> - * When destroying a queue, we remove the pointers from the file and
> - * from the dev and then synchronize_rcu to make sure no thread is
> - * still using the queue. There may still be references to the struct
> - * sock inside of the queue from outbound SKBs, but these never
> - * reference back to the file or the dev. The data structure is freed
> - * through __sk_free when both our references and any pending SKBs
> - * are gone.
> - *
> - * macvtap_lock is only used to prevent multiple concurrent open()
> - * calls to assign a new vlan->tap pointer. It could be moved into
> - * the macvlan_dev itself but is extremely rarely used.
> + * There may still be references to the struct sock inside of the
> + * queue from outbound SKBs, but these never reference back to the
> + * file or the dev. The data structure is freed through __sk_free
> + * when both our references and any pending SKBs are gone.
>   */
>  static DEFINE_SPINLOCK(macvtap_lock);
> 
> @@ -101,11 +90,12 @@ static int macvtap_set_queue(struct net_device *dev, struct file *file,
>  		goto out;
> 
>  	err = 0;
> -	q->vlan = vlan;
> +	rcu_assign_pointer(q->vlan, vlan);
>  	rcu_assign_pointer(vlan->tap, q);
> +	sock_hold(&q->sk);
> 
>  	q->file = file;
> -	rcu_assign_pointer(file->private_data, q);
> +	file->private_data = q;
> 
>  out:
>  	spin_unlock(&macvtap_lock);
> @@ -113,28 +103,25 @@ out:
>  }
> 
>  /*
> - * We must destroy each queue exactly once, when either
> - * the netdev or the file go away.
> + * The file owning the queue got closed, give up both
> + * the reference that the files holds as well as the
> + * one from the macvlan_dev if that still exists.
>   *
>   * Using the spinlock makes sure that we don't get
>   * to the queue again after destroying it.
> - *
> - * synchronize_rcu serializes with the packet flow
> - * that uses rcu_read_lock.
>   */
> -static void macvtap_del_queue(struct macvtap_queue **qp)
> +static void macvtap_put_queue(struct macvtap_queue *q)
>  {
> -	struct macvtap_queue *q;
> +	struct macvlan_dev *vlan;
> 
>  	spin_lock(&macvtap_lock);
> -	q = rcu_dereference(*qp);
> -	if (!q) {
> -		spin_unlock(&macvtap_lock);
> -		return;
> +	vlan = rcu_dereference(q->vlan);
> +	if (vlan) {
> +		rcu_assign_pointer(vlan->tap, NULL);
> +		rcu_assign_pointer(q->vlan, NULL);
> +		sock_put(&q->sk);
>  	}
> 
> -	rcu_assign_pointer(q->vlan->tap, NULL);
> -	rcu_assign_pointer(q->file->private_data, NULL);
>  	spin_unlock(&macvtap_lock);
> 
>  	synchronize_rcu();
> @@ -152,29 +139,29 @@ static struct macvtap_queue *macvtap_get_queue(struct net_device *dev,
>  	return rcu_dereference(vlan->tap);
>  }
> 
> +/*
> + * The net_device is going away, give up the reference
> + * that it holds on the queue (all the queues one day)
> + * and safely set the pointer from the queues to NULL.
> + */
>  static void macvtap_del_queues(struct net_device *dev)
>  {
>  	struct macvlan_dev *vlan = netdev_priv(dev);
> -	macvtap_del_queue(&vlan->tap);
> -}
> -
> -static inline struct macvtap_queue *macvtap_file_get_queue(struct file *file)
> -{
>  	struct macvtap_queue *q;
> -	rcu_read_lock_bh();
> -	q = rcu_dereference(file->private_data);
> -	if (q) {
> -		sock_hold(&q->sk);
> -		dev_hold(q->vlan->dev);
> +
> +	spin_lock(&macvtap_lock);
> +	q = rcu_dereference(vlan->tap);
> +	if (!q) {
> +		spin_unlock(&macvtap_lock);
> +		return;
>  	}
> -	rcu_read_unlock_bh();
> -	return q;
> -}
> 
> -static inline void macvtap_file_put_queue(struct macvtap_queue *q)
> -{
> +	rcu_assign_pointer(vlan->tap, NULL);
> +	rcu_assign_pointer(q->vlan, NULL);
> +	spin_unlock(&macvtap_lock);
> +
> +	synchronize_rcu();
>  	sock_put(&q->sk);
> -	dev_put(q->vlan->dev);
>  }
> 
>  /*
> @@ -284,7 +271,6 @@ static int macvtap_open(struct inode *inode, struct file *file)
>  	q->sock.type = SOCK_RAW;
>  	q->sock.state = SS_CONNECTED;
>  	sock_init_data(&q->sock, &q->sk);
> -	q->sk.sk_allocation = GFP_ATOMIC; /* for now */
>  	q->sk.sk_write_space = macvtap_sock_write_space;
> 
>  	err = macvtap_set_queue(dev, file, q);
> @@ -300,13 +286,14 @@ out:
> 
>  static int macvtap_release(struct inode *inode, struct file *file)
>  {
> -	macvtap_del_queue((struct macvtap_queue **)&file->private_data);
> +	struct macvtap_queue *q = file->private_data;
> +	macvtap_put_queue(q);
>  	return 0;
>  }
> 
>  static unsigned int macvtap_poll(struct file *file, poll_table * wait)
>  {
> -	struct macvtap_queue *q = macvtap_file_get_queue(file);
> +	struct macvtap_queue *q = file->private_data;
>  	unsigned int mask = POLLERR;
> 
>  	if (!q)
> @@ -323,7 +310,6 @@ static unsigned int macvtap_poll(struct file *file, poll_table * wait)
>  	     sock_writeable(&q->sk)))
>  		mask |= POLLOUT | POLLWRNORM;
> 
> -	macvtap_file_put_queue(q);
>  out:
>  	return mask;
>  }
> @@ -334,6 +320,7 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
>  				int noblock)
>  {
>  	struct sk_buff *skb;
> +	struct macvlan_dev *vlan;
>  	size_t len = count;
>  	int err;
> 
> @@ -341,26 +328,37 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
>  		return -EINVAL;
> 
>  	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
> -
> -	if (!skb) {
> -		macvlan_count_rx(q->vlan, 0, false, false);
> -		return err;
> -	}
> +	if (!skb)
> +		goto err;
> 
>  	skb_reserve(skb, NET_IP_ALIGN);
>  	skb_put(skb, count);
> 
> -	if (skb_copy_datagram_from_iovec(skb, 0, iv, 0, len)) {
> -		macvlan_count_rx(q->vlan, 0, false, false);
> -		kfree_skb(skb);
> -		return -EFAULT;
> -	}
> +	err = skb_copy_datagram_from_iovec(skb, 0, iv, 0, len);
> +	if (err)
> +		goto err;
> 
>  	skb_set_network_header(skb, ETH_HLEN);
> -
> -	macvlan_start_xmit(skb, q->vlan->dev);
> +	rcu_read_lock_bh();
> +	vlan = rcu_dereference(q->vlan);
> +	if (vlan)
> +		macvlan_start_xmit(skb, vlan->dev);
> +	else
> +		kfree_skb(skb);
> +	rcu_read_unlock_bh();
> 
>  	return count;
> +
> +err:
> +	rcu_read_lock_bh();
> +	vlan = rcu_dereference(q->vlan);
> +	if (vlan)
> +		macvlan_count_rx(q->vlan, 0, false, false);
> +	rcu_read_unlock_bh();
> +
> +	kfree_skb(skb);
> +
> +	return err;
>  }
> 
>  static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
> @@ -368,15 +366,10 @@ static ssize_t macvtap_aio_write(struct kiocb *iocb, const struct iovec *iv,
>  {
>  	struct file *file = iocb->ki_filp;
>  	ssize_t result = -ENOLINK;
> -	struct macvtap_queue *q = macvtap_file_get_queue(file);
> -
> -	if (!q)
> -		goto out;
> +	struct macvtap_queue *q = file->private_data;
> 
>  	result = macvtap_get_user(q, iv, iov_length(iv, count),
>  			      file->f_flags & O_NONBLOCK);
> -	macvtap_file_put_queue(q);
> -out:
>  	return result;
>  }
> 
> @@ -385,14 +378,17 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>  				const struct sk_buff *skb,
>  				const struct iovec *iv, int len)
>  {
> -	struct macvlan_dev *vlan = q->vlan;
> +	struct macvlan_dev *vlan;
>  	int ret;
> 
>  	len = min_t(int, skb->len, len);
> 
>  	ret = skb_copy_datagram_const_iovec(skb, 0, iv, 0, len);
> 
> +	rcu_read_lock_bh();
> +	vlan = rcu_dereference(q->vlan);
>  	macvlan_count_rx(vlan, len, ret == 0, 0);
> +	rcu_read_unlock_bh();
> 
>  	return ret ? ret : len;
>  }
> @@ -401,14 +397,16 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
>  				unsigned long count, loff_t pos)
>  {
>  	struct file *file = iocb->ki_filp;
> -	struct macvtap_queue *q = macvtap_file_get_queue(file);
> +	struct macvtap_queue *q = file->private_data;
> 
>  	DECLARE_WAITQUEUE(wait, current);
>  	struct sk_buff *skb;
>  	ssize_t len, ret = 0;
> 
> -	if (!q)
> -		return -ENOLINK;
> +	if (!q) {
> +		ret = -ENOLINK;
> +		goto out;
> +	}
> 
>  	len = iov_length(iv, count);
>  	if (len < 0) {
> @@ -444,7 +442,6 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
>  	remove_wait_queue(q->sk.sk_sleep, &wait);
> 
>  out:
> -	macvtap_file_put_queue(q);
>  	return ret;
>  }
> 
> @@ -454,12 +451,13 @@ out:
>  static long macvtap_ioctl(struct file *file, unsigned int cmd,
>  			  unsigned long arg)
>  {
> -	struct macvtap_queue *q;
> +	struct macvtap_queue *q = file->private_data;
> +	struct macvlan_dev *vlan;
>  	void __user *argp = (void __user *)arg;
>  	struct ifreq __user *ifr = argp;
>  	unsigned int __user *up = argp;
>  	unsigned int u;
> -	char devname[IFNAMSIZ];
> +	int ret;
> 
>  	switch (cmd) {
>  	case TUNSETIFF:
> @@ -471,16 +469,21 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
>  		return 0;
> 
>  	case TUNGETIFF:
> -		q = macvtap_file_get_queue(file);
> -		if (!q)
> +		rcu_read_lock_bh();
> +		vlan = rcu_dereference(q->vlan);
> +		if (vlan)
> +			dev_hold(vlan->dev);
> +		rcu_read_unlock_bh();
> +
> +		if (!vlan)
>  			return -ENOLINK;
> -		memcpy(devname, q->vlan->dev->name, sizeof(devname));
> -		macvtap_file_put_queue(q);
> 
> +		ret = 0;
>  		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
>  		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
> -			return -EFAULT;
> -		return 0;
> +			ret = -EFAULT;
> +		dev_put(vlan->dev);
> +		return ret;
> 
>  	case TUNGETFEATURES:
>  		if (put_user((IFF_TAP | IFF_NO_PI), up))
> @@ -491,11 +494,7 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
>  		if (get_user(u, up))
>  			return -EFAULT;
> 
> -		q = macvtap_file_get_queue(file);
> -		if (!q)
> -			return -ENOLINK;
>  		q->sk.sk_sndbuf = u;
> -		macvtap_file_put_queue(q);
>  		return 0;
> 
>  	case TUNSETOFFLOAD:


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/3] net/macvtap: add vhost support
  2010-02-18 15:46                         ` [PATCH 2/3] net/macvtap: add vhost support Arnd Bergmann
@ 2010-02-18 20:10                           ` Sridhar Samudrala
  2010-02-18 22:11                           ` David Miller
  1 sibling, 0 replies; 63+ messages in thread
From: Sridhar Samudrala @ 2010-02-18 20:10 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: David Miller, kaber, eswierk, netdev, Michael S. Tsirkin

On Thu, 2010-02-18 at 16:46 +0100, Arnd Bergmann wrote:
> This adds support for passing a macvtap file descriptor into
> vhost-net, much like we already do for tun/tap.
> 
> Most of the new code is taken from the respective patch
> in the tun driver and may get consolidated in the future.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Reviewed and tested.
Acked-by: Sridhar Samudrala <sri@us.ibm.com>

> ---
>  drivers/net/macvtap.c      |   98 ++++++++++++++++++++++++++++++++++---------
>  drivers/vhost/Kconfig      |    2 +-
>  drivers/vhost/net.c        |    8 +++-
>  include/linux/if_macvlan.h |   13 ++++++
>  4 files changed, 97 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index 7050997..e354501 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -58,6 +58,8 @@ static unsigned int macvtap_major;
>  static struct class *macvtap_class;
>  static struct cdev macvtap_cdev;
> 
> +static const struct proto_ops macvtap_socket_ops;
> +
>  /*
>   * RCU usage:
>   * The macvtap_queue and the macvlan_dev are loosely coupled, the
> @@ -176,7 +178,7 @@ static int macvtap_forward(struct net_device *dev, struct sk_buff *skb)
>  		return -ENOLINK;
> 
>  	skb_queue_tail(&q->sk.sk_receive_queue, skb);
> -	wake_up(q->sk.sk_sleep);
> +	wake_up_interruptible_poll(q->sk.sk_sleep, POLLIN | POLLRDNORM | POLLRDBAND);
>  	return 0;
>  }
> 
> @@ -242,7 +244,7 @@ static void macvtap_sock_write_space(struct sock *sk)
>  		return;
> 
>  	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> -		wake_up_interruptible_sync(sk->sk_sleep);
> +		wake_up_interruptible_poll(sk->sk_sleep, POLLOUT | POLLWRNORM | POLLWRBAND);
>  }
> 
>  static int macvtap_open(struct inode *inode, struct file *file)
> @@ -270,6 +272,8 @@ static int macvtap_open(struct inode *inode, struct file *file)
>  	init_waitqueue_head(&q->sock.wait);
>  	q->sock.type = SOCK_RAW;
>  	q->sock.state = SS_CONNECTED;
> +	q->sock.file = file;
> +	q->sock.ops = &macvtap_socket_ops;
>  	sock_init_data(&q->sock, &q->sk);
>  	q->sk.sk_write_space = macvtap_sock_write_space;
> 
> @@ -387,32 +391,20 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
> 
>  	rcu_read_lock_bh();
>  	vlan = rcu_dereference(q->vlan);
> -	macvlan_count_rx(vlan, len, ret == 0, 0);
> +	if (vlan)
> +		macvlan_count_rx(vlan, len, ret == 0, 0);
>  	rcu_read_unlock_bh();
> 
>  	return ret ? ret : len;
>  }
> 
> -static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
> -				unsigned long count, loff_t pos)
> +static ssize_t macvtap_do_read(struct macvtap_queue *q, struct kiocb *iocb,
> +			       const struct iovec *iv, unsigned long len,
> +			       int noblock)
>  {
> -	struct file *file = iocb->ki_filp;
> -	struct macvtap_queue *q = file->private_data;
> -
>  	DECLARE_WAITQUEUE(wait, current);
>  	struct sk_buff *skb;
> -	ssize_t len, ret = 0;
> -
> -	if (!q) {
> -		ret = -ENOLINK;
> -		goto out;
> -	}
> -
> -	len = iov_length(iv, count);
> -	if (len < 0) {
> -		ret = -EINVAL;
> -		goto out;
> -	}
> +	ssize_t ret = 0;
> 
>  	add_wait_queue(q->sk.sk_sleep, &wait);
>  	while (len) {
> @@ -421,7 +413,7 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
>  		/* Read frames from the queue */
>  		skb = skb_dequeue(&q->sk.sk_receive_queue);
>  		if (!skb) {
> -			if (file->f_flags & O_NONBLOCK) {
> +			if (noblock) {
>  				ret = -EAGAIN;
>  				break;
>  			}
> @@ -440,7 +432,24 @@ static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
> 
>  	current->state = TASK_RUNNING;
>  	remove_wait_queue(q->sk.sk_sleep, &wait);
> +	return ret;
> +}
> +
> +static ssize_t macvtap_aio_read(struct kiocb *iocb, const struct iovec *iv,
> +				unsigned long count, loff_t pos)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct macvtap_queue *q = file->private_data;
> +	ssize_t len, ret = 0;
> 
> +	len = iov_length(iv, count);
> +	if (len < 0) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	ret = macvtap_do_read(q, iocb, iv, len, file->f_flags & O_NONBLOCK);
> +	ret = min_t(ssize_t, ret, len); /* XXX copied from tun.c. Why? */
>  out:
>  	return ret;
>  }
> @@ -538,6 +547,53 @@ static const struct file_operations macvtap_fops = {
>  #endif
>  };
> 
> +static int macvtap_sendmsg(struct kiocb *iocb, struct socket *sock,
> +			   struct msghdr *m, size_t total_len)
> +{
> +	struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
> +	return macvtap_get_user(q, m->msg_iov, total_len,
> +			    m->msg_flags & MSG_DONTWAIT);
> +}
> +
> +static int macvtap_recvmsg(struct kiocb *iocb, struct socket *sock,
> +			   struct msghdr *m, size_t total_len,
> +			   int flags)
> +{
> +	struct macvtap_queue *q = container_of(sock, struct macvtap_queue, sock);
> +	int ret;
> +	if (flags & ~(MSG_DONTWAIT|MSG_TRUNC))
> +		return -EINVAL;
> +	ret = macvtap_do_read(q, iocb, m->msg_iov, total_len,
> +			  flags & MSG_DONTWAIT);
> +	if (ret > total_len) {
> +		m->msg_flags |= MSG_TRUNC;
> +		ret = flags & MSG_TRUNC ? ret : total_len;
> +	}
> +	return ret;
> +}
> +
> +/* Ops structure to mimic raw sockets with tun */
> +static const struct proto_ops macvtap_socket_ops = {
> +	.sendmsg = macvtap_sendmsg,
> +	.recvmsg = macvtap_recvmsg,
> +};
> +
> +/* Get an underlying socket object from tun file.  Returns error unless file is
> + * attached to a device.  The returned object works like a packet socket, it
> + * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for
> + * holding a reference to the file for as long as the socket is in use. */
> +struct socket *macvtap_get_socket(struct file *file)
> +{
> +	struct macvtap_queue *q;
> +	if (file->f_op != &macvtap_fops)
> +		return ERR_PTR(-EINVAL);
> +	q = file->private_data;
> +	if (!q)
> +		return ERR_PTR(-EBADFD);
> +	return &q->sock;
> +}
> +EXPORT_SYMBOL_GPL(macvtap_get_socket);
> +
>  static int macvtap_init(void)
>  {
>  	int err;
> diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
> index 9e93553..e4e2fd1 100644
> --- a/drivers/vhost/Kconfig
> +++ b/drivers/vhost/Kconfig
> @@ -1,6 +1,6 @@
>  config VHOST_NET
>  	tristate "Host kernel accelerator for virtio net (EXPERIMENTAL)"
> -	depends on NET && EVENTFD && (TUN || !TUN) && EXPERIMENTAL
> +	depends on NET && EVENTFD && (TUN || !TUN) && (MACVTAP || !MACVTAP) && EXPERIMENTAL
>  	---help---
>  	  This kernel module can be loaded in host kernel to accelerate
>  	  guest networking with virtio_net. Not to be confused with virtio_net
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 4c89283..91a324c 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -22,6 +22,7 @@
>  #include <linux/if_packet.h>
>  #include <linux/if_arp.h>
>  #include <linux/if_tun.h>
> +#include <linux/if_macvlan.h>
> 
>  #include <net/sock.h>
> 
> @@ -452,13 +453,16 @@ err:
>  	return ERR_PTR(r);
>  }
> 
> -static struct socket *get_tun_socket(int fd)
> +static struct socket *get_tap_socket(int fd)
>  {
>  	struct file *file = fget(fd);
>  	struct socket *sock;
>  	if (!file)
>  		return ERR_PTR(-EBADF);
>  	sock = tun_get_socket(file);
> +	if (!IS_ERR(sock))
> +		return sock;
> +	sock = macvtap_get_socket(file);
>  	if (IS_ERR(sock))
>  		fput(file);
>  	return sock;
> @@ -473,7 +477,7 @@ static struct socket *get_socket(int fd)
>  	sock = get_raw_socket(fd);
>  	if (!IS_ERR(sock))
>  		return sock;
> -	sock = get_tun_socket(fd);
> +	sock = get_tap_socket(fd);
>  	if (!IS_ERR(sock))
>  		return sock;
>  	return ERR_PTR(-ENOTSOCK);
> diff --git a/include/linux/if_macvlan.h b/include/linux/if_macvlan.h
> index f9cb9ba..b78a712 100644
> --- a/include/linux/if_macvlan.h
> +++ b/include/linux/if_macvlan.h
> @@ -7,6 +7,19 @@
>  #include <linux/netlink.h>
>  #include <net/netlink.h>
> 
> +#if defined(CONFIG_MACVTAP) || defined(CONFIG_MACVTAP_MODULE)
> +struct socket *macvtap_get_socket(struct file *);
> +#else
> +#include <linux/err.h>
> +#include <linux/errno.h>
> +struct file;
> +struct socket;
> +static inline struct socket *macvtap_get_socket(struct file *f)
> +{
> +	return ERR_PTR(-EINVAL);
> +}
> +#endif /* CONFIG_MACVTAP */
> +
>  struct macvlan_port;
>  struct macvtap_queue;
> 


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/3] macvtap: add GSO/csum offload support
  2010-02-18 15:48                         ` [PATCH 3/3] macvtap: add GSO/csum offload support Arnd Bergmann
@ 2010-02-18 20:38                           ` Sridhar Samudrala
  2010-02-18 22:11                           ` David Miller
  1 sibling, 0 replies; 63+ messages in thread
From: Sridhar Samudrala @ 2010-02-18 20:38 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: David Miller, kaber, eswierk, netdev

On Thu, 2010-02-18 at 16:48 +0100, Arnd Bergmann wrote:
> Added flags field to macvtap_queue to enable/disable processing of
> virtio_net_hdr via IFF_VNET_HDR. This flag is checked to prepend virtio_net_hdr
> in the receive path and process/skip virtio_net_hdr in the send path.
> 
> Original patch by Sridhar, further changes by Arnd.
> 
> Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

The changes look good.
Tested it over gigabit ethernet. I am seeing send side CPU utilization dropping 
from 80% to 15% with netperf TCP_STREAM from guest to remote host.

Thanks
Sridhar

> ---
>  drivers/net/macvtap.c |  206 +++++++++++++++++++++++++++++++++++++++++++------
>  1 files changed, 182 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index e354501..55ceae0 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -17,6 +17,7 @@
>  #include <net/net_namespace.h>
>  #include <net/rtnetlink.h>
>  #include <net/sock.h>
> +#include <linux/virtio_net.h>
> 
>  /*
>   * A macvtap queue is the central object of this driver, it connects
> @@ -37,6 +38,7 @@ struct macvtap_queue {
>  	struct socket sock;
>  	struct macvlan_dev *vlan;
>  	struct file *file;
> +	unsigned int flags;
>  };
> 
>  static struct proto macvtap_proto = {
> @@ -276,6 +278,7 @@ static int macvtap_open(struct inode *inode, struct file *file)
>  	q->sock.ops = &macvtap_socket_ops;
>  	sock_init_data(&q->sock, &q->sk);
>  	q->sk.sk_write_space = macvtap_sock_write_space;
> +	q->flags = IFF_VNET_HDR | IFF_NO_PI | IFF_TAP;
> 
>  	err = macvtap_set_queue(dev, file, q);
>  	if (err)
> @@ -318,6 +321,111 @@ out:
>  	return mask;
>  }
> 
> +static inline struct sk_buff *macvtap_alloc_skb(struct sock *sk, size_t prepad,
> +						size_t len, size_t linear,
> +						int noblock, int *err)
> +{
> +	struct sk_buff *skb;
> +
> +	/* Under a page?  Don't bother with paged skb. */
> +	if (prepad + len < PAGE_SIZE || !linear)
> +		linear = len;
> +
> +	skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
> +				   err);
> +	if (!skb)
> +		return NULL;
> +
> +	skb_reserve(skb, prepad);
> +	skb_put(skb, linear);
> +	skb->data_len = len - linear;
> +	skb->len += len - linear;
> +
> +	return skb;
> +}
> +
> +/*
> + * macvtap_skb_from_vnet_hdr and macvtap_skb_to_vnet_hdr should
> + * be shared with the tun/tap driver.
> + */
> +static int macvtap_skb_from_vnet_hdr(struct sk_buff *skb,
> +				     struct virtio_net_hdr *vnet_hdr)
> +{
> +	unsigned short gso_type = 0;
> +	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> +		switch (vnet_hdr->gso_type & ~VIRTIO_NET_HDR_GSO_ECN) {
> +		case VIRTIO_NET_HDR_GSO_TCPV4:
> +			gso_type = SKB_GSO_TCPV4;
> +			break;
> +		case VIRTIO_NET_HDR_GSO_TCPV6:
> +			gso_type = SKB_GSO_TCPV6;
> +			break;
> +		case VIRTIO_NET_HDR_GSO_UDP:
> +			gso_type = SKB_GSO_UDP;
> +			break;
> +		default:
> +			return -EINVAL;
> +		}
> +
> +		if (vnet_hdr->gso_type & VIRTIO_NET_HDR_GSO_ECN)
> +			gso_type |= SKB_GSO_TCP_ECN;
> +
> +		if (vnet_hdr->gso_size == 0)
> +			return -EINVAL;
> +	}
> +
> +	if (vnet_hdr->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) {
> +		if (!skb_partial_csum_set(skb, vnet_hdr->csum_start,
> +					  vnet_hdr->csum_offset))
> +			return -EINVAL;
> +	}
> +
> +	if (vnet_hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {
> +		skb_shinfo(skb)->gso_size = vnet_hdr->gso_size;
> +		skb_shinfo(skb)->gso_type = gso_type;
> +
> +		/* Header must be checked, and gso_segs computed. */
> +		skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
> +		skb_shinfo(skb)->gso_segs = 0;
> +	}
> +	return 0;
> +}
> +
> +static int macvtap_skb_to_vnet_hdr(const struct sk_buff *skb,
> +				   struct virtio_net_hdr *vnet_hdr)
> +{
> +	memset(vnet_hdr, 0, sizeof(*vnet_hdr));
> +
> +	if (skb_is_gso(skb)) {
> +		struct skb_shared_info *sinfo = skb_shinfo(skb);
> +
> +		/* This is a hint as to how much should be linear. */
> +		vnet_hdr->hdr_len = skb_headlen(skb);
> +		vnet_hdr->gso_size = sinfo->gso_size;
> +		if (sinfo->gso_type & SKB_GSO_TCPV4)
> +			vnet_hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
> +		else if (sinfo->gso_type & SKB_GSO_TCPV6)
> +			vnet_hdr->gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
> +		else if (sinfo->gso_type & SKB_GSO_UDP)
> +			vnet_hdr->gso_type = VIRTIO_NET_HDR_GSO_UDP;
> +		else
> +			BUG();
> +		if (sinfo->gso_type & SKB_GSO_TCP_ECN)
> +			vnet_hdr->gso_type |= VIRTIO_NET_HDR_GSO_ECN;
> +	} else
> +		vnet_hdr->gso_type = VIRTIO_NET_HDR_GSO_NONE;
> +
> +	if (skb->ip_summed == CHECKSUM_PARTIAL) {
> +		vnet_hdr->flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
> +		vnet_hdr->csum_start = skb->csum_start -
> +					skb_headroom(skb);
> +		vnet_hdr->csum_offset = skb->csum_offset;
> +	} /* else everything is zero */
> +
> +	return 0;
> +}
> +
> +
>  /* Get packet from user space buffer */
>  static ssize_t macvtap_get_user(struct macvtap_queue *q,
>  				const struct iovec *iv, size_t count,
> @@ -327,22 +435,53 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
>  	struct macvlan_dev *vlan;
>  	size_t len = count;
>  	int err;
> +	struct virtio_net_hdr vnet_hdr = { 0 };
> +	int vnet_hdr_len = 0;
> +
> +	if (q->flags & IFF_VNET_HDR) {
> +		vnet_hdr_len = sizeof(vnet_hdr);
> +
> +		err = -EINVAL;
> +		if ((len -= vnet_hdr_len) < 0)
> +			goto err;
> +
> +		err = memcpy_fromiovecend((void *)&vnet_hdr, iv, 0,
> +					   vnet_hdr_len);
> +		if (err < 0)
> +			goto err;
> +		if ((vnet_hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> +		     vnet_hdr.csum_start + vnet_hdr.csum_offset + 2 >
> +							vnet_hdr.hdr_len)
> +			vnet_hdr.hdr_len = vnet_hdr.csum_start +
> +						vnet_hdr.csum_offset + 2;
> +		err = -EINVAL;
> +		if (vnet_hdr.hdr_len > len)
> +			goto err;
> +	}
> 
> +	err = -EINVAL;
>  	if (unlikely(len < ETH_HLEN))
> -		return -EINVAL;
> +		goto err;
> 
> -	skb = sock_alloc_send_skb(&q->sk, NET_IP_ALIGN + len, noblock, &err);
> +	skb = macvtap_alloc_skb(&q->sk, NET_IP_ALIGN, len, vnet_hdr.hdr_len,
> +				noblock, &err);
>  	if (!skb)
>  		goto err;
> 
> -	skb_reserve(skb, NET_IP_ALIGN);
> -	skb_put(skb, count);
> -
> -	err = skb_copy_datagram_from_iovec(skb, 0, iv, 0, len);
> +	err = skb_copy_datagram_from_iovec(skb, 0, iv, vnet_hdr_len, len);
>  	if (err)
> -		goto err;
> +		goto err_kfree;
> 
>  	skb_set_network_header(skb, ETH_HLEN);
> +	skb_reset_mac_header(skb);
> +	skb->protocol = eth_hdr(skb)->h_proto;
> +
> +	if (vnet_hdr_len) {
> +		err = macvtap_skb_from_vnet_hdr(skb, &vnet_hdr);
> +		if (err)
> +			goto err_kfree;
> +	}
> +
>  	rcu_read_lock_bh();
>  	vlan = rcu_dereference(q->vlan);
>  	if (vlan)
> @@ -353,15 +492,16 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q,
> 
>  	return count;
> 
> +err_kfree:
> +	kfree_skb(skb);
> +
>  err:
>  	rcu_read_lock_bh();
>  	vlan = rcu_dereference(q->vlan);
>  	if (vlan)
> -		macvlan_count_rx(q->vlan, 0, false, false);
> +		netdev_get_tx_queue(vlan->dev, 0)->tx_dropped++;
>  	rcu_read_unlock_bh();
> 
> -	kfree_skb(skb);
> -
>  	return err;
>  }
> 
> @@ -384,10 +524,25 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>  {
>  	struct macvlan_dev *vlan;
>  	int ret;
> +	int vnet_hdr_len = 0;
> +
> +	if (q->flags & IFF_VNET_HDR) {
> +		struct virtio_net_hdr vnet_hdr;
> +		vnet_hdr_len = sizeof (vnet_hdr);
> +		if ((len -= vnet_hdr_len) < 0)
> +			return -EINVAL;
> +
> +		ret = macvtap_skb_to_vnet_hdr(skb, &vnet_hdr);
> +		if (ret)
> +			return ret;
> +
> +		if (memcpy_toiovecend(iv, (void *)&vnet_hdr, 0, vnet_hdr_len))
> +			return -EFAULT;
> +	}
> 
>  	len = min_t(int, skb->len, len);
> 
> -	ret = skb_copy_datagram_const_iovec(skb, 0, iv, 0, len);
> +	ret = skb_copy_datagram_const_iovec(skb, 0, iv, vnet_hdr_len, len);
> 
>  	rcu_read_lock_bh();
>  	vlan = rcu_dereference(q->vlan);
> @@ -395,7 +550,7 @@ static ssize_t macvtap_put_user(struct macvtap_queue *q,
>  		macvlan_count_rx(vlan, len, ret == 0, 0);
>  	rcu_read_unlock_bh();
> 
> -	return ret ? ret : len;
> +	return ret ? ret : (len + vnet_hdr_len);
>  }
> 
>  static ssize_t macvtap_do_read(struct macvtap_queue *q, struct kiocb *iocb,
> @@ -473,9 +628,14 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
>  		/* ignore the name, just look at flags */
>  		if (get_user(u, &ifr->ifr_flags))
>  			return -EFAULT;
> -		if (u != (IFF_TAP | IFF_NO_PI))
> -			return -EINVAL;
> -		return 0;
> +
> +		ret = 0;
> +		if ((u & ~IFF_VNET_HDR) != (IFF_NO_PI | IFF_TAP))
> +			ret = -EINVAL;
> +		else
> +			q->flags = u;
> +
> +		return ret;
> 
>  	case TUNGETIFF:
>  		rcu_read_lock_bh();
> @@ -489,13 +649,13 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
> 
>  		ret = 0;
>  		if (copy_to_user(&ifr->ifr_name, q->vlan->dev->name, IFNAMSIZ) ||
> -		    put_user((TUN_TAP_DEV | TUN_NO_PI), &ifr->ifr_flags))
> +		    put_user(q->flags, &ifr->ifr_flags))
>  			ret = -EFAULT;
>  		dev_put(vlan->dev);
>  		return ret;
> 
>  	case TUNGETFEATURES:
> -		if (put_user((IFF_TAP | IFF_NO_PI), up))
> +		if (put_user(IFF_TAP | IFF_NO_PI | IFF_VNET_HDR, up))
>  			return -EFAULT;
>  		return 0;
> 
> @@ -509,15 +669,13 @@ static long macvtap_ioctl(struct file *file, unsigned int cmd,
>  	case TUNSETOFFLOAD:
>  		/* let the user check for future flags */
>  		if (arg & ~(TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
> -			  TUN_F_TSO_ECN | TUN_F_UFO))
> -			return -EINVAL;
> -
> -		/* TODO: add support for these, so far we don't
> -			 support any offload */
> -		if (arg & (TUN_F_CSUM | TUN_F_TSO4 | TUN_F_TSO6 |
> -			 TUN_F_TSO_ECN | TUN_F_UFO))
> +			    TUN_F_TSO_ECN | TUN_F_UFO))
>  			return -EINVAL;
> 
> +		/* TODO: only accept frames with the features that
> +			 got enabled for forwarded frames */
> +		if (!(q->flags & IFF_VNET_HDR))
> +			return  -EINVAL;
>  		return 0;
> 
>  	default:


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 1/3] macvtap: rework object lifetime rules
  2010-02-18 15:45                         ` [PATCH 1/3] macvtap: rework object lifetime rules Arnd Bergmann
  2010-02-18 20:09                           ` Sridhar Samudrala
@ 2010-02-18 22:11                           ` David Miller
  1 sibling, 0 replies; 63+ messages in thread
From: David Miller @ 2010-02-18 22:11 UTC (permalink / raw)
  To: arnd; +Cc: sri, kaber, eswierk, netdev

From: Arnd Bergmann <arnd@arndb.de>
Date: Thu, 18 Feb 2010 16:45:36 +0100

> This reworks the change done by the previous patch
> in a more complete way.
> 
> The original macvtap code has a number of problems
> resulting from the use of RCU for protecting the
> access to struct macvtap_queue from open files.
> 
> This includes
> - need for GFP_ATOMIC allocations for skbs
> - potential deadlocks when copy_*_user sleeps
> - inability to work with vhost-net
> 
> Changing the lifetime of macvtap_queue to always
> depend on the open file solves all these. The
> RCU reference simply moves one step down to
> the reference on the macvlan_dev, which we
> only need for nonblocking operations.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Applied.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/3] net/macvtap: add vhost support
  2010-02-18 15:46                         ` [PATCH 2/3] net/macvtap: add vhost support Arnd Bergmann
  2010-02-18 20:10                           ` Sridhar Samudrala
@ 2010-02-18 22:11                           ` David Miller
  1 sibling, 0 replies; 63+ messages in thread
From: David Miller @ 2010-02-18 22:11 UTC (permalink / raw)
  To: arnd; +Cc: sri, kaber, eswierk, netdev, mst

From: Arnd Bergmann <arnd@arndb.de>
Date: Thu, 18 Feb 2010 16:46:50 +0100

> This adds support for passing a macvtap file descriptor into
> vhost-net, much like we already do for tun/tap.
> 
> Most of the new code is taken from the respective patch
> in the tun driver and may get consolidated in the future.
> 
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Applied.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 3/3] macvtap: add GSO/csum offload support
  2010-02-18 15:48                         ` [PATCH 3/3] macvtap: add GSO/csum offload support Arnd Bergmann
  2010-02-18 20:38                           ` Sridhar Samudrala
@ 2010-02-18 22:11                           ` David Miller
  1 sibling, 0 replies; 63+ messages in thread
From: David Miller @ 2010-02-18 22:11 UTC (permalink / raw)
  To: arnd; +Cc: sri, kaber, eswierk, netdev

From: Arnd Bergmann <arnd@arndb.de>
Date: Thu, 18 Feb 2010 16:48:17 +0100

> Added flags field to macvtap_queue to enable/disable processing of
> virtio_net_hdr via IFF_VNET_HDR. This flag is checked to prepend virtio_net_hdr
> in the receive path and process/skip virtio_net_hdr in the send path.
> 
> Original patch by Sridhar, further changes by Arnd.
> 
> Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>

Applied.

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2010-02-18 22:11 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-27 10:04 [PATCH 0/3 v3] macvtap driver Arnd Bergmann
2010-01-27 10:04 ` [Bridge] " Arnd Bergmann
2010-01-27 10:05 ` [PATCH 1/3] net: maintain namespace isolation between vlan and real device Arnd Bergmann
2010-01-27 10:05   ` [Bridge] " Arnd Bergmann
2010-01-29  5:33   ` David Miller
2010-01-29  5:33     ` [Bridge] " David Miller
2010-01-29 10:12     ` Arnd Bergmann
2010-01-29 10:12       ` [Bridge] " Arnd Bergmann
2010-01-27 10:06 ` [PATCH 2/3] net/macvlan: allow multiple driver backends Arnd Bergmann
2010-01-27 10:06   ` [Bridge] " Arnd Bergmann
2010-01-27 21:09 ` [PATCH 3/3] net: macvtap driver Arnd Bergmann
2010-01-27 21:09   ` [Bridge] " Arnd Bergmann
2010-01-28 17:34   ` Michael S. Tsirkin
2010-01-28 17:34     ` [Bridge] " Michael S. Tsirkin
2010-01-28 20:18     ` Arnd Bergmann
2010-01-28 20:18       ` [Bridge] " Arnd Bergmann
2010-01-29 11:21       ` Michael S. Tsirkin
2010-01-29 11:21         ` [Bridge] " Michael S. Tsirkin
2010-01-29 19:49         ` Arnd Bergmann
2010-01-29 19:49           ` [Bridge] " Arnd Bergmann
2010-01-27 21:59 ` [PATCH 0/3 v3] " Arnd Bergmann
2010-01-27 21:59   ` [Bridge] " Arnd Bergmann
2010-01-30 22:22 ` [PATCH 0/3 v4] " Arnd Bergmann
2010-01-30 22:22 ` Arnd Bergmann
2010-01-30 22:22   ` [Bridge] " Arnd Bergmann
2010-01-30 22:23   ` [PATCH 1/3] net: maintain namespace isolation between vlan and real device Arnd Bergmann
2010-01-30 22:23   ` Arnd Bergmann
2010-01-30 22:23     ` [Bridge] " Arnd Bergmann
2010-01-30 22:23   ` [PATCH 2/3] macvlan: allow multiple driver backends Arnd Bergmann
2010-01-30 22:23   ` Arnd Bergmann
2010-01-30 22:23     ` [Bridge] " Arnd Bergmann
2010-01-30 22:24   ` [PATCH 3/3] net: macvtap driver Arnd Bergmann
2010-01-30 22:24   ` Arnd Bergmann
2010-01-30 22:24     ` [Bridge] " Arnd Bergmann
2010-02-04  4:21   ` [PATCH 0/3 v4] " David Miller
2010-02-04  4:21   ` David Miller
2010-02-04  4:21     ` [Bridge] " David Miller
2010-02-08 17:14     ` Ed Swierk
2010-02-08 18:55       ` Sridhar Samudrala
2010-02-08 23:30         ` Ed Swierk
2010-02-10 14:50           ` Arnd Bergmann
2010-02-11  0:42             ` Ed Swierk
2010-02-11  7:12               ` Arnd Bergmann
2010-02-09  3:25         ` Ed Swierk
2010-02-10 14:52           ` Arnd Bergmann
2010-02-10 14:48         ` Arnd Bergmann
2010-02-10 18:05           ` Sridhar Samudrala
2010-02-10 18:10             ` Patrick McHardy
2010-02-11 15:45               ` [PATCH] net/macvtap: fix reference counting Arnd Bergmann
2010-02-11 15:55                 ` [PATCH v2] " Arnd Bergmann
2010-02-11 21:09                   ` Sridhar Samudrala
2010-02-16  5:53                     ` David Miller
2010-02-18 15:44                       ` Arnd Bergmann
2010-02-18 15:45                         ` [PATCH 1/3] macvtap: rework object lifetime rules Arnd Bergmann
2010-02-18 20:09                           ` Sridhar Samudrala
2010-02-18 22:11                           ` David Miller
2010-02-18 15:46                         ` [PATCH 2/3] net/macvtap: add vhost support Arnd Bergmann
2010-02-18 20:10                           ` Sridhar Samudrala
2010-02-18 22:11                           ` David Miller
2010-02-18 15:48                         ` [PATCH 3/3] macvtap: add GSO/csum offload support Arnd Bergmann
2010-02-18 20:38                           ` Sridhar Samudrala
2010-02-18 22:11                           ` David Miller
2010-02-12 20:58                   ` [PATCH v2] net/macvtap: fix reference counting Ed Swierk

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.