[net-next,3/3] tap: XDP support
diff mbox series

Message ID 1501147533-12368-4-git-send-email-jasowang@redhat.com
State New, archived
Headers show
Series
  • XDP support for tap
Related show

Commit Message

Jason Wang July 27, 2017, 9:25 a.m. UTC
This patch tries to implement XDP for tun. The implementation was
split into two parts:

- fast path: small and no gso packet. We try to do XDP at page level
  before build_skb(). For XDP_TX, since creating/destroying queues
  were completely under control of userspace, it was implemented
  through generic XDP helper after skb has been built. This could be
  optimized in the future.
- slow path: big or gso packet. We try to do it after skb was created
  through generic XDP helpers.

XDP_REDIRECT was not implemented, it could be done on top.

xdp1 test shows 47.6% improvement:

Before: ~2.1Mpps
After:  ~3.1Mpps

Suggested-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/net/tun.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 142 insertions(+), 7 deletions(-)

Comments

Jakub Kicinski July 28, 2017, 3:13 a.m. UTC | #1
On Thu, 27 Jul 2017 17:25:33 +0800, Jason Wang wrote:
> This patch tries to implement XDP for tun. The implementation was
> split into two parts:
> 
> - fast path: small and no gso packet. We try to do XDP at page level
>   before build_skb(). For XDP_TX, since creating/destroying queues
>   were completely under control of userspace, it was implemented
>   through generic XDP helper after skb has been built. This could be
>   optimized in the future.
> - slow path: big or gso packet. We try to do it after skb was created
>   through generic XDP helpers.
> 
> XDP_REDIRECT was not implemented, it could be done on top.
> 
> xdp1 test shows 47.6% improvement:
> 
> Before: ~2.1Mpps
> After:  ~3.1Mpps
> 
> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

> @@ -1008,6 +1016,56 @@ tun_net_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
>  	stats->tx_dropped = tx_dropped;
>  }
>  
> +static int tun_xdp_set(struct net_device *dev, struct bpf_prog *prog,
> +		       struct netlink_ext_ack *extack)
> +{
> +	struct tun_struct *tun = netdev_priv(dev);
> +	struct bpf_prog *old_prog;
> +
> +	/* We will shift the packet that can't be handled to generic
> +	 * XDP layer.
> +	 */
> +
> +	old_prog = rtnl_dereference(tun->xdp_prog);
> +	if (old_prog)
> +		bpf_prog_put(old_prog);
> +	rcu_assign_pointer(tun->xdp_prog, prog);

Is this OK?  Could this lead to the program getting freed and then
datapath accessing a stale pointer?  I mean in the scenario where the
process gets pre-empted between the bpf_prog_put() and
rcu_assign_pointer()?

> +	if (prog) {
> +		prog = bpf_prog_add(prog, 1);
> +		if (IS_ERR(prog))
> +			return PTR_ERR(prog);
> +	}

I don't think you need this extra reference here.  dev_change_xdp_fd()
will call bpf_prog_get_type() which means driver gets the program with
a reference already taken, drivers does have to free that reference when
program is removed (or device is freed, as you correctly do).

> +	return 0;
> +}
> +
Jason Wang July 28, 2017, 3:28 a.m. UTC | #2
On 2017年07月28日 11:13, Jakub Kicinski wrote:
> On Thu, 27 Jul 2017 17:25:33 +0800, Jason Wang wrote:
>> This patch tries to implement XDP for tun. The implementation was
>> split into two parts:
>>
>> - fast path: small and no gso packet. We try to do XDP at page level
>>    before build_skb(). For XDP_TX, since creating/destroying queues
>>    were completely under control of userspace, it was implemented
>>    through generic XDP helper after skb has been built. This could be
>>    optimized in the future.
>> - slow path: big or gso packet. We try to do it after skb was created
>>    through generic XDP helpers.
>>
>> XDP_REDIRECT was not implemented, it could be done on top.
>>
>> xdp1 test shows 47.6% improvement:
>>
>> Before: ~2.1Mpps
>> After:  ~3.1Mpps
>>
>> Suggested-by: Michael S. Tsirkin <mst@redhat.com>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> @@ -1008,6 +1016,56 @@ tun_net_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
>>   	stats->tx_dropped = tx_dropped;
>>   }
>>   
>> +static int tun_xdp_set(struct net_device *dev, struct bpf_prog *prog,
>> +		       struct netlink_ext_ack *extack)
>> +{
>> +	struct tun_struct *tun = netdev_priv(dev);
>> +	struct bpf_prog *old_prog;
>> +
>> +	/* We will shift the packet that can't be handled to generic
>> +	 * XDP layer.
>> +	 */
>> +
>> +	old_prog = rtnl_dereference(tun->xdp_prog);
>> +	if (old_prog)
>> +		bpf_prog_put(old_prog);
>> +	rcu_assign_pointer(tun->xdp_prog, prog);
> Is this OK?  Could this lead to the program getting freed and then
> datapath accessing a stale pointer?  I mean in the scenario where the
> process gets pre-empted between the bpf_prog_put() and
> rcu_assign_pointer()?

Will call bpf_prog_put() after rcu_assign_pointer().

>
>> +	if (prog) {
>> +		prog = bpf_prog_add(prog, 1);
>> +		if (IS_ERR(prog))
>> +			return PTR_ERR(prog);
>> +	}
> I don't think you need this extra reference here.  dev_change_xdp_fd()
> will call bpf_prog_get_type() which means driver gets the program with
> a reference already taken, drivers does have to free that reference when
> program is removed (or device is freed, as you correctly do).

I see, will drop this in next version.

Thanks.

>
>> +	return 0;
>> +}
>> +
Michael S. Tsirkin July 28, 2017, 3:46 a.m. UTC | #3
On Fri, Jul 28, 2017 at 11:28:54AM +0800, Jason Wang wrote:
> > > +	old_prog = rtnl_dereference(tun->xdp_prog);
> > > +	if (old_prog)
> > > +		bpf_prog_put(old_prog);
> > > +	rcu_assign_pointer(tun->xdp_prog, prog);
> > Is this OK?  Could this lead to the program getting freed and then
> > datapath accessing a stale pointer?  I mean in the scenario where the
> > process gets pre-empted between the bpf_prog_put() and
> > rcu_assign_pointer()?
> 
> Will call bpf_prog_put() after rcu_assign_pointer().

I suspect you need to sync RCU or something before that.
Jason Wang July 28, 2017, 3:50 a.m. UTC | #4
On 2017年07月28日 11:46, Michael S. Tsirkin wrote:
> On Fri, Jul 28, 2017 at 11:28:54AM +0800, Jason Wang wrote:
>>>> +	old_prog = rtnl_dereference(tun->xdp_prog);
>>>> +	if (old_prog)
>>>> +		bpf_prog_put(old_prog);
>>>> +	rcu_assign_pointer(tun->xdp_prog, prog);
>>> Is this OK?  Could this lead to the program getting freed and then
>>> datapath accessing a stale pointer?  I mean in the scenario where the
>>> process gets pre-empted between the bpf_prog_put() and
>>> rcu_assign_pointer()?
>> Will call bpf_prog_put() after rcu_assign_pointer().
> I suspect you need to sync RCU or something before that.

__bpf_prog_put() will do call_rcu(), so looks like it was ok.

Thanks
Jakub Kicinski July 28, 2017, 3:51 a.m. UTC | #5
On Fri, 28 Jul 2017 06:46:40 +0300, Michael S. Tsirkin wrote:
> On Fri, Jul 28, 2017 at 11:28:54AM +0800, Jason Wang wrote:
> > > > +	old_prog = rtnl_dereference(tun->xdp_prog);
> > > > +	if (old_prog)
> > > > +		bpf_prog_put(old_prog);
> > > > +	rcu_assign_pointer(tun->xdp_prog, prog);  
> > > Is this OK?  Could this lead to the program getting freed and then
> > > datapath accessing a stale pointer?  I mean in the scenario where the
> > > process gets pre-empted between the bpf_prog_put() and
> > > rcu_assign_pointer()?  
> > 
> > Will call bpf_prog_put() after rcu_assign_pointer().  
> 
> I suspect you need to sync RCU or something before that.

I think the bpf_prog_put() will use call_rcu() to do the actual free:

static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock)
{
	if (atomic_dec_and_test(&prog->aux->refcnt)) {
		trace_bpf_prog_put_rcu(prog);
		/* bpf_prog_free_id() must be called first */
		bpf_prog_free_id(prog, do_idr_lock);
		bpf_prog_kallsyms_del(prog);
		call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu);
	}
}

It's just that we are only under the rtnl here, RCU lock is not held, so
grace period may elapse between bpf_prog_put() and rcu_assign_pointer().
Michael S. Tsirkin July 28, 2017, 4:05 a.m. UTC | #6
On Fri, Jul 28, 2017 at 11:50:45AM +0800, Jason Wang wrote:
> 
> 
> On 2017年07月28日 11:46, Michael S. Tsirkin wrote:
> > On Fri, Jul 28, 2017 at 11:28:54AM +0800, Jason Wang wrote:
> > > > > +	old_prog = rtnl_dereference(tun->xdp_prog);
> > > > > +	if (old_prog)
> > > > > +		bpf_prog_put(old_prog);
> > > > > +	rcu_assign_pointer(tun->xdp_prog, prog);
> > > > Is this OK?  Could this lead to the program getting freed and then
> > > > datapath accessing a stale pointer?  I mean in the scenario where the
> > > > process gets pre-empted between the bpf_prog_put() and
> > > > rcu_assign_pointer()?
> > > Will call bpf_prog_put() after rcu_assign_pointer().
> > I suspect you need to sync RCU or something before that.
> 
> __bpf_prog_put() will do call_rcu(), so looks like it was ok.
> 
> Thanks

True - I missed that.
Daniel Borkmann July 28, 2017, 3:11 p.m. UTC | #7
On 07/28/2017 05:46 AM, Michael S. Tsirkin wrote:
> On Fri, Jul 28, 2017 at 11:28:54AM +0800, Jason Wang wrote:
>>>> +	old_prog = rtnl_dereference(tun->xdp_prog);
>>>> +	if (old_prog)
>>>> +		bpf_prog_put(old_prog);
>>>> +	rcu_assign_pointer(tun->xdp_prog, prog);
>>> Is this OK?  Could this lead to the program getting freed and then
>>> datapath accessing a stale pointer?  I mean in the scenario where the
>>> process gets pre-empted between the bpf_prog_put() and
>>> rcu_assign_pointer()?
>>
>> Will call bpf_prog_put() after rcu_assign_pointer().

+1, good catch Jakub.

> I suspect you need to sync RCU or something before that.

No, see for example generic_xdp_install().

Patch
diff mbox series

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index de1a2ef..13abe23 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -73,6 +73,8 @@ 
 #include <linux/seq_file.h>
 #include <linux/uio.h>
 #include <linux/skb_array.h>
+#include <linux/bpf.h>
+#include <linux/bpf_trace.h>
 
 #include <linux/uaccess.h>
 
@@ -105,7 +107,8 @@  do {								\
 } while (0)
 #endif
 
-#define TUN_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
+#define TUN_HEADROOM 256
+#define TUN_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD + TUN_HEADROOM)
 
 /* TUN device flags */
 
@@ -224,6 +227,7 @@  struct tun_struct {
 	u32 flow_count;
 	u32 rx_batched;
 	struct tun_pcpu_stats __percpu *pcpu_stats;
+	struct bpf_prog __rcu *xdp_prog;
 };
 
 #ifdef CONFIG_TUN_VNET_CROSS_LE
@@ -590,6 +594,7 @@  static void tun_detach(struct tun_file *tfile, bool clean)
 static void tun_detach_all(struct net_device *dev)
 {
 	struct tun_struct *tun = netdev_priv(dev);
+	struct bpf_prog *xdp_prog = rtnl_dereference(tun->xdp_prog);
 	struct tun_file *tfile, *tmp;
 	int i, n = tun->numqueues;
 
@@ -622,6 +627,9 @@  static void tun_detach_all(struct net_device *dev)
 	}
 	BUG_ON(tun->numdisabled != 0);
 
+	if (xdp_prog)
+		bpf_prog_put(xdp_prog);
+
 	if (tun->flags & IFF_PERSIST)
 		module_put(THIS_MODULE);
 }
@@ -1008,6 +1016,56 @@  tun_net_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
 	stats->tx_dropped = tx_dropped;
 }
 
+static int tun_xdp_set(struct net_device *dev, struct bpf_prog *prog,
+		       struct netlink_ext_ack *extack)
+{
+	struct tun_struct *tun = netdev_priv(dev);
+	struct bpf_prog *old_prog;
+
+	/* We will shift the packet that can't be handled to generic
+	 * XDP layer.
+	 */
+
+	old_prog = rtnl_dereference(tun->xdp_prog);
+	if (old_prog)
+		bpf_prog_put(old_prog);
+	rcu_assign_pointer(tun->xdp_prog, prog);
+
+	if (prog) {
+		prog = bpf_prog_add(prog, 1);
+		if (IS_ERR(prog))
+			return PTR_ERR(prog);
+	}
+
+	return 0;
+}
+
+static u32 tun_xdp_query(struct net_device *dev)
+{
+	struct tun_struct *tun = netdev_priv(dev);
+	const struct bpf_prog *xdp_prog;
+
+	xdp_prog = rtnl_dereference(tun->xdp_prog);
+	if (xdp_prog)
+		return xdp_prog->aux->id;
+
+	return 0;
+}
+
+static int tun_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return tun_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_QUERY_PROG:
+		xdp->prog_id = tun_xdp_query(dev);
+		xdp->prog_attached = !!xdp->prog_id;
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops tun_netdev_ops = {
 	.ndo_uninit		= tun_net_uninit,
 	.ndo_open		= tun_net_open,
@@ -1038,6 +1096,7 @@  static const struct net_device_ops tap_netdev_ops = {
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= tun_set_headroom,
 	.ndo_get_stats64	= tun_net_get_stats64,
+	.ndo_xdp		= tun_xdp,
 };
 
 static void tun_flow_init(struct tun_struct *tun)
@@ -1217,16 +1276,21 @@  static bool tun_can_build_skb(struct tun_struct *tun, struct tun_file *tfile,
 	return true;
 }
 
-static struct sk_buff *tun_build_skb(struct tun_file *tfile,
+static struct sk_buff *tun_build_skb(struct tun_struct *tun,
+				     struct tun_file *tfile,
 				     struct iov_iter *from,
-				     int len)
+				     struct virtio_net_hdr *hdr,
+				     int len, int *generic_xdp)
 {
 	struct page_frag *alloc_frag = &tfile->alloc_frag;
 	struct sk_buff *skb;
+	struct bpf_prog *xdp_prog;
 	int buflen = SKB_DATA_ALIGN(len + TUN_RX_PAD) +
 		     SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	unsigned int delta = 0;
 	char *buf;
 	size_t copied;
+	bool xdp_xmit = false;
 
 	if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, GFP_KERNEL)))
 		return ERR_PTR(-ENOMEM);
@@ -1238,16 +1302,68 @@  static struct sk_buff *tun_build_skb(struct tun_file *tfile,
 	if (copied != len)
 		return ERR_PTR(-EFAULT);
 
+	if (hdr->gso_type)
+		*generic_xdp = 1;
+	else
+		*generic_xdp = 0;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(tun->xdp_prog);
+	if (xdp_prog && !*generic_xdp) {
+		struct xdp_buff xdp;
+		void *orig_data;
+		u32 act;
+
+		xdp.data_hard_start = buf;
+		xdp.data = buf + TUN_RX_PAD;
+		xdp.data_end = xdp.data + len;
+		orig_data = xdp.data;
+		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+		switch (act) {
+		case XDP_TX:
+			xdp_xmit = true;
+			/* fall through */
+		case XDP_PASS:
+			delta = orig_data - xdp.data;
+			break;
+		default:
+			bpf_warn_invalid_xdp_action(act);
+			/* fall through */
+		case XDP_ABORTED:
+			trace_xdp_exception(tun->dev, xdp_prog, act);
+			/* fall through */
+		case XDP_DROP:
+			goto err_xdp;
+		}
+	}
+
 	skb = build_skb(buf, buflen);
-	if (!skb)
+	if (!skb) {
+		rcu_read_unlock();
 		return ERR_PTR(-ENOMEM);
+	}
 
-	skb_reserve(skb, TUN_RX_PAD);
-	skb_put(skb, len);
+	skb_reserve(skb, TUN_RX_PAD - delta);
+	skb_put(skb, len + delta);
 	get_page(alloc_frag->page);
 	alloc_frag->offset += buflen;
 
+	if (xdp_xmit) {
+		skb->dev = tun->dev;
+		generic_xdp_tx(skb, xdp_prog);
+		rcu_read_lock();
+		return NULL;
+	}
+
+	rcu_read_unlock();
+
 	return skb;
+
+err_xdp:
+	rcu_read_unlock();
+	this_cpu_inc(tun->pcpu_stats->rx_dropped);
+	return NULL;
 }
 
 /* Get packet from user space buffer */
@@ -1266,6 +1382,7 @@  static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 	bool zerocopy = false;
 	int err;
 	u32 rxhash;
+	int generic_xdp = 1;
 
 	if (!(tun->dev->flags & IFF_UP))
 		return -EIO;
@@ -1324,11 +1441,13 @@  static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 	}
 
 	if (tun_can_build_skb(tun, tfile, len, noblock, zerocopy)) {
-		skb = tun_build_skb(tfile, from, len);
+		skb = tun_build_skb(tun, tfile, from, &gso, len, &generic_xdp);
 		if (IS_ERR(skb)) {
 			this_cpu_inc(tun->pcpu_stats->rx_dropped);
 			return PTR_ERR(skb);
 		}
+		if (!skb)
+			return total_len;
 	} else {
 		if (!zerocopy) {
 			copylen = len;
@@ -1402,6 +1521,22 @@  static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
 	skb_reset_network_header(skb);
 	skb_probe_transport_header(skb, 0);
 
+	if (generic_xdp) {
+		struct bpf_prog *xdp_prog;
+		int ret;
+
+		rcu_read_lock();
+		xdp_prog = rcu_dereference(tun->xdp_prog);
+		if (xdp_prog) {
+			ret = do_xdp_generic(xdp_prog, skb);
+			if (ret != XDP_PASS) {
+				rcu_read_unlock();
+				return total_len;
+			}
+		}
+		rcu_read_unlock();
+	}
+
 	rxhash = __skb_get_hash_symmetric(skb);
 #ifndef CONFIG_4KSTACKS
 	tun_rx_batched(tun, tfile, skb, more);