netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 bpf-next 0/8] veth: Driver XDP
@ 2018-07-22 15:13 Toshiaki Makita
  2018-07-22 15:13 ` [PATCH v3 bpf-next 1/8] net: Export skb_headers_offset_update Toshiaki Makita
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-22 15:13 UTC (permalink / raw)
  To: netdev, Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, Jesper Dangaard Brouer

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC -----------> veth===veth
 (XDP) (redirect)        (XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.

Envisioned use-cases
--------------------

* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.

Implementation
--------------

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
context.
 - patch 1: Export a function needed for veth XDP.
 - patch 2-3: Basic implementation of veth XDP.
 - patch 4-5: Add ndo_xdp_xmit.
 - patch 6-7: Add XDP_TX and XDP_REDIRECT.
 - patch 8: Performance optimization for multi-queue env.

Tests and performance numbers
-----------------------------

Tested with a simple XDP program which only redirects packets between
NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
server has 20 of Xeon Silver 2.20 GHz cores.

  pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)

The rightmost veth loads XDP progs and just does DROP or TX. The number
of packets is measured in the XDP progs. The leftmost pktgen sends
packets at 37.1 Mpps (almost 25G wire speed).

veth XDP action    Flows    Mpps
================================
DROP                   1    10.6
DROP                   2    21.2
DROP                 100    36.0
TX                     1     5.0
TX                     2    10.0
TX                   100    31.0

I also measured netperf TCP_STREAM but was not so great performance due
to lack of tx/rx checksum offload and TSO, etc.

  netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)

Direction         Flows   Gbps
==============================
external->veth        1   20.8
external->veth        2   23.5
external->veth      100   23.6
veth->external        1    9.0
veth->external        2   17.8
veth->external      100   22.9

Also tested doing ifup/down or load/unload a XDP program repeatedly
during processing XDP packets in order to check if enabling/disabling
NAPI is working as expected, and found no problems.

Note:
This patchset depends on commit d8d7218ad842 ("xdp: XDP_REDIRECT should
check IFF_UP and MTU") which is currently in DaveM's net-next tree.

v3:
- Drop skb bulk xmit patch since it makes little performance
  difference. The hotspot in TCP skb xmit at this point is checksum
  computation in skb_segment and packet copy on XDP_REDIRECT due to
  cloned/nonlinear skb.
- Fix race on closing device.
- Add extack messages in ndo_bpf.

v2:
- Squash NAPI patch with "Add driver XDP" patch.
- Remove conversion from xdp_frame to skb when NAPI is not enabled.
- Introduce per-queue XDP ring (patch 8).
- Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Toshiaki Makita (8):
  net: Export skb_headers_offset_update
  veth: Add driver XDP
  veth: Avoid drops by oversized packets when XDP is enabled
  veth: Handle xdp_frames in xdp napi ring
  veth: Add ndo_xdp_xmit
  xdp: Add a flag for disabling napi_direct of xdp_return_frame in
    xdp_mem_info
  veth: Add XDP TX and REDIRECT
  veth: Support per queue XDP ring

 drivers/net/veth.c     | 735 ++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/skbuff.h |   1 +
 include/net/xdp.h      |   4 +
 net/core/skbuff.c      |   3 +-
 net/core/xdp.c         |   6 +-
 5 files changed, 737 insertions(+), 12 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 1/8] net: Export skb_headers_offset_update
  2018-07-22 15:13 [PATCH v3 bpf-next 0/8] veth: Driver XDP Toshiaki Makita
@ 2018-07-22 15:13 ` Toshiaki Makita
  2018-07-22 15:13 ` [PATCH v3 bpf-next 2/8] veth: Add driver XDP Toshiaki Makita
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-22 15:13 UTC (permalink / raw)
  To: netdev, Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, Jesper Dangaard Brouer

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c      | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b247df..f6929688853a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1035,6 +1035,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
+void skb_headers_offset_update(struct sk_buff *skb, int off);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 0c1a00672ba9..5366d1660e5b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1291,7 +1291,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(skb_clone);
 
-static void skb_headers_offset_update(struct sk_buff *skb, int off)
+void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
 	/* Only adjust this if it actually is csum_start rather than csum */
 	if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -1305,6 +1305,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_network_header += off;
 	skb->inner_mac_header += off;
 }
+EXPORT_SYMBOL(skb_headers_offset_update);
 
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 2/8] veth: Add driver XDP
  2018-07-22 15:13 [PATCH v3 bpf-next 0/8] veth: Driver XDP Toshiaki Makita
  2018-07-22 15:13 ` [PATCH v3 bpf-next 1/8] net: Export skb_headers_offset_update Toshiaki Makita
@ 2018-07-22 15:13 ` Toshiaki Makita
  2018-07-24  0:23   ` Jakub Kicinski
  2018-07-22 15:13 ` [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-22 15:13 UTC (permalink / raw)
  To: netdev, Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, Jesper Dangaard Brouer

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This is the basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

This introduces NAPI when XDP is enabled, because XDP is now heavily
relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
enqueues packets to the ring and peer NAPI handler drains the ring.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved by allocating rings
on the per-queue basis later.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

v3:
- Fix race on closing the device.
- Add extack messages in ndo_bpf.

v2:
- Squashed with the patch adding NAPI.
- Implement adjust_tail.
- Don't acquire consumer lock because it is guarded by NAPI.
- Make poll_controller noop since it is unnecessary.
- Register rxq_info on enabling XDP rather than on opening the device.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 373 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 366 insertions(+), 7 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a69ad39ee57e..78fa08cb6e24 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,10 +19,18 @@
 #include <net/xfrm.h>
 #include <linux/veth.h>
 #include <linux/module.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/ptr_ring.h>
+#include <linux/skb_array.h>
+#include <linux/bpf_trace.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_RING_SIZE		256
+#define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+
 struct pcpu_vstats {
 	u64			packets;
 	u64			bytes;
@@ -30,9 +38,16 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+	struct napi_struct	xdp_napi;
+	struct net_device	*dev;
+	struct bpf_prog __rcu	*xdp_prog;
+	struct bpf_prog		*_xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
 	unsigned		requested_headroom;
+	bool			rx_notify_masked;
+	struct ptr_ring		xdp_ring;
+	struct xdp_rxq_info	xdp_rxq;
 };
 
 /*
@@ -98,11 +113,43 @@ static const struct ethtool_ops veth_ethtool_ops = {
 	.get_link_ksettings	= veth_get_link_ksettings,
 };
 
-static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+/* general routines */
+
+static void __veth_xdp_flush(struct veth_priv *priv)
+{
+	/* Write ptr_ring before reading rx_notify_masked */
+	smp_mb();
+	if (!priv->rx_notify_masked) {
+		priv->rx_notify_masked = true;
+		napi_schedule(&priv->xdp_napi);
+	}
+}
+
+static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+{
+	if (unlikely(ptr_ring_produce(&priv->xdp_ring, skb))) {
+		dev_kfree_skb_any(skb);
+		return NET_RX_DROP;
+	}
+
+	return NET_RX_SUCCESS;
+}
+
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool xdp)
 {
 	struct veth_priv *priv = netdev_priv(dev);
+
+	return __dev_forward_skb(dev, skb) ?: xdp ?
+		veth_xdp_rx(priv, skb) :
+		netif_rx(skb);
+}
+
+static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct net_device *rcv;
 	int length = skb->len;
+	bool rcv_xdp = false;
 
 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
@@ -111,7 +158,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
-	if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+	rcv_priv = netdev_priv(rcv);
+	rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+
+	if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
 		struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -122,14 +172,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 drop:
 		atomic64_inc(&priv->dropped);
 	}
+
+	if (rcv_xdp)
+		__veth_xdp_flush(rcv_priv);
+
 	rcu_read_unlock();
+
 	return NETDEV_TX_OK;
 }
 
-/*
- * general routines
- */
-
 static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -179,18 +230,253 @@ static void veth_set_multicast_list(struct net_device *dev)
 {
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+				      int buflen)
+{
+	struct sk_buff *skb;
+
+	if (!buflen) {
+		buflen = SKB_DATA_ALIGN(headroom + len) +
+			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	}
+	skb = build_skb(head, buflen);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, headroom);
+	skb_put(skb, len);
+
+	return skb;
+}
+
+static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
+					struct sk_buff *skb)
+{
+	u32 pktlen, headroom, act, metalen;
+	void *orig_data, *orig_data_end;
+	int size, mac_len, delta, off;
+	struct bpf_prog *xdp_prog;
+	struct xdp_buff xdp;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(priv->xdp_prog);
+	if (unlikely(!xdp_prog)) {
+		rcu_read_unlock();
+		goto out;
+	}
+
+	mac_len = skb->data - skb_mac_header(skb);
+	pktlen = skb->len + mac_len;
+	size = SKB_DATA_ALIGN(VETH_XDP_HEADROOM + pktlen) +
+	       SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	if (size > PAGE_SIZE)
+		goto drop;
+
+	headroom = skb_headroom(skb) - mac_len;
+	if (skb_shared(skb) || skb_head_is_locked(skb) ||
+	    skb_is_nonlinear(skb) || headroom < XDP_PACKET_HEADROOM) {
+		struct sk_buff *nskb;
+		void *head, *start;
+		struct page *page;
+		int head_off;
+
+		page = alloc_page(GFP_ATOMIC);
+		if (!page)
+			goto drop;
+
+		head = page_address(page);
+		start = head + VETH_XDP_HEADROOM;
+		if (skb_copy_bits(skb, -mac_len, start, pktlen)) {
+			page_frag_free(head);
+			goto drop;
+		}
+
+		nskb = veth_build_skb(head,
+				      VETH_XDP_HEADROOM + mac_len, skb->len,
+				      PAGE_SIZE);
+		if (!nskb) {
+			page_frag_free(head);
+			goto drop;
+		}
+
+		skb_copy_header(nskb, skb);
+		head_off = skb_headroom(nskb) - skb_headroom(skb);
+		skb_headers_offset_update(nskb, head_off);
+		if (skb->sk)
+			skb_set_owner_w(nskb, skb->sk);
+		consume_skb(skb);
+		skb = nskb;
+	}
+
+	xdp.data_hard_start = skb->head;
+	xdp.data = skb_mac_header(skb);
+	xdp.data_end = xdp.data + pktlen;
+	xdp.data_meta = xdp.data;
+	xdp.rxq = &priv->xdp_rxq;
+	orig_data = xdp.data;
+	orig_data_end = xdp.data_end;
+
+	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+	switch (act) {
+	case XDP_PASS:
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_ABORTED:
+		trace_xdp_exception(priv->dev, xdp_prog, act);
+	case XDP_DROP:
+		goto drop;
+	}
+	rcu_read_unlock();
+
+	delta = orig_data - xdp.data;
+	off = mac_len + delta;
+	if (off > 0)
+		__skb_push(skb, off);
+	else if (off < 0)
+		__skb_pull(skb, -off);
+	skb->mac_header -= delta;
+	off = xdp.data_end - orig_data_end;
+	if (off != 0)
+		__skb_put(skb, off);
+	skb->protocol = eth_type_trans(skb, priv->dev);
+
+	metalen = xdp.data - xdp.data_meta;
+	if (metalen)
+		skb_metadata_set(skb, metalen);
+out:
+	return skb;
+drop:
+	rcu_read_unlock();
+	kfree_skb(skb);
+	return NULL;
+}
+
+static int veth_xdp_rcv(struct veth_priv *priv, int budget)
+{
+	int i, done = 0;
+
+	for (i = 0; i < budget; i++) {
+		struct sk_buff *skb = __ptr_ring_consume(&priv->xdp_ring);
+
+		if (!skb)
+			break;
+
+		skb = veth_xdp_rcv_skb(priv, skb);
+
+		if (skb)
+			napi_gro_receive(&priv->xdp_napi, skb);
+
+		done++;
+	}
+
+	return done;
+}
+
+static int veth_poll(struct napi_struct *napi, int budget)
+{
+	struct veth_priv *priv =
+		container_of(napi, struct veth_priv, xdp_napi);
+	int done;
+
+	done = veth_xdp_rcv(priv, budget);
+
+	if (done < budget && napi_complete_done(napi, done)) {
+		/* Write rx_notify_masked before reading ptr_ring */
+		smp_store_mb(priv->rx_notify_masked, false);
+		if (unlikely(!__ptr_ring_empty(&priv->xdp_ring))) {
+			priv->rx_notify_masked = true;
+			napi_schedule(&priv->xdp_napi);
+		}
+	}
+
+	return done;
+}
+
+static int veth_napi_add(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	int err;
+
+	err = ptr_ring_init(&priv->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
+	if (err)
+		return err;
+
+	netif_napi_add(dev, &priv->xdp_napi, veth_poll, NAPI_POLL_WEIGHT);
+	napi_enable(&priv->xdp_napi);
+
+	return 0;
+}
+
+static void veth_napi_del(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	napi_disable(&priv->xdp_napi);
+	netif_napi_del(&priv->xdp_napi);
+	priv->rx_notify_masked = false;
+	ptr_ring_cleanup(&priv->xdp_ring, __skb_array_destroy_skb);
+}
+
+static int veth_enable_xdp(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	int err;
+
+	if (!xdp_rxq_info_is_reg(&priv->xdp_rxq)) {
+		err = xdp_rxq_info_reg(&priv->xdp_rxq, dev, 0);
+		if (err < 0)
+			return err;
+
+		err = xdp_rxq_info_reg_mem_model(&priv->xdp_rxq,
+						 MEM_TYPE_PAGE_SHARED, NULL);
+		if (err < 0)
+			goto err;
+
+		err = veth_napi_add(dev);
+		if (err)
+			goto err;
+	}
+
+	rcu_assign_pointer(priv->xdp_prog, priv->_xdp_prog);
+
+	return 0;
+err:
+	xdp_rxq_info_unreg(&priv->xdp_rxq);
+
+	return err;
+}
+
+static void veth_disable_xdp(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	rcu_assign_pointer(priv->xdp_prog, NULL);
+	veth_napi_del(dev);
+	xdp_rxq_info_unreg(&priv->xdp_rxq);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
 	struct net_device *peer = rtnl_dereference(priv->peer);
+	int err;
 
 	if (!peer)
 		return -ENOTCONN;
 
+	if (priv->_xdp_prog) {
+		err = veth_enable_xdp(dev);
+		if (err)
+			return err;
+	}
+
 	if (peer->flags & IFF_UP) {
 		netif_carrier_on(dev);
 		netif_carrier_on(peer);
 	}
+
 	return 0;
 }
 
@@ -203,6 +489,9 @@ static int veth_close(struct net_device *dev)
 	if (peer)
 		netif_carrier_off(peer);
 
+	if (priv->_xdp_prog)
+		veth_disable_xdp(dev);
+
 	return 0;
 }
 
@@ -228,7 +517,7 @@ static void veth_dev_free(struct net_device *dev)
 static void veth_poll_controller(struct net_device *dev)
 {
 	/* veth only receives frames when its peer sends one
-	 * Since it's a synchronous operation, we are guaranteed
+	 * Since it has nothing to do with disabling irqs, we are guaranteed
 	 * never to have pending data when we poll for it so
 	 * there is nothing to do here.
 	 *
@@ -276,6 +565,72 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 	rcu_read_unlock();
 }
 
+static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
+			struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct bpf_prog *old_prog;
+	struct net_device *peer;
+	int err;
+
+	old_prog = priv->_xdp_prog;
+	priv->_xdp_prog = prog;
+	peer = rtnl_dereference(priv->peer);
+
+	if (prog) {
+		if (!peer) {
+			NL_SET_ERR_MSG_MOD(extack, "Cannot set XDP when peer is detached");
+			err = -ENOTCONN;
+			goto err;
+		}
+
+		if (dev->flags & IFF_UP) {
+			err = veth_enable_xdp(dev);
+			if (err) {
+				NL_SET_ERR_MSG_MOD(extack, "Setup for XDP failed");
+				goto err;
+			}
+		}
+	}
+
+	if (old_prog) {
+		if (!prog && dev->flags & IFF_UP)
+			veth_disable_xdp(dev);
+		bpf_prog_put(old_prog);
+	}
+
+	return 0;
+err:
+	priv->_xdp_prog = old_prog;
+
+	return err;
+}
+
+static u32 veth_xdp_query(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	const struct bpf_prog *xdp_prog;
+
+	xdp_prog = priv->_xdp_prog;
+	if (xdp_prog)
+		return xdp_prog->aux->id;
+
+	return 0;
+}
+
+static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return veth_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_QUERY_PROG:
+		xdp->prog_id = veth_xdp_query(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -290,6 +645,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_get_iflink		= veth_get_iflink,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
+	.ndo_bpf		= veth_xdp,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
@@ -451,10 +807,13 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 	 */
 
 	priv = netdev_priv(dev);
+	priv->dev = dev;
 	rcu_assign_pointer(priv->peer, peer);
 
 	priv = netdev_priv(peer);
+	priv->dev = peer;
 	rcu_assign_pointer(priv->peer, dev);
+
 	return 0;
 
 err_register_dev:
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled
  2018-07-22 15:13 [PATCH v3 bpf-next 0/8] veth: Driver XDP Toshiaki Makita
  2018-07-22 15:13 ` [PATCH v3 bpf-next 1/8] net: Export skb_headers_offset_update Toshiaki Makita
  2018-07-22 15:13 ` [PATCH v3 bpf-next 2/8] veth: Add driver XDP Toshiaki Makita
@ 2018-07-22 15:13 ` Toshiaki Makita
  2018-07-24  0:27   ` Jakub Kicinski
  2018-07-22 15:13 ` [PATCH v3 bpf-next 4/8] veth: Handle xdp_frames in xdp napi ring Toshiaki Makita
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-22 15:13 UTC (permalink / raw)
  To: netdev, Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, Jesper Dangaard Brouer

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

All oversized packets including GSO packets are dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 78fa08cb6e24..f5b72e937d9d 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -542,6 +542,23 @@ static int veth_get_iflink(const struct net_device *dev)
 	return iflink;
 }
 
+static netdev_features_t veth_fix_features(struct net_device *dev,
+					   netdev_features_t features)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
+
+	peer = rtnl_dereference(priv->peer);
+	if (peer) {
+		struct veth_priv *peer_priv = netdev_priv(peer);
+
+		if (peer_priv->_xdp_prog)
+			features &= ~NETIF_F_GSO_SOFTWARE;
+	}
+
+	return features;
+}
+
 static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 {
 	struct veth_priv *peer_priv, *priv = netdev_priv(dev);
@@ -591,14 +608,33 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 				goto err;
 			}
 		}
+
+		if (!old_prog) {
+			peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+			peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+				peer->hard_header_len -
+				SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+			if (peer->mtu > peer->max_mtu)
+				dev_set_mtu(peer, peer->max_mtu);
+		}
 	}
 
 	if (old_prog) {
-		if (!prog && dev->flags & IFF_UP)
-			veth_disable_xdp(dev);
+		if (!prog) {
+			if (dev->flags & IFF_UP)
+				veth_disable_xdp(dev);
+
+			if (peer) {
+				peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+				peer->max_mtu = ETH_MAX_MTU;
+			}
+		}
 		bpf_prog_put(old_prog);
 	}
 
+	if ((!!old_prog ^ !!prog) && peer)
+		netdev_update_features(peer);
+
 	return 0;
 err:
 	priv->_xdp_prog = old_prog;
@@ -643,6 +679,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_poll_controller	= veth_poll_controller,
 #endif
 	.ndo_get_iflink		= veth_get_iflink,
+	.ndo_fix_features	= veth_fix_features,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 4/8] veth: Handle xdp_frames in xdp napi ring
  2018-07-22 15:13 [PATCH v3 bpf-next 0/8] veth: Driver XDP Toshiaki Makita
                   ` (2 preceding siblings ...)
  2018-07-22 15:13 ` [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
@ 2018-07-22 15:13 ` Toshiaki Makita
  2018-07-22 15:13 ` [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit Toshiaki Makita
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-22 15:13 UTC (permalink / raw)
  To: netdev, Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, Jesper Dangaard Brouer

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This is preparation for XDP TX and ndo_xdp_xmit.
This allows napi handler to handle xdp_frames through xdp ring as well
as sk_buff.

v3:
- Revert v2 change around rings and use a flag to differentiate skb and
  xdp_frame, since bulk skb xmit makes little performance difference
  for now.

v2:
- Use another ring instead of using flag to differentiate skb and
  xdp_frame. This approach makes bulk skb transmit possible in
  veth_xmit later.
- Clear xdp_frame feilds in skb->head.
- Implement adjust_tail.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 82 insertions(+), 5 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f5b72e937d9d..4be75c58bc6a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -22,12 +22,12 @@
 #include <linux/bpf.h>
 #include <linux/filter.h>
 #include <linux/ptr_ring.h>
-#include <linux/skb_array.h>
 #include <linux/bpf_trace.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_XDP_FLAG		BIT(0)
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
@@ -115,6 +115,24 @@ static const struct ethtool_ops veth_ethtool_ops = {
 
 /* general routines */
 
+static bool veth_is_xdp_frame(void *ptr)
+{
+	return (unsigned long)ptr & VETH_XDP_FLAG;
+}
+
+static void *veth_ptr_to_xdp(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
+}
+
+static void veth_ptr_free(void *ptr)
+{
+	if (veth_is_xdp_frame(ptr))
+		xdp_return_frame(veth_ptr_to_xdp(ptr));
+	else
+		kfree_skb(ptr);
+}
+
 static void __veth_xdp_flush(struct veth_priv *priv)
 {
 	/* Write ptr_ring before reading rx_notify_masked */
@@ -249,6 +267,61 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+					struct xdp_frame *frame)
+{
+	int len = frame->len, delta = 0;
+	struct bpf_prog *xdp_prog;
+	unsigned int headroom;
+	struct sk_buff *skb;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(priv->xdp_prog);
+	if (likely(xdp_prog)) {
+		struct xdp_buff xdp;
+		u32 act;
+
+		xdp.data_hard_start = frame->data - frame->headroom;
+		xdp.data = frame->data;
+		xdp.data_end = frame->data + frame->len;
+		xdp.data_meta = frame->data - frame->metasize;
+		xdp.rxq = &priv->xdp_rxq;
+
+		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+		switch (act) {
+		case XDP_PASS:
+			delta = frame->data - xdp.data;
+			len = xdp.data_end - xdp.data;
+			break;
+		default:
+			bpf_warn_invalid_xdp_action(act);
+		case XDP_ABORTED:
+			trace_xdp_exception(priv->dev, xdp_prog, act);
+		case XDP_DROP:
+			goto err_xdp;
+		}
+	}
+	rcu_read_unlock();
+
+	headroom = frame->data - delta - (void *)frame;
+	skb = veth_build_skb(frame, headroom, len, 0);
+	if (!skb) {
+		xdp_return_frame(frame);
+		goto err;
+	}
+
+	memset(frame, 0, sizeof(*frame));
+	skb->protocol = eth_type_trans(skb, priv->dev);
+err:
+	return skb;
+err_xdp:
+	rcu_read_unlock();
+	xdp_return_frame(frame);
+
+	return NULL;
+}
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 					struct sk_buff *skb)
 {
@@ -358,12 +431,16 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget)
 	int i, done = 0;
 
 	for (i = 0; i < budget; i++) {
-		struct sk_buff *skb = __ptr_ring_consume(&priv->xdp_ring);
+		void *ptr = __ptr_ring_consume(&priv->xdp_ring);
+		struct sk_buff *skb;
 
-		if (!skb)
+		if (!ptr)
 			break;
 
-		skb = veth_xdp_rcv_skb(priv, skb);
+		if (veth_is_xdp_frame(ptr))
+			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
+		else
+			skb = veth_xdp_rcv_skb(priv, ptr);
 
 		if (skb)
 			napi_gro_receive(&priv->xdp_napi, skb);
@@ -416,7 +493,7 @@ static void veth_napi_del(struct net_device *dev)
 	napi_disable(&priv->xdp_napi);
 	netif_napi_del(&priv->xdp_napi);
 	priv->rx_notify_masked = false;
-	ptr_ring_cleanup(&priv->xdp_ring, __skb_array_destroy_skb);
+	ptr_ring_cleanup(&priv->xdp_ring, veth_ptr_free);
 }
 
 static int veth_enable_xdp(struct net_device *dev)
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit
  2018-07-22 15:13 [PATCH v3 bpf-next 0/8] veth: Driver XDP Toshiaki Makita
                   ` (3 preceding siblings ...)
  2018-07-22 15:13 ` [PATCH v3 bpf-next 4/8] veth: Handle xdp_frames in xdp napi ring Toshiaki Makita
@ 2018-07-22 15:13 ` Toshiaki Makita
  2018-07-24  0:19   ` kbuild test robot
                     ` (2 more replies)
  2018-07-22 15:13 ` [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info Toshiaki Makita
                   ` (2 subsequent siblings)
  7 siblings, 3 replies; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-22 15:13 UTC (permalink / raw)
  To: netdev, Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, Jesper Dangaard Brouer

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that when the peer veth device does not set driver xdp, redirected
packets will be dropped because the peer is not ready for NAPI.

v2:
- Drop the part converting xdp_frame into skb when XDP is not enabled.
- Implement bulk interface of ndo_xdp_xmit.
- Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 4be75c58bc6a..57187e955fea 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -17,6 +17,7 @@
 #include <net/rtnetlink.h>
 #include <net/dst.h>
 #include <net/xfrm.h>
+#include <net/xdp.h>
 #include <linux/veth.h>
 #include <linux/module.h>
 #include <linux/bpf.h>
@@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
 	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
 }
 
+static void *veth_xdp_to_ptr(void *ptr)
+{
+	return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+}
+
 static void veth_ptr_free(void *ptr)
 {
 	if (veth_is_xdp_frame(ptr))
@@ -267,6 +273,44 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static int veth_xdp_xmit(struct net_device *dev, int n,
+			 struct xdp_frame **frames, u32 flags)
+{
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	struct net_device *rcv;
+	int i, drops = 0;
+
+	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
+		return -EINVAL;
+
+	rcv = rcu_dereference(priv->peer);
+	if (unlikely(!rcv))
+		return -ENXIO;
+
+	rcv_priv = netdev_priv(rcv);
+	/* xdp_ring is initialized on receive side? */
+	if (!rcu_access_pointer(rcv_priv->xdp_prog))
+		return -ENXIO;
+
+	spin_lock(&rcv_priv->xdp_ring.producer_lock);
+	for (i = 0; i < n; i++) {
+		struct xdp_frame *frame = frames[i];
+		void *ptr = veth_xdp_to_ptr(frame);
+
+		if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
+			     __ptr_ring_produce(&rcv_priv->xdp_ring, ptr))) {
+			xdp_return_frame_rx_napi(frame);
+			drops++;
+		}
+	}
+	spin_unlock(&rcv_priv->xdp_ring.producer_lock);
+
+	if (flags & XDP_XMIT_FLUSH)
+		__veth_xdp_flush(rcv_priv);
+
+	return n - drops;
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 					struct xdp_frame *frame)
 {
@@ -760,6 +804,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
+	.ndo_xdp_xmit		= veth_xdp_xmit,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info
  2018-07-22 15:13 [PATCH v3 bpf-next 0/8] veth: Driver XDP Toshiaki Makita
                   ` (4 preceding siblings ...)
  2018-07-22 15:13 ` [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit Toshiaki Makita
@ 2018-07-22 15:13 ` Toshiaki Makita
  2018-07-24  1:22   ` Jakub Kicinski
  2018-07-22 15:13 ` [PATCH v3 bpf-next 7/8] veth: Add XDP TX and REDIRECT Toshiaki Makita
  2018-07-22 15:13 ` [PATCH v3 bpf-next 8/8] veth: Support per queue XDP ring Toshiaki Makita
  7 siblings, 1 reply; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-22 15:13 UTC (permalink / raw)
  To: netdev, Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, Jesper Dangaard Brouer

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is
not called directly from the rxq which owns the xdp_mem_info.

This approach introduces a flag in xdp_mem_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 include/net/xdp.h | 4 ++++
 net/core/xdp.c    | 6 ++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index fcb033f51d8c..1d1bc6553ff2 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -41,6 +41,9 @@ enum xdp_mem_type {
 	MEM_TYPE_MAX,
 };
 
+/* XDP flags for xdp_mem_info */
+#define XDP_MEM_RF_NO_DIRECT	BIT(0)	/* don't use napi_direct */
+
 /* XDP flags for ndo_xdp_xmit */
 #define XDP_XMIT_FLUSH		(1U << 0)	/* doorbell signal consumer */
 #define XDP_XMIT_FLAGS_MASK	XDP_XMIT_FLUSH
@@ -48,6 +51,7 @@ enum xdp_mem_type {
 struct xdp_mem_info {
 	u32 type; /* enum xdp_mem_type, but known size type */
 	u32 id;
+	u32 flags;
 };
 
 struct page_pool;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 57285383ed00..1426c608fd75 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct,
 		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
 		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
 		page = virt_to_head_page(data);
-		if (xa)
+		if (xa) {
+			napi_direct &= !(mem->flags & XDP_MEM_RF_NO_DIRECT);
 			page_pool_put_page(xa->page_pool, page, napi_direct);
-		else
+		} else {
 			put_page(page);
+		}
 		rcu_read_unlock();
 		break;
 	case MEM_TYPE_PAGE_SHARED:
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 7/8] veth: Add XDP TX and REDIRECT
  2018-07-22 15:13 [PATCH v3 bpf-next 0/8] veth: Driver XDP Toshiaki Makita
                   ` (5 preceding siblings ...)
  2018-07-22 15:13 ` [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info Toshiaki Makita
@ 2018-07-22 15:13 ` Toshiaki Makita
  2018-07-22 15:13 ` [PATCH v3 bpf-next 8/8] veth: Support per queue XDP ring Toshiaki Makita
  7 siblings, 0 replies; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-22 15:13 UTC (permalink / raw)
  To: netdev, Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, Jesper Dangaard Brouer

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)          (XDP)         (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem_info from NIC so that page recycling of the NIC works on
the destination veth's XDP.
In this way return_frame is not fully guarded by NAPI, since another
NAPI handler on another cpu may use the same xdp_mem_info concurrently.
Thus disable napi_direct by XDP_MEM_RF_NO_DIRECT flag.

v3:
- Fix double free when veth_xdp_tx() returns a positive value.
- Convert xdp_xmit and xdp_redir variables into flags.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 119 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 110 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 57187e955fea..0323a4ca74e2 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -32,6 +32,10 @@
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Separating two types of XDP xmit */
+#define VETH_XDP_TX		BIT(0)
+#define VETH_XDP_REDIR		BIT(1)
+
 struct pcpu_vstats {
 	u64			packets;
 	u64			bytes;
@@ -45,6 +49,7 @@ struct veth_priv {
 	struct bpf_prog		*_xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
+	struct xdp_mem_info	xdp_mem;
 	unsigned		requested_headroom;
 	bool			rx_notify_masked;
 	struct ptr_ring		xdp_ring;
@@ -311,10 +316,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 	return n - drops;
 }
 
+static void veth_xdp_flush(struct net_device *dev)
+{
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	struct net_device *rcv;
+
+	rcu_read_lock();
+	rcv = rcu_dereference(priv->peer);
+	if (unlikely(!rcv))
+		goto out;
+
+	rcv_priv = netdev_priv(rcv);
+	/* xdp_ring is initialized on receive side? */
+	if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+		goto out;
+
+	__veth_xdp_flush(rcv_priv);
+out:
+	rcu_read_unlock();
+}
+
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!frame))
+		return -EOVERFLOW;
+
+	return veth_xdp_xmit(dev, 1, &frame, 0);
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
-					struct xdp_frame *frame)
+					struct xdp_frame *frame,
+					unsigned int *xdp_xmit)
 {
 	int len = frame->len, delta = 0;
+	struct xdp_frame orig_frame;
 	struct bpf_prog *xdp_prog;
 	unsigned int headroom;
 	struct sk_buff *skb;
@@ -338,6 +375,31 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 			delta = frame->data - xdp.data;
 			len = xdp.data_end - xdp.data;
 			break;
+		case XDP_TX:
+			orig_frame = *frame;
+			xdp.data_hard_start = frame;
+			xdp.rxq->mem = frame->mem;
+			xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
+			if (unlikely(veth_xdp_tx(priv->dev, &xdp) < 0)) {
+				trace_xdp_exception(priv->dev, xdp_prog, act);
+				frame = &orig_frame;
+				goto err_xdp;
+			}
+			*xdp_xmit |= VETH_XDP_TX;
+			rcu_read_unlock();
+			goto xdp_xmit;
+		case XDP_REDIRECT:
+			orig_frame = *frame;
+			xdp.data_hard_start = frame;
+			xdp.rxq->mem = frame->mem;
+			xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
+			if (xdp_do_redirect(priv->dev, &xdp, xdp_prog)) {
+				frame = &orig_frame;
+				goto err_xdp;
+			}
+			*xdp_xmit |= VETH_XDP_REDIR;
+			rcu_read_unlock();
+			goto xdp_xmit;
 		default:
 			bpf_warn_invalid_xdp_action(act);
 		case XDP_ABORTED:
@@ -362,12 +424,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 err_xdp:
 	rcu_read_unlock();
 	xdp_return_frame(frame);
-
+xdp_xmit:
 	return NULL;
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-					struct sk_buff *skb)
+					struct sk_buff *skb,
+					unsigned int *xdp_xmit)
 {
 	u32 pktlen, headroom, act, metalen;
 	void *orig_data, *orig_data_end;
@@ -438,6 +501,26 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	switch (act) {
 	case XDP_PASS:
 		break;
+	case XDP_TX:
+		get_page(virt_to_page(xdp.data));
+		consume_skb(skb);
+		xdp.rxq->mem = priv->xdp_mem;
+		if (unlikely(veth_xdp_tx(priv->dev, &xdp) < 0)) {
+			trace_xdp_exception(priv->dev, xdp_prog, act);
+			goto err_xdp;
+		}
+		*xdp_xmit |= VETH_XDP_TX;
+		rcu_read_unlock();
+		goto xdp_xmit;
+	case XDP_REDIRECT:
+		get_page(virt_to_page(xdp.data));
+		consume_skb(skb);
+		xdp.rxq->mem = priv->xdp_mem;
+		if (xdp_do_redirect(priv->dev, &xdp, xdp_prog))
+			goto err_xdp;
+		*xdp_xmit |= VETH_XDP_REDIR;
+		rcu_read_unlock();
+		goto xdp_xmit;
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
@@ -468,9 +551,15 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	rcu_read_unlock();
 	kfree_skb(skb);
 	return NULL;
+err_xdp:
+	rcu_read_unlock();
+	page_frag_free(xdp.data);
+xdp_xmit:
+	return NULL;
 }
 
-static int veth_xdp_rcv(struct veth_priv *priv, int budget)
+static int veth_xdp_rcv(struct veth_priv *priv, int budget,
+			unsigned int *xdp_xmit)
 {
 	int i, done = 0;
 
@@ -481,10 +570,12 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget)
 		if (!ptr)
 			break;
 
-		if (veth_is_xdp_frame(ptr))
-			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
-		else
-			skb = veth_xdp_rcv_skb(priv, ptr);
+		if (veth_is_xdp_frame(ptr)) {
+			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr),
+					       xdp_xmit);
+		} else {
+			skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit);
+		}
 
 		if (skb)
 			napi_gro_receive(&priv->xdp_napi, skb);
@@ -499,9 +590,10 @@ static int veth_poll(struct napi_struct *napi, int budget)
 {
 	struct veth_priv *priv =
 		container_of(napi, struct veth_priv, xdp_napi);
+	unsigned int xdp_xmit = 0;
 	int done;
 
-	done = veth_xdp_rcv(priv, budget);
+	done = veth_xdp_rcv(priv, budget, &xdp_xmit);
 
 	if (done < budget && napi_complete_done(napi, done)) {
 		/* Write rx_notify_masked before reading ptr_ring */
@@ -512,6 +604,11 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		}
 	}
 
+	if (xdp_xmit & VETH_XDP_TX)
+		veth_xdp_flush(priv->dev);
+	if (xdp_xmit & VETH_XDP_REDIR)
+		xdp_do_flush_map();
+
 	return done;
 }
 
@@ -558,6 +655,9 @@ static int veth_enable_xdp(struct net_device *dev)
 		err = veth_napi_add(dev);
 		if (err)
 			goto err;
+
+		/* Save original mem info as it can be overwritten */
+		priv->xdp_mem = priv->xdp_rxq.mem;
 	}
 
 	rcu_assign_pointer(priv->xdp_prog, priv->_xdp_prog);
@@ -575,6 +675,7 @@ static void veth_disable_xdp(struct net_device *dev)
 
 	rcu_assign_pointer(priv->xdp_prog, NULL);
 	veth_napi_del(dev);
+	priv->xdp_rxq.mem = priv->xdp_mem;
 	xdp_rxq_info_unreg(&priv->xdp_rxq);
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 bpf-next 8/8] veth: Support per queue XDP ring
  2018-07-22 15:13 [PATCH v3 bpf-next 0/8] veth: Driver XDP Toshiaki Makita
                   ` (6 preceding siblings ...)
  2018-07-22 15:13 ` [PATCH v3 bpf-next 7/8] veth: Add XDP TX and REDIRECT Toshiaki Makita
@ 2018-07-22 15:13 ` Toshiaki Makita
  7 siblings, 0 replies; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-22 15:13 UTC (permalink / raw)
  To: netdev, Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, Jesper Dangaard Brouer

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Move XDP and napi related fields in veth_priv to newly created veth_rq
structure.

When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
selected by current cpu.

When skbs are enqueued from the peer device, rxq is one to one mapping
of its peer txq. This way we have a restriction that the number of rxqs
must not less than the number of peer txqs, but leave the possibility to
achieve bulk skb xmit in the future because txq lock would make it
possible to remove rxq ptr_ring lock.

v3:
- Add extack messages.
- Fix array overrun in veth_xmit.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 278 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 188 insertions(+), 90 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 0323a4ca74e2..84482d9901ec 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -42,20 +42,24 @@ struct pcpu_vstats {
 	struct u64_stats_sync	syncp;
 };
 
-struct veth_priv {
+struct veth_rq {
 	struct napi_struct	xdp_napi;
 	struct net_device	*dev;
 	struct bpf_prog __rcu	*xdp_prog;
-	struct bpf_prog		*_xdp_prog;
-	struct net_device __rcu	*peer;
-	atomic64_t		dropped;
 	struct xdp_mem_info	xdp_mem;
-	unsigned		requested_headroom;
 	bool			rx_notify_masked;
 	struct ptr_ring		xdp_ring;
 	struct xdp_rxq_info	xdp_rxq;
 };
 
+struct veth_priv {
+	struct net_device __rcu	*peer;
+	atomic64_t		dropped;
+	struct bpf_prog		*_xdp_prog;
+	struct veth_rq		*rq;
+	unsigned int		requested_headroom;
+};
+
 /*
  * ethtool interface
  */
@@ -144,19 +148,19 @@ static void veth_ptr_free(void *ptr)
 		kfree_skb(ptr);
 }
 
-static void __veth_xdp_flush(struct veth_priv *priv)
+static void __veth_xdp_flush(struct veth_rq *rq)
 {
 	/* Write ptr_ring before reading rx_notify_masked */
 	smp_mb();
-	if (!priv->rx_notify_masked) {
-		priv->rx_notify_masked = true;
-		napi_schedule(&priv->xdp_napi);
+	if (!rq->rx_notify_masked) {
+		rq->rx_notify_masked = true;
+		napi_schedule(&rq->xdp_napi);
 	}
 }
 
-static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 {
-	if (unlikely(ptr_ring_produce(&priv->xdp_ring, skb))) {
+	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
 		dev_kfree_skb_any(skb);
 		return NET_RX_DROP;
 	}
@@ -164,21 +168,22 @@ static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
 	return NET_RX_SUCCESS;
 }
 
-static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool xdp)
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
+			    struct veth_rq *rq, bool xdp)
 {
-	struct veth_priv *priv = netdev_priv(dev);
-
 	return __dev_forward_skb(dev, skb) ?: xdp ?
-		veth_xdp_rx(priv, skb) :
+		veth_xdp_rx(rq, skb) :
 		netif_rx(skb);
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	struct veth_rq *rq = NULL;
 	struct net_device *rcv;
 	int length = skb->len;
 	bool rcv_xdp = false;
+	int rxq;
 
 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
@@ -188,9 +193,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	rcv_priv = netdev_priv(rcv);
-	rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+	rxq = skb_get_queue_mapping(skb);
+	if (rxq < rcv->real_num_rx_queues) {
+		rq = &rcv_priv->rq[rxq];
+		rcv_xdp = rcu_access_pointer(rq->xdp_prog);
+		if (rcv_xdp)
+			skb_record_rx_queue(skb, rxq);
+	}
 
-	if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
+	if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
 		struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -203,7 +214,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	if (rcv_xdp)
-		__veth_xdp_flush(rcv_priv);
+		__veth_xdp_flush(rq);
 
 	rcu_read_unlock();
 
@@ -278,11 +289,17 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static int veth_select_rxq(struct net_device *dev)
+{
+	return smp_processor_id() % dev->real_num_rx_queues;
+}
+
 static int veth_xdp_xmit(struct net_device *dev, int n,
 			 struct xdp_frame **frames, u32 flags)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct net_device *rcv;
+	struct veth_rq *rq;
 	int i, drops = 0;
 
 	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
@@ -293,25 +310,26 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 		return -ENXIO;
 
 	rcv_priv = netdev_priv(rcv);
+	rq = &rcv_priv->rq[veth_select_rxq(rcv)];
 	/* xdp_ring is initialized on receive side? */
-	if (!rcu_access_pointer(rcv_priv->xdp_prog))
+	if (!rcu_access_pointer(rq->xdp_prog))
 		return -ENXIO;
 
-	spin_lock(&rcv_priv->xdp_ring.producer_lock);
+	spin_lock(&rq->xdp_ring.producer_lock);
 	for (i = 0; i < n; i++) {
 		struct xdp_frame *frame = frames[i];
 		void *ptr = veth_xdp_to_ptr(frame);
 
 		if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
-			     __ptr_ring_produce(&rcv_priv->xdp_ring, ptr))) {
+			     __ptr_ring_produce(&rq->xdp_ring, ptr))) {
 			xdp_return_frame_rx_napi(frame);
 			drops++;
 		}
 	}
-	spin_unlock(&rcv_priv->xdp_ring.producer_lock);
+	spin_unlock(&rq->xdp_ring.producer_lock);
 
 	if (flags & XDP_XMIT_FLUSH)
-		__veth_xdp_flush(rcv_priv);
+		__veth_xdp_flush(rq);
 
 	return n - drops;
 }
@@ -320,6 +338,7 @@ static void veth_xdp_flush(struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct net_device *rcv;
+	struct veth_rq *rq;
 
 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
@@ -327,11 +346,12 @@ static void veth_xdp_flush(struct net_device *dev)
 		goto out;
 
 	rcv_priv = netdev_priv(rcv);
+	rq = &rcv_priv->rq[veth_select_rxq(rcv)];
 	/* xdp_ring is initialized on receive side? */
-	if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+	if (unlikely(!rcu_access_pointer(rq->xdp_prog)))
 		goto out;
 
-	__veth_xdp_flush(rcv_priv);
+	__veth_xdp_flush(rq);
 out:
 	rcu_read_unlock();
 }
@@ -346,7 +366,7 @@ static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
 	return veth_xdp_xmit(dev, 1, &frame, 0);
 }
 
-static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
 					struct xdp_frame *frame,
 					unsigned int *xdp_xmit)
 {
@@ -357,7 +377,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 	struct sk_buff *skb;
 
 	rcu_read_lock();
-	xdp_prog = rcu_dereference(priv->xdp_prog);
+	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (likely(xdp_prog)) {
 		struct xdp_buff xdp;
 		u32 act;
@@ -366,7 +386,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 		xdp.data = frame->data;
 		xdp.data_end = frame->data + frame->len;
 		xdp.data_meta = frame->data - frame->metasize;
-		xdp.rxq = &priv->xdp_rxq;
+		xdp.rxq = &rq->xdp_rxq;
 
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
@@ -380,8 +400,8 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 			xdp.data_hard_start = frame;
 			xdp.rxq->mem = frame->mem;
 			xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
-			if (unlikely(veth_xdp_tx(priv->dev, &xdp) < 0)) {
-				trace_xdp_exception(priv->dev, xdp_prog, act);
+			if (unlikely(veth_xdp_tx(rq->dev, &xdp) < 0)) {
+				trace_xdp_exception(rq->dev, xdp_prog, act);
 				frame = &orig_frame;
 				goto err_xdp;
 			}
@@ -393,7 +413,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 			xdp.data_hard_start = frame;
 			xdp.rxq->mem = frame->mem;
 			xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
-			if (xdp_do_redirect(priv->dev, &xdp, xdp_prog)) {
+			if (xdp_do_redirect(rq->dev, &xdp, xdp_prog)) {
 				frame = &orig_frame;
 				goto err_xdp;
 			}
@@ -403,7 +423,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 		default:
 			bpf_warn_invalid_xdp_action(act);
 		case XDP_ABORTED:
-			trace_xdp_exception(priv->dev, xdp_prog, act);
+			trace_xdp_exception(rq->dev, xdp_prog, act);
 		case XDP_DROP:
 			goto err_xdp;
 		}
@@ -418,7 +438,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 	}
 
 	memset(frame, 0, sizeof(*frame));
-	skb->protocol = eth_type_trans(skb, priv->dev);
+	skb->protocol = eth_type_trans(skb, rq->dev);
 err:
 	return skb;
 err_xdp:
@@ -428,8 +448,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 	return NULL;
 }
 
-static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-					struct sk_buff *skb,
+static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb,
 					unsigned int *xdp_xmit)
 {
 	u32 pktlen, headroom, act, metalen;
@@ -439,7 +458,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	struct xdp_buff xdp;
 
 	rcu_read_lock();
-	xdp_prog = rcu_dereference(priv->xdp_prog);
+	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (unlikely(!xdp_prog)) {
 		rcu_read_unlock();
 		goto out;
@@ -492,7 +511,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	xdp.data = skb_mac_header(skb);
 	xdp.data_end = xdp.data + pktlen;
 	xdp.data_meta = xdp.data;
-	xdp.rxq = &priv->xdp_rxq;
+	xdp.rxq = &rq->xdp_rxq;
 	orig_data = xdp.data;
 	orig_data_end = xdp.data_end;
 
@@ -504,9 +523,9 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	case XDP_TX:
 		get_page(virt_to_page(xdp.data));
 		consume_skb(skb);
-		xdp.rxq->mem = priv->xdp_mem;
-		if (unlikely(veth_xdp_tx(priv->dev, &xdp) < 0)) {
-			trace_xdp_exception(priv->dev, xdp_prog, act);
+		xdp.rxq->mem = rq->xdp_mem;
+		if (unlikely(veth_xdp_tx(rq->dev, &xdp) < 0)) {
+			trace_xdp_exception(rq->dev, xdp_prog, act);
 			goto err_xdp;
 		}
 		*xdp_xmit |= VETH_XDP_TX;
@@ -515,8 +534,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	case XDP_REDIRECT:
 		get_page(virt_to_page(xdp.data));
 		consume_skb(skb);
-		xdp.rxq->mem = priv->xdp_mem;
-		if (xdp_do_redirect(priv->dev, &xdp, xdp_prog))
+		xdp.rxq->mem = rq->xdp_mem;
+		if (xdp_do_redirect(rq->dev, &xdp, xdp_prog))
 			goto err_xdp;
 		*xdp_xmit |= VETH_XDP_REDIR;
 		rcu_read_unlock();
@@ -524,7 +543,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
-		trace_xdp_exception(priv->dev, xdp_prog, act);
+		trace_xdp_exception(rq->dev, xdp_prog, act);
 	case XDP_DROP:
 		goto drop;
 	}
@@ -540,7 +559,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	off = xdp.data_end - orig_data_end;
 	if (off != 0)
 		__skb_put(skb, off);
-	skb->protocol = eth_type_trans(skb, priv->dev);
+	skb->protocol = eth_type_trans(skb, rq->dev);
 
 	metalen = xdp.data - xdp.data_meta;
 	if (metalen)
@@ -558,27 +577,26 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	return NULL;
 }
 
-static int veth_xdp_rcv(struct veth_priv *priv, int budget,
-			unsigned int *xdp_xmit)
+static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit)
 {
 	int i, done = 0;
 
 	for (i = 0; i < budget; i++) {
-		void *ptr = __ptr_ring_consume(&priv->xdp_ring);
+		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
 		struct sk_buff *skb;
 
 		if (!ptr)
 			break;
 
 		if (veth_is_xdp_frame(ptr)) {
-			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr),
+			skb = veth_xdp_rcv_one(rq, veth_ptr_to_xdp(ptr),
 					       xdp_xmit);
 		} else {
-			skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit);
+			skb = veth_xdp_rcv_skb(rq, ptr, xdp_xmit);
 		}
 
 		if (skb)
-			napi_gro_receive(&priv->xdp_napi, skb);
+			napi_gro_receive(&rq->xdp_napi, skb);
 
 		done++;
 	}
@@ -588,24 +606,24 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget,
 
 static int veth_poll(struct napi_struct *napi, int budget)
 {
-	struct veth_priv *priv =
-		container_of(napi, struct veth_priv, xdp_napi);
+	struct veth_rq *rq =
+		container_of(napi, struct veth_rq, xdp_napi);
 	unsigned int xdp_xmit = 0;
 	int done;
 
-	done = veth_xdp_rcv(priv, budget, &xdp_xmit);
+	done = veth_xdp_rcv(rq, budget, &xdp_xmit);
 
 	if (done < budget && napi_complete_done(napi, done)) {
 		/* Write rx_notify_masked before reading ptr_ring */
-		smp_store_mb(priv->rx_notify_masked, false);
-		if (unlikely(!__ptr_ring_empty(&priv->xdp_ring))) {
-			priv->rx_notify_masked = true;
-			napi_schedule(&priv->xdp_napi);
+		smp_store_mb(rq->rx_notify_masked, false);
+		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
+			rq->rx_notify_masked = true;
+			napi_schedule(&rq->xdp_napi);
 		}
 	}
 
 	if (xdp_xmit & VETH_XDP_TX)
-		veth_xdp_flush(priv->dev);
+		veth_xdp_flush(rq->dev);
 	if (xdp_xmit & VETH_XDP_REDIR)
 		xdp_do_flush_map();
 
@@ -615,56 +633,90 @@ static int veth_poll(struct napi_struct *napi, int budget)
 static int veth_napi_add(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
-	int err;
+	int err, i;
 
-	err = ptr_ring_init(&priv->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
-	if (err)
-		return err;
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_rq *rq = &priv->rq[i];
+
+		err = ptr_ring_init(&rq->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
+		if (err)
+			goto err_xdp_ring;
+	}
 
-	netif_napi_add(dev, &priv->xdp_napi, veth_poll, NAPI_POLL_WEIGHT);
-	napi_enable(&priv->xdp_napi);
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_rq *rq = &priv->rq[i];
+
+		netif_napi_add(dev, &rq->xdp_napi, veth_poll, NAPI_POLL_WEIGHT);
+		napi_enable(&rq->xdp_napi);
+	}
 
 	return 0;
+err_xdp_ring:
+	for (i--; i >= 0; i--)
+		ptr_ring_cleanup(&priv->rq[i].xdp_ring, veth_ptr_free);
+
+	return err;
 }
 
 static void veth_napi_del(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
+	int i;
 
-	napi_disable(&priv->xdp_napi);
-	netif_napi_del(&priv->xdp_napi);
-	priv->rx_notify_masked = false;
-	ptr_ring_cleanup(&priv->xdp_ring, veth_ptr_free);
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_rq *rq = &priv->rq[i];
+
+		napi_disable(&rq->xdp_napi);
+		napi_hash_del(&rq->xdp_napi);
+	}
+	synchronize_net();
+
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_rq *rq = &priv->rq[i];
+
+		netif_napi_del(&rq->xdp_napi);
+		rq->rx_notify_masked = false;
+		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
+	}
 }
 
 static int veth_enable_xdp(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
-	int err;
+	int err, i;
 
-	if (!xdp_rxq_info_is_reg(&priv->xdp_rxq)) {
-		err = xdp_rxq_info_reg(&priv->xdp_rxq, dev, 0);
-		if (err < 0)
-			return err;
+	if (!xdp_rxq_info_is_reg(&priv->rq[0].xdp_rxq)) {
+		for (i = 0; i < dev->real_num_rx_queues; i++) {
+			struct veth_rq *rq = &priv->rq[i];
 
-		err = xdp_rxq_info_reg_mem_model(&priv->xdp_rxq,
-						 MEM_TYPE_PAGE_SHARED, NULL);
-		if (err < 0)
-			goto err;
+			err = xdp_rxq_info_reg(&rq->xdp_rxq, dev, i);
+			if (err < 0)
+				goto err_rxq_reg;
+
+			err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+							 MEM_TYPE_PAGE_SHARED,
+							 NULL);
+			if (err < 0)
+				goto err_reg_mem;
+
+			/* Save original mem info as it can be overwritten */
+			rq->xdp_mem = rq->xdp_rxq.mem;
+		}
 
 		err = veth_napi_add(dev);
 		if (err)
-			goto err;
-
-		/* Save original mem info as it can be overwritten */
-		priv->xdp_mem = priv->xdp_rxq.mem;
+			goto err_rxq_reg;
 	}
 
-	rcu_assign_pointer(priv->xdp_prog, priv->_xdp_prog);
+	for (i = 0; i < dev->real_num_rx_queues; i++)
+		rcu_assign_pointer(priv->rq[i].xdp_prog, priv->_xdp_prog);
 
 	return 0;
-err:
-	xdp_rxq_info_unreg(&priv->xdp_rxq);
+err_reg_mem:
+	xdp_rxq_info_unreg(&priv->rq[i].xdp_rxq);
+err_rxq_reg:
+	for (i--; i >= 0; i--)
+		xdp_rxq_info_unreg(&priv->rq[i].xdp_rxq);
 
 	return err;
 }
@@ -672,11 +724,17 @@ static int veth_enable_xdp(struct net_device *dev)
 static void veth_disable_xdp(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
+	int i;
 
-	rcu_assign_pointer(priv->xdp_prog, NULL);
+	for (i = 0; i < dev->real_num_rx_queues; i++)
+		rcu_assign_pointer(priv->rq[i].xdp_prog, NULL);
 	veth_napi_del(dev);
-	priv->xdp_rxq.mem = priv->xdp_mem;
-	xdp_rxq_info_unreg(&priv->xdp_rxq);
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_rq *rq = &priv->rq[i];
+
+		rq->xdp_rxq.mem = rq->xdp_mem;
+		xdp_rxq_info_unreg(&rq->xdp_rxq);
+	}
 }
 
 static int veth_open(struct net_device *dev)
@@ -823,6 +881,12 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 			goto err;
 		}
 
+		if (dev->real_num_rx_queues < peer->real_num_tx_queues) {
+			NL_SET_ERR_MSG_MOD(extack, "XDP expects number of rx queues not less than peer tx queues");
+			err = -ENOSPC;
+			goto err;
+		}
+
 		if (dev->flags & IFF_UP) {
 			err = veth_enable_xdp(dev);
 			if (err) {
@@ -961,13 +1025,31 @@ static int veth_validate(struct nlattr *tb[], struct nlattr *data[],
 	return 0;
 }
 
+static int veth_alloc_queues(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL);
+	if (!priv->rq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void veth_free_queues(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	kfree(priv->rq);
+}
+
 static struct rtnl_link_ops veth_link_ops;
 
 static int veth_newlink(struct net *src_net, struct net_device *dev,
 			struct nlattr *tb[], struct nlattr *data[],
 			struct netlink_ext_ack *extack)
 {
-	int err;
+	int err, i;
 	struct net_device *peer;
 	struct veth_priv *priv;
 	char ifname[IFNAMSIZ];
@@ -1020,6 +1102,12 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 		return PTR_ERR(peer);
 	}
 
+	err = veth_alloc_queues(peer);
+	if (err) {
+		put_net(net);
+		goto err_peer_alloc_queues;
+	}
+
 	if (!ifmp || !tbp[IFLA_ADDRESS])
 		eth_hw_addr_random(peer);
 
@@ -1048,6 +1136,10 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 	 * should be re-allocated
 	 */
 
+	err = veth_alloc_queues(dev);
+	if (err)
+		goto err_alloc_queues;
+
 	if (tb[IFLA_ADDRESS] == NULL)
 		eth_hw_addr_random(dev);
 
@@ -1067,22 +1159,28 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 	 */
 
 	priv = netdev_priv(dev);
-	priv->dev = dev;
+	for (i = 0; i < dev->real_num_rx_queues; i++)
+		priv->rq[i].dev = dev;
 	rcu_assign_pointer(priv->peer, peer);
 
 	priv = netdev_priv(peer);
-	priv->dev = peer;
+	for (i = 0; i < peer->real_num_rx_queues; i++)
+		priv->rq[i].dev = peer;
 	rcu_assign_pointer(priv->peer, dev);
 
 	return 0;
 
 err_register_dev:
+	veth_free_queues(dev);
+err_alloc_queues:
 	/* nothing to do */
 err_configure_peer:
 	unregister_netdevice(peer);
 	return err;
 
 err_register_peer:
+	veth_free_queues(peer);
+err_peer_alloc_queues:
 	free_netdev(peer);
 	return err;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit
  2018-07-22 15:13 ` [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit Toshiaki Makita
@ 2018-07-24  0:19   ` kbuild test robot
  2018-07-24  1:59     ` Toshiaki Makita
  2018-07-24  0:33   ` kbuild test robot
  2018-07-24  1:02   ` Jakub Kicinski
  2 siblings, 1 reply; 27+ messages in thread
From: kbuild test robot @ 2018-07-24  0:19 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: kbuild-all, netdev, Alexei Starovoitov, Daniel Borkmann,
	Toshiaki Makita, Jesper Dangaard Brouer

[-- Attachment #1: Type: text/plain, Size: 2659 bytes --]

Hi Toshiaki,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Toshiaki-Makita/veth-Driver-XDP/20180724-065517
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: i386-randconfig-x001-201829 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from include/linux/kernel.h:10:0,
                    from include/linux/list.h:9,
                    from include/linux/timer.h:5,
                    from include/linux/netdevice.h:28,
                    from drivers//net/veth.c:11:
   drivers//net/veth.c: In function 'veth_xdp_xmit':
>> drivers//net/veth.c:300:16: error: implicit declaration of function 'xdp_ok_fwd_dev' [-Werror=implicit-function-declaration]
      if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
                   ^
   include/linux/compiler.h:77:42: note: in definition of macro 'unlikely'
    # define unlikely(x) __builtin_expect(!!(x), 0)
                                             ^
   cc1: some warnings being treated as errors

vim +/xdp_ok_fwd_dev +300 drivers//net/veth.c

   275	
   276	static int veth_xdp_xmit(struct net_device *dev, int n,
   277				 struct xdp_frame **frames, u32 flags)
   278	{
   279		struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
   280		struct net_device *rcv;
   281		int i, drops = 0;
   282	
   283		if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
   284			return -EINVAL;
   285	
   286		rcv = rcu_dereference(priv->peer);
   287		if (unlikely(!rcv))
   288			return -ENXIO;
   289	
   290		rcv_priv = netdev_priv(rcv);
   291		/* xdp_ring is initialized on receive side? */
   292		if (!rcu_access_pointer(rcv_priv->xdp_prog))
   293			return -ENXIO;
   294	
   295		spin_lock(&rcv_priv->xdp_ring.producer_lock);
   296		for (i = 0; i < n; i++) {
   297			struct xdp_frame *frame = frames[i];
   298			void *ptr = veth_xdp_to_ptr(frame);
   299	
 > 300			if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
   301				     __ptr_ring_produce(&rcv_priv->xdp_ring, ptr))) {
   302				xdp_return_frame_rx_napi(frame);
   303				drops++;
   304			}
   305		}
   306		spin_unlock(&rcv_priv->xdp_ring.producer_lock);
   307	
   308		if (flags & XDP_XMIT_FLUSH)
   309			__veth_xdp_flush(rcv_priv);
   310	
   311		return n - drops;
   312	}
   313	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34669 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 2/8] veth: Add driver XDP
  2018-07-22 15:13 ` [PATCH v3 bpf-next 2/8] veth: Add driver XDP Toshiaki Makita
@ 2018-07-24  0:23   ` Jakub Kicinski
  2018-07-24  1:47     ` Toshiaki Makita
  0 siblings, 1 reply; 27+ messages in thread
From: Jakub Kicinski @ 2018-07-24  0:23 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, Toshiaki Makita,
	Jesper Dangaard Brouer

On Mon, 23 Jul 2018 00:13:02 +0900, Toshiaki Makita wrote:
> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> 
> This is the basic implementation of veth driver XDP.
> 
> Incoming packets are sent from the peer veth device in the form of skb,
> so this is generally doing the same thing as generic XDP.
> 
> This itself is not so useful, but a starting point to implement other
> useful veth XDP features like TX and REDIRECT.
> 
> This introduces NAPI when XDP is enabled, because XDP is now heavily
> relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
> enqueues packets to the ring and peer NAPI handler drains the ring.
> 
> Currently only one ring is allocated for each veth device, so it does
> not scale on multiqueue env. This can be resolved by allocating rings
> on the per-queue basis later.
> 
> Note that NAPI is not used but netif_rx is used when XDP is not loaded,
> so this does not change the default behaviour.
> 
> v3:
> - Fix race on closing the device.
> - Add extack messages in ndo_bpf.
> 
> v2:
> - Squashed with the patch adding NAPI.
> - Implement adjust_tail.
> - Don't acquire consumer lock because it is guarded by NAPI.
> - Make poll_controller noop since it is unnecessary.
> - Register rxq_info on enabling XDP rather than on opening the device.
> 
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

> +static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
> +					struct sk_buff *skb)
> +{
> +	u32 pktlen, headroom, act, metalen;
> +	void *orig_data, *orig_data_end;
> +	int size, mac_len, delta, off;
> +	struct bpf_prog *xdp_prog;
> +	struct xdp_buff xdp;
> +
> +	rcu_read_lock();
> +	xdp_prog = rcu_dereference(priv->xdp_prog);
> +	if (unlikely(!xdp_prog)) {
> +		rcu_read_unlock();
> +		goto out;
> +	}
> +
> +	mac_len = skb->data - skb_mac_header(skb);
> +	pktlen = skb->len + mac_len;
> +	size = SKB_DATA_ALIGN(VETH_XDP_HEADROOM + pktlen) +
> +	       SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	if (size > PAGE_SIZE)
> +		goto drop;
> +
> +	headroom = skb_headroom(skb) - mac_len;
> +	if (skb_shared(skb) || skb_head_is_locked(skb) ||
> +	    skb_is_nonlinear(skb) || headroom < XDP_PACKET_HEADROOM) {
> +		struct sk_buff *nskb;
> +		void *head, *start;
> +		struct page *page;
> +		int head_off;
> +
> +		page = alloc_page(GFP_ATOMIC);
> +		if (!page)
> +			goto drop;
> +
> +		head = page_address(page);
> +		start = head + VETH_XDP_HEADROOM;
> +		if (skb_copy_bits(skb, -mac_len, start, pktlen)) {
> +			page_frag_free(head);
> +			goto drop;
> +		}
> +
> +		nskb = veth_build_skb(head,
> +				      VETH_XDP_HEADROOM + mac_len, skb->len,
> +				      PAGE_SIZE);
> +		if (!nskb) {
> +			page_frag_free(head);
> +			goto drop;
> +		}

> +static int veth_enable_xdp(struct net_device *dev)
> +{
> +	struct veth_priv *priv = netdev_priv(dev);
> +	int err;
> +
> +	if (!xdp_rxq_info_is_reg(&priv->xdp_rxq)) {
> +		err = xdp_rxq_info_reg(&priv->xdp_rxq, dev, 0);
> +		if (err < 0)
> +			return err;
> +
> +		err = xdp_rxq_info_reg_mem_model(&priv->xdp_rxq,
> +						 MEM_TYPE_PAGE_SHARED, NULL);

nit: doesn't matter much but looks like a mix of MEM_TYPE_PAGE_SHARED
     and MEM_TYPE_PAGE_ORDER0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled
  2018-07-22 15:13 ` [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
@ 2018-07-24  0:27   ` Jakub Kicinski
  2018-07-24  1:56     ` Toshiaki Makita
  0 siblings, 1 reply; 27+ messages in thread
From: Jakub Kicinski @ 2018-07-24  0:27 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, Toshiaki Makita,
	Jesper Dangaard Brouer

On Mon, 23 Jul 2018 00:13:03 +0900, Toshiaki Makita wrote:
> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> 
> All oversized packets including GSO packets are dropped if XDP is
> enabled on receiver side, so don't send such packets from peer.
> 
> Drop TSO and SCTP fragmentation features so that veth devices themselves
> segment packets with XDP enabled. Also cap MTU accordingly.
> 
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Is there any precedence for fixing up features and MTU like this?  Most
drivers just refuse to install the program if settings are incompatible.

> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 78fa08cb6e24..f5b72e937d9d 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -542,6 +542,23 @@ static int veth_get_iflink(const struct net_device *dev)
>  	return iflink;
>  }
>  
> +static netdev_features_t veth_fix_features(struct net_device *dev,
> +					   netdev_features_t features)
> +{
> +	struct veth_priv *priv = netdev_priv(dev);
> +	struct net_device *peer;
> +
> +	peer = rtnl_dereference(priv->peer);
> +	if (peer) {
> +		struct veth_priv *peer_priv = netdev_priv(peer);
> +
> +		if (peer_priv->_xdp_prog)
> +			features &= ~NETIF_F_GSO_SOFTWARE;
> +	}
> +
> +	return features;
> +}
> +
>  static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
>  {
>  	struct veth_priv *peer_priv, *priv = netdev_priv(dev);
> @@ -591,14 +608,33 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
>  				goto err;
>  			}
>  		}
> +
> +		if (!old_prog) {
> +			peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
> +			peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
> +				peer->hard_header_len -
> +				SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +			if (peer->mtu > peer->max_mtu)
> +				dev_set_mtu(peer, peer->max_mtu);
> +		}
>  	}
>  
>  	if (old_prog) {
> -		if (!prog && dev->flags & IFF_UP)
> -			veth_disable_xdp(dev);
> +		if (!prog) {
> +			if (dev->flags & IFF_UP)
> +				veth_disable_xdp(dev);
> +
> +			if (peer) {
> +				peer->hw_features |= NETIF_F_GSO_SOFTWARE;
> +				peer->max_mtu = ETH_MAX_MTU;
> +			}
> +		}
>  		bpf_prog_put(old_prog);
>  	}
>  
> +	if ((!!old_prog ^ !!prog) && peer)
> +		netdev_update_features(peer);
> +
>  	return 0;
>  err:
>  	priv->_xdp_prog = old_prog;
> @@ -643,6 +679,7 @@ static const struct net_device_ops veth_netdev_ops = {
>  	.ndo_poll_controller	= veth_poll_controller,
>  #endif
>  	.ndo_get_iflink		= veth_get_iflink,
> +	.ndo_fix_features	= veth_fix_features,
>  	.ndo_features_check	= passthru_features_check,
>  	.ndo_set_rx_headroom	= veth_set_rx_headroom,
>  	.ndo_bpf		= veth_xdp,

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit
  2018-07-22 15:13 ` [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit Toshiaki Makita
  2018-07-24  0:19   ` kbuild test robot
@ 2018-07-24  0:33   ` kbuild test robot
  2018-07-24  1:02   ` Jakub Kicinski
  2 siblings, 0 replies; 27+ messages in thread
From: kbuild test robot @ 2018-07-24  0:33 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: kbuild-all, netdev, Alexei Starovoitov, Daniel Borkmann,
	Toshiaki Makita, Jesper Dangaard Brouer

[-- Attachment #1: Type: text/plain, Size: 3127 bytes --]

Hi Toshiaki,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on bpf-next/master]

url:    https://github.com/0day-ci/linux/commits/Toshiaki-Makita/veth-Driver-XDP/20180724-065517
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
config: x86_64-randconfig-x010-201829 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All warnings (new ones prefixed by >>):

   In file included from include/linux/kernel.h:10:0,
                    from include/linux/list.h:9,
                    from include/linux/timer.h:5,
                    from include/linux/netdevice.h:28,
                    from drivers//net/veth.c:11:
   drivers//net/veth.c: In function 'veth_xdp_xmit':
   drivers//net/veth.c:300:16: error: implicit declaration of function 'xdp_ok_fwd_dev' [-Werror=implicit-function-declaration]
      if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
                   ^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
     if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
                                 ^~~~
>> drivers//net/veth.c:300:3: note: in expansion of macro 'if'
      if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
      ^~
   include/linux/compiler.h:48:24: note: in expansion of macro '__branch_check__'
    #  define unlikely(x) (__branch_check__(x, 0, __builtin_constant_p(x)))
                           ^~~~~~~~~~~~~~~~
>> drivers//net/veth.c:300:7: note: in expansion of macro 'unlikely'
      if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
          ^~~~~~~~
   cc1: some warnings being treated as errors

vim +/if +300 drivers//net/veth.c

   275	
   276	static int veth_xdp_xmit(struct net_device *dev, int n,
   277				 struct xdp_frame **frames, u32 flags)
   278	{
   279		struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
   280		struct net_device *rcv;
   281		int i, drops = 0;
   282	
   283		if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
   284			return -EINVAL;
   285	
   286		rcv = rcu_dereference(priv->peer);
   287		if (unlikely(!rcv))
   288			return -ENXIO;
   289	
   290		rcv_priv = netdev_priv(rcv);
   291		/* xdp_ring is initialized on receive side? */
   292		if (!rcu_access_pointer(rcv_priv->xdp_prog))
   293			return -ENXIO;
   294	
   295		spin_lock(&rcv_priv->xdp_ring.producer_lock);
   296		for (i = 0; i < n; i++) {
   297			struct xdp_frame *frame = frames[i];
   298			void *ptr = veth_xdp_to_ptr(frame);
   299	
 > 300			if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
   301				     __ptr_ring_produce(&rcv_priv->xdp_ring, ptr))) {
   302				xdp_return_frame_rx_napi(frame);
   303				drops++;
   304			}
   305		}
   306		spin_unlock(&rcv_priv->xdp_ring.producer_lock);
   307	
   308		if (flags & XDP_XMIT_FLUSH)
   309			__veth_xdp_flush(rcv_priv);
   310	
   311		return n - drops;
   312	}
   313	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 31379 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit
  2018-07-22 15:13 ` [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit Toshiaki Makita
  2018-07-24  0:19   ` kbuild test robot
  2018-07-24  0:33   ` kbuild test robot
@ 2018-07-24  1:02   ` Jakub Kicinski
  2018-07-24  2:11     ` Toshiaki Makita
  2018-07-24  2:24     ` Toshiaki Makita
  2 siblings, 2 replies; 27+ messages in thread
From: Jakub Kicinski @ 2018-07-24  1:02 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, Toshiaki Makita,
	Jesper Dangaard Brouer

On Mon, 23 Jul 2018 00:13:05 +0900, Toshiaki Makita wrote:
> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> 
> This allows NIC's XDP to redirect packets to veth. The destination veth
> device enqueues redirected packets to the napi ring of its peer, then
> they are processed by XDP on its peer veth device.
> This can be thought as calling another XDP program by XDP program using
> REDIRECT, when the peer enables driver XDP.
> 
> Note that when the peer veth device does not set driver xdp, redirected
> packets will be dropped because the peer is not ready for NAPI.

Often we can't redirect to devices which don't have am xdp program
installed.  In your case we can't redirect unless the peer of the
target doesn't have a program installed?  :(

Perhaps it is time to reconsider what Saeed once asked for, a flag or
attribute to enable being the destination of a XDP_REDIRECT.

> v2:
> - Drop the part converting xdp_frame into skb when XDP is not enabled.
> - Implement bulk interface of ndo_xdp_xmit.
> - Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.
> 
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
>  drivers/net/veth.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 45 insertions(+)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 4be75c58bc6a..57187e955fea 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -17,6 +17,7 @@
>  #include <net/rtnetlink.h>
>  #include <net/dst.h>
>  #include <net/xfrm.h>
> +#include <net/xdp.h>
>  #include <linux/veth.h>
>  #include <linux/module.h>
>  #include <linux/bpf.h>
> @@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
>  	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
>  }
>  
> +static void *veth_xdp_to_ptr(void *ptr)
> +{
> +	return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
> +}
> +
>  static void veth_ptr_free(void *ptr)
>  {
>  	if (veth_is_xdp_frame(ptr))
> @@ -267,6 +273,44 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
>  	return skb;
>  }
>  
> +static int veth_xdp_xmit(struct net_device *dev, int n,
> +			 struct xdp_frame **frames, u32 flags)
> +{
> +	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
> +	struct net_device *rcv;
> +	int i, drops = 0;
> +
> +	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
> +		return -EINVAL;
> +
> +	rcv = rcu_dereference(priv->peer);
> +	if (unlikely(!rcv))
> +		return -ENXIO;
> +
> +	rcv_priv = netdev_priv(rcv);
> +	/* xdp_ring is initialized on receive side? */
> +	if (!rcu_access_pointer(rcv_priv->xdp_prog))
> +		return -ENXIO;
> +
> +	spin_lock(&rcv_priv->xdp_ring.producer_lock);
> +	for (i = 0; i < n; i++) {
> +		struct xdp_frame *frame = frames[i];
> +		void *ptr = veth_xdp_to_ptr(frame);
> +
> +		if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
> +			     __ptr_ring_produce(&rcv_priv->xdp_ring, ptr))) {

Would you mind sparing a few more words how this is safe vs the
.ndo_close() on the peer?  Personally I'm a bit uncomfortable with the
IFF_UP check in xdp_ok_fwd_dev(), I'm not sure what's supposed to
guarantee the device doesn't go down right after that check, or is
already down, but netdev->flags are not atomic...  

> +			xdp_return_frame_rx_napi(frame);
> +			drops++;
> +		}
> +	}
> +	spin_unlock(&rcv_priv->xdp_ring.producer_lock);
> +
> +	if (flags & XDP_XMIT_FLUSH)
> +		__veth_xdp_flush(rcv_priv);
> +
> +	return n - drops;
> +}
> +
>  static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
>  					struct xdp_frame *frame)
>  {
> @@ -760,6 +804,7 @@ static const struct net_device_ops veth_netdev_ops = {
>  	.ndo_features_check	= passthru_features_check,
>  	.ndo_set_rx_headroom	= veth_set_rx_headroom,
>  	.ndo_bpf		= veth_xdp,
> +	.ndo_xdp_xmit		= veth_xdp_xmit,
>  };
>  
>  #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info
  2018-07-22 15:13 ` [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info Toshiaki Makita
@ 2018-07-24  1:22   ` Jakub Kicinski
  2018-07-24  2:43     ` Toshiaki Makita
  0 siblings, 1 reply; 27+ messages in thread
From: Jakub Kicinski @ 2018-07-24  1:22 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, Toshiaki Makita,
	Jesper Dangaard Brouer

On Mon, 23 Jul 2018 00:13:06 +0900, Toshiaki Makita wrote:
> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> 
> We need some mechanism to disable napi_direct on calling
> xdp_return_frame_rx_napi() from some context.
> When veth gets support of XDP_REDIRECT, it will redirects packets which
> are redirected from other devices. On redirection veth will reuse
> xdp_mem_info of the redirection source device to make return_frame work.
> But in this case .ndo_xdp_xmit() called from veth redirection uses
> xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is
> not called directly from the rxq which owns the xdp_mem_info.
> 
> This approach introduces a flag in xdp_mem_info to indicate that
> napi_direct should be disabled even when _rx_napi variant is used.
> 
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

To be clear - you will modify flags of the original source device if it
ever redirected a frame to a software device like veth?  Seems a bit
heavy handed.  The xdp_return_frame_rx_napi() is only really used on
error paths, but still..  Also as you note the original NAPI can run
concurrently with your veth dest one, but also with NAPIs of other veth
devices, so the non-atomic xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
makes me worried.

Would you mind elaborating why not handle the RX completely in the NAPI
context of the original device?

> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index fcb033f51d8c..1d1bc6553ff2 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -41,6 +41,9 @@ enum xdp_mem_type {
>  	MEM_TYPE_MAX,
>  };
>  
> +/* XDP flags for xdp_mem_info */
> +#define XDP_MEM_RF_NO_DIRECT	BIT(0)	/* don't use napi_direct */
> +
>  /* XDP flags for ndo_xdp_xmit */
>  #define XDP_XMIT_FLUSH		(1U << 0)	/* doorbell signal consumer */
>  #define XDP_XMIT_FLAGS_MASK	XDP_XMIT_FLUSH
> @@ -48,6 +51,7 @@ enum xdp_mem_type {
>  struct xdp_mem_info {
>  	u32 type; /* enum xdp_mem_type, but known size type */
>  	u32 id;
> +	u32 flags;
>  };
>  
>  struct page_pool;
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 57285383ed00..1426c608fd75 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct,
>  		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
>  		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
>  		page = virt_to_head_page(data);
> -		if (xa)
> +		if (xa) {
> +			napi_direct &= !(mem->flags & XDP_MEM_RF_NO_DIRECT);
>  			page_pool_put_page(xa->page_pool, page, napi_direct);
> -		else
> +		} else {
>  			put_page(page);
> +		}
>  		rcu_read_unlock();
>  		break;
>  	case MEM_TYPE_PAGE_SHARED:

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 2/8] veth: Add driver XDP
  2018-07-24  0:23   ` Jakub Kicinski
@ 2018-07-24  1:47     ` Toshiaki Makita
  0 siblings, 0 replies; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-24  1:47 UTC (permalink / raw)
  To: Jakub Kicinski, Toshiaki Makita
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer

Hi Jakub,

Thanks for reviewing!

On 2018/07/24 9:23, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:02 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>
>> This is the basic implementation of veth driver XDP.
>>
>> Incoming packets are sent from the peer veth device in the form of skb,
>> so this is generally doing the same thing as generic XDP.
>>
>> This itself is not so useful, but a starting point to implement other
>> useful veth XDP features like TX and REDIRECT.
>>
>> This introduces NAPI when XDP is enabled, because XDP is now heavily
>> relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
>> enqueues packets to the ring and peer NAPI handler drains the ring.
>>
>> Currently only one ring is allocated for each veth device, so it does
>> not scale on multiqueue env. This can be resolved by allocating rings
>> on the per-queue basis later.
>>
>> Note that NAPI is not used but netif_rx is used when XDP is not loaded,
>> so this does not change the default behaviour.
>>
>> v3:
>> - Fix race on closing the device.
>> - Add extack messages in ndo_bpf.
>>
>> v2:
>> - Squashed with the patch adding NAPI.
>> - Implement adjust_tail.
>> - Don't acquire consumer lock because it is guarded by NAPI.
>> - Make poll_controller noop since it is unnecessary.
>> - Register rxq_info on enabling XDP rather than on opening the device.
>>
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> 
>> +static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
>> +					struct sk_buff *skb)
>> +{
>> +	u32 pktlen, headroom, act, metalen;
>> +	void *orig_data, *orig_data_end;
>> +	int size, mac_len, delta, off;
>> +	struct bpf_prog *xdp_prog;
>> +	struct xdp_buff xdp;
>> +
>> +	rcu_read_lock();
>> +	xdp_prog = rcu_dereference(priv->xdp_prog);
>> +	if (unlikely(!xdp_prog)) {
>> +		rcu_read_unlock();
>> +		goto out;
>> +	}
>> +
>> +	mac_len = skb->data - skb_mac_header(skb);
>> +	pktlen = skb->len + mac_len;
>> +	size = SKB_DATA_ALIGN(VETH_XDP_HEADROOM + pktlen) +
>> +	       SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +	if (size > PAGE_SIZE)
>> +		goto drop;
>> +
>> +	headroom = skb_headroom(skb) - mac_len;
>> +	if (skb_shared(skb) || skb_head_is_locked(skb) ||
>> +	    skb_is_nonlinear(skb) || headroom < XDP_PACKET_HEADROOM) {
>> +		struct sk_buff *nskb;
>> +		void *head, *start;
>> +		struct page *page;
>> +		int head_off;
>> +
>> +		page = alloc_page(GFP_ATOMIC);
>> +		if (!page)
>> +			goto drop;
>> +
>> +		head = page_address(page);
>> +		start = head + VETH_XDP_HEADROOM;
>> +		if (skb_copy_bits(skb, -mac_len, start, pktlen)) {
>> +			page_frag_free(head);
>> +			goto drop;
>> +		}
>> +
>> +		nskb = veth_build_skb(head,
>> +				      VETH_XDP_HEADROOM + mac_len, skb->len,
>> +				      PAGE_SIZE);
>> +		if (!nskb) {
>> +			page_frag_free(head);
>> +			goto drop;
>> +		}
> 
>> +static int veth_enable_xdp(struct net_device *dev)
>> +{
>> +	struct veth_priv *priv = netdev_priv(dev);
>> +	int err;
>> +
>> +	if (!xdp_rxq_info_is_reg(&priv->xdp_rxq)) {
>> +		err = xdp_rxq_info_reg(&priv->xdp_rxq, dev, 0);
>> +		if (err < 0)
>> +			return err;
>> +
>> +		err = xdp_rxq_info_reg_mem_model(&priv->xdp_rxq,
>> +						 MEM_TYPE_PAGE_SHARED, NULL);
> 
> nit: doesn't matter much but looks like a mix of MEM_TYPE_PAGE_SHARED
>      and MEM_TYPE_PAGE_ORDER0

Actually I'm not sure when to use MEM_TYPE_PAGE_ORDER0. It seems a page
allocated by alloc_page() can be freed by page_frag_free() and it is
more lightweight than put_page(), isn't it?
virtio_net is doing it in a similar way.

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled
  2018-07-24  0:27   ` Jakub Kicinski
@ 2018-07-24  1:56     ` Toshiaki Makita
  2018-07-24  9:39       ` Toshiaki Makita
  0 siblings, 1 reply; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-24  1:56 UTC (permalink / raw)
  To: Jakub Kicinski, Toshiaki Makita
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann, Jesper Dangaard Brouer

On 2018/07/24 9:27, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:03 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>
>> All oversized packets including GSO packets are dropped if XDP is
>> enabled on receiver side, so don't send such packets from peer.
>>
>> Drop TSO and SCTP fragmentation features so that veth devices themselves
>> segment packets with XDP enabled. Also cap MTU accordingly.
>>
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> 
> Is there any precedence for fixing up features and MTU like this?  Most
> drivers just refuse to install the program if settings are incompatible.

I don't know any precedence. I can refuse the program on installing it
when features and MTU are not appropriate. Is it preferred?
Note that with current implementation wanted_features are not touched so
features will be restored when the XDP program is removed. MTU will not
be restored though, as I do not remember the original MTU.


>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 78fa08cb6e24..f5b72e937d9d 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -542,6 +542,23 @@ static int veth_get_iflink(const struct net_device *dev)
>>  	return iflink;
>>  }
>>  
>> +static netdev_features_t veth_fix_features(struct net_device *dev,
>> +					   netdev_features_t features)
>> +{
>> +	struct veth_priv *priv = netdev_priv(dev);
>> +	struct net_device *peer;
>> +
>> +	peer = rtnl_dereference(priv->peer);
>> +	if (peer) {
>> +		struct veth_priv *peer_priv = netdev_priv(peer);
>> +
>> +		if (peer_priv->_xdp_prog)
>> +			features &= ~NETIF_F_GSO_SOFTWARE;
>> +	}
>> +
>> +	return features;
>> +}
>> +
>>  static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
>>  {
>>  	struct veth_priv *peer_priv, *priv = netdev_priv(dev);
>> @@ -591,14 +608,33 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
>>  				goto err;
>>  			}
>>  		}
>> +
>> +		if (!old_prog) {
>> +			peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
>> +			peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
>> +				peer->hard_header_len -
>> +				SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +			if (peer->mtu > peer->max_mtu)
>> +				dev_set_mtu(peer, peer->max_mtu);
>> +		}
>>  	}
>>  
>>  	if (old_prog) {
>> -		if (!prog && dev->flags & IFF_UP)
>> -			veth_disable_xdp(dev);
>> +		if (!prog) {
>> +			if (dev->flags & IFF_UP)
>> +				veth_disable_xdp(dev);
>> +
>> +			if (peer) {
>> +				peer->hw_features |= NETIF_F_GSO_SOFTWARE;
>> +				peer->max_mtu = ETH_MAX_MTU;
>> +			}
>> +		}
>>  		bpf_prog_put(old_prog);
>>  	}
>>  
>> +	if ((!!old_prog ^ !!prog) && peer)
>> +		netdev_update_features(peer);
>> +
>>  	return 0;
>>  err:
>>  	priv->_xdp_prog = old_prog;
>> @@ -643,6 +679,7 @@ static const struct net_device_ops veth_netdev_ops = {
>>  	.ndo_poll_controller	= veth_poll_controller,
>>  #endif
>>  	.ndo_get_iflink		= veth_get_iflink,
>> +	.ndo_fix_features	= veth_fix_features,
>>  	.ndo_features_check	= passthru_features_check,
>>  	.ndo_set_rx_headroom	= veth_set_rx_headroom,
>>  	.ndo_bpf		= veth_xdp,

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit
  2018-07-24  0:19   ` kbuild test robot
@ 2018-07-24  1:59     ` Toshiaki Makita
  0 siblings, 0 replies; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-24  1:59 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: kbuild test robot, kbuild-all, netdev, Alexei Starovoitov,
	Daniel Borkmann, Jesper Dangaard Brouer

On 2018/07/24 9:19, kbuild test robot wrote:
> Hi Toshiaki,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on bpf-next/master]
> 
> url:    https://github.com/0day-ci/linux/commits/Toshiaki-Makita/veth-Driver-XDP/20180724-065517
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
> config: i386-randconfig-x001-201829 (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 
> 
> All errors (new ones prefixed by >>):
> 
>    In file included from include/linux/kernel.h:10:0,
>                     from include/linux/list.h:9,
>                     from include/linux/timer.h:5,
>                     from include/linux/netdevice.h:28,
>                     from drivers//net/veth.c:11:
>    drivers//net/veth.c: In function 'veth_xdp_xmit':
>>> drivers//net/veth.c:300:16: error: implicit declaration of function 'xdp_ok_fwd_dev' [-Werror=implicit-function-declaration]
>       if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||

This is because this series depends on commit d8d7218ad842 ("xdp:
XDP_REDIRECT should check IFF_UP and MTU") which is currently in DaveM's
net-next tree, as I noted in the cover letter.

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit
  2018-07-24  1:02   ` Jakub Kicinski
@ 2018-07-24  2:11     ` Toshiaki Makita
  2018-07-24 13:58       ` Tariq Toukan
  2018-07-24  2:24     ` Toshiaki Makita
  1 sibling, 1 reply; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-24  2:11 UTC (permalink / raw)
  To: Jakub Kicinski, Toshiaki Makita
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, tariqt

On 2018/07/24 10:02, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:05 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>
>> This allows NIC's XDP to redirect packets to veth. The destination veth
>> device enqueues redirected packets to the napi ring of its peer, then
>> they are processed by XDP on its peer veth device.
>> This can be thought as calling another XDP program by XDP program using
>> REDIRECT, when the peer enables driver XDP.
>>
>> Note that when the peer veth device does not set driver xdp, redirected
>> packets will be dropped because the peer is not ready for NAPI.
> 
> Often we can't redirect to devices which don't have am xdp program
> installed.  In your case we can't redirect unless the peer of the
> target doesn't have a program installed?  :(

Right. I tried to avoid this case by converting xdp_frames to skb but
realized that should not be done.
https://patchwork.ozlabs.org/patch/903536/

> Perhaps it is time to reconsider what Saeed once asked for, a flag or
> attribute to enable being the destination of a XDP_REDIRECT.

Yes, something will be necessary. Jesper said Tariq had some ideas to
implement it.

> 
>> v2:
>> - Drop the part converting xdp_frame into skb when XDP is not enabled.
>> - Implement bulk interface of ndo_xdp_xmit.
>> - Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.
>>
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>> ---
>>  drivers/net/veth.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 45 insertions(+)
>>
>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>> index 4be75c58bc6a..57187e955fea 100644
>> --- a/drivers/net/veth.c
>> +++ b/drivers/net/veth.c
>> @@ -17,6 +17,7 @@
>>  #include <net/rtnetlink.h>
>>  #include <net/dst.h>
>>  #include <net/xfrm.h>
>> +#include <net/xdp.h>
>>  #include <linux/veth.h>
>>  #include <linux/module.h>
>>  #include <linux/bpf.h>
>> @@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
>>  	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
>>  }
>>  
>> +static void *veth_xdp_to_ptr(void *ptr)
>> +{
>> +	return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
>> +}
>> +
>>  static void veth_ptr_free(void *ptr)
>>  {
>>  	if (veth_is_xdp_frame(ptr))
>> @@ -267,6 +273,44 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
>>  	return skb;
>>  }
>>  
>> +static int veth_xdp_xmit(struct net_device *dev, int n,
>> +			 struct xdp_frame **frames, u32 flags)
>> +{
>> +	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>> +	struct net_device *rcv;
>> +	int i, drops = 0;
>> +
>> +	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>> +		return -EINVAL;
>> +
>> +	rcv = rcu_dereference(priv->peer);
>> +	if (unlikely(!rcv))
>> +		return -ENXIO;
>> +
>> +	rcv_priv = netdev_priv(rcv);
>> +	/* xdp_ring is initialized on receive side? */
>> +	if (!rcu_access_pointer(rcv_priv->xdp_prog))
>> +		return -ENXIO;
>> +
>> +	spin_lock(&rcv_priv->xdp_ring.producer_lock);
>> +	for (i = 0; i < n; i++) {
>> +		struct xdp_frame *frame = frames[i];
>> +		void *ptr = veth_xdp_to_ptr(frame);
>> +
>> +		if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
>> +			     __ptr_ring_produce(&rcv_priv->xdp_ring, ptr))) {
> 
> Would you mind sparing a few more words how this is safe vs the
> .ndo_close() on the peer?  Personally I'm a bit uncomfortable with the
> IFF_UP check in xdp_ok_fwd_dev(), I'm not sure what's supposed to
> guarantee the device doesn't go down right after that check, or is
> already down, but netdev->flags are not atomic...  
> 
>> +			xdp_return_frame_rx_napi(frame);
>> +			drops++;
>> +		}
>> +	}
>> +	spin_unlock(&rcv_priv->xdp_ring.producer_lock);
>> +
>> +	if (flags & XDP_XMIT_FLUSH)
>> +		__veth_xdp_flush(rcv_priv);
>> +
>> +	return n - drops;
>> +}
>> +
>>  static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
>>  					struct xdp_frame *frame)
>>  {
>> @@ -760,6 +804,7 @@ static const struct net_device_ops veth_netdev_ops = {
>>  	.ndo_features_check	= passthru_features_check,
>>  	.ndo_set_rx_headroom	= veth_set_rx_headroom,
>>  	.ndo_bpf		= veth_xdp,
>> +	.ndo_xdp_xmit		= veth_xdp_xmit,
>>  };
>>  
>>  #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
> 
> 
> 

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit
  2018-07-24  1:02   ` Jakub Kicinski
  2018-07-24  2:11     ` Toshiaki Makita
@ 2018-07-24  2:24     ` Toshiaki Makita
  1 sibling, 0 replies; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-24  2:24 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toshiaki Makita, netdev, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer

On 2018/07/24 10:02, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:05 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>
>> This allows NIC's XDP to redirect packets to veth. The destination veth
>> device enqueues redirected packets to the napi ring of its peer, then
>> they are processed by XDP on its peer veth device.
>> This can be thought as calling another XDP program by XDP program using
>> REDIRECT, when the peer enables driver XDP.
>>
>> Note that when the peer veth device does not set driver xdp, redirected
>> packets will be dropped because the peer is not ready for NAPI.
...
>> +static int veth_xdp_xmit(struct net_device *dev, int n,
>> +			 struct xdp_frame **frames, u32 flags)
>> +{
>> +	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>> +	struct net_device *rcv;
>> +	int i, drops = 0;
>> +
>> +	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
>> +		return -EINVAL;
>> +
>> +	rcv = rcu_dereference(priv->peer);
>> +	if (unlikely(!rcv))
>> +		return -ENXIO;
>> +
>> +	rcv_priv = netdev_priv(rcv);
>> +	/* xdp_ring is initialized on receive side? */
>> +	if (!rcu_access_pointer(rcv_priv->xdp_prog))
>> +		return -ENXIO;
>> +
>> +	spin_lock(&rcv_priv->xdp_ring.producer_lock);
>> +	for (i = 0; i < n; i++) {
>> +		struct xdp_frame *frame = frames[i];
>> +		void *ptr = veth_xdp_to_ptr(frame);
>> +
>> +		if (unlikely(xdp_ok_fwd_dev(rcv, frame->len) ||
>> +			     __ptr_ring_produce(&rcv_priv->xdp_ring, ptr))) {
> 
> Would you mind sparing a few more words how this is safe vs the
> .ndo_close() on the peer?  Personally I'm a bit uncomfortable with the
> IFF_UP check in xdp_ok_fwd_dev(), I'm not sure what's supposed to
> guarantee the device doesn't go down right after that check, or is
> already down, but netdev->flags are not atomic...  

Actually it is guarded by RCU. On closing the device rcv_priv->xdp_prog
is set to be NULL, and synchronize_net() is called from within
netif_napi_del(). Then ptr_ring is cleaned-up.
xdp_ok_fwd_dev() is doing the same check as non-XDP case, but it may not
be appropriate because IFF_UP check here is not usable as you say.

> 
>> +			xdp_return_frame_rx_napi(frame);
>> +			drops++;
>> +		}
>> +	}
>> +	spin_unlock(&rcv_priv->xdp_ring.producer_lock);
>> +
>> +	if (flags & XDP_XMIT_FLUSH)
>> +		__veth_xdp_flush(rcv_priv);
>> +
>> +	return n - drops;
>> +}
>> +
>>  static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
>>  					struct xdp_frame *frame)
>>  {
>> @@ -760,6 +804,7 @@ static const struct net_device_ops veth_netdev_ops = {
>>  	.ndo_features_check	= passthru_features_check,
>>  	.ndo_set_rx_headroom	= veth_set_rx_headroom,
>>  	.ndo_bpf		= veth_xdp,
>> +	.ndo_xdp_xmit		= veth_xdp_xmit,
>>  };
>>  
>>  #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
> 
> 
> 

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info
  2018-07-24  1:22   ` Jakub Kicinski
@ 2018-07-24  2:43     ` Toshiaki Makita
  2018-07-24  3:38       ` Jakub Kicinski
  0 siblings, 1 reply; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-24  2:43 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toshiaki Makita, netdev, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer

On 2018/07/24 10:22, Jakub Kicinski wrote:
> On Mon, 23 Jul 2018 00:13:06 +0900, Toshiaki Makita wrote:
>> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>
>> We need some mechanism to disable napi_direct on calling
>> xdp_return_frame_rx_napi() from some context.
>> When veth gets support of XDP_REDIRECT, it will redirects packets which
>> are redirected from other devices. On redirection veth will reuse
>> xdp_mem_info of the redirection source device to make return_frame work.
>> But in this case .ndo_xdp_xmit() called from veth redirection uses
>> xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is
>> not called directly from the rxq which owns the xdp_mem_info.
>>
>> This approach introduces a flag in xdp_mem_info to indicate that
>> napi_direct should be disabled even when _rx_napi variant is used.
>>
>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> 
> To be clear - you will modify flags of the original source device if it
> ever redirected a frame to a software device like veth?  Seems a bit
> heavy handed.  The xdp_return_frame_rx_napi() is only really used on
> error paths, but still..  Also as you note the original NAPI can run
> concurrently with your veth dest one, but also with NAPIs of other veth
> devices, so the non-atomic xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
> makes me worried.

xdp_mem_info is copied in xdp_frame in convert_to_xdp_frame() so the
field is local to the frame. Changing flags affects only the frame.
xdp.rxq is local to NAPI thread, so no worries about atomicity.

> Would you mind elaborating why not handle the RX completely in the NAPI
> context of the original device?

Originally it was difficult to implement .ndo_xdp_xmit() and
.ndo_xdp_flush() model without creating NAPI in veth. Now it is changed
so I'm not sure how difficult it is at this point.
But in any case I want to avoid stack inflation by veth NAPI. (Imagine
some misconfiguration like calling XDP_TX on both side of veth...)

> 
>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>> index fcb033f51d8c..1d1bc6553ff2 100644
>> --- a/include/net/xdp.h
>> +++ b/include/net/xdp.h
>> @@ -41,6 +41,9 @@ enum xdp_mem_type {
>>  	MEM_TYPE_MAX,
>>  };
>>  
>> +/* XDP flags for xdp_mem_info */
>> +#define XDP_MEM_RF_NO_DIRECT	BIT(0)	/* don't use napi_direct */
>> +
>>  /* XDP flags for ndo_xdp_xmit */
>>  #define XDP_XMIT_FLUSH		(1U << 0)	/* doorbell signal consumer */
>>  #define XDP_XMIT_FLAGS_MASK	XDP_XMIT_FLUSH
>> @@ -48,6 +51,7 @@ enum xdp_mem_type {
>>  struct xdp_mem_info {
>>  	u32 type; /* enum xdp_mem_type, but known size type */
>>  	u32 id;
>> +	u32 flags;
>>  };
>>  
>>  struct page_pool;
>> diff --git a/net/core/xdp.c b/net/core/xdp.c
>> index 57285383ed00..1426c608fd75 100644
>> --- a/net/core/xdp.c
>> +++ b/net/core/xdp.c
>> @@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct,
>>  		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
>>  		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
>>  		page = virt_to_head_page(data);
>> -		if (xa)
>> +		if (xa) {
>> +			napi_direct &= !(mem->flags & XDP_MEM_RF_NO_DIRECT);
>>  			page_pool_put_page(xa->page_pool, page, napi_direct);
>> -		else
>> +		} else {
>>  			put_page(page);
>> +		}
>>  		rcu_read_unlock();
>>  		break;
>>  	case MEM_TYPE_PAGE_SHARED:
> 
> 
> 

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info
  2018-07-24  2:43     ` Toshiaki Makita
@ 2018-07-24  3:38       ` Jakub Kicinski
  2018-07-24  4:02         ` Toshiaki Makita
  0 siblings, 1 reply; 27+ messages in thread
From: Jakub Kicinski @ 2018-07-24  3:38 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Toshiaki Makita, netdev, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer

On Tue, 24 Jul 2018 11:43:11 +0900, Toshiaki Makita wrote:
> On 2018/07/24 10:22, Jakub Kicinski wrote:
> > On Mon, 23 Jul 2018 00:13:06 +0900, Toshiaki Makita wrote:  
> >> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> >>
> >> We need some mechanism to disable napi_direct on calling
> >> xdp_return_frame_rx_napi() from some context.
> >> When veth gets support of XDP_REDIRECT, it will redirects packets which
> >> are redirected from other devices. On redirection veth will reuse
> >> xdp_mem_info of the redirection source device to make return_frame work.
> >> But in this case .ndo_xdp_xmit() called from veth redirection uses
> >> xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is
> >> not called directly from the rxq which owns the xdp_mem_info.
> >>
> >> This approach introduces a flag in xdp_mem_info to indicate that
> >> napi_direct should be disabled even when _rx_napi variant is used.
> >>
> >> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>  
> > 
> > To be clear - you will modify flags of the original source device if it
> > ever redirected a frame to a software device like veth?  Seems a bit
> > heavy handed.  The xdp_return_frame_rx_napi() is only really used on
> > error paths, but still..  Also as you note the original NAPI can run
> > concurrently with your veth dest one, but also with NAPIs of other veth
> > devices, so the non-atomic xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
> > makes me worried.  
> 
> xdp_mem_info is copied in xdp_frame in convert_to_xdp_frame() so the
> field is local to the frame. Changing flags affects only the frame.
> xdp.rxq is local to NAPI thread, so no worries about atomicity.

Ah, right!  mem_info used to be just 8B, now it would be 12B.
Alternatively we could perhaps add this info to struct redirect_info,
through xdp_do_redirect() to avoid the per-frame cost.  I'm not sure
that's better.

> > Would you mind elaborating why not handle the RX completely in the NAPI
> > context of the original device?  
> 
> Originally it was difficult to implement .ndo_xdp_xmit() and
> .ndo_xdp_flush() model without creating NAPI in veth. Now it is changed
> so I'm not sure how difficult it is at this point.
> But in any case I want to avoid stack inflation by veth NAPI. (Imagine
> some misconfiguration like calling XDP_TX on both side of veth...)

True :/

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info
  2018-07-24  3:38       ` Jakub Kicinski
@ 2018-07-24  4:02         ` Toshiaki Makita
  0 siblings, 0 replies; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-24  4:02 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toshiaki Makita, netdev, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer

On 2018/07/24 12:38, Jakub Kicinski wrote:
> On Tue, 24 Jul 2018 11:43:11 +0900, Toshiaki Makita wrote:
>> On 2018/07/24 10:22, Jakub Kicinski wrote:
>>> On Mon, 23 Jul 2018 00:13:06 +0900, Toshiaki Makita wrote:  
>>>> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>
>>>> We need some mechanism to disable napi_direct on calling
>>>> xdp_return_frame_rx_napi() from some context.
>>>> When veth gets support of XDP_REDIRECT, it will redirects packets which
>>>> are redirected from other devices. On redirection veth will reuse
>>>> xdp_mem_info of the redirection source device to make return_frame work.
>>>> But in this case .ndo_xdp_xmit() called from veth redirection uses
>>>> xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit is
>>>> not called directly from the rxq which owns the xdp_mem_info.
>>>>
>>>> This approach introduces a flag in xdp_mem_info to indicate that
>>>> napi_direct should be disabled even when _rx_napi variant is used.
>>>>
>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>  
>>>
>>> To be clear - you will modify flags of the original source device if it
>>> ever redirected a frame to a software device like veth?  Seems a bit
>>> heavy handed.  The xdp_return_frame_rx_napi() is only really used on
>>> error paths, but still..  Also as you note the original NAPI can run
>>> concurrently with your veth dest one, but also with NAPIs of other veth
>>> devices, so the non-atomic xdp.rxq->mem.flags |= XDP_MEM_RF_NO_DIRECT;
>>> makes me worried.  
>>
>> xdp_mem_info is copied in xdp_frame in convert_to_xdp_frame() so the
>> field is local to the frame. Changing flags affects only the frame.
>> xdp.rxq is local to NAPI thread, so no worries about atomicity.
> 
> Ah, right!  mem_info used to be just 8B, now it would be 12B.
> Alternatively we could perhaps add this info to struct redirect_info,
> through xdp_do_redirect() to avoid the per-frame cost.  I'm not sure
> that's better.

OK, let me check if this works.

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled
  2018-07-24  1:56     ` Toshiaki Makita
@ 2018-07-24  9:39       ` Toshiaki Makita
  2018-07-24 19:10         ` Jakub Kicinski
  0 siblings, 1 reply; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-24  9:39 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toshiaki Makita, netdev, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer

On 2018/07/24 10:56, Toshiaki Makita wrote:
> On 2018/07/24 9:27, Jakub Kicinski wrote:
>> On Mon, 23 Jul 2018 00:13:03 +0900, Toshiaki Makita wrote:
>>> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>
>>> All oversized packets including GSO packets are dropped if XDP is
>>> enabled on receiver side, so don't send such packets from peer.
>>>
>>> Drop TSO and SCTP fragmentation features so that veth devices themselves
>>> segment packets with XDP enabled. Also cap MTU accordingly.
>>>
>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>
>> Is there any precedence for fixing up features and MTU like this?  Most
>> drivers just refuse to install the program if settings are incompatible.
> 
> I don't know any precedence. I can refuse the program on installing it
> when features and MTU are not appropriate. Is it preferred?
> Note that with current implementation wanted_features are not touched so
> features will be restored when the XDP program is removed. MTU will not
> be restored though, as I do not remember the original MTU.

I just recalled that virtio_net used to refused XDP when guest offload
features are incompatible but now it dynamically fixup them on
installing an XDP program.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3f93522ffab2d46a36b57adf324a54e674fc9536

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit
  2018-07-24  2:11     ` Toshiaki Makita
@ 2018-07-24 13:58       ` Tariq Toukan
  0 siblings, 0 replies; 27+ messages in thread
From: Tariq Toukan @ 2018-07-24 13:58 UTC (permalink / raw)
  To: Toshiaki Makita, Jakub Kicinski, Toshiaki Makita
  Cc: netdev, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer, tariqt



On 24/07/2018 5:11 AM, Toshiaki Makita wrote:
> On 2018/07/24 10:02, Jakub Kicinski wrote:
>> On Mon, 23 Jul 2018 00:13:05 +0900, Toshiaki Makita wrote:
>>> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>
>>> This allows NIC's XDP to redirect packets to veth. The destination veth
>>> device enqueues redirected packets to the napi ring of its peer, then
>>> they are processed by XDP on its peer veth device.
>>> This can be thought as calling another XDP program by XDP program using
>>> REDIRECT, when the peer enables driver XDP.
>>>
>>> Note that when the peer veth device does not set driver xdp, redirected
>>> packets will be dropped because the peer is not ready for NAPI.
>>
>> Often we can't redirect to devices which don't have am xdp program
>> installed.  In your case we can't redirect unless the peer of the
>> target doesn't have a program installed?  :(
> 
> Right. I tried to avoid this case by converting xdp_frames to skb but
> realized that should not be done.
> https://patchwork.ozlabs.org/patch/903536/
> 
>> Perhaps it is time to reconsider what Saeed once asked for, a flag or
>> attribute to enable being the destination of a XDP_REDIRECT.
> 
> Yes, something will be necessary. Jesper said Tariq had some ideas to
> implement it.
> 

Yes, that bothered me as well.

I think that the driver-out capability of the XDP redirect is totally 
unrelated to any XDP program, and is a standalone feature that should be 
simply turned on/off just like any other performance feature, via 
ethtool -K.

I am going to push my driver implementation (mlx5) of XDP redirect 
driver-out side very soon (this week).
As you will see, it does not require loading any XDP program, and the 
feature will be always on (for now, until we add a control flow for it).

Later, we plan to push ethtool infrastructure and driver implementation 
to control the feature.

Thanks,
Tariq

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled
  2018-07-24  9:39       ` Toshiaki Makita
@ 2018-07-24 19:10         ` Jakub Kicinski
  2018-07-25  4:22           ` Toshiaki Makita
  0 siblings, 1 reply; 27+ messages in thread
From: Jakub Kicinski @ 2018-07-24 19:10 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Toshiaki Makita, netdev, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer

On Tue, 24 Jul 2018 18:39:09 +0900, Toshiaki Makita wrote:
> On 2018/07/24 10:56, Toshiaki Makita wrote:
> > On 2018/07/24 9:27, Jakub Kicinski wrote:  
> >> On Mon, 23 Jul 2018 00:13:03 +0900, Toshiaki Makita wrote:  
> >>> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> >>>
> >>> All oversized packets including GSO packets are dropped if XDP is
> >>> enabled on receiver side, so don't send such packets from peer.
> >>>
> >>> Drop TSO and SCTP fragmentation features so that veth devices themselves
> >>> segment packets with XDP enabled. Also cap MTU accordingly.
> >>>
> >>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>  
> >>
> >> Is there any precedence for fixing up features and MTU like this?  Most
> >> drivers just refuse to install the program if settings are incompatible.  
> > 
> > I don't know any precedence. I can refuse the program on installing it
> > when features and MTU are not appropriate. Is it preferred?
> > Note that with current implementation wanted_features are not touched so
> > features will be restored when the XDP program is removed. MTU will not
> > be restored though, as I do not remember the original MTU.  
> 
> I just recalled that virtio_net used to refused XDP when guest offload
> features are incompatible but now it dynamically fixup them on
> installing an XDP program.
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3f93522ffab2d46a36b57adf324a54e674fc9536

That's slightly different AFAIU, because the virtio features weren't
really controllable at runtime at all.  I'm not dead set on leaving the
features be, but I just want to make sure we think this through
properly before we commit to any magic behaviour for ever...

Taking a quick glance at the MTU side, it seems that today if someone
decides to set MTU on one side of a veth pair the packets will simply
get dropped.  So the MTU coupling for XDP doesn't seem in line with
existing behaviour of veth, not only other XDP drivers.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled
  2018-07-24 19:10         ` Jakub Kicinski
@ 2018-07-25  4:22           ` Toshiaki Makita
  0 siblings, 0 replies; 27+ messages in thread
From: Toshiaki Makita @ 2018-07-25  4:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Toshiaki Makita, netdev, Alexei Starovoitov, Daniel Borkmann,
	Jesper Dangaard Brouer

On 2018/07/25 4:10, Jakub Kicinski wrote:
> On Tue, 24 Jul 2018 18:39:09 +0900, Toshiaki Makita wrote:
>> On 2018/07/24 10:56, Toshiaki Makita wrote:
>>> On 2018/07/24 9:27, Jakub Kicinski wrote:  
>>>> On Mon, 23 Jul 2018 00:13:03 +0900, Toshiaki Makita wrote:  
>>>>> From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
>>>>>
>>>>> All oversized packets including GSO packets are dropped if XDP is
>>>>> enabled on receiver side, so don't send such packets from peer.
>>>>>
>>>>> Drop TSO and SCTP fragmentation features so that veth devices themselves
>>>>> segment packets with XDP enabled. Also cap MTU accordingly.
>>>>>
>>>>> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>  
>>>>
>>>> Is there any precedence for fixing up features and MTU like this?  Most
>>>> drivers just refuse to install the program if settings are incompatible.  
>>>
>>> I don't know any precedence. I can refuse the program on installing it
>>> when features and MTU are not appropriate. Is it preferred?
>>> Note that with current implementation wanted_features are not touched so
>>> features will be restored when the XDP program is removed. MTU will not
>>> be restored though, as I do not remember the original MTU.  
>>
>> I just recalled that virtio_net used to refused XDP when guest offload
>> features are incompatible but now it dynamically fixup them on
>> installing an XDP program.
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3f93522ffab2d46a36b57adf324a54e674fc9536
> 
> That's slightly different AFAIU, because the virtio features weren't
> really controllable at runtime at all.  I'm not dead set on leaving the
> features be, but I just want to make sure we think this through
> properly before we commit to any magic behaviour for ever...

To me it does not look so different. What the above virtio commit is
doing is almost disabling LRO so we could add a feature to toggle LRO
instead but it chose to automatically disable it. And this veth commit
is also almost equivalent to disabling LRO.
IMHO we should do this feature adjustment. It just avoids packet drops
and has no downside. It forces software segmentation on the peer veth.
Features will be restored when XDP is removed so there would be no
surprising for users. It seems there is no benefit to not doing this.

> Taking a quick glance at the MTU side, it seems that today if someone
> decides to set MTU on one side of a veth pair the packets will simply
> get dropped.  So the MTU coupling for XDP doesn't seem in line with
> existing behaviour of veth, not only other XDP drivers.

It looks weird to allow such inconsistent MTU settings. But anyway
changing MTU can have negative effect on users, such as causing
fragmentation or EMSGSIZE error on UDP sendmsg() or not restoring MTU. I
think I should not adjust MTU but just cap max_mtu.

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2018-07-25  5:32 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-22 15:13 [PATCH v3 bpf-next 0/8] veth: Driver XDP Toshiaki Makita
2018-07-22 15:13 ` [PATCH v3 bpf-next 1/8] net: Export skb_headers_offset_update Toshiaki Makita
2018-07-22 15:13 ` [PATCH v3 bpf-next 2/8] veth: Add driver XDP Toshiaki Makita
2018-07-24  0:23   ` Jakub Kicinski
2018-07-24  1:47     ` Toshiaki Makita
2018-07-22 15:13 ` [PATCH v3 bpf-next 3/8] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
2018-07-24  0:27   ` Jakub Kicinski
2018-07-24  1:56     ` Toshiaki Makita
2018-07-24  9:39       ` Toshiaki Makita
2018-07-24 19:10         ` Jakub Kicinski
2018-07-25  4:22           ` Toshiaki Makita
2018-07-22 15:13 ` [PATCH v3 bpf-next 4/8] veth: Handle xdp_frames in xdp napi ring Toshiaki Makita
2018-07-22 15:13 ` [PATCH v3 bpf-next 5/8] veth: Add ndo_xdp_xmit Toshiaki Makita
2018-07-24  0:19   ` kbuild test robot
2018-07-24  1:59     ` Toshiaki Makita
2018-07-24  0:33   ` kbuild test robot
2018-07-24  1:02   ` Jakub Kicinski
2018-07-24  2:11     ` Toshiaki Makita
2018-07-24 13:58       ` Tariq Toukan
2018-07-24  2:24     ` Toshiaki Makita
2018-07-22 15:13 ` [PATCH v3 bpf-next 6/8] xdp: Add a flag for disabling napi_direct of xdp_return_frame in xdp_mem_info Toshiaki Makita
2018-07-24  1:22   ` Jakub Kicinski
2018-07-24  2:43     ` Toshiaki Makita
2018-07-24  3:38       ` Jakub Kicinski
2018-07-24  4:02         ` Toshiaki Makita
2018-07-22 15:13 ` [PATCH v3 bpf-next 7/8] veth: Add XDP TX and REDIRECT Toshiaki Makita
2018-07-22 15:13 ` [PATCH v3 bpf-next 8/8] veth: Support per queue XDP ring Toshiaki Makita

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).