All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 bpf-next 0/9] veth: Driver XDP
@ 2018-07-30 10:43 Toshiaki Makita
  2018-07-30 10:43 ` [PATCH v6 bpf-next 1/9] net: Export skb_headers_offset_update Toshiaki Makita
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Toshiaki Makita @ 2018-07-30 10:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, netdev, Jesper Dangaard Brouer, Jakub Kicinski,
	John Fastabend

This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC -----------> veth===veth
 (XDP) (redirect)        (XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.


Envisioned use-cases
--------------------

* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.


Implementation
--------------

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because XDP heavily relies on NAPI
context.
 - patch 1: Export a function needed for veth XDP.
 - patch 2-3: Basic implementation of veth XDP.
 - patch 4-5: Add ndo_xdp_xmit.
 - patch 6-8: Add XDP_TX and XDP_REDIRECT.
 - patch 9: Performance optimization for multi-queue env.


Tests and performance numbers
-----------------------------

Tested with a simple XDP program which only redirects packets between
NIC and veth. I used i40e 25G NIC (XXV710) for the physical NIC. The
server has 20 of Xeon Silver 2.20 GHz cores.

  pktgen --(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP)

The rightmost veth loads XDP progs and just does DROP or TX. The number
of packets is measured in the XDP progs. The leftmost pktgen sends
packets at 37.1 Mpps (almost 25G wire speed).

veth XDP action    Flows    Mpps
================================
DROP                   1    10.6
DROP                   2    21.2
DROP                 100    36.0
TX                     1     5.0
TX                     2    10.0
TX                   100    31.0

I also measured netperf TCP_STREAM but was not so great performance due
to lack of tx/rx checksum offload and TSO, etc.

  netperf <--(wire)--> XXV710 (i40e) <--(XDP redirect)--> veth===veth (XDP PASS)

Direction         Flows   Gbps
==============================
external->veth        1   20.8
external->veth        2   23.5
external->veth      100   23.6
veth->external        1    9.0
veth->external        2   17.8
veth->external      100   22.9

Also tested doing ifup/down or load/unload a XDP program repeatedly
during processing XDP packets in order to check if enabling/disabling
NAPI is working as expected, and found no problems.

v6:
- Check skb->len only if reallocation is needed.
- Add __GFP_NOWARN to alloc_page() since it can be triggered by external
  events.
- Fix sparse warning around EXPORT_SYMBOL.

v5:
- Fix broken SOBs.

v4:
- Don't adjust MTU automatically.
- Skip peer IFF_UP check on .ndo_xdp_xmit() because it is unnecessary.
  Add comments to explain that.
- Use redirect_info instead of xdp_mem_info for storing no_direct flag
  to avoid per packet copy cost.

v3:
- Drop skb bulk xmit patch since it makes little performance
  difference. The hotspot in TCP skb xmit at this point is checksum
  computation in skb_segment and packet copy on XDP_REDIRECT due to
  cloned/nonlinear skb.
- Fix race on closing device.
- Add extack messages in ndo_bpf.

v2:
- Squash NAPI patch with "Add driver XDP" patch.
- Remove conversion from xdp_frame to skb when NAPI is not enabled.
- Introduce per-queue XDP ring (patch 8).
- Introduce bulk skb xmit when XDP is enabled on the peer (patch 9).

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Toshiaki Makita (9):
  net: Export skb_headers_offset_update
  veth: Add driver XDP
  veth: Avoid drops by oversized packets when XDP is enabled
  veth: Handle xdp_frames in xdp napi ring
  veth: Add ndo_xdp_xmit
  bpf: Make redirect_info accessible from modules
  xdp: Helpers for disabling napi_direct of xdp_return_frame
  veth: Add XDP TX and REDIRECT
  veth: Support per queue XDP ring

 drivers/net/veth.c     | 748 ++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/filter.h |  35 +++
 include/linux/skbuff.h |   1 +
 net/core/filter.c      |  29 +-
 net/core/skbuff.c      |   3 +-
 net/core/xdp.c         |   6 +-
 6 files changed, 792 insertions(+), 30 deletions(-)

-- 
1.8.3.1

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v6 bpf-next 1/9] net: Export skb_headers_offset_update
  2018-07-30 10:43 [PATCH v6 bpf-next 0/9] veth: Driver XDP Toshiaki Makita
@ 2018-07-30 10:43 ` Toshiaki Makita
  2018-07-30 10:43 ` [PATCH v6 bpf-next 2/9] veth: Add driver XDP Toshiaki Makita
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Toshiaki Makita @ 2018-07-30 10:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, netdev, Jesper Dangaard Brouer, Jakub Kicinski,
	John Fastabend

This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 include/linux/skbuff.h | 1 +
 net/core/skbuff.c      | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b..f692968 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1035,6 +1035,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
+void skb_headers_offset_update(struct sk_buff *skb, int off);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 266b954..f5670e6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1291,7 +1291,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(skb_clone);
 
-static void skb_headers_offset_update(struct sk_buff *skb, int off)
+void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
 	/* Only adjust this if it actually is csum_start rather than csum */
 	if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -1305,6 +1305,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_network_header += off;
 	skb->inner_mac_header += off;
 }
+EXPORT_SYMBOL(skb_headers_offset_update);
 
 void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 bpf-next 2/9] veth: Add driver XDP
  2018-07-30 10:43 [PATCH v6 bpf-next 0/9] veth: Driver XDP Toshiaki Makita
  2018-07-30 10:43 ` [PATCH v6 bpf-next 1/9] net: Export skb_headers_offset_update Toshiaki Makita
@ 2018-07-30 10:43 ` Toshiaki Makita
  2018-07-30 10:43 ` [PATCH v6 bpf-next 3/9] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Toshiaki Makita @ 2018-07-30 10:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, netdev, Jesper Dangaard Brouer, Jakub Kicinski,
	John Fastabend

This is the basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

This introduces NAPI when XDP is enabled, because XDP is now heavily
relies on NAPI context. Use ptr_ring to emulate NIC ring. Tx function
enqueues packets to the ring and peer NAPI handler drains the ring.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved by allocating rings
on the per-queue basis later.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

v6:
- Check skb->len only when allocation is needed.
- Add __GFP_NOWARN to alloc_page() as it can be triggered by external
  events.

v3:
- Fix race on closing the device.
- Add extack messages in ndo_bpf.

v2:
- Squashed with the patch adding NAPI.
- Implement adjust_tail.
- Don't acquire consumer lock because it is guarded by NAPI.
- Make poll_controller noop since it is unnecessary.
- Register rxq_info on enabling XDP rather than on opening the device.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 374 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 367 insertions(+), 7 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a69ad39..d3b9f10 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,10 +19,18 @@
 #include <net/xfrm.h>
 #include <linux/veth.h>
 #include <linux/module.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/ptr_ring.h>
+#include <linux/skb_array.h>
+#include <linux/bpf_trace.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_RING_SIZE		256
+#define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+
 struct pcpu_vstats {
 	u64			packets;
 	u64			bytes;
@@ -30,9 +38,16 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+	struct napi_struct	xdp_napi;
+	struct net_device	*dev;
+	struct bpf_prog __rcu	*xdp_prog;
+	struct bpf_prog		*_xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
 	unsigned		requested_headroom;
+	bool			rx_notify_masked;
+	struct ptr_ring		xdp_ring;
+	struct xdp_rxq_info	xdp_rxq;
 };
 
 /*
@@ -98,11 +113,43 @@ static void veth_get_ethtool_stats(struct net_device *dev,
 	.get_link_ksettings	= veth_get_link_ksettings,
 };
 
-static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+/* general routines */
+
+static void __veth_xdp_flush(struct veth_priv *priv)
+{
+	/* Write ptr_ring before reading rx_notify_masked */
+	smp_mb();
+	if (!priv->rx_notify_masked) {
+		priv->rx_notify_masked = true;
+		napi_schedule(&priv->xdp_napi);
+	}
+}
+
+static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+{
+	if (unlikely(ptr_ring_produce(&priv->xdp_ring, skb))) {
+		dev_kfree_skb_any(skb);
+		return NET_RX_DROP;
+	}
+
+	return NET_RX_SUCCESS;
+}
+
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool xdp)
 {
 	struct veth_priv *priv = netdev_priv(dev);
+
+	return __dev_forward_skb(dev, skb) ?: xdp ?
+		veth_xdp_rx(priv, skb) :
+		netif_rx(skb);
+}
+
+static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct net_device *rcv;
 	int length = skb->len;
+	bool rcv_xdp = false;
 
 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
@@ -111,7 +158,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
-	if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+	rcv_priv = netdev_priv(rcv);
+	rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+
+	if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
 		struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -122,14 +172,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 drop:
 		atomic64_inc(&priv->dropped);
 	}
+
+	if (rcv_xdp)
+		__veth_xdp_flush(rcv_priv);
+
 	rcu_read_unlock();
+
 	return NETDEV_TX_OK;
 }
 
-/*
- * general routines
- */
-
 static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -179,18 +230,254 @@ static void veth_set_multicast_list(struct net_device *dev)
 {
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+				      int buflen)
+{
+	struct sk_buff *skb;
+
+	if (!buflen) {
+		buflen = SKB_DATA_ALIGN(headroom + len) +
+			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	}
+	skb = build_skb(head, buflen);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, headroom);
+	skb_put(skb, len);
+
+	return skb;
+}
+
+static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
+					struct sk_buff *skb)
+{
+	u32 pktlen, headroom, act, metalen;
+	void *orig_data, *orig_data_end;
+	struct bpf_prog *xdp_prog;
+	int mac_len, delta, off;
+	struct xdp_buff xdp;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(priv->xdp_prog);
+	if (unlikely(!xdp_prog)) {
+		rcu_read_unlock();
+		goto out;
+	}
+
+	mac_len = skb->data - skb_mac_header(skb);
+	pktlen = skb->len + mac_len;
+	headroom = skb_headroom(skb) - mac_len;
+
+	if (skb_shared(skb) || skb_head_is_locked(skb) ||
+	    skb_is_nonlinear(skb) || headroom < XDP_PACKET_HEADROOM) {
+		struct sk_buff *nskb;
+		int size, head_off;
+		void *head, *start;
+		struct page *page;
+
+		size = SKB_DATA_ALIGN(VETH_XDP_HEADROOM + pktlen) +
+		       SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+		if (size > PAGE_SIZE)
+			goto drop;
+
+		page = alloc_page(GFP_ATOMIC | __GFP_NOWARN);
+		if (!page)
+			goto drop;
+
+		head = page_address(page);
+		start = head + VETH_XDP_HEADROOM;
+		if (skb_copy_bits(skb, -mac_len, start, pktlen)) {
+			page_frag_free(head);
+			goto drop;
+		}
+
+		nskb = veth_build_skb(head,
+				      VETH_XDP_HEADROOM + mac_len, skb->len,
+				      PAGE_SIZE);
+		if (!nskb) {
+			page_frag_free(head);
+			goto drop;
+		}
+
+		skb_copy_header(nskb, skb);
+		head_off = skb_headroom(nskb) - skb_headroom(skb);
+		skb_headers_offset_update(nskb, head_off);
+		if (skb->sk)
+			skb_set_owner_w(nskb, skb->sk);
+		consume_skb(skb);
+		skb = nskb;
+	}
+
+	xdp.data_hard_start = skb->head;
+	xdp.data = skb_mac_header(skb);
+	xdp.data_end = xdp.data + pktlen;
+	xdp.data_meta = xdp.data;
+	xdp.rxq = &priv->xdp_rxq;
+	orig_data = xdp.data;
+	orig_data_end = xdp.data_end;
+
+	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+	switch (act) {
+	case XDP_PASS:
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_ABORTED:
+		trace_xdp_exception(priv->dev, xdp_prog, act);
+	case XDP_DROP:
+		goto drop;
+	}
+	rcu_read_unlock();
+
+	delta = orig_data - xdp.data;
+	off = mac_len + delta;
+	if (off > 0)
+		__skb_push(skb, off);
+	else if (off < 0)
+		__skb_pull(skb, -off);
+	skb->mac_header -= delta;
+	off = xdp.data_end - orig_data_end;
+	if (off != 0)
+		__skb_put(skb, off);
+	skb->protocol = eth_type_trans(skb, priv->dev);
+
+	metalen = xdp.data - xdp.data_meta;
+	if (metalen)
+		skb_metadata_set(skb, metalen);
+out:
+	return skb;
+drop:
+	rcu_read_unlock();
+	kfree_skb(skb);
+	return NULL;
+}
+
+static int veth_xdp_rcv(struct veth_priv *priv, int budget)
+{
+	int i, done = 0;
+
+	for (i = 0; i < budget; i++) {
+		struct sk_buff *skb = __ptr_ring_consume(&priv->xdp_ring);
+
+		if (!skb)
+			break;
+
+		skb = veth_xdp_rcv_skb(priv, skb);
+
+		if (skb)
+			napi_gro_receive(&priv->xdp_napi, skb);
+
+		done++;
+	}
+
+	return done;
+}
+
+static int veth_poll(struct napi_struct *napi, int budget)
+{
+	struct veth_priv *priv =
+		container_of(napi, struct veth_priv, xdp_napi);
+	int done;
+
+	done = veth_xdp_rcv(priv, budget);
+
+	if (done < budget && napi_complete_done(napi, done)) {
+		/* Write rx_notify_masked before reading ptr_ring */
+		smp_store_mb(priv->rx_notify_masked, false);
+		if (unlikely(!__ptr_ring_empty(&priv->xdp_ring))) {
+			priv->rx_notify_masked = true;
+			napi_schedule(&priv->xdp_napi);
+		}
+	}
+
+	return done;
+}
+
+static int veth_napi_add(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	int err;
+
+	err = ptr_ring_init(&priv->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
+	if (err)
+		return err;
+
+	netif_napi_add(dev, &priv->xdp_napi, veth_poll, NAPI_POLL_WEIGHT);
+	napi_enable(&priv->xdp_napi);
+
+	return 0;
+}
+
+static void veth_napi_del(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	napi_disable(&priv->xdp_napi);
+	netif_napi_del(&priv->xdp_napi);
+	priv->rx_notify_masked = false;
+	ptr_ring_cleanup(&priv->xdp_ring, __skb_array_destroy_skb);
+}
+
+static int veth_enable_xdp(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	int err;
+
+	if (!xdp_rxq_info_is_reg(&priv->xdp_rxq)) {
+		err = xdp_rxq_info_reg(&priv->xdp_rxq, dev, 0);
+		if (err < 0)
+			return err;
+
+		err = xdp_rxq_info_reg_mem_model(&priv->xdp_rxq,
+						 MEM_TYPE_PAGE_SHARED, NULL);
+		if (err < 0)
+			goto err;
+
+		err = veth_napi_add(dev);
+		if (err)
+			goto err;
+	}
+
+	rcu_assign_pointer(priv->xdp_prog, priv->_xdp_prog);
+
+	return 0;
+err:
+	xdp_rxq_info_unreg(&priv->xdp_rxq);
+
+	return err;
+}
+
+static void veth_disable_xdp(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	rcu_assign_pointer(priv->xdp_prog, NULL);
+	veth_napi_del(dev);
+	xdp_rxq_info_unreg(&priv->xdp_rxq);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
 	struct net_device *peer = rtnl_dereference(priv->peer);
+	int err;
 
 	if (!peer)
 		return -ENOTCONN;
 
+	if (priv->_xdp_prog) {
+		err = veth_enable_xdp(dev);
+		if (err)
+			return err;
+	}
+
 	if (peer->flags & IFF_UP) {
 		netif_carrier_on(dev);
 		netif_carrier_on(peer);
 	}
+
 	return 0;
 }
 
@@ -203,6 +490,9 @@ static int veth_close(struct net_device *dev)
 	if (peer)
 		netif_carrier_off(peer);
 
+	if (priv->_xdp_prog)
+		veth_disable_xdp(dev);
+
 	return 0;
 }
 
@@ -228,7 +518,7 @@ static void veth_dev_free(struct net_device *dev)
 static void veth_poll_controller(struct net_device *dev)
 {
 	/* veth only receives frames when its peer sends one
-	 * Since it's a synchronous operation, we are guaranteed
+	 * Since it has nothing to do with disabling irqs, we are guaranteed
 	 * never to have pending data when we poll for it so
 	 * there is nothing to do here.
 	 *
@@ -276,6 +566,72 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 	rcu_read_unlock();
 }
 
+static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
+			struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct bpf_prog *old_prog;
+	struct net_device *peer;
+	int err;
+
+	old_prog = priv->_xdp_prog;
+	priv->_xdp_prog = prog;
+	peer = rtnl_dereference(priv->peer);
+
+	if (prog) {
+		if (!peer) {
+			NL_SET_ERR_MSG_MOD(extack, "Cannot set XDP when peer is detached");
+			err = -ENOTCONN;
+			goto err;
+		}
+
+		if (dev->flags & IFF_UP) {
+			err = veth_enable_xdp(dev);
+			if (err) {
+				NL_SET_ERR_MSG_MOD(extack, "Setup for XDP failed");
+				goto err;
+			}
+		}
+	}
+
+	if (old_prog) {
+		if (!prog && dev->flags & IFF_UP)
+			veth_disable_xdp(dev);
+		bpf_prog_put(old_prog);
+	}
+
+	return 0;
+err:
+	priv->_xdp_prog = old_prog;
+
+	return err;
+}
+
+static u32 veth_xdp_query(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	const struct bpf_prog *xdp_prog;
+
+	xdp_prog = priv->_xdp_prog;
+	if (xdp_prog)
+		return xdp_prog->aux->id;
+
+	return 0;
+}
+
+static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return veth_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_QUERY_PROG:
+		xdp->prog_id = veth_xdp_query(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -290,6 +646,7 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 	.ndo_get_iflink		= veth_get_iflink,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
+	.ndo_bpf		= veth_xdp,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
@@ -451,10 +808,13 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 	 */
 
 	priv = netdev_priv(dev);
+	priv->dev = dev;
 	rcu_assign_pointer(priv->peer, peer);
 
 	priv = netdev_priv(peer);
+	priv->dev = peer;
 	rcu_assign_pointer(priv->peer, dev);
+
 	return 0;
 
 err_register_dev:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 bpf-next 3/9] veth: Avoid drops by oversized packets when XDP is enabled
  2018-07-30 10:43 [PATCH v6 bpf-next 0/9] veth: Driver XDP Toshiaki Makita
  2018-07-30 10:43 ` [PATCH v6 bpf-next 1/9] net: Export skb_headers_offset_update Toshiaki Makita
  2018-07-30 10:43 ` [PATCH v6 bpf-next 2/9] veth: Add driver XDP Toshiaki Makita
@ 2018-07-30 10:43 ` Toshiaki Makita
  2018-07-30 10:43 ` [PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring Toshiaki Makita
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Toshiaki Makita @ 2018-07-30 10:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, netdev, Jesper Dangaard Brouer, Jakub Kicinski,
	John Fastabend

Oversized packets including GSO packets can be dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

v4:
- Don't auto-adjust MTU but cap max MTU.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d3b9f10..9edf104 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -543,6 +543,23 @@ static int veth_get_iflink(const struct net_device *dev)
 	return iflink;
 }
 
+static netdev_features_t veth_fix_features(struct net_device *dev,
+					   netdev_features_t features)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
+
+	peer = rtnl_dereference(priv->peer);
+	if (peer) {
+		struct veth_priv *peer_priv = netdev_priv(peer);
+
+		if (peer_priv->_xdp_prog)
+			features &= ~NETIF_F_GSO_SOFTWARE;
+	}
+
+	return features;
+}
+
 static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 {
 	struct veth_priv *peer_priv, *priv = netdev_priv(dev);
@@ -572,6 +589,7 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	struct veth_priv *priv = netdev_priv(dev);
 	struct bpf_prog *old_prog;
 	struct net_device *peer;
+	unsigned int max_mtu;
 	int err;
 
 	old_prog = priv->_xdp_prog;
@@ -585,6 +603,15 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 			goto err;
 		}
 
+		max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+			  peer->hard_header_len -
+			  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+		if (peer->mtu > max_mtu) {
+			NL_SET_ERR_MSG_MOD(extack, "Peer MTU is too large to set XDP");
+			err = -ERANGE;
+			goto err;
+		}
+
 		if (dev->flags & IFF_UP) {
 			err = veth_enable_xdp(dev);
 			if (err) {
@@ -592,14 +619,29 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 				goto err;
 			}
 		}
+
+		if (!old_prog) {
+			peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+			peer->max_mtu = max_mtu;
+		}
 	}
 
 	if (old_prog) {
-		if (!prog && dev->flags & IFF_UP)
-			veth_disable_xdp(dev);
+		if (!prog) {
+			if (dev->flags & IFF_UP)
+				veth_disable_xdp(dev);
+
+			if (peer) {
+				peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+				peer->max_mtu = ETH_MAX_MTU;
+			}
+		}
 		bpf_prog_put(old_prog);
 	}
 
+	if ((!!old_prog ^ !!prog) && peer)
+		netdev_update_features(peer);
+
 	return 0;
 err:
 	priv->_xdp_prog = old_prog;
@@ -644,6 +686,7 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	.ndo_poll_controller	= veth_poll_controller,
 #endif
 	.ndo_get_iflink		= veth_get_iflink,
+	.ndo_fix_features	= veth_fix_features,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring
  2018-07-30 10:43 [PATCH v6 bpf-next 0/9] veth: Driver XDP Toshiaki Makita
                   ` (2 preceding siblings ...)
  2018-07-30 10:43 ` [PATCH v6 bpf-next 3/9] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
@ 2018-07-30 10:43 ` Toshiaki Makita
  2018-07-31 10:26   ` Jesper Dangaard Brouer
  2018-07-30 10:43 ` [PATCH v6 bpf-next 5/9] veth: Add ndo_xdp_xmit Toshiaki Makita
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 15+ messages in thread
From: Toshiaki Makita @ 2018-07-30 10:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, netdev, Jesper Dangaard Brouer, Jakub Kicinski,
	John Fastabend

This is preparation for XDP TX and ndo_xdp_xmit.
This allows napi handler to handle xdp_frames through xdp ring as well
as sk_buff.

v3:
- Revert v2 change around rings and use a flag to differentiate skb and
  xdp_frame, since bulk skb xmit makes little performance difference
  for now.

v2:
- Use another ring instead of using flag to differentiate skb and
  xdp_frame. This approach makes bulk skb transmit possible in
  veth_xmit later.
- Clear xdp_frame feilds in skb->head.
- Implement adjust_tail.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 drivers/net/veth.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 82 insertions(+), 5 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9edf104..9de0e90 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -22,12 +22,12 @@
 #include <linux/bpf.h>
 #include <linux/filter.h>
 #include <linux/ptr_ring.h>
-#include <linux/skb_array.h>
 #include <linux/bpf_trace.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_XDP_FLAG		BIT(0)
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
@@ -115,6 +115,24 @@ static void veth_get_ethtool_stats(struct net_device *dev,
 
 /* general routines */
 
+static bool veth_is_xdp_frame(void *ptr)
+{
+	return (unsigned long)ptr & VETH_XDP_FLAG;
+}
+
+static void *veth_ptr_to_xdp(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
+}
+
+static void veth_ptr_free(void *ptr)
+{
+	if (veth_is_xdp_frame(ptr))
+		xdp_return_frame(veth_ptr_to_xdp(ptr));
+	else
+		kfree_skb(ptr);
+}
+
 static void __veth_xdp_flush(struct veth_priv *priv)
 {
 	/* Write ptr_ring before reading rx_notify_masked */
@@ -249,6 +267,61 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+					struct xdp_frame *frame)
+{
+	int len = frame->len, delta = 0;
+	struct bpf_prog *xdp_prog;
+	unsigned int headroom;
+	struct sk_buff *skb;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(priv->xdp_prog);
+	if (likely(xdp_prog)) {
+		struct xdp_buff xdp;
+		u32 act;
+
+		xdp.data_hard_start = frame->data - frame->headroom;
+		xdp.data = frame->data;
+		xdp.data_end = frame->data + frame->len;
+		xdp.data_meta = frame->data - frame->metasize;
+		xdp.rxq = &priv->xdp_rxq;
+
+		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+		switch (act) {
+		case XDP_PASS:
+			delta = frame->data - xdp.data;
+			len = xdp.data_end - xdp.data;
+			break;
+		default:
+			bpf_warn_invalid_xdp_action(act);
+		case XDP_ABORTED:
+			trace_xdp_exception(priv->dev, xdp_prog, act);
+		case XDP_DROP:
+			goto err_xdp;
+		}
+	}
+	rcu_read_unlock();
+
+	headroom = frame->data - delta - (void *)frame;
+	skb = veth_build_skb(frame, headroom, len, 0);
+	if (!skb) {
+		xdp_return_frame(frame);
+		goto err;
+	}
+
+	memset(frame, 0, sizeof(*frame));
+	skb->protocol = eth_type_trans(skb, priv->dev);
+err:
+	return skb;
+err_xdp:
+	rcu_read_unlock();
+	xdp_return_frame(frame);
+
+	return NULL;
+}
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 					struct sk_buff *skb)
 {
@@ -359,12 +432,16 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget)
 	int i, done = 0;
 
 	for (i = 0; i < budget; i++) {
-		struct sk_buff *skb = __ptr_ring_consume(&priv->xdp_ring);
+		void *ptr = __ptr_ring_consume(&priv->xdp_ring);
+		struct sk_buff *skb;
 
-		if (!skb)
+		if (!ptr)
 			break;
 
-		skb = veth_xdp_rcv_skb(priv, skb);
+		if (veth_is_xdp_frame(ptr))
+			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
+		else
+			skb = veth_xdp_rcv_skb(priv, ptr);
 
 		if (skb)
 			napi_gro_receive(&priv->xdp_napi, skb);
@@ -417,7 +494,7 @@ static void veth_napi_del(struct net_device *dev)
 	napi_disable(&priv->xdp_napi);
 	netif_napi_del(&priv->xdp_napi);
 	priv->rx_notify_masked = false;
-	ptr_ring_cleanup(&priv->xdp_ring, __skb_array_destroy_skb);
+	ptr_ring_cleanup(&priv->xdp_ring, veth_ptr_free);
 }
 
 static int veth_enable_xdp(struct net_device *dev)
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 bpf-next 5/9] veth: Add ndo_xdp_xmit
  2018-07-30 10:43 [PATCH v6 bpf-next 0/9] veth: Driver XDP Toshiaki Makita
                   ` (3 preceding siblings ...)
  2018-07-30 10:43 ` [PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring Toshiaki Makita
@ 2018-07-30 10:43 ` Toshiaki Makita
  2018-07-30 10:43 ` [PATCH v6 bpf-next 6/9] bpf: Make redirect_info accessible from modules Toshiaki Makita
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Toshiaki Makita @ 2018-07-30 10:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, netdev, Jesper Dangaard Brouer, Jakub Kicinski,
	John Fastabend

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that when the peer veth device does not set driver xdp, redirected
packets will be dropped because the peer is not ready for NAPI.

v4:
- Don't use xdp_ok_fwd_dev() because checking IFF_UP is not necessary.
  Add comments about it and check only MTU.

v2:
- Drop the part converting xdp_frame into skb when XDP is not enabled.
- Implement bulk interface of ndo_xdp_xmit.
- Implement XDP_XMIT_FLUSH bit and drop ndo_xdp_flush.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 drivers/net/veth.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9de0e90..c13f7a4 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -17,6 +17,7 @@
 #include <net/rtnetlink.h>
 #include <net/dst.h>
 #include <net/xfrm.h>
+#include <net/xdp.h>
 #include <linux/veth.h>
 #include <linux/module.h>
 #include <linux/bpf.h>
@@ -125,6 +126,11 @@ static void *veth_ptr_to_xdp(void *ptr)
 	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
 }
 
+static void *veth_xdp_to_ptr(void *ptr)
+{
+	return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+}
+
 static void veth_ptr_free(void *ptr)
 {
 	if (veth_is_xdp_frame(ptr))
@@ -267,6 +273,50 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static int veth_xdp_xmit(struct net_device *dev, int n,
+			 struct xdp_frame **frames, u32 flags)
+{
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	struct net_device *rcv;
+	unsigned int max_len;
+	int i, drops = 0;
+
+	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
+		return -EINVAL;
+
+	rcv = rcu_dereference(priv->peer);
+	if (unlikely(!rcv))
+		return -ENXIO;
+
+	rcv_priv = netdev_priv(rcv);
+	/* Non-NULL xdp_prog ensures that xdp_ring is initialized on receive
+	 * side. This means an XDP program is loaded on the peer and the peer
+	 * device is up.
+	 */
+	if (!rcu_access_pointer(rcv_priv->xdp_prog))
+		return -ENXIO;
+
+	max_len = rcv->mtu + rcv->hard_header_len + VLAN_HLEN;
+
+	spin_lock(&rcv_priv->xdp_ring.producer_lock);
+	for (i = 0; i < n; i++) {
+		struct xdp_frame *frame = frames[i];
+		void *ptr = veth_xdp_to_ptr(frame);
+
+		if (unlikely(frame->len > max_len ||
+			     __ptr_ring_produce(&rcv_priv->xdp_ring, ptr))) {
+			xdp_return_frame_rx_napi(frame);
+			drops++;
+		}
+	}
+	spin_unlock(&rcv_priv->xdp_ring.producer_lock);
+
+	if (flags & XDP_XMIT_FLUSH)
+		__veth_xdp_flush(rcv_priv);
+
+	return n - drops;
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 					struct xdp_frame *frame)
 {
@@ -767,6 +817,7 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
+	.ndo_xdp_xmit		= veth_xdp_xmit,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 bpf-next 6/9] bpf: Make redirect_info accessible from modules
  2018-07-30 10:43 [PATCH v6 bpf-next 0/9] veth: Driver XDP Toshiaki Makita
                   ` (4 preceding siblings ...)
  2018-07-30 10:43 ` [PATCH v6 bpf-next 5/9] veth: Add ndo_xdp_xmit Toshiaki Makita
@ 2018-07-30 10:43 ` Toshiaki Makita
  2018-07-30 10:43 ` [PATCH v6 bpf-next 7/9] xdp: Helpers for disabling napi_direct of xdp_return_frame Toshiaki Makita
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Toshiaki Makita @ 2018-07-30 10:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, netdev, Jesper Dangaard Brouer, Jakub Kicinski,
	John Fastabend

We are going to add kern_flags field in redirect_info for kernel
internal use.
In order to avoid function call to access the flags, make redirect_info
accessible from modules. Also as it is now non-static, add prefix bpf_
to redirect_info.

v6:
- Fix sparse warning around EXPORT_SYMBOL.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 include/linux/filter.h | 10 ++++++++++
 net/core/filter.c      | 29 +++++++++++------------------
 2 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index c73dd73..4717af8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -537,6 +537,16 @@ struct sk_msg_buff {
 	struct list_head list;
 };
 
+struct bpf_redirect_info {
+	u32 ifindex;
+	u32 flags;
+	struct bpf_map *map;
+	struct bpf_map *map_to_flush;
+	unsigned long   map_owner;
+};
+
+DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
diff --git a/net/core/filter.c b/net/core/filter.c
index 104d560..2766a55 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2080,19 +2080,12 @@ static int __bpf_redirect(struct sk_buff *skb, struct net_device *dev,
 	.arg3_type      = ARG_ANYTHING,
 };
 
-struct redirect_info {
-	u32 ifindex;
-	u32 flags;
-	struct bpf_map *map;
-	struct bpf_map *map_to_flush;
-	unsigned long   map_owner;
-};
-
-static DEFINE_PER_CPU(struct redirect_info, redirect_info);
+DEFINE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
+EXPORT_PER_CPU_SYMBOL_GPL(bpf_redirect_info);
 
 BPF_CALL_2(bpf_redirect, u32, ifindex, u64, flags)
 {
-	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
 	if (unlikely(flags & ~(BPF_F_INGRESS)))
 		return TC_ACT_SHOT;
@@ -2105,7 +2098,7 @@ struct redirect_info {
 
 int skb_do_redirect(struct sk_buff *skb)
 {
-	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct net_device *dev;
 
 	dev = dev_get_by_index_rcu(dev_net(skb->dev), ri->ifindex);
@@ -3198,7 +3191,7 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 
 void xdp_do_flush_map(void)
 {
-	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct bpf_map *map = ri->map_to_flush;
 
 	ri->map_to_flush = NULL;
@@ -3243,7 +3236,7 @@ static inline bool xdp_map_invalid(const struct bpf_prog *xdp_prog,
 static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
 			       struct bpf_prog *xdp_prog)
 {
-	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	unsigned long map_owner = ri->map_owner;
 	struct bpf_map *map = ri->map;
 	u32 index = ri->ifindex;
@@ -3283,7 +3276,7 @@ static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
 int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
 		    struct bpf_prog *xdp_prog)
 {
-	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	struct net_device *fwd;
 	u32 index = ri->ifindex;
 	int err;
@@ -3315,7 +3308,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 				       struct xdp_buff *xdp,
 				       struct bpf_prog *xdp_prog)
 {
-	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	unsigned long map_owner = ri->map_owner;
 	struct bpf_map *map = ri->map;
 	u32 index = ri->ifindex;
@@ -3366,7 +3359,7 @@ static int xdp_do_generic_redirect_map(struct net_device *dev,
 int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 			    struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
 {
-	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 	u32 index = ri->ifindex;
 	struct net_device *fwd;
 	int err = 0;
@@ -3397,7 +3390,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 
 BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
 {
-	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
 	if (unlikely(flags))
 		return XDP_ABORTED;
@@ -3421,7 +3414,7 @@ int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
 BPF_CALL_4(bpf_xdp_redirect_map, struct bpf_map *, map, u32, ifindex, u64, flags,
 	   unsigned long, map_owner)
 {
-	struct redirect_info *ri = this_cpu_ptr(&redirect_info);
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
 
 	if (unlikely(flags))
 		return XDP_ABORTED;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 bpf-next 7/9] xdp: Helpers for disabling napi_direct of xdp_return_frame
  2018-07-30 10:43 [PATCH v6 bpf-next 0/9] veth: Driver XDP Toshiaki Makita
                   ` (5 preceding siblings ...)
  2018-07-30 10:43 ` [PATCH v6 bpf-next 6/9] bpf: Make redirect_info accessible from modules Toshiaki Makita
@ 2018-07-30 10:43 ` Toshiaki Makita
  2018-07-30 10:43 ` [PATCH v6 bpf-next 8/9] veth: Add XDP TX and REDIRECT Toshiaki Makita
  2018-07-30 10:43 ` [PATCH v6 bpf-next 9/9] veth: Support per queue XDP ring Toshiaki Makita
  8 siblings, 0 replies; 15+ messages in thread
From: Toshiaki Makita @ 2018-07-30 10:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, netdev, Jesper Dangaard Brouer, Jakub Kicinski,
	John Fastabend

We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
is not called directly from the rxq which owns the xdp_mem_info.

This approach introduces a flag in bpf_redirect_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used as
well as helper functions to use it.

A NAPI handler who wants to use this flag needs to call
xdp_set_return_frame_no_direct() before processing packets, and call
xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
exiting NAPI.

v4:
- Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
  avoid per-frame copy cost.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 include/linux/filter.h | 25 +++++++++++++++++++++++++
 net/core/xdp.c         |  6 ++++--
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4717af8..2b072da 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -543,10 +543,14 @@ struct bpf_redirect_info {
 	struct bpf_map *map;
 	struct bpf_map *map_to_flush;
 	unsigned long   map_owner;
+	u32 kern_flags;
 };
 
 DECLARE_PER_CPU(struct bpf_redirect_info, bpf_redirect_info);
 
+/* flags for bpf_redirect_info kern_flags */
+#define BPF_RI_F_RF_NO_DIRECT	BIT(0)	/* no napi_direct on return_frame */
+
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
@@ -775,6 +779,27 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 
+static inline bool xdp_return_frame_no_direct(void)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	return ri->kern_flags & BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_set_return_frame_no_direct(void)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	ri->kern_flags |= BPF_RI_F_RF_NO_DIRECT;
+}
+
+static inline void xdp_clear_return_frame_no_direct(void)
+{
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+	ri->kern_flags &= ~BPF_RI_F_RF_NO_DIRECT;
+}
+
 static inline int xdp_ok_fwd_dev(const struct net_device *fwd,
 				 unsigned int pktlen)
 {
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 5728538..3dd99e1 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -330,10 +330,12 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct,
 		/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
 		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
 		page = virt_to_head_page(data);
-		if (xa)
+		if (xa) {
+			napi_direct &= !xdp_return_frame_no_direct();
 			page_pool_put_page(xa->page_pool, page, napi_direct);
-		else
+		} else {
 			put_page(page);
+		}
 		rcu_read_unlock();
 		break;
 	case MEM_TYPE_PAGE_SHARED:
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 bpf-next 8/9] veth: Add XDP TX and REDIRECT
  2018-07-30 10:43 [PATCH v6 bpf-next 0/9] veth: Driver XDP Toshiaki Makita
                   ` (6 preceding siblings ...)
  2018-07-30 10:43 ` [PATCH v6 bpf-next 7/9] xdp: Helpers for disabling napi_direct of xdp_return_frame Toshiaki Makita
@ 2018-07-30 10:43 ` Toshiaki Makita
  2018-07-30 10:43 ` [PATCH v6 bpf-next 9/9] veth: Support per queue XDP ring Toshiaki Makita
  8 siblings, 0 replies; 15+ messages in thread
From: Toshiaki Makita @ 2018-07-30 10:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, netdev, Jesper Dangaard Brouer, Jakub Kicinski,
	John Fastabend

This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)          (XDP)         (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem_info from NIC so that page recycling of the NIC works on
the destination veth's XDP.
In this way return_frame is not fully guarded by NAPI, since another
NAPI handler on another cpu may use the same xdp_mem_info concurrently.
Thus disable napi_direct by xdp_set_return_frame_no_direct() during the
NAPI context.

v4:
- Use xdp_[set|clear]_return_frame_no_direct() instead of a flag in
  xdp_mem_info.

v3:
- Fix double free when veth_xdp_tx() returns a positive value.
- Convert xdp_xmit and xdp_redir variables into flags.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 119 +++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 110 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index c13f7a4..b1ce3691 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -32,6 +32,10 @@
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
+/* Separating two types of XDP xmit */
+#define VETH_XDP_TX		BIT(0)
+#define VETH_XDP_REDIR		BIT(1)
+
 struct pcpu_vstats {
 	u64			packets;
 	u64			bytes;
@@ -45,6 +49,7 @@ struct veth_priv {
 	struct bpf_prog		*_xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
+	struct xdp_mem_info	xdp_mem;
 	unsigned		requested_headroom;
 	bool			rx_notify_masked;
 	struct ptr_ring		xdp_ring;
@@ -317,10 +322,42 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 	return n - drops;
 }
 
+static void veth_xdp_flush(struct net_device *dev)
+{
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	struct net_device *rcv;
+
+	rcu_read_lock();
+	rcv = rcu_dereference(priv->peer);
+	if (unlikely(!rcv))
+		goto out;
+
+	rcv_priv = netdev_priv(rcv);
+	/* xdp_ring is initialized on receive side? */
+	if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+		goto out;
+
+	__veth_xdp_flush(rcv_priv);
+out:
+	rcu_read_unlock();
+}
+
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!frame))
+		return -EOVERFLOW;
+
+	return veth_xdp_xmit(dev, 1, &frame, 0);
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
-					struct xdp_frame *frame)
+					struct xdp_frame *frame,
+					unsigned int *xdp_xmit)
 {
 	int len = frame->len, delta = 0;
+	struct xdp_frame orig_frame;
 	struct bpf_prog *xdp_prog;
 	unsigned int headroom;
 	struct sk_buff *skb;
@@ -344,6 +381,29 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 			delta = frame->data - xdp.data;
 			len = xdp.data_end - xdp.data;
 			break;
+		case XDP_TX:
+			orig_frame = *frame;
+			xdp.data_hard_start = frame;
+			xdp.rxq->mem = frame->mem;
+			if (unlikely(veth_xdp_tx(priv->dev, &xdp) < 0)) {
+				trace_xdp_exception(priv->dev, xdp_prog, act);
+				frame = &orig_frame;
+				goto err_xdp;
+			}
+			*xdp_xmit |= VETH_XDP_TX;
+			rcu_read_unlock();
+			goto xdp_xmit;
+		case XDP_REDIRECT:
+			orig_frame = *frame;
+			xdp.data_hard_start = frame;
+			xdp.rxq->mem = frame->mem;
+			if (xdp_do_redirect(priv->dev, &xdp, xdp_prog)) {
+				frame = &orig_frame;
+				goto err_xdp;
+			}
+			*xdp_xmit |= VETH_XDP_REDIR;
+			rcu_read_unlock();
+			goto xdp_xmit;
 		default:
 			bpf_warn_invalid_xdp_action(act);
 		case XDP_ABORTED:
@@ -368,12 +428,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 err_xdp:
 	rcu_read_unlock();
 	xdp_return_frame(frame);
-
+xdp_xmit:
 	return NULL;
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-					struct sk_buff *skb)
+					struct sk_buff *skb,
+					unsigned int *xdp_xmit)
 {
 	u32 pktlen, headroom, act, metalen;
 	void *orig_data, *orig_data_end;
@@ -445,6 +506,26 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	switch (act) {
 	case XDP_PASS:
 		break;
+	case XDP_TX:
+		get_page(virt_to_page(xdp.data));
+		consume_skb(skb);
+		xdp.rxq->mem = priv->xdp_mem;
+		if (unlikely(veth_xdp_tx(priv->dev, &xdp) < 0)) {
+			trace_xdp_exception(priv->dev, xdp_prog, act);
+			goto err_xdp;
+		}
+		*xdp_xmit |= VETH_XDP_TX;
+		rcu_read_unlock();
+		goto xdp_xmit;
+	case XDP_REDIRECT:
+		get_page(virt_to_page(xdp.data));
+		consume_skb(skb);
+		xdp.rxq->mem = priv->xdp_mem;
+		if (xdp_do_redirect(priv->dev, &xdp, xdp_prog))
+			goto err_xdp;
+		*xdp_xmit |= VETH_XDP_REDIR;
+		rcu_read_unlock();
+		goto xdp_xmit;
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
@@ -475,9 +556,15 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	rcu_read_unlock();
 	kfree_skb(skb);
 	return NULL;
+err_xdp:
+	rcu_read_unlock();
+	page_frag_free(xdp.data);
+xdp_xmit:
+	return NULL;
 }
 
-static int veth_xdp_rcv(struct veth_priv *priv, int budget)
+static int veth_xdp_rcv(struct veth_priv *priv, int budget,
+			unsigned int *xdp_xmit)
 {
 	int i, done = 0;
 
@@ -488,10 +575,12 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget)
 		if (!ptr)
 			break;
 
-		if (veth_is_xdp_frame(ptr))
-			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
-		else
-			skb = veth_xdp_rcv_skb(priv, ptr);
+		if (veth_is_xdp_frame(ptr)) {
+			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr),
+					       xdp_xmit);
+		} else {
+			skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit);
+		}
 
 		if (skb)
 			napi_gro_receive(&priv->xdp_napi, skb);
@@ -506,9 +595,11 @@ static int veth_poll(struct napi_struct *napi, int budget)
 {
 	struct veth_priv *priv =
 		container_of(napi, struct veth_priv, xdp_napi);
+	unsigned int xdp_xmit = 0;
 	int done;
 
-	done = veth_xdp_rcv(priv, budget);
+	xdp_set_return_frame_no_direct();
+	done = veth_xdp_rcv(priv, budget, &xdp_xmit);
 
 	if (done < budget && napi_complete_done(napi, done)) {
 		/* Write rx_notify_masked before reading ptr_ring */
@@ -519,6 +610,12 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		}
 	}
 
+	if (xdp_xmit & VETH_XDP_TX)
+		veth_xdp_flush(priv->dev);
+	if (xdp_xmit & VETH_XDP_REDIR)
+		xdp_do_flush_map();
+	xdp_clear_return_frame_no_direct();
+
 	return done;
 }
 
@@ -565,6 +662,9 @@ static int veth_enable_xdp(struct net_device *dev)
 		err = veth_napi_add(dev);
 		if (err)
 			goto err;
+
+		/* Save original mem info as it can be overwritten */
+		priv->xdp_mem = priv->xdp_rxq.mem;
 	}
 
 	rcu_assign_pointer(priv->xdp_prog, priv->_xdp_prog);
@@ -582,6 +682,7 @@ static void veth_disable_xdp(struct net_device *dev)
 
 	rcu_assign_pointer(priv->xdp_prog, NULL);
 	veth_napi_del(dev);
+	priv->xdp_rxq.mem = priv->xdp_mem;
 	xdp_rxq_info_unreg(&priv->xdp_rxq);
 }
 
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 bpf-next 9/9] veth: Support per queue XDP ring
  2018-07-30 10:43 [PATCH v6 bpf-next 0/9] veth: Driver XDP Toshiaki Makita
                   ` (7 preceding siblings ...)
  2018-07-30 10:43 ` [PATCH v6 bpf-next 8/9] veth: Add XDP TX and REDIRECT Toshiaki Makita
@ 2018-07-30 10:43 ` Toshiaki Makita
  8 siblings, 0 replies; 15+ messages in thread
From: Toshiaki Makita @ 2018-07-30 10:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann
  Cc: Toshiaki Makita, netdev, Jesper Dangaard Brouer, Jakub Kicinski,
	John Fastabend

Move XDP and napi related fields in veth_priv to newly created veth_rq
structure.

When xdp_frames are enqueued from ndo_xdp_xmit and XDP_TX, rxq is
selected by current cpu.

When skbs are enqueued from the peer device, rxq is one to one mapping
of its peer txq. This way we have a restriction that the number of rxqs
must not less than the number of peer txqs, but leave the possibility to
achieve bulk skb xmit in the future because txq lock would make it
possible to remove rxq ptr_ring lock.

v3:
- Add extack messages.
- Fix array overrun in veth_xmit.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 278 ++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 188 insertions(+), 90 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index b1ce3691..c276a72 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -42,20 +42,24 @@ struct pcpu_vstats {
 	struct u64_stats_sync	syncp;
 };
 
-struct veth_priv {
+struct veth_rq {
 	struct napi_struct	xdp_napi;
 	struct net_device	*dev;
 	struct bpf_prog __rcu	*xdp_prog;
-	struct bpf_prog		*_xdp_prog;
-	struct net_device __rcu	*peer;
-	atomic64_t		dropped;
 	struct xdp_mem_info	xdp_mem;
-	unsigned		requested_headroom;
 	bool			rx_notify_masked;
 	struct ptr_ring		xdp_ring;
 	struct xdp_rxq_info	xdp_rxq;
 };
 
+struct veth_priv {
+	struct net_device __rcu	*peer;
+	atomic64_t		dropped;
+	struct bpf_prog		*_xdp_prog;
+	struct veth_rq		*rq;
+	unsigned int		requested_headroom;
+};
+
 /*
  * ethtool interface
  */
@@ -144,19 +148,19 @@ static void veth_ptr_free(void *ptr)
 		kfree_skb(ptr);
 }
 
-static void __veth_xdp_flush(struct veth_priv *priv)
+static void __veth_xdp_flush(struct veth_rq *rq)
 {
 	/* Write ptr_ring before reading rx_notify_masked */
 	smp_mb();
-	if (!priv->rx_notify_masked) {
-		priv->rx_notify_masked = true;
-		napi_schedule(&priv->xdp_napi);
+	if (!rq->rx_notify_masked) {
+		rq->rx_notify_masked = true;
+		napi_schedule(&rq->xdp_napi);
 	}
 }
 
-static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 {
-	if (unlikely(ptr_ring_produce(&priv->xdp_ring, skb))) {
+	if (unlikely(ptr_ring_produce(&rq->xdp_ring, skb))) {
 		dev_kfree_skb_any(skb);
 		return NET_RX_DROP;
 	}
@@ -164,21 +168,22 @@ static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
 	return NET_RX_SUCCESS;
 }
 
-static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool xdp)
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
+			    struct veth_rq *rq, bool xdp)
 {
-	struct veth_priv *priv = netdev_priv(dev);
-
 	return __dev_forward_skb(dev, skb) ?: xdp ?
-		veth_xdp_rx(priv, skb) :
+		veth_xdp_rx(rq, skb) :
 		netif_rx(skb);
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	struct veth_rq *rq = NULL;
 	struct net_device *rcv;
 	int length = skb->len;
 	bool rcv_xdp = false;
+	int rxq;
 
 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
@@ -188,9 +193,15 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	rcv_priv = netdev_priv(rcv);
-	rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+	rxq = skb_get_queue_mapping(skb);
+	if (rxq < rcv->real_num_rx_queues) {
+		rq = &rcv_priv->rq[rxq];
+		rcv_xdp = rcu_access_pointer(rq->xdp_prog);
+		if (rcv_xdp)
+			skb_record_rx_queue(skb, rxq);
+	}
 
-	if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
+	if (likely(veth_forward_skb(rcv, skb, rq, rcv_xdp) == NET_RX_SUCCESS)) {
 		struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -203,7 +214,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 
 	if (rcv_xdp)
-		__veth_xdp_flush(rcv_priv);
+		__veth_xdp_flush(rq);
 
 	rcu_read_unlock();
 
@@ -278,12 +289,18 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static int veth_select_rxq(struct net_device *dev)
+{
+	return smp_processor_id() % dev->real_num_rx_queues;
+}
+
 static int veth_xdp_xmit(struct net_device *dev, int n,
 			 struct xdp_frame **frames, u32 flags)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct net_device *rcv;
 	unsigned int max_len;
+	struct veth_rq *rq;
 	int i, drops = 0;
 
 	if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK))
@@ -294,30 +311,31 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 		return -ENXIO;
 
 	rcv_priv = netdev_priv(rcv);
+	rq = &rcv_priv->rq[veth_select_rxq(rcv)];
 	/* Non-NULL xdp_prog ensures that xdp_ring is initialized on receive
 	 * side. This means an XDP program is loaded on the peer and the peer
 	 * device is up.
 	 */
-	if (!rcu_access_pointer(rcv_priv->xdp_prog))
+	if (!rcu_access_pointer(rq->xdp_prog))
 		return -ENXIO;
 
 	max_len = rcv->mtu + rcv->hard_header_len + VLAN_HLEN;
 
-	spin_lock(&rcv_priv->xdp_ring.producer_lock);
+	spin_lock(&rq->xdp_ring.producer_lock);
 	for (i = 0; i < n; i++) {
 		struct xdp_frame *frame = frames[i];
 		void *ptr = veth_xdp_to_ptr(frame);
 
 		if (unlikely(frame->len > max_len ||
-			     __ptr_ring_produce(&rcv_priv->xdp_ring, ptr))) {
+			     __ptr_ring_produce(&rq->xdp_ring, ptr))) {
 			xdp_return_frame_rx_napi(frame);
 			drops++;
 		}
 	}
-	spin_unlock(&rcv_priv->xdp_ring.producer_lock);
+	spin_unlock(&rq->xdp_ring.producer_lock);
 
 	if (flags & XDP_XMIT_FLUSH)
-		__veth_xdp_flush(rcv_priv);
+		__veth_xdp_flush(rq);
 
 	return n - drops;
 }
@@ -326,6 +344,7 @@ static void veth_xdp_flush(struct net_device *dev)
 {
 	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct net_device *rcv;
+	struct veth_rq *rq;
 
 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
@@ -333,11 +352,12 @@ static void veth_xdp_flush(struct net_device *dev)
 		goto out;
 
 	rcv_priv = netdev_priv(rcv);
+	rq = &rcv_priv->rq[veth_select_rxq(rcv)];
 	/* xdp_ring is initialized on receive side? */
-	if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+	if (unlikely(!rcu_access_pointer(rq->xdp_prog)))
 		goto out;
 
-	__veth_xdp_flush(rcv_priv);
+	__veth_xdp_flush(rq);
 out:
 	rcu_read_unlock();
 }
@@ -352,7 +372,7 @@ static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
 	return veth_xdp_xmit(dev, 1, &frame, 0);
 }
 
-static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+static struct sk_buff *veth_xdp_rcv_one(struct veth_rq *rq,
 					struct xdp_frame *frame,
 					unsigned int *xdp_xmit)
 {
@@ -363,7 +383,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 	struct sk_buff *skb;
 
 	rcu_read_lock();
-	xdp_prog = rcu_dereference(priv->xdp_prog);
+	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (likely(xdp_prog)) {
 		struct xdp_buff xdp;
 		u32 act;
@@ -372,7 +392,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 		xdp.data = frame->data;
 		xdp.data_end = frame->data + frame->len;
 		xdp.data_meta = frame->data - frame->metasize;
-		xdp.rxq = &priv->xdp_rxq;
+		xdp.rxq = &rq->xdp_rxq;
 
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
@@ -385,8 +405,8 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 			orig_frame = *frame;
 			xdp.data_hard_start = frame;
 			xdp.rxq->mem = frame->mem;
-			if (unlikely(veth_xdp_tx(priv->dev, &xdp) < 0)) {
-				trace_xdp_exception(priv->dev, xdp_prog, act);
+			if (unlikely(veth_xdp_tx(rq->dev, &xdp) < 0)) {
+				trace_xdp_exception(rq->dev, xdp_prog, act);
 				frame = &orig_frame;
 				goto err_xdp;
 			}
@@ -397,7 +417,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 			orig_frame = *frame;
 			xdp.data_hard_start = frame;
 			xdp.rxq->mem = frame->mem;
-			if (xdp_do_redirect(priv->dev, &xdp, xdp_prog)) {
+			if (xdp_do_redirect(rq->dev, &xdp, xdp_prog)) {
 				frame = &orig_frame;
 				goto err_xdp;
 			}
@@ -407,7 +427,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 		default:
 			bpf_warn_invalid_xdp_action(act);
 		case XDP_ABORTED:
-			trace_xdp_exception(priv->dev, xdp_prog, act);
+			trace_xdp_exception(rq->dev, xdp_prog, act);
 		case XDP_DROP:
 			goto err_xdp;
 		}
@@ -422,7 +442,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 	}
 
 	memset(frame, 0, sizeof(*frame));
-	skb->protocol = eth_type_trans(skb, priv->dev);
+	skb->protocol = eth_type_trans(skb, rq->dev);
 err:
 	return skb;
 err_xdp:
@@ -432,8 +452,7 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 	return NULL;
 }
 
-static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-					struct sk_buff *skb,
+static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq, struct sk_buff *skb,
 					unsigned int *xdp_xmit)
 {
 	u32 pktlen, headroom, act, metalen;
@@ -443,7 +462,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	struct xdp_buff xdp;
 
 	rcu_read_lock();
-	xdp_prog = rcu_dereference(priv->xdp_prog);
+	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (unlikely(!xdp_prog)) {
 		rcu_read_unlock();
 		goto out;
@@ -497,7 +516,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	xdp.data = skb_mac_header(skb);
 	xdp.data_end = xdp.data + pktlen;
 	xdp.data_meta = xdp.data;
-	xdp.rxq = &priv->xdp_rxq;
+	xdp.rxq = &rq->xdp_rxq;
 	orig_data = xdp.data;
 	orig_data_end = xdp.data_end;
 
@@ -509,9 +528,9 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	case XDP_TX:
 		get_page(virt_to_page(xdp.data));
 		consume_skb(skb);
-		xdp.rxq->mem = priv->xdp_mem;
-		if (unlikely(veth_xdp_tx(priv->dev, &xdp) < 0)) {
-			trace_xdp_exception(priv->dev, xdp_prog, act);
+		xdp.rxq->mem = rq->xdp_mem;
+		if (unlikely(veth_xdp_tx(rq->dev, &xdp) < 0)) {
+			trace_xdp_exception(rq->dev, xdp_prog, act);
 			goto err_xdp;
 		}
 		*xdp_xmit |= VETH_XDP_TX;
@@ -520,8 +539,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	case XDP_REDIRECT:
 		get_page(virt_to_page(xdp.data));
 		consume_skb(skb);
-		xdp.rxq->mem = priv->xdp_mem;
-		if (xdp_do_redirect(priv->dev, &xdp, xdp_prog))
+		xdp.rxq->mem = rq->xdp_mem;
+		if (xdp_do_redirect(rq->dev, &xdp, xdp_prog))
 			goto err_xdp;
 		*xdp_xmit |= VETH_XDP_REDIR;
 		rcu_read_unlock();
@@ -529,7 +548,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
-		trace_xdp_exception(priv->dev, xdp_prog, act);
+		trace_xdp_exception(rq->dev, xdp_prog, act);
 	case XDP_DROP:
 		goto drop;
 	}
@@ -545,7 +564,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	off = xdp.data_end - orig_data_end;
 	if (off != 0)
 		__skb_put(skb, off);
-	skb->protocol = eth_type_trans(skb, priv->dev);
+	skb->protocol = eth_type_trans(skb, rq->dev);
 
 	metalen = xdp.data - xdp.data_meta;
 	if (metalen)
@@ -563,27 +582,26 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	return NULL;
 }
 
-static int veth_xdp_rcv(struct veth_priv *priv, int budget,
-			unsigned int *xdp_xmit)
+static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit)
 {
 	int i, done = 0;
 
 	for (i = 0; i < budget; i++) {
-		void *ptr = __ptr_ring_consume(&priv->xdp_ring);
+		void *ptr = __ptr_ring_consume(&rq->xdp_ring);
 		struct sk_buff *skb;
 
 		if (!ptr)
 			break;
 
 		if (veth_is_xdp_frame(ptr)) {
-			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr),
+			skb = veth_xdp_rcv_one(rq, veth_ptr_to_xdp(ptr),
 					       xdp_xmit);
 		} else {
-			skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit);
+			skb = veth_xdp_rcv_skb(rq, ptr, xdp_xmit);
 		}
 
 		if (skb)
-			napi_gro_receive(&priv->xdp_napi, skb);
+			napi_gro_receive(&rq->xdp_napi, skb);
 
 		done++;
 	}
@@ -593,25 +611,25 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget,
 
 static int veth_poll(struct napi_struct *napi, int budget)
 {
-	struct veth_priv *priv =
-		container_of(napi, struct veth_priv, xdp_napi);
+	struct veth_rq *rq =
+		container_of(napi, struct veth_rq, xdp_napi);
 	unsigned int xdp_xmit = 0;
 	int done;
 
 	xdp_set_return_frame_no_direct();
-	done = veth_xdp_rcv(priv, budget, &xdp_xmit);
+	done = veth_xdp_rcv(rq, budget, &xdp_xmit);
 
 	if (done < budget && napi_complete_done(napi, done)) {
 		/* Write rx_notify_masked before reading ptr_ring */
-		smp_store_mb(priv->rx_notify_masked, false);
-		if (unlikely(!__ptr_ring_empty(&priv->xdp_ring))) {
-			priv->rx_notify_masked = true;
-			napi_schedule(&priv->xdp_napi);
+		smp_store_mb(rq->rx_notify_masked, false);
+		if (unlikely(!__ptr_ring_empty(&rq->xdp_ring))) {
+			rq->rx_notify_masked = true;
+			napi_schedule(&rq->xdp_napi);
 		}
 	}
 
 	if (xdp_xmit & VETH_XDP_TX)
-		veth_xdp_flush(priv->dev);
+		veth_xdp_flush(rq->dev);
 	if (xdp_xmit & VETH_XDP_REDIR)
 		xdp_do_flush_map();
 	xdp_clear_return_frame_no_direct();
@@ -622,56 +640,90 @@ static int veth_poll(struct napi_struct *napi, int budget)
 static int veth_napi_add(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
-	int err;
+	int err, i;
 
-	err = ptr_ring_init(&priv->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
-	if (err)
-		return err;
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_rq *rq = &priv->rq[i];
+
+		err = ptr_ring_init(&rq->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
+		if (err)
+			goto err_xdp_ring;
+	}
 
-	netif_napi_add(dev, &priv->xdp_napi, veth_poll, NAPI_POLL_WEIGHT);
-	napi_enable(&priv->xdp_napi);
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_rq *rq = &priv->rq[i];
+
+		netif_napi_add(dev, &rq->xdp_napi, veth_poll, NAPI_POLL_WEIGHT);
+		napi_enable(&rq->xdp_napi);
+	}
 
 	return 0;
+err_xdp_ring:
+	for (i--; i >= 0; i--)
+		ptr_ring_cleanup(&priv->rq[i].xdp_ring, veth_ptr_free);
+
+	return err;
 }
 
 static void veth_napi_del(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
+	int i;
 
-	napi_disable(&priv->xdp_napi);
-	netif_napi_del(&priv->xdp_napi);
-	priv->rx_notify_masked = false;
-	ptr_ring_cleanup(&priv->xdp_ring, veth_ptr_free);
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_rq *rq = &priv->rq[i];
+
+		napi_disable(&rq->xdp_napi);
+		napi_hash_del(&rq->xdp_napi);
+	}
+	synchronize_net();
+
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_rq *rq = &priv->rq[i];
+
+		netif_napi_del(&rq->xdp_napi);
+		rq->rx_notify_masked = false;
+		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
+	}
 }
 
 static int veth_enable_xdp(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
-	int err;
+	int err, i;
 
-	if (!xdp_rxq_info_is_reg(&priv->xdp_rxq)) {
-		err = xdp_rxq_info_reg(&priv->xdp_rxq, dev, 0);
-		if (err < 0)
-			return err;
+	if (!xdp_rxq_info_is_reg(&priv->rq[0].xdp_rxq)) {
+		for (i = 0; i < dev->real_num_rx_queues; i++) {
+			struct veth_rq *rq = &priv->rq[i];
 
-		err = xdp_rxq_info_reg_mem_model(&priv->xdp_rxq,
-						 MEM_TYPE_PAGE_SHARED, NULL);
-		if (err < 0)
-			goto err;
+			err = xdp_rxq_info_reg(&rq->xdp_rxq, dev, i);
+			if (err < 0)
+				goto err_rxq_reg;
+
+			err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+							 MEM_TYPE_PAGE_SHARED,
+							 NULL);
+			if (err < 0)
+				goto err_reg_mem;
+
+			/* Save original mem info as it can be overwritten */
+			rq->xdp_mem = rq->xdp_rxq.mem;
+		}
 
 		err = veth_napi_add(dev);
 		if (err)
-			goto err;
-
-		/* Save original mem info as it can be overwritten */
-		priv->xdp_mem = priv->xdp_rxq.mem;
+			goto err_rxq_reg;
 	}
 
-	rcu_assign_pointer(priv->xdp_prog, priv->_xdp_prog);
+	for (i = 0; i < dev->real_num_rx_queues; i++)
+		rcu_assign_pointer(priv->rq[i].xdp_prog, priv->_xdp_prog);
 
 	return 0;
-err:
-	xdp_rxq_info_unreg(&priv->xdp_rxq);
+err_reg_mem:
+	xdp_rxq_info_unreg(&priv->rq[i].xdp_rxq);
+err_rxq_reg:
+	for (i--; i >= 0; i--)
+		xdp_rxq_info_unreg(&priv->rq[i].xdp_rxq);
 
 	return err;
 }
@@ -679,11 +731,17 @@ static int veth_enable_xdp(struct net_device *dev)
 static void veth_disable_xdp(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
+	int i;
 
-	rcu_assign_pointer(priv->xdp_prog, NULL);
+	for (i = 0; i < dev->real_num_rx_queues; i++)
+		rcu_assign_pointer(priv->rq[i].xdp_prog, NULL);
 	veth_napi_del(dev);
-	priv->xdp_rxq.mem = priv->xdp_mem;
-	xdp_rxq_info_unreg(&priv->xdp_rxq);
+	for (i = 0; i < dev->real_num_rx_queues; i++) {
+		struct veth_rq *rq = &priv->rq[i];
+
+		rq->xdp_rxq.mem = rq->xdp_mem;
+		xdp_rxq_info_unreg(&rq->xdp_rxq);
+	}
 }
 
 static int veth_open(struct net_device *dev)
@@ -840,6 +898,12 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 			goto err;
 		}
 
+		if (dev->real_num_rx_queues < peer->real_num_tx_queues) {
+			NL_SET_ERR_MSG_MOD(extack, "XDP expects number of rx queues not less than peer tx queues");
+			err = -ENOSPC;
+			goto err;
+		}
+
 		if (dev->flags & IFF_UP) {
 			err = veth_enable_xdp(dev);
 			if (err) {
@@ -974,13 +1038,31 @@ static int veth_validate(struct nlattr *tb[], struct nlattr *data[],
 	return 0;
 }
 
+static int veth_alloc_queues(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	priv->rq = kcalloc(dev->num_rx_queues, sizeof(*priv->rq), GFP_KERNEL);
+	if (!priv->rq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static void veth_free_queues(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	kfree(priv->rq);
+}
+
 static struct rtnl_link_ops veth_link_ops;
 
 static int veth_newlink(struct net *src_net, struct net_device *dev,
 			struct nlattr *tb[], struct nlattr *data[],
 			struct netlink_ext_ack *extack)
 {
-	int err;
+	int err, i;
 	struct net_device *peer;
 	struct veth_priv *priv;
 	char ifname[IFNAMSIZ];
@@ -1033,6 +1115,12 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 		return PTR_ERR(peer);
 	}
 
+	err = veth_alloc_queues(peer);
+	if (err) {
+		put_net(net);
+		goto err_peer_alloc_queues;
+	}
+
 	if (!ifmp || !tbp[IFLA_ADDRESS])
 		eth_hw_addr_random(peer);
 
@@ -1061,6 +1149,10 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 	 * should be re-allocated
 	 */
 
+	err = veth_alloc_queues(dev);
+	if (err)
+		goto err_alloc_queues;
+
 	if (tb[IFLA_ADDRESS] == NULL)
 		eth_hw_addr_random(dev);
 
@@ -1080,22 +1172,28 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 	 */
 
 	priv = netdev_priv(dev);
-	priv->dev = dev;
+	for (i = 0; i < dev->real_num_rx_queues; i++)
+		priv->rq[i].dev = dev;
 	rcu_assign_pointer(priv->peer, peer);
 
 	priv = netdev_priv(peer);
-	priv->dev = peer;
+	for (i = 0; i < peer->real_num_rx_queues; i++)
+		priv->rq[i].dev = peer;
 	rcu_assign_pointer(priv->peer, dev);
 
 	return 0;
 
 err_register_dev:
+	veth_free_queues(dev);
+err_alloc_queues:
 	/* nothing to do */
 err_configure_peer:
 	unregister_netdevice(peer);
 	return err;
 
 err_register_peer:
+	veth_free_queues(peer);
+err_peer_alloc_queues:
 	free_netdev(peer);
 	return err;
 }
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring
  2018-07-30 10:43 ` [PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring Toshiaki Makita
@ 2018-07-31 10:26   ` Jesper Dangaard Brouer
  2018-07-31 10:40     ` Toshiaki Makita
  0 siblings, 1 reply; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2018-07-31 10:26 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Jakub Kicinski,
	John Fastabend, brouer, Karlsson, Magnus, Björn Töpel


Context needed from: [PATCH v6 bpf-next 2/9] veth: Add driver XDP

On Mon, 30 Jul 2018 19:43:44 +0900
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:

> +static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
> +				      int buflen)
> +{
> +	struct sk_buff *skb;
> +
> +	if (!buflen) {
> +		buflen = SKB_DATA_ALIGN(headroom + len) +
> +			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	}
> +	skb = build_skb(head, buflen);
> +	if (!skb)
> +		return NULL;
> +
> +	skb_reserve(skb, headroom);
> +	skb_put(skb, len);
> +
> +	return skb;
> +}


On Mon, 30 Jul 2018 19:43:46 +0900
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:

> +static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
> +					struct xdp_frame *frame)
> +{
> +	int len = frame->len, delta = 0;
> +	struct bpf_prog *xdp_prog;
> +	unsigned int headroom;
> +	struct sk_buff *skb;
> +
> +	rcu_read_lock();
> +	xdp_prog = rcu_dereference(priv->xdp_prog);
> +	if (likely(xdp_prog)) {
> +		struct xdp_buff xdp;
> +		u32 act;
> +
> +		xdp.data_hard_start = frame->data - frame->headroom;
> +		xdp.data = frame->data;
> +		xdp.data_end = frame->data + frame->len;
> +		xdp.data_meta = frame->data - frame->metasize;
> +		xdp.rxq = &priv->xdp_rxq;
> +
> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);
> +
> +		switch (act) {
> +		case XDP_PASS:
> +			delta = frame->data - xdp.data;
> +			len = xdp.data_end - xdp.data;
> +			break;
> +		default:
> +			bpf_warn_invalid_xdp_action(act);
> +		case XDP_ABORTED:
> +			trace_xdp_exception(priv->dev, xdp_prog, act);
> +		case XDP_DROP:
> +			goto err_xdp;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	headroom = frame->data - delta - (void *)frame;
> +	skb = veth_build_skb(frame, headroom, len, 0);

Here you are adding an assumption that struct xdp_frame is always
located in-the-top of the packet-data area.  I tried hard not to add
such a dependency!  You can calculate the beginning of the frame from
the xdp_frame->data pointer.

Why not add such a dependency?  Because for AF_XDP zero-copy, we cannot
make such an assumption.  

Currently, when an RX-queue is in AF-XDP-ZC mode (MEM_TYPE_ZERO_COPY)
the packet will get dropped when calling convert_to_xdp_frame(), but as
the TODO comment indicated in convert_to_xdp_frame() this is not the
end-goal. 

The comment in convert_to_xdp_frame(), indicate we need a full
alloc+copy, but that is actually not necessary, if we can just use
another memory area for struct xdp_frame, and a pointer to data.  Thus,
allowing devmap-redir to work-ZC and allow cpumap-redir to do the copy
on the remote CPU.


> +	if (!skb) {
> +		xdp_return_frame(frame);
> +		goto err;
> +	}
> +
> +	memset(frame, 0, sizeof(*frame));
> +	skb->protocol = eth_type_trans(skb, priv->dev);
> +err:
> +	return skb;
> +err_xdp:
> +	rcu_read_unlock();
> +	xdp_return_frame(frame);
> +
> +	return NULL;
> +}


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring
  2018-07-31 10:26   ` Jesper Dangaard Brouer
@ 2018-07-31 10:40     ` Toshiaki Makita
  2018-07-31 12:46       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 15+ messages in thread
From: Toshiaki Makita @ 2018-07-31 10:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Jakub Kicinski,
	John Fastabend, Karlsson, Magnus, Björn Töpel

On 2018/07/31 19:26, Jesper Dangaard Brouer wrote:
> 
> Context needed from: [PATCH v6 bpf-next 2/9] veth: Add driver XDP
> 
> On Mon, 30 Jul 2018 19:43:44 +0900
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> +static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
>> +				      int buflen)
>> +{
>> +	struct sk_buff *skb;
>> +
>> +	if (!buflen) {
>> +		buflen = SKB_DATA_ALIGN(headroom + len) +
>> +			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>> +	}
>> +	skb = build_skb(head, buflen);
>> +	if (!skb)
>> +		return NULL;
>> +
>> +	skb_reserve(skb, headroom);
>> +	skb_put(skb, len);
>> +
>> +	return skb;
>> +}
> 
> 
> On Mon, 30 Jul 2018 19:43:46 +0900
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> +static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
>> +					struct xdp_frame *frame)
>> +{
>> +	int len = frame->len, delta = 0;
>> +	struct bpf_prog *xdp_prog;
>> +	unsigned int headroom;
>> +	struct sk_buff *skb;
>> +
>> +	rcu_read_lock();
>> +	xdp_prog = rcu_dereference(priv->xdp_prog);
>> +	if (likely(xdp_prog)) {
>> +		struct xdp_buff xdp;
>> +		u32 act;
>> +
>> +		xdp.data_hard_start = frame->data - frame->headroom;
>> +		xdp.data = frame->data;
>> +		xdp.data_end = frame->data + frame->len;
>> +		xdp.data_meta = frame->data - frame->metasize;
>> +		xdp.rxq = &priv->xdp_rxq;
>> +
>> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);
>> +
>> +		switch (act) {
>> +		case XDP_PASS:
>> +			delta = frame->data - xdp.data;
>> +			len = xdp.data_end - xdp.data;
>> +			break;
>> +		default:
>> +			bpf_warn_invalid_xdp_action(act);
>> +		case XDP_ABORTED:
>> +			trace_xdp_exception(priv->dev, xdp_prog, act);
>> +		case XDP_DROP:
>> +			goto err_xdp;
>> +		}
>> +	}
>> +	rcu_read_unlock();
>> +
>> +	headroom = frame->data - delta - (void *)frame;
>> +	skb = veth_build_skb(frame, headroom, len, 0);
> 
> Here you are adding an assumption that struct xdp_frame is always
> located in-the-top of the packet-data area.  I tried hard not to add
> such a dependency!  You can calculate the beginning of the frame from
> the xdp_frame->data pointer.
> 
> Why not add such a dependency?  Because for AF_XDP zero-copy, we cannot
> make such an assumption.  
> 
> Currently, when an RX-queue is in AF-XDP-ZC mode (MEM_TYPE_ZERO_COPY)
> the packet will get dropped when calling convert_to_xdp_frame(), but as
> the TODO comment indicated in convert_to_xdp_frame() this is not the
> end-goal. 
> 
> The comment in convert_to_xdp_frame(), indicate we need a full
> alloc+copy, but that is actually not necessary, if we can just use
> another memory area for struct xdp_frame, and a pointer to data.  Thus,
> allowing devmap-redir to work-ZC and allow cpumap-redir to do the copy
> on the remote CPU.

Thanks for pointing this out.
Seems you are saying xdp_frame area is not reusable. That means we
reduce usable headroom on every REDIRECT. I wanted to avoid this but
actually it is impossible, right?

>> +	if (!skb) {
>> +		xdp_return_frame(frame);
>> +		goto err;
>> +	}
>> +
>> +	memset(frame, 0, sizeof(*frame));
>> +	skb->protocol = eth_type_trans(skb, priv->dev);
>> +err:
>> +	return skb;
>> +err_xdp:
>> +	rcu_read_unlock();
>> +	xdp_return_frame(frame);
>> +
>> +	return NULL;
>> +}
> 
> 

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring
  2018-07-31 10:40     ` Toshiaki Makita
@ 2018-07-31 12:46       ` Jesper Dangaard Brouer
  2018-08-01  5:41         ` Toshiaki Makita
  0 siblings, 1 reply; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2018-07-31 12:46 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Jakub Kicinski,
	John Fastabend, Karlsson, Magnus, Björn Töpel, brouer

On Tue, 31 Jul 2018 19:40:08 +0900
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:

> On 2018/07/31 19:26, Jesper Dangaard Brouer wrote:
> > 
> > Context needed from: [PATCH v6 bpf-next 2/9] veth: Add driver XDP
> > 
> > On Mon, 30 Jul 2018 19:43:44 +0900
> > Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> >   
> >> +static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
> >> +				      int buflen)
> >> +{
> >> +	struct sk_buff *skb;
> >> +
> >> +	if (!buflen) {
> >> +		buflen = SKB_DATA_ALIGN(headroom + len) +
> >> +			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> >> +	}
> >> +	skb = build_skb(head, buflen);
> >> +	if (!skb)
> >> +		return NULL;
> >> +
> >> +	skb_reserve(skb, headroom);
> >> +	skb_put(skb, len);
> >> +
> >> +	return skb;
> >> +}  
> > 
> > 
> > On Mon, 30 Jul 2018 19:43:46 +0900
> > Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> >   
> >> +static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
> >> +					struct xdp_frame *frame)
> >> +{
> >> +	int len = frame->len, delta = 0;
> >> +	struct bpf_prog *xdp_prog;
> >> +	unsigned int headroom;
> >> +	struct sk_buff *skb;
> >> +
> >> +	rcu_read_lock();
> >> +	xdp_prog = rcu_dereference(priv->xdp_prog);
> >> +	if (likely(xdp_prog)) {
> >> +		struct xdp_buff xdp;
> >> +		u32 act;
> >> +
> >> +		xdp.data_hard_start = frame->data - frame->headroom;
> >> +		xdp.data = frame->data;
> >> +		xdp.data_end = frame->data + frame->len;
> >> +		xdp.data_meta = frame->data - frame->metasize;
> >> +		xdp.rxq = &priv->xdp_rxq;
> >> +
> >> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);
> >> +
> >> +		switch (act) {
> >> +		case XDP_PASS:
> >> +			delta = frame->data - xdp.data;
> >> +			len = xdp.data_end - xdp.data;
> >> +			break;
> >> +		default:
> >> +			bpf_warn_invalid_xdp_action(act);
> >> +		case XDP_ABORTED:
> >> +			trace_xdp_exception(priv->dev, xdp_prog, act);
> >> +		case XDP_DROP:
> >> +			goto err_xdp;
> >> +		}
> >> +	}
> >> +	rcu_read_unlock();
> >> +
> >> +	headroom = frame->data - delta - (void *)frame;
> >> +	skb = veth_build_skb(frame, headroom, len, 0);  
> > 
> > Here you are adding an assumption that struct xdp_frame is always
> > located in-the-top of the packet-data area.  I tried hard not to add
> > such a dependency!  You can calculate the beginning of the frame from
> > the xdp_frame->data pointer.
> > 
> > Why not add such a dependency?  Because for AF_XDP zero-copy, we cannot
> > make such an assumption.  
> > 
> > Currently, when an RX-queue is in AF-XDP-ZC mode (MEM_TYPE_ZERO_COPY)
> > the packet will get dropped when calling convert_to_xdp_frame(), but as
> > the TODO comment indicated in convert_to_xdp_frame() this is not the
> > end-goal. 
> > 
> > The comment in convert_to_xdp_frame(), indicate we need a full
> > alloc+copy, but that is actually not necessary, if we can just use
> > another memory area for struct xdp_frame, and a pointer to data.  Thus,
> > allowing devmap-redir to work-ZC and allow cpumap-redir to do the copy
> > on the remote CPU.  
> 
> Thanks for pointing this out.
> Seems you are saying xdp_frame area is not reusable. That means we
> reduce usable headroom on every REDIRECT. I wanted to avoid this but
> actually it is impossible, right?

I'm not sure I understand fully...  has this something to do, with the
below memset?

When cpumap generate an SKB for the netstack, then we sacrifice/reduce
the SKB headroom available, by in convert_to_xdp_frame() reducing the
headroom by xdp_frame size.

 xdp_frame->headroom = headroom - sizeof(*xdp_frame)

In-order to avoid doing such memset of this area.  We are actually only
worried about exposing the 'data' pointer, thus we could just clear
that.  (See commit 6dfb970d3dbd, this is because Alexei is planing to
move from CAP_SYS_ADMIN to lesser privileged mode CAP_NET_ADMIN)

See commits:
 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp headroom")
 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse")


> >> +	if (!skb) {
> >> +		xdp_return_frame(frame);
> >> +		goto err;
> >> +	}
> >> +
> >> +	memset(frame, 0, sizeof(*frame));

This memset can become a performance issue later, if we change the size
of xdp_frame. (e.g I'm considering to extend this with the DMA addr,
but I'm not sure about that scheme yet).

Currently sizeof(xdp_frame) == 32 bytes, and a memset of 32 bytes is
fast, due to compiler reasons.  Above 32 bytes are more expensive,
because the compiler translates this into a "rep stos" operation, which
is slower, as it needs to save some registers (to allow it to be
interrupted). See [1] for experiments.

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c

> >> +	skb->protocol = eth_type_trans(skb, priv->dev);
> >> +err:
> >> +	return skb;
> >> +err_xdp:
> >> +	rcu_read_unlock();
> >> +	xdp_return_frame(frame);
> >> +
> >> +	return NULL;
> >> +}  
> > 
> >   
> 



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring
  2018-07-31 12:46       ` Jesper Dangaard Brouer
@ 2018-08-01  5:41         ` Toshiaki Makita
  2018-08-01 15:09           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 15+ messages in thread
From: Toshiaki Makita @ 2018-08-01  5:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Daniel Borkmann, netdev, Jakub Kicinski,
	John Fastabend, Karlsson, Magnus, Björn Töpel

On 2018/07/31 21:46, Jesper Dangaard Brouer wrote:
> On Tue, 31 Jul 2018 19:40:08 +0900
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> On 2018/07/31 19:26, Jesper Dangaard Brouer wrote:
>>>
>>> Context needed from: [PATCH v6 bpf-next 2/9] veth: Add driver XDP
>>>
>>> On Mon, 30 Jul 2018 19:43:44 +0900
>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
>>>   
>>>> +static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
>>>> +				      int buflen)
>>>> +{
>>>> +	struct sk_buff *skb;
>>>> +
>>>> +	if (!buflen) {
>>>> +		buflen = SKB_DATA_ALIGN(headroom + len) +
>>>> +			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>>>> +	}
>>>> +	skb = build_skb(head, buflen);
>>>> +	if (!skb)
>>>> +		return NULL;
>>>> +
>>>> +	skb_reserve(skb, headroom);
>>>> +	skb_put(skb, len);
>>>> +
>>>> +	return skb;
>>>> +}  
>>>
>>>
>>> On Mon, 30 Jul 2018 19:43:46 +0900
>>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
>>>   
>>>> +static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
>>>> +					struct xdp_frame *frame)
>>>> +{
>>>> +	int len = frame->len, delta = 0;
>>>> +	struct bpf_prog *xdp_prog;
>>>> +	unsigned int headroom;
>>>> +	struct sk_buff *skb;
>>>> +
>>>> +	rcu_read_lock();
>>>> +	xdp_prog = rcu_dereference(priv->xdp_prog);
>>>> +	if (likely(xdp_prog)) {
>>>> +		struct xdp_buff xdp;
>>>> +		u32 act;
>>>> +
>>>> +		xdp.data_hard_start = frame->data - frame->headroom;
>>>> +		xdp.data = frame->data;
>>>> +		xdp.data_end = frame->data + frame->len;
>>>> +		xdp.data_meta = frame->data - frame->metasize;
>>>> +		xdp.rxq = &priv->xdp_rxq;
>>>> +
>>>> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);
>>>> +
>>>> +		switch (act) {
>>>> +		case XDP_PASS:
>>>> +			delta = frame->data - xdp.data;
>>>> +			len = xdp.data_end - xdp.data;
>>>> +			break;
>>>> +		default:
>>>> +			bpf_warn_invalid_xdp_action(act);
>>>> +		case XDP_ABORTED:
>>>> +			trace_xdp_exception(priv->dev, xdp_prog, act);
>>>> +		case XDP_DROP:
>>>> +			goto err_xdp;
>>>> +		}
>>>> +	}
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	headroom = frame->data - delta - (void *)frame;
>>>> +	skb = veth_build_skb(frame, headroom, len, 0);  
>>>
>>> Here you are adding an assumption that struct xdp_frame is always
>>> located in-the-top of the packet-data area.  I tried hard not to add
>>> such a dependency!  You can calculate the beginning of the frame from
>>> the xdp_frame->data pointer.
>>>
>>> Why not add such a dependency?  Because for AF_XDP zero-copy, we cannot
>>> make such an assumption.  
>>>
>>> Currently, when an RX-queue is in AF-XDP-ZC mode (MEM_TYPE_ZERO_COPY)
>>> the packet will get dropped when calling convert_to_xdp_frame(), but as
>>> the TODO comment indicated in convert_to_xdp_frame() this is not the
>>> end-goal. 
>>>
>>> The comment in convert_to_xdp_frame(), indicate we need a full
>>> alloc+copy, but that is actually not necessary, if we can just use
>>> another memory area for struct xdp_frame, and a pointer to data.  Thus,
>>> allowing devmap-redir to work-ZC and allow cpumap-redir to do the copy
>>> on the remote CPU.  
>>
>> Thanks for pointing this out.
>> Seems you are saying xdp_frame area is not reusable. That means we
>> reduce usable headroom on every REDIRECT. I wanted to avoid this but
>> actually it is impossible, right?
> 
> I'm not sure I understand fully...  has this something to do, with the
> below memset?

Sorry for not being so clear...
It has something to do with the memset as well but mainly I was talking
about XDP_TX and REDIRECT introduced in patch 8. On REDIRECT,
dev_map_enqueue() calls convert_to_xdp_frame() so we use the headroom
for struct xdp_frame on REDIRECT. If we don't reuse xdp_frame region of
the original xdp packet, we reduce the headroom size each time on
REDIRECT. When ZC is used, in the future xdp_frame can be non-contiguous
to the buffer, so we cannot reuse the xdp_frame region in
convert_to_xdp_frame()? But current convert_to_xdp_frame()
implementation requires xdp_frame region in headroom so I think I cannot
avoid this dependency now.

SKB has a similar problem if we cannot reuse it. It can be passed to a
bridge and redirected to another veth which has driver XDP. In that case
we need to reallocate the page if we have reduced the headroom because
sufficient headroom is required for XDP processing for now (can we
remove this requirement actually?).
Instead I think I need to drop the packet (or reallocate the buffer and
copy data) for ZC according to this patch.
https://patchwork.codeaurora.org/patch/540887/

> When cpumap generate an SKB for the netstack, then we sacrifice/reduce
> the SKB headroom available, by in convert_to_xdp_frame() reducing the
> headroom by xdp_frame size.
> 
>  xdp_frame->headroom = headroom - sizeof(*xdp_frame)
> 
> In-order to avoid doing such memset of this area.  We are actually only
> worried about exposing the 'data' pointer, thus we could just clear
> that.  (See commit 6dfb970d3dbd, this is because Alexei is planing to
> move from CAP_SYS_ADMIN to lesser privileged mode CAP_NET_ADMIN)
> 
> See commits:
>  97e19cce05e5 ("bpf: reserve xdp_frame size in xdp headroom")
>  6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse")

We have talked about that...
https://patchwork.ozlabs.org/patch/903536/

The memset is introduced as per your feedback, but I'm still not sure if
we need this. In general the headroom is not cleared after allocation in
drivers, so anyway unprivileged users should not see it no matter if it
contains xdp_frame or not...

> 
> 
>>>> +	if (!skb) {
>>>> +		xdp_return_frame(frame);
>>>> +		goto err;
>>>> +	}
>>>> +
>>>> +	memset(frame, 0, sizeof(*frame));
> 
> This memset can become a performance issue later, if we change the size
> of xdp_frame. (e.g I'm considering to extend this with the DMA addr,
> but I'm not sure about that scheme yet).
> 
> Currently sizeof(xdp_frame) == 32 bytes, and a memset of 32 bytes is
> fast, due to compiler reasons.  Above 32 bytes are more expensive,
> because the compiler translates this into a "rep stos" operation, which
> is slower, as it needs to save some registers (to allow it to be
> interrupted). See [1] for experiments.
> 
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c
> 
>>>> +	skb->protocol = eth_type_trans(skb, priv->dev);
>>>> +err:
>>>> +	return skb;
>>>> +err_xdp:
>>>> +	rcu_read_unlock();
>>>> +	xdp_return_frame(frame);
>>>> +
>>>> +	return NULL;
>>>> +}  
>>>
>>>   
>>
> 
> 
> 

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring
  2018-08-01  5:41         ` Toshiaki Makita
@ 2018-08-01 15:09           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2018-08-01 15:09 UTC (permalink / raw)
  To: Toshiaki Makita, Alexei Starovoitov
  Cc: Daniel Borkmann, netdev, Jakub Kicinski, John Fastabend,
	Karlsson, Magnus, Björn Töpel, brouer

On Wed, 1 Aug 2018 14:41:08 +0900
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:

> On 2018/07/31 21:46, Jesper Dangaard Brouer wrote:
> > On Tue, 31 Jul 2018 19:40:08 +0900
> > Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> >   
> >> On 2018/07/31 19:26, Jesper Dangaard Brouer wrote:  
> >>>
> >>> Context needed from: [PATCH v6 bpf-next 2/9] veth: Add driver XDP
> >>>
> >>> On Mon, 30 Jul 2018 19:43:44 +0900
> >>> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> >>>     
[...]
> >>>
> >>> Here you are adding an assumption that struct xdp_frame is always
> >>> located in-the-top of the packet-data area.  I tried hard not to add
> >>> such a dependency!  You can calculate the beginning of the frame from
> >>> the xdp_frame->data pointer.
> >>>
> >>> Why not add such a dependency?  Because for AF_XDP zero-copy, we cannot
> >>> make such an assumption.  
> >>>
> >>> Currently, when an RX-queue is in AF-XDP-ZC mode (MEM_TYPE_ZERO_COPY)
> >>> the packet will get dropped when calling convert_to_xdp_frame(), but as
> >>> the TODO comment indicated in convert_to_xdp_frame() this is not the
> >>> end-goal. 
> >>>
> >>> The comment in convert_to_xdp_frame(), indicate we need a full
> >>> alloc+copy, but that is actually not necessary, if we can just use
> >>> another memory area for struct xdp_frame, and a pointer to data.  Thus,
> >>> allowing devmap-redir to work-ZC and allow cpumap-redir to do the copy
> >>> on the remote CPU.    
> >>
> >> Thanks for pointing this out.
> >> Seems you are saying xdp_frame area is not reusable. That means we
> >> reduce usable headroom on every REDIRECT. I wanted to avoid this but
> >> actually it is impossible, right?  
> > 
> > I'm not sure I understand fully...  has this something to do, with the
> > below memset?  
> 
> Sorry for not being so clear...
> It has something to do with the memset as well but mainly I was talking
> about XDP_TX and REDIRECT introduced in patch 8. On REDIRECT,
> dev_map_enqueue() calls convert_to_xdp_frame() so we use the headroom
> for struct xdp_frame on REDIRECT. If we don't reuse xdp_frame region of
> the original xdp packet, we reduce the headroom size each time on
> REDIRECT. When ZC is used, in the future xdp_frame can be non-contiguous
> to the buffer, so we cannot reuse the xdp_frame region in
> convert_to_xdp_frame()? But current convert_to_xdp_frame()
> implementation requires xdp_frame region in headroom so I think I cannot
> avoid this dependency now.
> 
> SKB has a similar problem if we cannot reuse it. It can be passed to a
> bridge and redirected to another veth which has driver XDP. In that case
> we need to reallocate the page if we have reduced the headroom because
> sufficient headroom is required for XDP processing for now (can we
> remove this requirement actually?).

Okay, now I understand.  Your changes allow multiple levels of
XDP_REDIRECT between/into other veth net_devices.  This is very
interesting and exciting stuff, but also a bit scary, when thinking
about if we got he life-time correct for the different memory objects.

You have convinced me.  We should not sacrifice/reduce the headroom
this way.  I'll also fix up cpumap.

To avoid the performance penalty of the memset, I propose that we just
clear the xdp_frame->data pointer.  But lets implement it via a common
sanitize/scrub function.


> > When cpumap generate an SKB for the netstack, then we sacrifice/reduce
> > the SKB headroom available, by in convert_to_xdp_frame() reducing the
> > headroom by xdp_frame size.
> > 
> >  xdp_frame->headroom = headroom - sizeof(*xdp_frame)
> > 
> > In-order to avoid doing such memset of this area.  We are actually only
> > worried about exposing the 'data' pointer, thus we could just clear
> > that.  (See commit 6dfb970d3dbd, this is because Alexei is planing to
> > move from CAP_SYS_ADMIN to lesser privileged mode CAP_NET_ADMIN)
> > 
> > See commits:
> >  97e19cce05e5 ("bpf: reserve xdp_frame size in xdp headroom")
> >  6dfb970d3dbd ("xdp: avoid leaking info stored in frame data on page reuse")  
> 
> We have talked about that...
> https://patchwork.ozlabs.org/patch/903536/
> 
> The memset is introduced as per your feedback, but I'm still not sure if
> we need this. In general the headroom is not cleared after allocation in
> drivers, so anyway unprivileged users should not see it no matter if it
> contains xdp_frame or not...

I actually got this request from Alexei. That is why I implemented it.
Personally I don't think this clearing is really needed, until someone
actually makes the TC/cls_act BPF hook CAP_NET_ADMIN.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-08-01 16:56 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-30 10:43 [PATCH v6 bpf-next 0/9] veth: Driver XDP Toshiaki Makita
2018-07-30 10:43 ` [PATCH v6 bpf-next 1/9] net: Export skb_headers_offset_update Toshiaki Makita
2018-07-30 10:43 ` [PATCH v6 bpf-next 2/9] veth: Add driver XDP Toshiaki Makita
2018-07-30 10:43 ` [PATCH v6 bpf-next 3/9] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
2018-07-30 10:43 ` [PATCH v6 bpf-next 4/9] veth: Handle xdp_frames in xdp napi ring Toshiaki Makita
2018-07-31 10:26   ` Jesper Dangaard Brouer
2018-07-31 10:40     ` Toshiaki Makita
2018-07-31 12:46       ` Jesper Dangaard Brouer
2018-08-01  5:41         ` Toshiaki Makita
2018-08-01 15:09           ` Jesper Dangaard Brouer
2018-07-30 10:43 ` [PATCH v6 bpf-next 5/9] veth: Add ndo_xdp_xmit Toshiaki Makita
2018-07-30 10:43 ` [PATCH v6 bpf-next 6/9] bpf: Make redirect_info accessible from modules Toshiaki Makita
2018-07-30 10:43 ` [PATCH v6 bpf-next 7/9] xdp: Helpers for disabling napi_direct of xdp_return_frame Toshiaki Makita
2018-07-30 10:43 ` [PATCH v6 bpf-next 8/9] veth: Add XDP TX and REDIRECT Toshiaki Makita
2018-07-30 10:43 ` [PATCH v6 bpf-next 9/9] veth: Support per queue XDP ring Toshiaki Makita

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.