All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/9] veth: Driver XDP
@ 2018-04-24 14:39 Toshiaki Makita
  2018-04-24 14:39 ` [PATCH RFC 1/9] net: Export skb_headers_offset_update and skb_copy_header Toshiaki Makita
                   ` (8 more replies)
  0 siblings, 9 replies; 23+ messages in thread
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

  NIC -----------> veth===veth
 (XDP) (redirect)        (XDP)

In this case xdp_frame can be forwarded to the peer veth without
modification, so we can expect far better performance than generic XDP.

The envisioned use cases are:

* Container managed XDP program
Container host redirects frames to containers by XDP redirect action, and
privileged containers can deploy their own XDP programs.

* XDP program cascading
Two or more XDP programs can be called for each packet by redirecting
xdp frames to veth.

* Internal interface for an XDP bridge
When using XDP redirection to create a virtual bridge, veth can be used
to create an internal interface for the bridge.

With single core and simple XDP programs which only redirect and drop
packets, I got 10.5 Mpps redirect/drop rate with i40e 25G NIC + veth.

XXV710 (i40e) --- (XDP redirect) --> veth===veth (XDP drop)

This changeset is making use of NAPI to implement ndo_xdp_xmit and
XDP_TX/REDIRECT. This is mainly because I wanted to avoid stack inflation
by recursive calling of XDP programs.

As an RFC this has not implemented recently introduced xdp_adjust_tail
and based on top of Jesper's redirect memory return API patch set
(684009d4fdaf).
Any feedback is welcome. Thanks!

Toshiaki Makita (9):
  net: Export skb_headers_offset_update and skb_copy_header
  veth: Add driver XDP
  veth: Avoid drops by oversized packets when XDP is enabled
  veth: Use NAPI for XDP
  veth: Handle xdp_frame in xdp napi ring
  veth: Add ndo_xdp_xmit
  veth: Add XDP TX and REDIRECT
  veth: Avoid per-packet spinlock of XDP napi ring on dequeueing
  veth: Avoid per-packet spinlock of XDP napi ring on enqueueing

 drivers/net/veth.c     | 688 +++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/filter.h |  16 ++
 include/linux/skbuff.h |   2 +
 net/core/filter.c      |  11 +-
 net/core/skbuff.c      |  12 +-
 5 files changed, 699 insertions(+), 30 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH RFC 1/9] net: Export skb_headers_offset_update and skb_copy_header
  2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
@ 2018-04-24 14:39 ` Toshiaki Makita
  2018-04-24 14:39 ` [PATCH RFC 2/9] veth: Add driver XDP Toshiaki Makita
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 include/linux/skbuff.h |  2 ++
 net/core/skbuff.c      | 12 +++++++-----
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 9065477ed255..fdf80a9d4582 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1030,6 +1030,8 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 }
 
 struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
+void skb_headers_offset_update(struct sk_buff *skb, int off);
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old);
 int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask);
 struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t priority);
 struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t priority);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 345b51837ca8..531354900177 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1290,7 +1290,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(skb_clone);
 
-static void skb_headers_offset_update(struct sk_buff *skb, int off)
+void skb_headers_offset_update(struct sk_buff *skb, int off)
 {
 	/* Only adjust this if it actually is csum_start rather than csum */
 	if (skb->ip_summed == CHECKSUM_PARTIAL)
@@ -1304,8 +1304,9 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
 	skb->inner_network_header += off;
 	skb->inner_mac_header += off;
 }
+EXPORT_SYMBOL(skb_headers_offset_update);
 
-static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
+void skb_copy_header(struct sk_buff *new, const struct sk_buff *old)
 {
 	__copy_skb_header(new, old);
 
@@ -1313,6 +1314,7 @@ static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
 	skb_shinfo(new)->gso_segs = skb_shinfo(old)->gso_segs;
 	skb_shinfo(new)->gso_type = skb_shinfo(old)->gso_type;
 }
+EXPORT_SYMBOL(skb_copy_header);
 
 static inline int skb_alloc_rx_flag(const struct sk_buff *skb)
 {
@@ -1355,7 +1357,7 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask)
 
 	BUG_ON(skb_copy_bits(skb, -headerlen, n->head, headerlen + skb->len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 	return n;
 }
 EXPORT_SYMBOL(skb_copy);
@@ -1419,7 +1421,7 @@ struct sk_buff *__pskb_copy_fclone(struct sk_buff *skb, int headroom,
 		skb_clone_fraglist(n);
 	}
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 out:
 	return n;
 }
@@ -1599,7 +1601,7 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb,
 	BUG_ON(skb_copy_bits(skb, -head_copy_len, n->head + head_copy_off,
 			     skb->len + head_copy_len));
 
-	copy_skb_header(n, skb);
+	skb_copy_header(n, skb);
 
 	skb_headers_offset_update(n, newheadroom - oldheadroom);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 2/9] veth: Add driver XDP
  2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
  2018-04-24 14:39 ` [PATCH RFC 1/9] net: Export skb_headers_offset_update and skb_copy_header Toshiaki Makita
@ 2018-04-24 14:39 ` Toshiaki Makita
  2018-04-25 20:39   ` Jesper Dangaard Brouer
  2018-04-24 14:39 ` [PATCH RFC 3/9] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This is basic implementation of veth driver XDP.

Incoming packets are sent from the peer veth device in the form of skb,
so this is generally doing the same thing as generic XDP.

This itself is not so useful, but a starting point to implement other
useful veth XDP features like TX and REDIRECT.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 210 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 205 insertions(+), 5 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index a69ad39ee57e..9c4197306716 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -19,10 +19,15 @@
 #include <net/xfrm.h>
 #include <linux/veth.h>
 #include <linux/module.h>
+#include <linux/bpf.h>
+#include <linux/filter.h>
+#include <linux/bpf_trace.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+
 struct pcpu_vstats {
 	u64			packets;
 	u64			bytes;
@@ -30,9 +35,11 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+	struct bpf_prog __rcu	*xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
 	unsigned		requested_headroom;
+	struct xdp_rxq_info	xdp_rxq;
 };
 
 /*
@@ -98,6 +105,25 @@ static const struct ethtool_ops veth_ethtool_ops = {
 	.get_link_ksettings	= veth_get_link_ksettings,
 };
 
+/* general routines */
+
+static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
+					struct sk_buff *skb);
+
+static int veth_xdp_rx(struct net_device *dev, struct sk_buff *skb)
+{
+	skb = veth_xdp_rcv_skb(dev, skb);
+	if (!skb)
+		return NET_RX_DROP;
+
+	return netif_rx(skb);
+}
+
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb)
+{
+	return __dev_forward_skb(dev, skb) ?: veth_xdp_rx(dev, skb);
+}
+
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -111,7 +137,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
-	if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+	if (likely(veth_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
 		struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -126,10 +152,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 	return NETDEV_TX_OK;
 }
 
-/*
- * general routines
- */
-
 static u64 veth_stats_one(struct pcpu_vstats *result, struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -179,19 +201,152 @@ static void veth_set_multicast_list(struct net_device *dev)
 {
 }
 
+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+				      int buflen)
+{
+	struct sk_buff *skb;
+
+	if (!buflen) {
+		buflen = SKB_DATA_ALIGN(headroom + len) +
+			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	}
+	skb = build_skb(head, buflen);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, headroom);
+	skb_put(skb, len);
+
+	return skb;
+}
+
+static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
+					struct sk_buff *skb)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	u32 pktlen, headroom, act, metalen;
+	int size, mac_len, delta, off;
+	struct bpf_prog *xdp_prog;
+	struct xdp_buff xdp;
+	void *orig_data;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(priv->xdp_prog);
+	if (!xdp_prog) {
+		rcu_read_unlock();
+		goto out;
+	}
+
+	mac_len = skb->data - skb_mac_header(skb);
+	pktlen = skb->len + mac_len;
+	size = SKB_DATA_ALIGN(VETH_XDP_HEADROOM + pktlen) +
+	       SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	if (size > PAGE_SIZE)
+		goto drop;
+
+	headroom = skb_headroom(skb) - mac_len;
+	if (skb_shared(skb) || skb_head_is_locked(skb) ||
+	    skb_is_nonlinear(skb) || headroom < XDP_PACKET_HEADROOM) {
+		struct sk_buff *nskb;
+		void *head, *start;
+		struct page *page;
+		int head_off;
+
+		page = alloc_page(GFP_ATOMIC);
+		if (!page)
+			goto drop;
+
+		head = page_address(page);
+		start = head + VETH_XDP_HEADROOM;
+		if (skb_copy_bits(skb, -mac_len, start, pktlen)) {
+			page_frag_free(head);
+			goto drop;
+		}
+
+		nskb = veth_build_skb(head,
+				      VETH_XDP_HEADROOM + mac_len, skb->len,
+				      PAGE_SIZE);
+		if (!nskb) {
+			page_frag_free(head);
+			goto drop;
+		}
+
+		skb_copy_header(nskb, skb);
+		head_off = skb_headroom(nskb) - skb_headroom(skb);
+		skb_headers_offset_update(nskb, head_off);
+		dev_consume_skb_any(skb);
+		skb = nskb;
+	}
+
+	xdp.data_hard_start = skb->head;
+	xdp.data = skb_mac_header(skb);
+	xdp.data_end = xdp.data + pktlen;
+	xdp.data_meta = xdp.data;
+	xdp.rxq = &priv->xdp_rxq;
+	orig_data = xdp.data;
+
+	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+	switch (act) {
+	case XDP_PASS:
+		break;
+	default:
+		bpf_warn_invalid_xdp_action(act);
+	case XDP_ABORTED:
+		trace_xdp_exception(dev, xdp_prog, act);
+	case XDP_DROP:
+		goto drop;
+	}
+	rcu_read_unlock();
+
+	delta = orig_data - xdp.data;
+	off = mac_len + delta;
+	if (off > 0)
+		__skb_push(skb, off);
+	else if (off < 0)
+		__skb_pull(skb, -off);
+	skb->mac_header -= delta;
+	skb->protocol = eth_type_trans(skb, dev);
+
+	metalen = xdp.data - xdp.data_meta;
+	if (metalen)
+		skb_metadata_set(skb, metalen);
+out:
+	return skb;
+drop:
+	rcu_read_unlock();
+	dev_kfree_skb_any(skb);
+	return NULL;
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
 	struct net_device *peer = rtnl_dereference(priv->peer);
+	int err;
 
 	if (!peer)
 		return -ENOTCONN;
 
+	err = xdp_rxq_info_reg(&priv->xdp_rxq, dev, 0);
+	if (err < 0)
+		return err;
+
+	err = xdp_rxq_info_reg_mem_model(&priv->xdp_rxq,
+					 MEM_TYPE_PAGE_SHARED, NULL);
+	if (err < 0)
+		goto err_reg_mem;
+
 	if (peer->flags & IFF_UP) {
 		netif_carrier_on(dev);
 		netif_carrier_on(peer);
 	}
+
 	return 0;
+err_reg_mem:
+	xdp_rxq_info_unreg(&priv->xdp_rxq);
+
+	return err;
 }
 
 static int veth_close(struct net_device *dev)
@@ -203,6 +358,8 @@ static int veth_close(struct net_device *dev)
 	if (peer)
 		netif_carrier_off(peer);
 
+	xdp_rxq_info_unreg(&priv->xdp_rxq);
+
 	return 0;
 }
 
@@ -276,6 +433,48 @@ static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 	rcu_read_unlock();
 }
 
+static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
+			struct netlink_ext_ack *extack)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct bpf_prog *old_prog;
+
+	old_prog = rtnl_dereference(priv->xdp_prog);
+
+	rcu_assign_pointer(priv->xdp_prog, prog);
+
+	if (old_prog)
+		bpf_prog_put(old_prog);
+
+	return 0;
+}
+
+static u32 veth_xdp_query(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	const struct bpf_prog *xdp_prog;
+
+	xdp_prog = rtnl_dereference(priv->xdp_prog);
+	if (xdp_prog)
+		return xdp_prog->aux->id;
+
+	return 0;
+}
+
+static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return veth_xdp_set(dev, xdp->prog, xdp->extack);
+	case XDP_QUERY_PROG:
+		xdp->prog_id = veth_xdp_query(dev);
+		xdp->prog_attached = !!xdp->prog_id;
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -290,6 +489,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_get_iflink		= veth_get_iflink,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
+	.ndo_bpf		= veth_xdp,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 3/9] veth: Avoid drops by oversized packets when XDP is enabled
  2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
  2018-04-24 14:39 ` [PATCH RFC 1/9] net: Export skb_headers_offset_update and skb_copy_header Toshiaki Makita
  2018-04-24 14:39 ` [PATCH RFC 2/9] veth: Add driver XDP Toshiaki Makita
@ 2018-04-24 14:39 ` Toshiaki Makita
  2018-04-24 14:39 ` [PATCH RFC 4/9] veth: Use NAPI for XDP Toshiaki Makita
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

All oversized packets including GSO packets are dropped if XDP is
enabled on receiver side, so don't send such packets from peer.

Drop TSO and SCTP fragmentation features so that veth devices themselves
segment packets with XDP enabled. Also cap MTU accordingly.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 9c4197306716..7271d9582b4a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -410,6 +410,23 @@ static int veth_get_iflink(const struct net_device *dev)
 	return iflink;
 }
 
+static netdev_features_t veth_fix_features(struct net_device *dev,
+					   netdev_features_t features)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct net_device *peer;
+
+	peer = rtnl_dereference(priv->peer);
+	if (peer) {
+		struct veth_priv *peer_priv = netdev_priv(peer);
+
+		if (rtnl_dereference(peer_priv->xdp_prog))
+			features &= ~NETIF_F_GSO_SOFTWARE;
+	}
+
+	return features;
+}
+
 static void veth_set_rx_headroom(struct net_device *dev, int new_hr)
 {
 	struct veth_priv *peer_priv, *priv = netdev_priv(dev);
@@ -438,13 +455,32 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 {
 	struct veth_priv *priv = netdev_priv(dev);
 	struct bpf_prog *old_prog;
+	struct net_device *peer;
 
 	old_prog = rtnl_dereference(priv->xdp_prog);
+	peer = rtnl_dereference(priv->peer);
+
+	if (!old_prog && prog && peer) {
+		peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+		peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+			peer->hard_header_len -
+			SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+		if (peer->mtu > peer->max_mtu)
+			dev_set_mtu(peer, peer->max_mtu);
+	}
 
 	rcu_assign_pointer(priv->xdp_prog, prog);
 
-	if (old_prog)
+	if (old_prog) {
 		bpf_prog_put(old_prog);
+		if (!prog && peer) {
+			peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+			peer->max_mtu = ETH_MAX_MTU;
+		}
+	}
+
+	if ((!!old_prog ^ !!prog) && peer)
+		netdev_update_features(peer);
 
 	return 0;
 }
@@ -487,6 +523,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_poll_controller	= veth_poll_controller,
 #endif
 	.ndo_get_iflink		= veth_get_iflink,
+	.ndo_fix_features	= veth_fix_features,
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 4/9] veth: Use NAPI for XDP
  2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
                   ` (2 preceding siblings ...)
  2018-04-24 14:39 ` [PATCH RFC 3/9] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
@ 2018-04-24 14:39 ` Toshiaki Makita
  2018-05-01  7:50   ` Jesper Dangaard Brouer
  2018-04-24 14:39 ` [PATCH RFC 5/9] veth: Handle xdp_frame in xdp napi ring Toshiaki Makita
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

In order to avoid stack inflation by recursive XDP program call from
ndo_xdp_xmit, this change introduces NAPI in veth.

Add veth's own NAPI handler when XDP is enabled.
Use ptr_ring to emulate NIC ring. Tx function enqueues packets to the
ring and peer NAPI handler drains the ring.

This way also makes REDIRECT bulk interface simple. When ndo_xdp_xmit is
implemented later, ndo_xdp_flush schedules NAPI of the peer veth device
and NAPI handles xdp frames enqueued by previous ndo_xdp_xmit, which is
quite similar to physical NIC tx function using DMA ring descriptors and
mmio door bell.

Currently only one ring is allocated for each veth device, so it does
not scale on multiqueue env. This can be resolved in the future by
allocating rings on per-queue basis.

Note that NAPI is not used but netif_rx is used when XDP is not loaded,
so this does not change the default behaviour.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 197 ++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 164 insertions(+), 33 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 7271d9582b4a..452771f31c30 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -21,11 +21,13 @@
 #include <linux/module.h>
 #include <linux/bpf.h>
 #include <linux/filter.h>
+#include <linux/ptr_ring.h>
 #include <linux/bpf_trace.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
 struct pcpu_vstats {
@@ -35,10 +37,14 @@ struct pcpu_vstats {
 };
 
 struct veth_priv {
+	struct napi_struct	xdp_napi;
+	struct net_device	*dev;
 	struct bpf_prog __rcu	*xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
 	unsigned		requested_headroom;
+	bool			rx_notify_masked;
+	struct ptr_ring		xdp_ring;
 	struct xdp_rxq_info	xdp_rxq;
 };
 
@@ -107,28 +113,56 @@ static const struct ethtool_ops veth_ethtool_ops = {
 
 /* general routines */
 
-static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
-					struct sk_buff *skb);
+static void veth_ptr_free(void *ptr)
+{
+	if (!ptr)
+		return;
+	dev_kfree_skb_any(ptr);
+}
 
-static int veth_xdp_rx(struct net_device *dev, struct sk_buff *skb)
+static void veth_xdp_flush(struct veth_priv *priv)
 {
-	skb = veth_xdp_rcv_skb(dev, skb);
-	if (!skb)
+	/* Write ptr_ring before reading rx_notify_masked */
+	smp_mb();
+	if (!priv->rx_notify_masked) {
+		priv->rx_notify_masked = true;
+		napi_schedule(&priv->xdp_napi);
+	}
+}
+
+static int veth_xdp_enqueue(struct veth_priv *priv, void *ptr)
+{
+	if (unlikely(ptr_ring_produce(&priv->xdp_ring, ptr)))
+		return -ENOSPC;
+
+	return 0;
+}
+
+static int veth_xdp_rx(struct veth_priv *priv, struct sk_buff *skb)
+{
+	if (unlikely(veth_xdp_enqueue(priv, skb))) {
+		dev_kfree_skb_any(skb);
 		return NET_RX_DROP;
+	}
 
-	return netif_rx(skb);
+	return NET_RX_SUCCESS;
 }
 
-static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb)
+static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb, bool xdp)
 {
-	return __dev_forward_skb(dev, skb) ?: veth_xdp_rx(dev, skb);
+	struct veth_priv *priv = netdev_priv(dev);
+
+	return __dev_forward_skb(dev, skb) ?: xdp ?
+		veth_xdp_rx(priv, skb) :
+		netif_rx(skb);
 }
 
 static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	struct veth_priv *priv = netdev_priv(dev);
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
 	struct net_device *rcv;
 	int length = skb->len;
+	bool rcv_xdp = false;
 
 	rcu_read_lock();
 	rcv = rcu_dereference(priv->peer);
@@ -137,7 +171,10 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
-	if (likely(veth_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
+	rcv_priv = netdev_priv(rcv);
+	rcv_xdp = rcu_access_pointer(rcv_priv->xdp_prog);
+
+	if (likely(veth_forward_skb(rcv, skb, rcv_xdp) == NET_RX_SUCCESS)) {
 		struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
 
 		u64_stats_update_begin(&stats->syncp);
@@ -148,7 +185,13 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 drop:
 		atomic64_inc(&priv->dropped);
 	}
+
+	/* TODO: check xmit_more and tx_stopped */
+	if (rcv_xdp)
+		veth_xdp_flush(rcv_priv);
+
 	rcu_read_unlock();
+
 	return NETDEV_TX_OK;
 }
 
@@ -220,10 +263,9 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
-static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
+static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 					struct sk_buff *skb)
 {
-	struct veth_priv *priv = netdev_priv(dev);
 	u32 pktlen, headroom, act, metalen;
 	int size, mac_len, delta, off;
 	struct bpf_prog *xdp_prog;
@@ -293,7 +335,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
-		trace_xdp_exception(dev, xdp_prog, act);
+		trace_xdp_exception(priv->dev, xdp_prog, act);
 	case XDP_DROP:
 		goto drop;
 	}
@@ -306,7 +348,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
 	else if (off < 0)
 		__skb_pull(skb, -off);
 	skb->mac_header -= delta;
-	skb->protocol = eth_type_trans(skb, dev);
+	skb->protocol = eth_type_trans(skb, priv->dev);
 
 	metalen = xdp.data - xdp.data_meta;
 	if (metalen)
@@ -319,6 +361,72 @@ static struct sk_buff *veth_xdp_rcv_skb(struct net_device *dev,
 	return NULL;
 }
 
+static int veth_xdp_rcv(struct veth_priv *priv, int budget)
+{
+	int i, done = 0;
+
+	for (i = 0; i < budget; i++) {
+		void *ptr = ptr_ring_consume(&priv->xdp_ring);
+		struct sk_buff *skb;
+
+		if (!ptr)
+			break;
+
+		skb = veth_xdp_rcv_skb(priv, ptr);
+
+		if (skb)
+			napi_gro_receive(&priv->xdp_napi, skb);
+
+		done++;
+	}
+
+	return done;
+}
+
+static int veth_poll(struct napi_struct *napi, int budget)
+{
+	struct veth_priv *priv =
+		container_of(napi, struct veth_priv, xdp_napi);
+	int done;
+
+	done = veth_xdp_rcv(priv, budget);
+
+	if (done < budget && napi_complete_done(napi, done)) {
+		/* Write rx_notify_masked before reading ptr_ring */
+		smp_store_mb(priv->rx_notify_masked, false);
+		if (unlikely(!ptr_ring_empty(&priv->xdp_ring))) {
+			priv->rx_notify_masked = true;
+			napi_schedule(&priv->xdp_napi);
+		}
+	}
+
+	return done;
+}
+
+static int veth_napi_add(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	int err;
+
+	err = ptr_ring_init(&priv->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
+	if (err)
+		return err;
+
+	netif_napi_add(dev, &priv->xdp_napi, veth_poll, NAPI_POLL_WEIGHT);
+	napi_enable(&priv->xdp_napi);
+
+	return 0;
+}
+
+static void veth_napi_del(struct net_device *dev)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+
+	napi_disable(&priv->xdp_napi);
+	netif_napi_del(&priv->xdp_napi);
+	ptr_ring_cleanup(&priv->xdp_ring, veth_ptr_free);
+}
+
 static int veth_open(struct net_device *dev)
 {
 	struct veth_priv *priv = netdev_priv(dev);
@@ -337,6 +445,12 @@ static int veth_open(struct net_device *dev)
 	if (err < 0)
 		goto err_reg_mem;
 
+	if (rtnl_dereference(priv->xdp_prog)) {
+		err = veth_napi_add(dev);
+		if (err)
+			goto err_reg_mem;
+	}
+
 	if (peer->flags & IFF_UP) {
 		netif_carrier_on(dev);
 		netif_carrier_on(peer);
@@ -358,6 +472,9 @@ static int veth_close(struct net_device *dev)
 	if (peer)
 		netif_carrier_off(peer);
 
+	if (rtnl_dereference(priv->xdp_prog))
+		veth_napi_del(dev);
+
 	xdp_rxq_info_unreg(&priv->xdp_rxq);
 
 	return 0;
@@ -384,15 +501,12 @@ static void veth_dev_free(struct net_device *dev)
 #ifdef CONFIG_NET_POLL_CONTROLLER
 static void veth_poll_controller(struct net_device *dev)
 {
-	/* veth only receives frames when its peer sends one
-	 * Since it's a synchronous operation, we are guaranteed
-	 * never to have pending data when we poll for it so
-	 * there is nothing to do here.
-	 *
-	 * We need this though so netpoll recognizes us as an interface that
-	 * supports polling, which enables bridge devices in virt setups to
-	 * still use netconsole
-	 */
+	struct veth_priv *priv = netdev_priv(dev);
+
+	rcu_read_lock();
+	if (rcu_access_pointer(priv->xdp_prog))
+		veth_xdp_flush(priv);
+	rcu_read_unlock();
 }
 #endif	/* CONFIG_NET_POLL_CONTROLLER */
 
@@ -456,26 +570,40 @@ static int veth_xdp_set(struct net_device *dev, struct bpf_prog *prog,
 	struct veth_priv *priv = netdev_priv(dev);
 	struct bpf_prog *old_prog;
 	struct net_device *peer;
+	int err;
 
 	old_prog = rtnl_dereference(priv->xdp_prog);
 	peer = rtnl_dereference(priv->peer);
 
-	if (!old_prog && prog && peer) {
-		peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
-		peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
-			peer->hard_header_len -
-			SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-		if (peer->mtu > peer->max_mtu)
-			dev_set_mtu(peer, peer->max_mtu);
+	if (!old_prog && prog) {
+		if (dev->flags & IFF_UP) {
+			err = veth_napi_add(dev);
+			if (err)
+				return err;
+		}
+
+		if (peer) {
+			peer->hw_features &= ~NETIF_F_GSO_SOFTWARE;
+			peer->max_mtu = PAGE_SIZE - VETH_XDP_HEADROOM -
+				peer->hard_header_len -
+				SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+			if (peer->mtu > peer->max_mtu)
+				dev_set_mtu(peer, peer->max_mtu);
+		}
 	}
 
 	rcu_assign_pointer(priv->xdp_prog, prog);
 
 	if (old_prog) {
 		bpf_prog_put(old_prog);
-		if (!prog && peer) {
-			peer->hw_features |= NETIF_F_GSO_SOFTWARE;
-			peer->max_mtu = ETH_MAX_MTU;
+		if (!prog) {
+			if (dev->flags & IFF_UP)
+				veth_napi_del(dev);
+
+			if (peer) {
+				peer->hw_features |= NETIF_F_GSO_SOFTWARE;
+				peer->max_mtu = ETH_MAX_MTU;
+			}
 		}
 	}
 
@@ -688,10 +816,13 @@ static int veth_newlink(struct net *src_net, struct net_device *dev,
 	 */
 
 	priv = netdev_priv(dev);
+	priv->dev = dev;
 	rcu_assign_pointer(priv->peer, peer);
 
 	priv = netdev_priv(peer);
+	priv->dev = peer;
 	rcu_assign_pointer(priv->peer, dev);
+
 	return 0;
 
 err_register_dev:
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 5/9] veth: Handle xdp_frame in xdp napi ring
  2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
                   ` (3 preceding siblings ...)
  2018-04-24 14:39 ` [PATCH RFC 4/9] veth: Use NAPI for XDP Toshiaki Makita
@ 2018-04-24 14:39 ` Toshiaki Makita
  2018-04-24 14:39 ` [PATCH RFC 6/9] veth: Add ndo_xdp_xmit Toshiaki Makita
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 23+ messages in thread
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This is preparation for XDP TX and ndo_xdp_xmit.

Now the napi ring accepts both skb and xdp_frame. When xdp_frame is
enqueued, skb will not be allocated until XDP program on veth returns
PASS. This will speedup the XDP processing when ndo_xdp_xmit is
implemented and xdp_frame is enqueued by the peer device.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 77 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 452771f31c30..89c91c1c9935 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -27,6 +27,7 @@
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
 
+#define VETH_XDP_FLAG		0x1UL
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
 
@@ -48,6 +49,16 @@ struct veth_priv {
 	struct xdp_rxq_info	xdp_rxq;
 };
 
+static bool veth_is_xdp_frame(void *ptr)
+{
+	return (unsigned long)ptr & VETH_XDP_FLAG;
+}
+
+static void *veth_ptr_to_xdp(void *ptr)
+{
+	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
+}
+
 /*
  * ethtool interface
  */
@@ -117,7 +128,14 @@ static void veth_ptr_free(void *ptr)
 {
 	if (!ptr)
 		return;
-	dev_kfree_skb_any(ptr);
+
+	if (veth_is_xdp_frame(ptr)) {
+		struct xdp_frame *frame = veth_ptr_to_xdp(ptr);
+
+		xdp_return_frame(frame);
+	} else {
+		dev_kfree_skb_any(ptr);
+	}
 }
 
 static void veth_xdp_flush(struct veth_priv *priv)
@@ -263,6 +281,60 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
+					struct xdp_frame *frame)
+{
+	struct bpf_prog *xdp_prog;
+	unsigned int headroom;
+	struct sk_buff *skb;
+	int len, delta = 0;
+
+	rcu_read_lock();
+	xdp_prog = rcu_dereference(priv->xdp_prog);
+	if (xdp_prog) {
+		struct xdp_buff xdp;
+		u32 act;
+
+		xdp.data_hard_start = frame->data - frame->headroom;
+		xdp.data = frame->data;
+		xdp.data_end = frame->data + frame->len;
+		xdp.data_meta = frame->data - frame->metasize;
+		xdp.rxq = &priv->xdp_rxq;
+
+		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+		switch (act) {
+		case XDP_PASS:
+			delta = frame->data - xdp.data;
+			break;
+		default:
+			bpf_warn_invalid_xdp_action(act);
+		case XDP_ABORTED:
+			trace_xdp_exception(priv->dev, xdp_prog, act);
+		case XDP_DROP:
+			goto err_xdp;
+		}
+	}
+	rcu_read_unlock();
+
+	headroom = frame->data - delta - (void *)frame;
+	len = frame->len + delta;
+	skb = veth_build_skb(frame, headroom, len, 0);
+	if (!skb) {
+		xdp_return_frame(frame);
+		goto err;
+	}
+
+	skb->protocol = eth_type_trans(skb, priv->dev);
+err:
+	return skb;
+err_xdp:
+	rcu_read_unlock();
+	xdp_return_frame(frame);
+
+	return NULL;
+}
+
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 					struct sk_buff *skb)
 {
@@ -372,7 +444,10 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget)
 		if (!ptr)
 			break;
 
-		skb = veth_xdp_rcv_skb(priv, ptr);
+		if (veth_is_xdp_frame(ptr))
+			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
+		else
+			skb = veth_xdp_rcv_skb(priv, ptr);
 
 		if (skb)
 			napi_gro_receive(&priv->xdp_napi, skb);
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
  2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
                   ` (4 preceding siblings ...)
  2018-04-24 14:39 ` [PATCH RFC 5/9] veth: Handle xdp_frame in xdp napi ring Toshiaki Makita
@ 2018-04-24 14:39 ` Toshiaki Makita
  2018-04-25 20:24   ` Jesper Dangaard Brouer
  2018-04-24 14:39 ` [PATCH RFC 7/9] veth: Add XDP TX and REDIRECT Toshiaki Makita
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 23+ messages in thread
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This allows NIC's XDP to redirect packets to veth. The destination veth
device enqueues redirected packets to the napi ring of its peer, then
they are processed by XDP on its peer veth device.
This can be thought as calling another XDP program by XDP program using
REDIRECT, when the peer enables driver XDP.

Note that whether an XDP program is loaded on the redirect target veth
device does not affect how xdp_frames sent by ndo_xdp_xmit is handled,
since the ring sits in rx (peer) side. Instead, whether XDP program is
loaded on peer veth does.

When peer veth device has driver XDP, ndo_xdp_xmit forwards xdp_frames
to its peer without modification.
If not, ndo_xdp_xmit converts xdp_frames to skb on sender side and
invokes netif_rx rather than dropping them. Although this will not
result in good performance, I'm thinking dropping redirected packets
when XDP is not loaded on the peer device is too restrictive, so added
this fallback.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c     | 73 +++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/filter.h | 16 +++++++++++
 net/core/filter.c      | 11 +-------
 3 files changed, 87 insertions(+), 13 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 89c91c1c9935..b1d591be0eba 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -54,6 +54,11 @@ static bool veth_is_xdp_frame(void *ptr)
 	return (unsigned long)ptr & VETH_XDP_FLAG;
 }
 
+static void *veth_xdp_to_ptr(void *ptr)
+{
+	return (void *)((unsigned long)ptr | VETH_XDP_FLAG);
+}
+
 static void *veth_ptr_to_xdp(void *ptr)
 {
 	return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG);
@@ -138,7 +143,7 @@ static void veth_ptr_free(void *ptr)
 	}
 }
 
-static void veth_xdp_flush(struct veth_priv *priv)
+static void __veth_xdp_flush(struct veth_priv *priv)
 {
 	/* Write ptr_ring before reading rx_notify_masked */
 	smp_mb();
@@ -206,7 +211,7 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	/* TODO: check xmit_more and tx_stopped */
 	if (rcv_xdp)
-		veth_xdp_flush(rcv_priv);
+		__veth_xdp_flush(rcv_priv);
 
 	rcu_read_unlock();
 
@@ -281,6 +286,66 @@ static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
 	return skb;
 }
 
+static int veth_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
+{
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	int headroom = frame->data - (void *)frame;
+	struct net_device *rcv;
+	int err = 0;
+
+	rcv = rcu_dereference(priv->peer);
+	if (unlikely(!rcv))
+		return -ENXIO;
+
+	rcv_priv = netdev_priv(rcv);
+	/* xdp_ring is initialized on receive side? */
+	if (rcu_access_pointer(rcv_priv->xdp_prog)) {
+		err = xdp_ok_fwd_dev(rcv, frame->len);
+		if (unlikely(err))
+			return err;
+
+		err = veth_xdp_enqueue(rcv_priv, veth_xdp_to_ptr(frame));
+	} else {
+		struct sk_buff *skb;
+
+		skb = veth_build_skb(frame, headroom, frame->len, 0);
+		if (unlikely(!skb))
+			return -ENOMEM;
+
+		/* Get page ref in case skb is dropped in netif_rx.
+		 * The caller is responsible for freeing the page on error.
+		 */
+		get_page(virt_to_page(frame->data));
+		if (unlikely(veth_forward_skb(rcv, skb, false) != NET_RX_SUCCESS))
+			return -ENXIO;
+
+		/* Put page ref on success */
+		page_frag_free(frame->data);
+	}
+
+	return err;
+}
+
+static void veth_xdp_flush(struct net_device *dev)
+{
+	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
+	struct net_device *rcv;
+
+	rcu_read_lock();
+	rcv = rcu_dereference(priv->peer);
+	if (unlikely(!rcv))
+		goto out;
+
+	rcv_priv = netdev_priv(rcv);
+	/* xdp_ring is initialized on receive side? */
+	if (unlikely(!rcu_access_pointer(rcv_priv->xdp_prog)))
+		goto out;
+
+	__veth_xdp_flush(rcv_priv);
+out:
+	rcu_read_unlock();
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 					struct xdp_frame *frame)
 {
@@ -580,7 +645,7 @@ static void veth_poll_controller(struct net_device *dev)
 
 	rcu_read_lock();
 	if (rcu_access_pointer(priv->xdp_prog))
-		veth_xdp_flush(priv);
+		__veth_xdp_flush(priv);
 	rcu_read_unlock();
 }
 #endif	/* CONFIG_NET_POLL_CONTROLLER */
@@ -730,6 +795,8 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_features_check	= passthru_features_check,
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
+	.ndo_xdp_xmit		= veth_xdp_xmit,
+	.ndo_xdp_flush		= veth_xdp_flush,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4da8b2308174..7d043f51d1d7 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -19,6 +19,7 @@
 #include <linux/cryptohash.h>
 #include <linux/set_memory.h>
 #include <linux/kallsyms.h>
+#include <linux/if_vlan.h>
 
 #include <net/sch_generic.h>
 
@@ -752,6 +753,21 @@ static inline bool bpf_dump_raw_ok(void)
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
 				       const struct bpf_insn *patch, u32 len);
 
+static __always_inline int
+xdp_ok_fwd_dev(const struct net_device *fwd, unsigned int pktlen)
+{
+	unsigned int len;
+
+	if (unlikely(!(fwd->flags & IFF_UP)))
+		return -ENETDOWN;
+
+	len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
+	if (pktlen > len)
+		return -EMSGSIZE;
+
+	return 0;
+}
+
 /* The pair of xdp_do_redirect and xdp_do_flush_map MUST be called in the
  * same cpu context. Further for best results no more than a single map
  * for the do_redirect/do_flush pair should be used. This limitation is
diff --git a/net/core/filter.c b/net/core/filter.c
index a374b8560bc4..25ae8ffaa968 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2923,16 +2923,7 @@ EXPORT_SYMBOL_GPL(xdp_do_redirect);
 
 static int __xdp_generic_ok_fwd_dev(struct sk_buff *skb, struct net_device *fwd)
 {
-	unsigned int len;
-
-	if (unlikely(!(fwd->flags & IFF_UP)))
-		return -ENETDOWN;
-
-	len = fwd->mtu + fwd->hard_header_len + VLAN_HLEN;
-	if (skb->len > len)
-		return -EMSGSIZE;
-
-	return 0;
+	return xdp_ok_fwd_dev(fwd, skb->len);
 }
 
 static int xdp_do_generic_redirect_map(struct net_device *dev,
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 7/9] veth: Add XDP TX and REDIRECT
  2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
                   ` (5 preceding siblings ...)
  2018-04-24 14:39 ` [PATCH RFC 6/9] veth: Add ndo_xdp_xmit Toshiaki Makita
@ 2018-04-24 14:39 ` Toshiaki Makita
  2018-04-24 14:39 ` [PATCH RFC 8/9] veth: Avoid per-packet spinlock of XDP napi ring on dequeueing Toshiaki Makita
  2018-04-24 14:39 ` [PATCH RFC 9/9] veth: Avoid per-packet spinlock of XDP napi ring on enqueueing Toshiaki Makita
  8 siblings, 0 replies; 23+ messages in thread
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

This allows further redirection of xdp_frames like

 NIC   -> veth--veth -> veth--veth
 (XDP)          (XDP)         (XDP)

The intermediate XDP, redirecting packets from NIC to the other veth,
reuses xdp_mem info from NIC so that page recycling of the NIC works on
the destination veth's XDP.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 94 ++++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 85 insertions(+), 9 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index b1d591be0eba..98fc91a64e29 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -43,6 +43,7 @@ struct veth_priv {
 	struct bpf_prog __rcu	*xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
+	struct xdp_mem_info	xdp_mem;
 	unsigned		requested_headroom;
 	bool			rx_notify_masked;
 	struct ptr_ring		xdp_ring;
@@ -346,9 +347,21 @@ static void veth_xdp_flush(struct net_device *dev)
 	rcu_read_unlock();
 }
 
+static int veth_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!frame))
+		return -EOVERFLOW;
+
+	return veth_xdp_xmit(dev, frame);
+}
+
 static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
-					struct xdp_frame *frame)
+					struct xdp_frame *frame, bool *xdp_xmit,
+					bool *xdp_redir)
 {
+	struct xdp_frame orig_frame;
 	struct bpf_prog *xdp_prog;
 	unsigned int headroom;
 	struct sk_buff *skb;
@@ -372,6 +385,29 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 		case XDP_PASS:
 			delta = frame->data - xdp.data;
 			break;
+		case XDP_TX:
+			orig_frame = *frame;
+			xdp.data_hard_start = frame;
+			xdp.rxq->mem = frame->mem;
+			if (unlikely(veth_xdp_tx(priv->dev, &xdp))) {
+				trace_xdp_exception(priv->dev, xdp_prog, act);
+				frame = &orig_frame;
+				goto err_xdp;
+			}
+			*xdp_xmit = true;
+			rcu_read_unlock();
+			goto xdp_xmit;
+		case XDP_REDIRECT:
+			orig_frame = *frame;
+			xdp.data_hard_start = frame;
+			xdp.rxq->mem = frame->mem;
+			if (xdp_do_redirect(priv->dev, &xdp, xdp_prog)) {
+				frame = &orig_frame;
+				goto err_xdp;
+			}
+			*xdp_redir = true;
+			rcu_read_unlock();
+			goto xdp_xmit;
 		default:
 			bpf_warn_invalid_xdp_action(act);
 		case XDP_ABORTED:
@@ -396,12 +432,13 @@ static struct sk_buff *veth_xdp_rcv_one(struct veth_priv *priv,
 err_xdp:
 	rcu_read_unlock();
 	xdp_return_frame(frame);
-
+xdp_xmit:
 	return NULL;
 }
 
 static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
-					struct sk_buff *skb)
+					struct sk_buff *skb, bool *xdp_xmit,
+					bool *xdp_redir)
 {
 	u32 pktlen, headroom, act, metalen;
 	int size, mac_len, delta, off;
@@ -469,6 +506,26 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	switch (act) {
 	case XDP_PASS:
 		break;
+	case XDP_TX:
+		get_page(virt_to_page(xdp.data));
+		dev_consume_skb_any(skb);
+		xdp.rxq->mem = priv->xdp_mem;
+		if (unlikely(veth_xdp_tx(priv->dev, &xdp))) {
+			trace_xdp_exception(priv->dev, xdp_prog, act);
+			goto err_xdp;
+		}
+		*xdp_xmit = true;
+		rcu_read_unlock();
+		goto xdp_xmit;
+	case XDP_REDIRECT:
+		get_page(virt_to_page(xdp.data));
+		dev_consume_skb_any(skb);
+		xdp.rxq->mem = priv->xdp_mem;
+		if (xdp_do_redirect(priv->dev, &xdp, xdp_prog))
+			goto err_xdp;
+		*xdp_redir = true;
+		rcu_read_unlock();
+		goto xdp_xmit;
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
@@ -496,9 +553,15 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 	rcu_read_unlock();
 	dev_kfree_skb_any(skb);
 	return NULL;
+err_xdp:
+	rcu_read_unlock();
+	page_frag_free(xdp.data);
+xdp_xmit:
+	return NULL;
 }
 
-static int veth_xdp_rcv(struct veth_priv *priv, int budget)
+static int veth_xdp_rcv(struct veth_priv *priv, int budget, bool *xdp_xmit,
+			bool *xdp_redir)
 {
 	int i, done = 0;
 
@@ -509,10 +572,12 @@ static int veth_xdp_rcv(struct veth_priv *priv, int budget)
 		if (!ptr)
 			break;
 
-		if (veth_is_xdp_frame(ptr))
-			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr));
-		else
-			skb = veth_xdp_rcv_skb(priv, ptr);
+		if (veth_is_xdp_frame(ptr)) {
+			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr),
+					       xdp_xmit, xdp_redir);
+		} else {
+			skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit, xdp_redir);
+		}
 
 		if (skb)
 			napi_gro_receive(&priv->xdp_napi, skb);
@@ -527,9 +592,11 @@ static int veth_poll(struct napi_struct *napi, int budget)
 {
 	struct veth_priv *priv =
 		container_of(napi, struct veth_priv, xdp_napi);
+	bool xdp_xmit = false;
+	bool xdp_redir = false;
 	int done;
 
-	done = veth_xdp_rcv(priv, budget);
+	done = veth_xdp_rcv(priv, budget, &xdp_xmit, &xdp_redir);
 
 	if (done < budget && napi_complete_done(napi, done)) {
 		/* Write rx_notify_masked before reading ptr_ring */
@@ -540,6 +607,11 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		}
 	}
 
+	if (xdp_xmit)
+		veth_xdp_flush(priv->dev);
+	if (xdp_redir)
+		xdp_do_flush_map();
+
 	return done;
 }
 
@@ -585,6 +657,9 @@ static int veth_open(struct net_device *dev)
 	if (err < 0)
 		goto err_reg_mem;
 
+	/* Save original mem info as it can be overwritten */
+	priv->xdp_mem = priv->xdp_rxq.mem;
+
 	if (rtnl_dereference(priv->xdp_prog)) {
 		err = veth_napi_add(dev);
 		if (err)
@@ -615,6 +690,7 @@ static int veth_close(struct net_device *dev)
 	if (rtnl_dereference(priv->xdp_prog))
 		veth_napi_del(dev);
 
+	priv->xdp_rxq.mem = priv->xdp_mem;
 	xdp_rxq_info_unreg(&priv->xdp_rxq);
 
 	return 0;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 8/9] veth: Avoid per-packet spinlock of XDP napi ring on dequeueing
  2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
                   ` (6 preceding siblings ...)
  2018-04-24 14:39 ` [PATCH RFC 7/9] veth: Add XDP TX and REDIRECT Toshiaki Makita
@ 2018-04-24 14:39 ` Toshiaki Makita
  2018-04-24 14:39 ` [PATCH RFC 9/9] veth: Avoid per-packet spinlock of XDP napi ring on enqueueing Toshiaki Makita
  8 siblings, 0 replies; 23+ messages in thread
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Use percpu temporary storage to avoid per-packet spinlock.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 46 +++++++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 98fc91a64e29..1592119e3873 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -30,6 +30,7 @@
 #define VETH_XDP_FLAG		0x1UL
 #define VETH_RING_SIZE		256
 #define VETH_XDP_HEADROOM	(XDP_PACKET_HEADROOM + NET_IP_ALIGN)
+#define VETH_XDP_QUEUE_SIZE	NAPI_POLL_WEIGHT
 
 struct pcpu_vstats {
 	u64			packets;
@@ -50,6 +51,8 @@ struct veth_priv {
 	struct xdp_rxq_info	xdp_rxq;
 };
 
+static DEFINE_PER_CPU(void *[VETH_XDP_QUEUE_SIZE], xdp_consume_q);
+
 static bool veth_is_xdp_frame(void *ptr)
 {
 	return (unsigned long)ptr & VETH_XDP_FLAG;
@@ -563,27 +566,32 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_priv *priv,
 static int veth_xdp_rcv(struct veth_priv *priv, int budget, bool *xdp_xmit,
 			bool *xdp_redir)
 {
-	int i, done = 0;
-
-	for (i = 0; i < budget; i++) {
-		void *ptr = ptr_ring_consume(&priv->xdp_ring);
-		struct sk_buff *skb;
-
-		if (!ptr)
-			break;
+	void **q = this_cpu_ptr(xdp_consume_q);
+	int num, lim, done = 0;
+
+	do {
+		int i;
+
+		lim = min(budget - done, VETH_XDP_QUEUE_SIZE);
+		num = ptr_ring_consume_batched(&priv->xdp_ring, q, lim);
+		for (i = 0; i < num; i++) {
+			struct sk_buff *skb;
+			void *ptr = q[i];
+
+			if (veth_is_xdp_frame(ptr)) {
+				skb = veth_xdp_rcv_one(priv,
+						       veth_ptr_to_xdp(ptr),
+						       xdp_xmit, xdp_redir);
+			} else {
+				skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit,
+						       xdp_redir);
+			}
 
-		if (veth_is_xdp_frame(ptr)) {
-			skb = veth_xdp_rcv_one(priv, veth_ptr_to_xdp(ptr),
-					       xdp_xmit, xdp_redir);
-		} else {
-			skb = veth_xdp_rcv_skb(priv, ptr, xdp_xmit, xdp_redir);
+			if (skb)
+				napi_gro_receive(&priv->xdp_napi, skb);
 		}
-
-		if (skb)
-			napi_gro_receive(&priv->xdp_napi, skb);
-
-		done++;
-	}
+		done += num;
+	} while (unlikely(num == lim && done < budget));
 
 	return done;
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 9/9] veth: Avoid per-packet spinlock of XDP napi ring on enqueueing
  2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
                   ` (7 preceding siblings ...)
  2018-04-24 14:39 ` [PATCH RFC 8/9] veth: Avoid per-packet spinlock of XDP napi ring on dequeueing Toshiaki Makita
@ 2018-04-24 14:39 ` Toshiaki Makita
  8 siblings, 0 replies; 23+ messages in thread
From: Toshiaki Makita @ 2018-04-24 14:39 UTC (permalink / raw)
  To: netdev; +Cc: Toshiaki Makita

From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>

Use percpu temporary storage to avoid per-packet spinlock.
This is different from dequeue in that multiple veth devices can be
redirect target in one napi loop so allocate percpu storage in veth
private structure.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
---
 drivers/net/veth.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 1592119e3873..5978d76f2c00 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -38,12 +38,18 @@ struct pcpu_vstats {
 	struct u64_stats_sync	syncp;
 };
 
+struct xdp_queue {
+	void *q[VETH_XDP_QUEUE_SIZE];
+	unsigned int len;
+};
+
 struct veth_priv {
 	struct napi_struct	xdp_napi;
 	struct net_device	*dev;
 	struct bpf_prog __rcu	*xdp_prog;
 	struct net_device __rcu	*peer;
 	atomic64_t		dropped;
+	struct xdp_queue __percpu *xdp_produce_q;
 	struct xdp_mem_info	xdp_mem;
 	unsigned		requested_headroom;
 	bool			rx_notify_masked;
@@ -147,8 +153,48 @@ static void veth_ptr_free(void *ptr)
 	}
 }
 
+static void veth_xdp_cleanup_queues(struct veth_priv *priv)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct xdp_queue *q = per_cpu_ptr(priv->xdp_produce_q, cpu);
+		int i;
+
+		for (i = 0; i < q->len; i++)
+			veth_ptr_free(q->q[i]);
+
+		q->len = 0;
+	}
+}
+
+static bool veth_xdp_flush_queue(struct veth_priv *priv)
+{
+	struct xdp_queue *q = this_cpu_ptr(priv->xdp_produce_q);
+	int i;
+
+	if (unlikely(!q->len))
+		return false;
+
+	spin_lock(&priv->xdp_ring.producer_lock);
+	for (i = 0; i < q->len; i++) {
+		void *ptr = q->q[i];
+
+		if (unlikely(__ptr_ring_produce(&priv->xdp_ring, ptr)))
+			veth_ptr_free(ptr);
+	}
+	spin_unlock(&priv->xdp_ring.producer_lock);
+
+	q->len = 0;
+
+	return true;
+}
+
 static void __veth_xdp_flush(struct veth_priv *priv)
 {
+	if (unlikely(!veth_xdp_flush_queue(priv)))
+		return;
+
 	/* Write ptr_ring before reading rx_notify_masked */
 	smp_mb();
 	if (!priv->rx_notify_masked) {
@@ -159,9 +205,13 @@ static void __veth_xdp_flush(struct veth_priv *priv)
 
 static int veth_xdp_enqueue(struct veth_priv *priv, void *ptr)
 {
-	if (unlikely(ptr_ring_produce(&priv->xdp_ring, ptr)))
+	struct xdp_queue *q = this_cpu_ptr(priv->xdp_produce_q);
+
+	if (unlikely(q->len >= VETH_XDP_QUEUE_SIZE))
 		return -ENOSPC;
 
+	q->q[q->len++] = ptr;
+
 	return 0;
 }
 
@@ -644,6 +694,7 @@ static void veth_napi_del(struct net_device *dev)
 
 	napi_disable(&priv->xdp_napi);
 	netif_napi_del(&priv->xdp_napi);
+	veth_xdp_cleanup_queues(priv);
 	ptr_ring_cleanup(&priv->xdp_ring, veth_ptr_free);
 }
 
@@ -711,15 +762,28 @@ static int is_valid_veth_mtu(int mtu)
 
 static int veth_dev_init(struct net_device *dev)
 {
+	struct veth_priv *priv = netdev_priv(dev);
+
 	dev->vstats = netdev_alloc_pcpu_stats(struct pcpu_vstats);
 	if (!dev->vstats)
 		return -ENOMEM;
+
+	priv->xdp_produce_q = __alloc_percpu(sizeof(*priv->xdp_produce_q),
+					     sizeof (void *));
+	if (!priv->xdp_produce_q) {
+		free_percpu(dev->vstats);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
 static void veth_dev_free(struct net_device *dev)
 {
+	struct veth_priv *priv = netdev_priv(dev);
+
 	free_percpu(dev->vstats);
+	free_percpu(priv->xdp_produce_q);
 }
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
  2018-04-24 14:39 ` [PATCH RFC 6/9] veth: Add ndo_xdp_xmit Toshiaki Makita
@ 2018-04-25 20:24   ` Jesper Dangaard Brouer
  2018-04-26 10:52     ` Toshiaki Makita
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2018-04-25 20:24 UTC (permalink / raw)
  To: Toshiaki Makita; +Cc: brouer, netdev, Toshiaki Makita

On Tue, 24 Apr 2018 23:39:20 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> +static int veth_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
> +{
> +	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
> +	int headroom = frame->data - (void *)frame;
> +	struct net_device *rcv;
> +	int err = 0;
> +
> +	rcv = rcu_dereference(priv->peer);
> +	if (unlikely(!rcv))
> +		return -ENXIO;
> +
> +	rcv_priv = netdev_priv(rcv);
> +	/* xdp_ring is initialized on receive side? */
> +	if (rcu_access_pointer(rcv_priv->xdp_prog)) {
> +		err = xdp_ok_fwd_dev(rcv, frame->len);
> +		if (unlikely(err))
> +			return err;
> +
> +		err = veth_xdp_enqueue(rcv_priv, veth_xdp_to_ptr(frame));
> +	} else {
> +		struct sk_buff *skb;
> +
> +		skb = veth_build_skb(frame, headroom, frame->len, 0);
> +		if (unlikely(!skb))
> +			return -ENOMEM;
> +
> +		/* Get page ref in case skb is dropped in netif_rx.
> +		 * The caller is responsible for freeing the page on error.
> +		 */
> +		get_page(virt_to_page(frame->data));

I'm not sure you can make this assumption, that xdp_frames coming from
another device driver uses a refcnt based memory model. But maybe I'm
confused, as this looks like an SKB receive path, but in the
ndo_xdp_xmit().


> +		if (unlikely(veth_forward_skb(rcv, skb, false) != NET_RX_SUCCESS))
> +			return -ENXIO;
> +
> +		/* Put page ref on success */
> +		page_frag_free(frame->data);
> +	}
> +
> +	return err;
> +}


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 2/9] veth: Add driver XDP
  2018-04-24 14:39 ` [PATCH RFC 2/9] veth: Add driver XDP Toshiaki Makita
@ 2018-04-25 20:39   ` Jesper Dangaard Brouer
  2018-04-26 10:46     ` Toshiaki Makita
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2018-04-25 20:39 UTC (permalink / raw)
  To: Toshiaki Makita; +Cc: brouer, netdev, Toshiaki Makita

On Tue, 24 Apr 2018 23:39:16 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> This is basic implementation of veth driver XDP.
> 
> Incoming packets are sent from the peer veth device in the form of skb,
> so this is generally doing the same thing as generic XDP.

I'm unsure that context you are calling veth_xdp_rcv_skb() from.  The
XDP (RX side) depend heavily on the protection provided by NAPI context.
It looks like you are adding NAPI handler later.  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 2/9] veth: Add driver XDP
  2018-04-25 20:39   ` Jesper Dangaard Brouer
@ 2018-04-26 10:46     ` Toshiaki Makita
  0 siblings, 0 replies; 23+ messages in thread
From: Toshiaki Makita @ 2018-04-26 10:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Toshiaki Makita, netdev

Hi Jesper,

Thanks for taking a look!

On 2018/04/26 5:39, Jesper Dangaard Brouer wrote:
> On Tue, 24 Apr 2018 23:39:16 +0900
> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> 
>> This is basic implementation of veth driver XDP.
>>
>> Incoming packets are sent from the peer veth device in the form of skb,
>> so this is generally doing the same thing as generic XDP.
> 
> I'm unsure that context you are calling veth_xdp_rcv_skb() from.  The
> XDP (RX side) depend heavily on the protection provided by NAPI context.
> It looks like you are adding NAPI handler later.  

This is called from softirq or bh disabled context.
I can see XDP REDIRECT depends on NAPI since it uses per-cpu temporary
storage which is used in ndo_xdp_flush. I thought DROP and PASS is safe
here. Also this is basically the same context as generic XDP, which is
called from netif_rx_internal.

Anyway this is a temporary state and not needed. It looks like this does
not help review so I'll squash this and patch 4 (napi patch).

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
  2018-04-25 20:24   ` Jesper Dangaard Brouer
@ 2018-04-26 10:52     ` Toshiaki Makita
  2018-04-30 17:27       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 23+ messages in thread
From: Toshiaki Makita @ 2018-04-26 10:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Toshiaki Makita, netdev

On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:
> On Tue, 24 Apr 2018 23:39:20 +0900
> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> 
>> +static int veth_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
>> +{
>> +	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>> +	int headroom = frame->data - (void *)frame;
>> +	struct net_device *rcv;
>> +	int err = 0;
>> +
>> +	rcv = rcu_dereference(priv->peer);
>> +	if (unlikely(!rcv))
>> +		return -ENXIO;
>> +
>> +	rcv_priv = netdev_priv(rcv);
>> +	/* xdp_ring is initialized on receive side? */
>> +	if (rcu_access_pointer(rcv_priv->xdp_prog)) {
>> +		err = xdp_ok_fwd_dev(rcv, frame->len);
>> +		if (unlikely(err))
>> +			return err;
>> +
>> +		err = veth_xdp_enqueue(rcv_priv, veth_xdp_to_ptr(frame));
>> +	} else {
>> +		struct sk_buff *skb;
>> +
>> +		skb = veth_build_skb(frame, headroom, frame->len, 0);
>> +		if (unlikely(!skb))
>> +			return -ENOMEM;
>> +
>> +		/* Get page ref in case skb is dropped in netif_rx.
>> +		 * The caller is responsible for freeing the page on error.
>> +		 */
>> +		get_page(virt_to_page(frame->data));
> 
> I'm not sure you can make this assumption, that xdp_frames coming from
> another device driver uses a refcnt based memory model. But maybe I'm
> confused, as this looks like an SKB receive path, but in the
> ndo_xdp_xmit().

I find this path similar to cpumap, which creates skb from redirected
xdp frame. Once it is converted to skb, skb head is freed by
page_frag_free, so anyway I needed to get the refcount here regardless
of memory model.

> 
> 
>> +		if (unlikely(veth_forward_skb(rcv, skb, false) != NET_RX_SUCCESS))
>> +			return -ENXIO;
>> +
>> +		/* Put page ref on success */
>> +		page_frag_free(frame->data);
>> +	}
>> +
>> +	return err;
>> +}
> 
> 

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
  2018-04-26 10:52     ` Toshiaki Makita
@ 2018-04-30 17:27       ` Jesper Dangaard Brouer
  2018-05-01  1:02         ` Toshiaki Makita
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2018-04-30 17:27 UTC (permalink / raw)
  To: Toshiaki Makita; +Cc: Toshiaki Makita, netdev, brouer, Tariq Toukan

On Thu, 26 Apr 2018 19:52:40 +0900
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:

> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:
> > On Tue, 24 Apr 2018 23:39:20 +0900
> > Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> >   
> >> +static int veth_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
> >> +{
> >> +	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
> >> +	int headroom = frame->data - (void *)frame;
> >> +	struct net_device *rcv;
> >> +	int err = 0;
> >> +
> >> +	rcv = rcu_dereference(priv->peer);
> >> +	if (unlikely(!rcv))
> >> +		return -ENXIO;
> >> +
> >> +	rcv_priv = netdev_priv(rcv);
> >> +	/* xdp_ring is initialized on receive side? */
> >> +	if (rcu_access_pointer(rcv_priv->xdp_prog)) {
> >> +		err = xdp_ok_fwd_dev(rcv, frame->len);
> >> +		if (unlikely(err))
> >> +			return err;
> >> +
> >> +		err = veth_xdp_enqueue(rcv_priv, veth_xdp_to_ptr(frame));
> >> +	} else {
> >> +		struct sk_buff *skb;
> >> +
> >> +		skb = veth_build_skb(frame, headroom, frame->len, 0);
> >> +		if (unlikely(!skb))
> >> +			return -ENOMEM;
> >> +
> >> +		/* Get page ref in case skb is dropped in netif_rx.
> >> +		 * The caller is responsible for freeing the page on error.
> >> +		 */
> >> +		get_page(virt_to_page(frame->data));  
> > 
> > I'm not sure you can make this assumption, that xdp_frames coming from
> > another device driver uses a refcnt based memory model. But maybe I'm
> > confused, as this looks like an SKB receive path, but in the
> > ndo_xdp_xmit().  
> 
> I find this path similar to cpumap, which creates skb from redirected
> xdp frame. Once it is converted to skb, skb head is freed by
> page_frag_free, so anyway I needed to get the refcount here regardless
> of memory model.

Yes I know, I wrote cpumap ;-)

First of all, I don't want to see such xdp_frame to SKB conversion code
in every driver.  Because that increase the chances of errors.  And
when looking at the details, then it seems that you have made the
mistake of making it possible to leak xdp_frame info to the SKB (which
cpumap takes into account).

Second, I think the refcnt scheme here is wrong. The xdp_frame should
be "owned" by XDP and have the proper refcnt to deliver it directly to
the network stack.

Third, if we choose that we want a fallback, in-case XDP is not enabled
on egress dev (but it have an ndo_xdp_xmit), then it should be placed
in the generic/core code.  E.g. __bpf_tx_xdp_map() could look at the
return code from dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint,
this would make it easy to implement TX bulking towards the dev).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


+static struct sk_buff *veth_build_skb(void *head, int headroom, int len,
+				      int buflen)
+{
+	struct sk_buff *skb;
+
+	if (!buflen) {
+		buflen = SKB_DATA_ALIGN(headroom + len) +
+			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	}
+	skb = build_skb(head, buflen);
+	if (!skb)
+		return NULL;
+
+	skb_reserve(skb, headroom);
+	skb_put(skb, len);
+
+	return skb;
+}

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
  2018-04-30 17:27       ` Jesper Dangaard Brouer
@ 2018-05-01  1:02         ` Toshiaki Makita
  2018-05-01  8:14           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 23+ messages in thread
From: Toshiaki Makita @ 2018-05-01  1:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Toshiaki Makita, netdev, Tariq Toukan

On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:
> On Thu, 26 Apr 2018 19:52:40 +0900
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:
>>> On Tue, 24 Apr 2018 23:39:20 +0900
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
>>>   
>>>> +static int veth_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
>>>> +{
>>>> +	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
>>>> +	int headroom = frame->data - (void *)frame;
>>>> +	struct net_device *rcv;
>>>> +	int err = 0;
>>>> +
>>>> +	rcv = rcu_dereference(priv->peer);
>>>> +	if (unlikely(!rcv))
>>>> +		return -ENXIO;
>>>> +
>>>> +	rcv_priv = netdev_priv(rcv);
>>>> +	/* xdp_ring is initialized on receive side? */
>>>> +	if (rcu_access_pointer(rcv_priv->xdp_prog)) {
>>>> +		err = xdp_ok_fwd_dev(rcv, frame->len);
>>>> +		if (unlikely(err))
>>>> +			return err;
>>>> +
>>>> +		err = veth_xdp_enqueue(rcv_priv, veth_xdp_to_ptr(frame));
>>>> +	} else {
>>>> +		struct sk_buff *skb;
>>>> +
>>>> +		skb = veth_build_skb(frame, headroom, frame->len, 0);
>>>> +		if (unlikely(!skb))
>>>> +			return -ENOMEM;
>>>> +
>>>> +		/* Get page ref in case skb is dropped in netif_rx.
>>>> +		 * The caller is responsible for freeing the page on error.
>>>> +		 */
>>>> +		get_page(virt_to_page(frame->data));  
>>>
>>> I'm not sure you can make this assumption, that xdp_frames coming from
>>> another device driver uses a refcnt based memory model. But maybe I'm
>>> confused, as this looks like an SKB receive path, but in the
>>> ndo_xdp_xmit().  
>>
>> I find this path similar to cpumap, which creates skb from redirected
>> xdp frame. Once it is converted to skb, skb head is freed by
>> page_frag_free, so anyway I needed to get the refcount here regardless
>> of memory model.
> 
> Yes I know, I wrote cpumap ;-)
> 
> First of all, I don't want to see such xdp_frame to SKB conversion code
> in every driver.  Because that increase the chances of errors.  And
> when looking at the details, then it seems that you have made the
> mistake of making it possible to leak xdp_frame info to the SKB (which
> cpumap takes into account).

Do you mean leaving xdp_frame in skb->head is leaking something? how?

> 
> Second, I think the refcnt scheme here is wrong. The xdp_frame should
> be "owned" by XDP and have the proper refcnt to deliver it directly to
> the network stack.
> 
> Third, if we choose that we want a fallback, in-case XDP is not enabled
> on egress dev (but it have an ndo_xdp_xmit), then it should be placed
> in the generic/core code.  E.g. __bpf_tx_xdp_map() could look at the
> return code from dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint,
> this would make it easy to implement TX bulking towards the dev).

Right, this is a much cleaner way.
Although I feel like we should add this fallback for veth because it
requires something which is different from other drivers (enabling XDP
on the peer device of the egress device), I'll drop the part for now. It
should not be resolved in the driver code.

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 4/9] veth: Use NAPI for XDP
  2018-04-24 14:39 ` [PATCH RFC 4/9] veth: Use NAPI for XDP Toshiaki Makita
@ 2018-05-01  7:50   ` Jesper Dangaard Brouer
  2018-05-01  8:02     ` Toshiaki Makita
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-01  7:50 UTC (permalink / raw)
  To: Toshiaki Makita; +Cc: brouer, netdev, Toshiaki Makita

On Tue, 24 Apr 2018 23:39:18 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> +static int veth_xdp_enqueue(struct veth_priv *priv, void *ptr)
> +{
> +	if (unlikely(ptr_ring_produce(&priv->xdp_ring, ptr)))
> +		return -ENOSPC;
> +
> +	return 0;
> +}

Here we have a lock per (enqueued) packet.  I'm working on changing the
ndo_xdp_xmit API to allow bulking.  And the tun driver have exact same
issue/need.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 4/9] veth: Use NAPI for XDP
  2018-05-01  7:50   ` Jesper Dangaard Brouer
@ 2018-05-01  8:02     ` Toshiaki Makita
  2018-05-01  8:43       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 23+ messages in thread
From: Toshiaki Makita @ 2018-05-01  8:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Toshiaki Makita, netdev

On 2018/05/01 16:50, Jesper Dangaard Brouer wrote:
> On Tue, 24 Apr 2018 23:39:18 +0900
> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> 
>> +static int veth_xdp_enqueue(struct veth_priv *priv, void *ptr)
>> +{
>> +	if (unlikely(ptr_ring_produce(&priv->xdp_ring, ptr)))
>> +		return -ENOSPC;
>> +
>> +	return 0;
>> +}
> 
> Here we have a lock per (enqueued) packet.  I'm working on changing the
> ndo_xdp_xmit API to allow bulking.  And the tun driver have exact same
> issue/need.

Probably I should have noted in commitlog, but this per-packet lock is
removed in patch 9.
I'm curious about if any change is needed by your new API.

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
  2018-05-01  1:02         ` Toshiaki Makita
@ 2018-05-01  8:14           ` Jesper Dangaard Brouer
  2018-05-02  3:33             ` Toshiaki Makita
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-01  8:14 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Toshiaki Makita, netdev, Tariq Toukan, brouer, Daniel Borkmann,
	Alexei Starovoitov, Eran Ben Elisha

On Tue, 1 May 2018 10:02:01 +0900
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:

> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:
> > On Thu, 26 Apr 2018 19:52:40 +0900
> > Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> >   
> >> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:  
> >>> On Tue, 24 Apr 2018 23:39:20 +0900
> >>> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> >>>     
> >>>> +static int veth_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
> >>>> +{
> >>>> +	struct veth_priv *rcv_priv, *priv = netdev_priv(dev);
> >>>> +	int headroom = frame->data - (void *)frame;
> >>>> +	struct net_device *rcv;
> >>>> +	int err = 0;
> >>>> +
> >>>> +	rcv = rcu_dereference(priv->peer);
> >>>> +	if (unlikely(!rcv))
> >>>> +		return -ENXIO;
> >>>> +
> >>>> +	rcv_priv = netdev_priv(rcv);
> >>>> +	/* xdp_ring is initialized on receive side? */
> >>>> +	if (rcu_access_pointer(rcv_priv->xdp_prog)) {
> >>>> +		err = xdp_ok_fwd_dev(rcv, frame->len);
> >>>> +		if (unlikely(err))
> >>>> +			return err;
> >>>> +
> >>>> +		err = veth_xdp_enqueue(rcv_priv, veth_xdp_to_ptr(frame));
> >>>> +	} else {
> >>>> +		struct sk_buff *skb;
> >>>> +
> >>>> +		skb = veth_build_skb(frame, headroom, frame->len, 0);
> >>>> +		if (unlikely(!skb))
> >>>> +			return -ENOMEM;
> >>>> +
> >>>> +		/* Get page ref in case skb is dropped in netif_rx.
> >>>> +		 * The caller is responsible for freeing the page on error.
> >>>> +		 */
> >>>> +		get_page(virt_to_page(frame->data));    
> >>>
> >>> I'm not sure you can make this assumption, that xdp_frames coming from
> >>> another device driver uses a refcnt based memory model. But maybe I'm
> >>> confused, as this looks like an SKB receive path, but in the
> >>> ndo_xdp_xmit().    
> >>
> >> I find this path similar to cpumap, which creates skb from redirected
> >> xdp frame. Once it is converted to skb, skb head is freed by
> >> page_frag_free, so anyway I needed to get the refcount here regardless
> >> of memory model.  
> > 
> > Yes I know, I wrote cpumap ;-)
> > 
> > First of all, I don't want to see such xdp_frame to SKB conversion code
> > in every driver.  Because that increase the chances of errors.  And
> > when looking at the details, then it seems that you have made the
> > mistake of making it possible to leak xdp_frame info to the SKB (which
> > cpumap takes into account).  
> 
> Do you mean leaving xdp_frame in skb->head is leaking something? how?

Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp headroom")
and commit 6dfb970d3dbd ("xdp: avoid leaking info stored in frame data
on page reuse").

But this time, the concern is a bpf_prog attached at TC/bpf_cls level, that can 
that can adjust head via bpf_skb_change_head (for XDP it is
bpf_xdp_adjust_head) into the area used by xdp_frame.  As described in
https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in is not super
critical at the moment, as this _currently_ runs as CAP_SYS_ADMIN, but
we would like to move towards CAP_NET_ADMIN.

 
> > Second, I think the refcnt scheme here is wrong. The xdp_frame should
> > be "owned" by XDP and have the proper refcnt to deliver it directly to
> > the network stack.
> > 
> > Third, if we choose that we want a fallback, in-case XDP is not enabled
> > on egress dev (but it have an ndo_xdp_xmit), then it should be placed
> > in the generic/core code.  E.g. __bpf_tx_xdp_map() could look at the
> > return code from dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint,
> > this would make it easy to implement TX bulking towards the dev).  
> 
> Right, this is a much cleaner way.
>
> Although I feel like we should add this fallback for veth because it
> requires something which is different from other drivers (enabling XDP
> on the peer device of the egress device), 

(This is why I Cc'ed Tariq...)

This is actually a general problem with the xdp "xmit" side (and not
specific to veth driver). The problem exists for other drivers as well.

The problem is that a driver can implement ndo_xdp_xmit(), but the
driver might not have allocated the necessary XDP TX-queue resources
yet (or it might not be possible due to system resource limits).

The current "hack" is to load an XDP prog on the egress device, and
then assume that this cause the driver to also allocate the XDP
ndo_xdo_xmit side HW resources.  This is IMHO a wrong assumption.

We need a more generic way to test if a net_device is "ready/enabled"
for handling xdp_frames via ndo_xdp_xmit.  And Tariq had some ideas on
how to implement this...

My opinion is that it is a waste of (HW/mem) resources to always
allocate resources for ndo_xdp_xmit when loading an XDP program.
Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP
redirect load-balancer, then I don't want/need ndo_xdp_xmit.  E.g.
today using ixgbe on machines with more than 96 CPUs, will fail due to
limited TX queue resources. Thus, blocking the mentioned use-cases.


> I'll drop the part for now. It should not be resolved in the driver
> code.

Thank you.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 4/9] veth: Use NAPI for XDP
  2018-05-01  8:02     ` Toshiaki Makita
@ 2018-05-01  8:43       ` Jesper Dangaard Brouer
  2018-05-02  3:37         ` Toshiaki Makita
  0 siblings, 1 reply; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-01  8:43 UTC (permalink / raw)
  To: Toshiaki Makita; +Cc: Toshiaki Makita, netdev, brouer

On Tue, 1 May 2018 17:02:34 +0900
Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:

> On 2018/05/01 16:50, Jesper Dangaard Brouer wrote:
> > On Tue, 24 Apr 2018 23:39:18 +0900
> > Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
> >   
> >> +static int veth_xdp_enqueue(struct veth_priv *priv, void *ptr)
> >> +{
> >> +	if (unlikely(ptr_ring_produce(&priv->xdp_ring, ptr)))
> >> +		return -ENOSPC;
> >> +
> >> +	return 0;
> >> +}  
> > 
> > Here we have a lock per (enqueued) packet.  I'm working on changing the
> > ndo_xdp_xmit API to allow bulking.  And the tun driver have exact same
> > issue/need.  
> 
> Probably I should have noted in commitlog, but this per-packet lock is
> removed in patch 9.
> I'm curious about if any change is needed by your new API.

Again, I'm just moving this into the generic code, to avoid having to
implement this for every driver.

Plus, with CONFIG_RETPOLINE we have the advantage of only calling
ndo_xdp_xmit once (indirect function pointer call).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
  2018-05-01  8:14           ` Jesper Dangaard Brouer
@ 2018-05-02  3:33             ` Toshiaki Makita
  2018-05-02  5:56               ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 23+ messages in thread
From: Toshiaki Makita @ 2018-05-02  3:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Toshiaki Makita, netdev, Tariq Toukan, Daniel Borkmann,
	Alexei Starovoitov, Eran Ben Elisha

On 18/05/01 (火) 17:14, Jesper Dangaard Brouer wrote:
> On Tue, 1 May 2018 10:02:01 +0900 Toshiaki Makita 
> <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:
>>> On Thu, 26 Apr 2018 19:52:40 +0900 Toshiaki Makita 
>>> <makita.toshiaki@lab.ntt.co.jp> wrote:
>>> 
>>>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:
>>>>> On Tue, 24 Apr 2018 23:39:20 +0900 Toshiaki Makita 
>>>>> <toshiaki.makita1@gmail.com> wrote:
>>>>> 
>>>>>> +static int veth_xdp_xmit(struct net_device *dev, struct 
>>>>>> xdp_frame *frame) +{ +	struct veth_priv *rcv_priv, *priv = 
>>>>>> netdev_priv(dev); +	int headroom = frame->data - (void 
>>>>>> *)frame; +	struct net_device *rcv; +	int err = 0; + +	rcv
>>>>>> = rcu_dereference(priv->peer); +	if (unlikely(!rcv)) + 
>>>>>> return -ENXIO; + +	rcv_priv = netdev_priv(rcv); +	/* 
>>>>>> xdp_ring is initialized on receive side? */ +	if 
>>>>>> (rcu_access_pointer(rcv_priv->xdp_prog)) { +		err = 
>>>>>> xdp_ok_fwd_dev(rcv, frame->len); +		if (unlikely(err)) + 
>>>>>> return err; + +		err = veth_xdp_enqueue(rcv_priv, 
>>>>>> veth_xdp_to_ptr(frame)); +	} else { +		struct sk_buff *skb;
>>>>>> + +		skb = veth_build_skb(frame, headroom, frame->len, 0);
>>>>>> +		if (unlikely(!skb)) +			return -ENOMEM; + +		/* Get page
>>>>>> ref in case skb is dropped in netif_rx. + * The caller is
>>>>>> responsible for freeing the page on error. +		 */ +
>>>>>> get_page(virt_to_page(frame->data));
>>>>> 
>>>>> I'm not sure you can make this assumption, that xdp_frames 
>>>>> coming from another device driver uses a refcnt based memory 
>>>>> model. But maybe I'm confused, as this looks like an SKB 
>>>>> receive path, but in the ndo_xdp_xmit().
>>>> 
>>>> I find this path similar to cpumap, which creates skb from 
>>>> redirected xdp frame. Once it is converted to skb, skb head is 
>>>> freed by page_frag_free, so anyway I needed to get the
>>>> refcount here regardless of memory model.
>>> 
>>> Yes I know, I wrote cpumap ;-)
>>> 
>>> First of all, I don't want to see such xdp_frame to SKB 
>>> conversion code in every driver.  Because that increase the 
>>> chances of errors.  And when looking at the details, then it 
>>> seems that you have made the mistake of making it possible to 
>>> leak xdp_frame info to the SKB (which cpumap takes into 
>>> account).
>> 
>> Do you mean leaving xdp_frame in skb->head is leaking something? 
>> how?
> 
> Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp 
> headroom") and commit 6dfb970d3dbd ("xdp: avoid leaking info stored 
> in frame data on page reuse").

Thanks for sharing the info.

> But this time, the concern is a bpf_prog attached at TC/bpf_cls 
> level, that can that can adjust head via bpf_skb_change_head (for
> XDP it is bpf_xdp_adjust_head) into the area used by xdp_frame.  As 
> described in https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in 
> is not super critical at the moment, as this _currently_ runs as 
> CAP_SYS_ADMIN, but we would like to move towards CAP_NET_ADMIN.

What I don't get is why special casing xdp_frame info. My assumption is
that the area above skb->mac_header is uninit kernel memory and should
not be readable by unprivileged users anyway. So I didn't clear the area
at this point.

>>> Second, I think the refcnt scheme here is wrong. The xdp_frame 
>>> should be "owned" by XDP and have the proper refcnt to deliver
>>> it directly to the network stack.
>>> 
>>> Third, if we choose that we want a fallback, in-case XDP is not 
>>> enabled on egress dev (but it have an ndo_xdp_xmit), then it 
>>> should be placed in the generic/core code.  E.g. 
>>> __bpf_tx_xdp_map() could look at the return code from 
>>> dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint, this would 
>>> make it easy to implement TX bulking towards the dev).
>> 
>> Right, this is a much cleaner way.
>> 
>> Although I feel like we should add this fallback for veth because 
>> it requires something which is different from other drivers 
>> (enabling XDP on the peer device of the egress device),
> 
> (This is why I Cc'ed Tariq...)
> 
> This is actually a general problem with the xdp "xmit" side (and not
>  specific to veth driver). The problem exists for other drivers as 
> well.
> 
> The problem is that a driver can implement ndo_xdp_xmit(), but the 
> driver might not have allocated the necessary XDP TX-queue resources
>  yet (or it might not be possible due to system resource limits).
> 
> The current "hack" is to load an XDP prog on the egress device, and 
> then assume that this cause the driver to also allocate the XDP 
> ndo_xdo_xmit side HW resources.  This is IMHO a wrong assumption.
 >
> We need a more generic way to test if a net_device is "ready/enabled"
> for handling xdp_frames via ndo_xdp_xmit.  And Tariq had some ideas
> on how to implement this...

My assumption on REDIRECT requirement came from this.
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=306da4e685b

I guess you are saying thing are changing, and having an XDP program 
attached on the egress device is no longer generally sufficient. Looking 
forward to Tariq's solution.

Toshiaki Makita

> 
> My opinion is that it is a waste of (HW/mem) resources to always 
> allocate resources for ndo_xdp_xmit when loading an XDP program. 
> Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP 
> redirect load-balancer, then I don't want/need ndo_xdp_xmit.  E.g. 
> today using ixgbe on machines with more than 96 CPUs, will fail due 
> to limited TX queue resources. Thus, blocking the mentioned 
> use-cases.
> 
> 
>> I'll drop the part for now. It should not be resolved in the driver
>> code.
> 
> Thank you.
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 4/9] veth: Use NAPI for XDP
  2018-05-01  8:43       ` Jesper Dangaard Brouer
@ 2018-05-02  3:37         ` Toshiaki Makita
  0 siblings, 0 replies; 23+ messages in thread
From: Toshiaki Makita @ 2018-05-02  3:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Toshiaki Makita, netdev

On 18/05/01 (火) 17:43, Jesper Dangaard Brouer wrote:
> On Tue, 1 May 2018 17:02:34 +0900
> Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> wrote:
> 
>> On 2018/05/01 16:50, Jesper Dangaard Brouer wrote:
>>> On Tue, 24 Apr 2018 23:39:18 +0900
>>> Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:
>>>    
>>>> +static int veth_xdp_enqueue(struct veth_priv *priv, void *ptr)
>>>> +{
>>>> +	if (unlikely(ptr_ring_produce(&priv->xdp_ring, ptr)))
>>>> +		return -ENOSPC;
>>>> +
>>>> +	return 0;
>>>> +}
>>>
>>> Here we have a lock per (enqueued) packet.  I'm working on changing the
>>> ndo_xdp_xmit API to allow bulking.  And the tun driver have exact same
>>> issue/need.
>>
>> Probably I should have noted in commitlog, but this per-packet lock is
>> removed in patch 9.
>> I'm curious about if any change is needed by your new API.
> 
> Again, I'm just moving this into the generic code, to avoid having to
> implement this for every driver.
> 
> Plus, with CONFIG_RETPOLINE we have the advantage of only calling
> ndo_xdp_xmit once (indirect function pointer call).

Nice. Once it becomes available I'll use it instead.

Toshiaki Makita

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
  2018-05-02  3:33             ` Toshiaki Makita
@ 2018-05-02  5:56               ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 23+ messages in thread
From: Jesper Dangaard Brouer @ 2018-05-02  5:56 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Toshiaki Makita, netdev, Tariq Toukan, Daniel Borkmann,
	Alexei Starovoitov, Eran Ben Elisha, brouer

On Wed, 2 May 2018 12:33:47 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> On 18/05/01 (火) 17:14, Jesper Dangaard Brouer wrote:
> > On Tue, 1 May 2018 10:02:01 +0900 Toshiaki Makita 
> > <makita.toshiaki@lab.ntt.co.jp> wrote:
> >   
> >> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:  
> >>> On Thu, 26 Apr 2018 19:52:40 +0900 Toshiaki Makita 
> >>> <makita.toshiaki@lab.ntt.co.jp> wrote:
> >>>   
> >>>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:  
> >>>>> On Tue, 24 Apr 2018 23:39:20 +0900 Toshiaki Makita 
> >>>>> <toshiaki.makita1@gmail.com> wrote:
> >>>>>   
> >>>>>> +static int veth_xdp_xmit(struct net_device *dev, struct 
[...]
> >>>>> 
> >>>>> I'm not sure you can make this assumption, that xdp_frames 
> >>>>> coming from another device driver uses a refcnt based memory 
> >>>>> model. But maybe I'm confused, as this looks like an SKB 
> >>>>> receive path, but in the ndo_xdp_xmit().  
> >>>> 
> >>>> I find this path similar to cpumap, which creates skb from 
> >>>> redirected xdp frame. Once it is converted to skb, skb head is 
> >>>> freed by page_frag_free, so anyway I needed to get the
> >>>> refcount here regardless of memory model.  
> >>> 
> >>> Yes I know, I wrote cpumap ;-)
> >>> 
> >>> First of all, I don't want to see such xdp_frame to SKB 
> >>> conversion code in every driver.  Because that increase the 
> >>> chances of errors.  And when looking at the details, then it 
> >>> seems that you have made the mistake of making it possible to 
> >>> leak xdp_frame info to the SKB (which cpumap takes into 
> >>> account).  
> >> 
> >> Do you mean leaving xdp_frame in skb->head is leaking something? 
> >> how?  
> > 
> > Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp 
> > headroom") and commit 6dfb970d3dbd ("xdp: avoid leaking info stored 
> > in frame data on page reuse").  
> 
> Thanks for sharing the info.
> 
> > But this time, the concern is a bpf_prog attached at TC/bpf_cls 
> > level, that can that can adjust head via bpf_skb_change_head (for
> > XDP it is bpf_xdp_adjust_head) into the area used by xdp_frame.  As 
> > described in https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in 
> > is not super critical at the moment, as this _currently_ runs as 
> > CAP_SYS_ADMIN, but we would like to move towards CAP_NET_ADMIN.  
> 
> What I don't get is why special casing xdp_frame info. My assumption is
> that the area above skb->mac_header is uninit kernel memory and should
> not be readable by unprivileged users anyway. So I didn't clear the area
> at this point.

This is also my understanding. But Alexei explicitly asked me to handle
this xdp_frame info case.  I assume that more work is required for the
transition from CAP_SYS_ADMIN to CAP_NET_ADMIN, we just don't want to
add more/new code that makes this more difficult.


> >>> Second, I think the refcnt scheme here is wrong. The xdp_frame 
> >>> should be "owned" by XDP and have the proper refcnt to deliver
> >>> it directly to the network stack.
> >>> 
> >>> Third, if we choose that we want a fallback, in-case XDP is not 
> >>> enabled on egress dev (but it have an ndo_xdp_xmit), then it 
> >>> should be placed in the generic/core code.  E.g. 
> >>> __bpf_tx_xdp_map() could look at the return code from 
> >>> dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint, this would 
> >>> make it easy to implement TX bulking towards the dev).  
> >> 
> >> Right, this is a much cleaner way.
> >> 
> >> Although I feel like we should add this fallback for veth because 
> >> it requires something which is different from other drivers 
> >> (enabling XDP on the peer device of the egress device),  
> > 
> > (This is why I Cc'ed Tariq...)
> > 
> > This is actually a general problem with the xdp "xmit" side (and not
> >  specific to veth driver). The problem exists for other drivers as 
> > well.
> > 
> > The problem is that a driver can implement ndo_xdp_xmit(), but the 
> > driver might not have allocated the necessary XDP TX-queue resources
> >  yet (or it might not be possible due to system resource limits).
> > 
> > The current "hack" is to load an XDP prog on the egress device, and 
> > then assume that this cause the driver to also allocate the XDP 
> > ndo_xdo_xmit side HW resources.  This is IMHO a wrong assumption.
>  >
> > We need a more generic way to test if a net_device is "ready/enabled"
> > for handling xdp_frames via ndo_xdp_xmit.  And Tariq had some ideas
> > on how to implement this...  
> 
> My assumption on REDIRECT requirement came from this.
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=306da4e685b

Yes, I wrote that.

> I guess you are saying thing are changing, and having an XDP program 
> attached on the egress device is no longer generally sufficient. Looking 
> forward to Tariq's solution.

Yes, (hopefully) things are changing. Loading a dummy XDP program to
enable ndo_xdp_xmit, turned out to be a bad model, for all the reasons
I listed.

I hope Tariq find some time to work on this ... ;-)


> Toshiaki Makita
> 
> > 
> > My opinion is that it is a waste of (HW/mem) resources to always 
> > allocate resources for ndo_xdp_xmit when loading an XDP program. 
> > Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP 
> > redirect load-balancer, then I don't want/need ndo_xdp_xmit.  E.g. 
> > today using ixgbe on machines with more than 96 CPUs, will fail due 
> > to limited TX queue resources. Thus, blocking the mentioned 
> > use-cases.
> > 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2018-05-02  5:57 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-24 14:39 [PATCH RFC 0/9] veth: Driver XDP Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 1/9] net: Export skb_headers_offset_update and skb_copy_header Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 2/9] veth: Add driver XDP Toshiaki Makita
2018-04-25 20:39   ` Jesper Dangaard Brouer
2018-04-26 10:46     ` Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 3/9] veth: Avoid drops by oversized packets when XDP is enabled Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 4/9] veth: Use NAPI for XDP Toshiaki Makita
2018-05-01  7:50   ` Jesper Dangaard Brouer
2018-05-01  8:02     ` Toshiaki Makita
2018-05-01  8:43       ` Jesper Dangaard Brouer
2018-05-02  3:37         ` Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 5/9] veth: Handle xdp_frame in xdp napi ring Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 6/9] veth: Add ndo_xdp_xmit Toshiaki Makita
2018-04-25 20:24   ` Jesper Dangaard Brouer
2018-04-26 10:52     ` Toshiaki Makita
2018-04-30 17:27       ` Jesper Dangaard Brouer
2018-05-01  1:02         ` Toshiaki Makita
2018-05-01  8:14           ` Jesper Dangaard Brouer
2018-05-02  3:33             ` Toshiaki Makita
2018-05-02  5:56               ` Jesper Dangaard Brouer
2018-04-24 14:39 ` [PATCH RFC 7/9] veth: Add XDP TX and REDIRECT Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 8/9] veth: Avoid per-packet spinlock of XDP napi ring on dequeueing Toshiaki Makita
2018-04-24 14:39 ` [PATCH RFC 9/9] veth: Avoid per-packet spinlock of XDP napi ring on enqueueing Toshiaki Makita

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.