netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next RFCv3 0/6] AF_XDP support for veth.
@ 2018-12-26 20:27 William Tu
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 1/6] xsk: add xsk_umem_consume_tx_virtual William Tu
                   ` (5 more replies)
  0 siblings, 6 replies; 10+ messages in thread
From: William Tu @ 2018-12-26 20:27 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, ast, daniel, netdev,
	makita.toshiaki, yihung.wei, magnus.karlsson

The patch series adds AF_XDP support for veth device. The first patch
adds a new API for supporting non-physical NIC device to get
packet's virtual address.  The second patch implements the async xmit
with one extra copy. The third and forth patches implement the zero
copy AF_XDP TX support.  The fifth patch implements the AF_XDP
RX and last patch adds example use cases.

I tested with 2 namespaces, one as sender, the other as receiver.
The packet rate is measure at the receiver side.
  ip netns add at_ns0
  ip link add p0 type veth peer name p1
  ip link set p0 netns at_ns0
  ip link set dev p1 up
  ip netns exec at_ns0 ip link set dev p0 up
  
  # receiver
  ip netns exec at_ns0 xdp_rxq_info --dev p0 --action XDP_DROP

  # receiver with AF_XDP
  ip netns exec at_ns0 xdpsock -i p0 -r -N -z 

  # sender without AF_XDP ZC
  xdpsock -i p1 -t -S

  # sender with AF_XDP
  xdpsock -i p1 -t -N -z

Without AF_XDP: 724 Kpps
With AF_XDP: 1.1 Mpps (with ksoftirqd 100% cpu)
With AF_XDP: 1.4 Mpps (with ksoftirqd 100% cpu)
With both peer running AF_XDP: 2.4Mpps

v2->v3:
- refactor the xsk_umem_consume_tx_virtual, suggested by Björn Töpel
- fix the racy condition by processing tx and its peer's rx napi,
  suggested by Björn Töpel
- add AF_XDP zero copy TX
- add AF_XDP RX

v1->v2:
- refactor the xsk_umem_consume_tx_virtual
- use the umems provided by netdev
- fix bug from locating peer side rq with qid


William Tu (6):
  xsk: add xsk_umem_consume_tx_virtual.
  veth: support AF_XDP TX copy-mode.
  xsk: add new MEM type for virtual device.
  veth: add zero-copy AF_XDP TX support.
  veth: add AF_XDP RX support.
  samples: bpf: add veth AF_XDP example.

 drivers/net/veth.c             | 212 ++++++++++++++++++++++++++++++++++++++++-
 include/net/xdp.h              |   1 +
 include/net/xdp_sock.h         |   7 ++
 net/core/xdp.c                 |   1 +
 net/xdp/xdp_umem.c             |   1 +
 net/xdp/xsk.c                  |  41 ++++++--
 samples/bpf/test_veth_afxdp.sh |  82 ++++++++++++++++
 7 files changed, 336 insertions(+), 9 deletions(-)
 create mode 100755 samples/bpf/test_veth_afxdp.sh

-- 
2.7.4

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH bpf-next RFCv3 1/6] xsk: add xsk_umem_consume_tx_virtual.
  2018-12-26 20:27 [PATCH bpf-next RFCv3 0/6] AF_XDP support for veth William Tu
@ 2018-12-26 20:27 ` William Tu
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 2/6] veth: support AF_XDP TX copy-mode William Tu
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: William Tu @ 2018-12-26 20:27 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, ast, daniel, netdev,
	makita.toshiaki, yihung.wei, magnus.karlsson

Currently the xsk_umem_consume_tx expects only the physical NICs so
the api returns a dma address.  This patch introduce the new function
to return the virtual address, when XSK is used by a virtual device.

Signed-off-by: William Tu <u9012063@gmail.com>
---
 include/net/xdp_sock.h |  7 +++++++
 net/xdp/xdp_umem.c     |  1 +
 net/xdp/xsk.c          | 38 +++++++++++++++++++++++++++++++-------
 3 files changed, 39 insertions(+), 7 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 13acb9803a6d..7fefe74f7fb5 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -81,6 +81,7 @@ u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr);
 void xsk_umem_discard_addr(struct xdp_umem *umem);
 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
 bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len);
+bool xsk_umem_consume_tx_virtual(struct xdp_umem *umem, void **addr, u32 *len);
 void xsk_umem_consume_tx_done(struct xdp_umem *umem);
 struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries);
 struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
@@ -165,6 +166,12 @@ static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
 	return false;
 }
 
+static inline bool xsk_umem_consume_tx_virtual(struct xdp_umem *umem,
+					       void **vaddr, u32 *len)
+{
+	return false;
+}
+
 static inline void xsk_umem_consume_tx_done(struct xdp_umem *umem)
 {
 }
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index a264cf2accd0..424ae2538f9f 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -60,6 +60,7 @@ struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev,
 
 	return NULL;
 }
+EXPORT_SYMBOL(xdp_get_umem_from_qid);
 
 static void xdp_clear_umem_at_qid(struct net_device *dev, u16 queue_id)
 {
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 07156f43d295..0e252047f55f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -170,22 +170,19 @@ void xsk_umem_consume_tx_done(struct xdp_umem *umem)
 }
 EXPORT_SYMBOL(xsk_umem_consume_tx_done);
 
-bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len)
+static __always_inline bool __xsk_umem_consume_tx(struct xdp_umem *umem,
+						  struct xdp_desc *desc)
 {
-	struct xdp_desc desc;
 	struct xdp_sock *xs;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
-		if (!xskq_peek_desc(xs->tx, &desc))
+		if (!xskq_peek_desc(xs->tx, desc))
 			continue;
 
-		if (xskq_produce_addr_lazy(umem->cq, desc.addr))
+		if (xskq_produce_addr_lazy(umem->cq, desc->addr))
 			goto out;
 
-		*dma = xdp_umem_get_dma(umem, desc.addr);
-		*len = desc.len;
-
 		xskq_discard_desc(xs->tx);
 		rcu_read_unlock();
 		return true;
@@ -195,8 +192,35 @@ bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len)
 	rcu_read_unlock();
 	return false;
 }
+
+bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len)
+{
+	struct xdp_desc desc;
+
+	if (!__xsk_umem_consume_tx(umem, &desc))
+		return false;
+
+	*dma = xdp_umem_get_dma(umem, desc.addr);
+	*len = desc.len;
+
+	return true;
+}
 EXPORT_SYMBOL(xsk_umem_consume_tx);
 
+bool xsk_umem_consume_tx_virtual(struct xdp_umem *umem, void **vaddr, u32 *len)
+{
+	struct xdp_desc desc;
+
+	if (!__xsk_umem_consume_tx(umem, &desc))
+		return false;
+
+	*vaddr = xdp_umem_get_data(umem, desc.addr);
+	*len = desc.len;
+
+	return true;
+}
+EXPORT_SYMBOL(xsk_umem_consume_tx_virtual);
+
 static int xsk_zc_xmit(struct sock *sk)
 {
 	struct xdp_sock *xs = xdp_sk(sk);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next RFCv3 2/6] veth: support AF_XDP TX copy-mode.
  2018-12-26 20:27 [PATCH bpf-next RFCv3 0/6] AF_XDP support for veth William Tu
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 1/6] xsk: add xsk_umem_consume_tx_virtual William Tu
@ 2018-12-26 20:27 ` William Tu
  2019-01-01 13:44   ` Toshiaki Makita
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 3/6] xsk: add new MEM type for virtual device William Tu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 10+ messages in thread
From: William Tu @ 2018-12-26 20:27 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, ast, daniel, netdev,
	makita.toshiaki, yihung.wei, magnus.karlsson

The patch adds support for AF_XDP async xmit.  Users can use
AF_XDP on both sides of the veth and get better performance, with
the cost of ksoftirqd doing the xmit.  The veth_xsk_async_xmit
simply kicks the napi function, veth_poll, to receive the packets
that are on the umem transmit ring at the _peer_ side.

Tested using two namespaces, one runs xdpsock and the other runs
xdp_rxq_info.  A simple script comparing the performance with/without
AF_XDP shows improvement from 724Kpps to 1.1Mpps.

  ip netns add at_ns0
  ip link add p0 type veth peer name p1
  ip link set p0 netns at_ns0
  ip link set dev p1 up
  ip netns exec at_ns0 ip link set dev p0 up

  # receiver
  ip netns exec at_ns0 xdp_rxq_info --dev p0 --action XDP_DROP

  # sender
  xdpsock -i p1 -t -N -z
  or
  xdpsock -i p1 -t -S

Signed-off-by: William Tu <u9012063@gmail.com>
---
 drivers/net/veth.c | 200 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 199 insertions(+), 1 deletion(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index f412ea1cef18..10cf9ded59f1 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -25,6 +25,10 @@
 #include <linux/ptr_ring.h>
 #include <linux/bpf_trace.h>
 #include <linux/net_tstamp.h>
+#include <net/xdp_sock.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <net/page_pool.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
@@ -53,6 +57,8 @@ struct veth_rq {
 	bool			rx_notify_masked;
 	struct ptr_ring		xdp_ring;
 	struct xdp_rxq_info	xdp_rxq;
+	struct xdp_umem *xsk_umem;
+	u16 qid;
 };
 
 struct veth_priv {
@@ -737,11 +743,95 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit)
 	return done;
 }
 
+static int veth_xsk_poll(struct napi_struct *napi, int budget)
+{
+	struct veth_priv *priv, *peer_priv;
+	struct net_device *dev, *peer_dev;
+	struct veth_rq *peer_rq;
+	struct veth_rq *rq =
+		container_of(napi, struct veth_rq, xdp_napi);
+	int done = 0;
+
+	dev = rq->dev;
+	priv = netdev_priv(dev);
+	peer_dev = priv->peer;
+	peer_priv = netdev_priv(peer_dev);
+	peer_rq = &peer_priv->rq[rq->qid];
+
+	while (peer_rq->xsk_umem && budget--) {
+		unsigned int inner_xdp_xmit = 0;
+		unsigned int metasize = 0;
+		struct xdp_frame *xdpf;
+		bool dropped = false;
+		struct sk_buff *skb;
+		struct page *page;
+		void *vaddr;
+		void *addr;
+		u32 len;
+
+		if (!xsk_umem_consume_tx_virtual(peer_rq->xsk_umem, &vaddr, &len))
+			break;
+
+		page = dev_alloc_page();
+		if (!page) {
+			xsk_umem_complete_tx(peer_rq->xsk_umem, 1);
+			xsk_umem_consume_tx_done(peer_rq->xsk_umem);
+			return -ENOMEM;
+		}
+
+		addr = page_to_virt(page);
+		xdpf = addr;
+		memset(xdpf, 0, sizeof(*xdpf));
+
+		addr += sizeof(*xdpf);
+		memcpy(addr, vaddr, len);
+
+		xdpf->data = addr + metasize;
+		xdpf->len = len;
+		xdpf->headroom = 0;
+		xdpf->metasize = metasize;
+		xdpf->mem.type = MEM_TYPE_PAGE_SHARED;
+
+		/* put into rq */
+		skb = veth_xdp_rcv_one(rq, xdpf, &inner_xdp_xmit);
+		if (!skb) {
+			/* Peer side has XDP program attached */
+			if (inner_xdp_xmit & VETH_XDP_TX) {
+				/* Not supported */
+				pr_warn("veth: peer XDP_TX not supported\n");
+				xdp_return_frame(xdpf);
+				dropped = true;
+				goto skip_tx;
+			} else if (inner_xdp_xmit & VETH_XDP_REDIR) {
+				xdp_do_flush_map();
+			} else {
+				dropped = true;
+			}
+		} else {
+			napi_gro_receive(&rq->xdp_napi, skb);
+		}
+skip_tx:
+		xsk_umem_complete_tx(peer_rq->xsk_umem, 1);
+		xsk_umem_consume_tx_done(peer_rq->xsk_umem);
+
+		/* update rq stats */
+		u64_stats_update_begin(&rq->stats.syncp);
+		rq->stats.xdp_packets++;
+		rq->stats.xdp_bytes += len;
+		if (dropped)
+			rq->stats.xdp_drops++;
+		u64_stats_update_end(&rq->stats.syncp);
+		done++;
+	}
+	return done;
+}
+
 static int veth_poll(struct napi_struct *napi, int budget)
 {
 	struct veth_rq *rq =
 		container_of(napi, struct veth_rq, xdp_napi);
 	unsigned int xdp_xmit = 0;
+	int tx_done;
 	int done;
 
 	xdp_set_return_frame_no_direct();
@@ -756,13 +846,17 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		}
 	}
 
+	tx_done = veth_xsk_poll(napi, budget);
+	if (tx_done > 0)
+		done += tx_done;
+
 	if (xdp_xmit & VETH_XDP_TX)
 		veth_xdp_flush(rq->dev);
 	if (xdp_xmit & VETH_XDP_REDIR)
 		xdp_do_flush_map();
 	xdp_clear_return_frame_no_direct();
 
-	return done;
+	return done > budget ? budget : done;
 }
 
 static int veth_napi_add(struct net_device *dev)
@@ -776,6 +870,7 @@ static int veth_napi_add(struct net_device *dev)
 		err = ptr_ring_init(&rq->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
 		if (err)
 			goto err_xdp_ring;
+		rq->qid = i;
 	}
 
 	for (i = 0; i < dev->real_num_rx_queues; i++) {
@@ -812,6 +907,7 @@ static void veth_napi_del(struct net_device *dev)
 		netif_napi_del(&rq->xdp_napi);
 		rq->rx_notify_masked = false;
 		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
+		rq->qid = -1;
 	}
 }
 
@@ -836,6 +932,7 @@ static int veth_enable_xdp(struct net_device *dev)
 
 			/* Save original mem info as it can be overwritten */
 			rq->xdp_mem = rq->xdp_rxq.mem;
+			rq->qid = i;
 		}
 
 		err = veth_napi_add(dev);
@@ -1115,6 +1212,84 @@ static u32 veth_xdp_query(struct net_device *dev)
 	return 0;
 }
 
+int veth_xsk_umem_query(struct net_device *dev, struct xdp_umem **umem,
+			u16 qid)
+{
+	struct xdp_umem *queried_umem;
+
+	queried_umem = xdp_get_umem_from_qid(dev, qid);
+
+	if (!queried_umem)
+		return -EINVAL;
+
+	*umem = queried_umem;
+	return 0;
+}
+
+static int veth_xsk_umem_enable(struct net_device *dev,
+				struct xdp_umem *umem,
+				u16 qid)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct xdp_umem_fq_reuse *reuseq;
+	int err = 0;
+
+	if (qid >= dev->real_num_rx_queues)
+		return -EINVAL;
+
+	reuseq = xsk_reuseq_prepare(priv->rq[0].xdp_ring.size);
+	if (!reuseq)
+		return -ENOMEM;
+
+	xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq));
+
+	priv->rq[qid].xsk_umem = umem;
+
+	return err;
+}
+
+static int veth_xsk_umem_disable(struct net_device *dev,
+				 u16 qid)
+{
+	struct veth_priv *priv = netdev_priv(dev);
+	struct xdp_umem *umem;
+
+	umem = xdp_get_umem_from_qid(dev, qid);
+	if (!umem)
+		return -EINVAL;
+
+	priv->rq[qid].xsk_umem = NULL;
+	return 0;
+}
+
+int veth_xsk_umem_setup(struct net_device *dev, struct xdp_umem *umem,
+			u16 qid)
+{
+	return umem ? veth_xsk_umem_enable(dev, umem, qid) :
+		      veth_xsk_umem_disable(dev, qid);
+}
+
+int veth_xsk_async_xmit(struct net_device *dev, u32 qid)
+{
+	struct veth_priv *priv, *peer_priv;
+	struct net_device *peer_dev;
+	struct veth_rq *peer_rq;
+
+	priv = netdev_priv(dev);
+	peer_dev = priv->peer;
+	peer_priv = netdev_priv(peer_dev);
+	peer_rq = &peer_priv->rq[qid];
+
+	if (qid >= dev->real_num_rx_queues)
+		return -ENXIO;
+
+	/* Schedule the peer side NAPI to receive */
+	if (!napi_if_scheduled_mark_missed(&peer_rq->xdp_napi))
+		napi_schedule(&peer_rq->xdp_napi);
+
+	return 0;
+}
+
 static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 {
 	switch (xdp->command) {
@@ -1123,6 +1298,28 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	case XDP_QUERY_PROG:
 		xdp->prog_id = veth_xdp_query(dev);
 		return 0;
+	case XDP_QUERY_XSK_UMEM:
+		return veth_xsk_umem_query(dev, &xdp->xsk.umem,
+					   xdp->xsk.queue_id);
+	case XDP_SETUP_XSK_UMEM: {
+		struct veth_priv *priv;
+		int err;
+
+		/* Enable NAPI on both sides, by enabling
+		 * their XDP.
+		 */
+		err = veth_enable_xdp(dev);
+		if (err)
+			return err;
+
+		priv = netdev_priv(dev);
+		err = veth_enable_xdp(priv->peer);
+		if (err)
+			return err;
+
+		return veth_xsk_umem_setup(dev, xdp->xsk.umem,
+					   xdp->xsk.queue_id);
+	}
 	default:
 		return -EINVAL;
 	}
@@ -1145,6 +1342,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_set_rx_headroom	= veth_set_rx_headroom,
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_xdp_xmit,
+	.ndo_xsk_async_xmit	= veth_xsk_async_xmit,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next RFCv3 3/6] xsk: add new MEM type for virtual device.
  2018-12-26 20:27 [PATCH bpf-next RFCv3 0/6] AF_XDP support for veth William Tu
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 1/6] xsk: add xsk_umem_consume_tx_virtual William Tu
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 2/6] veth: support AF_XDP TX copy-mode William Tu
@ 2018-12-26 20:27 ` William Tu
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 4/6] veth: add zero-copy AF_XDP TX support William Tu
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 10+ messages in thread
From: William Tu @ 2018-12-26 20:27 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, ast, daniel, netdev,
	makita.toshiaki, yihung.wei, magnus.karlsson

Add MEM_TYPE_ZERO_COPY_VDEV for supporting the veth AF_XDP.
For zero-copy veth, the memory comes from userspace and does
not need to be freed at kernel.  Thus, when xdp returns the
frame, detecting this type and doing nothing.

Signed-off-by: William Tu <u9012063@gmail.com>
---
 include/net/xdp.h | 1 +
 net/core/xdp.c    | 1 +
 net/xdp/xsk.c     | 3 ++-
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 0f25b3675c5c..010cb9efca90 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -38,6 +38,7 @@ enum xdp_mem_type {
 	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
 	MEM_TYPE_PAGE_POOL,
 	MEM_TYPE_ZERO_COPY,
+	MEM_TYPE_ZERO_COPY_VDEV,
 	MEM_TYPE_MAX,
 };
 
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 4b2b194f4f1f..d47f003714c4 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -355,6 +355,7 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct,
 		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
 		xa->zc_alloc->free(xa->zc_alloc, handle);
 		rcu_read_unlock();
+	case MEM_TYPE_ZERO_COPY_VDEV:
 	default:
 		/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
 		break;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 0e252047f55f..3d151d64fc79 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -109,7 +109,8 @@ int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 
 	len = xdp->data_end - xdp->data;
 
-	return (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) ?
+	return (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY ||
+		xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY_VDEV) ?
 		__xsk_rcv_zc(xs, xdp, len) : __xsk_rcv(xs, xdp, len);
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next RFCv3 4/6] veth: add zero-copy AF_XDP TX support.
  2018-12-26 20:27 [PATCH bpf-next RFCv3 0/6] AF_XDP support for veth William Tu
                   ` (2 preceding siblings ...)
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 3/6] xsk: add new MEM type for virtual device William Tu
@ 2018-12-26 20:27 ` William Tu
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 5/6] veth: add AF_XDP RX support William Tu
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 6/6] samples: bpf: add veth AF_XDP example William Tu
  5 siblings, 0 replies; 10+ messages in thread
From: William Tu @ 2018-12-26 20:27 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, ast, daniel, netdev,
	makita.toshiaki, yihung.wei, magnus.karlsson

Remove the extra copy when doing AF_XDP TX.  The xdp frame comes
directly from the umem element and passes to the receiving logic.
Also, only depending on async_xmit to kick napi poll isn't fast
enough. So re-schedule the napi at the end of poll so the ksoftirqd
can keep processing the packets.  The performance increases from
1.1Mpps to 1.4Mpps, when running zero copy xdpsock as sender and
XDP_DROP at the receiver side.

Signed-off-by: William Tu <u9012063@gmail.com>
---
 drivers/net/veth.c | 41 ++++++++++++++++++-----------------------
 1 file changed, 18 insertions(+), 23 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 10cf9ded59f1..551444195398 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -761,45 +761,34 @@ static int veth_xsk_poll(struct napi_struct *napi, int budget)
 	while (peer_rq->xsk_umem && budget--) {
 		unsigned int inner_xdp_xmit = 0;
 		unsigned int metasize = 0;
-		struct xdp_frame *xdpf;
+		struct xdp_frame xdpf;
 		bool dropped = false;
 		struct sk_buff *skb;
 		struct page *page;
 		void *vaddr;
-		void *addr;
 		u32 len;
 
 		if (!xsk_umem_consume_tx_virtual(peer_rq->xsk_umem, &vaddr, &len))
 			break;
 
-		page = dev_alloc_page();
-		if (!page) {
-			xsk_umem_complete_tx(peer_rq->xsk_umem, 1);
-			xsk_umem_consume_tx_done(peer_rq->xsk_umem);
-			return -ENOMEM;
-		}
-
-		addr = page_to_virt(page);
-		xdpf = addr;
-		memset(xdpf, 0, sizeof(*xdpf));
-
-		addr += sizeof(*xdpf);
-		memcpy(addr, vaddr, len);
+		xdpf.data = vaddr + metasize;
+		xdpf.len = len;
+		xdpf.headroom = 0;
+		xdpf.metasize = metasize;
+		xdpf.mem.type = MEM_TYPE_ZERO_COPY_VDEV;
 
-		xdpf->data = addr + metasize;
-		xdpf->len = len;
-		xdpf->headroom = 0;
-		xdpf->metasize = metasize;
-		xdpf->mem.type = MEM_TYPE_PAGE_SHARED;
+		page = virt_to_head_page(vaddr);
+		if (page->mem_cgroup)
+			page->mem_cgroup = NULL;
 
 		/* put into rq */
-		skb = veth_xdp_rcv_one(rq, xdpf, &inner_xdp_xmit);
+		skb = veth_xdp_rcv_one(rq, &xdpf, &inner_xdp_xmit);
 		if (!skb) {
 			/* Peer side has XDP program attached */
 			if (inner_xdp_xmit & VETH_XDP_TX) {
 				/* Not supported */
 				pr_warn("veth: peer XDP_TX not supported\n");
-				xdp_return_frame(xdpf);
+				xdp_return_frame(&xdpf);
 				dropped = true;
 				goto skip_tx;
 			} else if (inner_xdp_xmit & VETH_XDP_REDIR) {
@@ -808,7 +797,8 @@ static int veth_xsk_poll(struct napi_struct *napi, int budget)
 				dropped = true;
 			}
 		} else {
-			napi_gro_receive(&rq->xdp_napi, skb);
+			napi_gro_receive(&rq->xdp_napi, skb_copy(skb, GFP_KERNEL));
+			kfree(skb);
 		}
 skip_tx:
 		xsk_umem_complete_tx(peer_rq->xsk_umem, 1);
@@ -856,6 +846,11 @@ static int veth_poll(struct napi_struct *napi, int budget)
 		xdp_do_flush_map();
 	xdp_clear_return_frame_no_direct();
 
+	/* schedule again so the CPU can keep receiving
+	 * at higher rate
+	 */
+	napi_schedule(&rq->xdp_napi);
+
 	return done > budget ? budget : done;
 }
 
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next RFCv3 5/6] veth: add AF_XDP RX support.
  2018-12-26 20:27 [PATCH bpf-next RFCv3 0/6] AF_XDP support for veth William Tu
                   ` (3 preceding siblings ...)
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 4/6] veth: add zero-copy AF_XDP TX support William Tu
@ 2018-12-26 20:27 ` William Tu
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 6/6] samples: bpf: add veth AF_XDP example William Tu
  5 siblings, 0 replies; 10+ messages in thread
From: William Tu @ 2018-12-26 20:27 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, ast, daniel, netdev,
	makita.toshiaki, yihung.wei, magnus.karlsson

If the receiving veth side has umem rx enabled, the patch
directly copy the packet from the peer side's send buffer
to the umem receive buffer.  This requires running AF_XDP
as both side of the veth peer.  For example:
Receiver:
  # ip netns exec at_ns0 xdpsock -r -N -z -i p0
Sender:
  # xdpsock -i p1 -t -N -z
The performance increases from 1.4Mpps to 2.3Mpps.

Signed-off-by: William Tu <u9012063@gmail.com>
---
 drivers/net/veth.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 551444195398..8aac67554880 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -766,11 +766,28 @@ static int veth_xsk_poll(struct napi_struct *napi, int budget)
 		struct sk_buff *skb;
 		struct page *page;
 		void *vaddr;
+		u64 handle;
 		u32 len;
 
 		if (!xsk_umem_consume_tx_virtual(peer_rq->xsk_umem, &vaddr, &len))
 			break;
 
+		if (rq->xsk_umem && xsk_umem_peek_addr(rq->xsk_umem, &handle)) {
+			char *daddr;
+			u64 hr = 0;
+
+			/* the peer side also has umem enabled,
+			 * copy directly to it.
+			 */
+			handle &= rq->xsk_umem->chunk_mask;
+			hr = rq->xsk_umem->headroom + XDP_PACKET_HEADROOM;
+			daddr = xdp_umem_get_data(rq->xsk_umem, handle);
+			daddr += hr;
+			memcpy((void *)daddr, vaddr, len);
+			xsk_umem_discard_addr(rq->xsk_umem);
+			vaddr = daddr;
+		}
+
 		xdpf.data = vaddr + metasize;
 		xdpf.len = len;
 		xdpf.headroom = 0;
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH bpf-next RFCv3 6/6] samples: bpf: add veth AF_XDP example.
  2018-12-26 20:27 [PATCH bpf-next RFCv3 0/6] AF_XDP support for veth William Tu
                   ` (4 preceding siblings ...)
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 5/6] veth: add AF_XDP RX support William Tu
@ 2018-12-26 20:27 ` William Tu
  5 siblings, 0 replies; 10+ messages in thread
From: William Tu @ 2018-12-26 20:27 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, ast, daniel, netdev,
	makita.toshiaki, yihung.wei, magnus.karlsson

Add example use cases for AF_XDP socket on two namespaces.
The script runs sender at the root namespace, and receiver
at the at_ns0 namespace with different XDP actions.

Signed-off-by: William Tu <u9012063@gmail.com>
---
 samples/bpf/test_veth_afxdp.sh | 82 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 82 insertions(+)
 create mode 100755 samples/bpf/test_veth_afxdp.sh

diff --git a/samples/bpf/test_veth_afxdp.sh b/samples/bpf/test_veth_afxdp.sh
new file mode 100755
index 000000000000..d51c6031db47
--- /dev/null
+++ b/samples/bpf/test_veth_afxdp.sh
@@ -0,0 +1,82 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# The script runs the sender at the root namespace, and
+# the receiver at the namespace at_ns0, with different mode
+#  1. XDP_DROP
+#  2. XDP_TX
+#  3. XDP_PASS
+#  4. XDP_REDIRECT
+#  5. Generic XDP
+
+XDPSOCK=./xdpsock
+XDP_RXQ_INFO=./xdp_rxq_info
+
+ip netns add at_ns0
+ip link add p0 type veth peer name p1
+ip link set p0 netns at_ns0
+ip link set dev p1 up
+ip netns exec at_ns0 ip link set dev p0 up
+
+test_xdp_drop()
+{
+	echo "[Peer] XDP_DROP"
+	ip netns exec at_ns0 $XDP_RXQ_INFO --dev p0 --action XDP_DROP &
+}
+
+test_xdp_pass()
+{
+	echo "[Peer] XDP_PASS"
+	ip netns exec at_ns0 $XDP_RXQ_INFO --dev p0 --action XDP_PASS &
+}
+
+test_xdp_tx()
+{
+	echo "[Peer] XDP_TX"
+	ip netns exec at_ns0 $XDP_RXQ_INFO --dev p0 --action XDP_TX &
+}
+
+test_generic_xdp()
+{
+	echo "[Peer] Generic XDP"
+	ip netns exec at_ns0 $XDPSOCK -i p0 -r -S &
+}
+
+test_xdp_redirect()
+{
+	echo "[Peer] XDP_REDIRECT"
+	ip netns exec at_ns0 $XDPSOCK -i p0 -r -N &
+}
+
+test_xdp_zcrx()
+{
+	echo "[Peer] AF_XDP RX"
+	ip netns exec at_ns0 $XDPSOCK -i p0 -r -N -z &
+}
+
+cleanup() {
+    killall xdpsock
+    sleep 1
+    killall xdp_rxq_info
+    ip netns del at_ns0
+    ip link del p1
+}
+
+trap cleanup 0 3 6
+
+if [ "$1" == "drop" ]; then
+	test_xdp_drop
+elif [ "$1" == "pass" ]; then
+	test_xdp_pass
+elif [ "$1" == "tx" ]; then
+	test_xdp_tx
+elif [ "$1" == "redirect" ]; then
+	test_xdp_redirect
+elif [ "$1" == "zcrx" ]; then
+	test_xdp_zcrx
+else
+	test_xdp_drop
+fi
+
+# send at root namespace
+$XDPSOCK -i p1 -t -N -z
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next RFCv3 2/6] veth: support AF_XDP TX copy-mode.
  2018-12-26 20:27 ` [PATCH bpf-next RFCv3 2/6] veth: support AF_XDP TX copy-mode William Tu
@ 2019-01-01 13:44   ` Toshiaki Makita
  2019-01-05 15:55     ` William Tu
  0 siblings, 1 reply; 10+ messages in thread
From: Toshiaki Makita @ 2019-01-01 13:44 UTC (permalink / raw)
  To: William Tu
  Cc: bjorn.topel, magnus.karlsson, ast, daniel, netdev,
	makita.toshiaki, yihung.wei, magnus.karlsson

Hi, William. Nice work.
I have some feedback and questions.

On 18/12/27 (木) 5:27:49, William Tu wrote:
> The patch adds support for AF_XDP async xmit.  Users can use
> AF_XDP on both sides of the veth and get better performance, with
> the cost of ksoftirqd doing the xmit.  The veth_xsk_async_xmit
> simply kicks the napi function, veth_poll, to receive the packets
> that are on the umem transmit ring at the _peer_ side.
> 
> Tested using two namespaces, one runs xdpsock and the other runs
> xdp_rxq_info.  A simple script comparing the performance with/without
> AF_XDP shows improvement from 724Kpps to 1.1Mpps.
> 
>    ip netns add at_ns0
>    ip link add p0 type veth peer name p1
>    ip link set p0 netns at_ns0
>    ip link set dev p1 up
>    ip netns exec at_ns0 ip link set dev p0 up
> 
>    # receiver
>    ip netns exec at_ns0 xdp_rxq_info --dev p0 --action XDP_DROP
> 
>    # sender
>    xdpsock -i p1 -t -N -z
>    or
>    xdpsock -i p1 -t -S
> 
> Signed-off-by: William Tu <u9012063@gmail.com>
> ---
>   drivers/net/veth.c | 200 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 199 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index f412ea1cef18..10cf9ded59f1 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -25,6 +25,10 @@
>   #include <linux/ptr_ring.h>
>   #include <linux/bpf_trace.h>
>   #include <linux/net_tstamp.h>
> +#include <net/xdp_sock.h>
> +#include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <net/page_pool.h>
>   
>   #define DRV_NAME	"veth"
>   #define DRV_VERSION	"1.0"
> @@ -53,6 +57,8 @@ struct veth_rq {
>   	bool			rx_notify_masked;
>   	struct ptr_ring		xdp_ring;
>   	struct xdp_rxq_info	xdp_rxq;
> +	struct xdp_umem *xsk_umem;
> +	u16 qid;

veth_rq has 31 bytes hole after rx_notify_masked, so they should be 
placed there.

>   };
>   
>   struct veth_priv {
> @@ -737,11 +743,95 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, unsigned int *xdp_xmit)
>   	return done;
>   }
>   
> +static int veth_xsk_poll(struct napi_struct *napi, int budget)
> +{
> +	struct veth_priv *priv, *peer_priv;
> +	struct net_device *dev, *peer_dev;
> +	struct veth_rq *peer_rq;
> +	struct veth_rq *rq =
> +		container_of(napi, struct veth_rq, xdp_napi);
> +	int done = 0;
> +
> +	dev = rq->dev;
> +	priv = netdev_priv(dev);
> +	peer_dev = priv->peer;
> +	peer_priv = netdev_priv(peer_dev);
> +	peer_rq = &peer_priv->rq[rq->qid];
> +
> +	while (peer_rq->xsk_umem && budget--) {
> +		unsigned int inner_xdp_xmit = 0;
> +		unsigned int metasize = 0;
> +		struct xdp_frame *xdpf;
> +		bool dropped = false;
> +		struct sk_buff *skb;
> +		struct page *page;
> +		void *vaddr;
> +		void *addr;
> +		u32 len;
> +
> +		if (!xsk_umem_consume_tx_virtual(peer_rq->xsk_umem, &vaddr, &len))
> +			break;

How do you prevent races around xsk_umem?
It seems you are checking xsk_umem in above while() condition, but there 
is no guarantee that xsk_umem is not NULL here, since umem can be 
disabled under us?

> +
> +		page = dev_alloc_page();
> +		if (!page) {
> +			xsk_umem_complete_tx(peer_rq->xsk_umem, 1);
> +			xsk_umem_consume_tx_done(peer_rq->xsk_umem);
> +			return -ENOMEM;
> +		}
> +
> +		addr = page_to_virt(page);
> +		xdpf = addr;
> +		memset(xdpf, 0, sizeof(*xdpf));
> +
> +		addr += sizeof(*xdpf);
> +		memcpy(addr, vaddr, len);
> +
> +		xdpf->data = addr + metasize;
> +		xdpf->len = len;
> +		xdpf->headroom = 0;
> +		xdpf->metasize = metasize;
> +		xdpf->mem.type = MEM_TYPE_PAGE_SHARED;
> +
> +		/* put into rq */
> +		skb = veth_xdp_rcv_one(rq, xdpf, &inner_xdp_xmit);
> +		if (!skb) {
> +			/* Peer side has XDP program attached */
> +			if (inner_xdp_xmit & VETH_XDP_TX) {
> +				/* Not supported */
> +				pr_warn("veth: peer XDP_TX not supported\n");

As this can be triggered by users we need ratelimit at least.

But since this is envisioned to be used in OVS, XDP_TX would be a very 
important feature to me. I expect XDP programs in containers to process 
packets and send back to OVS.

> +				xdp_return_frame(xdpf);
> +				dropped = true;
> +				goto skip_tx;
> +			} else if (inner_xdp_xmit & VETH_XDP_REDIR) {
> +				xdp_do_flush_map();
> +			} else {
> +				dropped = true;
> +			}
> +		} else {
> +			napi_gro_receive(&rq->xdp_napi, skb);
> +		}
> +skip_tx:
> +		xsk_umem_complete_tx(peer_rq->xsk_umem, 1);
> +		xsk_umem_consume_tx_done(peer_rq->xsk_umem);
> +
> +		/* update rq stats */
> +		u64_stats_update_begin(&rq->stats.syncp);
> +		rq->stats.xdp_packets++;
> +		rq->stats.xdp_bytes += len;
> +		if (dropped)
> +			rq->stats.xdp_drops++;
> +		u64_stats_update_end(&rq->stats.syncp);
> +		done++;
> +	}
> +	return done;
> +}
> +
>   static int veth_poll(struct napi_struct *napi, int budget)
>   {
>   	struct veth_rq *rq =
>   		container_of(napi, struct veth_rq, xdp_napi);
>   	unsigned int xdp_xmit = 0;
> +	int tx_done;
>   	int done;
>   
>   	xdp_set_return_frame_no_direct();
> @@ -756,13 +846,17 @@ static int veth_poll(struct napi_struct *napi, int budget)
>   		}
>   	}
>   
> +	tx_done = veth_xsk_poll(napi, budget);
> +	if (tx_done > 0)
> +		done += tx_done;
> +

This receives packets more than budget.

>   	if (xdp_xmit & VETH_XDP_TX)
>   		veth_xdp_flush(rq->dev);
>   	if (xdp_xmit & VETH_XDP_REDIR)
>   		xdp_do_flush_map();
>   	xdp_clear_return_frame_no_direct();
>   
> -	return done;
> +	return done > budget ? budget : done;
>   }
>   
>   static int veth_napi_add(struct net_device *dev)
> @@ -776,6 +870,7 @@ static int veth_napi_add(struct net_device *dev)
>   		err = ptr_ring_init(&rq->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
>   		if (err)
>   			goto err_xdp_ring;
> +		rq->qid = i;
>   	}
>   
>   	for (i = 0; i < dev->real_num_rx_queues; i++) {
> @@ -812,6 +907,7 @@ static void veth_napi_del(struct net_device *dev)
>   		netif_napi_del(&rq->xdp_napi);
>   		rq->rx_notify_masked = false;
>   		ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
> +		rq->qid = -1;
>   	}
>   }
>   
> @@ -836,6 +932,7 @@ static int veth_enable_xdp(struct net_device *dev)
>   
>   			/* Save original mem info as it can be overwritten */
>   			rq->xdp_mem = rq->xdp_rxq.mem;
> +			rq->qid = i;
>   		}
>   
>   		err = veth_napi_add(dev);
> @@ -1115,6 +1212,84 @@ static u32 veth_xdp_query(struct net_device *dev)
>   	return 0;
>   }
>   
> +int veth_xsk_umem_query(struct net_device *dev, struct xdp_umem **umem,
> +			u16 qid)
> +{
> +	struct xdp_umem *queried_umem;
> +
> +	queried_umem = xdp_get_umem_from_qid(dev, qid);
> +
> +	if (!queried_umem)
> +		return -EINVAL;
> +
> +	*umem = queried_umem;
> +	return 0;
> +}
> +
> +static int veth_xsk_umem_enable(struct net_device *dev,
> +				struct xdp_umem *umem,
> +				u16 qid)
> +{
> +	struct veth_priv *priv = netdev_priv(dev);
> +	struct xdp_umem_fq_reuse *reuseq;
> +	int err = 0;
> +
> +	if (qid >= dev->real_num_rx_queues)
> +		return -EINVAL;
> +
> +	reuseq = xsk_reuseq_prepare(priv->rq[0].xdp_ring.size);
> +	if (!reuseq)
> +		return -ENOMEM;
> +
> +	xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq));
> +
> +	priv->rq[qid].xsk_umem = umem;
> +
> +	return err;
> +}
> +
> +static int veth_xsk_umem_disable(struct net_device *dev,
> +				 u16 qid)
> +{
> +	struct veth_priv *priv = netdev_priv(dev);
> +	struct xdp_umem *umem;
> +
> +	umem = xdp_get_umem_from_qid(dev, qid);
> +	if (!umem)
> +		return -EINVAL;
> +
> +	priv->rq[qid].xsk_umem = NULL;
> +	return 0;
> +}
> +
> +int veth_xsk_umem_setup(struct net_device *dev, struct xdp_umem *umem,
> +			u16 qid)
> +{
> +	return umem ? veth_xsk_umem_enable(dev, umem, qid) :
> +		      veth_xsk_umem_disable(dev, qid);
> +}
> +
> +int veth_xsk_async_xmit(struct net_device *dev, u32 qid)
> +{
> +	struct veth_priv *priv, *peer_priv;
> +	struct net_device *peer_dev;
> +	struct veth_rq *peer_rq;
> +
> +	priv = netdev_priv(dev);
> +	peer_dev = priv->peer;
> +	peer_priv = netdev_priv(peer_dev);
> +	peer_rq = &peer_priv->rq[qid];
> +
> +	if (qid >= dev->real_num_rx_queues)
> +		return -ENXIO;
> +
> +	/* Schedule the peer side NAPI to receive */
> +	if (!napi_if_scheduled_mark_missed(&peer_rq->xdp_napi))
> +		napi_schedule(&peer_rq->xdp_napi);
> +
> +	return 0;
> +}
> +
>   static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>   {
>   	switch (xdp->command) {
> @@ -1123,6 +1298,28 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>   	case XDP_QUERY_PROG:
>   		xdp->prog_id = veth_xdp_query(dev);
>   		return 0;
> +	case XDP_QUERY_XSK_UMEM:
> +		return veth_xsk_umem_query(dev, &xdp->xsk.umem,
> +					   xdp->xsk.queue_id);
> +	case XDP_SETUP_XSK_UMEM: {
> +		struct veth_priv *priv;
> +		int err;
> +
> +		/* Enable NAPI on both sides, by enabling
> +		 * their XDP.
> +		 */
> +		err = veth_enable_xdp(dev);

Looks like there is no need to enable XDP on this side. You only use 
peer's NAPI, right?

> +		if (err)
> +			return err;
> +
> +		priv = netdev_priv(dev);
> +		err = veth_enable_xdp(priv->peer);

Enabling NAPI here and never disable it?
Also, what happens if peer disables XDP later by detaching an XDP program?
Probably you need something like refcounting.


BTW I'm not 100% sure if current way of accessing peer rq in NAPI 
handler is safe although I did not find an obvious problem.

Looking into physical NIC drivers, ndo_async_xmit() is expected to kick 
TX softirq? This should be something like below process in veth.

Say you attach an xsk to veth A, and B is the peer of A.

1. async_xmit kicks A's NAPI
2. A's NAPI drains xsk tx_ring and pushes frames into B's xdp_ring, then 
kicks B's NAPI
3. B's NAPI drains xdp_ring and process xdp_frames in the same way as 
normal xdp_frames.

What you are currently doing seems to be skipping 2 and making B's NAPI 
directly drain tx_ring of xsk bound to A.

I don't have particular opinion about which is better. Probably your 
approach is more efficient performance-wise? If later you find some race 
with your approach, please consider more physical-NIC-like approach I 
described above.

> +		if (err)
> +			return err;
> +
> +		return veth_xsk_umem_setup(dev, xdp->xsk.umem,
> +					   xdp->xsk.queue_id);
> +	}
>   	default:
>   		return -EINVAL;
>   	}
> @@ -1145,6 +1342,7 @@ static const struct net_device_ops veth_netdev_ops = {
>   	.ndo_set_rx_headroom	= veth_set_rx_headroom,
>   	.ndo_bpf		= veth_xdp,
>   	.ndo_xdp_xmit		= veth_xdp_xmit,
> +	.ndo_xsk_async_xmit	= veth_xsk_async_xmit,
>   };
>   
>   #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next RFCv3 2/6] veth: support AF_XDP TX copy-mode.
  2019-01-01 13:44   ` Toshiaki Makita
@ 2019-01-05 15:55     ` William Tu
  2019-01-08  7:25       ` Toshiaki Makita
  0 siblings, 1 reply; 10+ messages in thread
From: William Tu @ 2019-01-05 15:55 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Björn Töpel, Magnus Karlsson, Alexei Starovoitov,
	Daniel Borkmann, Linux Kernel Network Developers,
	makita.toshiaki, Yi-Hung Wei, Karlsson, Magnus

Hi Toshiaki,

Thanks a lot for the feedback.

On Tue, Jan 1, 2019 at 5:44 AM Toshiaki Makita
<toshiaki.makita1@gmail.com> wrote:
>
> Hi, William. Nice work.
> I have some feedback and questions.
>
> > +     while (peer_rq->xsk_umem && budget--) {
> > +             unsigned int inner_xdp_xmit = 0;
> > +             unsigned int metasize = 0;
> > +             struct xdp_frame *xdpf;
> > +             bool dropped = false;
> > +             struct sk_buff *skb;
> > +             struct page *page;
> > +             void *vaddr;
> > +             void *addr;
> > +             u32 len;
> > +
> > +             if (!xsk_umem_consume_tx_virtual(peer_rq->xsk_umem, &vaddr, &len))
> > +                     break;
>
> How do you prevent races around xsk_umem?
> It seems you are checking xsk_umem in above while() condition, but there
> is no guarantee that xsk_umem is not NULL here, since umem can be
> disabled under us?

you're right. Looking at similar code (i40e_xsk_umem_enable/disable).
When setting xsk_umem to NULL (at veth_xsk_umem_disable), I should first
disable the napi, then set to NULL.

>
> > +
> > +             page = dev_alloc_page();
> > +             if (!page) {
> > +                     xsk_umem_complete_tx(peer_rq->xsk_umem, 1);
> > +                     xsk_umem_consume_tx_done(peer_rq->xsk_umem);
> > +                     return -ENOMEM;
> > +             }
> > +
> > +             addr = page_to_virt(page);
> > +             xdpf = addr;
> > +             memset(xdpf, 0, sizeof(*xdpf));
> > +
> > +             addr += sizeof(*xdpf);
> > +             memcpy(addr, vaddr, len);
> > +
> > +             xdpf->data = addr + metasize;
> > +             xdpf->len = len;
> > +             xdpf->headroom = 0;
> > +             xdpf->metasize = metasize;
> > +             xdpf->mem.type = MEM_TYPE_PAGE_SHARED;
> > +
> > +             /* put into rq */
> > +             skb = veth_xdp_rcv_one(rq, xdpf, &inner_xdp_xmit);
> > +             if (!skb) {
> > +                     /* Peer side has XDP program attached */
> > +                     if (inner_xdp_xmit & VETH_XDP_TX) {
> > +                             /* Not supported */
> > +                             pr_warn("veth: peer XDP_TX not supported\n");
>
> As this can be triggered by users we need ratelimit at least.
How to rate limit here? Can I slow down the napi poll or reduce the budget?

>
> But since this is envisioned to be used in OVS, XDP_TX would be a very
> important feature to me. I expect XDP programs in containers to process
> packets and send back to OVS.

It's a little tricky here since the receiving veth pulls the packet sent from
its peer side, but due to XDP_TX, it has to put the packet back to its peer
side to receive. But I can see the use case you mentioned. Let me think
about how to implement.

>
> > +                             xdp_return_frame(xdpf);
> > +                             dropped = true;
> > +                             goto skip_tx;
> > +                     } else if (inner_xdp_xmit & VETH_XDP_REDIR) {
> > +                             xdp_do_flush_map();
> > +                     } else {
> > +                             dropped = true;
> > +                     }
> > +             } else {
> > +                     napi_gro_receive(&rq->xdp_napi, skb);
> > +             }
> > +skip_tx:
> > +             xsk_umem_complete_tx(peer_rq->xsk_umem, 1);
> > +             xsk_umem_consume_tx_done(peer_rq->xsk_umem);
> > +
> > +             /* update rq stats */
> > +             u64_stats_update_begin(&rq->stats.syncp);
> > +             rq->stats.xdp_packets++;
> > +             rq->stats.xdp_bytes += len;
> > +             if (dropped)
> > +                     rq->stats.xdp_drops++;
> > +             u64_stats_update_end(&rq->stats.syncp);
> > +             done++;
> > +     }
> > +     return done;
> > +}
> > +
> >   static int veth_poll(struct napi_struct *napi, int budget)
> >   {
> >       struct veth_rq *rq =
> >               container_of(napi, struct veth_rq, xdp_napi);
> >       unsigned int xdp_xmit = 0;
> > +     int tx_done;
> >       int done;
> >
> >       xdp_set_return_frame_no_direct();
> > @@ -756,13 +846,17 @@ static int veth_poll(struct napi_struct *napi, int budget)
> >               }
> >       }
> >
> > +     tx_done = veth_xsk_poll(napi, budget);
> > +     if (tx_done > 0)
> > +             done += tx_done;
> > +
>
> This receives packets more than budget.
>
> >       if (xdp_xmit & VETH_XDP_TX)
> >               veth_xdp_flush(rq->dev);
> >       if (xdp_xmit & VETH_XDP_REDIR)
> >               xdp_do_flush_map();
> >       xdp_clear_return_frame_no_direct();
> >
> > -     return done;
> > +     return done > budget ? budget : done;
> >   }
> >
> >   static int veth_napi_add(struct net_device *dev)
> > @@ -776,6 +870,7 @@ static int veth_napi_add(struct net_device *dev)
> >               err = ptr_ring_init(&rq->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
> >               if (err)
> >                       goto err_xdp_ring;
> > +             rq->qid = i;
> >       }
> >
> >       for (i = 0; i < dev->real_num_rx_queues; i++) {
> > @@ -812,6 +907,7 @@ static void veth_napi_del(struct net_device *dev)
> >               netif_napi_del(&rq->xdp_napi);
> >               rq->rx_notify_masked = false;
> >               ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
> > +             rq->qid = -1;
> >       }
> >   }
> >
> > @@ -836,6 +932,7 @@ static int veth_enable_xdp(struct net_device *dev)
> >
> >                       /* Save original mem info as it can be overwritten */
> >                       rq->xdp_mem = rq->xdp_rxq.mem;
> > +                     rq->qid = i;
> >               }
> >
> >               err = veth_napi_add(dev);
> > @@ -1115,6 +1212,84 @@ static u32 veth_xdp_query(struct net_device *dev)
> >       return 0;
> >   }
> >
> > +int veth_xsk_umem_query(struct net_device *dev, struct xdp_umem **umem,
> > +                     u16 qid)
> > +{
> > +     struct xdp_umem *queried_umem;
> > +
> > +     queried_umem = xdp_get_umem_from_qid(dev, qid);
> > +
> > +     if (!queried_umem)
> > +             return -EINVAL;
> > +
> > +     *umem = queried_umem;
> > +     return 0;
> > +}
> > +
> > +static int veth_xsk_umem_enable(struct net_device *dev,
> > +                             struct xdp_umem *umem,
> > +                             u16 qid)
> > +{
> > +     struct veth_priv *priv = netdev_priv(dev);
> > +     struct xdp_umem_fq_reuse *reuseq;
> > +     int err = 0;
> > +
> > +     if (qid >= dev->real_num_rx_queues)
> > +             return -EINVAL;
> > +
> > +     reuseq = xsk_reuseq_prepare(priv->rq[0].xdp_ring.size);
> > +     if (!reuseq)
> > +             return -ENOMEM;
> > +
> > +     xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq));
> > +
> > +     priv->rq[qid].xsk_umem = umem;
> > +
> > +     return err;
> > +}
> > +
> > +static int veth_xsk_umem_disable(struct net_device *dev,
> > +                              u16 qid)
> > +{
> > +     struct veth_priv *priv = netdev_priv(dev);
> > +     struct xdp_umem *umem;
> > +
> > +     umem = xdp_get_umem_from_qid(dev, qid);
> > +     if (!umem)
> > +             return -EINVAL;
> > +
> > +     priv->rq[qid].xsk_umem = NULL;
> > +     return 0;
> > +}
> > +
> > +int veth_xsk_umem_setup(struct net_device *dev, struct xdp_umem *umem,
> > +                     u16 qid)
> > +{
> > +     return umem ? veth_xsk_umem_enable(dev, umem, qid) :
> > +                   veth_xsk_umem_disable(dev, qid);
> > +}
> > +
> > +int veth_xsk_async_xmit(struct net_device *dev, u32 qid)
> > +{
> > +     struct veth_priv *priv, *peer_priv;
> > +     struct net_device *peer_dev;
> > +     struct veth_rq *peer_rq;
> > +
> > +     priv = netdev_priv(dev);
> > +     peer_dev = priv->peer;
> > +     peer_priv = netdev_priv(peer_dev);
> > +     peer_rq = &peer_priv->rq[qid];
> > +
> > +     if (qid >= dev->real_num_rx_queues)
> > +             return -ENXIO;
> > +
> > +     /* Schedule the peer side NAPI to receive */
> > +     if (!napi_if_scheduled_mark_missed(&peer_rq->xdp_napi))
> > +             napi_schedule(&peer_rq->xdp_napi);
> > +
> > +     return 0;
> > +}
> > +
> >   static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >   {
> >       switch (xdp->command) {
> > @@ -1123,6 +1298,28 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >       case XDP_QUERY_PROG:
> >               xdp->prog_id = veth_xdp_query(dev);
> >               return 0;
> > +     case XDP_QUERY_XSK_UMEM:
> > +             return veth_xsk_umem_query(dev, &xdp->xsk.umem,
> > +                                        xdp->xsk.queue_id);
> > +     case XDP_SETUP_XSK_UMEM: {
> > +             struct veth_priv *priv;
> > +             int err;
> > +
> > +             /* Enable NAPI on both sides, by enabling
> > +              * their XDP.
> > +              */
> > +             err = veth_enable_xdp(dev);
>
> Looks like there is no need to enable XDP on this side. You only use
> peer's NAPI, right?

Yes, thanks. The peer's NAPI is doing the work.

>
> > +             if (err)
> > +                     return err;
> > +
> > +             priv = netdev_priv(dev);
> > +             err = veth_enable_xdp(priv->peer);
>
> Enabling NAPI here and never disable it?

Here XDP_SETUP_XSK_UMEM might be enable or disable,
so I should disable NAPI when receiving umem disable command.
Will fix it in next version.

> Also, what happens if peer disables XDP later by detaching an XDP program?
> Probably you need something like refcounting.

OK, so the refconting will keep the NAPI running when XDP program is detached.
>
>
> BTW I'm not 100% sure if current way of accessing peer rq in NAPI
> handler is safe although I did not find an obvious problem.
>
> Looking into physical NIC drivers, ndo_async_xmit() is expected to kick
> TX softirq? This should be something like below process in veth.

right, the physical NIC driver (i40e and ixgbe), the async_xmit simply
kick TX softirq. For veth, we're doing a little different. We kick the peer
side's RX softirq.

>
> Say you attach an xsk to veth A, and B is the peer of A.
>
> 1. async_xmit kicks A's NAPI
> 2. A's NAPI drains xsk tx_ring and pushes frames into B's xdp_ring, then
> kicks B's NAPI
> 3. B's NAPI drains xdp_ring and process xdp_frames in the same way as
> normal xdp_frames.
>
> What you are currently doing seems to be skipping 2 and making B's NAPI
> directly drain tx_ring of xsk bound to A.
Right.

>
> I don't have particular opinion about which is better. Probably your
> approach is more efficient performance-wise? If later you find some race
> with your approach, please consider more physical-NIC-like approach I
> described above.

Yes, I thought without pushing/draining frames into/outof xdp_ring
might have better performance.

I will work on the next version. Thanks!
William

>
> > +             if (err)
> > +                     return err;
> > +
> > +             return veth_xsk_umem_setup(dev, xdp->xsk.umem,
> > +                                        xdp->xsk.queue_id);
> > +     }
> >       default:
> >               return -EINVAL;
> >       }
> > @@ -1145,6 +1342,7 @@ static const struct net_device_ops veth_netdev_ops = {
> >       .ndo_set_rx_headroom    = veth_set_rx_headroom,
> >       .ndo_bpf                = veth_xdp,
> >       .ndo_xdp_xmit           = veth_xdp_xmit,
> > +     .ndo_xsk_async_xmit     = veth_xsk_async_xmit,
> >   };
> >
> >   #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
> >

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH bpf-next RFCv3 2/6] veth: support AF_XDP TX copy-mode.
  2019-01-05 15:55     ` William Tu
@ 2019-01-08  7:25       ` Toshiaki Makita
  0 siblings, 0 replies; 10+ messages in thread
From: Toshiaki Makita @ 2019-01-08  7:25 UTC (permalink / raw)
  To: William Tu
  Cc: Toshiaki Makita, Björn Töpel, Magnus Karlsson,
	Alexei Starovoitov, Daniel Borkmann,
	Linux Kernel Network Developers, Yi-Hung Wei, Karlsson, Magnus

On 2019/01/06 0:55, William Tu wrote:
...
>>> +             /* put into rq */
>>> +             skb = veth_xdp_rcv_one(rq, xdpf, &inner_xdp_xmit);
>>> +             if (!skb) {
>>> +                     /* Peer side has XDP program attached */
>>> +                     if (inner_xdp_xmit & VETH_XDP_TX) {
>>> +                             /* Not supported */
>>> +                             pr_warn("veth: peer XDP_TX not supported\n");
>>
>> As this can be triggered by users we need ratelimit at least.
> How to rate limit here? Can I slow down the napi poll or reduce the budget?

net_ratelimit()

>> But since this is envisioned to be used in OVS, XDP_TX would be a very
>> important feature to me. I expect XDP programs in containers to process
>> packets and send back to OVS.
> 
> It's a little tricky here since the receiving veth pulls the packet sent from
> its peer side, but due to XDP_TX, it has to put the packet back to its peer
> side to receive. But I can see the use case you mentioned. Let me think
> about how to implement.

You have already implemented XDP_REDIRECT. XDP_TX is kind of special
case of XDP_REDIRECT, so I wonder why you think XDP_TX is especially
difficult.

-- 
Toshiaki Makita

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-01-08  7:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-26 20:27 [PATCH bpf-next RFCv3 0/6] AF_XDP support for veth William Tu
2018-12-26 20:27 ` [PATCH bpf-next RFCv3 1/6] xsk: add xsk_umem_consume_tx_virtual William Tu
2018-12-26 20:27 ` [PATCH bpf-next RFCv3 2/6] veth: support AF_XDP TX copy-mode William Tu
2019-01-01 13:44   ` Toshiaki Makita
2019-01-05 15:55     ` William Tu
2019-01-08  7:25       ` Toshiaki Makita
2018-12-26 20:27 ` [PATCH bpf-next RFCv3 3/6] xsk: add new MEM type for virtual device William Tu
2018-12-26 20:27 ` [PATCH bpf-next RFCv3 4/6] veth: add zero-copy AF_XDP TX support William Tu
2018-12-26 20:27 ` [PATCH bpf-next RFCv3 5/6] veth: add AF_XDP RX support William Tu
2018-12-26 20:27 ` [PATCH bpf-next RFCv3 6/6] samples: bpf: add veth AF_XDP example William Tu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).