All of lore.kernel.org
 help / color / mirror / Atom feed
* [bpf-next V2 PATCH 00/15] XDP redirect memory return API
@ 2018-03-08 13:07 Jesper Dangaard Brouer
  2018-03-08 13:07 ` [bpf-next V2 PATCH 01/15] mlx5: basic XDP_REDIRECT forward support Jesper Dangaard Brouer
                   ` (15 more replies)
  0 siblings, 16 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:07 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

This patchset works towards supporting different XDP RX-ring memory
allocators.  As this will be needed by the AF_XDP zero-copy mode.

The patchset uses mlx5 as the sample driver, which gets implemented
XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to
change thought the patchset).

A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt).
And both ndo_xdp_xmit and the new xdp_return_frame end-up using this.

Support for a driver supplied allocator is implemented, and a
refurbished version of page_pool is the first return allocator type
introduced.  This will be a integration point for AF_XDP zero-copy.

The mlx5 driver evolve into using the page_pool, and see a performance
increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps.


The patchset stop at the 15 patches limit, but more API changes are
planned.  Specifically extending ndo_xdp_xmit and xdp_return_frame
APIs to support bulking.  As this will address some known limits.

V2: Updated according to Tariq's feedback

---

Jesper Dangaard Brouer (15):
      mlx5: basic XDP_REDIRECT forward support
      xdp: introduce xdp_return_frame API and use in cpumap
      ixgbe: use xdp_return_frame API
      xdp: move struct xdp_buff from filter.h to xdp.h
      xdp: introduce a new xdp_frame type
      tun: convert to use generic xdp_frame and xdp_return_frame API
      virtio_net: convert to use generic xdp_frame and xdp_return_frame API
      bpf: cpumap convert to use generic xdp_frame
      mlx5: register a memory model when XDP is enabled
      xdp: rhashtable with allocator ID to pointer mapping
      page_pool: refurbish version of page_pool code
      xdp: allow page_pool as an allocator type in xdp_return_frame
      mlx5: use page_pool for xdp_return_frame call
      xdp: transition into using xdp_frame for return API
      xdp: transition into using xdp_frame for ndo_xdp_xmit


 drivers/net/ethernet/intel/ixgbe/ixgbe.h          |    3 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |   37 ++
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    4 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   37 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   42 ++-
 drivers/net/tun.c                                 |   42 +--
 drivers/net/virtio_net.c                          |   45 ++-
 drivers/vhost/net.c                               |    7 
 include/linux/filter.h                            |   24 --
 include/linux/if_tun.h                            |    4 
 include/linux/netdevice.h                         |    4 
 include/net/page_pool.h                           |  123 ++++++++
 include/net/xdp.h                                 |   86 +++++
 kernel/bpf/cpumap.c                               |  132 +++-----
 net/core/Makefile                                 |    1 
 net/core/filter.c                                 |   17 +
 net/core/page_pool.c                              |  334 +++++++++++++++++++++
 net/core/xdp.c                                    |  257 ++++++++++++++++
 18 files changed, 1024 insertions(+), 175 deletions(-)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

--

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 01/15] mlx5: basic XDP_REDIRECT forward support
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
@ 2018-03-08 13:07 ` Jesper Dangaard Brouer
  2018-03-08 13:07 ` [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap Jesper Dangaard Brouer
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:07 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

This implements basic XDP redirect support in mlx5 driver.

Notice that the ndo_xdp_xmit() is NOT implemented, because that API
need some changes that this patchset is working towards.

The main purpose of this patch is have different drivers doing
XDP_REDIRECT to show how different memory models behave in a cross
driver world.

Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect,
as the return API does not exist yet to to keep this mapped.

Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing,
introduce xdpsq.db.redirect_flush boolian.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    |    1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |   27 ++++++++++++++++++++---
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 4c9360b25532..28cc26debeda 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -398,6 +398,7 @@ struct mlx5e_xdpsq {
 	struct {
 		struct mlx5e_dma_info     *di;
 		bool                       doorbell;
+		bool                       redirect_flush;
 	} db;
 
 	/* read only */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 8cce90dc461d..6dcc3e8fbd3e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -236,14 +236,20 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 	return 0;
 }
 
+static inline void mlx5e_page_dma_unmap(struct mlx5e_rq *rq,
+					struct mlx5e_dma_info *dma_info)
+{
+	dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
+		       rq->buff.map_dir);
+}
+
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle)
 {
 	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
 		return;
 
-	dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
-		       rq->buff.map_dir);
+	mlx5e_page_dma_unmap(rq, dma_info);
 	put_page(dma_info->page);
 }
 
@@ -822,9 +828,10 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				   struct mlx5e_dma_info *di,
 				   void *va, u16 *rx_headroom, u32 *len)
 {
-	const struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
+	struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
 	struct xdp_buff xdp;
 	u32 act;
+	int err;
 
 	if (!prog)
 		return false;
@@ -845,6 +852,15 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
 		if (unlikely(!mlx5e_xmit_xdp_frame(rq, di, &xdp)))
 			trace_xdp_exception(rq->netdev, prog, act);
 		return true;
+	case XDP_REDIRECT:
+		/* When XDP enabled then page-refcnt==1 here */
+		err = xdp_do_redirect(rq->netdev, &xdp, prog);
+		if (!err) {
+			rq->wqe.xdp_xmit = true; /* XDP xmit owns page */
+			rq->xdpsq.db.redirect_flush = true;
+			mlx5e_page_dma_unmap(rq, di);
+		}
+		return true;
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
@@ -1107,6 +1123,11 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 		xdpsq->db.doorbell = false;
 	}
 
+	if (xdpsq->db.redirect_flush) {
+		xdp_do_flush_map();
+		xdpsq->db.redirect_flush = false;
+	}
+
 	mlx5_cqwq_update_db_record(&cq->wq);
 
 	/* ensure cq space is freed before enabling more cqes */

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
  2018-03-08 13:07 ` [bpf-next V2 PATCH 01/15] mlx5: basic XDP_REDIRECT forward support Jesper Dangaard Brouer
@ 2018-03-08 13:07 ` Jesper Dangaard Brouer
  2018-03-09  7:24   ` Jason Wang
  2018-03-08 13:07 ` [bpf-next V2 PATCH 03/15] ixgbe: use xdp_return_frame API Jesper Dangaard Brouer
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:07 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Introduce an xdp_return_frame API, and convert over cpumap as
the first user, given it have queued XDP frame structure to leverage.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/xdp.h   |   32 +++++++++++++++++++++++++++
 kernel/bpf/cpumap.c |   60 +++++++++++++++++++++++++++++++--------------------
 net/core/xdp.c      |   18 +++++++++++++++
 3 files changed, 86 insertions(+), 24 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index b2362ddfa694..3cb726a6dc5b 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -33,16 +33,48 @@
  * also mandatory during RX-ring setup.
  */
 
+enum mem_type {
+	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
+	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
+	// Possible new ideas for types:
+	// MEM_TYPE_PAGE_POOL,    /* Will be added later  */
+	// MEM_TYPE_AF_XDP,
+	// MEM_TYPE_DEVICE_OWNED -- invoking an dev->ndo?
+	MEM_TYPE_MAX,
+};
+
+struct xdp_mem_info {
+	u32 type; /* enum mem_type, but known size type */
+	u32 id; // Needed later (to lookup struct xdp_rxq_info)
+};
+
 struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
 	u32 reg_state;
+	struct xdp_mem_info mem;
 } ____cacheline_aligned; /* perf critical, avoid false-sharing */
 
+
+static inline
+void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+{
+	if (mem->type == MEM_TYPE_PAGE_SHARED)
+		page_frag_free(data);
+
+	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
+		struct page *page = virt_to_page(data); /* Assumes order0 page*/
+
+		put_page(page);
+	}
+}
+
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
 void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
 bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
+int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
+			       enum mem_type type, void *allocator);
 
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index a4bb0b34375a..3e4bbcbe3e86 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -19,6 +19,7 @@
 #include <linux/bpf.h>
 #include <linux/filter.h>
 #include <linux/ptr_ring.h>
+#include <net/xdp.h>
 
 #include <linux/sched.h>
 #include <linux/workqueue.h>
@@ -137,27 +138,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static void __cpu_map_queue_destructor(void *ptr)
-{
-	/* The tear-down procedure should have made sure that queue is
-	 * empty.  See __cpu_map_entry_replace() and work-queue
-	 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
-	 * gracefully and warn once.
-	 */
-	if (WARN_ON_ONCE(ptr))
-		page_frag_free(ptr);
-}
-
-static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
-{
-	if (atomic_dec_and_test(&rcpu->refcnt)) {
-		/* The queue should be empty at this point */
-		ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
-		kfree(rcpu->queue);
-		kfree(rcpu);
-	}
-}
-
 static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
 {
 	atomic_inc(&rcpu->refcnt);
@@ -188,6 +168,10 @@ struct xdp_pkt {
 	u16 len;
 	u16 headroom;
 	u16 metasize;
+	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+	 * while mem info is valid on remote CPU.
+	 */
+	struct xdp_mem_info mem;
 	struct net_device *dev_rx;
 };
 
@@ -213,6 +197,9 @@ static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
 	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
 	xdp_pkt->metasize = metasize;
 
+	/* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
+	xdp_pkt->mem = xdp->rxq->mem;
+
 	return xdp_pkt;
 }
 
@@ -265,6 +252,31 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
 	return skb;
 }
 
+static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
+{
+	/* The tear-down procedure should have made sure that queue is
+	 * empty.  See __cpu_map_entry_replace() and work-queue
+	 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
+	 * gracefully and warn once.
+	 */
+	struct xdp_pkt *xdp_pkt;
+
+	while ((xdp_pkt = ptr_ring_consume(ring)))
+		if (WARN_ON_ONCE(xdp_pkt))
+			xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+}
+
+static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
+{
+	if (atomic_dec_and_test(&rcpu->refcnt)) {
+		/* The queue should be empty at this point */
+		__cpu_map_ring_cleanup(rcpu->queue);
+		ptr_ring_cleanup(rcpu->queue, NULL);
+		kfree(rcpu->queue);
+		kfree(rcpu);
+	}
+}
+
 static int cpu_map_kthread_run(void *data)
 {
 	struct bpf_cpu_map_entry *rcpu = data;
@@ -307,7 +319,7 @@ static int cpu_map_kthread_run(void *data)
 
 			skb = cpu_map_build_skb(rcpu, xdp_pkt);
 			if (!skb) {
-				page_frag_free(xdp_pkt);
+				xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
 				continue;
 			}
 
@@ -604,13 +616,13 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 	spin_lock(&q->producer_lock);
 
 	for (i = 0; i < bq->count; i++) {
-		void *xdp_pkt = bq->q[i];
+		struct xdp_pkt *xdp_pkt = bq->q[i];
 		int err;
 
 		err = __ptr_ring_produce(q, xdp_pkt);
 		if (err) {
 			drops++;
-			page_frag_free(xdp_pkt); /* Free xdp_pkt */
+			xdp_return_frame(xdp_pkt->data, &xdp_pkt->mem);
 		}
 		processed++;
 	}
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 097a0f74e004..9eee0c431126 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -71,3 +71,21 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
 	return (xdp_rxq->reg_state == REG_STATE_REGISTERED);
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);
+
+int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
+			       enum mem_type type, void *allocator)
+{
+	if (type >= MEM_TYPE_MAX)
+		return -EINVAL;
+
+	xdp_rxq->mem.type = type;
+
+	if (allocator)
+		return -EOPNOTSUPP;
+
+	/* TODO: Allocate an ID that maps to allocator pointer
+	 * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
+	 */
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 03/15] ixgbe: use xdp_return_frame API
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
  2018-03-08 13:07 ` [bpf-next V2 PATCH 01/15] mlx5: basic XDP_REDIRECT forward support Jesper Dangaard Brouer
  2018-03-08 13:07 ` [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap Jesper Dangaard Brouer
@ 2018-03-08 13:07 ` Jesper Dangaard Brouer
  2018-03-08 13:08 ` [bpf-next V2 PATCH 04/15] xdp: move struct xdp_buff from filter.h to xdp.h Jesper Dangaard Brouer
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:07 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Extend struct ixgbe_tx_buffer to store the xdp_mem_info.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |    1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    6 ++++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index c1e3a0039ea5..cbc20f199364 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -249,6 +249,7 @@ struct ixgbe_tx_buffer {
 	DEFINE_DMA_UNMAP_ADDR(dma);
 	DEFINE_DMA_UNMAP_LEN(len);
 	u32 tx_flags;
+	struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 85369423452d..45520eb503ee 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 
 		/* free the skb */
 		if (ring_is_xdp(tx_ring))
-			page_frag_free(tx_buffer->data);
+			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
 		else
 			napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -5787,7 +5787,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring)
 
 		/* Free all the Tx ring sk_buffs */
 		if (ring_is_xdp(tx_ring))
-			page_frag_free(tx_buffer->data);
+			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 
@@ -8351,6 +8351,8 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 	dma_unmap_len_set(tx_buffer, len, len);
 	dma_unmap_addr_set(tx_buffer, dma, dma);
 	tx_buffer->data = xdp->data;
+	tx_buffer->xdp_mem = xdp->rxq->mem;
+
 	tx_desc->read.buffer_addr = cpu_to_le64(dma);
 
 	/* put descriptor type bits */

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 04/15] xdp: move struct xdp_buff from filter.h to xdp.h
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2018-03-08 13:07 ` [bpf-next V2 PATCH 03/15] ixgbe: use xdp_return_frame API Jesper Dangaard Brouer
@ 2018-03-08 13:08 ` Jesper Dangaard Brouer
  2018-03-08 13:08 ` [bpf-next V2 PATCH 05/15] xdp: introduce a new xdp_frame type Jesper Dangaard Brouer
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

This is done to prepare for the next patch, and it is also
nice to move this XDP related struct out of filter.h.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/filter.h |   24 +-----------------------
 include/net/xdp.h      |   22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index fdb691b520c0..a856edb5c4eb 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -30,6 +30,7 @@ struct sock;
 struct seccomp_data;
 struct bpf_prog_aux;
 struct xdp_rxq_info;
+struct xdp_buff;
 
 /* ArgX, context and stack frame pointer register positions. Note,
  * Arg1, Arg2, Arg3, etc are used as argument mappings of function
@@ -499,14 +500,6 @@ struct bpf_skb_data_end {
 	void *data_end;
 };
 
-struct xdp_buff {
-	void *data;
-	void *data_end;
-	void *data_meta;
-	void *data_hard_start;
-	struct xdp_rxq_info *rxq;
-};
-
 /* Compute the linear packet data range [data, data_end) which
  * will be accessed by various program types (cls_bpf, act_bpf,
  * lwt, ...). Subsystems allowing direct data access must (!)
@@ -753,21 +746,6 @@ int xdp_do_redirect(struct net_device *dev,
 		    struct bpf_prog *prog);
 void xdp_do_flush_map(void);
 
-/* Drivers not supporting XDP metadata can use this helper, which
- * rejects any room expansion for metadata as a result.
- */
-static __always_inline void
-xdp_set_data_meta_invalid(struct xdp_buff *xdp)
-{
-	xdp->data_meta = xdp->data + 1;
-}
-
-static __always_inline bool
-xdp_data_meta_unsupported(const struct xdp_buff *xdp)
-{
-	return unlikely(xdp->data_meta > xdp->data);
-}
-
 void bpf_warn_invalid_xdp_action(u32 act);
 
 struct sock *do_sk_redirect_map(struct sk_buff *skb);
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 3cb726a6dc5b..f6cbbbad412f 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -55,6 +55,13 @@ struct xdp_rxq_info {
 	struct xdp_mem_info mem;
 } ____cacheline_aligned; /* perf critical, avoid false-sharing */
 
+struct xdp_buff {
+	void *data;
+	void *data_end;
+	void *data_meta;
+	void *data_hard_start;
+	struct xdp_rxq_info *rxq;
+};
 
 static inline
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
@@ -77,4 +84,19 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 			       enum mem_type type, void *allocator);
 
+/* Drivers not supporting XDP metadata can use this helper, which
+ * rejects any room expansion for metadata as a result.
+ */
+static __always_inline void
+xdp_set_data_meta_invalid(struct xdp_buff *xdp)
+{
+	xdp->data_meta = xdp->data + 1;
+}
+
+static __always_inline bool
+xdp_data_meta_unsupported(const struct xdp_buff *xdp)
+{
+	return unlikely(xdp->data_meta > xdp->data);
+}
+
 #endif /* __LINUX_NET_XDP_H__ */

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 05/15] xdp: introduce a new xdp_frame type
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2018-03-08 13:08 ` [bpf-next V2 PATCH 04/15] xdp: move struct xdp_buff from filter.h to xdp.h Jesper Dangaard Brouer
@ 2018-03-08 13:08 ` Jesper Dangaard Brouer
  2018-03-08 13:08 ` [bpf-next V2 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API Jesper Dangaard Brouer
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

This is needed to convert drivers tuntap and virtio_net.

This is a generalization of what is done inside cpumap, which will be
converted later.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/xdp.h |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index f6cbbbad412f..c5e7e2d27dd8 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -63,6 +63,46 @@ struct xdp_buff {
 	struct xdp_rxq_info *rxq;
 };
 
+struct xdp_frame {
+	void *data;
+	u16 len;
+	u16 headroom;
+	u16 metasize;
+	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+	 * while mem info is valid on remote CPU.
+	 */
+	struct xdp_mem_info mem;
+};
+
+/* Convert xdp_buff to xdp_frame */
+static inline
+struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
+{
+	struct xdp_frame *xdp_frame;
+	int metasize;
+	int headroom;
+
+	/* Assure headroom is available for storing info */
+	headroom = xdp->data - xdp->data_hard_start;
+	metasize = xdp->data - xdp->data_meta;
+	metasize = metasize > 0 ? metasize : 0;
+	if (unlikely((headroom - metasize) < sizeof(*xdp_frame)))
+		return NULL;
+
+	/* Store info in top of packet */
+	xdp_frame = xdp->data_hard_start;
+
+	xdp_frame->data = xdp->data;
+	xdp_frame->len  = xdp->data_end - xdp->data;
+	xdp_frame->headroom = headroom - sizeof(*xdp_frame);
+	xdp_frame->metasize = metasize;
+
+	/* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
+	xdp_frame->mem = xdp->rxq->mem;
+
+	return xdp_frame;
+}
+
 static inline
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  2018-03-08 13:08 ` [bpf-next V2 PATCH 05/15] xdp: introduce a new xdp_frame type Jesper Dangaard Brouer
@ 2018-03-08 13:08 ` Jesper Dangaard Brouer
  2018-03-08 15:16   ` Jesper Dangaard Brouer
  2018-03-08 13:08 ` [bpf-next V2 PATCH 07/15] virtio_net: " Jesper Dangaard Brouer
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

The tuntap driver invented it's own driver specific way of queuing
XDP packets, by storing the xdp_buff information in the top of
the XDP frame data.

Convert it over to use the more generic xdp_frame structure.  The
main problem with the in-driver method is that the xdp_rxq_info pointer
cannot be trused/used when dequeueing the frame.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/tun.c      |   45 +++++++++++++++++++++++----------------------
 drivers/vhost/net.c    |    7 ++++---
 include/linux/if_tun.h |    4 ++--
 3 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 475088f947bb..cd046cf31b77 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -248,11 +248,11 @@ struct veth {
 	__be16 h_vlan_TCI;
 };
 
-bool tun_is_xdp_buff(void *ptr)
+bool tun_is_xdp_frame(void *ptr)
 {
 	return (unsigned long)ptr & TUN_XDP_FLAG;
 }
-EXPORT_SYMBOL(tun_is_xdp_buff);
+EXPORT_SYMBOL(tun_is_xdp_frame);
 
 void *tun_xdp_to_ptr(void *ptr)
 {
@@ -660,10 +660,10 @@ static void tun_ptr_free(void *ptr)
 {
 	if (!ptr)
 		return;
-	if (tun_is_xdp_buff(ptr)) {
-		struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+	if (tun_is_xdp_frame(ptr)) {
+		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-		put_page(virt_to_head_page(xdp->data));
+		xdp_return_frame(xdpf->data, &xdpf->mem);
 	} else {
 		__skb_array_destroy_skb(ptr);
 	}
@@ -1290,17 +1290,18 @@ static const struct net_device_ops tun_netdev_ops = {
 static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 {
 	struct tun_struct *tun = netdev_priv(dev);
-	struct xdp_buff *buff = xdp->data_hard_start;
-	int headroom = xdp->data - xdp->data_hard_start;
+	struct xdp_frame *frame;
 	struct tun_file *tfile;
 	u32 numqueues;
 	int ret = 0;
 
-	/* Assure headroom is available and buff is properly aligned */
-	if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
-		return -ENOSPC;
+	/* FIXME: Explain why this check is the needed! */
+	if (unlikely(tun_is_xdp_frame(xdp)))
+		return -EBADRQC;
 
-	*buff = *xdp;
+	frame = convert_to_xdp_frame(xdp);
+	if (unlikely(!frame))
+		return -EOVERFLOW;
 
 	rcu_read_lock();
 
@@ -1315,7 +1316,7 @@ static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	/* Encode the XDP flag into lowest bit for consumer to differ
 	 * XDP buffer from sk_buff.
 	 */
-	if (ptr_ring_produce(&tfile->tx_ring, tun_xdp_to_ptr(buff))) {
+	if (ptr_ring_produce(&tfile->tx_ring, tun_xdp_to_ptr(frame))) {
 		this_cpu_inc(tun->pcpu_stats->tx_dropped);
 		ret = -ENOSPC;
 	}
@@ -1996,11 +1997,11 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 static ssize_t tun_put_user_xdp(struct tun_struct *tun,
 				struct tun_file *tfile,
-				struct xdp_buff *xdp,
+				struct xdp_frame *xdp_frame,
 				struct iov_iter *iter)
 {
 	int vnet_hdr_sz = 0;
-	size_t size = xdp->data_end - xdp->data;
+	size_t size = xdp_frame->len;
 	struct tun_pcpu_stats *stats;
 	size_t ret;
 
@@ -2016,7 +2017,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
 		iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
 	}
 
-	ret = copy_to_iter(xdp->data, size, iter) + vnet_hdr_sz;
+	ret = copy_to_iter(xdp_frame->data, size, iter) + vnet_hdr_sz;
 
 	stats = get_cpu_ptr(tun->pcpu_stats);
 	u64_stats_update_begin(&stats->syncp);
@@ -2184,11 +2185,11 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 			return err;
 	}
 
-	if (tun_is_xdp_buff(ptr)) {
-		struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+	if (tun_is_xdp_frame(ptr)) {
+		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-		ret = tun_put_user_xdp(tun, tfile, xdp, to);
-		put_page(virt_to_head_page(xdp->data));
+		ret = tun_put_user_xdp(tun, tfile, xdpf, to);
+		xdp_return_frame(xdpf->data, &xdpf->mem);
 	} else {
 		struct sk_buff *skb = ptr;
 
@@ -2427,10 +2428,10 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 static int tun_ptr_peek_len(void *ptr)
 {
 	if (likely(ptr)) {
-		if (tun_is_xdp_buff(ptr)) {
-			struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+		if (tun_is_xdp_frame(ptr)) {
+			struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-			return xdp->data_end - xdp->data;
+			return xdpf->len;
 		}
 		return __skb_array_len_with_tag(ptr);
 	} else {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b5fb56b822fd..5aee3aaf6c8c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,7 @@
 #include <linux/skbuff.h>
 
 #include <net/sock.h>
+#include <net/xdp.h>
 
 #include "vhost.h"
 
@@ -177,10 +178,10 @@ static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq)
 
 static int vhost_net_buf_peek_len(void *ptr)
 {
-	if (tun_is_xdp_buff(ptr)) {
-		struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+	if (tun_is_xdp_frame(ptr)) {
+		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-		return xdp->data_end - xdp->data;
+		return xdpf->len;
 	}
 
 	return __skb_array_len_with_tag(ptr);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index c5b0a75a7812..33b817b172af 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -22,7 +22,7 @@
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
-bool tun_is_xdp_buff(void *ptr);
+bool tun_is_xdp_frame(void *ptr);
 void *tun_xdp_to_ptr(void *ptr);
 void *tun_ptr_to_xdp(void *ptr);
 #else
@@ -38,7 +38,7 @@ static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
-static inline bool tun_is_xdp_buff(void *ptr)
+static inline bool tun_is_xdp_frame(void *ptr)
 {
 	return false;
 }

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 07/15] virtio_net: convert to use generic xdp_frame and xdp_return_frame API
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (5 preceding siblings ...)
  2018-03-08 13:08 ` [bpf-next V2 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API Jesper Dangaard Brouer
@ 2018-03-08 13:08 ` Jesper Dangaard Brouer
  2018-03-09  8:03   ` Jason Wang
  2018-03-08 13:08 ` [bpf-next V2 PATCH 08/15] bpf: cpumap convert to use generic xdp_frame Jesper Dangaard Brouer
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

The virtio_net driver assumes XDP frames are always released based on
page refcnt (via put_page).  Thus, is only queues the XDP data pointer
address and uses virt_to_head_page() to retrieve struct page.

Use the XDP return API to get away from such assumptions. Instead
queue an xdp_frame, which allow us to use the xdp_return_frame API,
when releasing the frame.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/virtio_net.c |   31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 23374603e4d9..6c4220450506 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -419,30 +419,41 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
 			       struct xdp_buff *xdp)
 {
 	struct virtio_net_hdr_mrg_rxbuf *hdr;
-	unsigned int len;
+	struct xdp_frame *xdpf, *xdpf_sent;
 	struct send_queue *sq;
+	unsigned int len;
 	unsigned int qp;
-	void *xdp_sent;
 	int err;
 
 	qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id();
 	sq = &vi->sq[qp];
 
 	/* Free up any pending old buffers before queueing new ones. */
-	while ((xdp_sent = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		struct page *sent_page = virt_to_head_page(xdp_sent);
+	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
+		xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
 
-		put_page(sent_page);
-	}
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	/* virtqueue want to use data area in-front of packet */
+	if (unlikely(xdpf->metasize > 0))
+		return -EOPNOTSUPP;
+
+	if (unlikely(xdpf->headroom < vi->hdr_len))
+		return -EOVERFLOW;
 
-	xdp->data -= vi->hdr_len;
+	/* Make room for virtqueue hdr (also change xdpf->headroom?) */
+	xdpf->data -= vi->hdr_len;
 	/* Zero header and leave csum up to XDP layers */
-	hdr = xdp->data;
+	hdr = xdpf->data;
 	memset(hdr, 0, vi->hdr_len);
+	hdr->hdr.hdr_len = xdpf->len; /* Q: is this needed? */
+	xdpf->len   += vi->hdr_len;
 
-	sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
+	sg_init_one(sq->sg, xdpf->data, xdpf->len);
 
-	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp->data, GFP_ATOMIC);
+	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf, GFP_ATOMIC);
 	if (unlikely(err))
 		return false; /* Caller handle free/refcnt */
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 08/15] bpf: cpumap convert to use generic xdp_frame
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (6 preceding siblings ...)
  2018-03-08 13:08 ` [bpf-next V2 PATCH 07/15] virtio_net: " Jesper Dangaard Brouer
@ 2018-03-08 13:08 ` Jesper Dangaard Brouer
  2018-03-08 13:08 ` [bpf-next V2 PATCH 09/15] mlx5: register a memory model when XDP is enabled Jesper Dangaard Brouer
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

The generic xdp_frame format, was inspired by the cpumap own internal
xdp_pkt format.  It is now time to convert it over to the generic
xdp_frame format.  The cpumap needs one extra field dev_rx.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/xdp.h   |    1 +
 kernel/bpf/cpumap.c |  100 ++++++++++++++-------------------------------------
 2 files changed, 29 insertions(+), 72 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index c5e7e2d27dd8..9f47fbc78aa8 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -72,6 +72,7 @@ struct xdp_frame {
 	 * while mem info is valid on remote CPU.
 	 */
 	struct xdp_mem_info mem;
+	struct net_device *dev_rx; /* used by cpumap */
 };
 
 /* Convert xdp_buff to xdp_frame */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 3e4bbcbe3e86..bcdc4dea5ce7 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -159,52 +159,8 @@ static void cpu_map_kthread_stop(struct work_struct *work)
 	kthread_stop(rcpu->kthread);
 }
 
-/* For now, xdp_pkt is a cpumap internal data structure, with info
- * carried between enqueue to dequeue. It is mapped into the top
- * headroom of the packet, to avoid allocating separate mem.
- */
-struct xdp_pkt {
-	void *data;
-	u16 len;
-	u16 headroom;
-	u16 metasize;
-	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
-	 * while mem info is valid on remote CPU.
-	 */
-	struct xdp_mem_info mem;
-	struct net_device *dev_rx;
-};
-
-/* Convert xdp_buff to xdp_pkt */
-static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
-{
-	struct xdp_pkt *xdp_pkt;
-	int metasize;
-	int headroom;
-
-	/* Assure headroom is available for storing info */
-	headroom = xdp->data - xdp->data_hard_start;
-	metasize = xdp->data - xdp->data_meta;
-	metasize = metasize > 0 ? metasize : 0;
-	if (unlikely((headroom - metasize) < sizeof(*xdp_pkt)))
-		return NULL;
-
-	/* Store info in top of packet */
-	xdp_pkt = xdp->data_hard_start;
-
-	xdp_pkt->data = xdp->data;
-	xdp_pkt->len  = xdp->data_end - xdp->data;
-	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
-	xdp_pkt->metasize = metasize;
-
-	/* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
-	xdp_pkt->mem = xdp->rxq->mem;
-
-	return xdp_pkt;
-}
-
 static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
-					 struct xdp_pkt *xdp_pkt)
+					 struct xdp_frame *xdpf)
 {
 	unsigned int frame_size;
 	void *pkt_data_start;
@@ -219,7 +175,7 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
 	 * would be preferred to set frame_size to 2048 or 4096
 	 * depending on the driver.
 	 *   frame_size = 2048;
-	 *   frame_len  = frame_size - sizeof(*xdp_pkt);
+	 *   frame_len  = frame_size - sizeof(*xdp_frame);
 	 *
 	 * Instead, with info avail, skb_shared_info in placed after
 	 * packet len.  This, unfortunately fakes the truesize.
@@ -227,21 +183,21 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
 	 * is not at a fixed memory location, with mixed length
 	 * packets, which is bad for cache-line hotness.
 	 */
-	frame_size = SKB_DATA_ALIGN(xdp_pkt->len) + xdp_pkt->headroom +
+	frame_size = SKB_DATA_ALIGN(xdpf->len) + xdpf->headroom +
 		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-	pkt_data_start = xdp_pkt->data - xdp_pkt->headroom;
+	pkt_data_start = xdpf->data - xdpf->headroom;
 	skb = build_skb(pkt_data_start, frame_size);
 	if (!skb)
 		return NULL;
 
-	skb_reserve(skb, xdp_pkt->headroom);
-	__skb_put(skb, xdp_pkt->len);
-	if (xdp_pkt->metasize)
-		skb_metadata_set(skb, xdp_pkt->metasize);
+	skb_reserve(skb, xdpf->headroom);
+	__skb_put(skb, xdpf->len);
+	if (xdpf->metasize)
+		skb_metadata_set(skb, xdpf->metasize);
 
 	/* Essential SKB info: protocol and skb->dev */
-	skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx);
+	skb->protocol = eth_type_trans(skb, xdpf->dev_rx);
 
 	/* Optional SKB info, currently missing:
 	 * - HW checksum info		(skb->ip_summed)
@@ -259,11 +215,11 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 	 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
 	 * gracefully and warn once.
 	 */
-	struct xdp_pkt *xdp_pkt;
+	struct xdp_frame *xdpf;
 
-	while ((xdp_pkt = ptr_ring_consume(ring)))
-		if (WARN_ON_ONCE(xdp_pkt))
-			xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+	while ((xdpf = ptr_ring_consume(ring)))
+		if (WARN_ON_ONCE(xdpf))
+			xdp_return_frame(xdpf->data, &xdpf->mem);
 }
 
 static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
@@ -290,7 +246,7 @@ static int cpu_map_kthread_run(void *data)
 	 */
 	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
 		unsigned int processed = 0, drops = 0, sched = 0;
-		struct xdp_pkt *xdp_pkt;
+		struct xdp_frame *xdpf;
 
 		/* Release CPU reschedule checks */
 		if (__ptr_ring_empty(rcpu->queue)) {
@@ -313,13 +269,13 @@ static int cpu_map_kthread_run(void *data)
 		 * kthread CPU pinned. Lockless access to ptr_ring
 		 * consume side valid as no-resize allowed of queue.
 		 */
-		while ((xdp_pkt = __ptr_ring_consume(rcpu->queue))) {
+		while ((xdpf = __ptr_ring_consume(rcpu->queue))) {
 			struct sk_buff *skb;
 			int ret;
 
-			skb = cpu_map_build_skb(rcpu, xdp_pkt);
+			skb = cpu_map_build_skb(rcpu, xdpf);
 			if (!skb) {
-				xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+				xdp_return_frame(xdpf->data, &xdpf->mem);
 				continue;
 			}
 
@@ -616,13 +572,13 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 	spin_lock(&q->producer_lock);
 
 	for (i = 0; i < bq->count; i++) {
-		struct xdp_pkt *xdp_pkt = bq->q[i];
+		struct xdp_frame *xdpf = bq->q[i];
 		int err;
 
-		err = __ptr_ring_produce(q, xdp_pkt);
+		err = __ptr_ring_produce(q, xdpf);
 		if (err) {
 			drops++;
-			xdp_return_frame(xdp_pkt->data, &xdp_pkt->mem);
+			xdp_return_frame(xdpf->data, &xdpf->mem);
 		}
 		processed++;
 	}
@@ -637,7 +593,7 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 /* Runs under RCU-read-side, plus in softirq under NAPI protection.
  * Thus, safe percpu variable access.
  */
-static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_pkt *xdp_pkt)
+static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
 {
 	struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
 
@@ -648,28 +604,28 @@ static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_pkt *xdp_pkt)
 	 * driver to code invoking us to finished, due to driver
 	 * (e.g. ixgbe) recycle tricks based on page-refcnt.
 	 *
-	 * Thus, incoming xdp_pkt is always queued here (else we race
+	 * Thus, incoming xdp_frame is always queued here (else we race
 	 * with another CPU on page-refcnt and remaining driver code).
 	 * Queue time is very short, as driver will invoke flush
 	 * operation, when completing napi->poll call.
 	 */
-	bq->q[bq->count++] = xdp_pkt;
+	bq->q[bq->count++] = xdpf;
 	return 0;
 }
 
 int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
 		    struct net_device *dev_rx)
 {
-	struct xdp_pkt *xdp_pkt;
+	struct xdp_frame *xdpf;
 
-	xdp_pkt = convert_to_xdp_pkt(xdp);
-	if (unlikely(!xdp_pkt))
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
 		return -EOVERFLOW;
 
 	/* Info needed when constructing SKB on remote CPU */
-	xdp_pkt->dev_rx = dev_rx;
+	xdpf->dev_rx = dev_rx;
 
-	bq_enqueue(rcpu, xdp_pkt);
+	bq_enqueue(rcpu, xdpf);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 09/15] mlx5: register a memory model when XDP is enabled
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (7 preceding siblings ...)
  2018-03-08 13:08 ` [bpf-next V2 PATCH 08/15] bpf: cpumap convert to use generic xdp_frame Jesper Dangaard Brouer
@ 2018-03-08 13:08 ` Jesper Dangaard Brouer
  2018-03-08 13:08 ` [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping Jesper Dangaard Brouer
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
This enable a different memory model, thus activating another code path
in the xdp_return_frame API.

V2: Fixed issues pointed out by Tariq.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index da94c8cba5ee..2e4ca0f15b62 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -506,6 +506,14 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 		rq->mkey_be = c->mkey_be;
 	}
 
+	/* This must only be activate for order-0 pages */
+	if (rq->xdp_prog) {
+		err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+						 MEM_TYPE_PAGE_ORDER0, NULL);
+		if (err)
+			goto err_rq_wq_destroy;
+	}
+
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
 

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (8 preceding siblings ...)
  2018-03-08 13:08 ` [bpf-next V2 PATCH 09/15] mlx5: register a memory model when XDP is enabled Jesper Dangaard Brouer
@ 2018-03-08 13:08 ` Jesper Dangaard Brouer
  2018-03-09  8:08   ` Jason Wang
  2018-03-08 13:08 ` [bpf-next V2 PATCH 11/15] page_pool: refurbish version of page_pool code Jesper Dangaard Brouer
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.

The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short.  The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.

For more advanced allocators there is a need to store a pointer to the
registered allocator.  Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().

It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function.  In any case,
this must happen via RCU freeing.

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    9 +
 include/net/xdp.h                             |   15 --
 net/core/xdp.c                                |  230 ++++++++++++++++++++++++-
 3 files changed, 235 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 45520eb503ee..ff069597fccf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6360,7 +6360,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 	struct device *dev = rx_ring->dev;
 	int orig_node = dev_to_node(dev);
 	int ring_node = -1;
-	int size;
+	int size, err;
 
 	size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count;
 
@@ -6397,6 +6397,13 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 			     rx_ring->queue_index) < 0)
 		goto err;
 
+	err = xdp_rxq_info_reg_mem_model(&rx_ring->xdp_rxq,
+					 MEM_TYPE_PAGE_SHARED, NULL);
+	if (err) {
+		xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+		goto err;
+	}
+
 	rx_ring->xdp_prog = adapter->xdp_prog;
 
 	return 0;
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 9f47fbc78aa8..e82cf91f384a 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -45,7 +45,7 @@ enum mem_type {
 
 struct xdp_mem_info {
 	u32 type; /* enum mem_type, but known size type */
-	u32 id; // Needed later (to lookup struct xdp_rxq_info)
+	u32 id;
 };
 
 struct xdp_rxq_info {
@@ -104,18 +104,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	return xdp_frame;
 }
 
-static inline
-void xdp_return_frame(void *data, struct xdp_mem_info *mem)
-{
-	if (mem->type == MEM_TYPE_PAGE_SHARED)
-		page_frag_free(data);
-
-	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
-		struct page *page = virt_to_page(data); /* Assumes order0 page*/
-
-		put_page(page);
-	}
-}
+void xdp_return_frame(void *data, struct xdp_mem_info *mem);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 9eee0c431126..06a5b39491ad 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -5,6 +5,9 @@
  */
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <linux/rhashtable.h>
 
 #include <net/xdp.h>
 
@@ -13,6 +16,99 @@
 #define REG_STATE_UNREGISTERED	0x2
 #define REG_STATE_UNUSED	0x3
 
+DEFINE_IDA(mem_id_pool);
+static DEFINE_MUTEX(mem_id_lock);
+#define MEM_ID_MAX 0xFFFE
+#define MEM_ID_MIN 1
+static int mem_id_next = MEM_ID_MIN;
+
+static bool mem_id_init; /* false */
+static struct rhashtable *mem_id_ht;
+
+struct xdp_mem_allocator {
+	struct xdp_mem_info mem;
+	void *allocator;
+	struct rhash_head node;
+	struct rcu_head rcu;
+};
+
+static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed)
+{
+	const u32 *k = data;
+	const u32 key = *k;
+
+	BUILD_BUG_ON(FIELD_SIZEOF(struct xdp_mem_allocator, mem.id)
+		     != sizeof(u32));
+
+	/* Use cyclic increasing ID as direct hash key, see rht_bucket_index */
+	return key << RHT_HASH_RESERVED_SPACE;
+}
+
+static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg,
+			  const void *ptr)
+{
+	const struct xdp_mem_allocator *xa = ptr;
+	u32 mem_id = *(u32 *)arg->key;
+
+	return xa->mem.id != mem_id;
+}
+
+static const struct rhashtable_params mem_id_rht_params = {
+	.nelem_hint = 64,
+	.head_offset = offsetof(struct xdp_mem_allocator, node),
+	.key_offset  = offsetof(struct xdp_mem_allocator, mem.id),
+	.key_len = FIELD_SIZEOF(struct xdp_mem_allocator, mem.id),
+	.max_size = MEM_ID_MAX,
+	.min_size = 8,
+	.automatic_shrinking = true,
+	.hashfn    = xdp_mem_id_hashfn,
+	.obj_cmpfn = xdp_mem_id_cmp,
+};
+
+void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
+{
+	struct xdp_mem_allocator *xa;
+
+	xa = container_of(rcu, struct xdp_mem_allocator, rcu);
+
+	/* Allow this ID to be reused */
+	ida_simple_remove(&mem_id_pool, xa->mem.id);
+
+	/* TODO: Depending on allocator type/pointer free resources */
+
+	/* Poison memory */
+	xa->mem.id = 0xFFFF;
+	xa->mem.type = 0xF0F0;
+	xa->allocator = (void *)0xDEAD9001;
+
+	kfree(xa);
+}
+
+void __xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
+{
+	struct xdp_mem_allocator *xa;
+	int id = xdp_rxq->mem.id;
+	int err;
+
+	if (id == 0)
+		return;
+
+	mutex_lock(&mem_id_lock);
+
+	xa = rhashtable_lookup(mem_id_ht, &id, mem_id_rht_params);
+	if (!xa) {
+		mutex_unlock(&mem_id_lock);
+		return;
+	}
+
+	err = rhashtable_remove_fast(mem_id_ht, &xa->node, mem_id_rht_params);
+	WARN_ON(err);
+
+	call_rcu(&xa->rcu, __xdp_mem_allocator_rcu_free);
+
+	mutex_unlock(&mem_id_lock);
+}
+
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 {
 	/* Simplify driver cleanup code paths, allow unreg "unused" */
@@ -21,8 +117,14 @@ void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 
 	WARN(!(xdp_rxq->reg_state == REG_STATE_REGISTERED), "Driver BUG");
 
+	__xdp_rxq_info_unreg_mem_model(xdp_rxq);
+
 	xdp_rxq->reg_state = REG_STATE_UNREGISTERED;
 	xdp_rxq->dev = NULL;
+
+	/* Reset mem info to defaults */
+	xdp_rxq->mem.id = 0;
+	xdp_rxq->mem.type = 0;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
 
@@ -72,20 +174,138 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);
 
+int __mem_id_init_hash_table(void)
+{
+	struct rhashtable *rht;
+	int ret;
+
+	if (unlikely(mem_id_init))
+		return 0;
+
+	rht = kzalloc(sizeof(*rht), GFP_KERNEL);
+	if (!rht)
+		return -ENOMEM;
+
+	ret = rhashtable_init(rht, &mem_id_rht_params);
+	if (ret < 0) {
+		kfree(rht);
+		return ret;
+	}
+	mem_id_ht = rht;
+	smp_mb(); /* mutex lock should provide enough pairing */
+	mem_id_init = true;
+
+	return 0;
+}
+
+/* Allocate a cyclic ID that maps to allocator pointer.
+ * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
+ *
+ * Caller must lock mem_id_lock.
+ */
+static int __mem_id_cyclic_get(gfp_t gfp)
+{
+	int retries = 1;
+	int id;
+
+again:
+	id = ida_simple_get(&mem_id_pool, mem_id_next, MEM_ID_MAX, gfp);
+	if (id < 0) {
+		if (id == -ENOSPC) {
+			/* Cyclic allocator, reset next id */
+			if (retries--) {
+				mem_id_next = MEM_ID_MIN;
+				goto again;
+			}
+		}
+		return id; /* errno */
+	}
+	mem_id_next = id + 1;
+
+	return id;
+}
+
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 			       enum mem_type type, void *allocator)
 {
+	struct xdp_mem_allocator *xdp_alloc;
+	gfp_t gfp = GFP_KERNEL;
+	int id, errno, ret;
+	void *ptr;
+
+	if (xdp_rxq->reg_state != REG_STATE_REGISTERED) {
+		WARN(1, "Missing register, driver bug");
+		return -EFAULT;
+	}
+
 	if (type >= MEM_TYPE_MAX)
 		return -EINVAL;
 
 	xdp_rxq->mem.type = type;
 
-	if (allocator)
-		return -EOPNOTSUPP;
+	if (!allocator)
+		return 0;
+
+	/* Delay init of rhashtable to save memory if feature isn't used */
+	if (!mem_id_init) {
+		mutex_lock(&mem_id_lock);
+		ret = __mem_id_init_hash_table();
+		mutex_unlock(&mem_id_lock);
+		if (ret < 0) {
+			WARN_ON(1);
+			return ret;
+		}
+	}
+
+	xdp_alloc = kzalloc(sizeof(*xdp_alloc), gfp);
+	if (!xdp_alloc)
+		return -ENOMEM;
+
+	mutex_lock(&mem_id_lock);
+	id = __mem_id_cyclic_get(gfp);
+	if (id < 0) {
+		errno = id;
+		goto err;
+	}
+	xdp_rxq->mem.id = id;
+	xdp_alloc->mem  = xdp_rxq->mem;
+	xdp_alloc->allocator = allocator;
+
+	/* Insert allocator into ID lookup table */
+	ptr = rhashtable_insert_slow(mem_id_ht, &id, &xdp_alloc->node);
+	if (IS_ERR(ptr)) {
+		errno = PTR_ERR(ptr);
+		goto err;
+	}
+
+	mutex_unlock(&mem_id_lock);
 
-	/* TODO: Allocate an ID that maps to allocator pointer
-	 * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
-	 */
 	return 0;
+err:
+	mutex_unlock(&mem_id_lock);
+	kfree(xdp_alloc);
+	return errno;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
+
+void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+{
+	struct xdp_mem_allocator *xa;
+
+	rcu_read_lock();
+	if (mem->id)
+		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+	rcu_read_unlock();
+
+	if (mem->type == MEM_TYPE_PAGE_SHARED) {
+		page_frag_free(data);
+		return;
+	}
+
+	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
+		struct page *page = virt_to_page(data); /* Assumes order0 page*/
+
+		put_page(page);
+	}
+}
+EXPORT_SYMBOL_GPL(xdp_return_frame);

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 11/15] page_pool: refurbish version of page_pool code
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (9 preceding siblings ...)
  2018-03-08 13:08 ` [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping Jesper Dangaard Brouer
@ 2018-03-08 13:08 ` Jesper Dangaard Brouer
  2018-03-08 13:08 ` [bpf-next V2 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.

Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page.  The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.

[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/page_pool.h |  123 +++++++++++++++++
 net/core/Makefile       |    1 
 net/core/page_pool.c    |  334 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 458 insertions(+)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
new file mode 100644
index 000000000000..bfbd6fd018bb
--- /dev/null
+++ b/include/net/page_pool.h
@@ -0,0 +1,123 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
+ *
+ * page_pool.h
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ *
+ * Notice: this is page_pool variant is no-longer hooked into the
+ * put_page() code.  Thus, it is the responsability of the API user to
+ * return pages.
+ *
+ * UPDATE: This can be used without DMA mapping part (which use
+ * page->private).  This means, that the pages-must-be-returned to
+ * pool requirement can be relaxed, as no state is maintained in
+ * struct-page.
+ *
+ */
+#ifndef _NET_PAGE_POOL_H
+#define _NET_PAGE_POOL_H
+
+#include <linux/mm.h> /* Needed by ptr_ring */
+#include <linux/ptr_ring.h>
+#include <linux/dma-direction.h>
+
+#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */
+#define PP_FLAG_ALL	PP_FLAG_DMA_MAP
+
+/*
+ * Fast allocation side cache array/stack
+ *
+ * The cache size and refill watermark is related to the network
+ * use-case.  The NAPI budget is 64 packets.  After a NAPI poll the RX
+ * ring is usually refilled and the max consumed elements will be 64,
+ * thus a natural max size of objects needed in the cache.
+ *
+ * Keeping room for more objects, is due to XDP_DROP use-case.  As
+ * XDP_DROP allows the opportunity to recycle objects directly into
+ * this array, as it shares the same softirq/NAPI protection.  If
+ * cache is already full (or partly full) then the XDP_DROP recycles
+ * would have to take a slower code path.
+ */
+#define PP_ALLOC_CACHE_SIZE	128
+#define PP_ALLOC_CACHE_REFILL	64
+struct pp_alloc_cache {
+	u32 count ____cacheline_aligned_in_smp;
+	u32 refill; /* not used atm */
+	void *cache[PP_ALLOC_CACHE_SIZE];
+};
+
+struct page_pool_params {
+	u32		size; /* caller sets size of struct */
+	unsigned int	order;
+	unsigned long	flags;
+	struct device	*dev; /* device, for DMA pre-mapping purposes */
+	int		nid;  /* Numa node id to allocate from pages from */
+	enum dma_data_direction dma_dir; /* DMA mapping direction */
+	unsigned int	pool_size;
+	char		end_marker[0]; /* must be last struct member */
+};
+#define	PAGE_POOL_PARAMS_SIZE	offsetof(struct page_pool_params, end_marker)
+
+struct page_pool {
+	struct page_pool_params p;
+
+	/*
+	 * Data structure for allocation side
+	 *
+	 * Drivers allocation side usually already perform some kind
+	 * of resource protection.  Piggyback on this protection, and
+	 * require driver to protect allocation side.
+	 *
+	 * For NIC drivers this means, allocate a page_pool per
+	 * RX-queue. As the RX-queue is already protected by
+	 * Softirq/BH scheduling and napi_schedule. NAPI schedule
+	 * guarantee that a single napi_struct will only be scheduled
+	 * on a single CPU (see napi_schedule).
+	 */
+	struct pp_alloc_cache alloc;
+
+	/* Data structure for storing recycled pages.
+	 *
+	 * Returning/freeing pages is more complicated synchronization
+	 * wise, because free's can happen on remote CPUs, with no
+	 * association with allocation resource.
+	 *
+	 * Use ptr_ring, as it separates consumer and producer
+	 * effeciently, it a way that doesn't bounce cache-lines.
+	 *
+	 * TODO: Implement bulk return pages into this structure.
+	 */
+	struct ptr_ring ring;
+
+	struct rcu_head rcu;
+};
+
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
+
+static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
+{
+	gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
+
+	return page_pool_alloc_pages(pool, gfp);
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params);
+
+void page_pool_destroy_rcu(struct page_pool *pool);
+
+/* Never call this directly, use helpers below */
+void __page_pool_put_page(struct page_pool *pool,
+			  struct page *page, bool allow_direct);
+
+static inline void page_pool_put_page(struct page_pool *pool, struct page *page)
+{
+	__page_pool_put_page(pool, page, false);
+}
+/* Very limited use-cases allow recycle direct */
+static inline void page_pool_recycle_direct(struct page_pool *pool,
+					    struct page *page)
+{
+	__page_pool_put_page(pool, page, true);
+}
+
+#endif /* _NET_PAGE_POOL_H */
diff --git a/net/core/Makefile b/net/core/Makefile
index 6dbbba8c57ae..100a2b3b2a08 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -14,6 +14,7 @@ obj-y		     += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
 			fib_notifier.o xdp.o
 
 obj-y += net-sysfs.o
+obj-y += page_pool.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
 obj-$(CONFIG_NET_PKTGEN) += pktgen.o
 obj-$(CONFIG_NETPOLL) += netpoll.o
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
new file mode 100644
index 000000000000..8bf84d2a6a60
--- /dev/null
+++ b/net/core/page_pool.c
@@ -0,0 +1,334 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note
+ *
+ * page_pool.c
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#include <net/page_pool.h>
+#include <linux/dma-direction.h>
+#include <linux/dma-mapping.h>
+#include <linux/page-flags.h>
+#include <linux/mm.h> /* for __put_page() */
+
+int page_pool_init(struct page_pool *pool,
+		   const struct page_pool_params *params)
+{
+	int ring_qsize = 1024; /* Default */
+	int param_copy_sz;
+
+	if (!pool)
+		return -EFAULT;
+
+	/* Note, below struct compat code was primarily needed when
+	 * page_pool code lived under MM-tree control, given mmots and
+	 * net-next trees progress in very different rates.
+	 *
+	 * Allow kernel devel trees and driver to progress at different rates
+	 */
+	param_copy_sz = PAGE_POOL_PARAMS_SIZE;
+	memset(&pool->p, 0, param_copy_sz);
+	if (params->size < param_copy_sz) {
+		/* Older module calling newer kernel, handled by only
+		 * copying supplied size, and keep remaining params zero
+		 */
+		param_copy_sz = params->size;
+	} else if (params->size > param_copy_sz) {
+		/* Newer module calling older kernel. Need to validate
+		 * no new features were requested.
+		 */
+		unsigned char *addr = (unsigned char *)params + param_copy_sz;
+		unsigned char *end  = (unsigned char *)params + params->size;
+
+		for (; addr < end; addr++) {
+			if (*addr != 0)
+				return -E2BIG;
+		}
+	}
+	memcpy(&pool->p, params, param_copy_sz);
+
+	/* Validate only known flags were used */
+	if (pool->p.flags & ~(PP_FLAG_ALL))
+		return -EINVAL;
+
+	if (pool->p.pool_size)
+		ring_qsize = pool->p.pool_size;
+
+	if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0)
+		return -ENOMEM;
+
+	/* DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.
+	 * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,
+	 * which is the XDP_TX use-case.
+	 */
+	if ((pool->p.dma_dir != DMA_FROM_DEVICE) &&
+	    (pool->p.dma_dir != DMA_BIDIRECTIONAL))
+		return -EINVAL;
+
+	return 0;
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params)
+{
+	struct page_pool *pool;
+	int err = 0;
+
+	if (params->size < offsetof(struct page_pool_params, nid)) {
+		WARN(1, "Fix page_pool_params->size code\n");
+		return NULL;
+	}
+
+	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid);
+	err = page_pool_init(pool, params);
+	if (err < 0) {
+		pr_warn("%s() gave up with errno %d\n", __func__, err);
+		kfree(pool);
+		return ERR_PTR(err);
+	}
+	return pool;
+}
+EXPORT_SYMBOL(page_pool_create);
+
+/* fast path */
+static struct page *__page_pool_get_cached(struct page_pool *pool)
+{
+	struct page *page;
+
+	/* Test for safe-context, caller should provide this guarantee */
+	if (likely(in_serving_softirq())) {
+		struct ptr_ring *r;
+
+		if (likely(pool->alloc.count)) {
+			/* Fast-path */
+			page = pool->alloc.cache[--pool->alloc.count];
+			return page;
+		}
+		/* Slower-path: Alloc array empty, time to refill */
+		r = &pool->ring;
+		/* Open-coded bulk ptr_ring consumer.
+		 *
+		 * Discussion: ATM the ring consumer lock is not
+		 * really needed due to the softirq/NAPI protection,
+		 * but later MM-layer need the ability to reclaim
+		 * pages on the ring. Thus, keeping the locks.
+		 */
+		spin_lock(&r->consumer_lock);
+		while ((page = __ptr_ring_consume(r))) {
+			if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
+				break;
+			pool->alloc.cache[pool->alloc.count++] = page;
+		}
+		spin_unlock(&r->consumer_lock);
+		return page;
+	}
+
+	/* Slow-path: Get page from locked ring queue */
+	page = ptr_ring_consume(&pool->ring);
+	return page;
+}
+
+/* slow path */
+noinline
+static struct page *__page_pool_alloc_pages(struct page_pool *pool,
+					    gfp_t _gfp)
+{
+	struct page *page;
+	gfp_t gfp = _gfp;
+	dma_addr_t dma;
+
+	/* We could always set __GFP_COMP, and avoid this branch, as
+	 * prep_new_page() can handle order-0 with __GFP_COMP.
+	 */
+	if (pool->p.order)
+		gfp |= __GFP_COMP;
+
+	/* FUTURE development:
+	 *
+	 * Current slow-path essentially falls back to single page
+	 * allocations, which doesn't improve performance.  This code
+	 * need bulk allocation support from the page allocator code.
+	 *
+	 * For now, page pool recycle cache is not refilled.  Hint:
+	 * when pages are returned, they will go into the recycle
+	 * cache.
+	 */
+
+	/* Cache was empty, do real allocation */
+	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
+	if (!page)
+		return NULL;
+
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		goto skip_dma_map;
+
+	/* Setup DMA mapping: use page->private for DMA-addr
+	 * This mapping is kept for lifetime of page, until leaving pool.
+	 */
+	dma = dma_map_page(pool->p.dev, page, 0,
+			   (PAGE_SIZE << pool->p.order),
+			   pool->p.dma_dir);
+	if (dma_mapping_error(pool->p.dev, dma)) {
+		put_page(page);
+		return NULL;
+	}
+	set_page_private(page, dma); /* page->private = dma; */
+
+skip_dma_map:
+	/* When page just alloc'ed is should/must have refcnt 1. */
+	return page;
+}
+
+/* For using page_pool replace: alloc_pages() API calls, but provide
+ * synchronization guarantee for allocation side.
+ */
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+	struct page *page;
+
+	/* Fast-path: Get a page from cache */
+	page = __page_pool_get_cached(pool);
+	if (page)
+		return page;
+
+	/* Slow-path: cache empty, do real allocation */
+	page = __page_pool_alloc_pages(pool, gfp);
+	return page;
+}
+EXPORT_SYMBOL(page_pool_alloc_pages);
+
+/* Cleanup page_pool state from page */
+static void __page_pool_clean_page(struct page_pool *pool,
+				   struct page *page)
+{
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		return;
+
+	/* DMA unmap */
+	dma_unmap_page(pool->p.dev, page_private(page),
+		       PAGE_SIZE << pool->p.order, pool->p.dma_dir);
+	set_page_private(page, 0);
+}
+
+/* Return a page to the page allocator, cleaning up our state */
+static void __page_pool_return_page(struct page_pool *pool, struct page *page)
+{
+	__page_pool_clean_page(pool, page);
+	/* Given state was cleared, the page should be freed here.
+	 * Thus, code invariant assumes refcnt==1, as __free_pages()
+	 * call put_page_testzero().
+	 */
+	__free_pages(page, pool->p.order);
+}
+
+bool __page_pool_recycle_into_ring(struct page_pool *pool,
+				   struct page *page)
+{
+	int ret;
+	/* BH protection not needed if current is serving softirq */
+	if (in_serving_softirq())
+		ret = ptr_ring_produce(&pool->ring, page);
+	else
+		ret = ptr_ring_produce_bh(&pool->ring, page);
+
+	return (ret == 0) ? true : false;
+}
+
+/* Only allow direct recycling in very special circumstances, into the
+ * alloc cache.  E.g. XDP_DROP use-case.
+ *
+ * Caller must provide appropriate safe context.
+ */
+static bool __page_pool_recycle_direct(struct page *page,
+				       struct page_pool *pool)
+{
+	/* WARN_RATELIMIT(!in_serving_softirq(), "Wrong context\n"); */
+
+	if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE))
+		return false;
+
+	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
+	pool->alloc.cache[pool->alloc.count++] = page;
+	return true;
+}
+
+void __page_pool_put_page(struct page_pool *pool,
+			  struct page *page, bool allow_direct)
+{
+	/* This is a fast-path optimization, that avoids an atomic
+	 * operation, in the case where a single object is (refcnt)
+	 * using the page.
+	 *
+	 * refcnt == 1 means page_pool owns page, and can recycle it.
+	 */
+	if (likely(page_ref_count(page) == 1)) {
+		/* Read barrier implicit paired with full MB of atomic ops */
+		smp_rmb();
+
+		if (allow_direct && in_serving_softirq())
+			if (__page_pool_recycle_direct(page, pool))
+				return;
+
+		if (!__page_pool_recycle_into_ring(pool, page)) {
+			/* Cache full, do real __free_pages() */
+			__page_pool_return_page(pool, page);
+		}
+		return;
+	}
+	/* Many drivers splitting up the page into fragments, and some
+	 * want to keep doing this to save memory. The put_page_testzero()
+	 * function as a refcnt decrement, and should not return true.
+	 */
+	if (unlikely(put_page_testzero(page))) {
+		/* Reaching refcnt zero should not be possible,
+		 * indicate code error.  Don't crash but warn, handle
+		 * case by not-recycling, but return page to page
+		 * allocator.
+		 */
+		WARN(1, "%s() violating page_pool invariance refcnt:%d\n",
+		     __func__, page_ref_count(page));
+		/* Cleanup state before directly returning page */
+		__page_pool_clean_page(pool, page);
+		__put_page(page);
+	}
+}
+EXPORT_SYMBOL(__page_pool_put_page);
+
+/* Cleanup and release resources */
+void __page_pool_destroy_rcu(struct rcu_head *rcu)
+{
+	struct page_pool *pool;
+	struct page *page;
+
+	pool = container_of(rcu, struct page_pool, rcu);
+
+	/* Empty alloc cache, assume caller made sure this is
+	 * no-longer in use, and page_pool_alloc_pages() cannot be
+	 * call concurrently.
+	 */
+	while (pool->alloc.count) {
+		page = pool->alloc.cache[--pool->alloc.count];
+		__page_pool_return_page(pool, page);
+	}
+
+	/* Empty recycle ring */
+	while ((page = ptr_ring_consume(&pool->ring))) {
+		/* Verify the refcnt invariant of cached pages */
+		if (!(page_ref_count(page) == 1)) {
+			pr_crit("%s() page_pool refcnt %d violation\n",
+				__func__, page_ref_count(page));
+			WARN_ON(1);
+		}
+		__page_pool_return_page(pool, page);
+	}
+	ptr_ring_cleanup(&pool->ring, NULL);
+	kfree(pool);
+}
+
+void page_pool_destroy_rcu(struct page_pool *pool)
+{
+	call_rcu(&pool->rcu, __page_pool_destroy_rcu);
+}
+EXPORT_SYMBOL(page_pool_destroy_rcu);

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (10 preceding siblings ...)
  2018-03-08 13:08 ` [bpf-next V2 PATCH 11/15] page_pool: refurbish version of page_pool code Jesper Dangaard Brouer
@ 2018-03-08 13:08 ` Jesper Dangaard Brouer
  2018-03-08 13:08 ` [bpf-next V2 PATCH 13/15] mlx5: use page_pool for xdp_return_frame call Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.

The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed).  For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.

As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call.  This
is done easily via the page_pool helper page_pool_destroy_rcu() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/xdp.h |    4 +++-
 net/core/xdp.c    |   23 ++++++++++++++++++++---
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index e82cf91f384a..26eb645c699a 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -36,8 +36,8 @@
 enum mem_type {
 	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
 	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
+	MEM_TYPE_PAGE_POOL,
 	// Possible new ideas for types:
-	// MEM_TYPE_PAGE_POOL,    /* Will be added later  */
 	// MEM_TYPE_AF_XDP,
 	// MEM_TYPE_DEVICE_OWNED -- invoking an dev->ndo?
 	MEM_TYPE_MAX,
@@ -48,6 +48,8 @@ struct xdp_mem_info {
 	u32 id;
 };
 
+struct page_pool;
+
 struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 06a5b39491ad..fe8e87abc266 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -8,6 +8,7 @@
 #include <linux/slab.h>
 #include <linux/idr.h>
 #include <linux/rhashtable.h>
+#include <net/page_pool.h>
 
 #include <net/xdp.h>
 
@@ -27,7 +28,10 @@ static struct rhashtable *mem_id_ht;
 
 struct xdp_mem_allocator {
 	struct xdp_mem_info mem;
-	void *allocator;
+	union {
+		void *allocator;
+		struct page_pool *page_pool;
+	};
 	struct rhash_head node;
 	struct rcu_head rcu;
 };
@@ -74,7 +78,9 @@ void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
 	/* Allow this ID to be reused */
 	ida_simple_remove(&mem_id_pool, xa->mem.id);
 
-	/* TODO: Depending on allocator type/pointer free resources */
+	/* Notice, driver is expected to free the *allocator,
+	 * e.g. page_pool, and MUST also use RCU free.
+	 */
 
 	/* Poison memory */
 	xa->mem.id = 0xFFFF;
@@ -290,11 +296,21 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {
-	struct xdp_mem_allocator *xa;
+	struct xdp_mem_allocator *xa = NULL;
 
 	rcu_read_lock();
 	if (mem->id)
 		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+
+	if (mem->type == MEM_TYPE_PAGE_POOL) {
+		struct page *page = virt_to_head_page(data);
+
+		if (xa)
+			page_pool_put_page(xa->page_pool, page);
+		else
+			put_page(page);
+		return;
+	}
 	rcu_read_unlock();
 
 	if (mem->type == MEM_TYPE_PAGE_SHARED) {
@@ -306,6 +322,7 @@ void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 		struct page *page = virt_to_page(data); /* Assumes order0 page*/
 
 		put_page(page);
+		return;
 	}
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 13/15] mlx5: use page_pool for xdp_return_frame call
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (11 preceding siblings ...)
  2018-03-08 13:08 ` [bpf-next V2 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame Jesper Dangaard Brouer
@ 2018-03-08 13:08 ` Jesper Dangaard Brouer
  2018-03-08 13:08 ` [bpf-next V2 PATCH 14/15] xdp: transition into using xdp_frame for return API Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page.  And at the same time, have pages getting returned to the
page_pool from ndp_xdp_xmit DMA completion.

Performance is surprisingly good. Tested DMA-TX completion on ixgbe,
that calls "xdp_return_frame", which call page_pool_put_page().
Stats show DMA-TX-completion runs on CPU#9 and mlx5 RX runs on CPU#5.
(Internally page_pool uses ptr_ring, which is what gives the good
cross CPU performance).

Show adapter(s) (ixgbe2 mlx5p2) statistics (ONLY that changed!)
Ethtool(ixgbe2  ) stat:    732863573 (    732,863,573) <= tx_bytes /sec
Ethtool(ixgbe2  ) stat:    781724427 (    781,724,427) <= tx_bytes_nic /sec
Ethtool(ixgbe2  ) stat:     12214393 (     12,214,393) <= tx_packets /sec
Ethtool(ixgbe2  ) stat:     12214435 (     12,214,435) <= tx_pkts_nic /sec
Ethtool(mlx5p2  ) stat:     12211786 (     12,211,786) <= rx3_cache_empty /sec
Ethtool(mlx5p2  ) stat:     36506736 (     36,506,736) <= rx_64_bytes_phy /sec
Ethtool(mlx5p2  ) stat:   2336430575 (  2,336,430,575) <= rx_bytes_phy /sec
Ethtool(mlx5p2  ) stat:     12211786 (     12,211,786) <= rx_cache_empty /sec
Ethtool(mlx5p2  ) stat:     22823073 (     22,823,073) <= rx_discards_phy /sec
Ethtool(mlx5p2  ) stat:      1471860 (      1,471,860) <= rx_out_of_buffer /sec
Ethtool(mlx5p2  ) stat:     36506715 (     36,506,715) <= rx_packets_phy /sec
Ethtool(mlx5p2  ) stat:   2336542282 (  2,336,542,282) <= rx_prio0_bytes /sec
Ethtool(mlx5p2  ) stat:     13683921 (     13,683,921) <= rx_prio0_packets /sec
Ethtool(mlx5p2  ) stat:    821015537 (    821,015,537) <= rx_vport_unicast_bytes /sec
Ethtool(mlx5p2  ) stat:     13683608 (     13,683,608) <= rx_vport_unicast_packets /sec

Before this patch: single flow performance was 6Mpps, and if I started
two flows the collective performance drop to 4Mpps, because we hit the
page allocator lock (further negative scaling occurs).

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes not return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.
 - Save a branch in mlx5e_page_release
 - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    3 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   41 +++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   16 ++++++--
 net/core/page_pool.c                              |    2 +
 4 files changed, 49 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 28cc26debeda..ab91166f7c5a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,8 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+struct page_pool;
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
@@ -535,6 +537,7 @@ struct mlx5e_rq {
 	/* XDP */
 	struct bpf_prog       *xdp_prog;
 	struct mlx5e_xdpsq     xdpsq;
+	struct page_pool      *page_pool;
 
 	/* control */
 	struct mlx5_wq_ctrl    wq_ctrl;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2e4ca0f15b62..bf17e6d614d6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -35,6 +35,7 @@
 #include <linux/mlx5/fs.h>
 #include <net/vxlan.h>
 #include <linux/bpf.h>
+#include <net/page_pool.h>
 #include "eswitch.h"
 #include "en.h"
 #include "en_tc.h"
@@ -387,10 +388,11 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 			  struct mlx5e_rq_param *rqp,
 			  struct mlx5e_rq *rq)
 {
+	struct page_pool_params pp_params = { 0 };
 	struct mlx5_core_dev *mdev = c->mdev;
 	void *rqc = rqp->rqc;
 	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
-	u32 byte_count;
+	u32 byte_count, pool_size;
 	int npages;
 	int wq_sz;
 	int err;
@@ -429,10 +431,13 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 
 	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	rq->buff.headroom = params->rq_headroom;
+	pool_size = 1 << params->log_rq_size;
 
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 
+		pool_size = pool_size * MLX5_MPWRQ_PAGES_PER_WQE;
+
 		rq->post_wqes = mlx5e_post_rx_mpwqes;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
@@ -506,13 +511,31 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 		rq->mkey_be = c->mkey_be;
 	}
 
-	/* This must only be activate for order-0 pages */
-	if (rq->xdp_prog) {
-		err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
-						 MEM_TYPE_PAGE_ORDER0, NULL);
-		if (err)
-			goto err_rq_wq_destroy;
+	/* Create a page_pool and register it with rxq */
+	pp_params.size      = PAGE_POOL_PARAMS_SIZE;
+	pp_params.order     = rq->buff.page_order;
+	pp_params.dev       = c->pdev;
+	pp_params.nid       = cpu_to_node(c->cpu);
+	pp_params.dma_dir   = rq->buff.map_dir;
+	pp_params.pool_size = pool_size;
+	pp_params.flags     = 0; /* No-internal DMA mapping in page_pool */
+
+	/* page_pool can be used even when there is no rq->xdp_prog,
+	 * given page_pool does not handle DMA mapping there is no
+	 * required state to clear. And page_pool gracefully handle
+	 * elevated refcnt.
+	 */
+	rq->page_pool = page_pool_create(&pp_params);
+	if (IS_ERR(rq->page_pool)) {
+		kfree(rq->wqe.frag_info);
+		err = PTR_ERR(rq->page_pool);
+		rq->page_pool = NULL;
+		goto err_rq_wq_destroy;
 	}
+	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+					 MEM_TYPE_PAGE_POOL, rq->page_pool);
+	if (err)
+		goto err_rq_wq_destroy;
 
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
@@ -550,6 +573,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	if (rq->xdp_prog)
 		bpf_prog_put(rq->xdp_prog);
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
+	if (rq->page_pool)
+		page_pool_destroy_rcu(rq->page_pool);
 	mlx5_wq_destroy(&rq->wq_ctrl);
 
 	return err;
@@ -563,6 +588,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
 		bpf_prog_put(rq->xdp_prog);
 
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
+	if (rq->page_pool)
+		page_pool_destroy_rcu(rq->page_pool);
 
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 6dcc3e8fbd3e..2ac78b88fc3d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -37,6 +37,7 @@
 #include <linux/bpf_trace.h>
 #include <net/busy_poll.h>
 #include <net/ip6_checksum.h>
+#include <net/page_pool.h>
 #include "en.h"
 #include "en_tc.h"
 #include "eswitch.h"
@@ -221,7 +222,7 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 	if (mlx5e_rx_cache_get(rq, dma_info))
 		return 0;
 
-	dma_info->page = dev_alloc_pages(rq->buff.page_order);
+	dma_info->page = page_pool_dev_alloc_pages(rq->page_pool);
 	if (unlikely(!dma_info->page))
 		return -ENOMEM;
 
@@ -246,11 +247,16 @@ static inline void mlx5e_page_dma_unmap(struct mlx5e_rq *rq,
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle)
 {
-	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
-		return;
+	if (likely(recycle)) {
+		if (mlx5e_rx_cache_put(rq, dma_info))
+			return;
 
-	mlx5e_page_dma_unmap(rq, dma_info);
-	put_page(dma_info->page);
+		mlx5e_page_dma_unmap(rq, dma_info);
+		page_pool_recycle_direct(rq->page_pool, dma_info->page);
+	} else {
+		mlx5e_page_dma_unmap(rq, dma_info);
+		put_page(dma_info->page);
+	}
 }
 
 static inline bool mlx5e_page_reuse(struct mlx5e_rq *rq,
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 8bf84d2a6a60..9706ff78a4c9 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -78,7 +78,7 @@ struct page_pool *page_pool_create(const struct page_pool_params *params)
 
 	if (params->size < offsetof(struct page_pool_params, nid)) {
 		WARN(1, "Fix page_pool_params->size code\n");
-		return NULL;
+		return ERR_PTR(-EBADR);
 	}
 
 	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid);

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 14/15] xdp: transition into using xdp_frame for return API
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (12 preceding siblings ...)
  2018-03-08 13:08 ` [bpf-next V2 PATCH 13/15] mlx5: use page_pool for xdp_return_frame call Jesper Dangaard Brouer
@ 2018-03-08 13:08 ` Jesper Dangaard Brouer
  2018-03-08 13:08 ` [bpf-next V2 PATCH 15/15] xdp: transition into using xdp_frame for ndo_xdp_xmit Jesper Dangaard Brouer
  2018-03-08 15:03 ` [bpf-next V2 PATCH 00/15] XDP redirect memory return API Alexander Duyck
  15 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h        |    4 +---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   17 +++++++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    1 +
 drivers/net/tun.c                               |    4 ++--
 drivers/net/virtio_net.c                        |    2 +-
 include/net/xdp.h                               |    2 +-
 kernel/bpf/cpumap.c                             |    6 +++---
 net/core/xdp.c                                  |    4 +++-
 8 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index cbc20f199364..dfbc15a45cb4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -240,8 +240,7 @@ struct ixgbe_tx_buffer {
 	unsigned long time_stamp;
 	union {
 		struct sk_buff *skb;
-		/* XDP uses address ptr on irq_clean */
-		void *data;
+		struct xdp_frame *xdpf;
 	};
 	unsigned int bytecount;
 	unsigned short gso_segs;
@@ -249,7 +248,6 @@ struct ixgbe_tx_buffer {
 	DEFINE_DMA_UNMAP_ADDR(dma);
 	DEFINE_DMA_UNMAP_LEN(len);
 	u32 tx_flags;
-	struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ff069597fccf..e6e9b28ecfba 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 
 		/* free the skb */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -2376,6 +2376,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			xdp.data_hard_start = xdp.data -
 					      ixgbe_rx_offset(rx_ring);
 			xdp.data_end = xdp.data + size;
+			prefetchw(xdp.data_hard_start); /* xdp_frame write */
 
 			skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
 		}
@@ -5787,7 +5788,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring)
 
 		/* Free all the Tx ring sk_buffs */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 
@@ -8333,16 +8334,21 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 	struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 	struct ixgbe_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
+	struct xdp_frame *xdpf;
 	u32 len, cmd_type;
 	dma_addr_t dma;
 	u16 i;
 
-	len = xdp->data_end - xdp->data;
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	len = xdpf->len;
 
 	if (unlikely(!ixgbe_desc_unused(ring)))
 		return IXGBE_XDP_CONSUMED;
 
-	dma = dma_map_single(ring->dev, xdp->data, len, DMA_TO_DEVICE);
+	dma = dma_map_single(ring->dev, xdpf->data, len, DMA_TO_DEVICE);
 	if (dma_mapping_error(ring->dev, dma))
 		return IXGBE_XDP_CONSUMED;
 
@@ -8357,8 +8363,7 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 
 	dma_unmap_len_set(tx_buffer, len, len);
 	dma_unmap_addr_set(tx_buffer, dma, dma);
-	tx_buffer->data = xdp->data;
-	tx_buffer->xdp_mem = xdp->rxq->mem;
+	tx_buffer->xdpf = xdpf;
 
 	tx_desc->read.buffer_addr = cpu_to_le64(dma);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 2ac78b88fc3d..00b9b13d9fea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -896,6 +896,7 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 				      di->addr + wi->offset,
 				      0, frag_size,
 				      DMA_FROM_DEVICE);
+	prefetchw(va); /* xdp_frame data area */
 	prefetch(data);
 	wi->offset += frag_size;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index cd046cf31b77..fedf01ca3e51 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -663,7 +663,7 @@ static void tun_ptr_free(void *ptr)
 	if (tun_is_xdp_frame(ptr)) {
 		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-		xdp_return_frame(xdpf->data, &xdpf->mem);
+		xdp_return_frame(xdpf);
 	} else {
 		__skb_array_destroy_skb(ptr);
 	}
@@ -2189,7 +2189,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
 		ret = tun_put_user_xdp(tun, tfile, xdpf, to);
-		xdp_return_frame(xdpf->data, &xdpf->mem);
+		xdp_return_frame(xdpf);
 	} else {
 		struct sk_buff *skb = ptr;
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6c4220450506..9fdb70fe74fe 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -430,7 +430,7 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
 
 	/* Free up any pending old buffers before queueing new ones. */
 	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
-		xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
+		xdp_return_frame(xdpf_sent);
 
 	xdpf = convert_to_xdp_frame(xdp);
 	if (unlikely(!xdpf))
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 26eb645c699a..ca023a1fd95b 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -106,7 +106,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	return xdp_frame;
 }
 
-void xdp_return_frame(void *data, struct xdp_mem_info *mem);
+void xdp_return_frame(struct xdp_frame *xdpf);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index bcdc4dea5ce7..c95b04ec103e 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -219,7 +219,7 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 
 	while ((xdpf = ptr_ring_consume(ring)))
 		if (WARN_ON_ONCE(xdpf))
-			xdp_return_frame(xdpf->data, &xdpf->mem);
+			xdp_return_frame(xdpf);
 }
 
 static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
@@ -275,7 +275,7 @@ static int cpu_map_kthread_run(void *data)
 
 			skb = cpu_map_build_skb(rcpu, xdpf);
 			if (!skb) {
-				xdp_return_frame(xdpf->data, &xdpf->mem);
+				xdp_return_frame(xdpf);
 				continue;
 			}
 
@@ -578,7 +578,7 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 		err = __ptr_ring_produce(q, xdpf);
 		if (err) {
 			drops++;
-			xdp_return_frame(xdpf->data, &xdpf->mem);
+			xdp_return_frame(xdpf);
 		}
 		processed++;
 	}
diff --git a/net/core/xdp.c b/net/core/xdp.c
index fe8e87abc266..6ed3d73a73be 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -294,9 +294,11 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
-void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+void xdp_return_frame(struct xdp_frame *xdpf)
 {
 	struct xdp_mem_allocator *xa = NULL;
+	struct xdp_mem_info *mem = &xdpf->mem;
+	void *data = xdpf->data;
 
 	rcu_read_lock();
 	if (mem->id)

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [bpf-next V2 PATCH 15/15] xdp: transition into using xdp_frame for ndo_xdp_xmit
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (13 preceding siblings ...)
  2018-03-08 13:08 ` [bpf-next V2 PATCH 14/15] xdp: transition into using xdp_frame for return API Jesper Dangaard Brouer
@ 2018-03-08 13:08 ` Jesper Dangaard Brouer
  2018-03-08 15:03 ` [bpf-next V2 PATCH 00/15] XDP redirect memory return API Alexander Duyck
  15 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 13:08 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.

This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   21 +++++++++++----------
 drivers/net/tun.c                             |   11 +----------
 drivers/net/virtio_net.c                      |   24 ++++++++++++++----------
 include/linux/netdevice.h                     |    4 ++--
 net/core/filter.c                             |   17 +++++++++++++++--
 5 files changed, 43 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index e6e9b28ecfba..f78096ed4c86 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2252,7 +2252,7 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring *rx_ring,
 #define IXGBE_XDP_TX 2
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-			       struct xdp_buff *xdp);
+			       struct xdp_frame *xdpf);
 
 static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 				     struct ixgbe_ring *rx_ring,
@@ -2260,6 +2260,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 {
 	int err, result = IXGBE_XDP_PASS;
 	struct bpf_prog *xdp_prog;
+	struct xdp_frame *xdpf;
 	u32 act;
 
 	rcu_read_lock();
@@ -2273,7 +2274,12 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 	case XDP_PASS:
 		break;
 	case XDP_TX:
-		result = ixgbe_xmit_xdp_ring(adapter, xdp);
+		xdpf = convert_to_xdp_frame(xdp);
+		if (unlikely(!xdpf)) {
+			result = IXGBE_XDP_CONSUMED;
+			break;
+		}
+		result = ixgbe_xmit_xdp_ring(adapter, xdpf);
 		break;
 	case XDP_REDIRECT:
 		err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
@@ -8329,20 +8335,15 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 }
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-			       struct xdp_buff *xdp)
+			       struct xdp_frame *xdpf)
 {
 	struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 	struct ixgbe_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
-	struct xdp_frame *xdpf;
 	u32 len, cmd_type;
 	dma_addr_t dma;
 	u16 i;
 
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
 	len = xdpf->len;
 
 	if (unlikely(!ixgbe_desc_unused(ring)))
@@ -9995,7 +9996,7 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
-static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(dev);
 	struct ixgbe_ring *ring;
@@ -10011,7 +10012,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (unlikely(!ring))
 		return -ENXIO;
 
-	err = ixgbe_xmit_xdp_ring(adapter, xdp);
+	err = ixgbe_xmit_xdp_ring(adapter, xdpf);
 	if (err != IXGBE_XDP_TX)
 		return -ENOSPC;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index fedf01ca3e51..16ccaf614cf3 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1287,22 +1287,13 @@ static const struct net_device_ops tun_netdev_ops = {
 	.ndo_get_stats64	= tun_net_get_stats64,
 };
 
-static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int tun_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
 {
 	struct tun_struct *tun = netdev_priv(dev);
-	struct xdp_frame *frame;
 	struct tun_file *tfile;
 	u32 numqueues;
 	int ret = 0;
 
-	/* FIXME: Explain why this check is the needed! */
-	if (unlikely(tun_is_xdp_frame(xdp)))
-		return -EBADRQC;
-
-	frame = convert_to_xdp_frame(xdp);
-	if (unlikely(!frame))
-		return -EOVERFLOW;
-
 	rcu_read_lock();
 
 	numqueues = READ_ONCE(tun->numqueues);
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9fdb70fe74fe..aacefc260cdf 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -416,10 +416,10 @@ static void virtnet_xdp_flush(struct net_device *dev)
 }
 
 static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
-			       struct xdp_buff *xdp)
+			       struct xdp_frame *xdpf)
 {
 	struct virtio_net_hdr_mrg_rxbuf *hdr;
-	struct xdp_frame *xdpf, *xdpf_sent;
+	struct xdp_frame *xdpf_sent;
 	struct send_queue *sq;
 	unsigned int len;
 	unsigned int qp;
@@ -432,10 +432,6 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
 	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
 		xdp_return_frame(xdpf_sent);
 
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
 	/* virtqueue want to use data area in-front of packet */
 	if (unlikely(xdpf->metasize > 0))
 		return -EOPNOTSUPP;
@@ -460,7 +456,7 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
 	return true;
 }
 
-static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct receive_queue *rq = vi->rq;
@@ -474,7 +470,7 @@ static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (!xdp_prog)
 		return -ENXIO;
 
-	sent = __virtnet_xdp_xmit(vi, xdp);
+	sent = __virtnet_xdp_xmit(vi, xdpf);
 	if (!sent)
 		return -ENOSPC;
 	return 0;
@@ -575,6 +571,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
 		struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset;
+		struct xdp_frame *xdpf;
 		struct xdp_buff xdp;
 		void *orig_data;
 		u32 act;
@@ -617,7 +614,10 @@ static struct sk_buff *receive_small(struct net_device *dev,
 			delta = orig_data - xdp.data;
 			break;
 		case XDP_TX:
-			sent = __virtnet_xdp_xmit(vi, &xdp);
+			xdpf = convert_to_xdp_frame(&xdp);
+			if (unlikely(!xdpf))
+				goto err_xdp;
+			sent = __virtnet_xdp_xmit(vi, xdpf);
 			if (unlikely(!sent)) {
 				trace_xdp_exception(vi->dev, xdp_prog, act);
 				goto err_xdp;
@@ -709,6 +709,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
+		struct xdp_frame *xdpf;
 		struct page *xdp_page;
 		struct xdp_buff xdp;
 		void *data;
@@ -773,7 +774,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			}
 			break;
 		case XDP_TX:
-			sent = __virtnet_xdp_xmit(vi, &xdp);
+			xdpf = convert_to_xdp_frame(&xdp);
+			if (unlikely(!xdpf))
+				goto err_xdp;
+			sent = __virtnet_xdp_xmit(vi, xdpf);
 			if (unlikely(!sent)) {
 				trace_xdp_exception(vi->dev, xdp_prog, act);
 				if (unlikely(xdp_page != page))
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index dbe6344b727a..775e16827df3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1155,7 +1155,7 @@ struct dev_ifalias {
  *	This function is used to set or query state related to XDP on the
  *	netdevice and manage BPF offload. See definition of
  *	enum bpf_netdev_command for details.
- * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_buff *xdp);
+ * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_frame *xdp);
  *	This function is used to submit a XDP packet for transmit on a
  *	netdevice.
  * void (*ndo_xdp_flush)(struct net_device *dev);
@@ -1346,7 +1346,7 @@ struct net_device_ops {
 	int			(*ndo_bpf)(struct net_device *dev,
 					   struct netdev_bpf *bpf);
 	int			(*ndo_xdp_xmit)(struct net_device *dev,
-						struct xdp_buff *xdp);
+						struct xdp_frame *xdp);
 	void			(*ndo_xdp_flush)(struct net_device *dev);
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 33edfa8372fd..25a8980f9d33 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2528,13 +2528,18 @@ static int __bpf_tx_xdp(struct net_device *dev,
 			struct xdp_buff *xdp,
 			u32 index)
 {
+	struct xdp_frame *xdpf;
 	int err;
 
 	if (!dev->netdev_ops->ndo_xdp_xmit) {
 		return -EOPNOTSUPP;
 	}
 
-	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
 	if (err)
 		return err;
 	dev->netdev_ops->ndo_xdp_flush(dev);
@@ -2550,11 +2555,19 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
 		struct net_device *dev = fwd;
+		struct xdp_frame *xdpf;
 
 		if (!dev->netdev_ops->ndo_xdp_xmit)
 			return -EOPNOTSUPP;
 
-		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+		xdpf = convert_to_xdp_frame(xdp);
+		if (unlikely(!xdpf))
+			return -EOVERFLOW;
+
+		/* TODO: move to inside map code instead, for bulk support
+		 * err = dev_map_enqueue(dev, xdp);
+		 */
+		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
 		if (err)
 			return err;
 		__dev_map_insert_ctx(map, index);

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 00/15] XDP redirect memory return API
  2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (14 preceding siblings ...)
  2018-03-08 13:08 ` [bpf-next V2 PATCH 15/15] xdp: transition into using xdp_frame for ndo_xdp_xmit Jesper Dangaard Brouer
@ 2018-03-08 15:03 ` Alexander Duyck
  2018-03-08 15:30   ` Jesper Dangaard Brouer
  15 siblings, 1 reply; 38+ messages in thread
From: Alexander Duyck @ 2018-03-08 15:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

On Thu, Mar 8, 2018 at 5:07 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> This patchset works towards supporting different XDP RX-ring memory
> allocators.  As this will be needed by the AF_XDP zero-copy mode.
>
> The patchset uses mlx5 as the sample driver, which gets implemented
> XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to
> change thought the patchset).
>
> A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt).
> And both ndo_xdp_xmit and the new xdp_return_frame end-up using this.
>
> Support for a driver supplied allocator is implemented, and a
> refurbished version of page_pool is the first return allocator type
> introduced.  This will be a integration point for AF_XDP zero-copy.
>
> The mlx5 driver evolve into using the page_pool, and see a performance
> increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps.

So one question I would have is what effect does your patch set have
on packets moving in the opposite direction, or ixgbe to ixgbe. My
concern is how much of a hit we are taking in ixgbe to support a more
generic solution.

> The patchset stop at the 15 patches limit, but more API changes are
> planned.  Specifically extending ndo_xdp_xmit and xdp_return_frame
> APIs to support bulking.  As this will address some known limits.
>
> V2: Updated according to Tariq's feedback

I am assuming you are submitting this as an RFC aren't you? If not you
have some general quality issues to work on. I saw c99 style "//"
comments in patch 2, for example:
+struct xdp_mem_info {
+       u32 type; /* enum mem_type, but known size type */
+       u32 id; // Needed later (to lookup struct xdp_rxq_info)
+};
+

You should probably just go with the standard "/* */" style comments
and be consistent.

There are also "FIXME" comments in patch 6, either drop them entirely
so you don't need to remove them in patch 15, or address the issue of
explaining why you needed the call.

- Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API
  2018-03-08 13:08 ` [bpf-next V2 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API Jesper Dangaard Brouer
@ 2018-03-08 15:16   ` Jesper Dangaard Brouer
  2018-03-09  7:16     ` Jason Wang
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 15:16 UTC (permalink / raw)
  To: BjörnTöpel, magnus.karlsson, Jason Wang
  Cc: netdev, eugenia, John Fastabend, Eran Ben Elisha, Saeed Mahameed,
	galp, Daniel Borkmann, Alexei Starovoitov, Tariq Toukan, brouer

Hi Jason,

Please see below FIXME, which is actually a question to you.

On Thu, 08 Mar 2018 14:08:11 +0100 Jesper Dangaard Brouer <brouer@redhat.com> wrote:

> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 475088f947bb..cd046cf31b77 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
[...]

> @@ -1290,17 +1290,18 @@ static const struct net_device_ops tun_netdev_ops = {
>  static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
>  {
>  	struct tun_struct *tun = netdev_priv(dev);
> -	struct xdp_buff *buff = xdp->data_hard_start;
> -	int headroom = xdp->data - xdp->data_hard_start;
> +	struct xdp_frame *frame;
>  	struct tun_file *tfile;
>  	u32 numqueues;
>  	int ret = 0;
>  
> -	/* Assure headroom is available and buff is properly aligned */
> -	if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
> -		return -ENOSPC;
> +	/* FIXME: Explain why this check is the needed! */
> +	if (unlikely(tun_is_xdp_frame(xdp)))
> +		return -EBADRQC;
>  
> -	*buff = *xdp;
> +	frame = convert_to_xdp_frame(xdp);
> +	if (unlikely(!frame))
> +		return -EOVERFLOW;

To Jason, in the FIXME, I'm inheriting a check you put in, but I don't
understand why this check was needed?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 00/15] XDP redirect memory return API
  2018-03-08 15:03 ` [bpf-next V2 PATCH 00/15] XDP redirect memory return API Alexander Duyck
@ 2018-03-08 15:30   ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-08 15:30 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan, brouer

On Thu, 8 Mar 2018 07:03:07 -0800
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Thu, Mar 8, 2018 at 5:07 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > This patchset works towards supporting different XDP RX-ring memory
> > allocators.  As this will be needed by the AF_XDP zero-copy mode.
> >
> > The patchset uses mlx5 as the sample driver, which gets implemented
> > XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to
> > change thought the patchset).
> >
> > A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt).
> > And both ndo_xdp_xmit and the new xdp_return_frame end-up using this.
> >
> > Support for a driver supplied allocator is implemented, and a
> > refurbished version of page_pool is the first return allocator type
> > introduced.  This will be a integration point for AF_XDP zero-copy.
> >
> > The mlx5 driver evolve into using the page_pool, and see a performance
> > increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps.  
> 
> So one question I would have is what effect does your patch set have
> on packets moving in the opposite direction, or ixgbe to ixgbe. My
> concern is how much of a hit we are taking in ixgbe to support a more
> generic solution.

The overhead for ixgbe is basically close to zero.  Notice that the
MEM_TYPE_PAGE_SHARED type (like ixgbe) does not get allocated an ID,
thus it avoids doing any hash lookup. And basically just calls
page_frag_free().

	if (mem->type == MEM_TYPE_PAGE_SHARED) {
		page_frag_free(data);
		return;
	}

> > The patchset stop at the 15 patches limit, but more API changes are
> > planned.  Specifically extending ndo_xdp_xmit and xdp_return_frame
> > APIs to support bulking.  As this will address some known limits.
> >
> > V2: Updated according to Tariq's feedback  
> 
> I am assuming you are submitting this as an RFC aren't you? If not you
> have some general quality issues to work on. I saw c99 style "//"
> comments in patch 2, for example:
> +struct xdp_mem_info {
> +       u32 type; /* enum mem_type, but known size type */
> +       u32 id; // Needed later (to lookup struct xdp_rxq_info)
> +};
> +
> 
> You should probably just go with the standard "/* */" style comments
> and be consistent.

This comment style, is basically a reminder to myself (and it gets
removed later).  I guess, I should have removed it completely before I
submitted. Note, I've already done 3 versions of pre-RFC patchset
(offlist), so I hope this patchset can be soon be considered non-RFC
material.

> There are also "FIXME" comments in patch 6, either drop them entirely
> so you don't need to remove them in patch 15, or address the issue of
> explaining why you needed the call.

This were actually a question to Jason Wang... pointed it out explicitly.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API
  2018-03-08 15:16   ` Jesper Dangaard Brouer
@ 2018-03-09  7:16     ` Jason Wang
  2018-03-09  8:46       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Wang @ 2018-03-09  7:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, BjörnTöpel, magnus.karlsson
  Cc: netdev, eugenia, John Fastabend, Eran Ben Elisha, Saeed Mahameed,
	galp, Daniel Borkmann, Alexei Starovoitov, Tariq Toukan



On 2018年03月08日 23:16, Jesper Dangaard Brouer wrote:
> Hi Jason,
>
> Please see below FIXME, which is actually a question to you.
>
> On Thu, 08 Mar 2018 14:08:11 +0100 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 475088f947bb..cd046cf31b77 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
> [...]
>
>> @@ -1290,17 +1290,18 @@ static const struct net_device_ops tun_netdev_ops = {
>>   static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
>>   {
>>   	struct tun_struct *tun = netdev_priv(dev);
>> -	struct xdp_buff *buff = xdp->data_hard_start;
>> -	int headroom = xdp->data - xdp->data_hard_start;
>> +	struct xdp_frame *frame;
>>   	struct tun_file *tfile;
>>   	u32 numqueues;
>>   	int ret = 0;
>>   
>> -	/* Assure headroom is available and buff is properly aligned */
>> -	if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
>> -		return -ENOSPC;
>> +	/* FIXME: Explain why this check is the needed! */
>> +	if (unlikely(tun_is_xdp_frame(xdp)))
>> +		return -EBADRQC;
>>   
>> -	*buff = *xdp;
>> +	frame = convert_to_xdp_frame(xdp);
>> +	if (unlikely(!frame))
>> +		return -EOVERFLOW;
> To Jason, in the FIXME, I'm inheriting a check you put in, but I don't
> understand why this check was needed?
>

Sorry for the late reply.

I think it was used to make sure to not use misaligned or invalid 
pointer that caller passed to us.

Thanks

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap
  2018-03-08 13:07 ` [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap Jesper Dangaard Brouer
@ 2018-03-09  7:24   ` Jason Wang
  2018-03-09  9:35     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Wang @ 2018-03-09  7:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan



On 2018年03月08日 21:07, Jesper Dangaard Brouer wrote:
> Introduce an xdp_return_frame API, and convert over cpumap as
> the first user, given it have queued XDP frame structure to leverage.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>   include/net/xdp.h   |   32 +++++++++++++++++++++++++++
>   kernel/bpf/cpumap.c |   60 +++++++++++++++++++++++++++++++--------------------
>   net/core/xdp.c      |   18 +++++++++++++++
>   3 files changed, 86 insertions(+), 24 deletions(-)
>
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index b2362ddfa694..3cb726a6dc5b 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -33,16 +33,48 @@
>    * also mandatory during RX-ring setup.
>    */
>   
> +enum mem_type {
> +	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
> +	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
> +	// Possible new ideas for types:
> +	// MEM_TYPE_PAGE_POOL,    /* Will be added later  */
> +	// MEM_TYPE_AF_XDP,
> +	// MEM_TYPE_DEVICE_OWNED -- invoking an dev->ndo?
> +	MEM_TYPE_MAX,
> +};

So if we plan to support dev->ndo, it looks to me two types AF_XDP and 
DEVICE_OWNED are sufficient? Driver can do what it wants (e.g page pool 
or ordinary page allocator) in ndo or what ever other callbacks.

> +
> +struct xdp_mem_info {
> +	u32 type; /* enum mem_type, but known size type */
> +	u32 id; // Needed later (to lookup struct xdp_rxq_info)
> +};
> +
>   struct xdp_rxq_info {
>   	struct net_device *dev;
>   	u32 queue_index;
>   	u32 reg_state;
> +	struct xdp_mem_info mem;
>   } ____cacheline_aligned; /* perf critical, avoid false-sharing */
>   
> +
> +static inline
> +void xdp_return_frame(void *data, struct xdp_mem_info *mem)
> +{
> +	if (mem->type == MEM_TYPE_PAGE_SHARED)
> +		page_frag_free(data);
> +
> +	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
> +		struct page *page = virt_to_page(data); /* Assumes order0 page*/
> +
> +		put_page(page);
> +	}
> +}
> +
>   int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
>   		     struct net_device *dev, u32 queue_index);
>   void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
>   void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
>   bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
> +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
> +			       enum mem_type type, void *allocator);
>   
>   #endif /* __LINUX_NET_XDP_H__ */
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index a4bb0b34375a..3e4bbcbe3e86 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -19,6 +19,7 @@
>   #include <linux/bpf.h>
>   #include <linux/filter.h>
>   #include <linux/ptr_ring.h>
> +#include <net/xdp.h>
>   
>   #include <linux/sched.h>
>   #include <linux/workqueue.h>
> @@ -137,27 +138,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
>   	return ERR_PTR(err);
>   }
>   
> -static void __cpu_map_queue_destructor(void *ptr)
> -{
> -	/* The tear-down procedure should have made sure that queue is
> -	 * empty.  See __cpu_map_entry_replace() and work-queue
> -	 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
> -	 * gracefully and warn once.
> -	 */
> -	if (WARN_ON_ONCE(ptr))
> -		page_frag_free(ptr);
> -}
> -
> -static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> -{
> -	if (atomic_dec_and_test(&rcpu->refcnt)) {
> -		/* The queue should be empty at this point */
> -		ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
> -		kfree(rcpu->queue);
> -		kfree(rcpu);
> -	}
> -}
> -
>   static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
>   {
>   	atomic_inc(&rcpu->refcnt);
> @@ -188,6 +168,10 @@ struct xdp_pkt {
>   	u16 len;
>   	u16 headroom;
>   	u16 metasize;
> +	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
> +	 * while mem info is valid on remote CPU.
> +	 */

Can we simply move the xdp_mem_info to xdp_buff to avoid conversion?

Thanks

> +	struct xdp_mem_info mem;
>   	struct net_device *dev_rx;
>   };
>   
> @@ -213,6 +197,9 @@ static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
>   	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
>   	xdp_pkt->metasize = metasize;
>   
> +	/* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
> +	xdp_pkt->mem = xdp->rxq->mem;
> +
>   	return xdp_pkt;
>   }
>   
> @@ -265,6 +252,31 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
>   	return skb;
>   }
>   
> +static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
> +{
> +	/* The tear-down procedure should have made sure that queue is
> +	 * empty.  See __cpu_map_entry_replace() and work-queue
> +	 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
> +	 * gracefully and warn once.
> +	 */
> +	struct xdp_pkt *xdp_pkt;
> +
> +	while ((xdp_pkt = ptr_ring_consume(ring)))
> +		if (WARN_ON_ONCE(xdp_pkt))
> +			xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
> +}
> +
> +static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> +{
> +	if (atomic_dec_and_test(&rcpu->refcnt)) {
> +		/* The queue should be empty at this point */
> +		__cpu_map_ring_cleanup(rcpu->queue);
> +		ptr_ring_cleanup(rcpu->queue, NULL);
> +		kfree(rcpu->queue);
> +		kfree(rcpu);
> +	}
> +}
> +
>   static int cpu_map_kthread_run(void *data)
>   {
>   	struct bpf_cpu_map_entry *rcpu = data;
> @@ -307,7 +319,7 @@ static int cpu_map_kthread_run(void *data)
>   
>   			skb = cpu_map_build_skb(rcpu, xdp_pkt);
>   			if (!skb) {
> -				page_frag_free(xdp_pkt);
> +				xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
>   				continue;
>   			}
>   
> @@ -604,13 +616,13 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
>   	spin_lock(&q->producer_lock);
>   
>   	for (i = 0; i < bq->count; i++) {
> -		void *xdp_pkt = bq->q[i];
> +		struct xdp_pkt *xdp_pkt = bq->q[i];
>   		int err;
>   
>   		err = __ptr_ring_produce(q, xdp_pkt);
>   		if (err) {
>   			drops++;
> -			page_frag_free(xdp_pkt); /* Free xdp_pkt */
> +			xdp_return_frame(xdp_pkt->data, &xdp_pkt->mem);
>   		}
>   		processed++;
>   	}
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 097a0f74e004..9eee0c431126 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -71,3 +71,21 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
>   	return (xdp_rxq->reg_state == REG_STATE_REGISTERED);
>   }
>   EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);
> +
> +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
> +			       enum mem_type type, void *allocator)
> +{
> +	if (type >= MEM_TYPE_MAX)
> +		return -EINVAL;
> +
> +	xdp_rxq->mem.type = type;
> +
> +	if (allocator)
> +		return -EOPNOTSUPP;
> +
> +	/* TODO: Allocate an ID that maps to allocator pointer
> +	 * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
> +	 */
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 07/15] virtio_net: convert to use generic xdp_frame and xdp_return_frame API
  2018-03-08 13:08 ` [bpf-next V2 PATCH 07/15] virtio_net: " Jesper Dangaard Brouer
@ 2018-03-09  8:03   ` Jason Wang
  2018-03-09  9:44     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Wang @ 2018-03-09  8:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan



On 2018年03月08日 21:08, Jesper Dangaard Brouer wrote:
> The virtio_net driver assumes XDP frames are always released based on
> page refcnt (via put_page).  Thus, is only queues the XDP data pointer
> address and uses virt_to_head_page() to retrieve struct page.
>
> Use the XDP return API to get away from such assumptions. Instead
> queue an xdp_frame, which allow us to use the xdp_return_frame API,
> when releasing the frame.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>   drivers/net/virtio_net.c |   31 +++++++++++++++++++++----------
>   1 file changed, 21 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 23374603e4d9..6c4220450506 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -419,30 +419,41 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
>   			       struct xdp_buff *xdp)
>   {
>   	struct virtio_net_hdr_mrg_rxbuf *hdr;
> -	unsigned int len;
> +	struct xdp_frame *xdpf, *xdpf_sent;
>   	struct send_queue *sq;
> +	unsigned int len;
>   	unsigned int qp;
> -	void *xdp_sent;
>   	int err;
>   
>   	qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id();
>   	sq = &vi->sq[qp];
>   
>   	/* Free up any pending old buffers before queueing new ones. */
> -	while ((xdp_sent = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> -		struct page *sent_page = virt_to_head_page(xdp_sent);
> +	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
> +		xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
>   
> -		put_page(sent_page);
> -	}
> +	xdpf = convert_to_xdp_frame(xdp);
> +	if (unlikely(!xdpf))
> +		return -EOVERFLOW;
> +
> +	/* virtqueue want to use data area in-front of packet */
> +	if (unlikely(xdpf->metasize > 0))
> +		return -EOPNOTSUPP;
> +
> +	if (unlikely(xdpf->headroom < vi->hdr_len))
> +		return -EOVERFLOW;

I think this need another independent patch. For now we can simply drop 
the packet, but we probably need to think more to address it completely, 
allocate header part either dynamically or statically.

>   
> -	xdp->data -= vi->hdr_len;
> +	/* Make room for virtqueue hdr (also change xdpf->headroom?) */
> +	xdpf->data -= vi->hdr_len;
>   	/* Zero header and leave csum up to XDP layers */
> -	hdr = xdp->data;
> +	hdr = xdpf->data;
>   	memset(hdr, 0, vi->hdr_len);
> +	hdr->hdr.hdr_len = xdpf->len; /* Q: is this needed? */

Maybe another patch but not a must, hdr_len is hint for the linear part 
of skb used in host. Backend implementation may simply ignore this value.

Thanks

> +	xdpf->len   += vi->hdr_len;
>   
> -	sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
> +	sg_init_one(sq->sg, xdpf->data, xdpf->len);
>   
> -	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp->data, GFP_ATOMIC);
> +	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf, GFP_ATOMIC);
>   	if (unlikely(err))
>   		return false; /* Caller handle free/refcnt */
>   
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-08 13:08 ` [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping Jesper Dangaard Brouer
@ 2018-03-09  8:08   ` Jason Wang
  2018-03-09  9:37     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Wang @ 2018-03-09  8:08 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan



On 2018年03月08日 21:08, Jesper Dangaard Brouer wrote:
> Use the IDA infrastructure for getting a cyclic increasing ID number,
> that is used for keeping track of each registered allocator per
> RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
> uses a radix tree, use a dynamic rhashtable, for creating ID to
> pointer lookup table, because this is faster.
>
> The problem that is being solved here is that, the xdp_rxq_info
> pointer (stored in xdp_buff) cannot be used directly, as the
> guaranteed lifetime is too short.  The info is needed on a
> (potentially) remote CPU during DMA-TX completion time . In an
> xdp_frame the xdp_mem_info is stored, when it got converted from an
> xdp_buff, which is sufficient for the simple page refcnt based recycle
> schemes.
>
> For more advanced allocators there is a need to store a pointer to the
> registered allocator.  Thus, there is a need to guard the lifetime or
> validity of the allocator pointer, which is done through this
> rhashtable ID map to pointer. The removal and validity of of the
> allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
> allocator will be created by the driver, and registered with
> xdp_rxq_info_reg_mem_model().
>
> It is up-to debate who is responsible for freeing the allocator
> pointer or invoking the allocator destructor function.  In any case,
> this must happen via RCU freeing.
>
> Use the IDA infrastructure for getting a cyclic increasing ID number,
> that is used for keeping track of each registered allocator per
> RX-queue xdp_rxq_info.
>
> Signed-off-by: Jesper Dangaard Brouer<brouer@redhat.com>

A stupid question is, can we manage to unify this ID with NAPI id?

Thanks

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API
  2018-03-09  7:16     ` Jason Wang
@ 2018-03-09  8:46       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-09  8:46 UTC (permalink / raw)
  To: Jason Wang
  Cc: BjörnTöpel, magnus.karlsson, netdev, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 9 Mar 2018 15:16:35 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2018年03月08日 23:16, Jesper Dangaard Brouer wrote:
> > Hi Jason,
> >
> > Please see below FIXME, which is actually a question to you.
> >
> > On Thu, 08 Mar 2018 14:08:11 +0100 Jesper Dangaard Brouer <brouer@redhat.com> wrote:
> >  
> >> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> >> index 475088f947bb..cd046cf31b77 100644
> >> --- a/drivers/net/tun.c
> >> +++ b/drivers/net/tun.c  
> > [...]
> >  
> >> @@ -1290,17 +1290,18 @@ static const struct net_device_ops tun_netdev_ops = {
> >>   static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
> >>   {
> >>   	struct tun_struct *tun = netdev_priv(dev);
> >> -	struct xdp_buff *buff = xdp->data_hard_start;
> >> -	int headroom = xdp->data - xdp->data_hard_start;
> >> +	struct xdp_frame *frame;
> >>   	struct tun_file *tfile;
> >>   	u32 numqueues;
> >>   	int ret = 0;
> >>   
> >> -	/* Assure headroom is available and buff is properly aligned */
> >> -	if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
> >> -		return -ENOSPC;
> >> +	/* FIXME: Explain why this check is the needed! */
> >> +	if (unlikely(tun_is_xdp_frame(xdp)))
> >> +		return -EBADRQC;
> >>   
> >> -	*buff = *xdp;
> >> +	frame = convert_to_xdp_frame(xdp);
> >> +	if (unlikely(!frame))
> >> +		return -EOVERFLOW;  
> > To Jason, in the FIXME, I'm inheriting a check you put in, but I don't
> > understand why this check was needed?
> >  
> 
> Sorry for the late reply.
> 
> I think it was used to make sure to not use misaligned or invalid 
> pointer that caller passed to us.

Okay, but I don't think this can happen, thus I'm going to remove this
check in V3.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap
  2018-03-09  7:24   ` Jason Wang
@ 2018-03-09  9:35     ` Jesper Dangaard Brouer
  2018-03-09 13:04       ` Jason Wang
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-09  9:35 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 9 Mar 2018 15:24:10 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2018年03月08日 21:07, Jesper Dangaard Brouer wrote:
> > Introduce an xdp_return_frame API, and convert over cpumap as
> > the first user, given it have queued XDP frame structure to leverage.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >   include/net/xdp.h   |   32 +++++++++++++++++++++++++++
> >   kernel/bpf/cpumap.c |   60 +++++++++++++++++++++++++++++++--------------------
> >   net/core/xdp.c      |   18 +++++++++++++++
> >   3 files changed, 86 insertions(+), 24 deletions(-)
> >
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index b2362ddfa694..3cb726a6dc5b 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -33,16 +33,48 @@
> >    * also mandatory during RX-ring setup.
> >    */
> >   
> > +enum mem_type {
> > +	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
> > +	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
> > +	// Possible new ideas for types:
> > +	// MEM_TYPE_PAGE_POOL,    /* Will be added later  */
> > +	// MEM_TYPE_AF_XDP,
> > +	// MEM_TYPE_DEVICE_OWNED -- invoking an dev->ndo?
> > +	MEM_TYPE_MAX,
> > +};  
> 
> So if we plan to support dev->ndo, it looks to me two types AF_XDP and 
> DEVICE_OWNED are sufficient? Driver can do what it wants (e.g page pool 
> or ordinary page allocator) in ndo or what ever other callbacks.

So, the design is open to go both ways, we can figure out later.

The reason I'm not calling page_pool from a driver level dev->ndo, is
that I'm trying to avoid invoking code in a module, as the driver
module code can be (in the process of being) unloaded from the kernel,
while another driver is calling xdp_return_frame.  The current design
in patch 10, uses the RCU period to guarantee that the allocator
pointer is valid as long as the ID lookup succeeded.

To do what you propose, we also need to guarantee that the net_device
cannot disappear so it is safe to invoke dev->ndo.  To do so, the
driver likely have to add a rcu_barrier before unloading.  I'm also
thinking that for MEM_TYPE_DEVICE_OWNED the allocator pointer ("under
protection") could be the net_device pointer, and we could take a
refcnt on dev (dev_hold/dev_put). Thus, I think it is doable, but lets
figure this out later.

Another important design consideration, is that the xdp core need to
know how to release memory in case the ID lookup failed.  This is
another argument for keeping the memory release code inside the xdp
core, and not leave too much freedom to the drivers.


> > +struct xdp_mem_info {
> > +	u32 type; /* enum mem_type, but known size type */
> > +	u32 id; // Needed later (to lookup struct xdp_rxq_info)
> > +};
> > +
> >   struct xdp_rxq_info {
> >   	struct net_device *dev;
> >   	u32 queue_index;
> >   	u32 reg_state;
> > +	struct xdp_mem_info mem;
> >   } ____cacheline_aligned; /* perf critical, avoid false-sharing */
> >   
> > +
> > +static inline
> > +void xdp_return_frame(void *data, struct xdp_mem_info *mem)
> > +{
> > +	if (mem->type == MEM_TYPE_PAGE_SHARED)
> > +		page_frag_free(data);
> > +
> > +	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
> > +		struct page *page = virt_to_page(data); /* Assumes order0 page*/
> > +
> > +		put_page(page);
> > +	}
> > +}
> > +
> >   int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
> >   		     struct net_device *dev, u32 queue_index);
> >   void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
> >   void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
> >   bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
> > +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
> > +			       enum mem_type type, void *allocator);
> >   
> >   #endif /* __LINUX_NET_XDP_H__ */
> > diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> > index a4bb0b34375a..3e4bbcbe3e86 100644
> > --- a/kernel/bpf/cpumap.c
> > +++ b/kernel/bpf/cpumap.c
[...]
> >   static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> >   {
> >   	atomic_inc(&rcpu->refcnt);
> > @@ -188,6 +168,10 @@ struct xdp_pkt {
> >   	u16 len;
> >   	u16 headroom;
> >   	u16 metasize;
> > +	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
> > +	 * while mem info is valid on remote CPU.
> > +	 */  
> 
> Can we simply move the xdp_mem_info to xdp_buff to avoid conversion?

No, xdp_buff is a stack allocated piece of memory, thus we do need a
conversion into another piece of memory.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-09  8:08   ` Jason Wang
@ 2018-03-09  9:37     ` Jesper Dangaard Brouer
  2018-03-09 13:07       ` Jason Wang
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-09  9:37 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 9 Mar 2018 16:08:58 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2018年03月08日 21:08, Jesper Dangaard Brouer wrote:
> > Use the IDA infrastructure for getting a cyclic increasing ID number,
> > that is used for keeping track of each registered allocator per
> > RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
> > uses a radix tree, use a dynamic rhashtable, for creating ID to
> > pointer lookup table, because this is faster.
> >
> > The problem that is being solved here is that, the xdp_rxq_info
> > pointer (stored in xdp_buff) cannot be used directly, as the
> > guaranteed lifetime is too short.  The info is needed on a
> > (potentially) remote CPU during DMA-TX completion time . In an
> > xdp_frame the xdp_mem_info is stored, when it got converted from an
> > xdp_buff, which is sufficient for the simple page refcnt based recycle
> > schemes.
> >
> > For more advanced allocators there is a need to store a pointer to the
> > registered allocator.  Thus, there is a need to guard the lifetime or
> > validity of the allocator pointer, which is done through this
> > rhashtable ID map to pointer. The removal and validity of of the
> > allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
> > allocator will be created by the driver, and registered with
> > xdp_rxq_info_reg_mem_model().
> >
> > It is up-to debate who is responsible for freeing the allocator
> > pointer or invoking the allocator destructor function.  In any case,
> > this must happen via RCU freeing.
> >
> > Use the IDA infrastructure for getting a cyclic increasing ID number,
> > that is used for keeping track of each registered allocator per
> > RX-queue xdp_rxq_info.
> >
> > Signed-off-by: Jesper Dangaard Brouer<brouer@redhat.com>  
> 
> A stupid question is, can we manage to unify this ID with NAPI id?

Sorry I don't understand the question?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 07/15] virtio_net: convert to use generic xdp_frame and xdp_return_frame API
  2018-03-09  8:03   ` Jason Wang
@ 2018-03-09  9:44     ` Jesper Dangaard Brouer
  2018-03-09 13:11       ` Jason Wang
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-09  9:44 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 9 Mar 2018 16:03:28 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2018年03月08日 21:08, Jesper Dangaard Brouer wrote:
> > The virtio_net driver assumes XDP frames are always released based on
> > page refcnt (via put_page).  Thus, is only queues the XDP data pointer
> > address and uses virt_to_head_page() to retrieve struct page.
> >
> > Use the XDP return API to get away from such assumptions. Instead
> > queue an xdp_frame, which allow us to use the xdp_return_frame API,
> > when releasing the frame.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >   drivers/net/virtio_net.c |   31 +++++++++++++++++++++----------
> >   1 file changed, 21 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 23374603e4d9..6c4220450506 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -419,30 +419,41 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
> >   			       struct xdp_buff *xdp)
> >   {
> >   	struct virtio_net_hdr_mrg_rxbuf *hdr;
> > -	unsigned int len;
> > +	struct xdp_frame *xdpf, *xdpf_sent;
> >   	struct send_queue *sq;
> > +	unsigned int len;
> >   	unsigned int qp;
> > -	void *xdp_sent;
> >   	int err;
> >   
> >   	qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id();
> >   	sq = &vi->sq[qp];
> >   
> >   	/* Free up any pending old buffers before queueing new ones. */
> > -	while ((xdp_sent = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> > -		struct page *sent_page = virt_to_head_page(xdp_sent);
> > +	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
> > +		xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
> >   
> > -		put_page(sent_page);
> > -	}
> > +	xdpf = convert_to_xdp_frame(xdp);
> > +	if (unlikely(!xdpf))
> > +		return -EOVERFLOW;
> > +
> > +	/* virtqueue want to use data area in-front of packet */
> > +	if (unlikely(xdpf->metasize > 0))
> > +		return -EOPNOTSUPP;
> > +
> > +	if (unlikely(xdpf->headroom < vi->hdr_len))
> > +		return -EOVERFLOW;  
> 
> I think this need another independent patch. For now we can simply drop 
> the packet, but we probably need to think more to address it completely, 
> allocate header part either dynamically or statically.

Okay, so we can followup later if we want to handle this case better
than drop.

> >   
> > -	xdp->data -= vi->hdr_len;
> > +	/* Make room for virtqueue hdr (also change xdpf->headroom?) */
> > +	xdpf->data -= vi->hdr_len;
> >   	/* Zero header and leave csum up to XDP layers */
> > -	hdr = xdp->data;
> > +	hdr = xdpf->data;
> >   	memset(hdr, 0, vi->hdr_len);
> > +	hdr->hdr.hdr_len = xdpf->len; /* Q: is this needed? */  
> 
> Maybe another patch but not a must, hdr_len is hint for the linear part 
> of skb used in host. Backend implementation may simply ignore this value.

So, I should leave it out for now?
Or it is okay to keep it?


> > +	xdpf->len   += vi->hdr_len;
> >   
> > -	sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
> > +	sg_init_one(sq->sg, xdpf->data, xdpf->len);

When _later_ introducing bulking, can we use something else than
sg_init_one() to send/queue multiple raw XDP frames for sending?
(I'm asking as I don't know this "sg_*" API usage)

> > -	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp->data,
> > GFP_ATOMIC);
> > +	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf,
> > GFP_ATOMIC); if (unlikely(err))
> >   		return false; /* Caller handle free/refcnt */
> >   

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap
  2018-03-09  9:35     ` Jesper Dangaard Brouer
@ 2018-03-09 13:04       ` Jason Wang
  2018-03-09 16:05         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Wang @ 2018-03-09 13:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan



On 2018年03月09日 17:35, Jesper Dangaard Brouer wrote:
> On Fri, 9 Mar 2018 15:24:10 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 2018年03月08日 21:07, Jesper Dangaard Brouer wrote:
>>> Introduce an xdp_return_frame API, and convert over cpumap as
>>> the first user, given it have queued XDP frame structure to leverage.
>>>
>>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>> ---
>>>    include/net/xdp.h   |   32 +++++++++++++++++++++++++++
>>>    kernel/bpf/cpumap.c |   60 +++++++++++++++++++++++++++++++--------------------
>>>    net/core/xdp.c      |   18 +++++++++++++++
>>>    3 files changed, 86 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>>> index b2362ddfa694..3cb726a6dc5b 100644
>>> --- a/include/net/xdp.h
>>> +++ b/include/net/xdp.h
>>> @@ -33,16 +33,48 @@
>>>     * also mandatory during RX-ring setup.
>>>     */
>>>    
>>> +enum mem_type {
>>> +	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
>>> +	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
>>> +	// Possible new ideas for types:
>>> +	// MEM_TYPE_PAGE_POOL,    /* Will be added later  */
>>> +	// MEM_TYPE_AF_XDP,
>>> +	// MEM_TYPE_DEVICE_OWNED -- invoking an dev->ndo?
>>> +	MEM_TYPE_MAX,
>>> +};
>> So if we plan to support dev->ndo, it looks to me two types AF_XDP and
>> DEVICE_OWNED are sufficient? Driver can do what it wants (e.g page pool
>> or ordinary page allocator) in ndo or what ever other callbacks.
> So, the design is open to go both ways, we can figure out later.
>
> The reason I'm not calling page_pool from a driver level dev->ndo, is
> that I'm trying to avoid invoking code in a module, as the driver
> module code can be (in the process of being) unloaded from the kernel,
> while another driver is calling xdp_return_frame.  The current design
> in patch 10, uses the RCU period to guarantee that the allocator
> pointer is valid as long as the ID lookup succeeded.
>
> To do what you propose, we also need to guarantee that the net_device
> cannot disappear so it is safe to invoke dev->ndo.  To do so, the
> driver likely have to add a rcu_barrier before unloading.  I'm also
> thinking that for MEM_TYPE_DEVICE_OWNED the allocator pointer ("under
> protection") could be the net_device pointer, and we could take a
> refcnt on dev (dev_hold/dev_put).

Then it looks like a device (e.g tun) can lock another device for 
indefinite time.

> Thus, I think it is doable, but lets
> figure this out later.
>
> Another important design consideration, is that the xdp core need to
> know how to release memory in case the ID lookup failed.

Can we simply use put_page() in this case?

>   This is
> another argument for keeping the memory release code inside the xdp
> core, and not leave too much freedom to the drivers.
>
>
>>> +struct xdp_mem_info {
>>> +	u32 type; /* enum mem_type, but known size type */
>>> +	u32 id; // Needed later (to lookup struct xdp_rxq_info)
>>> +};
>>> +
>>>    struct xdp_rxq_info {
>>>    	struct net_device *dev;
>>>    	u32 queue_index;
>>>    	u32 reg_state;
>>> +	struct xdp_mem_info mem;
>>>    } ____cacheline_aligned; /* perf critical, avoid false-sharing */
>>>    
>>> +
>>> +static inline
>>> +void xdp_return_frame(void *data, struct xdp_mem_info *mem)
>>> +{
>>> +	if (mem->type == MEM_TYPE_PAGE_SHARED)
>>> +		page_frag_free(data);
>>> +
>>> +	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
>>> +		struct page *page = virt_to_page(data); /* Assumes order0 page*/
>>> +
>>> +		put_page(page);
>>> +	}
>>> +}
>>> +
>>>    int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
>>>    		     struct net_device *dev, u32 queue_index);
>>>    void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
>>>    void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
>>>    bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
>>> +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
>>> +			       enum mem_type type, void *allocator);
>>>    
>>>    #endif /* __LINUX_NET_XDP_H__ */
>>> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
>>> index a4bb0b34375a..3e4bbcbe3e86 100644
>>> --- a/kernel/bpf/cpumap.c
>>> +++ b/kernel/bpf/cpumap.c
> [...]
>>>    static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
>>>    {
>>>    	atomic_inc(&rcpu->refcnt);
>>> @@ -188,6 +168,10 @@ struct xdp_pkt {
>>>    	u16 len;
>>>    	u16 headroom;
>>>    	u16 metasize;
>>> +	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
>>> +	 * while mem info is valid on remote CPU.
>>> +	 */
>> Can we simply move the xdp_mem_info to xdp_buff to avoid conversion?
> No, xdp_buff is a stack allocated piece of memory, thus we do need a
> conversion into another piece of memory.
>

Right but just duplicate xdp_buff content is sufficient in this case or 
is there anything other I missed?

Thanks

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-09  9:37     ` Jesper Dangaard Brouer
@ 2018-03-09 13:07       ` Jason Wang
  2018-03-09 16:07         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Wang @ 2018-03-09 13:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan



On 2018年03月09日 17:37, Jesper Dangaard Brouer wrote:
> On Fri, 9 Mar 2018 16:08:58 +0800
> Jason Wang<jasowang@redhat.com>  wrote:
>
>> On 2018年03月08日 21:08, Jesper Dangaard Brouer wrote:
>>> Use the IDA infrastructure for getting a cyclic increasing ID number,
>>> that is used for keeping track of each registered allocator per
>>> RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
>>> uses a radix tree, use a dynamic rhashtable, for creating ID to
>>> pointer lookup table, because this is faster.
>>>
>>> The problem that is being solved here is that, the xdp_rxq_info
>>> pointer (stored in xdp_buff) cannot be used directly, as the
>>> guaranteed lifetime is too short.  The info is needed on a
>>> (potentially) remote CPU during DMA-TX completion time . In an
>>> xdp_frame the xdp_mem_info is stored, when it got converted from an
>>> xdp_buff, which is sufficient for the simple page refcnt based recycle
>>> schemes.
>>>
>>> For more advanced allocators there is a need to store a pointer to the
>>> registered allocator.  Thus, there is a need to guard the lifetime or
>>> validity of the allocator pointer, which is done through this
>>> rhashtable ID map to pointer. The removal and validity of of the
>>> allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
>>> allocator will be created by the driver, and registered with
>>> xdp_rxq_info_reg_mem_model().
>>>
>>> It is up-to debate who is responsible for freeing the allocator
>>> pointer or invoking the allocator destructor function.  In any case,
>>> this must happen via RCU freeing.
>>>
>>> Use the IDA infrastructure for getting a cyclic increasing ID number,
>>> that is used for keeping track of each registered allocator per
>>> RX-queue xdp_rxq_info.
>>>
>>> Signed-off-by: Jesper Dangaard Brouer<brouer@redhat.com>   
>> A stupid question is, can we manage to unify this ID with NAPI id?
> Sorry I don't understand the question?

I mean can we associate page poll pointer to napi_struct, record NAPI id 
in xdp_mem_info and do lookup through NAPI id?

Thanks

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 07/15] virtio_net: convert to use generic xdp_frame and xdp_return_frame API
  2018-03-09  9:44     ` Jesper Dangaard Brouer
@ 2018-03-09 13:11       ` Jason Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Wang @ 2018-03-09 13:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan



On 2018年03月09日 17:44, Jesper Dangaard Brouer wrote:
> On Fri, 9 Mar 2018 16:03:28 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 2018年03月08日 21:08, Jesper Dangaard Brouer wrote:
>>> The virtio_net driver assumes XDP frames are always released based on
>>> page refcnt (via put_page).  Thus, is only queues the XDP data pointer
>>> address and uses virt_to_head_page() to retrieve struct page.
>>>
>>> Use the XDP return API to get away from such assumptions. Instead
>>> queue an xdp_frame, which allow us to use the xdp_return_frame API,
>>> when releasing the frame.
>>>
>>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>> ---
>>>    drivers/net/virtio_net.c |   31 +++++++++++++++++++++----------
>>>    1 file changed, 21 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>> index 23374603e4d9..6c4220450506 100644
>>> --- a/drivers/net/virtio_net.c
>>> +++ b/drivers/net/virtio_net.c
>>> @@ -419,30 +419,41 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
>>>    			       struct xdp_buff *xdp)
>>>    {
>>>    	struct virtio_net_hdr_mrg_rxbuf *hdr;
>>> -	unsigned int len;
>>> +	struct xdp_frame *xdpf, *xdpf_sent;
>>>    	struct send_queue *sq;
>>> +	unsigned int len;
>>>    	unsigned int qp;
>>> -	void *xdp_sent;
>>>    	int err;
>>>    
>>>    	qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id();
>>>    	sq = &vi->sq[qp];
>>>    
>>>    	/* Free up any pending old buffers before queueing new ones. */
>>> -	while ((xdp_sent = virtqueue_get_buf(sq->vq, &len)) != NULL) {
>>> -		struct page *sent_page = virt_to_head_page(xdp_sent);
>>> +	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
>>> +		xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
>>>    
>>> -		put_page(sent_page);
>>> -	}
>>> +	xdpf = convert_to_xdp_frame(xdp);
>>> +	if (unlikely(!xdpf))
>>> +		return -EOVERFLOW;
>>> +
>>> +	/* virtqueue want to use data area in-front of packet */
>>> +	if (unlikely(xdpf->metasize > 0))
>>> +		return -EOPNOTSUPP;
>>> +
>>> +	if (unlikely(xdpf->headroom < vi->hdr_len))
>>> +		return -EOVERFLOW;
>> I think this need another independent patch. For now we can simply drop
>> the packet, but we probably need to think more to address it completely,
>> allocate header part either dynamically or statically.
> Okay, so we can followup later if we want to handle this case better
> than drop.
>
>>>    
>>> -	xdp->data -= vi->hdr_len;
>>> +	/* Make room for virtqueue hdr (also change xdpf->headroom?) */
>>> +	xdpf->data -= vi->hdr_len;
>>>    	/* Zero header and leave csum up to XDP layers */
>>> -	hdr = xdp->data;
>>> +	hdr = xdpf->data;
>>>    	memset(hdr, 0, vi->hdr_len);
>>> +	hdr->hdr.hdr_len = xdpf->len; /* Q: is this needed? */
>> Maybe another patch but not a must, hdr_len is hint for the linear part
>> of skb used in host. Backend implementation may simply ignore this value.
> So, I should leave it out for now?
> Or it is okay to keep it?
>
>

If you stick to it, you can keep it.

>>> +	xdpf->len   += vi->hdr_len;
>>>    
>>> -	sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
>>> +	sg_init_one(sq->sg, xdpf->data, xdpf->len);
> When _later_ introducing bulking, can we use something else than
> sg_init_one() to send/queue multiple raw XDP frames for sending?
> (I'm asking as I don't know this "sg_*" API usage)

Looks not, but consider the simplicity of sg_init_one(), I wonder 
whether or not we can get any difference if there's such one.

It looks to me the actual issue is virtio API which does not support 
bulking. I can try to extend it if it's necessary.

Thanks

>
>>> -	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp->data,
>>> GFP_ATOMIC);
>>> +	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf,
>>> GFP_ATOMIC); if (unlikely(err))
>>>    		return false; /* Caller handle free/refcnt */
>>>    

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap
  2018-03-09 13:04       ` Jason Wang
@ 2018-03-09 16:05         ` Jesper Dangaard Brouer
  2018-03-16  8:43           ` Jason Wang
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-09 16:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 9 Mar 2018 21:04:23 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2018年03月09日 17:35, Jesper Dangaard Brouer wrote:
> > On Fri, 9 Mar 2018 15:24:10 +0800
> > Jason Wang <jasowang@redhat.com> wrote:
> >  
> >> On 2018年03月08日 21:07, Jesper Dangaard Brouer wrote:  
> >>> Introduce an xdp_return_frame API, and convert over cpumap as
> >>> the first user, given it have queued XDP frame structure to leverage.
> >>>
> >>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> >>> ---
> >>>    include/net/xdp.h   |   32 +++++++++++++++++++++++++++
> >>>    kernel/bpf/cpumap.c |   60 +++++++++++++++++++++++++++++++--------------------
> >>>    net/core/xdp.c      |   18 +++++++++++++++
> >>>    3 files changed, 86 insertions(+), 24 deletions(-)
> >>>
> >>> diff --git a/include/net/xdp.h b/include/net/xdp.h
> >>> index b2362ddfa694..3cb726a6dc5b 100644
> >>> --- a/include/net/xdp.h
> >>> +++ b/include/net/xdp.h
> >>> @@ -33,16 +33,48 @@
> >>>     * also mandatory during RX-ring setup.
> >>>     */
> >>>    
> >>> +enum mem_type {
> >>> +	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
> >>> +	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
> >>> +	// Possible new ideas for types:
> >>> +	// MEM_TYPE_PAGE_POOL,    /* Will be added later  */
> >>> +	// MEM_TYPE_AF_XDP,
> >>> +	// MEM_TYPE_DEVICE_OWNED -- invoking an dev->ndo?
> >>> +	MEM_TYPE_MAX,
> >>> +};  
> >> So if we plan to support dev->ndo, it looks to me two types AF_XDP and
> >> DEVICE_OWNED are sufficient? Driver can do what it wants (e.g page pool
> >> or ordinary page allocator) in ndo or what ever other callbacks.  
> > So, the design is open to go both ways, we can figure out later.
> >
> > The reason I'm not calling page_pool from a driver level dev->ndo, is
> > that I'm trying to avoid invoking code in a module, as the driver
> > module code can be (in the process of being) unloaded from the kernel,
> > while another driver is calling xdp_return_frame.  The current design
> > in patch 10, uses the RCU period to guarantee that the allocator
> > pointer is valid as long as the ID lookup succeeded.
> >
> > To do what you propose, we also need to guarantee that the net_device
> > cannot disappear so it is safe to invoke dev->ndo.  To do so, the
> > driver likely have to add a rcu_barrier before unloading.  I'm also
> > thinking that for MEM_TYPE_DEVICE_OWNED the allocator pointer ("under
> > protection") could be the net_device pointer, and we could take a
> > refcnt on dev (dev_hold/dev_put).  
> 
> Then it looks like a device (e.g tun) can lock another device for 
> indefinite time.
> 
> > Thus, I think it is doable, but lets
> > figure this out later.
> >
> > Another important design consideration, is that the xdp core need to
> > know how to release memory in case the ID lookup failed.  
> 
> Can we simply use put_page() in this case?

For now, yes. The (future) use case is when pages are kept DMA mapped,
then the page also need to be DMA unmapped.  To DMA unmap we need to
know the struct device (not net_device).  (And then the discussion
start around how to know struct device is still valid).


> >   This is
> > another argument for keeping the memory release code inside the xdp
> > core, and not leave too much freedom to the drivers.
> >
> >  
> >>> +struct xdp_mem_info {
> >>> +	u32 type; /* enum mem_type, but known size type */
> >>> +	u32 id; // Needed later (to lookup struct xdp_rxq_info)
> >>> +};
> >>> +
> >>>    struct xdp_rxq_info {
> >>>    	struct net_device *dev;
> >>>    	u32 queue_index;
> >>>    	u32 reg_state;
> >>> +	struct xdp_mem_info mem;
> >>>    } ____cacheline_aligned; /* perf critical, avoid false-sharing */
> >>>    
> >>> +
> >>> +static inline
> >>> +void xdp_return_frame(void *data, struct xdp_mem_info *mem)
> >>> +{
> >>> +	if (mem->type == MEM_TYPE_PAGE_SHARED)
> >>> +		page_frag_free(data);
> >>> +
> >>> +	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
> >>> +		struct page *page = virt_to_page(data); /* Assumes order0 page*/
> >>> +
> >>> +		put_page(page);
> >>> +	}
> >>> +}
> >>> +
> >>>    int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
> >>>    		     struct net_device *dev, u32 queue_index);
> >>>    void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
> >>>    void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
> >>>    bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
> >>> +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
> >>> +			       enum mem_type type, void *allocator);
> >>>    
> >>>    #endif /* __LINUX_NET_XDP_H__ */
> >>> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> >>> index a4bb0b34375a..3e4bbcbe3e86 100644
> >>> --- a/kernel/bpf/cpumap.c
> >>> +++ b/kernel/bpf/cpumap.c  
> > [...]  
> >>>    static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> >>>    {
> >>>    	atomic_inc(&rcpu->refcnt);
> >>> @@ -188,6 +168,10 @@ struct xdp_pkt {
> >>>    	u16 len;
> >>>    	u16 headroom;
> >>>    	u16 metasize;
> >>> +	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
> >>> +	 * while mem info is valid on remote CPU.
> >>> +	 */  
> >> Can we simply move the xdp_mem_info to xdp_buff to avoid conversion?  
> > No, xdp_buff is a stack allocated piece of memory, thus we do need a
> > conversion into another piece of memory.
> >  
> 
> Right but just duplicate xdp_buff content is sufficient in this case or 
> is there anything other I missed?

That is what I'm doing (just with state compression).


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-09 13:07       ` Jason Wang
@ 2018-03-09 16:07         ` Jesper Dangaard Brouer
  2018-03-16  8:45           ` Jason Wang
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-09 16:07 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 9 Mar 2018 21:07:36 +0800
Jason Wang <jasowang@redhat.com> wrote:

> >>> Use the IDA infrastructure for getting a cyclic increasing ID number,
> >>> that is used for keeping track of each registered allocator per
> >>> RX-queue xdp_rxq_info.
> >>>
> >>> Signed-off-by: Jesper Dangaard Brouer<brouer@redhat.com>     
> >> A stupid question is, can we manage to unify this ID with NAPI id?  
> > Sorry I don't understand the question?  
> 
> I mean can we associate page poll pointer to napi_struct, record NAPI id 
> in xdp_mem_info and do lookup through NAPI id?

No. The driver can unreg/reg a new XDP memory model, without reloading
the NAPI and generate a new NAPI id.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap
  2018-03-09 16:05         ` Jesper Dangaard Brouer
@ 2018-03-16  8:43           ` Jason Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Wang @ 2018-03-16  8:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan



On 2018年03月10日 00:05, Jesper Dangaard Brouer wrote:
> On Fri, 9 Mar 2018 21:04:23 +0800
> Jason Wang<jasowang@redhat.com>  wrote:
>
>> On 2018年03月09日 17:35, Jesper Dangaard Brouer wrote:
>>> On Fri, 9 Mar 2018 15:24:10 +0800
>>> Jason Wang<jasowang@redhat.com>  wrote:
>>>   
>>>> On 2018年03月08日 21:07, Jesper Dangaard Brouer wrote:
>>>>> Introduce an xdp_return_frame API, and convert over cpumap as
>>>>> the first user, given it have queued XDP frame structure to leverage.
>>>>>
>>>>> Signed-off-by: Jesper Dangaard Brouer<brouer@redhat.com>
>>>>> ---
>>>>>     include/net/xdp.h   |   32 +++++++++++++++++++++++++++
>>>>>     kernel/bpf/cpumap.c |   60 +++++++++++++++++++++++++++++++--------------------
>>>>>     net/core/xdp.c      |   18 +++++++++++++++
>>>>>     3 files changed, 86 insertions(+), 24 deletions(-)
>>>>>
>>>>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>>>>> index b2362ddfa694..3cb726a6dc5b 100644
>>>>> --- a/include/net/xdp.h
>>>>> +++ b/include/net/xdp.h
>>>>> @@ -33,16 +33,48 @@
>>>>>      * also mandatory during RX-ring setup.
>>>>>      */
>>>>>     
>>>>> +enum mem_type {
>>>>> +	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
>>>>> +	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
>>>>> +	// Possible new ideas for types:
>>>>> +	// MEM_TYPE_PAGE_POOL,    /* Will be added later  */
>>>>> +	// MEM_TYPE_AF_XDP,
>>>>> +	// MEM_TYPE_DEVICE_OWNED -- invoking an dev->ndo?
>>>>> +	MEM_TYPE_MAX,
>>>>> +};
>>>> So if we plan to support dev->ndo, it looks to me two types AF_XDP and
>>>> DEVICE_OWNED are sufficient? Driver can do what it wants (e.g page pool
>>>> or ordinary page allocator) in ndo or what ever other callbacks.
>>> So, the design is open to go both ways, we can figure out later.
>>>
>>> The reason I'm not calling page_pool from a driver level dev->ndo, is
>>> that I'm trying to avoid invoking code in a module, as the driver
>>> module code can be (in the process of being) unloaded from the kernel,
>>> while another driver is calling xdp_return_frame.  The current design
>>> in patch 10, uses the RCU period to guarantee that the allocator
>>> pointer is valid as long as the ID lookup succeeded.
>>>
>>> To do what you propose, we also need to guarantee that the net_device
>>> cannot disappear so it is safe to invoke dev->ndo.  To do so, the
>>> driver likely have to add a rcu_barrier before unloading.  I'm also
>>> thinking that for MEM_TYPE_DEVICE_OWNED the allocator pointer ("under
>>> protection") could be the net_device pointer, and we could take a
>>> refcnt on dev (dev_hold/dev_put).
>> Then it looks like a device (e.g tun) can lock another device for
>> indefinite time.
>>
>>> Thus, I think it is doable, but lets
>>> figure this out later.
>>>
>>> Another important design consideration, is that the xdp core need to
>>> know how to release memory in case the ID lookup failed.
>> Can we simply use put_page() in this case?
> For now, yes. The (future) use case is when pages are kept DMA mapped,
> then the page also need to be DMA unmapped.

This probably needs careful thought. E.g current virtio-net driver does 
not do mapping by itself, instead it depends on the virtio core to do 
map and unmap.

> To DMA unmap we need to
> know the struct device (not net_device).  (And then the discussion
> start around how to know struct device is still valid).

So still need a safe fallback if I understand this correctly.

Thanks

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-09 16:07         ` Jesper Dangaard Brouer
@ 2018-03-16  8:45           ` Jason Wang
  2018-03-19  9:48             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Wang @ 2018-03-16  8:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan



On 2018年03月10日 00:07, Jesper Dangaard Brouer wrote:
> On Fri, 9 Mar 2018 21:07:36 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>>>>> Use the IDA infrastructure for getting a cyclic increasing ID number,
>>>>> that is used for keeping track of each registered allocator per
>>>>> RX-queue xdp_rxq_info.
>>>>>
>>>>> Signed-off-by: Jesper Dangaard Brouer<brouer@redhat.com>
>>>> A stupid question is, can we manage to unify this ID with NAPI id?
>>> Sorry I don't understand the question?
>> I mean can we associate page poll pointer to napi_struct, record NAPI id
>> in xdp_mem_info and do lookup through NAPI id?
> No. The driver can unreg/reg a new XDP memory model,

Is there an actual use case for this?

>   without reloading
> the NAPI and generate a new NAPI id.
>

We could still do this by e.g a allocator pointer protected by RCU I think?

Thanks

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-16  8:45           ` Jason Wang
@ 2018-03-19  9:48             ` Jesper Dangaard Brouer
  2018-03-20  2:26               ` Jason Wang
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-19  9:48 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 16 Mar 2018 16:45:30 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2018年03月10日 00:07, Jesper Dangaard Brouer wrote:
> > On Fri, 9 Mar 2018 21:07:36 +0800
> > Jason Wang <jasowang@redhat.com> wrote:
> >  
> >>>>> Use the IDA infrastructure for getting a cyclic increasing ID number,
> >>>>> that is used for keeping track of each registered allocator per
> >>>>> RX-queue xdp_rxq_info.
> >>>>>
> >>>>> Signed-off-by: Jesper Dangaard Brouer<brouer@redhat.com>  
> >>>> A stupid question is, can we manage to unify this ID with NAPI id?  
> >>> Sorry I don't understand the question?
> >>
> >> I mean can we associate page poll pointer to napi_struct, record NAPI id
> >> in xdp_mem_info and do lookup through NAPI id?  
> >
> > No. The driver can unreg/reg a new XDP memory model,  
> 
> Is there an actual use case for this?

I believe this is the common use case.  When attaching an XDP/bpf prog,
then the driver usually want to change the RX-ring memory model
(different performance trade off).  When detaching XDP, then driver
want to change back to old memory model. During this process, I
believe, the NAPI-ID remains the same (right?).

> >   without reloading
> > the NAPI and generate a new NAPI id.
> >  
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-19  9:48             ` Jesper Dangaard Brouer
@ 2018-03-20  2:26               ` Jason Wang
  2018-03-20 14:27                 ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Wang @ 2018-03-20  2:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan



On 2018年03月19日 17:48, Jesper Dangaard Brouer wrote:
> On Fri, 16 Mar 2018 16:45:30 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 2018年03月10日 00:07, Jesper Dangaard Brouer wrote:
>>> On Fri, 9 Mar 2018 21:07:36 +0800
>>> Jason Wang <jasowang@redhat.com> wrote:
>>>   
>>>>>>> Use the IDA infrastructure for getting a cyclic increasing ID number,
>>>>>>> that is used for keeping track of each registered allocator per
>>>>>>> RX-queue xdp_rxq_info.
>>>>>>>
>>>>>>> Signed-off-by: Jesper Dangaard Brouer<brouer@redhat.com>
>>>>>> A stupid question is, can we manage to unify this ID with NAPI id?
>>>>> Sorry I don't understand the question?
>>>> I mean can we associate page poll pointer to napi_struct, record NAPI id
>>>> in xdp_mem_info and do lookup through NAPI id?
>>> No. The driver can unreg/reg a new XDP memory model,
>> Is there an actual use case for this?
> I believe this is the common use case.  When attaching an XDP/bpf prog,
> then the driver usually want to change the RX-ring memory model
> (different performance trade off).

Right, but a single driver should only have one XDP memory model. (Or 
you want to all drivers to use this generic allocator?)

> When detaching XDP, then driver
> want to change back to old memory model. During this process, I
> believe, the NAPI-ID remains the same (right?).

Yes, but we can change the allocator pointer in the NAPI struct in this 
case too.

Thanks

>
>>>    without reloading
>>> the NAPI and generate a new NAPI id.
>>>   

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-20  2:26               ` Jason Wang
@ 2018-03-20 14:27                 ` Jesper Dangaard Brouer
  2018-03-22  2:16                   ` Jason Wang
  0 siblings, 1 reply; 38+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-20 14:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan, brouer

On Tue, 20 Mar 2018 10:26:50 +0800
Jason Wang <jasowang@redhat.com> wrote:

> On 2018年03月19日 17:48, Jesper Dangaard Brouer wrote:
> > On Fri, 16 Mar 2018 16:45:30 +0800
> > Jason Wang <jasowang@redhat.com> wrote:
> >  
> >> On 2018年03月10日 00:07, Jesper Dangaard Brouer wrote:  
> >>> On Fri, 9 Mar 2018 21:07:36 +0800
> >>> Jason Wang <jasowang@redhat.com> wrote:
> >>>     
> >>>>>>> Use the IDA infrastructure for getting a cyclic increasing ID number,
> >>>>>>> that is used for keeping track of each registered allocator per
> >>>>>>> RX-queue xdp_rxq_info.
> >>>>>>>
> >>>>>>> Signed-off-by: Jesper Dangaard Brouer<brouer@redhat.com>  
> >>>>>> A stupid question is, can we manage to unify this ID with NAPI id?  
> >>>>> Sorry I don't understand the question? 
> >>>> I mean can we associate page poll pointer to napi_struct, record NAPI id
> >>>> in xdp_mem_info and do lookup through NAPI id?  
> >>> No. The driver can unreg/reg a new XDP memory model,  
> >>
> >> Is there an actual use case for this?  
> >
> > I believe this is the common use case.  When attaching an XDP/bpf prog,
> > then the driver usually want to change the RX-ring memory model
> > (different performance trade off).  
> 
> Right, but a single driver should only have one XDP memory model. 

No! -- a driver can have multiple XDP memory models, based on different
performance trade offs and hardware capabilities.

The mlx5 (100Gbit/s) driver/hardware is a good example, which need
different memory models.  It already support multiple RX memory models,
depending on HW support.  So, I predict that we hit at performance
limit around 42Mpps on PCIe (I can measure 36Mpps), this is due to
PCI-express translations/sec limit.  The mlx5 HW supports a compressed
descriptor format which deliver packets in several pages (based on
offset and len), thus lowering the needed PCIe transactions.  The
pitfall is that this comes tail room limitations, which can be okay if
e.g. the users use-case does not involve cpumap.

Plus, when a driver need to support AF_XDP zero-copy, that also count
as another XDP memory model...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-20 14:27                 ` Jesper Dangaard Brouer
@ 2018-03-22  2:16                   ` Jason Wang
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Wang @ 2018-03-22  2:16 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Alexei Starovoitov, Tariq Toukan



On 2018年03月20日 22:27, Jesper Dangaard Brouer wrote:
> On Tue, 20 Mar 2018 10:26:50 +0800
> Jason Wang <jasowang@redhat.com> wrote:
>
>> On 2018年03月19日 17:48, Jesper Dangaard Brouer wrote:
>>> On Fri, 16 Mar 2018 16:45:30 +0800
>>> Jason Wang <jasowang@redhat.com> wrote:
>>>   
>>>> On 2018年03月10日 00:07, Jesper Dangaard Brouer wrote:
>>>>> On Fri, 9 Mar 2018 21:07:36 +0800
>>>>> Jason Wang <jasowang@redhat.com> wrote:
>>>>>      
>>>>>>>>> Use the IDA infrastructure for getting a cyclic increasing ID number,
>>>>>>>>> that is used for keeping track of each registered allocator per
>>>>>>>>> RX-queue xdp_rxq_info.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Jesper Dangaard Brouer<brouer@redhat.com>
>>>>>>>> A stupid question is, can we manage to unify this ID with NAPI id?
>>>>>>> Sorry I don't understand the question?
>>>>>> I mean can we associate page poll pointer to napi_struct, record NAPI id
>>>>>> in xdp_mem_info and do lookup through NAPI id?
>>>>> No. The driver can unreg/reg a new XDP memory model,
>>>> Is there an actual use case for this?
>>> I believe this is the common use case.  When attaching an XDP/bpf prog,
>>> then the driver usually want to change the RX-ring memory model
>>> (different performance trade off).
>> Right, but a single driver should only have one XDP memory model.
> No! -- a driver can have multiple XDP memory models, based on different
> performance trade offs and hardware capabilities.
>
> The mlx5 (100Gbit/s) driver/hardware is a good example, which need
> different memory models.  It already support multiple RX memory models,
> depending on HW support.

So let me correct my question, not familiar with mlx5e driver but if I 
understand correctly, driver (mlx5) will not change memory model during 
runtime for each NAPI. So NAPI id still work in this case?

> So, I predict that we hit at performance
> limit around 42Mpps on PCIe (I can measure 36Mpps), this is due to
> PCI-express translations/sec limit.  The mlx5 HW supports a compressed
> descriptor format which deliver packets in several pages (based on
> offset and len), thus lowering the needed PCIe transactions.  The
> pitfall is that this comes tail room limitations, which can be okay if
> e.g. the users use-case does not involve cpumap.
>
> Plus, when a driver need to support AF_XDP zero-copy, that also count
> as another XDP memory model...

Yes or TAP zero-copy XDP. But it looks to me that we don't even need to 
care about the recycling here since the pages belongs to userspace.

Thanks

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2018-03-22  2:17 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-08 13:07 [bpf-next V2 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
2018-03-08 13:07 ` [bpf-next V2 PATCH 01/15] mlx5: basic XDP_REDIRECT forward support Jesper Dangaard Brouer
2018-03-08 13:07 ` [bpf-next V2 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap Jesper Dangaard Brouer
2018-03-09  7:24   ` Jason Wang
2018-03-09  9:35     ` Jesper Dangaard Brouer
2018-03-09 13:04       ` Jason Wang
2018-03-09 16:05         ` Jesper Dangaard Brouer
2018-03-16  8:43           ` Jason Wang
2018-03-08 13:07 ` [bpf-next V2 PATCH 03/15] ixgbe: use xdp_return_frame API Jesper Dangaard Brouer
2018-03-08 13:08 ` [bpf-next V2 PATCH 04/15] xdp: move struct xdp_buff from filter.h to xdp.h Jesper Dangaard Brouer
2018-03-08 13:08 ` [bpf-next V2 PATCH 05/15] xdp: introduce a new xdp_frame type Jesper Dangaard Brouer
2018-03-08 13:08 ` [bpf-next V2 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API Jesper Dangaard Brouer
2018-03-08 15:16   ` Jesper Dangaard Brouer
2018-03-09  7:16     ` Jason Wang
2018-03-09  8:46       ` Jesper Dangaard Brouer
2018-03-08 13:08 ` [bpf-next V2 PATCH 07/15] virtio_net: " Jesper Dangaard Brouer
2018-03-09  8:03   ` Jason Wang
2018-03-09  9:44     ` Jesper Dangaard Brouer
2018-03-09 13:11       ` Jason Wang
2018-03-08 13:08 ` [bpf-next V2 PATCH 08/15] bpf: cpumap convert to use generic xdp_frame Jesper Dangaard Brouer
2018-03-08 13:08 ` [bpf-next V2 PATCH 09/15] mlx5: register a memory model when XDP is enabled Jesper Dangaard Brouer
2018-03-08 13:08 ` [bpf-next V2 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping Jesper Dangaard Brouer
2018-03-09  8:08   ` Jason Wang
2018-03-09  9:37     ` Jesper Dangaard Brouer
2018-03-09 13:07       ` Jason Wang
2018-03-09 16:07         ` Jesper Dangaard Brouer
2018-03-16  8:45           ` Jason Wang
2018-03-19  9:48             ` Jesper Dangaard Brouer
2018-03-20  2:26               ` Jason Wang
2018-03-20 14:27                 ` Jesper Dangaard Brouer
2018-03-22  2:16                   ` Jason Wang
2018-03-08 13:08 ` [bpf-next V2 PATCH 11/15] page_pool: refurbish version of page_pool code Jesper Dangaard Brouer
2018-03-08 13:08 ` [bpf-next V2 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame Jesper Dangaard Brouer
2018-03-08 13:08 ` [bpf-next V2 PATCH 13/15] mlx5: use page_pool for xdp_return_frame call Jesper Dangaard Brouer
2018-03-08 13:08 ` [bpf-next V2 PATCH 14/15] xdp: transition into using xdp_frame for return API Jesper Dangaard Brouer
2018-03-08 13:08 ` [bpf-next V2 PATCH 15/15] xdp: transition into using xdp_frame for ndo_xdp_xmit Jesper Dangaard Brouer
2018-03-08 15:03 ` [bpf-next V2 PATCH 00/15] XDP redirect memory return API Alexander Duyck
2018-03-08 15:30   ` Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.