All of lore.kernel.org
 help / color / mirror / Atom feed
* [bpf-next V5 PATCH 00/15] XDP redirect memory return API
@ 2018-03-23 12:17 Jesper Dangaard Brouer
  2018-03-23 12:18 ` [bpf-next V5 PATCH 01/15] mlx5: basic XDP_REDIRECT forward support Jesper Dangaard Brouer
                   ` (14 more replies)
  0 siblings, 15 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:17 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

This patchset works towards supporting different XDP RX-ring memory
allocators.  As this will be needed by the AF_XDP zero-copy mode.

The patchset uses mlx5 as the sample driver, which gets implemented
XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to
change thought the patchset).

A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt).
And both ndo_xdp_xmit and the new xdp_return_frame end-up using this.

Support for a driver supplied allocator is implemented, and a
refurbished version of page_pool is the first return allocator type
introduced.  This will be a integration point for AF_XDP zero-copy.

The mlx5 driver evolve into using the page_pool, and see a performance
increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps.


The patchset stop at the 15 patches limit, but more API changes are
planned.  Specifically extending ndo_xdp_xmit and xdp_return_frame
APIs to support bulking.  As this will address some known limits.

V2: Updated according to Tariq's feedback
V3: Feedback from Jason Wang and Alex Duyck
V4: Feedback from Tariq and Jason
V5: Fix SPDX license, add Tariq's reviews, improve patch desc for perf test

---

Jesper Dangaard Brouer (15):
      mlx5: basic XDP_REDIRECT forward support
      xdp: introduce xdp_return_frame API and use in cpumap
      ixgbe: use xdp_return_frame API
      xdp: move struct xdp_buff from filter.h to xdp.h
      xdp: introduce a new xdp_frame type
      tun: convert to use generic xdp_frame and xdp_return_frame API
      virtio_net: convert to use generic xdp_frame and xdp_return_frame API
      bpf: cpumap convert to use generic xdp_frame
      mlx5: register a memory model when XDP is enabled
      xdp: rhashtable with allocator ID to pointer mapping
      page_pool: refurbish version of page_pool code
      xdp: allow page_pool as an allocator type in xdp_return_frame
      mlx5: use page_pool for xdp_return_frame call
      xdp: transition into using xdp_frame for return API
      xdp: transition into using xdp_frame for ndo_xdp_xmit


 drivers/net/ethernet/intel/ixgbe/ixgbe.h          |    3 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c     |   37 ++
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    4 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   37 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   42 ++-
 drivers/net/tun.c                                 |   60 ++--
 drivers/net/virtio_net.c                          |   52 ++-
 drivers/vhost/net.c                               |    7 
 include/linux/filter.h                            |   24 --
 include/linux/if_tun.h                            |    4 
 include/linux/netdevice.h                         |    4 
 include/net/page_pool.h                           |  133 ++++++++
 include/net/xdp.h                                 |   83 +++++
 kernel/bpf/cpumap.c                               |  132 +++-----
 net/core/Makefile                                 |    1 
 net/core/filter.c                                 |   17 +
 net/core/page_pool.c                              |  329 +++++++++++++++++++++
 net/core/xdp.c                                    |  257 ++++++++++++++++
 18 files changed, 1050 insertions(+), 176 deletions(-)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

--

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 01/15] mlx5: basic XDP_REDIRECT forward support
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
@ 2018-03-23 12:18 ` Jesper Dangaard Brouer
  2018-03-23 12:18 ` [bpf-next V5 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap Jesper Dangaard Brouer
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:18 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

This implements basic XDP redirect support in mlx5 driver.

Notice that the ndo_xdp_xmit() is NOT implemented, because that API
need some changes that this patchset is working towards.

The main purpose of this patch is have different drivers doing
XDP_REDIRECT to show how different memory models behave in a cross
driver world.

Update(pre-RFCv2 Tariq): Need to DMA unmap page before xdp_do_redirect,
as the return API does not exist yet to to keep this mapped.

Update(pre-RFCv3 Saeed): Don't mix XDP_TX and XDP_REDIRECT flushing,
introduce xdpsq.db.redirect_flush boolian.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h    |    1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |   27 ++++++++++++++++++++---
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 4c9360b25532..28cc26debeda 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -398,6 +398,7 @@ struct mlx5e_xdpsq {
 	struct {
 		struct mlx5e_dma_info     *di;
 		bool                       doorbell;
+		bool                       redirect_flush;
 	} db;
 
 	/* read only */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 8cce90dc461d..6dcc3e8fbd3e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -236,14 +236,20 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 	return 0;
 }
 
+static inline void mlx5e_page_dma_unmap(struct mlx5e_rq *rq,
+					struct mlx5e_dma_info *dma_info)
+{
+	dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
+		       rq->buff.map_dir);
+}
+
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle)
 {
 	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
 		return;
 
-	dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
-		       rq->buff.map_dir);
+	mlx5e_page_dma_unmap(rq, dma_info);
 	put_page(dma_info->page);
 }
 
@@ -822,9 +828,10 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
 				   struct mlx5e_dma_info *di,
 				   void *va, u16 *rx_headroom, u32 *len)
 {
-	const struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
+	struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
 	struct xdp_buff xdp;
 	u32 act;
+	int err;
 
 	if (!prog)
 		return false;
@@ -845,6 +852,15 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
 		if (unlikely(!mlx5e_xmit_xdp_frame(rq, di, &xdp)))
 			trace_xdp_exception(rq->netdev, prog, act);
 		return true;
+	case XDP_REDIRECT:
+		/* When XDP enabled then page-refcnt==1 here */
+		err = xdp_do_redirect(rq->netdev, &xdp, prog);
+		if (!err) {
+			rq->wqe.xdp_xmit = true; /* XDP xmit owns page */
+			rq->xdpsq.db.redirect_flush = true;
+			mlx5e_page_dma_unmap(rq, di);
+		}
+		return true;
 	default:
 		bpf_warn_invalid_xdp_action(act);
 	case XDP_ABORTED:
@@ -1107,6 +1123,11 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
 		xdpsq->db.doorbell = false;
 	}
 
+	if (xdpsq->db.redirect_flush) {
+		xdp_do_flush_map();
+		xdpsq->db.redirect_flush = false;
+	}
+
 	mlx5_cqwq_update_db_record(&cq->wq);
 
 	/* ensure cq space is freed before enabling more cqes */

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
  2018-03-23 12:18 ` [bpf-next V5 PATCH 01/15] mlx5: basic XDP_REDIRECT forward support Jesper Dangaard Brouer
@ 2018-03-23 12:18 ` Jesper Dangaard Brouer
  2018-03-23 16:35   ` Alexander Duyck
  2018-03-23 12:18 ` [bpf-next V5 PATCH 03/15] ixgbe: use xdp_return_frame API Jesper Dangaard Brouer
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:18 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Introduce an xdp_return_frame API, and convert over cpumap as
the first user, given it have queued XDP frame structure to leverage.

V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/xdp.h   |   28 ++++++++++++++++++++++++
 kernel/bpf/cpumap.c |   60 +++++++++++++++++++++++++++++++--------------------
 net/core/xdp.c      |   18 +++++++++++++++
 3 files changed, 82 insertions(+), 24 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index b2362ddfa694..15b546325e31 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -33,16 +33,44 @@
  * also mandatory during RX-ring setup.
  */
 
+enum mem_type {
+	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
+	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
+	MEM_TYPE_MAX,
+};
+
+struct xdp_mem_info {
+	u32 type; /* enum mem_type, but known size type */
+	/* u32 id; will be added later in this patchset */
+};
+
 struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
 	u32 reg_state;
+	struct xdp_mem_info mem;
 } ____cacheline_aligned; /* perf critical, avoid false-sharing */
 
+
+static inline
+void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+{
+	if (mem->type == MEM_TYPE_PAGE_SHARED)
+		page_frag_free(data);
+
+	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
+		struct page *page = virt_to_page(data); /* Assumes order0 page*/
+
+		put_page(page);
+	}
+}
+
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
 void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
 bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
+int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
+			       enum mem_type type, void *allocator);
 
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index a4bb0b34375a..3e4bbcbe3e86 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -19,6 +19,7 @@
 #include <linux/bpf.h>
 #include <linux/filter.h>
 #include <linux/ptr_ring.h>
+#include <net/xdp.h>
 
 #include <linux/sched.h>
 #include <linux/workqueue.h>
@@ -137,27 +138,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static void __cpu_map_queue_destructor(void *ptr)
-{
-	/* The tear-down procedure should have made sure that queue is
-	 * empty.  See __cpu_map_entry_replace() and work-queue
-	 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
-	 * gracefully and warn once.
-	 */
-	if (WARN_ON_ONCE(ptr))
-		page_frag_free(ptr);
-}
-
-static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
-{
-	if (atomic_dec_and_test(&rcpu->refcnt)) {
-		/* The queue should be empty at this point */
-		ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
-		kfree(rcpu->queue);
-		kfree(rcpu);
-	}
-}
-
 static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
 {
 	atomic_inc(&rcpu->refcnt);
@@ -188,6 +168,10 @@ struct xdp_pkt {
 	u16 len;
 	u16 headroom;
 	u16 metasize;
+	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+	 * while mem info is valid on remote CPU.
+	 */
+	struct xdp_mem_info mem;
 	struct net_device *dev_rx;
 };
 
@@ -213,6 +197,9 @@ static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
 	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
 	xdp_pkt->metasize = metasize;
 
+	/* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
+	xdp_pkt->mem = xdp->rxq->mem;
+
 	return xdp_pkt;
 }
 
@@ -265,6 +252,31 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
 	return skb;
 }
 
+static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
+{
+	/* The tear-down procedure should have made sure that queue is
+	 * empty.  See __cpu_map_entry_replace() and work-queue
+	 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
+	 * gracefully and warn once.
+	 */
+	struct xdp_pkt *xdp_pkt;
+
+	while ((xdp_pkt = ptr_ring_consume(ring)))
+		if (WARN_ON_ONCE(xdp_pkt))
+			xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+}
+
+static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
+{
+	if (atomic_dec_and_test(&rcpu->refcnt)) {
+		/* The queue should be empty at this point */
+		__cpu_map_ring_cleanup(rcpu->queue);
+		ptr_ring_cleanup(rcpu->queue, NULL);
+		kfree(rcpu->queue);
+		kfree(rcpu);
+	}
+}
+
 static int cpu_map_kthread_run(void *data)
 {
 	struct bpf_cpu_map_entry *rcpu = data;
@@ -307,7 +319,7 @@ static int cpu_map_kthread_run(void *data)
 
 			skb = cpu_map_build_skb(rcpu, xdp_pkt);
 			if (!skb) {
-				page_frag_free(xdp_pkt);
+				xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
 				continue;
 			}
 
@@ -604,13 +616,13 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 	spin_lock(&q->producer_lock);
 
 	for (i = 0; i < bq->count; i++) {
-		void *xdp_pkt = bq->q[i];
+		struct xdp_pkt *xdp_pkt = bq->q[i];
 		int err;
 
 		err = __ptr_ring_produce(q, xdp_pkt);
 		if (err) {
 			drops++;
-			page_frag_free(xdp_pkt); /* Free xdp_pkt */
+			xdp_return_frame(xdp_pkt->data, &xdp_pkt->mem);
 		}
 		processed++;
 	}
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 097a0f74e004..9eee0c431126 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -71,3 +71,21 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
 	return (xdp_rxq->reg_state == REG_STATE_REGISTERED);
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);
+
+int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
+			       enum mem_type type, void *allocator)
+{
+	if (type >= MEM_TYPE_MAX)
+		return -EINVAL;
+
+	xdp_rxq->mem.type = type;
+
+	if (allocator)
+		return -EOPNOTSUPP;
+
+	/* TODO: Allocate an ID that maps to allocator pointer
+	 * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
+	 */
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 03/15] ixgbe: use xdp_return_frame API
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
  2018-03-23 12:18 ` [bpf-next V5 PATCH 01/15] mlx5: basic XDP_REDIRECT forward support Jesper Dangaard Brouer
  2018-03-23 12:18 ` [bpf-next V5 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap Jesper Dangaard Brouer
@ 2018-03-23 12:18 ` Jesper Dangaard Brouer
  2018-03-23 16:46   ` Alexander Duyck
  2018-03-23 12:18 ` [bpf-next V5 PATCH 04/15] xdp: move struct xdp_buff from filter.h to xdp.h Jesper Dangaard Brouer
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:18 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Extend struct ixgbe_tx_buffer to store the xdp_mem_info.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h      |    1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    6 ++++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index c1e3a0039ea5..cbc20f199364 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -249,6 +249,7 @@ struct ixgbe_tx_buffer {
 	DEFINE_DMA_UNMAP_ADDR(dma);
 	DEFINE_DMA_UNMAP_LEN(len);
 	u32 tx_flags;
+	struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 85369423452d..45520eb503ee 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 
 		/* free the skb */
 		if (ring_is_xdp(tx_ring))
-			page_frag_free(tx_buffer->data);
+			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
 		else
 			napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -5787,7 +5787,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring)
 
 		/* Free all the Tx ring sk_buffs */
 		if (ring_is_xdp(tx_ring))
-			page_frag_free(tx_buffer->data);
+			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 
@@ -8351,6 +8351,8 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 	dma_unmap_len_set(tx_buffer, len, len);
 	dma_unmap_addr_set(tx_buffer, dma, dma);
 	tx_buffer->data = xdp->data;
+	tx_buffer->xdp_mem = xdp->rxq->mem;
+
 	tx_desc->read.buffer_addr = cpu_to_le64(dma);
 
 	/* put descriptor type bits */

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 04/15] xdp: move struct xdp_buff from filter.h to xdp.h
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (2 preceding siblings ...)
  2018-03-23 12:18 ` [bpf-next V5 PATCH 03/15] ixgbe: use xdp_return_frame API Jesper Dangaard Brouer
@ 2018-03-23 12:18 ` Jesper Dangaard Brouer
  2018-03-23 12:18 ` [bpf-next V5 PATCH 05/15] xdp: introduce a new xdp_frame type Jesper Dangaard Brouer
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:18 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

This is done to prepare for the next patch, and it is also
nice to move this XDP related struct out of filter.h.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/linux/filter.h |   24 +-----------------------
 include/net/xdp.h      |   22 ++++++++++++++++++++++
 2 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 109d05ccea9a..340ba653dd80 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -30,6 +30,7 @@ struct sock;
 struct seccomp_data;
 struct bpf_prog_aux;
 struct xdp_rxq_info;
+struct xdp_buff;
 
 /* ArgX, context and stack frame pointer register positions. Note,
  * Arg1, Arg2, Arg3, etc are used as argument mappings of function
@@ -499,14 +500,6 @@ struct bpf_skb_data_end {
 	void *data_end;
 };
 
-struct xdp_buff {
-	void *data;
-	void *data_end;
-	void *data_meta;
-	void *data_hard_start;
-	struct xdp_rxq_info *rxq;
-};
-
 struct sk_msg_buff {
 	void *data;
 	void *data_end;
@@ -769,21 +762,6 @@ int xdp_do_redirect(struct net_device *dev,
 		    struct bpf_prog *prog);
 void xdp_do_flush_map(void);
 
-/* Drivers not supporting XDP metadata can use this helper, which
- * rejects any room expansion for metadata as a result.
- */
-static __always_inline void
-xdp_set_data_meta_invalid(struct xdp_buff *xdp)
-{
-	xdp->data_meta = xdp->data + 1;
-}
-
-static __always_inline bool
-xdp_data_meta_unsupported(const struct xdp_buff *xdp)
-{
-	return unlikely(xdp->data_meta > xdp->data);
-}
-
 void bpf_warn_invalid_xdp_action(u32 act);
 
 struct sock *do_sk_redirect_map(struct sk_buff *skb);
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 15b546325e31..1ee154fe0be6 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -51,6 +51,13 @@ struct xdp_rxq_info {
 	struct xdp_mem_info mem;
 } ____cacheline_aligned; /* perf critical, avoid false-sharing */
 
+struct xdp_buff {
+	void *data;
+	void *data_end;
+	void *data_meta;
+	void *data_hard_start;
+	struct xdp_rxq_info *rxq;
+};
 
 static inline
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
@@ -73,4 +80,19 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 			       enum mem_type type, void *allocator);
 
+/* Drivers not supporting XDP metadata can use this helper, which
+ * rejects any room expansion for metadata as a result.
+ */
+static __always_inline void
+xdp_set_data_meta_invalid(struct xdp_buff *xdp)
+{
+	xdp->data_meta = xdp->data + 1;
+}
+
+static __always_inline bool
+xdp_data_meta_unsupported(const struct xdp_buff *xdp)
+{
+	return unlikely(xdp->data_meta > xdp->data);
+}
+
 #endif /* __LINUX_NET_XDP_H__ */

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 05/15] xdp: introduce a new xdp_frame type
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (3 preceding siblings ...)
  2018-03-23 12:18 ` [bpf-next V5 PATCH 04/15] xdp: move struct xdp_buff from filter.h to xdp.h Jesper Dangaard Brouer
@ 2018-03-23 12:18 ` Jesper Dangaard Brouer
  2018-03-23 17:11   ` Alexander Duyck
  2018-03-23 12:18 ` [bpf-next V5 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API Jesper Dangaard Brouer
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:18 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

This is needed to convert drivers tuntap and virtio_net.

This is a generalization of what is done inside cpumap, which will be
converted later.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/xdp.h |   40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 1ee154fe0be6..13f71a15c79f 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -59,6 +59,46 @@ struct xdp_buff {
 	struct xdp_rxq_info *rxq;
 };
 
+struct xdp_frame {
+	void *data;
+	u16 len;
+	u16 headroom;
+	u16 metasize;
+	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
+	 * while mem info is valid on remote CPU.
+	 */
+	struct xdp_mem_info mem;
+};
+
+/* Convert xdp_buff to xdp_frame */
+static inline
+struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
+{
+	struct xdp_frame *xdp_frame;
+	int metasize;
+	int headroom;
+
+	/* Assure headroom is available for storing info */
+	headroom = xdp->data - xdp->data_hard_start;
+	metasize = xdp->data - xdp->data_meta;
+	metasize = metasize > 0 ? metasize : 0;
+	if (unlikely((headroom - metasize) < sizeof(*xdp_frame)))
+		return NULL;
+
+	/* Store info in top of packet */
+	xdp_frame = xdp->data_hard_start;
+
+	xdp_frame->data = xdp->data;
+	xdp_frame->len  = xdp->data_end - xdp->data;
+	xdp_frame->headroom = headroom - sizeof(*xdp_frame);
+	xdp_frame->metasize = metasize;
+
+	/* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
+	xdp_frame->mem = xdp->rxq->mem;
+
+	return xdp_frame;
+}
+
 static inline
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (4 preceding siblings ...)
  2018-03-23 12:18 ` [bpf-next V5 PATCH 05/15] xdp: introduce a new xdp_frame type Jesper Dangaard Brouer
@ 2018-03-23 12:18 ` Jesper Dangaard Brouer
  2018-03-23 12:18 ` [bpf-next V5 PATCH 07/15] virtio_net: " Jesper Dangaard Brouer
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:18 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

The tuntap driver invented it's own driver specific way of queuing
XDP packets, by storing the xdp_buff information in the top of
the XDP frame data.

Convert it over to use the more generic xdp_frame structure.  The
main problem with the in-driver method is that the xdp_rxq_info pointer
cannot be trused/used when dequeueing the frame.

V3: Remove check based on feedback from Jason

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/tun.c      |   43 ++++++++++++++++++++-----------------------
 drivers/vhost/net.c    |    7 ++++---
 include/linux/if_tun.h |    4 ++--
 3 files changed, 26 insertions(+), 28 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index baeafa004463..6750980d9f30 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -248,11 +248,11 @@ struct veth {
 	__be16 h_vlan_TCI;
 };
 
-bool tun_is_xdp_buff(void *ptr)
+bool tun_is_xdp_frame(void *ptr)
 {
 	return (unsigned long)ptr & TUN_XDP_FLAG;
 }
-EXPORT_SYMBOL(tun_is_xdp_buff);
+EXPORT_SYMBOL(tun_is_xdp_frame);
 
 void *tun_xdp_to_ptr(void *ptr)
 {
@@ -660,10 +660,10 @@ static void tun_ptr_free(void *ptr)
 {
 	if (!ptr)
 		return;
-	if (tun_is_xdp_buff(ptr)) {
-		struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+	if (tun_is_xdp_frame(ptr)) {
+		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-		put_page(virt_to_head_page(xdp->data));
+		xdp_return_frame(xdpf->data, &xdpf->mem);
 	} else {
 		__skb_array_destroy_skb(ptr);
 	}
@@ -1290,17 +1290,14 @@ static const struct net_device_ops tun_netdev_ops = {
 static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 {
 	struct tun_struct *tun = netdev_priv(dev);
-	struct xdp_buff *buff = xdp->data_hard_start;
-	int headroom = xdp->data - xdp->data_hard_start;
+	struct xdp_frame *frame;
 	struct tun_file *tfile;
 	u32 numqueues;
 	int ret = 0;
 
-	/* Assure headroom is available and buff is properly aligned */
-	if (unlikely(headroom < sizeof(*xdp) || tun_is_xdp_buff(xdp)))
-		return -ENOSPC;
-
-	*buff = *xdp;
+	frame = convert_to_xdp_frame(xdp);
+	if (unlikely(!frame))
+		return -EOVERFLOW;
 
 	rcu_read_lock();
 
@@ -1315,7 +1312,7 @@ static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	/* Encode the XDP flag into lowest bit for consumer to differ
 	 * XDP buffer from sk_buff.
 	 */
-	if (ptr_ring_produce(&tfile->tx_ring, tun_xdp_to_ptr(buff))) {
+	if (ptr_ring_produce(&tfile->tx_ring, tun_xdp_to_ptr(frame))) {
 		this_cpu_inc(tun->pcpu_stats->tx_dropped);
 		ret = -ENOSPC;
 	}
@@ -1993,11 +1990,11 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 static ssize_t tun_put_user_xdp(struct tun_struct *tun,
 				struct tun_file *tfile,
-				struct xdp_buff *xdp,
+				struct xdp_frame *xdp_frame,
 				struct iov_iter *iter)
 {
 	int vnet_hdr_sz = 0;
-	size_t size = xdp->data_end - xdp->data;
+	size_t size = xdp_frame->len;
 	struct tun_pcpu_stats *stats;
 	size_t ret;
 
@@ -2013,7 +2010,7 @@ static ssize_t tun_put_user_xdp(struct tun_struct *tun,
 		iov_iter_advance(iter, vnet_hdr_sz - sizeof(gso));
 	}
 
-	ret = copy_to_iter(xdp->data, size, iter) + vnet_hdr_sz;
+	ret = copy_to_iter(xdp_frame->data, size, iter) + vnet_hdr_sz;
 
 	stats = get_cpu_ptr(tun->pcpu_stats);
 	u64_stats_update_begin(&stats->syncp);
@@ -2181,11 +2178,11 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 			return err;
 	}
 
-	if (tun_is_xdp_buff(ptr)) {
-		struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+	if (tun_is_xdp_frame(ptr)) {
+		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-		ret = tun_put_user_xdp(tun, tfile, xdp, to);
-		put_page(virt_to_head_page(xdp->data));
+		ret = tun_put_user_xdp(tun, tfile, xdpf, to);
+		xdp_return_frame(xdpf->data, &xdpf->mem);
 	} else {
 		struct sk_buff *skb = ptr;
 
@@ -2424,10 +2421,10 @@ static int tun_recvmsg(struct socket *sock, struct msghdr *m, size_t total_len,
 static int tun_ptr_peek_len(void *ptr)
 {
 	if (likely(ptr)) {
-		if (tun_is_xdp_buff(ptr)) {
-			struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+		if (tun_is_xdp_frame(ptr)) {
+			struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-			return xdp->data_end - xdp->data;
+			return xdpf->len;
 		}
 		return __skb_array_len_with_tag(ptr);
 	} else {
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b5fb56b822fd..5aee3aaf6c8c 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,7 @@
 #include <linux/skbuff.h>
 
 #include <net/sock.h>
+#include <net/xdp.h>
 
 #include "vhost.h"
 
@@ -177,10 +178,10 @@ static void vhost_net_buf_unproduce(struct vhost_net_virtqueue *nvq)
 
 static int vhost_net_buf_peek_len(void *ptr)
 {
-	if (tun_is_xdp_buff(ptr)) {
-		struct xdp_buff *xdp = tun_ptr_to_xdp(ptr);
+	if (tun_is_xdp_frame(ptr)) {
+		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-		return xdp->data_end - xdp->data;
+		return xdpf->len;
 	}
 
 	return __skb_array_len_with_tag(ptr);
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index c5b0a75a7812..33b817b172af 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -22,7 +22,7 @@
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
-bool tun_is_xdp_buff(void *ptr);
+bool tun_is_xdp_frame(void *ptr);
 void *tun_xdp_to_ptr(void *ptr);
 void *tun_ptr_to_xdp(void *ptr);
 #else
@@ -38,7 +38,7 @@ static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
 {
 	return ERR_PTR(-EINVAL);
 }
-static inline bool tun_is_xdp_buff(void *ptr)
+static inline bool tun_is_xdp_frame(void *ptr)
 {
 	return false;
 }

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 07/15] virtio_net: convert to use generic xdp_frame and xdp_return_frame API
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (5 preceding siblings ...)
  2018-03-23 12:18 ` [bpf-next V5 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API Jesper Dangaard Brouer
@ 2018-03-23 12:18 ` Jesper Dangaard Brouer
  2018-03-23 12:18 ` [bpf-next V5 PATCH 08/15] bpf: cpumap convert to use generic xdp_frame Jesper Dangaard Brouer
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:18 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

The virtio_net driver assumes XDP frames are always released based on
page refcnt (via put_page).  Thus, is only queues the XDP data pointer
address and uses virt_to_head_page() to retrieve struct page.

Use the XDP return API to get away from such assumptions. Instead
queue an xdp_frame, which allow us to use the xdp_return_frame API,
when releasing the frame.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/virtio_net.c |   31 +++++++++++++++++++++----------
 1 file changed, 21 insertions(+), 10 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 23374603e4d9..6c4220450506 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -419,30 +419,41 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
 			       struct xdp_buff *xdp)
 {
 	struct virtio_net_hdr_mrg_rxbuf *hdr;
-	unsigned int len;
+	struct xdp_frame *xdpf, *xdpf_sent;
 	struct send_queue *sq;
+	unsigned int len;
 	unsigned int qp;
-	void *xdp_sent;
 	int err;
 
 	qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id();
 	sq = &vi->sq[qp];
 
 	/* Free up any pending old buffers before queueing new ones. */
-	while ((xdp_sent = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-		struct page *sent_page = virt_to_head_page(xdp_sent);
+	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
+		xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
 
-		put_page(sent_page);
-	}
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	/* virtqueue want to use data area in-front of packet */
+	if (unlikely(xdpf->metasize > 0))
+		return -EOPNOTSUPP;
+
+	if (unlikely(xdpf->headroom < vi->hdr_len))
+		return -EOVERFLOW;
 
-	xdp->data -= vi->hdr_len;
+	/* Make room for virtqueue hdr (also change xdpf->headroom?) */
+	xdpf->data -= vi->hdr_len;
 	/* Zero header and leave csum up to XDP layers */
-	hdr = xdp->data;
+	hdr = xdpf->data;
 	memset(hdr, 0, vi->hdr_len);
+	hdr->hdr.hdr_len = xdpf->len; /* Q: is this needed? */
+	xdpf->len   += vi->hdr_len;
 
-	sg_init_one(sq->sg, xdp->data, xdp->data_end - xdp->data);
+	sg_init_one(sq->sg, xdpf->data, xdpf->len);
 
-	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp->data, GFP_ATOMIC);
+	err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdpf, GFP_ATOMIC);
 	if (unlikely(err))
 		return false; /* Caller handle free/refcnt */
 

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 08/15] bpf: cpumap convert to use generic xdp_frame
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (6 preceding siblings ...)
  2018-03-23 12:18 ` [bpf-next V5 PATCH 07/15] virtio_net: " Jesper Dangaard Brouer
@ 2018-03-23 12:18 ` Jesper Dangaard Brouer
  2018-03-23 12:18 ` [bpf-next V5 PATCH 09/15] mlx5: register a memory model when XDP is enabled Jesper Dangaard Brouer
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:18 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

The generic xdp_frame format, was inspired by the cpumap own internal
xdp_pkt format.  It is now time to convert it over to the generic
xdp_frame format.  The cpumap needs one extra field dev_rx.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/xdp.h   |    1 +
 kernel/bpf/cpumap.c |  100 ++++++++++++++-------------------------------------
 2 files changed, 29 insertions(+), 72 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 13f71a15c79f..bc0cb97e20dc 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -68,6 +68,7 @@ struct xdp_frame {
 	 * while mem info is valid on remote CPU.
 	 */
 	struct xdp_mem_info mem;
+	struct net_device *dev_rx; /* used by cpumap */
 };
 
 /* Convert xdp_buff to xdp_frame */
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 3e4bbcbe3e86..bcdc4dea5ce7 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -159,52 +159,8 @@ static void cpu_map_kthread_stop(struct work_struct *work)
 	kthread_stop(rcpu->kthread);
 }
 
-/* For now, xdp_pkt is a cpumap internal data structure, with info
- * carried between enqueue to dequeue. It is mapped into the top
- * headroom of the packet, to avoid allocating separate mem.
- */
-struct xdp_pkt {
-	void *data;
-	u16 len;
-	u16 headroom;
-	u16 metasize;
-	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
-	 * while mem info is valid on remote CPU.
-	 */
-	struct xdp_mem_info mem;
-	struct net_device *dev_rx;
-};
-
-/* Convert xdp_buff to xdp_pkt */
-static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
-{
-	struct xdp_pkt *xdp_pkt;
-	int metasize;
-	int headroom;
-
-	/* Assure headroom is available for storing info */
-	headroom = xdp->data - xdp->data_hard_start;
-	metasize = xdp->data - xdp->data_meta;
-	metasize = metasize > 0 ? metasize : 0;
-	if (unlikely((headroom - metasize) < sizeof(*xdp_pkt)))
-		return NULL;
-
-	/* Store info in top of packet */
-	xdp_pkt = xdp->data_hard_start;
-
-	xdp_pkt->data = xdp->data;
-	xdp_pkt->len  = xdp->data_end - xdp->data;
-	xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
-	xdp_pkt->metasize = metasize;
-
-	/* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
-	xdp_pkt->mem = xdp->rxq->mem;
-
-	return xdp_pkt;
-}
-
 static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
-					 struct xdp_pkt *xdp_pkt)
+					 struct xdp_frame *xdpf)
 {
 	unsigned int frame_size;
 	void *pkt_data_start;
@@ -219,7 +175,7 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
 	 * would be preferred to set frame_size to 2048 or 4096
 	 * depending on the driver.
 	 *   frame_size = 2048;
-	 *   frame_len  = frame_size - sizeof(*xdp_pkt);
+	 *   frame_len  = frame_size - sizeof(*xdp_frame);
 	 *
 	 * Instead, with info avail, skb_shared_info in placed after
 	 * packet len.  This, unfortunately fakes the truesize.
@@ -227,21 +183,21 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
 	 * is not at a fixed memory location, with mixed length
 	 * packets, which is bad for cache-line hotness.
 	 */
-	frame_size = SKB_DATA_ALIGN(xdp_pkt->len) + xdp_pkt->headroom +
+	frame_size = SKB_DATA_ALIGN(xdpf->len) + xdpf->headroom +
 		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-	pkt_data_start = xdp_pkt->data - xdp_pkt->headroom;
+	pkt_data_start = xdpf->data - xdpf->headroom;
 	skb = build_skb(pkt_data_start, frame_size);
 	if (!skb)
 		return NULL;
 
-	skb_reserve(skb, xdp_pkt->headroom);
-	__skb_put(skb, xdp_pkt->len);
-	if (xdp_pkt->metasize)
-		skb_metadata_set(skb, xdp_pkt->metasize);
+	skb_reserve(skb, xdpf->headroom);
+	__skb_put(skb, xdpf->len);
+	if (xdpf->metasize)
+		skb_metadata_set(skb, xdpf->metasize);
 
 	/* Essential SKB info: protocol and skb->dev */
-	skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx);
+	skb->protocol = eth_type_trans(skb, xdpf->dev_rx);
 
 	/* Optional SKB info, currently missing:
 	 * - HW checksum info		(skb->ip_summed)
@@ -259,11 +215,11 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 	 * invoked cpu_map_kthread_stop(). Catch any broken behaviour
 	 * gracefully and warn once.
 	 */
-	struct xdp_pkt *xdp_pkt;
+	struct xdp_frame *xdpf;
 
-	while ((xdp_pkt = ptr_ring_consume(ring)))
-		if (WARN_ON_ONCE(xdp_pkt))
-			xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+	while ((xdpf = ptr_ring_consume(ring)))
+		if (WARN_ON_ONCE(xdpf))
+			xdp_return_frame(xdpf->data, &xdpf->mem);
 }
 
 static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
@@ -290,7 +246,7 @@ static int cpu_map_kthread_run(void *data)
 	 */
 	while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
 		unsigned int processed = 0, drops = 0, sched = 0;
-		struct xdp_pkt *xdp_pkt;
+		struct xdp_frame *xdpf;
 
 		/* Release CPU reschedule checks */
 		if (__ptr_ring_empty(rcpu->queue)) {
@@ -313,13 +269,13 @@ static int cpu_map_kthread_run(void *data)
 		 * kthread CPU pinned. Lockless access to ptr_ring
 		 * consume side valid as no-resize allowed of queue.
 		 */
-		while ((xdp_pkt = __ptr_ring_consume(rcpu->queue))) {
+		while ((xdpf = __ptr_ring_consume(rcpu->queue))) {
 			struct sk_buff *skb;
 			int ret;
 
-			skb = cpu_map_build_skb(rcpu, xdp_pkt);
+			skb = cpu_map_build_skb(rcpu, xdpf);
 			if (!skb) {
-				xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
+				xdp_return_frame(xdpf->data, &xdpf->mem);
 				continue;
 			}
 
@@ -616,13 +572,13 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 	spin_lock(&q->producer_lock);
 
 	for (i = 0; i < bq->count; i++) {
-		struct xdp_pkt *xdp_pkt = bq->q[i];
+		struct xdp_frame *xdpf = bq->q[i];
 		int err;
 
-		err = __ptr_ring_produce(q, xdp_pkt);
+		err = __ptr_ring_produce(q, xdpf);
 		if (err) {
 			drops++;
-			xdp_return_frame(xdp_pkt->data, &xdp_pkt->mem);
+			xdp_return_frame(xdpf->data, &xdpf->mem);
 		}
 		processed++;
 	}
@@ -637,7 +593,7 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 /* Runs under RCU-read-side, plus in softirq under NAPI protection.
  * Thus, safe percpu variable access.
  */
-static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_pkt *xdp_pkt)
+static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf)
 {
 	struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq);
 
@@ -648,28 +604,28 @@ static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_pkt *xdp_pkt)
 	 * driver to code invoking us to finished, due to driver
 	 * (e.g. ixgbe) recycle tricks based on page-refcnt.
 	 *
-	 * Thus, incoming xdp_pkt is always queued here (else we race
+	 * Thus, incoming xdp_frame is always queued here (else we race
 	 * with another CPU on page-refcnt and remaining driver code).
 	 * Queue time is very short, as driver will invoke flush
 	 * operation, when completing napi->poll call.
 	 */
-	bq->q[bq->count++] = xdp_pkt;
+	bq->q[bq->count++] = xdpf;
 	return 0;
 }
 
 int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp,
 		    struct net_device *dev_rx)
 {
-	struct xdp_pkt *xdp_pkt;
+	struct xdp_frame *xdpf;
 
-	xdp_pkt = convert_to_xdp_pkt(xdp);
-	if (unlikely(!xdp_pkt))
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
 		return -EOVERFLOW;
 
 	/* Info needed when constructing SKB on remote CPU */
-	xdp_pkt->dev_rx = dev_rx;
+	xdpf->dev_rx = dev_rx;
 
-	bq_enqueue(rcpu, xdp_pkt);
+	bq_enqueue(rcpu, xdpf);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 09/15] mlx5: register a memory model when XDP is enabled
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (7 preceding siblings ...)
  2018-03-23 12:18 ` [bpf-next V5 PATCH 08/15] bpf: cpumap convert to use generic xdp_frame Jesper Dangaard Brouer
@ 2018-03-23 12:18 ` Jesper Dangaard Brouer
  2018-03-23 16:18   ` Sergei Shtylyov
  2018-03-23 12:18 ` [bpf-next V5 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping Jesper Dangaard Brouer
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:18 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
This enable a different memory model, thus activating another code path
in the xdp_return_frame API.

V2: Fixed issues pointed out by Tariq.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index da94c8cba5ee..2e4ca0f15b62 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -506,6 +506,14 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 		rq->mkey_be = c->mkey_be;
 	}
 
+	/* This must only be activate for order-0 pages */
+	if (rq->xdp_prog) {
+		err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+						 MEM_TYPE_PAGE_ORDER0, NULL);
+		if (err)
+			goto err_rq_wq_destroy;
+	}
+
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
 

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (8 preceding siblings ...)
  2018-03-23 12:18 ` [bpf-next V5 PATCH 09/15] mlx5: register a memory model when XDP is enabled Jesper Dangaard Brouer
@ 2018-03-23 12:18 ` Jesper Dangaard Brouer
  2018-03-23 16:56   ` Alexander Duyck
  2018-03-23 12:18 ` [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code Jesper Dangaard Brouer
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:18 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
uses a radix tree, use a dynamic rhashtable, for creating ID to
pointer lookup table, because this is faster.

The problem that is being solved here is that, the xdp_rxq_info
pointer (stored in xdp_buff) cannot be used directly, as the
guaranteed lifetime is too short.  The info is needed on a
(potentially) remote CPU during DMA-TX completion time . In an
xdp_frame the xdp_mem_info is stored, when it got converted from an
xdp_buff, which is sufficient for the simple page refcnt based recycle
schemes.

For more advanced allocators there is a need to store a pointer to the
registered allocator.  Thus, there is a need to guard the lifetime or
validity of the allocator pointer, which is done through this
rhashtable ID map to pointer. The removal and validity of of the
allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
allocator will be created by the driver, and registered with
xdp_rxq_info_reg_mem_model().

It is up-to debate who is responsible for freeing the allocator
pointer or invoking the allocator destructor function.  In any case,
this must happen via RCU freeing.

Use the IDA infrastructure for getting a cyclic increasing ID number,
that is used for keeping track of each registered allocator per
RX-queue xdp_rxq_info.

V4: Per req of Jason Wang
- Use xdp_rxq_info_reg_mem_model() in all drivers implementing
  XDP_REDIRECT, even-though it's not strictly necessary when
  allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    9 +
 drivers/net/tun.c                             |    6 +
 drivers/net/virtio_net.c                      |    7 +
 include/net/xdp.h                             |   15 --
 net/core/xdp.c                                |  230 ++++++++++++++++++++++++-
 5 files changed, 248 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 45520eb503ee..ff069597fccf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6360,7 +6360,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 	struct device *dev = rx_ring->dev;
 	int orig_node = dev_to_node(dev);
 	int ring_node = -1;
-	int size;
+	int size, err;
 
 	size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count;
 
@@ -6397,6 +6397,13 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
 			     rx_ring->queue_index) < 0)
 		goto err;
 
+	err = xdp_rxq_info_reg_mem_model(&rx_ring->xdp_rxq,
+					 MEM_TYPE_PAGE_SHARED, NULL);
+	if (err) {
+		xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
+		goto err;
+	}
+
 	rx_ring->xdp_prog = adapter->xdp_prog;
 
 	return 0;
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 6750980d9f30..81fddf9cc58f 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -846,6 +846,12 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
 				       tun->dev, tfile->queue_index);
 		if (err < 0)
 			goto out;
+		err = xdp_rxq_info_reg_mem_model(&tfile->xdp_rxq,
+						 MEM_TYPE_PAGE_SHARED, NULL);
+		if (err < 0) {
+			xdp_rxq_info_unreg(&tfile->xdp_rxq);
+			goto out;
+		}
 		err = 0;
 	}
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6c4220450506..48c86accd3b8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1312,6 +1312,13 @@ static int virtnet_open(struct net_device *dev)
 		if (err < 0)
 			return err;
 
+		err = xdp_rxq_info_reg_mem_model(&vi->rq[i].xdp_rxq,
+						 MEM_TYPE_PAGE_SHARED, NULL);
+		if (err < 0) {
+			xdp_rxq_info_unreg(&vi->rq[i].xdp_rxq);
+			return err;
+		}
+
 		virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
 		virtnet_napi_tx_enable(vi, vi->sq[i].vq, &vi->sq[i].napi);
 	}
diff --git a/include/net/xdp.h b/include/net/xdp.h
index bc0cb97e20dc..859aa9b737fe 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -41,7 +41,7 @@ enum mem_type {
 
 struct xdp_mem_info {
 	u32 type; /* enum mem_type, but known size type */
-	/* u32 id; will be added later in this patchset */
+	u32 id;
 };
 
 struct xdp_rxq_info {
@@ -100,18 +100,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	return xdp_frame;
 }
 
-static inline
-void xdp_return_frame(void *data, struct xdp_mem_info *mem)
-{
-	if (mem->type == MEM_TYPE_PAGE_SHARED)
-		page_frag_free(data);
-
-	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
-		struct page *page = virt_to_page(data); /* Assumes order0 page*/
-
-		put_page(page);
-	}
-}
+void xdp_return_frame(void *data, struct xdp_mem_info *mem);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 9eee0c431126..06a5b39491ad 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -5,6 +5,9 @@
  */
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <linux/rhashtable.h>
 
 #include <net/xdp.h>
 
@@ -13,6 +16,99 @@
 #define REG_STATE_UNREGISTERED	0x2
 #define REG_STATE_UNUSED	0x3
 
+DEFINE_IDA(mem_id_pool);
+static DEFINE_MUTEX(mem_id_lock);
+#define MEM_ID_MAX 0xFFFE
+#define MEM_ID_MIN 1
+static int mem_id_next = MEM_ID_MIN;
+
+static bool mem_id_init; /* false */
+static struct rhashtable *mem_id_ht;
+
+struct xdp_mem_allocator {
+	struct xdp_mem_info mem;
+	void *allocator;
+	struct rhash_head node;
+	struct rcu_head rcu;
+};
+
+static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed)
+{
+	const u32 *k = data;
+	const u32 key = *k;
+
+	BUILD_BUG_ON(FIELD_SIZEOF(struct xdp_mem_allocator, mem.id)
+		     != sizeof(u32));
+
+	/* Use cyclic increasing ID as direct hash key, see rht_bucket_index */
+	return key << RHT_HASH_RESERVED_SPACE;
+}
+
+static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg,
+			  const void *ptr)
+{
+	const struct xdp_mem_allocator *xa = ptr;
+	u32 mem_id = *(u32 *)arg->key;
+
+	return xa->mem.id != mem_id;
+}
+
+static const struct rhashtable_params mem_id_rht_params = {
+	.nelem_hint = 64,
+	.head_offset = offsetof(struct xdp_mem_allocator, node),
+	.key_offset  = offsetof(struct xdp_mem_allocator, mem.id),
+	.key_len = FIELD_SIZEOF(struct xdp_mem_allocator, mem.id),
+	.max_size = MEM_ID_MAX,
+	.min_size = 8,
+	.automatic_shrinking = true,
+	.hashfn    = xdp_mem_id_hashfn,
+	.obj_cmpfn = xdp_mem_id_cmp,
+};
+
+void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
+{
+	struct xdp_mem_allocator *xa;
+
+	xa = container_of(rcu, struct xdp_mem_allocator, rcu);
+
+	/* Allow this ID to be reused */
+	ida_simple_remove(&mem_id_pool, xa->mem.id);
+
+	/* TODO: Depending on allocator type/pointer free resources */
+
+	/* Poison memory */
+	xa->mem.id = 0xFFFF;
+	xa->mem.type = 0xF0F0;
+	xa->allocator = (void *)0xDEAD9001;
+
+	kfree(xa);
+}
+
+void __xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
+{
+	struct xdp_mem_allocator *xa;
+	int id = xdp_rxq->mem.id;
+	int err;
+
+	if (id == 0)
+		return;
+
+	mutex_lock(&mem_id_lock);
+
+	xa = rhashtable_lookup(mem_id_ht, &id, mem_id_rht_params);
+	if (!xa) {
+		mutex_unlock(&mem_id_lock);
+		return;
+	}
+
+	err = rhashtable_remove_fast(mem_id_ht, &xa->node, mem_id_rht_params);
+	WARN_ON(err);
+
+	call_rcu(&xa->rcu, __xdp_mem_allocator_rcu_free);
+
+	mutex_unlock(&mem_id_lock);
+}
+
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 {
 	/* Simplify driver cleanup code paths, allow unreg "unused" */
@@ -21,8 +117,14 @@ void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 
 	WARN(!(xdp_rxq->reg_state == REG_STATE_REGISTERED), "Driver BUG");
 
+	__xdp_rxq_info_unreg_mem_model(xdp_rxq);
+
 	xdp_rxq->reg_state = REG_STATE_UNREGISTERED;
 	xdp_rxq->dev = NULL;
+
+	/* Reset mem info to defaults */
+	xdp_rxq->mem.id = 0;
+	xdp_rxq->mem.type = 0;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
 
@@ -72,20 +174,138 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);
 
+int __mem_id_init_hash_table(void)
+{
+	struct rhashtable *rht;
+	int ret;
+
+	if (unlikely(mem_id_init))
+		return 0;
+
+	rht = kzalloc(sizeof(*rht), GFP_KERNEL);
+	if (!rht)
+		return -ENOMEM;
+
+	ret = rhashtable_init(rht, &mem_id_rht_params);
+	if (ret < 0) {
+		kfree(rht);
+		return ret;
+	}
+	mem_id_ht = rht;
+	smp_mb(); /* mutex lock should provide enough pairing */
+	mem_id_init = true;
+
+	return 0;
+}
+
+/* Allocate a cyclic ID that maps to allocator pointer.
+ * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
+ *
+ * Caller must lock mem_id_lock.
+ */
+static int __mem_id_cyclic_get(gfp_t gfp)
+{
+	int retries = 1;
+	int id;
+
+again:
+	id = ida_simple_get(&mem_id_pool, mem_id_next, MEM_ID_MAX, gfp);
+	if (id < 0) {
+		if (id == -ENOSPC) {
+			/* Cyclic allocator, reset next id */
+			if (retries--) {
+				mem_id_next = MEM_ID_MIN;
+				goto again;
+			}
+		}
+		return id; /* errno */
+	}
+	mem_id_next = id + 1;
+
+	return id;
+}
+
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 			       enum mem_type type, void *allocator)
 {
+	struct xdp_mem_allocator *xdp_alloc;
+	gfp_t gfp = GFP_KERNEL;
+	int id, errno, ret;
+	void *ptr;
+
+	if (xdp_rxq->reg_state != REG_STATE_REGISTERED) {
+		WARN(1, "Missing register, driver bug");
+		return -EFAULT;
+	}
+
 	if (type >= MEM_TYPE_MAX)
 		return -EINVAL;
 
 	xdp_rxq->mem.type = type;
 
-	if (allocator)
-		return -EOPNOTSUPP;
+	if (!allocator)
+		return 0;
+
+	/* Delay init of rhashtable to save memory if feature isn't used */
+	if (!mem_id_init) {
+		mutex_lock(&mem_id_lock);
+		ret = __mem_id_init_hash_table();
+		mutex_unlock(&mem_id_lock);
+		if (ret < 0) {
+			WARN_ON(1);
+			return ret;
+		}
+	}
+
+	xdp_alloc = kzalloc(sizeof(*xdp_alloc), gfp);
+	if (!xdp_alloc)
+		return -ENOMEM;
+
+	mutex_lock(&mem_id_lock);
+	id = __mem_id_cyclic_get(gfp);
+	if (id < 0) {
+		errno = id;
+		goto err;
+	}
+	xdp_rxq->mem.id = id;
+	xdp_alloc->mem  = xdp_rxq->mem;
+	xdp_alloc->allocator = allocator;
+
+	/* Insert allocator into ID lookup table */
+	ptr = rhashtable_insert_slow(mem_id_ht, &id, &xdp_alloc->node);
+	if (IS_ERR(ptr)) {
+		errno = PTR_ERR(ptr);
+		goto err;
+	}
+
+	mutex_unlock(&mem_id_lock);
 
-	/* TODO: Allocate an ID that maps to allocator pointer
-	 * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
-	 */
 	return 0;
+err:
+	mutex_unlock(&mem_id_lock);
+	kfree(xdp_alloc);
+	return errno;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
+
+void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+{
+	struct xdp_mem_allocator *xa;
+
+	rcu_read_lock();
+	if (mem->id)
+		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+	rcu_read_unlock();
+
+	if (mem->type == MEM_TYPE_PAGE_SHARED) {
+		page_frag_free(data);
+		return;
+	}
+
+	if (mem->type == MEM_TYPE_PAGE_ORDER0) {
+		struct page *page = virt_to_page(data); /* Assumes order0 page*/
+
+		put_page(page);
+	}
+}
+EXPORT_SYMBOL_GPL(xdp_return_frame);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (9 preceding siblings ...)
  2018-03-23 12:18 ` [bpf-next V5 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping Jesper Dangaard Brouer
@ 2018-03-23 12:18 ` Jesper Dangaard Brouer
  2018-03-23 13:28   ` Eric Dumazet
                     ` (2 more replies)
  2018-03-23 12:18 ` [bpf-next V5 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame Jesper Dangaard Brouer
                   ` (3 subsequent siblings)
  14 siblings, 3 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:18 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Need a fast page recycle mechanism for ndo_xdp_xmit API for returning
pages on DMA-TX completion time, which have good cross CPU
performance, given DMA-TX completion time can happen on a remote CPU.

Refurbish my page_pool code, that was presented[1] at MM-summit 2016.
Adapted page_pool code to not depend the page allocator and
integration into struct page.  The DMA mapping feature is kept,
even-though it will not be activated/used in this patchset.

[1] http://people.netfilter.org/hawk/presentations/MM-summit2016/generic_page_pool_mm_summit2016.pdf

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes, don't return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.

V4: many small improvements and cleanups
- Add DOC comment section, that can be used by kernel-doc
- Improve fallback mode, to work better with refcnt based recycling
  e.g. remove a WARN as pointed out by Tariq
  e.g. quicker fallback if ptr_ring is empty.

V5: Fixed SPDX license as pointed out by Alexei

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/page_pool.h |  133 +++++++++++++++++++
 net/core/Makefile       |    1 
 net/core/page_pool.c    |  329 +++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 463 insertions(+)
 create mode 100644 include/net/page_pool.h
 create mode 100644 net/core/page_pool.c

diff --git a/include/net/page_pool.h b/include/net/page_pool.h
new file mode 100644
index 000000000000..4486fedbf180
--- /dev/null
+++ b/include/net/page_pool.h
@@ -0,0 +1,133 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * page_pool.h
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ */
+
+/**
+ * DOC: page_pool allocator
+ *
+ * This page_pool allocator is optimized for the XDP mode that
+ * uses one-frame-per-page, but have fallbacks that act like the
+ * regular page allocator APIs.
+ *
+ * Basic use involve replacing alloc_pages() calls with the
+ * page_pool_alloc_pages() call.  Drivers should likely use
+ * page_pool_dev_alloc_pages() replacing dev_alloc_pages().
+ *
+ * If page_pool handles DMA mapping (use page->private), then API user
+ * is responsible for invoking page_pool_put_page() once.  In-case of
+ * elevated refcnt, the DMA state is released, assuming other users of
+ * the page will eventually call put_page().
+ *
+ * If no DMA mapping is done, then it can act as shim-layer that
+ * fall-through to alloc_page.  As no state is kept on the page, the
+ * regular put_page() call is sufficient.
+ */
+#ifndef _NET_PAGE_POOL_H
+#define _NET_PAGE_POOL_H
+
+#include <linux/mm.h> /* Needed by ptr_ring */
+#include <linux/ptr_ring.h>
+#include <linux/dma-direction.h>
+
+#define PP_FLAG_DMA_MAP 1 /* Should page_pool do the DMA map/unmap */
+#define PP_FLAG_ALL	PP_FLAG_DMA_MAP
+
+/*
+ * Fast allocation side cache array/stack
+ *
+ * The cache size and refill watermark is related to the network
+ * use-case.  The NAPI budget is 64 packets.  After a NAPI poll the RX
+ * ring is usually refilled and the max consumed elements will be 64,
+ * thus a natural max size of objects needed in the cache.
+ *
+ * Keeping room for more objects, is due to XDP_DROP use-case.  As
+ * XDP_DROP allows the opportunity to recycle objects directly into
+ * this array, as it shares the same softirq/NAPI protection.  If
+ * cache is already full (or partly full) then the XDP_DROP recycles
+ * would have to take a slower code path.
+ */
+#define PP_ALLOC_CACHE_SIZE	128
+#define PP_ALLOC_CACHE_REFILL	64
+struct pp_alloc_cache {
+	u32 count ____cacheline_aligned_in_smp;
+	void *cache[PP_ALLOC_CACHE_SIZE];
+};
+
+struct page_pool_params {
+	u32		size; /* caller sets size of struct */
+	unsigned int	order;
+	unsigned long	flags;
+	struct device	*dev; /* device, for DMA pre-mapping purposes */
+	int		nid;  /* Numa node id to allocate from pages from */
+	enum dma_data_direction dma_dir; /* DMA mapping direction */
+	unsigned int	pool_size;
+	char		end_marker[0]; /* must be last struct member */
+};
+#define	PAGE_POOL_PARAMS_SIZE	offsetof(struct page_pool_params, end_marker)
+
+struct page_pool {
+	struct page_pool_params p;
+
+	/*
+	 * Data structure for allocation side
+	 *
+	 * Drivers allocation side usually already perform some kind
+	 * of resource protection.  Piggyback on this protection, and
+	 * require driver to protect allocation side.
+	 *
+	 * For NIC drivers this means, allocate a page_pool per
+	 * RX-queue. As the RX-queue is already protected by
+	 * Softirq/BH scheduling and napi_schedule. NAPI schedule
+	 * guarantee that a single napi_struct will only be scheduled
+	 * on a single CPU (see napi_schedule).
+	 */
+	struct pp_alloc_cache alloc;
+
+	/* Data structure for storing recycled pages.
+	 *
+	 * Returning/freeing pages is more complicated synchronization
+	 * wise, because free's can happen on remote CPUs, with no
+	 * association with allocation resource.
+	 *
+	 * Use ptr_ring, as it separates consumer and producer
+	 * effeciently, it a way that doesn't bounce cache-lines.
+	 *
+	 * TODO: Implement bulk return pages into this structure.
+	 */
+	struct ptr_ring ring;
+
+	struct rcu_head rcu;
+};
+
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
+
+static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
+{
+	gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);
+
+	return page_pool_alloc_pages(pool, gfp);
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params);
+
+void page_pool_destroy_rcu(struct page_pool *pool);
+
+/* Never call this directly, use helpers below */
+void __page_pool_put_page(struct page_pool *pool,
+			  struct page *page, bool allow_direct);
+
+static inline void page_pool_put_page(struct page_pool *pool, struct page *page)
+{
+	__page_pool_put_page(pool, page, false);
+}
+/* Very limited use-cases allow recycle direct */
+static inline void page_pool_recycle_direct(struct page_pool *pool,
+					    struct page *page)
+{
+	__page_pool_put_page(pool, page, true);
+}
+
+#endif /* _NET_PAGE_POOL_H */
diff --git a/net/core/Makefile b/net/core/Makefile
index 6dbbba8c57ae..100a2b3b2a08 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -14,6 +14,7 @@ obj-y		     += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
 			fib_notifier.o xdp.o
 
 obj-y += net-sysfs.o
+obj-y += page_pool.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
 obj-$(CONFIG_NET_PKTGEN) += pktgen.o
 obj-$(CONFIG_NETPOLL) += netpoll.o
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
new file mode 100644
index 000000000000..351fdfb1d872
--- /dev/null
+++ b/net/core/page_pool.c
@@ -0,0 +1,329 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * page_pool.c
+ *	Author:	Jesper Dangaard Brouer <netoptimizer@brouer.com>
+ *	Copyright (C) 2016 Red Hat, Inc.
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+
+#include <net/page_pool.h>
+#include <linux/dma-direction.h>
+#include <linux/dma-mapping.h>
+#include <linux/page-flags.h>
+#include <linux/mm.h> /* for __put_page() */
+
+int page_pool_init(struct page_pool *pool,
+		   const struct page_pool_params *params)
+{
+	int ring_qsize = 1024; /* Default */
+	int param_copy_sz;
+
+	if (!pool)
+		return -EFAULT;
+
+	/* Note, below struct compat code was primarily needed when
+	 * page_pool code lived under MM-tree control, given mmots and
+	 * net-next trees progress in very different rates.
+	 *
+	 * Allow kernel devel trees and driver to progress at different rates
+	 */
+	param_copy_sz = PAGE_POOL_PARAMS_SIZE;
+	memset(&pool->p, 0, param_copy_sz);
+	if (params->size < param_copy_sz) {
+		/* Older module calling newer kernel, handled by only
+		 * copying supplied size, and keep remaining params zero
+		 */
+		param_copy_sz = params->size;
+	} else if (params->size > param_copy_sz) {
+		/* Newer module calling older kernel. Need to validate
+		 * no new features were requested.
+		 */
+		unsigned char *addr = (unsigned char *)params + param_copy_sz;
+		unsigned char *end  = (unsigned char *)params + params->size;
+
+		for (; addr < end; addr++) {
+			if (*addr != 0)
+				return -E2BIG;
+		}
+	}
+	memcpy(&pool->p, params, param_copy_sz);
+
+	/* Validate only known flags were used */
+	if (pool->p.flags & ~(PP_FLAG_ALL))
+		return -EINVAL;
+
+	if (pool->p.pool_size)
+		ring_qsize = pool->p.pool_size;
+
+	if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0)
+		return -ENOMEM;
+
+	/* DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL.
+	 * DMA_BIDIRECTIONAL is for allowing page used for DMA sending,
+	 * which is the XDP_TX use-case.
+	 */
+	if ((pool->p.dma_dir != DMA_FROM_DEVICE) &&
+	    (pool->p.dma_dir != DMA_BIDIRECTIONAL))
+		return -EINVAL;
+
+	return 0;
+}
+
+struct page_pool *page_pool_create(const struct page_pool_params *params)
+{
+	struct page_pool *pool;
+	int err = 0;
+
+	if (params->size < offsetof(struct page_pool_params, nid)) {
+		WARN(1, "Fix page_pool_params->size code\n");
+		return ERR_PTR(-EBADR);
+	}
+
+	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, params->nid);
+	err = page_pool_init(pool, params);
+	if (err < 0) {
+		pr_warn("%s() gave up with errno %d\n", __func__, err);
+		kfree(pool);
+		return ERR_PTR(err);
+	}
+	return pool;
+}
+EXPORT_SYMBOL(page_pool_create);
+
+/* fast path */
+static struct page *__page_pool_get_cached(struct page_pool *pool)
+{
+	struct ptr_ring *r = &pool->ring;
+	struct page *page;
+
+	/* Quicker fallback, avoid locks when ring is empty */
+	if (__ptr_ring_empty(r))
+		return NULL;
+
+	/* Test for safe-context, caller should provide this guarantee */
+	if (likely(in_serving_softirq())) {
+		if (likely(pool->alloc.count)) {
+			/* Fast-path */
+			page = pool->alloc.cache[--pool->alloc.count];
+			return page;
+		}
+		/* Slower-path: Alloc array empty, time to refill
+		 *
+		 * Open-coded bulk ptr_ring consumer.
+		 *
+		 * Discussion: the ring consumer lock is not really
+		 * needed due to the softirq/NAPI protection, but
+		 * later need the ability to reclaim pages on the
+		 * ring. Thus, keeping the locks.
+		 */
+		spin_lock(&r->consumer_lock);
+		while ((page = __ptr_ring_consume(r))) {
+			if (pool->alloc.count == PP_ALLOC_CACHE_REFILL)
+				break;
+			pool->alloc.cache[pool->alloc.count++] = page;
+		}
+		spin_unlock(&r->consumer_lock);
+		return page;
+	}
+
+	/* Slow-path: Get page from locked ring queue */
+	page = ptr_ring_consume(&pool->ring);
+	return page;
+}
+
+/* slow path */
+noinline
+static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
+						 gfp_t _gfp)
+{
+	struct page *page;
+	gfp_t gfp = _gfp;
+	dma_addr_t dma;
+
+	/* We could always set __GFP_COMP, and avoid this branch, as
+	 * prep_new_page() can handle order-0 with __GFP_COMP.
+	 */
+	if (pool->p.order)
+		gfp |= __GFP_COMP;
+
+	/* FUTURE development:
+	 *
+	 * Current slow-path essentially falls back to single page
+	 * allocations, which doesn't improve performance.  This code
+	 * need bulk allocation support from the page allocator code.
+	 */
+
+	/* Cache was empty, do real allocation */
+	page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
+	if (!page)
+		return NULL;
+
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		goto skip_dma_map;
+
+	/* Setup DMA mapping: use page->private for DMA-addr
+	 * This mapping is kept for lifetime of page, until leaving pool.
+	 */
+	dma = dma_map_page(pool->p.dev, page, 0,
+			   (PAGE_SIZE << pool->p.order),
+			   pool->p.dma_dir);
+	if (dma_mapping_error(pool->p.dev, dma)) {
+		put_page(page);
+		return NULL;
+	}
+	set_page_private(page, dma); /* page->private = dma; */
+
+skip_dma_map:
+	/* When page just alloc'ed is should/must have refcnt 1. */
+	return page;
+}
+
+/* For using page_pool replace: alloc_pages() API calls, but provide
+ * synchronization guarantee for allocation side.
+ */
+struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
+{
+	struct page *page;
+
+	/* Fast-path: Get a page from cache */
+	page = __page_pool_get_cached(pool);
+	if (page)
+		return page;
+
+	/* Slow-path: cache empty, do real allocation */
+	page = __page_pool_alloc_pages_slow(pool, gfp);
+	return page;
+}
+EXPORT_SYMBOL(page_pool_alloc_pages);
+
+/* Cleanup page_pool state from page */
+static void __page_pool_clean_page(struct page_pool *pool,
+				   struct page *page)
+{
+	if (!(pool->p.flags & PP_FLAG_DMA_MAP))
+		return;
+
+	/* DMA unmap */
+	dma_unmap_page(pool->p.dev, page_private(page),
+		       PAGE_SIZE << pool->p.order, pool->p.dma_dir);
+	set_page_private(page, 0);
+}
+
+/* Return a page to the page allocator, cleaning up our state */
+static void __page_pool_return_page(struct page_pool *pool, struct page *page)
+{
+	__page_pool_clean_page(pool, page);
+	put_page(page);
+	/* An optimization would be to call __free_pages(page, pool->p.order)
+	 * knowing page is not part of page-cache (thus avoiding a
+	 * __page_cache_release() call).
+	 */
+}
+
+bool __page_pool_recycle_into_ring(struct page_pool *pool,
+				   struct page *page)
+{
+	int ret;
+	/* BH protection not needed if current is serving softirq */
+	if (in_serving_softirq())
+		ret = ptr_ring_produce(&pool->ring, page);
+	else
+		ret = ptr_ring_produce_bh(&pool->ring, page);
+
+	return (ret == 0) ? true : false;
+}
+
+/* Only allow direct recycling in special circumstances, into the
+ * alloc side cache.  E.g. during RX-NAPI processing for XDP_DROP use-case.
+ *
+ * Caller must provide appropriate safe context.
+ */
+static bool __page_pool_recycle_direct(struct page *page,
+				       struct page_pool *pool)
+{
+	if (unlikely(pool->alloc.count == PP_ALLOC_CACHE_SIZE))
+		return false;
+
+	/* Caller MUST have verified/know (page_ref_count(page) == 1) */
+	pool->alloc.cache[pool->alloc.count++] = page;
+	return true;
+}
+
+void __page_pool_put_page(struct page_pool *pool,
+			  struct page *page, bool allow_direct)
+{
+	/* This allocator is optimized for the XDP mode that uses
+	 * one-frame-per-page, but have fallbacks that act like the
+	 * regular page allocator APIs.
+	 *
+	 * refcnt == 1 means page_pool owns page, and can recycle it.
+	 */
+	if (likely(page_ref_count(page) == 1)) {
+		/* Read barrier done in page_ref_count / READ_ONCE */
+
+		if (allow_direct && in_serving_softirq())
+			if (__page_pool_recycle_direct(page, pool))
+				return;
+
+		if (!__page_pool_recycle_into_ring(pool, page)) {
+			/* Cache full, fallback to free pages */
+			__page_pool_return_page(pool, page);
+		}
+		return;
+	}
+	/* Fallback/non-XDP mode: API user have elevated refcnt.
+	 *
+	 * Many drivers split up the page into fragments, and some
+	 * want to keep doing this to save memory and do refcnt based
+	 * recycling. Support this use case too, to ease drivers
+	 * switching between XDP/non-XDP.
+	 *
+	 * In-case page_pool maintains the DMA mapping, API user must
+	 * call page_pool_put_page once.  In this elevated refcnt
+	 * case, the DMA is unmapped/released, as driver is likely
+	 * doing refcnt based recycle tricks, meaning another process
+	 * will be invoking put_page.
+	 */
+	__page_pool_clean_page(pool, page);
+	put_page(page);
+}
+EXPORT_SYMBOL(__page_pool_put_page);
+
+/* Cleanup and release resources */
+void __page_pool_destroy_rcu(struct rcu_head *rcu)
+{
+	struct page_pool *pool;
+	struct page *page;
+
+	pool = container_of(rcu, struct page_pool, rcu);
+
+	/* Empty alloc cache, assume caller made sure this is
+	 * no-longer in use, and page_pool_alloc_pages() cannot be
+	 * call concurrently.
+	 */
+	while (pool->alloc.count) {
+		page = pool->alloc.cache[--pool->alloc.count];
+		__page_pool_return_page(pool, page);
+	}
+
+	/* Empty recycle ring */
+	while ((page = ptr_ring_consume(&pool->ring))) {
+		/* Verify the refcnt invariant of cached pages */
+		if (!(page_ref_count(page) == 1)) {
+			pr_crit("%s() page_pool refcnt %d violation\n",
+				__func__, page_ref_count(page));
+			WARN_ON(1);
+		}
+		__page_pool_return_page(pool, page);
+	}
+	ptr_ring_cleanup(&pool->ring, NULL);
+	kfree(pool);
+}
+
+void page_pool_destroy_rcu(struct page_pool *pool)
+{
+	call_rcu(&pool->rcu, __page_pool_destroy_rcu);
+}
+EXPORT_SYMBOL(page_pool_destroy_rcu);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (10 preceding siblings ...)
  2018-03-23 12:18 ` [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code Jesper Dangaard Brouer
@ 2018-03-23 12:18 ` Jesper Dangaard Brouer
  2018-03-23 17:02   ` Alexander Duyck
  2018-03-23 12:19 ` [bpf-next V5 PATCH 13/15] mlx5: use page_pool for xdp_return_frame call Jesper Dangaard Brouer
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:18 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.

The registered allocator page_pool pointer is not available directly
from xdp_rxq_info, but it could be (if needed).  For now, the driver
should keep separate track of the page_pool pointer, which it should
use for RX-ring page allocation.

As suggested by Saeed, to maintain a symmetric API it is the drivers
responsibility to allocate/create and free/destroy the page_pool.
Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
responsibility to free the page_pool, but with a RCU free call.  This
is done easily via the page_pool helper page_pool_destroy_rcu() (which
avoids touching any driver code during the RCU callback, which could
happen after the driver have been unloaded).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 include/net/xdp.h |    3 +++
 net/core/xdp.c    |   23 ++++++++++++++++++++---
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 859aa9b737fe..98b55eaf8fd7 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -36,6 +36,7 @@
 enum mem_type {
 	MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
 	MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
+	MEM_TYPE_PAGE_POOL,
 	MEM_TYPE_MAX,
 };
 
@@ -44,6 +45,8 @@ struct xdp_mem_info {
 	u32 id;
 };
 
+struct page_pool;
+
 struct xdp_rxq_info {
 	struct net_device *dev;
 	u32 queue_index;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 06a5b39491ad..fe8e87abc266 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -8,6 +8,7 @@
 #include <linux/slab.h>
 #include <linux/idr.h>
 #include <linux/rhashtable.h>
+#include <net/page_pool.h>
 
 #include <net/xdp.h>
 
@@ -27,7 +28,10 @@ static struct rhashtable *mem_id_ht;
 
 struct xdp_mem_allocator {
 	struct xdp_mem_info mem;
-	void *allocator;
+	union {
+		void *allocator;
+		struct page_pool *page_pool;
+	};
 	struct rhash_head node;
 	struct rcu_head rcu;
 };
@@ -74,7 +78,9 @@ void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
 	/* Allow this ID to be reused */
 	ida_simple_remove(&mem_id_pool, xa->mem.id);
 
-	/* TODO: Depending on allocator type/pointer free resources */
+	/* Notice, driver is expected to free the *allocator,
+	 * e.g. page_pool, and MUST also use RCU free.
+	 */
 
 	/* Poison memory */
 	xa->mem.id = 0xFFFF;
@@ -290,11 +296,21 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
 void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 {
-	struct xdp_mem_allocator *xa;
+	struct xdp_mem_allocator *xa = NULL;
 
 	rcu_read_lock();
 	if (mem->id)
 		xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
+
+	if (mem->type == MEM_TYPE_PAGE_POOL) {
+		struct page *page = virt_to_head_page(data);
+
+		if (xa)
+			page_pool_put_page(xa->page_pool, page);
+		else
+			put_page(page);
+		return;
+	}
 	rcu_read_unlock();
 
 	if (mem->type == MEM_TYPE_PAGE_SHARED) {
@@ -306,6 +322,7 @@ void xdp_return_frame(void *data, struct xdp_mem_info *mem)
 		struct page *page = virt_to_page(data); /* Assumes order0 page*/
 
 		put_page(page);
+		return;
 	}
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 13/15] mlx5: use page_pool for xdp_return_frame call
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (11 preceding siblings ...)
  2018-03-23 12:18 ` [bpf-next V5 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame Jesper Dangaard Brouer
@ 2018-03-23 12:19 ` Jesper Dangaard Brouer
  2018-03-23 12:19 ` [bpf-next V5 PATCH 14/15] xdp: transition into using xdp_frame for return API Jesper Dangaard Brouer
  2018-03-23 12:19 ` [bpf-next V5 PATCH 15/15] xdp: transition into using xdp_frame for ndo_xdp_xmit Jesper Dangaard Brouer
  14 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:19 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator.  And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.

The performance improvement for XDP_REDIRECT in this patch is really
good.  Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).

The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).

Before this patch for mlx5, XDP redirected frames were returned via
the page allocator.  The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).

Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps.  This is very
close to our 13Mpps max target.

The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing).  It is also planned to remove this unnecessary DMA
unmap in a later patchset

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes not return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.
 - Save a branch in mlx5e_page_release
 - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

V5: Updated patch desc

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    3 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   41 +++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   16 ++++++--
 3 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 28cc26debeda..ab91166f7c5a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,8 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+struct page_pool;
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
@@ -535,6 +537,7 @@ struct mlx5e_rq {
 	/* XDP */
 	struct bpf_prog       *xdp_prog;
 	struct mlx5e_xdpsq     xdpsq;
+	struct page_pool      *page_pool;
 
 	/* control */
 	struct mlx5_wq_ctrl    wq_ctrl;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2e4ca0f15b62..bf17e6d614d6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -35,6 +35,7 @@
 #include <linux/mlx5/fs.h>
 #include <net/vxlan.h>
 #include <linux/bpf.h>
+#include <net/page_pool.h>
 #include "eswitch.h"
 #include "en.h"
 #include "en_tc.h"
@@ -387,10 +388,11 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 			  struct mlx5e_rq_param *rqp,
 			  struct mlx5e_rq *rq)
 {
+	struct page_pool_params pp_params = { 0 };
 	struct mlx5_core_dev *mdev = c->mdev;
 	void *rqc = rqp->rqc;
 	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
-	u32 byte_count;
+	u32 byte_count, pool_size;
 	int npages;
 	int wq_sz;
 	int err;
@@ -429,10 +431,13 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 
 	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	rq->buff.headroom = params->rq_headroom;
+	pool_size = 1 << params->log_rq_size;
 
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
 
+		pool_size = pool_size * MLX5_MPWRQ_PAGES_PER_WQE;
+
 		rq->post_wqes = mlx5e_post_rx_mpwqes;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
@@ -506,13 +511,31 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 		rq->mkey_be = c->mkey_be;
 	}
 
-	/* This must only be activate for order-0 pages */
-	if (rq->xdp_prog) {
-		err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
-						 MEM_TYPE_PAGE_ORDER0, NULL);
-		if (err)
-			goto err_rq_wq_destroy;
+	/* Create a page_pool and register it with rxq */
+	pp_params.size      = PAGE_POOL_PARAMS_SIZE;
+	pp_params.order     = rq->buff.page_order;
+	pp_params.dev       = c->pdev;
+	pp_params.nid       = cpu_to_node(c->cpu);
+	pp_params.dma_dir   = rq->buff.map_dir;
+	pp_params.pool_size = pool_size;
+	pp_params.flags     = 0; /* No-internal DMA mapping in page_pool */
+
+	/* page_pool can be used even when there is no rq->xdp_prog,
+	 * given page_pool does not handle DMA mapping there is no
+	 * required state to clear. And page_pool gracefully handle
+	 * elevated refcnt.
+	 */
+	rq->page_pool = page_pool_create(&pp_params);
+	if (IS_ERR(rq->page_pool)) {
+		kfree(rq->wqe.frag_info);
+		err = PTR_ERR(rq->page_pool);
+		rq->page_pool = NULL;
+		goto err_rq_wq_destroy;
 	}
+	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+					 MEM_TYPE_PAGE_POOL, rq->page_pool);
+	if (err)
+		goto err_rq_wq_destroy;
 
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
@@ -550,6 +573,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	if (rq->xdp_prog)
 		bpf_prog_put(rq->xdp_prog);
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
+	if (rq->page_pool)
+		page_pool_destroy_rcu(rq->page_pool);
 	mlx5_wq_destroy(&rq->wq_ctrl);
 
 	return err;
@@ -563,6 +588,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
 		bpf_prog_put(rq->xdp_prog);
 
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
+	if (rq->page_pool)
+		page_pool_destroy_rcu(rq->page_pool);
 
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 6dcc3e8fbd3e..2ac78b88fc3d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -37,6 +37,7 @@
 #include <linux/bpf_trace.h>
 #include <net/busy_poll.h>
 #include <net/ip6_checksum.h>
+#include <net/page_pool.h>
 #include "en.h"
 #include "en_tc.h"
 #include "eswitch.h"
@@ -221,7 +222,7 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 	if (mlx5e_rx_cache_get(rq, dma_info))
 		return 0;
 
-	dma_info->page = dev_alloc_pages(rq->buff.page_order);
+	dma_info->page = page_pool_dev_alloc_pages(rq->page_pool);
 	if (unlikely(!dma_info->page))
 		return -ENOMEM;
 
@@ -246,11 +247,16 @@ static inline void mlx5e_page_dma_unmap(struct mlx5e_rq *rq,
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle)
 {
-	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
-		return;
+	if (likely(recycle)) {
+		if (mlx5e_rx_cache_put(rq, dma_info))
+			return;
 
-	mlx5e_page_dma_unmap(rq, dma_info);
-	put_page(dma_info->page);
+		mlx5e_page_dma_unmap(rq, dma_info);
+		page_pool_recycle_direct(rq->page_pool, dma_info->page);
+	} else {
+		mlx5e_page_dma_unmap(rq, dma_info);
+		put_page(dma_info->page);
+	}
 }
 
 static inline bool mlx5e_page_reuse(struct mlx5e_rq *rq,

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 14/15] xdp: transition into using xdp_frame for return API
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (12 preceding siblings ...)
  2018-03-23 12:19 ` [bpf-next V5 PATCH 13/15] mlx5: use page_pool for xdp_return_frame call Jesper Dangaard Brouer
@ 2018-03-23 12:19 ` Jesper Dangaard Brouer
  2018-03-23 17:29   ` Alexander Duyck
  2018-03-23 12:19 ` [bpf-next V5 PATCH 15/15] xdp: transition into using xdp_frame for ndo_xdp_xmit Jesper Dangaard Brouer
  14 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:19 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h        |    4 +---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   17 +++++++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    1 +
 drivers/net/tun.c                               |    4 ++--
 drivers/net/virtio_net.c                        |    2 +-
 include/net/xdp.h                               |    2 +-
 kernel/bpf/cpumap.c                             |    6 +++---
 net/core/xdp.c                                  |    4 +++-
 8 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index cbc20f199364..dfbc15a45cb4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -240,8 +240,7 @@ struct ixgbe_tx_buffer {
 	unsigned long time_stamp;
 	union {
 		struct sk_buff *skb;
-		/* XDP uses address ptr on irq_clean */
-		void *data;
+		struct xdp_frame *xdpf;
 	};
 	unsigned int bytecount;
 	unsigned short gso_segs;
@@ -249,7 +248,6 @@ struct ixgbe_tx_buffer {
 	DEFINE_DMA_UNMAP_ADDR(dma);
 	DEFINE_DMA_UNMAP_LEN(len);
 	u32 tx_flags;
-	struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index ff069597fccf..e6e9b28ecfba 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 
 		/* free the skb */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -2376,6 +2376,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			xdp.data_hard_start = xdp.data -
 					      ixgbe_rx_offset(rx_ring);
 			xdp.data_end = xdp.data + size;
+			prefetchw(xdp.data_hard_start); /* xdp_frame write */
 
 			skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
 		}
@@ -5787,7 +5788,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring)
 
 		/* Free all the Tx ring sk_buffs */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 
@@ -8333,16 +8334,21 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 	struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 	struct ixgbe_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
+	struct xdp_frame *xdpf;
 	u32 len, cmd_type;
 	dma_addr_t dma;
 	u16 i;
 
-	len = xdp->data_end - xdp->data;
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	len = xdpf->len;
 
 	if (unlikely(!ixgbe_desc_unused(ring)))
 		return IXGBE_XDP_CONSUMED;
 
-	dma = dma_map_single(ring->dev, xdp->data, len, DMA_TO_DEVICE);
+	dma = dma_map_single(ring->dev, xdpf->data, len, DMA_TO_DEVICE);
 	if (dma_mapping_error(ring->dev, dma))
 		return IXGBE_XDP_CONSUMED;
 
@@ -8357,8 +8363,7 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 
 	dma_unmap_len_set(tx_buffer, len, len);
 	dma_unmap_addr_set(tx_buffer, dma, dma);
-	tx_buffer->data = xdp->data;
-	tx_buffer->xdp_mem = xdp->rxq->mem;
+	tx_buffer->xdpf = xdpf;
 
 	tx_desc->read.buffer_addr = cpu_to_le64(dma);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 2ac78b88fc3d..00b9b13d9fea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -896,6 +896,7 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 				      di->addr + wi->offset,
 				      0, frag_size,
 				      DMA_FROM_DEVICE);
+	prefetchw(va); /* xdp_frame data area */
 	prefetch(data);
 	wi->offset += frag_size;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 81fddf9cc58f..a7e42ae1b220 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -663,7 +663,7 @@ static void tun_ptr_free(void *ptr)
 	if (tun_is_xdp_frame(ptr)) {
 		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-		xdp_return_frame(xdpf->data, &xdpf->mem);
+		xdp_return_frame(xdpf);
 	} else {
 		__skb_array_destroy_skb(ptr);
 	}
@@ -2188,7 +2188,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
 		ret = tun_put_user_xdp(tun, tfile, xdpf, to);
-		xdp_return_frame(xdpf->data, &xdpf->mem);
+		xdp_return_frame(xdpf);
 	} else {
 		struct sk_buff *skb = ptr;
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 48c86accd3b8..479a80339fad 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -430,7 +430,7 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
 
 	/* Free up any pending old buffers before queueing new ones. */
 	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
-		xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
+		xdp_return_frame(xdpf_sent);
 
 	xdpf = convert_to_xdp_frame(xdp);
 	if (unlikely(!xdpf))
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 98b55eaf8fd7..35aa9825fdd0 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -103,7 +103,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	return xdp_frame;
 }
 
-void xdp_return_frame(void *data, struct xdp_mem_info *mem);
+void xdp_return_frame(struct xdp_frame *xdpf);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index bcdc4dea5ce7..c95b04ec103e 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -219,7 +219,7 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 
 	while ((xdpf = ptr_ring_consume(ring)))
 		if (WARN_ON_ONCE(xdpf))
-			xdp_return_frame(xdpf->data, &xdpf->mem);
+			xdp_return_frame(xdpf);
 }
 
 static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
@@ -275,7 +275,7 @@ static int cpu_map_kthread_run(void *data)
 
 			skb = cpu_map_build_skb(rcpu, xdpf);
 			if (!skb) {
-				xdp_return_frame(xdpf->data, &xdpf->mem);
+				xdp_return_frame(xdpf);
 				continue;
 			}
 
@@ -578,7 +578,7 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 		err = __ptr_ring_produce(q, xdpf);
 		if (err) {
 			drops++;
-			xdp_return_frame(xdpf->data, &xdpf->mem);
+			xdp_return_frame(xdpf);
 		}
 		processed++;
 	}
diff --git a/net/core/xdp.c b/net/core/xdp.c
index fe8e87abc266..6ed3d73a73be 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -294,9 +294,11 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
-void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+void xdp_return_frame(struct xdp_frame *xdpf)
 {
 	struct xdp_mem_allocator *xa = NULL;
+	struct xdp_mem_info *mem = &xdpf->mem;
+	void *data = xdpf->data;
 
 	rcu_read_lock();
 	if (mem->id)

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [bpf-next V5 PATCH 15/15] xdp: transition into using xdp_frame for ndo_xdp_xmit
  2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
                   ` (13 preceding siblings ...)
  2018-03-23 12:19 ` [bpf-next V5 PATCH 14/15] xdp: transition into using xdp_frame for return API Jesper Dangaard Brouer
@ 2018-03-23 12:19 ` Jesper Dangaard Brouer
  14 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 12:19 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.

This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.

V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   21 +++++++++++----------
 drivers/net/tun.c                             |   19 ++++++++++++-------
 drivers/net/virtio_net.c                      |   24 ++++++++++++++----------
 include/linux/netdevice.h                     |    4 ++--
 net/core/filter.c                             |   17 +++++++++++++++--
 5 files changed, 54 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index e6e9b28ecfba..f78096ed4c86 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2252,7 +2252,7 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring *rx_ring,
 #define IXGBE_XDP_TX 2
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-			       struct xdp_buff *xdp);
+			       struct xdp_frame *xdpf);
 
 static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 				     struct ixgbe_ring *rx_ring,
@@ -2260,6 +2260,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 {
 	int err, result = IXGBE_XDP_PASS;
 	struct bpf_prog *xdp_prog;
+	struct xdp_frame *xdpf;
 	u32 act;
 
 	rcu_read_lock();
@@ -2273,7 +2274,12 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 	case XDP_PASS:
 		break;
 	case XDP_TX:
-		result = ixgbe_xmit_xdp_ring(adapter, xdp);
+		xdpf = convert_to_xdp_frame(xdp);
+		if (unlikely(!xdpf)) {
+			result = IXGBE_XDP_CONSUMED;
+			break;
+		}
+		result = ixgbe_xmit_xdp_ring(adapter, xdpf);
 		break;
 	case XDP_REDIRECT:
 		err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
@@ -8329,20 +8335,15 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 }
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-			       struct xdp_buff *xdp)
+			       struct xdp_frame *xdpf)
 {
 	struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 	struct ixgbe_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
-	struct xdp_frame *xdpf;
 	u32 len, cmd_type;
 	dma_addr_t dma;
 	u16 i;
 
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
 	len = xdpf->len;
 
 	if (unlikely(!ixgbe_desc_unused(ring)))
@@ -9995,7 +9996,7 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
-static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(dev);
 	struct ixgbe_ring *ring;
@@ -10011,7 +10012,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (unlikely(!ring))
 		return -ENXIO;
 
-	err = ixgbe_xmit_xdp_ring(adapter, xdp);
+	err = ixgbe_xmit_xdp_ring(adapter, xdpf);
 	if (err != IXGBE_XDP_TX)
 		return -ENOSPC;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a7e42ae1b220..da0402ebc5ce 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1293,18 +1293,13 @@ static const struct net_device_ops tun_netdev_ops = {
 	.ndo_get_stats64	= tun_net_get_stats64,
 };
 
-static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int tun_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
 {
 	struct tun_struct *tun = netdev_priv(dev);
-	struct xdp_frame *frame;
 	struct tun_file *tfile;
 	u32 numqueues;
 	int ret = 0;
 
-	frame = convert_to_xdp_frame(xdp);
-	if (unlikely(!frame))
-		return -EOVERFLOW;
-
 	rcu_read_lock();
 
 	numqueues = READ_ONCE(tun->numqueues);
@@ -1328,6 +1323,16 @@ static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	return ret;
 }
 
+static int tun_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!frame))
+		return -EOVERFLOW;
+
+	return tun_xdp_xmit(dev, frame);
+}
+
 static void tun_xdp_flush(struct net_device *dev)
 {
 	struct tun_struct *tun = netdev_priv(dev);
@@ -1675,7 +1680,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		case XDP_TX:
 			get_page(alloc_frag->page);
 			alloc_frag->offset += buflen;
-			if (tun_xdp_xmit(tun->dev, &xdp))
+			if (tun_xdp_tx(tun->dev, &xdp))
 				goto err_redirect;
 			tun_xdp_flush(tun->dev);
 			rcu_read_unlock();
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 479a80339fad..906fcd9ff49b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -416,10 +416,10 @@ static void virtnet_xdp_flush(struct net_device *dev)
 }
 
 static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
-			       struct xdp_buff *xdp)
+			       struct xdp_frame *xdpf)
 {
 	struct virtio_net_hdr_mrg_rxbuf *hdr;
-	struct xdp_frame *xdpf, *xdpf_sent;
+	struct xdp_frame *xdpf_sent;
 	struct send_queue *sq;
 	unsigned int len;
 	unsigned int qp;
@@ -432,10 +432,6 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
 	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
 		xdp_return_frame(xdpf_sent);
 
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
 	/* virtqueue want to use data area in-front of packet */
 	if (unlikely(xdpf->metasize > 0))
 		return -EOPNOTSUPP;
@@ -460,7 +456,7 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
 	return true;
 }
 
-static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct receive_queue *rq = vi->rq;
@@ -474,7 +470,7 @@ static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (!xdp_prog)
 		return -ENXIO;
 
-	sent = __virtnet_xdp_xmit(vi, xdp);
+	sent = __virtnet_xdp_xmit(vi, xdpf);
 	if (!sent)
 		return -ENOSPC;
 	return 0;
@@ -575,6 +571,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
 		struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset;
+		struct xdp_frame *xdpf;
 		struct xdp_buff xdp;
 		void *orig_data;
 		u32 act;
@@ -617,7 +614,10 @@ static struct sk_buff *receive_small(struct net_device *dev,
 			delta = orig_data - xdp.data;
 			break;
 		case XDP_TX:
-			sent = __virtnet_xdp_xmit(vi, &xdp);
+			xdpf = convert_to_xdp_frame(&xdp);
+			if (unlikely(!xdpf))
+				goto err_xdp;
+			sent = __virtnet_xdp_xmit(vi, xdpf);
 			if (unlikely(!sent)) {
 				trace_xdp_exception(vi->dev, xdp_prog, act);
 				goto err_xdp;
@@ -709,6 +709,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
+		struct xdp_frame *xdpf;
 		struct page *xdp_page;
 		struct xdp_buff xdp;
 		void *data;
@@ -773,7 +774,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			}
 			break;
 		case XDP_TX:
-			sent = __virtnet_xdp_xmit(vi, &xdp);
+			xdpf = convert_to_xdp_frame(&xdp);
+			if (unlikely(!xdpf))
+				goto err_xdp;
+			sent = __virtnet_xdp_xmit(vi, xdpf);
 			if (unlikely(!sent)) {
 				trace_xdp_exception(vi->dev, xdp_prog, act);
 				if (unlikely(xdp_page != page))
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 913b1cc882cf..62d984ac6c7c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1164,7 +1164,7 @@ struct dev_ifalias {
  *	This function is used to set or query state related to XDP on the
  *	netdevice and manage BPF offload. See definition of
  *	enum bpf_netdev_command for details.
- * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_buff *xdp);
+ * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_frame *xdp);
  *	This function is used to submit a XDP packet for transmit on a
  *	netdevice.
  * void (*ndo_xdp_flush)(struct net_device *dev);
@@ -1355,7 +1355,7 @@ struct net_device_ops {
 	int			(*ndo_bpf)(struct net_device *dev,
 					   struct netdev_bpf *bpf);
 	int			(*ndo_xdp_xmit)(struct net_device *dev,
-						struct xdp_buff *xdp);
+						struct xdp_frame *xdp);
 	void			(*ndo_xdp_flush)(struct net_device *dev);
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index c86f03fd9ea5..189ae8e4dda3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2724,13 +2724,18 @@ static int __bpf_tx_xdp(struct net_device *dev,
 			struct xdp_buff *xdp,
 			u32 index)
 {
+	struct xdp_frame *xdpf;
 	int err;
 
 	if (!dev->netdev_ops->ndo_xdp_xmit) {
 		return -EOPNOTSUPP;
 	}
 
-	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
 	if (err)
 		return err;
 	dev->netdev_ops->ndo_xdp_flush(dev);
@@ -2746,11 +2751,19 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
 		struct net_device *dev = fwd;
+		struct xdp_frame *xdpf;
 
 		if (!dev->netdev_ops->ndo_xdp_xmit)
 			return -EOPNOTSUPP;
 
-		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+		xdpf = convert_to_xdp_frame(xdp);
+		if (unlikely(!xdpf))
+			return -EOVERFLOW;
+
+		/* TODO: move to inside map code instead, for bulk support
+		 * err = dev_map_enqueue(dev, xdp);
+		 */
+		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
 		if (err)
 			return err;
 		__dev_map_insert_ctx(map, index);

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code
  2018-03-23 12:18 ` [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code Jesper Dangaard Brouer
@ 2018-03-23 13:28   ` Eric Dumazet
  2018-03-26 14:09     ` Jesper Dangaard Brouer
  2018-03-23 13:29   ` Eric Dumazet
  2018-03-23 13:37   ` Eric Dumazet
  2 siblings, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2018-03-23 13:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Daniel Borkmann, Alexei Starovoitov,
	Tariq Toukan



On 03/23/2018 05:18 AM, Jesper Dangaard Brouer wrote:

> +
> +	/* Note, below struct compat code was primarily needed when
> +	 * page_pool code lived under MM-tree control, given mmots and
> +	 * net-next trees progress in very different rates.
> +	 *
> +	 * Allow kernel devel trees and driver to progress at different rates
> +	 */
> +	param_copy_sz = PAGE_POOL_PARAMS_SIZE;
> +	memset(&pool->p, 0, param_copy_sz);
> +	if (params->size < param_copy_sz) {
> +		/* Older module calling newer kernel, handled by only
> +		 * copying supplied size, and keep remaining params zero
> +		 */
> +		param_copy_sz = params->size;
> +	} else if (params->size > param_copy_sz) {
> +		/* Newer module calling older kernel. Need to validate
> +		 * no new features were requested.
> +		 */
> +		unsigned char *addr = (unsigned char *)params + param_copy_sz;
> +		unsigned char *end  = (unsigned char *)params + params->size;
> +
> +		for (; addr < end; addr++) {
> +			if (*addr != 0)
> +				return -E2BIG;
> +		}
> +	}

I do not see the need for this part.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code
  2018-03-23 12:18 ` [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code Jesper Dangaard Brouer
  2018-03-23 13:28   ` Eric Dumazet
@ 2018-03-23 13:29   ` Eric Dumazet
  2018-03-23 14:15     ` Jesper Dangaard Brouer
  2018-03-23 13:37   ` Eric Dumazet
  2 siblings, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2018-03-23 13:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Daniel Borkmann, Alexei Starovoitov,
	Tariq Toukan



On 03/23/2018 05:18 AM, Jesper Dangaard Brouer wrote:

> +
> +void page_pool_destroy_rcu(struct page_pool *pool)
> +{
> +	call_rcu(&pool->rcu, __page_pool_destroy_rcu);
> +}
> +EXPORT_SYMBOL(page_pool_destroy_rcu);
> 


Why do we need to respect one rcu grace period before destroying a page pool ?

In any case, this should be called page_pool_destroy()

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code
  2018-03-23 12:18 ` [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code Jesper Dangaard Brouer
  2018-03-23 13:28   ` Eric Dumazet
  2018-03-23 13:29   ` Eric Dumazet
@ 2018-03-23 13:37   ` Eric Dumazet
  2 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2018-03-23 13:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Daniel Borkmann, Alexei Starovoitov,
	Tariq Toukan



On 03/23/2018 05:18 AM, Jesper Dangaard Brouer wrote:


> +#define PP_ALLOC_CACHE_SIZE	128
> +#define PP_ALLOC_CACHE_REFILL	64
> +struct pp_alloc_cache {
> +	u32 count ____cacheline_aligned_in_smp;
> +	void *cache[PP_ALLOC_CACHE_SIZE];
> +};
> +
> +struct page_pool_params {
...
> +};
> +#define	PAGE_POOL_PARAMS_SIZE	offsetof(struct page_pool_params, end_marker)
> +
> +struct page_pool {
> +	struct page_pool_params p;
> +
> +	struct pp_alloc_cache alloc;
> +
>...
> +	struct ptr_ring ring;
> +
> +	struct rcu_head rcu;
> +};
> +


The placement of ____cacheline_aligned_in_smp in pp_alloc_cache is odd.

I would rather put it in struct page_pool as in :

+struct page_pool {
+	struct page_pool_params p;

+	struct pp_alloc_cache alloc ____cacheline_aligned_in_smp;;

To make clear the intent here (let the page_pool_params being in a read only cache line)

Also you probably can move the struct rcu_head  between p and pp_alloc_cache_alloc to fill a hole.

(assuming allowing variable sized params is no longer needed)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code
  2018-03-23 13:29   ` Eric Dumazet
@ 2018-03-23 14:15     ` Jesper Dangaard Brouer
  2018-03-23 14:55       ` Eric Dumazet
  0 siblings, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 14:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	Jason Wang, John Fastabend, Eran Ben Elisha, Saeed Mahameed,
	galp, Daniel Borkmann, Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 23 Mar 2018 06:29:55 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On 03/23/2018 05:18 AM, Jesper Dangaard Brouer wrote:
> 
> > +
> > +void page_pool_destroy_rcu(struct page_pool *pool)
> > +{
> > +	call_rcu(&pool->rcu, __page_pool_destroy_rcu);
> > +}
> > +EXPORT_SYMBOL(page_pool_destroy_rcu);
> >   
> 
> 
> Why do we need to respect one rcu grace period before destroying a page pool ?

Due to previous allocator ID patch, which can have a pointer reference
to a page_pool, and the allocator ID lookup uses RCU.

> In any case, this should be called page_pool_destroy()

Okay.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code
  2018-03-23 14:15     ` Jesper Dangaard Brouer
@ 2018-03-23 14:55       ` Eric Dumazet
  2018-03-26 15:19         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2018-03-23 14:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	Jason Wang, John Fastabend, Eran Ben Elisha, Saeed Mahameed,
	galp, Daniel Borkmann, Alexei Starovoitov, Tariq Toukan



On 03/23/2018 07:15 AM, Jesper Dangaard Brouer wrote:
> On Fri, 23 Mar 2018 06:29:55 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
>> On 03/23/2018 05:18 AM, Jesper Dangaard Brouer wrote:
>>
>>> +
>>> +void page_pool_destroy_rcu(struct page_pool *pool)
>>> +{
>>> +	call_rcu(&pool->rcu, __page_pool_destroy_rcu);
>>> +}
>>> +EXPORT_SYMBOL(page_pool_destroy_rcu);
>>>   
>>
>>
>> Why do we need to respect one rcu grace period before destroying a page pool ?
> 
> Due to previous allocator ID patch, which can have a pointer reference
> to a page_pool, and the allocator ID lookup uses RCU.
> 

I am not convinced.  How comes a patch that is _before_ this one can have any impact ?

Normally, we put first infrastructure, then something using it.

rcu grace period before freeing huge quantitites of pages is problematic and could
be used by syzbot to OOM the host.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 09/15] mlx5: register a memory model when XDP is enabled
  2018-03-23 12:18 ` [bpf-next V5 PATCH 09/15] mlx5: register a memory model when XDP is enabled Jesper Dangaard Brouer
@ 2018-03-23 16:18   ` Sergei Shtylyov
  0 siblings, 0 replies; 35+ messages in thread
From: Sergei Shtylyov @ 2018-03-23 16:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Daniel Borkmann, Alexei Starovoitov,
	Tariq Toukan

Hello!

On 03/23/2018 03:18 PM, Jesper Dangaard Brouer wrote:

> Now all the users of ndo_xdp_xmit have been converted to use xdp_return_frame.
> This enable a different memory model, thus activating another code path
> in the xdp_return_frame API.
> 
> V2: Fixed issues pointed out by Tariq.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index da94c8cba5ee..2e4ca0f15b62 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -506,6 +506,14 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>  		rq->mkey_be = c->mkey_be;
>  	}
>  
> +	/* This must only be activate for order-0 pages */

   Activated?

[...]

MBR, Sergei

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap
  2018-03-23 12:18 ` [bpf-next V5 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap Jesper Dangaard Brouer
@ 2018-03-23 16:35   ` Alexander Duyck
  2018-03-26 19:06     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2018-03-23 16:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

On Fri, Mar 23, 2018 at 5:18 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Introduce an xdp_return_frame API, and convert over cpumap as
> the first user, given it have queued XDP frame structure to leverage.
>
> V3: Cleanup and remove C99 style comments, pointed out by Alex Duyck.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/net/xdp.h   |   28 ++++++++++++++++++++++++
>  kernel/bpf/cpumap.c |   60 +++++++++++++++++++++++++++++++--------------------
>  net/core/xdp.c      |   18 +++++++++++++++
>  3 files changed, 82 insertions(+), 24 deletions(-)
>
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index b2362ddfa694..15b546325e31 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -33,16 +33,44 @@
>   * also mandatory during RX-ring setup.
>   */
>
> +enum mem_type {
> +       MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
> +       MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
> +       MEM_TYPE_MAX,
> +};
> +
> +struct xdp_mem_info {
> +       u32 type; /* enum mem_type, but known size type */

Do you really need to make t his a full u32 value for something that
is likely never going to exceed a single digit value?

> +       /* u32 id; will be added later in this patchset */

Wouldn't it be better to just hold off and add it then instead of
adding it as a comment?

> +};
> +
>  struct xdp_rxq_info {
>         struct net_device *dev;
>         u32 queue_index;
>         u32 reg_state;
> +       struct xdp_mem_info mem;
>  } ____cacheline_aligned; /* perf critical, avoid false-sharing */
>
> +
> +static inline
> +void xdp_return_frame(void *data, struct xdp_mem_info *mem)
> +{
> +       if (mem->type == MEM_TYPE_PAGE_SHARED)
> +               page_frag_free(data);
> +
> +       if (mem->type == MEM_TYPE_PAGE_ORDER0) {
> +               struct page *page = virt_to_page(data); /* Assumes order0 page*/
> +
> +               put_page(page);
> +       }

Actually page_frag_free would probably work for either one. Also is it
safe to assume that the page is order 0? Are the only users of
compound pages that support XDP also only supporting the page fragment
setup?

Also you probably don't need put_page. It might be better to use
__free_page if you are certain the pages are coming from the Rx path
of drivers and don't have any special destructors associated with
them.

> +}
> +
>  int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
>                      struct net_device *dev, u32 queue_index);
>  void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
>  void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
>  bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
> +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
> +                              enum mem_type type, void *allocator);
>
>  #endif /* __LINUX_NET_XDP_H__ */
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index a4bb0b34375a..3e4bbcbe3e86 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -19,6 +19,7 @@
>  #include <linux/bpf.h>
>  #include <linux/filter.h>
>  #include <linux/ptr_ring.h>
> +#include <net/xdp.h>
>
>  #include <linux/sched.h>
>  #include <linux/workqueue.h>
> @@ -137,27 +138,6 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
>         return ERR_PTR(err);
>  }
>
> -static void __cpu_map_queue_destructor(void *ptr)
> -{
> -       /* The tear-down procedure should have made sure that queue is
> -        * empty.  See __cpu_map_entry_replace() and work-queue
> -        * invoked cpu_map_kthread_stop(). Catch any broken behaviour
> -        * gracefully and warn once.
> -        */
> -       if (WARN_ON_ONCE(ptr))
> -               page_frag_free(ptr);
> -}
> -
> -static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> -{
> -       if (atomic_dec_and_test(&rcpu->refcnt)) {
> -               /* The queue should be empty at this point */
> -               ptr_ring_cleanup(rcpu->queue, __cpu_map_queue_destructor);
> -               kfree(rcpu->queue);
> -               kfree(rcpu);
> -       }
> -}
> -
>  static void get_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
>  {
>         atomic_inc(&rcpu->refcnt);
> @@ -188,6 +168,10 @@ struct xdp_pkt {
>         u16 len;
>         u16 headroom;
>         u16 metasize;
> +       /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
> +        * while mem info is valid on remote CPU.
> +        */
> +       struct xdp_mem_info mem;
>         struct net_device *dev_rx;
>  };
>
> @@ -213,6 +197,9 @@ static struct xdp_pkt *convert_to_xdp_pkt(struct xdp_buff *xdp)
>         xdp_pkt->headroom = headroom - sizeof(*xdp_pkt);
>         xdp_pkt->metasize = metasize;
>
> +       /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
> +       xdp_pkt->mem = xdp->rxq->mem;
> +
>         return xdp_pkt;
>  }
>
> @@ -265,6 +252,31 @@ static struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
>         return skb;
>  }
>
> +static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
> +{
> +       /* The tear-down procedure should have made sure that queue is
> +        * empty.  See __cpu_map_entry_replace() and work-queue
> +        * invoked cpu_map_kthread_stop(). Catch any broken behaviour
> +        * gracefully and warn once.
> +        */
> +       struct xdp_pkt *xdp_pkt;
> +
> +       while ((xdp_pkt = ptr_ring_consume(ring)))
> +               if (WARN_ON_ONCE(xdp_pkt))
> +                       xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
> +}
> +
> +static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> +{
> +       if (atomic_dec_and_test(&rcpu->refcnt)) {
> +               /* The queue should be empty at this point */
> +               __cpu_map_ring_cleanup(rcpu->queue);
> +               ptr_ring_cleanup(rcpu->queue, NULL);
> +               kfree(rcpu->queue);
> +               kfree(rcpu);
> +       }
> +}
> +
>  static int cpu_map_kthread_run(void *data)
>  {
>         struct bpf_cpu_map_entry *rcpu = data;
> @@ -307,7 +319,7 @@ static int cpu_map_kthread_run(void *data)
>
>                         skb = cpu_map_build_skb(rcpu, xdp_pkt);
>                         if (!skb) {
> -                               page_frag_free(xdp_pkt);
> +                               xdp_return_frame(xdp_pkt, &xdp_pkt->mem);
>                                 continue;
>                         }
>
> @@ -604,13 +616,13 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
>         spin_lock(&q->producer_lock);
>
>         for (i = 0; i < bq->count; i++) {
> -               void *xdp_pkt = bq->q[i];
> +               struct xdp_pkt *xdp_pkt = bq->q[i];
>                 int err;
>
>                 err = __ptr_ring_produce(q, xdp_pkt);
>                 if (err) {
>                         drops++;
> -                       page_frag_free(xdp_pkt); /* Free xdp_pkt */
> +                       xdp_return_frame(xdp_pkt->data, &xdp_pkt->mem);
>                 }
>                 processed++;
>         }
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 097a0f74e004..9eee0c431126 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -71,3 +71,21 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
>         return (xdp_rxq->reg_state == REG_STATE_REGISTERED);
>  }
>  EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);
> +
> +int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
> +                              enum mem_type type, void *allocator)
> +{
> +       if (type >= MEM_TYPE_MAX)
> +               return -EINVAL;
> +
> +       xdp_rxq->mem.type = type;
> +
> +       if (allocator)
> +               return -EOPNOTSUPP;
> +
> +       /* TODO: Allocate an ID that maps to allocator pointer
> +        * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
> +        */
> +       return 0;
> +}
> +EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 03/15] ixgbe: use xdp_return_frame API
  2018-03-23 12:18 ` [bpf-next V5 PATCH 03/15] ixgbe: use xdp_return_frame API Jesper Dangaard Brouer
@ 2018-03-23 16:46   ` Alexander Duyck
  0 siblings, 0 replies; 35+ messages in thread
From: Alexander Duyck @ 2018-03-23 16:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

On Fri, Mar 23, 2018 at 5:18 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Extend struct ixgbe_tx_buffer to store the xdp_mem_info.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h      |    1 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    6 ++++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index c1e3a0039ea5..cbc20f199364 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -249,6 +249,7 @@ struct ixgbe_tx_buffer {
>         DEFINE_DMA_UNMAP_ADDR(dma);
>         DEFINE_DMA_UNMAP_LEN(len);
>         u32 tx_flags;
> +       struct xdp_mem_info xdp_mem;

Instead of increasing the size of this structure it might make more
sense to find free space somewhere in side of it.

One thing you could probably look at doing is making it so that the
gso_segs, protocol, and tx_flags fields could be put into an anonymous
structure that is then part of a union with the xdp_mem_info
structure. Then you would just have to tweak things slightly so that
the else section for the ring_is_xdp block pulls the code just above
it into that section so that those fields are not read.

You would need to flip the dma, length, and type field ordering but
that shouldn't be much of an issue and it should have no effect on
performance since they are all in the same 16 byte aligned block
anyway.

>  };
>
>  struct ixgbe_rx_buffer {
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 85369423452d..45520eb503ee 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
>
>                 /* free the skb */
>                 if (ring_is_xdp(tx_ring))
> -                       page_frag_free(tx_buffer->data);
> +                       xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);

Based on my suggestion above this section would have:
    total_packets++;

>                 else
>                         napi_consume_skb(tx_buffer->skb, napi_budget);

This section would pull the lines that read gso_segs and tx_flags down
into this else statement so that the fields go unused.

>
> @@ -5787,7 +5787,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring)
>
>                 /* Free all the Tx ring sk_buffs */
>                 if (ring_is_xdp(tx_ring))
> -                       page_frag_free(tx_buffer->data);
> +                       xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
>                 else
>                         dev_kfree_skb_any(tx_buffer->skb);
>
> @@ -8351,6 +8351,8 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
>         dma_unmap_len_set(tx_buffer, len, len);
>         dma_unmap_addr_set(tx_buffer, dma, dma);
>         tx_buffer->data = xdp->data;
> +       tx_buffer->xdp_mem = xdp->rxq->mem;
> +

It would work out better if you moved this up to the lines that are
setting the tx_buffer fields. Really you could probably pull the line
that is recording xdp->data up as well.

>         tx_desc->read.buffer_addr = cpu_to_le64(dma);
>
>         /* put descriptor type bits */
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-23 12:18 ` [bpf-next V5 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping Jesper Dangaard Brouer
@ 2018-03-23 16:56   ` Alexander Duyck
  2018-03-23 18:15     ` Jesper Dangaard Brouer
  2018-03-26 21:04     ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 35+ messages in thread
From: Alexander Duyck @ 2018-03-23 16:56 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

On Fri, Mar 23, 2018 at 5:18 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Use the IDA infrastructure for getting a cyclic increasing ID number,
> that is used for keeping track of each registered allocator per
> RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
> uses a radix tree, use a dynamic rhashtable, for creating ID to
> pointer lookup table, because this is faster.
>
> The problem that is being solved here is that, the xdp_rxq_info
> pointer (stored in xdp_buff) cannot be used directly, as the
> guaranteed lifetime is too short.  The info is needed on a
> (potentially) remote CPU during DMA-TX completion time . In an
> xdp_frame the xdp_mem_info is stored, when it got converted from an
> xdp_buff, which is sufficient for the simple page refcnt based recycle
> schemes.
>
> For more advanced allocators there is a need to store a pointer to the
> registered allocator.  Thus, there is a need to guard the lifetime or
> validity of the allocator pointer, which is done through this
> rhashtable ID map to pointer. The removal and validity of of the
> allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
> allocator will be created by the driver, and registered with
> xdp_rxq_info_reg_mem_model().
>
> It is up-to debate who is responsible for freeing the allocator
> pointer or invoking the allocator destructor function.  In any case,
> this must happen via RCU freeing.
>
> Use the IDA infrastructure for getting a cyclic increasing ID number,
> that is used for keeping track of each registered allocator per
> RX-queue xdp_rxq_info.
>
> V4: Per req of Jason Wang
> - Use xdp_rxq_info_reg_mem_model() in all drivers implementing
>   XDP_REDIRECT, even-though it's not strictly necessary when
>   allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    9 +
>  drivers/net/tun.c                             |    6 +
>  drivers/net/virtio_net.c                      |    7 +
>  include/net/xdp.h                             |   15 --
>  net/core/xdp.c                                |  230 ++++++++++++++++++++++++-
>  5 files changed, 248 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 45520eb503ee..ff069597fccf 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -6360,7 +6360,7 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
>         struct device *dev = rx_ring->dev;
>         int orig_node = dev_to_node(dev);
>         int ring_node = -1;
> -       int size;
> +       int size, err;
>
>         size = sizeof(struct ixgbe_rx_buffer) * rx_ring->count;
>
> @@ -6397,6 +6397,13 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter *adapter,
>                              rx_ring->queue_index) < 0)
>                 goto err;
>
> +       err = xdp_rxq_info_reg_mem_model(&rx_ring->xdp_rxq,
> +                                        MEM_TYPE_PAGE_SHARED, NULL);
> +       if (err) {
> +               xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
> +               goto err;
> +       }
> +
>         rx_ring->xdp_prog = adapter->xdp_prog;
>
>         return 0;
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 6750980d9f30..81fddf9cc58f 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -846,6 +846,12 @@ static int tun_attach(struct tun_struct *tun, struct file *file,
>                                        tun->dev, tfile->queue_index);
>                 if (err < 0)
>                         goto out;
> +               err = xdp_rxq_info_reg_mem_model(&tfile->xdp_rxq,
> +                                                MEM_TYPE_PAGE_SHARED, NULL);
> +               if (err < 0) {
> +                       xdp_rxq_info_unreg(&tfile->xdp_rxq);
> +                       goto out;
> +               }
>                 err = 0;
>         }
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 6c4220450506..48c86accd3b8 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1312,6 +1312,13 @@ static int virtnet_open(struct net_device *dev)
>                 if (err < 0)
>                         return err;
>
> +               err = xdp_rxq_info_reg_mem_model(&vi->rq[i].xdp_rxq,
> +                                                MEM_TYPE_PAGE_SHARED, NULL);
> +               if (err < 0) {
> +                       xdp_rxq_info_unreg(&vi->rq[i].xdp_rxq);
> +                       return err;
> +               }
> +
>                 virtnet_napi_enable(vi->rq[i].vq, &vi->rq[i].napi);
>                 virtnet_napi_tx_enable(vi, vi->sq[i].vq, &vi->sq[i].napi);
>         }
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index bc0cb97e20dc..859aa9b737fe 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -41,7 +41,7 @@ enum mem_type {
>
>  struct xdp_mem_info {
>         u32 type; /* enum mem_type, but known size type */
> -       /* u32 id; will be added later in this patchset */
> +       u32 id;
>  };
>
>  struct xdp_rxq_info {
> @@ -100,18 +100,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
>         return xdp_frame;
>  }
>
> -static inline
> -void xdp_return_frame(void *data, struct xdp_mem_info *mem)
> -{
> -       if (mem->type == MEM_TYPE_PAGE_SHARED)
> -               page_frag_free(data);
> -
> -       if (mem->type == MEM_TYPE_PAGE_ORDER0) {
> -               struct page *page = virt_to_page(data); /* Assumes order0 page*/
> -
> -               put_page(page);
> -       }
> -}
> +void xdp_return_frame(void *data, struct xdp_mem_info *mem);
>
>  int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
>                      struct net_device *dev, u32 queue_index);
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 9eee0c431126..06a5b39491ad 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -5,6 +5,9 @@
>   */
>  #include <linux/types.h>
>  #include <linux/mm.h>
> +#include <linux/slab.h>
> +#include <linux/idr.h>
> +#include <linux/rhashtable.h>
>
>  #include <net/xdp.h>
>
> @@ -13,6 +16,99 @@
>  #define REG_STATE_UNREGISTERED 0x2
>  #define REG_STATE_UNUSED       0x3
>
> +DEFINE_IDA(mem_id_pool);
> +static DEFINE_MUTEX(mem_id_lock);
> +#define MEM_ID_MAX 0xFFFE
> +#define MEM_ID_MIN 1
> +static int mem_id_next = MEM_ID_MIN;
> +
> +static bool mem_id_init; /* false */
> +static struct rhashtable *mem_id_ht;
> +
> +struct xdp_mem_allocator {
> +       struct xdp_mem_info mem;
> +       void *allocator;
> +       struct rhash_head node;
> +       struct rcu_head rcu;
> +};
> +
> +static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed)
> +{
> +       const u32 *k = data;
> +       const u32 key = *k;
> +
> +       BUILD_BUG_ON(FIELD_SIZEOF(struct xdp_mem_allocator, mem.id)
> +                    != sizeof(u32));
> +
> +       /* Use cyclic increasing ID as direct hash key, see rht_bucket_index */
> +       return key << RHT_HASH_RESERVED_SPACE;
> +}
> +
> +static int xdp_mem_id_cmp(struct rhashtable_compare_arg *arg,
> +                         const void *ptr)
> +{
> +       const struct xdp_mem_allocator *xa = ptr;
> +       u32 mem_id = *(u32 *)arg->key;
> +
> +       return xa->mem.id != mem_id;
> +}
> +
> +static const struct rhashtable_params mem_id_rht_params = {
> +       .nelem_hint = 64,
> +       .head_offset = offsetof(struct xdp_mem_allocator, node),
> +       .key_offset  = offsetof(struct xdp_mem_allocator, mem.id),
> +       .key_len = FIELD_SIZEOF(struct xdp_mem_allocator, mem.id),
> +       .max_size = MEM_ID_MAX,
> +       .min_size = 8,
> +       .automatic_shrinking = true,
> +       .hashfn    = xdp_mem_id_hashfn,
> +       .obj_cmpfn = xdp_mem_id_cmp,
> +};
> +
> +void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
> +{
> +       struct xdp_mem_allocator *xa;
> +
> +       xa = container_of(rcu, struct xdp_mem_allocator, rcu);
> +
> +       /* Allow this ID to be reused */
> +       ida_simple_remove(&mem_id_pool, xa->mem.id);
> +
> +       /* TODO: Depending on allocator type/pointer free resources */
> +
> +       /* Poison memory */
> +       xa->mem.id = 0xFFFF;
> +       xa->mem.type = 0xF0F0;
> +       xa->allocator = (void *)0xDEAD9001;
> +
> +       kfree(xa);
> +}
> +
> +void __xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
> +{
> +       struct xdp_mem_allocator *xa;
> +       int id = xdp_rxq->mem.id;
> +       int err;
> +
> +       if (id == 0)
> +               return;
> +
> +       mutex_lock(&mem_id_lock);
> +
> +       xa = rhashtable_lookup(mem_id_ht, &id, mem_id_rht_params);
> +       if (!xa) {
> +               mutex_unlock(&mem_id_lock);
> +               return;
> +       }
> +
> +       err = rhashtable_remove_fast(mem_id_ht, &xa->node, mem_id_rht_params);
> +       WARN_ON(err);
> +
> +       call_rcu(&xa->rcu, __xdp_mem_allocator_rcu_free);
> +
> +       mutex_unlock(&mem_id_lock);
> +}
> +
>  void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
>  {
>         /* Simplify driver cleanup code paths, allow unreg "unused" */
> @@ -21,8 +117,14 @@ void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
>
>         WARN(!(xdp_rxq->reg_state == REG_STATE_REGISTERED), "Driver BUG");
>
> +       __xdp_rxq_info_unreg_mem_model(xdp_rxq);
> +
>         xdp_rxq->reg_state = REG_STATE_UNREGISTERED;
>         xdp_rxq->dev = NULL;
> +
> +       /* Reset mem info to defaults */
> +       xdp_rxq->mem.id = 0;
> +       xdp_rxq->mem.type = 0;
>  }
>  EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg);
>
> @@ -72,20 +174,138 @@ bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
>  }
>  EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);
>
> +int __mem_id_init_hash_table(void)
> +{
> +       struct rhashtable *rht;
> +       int ret;
> +
> +       if (unlikely(mem_id_init))
> +               return 0;
> +
> +       rht = kzalloc(sizeof(*rht), GFP_KERNEL);
> +       if (!rht)
> +               return -ENOMEM;
> +
> +       ret = rhashtable_init(rht, &mem_id_rht_params);
> +       if (ret < 0) {
> +               kfree(rht);
> +               return ret;
> +       }
> +       mem_id_ht = rht;
> +       smp_mb(); /* mutex lock should provide enough pairing */
> +       mem_id_init = true;
> +
> +       return 0;
> +}
> +
> +/* Allocate a cyclic ID that maps to allocator pointer.
> + * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
> + *
> + * Caller must lock mem_id_lock.
> + */
> +static int __mem_id_cyclic_get(gfp_t gfp)
> +{
> +       int retries = 1;
> +       int id;
> +
> +again:
> +       id = ida_simple_get(&mem_id_pool, mem_id_next, MEM_ID_MAX, gfp);
> +       if (id < 0) {
> +               if (id == -ENOSPC) {
> +                       /* Cyclic allocator, reset next id */
> +                       if (retries--) {
> +                               mem_id_next = MEM_ID_MIN;
> +                               goto again;
> +                       }
> +               }
> +               return id; /* errno */
> +       }
> +       mem_id_next = id + 1;
> +
> +       return id;
> +}
> +
>  int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
>                                enum mem_type type, void *allocator)
>  {
> +       struct xdp_mem_allocator *xdp_alloc;
> +       gfp_t gfp = GFP_KERNEL;
> +       int id, errno, ret;
> +       void *ptr;
> +
> +       if (xdp_rxq->reg_state != REG_STATE_REGISTERED) {
> +               WARN(1, "Missing register, driver bug");
> +               return -EFAULT;
> +       }
> +
>         if (type >= MEM_TYPE_MAX)
>                 return -EINVAL;
>
>         xdp_rxq->mem.type = type;
>
> -       if (allocator)
> -               return -EOPNOTSUPP;
> +       if (!allocator)
> +               return 0;
> +
> +       /* Delay init of rhashtable to save memory if feature isn't used */
> +       if (!mem_id_init) {
> +               mutex_lock(&mem_id_lock);
> +               ret = __mem_id_init_hash_table();
> +               mutex_unlock(&mem_id_lock);
> +               if (ret < 0) {
> +                       WARN_ON(1);
> +                       return ret;
> +               }
> +       }
> +
> +       xdp_alloc = kzalloc(sizeof(*xdp_alloc), gfp);
> +       if (!xdp_alloc)
> +               return -ENOMEM;
> +
> +       mutex_lock(&mem_id_lock);
> +       id = __mem_id_cyclic_get(gfp);
> +       if (id < 0) {
> +               errno = id;
> +               goto err;
> +       }
> +       xdp_rxq->mem.id = id;
> +       xdp_alloc->mem  = xdp_rxq->mem;
> +       xdp_alloc->allocator = allocator;
> +
> +       /* Insert allocator into ID lookup table */
> +       ptr = rhashtable_insert_slow(mem_id_ht, &id, &xdp_alloc->node);
> +       if (IS_ERR(ptr)) {
> +               errno = PTR_ERR(ptr);
> +               goto err;
> +       }
> +
> +       mutex_unlock(&mem_id_lock);
>
> -       /* TODO: Allocate an ID that maps to allocator pointer
> -        * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
> -        */
>         return 0;
> +err:
> +       mutex_unlock(&mem_id_lock);
> +       kfree(xdp_alloc);
> +       return errno;
>  }
>  EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
> +
> +void xdp_return_frame(void *data, struct xdp_mem_info *mem)
> +{
> +       struct xdp_mem_allocator *xa;
> +
> +       rcu_read_lock();
> +       if (mem->id)
> +               xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
> +       rcu_read_unlock();
> +
> +       if (mem->type == MEM_TYPE_PAGE_SHARED) {
> +               page_frag_free(data);
> +               return;
> +       }
> +
> +       if (mem->type == MEM_TYPE_PAGE_ORDER0) {
> +               struct page *page = virt_to_page(data); /* Assumes order0 page*/
> +
> +               put_page(page);
> +       }
> +}
> +EXPORT_SYMBOL_GPL(xdp_return_frame);
>

I'm not sure what the point is of getting the xa value if it is not
going to be used. Also I would assume there are types that won't even
need the hash table lookup. I would prefer to see this bit held off on
until you have something that actually needs it.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame
  2018-03-23 12:18 ` [bpf-next V5 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame Jesper Dangaard Brouer
@ 2018-03-23 17:02   ` Alexander Duyck
  0 siblings, 0 replies; 35+ messages in thread
From: Alexander Duyck @ 2018-03-23 17:02 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

On Fri, Mar 23, 2018 at 5:18 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> New allocator type MEM_TYPE_PAGE_POOL for page_pool usage.
>
> The registered allocator page_pool pointer is not available directly
> from xdp_rxq_info, but it could be (if needed).  For now, the driver
> should keep separate track of the page_pool pointer, which it should
> use for RX-ring page allocation.
>
> As suggested by Saeed, to maintain a symmetric API it is the drivers
> responsibility to allocate/create and free/destroy the page_pool.
> Thus, after the driver have called xdp_rxq_info_unreg(), it is drivers
> responsibility to free the page_pool, but with a RCU free call.  This
> is done easily via the page_pool helper page_pool_destroy_rcu() (which
> avoids touching any driver code during the RCU callback, which could
> happen after the driver have been unloaded).
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/net/xdp.h |    3 +++
>  net/core/xdp.c    |   23 ++++++++++++++++++++---
>  2 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 859aa9b737fe..98b55eaf8fd7 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -36,6 +36,7 @@
>  enum mem_type {
>         MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
>         MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
> +       MEM_TYPE_PAGE_POOL,
>         MEM_TYPE_MAX,
>  };
>
> @@ -44,6 +45,8 @@ struct xdp_mem_info {
>         u32 id;
>  };
>
> +struct page_pool;
> +
>  struct xdp_rxq_info {
>         struct net_device *dev;
>         u32 queue_index;
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 06a5b39491ad..fe8e87abc266 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -8,6 +8,7 @@
>  #include <linux/slab.h>
>  #include <linux/idr.h>
>  #include <linux/rhashtable.h>
> +#include <net/page_pool.h>
>
>  #include <net/xdp.h>
>
> @@ -27,7 +28,10 @@ static struct rhashtable *mem_id_ht;
>
>  struct xdp_mem_allocator {
>         struct xdp_mem_info mem;
> -       void *allocator;
> +       union {
> +               void *allocator;
> +               struct page_pool *page_pool;
> +       };
>         struct rhash_head node;
>         struct rcu_head rcu;
>  };
> @@ -74,7 +78,9 @@ void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
>         /* Allow this ID to be reused */
>         ida_simple_remove(&mem_id_pool, xa->mem.id);
>
> -       /* TODO: Depending on allocator type/pointer free resources */
> +       /* Notice, driver is expected to free the *allocator,
> +        * e.g. page_pool, and MUST also use RCU free.
> +        */
>
>         /* Poison memory */
>         xa->mem.id = 0xFFFF;
> @@ -290,11 +296,21 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
>
>  void xdp_return_frame(void *data, struct xdp_mem_info *mem)
>  {
> -       struct xdp_mem_allocator *xa;
> +       struct xdp_mem_allocator *xa = NULL;
>
>         rcu_read_lock();
>         if (mem->id)
>                 xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
> +
> +       if (mem->type == MEM_TYPE_PAGE_POOL) {
> +               struct page *page = virt_to_head_page(data);
> +
> +               if (xa)
> +                       page_pool_put_page(xa->page_pool, page);
> +               else
> +                       put_page(page);
> +               return;
> +       }

So at this point there need to be a few changes.

I would suggest doing a switch based on mem->type instead of this mess
of if statements.

Also the PAGE_POOL is the only one that needs the reference to the
memory allocator. You should move all of the logic for accessing the
hash table into the section specific to the page pool with a
fall-through into the other paths.

>         rcu_read_unlock();
>
>         if (mem->type == MEM_TYPE_PAGE_SHARED) {
> @@ -306,6 +322,7 @@ void xdp_return_frame(void *data, struct xdp_mem_info *mem)
>                 struct page *page = virt_to_page(data); /* Assumes order0 page*/
>
>                 put_page(page);
> +               return;

Why add a return here? This was the end of the function already wasn't it?

>         }
>  }
>  EXPORT_SYMBOL_GPL(xdp_return_frame);
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 05/15] xdp: introduce a new xdp_frame type
  2018-03-23 12:18 ` [bpf-next V5 PATCH 05/15] xdp: introduce a new xdp_frame type Jesper Dangaard Brouer
@ 2018-03-23 17:11   ` Alexander Duyck
  0 siblings, 0 replies; 35+ messages in thread
From: Alexander Duyck @ 2018-03-23 17:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

On Fri, Mar 23, 2018 at 5:18 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> This is needed to convert drivers tuntap and virtio_net.
>
> This is a generalization of what is done inside cpumap, which will be
> converted later.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  include/net/xdp.h |   40 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 40 insertions(+)
>
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 1ee154fe0be6..13f71a15c79f 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -59,6 +59,46 @@ struct xdp_buff {
>         struct xdp_rxq_info *rxq;
>  };
>
> +struct xdp_frame {
> +       void *data;
> +       u16 len;
> +       u16 headroom;
> +       u16 metasize;
> +       /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
> +        * while mem info is valid on remote CPU.
> +        */
> +       struct xdp_mem_info mem;
> +};
> +
> +/* Convert xdp_buff to xdp_frame */
> +static inline
> +struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
> +{
> +       struct xdp_frame *xdp_frame;
> +       int metasize;
> +       int headroom;
> +
> +       /* Assure headroom is available for storing info */
> +       headroom = xdp->data - xdp->data_hard_start;
> +       metasize = xdp->data - xdp->data_meta;
> +       metasize = metasize > 0 ? metasize : 0;
> +       if (unlikely((headroom - metasize) < sizeof(*xdp_frame)))
> +               return NULL;
> +
> +       /* Store info in top of packet */
> +       xdp_frame = xdp->data_hard_start;
> +

I thought the point of this stuff was supposed to be performance. This
is going to hurt. You are adding yet another cacheline to the number
of lines accessed to process the frame.

I would like to know how of a slow down is being added after you apply
this to the Tx path of ixgbe. I suspect there should be a noticeable
difference for your basic xmit loopback test.

> +       xdp_frame->data = xdp->data;
> +       xdp_frame->len  = xdp->data_end - xdp->data;
> +       xdp_frame->headroom = headroom - sizeof(*xdp_frame);
> +       xdp_frame->metasize = metasize;
> +
> +       /* rxq only valid until napi_schedule ends, convert to xdp_mem_info */
> +       xdp_frame->mem = xdp->rxq->mem;
> +
> +       return xdp_frame;
> +}
> +
>  static inline
>  void xdp_return_frame(void *data, struct xdp_mem_info *mem)
>  {
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 14/15] xdp: transition into using xdp_frame for return API
  2018-03-23 12:19 ` [bpf-next V5 PATCH 14/15] xdp: transition into using xdp_frame for return API Jesper Dangaard Brouer
@ 2018-03-23 17:29   ` Alexander Duyck
  2018-03-26 11:42     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 35+ messages in thread
From: Alexander Duyck @ 2018-03-23 17:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

On Fri, Mar 23, 2018 at 5:19 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> Changing API xdp_return_frame() to take struct xdp_frame as argument,
> seems like a natural choice. But there are some subtle performance
> details here that needs extra care, which is a deliberate choice.
>
> When de-referencing xdp_frame on a remote CPU during DMA-TX
> completion, result in the cache-line is change to "Shared"
> state. Later when the page is reused for RX, then this xdp_frame
> cache-line is written, which change the state to "Modified".
>
> This situation already happens (naturally) for, virtio_net, tun and
> cpumap as the xdp_frame pointer is the queued object.  In tun and
> cpumap, the ptr_ring is used for efficiently transferring cache-lines
> (with pointers) between CPUs. Thus, the only option is to
> de-referencing xdp_frame.
>
> It is only the ixgbe driver that had an optimization, in which it can
> avoid doing the de-reference of xdp_frame.  The driver already have
> TX-ring queue, which (in case of remote DMA-TX completion) have to be
> transferred between CPUs anyhow.  In this data area, we stored a
> struct xdp_mem_info and a data pointer, which allowed us to avoid
> de-referencing xdp_frame.
>
> To compensate for this, a prefetchw is used for telling the cache
> coherency protocol about our access pattern.  My benchmarks show that
> this prefetchw is enough to compensate the ixgbe driver.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

I'm really not a fan of this patch. It seems like it is adding a ton
of overhead for one case that is going to penalize all of the other
use cases for XDP. Just the extra prefetch is going to have a
significant impact on things like the XDP_DROP case.

> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h        |    4 +---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   17 +++++++++++------
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    1 +
>  drivers/net/tun.c                               |    4 ++--
>  drivers/net/virtio_net.c                        |    2 +-
>  include/net/xdp.h                               |    2 +-
>  kernel/bpf/cpumap.c                             |    6 +++---
>  net/core/xdp.c                                  |    4 +++-
>  8 files changed, 23 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index cbc20f199364..dfbc15a45cb4 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -240,8 +240,7 @@ struct ixgbe_tx_buffer {
>         unsigned long time_stamp;
>         union {
>                 struct sk_buff *skb;
> -               /* XDP uses address ptr on irq_clean */
> -               void *data;
> +               struct xdp_frame *xdpf;
>         };
>         unsigned int bytecount;
>         unsigned short gso_segs;
> @@ -249,7 +248,6 @@ struct ixgbe_tx_buffer {
>         DEFINE_DMA_UNMAP_ADDR(dma);
>         DEFINE_DMA_UNMAP_LEN(len);
>         u32 tx_flags;
> -       struct xdp_mem_info xdp_mem;
>  };
>
>  struct ixgbe_rx_buffer {

I suppose this makes most of my earlier suggestions pointless, though
I would be interested in seeing what the extra cost is we are taking
for having to initialize one more cache line than we were before to
support the xdp_frame format.

> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index ff069597fccf..e6e9b28ecfba 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
>
>                 /* free the skb */
>                 if (ring_is_xdp(tx_ring))
> -                       xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
> +                       xdp_return_frame(tx_buffer->xdpf);
>                 else
>                         napi_consume_skb(tx_buffer->skb, napi_budget);
>
> @@ -2376,6 +2376,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>                         xdp.data_hard_start = xdp.data -
>                                               ixgbe_rx_offset(rx_ring);
>                         xdp.data_end = xdp.data + size;
> +                       prefetchw(xdp.data_hard_start); /* xdp_frame write */
>
>                         skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
>                 }

Do we really need to be prefetching yet another cache line? This is
going to hurt general performance since for most cases you are now
prefetching a cache line that will never be used.

> @@ -5787,7 +5788,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring)
>
>                 /* Free all the Tx ring sk_buffs */
>                 if (ring_is_xdp(tx_ring))
> -                       xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
> +                       xdp_return_frame(tx_buffer->xdpf);
>                 else
>                         dev_kfree_skb_any(tx_buffer->skb);
>
> @@ -8333,16 +8334,21 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
>         struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
>         struct ixgbe_tx_buffer *tx_buffer;
>         union ixgbe_adv_tx_desc *tx_desc;
> +       struct xdp_frame *xdpf;
>         u32 len, cmd_type;
>         dma_addr_t dma;
>         u16 i;
>
> -       len = xdp->data_end - xdp->data;
> +       xdpf = convert_to_xdp_frame(xdp);
> +       if (unlikely(!xdpf))
> +               return -EOVERFLOW;
> +
> +       len = xdpf->len;
>
>         if (unlikely(!ixgbe_desc_unused(ring)))
>                 return IXGBE_XDP_CONSUMED;
>
> -       dma = dma_map_single(ring->dev, xdp->data, len, DMA_TO_DEVICE);
> +       dma = dma_map_single(ring->dev, xdpf->data, len, DMA_TO_DEVICE);
>         if (dma_mapping_error(ring->dev, dma))
>                 return IXGBE_XDP_CONSUMED;
>
> @@ -8357,8 +8363,7 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
>
>         dma_unmap_len_set(tx_buffer, len, len);
>         dma_unmap_addr_set(tx_buffer, dma, dma);
> -       tx_buffer->data = xdp->data;
> -       tx_buffer->xdp_mem = xdp->rxq->mem;
> +       tx_buffer->xdpf = xdpf;
>
>         tx_desc->read.buffer_addr = cpu_to_le64(dma);
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 2ac78b88fc3d..00b9b13d9fea 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -896,6 +896,7 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
>                                       di->addr + wi->offset,
>                                       0, frag_size,
>                                       DMA_FROM_DEVICE);
> +       prefetchw(va); /* xdp_frame data area */
>         prefetch(data);
>         wi->offset += frag_size;
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 81fddf9cc58f..a7e42ae1b220 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -663,7 +663,7 @@ static void tun_ptr_free(void *ptr)
>         if (tun_is_xdp_frame(ptr)) {
>                 struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
>
> -               xdp_return_frame(xdpf->data, &xdpf->mem);
> +               xdp_return_frame(xdpf);
>         } else {
>                 __skb_array_destroy_skb(ptr);
>         }
> @@ -2188,7 +2188,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
>                 struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
>
>                 ret = tun_put_user_xdp(tun, tfile, xdpf, to);
> -               xdp_return_frame(xdpf->data, &xdpf->mem);
> +               xdp_return_frame(xdpf);
>         } else {
>                 struct sk_buff *skb = ptr;
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 48c86accd3b8..479a80339fad 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -430,7 +430,7 @@ static bool __virtnet_xdp_xmit(struct virtnet_info *vi,
>
>         /* Free up any pending old buffers before queueing new ones. */
>         while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
> -               xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
> +               xdp_return_frame(xdpf_sent);
>
>         xdpf = convert_to_xdp_frame(xdp);
>         if (unlikely(!xdpf))
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 98b55eaf8fd7..35aa9825fdd0 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -103,7 +103,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
>         return xdp_frame;
>  }
>
> -void xdp_return_frame(void *data, struct xdp_mem_info *mem);
> +void xdp_return_frame(struct xdp_frame *xdpf);
>
>  int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
>                      struct net_device *dev, u32 queue_index);
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index bcdc4dea5ce7..c95b04ec103e 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -219,7 +219,7 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
>
>         while ((xdpf = ptr_ring_consume(ring)))
>                 if (WARN_ON_ONCE(xdpf))
> -                       xdp_return_frame(xdpf->data, &xdpf->mem);
> +                       xdp_return_frame(xdpf);
>  }
>
>  static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
> @@ -275,7 +275,7 @@ static int cpu_map_kthread_run(void *data)
>
>                         skb = cpu_map_build_skb(rcpu, xdpf);
>                         if (!skb) {
> -                               xdp_return_frame(xdpf->data, &xdpf->mem);
> +                               xdp_return_frame(xdpf);
>                                 continue;
>                         }
>
> @@ -578,7 +578,7 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
>                 err = __ptr_ring_produce(q, xdpf);
>                 if (err) {
>                         drops++;
> -                       xdp_return_frame(xdpf->data, &xdpf->mem);
> +                       xdp_return_frame(xdpf);
>                 }
>                 processed++;
>         }
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index fe8e87abc266..6ed3d73a73be 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -294,9 +294,11 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
>  }
>  EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
>
> -void xdp_return_frame(void *data, struct xdp_mem_info *mem)
> +void xdp_return_frame(struct xdp_frame *xdpf)
>  {
>         struct xdp_mem_allocator *xa = NULL;
> +       struct xdp_mem_info *mem = &xdpf->mem;
> +       void *data = xdpf->data;
>
>         rcu_read_lock();
>         if (mem->id)
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-23 16:56   ` Alexander Duyck
@ 2018-03-23 18:15     ` Jesper Dangaard Brouer
  2018-03-23 18:22       ` Alexander Duyck
  2018-03-26 21:04     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-23 18:15 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 23 Mar 2018 09:56:50 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Fri, Mar 23, 2018 at 5:18 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > Use the IDA infrastructure for getting a cyclic increasing ID number,
> > that is used for keeping track of each registered allocator per
> > RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
> > uses a radix tree, use a dynamic rhashtable, for creating ID to
> > pointer lookup table, because this is faster.
> >
> > The problem that is being solved here is that, the xdp_rxq_info
> > pointer (stored in xdp_buff) cannot be used directly, as the
> > guaranteed lifetime is too short.  The info is needed on a
> > (potentially) remote CPU during DMA-TX completion time . In an
> > xdp_frame the xdp_mem_info is stored, when it got converted from an
> > xdp_buff, which is sufficient for the simple page refcnt based recycle
> > schemes.
> >
> > For more advanced allocators there is a need to store a pointer to the
> > registered allocator.  Thus, there is a need to guard the lifetime or
> > validity of the allocator pointer, which is done through this
> > rhashtable ID map to pointer. The removal and validity of of the
> > allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
> > allocator will be created by the driver, and registered with
> > xdp_rxq_info_reg_mem_model().
> >
> > It is up-to debate who is responsible for freeing the allocator
> > pointer or invoking the allocator destructor function.  In any case,
> > this must happen via RCU freeing.
> >
> > Use the IDA infrastructure for getting a cyclic increasing ID number,
> > that is used for keeping track of each registered allocator per
> > RX-queue xdp_rxq_info.
> >
> > V4: Per req of Jason Wang
> > - Use xdp_rxq_info_reg_mem_model() in all drivers implementing
> >   XDP_REDIRECT, even-though it's not strictly necessary when
> >   allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    9 +
> >  drivers/net/tun.c                             |    6 +
> >  drivers/net/virtio_net.c                      |    7 +
> >  include/net/xdp.h                             |   15 --
> >  net/core/xdp.c                                |  230 ++++++++++++++++++++++++-
> >  5 files changed, 248 insertions(+), 19 deletions(-)
> >
[...]
> >  int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
> >                                enum mem_type type, void *allocator)
> >  {
> > +       struct xdp_mem_allocator *xdp_alloc;
> > +       gfp_t gfp = GFP_KERNEL;
> > +       int id, errno, ret;
> > +       void *ptr;
> > +
> > +       if (xdp_rxq->reg_state != REG_STATE_REGISTERED) {
> > +               WARN(1, "Missing register, driver bug");
> > +               return -EFAULT;
> > +       }
> > +
> >         if (type >= MEM_TYPE_MAX)
> >                 return -EINVAL;
> >
> >         xdp_rxq->mem.type = type;
> >
> > -       if (allocator)
> > -               return -EOPNOTSUPP;
> > +       if (!allocator)
> > +               return 0;
> > +
> > +       /* Delay init of rhashtable to save memory if feature isn't used */
> > +       if (!mem_id_init) {
> > +               mutex_lock(&mem_id_lock);
> > +               ret = __mem_id_init_hash_table();
> > +               mutex_unlock(&mem_id_lock);
> > +               if (ret < 0) {
> > +                       WARN_ON(1);
> > +                       return ret;
> > +               }
> > +       }
> > +
> > +       xdp_alloc = kzalloc(sizeof(*xdp_alloc), gfp);
> > +       if (!xdp_alloc)
> > +               return -ENOMEM;
> > +
> > +       mutex_lock(&mem_id_lock);
> > +       id = __mem_id_cyclic_get(gfp);
> > +       if (id < 0) {
> > +               errno = id;
> > +               goto err;
> > +       }
> > +       xdp_rxq->mem.id = id;
> > +       xdp_alloc->mem  = xdp_rxq->mem;
> > +       xdp_alloc->allocator = allocator;
> > +
> > +       /* Insert allocator into ID lookup table */
> > +       ptr = rhashtable_insert_slow(mem_id_ht, &id, &xdp_alloc->node);
> > +       if (IS_ERR(ptr)) {
> > +               errno = PTR_ERR(ptr);
> > +               goto err;
> > +       }
> > +
> > +       mutex_unlock(&mem_id_lock);
> >
> > -       /* TODO: Allocate an ID that maps to allocator pointer
> > -        * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
> > -        */
> >         return 0;
> > +err:
> > +       mutex_unlock(&mem_id_lock);
> > +       kfree(xdp_alloc);
> > +       return errno;
> >  }
> >  EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
> > +
> > +void xdp_return_frame(void *data, struct xdp_mem_info *mem)
> > +{
> > +       struct xdp_mem_allocator *xa;
> > +
> > +       rcu_read_lock();
> > +       if (mem->id)
> > +               xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
> > +       rcu_read_unlock();
> > +
> > +       if (mem->type == MEM_TYPE_PAGE_SHARED) {
> > +               page_frag_free(data);
> > +               return;
> > +       }
> > +
> > +       if (mem->type == MEM_TYPE_PAGE_ORDER0) {
> > +               struct page *page = virt_to_page(data); /* Assumes order0 page*/
> > +
> > +               put_page(page);
> > +       }
> > +}
> > +EXPORT_SYMBOL_GPL(xdp_return_frame);
> >  
> 
> I'm not sure what the point is of getting the xa value if it is not
> going to be used. Also I would assume there are types that won't even
> need the hash table lookup. I would prefer to see this bit held off on
> until you have something that actually needs it.

I think, you misread the patch. The lookup is NOT going to be performed
when mem->id is zero, which is the case that you are interested in for
your ixgbe driver.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-23 18:15     ` Jesper Dangaard Brouer
@ 2018-03-23 18:22       ` Alexander Duyck
  0 siblings, 0 replies; 35+ messages in thread
From: Alexander Duyck @ 2018-03-23 18:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan

On Fri, Mar 23, 2018 at 11:15 AM, Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
> On Fri, 23 Mar 2018 09:56:50 -0700
> Alexander Duyck <alexander.duyck@gmail.com> wrote:
>
>> On Fri, Mar 23, 2018 at 5:18 AM, Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>> > Use the IDA infrastructure for getting a cyclic increasing ID number,
>> > that is used for keeping track of each registered allocator per
>> > RX-queue xdp_rxq_info.  Instead of using the IDR infrastructure, which
>> > uses a radix tree, use a dynamic rhashtable, for creating ID to
>> > pointer lookup table, because this is faster.
>> >
>> > The problem that is being solved here is that, the xdp_rxq_info
>> > pointer (stored in xdp_buff) cannot be used directly, as the
>> > guaranteed lifetime is too short.  The info is needed on a
>> > (potentially) remote CPU during DMA-TX completion time . In an
>> > xdp_frame the xdp_mem_info is stored, when it got converted from an
>> > xdp_buff, which is sufficient for the simple page refcnt based recycle
>> > schemes.
>> >
>> > For more advanced allocators there is a need to store a pointer to the
>> > registered allocator.  Thus, there is a need to guard the lifetime or
>> > validity of the allocator pointer, which is done through this
>> > rhashtable ID map to pointer. The removal and validity of of the
>> > allocator and helper struct xdp_mem_allocator is guarded by RCU.  The
>> > allocator will be created by the driver, and registered with
>> > xdp_rxq_info_reg_mem_model().
>> >
>> > It is up-to debate who is responsible for freeing the allocator
>> > pointer or invoking the allocator destructor function.  In any case,
>> > this must happen via RCU freeing.
>> >
>> > Use the IDA infrastructure for getting a cyclic increasing ID number,
>> > that is used for keeping track of each registered allocator per
>> > RX-queue xdp_rxq_info.
>> >
>> > V4: Per req of Jason Wang
>> > - Use xdp_rxq_info_reg_mem_model() in all drivers implementing
>> >   XDP_REDIRECT, even-though it's not strictly necessary when
>> >   allocator==NULL for type MEM_TYPE_PAGE_SHARED (given it's zero).
>> >
>> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> > ---
>> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |    9 +
>> >  drivers/net/tun.c                             |    6 +
>> >  drivers/net/virtio_net.c                      |    7 +
>> >  include/net/xdp.h                             |   15 --
>> >  net/core/xdp.c                                |  230 ++++++++++++++++++++++++-
>> >  5 files changed, 248 insertions(+), 19 deletions(-)
>> >
> [...]
>> >  int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
>> >                                enum mem_type type, void *allocator)
>> >  {
>> > +       struct xdp_mem_allocator *xdp_alloc;
>> > +       gfp_t gfp = GFP_KERNEL;
>> > +       int id, errno, ret;
>> > +       void *ptr;
>> > +
>> > +       if (xdp_rxq->reg_state != REG_STATE_REGISTERED) {
>> > +               WARN(1, "Missing register, driver bug");
>> > +               return -EFAULT;
>> > +       }
>> > +
>> >         if (type >= MEM_TYPE_MAX)
>> >                 return -EINVAL;
>> >
>> >         xdp_rxq->mem.type = type;
>> >
>> > -       if (allocator)
>> > -               return -EOPNOTSUPP;
>> > +       if (!allocator)
>> > +               return 0;
>> > +
>> > +       /* Delay init of rhashtable to save memory if feature isn't used */
>> > +       if (!mem_id_init) {
>> > +               mutex_lock(&mem_id_lock);
>> > +               ret = __mem_id_init_hash_table();
>> > +               mutex_unlock(&mem_id_lock);
>> > +               if (ret < 0) {
>> > +                       WARN_ON(1);
>> > +                       return ret;
>> > +               }
>> > +       }
>> > +
>> > +       xdp_alloc = kzalloc(sizeof(*xdp_alloc), gfp);
>> > +       if (!xdp_alloc)
>> > +               return -ENOMEM;
>> > +
>> > +       mutex_lock(&mem_id_lock);
>> > +       id = __mem_id_cyclic_get(gfp);
>> > +       if (id < 0) {
>> > +               errno = id;
>> > +               goto err;
>> > +       }
>> > +       xdp_rxq->mem.id = id;
>> > +       xdp_alloc->mem  = xdp_rxq->mem;
>> > +       xdp_alloc->allocator = allocator;
>> > +
>> > +       /* Insert allocator into ID lookup table */
>> > +       ptr = rhashtable_insert_slow(mem_id_ht, &id, &xdp_alloc->node);
>> > +       if (IS_ERR(ptr)) {
>> > +               errno = PTR_ERR(ptr);
>> > +               goto err;
>> > +       }
>> > +
>> > +       mutex_unlock(&mem_id_lock);
>> >
>> > -       /* TODO: Allocate an ID that maps to allocator pointer
>> > -        * See: https://www.kernel.org/doc/html/latest/core-api/idr.html
>> > -        */
>> >         return 0;
>> > +err:
>> > +       mutex_unlock(&mem_id_lock);
>> > +       kfree(xdp_alloc);
>> > +       return errno;
>> >  }
>> >  EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
>> > +
>> > +void xdp_return_frame(void *data, struct xdp_mem_info *mem)
>> > +{
>> > +       struct xdp_mem_allocator *xa;
>> > +
>> > +       rcu_read_lock();
>> > +       if (mem->id)
>> > +               xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
>> > +       rcu_read_unlock();
>> > +
>> > +       if (mem->type == MEM_TYPE_PAGE_SHARED) {
>> > +               page_frag_free(data);
>> > +               return;
>> > +       }
>> > +
>> > +       if (mem->type == MEM_TYPE_PAGE_ORDER0) {
>> > +               struct page *page = virt_to_page(data); /* Assumes order0 page*/
>> > +
>> > +               put_page(page);
>> > +       }
>> > +}
>> > +EXPORT_SYMBOL_GPL(xdp_return_frame);
>> >
>>
>> I'm not sure what the point is of getting the xa value if it is not
>> going to be used. Also I would assume there are types that won't even
>> need the hash table lookup. I would prefer to see this bit held off on
>> until you have something that actually needs it.
>
> I think, you misread the patch. The lookup is NOT going to be performed
> when mem->id is zero, which is the case that you are interested in for
> your ixgbe driver.

Sorry, to clarify. Why do I have to take rcu_read_lock and
rcu_read_unlock if i am not doing an rcu read? Why even bother doing a
conditional check for mem->id if the lookup using it is not used?

Basically if I am not using it why should I take any of the overhead
for it. I would much rather have this code reduced to be as small and
fast as possible instead of wasting cycles on the RCU acquire/release,
reading mem->id, testing mem->id, and then jumping over the code that
is not used in either case even if mem->id isn't 0.

What I don't get is why you aren't getting warnings about variables
being assigned but never used in the case of xa.

- Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 14/15] xdp: transition into using xdp_frame for return API
  2018-03-23 17:29   ` Alexander Duyck
@ 2018-03-26 11:42     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-26 11:42 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan, brouer


On Fri, 23 Mar 2018 10:29:29 -0700 Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Fri, Mar 23, 2018 at 5:19 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
> > Changing API xdp_return_frame() to take struct xdp_frame as argument,
> > seems like a natural choice. But there are some subtle performance
> > details here that needs extra care, which is a deliberate choice.
> >
> > When de-referencing xdp_frame on a remote CPU during DMA-TX
> > completion, result in the cache-line is change to "Shared"
> > state. Later when the page is reused for RX, then this xdp_frame
> > cache-line is written, which change the state to "Modified".
> >
> > This situation already happens (naturally) for, virtio_net, tun and
> > cpumap as the xdp_frame pointer is the queued object.  In tun and
> > cpumap, the ptr_ring is used for efficiently transferring cache-lines
> > (with pointers) between CPUs. Thus, the only option is to
> > de-referencing xdp_frame.
> >
> > It is only the ixgbe driver that had an optimization, in which it can
> > avoid doing the de-reference of xdp_frame.  The driver already have
> > TX-ring queue, which (in case of remote DMA-TX completion) have to be
> > transferred between CPUs anyhow.  In this data area, we stored a
> > struct xdp_mem_info and a data pointer, which allowed us to avoid
> > de-referencing xdp_frame.
> >
> > To compensate for this, a prefetchw is used for telling the cache
> > coherency protocol about our access pattern.  My benchmarks show that
> > this prefetchw is enough to compensate the ixgbe driver.
> >
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>  
> 
> I'm really not a fan of this patch. It seems like it is adding a ton
> of overhead for one case that is going to penalize all of the other
> use cases for XDP. Just the extra prefetch is going to have a
> significant impact on things like the XDP_DROP case.
> 
[...]
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > index ff069597fccf..e6e9b28ecfba 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > @@ -1207,7 +1207,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
> >
> >                 /* free the skb */
> >                 if (ring_is_xdp(tx_ring))
> > -                       xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
> > +                       xdp_return_frame(tx_buffer->xdpf);
> >                 else
> >                         napi_consume_skb(tx_buffer->skb, napi_budget);
> >
> > @@ -2376,6 +2376,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
> >                         xdp.data_hard_start = xdp.data -
> >                                               ixgbe_rx_offset(rx_ring);
> >                         xdp.data_end = xdp.data + size;
> > +                       prefetchw(xdp.data_hard_start); /* xdp_frame write */
> >
> >                         skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
> >                 }  
> 
> Do we really need to be prefetching yet another cache line? This is
> going to hurt general performance since for most cases you are now
> prefetching a cache line that will never be used.

My intuition were also that this would hurt performance.  But I've done
many tests, and they all show no performance regression from this change!

XDP_DROP testing with xdp1 on ixgbe:

 Baseline: 14,703,344 pkt/s
 Patched:  14,711,574 pkt/s

For people reproducing, notice that it requires tuning on the generator
side (with pktgen) to reach these extreme speeds.

As you might have noticed, the prefetchw is only active when a XDP
program is loaded.  Thus, I created a benchmark[1] that loads an XDP
program and always returns XDP_PASS (via --action)

Then, I drop packets in iptables raw table:

 Baseline: 5,783,509 pps
 Patched:  5,789,195 pps

Then unload netfilter modules and let packets reach UDP step
"UdpNoPorts" listening stage:

 Baseline: 3,832,956 pps
 Patched:  3,855,470 pps

Then, add a udp_sink (--recvmsg) to allow packets to travel deeper into
the network stack (and force udp_sink to run on another CPU):

 Baseline: 2,278,855 pps
 Patched:  2,270,373 pps

By measurements, it should be clear that this patchset does not add "a
ton of overhead" and does not "penalize all of the other use cases for
XDP".

For the XDP_REDIRECT use-case, I actually find the prefetchw() quite
beautiful, because xdp_buff (which is stack variable) is used by the
bpf_prog, and the prefetch have time to fetch the memory area that the
conversion of xdp_buff to xdp_frame goes into, and xdp_buff is known to
be hot.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_bench01_mem_access_cost_kern.c

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code
  2018-03-23 13:28   ` Eric Dumazet
@ 2018-03-26 14:09     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-26 14:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	Jason Wang, John Fastabend, Eran Ben Elisha, Saeed Mahameed,
	galp, Daniel Borkmann, Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 23 Mar 2018 06:28:13 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On 03/23/2018 05:18 AM, Jesper Dangaard Brouer wrote:
> 
> > +
> > +	/* Note, below struct compat code was primarily needed when
> > +	 * page_pool code lived under MM-tree control, given mmots and
> > +	 * net-next trees progress in very different rates.
> > +	 *
> > +	 * Allow kernel devel trees and driver to progress at different rates
> > +	 */
> > +	param_copy_sz = PAGE_POOL_PARAMS_SIZE;
> > +	memset(&pool->p, 0, param_copy_sz);
> > +	if (params->size < param_copy_sz) {
> > +		/* Older module calling newer kernel, handled by only
> > +		 * copying supplied size, and keep remaining params zero
> > +		 */
> > +		param_copy_sz = params->size;
> > +	} else if (params->size > param_copy_sz) {
> > +		/* Newer module calling older kernel. Need to validate
> > +		 * no new features were requested.
> > +		 */
> > +		unsigned char *addr = (unsigned char *)params + param_copy_sz;
> > +		unsigned char *end  = (unsigned char *)params + params->size;
> > +
> > +		for (; addr < end; addr++) {
> > +			if (*addr != 0)
> > +				return -E2BIG;
> > +		}
> > +	}  
> 
> I do not see the need for this part.

Okay, then I'll just drop it.
I'm considering changing page_pool_create() to just take everything
as parameters.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code
  2018-03-23 14:55       ` Eric Dumazet
@ 2018-03-26 15:19         ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-26 15:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	Jason Wang, John Fastabend, Eran Ben Elisha, Saeed Mahameed,
	galp, Daniel Borkmann, Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 23 Mar 2018 07:55:37 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> rcu grace period before freeing huge quantitites of pages is
> problematic and could be used by syzbot to OOM the host.

Okay. Adjusted code to empty ring "right-way" when driver calls
destroy, and then only RCU delay/free the page_pool pointer and also
free/empty ring for pages of any in-flight producers.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap
  2018-03-23 16:35   ` Alexander Duyck
@ 2018-03-26 19:06     ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-26 19:06 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 23 Mar 2018 09:35:47 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index b2362ddfa694..15b546325e31 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -33,16 +33,44 @@
> >   * also mandatory during RX-ring setup.
> >   */
> >
> > +enum mem_type {
> > +       MEM_TYPE_PAGE_SHARED = 0, /* Split-page refcnt based model */
> > +       MEM_TYPE_PAGE_ORDER0,     /* Orig XDP full page model */
> > +       MEM_TYPE_MAX,
> > +};
> > +
> > +struct xdp_mem_info {
> > +       u32 type; /* enum mem_type, but known size type */  
> 
> Do you really need to make t his a full u32 value for something that
> is likely never going to exceed a single digit value?

Could we please save these kind of optimizations for later, please?

I do have a plan for state compression later.  If you noticed, later
I'll add an 'id' member.   And in the ID patch, I'm actually
limiting the IDs to max 16-bit.  Thus, the plan is to allow for storing
type+id in e.g. a u32.  BUT I see no reason do this now and complicate
the patchset further. (When we do this state compression, I actually
want to run some perf tests, that kind of work can cause unfortunate
compiler and assembler level constructs what are slower...)

> > +       /* u32 id; will be added later in this patchset */  
> 
> Wouldn't it be better to just hold off and add it then instead of
> adding it as a comment?

I was trying to help the reviewer see where this was going.  The line
will be updated within the patchset, like the comment promise, so there
is not real danger of having it here.

[...]
> > +static inline
> > +void xdp_return_frame(void *data, struct xdp_mem_info *mem)
> > +{
> > +       if (mem->type == MEM_TYPE_PAGE_SHARED)
> > +               page_frag_free(data);
> > +
> > +       if (mem->type == MEM_TYPE_PAGE_ORDER0) {
> > +               struct page *page = virt_to_page(data); /* Assumes order0 page*/
> > +
> > +               put_page(page);
> > +       }  
> 
> Actually page_frag_free would probably work for either one.  Also is it
> safe to assume that the page is order 0? Are the only users of
> compound pages that support XDP also only supporting the page fragment
> setup?
> 
> Also you probably don't need put_page. It might be better to use
> __free_page if you are certain the pages are coming from the Rx path
> of drivers and don't have any special destructors associated with
> them.

Can we also keep these kind of optimization for later, please? (p.s. I
do know this, e.g. I put a comment in page_pool of using __free_pages()).


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [bpf-next V5 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping
  2018-03-23 16:56   ` Alexander Duyck
  2018-03-23 18:15     ` Jesper Dangaard Brouer
@ 2018-03-26 21:04     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 35+ messages in thread
From: Jesper Dangaard Brouer @ 2018-03-26 21:04 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Netdev, BjörnTöpel, Karlsson, Magnus,
	Eugenia Emantayev, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, Gal Pressman, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan, brouer

On Fri, 23 Mar 2018 09:56:50 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> > +void xdp_return_frame(void *data, struct xdp_mem_info *mem)
> > +{
> > +       struct xdp_mem_allocator *xa;
> > +
> > +       rcu_read_lock();
> > +       if (mem->id)
> > +               xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
> > +       rcu_read_unlock();
> > +
> > +       if (mem->type == MEM_TYPE_PAGE_SHARED) {
> > +               page_frag_free(data);
> > +               return;
> > +       }
> > +
> > +       if (mem->type == MEM_TYPE_PAGE_ORDER0) {
> > +               struct page *page = virt_to_page(data); /* Assumes order0 page*/
> > +
> > +               put_page(page);
> > +       }
> > +}
> > +EXPORT_SYMBOL_GPL(xdp_return_frame);
> >  
> 
> I'm not sure what the point is of getting the xa value if it is not
> going to be used. Also I would assume there are types that won't even
> need the hash table lookup. I would prefer to see this bit held off on
> until you have something that actually needs it.

I agree, fixed.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2018-03-26 21:04 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-23 12:17 [bpf-next V5 PATCH 00/15] XDP redirect memory return API Jesper Dangaard Brouer
2018-03-23 12:18 ` [bpf-next V5 PATCH 01/15] mlx5: basic XDP_REDIRECT forward support Jesper Dangaard Brouer
2018-03-23 12:18 ` [bpf-next V5 PATCH 02/15] xdp: introduce xdp_return_frame API and use in cpumap Jesper Dangaard Brouer
2018-03-23 16:35   ` Alexander Duyck
2018-03-26 19:06     ` Jesper Dangaard Brouer
2018-03-23 12:18 ` [bpf-next V5 PATCH 03/15] ixgbe: use xdp_return_frame API Jesper Dangaard Brouer
2018-03-23 16:46   ` Alexander Duyck
2018-03-23 12:18 ` [bpf-next V5 PATCH 04/15] xdp: move struct xdp_buff from filter.h to xdp.h Jesper Dangaard Brouer
2018-03-23 12:18 ` [bpf-next V5 PATCH 05/15] xdp: introduce a new xdp_frame type Jesper Dangaard Brouer
2018-03-23 17:11   ` Alexander Duyck
2018-03-23 12:18 ` [bpf-next V5 PATCH 06/15] tun: convert to use generic xdp_frame and xdp_return_frame API Jesper Dangaard Brouer
2018-03-23 12:18 ` [bpf-next V5 PATCH 07/15] virtio_net: " Jesper Dangaard Brouer
2018-03-23 12:18 ` [bpf-next V5 PATCH 08/15] bpf: cpumap convert to use generic xdp_frame Jesper Dangaard Brouer
2018-03-23 12:18 ` [bpf-next V5 PATCH 09/15] mlx5: register a memory model when XDP is enabled Jesper Dangaard Brouer
2018-03-23 16:18   ` Sergei Shtylyov
2018-03-23 12:18 ` [bpf-next V5 PATCH 10/15] xdp: rhashtable with allocator ID to pointer mapping Jesper Dangaard Brouer
2018-03-23 16:56   ` Alexander Duyck
2018-03-23 18:15     ` Jesper Dangaard Brouer
2018-03-23 18:22       ` Alexander Duyck
2018-03-26 21:04     ` Jesper Dangaard Brouer
2018-03-23 12:18 ` [bpf-next V5 PATCH 11/15] page_pool: refurbish version of page_pool code Jesper Dangaard Brouer
2018-03-23 13:28   ` Eric Dumazet
2018-03-26 14:09     ` Jesper Dangaard Brouer
2018-03-23 13:29   ` Eric Dumazet
2018-03-23 14:15     ` Jesper Dangaard Brouer
2018-03-23 14:55       ` Eric Dumazet
2018-03-26 15:19         ` Jesper Dangaard Brouer
2018-03-23 13:37   ` Eric Dumazet
2018-03-23 12:18 ` [bpf-next V5 PATCH 12/15] xdp: allow page_pool as an allocator type in xdp_return_frame Jesper Dangaard Brouer
2018-03-23 17:02   ` Alexander Duyck
2018-03-23 12:19 ` [bpf-next V5 PATCH 13/15] mlx5: use page_pool for xdp_return_frame call Jesper Dangaard Brouer
2018-03-23 12:19 ` [bpf-next V5 PATCH 14/15] xdp: transition into using xdp_frame for return API Jesper Dangaard Brouer
2018-03-23 17:29   ` Alexander Duyck
2018-03-26 11:42     ` Jesper Dangaard Brouer
2018-03-23 12:19 ` [bpf-next V5 PATCH 15/15] xdp: transition into using xdp_frame for ndo_xdp_xmit Jesper Dangaard Brouer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.