netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support
@ 2020-09-03 20:58 Lorenzo Bianconi
  2020-09-03 20:58 ` [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame Lorenzo Bianconi
                   ` (10 more replies)
  0 siblings, 11 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-03 20:58 UTC (permalink / raw)
  To: netdev
  Cc: bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr

- Finalize XDP multi-buffer support for mvneta driver introducing the
  capability to map non-linear buffers on tx side.
- Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer to specify if
  shared_info area has been properly initialized.
- Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.
- Add multi-buff support to xdp_return_{buff/frame} utility routines.
- Introduce bpf_xdp_adjust_mb_header helper to adjust frame headers moving
  *offset* bytes from/to the second buffer to/from the first one.
  This helper can be used to move headers when the hw DMA SG is not able
  to copy all the headers in the first fragment and split header and data
  pages. A possible use case for bpf_xdp_adjust_mb_header is described
  here [0]
- Introduce bpf_xdp_get_frag_count and bpf_xdp_get_frags_total_size helpers to
  report the total number/size of frags for a given xdp multi-buff.

XDP multi-buffer design principles are described here [1]
For the moment we have not implemented any self-test for the introduced the bpf
helpers. We can address this in a follow up series if the proposed approach
is accepted.

Changes since v1:
- Fix use-after-free in xdp_return_{buff/frame}
- Introduce bpf helpers
- Introduce xdp_mb sample program
- access skb_shared_info->nr_frags only on the last fragment

Changes since RFC:
- squash multi-buffer bit initialization in a single patch
- add mvneta non-linear XDP buff support for tx side

[0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy
[1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

Lorenzo Bianconi (7):
  xdp: introduce mb in xdp_buff/xdp_frame
  xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
  net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
  xdp: add multi-buff support to xdp_return_{buff/frame}
  net: mvneta: add multi buffer support to XDP_TX
  bpf: helpers: add bpf_xdp_adjust_mb_header helper
  net: mvneta: enable jumbo frames for XDP

Sameeh Jubran (2):
  bpf: helpers: add multibuffer support
  samples/bpf: add bpf program that uses xdp mb helpers

 drivers/net/ethernet/amazon/ena/ena_netdev.c  |   1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   1 +
 .../net/ethernet/cavium/thunder/nicvf_main.c  |   1 +
 .../net/ethernet/freescale/dpaa2/dpaa2-eth.c  |   1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   1 +
 drivers/net/ethernet/intel/ice/ice_txrx.c     |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   1 +
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |   1 +
 drivers/net/ethernet/marvell/mvneta.c         | 126 ++++++------
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   |   1 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |   1 +
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   1 +
 .../ethernet/netronome/nfp/nfp_net_common.c   |   1 +
 drivers/net/ethernet/qlogic/qede/qede_fp.c    |   1 +
 drivers/net/ethernet/sfc/rx.c                 |   1 +
 drivers/net/ethernet/socionext/netsec.c       |   1 +
 drivers/net/ethernet/ti/cpsw.c                |   1 +
 drivers/net/ethernet/ti/cpsw_new.c            |   1 +
 drivers/net/hyperv/netvsc_bpf.c               |   1 +
 drivers/net/tun.c                             |   2 +
 drivers/net/veth.c                            |   1 +
 drivers/net/virtio_net.c                      |   2 +
 drivers/net/xen-netfront.c                    |   1 +
 include/net/xdp.h                             |  26 ++-
 include/uapi/linux/bpf.h                      |  39 +++-
 net/core/dev.c                                |   1 +
 net/core/filter.c                             |  93 +++++++++
 net/core/xdp.c                                |  40 ++++
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_mb_kern.c                     |  68 +++++++
 samples/bpf/xdp_mb_user.c                     | 182 ++++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  40 +++-
 32 files changed, 572 insertions(+), 70 deletions(-)
 create mode 100644 samples/bpf/xdp_mb_kern.c
 create mode 100644 samples/bpf/xdp_mb_user.c

-- 
2.26.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame
  2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
@ 2020-09-03 20:58 ` Lorenzo Bianconi
  2020-09-04  1:07   ` Alexei Starovoitov
  2020-09-03 20:58 ` [PATCH v2 net-next 2/9] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers Lorenzo Bianconi
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-03 20:58 UTC (permalink / raw)
  To: netdev
  Cc: bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr

Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer to specify
if shared_info area has been properly initialized for non-linear
xdp buffers

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/net/xdp.h | 8 ++++++--
 net/core/xdp.c    | 1 +
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 3814fb631d52..42f439f9fcda 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -72,7 +72,8 @@ struct xdp_buff {
 	void *data_hard_start;
 	struct xdp_rxq_info *rxq;
 	struct xdp_txq_info *txq;
-	u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
+	u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved tailroom*/
+	u32 mb:1; /* xdp non-linear buffer */
 };
 
 /* Reserve memory area at end-of data area.
@@ -96,7 +97,8 @@ struct xdp_frame {
 	u16 len;
 	u16 headroom;
 	u32 metasize:8;
-	u32 frame_sz:24;
+	u32 frame_sz:23;
+	u32 mb:1; /* xdp non-linear frame */
 	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
 	 * while mem info is valid on remote CPU.
 	 */
@@ -141,6 +143,7 @@ void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
 	xdp->data_end = frame->data + frame->len;
 	xdp->data_meta = frame->data - frame->metasize;
 	xdp->frame_sz = frame->frame_sz;
+	xdp->mb = frame->mb;
 }
 
 static inline
@@ -167,6 +170,7 @@ int xdp_update_frame_from_buff(struct xdp_buff *xdp,
 	xdp_frame->headroom = headroom - sizeof(*xdp_frame);
 	xdp_frame->metasize = metasize;
 	xdp_frame->frame_sz = xdp->frame_sz;
+	xdp_frame->mb = xdp->mb;
 
 	return 0;
 }
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 48aba933a5a8..884f140fc3be 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -454,6 +454,7 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
 	xdpf->headroom = 0;
 	xdpf->metasize = metasize;
 	xdpf->frame_sz = PAGE_SIZE;
+	xdpf->mb = xdp->mb;
 	xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
 
 	xsk_buff_free(xdp);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 net-next 2/9] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
  2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
  2020-09-03 20:58 ` [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame Lorenzo Bianconi
@ 2020-09-03 20:58 ` Lorenzo Bianconi
  2020-09-04  7:35   ` Jesper Dangaard Brouer
  2020-09-03 20:58 ` [PATCH v2 net-next 3/9] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer Lorenzo Bianconi
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-03 20:58 UTC (permalink / raw)
  To: netdev
  Cc: bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr

Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.
This is a preliminary patch to enable xdp multi-buffer support.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c        | 1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c       | 1 +
 drivers/net/ethernet/cavium/thunder/nicvf_main.c    | 1 +
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c    | 1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c         | 1 +
 drivers/net/ethernet/intel/ice/ice_txrx.c           | 1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c       | 1 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c   | 1 +
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c     | 1 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c          | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c     | 1 +
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 1 +
 drivers/net/ethernet/qlogic/qede/qede_fp.c          | 1 +
 drivers/net/ethernet/sfc/rx.c                       | 1 +
 drivers/net/ethernet/socionext/netsec.c             | 1 +
 drivers/net/ethernet/ti/cpsw.c                      | 1 +
 drivers/net/ethernet/ti/cpsw_new.c                  | 1 +
 drivers/net/hyperv/netvsc_bpf.c                     | 1 +
 drivers/net/tun.c                                   | 2 ++
 drivers/net/veth.c                                  | 1 +
 drivers/net/virtio_net.c                            | 2 ++
 drivers/net/xen-netfront.c                          | 1 +
 net/core/dev.c                                      | 1 +
 23 files changed, 25 insertions(+)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index a3a8edf9a734..a8e36e4204e6 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -1607,6 +1607,7 @@ static int ena_clean_rx_irq(struct ena_ring *rx_ring, struct napi_struct *napi,
 	res_budget = budget;
 	xdp.rxq = &rx_ring->xdp_rxq;
 	xdp.frame_sz = ENA_PAGE_SIZE;
+	xdp.mb = 0;
 
 	do {
 		xdp_verdict = XDP_PASS;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 2704a4709bc7..63dde7a369b7 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -139,6 +139,7 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 cons,
 	xdp.data_end = *data_ptr + *len;
 	xdp.rxq = &rxr->xdp_rxq;
 	xdp.frame_sz = PAGE_SIZE; /* BNXT_RX_PAGE_MODE(bp) when XDP enabled */
+	xdp.mb = 0;
 	orig_data = xdp.data;
 
 	rcu_read_lock();
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index c1378b5c780c..28448033750d 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -553,6 +553,7 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog,
 	xdp.data_end = xdp.data + len;
 	xdp.rxq = &rq->xdp_rxq;
 	xdp.frame_sz = RCV_FRAG_LEN + XDP_PACKET_HEADROOM;
+	xdp.mb = 0;
 	orig_data = xdp.data;
 
 	rcu_read_lock();
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index cb3083d2b4ab..dfc93e94c8e5 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -362,6 +362,7 @@ static u32 dpaa2_eth_run_xdp(struct dpaa2_eth_priv *priv,
 
 	xdp.frame_sz = DPAA2_ETH_RX_BUF_RAW_SIZE -
 		(dpaa2_fd_get_offset(fd) - XDP_PACKET_HEADROOM);
+	xdp.mb = 0;
 
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 91ab824926b9..49d3f3b2ba7a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2320,6 +2320,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 	xdp.frame_sz = i40e_rx_frame_truesize(rx_ring, 0);
 #endif
 	xdp.rxq = &rx_ring->xdp_rxq;
+	xdp.mb = 0;
 
 	while (likely(total_rx_packets < (unsigned int)budget)) {
 		struct i40e_rx_buffer *rx_buffer;
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index eae75260fe20..d641f513b8d9 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1089,6 +1089,7 @@ int ice_clean_rx_irq(struct ice_ring *rx_ring, int budget)
 #if (PAGE_SIZE < 8192)
 	xdp.frame_sz = ice_rx_frame_truesize(rx_ring, 0);
 #endif
+	xdp.mb = 0;
 
 	/* start the loop to process Rx packets bounded by 'budget' */
 	while (likely(total_rx_pkts < (unsigned int)budget)) {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 0b675c34ce49..20c8fd3cd4a3 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2298,6 +2298,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 #if (PAGE_SIZE < 8192)
 	xdp.frame_sz = ixgbe_rx_frame_truesize(rx_ring, 0);
 #endif
+	xdp.mb = 0;
 
 	while (likely(total_rx_packets < budget)) {
 		union ixgbe_adv_rx_desc *rx_desc;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 50afec43e001..3fb200f315bb 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -1128,6 +1128,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
 	struct xdp_buff xdp;
 
 	xdp.rxq = &rx_ring->xdp_rxq;
+	xdp.mb = 0;
 
 	/* Frame size depend on rx_ring setup when PAGE_SIZE=4K */
 #if (PAGE_SIZE < 8192)
diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index 2a8a5842eaef..d02e08eb3df8 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -3475,6 +3475,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi,
 			xdp.data = data + MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM;
 			xdp.data_end = xdp.data + rx_bytes;
 			xdp.frame_sz = PAGE_SIZE;
+			xdp.mb = 0;
 
 			if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE)
 				xdp.rxq = &rxq->xdp_rxq_short;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 99d7737e8ad6..de1ae36b068e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -684,6 +684,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	xdp_prog = rcu_dereference(ring->xdp_prog);
 	xdp.rxq = &ring->xdp_rxq;
 	xdp.frame_sz = priv->frag_info[0].frag_stride;
+	xdp.mb = 0;
 	doorbell_pending = 0;
 
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 7aab69e991a5..0bfe6606710b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1120,6 +1120,7 @@ static void mlx5e_fill_xdp_buff(struct mlx5e_rq *rq, void *va, u16 headroom,
 	xdp->data_end = xdp->data + len;
 	xdp->rxq = &rq->xdp_rxq;
 	xdp->frame_sz = rq->buff.frame0_sz;
+	xdp->mb = 0;
 }
 
 static struct sk_buff *
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 39ee23e8c0bf..0bfccf2a7b1c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1824,6 +1824,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
 	true_bufsz = xdp_prog ? PAGE_SIZE : dp->fl_bufsz;
 	xdp.frame_sz = PAGE_SIZE - NFP_NET_RX_BUF_HEADROOM;
 	xdp.rxq = &rx_ring->xdp_rxq;
+	xdp.mb = 0;
 	tx_ring = r_vec->xdp_ring;
 
 	while (pkts_polled < budget) {
diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index a2494bf85007..14a54094ca08 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -1096,6 +1096,7 @@ static bool qede_rx_xdp(struct qede_dev *edev,
 	xdp.data_end = xdp.data + *len;
 	xdp.rxq = &rxq->xdp_rxq;
 	xdp.frame_sz = rxq->rx_buf_seg_size; /* PAGE_SIZE when XDP enabled */
+	xdp.mb = 0;
 
 	/* Queues always have a full reset currently, so for the time
 	 * being until there's atomic program replace just mark read
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index 59a43d586967..8fd6023995d4 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -301,6 +301,7 @@ static bool efx_do_xdp(struct efx_nic *efx, struct efx_channel *channel,
 	xdp.data_end = xdp.data + rx_buf->len;
 	xdp.rxq = &rx_queue->xdp_rxq_info;
 	xdp.frame_sz = efx->rx_page_buf_step;
+	xdp.mb = 0;
 
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 	rcu_read_unlock();
diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c
index 25db667fa879..c73108ce0a32 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -947,6 +947,7 @@ static int netsec_process_rx(struct netsec_priv *priv, int budget)
 
 	xdp.rxq = &dring->xdp_rxq;
 	xdp.frame_sz = PAGE_SIZE;
+	xdp.mb = 0;
 
 	rcu_read_lock();
 	xdp_prog = READ_ONCE(priv->xdp_prog);
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index 9b17bbbe102f..53a55c540adc 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -407,6 +407,7 @@ static void cpsw_rx_handler(void *token, int len, int status)
 		xdp.data_hard_start = pa;
 		xdp.rxq = &priv->xdp_rxq[ch];
 		xdp.frame_sz = PAGE_SIZE;
+		xdp.mb = 0;
 
 		port = priv->emac_port + cpsw->data.dual_emac;
 		ret = cpsw_run_xdp(priv, ch, &xdp, page, port);
diff --git a/drivers/net/ethernet/ti/cpsw_new.c b/drivers/net/ethernet/ti/cpsw_new.c
index 1247d35d42ef..703d079fd479 100644
--- a/drivers/net/ethernet/ti/cpsw_new.c
+++ b/drivers/net/ethernet/ti/cpsw_new.c
@@ -349,6 +349,7 @@ static void cpsw_rx_handler(void *token, int len, int status)
 		xdp.data_hard_start = pa;
 		xdp.rxq = &priv->xdp_rxq[ch];
 		xdp.frame_sz = PAGE_SIZE;
+		xdp.mb = 0;
 
 		ret = cpsw_run_xdp(priv, ch, &xdp, page, priv->emac_port);
 		if (ret != CPSW_XDP_PASS)
diff --git a/drivers/net/hyperv/netvsc_bpf.c b/drivers/net/hyperv/netvsc_bpf.c
index 440486d9c999..a4bafc64997f 100644
--- a/drivers/net/hyperv/netvsc_bpf.c
+++ b/drivers/net/hyperv/netvsc_bpf.c
@@ -50,6 +50,7 @@ u32 netvsc_run_xdp(struct net_device *ndev, struct netvsc_channel *nvchan,
 	xdp->data_end = xdp->data + len;
 	xdp->rxq = &nvchan->xdp_rxq;
 	xdp->frame_sz = PAGE_SIZE;
+	xdp->mb = 0;
 
 	memcpy(xdp->data, data, len);
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index efaef83b8897..1aebd1f390a1 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1641,6 +1641,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		xdp.data_end = xdp.data + len;
 		xdp.rxq = &tfile->xdp_rxq;
 		xdp.frame_sz = buflen;
+		xdp.mb = 0;
 
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 		if (act == XDP_REDIRECT || act == XDP_TX) {
@@ -2388,6 +2389,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 		xdp_set_data_meta_invalid(xdp);
 		xdp->rxq = &tfile->xdp_rxq;
 		xdp->frame_sz = buflen;
+		xdp->mb = 0;
 
 		act = bpf_prog_run_xdp(xdp_prog, xdp);
 		err = tun_xdp_act(tun, xdp_prog, xdp, act);
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index b80cbffeb88e..7486ea1364b2 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -711,6 +711,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	/* SKB "head" area always have tailroom for skb_shared_info */
 	xdp.frame_sz = (void *)skb_end_pointer(skb) - xdp.data_hard_start;
 	xdp.frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	xdp.mb = 0;
 
 	orig_data = xdp.data;
 	orig_data_end = xdp.data_end;
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 0ada48edf749..3bee68f59e19 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -690,6 +690,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 		xdp.data_meta = xdp.data;
 		xdp.rxq = &rq->xdp_rxq;
 		xdp.frame_sz = buflen;
+		xdp.mb = 0;
 		orig_data = xdp.data;
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 		stats->xdp_packets++;
@@ -860,6 +861,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		xdp.data_meta = xdp.data;
 		xdp.rxq = &rq->xdp_rxq;
 		xdp.frame_sz = frame_sz - vi->hdr_len;
+		xdp.mb = 0;
 
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 		stats->xdp_packets++;
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 458be6882b98..02ed7f26097d 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -870,6 +870,7 @@ static u32 xennet_run_xdp(struct netfront_queue *queue, struct page *pdata,
 	xdp->data_end = xdp->data + len;
 	xdp->rxq = &queue->xdp_rxq;
 	xdp->frame_sz = XEN_PAGE_SIZE - XDP_PACKET_HEADROOM;
+	xdp->mb = 0;
 
 	act = bpf_prog_run_xdp(prog, xdp);
 	switch (act) {
diff --git a/net/core/dev.c b/net/core/dev.c
index d42c9ea0c3c0..2c3c961997d3 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4639,6 +4639,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	/* SKB "head" area always have tailroom for skb_shared_info */
 	xdp->frame_sz  = (void *)skb_end_pointer(skb) - xdp->data_hard_start;
 	xdp->frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	xdp->mb = 0;
 
 	orig_data_end = xdp->data_end;
 	orig_data = xdp->data;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 net-next 3/9] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
  2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
  2020-09-03 20:58 ` [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame Lorenzo Bianconi
  2020-09-03 20:58 ` [PATCH v2 net-next 2/9] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers Lorenzo Bianconi
@ 2020-09-03 20:58 ` Lorenzo Bianconi
  2020-09-06  7:33   ` Shay Agroskin
  2020-09-03 20:58 ` [PATCH v2 net-next 4/9] xdp: add multi-buff support to xdp_return_{buff/frame} Lorenzo Bianconi
                   ` (7 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-03 20:58 UTC (permalink / raw)
  To: netdev
  Cc: bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr

Update multi-buffer bit (mb) in xdp_buff to notify XDP/eBPF layer and
XDP remote drivers if this is a "non-linear" XDP buffer. Access
skb_shared_info only if xdp_buff mb is set

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/marvell/mvneta.c | 37 +++++++++++++++------------
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 832bbb8b05c8..4f745a2b702a 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2027,11 +2027,11 @@ mvneta_xdp_put_buff(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 		    struct xdp_buff *xdp, int sync_len, bool napi)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
-	int i;
+	int i, num_frames = xdp->mb ? sinfo->nr_frags : 0;
 
 	page_pool_put_page(rxq->page_pool, virt_to_head_page(xdp->data),
 			   sync_len, napi);
-	for (i = 0; i < sinfo->nr_frags; i++)
+	for (i = 0; i < num_frames; i++)
 		page_pool_put_full_page(rxq->page_pool,
 					skb_frag_page(&sinfo->frags[i]), napi);
 }
@@ -2175,6 +2175,7 @@ mvneta_run_xdp(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 
 	len = xdp->data_end - xdp->data_hard_start - pp->rx_offset_correction;
 	data_len = xdp->data_end - xdp->data;
+
 	act = bpf_prog_run_xdp(prog, xdp);
 
 	/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
@@ -2234,7 +2235,6 @@ mvneta_swbm_rx_frame(struct mvneta_port *pp,
 	int data_len = -MVNETA_MH_SIZE, len;
 	struct net_device *dev = pp->dev;
 	enum dma_data_direction dma_dir;
-	struct skb_shared_info *sinfo;
 
 	if (MVNETA_SKB_SIZE(rx_desc->data_size) > PAGE_SIZE) {
 		len = MVNETA_MAX_RX_BUF_SIZE;
@@ -2256,9 +2256,7 @@ mvneta_swbm_rx_frame(struct mvneta_port *pp,
 	xdp->data = data + pp->rx_offset_correction + MVNETA_MH_SIZE;
 	xdp->data_end = xdp->data + data_len;
 	xdp_set_data_meta_invalid(xdp);
-
-	sinfo = xdp_get_shared_info_from_buff(xdp);
-	sinfo->nr_frags = 0;
+	xdp->mb = 0;
 
 	*size = rx_desc->data_size - len;
 	rx_desc->buf_phys_addr = 0;
@@ -2269,7 +2267,7 @@ mvneta_swbm_add_rx_fragment(struct mvneta_port *pp,
 			    struct mvneta_rx_desc *rx_desc,
 			    struct mvneta_rx_queue *rxq,
 			    struct xdp_buff *xdp, int *size,
-			    struct page *page)
+			    int *nfrags, struct page *page)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
 	struct net_device *dev = pp->dev;
@@ -2288,13 +2286,18 @@ mvneta_swbm_add_rx_fragment(struct mvneta_port *pp,
 				rx_desc->buf_phys_addr,
 				len, dma_dir);
 
-	if (data_len > 0 && sinfo->nr_frags < MAX_SKB_FRAGS) {
-		skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags];
+	if (data_len > 0 && *nfrags < MAX_SKB_FRAGS) {
+		skb_frag_t *frag = &sinfo->frags[*nfrags];
 
 		skb_frag_off_set(frag, pp->rx_offset_correction);
 		skb_frag_size_set(frag, data_len);
 		__skb_frag_set_page(frag, page);
-		sinfo->nr_frags++;
+		*nfrags = *nfrags + 1;
+
+		if (rx_desc->status & MVNETA_RXD_LAST_DESC) {
+			sinfo->nr_frags = *nfrags;
+			xdp->mb = true;
+		}
 
 		rx_desc->buf_phys_addr = 0;
 	}
@@ -2306,7 +2309,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 		      struct xdp_buff *xdp, u32 desc_status)
 {
 	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
-	int i, num_frags = sinfo->nr_frags;
+	int i, num_frags = xdp->mb ? sinfo->nr_frags : 0;
 	skb_frag_t frags[MAX_SKB_FRAGS];
 	struct sk_buff *skb;
 
@@ -2341,13 +2344,14 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 {
 	int rx_proc = 0, rx_todo, refill, size = 0;
 	struct net_device *dev = pp->dev;
-	struct xdp_buff xdp_buf = {
-		.frame_sz = PAGE_SIZE,
-		.rxq = &rxq->xdp_rxq,
-	};
 	struct mvneta_stats ps = {};
 	struct bpf_prog *xdp_prog;
 	u32 desc_status, frame_sz;
+	struct xdp_buff xdp_buf;
+	int nfrags;
+
+	xdp_buf.frame_sz = PAGE_SIZE;
+	xdp_buf.rxq = &rxq->xdp_rxq;
 
 	/* Get number of received packets */
 	rx_todo = mvneta_rxq_busy_desc_num_get(pp, rxq);
@@ -2379,6 +2383,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 			size = rx_desc->data_size;
 			frame_sz = size - ETH_FCS_LEN;
 			desc_status = rx_desc->status;
+			nfrags = 0;
 
 			mvneta_swbm_rx_frame(pp, rx_desc, rxq, &xdp_buf,
 					     &size, page, &ps);
@@ -2387,7 +2392,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 				continue;
 
 			mvneta_swbm_add_rx_fragment(pp, rx_desc, rxq, &xdp_buf,
-						    &size, page);
+						    &size, &nfrags, page);
 		} /* Middle or Last descriptor */
 
 		if (!(rx_status & MVNETA_RXD_LAST_DESC))
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 net-next 4/9] xdp: add multi-buff support to xdp_return_{buff/frame}
  2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (2 preceding siblings ...)
  2020-09-03 20:58 ` [PATCH v2 net-next 3/9] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer Lorenzo Bianconi
@ 2020-09-03 20:58 ` Lorenzo Bianconi
  2020-09-03 20:58 ` [PATCH v2 net-next 5/9] net: mvneta: add multi buffer support to XDP_TX Lorenzo Bianconi
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-03 20:58 UTC (permalink / raw)
  To: netdev
  Cc: bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr

Take into account if the received xdp_buff/xdp_frame is non-linear
recycling/returning the frame memory to the allocator

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/net/xdp.h | 18 ++++++++++++++++--
 net/core/xdp.c    | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 42f439f9fcda..4d47076546ff 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -208,10 +208,24 @@ void __xdp_release_frame(void *data, struct xdp_mem_info *mem);
 static inline void xdp_release_frame(struct xdp_frame *xdpf)
 {
 	struct xdp_mem_info *mem = &xdpf->mem;
+	struct skb_shared_info *sinfo;
+	int i;
 
 	/* Curr only page_pool needs this */
-	if (mem->type == MEM_TYPE_PAGE_POOL)
-		__xdp_release_frame(xdpf->data, mem);
+	if (mem->type != MEM_TYPE_PAGE_POOL)
+		return;
+
+	if (likely(!xdpf->mb))
+		goto out;
+
+	sinfo = xdp_get_shared_info_from_frame(xdpf);
+	for (i = 0; i < sinfo->nr_frags; i++) {
+		struct page *page = skb_frag_page(&sinfo->frags[i]);
+
+		__xdp_release_frame(page_address(page), mem);
+	}
+out:
+	__xdp_release_frame(xdpf->data, mem);
 }
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 884f140fc3be..6d4fd4dddb00 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -370,18 +370,57 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct)
 
 void xdp_return_frame(struct xdp_frame *xdpf)
 {
+	struct skb_shared_info *sinfo;
+	int i;
+
+	if (likely(!xdpf->mb))
+		goto out;
+
+	sinfo = xdp_get_shared_info_from_frame(xdpf);
+	for (i = 0; i < sinfo->nr_frags; i++) {
+		struct page *page = skb_frag_page(&sinfo->frags[i]);
+
+		__xdp_return(page_address(page), &xdpf->mem, false);
+	}
+out:
 	__xdp_return(xdpf->data, &xdpf->mem, false);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);
 
 void xdp_return_frame_rx_napi(struct xdp_frame *xdpf)
 {
+	struct skb_shared_info *sinfo;
+	int i;
+
+	if (likely(!xdpf->mb))
+		goto out;
+
+	sinfo = xdp_get_shared_info_from_frame(xdpf);
+	for (i = 0; i < sinfo->nr_frags; i++) {
+		struct page *page = skb_frag_page(&sinfo->frags[i]);
+
+		__xdp_return(page_address(page), &xdpf->mem, true);
+	}
+out:
 	__xdp_return(xdpf->data, &xdpf->mem, true);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi);
 
 void xdp_return_buff(struct xdp_buff *xdp)
 {
+	struct skb_shared_info *sinfo;
+	int i;
+
+	if (likely(!xdp->mb))
+		goto out;
+
+	sinfo = xdp_get_shared_info_from_buff(xdp);
+	for (i = 0; i < sinfo->nr_frags; i++) {
+		struct page *page = skb_frag_page(&sinfo->frags[i]);
+
+		__xdp_return(page_address(page), &xdp->rxq->mem, true);
+	}
+out:
 	__xdp_return(xdp->data, &xdp->rxq->mem, true);
 }
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 net-next 5/9] net: mvneta: add multi buffer support to XDP_TX
  2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (3 preceding siblings ...)
  2020-09-03 20:58 ` [PATCH v2 net-next 4/9] xdp: add multi-buff support to xdp_return_{buff/frame} Lorenzo Bianconi
@ 2020-09-03 20:58 ` Lorenzo Bianconi
  2020-09-06  7:20   ` Shay Agroskin
  2020-09-03 20:58 ` [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper Lorenzo Bianconi
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-03 20:58 UTC (permalink / raw)
  To: netdev
  Cc: bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr

Introduce the capability to map non-linear xdp buffer running
mvneta_xdp_submit_frame() for XDP_TX and XDP_REDIRECT

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/marvell/mvneta.c | 79 +++++++++++++++++----------
 1 file changed, 49 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 4f745a2b702a..65fbed957e4f 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1854,8 +1854,8 @@ static void mvneta_txq_bufs_free(struct mvneta_port *pp,
 			bytes_compl += buf->skb->len;
 			pkts_compl++;
 			dev_kfree_skb_any(buf->skb);
-		} else if (buf->type == MVNETA_TYPE_XDP_TX ||
-			   buf->type == MVNETA_TYPE_XDP_NDO) {
+		} else if ((buf->type == MVNETA_TYPE_XDP_TX ||
+			    buf->type == MVNETA_TYPE_XDP_NDO) && buf->xdpf) {
 			xdp_return_frame(buf->xdpf);
 		}
 	}
@@ -2040,43 +2040,62 @@ static int
 mvneta_xdp_submit_frame(struct mvneta_port *pp, struct mvneta_tx_queue *txq,
 			struct xdp_frame *xdpf, bool dma_map)
 {
-	struct mvneta_tx_desc *tx_desc;
-	struct mvneta_tx_buf *buf;
-	dma_addr_t dma_addr;
+	struct skb_shared_info *sinfo = xdp_get_shared_info_from_frame(xdpf);
+	int i, num_frames = xdpf->mb ? sinfo->nr_frags + 1 : 1;
+	struct mvneta_tx_desc *tx_desc = NULL;
+	struct page *page;
 
-	if (txq->count >= txq->tx_stop_threshold)
+	if (txq->count + num_frames >= txq->tx_stop_threshold)
 		return MVNETA_XDP_DROPPED;
 
-	tx_desc = mvneta_txq_next_desc_get(txq);
+	for (i = 0; i < num_frames; i++) {
+		struct mvneta_tx_buf *buf = &txq->buf[txq->txq_put_index];
+		skb_frag_t *frag = i ? &sinfo->frags[i - 1] : NULL;
+		int len = frag ? skb_frag_size(frag) : xdpf->len;
+		dma_addr_t dma_addr;
 
-	buf = &txq->buf[txq->txq_put_index];
-	if (dma_map) {
-		/* ndo_xdp_xmit */
-		dma_addr = dma_map_single(pp->dev->dev.parent, xdpf->data,
-					  xdpf->len, DMA_TO_DEVICE);
-		if (dma_mapping_error(pp->dev->dev.parent, dma_addr)) {
-			mvneta_txq_desc_put(txq);
-			return MVNETA_XDP_DROPPED;
+		tx_desc = mvneta_txq_next_desc_get(txq);
+		if (dma_map) {
+			/* ndo_xdp_xmit */
+			void *data;
+
+			data = frag ? page_address(skb_frag_page(frag))
+				    : xdpf->data;
+			dma_addr = dma_map_single(pp->dev->dev.parent, data,
+						  len, DMA_TO_DEVICE);
+			if (dma_mapping_error(pp->dev->dev.parent, dma_addr)) {
+				for (; i >= 0; i--)
+					mvneta_txq_desc_put(txq);
+				return MVNETA_XDP_DROPPED;
+			}
+			buf->type = MVNETA_TYPE_XDP_NDO;
+		} else {
+			page = frag ? skb_frag_page(frag)
+				    : virt_to_page(xdpf->data);
+			dma_addr = page_pool_get_dma_addr(page);
+			if (!frag)
+				dma_addr += sizeof(*xdpf) + xdpf->headroom;
+			dma_sync_single_for_device(pp->dev->dev.parent,
+						   dma_addr, len,
+						   DMA_BIDIRECTIONAL);
+			buf->type = MVNETA_TYPE_XDP_TX;
 		}
-		buf->type = MVNETA_TYPE_XDP_NDO;
-	} else {
-		struct page *page = virt_to_page(xdpf->data);
+		buf->xdpf = i ? NULL : xdpf;
 
-		dma_addr = page_pool_get_dma_addr(page) +
-			   sizeof(*xdpf) + xdpf->headroom;
-		dma_sync_single_for_device(pp->dev->dev.parent, dma_addr,
-					   xdpf->len, DMA_BIDIRECTIONAL);
-		buf->type = MVNETA_TYPE_XDP_TX;
+		if (!i)
+			tx_desc->command = MVNETA_TXD_F_DESC;
+		tx_desc->buf_phys_addr = dma_addr;
+		tx_desc->data_size = len;
+
+		mvneta_txq_inc_put(txq);
 	}
-	buf->xdpf = xdpf;
 
-	tx_desc->command = MVNETA_TXD_FLZ_DESC;
-	tx_desc->buf_phys_addr = dma_addr;
-	tx_desc->data_size = xdpf->len;
+	/*last descriptor */
+	if (tx_desc)
+		tx_desc->command |= MVNETA_TXD_L_DESC | MVNETA_TXD_Z_PAD;
 
-	mvneta_txq_inc_put(txq);
-	txq->pending++;
-	txq->count++;
+	txq->pending += num_frames;
+	txq->count += num_frames;
 
 	return MVNETA_XDP_TX;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (4 preceding siblings ...)
  2020-09-03 20:58 ` [PATCH v2 net-next 5/9] net: mvneta: add multi buffer support to XDP_TX Lorenzo Bianconi
@ 2020-09-03 20:58 ` Lorenzo Bianconi
  2020-09-04  1:09   ` Alexei Starovoitov
                     ` (2 more replies)
  2020-09-03 20:58 ` [PATCH v2 net-next 7/9] bpf: helpers: add multibuffer support Lorenzo Bianconi
                   ` (4 subsequent siblings)
  10 siblings, 3 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-03 20:58 UTC (permalink / raw)
  To: netdev
  Cc: bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr

Introduce bpf_xdp_adjust_mb_header helper in order to adjust frame
headers moving *offset* bytes from/to the second buffer to/from the
first one.
This helper can be used to move headers when the hw DMA SG is not able
to copy all the headers in the first fragment and split header and data
pages.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/uapi/linux/bpf.h       | 25 ++++++++++++----
 net/core/filter.c              | 54 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h | 26 ++++++++++++----
 3 files changed, 95 insertions(+), 10 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8dda13880957..c4a6d245619c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3571,11 +3571,25 @@ union bpf_attr {
  *		value.
  *
  * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)
- * 	Description
- * 		Read *size* bytes from user space address *user_ptr* and store
- * 		the data in *dst*. This is a wrapper of copy_from_user().
- * 	Return
- * 		0 on success, or a negative error in case of failure.
+ *	Description
+ *		Read *size* bytes from user space address *user_ptr* and store
+ *		the data in *dst*. This is a wrapper of copy_from_user().
+ *
+ * long bpf_xdp_adjust_mb_header(struct xdp_buff *xdp_md, int offset)
+ *	Description
+ *		Adjust frame headers moving *offset* bytes from/to the second
+ *		buffer to/from the first one. This helper can be used to move
+ *		headers when the hw DMA SG does not copy all the headers in
+ *		the first fragment.
+ *
+ *		A call to this helper is susceptible to change the underlying
+ *		packet buffer. Therefore, at load time, all checks on pointers
+ *		previously done by the verifier are invalidated and must be
+ *		performed again, if the helper is used in combination with
+ *		direct packet access.
+ *
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3727,6 +3741,7 @@ union bpf_attr {
 	FN(inode_storage_delete),	\
 	FN(d_path),			\
 	FN(copy_from_user),		\
+	FN(xdp_adjust_mb_header),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/net/core/filter.c b/net/core/filter.c
index 47eef9a0be6a..ae6b10cf062d 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3475,6 +3475,57 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_2(bpf_xdp_adjust_mb_header, struct  xdp_buff *, xdp,
+	   int, offset)
+{
+	void *data_hard_end, *data_end;
+	struct skb_shared_info *sinfo;
+	int frag_offset, frag_len;
+	u8 *addr;
+
+	if (!xdp->mb)
+		return -EOPNOTSUPP;
+
+	sinfo = xdp_get_shared_info_from_buff(xdp);
+
+	frag_len = skb_frag_size(&sinfo->frags[0]);
+	if (offset > frag_len)
+		return -EINVAL;
+
+	frag_offset = skb_frag_off(&sinfo->frags[0]);
+	data_end = xdp->data_end + offset;
+
+	if (offset < 0 && (-offset > frag_offset ||
+			   data_end < xdp->data + ETH_HLEN))
+		return -EINVAL;
+
+	data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
+	if (data_end > data_hard_end)
+		return -EINVAL;
+
+	addr = page_address(skb_frag_page(&sinfo->frags[0])) + frag_offset;
+	if (offset > 0) {
+		memcpy(xdp->data_end, addr, offset);
+	} else {
+		memcpy(addr + offset, xdp->data_end + offset, -offset);
+		memset(xdp->data_end + offset, 0, -offset);
+	}
+
+	skb_frag_size_sub(&sinfo->frags[0], offset);
+	skb_frag_off_add(&sinfo->frags[0], offset);
+	xdp->data_end = data_end;
+
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_adjust_mb_header_proto = {
+	.func		= bpf_xdp_adjust_mb_header,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+	.arg2_type	= ARG_ANYTHING,
+};
+
 BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
 {
 	void *data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
@@ -6505,6 +6556,7 @@ bool bpf_helper_changes_pkt_data(void *func)
 	    func == bpf_msg_push_data ||
 	    func == bpf_msg_pop_data ||
 	    func == bpf_xdp_adjust_tail ||
+	    func == bpf_xdp_adjust_mb_header ||
 #if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
 	    func == bpf_lwt_seg6_store_bytes ||
 	    func == bpf_lwt_seg6_adjust_srh ||
@@ -6835,6 +6887,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_redirect_map_proto;
 	case BPF_FUNC_xdp_adjust_tail:
 		return &bpf_xdp_adjust_tail_proto;
+	case BPF_FUNC_xdp_adjust_mb_header:
+		return &bpf_xdp_adjust_mb_header_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_xdp_fib_lookup_proto;
 #ifdef CONFIG_INET
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 8dda13880957..392d52a2ecef 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3407,6 +3407,7 @@ union bpf_attr {
  *		A non-negative value equal to or less than *size* on success,
  *		or a negative error in case of failure.
  *
+ *
  * long bpf_load_hdr_opt(struct bpf_sock_ops *skops, void *searchby_res, u32 len, u64 flags)
  *	Description
  *		Load header option.  Support reading a particular TCP header
@@ -3571,11 +3572,25 @@ union bpf_attr {
  *		value.
  *
  * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)
- * 	Description
- * 		Read *size* bytes from user space address *user_ptr* and store
- * 		the data in *dst*. This is a wrapper of copy_from_user().
- * 	Return
- * 		0 on success, or a negative error in case of failure.
+ *	Description
+ *		Read *size* bytes from user space address *user_ptr* and store
+ *		the data in *dst*. This is a wrapper of copy_from_user().
+ *
+ * long bpf_xdp_adjust_mb_header(struct xdp_buff *xdp_md, int offset)
+ *	Description
+ *		Adjust frame headers moving *offset* bytes from/to the second
+ *		buffer to/from the first one. This helper can be used to move
+ *		headers when the hw DMA SG does not copy all the headers in
+ *		the first fragment.
+ *
+ *		A call to this helper is susceptible to change the underlying
+ *		packet buffer. Therefore, at load time, all checks on pointers
+ *		previously done by the verifier are invalidated and must be
+ *		performed again, if the helper is used in combination with
+ *		direct packet access.
+ *
+ *	Return
+ *		0 on success, or a negative error in case of failure.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3727,6 +3742,7 @@ union bpf_attr {
 	FN(inode_storage_delete),	\
 	FN(d_path),			\
 	FN(copy_from_user),		\
+	FN(xdp_adjust_mb_header),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 net-next 7/9] bpf: helpers: add multibuffer support
  2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (5 preceding siblings ...)
  2020-09-03 20:58 ` [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper Lorenzo Bianconi
@ 2020-09-03 20:58 ` Lorenzo Bianconi
  2020-09-03 21:24   ` Maciej Fijalkowski
  2020-09-03 20:58 ` [PATCH v2 net-next 8/9] samples/bpf: add bpf program that uses xdp mb helpers Lorenzo Bianconi
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-03 20:58 UTC (permalink / raw)
  To: netdev
  Cc: bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr

From: Sameeh Jubran <sameehj@amazon.com>

The implementation is based on this [0] draft by Jesper D. Brouer.

Provided two new helpers:

* bpf_xdp_get_frag_count()
* bpf_xdp_get_frags_total_size()

[0] xdp mb design - https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/uapi/linux/bpf.h       | 14 ++++++++++++
 net/core/filter.c              | 39 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h | 14 ++++++++++++
 3 files changed, 67 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c4a6d245619c..53db75095306 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -3590,6 +3590,18 @@ union bpf_attr {
  *
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_xdp_get_frag_count(struct xdp_buff *xdp_md)
+ *	Description
+ *		Get the total number of frags for a given packet.
+ *	Return
+ *		The number of frags
+ *
+ * int bpf_xdp_get_frags_total_size(struct xdp_buff *xdp_md)
+ *	Description
+ *		Get the total size of frags for a given packet.
+ *	Return
+ *		The total size of frags for a given packet.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3742,6 +3754,8 @@ union bpf_attr {
 	FN(d_path),			\
 	FN(copy_from_user),		\
 	FN(xdp_adjust_mb_header),	\
+	FN(xdp_get_frag_count),		\
+	FN(xdp_get_frags_total_size),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/net/core/filter.c b/net/core/filter.c
index ae6b10cf062d..ba058fc16440 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3526,6 +3526,41 @@ static const struct bpf_func_proto bpf_xdp_adjust_mb_header_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+BPF_CALL_1(bpf_xdp_get_frag_count, struct  xdp_buff*, xdp)
+{
+	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
+
+	return xdp->mb ? sinfo->nr_frags : 0;
+}
+
+const struct bpf_func_proto bpf_xdp_get_frag_count_proto = {
+	.func		= bpf_xdp_get_frag_count,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+};
+
+BPF_CALL_1(bpf_xdp_get_frags_total_size, struct  xdp_buff*, xdp)
+{
+	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
+	int nfrags, i;
+	int size = 0;
+
+	nfrags = xdp->mb ? sinfo->nr_frags : 0;
+
+	for (i = 0; i < nfrags && i < MAX_SKB_FRAGS; i++)
+		size += skb_frag_size(&sinfo->frags[i]);
+
+	return size;
+}
+
+const struct bpf_func_proto bpf_xdp_get_frags_total_size_proto = {
+	.func		= bpf_xdp_get_frags_total_size,
+	.gpl_only	= false,
+	.ret_type	= RET_INTEGER,
+	.arg1_type	= ARG_PTR_TO_CTX,
+};
+
 BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
 {
 	void *data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
@@ -6889,6 +6924,10 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_adjust_tail_proto;
 	case BPF_FUNC_xdp_adjust_mb_header:
 		return &bpf_xdp_adjust_mb_header_proto;
+	case BPF_FUNC_xdp_get_frag_count:
+		return &bpf_xdp_get_frag_count_proto;
+	case BPF_FUNC_xdp_get_frags_total_size:
+		return &bpf_xdp_get_frags_total_size_proto;
 	case BPF_FUNC_fib_lookup:
 		return &bpf_xdp_fib_lookup_proto;
 #ifdef CONFIG_INET
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 392d52a2ecef..dd4669096cbb 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -3591,6 +3591,18 @@ union bpf_attr {
  *
  *	Return
  *		0 on success, or a negative error in case of failure.
+ *
+ * int bpf_xdp_get_frag_count(struct xdp_buff *xdp_md)
+ *	Description
+ *		Get the total number of frags for a given packet.
+ *	Return
+ *		The number of frags
+ *
+ * int bpf_xdp_get_frags_total_size(struct xdp_buff *xdp_md)
+ *	Description
+ *		Get the total size of frags for a given packet.
+ *	Return
+ *		The total size of frags for a given packet.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -3743,6 +3755,8 @@ union bpf_attr {
 	FN(d_path),			\
 	FN(copy_from_user),		\
 	FN(xdp_adjust_mb_header),	\
+	FN(xdp_get_frag_count),		\
+	FN(xdp_get_frags_total_size),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 net-next 8/9] samples/bpf: add bpf program that uses xdp mb helpers
  2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (6 preceding siblings ...)
  2020-09-03 20:58 ` [PATCH v2 net-next 7/9] bpf: helpers: add multibuffer support Lorenzo Bianconi
@ 2020-09-03 20:58 ` Lorenzo Bianconi
  2020-09-03 20:58 ` [PATCH v2 net-next 9/9] net: mvneta: enable jumbo frames for XDP Lorenzo Bianconi
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-03 20:58 UTC (permalink / raw)
  To: netdev
  Cc: bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr

From: Sameeh Jubran <sameehj@amazon.com>

The bpf program returns XDP_PASS for every packet and calculates the
total number of bytes in its linear and paged parts.

The program is executed with:
./xdp_mb [if name]

and has the following output format:
[if index]: [rx packet count] pkt/sec, [number of bytes] bytes/sec

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 samples/bpf/Makefile      |   3 +
 samples/bpf/xdp_mb_kern.c |  68 ++++++++++++++
 samples/bpf/xdp_mb_user.c | 182 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 253 insertions(+)
 create mode 100644 samples/bpf/xdp_mb_kern.c
 create mode 100644 samples/bpf/xdp_mb_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4f1ed0e3cf9f..12e32516f02a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -54,6 +54,7 @@ tprogs-y += task_fd_query
 tprogs-y += xdp_sample_pkts
 tprogs-y += ibumad
 tprogs-y += hbm
+tprogs-y += xdp_mb
 
 # Libbpf dependencies
 LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
@@ -111,6 +112,7 @@ task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS)
 xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS)
 ibumad-objs := bpf_load.o ibumad_user.o $(TRACE_HELPERS)
 hbm-objs := bpf_load.o hbm.o $(CGROUP_HELPERS)
+xdp_mb-objs := xdp_mb_user.o
 
 # Tell kbuild to always build the programs
 always-y := $(tprogs-y)
@@ -172,6 +174,7 @@ always-y += ibumad_kern.o
 always-y += hbm_out_kern.o
 always-y += hbm_edt_kern.o
 always-y += xdpsock_kern.o
+always-y += xdp_mb_kern.o
 
 ifeq ($(ARCH), arm)
 # Strip all except -D__LINUX_ARM_ARCH__ option needed to handle linux
diff --git a/samples/bpf/xdp_mb_kern.c b/samples/bpf/xdp_mb_kern.c
new file mode 100644
index 000000000000..554c3b9a3243
--- /dev/null
+++ b/samples/bpf/xdp_mb_kern.c
@@ -0,0 +1,68 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright 2020 Amazon.com, Inc. or its affiliates. All rights reserved.
+ */
+#define KBUILD_MODNAME "foo"
+#include <uapi/linux/bpf.h>
+#include <linux/in.h>
+#include <linux/if_ether.h>
+#include <linux/if_packet.h>
+#include <linux/if_vlan.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <bpf/bpf_helpers.h>
+
+/* count RX packets */
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__type(key, u32);
+	__type(value, long);
+	__uint(max_entries, 1);
+} rx_cnt SEC(".maps");
+
+/* count RX fragments */
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__type(key, u32);
+	__type(value, long);
+	__uint(max_entries, 1);
+} rx_frags SEC(".maps");
+
+/* count total number of bytes */
+struct {
+	__uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+	__type(key, u32);
+	__type(value, long);
+	__uint(max_entries, 1);
+} tot_len SEC(".maps");
+
+SEC("xdp_mb")
+int xdp_mb_prog(struct xdp_md *ctx)
+{
+	void *data_end = (void *)(long)ctx->data_end;
+	void *data = (void *)(long)ctx->data;
+	u32 frag_offset = 0, frag_size = 0;
+	u32 key = 0, nfrags;
+	long *value;
+	int i, len;
+
+	value = bpf_map_lookup_elem(&rx_cnt, &key);
+	if (value)
+		*value += 1;
+
+	len = data_end - data;
+	nfrags = bpf_xdp_get_frag_count(ctx);
+	len += bpf_xdp_get_frags_total_size(ctx);
+
+	value = bpf_map_lookup_elem(&tot_len, &key);
+	if (value)
+		*value += len;
+
+	value = bpf_map_lookup_elem(&rx_frags, &key);
+	if (value)
+		*value += nfrags;
+
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_mb_user.c b/samples/bpf/xdp_mb_user.c
new file mode 100644
index 000000000000..6f555e94b748
--- /dev/null
+++ b/samples/bpf/xdp_mb_user.c
@@ -0,0 +1,182 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright 2020 Amazon.com, Inc. or its affiliates. All rights reserved.
+ */
+#include <linux/bpf.h>
+#include <linux/if_link.h>
+#include <assert.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <libgen.h>
+#include <sys/resource.h>
+#include <net/if.h>
+
+#include "bpf_util.h"
+#include <bpf/bpf.h>
+#include <bpf/libbpf.h>
+
+static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST | XDP_FLAGS_DRV_MODE;
+static __u32 prog_id;
+static int rx_cnt_fd, tot_len_fd, rx_frags_fd;
+static int ifindex;
+
+static void int_exit(int sig)
+{
+	__u32 curr_prog_id = 0;
+
+	if (bpf_get_link_xdp_id(ifindex, &curr_prog_id, xdp_flags)) {
+		printf("bpf_get_link_xdp_id failed\n");
+		exit(1);
+	}
+	if (prog_id == curr_prog_id)
+		bpf_set_link_xdp_fd(ifindex, -1, xdp_flags);
+	else if (!curr_prog_id)
+		printf("couldn't find a prog id on a given interface\n");
+	else
+		printf("program on interface changed, not removing\n");
+	exit(0);
+}
+
+/* count total packets and bytes per second */
+static void poll_stats(int interval)
+{
+	unsigned int nr_cpus = bpf_num_possible_cpus();
+	__u64 rx_frags_cnt[nr_cpus], rx_frags_cnt_prev[nr_cpus];
+	__u64 tot_len[nr_cpus], tot_len_prev[nr_cpus];
+	__u64 rx_cnt[nr_cpus], rx_cnt_prev[nr_cpus];
+	int i;
+
+	memset(rx_frags_cnt_prev, 0, sizeof(rx_frags_cnt_prev));
+	memset(tot_len_prev, 0, sizeof(tot_len_prev));
+	memset(rx_cnt_prev, 0, sizeof(rx_cnt_prev));
+
+	while (1) {
+		__u64 n_rx_pkts = 0, rx_frags = 0, rx_len = 0;
+		__u32 key = 0;
+
+		sleep(interval);
+
+		/* fetch rx cnt */
+		assert(bpf_map_lookup_elem(rx_cnt_fd, &key, rx_cnt) == 0);
+		for (i = 0; i < nr_cpus; i++)
+			n_rx_pkts += (rx_cnt[i] - rx_cnt_prev[i]);
+		memcpy(rx_cnt_prev, rx_cnt, sizeof(rx_cnt));
+
+		/* fetch rx frags */
+		assert(bpf_map_lookup_elem(rx_frags_fd, &key, rx_frags_cnt) == 0);
+		for (i = 0; i < nr_cpus; i++)
+			rx_frags += (rx_frags_cnt[i] - rx_frags_cnt_prev[i]);
+		memcpy(rx_frags_cnt_prev, rx_frags_cnt, sizeof(rx_frags_cnt));
+
+		/* count total bytes of packets */
+		assert(bpf_map_lookup_elem(tot_len_fd, &key, tot_len) == 0);
+		for (i = 0; i < nr_cpus; i++)
+			rx_len += (tot_len[i] - tot_len_prev[i]);
+		memcpy(tot_len_prev, tot_len, sizeof(tot_len));
+
+		if (n_rx_pkts)
+			printf("ifindex %i: %10llu pkt/s, %10llu frags/s, %10llu bytes/s\n",
+			       ifindex, n_rx_pkts / interval, rx_frags / interval,
+			       rx_len / interval);
+	}
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"%s: %s [OPTS] IFACE\n\n"
+		"OPTS:\n"
+		"    -F    force loading prog\n",
+		__func__, prog);
+}
+
+int main(int argc, char **argv)
+{
+	struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
+	struct bpf_prog_load_attr prog_load_attr = {
+		.prog_type	= BPF_PROG_TYPE_XDP,
+	};
+	int prog_fd, opt;
+	struct bpf_prog_info info = {};
+	__u32 info_len = sizeof(info);
+	const char *optstr = "F";
+	struct bpf_program *prog;
+	struct bpf_object *obj;
+	char filename[256];
+	int err;
+
+	while ((opt = getopt(argc, argv, optstr)) != -1) {
+		switch (opt) {
+		case 'F':
+			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (optind == argc) {
+		usage(basename(argv[0]));
+		return 1;
+	}
+
+	if (setrlimit(RLIMIT_MEMLOCK, &r)) {
+		perror("setrlimit(RLIMIT_MEMLOCK)");
+		return 1;
+	}
+
+	ifindex = if_nametoindex(argv[optind]);
+	if (!ifindex) {
+		perror("if_nametoindex");
+		return 1;
+	}
+
+	snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+	prog_load_attr.file = filename;
+
+	if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
+		return 1;
+
+	prog = bpf_program__next(NULL, obj);
+	if (!prog) {
+		printf("finding a prog in obj file failed\n");
+		return 1;
+	}
+
+	if (!prog_fd) {
+		printf("bpf_prog_load_xattr: %s\n", strerror(errno));
+		return 1;
+	}
+
+	rx_cnt_fd = bpf_object__find_map_fd_by_name(obj, "rx_cnt");
+	rx_frags_fd = bpf_object__find_map_fd_by_name(obj, "rx_frags");
+	tot_len_fd = bpf_object__find_map_fd_by_name(obj, "tot_len");
+	if (rx_cnt_fd < 0 || rx_frags_fd < 0 || tot_len_fd < 0) {
+		printf("bpf_object__find_map_fd_by_name failed\n");
+		return 1;
+	}
+
+	if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) {
+		printf("ERROR: link set xdp fd failed on %d\n", ifindex);
+		return 1;
+	}
+
+	err = bpf_obj_get_info_by_fd(prog_fd, &info, &info_len);
+	if (err) {
+		printf("can't get prog info - %s\n", strerror(errno));
+		return err;
+	}
+	prog_id = info.id;
+
+	signal(SIGINT, int_exit);
+	signal(SIGTERM, int_exit);
+
+	poll_stats(1);
+
+	return 0;
+}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH v2 net-next 9/9] net: mvneta: enable jumbo frames for XDP
  2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (7 preceding siblings ...)
  2020-09-03 20:58 ` [PATCH v2 net-next 8/9] samples/bpf: add bpf program that uses xdp mb helpers Lorenzo Bianconi
@ 2020-09-03 20:58 ` Lorenzo Bianconi
  2020-09-04  1:08 ` [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Alexei Starovoitov
  2020-09-04  5:41 ` John Fastabend
  10 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-03 20:58 UTC (permalink / raw)
  To: netdev
  Cc: bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr

Enable the capability to receive jumbo frames even if the interface is
running in XDP mode

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/marvell/mvneta.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 65fbed957e4f..85853fefcfd1 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -3737,11 +3737,6 @@ static int mvneta_change_mtu(struct net_device *dev, int mtu)
 		mtu = ALIGN(MVNETA_RX_PKT_SIZE(mtu), 8);
 	}
 
-	if (pp->xdp_prog && mtu > MVNETA_MAX_RX_BUF_SIZE) {
-		netdev_info(dev, "Illegal MTU value %d for XDP mode\n", mtu);
-		return -EINVAL;
-	}
-
 	dev->mtu = mtu;
 
 	if (!netif_running(dev)) {
@@ -4439,11 +4434,6 @@ static int mvneta_xdp_setup(struct net_device *dev, struct bpf_prog *prog,
 	struct mvneta_port *pp = netdev_priv(dev);
 	struct bpf_prog *old_prog;
 
-	if (prog && dev->mtu > MVNETA_MAX_RX_BUF_SIZE) {
-		NL_SET_ERR_MSG_MOD(extack, "Jumbo frames not supported on XDP");
-		return -EOPNOTSUPP;
-	}
-
 	if (pp->bm_priv) {
 		NL_SET_ERR_MSG_MOD(extack,
 				   "Hardware Buffer Management not supported on XDP");
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 7/9] bpf: helpers: add multibuffer support
  2020-09-03 20:58 ` [PATCH v2 net-next 7/9] bpf: helpers: add multibuffer support Lorenzo Bianconi
@ 2020-09-03 21:24   ` Maciej Fijalkowski
  2020-09-04  9:47     ` Lorenzo Bianconi
  0 siblings, 1 reply; 42+ messages in thread
From: Maciej Fijalkowski @ 2020-09-03 21:24 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast, shayagr

On Thu, Sep 03, 2020 at 10:58:51PM +0200, Lorenzo Bianconi wrote:
> From: Sameeh Jubran <sameehj@amazon.com>
> 
> The implementation is based on this [0] draft by Jesper D. Brouer.
> 
> Provided two new helpers:
> 
> * bpf_xdp_get_frag_count()
> * bpf_xdp_get_frags_total_size()
> 
> [0] xdp mb design - https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
> Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  include/uapi/linux/bpf.h       | 14 ++++++++++++
>  net/core/filter.c              | 39 ++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h | 14 ++++++++++++
>  3 files changed, 67 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c4a6d245619c..53db75095306 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3590,6 +3590,18 @@ union bpf_attr {
>   *
>   *	Return
>   *		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_xdp_get_frag_count(struct xdp_buff *xdp_md)
> + *	Description
> + *		Get the total number of frags for a given packet.
> + *	Return
> + *		The number of frags
> + *
> + * int bpf_xdp_get_frags_total_size(struct xdp_buff *xdp_md)
> + *	Description
> + *		Get the total size of frags for a given packet.
> + *	Return
> + *		The total size of frags for a given packet.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3742,6 +3754,8 @@ union bpf_attr {
>  	FN(d_path),			\
>  	FN(copy_from_user),		\
>  	FN(xdp_adjust_mb_header),	\
> +	FN(xdp_get_frag_count),		\
> +	FN(xdp_get_frags_total_size),	\
>  	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/net/core/filter.c b/net/core/filter.c
> index ae6b10cf062d..ba058fc16440 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3526,6 +3526,41 @@ static const struct bpf_func_proto bpf_xdp_adjust_mb_header_proto = {
>  	.arg2_type	= ARG_ANYTHING,
>  };
>  
> +BPF_CALL_1(bpf_xdp_get_frag_count, struct  xdp_buff*, xdp)
> +{
> +	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> +
> +	return xdp->mb ? sinfo->nr_frags : 0;
> +}
> +
> +const struct bpf_func_proto bpf_xdp_get_frag_count_proto = {
> +	.func		= bpf_xdp_get_frag_count,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +};
> +
> +BPF_CALL_1(bpf_xdp_get_frags_total_size, struct  xdp_buff*, xdp)
> +{
> +	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
> +	int nfrags, i;
> +	int size = 0;
> +
> +	nfrags = xdp->mb ? sinfo->nr_frags : 0;
> +
> +	for (i = 0; i < nfrags && i < MAX_SKB_FRAGS; i++)
> +		size += skb_frag_size(&sinfo->frags[i]);
> +

I only quickly jumped through series and IMHO this helper should be
rewritten/optimized in a way that we bail out as early as possible if
!xdp->mb as the rest of the code on that condition will do nothing and i'm
not sure if compiler would optimize it.


	struct skb_shared_info *sinfo;
	int nfrags, i;
	int size = 0;

	if (!xdp->mb)
		return 0;

	sinfo = xdp_get_shared_info_from_buff(xdp);

	nfrags = min(sinfo->nr_frags, MAX_SKB_FRAGS);

	for (i = 0; i < nfrags; i++)
		size += skb_frag_size(&sinfo->frags[i]);

	return size;

Thoughts?


> +	return size;
> +}
> +
> +const struct bpf_func_proto bpf_xdp_get_frags_total_size_proto = {
> +	.func		= bpf_xdp_get_frags_total_size,
> +	.gpl_only	= false,
> +	.ret_type	= RET_INTEGER,
> +	.arg1_type	= ARG_PTR_TO_CTX,
> +};
> +
>  BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
>  {
>  	void *data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
> @@ -6889,6 +6924,10 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>  		return &bpf_xdp_adjust_tail_proto;
>  	case BPF_FUNC_xdp_adjust_mb_header:
>  		return &bpf_xdp_adjust_mb_header_proto;
> +	case BPF_FUNC_xdp_get_frag_count:
> +		return &bpf_xdp_get_frag_count_proto;
> +	case BPF_FUNC_xdp_get_frags_total_size:
> +		return &bpf_xdp_get_frags_total_size_proto;
>  	case BPF_FUNC_fib_lookup:
>  		return &bpf_xdp_fib_lookup_proto;
>  #ifdef CONFIG_INET
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 392d52a2ecef..dd4669096cbb 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -3591,6 +3591,18 @@ union bpf_attr {
>   *
>   *	Return
>   *		0 on success, or a negative error in case of failure.
> + *
> + * int bpf_xdp_get_frag_count(struct xdp_buff *xdp_md)
> + *	Description
> + *		Get the total number of frags for a given packet.
> + *	Return
> + *		The number of frags
> + *
> + * int bpf_xdp_get_frags_total_size(struct xdp_buff *xdp_md)
> + *	Description
> + *		Get the total size of frags for a given packet.
> + *	Return
> + *		The total size of frags for a given packet.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3743,6 +3755,8 @@ union bpf_attr {
>  	FN(d_path),			\
>  	FN(copy_from_user),		\
>  	FN(xdp_adjust_mb_header),	\
> +	FN(xdp_get_frag_count),		\
> +	FN(xdp_get_frags_total_size),	\
>  	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame
  2020-09-03 20:58 ` [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame Lorenzo Bianconi
@ 2020-09-04  1:07   ` Alexei Starovoitov
  2020-09-04  7:19     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2020-09-04  1:07 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast, shayagr

On Thu, Sep 03, 2020 at 10:58:45PM +0200, Lorenzo Bianconi wrote:
> Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer to specify
> if shared_info area has been properly initialized for non-linear
> xdp buffers
> 
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  include/net/xdp.h | 8 ++++++--
>  net/core/xdp.c    | 1 +
>  2 files changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 3814fb631d52..42f439f9fcda 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -72,7 +72,8 @@ struct xdp_buff {
>  	void *data_hard_start;
>  	struct xdp_rxq_info *rxq;
>  	struct xdp_txq_info *txq;
> -	u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
> +	u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved tailroom*/
> +	u32 mb:1; /* xdp non-linear buffer */
>  };
>  
>  /* Reserve memory area at end-of data area.
> @@ -96,7 +97,8 @@ struct xdp_frame {
>  	u16 len;
>  	u16 headroom;
>  	u32 metasize:8;
> -	u32 frame_sz:24;
> +	u32 frame_sz:23;
> +	u32 mb:1; /* xdp non-linear frame */

Hmm. Last time I checked compilers were generating ugly code with bitfields.
Not performant and not efficient.
frame_sz is used in the fast path.
I suspect the first hunk alone will cause performance degradation.
Could you use normal u8 or u32 flag field?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support
  2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (8 preceding siblings ...)
  2020-09-03 20:58 ` [PATCH v2 net-next 9/9] net: mvneta: enable jumbo frames for XDP Lorenzo Bianconi
@ 2020-09-04  1:08 ` Alexei Starovoitov
  2020-09-04  7:40   ` Lorenzo Bianconi
  2020-09-04  5:41 ` John Fastabend
  10 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2020-09-04  1:08 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast, shayagr

On Thu, Sep 03, 2020 at 10:58:44PM +0200, Lorenzo Bianconi wrote:
> For the moment we have not implemented any self-test for the introduced the bpf
> helpers. We can address this in a follow up series if the proposed approach
> is accepted.

selftest has to be part of the same patch set.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-03 20:58 ` [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper Lorenzo Bianconi
@ 2020-09-04  1:09   ` Alexei Starovoitov
  2020-09-04  8:07     ` Lorenzo Bianconi
  2020-09-04  1:13   ` Alexei Starovoitov
  2020-09-04  6:47   ` John Fastabend
  2 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2020-09-04  1:09 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast, shayagr

On Thu, Sep 03, 2020 at 10:58:50PM +0200, Lorenzo Bianconi wrote:
> Introduce bpf_xdp_adjust_mb_header helper in order to adjust frame
> headers moving *offset* bytes from/to the second buffer to/from the
> first one.
> This helper can be used to move headers when the hw DMA SG is not able
> to copy all the headers in the first fragment and split header and data
> pages.
> 
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  include/uapi/linux/bpf.h       | 25 ++++++++++++----
>  net/core/filter.c              | 54 ++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h | 26 ++++++++++++----
>  3 files changed, 95 insertions(+), 10 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 8dda13880957..c4a6d245619c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3571,11 +3571,25 @@ union bpf_attr {
>   *		value.
>   *
>   * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)
> - * 	Description
> - * 		Read *size* bytes from user space address *user_ptr* and store
> - * 		the data in *dst*. This is a wrapper of copy_from_user().
> - * 	Return
> - * 		0 on success, or a negative error in case of failure.
> + *	Description
> + *		Read *size* bytes from user space address *user_ptr* and store
> + *		the data in *dst*. This is a wrapper of copy_from_user().
> + *
> + * long bpf_xdp_adjust_mb_header(struct xdp_buff *xdp_md, int offset)

botched rebase?

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-03 20:58 ` [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper Lorenzo Bianconi
  2020-09-04  1:09   ` Alexei Starovoitov
@ 2020-09-04  1:13   ` Alexei Starovoitov
  2020-09-04  7:50     ` Lorenzo Bianconi
  2020-09-04  6:47   ` John Fastabend
  2 siblings, 1 reply; 42+ messages in thread
From: Alexei Starovoitov @ 2020-09-04  1:13 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast, shayagr

On Thu, Sep 03, 2020 at 10:58:50PM +0200, Lorenzo Bianconi wrote:
> +BPF_CALL_2(bpf_xdp_adjust_mb_header, struct  xdp_buff *, xdp,
> +	   int, offset)
> +{
> +	void *data_hard_end, *data_end;
> +	struct skb_shared_info *sinfo;
> +	int frag_offset, frag_len;
> +	u8 *addr;
> +
> +	if (!xdp->mb)
> +		return -EOPNOTSUPP;
> +
> +	sinfo = xdp_get_shared_info_from_buff(xdp);
> +
> +	frag_len = skb_frag_size(&sinfo->frags[0]);
> +	if (offset > frag_len)
> +		return -EINVAL;
> +
> +	frag_offset = skb_frag_off(&sinfo->frags[0]);
> +	data_end = xdp->data_end + offset;
> +
> +	if (offset < 0 && (-offset > frag_offset ||
> +			   data_end < xdp->data + ETH_HLEN))
> +		return -EINVAL;
> +
> +	data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
> +	if (data_end > data_hard_end)
> +		return -EINVAL;
> +
> +	addr = page_address(skb_frag_page(&sinfo->frags[0])) + frag_offset;
> +	if (offset > 0) {
> +		memcpy(xdp->data_end, addr, offset);
> +	} else {
> +		memcpy(addr + offset, xdp->data_end + offset, -offset);
> +		memset(xdp->data_end + offset, 0, -offset);
> +	}
> +
> +	skb_frag_size_sub(&sinfo->frags[0], offset);
> +	skb_frag_off_add(&sinfo->frags[0], offset);
> +	xdp->data_end = data_end;
> +
> +	return 0;
> +}

wait a sec. Are you saying that multi buffer XDP actually should be skb based?
If that's what mvneta driver is doing that's fine, but that is not a
reasonable requirement to put on all other drivers.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support
  2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (9 preceding siblings ...)
  2020-09-04  1:08 ` [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Alexei Starovoitov
@ 2020-09-04  5:41 ` John Fastabend
  2020-09-04  7:39   ` Lorenzo Bianconi
  10 siblings, 1 reply; 42+ messages in thread
From: John Fastabend @ 2020-09-04  5:41 UTC (permalink / raw)
  To: Lorenzo Bianconi, netdev
  Cc: bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr

Lorenzo Bianconi wrote:
> - Finalize XDP multi-buffer support for mvneta driver introducing the
>   capability to map non-linear buffers on tx side.
> - Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer to specify if
>   shared_info area has been properly initialized.
> - Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.
> - Add multi-buff support to xdp_return_{buff/frame} utility routines.
> - Introduce bpf_xdp_adjust_mb_header helper to adjust frame headers moving
>   *offset* bytes from/to the second buffer to/from the first one.
>   This helper can be used to move headers when the hw DMA SG is not able
>   to copy all the headers in the first fragment and split header and data
>   pages. A possible use case for bpf_xdp_adjust_mb_header is described
>   here [0]

Are those slides available anywhere? [0] is just a link to the abstract.

> - Introduce bpf_xdp_get_frag_count and bpf_xdp_get_frags_total_size helpers to
>   report the total number/size of frags for a given xdp multi-buff.
> 
> XDP multi-buffer design principles are described here [1]
> For the moment we have not implemented any self-test for the introduced the bpf
> helpers. We can address this in a follow up series if the proposed approach
> is accepted.

Will need to include selftests with series.

> 
> Changes since v1:
> - Fix use-after-free in xdp_return_{buff/frame}
> - Introduce bpf helpers
> - Introduce xdp_mb sample program
> - access skb_shared_info->nr_frags only on the last fragment
> 
> Changes since RFC:
> - squash multi-buffer bit initialization in a single patch
> - add mvneta non-linear XDP buff support for tx side
> 
> [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy
> [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
> 
> Lorenzo Bianconi (7):
>   xdp: introduce mb in xdp_buff/xdp_frame
>   xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
>   net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
>   xdp: add multi-buff support to xdp_return_{buff/frame}
>   net: mvneta: add multi buffer support to XDP_TX
>   bpf: helpers: add bpf_xdp_adjust_mb_header helper
>   net: mvneta: enable jumbo frames for XDP
> 
> Sameeh Jubran (2):
>   bpf: helpers: add multibuffer support
>   samples/bpf: add bpf program that uses xdp mb helpers
> 
>  drivers/net/ethernet/amazon/ena/ena_netdev.c  |   1 +
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   1 +
>  .../net/ethernet/cavium/thunder/nicvf_main.c  |   1 +
>  .../net/ethernet/freescale/dpaa2/dpaa2-eth.c  |   1 +
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   1 +
>  drivers/net/ethernet/intel/ice/ice_txrx.c     |   1 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   1 +
>  .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |   1 +
>  drivers/net/ethernet/marvell/mvneta.c         | 126 ++++++------
>  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   |   1 +
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c    |   1 +
>  .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   1 +
>  .../ethernet/netronome/nfp/nfp_net_common.c   |   1 +
>  drivers/net/ethernet/qlogic/qede/qede_fp.c    |   1 +
>  drivers/net/ethernet/sfc/rx.c                 |   1 +
>  drivers/net/ethernet/socionext/netsec.c       |   1 +
>  drivers/net/ethernet/ti/cpsw.c                |   1 +
>  drivers/net/ethernet/ti/cpsw_new.c            |   1 +
>  drivers/net/hyperv/netvsc_bpf.c               |   1 +
>  drivers/net/tun.c                             |   2 +
>  drivers/net/veth.c                            |   1 +
>  drivers/net/virtio_net.c                      |   2 +
>  drivers/net/xen-netfront.c                    |   1 +
>  include/net/xdp.h                             |  26 ++-
>  include/uapi/linux/bpf.h                      |  39 +++-
>  net/core/dev.c                                |   1 +
>  net/core/filter.c                             |  93 +++++++++
>  net/core/xdp.c                                |  40 ++++
>  samples/bpf/Makefile                          |   3 +
>  samples/bpf/xdp_mb_kern.c                     |  68 +++++++
>  samples/bpf/xdp_mb_user.c                     | 182 ++++++++++++++++++
>  tools/include/uapi/linux/bpf.h                |  40 +++-
>  32 files changed, 572 insertions(+), 70 deletions(-)
>  create mode 100644 samples/bpf/xdp_mb_kern.c
>  create mode 100644 samples/bpf/xdp_mb_user.c
> 
> -- 
> 2.26.2
> 



^ permalink raw reply	[flat|nested] 42+ messages in thread

* RE: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-03 20:58 ` [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper Lorenzo Bianconi
  2020-09-04  1:09   ` Alexei Starovoitov
  2020-09-04  1:13   ` Alexei Starovoitov
@ 2020-09-04  6:47   ` John Fastabend
  2020-09-04  9:45     ` Lorenzo Bianconi
  2 siblings, 1 reply; 42+ messages in thread
From: John Fastabend @ 2020-09-04  6:47 UTC (permalink / raw)
  To: Lorenzo Bianconi, netdev
  Cc: bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr

Lorenzo Bianconi wrote:
> Introduce bpf_xdp_adjust_mb_header helper in order to adjust frame
> headers moving *offset* bytes from/to the second buffer to/from the
> first one.
> This helper can be used to move headers when the hw DMA SG is not able
> to copy all the headers in the first fragment and split header and data
> pages.
> 
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  include/uapi/linux/bpf.h       | 25 ++++++++++++----
>  net/core/filter.c              | 54 ++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h | 26 ++++++++++++----
>  3 files changed, 95 insertions(+), 10 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 8dda13880957..c4a6d245619c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3571,11 +3571,25 @@ union bpf_attr {
>   *		value.
>   *
>   * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)
> - * 	Description
> - * 		Read *size* bytes from user space address *user_ptr* and store
> - * 		the data in *dst*. This is a wrapper of copy_from_user().
> - * 	Return
> - * 		0 on success, or a negative error in case of failure.
> + *	Description
> + *		Read *size* bytes from user space address *user_ptr* and store
> + *		the data in *dst*. This is a wrapper of copy_from_user().
> + *
> + * long bpf_xdp_adjust_mb_header(struct xdp_buff *xdp_md, int offset)
> + *	Description
> + *		Adjust frame headers moving *offset* bytes from/to the second
> + *		buffer to/from the first one. This helper can be used to move
> + *		headers when the hw DMA SG does not copy all the headers in
> + *		the first fragment.

This is confusing to read. Does this mean I can "move bytes to the second
buffer from the first one" or "move bytes from the second buffer to the first
one" And what are frame headers? I'm sure I can read below code and work
it out, but users reading the header file should be able to parse this.

Also we want to be able to read all data not just headers. Reading the
payload of a TCP packet is equally important for many l7 load balancers.

> + *
> + *		A call to this helper is susceptible to change the underlying
> + *		packet buffer. Therefore, at load time, all checks on pointers
> + *		previously done by the verifier are invalidated and must be
> + *		performed again, if the helper is used in combination with
> + *		direct packet access.
> + *
> + *	Return
> + *		0 on success, or a negative error in case of failure.
>   */
>  #define __BPF_FUNC_MAPPER(FN)		\
>  	FN(unspec),			\
> @@ -3727,6 +3741,7 @@ union bpf_attr {
>  	FN(inode_storage_delete),	\
>  	FN(d_path),			\
>  	FN(copy_from_user),		\
> +	FN(xdp_adjust_mb_header),	\
>  	/* */
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 47eef9a0be6a..ae6b10cf062d 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3475,6 +3475,57 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
>  	.arg2_type	= ARG_ANYTHING,
>  };
>  
> +BPF_CALL_2(bpf_xdp_adjust_mb_header, struct  xdp_buff *, xdp,
> +	   int, offset)
> +{
> +	void *data_hard_end, *data_end;
> +	struct skb_shared_info *sinfo;
> +	int frag_offset, frag_len;
> +	u8 *addr;
> +
> +	if (!xdp->mb)
> +		return -EOPNOTSUPP;
> +
> +	sinfo = xdp_get_shared_info_from_buff(xdp);
> +
> +	frag_len = skb_frag_size(&sinfo->frags[0]);
> +	if (offset > frag_len)
> +		return -EINVAL;

What if we want data in frags[1] and so on.

> +
> +	frag_offset = skb_frag_off(&sinfo->frags[0]);
> +	data_end = xdp->data_end + offset;
> +
> +	if (offset < 0 && (-offset > frag_offset ||
> +			   data_end < xdp->data + ETH_HLEN))
> +		return -EINVAL;
> +
> +	data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
> +	if (data_end > data_hard_end)
> +		return -EINVAL;
> +
> +	addr = page_address(skb_frag_page(&sinfo->frags[0])) + frag_offset;
> +	if (offset > 0) {
> +		memcpy(xdp->data_end, addr, offset);

But this could get expensive for large amount of data? And also
limiting because we require the room to do the copy. Presumably
the reason we have fargs[1] is because the normal page or half
page is in use?

> +	} else {
> +		memcpy(addr + offset, xdp->data_end + offset, -offset);
> +		memset(xdp->data_end + offset, 0, -offset);
> +	}
> +
> +	skb_frag_size_sub(&sinfo->frags[0], offset);
> +	skb_frag_off_add(&sinfo->frags[0], offset);
> +	xdp->data_end = data_end;
> +
> +	return 0;
> +}

So overall I don't see the point of copying bytes from one frag to
another. Create an API that adjusts the data pointers and then
copies are avoided and manipulating frags is not needed.

Also and even more concerning I think this API requires the
driver to populate shinfo. If we use TX_REDIRECT a lot or TX_XMIT
this means we need to populate shinfo when its probably not ever
used. If our driver is smart L2/L3 headers are in the readable
data and prefetched. Writing frags into the end of a page is likely
not free.

Did you benchmark this?

In general users of this API should know the bytes they want
to fetch. Use an API like this,

  bpf_xdp_adjust_bytes(xdp, start, end)

Where xdp is the frame, start is the first byte the user wants
and end is the last byte. Then usually you can skip the entire
copy part and just move the xdp pointesr around. The ugly case
is if the user puts start/end across a frag boundary and a copy
is needed. In that case maybe we use end as a hint and not a
hard requirement.

The use case I see is I read L2/L3/L4 headers and I need the
first N bytes of the payload. I know where the payload starts
and I know how many bytes I need so,

  bpf_xdp_adjust_bytes(xdp, payload_offset, bytes_needed);

Then hopefully that is all in one frag. If its not I'll need
to do a second helper call. Still nicer than forcing drivers
to populate this shinfo I think. If you think I'm wrong try
a benchmark. Benchmarks aside I get stuck when data_end and
data_hard_end are too close together.

Thanks,
John

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame
  2020-09-04  1:07   ` Alexei Starovoitov
@ 2020-09-04  7:19     ` Jesper Dangaard Brouer
  2020-09-04 15:15       ` David Ahern
  0 siblings, 1 reply; 42+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-04  7:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Lorenzo Bianconi, netdev, bpf, davem, lorenzo.bianconi, echaudro,
	sameehj, kuba, john.fastabend, daniel, ast, shayagr, brouer,
	David Ahern

On Thu, 3 Sep 2020 18:07:05 -0700
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Thu, Sep 03, 2020 at 10:58:45PM +0200, Lorenzo Bianconi wrote:
> > Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer to specify
> > if shared_info area has been properly initialized for non-linear
> > xdp buffers
> > 
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> >  include/net/xdp.h | 8 ++++++--
> >  net/core/xdp.c    | 1 +
> >  2 files changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 3814fb631d52..42f439f9fcda 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -72,7 +72,8 @@ struct xdp_buff {
> >  	void *data_hard_start;
> >  	struct xdp_rxq_info *rxq;
> >  	struct xdp_txq_info *txq;
> > -	u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
> > +	u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved tailroom*/
> > +	u32 mb:1; /* xdp non-linear buffer */
> >  };
> >  
> >  /* Reserve memory area at end-of data area.
> > @@ -96,7 +97,8 @@ struct xdp_frame {
> >  	u16 len;
> >  	u16 headroom;
> >  	u32 metasize:8;
> > -	u32 frame_sz:24;
> > +	u32 frame_sz:23;
> > +	u32 mb:1; /* xdp non-linear frame */  
> 
> Hmm. Last time I checked compilers were generating ugly code with bitfields.
> Not performant and not efficient.
> frame_sz is used in the fast path.
> I suspect the first hunk alone will cause performance degradation.
> Could you use normal u8 or u32 flag field?

For struct xdp_buff sure we can do this.  For struct xdp_frame, I'm not
sure, as it is a state compressed version of xdp_buff + extra
information.  The xdp_frame have been called skb-light, and I know
people (e.g Ahern) wants to add more info to this, vlan, RX-hash, csum,
and we must keep this to 1-cache-line, for performance reasons.

You do make a good point, that these bit-fields might hurt performance
more.  I guess, we need to test this.  As I constantly worry that we
will slowly kill XDP performance with a 1000 paper-cuts.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 2/9] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
  2020-09-03 20:58 ` [PATCH v2 net-next 2/9] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers Lorenzo Bianconi
@ 2020-09-04  7:35   ` Jesper Dangaard Brouer
  2020-09-04  7:54     ` Lorenzo Bianconi
  0 siblings, 1 reply; 42+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-04  7:35 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: netdev, bpf, davem, lorenzo.bianconi, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr, brouer, intel-wired-lan

On Thu,  3 Sep 2020 22:58:46 +0200
Lorenzo Bianconi <lorenzo@kernel.org> wrote:

> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 0b675c34ce49..20c8fd3cd4a3 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -2298,6 +2298,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
>  #if (PAGE_SIZE < 8192)
>  	xdp.frame_sz = ixgbe_rx_frame_truesize(rx_ring, 0);
>  #endif
> +	xdp.mb = 0;
>  
>  	while (likely(total_rx_packets < budget)) {
>  		union ixgbe_adv_rx_desc *rx_desc;

In this ixgbe driver you are smart and init the xdp.mb bit outside the
(like xdp.frame_sz, when frame_sz is constant).   This is a nice
optimization, but the driver developer that adds XDP multi-buffer
support must remember to reset it.  The patch itself is okay, it is
just something to keep in-mind when reviewing/changing drivers.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support
  2020-09-04  5:41 ` John Fastabend
@ 2020-09-04  7:39   ` Lorenzo Bianconi
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-04  7:39 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, daniel, ast, shayagr

[-- Attachment #1: Type: text/plain, Size: 4931 bytes --]

> Lorenzo Bianconi wrote:
> > - Finalize XDP multi-buffer support for mvneta driver introducing the
> >   capability to map non-linear buffers on tx side.
> > - Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer to specify if
> >   shared_info area has been properly initialized.
> > - Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.
> > - Add multi-buff support to xdp_return_{buff/frame} utility routines.
> > - Introduce bpf_xdp_adjust_mb_header helper to adjust frame headers moving
> >   *offset* bytes from/to the second buffer to/from the first one.
> >   This helper can be used to move headers when the hw DMA SG is not able
> >   to copy all the headers in the first fragment and split header and data
> >   pages. A possible use case for bpf_xdp_adjust_mb_header is described
> >   here [0]
> 
> Are those slides available anywhere? [0] is just a link to the abstract.

Yes, sorry. I would point out where we got the idea for this helper.
I do not think the slides are available yet but I guess they will be soon.

> 
> > - Introduce bpf_xdp_get_frag_count and bpf_xdp_get_frags_total_size helpers to
> >   report the total number/size of frags for a given xdp multi-buff.
> > 
> > XDP multi-buffer design principles are described here [1]
> > For the moment we have not implemented any self-test for the introduced the bpf
> > helpers. We can address this in a follow up series if the proposed approach
> > is accepted.
> 
> Will need to include selftests with series.

Sure, I will add selftests in v3.

Regards,
Lorenzo

> 
> > 
> > Changes since v1:
> > - Fix use-after-free in xdp_return_{buff/frame}
> > - Introduce bpf helpers
> > - Introduce xdp_mb sample program
> > - access skb_shared_info->nr_frags only on the last fragment
> > 
> > Changes since RFC:
> > - squash multi-buffer bit initialization in a single patch
> > - add mvneta non-linear XDP buff support for tx side
> > 
> > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy
> > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
> > 
> > Lorenzo Bianconi (7):
> >   xdp: introduce mb in xdp_buff/xdp_frame
> >   xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
> >   net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
> >   xdp: add multi-buff support to xdp_return_{buff/frame}
> >   net: mvneta: add multi buffer support to XDP_TX
> >   bpf: helpers: add bpf_xdp_adjust_mb_header helper
> >   net: mvneta: enable jumbo frames for XDP
> > 
> > Sameeh Jubran (2):
> >   bpf: helpers: add multibuffer support
> >   samples/bpf: add bpf program that uses xdp mb helpers
> > 
> >  drivers/net/ethernet/amazon/ena/ena_netdev.c  |   1 +
> >  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   1 +
> >  .../net/ethernet/cavium/thunder/nicvf_main.c  |   1 +
> >  .../net/ethernet/freescale/dpaa2/dpaa2-eth.c  |   1 +
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   1 +
> >  drivers/net/ethernet/intel/ice/ice_txrx.c     |   1 +
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   1 +
> >  .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |   1 +
> >  drivers/net/ethernet/marvell/mvneta.c         | 126 ++++++------
> >  .../net/ethernet/marvell/mvpp2/mvpp2_main.c   |   1 +
> >  drivers/net/ethernet/mellanox/mlx4/en_rx.c    |   1 +
> >  .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   1 +
> >  .../ethernet/netronome/nfp/nfp_net_common.c   |   1 +
> >  drivers/net/ethernet/qlogic/qede/qede_fp.c    |   1 +
> >  drivers/net/ethernet/sfc/rx.c                 |   1 +
> >  drivers/net/ethernet/socionext/netsec.c       |   1 +
> >  drivers/net/ethernet/ti/cpsw.c                |   1 +
> >  drivers/net/ethernet/ti/cpsw_new.c            |   1 +
> >  drivers/net/hyperv/netvsc_bpf.c               |   1 +
> >  drivers/net/tun.c                             |   2 +
> >  drivers/net/veth.c                            |   1 +
> >  drivers/net/virtio_net.c                      |   2 +
> >  drivers/net/xen-netfront.c                    |   1 +
> >  include/net/xdp.h                             |  26 ++-
> >  include/uapi/linux/bpf.h                      |  39 +++-
> >  net/core/dev.c                                |   1 +
> >  net/core/filter.c                             |  93 +++++++++
> >  net/core/xdp.c                                |  40 ++++
> >  samples/bpf/Makefile                          |   3 +
> >  samples/bpf/xdp_mb_kern.c                     |  68 +++++++
> >  samples/bpf/xdp_mb_user.c                     | 182 ++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h                |  40 +++-
> >  32 files changed, 572 insertions(+), 70 deletions(-)
> >  create mode 100644 samples/bpf/xdp_mb_kern.c
> >  create mode 100644 samples/bpf/xdp_mb_user.c
> > 
> > -- 
> > 2.26.2
> > 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support
  2020-09-04  1:08 ` [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Alexei Starovoitov
@ 2020-09-04  7:40   ` Lorenzo Bianconi
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-04  7:40 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast, shayagr

[-- Attachment #1: Type: text/plain, Size: 358 bytes --]

> On Thu, Sep 03, 2020 at 10:58:44PM +0200, Lorenzo Bianconi wrote:
> > For the moment we have not implemented any self-test for the introduced the bpf
> > helpers. We can address this in a follow up series if the proposed approach
> > is accepted.
> 
> selftest has to be part of the same patch set.

sure, I will add it in v3.

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-04  1:13   ` Alexei Starovoitov
@ 2020-09-04  7:50     ` Lorenzo Bianconi
  2020-09-04 13:52       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-04  7:50 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast, shayagr

[-- Attachment #1: Type: text/plain, Size: 1957 bytes --]

> On Thu, Sep 03, 2020 at 10:58:50PM +0200, Lorenzo Bianconi wrote:
> > +BPF_CALL_2(bpf_xdp_adjust_mb_header, struct  xdp_buff *, xdp,
> > +	   int, offset)
> > +{
> > +	void *data_hard_end, *data_end;
> > +	struct skb_shared_info *sinfo;
> > +	int frag_offset, frag_len;
> > +	u8 *addr;
> > +
> > +	if (!xdp->mb)
> > +		return -EOPNOTSUPP;
> > +
> > +	sinfo = xdp_get_shared_info_from_buff(xdp);
> > +
> > +	frag_len = skb_frag_size(&sinfo->frags[0]);
> > +	if (offset > frag_len)
> > +		return -EINVAL;
> > +
> > +	frag_offset = skb_frag_off(&sinfo->frags[0]);
> > +	data_end = xdp->data_end + offset;
> > +
> > +	if (offset < 0 && (-offset > frag_offset ||
> > +			   data_end < xdp->data + ETH_HLEN))
> > +		return -EINVAL;
> > +
> > +	data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
> > +	if (data_end > data_hard_end)
> > +		return -EINVAL;
> > +
> > +	addr = page_address(skb_frag_page(&sinfo->frags[0])) + frag_offset;
> > +	if (offset > 0) {
> > +		memcpy(xdp->data_end, addr, offset);
> > +	} else {
> > +		memcpy(addr + offset, xdp->data_end + offset, -offset);
> > +		memset(xdp->data_end + offset, 0, -offset);
> > +	}
> > +
> > +	skb_frag_size_sub(&sinfo->frags[0], offset);
> > +	skb_frag_off_add(&sinfo->frags[0], offset);
> > +	xdp->data_end = data_end;
> > +
> > +	return 0;
> > +}
> 
> wait a sec. Are you saying that multi buffer XDP actually should be skb based?
> If that's what mvneta driver is doing that's fine, but that is not a
> reasonable requirement to put on all other drivers.

I did not got what you mean here. The xdp multi-buffer layout uses the skb_shared_info
at the end of the first buffer to link subsequent frames [0] and we rely on skb_frag*
utilities to set/read offset and length of subsequent buffers.

Regards,
Lorenzo

[0] http://people.redhat.com/lbiancon/conference/NetDevConf2020-0x14/add-xdp-on-driver.html - XDP multi-buffers section

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 2/9] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
  2020-09-04  7:35   ` Jesper Dangaard Brouer
@ 2020-09-04  7:54     ` Lorenzo Bianconi
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-04  7:54 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, bpf, davem, lorenzo.bianconi, echaudro, sameehj, kuba,
	john.fastabend, daniel, ast, shayagr, intel-wired-lan

[-- Attachment #1: Type: text/plain, Size: 1284 bytes --]

> On Thu,  3 Sep 2020 22:58:46 +0200
> Lorenzo Bianconi <lorenzo@kernel.org> wrote:
> 
> > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > index 0b675c34ce49..20c8fd3cd4a3 100644
> > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> > @@ -2298,6 +2298,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
> >  #if (PAGE_SIZE < 8192)
> >  	xdp.frame_sz = ixgbe_rx_frame_truesize(rx_ring, 0);
> >  #endif
> > +	xdp.mb = 0;
> >  
> >  	while (likely(total_rx_packets < budget)) {
> >  		union ixgbe_adv_rx_desc *rx_desc;
> 
> In this ixgbe driver you are smart and init the xdp.mb bit outside the
> (like xdp.frame_sz, when frame_sz is constant).   This is a nice
> optimization, but the driver developer that adds XDP multi-buffer
> support must remember to reset it.  The patch itself is okay, it is
> just something to keep in-mind when reviewing/changing drivers.

yes, I have just decided to avoid unnecessary instructions for the moment.

Regards,
Lorenzo

> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-04  1:09   ` Alexei Starovoitov
@ 2020-09-04  8:07     ` Lorenzo Bianconi
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-04  8:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast, shayagr

[-- Attachment #1: Type: text/plain, Size: 1670 bytes --]

On Sep 03, Alexei Starovoitov wrote:
> On Thu, Sep 03, 2020 at 10:58:50PM +0200, Lorenzo Bianconi wrote:
> > Introduce bpf_xdp_adjust_mb_header helper in order to adjust frame
> > headers moving *offset* bytes from/to the second buffer to/from the
> > first one.
> > This helper can be used to move headers when the hw DMA SG is not able
> > to copy all the headers in the first fragment and split header and data
> > pages.
> > 
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> >  include/uapi/linux/bpf.h       | 25 ++++++++++++----
> >  net/core/filter.c              | 54 ++++++++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h | 26 ++++++++++++----
> >  3 files changed, 95 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 8dda13880957..c4a6d245619c 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -3571,11 +3571,25 @@ union bpf_attr {
> >   *		value.
> >   *
> >   * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)
> > - * 	Description
> > - * 		Read *size* bytes from user space address *user_ptr* and store
> > - * 		the data in *dst*. This is a wrapper of copy_from_user().
> > - * 	Return
> > - * 		0 on success, or a negative error in case of failure.
> > + *	Description
> > + *		Read *size* bytes from user space address *user_ptr* and store
> > + *		the data in *dst*. This is a wrapper of copy_from_user().
> > + *
> > + * long bpf_xdp_adjust_mb_header(struct xdp_buff *xdp_md, int offset)
> 
> botched rebase?

Yes, sorry. I will fix in v3.

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-04  6:47   ` John Fastabend
@ 2020-09-04  9:45     ` Lorenzo Bianconi
  2020-09-04 15:23       ` John Fastabend
  0 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-04  9:45 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, daniel, ast, shayagr, edumazet

[-- Attachment #1: Type: text/plain, Size: 8672 bytes --]

> Lorenzo Bianconi wrote:
> > Introduce bpf_xdp_adjust_mb_header helper in order to adjust frame
> > headers moving *offset* bytes from/to the second buffer to/from the
> > first one.
> > This helper can be used to move headers when the hw DMA SG is not able
> > to copy all the headers in the first fragment and split header and data
> > pages.
> > 
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> >  include/uapi/linux/bpf.h       | 25 ++++++++++++----
> >  net/core/filter.c              | 54 ++++++++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h | 26 ++++++++++++----
> >  3 files changed, 95 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 8dda13880957..c4a6d245619c 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -3571,11 +3571,25 @@ union bpf_attr {
> >   *		value.
> >   *
> >   * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)
> > - * 	Description
> > - * 		Read *size* bytes from user space address *user_ptr* and store
> > - * 		the data in *dst*. This is a wrapper of copy_from_user().
> > - * 	Return
> > - * 		0 on success, or a negative error in case of failure.
> > + *	Description
> > + *		Read *size* bytes from user space address *user_ptr* and store
> > + *		the data in *dst*. This is a wrapper of copy_from_user().
> > + *
> > + * long bpf_xdp_adjust_mb_header(struct xdp_buff *xdp_md, int offset)
> > + *	Description
> > + *		Adjust frame headers moving *offset* bytes from/to the second
> > + *		buffer to/from the first one. This helper can be used to move
> > + *		headers when the hw DMA SG does not copy all the headers in
> > + *		the first fragment.

+ Eric to the discussion

> 
> This is confusing to read. Does this mean I can "move bytes to the second
> buffer from the first one" or "move bytes from the second buffer to the first
> one" And what are frame headers? I'm sure I can read below code and work
> it out, but users reading the header file should be able to parse this.

Our main goal with this helper is to fix the use-case where we request the hw
to put L2/L3/L4 headers (and all the TCP options) in the first fragment and TCP
data starting from the second fragment (headers split) but for some reasons
the hw puts the TCP options in the second fragment (if we understood correctly
this issue has been introduced by Eric @ NetDevConf 0x14).
bpf_xdp_adjust_mb_header() can fix this use-case moving bytes from the second fragment
to the first one (offset > 0) or from the first buffer to the second one (offset < 0).

> 
> Also we want to be able to read all data not just headers. Reading the
> payload of a TCP packet is equally important for many l7 load balancers.
> 

In order to avoid to slow down performances we require that eBPF sandbox can
read/write only buff0 in a xdp multi-buffer. The xdp program can only
perform some restricted changes to buff<n> (n >= 1) (e.g. what we did in
bpf_xdp_adjust_mb_header()).
You can find the xdp multi-buff design principles here [0][1]

[0] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
[1] http://people.redhat.com/lbiancon/conference/NetDevConf2020-0x14/add-xdp-on-driver.html - XDP multi-buffers section (slide 40)

> > + *
> > + *		A call to this helper is susceptible to change the underlying
> > + *		packet buffer. Therefore, at load time, all checks on pointers
> > + *		previously done by the verifier are invalidated and must be
> > + *		performed again, if the helper is used in combination with
> > + *		direct packet access.
> > + *
> > + *	Return
> > + *		0 on success, or a negative error in case of failure.
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)		\
> >  	FN(unspec),			\
> > @@ -3727,6 +3741,7 @@ union bpf_attr {
> >  	FN(inode_storage_delete),	\
> >  	FN(d_path),			\
> >  	FN(copy_from_user),		\
> > +	FN(xdp_adjust_mb_header),	\
> >  	/* */
> >  
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 47eef9a0be6a..ae6b10cf062d 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -3475,6 +3475,57 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
> >  	.arg2_type	= ARG_ANYTHING,
> >  };
> >  
> > +BPF_CALL_2(bpf_xdp_adjust_mb_header, struct  xdp_buff *, xdp,
> > +	   int, offset)
> > +{
> > +	void *data_hard_end, *data_end;
> > +	struct skb_shared_info *sinfo;
> > +	int frag_offset, frag_len;
> > +	u8 *addr;
> > +
> > +	if (!xdp->mb)
> > +		return -EOPNOTSUPP;
> > +
> > +	sinfo = xdp_get_shared_info_from_buff(xdp);
> > +
> > +	frag_len = skb_frag_size(&sinfo->frags[0]);
> > +	if (offset > frag_len)
> > +		return -EINVAL;
> 
> What if we want data in frags[1] and so on.
> 
> > +
> > +	frag_offset = skb_frag_off(&sinfo->frags[0]);
> > +	data_end = xdp->data_end + offset;
> > +
> > +	if (offset < 0 && (-offset > frag_offset ||
> > +			   data_end < xdp->data + ETH_HLEN))
> > +		return -EINVAL;
> > +
> > +	data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
> > +	if (data_end > data_hard_end)
> > +		return -EINVAL;
> > +
> > +	addr = page_address(skb_frag_page(&sinfo->frags[0])) + frag_offset;
> > +	if (offset > 0) {
> > +		memcpy(xdp->data_end, addr, offset);
> 
> But this could get expensive for large amount of data? And also
> limiting because we require the room to do the copy. Presumably
> the reason we have fargs[1] is because the normal page or half
> page is in use?
> 
> > +	} else {
> > +		memcpy(addr + offset, xdp->data_end + offset, -offset);
> > +		memset(xdp->data_end + offset, 0, -offset);
> > +	}
> > +
> > +	skb_frag_size_sub(&sinfo->frags[0], offset);
> > +	skb_frag_off_add(&sinfo->frags[0], offset);
> > +	xdp->data_end = data_end;
> > +
> > +	return 0;
> > +}
> 
> So overall I don't see the point of copying bytes from one frag to
> another. Create an API that adjusts the data pointers and then
> copies are avoided and manipulating frags is not needed.

please see above.

> 
> Also and even more concerning I think this API requires the
> driver to populate shinfo. If we use TX_REDIRECT a lot or TX_XMIT
> this means we need to populate shinfo when its probably not ever
> used. If our driver is smart L2/L3 headers are in the readable
> data and prefetched. Writing frags into the end of a page is likely
> not free.

Sorry I did not get what you mean with "populate shinfo" here. We need to
properly set shared_info in order to create the xdp multi-buff.
Apart of header splits, please consider the main uses-cases for
xdp multi-buff are XDP with TSO and Jumbo frames.

> 
> Did you benchmark this?

will do, I need to understand if we can use tiny buffers in mvneta.

> 
> In general users of this API should know the bytes they want
> to fetch. Use an API like this,
> 
>   bpf_xdp_adjust_bytes(xdp, start, end)
> 
> Where xdp is the frame, start is the first byte the user wants
> and end is the last byte. Then usually you can skip the entire
> copy part and just move the xdp pointesr around. The ugly case
> is if the user puts start/end across a frag boundary and a copy
> is needed. In that case maybe we use end as a hint and not a
> hard requirement.
> 
> The use case I see is I read L2/L3/L4 headers and I need the
> first N bytes of the payload. I know where the payload starts
> and I know how many bytes I need so,
> 
>   bpf_xdp_adjust_bytes(xdp, payload_offset, bytes_needed);
> 
> Then hopefully that is all in one frag. If its not I'll need
> to do a second helper call. Still nicer than forcing drivers
> to populate this shinfo I think. If you think I'm wrong try
> a benchmark. Benchmarks aside I get stuck when data_end and
> data_hard_end are too close together.

IIUC what you mean here is to expose L2/L3/L4 headers + some data to
the ebpf program to perform like L7 load-balancing, right?
Let's consider the Jumbo frames use-case (so the data are split in multiple
buffers). I can see to issues here:
- since in XDP we can't linearize the buffer if start and end are on the
  different pages (like we do in bpf_msg_pull_data()), are we ending up
  in the case where requested data are all in buff0? 
- if  start and end are in buff<2>, we should report the fragment number to the
  ebpf program to "fetch" the data. Are we exposing too low-level details to
  user-space?

Regards,
Lorenzo

> 
> Thanks,
> John

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 7/9] bpf: helpers: add multibuffer support
  2020-09-03 21:24   ` Maciej Fijalkowski
@ 2020-09-04  9:47     ` Lorenzo Bianconi
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-04  9:47 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast, shayagr

[-- Attachment #1: Type: text/plain, Size: 3899 bytes --]

> On Thu, Sep 03, 2020 at 10:58:51PM +0200, Lorenzo Bianconi wrote:
> > From: Sameeh Jubran <sameehj@amazon.com>
> > 
> > The implementation is based on this [0] draft by Jesper D. Brouer.
> > 
> > Provided two new helpers:
> > 
> > * bpf_xdp_get_frag_count()
> > * bpf_xdp_get_frags_total_size()
> > 
> > [0] xdp mb design - https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
> > Signed-off-by: Sameeh Jubran <sameehj@amazon.com>
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> >  include/uapi/linux/bpf.h       | 14 ++++++++++++
> >  net/core/filter.c              | 39 ++++++++++++++++++++++++++++++++++
> >  tools/include/uapi/linux/bpf.h | 14 ++++++++++++
> >  3 files changed, 67 insertions(+)
> > 
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index c4a6d245619c..53db75095306 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -3590,6 +3590,18 @@ union bpf_attr {
> >   *
> >   *	Return
> >   *		0 on success, or a negative error in case of failure.

[...]

> > +
> 
> I only quickly jumped through series and IMHO this helper should be
> rewritten/optimized in a way that we bail out as early as possible if
> !xdp->mb as the rest of the code on that condition will do nothing and i'm
> not sure if compiler would optimize it.
> 
> 
> 	struct skb_shared_info *sinfo;
> 	int nfrags, i;
> 	int size = 0;
> 
> 	if (!xdp->mb)
> 		return 0;
> 
> 	sinfo = xdp_get_shared_info_from_buff(xdp);
> 
> 	nfrags = min(sinfo->nr_frags, MAX_SKB_FRAGS);
> 
> 	for (i = 0; i < nfrags; i++)
> 		size += skb_frag_size(&sinfo->frags[i]);
> 
> 	return size;
> 
> Thoughts?

I agree.

Regards,
Lorenzo

> 
> 
> > +	return size;
> > +}
> > +
> > +const struct bpf_func_proto bpf_xdp_get_frags_total_size_proto = {
> > +	.func		= bpf_xdp_get_frags_total_size,
> > +	.gpl_only	= false,
> > +	.ret_type	= RET_INTEGER,
> > +	.arg1_type	= ARG_PTR_TO_CTX,
> > +};
> > +
> >  BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
> >  {
> >  	void *data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
> > @@ -6889,6 +6924,10 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> >  		return &bpf_xdp_adjust_tail_proto;
> >  	case BPF_FUNC_xdp_adjust_mb_header:
> >  		return &bpf_xdp_adjust_mb_header_proto;
> > +	case BPF_FUNC_xdp_get_frag_count:
> > +		return &bpf_xdp_get_frag_count_proto;
> > +	case BPF_FUNC_xdp_get_frags_total_size:
> > +		return &bpf_xdp_get_frags_total_size_proto;
> >  	case BPF_FUNC_fib_lookup:
> >  		return &bpf_xdp_fib_lookup_proto;
> >  #ifdef CONFIG_INET
> > diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> > index 392d52a2ecef..dd4669096cbb 100644
> > --- a/tools/include/uapi/linux/bpf.h
> > +++ b/tools/include/uapi/linux/bpf.h
> > @@ -3591,6 +3591,18 @@ union bpf_attr {
> >   *
> >   *	Return
> >   *		0 on success, or a negative error in case of failure.
> > + *
> > + * int bpf_xdp_get_frag_count(struct xdp_buff *xdp_md)
> > + *	Description
> > + *		Get the total number of frags for a given packet.
> > + *	Return
> > + *		The number of frags
> > + *
> > + * int bpf_xdp_get_frags_total_size(struct xdp_buff *xdp_md)
> > + *	Description
> > + *		Get the total size of frags for a given packet.
> > + *	Return
> > + *		The total size of frags for a given packet.
> >   */
> >  #define __BPF_FUNC_MAPPER(FN)		\
> >  	FN(unspec),			\
> > @@ -3743,6 +3755,8 @@ union bpf_attr {
> >  	FN(d_path),			\
> >  	FN(copy_from_user),		\
> >  	FN(xdp_adjust_mb_header),	\
> > +	FN(xdp_get_frag_count),		\
> > +	FN(xdp_get_frags_total_size),	\
> >  	/* */
> >  
> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > -- 
> > 2.26.2
> > 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-04  7:50     ` Lorenzo Bianconi
@ 2020-09-04 13:52       ` Jesper Dangaard Brouer
  2020-09-04 14:27         ` Lorenzo Bianconi
  0 siblings, 1 reply; 42+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-04 13:52 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Alexei Starovoitov, netdev, bpf, davem, lorenzo.bianconi,
	echaudro, sameehj, kuba, john.fastabend, daniel, ast, shayagr,
	brouer

On Fri, 4 Sep 2020 09:50:31 +0200
Lorenzo Bianconi <lorenzo@kernel.org> wrote:

> > On Thu, Sep 03, 2020 at 10:58:50PM +0200, Lorenzo Bianconi wrote:  
> > > +BPF_CALL_2(bpf_xdp_adjust_mb_header, struct  xdp_buff *, xdp,
> > > +	   int, offset)
> > > +{
> > > +	void *data_hard_end, *data_end;
> > > +	struct skb_shared_info *sinfo;
> > > +	int frag_offset, frag_len;
> > > +	u8 *addr;
> > > +
> > > +	if (!xdp->mb)
> > > +		return -EOPNOTSUPP;
> > > +
> > > +	sinfo = xdp_get_shared_info_from_buff(xdp);
> > > +
> > > +	frag_len = skb_frag_size(&sinfo->frags[0]);
> > > +	if (offset > frag_len)
> > > +		return -EINVAL;
> > > +
> > > +	frag_offset = skb_frag_off(&sinfo->frags[0]);
> > > +	data_end = xdp->data_end + offset;
> > > +
> > > +	if (offset < 0 && (-offset > frag_offset ||
> > > +			   data_end < xdp->data + ETH_HLEN))
> > > +		return -EINVAL;
> > > +
> > > +	data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
> > > +	if (data_end > data_hard_end)
> > > +		return -EINVAL;
> > > +
> > > +	addr = page_address(skb_frag_page(&sinfo->frags[0])) + frag_offset;
> > > +	if (offset > 0) {
> > > +		memcpy(xdp->data_end, addr, offset);
> > > +	} else {
> > > +		memcpy(addr + offset, xdp->data_end + offset, -offset);
> > > +		memset(xdp->data_end + offset, 0, -offset);
> > > +	}
> > > +
> > > +	skb_frag_size_sub(&sinfo->frags[0], offset);
> > > +	skb_frag_off_add(&sinfo->frags[0], offset);
> > > +	xdp->data_end = data_end;
> > > +
> > > +	return 0;
> > > +}  
> > 
> > wait a sec. Are you saying that multi buffer XDP actually should be skb based?
> > If that's what mvneta driver is doing that's fine, but that is not a
> > reasonable requirement to put on all other drivers.  
> 
> I did not got what you mean here. The xdp multi-buffer layout uses
> the skb_shared_info at the end of the first buffer to link subsequent
> frames [0] and we rely on skb_frag* utilities to set/read offset and
> length of subsequent buffers.

Yes, for now the same layout as "skb_shared_info" is "reuse", but I
think we should think of this as "xdp_shared_info" instead, as how it
is used for XDP is going to divert from SKBs.  We already discussed (in
conf call) that we could store the total len of "frags" here, to
simplify the other helper.

Using the skb_frag_* helper functions are misleading, and will make it
more difficult to divert from how SKB handle frags.  What about
introducing xdp_frag_* wrappers? (what do others think?)


> 
> [0] http://people.redhat.com/lbiancon/conference/NetDevConf2020-0x14/add-xdp-on-driver.html
>     - XDP multi-buffers section (slide 40)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-04 13:52       ` Jesper Dangaard Brouer
@ 2020-09-04 14:27         ` Lorenzo Bianconi
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-04 14:27 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Lorenzo Bianconi, Alexei Starovoitov, netdev, bpf, davem,
	echaudro, sameehj, kuba, john.fastabend, daniel, ast, shayagr

[-- Attachment #1: Type: text/plain, Size: 3331 bytes --]

> On Fri, 4 Sep 2020 09:50:31 +0200
> Lorenzo Bianconi <lorenzo@kernel.org> wrote:
> 
> > > On Thu, Sep 03, 2020 at 10:58:50PM +0200, Lorenzo Bianconi wrote:  
> > > > +BPF_CALL_2(bpf_xdp_adjust_mb_header, struct  xdp_buff *, xdp,
> > > > +	   int, offset)
> > > > +{
> > > > +	void *data_hard_end, *data_end;
> > > > +	struct skb_shared_info *sinfo;
> > > > +	int frag_offset, frag_len;
> > > > +	u8 *addr;
> > > > +
> > > > +	if (!xdp->mb)
> > > > +		return -EOPNOTSUPP;
> > > > +
> > > > +	sinfo = xdp_get_shared_info_from_buff(xdp);
> > > > +
> > > > +	frag_len = skb_frag_size(&sinfo->frags[0]);
> > > > +	if (offset > frag_len)
> > > > +		return -EINVAL;
> > > > +
> > > > +	frag_offset = skb_frag_off(&sinfo->frags[0]);
> > > > +	data_end = xdp->data_end + offset;
> > > > +
> > > > +	if (offset < 0 && (-offset > frag_offset ||
> > > > +			   data_end < xdp->data + ETH_HLEN))
> > > > +		return -EINVAL;
> > > > +
> > > > +	data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
> > > > +	if (data_end > data_hard_end)
> > > > +		return -EINVAL;
> > > > +
> > > > +	addr = page_address(skb_frag_page(&sinfo->frags[0])) + frag_offset;
> > > > +	if (offset > 0) {
> > > > +		memcpy(xdp->data_end, addr, offset);
> > > > +	} else {
> > > > +		memcpy(addr + offset, xdp->data_end + offset, -offset);
> > > > +		memset(xdp->data_end + offset, 0, -offset);
> > > > +	}
> > > > +
> > > > +	skb_frag_size_sub(&sinfo->frags[0], offset);
> > > > +	skb_frag_off_add(&sinfo->frags[0], offset);
> > > > +	xdp->data_end = data_end;
> > > > +
> > > > +	return 0;
> > > > +}  
> > > 
> > > wait a sec. Are you saying that multi buffer XDP actually should be skb based?
> > > If that's what mvneta driver is doing that's fine, but that is not a
> > > reasonable requirement to put on all other drivers.  
> > 
> > I did not got what you mean here. The xdp multi-buffer layout uses
> > the skb_shared_info at the end of the first buffer to link subsequent
> > frames [0] and we rely on skb_frag* utilities to set/read offset and
> > length of subsequent buffers.
> 
> Yes, for now the same layout as "skb_shared_info" is "reuse", but I
> think we should think of this as "xdp_shared_info" instead, as how it
> is used for XDP is going to divert from SKBs.  We already discussed (in
> conf call) that we could store the total len of "frags" here, to
> simplify the other helper.

I like this approach, at the end the first fragment we can have something like:

struct xdp_shared_info {
	skb_frag_f frags[16];
	int n_frags;
	int frag_len;
};

or do you prefer to not use skb_frag_t struct?

> 
> Using the skb_frag_* helper functions are misleading, and will make it
> more difficult to divert from how SKB handle frags.  What about
> introducing xdp_frag_* wrappers? (what do others think?)

I am fine with having some dedicated helpers.
Anyway we need to construct the xdp_buff {} receiving the dma descriptors.

Regards,
Lorenzo

> 
> 
> > 
> > [0] http://people.redhat.com/lbiancon/conference/NetDevConf2020-0x14/add-xdp-on-driver.html
> >     - XDP multi-buffers section (slide 40)
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame
  2020-09-04  7:19     ` Jesper Dangaard Brouer
@ 2020-09-04 15:15       ` David Ahern
  2020-09-04 15:59         ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 42+ messages in thread
From: David Ahern @ 2020-09-04 15:15 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Alexei Starovoitov
  Cc: Lorenzo Bianconi, netdev, bpf, davem, lorenzo.bianconi, echaudro,
	sameehj, kuba, john.fastabend, daniel, ast, shayagr, David Ahern

On 9/4/20 1:19 AM, Jesper Dangaard Brouer wrote:
> On Thu, 3 Sep 2020 18:07:05 -0700
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
>> On Thu, Sep 03, 2020 at 10:58:45PM +0200, Lorenzo Bianconi wrote:
>>> Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer to specify
>>> if shared_info area has been properly initialized for non-linear
>>> xdp buffers
>>>
>>> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
>>> ---
>>>  include/net/xdp.h | 8 ++++++--
>>>  net/core/xdp.c    | 1 +
>>>  2 files changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>>> index 3814fb631d52..42f439f9fcda 100644
>>> --- a/include/net/xdp.h
>>> +++ b/include/net/xdp.h
>>> @@ -72,7 +72,8 @@ struct xdp_buff {
>>>  	void *data_hard_start;
>>>  	struct xdp_rxq_info *rxq;
>>>  	struct xdp_txq_info *txq;
>>> -	u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
>>> +	u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved tailroom*/
>>> +	u32 mb:1; /* xdp non-linear buffer */
>>>  };
>>>  
>>>  /* Reserve memory area at end-of data area.
>>> @@ -96,7 +97,8 @@ struct xdp_frame {
>>>  	u16 len;
>>>  	u16 headroom;
>>>  	u32 metasize:8;
>>> -	u32 frame_sz:24;
>>> +	u32 frame_sz:23;
>>> +	u32 mb:1; /* xdp non-linear frame */  
>>
>> Hmm. Last time I checked compilers were generating ugly code with bitfields.
>> Not performant and not efficient.
>> frame_sz is used in the fast path.
>> I suspect the first hunk alone will cause performance degradation.
>> Could you use normal u8 or u32 flag field?
> 
> For struct xdp_buff sure we can do this.  For struct xdp_frame, I'm not
> sure, as it is a state compressed version of xdp_buff + extra
> information.  The xdp_frame have been called skb-light, and I know
> people (e.g Ahern) wants to add more info to this, vlan, RX-hash, csum,
> and we must keep this to 1-cache-line, for performance reasons.
> 
> You do make a good point, that these bit-fields might hurt performance
> more.  I guess, we need to test this.  As I constantly worry that we
> will slowly kill XDP performance with a 1000 paper-cuts.
> 

That struct is tight on space, and we have to be very smart about
additions. dev_rx for example seems like it could just be the netdev
index rather than a pointer or perhaps can be removed completely. I
believe it is only used for 1 use case (redirects to CPUMAP); maybe that
code can be refactored to handle the dev outside of xdp_frame.

xdp_mem_info is 2 u32's; the type in that struct really could be a u8.
In this case it means removing struct in favor of 2 elements to reclaim
the space, but as we reach the 64B limit this is a place to change.
e.g., make it a single u32 with the id only 24 bits though the
rhashtable key can stay u32 but now with the combined type + id.

As for frame_sz, why does it need to be larger than a u16?

If it really needs to be larger than u16, there are several examples of
using a bit (or bits) in the data path. dst metrics for examples uses
lowest 4 bits of the dst pointer as a bitfield. It does so using a mask
with accessors vs a bitfield. Perhaps that is the way to go here.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-04  9:45     ` Lorenzo Bianconi
@ 2020-09-04 15:23       ` John Fastabend
  2020-09-06 13:36         ` Lorenzo Bianconi
  0 siblings, 1 reply; 42+ messages in thread
From: John Fastabend @ 2020-09-04 15:23 UTC (permalink / raw)
  To: Lorenzo Bianconi, John Fastabend
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, daniel, ast, shayagr, edumazet

Lorenzo Bianconi wrote:
> > Lorenzo Bianconi wrote:
> > > Introduce bpf_xdp_adjust_mb_header helper in order to adjust frame
> > > headers moving *offset* bytes from/to the second buffer to/from the
> > > first one.
> > > This helper can be used to move headers when the hw DMA SG is not able
> > > to copy all the headers in the first fragment and split header and data
> > > pages.
> > > 
> > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > > ---
> > >  include/uapi/linux/bpf.h       | 25 ++++++++++++----
> > >  net/core/filter.c              | 54 ++++++++++++++++++++++++++++++++++
> > >  tools/include/uapi/linux/bpf.h | 26 ++++++++++++----
> > >  3 files changed, 95 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 8dda13880957..c4a6d245619c 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -3571,11 +3571,25 @@ union bpf_attr {
> > >   *		value.
> > >   *
> > >   * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr)
> > > - * 	Description
> > > - * 		Read *size* bytes from user space address *user_ptr* and store
> > > - * 		the data in *dst*. This is a wrapper of copy_from_user().
> > > - * 	Return
> > > - * 		0 on success, or a negative error in case of failure.
> > > + *	Description
> > > + *		Read *size* bytes from user space address *user_ptr* and store
> > > + *		the data in *dst*. This is a wrapper of copy_from_user().
> > > + *
> > > + * long bpf_xdp_adjust_mb_header(struct xdp_buff *xdp_md, int offset)
> > > + *	Description
> > > + *		Adjust frame headers moving *offset* bytes from/to the second
> > > + *		buffer to/from the first one. This helper can be used to move
> > > + *		headers when the hw DMA SG does not copy all the headers in
> > > + *		the first fragment.
> 
> + Eric to the discussion
> 
> > 
> > This is confusing to read. Does this mean I can "move bytes to the second
> > buffer from the first one" or "move bytes from the second buffer to the first
> > one" And what are frame headers? I'm sure I can read below code and work
> > it out, but users reading the header file should be able to parse this.
> 
> Our main goal with this helper is to fix the use-case where we request the hw
> to put L2/L3/L4 headers (and all the TCP options) in the first fragment and TCP
> data starting from the second fragment (headers split) but for some reasons
> the hw puts the TCP options in the second fragment (if we understood correctly
> this issue has been introduced by Eric @ NetDevConf 0x14).
> bpf_xdp_adjust_mb_header() can fix this use-case moving bytes from the second fragment
> to the first one (offset > 0) or from the first buffer to the second one (offset < 0).

Ah OK, so the description needs the information about how to use offset then it
would have been clear I think. Something like that last line "moving bytes from
the second fragment ...."

So this is to fixup header-spit for RX zerocopy? Add that to the commit
message then.

> 
> > 
> > Also we want to be able to read all data not just headers. Reading the
> > payload of a TCP packet is equally important for many l7 load balancers.
> > 
> 
> In order to avoid to slow down performances we require that eBPF sandbox can
> read/write only buff0 in a xdp multi-buffer. The xdp program can only
> perform some restricted changes to buff<n> (n >= 1) (e.g. what we did in
> bpf_xdp_adjust_mb_header()).
> You can find the xdp multi-buff design principles here [0][1]
> 
> [0] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
> [1] http://people.redhat.com/lbiancon/conference/NetDevConf2020-0x14/add-xdp-on-driver.html - XDP multi-buffers section (slide 40)
> 
> > > + *
> > > + *		A call to this helper is susceptible to change the underlying
> > > + *		packet buffer. Therefore, at load time, all checks on pointers
> > > + *		previously done by the verifier are invalidated and must be
> > > + *		performed again, if the helper is used in combination with
> > > + *		direct packet access.
> > > + *
> > > + *	Return
> > > + *		0 on success, or a negative error in case of failure.
> > >   */
> > >  #define __BPF_FUNC_MAPPER(FN)		\
> > >  	FN(unspec),			\
> > > @@ -3727,6 +3741,7 @@ union bpf_attr {
> > >  	FN(inode_storage_delete),	\
> > >  	FN(d_path),			\
> > >  	FN(copy_from_user),		\
> > > +	FN(xdp_adjust_mb_header),	\
> > >  	/* */
> > >  
> > >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 47eef9a0be6a..ae6b10cf062d 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -3475,6 +3475,57 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
> > >  	.arg2_type	= ARG_ANYTHING,
> > >  };
> > >  
> > > +BPF_CALL_2(bpf_xdp_adjust_mb_header, struct  xdp_buff *, xdp,
> > > +	   int, offset)
> > > +{
> > > +	void *data_hard_end, *data_end;
> > > +	struct skb_shared_info *sinfo;
> > > +	int frag_offset, frag_len;
> > > +	u8 *addr;
> > > +
> > > +	if (!xdp->mb)
> > > +		return -EOPNOTSUPP;

Not required for this patch necessarily but I think it would be better user
experience if instead of EOPNOTSUPP here we did the header split. This
would allocate a frag and copy the bytes around as needed. Yes it might
be slow if you don't have a frag free in the driver, but if user wants to
do header split and their hardware can't do it we would have a way out.

I guess it could be an improvement for later though.


> > > +
> > > +	sinfo = xdp_get_shared_info_from_buff(xdp);
> > > +
> > > +	frag_len = skb_frag_size(&sinfo->frags[0]);
> > > +	if (offset > frag_len)
> > > +		return -EINVAL;
> > 
> > What if we want data in frags[1] and so on.
> > 
> > > +
> > > +	frag_offset = skb_frag_off(&sinfo->frags[0]);
> > > +	data_end = xdp->data_end + offset;
> > > +
> > > +	if (offset < 0 && (-offset > frag_offset ||
> > > +			   data_end < xdp->data + ETH_HLEN))
> > > +		return -EINVAL;
> > > +
> > > +	data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
> > > +	if (data_end > data_hard_end)
> > > +		return -EINVAL;
> > > +
> > > +	addr = page_address(skb_frag_page(&sinfo->frags[0])) + frag_offset;
> > > +	if (offset > 0) {
> > > +		memcpy(xdp->data_end, addr, offset);
> > 
> > But this could get expensive for large amount of data? And also
> > limiting because we require the room to do the copy. Presumably
> > the reason we have fargs[1] is because the normal page or half
> > page is in use?
> > 
> > > +	} else {
> > > +		memcpy(addr + offset, xdp->data_end + offset, -offset);
> > > +		memset(xdp->data_end + offset, 0, -offset);
> > > +	}
> > > +
> > > +	skb_frag_size_sub(&sinfo->frags[0], offset);
> > > +	skb_frag_off_add(&sinfo->frags[0], offset);
> > > +	xdp->data_end = data_end;
> > > +
> > > +	return 0;
> > > +}
> > 
> > So overall I don't see the point of copying bytes from one frag to
> > another. Create an API that adjusts the data pointers and then
> > copies are avoided and manipulating frags is not needed.
> 
> please see above.

OK it makes more sense with the context. It doesn't have much if anything
to do about making data visible to the BPF program. This is about
changing the layout of the frags list.

How/when does the header split go wrong on the mvneta device? I guess
this is to fix a real bug/issue not some theoritical one? An example
in the commit message would make this concrete. Soemthing like,
"When using RX zerocopy to mmap data into userspace application if
a packet with [all these wild headers] is received rx zerocopy breaks
because header split puts headers X in the data frag confusing apps".

> 
> > 
> > Also and even more concerning I think this API requires the
> > driver to populate shinfo. If we use TX_REDIRECT a lot or TX_XMIT
> > this means we need to populate shinfo when its probably not ever
> > used. If our driver is smart L2/L3 headers are in the readable
> > data and prefetched. Writing frags into the end of a page is likely
> > not free.
> 
> Sorry I did not get what you mean with "populate shinfo" here. We need to
> properly set shared_info in order to create the xdp multi-buff.
> Apart of header splits, please consider the main uses-cases for
> xdp multi-buff are XDP with TSO and Jumbo frames.

The use case I have in mind is a XDP_TX or XDP_REDIRECT load balancer.
I wont know this at the driver level and now I'll have to write into
the back of every page with this shinfo just in case. If header
split is working I should never need to even touch the page outside
the first N bytes that were DMAd and in cache with DDIO. So its extra
overhead for something that is unlikely to happen in the LB case.

If you take the simplest possible program that just returns XDP_TX
and run a pkt generator against it. I believe (haven't run any
tests) that you will see overhead now just from populating this
shinfo. I think it needs to only be done when its needed e.g. when
user makes this helper call or we need to build the skb and populate
the frags there.

I think a smart driver will just keep the frags list in whatever
form it has them (rx descriptors?) and push them over to the
tx descriptors without having to do extra work with frag lists.

> 
> > 
> > Did you benchmark this?
> 
> will do, I need to understand if we can use tiny buffers in mvneta.

Why tiny buffers? How does mvneta layout the frags when doing
header split? Can we just benchmark what mvneta is doing at the
end of this patch series?

Also can you try the basic XDP_TX case mentioned above.
I don't want this to degrade existing use cases if at all
possible.

> 
> > 
> > In general users of this API should know the bytes they want
> > to fetch. Use an API like this,
> > 
> >   bpf_xdp_adjust_bytes(xdp, start, end)
> > 
> > Where xdp is the frame, start is the first byte the user wants
> > and end is the last byte. Then usually you can skip the entire
> > copy part and just move the xdp pointesr around. The ugly case
> > is if the user puts start/end across a frag boundary and a copy
> > is needed. In that case maybe we use end as a hint and not a
> > hard requirement.
> > 
> > The use case I see is I read L2/L3/L4 headers and I need the
> > first N bytes of the payload. I know where the payload starts
> > and I know how many bytes I need so,
> > 
> >   bpf_xdp_adjust_bytes(xdp, payload_offset, bytes_needed);
> > 
> > Then hopefully that is all in one frag. If its not I'll need
> > to do a second helper call. Still nicer than forcing drivers
> > to populate this shinfo I think. If you think I'm wrong try
> > a benchmark. Benchmarks aside I get stuck when data_end and
> > data_hard_end are too close together.
> 
> IIUC what you mean here is to expose L2/L3/L4 headers + some data to
> the ebpf program to perform like L7 load-balancing, right?

Correct, but with extra context I see in this patch you are trying
to build an XDP controlled header split. This seems like a different
use case from mine.

> Let's consider the Jumbo frames use-case (so the data are split in multiple
> buffers). I can see to issues here:
> - since in XDP we can't linearize the buffer if start and end are on the
>   different pages (like we do in bpf_msg_pull_data()), are we ending up
>   in the case where requested data are all in buff0? 

In this case I would expect either the helper returns how many bytes
were pulled in, maybe just (start, end_of_frag) or user can find
it from data_end pointer. Here end is just a hint.

> - if  start and end are in buff<2>, we should report the fragment number to the
>   ebpf program to "fetch" the data. Are we exposing too low-level details to
>   user-space?

Why do you need the frag number? Just say I want bytes (X,Y) if that
happens to be on buff<2> let the helper find it.

I think having a helper to read/write any bytes is important and
necessary, but the helper implemented in this patch is something
else. I get naming is hard what if we called it xdp_header_split().

> 
> Regards,
> Lorenzo
> 
> > 
> > Thanks,
> > John



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame
  2020-09-04 15:15       ` David Ahern
@ 2020-09-04 15:59         ` Jesper Dangaard Brouer
  2020-09-04 16:30           ` David Ahern
  0 siblings, 1 reply; 42+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-04 15:59 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Lorenzo Bianconi, netdev, bpf, davem,
	lorenzo.bianconi, echaudro, sameehj, kuba, john.fastabend,
	daniel, ast, shayagr, David Ahern, brouer, Ilias Apalodimas

On Fri, 4 Sep 2020 09:15:04 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 9/4/20 1:19 AM, Jesper Dangaard Brouer wrote:
> > On Thu, 3 Sep 2020 18:07:05 -0700
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >   
> >> On Thu, Sep 03, 2020 at 10:58:45PM +0200, Lorenzo Bianconi wrote:  
> >>> Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer to specify
> >>> if shared_info area has been properly initialized for non-linear
> >>> xdp buffers
> >>>
> >>> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> >>> ---
> >>>  include/net/xdp.h | 8 ++++++--
> >>>  net/core/xdp.c    | 1 +
> >>>  2 files changed, 7 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/include/net/xdp.h b/include/net/xdp.h
> >>> index 3814fb631d52..42f439f9fcda 100644
> >>> --- a/include/net/xdp.h
> >>> +++ b/include/net/xdp.h
> >>> @@ -72,7 +72,8 @@ struct xdp_buff {
> >>>  	void *data_hard_start;
> >>>  	struct xdp_rxq_info *rxq;
> >>>  	struct xdp_txq_info *txq;
> >>> -	u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
> >>> +	u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved tailroom*/
> >>> +	u32 mb:1; /* xdp non-linear buffer */
> >>>  };
> >>>  
> >>>  /* Reserve memory area at end-of data area.
> >>> @@ -96,7 +97,8 @@ struct xdp_frame {
> >>>  	u16 len;
> >>>  	u16 headroom;
> >>>  	u32 metasize:8;
> >>> -	u32 frame_sz:24;
> >>> +	u32 frame_sz:23;
> >>> +	u32 mb:1; /* xdp non-linear frame */    
> >>
> >> Hmm. Last time I checked compilers were generating ugly code with bitfields.
> >> Not performant and not efficient.
> >> frame_sz is used in the fast path.
> >> I suspect the first hunk alone will cause performance degradation.
> >> Could you use normal u8 or u32 flag field?  
> > 
> > For struct xdp_buff sure we can do this.  For struct xdp_frame, I'm not
> > sure, as it is a state compressed version of xdp_buff + extra
> > information.  The xdp_frame have been called skb-light, and I know
> > people (e.g Ahern) wants to add more info to this, vlan, RX-hash, csum,
> > and we must keep this to 1-cache-line, for performance reasons.
> > 
> > You do make a good point, that these bit-fields might hurt performance
> > more.  I guess, we need to test this.  As I constantly worry that we
> > will slowly kill XDP performance with a 1000 paper-cuts.
> >   
> 
> That struct is tight on space, and we have to be very smart about
> additions. 

I fully agree.

> dev_rx for example seems like it could just be the netdev
> index rather than a pointer or perhaps can be removed completely. I
> believe it is only used for 1 use case (redirects to CPUMAP); maybe that
> code can be refactored to handle the dev outside of xdp_frame.

The dev_rx is needed when creating an SKB from a xdp_frame (basically
skb->dev = rx_dev). Yes, that is done in cpumap, but I want to
generalize this.  The veth also creates SKBs from xdp_frame, but use
itself as skb->dev.

And yes, we could save some space storing the index instead, and trade
space for cycles in a lookup.

 
> xdp_mem_info is 2 u32's; the type in that struct really could be a u8.

Yes, I have floated a patch that did this earlier, but it was never
merged, as it was part of storing the xdp_mem_info in the SKB to create
a return path for page_pool pages.

> In this case it means removing struct in favor of 2 elements to reclaim
> the space, but as we reach the 64B limit this is a place to change.
> e.g., make it a single u32 with the id only 24 bits though the
> rhashtable key can stay u32 but now with the combined type + id.
> 
> As for frame_sz, why does it need to be larger than a u16?

Because PAGE_SIZE can be 64KiB on some archs.

> If it really needs to be larger than u16, there are several examples of
> using a bit (or bits) in the data path. dst metrics for examples uses
> lowest 4 bits of the dst pointer as a bitfield. It does so using a mask
> with accessors vs a bitfield. Perhaps that is the way to go here.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

perl -e 'my $a=65536; printf("%d b%b 0x%X\n", $a, $a, $a)'
65536 b10000000000000000 0x10000


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame
  2020-09-04 15:59         ` Jesper Dangaard Brouer
@ 2020-09-04 16:30           ` David Ahern
  2020-09-07 18:02             ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 42+ messages in thread
From: David Ahern @ 2020-09-04 16:30 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Lorenzo Bianconi, netdev, bpf, davem,
	lorenzo.bianconi, echaudro, sameehj, kuba, john.fastabend,
	daniel, ast, shayagr, David Ahern, Ilias Apalodimas

On 9/4/20 9:59 AM, Jesper Dangaard Brouer wrote:
>> dev_rx for example seems like it could just be the netdev
>> index rather than a pointer or perhaps can be removed completely. I
>> believe it is only used for 1 use case (redirects to CPUMAP); maybe that
>> code can be refactored to handle the dev outside of xdp_frame.
> 
> The dev_rx is needed when creating an SKB from a xdp_frame (basically
> skb->dev = rx_dev). Yes, that is done in cpumap, but I want to
> generalize this.  The veth also creates SKBs from xdp_frame, but use
> itself as skb->dev.
> 
> And yes, we could save some space storing the index instead, and trade
> space for cycles in a lookup.

I think this can be managed without adding a reference to the xdp_frame.
I'll start a separate thread on that.

>>
>> As for frame_sz, why does it need to be larger than a u16?
> 
> Because PAGE_SIZE can be 64KiB on some archs.
> 

ok, is there any alignment requirement? can frame_sz be number of 32-bit
words? I believe bit shifts are cheap.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 5/9] net: mvneta: add multi buffer support to XDP_TX
  2020-09-03 20:58 ` [PATCH v2 net-next 5/9] net: mvneta: add multi buffer support to XDP_TX Lorenzo Bianconi
@ 2020-09-06  7:20   ` Shay Agroskin
  2020-09-06  8:43     ` Lorenzo Bianconi
  0 siblings, 1 reply; 42+ messages in thread
From: Shay Agroskin @ 2020-09-06  7:20 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast


Lorenzo Bianconi <lorenzo@kernel.org> writes:

> Introduce the capability to map non-linear xdp buffer running
> mvneta_xdp_submit_frame() for XDP_TX and XDP_REDIRECT
>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 79 
>  +++++++++++++++++----------
>  1 file changed, 49 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> b/drivers/net/ethernet/marvell/mvneta.c
> index 4f745a2b702a..65fbed957e4f 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -1854,8 +1854,8 @@ static void mvneta_txq_bufs_free(struct 
> mvneta_port *pp,
>  			bytes_compl += buf->skb->len;
>  			pkts_compl++;
>  			dev_kfree_skb_any(buf->skb);
> -		} else if (buf->type == MVNETA_TYPE_XDP_TX ||
> -			   buf->type == MVNETA_TYPE_XDP_NDO) {
> +		} else if ((buf->type == MVNETA_TYPE_XDP_TX ||
> +			    buf->type == MVNETA_TYPE_XDP_NDO) && 
> buf->xdpf) {
>  			xdp_return_frame(buf->xdpf);
>  		}
>  	}
> @@ -2040,43 +2040,62 @@ static int
>  mvneta_xdp_submit_frame(struct mvneta_port *pp, struct 
>  mvneta_tx_queue *txq,
>  			struct xdp_frame *xdpf, bool dma_map)
>  {
> -	struct mvneta_tx_desc *tx_desc;
> -	struct mvneta_tx_buf *buf;
> -	dma_addr_t dma_addr;
> +	struct skb_shared_info *sinfo = 
> xdp_get_shared_info_from_frame(xdpf);
> +	int i, num_frames = xdpf->mb ? sinfo->nr_frags + 1 : 1;
> +	struct mvneta_tx_desc *tx_desc = NULL;
> +	struct page *page;
>  
> -	if (txq->count >= txq->tx_stop_threshold)
> +	if (txq->count + num_frames >= txq->tx_stop_threshold)
>  		return MVNETA_XDP_DROPPED;
>  
> -	tx_desc = mvneta_txq_next_desc_get(txq);
> +	for (i = 0; i < num_frames; i++) {
> +		struct mvneta_tx_buf *buf = 
> &txq->buf[txq->txq_put_index];
> +		skb_frag_t *frag = i ? &sinfo->frags[i - 1] : 
> NULL;
> +		int len = frag ? skb_frag_size(frag) : xdpf->len;
> +		dma_addr_t dma_addr;
>  
> -	buf = &txq->buf[txq->txq_put_index];
> -	if (dma_map) {
> -		/* ndo_xdp_xmit */
> -		dma_addr = dma_map_single(pp->dev->dev.parent, 
> xdpf->data,
> -					  xdpf->len, 
> DMA_TO_DEVICE);
> -		if (dma_mapping_error(pp->dev->dev.parent, 
> dma_addr)) {
> -			mvneta_txq_desc_put(txq);
> -			return MVNETA_XDP_DROPPED;
> +		tx_desc = mvneta_txq_next_desc_get(txq);
> +		if (dma_map) {
> +			/* ndo_xdp_xmit */
> +			void *data;
> +
> +			data = frag ? 
> page_address(skb_frag_page(frag))
> +				    : xdpf->data;
> +			dma_addr = 
> dma_map_single(pp->dev->dev.parent, data,
> +						  len, 
> DMA_TO_DEVICE);
> +			if (dma_mapping_error(pp->dev->dev.parent, 
> dma_addr)) {
> +				for (; i >= 0; i--)
> +					mvneta_txq_desc_put(txq);
> +				return MVNETA_XDP_DROPPED;
> +			}
> +			buf->type = MVNETA_TYPE_XDP_NDO;
> +		} else {
> +			page = frag ? skb_frag_page(frag)
> +				    : virt_to_page(xdpf->data);
> +			dma_addr = page_pool_get_dma_addr(page);
> +			if (!frag)
> +				dma_addr += sizeof(*xdpf) + 
> xdpf->headroom;
> + 
> dma_sync_single_for_device(pp->dev->dev.parent,
> +						   dma_addr, len,
> + 
> DMA_BIDIRECTIONAL);
> +			buf->type = MVNETA_TYPE_XDP_TX;
>  		}
> -		buf->type = MVNETA_TYPE_XDP_NDO;
> -	} else {
> -		struct page *page = virt_to_page(xdpf->data);
> +		buf->xdpf = i ? NULL : xdpf;
>  
> -		dma_addr = page_pool_get_dma_addr(page) +
> -			   sizeof(*xdpf) + xdpf->headroom;
> -		dma_sync_single_for_device(pp->dev->dev.parent, 
> dma_addr,
> -					   xdpf->len, 
> DMA_BIDIRECTIONAL);
> -		buf->type = MVNETA_TYPE_XDP_TX;
> +		if (!i)
> +			tx_desc->command = MVNETA_TXD_F_DESC;
> +		tx_desc->buf_phys_addr = dma_addr;
> +		tx_desc->data_size = len;
> +
> +		mvneta_txq_inc_put(txq);
>  	}
> -	buf->xdpf = xdpf;
>  
> -	tx_desc->command = MVNETA_TXD_FLZ_DESC;
> -	tx_desc->buf_phys_addr = dma_addr;
> -	tx_desc->data_size = xdpf->len;
> +	/*last descriptor */
> +	if (tx_desc)
> +		tx_desc->command |= MVNETA_TXD_L_DESC | 
> MVNETA_TXD_Z_PAD;

        When is this condition not taken ? You initialize tx_desc 
        to NULL, but it seems to me like you always set it inside 
        the for loop to the output of mvneta_txq_next_desc_get() 
        which doesn't look like it returns NULL. The for loop runs 1 iteration or `sinfo->nr_frage 
+ 1` iterations (which also equals or larger than 1).

>  
> -	mvneta_txq_inc_put(txq);
> -	txq->pending++;
> -	txq->count++;
> +	txq->pending += num_frames;
> +	txq->count += num_frames;
>  
>  	return MVNETA_XDP_TX;
>  }


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 3/9] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
  2020-09-03 20:58 ` [PATCH v2 net-next 3/9] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer Lorenzo Bianconi
@ 2020-09-06  7:33   ` Shay Agroskin
  2020-09-06  9:05     ` Lorenzo Bianconi
  0 siblings, 1 reply; 42+ messages in thread
From: Shay Agroskin @ 2020-09-06  7:33 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast


Lorenzo Bianconi <lorenzo@kernel.org> writes:

> Update multi-buffer bit (mb) in xdp_buff to notify XDP/eBPF 
> layer and
> XDP remote drivers if this is a "non-linear" XDP buffer. Access
> skb_shared_info only if xdp_buff mb is set
>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 37 
>  +++++++++++++++------------
>  1 file changed, 21 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvneta.c 
> b/drivers/net/ethernet/marvell/mvneta.c
> index 832bbb8b05c8..4f745a2b702a 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -2027,11 +2027,11 @@ mvneta_xdp_put_buff(struct mvneta_port 
> *pp, struct mvneta_rx_queue *rxq,
>  		    struct xdp_buff *xdp, int sync_len, bool napi)
>  {
>  	struct skb_shared_info *sinfo = 
>  xdp_get_shared_info_from_buff(xdp);
> -	int i;
> +	int i, num_frames = xdp->mb ? sinfo->nr_frags : 0;
>  
>  	page_pool_put_page(rxq->page_pool, 
>  virt_to_head_page(xdp->data),
>  			   sync_len, napi);
> -	for (i = 0; i < sinfo->nr_frags; i++)
> +	for (i = 0; i < num_frames; i++)
>  		page_pool_put_full_page(rxq->page_pool,
>  					skb_frag_page(&sinfo->frags[i]), 
>  napi);
>  }
> ...
> @@ -2175,6 +2175,7 @@ mvneta_run_xdp(struct mvneta_port *pp, 
> struct mvneta_rx_queue *rxq,
>  
>  	len = xdp->data_end - xdp->data_hard_start - 
>  pp->rx_offset_correction;
>  	data_len = xdp->data_end - xdp->data;
> -	sinfo = xdp_get_shared_info_from_buff(xdp);
> -	sinfo->nr_frags = 0;
> +	xdp->mb = 0;
>  
>  	*size = rx_desc->data_size - len;
>  	rx_desc->buf_phys_addr = 0;
> @@ -2269,7 +2267,7 @@ mvneta_swbm_add_rx_fragment(struct 
> mvneta_port *pp,
>  			    struct mvneta_rx_desc *rx_desc,
>  			    struct mvneta_rx_queue *rxq,
>  			    struct xdp_buff *xdp, int *size,
> -			    struct page *page)
> +			    int *nfrags, struct page *page)
>  {
>  	struct skb_shared_info *sinfo = 
>  xdp_get_shared_info_from_buff(xdp);
>  	struct net_device *dev = pp->dev;
> @@ -2288,13 +2286,18 @@ mvneta_swbm_add_rx_fragment(struct 
> mvneta_port *pp,
>  				rx_desc->buf_phys_addr,
>  				len, dma_dir);
>  
> -	if (data_len > 0 && sinfo->nr_frags < MAX_SKB_FRAGS) {
> -		skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags];
> +	if (data_len > 0 && *nfrags < MAX_SKB_FRAGS) {
> +		skb_frag_t *frag = &sinfo->frags[*nfrags];
>  
>  		skb_frag_off_set(frag, pp->rx_offset_correction);
>  		skb_frag_size_set(frag, data_len);
>  		__skb_frag_set_page(frag, page);
> -		sinfo->nr_frags++;
> +		*nfrags = *nfrags + 1;

nit, why do you use nfrags variable instead of sinfo->nr_frags (as 
it was before this patch) ?
                You doesn't seem to use the nfrags variable in the 
                caller function and you update nr_frags as well.
                If it's used in a different patch (haven't 
                reviewed it all yet), maybe move it to that patch 
                ? (sorry in advance if I missed the logic here)

> +
> +		if (rx_desc->status & MVNETA_RXD_LAST_DESC) {
> +			sinfo->nr_frags = *nfrags;
> +			xdp->mb = true;
> +		}
>  
>  		rx_desc->buf_phys_addr = 0;
>               ...

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 5/9] net: mvneta: add multi buffer support to XDP_TX
  2020-09-06  7:20   ` Shay Agroskin
@ 2020-09-06  8:43     ` Lorenzo Bianconi
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-06  8:43 UTC (permalink / raw)
  To: Shay Agroskin
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast

[-- Attachment #1: Type: text/plain, Size: 948 bytes --]

> 
> Lorenzo Bianconi <lorenzo@kernel.org> writes:
> 

[...]

> > -	buf->xdpf = xdpf;
> > -	tx_desc->command = MVNETA_TXD_FLZ_DESC;
> > -	tx_desc->buf_phys_addr = dma_addr;
> > -	tx_desc->data_size = xdpf->len;
> > +	/*last descriptor */
> > +	if (tx_desc)
> > +		tx_desc->command |= MVNETA_TXD_L_DESC | MVNETA_TXD_Z_PAD;
> 
>        When is this condition not taken ? You initialize tx_desc        to
> NULL, but it seems to me like you always set it inside        the for loop
> to the output of mvneta_txq_next_desc_get()        which doesn't look like
> it returns NULL. The for loop runs 1 iteration or `sinfo->nr_frage + 1`
> iterations (which also equals or larger than 1).

ack, right. I will fix it in v3.

Regards,
Lorenzo

> 
> > -	mvneta_txq_inc_put(txq);
> > -	txq->pending++;
> > -	txq->count++;
> > +	txq->pending += num_frames;
> > +	txq->count += num_frames;
> >  	return MVNETA_XDP_TX;
> >  }
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 3/9] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
  2020-09-06  7:33   ` Shay Agroskin
@ 2020-09-06  9:05     ` Lorenzo Bianconi
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-06  9:05 UTC (permalink / raw)
  To: Shay Agroskin
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, john.fastabend, daniel, ast

[-- Attachment #1: Type: text/plain, Size: 1382 bytes --]

[...]

> >  				rx_desc->buf_phys_addr,
> >  				len, dma_dir);
> > -	if (data_len > 0 && sinfo->nr_frags < MAX_SKB_FRAGS) {
> > -		skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags];
> > +	if (data_len > 0 && *nfrags < MAX_SKB_FRAGS) {
> > +		skb_frag_t *frag = &sinfo->frags[*nfrags];
> >  		skb_frag_off_set(frag, pp->rx_offset_correction);
> >  		skb_frag_size_set(frag, data_len);
> >  		__skb_frag_set_page(frag, page);
> > -		sinfo->nr_frags++;
> > +		*nfrags = *nfrags + 1;
> 
> nit, why do you use nfrags variable instead of sinfo->nr_frags (as it was
> before this patch) ?
>                You doesn't seem to use the nfrags variable in the
> caller function and you update nr_frags as well.
>                If it's used in a different patch (haven't
> reviewed it all yet), maybe move it to that patch                ? (sorry in
> advance if I missed the logic here)

nfrags pointer is used to avoid the sinfo->nr_frags initialization in
mvneta_swbm_rx_frame() since skb_shared_info is not in the cache usually.
AFAIK the hw does not provide the info "this is the second descriptor" but
just "this the frist/last descriptor".

Regards,
Lorenzo

> 
> > +
> > +		if (rx_desc->status & MVNETA_RXD_LAST_DESC) {
> > +			sinfo->nr_frags = *nfrags;
> > +			xdp->mb = true;
> > +		}
> >  		rx_desc->buf_phys_addr = 0;
> >               ...

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-04 15:23       ` John Fastabend
@ 2020-09-06 13:36         ` Lorenzo Bianconi
  2020-09-08 19:57           ` John Fastabend
  0 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-06 13:36 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, daniel, ast, shayagr, edumazet

[-- Attachment #1: Type: text/plain, Size: 13210 bytes --]

> Lorenzo Bianconi wrote:
> > > Lorenzo Bianconi wrote:

[...]

> > > > + *	Description
> > > > + *		Adjust frame headers moving *offset* bytes from/to the second
> > > > + *		buffer to/from the first one. This helper can be used to move
> > > > + *		headers when the hw DMA SG does not copy all the headers in
> > > > + *		the first fragment.
> > 
> > + Eric to the discussion
> > 
> > > 
> > > This is confusing to read. Does this mean I can "move bytes to the second
> > > buffer from the first one" or "move bytes from the second buffer to the first
> > > one" And what are frame headers? I'm sure I can read below code and work
> > > it out, but users reading the header file should be able to parse this.
> > 
> > Our main goal with this helper is to fix the use-case where we request the hw
> > to put L2/L3/L4 headers (and all the TCP options) in the first fragment and TCP
> > data starting from the second fragment (headers split) but for some reasons
> > the hw puts the TCP options in the second fragment (if we understood correctly
> > this issue has been introduced by Eric @ NetDevConf 0x14).
> > bpf_xdp_adjust_mb_header() can fix this use-case moving bytes from the second fragment
> > to the first one (offset > 0) or from the first buffer to the second one (offset < 0).
> 
> Ah OK, so the description needs the information about how to use offset then it
> would have been clear I think. Something like that last line "moving bytes from
> the second fragment ...."

ack, right. I will do in v3.

> 
> So this is to fixup header-spit for RX zerocopy? Add that to the commit
> message then.

Right. I will improve comments in v3.

> 
> > 
> > > 
> > > Also we want to be able to read all data not just headers. Reading the
> > > payload of a TCP packet is equally important for many l7 load balancers.
> > > 
> > 
> > In order to avoid to slow down performances we require that eBPF sandbox can
> > read/write only buff0 in a xdp multi-buffer. The xdp program can only
> > perform some restricted changes to buff<n> (n >= 1) (e.g. what we did in
> > bpf_xdp_adjust_mb_header()).
> > You can find the xdp multi-buff design principles here [0][1]
> > 
> > [0] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
> > [1] http://people.redhat.com/lbiancon/conference/NetDevConf2020-0x14/add-xdp-on-driver.html - XDP multi-buffers section (slide 40)
> > 
> > > > + *
> > > > + *		A call to this helper is susceptible to change the underlying
> > > > + *		packet buffer. Therefore, at load time, all checks on pointers
> > > > + *		previously done by the verifier are invalidated and must be
> > > > + *		performed again, if the helper is used in combination with
> > > > + *		direct packet access.
> > > > + *
> > > > + *	Return
> > > > + *		0 on success, or a negative error in case of failure.
> > > >   */
> > > >  #define __BPF_FUNC_MAPPER(FN)		\
> > > >  	FN(unspec),			\
> > > > @@ -3727,6 +3741,7 @@ union bpf_attr {
> > > >  	FN(inode_storage_delete),	\
> > > >  	FN(d_path),			\
> > > >  	FN(copy_from_user),		\
> > > > +	FN(xdp_adjust_mb_header),	\
> > > >  	/* */
> > > >  
> > > >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
> > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > index 47eef9a0be6a..ae6b10cf062d 100644
> > > > --- a/net/core/filter.c
> > > > +++ b/net/core/filter.c
> > > > @@ -3475,6 +3475,57 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
> > > >  	.arg2_type	= ARG_ANYTHING,
> > > >  };
> > > >  
> > > > +BPF_CALL_2(bpf_xdp_adjust_mb_header, struct  xdp_buff *, xdp,
> > > > +	   int, offset)
> > > > +{
> > > > +	void *data_hard_end, *data_end;
> > > > +	struct skb_shared_info *sinfo;
> > > > +	int frag_offset, frag_len;
> > > > +	u8 *addr;
> > > > +
> > > > +	if (!xdp->mb)
> > > > +		return -EOPNOTSUPP;
> 
> Not required for this patch necessarily but I think it would be better user
> experience if instead of EOPNOTSUPP here we did the header split. This
> would allocate a frag and copy the bytes around as needed. Yes it might
> be slow if you don't have a frag free in the driver, but if user wants to
> do header split and their hardware can't do it we would have a way out.
> 
> I guess it could be an improvement for later though.

I have no a strong opinion on this, I did it in this way to respect the rule "we
do not allocate memory for XDP".

@Jesper, David: thoughts?

> 
> 
> > > > +
> > > > +	sinfo = xdp_get_shared_info_from_buff(xdp);
> > > > +
> > > > +	frag_len = skb_frag_size(&sinfo->frags[0]);
> > > > +	if (offset > frag_len)
> > > > +		return -EINVAL;
> > > 
> > > What if we want data in frags[1] and so on.
> > > 
> > > > +
> > > > +	frag_offset = skb_frag_off(&sinfo->frags[0]);
> > > > +	data_end = xdp->data_end + offset;
> > > > +
> > > > +	if (offset < 0 && (-offset > frag_offset ||
> > > > +			   data_end < xdp->data + ETH_HLEN))
> > > > +		return -EINVAL;
> > > > +
> > > > +	data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
> > > > +	if (data_end > data_hard_end)
> > > > +		return -EINVAL;
> > > > +
> > > > +	addr = page_address(skb_frag_page(&sinfo->frags[0])) + frag_offset;
> > > > +	if (offset > 0) {
> > > > +		memcpy(xdp->data_end, addr, offset);
> > > 
> > > But this could get expensive for large amount of data? And also
> > > limiting because we require the room to do the copy. Presumably
> > > the reason we have fargs[1] is because the normal page or half
> > > page is in use?
> > > 
> > > > +	} else {
> > > > +		memcpy(addr + offset, xdp->data_end + offset, -offset);
> > > > +		memset(xdp->data_end + offset, 0, -offset);
> > > > +	}
> > > > +
> > > > +	skb_frag_size_sub(&sinfo->frags[0], offset);
> > > > +	skb_frag_off_add(&sinfo->frags[0], offset);
> > > > +	xdp->data_end = data_end;
> > > > +
> > > > +	return 0;
> > > > +}
> > > 
> > > So overall I don't see the point of copying bytes from one frag to
> > > another. Create an API that adjusts the data pointers and then
> > > copies are avoided and manipulating frags is not needed.
> > 
> > please see above.
> 
> OK it makes more sense with the context. It doesn't have much if anything
> to do about making data visible to the BPF program. This is about
> changing the layout of the frags list.

correct.

> 
> How/when does the header split go wrong on the mvneta device? I guess
> this is to fix a real bug/issue not some theoritical one? An example
> in the commit message would make this concrete. Soemthing like,
> "When using RX zerocopy to mmap data into userspace application if
> a packet with [all these wild headers] is received rx zerocopy breaks
> because header split puts headers X in the data frag confusing apps".

This issue does not occur with mvneta since the driver is not capable of
performing header split AFAIK. The helper has been introduced to cover the
"issue" reported by Eric in his NetDevConf presentation. In order to test the
helper I modified the mventa rx napi loop in a controlled way (this patch can't
be sent upstream, it is for testing only :))
I will improve commit message in v3.

> 
> > 
> > > 
> > > Also and even more concerning I think this API requires the
> > > driver to populate shinfo. If we use TX_REDIRECT a lot or TX_XMIT
> > > this means we need to populate shinfo when its probably not ever
> > > used. If our driver is smart L2/L3 headers are in the readable
> > > data and prefetched. Writing frags into the end of a page is likely
> > > not free.
> > 
> > Sorry I did not get what you mean with "populate shinfo" here. We need to
> > properly set shared_info in order to create the xdp multi-buff.
> > Apart of header splits, please consider the main uses-cases for
> > xdp multi-buff are XDP with TSO and Jumbo frames.
> 
> The use case I have in mind is a XDP_TX or XDP_REDIRECT load balancer.
> I wont know this at the driver level and now I'll have to write into
> the back of every page with this shinfo just in case. If header
> split is working I should never need to even touch the page outside
> the first N bytes that were DMAd and in cache with DDIO. So its extra
> overhead for something that is unlikely to happen in the LB case.

So far the skb_shared_info in constructed in mvneta only if the hw splits
the received data in multiple buffers (so if the MTU is greater than 1 PAGE,
please see comments below). Moreover the shared_info is present only in the
first buffer.

> 
> If you take the simplest possible program that just returns XDP_TX
> and run a pkt generator against it. I believe (haven't run any
> tests) that you will see overhead now just from populating this
> shinfo. I think it needs to only be done when its needed e.g. when
> user makes this helper call or we need to build the skb and populate
> the frags there.

sure, I will carry out some tests.

> 
> I think a smart driver will just keep the frags list in whatever
> form it has them (rx descriptors?) and push them over to the
> tx descriptors without having to do extra work with frag lists.

I think there are many use-cases where we want to have this info available in
xdp_buff/xdp_frame. E.g: let's consider the following Jumbo frame example:
- MTU > 1 PAGE (so we the driver will split the received data in multiple rx
  descriptors)
- the driver performs a XDP_REDIRECT to a veth or cpumap

Relying on the proposed architecture we could enable GRO in veth or cpumap I
guess since we can build a non-linear skb from the xdp multi-buff, right?

> 
> > 
> > > 
> > > Did you benchmark this?
> > 
> > will do, I need to understand if we can use tiny buffers in mvneta.
> 
> Why tiny buffers? How does mvneta layout the frags when doing
> header split? Can we just benchmark what mvneta is doing at the
> end of this patch series?

for the moment mvneta can split the received data when the previous buffer is
full (e.g. when we the first page is completely written). I want to explore if
I can set a tiny buffer (e.g. 128B) as max received buffer to run some performance
tests and have some "comparable" results respect to the ones I got when I added XDP
support to mvneta.

> 
> Also can you try the basic XDP_TX case mentioned above.
> I don't want this to degrade existing use cases if at all
> possible.

sure, will do.

> 
> > 
> > > 
> > > In general users of this API should know the bytes they want
> > > to fetch. Use an API like this,
> > > 
> > >   bpf_xdp_adjust_bytes(xdp, start, end)
> > > 
> > > Where xdp is the frame, start is the first byte the user wants
> > > and end is the last byte. Then usually you can skip the entire
> > > copy part and just move the xdp pointesr around. The ugly case
> > > is if the user puts start/end across a frag boundary and a copy
> > > is needed. In that case maybe we use end as a hint and not a
> > > hard requirement.
> > > 
> > > The use case I see is I read L2/L3/L4 headers and I need the
> > > first N bytes of the payload. I know where the payload starts
> > > and I know how many bytes I need so,
> > > 
> > >   bpf_xdp_adjust_bytes(xdp, payload_offset, bytes_needed);
> > > 
> > > Then hopefully that is all in one frag. If its not I'll need
> > > to do a second helper call. Still nicer than forcing drivers
> > > to populate this shinfo I think. If you think I'm wrong try
> > > a benchmark. Benchmarks aside I get stuck when data_end and
> > > data_hard_end are too close together.
> > 
> > IIUC what you mean here is to expose L2/L3/L4 headers + some data to
> > the ebpf program to perform like L7 load-balancing, right?
> 
> Correct, but with extra context I see in this patch you are trying
> to build an XDP controlled header split. This seems like a different
> use case from mine.

I agree.

> 
> > Let's consider the Jumbo frames use-case (so the data are split in multiple
> > buffers). I can see to issues here:
> > - since in XDP we can't linearize the buffer if start and end are on the
> >   different pages (like we do in bpf_msg_pull_data()), are we ending up
> >   in the case where requested data are all in buff0? 
> 
> In this case I would expect either the helper returns how many bytes
> were pulled in, maybe just (start, end_of_frag) or user can find
> it from data_end pointer. Here end is just a hint.
> 
> > - if  start and end are in buff<2>, we should report the fragment number to the
> >   ebpf program to "fetch" the data. Are we exposing too low-level details to
> >   user-space?
> 
> Why do you need the frag number? Just say I want bytes (X,Y) if that
> happens to be on buff<2> let the helper find it.
> 
> I think having a helper to read/write any bytes is important and
> necessary, but the helper implemented in this patch is something
> else. I get naming is hard what if we called it xdp_header_split().

ack, sure. I will fix it in v3.

Regards,
Lorenzo

> 
> > 
> > Regards,
> > Lorenzo
> > 
> > > 
> > > Thanks,
> > > John
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame
  2020-09-04 16:30           ` David Ahern
@ 2020-09-07 18:02             ` Jesper Dangaard Brouer
  2020-09-08  1:22               ` David Ahern
  0 siblings, 1 reply; 42+ messages in thread
From: Jesper Dangaard Brouer @ 2020-09-07 18:02 UTC (permalink / raw)
  To: David Ahern
  Cc: Alexei Starovoitov, Lorenzo Bianconi, netdev, bpf, davem,
	lorenzo.bianconi, echaudro, sameehj, kuba, john.fastabend,
	daniel, ast, shayagr, David Ahern, Ilias Apalodimas, brouer

On Fri, 4 Sep 2020 10:30:48 -0600
David Ahern <dsahern@gmail.com> wrote:

> On 9/4/20 9:59 AM, Jesper Dangaard Brouer wrote:
> >> dev_rx for example seems like it could just be the netdev
> >> index rather than a pointer or perhaps can be removed completely. I
> >> believe it is only used for 1 use case (redirects to CPUMAP); maybe that
> >> code can be refactored to handle the dev outside of xdp_frame.  
> > 
> > The dev_rx is needed when creating an SKB from a xdp_frame (basically
> > skb->dev = rx_dev). Yes, that is done in cpumap, but I want to
> > generalize this.  The veth also creates SKBs from xdp_frame, but use
> > itself as skb->dev.
> > 
> > And yes, we could save some space storing the index instead, and trade
> > space for cycles in a lookup.  
> 
> I think this can be managed without adding a reference to the xdp_frame.
> I'll start a separate thread on that.
> 
> >>
> >> As for frame_sz, why does it need to be larger than a u16?  
> > 
> > Because PAGE_SIZE can be 64KiB on some archs.
> >   

I also believe syzbot managed to create packets for generic-XDP with
frame_sz 128KiB, which was a bit weird (it's on my todo list to
investigate and fix).

> ok, is there any alignment requirement? can frame_sz be number of 32-bit
> words? I believe bit shifts are cheap.

No that is not possible, because some drivers and generic-XDP have a
fully dynamic frame_sz.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame
  2020-09-07 18:02             ` Jesper Dangaard Brouer
@ 2020-09-08  1:22               ` David Ahern
  0 siblings, 0 replies; 42+ messages in thread
From: David Ahern @ 2020-09-08  1:22 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, Lorenzo Bianconi, netdev, bpf, davem,
	lorenzo.bianconi, echaudro, sameehj, kuba, john.fastabend,
	daniel, ast, shayagr, David Ahern, Ilias Apalodimas

On 9/7/20 12:02 PM, Jesper Dangaard Brouer wrote:
> 
>> ok, is there any alignment requirement? can frame_sz be number of 32-bit
>> words? I believe bit shifts are cheap.
> 
> No that is not possible, because some drivers and generic-XDP have a
> fully dynamic frame_sz.
> 

frame_sz represents allocated memory right? What is the real range that
needs to be supported for frame_sz? Surely there is some upper limit,
and I thought it was 64kB.

Allocated memory will not be on an odd number, so fair to assume at a
minimum it is a multiple of 2. correct? At a minimum we should be able
to shift frame_sz by 1 which now covers 64kB in a u16.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-06 13:36         ` Lorenzo Bianconi
@ 2020-09-08 19:57           ` John Fastabend
  2020-09-08 21:31             ` Lorenzo Bianconi
  0 siblings, 1 reply; 42+ messages in thread
From: John Fastabend @ 2020-09-08 19:57 UTC (permalink / raw)
  To: Lorenzo Bianconi, John Fastabend
  Cc: netdev, bpf, davem, lorenzo.bianconi, brouer, echaudro, sameehj,
	kuba, daniel, ast, shayagr, edumazet

Lorenzo Bianconi wrote:
> > Lorenzo Bianconi wrote:
> > > > Lorenzo Bianconi wrote:
> 
> [...]
> 
> > > > > + *	Description
> > > > > + *		Adjust frame headers moving *offset* bytes from/to the second
> > > > > + *		buffer to/from the first one. This helper can be used to move
> > > > > + *		headers when the hw DMA SG does not copy all the headers in
> > > > > + *		the first fragment.
> > > 
> > > + Eric to the discussion
> > > 

[...]

> > > > > +BPF_CALL_2(bpf_xdp_adjust_mb_header, struct  xdp_buff *, xdp,
> > > > > +	   int, offset)
> > > > > +{
> > > > > +	void *data_hard_end, *data_end;
> > > > > +	struct skb_shared_info *sinfo;
> > > > > +	int frag_offset, frag_len;
> > > > > +	u8 *addr;
> > > > > +
> > > > > +	if (!xdp->mb)
> > > > > +		return -EOPNOTSUPP;
> > 
> > Not required for this patch necessarily but I think it would be better user
> > experience if instead of EOPNOTSUPP here we did the header split. This
> > would allocate a frag and copy the bytes around as needed. Yes it might
> > be slow if you don't have a frag free in the driver, but if user wants to
> > do header split and their hardware can't do it we would have a way out.
> > 
> > I guess it could be an improvement for later though.
> 
> I have no a strong opinion on this, I did it in this way to respect the rule "we
> do not allocate memory for XDP".
> 
> @Jesper, David: thoughts?

Consider adding a flags field to the helper so we could do this later with
a flag. Then users who want the alloc can set the flag and get it.

[...]

> 
> > 
> > How/when does the header split go wrong on the mvneta device? I guess
> > this is to fix a real bug/issue not some theoritical one? An example
> > in the commit message would make this concrete. Soemthing like,
> > "When using RX zerocopy to mmap data into userspace application if
> > a packet with [all these wild headers] is received rx zerocopy breaks
> > because header split puts headers X in the data frag confusing apps".
> 
> This issue does not occur with mvneta since the driver is not capable of
> performing header split AFAIK. The helper has been introduced to cover the
> "issue" reported by Eric in his NetDevConf presentation. In order to test the
> helper I modified the mventa rx napi loop in a controlled way (this patch can't
> be sent upstream, it is for testing only :))
> I will improve commit message in v3.

Ah ok so really there is no users for the helper then IMO just drop
the patch until we have a user then.

> 
> > 
> > > 
> > > > 
> > > > Also and even more concerning I think this API requires the
> > > > driver to populate shinfo. If we use TX_REDIRECT a lot or TX_XMIT
> > > > this means we need to populate shinfo when its probably not ever
> > > > used. If our driver is smart L2/L3 headers are in the readable
> > > > data and prefetched. Writing frags into the end of a page is likely
> > > > not free.
> > > 
> > > Sorry I did not get what you mean with "populate shinfo" here. We need to
> > > properly set shared_info in order to create the xdp multi-buff.
> > > Apart of header splits, please consider the main uses-cases for
> > > xdp multi-buff are XDP with TSO and Jumbo frames.
> > 
> > The use case I have in mind is a XDP_TX or XDP_REDIRECT load balancer.
> > I wont know this at the driver level and now I'll have to write into
> > the back of every page with this shinfo just in case. If header
> > split is working I should never need to even touch the page outside
> > the first N bytes that were DMAd and in cache with DDIO. So its extra
> > overhead for something that is unlikely to happen in the LB case.
> 
> So far the skb_shared_info in constructed in mvneta only if the hw splits
> the received data in multiple buffers (so if the MTU is greater than 1 PAGE,
> please see comments below). Moreover the shared_info is present only in the
> first buffer.

Still in a normal L2/L3/L4 use case I expect all the headers you
need to be in the fist buffer so its unlikely for use cases that
send most traffic via XDP_TX for example to ever need the extra
info. In these cases I think you are paying some penalty for
having to do the work of populating the shinfo. Maybe its measurable
maybe not I'm not sure.

Also if we make it required for multi-buffer than we also need
the shinfo on 40gbps or 100gbps nics and now even small costs
matter.

> 
> > 
> > If you take the simplest possible program that just returns XDP_TX
> > and run a pkt generator against it. I believe (haven't run any
> > tests) that you will see overhead now just from populating this
> > shinfo. I think it needs to only be done when its needed e.g. when
> > user makes this helper call or we need to build the skb and populate
> > the frags there.
> 
> sure, I will carry out some tests.

Thanks!

> 
> > 
> > I think a smart driver will just keep the frags list in whatever
> > form it has them (rx descriptors?) and push them over to the
> > tx descriptors without having to do extra work with frag lists.
> 
> I think there are many use-cases where we want to have this info available in
> xdp_buff/xdp_frame. E.g: let's consider the following Jumbo frame example:
> - MTU > 1 PAGE (so we the driver will split the received data in multiple rx
>   descriptors)
> - the driver performs a XDP_REDIRECT to a veth or cpumap
> 
> Relying on the proposed architecture we could enable GRO in veth or cpumap I
> guess since we can build a non-linear skb from the xdp multi-buff, right?

I'm not disputing there are use-cases. But, I'm trying to see if we
can cover those without introducing additional latency in other
cases. Hence the extra benchmarks request ;)

> 
> > 
> > > 
> > > > 
> > > > Did you benchmark this?
> > > 
> > > will do, I need to understand if we can use tiny buffers in mvneta.
> > 
> > Why tiny buffers? How does mvneta layout the frags when doing
> > header split? Can we just benchmark what mvneta is doing at the
> > end of this patch series?
> 
> for the moment mvneta can split the received data when the previous buffer is
> full (e.g. when we the first page is completely written). I want to explore if
> I can set a tiny buffer (e.g. 128B) as max received buffer to run some performance
> tests and have some "comparable" results respect to the ones I got when I added XDP
> support to mvneta.

OK would be great.

> 
> > 
> > Also can you try the basic XDP_TX case mentioned above.
> > I don't want this to degrade existing use cases if at all
> > possible.
> 
> sure, will do.

Thanks!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-08 19:57           ` John Fastabend
@ 2020-09-08 21:31             ` Lorenzo Bianconi
  2020-09-09 20:51               ` Lorenzo Bianconi
  0 siblings, 1 reply; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-08 21:31 UTC (permalink / raw)
  To: John Fastabend
  Cc: Lorenzo Bianconi, netdev, bpf, davem, brouer, echaudro, sameehj,
	kuba, daniel, ast, shayagr, edumazet

[-- Attachment #1: Type: text/plain, Size: 8063 bytes --]

> Lorenzo Bianconi wrote:
> > > Lorenzo Bianconi wrote:
> > > > > Lorenzo Bianconi wrote:
> > 
> > [...]
> > 
> > > > > > + *	Description
> > > > > > + *		Adjust frame headers moving *offset* bytes from/to the second
> > > > > > + *		buffer to/from the first one. This helper can be used to move
> > > > > > + *		headers when the hw DMA SG does not copy all the headers in
> > > > > > + *		the first fragment.
> > > > 
> > > > + Eric to the discussion
> > > > 
> 
> [...]
> 
> > > > > > +BPF_CALL_2(bpf_xdp_adjust_mb_header, struct  xdp_buff *, xdp,
> > > > > > +	   int, offset)
> > > > > > +{
> > > > > > +	void *data_hard_end, *data_end;
> > > > > > +	struct skb_shared_info *sinfo;
> > > > > > +	int frag_offset, frag_len;
> > > > > > +	u8 *addr;
> > > > > > +
> > > > > > +	if (!xdp->mb)
> > > > > > +		return -EOPNOTSUPP;
> > > 
> > > Not required for this patch necessarily but I think it would be better user
> > > experience if instead of EOPNOTSUPP here we did the header split. This
> > > would allocate a frag and copy the bytes around as needed. Yes it might
> > > be slow if you don't have a frag free in the driver, but if user wants to
> > > do header split and their hardware can't do it we would have a way out.
> > > 
> > > I guess it could be an improvement for later though.
> > 
> > I have no a strong opinion on this, I did it in this way to respect the rule "we
> > do not allocate memory for XDP".
> > 
> > @Jesper, David: thoughts?
> 
> Consider adding a flags field to the helper so we could do this later with
> a flag. Then users who want the alloc can set the flag and get it.
> 
> [...]
> 
> > 
> > > 
> > > How/when does the header split go wrong on the mvneta device? I guess
> > > this is to fix a real bug/issue not some theoritical one? An example
> > > in the commit message would make this concrete. Soemthing like,
> > > "When using RX zerocopy to mmap data into userspace application if
> > > a packet with [all these wild headers] is received rx zerocopy breaks
> > > because header split puts headers X in the data frag confusing apps".
> > 
> > This issue does not occur with mvneta since the driver is not capable of
> > performing header split AFAIK. The helper has been introduced to cover the
> > "issue" reported by Eric in his NetDevConf presentation. In order to test the
> > helper I modified the mventa rx napi loop in a controlled way (this patch can't
> > be sent upstream, it is for testing only :))
> > I will improve commit message in v3.
> 
> Ah ok so really there is no users for the helper then IMO just drop
> the patch until we have a user then.

I agree, this helper is not strictly related to the series. I added it in v2
to provide another example of using xdp_buff mb bit.

> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > Also and even more concerning I think this API requires the
> > > > > driver to populate shinfo. If we use TX_REDIRECT a lot or TX_XMIT
> > > > > this means we need to populate shinfo when its probably not ever
> > > > > used. If our driver is smart L2/L3 headers are in the readable
> > > > > data and prefetched. Writing frags into the end of a page is likely
> > > > > not free.
> > > > 
> > > > Sorry I did not get what you mean with "populate shinfo" here. We need to
> > > > properly set shared_info in order to create the xdp multi-buff.
> > > > Apart of header splits, please consider the main uses-cases for
> > > > xdp multi-buff are XDP with TSO and Jumbo frames.
> > > 
> > > The use case I have in mind is a XDP_TX or XDP_REDIRECT load balancer.
> > > I wont know this at the driver level and now I'll have to write into
> > > the back of every page with this shinfo just in case. If header
> > > split is working I should never need to even touch the page outside
> > > the first N bytes that were DMAd and in cache with DDIO. So its extra
> > > overhead for something that is unlikely to happen in the LB case.
> > 
> > So far the skb_shared_info in constructed in mvneta only if the hw splits
> > the received data in multiple buffers (so if the MTU is greater than 1 PAGE,
> > please see comments below). Moreover the shared_info is present only in the
> > first buffer.
> 
> Still in a normal L2/L3/L4 use case I expect all the headers you
> need to be in the fist buffer so its unlikely for use cases that
> send most traffic via XDP_TX for example to ever need the extra
> info. In these cases I think you are paying some penalty for
> having to do the work of populating the shinfo. Maybe its measurable
> maybe not I'm not sure.
> 
> Also if we make it required for multi-buffer than we also need
> the shinfo on 40gbps or 100gbps nics and now even small costs
> matter.

Now I realized I used the word "split" in a not clear way here,
I apologize for that.
What I mean is not related "header" split, I am referring to the case where
the hw is configured with a given rx buffer size (e.g. 1 PAGE) and we have
set a higher MTU/max received size (e.g. 9K).
In this case the hw will "split" the jumbo received frame over multiple rx
buffers/descriptors. Populating the "xdp_shared_info" we will forward this
layout info to the eBPF sandbox and to a remote driver/cpu.
Please note this use case is not currently covered by XDP so if we develop it a
proper way I guess we should not get any performance hit for the legacy single-buffer
mode since we will not populate the shared_info for it (I think you refer to
the "legacy" use-case in your "normal L2/L3/L4" example, right?)
Anyway I will run some tests to verify the performances for the single buffer
use-case are not hit.

Regards,
Lorenzo

> 
> > 
> > > 
> > > If you take the simplest possible program that just returns XDP_TX
> > > and run a pkt generator against it. I believe (haven't run any
> > > tests) that you will see overhead now just from populating this
> > > shinfo. I think it needs to only be done when its needed e.g. when
> > > user makes this helper call or we need to build the skb and populate
> > > the frags there.
> > 
> > sure, I will carry out some tests.
> 
> Thanks!
> 
> > 
> > > 
> > > I think a smart driver will just keep the frags list in whatever
> > > form it has them (rx descriptors?) and push them over to the
> > > tx descriptors without having to do extra work with frag lists.
> > 
> > I think there are many use-cases where we want to have this info available in
> > xdp_buff/xdp_frame. E.g: let's consider the following Jumbo frame example:
> > - MTU > 1 PAGE (so we the driver will split the received data in multiple rx
> >   descriptors)
> > - the driver performs a XDP_REDIRECT to a veth or cpumap
> > 
> > Relying on the proposed architecture we could enable GRO in veth or cpumap I
> > guess since we can build a non-linear skb from the xdp multi-buff, right?
> 
> I'm not disputing there are use-cases. But, I'm trying to see if we
> can cover those without introducing additional latency in other
> cases. Hence the extra benchmarks request ;)
> 
> > 
> > > 
> > > > 
> > > > > 
> > > > > Did you benchmark this?
> > > > 
> > > > will do, I need to understand if we can use tiny buffers in mvneta.
> > > 
> > > Why tiny buffers? How does mvneta layout the frags when doing
> > > header split? Can we just benchmark what mvneta is doing at the
> > > end of this patch series?
> > 
> > for the moment mvneta can split the received data when the previous buffer is
> > full (e.g. when we the first page is completely written). I want to explore if
> > I can set a tiny buffer (e.g. 128B) as max received buffer to run some performance
> > tests and have some "comparable" results respect to the ones I got when I added XDP
> > support to mvneta.
> 
> OK would be great.
> 
> > 
> > > 
> > > Also can you try the basic XDP_TX case mentioned above.
> > > I don't want this to degrade existing use cases if at all
> > > possible.
> > 
> > sure, will do.
> 
> Thanks!
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper
  2020-09-08 21:31             ` Lorenzo Bianconi
@ 2020-09-09 20:51               ` Lorenzo Bianconi
  0 siblings, 0 replies; 42+ messages in thread
From: Lorenzo Bianconi @ 2020-09-09 20:51 UTC (permalink / raw)
  To: John Fastabend
  Cc: Lorenzo Bianconi, netdev, bpf, davem, brouer, echaudro, sameehj,
	kuba, daniel, ast, shayagr, edumazet

[-- Attachment #1: Type: text/plain, Size: 5234 bytes --]

> > Lorenzo Bianconi wrote:
> > > > Lorenzo Bianconi wrote:
> > > > > > Lorenzo Bianconi wrote:
> > > 
> > > [...]
> > > 
> > > > > > > + *	Description
> > > > > > > + *		Adjust frame headers moving *offset* bytes from/to the second
> > > > > > > + *		buffer to/from the first one. This helper can be used to move
> > > > > > > + *		headers when the hw DMA SG does not copy all the headers in
> > > > > > > + *		the first fragment.
> > > > > 
> > > > > + Eric to the discussion
> > > > > 
> > 
> > [...]
> > 

[...]

> > 
> > Still in a normal L2/L3/L4 use case I expect all the headers you
> > need to be in the fist buffer so its unlikely for use cases that
> > send most traffic via XDP_TX for example to ever need the extra
> > info. In these cases I think you are paying some penalty for
> > having to do the work of populating the shinfo. Maybe its measurable
> > maybe not I'm not sure.
> > 
> > Also if we make it required for multi-buffer than we also need
> > the shinfo on 40gbps or 100gbps nics and now even small costs
> > matter.
> 
> Now I realized I used the word "split" in a not clear way here,
> I apologize for that.
> What I mean is not related "header" split, I am referring to the case where
> the hw is configured with a given rx buffer size (e.g. 1 PAGE) and we have
> set a higher MTU/max received size (e.g. 9K).
> In this case the hw will "split" the jumbo received frame over multiple rx
> buffers/descriptors. Populating the "xdp_shared_info" we will forward this
> layout info to the eBPF sandbox and to a remote driver/cpu.
> Please note this use case is not currently covered by XDP so if we develop it a
> proper way I guess we should not get any performance hit for the legacy single-buffer
> mode since we will not populate the shared_info for it (I think you refer to
> the "legacy" use-case in your "normal L2/L3/L4" example, right?)
> Anyway I will run some tests to verify the performances for the single buffer
> use-case are not hit.
> 
> Regards,
> Lorenzo

I carried out some performance measurements on my Espressobin to check if the
XDP "single buffer" use-case has been hit introducing xdp multi-buff support.
Each test has been carried out sending ~900Kpps (pkt length 64B). The rx
buffer size was set to 1 PAGE (default value).
The results are roughly the same:

commit: f2ca673d2cd5 "net: mvneta: fix use of state->speed"
==========================================================
- XDP-DROP: ~ 740 Kpps
- XDP-TX: ~ 286 Kpps
- XDP-PASS + tc drop: ~ 219.5 Kpps

xdp multi-buff:
===============
- XDP-DROP: ~ 739-740 Kpps
- XDP-TX: ~ 285 Kpps
- XDP-PASS + tc drop: ~ 223 Kpps

I will add these results to v3 cover letter.

Regards,
Lorenzo

> 
> > 
> > > 
> > > > 
> > > > If you take the simplest possible program that just returns XDP_TX
> > > > and run a pkt generator against it. I believe (haven't run any
> > > > tests) that you will see overhead now just from populating this
> > > > shinfo. I think it needs to only be done when its needed e.g. when
> > > > user makes this helper call or we need to build the skb and populate
> > > > the frags there.
> > > 
> > > sure, I will carry out some tests.
> > 
> > Thanks!
> > 
> > > 
> > > > 
> > > > I think a smart driver will just keep the frags list in whatever
> > > > form it has them (rx descriptors?) and push them over to the
> > > > tx descriptors without having to do extra work with frag lists.
> > > 
> > > I think there are many use-cases where we want to have this info available in
> > > xdp_buff/xdp_frame. E.g: let's consider the following Jumbo frame example:
> > > - MTU > 1 PAGE (so we the driver will split the received data in multiple rx
> > >   descriptors)
> > > - the driver performs a XDP_REDIRECT to a veth or cpumap
> > > 
> > > Relying on the proposed architecture we could enable GRO in veth or cpumap I
> > > guess since we can build a non-linear skb from the xdp multi-buff, right?
> > 
> > I'm not disputing there are use-cases. But, I'm trying to see if we
> > can cover those without introducing additional latency in other
> > cases. Hence the extra benchmarks request ;)
> > 
> > > 
> > > > 
> > > > > 
> > > > > > 
> > > > > > Did you benchmark this?
> > > > > 
> > > > > will do, I need to understand if we can use tiny buffers in mvneta.
> > > > 
> > > > Why tiny buffers? How does mvneta layout the frags when doing
> > > > header split? Can we just benchmark what mvneta is doing at the
> > > > end of this patch series?
> > > 
> > > for the moment mvneta can split the received data when the previous buffer is
> > > full (e.g. when we the first page is completely written). I want to explore if
> > > I can set a tiny buffer (e.g. 128B) as max received buffer to run some performance
> > > tests and have some "comparable" results respect to the ones I got when I added XDP
> > > support to mvneta.
> > 
> > OK would be great.
> > 
> > > 
> > > > 
> > > > Also can you try the basic XDP_TX case mentioned above.
> > > > I don't want this to degrade existing use cases if at all
> > > > possible.
> > > 
> > > sure, will do.
> > 
> > Thanks!
> > 



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2020-09-09 20:51 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 20:58 [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 1/9] xdp: introduce mb in xdp_buff/xdp_frame Lorenzo Bianconi
2020-09-04  1:07   ` Alexei Starovoitov
2020-09-04  7:19     ` Jesper Dangaard Brouer
2020-09-04 15:15       ` David Ahern
2020-09-04 15:59         ` Jesper Dangaard Brouer
2020-09-04 16:30           ` David Ahern
2020-09-07 18:02             ` Jesper Dangaard Brouer
2020-09-08  1:22               ` David Ahern
2020-09-03 20:58 ` [PATCH v2 net-next 2/9] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers Lorenzo Bianconi
2020-09-04  7:35   ` Jesper Dangaard Brouer
2020-09-04  7:54     ` Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 3/9] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer Lorenzo Bianconi
2020-09-06  7:33   ` Shay Agroskin
2020-09-06  9:05     ` Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 4/9] xdp: add multi-buff support to xdp_return_{buff/frame} Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 5/9] net: mvneta: add multi buffer support to XDP_TX Lorenzo Bianconi
2020-09-06  7:20   ` Shay Agroskin
2020-09-06  8:43     ` Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 6/9] bpf: helpers: add bpf_xdp_adjust_mb_header helper Lorenzo Bianconi
2020-09-04  1:09   ` Alexei Starovoitov
2020-09-04  8:07     ` Lorenzo Bianconi
2020-09-04  1:13   ` Alexei Starovoitov
2020-09-04  7:50     ` Lorenzo Bianconi
2020-09-04 13:52       ` Jesper Dangaard Brouer
2020-09-04 14:27         ` Lorenzo Bianconi
2020-09-04  6:47   ` John Fastabend
2020-09-04  9:45     ` Lorenzo Bianconi
2020-09-04 15:23       ` John Fastabend
2020-09-06 13:36         ` Lorenzo Bianconi
2020-09-08 19:57           ` John Fastabend
2020-09-08 21:31             ` Lorenzo Bianconi
2020-09-09 20:51               ` Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 7/9] bpf: helpers: add multibuffer support Lorenzo Bianconi
2020-09-03 21:24   ` Maciej Fijalkowski
2020-09-04  9:47     ` Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 8/9] samples/bpf: add bpf program that uses xdp mb helpers Lorenzo Bianconi
2020-09-03 20:58 ` [PATCH v2 net-next 9/9] net: mvneta: enable jumbo frames for XDP Lorenzo Bianconi
2020-09-04  1:08 ` [PATCH v2 net-next 0/9] mvneta: introduce XDP multi-buffer support Alexei Starovoitov
2020-09-04  7:40   ` Lorenzo Bianconi
2020-09-04  5:41 ` John Fastabend
2020-09-04  7:39   ` Lorenzo Bianconi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).