netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support
@ 2020-12-07 16:32 Lorenzo Bianconi
  2020-12-07 16:32 ` [PATCH v5 bpf-next 01/14] xdp: introduce mb in xdp_buff/xdp_frame Lorenzo Bianconi
                   ` (13 more replies)
  0 siblings, 14 replies; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

This series introduce XDP multi-buffer support. The mvneta driver is
the first to support these new "non-linear" xdp_{buff,frame}. Reviewers
please focus on how these new types of xdp_{buff,frame} packets
traverse the different layers and the layout design. It is on purpose
that BPF-helpers are kept simple, as we don't want to expose the
internal layout to allow later changes.

For now, to keep the design simple and to maintain performance, the XDP
BPF-prog (still) only have access to the first-buffer. It is left for
later (another patchset) to add payload access across multiple buffers.
This patchset should still allow for these future extensions. The goal
is to lift the XDP MTU restriction that comes with XDP, but maintain
same performance as before.

The main idea for the new multi-buffer layout is to reuse the same
layout used for non-linear SKB. We introduced a "xdp_shared_info" data
structure at the end of the first buffer to link together subsequent buffers.
xdp_shared_info will alias skb_shared_info allowing to keep most of the frags
in the same cache-line (while with skb_shared_info only the first fragment will
be placed in the first "shared_info" cache-line). Moreover we introduced some
xdp_shared_info helpers aligned to skb_frag* ones.
Converting xdp_frame to SKB and deliver it to the network stack is shown in
cpumap code (patch 11/14). Building the SKB, the xdp_shared_info structure
will be converted in a skb_shared_info one.

A multi-buffer bit (mb) has been introduced in xdp_{buff,frame} structure
to notify the bpf/network layer if this is a xdp multi-buffer frame (mb = 1)
or not (mb = 0).
The mb bit will be set by a xdp multi-buffer capable driver only for
non-linear frames maintaining the capability to receive linear frames
without any extra cost since the xdp_shared_info structure at the end
of the first buffer will be initialized only if mb is set.

Typical use cases for this series are:
- Jumbo-frames
- Packet header split (please see Google’s use-case @ NetDevConf 0x14, [0])
- TSO

A new frame_length field has been introduce in XDP ctx in order to notify the
eBPF layer about the total frame size (linear + paged parts).

bpf_xdp_adjust_tail helper has been modified to take info account xdp
multi-buff frames.

More info about the main idea behind this approach can be found here [1][2].

Changes since v4:
- rebase ontop of bpf-next
- introduce xdp_shared_info to build xdp multi-buff instead of using the
  skb_shared_info struct
- introduce frame_length in xdp ctx
- drop previous bpf helpers
- fix bpf_xdp_adjust_tail for xdp multi-buff
- introduce xdp multi-buff self-tests for bpf_xdp_adjust_tail
- fix xdp_return_frame_bulk for xdp multi-buff

Changes since v3:
- rebase ontop of bpf-next
- add patch 10/13 to copy back paged data from a xdp multi-buff frame to
  userspace buffer for xdp multi-buff selftests

Changes since v2:
- add throughput measurements
- drop bpf_xdp_adjust_mb_header bpf helper
- introduce selftest for xdp multibuffer
- addressed comments on bpf_xdp_get_frags_count
- introduce xdp multi-buff support to cpumaps

Changes since v1:
- Fix use-after-free in xdp_return_{buff/frame}
- Introduce bpf helpers
- Introduce xdp_mb sample program
- access skb_shared_info->nr_frags only on the last fragment

Changes since RFC:
- squash multi-buffer bit initialization in a single patch
- add mvneta non-linear XDP buff support for tx side

[0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy
[1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
[2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)

Eelco Chaudron (3):
  bpf: add multi-buff support to the bpf_xdp_adjust_tail() API
  bpf: add new frame_length field to the XDP ctx
  bpf: update xdp_adjust_tail selftest to include multi-buffer

Lorenzo Bianconi (11):
  xdp: introduce mb in xdp_buff/xdp_frame
  xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
  xdp: add xdp_shared_info data structure
  net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
  xdp: add multi-buff support to xdp_return_{buff/frame}
  net: mvneta: add multi buffer support to XDP_TX
  bpf: move user_size out of bpf_test_init
  bpf: introduce multibuff support to bpf_prog_test_run_xdp()
  bpf: test_run: add xdp_shared_info pointer in bpf_test_finish
    signature
  net: mvneta: enable jumbo frames for XDP
  bpf: cpumap: introduce xdp multi-buff support

 drivers/net/ethernet/amazon/ena/ena_netdev.c  |   1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |   1 +
 .../net/ethernet/cavium/thunder/nicvf_main.c  |   1 +
 .../net/ethernet/freescale/dpaa2/dpaa2-eth.c  |   1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   1 +
 drivers/net/ethernet/intel/ice/ice_txrx.c     |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   1 +
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |   1 +
 drivers/net/ethernet/marvell/mvneta.c         | 181 ++++++++++--------
 .../net/ethernet/marvell/mvpp2/mvpp2_main.c   |   1 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |   1 +
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   1 +
 .../ethernet/netronome/nfp/nfp_net_common.c   |   1 +
 drivers/net/ethernet/qlogic/qede/qede_fp.c    |   1 +
 drivers/net/ethernet/sfc/rx.c                 |   1 +
 drivers/net/ethernet/socionext/netsec.c       |   1 +
 drivers/net/ethernet/ti/cpsw.c                |   1 +
 drivers/net/ethernet/ti/cpsw_new.c            |   1 +
 drivers/net/hyperv/netvsc_bpf.c               |   1 +
 drivers/net/tun.c                             |   2 +
 drivers/net/veth.c                            |   1 +
 drivers/net/virtio_net.c                      |   2 +
 drivers/net/xen-netfront.c                    |   1 +
 include/net/xdp.h                             | 111 ++++++++++-
 include/uapi/linux/bpf.h                      |   1 +
 kernel/bpf/cpumap.c                           |  45 +----
 kernel/bpf/verifier.c                         |   2 +-
 net/bpf/test_run.c                            | 107 +++++++++--
 net/core/dev.c                                |   1 +
 net/core/filter.c                             | 146 ++++++++++++++
 net/core/xdp.c                                | 150 ++++++++++++++-
 tools/include/uapi/linux/bpf.h                |   1 +
 .../bpf/prog_tests/xdp_adjust_tail.c          | 105 ++++++++++
 .../bpf/progs/test_xdp_adjust_tail_grow.c     |  16 +-
 .../bpf/progs/test_xdp_adjust_tail_shrink.c   |  32 +++-
 35 files changed, 761 insertions(+), 161 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 01/14] xdp: introduce mb in xdp_buff/xdp_frame
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-07 21:16   ` Alexander Duyck
  2020-12-07 16:32 ` [PATCH v5 bpf-next 02/14] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers Lorenzo Bianconi
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer data structure
in order to specify if this is a linear buffer (mb = 0) or a multi-buffer
frame (mb = 1). In the latter case the shared_info area at the end of the
first buffer is been properly initialized to link together subsequent
buffers.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/net/xdp.h | 8 ++++++--
 net/core/xdp.c    | 1 +
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 700ad5db7f5d..70559720ff44 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -73,7 +73,8 @@ struct xdp_buff {
 	void *data_hard_start;
 	struct xdp_rxq_info *rxq;
 	struct xdp_txq_info *txq;
-	u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
+	u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved tailroom*/
+	u32 mb:1; /* xdp non-linear buffer */
 };
 
 /* Reserve memory area at end-of data area.
@@ -97,7 +98,8 @@ struct xdp_frame {
 	u16 len;
 	u16 headroom;
 	u32 metasize:8;
-	u32 frame_sz:24;
+	u32 frame_sz:23;
+	u32 mb:1; /* xdp non-linear frame */
 	/* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
 	 * while mem info is valid on remote CPU.
 	 */
@@ -154,6 +156,7 @@ void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
 	xdp->data_end = frame->data + frame->len;
 	xdp->data_meta = frame->data - frame->metasize;
 	xdp->frame_sz = frame->frame_sz;
+	xdp->mb = frame->mb;
 }
 
 static inline
@@ -180,6 +183,7 @@ int xdp_update_frame_from_buff(struct xdp_buff *xdp,
 	xdp_frame->headroom = headroom - sizeof(*xdp_frame);
 	xdp_frame->metasize = metasize;
 	xdp_frame->frame_sz = xdp->frame_sz;
+	xdp_frame->mb = xdp->mb;
 
 	return 0;
 }
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 17ffd33c6b18..79dd45234e4d 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -509,6 +509,7 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
 	xdpf->headroom = 0;
 	xdpf->metasize = metasize;
 	xdpf->frame_sz = PAGE_SIZE;
+	xdpf->mb = xdp->mb;
 	xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
 
 	xsk_buff_free(xdp);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 02/14] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
  2020-12-07 16:32 ` [PATCH v5 bpf-next 01/14] xdp: introduce mb in xdp_buff/xdp_frame Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-07 21:15   ` Alexander Duyck
  2020-12-07 16:32 ` [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure Lorenzo Bianconi
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.
This is a preliminary patch to enable xdp multi-buffer support.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c        | 1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c       | 1 +
 drivers/net/ethernet/cavium/thunder/nicvf_main.c    | 1 +
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c    | 1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c         | 1 +
 drivers/net/ethernet/intel/ice/ice_txrx.c           | 1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c       | 1 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c   | 1 +
 drivers/net/ethernet/marvell/mvneta.c               | 1 +
 drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c     | 1 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c          | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c     | 1 +
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 1 +
 drivers/net/ethernet/qlogic/qede/qede_fp.c          | 1 +
 drivers/net/ethernet/sfc/rx.c                       | 1 +
 drivers/net/ethernet/socionext/netsec.c             | 1 +
 drivers/net/ethernet/ti/cpsw.c                      | 1 +
 drivers/net/ethernet/ti/cpsw_new.c                  | 1 +
 drivers/net/hyperv/netvsc_bpf.c                     | 1 +
 drivers/net/tun.c                                   | 2 ++
 drivers/net/veth.c                                  | 1 +
 drivers/net/virtio_net.c                            | 2 ++
 drivers/net/xen-netfront.c                          | 1 +
 net/core/dev.c                                      | 1 +
 24 files changed, 26 insertions(+)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 0e98f45c2b22..abe826395e2f 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -1569,6 +1569,7 @@ static int ena_clean_rx_irq(struct ena_ring *rx_ring, struct napi_struct *napi,
 	res_budget = budget;
 	xdp.rxq = &rx_ring->xdp_rxq;
 	xdp.frame_sz = ENA_PAGE_SIZE;
+	xdp.mb = 0;
 
 	do {
 		xdp_verdict = XDP_PASS;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index fcc262064766..344644b6dd4d 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -139,6 +139,7 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info *rxr, u16 cons,
 	xdp.data_end = *data_ptr + *len;
 	xdp.rxq = &rxr->xdp_rxq;
 	xdp.frame_sz = PAGE_SIZE; /* BNXT_RX_PAGE_MODE(bp) when XDP enabled */
+	xdp.mb = 0;
 	orig_data = xdp.data;
 
 	rcu_read_lock();
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index f3b7b443f964..4e790a50d14c 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -553,6 +553,7 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog,
 	xdp.data_end = xdp.data + len;
 	xdp.rxq = &rq->xdp_rxq;
 	xdp.frame_sz = RCV_FRAG_LEN + XDP_PACKET_HEADROOM;
+	xdp.mb = 0;
 	orig_data = xdp.data;
 
 	rcu_read_lock();
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index 91cff93dbdae..fe70be3ca399 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -366,6 +366,7 @@ static u32 dpaa2_eth_run_xdp(struct dpaa2_eth_priv *priv,
 
 	xdp.frame_sz = DPAA2_ETH_RX_BUF_RAW_SIZE -
 		(dpaa2_fd_get_offset(fd) - XDP_PACKET_HEADROOM);
+	xdp.mb = 0;
 
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 9f73cd7aee09..1c8acebfde3d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2343,6 +2343,7 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, int budget)
 	xdp.frame_sz = i40e_rx_frame_truesize(rx_ring, 0);
 #endif
 	xdp.rxq = &rx_ring->xdp_rxq;
+	xdp.mb = 0;
 
 	while (likely(total_rx_packets < (unsigned int)budget)) {
 		struct i40e_rx_buffer *rx_buffer;
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 77d5eae6b4c2..0f8a996e298b 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1089,6 +1089,7 @@ int ice_clean_rx_irq(struct ice_ring *rx_ring, int budget)
 #if (PAGE_SIZE < 8192)
 	xdp.frame_sz = ice_rx_frame_truesize(rx_ring, 0);
 #endif
+	xdp.mb = 0;
 
 	/* start the loop to process Rx packets bounded by 'budget' */
 	while (likely(total_rx_pkts < (unsigned int)budget)) {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 50e6b8b6ba7b..2e028b306677 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2298,6 +2298,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 #if (PAGE_SIZE < 8192)
 	xdp.frame_sz = ixgbe_rx_frame_truesize(rx_ring, 0);
 #endif
+	xdp.mb = 0;
 
 	while (likely(total_rx_packets < budget)) {
 		union ixgbe_adv_rx_desc *rx_desc;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 4061cd7db5dd..037bfb2aadac 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -1129,6 +1129,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector,
 	struct xdp_buff xdp;
 
 	xdp.rxq = &rx_ring->xdp_rxq;
+	xdp.mb = 0;
 
 	/* Frame size depend on rx_ring setup when PAGE_SIZE=4K */
 #if (PAGE_SIZE < 8192)
diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 563ceac3060f..1e5b5c69685a 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2366,6 +2366,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 	xdp_buf.data_hard_start = NULL;
 	xdp_buf.frame_sz = PAGE_SIZE;
 	xdp_buf.rxq = &rxq->xdp_rxq;
+	xdp_buf.mb = 0;
 
 	sinfo.nr_frags = 0;
 
diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
index afdd22827223..32b48de36841 100644
--- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
+++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
@@ -3566,6 +3566,7 @@ static int mvpp2_rx(struct mvpp2_port *port, struct napi_struct *napi,
 			xdp.data = data + MVPP2_MH_SIZE + MVPP2_SKB_HEADROOM;
 			xdp.data_end = xdp.data + rx_bytes;
 			xdp.frame_sz = PAGE_SIZE;
+			xdp.mb = 0;
 
 			if (bm_pool->pkt_size == MVPP2_BM_SHORT_PKT_SIZE)
 				xdp.rxq = &rxq->xdp_rxq_short;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 7954c1daf2b6..547ff84bb71a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -684,6 +684,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	xdp_prog = rcu_dereference(ring->xdp_prog);
 	xdp.rxq = &ring->xdp_rxq;
 	xdp.frame_sz = priv->frag_info[0].frag_stride;
+	xdp.mb = 0;
 	doorbell_pending = false;
 
 	/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 6628a0197b4e..50fd12ba3a0f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1133,6 +1133,7 @@ static void mlx5e_fill_xdp_buff(struct mlx5e_rq *rq, void *va, u16 headroom,
 	xdp->data_end = xdp->data + len;
 	xdp->rxq = &rq->xdp_rxq;
 	xdp->frame_sz = rq->buff.frame0_sz;
+	xdp->mb = 0;
 }
 
 static struct sk_buff *
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index b4acf2f41e84..4e762bbf283c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1824,6 +1824,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
 	true_bufsz = xdp_prog ? PAGE_SIZE : dp->fl_bufsz;
 	xdp.frame_sz = PAGE_SIZE - NFP_NET_RX_BUF_HEADROOM;
 	xdp.rxq = &rx_ring->xdp_rxq;
+	xdp.mb = 0;
 	tx_ring = r_vec->xdp_ring;
 
 	while (pkts_polled < budget) {
diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index a2494bf85007..14a54094ca08 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -1096,6 +1096,7 @@ static bool qede_rx_xdp(struct qede_dev *edev,
 	xdp.data_end = xdp.data + *len;
 	xdp.rxq = &rxq->xdp_rxq;
 	xdp.frame_sz = rxq->rx_buf_seg_size; /* PAGE_SIZE when XDP enabled */
+	xdp.mb = 0;
 
 	/* Queues always have a full reset currently, so for the time
 	 * being until there's atomic program replace just mark read
diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c
index aaa112877561..286feb510c21 100644
--- a/drivers/net/ethernet/sfc/rx.c
+++ b/drivers/net/ethernet/sfc/rx.c
@@ -301,6 +301,7 @@ static bool efx_do_xdp(struct efx_nic *efx, struct efx_channel *channel,
 	xdp.data_end = xdp.data + rx_buf->len;
 	xdp.rxq = &rx_queue->xdp_rxq_info;
 	xdp.frame_sz = efx->rx_page_buf_step;
+	xdp.mb = 0;
 
 	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
 	rcu_read_unlock();
diff --git a/drivers/net/ethernet/socionext/netsec.c b/drivers/net/ethernet/socionext/netsec.c
index 19d20a6d0d44..8853db2575f0 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -958,6 +958,7 @@ static int netsec_process_rx(struct netsec_priv *priv, int budget)
 
 	xdp.rxq = &dring->xdp_rxq;
 	xdp.frame_sz = PAGE_SIZE;
+	xdp.mb = 0;
 
 	rcu_read_lock();
 	xdp_prog = READ_ONCE(priv->xdp_prog);
diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c
index b0f00b4edd94..6e3fa1994bd5 100644
--- a/drivers/net/ethernet/ti/cpsw.c
+++ b/drivers/net/ethernet/ti/cpsw.c
@@ -407,6 +407,7 @@ static void cpsw_rx_handler(void *token, int len, int status)
 		xdp.data_hard_start = pa;
 		xdp.rxq = &priv->xdp_rxq[ch];
 		xdp.frame_sz = PAGE_SIZE;
+		xdp.mb = 0;
 
 		port = priv->emac_port + cpsw->data.dual_emac;
 		ret = cpsw_run_xdp(priv, ch, &xdp, page, port);
diff --git a/drivers/net/ethernet/ti/cpsw_new.c b/drivers/net/ethernet/ti/cpsw_new.c
index 2f5e0ad23ad7..a13535fefbeb 100644
--- a/drivers/net/ethernet/ti/cpsw_new.c
+++ b/drivers/net/ethernet/ti/cpsw_new.c
@@ -350,6 +350,7 @@ static void cpsw_rx_handler(void *token, int len, int status)
 		xdp.data_hard_start = pa;
 		xdp.rxq = &priv->xdp_rxq[ch];
 		xdp.frame_sz = PAGE_SIZE;
+		xdp.mb = 0;
 
 		ret = cpsw_run_xdp(priv, ch, &xdp, page, priv->emac_port);
 		if (ret != CPSW_XDP_PASS)
diff --git a/drivers/net/hyperv/netvsc_bpf.c b/drivers/net/hyperv/netvsc_bpf.c
index 440486d9c999..a4bafc64997f 100644
--- a/drivers/net/hyperv/netvsc_bpf.c
+++ b/drivers/net/hyperv/netvsc_bpf.c
@@ -50,6 +50,7 @@ u32 netvsc_run_xdp(struct net_device *ndev, struct netvsc_channel *nvchan,
 	xdp->data_end = xdp->data + len;
 	xdp->rxq = &nvchan->xdp_rxq;
 	xdp->frame_sz = PAGE_SIZE;
+	xdp->mb = 0;
 
 	memcpy(xdp->data, data, len);
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index fbed05ae7b0f..8c100f4b2001 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1605,6 +1605,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		xdp.data_end = xdp.data + len;
 		xdp.rxq = &tfile->xdp_rxq;
 		xdp.frame_sz = buflen;
+		xdp.mb = 0;
 
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 		if (act == XDP_REDIRECT || act == XDP_TX) {
@@ -2347,6 +2348,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 		xdp_set_data_meta_invalid(xdp);
 		xdp->rxq = &tfile->xdp_rxq;
 		xdp->frame_sz = buflen;
+		xdp->mb = 0;
 
 		act = bpf_prog_run_xdp(xdp_prog, xdp);
 		err = tun_xdp_act(tun, xdp_prog, xdp, act);
diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 02bfcdf50a7a..52e050228a42 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -719,6 +719,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	/* SKB "head" area always have tailroom for skb_shared_info */
 	xdp.frame_sz = (void *)skb_end_pointer(skb) - xdp.data_hard_start;
 	xdp.frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	xdp.mb = 0;
 
 	orig_data = xdp.data;
 	orig_data_end = xdp.data_end;
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 052975ea0af4..4c15e30d7ef1 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -695,6 +695,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 		xdp.data_meta = xdp.data;
 		xdp.rxq = &rq->xdp_rxq;
 		xdp.frame_sz = buflen;
+		xdp.mb = 0;
 		orig_data = xdp.data;
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 		stats->xdp_packets++;
@@ -865,6 +866,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 		xdp.data_meta = xdp.data;
 		xdp.rxq = &rq->xdp_rxq;
 		xdp.frame_sz = frame_sz - vi->hdr_len;
+		xdp.mb = 0;
 
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
 		stats->xdp_packets++;
diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index b01848ef4649..34a254eab58b 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -870,6 +870,7 @@ static u32 xennet_run_xdp(struct netfront_queue *queue, struct page *pdata,
 	xdp->data_end = xdp->data + len;
 	xdp->rxq = &queue->xdp_rxq;
 	xdp->frame_sz = XEN_PAGE_SIZE - XDP_PACKET_HEADROOM;
+	xdp->mb = 0;
 
 	act = bpf_prog_run_xdp(prog, xdp);
 	switch (act) {
diff --git a/net/core/dev.c b/net/core/dev.c
index ce8fea2e2788..7f0b2b25860a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4633,6 +4633,7 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 	/* SKB "head" area always have tailroom for skb_shared_info */
 	xdp->frame_sz  = (void *)skb_end_pointer(skb) - xdp->data_hard_start;
 	xdp->frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	xdp->mb = 0;
 
 	orig_data_end = xdp->data_end;
 	orig_data = xdp->data;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
  2020-12-07 16:32 ` [PATCH v5 bpf-next 01/14] xdp: introduce mb in xdp_buff/xdp_frame Lorenzo Bianconi
  2020-12-07 16:32 ` [PATCH v5 bpf-next 02/14] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-08  0:22   ` Saeed Mahameed
  2020-12-07 16:32 ` [PATCH v5 bpf-next 04/14] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer Lorenzo Bianconi
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

Introduce xdp_shared_info data structure to contain info about
"non-linear" xdp frame. xdp_shared_info will alias skb_shared_info
allowing to keep most of the frags in the same cache-line.
Introduce some xdp_shared_info helpers aligned to skb_frag* ones

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/marvell/mvneta.c | 62 +++++++++++++++------------
 include/net/xdp.h                     | 52 ++++++++++++++++++++--
 2 files changed, 82 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 1e5b5c69685a..d635463609ad 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2033,14 +2033,17 @@ int mvneta_rx_refill_queue(struct mvneta_port *pp, struct mvneta_rx_queue *rxq)
 
 static void
 mvneta_xdp_put_buff(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
-		    struct xdp_buff *xdp, struct skb_shared_info *sinfo,
+		    struct xdp_buff *xdp, struct xdp_shared_info *xdp_sinfo,
 		    int sync_len)
 {
 	int i;
 
-	for (i = 0; i < sinfo->nr_frags; i++)
+	for (i = 0; i < xdp_sinfo->nr_frags; i++) {
+		skb_frag_t *frag = &xdp_sinfo->frags[i];
+
 		page_pool_put_full_page(rxq->page_pool,
-					skb_frag_page(&sinfo->frags[i]), true);
+					xdp_get_frag_page(frag), true);
+	}
 	page_pool_put_page(rxq->page_pool, virt_to_head_page(xdp->data),
 			   sync_len, true);
 }
@@ -2179,7 +2182,7 @@ mvneta_run_xdp(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 	       struct bpf_prog *prog, struct xdp_buff *xdp,
 	       u32 frame_sz, struct mvneta_stats *stats)
 {
-	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
+	struct xdp_shared_info *xdp_sinfo = xdp_get_shared_info_from_buff(xdp);
 	unsigned int len, data_len, sync;
 	u32 ret, act;
 
@@ -2200,7 +2203,7 @@ mvneta_run_xdp(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 
 		err = xdp_do_redirect(pp->dev, xdp, prog);
 		if (unlikely(err)) {
-			mvneta_xdp_put_buff(pp, rxq, xdp, sinfo, sync);
+			mvneta_xdp_put_buff(pp, rxq, xdp, xdp_sinfo, sync);
 			ret = MVNETA_XDP_DROPPED;
 		} else {
 			ret = MVNETA_XDP_REDIR;
@@ -2211,7 +2214,7 @@ mvneta_run_xdp(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 	case XDP_TX:
 		ret = mvneta_xdp_xmit_back(pp, xdp);
 		if (ret != MVNETA_XDP_TX)
-			mvneta_xdp_put_buff(pp, rxq, xdp, sinfo, sync);
+			mvneta_xdp_put_buff(pp, rxq, xdp, xdp_sinfo, sync);
 		break;
 	default:
 		bpf_warn_invalid_xdp_action(act);
@@ -2220,7 +2223,7 @@ mvneta_run_xdp(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 		trace_xdp_exception(pp->dev, prog, act);
 		fallthrough;
 	case XDP_DROP:
-		mvneta_xdp_put_buff(pp, rxq, xdp, sinfo, sync);
+		mvneta_xdp_put_buff(pp, rxq, xdp, xdp_sinfo, sync);
 		ret = MVNETA_XDP_DROPPED;
 		stats->xdp_drop++;
 		break;
@@ -2241,9 +2244,9 @@ mvneta_swbm_rx_frame(struct mvneta_port *pp,
 {
 	unsigned char *data = page_address(page);
 	int data_len = -MVNETA_MH_SIZE, len;
+	struct xdp_shared_info *xdp_sinfo;
 	struct net_device *dev = pp->dev;
 	enum dma_data_direction dma_dir;
-	struct skb_shared_info *sinfo;
 
 	if (*size > MVNETA_MAX_RX_BUF_SIZE) {
 		len = MVNETA_MAX_RX_BUF_SIZE;
@@ -2269,8 +2272,8 @@ mvneta_swbm_rx_frame(struct mvneta_port *pp,
 	xdp->data_end = xdp->data + data_len;
 	xdp_set_data_meta_invalid(xdp);
 
-	sinfo = xdp_get_shared_info_from_buff(xdp);
-	sinfo->nr_frags = 0;
+	xdp_sinfo = xdp_get_shared_info_from_buff(xdp);
+	xdp_sinfo->nr_frags = 0;
 }
 
 static void
@@ -2278,7 +2281,7 @@ mvneta_swbm_add_rx_fragment(struct mvneta_port *pp,
 			    struct mvneta_rx_desc *rx_desc,
 			    struct mvneta_rx_queue *rxq,
 			    struct xdp_buff *xdp, int *size,
-			    struct skb_shared_info *xdp_sinfo,
+			    struct xdp_shared_info *xdp_sinfo,
 			    struct page *page)
 {
 	struct net_device *dev = pp->dev;
@@ -2301,13 +2304,13 @@ mvneta_swbm_add_rx_fragment(struct mvneta_port *pp,
 	if (data_len > 0 && xdp_sinfo->nr_frags < MAX_SKB_FRAGS) {
 		skb_frag_t *frag = &xdp_sinfo->frags[xdp_sinfo->nr_frags++];
 
-		skb_frag_off_set(frag, pp->rx_offset_correction);
-		skb_frag_size_set(frag, data_len);
-		__skb_frag_set_page(frag, page);
+		xdp_set_frag_offset(frag, pp->rx_offset_correction);
+		xdp_set_frag_size(frag, data_len);
+		xdp_set_frag_page(frag, page);
 
 		/* last fragment */
 		if (len == *size) {
-			struct skb_shared_info *sinfo;
+			struct xdp_shared_info *sinfo;
 
 			sinfo = xdp_get_shared_info_from_buff(xdp);
 			sinfo->nr_frags = xdp_sinfo->nr_frags;
@@ -2324,10 +2327,13 @@ static struct sk_buff *
 mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 		      struct xdp_buff *xdp, u32 desc_status)
 {
-	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
-	int i, num_frags = sinfo->nr_frags;
+	struct xdp_shared_info *xdp_sinfo = xdp_get_shared_info_from_buff(xdp);
+	int i, num_frags = xdp_sinfo->nr_frags;
+	skb_frag_t frag_list[MAX_SKB_FRAGS];
 	struct sk_buff *skb;
 
+	memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) * num_frags);
+
 	skb = build_skb(xdp->data_hard_start, PAGE_SIZE);
 	if (!skb)
 		return ERR_PTR(-ENOMEM);
@@ -2339,12 +2345,12 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 	mvneta_rx_csum(pp, desc_status, skb);
 
 	for (i = 0; i < num_frags; i++) {
-		skb_frag_t *frag = &sinfo->frags[i];
+		struct page *page = xdp_get_frag_page(&frag_list[i]);
 
 		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
-				skb_frag_page(frag), skb_frag_off(frag),
-				skb_frag_size(frag), PAGE_SIZE);
-		page_pool_release_page(rxq->page_pool, skb_frag_page(frag));
+				page, xdp_get_frag_offset(&frag_list[i]),
+				xdp_get_frag_size(&frag_list[i]), PAGE_SIZE);
+		page_pool_release_page(rxq->page_pool, page);
 	}
 
 	return skb;
@@ -2357,7 +2363,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 {
 	int rx_proc = 0, rx_todo, refill, size = 0;
 	struct net_device *dev = pp->dev;
-	struct skb_shared_info sinfo;
+	struct xdp_shared_info xdp_sinfo;
 	struct mvneta_stats ps = {};
 	struct bpf_prog *xdp_prog;
 	u32 desc_status, frame_sz;
@@ -2368,7 +2374,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 	xdp_buf.rxq = &rxq->xdp_rxq;
 	xdp_buf.mb = 0;
 
-	sinfo.nr_frags = 0;
+	xdp_sinfo.nr_frags = 0;
 
 	/* Get number of received packets */
 	rx_todo = mvneta_rxq_busy_desc_num_get(pp, rxq);
@@ -2412,7 +2418,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 			}
 
 			mvneta_swbm_add_rx_fragment(pp, rx_desc, rxq, &xdp_buf,
-						    &size, &sinfo, page);
+						    &size, &xdp_sinfo, page);
 		} /* Middle or Last descriptor */
 
 		if (!(rx_status & MVNETA_RXD_LAST_DESC))
@@ -2420,7 +2426,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 			continue;
 
 		if (size) {
-			mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &sinfo, -1);
+			mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &xdp_sinfo, -1);
 			goto next;
 		}
 
@@ -2432,7 +2438,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 		if (IS_ERR(skb)) {
 			struct mvneta_pcpu_stats *stats = this_cpu_ptr(pp->stats);
 
-			mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &sinfo, -1);
+			mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &xdp_sinfo, -1);
 
 			u64_stats_update_begin(&stats->syncp);
 			stats->es.skb_alloc_error++;
@@ -2449,12 +2455,12 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 		napi_gro_receive(napi, skb);
 next:
 		xdp_buf.data_hard_start = NULL;
-		sinfo.nr_frags = 0;
+		xdp_sinfo.nr_frags = 0;
 	}
 	rcu_read_unlock();
 
 	if (xdp_buf.data_hard_start)
-		mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &sinfo, -1);
+		mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &xdp_sinfo, -1);
 
 	if (ps.xdp_redirect)
 		xdp_do_flush_map();
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 70559720ff44..614f66d35ee8 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -87,10 +87,54 @@ struct xdp_buff {
 	((xdp)->data_hard_start + (xdp)->frame_sz -	\
 	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
 
-static inline struct skb_shared_info *
+struct xdp_shared_info {
+	u16 nr_frags;
+	u16 data_length; /* paged area length */
+	skb_frag_t frags[MAX_SKB_FRAGS];
+};
+
+static inline struct xdp_shared_info *
 xdp_get_shared_info_from_buff(struct xdp_buff *xdp)
 {
-	return (struct skb_shared_info *)xdp_data_hard_end(xdp);
+	BUILD_BUG_ON(sizeof(struct xdp_shared_info) >
+		     sizeof(struct skb_shared_info));
+	return (struct xdp_shared_info *)xdp_data_hard_end(xdp);
+}
+
+static inline struct page *xdp_get_frag_page(const skb_frag_t *frag)
+{
+	return frag->bv_page;
+}
+
+static inline unsigned int xdp_get_frag_offset(const skb_frag_t *frag)
+{
+	return frag->bv_offset;
+}
+
+static inline unsigned int xdp_get_frag_size(const skb_frag_t *frag)
+{
+	return frag->bv_len;
+}
+
+static inline void *xdp_get_frag_address(const skb_frag_t *frag)
+{
+	return page_address(xdp_get_frag_page(frag)) +
+	       xdp_get_frag_offset(frag);
+}
+
+static inline void xdp_set_frag_page(skb_frag_t *frag, struct page *page)
+{
+	frag->bv_page = page;
+}
+
+static inline void xdp_set_frag_offset(skb_frag_t *frag, u32 offset)
+{
+	frag->bv_offset = offset;
+}
+
+static inline void xdp_set_frag_size(skb_frag_t *frag, u32 size)
+{
+	frag->bv_len = size;
 }
 
 struct xdp_frame {
@@ -120,12 +164,12 @@ static __always_inline void xdp_frame_bulk_init(struct xdp_frame_bulk *bq)
 	bq->xa = NULL;
 }
 
-static inline struct skb_shared_info *
+static inline struct xdp_shared_info *
 xdp_get_shared_info_from_frame(struct xdp_frame *frame)
 {
 	void *data_hard_start = frame->data - frame->headroom - sizeof(*frame);
 
-	return (struct skb_shared_info *)(data_hard_start + frame->frame_sz -
+	return (struct xdp_shared_info *)(data_hard_start + frame->frame_sz -
 				SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 04/14] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (2 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-07 16:32 ` [PATCH v5 bpf-next 05/14] xdp: add multi-buff support to xdp_return_{buff/frame} Lorenzo Bianconi
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

Update multi-buffer bit (mb) in xdp_buff to notify XDP/eBPF layer and
XDP remote drivers if this is a "non-linear" XDP buffer. Access
xdp_shared_info only if xdp_buff mb is set.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/marvell/mvneta.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index d635463609ad..bac1ae7014eb 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2038,12 +2038,16 @@ mvneta_xdp_put_buff(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 {
 	int i;
 
+	if (likely(!xdp->mb))
+		goto out;
+
 	for (i = 0; i < xdp_sinfo->nr_frags; i++) {
 		skb_frag_t *frag = &xdp_sinfo->frags[i];
 
 		page_pool_put_full_page(rxq->page_pool,
 					xdp_get_frag_page(frag), true);
 	}
+out:
 	page_pool_put_page(rxq->page_pool, virt_to_head_page(xdp->data),
 			   sync_len, true);
 }
@@ -2244,7 +2248,6 @@ mvneta_swbm_rx_frame(struct mvneta_port *pp,
 {
 	unsigned char *data = page_address(page);
 	int data_len = -MVNETA_MH_SIZE, len;
-	struct xdp_shared_info *xdp_sinfo;
 	struct net_device *dev = pp->dev;
 	enum dma_data_direction dma_dir;
 
@@ -2271,9 +2274,6 @@ mvneta_swbm_rx_frame(struct mvneta_port *pp,
 	xdp->data = data + pp->rx_offset_correction + MVNETA_MH_SIZE;
 	xdp->data_end = xdp->data + data_len;
 	xdp_set_data_meta_invalid(xdp);
-
-	xdp_sinfo = xdp_get_shared_info_from_buff(xdp);
-	xdp_sinfo->nr_frags = 0;
 }
 
 static void
@@ -2308,12 +2308,18 @@ mvneta_swbm_add_rx_fragment(struct mvneta_port *pp,
 		xdp_set_frag_size(frag, data_len);
 		xdp_set_frag_page(frag, page);
 
+		if (!xdp->mb) {
+			xdp_sinfo->data_length = *size;
+			xdp->mb = 1;
+		}
 		/* last fragment */
 		if (len == *size) {
 			struct xdp_shared_info *sinfo;
 
 			sinfo = xdp_get_shared_info_from_buff(xdp);
 			sinfo->nr_frags = xdp_sinfo->nr_frags;
+			sinfo->data_length = xdp_sinfo->data_length;
+
 			memcpy(sinfo->frags, xdp_sinfo->frags,
 			       sinfo->nr_frags * sizeof(skb_frag_t));
 		}
@@ -2328,11 +2334,13 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 		      struct xdp_buff *xdp, u32 desc_status)
 {
 	struct xdp_shared_info *xdp_sinfo = xdp_get_shared_info_from_buff(xdp);
-	int i, num_frags = xdp_sinfo->nr_frags;
+	int i, num_frags = xdp->mb ? xdp_sinfo->nr_frags : 0;
 	skb_frag_t frag_list[MAX_SKB_FRAGS];
 	struct sk_buff *skb;
 
-	memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) * num_frags);
+	if (unlikely(xdp->mb))
+		memcpy(frag_list, xdp_sinfo->frags,
+		       sizeof(skb_frag_t) * num_frags);
 
 	skb = build_skb(xdp->data_hard_start, PAGE_SIZE);
 	if (!skb)
@@ -2344,6 +2352,9 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 	skb_put(skb, xdp->data_end - xdp->data);
 	mvneta_rx_csum(pp, desc_status, skb);
 
+	if (likely(!xdp->mb))
+		return skb;
+
 	for (i = 0; i < num_frags; i++) {
 		struct page *page = xdp_get_frag_page(&frag_list[i]);
 
@@ -2407,6 +2418,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 			frame_sz = size - ETH_FCS_LEN;
 			desc_status = rx_status;
 
+			xdp_buf.mb = 0;
 			mvneta_swbm_rx_frame(pp, rx_desc, rxq, &xdp_buf,
 					     &size, page);
 		} else {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 05/14] xdp: add multi-buff support to xdp_return_{buff/frame}
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (3 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v5 bpf-next 04/14] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-07 16:32 ` [PATCH v5 bpf-next 06/14] net: mvneta: add multi buffer support to XDP_TX Lorenzo Bianconi
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

Take into account if the received xdp_buff/xdp_frame is non-linear
recycling/returning the frame memory to the allocator or into
xdp_frame_bulk.
Introduce xdp_return_num_frags_from_buff to return a given number of
fragments from a xdp multi-buff starting from the tail.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/net/xdp.h | 19 ++++++++++--
 net/core/xdp.c    | 76 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 92 insertions(+), 3 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 614f66d35ee8..d0e90d729023 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -258,6 +258,7 @@ void xdp_return_buff(struct xdp_buff *xdp);
 void xdp_flush_frame_bulk(struct xdp_frame_bulk *bq);
 void xdp_return_frame_bulk(struct xdp_frame *xdpf,
 			   struct xdp_frame_bulk *bq);
+void xdp_return_num_frags_from_buff(struct xdp_buff *xdp, u16 num_frags);
 
 /* When sending xdp_frame into the network stack, then there is no
  * return point callback, which is needed to release e.g. DMA-mapping
@@ -268,10 +269,24 @@ void __xdp_release_frame(void *data, struct xdp_mem_info *mem);
 static inline void xdp_release_frame(struct xdp_frame *xdpf)
 {
 	struct xdp_mem_info *mem = &xdpf->mem;
+	struct xdp_shared_info *xdp_sinfo;
+	int i;
 
 	/* Curr only page_pool needs this */
-	if (mem->type == MEM_TYPE_PAGE_POOL)
-		__xdp_release_frame(xdpf->data, mem);
+	if (mem->type != MEM_TYPE_PAGE_POOL)
+		return;
+
+	if (likely(!xdpf->mb))
+		goto out;
+
+	xdp_sinfo = xdp_get_shared_info_from_frame(xdpf);
+	for (i = 0; i < xdp_sinfo->nr_frags; i++) {
+		struct page *page = xdp_get_frag_page(&xdp_sinfo->frags[i]);
+
+		__xdp_release_frame(page_address(page), mem);
+	}
+out:
+	__xdp_release_frame(xdpf->data, mem);
 }
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 79dd45234e4d..6c8e743ad03a 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -371,12 +371,38 @@ static void __xdp_return(void *data, struct xdp_mem_info *mem, bool napi_direct)
 
 void xdp_return_frame(struct xdp_frame *xdpf)
 {
+	struct xdp_shared_info *xdp_sinfo;
+	int i;
+
+	if (likely(!xdpf->mb))
+		goto out;
+
+	xdp_sinfo = xdp_get_shared_info_from_frame(xdpf);
+	for (i = 0; i < xdp_sinfo->nr_frags; i++) {
+		struct page *page = xdp_get_frag_page(&xdp_sinfo->frags[i]);
+
+		__xdp_return(page_address(page), &xdpf->mem, false);
+	}
+out:
 	__xdp_return(xdpf->data, &xdpf->mem, false);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame);
 
 void xdp_return_frame_rx_napi(struct xdp_frame *xdpf)
 {
+	struct xdp_shared_info *xdp_sinfo;
+	int i;
+
+	if (likely(!xdpf->mb))
+		goto out;
+
+	xdp_sinfo = xdp_get_shared_info_from_frame(xdpf);
+	for (i = 0; i < xdp_sinfo->nr_frags; i++) {
+		struct page *page = xdp_get_frag_page(&xdp_sinfo->frags[i]);
+
+		__xdp_return(page_address(page), &xdpf->mem, true);
+	}
+out:
 	__xdp_return(xdpf->data, &xdpf->mem, true);
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame_rx_napi);
@@ -412,7 +438,7 @@ void xdp_return_frame_bulk(struct xdp_frame *xdpf,
 	struct xdp_mem_allocator *xa;
 
 	if (mem->type != MEM_TYPE_PAGE_POOL) {
-		__xdp_return(xdpf->data, &xdpf->mem, false);
+		xdp_return_frame(xdpf);
 		return;
 	}
 
@@ -431,15 +457,63 @@ void xdp_return_frame_bulk(struct xdp_frame *xdpf,
 		bq->xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
 	}
 
+	if (unlikely(xdpf->mb)) {
+		struct xdp_shared_info *xdp_sinfo;
+		int i;
+
+		xdp_sinfo = xdp_get_shared_info_from_frame(xdpf);
+		for (i = 0; i < xdp_sinfo->nr_frags; i++) {
+			skb_frag_t *frag = &xdp_sinfo->frags[i];
+
+			bq->q[bq->count++] = xdp_get_frag_address(frag);
+			if (bq->count == XDP_BULK_QUEUE_SIZE)
+				xdp_flush_frame_bulk(bq);
+		}
+	}
 	bq->q[bq->count++] = xdpf->data;
 }
 EXPORT_SYMBOL_GPL(xdp_return_frame_bulk);
 
 void xdp_return_buff(struct xdp_buff *xdp)
 {
+	struct xdp_shared_info *xdp_sinfo;
+	int i;
+
+	if (likely(!xdp->mb))
+		goto out;
+
+	xdp_sinfo = xdp_get_shared_info_from_buff(xdp);
+	for (i = 0; i < xdp_sinfo->nr_frags; i++) {
+		struct page *page = xdp_get_frag_page(&xdp_sinfo->frags[i]);
+
+		__xdp_return(page_address(page), &xdp->rxq->mem, true);
+	}
+out:
 	__xdp_return(xdp->data, &xdp->rxq->mem, true);
 }
 
+void xdp_return_num_frags_from_buff(struct xdp_buff *xdp, u16 num_frags)
+{
+	struct xdp_shared_info *xdp_sinfo;
+	int i;
+
+	if (unlikely(!xdp->mb))
+		return;
+
+	xdp_sinfo = xdp_get_shared_info_from_buff(xdp);
+	num_frags = min_t(u16, num_frags, xdp_sinfo->nr_frags);
+	for (i = 1; i <= num_frags; i++) {
+		skb_frag_t *frag = &xdp_sinfo->frags[xdp_sinfo->nr_frags - i];
+		struct page *page = xdp_get_frag_page(frag);
+
+		xdp_sinfo->data_length -= xdp_get_frag_size(frag);
+		__xdp_return(page_address(page), &xdp->rxq->mem, false);
+	}
+	xdp_sinfo->nr_frags -= num_frags;
+	xdp->mb = !!xdp_sinfo->nr_frags;
+}
+EXPORT_SYMBOL_GPL(xdp_return_num_frags_from_buff);
+
 /* Only called for MEM_TYPE_PAGE_POOL see xdp.h */
 void __xdp_release_frame(void *data, struct xdp_mem_info *mem)
 {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 06/14] net: mvneta: add multi buffer support to XDP_TX
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (4 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v5 bpf-next 05/14] xdp: add multi-buff support to xdp_return_{buff/frame} Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-19 15:56   ` Shay Agroskin
  2020-12-07 16:32 ` [PATCH v5 bpf-next 07/14] bpf: move user_size out of bpf_test_init Lorenzo Bianconi
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

Introduce the capability to map non-linear xdp buffer running
mvneta_xdp_submit_frame() for XDP_TX and XDP_REDIRECT

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/marvell/mvneta.c | 94 ++++++++++++++++-----------
 1 file changed, 56 insertions(+), 38 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index bac1ae7014eb..dc1f1f25fce0 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -1857,8 +1857,8 @@ static void mvneta_txq_bufs_free(struct mvneta_port *pp,
 			bytes_compl += buf->skb->len;
 			pkts_compl++;
 			dev_kfree_skb_any(buf->skb);
-		} else if (buf->type == MVNETA_TYPE_XDP_TX ||
-			   buf->type == MVNETA_TYPE_XDP_NDO) {
+		} else if ((buf->type == MVNETA_TYPE_XDP_TX ||
+			    buf->type == MVNETA_TYPE_XDP_NDO) && buf->xdpf) {
 			if (napi && buf->type == MVNETA_TYPE_XDP_TX)
 				xdp_return_frame_rx_napi(buf->xdpf);
 			else
@@ -2054,45 +2054,64 @@ mvneta_xdp_put_buff(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,
 
 static int
 mvneta_xdp_submit_frame(struct mvneta_port *pp, struct mvneta_tx_queue *txq,
-			struct xdp_frame *xdpf, bool dma_map)
+			struct xdp_frame *xdpf, int *nxmit_byte, bool dma_map)
 {
-	struct mvneta_tx_desc *tx_desc;
-	struct mvneta_tx_buf *buf;
-	dma_addr_t dma_addr;
+	struct xdp_shared_info *xdp_sinfo = xdp_get_shared_info_from_frame(xdpf);
+	int i, num_frames = xdpf->mb ? xdp_sinfo->nr_frags + 1 : 1;
+	struct mvneta_tx_desc *tx_desc = NULL;
+	struct page *page;
 
-	if (txq->count >= txq->tx_stop_threshold)
+	if (txq->count + num_frames >= txq->size)
 		return MVNETA_XDP_DROPPED;
 
-	tx_desc = mvneta_txq_next_desc_get(txq);
+	for (i = 0; i < num_frames; i++) {
+		struct mvneta_tx_buf *buf = &txq->buf[txq->txq_put_index];
+		skb_frag_t *frag = i ? &xdp_sinfo->frags[i - 1] : NULL;
+		int len = frag ? xdp_get_frag_size(frag) : xdpf->len;
+		dma_addr_t dma_addr;
 
-	buf = &txq->buf[txq->txq_put_index];
-	if (dma_map) {
-		/* ndo_xdp_xmit */
-		dma_addr = dma_map_single(pp->dev->dev.parent, xdpf->data,
-					  xdpf->len, DMA_TO_DEVICE);
-		if (dma_mapping_error(pp->dev->dev.parent, dma_addr)) {
-			mvneta_txq_desc_put(txq);
-			return MVNETA_XDP_DROPPED;
+		tx_desc = mvneta_txq_next_desc_get(txq);
+		if (dma_map) {
+			/* ndo_xdp_xmit */
+			void *data;
+
+			data = frag ? xdp_get_frag_address(frag) : xdpf->data;
+			dma_addr = dma_map_single(pp->dev->dev.parent, data,
+						  len, DMA_TO_DEVICE);
+			if (dma_mapping_error(pp->dev->dev.parent, dma_addr)) {
+				for (; i >= 0; i--)
+					mvneta_txq_desc_put(txq);
+				return MVNETA_XDP_DROPPED;
+			}
+			buf->type = MVNETA_TYPE_XDP_NDO;
+		} else {
+			page = frag ? xdp_get_frag_page(frag)
+				    : virt_to_page(xdpf->data);
+			dma_addr = page_pool_get_dma_addr(page);
+			if (frag)
+				dma_addr += xdp_get_frag_offset(frag);
+			else
+				dma_addr += sizeof(*xdpf) + xdpf->headroom;
+			dma_sync_single_for_device(pp->dev->dev.parent,
+						   dma_addr, len,
+						   DMA_BIDIRECTIONAL);
+			buf->type = MVNETA_TYPE_XDP_TX;
 		}
-		buf->type = MVNETA_TYPE_XDP_NDO;
-	} else {
-		struct page *page = virt_to_page(xdpf->data);
+		buf->xdpf = i ? NULL : xdpf;
+
+		tx_desc->command = !i ? MVNETA_TXD_F_DESC : 0;
+		tx_desc->buf_phys_addr = dma_addr;
+		tx_desc->data_size = len;
+		*nxmit_byte += len;
 
-		dma_addr = page_pool_get_dma_addr(page) +
-			   sizeof(*xdpf) + xdpf->headroom;
-		dma_sync_single_for_device(pp->dev->dev.parent, dma_addr,
-					   xdpf->len, DMA_BIDIRECTIONAL);
-		buf->type = MVNETA_TYPE_XDP_TX;
+		mvneta_txq_inc_put(txq);
 	}
-	buf->xdpf = xdpf;
 
-	tx_desc->command = MVNETA_TXD_FLZ_DESC;
-	tx_desc->buf_phys_addr = dma_addr;
-	tx_desc->data_size = xdpf->len;
+	/*last descriptor */
+	tx_desc->command |= MVNETA_TXD_L_DESC | MVNETA_TXD_Z_PAD;
 
-	mvneta_txq_inc_put(txq);
-	txq->pending++;
-	txq->count++;
+	txq->pending += num_frames;
+	txq->count += num_frames;
 
 	return MVNETA_XDP_TX;
 }
@@ -2103,8 +2122,8 @@ mvneta_xdp_xmit_back(struct mvneta_port *pp, struct xdp_buff *xdp)
 	struct mvneta_pcpu_stats *stats = this_cpu_ptr(pp->stats);
 	struct mvneta_tx_queue *txq;
 	struct netdev_queue *nq;
+	int cpu, nxmit_byte = 0;
 	struct xdp_frame *xdpf;
-	int cpu;
 	u32 ret;
 
 	xdpf = xdp_convert_buff_to_frame(xdp);
@@ -2116,10 +2135,10 @@ mvneta_xdp_xmit_back(struct mvneta_port *pp, struct xdp_buff *xdp)
 	nq = netdev_get_tx_queue(pp->dev, txq->id);
 
 	__netif_tx_lock(nq, cpu);
-	ret = mvneta_xdp_submit_frame(pp, txq, xdpf, false);
+	ret = mvneta_xdp_submit_frame(pp, txq, xdpf, &nxmit_byte, false);
 	if (ret == MVNETA_XDP_TX) {
 		u64_stats_update_begin(&stats->syncp);
-		stats->es.ps.tx_bytes += xdpf->len;
+		stats->es.ps.tx_bytes += nxmit_byte;
 		stats->es.ps.tx_packets++;
 		stats->es.ps.xdp_tx++;
 		u64_stats_update_end(&stats->syncp);
@@ -2158,10 +2177,9 @@ mvneta_xdp_xmit(struct net_device *dev, int num_frame,
 
 	__netif_tx_lock(nq, cpu);
 	for (i = 0; i < num_frame; i++) {
-		ret = mvneta_xdp_submit_frame(pp, txq, frames[i], true);
-		if (ret == MVNETA_XDP_TX) {
-			nxmit_byte += frames[i]->len;
-		} else {
+		ret = mvneta_xdp_submit_frame(pp, txq, frames[i], &nxmit_byte,
+					      true);
+		if (ret != MVNETA_XDP_TX) {
 			xdp_return_frame_rx_napi(frames[i]);
 			nxmit--;
 		}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 07/14] bpf: move user_size out of bpf_test_init
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (5 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v5 bpf-next 06/14] net: mvneta: add multi buffer support to XDP_TX Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-07 16:32 ` [PATCH v5 bpf-next 08/14] bpf: introduce multibuff support to bpf_prog_test_run_xdp() Lorenzo Bianconi
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

Rely on data_size_in in bpf_test_init routine signature. This is a
preliminary patch to introduce xdp multi-buff selftest

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 net/bpf/test_run.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index c1c30a9f76f3..bd291f5f539c 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -171,11 +171,10 @@ __diag_pop();
 
 ALLOW_ERROR_INJECTION(bpf_modify_return_test, ERRNO);
 
-static void *bpf_test_init(const union bpf_attr *kattr, u32 size,
-			   u32 headroom, u32 tailroom)
+static void *bpf_test_init(const union bpf_attr *kattr, u32 user_size,
+			   u32 size, u32 headroom, u32 tailroom)
 {
 	void __user *data_in = u64_to_user_ptr(kattr->test.data_in);
-	u32 user_size = kattr->test.data_size_in;
 	void *data;
 
 	if (size < ETH_HLEN || size > PAGE_SIZE - headroom - tailroom)
@@ -495,7 +494,8 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
 	if (kattr->test.flags || kattr->test.cpu)
 		return -EINVAL;
 
-	data = bpf_test_init(kattr, size, NET_SKB_PAD + NET_IP_ALIGN,
+	data = bpf_test_init(kattr, kattr->test.data_size_in,
+			     size, NET_SKB_PAD + NET_IP_ALIGN,
 			     SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
 	if (IS_ERR(data))
 		return PTR_ERR(data);
@@ -632,7 +632,8 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	/* XDP have extra tailroom as (most) drivers use full page */
 	max_data_sz = 4096 - headroom - tailroom;
 
-	data = bpf_test_init(kattr, max_data_sz, headroom, tailroom);
+	data = bpf_test_init(kattr, kattr->test.data_size_in,
+			     max_data_sz, headroom, tailroom);
 	if (IS_ERR(data))
 		return PTR_ERR(data);
 
@@ -698,7 +699,7 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	if (size < ETH_HLEN)
 		return -EINVAL;
 
-	data = bpf_test_init(kattr, size, 0, 0);
+	data = bpf_test_init(kattr, kattr->test.data_size_in, size, 0, 0);
 	if (IS_ERR(data))
 		return PTR_ERR(data);
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 08/14] bpf: introduce multibuff support to bpf_prog_test_run_xdp()
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (6 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v5 bpf-next 07/14] bpf: move user_size out of bpf_test_init Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-07 16:32 ` [PATCH v5 bpf-next 09/14] bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature Lorenzo Bianconi
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

Introduce the capability to allocate a xdp multi-buff in
bpf_prog_test_run_xdp routine. This is a preliminary patch to introduce
the selftests for new xdp multi-buff ebpf helpers

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 net/bpf/test_run.c | 52 +++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 44 insertions(+), 8 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index bd291f5f539c..e4b7b749184d 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -617,23 +617,22 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 {
 	u32 tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 	u32 headroom = XDP_PACKET_HEADROOM;
-	u32 size = kattr->test.data_size_in;
+	struct xdp_shared_info *xdp_sinfo;
 	u32 repeat = kattr->test.repeat;
 	struct netdev_rx_queue *rxqueue;
 	struct xdp_buff xdp = {};
+	u32 max_data_sz, size;
 	u32 retval, duration;
-	u32 max_data_sz;
+	int i, ret;
 	void *data;
-	int ret;
 
 	if (kattr->test.ctx_in || kattr->test.ctx_out)
 		return -EINVAL;
 
-	/* XDP have extra tailroom as (most) drivers use full page */
 	max_data_sz = 4096 - headroom - tailroom;
+	size = min_t(u32, kattr->test.data_size_in, max_data_sz);
 
-	data = bpf_test_init(kattr, kattr->test.data_size_in,
-			     max_data_sz, headroom, tailroom);
+	data = bpf_test_init(kattr, size, max_data_sz, headroom, tailroom);
 	if (IS_ERR(data))
 		return PTR_ERR(data);
 
@@ -643,18 +642,55 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	xdp.data_end = xdp.data + size;
 	xdp.frame_sz = headroom + max_data_sz + tailroom;
 
+	xdp_sinfo = xdp_get_shared_info_from_buff(&xdp);
+	if (unlikely(kattr->test.data_size_in > size)) {
+		void __user *data_in = u64_to_user_ptr(kattr->test.data_in);
+
+		while (size < kattr->test.data_size_in) {
+			struct page *page;
+			skb_frag_t *frag;
+			int data_len;
+
+			page = alloc_page(GFP_KERNEL);
+			if (!page) {
+				ret = -ENOMEM;
+				goto out;
+			}
+
+			frag = &xdp_sinfo->frags[xdp_sinfo->nr_frags++];
+			xdp_set_frag_page(frag, page);
+
+			data_len = min_t(int, kattr->test.data_size_in - size,
+					 PAGE_SIZE);
+			xdp_set_frag_size(frag, data_len);
+
+			if (copy_from_user(page_address(page), data_in + size,
+					   data_len)) {
+				ret = -EFAULT;
+				goto out;
+			}
+			xdp_sinfo->data_length += data_len;
+			size += data_len;
+		}
+		xdp.mb = 1;
+	}
+
 	rxqueue = __netif_get_rx_queue(current->nsproxy->net_ns->loopback_dev, 0);
 	xdp.rxq = &rxqueue->xdp_rxq;
 	bpf_prog_change_xdp(NULL, prog);
 	ret = bpf_test_run(prog, &xdp, repeat, &retval, &duration, true);
 	if (ret)
 		goto out;
-	if (xdp.data != data + headroom || xdp.data_end != xdp.data + size)
-		size = xdp.data_end - xdp.data;
+
+	size = xdp.data_end - xdp.data + xdp_sinfo->data_length;
 	ret = bpf_test_finish(kattr, uattr, xdp.data, size, retval, duration);
+
 out:
 	bpf_prog_change_xdp(prog, NULL);
+	for (i = 0; i < xdp_sinfo->nr_frags; i++)
+		__free_page(xdp_get_frag_page(&xdp_sinfo->frags[i]));
 	kfree(data);
+
 	return ret;
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 09/14] bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (7 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v5 bpf-next 08/14] bpf: introduce multibuff support to bpf_prog_test_run_xdp() Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-07 16:32 ` [PATCH v5 bpf-next 10/14] net: mvneta: enable jumbo frames for XDP Lorenzo Bianconi
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

introduce xdp_shared_info pointer in bpf_test_finish signature in order
to copy back paged data from a xdp multi-buff frame to userspace buffer

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 net/bpf/test_run.c | 46 +++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 39 insertions(+), 7 deletions(-)

diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index e4b7b749184d..32cda04ac7fb 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -81,7 +81,8 @@ static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
 
 static int bpf_test_finish(const union bpf_attr *kattr,
 			   union bpf_attr __user *uattr, const void *data,
-			   u32 size, u32 retval, u32 duration)
+			   struct xdp_shared_info *xdp_sinfo, u32 size,
+			   u32 retval, u32 duration)
 {
 	void __user *data_out = u64_to_user_ptr(kattr->test.data_out);
 	int err = -EFAULT;
@@ -96,8 +97,37 @@ static int bpf_test_finish(const union bpf_attr *kattr,
 		err = -ENOSPC;
 	}
 
-	if (data_out && copy_to_user(data_out, data, copy_size))
-		goto out;
+	if (data_out) {
+		int len = xdp_sinfo ? copy_size - xdp_sinfo->data_length
+				    : copy_size;
+
+		if (copy_to_user(data_out, data, len))
+			goto out;
+
+		if (xdp_sinfo) {
+			int i, offset = len, data_len;
+
+			for (i = 0; i < xdp_sinfo->nr_frags; i++) {
+				skb_frag_t *frag = &xdp_sinfo->frags[i];
+
+				if (offset >= copy_size) {
+					err = -ENOSPC;
+					break;
+				}
+
+				data_len = min_t(int, copy_size - offset,
+						 xdp_get_frag_size(frag));
+
+				if (copy_to_user(data_out + offset,
+						 xdp_get_frag_address(frag),
+						 data_len))
+					goto out;
+
+				offset += data_len;
+			}
+		}
+	}
+
 	if (copy_to_user(&uattr->test.data_size_out, &size, sizeof(size)))
 		goto out;
 	if (copy_to_user(&uattr->test.retval, &retval, sizeof(retval)))
@@ -598,7 +628,8 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
 	/* bpf program can never convert linear skb to non-linear */
 	if (WARN_ON_ONCE(skb_is_nonlinear(skb)))
 		size = skb_headlen(skb);
-	ret = bpf_test_finish(kattr, uattr, skb->data, size, retval, duration);
+	ret = bpf_test_finish(kattr, uattr, skb->data, NULL, size, retval,
+			      duration);
 	if (!ret)
 		ret = bpf_ctx_finish(kattr, uattr, ctx,
 				     sizeof(struct __sk_buff));
@@ -683,7 +714,8 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 		goto out;
 
 	size = xdp.data_end - xdp.data + xdp_sinfo->data_length;
-	ret = bpf_test_finish(kattr, uattr, xdp.data, size, retval, duration);
+	ret = bpf_test_finish(kattr, uattr, xdp.data, xdp_sinfo, size, retval,
+			      duration);
 
 out:
 	bpf_prog_change_xdp(prog, NULL);
@@ -794,8 +826,8 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	do_div(time_spent, repeat);
 	duration = time_spent > U32_MAX ? U32_MAX : (u32)time_spent;
 
-	ret = bpf_test_finish(kattr, uattr, &flow_keys, sizeof(flow_keys),
-			      retval, duration);
+	ret = bpf_test_finish(kattr, uattr, &flow_keys, NULL,
+			      sizeof(flow_keys), retval, duration);
 	if (!ret)
 		ret = bpf_ctx_finish(kattr, uattr, user_ctx,
 				     sizeof(struct bpf_flow_keys));
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 10/14] net: mvneta: enable jumbo frames for XDP
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (8 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v5 bpf-next 09/14] bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-07 16:32 ` [PATCH v5 bpf-next 11/14] bpf: cpumap: introduce xdp multi-buff support Lorenzo Bianconi
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

Enable the capability to receive jumbo frames even if the interface is
running in XDP mode

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 drivers/net/ethernet/marvell/mvneta.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index dc1f1f25fce0..853bea9fcb14 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -3766,11 +3766,6 @@ static int mvneta_change_mtu(struct net_device *dev, int mtu)
 		mtu = ALIGN(MVNETA_RX_PKT_SIZE(mtu), 8);
 	}
 
-	if (pp->xdp_prog && mtu > MVNETA_MAX_RX_BUF_SIZE) {
-		netdev_info(dev, "Illegal MTU value %d for XDP mode\n", mtu);
-		return -EINVAL;
-	}
-
 	dev->mtu = mtu;
 
 	if (!netif_running(dev)) {
@@ -4468,11 +4463,6 @@ static int mvneta_xdp_setup(struct net_device *dev, struct bpf_prog *prog,
 	struct mvneta_port *pp = netdev_priv(dev);
 	struct bpf_prog *old_prog;
 
-	if (prog && dev->mtu > MVNETA_MAX_RX_BUF_SIZE) {
-		NL_SET_ERR_MSG_MOD(extack, "Jumbo frames not supported on XDP");
-		return -EOPNOTSUPP;
-	}
-
 	if (pp->bm_priv) {
 		NL_SET_ERR_MSG_MOD(extack,
 				   "Hardware Buffer Management not supported on XDP");
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 11/14] bpf: cpumap: introduce xdp multi-buff support
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (9 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v5 bpf-next 10/14] net: mvneta: enable jumbo frames for XDP Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-19 17:46   ` Shay Agroskin
  2020-12-07 16:32 ` [PATCH v5 bpf-next 12/14] bpf: add multi-buff support to the bpf_xdp_adjust_tail() API Lorenzo Bianconi
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

Introduce __xdp_build_skb_from_frame and xdp_build_skb_from_frame
utility routines to build the skb from xdp_frame.
Add xdp multi-buff support to cpumap

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/net/xdp.h   |  5 ++++
 kernel/bpf/cpumap.c | 45 +---------------------------
 net/core/xdp.c      | 73 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 79 insertions(+), 44 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index d0e90d729023..76cfee6a40f7 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -191,6 +191,11 @@ void xdp_warn(const char *msg, const char *func, const int line);
 #define XDP_WARN(msg) xdp_warn(msg, __func__, __LINE__)
 
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
+					   struct sk_buff *skb,
+					   struct net_device *dev);
+struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf,
+					 struct net_device *dev);
 
 static inline
 void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 747313698178..49113880b82a 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -141,49 +141,6 @@ static void cpu_map_kthread_stop(struct work_struct *work)
 	kthread_stop(rcpu->kthread);
 }
 
-static struct sk_buff *cpu_map_build_skb(struct xdp_frame *xdpf,
-					 struct sk_buff *skb)
-{
-	unsigned int hard_start_headroom;
-	unsigned int frame_size;
-	void *pkt_data_start;
-
-	/* Part of headroom was reserved to xdpf */
-	hard_start_headroom = sizeof(struct xdp_frame) +  xdpf->headroom;
-
-	/* Memory size backing xdp_frame data already have reserved
-	 * room for build_skb to place skb_shared_info in tailroom.
-	 */
-	frame_size = xdpf->frame_sz;
-
-	pkt_data_start = xdpf->data - hard_start_headroom;
-	skb = build_skb_around(skb, pkt_data_start, frame_size);
-	if (unlikely(!skb))
-		return NULL;
-
-	skb_reserve(skb, hard_start_headroom);
-	__skb_put(skb, xdpf->len);
-	if (xdpf->metasize)
-		skb_metadata_set(skb, xdpf->metasize);
-
-	/* Essential SKB info: protocol and skb->dev */
-	skb->protocol = eth_type_trans(skb, xdpf->dev_rx);
-
-	/* Optional SKB info, currently missing:
-	 * - HW checksum info		(skb->ip_summed)
-	 * - HW RX hash			(skb_set_hash)
-	 * - RX ring dev queue index	(skb_record_rx_queue)
-	 */
-
-	/* Until page_pool get SKB return path, release DMA here */
-	xdp_release_frame(xdpf);
-
-	/* Allow SKB to reuse area used by xdp_frame */
-	xdp_scrub_frame(xdpf);
-
-	return skb;
-}
-
 static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 {
 	/* The tear-down procedure should have made sure that queue is
@@ -350,7 +307,7 @@ static int cpu_map_kthread_run(void *data)
 			struct sk_buff *skb = skbs[i];
 			int ret;
 
-			skb = cpu_map_build_skb(xdpf, skb);
+			skb = __xdp_build_skb_from_frame(xdpf, skb, xdpf->dev_rx);
 			if (!skb) {
 				xdp_return_frame(xdpf);
 				continue;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 6c8e743ad03a..55f3e9c69427 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -597,3 +597,76 @@ void xdp_warn(const char *msg, const char *func, const int line)
 	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
 };
 EXPORT_SYMBOL_GPL(xdp_warn);
+
+struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
+					   struct sk_buff *skb,
+					   struct net_device *dev)
+{
+	unsigned int headroom = sizeof(*xdpf) + xdpf->headroom;
+	void *hard_start = xdpf->data - headroom;
+	skb_frag_t frag_list[MAX_SKB_FRAGS];
+	struct xdp_shared_info *xdp_sinfo;
+	int i, num_frags = 0;
+
+	xdp_sinfo = xdp_get_shared_info_from_frame(xdpf);
+	if (unlikely(xdpf->mb)) {
+		num_frags = xdp_sinfo->nr_frags;
+		memcpy(frag_list, xdp_sinfo->frags,
+		       sizeof(skb_frag_t) * num_frags);
+	}
+
+	skb = build_skb_around(skb, hard_start, xdpf->frame_sz);
+	if (unlikely(!skb))
+		return NULL;
+
+	skb_reserve(skb, headroom);
+	__skb_put(skb, xdpf->len);
+	if (xdpf->metasize)
+		skb_metadata_set(skb, xdpf->metasize);
+
+	if (likely(!num_frags))
+		goto out;
+
+	for (i = 0; i < num_frags; i++) {
+		struct page *page = xdp_get_frag_page(&frag_list[i]);
+
+		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
+				page, xdp_get_frag_offset(&frag_list[i]),
+				xdp_get_frag_size(&frag_list[i]),
+				xdpf->frame_sz);
+	}
+
+out:
+	/* Essential SKB info: protocol and skb->dev */
+	skb->protocol = eth_type_trans(skb, dev);
+
+	/* Optional SKB info, currently missing:
+	 * - HW checksum info		(skb->ip_summed)
+	 * - HW RX hash			(skb_set_hash)
+	 * - RX ring dev queue index	(skb_record_rx_queue)
+	 */
+
+	/* Until page_pool get SKB return path, release DMA here */
+	xdp_release_frame(xdpf);
+
+	/* Allow SKB to reuse area used by xdp_frame */
+	xdp_scrub_frame(xdpf);
+
+	return skb;
+}
+EXPORT_SYMBOL_GPL(__xdp_build_skb_from_frame);
+
+struct sk_buff *xdp_build_skb_from_frame(struct xdp_frame *xdpf,
+					 struct net_device *dev)
+{
+	struct sk_buff *skb;
+
+	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+	if (unlikely(!skb))
+		return NULL;
+
+	memset(skb, 0, offsetof(struct sk_buff, tail));
+
+	return __xdp_build_skb_from_frame(xdpf, skb, dev);
+}
+EXPORT_SYMBOL_GPL(xdp_build_skb_from_frame);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 12/14] bpf: add multi-buff support to the bpf_xdp_adjust_tail() API
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (10 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v5 bpf-next 11/14] bpf: cpumap: introduce xdp multi-buff support Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-07 16:32 ` [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx Lorenzo Bianconi
  2020-12-07 16:32 ` [PATCH v5 bpf-next 14/14] bpf: update xdp_adjust_tail selftest to include multi-buffer Lorenzo Bianconi
  13 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

From: Eelco Chaudron <echaudro@redhat.com>

This change adds support for tail growing and shrinking for XDP multi-buff.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/net/xdp.h |  5 ++++
 net/core/filter.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 76cfee6a40f7..09078ab6644c 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -137,6 +137,11 @@ static inline void xdp_set_frag_size(skb_frag_t *frag, u32 size)
 	frag->bv_len = size;
 }
 
+static inline unsigned int xdp_get_frag_tailroom(const skb_frag_t *frag)
+{
+	return PAGE_SIZE - xdp_get_frag_size(frag) - xdp_get_frag_offset(frag);
+}
+
 struct xdp_frame {
 	void *data;
 	u16 len;
diff --git a/net/core/filter.c b/net/core/filter.c
index 77001a35768f..4c4882d4d92c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3860,11 +3860,74 @@ static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
 	.arg2_type	= ARG_ANYTHING,
 };
 
+static int bpf_xdp_mb_adjust_tail(struct xdp_buff *xdp, int offset)
+{
+	struct xdp_shared_info *xdp_sinfo = xdp_get_shared_info_from_buff(xdp);
+
+	if (unlikely(xdp_sinfo->nr_frags == 0))
+		return -EINVAL;
+
+	if (offset >= 0) {
+		skb_frag_t *frag = &xdp_sinfo->frags[xdp_sinfo->nr_frags - 1];
+		int size;
+
+		if (unlikely(offset > xdp_get_frag_tailroom(frag)))
+			return -EINVAL;
+
+		size = xdp_get_frag_size(frag);
+		memset(xdp_get_frag_address(frag) + size, 0, offset);
+		xdp_set_frag_size(frag, size + offset);
+		xdp_sinfo->data_length += offset;
+	} else {
+		int i, frags_to_free = 0;
+
+		offset = abs(offset);
+
+		if (unlikely(offset > ((int)(xdp->data_end - xdp->data) +
+				       xdp_sinfo->data_length -
+				       ETH_HLEN)))
+			return -EINVAL;
+
+		for (i = xdp_sinfo->nr_frags - 1; i >= 0 && offset > 0; i--) {
+			skb_frag_t *frag = &xdp_sinfo->frags[i];
+			int size = xdp_get_frag_size(frag);
+			int shrink = min_t(int, offset, size);
+
+			offset -= shrink;
+			if (likely(size - shrink > 0)) {
+				/* When updating the final fragment we have
+				 * to adjust the data_length in line.
+				 */
+				xdp_sinfo->data_length -= shrink;
+				xdp_set_frag_size(frag, size - shrink);
+				break;
+			}
+
+			/* When we free the fragments,
+			 * xdp_return_frags_from_buff() will take care
+			 * of updating the xdp share info data_length.
+			 */
+			frags_to_free++;
+		}
+
+		if (unlikely(frags_to_free))
+			xdp_return_num_frags_from_buff(xdp, frags_to_free);
+
+		if (unlikely(offset > 0))
+			xdp->data_end -= offset;
+	}
+
+	return 0;
+}
+
 BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
 {
 	void *data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
 	void *data_end = xdp->data_end + offset;
 
+	if (unlikely(xdp->mb))
+		return bpf_xdp_mb_adjust_tail(xdp, offset);
+
 	/* Notice that xdp_data_hard_end have reserved some tailroom */
 	if (unlikely(data_end > data_hard_end))
 		return -EINVAL;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (11 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v5 bpf-next 12/14] bpf: add multi-buff support to the bpf_xdp_adjust_tail() API Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  2020-12-08 22:17   ` Maciej Fijalkowski
  2020-12-07 16:32 ` [PATCH v5 bpf-next 14/14] bpf: update xdp_adjust_tail selftest to include multi-buffer Lorenzo Bianconi
  13 siblings, 1 reply; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

From: Eelco Chaudron <echaudro@redhat.com>

This patch adds a new field to the XDP context called frame_length,
which will hold the full length of the packet, including fragments
if existing.

eBPF programs can determine if fragments are present using something
like:

  if (ctx->data_end - ctx->data < ctx->frame_length) {
    /* Fragements exists. /*
  }

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 include/net/xdp.h              | 22 +++++++++
 include/uapi/linux/bpf.h       |  1 +
 kernel/bpf/verifier.c          |  2 +-
 net/core/filter.c              | 83 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  1 +
 5 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 09078ab6644c..e54d733c90ed 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -73,8 +73,30 @@ struct xdp_buff {
 	void *data_hard_start;
 	struct xdp_rxq_info *rxq;
 	struct xdp_txq_info *txq;
+	/* If any of the bitfield lengths for frame_sz or mb below change,
+	 * make sure the defines here are also updated!
+	 */
+#ifdef __BIG_ENDIAN_BITFIELD
+#define MB_SHIFT	  0
+#define MB_MASK		  0x00000001
+#define FRAME_SZ_SHIFT	  1
+#define FRAME_SZ_MASK	  0xfffffffe
+#else
+#define MB_SHIFT	  31
+#define MB_MASK		  0x80000000
+#define FRAME_SZ_SHIFT	  0
+#define FRAME_SZ_MASK	  0x7fffffff
+#endif
+#define FRAME_SZ_OFFSET() offsetof(struct xdp_buff, __u32_bit_fields_offset)
+#define MB_OFFSET()	  offsetof(struct xdp_buff, __u32_bit_fields_offset)
+	/* private: */
+	u32 __u32_bit_fields_offset[0];
+	/* public: */
 	u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved tailroom*/
 	u32 mb:1; /* xdp non-linear buffer */
+
+	/* Temporary registers to make conditional access/stores possible. */
+	u64 tmp_reg[2];
 };
 
 /* Reserve memory area at end-of data area.
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 30b477a26482..62c50ab28ea9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -4380,6 +4380,7 @@ struct xdp_md {
 	__u32 rx_queue_index;  /* rxq->queue_index  */
 
 	__u32 egress_ifindex;  /* txq->dev->ifindex */
+	__u32 frame_length;
 };
 
 /* DEVMAP map-value layout
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 93def76cf32b..c50caea29fa2 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10526,7 +10526,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 	const struct bpf_verifier_ops *ops = env->ops;
 	int i, cnt, size, ctx_field_size, delta = 0;
 	const int insn_cnt = env->prog->len;
-	struct bpf_insn insn_buf[16], *insn;
+	struct bpf_insn insn_buf[32], *insn;
 	u32 target_size, size_default, off;
 	struct bpf_prog *new_prog;
 	enum bpf_access_type type;
diff --git a/net/core/filter.c b/net/core/filter.c
index 4c4882d4d92c..278640db9e0a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8908,6 +8908,7 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 				  struct bpf_insn *insn_buf,
 				  struct bpf_prog *prog, u32 *target_size)
 {
+	int ctx_reg, dst_reg, scratch_reg;
 	struct bpf_insn *insn = insn_buf;
 
 	switch (si->off) {
@@ -8954,6 +8955,88 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
 				      offsetof(struct net_device, ifindex));
 		break;
+	case offsetof(struct xdp_md, frame_length):
+		/* Need tmp storage for src_reg in case src_reg == dst_reg,
+		 * and a scratch reg */
+		scratch_reg = BPF_REG_9;
+		dst_reg = si->dst_reg;
+
+		if (dst_reg == scratch_reg)
+			scratch_reg--;
+
+		ctx_reg = (si->src_reg == si->dst_reg) ? scratch_reg - 1 : si->src_reg;
+		while (dst_reg == ctx_reg || scratch_reg == ctx_reg)
+			ctx_reg--;
+
+		/* Save scratch registers */
+		if (ctx_reg != si->src_reg) {
+			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, ctx_reg,
+					      offsetof(struct xdp_buff,
+						       tmp_reg[1]));
+
+			*insn++ = BPF_MOV64_REG(ctx_reg, si->src_reg);
+		}
+
+		*insn++ = BPF_STX_MEM(BPF_DW, ctx_reg, scratch_reg,
+				      offsetof(struct xdp_buff, tmp_reg[0]));
+
+		/* What does this code do?
+		 *   dst_reg = 0
+		 *
+		 *   if (!ctx_reg->mb)
+		 *      goto no_mb:
+		 *
+		 *   dst_reg = (struct xdp_shared_info *)xdp_data_hard_end(xdp)
+		 *   dst_reg = dst_reg->data_length
+		 *
+		 * NOTE: xdp_data_hard_end() is xdp->hard_start +
+		 *       xdp->frame_sz - sizeof(shared_info)
+		 *
+		 * no_mb:
+		 *   dst_reg += ctx_reg->data_end - ctx_reg->data
+		 */
+		*insn++ = BPF_MOV64_IMM(dst_reg, 0);
+
+		*insn++ = BPF_LDX_MEM(BPF_W, scratch_reg, ctx_reg, MB_OFFSET());
+		*insn++ = BPF_ALU32_IMM(BPF_AND, scratch_reg, MB_MASK);
+		*insn++ = BPF_JMP_IMM(BPF_JEQ, scratch_reg, 0, 7); /*goto no_mb; */
+
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff,
+						       data_hard_start),
+				      dst_reg, ctx_reg,
+				      offsetof(struct xdp_buff, data_hard_start));
+		*insn++ = BPF_LDX_MEM(BPF_W, scratch_reg, ctx_reg,
+				      FRAME_SZ_OFFSET());
+		*insn++ = BPF_ALU32_IMM(BPF_AND, scratch_reg, FRAME_SZ_MASK);
+		*insn++ = BPF_ALU32_IMM(BPF_RSH, scratch_reg, FRAME_SZ_SHIFT);
+		*insn++ = BPF_ALU64_REG(BPF_ADD, dst_reg, scratch_reg);
+		*insn++ = BPF_ALU64_IMM(BPF_SUB, dst_reg,
+					SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_shared_info,
+						       data_length),
+				      dst_reg, dst_reg,
+				      offsetof(struct xdp_shared_info,
+					       data_length));
+
+		/* no_mb: */
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_end),
+				      scratch_reg, ctx_reg,
+				      offsetof(struct xdp_buff, data_end));
+		*insn++ = BPF_ALU64_REG(BPF_ADD, dst_reg, scratch_reg);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data),
+				      scratch_reg, ctx_reg,
+				      offsetof(struct xdp_buff, data));
+		*insn++ = BPF_ALU64_REG(BPF_SUB, dst_reg, scratch_reg);
+
+		/* Restore scratch registers */
+		*insn++ = BPF_LDX_MEM(BPF_DW, scratch_reg, ctx_reg,
+				      offsetof(struct xdp_buff, tmp_reg[0]));
+
+		if (ctx_reg != si->src_reg)
+			*insn++ = BPF_LDX_MEM(BPF_DW, ctx_reg, ctx_reg,
+					      offsetof(struct xdp_buff,
+						       tmp_reg[1]));
+		break;
 	}
 
 	return insn - insn_buf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 30b477a26482..62c50ab28ea9 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -4380,6 +4380,7 @@ struct xdp_md {
 	__u32 rx_queue_index;  /* rxq->queue_index  */
 
 	__u32 egress_ifindex;  /* txq->dev->ifindex */
+	__u32 frame_length;
 };
 
 /* DEVMAP map-value layout
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v5 bpf-next 14/14] bpf: update xdp_adjust_tail selftest to include multi-buffer
  2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
                   ` (12 preceding siblings ...)
  2020-12-07 16:32 ` [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx Lorenzo Bianconi
@ 2020-12-07 16:32 ` Lorenzo Bianconi
  13 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-07 16:32 UTC (permalink / raw)
  To: bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

From: Eelco Chaudron <echaudro@redhat.com>

This change adds test cases for the multi-buffer scenarios when shrinking
and growing.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
---
 .../bpf/prog_tests/xdp_adjust_tail.c          | 105 ++++++++++++++++++
 .../bpf/progs/test_xdp_adjust_tail_grow.c     |  16 +--
 .../bpf/progs/test_xdp_adjust_tail_shrink.c   |  32 +++++-
 3 files changed, 142 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_adjust_tail.c b/tools/testing/selftests/bpf/prog_tests/xdp_adjust_tail.c
index d5c98f2cb12f..b936beaba797 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_adjust_tail.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_adjust_tail.c
@@ -130,6 +130,107 @@ void test_xdp_adjust_tail_grow2(void)
 	bpf_object__close(obj);
 }
 
+void test_xdp_adjust_mb_tail_shrink(void)
+{
+	const char *file = "./test_xdp_adjust_tail_shrink.o";
+	__u32 duration, retval, size, exp_size;
+	struct bpf_object *obj;
+	static char buf[9000];
+	int err, prog_fd;
+
+	/* For the individual test cases, the first byte in the packet
+	 * indicates which test will be run.
+	 */
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_XDP, &obj, &prog_fd);
+	if (CHECK_FAIL(err))
+		return;
+
+	/* Test case removing 10 bytes from last frag, NOT freeing it */
+	buf[0] = 0;
+	exp_size = sizeof(buf) - 10;
+	err = bpf_prog_test_run(prog_fd, 1, buf, sizeof(buf),
+				buf, &size, &retval, &duration);
+
+	CHECK(err || retval != XDP_TX || size != exp_size,
+	      "9k-10b", "err %d errno %d retval %d[%d] size %d[%u]\n",
+	      err, errno, retval, XDP_TX, size, exp_size);
+
+	/* Test case removing one of two pages, assuming 4K pages */
+	buf[0] = 1;
+	exp_size = sizeof(buf) - 4100;
+	err = bpf_prog_test_run(prog_fd, 1, buf, sizeof(buf),
+				buf, &size, &retval, &duration);
+
+	CHECK(err || retval != XDP_TX || size != exp_size,
+	      "9k-1p", "err %d errno %d retval %d[%d] size %d[%u]\n",
+	      err, errno, retval, XDP_TX, size, exp_size);
+
+	/* Test case removing two pages resulting in a non mb xdp_buff */
+	buf[0] = 2;
+	exp_size = sizeof(buf) - 8200;
+	err = bpf_prog_test_run(prog_fd, 1, buf, sizeof(buf),
+				buf, &size, &retval, &duration);
+
+	CHECK(err || retval != XDP_TX || size != exp_size,
+	      "9k-2p", "err %d errno %d retval %d[%d] size %d[%u]\n",
+	      err, errno, retval, XDP_TX, size, exp_size);
+
+	bpf_object__close(obj);
+}
+
+void test_xdp_adjust_mb_tail_grow(void)
+{
+	const char *file = "./test_xdp_adjust_tail_grow.o";
+	__u32 duration, retval, size, exp_size;
+	static char buf[16384];
+	struct bpf_object *obj;
+	int err, i, prog_fd;
+
+	err = bpf_prog_load(file, BPF_PROG_TYPE_XDP, &obj, &prog_fd);
+	if (CHECK_FAIL(err))
+		return;
+
+	/* Test case add 10 bytes to last frag */
+	memset(buf, 1, sizeof(buf));
+	size = 9000;
+	exp_size = size + 10;
+	err = bpf_prog_test_run(prog_fd, 1, buf, size,
+				buf, &size, &retval, &duration);
+
+	CHECK(err || retval != XDP_TX || size != exp_size,
+	      "9k+10b", "err %d retval %d[%d] size %d[%u]\n",
+	      err, retval, XDP_TX, size, exp_size);
+
+	for (i = 0; i < 9000; i++)
+		CHECK(buf[i] != 1, "9k+10b-old",
+		      "Old data not all ok, offset %i is failing [%u]!\n",
+		      i, buf[i]);
+
+	for (i = 9000; i < 9010; i++)
+		CHECK(buf[i] != 0, "9k+10b-new",
+		      "New data not all ok, offset %i is failing [%u]!\n",
+		      i, buf[i]);
+
+	for (i = 9010; i < sizeof(buf); i++)
+		CHECK(buf[i] != 1, "9k+10b-untouched",
+		      "Unused data not all ok, offset %i is failing [%u]!\n",
+		      i, buf[i]);
+
+	/* Test a too large grow */
+	memset(buf, 1, sizeof(buf));
+	size = 9001;
+	exp_size = size;
+	err = bpf_prog_test_run(prog_fd, 1, buf, size,
+				buf, &size, &retval, &duration);
+
+	CHECK(err || retval != XDP_DROP || size != exp_size,
+	      "9k+10b", "err %d retval %d[%d] size %d[%u]\n",
+	      err, retval, XDP_TX, size, exp_size);
+
+	bpf_object__close(obj);
+}
+
 void test_xdp_adjust_tail(void)
 {
 	if (test__start_subtest("xdp_adjust_tail_shrink"))
@@ -138,4 +239,8 @@ void test_xdp_adjust_tail(void)
 		test_xdp_adjust_tail_grow();
 	if (test__start_subtest("xdp_adjust_tail_grow2"))
 		test_xdp_adjust_tail_grow2();
+	if (test__start_subtest("xdp_adjust_mb_tail_shrink"))
+		test_xdp_adjust_mb_tail_shrink();
+	if (test__start_subtest("xdp_adjust_mb_tail_grow"))
+		test_xdp_adjust_mb_tail_grow();
 }
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_adjust_tail_grow.c b/tools/testing/selftests/bpf/progs/test_xdp_adjust_tail_grow.c
index 3d66599eee2e..25ac7108a53f 100644
--- a/tools/testing/selftests/bpf/progs/test_xdp_adjust_tail_grow.c
+++ b/tools/testing/selftests/bpf/progs/test_xdp_adjust_tail_grow.c
@@ -7,20 +7,22 @@ int _xdp_adjust_tail_grow(struct xdp_md *xdp)
 {
 	void *data_end = (void *)(long)xdp->data_end;
 	void *data = (void *)(long)xdp->data;
-	unsigned int data_len;
 	int offset = 0;
 
 	/* Data length determine test case */
-	data_len = data_end - data;
 
-	if (data_len == 54) { /* sizeof(pkt_v4) */
+	if (xdp->frame_length == 54) { /* sizeof(pkt_v4) */
 		offset = 4096; /* test too large offset */
-	} else if (data_len == 74) { /* sizeof(pkt_v6) */
+	} else if (xdp->frame_length == 74) { /* sizeof(pkt_v6) */
 		offset = 40;
-	} else if (data_len == 64) {
+	} else if (xdp->frame_length == 64) {
 		offset = 128;
-	} else if (data_len == 128) {
-		offset = 4096 - 256 - 320 - data_len; /* Max tail grow 3520 */
+	} else if (xdp->frame_length == 128) {
+		offset = 4096 - 256 - 320 - xdp->frame_length; /* Max tail grow 3520 */
+	} else if (xdp->frame_length == 9000) {
+		offset = 10;
+	} else if (xdp->frame_length == 9001) {
+		offset = 4096;
 	} else {
 		return XDP_ABORTED; /* No matching test */
 	}
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_adjust_tail_shrink.c b/tools/testing/selftests/bpf/progs/test_xdp_adjust_tail_shrink.c
index 22065a9cfb25..689450414d29 100644
--- a/tools/testing/selftests/bpf/progs/test_xdp_adjust_tail_shrink.c
+++ b/tools/testing/selftests/bpf/progs/test_xdp_adjust_tail_shrink.c
@@ -14,14 +14,38 @@ int _version SEC("version") = 1;
 SEC("xdp_adjust_tail_shrink")
 int _xdp_adjust_tail_shrink(struct xdp_md *xdp)
 {
-	void *data_end = (void *)(long)xdp->data_end;
-	void *data = (void *)(long)xdp->data;
+	__u8 *data_end = (void *)(long)xdp->data_end;
+	__u8 *data = (void *)(long)xdp->data;
 	int offset = 0;
 
-	if (data_end - data == 54) /* sizeof(pkt_v4) */
+	switch (xdp->frame_length) {
+	case 54:
+		/* sizeof(pkt_v4) */
 		offset = 256; /* shrink too much */
-	else
+		break;
+	case 9000:
+		/* Multi-buffer test cases */
+		if (data + 1 > data_end)
+			return XDP_DROP;
+
+		switch (data[0]) {
+		case 0:
+			offset = 10;
+			break;
+		case 1:
+			offset = 4100;
+			break;
+		case 2:
+			offset = 8200;
+			break;
+		default:
+			return XDP_DROP;
+		}
+		break;
+	default:
 		offset = 20;
+		break;
+	}
 	if (bpf_xdp_adjust_tail(xdp, 0 - offset))
 		return XDP_DROP;
 	return XDP_TX;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 02/14] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
  2020-12-07 16:32 ` [PATCH v5 bpf-next 02/14] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers Lorenzo Bianconi
@ 2020-12-07 21:15   ` Alexander Duyck
  2020-12-07 21:37     ` Maciej Fijalkowski
  0 siblings, 1 reply; 48+ messages in thread
From: Alexander Duyck @ 2020-12-07 21:15 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: bpf, Netdev, David Miller, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, shayagr, Jubran, Samih, John Fastabend, dsahern,
	Jesper Dangaard Brouer, Eelco Chaudron, lorenzo.bianconi,
	Jason Wang

On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>
> Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.
> This is a preliminary patch to enable xdp multi-buffer support.
>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

I'm really not a fan of this design. Having to update every driver in
order to initialize a field that was fragmented is a pain. At a
minimum it seems like it might be time to consider introducing some
sort of initializer function for this so that you can update things in
one central place the next time you have to add a new field instead of
having to update every individual driver that supports XDP. Otherwise
this isn't going to scale going forward.

> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c        | 1 +
>  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c       | 1 +
>  drivers/net/ethernet/cavium/thunder/nicvf_main.c    | 1 +
>  drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c    | 1 +
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c         | 1 +
>  drivers/net/ethernet/intel/ice/ice_txrx.c           | 1 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c       | 1 +
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c   | 1 +
>  drivers/net/ethernet/marvell/mvneta.c               | 1 +
>  drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c     | 1 +
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c          | 1 +
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c     | 1 +
>  drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 1 +
>  drivers/net/ethernet/qlogic/qede/qede_fp.c          | 1 +
>  drivers/net/ethernet/sfc/rx.c                       | 1 +
>  drivers/net/ethernet/socionext/netsec.c             | 1 +
>  drivers/net/ethernet/ti/cpsw.c                      | 1 +
>  drivers/net/ethernet/ti/cpsw_new.c                  | 1 +
>  drivers/net/hyperv/netvsc_bpf.c                     | 1 +
>  drivers/net/tun.c                                   | 2 ++
>  drivers/net/veth.c                                  | 1 +
>  drivers/net/virtio_net.c                            | 2 ++
>  drivers/net/xen-netfront.c                          | 1 +
>  net/core/dev.c                                      | 1 +
>  24 files changed, 26 insertions(+)
>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 01/14] xdp: introduce mb in xdp_buff/xdp_frame
  2020-12-07 16:32 ` [PATCH v5 bpf-next 01/14] xdp: introduce mb in xdp_buff/xdp_frame Lorenzo Bianconi
@ 2020-12-07 21:16   ` Alexander Duyck
  2020-12-07 23:03     ` Saeed Mahameed
  0 siblings, 1 reply; 48+ messages in thread
From: Alexander Duyck @ 2020-12-07 21:16 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: bpf, Netdev, David Miller, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, shayagr, Jubran, Samih, John Fastabend, dsahern,
	Jesper Dangaard Brouer, Eelco Chaudron, lorenzo.bianconi,
	Jason Wang

On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
>
> Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer data structure
> in order to specify if this is a linear buffer (mb = 0) or a multi-buffer
> frame (mb = 1). In the latter case the shared_info area at the end of the
> first buffer is been properly initialized to link together subsequent
> buffers.
>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  include/net/xdp.h | 8 ++++++--
>  net/core/xdp.c    | 1 +
>  2 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 700ad5db7f5d..70559720ff44 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -73,7 +73,8 @@ struct xdp_buff {
>         void *data_hard_start;
>         struct xdp_rxq_info *rxq;
>         struct xdp_txq_info *txq;
> -       u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/
> +       u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved tailroom*/
> +       u32 mb:1; /* xdp non-linear buffer */
>  };
>

If we are really going to do something like this I say we should just
rip a swath of bits out instead of just grabbing one. We are already
cutting the size down then we should just decide on the minimum size
that is acceptable and just jump to that instead of just stealing one
bit at a time. It looks like we already have differences between the
size here and frame_size in xdp_frame.

If we have to steal a bit why not look at something like one of the
lower 2/3 bits in rxq? You could then do the same thing using dev_rx
in a similar fashion instead of stealing from a bit that is likely to
be used in multiple spots and modifying like this adds extra overhead
to?

>  /* Reserve memory area at end-of data area.
> @@ -97,7 +98,8 @@ struct xdp_frame {
>         u16 len;
>         u16 headroom;
>         u32 metasize:8;
> -       u32 frame_sz:24;
> +       u32 frame_sz:23;
> +       u32 mb:1; /* xdp non-linear frame */
>         /* Lifetime of xdp_rxq_info is limited to NAPI/enqueue time,
>          * while mem info is valid on remote CPU.
>          */

Again, if we are just going to start shrinking frame_sz we should
probably define where we are going to limit ourselves to and just go
straight to that value. Otherwise we are going to start jeopardizing
backwards compatibility at some point when we steal too many bits.

> @@ -154,6 +156,7 @@ void xdp_convert_frame_to_buff(struct xdp_frame *frame, struct xdp_buff *xdp)
>         xdp->data_end = frame->data + frame->len;
>         xdp->data_meta = frame->data - frame->metasize;
>         xdp->frame_sz = frame->frame_sz;
> +       xdp->mb = frame->mb;
>  }
>
>  static inline
> @@ -180,6 +183,7 @@ int xdp_update_frame_from_buff(struct xdp_buff *xdp,
>         xdp_frame->headroom = headroom - sizeof(*xdp_frame);
>         xdp_frame->metasize = metasize;
>         xdp_frame->frame_sz = xdp->frame_sz;
> +       xdp_frame->mb = xdp->mb;
>
>         return 0;
>  }
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 17ffd33c6b18..79dd45234e4d 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -509,6 +509,7 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
>         xdpf->headroom = 0;
>         xdpf->metasize = metasize;
>         xdpf->frame_sz = PAGE_SIZE;
> +       xdpf->mb = xdp->mb;
>         xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
>
>         xsk_buff_free(xdp);

At this point all you are doing is moving a meaningless flag. I would
think we would want to wait on adding this code until there is some
meaning behind the bit since it doesn't make sense to convert a
multi-buffer xdp_frame to a buffer. If nothing else it really feels
like there is some exception handling missing here as I would expect
that conversion of a multi-buffer frame should fail since you cannot
convert something from multiple to a single without having to redo
allocations and/or linearizing it.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 02/14] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
  2020-12-07 21:15   ` Alexander Duyck
@ 2020-12-07 21:37     ` Maciej Fijalkowski
  2020-12-07 23:20       ` Saeed Mahameed
  0 siblings, 1 reply; 48+ messages in thread
From: Maciej Fijalkowski @ 2020-12-07 21:37 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Lorenzo Bianconi, bpf, Netdev, David Miller, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, shayagr, Jubran, Samih,
	John Fastabend, dsahern, Jesper Dangaard Brouer, Eelco Chaudron,
	lorenzo.bianconi, Jason Wang

On Mon, Dec 07, 2020 at 01:15:00PM -0800, Alexander Duyck wrote:
> On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <lorenzo@kernel.org> wrote:
> >
> > Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.
> > This is a preliminary patch to enable xdp multi-buffer support.
> >
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> 
> I'm really not a fan of this design. Having to update every driver in
> order to initialize a field that was fragmented is a pain. At a
> minimum it seems like it might be time to consider introducing some
> sort of initializer function for this so that you can update things in
> one central place the next time you have to add a new field instead of
> having to update every individual driver that supports XDP. Otherwise
> this isn't going to scale going forward.

Also, a good example of why this might be bothering for us is a fact that
in the meantime the dpaa driver got XDP support and this patch hasn't been
updated to include mb setting in that driver.

> 
> > ---
> >  drivers/net/ethernet/amazon/ena/ena_netdev.c        | 1 +
> >  drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c       | 1 +
> >  drivers/net/ethernet/cavium/thunder/nicvf_main.c    | 1 +
> >  drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c    | 1 +
> >  drivers/net/ethernet/intel/i40e/i40e_txrx.c         | 1 +
> >  drivers/net/ethernet/intel/ice/ice_txrx.c           | 1 +
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c       | 1 +
> >  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c   | 1 +
> >  drivers/net/ethernet/marvell/mvneta.c               | 1 +
> >  drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c     | 1 +
> >  drivers/net/ethernet/mellanox/mlx4/en_rx.c          | 1 +
> >  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c     | 1 +
> >  drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 1 +
> >  drivers/net/ethernet/qlogic/qede/qede_fp.c          | 1 +
> >  drivers/net/ethernet/sfc/rx.c                       | 1 +
> >  drivers/net/ethernet/socionext/netsec.c             | 1 +
> >  drivers/net/ethernet/ti/cpsw.c                      | 1 +
> >  drivers/net/ethernet/ti/cpsw_new.c                  | 1 +
> >  drivers/net/hyperv/netvsc_bpf.c                     | 1 +
> >  drivers/net/tun.c                                   | 2 ++
> >  drivers/net/veth.c                                  | 1 +
> >  drivers/net/virtio_net.c                            | 2 ++
> >  drivers/net/xen-netfront.c                          | 1 +
> >  net/core/dev.c                                      | 1 +
> >  24 files changed, 26 insertions(+)
> >

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 01/14] xdp: introduce mb in xdp_buff/xdp_frame
  2020-12-07 21:16   ` Alexander Duyck
@ 2020-12-07 23:03     ` Saeed Mahameed
  2020-12-08  3:16       ` Alexander Duyck
  0 siblings, 1 reply; 48+ messages in thread
From: Saeed Mahameed @ 2020-12-07 23:03 UTC (permalink / raw)
  To: Alexander Duyck, Lorenzo Bianconi
  Cc: bpf, Netdev, David Miller, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, shayagr, Jubran, Samih, John Fastabend, dsahern,
	Jesper Dangaard Brouer, Eelco Chaudron, lorenzo.bianconi,
	Jason Wang

On Mon, 2020-12-07 at 13:16 -0800, Alexander Duyck wrote:
> On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <lorenzo@kernel.org>
> wrote:
> > Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer data
> > structure
> > in order to specify if this is a linear buffer (mb = 0) or a multi-
> > buffer
> > frame (mb = 1). In the latter case the shared_info area at the end
> > of the
> > first buffer is been properly initialized to link together
> > subsequent
> > buffers.
> > 
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> >  include/net/xdp.h | 8 ++++++--
> >  net/core/xdp.c    | 1 +
> >  2 files changed, 7 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 700ad5db7f5d..70559720ff44 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -73,7 +73,8 @@ struct xdp_buff {
> >         void *data_hard_start;
> >         struct xdp_rxq_info *rxq;
> >         struct xdp_txq_info *txq;
> > -       u32 frame_sz; /* frame size to deduce
> > data_hard_end/reserved tailroom*/
> > +       u32 frame_sz:31; /* frame size to deduce
> > data_hard_end/reserved tailroom*/
> > +       u32 mb:1; /* xdp non-linear buffer */
> >  };
> > 
> 
> If we are really going to do something like this I say we should just
> rip a swath of bits out instead of just grabbing one. We are already
> cutting the size down then we should just decide on the minimum size
> that is acceptable and just jump to that instead of just stealing one
> bit at a time. It looks like we already have differences between the
> size here and frame_size in xdp_frame.
> 

+1

> If we have to steal a bit why not look at something like one of the
> lower 2/3 bits in rxq? You could then do the same thing using dev_rx
> in a similar fashion instead of stealing from a bit that is likely to
> be used in multiple spots and modifying like this adds extra overhead
> to?
> 

What do you mean in rxq ? from the pointer ? 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 02/14] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
  2020-12-07 21:37     ` Maciej Fijalkowski
@ 2020-12-07 23:20       ` Saeed Mahameed
  2020-12-08 10:31         ` Lorenzo Bianconi
  0 siblings, 1 reply; 48+ messages in thread
From: Saeed Mahameed @ 2020-12-07 23:20 UTC (permalink / raw)
  To: Maciej Fijalkowski, Alexander Duyck
  Cc: Lorenzo Bianconi, bpf, Netdev, David Miller, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, shayagr, Jubran, Samih,
	John Fastabend, dsahern, Jesper Dangaard Brouer, Eelco Chaudron,
	lorenzo.bianconi, Jason Wang

On Mon, 2020-12-07 at 22:37 +0100, Maciej Fijalkowski wrote:
> On Mon, Dec 07, 2020 at 01:15:00PM -0800, Alexander Duyck wrote:
> > On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <lorenzo@kernel.org
> > > wrote:
> > > Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.
> > > This is a preliminary patch to enable xdp multi-buffer support.
> > > 
> > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > 
> > I'm really not a fan of this design. Having to update every driver
> > in
> > order to initialize a field that was fragmented is a pain. At a
> > minimum it seems like it might be time to consider introducing some
> > sort of initializer function for this so that you can update things
> > in
> > one central place the next time you have to add a new field instead
> > of
> > having to update every individual driver that supports XDP.
> > Otherwise
> > this isn't going to scale going forward.
> 
> Also, a good example of why this might be bothering for us is a fact
> that
> in the meantime the dpaa driver got XDP support and this patch hasn't
> been
> updated to include mb setting in that driver.
> 
something like
init_xdp_buff(hard_start, headroom, len, frame_sz, rxq);

would work for most of the drivers.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure
  2020-12-07 16:32 ` [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure Lorenzo Bianconi
@ 2020-12-08  0:22   ` Saeed Mahameed
  2020-12-08 11:01     ` Lorenzo Bianconi
  0 siblings, 1 reply; 48+ messages in thread
From: Saeed Mahameed @ 2020-12-08  0:22 UTC (permalink / raw)
  To: Lorenzo Bianconi, bpf, netdev
  Cc: davem, kuba, ast, daniel, shayagr, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang

On Mon, 2020-12-07 at 17:32 +0100, Lorenzo Bianconi wrote:
> Introduce xdp_shared_info data structure to contain info about
> "non-linear" xdp frame. xdp_shared_info will alias skb_shared_info
> allowing to keep most of the frags in the same cache-line.
> Introduce some xdp_shared_info helpers aligned to skb_frag* ones
> 

is there or will be a more general purpose use to this xdp_shared_info
? other than hosting frags ?

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 62 +++++++++++++++--------
> ----
>  include/net/xdp.h                     | 52 ++++++++++++++++++++--
>  2 files changed, 82 insertions(+), 32 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvneta.c
> b/drivers/net/ethernet/marvell/mvneta.c
> index 1e5b5c69685a..d635463609ad 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -2033,14 +2033,17 @@ int mvneta_rx_refill_queue(struct mvneta_port
> *pp, struct mvneta_rx_queue *rxq)
>  

[...]

>  static void
> @@ -2278,7 +2281,7 @@ mvneta_swbm_add_rx_fragment(struct mvneta_port
> *pp,
>  			    struct mvneta_rx_desc *rx_desc,
>  			    struct mvneta_rx_queue *rxq,
>  			    struct xdp_buff *xdp, int *size,
> -			    struct skb_shared_info *xdp_sinfo,
> +			    struct xdp_shared_info *xdp_sinfo,
>  			    struct page *page)
>  {
>  	struct net_device *dev = pp->dev;
> @@ -2301,13 +2304,13 @@ mvneta_swbm_add_rx_fragment(struct
> mvneta_port *pp,
>  	if (data_len > 0 && xdp_sinfo->nr_frags < MAX_SKB_FRAGS) {
>  		skb_frag_t *frag = &xdp_sinfo->frags[xdp_sinfo-
> >nr_frags++];
>  
> -		skb_frag_off_set(frag, pp->rx_offset_correction);
> -		skb_frag_size_set(frag, data_len);
> -		__skb_frag_set_page(frag, page);
> +		xdp_set_frag_offset(frag, pp->rx_offset_correction);
> +		xdp_set_frag_size(frag, data_len);
> +		xdp_set_frag_page(frag, page);
>  

why three separate setters ? why not just one 
xdp_set_frag(page, offset, size) ?

>  		/* last fragment */
>  		if (len == *size) {
> -			struct skb_shared_info *sinfo;
> +			struct xdp_shared_info *sinfo;
>  
>  			sinfo = xdp_get_shared_info_from_buff(xdp);
>  			sinfo->nr_frags = xdp_sinfo->nr_frags;
> @@ -2324,10 +2327,13 @@ static struct sk_buff *
>  mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue
> *rxq,
>  		      struct xdp_buff *xdp, u32 desc_status)
>  {
> -	struct skb_shared_info *sinfo =
> xdp_get_shared_info_from_buff(xdp);
> -	int i, num_frags = sinfo->nr_frags;
> +	struct xdp_shared_info *xdp_sinfo =
> xdp_get_shared_info_from_buff(xdp);
> +	int i, num_frags = xdp_sinfo->nr_frags;
> +	skb_frag_t frag_list[MAX_SKB_FRAGS];
>  	struct sk_buff *skb;
>  
> +	memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) *
> num_frags);
> +
>  	skb = build_skb(xdp->data_hard_start, PAGE_SIZE);
>  	if (!skb)
>  		return ERR_PTR(-ENOMEM);
> @@ -2339,12 +2345,12 @@ mvneta_swbm_build_skb(struct mvneta_port *pp,
> struct mvneta_rx_queue *rxq,
>  	mvneta_rx_csum(pp, desc_status, skb);
>  
>  	for (i = 0; i < num_frags; i++) {
> -		skb_frag_t *frag = &sinfo->frags[i];
> +		struct page *page = xdp_get_frag_page(&frag_list[i]);
>  
>  		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> -				skb_frag_page(frag),
> skb_frag_off(frag),
> -				skb_frag_size(frag), PAGE_SIZE);
> -		page_pool_release_page(rxq->page_pool,
> skb_frag_page(frag));
> +				page,
> xdp_get_frag_offset(&frag_list[i]),
> +				xdp_get_frag_size(&frag_list[i]),
> PAGE_SIZE);
> +		page_pool_release_page(rxq->page_pool, page);
>  	}
>  
>  	return skb;
> @@ -2357,7 +2363,7 @@ static int mvneta_rx_swbm(struct napi_struct
> *napi,
>  {
>  	int rx_proc = 0, rx_todo, refill, size = 0;
>  	struct net_device *dev = pp->dev;
> -	struct skb_shared_info sinfo;
> +	struct xdp_shared_info xdp_sinfo;
>  	struct mvneta_stats ps = {};
>  	struct bpf_prog *xdp_prog;
>  	u32 desc_status, frame_sz;
> @@ -2368,7 +2374,7 @@ static int mvneta_rx_swbm(struct napi_struct
> *napi,
>  	xdp_buf.rxq = &rxq->xdp_rxq;
>  	xdp_buf.mb = 0;
>  
> -	sinfo.nr_frags = 0;
> +	xdp_sinfo.nr_frags = 0;
>  
>  	/* Get number of received packets */
>  	rx_todo = mvneta_rxq_busy_desc_num_get(pp, rxq);
> @@ -2412,7 +2418,7 @@ static int mvneta_rx_swbm(struct napi_struct
> *napi,
>  			}
>  
>  			mvneta_swbm_add_rx_fragment(pp, rx_desc, rxq,
> &xdp_buf,
> -						    &size, &sinfo,
> page);
> +						    &size, &xdp_sinfo,
> page);
>  		} /* Middle or Last descriptor */
>  
>  		if (!(rx_status & MVNETA_RXD_LAST_DESC))
> @@ -2420,7 +2426,7 @@ static int mvneta_rx_swbm(struct napi_struct
> *napi,
>  			continue;
>  
>  		if (size) {
> -			mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &sinfo,
> -1);
> +			mvneta_xdp_put_buff(pp, rxq, &xdp_buf,
> &xdp_sinfo, -1);
>  			goto next;
>  		}
>  
> @@ -2432,7 +2438,7 @@ static int mvneta_rx_swbm(struct napi_struct
> *napi,
>  		if (IS_ERR(skb)) {
>  			struct mvneta_pcpu_stats *stats =
> this_cpu_ptr(pp->stats);
>  
> -			mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &sinfo,
> -1);
> +			mvneta_xdp_put_buff(pp, rxq, &xdp_buf,
> &xdp_sinfo, -1);
>  
>  			u64_stats_update_begin(&stats->syncp);
>  			stats->es.skb_alloc_error++;
> @@ -2449,12 +2455,12 @@ static int mvneta_rx_swbm(struct napi_struct
> *napi,
>  		napi_gro_receive(napi, skb);
>  next:
>  		xdp_buf.data_hard_start = NULL;
> -		sinfo.nr_frags = 0;
> +		xdp_sinfo.nr_frags = 0;
>  	}
>  	rcu_read_unlock();
>  
>  	if (xdp_buf.data_hard_start)
> -		mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &sinfo, -1);
> +		mvneta_xdp_put_buff(pp, rxq, &xdp_buf, &xdp_sinfo, -1);
>  
>  	if (ps.xdp_redirect)
>  		xdp_do_flush_map();
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 70559720ff44..614f66d35ee8 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -87,10 +87,54 @@ struct xdp_buff {
>  	((xdp)->data_hard_start + (xdp)->frame_sz -	\
>  	 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>  
> -static inline struct skb_shared_info *
> +struct xdp_shared_info {

xdp_shared_info is a bad name, we need this to have a specific purpose 
xdp_frags should the proper name, so people will think twice before
adding weird bits to this so called shared_info.

> +	u16 nr_frags;
> +	u16 data_length; /* paged area length */
> +	skb_frag_t frags[MAX_SKB_FRAGS];

why MAX_SKB_FRAGS ? just use a flexible array member 
skb_frag_t frags[]; 

and enforce size via the n_frags and on the construction of the
tailroom preserved buffer, which is already being done.

this is waste of unnecessary space, at lease by definition of the
struct, in your use case you do:
memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) * num_frags);
And the tailroom space was already preserved for a full skb_shinfo.
so i don't see why you need this array to be of a fixed MAX_SKB_FRAGS
size.

> +};
> +
> +static inline struct xdp_shared_info *
>  xdp_get_shared_info_from_buff(struct xdp_buff *xdp)
>  {
> -	return (struct skb_shared_info *)xdp_data_hard_end(xdp);
> +	BUILD_BUG_ON(sizeof(struct xdp_shared_info) >
> +		     sizeof(struct skb_shared_info));
> +	return (struct xdp_shared_info *)xdp_data_hard_end(xdp);
> +}
> +

Back to my first comment, do we have plans to use this tail room buffer
for other than frag_list use cases ? what will be the buffer format
then ? should we push all new fields to the end of the xdp_shared_info
struct ? or deal with this tailroom buffer as a stack ? 
my main concern is that for drivers that don't support frag list and
still want to utilize the tailroom buffer for other usecases they will
have to skip the first sizeof(xdp_shared_info) so they won't break the
stack.

> +static inline struct page *xdp_get_frag_page(const skb_frag_t *frag)
> +{
> +	return frag->bv_page;
> +}
> +
> +static inline unsigned int xdp_get_frag_offset(const skb_frag_t
> *frag)
> +{
> +	return frag->bv_offset;
> +}
> +
> +static inline unsigned int xdp_get_frag_size(const skb_frag_t *frag)
> +{
> +	return frag->bv_len;
> +}
> +
> +static inline void *xdp_get_frag_address(const skb_frag_t *frag)
> +{
> +	return page_address(xdp_get_frag_page(frag)) +
> +	       xdp_get_frag_offset(frag);
> +}
> +
> +static inline void xdp_set_frag_page(skb_frag_t *frag, struct page
> *page)
> +{
> +	frag->bv_page = page;
> +}
> +
> +static inline void xdp_set_frag_offset(skb_frag_t *frag, u32 offset)
> +{
> +	frag->bv_offset = offset;
> +}
> +
> +static inline void xdp_set_frag_size(skb_frag_t *frag, u32 size)
> +{
> +	frag->bv_len = size;
>  }
>  
>  struct xdp_frame {
> @@ -120,12 +164,12 @@ static __always_inline void
> xdp_frame_bulk_init(struct xdp_frame_bulk *bq)
>  	bq->xa = NULL;
>  }
>  
> -static inline struct skb_shared_info *
> +static inline struct xdp_shared_info *
>  xdp_get_shared_info_from_frame(struct xdp_frame *frame)
>  {
>  	void *data_hard_start = frame->data - frame->headroom -
> sizeof(*frame);
>  
> -	return (struct skb_shared_info *)(data_hard_start + frame-
> >frame_sz -
> +	return (struct xdp_shared_info *)(data_hard_start + frame-
> >frame_sz -
>  				SKB_DATA_ALIGN(sizeof(struct
> skb_shared_info)));
>  }
>  

need a comment here why we preserve the size of skb_shared_info, yet
the usable buffer is of type xdp_shared_info.


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 01/14] xdp: introduce mb in xdp_buff/xdp_frame
  2020-12-07 23:03     ` Saeed Mahameed
@ 2020-12-08  3:16       ` Alexander Duyck
  2020-12-08  6:49         ` Saeed Mahameed
  0 siblings, 1 reply; 48+ messages in thread
From: Alexander Duyck @ 2020-12-08  3:16 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Lorenzo Bianconi, bpf, Netdev, David Miller, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, shayagr, Jubran, Samih,
	John Fastabend, dsahern, Jesper Dangaard Brouer, Eelco Chaudron,
	lorenzo.bianconi, Jason Wang

On Mon, Dec 7, 2020 at 3:03 PM Saeed Mahameed <saeed@kernel.org> wrote:
>
> On Mon, 2020-12-07 at 13:16 -0800, Alexander Duyck wrote:
> > On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <lorenzo@kernel.org>
> > wrote:
> > > Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer data
> > > structure
> > > in order to specify if this is a linear buffer (mb = 0) or a multi-
> > > buffer
> > > frame (mb = 1). In the latter case the shared_info area at the end
> > > of the
> > > first buffer is been properly initialized to link together
> > > subsequent
> > > buffers.
> > >
> > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > > ---
> > >  include/net/xdp.h | 8 ++++++--
> > >  net/core/xdp.c    | 1 +
> > >  2 files changed, 7 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > > index 700ad5db7f5d..70559720ff44 100644
> > > --- a/include/net/xdp.h
> > > +++ b/include/net/xdp.h
> > > @@ -73,7 +73,8 @@ struct xdp_buff {
> > >         void *data_hard_start;
> > >         struct xdp_rxq_info *rxq;
> > >         struct xdp_txq_info *txq;
> > > -       u32 frame_sz; /* frame size to deduce
> > > data_hard_end/reserved tailroom*/
> > > +       u32 frame_sz:31; /* frame size to deduce
> > > data_hard_end/reserved tailroom*/
> > > +       u32 mb:1; /* xdp non-linear buffer */
> > >  };
> > >
> >
> > If we are really going to do something like this I say we should just
> > rip a swath of bits out instead of just grabbing one. We are already
> > cutting the size down then we should just decide on the minimum size
> > that is acceptable and just jump to that instead of just stealing one
> > bit at a time. It looks like we already have differences between the
> > size here and frame_size in xdp_frame.
> >
>
> +1
>
> > If we have to steal a bit why not look at something like one of the
> > lower 2/3 bits in rxq? You could then do the same thing using dev_rx
> > in a similar fashion instead of stealing from a bit that is likely to
> > be used in multiple spots and modifying like this adds extra overhead
> > to?
> >
>
> What do you mean in rxq ? from the pointer ?

Yeah, the pointers have a few bits that are guaranteed 0 and in my
mind reusing the lower bits from a 4 or 8 byte aligned pointer would
make more sense then stealing the upper bits from the size of the
frame.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 01/14] xdp: introduce mb in xdp_buff/xdp_frame
  2020-12-08  3:16       ` Alexander Duyck
@ 2020-12-08  6:49         ` Saeed Mahameed
  2020-12-08  9:47           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 48+ messages in thread
From: Saeed Mahameed @ 2020-12-08  6:49 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Lorenzo Bianconi, bpf, Netdev, David Miller, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, shayagr, Jubran, Samih,
	John Fastabend, dsahern, Jesper Dangaard Brouer, Eelco Chaudron,
	lorenzo.bianconi, Jason Wang

On Mon, 2020-12-07 at 19:16 -0800, Alexander Duyck wrote:
> On Mon, Dec 7, 2020 at 3:03 PM Saeed Mahameed <saeed@kernel.org>
> wrote:
> > On Mon, 2020-12-07 at 13:16 -0800, Alexander Duyck wrote:
> > > On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <
> > > lorenzo@kernel.org>
> > > wrote:
> > > > Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer data
> > > > structure
> > > > in order to specify if this is a linear buffer (mb = 0) or a
> > > > multi-
> > > > buffer
> > > > frame (mb = 1). In the latter case the shared_info area at the
> > > > end
> > > > of the
> > > > first buffer is been properly initialized to link together
> > > > subsequent
> > > > buffers.
> > > > 
> > > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > > > ---
> > > >  include/net/xdp.h | 8 ++++++--
> > > >  net/core/xdp.c    | 1 +
> > > >  2 files changed, 7 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > > > index 700ad5db7f5d..70559720ff44 100644
> > > > --- a/include/net/xdp.h
> > > > +++ b/include/net/xdp.h
> > > > @@ -73,7 +73,8 @@ struct xdp_buff {
> > > >         void *data_hard_start;
> > > >         struct xdp_rxq_info *rxq;
> > > >         struct xdp_txq_info *txq;
> > > > -       u32 frame_sz; /* frame size to deduce
> > > > data_hard_end/reserved tailroom*/
> > > > +       u32 frame_sz:31; /* frame size to deduce
> > > > data_hard_end/reserved tailroom*/
> > > > +       u32 mb:1; /* xdp non-linear buffer */
> > > >  };
> > > > 
> > > 
> > > If we are really going to do something like this I say we should
> > > just
> > > rip a swath of bits out instead of just grabbing one. We are
> > > already
> > > cutting the size down then we should just decide on the minimum
> > > size
> > > that is acceptable and just jump to that instead of just stealing
> > > one
> > > bit at a time. It looks like we already have differences between
> > > the
> > > size here and frame_size in xdp_frame.
> > > 
> > 
> > +1
> > 
> > > If we have to steal a bit why not look at something like one of
> > > the
> > > lower 2/3 bits in rxq? You could then do the same thing using
> > > dev_rx
> > > in a similar fashion instead of stealing from a bit that is
> > > likely to
> > > be used in multiple spots and modifying like this adds extra
> > > overhead
> > > to?
> > > 
> > 
> > What do you mean in rxq ? from the pointer ?
> 
> Yeah, the pointers have a few bits that are guaranteed 0 and in my
> mind reusing the lower bits from a 4 or 8 byte aligned pointer would
> make more sense then stealing the upper bits from the size of the
> frame.

Ha, i can't imagine how accessing that pointer would look like ..
is possible to define the pointer as a bit-field and just access it
normally ? or do we need to fix it up every time we need to access it ?
will gcc/static checkers complain about wrong pointer type ?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 01/14] xdp: introduce mb in xdp_buff/xdp_frame
  2020-12-08  6:49         ` Saeed Mahameed
@ 2020-12-08  9:47           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2020-12-08  9:47 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Alexander Duyck, Lorenzo Bianconi, bpf, Netdev, David Miller,
	Jakub Kicinski, Alexei Starovoitov, Daniel Borkmann, shayagr,
	Jubran, Samih, John Fastabend, dsahern, Eelco Chaudron,
	lorenzo.bianconi, Jason Wang, brouer

On Mon, 07 Dec 2020 22:49:55 -0800
Saeed Mahameed <saeed@kernel.org> wrote:

> On Mon, 2020-12-07 at 19:16 -0800, Alexander Duyck wrote:
> > On Mon, Dec 7, 2020 at 3:03 PM Saeed Mahameed <saeed@kernel.org>
> > wrote:  
> > > On Mon, 2020-12-07 at 13:16 -0800, Alexander Duyck wrote:  
> > > > On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <  
> > > > lorenzo@kernel.org>  
> > > > wrote:  
> > > > > Introduce multi-buffer bit (mb) in xdp_frame/xdp_buffer data
> > > > > structure
> > > > > in order to specify if this is a linear buffer (mb = 0) or a
> > > > > multi-
> > > > > buffer
> > > > > frame (mb = 1). In the latter case the shared_info area at the
> > > > > end
> > > > > of the
> > > > > first buffer is been properly initialized to link together
> > > > > subsequent
> > > > > buffers.
> > > > > 
> > > > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > > > > ---
> > > > >  include/net/xdp.h | 8 ++++++--
> > > > >  net/core/xdp.c    | 1 +
> > > > >  2 files changed, 7 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > > > > index 700ad5db7f5d..70559720ff44 100644
> > > > > --- a/include/net/xdp.h
> > > > > +++ b/include/net/xdp.h
> > > > > @@ -73,7 +73,8 @@ struct xdp_buff {
> > > > >         void *data_hard_start;
> > > > >         struct xdp_rxq_info *rxq;
> > > > >         struct xdp_txq_info *txq;
> > > > > -       u32 frame_sz; /* frame size to deduce
> > > > > data_hard_end/reserved tailroom*/
> > > > > +       u32 frame_sz:31; /* frame size to deduce
> > > > > data_hard_end/reserved tailroom*/
> > > > > +       u32 mb:1; /* xdp non-linear buffer */
> > > > >  };
> > > > >   
> > > > 
> > > > If we are really going to do something like this I say we should
> > > > just
> > > > rip a swath of bits out instead of just grabbing one. We are
> > > > already
> > > > cutting the size down then we should just decide on the minimum
> > > > size
> > > > that is acceptable and just jump to that instead of just stealing
> > > > one
> > > > bit at a time. It looks like we already have differences between
> > > > the
> > > > size here and frame_size in xdp_frame.
> > > >   
> > > 
> > > +1
> > >   
> > > > If we have to steal a bit why not look at something like one of
> > > > the
> > > > lower 2/3 bits in rxq? You could then do the same thing using
> > > > dev_rx
> > > > in a similar fashion instead of stealing from a bit that is
> > > > likely to
> > > > be used in multiple spots and modifying like this adds extra
> > > > overhead
> > > > to?
> > > >   
> > > 
> > > What do you mean in rxq ? from the pointer ?  
> > 
> > Yeah, the pointers have a few bits that are guaranteed 0 and in my
> > mind reusing the lower bits from a 4 or 8 byte aligned pointer would
> > make more sense then stealing the upper bits from the size of the
> > frame.  
> 
> Ha, i can't imagine how accessing that pointer would look like ..
> is possible to define the pointer as a bit-field and just access it
> normally ? or do we need to fix it up every time we need to access it ?
> will gcc/static checkers complain about wrong pointer type ?

This is a pattern that is used all over the kernel.  Yes, it needs to
be fixed it up every time we access it.  In this case, we don't want to
to deploy this trick.  For two reason, (1) rxq is accessed by BPF
byte-code rewrite (which would also need to handle masking out the
bit), (2) this optimization is trading CPU cycles for saving space.

IIRC Alexei have already pointed out that the change to struct xdp_buff
looks suboptimal.  Why don't you simply add a u8 with the info.

The general point is that struct xdp_buff layout is for fast access,
and struct xdp_frame is a state compressed version of xdp_buff.  (Still
room in xdp_buff is limited to 64 bytes - one cacheline, which is
rather close according to pahole)

Thus, it is more okay to do these bit tricks in struct xdp_frame.  For
xdp_frame, it might be better to take some room/space from the member
'mem' (struct xdp_mem_info).  (Would it help later that multi-buffer
bit is officially part of struct xdp_mem_info, when later freeing the
memory backing the frame?)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

$ pahole -C xdp_buff
struct xdp_buff {
	void *                     data;                 /*     0     8 */
	void *                     data_end;             /*     8     8 */
	void *                     data_meta;            /*    16     8 */
	void *                     data_hard_start;      /*    24     8 */
	struct xdp_rxq_info *      rxq;                  /*    32     8 */
	struct xdp_txq_info *      txq;                  /*    40     8 */
	u32                        frame_sz;             /*    48     4 */

	/* size: 56, cachelines: 1, members: 7 */
	/* padding: 4 */
	/* last cacheline: 56 bytes */
};

$ pahole -C xdp_frame
struct xdp_frame {
	void *                     data;                 /*     0     8 */
	u16                        len;                  /*     8     2 */
	u16                        headroom;             /*    10     2 */
	u32                        metasize:8;           /*    12: 0  4 */
	u32                        frame_sz:24;          /*    12: 8  4 */
	struct xdp_mem_info        mem;                  /*    16     8 */
	struct net_device *        dev_rx;               /*    24     8 */

	/* size: 32, cachelines: 1, members: 7 */
	/* last cacheline: 32 bytes */
};

$ pahole -C xdp_mem_info
struct xdp_mem_info {
	u32                        type;                 /*     0     4 */
	u32                        id;                   /*     4     4 */

	/* size: 8, cachelines: 1, members: 2 */
	/* last cacheline: 8 bytes */
};


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 02/14] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
  2020-12-07 23:20       ` Saeed Mahameed
@ 2020-12-08 10:31         ` Lorenzo Bianconi
  2020-12-08 13:29           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-08 10:31 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Maciej Fijalkowski, Alexander Duyck, Lorenzo Bianconi, bpf,
	Netdev, David Miller, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, shayagr, Jubran, Samih, John Fastabend, dsahern,
	Jesper Dangaard Brouer, Eelco Chaudron, Jason Wang

[-- Attachment #1: Type: text/plain, Size: 1412 bytes --]

> On Mon, 2020-12-07 at 22:37 +0100, Maciej Fijalkowski wrote:
> > On Mon, Dec 07, 2020 at 01:15:00PM -0800, Alexander Duyck wrote:
> > > On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <lorenzo@kernel.org
> > > > wrote:
> > > > Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.
> > > > This is a preliminary patch to enable xdp multi-buffer support.
> > > > 
> > > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > > 
> > > I'm really not a fan of this design. Having to update every driver
> > > in
> > > order to initialize a field that was fragmented is a pain. At a
> > > minimum it seems like it might be time to consider introducing some
> > > sort of initializer function for this so that you can update things
> > > in
> > > one central place the next time you have to add a new field instead
> > > of
> > > having to update every individual driver that supports XDP.
> > > Otherwise
> > > this isn't going to scale going forward.
> > 
> > Also, a good example of why this might be bothering for us is a fact
> > that
> > in the meantime the dpaa driver got XDP support and this patch hasn't
> > been
> > updated to include mb setting in that driver.
> > 
> something like
> init_xdp_buff(hard_start, headroom, len, frame_sz, rxq);
> 
> would work for most of the drivers.
> 

ack, agree. I will add init_xdp_buff() in v6.

Regards,
Lorenzo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure
  2020-12-08  0:22   ` Saeed Mahameed
@ 2020-12-08 11:01     ` Lorenzo Bianconi
  2020-12-19 14:53       ` Shay Agroskin
  0 siblings, 1 reply; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-08 11:01 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Lorenzo Bianconi, bpf, netdev, davem, kuba, ast, daniel, shayagr,
	sameehj, john.fastabend, dsahern, brouer, echaudro, jasowang

[-- Attachment #1: Type: text/plain, Size: 7571 bytes --]

> On Mon, 2020-12-07 at 17:32 +0100, Lorenzo Bianconi wrote:
> > Introduce xdp_shared_info data structure to contain info about
> > "non-linear" xdp frame. xdp_shared_info will alias skb_shared_info
> > allowing to keep most of the frags in the same cache-line.
> > Introduce some xdp_shared_info helpers aligned to skb_frag* ones
> > 
> 
> is there or will be a more general purpose use to this xdp_shared_info
> ? other than hosting frags ?

I do not have other use-cases at the moment other than multi-buff but in
theory it is possible I guess.
The reason we introduced it is to have most of the frags in the first
shared_info cache-line to avoid cache-misses.

> 
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> >  drivers/net/ethernet/marvell/mvneta.c | 62 +++++++++++++++--------
> > ----
> >  include/net/xdp.h                     | 52 ++++++++++++++++++++--
> >  2 files changed, 82 insertions(+), 32 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/marvell/mvneta.c
> > b/drivers/net/ethernet/marvell/mvneta.c
> > index 1e5b5c69685a..d635463609ad 100644
> > --- a/drivers/net/ethernet/marvell/mvneta.c
> > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > @@ -2033,14 +2033,17 @@ int mvneta_rx_refill_queue(struct mvneta_port
> > *pp, struct mvneta_rx_queue *rxq)
> >  
> 
> [...]
> 
> >  static void
> > @@ -2278,7 +2281,7 @@ mvneta_swbm_add_rx_fragment(struct mvneta_port
> > *pp,
> >  			    struct mvneta_rx_desc *rx_desc,
> >  			    struct mvneta_rx_queue *rxq,
> >  			    struct xdp_buff *xdp, int *size,
> > -			    struct skb_shared_info *xdp_sinfo,
> > +			    struct xdp_shared_info *xdp_sinfo,
> >  			    struct page *page)
> >  {
> >  	struct net_device *dev = pp->dev;
> > @@ -2301,13 +2304,13 @@ mvneta_swbm_add_rx_fragment(struct
> > mvneta_port *pp,
> >  	if (data_len > 0 && xdp_sinfo->nr_frags < MAX_SKB_FRAGS) {
> >  		skb_frag_t *frag = &xdp_sinfo->frags[xdp_sinfo-
> > >nr_frags++];
> >  
> > -		skb_frag_off_set(frag, pp->rx_offset_correction);
> > -		skb_frag_size_set(frag, data_len);
> > -		__skb_frag_set_page(frag, page);
> > +		xdp_set_frag_offset(frag, pp->rx_offset_correction);
> > +		xdp_set_frag_size(frag, data_len);
> > +		xdp_set_frag_page(frag, page);
> >  
> 
> why three separate setters ? why not just one 
> xdp_set_frag(page, offset, size) ?

to be aligned with skb_frags helpers, but I guess we can have a single helper,
I do not have a strong opinion on it

> 
> >  		/* last fragment */
> >  		if (len == *size) {
> > -			struct skb_shared_info *sinfo;
> > +			struct xdp_shared_info *sinfo;
> >  
> >  			sinfo = xdp_get_shared_info_from_buff(xdp);
> >  			sinfo->nr_frags = xdp_sinfo->nr_frags;
> > @@ -2324,10 +2327,13 @@ static struct sk_buff *
> >  mvneta_swbm_build_skb(struct mvneta_port *pp, struct mvneta_rx_queue
> > *rxq,
> >  		      struct xdp_buff *xdp, u32 desc_status)
> >  {

[...]

> >  
> > -static inline struct skb_shared_info *
> > +struct xdp_shared_info {
> 
> xdp_shared_info is a bad name, we need this to have a specific purpose 
> xdp_frags should the proper name, so people will think twice before
> adding weird bits to this so called shared_info.

I named the struct xdp_shared_info to recall skb_shared_info but I guess
xdp_frags is fine too. Agree?

> 
> > +	u16 nr_frags;
> > +	u16 data_length; /* paged area length */
> > +	skb_frag_t frags[MAX_SKB_FRAGS];
> 
> why MAX_SKB_FRAGS ? just use a flexible array member 
> skb_frag_t frags[]; 
> 
> and enforce size via the n_frags and on the construction of the
> tailroom preserved buffer, which is already being done.
> 
> this is waste of unnecessary space, at lease by definition of the
> struct, in your use case you do:
> memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) * num_frags);
> And the tailroom space was already preserved for a full skb_shinfo.
> so i don't see why you need this array to be of a fixed MAX_SKB_FRAGS
> size.

In order to avoid cache-misses, xdp_shared info is built as a variable
on mvneta_rx_swbm() stack and it is written to "shared_info" area only on the
last fragment in mvneta_swbm_add_rx_fragment(). I used MAX_SKB_FRAGS to be
aligned with skb_shared_info struct but probably we can use even a smaller value.
Another approach would be to define two different struct, e.g.

stuct xdp_frag_metadata {
	u16 nr_frags;
	u16 data_length; /* paged area length */
};

struct xdp_frags {
	skb_frag_t frags[MAX_SKB_FRAGS];
};

and then define xdp_shared_info as

struct xdp_shared_info {
	stuct xdp_frag_metadata meta;
	skb_frag_t frags[];
};

In this way we can probably optimize the space. What do you think?

> 
> > +};
> > +
> > +static inline struct xdp_shared_info *
> >  xdp_get_shared_info_from_buff(struct xdp_buff *xdp)
> >  {
> > -	return (struct skb_shared_info *)xdp_data_hard_end(xdp);
> > +	BUILD_BUG_ON(sizeof(struct xdp_shared_info) >
> > +		     sizeof(struct skb_shared_info));
> > +	return (struct xdp_shared_info *)xdp_data_hard_end(xdp);
> > +}
> > +
> 
> Back to my first comment, do we have plans to use this tail room buffer
> for other than frag_list use cases ? what will be the buffer format
> then ? should we push all new fields to the end of the xdp_shared_info
> struct ? or deal with this tailroom buffer as a stack ? 
> my main concern is that for drivers that don't support frag list and
> still want to utilize the tailroom buffer for other usecases they will
> have to skip the first sizeof(xdp_shared_info) so they won't break the
> stack.

for the moment I do not know if this area is used for other purposes.
Do you think there are other use-cases for it?

> 
> > +static inline struct page *xdp_get_frag_page(const skb_frag_t *frag)
> > +{
> > +	return frag->bv_page;
> > +}
> > +
> > +static inline unsigned int xdp_get_frag_offset(const skb_frag_t
> > *frag)
> > +{
> > +	return frag->bv_offset;
> > +}
> > +
> > +static inline unsigned int xdp_get_frag_size(const skb_frag_t *frag)
> > +{
> > +	return frag->bv_len;
> > +}
> > +
> > +static inline void *xdp_get_frag_address(const skb_frag_t *frag)
> > +{
> > +	return page_address(xdp_get_frag_page(frag)) +
> > +	       xdp_get_frag_offset(frag);
> > +}
> > +
> > +static inline void xdp_set_frag_page(skb_frag_t *frag, struct page
> > *page)
> > +{
> > +	frag->bv_page = page;
> > +}
> > +
> > +static inline void xdp_set_frag_offset(skb_frag_t *frag, u32 offset)
> > +{
> > +	frag->bv_offset = offset;
> > +}
> > +
> > +static inline void xdp_set_frag_size(skb_frag_t *frag, u32 size)
> > +{
> > +	frag->bv_len = size;
> >  }
> >  
> >  struct xdp_frame {
> > @@ -120,12 +164,12 @@ static __always_inline void
> > xdp_frame_bulk_init(struct xdp_frame_bulk *bq)
> >  	bq->xa = NULL;
> >  }
> >  
> > -static inline struct skb_shared_info *
> > +static inline struct xdp_shared_info *
> >  xdp_get_shared_info_from_frame(struct xdp_frame *frame)
> >  {
> >  	void *data_hard_start = frame->data - frame->headroom -
> > sizeof(*frame);
> >  
> > -	return (struct skb_shared_info *)(data_hard_start + frame-
> > >frame_sz -
> > +	return (struct xdp_shared_info *)(data_hard_start + frame-
> > >frame_sz -
> >  				SKB_DATA_ALIGN(sizeof(struct
> > skb_shared_info)));
> >  }
> >  
> 
> need a comment here why we preserve the size of skb_shared_info, yet
> the usable buffer is of type xdp_shared_info.

ack, I will add it in v6.

Regards,
Lorenzo

> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 02/14] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers
  2020-12-08 10:31         ` Lorenzo Bianconi
@ 2020-12-08 13:29           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2020-12-08 13:29 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Saeed Mahameed, Maciej Fijalkowski, Alexander Duyck,
	Lorenzo Bianconi, bpf, Netdev, David Miller, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, shayagr, Jubran, Samih,
	John Fastabend, dsahern, Eelco Chaudron, Jason Wang, brouer

On Tue, 8 Dec 2020 11:31:03 +0100
Lorenzo Bianconi <lorenzo.bianconi@redhat.com> wrote:

> > On Mon, 2020-12-07 at 22:37 +0100, Maciej Fijalkowski wrote:  
> > > On Mon, Dec 07, 2020 at 01:15:00PM -0800, Alexander Duyck wrote:  
> > > > On Mon, Dec 7, 2020 at 8:36 AM Lorenzo Bianconi <lorenzo@kernel.org  
> > > > > wrote:
> > > > > Initialize multi-buffer bit (mb) to 0 in all XDP-capable drivers.
> > > > > This is a preliminary patch to enable xdp multi-buffer support.
> > > > > 
> > > > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>  
> > > > 
> > > > I'm really not a fan of this design. Having to update every driver in
> > > > order to initialize a field that was fragmented is a pain. At a
> > > > minimum it seems like it might be time to consider introducing some
> > > > sort of initializer function for this so that you can update things in
> > > > one central place the next time you have to add a new field instead of
> > > > having to update every individual driver that supports XDP. Otherwise
> > > > this isn't going to scale going forward.  

+1

> > > Also, a good example of why this might be bothering for us is a fact that
> > > in the meantime the dpaa driver got XDP support and this patch hasn't been
> > > updated to include mb setting in that driver.
> > >   
> > something like
> > init_xdp_buff(hard_start, headroom, len, frame_sz, rxq);
> >
> > would work for most of the drivers.
> >   
> 
> ack, agree. I will add init_xdp_buff() in v6.

I do like the idea of an initialize helper function.
Remember this is fast-path code and likely need to be inlined.

Further more, remember that drivers can and do optimize the number of
writes they do to xdp_buff.   There are a number of fields in xdp_buff
that only need to be initialized once per NAPI.  E.g. rxq and frame_sz
(some driver do change frame_sz per packet).  Thus, you likely need two
inlined helpers for init.

Again, remember that C-compiler will generate an expensive operation
(rep stos) for clearing a struct if it is initialized like this, where
all member are not initialized (do NOT do this):

 struct xdp_buff xdp = {
   .rxq = rxq,
   .frame_sz = PAGE_SIZE,
 };

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx
  2020-12-07 16:32 ` [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx Lorenzo Bianconi
@ 2020-12-08 22:17   ` Maciej Fijalkowski
  2020-12-09 10:35     ` Eelco Chaudron
  0 siblings, 1 reply; 48+ messages in thread
From: Maciej Fijalkowski @ 2020-12-08 22:17 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: bpf, netdev, davem, kuba, ast, daniel, shayagr, sameehj,
	john.fastabend, dsahern, brouer, echaudro, lorenzo.bianconi,
	jasowang

On Mon, Dec 07, 2020 at 05:32:42PM +0100, Lorenzo Bianconi wrote:
> From: Eelco Chaudron <echaudro@redhat.com>
> 
> This patch adds a new field to the XDP context called frame_length,
> which will hold the full length of the packet, including fragments
> if existing.

The approach you took for ctx access conversion is barely described :/

> 
> eBPF programs can determine if fragments are present using something
> like:
> 
>   if (ctx->data_end - ctx->data < ctx->frame_length) {
>     /* Fragements exists. /*
>   }
> 
> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  include/net/xdp.h              | 22 +++++++++
>  include/uapi/linux/bpf.h       |  1 +
>  kernel/bpf/verifier.c          |  2 +-
>  net/core/filter.c              | 83 ++++++++++++++++++++++++++++++++++
>  tools/include/uapi/linux/bpf.h |  1 +
>  5 files changed, 108 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 09078ab6644c..e54d733c90ed 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -73,8 +73,30 @@ struct xdp_buff {
>  	void *data_hard_start;
>  	struct xdp_rxq_info *rxq;
>  	struct xdp_txq_info *txq;
> +	/* If any of the bitfield lengths for frame_sz or mb below change,
> +	 * make sure the defines here are also updated!
> +	 */
> +#ifdef __BIG_ENDIAN_BITFIELD
> +#define MB_SHIFT	  0
> +#define MB_MASK		  0x00000001
> +#define FRAME_SZ_SHIFT	  1
> +#define FRAME_SZ_MASK	  0xfffffffe
> +#else
> +#define MB_SHIFT	  31
> +#define MB_MASK		  0x80000000
> +#define FRAME_SZ_SHIFT	  0
> +#define FRAME_SZ_MASK	  0x7fffffff
> +#endif
> +#define FRAME_SZ_OFFSET() offsetof(struct xdp_buff, __u32_bit_fields_offset)
> +#define MB_OFFSET()	  offsetof(struct xdp_buff, __u32_bit_fields_offset)
> +	/* private: */
> +	u32 __u32_bit_fields_offset[0];

Why? I don't get that. Please explain.
Also, looking at all the need for masking/shifting, I wonder if it would
be better to have u32 frame_sz and u8 mb...

> +	/* public: */
>  	u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved tailroom*/
>  	u32 mb:1; /* xdp non-linear buffer */
> +
> +	/* Temporary registers to make conditional access/stores possible. */
> +	u64 tmp_reg[2];

IMHO this kills the bitfield approach we have for vars above.

>  };
>  
>  /* Reserve memory area at end-of data area.
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 30b477a26482..62c50ab28ea9 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -4380,6 +4380,7 @@ struct xdp_md {
>  	__u32 rx_queue_index;  /* rxq->queue_index  */
>  
>  	__u32 egress_ifindex;  /* txq->dev->ifindex */
> +	__u32 frame_length;
>  };
>  
>  /* DEVMAP map-value layout
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 93def76cf32b..c50caea29fa2 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -10526,7 +10526,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
>  	const struct bpf_verifier_ops *ops = env->ops;
>  	int i, cnt, size, ctx_field_size, delta = 0;
>  	const int insn_cnt = env->prog->len;
> -	struct bpf_insn insn_buf[16], *insn;
> +	struct bpf_insn insn_buf[32], *insn;
>  	u32 target_size, size_default, off;
>  	struct bpf_prog *new_prog;
>  	enum bpf_access_type type;
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 4c4882d4d92c..278640db9e0a 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -8908,6 +8908,7 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
>  				  struct bpf_insn *insn_buf,
>  				  struct bpf_prog *prog, u32 *target_size)
>  {
> +	int ctx_reg, dst_reg, scratch_reg;
>  	struct bpf_insn *insn = insn_buf;
>  
>  	switch (si->off) {
> @@ -8954,6 +8955,88 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
>  		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
>  				      offsetof(struct net_device, ifindex));
>  		break;
> +	case offsetof(struct xdp_md, frame_length):
> +		/* Need tmp storage for src_reg in case src_reg == dst_reg,
> +		 * and a scratch reg */
> +		scratch_reg = BPF_REG_9;
> +		dst_reg = si->dst_reg;
> +
> +		if (dst_reg == scratch_reg)
> +			scratch_reg--;
> +
> +		ctx_reg = (si->src_reg == si->dst_reg) ? scratch_reg - 1 : si->src_reg;
> +		while (dst_reg == ctx_reg || scratch_reg == ctx_reg)
> +			ctx_reg--;
> +
> +		/* Save scratch registers */
> +		if (ctx_reg != si->src_reg) {
> +			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, ctx_reg,
> +					      offsetof(struct xdp_buff,
> +						       tmp_reg[1]));
> +
> +			*insn++ = BPF_MOV64_REG(ctx_reg, si->src_reg);
> +		}
> +
> +		*insn++ = BPF_STX_MEM(BPF_DW, ctx_reg, scratch_reg,
> +				      offsetof(struct xdp_buff, tmp_reg[0]));

Why don't you push regs to stack, use it and then pop it back? That way I
suppose you could avoid polluting xdp_buff with tmp_reg[2].

> +
> +		/* What does this code do?
> +		 *   dst_reg = 0
> +		 *
> +		 *   if (!ctx_reg->mb)
> +		 *      goto no_mb:
> +		 *
> +		 *   dst_reg = (struct xdp_shared_info *)xdp_data_hard_end(xdp)
> +		 *   dst_reg = dst_reg->data_length
> +		 *
> +		 * NOTE: xdp_data_hard_end() is xdp->hard_start +
> +		 *       xdp->frame_sz - sizeof(shared_info)
> +		 *
> +		 * no_mb:
> +		 *   dst_reg += ctx_reg->data_end - ctx_reg->data
> +		 */
> +		*insn++ = BPF_MOV64_IMM(dst_reg, 0);
> +
> +		*insn++ = BPF_LDX_MEM(BPF_W, scratch_reg, ctx_reg, MB_OFFSET());
> +		*insn++ = BPF_ALU32_IMM(BPF_AND, scratch_reg, MB_MASK);
> +		*insn++ = BPF_JMP_IMM(BPF_JEQ, scratch_reg, 0, 7); /*goto no_mb; */
> +
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff,
> +						       data_hard_start),
> +				      dst_reg, ctx_reg,
> +				      offsetof(struct xdp_buff, data_hard_start));
> +		*insn++ = BPF_LDX_MEM(BPF_W, scratch_reg, ctx_reg,
> +				      FRAME_SZ_OFFSET());
> +		*insn++ = BPF_ALU32_IMM(BPF_AND, scratch_reg, FRAME_SZ_MASK);
> +		*insn++ = BPF_ALU32_IMM(BPF_RSH, scratch_reg, FRAME_SZ_SHIFT);
> +		*insn++ = BPF_ALU64_REG(BPF_ADD, dst_reg, scratch_reg);
> +		*insn++ = BPF_ALU64_IMM(BPF_SUB, dst_reg,
> +					SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_shared_info,
> +						       data_length),
> +				      dst_reg, dst_reg,
> +				      offsetof(struct xdp_shared_info,
> +					       data_length));
> +
> +		/* no_mb: */
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_end),
> +				      scratch_reg, ctx_reg,
> +				      offsetof(struct xdp_buff, data_end));
> +		*insn++ = BPF_ALU64_REG(BPF_ADD, dst_reg, scratch_reg);
> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data),
> +				      scratch_reg, ctx_reg,
> +				      offsetof(struct xdp_buff, data));
> +		*insn++ = BPF_ALU64_REG(BPF_SUB, dst_reg, scratch_reg);
> +
> +		/* Restore scratch registers */
> +		*insn++ = BPF_LDX_MEM(BPF_DW, scratch_reg, ctx_reg,
> +				      offsetof(struct xdp_buff, tmp_reg[0]));
> +
> +		if (ctx_reg != si->src_reg)
> +			*insn++ = BPF_LDX_MEM(BPF_DW, ctx_reg, ctx_reg,
> +					      offsetof(struct xdp_buff,
> +						       tmp_reg[1]));
> +		break;
>  	}
>  
>  	return insn - insn_buf;
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index 30b477a26482..62c50ab28ea9 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -4380,6 +4380,7 @@ struct xdp_md {
>  	__u32 rx_queue_index;  /* rxq->queue_index  */
>  
>  	__u32 egress_ifindex;  /* txq->dev->ifindex */
> +	__u32 frame_length;
>  };
>  
>  /* DEVMAP map-value layout
> -- 
> 2.28.0
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx
  2020-12-08 22:17   ` Maciej Fijalkowski
@ 2020-12-09 10:35     ` Eelco Chaudron
  2020-12-09 11:10       ` Maciej Fijalkowski
  0 siblings, 1 reply; 48+ messages in thread
From: Eelco Chaudron @ 2020-12-09 10:35 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Lorenzo Bianconi, bpf, netdev, davem, kuba, ast, daniel, shayagr,
	sameehj, john.fastabend, dsahern, brouer, lorenzo.bianconi,
	jasowang



On 8 Dec 2020, at 23:17, Maciej Fijalkowski wrote:

> On Mon, Dec 07, 2020 at 05:32:42PM +0100, Lorenzo Bianconi wrote:
>> From: Eelco Chaudron <echaudro@redhat.com>
>>
>> This patch adds a new field to the XDP context called frame_length,
>> which will hold the full length of the packet, including fragments
>> if existing.
>
> The approach you took for ctx access conversion is barely described :/

You are right, I should have added some details on why I have chosen to 
take this approach. The reason is, to avoid a dedicated entry in the 
xdp_frame structure and maintaining it in the various eBPF helpers.

I'll update the commit message in the next revision to include this.

>>
>> eBPF programs can determine if fragments are present using something
>> like:
>>
>>   if (ctx->data_end - ctx->data < ctx->frame_length) {
>>     /* Fragements exists. /*
>>   }
>>
>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
>> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
>> ---
>>  include/net/xdp.h              | 22 +++++++++
>>  include/uapi/linux/bpf.h       |  1 +
>>  kernel/bpf/verifier.c          |  2 +-
>>  net/core/filter.c              | 83 
>> ++++++++++++++++++++++++++++++++++
>>  tools/include/uapi/linux/bpf.h |  1 +
>>  5 files changed, 108 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>> index 09078ab6644c..e54d733c90ed 100644
>> --- a/include/net/xdp.h
>> +++ b/include/net/xdp.h
>> @@ -73,8 +73,30 @@ struct xdp_buff {
>>  	void *data_hard_start;
>>  	struct xdp_rxq_info *rxq;
>>  	struct xdp_txq_info *txq;
>> +	/* If any of the bitfield lengths for frame_sz or mb below change,
>> +	 * make sure the defines here are also updated!
>> +	 */
>> +#ifdef __BIG_ENDIAN_BITFIELD
>> +#define MB_SHIFT	  0
>> +#define MB_MASK		  0x00000001
>> +#define FRAME_SZ_SHIFT	  1
>> +#define FRAME_SZ_MASK	  0xfffffffe
>> +#else
>> +#define MB_SHIFT	  31
>> +#define MB_MASK		  0x80000000
>> +#define FRAME_SZ_SHIFT	  0
>> +#define FRAME_SZ_MASK	  0x7fffffff
>> +#endif
>> +#define FRAME_SZ_OFFSET() offsetof(struct xdp_buff, 
>> __u32_bit_fields_offset)
>> +#define MB_OFFSET()	  offsetof(struct xdp_buff, 
>> __u32_bit_fields_offset)
>> +	/* private: */
>> +	u32 __u32_bit_fields_offset[0];
>
> Why? I don't get that. Please explain.

I was trying to find an easy way to extract the data/fields, maybe using 
BTF but had no luck.
So I resorted back to an existing approach in sk_buff, see 
https://elixir.bootlin.com/linux/v5.10-rc7/source/include/linux/skbuff.h#L780

> Also, looking at all the need for masking/shifting, I wonder if it 
> would
> be better to have u32 frame_sz and u8 mb...

Yes, I agree having u32 would be way better, even for u32 for the mb 
field. I’ve seen other code converting flags to u32 for easy access in 
the eBPF context structures.

I’ll see there are some comments in general on the bit definitions for 
mb, but I’ll try to convince them to use u32 for both in the next 
revision, as I think for the xdp_buff structure size is not a real 
problem ;)

>> +	/* public: */
>>  	u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved 
>> tailroom*/
>>  	u32 mb:1; /* xdp non-linear buffer */
>> +
>> +	/* Temporary registers to make conditional access/stores possible. 
>> */
>> +	u64 tmp_reg[2];
>
> IMHO this kills the bitfield approach we have for vars above.

See above…

>>  };
>>
>>  /* Reserve memory area at end-of data area.
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 30b477a26482..62c50ab28ea9 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -4380,6 +4380,7 @@ struct xdp_md {
>>  	__u32 rx_queue_index;  /* rxq->queue_index  */
>>
>>  	__u32 egress_ifindex;  /* txq->dev->ifindex */
>> +	__u32 frame_length;
>>  };
>>
>>  /* DEVMAP map-value layout
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 93def76cf32b..c50caea29fa2 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -10526,7 +10526,7 @@ static int convert_ctx_accesses(struct 
>> bpf_verifier_env *env)
>>  	const struct bpf_verifier_ops *ops = env->ops;
>>  	int i, cnt, size, ctx_field_size, delta = 0;
>>  	const int insn_cnt = env->prog->len;
>> -	struct bpf_insn insn_buf[16], *insn;
>> +	struct bpf_insn insn_buf[32], *insn;
>>  	u32 target_size, size_default, off;
>>  	struct bpf_prog *new_prog;
>>  	enum bpf_access_type type;
>> diff --git a/net/core/filter.c b/net/core/filter.c
>> index 4c4882d4d92c..278640db9e0a 100644
>> --- a/net/core/filter.c
>> +++ b/net/core/filter.c
>> @@ -8908,6 +8908,7 @@ static u32 xdp_convert_ctx_access(enum 
>> bpf_access_type type,
>>  				  struct bpf_insn *insn_buf,
>>  				  struct bpf_prog *prog, u32 *target_size)
>>  {
>> +	int ctx_reg, dst_reg, scratch_reg;
>>  	struct bpf_insn *insn = insn_buf;
>>
>>  	switch (si->off) {
>> @@ -8954,6 +8955,88 @@ static u32 xdp_convert_ctx_access(enum 
>> bpf_access_type type,
>>  		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
>>  				      offsetof(struct net_device, ifindex));
>>  		break;
>> +	case offsetof(struct xdp_md, frame_length):
>> +		/* Need tmp storage for src_reg in case src_reg == dst_reg,
>> +		 * and a scratch reg */
>> +		scratch_reg = BPF_REG_9;
>> +		dst_reg = si->dst_reg;
>> +
>> +		if (dst_reg == scratch_reg)
>> +			scratch_reg--;
>> +
>> +		ctx_reg = (si->src_reg == si->dst_reg) ? scratch_reg - 1 : 
>> si->src_reg;
>> +		while (dst_reg == ctx_reg || scratch_reg == ctx_reg)
>> +			ctx_reg--;
>> +
>> +		/* Save scratch registers */
>> +		if (ctx_reg != si->src_reg) {
>> +			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, ctx_reg,
>> +					      offsetof(struct xdp_buff,
>> +						       tmp_reg[1]));
>> +
>> +			*insn++ = BPF_MOV64_REG(ctx_reg, si->src_reg);
>> +		}
>> +
>> +		*insn++ = BPF_STX_MEM(BPF_DW, ctx_reg, scratch_reg,
>> +				      offsetof(struct xdp_buff, tmp_reg[0]));
>
> Why don't you push regs to stack, use it and then pop it back? That 
> way I
> suppose you could avoid polluting xdp_buff with tmp_reg[2].

There is no “real” stack in eBPF, only a read-only frame pointer, 
and as we are replacing a single instruction, we have no info on what we 
can use as scratch space.

>> +
>> +		/* What does this code do?
>> +		 *   dst_reg = 0
>> +		 *
>> +		 *   if (!ctx_reg->mb)
>> +		 *      goto no_mb:
>> +		 *
>> +		 *   dst_reg = (struct xdp_shared_info *)xdp_data_hard_end(xdp)
>> +		 *   dst_reg = dst_reg->data_length
>> +		 *
>> +		 * NOTE: xdp_data_hard_end() is xdp->hard_start +
>> +		 *       xdp->frame_sz - sizeof(shared_info)
>> +		 *
>> +		 * no_mb:
>> +		 *   dst_reg += ctx_reg->data_end - ctx_reg->data
>> +		 */
>> +		*insn++ = BPF_MOV64_IMM(dst_reg, 0);
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_W, scratch_reg, ctx_reg, MB_OFFSET());
>> +		*insn++ = BPF_ALU32_IMM(BPF_AND, scratch_reg, MB_MASK);
>> +		*insn++ = BPF_JMP_IMM(BPF_JEQ, scratch_reg, 0, 7); /*goto no_mb; 
>> */
>> +
>> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff,
>> +						       data_hard_start),
>> +				      dst_reg, ctx_reg,
>> +				      offsetof(struct xdp_buff, data_hard_start));
>> +		*insn++ = BPF_LDX_MEM(BPF_W, scratch_reg, ctx_reg,
>> +				      FRAME_SZ_OFFSET());
>> +		*insn++ = BPF_ALU32_IMM(BPF_AND, scratch_reg, FRAME_SZ_MASK);
>> +		*insn++ = BPF_ALU32_IMM(BPF_RSH, scratch_reg, FRAME_SZ_SHIFT);
>> +		*insn++ = BPF_ALU64_REG(BPF_ADD, dst_reg, scratch_reg);
>> +		*insn++ = BPF_ALU64_IMM(BPF_SUB, dst_reg,
>> +					SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
>> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_shared_info,
>> +						       data_length),
>> +				      dst_reg, dst_reg,
>> +				      offsetof(struct xdp_shared_info,
>> +					       data_length));
>> +
>> +		/* no_mb: */
>> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_end),
>> +				      scratch_reg, ctx_reg,
>> +				      offsetof(struct xdp_buff, data_end));
>> +		*insn++ = BPF_ALU64_REG(BPF_ADD, dst_reg, scratch_reg);
>> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data),
>> +				      scratch_reg, ctx_reg,
>> +				      offsetof(struct xdp_buff, data));
>> +		*insn++ = BPF_ALU64_REG(BPF_SUB, dst_reg, scratch_reg);
>> +
>> +		/* Restore scratch registers */
>> +		*insn++ = BPF_LDX_MEM(BPF_DW, scratch_reg, ctx_reg,
>> +				      offsetof(struct xdp_buff, tmp_reg[0]));
>> +
>> +		if (ctx_reg != si->src_reg)
>> +			*insn++ = BPF_LDX_MEM(BPF_DW, ctx_reg, ctx_reg,
>> +					      offsetof(struct xdp_buff,
>> +						       tmp_reg[1]));
>> +		break;
>>  	}
>>
>>  	return insn - insn_buf;
>> diff --git a/tools/include/uapi/linux/bpf.h 
>> b/tools/include/uapi/linux/bpf.h
>> index 30b477a26482..62c50ab28ea9 100644
>> --- a/tools/include/uapi/linux/bpf.h
>> +++ b/tools/include/uapi/linux/bpf.h
>> @@ -4380,6 +4380,7 @@ struct xdp_md {
>>  	__u32 rx_queue_index;  /* rxq->queue_index  */
>>
>>  	__u32 egress_ifindex;  /* txq->dev->ifindex */
>> +	__u32 frame_length;
>>  };
>>
>>  /* DEVMAP map-value layout
>> -- 
>> 2.28.0
>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx
  2020-12-09 10:35     ` Eelco Chaudron
@ 2020-12-09 11:10       ` Maciej Fijalkowski
  2020-12-09 12:07         ` Eelco Chaudron
  0 siblings, 1 reply; 48+ messages in thread
From: Maciej Fijalkowski @ 2020-12-09 11:10 UTC (permalink / raw)
  To: Eelco Chaudron
  Cc: Lorenzo Bianconi, bpf, netdev, davem, kuba, ast, daniel, shayagr,
	sameehj, john.fastabend, dsahern, brouer, lorenzo.bianconi,
	jasowang

On Wed, Dec 09, 2020 at 11:35:13AM +0100, Eelco Chaudron wrote:
> 
> 
> On 8 Dec 2020, at 23:17, Maciej Fijalkowski wrote:
> 
> > On Mon, Dec 07, 2020 at 05:32:42PM +0100, Lorenzo Bianconi wrote:
> > > From: Eelco Chaudron <echaudro@redhat.com>
> > > 
> > > This patch adds a new field to the XDP context called frame_length,
> > > which will hold the full length of the packet, including fragments
> > > if existing.
> > 
> > The approach you took for ctx access conversion is barely described :/
> 
> You are right, I should have added some details on why I have chosen to take
> this approach. The reason is, to avoid a dedicated entry in the xdp_frame
> structure and maintaining it in the various eBPF helpers.
> 
> I'll update the commit message in the next revision to include this.
> 
> > > 
> > > eBPF programs can determine if fragments are present using something
> > > like:
> > > 
> > >   if (ctx->data_end - ctx->data < ctx->frame_length) {
> > >     /* Fragements exists. /*
> > >   }
> > > 
> > > Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
> > > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > > ---
> > >  include/net/xdp.h              | 22 +++++++++
> > >  include/uapi/linux/bpf.h       |  1 +
> > >  kernel/bpf/verifier.c          |  2 +-
> > >  net/core/filter.c              | 83
> > > ++++++++++++++++++++++++++++++++++
> > >  tools/include/uapi/linux/bpf.h |  1 +
> > >  5 files changed, 108 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > > index 09078ab6644c..e54d733c90ed 100644
> > > --- a/include/net/xdp.h
> > > +++ b/include/net/xdp.h
> > > @@ -73,8 +73,30 @@ struct xdp_buff {
> > >  	void *data_hard_start;
> > >  	struct xdp_rxq_info *rxq;
> > >  	struct xdp_txq_info *txq;
> > > +	/* If any of the bitfield lengths for frame_sz or mb below change,
> > > +	 * make sure the defines here are also updated!
> > > +	 */
> > > +#ifdef __BIG_ENDIAN_BITFIELD
> > > +#define MB_SHIFT	  0
> > > +#define MB_MASK		  0x00000001
> > > +#define FRAME_SZ_SHIFT	  1
> > > +#define FRAME_SZ_MASK	  0xfffffffe
> > > +#else
> > > +#define MB_SHIFT	  31
> > > +#define MB_MASK		  0x80000000
> > > +#define FRAME_SZ_SHIFT	  0
> > > +#define FRAME_SZ_MASK	  0x7fffffff
> > > +#endif
> > > +#define FRAME_SZ_OFFSET() offsetof(struct xdp_buff,
> > > __u32_bit_fields_offset)
> > > +#define MB_OFFSET()	  offsetof(struct xdp_buff,
> > > __u32_bit_fields_offset)
> > > +	/* private: */
> > > +	u32 __u32_bit_fields_offset[0];
> > 
> > Why? I don't get that. Please explain.
> 
> I was trying to find an easy way to extract the data/fields, maybe using BTF
> but had no luck.
> So I resorted back to an existing approach in sk_buff, see
> https://elixir.bootlin.com/linux/v5.10-rc7/source/include/linux/skbuff.h#L780
> 
> > Also, looking at all the need for masking/shifting, I wonder if it would
> > be better to have u32 frame_sz and u8 mb...
> 
> Yes, I agree having u32 would be way better, even for u32 for the mb field.
> I’ve seen other code converting flags to u32 for easy access in the eBPF
> context structures.
> 
> I’ll see there are some comments in general on the bit definitions for mb,
> but I’ll try to convince them to use u32 for both in the next revision, as I
> think for the xdp_buff structure size is not a real problem ;)

Generally people were really strict on xdp_buff extensions as we didn't
want to end up with another skb-like monster. I think Jesper somewhere
said that one cacheline is max for that. With your tmp_reg[2] you exceed
that from what I see, but I might be short on coffee.

> 
> > > +	/* public: */
> > >  	u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved
> > > tailroom*/
> > >  	u32 mb:1; /* xdp non-linear buffer */
> > > +
> > > +	/* Temporary registers to make conditional access/stores possible.
> > > */
> > > +	u64 tmp_reg[2];
> > 
> > IMHO this kills the bitfield approach we have for vars above.
> 
> See above…
> 
> > >  };
> > > 
> > >  /* Reserve memory area at end-of data area.
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 30b477a26482..62c50ab28ea9 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -4380,6 +4380,7 @@ struct xdp_md {
> > >  	__u32 rx_queue_index;  /* rxq->queue_index  */
> > > 
> > >  	__u32 egress_ifindex;  /* txq->dev->ifindex */
> > > +	__u32 frame_length;
> > >  };
> > > 
> > >  /* DEVMAP map-value layout
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 93def76cf32b..c50caea29fa2 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -10526,7 +10526,7 @@ static int convert_ctx_accesses(struct
> > > bpf_verifier_env *env)
> > >  	const struct bpf_verifier_ops *ops = env->ops;
> > >  	int i, cnt, size, ctx_field_size, delta = 0;
> > >  	const int insn_cnt = env->prog->len;
> > > -	struct bpf_insn insn_buf[16], *insn;
> > > +	struct bpf_insn insn_buf[32], *insn;
> > >  	u32 target_size, size_default, off;
> > >  	struct bpf_prog *new_prog;
> > >  	enum bpf_access_type type;
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 4c4882d4d92c..278640db9e0a 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -8908,6 +8908,7 @@ static u32 xdp_convert_ctx_access(enum
> > > bpf_access_type type,
> > >  				  struct bpf_insn *insn_buf,
> > >  				  struct bpf_prog *prog, u32 *target_size)
> > >  {
> > > +	int ctx_reg, dst_reg, scratch_reg;
> > >  	struct bpf_insn *insn = insn_buf;
> > > 
> > >  	switch (si->off) {
> > > @@ -8954,6 +8955,88 @@ static u32 xdp_convert_ctx_access(enum
> > > bpf_access_type type,
> > >  		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
> > >  				      offsetof(struct net_device, ifindex));
> > >  		break;
> > > +	case offsetof(struct xdp_md, frame_length):
> > > +		/* Need tmp storage for src_reg in case src_reg == dst_reg,
> > > +		 * and a scratch reg */
> > > +		scratch_reg = BPF_REG_9;
> > > +		dst_reg = si->dst_reg;
> > > +
> > > +		if (dst_reg == scratch_reg)
> > > +			scratch_reg--;
> > > +
> > > +		ctx_reg = (si->src_reg == si->dst_reg) ? scratch_reg - 1 :
> > > si->src_reg;
> > > +		while (dst_reg == ctx_reg || scratch_reg == ctx_reg)
> > > +			ctx_reg--;
> > > +
> > > +		/* Save scratch registers */
> > > +		if (ctx_reg != si->src_reg) {
> > > +			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, ctx_reg,
> > > +					      offsetof(struct xdp_buff,
> > > +						       tmp_reg[1]));
> > > +
> > > +			*insn++ = BPF_MOV64_REG(ctx_reg, si->src_reg);
> > > +		}
> > > +
> > > +		*insn++ = BPF_STX_MEM(BPF_DW, ctx_reg, scratch_reg,
> > > +				      offsetof(struct xdp_buff, tmp_reg[0]));
> > 
> > Why don't you push regs to stack, use it and then pop it back? That way
> > I
> > suppose you could avoid polluting xdp_buff with tmp_reg[2].
> 
> There is no “real” stack in eBPF, only a read-only frame pointer, and as we
> are replacing a single instruction, we have no info on what we can use as
> scratch space.

Uhm, what? You use R10 for stack operations. Verifier tracks the stack
depth used by programs and then it is passed down to JIT so that native
asm will create a properly sized stack frame.

From the top of my head I would let know xdp_convert_ctx_access of a
current stack depth and use it for R10 stores, so your scratch space would
be R10 + (stack depth + 8), R10 + (stack_depth + 16).

Problem with that would be the fact that convert_ctx_accesses() happens to
be called after the check_max_stack_depth(), so probably stack_depth of a
prog that has frame_length accesses would have to be adjusted earlier.

> 
> > > +
> > > +		/* What does this code do?
> > > +		 *   dst_reg = 0
> > > +		 *
> > > +		 *   if (!ctx_reg->mb)
> > > +		 *      goto no_mb:
> > > +		 *
> > > +		 *   dst_reg = (struct xdp_shared_info *)xdp_data_hard_end(xdp)
> > > +		 *   dst_reg = dst_reg->data_length
> > > +		 *
> > > +		 * NOTE: xdp_data_hard_end() is xdp->hard_start +
> > > +		 *       xdp->frame_sz - sizeof(shared_info)
> > > +		 *
> > > +		 * no_mb:
> > > +		 *   dst_reg += ctx_reg->data_end - ctx_reg->data
> > > +		 */
> > > +		*insn++ = BPF_MOV64_IMM(dst_reg, 0);
> > > +
> > > +		*insn++ = BPF_LDX_MEM(BPF_W, scratch_reg, ctx_reg, MB_OFFSET());
> > > +		*insn++ = BPF_ALU32_IMM(BPF_AND, scratch_reg, MB_MASK);
> > > +		*insn++ = BPF_JMP_IMM(BPF_JEQ, scratch_reg, 0, 7); /*goto no_mb;
> > > */
> > > +
> > > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff,
> > > +						       data_hard_start),
> > > +				      dst_reg, ctx_reg,
> > > +				      offsetof(struct xdp_buff, data_hard_start));
> > > +		*insn++ = BPF_LDX_MEM(BPF_W, scratch_reg, ctx_reg,
> > > +				      FRAME_SZ_OFFSET());
> > > +		*insn++ = BPF_ALU32_IMM(BPF_AND, scratch_reg, FRAME_SZ_MASK);
> > > +		*insn++ = BPF_ALU32_IMM(BPF_RSH, scratch_reg, FRAME_SZ_SHIFT);
> > > +		*insn++ = BPF_ALU64_REG(BPF_ADD, dst_reg, scratch_reg);
> > > +		*insn++ = BPF_ALU64_IMM(BPF_SUB, dst_reg,
> > > +					SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
> > > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_shared_info,
> > > +						       data_length),
> > > +				      dst_reg, dst_reg,
> > > +				      offsetof(struct xdp_shared_info,
> > > +					       data_length));
> > > +
> > > +		/* no_mb: */
> > > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_end),
> > > +				      scratch_reg, ctx_reg,
> > > +				      offsetof(struct xdp_buff, data_end));
> > > +		*insn++ = BPF_ALU64_REG(BPF_ADD, dst_reg, scratch_reg);
> > > +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data),
> > > +				      scratch_reg, ctx_reg,
> > > +				      offsetof(struct xdp_buff, data));
> > > +		*insn++ = BPF_ALU64_REG(BPF_SUB, dst_reg, scratch_reg);
> > > +
> > > +		/* Restore scratch registers */
> > > +		*insn++ = BPF_LDX_MEM(BPF_DW, scratch_reg, ctx_reg,
> > > +				      offsetof(struct xdp_buff, tmp_reg[0]));
> > > +
> > > +		if (ctx_reg != si->src_reg)
> > > +			*insn++ = BPF_LDX_MEM(BPF_DW, ctx_reg, ctx_reg,
> > > +					      offsetof(struct xdp_buff,
> > > +						       tmp_reg[1]));
> > > +		break;
> > >  	}
> > > 
> > >  	return insn - insn_buf;
> > > diff --git a/tools/include/uapi/linux/bpf.h
> > > b/tools/include/uapi/linux/bpf.h
> > > index 30b477a26482..62c50ab28ea9 100644
> > > --- a/tools/include/uapi/linux/bpf.h
> > > +++ b/tools/include/uapi/linux/bpf.h
> > > @@ -4380,6 +4380,7 @@ struct xdp_md {
> > >  	__u32 rx_queue_index;  /* rxq->queue_index  */
> > > 
> > >  	__u32 egress_ifindex;  /* txq->dev->ifindex */
> > > +	__u32 frame_length;
> > >  };
> > > 
> > >  /* DEVMAP map-value layout
> > > -- 
> > > 2.28.0
> > > 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx
  2020-12-09 11:10       ` Maciej Fijalkowski
@ 2020-12-09 12:07         ` Eelco Chaudron
  2020-12-15 13:28           ` Eelco Chaudron
  0 siblings, 1 reply; 48+ messages in thread
From: Eelco Chaudron @ 2020-12-09 12:07 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Lorenzo Bianconi, bpf, netdev, davem, kuba, ast, daniel, shayagr,
	sameehj, john.fastabend, dsahern, brouer, lorenzo.bianconi,
	jasowang



On 9 Dec 2020, at 12:10, Maciej Fijalkowski wrote:

> On Wed, Dec 09, 2020 at 11:35:13AM +0100, Eelco Chaudron wrote:
>>
>>
>> On 8 Dec 2020, at 23:17, Maciej Fijalkowski wrote:
>>
>>> On Mon, Dec 07, 2020 at 05:32:42PM +0100, Lorenzo Bianconi wrote:
>>>> From: Eelco Chaudron <echaudro@redhat.com>
>>>>
>>>> This patch adds a new field to the XDP context called frame_length,
>>>> which will hold the full length of the packet, including fragments
>>>> if existing.
>>>
>>> The approach you took for ctx access conversion is barely described 
>>> :/
>>
>> You are right, I should have added some details on why I have chosen 
>> to take
>> this approach. The reason is, to avoid a dedicated entry in the 
>> xdp_frame
>> structure and maintaining it in the various eBPF helpers.
>>
>> I'll update the commit message in the next revision to include this.
>>
>>>>
>>>> eBPF programs can determine if fragments are present using 
>>>> something
>>>> like:
>>>>
>>>>   if (ctx->data_end - ctx->data < ctx->frame_length) {
>>>>     /* Fragements exists. /*
>>>>   }
>>>>
>>>> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
>>>> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
>>>> ---
>>>>  include/net/xdp.h              | 22 +++++++++
>>>>  include/uapi/linux/bpf.h       |  1 +
>>>>  kernel/bpf/verifier.c          |  2 +-
>>>>  net/core/filter.c              | 83
>>>> ++++++++++++++++++++++++++++++++++
>>>>  tools/include/uapi/linux/bpf.h |  1 +
>>>>  5 files changed, 108 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>>>> index 09078ab6644c..e54d733c90ed 100644
>>>> --- a/include/net/xdp.h
>>>> +++ b/include/net/xdp.h
>>>> @@ -73,8 +73,30 @@ struct xdp_buff {
>>>>  	void *data_hard_start;
>>>>  	struct xdp_rxq_info *rxq;
>>>>  	struct xdp_txq_info *txq;
>>>> +	/* If any of the bitfield lengths for frame_sz or mb below 
>>>> change,
>>>> +	 * make sure the defines here are also updated!
>>>> +	 */
>>>> +#ifdef __BIG_ENDIAN_BITFIELD
>>>> +#define MB_SHIFT	  0
>>>> +#define MB_MASK		  0x00000001
>>>> +#define FRAME_SZ_SHIFT	  1
>>>> +#define FRAME_SZ_MASK	  0xfffffffe
>>>> +#else
>>>> +#define MB_SHIFT	  31
>>>> +#define MB_MASK		  0x80000000
>>>> +#define FRAME_SZ_SHIFT	  0
>>>> +#define FRAME_SZ_MASK	  0x7fffffff
>>>> +#endif
>>>> +#define FRAME_SZ_OFFSET() offsetof(struct xdp_buff,
>>>> __u32_bit_fields_offset)
>>>> +#define MB_OFFSET()	  offsetof(struct xdp_buff,
>>>> __u32_bit_fields_offset)
>>>> +	/* private: */
>>>> +	u32 __u32_bit_fields_offset[0];
>>>
>>> Why? I don't get that. Please explain.
>>
>> I was trying to find an easy way to extract the data/fields, maybe 
>> using BTF
>> but had no luck.
>> So I resorted back to an existing approach in sk_buff, see
>> https://elixir.bootlin.com/linux/v5.10-rc7/source/include/linux/skbuff.h#L780
>>
>>> Also, looking at all the need for masking/shifting, I wonder if it 
>>> would
>>> be better to have u32 frame_sz and u8 mb...
>>
>> Yes, I agree having u32 would be way better, even for u32 for the mb 
>> field.
>> I’ve seen other code converting flags to u32 for easy access in the 
>> eBPF
>> context structures.
>>
>> I’ll see there are some comments in general on the bit definitions 
>> for mb,
>> but I’ll try to convince them to use u32 for both in the next 
>> revision, as I
>> think for the xdp_buff structure size is not a real problem ;)
>
> Generally people were really strict on xdp_buff extensions as we 
> didn't
> want to end up with another skb-like monster. I think Jesper somewhere
> said that one cacheline is max for that. With your tmp_reg[2] you 
> exceed
> that from what I see, but I might be short on coffee.

Guess you are right! I got confused with xdp_md, guess I did not have 
enough coffee when I replied :)

The common use case will not hit the second cache line (if src reg != 
dst reg), but it might happen.

>>
>>>> +	/* public: */
>>>>  	u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved
>>>> tailroom*/
>>>>  	u32 mb:1; /* xdp non-linear buffer */
>>>> +
>>>> +	/* Temporary registers to make conditional access/stores 
>>>> possible.
>>>> */
>>>> +	u64 tmp_reg[2];
>>>
>>> IMHO this kills the bitfield approach we have for vars above.
>>
>> See above…
>>
>>>>  };
>>>>
>>>>  /* Reserve memory area at end-of data area.
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index 30b477a26482..62c50ab28ea9 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -4380,6 +4380,7 @@ struct xdp_md {
>>>>  	__u32 rx_queue_index;  /* rxq->queue_index  */
>>>>
>>>>  	__u32 egress_ifindex;  /* txq->dev->ifindex */
>>>> +	__u32 frame_length;
>>>>  };
>>>>
>>>>  /* DEVMAP map-value layout
>>>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>>>> index 93def76cf32b..c50caea29fa2 100644
>>>> --- a/kernel/bpf/verifier.c
>>>> +++ b/kernel/bpf/verifier.c
>>>> @@ -10526,7 +10526,7 @@ static int convert_ctx_accesses(struct
>>>> bpf_verifier_env *env)
>>>>  	const struct bpf_verifier_ops *ops = env->ops;
>>>>  	int i, cnt, size, ctx_field_size, delta = 0;
>>>>  	const int insn_cnt = env->prog->len;
>>>> -	struct bpf_insn insn_buf[16], *insn;
>>>> +	struct bpf_insn insn_buf[32], *insn;
>>>>  	u32 target_size, size_default, off;
>>>>  	struct bpf_prog *new_prog;
>>>>  	enum bpf_access_type type;
>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>> index 4c4882d4d92c..278640db9e0a 100644
>>>> --- a/net/core/filter.c
>>>> +++ b/net/core/filter.c
>>>> @@ -8908,6 +8908,7 @@ static u32 xdp_convert_ctx_access(enum
>>>> bpf_access_type type,
>>>>  				  struct bpf_insn *insn_buf,
>>>>  				  struct bpf_prog *prog, u32 *target_size)
>>>>  {
>>>> +	int ctx_reg, dst_reg, scratch_reg;
>>>>  	struct bpf_insn *insn = insn_buf;
>>>>
>>>>  	switch (si->off) {
>>>> @@ -8954,6 +8955,88 @@ static u32 xdp_convert_ctx_access(enum
>>>> bpf_access_type type,
>>>>  		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
>>>>  				      offsetof(struct net_device, ifindex));
>>>>  		break;
>>>> +	case offsetof(struct xdp_md, frame_length):
>>>> +		/* Need tmp storage for src_reg in case src_reg == dst_reg,
>>>> +		 * and a scratch reg */
>>>> +		scratch_reg = BPF_REG_9;
>>>> +		dst_reg = si->dst_reg;
>>>> +
>>>> +		if (dst_reg == scratch_reg)
>>>> +			scratch_reg--;
>>>> +
>>>> +		ctx_reg = (si->src_reg == si->dst_reg) ? scratch_reg - 1 :
>>>> si->src_reg;
>>>> +		while (dst_reg == ctx_reg || scratch_reg == ctx_reg)
>>>> +			ctx_reg--;
>>>> +
>>>> +		/* Save scratch registers */
>>>> +		if (ctx_reg != si->src_reg) {
>>>> +			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, ctx_reg,
>>>> +					      offsetof(struct xdp_buff,
>>>> +						       tmp_reg[1]));
>>>> +
>>>> +			*insn++ = BPF_MOV64_REG(ctx_reg, si->src_reg);
>>>> +		}
>>>> +
>>>> +		*insn++ = BPF_STX_MEM(BPF_DW, ctx_reg, scratch_reg,
>>>> +				      offsetof(struct xdp_buff, tmp_reg[0]));
>>>
>>> Why don't you push regs to stack, use it and then pop it back? That 
>>> way
>>> I
>>> suppose you could avoid polluting xdp_buff with tmp_reg[2].
>>
>> There is no “real” stack in eBPF, only a read-only frame pointer, 
>> and as we
>> are replacing a single instruction, we have no info on what we can 
>> use as
>> scratch space.
>
> Uhm, what? You use R10 for stack operations. Verifier tracks the stack
> depth used by programs and then it is passed down to JIT so that 
> native
> asm will create a properly sized stack frame.
>
> From the top of my head I would let know xdp_convert_ctx_access of a
> current stack depth and use it for R10 stores, so your scratch space 
> would
> be R10 + (stack depth + 8), R10 + (stack_depth + 16).

Other instances do exactly the same, i.e. put some scratch registers in 
the underlying data structure, so I reused this approach. From the 
current information in the callback, I was not able to determine the 
current stack_depth. With "real" stack above, I meant having a pop/push 
like instruction.

I do not know the verifier code well enough, but are you suggesting I 
can get the current stack_depth from the verifier in the 
xdp_convert_ctx_access() callback? If so any pointers?

> Problem with that would be the fact that convert_ctx_accesses() 
> happens to
> be called after the check_max_stack_depth(), so probably stack_depth 
> of a
> prog that has frame_length accesses would have to be adjusted earlier.

Ack, need to learn more on the verifier part…

>>
>>>> +
>>>> +		/* What does this code do?
>>>> +		 *   dst_reg = 0
>>>> +		 *
>>>> +		 *   if (!ctx_reg->mb)
>>>> +		 *      goto no_mb:
>>>> +		 *
>>>> +		 *   dst_reg = (struct xdp_shared_info *)xdp_data_hard_end(xdp)
>>>> +		 *   dst_reg = dst_reg->data_length
>>>> +		 *
>>>> +		 * NOTE: xdp_data_hard_end() is xdp->hard_start +
>>>> +		 *       xdp->frame_sz - sizeof(shared_info)
>>>> +		 *
>>>> +		 * no_mb:
>>>> +		 *   dst_reg += ctx_reg->data_end - ctx_reg->data
>>>> +		 */
>>>> +		*insn++ = BPF_MOV64_IMM(dst_reg, 0);
>>>> +
>>>> +		*insn++ = BPF_LDX_MEM(BPF_W, scratch_reg, ctx_reg, MB_OFFSET());
>>>> +		*insn++ = BPF_ALU32_IMM(BPF_AND, scratch_reg, MB_MASK);
>>>> +		*insn++ = BPF_JMP_IMM(BPF_JEQ, scratch_reg, 0, 7); /*goto no_mb;
>>>> */
>>>> +
>>>> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff,
>>>> +						       data_hard_start),
>>>> +				      dst_reg, ctx_reg,
>>>> +				      offsetof(struct xdp_buff, data_hard_start));
>>>> +		*insn++ = BPF_LDX_MEM(BPF_W, scratch_reg, ctx_reg,
>>>> +				      FRAME_SZ_OFFSET());
>>>> +		*insn++ = BPF_ALU32_IMM(BPF_AND, scratch_reg, FRAME_SZ_MASK);
>>>> +		*insn++ = BPF_ALU32_IMM(BPF_RSH, scratch_reg, FRAME_SZ_SHIFT);
>>>> +		*insn++ = BPF_ALU64_REG(BPF_ADD, dst_reg, scratch_reg);
>>>> +		*insn++ = BPF_ALU64_IMM(BPF_SUB, dst_reg,
>>>> +					SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
>>>> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_shared_info,
>>>> +						       data_length),
>>>> +				      dst_reg, dst_reg,
>>>> +				      offsetof(struct xdp_shared_info,
>>>> +					       data_length));
>>>> +
>>>> +		/* no_mb: */
>>>> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, 
>>>> data_end),
>>>> +				      scratch_reg, ctx_reg,
>>>> +				      offsetof(struct xdp_buff, data_end));
>>>> +		*insn++ = BPF_ALU64_REG(BPF_ADD, dst_reg, scratch_reg);
>>>> +		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data),
>>>> +				      scratch_reg, ctx_reg,
>>>> +				      offsetof(struct xdp_buff, data));
>>>> +		*insn++ = BPF_ALU64_REG(BPF_SUB, dst_reg, scratch_reg);
>>>> +
>>>> +		/* Restore scratch registers */
>>>> +		*insn++ = BPF_LDX_MEM(BPF_DW, scratch_reg, ctx_reg,
>>>> +				      offsetof(struct xdp_buff, tmp_reg[0]));
>>>> +
>>>> +		if (ctx_reg != si->src_reg)
>>>> +			*insn++ = BPF_LDX_MEM(BPF_DW, ctx_reg, ctx_reg,
>>>> +					      offsetof(struct xdp_buff,
>>>> +						       tmp_reg[1]));
>>>> +		break;
>>>>  	}
>>>>
>>>>  	return insn - insn_buf;
>>>> diff --git a/tools/include/uapi/linux/bpf.h
>>>> b/tools/include/uapi/linux/bpf.h
>>>> index 30b477a26482..62c50ab28ea9 100644
>>>> --- a/tools/include/uapi/linux/bpf.h
>>>> +++ b/tools/include/uapi/linux/bpf.h
>>>> @@ -4380,6 +4380,7 @@ struct xdp_md {
>>>>  	__u32 rx_queue_index;  /* rxq->queue_index  */
>>>>
>>>>  	__u32 egress_ifindex;  /* txq->dev->ifindex */
>>>> +	__u32 frame_length;
>>>>  };
>>>>
>>>>  /* DEVMAP map-value layout
>>>> -- 
>>>> 2.28.0
>>>>
>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx
  2020-12-09 12:07         ` Eelco Chaudron
@ 2020-12-15 13:28           ` Eelco Chaudron
  2020-12-15 18:06             ` Maciej Fijalkowski
  0 siblings, 1 reply; 48+ messages in thread
From: Eelco Chaudron @ 2020-12-15 13:28 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: Lorenzo Bianconi, bpf, netdev



On 9 Dec 2020, at 13:07, Eelco Chaudron wrote:

> On 9 Dec 2020, at 12:10, Maciej Fijalkowski wrote:

<SNIP>

>>>>> +
>>>>> +		ctx_reg = (si->src_reg == si->dst_reg) ? scratch_reg - 1 :
>>>>> si->src_reg;
>>>>> +		while (dst_reg == ctx_reg || scratch_reg == ctx_reg)
>>>>> +			ctx_reg--;
>>>>> +
>>>>> +		/* Save scratch registers */
>>>>> +		if (ctx_reg != si->src_reg) {
>>>>> +			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, ctx_reg,
>>>>> +					      offsetof(struct xdp_buff,
>>>>> +						       tmp_reg[1]));
>>>>> +
>>>>> +			*insn++ = BPF_MOV64_REG(ctx_reg, si->src_reg);
>>>>> +		}
>>>>> +
>>>>> +		*insn++ = BPF_STX_MEM(BPF_DW, ctx_reg, scratch_reg,
>>>>> +				      offsetof(struct xdp_buff, tmp_reg[0]));
>>>>
>>>> Why don't you push regs to stack, use it and then pop it back? That 
>>>> way
>>>> I
>>>> suppose you could avoid polluting xdp_buff with tmp_reg[2].
>>>
>>> There is no “real” stack in eBPF, only a read-only frame 
>>> pointer, and as we
>>> are replacing a single instruction, we have no info on what we can 
>>> use as
>>> scratch space.
>>
>> Uhm, what? You use R10 for stack operations. Verifier tracks the 
>> stack
>> depth used by programs and then it is passed down to JIT so that 
>> native
>> asm will create a properly sized stack frame.
>>
>> From the top of my head I would let know xdp_convert_ctx_access of a
>> current stack depth and use it for R10 stores, so your scratch space 
>> would
>> be R10 + (stack depth + 8), R10 + (stack_depth + 16).
>
> Other instances do exactly the same, i.e. put some scratch registers 
> in the underlying data structure, so I reused this approach. From the 
> current information in the callback, I was not able to determine the 
> current stack_depth. With "real" stack above, I meant having a 
> pop/push like instruction.
>
> I do not know the verifier code well enough, but are you suggesting I 
> can get the current stack_depth from the verifier in the 
> xdp_convert_ctx_access() callback? If so any pointers?

Maciej any feedback on the above, i.e. getting the stack_depth in 
xdp_convert_ctx_access()?

>> Problem with that would be the fact that convert_ctx_accesses() 
>> happens to
>> be called after the check_max_stack_depth(), so probably stack_depth 
>> of a
>> prog that has frame_length accesses would have to be adjusted 
>> earlier.
>
> Ack, need to learn more on the verifier part…

<SNIP>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx
  2020-12-15 13:28           ` Eelco Chaudron
@ 2020-12-15 18:06             ` Maciej Fijalkowski
  2020-12-16 14:08               ` Eelco Chaudron
  0 siblings, 1 reply; 48+ messages in thread
From: Maciej Fijalkowski @ 2020-12-15 18:06 UTC (permalink / raw)
  To: Eelco Chaudron; +Cc: Lorenzo Bianconi, bpf, netdev

On Tue, Dec 15, 2020 at 02:28:39PM +0100, Eelco Chaudron wrote:
> 
> 
> On 9 Dec 2020, at 13:07, Eelco Chaudron wrote:
> 
> > On 9 Dec 2020, at 12:10, Maciej Fijalkowski wrote:
> 
> <SNIP>
> 
> > > > > > +
> > > > > > +		ctx_reg = (si->src_reg == si->dst_reg) ? scratch_reg - 1 :
> > > > > > si->src_reg;
> > > > > > +		while (dst_reg == ctx_reg || scratch_reg == ctx_reg)
> > > > > > +			ctx_reg--;
> > > > > > +
> > > > > > +		/* Save scratch registers */
> > > > > > +		if (ctx_reg != si->src_reg) {
> > > > > > +			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, ctx_reg,
> > > > > > +					      offsetof(struct xdp_buff,
> > > > > > +						       tmp_reg[1]));
> > > > > > +
> > > > > > +			*insn++ = BPF_MOV64_REG(ctx_reg, si->src_reg);
> > > > > > +		}
> > > > > > +
> > > > > > +		*insn++ = BPF_STX_MEM(BPF_DW, ctx_reg, scratch_reg,
> > > > > > +				      offsetof(struct xdp_buff, tmp_reg[0]));
> > > > > 
> > > > > Why don't you push regs to stack, use it and then pop it
> > > > > back? That way
> > > > > I
> > > > > suppose you could avoid polluting xdp_buff with tmp_reg[2].
> > > > 
> > > > There is no “real” stack in eBPF, only a read-only frame
> > > > pointer, and as we
> > > > are replacing a single instruction, we have no info on what we
> > > > can use as
> > > > scratch space.
> > > 
> > > Uhm, what? You use R10 for stack operations. Verifier tracks the
> > > stack
> > > depth used by programs and then it is passed down to JIT so that
> > > native
> > > asm will create a properly sized stack frame.
> > > 
> > > From the top of my head I would let know xdp_convert_ctx_access of a
> > > current stack depth and use it for R10 stores, so your scratch space
> > > would
> > > be R10 + (stack depth + 8), R10 + (stack_depth + 16).
> > 
> > Other instances do exactly the same, i.e. put some scratch registers in
> > the underlying data structure, so I reused this approach. From the
> > current information in the callback, I was not able to determine the
> > current stack_depth. With "real" stack above, I meant having a pop/push
> > like instruction.
> > 
> > I do not know the verifier code well enough, but are you suggesting I
> > can get the current stack_depth from the verifier in the
> > xdp_convert_ctx_access() callback? If so any pointers?
> 
> Maciej any feedback on the above, i.e. getting the stack_depth in
> xdp_convert_ctx_access()?

Sorry. I'll try to get my head around it. If i recall correctly stack
depth is tracked per subprogram whereas convert_ctx_accesses is iterating
through *all* insns (so a prog that is not chunked onto subprogs), but
maybe we could dig up the subprog based on insn idx.

But at first, you mentioned that you took the approach from other
instances, can you point me to them?

I'd also like to hear from Daniel/Alexei/John and others their thoughts.

> 
> > > Problem with that would be the fact that convert_ctx_accesses()
> > > happens to
> > > be called after the check_max_stack_depth(), so probably stack_depth
> > > of a
> > > prog that has frame_length accesses would have to be adjusted
> > > earlier.
> > 
> > Ack, need to learn more on the verifier part…
> 
> <SNIP>
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx
  2020-12-15 18:06             ` Maciej Fijalkowski
@ 2020-12-16 14:08               ` Eelco Chaudron
  2021-01-15 16:36                 ` Eelco Chaudron
  0 siblings, 1 reply; 48+ messages in thread
From: Eelco Chaudron @ 2020-12-16 14:08 UTC (permalink / raw)
  To: Maciej Fijalkowski; +Cc: Lorenzo Bianconi, bpf, netdev



On 15 Dec 2020, at 19:06, Maciej Fijalkowski wrote:

> On Tue, Dec 15, 2020 at 02:28:39PM +0100, Eelco Chaudron wrote:
>>
>>
>> On 9 Dec 2020, at 13:07, Eelco Chaudron wrote:
>>
>>> On 9 Dec 2020, at 12:10, Maciej Fijalkowski wrote:
>>
>> <SNIP>
>>
>>>>>>> +
>>>>>>> +		ctx_reg = (si->src_reg == si->dst_reg) ? scratch_reg - 1 :
>>>>>>> si->src_reg;
>>>>>>> +		while (dst_reg == ctx_reg || scratch_reg == ctx_reg)
>>>>>>> +			ctx_reg--;
>>>>>>> +
>>>>>>> +		/* Save scratch registers */
>>>>>>> +		if (ctx_reg != si->src_reg) {
>>>>>>> +			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, ctx_reg,
>>>>>>> +					      offsetof(struct xdp_buff,
>>>>>>> +						       tmp_reg[1]));
>>>>>>> +
>>>>>>> +			*insn++ = BPF_MOV64_REG(ctx_reg, si->src_reg);
>>>>>>> +		}
>>>>>>> +
>>>>>>> +		*insn++ = BPF_STX_MEM(BPF_DW, ctx_reg, scratch_reg,
>>>>>>> +				      offsetof(struct xdp_buff, tmp_reg[0]));
>>>>>>
>>>>>> Why don't you push regs to stack, use it and then pop it
>>>>>> back? That way
>>>>>> I
>>>>>> suppose you could avoid polluting xdp_buff with tmp_reg[2].
>>>>>
>>>>> There is no “real” stack in eBPF, only a read-only frame
>>>>> pointer, and as we
>>>>> are replacing a single instruction, we have no info on what we
>>>>> can use as
>>>>> scratch space.
>>>>
>>>> Uhm, what? You use R10 for stack operations. Verifier tracks the
>>>> stack
>>>> depth used by programs and then it is passed down to JIT so that
>>>> native
>>>> asm will create a properly sized stack frame.
>>>>
>>>> From the top of my head I would let know xdp_convert_ctx_access of a
>>>> current stack depth and use it for R10 stores, so your scratch space
>>>> would
>>>> be R10 + (stack depth + 8), R10 + (stack_depth + 16).
>>>
>>> Other instances do exactly the same, i.e. put some scratch registers in
>>> the underlying data structure, so I reused this approach. From the
>>> current information in the callback, I was not able to determine the
>>> current stack_depth. With "real" stack above, I meant having a pop/push
>>> like instruction.
>>>
>>> I do not know the verifier code well enough, but are you suggesting I
>>> can get the current stack_depth from the verifier in the
>>> xdp_convert_ctx_access() callback? If so any pointers?
>>
>> Maciej any feedback on the above, i.e. getting the stack_depth in
>> xdp_convert_ctx_access()?
>
> Sorry. I'll try to get my head around it. If i recall correctly stack
> depth is tracked per subprogram whereas convert_ctx_accesses is iterating
> through *all* insns (so a prog that is not chunked onto subprogs), but
> maybe we could dig up the subprog based on insn idx.
>
> But at first, you mentioned that you took the approach from other
> instances, can you point me to them?

Quick search found the following two (sure there is one more with two regs):

https://elixir.bootlin.com/linux/v5.10.1/source/kernel/bpf/cgroup.c#L1718
https://elixir.bootlin.com/linux/v5.10.1/source/net/core/filter.c#L8977

> I'd also like to hear from Daniel/Alexei/John and others their thoughts.

Please keep me in the loop…

>>
>>>> Problem with that would be the fact that convert_ctx_accesses()
>>>> happens to
>>>> be called after the check_max_stack_depth(), so probably stack_depth
>>>> of a
>>>> prog that has frame_length accesses would have to be adjusted
>>>> earlier.
>>>
>>> Ack, need to learn more on the verifier part…
>>
>> <SNIP>
>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure
  2020-12-08 11:01     ` Lorenzo Bianconi
@ 2020-12-19 14:53       ` Shay Agroskin
  2020-12-19 15:30         ` Jamal Hadi Salim
  2020-12-20 17:52         ` Lorenzo Bianconi
  0 siblings, 2 replies; 48+ messages in thread
From: Shay Agroskin @ 2020-12-19 14:53 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Saeed Mahameed, Lorenzo Bianconi, bpf, netdev, davem, kuba, ast,
	daniel, sameehj, john.fastabend, dsahern, brouer, echaudro,
	jasowang


Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:

>> On Mon, 2020-12-07 at 17:32 +0100, Lorenzo Bianconi wrote:
>> > Introduce xdp_shared_info data structure to contain info 
>> > about
>> > "non-linear" xdp frame. xdp_shared_info will alias 
>> > skb_shared_info
>> > allowing to keep most of the frags in the same cache-line.
[...]
>> 
>> > +	u16 nr_frags;
>> > +	u16 data_length; /* paged area length */
>> > +	skb_frag_t frags[MAX_SKB_FRAGS];
>> 
>> why MAX_SKB_FRAGS ? just use a flexible array member 
>> skb_frag_t frags[]; 
>> 
>> and enforce size via the n_frags and on the construction of the
>> tailroom preserved buffer, which is already being done.
>> 
>> this is waste of unnecessary space, at lease by definition of 
>> the
>> struct, in your use case you do:
>> memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) * 
>> num_frags);
>> And the tailroom space was already preserved for a full 
>> skb_shinfo.
>> so i don't see why you need this array to be of a fixed 
>> MAX_SKB_FRAGS
>> size.
>
> In order to avoid cache-misses, xdp_shared info is built as a 
> variable
> on mvneta_rx_swbm() stack and it is written to "shared_info" 
> area only on the
> last fragment in mvneta_swbm_add_rx_fragment(). I used 
> MAX_SKB_FRAGS to be
> aligned with skb_shared_info struct but probably we can use even 
> a smaller value.
> Another approach would be to define two different struct, e.g.
>
> stuct xdp_frag_metadata {
> 	u16 nr_frags;
> 	u16 data_length; /* paged area length */
> };
>
> struct xdp_frags {
> 	skb_frag_t frags[MAX_SKB_FRAGS];
> };
>
> and then define xdp_shared_info as
>
> struct xdp_shared_info {
> 	stuct xdp_frag_metadata meta;
> 	skb_frag_t frags[];
> };
>
> In this way we can probably optimize the space. What do you 
> think?

We're still reserving ~sizeof(skb_shared_info) bytes at the end of 
the first buffer and it seems like in mvneta code you keep 
updating all three fields (frags, nr_frags and data_length).
Can you explain how the space is optimized by splitting the 
structs please?

>> 
>> > +};
>> > +
>> > +static inline struct xdp_shared_info *
>> >  xdp_get_shared_info_from_buff(struct xdp_buff *xdp)
>> >  {
>> > -	return (struct skb_shared_info *)xdp_data_hard_end(xdp);
>> > +	BUILD_BUG_ON(sizeof(struct xdp_shared_info) >
>> > +		     sizeof(struct skb_shared_info));
>> > +	return (struct xdp_shared_info *)xdp_data_hard_end(xdp);
>> > +}
>> > +
>> 
>> Back to my first comment, do we have plans to use this tail 
>> room buffer
>> for other than frag_list use cases ? what will be the buffer 
>> format
>> then ? should we push all new fields to the end of the 
>> xdp_shared_info
>> struct ? or deal with this tailroom buffer as a stack ? 
>> my main concern is that for drivers that don't support frag 
>> list and
>> still want to utilize the tailroom buffer for other usecases 
>> they will
>> have to skip the first sizeof(xdp_shared_info) so they won't 
>> break the
>> stack.
>
> for the moment I do not know if this area is used for other 
> purposes.
> Do you think there are other use-cases for it?
>

Saeed, the stack receives skb_shared_info when the frames are 
passed to the stack (skb_add_rx_frag is used to add the whole 
information to skb's shared info), and for XDP_REDIRECT use case, 
it doesn't seem like all drivers check page's tailroom for more 
information anyway (ena doesn't at least).
Can you please explain what do you mean by "break the stack"?

Thanks, Shay

>> 
[...]
>
>> 


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure
  2020-12-19 14:53       ` Shay Agroskin
@ 2020-12-19 15:30         ` Jamal Hadi Salim
  2020-12-21  9:01           ` Jesper Dangaard Brouer
  2020-12-20 17:52         ` Lorenzo Bianconi
  1 sibling, 1 reply; 48+ messages in thread
From: Jamal Hadi Salim @ 2020-12-19 15:30 UTC (permalink / raw)
  To: Shay Agroskin, Lorenzo Bianconi
  Cc: Saeed Mahameed, Lorenzo Bianconi, bpf, netdev, davem, kuba, ast,
	daniel, sameehj, john.fastabend, dsahern, brouer, echaudro,
	jasowang

On 2020-12-19 9:53 a.m., Shay Agroskin wrote:
> 
> Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:
> 

>> for the moment I do not know if this area is used for other purposes.
>> Do you think there are other use-cases for it?

Sorry to interject:
Does it make sense to use it to store arbitrary metadata or a scratchpad
in this space? Something equivalent to skb->cb which is lacking in
XDP.

cheers,
jamal

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 06/14] net: mvneta: add multi buffer support to XDP_TX
  2020-12-07 16:32 ` [PATCH v5 bpf-next 06/14] net: mvneta: add multi buffer support to XDP_TX Lorenzo Bianconi
@ 2020-12-19 15:56   ` Shay Agroskin
  2020-12-20 18:06     ` Lorenzo Bianconi
  0 siblings, 1 reply; 48+ messages in thread
From: Shay Agroskin @ 2020-12-19 15:56 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: bpf, netdev, davem, kuba, ast, daniel, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang


Lorenzo Bianconi <lorenzo@kernel.org> writes:

> Introduce the capability to map non-linear xdp buffer running
> mvneta_xdp_submit_frame() for XDP_TX and XDP_REDIRECT
>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 94 
>  ++++++++++++++++-----------
>  1 file changed, 56 insertions(+), 38 deletions(-)
[...]
>  			if (napi && buf->type == 
>  MVNETA_TYPE_XDP_TX)
>  				xdp_return_frame_rx_napi(buf->xdpf);
>  			else
> @@ -2054,45 +2054,64 @@ mvneta_xdp_put_buff(struct mvneta_port 
> *pp, struct mvneta_rx_queue *rxq,
>  
>  static int
>  mvneta_xdp_submit_frame(struct mvneta_port *pp, struct 
>  mvneta_tx_queue *txq,
> -			struct xdp_frame *xdpf, bool dma_map)
> +			struct xdp_frame *xdpf, int *nxmit_byte, 
> bool dma_map)
>  {
> -	struct mvneta_tx_desc *tx_desc;
> -	struct mvneta_tx_buf *buf;
> -	dma_addr_t dma_addr;
> +	struct xdp_shared_info *xdp_sinfo = 
> xdp_get_shared_info_from_frame(xdpf);
> +	int i, num_frames = xdpf->mb ? xdp_sinfo->nr_frags + 1 : 
> 1;
> +	struct mvneta_tx_desc *tx_desc = NULL;
> +	struct page *page;
>  
> -	if (txq->count >= txq->tx_stop_threshold)
> +	if (txq->count + num_frames >= txq->size)
>  		return MVNETA_XDP_DROPPED;
>  
> -	tx_desc = mvneta_txq_next_desc_get(txq);
> +	for (i = 0; i < num_frames; i++) {
> +		struct mvneta_tx_buf *buf = 
> &txq->buf[txq->txq_put_index];
> +		skb_frag_t *frag = i ? &xdp_sinfo->frags[i - 1] : 
> NULL;
> +		int len = frag ? xdp_get_frag_size(frag) : 
> xdpf->len;

nit, from branch prediction point of view, maybe it would be 
better to write
     int len = i ? xdp_get_frag_size(frag) : xdpf->len;

since the value of i is checked one line above
Disclaimer: I'm far from a compiler expert, and don't know whether 
the compiler would know to group these two assignments together 
into a single branch prediction decision, but it feels like using 
'i' would make this decision easier for it.

Thanks,
Shay

[...]


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 11/14] bpf: cpumap: introduce xdp multi-buff support
  2020-12-07 16:32 ` [PATCH v5 bpf-next 11/14] bpf: cpumap: introduce xdp multi-buff support Lorenzo Bianconi
@ 2020-12-19 17:46   ` Shay Agroskin
  2020-12-20 17:56     ` Lorenzo Bianconi
  0 siblings, 1 reply; 48+ messages in thread
From: Shay Agroskin @ 2020-12-19 17:46 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: bpf, netdev, davem, kuba, ast, daniel, sameehj, john.fastabend,
	dsahern, brouer, echaudro, lorenzo.bianconi, jasowang


Lorenzo Bianconi <lorenzo@kernel.org> writes:

> Introduce __xdp_build_skb_from_frame and 
> xdp_build_skb_from_frame
> utility routines to build the skb from xdp_frame.
> Add xdp multi-buff support to cpumap
>
> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> ---
>  include/net/xdp.h   |  5 ++++
>  kernel/bpf/cpumap.c | 45 +---------------------------
>  net/core/xdp.c      | 73 
>  +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 79 insertions(+), 44 deletions(-)
>
[...]
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 6c8e743ad03a..55f3e9c69427 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -597,3 +597,76 @@ void xdp_warn(const char *msg, const char 
> *func, const int line)
>  	WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
>  };
>  EXPORT_SYMBOL_GPL(xdp_warn);
> +
> +struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame 
> *xdpf,
> +					   struct sk_buff *skb,
> +					   struct net_device *dev)
> +{
> +	unsigned int headroom = sizeof(*xdpf) + xdpf->headroom;
> +	void *hard_start = xdpf->data - headroom;
> +	skb_frag_t frag_list[MAX_SKB_FRAGS];
> +	struct xdp_shared_info *xdp_sinfo;
> +	int i, num_frags = 0;
> +
> +	xdp_sinfo = xdp_get_shared_info_from_frame(xdpf);
> +	if (unlikely(xdpf->mb)) {
> +		num_frags = xdp_sinfo->nr_frags;
> +		memcpy(frag_list, xdp_sinfo->frags,
> +		       sizeof(skb_frag_t) * num_frags);
> +	}

nit, can you please move the xdp_sinfo assignment inside this 'if' 
? This would help to emphasize that regarding xdp_frame tailroom 
as xdp_shared_info struct (rather than skb_shared_info) is correct 
only when the mb bit is set

thanks,
Shay

> +
> +	skb = build_skb_around(skb, hard_start, xdpf->frame_sz);
> +	if (unlikely(!skb))
> +		return NULL;
[...]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure
  2020-12-19 14:53       ` Shay Agroskin
  2020-12-19 15:30         ` Jamal Hadi Salim
@ 2020-12-20 17:52         ` Lorenzo Bianconi
  2020-12-21 20:55           ` Shay Agroskin
  1 sibling, 1 reply; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-20 17:52 UTC (permalink / raw)
  To: Shay Agroskin
  Cc: Saeed Mahameed, Lorenzo Bianconi, BPF-dev-list,
	Network Development, David S. Miller, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Jubran, Samih,
	John Fastabend, David Ahern, Jesper Brouer, Eelco Chaudron,
	Jason Wang

>
>
> Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:
>
> >> On Mon, 2020-12-07 at 17:32 +0100, Lorenzo Bianconi wrote:
> >> > Introduce xdp_shared_info data structure to contain info
> >> > about
> >> > "non-linear" xdp frame. xdp_shared_info will alias
> >> > skb_shared_info
> >> > allowing to keep most of the frags in the same cache-line.
> [...]
> >>
> >> > +  u16 nr_frags;
> >> > +  u16 data_length; /* paged area length */
> >> > +  skb_frag_t frags[MAX_SKB_FRAGS];
> >>
> >> why MAX_SKB_FRAGS ? just use a flexible array member
> >> skb_frag_t frags[];
> >>
> >> and enforce size via the n_frags and on the construction of the
> >> tailroom preserved buffer, which is already being done.
> >>
> >> this is waste of unnecessary space, at lease by definition of
> >> the
> >> struct, in your use case you do:
> >> memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) *
> >> num_frags);
> >> And the tailroom space was already preserved for a full
> >> skb_shinfo.
> >> so i don't see why you need this array to be of a fixed
> >> MAX_SKB_FRAGS
> >> size.
> >
> > In order to avoid cache-misses, xdp_shared info is built as a
> > variable
> > on mvneta_rx_swbm() stack and it is written to "shared_info"
> > area only on the
> > last fragment in mvneta_swbm_add_rx_fragment(). I used
> > MAX_SKB_FRAGS to be
> > aligned with skb_shared_info struct but probably we can use even
> > a smaller value.
> > Another approach would be to define two different struct, e.g.
> >
> > stuct xdp_frag_metadata {
> >       u16 nr_frags;
> >       u16 data_length; /* paged area length */
> > };
> >
> > struct xdp_frags {
> >       skb_frag_t frags[MAX_SKB_FRAGS];
> > };
> >
> > and then define xdp_shared_info as
> >
> > struct xdp_shared_info {
> >       stuct xdp_frag_metadata meta;
> >       skb_frag_t frags[];
> > };
> >
> > In this way we can probably optimize the space. What do you
> > think?
>
> We're still reserving ~sizeof(skb_shared_info) bytes at the end of
> the first buffer and it seems like in mvneta code you keep
> updating all three fields (frags, nr_frags and data_length).
> Can you explain how the space is optimized by splitting the
> structs please?

using xdp_shared_info struct we will have the first 3 fragments in the
same cacheline of nr_frags while using skb_shared_info struct only the
first fragment will be in the same cacheline of nr_frags. Moreover
skb_shared_info has multiple fields unused by xdp.

Regards,
Lorenzo

>
> >>
> >> > +};
> >> > +
> >> > +static inline struct xdp_shared_info *
> >> >  xdp_get_shared_info_from_buff(struct xdp_buff *xdp)
> >> >  {
> >> > -  return (struct skb_shared_info *)xdp_data_hard_end(xdp);
> >> > +  BUILD_BUG_ON(sizeof(struct xdp_shared_info) >
> >> > +               sizeof(struct skb_shared_info));
> >> > +  return (struct xdp_shared_info *)xdp_data_hard_end(xdp);
> >> > +}
> >> > +
> >>
> >> Back to my first comment, do we have plans to use this tail
> >> room buffer
> >> for other than frag_list use cases ? what will be the buffer
> >> format
> >> then ? should we push all new fields to the end of the
> >> xdp_shared_info
> >> struct ? or deal with this tailroom buffer as a stack ?
> >> my main concern is that for drivers that don't support frag
> >> list and
> >> still want to utilize the tailroom buffer for other usecases
> >> they will
> >> have to skip the first sizeof(xdp_shared_info) so they won't
> >> break the
> >> stack.
> >
> > for the moment I do not know if this area is used for other
> > purposes.
> > Do you think there are other use-cases for it?
> >
>
> Saeed, the stack receives skb_shared_info when the frames are
> passed to the stack (skb_add_rx_frag is used to add the whole
> information to skb's shared info), and for XDP_REDIRECT use case,
> it doesn't seem like all drivers check page's tailroom for more
> information anyway (ena doesn't at least).
> Can you please explain what do you mean by "break the stack"?
>
> Thanks, Shay
>
> >>
> [...]
> >
> >>
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 11/14] bpf: cpumap: introduce xdp multi-buff support
  2020-12-19 17:46   ` Shay Agroskin
@ 2020-12-20 17:56     ` Lorenzo Bianconi
  0 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-20 17:56 UTC (permalink / raw)
  To: Shay Agroskin
  Cc: Lorenzo Bianconi, BPF-dev-list, Network Development,
	David S. Miller, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jubran, Samih, John Fastabend, David Ahern,
	Jesper Brouer, Eelco Chaudron, Jason Wang

>
>
> Lorenzo Bianconi <lorenzo@kernel.org> writes:
>
> > Introduce __xdp_build_skb_from_frame and
> > xdp_build_skb_from_frame
> > utility routines to build the skb from xdp_frame.
> > Add xdp multi-buff support to cpumap
> >
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> >  include/net/xdp.h   |  5 ++++
> >  kernel/bpf/cpumap.c | 45 +---------------------------
> >  net/core/xdp.c      | 73
> >  +++++++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 79 insertions(+), 44 deletions(-)
> >
> [...]
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index 6c8e743ad03a..55f3e9c69427 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -597,3 +597,76 @@ void xdp_warn(const char *msg, const char
> > *func, const int line)
> >       WARN(1, "XDP_WARN: %s(line:%d): %s\n", func, line, msg);
> >  };
> >  EXPORT_SYMBOL_GPL(xdp_warn);
> > +
> > +struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame
> > *xdpf,
> > +                                        struct sk_buff *skb,
> > +                                        struct net_device *dev)
> > +{
> > +     unsigned int headroom = sizeof(*xdpf) + xdpf->headroom;
> > +     void *hard_start = xdpf->data - headroom;
> > +     skb_frag_t frag_list[MAX_SKB_FRAGS];
> > +     struct xdp_shared_info *xdp_sinfo;
> > +     int i, num_frags = 0;
> > +
> > +     xdp_sinfo = xdp_get_shared_info_from_frame(xdpf);
> > +     if (unlikely(xdpf->mb)) {
> > +             num_frags = xdp_sinfo->nr_frags;
> > +             memcpy(frag_list, xdp_sinfo->frags,
> > +                    sizeof(skb_frag_t) * num_frags);
> > +     }
>
> nit, can you please move the xdp_sinfo assignment inside this 'if'
> ? This would help to emphasize that regarding xdp_frame tailroom
> as xdp_shared_info struct (rather than skb_shared_info) is correct
> only when the mb bit is set
>
> thanks,
> Shay

ack, will do in v6.

Regards,
Lorenzo

>
> > +
> > +     skb = build_skb_around(skb, hard_start, xdpf->frame_sz);
> > +     if (unlikely(!skb))
> > +             return NULL;
> [...]
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 06/14] net: mvneta: add multi buffer support to XDP_TX
  2020-12-19 15:56   ` Shay Agroskin
@ 2020-12-20 18:06     ` Lorenzo Bianconi
  0 siblings, 0 replies; 48+ messages in thread
From: Lorenzo Bianconi @ 2020-12-20 18:06 UTC (permalink / raw)
  To: Shay Agroskin
  Cc: Lorenzo Bianconi, BPF-dev-list, Network Development,
	David S. Miller, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jubran, Samih, John Fastabend, David Ahern,
	Jesper Brouer, Eelco Chaudron, Jason Wang

On Sat, Dec 19, 2020 at 4:56 PM Shay Agroskin <shayagr@amazon.com> wrote:
>
>
> Lorenzo Bianconi <lorenzo@kernel.org> writes:
>
> > Introduce the capability to map non-linear xdp buffer running
> > mvneta_xdp_submit_frame() for XDP_TX and XDP_REDIRECT
> >
> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
> > ---
> >  drivers/net/ethernet/marvell/mvneta.c | 94
> >  ++++++++++++++++-----------
> >  1 file changed, 56 insertions(+), 38 deletions(-)
> [...]
> >                       if (napi && buf->type ==
> >  MVNETA_TYPE_XDP_TX)
> >                               xdp_return_frame_rx_napi(buf->xdpf);
> >                       else
> > @@ -2054,45 +2054,64 @@ mvneta_xdp_put_buff(struct mvneta_port
> > *pp, struct mvneta_rx_queue *rxq,
> >
> >  static int
> >  mvneta_xdp_submit_frame(struct mvneta_port *pp, struct
> >  mvneta_tx_queue *txq,
> > -                     struct xdp_frame *xdpf, bool dma_map)
> > +                     struct xdp_frame *xdpf, int *nxmit_byte,
> > bool dma_map)
> >  {
> > -     struct mvneta_tx_desc *tx_desc;
> > -     struct mvneta_tx_buf *buf;
> > -     dma_addr_t dma_addr;
> > +     struct xdp_shared_info *xdp_sinfo =
> > xdp_get_shared_info_from_frame(xdpf);
> > +     int i, num_frames = xdpf->mb ? xdp_sinfo->nr_frags + 1 :
> > 1;
> > +     struct mvneta_tx_desc *tx_desc = NULL;
> > +     struct page *page;
> >
> > -     if (txq->count >= txq->tx_stop_threshold)
> > +     if (txq->count + num_frames >= txq->size)
> >               return MVNETA_XDP_DROPPED;
> >
> > -     tx_desc = mvneta_txq_next_desc_get(txq);
> > +     for (i = 0; i < num_frames; i++) {
> > +             struct mvneta_tx_buf *buf =
> > &txq->buf[txq->txq_put_index];
> > +             skb_frag_t *frag = i ? &xdp_sinfo->frags[i - 1] :
> > NULL;
> > +             int len = frag ? xdp_get_frag_size(frag) :
> > xdpf->len;
>
> nit, from branch prediction point of view, maybe it would be
> better to write
>      int len = i ? xdp_get_frag_size(frag) : xdpf->len;
>

ack, I will fix it in v6.

Regards,
Lorenzo

> since the value of i is checked one line above
> Disclaimer: I'm far from a compiler expert, and don't know whether
> the compiler would know to group these two assignments together
> into a single branch prediction decision, but it feels like using
> 'i' would make this decision easier for it.
>
> Thanks,
> Shay
>
> [...]
>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure
  2020-12-19 15:30         ` Jamal Hadi Salim
@ 2020-12-21  9:01           ` Jesper Dangaard Brouer
  2020-12-21 13:00             ` Jamal Hadi Salim
  0 siblings, 1 reply; 48+ messages in thread
From: Jesper Dangaard Brouer @ 2020-12-21  9:01 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Shay Agroskin, Lorenzo Bianconi, Saeed Mahameed,
	Lorenzo Bianconi, bpf, netdev, davem, kuba, ast, daniel, sameehj,
	john.fastabend, dsahern, echaudro, jasowang, brouer

On Sat, 19 Dec 2020 10:30:57 -0500
Jamal Hadi Salim <jhs@mojatatu.com> wrote:

> On 2020-12-19 9:53 a.m., Shay Agroskin wrote:
> > 
> > Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:
> >   
> 
> >> for the moment I do not know if this area is used for other purposes.
> >> Do you think there are other use-cases for it?  

Yes, all the same use-cases as SKB have.  I wanted to keep this the
same as skb_shared_info, but Lorenzo choose to take John's advice and
it going in this direction (which is fine, we can always change and
adjust this later).


> Sorry to interject:
> Does it make sense to use it to store arbitrary metadata or a scratchpad
> in this space? Something equivalent to skb->cb which is lacking in
> XDP.

Well, XDP have the data_meta area.  But difficult to rely on because a
lot of driver don't implement it.  And Saeed and I plan to use this
area and populate it with driver info from RX-descriptor.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure
  2020-12-21  9:01           ` Jesper Dangaard Brouer
@ 2020-12-21 13:00             ` Jamal Hadi Salim
  0 siblings, 0 replies; 48+ messages in thread
From: Jamal Hadi Salim @ 2020-12-21 13:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Shay Agroskin, Lorenzo Bianconi, Saeed Mahameed,
	Lorenzo Bianconi, bpf, netdev, davem, kuba, ast, daniel, sameehj,
	john.fastabend, dsahern, echaudro, jasowang

On 2020-12-21 4:01 a.m., Jesper Dangaard Brouer wrote:
> On Sat, 19 Dec 2020 10:30:57 -0500

>> Sorry to interject:
>> Does it make sense to use it to store arbitrary metadata or a scratchpad
>> in this space? Something equivalent to skb->cb which is lacking in
>> XDP.
> 
> Well, XDP have the data_meta area.  But difficult to rely on because a
> lot of driver don't implement it.  And Saeed and I plan to use this
> area and populate it with driver info from RX-descriptor.
> 

What i was thinking is some scratch pad that i can write to within
an XDP prog (not driver); example, in a prog array map the scratch
pad is written by one program in the array and read by another later on.
skb->cb allows for that. Unless you mean i can already write to some
XDP data_meta area?

cheers,
jamal

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure
  2020-12-20 17:52         ` Lorenzo Bianconi
@ 2020-12-21 20:55           ` Shay Agroskin
  0 siblings, 0 replies; 48+ messages in thread
From: Shay Agroskin @ 2020-12-21 20:55 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: Saeed Mahameed, Lorenzo Bianconi, BPF-dev-list,
	Network Development, David S. Miller, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Jubran, Samih,
	John Fastabend, David Ahern, Jesper Brouer, Eelco Chaudron,
	Jason Wang


Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:

>>
>>
>> Lorenzo Bianconi <lorenzo.bianconi@redhat.com> writes:
>>
>> >> On Mon, 2020-12-07 at 17:32 +0100, Lorenzo Bianconi wrote:
>> >> > Introduce xdp_shared_info data structure to contain info
>> >> > about
>> >> > "non-linear" xdp frame. xdp_shared_info will alias
>> >> > skb_shared_info
>> >> > allowing to keep most of the frags in the same cache-line.
>> [...]
>> >>
>> >> > +  u16 nr_frags;
>> >> > +  u16 data_length; /* paged area length */
>> >> > +  skb_frag_t frags[MAX_SKB_FRAGS];
>> >>
>> >> why MAX_SKB_FRAGS ? just use a flexible array member
>> >> skb_frag_t frags[];
>> >>
>> >> and enforce size via the n_frags and on the construction of 
>> >> the
>> >> tailroom preserved buffer, which is already being done.
>> >>
>> >> this is waste of unnecessary space, at lease by definition 
>> >> of
>> >> the
>> >> struct, in your use case you do:
>> >> memcpy(frag_list, xdp_sinfo->frags, sizeof(skb_frag_t) *
>> >> num_frags);
>> >> And the tailroom space was already preserved for a full
>> >> skb_shinfo.
>> >> so i don't see why you need this array to be of a fixed
>> >> MAX_SKB_FRAGS
>> >> size.
>> >
>> > In order to avoid cache-misses, xdp_shared info is built as a
>> > variable
>> > on mvneta_rx_swbm() stack and it is written to "shared_info"
>> > area only on the
>> > last fragment in mvneta_swbm_add_rx_fragment(). I used
>> > MAX_SKB_FRAGS to be
>> > aligned with skb_shared_info struct but probably we can use 
>> > even
>> > a smaller value.
>> > Another approach would be to define two different struct, 
>> > e.g.
>> >
>> > stuct xdp_frag_metadata {
>> >       u16 nr_frags;
>> >       u16 data_length; /* paged area length */
>> > };
>> >
>> > struct xdp_frags {
>> >       skb_frag_t frags[MAX_SKB_FRAGS];
>> > };
>> >
>> > and then define xdp_shared_info as
>> >
>> > struct xdp_shared_info {
>> >       stuct xdp_frag_metadata meta;
>> >       skb_frag_t frags[];
>> > };
>> >
>> > In this way we can probably optimize the space. What do you
>> > think?
>>
>> We're still reserving ~sizeof(skb_shared_info) bytes at the end 
>> of
>> the first buffer and it seems like in mvneta code you keep
>> updating all three fields (frags, nr_frags and data_length).
>> Can you explain how the space is optimized by splitting the
>> structs please?
>
> using xdp_shared_info struct we will have the first 3 fragments 
> in the
> same cacheline of nr_frags while using skb_shared_info struct 
> only the
> first fragment will be in the same cacheline of 
> nr_frags. Moreover
> skb_shared_info has multiple fields unused by xdp.
>
> Regards,
> Lorenzo
>

Thanks for your reply. I was actually referring to your suggestion 
to Saeed. Namely, defining

struct xdp_shared_info {
       struct xdp_frag_metadata meta;
       skb_frag_t frags[];
}

I don't see what benefits there are to this scheme compared to the 
original patch

Thanks,
Shay

>>
>> >>
>> >> > +};
>> >> > +
[...]
>>
>> Saeed, the stack receives skb_shared_info when the frames are
>> passed to the stack (skb_add_rx_frag is used to add the whole
>> information to skb's shared info), and for XDP_REDIRECT use 
>> case,
>> it doesn't seem like all drivers check page's tailroom for more
>> information anyway (ena doesn't at least).
>> Can you please explain what do you mean by "break the stack"?
>>
>> Thanks, Shay
>>
>> >>
>> [...]
>> >
>> >>
>>


^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx
  2020-12-16 14:08               ` Eelco Chaudron
@ 2021-01-15 16:36                 ` Eelco Chaudron
  2021-01-18 16:48                   ` Maciej Fijalkowski
  0 siblings, 1 reply; 48+ messages in thread
From: Eelco Chaudron @ 2021-01-15 16:36 UTC (permalink / raw)
  To: Maciej Fijalkowski, Alexei Starovoitov, Daniel Borkmann
  Cc: Lorenzo Bianconi, bpf, netdev



On 16 Dec 2020, at 15:08, Eelco Chaudron wrote:

> On 15 Dec 2020, at 19:06, Maciej Fijalkowski wrote:
>
>> On Tue, Dec 15, 2020 at 02:28:39PM +0100, Eelco Chaudron wrote:
>>>
>>>
>>> On 9 Dec 2020, at 13:07, Eelco Chaudron wrote:
>>>
>>>> On 9 Dec 2020, at 12:10, Maciej Fijalkowski wrote:
>>>
>>> <SNIP>
>>>
>>>>>>>> +
>>>>>>>> +		ctx_reg = (si->src_reg == si->dst_reg) ? scratch_reg - 1 :
>>>>>>>> si->src_reg;
>>>>>>>> +		while (dst_reg == ctx_reg || scratch_reg == ctx_reg)
>>>>>>>> +			ctx_reg--;
>>>>>>>> +
>>>>>>>> +		/* Save scratch registers */
>>>>>>>> +		if (ctx_reg != si->src_reg) {
>>>>>>>> +			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, ctx_reg,
>>>>>>>> +					      offsetof(struct xdp_buff,
>>>>>>>> +						       tmp_reg[1]));
>>>>>>>> +
>>>>>>>> +			*insn++ = BPF_MOV64_REG(ctx_reg, si->src_reg);
>>>>>>>> +		}
>>>>>>>> +
>>>>>>>> +		*insn++ = BPF_STX_MEM(BPF_DW, ctx_reg, scratch_reg,
>>>>>>>> +				      offsetof(struct xdp_buff, tmp_reg[0]));
>>>>>>>
>>>>>>> Why don't you push regs to stack, use it and then pop it
>>>>>>> back? That way
>>>>>>> I
>>>>>>> suppose you could avoid polluting xdp_buff with tmp_reg[2].
>>>>>>
>>>>>> There is no “real” stack in eBPF, only a read-only frame
>>>>>> pointer, and as we
>>>>>> are replacing a single instruction, we have no info on what we
>>>>>> can use as
>>>>>> scratch space.
>>>>>
>>>>> Uhm, what? You use R10 for stack operations. Verifier tracks the
>>>>> stack
>>>>> depth used by programs and then it is passed down to JIT so that
>>>>> native
>>>>> asm will create a properly sized stack frame.
>>>>>
>>>>> From the top of my head I would let know xdp_convert_ctx_access of 
>>>>> a
>>>>> current stack depth and use it for R10 stores, so your scratch 
>>>>> space
>>>>> would
>>>>> be R10 + (stack depth + 8), R10 + (stack_depth + 16).
>>>>
>>>> Other instances do exactly the same, i.e. put some scratch 
>>>> registers in
>>>> the underlying data structure, so I reused this approach. From the
>>>> current information in the callback, I was not able to determine 
>>>> the
>>>> current stack_depth. With "real" stack above, I meant having a 
>>>> pop/push
>>>> like instruction.
>>>>
>>>> I do not know the verifier code well enough, but are you suggesting 
>>>> I
>>>> can get the current stack_depth from the verifier in the
>>>> xdp_convert_ctx_access() callback? If so any pointers?
>>>
>>> Maciej any feedback on the above, i.e. getting the stack_depth in
>>> xdp_convert_ctx_access()?
>>
>> Sorry. I'll try to get my head around it. If i recall correctly stack
>> depth is tracked per subprogram whereas convert_ctx_accesses is 
>> iterating
>> through *all* insns (so a prog that is not chunked onto subprogs), 
>> but
>> maybe we could dig up the subprog based on insn idx.
>>
>> But at first, you mentioned that you took the approach from other
>> instances, can you point me to them?
>
> Quick search found the following two (sure there is one more with two 
> regs):
>
> https://elixir.bootlin.com/linux/v5.10.1/source/kernel/bpf/cgroup.c#L1718
> https://elixir.bootlin.com/linux/v5.10.1/source/net/core/filter.c#L8977
>
>> I'd also like to hear from Daniel/Alexei/John and others their 
>> thoughts.
>
> Please keep me in the loop…

Any thoughts/update on the above so I can move this patchset forward?



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx
  2021-01-15 16:36                 ` Eelco Chaudron
@ 2021-01-18 16:48                   ` Maciej Fijalkowski
  2021-01-20 13:20                     ` Eelco Chaudron
  0 siblings, 1 reply; 48+ messages in thread
From: Maciej Fijalkowski @ 2021-01-18 16:48 UTC (permalink / raw)
  To: Eelco Chaudron
  Cc: Alexei Starovoitov, Daniel Borkmann, Lorenzo Bianconi, bpf,
	netdev, brouer, bjorn, toke, john.fastabend

On Fri, Jan 15, 2021 at 05:36:23PM +0100, Eelco Chaudron wrote:
> 
> 
> On 16 Dec 2020, at 15:08, Eelco Chaudron wrote:
> 
> > On 15 Dec 2020, at 19:06, Maciej Fijalkowski wrote:
> > 
> > > On Tue, Dec 15, 2020 at 02:28:39PM +0100, Eelco Chaudron wrote:
> > > > 
> > > > 
> > > > On 9 Dec 2020, at 13:07, Eelco Chaudron wrote:
> > > > 
> > > > > On 9 Dec 2020, at 12:10, Maciej Fijalkowski wrote:
> > > > 
> > > > <SNIP>
> > > > 
> > > > > > > > > +
> > > > > > > > > +		ctx_reg = (si->src_reg == si->dst_reg) ? scratch_reg - 1 :
> > > > > > > > > si->src_reg;
> > > > > > > > > +		while (dst_reg == ctx_reg || scratch_reg == ctx_reg)
> > > > > > > > > +			ctx_reg--;
> > > > > > > > > +
> > > > > > > > > +		/* Save scratch registers */
> > > > > > > > > +		if (ctx_reg != si->src_reg) {
> > > > > > > > > +			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, ctx_reg,
> > > > > > > > > +					      offsetof(struct xdp_buff,
> > > > > > > > > +						       tmp_reg[1]));
> > > > > > > > > +
> > > > > > > > > +			*insn++ = BPF_MOV64_REG(ctx_reg, si->src_reg);
> > > > > > > > > +		}
> > > > > > > > > +
> > > > > > > > > +		*insn++ = BPF_STX_MEM(BPF_DW, ctx_reg, scratch_reg,
> > > > > > > > > +				      offsetof(struct xdp_buff, tmp_reg[0]));
> > > > > > > > 
> > > > > > > > Why don't you push regs to stack, use it and then pop it
> > > > > > > > back? That way
> > > > > > > > I
> > > > > > > > suppose you could avoid polluting xdp_buff with tmp_reg[2].
> > > > > > > 
> > > > > > > There is no “real” stack in eBPF, only a read-only frame
> > > > > > > pointer, and as we
> > > > > > > are replacing a single instruction, we have no info on what we
> > > > > > > can use as
> > > > > > > scratch space.
> > > > > > 
> > > > > > Uhm, what? You use R10 for stack operations. Verifier tracks the
> > > > > > stack
> > > > > > depth used by programs and then it is passed down to JIT so that
> > > > > > native
> > > > > > asm will create a properly sized stack frame.
> > > > > > 
> > > > > > From the top of my head I would let know
> > > > > > xdp_convert_ctx_access of a
> > > > > > current stack depth and use it for R10 stores, so your
> > > > > > scratch space
> > > > > > would
> > > > > > be R10 + (stack depth + 8), R10 + (stack_depth + 16).
> > > > > 
> > > > > Other instances do exactly the same, i.e. put some scratch
> > > > > registers in
> > > > > the underlying data structure, so I reused this approach. From the
> > > > > current information in the callback, I was not able to
> > > > > determine the
> > > > > current stack_depth. With "real" stack above, I meant having
> > > > > a pop/push
> > > > > like instruction.
> > > > > 
> > > > > I do not know the verifier code well enough, but are you
> > > > > suggesting I
> > > > > can get the current stack_depth from the verifier in the
> > > > > xdp_convert_ctx_access() callback? If so any pointers?
> > > > 
> > > > Maciej any feedback on the above, i.e. getting the stack_depth in
> > > > xdp_convert_ctx_access()?
> > > 
> > > Sorry. I'll try to get my head around it. If i recall correctly stack
> > > depth is tracked per subprogram whereas convert_ctx_accesses is
> > > iterating
> > > through *all* insns (so a prog that is not chunked onto subprogs),
> > > but
> > > maybe we could dig up the subprog based on insn idx.
> > > 
> > > But at first, you mentioned that you took the approach from other
> > > instances, can you point me to them?
> > 
> > Quick search found the following two (sure there is one more with two
> > regs):
> > 
> > https://elixir.bootlin.com/linux/v5.10.1/source/kernel/bpf/cgroup.c#L1718
> > https://elixir.bootlin.com/linux/v5.10.1/source/net/core/filter.c#L8977
> > 
> > > I'd also like to hear from Daniel/Alexei/John and others their
> > > thoughts.
> > 
> > Please keep me in the loop…
> 
> Any thoughts/update on the above so I can move this patchset forward?

Cc: John, Jesper, Bjorn

I didn't spend time thinking about it, but I still am against xdp_buff
extension for the purpose that code within this patch has.

Daniel/Alexei/John/Jesper/Bjorn,

any objections for not having the scratch registers but rather use the
stack and update the stack depth to calculate the frame length?

This seems not trivial so I really would like to have an input from better
BPF developers than me :)

> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx
  2021-01-18 16:48                   ` Maciej Fijalkowski
@ 2021-01-20 13:20                     ` Eelco Chaudron
  2021-02-01 16:00                       ` Eelco Chaudron
  0 siblings, 1 reply; 48+ messages in thread
From: Eelco Chaudron @ 2021-01-20 13:20 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: Alexei Starovoitov, Daniel Borkmann, Lorenzo Bianconi, bpf,
	netdev, brouer, bjorn, toke, john.fastabend



On 18 Jan 2021, at 17:48, Maciej Fijalkowski wrote:

> On Fri, Jan 15, 2021 at 05:36:23PM +0100, Eelco Chaudron wrote:
>>
>>
>> On 16 Dec 2020, at 15:08, Eelco Chaudron wrote:
>>
>>> On 15 Dec 2020, at 19:06, Maciej Fijalkowski wrote:
>>>
>>>> On Tue, Dec 15, 2020 at 02:28:39PM +0100, Eelco Chaudron wrote:
>>>>>
>>>>>
>>>>> On 9 Dec 2020, at 13:07, Eelco Chaudron wrote:
>>>>>
>>>>>> On 9 Dec 2020, at 12:10, Maciej Fijalkowski wrote:
>>>>>
>>>>> <SNIP>
>>>>>
>>>>>>>>>> +
>>>>>>>>>> +		ctx_reg = (si->src_reg == si->dst_reg) ? scratch_reg - 1 :
>>>>>>>>>> si->src_reg;
>>>>>>>>>> +		while (dst_reg == ctx_reg || scratch_reg == ctx_reg)
>>>>>>>>>> +			ctx_reg--;
>>>>>>>>>> +
>>>>>>>>>> +		/* Save scratch registers */
>>>>>>>>>> +		if (ctx_reg != si->src_reg) {
>>>>>>>>>> +			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, ctx_reg,
>>>>>>>>>> +					      offsetof(struct xdp_buff,
>>>>>>>>>> +						       tmp_reg[1]));
>>>>>>>>>> +
>>>>>>>>>> +			*insn++ = BPF_MOV64_REG(ctx_reg, si->src_reg);
>>>>>>>>>> +		}
>>>>>>>>>> +
>>>>>>>>>> +		*insn++ = BPF_STX_MEM(BPF_DW, ctx_reg, scratch_reg,
>>>>>>>>>> +				      offsetof(struct xdp_buff, tmp_reg[0]));
>>>>>>>>>
>>>>>>>>> Why don't you push regs to stack, use it and then pop it
>>>>>>>>> back? That way
>>>>>>>>> I
>>>>>>>>> suppose you could avoid polluting xdp_buff with tmp_reg[2].
>>>>>>>>
>>>>>>>> There is no “real” stack in eBPF, only a read-only frame
>>>>>>>> pointer, and as we
>>>>>>>> are replacing a single instruction, we have no info on what we
>>>>>>>> can use as
>>>>>>>> scratch space.
>>>>>>>
>>>>>>> Uhm, what? You use R10 for stack operations. Verifier tracks the
>>>>>>> stack
>>>>>>> depth used by programs and then it is passed down to JIT so that
>>>>>>> native
>>>>>>> asm will create a properly sized stack frame.
>>>>>>>
>>>>>>> From the top of my head I would let know
>>>>>>> xdp_convert_ctx_access of a
>>>>>>> current stack depth and use it for R10 stores, so your
>>>>>>> scratch space
>>>>>>> would
>>>>>>> be R10 + (stack depth + 8), R10 + (stack_depth + 16).
>>>>>>
>>>>>> Other instances do exactly the same, i.e. put some scratch
>>>>>> registers in
>>>>>> the underlying data structure, so I reused this approach. From 
>>>>>> the
>>>>>> current information in the callback, I was not able to
>>>>>> determine the
>>>>>> current stack_depth. With "real" stack above, I meant having
>>>>>> a pop/push
>>>>>> like instruction.
>>>>>>
>>>>>> I do not know the verifier code well enough, but are you
>>>>>> suggesting I
>>>>>> can get the current stack_depth from the verifier in the
>>>>>> xdp_convert_ctx_access() callback? If so any pointers?
>>>>>
>>>>> Maciej any feedback on the above, i.e. getting the stack_depth in
>>>>> xdp_convert_ctx_access()?
>>>>
>>>> Sorry. I'll try to get my head around it. If i recall correctly 
>>>> stack
>>>> depth is tracked per subprogram whereas convert_ctx_accesses is
>>>> iterating
>>>> through *all* insns (so a prog that is not chunked onto subprogs),
>>>> but
>>>> maybe we could dig up the subprog based on insn idx.
>>>>
>>>> But at first, you mentioned that you took the approach from other
>>>> instances, can you point me to them?
>>>
>>> Quick search found the following two (sure there is one more with 
>>> two
>>> regs):
>>>
>>> https://elixir.bootlin.com/linux/v5.10.1/source/kernel/bpf/cgroup.c#L1718
>>> https://elixir.bootlin.com/linux/v5.10.1/source/net/core/filter.c#L8977
>>>
>>>> I'd also like to hear from Daniel/Alexei/John and others their
>>>> thoughts.
>>>
>>> Please keep me in the loop…
>>
>> Any thoughts/update on the above so I can move this patchset forward?
>
> Cc: John, Jesper, Bjorn
>
> I didn't spend time thinking about it, but I still am against xdp_buff
> extension for the purpose that code within this patch has.

Yes I agree, if we can not find an easy way to store the scratch 
registers on the stack, I’ll rework this patch to just store the total 
frame length in xdp_buff, as it will be less and still fit in one cache 
line.

> Daniel/Alexei/John/Jesper/Bjorn,
>
> any objections for not having the scratch registers but rather use the
> stack and update the stack depth to calculate the frame length?
>
> This seems not trivial so I really would like to have an input from 
> better
> BPF developers than me :)



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx
  2021-01-20 13:20                     ` Eelco Chaudron
@ 2021-02-01 16:00                       ` Eelco Chaudron
  0 siblings, 0 replies; 48+ messages in thread
From: Eelco Chaudron @ 2021-02-01 16:00 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov
  Cc: Maciej Fijalkowski, Lorenzo Bianconi, bpf, netdev, brouer, bjorn,
	toke, john.fastabend



On 20 Jan 2021, at 14:20, Eelco Chaudron wrote:

> On 18 Jan 2021, at 17:48, Maciej Fijalkowski wrote:
>
>> On Fri, Jan 15, 2021 at 05:36:23PM +0100, Eelco Chaudron wrote:
>>>
>>>
>>> On 16 Dec 2020, at 15:08, Eelco Chaudron wrote:
>>>
>>>> On 15 Dec 2020, at 19:06, Maciej Fijalkowski wrote:
>>>>
>>>>> On Tue, Dec 15, 2020 at 02:28:39PM +0100, Eelco Chaudron wrote:
>>>>>>
>>>>>>
>>>>>> On 9 Dec 2020, at 13:07, Eelco Chaudron wrote:
>>>>>>
>>>>>>> On 9 Dec 2020, at 12:10, Maciej Fijalkowski wrote:
>>>>>>
>>>>>> <SNIP>
>>>>>>
>>>>>>>>>>> +
>>>>>>>>>>> +		ctx_reg = (si->src_reg == si->dst_reg) ? scratch_reg - 1 
>>>>>>>>>>> :
>>>>>>>>>>> si->src_reg;
>>>>>>>>>>> +		while (dst_reg == ctx_reg || scratch_reg == ctx_reg)
>>>>>>>>>>> +			ctx_reg--;
>>>>>>>>>>> +
>>>>>>>>>>> +		/* Save scratch registers */
>>>>>>>>>>> +		if (ctx_reg != si->src_reg) {
>>>>>>>>>>> +			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, ctx_reg,
>>>>>>>>>>> +					      offsetof(struct xdp_buff,
>>>>>>>>>>> +						       tmp_reg[1]));
>>>>>>>>>>> +
>>>>>>>>>>> +			*insn++ = BPF_MOV64_REG(ctx_reg, si->src_reg);
>>>>>>>>>>> +		}
>>>>>>>>>>> +
>>>>>>>>>>> +		*insn++ = BPF_STX_MEM(BPF_DW, ctx_reg, scratch_reg,
>>>>>>>>>>> +				      offsetof(struct xdp_buff, tmp_reg[0]));
>>>>>>>>>>
>>>>>>>>>> Why don't you push regs to stack, use it and then pop it
>>>>>>>>>> back? That way
>>>>>>>>>> I
>>>>>>>>>> suppose you could avoid polluting xdp_buff with tmp_reg[2].
>>>>>>>>>
>>>>>>>>> There is no “real” stack in eBPF, only a read-only frame
>>>>>>>>> pointer, and as we
>>>>>>>>> are replacing a single instruction, we have no info on what we
>>>>>>>>> can use as
>>>>>>>>> scratch space.
>>>>>>>>
>>>>>>>> Uhm, what? You use R10 for stack operations. Verifier tracks 
>>>>>>>> the
>>>>>>>> stack
>>>>>>>> depth used by programs and then it is passed down to JIT so 
>>>>>>>> that
>>>>>>>> native
>>>>>>>> asm will create a properly sized stack frame.
>>>>>>>>
>>>>>>>> From the top of my head I would let know
>>>>>>>> xdp_convert_ctx_access of a
>>>>>>>> current stack depth and use it for R10 stores, so your
>>>>>>>> scratch space
>>>>>>>> would
>>>>>>>> be R10 + (stack depth + 8), R10 + (stack_depth + 16).
>>>>>>>
>>>>>>> Other instances do exactly the same, i.e. put some scratch
>>>>>>> registers in
>>>>>>> the underlying data structure, so I reused this approach. From 
>>>>>>> the
>>>>>>> current information in the callback, I was not able to
>>>>>>> determine the
>>>>>>> current stack_depth. With "real" stack above, I meant having
>>>>>>> a pop/push
>>>>>>> like instruction.
>>>>>>>
>>>>>>> I do not know the verifier code well enough, but are you
>>>>>>> suggesting I
>>>>>>> can get the current stack_depth from the verifier in the
>>>>>>> xdp_convert_ctx_access() callback? If so any pointers?
>>>>>>
>>>>>> Maciej any feedback on the above, i.e. getting the stack_depth in
>>>>>> xdp_convert_ctx_access()?
>>>>>
>>>>> Sorry. I'll try to get my head around it. If i recall correctly 
>>>>> stack
>>>>> depth is tracked per subprogram whereas convert_ctx_accesses is
>>>>> iterating
>>>>> through *all* insns (so a prog that is not chunked onto subprogs),
>>>>> but
>>>>> maybe we could dig up the subprog based on insn idx.
>>>>>
>>>>> But at first, you mentioned that you took the approach from other
>>>>> instances, can you point me to them?
>>>>
>>>> Quick search found the following two (sure there is one more with 
>>>> two
>>>> regs):
>>>>
>>>> https://elixir.bootlin.com/linux/v5.10.1/source/kernel/bpf/cgroup.c#L1718
>>>> https://elixir.bootlin.com/linux/v5.10.1/source/net/core/filter.c#L8977
>>>>
>>>>> I'd also like to hear from Daniel/Alexei/John and others their
>>>>> thoughts.
>>>>
>>>> Please keep me in the loop…
>>>
>>> Any thoughts/update on the above so I can move this patchset 
>>> forward?
>>
>> Cc: John, Jesper, Bjorn
>>
>> I didn't spend time thinking about it, but I still am against 
>> xdp_buff
>> extension for the purpose that code within this patch has.
>
> Yes I agree, if we can not find an easy way to store the scratch 
> registers on the stack, I’ll rework this patch to just store the 
> total frame length in xdp_buff, as it will be less and still fit in 
> one cache line.
>
>> Daniel/Alexei/John/Jesper/Bjorn,

Daniel/Alexei and input on how to easily allocate two scratch registers 
on the stack from a function like xdp_convert_ctx_access() through the 
verifier state? See above for some more details.

If you are not the right persons, who might be the verifier guru to ask?

>> any objections for not having the scratch registers but rather use 
>> the
>> stack and update the stack depth to calculate the frame length?
>>
>> This seems not trivial so I really would like to have an input from 
>> better
>> BPF developers than me :)


^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2021-02-01 16:02 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-07 16:32 [PATCH v5 bpf-next 00/14] mvneta: introduce XDP multi-buffer support Lorenzo Bianconi
2020-12-07 16:32 ` [PATCH v5 bpf-next 01/14] xdp: introduce mb in xdp_buff/xdp_frame Lorenzo Bianconi
2020-12-07 21:16   ` Alexander Duyck
2020-12-07 23:03     ` Saeed Mahameed
2020-12-08  3:16       ` Alexander Duyck
2020-12-08  6:49         ` Saeed Mahameed
2020-12-08  9:47           ` Jesper Dangaard Brouer
2020-12-07 16:32 ` [PATCH v5 bpf-next 02/14] xdp: initialize xdp_buff mb bit to 0 in all XDP drivers Lorenzo Bianconi
2020-12-07 21:15   ` Alexander Duyck
2020-12-07 21:37     ` Maciej Fijalkowski
2020-12-07 23:20       ` Saeed Mahameed
2020-12-08 10:31         ` Lorenzo Bianconi
2020-12-08 13:29           ` Jesper Dangaard Brouer
2020-12-07 16:32 ` [PATCH v5 bpf-next 03/14] xdp: add xdp_shared_info data structure Lorenzo Bianconi
2020-12-08  0:22   ` Saeed Mahameed
2020-12-08 11:01     ` Lorenzo Bianconi
2020-12-19 14:53       ` Shay Agroskin
2020-12-19 15:30         ` Jamal Hadi Salim
2020-12-21  9:01           ` Jesper Dangaard Brouer
2020-12-21 13:00             ` Jamal Hadi Salim
2020-12-20 17:52         ` Lorenzo Bianconi
2020-12-21 20:55           ` Shay Agroskin
2020-12-07 16:32 ` [PATCH v5 bpf-next 04/14] net: mvneta: update mb bit before passing the xdp buffer to eBPF layer Lorenzo Bianconi
2020-12-07 16:32 ` [PATCH v5 bpf-next 05/14] xdp: add multi-buff support to xdp_return_{buff/frame} Lorenzo Bianconi
2020-12-07 16:32 ` [PATCH v5 bpf-next 06/14] net: mvneta: add multi buffer support to XDP_TX Lorenzo Bianconi
2020-12-19 15:56   ` Shay Agroskin
2020-12-20 18:06     ` Lorenzo Bianconi
2020-12-07 16:32 ` [PATCH v5 bpf-next 07/14] bpf: move user_size out of bpf_test_init Lorenzo Bianconi
2020-12-07 16:32 ` [PATCH v5 bpf-next 08/14] bpf: introduce multibuff support to bpf_prog_test_run_xdp() Lorenzo Bianconi
2020-12-07 16:32 ` [PATCH v5 bpf-next 09/14] bpf: test_run: add xdp_shared_info pointer in bpf_test_finish signature Lorenzo Bianconi
2020-12-07 16:32 ` [PATCH v5 bpf-next 10/14] net: mvneta: enable jumbo frames for XDP Lorenzo Bianconi
2020-12-07 16:32 ` [PATCH v5 bpf-next 11/14] bpf: cpumap: introduce xdp multi-buff support Lorenzo Bianconi
2020-12-19 17:46   ` Shay Agroskin
2020-12-20 17:56     ` Lorenzo Bianconi
2020-12-07 16:32 ` [PATCH v5 bpf-next 12/14] bpf: add multi-buff support to the bpf_xdp_adjust_tail() API Lorenzo Bianconi
2020-12-07 16:32 ` [PATCH v5 bpf-next 13/14] bpf: add new frame_length field to the XDP ctx Lorenzo Bianconi
2020-12-08 22:17   ` Maciej Fijalkowski
2020-12-09 10:35     ` Eelco Chaudron
2020-12-09 11:10       ` Maciej Fijalkowski
2020-12-09 12:07         ` Eelco Chaudron
2020-12-15 13:28           ` Eelco Chaudron
2020-12-15 18:06             ` Maciej Fijalkowski
2020-12-16 14:08               ` Eelco Chaudron
2021-01-15 16:36                 ` Eelco Chaudron
2021-01-18 16:48                   ` Maciej Fijalkowski
2021-01-20 13:20                     ` Eelco Chaudron
2021-02-01 16:00                       ` Eelco Chaudron
2020-12-07 16:32 ` [PATCH v5 bpf-next 14/14] bpf: update xdp_adjust_tail selftest to include multi-buffer Lorenzo Bianconi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).