bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support
@ 2019-04-30 18:12 Maxim Mikityanskiy
  2019-04-30 18:12 ` [PATCH bpf-next v2 01/16] xsk: Add API to check for available entries in FQ Maxim Mikityanskiy
                   ` (15 more replies)
  0 siblings, 16 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

This series contains improvements to the AF_XDP kernel infrastructure
and AF_XDP support in mlx5e. The infrastructure improvements are
required for mlx5e, but also some of them benefit to all drivers, and
some can be useful for other drivers that want to implement AF_XDP.

The performance testing was performed on a machine with the following
configuration:

- 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
- Mellanox ConnectX-5 Ex with 100 Gbit/s link

The results with retpoline disabled, single stream:

txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
rxdrop: 12.2 Mpps
l2fwd: 9.4 Mpps

The results with retpoline enabled, single stream:

txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
rxdrop: 9.9 Mpps
l2fwd: 6.8 Mpps

v2 changes:

Added patches for mlx5e and addressed the comments for v1. Rebased for
bpf-next (net-next has to be merged first, because this series depends
on some patches from there).

Maxim Mikityanskiy (16):
  xsk: Add API to check for available entries in FQ
  xsk: Add getsockopt XDP_OPTIONS
  libbpf: Support getsockopt XDP_OPTIONS
  xsk: Extend channels to support combined XSK/non-XSK traffic
  xsk: Change the default frame size to 4096 and allow controlling it
  xsk: Return the whole xdp_desc from xsk_umem_consume_tx
  net/mlx5e: Replace deprecated PCI_DMA_TODEVICE
  net/mlx5e: Calculate linear RX frag size considering XSK
  net/mlx5e: Allow ICO SQ to be used by multiple RQs
  net/mlx5e: Refactor struct mlx5e_xdp_info
  net/mlx5e: Share the XDP SQ for XDP_TX between RQs
  net/mlx5e: XDP_TX from UMEM support
  net/mlx5e: Consider XSK in XDP MTU limit calculation
  net/mlx5e: Encapsulate open/close queues into a function
  net/mlx5e: Move queue param structs to en/params.h
  net/mlx5e: Add XSK support

 drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  12 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  |  15 +-
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 147 +++-
 .../ethernet/mellanox/mlx5/core/en/params.c   | 108 ++-
 .../ethernet/mellanox/mlx5/core/en/params.h   |  87 ++-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 231 ++++--
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  36 +-
 .../mellanox/mlx5/core/en/xsk/Makefile        |   1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 192 +++++
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |  27 +
 .../mellanox/mlx5/core/en/xsk/setup.c         | 220 ++++++
 .../mellanox/mlx5/core/en/xsk/setup.h         |  25 +
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   | 108 +++
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.h   |  15 +
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.c | 252 +++++++
 .../ethernet/mellanox/mlx5/core/en/xsk/umem.h |  34 +
 .../ethernet/mellanox/mlx5/core/en_ethtool.c  |  21 +-
 .../mellanox/mlx5/core/en_fs_ethtool.c        |  44 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 680 +++++++++++-------
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  12 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 104 ++-
 .../ethernet/mellanox/mlx5/core/en_stats.c    | 115 ++-
 .../ethernet/mellanox/mlx5/core/en_stats.h    |  30 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |  42 +-
 .../ethernet/mellanox/mlx5/core/ipoib/ipoib.c |  14 +-
 drivers/net/ethernet/mellanox/mlx5/core/wq.h  |   5 -
 include/net/xdp_sock.h                        |  27 +-
 include/uapi/linux/if_xdp.h                   |  18 +
 net/xdp/xsk.c                                 |  43 +-
 net/xdp/xsk_queue.h                           |  14 +
 samples/bpf/xdpsock_user.c                    |  52 +-
 tools/include/uapi/linux/if_xdp.h             |  18 +
 tools/lib/bpf/xsk.c                           | 127 +++-
 tools/lib/bpf/xsk.h                           |   6 +-
 35 files changed, 2384 insertions(+), 500 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/Makefile
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h

-- 
2.19.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 01/16] xsk: Add API to check for available entries in FQ
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
@ 2019-04-30 18:12 ` Maxim Mikityanskiy
  2019-04-30 18:12 ` [PATCH bpf-next v2 02/16] xsk: Add getsockopt XDP_OPTIONS Maxim Mikityanskiy
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

Add a function that checks whether the Fill Ring has the specified
amount of descriptors available. It will be useful for mlx5e that wants
to check in advance, whether it can allocate a bulk of RX descriptors,
to get the best performance.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/net/xdp_sock.h | 21 +++++++++++++++++++++
 net/xdp/xsk.c          |  6 ++++++
 net/xdp/xsk_queue.h    | 14 ++++++++++++++
 3 files changed, 41 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index d074b6d60f8a..1acddcbda236 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -77,6 +77,7 @@ int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
 /* Used from netdev driver */
+bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt);
 u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr);
 void xsk_umem_discard_addr(struct xdp_umem *umem);
 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
@@ -99,6 +100,16 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
 }
 
 /* Reuse-queue aware version of FILL queue helpers */
+static inline bool xsk_umem_has_addrs_rq(struct xdp_umem *umem, u32 cnt)
+{
+	struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+	if (rq->length >= cnt)
+		return true;
+
+	return xsk_umem_has_addrs(umem, cnt - rq->length);
+}
+
 static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
 {
 	struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
@@ -146,6 +157,11 @@ static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 	return false;
 }
 
+static inline bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt)
+{
+	return false;
+}
+
 static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
 {
 	return NULL;
@@ -200,6 +216,11 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
 	return 0;
 }
 
+static inline bool xsk_umem_has_addrs_rq(struct xdp_umem *umem, u32 cnt)
+{
+	return false;
+}
+
 static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
 {
 	return NULL;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a14e8864e4fa..b68a380f50b3 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -37,6 +37,12 @@ bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs)
 		READ_ONCE(xs->umem->fq);
 }
 
+bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt)
+{
+	return xskq_has_addrs(umem->fq, cnt);
+}
+EXPORT_SYMBOL(xsk_umem_has_addrs);
+
 u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
 {
 	return xskq_peek_addr(umem->fq, addr);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 88b9ae24658d..12b49784a6d5 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -117,6 +117,20 @@ static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt)
 	return q->nentries - (producer - q->cons_tail);
 }
 
+static inline bool xskq_has_addrs(struct xsk_queue *q, u32 cnt)
+{
+	u32 entries = q->prod_tail - q->cons_tail;
+
+	if (entries >= cnt)
+		return true;
+
+	/* Refresh the local pointer. */
+	q->prod_tail = READ_ONCE(q->ring->producer);
+	entries = q->prod_tail - q->cons_tail;
+
+	return entries >= cnt;
+}
+
 /* UMEM queue */
 
 static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 02/16] xsk: Add getsockopt XDP_OPTIONS
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
  2019-04-30 18:12 ` [PATCH bpf-next v2 01/16] xsk: Add API to check for available entries in FQ Maxim Mikityanskiy
@ 2019-04-30 18:12 ` Maxim Mikityanskiy
  2019-05-04 17:25   ` Björn Töpel
  2019-04-30 18:12 ` [PATCH bpf-next v2 03/16] libbpf: Support " Maxim Mikityanskiy
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

Make it possible for the application to determine whether the AF_XDP
socket is running in zero-copy mode. To achieve this, add a new
getsockopt option XDP_OPTIONS that returns flags. The only flag
supported for now is the zero-copy mode indicator.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/uapi/linux/if_xdp.h       |  7 +++++++
 net/xdp/xsk.c                     | 22 ++++++++++++++++++++++
 tools/include/uapi/linux/if_xdp.h |  7 +++++++
 3 files changed, 36 insertions(+)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index caed8b1614ff..9ae4b4e08b68 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_FILL_RING		5
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
+#define XDP_OPTIONS			8
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -60,6 +61,12 @@ struct xdp_statistics {
 	__u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
 };
 
+struct xdp_options {
+	__u32 flags;
+};
+
+#define XDP_OPTIONS_FLAG_ZEROCOPY (1 << 0)
+
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_RING			  0
 #define XDP_PGOFF_TX_RING		 0x80000000
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b68a380f50b3..998199109d5c 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -650,6 +650,28 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
 
 		return 0;
 	}
+	case XDP_OPTIONS:
+	{
+		struct xdp_options opts;
+
+		if (len < sizeof(opts))
+			return -EINVAL;
+
+		opts.flags = 0;
+
+		mutex_lock(&xs->mutex);
+		if (xs->zc)
+			opts.flags |= XDP_OPTIONS_FLAG_ZEROCOPY;
+		mutex_unlock(&xs->mutex);
+
+		len = sizeof(opts);
+		if (copy_to_user(optval, &opts, len))
+			return -EFAULT;
+		if (put_user(len, optlen))
+			return -EFAULT;
+
+		return 0;
+	}
 	default:
 		break;
 	}
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index caed8b1614ff..9ae4b4e08b68 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_FILL_RING		5
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
+#define XDP_OPTIONS			8
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -60,6 +61,12 @@ struct xdp_statistics {
 	__u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
 };
 
+struct xdp_options {
+	__u32 flags;
+};
+
+#define XDP_OPTIONS_FLAG_ZEROCOPY (1 << 0)
+
 /* Pgoff for mmaping the rings */
 #define XDP_PGOFF_RX_RING			  0
 #define XDP_PGOFF_TX_RING		 0x80000000
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 03/16] libbpf: Support getsockopt XDP_OPTIONS
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
  2019-04-30 18:12 ` [PATCH bpf-next v2 01/16] xsk: Add API to check for available entries in FQ Maxim Mikityanskiy
  2019-04-30 18:12 ` [PATCH bpf-next v2 02/16] xsk: Add getsockopt XDP_OPTIONS Maxim Mikityanskiy
@ 2019-04-30 18:12 ` Maxim Mikityanskiy
  2019-04-30 18:12 ` [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic Maxim Mikityanskiy
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

Query XDP_OPTIONS in libbpf to determine if the zero-copy mode is active
or not.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 tools/lib/bpf/xsk.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index 557ef8d1250d..a95b06d1f81d 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -67,6 +67,7 @@ struct xsk_socket {
 	int xsks_map_fd;
 	__u32 queue_id;
 	char ifname[IFNAMSIZ];
+	bool zc;
 };
 
 struct xsk_nl_info {
@@ -526,6 +527,7 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname,
 {
 	struct sockaddr_xdp sxdp = {};
 	struct xdp_mmap_offsets off;
+	struct xdp_options opts;
 	struct xsk_socket *xsk;
 	socklen_t optlen;
 	void *map;
@@ -643,6 +645,15 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname,
 		goto out_mmap_tx;
 	}
 
+	optlen = sizeof(opts);
+	err = getsockopt(xsk->fd, SOL_XDP, XDP_OPTIONS, &opts, &optlen);
+	if (err) {
+		err = -errno;
+		goto out_mmap_tx;
+	}
+
+	xsk->zc = opts.flags & XDP_OPTIONS_FLAG_ZEROCOPY;
+
 	if (!(xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)) {
 		err = xsk_setup_xdp_prog(xsk);
 		if (err)
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (2 preceding siblings ...)
  2019-04-30 18:12 ` [PATCH bpf-next v2 03/16] libbpf: Support " Maxim Mikityanskiy
@ 2019-04-30 18:12 ` Maxim Mikityanskiy
  2019-05-04 17:26   ` Björn Töpel
  2019-04-30 18:12 ` [PATCH bpf-next v2 05/16] xsk: Change the default frame size to 4096 and allow controlling it Maxim Mikityanskiy
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

Currently, the drivers that implement AF_XDP zero-copy support (e.g.,
i40e) switch the channel into a different mode when an XSK is opened. It
causes some issues that have to be taken into account. For example, RSS
needs to be reconfigured to skip the XSK-enabled channels, or the XDP
program should filter out traffic not intended for that socket and
XDP_PASS it with an additional copy. As nothing validates or forces the
proper configuration, it's easy to have packets drops, when they get
into an XSK by mistake, and, in fact, it's the default configuration.
There has to be some tool to have RSS reconfigured on each socket open
and close event, but such a tool is problematic to implement, because no
one reports these events, and it's race-prone.

This commit extends XSK to support both kinds of traffic (XSK and
non-XSK) in the same channel. It implies having two RX queues in
XSK-enabled channels: one for the regular traffic, and the other for
XSK. It solves the problem with RSS: the default configuration just
works without the need to manually reconfigure RSS or to perform some
possibly complicated filtering in the XDP layer. It makes it easy to run
both AF_XDP and regular sockets on the same machine. In the XDP program,
the QID's most significant bit will serve as a flag to indicate whether
it's the XSK queue or not. The extension is compatible with the legacy
configuration, so if one wants to run the legacy mode, they can
reconfigure RSS and ignore the flag in the XDP program (implemented in
the reference XDP program in libbpf). mlx5e will support this extension.

A single XDP program can run both with drivers supporting or not
supporting this extension. The xdpsock sample and libbpf are updated
accordingly.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/uapi/linux/if_xdp.h       |  11 +++
 net/xdp/xsk.c                     |   5 +-
 samples/bpf/xdpsock_user.c        |  10 ++-
 tools/include/uapi/linux/if_xdp.h |  11 +++
 tools/lib/bpf/xsk.c               | 116 ++++++++++++++++++++++--------
 tools/lib/bpf/xsk.h               |   4 ++
 6 files changed, 126 insertions(+), 31 deletions(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 9ae4b4e08b68..cf6ff1ecc6bd 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -82,4 +82,15 @@ struct xdp_desc {
 
 /* UMEM descriptor is __u64 */
 
+/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
+ * this flag bit in the queue index to distinguish between two RQs of the same
+ * channel.
+ */
+#define XDP_QID_FLAG_XSKRQ (1 << 31)
+
+static inline __u32 xdp_qid_get_channel(__u32 qid)
+{
+	return qid & ~XDP_QID_FLAG_XSKRQ;
+}
+
 #endif /* _LINUX_IF_XDP_H */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 998199109d5c..114ba17acb09 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -104,9 +104,12 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
+	struct xdp_rxq_info *rxq = xdp->rxq;
+	u32 channel = xdp_qid_get_channel(rxq->queue_index);
 	u32 len;
 
-	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+	if (xs->dev != rxq->dev || xs->queue_id != channel ||
+	    xs->zc != (rxq->mem.type == MEM_TYPE_ZERO_COPY))
 		return -EINVAL;
 
 	len = xdp->data_end - xdp->data;
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index d08ee1ab7bb4..a6b13025ee79 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -62,6 +62,7 @@ enum benchmark_type {
 
 static enum benchmark_type opt_bench = BENCH_RXDROP;
 static u32 opt_xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+static u32 opt_libbpf_flags;
 static const char *opt_if = "";
 static int opt_ifindex;
 static int opt_queue;
@@ -306,7 +307,7 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
 	xsk->umem = umem;
 	cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
 	cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
-	cfg.libbpf_flags = 0;
+	cfg.libbpf_flags = opt_libbpf_flags;
 	cfg.xdp_flags = opt_xdp_flags;
 	cfg.bind_flags = opt_xdp_bind_flags;
 	ret = xsk_socket__create(&xsk->xsk, opt_if, opt_queue, umem->umem,
@@ -346,6 +347,7 @@ static struct option long_options[] = {
 	{"interval", required_argument, 0, 'n'},
 	{"zero-copy", no_argument, 0, 'z'},
 	{"copy", no_argument, 0, 'c'},
+	{"combined", no_argument, 0, 'C'},
 	{0, 0, 0, 0}
 };
 
@@ -365,6 +367,7 @@ static void usage(const char *prog)
 		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
 		"  -z, --zero-copy      Force zero-copy mode.\n"
 		"  -c, --copy           Force copy mode.\n"
+		"  -C, --combined       Driver supports combined XSK and non-XSK traffic in a channel.\n"
 		"\n";
 	fprintf(stderr, str, prog);
 	exit(EXIT_FAILURE);
@@ -377,7 +380,7 @@ static void parse_command_line(int argc, char **argv)
 	opterr = 0;
 
 	for (;;) {
-		c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options,
+		c = getopt_long(argc, argv, "Frtli:q:psSNn:czC", long_options,
 				&option_index);
 		if (c == -1)
 			break;
@@ -420,6 +423,9 @@ static void parse_command_line(int argc, char **argv)
 		case 'F':
 			opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
 			break;
+		case 'C':
+			opt_libbpf_flags |= XSK_LIBBPF_FLAGS__COMBINED_CHANNELS;
+			break;
 		default:
 			usage(basename(argv[0]));
 		}
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 9ae4b4e08b68..cf6ff1ecc6bd 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -82,4 +82,15 @@ struct xdp_desc {
 
 /* UMEM descriptor is __u64 */
 
+/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
+ * this flag bit in the queue index to distinguish between two RQs of the same
+ * channel.
+ */
+#define XDP_QID_FLAG_XSKRQ (1 << 31)
+
+static inline __u32 xdp_qid_get_channel(__u32 qid)
+{
+	return qid & ~XDP_QID_FLAG_XSKRQ;
+}
+
 #endif /* _LINUX_IF_XDP_H */
diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index a95b06d1f81d..969dfd856039 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -76,6 +76,12 @@ struct xsk_nl_info {
 	int fd;
 };
 
+enum qidconf {
+	QIDCONF_REGULAR,
+	QIDCONF_XSK,
+	QIDCONF_XSK_COMBINED,
+};
+
 /* For 32-bit systems, we need to use mmap2 as the offsets are 64-bit.
  * Unfortunately, it is not part of glibc.
  */
@@ -139,7 +145,7 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
 		return 0;
 	}
 
-	if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)
+	if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS_MASK)
 		return -EINVAL;
 
 	cfg->rx_size = usr_cfg->rx_size;
@@ -267,44 +273,93 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk)
 	/* This is the C-program:
 	 * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
 	 * {
-	 *     int *qidconf, index = ctx->rx_queue_index;
+	 *     int *qidconf, qc;
+	 *     int index = ctx->rx_queue_index & ~(1 << 31);
+	 *     bool is_xskrq = ctx->rx_queue_index & (1 << 31);
 	 *
-	 *     // A set entry here means that the correspnding queue_id
-	 *     // has an active AF_XDP socket bound to it.
+	 *     // A set entry here means that the corresponding queue_id
+	 *     // has an active AF_XDP socket bound to it. Value 2 means
+	 *     // it's zero-copy multi-RQ mode.
 	 *     qidconf = bpf_map_lookup_elem(&qidconf_map, &index);
 	 *     if (!qidconf)
 	 *         return XDP_ABORTED;
 	 *
-	 *     if (*qidconf)
+	 *     qc = *qidconf;
+	 *
+	 *     if (qc == 2)
+	 *         qc = is_xskrq ? 1 : 0;
+	 *
+	 *     switch (qc) {
+	 *     case 0:
+	 *         return XDP_PASS;
+	 *     case 1:
 	 *         return bpf_redirect_map(&xsks_map, index, 0);
+	 *     }
 	 *
-	 *     return XDP_PASS;
+	 *     return XDP_ABORTED;
 	 * }
 	 */
 	struct bpf_insn prog[] = {
-		/* r1 = *(u32 *)(r1 + 16) */
-		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 16),
-		/* *(u32 *)(r10 - 4) = r1 */
-		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_1, -4),
-		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
-		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
-		BPF_LD_MAP_FD(BPF_REG_1, xsk->qidconf_map_fd),
+		/* Load index. */
+		/* r6 = *(u32 *)(r1 + 16) */
+		BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_ARG1, 16),
+		/* w7 = w6 */
+		BPF_MOV32_REG(BPF_REG_7, BPF_REG_6),
+		/* w7 &= 2147483647 */
+		BPF_ALU32_IMM(BPF_AND, BPF_REG_7, ~XDP_QID_FLAG_XSKRQ),
+		/* *(u32 *)(r10 - 4) = r7 */
+		BPF_STX_MEM(BPF_W, BPF_REG_FP, BPF_REG_7, -4),
+
+		/* Call bpf_map_lookup_elem. */
+		/* r2 = r10 */
+		BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
+		/* r2 += -4 */
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
+		/* r1 = qidconf_map ll */
+		BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->qidconf_map_fd),
+		/* call 1 */
 		BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
-		BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
-		BPF_MOV32_IMM(BPF_REG_0, 0),
-		/* if r1 == 0 goto +8 */
-		BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 8),
-		BPF_MOV32_IMM(BPF_REG_0, 2),
-		/* r1 = *(u32 *)(r1 + 0) */
-		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0),
-		/* if r1 == 0 goto +5 */
-		BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 5),
-		/* r2 = *(u32 *)(r10 - 4) */
-		BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd),
-		BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4),
+
+		/* Check the return value. */
+		/* if r0 == 0 goto +14 */
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 14),
+
+		/* Check qc == QIDCONF_XSK_COMBINED. */
+		/* r6 >>= 31 */
+		BPF_ALU64_IMM(BPF_RSH, BPF_REG_6, 31),
+		/* r1 = *(u32 *)(r0 + 0) */
+		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
+		/* if r1 == 2 goto +1 */
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, QIDCONF_XSK_COMBINED, 1),
+
+		/* qc != QIDCONF_XSK_COMBINED */
+		/* r6 = r1 */
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+
+		/* switch (qc) */
+		/* w0 = 2 */
+		BPF_MOV32_IMM(BPF_REG_0, XDP_PASS),
+		/* if w6 == 0 goto +8 */
+		BPF_JMP32_IMM(BPF_JEQ, BPF_REG_6, QIDCONF_REGULAR, 8),
+		/* if w6 != 1 goto +6 */
+		BPF_JMP32_IMM(BPF_JNE, BPF_REG_6, QIDCONF_XSK, 6),
+
+		/* Call bpf_redirect_map. */
+		/* r1 = xsks_map ll */
+		BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->xsks_map_fd),
+		/* w2 = w7 */
+		BPF_MOV32_REG(BPF_REG_ARG2, BPF_REG_7),
+		/* w3 = 0 */
 		BPF_MOV32_IMM(BPF_REG_3, 0),
+		/* call 51 */
 		BPF_EMIT_CALL(BPF_FUNC_redirect_map),
-		/* The jumps are to this instruction */
+		/* exit */
+		BPF_EXIT_INSN(),
+
+		/* XDP_ABORTED */
+		/* w0 = 0 */
+		BPF_MOV32_IMM(BPF_REG_0, XDP_ABORTED),
+		/* exit */
 		BPF_EXIT_INSN(),
 	};
 	size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
@@ -483,6 +538,7 @@ static int xsk_update_bpf_maps(struct xsk_socket *xsk, int qidconf_value,
 
 static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
 {
+	int qidconf_value = QIDCONF_XSK;
 	bool prog_attached = false;
 	__u32 prog_id = 0;
 	int err;
@@ -505,7 +561,11 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
 		xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id);
 	}
 
-	err = xsk_update_bpf_maps(xsk, true, xsk->fd);
+	if (xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
+		if (xsk->zc)
+			qidconf_value = QIDCONF_XSK_COMBINED;
+
+	err = xsk_update_bpf_maps(xsk, qidconf_value, xsk->fd);
 	if (err)
 		goto out_load;
 
@@ -717,7 +777,7 @@ void xsk_socket__delete(struct xsk_socket *xsk)
 	if (!xsk)
 		return;
 
-	(void)xsk_update_bpf_maps(xsk, 0, 0);
+	(void)xsk_update_bpf_maps(xsk, QIDCONF_REGULAR, 0);
 
 	optlen = sizeof(off);
 	err = getsockopt(xsk->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen);
diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
index 82ea71a0f3ec..be26a2423c04 100644
--- a/tools/lib/bpf/xsk.h
+++ b/tools/lib/bpf/xsk.h
@@ -180,6 +180,10 @@ struct xsk_umem_config {
 
 /* Flags for the libbpf_flags field. */
 #define XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD (1 << 0)
+#define XSK_LIBBPF_FLAGS__COMBINED_CHANNELS (1 << 1)
+#define XSK_LIBBPF_FLAGS_MASK ( \
+	XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD | \
+	XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
 
 struct xsk_socket_config {
 	__u32 rx_size;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 05/16] xsk: Change the default frame size to 4096 and allow controlling it
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (3 preceding siblings ...)
  2019-04-30 18:12 ` [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic Maxim Mikityanskiy
@ 2019-04-30 18:12 ` Maxim Mikityanskiy
  2019-04-30 18:12 ` [PATCH bpf-next v2 06/16] xsk: Return the whole xdp_desc from xsk_umem_consume_tx Maxim Mikityanskiy
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

The typical XDP memory scheme is one packet per page. Change the AF_XDP
frame size in libbpf to 4096, which is the page size on x86, to allow
libbpf to be used with the drivers with the packet-per-page scheme.

Add a command line option -f to xdpsock to allow to specify a custom
frame size.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 samples/bpf/xdpsock_user.c | 44 ++++++++++++++++++++++++--------------
 tools/lib/bpf/xsk.h        |  2 +-
 2 files changed, 29 insertions(+), 17 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index a6b13025ee79..9a2fd899c3e0 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -69,6 +69,7 @@ static int opt_queue;
 static int opt_poll;
 static int opt_interval = 1;
 static u32 opt_xdp_bind_flags;
+static int opt_xsk_frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
 static __u32 prog_id;
 
 struct xsk_umem_info {
@@ -277,6 +278,12 @@ static size_t gen_eth_frame(struct xsk_umem_info *umem, u64 addr)
 static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size)
 {
 	struct xsk_umem_info *umem;
+	struct xsk_umem_config cfg = {
+		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+		.frame_size = opt_xsk_frame_size,
+		.frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM,
+	};
 	int ret;
 
 	umem = calloc(1, sizeof(*umem));
@@ -284,7 +291,7 @@ static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size)
 		exit_with_error(errno);
 
 	ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
-			       NULL);
+			       &cfg);
 	if (ret)
 		exit_with_error(-ret);
 
@@ -324,11 +331,9 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
 				     &idx);
 	if (ret != XSK_RING_PROD__DEFAULT_NUM_DESCS)
 		exit_with_error(-ret);
-	for (i = 0;
-	     i < XSK_RING_PROD__DEFAULT_NUM_DESCS *
-		     XSK_UMEM__DEFAULT_FRAME_SIZE;
-	     i += XSK_UMEM__DEFAULT_FRAME_SIZE)
-		*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) = i;
+	for (i = 0; i < XSK_RING_PROD__DEFAULT_NUM_DESCS; i++)
+		*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx++) =
+			i * opt_xsk_frame_size;
 	xsk_ring_prod__submit(&xsk->umem->fq,
 			      XSK_RING_PROD__DEFAULT_NUM_DESCS);
 
@@ -348,6 +353,7 @@ static struct option long_options[] = {
 	{"zero-copy", no_argument, 0, 'z'},
 	{"copy", no_argument, 0, 'c'},
 	{"combined", no_argument, 0, 'C'},
+	{"frame-size", required_argument, 0, 'f'},
 	{0, 0, 0, 0}
 };
 
@@ -368,8 +374,9 @@ static void usage(const char *prog)
 		"  -z, --zero-copy      Force zero-copy mode.\n"
 		"  -c, --copy           Force copy mode.\n"
 		"  -C, --combined       Driver supports combined XSK and non-XSK traffic in a channel.\n"
+		"  -f, --frame-size=n   Set the frame size (must be a power of two, default is %d).\n"
 		"\n";
-	fprintf(stderr, str, prog);
+	fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE);
 	exit(EXIT_FAILURE);
 }
 
@@ -380,7 +387,7 @@ static void parse_command_line(int argc, char **argv)
 	opterr = 0;
 
 	for (;;) {
-		c = getopt_long(argc, argv, "Frtli:q:psSNn:czC", long_options,
+		c = getopt_long(argc, argv, "Frtli:q:psSNn:czCf:", long_options,
 				&option_index);
 		if (c == -1)
 			break;
@@ -426,6 +433,9 @@ static void parse_command_line(int argc, char **argv)
 		case 'C':
 			opt_libbpf_flags |= XSK_LIBBPF_FLAGS__COMBINED_CHANNELS;
 			break;
+		case 'f':
+			opt_xsk_frame_size = atoi(optarg);
+			break;
 		default:
 			usage(basename(argv[0]));
 		}
@@ -438,6 +448,11 @@ static void parse_command_line(int argc, char **argv)
 		usage(basename(argv[0]));
 	}
 
+	if (opt_xsk_frame_size & (opt_xsk_frame_size - 1)) {
+		fprintf(stderr, "--frame-size=%d is not a power of two\n",
+			opt_xsk_frame_size);
+		usage(basename(argv[0]));
+	}
 }
 
 static void kick_tx(struct xsk_socket_info *xsk)
@@ -589,8 +604,7 @@ static void tx_only(struct xsk_socket_info *xsk)
 
 			for (i = 0; i < BATCH_SIZE; i++) {
 				xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->addr
-					= (frame_nb + i) <<
-					XSK_UMEM__DEFAULT_FRAME_SHIFT;
+					= (frame_nb + i) * opt_xsk_frame_size;
 				xsk_ring_prod__tx_desc(&xsk->tx, idx + i)->len =
 					sizeof(pkt_data) - 1;
 			}
@@ -667,21 +681,19 @@ int main(int argc, char **argv)
 	}
 
 	ret = posix_memalign(&bufs, getpagesize(), /* PAGE_SIZE aligned */
-			     NUM_FRAMES * XSK_UMEM__DEFAULT_FRAME_SIZE);
+			     NUM_FRAMES * opt_xsk_frame_size);
 	if (ret)
 		exit_with_error(ret);
 
        /* Create sockets... */
-	umem = xsk_configure_umem(bufs,
-				  NUM_FRAMES * XSK_UMEM__DEFAULT_FRAME_SIZE);
+	umem = xsk_configure_umem(bufs, NUM_FRAMES * opt_xsk_frame_size);
 	xsks[num_socks++] = xsk_configure_socket(umem);
 
 	if (opt_bench == BENCH_TXONLY) {
 		int i;
 
-		for (i = 0; i < NUM_FRAMES * XSK_UMEM__DEFAULT_FRAME_SIZE;
-		     i += XSK_UMEM__DEFAULT_FRAME_SIZE)
-			(void)gen_eth_frame(umem, i);
+		for (i = 0; i < NUM_FRAMES; i++)
+			(void)gen_eth_frame(umem, i * opt_xsk_frame_size);
 	}
 
 	signal(SIGINT, int_exit);
diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
index be26a2423c04..31d3db1bd1d5 100644
--- a/tools/lib/bpf/xsk.h
+++ b/tools/lib/bpf/xsk.h
@@ -167,7 +167,7 @@ LIBBPF_API int xsk_socket__fd(const struct xsk_socket *xsk);
 
 #define XSK_RING_CONS__DEFAULT_NUM_DESCS      2048
 #define XSK_RING_PROD__DEFAULT_NUM_DESCS      2048
-#define XSK_UMEM__DEFAULT_FRAME_SHIFT    11 /* 2048 bytes */
+#define XSK_UMEM__DEFAULT_FRAME_SHIFT    12 /* 4096 bytes */
 #define XSK_UMEM__DEFAULT_FRAME_SIZE     (1 << XSK_UMEM__DEFAULT_FRAME_SHIFT)
 #define XSK_UMEM__DEFAULT_FRAME_HEADROOM 0
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 06/16] xsk: Return the whole xdp_desc from xsk_umem_consume_tx
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (4 preceding siblings ...)
  2019-04-30 18:12 ` [PATCH bpf-next v2 05/16] xsk: Change the default frame size to 4096 and allow controlling it Maxim Mikityanskiy
@ 2019-04-30 18:12 ` Maxim Mikityanskiy
  2019-04-30 18:12 ` [PATCH bpf-next v2 07/16] net/mlx5e: Replace deprecated PCI_DMA_TODEVICE Maxim Mikityanskiy
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

Some drivers want to access the data transmitted in order to implement
acceleration features of the NICs. It is also useful in AF_XDP TX flow.

Change the xsk_umem_consume_tx API to return the whole xdp_desc, that
contains the data pointer, length and DMA address, instead of only the
latter two. Adapt the implementation of i40e and ixgbe to this change.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c   | 12 +++++++-----
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 15 +++++++++------
 include/net/xdp_sock.h                       |  6 +++---
 net/xdp/xsk.c                                | 10 +++-------
 4 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 1b17486543ac..eae6fafad1b8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -640,8 +640,8 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 	struct i40e_tx_desc *tx_desc = NULL;
 	struct i40e_tx_buffer *tx_bi;
 	bool work_done = true;
+	struct xdp_desc desc;
 	dma_addr_t dma;
-	u32 len;
 
 	while (budget-- > 0) {
 		if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
@@ -650,21 +650,23 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
 			break;
 		}
 
-		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len))
+		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc))
 			break;
 
-		dma_sync_single_for_device(xdp_ring->dev, dma, len,
+		dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr);
+
+		dma_sync_single_for_device(xdp_ring->dev, dma, desc.len,
 					   DMA_BIDIRECTIONAL);
 
 		tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use];
-		tx_bi->bytecount = len;
+		tx_bi->bytecount = desc.len;
 
 		tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
 		tx_desc->buffer_addr = cpu_to_le64(dma);
 		tx_desc->cmd_type_offset_bsz =
 			build_ctob(I40E_TX_DESC_CMD_ICRC
 				   | I40E_TX_DESC_CMD_EOP,
-				   0, len, 0);
+				   0, desc.len, 0);
 
 		xdp_ring->next_to_use++;
 		if (xdp_ring->next_to_use == xdp_ring->count)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index bfe95ce0bd7f..0297a70a4e2d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -621,8 +621,9 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
 	union ixgbe_adv_tx_desc *tx_desc = NULL;
 	struct ixgbe_tx_buffer *tx_bi;
 	bool work_done = true;
-	u32 len, cmd_type;
+	struct xdp_desc desc;
 	dma_addr_t dma;
+	u32 cmd_type;
 
 	while (budget-- > 0) {
 		if (unlikely(!ixgbe_desc_unused(xdp_ring)) ||
@@ -631,14 +632,16 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
 			break;
 		}
 
-		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &dma, &len))
+		if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc))
 			break;
 
-		dma_sync_single_for_device(xdp_ring->dev, dma, len,
+		dma = xdp_umem_get_dma(xdp_ring->xsk_umem, desc.addr);
+
+		dma_sync_single_for_device(xdp_ring->dev, dma, desc.len,
 					   DMA_BIDIRECTIONAL);
 
 		tx_bi = &xdp_ring->tx_buffer_info[xdp_ring->next_to_use];
-		tx_bi->bytecount = len;
+		tx_bi->bytecount = desc.len;
 		tx_bi->xdpf = NULL;
 
 		tx_desc = IXGBE_TX_DESC(xdp_ring, xdp_ring->next_to_use);
@@ -648,10 +651,10 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
 		cmd_type = IXGBE_ADVTXD_DTYP_DATA |
 			   IXGBE_ADVTXD_DCMD_DEXT |
 			   IXGBE_ADVTXD_DCMD_IFCS;
-		cmd_type |= len | IXGBE_TXD_CMD;
+		cmd_type |= desc.len | IXGBE_TXD_CMD;
 		tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type);
 		tx_desc->read.olinfo_status =
-			cpu_to_le32(len << IXGBE_ADVTXD_PAYLEN_SHIFT);
+			cpu_to_le32(desc.len << IXGBE_ADVTXD_PAYLEN_SHIFT);
 
 		xdp_ring->next_to_use++;
 		if (xdp_ring->next_to_use == xdp_ring->count)
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 1acddcbda236..eec5b66aee93 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -81,7 +81,7 @@ bool xsk_umem_has_addrs(struct xdp_umem *umem, u32 cnt);
 u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr);
 void xsk_umem_discard_addr(struct xdp_umem *umem);
 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
-bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len);
+bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc);
 void xsk_umem_consume_tx_done(struct xdp_umem *umem);
 struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries);
 struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
@@ -175,8 +175,8 @@ static inline void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
 {
 }
 
-static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
-				       u32 *len)
+static inline bool xsk_umem_consume_tx(struct xdp_umem *umem,
+				       struct xdp_desc *desc)
 {
 	return false;
 }
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 114ba17acb09..99e987e25a4d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -175,22 +175,18 @@ void xsk_umem_consume_tx_done(struct xdp_umem *umem)
 }
 EXPORT_SYMBOL(xsk_umem_consume_tx_done);
 
-bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len)
+bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc)
 {
-	struct xdp_desc desc;
 	struct xdp_sock *xs;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
-		if (!xskq_peek_desc(xs->tx, &desc))
+		if (!xskq_peek_desc(xs->tx, desc))
 			continue;
 
-		if (xskq_produce_addr_lazy(umem->cq, desc.addr))
+		if (xskq_produce_addr_lazy(umem->cq, desc->addr))
 			goto out;
 
-		*dma = xdp_umem_get_dma(umem, desc.addr);
-		*len = desc.len;
-
 		xskq_discard_desc(xs->tx);
 		rcu_read_unlock();
 		return true;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 07/16] net/mlx5e: Replace deprecated PCI_DMA_TODEVICE
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (5 preceding siblings ...)
  2019-04-30 18:12 ` [PATCH bpf-next v2 06/16] xsk: Return the whole xdp_desc from xsk_umem_consume_tx Maxim Mikityanskiy
@ 2019-04-30 18:12 ` Maxim Mikityanskiy
  2019-04-30 18:12 ` [PATCH bpf-next v2 08/16] net/mlx5e: Calculate linear RX frag size considering XSK Maxim Mikityanskiy
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

The PCI API for DMA is deprecated, and PCI_DMA_TODEVICE is just defined
to DMA_TO_DEVICE for backward compatibility. Just use DMA_TO_DEVICE.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index eb8ef78e5626..5a900b70b203 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -64,7 +64,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_dma_info *di,
 		return false;
 	xdpi.dma_addr = di->addr + (xdpi.xdpf->data - (void *)xdpi.xdpf);
 	dma_sync_single_for_device(sq->pdev, xdpi.dma_addr,
-				   xdpi.xdpf->len, PCI_DMA_TODEVICE);
+				   xdpi.xdpf->len, DMA_TO_DEVICE);
 	xdpi.di = *di;
 
 	return sq->xmit_xdp_frame(sq, &xdpi);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 08/16] net/mlx5e: Calculate linear RX frag size considering XSK
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (6 preceding siblings ...)
  2019-04-30 18:12 ` [PATCH bpf-next v2 07/16] net/mlx5e: Replace deprecated PCI_DMA_TODEVICE Maxim Mikityanskiy
@ 2019-04-30 18:12 ` Maxim Mikityanskiy
  2019-04-30 18:12 ` [PATCH bpf-next v2 09/16] net/mlx5e: Allow ICO SQ to be used by multiple RQs Maxim Mikityanskiy
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

Additional conditions introduced:

- XSK implies XDP.
- Headroom includes the XSK headroom if it exists.
- No space is reserved for struct shared_skb_info in XSK mode.
- Fragment size smaller than the XSK chunk size is not allowed.

A new auxiliary function mlx5e_get_linear_rq_headroom with the support
for XSK is introduced. Use this function in the implementation of
mlx5e_get_rq_headroom. Change headroom to u32 to match the headroom
field in struct xdp_umem.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../ethernet/mellanox/mlx5/core/en/params.c   | 65 +++++++++++++------
 .../ethernet/mellanox/mlx5/core/en/params.h   |  8 ++-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 3 files changed, 52 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index d3744bffbae3..50a458dc3836 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -3,33 +3,62 @@
 
 #include "en/params.h"
 
-u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params)
+static inline bool mlx5e_rx_is_xdp(struct mlx5e_params *params,
+				   struct mlx5e_xsk_param *xsk)
 {
-	u16 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu);
-	u16 linear_rq_headroom = params->xdp_prog ?
-		XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM;
-	u32 frag_sz;
+	return params->xdp_prog || xsk;
+}
+
+static inline u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params,
+					       struct mlx5e_xsk_param *xsk)
+{
+	u16 headroom = NET_IP_ALIGN;
+
+	if (mlx5e_rx_is_xdp(params, xsk)) {
+		headroom += XDP_PACKET_HEADROOM;
+		if (xsk)
+			headroom += xsk->headroom;
+	} else {
+		headroom += MLX5_RX_HEADROOM;
+	}
+
+	return headroom;
+}
+
+u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params,
+				struct mlx5e_xsk_param *xsk)
+{
+	u32 hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu);
+	u16 linear_rq_headroom = mlx5e_get_linear_rq_headroom(params, xsk);
+	u32 frag_sz = linear_rq_headroom + hw_mtu;
 
-	linear_rq_headroom += NET_IP_ALIGN;
+	/* AF_XDP doesn't build SKBs in place. */
+	if (!xsk)
+		frag_sz = MLX5_SKB_FRAG_SZ(frag_sz);
 
-	frag_sz = MLX5_SKB_FRAG_SZ(linear_rq_headroom + hw_mtu);
+	/* XDP in mlx5e doesn't support multiple packets per page. */
+	if (mlx5e_rx_is_xdp(params, xsk))
+		frag_sz = max_t(u32, frag_sz, PAGE_SIZE);
 
-	if (params->xdp_prog && frag_sz < PAGE_SIZE)
-		frag_sz = PAGE_SIZE;
+	/* Even if we can go with a smaller fragment size, we must not put
+	 * multiple packets into a single frame.
+	 */
+	if (xsk)
+		frag_sz = max_t(u32, frag_sz, xsk->chunk_size);
 
 	return frag_sz;
 }
 
 u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params)
 {
-	u32 linear_frag_sz = mlx5e_rx_get_linear_frag_sz(params);
+	u32 linear_frag_sz = mlx5e_rx_get_linear_frag_sz(params, NULL);
 
 	return MLX5_MPWRQ_LOG_WQE_SZ - order_base_2(linear_frag_sz);
 }
 
 bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params)
 {
-	u32 frag_sz = mlx5e_rx_get_linear_frag_sz(params);
+	u32 frag_sz = mlx5e_rx_get_linear_frag_sz(params, NULL);
 
 	return !params->lro_en && frag_sz <= PAGE_SIZE;
 }
@@ -39,7 +68,7 @@ bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params)
 bool mlx5e_rx_mpwqe_is_linear_skb(struct mlx5_core_dev *mdev,
 				  struct mlx5e_params *params)
 {
-	u32 frag_sz = mlx5e_rx_get_linear_frag_sz(params);
+	u32 frag_sz = mlx5e_rx_get_linear_frag_sz(params, NULL);
 	s8 signed_log_num_strides_param;
 	u8 log_num_strides;
 
@@ -75,7 +104,7 @@ u8 mlx5e_mpwqe_get_log_stride_size(struct mlx5_core_dev *mdev,
 				   struct mlx5e_params *params)
 {
 	if (mlx5e_rx_mpwqe_is_linear_skb(mdev, params))
-		return order_base_2(mlx5e_rx_get_linear_frag_sz(params));
+		return order_base_2(mlx5e_rx_get_linear_frag_sz(params, NULL));
 
 	return MLX5_MPWRQ_DEF_LOG_STRIDE_SZ(mdev);
 }
@@ -90,15 +119,9 @@ u8 mlx5e_mpwqe_get_log_num_strides(struct mlx5_core_dev *mdev,
 u16 mlx5e_get_rq_headroom(struct mlx5_core_dev *mdev,
 			  struct mlx5e_params *params)
 {
-	u16 linear_rq_headroom = params->xdp_prog ?
-		XDP_PACKET_HEADROOM : MLX5_RX_HEADROOM;
-	bool is_linear_skb;
-
-	linear_rq_headroom += NET_IP_ALIGN;
-
-	is_linear_skb = (params->rq_wq_type == MLX5_WQ_TYPE_CYCLIC) ?
+	bool is_linear_skb = (params->rq_wq_type == MLX5_WQ_TYPE_CYCLIC) ?
 		mlx5e_rx_is_linear_skb(params) :
 		mlx5e_rx_mpwqe_is_linear_skb(mdev, params);
 
-	return is_linear_skb ? linear_rq_headroom : 0;
+	return is_linear_skb ? mlx5e_get_linear_rq_headroom(params, NULL) : 0;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
index b106a0236f36..ed420f3efe52 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
@@ -6,7 +6,13 @@
 
 #include "en.h"
 
-u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params);
+struct mlx5e_xsk_param {
+	u16 headroom;
+	u16 chunk_size;
+};
+
+u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params,
+				struct mlx5e_xsk_param *xsk);
 u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params);
 bool mlx5e_rx_is_linear_skb(struct mlx5e_params *params);
 bool mlx5e_rx_mpwqe_is_linear_skb(struct mlx5_core_dev *mdev,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 457cc39423f2..204e199b141e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1955,7 +1955,7 @@ static void mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
 	if (mlx5e_rx_is_linear_skb(params)) {
 		int frag_stride;
 
-		frag_stride = mlx5e_rx_get_linear_frag_sz(params);
+		frag_stride = mlx5e_rx_get_linear_frag_sz(params, NULL);
 		frag_stride = roundup_pow_of_two(frag_stride);
 
 		info->arr[0].frag_size = byte_count;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 09/16] net/mlx5e: Allow ICO SQ to be used by multiple RQs
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (7 preceding siblings ...)
  2019-04-30 18:12 ` [PATCH bpf-next v2 08/16] net/mlx5e: Calculate linear RX frag size considering XSK Maxim Mikityanskiy
@ 2019-04-30 18:12 ` Maxim Mikityanskiy
  2019-04-30 18:12 ` [PATCH bpf-next v2 10/16] net/mlx5e: Refactor struct mlx5e_xdp_info Maxim Mikityanskiy
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

Prepare to creation of the XSK RQ, which will require posting UMRs, too.
The same ICO SQ will be used for both RQs and also to trigger interrupts
by posting NOPs. UMR WQEs can't be reused any more. Optimization
introduced in commit ab966d7e4ff98 ("net/mlx5e: RX, Recycle buffer of
UMR WQEs") is reverted.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  9 +++++++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 27 +++++++------------
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |  4 ++-
 drivers/net/ethernet/mellanox/mlx5/core/wq.h  |  5 ----
 4 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 3a183d690e23..41e22763007c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -348,6 +348,13 @@ enum {
 
 struct mlx5e_sq_wqe_info {
 	u8  opcode;
+
+	/* Auxiliary data for different opcodes. */
+	union {
+		struct {
+			struct mlx5e_rq *rq;
+		} umr;
+	};
 };
 
 struct mlx5e_txqsq {
@@ -570,6 +577,7 @@ struct mlx5e_rq {
 			u8                     log_stride_sz;
 			u8                     umr_in_progress;
 			u8                     umr_last_bulk;
+			u8                     umr_completed;
 		} mpwqe;
 	};
 	struct {
@@ -797,6 +805,7 @@ void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
+void mlx5e_poll_ico_cq(struct mlx5e_cq *cq);
 bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq);
 void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix);
 void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 13133e7f088e..5d762da6bf9b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -425,11 +425,6 @@ static void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq, u8 n)
 	mlx5_wq_ll_update_db_record(wq);
 }
 
-static inline u16 mlx5e_icosq_wrap_cnt(struct mlx5e_icosq *sq)
-{
-	return mlx5_wq_cyc_get_ctr_wrap_cnt(&sq->wq, sq->pc);
-}
-
 static inline void mlx5e_fill_icosq_frag_edge(struct mlx5e_icosq *sq,
 					      struct mlx5_wq_cyc *wq,
 					      u16 pi, u16 nnops)
@@ -465,9 +460,7 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 	}
 
 	umr_wqe = mlx5_wq_cyc_get_wqe(wq, pi);
-	if (unlikely(mlx5e_icosq_wrap_cnt(sq) < 2))
-		memcpy(umr_wqe, &rq->mpwqe.umr_wqe,
-		       offsetof(struct mlx5e_umr_wqe, inline_mtts));
+	memcpy(umr_wqe, &rq->mpwqe.umr_wqe, offsetof(struct mlx5e_umr_wqe, inline_mtts));
 
 	for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++, dma_info++) {
 		err = mlx5e_page_alloc_mapped(rq, dma_info);
@@ -485,6 +478,7 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 	umr_wqe->uctrl.xlt_offset = cpu_to_be16(xlt_offset);
 
 	sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_UMR;
+	sq->db.ico_wqe[pi].umr.rq = rq;
 	sq->pc += MLX5E_UMR_WQEBBS;
 
 	sq->doorbell_cseg = &umr_wqe->ctrl;
@@ -542,11 +536,10 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
 	return !!err;
 }
 
-static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq)
+void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
 {
 	struct mlx5e_icosq *sq = container_of(cq, struct mlx5e_icosq, cq);
 	struct mlx5_cqe64 *cqe;
-	u8  completed_umr = 0;
 	u16 sqcc;
 	int i;
 
@@ -587,7 +580,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq)
 
 			if (likely(wi->opcode == MLX5_OPCODE_UMR)) {
 				sqcc += MLX5E_UMR_WQEBBS;
-				completed_umr++;
+				wi->umr.rq->mpwqe.umr_completed++;
 			} else if (likely(wi->opcode == MLX5_OPCODE_NOP)) {
 				sqcc++;
 			} else {
@@ -603,24 +596,24 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq)
 	sq->cc = sqcc;
 
 	mlx5_cqwq_update_db_record(&cq->wq);
-
-	if (likely(completed_umr)) {
-		mlx5e_post_rx_mpwqe(rq, completed_umr);
-		rq->mpwqe.umr_in_progress -= completed_umr;
-	}
 }
 
 bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq)
 {
 	struct mlx5e_icosq *sq = &rq->channel->icosq;
 	struct mlx5_wq_ll *wq = &rq->mpwqe.wq;
+	u8  umr_completed = rq->mpwqe.umr_completed;
 	u8  missing, i;
 	u16 head;
 
 	if (unlikely(!test_bit(MLX5E_RQ_STATE_ENABLED, &rq->state)))
 		return false;
 
-	mlx5e_poll_ico_cq(&sq->cq, rq);
+	if (umr_completed) {
+		mlx5e_post_rx_mpwqe(rq, umr_completed);
+		rq->mpwqe.umr_in_progress -= umr_completed;
+		rq->mpwqe.umr_completed = 0;
+	}
 
 	missing = mlx5_wq_ll_missing(wq) - rq->mpwqe.umr_in_progress;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index f9862bf75491..de4d5ae431af 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -107,7 +107,9 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 		busy |= work_done == budget;
 	}
 
-	busy |= c->rq.post_wqes(rq);
+	mlx5e_poll_ico_cq(&c->icosq.cq);
+
+	busy |= rq->post_wqes(rq);
 
 	if (busy) {
 		if (likely(mlx5e_channel_no_affinity_change(c)))
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/wq.h b/drivers/net/ethernet/mellanox/mlx5/core/wq.h
index 1f87cce421e0..f1ec58c9e9e3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/wq.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/wq.h
@@ -134,11 +134,6 @@ static inline void mlx5_wq_cyc_update_db_record(struct mlx5_wq_cyc *wq)
 	*wq->db = cpu_to_be32(wq->wqe_ctr);
 }
 
-static inline u16 mlx5_wq_cyc_get_ctr_wrap_cnt(struct mlx5_wq_cyc *wq, u16 ctr)
-{
-	return ctr >> wq->fbc.log_sz;
-}
-
 static inline u16 mlx5_wq_cyc_ctr2ix(struct mlx5_wq_cyc *wq, u16 ctr)
 {
 	return ctr & wq->fbc.sz_m1;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 10/16] net/mlx5e: Refactor struct mlx5e_xdp_info
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (8 preceding siblings ...)
  2019-04-30 18:12 ` [PATCH bpf-next v2 09/16] net/mlx5e: Allow ICO SQ to be used by multiple RQs Maxim Mikityanskiy
@ 2019-04-30 18:12 ` Maxim Mikityanskiy
  2019-04-30 18:12 ` [PATCH bpf-next v2 11/16] net/mlx5e: Share the XDP SQ for XDP_TX between RQs Maxim Mikityanskiy
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

Currently, struct mlx5e_xdp_info has some issues that have to be cleaned
up before the upcoming AF_XDP support makes things too complicated and
messy. This structure is used both when sending the packet and on
completion. Moreover, the cleanup procedure on completion depends on the
origin of the packet (XDP_REDIRECT, XDP_TX). Adding AF_XDP support will
add new flows that use this structure even differently. To avoid
overcomplicating the code, this commit refactors the usage of this
structure in the following ways:

1. struct mlx5e_xdp_info is split into two different structures. One is
struct mlx5e_xdp_xmit_data, a transient structure that doesn't need to
be stored and is only used while sending the packet. The other is still
struct mlx5e_xdp_info that is stored in a FIFO and contains the fields
needed on completion.

2. The fields of struct mlx5e_xdp_info that are used in different flows
are put into a union. A special enum indicates the cleanup mode and
helps choose the right union member. This approach is clear and
explicit. Although it could be possible to "guess" the mode by looking
at the values of the fields and at the XDP SQ type, it wouldn't be that
clear and extendable and would require looking through the whole chain
to understand what's going on.

For the reference, there are the fields of struct mlx5e_xdp_info that
are used in different flows (including AF_XDP ones):

Packet origin          | Fields used on completion | Cleanup steps
-----------------------+---------------------------+------------------
XDP_REDIRECT,          | xdpf, dma_addr            | DMA unmap and
XDP_TX from XSK RQ     |                           | xdp_return_frame.
-----------------------+---------------------------+------------------
XDP_TX from regular RQ | di                        | Recycle page.
-----------------------+---------------------------+------------------
AF_XDP TX              | (none)                    | Increment the
                       |                           | producer index in
                       |                           | Completion Ring.

On send, the same set of mlx5e_xdp_xmit_data fields is used in all
flows: DMA and virtual addresses and length.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  | 46 +++++++++--
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 81 ++++++++++++-------
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  | 11 ++-
 3 files changed, 97 insertions(+), 41 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 41e22763007c..cdb73568a344 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -402,10 +402,44 @@ struct mlx5e_dma_info {
 	dma_addr_t      addr;
 };
 
+/* XDP packets can be transmitted in different ways. On completion, we need to
+ * distinguish between them to clean up things in a proper way.
+ */
+enum mlx5e_xdp_xmit_mode {
+	/* An xdp_frame was transmitted due to either XDP_REDIRECT from another
+	 * device or XDP_TX from an XSK RQ. The frame has to be unmapped and
+	 * returned.
+	 */
+	MLX5E_XDP_XMIT_MODE_FRAME,
+
+	/* The xdp_frame was created in place as a result of XDP_TX from a
+	 * regular RQ. No DMA remapping happened, and the page belongs to us.
+	 */
+	MLX5E_XDP_XMIT_MODE_PAGE,
+
+	/* No xdp_frame was created at all, the transmit happened from a UMEM
+	 * page. The UMEM Completion Ring producer pointer has to be increased.
+	 */
+	MLX5E_XDP_XMIT_MODE_XSK,
+};
+
 struct mlx5e_xdp_info {
-	struct xdp_frame      *xdpf;
-	dma_addr_t            dma_addr;
-	struct mlx5e_dma_info di;
+	enum mlx5e_xdp_xmit_mode mode;
+	union {
+		struct {
+			struct xdp_frame *xdpf;
+			dma_addr_t dma_addr;
+		} frame;
+		struct {
+			struct mlx5e_dma_info di;
+		} page;
+	};
+};
+
+struct mlx5e_xdp_xmit_data {
+	dma_addr_t  dma_addr;
+	void       *data;
+	u32         len;
 };
 
 struct mlx5e_xdp_info_fifo {
@@ -431,8 +465,10 @@ struct mlx5e_xdp_mpwqe {
 };
 
 struct mlx5e_xdpsq;
-typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq*,
-					struct mlx5e_xdp_info*);
+typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq *,
+					struct mlx5e_xdp_xmit_data *,
+					struct mlx5e_xdp_info *);
+
 struct mlx5e_xdpsq {
 	/* data path */
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 5a900b70b203..89f6eb1109cf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -57,17 +57,27 @@ static inline bool
 mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_dma_info *di,
 		    struct xdp_buff *xdp)
 {
+	struct mlx5e_xdp_xmit_data xdptxd;
 	struct mlx5e_xdp_info xdpi;
+	struct xdp_frame *xdpf;
+	dma_addr_t dma_addr;
 
-	xdpi.xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpi.xdpf))
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
 		return false;
-	xdpi.dma_addr = di->addr + (xdpi.xdpf->data - (void *)xdpi.xdpf);
-	dma_sync_single_for_device(sq->pdev, xdpi.dma_addr,
-				   xdpi.xdpf->len, DMA_TO_DEVICE);
-	xdpi.di = *di;
 
-	return sq->xmit_xdp_frame(sq, &xdpi);
+	xdptxd.data = xdpf->data;
+	xdptxd.len  = xdpf->len;
+
+	xdpi.mode = MLX5E_XDP_XMIT_MODE_PAGE;
+
+	dma_addr = di->addr + (xdpf->data - (void *)xdpf);
+	dma_sync_single_for_device(sq->pdev, dma_addr, xdptxd.len, DMA_TO_DEVICE);
+
+	xdptxd.dma_addr = dma_addr;
+	xdpi.page.di = *di;
+
+	return sq->xmit_xdp_frame(sq, &xdptxd, &xdpi);
 }
 
 /* returns true if packet was consumed by xdp */
@@ -184,14 +194,13 @@ static void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq)
 }
 
 static bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
+				       struct mlx5e_xdp_xmit_data *xdptxd,
 				       struct mlx5e_xdp_info *xdpi)
 {
 	struct mlx5e_xdp_mpwqe *session = &sq->mpwqe;
 	struct mlx5e_xdpsq_stats *stats = sq->stats;
 
-	struct xdp_frame *xdpf = xdpi->xdpf;
-
-	if (unlikely(sq->hw_mtu < xdpf->len)) {
+	if (unlikely(xdptxd->len > sq->hw_mtu)) {
 		stats->err++;
 		return false;
 	}
@@ -208,7 +217,7 @@ static bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
 		mlx5e_xdp_mpwqe_session_start(sq);
 	}
 
-	mlx5e_xdp_mpwqe_add_dseg(sq, xdpi, stats);
+	mlx5e_xdp_mpwqe_add_dseg(sq, xdptxd, stats);
 
 	if (unlikely(session->complete ||
 		     session->ds_count == session->max_ds_count))
@@ -219,7 +228,9 @@ static bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
 	return true;
 }
 
-static bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi)
+static bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq,
+				 struct mlx5e_xdp_xmit_data *xdptxd,
+				 struct mlx5e_xdp_info *xdpi)
 {
 	struct mlx5_wq_cyc       *wq   = &sq->wq;
 	u16                       pi   = mlx5_wq_cyc_ctr2ix(wq, sq->pc);
@@ -229,9 +240,8 @@ static bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *
 	struct mlx5_wqe_eth_seg  *eseg = &wqe->eth;
 	struct mlx5_wqe_data_seg *dseg = wqe->data;
 
-	struct xdp_frame *xdpf = xdpi->xdpf;
-	dma_addr_t dma_addr  = xdpi->dma_addr;
-	unsigned int dma_len = xdpf->len;
+	dma_addr_t dma_addr = xdptxd->dma_addr;
+	u32 dma_len = xdptxd->len;
 
 	struct mlx5e_xdpsq_stats *stats = sq->stats;
 
@@ -253,7 +263,7 @@ static bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *
 
 	/* copy the inline part if required */
 	if (sq->min_inline_mode != MLX5_INLINE_MODE_NONE) {
-		memcpy(eseg->inline_hdr.start, xdpf->data, MLX5E_XDP_MIN_INLINE);
+		memcpy(eseg->inline_hdr.start, xdptxd->data, MLX5E_XDP_MIN_INLINE);
 		eseg->inline_hdr.sz = cpu_to_be16(MLX5E_XDP_MIN_INLINE);
 		dma_len  -= MLX5E_XDP_MIN_INLINE;
 		dma_addr += MLX5E_XDP_MIN_INLINE;
@@ -286,14 +296,19 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 	for (i = 0; i < wi->num_pkts; i++) {
 		struct mlx5e_xdp_info xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
 
-		if (rq) {
-			/* XDP_TX */
-			mlx5e_page_release(rq, &xdpi.di, recycle);
-		} else {
+		switch (xdpi.mode) {
+		case MLX5E_XDP_XMIT_MODE_FRAME:
 			/* XDP_REDIRECT */
-			dma_unmap_single(sq->pdev, xdpi.dma_addr,
-					 xdpi.xdpf->len, DMA_TO_DEVICE);
-			xdp_return_frame(xdpi.xdpf);
+			dma_unmap_single(sq->pdev, xdpi.frame.dma_addr,
+					 xdpi.frame.xdpf->len, DMA_TO_DEVICE);
+			xdp_return_frame(xdpi.frame.xdpf);
+			break;
+		case MLX5E_XDP_XMIT_MODE_PAGE:
+			/* XDP_TX */
+			mlx5e_page_release(rq, &xdpi.page.di, recycle);
+			break;
+		default:
+			WARN_ON_ONCE(true);
 		}
 	}
 }
@@ -398,21 +413,27 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 
 	for (i = 0; i < n; i++) {
 		struct xdp_frame *xdpf = frames[i];
+		struct mlx5e_xdp_xmit_data xdptxd;
 		struct mlx5e_xdp_info xdpi;
 
-		xdpi.dma_addr = dma_map_single(sq->pdev, xdpf->data, xdpf->len,
-					       DMA_TO_DEVICE);
-		if (unlikely(dma_mapping_error(sq->pdev, xdpi.dma_addr))) {
+		xdptxd.data = xdpf->data;
+		xdptxd.len = xdpf->len;
+		xdptxd.dma_addr = dma_map_single(sq->pdev, xdptxd.data,
+						 xdptxd.len, DMA_TO_DEVICE);
+
+		if (unlikely(dma_mapping_error(sq->pdev, xdptxd.dma_addr))) {
 			xdp_return_frame_rx_napi(xdpf);
 			drops++;
 			continue;
 		}
 
-		xdpi.xdpf = xdpf;
+		xdpi.mode           = MLX5E_XDP_XMIT_MODE_FRAME;
+		xdpi.frame.xdpf     = xdpf;
+		xdpi.frame.dma_addr = xdptxd.dma_addr;
 
-		if (unlikely(!sq->xmit_xdp_frame(sq, &xdpi))) {
-			dma_unmap_single(sq->pdev, xdpi.dma_addr,
-					 xdpf->len, DMA_TO_DEVICE);
+		if (unlikely(!sq->xmit_xdp_frame(sq, &xdptxd, &xdpi))) {
+			dma_unmap_single(sq->pdev, xdptxd.dma_addr,
+					 xdptxd.len, DMA_TO_DEVICE);
 			xdp_return_frame_rx_napi(xdpf);
 			drops++;
 		}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 8b537a4b0840..2a5158993349 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -97,15 +97,14 @@ static inline void mlx5e_xdp_update_inline_state(struct mlx5e_xdpsq *sq)
 }
 
 static inline void
-mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi,
+mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq,
+			 struct mlx5e_xdp_xmit_data *xdptxd,
 			 struct mlx5e_xdpsq_stats *stats)
 {
 	struct mlx5e_xdp_mpwqe *session = &sq->mpwqe;
-	dma_addr_t dma_addr    = xdpi->dma_addr;
-	struct xdp_frame *xdpf = xdpi->xdpf;
 	struct mlx5_wqe_data_seg *dseg =
 		(struct mlx5_wqe_data_seg *)session->wqe + session->ds_count;
-	u16 dma_len = xdpf->len;
+	u32 dma_len = xdptxd->len;
 
 	session->pkt_count++;
 
@@ -124,7 +123,7 @@ mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi,
 		}
 
 		inline_dseg->byte_count = cpu_to_be32(dma_len | MLX5_INLINE_SEG);
-		memcpy(inline_dseg->data, xdpf->data, dma_len);
+		memcpy(inline_dseg->data, xdptxd->data, dma_len);
 
 		session->ds_count += ds_cnt;
 		stats->inlnw++;
@@ -132,7 +131,7 @@ mlx5e_xdp_mpwqe_add_dseg(struct mlx5e_xdpsq *sq, struct mlx5e_xdp_info *xdpi,
 	}
 
 no_inline:
-	dseg->addr       = cpu_to_be64(dma_addr);
+	dseg->addr       = cpu_to_be64(xdptxd->dma_addr);
 	dseg->byte_count = cpu_to_be32(dma_len);
 	dseg->lkey       = sq->mkey_be;
 	session->ds_count++;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 11/16] net/mlx5e: Share the XDP SQ for XDP_TX between RQs
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (9 preceding siblings ...)
  2019-04-30 18:12 ` [PATCH bpf-next v2 10/16] net/mlx5e: Refactor struct mlx5e_xdp_info Maxim Mikityanskiy
@ 2019-04-30 18:12 ` Maxim Mikityanskiy
  2019-04-30 18:13 ` [PATCH bpf-next v2 12/16] net/mlx5e: XDP_TX from UMEM support Maxim Mikityanskiy
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:12 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

Put the XDP SQ that is used for XDP_TX into the channel. It used to be a
part of the RQ, but with introduction of AF_XDP there will be one more
RQ that could share the same XDP SQ. This patch is a preparation for
that change.

Separate XDP_TX statistics per RQ were implemented in one of the previous
patches.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  4 ++-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 20 +++++++-------
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  4 +--
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 26 +++++++++++--------
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |  4 +--
 5 files changed, 32 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index cdb73568a344..8cb28e5604f0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -431,6 +431,7 @@ struct mlx5e_xdp_info {
 			dma_addr_t dma_addr;
 		} frame;
 		struct {
+			struct mlx5e_rq *rq;
 			struct mlx5e_dma_info di;
 		} page;
 	};
@@ -643,7 +644,7 @@ struct mlx5e_rq {
 
 	/* XDP */
 	struct bpf_prog       *xdp_prog;
-	struct mlx5e_xdpsq     xdpsq;
+	struct mlx5e_xdpsq    *xdpsq;
 	DECLARE_BITMAP(flags, 8);
 	struct page_pool      *page_pool;
 
@@ -662,6 +663,7 @@ struct mlx5e_rq {
 struct mlx5e_channel {
 	/* data path */
 	struct mlx5e_rq            rq;
+	struct mlx5e_xdpsq         rq_xdpsq;
 	struct mlx5e_txqsq         sq[MLX5E_MAX_NUM_TC];
 	struct mlx5e_icosq         icosq;   /* internal control operations */
 	bool                       xdp;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 89f6eb1109cf..b3e118fc4521 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -54,8 +54,8 @@ int mlx5e_xdp_max_mtu(struct mlx5e_params *params)
 }
 
 static inline bool
-mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_dma_info *di,
-		    struct xdp_buff *xdp)
+mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
+		    struct mlx5e_dma_info *di, struct xdp_buff *xdp)
 {
 	struct mlx5e_xdp_xmit_data xdptxd;
 	struct mlx5e_xdp_info xdpi;
@@ -75,6 +75,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_dma_info *di,
 	dma_sync_single_for_device(sq->pdev, dma_addr, xdptxd.len, DMA_TO_DEVICE);
 
 	xdptxd.dma_addr = dma_addr;
+	xdpi.page.rq = rq;
 	xdpi.page.di = *di;
 
 	return sq->xmit_xdp_frame(sq, &xdptxd, &xdpi);
@@ -105,7 +106,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
 		*len = xdp.data_end - xdp.data;
 		return false;
 	case XDP_TX:
-		if (unlikely(!mlx5e_xmit_xdp_buff(&rq->xdpsq, di, &xdp)))
+		if (unlikely(!mlx5e_xmit_xdp_buff(rq->xdpsq, rq, di, &xdp)))
 			goto xdp_abort;
 		__set_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags); /* non-atomic */
 		return true;
@@ -287,7 +288,6 @@ static bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq,
 
 static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 				  struct mlx5e_xdp_wqe_info *wi,
-				  struct mlx5e_rq *rq,
 				  bool recycle)
 {
 	struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo;
@@ -305,7 +305,7 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 			break;
 		case MLX5E_XDP_XMIT_MODE_PAGE:
 			/* XDP_TX */
-			mlx5e_page_release(rq, &xdpi.page.di, recycle);
+			mlx5e_page_release(xdpi.page.rq, &xdpi.page.di, recycle);
 			break;
 		default:
 			WARN_ON_ONCE(true);
@@ -313,7 +313,7 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 	}
 }
 
-bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq)
+bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq)
 {
 	struct mlx5e_xdpsq *sq;
 	struct mlx5_cqe64 *cqe;
@@ -358,7 +358,7 @@ bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq)
 
 			sqcc += wi->num_wqebbs;
 
-			mlx5e_free_xdpsq_desc(sq, wi, rq, true);
+			mlx5e_free_xdpsq_desc(sq, wi, true);
 		} while (!last_wqe);
 	} while ((++i < MLX5E_TX_CQ_POLL_BUDGET) && (cqe = mlx5_cqwq_get_cqe(&cq->wq)));
 
@@ -373,7 +373,7 @@ bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq)
 	return (i == MLX5E_TX_CQ_POLL_BUDGET);
 }
 
-void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq)
+void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq)
 {
 	while (sq->cc != sq->pc) {
 		struct mlx5e_xdp_wqe_info *wi;
@@ -384,7 +384,7 @@ void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq)
 
 		sq->cc += wi->num_wqebbs;
 
-		mlx5e_free_xdpsq_desc(sq, wi, rq, false);
+		mlx5e_free_xdpsq_desc(sq, wi, false);
 	}
 }
 
@@ -450,7 +450,7 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 
 void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq)
 {
-	struct mlx5e_xdpsq *xdpsq = &rq->xdpsq;
+	struct mlx5e_xdpsq *xdpsq = rq->xdpsq;
 
 	if (xdpsq->mpwqe.wqe)
 		mlx5e_xdp_mpwqe_complete(xdpsq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 2a5158993349..86db5ad49a42 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -42,8 +42,8 @@
 int mlx5e_xdp_max_mtu(struct mlx5e_params *params);
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
 		      void *va, u16 *rx_headroom, u32 *len);
-bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq, struct mlx5e_rq *rq);
-void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq);
+bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq);
+void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq);
 void mlx5e_set_xmit_fp(struct mlx5e_xdpsq *sq, bool is_mpw);
 void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq);
 int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 204e199b141e..35b42c875cf3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -418,6 +418,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	rq->mdev    = mdev;
 	rq->hw_mtu  = MLX5E_SW2HW_MTU(params, params->sw_mtu);
 	rq->stats   = &c->priv->channel_stats[c->ix].rq;
+	rq->xdpsq   = &c->rq_xdpsq;
 
 	rq->xdp_prog = params->xdp_prog ? bpf_prog_inc(params->xdp_prog) : NULL;
 	if (IS_ERR(rq->xdp_prog)) {
@@ -1439,7 +1440,7 @@ static int mlx5e_open_xdpsq(struct mlx5e_channel *c,
 	return err;
 }
 
-static void mlx5e_close_xdpsq(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq)
+static void mlx5e_close_xdpsq(struct mlx5e_xdpsq *sq)
 {
 	struct mlx5e_channel *c = sq->channel;
 
@@ -1447,7 +1448,7 @@ static void mlx5e_close_xdpsq(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq)
 	napi_synchronize(&c->napi);
 
 	mlx5e_destroy_sq(c->mdev, sq->sqn);
-	mlx5e_free_xdpsq_descs(sq, rq);
+	mlx5e_free_xdpsq_descs(sq);
 	mlx5e_free_xdpsq(sq);
 }
 
@@ -1826,7 +1827,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 
 	/* XDP SQ CQ params are same as normal TXQ sq CQ params */
 	err = c->xdp ? mlx5e_open_cq(c, params->tx_cq_moderation,
-				     &cparam->tx_cq, &c->rq.xdpsq.cq) : 0;
+				     &cparam->tx_cq, &c->rq_xdpsq.cq) : 0;
 	if (err)
 		goto err_close_rx_cq;
 
@@ -1840,9 +1841,12 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 	if (err)
 		goto err_close_icosq;
 
-	err = c->xdp ? mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, &c->rq.xdpsq, false) : 0;
-	if (err)
-		goto err_close_sqs;
+	if (c->xdp) {
+		err = mlx5e_open_xdpsq(c, params, &cparam->xdp_sq,
+				       &c->rq_xdpsq, false);
+		if (err)
+			goto err_close_sqs;
+	}
 
 	err = mlx5e_open_rq(c, params, &cparam->rq, &c->rq);
 	if (err)
@@ -1861,7 +1865,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 
 err_close_xdp_sq:
 	if (c->xdp)
-		mlx5e_close_xdpsq(&c->rq.xdpsq, &c->rq);
+		mlx5e_close_xdpsq(&c->rq_xdpsq);
 
 err_close_sqs:
 	mlx5e_close_sqs(c);
@@ -1872,7 +1876,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 err_disable_napi:
 	napi_disable(&c->napi);
 	if (c->xdp)
-		mlx5e_close_cq(&c->rq.xdpsq.cq);
+		mlx5e_close_cq(&c->rq_xdpsq.cq);
 
 err_close_rx_cq:
 	mlx5e_close_cq(&c->rq.cq);
@@ -1917,15 +1921,15 @@ static void mlx5e_deactivate_channel(struct mlx5e_channel *c)
 
 static void mlx5e_close_channel(struct mlx5e_channel *c)
 {
-	mlx5e_close_xdpsq(&c->xdpsq, NULL);
+	mlx5e_close_xdpsq(&c->xdpsq);
 	mlx5e_close_rq(&c->rq);
 	if (c->xdp)
-		mlx5e_close_xdpsq(&c->rq.xdpsq, &c->rq);
+		mlx5e_close_xdpsq(&c->rq_xdpsq);
 	mlx5e_close_sqs(c);
 	mlx5e_close_icosq(&c->icosq);
 	napi_disable(&c->napi);
 	if (c->xdp)
-		mlx5e_close_cq(&c->rq.xdpsq.cq);
+		mlx5e_close_cq(&c->rq_xdpsq.cq);
 	mlx5e_close_cq(&c->rq.cq);
 	mlx5e_close_cq(&c->xdpsq.cq);
 	mlx5e_close_tx_cqs(c);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index de4d5ae431af..d2b8ce5df59c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -97,10 +97,10 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
 	for (i = 0; i < c->num_tc; i++)
 		busy |= mlx5e_poll_tx_cq(&c->sq[i].cq, budget);
 
-	busy |= mlx5e_poll_xdpsq_cq(&c->xdpsq.cq, NULL);
+	busy |= mlx5e_poll_xdpsq_cq(&c->xdpsq.cq);
 
 	if (c->xdp)
-		busy |= mlx5e_poll_xdpsq_cq(&rq->xdpsq.cq, rq);
+		busy |= mlx5e_poll_xdpsq_cq(&c->rq_xdpsq.cq);
 
 	if (likely(budget)) { /* budget=0 means: don't poll rx rings */
 		work_done = mlx5e_poll_rx_cq(&rq->cq, budget);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 12/16] net/mlx5e: XDP_TX from UMEM support
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (10 preceding siblings ...)
  2019-04-30 18:12 ` [PATCH bpf-next v2 11/16] net/mlx5e: Share the XDP SQ for XDP_TX between RQs Maxim Mikityanskiy
@ 2019-04-30 18:13 ` Maxim Mikityanskiy
  2019-04-30 18:13 ` [PATCH bpf-next v2 13/16] net/mlx5e: Consider XSK in XDP MTU limit calculation Maxim Mikityanskiy
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

When an XDP program returns XDP_TX, and the RQ is XSK-enabled, it
requires careful handling, because convert_to_xdp_frame creates a new
page and copies the data there, while our driver expects the xdp_frame
to point to the same memory as the xdp_buff. Handle this case
separately: map the page, and in the end unmap it and call
xdp_return_frame.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 50 ++++++++++++++++---
 1 file changed, 42 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index b3e118fc4521..1364bdff702c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -69,14 +69,48 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
 	xdptxd.data = xdpf->data;
 	xdptxd.len  = xdpf->len;
 
-	xdpi.mode = MLX5E_XDP_XMIT_MODE_PAGE;
+	if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) {
+		/* The xdp_buff was in the UMEM and was copied into a newly
+		 * allocated page. The UMEM page was returned via the ZCA, and
+		 * this new page has to be mapped at this point and has to be
+		 * unmapped and returned via xdp_return_frame on completion.
+		 */
+
+		/* Prevent double recycling of the UMEM page. Even in case this
+		 * function returns false, the xdp_buff shouldn't be recycled,
+		 * as it was already done in xdp_convert_zc_to_xdp_frame.
+		 */
+		__set_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags); /* non-atomic */
+
+		xdpi.mode = MLX5E_XDP_XMIT_MODE_FRAME;
 
-	dma_addr = di->addr + (xdpf->data - (void *)xdpf);
-	dma_sync_single_for_device(sq->pdev, dma_addr, xdptxd.len, DMA_TO_DEVICE);
+		dma_addr = dma_map_single(sq->pdev, xdptxd.data, xdptxd.len,
+					  DMA_TO_DEVICE);
+		if (dma_mapping_error(sq->pdev, dma_addr)) {
+			xdp_return_frame(xdpf);
+			return false;
+		}
 
-	xdptxd.dma_addr = dma_addr;
-	xdpi.page.rq = rq;
-	xdpi.page.di = *di;
+		xdptxd.dma_addr     = dma_addr;
+		xdpi.frame.xdpf     = xdpf;
+		xdpi.frame.dma_addr = dma_addr;
+	} else {
+		/* Driver assumes that convert_to_xdp_frame returns an xdp_frame
+		 * that points to the same memory region as the original
+		 * xdp_buff. It allows to map the memory only once and to use
+		 * the DMA_BIDIRECTIONAL mode.
+		 */
+
+		xdpi.mode = MLX5E_XDP_XMIT_MODE_PAGE;
+
+		dma_addr = di->addr + (xdpf->data - (void *)xdpf);
+		dma_sync_single_for_device(sq->pdev, dma_addr, xdptxd.len,
+					   DMA_TO_DEVICE);
+
+		xdptxd.dma_addr = dma_addr;
+		xdpi.page.rq    = rq;
+		xdpi.page.di    = *di;
+	}
 
 	return sq->xmit_xdp_frame(sq, &xdptxd, &xdpi);
 }
@@ -298,13 +332,13 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 
 		switch (xdpi.mode) {
 		case MLX5E_XDP_XMIT_MODE_FRAME:
-			/* XDP_REDIRECT */
+			/* XDP_TX from the XSK RQ and XDP_REDIRECT */
 			dma_unmap_single(sq->pdev, xdpi.frame.dma_addr,
 					 xdpi.frame.xdpf->len, DMA_TO_DEVICE);
 			xdp_return_frame(xdpi.frame.xdpf);
 			break;
 		case MLX5E_XDP_XMIT_MODE_PAGE:
-			/* XDP_TX */
+			/* XDP_TX from the regular RQ */
 			mlx5e_page_release(xdpi.page.rq, &xdpi.page.di, recycle);
 			break;
 		default:
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 13/16] net/mlx5e: Consider XSK in XDP MTU limit calculation
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (11 preceding siblings ...)
  2019-04-30 18:13 ` [PATCH bpf-next v2 12/16] net/mlx5e: XDP_TX from UMEM support Maxim Mikityanskiy
@ 2019-04-30 18:13 ` Maxim Mikityanskiy
  2019-04-30 18:13 ` [PATCH bpf-next v2 14/16] net/mlx5e: Encapsulate open/close queues into a function Maxim Mikityanskiy
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

Use the existing mlx5e_get_linear_rq_headroom function to calculate the
headroom for mlx5e_xdp_max_mtu. This function takes the XSK headroom
into consideration, which will be used in the following patches.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en/params.c | 4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en/params.h | 2 ++
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c    | 5 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h    | 3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c   | 4 ++--
 5 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 50a458dc3836..0de908b12fcc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -9,8 +9,8 @@ static inline bool mlx5e_rx_is_xdp(struct mlx5e_params *params,
 	return params->xdp_prog || xsk;
 }
 
-static inline u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params,
-					       struct mlx5e_xsk_param *xsk)
+u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params,
+				 struct mlx5e_xsk_param *xsk)
 {
 	u16 headroom = NET_IP_ALIGN;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
index ed420f3efe52..7f29b82dd8c2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
@@ -11,6 +11,8 @@ struct mlx5e_xsk_param {
 	u16 chunk_size;
 };
 
+u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params,
+				 struct mlx5e_xsk_param *xsk);
 u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params,
 				struct mlx5e_xsk_param *xsk);
 u8 mlx5e_mpwqe_log_pkts_per_wqe(struct mlx5e_params *params);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 1364bdff702c..ee99efde9143 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -32,10 +32,11 @@
 
 #include <linux/bpf_trace.h>
 #include "en/xdp.h"
+#include "en/params.h"
 
-int mlx5e_xdp_max_mtu(struct mlx5e_params *params)
+int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk)
 {
-	int hr = NET_IP_ALIGN + XDP_PACKET_HEADROOM;
+	int hr = mlx5e_get_linear_rq_headroom(params, xsk);
 
 	/* Let S := SKB_DATA_ALIGN(sizeof(struct skb_shared_info)).
 	 * The condition checked in mlx5e_rx_is_linear_skb is:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 86db5ad49a42..9200cb9f499b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -39,7 +39,8 @@
 	(sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
 #define MLX5E_XDP_TX_DS_COUNT (MLX5E_XDP_TX_EMPTY_DS_COUNT + 1 /* SG DS */)
 
-int mlx5e_xdp_max_mtu(struct mlx5e_params *params);
+struct mlx5e_xsk_param;
+int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk);
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
 		      void *va, u16 *rx_headroom, u32 *len);
 bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 35b42c875cf3..08c2fd0be7ac 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3718,7 +3718,7 @@ int mlx5e_change_mtu(struct net_device *netdev, int new_mtu,
 	if (params->xdp_prog &&
 	    !mlx5e_rx_is_linear_skb(&new_channels.params)) {
 		netdev_err(netdev, "MTU(%d) > %d is not allowed while XDP enabled\n",
-			   new_mtu, mlx5e_xdp_max_mtu(params));
+			   new_mtu, mlx5e_xdp_max_mtu(params, NULL));
 		err = -EINVAL;
 		goto out;
 	}
@@ -4160,7 +4160,7 @@ static int mlx5e_xdp_allowed(struct mlx5e_priv *priv, struct bpf_prog *prog)
 	if (!mlx5e_rx_is_linear_skb(&new_channels.params)) {
 		netdev_warn(netdev, "XDP is not allowed with MTU(%d) > %d\n",
 			    new_channels.params.sw_mtu,
-			    mlx5e_xdp_max_mtu(&new_channels.params));
+			    mlx5e_xdp_max_mtu(&new_channels.params, NULL));
 		return -EINVAL;
 	}
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 14/16] net/mlx5e: Encapsulate open/close queues into a function
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (12 preceding siblings ...)
  2019-04-30 18:13 ` [PATCH bpf-next v2 13/16] net/mlx5e: Consider XSK in XDP MTU limit calculation Maxim Mikityanskiy
@ 2019-04-30 18:13 ` Maxim Mikityanskiy
  2019-04-30 18:13 ` [PATCH bpf-next v2 15/16] net/mlx5e: Move queue param structs to en/params.h Maxim Mikityanskiy
  2019-05-04 17:25 ` [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Björn Töpel
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

Create new functions mlx5e_{open,close}_queues to encapsulate opening
and closing RQs and SQs, and call the new functions from
mlx5e_{open,close}_channel. It simplifies the existing functions a bit
and prepares them for the upcoming AF_XDP changes.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 125 ++++++++++--------
 1 file changed, 73 insertions(+), 52 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 08c2fd0be7ac..9f74b9f72135 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -1769,49 +1769,16 @@ static void mlx5e_free_xps_cpumask(struct mlx5e_channel *c)
 	free_cpumask_var(c->xps_cpumask);
 }
 
-static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
-			      struct mlx5e_params *params,
-			      struct mlx5e_channel_param *cparam,
-			      struct mlx5e_channel **cp)
+static int mlx5e_open_queues(struct mlx5e_channel *c,
+			     struct mlx5e_params *params,
+			     struct mlx5e_channel_param *cparam)
 {
-	int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
 	struct net_dim_cq_moder icocq_moder = {0, 0};
-	struct net_device *netdev = priv->netdev;
-	struct mlx5e_channel *c;
-	unsigned int irq;
 	int err;
-	int eqn;
-
-	err = mlx5_vector2eqn(priv->mdev, ix, &eqn, &irq);
-	if (err)
-		return err;
-
-	c = kvzalloc_node(sizeof(*c), GFP_KERNEL, cpu_to_node(cpu));
-	if (!c)
-		return -ENOMEM;
-
-	c->priv     = priv;
-	c->mdev     = priv->mdev;
-	c->tstamp   = &priv->tstamp;
-	c->ix       = ix;
-	c->cpu      = cpu;
-	c->pdev     = priv->mdev->device;
-	c->netdev   = priv->netdev;
-	c->mkey_be  = cpu_to_be32(priv->mdev->mlx5e_res.mkey.key);
-	c->num_tc   = params->num_tc;
-	c->xdp      = !!params->xdp_prog;
-	c->stats    = &priv->channel_stats[ix].ch;
-	c->irq_desc = irq_to_desc(irq);
-
-	err = mlx5e_alloc_xps_cpumask(c, params);
-	if (err)
-		goto err_free_channel;
-
-	netif_napi_add(netdev, &c->napi, mlx5e_napi_poll, 64);
 
 	err = mlx5e_open_cq(c, icocq_moder, &cparam->icosq_cq, &c->icosq.cq);
 	if (err)
-		goto err_napi_del;
+		return err;
 
 	err = mlx5e_open_tx_cqs(c, params, cparam);
 	if (err)
@@ -1856,8 +1823,6 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 	if (err)
 		goto err_close_rq;
 
-	*cp = c;
-
 	return 0;
 
 err_close_rq:
@@ -1875,6 +1840,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 
 err_disable_napi:
 	napi_disable(&c->napi);
+
 	if (c->xdp)
 		mlx5e_close_cq(&c->rq_xdpsq.cq);
 
@@ -1890,6 +1856,73 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
 err_close_icosq_cq:
 	mlx5e_close_cq(&c->icosq.cq);
 
+	return err;
+}
+
+static void mlx5e_close_queues(struct mlx5e_channel *c)
+{
+	mlx5e_close_xdpsq(&c->xdpsq);
+	mlx5e_close_rq(&c->rq);
+	if (c->xdp)
+		mlx5e_close_xdpsq(&c->rq_xdpsq);
+	mlx5e_close_sqs(c);
+	mlx5e_close_icosq(&c->icosq);
+	napi_disable(&c->napi);
+	if (c->xdp)
+		mlx5e_close_cq(&c->rq_xdpsq.cq);
+	mlx5e_close_cq(&c->rq.cq);
+	mlx5e_close_cq(&c->xdpsq.cq);
+	mlx5e_close_tx_cqs(c);
+	mlx5e_close_cq(&c->icosq.cq);
+}
+
+static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
+			      struct mlx5e_params *params,
+			      struct mlx5e_channel_param *cparam,
+			      struct mlx5e_channel **cp)
+{
+	int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
+	struct net_device *netdev = priv->netdev;
+	struct mlx5e_channel *c;
+	unsigned int irq;
+	int err;
+	int eqn;
+
+	err = mlx5_vector2eqn(priv->mdev, ix, &eqn, &irq);
+	if (err)
+		return err;
+
+	c = kvzalloc_node(sizeof(*c), GFP_KERNEL, cpu_to_node(cpu));
+	if (!c)
+		return -ENOMEM;
+
+	c->priv     = priv;
+	c->mdev     = priv->mdev;
+	c->tstamp   = &priv->tstamp;
+	c->ix       = ix;
+	c->cpu      = cpu;
+	c->pdev     = priv->mdev->device;
+	c->netdev   = priv->netdev;
+	c->mkey_be  = cpu_to_be32(priv->mdev->mlx5e_res.mkey.key);
+	c->num_tc   = params->num_tc;
+	c->xdp      = !!params->xdp_prog;
+	c->stats    = &priv->channel_stats[ix].ch;
+	c->irq_desc = irq_to_desc(irq);
+
+	err = mlx5e_alloc_xps_cpumask(c, params);
+	if (err)
+		goto err_free_channel;
+
+	netif_napi_add(netdev, &c->napi, mlx5e_napi_poll, 64);
+
+	err = mlx5e_open_queues(c, params, cparam);
+	if (unlikely(err))
+		goto err_napi_del;
+
+	*cp = c;
+
+	return 0;
+
 err_napi_del:
 	netif_napi_del(&c->napi);
 	mlx5e_free_xps_cpumask(c);
@@ -1921,19 +1954,7 @@ static void mlx5e_deactivate_channel(struct mlx5e_channel *c)
 
 static void mlx5e_close_channel(struct mlx5e_channel *c)
 {
-	mlx5e_close_xdpsq(&c->xdpsq);
-	mlx5e_close_rq(&c->rq);
-	if (c->xdp)
-		mlx5e_close_xdpsq(&c->rq_xdpsq);
-	mlx5e_close_sqs(c);
-	mlx5e_close_icosq(&c->icosq);
-	napi_disable(&c->napi);
-	if (c->xdp)
-		mlx5e_close_cq(&c->rq_xdpsq.cq);
-	mlx5e_close_cq(&c->rq.cq);
-	mlx5e_close_cq(&c->xdpsq.cq);
-	mlx5e_close_tx_cqs(c);
-	mlx5e_close_cq(&c->icosq.cq);
+	mlx5e_close_queues(c);
 	netif_napi_del(&c->napi);
 	mlx5e_free_xps_cpumask(c);
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH bpf-next v2 15/16] net/mlx5e: Move queue param structs to en/params.h
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (13 preceding siblings ...)
  2019-04-30 18:13 ` [PATCH bpf-next v2 14/16] net/mlx5e: Encapsulate open/close queues into a function Maxim Mikityanskiy
@ 2019-04-30 18:13 ` Maxim Mikityanskiy
  2019-05-04 17:25 ` [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Björn Töpel
  15 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-04-30 18:13 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson
  Cc: bpf, netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski, Maxim Mikityanskiy

structs mlx5e_{rq,sq,cq,channel}_param are going to be used in the
upcoming XSK RX and TX patches. Move them to a header file to make
them accessible from other C files.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 .../ethernet/mellanox/mlx5/core/en/params.h   | 31 +++++++++++++++++++
 .../net/ethernet/mellanox/mlx5/core/en_main.c | 29 -----------------
 2 files changed, 31 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
index 7f29b82dd8c2..f83417b822bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.h
@@ -11,6 +11,37 @@ struct mlx5e_xsk_param {
 	u16 chunk_size;
 };
 
+struct mlx5e_rq_param {
+	u32                        rqc[MLX5_ST_SZ_DW(rqc)];
+	struct mlx5_wq_param       wq;
+	struct mlx5e_rq_frags_info frags_info;
+};
+
+struct mlx5e_sq_param {
+	u32                        sqc[MLX5_ST_SZ_DW(sqc)];
+	struct mlx5_wq_param       wq;
+	bool                       is_mpw;
+};
+
+struct mlx5e_cq_param {
+	u32                        cqc[MLX5_ST_SZ_DW(cqc)];
+	struct mlx5_wq_param       wq;
+	u16                        eq_ix;
+	u8                         cq_period_mode;
+};
+
+struct mlx5e_channel_param {
+	struct mlx5e_rq_param      rq;
+	struct mlx5e_sq_param      sq;
+	struct mlx5e_sq_param      xdp_sq;
+	struct mlx5e_sq_param      icosq;
+	struct mlx5e_cq_param      rx_cq;
+	struct mlx5e_cq_param      tx_cq;
+	struct mlx5e_cq_param      icosq_cq;
+};
+
+/* Parameter calculations */
+
 u16 mlx5e_get_linear_rq_headroom(struct mlx5e_params *params,
 				 struct mlx5e_xsk_param *xsk);
 u32 mlx5e_rx_get_linear_frag_sz(struct mlx5e_params *params,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 9f74b9f72135..3b2d7a1fa7df 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -57,35 +57,6 @@
 #include "en/reporter.h"
 #include "en/params.h"
 
-struct mlx5e_rq_param {
-	u32			rqc[MLX5_ST_SZ_DW(rqc)];
-	struct mlx5_wq_param	wq;
-	struct mlx5e_rq_frags_info frags_info;
-};
-
-struct mlx5e_sq_param {
-	u32                        sqc[MLX5_ST_SZ_DW(sqc)];
-	struct mlx5_wq_param       wq;
-	bool                       is_mpw;
-};
-
-struct mlx5e_cq_param {
-	u32                        cqc[MLX5_ST_SZ_DW(cqc)];
-	struct mlx5_wq_param       wq;
-	u16                        eq_ix;
-	u8                         cq_period_mode;
-};
-
-struct mlx5e_channel_param {
-	struct mlx5e_rq_param      rq;
-	struct mlx5e_sq_param      sq;
-	struct mlx5e_sq_param      xdp_sq;
-	struct mlx5e_sq_param      icosq;
-	struct mlx5e_cq_param      rx_cq;
-	struct mlx5e_cq_param      tx_cq;
-	struct mlx5e_cq_param      icosq_cq;
-};
-
 bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
 {
 	bool striding_rq_umr = MLX5_CAP_GEN(mdev, striding_rq) &&
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support
  2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
                   ` (14 preceding siblings ...)
  2019-04-30 18:13 ` [PATCH bpf-next v2 15/16] net/mlx5e: Move queue param structs to en/params.h Maxim Mikityanskiy
@ 2019-05-04 17:25 ` Björn Töpel
  15 siblings, 0 replies; 26+ messages in thread
From: Björn Töpel @ 2019-05-04 17:25 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson, bpf, netdev, David S. Miller, Saeed Mahameed,
	Jonathan Lemon, Tariq Toukan, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jakub Kicinski, Maciej Fijalkowski

On Tue, 30 Apr 2019 at 20:12, Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>
> This series contains improvements to the AF_XDP kernel infrastructure
> and AF_XDP support in mlx5e. The infrastructure improvements are
> required for mlx5e, but also some of them benefit to all drivers, and
> some can be useful for other drivers that want to implement AF_XDP.
>
> The performance testing was performed on a machine with the following
> configuration:
>
> - 24 cores of Intel Xeon E5-2620 v3 @ 2.40 GHz
> - Mellanox ConnectX-5 Ex with 100 Gbit/s link
>
> The results with retpoline disabled, single stream:
>
> txonly: 33.3 Mpps (21.5 Mpps with queue and app pinned to the same CPU)
> rxdrop: 12.2 Mpps
> l2fwd: 9.4 Mpps
>
> The results with retpoline enabled, single stream:
>
> txonly: 21.3 Mpps (14.1 Mpps with queue and app pinned to the same CPU)
> rxdrop: 9.9 Mpps
> l2fwd: 6.8 Mpps
>
> v2 changes:
>
> Added patches for mlx5e and addressed the comments for v1. Rebased for
> bpf-next (net-next has to be merged first, because this series depends
> on some patches from there).
>

Nit: There're some checkpatch warnings (>80 char lines) for the driver parts.


> Maxim Mikityanskiy (16):
>   xsk: Add API to check for available entries in FQ
>   xsk: Add getsockopt XDP_OPTIONS
>   libbpf: Support getsockopt XDP_OPTIONS
>   xsk: Extend channels to support combined XSK/non-XSK traffic
>   xsk: Change the default frame size to 4096 and allow controlling it
>   xsk: Return the whole xdp_desc from xsk_umem_consume_tx
>   net/mlx5e: Replace deprecated PCI_DMA_TODEVICE
>   net/mlx5e: Calculate linear RX frag size considering XSK
>   net/mlx5e: Allow ICO SQ to be used by multiple RQs
>   net/mlx5e: Refactor struct mlx5e_xdp_info
>   net/mlx5e: Share the XDP SQ for XDP_TX between RQs
>   net/mlx5e: XDP_TX from UMEM support
>   net/mlx5e: Consider XSK in XDP MTU limit calculation
>   net/mlx5e: Encapsulate open/close queues into a function
>   net/mlx5e: Move queue param structs to en/params.h
>   net/mlx5e: Add XSK support
>
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c    |  12 +-
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  |  15 +-
>  .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/en.h  | 147 +++-
>  .../ethernet/mellanox/mlx5/core/en/params.c   | 108 ++-
>  .../ethernet/mellanox/mlx5/core/en/params.h   |  87 ++-
>  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 231 ++++--
>  .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  36 +-
>  .../mellanox/mlx5/core/en/xsk/Makefile        |   1 +
>  .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 192 +++++
>  .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |  27 +
>  .../mellanox/mlx5/core/en/xsk/setup.c         | 220 ++++++
>  .../mellanox/mlx5/core/en/xsk/setup.h         |  25 +
>  .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   | 108 +++
>  .../ethernet/mellanox/mlx5/core/en/xsk/tx.h   |  15 +
>  .../ethernet/mellanox/mlx5/core/en/xsk/umem.c | 252 +++++++
>  .../ethernet/mellanox/mlx5/core/en/xsk/umem.h |  34 +
>  .../ethernet/mellanox/mlx5/core/en_ethtool.c  |  21 +-
>  .../mellanox/mlx5/core/en_fs_ethtool.c        |  44 +-
>  .../net/ethernet/mellanox/mlx5/core/en_main.c | 680 +++++++++++-------
>  .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  12 +-
>  .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 104 ++-
>  .../ethernet/mellanox/mlx5/core/en_stats.c    | 115 ++-
>  .../ethernet/mellanox/mlx5/core/en_stats.h    |  30 +
>  .../net/ethernet/mellanox/mlx5/core/en_txrx.c |  42 +-
>  .../ethernet/mellanox/mlx5/core/ipoib/ipoib.c |  14 +-
>  drivers/net/ethernet/mellanox/mlx5/core/wq.h  |   5 -
>  include/net/xdp_sock.h                        |  27 +-
>  include/uapi/linux/if_xdp.h                   |  18 +
>  net/xdp/xsk.c                                 |  43 +-
>  net/xdp/xsk_queue.h                           |  14 +
>  samples/bpf/xdpsock_user.c                    |  52 +-
>  tools/include/uapi/linux/if_xdp.h             |  18 +
>  tools/lib/bpf/xsk.c                           | 127 +++-
>  tools/lib/bpf/xsk.h                           |   6 +-
>  35 files changed, 2384 insertions(+), 500 deletions(-)
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/Makefile
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.h
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c
>  create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h
>
> --
> 2.19.1
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next v2 02/16] xsk: Add getsockopt XDP_OPTIONS
  2019-04-30 18:12 ` [PATCH bpf-next v2 02/16] xsk: Add getsockopt XDP_OPTIONS Maxim Mikityanskiy
@ 2019-05-04 17:25   ` Björn Töpel
  2019-05-06 13:45     ` Maxim Mikityanskiy
  0 siblings, 1 reply; 26+ messages in thread
From: Björn Töpel @ 2019-05-04 17:25 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson, bpf, netdev, David S. Miller, Saeed Mahameed,
	Jonathan Lemon, Tariq Toukan, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jakub Kicinski, Maciej Fijalkowski

On Tue, 30 Apr 2019 at 20:12, Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>
> Make it possible for the application to determine whether the AF_XDP
> socket is running in zero-copy mode. To achieve this, add a new
> getsockopt option XDP_OPTIONS that returns flags. The only flag
> supported for now is the zero-copy mode indicator.
>
> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  include/uapi/linux/if_xdp.h       |  7 +++++++
>  net/xdp/xsk.c                     | 22 ++++++++++++++++++++++
>  tools/include/uapi/linux/if_xdp.h |  7 +++++++
>  3 files changed, 36 insertions(+)
>
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index caed8b1614ff..9ae4b4e08b68 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
>  #define XDP_UMEM_FILL_RING             5
>  #define XDP_UMEM_COMPLETION_RING       6
>  #define XDP_STATISTICS                 7
> +#define XDP_OPTIONS                    8
>
>  struct xdp_umem_reg {
>         __u64 addr; /* Start of packet data area */
> @@ -60,6 +61,12 @@ struct xdp_statistics {
>         __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
>  };
>
> +struct xdp_options {
> +       __u32 flags;
> +};
> +
> +#define XDP_OPTIONS_FLAG_ZEROCOPY (1 << 0)

Nit: The other flags doesn't use "FLAG" in its name, but that doesn't
really matter.

> +
>  /* Pgoff for mmaping the rings */
>  #define XDP_PGOFF_RX_RING                        0
>  #define XDP_PGOFF_TX_RING               0x80000000
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index b68a380f50b3..998199109d5c 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -650,6 +650,28 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
>
>                 return 0;
>         }
> +       case XDP_OPTIONS:
> +       {
> +               struct xdp_options opts;
> +
> +               if (len < sizeof(opts))
> +                       return -EINVAL;
> +
> +               opts.flags = 0;

Maybe get rid of this, in favor of "opts = {}" if the structure grows?


> +
> +               mutex_lock(&xs->mutex);
> +               if (xs->zc)
> +                       opts.flags |= XDP_OPTIONS_FLAG_ZEROCOPY;
> +               mutex_unlock(&xs->mutex);
> +
> +               len = sizeof(opts);
> +               if (copy_to_user(optval, &opts, len))
> +                       return -EFAULT;
> +               if (put_user(len, optlen))
> +                       return -EFAULT;
> +
> +               return 0;
> +       }
>         default:
>                 break;
>         }
> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> index caed8b1614ff..9ae4b4e08b68 100644
> --- a/tools/include/uapi/linux/if_xdp.h
> +++ b/tools/include/uapi/linux/if_xdp.h
> @@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
>  #define XDP_UMEM_FILL_RING             5
>  #define XDP_UMEM_COMPLETION_RING       6
>  #define XDP_STATISTICS                 7
> +#define XDP_OPTIONS                    8
>
>  struct xdp_umem_reg {
>         __u64 addr; /* Start of packet data area */
> @@ -60,6 +61,12 @@ struct xdp_statistics {
>         __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
>  };
>
> +struct xdp_options {
> +       __u32 flags;
> +};
> +
> +#define XDP_OPTIONS_FLAG_ZEROCOPY (1 << 0)
> +
>  /* Pgoff for mmaping the rings */
>  #define XDP_PGOFF_RX_RING                        0
>  #define XDP_PGOFF_TX_RING               0x80000000
> --
> 2.19.1
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic
  2019-04-30 18:12 ` [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic Maxim Mikityanskiy
@ 2019-05-04 17:26   ` Björn Töpel
  2019-05-06 13:45     ` Maxim Mikityanskiy
  0 siblings, 1 reply; 26+ messages in thread
From: Björn Töpel @ 2019-05-04 17:26 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson, bpf, netdev, David S. Miller, Saeed Mahameed,
	Jonathan Lemon, Tariq Toukan, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jakub Kicinski, Maciej Fijalkowski

On Tue, 30 Apr 2019 at 20:12, Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>
> Currently, the drivers that implement AF_XDP zero-copy support (e.g.,
> i40e) switch the channel into a different mode when an XSK is opened. It
> causes some issues that have to be taken into account. For example, RSS
> needs to be reconfigured to skip the XSK-enabled channels, or the XDP
> program should filter out traffic not intended for that socket and
> XDP_PASS it with an additional copy. As nothing validates or forces the
> proper configuration, it's easy to have packets drops, when they get
> into an XSK by mistake, and, in fact, it's the default configuration.
> There has to be some tool to have RSS reconfigured on each socket open
> and close event, but such a tool is problematic to implement, because no
> one reports these events, and it's race-prone.
>
> This commit extends XSK to support both kinds of traffic (XSK and
> non-XSK) in the same channel. It implies having two RX queues in
> XSK-enabled channels: one for the regular traffic, and the other for
> XSK. It solves the problem with RSS: the default configuration just
> works without the need to manually reconfigure RSS or to perform some
> possibly complicated filtering in the XDP layer. It makes it easy to run
> both AF_XDP and regular sockets on the same machine. In the XDP program,
> the QID's most significant bit will serve as a flag to indicate whether
> it's the XSK queue or not. The extension is compatible with the legacy
> configuration, so if one wants to run the legacy mode, they can
> reconfigure RSS and ignore the flag in the XDP program (implemented in
> the reference XDP program in libbpf). mlx5e will support this extension.
>
> A single XDP program can run both with drivers supporting or not
> supporting this extension. The xdpsock sample and libbpf are updated
> accordingly.
>

I'm still not a fan of this, or maybe I'm not following you. It makes
it more complex and even harder to use. Let's take a look at the
kernel nomenclature. "ethtool" uses netdevs and channels. A channel is
a Rx queue or a Tx queue. In AF_XDP we call the channel a queue, which
is what kernel uses internally (netdev_rx_queue, netdev_queue).

Today, AF_XDP can attach to an existing queue for ingress. (On the
egress side, we're using "a queue", but the "XDP queue". XDP has these
"shadow queues" which are separated from the netdev. This is a bit
messy, and we can't really configure them. I believe Jakub has some
ideas here. :-) For now, let's leave egress aside.)

If an application would like to get all the traffic from a netdev,
it'll create an equal amout of sockets as the queues and bind to the
queues. Yes, even the queues in the RSS  set.

What you would like (I think):
a) is a way of spawning a new queue for a netdev, that is not part of
the stack and/or RSS set
b) steering traffic to that queue using a configuration mechanism (tc?
some yet to be hacked BPF configuration hook?)

With your mechanism you're doing this in contrived way. This makes the
existing AF_XDP model *more* complex/hard(er) to use.

How do you steer traffic to this dual-channel RQ? So you have a netdev
receiving on all queues. Then, e.g., the last queue is a "dual
channel" queue that can receive traffic from some other filter. How do
you use it?



Björn

> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  include/uapi/linux/if_xdp.h       |  11 +++
>  net/xdp/xsk.c                     |   5 +-
>  samples/bpf/xdpsock_user.c        |  10 ++-
>  tools/include/uapi/linux/if_xdp.h |  11 +++
>  tools/lib/bpf/xsk.c               | 116 ++++++++++++++++++++++--------
>  tools/lib/bpf/xsk.h               |   4 ++
>  6 files changed, 126 insertions(+), 31 deletions(-)
>
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 9ae4b4e08b68..cf6ff1ecc6bd 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -82,4 +82,15 @@ struct xdp_desc {
>
>  /* UMEM descriptor is __u64 */
>
> +/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
> + * this flag bit in the queue index to distinguish between two RQs of the same
> + * channel.
> + */
> +#define XDP_QID_FLAG_XSKRQ (1 << 31)
> +
> +static inline __u32 xdp_qid_get_channel(__u32 qid)
> +{
> +       return qid & ~XDP_QID_FLAG_XSKRQ;
> +}
> +
>  #endif /* _LINUX_IF_XDP_H */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 998199109d5c..114ba17acb09 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -104,9 +104,12 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>
>  int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>  {
> +       struct xdp_rxq_info *rxq = xdp->rxq;
> +       u32 channel = xdp_qid_get_channel(rxq->queue_index);
>         u32 len;
>
> -       if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
> +       if (xs->dev != rxq->dev || xs->queue_id != channel ||
> +           xs->zc != (rxq->mem.type == MEM_TYPE_ZERO_COPY))
>                 return -EINVAL;
>
>         len = xdp->data_end - xdp->data;
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> index d08ee1ab7bb4..a6b13025ee79 100644
> --- a/samples/bpf/xdpsock_user.c
> +++ b/samples/bpf/xdpsock_user.c
> @@ -62,6 +62,7 @@ enum benchmark_type {
>
>  static enum benchmark_type opt_bench = BENCH_RXDROP;
>  static u32 opt_xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
> +static u32 opt_libbpf_flags;
>  static const char *opt_if = "";
>  static int opt_ifindex;
>  static int opt_queue;
> @@ -306,7 +307,7 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
>         xsk->umem = umem;
>         cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
>         cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
> -       cfg.libbpf_flags = 0;
> +       cfg.libbpf_flags = opt_libbpf_flags;
>         cfg.xdp_flags = opt_xdp_flags;
>         cfg.bind_flags = opt_xdp_bind_flags;
>         ret = xsk_socket__create(&xsk->xsk, opt_if, opt_queue, umem->umem,
> @@ -346,6 +347,7 @@ static struct option long_options[] = {
>         {"interval", required_argument, 0, 'n'},
>         {"zero-copy", no_argument, 0, 'z'},
>         {"copy", no_argument, 0, 'c'},
> +       {"combined", no_argument, 0, 'C'},
>         {0, 0, 0, 0}
>  };
>
> @@ -365,6 +367,7 @@ static void usage(const char *prog)
>                 "  -n, --interval=n     Specify statistics update interval (default 1 sec).\n"
>                 "  -z, --zero-copy      Force zero-copy mode.\n"
>                 "  -c, --copy           Force copy mode.\n"
> +               "  -C, --combined       Driver supports combined XSK and non-XSK traffic in a channel.\n"
>                 "\n";
>         fprintf(stderr, str, prog);
>         exit(EXIT_FAILURE);
> @@ -377,7 +380,7 @@ static void parse_command_line(int argc, char **argv)
>         opterr = 0;
>
>         for (;;) {
> -               c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options,
> +               c = getopt_long(argc, argv, "Frtli:q:psSNn:czC", long_options,
>                                 &option_index);
>                 if (c == -1)
>                         break;
> @@ -420,6 +423,9 @@ static void parse_command_line(int argc, char **argv)
>                 case 'F':
>                         opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
>                         break;
> +               case 'C':
> +                       opt_libbpf_flags |= XSK_LIBBPF_FLAGS__COMBINED_CHANNELS;
> +                       break;
>                 default:
>                         usage(basename(argv[0]));
>                 }
> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> index 9ae4b4e08b68..cf6ff1ecc6bd 100644
> --- a/tools/include/uapi/linux/if_xdp.h
> +++ b/tools/include/uapi/linux/if_xdp.h
> @@ -82,4 +82,15 @@ struct xdp_desc {
>
>  /* UMEM descriptor is __u64 */
>
> +/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
> + * this flag bit in the queue index to distinguish between two RQs of the same
> + * channel.
> + */
> +#define XDP_QID_FLAG_XSKRQ (1 << 31)
> +
> +static inline __u32 xdp_qid_get_channel(__u32 qid)
> +{
> +       return qid & ~XDP_QID_FLAG_XSKRQ;
> +}
> +
>  #endif /* _LINUX_IF_XDP_H */
> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> index a95b06d1f81d..969dfd856039 100644
> --- a/tools/lib/bpf/xsk.c
> +++ b/tools/lib/bpf/xsk.c
> @@ -76,6 +76,12 @@ struct xsk_nl_info {
>         int fd;
>  };
>
> +enum qidconf {
> +       QIDCONF_REGULAR,
> +       QIDCONF_XSK,
> +       QIDCONF_XSK_COMBINED,
> +};
> +
>  /* For 32-bit systems, we need to use mmap2 as the offsets are 64-bit.
>   * Unfortunately, it is not part of glibc.
>   */
> @@ -139,7 +145,7 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
>                 return 0;
>         }
>
> -       if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)
> +       if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS_MASK)
>                 return -EINVAL;
>
>         cfg->rx_size = usr_cfg->rx_size;
> @@ -267,44 +273,93 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk)
>         /* This is the C-program:
>          * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
>          * {
> -        *     int *qidconf, index = ctx->rx_queue_index;
> +        *     int *qidconf, qc;
> +        *     int index = ctx->rx_queue_index & ~(1 << 31);
> +        *     bool is_xskrq = ctx->rx_queue_index & (1 << 31);
>          *
> -        *     // A set entry here means that the correspnding queue_id
> -        *     // has an active AF_XDP socket bound to it.
> +        *     // A set entry here means that the corresponding queue_id
> +        *     // has an active AF_XDP socket bound to it. Value 2 means
> +        *     // it's zero-copy multi-RQ mode.
>          *     qidconf = bpf_map_lookup_elem(&qidconf_map, &index);
>          *     if (!qidconf)
>          *         return XDP_ABORTED;
>          *
> -        *     if (*qidconf)
> +        *     qc = *qidconf;
> +        *
> +        *     if (qc == 2)
> +        *         qc = is_xskrq ? 1 : 0;
> +        *
> +        *     switch (qc) {
> +        *     case 0:
> +        *         return XDP_PASS;
> +        *     case 1:
>          *         return bpf_redirect_map(&xsks_map, index, 0);
> +        *     }
>          *
> -        *     return XDP_PASS;
> +        *     return XDP_ABORTED;
>          * }
>          */
>         struct bpf_insn prog[] = {
> -               /* r1 = *(u32 *)(r1 + 16) */
> -               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 16),
> -               /* *(u32 *)(r10 - 4) = r1 */
> -               BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_1, -4),
> -               BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> -               BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
> -               BPF_LD_MAP_FD(BPF_REG_1, xsk->qidconf_map_fd),
> +               /* Load index. */
> +               /* r6 = *(u32 *)(r1 + 16) */
> +               BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_ARG1, 16),
> +               /* w7 = w6 */
> +               BPF_MOV32_REG(BPF_REG_7, BPF_REG_6),
> +               /* w7 &= 2147483647 */
> +               BPF_ALU32_IMM(BPF_AND, BPF_REG_7, ~XDP_QID_FLAG_XSKRQ),
> +               /* *(u32 *)(r10 - 4) = r7 */
> +               BPF_STX_MEM(BPF_W, BPF_REG_FP, BPF_REG_7, -4),
> +
> +               /* Call bpf_map_lookup_elem. */
> +               /* r2 = r10 */
> +               BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
> +               /* r2 += -4 */
> +               BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
> +               /* r1 = qidconf_map ll */
> +               BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->qidconf_map_fd),
> +               /* call 1 */
>                 BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
> -               BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
> -               BPF_MOV32_IMM(BPF_REG_0, 0),
> -               /* if r1 == 0 goto +8 */
> -               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 8),
> -               BPF_MOV32_IMM(BPF_REG_0, 2),
> -               /* r1 = *(u32 *)(r1 + 0) */
> -               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0),
> -               /* if r1 == 0 goto +5 */
> -               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 5),
> -               /* r2 = *(u32 *)(r10 - 4) */
> -               BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd),
> -               BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4),
> +
> +               /* Check the return value. */
> +               /* if r0 == 0 goto +14 */
> +               BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 14),
> +
> +               /* Check qc == QIDCONF_XSK_COMBINED. */
> +               /* r6 >>= 31 */
> +               BPF_ALU64_IMM(BPF_RSH, BPF_REG_6, 31),
> +               /* r1 = *(u32 *)(r0 + 0) */
> +               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
> +               /* if r1 == 2 goto +1 */
> +               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, QIDCONF_XSK_COMBINED, 1),
> +
> +               /* qc != QIDCONF_XSK_COMBINED */
> +               /* r6 = r1 */
> +               BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
> +
> +               /* switch (qc) */
> +               /* w0 = 2 */
> +               BPF_MOV32_IMM(BPF_REG_0, XDP_PASS),
> +               /* if w6 == 0 goto +8 */
> +               BPF_JMP32_IMM(BPF_JEQ, BPF_REG_6, QIDCONF_REGULAR, 8),
> +               /* if w6 != 1 goto +6 */
> +               BPF_JMP32_IMM(BPF_JNE, BPF_REG_6, QIDCONF_XSK, 6),
> +
> +               /* Call bpf_redirect_map. */
> +               /* r1 = xsks_map ll */
> +               BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->xsks_map_fd),
> +               /* w2 = w7 */
> +               BPF_MOV32_REG(BPF_REG_ARG2, BPF_REG_7),
> +               /* w3 = 0 */
>                 BPF_MOV32_IMM(BPF_REG_3, 0),
> +               /* call 51 */
>                 BPF_EMIT_CALL(BPF_FUNC_redirect_map),
> -               /* The jumps are to this instruction */
> +               /* exit */
> +               BPF_EXIT_INSN(),
> +
> +               /* XDP_ABORTED */
> +               /* w0 = 0 */
> +               BPF_MOV32_IMM(BPF_REG_0, XDP_ABORTED),
> +               /* exit */
>                 BPF_EXIT_INSN(),
>         };
>         size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
> @@ -483,6 +538,7 @@ static int xsk_update_bpf_maps(struct xsk_socket *xsk, int qidconf_value,
>
>  static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
>  {
> +       int qidconf_value = QIDCONF_XSK;
>         bool prog_attached = false;
>         __u32 prog_id = 0;
>         int err;
> @@ -505,7 +561,11 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
>                 xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id);
>         }
>
> -       err = xsk_update_bpf_maps(xsk, true, xsk->fd);
> +       if (xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
> +               if (xsk->zc)
> +                       qidconf_value = QIDCONF_XSK_COMBINED;
> +
> +       err = xsk_update_bpf_maps(xsk, qidconf_value, xsk->fd);
>         if (err)
>                 goto out_load;
>
> @@ -717,7 +777,7 @@ void xsk_socket__delete(struct xsk_socket *xsk)
>         if (!xsk)
>                 return;
>
> -       (void)xsk_update_bpf_maps(xsk, 0, 0);
> +       (void)xsk_update_bpf_maps(xsk, QIDCONF_REGULAR, 0);
>
>         optlen = sizeof(off);
>         err = getsockopt(xsk->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen);
> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> index 82ea71a0f3ec..be26a2423c04 100644
> --- a/tools/lib/bpf/xsk.h
> +++ b/tools/lib/bpf/xsk.h
> @@ -180,6 +180,10 @@ struct xsk_umem_config {
>
>  /* Flags for the libbpf_flags field. */
>  #define XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD (1 << 0)
> +#define XSK_LIBBPF_FLAGS__COMBINED_CHANNELS (1 << 1)
> +#define XSK_LIBBPF_FLAGS_MASK ( \
> +       XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD | \
> +       XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
>
>  struct xsk_socket_config {
>         __u32 rx_size;
> --
> 2.19.1
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic
  2019-05-04 17:26   ` Björn Töpel
@ 2019-05-06 13:45     ` Maxim Mikityanskiy
  2019-05-06 14:23       ` Magnus Karlsson
  0 siblings, 1 reply; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-05-06 13:45 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Alexei Starovoitov, Daniel Borkmann, Björn Töpel,
	Magnus Karlsson, bpf, netdev, David S. Miller, Saeed Mahameed,
	Jonathan Lemon, Tariq Toukan, Martin KaFai Lau, Song Liu,
	Yonghong Song, Jakub Kicinski, Maciej Fijalkowski

On 2019-05-04 20:26, Björn Töpel wrote:
> On Tue, 30 Apr 2019 at 20:12, Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>>
>> Currently, the drivers that implement AF_XDP zero-copy support (e.g.,
>> i40e) switch the channel into a different mode when an XSK is opened. It
>> causes some issues that have to be taken into account. For example, RSS
>> needs to be reconfigured to skip the XSK-enabled channels, or the XDP
>> program should filter out traffic not intended for that socket and
>> XDP_PASS it with an additional copy. As nothing validates or forces the
>> proper configuration, it's easy to have packets drops, when they get
>> into an XSK by mistake, and, in fact, it's the default configuration.
>> There has to be some tool to have RSS reconfigured on each socket open
>> and close event, but such a tool is problematic to implement, because no
>> one reports these events, and it's race-prone.
>>
>> This commit extends XSK to support both kinds of traffic (XSK and
>> non-XSK) in the same channel. It implies having two RX queues in
>> XSK-enabled channels: one for the regular traffic, and the other for
>> XSK. It solves the problem with RSS: the default configuration just
>> works without the need to manually reconfigure RSS or to perform some
>> possibly complicated filtering in the XDP layer. It makes it easy to run
>> both AF_XDP and regular sockets on the same machine. In the XDP program,
>> the QID's most significant bit will serve as a flag to indicate whether
>> it's the XSK queue or not. The extension is compatible with the legacy
>> configuration, so if one wants to run the legacy mode, they can
>> reconfigure RSS and ignore the flag in the XDP program (implemented in
>> the reference XDP program in libbpf). mlx5e will support this extension.
>>
>> A single XDP program can run both with drivers supporting or not
>> supporting this extension. The xdpsock sample and libbpf are updated
>> accordingly.
>>
> 
> I'm still not a fan of this, or maybe I'm not following you. It makes
> it more complex and even harder to use. Let's take a look at the
> kernel nomenclature. "ethtool" uses netdevs and channels. A channel is
> a Rx queue or a Tx queue.

There are also combined channels that consist of an RX and a TX queue. 
mlx5e has only this kind of channels. For us, a channel is a set of 
queues "pinned to a CPU core" (they use the same NAPI).

> In AF_XDP we call the channel a queue, which
> is what kernel uses internally (netdev_rx_queue, netdev_queue).

You seem to agree it's a channel, right?

AF_XDP doesn't allow to configure RX queue number and TX queue number 
separately. Basically you choose a channel in AF_XDP. For some reason, 
it's referred as a queue in some places, but logically it means "channel".

> Today, AF_XDP can attach to an existing queue for ingress. (On the
> egress side, we're using "a queue", but the "XDP queue". XDP has these
> "shadow queues" which are separated from the netdev. This is a bit
> messy, and we can't really configure them. I believe Jakub has some
> ideas here. :-) For now, let's leave egress aside.)

So, XDP already has "shadow queues" for TX, so I see no problem in 
having the similar concept for AF_XDP RX.

> If an application would like to get all the traffic from a netdev,
> it'll create an equal amout of sockets as the queues and bind to the
> queues. Yes, even the queues in the RSS  set.
> 
> What you would like (I think):
> a) is a way of spawning a new queue for a netdev, that is not part of
> the stack and/or RSS set

Yes - for the simplicity sake and to make configuration easier. The only 
thing needed is to steer the traffic and to open an AF_XDP socket on 
channel X. We don't need to care about removing the queue out of RSS, 
about finding a way to administer this (which is hard because it's racy 
if the configuration in not known in advance). So I don't agree I'm 
complicating things, my goal is to make them easier.

> b) steering traffic to that queue using a configuration mechanism (tc?
> some yet to be hacked BPF configuration hook?)

Currently, ethtool --config-ntuple is used to steer the traffic. The 
user-def argument has a bit that selects XSK RQ/regular RQ, and action 
selects a channel:

ethtool -N eth0 flow-type udp4 dst-port 4242 action 3 user-def 1

> With your mechanism you're doing this in contrived way. This makes the
> existing AF_XDP model *more* complex/hard(er) to use.

No, as I said above, some issues are eliminated with my approach, and no 
new limitations are introduced, so it makes things more universal and 
simpler to configure.

> How do you steer traffic to this dual-channel RQ?

First, there is no dual-channel RQ, a more accurate term is dual-RQ 
channel, cause now the channel contains a regular RQ and can contain an 
XSK RQ.

For the steering itself, see the ethtool command above - the user-def 
argument has a bit that selects one of two RQs.

> So you have a netdev
> receiving on all queues. Then, e.g., the last queue is a "dual
> channel" queue that can receive traffic from some other filter. How do
> you use it?

If I want to take the last (or some) channel and start using AF_XDP with 
it, I simply configure steering to the XSK RQ of that channel and open a 
socket specifying the channel number. I don't need to reconfigure RSS, 
because RSS packets go to the regular RQ of that channel and don't 
interfere with XSK.

No functionality is lost - if you don't distinguish the regular and XSK 
RQs on the XDP level, you'll get the same effect as with i40e's 
implementation. If you want to dedicate the CPU core and channel solely 
for AF_XDP, in i40e you exclude the channel from RSS, and here you can 
do exactly the same thing. So, no use case is complicated comparing to 
i40e, but there are use cases where this feature is to advantage.

I hope I explained the points you were interested in. Please ask more 
questions if there is still something that I should clarify regarding 
this topic.

Thanks,
Max

> 
> 
> Björn
> 
>> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
>> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
>> ---
>>   include/uapi/linux/if_xdp.h       |  11 +++
>>   net/xdp/xsk.c                     |   5 +-
>>   samples/bpf/xdpsock_user.c        |  10 ++-
>>   tools/include/uapi/linux/if_xdp.h |  11 +++
>>   tools/lib/bpf/xsk.c               | 116 ++++++++++++++++++++++--------
>>   tools/lib/bpf/xsk.h               |   4 ++
>>   6 files changed, 126 insertions(+), 31 deletions(-)
>>
>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>> index 9ae4b4e08b68..cf6ff1ecc6bd 100644
>> --- a/include/uapi/linux/if_xdp.h
>> +++ b/include/uapi/linux/if_xdp.h
>> @@ -82,4 +82,15 @@ struct xdp_desc {
>>
>>   /* UMEM descriptor is __u64 */
>>
>> +/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
>> + * this flag bit in the queue index to distinguish between two RQs of the same
>> + * channel.
>> + */
>> +#define XDP_QID_FLAG_XSKRQ (1 << 31)
>> +
>> +static inline __u32 xdp_qid_get_channel(__u32 qid)
>> +{
>> +       return qid & ~XDP_QID_FLAG_XSKRQ;
>> +}
>> +
>>   #endif /* _LINUX_IF_XDP_H */
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index 998199109d5c..114ba17acb09 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -104,9 +104,12 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>>
>>   int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>>   {
>> +       struct xdp_rxq_info *rxq = xdp->rxq;
>> +       u32 channel = xdp_qid_get_channel(rxq->queue_index);
>>          u32 len;
>>
>> -       if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
>> +       if (xs->dev != rxq->dev || xs->queue_id != channel ||
>> +           xs->zc != (rxq->mem.type == MEM_TYPE_ZERO_COPY))
>>                  return -EINVAL;
>>
>>          len = xdp->data_end - xdp->data;
>> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
>> index d08ee1ab7bb4..a6b13025ee79 100644
>> --- a/samples/bpf/xdpsock_user.c
>> +++ b/samples/bpf/xdpsock_user.c
>> @@ -62,6 +62,7 @@ enum benchmark_type {
>>
>>   static enum benchmark_type opt_bench = BENCH_RXDROP;
>>   static u32 opt_xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
>> +static u32 opt_libbpf_flags;
>>   static const char *opt_if = "";
>>   static int opt_ifindex;
>>   static int opt_queue;
>> @@ -306,7 +307,7 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
>>          xsk->umem = umem;
>>          cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
>>          cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
>> -       cfg.libbpf_flags = 0;
>> +       cfg.libbpf_flags = opt_libbpf_flags;
>>          cfg.xdp_flags = opt_xdp_flags;
>>          cfg.bind_flags = opt_xdp_bind_flags;
>>          ret = xsk_socket__create(&xsk->xsk, opt_if, opt_queue, umem->umem,
>> @@ -346,6 +347,7 @@ static struct option long_options[] = {
>>          {"interval", required_argument, 0, 'n'},
>>          {"zero-copy", no_argument, 0, 'z'},
>>          {"copy", no_argument, 0, 'c'},
>> +       {"combined", no_argument, 0, 'C'},
>>          {0, 0, 0, 0}
>>   };
>>
>> @@ -365,6 +367,7 @@ static void usage(const char *prog)
>>                  "  -n, --interval=n     Specify statistics update interval (default 1 sec).\n"
>>                  "  -z, --zero-copy      Force zero-copy mode.\n"
>>                  "  -c, --copy           Force copy mode.\n"
>> +               "  -C, --combined       Driver supports combined XSK and non-XSK traffic in a channel.\n"
>>                  "\n";
>>          fprintf(stderr, str, prog);
>>          exit(EXIT_FAILURE);
>> @@ -377,7 +380,7 @@ static void parse_command_line(int argc, char **argv)
>>          opterr = 0;
>>
>>          for (;;) {
>> -               c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options,
>> +               c = getopt_long(argc, argv, "Frtli:q:psSNn:czC", long_options,
>>                                  &option_index);
>>                  if (c == -1)
>>                          break;
>> @@ -420,6 +423,9 @@ static void parse_command_line(int argc, char **argv)
>>                  case 'F':
>>                          opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
>>                          break;
>> +               case 'C':
>> +                       opt_libbpf_flags |= XSK_LIBBPF_FLAGS__COMBINED_CHANNELS;
>> +                       break;
>>                  default:
>>                          usage(basename(argv[0]));
>>                  }
>> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
>> index 9ae4b4e08b68..cf6ff1ecc6bd 100644
>> --- a/tools/include/uapi/linux/if_xdp.h
>> +++ b/tools/include/uapi/linux/if_xdp.h
>> @@ -82,4 +82,15 @@ struct xdp_desc {
>>
>>   /* UMEM descriptor is __u64 */
>>
>> +/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
>> + * this flag bit in the queue index to distinguish between two RQs of the same
>> + * channel.
>> + */
>> +#define XDP_QID_FLAG_XSKRQ (1 << 31)
>> +
>> +static inline __u32 xdp_qid_get_channel(__u32 qid)
>> +{
>> +       return qid & ~XDP_QID_FLAG_XSKRQ;
>> +}
>> +
>>   #endif /* _LINUX_IF_XDP_H */
>> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
>> index a95b06d1f81d..969dfd856039 100644
>> --- a/tools/lib/bpf/xsk.c
>> +++ b/tools/lib/bpf/xsk.c
>> @@ -76,6 +76,12 @@ struct xsk_nl_info {
>>          int fd;
>>   };
>>
>> +enum qidconf {
>> +       QIDCONF_REGULAR,
>> +       QIDCONF_XSK,
>> +       QIDCONF_XSK_COMBINED,
>> +};
>> +
>>   /* For 32-bit systems, we need to use mmap2 as the offsets are 64-bit.
>>    * Unfortunately, it is not part of glibc.
>>    */
>> @@ -139,7 +145,7 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
>>                  return 0;
>>          }
>>
>> -       if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)
>> +       if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS_MASK)
>>                  return -EINVAL;
>>
>>          cfg->rx_size = usr_cfg->rx_size;
>> @@ -267,44 +273,93 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk)
>>          /* This is the C-program:
>>           * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
>>           * {
>> -        *     int *qidconf, index = ctx->rx_queue_index;
>> +        *     int *qidconf, qc;
>> +        *     int index = ctx->rx_queue_index & ~(1 << 31);
>> +        *     bool is_xskrq = ctx->rx_queue_index & (1 << 31);
>>           *
>> -        *     // A set entry here means that the correspnding queue_id
>> -        *     // has an active AF_XDP socket bound to it.
>> +        *     // A set entry here means that the corresponding queue_id
>> +        *     // has an active AF_XDP socket bound to it. Value 2 means
>> +        *     // it's zero-copy multi-RQ mode.
>>           *     qidconf = bpf_map_lookup_elem(&qidconf_map, &index);
>>           *     if (!qidconf)
>>           *         return XDP_ABORTED;
>>           *
>> -        *     if (*qidconf)
>> +        *     qc = *qidconf;
>> +        *
>> +        *     if (qc == 2)
>> +        *         qc = is_xskrq ? 1 : 0;
>> +        *
>> +        *     switch (qc) {
>> +        *     case 0:
>> +        *         return XDP_PASS;
>> +        *     case 1:
>>           *         return bpf_redirect_map(&xsks_map, index, 0);
>> +        *     }
>>           *
>> -        *     return XDP_PASS;
>> +        *     return XDP_ABORTED;
>>           * }
>>           */
>>          struct bpf_insn prog[] = {
>> -               /* r1 = *(u32 *)(r1 + 16) */
>> -               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 16),
>> -               /* *(u32 *)(r10 - 4) = r1 */
>> -               BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_1, -4),
>> -               BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
>> -               BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
>> -               BPF_LD_MAP_FD(BPF_REG_1, xsk->qidconf_map_fd),
>> +               /* Load index. */
>> +               /* r6 = *(u32 *)(r1 + 16) */
>> +               BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_ARG1, 16),
>> +               /* w7 = w6 */
>> +               BPF_MOV32_REG(BPF_REG_7, BPF_REG_6),
>> +               /* w7 &= 2147483647 */
>> +               BPF_ALU32_IMM(BPF_AND, BPF_REG_7, ~XDP_QID_FLAG_XSKRQ),
>> +               /* *(u32 *)(r10 - 4) = r7 */
>> +               BPF_STX_MEM(BPF_W, BPF_REG_FP, BPF_REG_7, -4),
>> +
>> +               /* Call bpf_map_lookup_elem. */
>> +               /* r2 = r10 */
>> +               BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
>> +               /* r2 += -4 */
>> +               BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
>> +               /* r1 = qidconf_map ll */
>> +               BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->qidconf_map_fd),
>> +               /* call 1 */
>>                  BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
>> -               BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
>> -               BPF_MOV32_IMM(BPF_REG_0, 0),
>> -               /* if r1 == 0 goto +8 */
>> -               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 8),
>> -               BPF_MOV32_IMM(BPF_REG_0, 2),
>> -               /* r1 = *(u32 *)(r1 + 0) */
>> -               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0),
>> -               /* if r1 == 0 goto +5 */
>> -               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 5),
>> -               /* r2 = *(u32 *)(r10 - 4) */
>> -               BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd),
>> -               BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4),
>> +
>> +               /* Check the return value. */
>> +               /* if r0 == 0 goto +14 */
>> +               BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 14),
>> +
>> +               /* Check qc == QIDCONF_XSK_COMBINED. */
>> +               /* r6 >>= 31 */
>> +               BPF_ALU64_IMM(BPF_RSH, BPF_REG_6, 31),
>> +               /* r1 = *(u32 *)(r0 + 0) */
>> +               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
>> +               /* if r1 == 2 goto +1 */
>> +               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, QIDCONF_XSK_COMBINED, 1),
>> +
>> +               /* qc != QIDCONF_XSK_COMBINED */
>> +               /* r6 = r1 */
>> +               BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
>> +
>> +               /* switch (qc) */
>> +               /* w0 = 2 */
>> +               BPF_MOV32_IMM(BPF_REG_0, XDP_PASS),
>> +               /* if w6 == 0 goto +8 */
>> +               BPF_JMP32_IMM(BPF_JEQ, BPF_REG_6, QIDCONF_REGULAR, 8),
>> +               /* if w6 != 1 goto +6 */
>> +               BPF_JMP32_IMM(BPF_JNE, BPF_REG_6, QIDCONF_XSK, 6),
>> +
>> +               /* Call bpf_redirect_map. */
>> +               /* r1 = xsks_map ll */
>> +               BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->xsks_map_fd),
>> +               /* w2 = w7 */
>> +               BPF_MOV32_REG(BPF_REG_ARG2, BPF_REG_7),
>> +               /* w3 = 0 */
>>                  BPF_MOV32_IMM(BPF_REG_3, 0),
>> +               /* call 51 */
>>                  BPF_EMIT_CALL(BPF_FUNC_redirect_map),
>> -               /* The jumps are to this instruction */
>> +               /* exit */
>> +               BPF_EXIT_INSN(),
>> +
>> +               /* XDP_ABORTED */
>> +               /* w0 = 0 */
>> +               BPF_MOV32_IMM(BPF_REG_0, XDP_ABORTED),
>> +               /* exit */
>>                  BPF_EXIT_INSN(),
>>          };
>>          size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
>> @@ -483,6 +538,7 @@ static int xsk_update_bpf_maps(struct xsk_socket *xsk, int qidconf_value,
>>
>>   static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
>>   {
>> +       int qidconf_value = QIDCONF_XSK;
>>          bool prog_attached = false;
>>          __u32 prog_id = 0;
>>          int err;
>> @@ -505,7 +561,11 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
>>                  xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id);
>>          }
>>
>> -       err = xsk_update_bpf_maps(xsk, true, xsk->fd);
>> +       if (xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
>> +               if (xsk->zc)
>> +                       qidconf_value = QIDCONF_XSK_COMBINED;
>> +
>> +       err = xsk_update_bpf_maps(xsk, qidconf_value, xsk->fd);
>>          if (err)
>>                  goto out_load;
>>
>> @@ -717,7 +777,7 @@ void xsk_socket__delete(struct xsk_socket *xsk)
>>          if (!xsk)
>>                  return;
>>
>> -       (void)xsk_update_bpf_maps(xsk, 0, 0);
>> +       (void)xsk_update_bpf_maps(xsk, QIDCONF_REGULAR, 0);
>>
>>          optlen = sizeof(off);
>>          err = getsockopt(xsk->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen);
>> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
>> index 82ea71a0f3ec..be26a2423c04 100644
>> --- a/tools/lib/bpf/xsk.h
>> +++ b/tools/lib/bpf/xsk.h
>> @@ -180,6 +180,10 @@ struct xsk_umem_config {
>>
>>   /* Flags for the libbpf_flags field. */
>>   #define XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD (1 << 0)
>> +#define XSK_LIBBPF_FLAGS__COMBINED_CHANNELS (1 << 1)
>> +#define XSK_LIBBPF_FLAGS_MASK ( \
>> +       XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD | \
>> +       XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
>>
>>   struct xsk_socket_config {
>>          __u32 rx_size;
>> --
>> 2.19.1
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next v2 02/16] xsk: Add getsockopt XDP_OPTIONS
  2019-05-04 17:25   ` Björn Töpel
@ 2019-05-06 13:45     ` Maxim Mikityanskiy
  2019-05-06 16:35       ` Alexei Starovoitov
  0 siblings, 1 reply; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-05-06 13:45 UTC (permalink / raw)
  To: Björn Töpel, Alexei Starovoitov
  Cc: Daniel Borkmann, Björn Töpel, Magnus Karlsson, bpf,
	netdev, David S. Miller, Saeed Mahameed, Jonathan Lemon,
	Tariq Toukan, Martin KaFai Lau, Song Liu, Yonghong Song,
	Jakub Kicinski, Maciej Fijalkowski

On 2019-05-04 20:25, Björn Töpel wrote:
> On Tue, 30 Apr 2019 at 20:12, Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>>
>> Make it possible for the application to determine whether the AF_XDP
>> socket is running in zero-copy mode. To achieve this, add a new
>> getsockopt option XDP_OPTIONS that returns flags. The only flag
>> supported for now is the zero-copy mode indicator.
>>
>> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
>> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
>> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
>> ---
>>   include/uapi/linux/if_xdp.h       |  7 +++++++
>>   net/xdp/xsk.c                     | 22 ++++++++++++++++++++++
>>   tools/include/uapi/linux/if_xdp.h |  7 +++++++
>>   3 files changed, 36 insertions(+)
>>
>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>> index caed8b1614ff..9ae4b4e08b68 100644
>> --- a/include/uapi/linux/if_xdp.h
>> +++ b/include/uapi/linux/if_xdp.h
>> @@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
>>   #define XDP_UMEM_FILL_RING             5
>>   #define XDP_UMEM_COMPLETION_RING       6
>>   #define XDP_STATISTICS                 7
>> +#define XDP_OPTIONS                    8
>>
>>   struct xdp_umem_reg {
>>          __u64 addr; /* Start of packet data area */
>> @@ -60,6 +61,12 @@ struct xdp_statistics {
>>          __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
>>   };
>>
>> +struct xdp_options {
>> +       __u32 flags;
>> +};
>> +
>> +#define XDP_OPTIONS_FLAG_ZEROCOPY (1 << 0)
> 
> Nit: The other flags doesn't use "FLAG" in its name, but that doesn't
> really matter.
> 
>> +
>>   /* Pgoff for mmaping the rings */
>>   #define XDP_PGOFF_RX_RING                        0
>>   #define XDP_PGOFF_TX_RING               0x80000000
>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>> index b68a380f50b3..998199109d5c 100644
>> --- a/net/xdp/xsk.c
>> +++ b/net/xdp/xsk.c
>> @@ -650,6 +650,28 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
>>
>>                  return 0;
>>          }
>> +       case XDP_OPTIONS:
>> +       {
>> +               struct xdp_options opts;
>> +
>> +               if (len < sizeof(opts))
>> +                       return -EINVAL;
>> +
>> +               opts.flags = 0;
> 
> Maybe get rid of this, in favor of "opts = {}" if the structure grows?

I'm OK with any of these options. Should I respin the series, or can I 
follow up with the change in RCs if the series gets to 5.2?

Alexei, is it even possible to still make changes to this series? The 
window appears closed.

> 
>> +
>> +               mutex_lock(&xs->mutex);
>> +               if (xs->zc)
>> +                       opts.flags |= XDP_OPTIONS_FLAG_ZEROCOPY;
>> +               mutex_unlock(&xs->mutex);
>> +
>> +               len = sizeof(opts);
>> +               if (copy_to_user(optval, &opts, len))
>> +                       return -EFAULT;
>> +               if (put_user(len, optlen))
>> +                       return -EFAULT;
>> +
>> +               return 0;
>> +       }
>>          default:
>>                  break;
>>          }
>> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
>> index caed8b1614ff..9ae4b4e08b68 100644
>> --- a/tools/include/uapi/linux/if_xdp.h
>> +++ b/tools/include/uapi/linux/if_xdp.h
>> @@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
>>   #define XDP_UMEM_FILL_RING             5
>>   #define XDP_UMEM_COMPLETION_RING       6
>>   #define XDP_STATISTICS                 7
>> +#define XDP_OPTIONS                    8
>>
>>   struct xdp_umem_reg {
>>          __u64 addr; /* Start of packet data area */
>> @@ -60,6 +61,12 @@ struct xdp_statistics {
>>          __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
>>   };
>>
>> +struct xdp_options {
>> +       __u32 flags;
>> +};
>> +
>> +#define XDP_OPTIONS_FLAG_ZEROCOPY (1 << 0)
>> +
>>   /* Pgoff for mmaping the rings */
>>   #define XDP_PGOFF_RX_RING                        0
>>   #define XDP_PGOFF_TX_RING               0x80000000
>> --
>> 2.19.1
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic
  2019-05-06 13:45     ` Maxim Mikityanskiy
@ 2019-05-06 14:23       ` Magnus Karlsson
  2019-05-07 14:19         ` Maxim Mikityanskiy
  0 siblings, 1 reply; 26+ messages in thread
From: Magnus Karlsson @ 2019-05-06 14:23 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, Magnus Karlsson, bpf, netdev,
	David S. Miller, Saeed Mahameed, Jonathan Lemon, Tariq Toukan,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jakub Kicinski,
	Maciej Fijalkowski

On Mon, May 6, 2019 at 3:46 PM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>
> On 2019-05-04 20:26, Björn Töpel wrote:
> > On Tue, 30 Apr 2019 at 20:12, Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
> >>
> >> Currently, the drivers that implement AF_XDP zero-copy support (e.g.,
> >> i40e) switch the channel into a different mode when an XSK is opened. It
> >> causes some issues that have to be taken into account. For example, RSS
> >> needs to be reconfigured to skip the XSK-enabled channels, or the XDP
> >> program should filter out traffic not intended for that socket and
> >> XDP_PASS it with an additional copy. As nothing validates or forces the
> >> proper configuration, it's easy to have packets drops, when they get
> >> into an XSK by mistake, and, in fact, it's the default configuration.
> >> There has to be some tool to have RSS reconfigured on each socket open
> >> and close event, but such a tool is problematic to implement, because no
> >> one reports these events, and it's race-prone.
> >>
> >> This commit extends XSK to support both kinds of traffic (XSK and
> >> non-XSK) in the same channel. It implies having two RX queues in
> >> XSK-enabled channels: one for the regular traffic, and the other for
> >> XSK. It solves the problem with RSS: the default configuration just
> >> works without the need to manually reconfigure RSS or to perform some
> >> possibly complicated filtering in the XDP layer. It makes it easy to run
> >> both AF_XDP and regular sockets on the same machine. In the XDP program,
> >> the QID's most significant bit will serve as a flag to indicate whether
> >> it's the XSK queue or not. The extension is compatible with the legacy
> >> configuration, so if one wants to run the legacy mode, they can
> >> reconfigure RSS and ignore the flag in the XDP program (implemented in
> >> the reference XDP program in libbpf). mlx5e will support this extension.
> >>
> >> A single XDP program can run both with drivers supporting or not
> >> supporting this extension. The xdpsock sample and libbpf are updated
> >> accordingly.
> >>
> >
> > I'm still not a fan of this, or maybe I'm not following you. It makes
> > it more complex and even harder to use. Let's take a look at the
> > kernel nomenclature. "ethtool" uses netdevs and channels. A channel is
> > a Rx queue or a Tx queue.
>
> There are also combined channels that consist of an RX and a TX queue.
> mlx5e has only this kind of channels. For us, a channel is a set of
> queues "pinned to a CPU core" (they use the same NAPI).
>
> > In AF_XDP we call the channel a queue, which
> > is what kernel uses internally (netdev_rx_queue, netdev_queue).
>
> You seem to agree it's a channel, right?
>
> AF_XDP doesn't allow to configure RX queue number and TX queue number
> separately. Basically you choose a channel in AF_XDP. For some reason,
> it's referred as a queue in some places, but logically it means "channel".

You can configure the Rx queue and the Tx queue separately by creating
two sockets tied to the same umem area. But if you just create one,
you are correct.

> > Today, AF_XDP can attach to an existing queue for ingress. (On the
> > egress side, we're using "a queue", but the "XDP queue". XDP has these
> > "shadow queues" which are separated from the netdev. This is a bit
> > messy, and we can't really configure them. I believe Jakub has some
> > ideas here. :-) For now, let's leave egress aside.)
>
> So, XDP already has "shadow queues" for TX, so I see no problem in
> having the similar concept for AF_XDP RX.

The question is if we would like to continue down the path of "shadow
queues" by adding even more. In the busy-poll RFC I sent out last
week, I talk about the possibility to create a new queue (set) not
tied to the napi of the regular Rx queues in order to get better
performance when using busy-poll. How would such a queue set fit into
a shadow queue set approach? When does hiding the real queues created
to support various features break and we have to expose the real queue
number? Trying to wrap my head around these questions.

Maxim, would it be possible for you to respin this set without this
feature? I like the other stuff you have implemented and think that
the rest of the common functionality should be useful for all of us.
This way you can get the AF_XDP support accepted quicker while we
debate the best way to solve the issue in this thread.

Thanks for all your work: Magnus

> > If an application would like to get all the traffic from a netdev,
> > it'll create an equal amout of sockets as the queues and bind to the
> > queues. Yes, even the queues in the RSS  set.
> >
> > What you would like (I think):
> > a) is a way of spawning a new queue for a netdev, that is not part of
> > the stack and/or RSS set
>
> Yes - for the simplicity sake and to make configuration easier. The only
> thing needed is to steer the traffic and to open an AF_XDP socket on
> channel X. We don't need to care about removing the queue out of RSS,
> about finding a way to administer this (which is hard because it's racy
> if the configuration in not known in advance). So I don't agree I'm
> complicating things, my goal is to make them easier.
>
> > b) steering traffic to that queue using a configuration mechanism (tc?
> > some yet to be hacked BPF configuration hook?)
>
> Currently, ethtool --config-ntuple is used to steer the traffic. The
> user-def argument has a bit that selects XSK RQ/regular RQ, and action
> selects a channel:
>
> ethtool -N eth0 flow-type udp4 dst-port 4242 action 3 user-def 1
>
> > With your mechanism you're doing this in contrived way. This makes the
> > existing AF_XDP model *more* complex/hard(er) to use.
>
> No, as I said above, some issues are eliminated with my approach, and no
> new limitations are introduced, so it makes things more universal and
> simpler to configure.
>
> > How do you steer traffic to this dual-channel RQ?
>
> First, there is no dual-channel RQ, a more accurate term is dual-RQ
> channel, cause now the channel contains a regular RQ and can contain an
> XSK RQ.
>
> For the steering itself, see the ethtool command above - the user-def
> argument has a bit that selects one of two RQs.
>
> > So you have a netdev
> > receiving on all queues. Then, e.g., the last queue is a "dual
> > channel" queue that can receive traffic from some other filter. How do
> > you use it?
>
> If I want to take the last (or some) channel and start using AF_XDP with
> it, I simply configure steering to the XSK RQ of that channel and open a
> socket specifying the channel number. I don't need to reconfigure RSS,
> because RSS packets go to the regular RQ of that channel and don't
> interfere with XSK.
>
> No functionality is lost - if you don't distinguish the regular and XSK
> RQs on the XDP level, you'll get the same effect as with i40e's
> implementation. If you want to dedicate the CPU core and channel solely
> for AF_XDP, in i40e you exclude the channel from RSS, and here you can
> do exactly the same thing. So, no use case is complicated comparing to
> i40e, but there are use cases where this feature is to advantage.
>
> I hope I explained the points you were interested in. Please ask more
> questions if there is still something that I should clarify regarding
> this topic.
>
> Thanks,
> Max
>
> >
> >
> > Björn
> >
> >> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> >> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
> >> ---
> >>   include/uapi/linux/if_xdp.h       |  11 +++
> >>   net/xdp/xsk.c                     |   5 +-
> >>   samples/bpf/xdpsock_user.c        |  10 ++-
> >>   tools/include/uapi/linux/if_xdp.h |  11 +++
> >>   tools/lib/bpf/xsk.c               | 116 ++++++++++++++++++++++--------
> >>   tools/lib/bpf/xsk.h               |   4 ++
> >>   6 files changed, 126 insertions(+), 31 deletions(-)
> >>
> >> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> >> index 9ae4b4e08b68..cf6ff1ecc6bd 100644
> >> --- a/include/uapi/linux/if_xdp.h
> >> +++ b/include/uapi/linux/if_xdp.h
> >> @@ -82,4 +82,15 @@ struct xdp_desc {
> >>
> >>   /* UMEM descriptor is __u64 */
> >>
> >> +/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
> >> + * this flag bit in the queue index to distinguish between two RQs of the same
> >> + * channel.
> >> + */
> >> +#define XDP_QID_FLAG_XSKRQ (1 << 31)
> >> +
> >> +static inline __u32 xdp_qid_get_channel(__u32 qid)
> >> +{
> >> +       return qid & ~XDP_QID_FLAG_XSKRQ;
> >> +}
> >> +
> >>   #endif /* _LINUX_IF_XDP_H */
> >> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> >> index 998199109d5c..114ba17acb09 100644
> >> --- a/net/xdp/xsk.c
> >> +++ b/net/xdp/xsk.c
> >> @@ -104,9 +104,12 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
> >>
> >>   int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
> >>   {
> >> +       struct xdp_rxq_info *rxq = xdp->rxq;
> >> +       u32 channel = xdp_qid_get_channel(rxq->queue_index);
> >>          u32 len;
> >>
> >> -       if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
> >> +       if (xs->dev != rxq->dev || xs->queue_id != channel ||
> >> +           xs->zc != (rxq->mem.type == MEM_TYPE_ZERO_COPY))
> >>                  return -EINVAL;
> >>
> >>          len = xdp->data_end - xdp->data;
> >> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> >> index d08ee1ab7bb4..a6b13025ee79 100644
> >> --- a/samples/bpf/xdpsock_user.c
> >> +++ b/samples/bpf/xdpsock_user.c
> >> @@ -62,6 +62,7 @@ enum benchmark_type {
> >>
> >>   static enum benchmark_type opt_bench = BENCH_RXDROP;
> >>   static u32 opt_xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
> >> +static u32 opt_libbpf_flags;
> >>   static const char *opt_if = "";
> >>   static int opt_ifindex;
> >>   static int opt_queue;
> >> @@ -306,7 +307,7 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
> >>          xsk->umem = umem;
> >>          cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
> >>          cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
> >> -       cfg.libbpf_flags = 0;
> >> +       cfg.libbpf_flags = opt_libbpf_flags;
> >>          cfg.xdp_flags = opt_xdp_flags;
> >>          cfg.bind_flags = opt_xdp_bind_flags;
> >>          ret = xsk_socket__create(&xsk->xsk, opt_if, opt_queue, umem->umem,
> >> @@ -346,6 +347,7 @@ static struct option long_options[] = {
> >>          {"interval", required_argument, 0, 'n'},
> >>          {"zero-copy", no_argument, 0, 'z'},
> >>          {"copy", no_argument, 0, 'c'},
> >> +       {"combined", no_argument, 0, 'C'},
> >>          {0, 0, 0, 0}
> >>   };
> >>
> >> @@ -365,6 +367,7 @@ static void usage(const char *prog)
> >>                  "  -n, --interval=n     Specify statistics update interval (default 1 sec).\n"
> >>                  "  -z, --zero-copy      Force zero-copy mode.\n"
> >>                  "  -c, --copy           Force copy mode.\n"
> >> +               "  -C, --combined       Driver supports combined XSK and non-XSK traffic in a channel.\n"
> >>                  "\n";
> >>          fprintf(stderr, str, prog);
> >>          exit(EXIT_FAILURE);
> >> @@ -377,7 +380,7 @@ static void parse_command_line(int argc, char **argv)
> >>          opterr = 0;
> >>
> >>          for (;;) {
> >> -               c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options,
> >> +               c = getopt_long(argc, argv, "Frtli:q:psSNn:czC", long_options,
> >>                                  &option_index);
> >>                  if (c == -1)
> >>                          break;
> >> @@ -420,6 +423,9 @@ static void parse_command_line(int argc, char **argv)
> >>                  case 'F':
> >>                          opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
> >>                          break;
> >> +               case 'C':
> >> +                       opt_libbpf_flags |= XSK_LIBBPF_FLAGS__COMBINED_CHANNELS;
> >> +                       break;
> >>                  default:
> >>                          usage(basename(argv[0]));
> >>                  }
> >> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> >> index 9ae4b4e08b68..cf6ff1ecc6bd 100644
> >> --- a/tools/include/uapi/linux/if_xdp.h
> >> +++ b/tools/include/uapi/linux/if_xdp.h
> >> @@ -82,4 +82,15 @@ struct xdp_desc {
> >>
> >>   /* UMEM descriptor is __u64 */
> >>
> >> +/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
> >> + * this flag bit in the queue index to distinguish between two RQs of the same
> >> + * channel.
> >> + */
> >> +#define XDP_QID_FLAG_XSKRQ (1 << 31)
> >> +
> >> +static inline __u32 xdp_qid_get_channel(__u32 qid)
> >> +{
> >> +       return qid & ~XDP_QID_FLAG_XSKRQ;
> >> +}
> >> +
> >>   #endif /* _LINUX_IF_XDP_H */
> >> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> >> index a95b06d1f81d..969dfd856039 100644
> >> --- a/tools/lib/bpf/xsk.c
> >> +++ b/tools/lib/bpf/xsk.c
> >> @@ -76,6 +76,12 @@ struct xsk_nl_info {
> >>          int fd;
> >>   };
> >>
> >> +enum qidconf {
> >> +       QIDCONF_REGULAR,
> >> +       QIDCONF_XSK,
> >> +       QIDCONF_XSK_COMBINED,
> >> +};
> >> +
> >>   /* For 32-bit systems, we need to use mmap2 as the offsets are 64-bit.
> >>    * Unfortunately, it is not part of glibc.
> >>    */
> >> @@ -139,7 +145,7 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
> >>                  return 0;
> >>          }
> >>
> >> -       if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)
> >> +       if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS_MASK)
> >>                  return -EINVAL;
> >>
> >>          cfg->rx_size = usr_cfg->rx_size;
> >> @@ -267,44 +273,93 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk)
> >>          /* This is the C-program:
> >>           * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
> >>           * {
> >> -        *     int *qidconf, index = ctx->rx_queue_index;
> >> +        *     int *qidconf, qc;
> >> +        *     int index = ctx->rx_queue_index & ~(1 << 31);
> >> +        *     bool is_xskrq = ctx->rx_queue_index & (1 << 31);
> >>           *
> >> -        *     // A set entry here means that the correspnding queue_id
> >> -        *     // has an active AF_XDP socket bound to it.
> >> +        *     // A set entry here means that the corresponding queue_id
> >> +        *     // has an active AF_XDP socket bound to it. Value 2 means
> >> +        *     // it's zero-copy multi-RQ mode.
> >>           *     qidconf = bpf_map_lookup_elem(&qidconf_map, &index);
> >>           *     if (!qidconf)
> >>           *         return XDP_ABORTED;
> >>           *
> >> -        *     if (*qidconf)
> >> +        *     qc = *qidconf;
> >> +        *
> >> +        *     if (qc == 2)
> >> +        *         qc = is_xskrq ? 1 : 0;
> >> +        *
> >> +        *     switch (qc) {
> >> +        *     case 0:
> >> +        *         return XDP_PASS;
> >> +        *     case 1:
> >>           *         return bpf_redirect_map(&xsks_map, index, 0);
> >> +        *     }
> >>           *
> >> -        *     return XDP_PASS;
> >> +        *     return XDP_ABORTED;
> >>           * }
> >>           */
> >>          struct bpf_insn prog[] = {
> >> -               /* r1 = *(u32 *)(r1 + 16) */
> >> -               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 16),
> >> -               /* *(u32 *)(r10 - 4) = r1 */
> >> -               BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_1, -4),
> >> -               BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> >> -               BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
> >> -               BPF_LD_MAP_FD(BPF_REG_1, xsk->qidconf_map_fd),
> >> +               /* Load index. */
> >> +               /* r6 = *(u32 *)(r1 + 16) */
> >> +               BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_ARG1, 16),
> >> +               /* w7 = w6 */
> >> +               BPF_MOV32_REG(BPF_REG_7, BPF_REG_6),
> >> +               /* w7 &= 2147483647 */
> >> +               BPF_ALU32_IMM(BPF_AND, BPF_REG_7, ~XDP_QID_FLAG_XSKRQ),
> >> +               /* *(u32 *)(r10 - 4) = r7 */
> >> +               BPF_STX_MEM(BPF_W, BPF_REG_FP, BPF_REG_7, -4),
> >> +
> >> +               /* Call bpf_map_lookup_elem. */
> >> +               /* r2 = r10 */
> >> +               BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
> >> +               /* r2 += -4 */
> >> +               BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
> >> +               /* r1 = qidconf_map ll */
> >> +               BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->qidconf_map_fd),
> >> +               /* call 1 */
> >>                  BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
> >> -               BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
> >> -               BPF_MOV32_IMM(BPF_REG_0, 0),
> >> -               /* if r1 == 0 goto +8 */
> >> -               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 8),
> >> -               BPF_MOV32_IMM(BPF_REG_0, 2),
> >> -               /* r1 = *(u32 *)(r1 + 0) */
> >> -               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0),
> >> -               /* if r1 == 0 goto +5 */
> >> -               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 5),
> >> -               /* r2 = *(u32 *)(r10 - 4) */
> >> -               BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd),
> >> -               BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4),
> >> +
> >> +               /* Check the return value. */
> >> +               /* if r0 == 0 goto +14 */
> >> +               BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 14),
> >> +
> >> +               /* Check qc == QIDCONF_XSK_COMBINED. */
> >> +               /* r6 >>= 31 */
> >> +               BPF_ALU64_IMM(BPF_RSH, BPF_REG_6, 31),
> >> +               /* r1 = *(u32 *)(r0 + 0) */
> >> +               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
> >> +               /* if r1 == 2 goto +1 */
> >> +               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, QIDCONF_XSK_COMBINED, 1),
> >> +
> >> +               /* qc != QIDCONF_XSK_COMBINED */
> >> +               /* r6 = r1 */
> >> +               BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
> >> +
> >> +               /* switch (qc) */
> >> +               /* w0 = 2 */
> >> +               BPF_MOV32_IMM(BPF_REG_0, XDP_PASS),
> >> +               /* if w6 == 0 goto +8 */
> >> +               BPF_JMP32_IMM(BPF_JEQ, BPF_REG_6, QIDCONF_REGULAR, 8),
> >> +               /* if w6 != 1 goto +6 */
> >> +               BPF_JMP32_IMM(BPF_JNE, BPF_REG_6, QIDCONF_XSK, 6),
> >> +
> >> +               /* Call bpf_redirect_map. */
> >> +               /* r1 = xsks_map ll */
> >> +               BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->xsks_map_fd),
> >> +               /* w2 = w7 */
> >> +               BPF_MOV32_REG(BPF_REG_ARG2, BPF_REG_7),
> >> +               /* w3 = 0 */
> >>                  BPF_MOV32_IMM(BPF_REG_3, 0),
> >> +               /* call 51 */
> >>                  BPF_EMIT_CALL(BPF_FUNC_redirect_map),
> >> -               /* The jumps are to this instruction */
> >> +               /* exit */
> >> +               BPF_EXIT_INSN(),
> >> +
> >> +               /* XDP_ABORTED */
> >> +               /* w0 = 0 */
> >> +               BPF_MOV32_IMM(BPF_REG_0, XDP_ABORTED),
> >> +               /* exit */
> >>                  BPF_EXIT_INSN(),
> >>          };
> >>          size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
> >> @@ -483,6 +538,7 @@ static int xsk_update_bpf_maps(struct xsk_socket *xsk, int qidconf_value,
> >>
> >>   static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
> >>   {
> >> +       int qidconf_value = QIDCONF_XSK;
> >>          bool prog_attached = false;
> >>          __u32 prog_id = 0;
> >>          int err;
> >> @@ -505,7 +561,11 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
> >>                  xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id);
> >>          }
> >>
> >> -       err = xsk_update_bpf_maps(xsk, true, xsk->fd);
> >> +       if (xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
> >> +               if (xsk->zc)
> >> +                       qidconf_value = QIDCONF_XSK_COMBINED;
> >> +
> >> +       err = xsk_update_bpf_maps(xsk, qidconf_value, xsk->fd);
> >>          if (err)
> >>                  goto out_load;
> >>
> >> @@ -717,7 +777,7 @@ void xsk_socket__delete(struct xsk_socket *xsk)
> >>          if (!xsk)
> >>                  return;
> >>
> >> -       (void)xsk_update_bpf_maps(xsk, 0, 0);
> >> +       (void)xsk_update_bpf_maps(xsk, QIDCONF_REGULAR, 0);
> >>
> >>          optlen = sizeof(off);
> >>          err = getsockopt(xsk->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen);
> >> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> >> index 82ea71a0f3ec..be26a2423c04 100644
> >> --- a/tools/lib/bpf/xsk.h
> >> +++ b/tools/lib/bpf/xsk.h
> >> @@ -180,6 +180,10 @@ struct xsk_umem_config {
> >>
> >>   /* Flags for the libbpf_flags field. */
> >>   #define XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD (1 << 0)
> >> +#define XSK_LIBBPF_FLAGS__COMBINED_CHANNELS (1 << 1)
> >> +#define XSK_LIBBPF_FLAGS_MASK ( \
> >> +       XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD | \
> >> +       XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
> >>
> >>   struct xsk_socket_config {
> >>          __u32 rx_size;
> >> --
> >> 2.19.1
> >>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next v2 02/16] xsk: Add getsockopt XDP_OPTIONS
  2019-05-06 13:45     ` Maxim Mikityanskiy
@ 2019-05-06 16:35       ` Alexei Starovoitov
  0 siblings, 0 replies; 26+ messages in thread
From: Alexei Starovoitov @ 2019-05-06 16:35 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, Magnus Karlsson, bpf, netdev,
	David S. Miller, Saeed Mahameed, Jonathan Lemon, Tariq Toukan,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jakub Kicinski,
	Maciej Fijalkowski

On Mon, May 06, 2019 at 01:45:40PM +0000, Maxim Mikityanskiy wrote:
> On 2019-05-04 20:25, Björn Töpel wrote:
> > On Tue, 30 Apr 2019 at 20:12, Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
> >>
> >> Make it possible for the application to determine whether the AF_XDP
> >> socket is running in zero-copy mode. To achieve this, add a new
> >> getsockopt option XDP_OPTIONS that returns flags. The only flag
> >> supported for now is the zero-copy mode indicator.
> >>
> >> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> >> Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
> >> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
> >> ---
> >>   include/uapi/linux/if_xdp.h       |  7 +++++++
> >>   net/xdp/xsk.c                     | 22 ++++++++++++++++++++++
> >>   tools/include/uapi/linux/if_xdp.h |  7 +++++++
> >>   3 files changed, 36 insertions(+)
> >>
> >> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> >> index caed8b1614ff..9ae4b4e08b68 100644
> >> --- a/include/uapi/linux/if_xdp.h
> >> +++ b/include/uapi/linux/if_xdp.h
> >> @@ -46,6 +46,7 @@ struct xdp_mmap_offsets {
> >>   #define XDP_UMEM_FILL_RING             5
> >>   #define XDP_UMEM_COMPLETION_RING       6
> >>   #define XDP_STATISTICS                 7
> >> +#define XDP_OPTIONS                    8
> >>
> >>   struct xdp_umem_reg {
> >>          __u64 addr; /* Start of packet data area */
> >> @@ -60,6 +61,12 @@ struct xdp_statistics {
> >>          __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */
> >>   };
> >>
> >> +struct xdp_options {
> >> +       __u32 flags;
> >> +};
> >> +
> >> +#define XDP_OPTIONS_FLAG_ZEROCOPY (1 << 0)
> > 
> > Nit: The other flags doesn't use "FLAG" in its name, but that doesn't
> > really matter.
> > 
> >> +
> >>   /* Pgoff for mmaping the rings */
> >>   #define XDP_PGOFF_RX_RING                        0
> >>   #define XDP_PGOFF_TX_RING               0x80000000
> >> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> >> index b68a380f50b3..998199109d5c 100644
> >> --- a/net/xdp/xsk.c
> >> +++ b/net/xdp/xsk.c
> >> @@ -650,6 +650,28 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname,
> >>
> >>                  return 0;
> >>          }
> >> +       case XDP_OPTIONS:
> >> +       {
> >> +               struct xdp_options opts;
> >> +
> >> +               if (len < sizeof(opts))
> >> +                       return -EINVAL;
> >> +
> >> +               opts.flags = 0;
> > 
> > Maybe get rid of this, in favor of "opts = {}" if the structure grows?
> 
> I'm OK with any of these options. Should I respin the series, or can I 
> follow up with the change in RCs if the series gets to 5.2?
> 
> Alexei, is it even possible to still make changes to this series? The 
> window appears closed.

The series were not applied.
Please resubmit addressing all feedback when bpf-next reopens.
Likely in ~2 weeks.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic
  2019-05-06 14:23       ` Magnus Karlsson
@ 2019-05-07 14:19         ` Maxim Mikityanskiy
  2019-05-08 13:06           ` Magnus Karlsson
  0 siblings, 1 reply; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-05-07 14:19 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, Magnus Karlsson, bpf, netdev,
	David S. Miller, Saeed Mahameed, Jonathan Lemon, Tariq Toukan,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jakub Kicinski,
	Maciej Fijalkowski

On 2019-05-06 17:23, Magnus Karlsson wrote:
> On Mon, May 6, 2019 at 3:46 PM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>>
>> On 2019-05-04 20:26, Björn Töpel wrote:
>>> On Tue, 30 Apr 2019 at 20:12, Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>>>>
>>>> Currently, the drivers that implement AF_XDP zero-copy support (e.g.,
>>>> i40e) switch the channel into a different mode when an XSK is opened. It
>>>> causes some issues that have to be taken into account. For example, RSS
>>>> needs to be reconfigured to skip the XSK-enabled channels, or the XDP
>>>> program should filter out traffic not intended for that socket and
>>>> XDP_PASS it with an additional copy. As nothing validates or forces the
>>>> proper configuration, it's easy to have packets drops, when they get
>>>> into an XSK by mistake, and, in fact, it's the default configuration.
>>>> There has to be some tool to have RSS reconfigured on each socket open
>>>> and close event, but such a tool is problematic to implement, because no
>>>> one reports these events, and it's race-prone.
>>>>
>>>> This commit extends XSK to support both kinds of traffic (XSK and
>>>> non-XSK) in the same channel. It implies having two RX queues in
>>>> XSK-enabled channels: one for the regular traffic, and the other for
>>>> XSK. It solves the problem with RSS: the default configuration just
>>>> works without the need to manually reconfigure RSS or to perform some
>>>> possibly complicated filtering in the XDP layer. It makes it easy to run
>>>> both AF_XDP and regular sockets on the same machine. In the XDP program,
>>>> the QID's most significant bit will serve as a flag to indicate whether
>>>> it's the XSK queue or not. The extension is compatible with the legacy
>>>> configuration, so if one wants to run the legacy mode, they can
>>>> reconfigure RSS and ignore the flag in the XDP program (implemented in
>>>> the reference XDP program in libbpf). mlx5e will support this extension.
>>>>
>>>> A single XDP program can run both with drivers supporting or not
>>>> supporting this extension. The xdpsock sample and libbpf are updated
>>>> accordingly.
>>>>
>>>
>>> I'm still not a fan of this, or maybe I'm not following you. It makes
>>> it more complex and even harder to use. Let's take a look at the
>>> kernel nomenclature. "ethtool" uses netdevs and channels. A channel is
>>> a Rx queue or a Tx queue.
>>
>> There are also combined channels that consist of an RX and a TX queue.
>> mlx5e has only this kind of channels. For us, a channel is a set of
>> queues "pinned to a CPU core" (they use the same NAPI).
>>
>>> In AF_XDP we call the channel a queue, which
>>> is what kernel uses internally (netdev_rx_queue, netdev_queue).
>>
>> You seem to agree it's a channel, right?
>>
>> AF_XDP doesn't allow to configure RX queue number and TX queue number
>> separately. Basically you choose a channel in AF_XDP. For some reason,
>> it's referred as a queue in some places, but logically it means "channel".
> 
> You can configure the Rx queue and the Tx queue separately by creating
> two sockets tied to the same umem area. But if you just create one,
> you are correct.

Yes, I know I can open two sockets, but it's only a workaround. If I 
want to RX on RQ 3 and TX on SQ 5, I'll have one socket bound to RQ 3 
and SQ 3, and the other bound to RQ 5 and SQ 5. It's not a clean way to 
achieve the initial goal. It means that the current implementation 
actually binds a socket to a channel X (that has RQ X and SQ X). It 
could be different if there were two kinds of XSK sockets: RX-only and 
TX-only, but it's not the case.

>>> Today, AF_XDP can attach to an existing queue for ingress. (On the
>>> egress side, we're using "a queue", but the "XDP queue". XDP has these
>>> "shadow queues" which are separated from the netdev. This is a bit
>>> messy, and we can't really configure them. I believe Jakub has some
>>> ideas here. :-) For now, let's leave egress aside.)
>>
>> So, XDP already has "shadow queues" for TX, so I see no problem in
>> having the similar concept for AF_XDP RX.
> 
> The question is if we would like to continue down the path of "shadow
> queues" by adding even more.

OK, I think "shadow queues" is not a valid name, and it's certainly not 
something bad. Initially, kernel had a concept of RX and TX queues. They 
can match the queues the driver has for regular traffic. Then new 
features appeared (XDP, then AF_XDP), and they required new types of 
queues: XDP TX queue, AF_XDP RX queue. These are absolutely new 
concepts, they aren't interchangeable with the legacy RX and TX queues. 
It means we can't just say we have 32 TX queues, 16 of which are regular 
SQs and 16 are XDP SQs. They function differently: the stack must not 
try sending anything from an XDP SQ, and XDP_REDIRECT must not try 
sending from the regular SQ.

However, the kernel didn't learn about new types of queues. That's why 
the new queues remain "in shadow". And we certainly shouldn't use the 
same numeration for different types of queues, i.e. it's incorrect to 
say that TX queues 0..15 are regular SQs, TX queues 16..31 are XDP SQs, 
etc. The correct way is: there are 16 regular SQs 0..15 and 16 XDP SQs 
0..15.

The same considerations apply to AF_XDP RX queues (XSK RQs). This is a 
whole new type of queue, so it can't be mixed with the regular RQs. 
That's why they have their own numeration and are not accounted as RX 
queues (moreover, their numeration can be non-contiguous). If the kernel 
had need to know about them, they could be exposed to the netdev as a 
third type of queue, but I see no need currently.

> In the busy-poll RFC I sent out last
> week, I talk about the possibility to create a new queue (set) not
> tied to the napi of the regular Rx queues in order to get better
> performance when using busy-poll. How would such a queue set fit into
> a shadow queue set approach?

Looking at the question in the RFC, I don't see in what way it 
interferes with XSK RQs. Moreover, it's even more convenient to have a 
separate type of queue - you'll be able to create and destroy "unbound" 
(not driven by an interrupt, having a separate NAPI) XSK RQs without 
interfering with the regular queues. At first sight, I would say it's a 
perfect fit :). For me, it looks cleaner when we have regular RQs in 
channels, XSK RQs in channels and "unbound" XSK RQs than having channels 
that switch the RQ type and having some extra out-of-channel RQs that 
behave differently, but look the same to the kernel.

> When does hiding the real queues created
> to support various features break and we have to expose the real queue
> number? Trying to wrap my head around these questions.

As I said above, different types of queues should not share the 
numeration. So, if need be, they can be exposed as a different type, but 
I don't see the necessity yet.

> Maxim, would it be possible for you to respin this set without this
> feature?

That's not as easy as you may think... It's not just a feature - for the 
kernel it's just a feature, but for the driver it's part of core design 
of AF_XDP support in mlx5. Removing it will require a lot of changes, 
reviewing and testing to adapt the driver to a different approach. I'd 
prefer to come to the conclusion before thinking of such changes :)

Thanks for giving the feedback and joining this discussion!

> I like the other stuff you have implemented and think that
> the rest of the common functionality should be useful for all of us.
> This way you can get the AF_XDP support accepted quicker while we
> debate the best way to solve the issue in this thread.
> 
> Thanks for all your work: Magnus
> 
>>> If an application would like to get all the traffic from a netdev,
>>> it'll create an equal amout of sockets as the queues and bind to the
>>> queues. Yes, even the queues in the RSS  set.
>>>
>>> What you would like (I think):
>>> a) is a way of spawning a new queue for a netdev, that is not part of
>>> the stack and/or RSS set
>>
>> Yes - for the simplicity sake and to make configuration easier. The only
>> thing needed is to steer the traffic and to open an AF_XDP socket on
>> channel X. We don't need to care about removing the queue out of RSS,
>> about finding a way to administer this (which is hard because it's racy
>> if the configuration in not known in advance). So I don't agree I'm
>> complicating things, my goal is to make them easier.
>>
>>> b) steering traffic to that queue using a configuration mechanism (tc?
>>> some yet to be hacked BPF configuration hook?)
>>
>> Currently, ethtool --config-ntuple is used to steer the traffic. The
>> user-def argument has a bit that selects XSK RQ/regular RQ, and action
>> selects a channel:
>>
>> ethtool -N eth0 flow-type udp4 dst-port 4242 action 3 user-def 1
>>
>>> With your mechanism you're doing this in contrived way. This makes the
>>> existing AF_XDP model *more* complex/hard(er) to use.
>>
>> No, as I said above, some issues are eliminated with my approach, and no
>> new limitations are introduced, so it makes things more universal and
>> simpler to configure.
>>
>>> How do you steer traffic to this dual-channel RQ?
>>
>> First, there is no dual-channel RQ, a more accurate term is dual-RQ
>> channel, cause now the channel contains a regular RQ and can contain an
>> XSK RQ.
>>
>> For the steering itself, see the ethtool command above - the user-def
>> argument has a bit that selects one of two RQs.
>>
>>> So you have a netdev
>>> receiving on all queues. Then, e.g., the last queue is a "dual
>>> channel" queue that can receive traffic from some other filter. How do
>>> you use it?
>>
>> If I want to take the last (or some) channel and start using AF_XDP with
>> it, I simply configure steering to the XSK RQ of that channel and open a
>> socket specifying the channel number. I don't need to reconfigure RSS,
>> because RSS packets go to the regular RQ of that channel and don't
>> interfere with XSK.
>>
>> No functionality is lost - if you don't distinguish the regular and XSK
>> RQs on the XDP level, you'll get the same effect as with i40e's
>> implementation. If you want to dedicate the CPU core and channel solely
>> for AF_XDP, in i40e you exclude the channel from RSS, and here you can
>> do exactly the same thing. So, no use case is complicated comparing to
>> i40e, but there are use cases where this feature is to advantage.
>>
>> I hope I explained the points you were interested in. Please ask more
>> questions if there is still something that I should clarify regarding
>> this topic.
>>
>> Thanks,
>> Max
>>
>>>
>>>
>>> Björn
>>>
>>>> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
>>>> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
>>>> ---
>>>>    include/uapi/linux/if_xdp.h       |  11 +++
>>>>    net/xdp/xsk.c                     |   5 +-
>>>>    samples/bpf/xdpsock_user.c        |  10 ++-
>>>>    tools/include/uapi/linux/if_xdp.h |  11 +++
>>>>    tools/lib/bpf/xsk.c               | 116 ++++++++++++++++++++++--------
>>>>    tools/lib/bpf/xsk.h               |   4 ++
>>>>    6 files changed, 126 insertions(+), 31 deletions(-)
>>>>
>>>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>>>> index 9ae4b4e08b68..cf6ff1ecc6bd 100644
>>>> --- a/include/uapi/linux/if_xdp.h
>>>> +++ b/include/uapi/linux/if_xdp.h
>>>> @@ -82,4 +82,15 @@ struct xdp_desc {
>>>>
>>>>    /* UMEM descriptor is __u64 */
>>>>
>>>> +/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
>>>> + * this flag bit in the queue index to distinguish between two RQs of the same
>>>> + * channel.
>>>> + */
>>>> +#define XDP_QID_FLAG_XSKRQ (1 << 31)
>>>> +
>>>> +static inline __u32 xdp_qid_get_channel(__u32 qid)
>>>> +{
>>>> +       return qid & ~XDP_QID_FLAG_XSKRQ;
>>>> +}
>>>> +
>>>>    #endif /* _LINUX_IF_XDP_H */
>>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>>> index 998199109d5c..114ba17acb09 100644
>>>> --- a/net/xdp/xsk.c
>>>> +++ b/net/xdp/xsk.c
>>>> @@ -104,9 +104,12 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>>>>
>>>>    int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>>>>    {
>>>> +       struct xdp_rxq_info *rxq = xdp->rxq;
>>>> +       u32 channel = xdp_qid_get_channel(rxq->queue_index);
>>>>           u32 len;
>>>>
>>>> -       if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
>>>> +       if (xs->dev != rxq->dev || xs->queue_id != channel ||
>>>> +           xs->zc != (rxq->mem.type == MEM_TYPE_ZERO_COPY))
>>>>                   return -EINVAL;
>>>>
>>>>           len = xdp->data_end - xdp->data;
>>>> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
>>>> index d08ee1ab7bb4..a6b13025ee79 100644
>>>> --- a/samples/bpf/xdpsock_user.c
>>>> +++ b/samples/bpf/xdpsock_user.c
>>>> @@ -62,6 +62,7 @@ enum benchmark_type {
>>>>
>>>>    static enum benchmark_type opt_bench = BENCH_RXDROP;
>>>>    static u32 opt_xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
>>>> +static u32 opt_libbpf_flags;
>>>>    static const char *opt_if = "";
>>>>    static int opt_ifindex;
>>>>    static int opt_queue;
>>>> @@ -306,7 +307,7 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
>>>>           xsk->umem = umem;
>>>>           cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
>>>>           cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
>>>> -       cfg.libbpf_flags = 0;
>>>> +       cfg.libbpf_flags = opt_libbpf_flags;
>>>>           cfg.xdp_flags = opt_xdp_flags;
>>>>           cfg.bind_flags = opt_xdp_bind_flags;
>>>>           ret = xsk_socket__create(&xsk->xsk, opt_if, opt_queue, umem->umem,
>>>> @@ -346,6 +347,7 @@ static struct option long_options[] = {
>>>>           {"interval", required_argument, 0, 'n'},
>>>>           {"zero-copy", no_argument, 0, 'z'},
>>>>           {"copy", no_argument, 0, 'c'},
>>>> +       {"combined", no_argument, 0, 'C'},
>>>>           {0, 0, 0, 0}
>>>>    };
>>>>
>>>> @@ -365,6 +367,7 @@ static void usage(const char *prog)
>>>>                   "  -n, --interval=n     Specify statistics update interval (default 1 sec).\n"
>>>>                   "  -z, --zero-copy      Force zero-copy mode.\n"
>>>>                   "  -c, --copy           Force copy mode.\n"
>>>> +               "  -C, --combined       Driver supports combined XSK and non-XSK traffic in a channel.\n"
>>>>                   "\n";
>>>>           fprintf(stderr, str, prog);
>>>>           exit(EXIT_FAILURE);
>>>> @@ -377,7 +380,7 @@ static void parse_command_line(int argc, char **argv)
>>>>           opterr = 0;
>>>>
>>>>           for (;;) {
>>>> -               c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options,
>>>> +               c = getopt_long(argc, argv, "Frtli:q:psSNn:czC", long_options,
>>>>                                   &option_index);
>>>>                   if (c == -1)
>>>>                           break;
>>>> @@ -420,6 +423,9 @@ static void parse_command_line(int argc, char **argv)
>>>>                   case 'F':
>>>>                           opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
>>>>                           break;
>>>> +               case 'C':
>>>> +                       opt_libbpf_flags |= XSK_LIBBPF_FLAGS__COMBINED_CHANNELS;
>>>> +                       break;
>>>>                   default:
>>>>                           usage(basename(argv[0]));
>>>>                   }
>>>> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
>>>> index 9ae4b4e08b68..cf6ff1ecc6bd 100644
>>>> --- a/tools/include/uapi/linux/if_xdp.h
>>>> +++ b/tools/include/uapi/linux/if_xdp.h
>>>> @@ -82,4 +82,15 @@ struct xdp_desc {
>>>>
>>>>    /* UMEM descriptor is __u64 */
>>>>
>>>> +/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
>>>> + * this flag bit in the queue index to distinguish between two RQs of the same
>>>> + * channel.
>>>> + */
>>>> +#define XDP_QID_FLAG_XSKRQ (1 << 31)
>>>> +
>>>> +static inline __u32 xdp_qid_get_channel(__u32 qid)
>>>> +{
>>>> +       return qid & ~XDP_QID_FLAG_XSKRQ;
>>>> +}
>>>> +
>>>>    #endif /* _LINUX_IF_XDP_H */
>>>> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
>>>> index a95b06d1f81d..969dfd856039 100644
>>>> --- a/tools/lib/bpf/xsk.c
>>>> +++ b/tools/lib/bpf/xsk.c
>>>> @@ -76,6 +76,12 @@ struct xsk_nl_info {
>>>>           int fd;
>>>>    };
>>>>
>>>> +enum qidconf {
>>>> +       QIDCONF_REGULAR,
>>>> +       QIDCONF_XSK,
>>>> +       QIDCONF_XSK_COMBINED,
>>>> +};
>>>> +
>>>>    /* For 32-bit systems, we need to use mmap2 as the offsets are 64-bit.
>>>>     * Unfortunately, it is not part of glibc.
>>>>     */
>>>> @@ -139,7 +145,7 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
>>>>                   return 0;
>>>>           }
>>>>
>>>> -       if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)
>>>> +       if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS_MASK)
>>>>                   return -EINVAL;
>>>>
>>>>           cfg->rx_size = usr_cfg->rx_size;
>>>> @@ -267,44 +273,93 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk)
>>>>           /* This is the C-program:
>>>>            * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
>>>>            * {
>>>> -        *     int *qidconf, index = ctx->rx_queue_index;
>>>> +        *     int *qidconf, qc;
>>>> +        *     int index = ctx->rx_queue_index & ~(1 << 31);
>>>> +        *     bool is_xskrq = ctx->rx_queue_index & (1 << 31);
>>>>            *
>>>> -        *     // A set entry here means that the correspnding queue_id
>>>> -        *     // has an active AF_XDP socket bound to it.
>>>> +        *     // A set entry here means that the corresponding queue_id
>>>> +        *     // has an active AF_XDP socket bound to it. Value 2 means
>>>> +        *     // it's zero-copy multi-RQ mode.
>>>>            *     qidconf = bpf_map_lookup_elem(&qidconf_map, &index);
>>>>            *     if (!qidconf)
>>>>            *         return XDP_ABORTED;
>>>>            *
>>>> -        *     if (*qidconf)
>>>> +        *     qc = *qidconf;
>>>> +        *
>>>> +        *     if (qc == 2)
>>>> +        *         qc = is_xskrq ? 1 : 0;
>>>> +        *
>>>> +        *     switch (qc) {
>>>> +        *     case 0:
>>>> +        *         return XDP_PASS;
>>>> +        *     case 1:
>>>>            *         return bpf_redirect_map(&xsks_map, index, 0);
>>>> +        *     }
>>>>            *
>>>> -        *     return XDP_PASS;
>>>> +        *     return XDP_ABORTED;
>>>>            * }
>>>>            */
>>>>           struct bpf_insn prog[] = {
>>>> -               /* r1 = *(u32 *)(r1 + 16) */
>>>> -               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 16),
>>>> -               /* *(u32 *)(r10 - 4) = r1 */
>>>> -               BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_1, -4),
>>>> -               BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
>>>> -               BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
>>>> -               BPF_LD_MAP_FD(BPF_REG_1, xsk->qidconf_map_fd),
>>>> +               /* Load index. */
>>>> +               /* r6 = *(u32 *)(r1 + 16) */
>>>> +               BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_ARG1, 16),
>>>> +               /* w7 = w6 */
>>>> +               BPF_MOV32_REG(BPF_REG_7, BPF_REG_6),
>>>> +               /* w7 &= 2147483647 */
>>>> +               BPF_ALU32_IMM(BPF_AND, BPF_REG_7, ~XDP_QID_FLAG_XSKRQ),
>>>> +               /* *(u32 *)(r10 - 4) = r7 */
>>>> +               BPF_STX_MEM(BPF_W, BPF_REG_FP, BPF_REG_7, -4),
>>>> +
>>>> +               /* Call bpf_map_lookup_elem. */
>>>> +               /* r2 = r10 */
>>>> +               BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
>>>> +               /* r2 += -4 */
>>>> +               BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
>>>> +               /* r1 = qidconf_map ll */
>>>> +               BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->qidconf_map_fd),
>>>> +               /* call 1 */
>>>>                   BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
>>>> -               BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
>>>> -               BPF_MOV32_IMM(BPF_REG_0, 0),
>>>> -               /* if r1 == 0 goto +8 */
>>>> -               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 8),
>>>> -               BPF_MOV32_IMM(BPF_REG_0, 2),
>>>> -               /* r1 = *(u32 *)(r1 + 0) */
>>>> -               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0),
>>>> -               /* if r1 == 0 goto +5 */
>>>> -               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 5),
>>>> -               /* r2 = *(u32 *)(r10 - 4) */
>>>> -               BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd),
>>>> -               BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4),
>>>> +
>>>> +               /* Check the return value. */
>>>> +               /* if r0 == 0 goto +14 */
>>>> +               BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 14),
>>>> +
>>>> +               /* Check qc == QIDCONF_XSK_COMBINED. */
>>>> +               /* r6 >>= 31 */
>>>> +               BPF_ALU64_IMM(BPF_RSH, BPF_REG_6, 31),
>>>> +               /* r1 = *(u32 *)(r0 + 0) */
>>>> +               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
>>>> +               /* if r1 == 2 goto +1 */
>>>> +               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, QIDCONF_XSK_COMBINED, 1),
>>>> +
>>>> +               /* qc != QIDCONF_XSK_COMBINED */
>>>> +               /* r6 = r1 */
>>>> +               BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
>>>> +
>>>> +               /* switch (qc) */
>>>> +               /* w0 = 2 */
>>>> +               BPF_MOV32_IMM(BPF_REG_0, XDP_PASS),
>>>> +               /* if w6 == 0 goto +8 */
>>>> +               BPF_JMP32_IMM(BPF_JEQ, BPF_REG_6, QIDCONF_REGULAR, 8),
>>>> +               /* if w6 != 1 goto +6 */
>>>> +               BPF_JMP32_IMM(BPF_JNE, BPF_REG_6, QIDCONF_XSK, 6),
>>>> +
>>>> +               /* Call bpf_redirect_map. */
>>>> +               /* r1 = xsks_map ll */
>>>> +               BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->xsks_map_fd),
>>>> +               /* w2 = w7 */
>>>> +               BPF_MOV32_REG(BPF_REG_ARG2, BPF_REG_7),
>>>> +               /* w3 = 0 */
>>>>                   BPF_MOV32_IMM(BPF_REG_3, 0),
>>>> +               /* call 51 */
>>>>                   BPF_EMIT_CALL(BPF_FUNC_redirect_map),
>>>> -               /* The jumps are to this instruction */
>>>> +               /* exit */
>>>> +               BPF_EXIT_INSN(),
>>>> +
>>>> +               /* XDP_ABORTED */
>>>> +               /* w0 = 0 */
>>>> +               BPF_MOV32_IMM(BPF_REG_0, XDP_ABORTED),
>>>> +               /* exit */
>>>>                   BPF_EXIT_INSN(),
>>>>           };
>>>>           size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
>>>> @@ -483,6 +538,7 @@ static int xsk_update_bpf_maps(struct xsk_socket *xsk, int qidconf_value,
>>>>
>>>>    static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
>>>>    {
>>>> +       int qidconf_value = QIDCONF_XSK;
>>>>           bool prog_attached = false;
>>>>           __u32 prog_id = 0;
>>>>           int err;
>>>> @@ -505,7 +561,11 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
>>>>                   xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id);
>>>>           }
>>>>
>>>> -       err = xsk_update_bpf_maps(xsk, true, xsk->fd);
>>>> +       if (xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
>>>> +               if (xsk->zc)
>>>> +                       qidconf_value = QIDCONF_XSK_COMBINED;
>>>> +
>>>> +       err = xsk_update_bpf_maps(xsk, qidconf_value, xsk->fd);
>>>>           if (err)
>>>>                   goto out_load;
>>>>
>>>> @@ -717,7 +777,7 @@ void xsk_socket__delete(struct xsk_socket *xsk)
>>>>           if (!xsk)
>>>>                   return;
>>>>
>>>> -       (void)xsk_update_bpf_maps(xsk, 0, 0);
>>>> +       (void)xsk_update_bpf_maps(xsk, QIDCONF_REGULAR, 0);
>>>>
>>>>           optlen = sizeof(off);
>>>>           err = getsockopt(xsk->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen);
>>>> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
>>>> index 82ea71a0f3ec..be26a2423c04 100644
>>>> --- a/tools/lib/bpf/xsk.h
>>>> +++ b/tools/lib/bpf/xsk.h
>>>> @@ -180,6 +180,10 @@ struct xsk_umem_config {
>>>>
>>>>    /* Flags for the libbpf_flags field. */
>>>>    #define XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD (1 << 0)
>>>> +#define XSK_LIBBPF_FLAGS__COMBINED_CHANNELS (1 << 1)
>>>> +#define XSK_LIBBPF_FLAGS_MASK ( \
>>>> +       XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD | \
>>>> +       XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
>>>>
>>>>    struct xsk_socket_config {
>>>>           __u32 rx_size;
>>>> --
>>>> 2.19.1
>>>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic
  2019-05-07 14:19         ` Maxim Mikityanskiy
@ 2019-05-08 13:06           ` Magnus Karlsson
  2019-05-13 14:52             ` Maxim Mikityanskiy
  0 siblings, 1 reply; 26+ messages in thread
From: Magnus Karlsson @ 2019-05-08 13:06 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, Magnus Karlsson, bpf, netdev,
	David S. Miller, Saeed Mahameed, Jonathan Lemon, Tariq Toukan,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jakub Kicinski,
	Maciej Fijalkowski

On Tue, May 7, 2019 at 4:19 PM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>
> On 2019-05-06 17:23, Magnus Karlsson wrote:
> > On Mon, May 6, 2019 at 3:46 PM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
> >>
> >> On 2019-05-04 20:26, Björn Töpel wrote:
> >>> On Tue, 30 Apr 2019 at 20:12, Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
> >>>>
> >>>> Currently, the drivers that implement AF_XDP zero-copy support (e.g.,
> >>>> i40e) switch the channel into a different mode when an XSK is opened. It
> >>>> causes some issues that have to be taken into account. For example, RSS
> >>>> needs to be reconfigured to skip the XSK-enabled channels, or the XDP
> >>>> program should filter out traffic not intended for that socket and
> >>>> XDP_PASS it with an additional copy. As nothing validates or forces the
> >>>> proper configuration, it's easy to have packets drops, when they get
> >>>> into an XSK by mistake, and, in fact, it's the default configuration.
> >>>> There has to be some tool to have RSS reconfigured on each socket open
> >>>> and close event, but such a tool is problematic to implement, because no
> >>>> one reports these events, and it's race-prone.
> >>>>
> >>>> This commit extends XSK to support both kinds of traffic (XSK and
> >>>> non-XSK) in the same channel. It implies having two RX queues in
> >>>> XSK-enabled channels: one for the regular traffic, and the other for
> >>>> XSK. It solves the problem with RSS: the default configuration just
> >>>> works without the need to manually reconfigure RSS or to perform some
> >>>> possibly complicated filtering in the XDP layer. It makes it easy to run
> >>>> both AF_XDP and regular sockets on the same machine. In the XDP program,
> >>>> the QID's most significant bit will serve as a flag to indicate whether
> >>>> it's the XSK queue or not. The extension is compatible with the legacy
> >>>> configuration, so if one wants to run the legacy mode, they can
> >>>> reconfigure RSS and ignore the flag in the XDP program (implemented in
> >>>> the reference XDP program in libbpf). mlx5e will support this extension.
> >>>>
> >>>> A single XDP program can run both with drivers supporting or not
> >>>> supporting this extension. The xdpsock sample and libbpf are updated
> >>>> accordingly.
> >>>>
> >>>
> >>> I'm still not a fan of this, or maybe I'm not following you. It makes
> >>> it more complex and even harder to use. Let's take a look at the
> >>> kernel nomenclature. "ethtool" uses netdevs and channels. A channel is
> >>> a Rx queue or a Tx queue.
> >>
> >> There are also combined channels that consist of an RX and a TX queue.
> >> mlx5e has only this kind of channels. For us, a channel is a set of
> >> queues "pinned to a CPU core" (they use the same NAPI).
> >>
> >>> In AF_XDP we call the channel a queue, which
> >>> is what kernel uses internally (netdev_rx_queue, netdev_queue).
> >>
> >> You seem to agree it's a channel, right?
> >>
> >> AF_XDP doesn't allow to configure RX queue number and TX queue number
> >> separately. Basically you choose a channel in AF_XDP. For some reason,
> >> it's referred as a queue in some places, but logically it means "channel".
> >
> > You can configure the Rx queue and the Tx queue separately by creating
> > two sockets tied to the same umem area. But if you just create one,
> > you are correct.
>
> Yes, I know I can open two sockets, but it's only a workaround. If I
> want to RX on RQ 3 and TX on SQ 5, I'll have one socket bound to RQ 3
> and SQ 3, and the other bound to RQ 5 and SQ 5. It's not a clean way to
> achieve the initial goal. It means that the current implementation
> actually binds a socket to a channel X (that has RQ X and SQ X). It
> could be different if there were two kinds of XSK sockets: RX-only and
> TX-only, but it's not the case.

It is possible to create Rx only or Tx only AF_XDP sockets from the
uapi by only creating an Rx or a Tx ring. Though, the NDO that
registers the umem in the driver would need to be extended with this
information so that the driver could act upon it. Today it is assumed
that it is always Rx and Tx which wastes resources and maybe also
performance. This has been on my todo list for quite some time, but it
has never floated up to the top.

> >>> Today, AF_XDP can attach to an existing queue for ingress. (On the
> >>> egress side, we're using "a queue", but the "XDP queue". XDP has these
> >>> "shadow queues" which are separated from the netdev. This is a bit
> >>> messy, and we can't really configure them. I believe Jakub has some
> >>> ideas here. :-) For now, let's leave egress aside.)
> >>
> >> So, XDP already has "shadow queues" for TX, so I see no problem in
> >> having the similar concept for AF_XDP RX.
> >
> > The question is if we would like to continue down the path of "shadow
> > queues" by adding even more.
>
> OK, I think "shadow queues" is not a valid name, and it's certainly not
> something bad. Initially, kernel had a concept of RX and TX queues. They
> can match the queues the driver has for regular traffic. Then new
> features appeared (XDP, then AF_XDP), and they required new types of
> queues: XDP TX queue, AF_XDP RX queue. These are absolutely new
> concepts, they aren't interchangeable with the legacy RX and TX queues.
> It means we can't just say we have 32 TX queues, 16 of which are regular
> SQs and 16 are XDP SQs. They function differently: the stack must not
> try sending anything from an XDP SQ, and XDP_REDIRECT must not try
> sending from the regular SQ.
>
> However, the kernel didn't learn about new types of queues. That's why
> the new queues remain "in shadow". And we certainly shouldn't use the
> same numeration for different types of queues, i.e. it's incorrect to
> say that TX queues 0..15 are regular SQs, TX queues 16..31 are XDP SQs,
> etc. The correct way is: there are 16 regular SQs 0..15 and 16 XDP SQs
> 0..15.
>
> The same considerations apply to AF_XDP RX queues (XSK RQs). This is a
> whole new type of queue, so it can't be mixed with the regular RQs.
> That's why they have their own numeration and are not accounted as RX
> queues (moreover, their numeration can be non-contiguous). If the kernel
> had need to know about them, they could be exposed to the netdev as a
> third type of queue, but I see no need currently.
>
> > In the busy-poll RFC I sent out last
> > week, I talk about the possibility to create a new queue (set) not
> > tied to the napi of the regular Rx queues in order to get better
> > performance when using busy-poll. How would such a queue set fit into
> > a shadow queue set approach?
>
> Looking at the question in the RFC, I don't see in what way it
> interferes with XSK RQs. Moreover, it's even more convenient to have a
> separate type of queue - you'll be able to create and destroy "unbound"
> (not driven by an interrupt, having a separate NAPI) XSK RQs without
> interfering with the regular queues. At first sight, I would say it's a
> perfect fit :). For me, it looks cleaner when we have regular RQs in
> channels, XSK RQs in channels and "unbound" XSK RQs than having channels
> that switch the RQ type and having some extra out-of-channel RQs that
> behave differently, but look the same to the kernel.

Agree with you that they are not interchangeable and that they
should be explicit. So we are on the same page here. Good. But
let us try to solve the Tx problem first because it has been
there from day one for both XDP and AF_XDP. Jakub Kicinski has
some ideas in this direction that he presented at Linux Kernel
Developers' bpfconf 2019
(http://vger.kernel.org/bpfconf2019.html) where he suggested to
start with Tx first. Let us start a discussion around that. Many
of the concepts developed there should hopefully extend to the Rx
side too.

Note that AF_XDP is today using the Rx queues that XDP operates
on (because it is an extension of XDP) both in the uapi and in
the driver. And these are the regular netdev Rx queues. But as we
all agree on, we need to change this so the choice becomes more
flexible and plays nicely with all side band configuration tools.

> > When does hiding the real queues created
> > to support various features break and we have to expose the real queue
> > number? Trying to wrap my head around these questions.
>
> As I said above, different types of queues should not share the
> numeration. So, if need be, they can be exposed as a different type, but
> I don't see the necessity yet.
>
> > Maxim, would it be possible for you to respin this set without this
> > feature?
>
> That's not as easy as you may think... It's not just a feature - for the
> kernel it's just a feature, but for the driver it's part of core design
> of AF_XDP support in mlx5. Removing it will require a lot of changes,
> reviewing and testing to adapt the driver to a different approach. I'd
> prefer to come to the conclusion before thinking of such changes :)

I leave it up to you if you would like to wait to submit the next
version of your patch set until the "queue id" problem of both Rx
and Tx has been resolved. I hope that it will be speedy so that
you can get your patch set in, but it might not be. It would have
been better if you had partitioned your patch set from the start
as: basic support -> new uapi/kernel feature -> implementation of
the new feature in your driver. Then we could have acked the
basic support quickly. It is good stuff.

Thanks: Magnus

> Thanks for giving the feedback and joining this discussion!
>
> > I like the other stuff you have implemented and think that
> > the rest of the common functionality should be useful for all of us.
> > This way you can get the AF_XDP support accepted quicker while we
> > debate the best way to solve the issue in this thread.
> >
> > Thanks for all your work: Magnus
> >
> >>> If an application would like to get all the traffic from a netdev,
> >>> it'll create an equal amout of sockets as the queues and bind to the
> >>> queues. Yes, even the queues in the RSS  set.
> >>>
> >>> What you would like (I think):
> >>> a) is a way of spawning a new queue for a netdev, that is not part of
> >>> the stack and/or RSS set
> >>
> >> Yes - for the simplicity sake and to make configuration easier. The only
> >> thing needed is to steer the traffic and to open an AF_XDP socket on
> >> channel X. We don't need to care about removing the queue out of RSS,
> >> about finding a way to administer this (which is hard because it's racy
> >> if the configuration in not known in advance). So I don't agree I'm
> >> complicating things, my goal is to make them easier.
> >>
> >>> b) steering traffic to that queue using a configuration mechanism (tc?
> >>> some yet to be hacked BPF configuration hook?)
> >>
> >> Currently, ethtool --config-ntuple is used to steer the traffic. The
> >> user-def argument has a bit that selects XSK RQ/regular RQ, and action
> >> selects a channel:
> >>
> >> ethtool -N eth0 flow-type udp4 dst-port 4242 action 3 user-def 1
> >>
> >>> With your mechanism you're doing this in contrived way. This makes the
> >>> existing AF_XDP model *more* complex/hard(er) to use.
> >>
> >> No, as I said above, some issues are eliminated with my approach, and no
> >> new limitations are introduced, so it makes things more universal and
> >> simpler to configure.
> >>
> >>> How do you steer traffic to this dual-channel RQ?
> >>
> >> First, there is no dual-channel RQ, a more accurate term is dual-RQ
> >> channel, cause now the channel contains a regular RQ and can contain an
> >> XSK RQ.
> >>
> >> For the steering itself, see the ethtool command above - the user-def
> >> argument has a bit that selects one of two RQs.
> >>
> >>> So you have a netdev
> >>> receiving on all queues. Then, e.g., the last queue is a "dual
> >>> channel" queue that can receive traffic from some other filter. How do
> >>> you use it?
> >>
> >> If I want to take the last (or some) channel and start using AF_XDP with
> >> it, I simply configure steering to the XSK RQ of that channel and open a
> >> socket specifying the channel number. I don't need to reconfigure RSS,
> >> because RSS packets go to the regular RQ of that channel and don't
> >> interfere with XSK.
> >>
> >> No functionality is lost - if you don't distinguish the regular and XSK
> >> RQs on the XDP level, you'll get the same effect as with i40e's
> >> implementation. If you want to dedicate the CPU core and channel solely
> >> for AF_XDP, in i40e you exclude the channel from RSS, and here you can
> >> do exactly the same thing. So, no use case is complicated comparing to
> >> i40e, but there are use cases where this feature is to advantage.
> >>
> >> I hope I explained the points you were interested in. Please ask more
> >> questions if there is still something that I should clarify regarding
> >> this topic.
> >>
> >> Thanks,
> >> Max
> >>
> >>>
> >>>
> >>> Björn
> >>>
> >>>> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
> >>>> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
> >>>> ---
> >>>>    include/uapi/linux/if_xdp.h       |  11 +++
> >>>>    net/xdp/xsk.c                     |   5 +-
> >>>>    samples/bpf/xdpsock_user.c        |  10 ++-
> >>>>    tools/include/uapi/linux/if_xdp.h |  11 +++
> >>>>    tools/lib/bpf/xsk.c               | 116 ++++++++++++++++++++++--------
> >>>>    tools/lib/bpf/xsk.h               |   4 ++
> >>>>    6 files changed, 126 insertions(+), 31 deletions(-)
> >>>>
> >>>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> >>>> index 9ae4b4e08b68..cf6ff1ecc6bd 100644
> >>>> --- a/include/uapi/linux/if_xdp.h
> >>>> +++ b/include/uapi/linux/if_xdp.h
> >>>> @@ -82,4 +82,15 @@ struct xdp_desc {
> >>>>
> >>>>    /* UMEM descriptor is __u64 */
> >>>>
> >>>> +/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
> >>>> + * this flag bit in the queue index to distinguish between two RQs of the same
> >>>> + * channel.
> >>>> + */
> >>>> +#define XDP_QID_FLAG_XSKRQ (1 << 31)
> >>>> +
> >>>> +static inline __u32 xdp_qid_get_channel(__u32 qid)
> >>>> +{
> >>>> +       return qid & ~XDP_QID_FLAG_XSKRQ;
> >>>> +}
> >>>> +
> >>>>    #endif /* _LINUX_IF_XDP_H */
> >>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> >>>> index 998199109d5c..114ba17acb09 100644
> >>>> --- a/net/xdp/xsk.c
> >>>> +++ b/net/xdp/xsk.c
> >>>> @@ -104,9 +104,12 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
> >>>>
> >>>>    int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
> >>>>    {
> >>>> +       struct xdp_rxq_info *rxq = xdp->rxq;
> >>>> +       u32 channel = xdp_qid_get_channel(rxq->queue_index);
> >>>>           u32 len;
> >>>>
> >>>> -       if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
> >>>> +       if (xs->dev != rxq->dev || xs->queue_id != channel ||
> >>>> +           xs->zc != (rxq->mem.type == MEM_TYPE_ZERO_COPY))
> >>>>                   return -EINVAL;
> >>>>
> >>>>           len = xdp->data_end - xdp->data;
> >>>> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> >>>> index d08ee1ab7bb4..a6b13025ee79 100644
> >>>> --- a/samples/bpf/xdpsock_user.c
> >>>> +++ b/samples/bpf/xdpsock_user.c
> >>>> @@ -62,6 +62,7 @@ enum benchmark_type {
> >>>>
> >>>>    static enum benchmark_type opt_bench = BENCH_RXDROP;
> >>>>    static u32 opt_xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
> >>>> +static u32 opt_libbpf_flags;
> >>>>    static const char *opt_if = "";
> >>>>    static int opt_ifindex;
> >>>>    static int opt_queue;
> >>>> @@ -306,7 +307,7 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
> >>>>           xsk->umem = umem;
> >>>>           cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
> >>>>           cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
> >>>> -       cfg.libbpf_flags = 0;
> >>>> +       cfg.libbpf_flags = opt_libbpf_flags;
> >>>>           cfg.xdp_flags = opt_xdp_flags;
> >>>>           cfg.bind_flags = opt_xdp_bind_flags;
> >>>>           ret = xsk_socket__create(&xsk->xsk, opt_if, opt_queue, umem->umem,
> >>>> @@ -346,6 +347,7 @@ static struct option long_options[] = {
> >>>>           {"interval", required_argument, 0, 'n'},
> >>>>           {"zero-copy", no_argument, 0, 'z'},
> >>>>           {"copy", no_argument, 0, 'c'},
> >>>> +       {"combined", no_argument, 0, 'C'},
> >>>>           {0, 0, 0, 0}
> >>>>    };
> >>>>
> >>>> @@ -365,6 +367,7 @@ static void usage(const char *prog)
> >>>>                   "  -n, --interval=n     Specify statistics update interval (default 1 sec).\n"
> >>>>                   "  -z, --zero-copy      Force zero-copy mode.\n"
> >>>>                   "  -c, --copy           Force copy mode.\n"
> >>>> +               "  -C, --combined       Driver supports combined XSK and non-XSK traffic in a channel.\n"
> >>>>                   "\n";
> >>>>           fprintf(stderr, str, prog);
> >>>>           exit(EXIT_FAILURE);
> >>>> @@ -377,7 +380,7 @@ static void parse_command_line(int argc, char **argv)
> >>>>           opterr = 0;
> >>>>
> >>>>           for (;;) {
> >>>> -               c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options,
> >>>> +               c = getopt_long(argc, argv, "Frtli:q:psSNn:czC", long_options,
> >>>>                                   &option_index);
> >>>>                   if (c == -1)
> >>>>                           break;
> >>>> @@ -420,6 +423,9 @@ static void parse_command_line(int argc, char **argv)
> >>>>                   case 'F':
> >>>>                           opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
> >>>>                           break;
> >>>> +               case 'C':
> >>>> +                       opt_libbpf_flags |= XSK_LIBBPF_FLAGS__COMBINED_CHANNELS;
> >>>> +                       break;
> >>>>                   default:
> >>>>                           usage(basename(argv[0]));
> >>>>                   }
> >>>> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> >>>> index 9ae4b4e08b68..cf6ff1ecc6bd 100644
> >>>> --- a/tools/include/uapi/linux/if_xdp.h
> >>>> +++ b/tools/include/uapi/linux/if_xdp.h
> >>>> @@ -82,4 +82,15 @@ struct xdp_desc {
> >>>>
> >>>>    /* UMEM descriptor is __u64 */
> >>>>
> >>>> +/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
> >>>> + * this flag bit in the queue index to distinguish between two RQs of the same
> >>>> + * channel.
> >>>> + */
> >>>> +#define XDP_QID_FLAG_XSKRQ (1 << 31)
> >>>> +
> >>>> +static inline __u32 xdp_qid_get_channel(__u32 qid)
> >>>> +{
> >>>> +       return qid & ~XDP_QID_FLAG_XSKRQ;
> >>>> +}
> >>>> +
> >>>>    #endif /* _LINUX_IF_XDP_H */
> >>>> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
> >>>> index a95b06d1f81d..969dfd856039 100644
> >>>> --- a/tools/lib/bpf/xsk.c
> >>>> +++ b/tools/lib/bpf/xsk.c
> >>>> @@ -76,6 +76,12 @@ struct xsk_nl_info {
> >>>>           int fd;
> >>>>    };
> >>>>
> >>>> +enum qidconf {
> >>>> +       QIDCONF_REGULAR,
> >>>> +       QIDCONF_XSK,
> >>>> +       QIDCONF_XSK_COMBINED,
> >>>> +};
> >>>> +
> >>>>    /* For 32-bit systems, we need to use mmap2 as the offsets are 64-bit.
> >>>>     * Unfortunately, it is not part of glibc.
> >>>>     */
> >>>> @@ -139,7 +145,7 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
> >>>>                   return 0;
> >>>>           }
> >>>>
> >>>> -       if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)
> >>>> +       if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS_MASK)
> >>>>                   return -EINVAL;
> >>>>
> >>>>           cfg->rx_size = usr_cfg->rx_size;
> >>>> @@ -267,44 +273,93 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk)
> >>>>           /* This is the C-program:
> >>>>            * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
> >>>>            * {
> >>>> -        *     int *qidconf, index = ctx->rx_queue_index;
> >>>> +        *     int *qidconf, qc;
> >>>> +        *     int index = ctx->rx_queue_index & ~(1 << 31);
> >>>> +        *     bool is_xskrq = ctx->rx_queue_index & (1 << 31);
> >>>>            *
> >>>> -        *     // A set entry here means that the correspnding queue_id
> >>>> -        *     // has an active AF_XDP socket bound to it.
> >>>> +        *     // A set entry here means that the corresponding queue_id
> >>>> +        *     // has an active AF_XDP socket bound to it. Value 2 means
> >>>> +        *     // it's zero-copy multi-RQ mode.
> >>>>            *     qidconf = bpf_map_lookup_elem(&qidconf_map, &index);
> >>>>            *     if (!qidconf)
> >>>>            *         return XDP_ABORTED;
> >>>>            *
> >>>> -        *     if (*qidconf)
> >>>> +        *     qc = *qidconf;
> >>>> +        *
> >>>> +        *     if (qc == 2)
> >>>> +        *         qc = is_xskrq ? 1 : 0;
> >>>> +        *
> >>>> +        *     switch (qc) {
> >>>> +        *     case 0:
> >>>> +        *         return XDP_PASS;
> >>>> +        *     case 1:
> >>>>            *         return bpf_redirect_map(&xsks_map, index, 0);
> >>>> +        *     }
> >>>>            *
> >>>> -        *     return XDP_PASS;
> >>>> +        *     return XDP_ABORTED;
> >>>>            * }
> >>>>            */
> >>>>           struct bpf_insn prog[] = {
> >>>> -               /* r1 = *(u32 *)(r1 + 16) */
> >>>> -               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 16),
> >>>> -               /* *(u32 *)(r10 - 4) = r1 */
> >>>> -               BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_1, -4),
> >>>> -               BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> >>>> -               BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
> >>>> -               BPF_LD_MAP_FD(BPF_REG_1, xsk->qidconf_map_fd),
> >>>> +               /* Load index. */
> >>>> +               /* r6 = *(u32 *)(r1 + 16) */
> >>>> +               BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_ARG1, 16),
> >>>> +               /* w7 = w6 */
> >>>> +               BPF_MOV32_REG(BPF_REG_7, BPF_REG_6),
> >>>> +               /* w7 &= 2147483647 */
> >>>> +               BPF_ALU32_IMM(BPF_AND, BPF_REG_7, ~XDP_QID_FLAG_XSKRQ),
> >>>> +               /* *(u32 *)(r10 - 4) = r7 */
> >>>> +               BPF_STX_MEM(BPF_W, BPF_REG_FP, BPF_REG_7, -4),
> >>>> +
> >>>> +               /* Call bpf_map_lookup_elem. */
> >>>> +               /* r2 = r10 */
> >>>> +               BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
> >>>> +               /* r2 += -4 */
> >>>> +               BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
> >>>> +               /* r1 = qidconf_map ll */
> >>>> +               BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->qidconf_map_fd),
> >>>> +               /* call 1 */
> >>>>                   BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
> >>>> -               BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
> >>>> -               BPF_MOV32_IMM(BPF_REG_0, 0),
> >>>> -               /* if r1 == 0 goto +8 */
> >>>> -               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 8),
> >>>> -               BPF_MOV32_IMM(BPF_REG_0, 2),
> >>>> -               /* r1 = *(u32 *)(r1 + 0) */
> >>>> -               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0),
> >>>> -               /* if r1 == 0 goto +5 */
> >>>> -               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 5),
> >>>> -               /* r2 = *(u32 *)(r10 - 4) */
> >>>> -               BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd),
> >>>> -               BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4),
> >>>> +
> >>>> +               /* Check the return value. */
> >>>> +               /* if r0 == 0 goto +14 */
> >>>> +               BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 14),
> >>>> +
> >>>> +               /* Check qc == QIDCONF_XSK_COMBINED. */
> >>>> +               /* r6 >>= 31 */
> >>>> +               BPF_ALU64_IMM(BPF_RSH, BPF_REG_6, 31),
> >>>> +               /* r1 = *(u32 *)(r0 + 0) */
> >>>> +               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
> >>>> +               /* if r1 == 2 goto +1 */
> >>>> +               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, QIDCONF_XSK_COMBINED, 1),
> >>>> +
> >>>> +               /* qc != QIDCONF_XSK_COMBINED */
> >>>> +               /* r6 = r1 */
> >>>> +               BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
> >>>> +
> >>>> +               /* switch (qc) */
> >>>> +               /* w0 = 2 */
> >>>> +               BPF_MOV32_IMM(BPF_REG_0, XDP_PASS),
> >>>> +               /* if w6 == 0 goto +8 */
> >>>> +               BPF_JMP32_IMM(BPF_JEQ, BPF_REG_6, QIDCONF_REGULAR, 8),
> >>>> +               /* if w6 != 1 goto +6 */
> >>>> +               BPF_JMP32_IMM(BPF_JNE, BPF_REG_6, QIDCONF_XSK, 6),
> >>>> +
> >>>> +               /* Call bpf_redirect_map. */
> >>>> +               /* r1 = xsks_map ll */
> >>>> +               BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->xsks_map_fd),
> >>>> +               /* w2 = w7 */
> >>>> +               BPF_MOV32_REG(BPF_REG_ARG2, BPF_REG_7),
> >>>> +               /* w3 = 0 */
> >>>>                   BPF_MOV32_IMM(BPF_REG_3, 0),
> >>>> +               /* call 51 */
> >>>>                   BPF_EMIT_CALL(BPF_FUNC_redirect_map),
> >>>> -               /* The jumps are to this instruction */
> >>>> +               /* exit */
> >>>> +               BPF_EXIT_INSN(),
> >>>> +
> >>>> +               /* XDP_ABORTED */
> >>>> +               /* w0 = 0 */
> >>>> +               BPF_MOV32_IMM(BPF_REG_0, XDP_ABORTED),
> >>>> +               /* exit */
> >>>>                   BPF_EXIT_INSN(),
> >>>>           };
> >>>>           size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
> >>>> @@ -483,6 +538,7 @@ static int xsk_update_bpf_maps(struct xsk_socket *xsk, int qidconf_value,
> >>>>
> >>>>    static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
> >>>>    {
> >>>> +       int qidconf_value = QIDCONF_XSK;
> >>>>           bool prog_attached = false;
> >>>>           __u32 prog_id = 0;
> >>>>           int err;
> >>>> @@ -505,7 +561,11 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
> >>>>                   xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id);
> >>>>           }
> >>>>
> >>>> -       err = xsk_update_bpf_maps(xsk, true, xsk->fd);
> >>>> +       if (xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
> >>>> +               if (xsk->zc)
> >>>> +                       qidconf_value = QIDCONF_XSK_COMBINED;
> >>>> +
> >>>> +       err = xsk_update_bpf_maps(xsk, qidconf_value, xsk->fd);
> >>>>           if (err)
> >>>>                   goto out_load;
> >>>>
> >>>> @@ -717,7 +777,7 @@ void xsk_socket__delete(struct xsk_socket *xsk)
> >>>>           if (!xsk)
> >>>>                   return;
> >>>>
> >>>> -       (void)xsk_update_bpf_maps(xsk, 0, 0);
> >>>> +       (void)xsk_update_bpf_maps(xsk, QIDCONF_REGULAR, 0);
> >>>>
> >>>>           optlen = sizeof(off);
> >>>>           err = getsockopt(xsk->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen);
> >>>> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> >>>> index 82ea71a0f3ec..be26a2423c04 100644
> >>>> --- a/tools/lib/bpf/xsk.h
> >>>> +++ b/tools/lib/bpf/xsk.h
> >>>> @@ -180,6 +180,10 @@ struct xsk_umem_config {
> >>>>
> >>>>    /* Flags for the libbpf_flags field. */
> >>>>    #define XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD (1 << 0)
> >>>> +#define XSK_LIBBPF_FLAGS__COMBINED_CHANNELS (1 << 1)
> >>>> +#define XSK_LIBBPF_FLAGS_MASK ( \
> >>>> +       XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD | \
> >>>> +       XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
> >>>>
> >>>>    struct xsk_socket_config {
> >>>>           __u32 rx_size;
> >>>> --
> >>>> 2.19.1
> >>>>
> >>
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic
  2019-05-08 13:06           ` Magnus Karlsson
@ 2019-05-13 14:52             ` Maxim Mikityanskiy
  0 siblings, 0 replies; 26+ messages in thread
From: Maxim Mikityanskiy @ 2019-05-13 14:52 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, Magnus Karlsson, bpf, netdev,
	David S. Miller, Saeed Mahameed, Jonathan Lemon, Tariq Toukan,
	Martin KaFai Lau, Song Liu, Yonghong Song, Jakub Kicinski,
	Maciej Fijalkowski

On 2019-05-08 16:06, Magnus Karlsson wrote:
> On Tue, May 7, 2019 at 4:19 PM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>>
>> On 2019-05-06 17:23, Magnus Karlsson wrote:
>>> On Mon, May 6, 2019 at 3:46 PM Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>>>>
>>>> On 2019-05-04 20:26, Björn Töpel wrote:
>>>>> On Tue, 30 Apr 2019 at 20:12, Maxim Mikityanskiy <maximmi@mellanox.com> wrote:
>>>>>>
>>>>>> Currently, the drivers that implement AF_XDP zero-copy support (e.g.,
>>>>>> i40e) switch the channel into a different mode when an XSK is opened. It
>>>>>> causes some issues that have to be taken into account. For example, RSS
>>>>>> needs to be reconfigured to skip the XSK-enabled channels, or the XDP
>>>>>> program should filter out traffic not intended for that socket and
>>>>>> XDP_PASS it with an additional copy. As nothing validates or forces the
>>>>>> proper configuration, it's easy to have packets drops, when they get
>>>>>> into an XSK by mistake, and, in fact, it's the default configuration.
>>>>>> There has to be some tool to have RSS reconfigured on each socket open
>>>>>> and close event, but such a tool is problematic to implement, because no
>>>>>> one reports these events, and it's race-prone.
>>>>>>
>>>>>> This commit extends XSK to support both kinds of traffic (XSK and
>>>>>> non-XSK) in the same channel. It implies having two RX queues in
>>>>>> XSK-enabled channels: one for the regular traffic, and the other for
>>>>>> XSK. It solves the problem with RSS: the default configuration just
>>>>>> works without the need to manually reconfigure RSS or to perform some
>>>>>> possibly complicated filtering in the XDP layer. It makes it easy to run
>>>>>> both AF_XDP and regular sockets on the same machine. In the XDP program,
>>>>>> the QID's most significant bit will serve as a flag to indicate whether
>>>>>> it's the XSK queue or not. The extension is compatible with the legacy
>>>>>> configuration, so if one wants to run the legacy mode, they can
>>>>>> reconfigure RSS and ignore the flag in the XDP program (implemented in
>>>>>> the reference XDP program in libbpf). mlx5e will support this extension.
>>>>>>
>>>>>> A single XDP program can run both with drivers supporting or not
>>>>>> supporting this extension. The xdpsock sample and libbpf are updated
>>>>>> accordingly.
>>>>>>
>>>>>
>>>>> I'm still not a fan of this, or maybe I'm not following you. It makes
>>>>> it more complex and even harder to use. Let's take a look at the
>>>>> kernel nomenclature. "ethtool" uses netdevs and channels. A channel is
>>>>> a Rx queue or a Tx queue.
>>>>
>>>> There are also combined channels that consist of an RX and a TX queue.
>>>> mlx5e has only this kind of channels. For us, a channel is a set of
>>>> queues "pinned to a CPU core" (they use the same NAPI).
>>>>
>>>>> In AF_XDP we call the channel a queue, which
>>>>> is what kernel uses internally (netdev_rx_queue, netdev_queue).
>>>>
>>>> You seem to agree it's a channel, right?
>>>>
>>>> AF_XDP doesn't allow to configure RX queue number and TX queue number
>>>> separately. Basically you choose a channel in AF_XDP. For some reason,
>>>> it's referred as a queue in some places, but logically it means "channel".
>>>
>>> You can configure the Rx queue and the Tx queue separately by creating
>>> two sockets tied to the same umem area. But if you just create one,
>>> you are correct.
>>
>> Yes, I know I can open two sockets, but it's only a workaround. If I
>> want to RX on RQ 3 and TX on SQ 5, I'll have one socket bound to RQ 3
>> and SQ 3, and the other bound to RQ 5 and SQ 5. It's not a clean way to
>> achieve the initial goal. It means that the current implementation
>> actually binds a socket to a channel X (that has RQ X and SQ X). It
>> could be different if there were two kinds of XSK sockets: RX-only and
>> TX-only, but it's not the case.
> 
> It is possible to create Rx only or Tx only AF_XDP sockets from the
> uapi by only creating an Rx or a Tx ring. Though, the NDO that
> registers the umem in the driver would need to be extended with this
> information so that the driver could act upon it. Today it is assumed
> that it is always Rx and Tx which wastes resources and maybe also
> performance. This has been on my todo list for quite some time, but it
> has never floated up to the top.
> 
>>>>> Today, AF_XDP can attach to an existing queue for ingress. (On the
>>>>> egress side, we're using "a queue", but the "XDP queue". XDP has these
>>>>> "shadow queues" which are separated from the netdev. This is a bit
>>>>> messy, and we can't really configure them. I believe Jakub has some
>>>>> ideas here. :-) For now, let's leave egress aside.)
>>>>
>>>> So, XDP already has "shadow queues" for TX, so I see no problem in
>>>> having the similar concept for AF_XDP RX.
>>>
>>> The question is if we would like to continue down the path of "shadow
>>> queues" by adding even more.
>>
>> OK, I think "shadow queues" is not a valid name, and it's certainly not
>> something bad. Initially, kernel had a concept of RX and TX queues. They
>> can match the queues the driver has for regular traffic. Then new
>> features appeared (XDP, then AF_XDP), and they required new types of
>> queues: XDP TX queue, AF_XDP RX queue. These are absolutely new
>> concepts, they aren't interchangeable with the legacy RX and TX queues.
>> It means we can't just say we have 32 TX queues, 16 of which are regular
>> SQs and 16 are XDP SQs. They function differently: the stack must not
>> try sending anything from an XDP SQ, and XDP_REDIRECT must not try
>> sending from the regular SQ.
>>
>> However, the kernel didn't learn about new types of queues. That's why
>> the new queues remain "in shadow". And we certainly shouldn't use the
>> same numeration for different types of queues, i.e. it's incorrect to
>> say that TX queues 0..15 are regular SQs, TX queues 16..31 are XDP SQs,
>> etc. The correct way is: there are 16 regular SQs 0..15 and 16 XDP SQs
>> 0..15.
>>
>> The same considerations apply to AF_XDP RX queues (XSK RQs). This is a
>> whole new type of queue, so it can't be mixed with the regular RQs.
>> That's why they have their own numeration and are not accounted as RX
>> queues (moreover, their numeration can be non-contiguous). If the kernel
>> had need to know about them, they could be exposed to the netdev as a
>> third type of queue, but I see no need currently.
>>
>>> In the busy-poll RFC I sent out last
>>> week, I talk about the possibility to create a new queue (set) not
>>> tied to the napi of the regular Rx queues in order to get better
>>> performance when using busy-poll. How would such a queue set fit into
>>> a shadow queue set approach?
>>
>> Looking at the question in the RFC, I don't see in what way it
>> interferes with XSK RQs. Moreover, it's even more convenient to have a
>> separate type of queue - you'll be able to create and destroy "unbound"
>> (not driven by an interrupt, having a separate NAPI) XSK RQs without
>> interfering with the regular queues. At first sight, I would say it's a
>> perfect fit :). For me, it looks cleaner when we have regular RQs in
>> channels, XSK RQs in channels and "unbound" XSK RQs than having channels
>> that switch the RQ type and having some extra out-of-channel RQs that
>> behave differently, but look the same to the kernel.
> 
> Agree with you that they are not interchangeable and that they
> should be explicit. So we are on the same page here. Good. But
> let us try to solve the Tx problem first because it has been
> there from day one for both XDP and AF_XDP. Jakub Kicinski has
> some ideas in this direction that he presented at Linux Kernel
> Developers' bpfconf 2019
> (http://vger.kernel.org/bpfconf2019.html) where he suggested to
> start with Tx first. Let us start a discussion around that. Many
> of the concepts developed there should hopefully extend to the Rx
> side too.

Solving the problem with TX first isn't something that helps get my code 
in. I'm not denying the problem - but it's not related to this patchset.

Regarding XSK RQs, currently things work without registering XSK RQs in 
the kernel. If a need for it arises, it can be addressed later, together 
with XDP TX queues. The new feature doesn't conflict with your RFC, and 
if you push the dedicated NAPI context feature, I'll be able to adapt 
easily with my feature, and it won't be the case without my feature.

> Note that AF_XDP is today using the Rx queues that XDP operates
> on (because it is an extension of XDP) both in the uapi and in
> the driver. And these are the regular netdev Rx queues. But as we
> all agree on, we need to change this so the choice becomes more
> flexible and plays nicely with all side band configuration tools.
> 
>>> When does hiding the real queues created
>>> to support various features break and we have to expose the real queue
>>> number? Trying to wrap my head around these questions.
>>
>> As I said above, different types of queues should not share the
>> numeration. So, if need be, they can be exposed as a different type, but
>> I don't see the necessity yet.
>>
>>> Maxim, would it be possible for you to respin this set without this
>>> feature?
>>
>> That's not as easy as you may think... It's not just a feature - for the
>> kernel it's just a feature, but for the driver it's part of core design
>> of AF_XDP support in mlx5. Removing it will require a lot of changes,
>> reviewing and testing to adapt the driver to a different approach. I'd
>> prefer to come to the conclusion before thinking of such changes :)
> 
> I leave it up to you if you would like to wait to submit the next
> version of your patch set until the "queue id" problem of both Rx
> and Tx has been resolved. I hope that it will be speedy so that
> you can get your patch set in, but it might not be. It would have
> been better if you had partitioned your patch set from the start
> as: basic support -> new uapi/kernel feature -> implementation of
> the new feature in your driver.
I'm not sure you got it right. You probably call "basic support" an 
implementation similar to the one in i40e, right? The problem is that 
the new feature doesn't complement "basic support", it replaces it. It's 
impossible to split this series into "basic support" and "new feature" 
parts. Our design is based on the new feature, and to split the series 
means to throw away the whole design, to start from scratch and 
implement "basic support", then remove it immediately with the "new 
feature" patch. It doesn't make sense for me to do it this way.

Thanks,
Max

> Then we could have acked the
> basic support quickly. It is good stuff.
> 
> Thanks: Magnus
> 
>> Thanks for giving the feedback and joining this discussion!
>>
>>> I like the other stuff you have implemented and think that
>>> the rest of the common functionality should be useful for all of us.
>>> This way you can get the AF_XDP support accepted quicker while we
>>> debate the best way to solve the issue in this thread.
>>>
>>> Thanks for all your work: Magnus
>>>
>>>>> If an application would like to get all the traffic from a netdev,
>>>>> it'll create an equal amout of sockets as the queues and bind to the
>>>>> queues. Yes, even the queues in the RSS  set.
>>>>>
>>>>> What you would like (I think):
>>>>> a) is a way of spawning a new queue for a netdev, that is not part of
>>>>> the stack and/or RSS set
>>>>
>>>> Yes - for the simplicity sake and to make configuration easier. The only
>>>> thing needed is to steer the traffic and to open an AF_XDP socket on
>>>> channel X. We don't need to care about removing the queue out of RSS,
>>>> about finding a way to administer this (which is hard because it's racy
>>>> if the configuration in not known in advance). So I don't agree I'm
>>>> complicating things, my goal is to make them easier.
>>>>
>>>>> b) steering traffic to that queue using a configuration mechanism (tc?
>>>>> some yet to be hacked BPF configuration hook?)
>>>>
>>>> Currently, ethtool --config-ntuple is used to steer the traffic. The
>>>> user-def argument has a bit that selects XSK RQ/regular RQ, and action
>>>> selects a channel:
>>>>
>>>> ethtool -N eth0 flow-type udp4 dst-port 4242 action 3 user-def 1
>>>>
>>>>> With your mechanism you're doing this in contrived way. This makes the
>>>>> existing AF_XDP model *more* complex/hard(er) to use.
>>>>
>>>> No, as I said above, some issues are eliminated with my approach, and no
>>>> new limitations are introduced, so it makes things more universal and
>>>> simpler to configure.
>>>>
>>>>> How do you steer traffic to this dual-channel RQ?
>>>>
>>>> First, there is no dual-channel RQ, a more accurate term is dual-RQ
>>>> channel, cause now the channel contains a regular RQ and can contain an
>>>> XSK RQ.
>>>>
>>>> For the steering itself, see the ethtool command above - the user-def
>>>> argument has a bit that selects one of two RQs.
>>>>
>>>>> So you have a netdev
>>>>> receiving on all queues. Then, e.g., the last queue is a "dual
>>>>> channel" queue that can receive traffic from some other filter. How do
>>>>> you use it?
>>>>
>>>> If I want to take the last (or some) channel and start using AF_XDP with
>>>> it, I simply configure steering to the XSK RQ of that channel and open a
>>>> socket specifying the channel number. I don't need to reconfigure RSS,
>>>> because RSS packets go to the regular RQ of that channel and don't
>>>> interfere with XSK.
>>>>
>>>> No functionality is lost - if you don't distinguish the regular and XSK
>>>> RQs on the XDP level, you'll get the same effect as with i40e's
>>>> implementation. If you want to dedicate the CPU core and channel solely
>>>> for AF_XDP, in i40e you exclude the channel from RSS, and here you can
>>>> do exactly the same thing. So, no use case is complicated comparing to
>>>> i40e, but there are use cases where this feature is to advantage.
>>>>
>>>> I hope I explained the points you were interested in. Please ask more
>>>> questions if there is still something that I should clarify regarding
>>>> this topic.
>>>>
>>>> Thanks,
>>>> Max
>>>>
>>>>>
>>>>>
>>>>> Björn
>>>>>
>>>>>> Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
>>>>>> Acked-by: Saeed Mahameed <saeedm@mellanox.com>
>>>>>> ---
>>>>>>     include/uapi/linux/if_xdp.h       |  11 +++
>>>>>>     net/xdp/xsk.c                     |   5 +-
>>>>>>     samples/bpf/xdpsock_user.c        |  10 ++-
>>>>>>     tools/include/uapi/linux/if_xdp.h |  11 +++
>>>>>>     tools/lib/bpf/xsk.c               | 116 ++++++++++++++++++++++--------
>>>>>>     tools/lib/bpf/xsk.h               |   4 ++
>>>>>>     6 files changed, 126 insertions(+), 31 deletions(-)
>>>>>>
>>>>>> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
>>>>>> index 9ae4b4e08b68..cf6ff1ecc6bd 100644
>>>>>> --- a/include/uapi/linux/if_xdp.h
>>>>>> +++ b/include/uapi/linux/if_xdp.h
>>>>>> @@ -82,4 +82,15 @@ struct xdp_desc {
>>>>>>
>>>>>>     /* UMEM descriptor is __u64 */
>>>>>>
>>>>>> +/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
>>>>>> + * this flag bit in the queue index to distinguish between two RQs of the same
>>>>>> + * channel.
>>>>>> + */
>>>>>> +#define XDP_QID_FLAG_XSKRQ (1 << 31)
>>>>>> +
>>>>>> +static inline __u32 xdp_qid_get_channel(__u32 qid)
>>>>>> +{
>>>>>> +       return qid & ~XDP_QID_FLAG_XSKRQ;
>>>>>> +}
>>>>>> +
>>>>>>     #endif /* _LINUX_IF_XDP_H */
>>>>>> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>>>>>> index 998199109d5c..114ba17acb09 100644
>>>>>> --- a/net/xdp/xsk.c
>>>>>> +++ b/net/xdp/xsk.c
>>>>>> @@ -104,9 +104,12 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>>>>>>
>>>>>>     int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>>>>>>     {
>>>>>> +       struct xdp_rxq_info *rxq = xdp->rxq;
>>>>>> +       u32 channel = xdp_qid_get_channel(rxq->queue_index);
>>>>>>            u32 len;
>>>>>>
>>>>>> -       if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
>>>>>> +       if (xs->dev != rxq->dev || xs->queue_id != channel ||
>>>>>> +           xs->zc != (rxq->mem.type == MEM_TYPE_ZERO_COPY))
>>>>>>                    return -EINVAL;
>>>>>>
>>>>>>            len = xdp->data_end - xdp->data;
>>>>>> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
>>>>>> index d08ee1ab7bb4..a6b13025ee79 100644
>>>>>> --- a/samples/bpf/xdpsock_user.c
>>>>>> +++ b/samples/bpf/xdpsock_user.c
>>>>>> @@ -62,6 +62,7 @@ enum benchmark_type {
>>>>>>
>>>>>>     static enum benchmark_type opt_bench = BENCH_RXDROP;
>>>>>>     static u32 opt_xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
>>>>>> +static u32 opt_libbpf_flags;
>>>>>>     static const char *opt_if = "";
>>>>>>     static int opt_ifindex;
>>>>>>     static int opt_queue;
>>>>>> @@ -306,7 +307,7 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
>>>>>>            xsk->umem = umem;
>>>>>>            cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
>>>>>>            cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
>>>>>> -       cfg.libbpf_flags = 0;
>>>>>> +       cfg.libbpf_flags = opt_libbpf_flags;
>>>>>>            cfg.xdp_flags = opt_xdp_flags;
>>>>>>            cfg.bind_flags = opt_xdp_bind_flags;
>>>>>>            ret = xsk_socket__create(&xsk->xsk, opt_if, opt_queue, umem->umem,
>>>>>> @@ -346,6 +347,7 @@ static struct option long_options[] = {
>>>>>>            {"interval", required_argument, 0, 'n'},
>>>>>>            {"zero-copy", no_argument, 0, 'z'},
>>>>>>            {"copy", no_argument, 0, 'c'},
>>>>>> +       {"combined", no_argument, 0, 'C'},
>>>>>>            {0, 0, 0, 0}
>>>>>>     };
>>>>>>
>>>>>> @@ -365,6 +367,7 @@ static void usage(const char *prog)
>>>>>>                    "  -n, --interval=n     Specify statistics update interval (default 1 sec).\n"
>>>>>>                    "  -z, --zero-copy      Force zero-copy mode.\n"
>>>>>>                    "  -c, --copy           Force copy mode.\n"
>>>>>> +               "  -C, --combined       Driver supports combined XSK and non-XSK traffic in a channel.\n"
>>>>>>                    "\n";
>>>>>>            fprintf(stderr, str, prog);
>>>>>>            exit(EXIT_FAILURE);
>>>>>> @@ -377,7 +380,7 @@ static void parse_command_line(int argc, char **argv)
>>>>>>            opterr = 0;
>>>>>>
>>>>>>            for (;;) {
>>>>>> -               c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options,
>>>>>> +               c = getopt_long(argc, argv, "Frtli:q:psSNn:czC", long_options,
>>>>>>                                    &option_index);
>>>>>>                    if (c == -1)
>>>>>>                            break;
>>>>>> @@ -420,6 +423,9 @@ static void parse_command_line(int argc, char **argv)
>>>>>>                    case 'F':
>>>>>>                            opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
>>>>>>                            break;
>>>>>> +               case 'C':
>>>>>> +                       opt_libbpf_flags |= XSK_LIBBPF_FLAGS__COMBINED_CHANNELS;
>>>>>> +                       break;
>>>>>>                    default:
>>>>>>                            usage(basename(argv[0]));
>>>>>>                    }
>>>>>> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
>>>>>> index 9ae4b4e08b68..cf6ff1ecc6bd 100644
>>>>>> --- a/tools/include/uapi/linux/if_xdp.h
>>>>>> +++ b/tools/include/uapi/linux/if_xdp.h
>>>>>> @@ -82,4 +82,15 @@ struct xdp_desc {
>>>>>>
>>>>>>     /* UMEM descriptor is __u64 */
>>>>>>
>>>>>> +/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
>>>>>> + * this flag bit in the queue index to distinguish between two RQs of the same
>>>>>> + * channel.
>>>>>> + */
>>>>>> +#define XDP_QID_FLAG_XSKRQ (1 << 31)
>>>>>> +
>>>>>> +static inline __u32 xdp_qid_get_channel(__u32 qid)
>>>>>> +{
>>>>>> +       return qid & ~XDP_QID_FLAG_XSKRQ;
>>>>>> +}
>>>>>> +
>>>>>>     #endif /* _LINUX_IF_XDP_H */
>>>>>> diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
>>>>>> index a95b06d1f81d..969dfd856039 100644
>>>>>> --- a/tools/lib/bpf/xsk.c
>>>>>> +++ b/tools/lib/bpf/xsk.c
>>>>>> @@ -76,6 +76,12 @@ struct xsk_nl_info {
>>>>>>            int fd;
>>>>>>     };
>>>>>>
>>>>>> +enum qidconf {
>>>>>> +       QIDCONF_REGULAR,
>>>>>> +       QIDCONF_XSK,
>>>>>> +       QIDCONF_XSK_COMBINED,
>>>>>> +};
>>>>>> +
>>>>>>     /* For 32-bit systems, we need to use mmap2 as the offsets are 64-bit.
>>>>>>      * Unfortunately, it is not part of glibc.
>>>>>>      */
>>>>>> @@ -139,7 +145,7 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
>>>>>>                    return 0;
>>>>>>            }
>>>>>>
>>>>>> -       if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)
>>>>>> +       if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS_MASK)
>>>>>>                    return -EINVAL;
>>>>>>
>>>>>>            cfg->rx_size = usr_cfg->rx_size;
>>>>>> @@ -267,44 +273,93 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk)
>>>>>>            /* This is the C-program:
>>>>>>             * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
>>>>>>             * {
>>>>>> -        *     int *qidconf, index = ctx->rx_queue_index;
>>>>>> +        *     int *qidconf, qc;
>>>>>> +        *     int index = ctx->rx_queue_index & ~(1 << 31);
>>>>>> +        *     bool is_xskrq = ctx->rx_queue_index & (1 << 31);
>>>>>>             *
>>>>>> -        *     // A set entry here means that the correspnding queue_id
>>>>>> -        *     // has an active AF_XDP socket bound to it.
>>>>>> +        *     // A set entry here means that the corresponding queue_id
>>>>>> +        *     // has an active AF_XDP socket bound to it. Value 2 means
>>>>>> +        *     // it's zero-copy multi-RQ mode.
>>>>>>             *     qidconf = bpf_map_lookup_elem(&qidconf_map, &index);
>>>>>>             *     if (!qidconf)
>>>>>>             *         return XDP_ABORTED;
>>>>>>             *
>>>>>> -        *     if (*qidconf)
>>>>>> +        *     qc = *qidconf;
>>>>>> +        *
>>>>>> +        *     if (qc == 2)
>>>>>> +        *         qc = is_xskrq ? 1 : 0;
>>>>>> +        *
>>>>>> +        *     switch (qc) {
>>>>>> +        *     case 0:
>>>>>> +        *         return XDP_PASS;
>>>>>> +        *     case 1:
>>>>>>             *         return bpf_redirect_map(&xsks_map, index, 0);
>>>>>> +        *     }
>>>>>>             *
>>>>>> -        *     return XDP_PASS;
>>>>>> +        *     return XDP_ABORTED;
>>>>>>             * }
>>>>>>             */
>>>>>>            struct bpf_insn prog[] = {
>>>>>> -               /* r1 = *(u32 *)(r1 + 16) */
>>>>>> -               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 16),
>>>>>> -               /* *(u32 *)(r10 - 4) = r1 */
>>>>>> -               BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_1, -4),
>>>>>> -               BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
>>>>>> -               BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
>>>>>> -               BPF_LD_MAP_FD(BPF_REG_1, xsk->qidconf_map_fd),
>>>>>> +               /* Load index. */
>>>>>> +               /* r6 = *(u32 *)(r1 + 16) */
>>>>>> +               BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_ARG1, 16),
>>>>>> +               /* w7 = w6 */
>>>>>> +               BPF_MOV32_REG(BPF_REG_7, BPF_REG_6),
>>>>>> +               /* w7 &= 2147483647 */
>>>>>> +               BPF_ALU32_IMM(BPF_AND, BPF_REG_7, ~XDP_QID_FLAG_XSKRQ),
>>>>>> +               /* *(u32 *)(r10 - 4) = r7 */
>>>>>> +               BPF_STX_MEM(BPF_W, BPF_REG_FP, BPF_REG_7, -4),
>>>>>> +
>>>>>> +               /* Call bpf_map_lookup_elem. */
>>>>>> +               /* r2 = r10 */
>>>>>> +               BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
>>>>>> +               /* r2 += -4 */
>>>>>> +               BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
>>>>>> +               /* r1 = qidconf_map ll */
>>>>>> +               BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->qidconf_map_fd),
>>>>>> +               /* call 1 */
>>>>>>                    BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
>>>>>> -               BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
>>>>>> -               BPF_MOV32_IMM(BPF_REG_0, 0),
>>>>>> -               /* if r1 == 0 goto +8 */
>>>>>> -               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 8),
>>>>>> -               BPF_MOV32_IMM(BPF_REG_0, 2),
>>>>>> -               /* r1 = *(u32 *)(r1 + 0) */
>>>>>> -               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0),
>>>>>> -               /* if r1 == 0 goto +5 */
>>>>>> -               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 5),
>>>>>> -               /* r2 = *(u32 *)(r10 - 4) */
>>>>>> -               BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd),
>>>>>> -               BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4),
>>>>>> +
>>>>>> +               /* Check the return value. */
>>>>>> +               /* if r0 == 0 goto +14 */
>>>>>> +               BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 14),
>>>>>> +
>>>>>> +               /* Check qc == QIDCONF_XSK_COMBINED. */
>>>>>> +               /* r6 >>= 31 */
>>>>>> +               BPF_ALU64_IMM(BPF_RSH, BPF_REG_6, 31),
>>>>>> +               /* r1 = *(u32 *)(r0 + 0) */
>>>>>> +               BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
>>>>>> +               /* if r1 == 2 goto +1 */
>>>>>> +               BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, QIDCONF_XSK_COMBINED, 1),
>>>>>> +
>>>>>> +               /* qc != QIDCONF_XSK_COMBINED */
>>>>>> +               /* r6 = r1 */
>>>>>> +               BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
>>>>>> +
>>>>>> +               /* switch (qc) */
>>>>>> +               /* w0 = 2 */
>>>>>> +               BPF_MOV32_IMM(BPF_REG_0, XDP_PASS),
>>>>>> +               /* if w6 == 0 goto +8 */
>>>>>> +               BPF_JMP32_IMM(BPF_JEQ, BPF_REG_6, QIDCONF_REGULAR, 8),
>>>>>> +               /* if w6 != 1 goto +6 */
>>>>>> +               BPF_JMP32_IMM(BPF_JNE, BPF_REG_6, QIDCONF_XSK, 6),
>>>>>> +
>>>>>> +               /* Call bpf_redirect_map. */
>>>>>> +               /* r1 = xsks_map ll */
>>>>>> +               BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->xsks_map_fd),
>>>>>> +               /* w2 = w7 */
>>>>>> +               BPF_MOV32_REG(BPF_REG_ARG2, BPF_REG_7),
>>>>>> +               /* w3 = 0 */
>>>>>>                    BPF_MOV32_IMM(BPF_REG_3, 0),
>>>>>> +               /* call 51 */
>>>>>>                    BPF_EMIT_CALL(BPF_FUNC_redirect_map),
>>>>>> -               /* The jumps are to this instruction */
>>>>>> +               /* exit */
>>>>>> +               BPF_EXIT_INSN(),
>>>>>> +
>>>>>> +               /* XDP_ABORTED */
>>>>>> +               /* w0 = 0 */
>>>>>> +               BPF_MOV32_IMM(BPF_REG_0, XDP_ABORTED),
>>>>>> +               /* exit */
>>>>>>                    BPF_EXIT_INSN(),
>>>>>>            };
>>>>>>            size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
>>>>>> @@ -483,6 +538,7 @@ static int xsk_update_bpf_maps(struct xsk_socket *xsk, int qidconf_value,
>>>>>>
>>>>>>     static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
>>>>>>     {
>>>>>> +       int qidconf_value = QIDCONF_XSK;
>>>>>>            bool prog_attached = false;
>>>>>>            __u32 prog_id = 0;
>>>>>>            int err;
>>>>>> @@ -505,7 +561,11 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
>>>>>>                    xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id);
>>>>>>            }
>>>>>>
>>>>>> -       err = xsk_update_bpf_maps(xsk, true, xsk->fd);
>>>>>> +       if (xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
>>>>>> +               if (xsk->zc)
>>>>>> +                       qidconf_value = QIDCONF_XSK_COMBINED;
>>>>>> +
>>>>>> +       err = xsk_update_bpf_maps(xsk, qidconf_value, xsk->fd);
>>>>>>            if (err)
>>>>>>                    goto out_load;
>>>>>>
>>>>>> @@ -717,7 +777,7 @@ void xsk_socket__delete(struct xsk_socket *xsk)
>>>>>>            if (!xsk)
>>>>>>                    return;
>>>>>>
>>>>>> -       (void)xsk_update_bpf_maps(xsk, 0, 0);
>>>>>> +       (void)xsk_update_bpf_maps(xsk, QIDCONF_REGULAR, 0);
>>>>>>
>>>>>>            optlen = sizeof(off);
>>>>>>            err = getsockopt(xsk->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen);
>>>>>> diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
>>>>>> index 82ea71a0f3ec..be26a2423c04 100644
>>>>>> --- a/tools/lib/bpf/xsk.h
>>>>>> +++ b/tools/lib/bpf/xsk.h
>>>>>> @@ -180,6 +180,10 @@ struct xsk_umem_config {
>>>>>>
>>>>>>     /* Flags for the libbpf_flags field. */
>>>>>>     #define XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD (1 << 0)
>>>>>> +#define XSK_LIBBPF_FLAGS__COMBINED_CHANNELS (1 << 1)
>>>>>> +#define XSK_LIBBPF_FLAGS_MASK ( \
>>>>>> +       XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD | \
>>>>>> +       XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
>>>>>>
>>>>>>     struct xsk_socket_config {
>>>>>>            __u32 rx_size;
>>>>>> --
>>>>>> 2.19.1
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-05-13 14:53 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 01/16] xsk: Add API to check for available entries in FQ Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 02/16] xsk: Add getsockopt XDP_OPTIONS Maxim Mikityanskiy
2019-05-04 17:25   ` Björn Töpel
2019-05-06 13:45     ` Maxim Mikityanskiy
2019-05-06 16:35       ` Alexei Starovoitov
2019-04-30 18:12 ` [PATCH bpf-next v2 03/16] libbpf: Support " Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic Maxim Mikityanskiy
2019-05-04 17:26   ` Björn Töpel
2019-05-06 13:45     ` Maxim Mikityanskiy
2019-05-06 14:23       ` Magnus Karlsson
2019-05-07 14:19         ` Maxim Mikityanskiy
2019-05-08 13:06           ` Magnus Karlsson
2019-05-13 14:52             ` Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 05/16] xsk: Change the default frame size to 4096 and allow controlling it Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 06/16] xsk: Return the whole xdp_desc from xsk_umem_consume_tx Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 07/16] net/mlx5e: Replace deprecated PCI_DMA_TODEVICE Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 08/16] net/mlx5e: Calculate linear RX frag size considering XSK Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 09/16] net/mlx5e: Allow ICO SQ to be used by multiple RQs Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 10/16] net/mlx5e: Refactor struct mlx5e_xdp_info Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 11/16] net/mlx5e: Share the XDP SQ for XDP_TX between RQs Maxim Mikityanskiy
2019-04-30 18:13 ` [PATCH bpf-next v2 12/16] net/mlx5e: XDP_TX from UMEM support Maxim Mikityanskiy
2019-04-30 18:13 ` [PATCH bpf-next v2 13/16] net/mlx5e: Consider XSK in XDP MTU limit calculation Maxim Mikityanskiy
2019-04-30 18:13 ` [PATCH bpf-next v2 14/16] net/mlx5e: Encapsulate open/close queues into a function Maxim Mikityanskiy
2019-04-30 18:13 ` [PATCH bpf-next v2 15/16] net/mlx5e: Move queue param structs to en/params.h Maxim Mikityanskiy
2019-05-04 17:25 ` [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Björn Töpel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).