netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Maxim Mikityanskiy <maximmi@mellanox.com>
To: "Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	"Magnus Karlsson" <magnus.karlsson@intel.com>
Cc: "bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Saeed Mahameed <saeedm@mellanox.com>, Jonathan Lemon <bsd@fb.com>,
	Tariq Toukan <tariqt@mellanox.com>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>,
	Jakub Kicinski <jakub.kicinski@netronome.com>,
	Maciej Fijalkowski <maciejromanfijalkowski@gmail.com>,
	Maxim Mikityanskiy <maximmi@mellanox.com>
Subject: [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic
Date: Tue, 30 Apr 2019 18:12:41 +0000	[thread overview]
Message-ID: <20190430181215.15305-5-maximmi@mellanox.com> (raw)
In-Reply-To: <20190430181215.15305-1-maximmi@mellanox.com>

Currently, the drivers that implement AF_XDP zero-copy support (e.g.,
i40e) switch the channel into a different mode when an XSK is opened. It
causes some issues that have to be taken into account. For example, RSS
needs to be reconfigured to skip the XSK-enabled channels, or the XDP
program should filter out traffic not intended for that socket and
XDP_PASS it with an additional copy. As nothing validates or forces the
proper configuration, it's easy to have packets drops, when they get
into an XSK by mistake, and, in fact, it's the default configuration.
There has to be some tool to have RSS reconfigured on each socket open
and close event, but such a tool is problematic to implement, because no
one reports these events, and it's race-prone.

This commit extends XSK to support both kinds of traffic (XSK and
non-XSK) in the same channel. It implies having two RX queues in
XSK-enabled channels: one for the regular traffic, and the other for
XSK. It solves the problem with RSS: the default configuration just
works without the need to manually reconfigure RSS or to perform some
possibly complicated filtering in the XDP layer. It makes it easy to run
both AF_XDP and regular sockets on the same machine. In the XDP program,
the QID's most significant bit will serve as a flag to indicate whether
it's the XSK queue or not. The extension is compatible with the legacy
configuration, so if one wants to run the legacy mode, they can
reconfigure RSS and ignore the flag in the XDP program (implemented in
the reference XDP program in libbpf). mlx5e will support this extension.

A single XDP program can run both with drivers supporting or not
supporting this extension. The xdpsock sample and libbpf are updated
accordingly.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 include/uapi/linux/if_xdp.h       |  11 +++
 net/xdp/xsk.c                     |   5 +-
 samples/bpf/xdpsock_user.c        |  10 ++-
 tools/include/uapi/linux/if_xdp.h |  11 +++
 tools/lib/bpf/xsk.c               | 116 ++++++++++++++++++++++--------
 tools/lib/bpf/xsk.h               |   4 ++
 6 files changed, 126 insertions(+), 31 deletions(-)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 9ae4b4e08b68..cf6ff1ecc6bd 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -82,4 +82,15 @@ struct xdp_desc {
 
 /* UMEM descriptor is __u64 */
 
+/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
+ * this flag bit in the queue index to distinguish between two RQs of the same
+ * channel.
+ */
+#define XDP_QID_FLAG_XSKRQ (1 << 31)
+
+static inline __u32 xdp_qid_get_channel(__u32 qid)
+{
+	return qid & ~XDP_QID_FLAG_XSKRQ;
+}
+
 #endif /* _LINUX_IF_XDP_H */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 998199109d5c..114ba17acb09 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -104,9 +104,12 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
+	struct xdp_rxq_info *rxq = xdp->rxq;
+	u32 channel = xdp_qid_get_channel(rxq->queue_index);
 	u32 len;
 
-	if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+	if (xs->dev != rxq->dev || xs->queue_id != channel ||
+	    xs->zc != (rxq->mem.type == MEM_TYPE_ZERO_COPY))
 		return -EINVAL;
 
 	len = xdp->data_end - xdp->data;
diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index d08ee1ab7bb4..a6b13025ee79 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -62,6 +62,7 @@ enum benchmark_type {
 
 static enum benchmark_type opt_bench = BENCH_RXDROP;
 static u32 opt_xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+static u32 opt_libbpf_flags;
 static const char *opt_if = "";
 static int opt_ifindex;
 static int opt_queue;
@@ -306,7 +307,7 @@ static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem)
 	xsk->umem = umem;
 	cfg.rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
 	cfg.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
-	cfg.libbpf_flags = 0;
+	cfg.libbpf_flags = opt_libbpf_flags;
 	cfg.xdp_flags = opt_xdp_flags;
 	cfg.bind_flags = opt_xdp_bind_flags;
 	ret = xsk_socket__create(&xsk->xsk, opt_if, opt_queue, umem->umem,
@@ -346,6 +347,7 @@ static struct option long_options[] = {
 	{"interval", required_argument, 0, 'n'},
 	{"zero-copy", no_argument, 0, 'z'},
 	{"copy", no_argument, 0, 'c'},
+	{"combined", no_argument, 0, 'C'},
 	{0, 0, 0, 0}
 };
 
@@ -365,6 +367,7 @@ static void usage(const char *prog)
 		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
 		"  -z, --zero-copy      Force zero-copy mode.\n"
 		"  -c, --copy           Force copy mode.\n"
+		"  -C, --combined       Driver supports combined XSK and non-XSK traffic in a channel.\n"
 		"\n";
 	fprintf(stderr, str, prog);
 	exit(EXIT_FAILURE);
@@ -377,7 +380,7 @@ static void parse_command_line(int argc, char **argv)
 	opterr = 0;
 
 	for (;;) {
-		c = getopt_long(argc, argv, "Frtli:q:psSNn:cz", long_options,
+		c = getopt_long(argc, argv, "Frtli:q:psSNn:czC", long_options,
 				&option_index);
 		if (c == -1)
 			break;
@@ -420,6 +423,9 @@ static void parse_command_line(int argc, char **argv)
 		case 'F':
 			opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
 			break;
+		case 'C':
+			opt_libbpf_flags |= XSK_LIBBPF_FLAGS__COMBINED_CHANNELS;
+			break;
 		default:
 			usage(basename(argv[0]));
 		}
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 9ae4b4e08b68..cf6ff1ecc6bd 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -82,4 +82,15 @@ struct xdp_desc {
 
 /* UMEM descriptor is __u64 */
 
+/* The driver may run a dedicated XSK RQ in the channel. The XDP program uses
+ * this flag bit in the queue index to distinguish between two RQs of the same
+ * channel.
+ */
+#define XDP_QID_FLAG_XSKRQ (1 << 31)
+
+static inline __u32 xdp_qid_get_channel(__u32 qid)
+{
+	return qid & ~XDP_QID_FLAG_XSKRQ;
+}
+
 #endif /* _LINUX_IF_XDP_H */
diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index a95b06d1f81d..969dfd856039 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -76,6 +76,12 @@ struct xsk_nl_info {
 	int fd;
 };
 
+enum qidconf {
+	QIDCONF_REGULAR,
+	QIDCONF_XSK,
+	QIDCONF_XSK_COMBINED,
+};
+
 /* For 32-bit systems, we need to use mmap2 as the offsets are 64-bit.
  * Unfortunately, it is not part of glibc.
  */
@@ -139,7 +145,7 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
 		return 0;
 	}
 
-	if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)
+	if (usr_cfg->libbpf_flags & ~XSK_LIBBPF_FLAGS_MASK)
 		return -EINVAL;
 
 	cfg->rx_size = usr_cfg->rx_size;
@@ -267,44 +273,93 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk)
 	/* This is the C-program:
 	 * SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx)
 	 * {
-	 *     int *qidconf, index = ctx->rx_queue_index;
+	 *     int *qidconf, qc;
+	 *     int index = ctx->rx_queue_index & ~(1 << 31);
+	 *     bool is_xskrq = ctx->rx_queue_index & (1 << 31);
 	 *
-	 *     // A set entry here means that the correspnding queue_id
-	 *     // has an active AF_XDP socket bound to it.
+	 *     // A set entry here means that the corresponding queue_id
+	 *     // has an active AF_XDP socket bound to it. Value 2 means
+	 *     // it's zero-copy multi-RQ mode.
 	 *     qidconf = bpf_map_lookup_elem(&qidconf_map, &index);
 	 *     if (!qidconf)
 	 *         return XDP_ABORTED;
 	 *
-	 *     if (*qidconf)
+	 *     qc = *qidconf;
+	 *
+	 *     if (qc == 2)
+	 *         qc = is_xskrq ? 1 : 0;
+	 *
+	 *     switch (qc) {
+	 *     case 0:
+	 *         return XDP_PASS;
+	 *     case 1:
 	 *         return bpf_redirect_map(&xsks_map, index, 0);
+	 *     }
 	 *
-	 *     return XDP_PASS;
+	 *     return XDP_ABORTED;
 	 * }
 	 */
 	struct bpf_insn prog[] = {
-		/* r1 = *(u32 *)(r1 + 16) */
-		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 16),
-		/* *(u32 *)(r10 - 4) = r1 */
-		BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_1, -4),
-		BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
-		BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4),
-		BPF_LD_MAP_FD(BPF_REG_1, xsk->qidconf_map_fd),
+		/* Load index. */
+		/* r6 = *(u32 *)(r1 + 16) */
+		BPF_LDX_MEM(BPF_W, BPF_REG_6, BPF_REG_ARG1, 16),
+		/* w7 = w6 */
+		BPF_MOV32_REG(BPF_REG_7, BPF_REG_6),
+		/* w7 &= 2147483647 */
+		BPF_ALU32_IMM(BPF_AND, BPF_REG_7, ~XDP_QID_FLAG_XSKRQ),
+		/* *(u32 *)(r10 - 4) = r7 */
+		BPF_STX_MEM(BPF_W, BPF_REG_FP, BPF_REG_7, -4),
+
+		/* Call bpf_map_lookup_elem. */
+		/* r2 = r10 */
+		BPF_MOV64_REG(BPF_REG_ARG2, BPF_REG_FP),
+		/* r2 += -4 */
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_ARG2, -4),
+		/* r1 = qidconf_map ll */
+		BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->qidconf_map_fd),
+		/* call 1 */
 		BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
-		BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
-		BPF_MOV32_IMM(BPF_REG_0, 0),
-		/* if r1 == 0 goto +8 */
-		BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 8),
-		BPF_MOV32_IMM(BPF_REG_0, 2),
-		/* r1 = *(u32 *)(r1 + 0) */
-		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0),
-		/* if r1 == 0 goto +5 */
-		BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0, 5),
-		/* r2 = *(u32 *)(r10 - 4) */
-		BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd),
-		BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4),
+
+		/* Check the return value. */
+		/* if r0 == 0 goto +14 */
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 14),
+
+		/* Check qc == QIDCONF_XSK_COMBINED. */
+		/* r6 >>= 31 */
+		BPF_ALU64_IMM(BPF_RSH, BPF_REG_6, 31),
+		/* r1 = *(u32 *)(r0 + 0) */
+		BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, 0),
+		/* if r1 == 2 goto +1 */
+		BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, QIDCONF_XSK_COMBINED, 1),
+
+		/* qc != QIDCONF_XSK_COMBINED */
+		/* r6 = r1 */
+		BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+
+		/* switch (qc) */
+		/* w0 = 2 */
+		BPF_MOV32_IMM(BPF_REG_0, XDP_PASS),
+		/* if w6 == 0 goto +8 */
+		BPF_JMP32_IMM(BPF_JEQ, BPF_REG_6, QIDCONF_REGULAR, 8),
+		/* if w6 != 1 goto +6 */
+		BPF_JMP32_IMM(BPF_JNE, BPF_REG_6, QIDCONF_XSK, 6),
+
+		/* Call bpf_redirect_map. */
+		/* r1 = xsks_map ll */
+		BPF_LD_MAP_FD(BPF_REG_ARG1, xsk->xsks_map_fd),
+		/* w2 = w7 */
+		BPF_MOV32_REG(BPF_REG_ARG2, BPF_REG_7),
+		/* w3 = 0 */
 		BPF_MOV32_IMM(BPF_REG_3, 0),
+		/* call 51 */
 		BPF_EMIT_CALL(BPF_FUNC_redirect_map),
-		/* The jumps are to this instruction */
+		/* exit */
+		BPF_EXIT_INSN(),
+
+		/* XDP_ABORTED */
+		/* w0 = 0 */
+		BPF_MOV32_IMM(BPF_REG_0, XDP_ABORTED),
+		/* exit */
 		BPF_EXIT_INSN(),
 	};
 	size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
@@ -483,6 +538,7 @@ static int xsk_update_bpf_maps(struct xsk_socket *xsk, int qidconf_value,
 
 static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
 {
+	int qidconf_value = QIDCONF_XSK;
 	bool prog_attached = false;
 	__u32 prog_id = 0;
 	int err;
@@ -505,7 +561,11 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk)
 		xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id);
 	}
 
-	err = xsk_update_bpf_maps(xsk, true, xsk->fd);
+	if (xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
+		if (xsk->zc)
+			qidconf_value = QIDCONF_XSK_COMBINED;
+
+	err = xsk_update_bpf_maps(xsk, qidconf_value, xsk->fd);
 	if (err)
 		goto out_load;
 
@@ -717,7 +777,7 @@ void xsk_socket__delete(struct xsk_socket *xsk)
 	if (!xsk)
 		return;
 
-	(void)xsk_update_bpf_maps(xsk, 0, 0);
+	(void)xsk_update_bpf_maps(xsk, QIDCONF_REGULAR, 0);
 
 	optlen = sizeof(off);
 	err = getsockopt(xsk->fd, SOL_XDP, XDP_MMAP_OFFSETS, &off, &optlen);
diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
index 82ea71a0f3ec..be26a2423c04 100644
--- a/tools/lib/bpf/xsk.h
+++ b/tools/lib/bpf/xsk.h
@@ -180,6 +180,10 @@ struct xsk_umem_config {
 
 /* Flags for the libbpf_flags field. */
 #define XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD (1 << 0)
+#define XSK_LIBBPF_FLAGS__COMBINED_CHANNELS (1 << 1)
+#define XSK_LIBBPF_FLAGS_MASK ( \
+	XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD | \
+	XSK_LIBBPF_FLAGS__COMBINED_CHANNELS)
 
 struct xsk_socket_config {
 	__u32 rx_size;
-- 
2.19.1


  parent reply	other threads:[~2019-04-30 18:12 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-30 18:12 [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 01/16] xsk: Add API to check for available entries in FQ Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 02/16] xsk: Add getsockopt XDP_OPTIONS Maxim Mikityanskiy
2019-05-04 17:25   ` Björn Töpel
2019-05-06 13:45     ` Maxim Mikityanskiy
2019-05-06 16:35       ` Alexei Starovoitov
2019-04-30 18:12 ` [PATCH bpf-next v2 03/16] libbpf: Support " Maxim Mikityanskiy
2019-04-30 18:12 ` Maxim Mikityanskiy [this message]
2019-05-04 17:26   ` [PATCH bpf-next v2 04/16] xsk: Extend channels to support combined XSK/non-XSK traffic Björn Töpel
2019-05-06 13:45     ` Maxim Mikityanskiy
2019-05-06 14:23       ` Magnus Karlsson
2019-05-07 14:19         ` Maxim Mikityanskiy
2019-05-08 13:06           ` Magnus Karlsson
2019-05-13 14:52             ` Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 05/16] xsk: Change the default frame size to 4096 and allow controlling it Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 06/16] xsk: Return the whole xdp_desc from xsk_umem_consume_tx Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 07/16] net/mlx5e: Replace deprecated PCI_DMA_TODEVICE Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 08/16] net/mlx5e: Calculate linear RX frag size considering XSK Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 09/16] net/mlx5e: Allow ICO SQ to be used by multiple RQs Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 10/16] net/mlx5e: Refactor struct mlx5e_xdp_info Maxim Mikityanskiy
2019-04-30 18:12 ` [PATCH bpf-next v2 11/16] net/mlx5e: Share the XDP SQ for XDP_TX between RQs Maxim Mikityanskiy
2019-04-30 18:13 ` [PATCH bpf-next v2 12/16] net/mlx5e: XDP_TX from UMEM support Maxim Mikityanskiy
2019-04-30 18:13 ` [PATCH bpf-next v2 13/16] net/mlx5e: Consider XSK in XDP MTU limit calculation Maxim Mikityanskiy
2019-04-30 18:13 ` [PATCH bpf-next v2 14/16] net/mlx5e: Encapsulate open/close queues into a function Maxim Mikityanskiy
2019-04-30 18:13 ` [PATCH bpf-next v2 15/16] net/mlx5e: Move queue param structs to en/params.h Maxim Mikityanskiy
2019-04-30 18:13 ` [PATCH bpf-next v2 16/16] net/mlx5e: Add XSK support Maxim Mikityanskiy
2019-05-04 17:25 ` [PATCH bpf-next v2 00/16] AF_XDP infrastructure improvements and mlx5e support Björn Töpel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190430181215.15305-5-maximmi@mellanox.com \
    --to=maximmi@mellanox.com \
    --cc=ast@kernel.org \
    --cc=bjorn.topel@intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=bsd@fb.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=jakub.kicinski@netronome.com \
    --cc=kafai@fb.com \
    --cc=maciejromanfijalkowski@gmail.com \
    --cc=magnus.karlsson@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@mellanox.com \
    --cc=songliubraving@fb.com \
    --cc=tariqt@mellanox.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).