All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
@ 2022-07-13 11:14 Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 01/17] dev: Move received_rps counter next to RPS members in softnet data Toke Høiland-Jørgensen
                   ` (19 more replies)
  0 siblings, 20 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Mykola Lysenko
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen

Packet forwarding is an important use case for XDP, which offers
significant performance improvements compared to forwarding using the
regular networking stack. However, XDP currently offers no mechanism to
delay, queue or schedule packets, which limits the practical uses for
XDP-based forwarding to those where the capacity of input and output links
always match each other (i.e., no rate transitions or many-to-one
forwarding). It also prevents an XDP-based router from doing any kind of
traffic shaping or reordering to enforce policy.

This series represents a first RFC of our attempt to remedy this lack. The
code in these patches is functional, but needs additional testing and
polishing before being considered for merging. I'm posting it here as an
RFC to get some early feedback on the API and overall design of the
feature.

DESIGN

The design consists of three components: A new map type for storing XDP
frames, a new 'dequeue' program type that will run in the TX softirq to
provide the stack with packets to transmit, and a set of helpers to dequeue
packets from the map, optionally drop them, and to schedule an interface
for transmission.

The new map type is modelled on the PIFO data structure proposed in the
literature[0][1]. It represents a priority queue where packets can be
enqueued in any priority, but is always dequeued from the head. From the
XDP side, the map is simply used as a target for the bpf_redirect_map()
helper, where the target index is the desired priority.

The dequeue program type is a new BPF program type that is attached to an
interface; when an interface is scheduled for transmission, the stack will
execute the attached dequeue program and, if it returns a packet to
transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
driver function.

The dequeue program can obtain packets by pulling them out of a PIFO map
using the new bpf_packet_dequeue() helper. This returns a pointer to an
xdp_md structure, which can be dereferenced to obtain packet data and
data_meta pointers like in an XDP program. The returned packets are also
reference counted, meaning the verifier enforces that the dequeue program
either drops the packet (with the bpf_packet_drop() helper), or returns it
for transmission. Finally, a helper is added that can be used to actually
schedule an interface for transmission using the dequeue program type; this
helper can be called from both XDP and dequeue programs.

PERFORMANCE

Preliminary performance tests indicate about 50ns overhead of adding
queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
overhead (but still 2x the forwarding performance of the netstack):

xdp_fwd :     4.7 Mpps  (213 ns /pkt)
xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
netstack:       2 Mpps  (500 ns /pkt)

RELATION TO BPF QDISC

Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
particular the use of a map to store packets. This is no accident, as we've
had ongoing discussions for a while now. I have no great hope that we can
completely converge the two efforts into a single BPF-based queueing
API (as has been discussed before[3], consolidating the SKB and XDP paths
is challenging). Rather, I'm hoping that we can converge the designs enough
that we can share BPF code between XDP and qdisc layers using common
functions, like it's possible to do with XDP and TC-BPF today. This would
imply agreeing on the map type and API, and possibly on the set of helpers
available to the BPF programs.

PATCH STRUCTURE

This series consists of a total of 17 patches, as follows:

Patches 1-3 are smaller preparatory refactoring patches used by subsequent
patches.

Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
program type.

Patches 7-10 adds the dequeue helpers and the verifier features needed to
recognise packet pointers, reference count them, and allow dereferencing
them to obtain packet data pointers.

Patches 11 and 12 add the dequeue program hook to the TX path, and the
helpers to schedule an interface.

Patches 13-16 add libbpf support for the new types, and selftests for the
new features.

Finally, patch 17 adds queueing support to the xdp_fwd program in
samples/bpf to provide an easy-to-use way of testing the feature; this is
for illustrative purposes for the RFC only, and will not be included in the
final submission.

SUPPLEMENTARY MATERIAL

A (WiP) test harness for implementing and unit-testing scheduling
algorithms using this framework (and the bpf_prog_run() hook) is available
as part of the bpf-examples repository[4]. We plan to expand this with more
test algorithms to smoke-test the API, and also add ready-to-use queueing
algorithms for use for forwarding (to replace the xdp_fwd patch included as
part of this RFC submission).

The work represented in this series was done in collaboration with several
people. Thanks to Kumar Kartikeya Dwivedi for writing the verifier
enhancements in this series, to Frey Alfredsson for his work on the testing
harness in [4], and to Jesper Brouer, Per Hurtig and Anna Brunstrom for
their valuable input on the design of the queueing APIs.

This series is also available as a git tree on git.kernel.org[5].

NOTES

[0] http://web.mit.edu/pifo/
[1] https://arxiv.org/abs/1810.03060
[2] https://lore.kernel.org/r/20220602041028.95124-1-xiyou.wangcong@gmail.com
[3] https://lore.kernel.org/r/b4ff6a2b-1478-89f8-ea9f-added498c59f@gmail.com
[4] https://github.com/xdp-project/bpf-examples/pull/40
[5] https://git.kernel.org/pub/scm/linux/kernel/git/toke/linux.git/log/?h=xdp-queueing-06

Kumar Kartikeya Dwivedi (5):
  bpf: Use 64-bit return value for bpf_prog_run
  bpf: Teach the verifier about referenced packets returned from dequeue
    programs
  bpf: Introduce pkt_uid member for PTR_TO_PACKET
  bpf: Implement direct packet access in dequeue progs
  selftests/bpf: Add verifier tests for dequeue prog

Toke Høiland-Jørgensen (12):
  dev: Move received_rps counter next to RPS members in softnet data
  bpf: Expand map key argument of bpf_redirect_map to u64
  bpf: Add a PIFO priority queue map type
  pifomap: Add queue rotation for continuously increasing rank mode
  xdp: Add dequeue program type for getting packets from a PIFO
  bpf: Add helpers to dequeue from a PIFO map
  dev: Add XDP dequeue hook
  bpf: Add helper to schedule an interface for TX dequeue
  libbpf: Add support for dequeue program type and PIFO map type
  libbpf: Add support for querying dequeue programs
  selftests/bpf: Add test for XDP queueing through PIFO maps
  samples/bpf: Add queueing support to xdp_fwd sample

 include/linux/bpf-cgroup.h                    |  12 +-
 include/linux/bpf.h                           |  64 +-
 include/linux/bpf_types.h                     |   4 +
 include/linux/bpf_verifier.h                  |  14 +-
 include/linux/filter.h                        |  63 +-
 include/linux/netdevice.h                     |   8 +-
 include/net/xdp.h                             |  16 +-
 include/uapi/linux/bpf.h                      |  50 +-
 include/uapi/linux/if_link.h                  |   4 +-
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/cgroup.c                           |  12 +-
 kernel/bpf/core.c                             |  14 +-
 kernel/bpf/cpumap.c                           |   4 +-
 kernel/bpf/devmap.c                           |  92 ++-
 kernel/bpf/offload.c                          |   4 +-
 kernel/bpf/pifomap.c                          | 635 ++++++++++++++++++
 kernel/bpf/syscall.c                          |   3 +
 kernel/bpf/verifier.c                         | 148 +++-
 net/bpf/test_run.c                            |  54 +-
 net/core/dev.c                                | 109 +++
 net/core/dev.h                                |   2 +
 net/core/filter.c                             | 307 ++++++++-
 net/core/rtnetlink.c                          |  30 +-
 net/packet/af_packet.c                        |   7 +-
 net/xdp/xskmap.c                              |   4 +-
 samples/bpf/xdp_fwd_kern.c                    |  65 +-
 samples/bpf/xdp_fwd_user.c                    | 200 ++++--
 tools/include/uapi/linux/bpf.h                |  48 ++
 tools/include/uapi/linux/if_link.h            |   4 +-
 tools/lib/bpf/libbpf.c                        |   1 +
 tools/lib/bpf/libbpf.h                        |   1 +
 tools/lib/bpf/libbpf_probes.c                 |   5 +
 tools/lib/bpf/netlink.c                       |   8 +
 .../selftests/bpf/prog_tests/pifo_map.c       | 125 ++++
 .../bpf/prog_tests/xdp_pifo_test_run.c        | 154 +++++
 tools/testing/selftests/bpf/progs/pifo_map.c  |  54 ++
 .../selftests/bpf/progs/test_xdp_pifo.c       | 110 +++
 tools/testing/selftests/bpf/test_verifier.c   |  29 +-
 .../testing/selftests/bpf/verifier/dequeue.c  | 160 +++++
 39 files changed, 2426 insertions(+), 200 deletions(-)
 create mode 100644 kernel/bpf/pifomap.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/pifo_map.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
 create mode 100644 tools/testing/selftests/bpf/progs/pifo_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pifo.c
 create mode 100644 tools/testing/selftests/bpf/verifier/dequeue.c

-- 
2.37.0


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [RFC PATCH 01/17] dev: Move received_rps counter next to RPS members in softnet data
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 02/17] bpf: Expand map key argument of bpf_redirect_map to u64 Toke Høiland-Jørgensen
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen

Move the received_rps counter value next to the other RPS-related members
in softnet_data. This closes two four-byte holes in the structure, making
room for another pointer in the first two cache lines without bumping the
xmit struct to its own line.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/netdevice.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1a3cb93c3dcc..fe9aeca2fce9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3100,7 +3100,6 @@ struct softnet_data {
 	/* stats */
 	unsigned int		processed;
 	unsigned int		time_squeeze;
-	unsigned int		received_rps;
 #ifdef CONFIG_RPS
 	struct softnet_data	*rps_ipi_list;
 #endif
@@ -3133,6 +3132,7 @@ struct softnet_data {
 	unsigned int		cpu;
 	unsigned int		input_queue_tail;
 #endif
+	unsigned int		received_rps;
 	unsigned int		dropped;
 	struct sk_buff_head	input_pkt_queue;
 	struct napi_struct	backlog;
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 02/17] bpf: Expand map key argument of bpf_redirect_map to u64
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 01/17] dev: Move received_rps counter next to RPS members in softnet data Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 03/17] bpf: Use 64-bit return value for bpf_prog_run Toke Høiland-Jørgensen
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen, Eric Dumazet,
	Paolo Abeni

We want to be able to support 64-bit indexes for PIFO maps, so expand the
width of the 'key' argument to the bpf_redirect_map() helper. Since BPF
registers are always 64-bit, this should be safe to do after the fact.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf.h      |  2 +-
 include/linux/filter.h   | 12 ++++++------
 include/uapi/linux/bpf.h |  2 +-
 kernel/bpf/cpumap.c      |  4 ++--
 kernel/bpf/devmap.c      |  4 ++--
 kernel/bpf/verifier.c    |  2 +-
 net/core/filter.c        |  4 ++--
 net/xdp/xskmap.c         |  4 ++--
 8 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2b21f2a3452f..d877d9825e77 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -132,7 +132,7 @@ struct bpf_map_ops {
 	struct bpf_local_storage __rcu ** (*map_owner_storage_ptr)(void *owner);
 
 	/* Misc helpers.*/
-	int (*map_redirect)(struct bpf_map *map, u32 ifindex, u64 flags);
+	int (*map_redirect)(struct bpf_map *map, u64 key, u64 flags);
 
 	/* map_meta_equal must be implemented for maps that can be
 	 * used as an inner map.  It is a runtime check to ensure
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4c1a8b247545..10167ab1ef95 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -637,13 +637,13 @@ struct bpf_nh_params {
 };
 
 struct bpf_redirect_info {
-	u32 flags;
-	u32 tgt_index;
+	u64 tgt_index;
 	void *tgt_value;
 	struct bpf_map *map;
+	u32 flags;
+	u32 kern_flags;
 	u32 map_id;
 	enum bpf_map_type map_type;
-	u32 kern_flags;
 	struct bpf_nh_params nh;
 };
 
@@ -1486,7 +1486,7 @@ static inline bool bpf_sk_lookup_run_v6(struct net *net, int protocol,
 }
 #endif /* IS_ENABLED(CONFIG_IPV6) */
 
-static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifindex,
+static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u64 index,
 						  u64 flags, const u64 flag_mask,
 						  void *lookup_elem(struct bpf_map *map, u32 key))
 {
@@ -1497,7 +1497,7 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 	if (unlikely(flags & ~(action_mask | flag_mask)))
 		return XDP_ABORTED;
 
-	ri->tgt_value = lookup_elem(map, ifindex);
+	ri->tgt_value = lookup_elem(map, index);
 	if (unlikely(!ri->tgt_value) && !(flags & BPF_F_BROADCAST)) {
 		/* If the lookup fails we want to clear out the state in the
 		 * redirect_info struct completely, so that if an eBPF program
@@ -1509,7 +1509,7 @@ static __always_inline int __bpf_xdp_redirect_map(struct bpf_map *map, u32 ifind
 		return flags & action_mask;
 	}
 
-	ri->tgt_index = ifindex;
+	ri->tgt_index = index;
 	ri->map_id = map->id;
 	ri->map_type = map->map_type;
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 379e68fb866f..aec623f60048 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2607,7 +2607,7 @@ union bpf_attr {
  * 	Return
  * 		0 on success, or a negative error in case of failure.
  *
- * long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
+ * long bpf_redirect_map(struct bpf_map *map, u64 key, u64 flags)
  * 	Description
  * 		Redirect the packet to the endpoint referenced by *map* at
  * 		index *key*. Depending on its type, this *map* can contain
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index f4860ac756cd..2e7ee53ae3e4 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -668,9 +668,9 @@ static int cpu_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	return 0;
 }
 
-static int cpu_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
+static int cpu_map_redirect(struct bpf_map *map, u64 index, u64 flags)
 {
-	return __bpf_xdp_redirect_map(map, ifindex, flags, 0,
+	return __bpf_xdp_redirect_map(map, index, flags, 0,
 				      __cpu_map_lookup_elem);
 }
 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index c2867068e5bd..980f8928e977 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -992,14 +992,14 @@ static int dev_map_hash_update_elem(struct bpf_map *map, void *key, void *value,
 					 map, key, value, map_flags);
 }
 
-static int dev_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
+static int dev_map_redirect(struct bpf_map *map, u64 ifindex, u64 flags)
 {
 	return __bpf_xdp_redirect_map(map, ifindex, flags,
 				      BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS,
 				      __dev_map_lookup_elem);
 }
 
-static int dev_hash_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
+static int dev_hash_map_redirect(struct bpf_map *map, u64 ifindex, u64 flags)
 {
 	return __bpf_xdp_redirect_map(map, ifindex, flags,
 				      BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS,
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 328cfab3af60..039f7b61c305 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -14183,7 +14183,7 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			BUILD_BUG_ON(!__same_type(ops->map_peek_elem,
 				     (int (*)(struct bpf_map *map, void *value))NULL));
 			BUILD_BUG_ON(!__same_type(ops->map_redirect,
-				     (int (*)(struct bpf_map *map, u32 ifindex, u64 flags))NULL));
+				     (int (*)(struct bpf_map *map, u64 index, u64 flags))NULL));
 			BUILD_BUG_ON(!__same_type(ops->map_for_each_callback,
 				     (int (*)(struct bpf_map *map,
 					      bpf_callback_t callback_fn,
diff --git a/net/core/filter.c b/net/core/filter.c
index 4ef77ec5255e..e23e53ed1b04 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4408,10 +4408,10 @@ static const struct bpf_func_proto bpf_xdp_redirect_proto = {
 	.arg2_type      = ARG_ANYTHING,
 };
 
-BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, map, u32, ifindex,
+BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, map, u64, key,
 	   u64, flags)
 {
-	return map->ops->map_redirect(map, ifindex, flags);
+	return map->ops->map_redirect(map, key, flags);
 }
 
 static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index acc8e52a4f5f..771d0fa90ef5 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -231,9 +231,9 @@ static int xsk_map_delete_elem(struct bpf_map *map, void *key)
 	return 0;
 }
 
-static int xsk_map_redirect(struct bpf_map *map, u32 ifindex, u64 flags)
+static int xsk_map_redirect(struct bpf_map *map, u64 index, u64 flags)
 {
-	return __bpf_xdp_redirect_map(map, ifindex, flags, 0,
+	return __bpf_xdp_redirect_map(map, index, flags, 0,
 				      __xsk_map_lookup_elem);
 }
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 03/17] bpf: Use 64-bit return value for bpf_prog_run
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 01/17] dev: Move received_rps counter next to RPS members in softnet data Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 02/17] bpf: Expand map key argument of bpf_redirect_map to u64 Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 04/17] bpf: Add a PIFO priority queue map type Toke Høiland-Jørgensen
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen, Eric Dumazet,
	Paolo Abeni

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>

BPF ABI always uses 64-bit return value, but so far __bpf_prog_run and
higher level wrappers always truncated the return value to 32-bit.
Future patches introducing a new BPF_PROG_TYPE_DEQUEUE return a
PTR_TO_BTF_ID or NULL from the BPF program to the caller context in the
kernel. The verifier is taught to enforce a successful return of such a
referenced PTR_TO_BTF_ID, explicit release, or the return of a NULL
pointer to indicate absence. To be able to use this returned pointer
value, the bpf_prog_run invocation needs to be able to return a 64-bit
value.

To avoid code churn in the whole kernel, we let the compiler handle
truncation normally, and allow new call sites to utilize the 64-bit
return value, by receiving the return value as a u64.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf-cgroup.h | 12 ++++++------
 include/linux/bpf.h        | 14 +++++++-------
 include/linux/filter.h     | 34 +++++++++++++++++-----------------
 kernel/bpf/cgroup.c        | 12 ++++++------
 kernel/bpf/core.c          | 14 +++++++-------
 kernel/bpf/offload.c       |  4 ++--
 net/bpf/test_run.c         | 21 ++++++++++++---------
 net/packet/af_packet.c     |  7 +++++--
 8 files changed, 62 insertions(+), 56 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 2bd1b5f8de9b..e975f89c491b 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -23,12 +23,12 @@ struct ctl_table;
 struct ctl_table_header;
 struct task_struct;
 
-unsigned int __cgroup_bpf_run_lsm_sock(const void *ctx,
-				       const struct bpf_insn *insn);
-unsigned int __cgroup_bpf_run_lsm_socket(const void *ctx,
-					 const struct bpf_insn *insn);
-unsigned int __cgroup_bpf_run_lsm_current(const void *ctx,
-					  const struct bpf_insn *insn);
+u64 __cgroup_bpf_run_lsm_sock(const void *ctx,
+			      const struct bpf_insn *insn);
+u64 __cgroup_bpf_run_lsm_socket(const void *ctx,
+				const struct bpf_insn *insn);
+u64 __cgroup_bpf_run_lsm_current(const void *ctx,
+				 const struct bpf_insn *insn);
 
 #ifdef CONFIG_CGROUP_BPF
 
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d877d9825e77..ebe6f2d95182 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -56,8 +56,8 @@ typedef u64 (*bpf_callback_t)(u64, u64, u64, u64, u64);
 typedef int (*bpf_iter_init_seq_priv_t)(void *private_data,
 					struct bpf_iter_aux_info *aux);
 typedef void (*bpf_iter_fini_seq_priv_t)(void *private_data);
-typedef unsigned int (*bpf_func_t)(const void *,
-				   const struct bpf_insn *);
+typedef u64 (*bpf_func_t)(const void *,
+			  const struct bpf_insn *);
 struct bpf_iter_seq_info {
 	const struct seq_operations *seq_ops;
 	bpf_iter_init_seq_priv_t init_seq_private;
@@ -882,7 +882,7 @@ struct bpf_dispatcher {
 	struct bpf_ksym ksym;
 };
 
-static __always_inline __nocfi unsigned int bpf_dispatcher_nop_func(
+static __always_inline __nocfi u64 bpf_dispatcher_nop_func(
 	const void *ctx,
 	const struct bpf_insn *insnsi,
 	bpf_func_t bpf_func)
@@ -911,7 +911,7 @@ int arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs);
 }
 
 #define DEFINE_BPF_DISPATCHER(name)					\
-	noinline __nocfi unsigned int bpf_dispatcher_##name##_func(	\
+	noinline __nocfi u64 bpf_dispatcher_##name##_func(		\
 		const void *ctx,					\
 		const struct bpf_insn *insnsi,				\
 		bpf_func_t bpf_func)					\
@@ -922,7 +922,7 @@ int arch_prepare_bpf_dispatcher(void *image, s64 *funcs, int num_funcs);
 	struct bpf_dispatcher bpf_dispatcher_##name =			\
 		BPF_DISPATCHER_INIT(bpf_dispatcher_##name);
 #define DECLARE_BPF_DISPATCHER(name)					\
-	unsigned int bpf_dispatcher_##name##_func(			\
+	u64 bpf_dispatcher_##name##_func(				\
 		const void *ctx,					\
 		const struct bpf_insn *insnsi,				\
 		bpf_func_t bpf_func);					\
@@ -1127,7 +1127,7 @@ struct bpf_prog {
 	u8			tag[BPF_TAG_SIZE];
 	struct bpf_prog_stats __percpu *stats;
 	int __percpu		*active;
-	unsigned int		(*bpf_func)(const void *ctx,
+	u64			(*bpf_func)(const void *ctx,
 					    const struct bpf_insn *insn);
 	struct bpf_prog_aux	*aux;		/* Auxiliary fields */
 	struct sock_fprog_kern	*orig_prog;	/* Original BPF program */
@@ -1472,7 +1472,7 @@ static inline void bpf_reset_run_ctx(struct bpf_run_ctx *old_ctx)
 /* BPF program asks to set CN on the packet. */
 #define BPF_RET_SET_CN						(1 << 0)
 
-typedef u32 (*bpf_prog_run_fn)(const struct bpf_prog *prog, const void *ctx);
+typedef u64 (*bpf_prog_run_fn)(const struct bpf_prog *prog, const void *ctx);
 
 static __always_inline u32
 bpf_prog_run_array(const struct bpf_prog_array *array,
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 10167ab1ef95..b0ddb647d5f2 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -567,16 +567,16 @@ struct sk_filter {
 
 DECLARE_STATIC_KEY_FALSE(bpf_stats_enabled_key);
 
-typedef unsigned int (*bpf_dispatcher_fn)(const void *ctx,
-					  const struct bpf_insn *insnsi,
-					  unsigned int (*bpf_func)(const void *,
-								   const struct bpf_insn *));
+typedef u64 (*bpf_dispatcher_fn)(const void *ctx,
+				 const struct bpf_insn *insnsi,
+				 u64 (*bpf_func)(const void *,
+						 const struct bpf_insn *));
 
-static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
+static __always_inline u64 __bpf_prog_run(const struct bpf_prog *prog,
 					  const void *ctx,
 					  bpf_dispatcher_fn dfunc)
 {
-	u32 ret;
+	u64 ret;
 
 	cant_migrate();
 	if (static_branch_unlikely(&bpf_stats_enabled_key)) {
@@ -596,7 +596,7 @@ static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
 	return ret;
 }
 
-static __always_inline u32 bpf_prog_run(const struct bpf_prog *prog, const void *ctx)
+static __always_inline u64 bpf_prog_run(const struct bpf_prog *prog, const void *ctx)
 {
 	return __bpf_prog_run(prog, ctx, bpf_dispatcher_nop_func);
 }
@@ -609,10 +609,10 @@ static __always_inline u32 bpf_prog_run(const struct bpf_prog *prog, const void
  * invocation of a BPF program does not require reentrancy protection
  * against a BPF program which is invoked from a preempting task.
  */
-static inline u32 bpf_prog_run_pin_on_cpu(const struct bpf_prog *prog,
+static inline u64 bpf_prog_run_pin_on_cpu(const struct bpf_prog *prog,
 					  const void *ctx)
 {
-	u32 ret;
+	u64 ret;
 
 	migrate_disable();
 	ret = bpf_prog_run(prog, ctx);
@@ -708,13 +708,13 @@ static inline u8 *bpf_skb_cb(const struct sk_buff *skb)
 }
 
 /* Must be invoked with migration disabled */
-static inline u32 __bpf_prog_run_save_cb(const struct bpf_prog *prog,
+static inline u64 __bpf_prog_run_save_cb(const struct bpf_prog *prog,
 					 const void *ctx)
 {
 	const struct sk_buff *skb = ctx;
 	u8 *cb_data = bpf_skb_cb(skb);
 	u8 cb_saved[BPF_SKB_CB_LEN];
-	u32 res;
+	u64 res;
 
 	if (unlikely(prog->cb_access)) {
 		memcpy(cb_saved, cb_data, sizeof(cb_saved));
@@ -729,10 +729,10 @@ static inline u32 __bpf_prog_run_save_cb(const struct bpf_prog *prog,
 	return res;
 }
 
-static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog,
+static inline u64 bpf_prog_run_save_cb(const struct bpf_prog *prog,
 				       struct sk_buff *skb)
 {
-	u32 res;
+	u64 res;
 
 	migrate_disable();
 	res = __bpf_prog_run_save_cb(prog, skb);
@@ -740,11 +740,11 @@ static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog,
 	return res;
 }
 
-static inline u32 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
+static inline u64 bpf_prog_run_clear_cb(const struct bpf_prog *prog,
 					struct sk_buff *skb)
 {
 	u8 *cb_data = bpf_skb_cb(skb);
-	u32 res;
+	u64 res;
 
 	if (unlikely(prog->cb_access))
 		memset(cb_data, 0, BPF_SKB_CB_LEN);
@@ -759,14 +759,14 @@ DECLARE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
 
 u32 xdp_master_redirect(struct xdp_buff *xdp);
 
-static __always_inline u32 bpf_prog_run_xdp(const struct bpf_prog *prog,
+static __always_inline u64 bpf_prog_run_xdp(const struct bpf_prog *prog,
 					    struct xdp_buff *xdp)
 {
 	/* Driver XDP hooks are invoked within a single NAPI poll cycle and thus
 	 * under local_bh_disable(), which provides the needed RCU protection
 	 * for accessing map entries.
 	 */
-	u32 act = __bpf_prog_run(prog, xdp, BPF_DISPATCHER_FUNC(xdp));
+	u64 act = __bpf_prog_run(prog, xdp, BPF_DISPATCHER_FUNC(xdp));
 
 	if (static_branch_unlikely(&bpf_master_redirect_enabled_key)) {
 		if (act == XDP_TX && netif_is_bond_slave(xdp->rxq->dev))
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 59b7eb60d5b4..1721b09d0838 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -63,8 +63,8 @@ bpf_prog_run_array_cg(const struct cgroup_bpf *cgrp,
 	return run_ctx.retval;
 }
 
-unsigned int __cgroup_bpf_run_lsm_sock(const void *ctx,
-				       const struct bpf_insn *insn)
+u64 __cgroup_bpf_run_lsm_sock(const void *ctx,
+			      const struct bpf_insn *insn)
 {
 	const struct bpf_prog *shim_prog;
 	struct sock *sk;
@@ -85,8 +85,8 @@ unsigned int __cgroup_bpf_run_lsm_sock(const void *ctx,
 	return ret;
 }
 
-unsigned int __cgroup_bpf_run_lsm_socket(const void *ctx,
-					 const struct bpf_insn *insn)
+u64 __cgroup_bpf_run_lsm_socket(const void *ctx,
+				const struct bpf_insn *insn)
 {
 	const struct bpf_prog *shim_prog;
 	struct socket *sock;
@@ -107,8 +107,8 @@ unsigned int __cgroup_bpf_run_lsm_socket(const void *ctx,
 	return ret;
 }
 
-unsigned int __cgroup_bpf_run_lsm_current(const void *ctx,
-					  const struct bpf_insn *insn)
+u64 __cgroup_bpf_run_lsm_current(const void *ctx,
+				 const struct bpf_insn *insn)
 {
 	const struct bpf_prog *shim_prog;
 	struct cgroup *cgrp;
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 805c2ad5c793..a94dbb822f11 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2039,7 +2039,7 @@ static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn)
 
 #define PROG_NAME(stack_size) __bpf_prog_run##stack_size
 #define DEFINE_BPF_PROG_RUN(stack_size) \
-static unsigned int PROG_NAME(stack_size)(const void *ctx, const struct bpf_insn *insn) \
+static u64 PROG_NAME(stack_size)(const void *ctx, const struct bpf_insn *insn) \
 { \
 	u64 stack[stack_size / sizeof(u64)]; \
 	u64 regs[MAX_BPF_EXT_REG]; \
@@ -2083,8 +2083,8 @@ EVAL4(DEFINE_BPF_PROG_RUN_ARGS, 416, 448, 480, 512);
 
 #define PROG_NAME_LIST(stack_size) PROG_NAME(stack_size),
 
-static unsigned int (*interpreters[])(const void *ctx,
-				      const struct bpf_insn *insn) = {
+static u64 (*interpreters[])(const void *ctx,
+			     const struct bpf_insn *insn) = {
 EVAL6(PROG_NAME_LIST, 32, 64, 96, 128, 160, 192)
 EVAL6(PROG_NAME_LIST, 224, 256, 288, 320, 352, 384)
 EVAL4(PROG_NAME_LIST, 416, 448, 480, 512)
@@ -2109,8 +2109,8 @@ void bpf_patch_call_args(struct bpf_insn *insn, u32 stack_depth)
 }
 
 #else
-static unsigned int __bpf_prog_ret0_warn(const void *ctx,
-					 const struct bpf_insn *insn)
+static u64 __bpf_prog_ret0_warn(const void *ctx,
+				const struct bpf_insn *insn)
 {
 	/* If this handler ever gets executed, then BPF_JIT_ALWAYS_ON
 	 * is not working properly, so warn about it!
@@ -2245,8 +2245,8 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 }
 EXPORT_SYMBOL_GPL(bpf_prog_select_runtime);
 
-static unsigned int __bpf_prog_ret1(const void *ctx,
-				    const struct bpf_insn *insn)
+static u64 __bpf_prog_ret1(const void *ctx,
+			   const struct bpf_insn *insn)
 {
 	return 1;
 }
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index bd09290e3648..fabda7ed5dd0 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -246,8 +246,8 @@ static int bpf_prog_offload_translate(struct bpf_prog *prog)
 	return ret;
 }
 
-static unsigned int bpf_prog_warn_on_exec(const void *ctx,
-					  const struct bpf_insn *insn)
+static u64 bpf_prog_warn_on_exec(const void *ctx,
+				 const struct bpf_insn *insn)
 {
 	WARN(1, "attempt to execute device eBPF program on the host!");
 	return 0;
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2ca96acbc50a..f05d13717430 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -370,7 +370,7 @@ static int bpf_test_run_xdp_live(struct bpf_prog *prog, struct xdp_buff *ctx,
 }
 
 static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
-			u32 *retval, u32 *time, bool xdp)
+			u64 *retval, u32 *time, bool xdp)
 {
 	struct bpf_prog_array_item item = {.prog = prog};
 	struct bpf_run_ctx *old_ctx;
@@ -769,7 +769,7 @@ int bpf_prog_test_run_tracing(struct bpf_prog *prog,
 	struct bpf_fentry_test_t arg = {};
 	u16 side_effect = 0, ret = 0;
 	int b = 2, err = -EFAULT;
-	u32 retval = 0;
+	u64 retval = 0;
 
 	if (kattr->test.flags || kattr->test.cpu || kattr->test.batch_size)
 		return -EINVAL;
@@ -809,7 +809,7 @@ int bpf_prog_test_run_tracing(struct bpf_prog *prog,
 struct bpf_raw_tp_test_run_info {
 	struct bpf_prog *prog;
 	void *ctx;
-	u32 retval;
+	u64 retval;
 };
 
 static void
@@ -1054,15 +1054,15 @@ int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
 			  union bpf_attr __user *uattr)
 {
 	bool is_l2 = false, is_direct_pkt_access = false;
+	u32 size = kattr->test.data_size_in, duration;
 	struct net *net = current->nsproxy->net_ns;
 	struct net_device *dev = net->loopback_dev;
-	u32 size = kattr->test.data_size_in;
 	u32 repeat = kattr->test.repeat;
 	struct __sk_buff *ctx = NULL;
-	u32 retval, duration;
 	int hh_len = ETH_HLEN;
 	struct sk_buff *skb;
 	struct sock *sk;
+	u64 retval;
 	void *data;
 	int ret;
 
@@ -1250,15 +1250,16 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	bool do_live = (kattr->test.flags & BPF_F_TEST_XDP_LIVE_FRAMES);
 	u32 tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 	u32 batch_size = kattr->test.batch_size;
-	u32 retval = 0, duration, max_data_sz;
 	u32 size = kattr->test.data_size_in;
 	u32 headroom = XDP_PACKET_HEADROOM;
 	u32 repeat = kattr->test.repeat;
 	struct netdev_rx_queue *rxqueue;
 	struct skb_shared_info *sinfo;
+	u32 duration, max_data_sz;
 	struct xdp_buff xdp = {};
 	int i, ret = -EINVAL;
 	struct xdp_md *ctx;
+	u64 retval = 0;
 	void *data;
 
 	if (prog->expected_attach_type == BPF_XDP_DEVMAP ||
@@ -1416,7 +1417,8 @@ int bpf_prog_test_run_flow_dissector(struct bpf_prog *prog,
 	struct bpf_flow_keys flow_keys;
 	const struct ethhdr *eth;
 	unsigned int flags = 0;
-	u32 retval, duration;
+	u32 duration;
+	u64 retval;
 	void *data;
 	int ret;
 
@@ -1481,8 +1483,9 @@ int bpf_prog_test_run_sk_lookup(struct bpf_prog *prog, const union bpf_attr *kat
 	struct bpf_sk_lookup_kern ctx = {};
 	u32 repeat = kattr->test.repeat;
 	struct bpf_sk_lookup *user_ctx;
-	u32 retval, duration;
 	int ret = -EINVAL;
+	u32 duration;
+	u64 retval;
 
 	if (kattr->test.flags || kattr->test.cpu || kattr->test.batch_size)
 		return -EINVAL;
@@ -1580,8 +1583,8 @@ int bpf_prog_test_run_syscall(struct bpf_prog *prog,
 	void __user *ctx_in = u64_to_user_ptr(kattr->test.ctx_in);
 	__u32 ctx_size_in = kattr->test.ctx_size_in;
 	void *ctx = NULL;
-	u32 retval;
 	int err = 0;
+	u64 retval;
 
 	/* doesn't support data_in/out, ctx_out, duration, or repeat or flags */
 	if (kattr->test.data_in || kattr->test.data_out ||
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d08c4728523b..5b91f712d246 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -1444,8 +1444,11 @@ static unsigned int fanout_demux_bpf(struct packet_fanout *f,
 
 	rcu_read_lock();
 	prog = rcu_dereference(f->bpf_prog);
-	if (prog)
-		ret = bpf_prog_run_clear_cb(prog, skb) % num;
+	if (prog) {
+		ret = bpf_prog_run_clear_cb(prog, skb);
+		/* For some architectures, we need to do modulus in 32-bit width */
+		ret %= num;
+	}
 	rcu_read_unlock();
 
 	return ret;
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 04/17] bpf: Add a PIFO priority queue map type
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (2 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 03/17] bpf: Use 64-bit return value for bpf_prog_run Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 05/17] pifomap: Add queue rotation for continuously increasing rank mode Toke Høiland-Jørgensen
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Eric Dumazet, Paolo Abeni
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen

The PIFO (Push-In First-Out) data structure is a priority queue where
entries can be added at any point in the queue, but dequeue is always
in-order. The primary application is packet queueing, but with arbitrary
data it can also be used as a generic priority queue data structure.

This patch implements two variants of the PIFO map: A generic priority
queue and a queueing structure for XDP frames. The former is
BPF_MAP_TYPE_PIFO_GENERIC, which a generic priority queue that BPF programs
can use via the bpf_map_push_elem() and bpf_map_pop_elem() helpers to
insert and dequeue items. For pushing items, the lower 60 bits of the flags
argument of the helper is interpreted as the priority.

The BPF_MAP_TYPE_PIFO_XDP is a priority queue that stores XDP frames, which
are added to the map using the bpf_redirect_map() helper, where the map
index is used as the priority. Frames can be dequeued from a separate
dequeue program type, added in a later commit.

The two variants of the PIFO share most of their implementation. The user
selects the maximum number of entries stored in the map (using the regular
max_entries) parameter, as well as the range of valid priorities, where the
latter is expressed using the lower 32 bits of the map_extra parameter; the
range must be a power of two.

Each priority can have multiple entries queued, in which case entries are
stored in FIFO order of enqueue. The implementation uses a tree of
word-sized bitmaps as an index of which buckets contain any items. This
allows fast lookups: finding the next item to dequeue requires log_k(N)
"find first set" operations, where N is the range of the map (chosen at
setup time), and k is the number of bits in the native 'unsigned long'
type (so either 32 or 64).

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf.h            |  13 +
 include/linux/bpf_types.h      |   2 +
 include/net/xdp.h              |   5 +-
 include/uapi/linux/bpf.h       |  13 +
 kernel/bpf/Makefile            |   2 +-
 kernel/bpf/pifomap.c           | 581 +++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           |   2 +
 kernel/bpf/verifier.c          |  10 +-
 net/core/filter.c              |   7 +
 tools/include/uapi/linux/bpf.h |  13 +
 10 files changed, 645 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/pifomap.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ebe6f2d95182..ea994acebb81 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1849,6 +1849,9 @@ int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf,
 int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu,
 			     struct sk_buff *skb);
 
+int pifo_map_enqueue(struct bpf_map *map, struct xdp_frame *xdpf, u32 index);
+struct xdp_frame *pifo_map_dequeue(struct bpf_map *map, u64 flags, u64 *rank);
+
 /* Return map's numa specified by userspace */
 static inline int bpf_map_attr_numa_node(const union bpf_attr *attr)
 {
@@ -2081,6 +2084,16 @@ static inline int cpu_map_generic_redirect(struct bpf_cpu_map_entry *rcpu,
 	return -EOPNOTSUPP;
 }
 
+static inline int pifo_map_enqueue(struct bpf_map *map, struct xdp_frame *xdp, u32 index)
+{
+	return 0;
+}
+
+static inline struct xdp_frame *pifo_map_dequeue(struct bpf_map *map, u64 flags, u64 *rank)
+{
+	return NULL;
+}
+
 static inline struct bpf_prog *bpf_prog_get_type_path(const char *name,
 				enum bpf_prog_type type)
 {
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 2b9112b80171..26ef981a8aa5 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -105,11 +105,13 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_INODE_STORAGE, inode_storage_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_TASK_STORAGE, task_storage_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_PIFO_GENERIC, pifo_generic_map_ops)
 #ifdef CONFIG_NET
 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP_HASH, dev_map_hash_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_SK_STORAGE, sk_storage_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_PIFO_XDP, pifo_xdp_map_ops)
 #if defined(CONFIG_XDP_SOCKETS)
 BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
 #endif
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 04c852c7a77f..7c694fb26f34 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -170,7 +170,10 @@ struct xdp_frame {
 	 * while mem info is valid on remote CPU.
 	 */
 	struct xdp_mem_info mem;
-	struct net_device *dev_rx; /* used by cpumap */
+	union {
+		struct net_device *dev_rx; /* used by cpumap */
+		struct xdp_frame *next; /* used by pifomap */
+	};
 	u32 flags; /* supported values defined in xdp_buff_flags */
 };
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aec623f60048..f0947ddee784 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -909,6 +909,8 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_PIFO_GENERIC,
+	BPF_MAP_TYPE_PIFO_XDP,
 };
 
 /* Note that tracing related programs such as
@@ -1244,6 +1246,13 @@ enum {
 /* If set, XDP frames will be transmitted after processing */
 #define BPF_F_TEST_XDP_LIVE_FRAMES	(1U << 1)
 
+/* Flags for BPF_MAP_TYPE_PIFO_* */
+
+/* Used for flags argument of bpf_map_push_elem(); reserve top four bits for
+ * actual flags, the rest is the enqueue priority
+ */
+#define BPF_PIFO_PRIO_MASK	(~0ULL >> 4)
+
 /* type for BPF_ENABLE_STATS */
 enum bpf_stats_type {
 	/* enabled run_time_ns and run_cnt */
@@ -1298,6 +1307,10 @@ union bpf_attr {
 		 * BPF_MAP_TYPE_BLOOM_FILTER - the lowest 4 bits indicate the
 		 * number of hash functions (if 0, the bloom filter will default
 		 * to using 5 hash functions).
+		 *
+		 * BPF_MAP_TYPE_PIFO_* - the lower 32 bits indicate the valid
+		 * range of priorities for entries enqueued in the map. Must be
+		 * a power of two.
 		 */
 		__u64	map_extra;
 	};
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 057ba8e01e70..e66b4d0d3135 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -7,7 +7,7 @@ endif
 CFLAGS_core.o += $(call cc-disable-warning, override-init) $(cflags-nogcse-yy)
 
 obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o link_iter.o
-obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o
+obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o bloom_filter.o pifomap.o
 obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
diff --git a/kernel/bpf/pifomap.c b/kernel/bpf/pifomap.c
new file mode 100644
index 000000000000..5040f532e5d8
--- /dev/null
+++ b/kernel/bpf/pifomap.c
@@ -0,0 +1,581 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+/* Pifomaps queue packets
+ */
+#include <linux/spinlock.h>
+#include <linux/bpf.h>
+#include <linux/bitops.h>
+#include <linux/btf_ids.h>
+#include <net/xdp.h>
+#include <linux/filter.h>
+#include <trace/events/xdp.h>
+
+#define PIFO_CREATE_FLAG_MASK \
+	(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
+
+struct bpf_pifo_element {
+	struct bpf_pifo_element *next;
+	char data[];
+};
+
+union bpf_pifo_item {
+	struct bpf_pifo_element elem;
+	struct xdp_frame frame;
+};
+
+struct bpf_pifo_element_cache {
+	u32 free_elems;
+	struct bpf_pifo_element *elements[];
+};
+
+struct bpf_pifo_bucket {
+	union bpf_pifo_item *head, *tail;
+	u32 elem_count;
+};
+
+struct bpf_pifo_queue {
+	struct bpf_pifo_bucket *buckets;
+	unsigned long *bitmap;
+	unsigned long **lvl_bitmap;
+	u64 min_rank;
+	u32 range;
+	u32 levels;
+};
+
+struct bpf_pifo_map {
+	struct bpf_map map;
+	struct bpf_pifo_queue *queue;
+	unsigned long num_queued;
+	spinlock_t lock; /* protects enqueue / dequeue */
+
+	size_t elem_size;
+	struct bpf_pifo_element_cache *elem_cache;
+	char elements[] __aligned(8);
+};
+
+static struct bpf_pifo_element *elem_cache_get(struct bpf_pifo_element_cache *cache)
+{
+	if (unlikely(!cache->free_elems))
+		return NULL;
+	return cache->elements[--cache->free_elems];
+}
+
+static void elem_cache_put(struct bpf_pifo_element_cache *cache,
+			   struct bpf_pifo_element *elem)
+{
+	cache->elements[cache->free_elems++] = elem;
+}
+
+static bool pifo_map_is_full(struct bpf_pifo_map *pifo)
+{
+	return pifo->num_queued >= pifo->map.max_entries;
+}
+
+static void pifo_queue_free(struct bpf_pifo_queue *q)
+{
+	bpf_map_area_free(q->buckets);
+	bpf_map_area_free(q->bitmap);
+	bpf_map_area_free(q->lvl_bitmap);
+	kfree(q);
+}
+
+static struct bpf_pifo_queue *pifo_queue_alloc(u32 range, int numa_node)
+{
+	u32 num_longs = 0, offset = 0, i, lvl, levels;
+	struct bpf_pifo_queue *q;
+
+	levels = __KERNEL_DIV_ROUND_UP(ilog2(range), ilog2(BITS_PER_TYPE(long)));
+	for (i = 0, lvl = 1; i < levels; i++) {
+		num_longs += lvl;
+		lvl *= BITS_PER_TYPE(long);
+	}
+
+	q = kzalloc(sizeof(*q), GFP_USER | __GFP_ACCOUNT);
+	if (!q)
+		return NULL;
+	q->buckets = bpf_map_area_alloc(sizeof(struct bpf_pifo_bucket) * range,
+					numa_node);
+	if (!q->buckets)
+		goto err;
+
+	q->bitmap = bpf_map_area_alloc(sizeof(unsigned long) * num_longs,
+				       numa_node);
+	if (!q->bitmap)
+		goto err;
+
+	q->lvl_bitmap = bpf_map_area_alloc(sizeof(unsigned long *) * levels,
+					   numa_node);
+	for (i = 0, lvl = 1; i < levels; i++) {
+		q->lvl_bitmap[i] = &q->bitmap[offset];
+		offset += lvl;
+		lvl *= BITS_PER_TYPE(long);
+	}
+	q->levels = levels;
+	q->range = range;
+	return q;
+
+err:
+	pifo_queue_free(q);
+	return NULL;
+}
+
+static int pifo_map_init_map(struct bpf_pifo_map *pifo, union bpf_attr *attr,
+			     size_t elem_size, u32 range)
+{
+	int err = -ENOMEM;
+
+	/* Packet map is special, we don't want BPF writing straight to it
+	 */
+	if (attr->map_type != BPF_MAP_TYPE_PIFO_GENERIC)
+		attr->map_flags |= BPF_F_RDONLY_PROG;
+
+	bpf_map_init_from_attr(&pifo->map, attr);
+
+	pifo->queue = pifo_queue_alloc(range, pifo->map.numa_node);
+	if (!pifo->queue)
+		return -ENOMEM;
+
+	if (attr->map_type == BPF_MAP_TYPE_PIFO_GENERIC) {
+		size_t cache_size;
+		int i;
+
+		cache_size = sizeof(void *) * attr->max_entries +
+			sizeof(struct bpf_pifo_element_cache);
+		pifo->elem_cache = bpf_map_area_alloc(cache_size,
+						      pifo->map.numa_node);
+		if (!pifo->elem_cache)
+			goto err_queue;
+
+		for (i = 0; i < attr->max_entries; i++)
+			pifo->elem_cache->elements[i] = (void *)&pifo->elements[i * elem_size];
+		pifo->elem_cache->free_elems = attr->max_entries;
+	}
+
+	return 0;
+
+err_queue:
+	pifo_queue_free(pifo->queue);
+	return err;
+}
+
+static struct bpf_map *pifo_map_alloc(union bpf_attr *attr)
+{
+	int numa_node = bpf_map_attr_numa_node(attr);
+	size_t size, elem_size = 0;
+	struct bpf_pifo_map *pifo;
+	u32 range;
+	int err;
+
+	if (!capable(CAP_NET_ADMIN))
+		return ERR_PTR(-EPERM);
+
+	if ((attr->map_type == BPF_MAP_TYPE_PIFO_XDP && attr->value_size != 4) ||
+	    attr->key_size != 4 || attr->map_extra & ~0xFFFFFFFFULL ||
+	    attr->map_flags & ~PIFO_CREATE_FLAG_MASK)
+		return ERR_PTR(-EINVAL);
+
+	range = attr->map_extra;
+	if (!range || !is_power_of_2(range))
+		return ERR_PTR(-EINVAL);
+
+	if (attr->map_type == BPF_MAP_TYPE_PIFO_GENERIC) {
+		elem_size = (attr->value_size + sizeof(struct bpf_pifo_element));
+		if (elem_size > U32_MAX / attr->max_entries)
+			return ERR_PTR(-E2BIG);
+	}
+
+	size = sizeof(*pifo) + attr->max_entries * elem_size;
+	pifo = bpf_map_area_alloc(size, numa_node);
+	if (!pifo)
+		return ERR_PTR(-ENOMEM);
+
+	err = pifo_map_init_map(pifo, attr, elem_size, range);
+	if (err) {
+		bpf_map_area_free(pifo);
+		return ERR_PTR(err);
+	}
+
+	spin_lock_init(&pifo->lock);
+	return &pifo->map;
+}
+
+static void pifo_queue_flush(struct bpf_pifo_queue *queue)
+{
+#ifdef CONFIG_NET
+	unsigned long *bitmap = queue->lvl_bitmap[queue->levels - 1];
+	int i = 0;
+
+	/* this is only ever called in the RCU callback when freeing the map, so
+	 * no need for locking
+	 */
+	while (i < queue->range) {
+		struct bpf_pifo_bucket *bucket = &queue->buckets[i];
+		struct xdp_frame *frame = &bucket->head->frame, *next;
+
+		while (frame) {
+			next = frame->next;
+			xdp_return_frame(frame);
+			frame = next;
+		}
+		i = find_next_bit(bitmap, queue->range, i + 1);
+	}
+#endif
+}
+
+static void pifo_map_free(struct bpf_map *map)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+
+	/* At this point bpf_prog->aux->refcnt == 0 and this map->refcnt == 0,
+	 * so the programs (can be more than one that used this map) were
+	 * disconnected from events. The following synchronize_rcu() guarantees
+	 * both rcu read critical sections complete and waits for
+	 * preempt-disable regions (NAPI being the relevant context here) so we
+	 * are certain there will be no further reads against the netdev_map and
+	 * all flush operations are complete. Flush operations can only be done
+	 * from NAPI context for this reason.
+	 */
+
+	synchronize_rcu();
+
+	if (map->map_type == BPF_MAP_TYPE_PIFO_XDP)
+		pifo_queue_flush(pifo->queue);
+	pifo_queue_free(pifo->queue);
+	bpf_map_area_free(pifo->elem_cache);
+	bpf_map_area_free(pifo);
+}
+
+static int pifo_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	u32 index = key ? *(u32 *)key : U32_MAX, offset;
+	struct bpf_pifo_queue *queue = pifo->queue;
+	unsigned long idx, flags;
+	u32 *next = next_key;
+	int ret = -ENOENT;
+
+	spin_lock_irqsave(&pifo->lock, flags);
+
+	if (index == U32_MAX || index < queue->min_rank)
+		offset = 0;
+	else
+		offset = index - queue->min_rank + 1;
+
+	if (offset >= queue->range)
+		goto out;
+
+	idx = find_next_bit(queue->lvl_bitmap[queue->levels - 1],
+			    queue->range, offset);
+	if (idx == queue->range)
+		goto out;
+
+	*next = idx;
+	ret = 0;
+out:
+	spin_unlock_irqrestore(&pifo->lock, flags);
+	return ret;
+}
+
+static void pifo_set_bit(struct bpf_pifo_queue *queue, u32 rank)
+{
+	u32 i;
+
+	for (i = queue->levels; i > 0; i--) {
+		unsigned long *bitmap = queue->lvl_bitmap[i - 1];
+
+		set_bit(rank, bitmap);
+		rank /= BITS_PER_TYPE(long);
+	}
+}
+
+static void pifo_clear_bit(struct bpf_pifo_queue *queue, u32 rank)
+{
+	u32 i;
+
+	for (i = queue->levels; i > 0; i--) {
+		unsigned long *bitmap = queue->lvl_bitmap[i - 1];
+
+		clear_bit(rank, bitmap);
+		rank /= BITS_PER_TYPE(long);
+
+		// another bit is set in this word, don't clear bit in higher
+		// level
+		if (*(bitmap + rank))
+			break;
+	}
+}
+
+static void pifo_item_set_next(union bpf_pifo_item *item, void *next, bool xdp)
+{
+	if (xdp)
+		item->frame.next = next;
+	else
+		item->elem.next = next;
+}
+
+static int __pifo_map_enqueue(struct bpf_pifo_map *pifo, union bpf_pifo_item *item,
+			      u64 rank, bool xdp)
+{
+	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_bucket *bucket;
+	u64 q_index;
+
+	lockdep_assert_held(&pifo->lock);
+
+	if (unlikely(pifo_map_is_full(pifo)))
+		return -EOVERFLOW;
+
+	if (rank < queue->min_rank)
+		return -ERANGE;
+
+	pifo_item_set_next(item, NULL, xdp);
+
+	q_index = rank - queue->min_rank;
+	if (unlikely(q_index >= queue->range))
+		q_index = queue->range - 1;
+
+	bucket = &queue->buckets[q_index];
+	if (likely(!bucket->head)) {
+		bucket->head = item;
+		bucket->tail = item;
+		pifo_set_bit(queue, q_index);
+	} else {
+		pifo_item_set_next(bucket->tail, item, xdp);
+		bucket->tail = item;
+	}
+
+	pifo->num_queued++;
+	bucket->elem_count++;
+	return 0;
+}
+
+int pifo_map_enqueue(struct bpf_map *map, struct xdp_frame *xdpf, u32 index)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	int ret;
+
+	/* called under local_bh_disable() so no need to use irqsave variant */
+	spin_lock(&pifo->lock);
+	ret = __pifo_map_enqueue(pifo, (union bpf_pifo_item *)xdpf, index, true);
+	spin_unlock(&pifo->lock);
+
+	return ret;
+}
+
+static unsigned long pifo_find_first_bucket(struct bpf_pifo_queue *queue)
+{
+	unsigned long *bitmap, bit = 0, offset = 0;
+	int i;
+
+	for (i = 0; i < queue->levels; i++) {
+		bitmap = queue->lvl_bitmap[i] + offset;
+		if (!*bitmap)
+			return -1;
+		bit = __ffs(*bitmap);
+		offset = offset * BITS_PER_TYPE(long) + bit;
+	}
+	return offset;
+}
+
+static union bpf_pifo_item *__pifo_map_dequeue(struct bpf_pifo_map *pifo,
+					       u64 flags, u64 *rank, bool xdp)
+{
+	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_bucket *bucket;
+	union bpf_pifo_item *item;
+	unsigned long bucket_idx;
+
+	lockdep_assert_held(&pifo->lock);
+
+	if (flags) {
+		*rank = -EINVAL;
+		return NULL;
+	}
+
+	bucket_idx = pifo_find_first_bucket(queue);
+	if (bucket_idx == -1) {
+		*rank = -ENOENT;
+		return NULL;
+	}
+	bucket = &queue->buckets[bucket_idx];
+
+	if (WARN_ON_ONCE(!bucket->tail)) {
+		*rank = -EFAULT;
+		return NULL;
+	}
+
+	item = bucket->head;
+	if (xdp)
+		bucket->head = (union bpf_pifo_item *)item->frame.next;
+	else
+		bucket->head = (union bpf_pifo_item *)item->elem.next;
+
+	if (!bucket->head) {
+		bucket->tail = NULL;
+		pifo_clear_bit(queue, bucket_idx);
+	}
+	pifo->num_queued--;
+	bucket->elem_count--;
+
+	*rank = bucket_idx + queue->min_rank;
+	return item;
+}
+
+struct xdp_frame *pifo_map_dequeue(struct bpf_map *map, u64 flags, u64 *rank)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	union bpf_pifo_item *item;
+	unsigned long lflags;
+
+	spin_lock_irqsave(&pifo->lock, lflags);
+	item = __pifo_map_dequeue(pifo, flags, rank, true);
+	spin_unlock_irqrestore(&pifo->lock, lflags);
+
+	return item ? &item->frame : NULL;
+}
+
+static void *pifo_map_lookup_elem(struct bpf_map *map, void *key)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_bucket *bucket;
+	u32 rank =  *(u32 *)key, idx;
+
+	if (rank < queue->min_rank)
+		return NULL;
+
+	idx = rank - queue->min_rank;
+	if (idx >= queue->range)
+		return NULL;
+
+	bucket = &queue->buckets[idx];
+	/* FIXME: what happens if this changes while userspace is reading the
+	 * value
+	 */
+	return &bucket->elem_count;
+}
+
+static int pifo_map_push_elem(struct bpf_map *map, void *value, u64 flags)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	struct bpf_pifo_element *dst;
+	unsigned long irq_flags;
+	u64 prio;
+	int ret;
+
+	/* Check if any of the actual flag bits are set */
+	if (flags & ~BPF_PIFO_PRIO_MASK)
+		return -EINVAL;
+
+	prio = flags & BPF_PIFO_PRIO_MASK;
+
+	spin_lock_irqsave(&pifo->lock, irq_flags);
+
+	dst = elem_cache_get(pifo->elem_cache);
+	if (!dst) {
+		ret = -EOVERFLOW;
+		goto out;
+	}
+
+	memcpy(&dst->data, value, pifo->map.value_size);
+
+	ret = __pifo_map_enqueue(pifo, (union bpf_pifo_item *)dst, prio, false);
+	if (ret)
+		elem_cache_put(pifo->elem_cache, dst);
+
+out:
+	spin_unlock_irqrestore(&pifo->lock, irq_flags);
+	return ret;
+}
+
+static int pifo_map_pop_elem(struct bpf_map *map, void *value)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	union bpf_pifo_item *item;
+	unsigned long flags;
+	int err = 0;
+	u64 rank;
+
+	spin_lock_irqsave(&pifo->lock, flags);
+
+	item = __pifo_map_dequeue(pifo, 0, &rank, false);
+	if (!item) {
+		err = rank;
+		goto out;
+	}
+
+	memcpy(value, &item->elem.data, pifo->map.value_size);
+	elem_cache_put(pifo->elem_cache, &item->elem);
+
+out:
+	spin_unlock_irqrestore(&pifo->lock, flags);
+	return err;
+}
+
+static int pifo_map_update_elem(struct bpf_map *map, void *key, void *value,
+				u64 map_flags)
+{
+	return -EINVAL;
+}
+
+static int pifo_map_delete_elem(struct bpf_map *map, void *key)
+{
+	return -EINVAL;
+}
+
+static int pifo_map_peek_elem(struct bpf_map *map, void *value)
+{
+	return -EINVAL;
+}
+
+static int pifo_map_redirect(struct bpf_map *map, u64 index, u64 flags)
+{
+#ifdef CONFIG_NET
+	struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+	const u64 action_mask = XDP_ABORTED | XDP_DROP | XDP_PASS | XDP_TX;
+
+	/* Lower bits of the flags are used as return code on lookup failure */
+	if (unlikely(flags & ~action_mask))
+		return XDP_ABORTED;
+
+	ri->tgt_value = NULL;
+	ri->tgt_index = index;
+	ri->map_id = map->id;
+	ri->map_type = map->map_type;
+	ri->flags = flags;
+	WRITE_ONCE(ri->map, map);
+	return XDP_REDIRECT;
+#else
+	return XDP_ABORTED;
+#endif
+}
+
+BTF_ID_LIST_SINGLE(pifo_xdp_map_btf_ids, struct, bpf_pifo_map);
+const struct bpf_map_ops pifo_xdp_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc = pifo_map_alloc,
+	.map_free = pifo_map_free,
+	.map_get_next_key = pifo_map_get_next_key,
+	.map_lookup_elem = pifo_map_lookup_elem,
+	.map_update_elem = pifo_map_update_elem,
+	.map_delete_elem = pifo_map_delete_elem,
+	.map_check_btf = map_check_no_btf,
+	.map_btf_id = &pifo_xdp_map_btf_ids[0],
+	.map_redirect = pifo_map_redirect,
+};
+
+BTF_ID_LIST_SINGLE(pifo_generic_map_btf_ids, struct, bpf_pifo_map);
+const struct bpf_map_ops pifo_generic_map_ops = {
+	.map_meta_equal = bpf_map_meta_equal,
+	.map_alloc = pifo_map_alloc,
+	.map_free = pifo_map_free,
+	.map_get_next_key = pifo_map_get_next_key,
+	.map_lookup_elem = pifo_map_lookup_elem,
+	.map_update_elem = pifo_map_update_elem,
+	.map_delete_elem = pifo_map_delete_elem,
+	.map_push_elem = pifo_map_push_elem,
+	.map_pop_elem = pifo_map_pop_elem,
+	.map_peek_elem = pifo_map_peek_elem,
+	.map_check_btf = map_check_no_btf,
+	.map_btf_id = &pifo_generic_map_btf_ids[0],
+};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ab688d85b2c6..31899882e513 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1066,6 +1066,8 @@ static int map_create(union bpf_attr *attr)
 	}
 
 	if (attr->map_type != BPF_MAP_TYPE_BLOOM_FILTER &&
+	    attr->map_type != BPF_MAP_TYPE_PIFO_XDP &&
+	    attr->map_type != BPF_MAP_TYPE_PIFO_GENERIC &&
 	    attr->map_extra != 0)
 		return -EINVAL;
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 039f7b61c305..489ea3f368a1 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6249,6 +6249,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_MAP_TYPE_QUEUE:
 	case BPF_MAP_TYPE_STACK:
+	case BPF_MAP_TYPE_PIFO_GENERIC:
 		if (func_id != BPF_FUNC_map_peek_elem &&
 		    func_id != BPF_FUNC_map_pop_elem &&
 		    func_id != BPF_FUNC_map_push_elem)
@@ -6274,6 +6275,10 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		    func_id != BPF_FUNC_map_push_elem)
 			goto error;
 		break;
+	case BPF_MAP_TYPE_PIFO_XDP:
+		if (func_id != BPF_FUNC_redirect_map)
+			goto error;
+		break;
 	default:
 		break;
 	}
@@ -6318,6 +6323,7 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (map->map_type != BPF_MAP_TYPE_DEVMAP &&
 		    map->map_type != BPF_MAP_TYPE_DEVMAP_HASH &&
 		    map->map_type != BPF_MAP_TYPE_CPUMAP &&
+		    map->map_type != BPF_MAP_TYPE_PIFO_XDP &&
 		    map->map_type != BPF_MAP_TYPE_XSKMAP)
 			goto error;
 		break;
@@ -6346,13 +6352,15 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		break;
 	case BPF_FUNC_map_pop_elem:
 		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
-		    map->map_type != BPF_MAP_TYPE_STACK)
+		    map->map_type != BPF_MAP_TYPE_STACK &&
+		    map->map_type != BPF_MAP_TYPE_PIFO_GENERIC)
 			goto error;
 		break;
 	case BPF_FUNC_map_peek_elem:
 	case BPF_FUNC_map_push_elem:
 		if (map->map_type != BPF_MAP_TYPE_QUEUE &&
 		    map->map_type != BPF_MAP_TYPE_STACK &&
+		    map->map_type != BPF_MAP_TYPE_PIFO_GENERIC &&
 		    map->map_type != BPF_MAP_TYPE_BLOOM_FILTER)
 			goto error;
 		break;
diff --git a/net/core/filter.c b/net/core/filter.c
index e23e53ed1b04..8e6ea17a29db 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4236,6 +4236,13 @@ static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
 			err = dev_map_enqueue(fwd, xdpf, dev);
 		}
 		break;
+	case BPF_MAP_TYPE_PIFO_XDP:
+		map = READ_ONCE(ri->map);
+		if (unlikely(!map))
+			err = -EINVAL;
+		else
+			err = pifo_map_enqueue(map, xdpf, ri->tgt_index);
+		break;
 	case BPF_MAP_TYPE_CPUMAP:
 		err = cpu_map_enqueue(fwd, xdpf, dev);
 		break;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 379e68fb866f..623421377f6e 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -909,6 +909,8 @@ enum bpf_map_type {
 	BPF_MAP_TYPE_INODE_STORAGE,
 	BPF_MAP_TYPE_TASK_STORAGE,
 	BPF_MAP_TYPE_BLOOM_FILTER,
+	BPF_MAP_TYPE_PIFO_GENERIC,
+	BPF_MAP_TYPE_PIFO_XDP,
 };
 
 /* Note that tracing related programs such as
@@ -1244,6 +1246,13 @@ enum {
 /* If set, XDP frames will be transmitted after processing */
 #define BPF_F_TEST_XDP_LIVE_FRAMES	(1U << 1)
 
+/* Flags for BPF_MAP_TYPE_PIFO_* */
+
+/* Used for flags argument of bpf_map_push_elem(); reserve top four bits for
+ * actual flags, the rest is the enqueue priority
+ */
+#define BPF_PIFO_PRIO_MASK	(~0ULL >> 4)
+
 /* type for BPF_ENABLE_STATS */
 enum bpf_stats_type {
 	/* enabled run_time_ns and run_cnt */
@@ -1298,6 +1307,10 @@ union bpf_attr {
 		 * BPF_MAP_TYPE_BLOOM_FILTER - the lowest 4 bits indicate the
 		 * number of hash functions (if 0, the bloom filter will default
 		 * to using 5 hash functions).
+		 *
+		 * BPF_MAP_TYPE_PIFO_* - the lower 32 bits indicate the valid
+		 * range of priorities for entries enqueued in the map. Must be
+		 * a power of two.
 		 */
 		__u64	map_extra;
 	};
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 05/17] pifomap: Add queue rotation for continuously increasing rank mode
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (3 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 04/17] bpf: Add a PIFO priority queue map type Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 06/17] xdp: Add dequeue program type for getting packets from a PIFO Toke Høiland-Jørgensen
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen

Amend the PIFO map so it can operate in a mode that allows the range to
increase continuously. This works by allocating two underlying queues, and
queueing entries into the second one if the first one overflows. When the
primary queue runs empty, if there are entries in the secondary queue, swap
the two queues and shift the operating range of the new secondary queue to
be after the (new) primary. This way the queue can support a continuously
increasing rank, for instance to index by timestamps.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 kernel/bpf/pifomap.c | 96 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 75 insertions(+), 21 deletions(-)

diff --git a/kernel/bpf/pifomap.c b/kernel/bpf/pifomap.c
index 5040f532e5d8..62633c2c7419 100644
--- a/kernel/bpf/pifomap.c
+++ b/kernel/bpf/pifomap.c
@@ -6,6 +6,7 @@
 #include <linux/bpf.h>
 #include <linux/bitops.h>
 #include <linux/btf_ids.h>
+#include <linux/minmax.h>
 #include <net/xdp.h>
 #include <linux/filter.h>
 #include <trace/events/xdp.h>
@@ -44,7 +45,8 @@ struct bpf_pifo_queue {
 
 struct bpf_pifo_map {
 	struct bpf_map map;
-	struct bpf_pifo_queue *queue;
+	struct bpf_pifo_queue *q_primary;
+	struct bpf_pifo_queue *q_secondary;
 	unsigned long num_queued;
 	spinlock_t lock; /* protects enqueue / dequeue */
 
@@ -71,6 +73,12 @@ static bool pifo_map_is_full(struct bpf_pifo_map *pifo)
 	return pifo->num_queued >= pifo->map.max_entries;
 }
 
+static bool pifo_queue_is_empty(struct bpf_pifo_queue *queue)
+{
+	/* first word in bitmap is always the top-level map */
+	return !queue->bitmap[0];
+}
+
 static void pifo_queue_free(struct bpf_pifo_queue *q)
 {
 	bpf_map_area_free(q->buckets);
@@ -79,7 +87,7 @@ static void pifo_queue_free(struct bpf_pifo_queue *q)
 	kfree(q);
 }
 
-static struct bpf_pifo_queue *pifo_queue_alloc(u32 range, int numa_node)
+static struct bpf_pifo_queue *pifo_queue_alloc(u32 range, u32 min_rank, int numa_node)
 {
 	u32 num_longs = 0, offset = 0, i, lvl, levels;
 	struct bpf_pifo_queue *q;
@@ -112,6 +120,7 @@ static struct bpf_pifo_queue *pifo_queue_alloc(u32 range, int numa_node)
 	}
 	q->levels = levels;
 	q->range = range;
+	q->min_rank = min_rank;
 	return q;
 
 err:
@@ -131,10 +140,14 @@ static int pifo_map_init_map(struct bpf_pifo_map *pifo, union bpf_attr *attr,
 
 	bpf_map_init_from_attr(&pifo->map, attr);
 
-	pifo->queue = pifo_queue_alloc(range, pifo->map.numa_node);
-	if (!pifo->queue)
+	pifo->q_primary = pifo_queue_alloc(range, 0, pifo->map.numa_node);
+	if (!pifo->q_primary)
 		return -ENOMEM;
 
+	pifo->q_secondary = pifo_queue_alloc(range, range, pifo->map.numa_node);
+	if (!pifo->q_secondary)
+		goto err_queue;
+
 	if (attr->map_type == BPF_MAP_TYPE_PIFO_GENERIC) {
 		size_t cache_size;
 		int i;
@@ -144,7 +157,7 @@ static int pifo_map_init_map(struct bpf_pifo_map *pifo, union bpf_attr *attr,
 		pifo->elem_cache = bpf_map_area_alloc(cache_size,
 						      pifo->map.numa_node);
 		if (!pifo->elem_cache)
-			goto err_queue;
+			goto err;
 
 		for (i = 0; i < attr->max_entries; i++)
 			pifo->elem_cache->elements[i] = (void *)&pifo->elements[i * elem_size];
@@ -153,8 +166,10 @@ static int pifo_map_init_map(struct bpf_pifo_map *pifo, union bpf_attr *attr,
 
 	return 0;
 
+err:
+	pifo_queue_free(pifo->q_secondary);
 err_queue:
-	pifo_queue_free(pifo->queue);
+	pifo_queue_free(pifo->q_primary);
 	return err;
 }
 
@@ -238,9 +253,12 @@ static void pifo_map_free(struct bpf_map *map)
 
 	synchronize_rcu();
 
-	if (map->map_type == BPF_MAP_TYPE_PIFO_XDP)
-		pifo_queue_flush(pifo->queue);
-	pifo_queue_free(pifo->queue);
+	if (map->map_type == BPF_MAP_TYPE_PIFO_XDP) {
+		pifo_queue_flush(pifo->q_primary);
+		pifo_queue_flush(pifo->q_secondary);
+	}
+	pifo_queue_free(pifo->q_primary);
+	pifo_queue_free(pifo->q_secondary);
 	bpf_map_area_free(pifo->elem_cache);
 	bpf_map_area_free(pifo);
 }
@@ -249,7 +267,7 @@ static int pifo_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 {
 	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
 	u32 index = key ? *(u32 *)key : U32_MAX, offset;
-	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_queue *queue = pifo->q_primary;
 	unsigned long idx, flags;
 	u32 *next = next_key;
 	int ret = -ENOENT;
@@ -261,15 +279,27 @@ static int pifo_map_get_next_key(struct bpf_map *map, void *key, void *next_key)
 	else
 		offset = index - queue->min_rank + 1;
 
-	if (offset >= queue->range)
-		goto out;
+	if (offset >= queue->range) {
+		offset -= queue->range;
+		queue = pifo->q_secondary;
+
+		if (offset >= queue->range)
+			goto out;
+	}
 
+search:
 	idx = find_next_bit(queue->lvl_bitmap[queue->levels - 1],
 			    queue->range, offset);
-	if (idx == queue->range)
+	if (idx == queue->range) {
+		if (queue == pifo->q_primary) {
+			queue = pifo->q_secondary;
+			offset = 0;
+			goto search;
+		}
 		goto out;
+	}
 
-	*next = idx;
+	*next = idx + queue->min_rank;
 	ret = 0;
 out:
 	spin_unlock_irqrestore(&pifo->lock, flags);
@@ -316,7 +346,7 @@ static void pifo_item_set_next(union bpf_pifo_item *item, void *next, bool xdp)
 static int __pifo_map_enqueue(struct bpf_pifo_map *pifo, union bpf_pifo_item *item,
 			      u64 rank, bool xdp)
 {
-	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_queue *queue = pifo->q_primary;
 	struct bpf_pifo_bucket *bucket;
 	u64 q_index;
 
@@ -331,8 +361,16 @@ static int __pifo_map_enqueue(struct bpf_pifo_map *pifo, union bpf_pifo_item *it
 	pifo_item_set_next(item, NULL, xdp);
 
 	q_index = rank - queue->min_rank;
-	if (unlikely(q_index >= queue->range))
-		q_index = queue->range - 1;
+	if (unlikely(q_index >= queue->range)) {
+		/* If we overflow the primary queue, enqueue into secondary, and
+		 * if we overflow that enqueue as the last item
+		 */
+		q_index -= queue->range;
+		queue = pifo->q_secondary;
+
+		if (q_index >= queue->range)
+			q_index = queue->range - 1;
+	}
 
 	bucket = &queue->buckets[q_index];
 	if (likely(!bucket->head)) {
@@ -380,7 +418,7 @@ static unsigned long pifo_find_first_bucket(struct bpf_pifo_queue *queue)
 static union bpf_pifo_item *__pifo_map_dequeue(struct bpf_pifo_map *pifo,
 					       u64 flags, u64 *rank, bool xdp)
 {
-	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_queue *queue = pifo->q_primary;
 	struct bpf_pifo_bucket *bucket;
 	union bpf_pifo_item *item;
 	unsigned long bucket_idx;
@@ -392,6 +430,17 @@ static union bpf_pifo_item *__pifo_map_dequeue(struct bpf_pifo_map *pifo,
 		return NULL;
 	}
 
+	if (!pifo->num_queued) {
+		*rank = -ENOENT;
+		return NULL;
+	}
+
+	if (unlikely(pifo_queue_is_empty(queue))) {
+		swap(pifo->q_primary, pifo->q_secondary);
+		pifo->q_secondary->min_rank = pifo->q_primary->min_rank + pifo->q_primary->range;
+		queue = pifo->q_primary;
+	}
+
 	bucket_idx = pifo_find_first_bucket(queue);
 	if (bucket_idx == -1) {
 		*rank = -ENOENT;
@@ -437,7 +486,7 @@ struct xdp_frame *pifo_map_dequeue(struct bpf_map *map, u64 flags, u64 *rank)
 static void *pifo_map_lookup_elem(struct bpf_map *map, void *key)
 {
 	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
-	struct bpf_pifo_queue *queue = pifo->queue;
+	struct bpf_pifo_queue *queue = pifo->q_primary;
 	struct bpf_pifo_bucket *bucket;
 	u32 rank =  *(u32 *)key, idx;
 
@@ -445,8 +494,13 @@ static void *pifo_map_lookup_elem(struct bpf_map *map, void *key)
 		return NULL;
 
 	idx = rank - queue->min_rank;
-	if (idx >= queue->range)
-		return NULL;
+	if (idx >= queue->range) {
+		idx -= queue->range;
+		queue = pifo->q_secondary;
+
+		if (idx >= queue->range)
+			return NULL;
+	}
 
 	bucket = &queue->buckets[idx];
 	/* FIXME: what happens if this changes while userspace is reading the
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 06/17] xdp: Add dequeue program type for getting packets from a PIFO
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (4 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 05/17] pifomap: Add queue rotation for continuously increasing rank mode Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 07/17] bpf: Teach the verifier about referenced packets returned from dequeue programs Toke Høiland-Jørgensen
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Eric Dumazet, Paolo Abeni
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen

Add a new BPF_PROG_TYPE_DEQUEUE, which will be executed by a new device
hook to retrieve queued packets for transmission. The API of the dequeue
program is simple: it takes a context object containing as its sole member
the ifindex of the device it is being executed on. The program can return a
pointer to a packet, or NULL to indicate it has nothing to transmit at this
time. Packet pointers are obtained by dequeueing them from a PIFO
map (using a helper added in a subsequent commit).

This commit adds dequeue program type and the ability to run it using the
bpf_prog_run() syscall (returning the dequeued packet to userspace); a
subsequent commit introduces the network stack hook to attach and execute
dequeue programs.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf.h            |  9 ++++++
 include/linux/bpf_types.h      |  2 ++
 include/net/xdp.h              |  4 +++
 include/uapi/linux/bpf.h       |  5 ++++
 kernel/bpf/syscall.c           |  1 +
 net/bpf/test_run.c             | 33 +++++++++++++++++++++
 net/core/filter.c              | 53 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  5 ++++
 8 files changed, 112 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ea994acebb81..6ea5d6d188cf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1864,6 +1864,8 @@ int array_map_alloc_check(union bpf_attr *attr);
 
 int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 			  union bpf_attr __user *uattr);
+int bpf_prog_test_run_dequeue(struct bpf_prog *prog, const union bpf_attr *kattr,
+			      union bpf_attr __user *uattr);
 int bpf_prog_test_run_skb(struct bpf_prog *prog, const union bpf_attr *kattr,
 			  union bpf_attr __user *uattr);
 int bpf_prog_test_run_tracing(struct bpf_prog *prog,
@@ -2107,6 +2109,13 @@ static inline int bpf_prog_test_run_xdp(struct bpf_prog *prog,
 	return -ENOTSUPP;
 }
 
+static inline int bpf_prog_test_run_dequeue(struct bpf_prog *prog,
+					    const union bpf_attr *kattr,
+					    union bpf_attr __user *uattr)
+{
+	return -ENOTSUPP;
+}
+
 static inline int bpf_prog_test_run_skb(struct bpf_prog *prog,
 					const union bpf_attr *kattr,
 					union bpf_attr __user *uattr)
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 26ef981a8aa5..e6bc962befb7 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -10,6 +10,8 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_SCHED_ACT, tc_cls_act,
 	      struct __sk_buff, struct sk_buff)
 BPF_PROG_TYPE(BPF_PROG_TYPE_XDP, xdp,
 	      struct xdp_md, struct xdp_buff)
+BPF_PROG_TYPE(BPF_PROG_TYPE_DEQUEUE, dequeue,
+	      struct dequeue_ctx, struct dequeue_data)
 #ifdef CONFIG_CGROUP_BPF
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_SKB, cg_skb,
 	      struct __sk_buff, struct sk_buff)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 7c694fb26f34..728ce943d352 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -85,6 +85,10 @@ struct xdp_buff {
 	u32 flags; /* supported values defined in xdp_buff_flags */
 };
 
+struct dequeue_data {
+	struct xdp_txq_info *txq;
+};
+
 static __always_inline bool xdp_buff_has_frags(struct xdp_buff *xdp)
 {
 	return !!(xdp->flags & XDP_FLAGS_HAS_FRAGS);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f0947ddee784..974fb5882305 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -954,6 +954,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_DEQUEUE,
 };
 
 enum bpf_attach_type {
@@ -5961,6 +5962,10 @@ struct xdp_md {
 	__u32 egress_ifindex;  /* txq->dev->ifindex */
 };
 
+struct dequeue_ctx {
+	__u32 egress_ifindex;
+};
+
 /* DEVMAP map-value layout
  *
  * The struct data-layout of map-value is a configuration interface.
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 31899882e513..c4af9119b68a 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2370,6 +2370,7 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type,
 		default:
 			return -EINVAL;
 		}
+	case BPF_PROG_TYPE_DEQUEUE:
 	case BPF_PROG_TYPE_SYSCALL:
 	case BPF_PROG_TYPE_EXT:
 		if (expected_attach_type)
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index f05d13717430..a7f479a19fe0 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1390,6 +1390,39 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	return ret;
 }
 
+int bpf_prog_test_run_dequeue(struct bpf_prog *prog, const union bpf_attr *kattr,
+			      union bpf_attr __user *uattr)
+{
+	struct xdp_txq_info txq = { .dev = current->nsproxy->net_ns->loopback_dev };
+	u32 repeat = kattr->test.repeat, duration, size;
+	struct dequeue_data ctx = { .txq = &txq };
+	struct xdp_buff xdp = {};
+	struct xdp_frame *pkt;
+	int ret = -EINVAL;
+	u64 retval;
+
+	if (prog->expected_attach_type)
+		return -EINVAL;
+
+	if (kattr->test.data_in || kattr->test.data_size_in ||
+	    kattr->test.ctx_in || kattr->test.ctx_out || repeat > 1)
+		return -EINVAL;
+
+	ret = bpf_test_run(prog, &ctx, repeat, &retval, &duration, false);
+	if (ret)
+		return ret;
+	if (!retval)
+		return bpf_test_finish(kattr, uattr, NULL, NULL, 0, retval, duration);
+
+	pkt = (void *)(unsigned long)retval;
+	xdp_convert_frame_to_buff(pkt, &xdp);
+	size = xdp.data_end - xdp.data_meta;
+	/* We set retval == 1 if pkt != NULL, otherwise 0 */
+	ret = bpf_test_finish(kattr, uattr, xdp.data_meta, NULL, size, !!retval, duration);
+	xdp_return_frame(pkt);
+	return ret;
+}
+
 static int verify_user_bpf_flow_keys(struct bpf_flow_keys *ctx)
 {
 	/* make sure the fields we don't use are zeroed */
diff --git a/net/core/filter.c b/net/core/filter.c
index 8e6ea17a29db..30bd3a6aedab 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8062,6 +8062,12 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 	}
 }
 
+static const struct bpf_func_proto *
+dequeue_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return bpf_base_func_proto(func_id);
+}
+
 const struct bpf_func_proto bpf_sock_map_update_proto __weak;
 const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
 
@@ -8776,6 +8782,20 @@ void bpf_warn_invalid_xdp_action(struct net_device *dev, struct bpf_prog *prog,
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
+static bool dequeue_is_valid_access(int off, int size,
+				    enum bpf_access_type type,
+				    const struct bpf_prog *prog,
+				    struct bpf_insn_access_aux *info)
+{
+	if (type == BPF_WRITE)
+		return false;
+	switch (off) {
+	case offsetof(struct dequeue_ctx, egress_ifindex):
+		return true;
+	}
+	return false;
+}
+
 static bool sock_addr_is_valid_access(int off, int size,
 				      enum bpf_access_type type,
 				      const struct bpf_prog *prog,
@@ -9835,6 +9855,28 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static u32 dequeue_convert_ctx_access(enum bpf_access_type type,
+				      const struct bpf_insn *si,
+				      struct bpf_insn *insn_buf,
+				      struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+
+	switch (si->off) {
+	case offsetof(struct dequeue_ctx, egress_ifindex):
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct dequeue_data, txq),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct dequeue_data, txq));
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_txq_info, dev),
+				      si->dst_reg, si->dst_reg,
+				      offsetof(struct xdp_txq_info, dev));
+		*insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+				      offsetof(struct net_device, ifindex));
+		break;
+	}
+	return insn - insn_buf;
+}
+
 /* SOCK_ADDR_LOAD_NESTED_FIELD() loads Nested Field S.F.NF where S is type of
  * context Structure, F is Field in context structure that contains a pointer
  * to Nested Structure of type NS that has the field NF.
@@ -10687,6 +10729,17 @@ const struct bpf_prog_ops xdp_prog_ops = {
 	.test_run		= bpf_prog_test_run_xdp,
 };
 
+const struct bpf_verifier_ops dequeue_verifier_ops = {
+	.get_func_proto		= dequeue_func_proto,
+	.is_valid_access	= dequeue_is_valid_access,
+	.convert_ctx_access	= dequeue_convert_ctx_access,
+	.gen_prologue		= bpf_noop_prologue,
+};
+
+const struct bpf_prog_ops dequeue_prog_ops = {
+	.test_run		= bpf_prog_test_run_dequeue,
+};
+
 const struct bpf_verifier_ops cg_skb_verifier_ops = {
 	.get_func_proto		= cg_skb_func_proto,
 	.is_valid_access	= cg_skb_is_valid_access,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 623421377f6e..4dd8a563f85d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -954,6 +954,7 @@ enum bpf_prog_type {
 	BPF_PROG_TYPE_LSM,
 	BPF_PROG_TYPE_SK_LOOKUP,
 	BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
+	BPF_PROG_TYPE_DEQUEUE,
 };
 
 enum bpf_attach_type {
@@ -5961,6 +5962,10 @@ struct xdp_md {
 	__u32 egress_ifindex;  /* txq->dev->ifindex */
 };
 
+struct dequeue_ctx {
+	__u32 egress_ifindex;
+};
+
 /* DEVMAP map-value layout
  *
  * The struct data-layout of map-value is a configuration interface.
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 07/17] bpf: Teach the verifier about referenced packets returned from dequeue programs
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (5 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 06/17] xdp: Add dequeue program type for getting packets from a PIFO Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 08/17] bpf: Add helpers to dequeue from a PIFO map Toke Høiland-Jørgensen
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>

The usecase is to allow returning a dequeued packet, or NULL directly from
the BPF program. Shift the check_reference_leak call after
check_return_code, since the return is reference release (the reference is
transferred to the caller of the BPF program), hence a reference leak check
before check_return_code would always fail verification.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 kernel/bpf/verifier.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 489ea3f368a1..e3662460a095 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -10421,6 +10421,9 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn)
 	return 0;
 }
 
+BTF_ID_LIST(dequeue_btf_ids)
+BTF_ID(struct, xdp_md)
+
 static int check_return_code(struct bpf_verifier_env *env)
 {
 	struct tnum enforce_attach_type_range = tnum_unknown;
@@ -10554,6 +10557,17 @@ static int check_return_code(struct bpf_verifier_env *env)
 		}
 		break;
 
+	case BPF_PROG_TYPE_DEQUEUE:
+		if (register_is_null(reg))
+			return 0;
+		if ((reg->type == PTR_TO_BTF_ID || reg->type == PTR_TO_BTF_ID_OR_NULL) &&
+		    reg->btf == btf_vmlinux && reg->btf_id == dequeue_btf_ids[0] &&
+		    reg->ref_obj_id != 0)
+			return release_reference(env, reg->ref_obj_id);
+		verbose(env, "At program exit the register R0 must be NULL or referenced %s%s\n",
+			reg_type_str(env, PTR_TO_BTF_ID),
+			kernel_type_name(btf_vmlinux, dequeue_btf_ids[0]));
+		return -EINVAL;
 	case BPF_PROG_TYPE_EXT:
 		/* freplace program can return anything as its return value
 		 * depends on the to-be-replaced kernel func or bpf program.
@@ -12339,11 +12353,11 @@ static int do_check(struct bpf_verifier_env *env)
 					continue;
 				}
 
-				err = check_reference_leak(env);
+				err = check_return_code(env);
 				if (err)
 					return err;
 
-				err = check_return_code(env);
+				err = check_reference_leak(env);
 				if (err)
 					return err;
 process_bpf_exit:
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 08/17] bpf: Add helpers to dequeue from a PIFO map
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (6 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 07/17] bpf: Teach the verifier about referenced packets returned from dequeue programs Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 09/17] bpf: Introduce pkt_uid member for PTR_TO_PACKET Toke Høiland-Jørgensen
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen, Eric Dumazet,
	Paolo Abeni

This adds a new helper to dequeue a packet from a PIFO map,
bpf_packet_dequeue(). The helper returns a refcounted pointer to the packet
dequeued from the map; the reference must be released either by dropping
the packet (using bpf_packet_drop()), or by returning it to the caller.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/uapi/linux/bpf.h       | 19 +++++++++++++++
 kernel/bpf/verifier.c          | 13 +++++++---
 net/core/filter.c              | 43 +++++++++++++++++++++++++++++++++-
 tools/include/uapi/linux/bpf.h | 19 +++++++++++++++
 4 files changed, 90 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 974fb5882305..d44382644391 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5341,6 +5341,23 @@ union bpf_attr {
  *		**-EACCES** if the SYN cookie is not valid.
  *
  *		**-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
+ *
+ * long bpf_packet_dequeue(void *ctx, struct bpf_map *map, u64 flags, u64 *rank)
+ *	Description
+ *		Dequeue the packet at the head of the PIFO in *map* and return a pointer
+ *		to the packet (or NULL if the PIFO is empty).
+ *	Return
+ *		On success, a pointer to the packet, or NULL if the PIFO is empty. The
+ *		packet pointer must be freed using *bpf_packet_drop()* or returning
+ *		the packet pointer. The *rank* pointer will be set to the rank of
+ *		the dequeued packet on success, or a negative error code on error.
+ *
+ * long bpf_packet_drop(void *ctx, void *pkt)
+ *	Description
+ *		Drop *pkt*, which must be a reference previously returned by
+ *		*bpf_packet_dequeue()* (and checked to not be NULL).
+ *	Return
+ *		This always succeeds and returns zero.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5551,6 +5568,8 @@ union bpf_attr {
 	FN(tcp_raw_gen_syncookie_ipv6),	\
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
+	FN(packet_dequeue),		\
+	FN(packet_drop),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index e3662460a095..68f98d76bc78 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -483,7 +483,8 @@ static bool may_be_acquire_function(enum bpf_func_id func_id)
 		func_id == BPF_FUNC_sk_lookup_udp ||
 		func_id == BPF_FUNC_skc_lookup_tcp ||
 		func_id == BPF_FUNC_map_lookup_elem ||
-	        func_id == BPF_FUNC_ringbuf_reserve;
+		func_id == BPF_FUNC_ringbuf_reserve ||
+		func_id == BPF_FUNC_packet_dequeue;
 }
 
 static bool is_acquire_function(enum bpf_func_id func_id,
@@ -495,7 +496,8 @@ static bool is_acquire_function(enum bpf_func_id func_id,
 	    func_id == BPF_FUNC_sk_lookup_udp ||
 	    func_id == BPF_FUNC_skc_lookup_tcp ||
 	    func_id == BPF_FUNC_ringbuf_reserve ||
-	    func_id == BPF_FUNC_kptr_xchg)
+	    func_id == BPF_FUNC_kptr_xchg ||
+	    func_id == BPF_FUNC_packet_dequeue)
 		return true;
 
 	if (func_id == BPF_FUNC_map_lookup_elem &&
@@ -6276,7 +6278,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 			goto error;
 		break;
 	case BPF_MAP_TYPE_PIFO_XDP:
-		if (func_id != BPF_FUNC_redirect_map)
+		if (func_id != BPF_FUNC_redirect_map &&
+		    func_id != BPF_FUNC_packet_dequeue)
 			goto error;
 		break;
 	default:
@@ -6385,6 +6388,10 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
 		if (map->map_type != BPF_MAP_TYPE_TASK_STORAGE)
 			goto error;
 		break;
+	case BPF_FUNC_packet_dequeue:
+		if (map->map_type != BPF_MAP_TYPE_PIFO_XDP)
+			goto error;
+		break;
 	default:
 		break;
 	}
diff --git a/net/core/filter.c b/net/core/filter.c
index 30bd3a6aedab..893b75515859 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4430,6 +4430,40 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+BTF_ID_LIST_SINGLE(xdp_md_btf_ids, struct, xdp_md)
+
+BPF_CALL_4(bpf_packet_dequeue, struct dequeue_data *, ctx, struct bpf_map *, map,
+	   u64, flags, u64 *, rank)
+{
+	return (unsigned long)pifo_map_dequeue(map, flags, rank);
+}
+
+static const struct bpf_func_proto bpf_packet_dequeue_proto = {
+	.func           = bpf_packet_dequeue,
+	.gpl_only       = false,
+	.ret_type       = RET_PTR_TO_BTF_ID_OR_NULL,
+	.ret_btf_id	= xdp_md_btf_ids,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_CONST_MAP_PTR,
+	.arg3_type      = ARG_ANYTHING,
+	.arg4_type      = ARG_PTR_TO_LONG,
+};
+
+BPF_CALL_2(bpf_packet_drop, struct dequeue_data *, ctx, struct xdp_frame *, pkt)
+{
+	xdp_return_frame(pkt);
+	return 0;
+}
+
+static const struct bpf_func_proto bpf_packet_drop_proto = {
+	.func           = bpf_packet_drop,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_PTR_TO_BTF_ID | OBJ_RELEASE,
+	.arg2_btf_id	= xdp_md_btf_ids,
+};
+
 static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
 				  unsigned long off, unsigned long len)
 {
@@ -8065,7 +8099,14 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 static const struct bpf_func_proto *
 dequeue_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 {
-	return bpf_base_func_proto(func_id);
+	switch (func_id) {
+	case BPF_FUNC_packet_dequeue:
+		return &bpf_packet_dequeue_proto;
+	case BPF_FUNC_packet_drop:
+		return &bpf_packet_drop_proto;
+	default:
+		return bpf_base_func_proto(func_id);
+	}
 }
 
 const struct bpf_func_proto bpf_sock_map_update_proto __weak;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4dd8a563f85d..1dab68a89e18 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5341,6 +5341,23 @@ union bpf_attr {
  *		**-EACCES** if the SYN cookie is not valid.
  *
  *		**-EPROTONOSUPPORT** if CONFIG_IPV6 is not builtin.
+ *
+ * long bpf_packet_dequeue(void *ctx, struct bpf_map *map, u64 flags, u64 *rank)
+ *	Description
+ *		Dequeue the packet at the head of the PIFO in *map* and return a pointer
+ *		to the packet (or NULL if the PIFO is empty).
+ *	Return
+ *		On success, a pointer to the packet, or NULL if the PIFO is empty. The
+ *		packet pointer must be freed using *bpf_packet_drop()* or returning
+ *		the packet pointer. The *rank* pointer will be set to the rank of
+ *		the dequeued packet on success, or a negative error code on error.
+ *
+ * long bpf_packet_drop(void *ctx, void *pkt)
+ *	Description
+ *		Drop *pkt*, which must be a reference previously returned by
+ *		*bpf_packet_dequeue()* (and checked to not be NULL).
+ *	Return
+ *		This always succeeds and returns zero.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5551,6 +5568,8 @@ union bpf_attr {
 	FN(tcp_raw_gen_syncookie_ipv6),	\
 	FN(tcp_raw_check_syncookie_ipv4),	\
 	FN(tcp_raw_check_syncookie_ipv6),	\
+	FN(packet_dequeue),		\
+	FN(packet_drop),		\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 09/17] bpf: Introduce pkt_uid member for PTR_TO_PACKET
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (7 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 08/17] bpf: Add helpers to dequeue from a PIFO map Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 10/17] bpf: Implement direct packet access in dequeue progs Toke Høiland-Jørgensen
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Add a new member in PTR_TO_PACKET specific register state, namely
pkt_uid. This is used to classify packet pointers into different sets,
and the invariant is that any pkt pointers not belonging to the same
set, i.e. not sharing same pkt_uid, won't be allowed for comparison with
each other. During range propagation in __find_good_pkt_pointers, we now
need to take care to skip packet pointers with a different pkt_uid.

This change is necessary so that we can dequeue multiple XDP frames in a
single program, obtain packet pointers using their xdp_md fake struct,
and prevent confusion wrt comparison of packet pointers pointing into
different frames. Attaching a pkt_uid to the PTR_TO_PACKET type prevents
these, and also allows user to see which frame a packet pointer belongs
to in the verbose verifier log (by matching pkt_uid and ref_obj_id of
the referenced xdp_md obtained from bpf_packet_dequeue).

regsafe is updated to match non-zero pkt_uid using the idmap to ensure
it rejects distinct pkt_uid pkt pointers.

We also replace memset of reg->raw to set range to 0. In commit
0962590e5533 ("bpf: fix partial copy of map_ptr when dst is scalar"),
the copying was changed to use raw so that all possible members of type
specific register state are copied, since at that point the type of
register is not known. But inside the reg_is_pkt_pointer block, there is
no need to memset the whole 'raw' struct, since we also have a pkt_uid
member that we now want to preserve after copying from one register to
another, for pkt pointers. A test for this case has been included to
prevent regressions.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf_verifier.h |  8 ++++-
 kernel/bpf/verifier.c        | 59 +++++++++++++++++++++++++++---------
 2 files changed, 52 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 2e3bad8640dc..93b69dbf3d19 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -50,7 +50,13 @@ struct bpf_reg_state {
 	s32 off;
 	union {
 		/* valid when type == PTR_TO_PACKET */
-		int range;
+		struct {
+			int range;
+			/* To distinguish packet pointers backed by different
+			 * packets, to prevent pkt pointer comparisons.
+			 */
+			u32 pkt_uid;
+		};
 
 		/* valid when type == CONST_PTR_TO_MAP | PTR_TO_MAP_VALUE |
 		 *   PTR_TO_MAP_VALUE_OR_NULL
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 68f98d76bc78..f319e9392587 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -431,6 +431,12 @@ static bool type_is_pkt_pointer(enum bpf_reg_type type)
 	       type == PTR_TO_PACKET_META;
 }
 
+static bool type_is_pkt_pointer_any(enum bpf_reg_type type)
+{
+	return type_is_pkt_pointer(type) ||
+	       type == PTR_TO_PACKET_END;
+}
+
 static bool type_is_sk_pointer(enum bpf_reg_type type)
 {
 	return type == PTR_TO_SOCKET ||
@@ -861,6 +867,8 @@ static void print_verifier_state(struct bpf_verifier_env *env,
 				verbose_a("off=%d", reg->off);
 			if (type_is_pkt_pointer(t))
 				verbose_a("r=%d", reg->range);
+			if (type_is_pkt_pointer_any(t) && reg->pkt_uid)
+				verbose_a("pkt_uid=%d", reg->pkt_uid);
 			else if (base_type(t) == CONST_PTR_TO_MAP ||
 				 base_type(t) == PTR_TO_MAP_KEY ||
 				 base_type(t) == PTR_TO_MAP_VALUE)
@@ -1394,8 +1402,7 @@ static bool reg_is_pkt_pointer(const struct bpf_reg_state *reg)
 
 static bool reg_is_pkt_pointer_any(const struct bpf_reg_state *reg)
 {
-	return reg_is_pkt_pointer(reg) ||
-	       reg->type == PTR_TO_PACKET_END;
+	return type_is_pkt_pointer_any(reg->type);
 }
 
 /* Unmodified PTR_TO_PACKET[_META,_END] register from ctx access. */
@@ -6575,14 +6582,17 @@ static void release_reg_references(struct bpf_verifier_env *env,
 	struct bpf_reg_state *regs = state->regs, *reg;
 	int i;
 
-	for (i = 0; i < MAX_BPF_REG; i++)
-		if (regs[i].ref_obj_id == ref_obj_id)
+	for (i = 0; i < MAX_BPF_REG; i++) {
+		if (regs[i].ref_obj_id == ref_obj_id ||
+		    (reg_is_pkt_pointer_any(&regs[i]) && regs[i].pkt_uid == ref_obj_id))
 			mark_reg_unknown(env, regs, i);
+	}
 
 	bpf_for_each_spilled_reg(i, state, reg) {
 		if (!reg)
 			continue;
-		if (reg->ref_obj_id == ref_obj_id)
+		if (reg->ref_obj_id == ref_obj_id ||
+		    (reg_is_pkt_pointer_any(reg) && reg->pkt_uid == ref_obj_id))
 			__mark_reg_unknown(env, reg);
 	}
 }
@@ -8200,7 +8210,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 		if (reg_is_pkt_pointer(ptr_reg)) {
 			dst_reg->id = ++env->id_gen;
 			/* something was added to pkt_ptr, set range to zero */
-			memset(&dst_reg->raw, 0, sizeof(dst_reg->raw));
+			dst_reg->range = 0;
 		}
 		break;
 	case BPF_SUB:
@@ -8260,7 +8270,7 @@ static int adjust_ptr_min_max_vals(struct bpf_verifier_env *env,
 			dst_reg->id = ++env->id_gen;
 			/* something was added to pkt_ptr, set range to zero */
 			if (smin_val < 0)
-				memset(&dst_reg->raw, 0, sizeof(dst_reg->raw));
+				dst_reg->range = 0;
 		}
 		break;
 	case BPF_AND:
@@ -9287,7 +9297,8 @@ static void __find_good_pkt_pointers(struct bpf_func_state *state,
 
 	for (i = 0; i < MAX_BPF_REG; i++) {
 		reg = &state->regs[i];
-		if (reg->type == type && reg->id == dst_reg->id)
+		if (reg->type == type && reg->id == dst_reg->id &&
+		    reg->pkt_uid == dst_reg->pkt_uid)
 			/* keep the maximum range already checked */
 			reg->range = max(reg->range, new_range);
 	}
@@ -9295,7 +9306,8 @@ static void __find_good_pkt_pointers(struct bpf_func_state *state,
 	bpf_for_each_spilled_reg(i, state, reg) {
 		if (!reg)
 			continue;
-		if (reg->type == type && reg->id == dst_reg->id)
+		if (reg->type == type && reg->id == dst_reg->id &&
+		    reg->pkt_uid == dst_reg->pkt_uid)
 			reg->range = max(reg->range, new_range);
 	}
 }
@@ -9910,6 +9922,14 @@ static void mark_ptr_or_null_regs(struct bpf_verifier_state *vstate, u32 regno,
 		__mark_ptr_or_null_regs(vstate->frame[i], id, is_null);
 }
 
+static bool is_bad_pkt_comparison(const struct bpf_reg_state *dst_reg,
+				  const struct bpf_reg_state *src_reg)
+{
+	if (!reg_is_pkt_pointer_any(dst_reg) || !reg_is_pkt_pointer_any(src_reg))
+		return false;
+	return dst_reg->pkt_uid != src_reg->pkt_uid;
+}
+
 static bool try_match_pkt_pointers(const struct bpf_insn *insn,
 				   struct bpf_reg_state *dst_reg,
 				   struct bpf_reg_state *src_reg,
@@ -9923,6 +9943,9 @@ static bool try_match_pkt_pointers(const struct bpf_insn *insn,
 	if (BPF_CLASS(insn->code) == BPF_JMP32)
 		return false;
 
+	if (is_bad_pkt_comparison(dst_reg, src_reg))
+		return false;
+
 	switch (BPF_OP(insn->code)) {
 	case BPF_JGT:
 		if ((dst_reg->type == PTR_TO_PACKET &&
@@ -10220,11 +10243,17 @@ static int check_cond_jmp_op(struct bpf_verifier_env *env,
 		mark_ptr_or_null_regs(other_branch, insn->dst_reg,
 				      opcode == BPF_JEQ);
 	} else if (!try_match_pkt_pointers(insn, dst_reg, &regs[insn->src_reg],
-					   this_branch, other_branch) &&
-		   is_pointer_value(env, insn->dst_reg)) {
-		verbose(env, "R%d pointer comparison prohibited\n",
-			insn->dst_reg);
-		return -EACCES;
+					   this_branch, other_branch)) {
+		if (is_pointer_value(env, insn->dst_reg)) {
+			verbose(env, "R%d pointer comparison prohibited\n",
+				insn->dst_reg);
+			return -EACCES;
+		}
+		if (is_bad_pkt_comparison(dst_reg, &regs[insn->src_reg])) {
+			verbose(env, "R%d, R%d pkt pointer comparison prohibited\n",
+				insn->dst_reg, insn->src_reg);
+			return -EACCES;
+		}
 	}
 	if (env->log.level & BPF_LOG_LEVEL)
 		print_insn_state(env, this_branch->frame[this_branch->curframe]);
@@ -11514,6 +11543,8 @@ static bool regsafe(struct bpf_verifier_env *env, struct bpf_reg_state *rold,
 		/* id relations must be preserved */
 		if (rold->id && !check_ids(rold->id, rcur->id, idmap))
 			return false;
+		if (rold->pkt_uid && !check_ids(rold->pkt_uid, rcur->pkt_uid, idmap))
+			return false;
 		/* new val must satisfy old val knowledge */
 		return range_within(rold, rcur) &&
 		       tnum_in(rold->var_off, rcur->var_off);
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 10/17] bpf: Implement direct packet access in dequeue progs
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (8 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 09/17] bpf: Introduce pkt_uid member for PTR_TO_PACKET Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 11/17] dev: Add XDP dequeue hook Toke Høiland-Jørgensen
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen, Eric Dumazet,
	Paolo Abeni

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Allow user to obtain packet pointers from dequeued xdp_md BTF pointer,
by allowing convert_ctx_access implementation for PTR_TO_BTF_ID, and
then tagging loads as packet pointers in verifier context.

Previously, convert_ctx_access was limited to just PTR_TO_CTX, but now
it will also be used to translate access into PTR_TO_BTF_ID of xdp_md
obtained from bpf_packet_dequeue, so it works like xdp_md ctx in XDP
programs. We must also remember that while xdp_buff backs ctx in XDP
programs, xdp_frame backs xdp_md in dequeue programs.

Next, we use pkt_uid support and transfer ref_obj_id on load data,
data_end, and data_meta fields, to make verifier aware of provenance of
these packet pointers, so that comparison can be rejected for unsafe
cases.

In the end, user can reuse their code meant for XDP ctx in deqeueue
programs as well, and don't have to do things differently.

Once packet pointers are obtained, regular verifier logic kicks in where
pointers from same xdp_frame can be compared to modify the range and
perform access into the packet.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/bpf.h          |  26 +++++--
 include/linux/bpf_verifier.h |   6 ++
 kernel/bpf/verifier.c        |  48 +++++++++---
 net/core/filter.c            | 143 +++++++++++++++++++++++++++++++++++
 4 files changed, 206 insertions(+), 17 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 6ea5d6d188cf..a568ddc1f1ea 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -653,6 +653,12 @@ struct bpf_prog_ops {
 			union bpf_attr __user *uattr);
 };
 
+typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type,
+					const struct bpf_insn *src,
+					struct bpf_insn *dst,
+					struct bpf_prog *prog,
+					u32 *target_size);
+
 struct bpf_verifier_ops {
 	/* return eBPF function prototype for verification */
 	const struct bpf_func_proto *
@@ -678,6 +684,9 @@ struct bpf_verifier_ops {
 				 const struct btf_type *t, int off, int size,
 				 enum bpf_access_type atype,
 				 u32 *next_btf_id, enum bpf_type_flag *flag);
+	bpf_convert_ctx_access_t (*get_convert_ctx_access)(struct bpf_verifier_log *log,
+							   const struct btf *btf,
+							   u32 btf_id);
 };
 
 struct bpf_prog_offload_ops {
@@ -1360,11 +1369,6 @@ const struct bpf_func_proto *bpf_get_trace_vprintk_proto(void);
 
 typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src,
 					unsigned long off, unsigned long len);
-typedef u32 (*bpf_convert_ctx_access_t)(enum bpf_access_type type,
-					const struct bpf_insn *src,
-					struct bpf_insn *dst,
-					struct bpf_prog *prog,
-					u32 *target_size);
 
 u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 		     void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
@@ -2180,6 +2184,18 @@ static inline bool unprivileged_ebpf_enabled(void)
 	return false;
 }
 
+static inline struct btf *bpf_get_btf_vmlinux(void)
+{
+	return ERR_PTR(-EINVAL);
+}
+
+static inline int btf_struct_access(struct bpf_verifier_log *log, const struct btf *btf,
+				    const struct btf_type *t, int off, int size,
+				    enum bpf_access_type atype __maybe_unused,
+				    u32 *next_btf_id, enum bpf_type_flag *flag)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_BPF_SYSCALL */
 
 void __bpf_free_used_btfs(struct bpf_prog_aux *aux,
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 93b69dbf3d19..640f92fece12 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -532,8 +532,14 @@ __printf(2, 0) void bpf_verifier_vlog(struct bpf_verifier_log *log,
 				      const char *fmt, va_list args);
 __printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env,
 					   const char *fmt, ...);
+#ifdef CONFIG_BPF_SYSCALL
 __printf(2, 3) void bpf_log(struct bpf_verifier_log *log,
 			    const char *fmt, ...);
+#else
+static inline void bpf_log(struct bpf_verifier_log *log, const char *fmt, ...)
+{
+}
+#endif
 
 static inline struct bpf_func_state *cur_func(struct bpf_verifier_env *env)
 {
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f319e9392587..7edc2b834d9b 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1707,7 +1707,7 @@ static void mark_reg_not_init(struct bpf_verifier_env *env,
 static void mark_btf_ld_reg(struct bpf_verifier_env *env,
 			    struct bpf_reg_state *regs, u32 regno,
 			    enum bpf_reg_type reg_type,
-			    struct btf *btf, u32 btf_id,
+			    struct btf *btf, u32 reg_id,
 			    enum bpf_type_flag flag)
 {
 	if (reg_type == SCALAR_VALUE) {
@@ -1715,9 +1715,14 @@ static void mark_btf_ld_reg(struct bpf_verifier_env *env,
 		return;
 	}
 	mark_reg_known_zero(env, regs, regno);
-	regs[regno].type = PTR_TO_BTF_ID | flag;
+	regs[regno].type = (int)reg_type | flag;
+	if (type_is_pkt_pointer_any(reg_type)) {
+		regs[regno].pkt_uid = reg_id;
+		return;
+	}
+	WARN_ON_ONCE(base_type(reg_type) != PTR_TO_BTF_ID);
 	regs[regno].btf = btf;
-	regs[regno].btf_id = btf_id;
+	regs[regno].btf_id = reg_id;
 }
 
 #define DEF_NOT_SUBREG	(0)
@@ -4479,13 +4484,14 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 				   struct bpf_reg_state *regs,
 				   int regno, int off, int size,
 				   enum bpf_access_type atype,
-				   int value_regno)
+				   int value_regno, int insn_idx)
 {
 	struct bpf_reg_state *reg = regs + regno;
 	const struct btf_type *t = btf_type_by_id(reg->btf, reg->btf_id);
 	const char *tname = btf_name_by_offset(reg->btf, t->name_off);
+	struct bpf_insn_aux_data *aux = &env->insn_aux_data[insn_idx];
 	enum bpf_type_flag flag = 0;
-	u32 btf_id;
+	u32 reg_id;
 	int ret;
 
 	if (off < 0) {
@@ -4520,7 +4526,7 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 
 	if (env->ops->btf_struct_access) {
 		ret = env->ops->btf_struct_access(&env->log, reg->btf, t,
-						  off, size, atype, &btf_id, &flag);
+						  off, size, atype, &reg_id, &flag);
 	} else {
 		if (atype != BPF_READ) {
 			verbose(env, "only read is supported\n");
@@ -4528,7 +4534,7 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 		}
 
 		ret = btf_struct_access(&env->log, reg->btf, t, off, size,
-					atype, &btf_id, &flag);
+					atype, &reg_id, &flag);
 	}
 
 	if (ret < 0)
@@ -4540,8 +4546,19 @@ static int check_ptr_to_btf_access(struct bpf_verifier_env *env,
 	if (type_flag(reg->type) & PTR_UNTRUSTED)
 		flag |= PTR_UNTRUSTED;
 
-	if (atype == BPF_READ && value_regno >= 0)
-		mark_btf_ld_reg(env, regs, value_regno, ret, reg->btf, btf_id, flag);
+	/* Remember the BTF ID for later use in convert_ctx_accesses */
+	aux->btf_var.btf_id = reg->btf_id;
+	aux->btf_var.btf = reg->btf;
+
+	if (atype == BPF_READ && value_regno >= 0) {
+		/* For pkt pointers, reg_id is set to pkt_uid, which must be the
+		 * ref_obj_id of the referenced register from which they are
+		 * obtained, denoting different packets e.g. in dequeue progs.
+		 */
+		if (type_is_pkt_pointer_any(ret))
+			reg_id = reg->ref_obj_id;
+		mark_btf_ld_reg(env, regs, value_regno, ret, reg->btf, reg_id, flag);
+	}
 
 	return 0;
 }
@@ -4896,7 +4913,7 @@ static int check_mem_access(struct bpf_verifier_env *env, int insn_idx, u32 regn
 	} else if (base_type(reg->type) == PTR_TO_BTF_ID &&
 		   !type_may_be_null(reg->type)) {
 		err = check_ptr_to_btf_access(env, regs, regno, off, size, t,
-					      value_regno);
+					      value_regno, insn_idx);
 	} else if (reg->type == CONST_PTR_TO_MAP) {
 		err = check_ptr_to_map_access(env, regs, regno, off, size, t,
 					      value_regno);
@@ -13515,8 +13532,15 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		case PTR_TO_BTF_ID:
 		case PTR_TO_BTF_ID | PTR_UNTRUSTED:
 			if (type == BPF_READ) {
-				insn->code = BPF_LDX | BPF_PROBE_MEM |
-					BPF_SIZE((insn)->code);
+				if (env->ops->get_convert_ctx_access) {
+					struct btf *btf = env->insn_aux_data[i + delta].btf_var.btf;
+					u32 btf_id = env->insn_aux_data[i + delta].btf_var.btf_id;
+
+					convert_ctx_access = env->ops->get_convert_ctx_access(&env->log, btf, btf_id);
+					if (convert_ctx_access)
+						break;
+				}
+				insn->code = BPF_LDX | BPF_PROBE_MEM | BPF_SIZE((insn)->code);
 				env->prog->aux->num_exentries++;
 			} else if (resolve_prog_type(env->prog) != BPF_PROG_TYPE_STRUCT_OPS) {
 				verbose(env, "Writes through BTF pointers are not allowed\n");
diff --git a/net/core/filter.c b/net/core/filter.c
index 893b75515859..6a4881739e9b 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -79,6 +79,7 @@
 #include <net/tls.h>
 #include <net/xdp.h>
 #include <net/mptcp.h>
+#include <linux/bpf_verifier.h>
 
 static const struct bpf_func_proto *
 bpf_sk_base_func_proto(enum bpf_func_id func_id);
@@ -9918,6 +9919,146 @@ static u32 dequeue_convert_ctx_access(enum bpf_access_type type,
 	return insn - insn_buf;
 }
 
+static int dequeue_btf_struct_access(struct bpf_verifier_log *log,
+				     const struct btf *btf,
+				     const struct btf_type *t, int off, int size,
+				     enum bpf_access_type atype,
+				     u32 *next_btf_id, enum bpf_type_flag *flag)
+{
+	const struct btf_type *pkt_type;
+	enum bpf_reg_type reg_type;
+	struct btf *btf_vmlinux;
+
+	btf_vmlinux = bpf_get_btf_vmlinux();
+	if (IS_ERR_OR_NULL(btf_vmlinux) || btf != btf_vmlinux)
+		return -EINVAL;
+
+	if (atype != BPF_READ)
+		return -EACCES;
+
+	pkt_type = btf_type_by_id(btf_vmlinux, xdp_md_btf_ids[0]);
+	if (!pkt_type)
+		return -EINVAL;
+	if (t != pkt_type)
+		return btf_struct_access(log, btf, t, off, size, atype,
+					 next_btf_id, flag);
+
+	switch (off) {
+	case offsetof(struct xdp_md, data):
+		reg_type = PTR_TO_PACKET;
+		break;
+	case offsetof(struct xdp_md, data_meta):
+		reg_type = PTR_TO_PACKET_META;
+		break;
+	case offsetof(struct xdp_md, data_end):
+		reg_type = PTR_TO_PACKET_END;
+		break;
+	default:
+		bpf_log(log, "no read support for xdp_md at off %d\n", off);
+		return -EACCES;
+	}
+
+	if (!__is_valid_xdp_access(off, size))
+		return -EINVAL;
+	return reg_type;
+}
+
+static u32
+dequeue_convert_xdp_md_access(enum bpf_access_type type,
+			      const struct bpf_insn *si, struct bpf_insn *insn_buf,
+			      struct bpf_prog *prog, u32 *target_size)
+{
+	struct bpf_insn *insn = insn_buf;
+	int src_reg;
+
+	switch (si->off) {
+	case offsetof(struct xdp_md, data):
+		/* dst_reg = *(src_reg + off(xdp_frame, data)) */
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_frame, data),
+				      si->dst_reg, si->src_reg,
+				      offsetof(struct xdp_frame, data));
+		break;
+	case offsetof(struct xdp_md, data_meta):
+		if (si->dst_reg == si->src_reg) {
+			src_reg = BPF_REG_9;
+			if (si->dst_reg == src_reg)
+				src_reg--;
+			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, src_reg,
+					      offsetof(struct xdp_frame, next));
+			*insn++ = BPF_MOV64_REG(src_reg, si->src_reg);
+		} else {
+			src_reg = si->src_reg;
+		}
+		/* AX = src_reg
+		 * dst_reg = *(src_reg + off(xdp_frame, data))
+		 * src_reg = *(src_reg + off(xdp_frame, metasize))
+		 * dst_reg -= src_reg
+		 * src_reg = AX
+		 */
+		*insn++ = BPF_MOV64_REG(BPF_REG_AX, src_reg);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_frame, data),
+				      si->dst_reg, src_reg,
+				      offsetof(struct xdp_frame, data));
+		*insn++ = BPF_LDX_MEM(BPF_B, /* metasize == 8 bits */
+				      src_reg, src_reg,
+#if defined(__LITTLE_ENDIAN_BITFIELD)
+				      offsetofend(struct xdp_frame, headroom) + 3);
+#elif defined(__BIG_ENDIAN_BITFIELD)
+				      offsetofend(struct xdp_frame, headroom));
+#endif
+		*insn++ = BPF_ALU64_REG(BPF_SUB, si->dst_reg, src_reg);
+		*insn++ = BPF_MOV64_REG(src_reg, BPF_REG_AX);
+		if (si->dst_reg == si->src_reg)
+			*insn++ = BPF_LDX_MEM(BPF_DW, src_reg, si->src_reg,
+					      offsetof(struct xdp_frame, next));
+		break;
+	case offsetof(struct xdp_md, data_end):
+		if (si->dst_reg == si->src_reg) {
+			src_reg = BPF_REG_9;
+			if (si->dst_reg == src_reg)
+				src_reg--;
+			*insn++ = BPF_STX_MEM(BPF_DW, si->src_reg, src_reg,
+					      offsetof(struct xdp_frame, next));
+			*insn++ = BPF_MOV64_REG(src_reg, si->src_reg);
+		} else {
+			src_reg = si->src_reg;
+		}
+		/* AX = src_reg
+		 * dst_reg = *(src_reg + off(xdp_frame, data))
+		 * src_reg = *(src_reg + off(xdp_frame, len))
+		 * dst_reg += src_reg
+		 * src_reg = AX
+		 */
+		*insn++ = BPF_MOV64_REG(BPF_REG_AX, src_reg);
+		*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_frame, data),
+				      si->dst_reg, src_reg,
+				      offsetof(struct xdp_frame, data));
+		*insn++ = BPF_LDX_MEM(BPF_H, src_reg, src_reg,
+				      offsetof(struct xdp_frame, len));
+		*insn++ = BPF_ALU64_REG(BPF_ADD, si->dst_reg, src_reg);
+		*insn++ = BPF_MOV64_REG(src_reg, BPF_REG_AX);
+		if (si->dst_reg == si->src_reg)
+			*insn++ = BPF_LDX_MEM(BPF_DW, src_reg, si->src_reg,
+					      offsetof(struct xdp_frame, next));
+		break;
+	}
+	return insn - insn_buf;
+}
+
+static bpf_convert_ctx_access_t
+dequeue_get_convert_ctx_access(struct bpf_verifier_log *log,
+			       const struct btf *btf, u32 btf_id)
+{
+	struct btf *btf_vmlinux;
+
+	btf_vmlinux = bpf_get_btf_vmlinux();
+	if (IS_ERR_OR_NULL(btf_vmlinux) || btf != btf_vmlinux)
+		return NULL;
+	if (btf_id != xdp_md_btf_ids[0])
+		return NULL;
+	return dequeue_convert_xdp_md_access;
+}
+
 /* SOCK_ADDR_LOAD_NESTED_FIELD() loads Nested Field S.F.NF where S is type of
  * context Structure, F is Field in context structure that contains a pointer
  * to Nested Structure of type NS that has the field NF.
@@ -10775,6 +10916,8 @@ const struct bpf_verifier_ops dequeue_verifier_ops = {
 	.is_valid_access	= dequeue_is_valid_access,
 	.convert_ctx_access	= dequeue_convert_ctx_access,
 	.gen_prologue		= bpf_noop_prologue,
+	.btf_struct_access	= dequeue_btf_struct_access,
+	.get_convert_ctx_access = dequeue_get_convert_ctx_access,
 };
 
 const struct bpf_prog_ops dequeue_prog_ops = {
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 11/17] dev: Add XDP dequeue hook
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (9 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 10/17] bpf: Implement direct packet access in dequeue progs Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 12/17] bpf: Add helper to schedule an interface for TX dequeue Toke Høiland-Jørgensen
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen

Add a second per-interface XDP hook for dequeueing packets. This hook
allows attaching programs of the dequeue type, which will be executed by
the stack in the TX softirq. Packets returned by the dequeue hook are
subsequently transmitted on the interface using the ndo_xdp_xmit() driver
function. The code to do this is added to devmap.c to be able to reuse the
existing bulking mechanism from there.

To actually schedule a device for transmission, a BPF program needs to call
a helper that is added in the next commit.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/linux/filter.h             |  17 +++++
 include/linux/netdevice.h          |   6 ++
 include/net/xdp.h                  |   7 ++
 include/uapi/linux/if_link.h       |   4 +-
 kernel/bpf/devmap.c                |  88 ++++++++++++++++++++---
 net/core/dev.c                     | 109 +++++++++++++++++++++++++++++
 net/core/dev.h                     |   2 +
 net/core/filter.c                  |   7 ++
 net/core/rtnetlink.c               |  30 ++++++--
 tools/include/uapi/linux/if_link.h |   4 +-
 10 files changed, 256 insertions(+), 18 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index b0ddb647d5f2..0f1570daaa52 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -778,6 +778,23 @@ static __always_inline u64 bpf_prog_run_xdp(const struct bpf_prog *prog,
 
 void bpf_prog_change_xdp(struct bpf_prog *prev_prog, struct bpf_prog *prog);
 
+DECLARE_BPF_DISPATCHER(xdp_dequeue)
+
+static __always_inline struct xdp_frame *bpf_prog_run_xdp_dequeue(const struct bpf_prog *prog,
+								  struct dequeue_data *ctx)
+{
+	struct xdp_frame *frm = NULL;
+	u64 ret;
+
+	ret = __bpf_prog_run(prog, ctx, BPF_DISPATCHER_FUNC(xdp_dequeue));
+	if (ret)
+		frm = (struct xdp_frame *)(unsigned long)ret;
+
+	return frm;
+}
+
+void bpf_prog_change_xdp_dequeue(struct bpf_prog *prev_prog, struct bpf_prog *prog);
+
 static inline u32 bpf_prog_insn_size(const struct bpf_prog *prog)
 {
 	return prog->len * sizeof(struct bpf_insn);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index fe9aeca2fce9..4096caac5a2a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -74,6 +74,7 @@ struct udp_tunnel_nic_info;
 struct udp_tunnel_nic;
 struct bpf_prog;
 struct xdp_buff;
+struct xdp_dequeue;
 
 void synchronize_net(void);
 void netdev_set_default_ethtool_ops(struct net_device *dev,
@@ -2326,6 +2327,7 @@ struct net_device {
 
 	/* protected by rtnl_lock */
 	struct bpf_xdp_entity	xdp_state[__MAX_XDP_MODE];
+	struct bpf_prog __rcu	*xdp_dequeue_prog;
 
 	u8 dev_addr_shadow[MAX_ADDR_LEN];
 	netdevice_tracker	linkwatch_dev_tracker;
@@ -3109,6 +3111,7 @@ struct softnet_data {
 	struct Qdisc		*output_queue;
 	struct Qdisc		**output_queue_tailp;
 	struct sk_buff		*completion_queue;
+	struct xdp_dequeue	*xdp_dequeue;
 #ifdef CONFIG_XFRM_OFFLOAD
 	struct sk_buff_head	xfrm_backlog;
 #endif
@@ -3143,6 +3146,7 @@ struct softnet_data {
 	int			defer_ipi_scheduled;
 	struct sk_buff		*defer_list;
 	call_single_data_t	defer_csd;
+
 };
 
 static inline void input_queue_head_incr(struct softnet_data *sd)
@@ -3222,6 +3226,7 @@ static inline void netif_tx_start_all_queues(struct net_device *dev)
 }
 
 void netif_tx_wake_queue(struct netdev_queue *dev_queue);
+void netif_tx_schedule_xdp(struct xdp_dequeue *deq);
 
 /**
  *	netif_wake_queue - restart transmit
@@ -3851,6 +3856,7 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 int bpf_xdp_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
 u8 dev_xdp_prog_count(struct net_device *dev);
 u32 dev_xdp_prog_id(struct net_device *dev, enum bpf_xdp_mode mode);
+u32 dev_xdp_dequeue_prog_id(struct net_device *dev);
 
 int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
 int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 728ce943d352..e06b340132dd 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -89,6 +89,13 @@ struct dequeue_data {
 	struct xdp_txq_info *txq;
 };
 
+struct xdp_dequeue {
+	struct xdp_dequeue *next;
+};
+
+void dev_run_xdp_dequeue(struct xdp_dequeue *deq);
+void dev_schedule_xdp_dequeue(struct net_device *dev);
+
 static __always_inline bool xdp_buff_has_frags(struct xdp_buff *xdp)
 {
 	return !!(xdp->flags & XDP_FLAGS_HAS_FRAGS);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index e36d9d2c65a7..fb8ab1796cd2 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -1283,9 +1283,10 @@ enum {
 #define XDP_FLAGS_DRV_MODE		(1U << 2)
 #define XDP_FLAGS_HW_MODE		(1U << 3)
 #define XDP_FLAGS_REPLACE		(1U << 4)
+#define XDP_FLAGS_DEQUEUE_MODE		(1U << 5)
 #define XDP_FLAGS_MODES			(XDP_FLAGS_SKB_MODE | \
 					 XDP_FLAGS_DRV_MODE | \
-					 XDP_FLAGS_HW_MODE)
+					 XDP_FLAGS_HW_MODE | XDP_FLAGS_DEQUEUE_MODE)
 #define XDP_FLAGS_MASK			(XDP_FLAGS_UPDATE_IF_NOEXIST | \
 					 XDP_FLAGS_MODES | XDP_FLAGS_REPLACE)
 
@@ -1308,6 +1309,7 @@ enum {
 	IFLA_XDP_SKB_PROG_ID,
 	IFLA_XDP_HW_PROG_ID,
 	IFLA_XDP_EXPECTED_FD,
+	IFLA_XDP_DEQUEUE_PROG_ID,
 	__IFLA_XDP_MAX,
 };
 
diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 980f8928e977..949a60f06d24 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -59,6 +59,7 @@ struct xdp_dev_bulk_queue {
 	struct net_device *dev;
 	struct net_device *dev_rx;
 	struct bpf_prog *xdp_prog;
+	struct xdp_dequeue deq;
 	unsigned int count;
 };
 
@@ -362,16 +363,17 @@ static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
 	return nframes; /* sent frames count */
 }
 
-static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
+static bool bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags, bool keep)
 {
 	struct net_device *dev = bq->dev;
 	unsigned int cnt = bq->count;
 	int sent = 0, err = 0;
 	int to_send = cnt;
-	int i;
+	bool ret = true;
+	int i, kept = 0;
 
 	if (unlikely(!cnt))
-		return;
+		return true;
 
 	for (i = 0; i < cnt; i++) {
 		struct xdp_frame *xdpf = bq->q[i];
@@ -394,15 +396,29 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
 		sent = 0;
 	}
 
-	/* If not all frames have been transmitted, it is our
-	 * responsibility to free them
+	/* If not all frames have been transmitted, it is our responsibility to
+	 * free them, unless the caller asked for them to be kept, in which case
+	 * we'll move them to the head of the queue
 	 */
-	for (i = sent; unlikely(i < to_send); i++)
-		xdp_return_frame_rx_napi(bq->q[i]);
+	if (unlikely(sent < to_send)) {
+		ret = false;
+		if (keep) {
+			if (!sent) {
+				kept = to_send;
+				goto out;
+			}
+			for (i = sent; i < to_send; i++)
+				bq->q[kept++] = bq->q[i];
+		} else {
+			for (i = sent; i < to_send; i++)
+				xdp_return_frame_rx_napi(bq->q[i]);
+		}
+	}
 
 out:
-	bq->count = 0;
-	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent, err);
+	bq->count = kept;
+	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent - kept, err);
+	return ret;
 }
 
 /* __dev_flush is called from xdp_do_flush() which _must_ be signalled from the
@@ -415,13 +431,63 @@ void __dev_flush(void)
 	struct xdp_dev_bulk_queue *bq, *tmp;
 
 	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
-		bq_xmit_all(bq, XDP_XMIT_FLUSH);
+		bq_xmit_all(bq, XDP_XMIT_FLUSH, false);
 		bq->dev_rx = NULL;
 		bq->xdp_prog = NULL;
 		__list_del_clearprev(&bq->flush_node);
 	}
 }
 
+void dev_schedule_xdp_dequeue(struct net_device *dev)
+{
+	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
+
+	netif_tx_schedule_xdp(&bq->deq);
+}
+
+void dev_run_xdp_dequeue(struct xdp_dequeue *deq)
+{
+	while (deq) {
+		struct xdp_dev_bulk_queue *bq = container_of(deq, struct xdp_dev_bulk_queue, deq);
+		struct xdp_txq_info txqi = { .dev = bq->dev };
+		struct dequeue_data ctx = { .txq = &txqi };
+		struct xdp_dequeue *nxt = deq->next;
+		int quota = dev_tx_weight;
+		struct xdp_frame *xdpf;
+		struct bpf_prog *prog;
+		bool ret = true;
+
+		local_bh_disable();
+
+		prog = rcu_dereference(bq->dev->xdp_dequeue_prog);
+		if (likely(prog)) {
+			do {
+				if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) {
+					ret = bq_xmit_all(bq, 0, true);
+					if (!ret)
+						break;
+				}
+				xdpf = bpf_prog_run_xdp_dequeue(prog, &ctx);
+				if (xdpf)
+					bq->q[bq->count++] = xdpf;
+
+			} while (xdpf && --quota);
+
+			if (ret)
+				ret = bq_xmit_all(bq, XDP_XMIT_FLUSH, true);
+
+			if (!ret || !quota)
+				/* out of space, reschedule */
+				netif_tx_schedule_xdp(deq);
+		}
+
+		deq->next = NULL;
+		deq = nxt;
+
+		local_bh_enable();
+	}
+}
+
 /* Elements are kept alive by RCU; either by rcu_read_lock() (from syscall) or
  * by local_bh_disable() (from XDP calls inside NAPI). The
  * rcu_read_lock_bh_held() below makes lockdep accept both.
@@ -450,7 +516,7 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
 	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
 
 	if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
-		bq_xmit_all(bq, 0);
+		bq_xmit_all(bq, 0, false);
 
 	/* Ingress dev_rx will be the same for all xdp_frame's in
 	 * bulk_queue, because bq stored per-CPU and must be flushed
diff --git a/net/core/dev.c b/net/core/dev.c
index 978ed0622d8f..07505c88117a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3120,6 +3120,22 @@ void netif_tx_wake_queue(struct netdev_queue *dev_queue)
 }
 EXPORT_SYMBOL(netif_tx_wake_queue);
 
+void netif_tx_schedule_xdp(struct xdp_dequeue *deq)
+{
+	bool need_bh_off = !(hardirq_count() | softirq_count());
+
+	WARN_ON_ONCE(need_bh_off);
+
+	if (!deq->next) {
+		struct softnet_data *sd = this_cpu_ptr(&softnet_data);
+
+		deq->next = sd->xdp_dequeue;
+		sd->xdp_dequeue = deq;
+		raise_softirq_irqoff(NET_TX_SOFTIRQ);
+	}
+}
+EXPORT_SYMBOL(netif_tx_schedule_xdp);
+
 void __dev_kfree_skb_irq(struct sk_buff *skb, enum skb_free_reason reason)
 {
 	unsigned long flags;
@@ -5011,6 +5027,17 @@ static __latent_entropy void net_tx_action(struct softirq_action *h)
 {
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
 
+	if (sd->xdp_dequeue) {
+		struct xdp_dequeue *deq;
+
+		local_irq_disable();
+		deq = sd->xdp_dequeue;
+		sd->xdp_dequeue = NULL;
+		local_irq_enable();
+
+		dev_run_xdp_dequeue(deq);
+	}
+
 	if (sd->completion_queue) {
 		struct sk_buff *clist;
 
@@ -9522,6 +9549,88 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 	return err;
 }
 
+u32 dev_xdp_dequeue_prog_id(struct net_device *dev)
+{
+	struct bpf_prog *prog = rtnl_dereference(dev->xdp_dequeue_prog);
+
+	return prog ? prog->aux->id : 0;
+}
+
+static int dev_xdp_dequeue_attach(struct net_device *dev, struct netlink_ext_ack *extack,
+				  struct bpf_prog *new_prog, struct bpf_prog *old_prog, u32 flags)
+{
+	struct bpf_prog *cur_prog;
+
+	ASSERT_RTNL();
+
+	if (!(flags & XDP_FLAGS_REPLACE) || (flags & XDP_FLAGS_UPDATE_IF_NOEXIST)) {
+		NL_SET_ERR_MSG(extack, "Dequeue prog must use XDP_FLAGS_REPLACE");
+		return -EINVAL;
+	}
+
+	cur_prog = rcu_dereference(dev->xdp_dequeue_prog);
+
+	if (cur_prog != old_prog) {
+		NL_SET_ERR_MSG(extack, "Active program does not match expected");
+		return -EEXIST;
+	}
+
+	if (cur_prog != new_prog) {
+		rcu_assign_pointer(dev->xdp_dequeue_prog, new_prog);
+		bpf_prog_change_xdp_dequeue(cur_prog, new_prog);
+	}
+
+	if (cur_prog)
+		bpf_prog_put(cur_prog);
+
+	return 0;
+}
+
+/**
+ *	dev_change_xdp_dequeue_fd - set or clear a bpf program for a XDP dequeue
+ *	@dev: device
+ *	@extack: netlink extended ack
+ *	@fd: new program fd or negative value to clear
+ *	@expected_fd: old program fd that userspace expects to replace or clear
+ *	@flags: xdp dequeue-related flags
+ *
+ *	Set or clear an XDP dequeue program for a device
+ */
+int dev_change_xdp_dequeue_fd(struct net_device *dev, struct netlink_ext_ack *extack,
+			      int fd, int expected_fd, u32 flags)
+{
+	struct bpf_prog *new_prog = NULL, *old_prog = NULL;
+	int err;
+
+	ASSERT_RTNL();
+
+	if (fd >= 0) {
+		new_prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_DEQUEUE, false);
+		if (IS_ERR(new_prog))
+			return PTR_ERR(new_prog);
+	}
+
+	if (expected_fd >= 0) {
+		old_prog = bpf_prog_get_type_dev(expected_fd,
+						 BPF_PROG_TYPE_DEQUEUE,
+						 false);
+		if (IS_ERR(old_prog)) {
+			err = PTR_ERR(old_prog);
+			old_prog = NULL;
+			goto err_out;
+		}
+	}
+
+	err = dev_xdp_dequeue_attach(dev, extack, new_prog, old_prog, flags);
+
+err_out:
+	if (err && new_prog)
+		bpf_prog_put(new_prog);
+	if (old_prog)
+		bpf_prog_put(old_prog);
+	return err;
+}
+
 /**
  *	dev_new_index	-	allocate an ifindex
  *	@net: the applicable net namespace
diff --git a/net/core/dev.h b/net/core/dev.h
index cbb8a925175a..fe598287f786 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -81,6 +81,8 @@ void dev_change_proto_down_reason(struct net_device *dev, unsigned long mask,
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
 int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
 		      int fd, int expected_fd, u32 flags);
+int dev_change_xdp_dequeue_fd(struct net_device *dev, struct netlink_ext_ack *extack,
+			      int fd, int expected_fd, u32 flags);
 
 int dev_change_tx_queue_len(struct net_device *dev, unsigned long new_len);
 void dev_set_group(struct net_device *dev, int new_group);
diff --git a/net/core/filter.c b/net/core/filter.c
index 6a4881739e9b..7c89eaa01c29 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -11584,6 +11584,13 @@ void bpf_prog_change_xdp(struct bpf_prog *prev_prog, struct bpf_prog *prog)
 	bpf_dispatcher_change_prog(BPF_DISPATCHER_PTR(xdp), prev_prog, prog);
 }
 
+DEFINE_BPF_DISPATCHER(xdp_dequeue)
+
+void bpf_prog_change_xdp_dequeue(struct bpf_prog *prev_prog, struct bpf_prog *prog)
+{
+	bpf_dispatcher_change_prog(BPF_DISPATCHER_PTR(xdp_dequeue), prev_prog, prog);
+}
+
 BTF_ID_LIST_GLOBAL(btf_sock_ids, MAX_BTF_SOCK_TYPE)
 #define BTF_SOCK_TYPE(name, type) BTF_ID(struct, type)
 BTF_SOCK_TYPE_xxx
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index ac45328607f7..495acb5a6616 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1012,7 +1012,8 @@ static size_t rtnl_xdp_size(void)
 	size_t xdp_size = nla_total_size(0) +	/* nest IFLA_XDP */
 			  nla_total_size(1) +	/* XDP_ATTACHED */
 			  nla_total_size(4) +	/* XDP_PROG_ID (or 1st mode) */
-			  nla_total_size(4);	/* XDP_<mode>_PROG_ID */
+			  nla_total_size(4) +	/* XDP_<mode>_PROG_ID */
+			  nla_total_size(4);	/* XDP_DEQUEUE_PROG_ID */
 
 	return xdp_size;
 }
@@ -1467,6 +1468,11 @@ static u32 rtnl_xdp_prog_hw(struct net_device *dev)
 	return dev_xdp_prog_id(dev, XDP_MODE_HW);
 }
 
+static u32 rtnl_xdp_dequeue_prog(struct net_device *dev)
+{
+	return dev_xdp_dequeue_prog_id(dev);
+}
+
 static int rtnl_xdp_report_one(struct sk_buff *skb, struct net_device *dev,
 			       u32 *prog_id, u8 *mode, u8 tgt_mode, u32 attr,
 			       u32 (*get_prog_id)(struct net_device *dev))
@@ -1527,6 +1533,13 @@ static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
 			goto err_cancel;
 	}
 
+	prog_id = rtnl_xdp_dequeue_prog(dev);
+	if (prog_id) {
+		err = nla_put_u32(skb, IFLA_XDP_DEQUEUE_PROG_ID, prog_id);
+		if (err)
+			goto err_cancel;
+	}
+
 	nla_nest_end(skb, xdp);
 	return 0;
 
@@ -1979,6 +1992,7 @@ static const struct nla_policy ifla_xdp_policy[IFLA_XDP_MAX + 1] = {
 	[IFLA_XDP_ATTACHED]	= { .type = NLA_U8 },
 	[IFLA_XDP_FLAGS]	= { .type = NLA_U32 },
 	[IFLA_XDP_PROG_ID]	= { .type = NLA_U32 },
+	[IFLA_XDP_DEQUEUE_PROG_ID]	= { .type = NLA_U32 },
 };
 
 static const struct rtnl_link_ops *linkinfo_to_kind_ops(const struct nlattr *nla)
@@ -2998,10 +3012,16 @@ static int do_setlink(const struct sk_buff *skb,
 					nla_get_s32(xdp[IFLA_XDP_EXPECTED_FD]);
 			}
 
-			err = dev_change_xdp_fd(dev, extack,
-						nla_get_s32(xdp[IFLA_XDP_FD]),
-						expected_fd,
-						xdp_flags);
+			if (xdp_flags & XDP_FLAGS_DEQUEUE_MODE)
+				err = dev_change_xdp_dequeue_fd(dev, extack,
+								nla_get_s32(xdp[IFLA_XDP_FD]),
+								expected_fd,
+								xdp_flags);
+			else
+				err = dev_change_xdp_fd(dev, extack,
+							nla_get_s32(xdp[IFLA_XDP_FD]),
+							expected_fd,
+							xdp_flags);
 			if (err)
 				goto errout;
 			status |= DO_SETLINK_NOTIFY;
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index 0242f31e339c..f40ad0db46b7 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -1188,9 +1188,10 @@ enum {
 #define XDP_FLAGS_DRV_MODE		(1U << 2)
 #define XDP_FLAGS_HW_MODE		(1U << 3)
 #define XDP_FLAGS_REPLACE		(1U << 4)
+#define XDP_FLAGS_DEQUEUE_MODE		(1U << 5)
 #define XDP_FLAGS_MODES			(XDP_FLAGS_SKB_MODE | \
 					 XDP_FLAGS_DRV_MODE | \
-					 XDP_FLAGS_HW_MODE)
+					 XDP_FLAGS_HW_MODE | XDP_FLAGS_DEQUEUE_MODE)
 #define XDP_FLAGS_MASK			(XDP_FLAGS_UPDATE_IF_NOEXIST | \
 					 XDP_FLAGS_MODES | XDP_FLAGS_REPLACE)
 
@@ -1213,6 +1214,7 @@ enum {
 	IFLA_XDP_SKB_PROG_ID,
 	IFLA_XDP_HW_PROG_ID,
 	IFLA_XDP_EXPECTED_FD,
+	IFLA_XDP_DEQUEUE_PROG_ID,
 	__IFLA_XDP_MAX,
 };
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 12/17] bpf: Add helper to schedule an interface for TX dequeue
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (10 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 11/17] dev: Add XDP dequeue hook Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 13/17] libbpf: Add support for dequeue program type and PIFO map type Toke Høiland-Jørgensen
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen, Eric Dumazet,
	Paolo Abeni

This adds a helper that a BPF program can call to schedule an interface for
transmission. The helper can be used from both a regular XDP program (to
schedule transmission after queueing a packet), and from a dequeue program
to (re-)schedule transmission after a dequeue operation. In particular, the
latter use can be combined with BPF timers to schedule delayed
transmission, for instance to implement traffic shaping.

The helper always schedules transmission on the interface on the current
CPU. For cross-CPU operation, it is up to the BPF program to arrange for
the helper to be called on the appropriate CPU, either by configuring
hardware RSS appropriately, or by using a cpumap. Likewise, it is up to the
BPF programs to decide whether to use separate queues per CPU (by using
multiple maps to queue packets in), or accept the lock contention of using
a single map across CPUs.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/uapi/linux/bpf.h       | 11 +++++++
 net/core/filter.c              | 52 ++++++++++++++++++++++++++++++++++
 tools/include/uapi/linux/bpf.h | 11 +++++++
 3 files changed, 74 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index d44382644391..b352ecc280f4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5358,6 +5358,16 @@ union bpf_attr {
  *		*bpf_packet_dequeue()* (and checked to not be NULL).
  *	Return
  *		This always succeeds and returns zero.
+ *
+ * long bpf_schedule_iface_dequeue(void *ctx, int ifindex, int flags)
+ *	Description
+ *		Schedule the interface with index *ifindex* for transmission from
+ *		its dequeue program as soon as possible. The *flags* argument
+ *		must be zero.
+ *
+ *	Return
+ *		Returns zero on success, or -ENOENT if no dequeue program is
+ *		loaded on the interface.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5570,6 +5580,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(packet_dequeue),		\
 	FN(packet_drop),		\
+	FN(schedule_iface_dequeue),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
diff --git a/net/core/filter.c b/net/core/filter.c
index 7c89eaa01c29..bb556d873b52 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4431,6 +4431,54 @@ static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
 	.arg3_type      = ARG_ANYTHING,
 };
 
+static int bpf_schedule_iface_dequeue(struct net *net, int ifindex, int flags)
+{
+	struct net_device *dev;
+	struct bpf_prog *prog;
+
+	if (flags)
+		return -EINVAL;
+
+	dev = dev_get_by_index_rcu(net, ifindex);
+	if (!dev)
+		return -ENODEV;
+
+	prog = rcu_dereference(dev->xdp_dequeue_prog);
+	if (!prog)
+		return -ENOENT;
+
+	dev_schedule_xdp_dequeue(dev);
+	return 0;
+}
+
+BPF_CALL_3(bpf_xdp_schedule_iface_dequeue, struct xdp_buff *, ctx, int, ifindex, int, flags)
+{
+	return bpf_schedule_iface_dequeue(dev_net(ctx->rxq->dev), ifindex, flags);
+}
+
+static const struct bpf_func_proto bpf_xdp_schedule_iface_dequeue_proto = {
+	.func           = bpf_xdp_schedule_iface_dequeue,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_ANYTHING,
+};
+
+BPF_CALL_3(bpf_dequeue_schedule_iface_dequeue, struct dequeue_data *, ctx, int, ifindex, int, flags)
+{
+	return bpf_schedule_iface_dequeue(dev_net(ctx->txq->dev), ifindex, flags);
+}
+
+static const struct bpf_func_proto bpf_dequeue_schedule_iface_dequeue_proto = {
+	.func           = bpf_dequeue_schedule_iface_dequeue,
+	.gpl_only       = false,
+	.ret_type       = RET_INTEGER,
+	.arg1_type      = ARG_PTR_TO_CTX,
+	.arg2_type      = ARG_ANYTHING,
+	.arg3_type      = ARG_ANYTHING,
+};
+
 BTF_ID_LIST_SINGLE(xdp_md_btf_ids, struct, xdp_md)
 
 BPF_CALL_4(bpf_packet_dequeue, struct dequeue_data *, ctx, struct bpf_map *, map,
@@ -8068,6 +8116,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_xdp_fib_lookup_proto;
 	case BPF_FUNC_check_mtu:
 		return &bpf_xdp_check_mtu_proto;
+	case BPF_FUNC_schedule_iface_dequeue:
+		return &bpf_xdp_schedule_iface_dequeue_proto;
 #ifdef CONFIG_INET
 	case BPF_FUNC_sk_lookup_udp:
 		return &bpf_xdp_sk_lookup_udp_proto;
@@ -8105,6 +8155,8 @@ dequeue_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_packet_dequeue_proto;
 	case BPF_FUNC_packet_drop:
 		return &bpf_packet_drop_proto;
+	case BPF_FUNC_schedule_iface_dequeue:
+		return &bpf_dequeue_schedule_iface_dequeue_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 1dab68a89e18..9eb9a5b52c76 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5358,6 +5358,16 @@ union bpf_attr {
  *		*bpf_packet_dequeue()* (and checked to not be NULL).
  *	Return
  *		This always succeeds and returns zero.
+ *
+ * long bpf_schedule_iface_dequeue(void *ctx, int ifindex, int flags)
+ *	Description
+ *		Schedule the interface with index *ifindex* for transmission from
+ *		its dequeue program as soon as possible. The *flags* argument
+ *		must be zero.
+ *
+ *	Return
+ *		Returns zero on success, or -ENOENT if no dequeue program is
+ *		loaded on the interface.
  */
 #define __BPF_FUNC_MAPPER(FN)		\
 	FN(unspec),			\
@@ -5570,6 +5580,7 @@ union bpf_attr {
 	FN(tcp_raw_check_syncookie_ipv6),	\
 	FN(packet_dequeue),		\
 	FN(packet_drop),		\
+	FN(schedule_iface_dequeue),	\
 	/* */
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 13/17] libbpf: Add support for dequeue program type and PIFO map type
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (11 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 12/17] bpf: Add helper to schedule an interface for TX dequeue Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 11:14 ` [RFC PATCH 14/17] libbpf: Add support for querying dequeue programs Toke Høiland-Jørgensen
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen

Add support for a 'dequeue' section type to specify dequeue type programs
and add support for dequeue program and PIFO map to probing code.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 tools/lib/bpf/libbpf.c        | 1 +
 tools/lib/bpf/libbpf_probes.c | 5 +++++
 2 files changed, 6 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index cb49408eb298..8553bb8369e0 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -8431,6 +8431,7 @@ static const struct bpf_sec_def section_defs[] = {
 	SEC_DEF("xdp/cpumap",		XDP, BPF_XDP_CPUMAP, SEC_ATTACHABLE),
 	SEC_DEF("xdp.frags",		XDP, BPF_XDP, SEC_XDP_FRAGS),
 	SEC_DEF("xdp",			XDP, BPF_XDP, SEC_ATTACHABLE_OPT),
+	SEC_DEF("dequeue",		DEQUEUE, 0, SEC_NONE),
 	SEC_DEF("perf_event",		PERF_EVENT, 0, SEC_NONE),
 	SEC_DEF("lwt_in",		LWT_IN, 0, SEC_NONE),
 	SEC_DEF("lwt_out",		LWT_OUT, 0, SEC_NONE),
diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c
index 0b5398786bf3..a9ead2d55264 100644
--- a/tools/lib/bpf/libbpf_probes.c
+++ b/tools/lib/bpf/libbpf_probes.c
@@ -97,6 +97,7 @@ static int probe_prog_load(enum bpf_prog_type prog_type,
 	case BPF_PROG_TYPE_SK_REUSEPORT:
 	case BPF_PROG_TYPE_FLOW_DISSECTOR:
 	case BPF_PROG_TYPE_CGROUP_SYSCTL:
+	case BPF_PROG_TYPE_DEQUEUE:
 		break;
 	default:
 		return -EOPNOTSUPP;
@@ -244,6 +245,10 @@ static int probe_map_create(enum bpf_map_type map_type)
 		key_size = 0;
 		max_entries = 1;
 		break;
+	case BPF_MAP_TYPE_PIFO_GENERIC:
+	case BPF_MAP_TYPE_PIFO_XDP:
+		opts.map_extra = 8;
+		break;
 	case BPF_MAP_TYPE_HASH:
 	case BPF_MAP_TYPE_ARRAY:
 	case BPF_MAP_TYPE_PROG_ARRAY:
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 14/17] libbpf: Add support for querying dequeue programs
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (12 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 13/17] libbpf: Add support for dequeue program type and PIFO map type Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-14  5:36   ` Andrii Nakryiko
  2022-07-13 11:14 ` [RFC PATCH 15/17] selftests/bpf: Add verifier tests for dequeue prog Toke Høiland-Jørgensen
                   ` (5 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen

Add support to libbpf for reading the dequeue program ID from netlink when
querying for installed XDP programs. No additional support is needed to
install dequeue programs, as they are just using a new mode flag for the
regular XDP program installation mechanism.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 tools/lib/bpf/libbpf.h  | 1 +
 tools/lib/bpf/netlink.c | 8 ++++++++
 2 files changed, 9 insertions(+)

diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index e4d5353f757b..b15ff90279cb 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -906,6 +906,7 @@ struct bpf_xdp_query_opts {
 	__u32 drv_prog_id;	/* output */
 	__u32 hw_prog_id;	/* output */
 	__u32 skb_prog_id;	/* output */
+	__u32 dequeue_prog_id;	/* output */
 	__u8 attach_mode;	/* output */
 	size_t :0;
 };
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 6c013168032d..64a9aceb9c9c 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -32,6 +32,7 @@ struct xdp_link_info {
 	__u32 drv_prog_id;
 	__u32 hw_prog_id;
 	__u32 skb_prog_id;
+	__u32 dequeue_prog_id;
 	__u8 attach_mode;
 };
 
@@ -354,6 +355,10 @@ static int get_xdp_info(void *cookie, void *msg, struct nlattr **tb)
 		xdp_id->info.hw_prog_id = libbpf_nla_getattr_u32(
 			xdp_tb[IFLA_XDP_HW_PROG_ID]);
 
+	if (xdp_tb[IFLA_XDP_DEQUEUE_PROG_ID])
+		xdp_id->info.dequeue_prog_id = libbpf_nla_getattr_u32(
+			xdp_tb[IFLA_XDP_DEQUEUE_PROG_ID]);
+
 	return 0;
 }
 
@@ -391,6 +396,7 @@ int bpf_xdp_query(int ifindex, int xdp_flags, struct bpf_xdp_query_opts *opts)
 	OPTS_SET(opts, drv_prog_id, xdp_id.info.drv_prog_id);
 	OPTS_SET(opts, hw_prog_id, xdp_id.info.hw_prog_id);
 	OPTS_SET(opts, skb_prog_id, xdp_id.info.skb_prog_id);
+	OPTS_SET(opts, dequeue_prog_id, xdp_id.info.dequeue_prog_id);
 	OPTS_SET(opts, attach_mode, xdp_id.info.attach_mode);
 
 	return 0;
@@ -415,6 +421,8 @@ int bpf_xdp_query_id(int ifindex, int flags, __u32 *prog_id)
 		*prog_id = opts.hw_prog_id;
 	else if (flags & XDP_FLAGS_SKB_MODE)
 		*prog_id = opts.skb_prog_id;
+	else if (flags & XDP_FLAGS_DEQUEUE_MODE)
+		*prog_id = opts.dequeue_prog_id;
 	else
 		*prog_id = 0;
 
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 15/17] selftests/bpf: Add verifier tests for dequeue prog
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (13 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 14/17] libbpf: Add support for querying dequeue programs Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-14  5:38   ` Andrii Nakryiko
  2022-07-13 11:14 ` [RFC PATCH 16/17] selftests/bpf: Add test for XDP queueing through PIFO maps Toke Høiland-Jørgensen
                   ` (4 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen, Shuah Khan

From: Kumar Kartikeya Dwivedi <memxor@gmail.com>

Test various cases of direct packet access (proper range propagation,
comparison of packet pointers pointing into separate xdp_frames, and
correct invalidation on packet drop (so that multiple packet pointers
are usable safely in a dequeue program)).

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 tools/testing/selftests/bpf/test_verifier.c   |  29 +++-
 .../testing/selftests/bpf/verifier/dequeue.c  | 160 ++++++++++++++++++
 2 files changed, 180 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/verifier/dequeue.c

diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c
index f9d553fbf68a..8d26ca96520b 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -55,7 +55,7 @@
 #define MAX_UNEXPECTED_INSNS	32
 #define MAX_TEST_INSNS	1000000
 #define MAX_FIXUPS	8
-#define MAX_NR_MAPS	23
+#define MAX_NR_MAPS	24
 #define MAX_TEST_RUNS	8
 #define POINTER_VALUE	0xcafe4all
 #define TEST_DATA_LEN	64
@@ -131,6 +131,7 @@ struct bpf_test {
 	int fixup_map_ringbuf[MAX_FIXUPS];
 	int fixup_map_timer[MAX_FIXUPS];
 	int fixup_map_kptr[MAX_FIXUPS];
+	int fixup_map_pifo[MAX_FIXUPS];
 	struct kfunc_btf_id_pair fixup_kfunc_btf_id[MAX_FIXUPS];
 	/* Expected verifier log output for result REJECT or VERBOSE_ACCEPT.
 	 * Can be a tab-separated sequence of expected strings. An empty string
@@ -145,6 +146,7 @@ struct bpf_test {
 		ACCEPT,
 		REJECT,
 		VERBOSE_ACCEPT,
+		VERBOSE_REJECT,
 	} result, result_unpriv;
 	enum bpf_prog_type prog_type;
 	uint8_t flags;
@@ -546,11 +548,12 @@ static bool skip_unsupported_map(enum bpf_map_type map_type)
 
 static int __create_map(uint32_t type, uint32_t size_key,
 			uint32_t size_value, uint32_t max_elem,
-			uint32_t extra_flags)
+			uint32_t extra_flags, uint64_t map_extra)
 {
 	LIBBPF_OPTS(bpf_map_create_opts, opts);
 	int fd;
 
+	opts.map_extra = map_extra;
 	opts.map_flags = (type == BPF_MAP_TYPE_HASH ? BPF_F_NO_PREALLOC : 0) | extra_flags;
 	fd = bpf_map_create(type, NULL, size_key, size_value, max_elem, &opts);
 	if (fd < 0) {
@@ -565,7 +568,7 @@ static int __create_map(uint32_t type, uint32_t size_key,
 static int create_map(uint32_t type, uint32_t size_key,
 		      uint32_t size_value, uint32_t max_elem)
 {
-	return __create_map(type, size_key, size_value, max_elem, 0);
+	return __create_map(type, size_key, size_value, max_elem, 0, 0);
 }
 
 static void update_map(int fd, int index)
@@ -904,6 +907,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	int *fixup_map_ringbuf = test->fixup_map_ringbuf;
 	int *fixup_map_timer = test->fixup_map_timer;
 	int *fixup_map_kptr = test->fixup_map_kptr;
+	int *fixup_map_pifo = test->fixup_map_pifo;
 	struct kfunc_btf_id_pair *fixup_kfunc_btf_id = test->fixup_kfunc_btf_id;
 
 	if (test->fill_helper) {
@@ -1033,7 +1037,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	if (*fixup_map_array_ro) {
 		map_fds[14] = __create_map(BPF_MAP_TYPE_ARRAY, sizeof(int),
 					   sizeof(struct test_val), 1,
-					   BPF_F_RDONLY_PROG);
+					   BPF_F_RDONLY_PROG, 0);
 		update_map(map_fds[14], 0);
 		do {
 			prog[*fixup_map_array_ro].imm = map_fds[14];
@@ -1043,7 +1047,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	if (*fixup_map_array_wo) {
 		map_fds[15] = __create_map(BPF_MAP_TYPE_ARRAY, sizeof(int),
 					   sizeof(struct test_val), 1,
-					   BPF_F_WRONLY_PROG);
+					   BPF_F_WRONLY_PROG, 0);
 		update_map(map_fds[15], 0);
 		do {
 			prog[*fixup_map_array_wo].imm = map_fds[15];
@@ -1052,7 +1056,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	}
 	if (*fixup_map_array_small) {
 		map_fds[16] = __create_map(BPF_MAP_TYPE_ARRAY, sizeof(int),
-					   1, 1, 0);
+					   1, 1, 0, 0);
 		update_map(map_fds[16], 0);
 		do {
 			prog[*fixup_map_array_small].imm = map_fds[16];
@@ -1068,7 +1072,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	}
 	if (*fixup_map_event_output) {
 		map_fds[18] = __create_map(BPF_MAP_TYPE_PERF_EVENT_ARRAY,
-					   sizeof(int), sizeof(int), 1, 0);
+					   sizeof(int), sizeof(int), 1, 0, 0);
 		do {
 			prog[*fixup_map_event_output].imm = map_fds[18];
 			fixup_map_event_output++;
@@ -1076,7 +1080,7 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 	}
 	if (*fixup_map_reuseport_array) {
 		map_fds[19] = __create_map(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
-					   sizeof(u32), sizeof(u64), 1, 0);
+					   sizeof(u32), sizeof(u64), 1, 0, 0);
 		do {
 			prog[*fixup_map_reuseport_array].imm = map_fds[19];
 			fixup_map_reuseport_array++;
@@ -1104,6 +1108,13 @@ static void do_test_fixup(struct bpf_test *test, enum bpf_prog_type prog_type,
 			fixup_map_kptr++;
 		} while (*fixup_map_kptr);
 	}
+	if (*fixup_map_pifo) {
+		map_fds[23] = __create_map(BPF_MAP_TYPE_PIFO_XDP, sizeof(u32), sizeof(u32), 1, 0, 8);
+		do {
+			prog[*fixup_map_pifo].imm = map_fds[23];
+			fixup_map_pifo++;
+		} while (*fixup_map_pifo);
+	}
 
 	/* Patch in kfunc BTF IDs */
 	if (fixup_kfunc_btf_id->kfunc) {
@@ -1490,7 +1501,7 @@ static void do_test_single(struct bpf_test *test, bool unpriv,
 		       test->errstr_unpriv : test->errstr;
 
 	opts.expected_attach_type = test->expected_attach_type;
-	if (verbose)
+	if (verbose || expected_ret == VERBOSE_REJECT)
 		opts.log_level = VERBOSE_LIBBPF_LOG_LEVEL;
 	else if (expected_ret == VERBOSE_ACCEPT)
 		opts.log_level = 2;
diff --git a/tools/testing/selftests/bpf/verifier/dequeue.c b/tools/testing/selftests/bpf/verifier/dequeue.c
new file mode 100644
index 000000000000..730f14395bcc
--- /dev/null
+++ b/tools/testing/selftests/bpf/verifier/dequeue.c
@@ -0,0 +1,160 @@
+{
+	"dequeue: non-xdp_md retval",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+	BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct xdp_md, data)),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_DEQUEUE,
+	.result = REJECT,
+	.errstr = "At program exit the register R0 must be NULL or referenced ptr_xdp_md",
+	.fixup_map_pifo = { 1 },
+},
+{
+	"dequeue: NULL retval",
+	.insns = {
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	},
+	.runs = -1,
+	.prog_type = BPF_PROG_TYPE_DEQUEUE,
+	.result = ACCEPT,
+},
+{
+	"dequeue: cannot access except data, data_end, data_meta",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+	BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, offsetof(struct xdp_md, data)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, offsetof(struct xdp_md, data_end)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, offsetof(struct xdp_md, data_meta)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_0, offsetof(struct xdp_md, ingress_ifindex)),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_DEQUEUE,
+	.result = REJECT,
+	.errstr = "no read support for xdp_md at off 12",
+	.fixup_map_pifo = { 1 },
+},
+{
+	"dequeue: pkt_uid preserved when resetting range on rX += var",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+	BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_6, offsetof(struct dequeue_ctx, egress_ifindex)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, offsetof(struct xdp_md, data)),
+	BPF_ALU64_REG(BPF_ADD, BPF_REG_0, BPF_REG_1),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_DEQUEUE,
+	.result = VERBOSE_REJECT,
+	.errstr = "13: (0f) r0 += r1                     ; R0_w=pkt(id=3,off=0,r=0,pkt_uid=2",
+	.fixup_map_pifo = { 1 },
+},
+{
+	"dequeue: dpa bad comparison",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+	BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+	BPF_MOV64_REG(BPF_REG_8, BPF_REG_4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_8),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_REG(BPF_REG_0, BPF_REG_7),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_8, offsetof(struct xdp_md, data)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_7, offsetof(struct xdp_md, data_end)),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+	BPF_JMP_REG(BPF_JGE, BPF_REG_0, BPF_REG_1, 1),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_drop),
+	BPF_MOV64_REG(BPF_REG_0, BPF_REG_8),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_DEQUEUE,
+	.result = REJECT,
+	.errstr = "R0, R1 pkt pointer comparison prohibited",
+	.fixup_map_pifo = { 1, 14 },
+},
+{
+	"dequeue: dpa scoped range propagation",
+	.insns = {
+	BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_10),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_4, -8),
+	BPF_ST_MEM(BPF_DW, BPF_REG_4, 0, 0),
+	BPF_MOV64_REG(BPF_REG_8, BPF_REG_4),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_IMM(BPF_REG_0, 0),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_7, BPF_REG_0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_LD_MAP_FD(BPF_REG_2, 0),
+	BPF_MOV64_IMM(BPF_REG_3, 0),
+	BPF_MOV64_REG(BPF_REG_4, BPF_REG_8),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_dequeue),
+	BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 2),
+	BPF_MOV64_REG(BPF_REG_0, BPF_REG_7),
+	BPF_EXIT_INSN(),
+	BPF_MOV64_REG(BPF_REG_8, BPF_REG_0),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_8, offsetof(struct xdp_md, data)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_8, offsetof(struct xdp_md, data_end)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_7, offsetof(struct xdp_md, data)),
+	BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_7, offsetof(struct xdp_md, data_end)),
+	BPF_ALU64_IMM(BPF_ADD, BPF_REG_0, 8),
+	BPF_JMP_REG(BPF_JGE, BPF_REG_0, BPF_REG_1, 1),
+	BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_2, 0),
+	BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+	BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_packet_drop),
+	BPF_MOV64_REG(BPF_REG_0, BPF_REG_8),
+	BPF_EXIT_INSN(),
+	},
+	.prog_type = BPF_PROG_TYPE_DEQUEUE,
+	.result = REJECT,
+	.errstr = "invalid access to packet, off=0 size=4, R2(id=0,off=0,r=0)",
+	.fixup_map_pifo = { 1, 14 },
+},
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 16/17] selftests/bpf: Add test for XDP queueing through PIFO maps
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (14 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 15/17] selftests/bpf: Add verifier tests for dequeue prog Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-14  5:41   ` Andrii Nakryiko
  2022-07-13 11:14 ` [RFC PATCH 17/17] samples/bpf: Add queueing support to xdp_fwd sample Toke Høiland-Jørgensen
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	Shuah Khan

This adds selftests for both variants of the generic PIFO map type, and for
the dequeue program type. The XDP test uses bpf_prog_run() to run an XDP
program that puts packets into a PIFO map, and then adds tests that pull
them back out again through bpf_prog_run() of a dequeue program, as well as
by attaching a dequeue program to a veth device and scheduling transmission
there.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 .../selftests/bpf/prog_tests/pifo_map.c       | 125 ++++++++++++++
 .../bpf/prog_tests/xdp_pifo_test_run.c        | 154 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/pifo_map.c  |  54 ++++++
 .../selftests/bpf/progs/test_xdp_pifo.c       | 110 +++++++++++++
 4 files changed, 443 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/pifo_map.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
 create mode 100644 tools/testing/selftests/bpf/progs/pifo_map.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pifo.c

diff --git a/tools/testing/selftests/bpf/prog_tests/pifo_map.c b/tools/testing/selftests/bpf/prog_tests/pifo_map.c
new file mode 100644
index 000000000000..ae23bcc0683f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/pifo_map.c
@@ -0,0 +1,125 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include "pifo_map.skel.h"
+
+static int run_prog(int prog_fd, __u32 exp_retval)
+{
+	struct xdp_md ctx_in = {};
+	char data[10] = {};
+	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
+			    .data_in = data,
+			    .data_size_in = sizeof(data),
+			    .ctx_in = &ctx_in,
+			    .ctx_size_in = sizeof(ctx_in),
+			    .repeat = 1,
+		);
+	int err;
+
+	ctx_in.data_end = sizeof(data);
+	err = bpf_prog_test_run_opts(prog_fd, &opts);
+	if (!ASSERT_OK(err, "bpf_prog_test_run(valid)"))
+		return -1;
+	if (!ASSERT_EQ(opts.retval, exp_retval, "prog retval"))
+		return -1;
+
+	return 0;
+}
+
+static void check_map_counts(int map_fd, int start, int interval, int num, int exp_val)
+{
+	__u32 val, key, next_key, *kptr = NULL;
+	int i, err;
+
+	for (i = 0; i < num; i++) {
+		err = bpf_map_get_next_key(map_fd, kptr, &next_key);
+		if (!ASSERT_OK(err, "bpf_map_get_next_key()"))
+			return;
+
+		key = next_key;
+		kptr = &key;
+
+		if (!ASSERT_EQ(key, start + i * interval, "expected key"))
+			break;
+		err = bpf_map_lookup_elem(map_fd, &key, &val);
+		if (!ASSERT_OK(err, "bpf_map_lookup_elem()"))
+			break;
+		if (!ASSERT_EQ(val, exp_val, "map value"))
+			break;
+	}
+}
+
+static void run_enqueue_fail(struct pifo_map *skel, int start, int interval, __u32 exp_retval)
+{
+	int enqueue_fd;
+
+	skel->bss->start = start;
+	skel->data->interval = interval;
+
+	enqueue_fd = bpf_program__fd(skel->progs.pifo_enqueue);
+
+	if (run_prog(enqueue_fd, exp_retval))
+		return;
+}
+
+static void run_test(struct pifo_map *skel, int start, int interval)
+{
+	int enqueue_fd, dequeue_fd;
+
+	skel->bss->start = start;
+	skel->data->interval = interval;
+
+	enqueue_fd = bpf_program__fd(skel->progs.pifo_enqueue);
+	dequeue_fd = bpf_program__fd(skel->progs.pifo_dequeue);
+
+	if (run_prog(enqueue_fd, 0))
+		return;
+	check_map_counts(bpf_map__fd(skel->maps.pifo_map),
+			 skel->bss->start, skel->data->interval,
+			 skel->rodata->num_entries, 1);
+	run_prog(dequeue_fd, 0);
+}
+
+void test_pifo_map(void)
+{
+	struct pifo_map *skel = NULL;
+	int err;
+
+	skel = pifo_map__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel"))
+		return;
+
+	run_test(skel, 0, 1);
+	run_test(skel, 0, 10);
+	run_test(skel, 0, 100);
+
+	/* do a series of runs that keep advancing the priority, to check that
+	 * we can keep rorating the two internal maps
+	 */
+	run_test(skel, 0, 125);
+	run_test(skel, 1250, 1);
+	run_test(skel, 1250, 125);
+
+	/* after rotating, starting enqueue at prio 0 will now fail */
+	run_enqueue_fail(skel, 0, 1, -ERANGE);
+
+	run_test(skel, 2500, 125);
+	run_test(skel, 3750, 125);
+	run_test(skel, 5000, 125);
+
+	pifo_map__destroy(skel);
+
+	/* reopen but change rodata */
+	skel = pifo_map__open();
+	if (!ASSERT_OK_PTR(skel, "open skel"))
+		return;
+
+	skel->rodata->num_entries = 12;
+	err = pifo_map__load(skel);
+	if (!ASSERT_OK(err, "load skel"))
+		goto out;
+
+	/* fails because the map is too small */
+	run_enqueue_fail(skel, 0, 1, -EOVERFLOW);
+out:
+	pifo_map__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c b/tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
new file mode 100644
index 000000000000..bac029731eee
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include <net/if.h>
+#include <linux/if_link.h>
+
+#include "test_xdp_pifo.skel.h"
+
+#define SYS(fmt, ...)						\
+	({							\
+		char cmd[1024];					\
+		snprintf(cmd, sizeof(cmd), fmt, ##__VA_ARGS__);	\
+		if (!ASSERT_OK(system(cmd), cmd))		\
+			goto out;				\
+	})
+
+static void run_xdp_prog(int prog_fd, void *data, size_t data_size, int repeat)
+{
+	struct xdp_md ctx_in = {};
+	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
+			    .data_in = data,
+			    .data_size_in = data_size,
+			    .ctx_in = &ctx_in,
+			    .ctx_size_in = sizeof(ctx_in),
+			    .repeat = repeat,
+			    .flags = BPF_F_TEST_XDP_LIVE_FRAMES,
+		);
+	int err;
+
+	ctx_in.data_end = ctx_in.data + sizeof(pkt_v4);
+	err = bpf_prog_test_run_opts(prog_fd, &opts);
+	ASSERT_OK(err, "bpf_prog_test_run(valid)");
+}
+
+static void run_dequeue_prog(int prog_fd, int exp_proto)
+{
+	struct ipv4_packet data_out;
+	DECLARE_LIBBPF_OPTS(bpf_test_run_opts, opts,
+			    .data_out = &data_out,
+			    .data_size_out = sizeof(data_out),
+			    .repeat = 1,
+		);
+	int err;
+
+	err = bpf_prog_test_run_opts(prog_fd, &opts);
+	ASSERT_OK(err, "bpf_prog_test_run(valid)");
+	ASSERT_EQ(opts.retval, exp_proto == -1 ? 0 : 1, "valid-retval");
+	if (exp_proto >= 0) {
+		ASSERT_EQ(opts.data_size_out, sizeof(pkt_v4), "valid-datasize");
+		ASSERT_EQ(data_out.eth.h_proto, exp_proto, "valid-pkt");
+	} else {
+		ASSERT_EQ(opts.data_size_out, 0, "no-pkt-returned");
+	}
+}
+
+void test_xdp_pifo(void)
+{
+	int xdp_prog_fd, dequeue_prog_fd, i;
+	struct test_xdp_pifo *skel = NULL;
+	struct ipv4_packet data;
+
+	skel = test_xdp_pifo__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel"))
+		return;
+
+	xdp_prog_fd = bpf_program__fd(skel->progs.xdp_pifo);
+	dequeue_prog_fd = bpf_program__fd(skel->progs.dequeue_pifo);
+	data = pkt_v4;
+
+	run_xdp_prog(xdp_prog_fd, &data, sizeof(data), 3);
+
+	/* kernel program queues packets with prio 2, 1, 0 (in that order), we
+	 * should get back 0 and 1, and 2 should get dropped on dequeue
+	 */
+	run_dequeue_prog(dequeue_prog_fd, 0);
+	run_dequeue_prog(dequeue_prog_fd, 1);
+	run_dequeue_prog(dequeue_prog_fd, -1);
+
+	xdp_prog_fd = bpf_program__fd(skel->progs.xdp_pifo_inc);
+	run_xdp_prog(xdp_prog_fd, &data, sizeof(data), 1024);
+
+	skel->bss->pkt_count = 0;
+	skel->data->prio = 0;
+	skel->data->drop_above = 1024;
+	for (i = 0; i < 1024; i++)
+		run_dequeue_prog(dequeue_prog_fd, i*10);
+
+	test_xdp_pifo__destroy(skel);
+}
+
+void test_xdp_pifo_live(void)
+{
+	struct test_xdp_pifo *skel = NULL;
+	int err, ifindex_src, ifindex_dst;
+	int xdp_prog_fd, dequeue_prog_fd;
+	struct nstoken *nstoken = NULL;
+	struct ipv4_packet data;
+	struct bpf_link *link;
+	__u32 xdp_flags = XDP_FLAGS_DEQUEUE_MODE;
+	LIBBPF_OPTS(bpf_xdp_attach_opts, opts,
+		    .old_prog_fd = -1);
+
+	skel = test_xdp_pifo__open();
+	if (!ASSERT_OK_PTR(skel, "skel"))
+		return;
+
+	SYS("ip netns add testns");
+	nstoken = open_netns("testns");
+	if (!ASSERT_OK_PTR(nstoken, "setns"))
+		goto out;
+
+	SYS("ip link add veth_src type veth peer name veth_dst");
+	SYS("ip link set dev veth_src up");
+	SYS("ip link set dev veth_dst up");
+
+	ifindex_src = if_nametoindex("veth_src");
+	ifindex_dst = if_nametoindex("veth_dst");
+	if (!ASSERT_NEQ(ifindex_src, 0, "ifindex_src") ||
+	    !ASSERT_NEQ(ifindex_dst, 0, "ifindex_dst"))
+		goto out;
+
+	skel->bss->tgt_ifindex = ifindex_src;
+	skel->data->drop_above = 3;
+
+	err = test_xdp_pifo__load(skel);
+	ASSERT_OK(err, "load skel");
+
+	link = bpf_program__attach_xdp(skel->progs.xdp_check_pkt, ifindex_dst);
+	if (!ASSERT_OK_PTR(link, "prog_attach"))
+		goto out;
+	skel->links.xdp_check_pkt = link;
+
+	xdp_prog_fd = bpf_program__fd(skel->progs.xdp_pifo);
+	dequeue_prog_fd = bpf_program__fd(skel->progs.dequeue_pifo);
+	data = pkt_v4;
+
+	err = bpf_xdp_attach(ifindex_src, dequeue_prog_fd, xdp_flags, &opts);
+	if (!ASSERT_OK(err, "attach-dequeue"))
+		goto out;
+
+	run_xdp_prog(xdp_prog_fd, &data, sizeof(data), 3);
+
+	/* wait for the packets to be flushed */
+	kern_sync_rcu();
+
+	ASSERT_EQ(skel->bss->seen_good_pkts, 3, "live packets OK");
+
+	opts.old_prog_fd = dequeue_prog_fd;
+	err = bpf_xdp_attach(ifindex_src, -1, xdp_flags, &opts);
+	ASSERT_OK(err, "dequeue-detach");
+
+out:
+	test_xdp_pifo__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/pifo_map.c b/tools/testing/selftests/bpf/progs/pifo_map.c
new file mode 100644
index 000000000000..b27bc2d0de03
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/pifo_map.c
@@ -0,0 +1,54 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PIFO_GENERIC);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, 10);
+	__uint(map_extra, 1024); /* range */
+} pifo_map SEC(".maps");
+
+const volatile int num_entries = 10;
+volatile int interval = 10;
+volatile int start = 0;
+
+SEC("xdp")
+int pifo_dequeue(struct xdp_md *xdp)
+{
+	__u32 val, exp;
+	int i, ret;
+
+	for (i = 0; i < num_entries; i++) {
+		exp = start + i * interval;
+		ret = bpf_map_pop_elem(&pifo_map, &val);
+		if (ret)
+			return ret;
+		if (val != exp)
+			return 1;
+	}
+
+	return 0;
+}
+
+SEC("xdp")
+int pifo_enqueue(struct xdp_md *xdp)
+{
+	__u64 flags;
+	__u32 val;
+	int i, ret;
+
+	for (i = num_entries - 1; i >= 0; i--) {
+		val = start + i * interval;
+		flags = val;
+		ret = bpf_map_push_elem(&pifo_map, &val, flags);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_pifo.c b/tools/testing/selftests/bpf/progs/test_xdp_pifo.c
new file mode 100644
index 000000000000..702611e0cd1a
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_xdp_pifo.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <bpf/bpf_helpers.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PIFO_XDP);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, 1024);
+	__uint(map_extra, 8192); /* range */
+} pifo_map SEC(".maps");
+
+__u16 prio = 3;
+int tgt_ifindex = 0;
+
+SEC("xdp")
+int xdp_pifo(struct xdp_md *xdp)
+{
+	void *data = (void *)(long)xdp->data;
+	void *data_end = (void *)(long)xdp->data_end;
+	struct ethhdr *eth = data;
+	int ret;
+
+	if (eth + 1 > data_end)
+		return XDP_DROP;
+
+	/* We write the priority into the ethernet proto field so userspace can
+	 * pick it back out and confirm that it's correct
+	 */
+	eth->h_proto = --prio;
+	ret = bpf_redirect_map(&pifo_map, prio, 0);
+	if (tgt_ifindex && ret == XDP_REDIRECT)
+		bpf_schedule_iface_dequeue(xdp, tgt_ifindex, 0);
+	return ret;
+}
+
+__u16 check_prio = 0;
+__u16 seen_good_pkts = 0;
+
+SEC("xdp")
+int xdp_check_pkt(struct xdp_md *xdp)
+{
+	void *data = (void *)(long)xdp->data;
+	void *data_end = (void *)(long)xdp->data_end;
+	struct ethhdr *eth = data;
+
+	if (eth + 1 > data_end)
+		return XDP_DROP;
+
+	if (eth->h_proto == check_prio) {
+		check_prio++;
+		seen_good_pkts++;
+		return XDP_DROP;
+	}
+
+	return XDP_PASS;
+}
+
+SEC("xdp")
+int xdp_pifo_inc(struct xdp_md *xdp)
+{
+	void *data = (void *)(long)xdp->data;
+	void *data_end = (void *)(long)xdp->data_end;
+	struct ethhdr *eth = data;
+	int ret;
+
+	if (eth + 1 > data_end)
+		return XDP_DROP;
+
+	/* We write the priority into the ethernet proto field so userspace can
+	 * pick it back out and confirm that it's correct
+	 */
+	eth->h_proto = prio;
+	ret = bpf_redirect_map(&pifo_map, prio, 0);
+	prio += 10;
+	return ret;
+}
+
+__u16 pkt_count = 0;
+__u16 drop_above = 2;
+
+SEC("dequeue")
+void *dequeue_pifo(struct dequeue_ctx *ctx)
+{
+	__u64 prio = 0, pkt_prio = 0;
+	void *data, *data_end;
+	struct xdp_md *pkt;
+	struct ethhdr *eth;
+
+	pkt = (void *)bpf_packet_dequeue(ctx, &pifo_map, 0, &prio);
+	if (!pkt)
+		return NULL;
+
+	data = (void *)(long)pkt->data;
+	data_end = (void *)(long)pkt->data_end;
+	eth = data;
+
+	if (eth + 1 <= data_end)
+		pkt_prio = eth->h_proto;
+
+	if (pkt_prio != prio || ++pkt_count > drop_above) {
+		bpf_packet_drop(ctx, pkt);
+		return NULL;
+	}
+
+	return pkt;
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [RFC PATCH 17/17] samples/bpf: Add queueing support to xdp_fwd sample
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (15 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 16/17] selftests/bpf: Add test for XDP queueing through PIFO maps Toke Høiland-Jørgensen
@ 2022-07-13 11:14 ` Toke Høiland-Jørgensen
  2022-07-13 18:36 ` [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Stanislav Fomichev
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 11:14 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa
  Cc: Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang, Toke Høiland-Jørgensen

Add support for queueing packets before forwarding them to the xdp_fwd
sample. This is meant to serve as an example (for the RFC series) of how
one could add queueing to a forwarding application. It doesn't actually
implement any fancy queueing algorithms, it just uses the queue maps to do
simple FIFO queueing, instantiating one queue map per interface.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 samples/bpf/xdp_fwd_kern.c |  65 +++++++++++-
 samples/bpf/xdp_fwd_user.c | 200 +++++++++++++++++++++++++++----------
 2 files changed, 205 insertions(+), 60 deletions(-)

diff --git a/samples/bpf/xdp_fwd_kern.c b/samples/bpf/xdp_fwd_kern.c
index 54c099cbd639..125adb02c658 100644
--- a/samples/bpf/xdp_fwd_kern.c
+++ b/samples/bpf/xdp_fwd_kern.c
@@ -23,6 +23,14 @@
 
 #define IPV6_FLOWINFO_MASK              cpu_to_be32(0x0FFFFFFF)
 
+struct pifo_map {
+	__uint(type, BPF_MAP_TYPE_PIFO_XDP);
+	__uint(key_size, sizeof(__u32));
+	__uint(value_size, sizeof(__u32));
+	__uint(max_entries, 1024);
+	__uint(map_extra, 8192); /* range */
+} pmap SEC(".maps");
+
 struct {
 	__uint(type, BPF_MAP_TYPE_DEVMAP);
 	__uint(key_size, sizeof(int));
@@ -30,6 +38,13 @@ struct {
 	__uint(max_entries, 64);
 } xdp_tx_ports SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+	__uint(key_size, sizeof(__u32));
+	__uint(max_entries, 64);
+	__array(values, struct pifo_map);
+} pifo_maps SEC(".maps");
+
 /* from include/net/ip.h */
 static __always_inline int ip_decrease_ttl(struct iphdr *iph)
 {
@@ -40,7 +55,7 @@ static __always_inline int ip_decrease_ttl(struct iphdr *iph)
 	return --iph->ttl;
 }
 
-static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
+static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags, bool queue)
 {
 	void *data_end = (void *)(long)ctx->data_end;
 	void *data = (void *)(long)ctx->data;
@@ -137,22 +152,62 @@ static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
 
 		memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN);
 		memcpy(eth->h_source, fib_params.smac, ETH_ALEN);
+
+		if (queue) {
+			void *ptr;
+			int ret;
+
+			ptr = bpf_map_lookup_elem(&pifo_maps, &fib_params.ifindex);
+			if (!ptr)
+				return XDP_DROP;
+
+			ret = bpf_redirect_map(ptr, 0, 0);
+			if (ret == XDP_REDIRECT)
+				bpf_schedule_iface_dequeue(ctx, fib_params.ifindex, 0);
+			return ret;
+		}
+
 		return bpf_redirect_map(&xdp_tx_ports, fib_params.ifindex, 0);
 	}
 
 	return XDP_PASS;
 }
 
-SEC("xdp_fwd")
+SEC("xdp")
 int xdp_fwd_prog(struct xdp_md *ctx)
 {
-	return xdp_fwd_flags(ctx, 0);
+	return xdp_fwd_flags(ctx, 0, false);
 }
 
-SEC("xdp_fwd_direct")
+SEC("xdp")
 int xdp_fwd_direct_prog(struct xdp_md *ctx)
 {
-	return xdp_fwd_flags(ctx, BPF_FIB_LOOKUP_DIRECT);
+	return xdp_fwd_flags(ctx, BPF_FIB_LOOKUP_DIRECT, false);
+}
+
+SEC("xdp")
+int xdp_fwd_queue(struct xdp_md *ctx)
+{
+	return xdp_fwd_flags(ctx, 0, true);
+}
+
+SEC("dequeue")
+void *xdp_dequeue(struct dequeue_ctx *ctx)
+{
+	__u32 ifindex = ctx->egress_ifindex;
+	struct xdp_md *pkt;
+	__u64 prio = 0;
+	void *pifo_ptr;
+
+	pifo_ptr = bpf_map_lookup_elem(&pifo_maps, &ifindex);
+	if (!pifo_ptr)
+		return NULL;
+
+	pkt = (void *)bpf_packet_dequeue(ctx, pifo_ptr, 0, &prio);
+	if (!pkt)
+		return NULL;
+
+	return pkt;
 }
 
 char _license[] SEC("license") = "GPL";
diff --git a/samples/bpf/xdp_fwd_user.c b/samples/bpf/xdp_fwd_user.c
index 84f57f1209ce..ec3f29d0babe 100644
--- a/samples/bpf/xdp_fwd_user.c
+++ b/samples/bpf/xdp_fwd_user.c
@@ -11,6 +11,7 @@
  * General Public License for more details.
  */
 
+#include "linux/if_link.h"
 #include <linux/bpf.h>
 #include <linux/if_link.h>
 #include <linux/limits.h>
@@ -29,66 +30,122 @@
 
 static __u32 xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
 
-static int do_attach(int idx, int prog_fd, int map_fd, const char *name)
+#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0]))
+
+const char *redir_prog_names[] = {
+	"xdp_fwd_prog",
+	"xdp_fwd_direct_", /* name truncated to BPF_OBJ_NAME_LEN */
+	"xdp_fwd_queue",
+};
+
+const char *dequeue_prog_names[] = {
+	"xdp_dequeue"
+};
+
+static int do_attach(int idx, int redir_prog_fd, int dequeue_prog_fd,
+		     int redir_map_fd, int pifos_map_fd, const char *name)
 {
 	int err;
 
-	err = bpf_xdp_attach(idx, prog_fd, xdp_flags, NULL);
+	if (pifos_map_fd > -1) {
+		LIBBPF_OPTS(bpf_map_create_opts, map_opts, .map_extra = 8192);
+		char map_name[BPF_OBJ_NAME_LEN];
+		int pifo_fd;
+
+		snprintf(map_name, sizeof(map_name), "pifo_%d", idx);
+		map_name[BPF_OBJ_NAME_LEN - 1] = '\0';
+
+		pifo_fd = bpf_map_create(BPF_MAP_TYPE_PIFO_XDP, map_name,
+					 sizeof(__u32), sizeof(__u32), 10240, &map_opts);
+		if (pifo_fd < 0) {
+			err = -errno;
+			printf("ERROR: Couldn't create PIFO map: %s\n", strerror(-err));
+			return err;
+		}
+
+		err = bpf_map_update_elem(pifos_map_fd, &idx, &pifo_fd, 0);
+		if (err)
+			printf("ERROR: failed adding PIFO map for device %s\n", name);
+	}
+
+	if (dequeue_prog_fd > -1) {
+		LIBBPF_OPTS(bpf_xdp_attach_opts, prog_opts, .old_prog_fd = -1);
+
+		err = bpf_xdp_attach(idx, dequeue_prog_fd,
+				     (XDP_FLAGS_DEQUEUE_MODE | XDP_FLAGS_REPLACE),
+				     &prog_opts);
+		if (err < 0) {
+			printf("ERROR: failed to attach dequeue program to %s\n", name);
+			return err;
+		}
+	}
+
+	err = bpf_xdp_attach(idx, redir_prog_fd, xdp_flags, NULL);
 	if (err < 0) {
-		printf("ERROR: failed to attach program to %s\n", name);
+		printf("ERROR: failed to attach redir program to %s\n", name);
 		return err;
 	}
 
 	/* Adding ifindex as a possible egress TX port */
-	err = bpf_map_update_elem(map_fd, &idx, &idx, 0);
+	err = bpf_map_update_elem(redir_map_fd, &idx, &idx, 0);
 	if (err)
 		printf("ERROR: failed using device %s as TX-port\n", name);
 
 	return err;
 }
 
+static bool should_detach(__u32 prog_fd, const char **prog_names, int num_prog_names)
+{
+	struct bpf_prog_info prog_info = {};
+	__u32 info_len = sizeof(prog_info);
+	int err, i;
+
+	err = bpf_obj_get_info_by_fd(prog_fd, &prog_info, &info_len);
+	if (err) {
+		printf("ERROR: bpf_obj_get_info_by_fd failed (%s)\n",
+		       strerror(errno));
+		return false;
+	}
+
+	for (i = 0; i < num_prog_names; i++)
+		if (!strcmp(prog_info.name, prog_names[i]))
+			return true;
+
+	return false;
+}
+
 static int do_detach(int ifindex, const char *ifname, const char *app_name)
 {
 	LIBBPF_OPTS(bpf_xdp_attach_opts, opts);
-	struct bpf_prog_info prog_info = {};
-	char prog_name[BPF_OBJ_NAME_LEN];
-	__u32 info_len, curr_prog_id;
-	int prog_fd;
-	int err = 1;
+	LIBBPF_OPTS(bpf_xdp_query_opts, query_opts);
+	int prog_fd, err = 1;
+	__u32 curr_prog_id;
 
-	if (bpf_xdp_query_id(ifindex, xdp_flags, &curr_prog_id)) {
+	if (bpf_xdp_query(ifindex, xdp_flags, &query_opts)) {
 		printf("ERROR: bpf_xdp_query_id failed (%s)\n",
 		       strerror(errno));
 		return err;
 	}
 
+	curr_prog_id = (xdp_flags & XDP_FLAGS_SKB_MODE) ? query_opts.skb_prog_id
+								: query_opts.drv_prog_id;
 	if (!curr_prog_id) {
 		printf("ERROR: flags(0x%x) xdp prog is not attached to %s\n",
 		       xdp_flags, ifname);
 		return err;
 	}
 
-	info_len = sizeof(prog_info);
 	prog_fd = bpf_prog_get_fd_by_id(curr_prog_id);
 	if (prog_fd < 0) {
 		printf("ERROR: bpf_prog_get_fd_by_id failed (%s)\n",
 		       strerror(errno));
-		return prog_fd;
-	}
-
-	err = bpf_obj_get_info_by_fd(prog_fd, &prog_info, &info_len);
-	if (err) {
-		printf("ERROR: bpf_obj_get_info_by_fd failed (%s)\n",
-		       strerror(errno));
-		goto close_out;
+		return err;
 	}
-	snprintf(prog_name, sizeof(prog_name), "%s_prog", app_name);
-	prog_name[BPF_OBJ_NAME_LEN - 1] = '\0';
 
-	if (strcmp(prog_info.name, prog_name)) {
+	if (!should_detach(prog_fd, redir_prog_names, ARRAY_SIZE(redir_prog_names))) {
 		printf("ERROR: %s isn't attached to %s\n", app_name, ifname);
-		err = 1;
-		goto close_out;
+		close(prog_fd);
+		return 1;
 	}
 
 	opts.old_prog_fd = prog_fd;
@@ -96,11 +153,34 @@ static int do_detach(int ifindex, const char *ifname, const char *app_name)
 	if (err < 0)
 		printf("ERROR: failed to detach program from %s (%s)\n",
 		       ifname, strerror(errno));
-	/* TODO: Remember to cleanup map, when adding use of shared map
+
+	close(prog_fd);
+
+	if (query_opts.dequeue_prog_id) {
+		prog_fd = bpf_prog_get_fd_by_id(query_opts.dequeue_prog_id);
+		if (prog_fd < 0) {
+			printf("ERROR: bpf_prog_get_fd_by_id failed (%s)\n",
+			       strerror(errno));
+			return err;
+		}
+
+		if (!should_detach(prog_fd, dequeue_prog_names, ARRAY_SIZE(dequeue_prog_names))) {
+			close(prog_fd);
+			return err;
+		}
+
+		opts.old_prog_fd = prog_fd;
+		err = bpf_xdp_detach(ifindex,
+				     (XDP_FLAGS_DEQUEUE_MODE | XDP_FLAGS_REPLACE),
+				     &opts);
+		if (err < 0)
+			printf("ERROR: failed to detach dequeue program from %s (%s)\n",
+			       ifname, strerror(errno));
+	}
+
+	/* todo: Remember to cleanup map, when adding use of shared map
 	 *  bpf_map_delete_elem((map_fd, &idx);
 	 */
-close_out:
-	close(prog_fd);
 	return err;
 }
 
@@ -112,24 +192,23 @@ static void usage(const char *prog)
 		"    -d    detach program\n"
 		"    -S    use skb-mode\n"
 		"    -F    force loading prog\n"
-		"    -D    direct table lookups (skip fib rules)\n",
+		"    -D    direct table lookups (skip fib rules)\n"
+		"    -Q    direct table lookups (skip fib rules)\n",
 		prog);
 }
 
 int main(int argc, char **argv)
 {
-	const char *prog_name = "xdp_fwd";
-	struct bpf_program *prog = NULL;
-	struct bpf_program *pos;
-	const char *sec_name;
-	int prog_fd = -1, map_fd = -1;
+	int redir_prog_fd = -1, dequeue_prog_fd = -1, redir_map_fd = -1, pifos_map_fd = -1;
+	const char *prog_name = "xdp_fwd_prog";
 	char filename[PATH_MAX];
 	struct bpf_object *obj;
 	int opt, i, idx, err;
+	bool queue = false;
 	int attach = 1;
 	int ret = 0;
 
-	while ((opt = getopt(argc, argv, ":dDSF")) != -1) {
+	while ((opt = getopt(argc, argv, ":dDQSF")) != -1) {
 		switch (opt) {
 		case 'd':
 			attach = 0;
@@ -141,7 +220,11 @@ int main(int argc, char **argv)
 			xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
 			break;
 		case 'D':
-			prog_name = "xdp_fwd_direct";
+			prog_name = "xdp_fwd_direct_prog";
+			break;
+		case 'Q':
+			prog_name = "xdp_fwd_queue";
+			queue = true;
 			break;
 		default:
 			usage(basename(argv[0]));
@@ -170,9 +253,6 @@ int main(int argc, char **argv)
 		if (libbpf_get_error(obj))
 			return 1;
 
-		prog = bpf_object__next_program(obj, NULL);
-		bpf_program__set_type(prog, BPF_PROG_TYPE_XDP);
-
 		err = bpf_object__load(obj);
 		if (err) {
 			printf("Does kernel support devmap lookup?\n");
@@ -181,25 +261,34 @@ int main(int argc, char **argv)
 			 */
 			return 1;
 		}
-
-		bpf_object__for_each_program(pos, obj) {
-			sec_name = bpf_program__section_name(pos);
-			if (sec_name && !strcmp(sec_name, prog_name)) {
-				prog = pos;
-				break;
-			}
-		}
-		prog_fd = bpf_program__fd(prog);
-		if (prog_fd < 0) {
-			printf("program not found: %s\n", strerror(prog_fd));
+		redir_prog_fd = bpf_program__fd(bpf_object__find_program_by_name(obj,
+										 prog_name));
+		if (redir_prog_fd < 0) {
+			printf("program not found: %s\n", strerror(redir_prog_fd));
 			return 1;
 		}
-		map_fd = bpf_map__fd(bpf_object__find_map_by_name(obj,
-							"xdp_tx_ports"));
-		if (map_fd < 0) {
-			printf("map not found: %s\n", strerror(map_fd));
+
+		redir_map_fd = bpf_map__fd(bpf_object__find_map_by_name(obj,
+									"xdp_tx_ports"));
+		if (redir_map_fd < 0) {
+			printf("map not found: %s\n", strerror(redir_map_fd));
 			return 1;
 		}
+
+		if (queue) {
+			dequeue_prog_fd = bpf_program__fd(bpf_object__find_program_by_name(obj,
+											   "xdp_dequeue"));
+			if (dequeue_prog_fd < 0) {
+				printf("dequeue program not found: %s\n",
+				       strerror(-dequeue_prog_fd));
+				return 1;
+			}
+			pifos_map_fd = bpf_map__fd(bpf_object__find_map_by_name(obj, "pifo_maps"));
+			if (pifos_map_fd < 0) {
+				printf("map not found: %s\n", strerror(-pifos_map_fd));
+				return 1;
+			}
+		}
 	}
 
 	for (i = optind; i < argc; ++i) {
@@ -212,11 +301,12 @@ int main(int argc, char **argv)
 			return 1;
 		}
 		if (!attach) {
-			err = do_detach(idx, argv[i], prog_name);
+			err = do_detach(idx, argv[i], argv[0]);
 			if (err)
 				ret = err;
 		} else {
-			err = do_attach(idx, prog_fd, map_fd, argv[i]);
+			err = do_attach(idx, redir_prog_fd, dequeue_prog_fd,
+					redir_map_fd, pifos_map_fd, argv[i]);
 			if (err)
 				ret = err;
 		}
-- 
2.37.0


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (16 preceding siblings ...)
  2022-07-13 11:14 ` [RFC PATCH 17/17] samples/bpf: Add queueing support to xdp_fwd sample Toke Høiland-Jørgensen
@ 2022-07-13 18:36 ` Stanislav Fomichev
  2022-07-13 21:52   ` Toke Høiland-Jørgensen
  2022-07-14 14:05 ` Jamal Hadi Salim
  2022-07-17 17:46 ` Cong Wang
  19 siblings, 1 reply; 46+ messages in thread
From: Stanislav Fomichev @ 2022-07-13 18:36 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Mykola Lysenko, Kumar Kartikeya Dwivedi, netdev,
	bpf, Freysteinn Alfredsson, Cong Wang

On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Packet forwarding is an important use case for XDP, which offers
> significant performance improvements compared to forwarding using the
> regular networking stack. However, XDP currently offers no mechanism to
> delay, queue or schedule packets, which limits the practical uses for
> XDP-based forwarding to those where the capacity of input and output links
> always match each other (i.e., no rate transitions or many-to-one
> forwarding). It also prevents an XDP-based router from doing any kind of
> traffic shaping or reordering to enforce policy.
>
> This series represents a first RFC of our attempt to remedy this lack. The
> code in these patches is functional, but needs additional testing and
> polishing before being considered for merging. I'm posting it here as an
> RFC to get some early feedback on the API and overall design of the
> feature.
>
> DESIGN
>
> The design consists of three components: A new map type for storing XDP
> frames, a new 'dequeue' program type that will run in the TX softirq to
> provide the stack with packets to transmit, and a set of helpers to dequeue
> packets from the map, optionally drop them, and to schedule an interface
> for transmission.
>
> The new map type is modelled on the PIFO data structure proposed in the
> literature[0][1]. It represents a priority queue where packets can be
> enqueued in any priority, but is always dequeued from the head. From the
> XDP side, the map is simply used as a target for the bpf_redirect_map()
> helper, where the target index is the desired priority.

I have the same question I asked on the series from Cong:
Any considerations for existing carousel/edt-like models?
Can we make the map flexible enough to implement different qdisc policies?

> The dequeue program type is a new BPF program type that is attached to an
> interface; when an interface is scheduled for transmission, the stack will
> execute the attached dequeue program and, if it returns a packet to
> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
> driver function.
>
> The dequeue program can obtain packets by pulling them out of a PIFO map
> using the new bpf_packet_dequeue() helper. This returns a pointer to an
> xdp_md structure, which can be dereferenced to obtain packet data and
> data_meta pointers like in an XDP program. The returned packets are also
> reference counted, meaning the verifier enforces that the dequeue program
> either drops the packet (with the bpf_packet_drop() helper), or returns it
> for transmission. Finally, a helper is added that can be used to actually
> schedule an interface for transmission using the dequeue program type; this
> helper can be called from both XDP and dequeue programs.
>
> PERFORMANCE
>
> Preliminary performance tests indicate about 50ns overhead of adding
> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
> overhead (but still 2x the forwarding performance of the netstack):
>
> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
> netstack:       2 Mpps  (500 ns /pkt)
>
> RELATION TO BPF QDISC
>
> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
> particular the use of a map to store packets. This is no accident, as we've
> had ongoing discussions for a while now. I have no great hope that we can
> completely converge the two efforts into a single BPF-based queueing
> API (as has been discussed before[3], consolidating the SKB and XDP paths
> is challenging). Rather, I'm hoping that we can converge the designs enough
> that we can share BPF code between XDP and qdisc layers using common
> functions, like it's possible to do with XDP and TC-BPF today. This would
> imply agreeing on the map type and API, and possibly on the set of helpers
> available to the BPF programs.

What would be the big difference for the map wrt xdp_frame vs sk_buff
excluding all obvious stuff like locking/refcnt?

> PATCH STRUCTURE
>
> This series consists of a total of 17 patches, as follows:
>
> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
> patches.

Seems like these can go separately without holding the rest?

> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
> program type.

[...]

> Patches 7-10 adds the dequeue helpers and the verifier features needed to
> recognise packet pointers, reference count them, and allow dereferencing
> them to obtain packet data pointers.

Have you considered using kfuncs for these instead of introducing new
hooks/contexts/etc?

> Patches 11 and 12 add the dequeue program hook to the TX path, and the
> helpers to schedule an interface.
>
> Patches 13-16 add libbpf support for the new types, and selftests for the
> new features.
>
> Finally, patch 17 adds queueing support to the xdp_fwd program in
> samples/bpf to provide an easy-to-use way of testing the feature; this is
> for illustrative purposes for the RFC only, and will not be included in the
> final submission.
>
> SUPPLEMENTARY MATERIAL
>
> A (WiP) test harness for implementing and unit-testing scheduling
> algorithms using this framework (and the bpf_prog_run() hook) is available
> as part of the bpf-examples repository[4]. We plan to expand this with more
> test algorithms to smoke-test the API, and also add ready-to-use queueing
> algorithms for use for forwarding (to replace the xdp_fwd patch included as
> part of this RFC submission).
>
> The work represented in this series was done in collaboration with several
> people. Thanks to Kumar Kartikeya Dwivedi for writing the verifier
> enhancements in this series, to Frey Alfredsson for his work on the testing
> harness in [4], and to Jesper Brouer, Per Hurtig and Anna Brunstrom for
> their valuable input on the design of the queueing APIs.
>
> This series is also available as a git tree on git.kernel.org[5].
>
> NOTES
>
> [0] http://web.mit.edu/pifo/
> [1] https://arxiv.org/abs/1810.03060
> [2] https://lore.kernel.org/r/20220602041028.95124-1-xiyou.wangcong@gmail.com
> [3] https://lore.kernel.org/r/b4ff6a2b-1478-89f8-ea9f-added498c59f@gmail.com
> [4] https://github.com/xdp-project/bpf-examples/pull/40
> [5] https://git.kernel.org/pub/scm/linux/kernel/git/toke/linux.git/log/?h=xdp-queueing-06
>
> Kumar Kartikeya Dwivedi (5):
>   bpf: Use 64-bit return value for bpf_prog_run
>   bpf: Teach the verifier about referenced packets returned from dequeue
>     programs
>   bpf: Introduce pkt_uid member for PTR_TO_PACKET
>   bpf: Implement direct packet access in dequeue progs
>   selftests/bpf: Add verifier tests for dequeue prog
>
> Toke Høiland-Jørgensen (12):
>   dev: Move received_rps counter next to RPS members in softnet data
>   bpf: Expand map key argument of bpf_redirect_map to u64
>   bpf: Add a PIFO priority queue map type
>   pifomap: Add queue rotation for continuously increasing rank mode
>   xdp: Add dequeue program type for getting packets from a PIFO
>   bpf: Add helpers to dequeue from a PIFO map
>   dev: Add XDP dequeue hook
>   bpf: Add helper to schedule an interface for TX dequeue
>   libbpf: Add support for dequeue program type and PIFO map type
>   libbpf: Add support for querying dequeue programs
>   selftests/bpf: Add test for XDP queueing through PIFO maps
>   samples/bpf: Add queueing support to xdp_fwd sample
>
>  include/linux/bpf-cgroup.h                    |  12 +-
>  include/linux/bpf.h                           |  64 +-
>  include/linux/bpf_types.h                     |   4 +
>  include/linux/bpf_verifier.h                  |  14 +-
>  include/linux/filter.h                        |  63 +-
>  include/linux/netdevice.h                     |   8 +-
>  include/net/xdp.h                             |  16 +-
>  include/uapi/linux/bpf.h                      |  50 +-
>  include/uapi/linux/if_link.h                  |   4 +-
>  kernel/bpf/Makefile                           |   2 +-
>  kernel/bpf/cgroup.c                           |  12 +-
>  kernel/bpf/core.c                             |  14 +-
>  kernel/bpf/cpumap.c                           |   4 +-
>  kernel/bpf/devmap.c                           |  92 ++-
>  kernel/bpf/offload.c                          |   4 +-
>  kernel/bpf/pifomap.c                          | 635 ++++++++++++++++++
>  kernel/bpf/syscall.c                          |   3 +
>  kernel/bpf/verifier.c                         | 148 +++-
>  net/bpf/test_run.c                            |  54 +-
>  net/core/dev.c                                | 109 +++
>  net/core/dev.h                                |   2 +
>  net/core/filter.c                             | 307 ++++++++-
>  net/core/rtnetlink.c                          |  30 +-
>  net/packet/af_packet.c                        |   7 +-
>  net/xdp/xskmap.c                              |   4 +-
>  samples/bpf/xdp_fwd_kern.c                    |  65 +-
>  samples/bpf/xdp_fwd_user.c                    | 200 ++++--
>  tools/include/uapi/linux/bpf.h                |  48 ++
>  tools/include/uapi/linux/if_link.h            |   4 +-
>  tools/lib/bpf/libbpf.c                        |   1 +
>  tools/lib/bpf/libbpf.h                        |   1 +
>  tools/lib/bpf/libbpf_probes.c                 |   5 +
>  tools/lib/bpf/netlink.c                       |   8 +
>  .../selftests/bpf/prog_tests/pifo_map.c       | 125 ++++
>  .../bpf/prog_tests/xdp_pifo_test_run.c        | 154 +++++
>  tools/testing/selftests/bpf/progs/pifo_map.c  |  54 ++
>  .../selftests/bpf/progs/test_xdp_pifo.c       | 110 +++
>  tools/testing/selftests/bpf/test_verifier.c   |  29 +-
>  .../testing/selftests/bpf/verifier/dequeue.c  | 160 +++++
>  39 files changed, 2426 insertions(+), 200 deletions(-)
>  create mode 100644 kernel/bpf/pifomap.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/pifo_map.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
>  create mode 100644 tools/testing/selftests/bpf/progs/pifo_map.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pifo.c
>  create mode 100644 tools/testing/selftests/bpf/verifier/dequeue.c
>
> --
> 2.37.0
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-13 18:36 ` [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Stanislav Fomichev
@ 2022-07-13 21:52   ` Toke Høiland-Jørgensen
  2022-07-13 22:56     ` Stanislav Fomichev
                       ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-13 21:52 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Mykola Lysenko, Kumar Kartikeya Dwivedi, netdev,
	bpf, Freysteinn Alfredsson, Cong Wang

Stanislav Fomichev <sdf@google.com> writes:

> On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Packet forwarding is an important use case for XDP, which offers
>> significant performance improvements compared to forwarding using the
>> regular networking stack. However, XDP currently offers no mechanism to
>> delay, queue or schedule packets, which limits the practical uses for
>> XDP-based forwarding to those where the capacity of input and output links
>> always match each other (i.e., no rate transitions or many-to-one
>> forwarding). It also prevents an XDP-based router from doing any kind of
>> traffic shaping or reordering to enforce policy.
>>
>> This series represents a first RFC of our attempt to remedy this lack. The
>> code in these patches is functional, but needs additional testing and
>> polishing before being considered for merging. I'm posting it here as an
>> RFC to get some early feedback on the API and overall design of the
>> feature.
>>
>> DESIGN
>>
>> The design consists of three components: A new map type for storing XDP
>> frames, a new 'dequeue' program type that will run in the TX softirq to
>> provide the stack with packets to transmit, and a set of helpers to dequeue
>> packets from the map, optionally drop them, and to schedule an interface
>> for transmission.
>>
>> The new map type is modelled on the PIFO data structure proposed in the
>> literature[0][1]. It represents a priority queue where packets can be
>> enqueued in any priority, but is always dequeued from the head. From the
>> XDP side, the map is simply used as a target for the bpf_redirect_map()
>> helper, where the target index is the desired priority.
>
> I have the same question I asked on the series from Cong:
> Any considerations for existing carousel/edt-like models?

Well, the reason for the addition in patch 5 (continuously increasing
priorities) is exactly to be able to implement EDT-like behaviour, where
the priority is used as time units to clock out packets.

> Can we make the map flexible enough to implement different qdisc
> policies?

That's one of the things we want to be absolutely sure about. We are
starting out with the PIFO map type because the literature makes a good
case that it is flexible enough to implement all conceivable policies.
The goal of the test harness linked as note [4] is to actually examine
this; Frey is our PhD student working on this bit.

Thus far we haven't hit any limitations on this, but we'll need to add
more policies before we are done with this. Another consideration is
performance, of course, so we're also planning to do a comparison with a
more traditional "bunch of FIFO queues" type data structure for at least
a subset of the algorithms. Kartikeya also had an idea for an
alternative way to implement a priority queue using (semi-)lockless
skiplists, which may turn out to perform better.

If there's any particular policy/algorithm you'd like to see included in
this evaluation, please do let us know, BTW! :)

>> The dequeue program type is a new BPF program type that is attached to an
>> interface; when an interface is scheduled for transmission, the stack will
>> execute the attached dequeue program and, if it returns a packet to
>> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
>> driver function.
>>
>> The dequeue program can obtain packets by pulling them out of a PIFO map
>> using the new bpf_packet_dequeue() helper. This returns a pointer to an
>> xdp_md structure, which can be dereferenced to obtain packet data and
>> data_meta pointers like in an XDP program. The returned packets are also
>> reference counted, meaning the verifier enforces that the dequeue program
>> either drops the packet (with the bpf_packet_drop() helper), or returns it
>> for transmission. Finally, a helper is added that can be used to actually
>> schedule an interface for transmission using the dequeue program type; this
>> helper can be called from both XDP and dequeue programs.
>>
>> PERFORMANCE
>>
>> Preliminary performance tests indicate about 50ns overhead of adding
>> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
>> overhead (but still 2x the forwarding performance of the netstack):
>>
>> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
>> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
>> netstack:       2 Mpps  (500 ns /pkt)
>>
>> RELATION TO BPF QDISC
>>
>> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
>> particular the use of a map to store packets. This is no accident, as we've
>> had ongoing discussions for a while now. I have no great hope that we can
>> completely converge the two efforts into a single BPF-based queueing
>> API (as has been discussed before[3], consolidating the SKB and XDP paths
>> is challenging). Rather, I'm hoping that we can converge the designs enough
>> that we can share BPF code between XDP and qdisc layers using common
>> functions, like it's possible to do with XDP and TC-BPF today. This would
>> imply agreeing on the map type and API, and possibly on the set of helpers
>> available to the BPF programs.
>
> What would be the big difference for the map wrt xdp_frame vs sk_buff
> excluding all obvious stuff like locking/refcnt?

I expect it would be quite straight-forward to just add a second subtype
of the PIFO map in this series that holds skbs. In fact, I think that
from the BPF side, the whole model implemented here would be possible to
carry over to the qdisc layer more or less wholesale. Some other
features of the qdisc layer, like locking, classes, and
multi-CPU/multi-queue management may be trickier, but I'm not sure how
much of that we should expose in a BPF qdisc anyway (as you may have
noticed I commented on Cong's series to this effect regarding the
classful qdiscs).

>> PATCH STRUCTURE
>>
>> This series consists of a total of 17 patches, as follows:
>>
>> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
>> patches.
>
> Seems like these can go separately without holding the rest?

Yeah, guess so? They don't really provide much benefit without the users
alter in the series, though, so not sure there's much point in sending
them separately?

>> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
>> program type.
>
> [...]
>
>> Patches 7-10 adds the dequeue helpers and the verifier features needed to
>> recognise packet pointers, reference count them, and allow dereferencing
>> them to obtain packet data pointers.
>
> Have you considered using kfuncs for these instead of introducing new
> hooks/contexts/etc?

I did, but I'm not sure it's such a good fit? In particular, the way the
direct packet access is implemented for dequeue programs (where you can
get an xdp_md pointer and deref that to get data and data_end pointers)
is done this way so programs can share utility functions between XDP and
dequeue programs. And having a new program type for the dequeue progs
seem like the obvious thing to do since they're doing something new?

Maybe I'm missing something, though; could you elaborate on how you'd
use kfuncs instead?

-Toke


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-13 21:52   ` Toke Høiland-Jørgensen
@ 2022-07-13 22:56     ` Stanislav Fomichev
  2022-07-14 10:46       ` Toke Høiland-Jørgensen
  2022-07-14  6:34     ` Kumar Kartikeya Dwivedi
  2022-07-17 18:17     ` Cong Wang
  2 siblings, 1 reply; 46+ messages in thread
From: Stanislav Fomichev @ 2022-07-13 22:56 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Mykola Lysenko, Kumar Kartikeya Dwivedi, netdev,
	bpf, Freysteinn Alfredsson, Cong Wang

On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Packet forwarding is an important use case for XDP, which offers
> >> significant performance improvements compared to forwarding using the
> >> regular networking stack. However, XDP currently offers no mechanism to
> >> delay, queue or schedule packets, which limits the practical uses for
> >> XDP-based forwarding to those where the capacity of input and output links
> >> always match each other (i.e., no rate transitions or many-to-one
> >> forwarding). It also prevents an XDP-based router from doing any kind of
> >> traffic shaping or reordering to enforce policy.
> >>
> >> This series represents a first RFC of our attempt to remedy this lack. The
> >> code in these patches is functional, but needs additional testing and
> >> polishing before being considered for merging. I'm posting it here as an
> >> RFC to get some early feedback on the API and overall design of the
> >> feature.
> >>
> >> DESIGN
> >>
> >> The design consists of three components: A new map type for storing XDP
> >> frames, a new 'dequeue' program type that will run in the TX softirq to
> >> provide the stack with packets to transmit, and a set of helpers to dequeue
> >> packets from the map, optionally drop them, and to schedule an interface
> >> for transmission.
> >>
> >> The new map type is modelled on the PIFO data structure proposed in the
> >> literature[0][1]. It represents a priority queue where packets can be
> >> enqueued in any priority, but is always dequeued from the head. From the
> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> >> helper, where the target index is the desired priority.
> >
> > I have the same question I asked on the series from Cong:
> > Any considerations for existing carousel/edt-like models?
>
> Well, the reason for the addition in patch 5 (continuously increasing
> priorities) is exactly to be able to implement EDT-like behaviour, where
> the priority is used as time units to clock out packets.

Ah, ok, I didn't read the patches closely enough. I saw some limits
for the ranges and assumed that it wasn't capable of efficiently
storing 64-bit timestamps..

> > Can we make the map flexible enough to implement different qdisc
> > policies?
>
> That's one of the things we want to be absolutely sure about. We are
> starting out with the PIFO map type because the literature makes a good
> case that it is flexible enough to implement all conceivable policies.
> The goal of the test harness linked as note [4] is to actually examine
> this; Frey is our PhD student working on this bit.
>
> Thus far we haven't hit any limitations on this, but we'll need to add
> more policies before we are done with this. Another consideration is
> performance, of course, so we're also planning to do a comparison with a
> more traditional "bunch of FIFO queues" type data structure for at least
> a subset of the algorithms. Kartikeya also had an idea for an
> alternative way to implement a priority queue using (semi-)lockless
> skiplists, which may turn out to perform better.
>
> If there's any particular policy/algorithm you'd like to see included in
> this evaluation, please do let us know, BTW! :)

I honestly am not sure what the bar for accepting this should be. But
on the Cong's series I mentioned Martin's CC bpf work as a great
example of what we should be trying to do for qdisc-like maps. Having
a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be
very convincing :-)

> >> The dequeue program type is a new BPF program type that is attached to an
> >> interface; when an interface is scheduled for transmission, the stack will
> >> execute the attached dequeue program and, if it returns a packet to
> >> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
> >> driver function.
> >>
> >> The dequeue program can obtain packets by pulling them out of a PIFO map
> >> using the new bpf_packet_dequeue() helper. This returns a pointer to an
> >> xdp_md structure, which can be dereferenced to obtain packet data and
> >> data_meta pointers like in an XDP program. The returned packets are also
> >> reference counted, meaning the verifier enforces that the dequeue program
> >> either drops the packet (with the bpf_packet_drop() helper), or returns it
> >> for transmission. Finally, a helper is added that can be used to actually
> >> schedule an interface for transmission using the dequeue program type; this
> >> helper can be called from both XDP and dequeue programs.
> >>
> >> PERFORMANCE
> >>
> >> Preliminary performance tests indicate about 50ns overhead of adding
> >> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
> >> overhead (but still 2x the forwarding performance of the netstack):
> >>
> >> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
> >> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
> >> netstack:       2 Mpps  (500 ns /pkt)
> >>
> >> RELATION TO BPF QDISC
> >>
> >> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
> >> particular the use of a map to store packets. This is no accident, as we've
> >> had ongoing discussions for a while now. I have no great hope that we can
> >> completely converge the two efforts into a single BPF-based queueing
> >> API (as has been discussed before[3], consolidating the SKB and XDP paths
> >> is challenging). Rather, I'm hoping that we can converge the designs enough
> >> that we can share BPF code between XDP and qdisc layers using common
> >> functions, like it's possible to do with XDP and TC-BPF today. This would
> >> imply agreeing on the map type and API, and possibly on the set of helpers
> >> available to the BPF programs.
> >
> > What would be the big difference for the map wrt xdp_frame vs sk_buff
> > excluding all obvious stuff like locking/refcnt?
>
> I expect it would be quite straight-forward to just add a second subtype
> of the PIFO map in this series that holds skbs. In fact, I think that
> from the BPF side, the whole model implemented here would be possible to
> carry over to the qdisc layer more or less wholesale. Some other
> features of the qdisc layer, like locking, classes, and
> multi-CPU/multi-queue management may be trickier, but I'm not sure how
> much of that we should expose in a BPF qdisc anyway (as you may have
> noticed I commented on Cong's series to this effect regarding the
> classful qdiscs).

Maybe a related question here: with the way you do
BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would
be have support for storing xdp_frames/skb in any map? Let's say we
have generic BPF_MAP_TYPE_RBTREE, where the key is
priority/timestamp/whatever, can we, based on the value's btf_id,
figure out the rest? (that the value is kernel structure and needs
special care and more constraints - can't be looked up from user space
and so on)

Seems like we really need to have two special cases: where we transfer
ownership of xdp_frame/skb to/from the map, any other big
complications?

That way we can maybe untangle the series a bit: we can talk about
efficient data structures for storing frames/skbs independently of
some generic support for storing them in the maps. Any major
complications with that approach?

> >> PATCH STRUCTURE
> >>
> >> This series consists of a total of 17 patches, as follows:
> >>
> >> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
> >> patches.
> >
> > Seems like these can go separately without holding the rest?
>
> Yeah, guess so? They don't really provide much benefit without the users
> alter in the series, though, so not sure there's much point in sending
> them separately?
>
> >> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
> >> program type.
> >
> > [...]
> >
> >> Patches 7-10 adds the dequeue helpers and the verifier features needed to
> >> recognise packet pointers, reference count them, and allow dereferencing
> >> them to obtain packet data pointers.
> >
> > Have you considered using kfuncs for these instead of introducing new
> > hooks/contexts/etc?
>
> I did, but I'm not sure it's such a good fit? In particular, the way the
> direct packet access is implemented for dequeue programs (where you can
> get an xdp_md pointer and deref that to get data and data_end pointers)
> is done this way so programs can share utility functions between XDP and
> dequeue programs. And having a new program type for the dequeue progs
> seem like the obvious thing to do since they're doing something new?
>
> Maybe I'm missing something, though; could you elaborate on how you'd
> use kfuncs instead?

I was thinking about the approach in general. In networking bpf, we've
been adding new program types, new contexts and new explicit hooks.
This all requires a ton of boiler plate (converting from uapi ctx to
the kernel, exposing hook points, etc, etc). And looking at Benjamin's
HID series, it's so much more elegant: there is no uapi, just kernel
function that allows it to be overridden and a bunch of kfuncs
exposed. No uapi, no helpers, no fake contexts.

For networking and xdp the ship might have sailed, but I was wondering
whether we should be still stuck in that 'old' boilerplate world or we
have a chance to use new nice shiny things :-)

(but it might be all moot if we'd like to have stable upis?)

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 14/17] libbpf: Add support for querying dequeue programs
  2022-07-13 11:14 ` [RFC PATCH 14/17] libbpf: Add support for querying dequeue programs Toke Høiland-Jørgensen
@ 2022-07-14  5:36   ` Andrii Nakryiko
  2022-07-14 10:13     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 46+ messages in thread
From: Andrii Nakryiko @ 2022-07-14  5:36 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Kumar Kartikeya Dwivedi, Networking, bpf, Freysteinn Alfredsson,
	Cong Wang

On Wed, Jul 13, 2022 at 4:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Add support to libbpf for reading the dequeue program ID from netlink when
> querying for installed XDP programs. No additional support is needed to
> install dequeue programs, as they are just using a new mode flag for the
> regular XDP program installation mechanism.
>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
>  tools/lib/bpf/libbpf.h  | 1 +
>  tools/lib/bpf/netlink.c | 8 ++++++++
>  2 files changed, 9 insertions(+)
>
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index e4d5353f757b..b15ff90279cb 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -906,6 +906,7 @@ struct bpf_xdp_query_opts {
>         __u32 drv_prog_id;      /* output */
>         __u32 hw_prog_id;       /* output */
>         __u32 skb_prog_id;      /* output */
> +       __u32 dequeue_prog_id;  /* output */

can't do that, you have to put it after attach_mode to preserve
backwards/forward compat

>         __u8 attach_mode;       /* output */
>         size_t :0;
>  };
> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
> index 6c013168032d..64a9aceb9c9c 100644
> --- a/tools/lib/bpf/netlink.c
> +++ b/tools/lib/bpf/netlink.c
> @@ -32,6 +32,7 @@ struct xdp_link_info {
>         __u32 drv_prog_id;
>         __u32 hw_prog_id;
>         __u32 skb_prog_id;
> +       __u32 dequeue_prog_id;
>         __u8 attach_mode;
>  };
>
> @@ -354,6 +355,10 @@ static int get_xdp_info(void *cookie, void *msg, struct nlattr **tb)
>                 xdp_id->info.hw_prog_id = libbpf_nla_getattr_u32(
>                         xdp_tb[IFLA_XDP_HW_PROG_ID]);
>
> +       if (xdp_tb[IFLA_XDP_DEQUEUE_PROG_ID])
> +               xdp_id->info.dequeue_prog_id = libbpf_nla_getattr_u32(
> +                       xdp_tb[IFLA_XDP_DEQUEUE_PROG_ID]);
> +
>         return 0;
>  }
>
> @@ -391,6 +396,7 @@ int bpf_xdp_query(int ifindex, int xdp_flags, struct bpf_xdp_query_opts *opts)
>         OPTS_SET(opts, drv_prog_id, xdp_id.info.drv_prog_id);
>         OPTS_SET(opts, hw_prog_id, xdp_id.info.hw_prog_id);
>         OPTS_SET(opts, skb_prog_id, xdp_id.info.skb_prog_id);
> +       OPTS_SET(opts, dequeue_prog_id, xdp_id.info.dequeue_prog_id);
>         OPTS_SET(opts, attach_mode, xdp_id.info.attach_mode);
>
>         return 0;
> @@ -415,6 +421,8 @@ int bpf_xdp_query_id(int ifindex, int flags, __u32 *prog_id)
>                 *prog_id = opts.hw_prog_id;
>         else if (flags & XDP_FLAGS_SKB_MODE)
>                 *prog_id = opts.skb_prog_id;
> +       else if (flags & XDP_FLAGS_DEQUEUE_MODE)
> +               *prog_id = opts.dequeue_prog_id;
>         else
>                 *prog_id = 0;
>
> --
> 2.37.0
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 15/17] selftests/bpf: Add verifier tests for dequeue prog
  2022-07-13 11:14 ` [RFC PATCH 15/17] selftests/bpf: Add verifier tests for dequeue prog Toke Høiland-Jørgensen
@ 2022-07-14  5:38   ` Andrii Nakryiko
  2022-07-14  6:45     ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 46+ messages in thread
From: Andrii Nakryiko @ 2022-07-14  5:38 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Mykola Lysenko,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Kumar Kartikeya Dwivedi, Networking, bpf, Freysteinn Alfredsson,
	Cong Wang, Shuah Khan

On Wed, Jul 13, 2022 at 4:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
>
> Test various cases of direct packet access (proper range propagation,
> comparison of packet pointers pointing into separate xdp_frames, and
> correct invalidation on packet drop (so that multiple packet pointers
> are usable safely in a dequeue program)).
>
> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---

Consider writing these tests as plain C BPF code and put them in
test_progs, is there anything you can't express in C and thus requires
test_verifier?

>  tools/testing/selftests/bpf/test_verifier.c   |  29 +++-
>  .../testing/selftests/bpf/verifier/dequeue.c  | 160 ++++++++++++++++++
>  2 files changed, 180 insertions(+), 9 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/verifier/dequeue.c
>

[...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 16/17] selftests/bpf: Add test for XDP queueing through PIFO maps
  2022-07-13 11:14 ` [RFC PATCH 16/17] selftests/bpf: Add test for XDP queueing through PIFO maps Toke Høiland-Jørgensen
@ 2022-07-14  5:41   ` Andrii Nakryiko
  2022-07-14 10:18     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 46+ messages in thread
From: Andrii Nakryiko @ 2022-07-14  5:41 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Kumar Kartikeya Dwivedi, Networking, bpf, Freysteinn Alfredsson,
	Cong Wang, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan

On Wed, Jul 13, 2022 at 4:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> This adds selftests for both variants of the generic PIFO map type, and for
> the dequeue program type. The XDP test uses bpf_prog_run() to run an XDP
> program that puts packets into a PIFO map, and then adds tests that pull
> them back out again through bpf_prog_run() of a dequeue program, as well as
> by attaching a dequeue program to a veth device and scheduling transmission
> there.
>
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
>  .../selftests/bpf/prog_tests/pifo_map.c       | 125 ++++++++++++++
>  .../bpf/prog_tests/xdp_pifo_test_run.c        | 154 ++++++++++++++++++
>  tools/testing/selftests/bpf/progs/pifo_map.c  |  54 ++++++
>  .../selftests/bpf/progs/test_xdp_pifo.c       | 110 +++++++++++++
>  4 files changed, 443 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/pifo_map.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
>  create mode 100644 tools/testing/selftests/bpf/progs/pifo_map.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pifo.c
>

[...]

> +__u16 pkt_count = 0;
> +__u16 drop_above = 2;
> +
> +SEC("dequeue")

"dequeue" seems like a way too generic term, why not "xdp_dequeue" or
something like that? Isn't this XDP specific program?

> +void *dequeue_pifo(struct dequeue_ctx *ctx)
> +{
> +       __u64 prio = 0, pkt_prio = 0;
> +       void *data, *data_end;
> +       struct xdp_md *pkt;
> +       struct ethhdr *eth;
> +

[...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-13 21:52   ` Toke Høiland-Jørgensen
  2022-07-13 22:56     ` Stanislav Fomichev
@ 2022-07-14  6:34     ` Kumar Kartikeya Dwivedi
  2022-07-17 18:17     ` Cong Wang
  2 siblings, 0 replies; 46+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-07-14  6:34 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Mykola Lysenko, netdev, bpf,
	Freysteinn Alfredsson, Cong Wang

On Wed, 13 Jul 2022 at 23:52, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Packet forwarding is an important use case for XDP, which offers
> >> significant performance improvements compared to forwarding using the
> >> regular networking stack. However, XDP currently offers no mechanism to
> >> delay, queue or schedule packets, which limits the practical uses for
> >> XDP-based forwarding to those where the capacity of input and output links
> >> always match each other (i.e., no rate transitions or many-to-one
> >> forwarding). It also prevents an XDP-based router from doing any kind of
> >> traffic shaping or reordering to enforce policy.
> >>
> >> This series represents a first RFC of our attempt to remedy this lack. The
> >> code in these patches is functional, but needs additional testing and
> >> polishing before being considered for merging. I'm posting it here as an
> >> RFC to get some early feedback on the API and overall design of the
> >> feature.
> >>
> >> DESIGN
> >>
> >> The design consists of three components: A new map type for storing XDP
> >> frames, a new 'dequeue' program type that will run in the TX softirq to
> >> provide the stack with packets to transmit, and a set of helpers to dequeue
> >> packets from the map, optionally drop them, and to schedule an interface
> >> for transmission.
> >>
> >> The new map type is modelled on the PIFO data structure proposed in the
> >> literature[0][1]. It represents a priority queue where packets can be
> >> enqueued in any priority, but is always dequeued from the head. From the
> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> >> helper, where the target index is the desired priority.
> >
> > I have the same question I asked on the series from Cong:
> > Any considerations for existing carousel/edt-like models?
>
> Well, the reason for the addition in patch 5 (continuously increasing
> priorities) is exactly to be able to implement EDT-like behaviour, where
> the priority is used as time units to clock out packets.
>
> > Can we make the map flexible enough to implement different qdisc
> > policies?
>
> That's one of the things we want to be absolutely sure about. We are
> starting out with the PIFO map type because the literature makes a good
> case that it is flexible enough to implement all conceivable policies.
> The goal of the test harness linked as note [4] is to actually examine
> this; Frey is our PhD student working on this bit.
>
> Thus far we haven't hit any limitations on this, but we'll need to add
> more policies before we are done with this. Another consideration is
> performance, of course, so we're also planning to do a comparison with a
> more traditional "bunch of FIFO queues" type data structure for at least
> a subset of the algorithms. Kartikeya also had an idea for an
> alternative way to implement a priority queue using (semi-)lockless
> skiplists, which may turn out to perform better.
>

There's also code to go with the idea, just to show it can work :)
https://github.com/kkdwivedi/linux/commits/skiplist
Lookups are fully lockless, updates only contend when the same nodes
are preds,succs. Still needs a lot of testing though. It's meant to be
a generic ordered map, but can be repurposed as a priority queue.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 15/17] selftests/bpf: Add verifier tests for dequeue prog
  2022-07-14  5:38   ` Andrii Nakryiko
@ 2022-07-14  6:45     ` Kumar Kartikeya Dwivedi
  2022-07-14 18:54       ` Andrii Nakryiko
  0 siblings, 1 reply; 46+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-07-14  6:45 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Mykola Lysenko, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Networking, bpf,
	Freysteinn Alfredsson, Cong Wang, Shuah Khan

On Thu, 14 Jul 2022 at 07:38, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jul 13, 2022 at 4:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> >
> > Test various cases of direct packet access (proper range propagation,
> > comparison of packet pointers pointing into separate xdp_frames, and
> > correct invalidation on packet drop (so that multiple packet pointers
> > are usable safely in a dequeue program)).
> >
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > ---
>
> Consider writing these tests as plain C BPF code and put them in
> test_progs, is there anything you can't express in C and thus requires
> test_verifier?

Not really, but in general I like test_verifier because it stays
immune to compiler shenanigans.
So going forward should test_verifier tests be avoided, and normal C
tests (using SEC("?...")) be preferred for these cases?

>
> >  tools/testing/selftests/bpf/test_verifier.c   |  29 +++-
> >  .../testing/selftests/bpf/verifier/dequeue.c  | 160 ++++++++++++++++++
> >  2 files changed, 180 insertions(+), 9 deletions(-)
> >  create mode 100644 tools/testing/selftests/bpf/verifier/dequeue.c
> >
>
> [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 14/17] libbpf: Add support for querying dequeue programs
  2022-07-14  5:36   ` Andrii Nakryiko
@ 2022-07-14 10:13     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-14 10:13 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Andrii Nakryiko, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Jakub Kicinski, Jesper Dangaard Brouer,
	Kumar Kartikeya Dwivedi, Networking, bpf, Freysteinn Alfredsson,
	Cong Wang

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Wed, Jul 13, 2022 at 4:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Add support to libbpf for reading the dequeue program ID from netlink when
>> querying for installed XDP programs. No additional support is needed to
>> install dequeue programs, as they are just using a new mode flag for the
>> regular XDP program installation mechanism.
>>
>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> ---
>>  tools/lib/bpf/libbpf.h  | 1 +
>>  tools/lib/bpf/netlink.c | 8 ++++++++
>>  2 files changed, 9 insertions(+)
>>
>> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
>> index e4d5353f757b..b15ff90279cb 100644
>> --- a/tools/lib/bpf/libbpf.h
>> +++ b/tools/lib/bpf/libbpf.h
>> @@ -906,6 +906,7 @@ struct bpf_xdp_query_opts {
>>         __u32 drv_prog_id;      /* output */
>>         __u32 hw_prog_id;       /* output */
>>         __u32 skb_prog_id;      /* output */
>> +       __u32 dequeue_prog_id;  /* output */
>
> can't do that, you have to put it after attach_mode to preserve
> backwards/forward compat

Argh, yes, of course, total brainfart - thanks for pointing that out! :)

-Toke


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 16/17] selftests/bpf: Add test for XDP queueing through PIFO maps
  2022-07-14  5:41   ` Andrii Nakryiko
@ 2022-07-14 10:18     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-14 10:18 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, Daniel Borkmann, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
	Kumar Kartikeya Dwivedi, Networking, bpf, Freysteinn Alfredsson,
	Cong Wang, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Mykola Lysenko, Shuah Khan

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Wed, Jul 13, 2022 at 4:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> This adds selftests for both variants of the generic PIFO map type, and for
>> the dequeue program type. The XDP test uses bpf_prog_run() to run an XDP
>> program that puts packets into a PIFO map, and then adds tests that pull
>> them back out again through bpf_prog_run() of a dequeue program, as well as
>> by attaching a dequeue program to a veth device and scheduling transmission
>> there.
>>
>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> ---
>>  .../selftests/bpf/prog_tests/pifo_map.c       | 125 ++++++++++++++
>>  .../bpf/prog_tests/xdp_pifo_test_run.c        | 154 ++++++++++++++++++
>>  tools/testing/selftests/bpf/progs/pifo_map.c  |  54 ++++++
>>  .../selftests/bpf/progs/test_xdp_pifo.c       | 110 +++++++++++++
>>  4 files changed, 443 insertions(+)
>>  create mode 100644 tools/testing/selftests/bpf/prog_tests/pifo_map.c
>>  create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/pifo_map.c
>>  create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pifo.c
>>
>
> [...]
>
>> +__u16 pkt_count = 0;
>> +__u16 drop_above = 2;
>> +
>> +SEC("dequeue")
>
> "dequeue" seems like a way too generic term, why not "xdp_dequeue" or
> something like that? Isn't this XDP specific program?

Well, depending on how close the qdisc/xdp APIs end up being we may be
able to reuse the program type but have subtypes (so we could have
"dequeue/xdp" and "dequeue/skb" for instance). But if that doesn't pan
out I do see your point that "dequeue" is a bit too generic; will change
it to 'xdp_dequeue' in that case...

-Toke


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-13 22:56     ` Stanislav Fomichev
@ 2022-07-14 10:46       ` Toke Høiland-Jørgensen
  2022-07-14 17:24         ` Stanislav Fomichev
                           ` (2 more replies)
  0 siblings, 3 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-14 10:46 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Mykola Lysenko, Kumar Kartikeya Dwivedi, netdev,
	bpf, Freysteinn Alfredsson, Cong Wang

Stanislav Fomichev <sdf@google.com> writes:

> On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Packet forwarding is an important use case for XDP, which offers
>> >> significant performance improvements compared to forwarding using the
>> >> regular networking stack. However, XDP currently offers no mechanism to
>> >> delay, queue or schedule packets, which limits the practical uses for
>> >> XDP-based forwarding to those where the capacity of input and output links
>> >> always match each other (i.e., no rate transitions or many-to-one
>> >> forwarding). It also prevents an XDP-based router from doing any kind of
>> >> traffic shaping or reordering to enforce policy.
>> >>
>> >> This series represents a first RFC of our attempt to remedy this lack. The
>> >> code in these patches is functional, but needs additional testing and
>> >> polishing before being considered for merging. I'm posting it here as an
>> >> RFC to get some early feedback on the API and overall design of the
>> >> feature.
>> >>
>> >> DESIGN
>> >>
>> >> The design consists of three components: A new map type for storing XDP
>> >> frames, a new 'dequeue' program type that will run in the TX softirq to
>> >> provide the stack with packets to transmit, and a set of helpers to dequeue
>> >> packets from the map, optionally drop them, and to schedule an interface
>> >> for transmission.
>> >>
>> >> The new map type is modelled on the PIFO data structure proposed in the
>> >> literature[0][1]. It represents a priority queue where packets can be
>> >> enqueued in any priority, but is always dequeued from the head. From the
>> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
>> >> helper, where the target index is the desired priority.
>> >
>> > I have the same question I asked on the series from Cong:
>> > Any considerations for existing carousel/edt-like models?
>>
>> Well, the reason for the addition in patch 5 (continuously increasing
>> priorities) is exactly to be able to implement EDT-like behaviour, where
>> the priority is used as time units to clock out packets.
>
> Ah, ok, I didn't read the patches closely enough. I saw some limits
> for the ranges and assumed that it wasn't capable of efficiently
> storing 64-bit timestamps..

The goal is definitely to support full 64-bit priorities. Right now you
have to start out at 0 but can go on for a full 64 bits, but that's a
bit of an API wart that I'd like to get rid of eventually...

>> > Can we make the map flexible enough to implement different qdisc
>> > policies?
>>
>> That's one of the things we want to be absolutely sure about. We are
>> starting out with the PIFO map type because the literature makes a good
>> case that it is flexible enough to implement all conceivable policies.
>> The goal of the test harness linked as note [4] is to actually examine
>> this; Frey is our PhD student working on this bit.
>>
>> Thus far we haven't hit any limitations on this, but we'll need to add
>> more policies before we are done with this. Another consideration is
>> performance, of course, so we're also planning to do a comparison with a
>> more traditional "bunch of FIFO queues" type data structure for at least
>> a subset of the algorithms. Kartikeya also had an idea for an
>> alternative way to implement a priority queue using (semi-)lockless
>> skiplists, which may turn out to perform better.
>>
>> If there's any particular policy/algorithm you'd like to see included in
>> this evaluation, please do let us know, BTW! :)
>
> I honestly am not sure what the bar for accepting this should be. But
> on the Cong's series I mentioned Martin's CC bpf work as a great
> example of what we should be trying to do for qdisc-like maps. Having
> a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be
> very convincing :-)

Just doing flow queueing is quite straight forward with PIFOs. We're
working on fq_codel. Personally I also want to implement something that
has feature parity with sch_cake (which includes every feature and the
kitchen sink already) :)

>> >> The dequeue program type is a new BPF program type that is attached to an
>> >> interface; when an interface is scheduled for transmission, the stack will
>> >> execute the attached dequeue program and, if it returns a packet to
>> >> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
>> >> driver function.
>> >>
>> >> The dequeue program can obtain packets by pulling them out of a PIFO map
>> >> using the new bpf_packet_dequeue() helper. This returns a pointer to an
>> >> xdp_md structure, which can be dereferenced to obtain packet data and
>> >> data_meta pointers like in an XDP program. The returned packets are also
>> >> reference counted, meaning the verifier enforces that the dequeue program
>> >> either drops the packet (with the bpf_packet_drop() helper), or returns it
>> >> for transmission. Finally, a helper is added that can be used to actually
>> >> schedule an interface for transmission using the dequeue program type; this
>> >> helper can be called from both XDP and dequeue programs.
>> >>
>> >> PERFORMANCE
>> >>
>> >> Preliminary performance tests indicate about 50ns overhead of adding
>> >> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
>> >> overhead (but still 2x the forwarding performance of the netstack):
>> >>
>> >> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
>> >> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
>> >> netstack:       2 Mpps  (500 ns /pkt)
>> >>
>> >> RELATION TO BPF QDISC
>> >>
>> >> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
>> >> particular the use of a map to store packets. This is no accident, as we've
>> >> had ongoing discussions for a while now. I have no great hope that we can
>> >> completely converge the two efforts into a single BPF-based queueing
>> >> API (as has been discussed before[3], consolidating the SKB and XDP paths
>> >> is challenging). Rather, I'm hoping that we can converge the designs enough
>> >> that we can share BPF code between XDP and qdisc layers using common
>> >> functions, like it's possible to do with XDP and TC-BPF today. This would
>> >> imply agreeing on the map type and API, and possibly on the set of helpers
>> >> available to the BPF programs.
>> >
>> > What would be the big difference for the map wrt xdp_frame vs sk_buff
>> > excluding all obvious stuff like locking/refcnt?
>>
>> I expect it would be quite straight-forward to just add a second subtype
>> of the PIFO map in this series that holds skbs. In fact, I think that
>> from the BPF side, the whole model implemented here would be possible to
>> carry over to the qdisc layer more or less wholesale. Some other
>> features of the qdisc layer, like locking, classes, and
>> multi-CPU/multi-queue management may be trickier, but I'm not sure how
>> much of that we should expose in a BPF qdisc anyway (as you may have
>> noticed I commented on Cong's series to this effect regarding the
>> classful qdiscs).
>
> Maybe a related question here: with the way you do
> BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would
> be have support for storing xdp_frames/skb in any map? Let's say we
> have generic BPF_MAP_TYPE_RBTREE, where the key is
> priority/timestamp/whatever, can we, based on the value's btf_id,
> figure out the rest? (that the value is kernel structure and needs
> special care and more constraints - can't be looked up from user space
> and so on)
>
> Seems like we really need to have two special cases: where we transfer
> ownership of xdp_frame/skb to/from the map, any other big
> complications?
>
> That way we can maybe untangle the series a bit: we can talk about
> efficient data structures for storing frames/skbs independently of
> some generic support for storing them in the maps. Any major
> complications with that approach?

I've had discussions with Kartikeya on this already (based on his 'kptr
in map' work). That may well end up being feasible, which would be
fantastic. The reason we didn't use it for this series is that there's
still some work to do on the generic verifier/infrastructure support
side of this (the PIFO map is the oldest part of this series), and I
didn't want to hold up the rest of the queueing work until that landed.

Now that we have a functional prototype I expect that iterating on the
data structure will be the next step. One complication with XDP is that
we probably want to keep using XDP_REDIRECT to place packets into the
map because that gets us bulking which is important for performance;
however, in general I like the idea of using BTF to designate the map
value type, and if we can figure out a way to make it completely generic
even for packets I'm all for that! :)

>> >> PATCH STRUCTURE
>> >>
>> >> This series consists of a total of 17 patches, as follows:
>> >>
>> >> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
>> >> patches.
>> >
>> > Seems like these can go separately without holding the rest?
>>
>> Yeah, guess so? They don't really provide much benefit without the users
>> alter in the series, though, so not sure there's much point in sending
>> them separately?
>>
>> >> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
>> >> program type.
>> >
>> > [...]
>> >
>> >> Patches 7-10 adds the dequeue helpers and the verifier features needed to
>> >> recognise packet pointers, reference count them, and allow dereferencing
>> >> them to obtain packet data pointers.
>> >
>> > Have you considered using kfuncs for these instead of introducing new
>> > hooks/contexts/etc?
>>
>> I did, but I'm not sure it's such a good fit? In particular, the way the
>> direct packet access is implemented for dequeue programs (where you can
>> get an xdp_md pointer and deref that to get data and data_end pointers)
>> is done this way so programs can share utility functions between XDP and
>> dequeue programs. And having a new program type for the dequeue progs
>> seem like the obvious thing to do since they're doing something new?
>>
>> Maybe I'm missing something, though; could you elaborate on how you'd
>> use kfuncs instead?
>
> I was thinking about the approach in general. In networking bpf, we've
> been adding new program types, new contexts and new explicit hooks.
> This all requires a ton of boiler plate (converting from uapi ctx to
> the kernel, exposing hook points, etc, etc). And looking at Benjamin's
> HID series, it's so much more elegant: there is no uapi, just kernel
> function that allows it to be overridden and a bunch of kfuncs
> exposed. No uapi, no helpers, no fake contexts.
>
> For networking and xdp the ship might have sailed, but I was wondering
> whether we should be still stuck in that 'old' boilerplate world or we
> have a chance to use new nice shiny things :-)
>
> (but it might be all moot if we'd like to have stable upis?)

Right, I see what you mean. My immediate feeling is that having an
explicit stable UAPI for XDP has served us well. We do all kinds of
rewrite tricks behind the scenes (things like switching between xdp_buff
and xdp_frame, bulking, direct packet access, reading ifindexes by
pointer walking txq->dev, etc) which are important ways to improve
performance without exposing too many nitty-gritty details into the API.

There's also consistency to consider: I think the addition of queueing
should work as a natural extension of the existing programming model for
XDP. So I feel like this is more a case of "if we were starting from
scratch today we might do things differently (like the HID series), but
when extending things let's keep it consistent"?

-Toke


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (17 preceding siblings ...)
  2022-07-13 18:36 ` [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Stanislav Fomichev
@ 2022-07-14 14:05 ` Jamal Hadi Salim
  2022-07-14 14:56   ` Dave Taht
  2022-07-14 16:21   ` Toke Høiland-Jørgensen
  2022-07-17 17:46 ` Cong Wang
  19 siblings, 2 replies; 46+ messages in thread
From: Jamal Hadi Salim @ 2022-07-14 14:05 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Mykola Lysenko,
	Kumar Kartikeya Dwivedi, Linux Kernel Network Developers, bpf,
	Freysteinn Alfredsson, Cong Wang

I think what would be really interesting is to see the performance numbers when
you have multiple producers/consumers(translation multiple
threads/softirqs) in play
targeting the same queues. Does PIFO alleviate the synchronization challenge
when you have multiple concurrent readers/writers? Or maybe for your use case
this would not be a common occurrence or not something you care about?

As I mentioned previously, I think this is what Cong's approach gets for free.

cheers,
jamal


cheers,
jamal

On Wed, Jul 13, 2022 at 7:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Packet forwarding is an important use case for XDP, which offers
> significant performance improvements compared to forwarding using the
> regular networking stack. However, XDP currently offers no mechanism to
> delay, queue or schedule packets, which limits the practical uses for
> XDP-based forwarding to those where the capacity of input and output links
> always match each other (i.e., no rate transitions or many-to-one
> forwarding). It also prevents an XDP-based router from doing any kind of
> traffic shaping or reordering to enforce policy.
>
> This series represents a first RFC of our attempt to remedy this lack. The
> code in these patches is functional, but needs additional testing and
> polishing before being considered for merging. I'm posting it here as an
> RFC to get some early feedback on the API and overall design of the
> feature.
>
> DESIGN
>
> The design consists of three components: A new map type for storing XDP
> frames, a new 'dequeue' program type that will run in the TX softirq to
> provide the stack with packets to transmit, and a set of helpers to dequeue
> packets from the map, optionally drop them, and to schedule an interface
> for transmission.
>
> The new map type is modelled on the PIFO data structure proposed in the
> literature[0][1]. It represents a priority queue where packets can be
> enqueued in any priority, but is always dequeued from the head. From the
> XDP side, the map is simply used as a target for the bpf_redirect_map()
> helper, where the target index is the desired priority.
>
> The dequeue program type is a new BPF program type that is attached to an
> interface; when an interface is scheduled for transmission, the stack will
> execute the attached dequeue program and, if it returns a packet to
> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
> driver function.
>
> The dequeue program can obtain packets by pulling them out of a PIFO map
> using the new bpf_packet_dequeue() helper. This returns a pointer to an
> xdp_md structure, which can be dereferenced to obtain packet data and
> data_meta pointers like in an XDP program. The returned packets are also
> reference counted, meaning the verifier enforces that the dequeue program
> either drops the packet (with the bpf_packet_drop() helper), or returns it
> for transmission. Finally, a helper is added that can be used to actually
> schedule an interface for transmission using the dequeue program type; this
> helper can be called from both XDP and dequeue programs.
>
> PERFORMANCE
>
> Preliminary performance tests indicate about 50ns overhead of adding
> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
> overhead (but still 2x the forwarding performance of the netstack):
>
> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
> netstack:       2 Mpps  (500 ns /pkt)
>
> RELATION TO BPF QDISC
>
> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
> particular the use of a map to store packets. This is no accident, as we've
> had ongoing discussions for a while now. I have no great hope that we can
> completely converge the two efforts into a single BPF-based queueing
> API (as has been discussed before[3], consolidating the SKB and XDP paths
> is challenging). Rather, I'm hoping that we can converge the designs enough
> that we can share BPF code between XDP and qdisc layers using common
> functions, like it's possible to do with XDP and TC-BPF today. This would
> imply agreeing on the map type and API, and possibly on the set of helpers
> available to the BPF programs.
>
> PATCH STRUCTURE
>
> This series consists of a total of 17 patches, as follows:
>
> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
> patches.
>
> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
> program type.
>
> Patches 7-10 adds the dequeue helpers and the verifier features needed to
> recognise packet pointers, reference count them, and allow dereferencing
> them to obtain packet data pointers.
>
> Patches 11 and 12 add the dequeue program hook to the TX path, and the
> helpers to schedule an interface.
>
> Patches 13-16 add libbpf support for the new types, and selftests for the
> new features.
>
> Finally, patch 17 adds queueing support to the xdp_fwd program in
> samples/bpf to provide an easy-to-use way of testing the feature; this is
> for illustrative purposes for the RFC only, and will not be included in the
> final submission.
>
> SUPPLEMENTARY MATERIAL
>
> A (WiP) test harness for implementing and unit-testing scheduling
> algorithms using this framework (and the bpf_prog_run() hook) is available
> as part of the bpf-examples repository[4]. We plan to expand this with more
> test algorithms to smoke-test the API, and also add ready-to-use queueing
> algorithms for use for forwarding (to replace the xdp_fwd patch included as
> part of this RFC submission).
>
> The work represented in this series was done in collaboration with several
> people. Thanks to Kumar Kartikeya Dwivedi for writing the verifier
> enhancements in this series, to Frey Alfredsson for his work on the testing
> harness in [4], and to Jesper Brouer, Per Hurtig and Anna Brunstrom for
> their valuable input on the design of the queueing APIs.
>
> This series is also available as a git tree on git.kernel.org[5].
>
> NOTES
>
> [0] http://web.mit.edu/pifo/
> [1] https://arxiv.org/abs/1810.03060
> [2] https://lore.kernel.org/r/20220602041028.95124-1-xiyou.wangcong@gmail.com
> [3] https://lore.kernel.org/r/b4ff6a2b-1478-89f8-ea9f-added498c59f@gmail.com
> [4] https://github.com/xdp-project/bpf-examples/pull/40
> [5] https://git.kernel.org/pub/scm/linux/kernel/git/toke/linux.git/log/?h=xdp-queueing-06
>
> Kumar Kartikeya Dwivedi (5):
>   bpf: Use 64-bit return value for bpf_prog_run
>   bpf: Teach the verifier about referenced packets returned from dequeue
>     programs
>   bpf: Introduce pkt_uid member for PTR_TO_PACKET
>   bpf: Implement direct packet access in dequeue progs
>   selftests/bpf: Add verifier tests for dequeue prog
>
> Toke Høiland-Jørgensen (12):
>   dev: Move received_rps counter next to RPS members in softnet data
>   bpf: Expand map key argument of bpf_redirect_map to u64
>   bpf: Add a PIFO priority queue map type
>   pifomap: Add queue rotation for continuously increasing rank mode
>   xdp: Add dequeue program type for getting packets from a PIFO
>   bpf: Add helpers to dequeue from a PIFO map
>   dev: Add XDP dequeue hook
>   bpf: Add helper to schedule an interface for TX dequeue
>   libbpf: Add support for dequeue program type and PIFO map type
>   libbpf: Add support for querying dequeue programs
>   selftests/bpf: Add test for XDP queueing through PIFO maps
>   samples/bpf: Add queueing support to xdp_fwd sample
>
>  include/linux/bpf-cgroup.h                    |  12 +-
>  include/linux/bpf.h                           |  64 +-
>  include/linux/bpf_types.h                     |   4 +
>  include/linux/bpf_verifier.h                  |  14 +-
>  include/linux/filter.h                        |  63 +-
>  include/linux/netdevice.h                     |   8 +-
>  include/net/xdp.h                             |  16 +-
>  include/uapi/linux/bpf.h                      |  50 +-
>  include/uapi/linux/if_link.h                  |   4 +-
>  kernel/bpf/Makefile                           |   2 +-
>  kernel/bpf/cgroup.c                           |  12 +-
>  kernel/bpf/core.c                             |  14 +-
>  kernel/bpf/cpumap.c                           |   4 +-
>  kernel/bpf/devmap.c                           |  92 ++-
>  kernel/bpf/offload.c                          |   4 +-
>  kernel/bpf/pifomap.c                          | 635 ++++++++++++++++++
>  kernel/bpf/syscall.c                          |   3 +
>  kernel/bpf/verifier.c                         | 148 +++-
>  net/bpf/test_run.c                            |  54 +-
>  net/core/dev.c                                | 109 +++
>  net/core/dev.h                                |   2 +
>  net/core/filter.c                             | 307 ++++++++-
>  net/core/rtnetlink.c                          |  30 +-
>  net/packet/af_packet.c                        |   7 +-
>  net/xdp/xskmap.c                              |   4 +-
>  samples/bpf/xdp_fwd_kern.c                    |  65 +-
>  samples/bpf/xdp_fwd_user.c                    | 200 ++++--
>  tools/include/uapi/linux/bpf.h                |  48 ++
>  tools/include/uapi/linux/if_link.h            |   4 +-
>  tools/lib/bpf/libbpf.c                        |   1 +
>  tools/lib/bpf/libbpf.h                        |   1 +
>  tools/lib/bpf/libbpf_probes.c                 |   5 +
>  tools/lib/bpf/netlink.c                       |   8 +
>  .../selftests/bpf/prog_tests/pifo_map.c       | 125 ++++
>  .../bpf/prog_tests/xdp_pifo_test_run.c        | 154 +++++
>  tools/testing/selftests/bpf/progs/pifo_map.c  |  54 ++
>  .../selftests/bpf/progs/test_xdp_pifo.c       | 110 +++
>  tools/testing/selftests/bpf/test_verifier.c   |  29 +-
>  .../testing/selftests/bpf/verifier/dequeue.c  | 160 +++++
>  39 files changed, 2426 insertions(+), 200 deletions(-)
>  create mode 100644 kernel/bpf/pifomap.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/pifo_map.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_pifo_test_run.c
>  create mode 100644 tools/testing/selftests/bpf/progs/pifo_map.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_pifo.c
>  create mode 100644 tools/testing/selftests/bpf/verifier/dequeue.c
>
> --
> 2.37.0
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-14 14:05 ` Jamal Hadi Salim
@ 2022-07-14 14:56   ` Dave Taht
  2022-07-14 15:33     ` Jamal Hadi Salim
  2022-07-14 16:21   ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 46+ messages in thread
From: Dave Taht @ 2022-07-14 14:56 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Mykola Lysenko, Kumar Kartikeya Dwivedi,
	Linux Kernel Network Developers, bpf, Freysteinn Alfredsson,
	Cong Wang

In general I feel a programmable packet pacing approach is the right
way forward for the internet as a whole.

It lends itself more easily and accurately to offloading in an age
where it is difficult to do anything sane within a ms on the host
cpu, especially in virtualized environments, in the enormous dynamic
range of kbits/ms to gbits/ms between host an potential recipient [1]

So considerations about what is easier to offload moving forward vs
central cpu costs should be in this conversation.

[1] I'm kind of on a campaign to get people to stop thinking about
mbits/sec and about intervals well below human perception, thus,
gbits/ms - or packets/ns!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-14 14:56   ` Dave Taht
@ 2022-07-14 15:33     ` Jamal Hadi Salim
  0 siblings, 0 replies; 46+ messages in thread
From: Jamal Hadi Salim @ 2022-07-14 15:33 UTC (permalink / raw)
  To: Dave Taht
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Mykola Lysenko, Kumar Kartikeya Dwivedi,
	Linux Kernel Network Developers, bpf, Freysteinn Alfredsson,
	Cong Wang

On Thu, Jul 14, 2022 at 10:56 AM Dave Taht <dave.taht@gmail.com> wrote:
>
> In general I feel a programmable packet pacing approach is the right
> way forward for the internet as a whole.
>
> It lends itself more easily and accurately to offloading in an age
> where it is difficult to do anything sane within a ms on the host
> cpu, especially in virtualized environments, in the enormous dynamic
> range of kbits/ms to gbits/ms between host an potential recipient [1]
>
> So considerations about what is easier to offload moving forward vs
> central cpu costs should be in this conversation.
>

If you know your hardware can offload - there is a lot less to worry about.
You can let the synchronization be handled by hardware. For example,
if your hardware can do strict priority scheduling/queueing you really
should bypass the kernel layer, set appropriate metadata (skb prio)
and let the hw handle it. See the HTB offload from Nvidia.
OTOH, EDT based approaches are the best for some lightweight
approach which takes advantage of simple hardware
features (like timestamps, etc).

cheers,
jamal

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-14 14:05 ` Jamal Hadi Salim
  2022-07-14 14:56   ` Dave Taht
@ 2022-07-14 16:21   ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-14 16:21 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Mykola Lysenko,
	Kumar Kartikeya Dwivedi, Linux Kernel Network Developers, bpf,
	Freysteinn Alfredsson, Cong Wang

Jamal Hadi Salim <jhs@mojatatu.com> writes:

> I think what would be really interesting is to see the performance numbers when
> you have multiple producers/consumers(translation multiple
> threads/softirqs) in play
> targeting the same queues. Does PIFO alleviate the synchronization challenge
> when you have multiple concurrent readers/writers? Or maybe for your use case
> this would not be a common occurrence or not something you care about?

Right, this is definitely one of the areas we want to flesh out some
more and benchmark. I think a PIFO-based algorithm *can* be an
improvement here because you can compute the priority without holding
any lock and only grab a lock for inserting the packet; which can be
made even better with a (partially) lockless data structure and/or
batching.

In any case we *have* to do a certain amount of re-inventing for XDP
because we can't reuse the qdisc infrastructure anyway. Ultimately, I
expect it will be possible to write both really well-performing
algorithms, and really badly-performing ones. Such is the power of BPF,
after all, and as long as we can provide an existence proof of the
former, that's fine with me :)

> As I mentioned previously, I think this is what Cong's approach gets
> for free.

Yes, but it also retains the global qdisc lock; my (naive, perhaps?)
hope is that since we have to do things differently in XDP land anyway,
that work can translate into something that is amenable to being
lockless in qdisc land as well...

-Toke


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-14 10:46       ` Toke Høiland-Jørgensen
@ 2022-07-14 17:24         ` Stanislav Fomichev
  2022-07-15  1:12         ` Alexei Starovoitov
  2022-07-17 19:12         ` Cong Wang
  2 siblings, 0 replies; 46+ messages in thread
From: Stanislav Fomichev @ 2022-07-14 17:24 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Mykola Lysenko, Kumar Kartikeya Dwivedi, netdev,
	bpf, Freysteinn Alfredsson, Cong Wang

On Thu, Jul 14, 2022 at 3:46 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> Packet forwarding is an important use case for XDP, which offers
> >> >> significant performance improvements compared to forwarding using the
> >> >> regular networking stack. However, XDP currently offers no mechanism to
> >> >> delay, queue or schedule packets, which limits the practical uses for
> >> >> XDP-based forwarding to those where the capacity of input and output links
> >> >> always match each other (i.e., no rate transitions or many-to-one
> >> >> forwarding). It also prevents an XDP-based router from doing any kind of
> >> >> traffic shaping or reordering to enforce policy.
> >> >>
> >> >> This series represents a first RFC of our attempt to remedy this lack. The
> >> >> code in these patches is functional, but needs additional testing and
> >> >> polishing before being considered for merging. I'm posting it here as an
> >> >> RFC to get some early feedback on the API and overall design of the
> >> >> feature.
> >> >>
> >> >> DESIGN
> >> >>
> >> >> The design consists of three components: A new map type for storing XDP
> >> >> frames, a new 'dequeue' program type that will run in the TX softirq to
> >> >> provide the stack with packets to transmit, and a set of helpers to dequeue
> >> >> packets from the map, optionally drop them, and to schedule an interface
> >> >> for transmission.
> >> >>
> >> >> The new map type is modelled on the PIFO data structure proposed in the
> >> >> literature[0][1]. It represents a priority queue where packets can be
> >> >> enqueued in any priority, but is always dequeued from the head. From the
> >> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> >> >> helper, where the target index is the desired priority.
> >> >
> >> > I have the same question I asked on the series from Cong:
> >> > Any considerations for existing carousel/edt-like models?
> >>
> >> Well, the reason for the addition in patch 5 (continuously increasing
> >> priorities) is exactly to be able to implement EDT-like behaviour, where
> >> the priority is used as time units to clock out packets.
> >
> > Ah, ok, I didn't read the patches closely enough. I saw some limits
> > for the ranges and assumed that it wasn't capable of efficiently
> > storing 64-bit timestamps..
>
> The goal is definitely to support full 64-bit priorities. Right now you
> have to start out at 0 but can go on for a full 64 bits, but that's a
> bit of an API wart that I'd like to get rid of eventually...
>
> >> > Can we make the map flexible enough to implement different qdisc
> >> > policies?
> >>
> >> That's one of the things we want to be absolutely sure about. We are
> >> starting out with the PIFO map type because the literature makes a good
> >> case that it is flexible enough to implement all conceivable policies.
> >> The goal of the test harness linked as note [4] is to actually examine
> >> this; Frey is our PhD student working on this bit.
> >>
> >> Thus far we haven't hit any limitations on this, but we'll need to add
> >> more policies before we are done with this. Another consideration is
> >> performance, of course, so we're also planning to do a comparison with a
> >> more traditional "bunch of FIFO queues" type data structure for at least
> >> a subset of the algorithms. Kartikeya also had an idea for an
> >> alternative way to implement a priority queue using (semi-)lockless
> >> skiplists, which may turn out to perform better.
> >>
> >> If there's any particular policy/algorithm you'd like to see included in
> >> this evaluation, please do let us know, BTW! :)
> >
> > I honestly am not sure what the bar for accepting this should be. But
> > on the Cong's series I mentioned Martin's CC bpf work as a great
> > example of what we should be trying to do for qdisc-like maps. Having
> > a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be
> > very convincing :-)
>
> Just doing flow queueing is quite straight forward with PIFOs. We're
> working on fq_codel. Personally I also want to implement something that
> has feature parity with sch_cake (which includes every feature and the
> kitchen sink already) :)

Yeah, sch_cake works too 👍

> >> >> The dequeue program type is a new BPF program type that is attached to an
> >> >> interface; when an interface is scheduled for transmission, the stack will
> >> >> execute the attached dequeue program and, if it returns a packet to
> >> >> transmit, that packet will be transmitted using the existing ndo_xdp_xmit()
> >> >> driver function.
> >> >>
> >> >> The dequeue program can obtain packets by pulling them out of a PIFO map
> >> >> using the new bpf_packet_dequeue() helper. This returns a pointer to an
> >> >> xdp_md structure, which can be dereferenced to obtain packet data and
> >> >> data_meta pointers like in an XDP program. The returned packets are also
> >> >> reference counted, meaning the verifier enforces that the dequeue program
> >> >> either drops the packet (with the bpf_packet_drop() helper), or returns it
> >> >> for transmission. Finally, a helper is added that can be used to actually
> >> >> schedule an interface for transmission using the dequeue program type; this
> >> >> helper can be called from both XDP and dequeue programs.
> >> >>
> >> >> PERFORMANCE
> >> >>
> >> >> Preliminary performance tests indicate about 50ns overhead of adding
> >> >> queueing to the xdp_fwd example (last patch), which translates to a 20% PPS
> >> >> overhead (but still 2x the forwarding performance of the netstack):
> >> >>
> >> >> xdp_fwd :     4.7 Mpps  (213 ns /pkt)
> >> >> xdp_fwd -Q:   3.8 Mpps  (263 ns /pkt)
> >> >> netstack:       2 Mpps  (500 ns /pkt)
> >> >>
> >> >> RELATION TO BPF QDISC
> >> >>
> >> >> Cong Wang's BPF qdisc patches[2] share some aspects of this series, in
> >> >> particular the use of a map to store packets. This is no accident, as we've
> >> >> had ongoing discussions for a while now. I have no great hope that we can
> >> >> completely converge the two efforts into a single BPF-based queueing
> >> >> API (as has been discussed before[3], consolidating the SKB and XDP paths
> >> >> is challenging). Rather, I'm hoping that we can converge the designs enough
> >> >> that we can share BPF code between XDP and qdisc layers using common
> >> >> functions, like it's possible to do with XDP and TC-BPF today. This would
> >> >> imply agreeing on the map type and API, and possibly on the set of helpers
> >> >> available to the BPF programs.
> >> >
> >> > What would be the big difference for the map wrt xdp_frame vs sk_buff
> >> > excluding all obvious stuff like locking/refcnt?
> >>
> >> I expect it would be quite straight-forward to just add a second subtype
> >> of the PIFO map in this series that holds skbs. In fact, I think that
> >> from the BPF side, the whole model implemented here would be possible to
> >> carry over to the qdisc layer more or less wholesale. Some other
> >> features of the qdisc layer, like locking, classes, and
> >> multi-CPU/multi-queue management may be trickier, but I'm not sure how
> >> much of that we should expose in a BPF qdisc anyway (as you may have
> >> noticed I commented on Cong's series to this effect regarding the
> >> classful qdiscs).
> >
> > Maybe a related question here: with the way you do
> > BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would
> > be have support for storing xdp_frames/skb in any map? Let's say we
> > have generic BPF_MAP_TYPE_RBTREE, where the key is
> > priority/timestamp/whatever, can we, based on the value's btf_id,
> > figure out the rest? (that the value is kernel structure and needs
> > special care and more constraints - can't be looked up from user space
> > and so on)
> >
> > Seems like we really need to have two special cases: where we transfer
> > ownership of xdp_frame/skb to/from the map, any other big
> > complications?
> >
> > That way we can maybe untangle the series a bit: we can talk about
> > efficient data structures for storing frames/skbs independently of
> > some generic support for storing them in the maps. Any major
> > complications with that approach?
>
> I've had discussions with Kartikeya on this already (based on his 'kptr
> in map' work). That may well end up being feasible, which would be
> fantastic. The reason we didn't use it for this series is that there's
> still some work to do on the generic verifier/infrastructure support
> side of this (the PIFO map is the oldest part of this series), and I
> didn't want to hold up the rest of the queueing work until that landed.

Yes, exactly, kptr seems like a very promising thing that you can leverage.
I'm looking forward to it!

> Now that we have a functional prototype I expect that iterating on the
> data structure will be the next step. One complication with XDP is that
> we probably want to keep using XDP_REDIRECT to place packets into the
> map because that gets us bulking which is important for performance;
> however, in general I like the idea of using BTF to designate the map
> value type, and if we can figure out a way to make it completely generic
> even for packets I'm all for that! :)

As long as we have generic kptr-based-skb-capable-maps and can
add/remove/lookup skbs using existing helpers it seems fine to have
XDP_REDIRECT as some kind of xdp-specific optimization.

> >> >> PATCH STRUCTURE
> >> >>
> >> >> This series consists of a total of 17 patches, as follows:
> >> >>
> >> >> Patches 1-3 are smaller preparatory refactoring patches used by subsequent
> >> >> patches.
> >> >
> >> > Seems like these can go separately without holding the rest?
> >>
> >> Yeah, guess so? They don't really provide much benefit without the users
> >> alter in the series, though, so not sure there's much point in sending
> >> them separately?
> >>
> >> >> Patches 4-5 introduce the PIFO map type, and patch 6 introduces the dequeue
> >> >> program type.
> >> >
> >> > [...]
> >> >
> >> >> Patches 7-10 adds the dequeue helpers and the verifier features needed to
> >> >> recognise packet pointers, reference count them, and allow dereferencing
> >> >> them to obtain packet data pointers.
> >> >
> >> > Have you considered using kfuncs for these instead of introducing new
> >> > hooks/contexts/etc?
> >>
> >> I did, but I'm not sure it's such a good fit? In particular, the way the
> >> direct packet access is implemented for dequeue programs (where you can
> >> get an xdp_md pointer and deref that to get data and data_end pointers)
> >> is done this way so programs can share utility functions between XDP and
> >> dequeue programs. And having a new program type for the dequeue progs
> >> seem like the obvious thing to do since they're doing something new?
> >>
> >> Maybe I'm missing something, though; could you elaborate on how you'd
> >> use kfuncs instead?
> >
> > I was thinking about the approach in general. In networking bpf, we've
> > been adding new program types, new contexts and new explicit hooks.
> > This all requires a ton of boiler plate (converting from uapi ctx to
> > the kernel, exposing hook points, etc, etc). And looking at Benjamin's
> > HID series, it's so much more elegant: there is no uapi, just kernel
> > function that allows it to be overridden and a bunch of kfuncs
> > exposed. No uapi, no helpers, no fake contexts.
> >
> > For networking and xdp the ship might have sailed, but I was wondering
> > whether we should be still stuck in that 'old' boilerplate world or we
> > have a chance to use new nice shiny things :-)
> >
> > (but it might be all moot if we'd like to have stable upis?)
>
> Right, I see what you mean. My immediate feeling is that having an
> explicit stable UAPI for XDP has served us well. We do all kinds of
> rewrite tricks behind the scenes (things like switching between xdp_buff
> and xdp_frame, bulking, direct packet access, reading ifindexes by
> pointer walking txq->dev, etc) which are important ways to improve
> performance without exposing too many nitty-gritty details into the API.
>
> There's also consistency to consider: I think the addition of queueing
> should work as a natural extension of the existing programming model for
> XDP. So I feel like this is more a case of "if we were starting from
> scratch today we might do things differently (like the HID series), but
> when extending things let's keep it consistent"?

Agreed. If we want to have the ability to inspect/change xdp/skb in
your new dequeue hooks, we might need to keep that fake xdp_buff for
consistency :-(

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 15/17] selftests/bpf: Add verifier tests for dequeue prog
  2022-07-14  6:45     ` Kumar Kartikeya Dwivedi
@ 2022-07-14 18:54       ` Andrii Nakryiko
  2022-07-15 11:11         ` Kumar Kartikeya Dwivedi
  0 siblings, 1 reply; 46+ messages in thread
From: Andrii Nakryiko @ 2022-07-14 18:54 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Mykola Lysenko, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Networking, bpf,
	Freysteinn Alfredsson, Cong Wang, Shuah Khan

On Wed, Jul 13, 2022 at 11:45 PM Kumar Kartikeya Dwivedi
<memxor@gmail.com> wrote:
>
> On Thu, 14 Jul 2022 at 07:38, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > On Wed, Jul 13, 2022 at 4:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >
> > > From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > >
> > > Test various cases of direct packet access (proper range propagation,
> > > comparison of packet pointers pointing into separate xdp_frames, and
> > > correct invalidation on packet drop (so that multiple packet pointers
> > > are usable safely in a dequeue program)).
> > >
> > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > > ---
> >
> > Consider writing these tests as plain C BPF code and put them in
> > test_progs, is there anything you can't express in C and thus requires
> > test_verifier?
>
> Not really, but in general I like test_verifier because it stays
> immune to compiler shenanigans.

In general I dislike them because they are almost incomprehensible. So
unless there is a very particular sequence of low-level BPF assembly
instructions one needs to test, I'd always opt for test_progs as more
maintainable solution.

Things like making sure that verifier rejects invalid use of
particular objects or helpers doesn't seem to rely much on particular
assembly sequence and can and should be expressed with plain C.


> So going forward should test_verifier tests be avoided, and normal C
> tests (using SEC("?...")) be preferred for these cases?

In my opinion, yes, unless absolutely requiring low-level assembly to
express conditions which are otherwise hard to express reliably in C.

>
> >
> > >  tools/testing/selftests/bpf/test_verifier.c   |  29 +++-
> > >  .../testing/selftests/bpf/verifier/dequeue.c  | 160 ++++++++++++++++++
> > >  2 files changed, 180 insertions(+), 9 deletions(-)
> > >  create mode 100644 tools/testing/selftests/bpf/verifier/dequeue.c
> > >
> >
> > [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-14 10:46       ` Toke Høiland-Jørgensen
  2022-07-14 17:24         ` Stanislav Fomichev
@ 2022-07-15  1:12         ` Alexei Starovoitov
  2022-07-15 12:55           ` Toke Høiland-Jørgensen
  2022-07-17 19:12         ` Cong Wang
  2 siblings, 1 reply; 46+ messages in thread
From: Alexei Starovoitov @ 2022-07-15  1:12 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Mykola Lysenko,
	Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang

On Thu, Jul 14, 2022 at 12:46:54PM +0200, Toke Høiland-Jørgensen wrote:
> >
> > Maybe a related question here: with the way you do
> > BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would
> > be have support for storing xdp_frames/skb in any map? Let's say we
> > have generic BPF_MAP_TYPE_RBTREE, where the key is
> > priority/timestamp/whatever, can we, based on the value's btf_id,
> > figure out the rest? (that the value is kernel structure and needs
> > special care and more constraints - can't be looked up from user space
> > and so on)
> >
> > Seems like we really need to have two special cases: where we transfer
> > ownership of xdp_frame/skb to/from the map, any other big
> > complications?
> >
> > That way we can maybe untangle the series a bit: we can talk about
> > efficient data structures for storing frames/skbs independently of
> > some generic support for storing them in the maps. Any major
> > complications with that approach?
> 
> I've had discussions with Kartikeya on this already (based on his 'kptr
> in map' work). That may well end up being feasible, which would be
> fantastic. The reason we didn't use it for this series is that there's
> still some work to do on the generic verifier/infrastructure support
> side of this (the PIFO map is the oldest part of this series), and I
> didn't want to hold up the rest of the queueing work until that landed.

Certainly makes sense for RFC to be sent out earlier,
but Stan's point stands. We have to avoid type-specific maps when
generic will do. kptr infra is getting close to be that answer.

> >> Maybe I'm missing something, though; could you elaborate on how you'd
> >> use kfuncs instead?
> >
> > I was thinking about the approach in general. In networking bpf, we've
> > been adding new program types, new contexts and new explicit hooks.
> > This all requires a ton of boiler plate (converting from uapi ctx to
> > the kernel, exposing hook points, etc, etc). And looking at Benjamin's
> > HID series, it's so much more elegant: there is no uapi, just kernel
> > function that allows it to be overridden and a bunch of kfuncs
> > exposed. No uapi, no helpers, no fake contexts.
> >
> > For networking and xdp the ship might have sailed, but I was wondering
> > whether we should be still stuck in that 'old' boilerplate world or we
> > have a chance to use new nice shiny things :-)
> >
> > (but it might be all moot if we'd like to have stable upis?)
> 
> Right, I see what you mean. My immediate feeling is that having an
> explicit stable UAPI for XDP has served us well. We do all kinds of
> rewrite tricks behind the scenes (things like switching between xdp_buff
> and xdp_frame, bulking, direct packet access, reading ifindexes by
> pointer walking txq->dev, etc) which are important ways to improve
> performance without exposing too many nitty-gritty details into the API.
> 
> There's also consistency to consider: I think the addition of queueing
> should work as a natural extension of the existing programming model for
> XDP. So I feel like this is more a case of "if we were starting from
> scratch today we might do things differently (like the HID series), but
> when extending things let's keep it consistent"?

"consistent" makes sense when new feature follows established path.
The programmable packet scheduling in TX is just as revolutionary as
XDP in RX was years ago :)
This feature can be done similar to hid-bpf without cast-in-stone uapi
and hooks. Such patches would be much easier to land and iterate on top.
The amount of bike shedding will be 10 times less.
No need for new program type, no new hooks, no new FDs and attach uapi-s.
Even libbpf won't need any changes.
Add few kfuncs and __weak noinline "hooks" in TX path.
Only new map type would be necessary.
If it can be made with kptr then it will be the only uapi exposure that
will be heavily scrutinized.

It doesn't mean that it will stay unstable-api forever. Once it demonstrates
that it is on par with fq/fq_codel/cake feature-wise we can bake it into uapi.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 15/17] selftests/bpf: Add verifier tests for dequeue prog
  2022-07-14 18:54       ` Andrii Nakryiko
@ 2022-07-15 11:11         ` Kumar Kartikeya Dwivedi
  0 siblings, 0 replies; 46+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-07-15 11:11 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Toke Høiland-Jørgensen, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Mykola Lysenko, David S. Miller,
	Jakub Kicinski, Jesper Dangaard Brouer, Networking, bpf,
	Freysteinn Alfredsson, Cong Wang, Shuah Khan

On Thu, 14 Jul 2022 at 20:54, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jul 13, 2022 at 11:45 PM Kumar Kartikeya Dwivedi
> <memxor@gmail.com> wrote:
> >
> > On Thu, 14 Jul 2022 at 07:38, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> > >
> > > On Wed, Jul 13, 2022 at 4:15 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > >
> > > > From: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > >
> > > > Test various cases of direct packet access (proper range propagation,
> > > > comparison of packet pointers pointing into separate xdp_frames, and
> > > > correct invalidation on packet drop (so that multiple packet pointers
> > > > are usable safely in a dequeue program)).
> > > >
> > > > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
> > > > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > > > ---
> > >
> > > Consider writing these tests as plain C BPF code and put them in
> > > test_progs, is there anything you can't express in C and thus requires
> > > test_verifier?
> >
> > Not really, but in general I like test_verifier because it stays
> > immune to compiler shenanigans.
>
> In general I dislike them because they are almost incomprehensible. So
> unless there is a very particular sequence of low-level BPF assembly
> instructions one needs to test, I'd always opt for test_progs as more
> maintainable solution.
>
> Things like making sure that verifier rejects invalid use of
> particular objects or helpers doesn't seem to rely much on particular
> assembly sequence and can and should be expressed with plain C.
>
>
> > So going forward should test_verifier tests be avoided, and normal C
> > tests (using SEC("?...")) be preferred for these cases?
>
> In my opinion, yes, unless absolutely requiring low-level assembly to
> express conditions which are otherwise hard to express reliably in C.
>

Ok, fair point. I will replace these with C tests in the next version.

> >
> > >
> > > >  tools/testing/selftests/bpf/test_verifier.c   |  29 +++-
> > > >  .../testing/selftests/bpf/verifier/dequeue.c  | 160 ++++++++++++++++++
> > > >  2 files changed, 180 insertions(+), 9 deletions(-)
> > > >  create mode 100644 tools/testing/selftests/bpf/verifier/dequeue.c
> > > >
> > >
> > > [...]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-15  1:12         ` Alexei Starovoitov
@ 2022-07-15 12:55           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-15 12:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Mykola Lysenko,
	Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson,
	Cong Wang

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Jul 14, 2022 at 12:46:54PM +0200, Toke Høiland-Jørgensen wrote:
>> >
>> > Maybe a related question here: with the way you do
>> > BPF_MAP_TYPE_PIFO_GENERIC vs BPF_MAP_TYPE_PIFO_XDP, how hard it would
>> > be have support for storing xdp_frames/skb in any map? Let's say we
>> > have generic BPF_MAP_TYPE_RBTREE, where the key is
>> > priority/timestamp/whatever, can we, based on the value's btf_id,
>> > figure out the rest? (that the value is kernel structure and needs
>> > special care and more constraints - can't be looked up from user space
>> > and so on)
>> >
>> > Seems like we really need to have two special cases: where we transfer
>> > ownership of xdp_frame/skb to/from the map, any other big
>> > complications?
>> >
>> > That way we can maybe untangle the series a bit: we can talk about
>> > efficient data structures for storing frames/skbs independently of
>> > some generic support for storing them in the maps. Any major
>> > complications with that approach?
>> 
>> I've had discussions with Kartikeya on this already (based on his 'kptr
>> in map' work). That may well end up being feasible, which would be
>> fantastic. The reason we didn't use it for this series is that there's
>> still some work to do on the generic verifier/infrastructure support
>> side of this (the PIFO map is the oldest part of this series), and I
>> didn't want to hold up the rest of the queueing work until that landed.
>
> Certainly makes sense for RFC to be sent out earlier,
> but Stan's point stands. We have to avoid type-specific maps when
> generic will do. kptr infra is getting close to be that answer.

ACK, I'll iterate on the map types and see how far we can get with the
kptr approach.

Do people feel a generic priority queue type would be generally useful?
Because in that case I can split out this work into a separate series...

>> >> Maybe I'm missing something, though; could you elaborate on how you'd
>> >> use kfuncs instead?
>> >
>> > I was thinking about the approach in general. In networking bpf, we've
>> > been adding new program types, new contexts and new explicit hooks.
>> > This all requires a ton of boiler plate (converting from uapi ctx to
>> > the kernel, exposing hook points, etc, etc). And looking at Benjamin's
>> > HID series, it's so much more elegant: there is no uapi, just kernel
>> > function that allows it to be overridden and a bunch of kfuncs
>> > exposed. No uapi, no helpers, no fake contexts.
>> >
>> > For networking and xdp the ship might have sailed, but I was wondering
>> > whether we should be still stuck in that 'old' boilerplate world or we
>> > have a chance to use new nice shiny things :-)
>> >
>> > (but it might be all moot if we'd like to have stable upis?)
>> 
>> Right, I see what you mean. My immediate feeling is that having an
>> explicit stable UAPI for XDP has served us well. We do all kinds of
>> rewrite tricks behind the scenes (things like switching between xdp_buff
>> and xdp_frame, bulking, direct packet access, reading ifindexes by
>> pointer walking txq->dev, etc) which are important ways to improve
>> performance without exposing too many nitty-gritty details into the API.
>> 
>> There's also consistency to consider: I think the addition of queueing
>> should work as a natural extension of the existing programming model for
>> XDP. So I feel like this is more a case of "if we were starting from
>> scratch today we might do things differently (like the HID series), but
>> when extending things let's keep it consistent"?
>
> "consistent" makes sense when new feature follows established path.
> The programmable packet scheduling in TX is just as revolutionary as
> XDP in RX was years ago :)
> This feature can be done similar to hid-bpf without cast-in-stone uapi
> and hooks. Such patches would be much easier to land and iterate on top.
> The amount of bike shedding will be 10 times less.
> No need for new program type, no new hooks, no new FDs and attach uapi-s.
> Even libbpf won't need any changes.
> Add few kfuncs and __weak noinline "hooks" in TX path.
> Only new map type would be necessary.
> If it can be made with kptr then it will be the only uapi exposure that
> will be heavily scrutinized.

I'm not quite convinced it's that obvious that it can be implemented
this way; but I don't mind trying it out either, if nothing else it'll
give us something to compare against...

> It doesn't mean that it will stay unstable-api forever. Once it demonstrates
> that it is on par with fq/fq_codel/cake feature-wise we can bake it into uapi.

In any case, I don't think we should merge anything until we've shown
this :)

-Toke


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
                   ` (18 preceding siblings ...)
  2022-07-14 14:05 ` Jamal Hadi Salim
@ 2022-07-17 17:46 ` Cong Wang
  2022-07-18 12:45   ` Toke Høiland-Jørgensen
  19 siblings, 1 reply; 46+ messages in thread
From: Cong Wang @ 2022-07-17 17:46 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Mykola Lysenko,
	Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson

On Wed, Jul 13, 2022 at 01:14:08PM +0200, Toke Høiland-Jørgensen wrote:
> Packet forwarding is an important use case for XDP, which offers
> significant performance improvements compared to forwarding using the
> regular networking stack. However, XDP currently offers no mechanism to
> delay, queue or schedule packets, which limits the practical uses for
> XDP-based forwarding to those where the capacity of input and output links
> always match each other (i.e., no rate transitions or many-to-one
> forwarding). It also prevents an XDP-based router from doing any kind of
> traffic shaping or reordering to enforce policy.
> 

Sorry for forgetting to respond to your email to my patchset.

The most important question from you is actually why I give up on PIFO.
Actually its limitation is already in its name, its name Push In First
Out already says clearly that it only allows to dequeue the first one.
Still confusing?

You can take a look at your pifo_map_pop_elem(), which is the
implementation for bpf_map_pop_elem(), which is:

       long bpf_map_pop_elem(struct bpf_map *map, void *value)

Clearly, there is no even 'key' in its parameter list. If you just
compare it to mine:

	BPF_CALL_2(bpf_skb_map_pop, struct bpf_map *, map, u64, key)

Is their difference now 100% clear? :)

The next question is why this is important (actually it is the most
important)? Because we (I mean for eBPF Qdisc users, not sure about you)
want the programmability, which I have been emphasizing since V1...

Clearly it is already too late to fix bpf_map_pop_elem(), we don't want
to repeat that mistake again.

More importantly, the latter can easily implement the former, as shown below:

bpf_stack_for_min; // Just BPF_MAP_TYPE_STACK

push(map, key, value)
{
  bpf_stack_for_min.push(min(key, bpf_stack_for_min.top()));
  // Insert key value pair here
}

pop_the_first(map, value)
{
   val = pop_any(map, bpf_stack_for_min.top());
   *value = val;
   bpf_stack_for_min.pop();
}


BTW, what is _your_ use case for skb map and user-space PIFO map? I am
sure you have uses case for XDP, it is unclear what you have for other
cases. Please don't piggy back use cases you don't have, we all have to
justify all use cases. :)

Thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-13 21:52   ` Toke Høiland-Jørgensen
  2022-07-13 22:56     ` Stanislav Fomichev
  2022-07-14  6:34     ` Kumar Kartikeya Dwivedi
@ 2022-07-17 18:17     ` Cong Wang
  2022-07-17 18:41       ` Kumar Kartikeya Dwivedi
  2022-07-18 12:12       ` Toke Høiland-Jørgensen
  2 siblings, 2 replies; 46+ messages in thread
From: Cong Wang @ 2022-07-17 18:17 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Mykola Lysenko,
	Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson

On Wed, Jul 13, 2022 at 11:52:07PM +0200, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:
> 
> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Packet forwarding is an important use case for XDP, which offers
> >> significant performance improvements compared to forwarding using the
> >> regular networking stack. However, XDP currently offers no mechanism to
> >> delay, queue or schedule packets, which limits the practical uses for
> >> XDP-based forwarding to those where the capacity of input and output links
> >> always match each other (i.e., no rate transitions or many-to-one
> >> forwarding). It also prevents an XDP-based router from doing any kind of
> >> traffic shaping or reordering to enforce policy.
> >>
> >> This series represents a first RFC of our attempt to remedy this lack. The
> >> code in these patches is functional, but needs additional testing and
> >> polishing before being considered for merging. I'm posting it here as an
> >> RFC to get some early feedback on the API and overall design of the
> >> feature.
> >>
> >> DESIGN
> >>
> >> The design consists of three components: A new map type for storing XDP
> >> frames, a new 'dequeue' program type that will run in the TX softirq to
> >> provide the stack with packets to transmit, and a set of helpers to dequeue
> >> packets from the map, optionally drop them, and to schedule an interface
> >> for transmission.
> >>
> >> The new map type is modelled on the PIFO data structure proposed in the
> >> literature[0][1]. It represents a priority queue where packets can be
> >> enqueued in any priority, but is always dequeued from the head. From the
> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> >> helper, where the target index is the desired priority.
> >
> > I have the same question I asked on the series from Cong:
> > Any considerations for existing carousel/edt-like models?
> 
> Well, the reason for the addition in patch 5 (continuously increasing
> priorities) is exactly to be able to implement EDT-like behaviour, where
> the priority is used as time units to clock out packets.

Are you sure? I seriouly doubt your patch can do this at all...

Since your patch relies on bpf_map_push_elem(), which has no room for
'key' hence you reuse 'flags' but you also reserve 4 bits there... How
could tstamp be packed with 4 reserved bits??

To answer Stanislav's question, this is how my code could handle EDT:

// BPF_CALL_3(bpf_skb_map_push, struct bpf_map *, map, struct sk_buff *, skb, u64, key)
skb->tstamp = XXX;
bpf_skb_map_push(map, skb, skb->tstamp);

(Please refer another reply from me for how to get the min when poping,
which is essentially just a popular interview coding problem.)

Actually, if we look into the in-kernel EDT implementation (net/sched/sch_etf.c),
it is also based on rbtree rather than PIFO. ;-)

Thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-17 18:17     ` Cong Wang
@ 2022-07-17 18:41       ` Kumar Kartikeya Dwivedi
  2022-07-17 19:23         ` Cong Wang
  2022-07-18 12:12       ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 46+ messages in thread
From: Kumar Kartikeya Dwivedi @ 2022-07-17 18:41 UTC (permalink / raw)
  To: Cong Wang
  Cc: Toke Høiland-Jørgensen, Stanislav Fomichev,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Mykola Lysenko, netdev, bpf,
	Freysteinn Alfredsson

On Sun, 17 Jul 2022 at 20:17, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Wed, Jul 13, 2022 at 11:52:07PM +0200, Toke Høiland-Jørgensen wrote:
> > Stanislav Fomichev <sdf@google.com> writes:
> >
> > > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >>
> > >> Packet forwarding is an important use case for XDP, which offers
> > >> significant performance improvements compared to forwarding using the
> > >> regular networking stack. However, XDP currently offers no mechanism to
> > >> delay, queue or schedule packets, which limits the practical uses for
> > >> XDP-based forwarding to those where the capacity of input and output links
> > >> always match each other (i.e., no rate transitions or many-to-one
> > >> forwarding). It also prevents an XDP-based router from doing any kind of
> > >> traffic shaping or reordering to enforce policy.
> > >>
> > >> This series represents a first RFC of our attempt to remedy this lack. The
> > >> code in these patches is functional, but needs additional testing and
> > >> polishing before being considered for merging. I'm posting it here as an
> > >> RFC to get some early feedback on the API and overall design of the
> > >> feature.
> > >>
> > >> DESIGN
> > >>
> > >> The design consists of three components: A new map type for storing XDP
> > >> frames, a new 'dequeue' program type that will run in the TX softirq to
> > >> provide the stack with packets to transmit, and a set of helpers to dequeue
> > >> packets from the map, optionally drop them, and to schedule an interface
> > >> for transmission.
> > >>
> > >> The new map type is modelled on the PIFO data structure proposed in the
> > >> literature[0][1]. It represents a priority queue where packets can be
> > >> enqueued in any priority, but is always dequeued from the head. From the
> > >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> > >> helper, where the target index is the desired priority.
> > >
> > > I have the same question I asked on the series from Cong:
> > > Any considerations for existing carousel/edt-like models?
> >
> > Well, the reason for the addition in patch 5 (continuously increasing
> > priorities) is exactly to be able to implement EDT-like behaviour, where
> > the priority is used as time units to clock out packets.
>
> Are you sure? I seriouly doubt your patch can do this at all...
>
> Since your patch relies on bpf_map_push_elem(), which has no room for
> 'key' hence you reuse 'flags' but you also reserve 4 bits there... How
> could tstamp be packed with 4 reserved bits??
>
> To answer Stanislav's question, this is how my code could handle EDT:
>
> // BPF_CALL_3(bpf_skb_map_push, struct bpf_map *, map, struct sk_buff *, skb, u64, key)
> skb->tstamp = XXX;
> bpf_skb_map_push(map, skb, skb->tstamp);

It is also possible here, if we could not push into the map with a
certain key it wouldn't be a PIFO.
Please look at patch 16/17 for an example (test_xdp_pifo.c), it's just
that the interface is different (bpf_redirect_map),
the key has been expanded to 64 bits to accommodate such use cases. It
is also possible in a future version of the patch to amortize the cost
of taking the lock for each enqueue by doing batching, similar to what
cpumap/devmap implementations do.

>
> (Please refer another reply from me for how to get the min when poping,
> which is essentially just a popular interview coding problem.)
>
> Actually, if we look into the in-kernel EDT implementation (net/sched/sch_etf.c),
> it is also based on rbtree rather than PIFO. ;-)
>
> Thanks.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-14 10:46       ` Toke Høiland-Jørgensen
  2022-07-14 17:24         ` Stanislav Fomichev
  2022-07-15  1:12         ` Alexei Starovoitov
@ 2022-07-17 19:12         ` Cong Wang
  2022-07-18 12:25           ` Toke Høiland-Jørgensen
  2 siblings, 1 reply; 46+ messages in thread
From: Cong Wang @ 2022-07-17 19:12 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Mykola Lysenko,
	Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson

On Thu, Jul 14, 2022 at 12:46:54PM +0200, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:
> 
> > On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >> >>
> >> >> Packet forwarding is an important use case for XDP, which offers
> >> >> significant performance improvements compared to forwarding using the
> >> >> regular networking stack. However, XDP currently offers no mechanism to
> >> >> delay, queue or schedule packets, which limits the practical uses for
> >> >> XDP-based forwarding to those where the capacity of input and output links
> >> >> always match each other (i.e., no rate transitions or many-to-one
> >> >> forwarding). It also prevents an XDP-based router from doing any kind of
> >> >> traffic shaping or reordering to enforce policy.
> >> >>
> >> >> This series represents a first RFC of our attempt to remedy this lack. The
> >> >> code in these patches is functional, but needs additional testing and
> >> >> polishing before being considered for merging. I'm posting it here as an
> >> >> RFC to get some early feedback on the API and overall design of the
> >> >> feature.
> >> >>
> >> >> DESIGN
> >> >>
> >> >> The design consists of three components: A new map type for storing XDP
> >> >> frames, a new 'dequeue' program type that will run in the TX softirq to
> >> >> provide the stack with packets to transmit, and a set of helpers to dequeue
> >> >> packets from the map, optionally drop them, and to schedule an interface
> >> >> for transmission.
> >> >>
> >> >> The new map type is modelled on the PIFO data structure proposed in the
> >> >> literature[0][1]. It represents a priority queue where packets can be
> >> >> enqueued in any priority, but is always dequeued from the head. From the
> >> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> >> >> helper, where the target index is the desired priority.
> >> >
> >> > I have the same question I asked on the series from Cong:
> >> > Any considerations for existing carousel/edt-like models?
> >>
> >> Well, the reason for the addition in patch 5 (continuously increasing
> >> priorities) is exactly to be able to implement EDT-like behaviour, where
> >> the priority is used as time units to clock out packets.
> >
> > Ah, ok, I didn't read the patches closely enough. I saw some limits
> > for the ranges and assumed that it wasn't capable of efficiently
> > storing 64-bit timestamps..
> 
> The goal is definitely to support full 64-bit priorities. Right now you
> have to start out at 0 but can go on for a full 64 bits, but that's a
> bit of an API wart that I'd like to get rid of eventually...
> 
> >> > Can we make the map flexible enough to implement different qdisc
> >> > policies?
> >>
> >> That's one of the things we want to be absolutely sure about. We are
> >> starting out with the PIFO map type because the literature makes a good
> >> case that it is flexible enough to implement all conceivable policies.
> >> The goal of the test harness linked as note [4] is to actually examine
> >> this; Frey is our PhD student working on this bit.
> >>
> >> Thus far we haven't hit any limitations on this, but we'll need to add
> >> more policies before we are done with this. Another consideration is
> >> performance, of course, so we're also planning to do a comparison with a
> >> more traditional "bunch of FIFO queues" type data structure for at least
> >> a subset of the algorithms. Kartikeya also had an idea for an
> >> alternative way to implement a priority queue using (semi-)lockless
> >> skiplists, which may turn out to perform better.
> >>
> >> If there's any particular policy/algorithm you'd like to see included in
> >> this evaluation, please do let us know, BTW! :)
> >
> > I honestly am not sure what the bar for accepting this should be. But
> > on the Cong's series I mentioned Martin's CC bpf work as a great
> > example of what we should be trying to do for qdisc-like maps. Having
> > a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be
> > very convincing :-)
> 
> Just doing flow queueing is quite straight forward with PIFOs. We're
> working on fq_codel. Personally I also want to implement something that
> has feature parity with sch_cake (which includes every feature and the
> kitchen sink already) :)

And how exactly would you plan to implement Least Slack Time First with
PIFOs?  See https://www.usenix.org/system/files/nsdi20-paper-sharma.pdf.
Can you be as specific as possible ideally with a pesudo code?

BTW, this is very easy to do with my approach as no FO limitations.

Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-17 18:41       ` Kumar Kartikeya Dwivedi
@ 2022-07-17 19:23         ` Cong Wang
  0 siblings, 0 replies; 46+ messages in thread
From: Cong Wang @ 2022-07-17 19:23 UTC (permalink / raw)
  To: Kumar Kartikeya Dwivedi
  Cc: Toke Høiland-Jørgensen, Stanislav Fomichev,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Jesper Dangaard Brouer,
	Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
	Jonathan Lemon, Mykola Lysenko, netdev, bpf,
	Freysteinn Alfredsson

On Sun, Jul 17, 2022 at 08:41:10PM +0200, Kumar Kartikeya Dwivedi wrote:
> On Sun, 17 Jul 2022 at 20:17, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Wed, Jul 13, 2022 at 11:52:07PM +0200, Toke Høiland-Jørgensen wrote:
> > > Stanislav Fomichev <sdf@google.com> writes:
> > >
> > > > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > > >>
> > > >> Packet forwarding is an important use case for XDP, which offers
> > > >> significant performance improvements compared to forwarding using the
> > > >> regular networking stack. However, XDP currently offers no mechanism to
> > > >> delay, queue or schedule packets, which limits the practical uses for
> > > >> XDP-based forwarding to those where the capacity of input and output links
> > > >> always match each other (i.e., no rate transitions or many-to-one
> > > >> forwarding). It also prevents an XDP-based router from doing any kind of
> > > >> traffic shaping or reordering to enforce policy.
> > > >>
> > > >> This series represents a first RFC of our attempt to remedy this lack. The
> > > >> code in these patches is functional, but needs additional testing and
> > > >> polishing before being considered for merging. I'm posting it here as an
> > > >> RFC to get some early feedback on the API and overall design of the
> > > >> feature.
> > > >>
> > > >> DESIGN
> > > >>
> > > >> The design consists of three components: A new map type for storing XDP
> > > >> frames, a new 'dequeue' program type that will run in the TX softirq to
> > > >> provide the stack with packets to transmit, and a set of helpers to dequeue
> > > >> packets from the map, optionally drop them, and to schedule an interface
> > > >> for transmission.
> > > >>
> > > >> The new map type is modelled on the PIFO data structure proposed in the
> > > >> literature[0][1]. It represents a priority queue where packets can be
> > > >> enqueued in any priority, but is always dequeued from the head. From the
> > > >> XDP side, the map is simply used as a target for the bpf_redirect_map()
> > > >> helper, where the target index is the desired priority.
> > > >
> > > > I have the same question I asked on the series from Cong:
> > > > Any considerations for existing carousel/edt-like models?
> > >
> > > Well, the reason for the addition in patch 5 (continuously increasing
> > > priorities) is exactly to be able to implement EDT-like behaviour, where
> > > the priority is used as time units to clock out packets.
> >
> > Are you sure? I seriouly doubt your patch can do this at all...
> >
> > Since your patch relies on bpf_map_push_elem(), which has no room for
> > 'key' hence you reuse 'flags' but you also reserve 4 bits there... How
> > could tstamp be packed with 4 reserved bits??
> >
> > To answer Stanislav's question, this is how my code could handle EDT:
> >
> > // BPF_CALL_3(bpf_skb_map_push, struct bpf_map *, map, struct sk_buff *, skb, u64, key)
> > skb->tstamp = XXX;
> > bpf_skb_map_push(map, skb, skb->tstamp);
> 
> It is also possible here, if we could not push into the map with a
> certain key it wouldn't be a PIFO.
> Please look at patch 16/17 for an example (test_xdp_pifo.c), it's just
> that the interface is different (bpf_redirect_map),


Sorry for mentioning that I don't care about XDP case at all. Please let me
know how this works for eBPF Qdisc. This is what I found in 16/17:

+ ret = bpf_map_push_elem(&pifo_map, &val, flags);


> the key has been expanded to 64 bits to accommodate such use cases. It
> is also possible in a future version of the patch to amortize the cost
> of taking the lock for each enqueue by doing batching, similar to what
> cpumap/devmap implementations do.

How about the 4 reserved bits?

 ret = bpf_map_push_elem(&pifo_map, &val, flags);

which leads to:

+#define BPF_PIFO_PRIO_MASK	(~0ULL >> 4)
...
+static int pifo_map_push_elem(struct bpf_map *map, void *value, u64 flags)
+{
+	struct bpf_pifo_map *pifo = container_of(map, struct bpf_pifo_map, map);
+	struct bpf_pifo_element *dst;
+	unsigned long irq_flags;
+	u64 prio;
+	int ret;
+
+	/* Check if any of the actual flag bits are set */
+	if (flags & ~BPF_PIFO_PRIO_MASK)
+		return -EINVAL;
+
+	prio = flags & BPF_PIFO_PRIO_MASK;


Please let me know how you calculate 64 bits while I only calculate 60
bits (for skb case, obviously)?

Wait for a second, as BPF_EXIST is already a bit, I think you have 59
bits here actually...

Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-17 18:17     ` Cong Wang
  2022-07-17 18:41       ` Kumar Kartikeya Dwivedi
@ 2022-07-18 12:12       ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-18 12:12 UTC (permalink / raw)
  To: Cong Wang
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Mykola Lysenko,
	Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson

Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Wed, Jul 13, 2022 at 11:52:07PM +0200, Toke Høiland-Jørgensen wrote:
>> Stanislav Fomichev <sdf@google.com> writes:
>> 
>> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Packet forwarding is an important use case for XDP, which offers
>> >> significant performance improvements compared to forwarding using the
>> >> regular networking stack. However, XDP currently offers no mechanism to
>> >> delay, queue or schedule packets, which limits the practical uses for
>> >> XDP-based forwarding to those where the capacity of input and output links
>> >> always match each other (i.e., no rate transitions or many-to-one
>> >> forwarding). It also prevents an XDP-based router from doing any kind of
>> >> traffic shaping or reordering to enforce policy.
>> >>
>> >> This series represents a first RFC of our attempt to remedy this lack. The
>> >> code in these patches is functional, but needs additional testing and
>> >> polishing before being considered for merging. I'm posting it here as an
>> >> RFC to get some early feedback on the API and overall design of the
>> >> feature.
>> >>
>> >> DESIGN
>> >>
>> >> The design consists of three components: A new map type for storing XDP
>> >> frames, a new 'dequeue' program type that will run in the TX softirq to
>> >> provide the stack with packets to transmit, and a set of helpers to dequeue
>> >> packets from the map, optionally drop them, and to schedule an interface
>> >> for transmission.
>> >>
>> >> The new map type is modelled on the PIFO data structure proposed in the
>> >> literature[0][1]. It represents a priority queue where packets can be
>> >> enqueued in any priority, but is always dequeued from the head. From the
>> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
>> >> helper, where the target index is the desired priority.
>> >
>> > I have the same question I asked on the series from Cong:
>> > Any considerations for existing carousel/edt-like models?
>> 
>> Well, the reason for the addition in patch 5 (continuously increasing
>> priorities) is exactly to be able to implement EDT-like behaviour, where
>> the priority is used as time units to clock out packets.
>
> Are you sure? I seriouly doubt your patch can do this at all...
>
> Since your patch relies on bpf_map_push_elem(), which has no room for
> 'key' hence you reuse 'flags' but you also reserve 4 bits there... How
> could tstamp be packed with 4 reserved bits??

Well, my point was that the *data structure* itself supports 64-bit
priorities, and that's what we use from bpf_map_redirect() in XDP. The
choice of reserving four bits was a bit of an arbitrary choice on my
part. I actually figured 60 bits would be plenty to represent timestamps
in themselves, but I guess I miscalculated a bit for nanosecond
timestamps (60 bits only gets you 36 years of range there).

We could lower that to 2 reserved bits, which gets you a range of 146
years using 62 bits; or users could just right-shift the value by a
couple of bits before putting them in the map (scheduling with
single-nanosecond precision is not possible anyway, so losing a few bits
of precision is no big deal); or we could add a new helper instead of
reusing the existing one.

> Actually, if we look into the in-kernel EDT implementation
> (net/sched/sch_etf.c), it is also based on rbtree rather than PIFO.

The main reason I eschewed the existing rbtree code is that I don't
believe it's sufficiently performant, mainly due to the rebalancing.
This is a hunch, though, and as I mentioned in a different reply I'm
planning to go back and revisit the data structure, including
benchmarking different implementations against each other.

-Toke


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-17 19:12         ` Cong Wang
@ 2022-07-18 12:25           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-18 12:25 UTC (permalink / raw)
  To: Cong Wang
  Cc: Stanislav Fomichev, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Mykola Lysenko,
	Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson

Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Thu, Jul 14, 2022 at 12:46:54PM +0200, Toke Høiland-Jørgensen wrote:
>> Stanislav Fomichev <sdf@google.com> writes:
>> 
>> > On Wed, Jul 13, 2022 at 2:52 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Stanislav Fomichev <sdf@google.com> writes:
>> >>
>> >> > On Wed, Jul 13, 2022 at 4:14 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >> >>
>> >> >> Packet forwarding is an important use case for XDP, which offers
>> >> >> significant performance improvements compared to forwarding using the
>> >> >> regular networking stack. However, XDP currently offers no mechanism to
>> >> >> delay, queue or schedule packets, which limits the practical uses for
>> >> >> XDP-based forwarding to those where the capacity of input and output links
>> >> >> always match each other (i.e., no rate transitions or many-to-one
>> >> >> forwarding). It also prevents an XDP-based router from doing any kind of
>> >> >> traffic shaping or reordering to enforce policy.
>> >> >>
>> >> >> This series represents a first RFC of our attempt to remedy this lack. The
>> >> >> code in these patches is functional, but needs additional testing and
>> >> >> polishing before being considered for merging. I'm posting it here as an
>> >> >> RFC to get some early feedback on the API and overall design of the
>> >> >> feature.
>> >> >>
>> >> >> DESIGN
>> >> >>
>> >> >> The design consists of three components: A new map type for storing XDP
>> >> >> frames, a new 'dequeue' program type that will run in the TX softirq to
>> >> >> provide the stack with packets to transmit, and a set of helpers to dequeue
>> >> >> packets from the map, optionally drop them, and to schedule an interface
>> >> >> for transmission.
>> >> >>
>> >> >> The new map type is modelled on the PIFO data structure proposed in the
>> >> >> literature[0][1]. It represents a priority queue where packets can be
>> >> >> enqueued in any priority, but is always dequeued from the head. From the
>> >> >> XDP side, the map is simply used as a target for the bpf_redirect_map()
>> >> >> helper, where the target index is the desired priority.
>> >> >
>> >> > I have the same question I asked on the series from Cong:
>> >> > Any considerations for existing carousel/edt-like models?
>> >>
>> >> Well, the reason for the addition in patch 5 (continuously increasing
>> >> priorities) is exactly to be able to implement EDT-like behaviour, where
>> >> the priority is used as time units to clock out packets.
>> >
>> > Ah, ok, I didn't read the patches closely enough. I saw some limits
>> > for the ranges and assumed that it wasn't capable of efficiently
>> > storing 64-bit timestamps..
>> 
>> The goal is definitely to support full 64-bit priorities. Right now you
>> have to start out at 0 but can go on for a full 64 bits, but that's a
>> bit of an API wart that I'd like to get rid of eventually...
>> 
>> >> > Can we make the map flexible enough to implement different qdisc
>> >> > policies?
>> >>
>> >> That's one of the things we want to be absolutely sure about. We are
>> >> starting out with the PIFO map type because the literature makes a good
>> >> case that it is flexible enough to implement all conceivable policies.
>> >> The goal of the test harness linked as note [4] is to actually examine
>> >> this; Frey is our PhD student working on this bit.
>> >>
>> >> Thus far we haven't hit any limitations on this, but we'll need to add
>> >> more policies before we are done with this. Another consideration is
>> >> performance, of course, so we're also planning to do a comparison with a
>> >> more traditional "bunch of FIFO queues" type data structure for at least
>> >> a subset of the algorithms. Kartikeya also had an idea for an
>> >> alternative way to implement a priority queue using (semi-)lockless
>> >> skiplists, which may turn out to perform better.
>> >>
>> >> If there's any particular policy/algorithm you'd like to see included in
>> >> this evaluation, please do let us know, BTW! :)
>> >
>> > I honestly am not sure what the bar for accepting this should be. But
>> > on the Cong's series I mentioned Martin's CC bpf work as a great
>> > example of what we should be trying to do for qdisc-like maps. Having
>> > a bpf version of fq/fq_codel/whatever_other_complex_qdisc might be
>> > very convincing :-)
>> 
>> Just doing flow queueing is quite straight forward with PIFOs. We're
>> working on fq_codel. Personally I also want to implement something that
>> has feature parity with sch_cake (which includes every feature and the
>> kitchen sink already) :)
>
> And how exactly would you plan to implement Least Slack Time First with
> PIFOs?  See https://www.usenix.org/system/files/nsdi20-paper-sharma.pdf.
> Can you be as specific as possible ideally with a pesudo code?

By sticking flows into the PIFO instead of individual packets.
Basically:

enqueue:

flow_id = hash_pkt(pkt);
flow_pifo = &flows[flow_id];
pifo_enqueue(flow_pifo, pkt, 0); // always enqueue at rank 0, so effectively a FIFO
pifo_enqueue(toplevel_pifo, flow_id, compute_rank(flow_id));

dequeue:

flow_id = pifo_dequeue(toplevel_pifo);
flow_pifo = &flows[flow_id];
pkt = pifo_dequeue(flow_pifo);
pifo_enqueue(toplevel_pifo, flow_id, compute_rank(flow_id)); // re-enqueue
return pkt;


We have not gotten around to doing a full implementation of this yet,
but SRPT/LSTF is on our list of algorithms to add :)

> BTW, this is very easy to do with my approach as no FO limitations.

How does being able to dequeue out-of-order actually help with this
particular scheme? On dequeue you still process things in priority
order?

-Toke


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities
  2022-07-17 17:46 ` Cong Wang
@ 2022-07-18 12:45   ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 46+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-18 12:45 UTC (permalink / raw)
  To: Cong Wang
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
	Maciej Fijalkowski, Jonathan Lemon, Mykola Lysenko,
	Kumar Kartikeya Dwivedi, netdev, bpf, Freysteinn Alfredsson

Cong Wang <xiyou.wangcong@gmail.com> writes:

> On Wed, Jul 13, 2022 at 01:14:08PM +0200, Toke Høiland-Jørgensen wrote:
>> Packet forwarding is an important use case for XDP, which offers
>> significant performance improvements compared to forwarding using the
>> regular networking stack. However, XDP currently offers no mechanism to
>> delay, queue or schedule packets, which limits the practical uses for
>> XDP-based forwarding to those where the capacity of input and output links
>> always match each other (i.e., no rate transitions or many-to-one
>> forwarding). It also prevents an XDP-based router from doing any kind of
>> traffic shaping or reordering to enforce policy.
>> 
>
> Sorry for forgetting to respond to your email to my patchset.
>
> The most important question from you is actually why I give up on PIFO.
> Actually its limitation is already in its name, its name Push In First
> Out already says clearly that it only allows to dequeue the first one.
> Still confusing?
>
> You can take a look at your pifo_map_pop_elem(), which is the
> implementation for bpf_map_pop_elem(), which is:
>
>        long bpf_map_pop_elem(struct bpf_map *map, void *value)
>
> Clearly, there is no even 'key' in its parameter list. If you just
> compare it to mine:
>
> 	BPF_CALL_2(bpf_skb_map_pop, struct bpf_map *, map, u64, key)
>
> Is their difference now 100% clear? :)
>
> The next question is why this is important (actually it is the most
> important)? Because we (I mean for eBPF Qdisc users, not sure about you)
> want the programmability, which I have been emphasizing since V1...

Right, I realise that in a strictly abstract sense, only being able to
dequeue at the head is a limitation. However, what I'm missing is what
concrete thing that limitation prevents you from implementing (see my
reply to your other email about LSTF)? I'm really not trying to be
disingenuous, I have no interest in ending up with a map primitive that
turns out to be limiting down the road...

> BTW, what is _your_ use case for skb map and user-space PIFO map? I am
> sure you have uses case for XDP, it is unclear what you have for other
> cases. Please don't piggy back use cases you don't have, we all have
> to justify all use cases. :)

You keep talking about the SKB and XDP paths as though they're
completely separate things. They're not: we're adding a general
capability to the kernel to implement programmable packet queueing using
BPF. One is for packets forwarded from the XDP hook, the other is for
packets coming from the stack. In an ideal world we'd only need one hook
that could handle both paths, but as the discussion I linked to in my
cover letter shows that is probably not going to be feasible.

So we'll most likely end up with two hooks, but as far as use cases are
concerned they are the same: I want to make sure that the primitives we
add are expressive enough to implement every conceivable queueing and
scheduling algorithm. I don't view the two efforts to be in competition
with each other either; we're really trying to achieve the same thing
here, so let's work together to make sure we end up with something that
works for both the XDP and qdisc layers? :)

The reason I mention the SKB path in particular in this series is that I
want to make sure we make the two as compatible as we can, so as not to
unnecessarily fragment the ecosystem. Sharing primitives is the obvious
way to do that.

-Toke


^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2022-07-18 12:45 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-13 11:14 [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 01/17] dev: Move received_rps counter next to RPS members in softnet data Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 02/17] bpf: Expand map key argument of bpf_redirect_map to u64 Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 03/17] bpf: Use 64-bit return value for bpf_prog_run Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 04/17] bpf: Add a PIFO priority queue map type Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 05/17] pifomap: Add queue rotation for continuously increasing rank mode Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 06/17] xdp: Add dequeue program type for getting packets from a PIFO Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 07/17] bpf: Teach the verifier about referenced packets returned from dequeue programs Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 08/17] bpf: Add helpers to dequeue from a PIFO map Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 09/17] bpf: Introduce pkt_uid member for PTR_TO_PACKET Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 10/17] bpf: Implement direct packet access in dequeue progs Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 11/17] dev: Add XDP dequeue hook Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 12/17] bpf: Add helper to schedule an interface for TX dequeue Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 13/17] libbpf: Add support for dequeue program type and PIFO map type Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 14/17] libbpf: Add support for querying dequeue programs Toke Høiland-Jørgensen
2022-07-14  5:36   ` Andrii Nakryiko
2022-07-14 10:13     ` Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 15/17] selftests/bpf: Add verifier tests for dequeue prog Toke Høiland-Jørgensen
2022-07-14  5:38   ` Andrii Nakryiko
2022-07-14  6:45     ` Kumar Kartikeya Dwivedi
2022-07-14 18:54       ` Andrii Nakryiko
2022-07-15 11:11         ` Kumar Kartikeya Dwivedi
2022-07-13 11:14 ` [RFC PATCH 16/17] selftests/bpf: Add test for XDP queueing through PIFO maps Toke Høiland-Jørgensen
2022-07-14  5:41   ` Andrii Nakryiko
2022-07-14 10:18     ` Toke Høiland-Jørgensen
2022-07-13 11:14 ` [RFC PATCH 17/17] samples/bpf: Add queueing support to xdp_fwd sample Toke Høiland-Jørgensen
2022-07-13 18:36 ` [RFC PATCH 00/17] xdp: Add packet queueing and scheduling capabilities Stanislav Fomichev
2022-07-13 21:52   ` Toke Høiland-Jørgensen
2022-07-13 22:56     ` Stanislav Fomichev
2022-07-14 10:46       ` Toke Høiland-Jørgensen
2022-07-14 17:24         ` Stanislav Fomichev
2022-07-15  1:12         ` Alexei Starovoitov
2022-07-15 12:55           ` Toke Høiland-Jørgensen
2022-07-17 19:12         ` Cong Wang
2022-07-18 12:25           ` Toke Høiland-Jørgensen
2022-07-14  6:34     ` Kumar Kartikeya Dwivedi
2022-07-17 18:17     ` Cong Wang
2022-07-17 18:41       ` Kumar Kartikeya Dwivedi
2022-07-17 19:23         ` Cong Wang
2022-07-18 12:12       ` Toke Høiland-Jørgensen
2022-07-14 14:05 ` Jamal Hadi Salim
2022-07-14 14:56   ` Dave Taht
2022-07-14 15:33     ` Jamal Hadi Salim
2022-07-14 16:21   ` Toke Høiland-Jørgensen
2022-07-17 17:46 ` Cong Wang
2022-07-18 12:45   ` Toke Høiland-Jørgensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.