All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH bpf-next v7 0/8] Transit between BPF TCP congestion controls.
@ 2023-03-16  2:36 Kui-Feng Lee
  2023-03-16  2:36 ` [PATCH bpf-next v7 1/8] bpf: Retire the struct_ops map kvalue->refcnt Kui-Feng Lee
                   ` (7 more replies)
  0 siblings, 8 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-16  2:36 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf; +Cc: Kui-Feng Lee

Major changes:

 - Create bpf_links in the kernel for BPF struct_ops to register and
   unregister it.

 - Enables switching between implementations of bpf-tcp-cc under a
   name instantly by replacing the backing struct_ops map of a
   bpf_link.

Previously, BPF struct_ops didn't go off, as even when the user
program creating it was terminated, none of these ever were pinned.
For instance, the TCP congestion control subsystem indirectly
maintains a reference count on the struct_ops of any registered BPF
implemented algorithm. Thus, the algorithm won't be deactivated until
someone deliberately unregisters it.  For compatibility with other BPF
programs, bpf_links have been created to work in coordination with
struct_ops maps. This ensures that the registration and unregistration
of these respective maps is carried out at the start and end of the
bpf_link.

We also faced complications when attempting to replace an existing TCP
congestion control algorithm with a new implementation on the fly. A
struct_ops map was used to register a TCP congestion control algorithm
with a unique name.  We had to either register the alternative
implementation with a new name and move over or unregister the current
one before being able to reregistration with the same name.  To fix
this problem, we can an option to migrate the registration of the
algorithm from struct_ops maps to bpf_links. By modifying the backing
map of a bpf_link, it suddenly becomes possible to replace an existing
TCP congestion control algorithm with ease.

The major differences from v6:

 - Reword commit logs of the patch 1, 2, *** BLURB HERE *** 8.

 - Call syncrhonize_rcu_tasks() as well in bpf_struct_ops_map_free().

 - Refactor bpf_struct_ops_map_free() so that
   bpf_struct_ops_map_alloc() can free a struct_ops without waiting
   for a RCU grace period.

The major differences from v5:

 - Add a new step to bpf_object__load() to prepare vdata.

 - Accept BPF_F_REPLACE.

 - Check section IDs in find_struct_ops_map_by_offset()

 - Add a test case to check mixing w/ *** BLURB HERE *** w/o link struct_ops.

 - Add a test case of using struct_ops w/o link to update a link.

 - Improve bpf_link__detach_struct_ops() to handle the w/ link case.

The major differences from v4:

 - Rebase.

 - Reorder patches and merge part 4 to part 2 of the v4.

The major differences from v3:

 - Remove bpf_struct_ops_map_free_rcu(), and use synchronize_rcu().

 - Improve the commit log of the part 1.

 - Before transitioning to the READY state, we conduct a value check
   to ensure that struct_ops can be successfully utilized and links
   created later.

The major differences from v2:

 - Simplify states

   - Remove TOBEUNREG.

   - Rename UNREG to READY.

 - Stop using the refcnt of the kvalue of a struct_ops. Explicitly
   increase and decrease the refcount of struct_ops.

 - Prepare kernel vdata during the load phase of libbpf.

The major differences from v1:

 - Added bpf_struct_ops_link to replace the previous union-based
   approach.

 - Added UNREG and TOBEUNREG to the state of bpf_struct_ops_map.

   - bpf_struct_ops_transit_state() maintains state transitions.

 - Fixed synchronization issue.

 - Prepare kernel vdata of struct_ops during the loading phase of
   bpf_object.

 - Merged previous patch 3 to patch 1.

v6: https://lore.kernel.org/all/20230310043812.3087672-1-kuifeng@meta.com/
v5: https://lore.kernel.org/all/20230308005050.255859-1-kuifeng@meta.com/
v4: https://lore.kernel.org/all/20230307232913.576893-1-andrii@kernel.org/
v3: https://lore.kernel.org/all/20230303012122.852654-1-kuifeng@meta.com/
v2: https://lore.kernel.org/bpf/20230223011238.12313-1-kuifeng@meta.com/
v1: https://lore.kernel.org/bpf/20230214221718.503964-1-kuifeng@meta.com/

Kui-Feng Lee (8):
  bpf: Retire the struct_ops map kvalue->refcnt.
  net: Update an existing TCP congestion control algorithm.
  bpf: Create links for BPF struct_ops maps.
  libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  bpf: Update the struct_ops of a bpf_link.
  libbpf: Update a bpf_link with another struct_ops.
  libbpf: Use .struct_ops.link section to indicate a struct_ops with a
    link.
  selftests/bpf: Test switching TCP Congestion Control algorithms.

 include/linux/bpf.h                           |  10 +
 include/net/tcp.h                             |   3 +
 include/uapi/linux/bpf.h                      |  20 +-
 kernel/bpf/bpf_struct_ops.c                   | 245 +++++++++++++++---
 kernel/bpf/syscall.c                          |  49 +++-
 net/bpf/bpf_dummy_struct_ops.c                |   6 +
 net/ipv4/bpf_tcp_ca.c                         |  14 +-
 net/ipv4/tcp_cong.c                           |  60 ++++-
 tools/include/uapi/linux/bpf.h                |  20 +-
 tools/lib/bpf/libbpf.c                        | 180 ++++++++++---
 tools/lib/bpf/libbpf.h                        |   1 +
 tools/lib/bpf/libbpf.map                      |   1 +
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     |  91 +++++++
 .../selftests/bpf/progs/tcp_ca_update.c       |  80 ++++++
 14 files changed, 685 insertions(+), 95 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/progs/tcp_ca_update.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v7 1/8] bpf: Retire the struct_ops map kvalue->refcnt.
  2023-03-16  2:36 [PATCH bpf-next v7 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee
@ 2023-03-16  2:36 ` Kui-Feng Lee
  2023-03-17 16:47   ` Martin KaFai Lau
  2023-03-16  2:36 ` [PATCH bpf-next v7 2/8] net: Update an existing TCP congestion control algorithm Kui-Feng Lee
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-16  2:36 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf; +Cc: Kui-Feng Lee

We have replaced kvalue-refcnt with synchronize_rcu() to wait for an
RCU grace period.

Maintenance of kvalue->refcnt was a complicated task, as we had to
simultaneously keep track of two reference counts: one for the
reference count of bpf_map. When the kvalue->refcnt reaches zero, we
also have to reduce the reference count on bpf_map - yet these steps
are not performed in an atomic manner and require us to be vigilant
when managing them. By eliminating kvalue->refcnt, we can make our
maintenance more straightforward as the refcount of bpf_map is now
solely managed!

To prevent the trampoline image of a struct_ops from being released
while it is still in use, we wait for an RCU grace period. The
setsockopt(TCP_CONGESTION, "...") command allows you to change your
socket's congestion control algorithm and can result in releasing the
old struct_ops implementation. It is fine. However, this function is
exposed through bpf_setsockopt(), it may be accessed by BPF programs
as well. To ensure that the trampoline image belonging to struct_op
can be safely called while its method is in use, the trampoline
safeguarde the BPF program with rcu_read_lock(). Doing so prevents any
destruction of the associated images before returning from a
trampoline and requires us to wait for an RCU grace period.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/bpf.h         |  1 +
 kernel/bpf/bpf_struct_ops.c | 75 +++++++++++++++++++++----------------
 kernel/bpf/syscall.c        |  6 ++-
 3 files changed, 47 insertions(+), 35 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 71cc92a4ba48..f4f923c19692 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1943,6 +1943,7 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd);
 struct bpf_map *__bpf_map_get(struct fd f);
 void bpf_map_inc(struct bpf_map *map);
 void bpf_map_inc_with_uref(struct bpf_map *map);
+struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref);
 struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map);
 void bpf_map_put_with_uref(struct bpf_map *map);
 void bpf_map_put(struct bpf_map *map);
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 38903fb52f98..2a854e9cee52 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -58,6 +58,8 @@ struct bpf_struct_ops_map {
 	struct bpf_struct_ops_value kvalue;
 };
 
+static DEFINE_MUTEX(update_mutex);
+
 #define VALUE_PREFIX "bpf_struct_ops_"
 #define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1)
 
@@ -249,6 +251,7 @@ int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
 	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
 	struct bpf_struct_ops_value *uvalue, *kvalue;
 	enum bpf_struct_ops_state state;
+	s64 refcnt;
 
 	if (unlikely(*(u32 *)key != 0))
 		return -ENOENT;
@@ -267,7 +270,14 @@ int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
 	uvalue = value;
 	memcpy(uvalue, st_map->uvalue, map->value_size);
 	uvalue->state = state;
-	refcount_set(&uvalue->refcnt, refcount_read(&kvalue->refcnt));
+
+	/* This value offers the user space a general estimate of how
+	 * many sockets are still utilizing this struct_ops for TCP
+	 * congestion control. The number might not be exact, but it
+	 * should sufficiently meet our present goals.
+	 */
+	refcnt = atomic64_read(&map->refcnt) - atomic64_read(&map->usercnt);
+	refcount_set(&uvalue->refcnt, max_t(s64, refcnt, 0));
 
 	return 0;
 }
@@ -491,7 +501,6 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 		*(unsigned long *)(udata + moff) = prog->aux->id;
 	}
 
-	refcount_set(&kvalue->refcnt, 1);
 	bpf_map_inc(map);
 
 	set_memory_rox((long)st_map->image, 1);
@@ -536,8 +545,7 @@ static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
 	switch (prev_state) {
 	case BPF_STRUCT_OPS_STATE_INUSE:
 		st_map->st_ops->unreg(&st_map->kvalue.data);
-		if (refcount_dec_and_test(&st_map->kvalue.refcnt))
-			bpf_map_put(map);
+		bpf_map_put(map);
 		return 0;
 	case BPF_STRUCT_OPS_STATE_TOBEFREE:
 		return -EINPROGRESS;
@@ -570,7 +578,7 @@ static void bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key,
 	kfree(value);
 }
 
-static void bpf_struct_ops_map_free(struct bpf_map *map)
+static void bpf_struct_ops_map_free_nosync(struct bpf_map *map)
 {
 	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
 
@@ -582,6 +590,25 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
 	bpf_map_area_free(st_map);
 }
 
+static void bpf_struct_ops_map_free(struct bpf_map *map)
+{
+	/* The struct_ops's function may switch to another struct_ops.
+	 *
+	 * For example, bpf_tcp_cc_x->init() may switch to
+	 * another tcp_cc_y by calling
+	 * setsockopt(TCP_CONGESTION, "tcp_cc_y").
+	 * During the switch,  bpf_struct_ops_put(tcp_cc_x) is called
+	 * and its refcount may reach 0 which then free its
+	 * trampoline image while tcp_cc_x is still running.
+	 *
+	 * Thus, a rcu grace period is needed here.
+	 */
+	synchronize_rcu();
+	synchronize_rcu_tasks();
+
+	bpf_struct_ops_map_free_nosync(map);
+}
+
 static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
 {
 	if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 ||
@@ -630,7 +657,7 @@ static struct bpf_map *bpf_struct_ops_map_alloc(union bpf_attr *attr)
 				   NUMA_NO_NODE);
 	st_map->image = bpf_jit_alloc_exec(PAGE_SIZE);
 	if (!st_map->uvalue || !st_map->links || !st_map->image) {
-		bpf_struct_ops_map_free(map);
+		bpf_struct_ops_map_free_nosync(map);
 		return ERR_PTR(-ENOMEM);
 	}
 
@@ -676,41 +703,23 @@ const struct bpf_map_ops bpf_struct_ops_map_ops = {
 bool bpf_struct_ops_get(const void *kdata)
 {
 	struct bpf_struct_ops_value *kvalue;
+	struct bpf_struct_ops_map *st_map;
+	struct bpf_map *map;
 
 	kvalue = container_of(kdata, struct bpf_struct_ops_value, data);
+	st_map = container_of(kvalue, struct bpf_struct_ops_map, kvalue);
 
-	return refcount_inc_not_zero(&kvalue->refcnt);
-}
-
-static void bpf_struct_ops_put_rcu(struct rcu_head *head)
-{
-	struct bpf_struct_ops_map *st_map;
-
-	st_map = container_of(head, struct bpf_struct_ops_map, rcu);
-	bpf_map_put(&st_map->map);
+	map = __bpf_map_inc_not_zero(&st_map->map, false);
+	return !IS_ERR(map);
 }
 
 void bpf_struct_ops_put(const void *kdata)
 {
 	struct bpf_struct_ops_value *kvalue;
+	struct bpf_struct_ops_map *st_map;
 
 	kvalue = container_of(kdata, struct bpf_struct_ops_value, data);
-	if (refcount_dec_and_test(&kvalue->refcnt)) {
-		struct bpf_struct_ops_map *st_map;
-
-		st_map = container_of(kvalue, struct bpf_struct_ops_map,
-				      kvalue);
-		/* The struct_ops's function may switch to another struct_ops.
-		 *
-		 * For example, bpf_tcp_cc_x->init() may switch to
-		 * another tcp_cc_y by calling
-		 * setsockopt(TCP_CONGESTION, "tcp_cc_y").
-		 * During the switch,  bpf_struct_ops_put(tcp_cc_x) is called
-		 * and its map->refcnt may reach 0 which then free its
-		 * trampoline image while tcp_cc_x is still running.
-		 *
-		 * Thus, a rcu grace period is needed here.
-		 */
-		call_rcu(&st_map->rcu, bpf_struct_ops_put_rcu);
-	}
+	st_map = container_of(kvalue, struct bpf_struct_ops_map, kvalue);
+
+	bpf_map_put(&st_map->map);
 }
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5b88301a2ae0..8f6eb22f53a7 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1303,8 +1303,10 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd)
 	return map;
 }
 
-/* map_idr_lock should have been held */
-static struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref)
+/* map_idr_lock should have been held or the map should have been
+ * protected by rcu read lock.
+ */
+struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref)
 {
 	int refold;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v7 2/8] net: Update an existing TCP congestion control algorithm.
  2023-03-16  2:36 [PATCH bpf-next v7 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee
  2023-03-16  2:36 ` [PATCH bpf-next v7 1/8] bpf: Retire the struct_ops map kvalue->refcnt Kui-Feng Lee
@ 2023-03-16  2:36 ` Kui-Feng Lee
  2023-03-17 15:23   ` Daniel Borkmann
  2023-03-16  2:36 ` [PATCH bpf-next v7 3/8] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-16  2:36 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf
  Cc: Kui-Feng Lee, netdev, Eric Dumazet

This feature lets you immediately transition to another congestion
control algorithm or implementation with the same name.  Once a name
is updated, new connections will apply this new algorithm.

The purpose is to update a customized algorithm implemented in BPF
struct_ops with a new version on the flight.  The following is an
example of using the userspace API implemented in later BPF patches.

   link = bpf_map__attach_struct_ops(skel->maps.ca_update_1);
   .......
   err = bpf_link__update_map(link, skel->maps.ca_update_2);

We first load and register an algorithm implemented in BPF struct_ops,
then swap it out with a new one using the same name. After that, newly
created connections will apply the updated algorithm, while older ones
retain the previous version already applied.

This patch also takes this chance to refactor the ca validation into
the new tcp_validate_congestion_control() function.

Cc: netdev@vger.kernel.org, Eric Dumazet <edumazet@google.com>
Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/net/tcp.h   |  3 +++
 net/ipv4/tcp_cong.c | 60 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index db9f828e9d1e..2abb755e6a3a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1117,6 +1117,9 @@ struct tcp_congestion_ops {
 
 int tcp_register_congestion_control(struct tcp_congestion_ops *type);
 void tcp_unregister_congestion_control(struct tcp_congestion_ops *type);
+int tcp_update_congestion_control(struct tcp_congestion_ops *type,
+				  struct tcp_congestion_ops *old_type);
+int tcp_validate_congestion_control(struct tcp_congestion_ops *ca);
 
 void tcp_assign_congestion_control(struct sock *sk);
 void tcp_init_congestion_control(struct sock *sk);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index db8b4b488c31..c90791ae8389 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -75,14 +75,8 @@ struct tcp_congestion_ops *tcp_ca_find_key(u32 key)
 	return NULL;
 }
 
-/*
- * Attach new congestion control algorithm to the list
- * of available options.
- */
-int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
+int tcp_validate_congestion_control(struct tcp_congestion_ops *ca)
 {
-	int ret = 0;
-
 	/* all algorithms must implement these */
 	if (!ca->ssthresh || !ca->undo_cwnd ||
 	    !(ca->cong_avoid || ca->cong_control)) {
@@ -90,6 +84,20 @@ int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
 		return -EINVAL;
 	}
 
+	return 0;
+}
+
+/* Attach new congestion control algorithm to the list
+ * of available options.
+ */
+int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
+{
+	int ret;
+
+	ret = tcp_validate_congestion_control(ca);
+	if (ret)
+		return ret;
+
 	ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
 
 	spin_lock(&tcp_cong_list_lock);
@@ -130,6 +138,44 @@ void tcp_unregister_congestion_control(struct tcp_congestion_ops *ca)
 }
 EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control);
 
+/* Replace a registered old ca with a new one.
+ *
+ * The new ca must have the same name as the old one, that has been
+ * registered.
+ */
+int tcp_update_congestion_control(struct tcp_congestion_ops *ca, struct tcp_congestion_ops *old_ca)
+{
+	struct tcp_congestion_ops *existing;
+	int ret;
+
+	ret = tcp_validate_congestion_control(ca);
+	if (ret)
+		return ret;
+
+	ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
+
+	spin_lock(&tcp_cong_list_lock);
+	existing = tcp_ca_find_key(old_ca->key);
+	if (ca->key == TCP_CA_UNSPEC || !existing || strcmp(existing->name, ca->name)) {
+		pr_notice("%s not registered or non-unique key\n",
+			  ca->name);
+		ret = -EINVAL;
+	} else if (existing != old_ca) {
+		pr_notice("invalid old congestion control algorithm to replace\n");
+		ret = -EINVAL;
+	} else {
+		/* Add the new one before removing the old one to keep
+		 * one implementation available all the time.
+		 */
+		list_add_tail_rcu(&ca->list, &tcp_cong_list);
+		list_del_rcu(&existing->list);
+		pr_debug("%s updated\n", ca->name);
+	}
+	spin_unlock(&tcp_cong_list_lock);
+
+	return ret;
+}
+
 u32 tcp_ca_get_key_by_name(struct net *net, const char *name, bool *ecn_ca)
 {
 	const struct tcp_congestion_ops *ca;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v7 3/8] bpf: Create links for BPF struct_ops maps.
  2023-03-16  2:36 [PATCH bpf-next v7 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee
  2023-03-16  2:36 ` [PATCH bpf-next v7 1/8] bpf: Retire the struct_ops map kvalue->refcnt Kui-Feng Lee
  2023-03-16  2:36 ` [PATCH bpf-next v7 2/8] net: Update an existing TCP congestion control algorithm Kui-Feng Lee
@ 2023-03-16  2:36 ` Kui-Feng Lee
  2023-03-17 18:10   ` Martin KaFai Lau
  2023-03-16  2:36 ` [PATCH bpf-next v7 4/8] libbpf: Create a bpf_link in bpf_map__attach_struct_ops() Kui-Feng Lee
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-16  2:36 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf; +Cc: Kui-Feng Lee

Make bpf_link support struct_ops.  Previously, struct_ops were always
used alone without any associated links. Upon updating its value, a
struct_ops would be activated automatically. Yet other BPF program
types required to make a bpf_link with their instances before they
could become active. Now, however, you can create an inactive
struct_ops, and create a link to activate it later.

With bpf_links, struct_ops has a behavior similar to other BPF program
types. You can pin/unpin them from their links and the struct_ops will
be deactivated when its link is removed while previously need someone
to delete the value for it to be deactivated.

bpf_links are responsible for registering their associated
struct_ops. You can only use a struct_ops that has the BPF_F_LINK flag
set to create a bpf_link, while a structs without this flag behaves in
the same manner as before and is registered upon updating its value.

The BPF_LINK_TYPE_STRUCT_OPS serves a dual purpose. Not only is it
used to craft the links for BPF struct_ops programs, but also to
create links for BPF struct_ops them-self.  Since the links of BPF
struct_ops programs are only used to create trampolines internally,
they are never seen in other contexts. Thus, they can be reused for
struct_ops themself.

To maintain a reference to the map supporting this link, we add
bpf_struct_ops_link as an additional type. The pointer of the map is
RCU and won't be necessary until later in the patchset.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/bpf.h            |   7 ++
 include/uapi/linux/bpf.h       |  12 ++-
 kernel/bpf/bpf_struct_ops.c    | 144 ++++++++++++++++++++++++++++++++-
 kernel/bpf/syscall.c           |  23 ++++--
 net/ipv4/bpf_tcp_ca.c          |   8 +-
 tools/include/uapi/linux/bpf.h |  12 ++-
 6 files changed, 191 insertions(+), 15 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index f4f923c19692..455b14bf8f28 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1516,6 +1516,7 @@ struct bpf_struct_ops {
 			   void *kdata, const void *udata);
 	int (*reg)(void *kdata);
 	void (*unreg)(void *kdata);
+	int (*validate)(void *kdata);
 	const struct btf_type *type;
 	const struct btf_type *value_type;
 	const char *name;
@@ -1550,6 +1551,7 @@ static inline void bpf_module_put(const void *data, struct module *owner)
 	else
 		module_put(owner);
 }
+int bpf_struct_ops_link_create(union bpf_attr *attr);
 
 #ifdef CONFIG_NET
 /* Define it here to avoid the use of forward declaration */
@@ -1590,6 +1592,11 @@ static inline int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map,
 {
 	return -EINVAL;
 }
+static inline int bpf_struct_ops_link_create(union bpf_attr *attr)
+{
+	return -EOPNOTSUPP;
+}
+
 #endif
 
 #if defined(CONFIG_CGROUP_BPF) && defined(CONFIG_BPF_LSM)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 13129df937cd..42f40ee083bf 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1033,6 +1033,7 @@ enum bpf_attach_type {
 	BPF_PERF_EVENT,
 	BPF_TRACE_KPROBE_MULTI,
 	BPF_LSM_CGROUP,
+	BPF_STRUCT_OPS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1266,6 +1267,9 @@ enum {
 
 /* Create a map that is suitable to be an inner map with dynamic max entries */
 	BPF_F_INNER_MAP		= (1U << 12),
+
+/* Create a map that will be registered/unregesitered by the backed bpf_link */
+	BPF_F_LINK		= (1U << 13),
 };
 
 /* Flags for BPF_PROG_QUERY. */
@@ -1507,7 +1511,10 @@ union bpf_attr {
 	} task_fd_query;
 
 	struct { /* struct used by BPF_LINK_CREATE command */
-		__u32		prog_fd;	/* eBPF program to attach */
+		union {
+			__u32		prog_fd;	/* eBPF program to attach */
+			__u32		map_fd;		/* struct_ops to attach */
+		};
 		union {
 			__u32		target_fd;	/* object to attach to */
 			__u32		target_ifindex; /* target ifindex */
@@ -6379,6 +6386,9 @@ struct bpf_link_info {
 		struct {
 			__u32 ifindex;
 		} xdp;
+		struct {
+			__u32 map_id;
+		} struct_ops;
 	};
 } __attribute__((aligned(8)));
 
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 2a854e9cee52..8ce6c7581ca3 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -16,6 +16,7 @@ enum bpf_struct_ops_state {
 	BPF_STRUCT_OPS_STATE_INIT,
 	BPF_STRUCT_OPS_STATE_INUSE,
 	BPF_STRUCT_OPS_STATE_TOBEFREE,
+	BPF_STRUCT_OPS_STATE_READY,
 };
 
 #define BPF_STRUCT_OPS_COMMON_VALUE			\
@@ -58,6 +59,11 @@ struct bpf_struct_ops_map {
 	struct bpf_struct_ops_value kvalue;
 };
 
+struct bpf_struct_ops_link {
+	struct bpf_link link;
+	struct bpf_map __rcu *map;
+};
+
 static DEFINE_MUTEX(update_mutex);
 
 #define VALUE_PREFIX "bpf_struct_ops_"
@@ -501,11 +507,31 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 		*(unsigned long *)(udata + moff) = prog->aux->id;
 	}
 
-	bpf_map_inc(map);
+	if (st_map->map.map_flags & BPF_F_LINK) {
+		if (st_ops->validate) {
+			err = st_ops->validate(kdata);
+			if (err)
+				goto reset_unlock;
+		}
+		set_memory_rox((long)st_map->image, 1);
+		/* Let bpf_link handle registration & unregistration.
+		 *
+		 * Pair with smp_load_acquire() during lookup_elem().
+		 */
+		smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_READY);
+		goto unlock;
+	}
 
 	set_memory_rox((long)st_map->image, 1);
 	err = st_ops->reg(kdata);
 	if (likely(!err)) {
+		/* This refcnt increment on the map here after
+		 * 'st_ops->reg()' is secure since the state of the
+		 * map must be set to INIT at this moment, and thus
+		 * bpf_struct_ops_map_delete_elem() can't unregister
+		 * or transition it to TOBEFREE concurrently.
+		 */
+		bpf_map_inc(map);
 		/* Pair with smp_load_acquire() during lookup_elem().
 		 * It ensures the above udata updates (e.g. prog->aux->id)
 		 * can be seen once BPF_STRUCT_OPS_STATE_INUSE is set.
@@ -521,7 +547,6 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
 	 */
 	set_memory_nx((long)st_map->image, 1);
 	set_memory_rw((long)st_map->image, 1);
-	bpf_map_put(map);
 
 reset_unlock:
 	bpf_struct_ops_map_put_progs(st_map);
@@ -539,6 +564,9 @@ static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
 	struct bpf_struct_ops_map *st_map;
 
 	st_map = (struct bpf_struct_ops_map *)map;
+	if (st_map->map.map_flags & BPF_F_LINK)
+		return -EOPNOTSUPP;
+
 	prev_state = cmpxchg(&st_map->kvalue.state,
 			     BPF_STRUCT_OPS_STATE_INUSE,
 			     BPF_STRUCT_OPS_STATE_TOBEFREE);
@@ -612,7 +640,7 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
 static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
 {
 	if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 ||
-	    attr->map_flags || !attr->btf_vmlinux_value_type_id)
+	    (attr->map_flags & ~BPF_F_LINK) || !attr->btf_vmlinux_value_type_id)
 		return -EINVAL;
 	return 0;
 }
@@ -723,3 +751,113 @@ void bpf_struct_ops_put(const void *kdata)
 
 	bpf_map_put(&st_map->map);
 }
+
+static bool bpf_struct_ops_valid_to_reg(struct bpf_map *map)
+{
+	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
+
+	return map->map_type == BPF_MAP_TYPE_STRUCT_OPS &&
+		map->map_flags & BPF_F_LINK &&
+		/* Pair with smp_store_release() during map_update */
+		smp_load_acquire(&st_map->kvalue.state) == BPF_STRUCT_OPS_STATE_READY;
+}
+
+static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *st_link;
+	struct bpf_struct_ops_map *st_map;
+
+	st_link = container_of(link, struct bpf_struct_ops_link, link);
+	st_map = (struct bpf_struct_ops_map *)
+		rcu_dereference_protected(st_link->map, true);
+	if (st_map) {
+		/* st_link->map can be NULL if
+		 * bpf_struct_ops_link_create() fails to register.
+		 */
+		st_map->st_ops->unreg(&st_map->kvalue.data);
+		bpf_map_put(&st_map->map);
+	}
+	kfree(st_link);
+}
+
+static void bpf_struct_ops_map_link_show_fdinfo(const struct bpf_link *link,
+					    struct seq_file *seq)
+{
+	struct bpf_struct_ops_link *st_link;
+	struct bpf_map *map;
+
+	st_link = container_of(link, struct bpf_struct_ops_link, link);
+	rcu_read_lock();
+	map = rcu_dereference(st_link->map);
+	seq_printf(seq, "map_id:\t%d\n", map->id);
+	rcu_read_unlock();
+}
+
+static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
+					       struct bpf_link_info *info)
+{
+	struct bpf_struct_ops_link *st_link;
+	struct bpf_map *map;
+
+	st_link = container_of(link, struct bpf_struct_ops_link, link);
+	rcu_read_lock();
+	map = rcu_dereference(st_link->map);
+	info->struct_ops.map_id = map->id;
+	rcu_read_unlock();
+	return 0;
+}
+
+static const struct bpf_link_ops bpf_struct_ops_map_lops = {
+	.dealloc = bpf_struct_ops_map_link_dealloc,
+	.show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
+	.fill_link_info = bpf_struct_ops_map_link_fill_link_info,
+};
+
+int bpf_struct_ops_link_create(union bpf_attr *attr)
+{
+	struct bpf_struct_ops_link *link = NULL;
+	struct bpf_link_primer link_primer;
+	struct bpf_struct_ops_map *st_map;
+	struct bpf_map *map;
+	int err;
+
+	map = bpf_map_get(attr->link_create.map_fd);
+	if (!map)
+		return -EINVAL;
+
+	st_map = (struct bpf_struct_ops_map *)map;
+
+	if (!bpf_struct_ops_valid_to_reg(map)) {
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	link = kzalloc(sizeof(*link), GFP_USER);
+	if (!link) {
+		err = -ENOMEM;
+		goto err_out;
+	}
+	bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL);
+	RCU_INIT_POINTER(link->map, map);
+
+	err = bpf_link_prime(&link->link, &link_primer);
+	if (err)
+		goto err_out;
+
+	err = st_map->st_ops->reg(st_map->kvalue.data);
+	if (err) {
+		/* No RCU since no one has a chance to read this pointer yet. */
+		RCU_INIT_POINTER(link->map, NULL);
+		bpf_link_cleanup(&link_primer);
+		link = NULL;
+		goto err_out;
+	}
+
+	return bpf_link_settle(&link_primer);
+
+err_out:
+	bpf_map_put(map);
+	kfree(link);
+	return err;
+}
+
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 8f6eb22f53a7..5a45e3bf34e2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2824,16 +2824,19 @@ static void bpf_link_show_fdinfo(struct seq_file *m, struct file *filp)
 	const struct bpf_prog *prog = link->prog;
 	char prog_tag[sizeof(prog->tag) * 2 + 1] = { };
 
-	bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
 	seq_printf(m,
 		   "link_type:\t%s\n"
-		   "link_id:\t%u\n"
-		   "prog_tag:\t%s\n"
-		   "prog_id:\t%u\n",
+		   "link_id:\t%u\n",
 		   bpf_link_type_strs[link->type],
-		   link->id,
-		   prog_tag,
-		   prog->aux->id);
+		   link->id);
+	if (prog) {
+		bin2hex(prog_tag, prog->tag, sizeof(prog->tag));
+		seq_printf(m,
+			   "prog_tag:\t%s\n"
+			   "prog_id:\t%u\n",
+			   prog_tag,
+			   prog->aux->id);
+	}
 	if (link->ops->show_fdinfo)
 		link->ops->show_fdinfo(link, m);
 }
@@ -4308,7 +4311,8 @@ static int bpf_link_get_info_by_fd(struct file *file,
 
 	info.type = link->type;
 	info.id = link->id;
-	info.prog_id = link->prog->aux->id;
+	if (link->prog)
+		info.prog_id = link->prog->aux->id;
 
 	if (link->ops->fill_link_info) {
 		err = link->ops->fill_link_info(link, &info);
@@ -4571,6 +4575,9 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 	if (CHECK_ATTR(BPF_LINK_CREATE))
 		return -EINVAL;
 
+	if (attr->link_create.attach_type == BPF_STRUCT_OPS)
+		return bpf_struct_ops_link_create(attr);
+
 	prog = bpf_prog_get(attr->link_create.prog_fd);
 	if (IS_ERR(prog))
 		return PTR_ERR(prog);
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index 13fc0c185cd9..bbbd5eb94db2 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -239,8 +239,6 @@ static int bpf_tcp_ca_init_member(const struct btf_type *t,
 		if (bpf_obj_name_cpy(tcp_ca->name, utcp_ca->name,
 				     sizeof(tcp_ca->name)) <= 0)
 			return -EINVAL;
-		if (tcp_ca_find(utcp_ca->name))
-			return -EEXIST;
 		return 1;
 	}
 
@@ -266,6 +264,11 @@ static void bpf_tcp_ca_unreg(void *kdata)
 	tcp_unregister_congestion_control(kdata);
 }
 
+static int bpf_tcp_ca_validate(void *kdata)
+{
+	return tcp_validate_congestion_control(kdata);
+}
+
 struct bpf_struct_ops bpf_tcp_congestion_ops = {
 	.verifier_ops = &bpf_tcp_ca_verifier_ops,
 	.reg = bpf_tcp_ca_reg,
@@ -273,6 +276,7 @@ struct bpf_struct_ops bpf_tcp_congestion_ops = {
 	.check_member = bpf_tcp_ca_check_member,
 	.init_member = bpf_tcp_ca_init_member,
 	.init = bpf_tcp_ca_init,
+	.validate = bpf_tcp_ca_validate,
 	.name = "tcp_congestion_ops",
 };
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 13129df937cd..9cf1deaf21f2 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1033,6 +1033,7 @@ enum bpf_attach_type {
 	BPF_PERF_EVENT,
 	BPF_TRACE_KPROBE_MULTI,
 	BPF_LSM_CGROUP,
+	BPF_STRUCT_OPS,
 	__MAX_BPF_ATTACH_TYPE
 };
 
@@ -1266,6 +1267,9 @@ enum {
 
 /* Create a map that is suitable to be an inner map with dynamic max entries */
 	BPF_F_INNER_MAP		= (1U << 12),
+
+/* Create a map that will be registered/unregesitered by the backed bpf_link */
+	BPF_F_LINK		= (1U << 13),
 };
 
 /* Flags for BPF_PROG_QUERY. */
@@ -1507,7 +1511,10 @@ union bpf_attr {
 	} task_fd_query;
 
 	struct { /* struct used by BPF_LINK_CREATE command */
-		__u32		prog_fd;	/* eBPF program to attach */
+		union {
+			__u32		prog_fd;	/* eBPF program to attach */
+			__u32		map_fd;		/* eBPF struct_ops to attach */
+		};
 		union {
 			__u32		target_fd;	/* object to attach to */
 			__u32		target_ifindex; /* target ifindex */
@@ -6379,6 +6386,9 @@ struct bpf_link_info {
 		struct {
 			__u32 ifindex;
 		} xdp;
+		struct {
+			__u32 map_id;
+		} struct_ops;
 	};
 } __attribute__((aligned(8)));
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v7 4/8] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-03-16  2:36 [PATCH bpf-next v7 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee
                   ` (2 preceding siblings ...)
  2023-03-16  2:36 ` [PATCH bpf-next v7 3/8] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
@ 2023-03-16  2:36 ` Kui-Feng Lee
  2023-03-17 18:44   ` Martin KaFai Lau
  2023-03-17 22:23   ` Andrii Nakryiko
  2023-03-16  2:36 ` [PATCH bpf-next v7 5/8] bpf: Update the struct_ops of a bpf_link Kui-Feng Lee
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-16  2:36 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf; +Cc: Kui-Feng Lee

bpf_map__attach_struct_ops() was creating a dummy bpf_link as a
placeholder, but now it is constructing an authentic one by calling
bpf_link_create() if the map has the BPF_F_LINK flag.

You can flag a struct_ops map with BPF_F_LINK by calling
bpf_map__set_map_flags().

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 tools/lib/bpf/libbpf.c | 90 +++++++++++++++++++++++++++++++-----------
 1 file changed, 66 insertions(+), 24 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index a557718401e4..6dbae7ffab48 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -116,6 +116,7 @@ static const char * const attach_type_name[] = {
 	[BPF_SK_REUSEPORT_SELECT_OR_MIGRATE]	= "sk_reuseport_select_or_migrate",
 	[BPF_PERF_EVENT]		= "perf_event",
 	[BPF_TRACE_KPROBE_MULTI]	= "trace_kprobe_multi",
+	[BPF_STRUCT_OPS]		= "struct_ops",
 };
 
 static const char * const link_type_name[] = {
@@ -7677,6 +7678,37 @@ static int bpf_object__resolve_externs(struct bpf_object *obj,
 	return 0;
 }
 
+static void bpf_map_prepare_vdata(const struct bpf_map *map)
+{
+	struct bpf_struct_ops *st_ops;
+	__u32 i;
+
+	st_ops = map->st_ops;
+	for (i = 0; i < btf_vlen(st_ops->type); i++) {
+		struct bpf_program *prog = st_ops->progs[i];
+		void *kern_data;
+		int prog_fd;
+
+		if (!prog)
+			continue;
+
+		prog_fd = bpf_program__fd(prog);
+		kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i];
+		*(unsigned long *)kern_data = prog_fd;
+	}
+}
+
+static int bpf_object_prepare_struct_ops(struct bpf_object *obj)
+{
+	int i;
+
+	for (i = 0; i < obj->nr_maps; i++)
+		if (bpf_map__is_struct_ops(&obj->maps[i]))
+			bpf_map_prepare_vdata(&obj->maps[i]);
+
+	return 0;
+}
+
 static int bpf_object_load(struct bpf_object *obj, int extra_log_level, const char *target_btf_path)
 {
 	int err, i;
@@ -7702,6 +7734,7 @@ static int bpf_object_load(struct bpf_object *obj, int extra_log_level, const ch
 	err = err ? : bpf_object__relocate(obj, obj->btf_custom_path ? : target_btf_path);
 	err = err ? : bpf_object__load_progs(obj, extra_log_level);
 	err = err ? : bpf_object_init_prog_arrays(obj);
+	err = err ? : bpf_object_prepare_struct_ops(obj);
 
 	if (obj->gen_loader) {
 		/* reset FDs */
@@ -11566,22 +11599,30 @@ struct bpf_link *bpf_program__attach(const struct bpf_program *prog)
 	return link;
 }
 
+struct bpf_link_struct_ops {
+	struct bpf_link link;
+	int map_fd;
+};
+
 static int bpf_link__detach_struct_ops(struct bpf_link *link)
 {
+	struct bpf_link_struct_ops *st_link;
 	__u32 zero = 0;
 
-	if (bpf_map_delete_elem(link->fd, &zero))
-		return -errno;
+	st_link = container_of(link, struct bpf_link_struct_ops, link);
 
-	return 0;
+	if (st_link->map_fd < 0)
+		/* w/o a real link */
+		return bpf_map_delete_elem(link->fd, &zero);
+
+	return close(link->fd);
 }
 
 struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
 {
-	struct bpf_struct_ops *st_ops;
-	struct bpf_link *link;
-	__u32 i, zero = 0;
-	int err;
+	struct bpf_link_struct_ops *link;
+	__u32 zero = 0;
+	int err, fd;
 
 	if (!bpf_map__is_struct_ops(map) || map->fd == -1)
 		return libbpf_err_ptr(-EINVAL);
@@ -11590,31 +11631,32 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
 	if (!link)
 		return libbpf_err_ptr(-EINVAL);
 
-	st_ops = map->st_ops;
-	for (i = 0; i < btf_vlen(st_ops->type); i++) {
-		struct bpf_program *prog = st_ops->progs[i];
-		void *kern_data;
-		int prog_fd;
+	/* kern_vdata should be prepared during the loading phase. */
+	err = bpf_map_update_elem(map->fd, &zero, map->st_ops->kern_vdata, 0);
+	if (err) {
+		free(link);
+		return libbpf_err_ptr(err);
+	}
 
-		if (!prog)
-			continue;
+	link->link.detach = bpf_link__detach_struct_ops;
 
-		prog_fd = bpf_program__fd(prog);
-		kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i];
-		*(unsigned long *)kern_data = prog_fd;
+	if (!(map->def.map_flags & BPF_F_LINK)) {
+		/* w/o a real link */
+		link->link.fd = map->fd;
+		link->map_fd = -1;
+		return &link->link;
 	}
 
-	err = bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
-	if (err) {
-		err = -errno;
+	fd = bpf_link_create(map->fd, -1, BPF_STRUCT_OPS, NULL);
+	if (fd < 0) {
 		free(link);
-		return libbpf_err_ptr(err);
+		return libbpf_err_ptr(fd);
 	}
 
-	link->detach = bpf_link__detach_struct_ops;
-	link->fd = map->fd;
+	link->link.fd = fd;
+	link->map_fd = map->fd;
 
-	return link;
+	return &link->link;
 }
 
 typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v7 5/8] bpf: Update the struct_ops of a bpf_link.
  2023-03-16  2:36 [PATCH bpf-next v7 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee
                   ` (3 preceding siblings ...)
  2023-03-16  2:36 ` [PATCH bpf-next v7 4/8] libbpf: Create a bpf_link in bpf_map__attach_struct_ops() Kui-Feng Lee
@ 2023-03-16  2:36 ` Kui-Feng Lee
  2023-03-17 19:23   ` Martin KaFai Lau
  2023-03-17 22:27   ` Andrii Nakryiko
  2023-03-16  2:36 ` [PATCH bpf-next v7 6/8] libbpf: Update a bpf_link with another struct_ops Kui-Feng Lee
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-16  2:36 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf; +Cc: Kui-Feng Lee

By improving the BPF_LINK_UPDATE command of bpf(), it should allow you
to conveniently switch between different struct_ops on a single
bpf_link. This would enable smoother transitions from one struct_ops
to another.

The struct_ops maps passing along with BPF_LINK_UPDATE should have the
BPF_F_LINK flag.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 include/linux/bpf.h            |  2 ++
 include/uapi/linux/bpf.h       |  8 +++++--
 kernel/bpf/bpf_struct_ops.c    | 38 ++++++++++++++++++++++++++++++++++
 kernel/bpf/syscall.c           | 20 ++++++++++++++++++
 net/bpf/bpf_dummy_struct_ops.c |  6 ++++++
 net/ipv4/bpf_tcp_ca.c          |  6 ++++++
 tools/include/uapi/linux/bpf.h |  8 +++++--
 7 files changed, 84 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 455b14bf8f28..56e6ab7559ef 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1474,6 +1474,7 @@ struct bpf_link_ops {
 	void (*show_fdinfo)(const struct bpf_link *link, struct seq_file *seq);
 	int (*fill_link_info)(const struct bpf_link *link,
 			      struct bpf_link_info *info);
+	int (*update_map)(struct bpf_link *link, struct bpf_map *new_map);
 };
 
 struct bpf_tramp_link {
@@ -1516,6 +1517,7 @@ struct bpf_struct_ops {
 			   void *kdata, const void *udata);
 	int (*reg)(void *kdata);
 	void (*unreg)(void *kdata);
+	int (*update)(void *kdata, void *old_kdata);
 	int (*validate)(void *kdata);
 	const struct btf_type *type;
 	const struct btf_type *value_type;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 42f40ee083bf..24e1dec4ad97 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1555,8 +1555,12 @@ union bpf_attr {
 
 	struct { /* struct used by BPF_LINK_UPDATE command */
 		__u32		link_fd;	/* link fd */
-		/* new program fd to update link with */
-		__u32		new_prog_fd;
+		union {
+			/* new program fd to update link with */
+			__u32		new_prog_fd;
+			/* new struct_ops map fd to update link with */
+			__u32           new_map_fd;
+		};
 		__u32		flags;		/* extra flags */
 		/* expected link's program fd; is specified only if
 		 * BPF_F_REPLACE flag is set in flags */
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 8ce6c7581ca3..5a9e10b92423 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -807,10 +807,48 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
 	return 0;
 }
 
+static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map *new_map)
+{
+	struct bpf_struct_ops_map *st_map, *old_st_map;
+	struct bpf_struct_ops_link *st_link;
+	struct bpf_map *old_map;
+	int err = 0;
+
+	st_link = container_of(link, struct bpf_struct_ops_link, link);
+	st_map = container_of(new_map, struct bpf_struct_ops_map, map);
+
+	if (!bpf_struct_ops_valid_to_reg(new_map))
+		return -EINVAL;
+
+	mutex_lock(&update_mutex);
+
+	old_map = rcu_dereference_protected(st_link->map, lockdep_is_held(&update_mutex));
+	old_st_map = container_of(old_map, struct bpf_struct_ops_map, map);
+	/* The new and old struct_ops must be the same type. */
+	if (st_map->st_ops != old_st_map->st_ops) {
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	err = st_map->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
+	if (err)
+		goto err_out;
+
+	bpf_map_inc(new_map);
+	rcu_assign_pointer(st_link->map, new_map);
+	bpf_map_put(old_map);
+
+err_out:
+	mutex_unlock(&update_mutex);
+
+	return err;
+}
+
 static const struct bpf_link_ops bpf_struct_ops_map_lops = {
 	.dealloc = bpf_struct_ops_map_link_dealloc,
 	.show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
 	.fill_link_info = bpf_struct_ops_map_link_fill_link_info,
+	.update_map = bpf_struct_ops_map_link_update,
 };
 
 int bpf_struct_ops_link_create(union bpf_attr *attr)
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5a45e3bf34e2..6fa10d108278 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4676,6 +4676,21 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
 	return ret;
 }
 
+static int link_update_map(struct bpf_link *link, union bpf_attr *attr)
+{
+	struct bpf_map *new_map;
+	int ret = 0;
+
+	new_map = bpf_map_get(attr->link_update.new_map_fd);
+	if (IS_ERR(new_map))
+		return -EINVAL;
+
+	ret = link->ops->update_map(link, new_map);
+
+	bpf_map_put(new_map);
+	return ret;
+}
+
 #define BPF_LINK_UPDATE_LAST_FIELD link_update.old_prog_fd
 
 static int link_update(union bpf_attr *attr)
@@ -4696,6 +4711,11 @@ static int link_update(union bpf_attr *attr)
 	if (IS_ERR(link))
 		return PTR_ERR(link);
 
+	if (link->ops->update_map) {
+		ret = link_update_map(link, attr);
+		goto out_put_link;
+	}
+
 	new_prog = bpf_prog_get(attr->link_update.new_prog_fd);
 	if (IS_ERR(new_prog)) {
 		ret = PTR_ERR(new_prog);
diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c
index ff4f89a2b02a..158f14e240d0 100644
--- a/net/bpf/bpf_dummy_struct_ops.c
+++ b/net/bpf/bpf_dummy_struct_ops.c
@@ -222,12 +222,18 @@ static void bpf_dummy_unreg(void *kdata)
 {
 }
 
+static int bpf_dummy_update(void *kdata, void *old_kdata)
+{
+	return -EOPNOTSUPP;
+}
+
 struct bpf_struct_ops bpf_bpf_dummy_ops = {
 	.verifier_ops = &bpf_dummy_verifier_ops,
 	.init = bpf_dummy_init,
 	.check_member = bpf_dummy_ops_check_member,
 	.init_member = bpf_dummy_init_member,
 	.reg = bpf_dummy_reg,
+	.update = bpf_dummy_update,
 	.unreg = bpf_dummy_unreg,
 	.name = "bpf_dummy_ops",
 };
diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c
index bbbd5eb94db2..e8b27826283e 100644
--- a/net/ipv4/bpf_tcp_ca.c
+++ b/net/ipv4/bpf_tcp_ca.c
@@ -264,6 +264,11 @@ static void bpf_tcp_ca_unreg(void *kdata)
 	tcp_unregister_congestion_control(kdata);
 }
 
+static int bpf_tcp_ca_update(void *kdata, void *old_kdata)
+{
+	return tcp_update_congestion_control(kdata, old_kdata);
+}
+
 static int bpf_tcp_ca_validate(void *kdata)
 {
 	return tcp_validate_congestion_control(kdata);
@@ -273,6 +278,7 @@ struct bpf_struct_ops bpf_tcp_congestion_ops = {
 	.verifier_ops = &bpf_tcp_ca_verifier_ops,
 	.reg = bpf_tcp_ca_reg,
 	.unreg = bpf_tcp_ca_unreg,
+	.update = bpf_tcp_ca_update,
 	.check_member = bpf_tcp_ca_check_member,
 	.init_member = bpf_tcp_ca_init_member,
 	.init = bpf_tcp_ca_init,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 9cf1deaf21f2..d224d7adef71 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1555,8 +1555,12 @@ union bpf_attr {
 
 	struct { /* struct used by BPF_LINK_UPDATE command */
 		__u32		link_fd;	/* link fd */
-		/* new program fd to update link with */
-		__u32		new_prog_fd;
+		union {
+			/* new program fd to update link with */
+			__u32		new_prog_fd;
+			/* new struct_ops map fd to update link with */
+			__u32           new_map_fd;
+		};
 		__u32		flags;		/* extra flags */
 		/* expected link's program fd; is specified only if
 		 * BPF_F_REPLACE flag is set in flags */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v7 6/8] libbpf: Update a bpf_link with another struct_ops.
  2023-03-16  2:36 [PATCH bpf-next v7 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee
                   ` (4 preceding siblings ...)
  2023-03-16  2:36 ` [PATCH bpf-next v7 5/8] bpf: Update the struct_ops of a bpf_link Kui-Feng Lee
@ 2023-03-16  2:36 ` Kui-Feng Lee
  2023-03-17 19:42   ` Martin KaFai Lau
  2023-03-17 22:33   ` Andrii Nakryiko
  2023-03-16  2:36 ` [PATCH bpf-next v7 7/8] libbpf: Use .struct_ops.link section to indicate a struct_ops with a link Kui-Feng Lee
  2023-03-16  2:36 ` [PATCH bpf-next v7 8/8] selftests/bpf: Test switching TCP Congestion Control algorithms Kui-Feng Lee
  7 siblings, 2 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-16  2:36 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf; +Cc: Kui-Feng Lee

Introduce bpf_link__update_map(), which allows to atomically update
underlying struct_ops implementation for given struct_ops BPF link

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 tools/lib/bpf/libbpf.c   | 30 ++++++++++++++++++++++++++++++
 tools/lib/bpf/libbpf.h   |  1 +
 tools/lib/bpf/libbpf.map |  1 +
 3 files changed, 32 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 6dbae7ffab48..63ec1f8fe8a0 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -11659,6 +11659,36 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
 	return &link->link;
 }
 
+/*
+ * Swap the back struct_ops of a link with a new struct_ops map.
+ */
+int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map)
+{
+	struct bpf_link_struct_ops *st_ops_link;
+	__u32 zero = 0;
+	int err, fd;
+
+	if (!bpf_map__is_struct_ops(map) || map->fd < 0)
+		return -EINVAL;
+
+	st_ops_link = container_of(link, struct bpf_link_struct_ops, link);
+	/* Ensure the type of a link is correct */
+	if (st_ops_link->map_fd < 0)
+		return -EINVAL;
+
+	err = bpf_map_update_elem(map->fd, &zero, map->st_ops->kern_vdata, 0);
+	if (err && err != -EBUSY)
+		return err;
+
+	fd = bpf_link_update(link->fd, map->fd, NULL);
+	if (fd < 0)
+		return fd;
+
+	st_ops_link->map_fd = map->fd;
+
+	return 0;
+}
+
 typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr,
 							  void *private_data);
 
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index db4992a036f8..1615e55e2e79 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -719,6 +719,7 @@ bpf_program__attach_freplace(const struct bpf_program *prog,
 struct bpf_map;
 
 LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
+LIBBPF_API int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map);
 
 struct bpf_iter_attach_opts {
 	size_t sz; /* size of this struct for forward/backward compatibility */
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 50dde1f6521e..cc05be376257 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -387,6 +387,7 @@ LIBBPF_1.2.0 {
 	global:
 		bpf_btf_get_info_by_fd;
 		bpf_link_get_info_by_fd;
+		bpf_link__update_map;
 		bpf_map_get_info_by_fd;
 		bpf_prog_get_info_by_fd;
 } LIBBPF_1.1.0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v7 7/8] libbpf: Use .struct_ops.link section to indicate a struct_ops with a link.
  2023-03-16  2:36 [PATCH bpf-next v7 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee
                   ` (5 preceding siblings ...)
  2023-03-16  2:36 ` [PATCH bpf-next v7 6/8] libbpf: Update a bpf_link with another struct_ops Kui-Feng Lee
@ 2023-03-16  2:36 ` Kui-Feng Lee
  2023-03-17 22:35   ` Andrii Nakryiko
  2023-03-16  2:36 ` [PATCH bpf-next v7 8/8] selftests/bpf: Test switching TCP Congestion Control algorithms Kui-Feng Lee
  7 siblings, 1 reply; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-16  2:36 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf; +Cc: Kui-Feng Lee

Flags a struct_ops is to back a bpf_link by putting it to the
".struct_ops.link" section.  Once it is flagged, the created
struct_ops can be used to create a bpf_link or update a bpf_link that
has been backed by another struct_ops.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 tools/lib/bpf/libbpf.c | 60 +++++++++++++++++++++++++++++++-----------
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 63ec1f8fe8a0..e16b5714f998 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -468,6 +468,7 @@ struct bpf_struct_ops {
 #define KCONFIG_SEC ".kconfig"
 #define KSYMS_SEC ".ksyms"
 #define STRUCT_OPS_SEC ".struct_ops"
+#define STRUCT_OPS_LINK_SEC ".struct_ops.link"
 
 enum libbpf_map_type {
 	LIBBPF_MAP_UNSPEC,
@@ -597,6 +598,7 @@ struct elf_state {
 	Elf64_Ehdr *ehdr;
 	Elf_Data *symbols;
 	Elf_Data *st_ops_data;
+	Elf_Data *st_ops_link_data;
 	size_t shstrndx; /* section index for section name strings */
 	size_t strtabidx;
 	struct elf_sec_desc *secs;
@@ -606,6 +608,7 @@ struct elf_state {
 	int text_shndx;
 	int symbols_shndx;
 	int st_ops_shndx;
+	int st_ops_link_shndx;
 };
 
 struct usdt_manager;
@@ -1119,7 +1122,8 @@ static int bpf_object__init_kern_struct_ops_maps(struct bpf_object *obj)
 	return 0;
 }
 
-static int bpf_object__init_struct_ops_maps(struct bpf_object *obj)
+static int init_struct_ops_maps(struct bpf_object *obj, const char *sec_name,
+				int shndx, Elf_Data *data, __u32 map_flags)
 {
 	const struct btf_type *type, *datasec;
 	const struct btf_var_secinfo *vsi;
@@ -1130,15 +1134,15 @@ static int bpf_object__init_struct_ops_maps(struct bpf_object *obj)
 	struct bpf_map *map;
 	__u32 i;
 
-	if (obj->efile.st_ops_shndx == -1)
+	if (shndx == -1)
 		return 0;
 
 	btf = obj->btf;
-	datasec_id = btf__find_by_name_kind(btf, STRUCT_OPS_SEC,
+	datasec_id = btf__find_by_name_kind(btf, sec_name,
 					    BTF_KIND_DATASEC);
 	if (datasec_id < 0) {
 		pr_warn("struct_ops init: DATASEC %s not found\n",
-			STRUCT_OPS_SEC);
+			sec_name);
 		return -EINVAL;
 	}
 
@@ -1151,7 +1155,7 @@ static int bpf_object__init_struct_ops_maps(struct bpf_object *obj)
 		type_id = btf__resolve_type(obj->btf, vsi->type);
 		if (type_id < 0) {
 			pr_warn("struct_ops init: Cannot resolve var type_id %u in DATASEC %s\n",
-				vsi->type, STRUCT_OPS_SEC);
+				vsi->type, sec_name);
 			return -EINVAL;
 		}
 
@@ -1170,7 +1174,7 @@ static int bpf_object__init_struct_ops_maps(struct bpf_object *obj)
 		if (IS_ERR(map))
 			return PTR_ERR(map);
 
-		map->sec_idx = obj->efile.st_ops_shndx;
+		map->sec_idx = shndx;
 		map->sec_offset = vsi->offset;
 		map->name = strdup(var_name);
 		if (!map->name)
@@ -1180,6 +1184,7 @@ static int bpf_object__init_struct_ops_maps(struct bpf_object *obj)
 		map->def.key_size = sizeof(int);
 		map->def.value_size = type->size;
 		map->def.max_entries = 1;
+		map->def.map_flags = map_flags;
 
 		map->st_ops = calloc(1, sizeof(*map->st_ops));
 		if (!map->st_ops)
@@ -1192,14 +1197,14 @@ static int bpf_object__init_struct_ops_maps(struct bpf_object *obj)
 		if (!st_ops->data || !st_ops->progs || !st_ops->kern_func_off)
 			return -ENOMEM;
 
-		if (vsi->offset + type->size > obj->efile.st_ops_data->d_size) {
+		if (vsi->offset + type->size > data->d_size) {
 			pr_warn("struct_ops init: var %s is beyond the end of DATASEC %s\n",
-				var_name, STRUCT_OPS_SEC);
+				var_name, sec_name);
 			return -EINVAL;
 		}
 
 		memcpy(st_ops->data,
-		       obj->efile.st_ops_data->d_buf + vsi->offset,
+		       data->d_buf + vsi->offset,
 		       type->size);
 		st_ops->tname = tname;
 		st_ops->type = type;
@@ -1212,6 +1217,19 @@ static int bpf_object__init_struct_ops_maps(struct bpf_object *obj)
 	return 0;
 }
 
+static int bpf_object_init_struct_ops(struct bpf_object *obj)
+{
+	int err;
+
+	err = init_struct_ops_maps(obj, STRUCT_OPS_SEC, obj->efile.st_ops_shndx,
+				   obj->efile.st_ops_data, 0);
+	err = err ?: init_struct_ops_maps(obj, STRUCT_OPS_LINK_SEC,
+					  obj->efile.st_ops_link_shndx,
+					  obj->efile.st_ops_link_data,
+					  BPF_F_LINK);
+	return err;
+}
+
 static struct bpf_object *bpf_object__new(const char *path,
 					  const void *obj_buf,
 					  size_t obj_buf_sz,
@@ -1248,6 +1266,7 @@ static struct bpf_object *bpf_object__new(const char *path,
 	obj->efile.obj_buf_sz = obj_buf_sz;
 	obj->efile.btf_maps_shndx = -1;
 	obj->efile.st_ops_shndx = -1;
+	obj->efile.st_ops_link_shndx = -1;
 	obj->kconfig_map_idx = -1;
 
 	obj->kern_version = get_kernel_version();
@@ -1265,6 +1284,7 @@ static void bpf_object__elf_finish(struct bpf_object *obj)
 	obj->efile.elf = NULL;
 	obj->efile.symbols = NULL;
 	obj->efile.st_ops_data = NULL;
+	obj->efile.st_ops_link_data = NULL;
 
 	zfree(&obj->efile.secs);
 	obj->efile.sec_cnt = 0;
@@ -2619,7 +2639,7 @@ static int bpf_object__init_maps(struct bpf_object *obj,
 	err = bpf_object__init_user_btf_maps(obj, strict, pin_root_path);
 	err = err ?: bpf_object__init_global_data_maps(obj);
 	err = err ?: bpf_object__init_kconfig_map(obj);
-	err = err ?: bpf_object__init_struct_ops_maps(obj);
+	err = err ?: bpf_object_init_struct_ops(obj);
 
 	return err;
 }
@@ -2753,12 +2773,13 @@ static bool libbpf_needs_btf(const struct bpf_object *obj)
 {
 	return obj->efile.btf_maps_shndx >= 0 ||
 	       obj->efile.st_ops_shndx >= 0 ||
+	       obj->efile.st_ops_link_shndx >= 0 ||
 	       obj->nr_extern > 0;
 }
 
 static bool kernel_needs_btf(const struct bpf_object *obj)
 {
-	return obj->efile.st_ops_shndx >= 0;
+	return obj->efile.st_ops_shndx >= 0 || obj->efile.st_ops_link_shndx >= 0;
 }
 
 static int bpf_object__init_btf(struct bpf_object *obj,
@@ -3451,6 +3472,9 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
 			} else if (strcmp(name, STRUCT_OPS_SEC) == 0) {
 				obj->efile.st_ops_data = data;
 				obj->efile.st_ops_shndx = idx;
+			} else if (strcmp(name, STRUCT_OPS_LINK_SEC) == 0) {
+				obj->efile.st_ops_link_data = data;
+				obj->efile.st_ops_link_shndx = idx;
 			} else {
 				pr_info("elf: skipping unrecognized data section(%d) %s\n",
 					idx, name);
@@ -3465,6 +3489,7 @@ static int bpf_object__elf_collect(struct bpf_object *obj)
 			/* Only do relo for section with exec instructions */
 			if (!section_have_execinstr(obj, targ_sec_idx) &&
 			    strcmp(name, ".rel" STRUCT_OPS_SEC) &&
+			    strcmp(name, ".rel" STRUCT_OPS_LINK_SEC) &&
 			    strcmp(name, ".rel" MAPS_ELF_SEC)) {
 				pr_info("elf: skipping relo section(%d) %s for section(%d) %s\n",
 					idx, name, targ_sec_idx,
@@ -6611,7 +6636,7 @@ static int bpf_object__collect_relos(struct bpf_object *obj)
 			return -LIBBPF_ERRNO__INTERNAL;
 		}
 
-		if (idx == obj->efile.st_ops_shndx)
+		if (idx == obj->efile.st_ops_shndx || idx == obj->efile.st_ops_link_shndx)
 			err = bpf_object__collect_st_ops_relos(obj, shdr, data);
 		else if (idx == obj->efile.btf_maps_shndx)
 			err = bpf_object__collect_map_relos(obj, shdr, data);
@@ -8844,6 +8869,7 @@ const char *libbpf_bpf_prog_type_str(enum bpf_prog_type t)
 }
 
 static struct bpf_map *find_struct_ops_map_by_offset(struct bpf_object *obj,
+						     int sec_idx,
 						     size_t offset)
 {
 	struct bpf_map *map;
@@ -8853,7 +8879,8 @@ static struct bpf_map *find_struct_ops_map_by_offset(struct bpf_object *obj,
 		map = &obj->maps[i];
 		if (!bpf_map__is_struct_ops(map))
 			continue;
-		if (map->sec_offset <= offset &&
+		if (map->sec_idx == sec_idx &&
+		    map->sec_offset <= offset &&
 		    offset - map->sec_offset < map->def.value_size)
 			return map;
 	}
@@ -8895,7 +8922,7 @@ static int bpf_object__collect_st_ops_relos(struct bpf_object *obj,
 		}
 
 		name = elf_sym_str(obj, sym->st_name) ?: "<?>";
-		map = find_struct_ops_map_by_offset(obj, rel->r_offset);
+		map = find_struct_ops_map_by_offset(obj, shdr->sh_info, rel->r_offset);
 		if (!map) {
 			pr_warn("struct_ops reloc: cannot find map at rel->r_offset %zu\n",
 				(size_t)rel->r_offset);
@@ -8962,8 +8989,9 @@ static int bpf_object__collect_st_ops_relos(struct bpf_object *obj,
 		}
 
 		/* struct_ops BPF prog can be re-used between multiple
-		 * .struct_ops as long as it's the same struct_ops struct
-		 * definition and the same function pointer field
+		 * .struct_ops & .struct_ops.link as long as it's the
+		 * same struct_ops struct definition and the same
+		 * function pointer field
 		 */
 		if (prog->attach_btf_id != st_ops->type_id ||
 		    prog->expected_attach_type != member_idx) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH bpf-next v7 8/8] selftests/bpf: Test switching TCP Congestion Control algorithms.
  2023-03-16  2:36 [PATCH bpf-next v7 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee
                   ` (6 preceding siblings ...)
  2023-03-16  2:36 ` [PATCH bpf-next v7 7/8] libbpf: Use .struct_ops.link section to indicate a struct_ops with a link Kui-Feng Lee
@ 2023-03-16  2:36 ` Kui-Feng Lee
  7 siblings, 0 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-16  2:36 UTC (permalink / raw)
  To: bpf, ast, martin.lau, song, kernel-team, andrii, sdf; +Cc: Kui-Feng Lee

Create a pair of sockets that utilize the congestion control algorithm
under a particular name. Then switch up this congestion control
algorithm to another implementation and check whether newly created
connections using the same cc name now run the new implementation.

Also, try to update a link with a struct_ops that is without
BPF_F_LINK or with a wrong or different name.  These cases should fail
due to the violation of assumptions.  To update a bpf_link of a
struct_ops, it must be replaced with another struct_ops that is
identical in type and name and has the BPF_F_LINK flag.

Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
---
 .../selftests/bpf/prog_tests/bpf_tcp_ca.c     | 91 +++++++++++++++++++
 .../selftests/bpf/progs/tcp_ca_update.c       | 80 ++++++++++++++++
 2 files changed, 171 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/progs/tcp_ca_update.c

diff --git a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
index e980188d4124..e53d611bbf06 100644
--- a/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
+++ b/tools/testing/selftests/bpf/prog_tests/bpf_tcp_ca.c
@@ -8,6 +8,7 @@
 #include "bpf_dctcp.skel.h"
 #include "bpf_cubic.skel.h"
 #include "bpf_tcp_nogpl.skel.h"
+#include "tcp_ca_update.skel.h"
 #include "bpf_dctcp_release.skel.h"
 #include "tcp_ca_write_sk_pacing.skel.h"
 #include "tcp_ca_incompl_cong_ops.skel.h"
@@ -381,6 +382,90 @@ static void test_unsupp_cong_op(void)
 	libbpf_set_print(old_print_fn);
 }
 
+static void test_update_ca(void)
+{
+	struct tcp_ca_update *skel;
+	struct bpf_link *link;
+	int saved_ca1_cnt;
+	int err;
+
+	skel = tcp_ca_update__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open"))
+		return;
+
+	link = bpf_map__attach_struct_ops(skel->maps.ca_update_1);
+	ASSERT_OK_PTR(link, "attach_struct_ops");
+
+	do_test("tcp_ca_update", NULL);
+	saved_ca1_cnt = skel->bss->ca1_cnt;
+	ASSERT_GT(saved_ca1_cnt, 0, "ca1_ca1_cnt");
+
+	err = bpf_link__update_map(link, skel->maps.ca_update_2);
+	ASSERT_OK(err, "update_map");
+
+	do_test("tcp_ca_update", NULL);
+	ASSERT_EQ(skel->bss->ca1_cnt, saved_ca1_cnt, "ca2_ca1_cnt");
+	ASSERT_GT(skel->bss->ca2_cnt, 0, "ca2_ca2_cnt");
+
+	bpf_link__destroy(link);
+	tcp_ca_update__destroy(skel);
+}
+
+static void test_update_wrong(void)
+{
+	struct tcp_ca_update *skel;
+	struct bpf_link *link;
+	int saved_ca1_cnt;
+	int err;
+
+	skel = tcp_ca_update__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open"))
+		return;
+
+	link = bpf_map__attach_struct_ops(skel->maps.ca_update_1);
+	ASSERT_OK_PTR(link, "attach_struct_ops");
+
+	do_test("tcp_ca_update", NULL);
+	saved_ca1_cnt = skel->bss->ca1_cnt;
+	ASSERT_GT(saved_ca1_cnt, 0, "ca1_ca1_cnt");
+
+	err = bpf_link__update_map(link, skel->maps.ca_wrong);
+	ASSERT_ERR(err, "update_map");
+
+	do_test("tcp_ca_update", NULL);
+	ASSERT_GT(skel->bss->ca1_cnt, saved_ca1_cnt, "ca2_ca1_cnt");
+
+	bpf_link__destroy(link);
+	tcp_ca_update__destroy(skel);
+}
+
+static void test_mixed_links(void)
+{
+	struct tcp_ca_update *skel;
+	struct bpf_link *link, *link_nl;
+	int err;
+
+	skel = tcp_ca_update__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "open"))
+		return;
+
+	link_nl = bpf_map__attach_struct_ops(skel->maps.ca_no_link);
+	ASSERT_OK_PTR(link_nl, "attach_struct_ops_nl");
+
+	link = bpf_map__attach_struct_ops(skel->maps.ca_update_1);
+	ASSERT_OK_PTR(link, "attach_struct_ops");
+
+	do_test("tcp_ca_update", NULL);
+	ASSERT_GT(skel->bss->ca1_cnt, 0, "ca1_ca1_cnt");
+
+	err = bpf_link__update_map(link, skel->maps.ca_no_link);
+	ASSERT_ERR(err, "update_map");
+
+	bpf_link__destroy(link);
+	bpf_link__destroy(link_nl);
+	tcp_ca_update__destroy(skel);
+}
+
 void test_bpf_tcp_ca(void)
 {
 	if (test__start_subtest("dctcp"))
@@ -399,4 +484,10 @@ void test_bpf_tcp_ca(void)
 		test_incompl_cong_ops();
 	if (test__start_subtest("unsupp_cong_op"))
 		test_unsupp_cong_op();
+	if (test__start_subtest("update_ca"))
+		test_update_ca();
+	if (test__start_subtest("update_wrong"))
+		test_update_wrong();
+	if (test__start_subtest("mixed_links"))
+		test_mixed_links();
 }
diff --git a/tools/testing/selftests/bpf/progs/tcp_ca_update.c b/tools/testing/selftests/bpf/progs/tcp_ca_update.c
new file mode 100644
index 000000000000..b93a0ed33057
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/tcp_ca_update.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "vmlinux.h"
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+int ca1_cnt = 0;
+int ca2_cnt = 0;
+
+static inline struct tcp_sock *tcp_sk(const struct sock *sk)
+{
+	return (struct tcp_sock *)sk;
+}
+
+SEC("struct_ops/ca_update_1_init")
+void BPF_PROG(ca_update_1_init, struct sock *sk)
+{
+	ca1_cnt++;
+}
+
+SEC("struct_ops/ca_update_2_init")
+void BPF_PROG(ca_update_2_init, struct sock *sk)
+{
+	ca2_cnt++;
+}
+
+SEC("struct_ops/ca_update_cong_control")
+void BPF_PROG(ca_update_cong_control, struct sock *sk,
+	      const struct rate_sample *rs)
+{
+}
+
+SEC("struct_ops/ca_update_ssthresh")
+__u32 BPF_PROG(ca_update_ssthresh, struct sock *sk)
+{
+	return tcp_sk(sk)->snd_ssthresh;
+}
+
+SEC("struct_ops/ca_update_undo_cwnd")
+__u32 BPF_PROG(ca_update_undo_cwnd, struct sock *sk)
+{
+	return tcp_sk(sk)->snd_cwnd;
+}
+
+SEC(".struct_ops.link")
+struct tcp_congestion_ops ca_update_1 = {
+	.init = (void *)ca_update_1_init,
+	.cong_control = (void *)ca_update_cong_control,
+	.ssthresh = (void *)ca_update_ssthresh,
+	.undo_cwnd = (void *)ca_update_undo_cwnd,
+	.name = "tcp_ca_update",
+};
+
+SEC(".struct_ops.link")
+struct tcp_congestion_ops ca_update_2 = {
+	.init = (void *)ca_update_2_init,
+	.cong_control = (void *)ca_update_cong_control,
+	.ssthresh = (void *)ca_update_ssthresh,
+	.undo_cwnd = (void *)ca_update_undo_cwnd,
+	.name = "tcp_ca_update",
+};
+
+SEC(".struct_ops.link")
+struct tcp_congestion_ops ca_wrong = {
+	.cong_control = (void *)ca_update_cong_control,
+	.ssthresh = (void *)ca_update_ssthresh,
+	.undo_cwnd = (void *)ca_update_undo_cwnd,
+	.name = "tcp_ca_wrong",
+};
+
+SEC(".struct_ops")
+struct tcp_congestion_ops ca_no_link = {
+	.cong_control = (void *)ca_update_cong_control,
+	.ssthresh = (void *)ca_update_ssthresh,
+	.undo_cwnd = (void *)ca_update_undo_cwnd,
+	.name = "tcp_ca_no_link",
+};
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 2/8] net: Update an existing TCP congestion control algorithm.
  2023-03-16  2:36 ` [PATCH bpf-next v7 2/8] net: Update an existing TCP congestion control algorithm Kui-Feng Lee
@ 2023-03-17 15:23   ` Daniel Borkmann
  2023-03-17 17:18     ` Martin KaFai Lau
  2023-03-17 23:07     ` Kui-Feng Lee
  0 siblings, 2 replies; 33+ messages in thread
From: Daniel Borkmann @ 2023-03-17 15:23 UTC (permalink / raw)
  To: Kui-Feng Lee, bpf, ast, martin.lau, song, kernel-team, andrii, sdf
  Cc: netdev, Eric Dumazet

On 3/16/23 3:36 AM, Kui-Feng Lee wrote:
> This feature lets you immediately transition to another congestion
> control algorithm or implementation with the same name.  Once a name
> is updated, new connections will apply this new algorithm.
> 
> The purpose is to update a customized algorithm implemented in BPF
> struct_ops with a new version on the flight.  The following is an
> example of using the userspace API implemented in later BPF patches.
> 
>     link = bpf_map__attach_struct_ops(skel->maps.ca_update_1);
>     .......
>     err = bpf_link__update_map(link, skel->maps.ca_update_2);
> 
> We first load and register an algorithm implemented in BPF struct_ops,
> then swap it out with a new one using the same name. After that, newly
> created connections will apply the updated algorithm, while older ones
> retain the previous version already applied.
> 
> This patch also takes this chance to refactor the ca validation into
> the new tcp_validate_congestion_control() function.
> 
> Cc: netdev@vger.kernel.org, Eric Dumazet <edumazet@google.com>
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>   include/net/tcp.h   |  3 +++
>   net/ipv4/tcp_cong.c | 60 +++++++++++++++++++++++++++++++++++++++------
>   2 files changed, 56 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index db9f828e9d1e..2abb755e6a3a 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1117,6 +1117,9 @@ struct tcp_congestion_ops {
>   
>   int tcp_register_congestion_control(struct tcp_congestion_ops *type);
>   void tcp_unregister_congestion_control(struct tcp_congestion_ops *type);
> +int tcp_update_congestion_control(struct tcp_congestion_ops *type,
> +				  struct tcp_congestion_ops *old_type);
> +int tcp_validate_congestion_control(struct tcp_congestion_ops *ca);
>   
>   void tcp_assign_congestion_control(struct sock *sk);
>   void tcp_init_congestion_control(struct sock *sk);
> diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
> index db8b4b488c31..c90791ae8389 100644
> --- a/net/ipv4/tcp_cong.c
> +++ b/net/ipv4/tcp_cong.c
> @@ -75,14 +75,8 @@ struct tcp_congestion_ops *tcp_ca_find_key(u32 key)
>   	return NULL;
>   }
>   
> -/*
> - * Attach new congestion control algorithm to the list
> - * of available options.
> - */
> -int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
> +int tcp_validate_congestion_control(struct tcp_congestion_ops *ca)
>   {
> -	int ret = 0;
> -
>   	/* all algorithms must implement these */
>   	if (!ca->ssthresh || !ca->undo_cwnd ||
>   	    !(ca->cong_avoid || ca->cong_control)) {
> @@ -90,6 +84,20 @@ int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
>   		return -EINVAL;
>   	}
>   
> +	return 0;
> +}
> +
> +/* Attach new congestion control algorithm to the list
> + * of available options.
> + */
> +int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
> +{
> +	int ret;
> +
> +	ret = tcp_validate_congestion_control(ca);
> +	if (ret)
> +		return ret;
> +
>   	ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
>   
>   	spin_lock(&tcp_cong_list_lock);
> @@ -130,6 +138,44 @@ void tcp_unregister_congestion_control(struct tcp_congestion_ops *ca)
>   }
>   EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control);
>   
> +/* Replace a registered old ca with a new one.
> + *
> + * The new ca must have the same name as the old one, that has been
> + * registered.
> + */
> +int tcp_update_congestion_control(struct tcp_congestion_ops *ca, struct tcp_congestion_ops *old_ca)
> +{
> +	struct tcp_congestion_ops *existing;
> +	int ret;
> +
> +	ret = tcp_validate_congestion_control(ca);
> +	if (ret)
> +		return ret;
> +
> +	ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
> +
> +	spin_lock(&tcp_cong_list_lock);
> +	existing = tcp_ca_find_key(old_ca->key);
> +	if (ca->key == TCP_CA_UNSPEC || !existing || strcmp(existing->name, ca->name)) {
> +		pr_notice("%s not registered or non-unique key\n",
> +			  ca->name);
> +		ret = -EINVAL;
> +	} else if (existing != old_ca) {
> +		pr_notice("invalid old congestion control algorithm to replace\n");
> +		ret = -EINVAL;
> +	} else {
> +		/* Add the new one before removing the old one to keep
> +		 * one implementation available all the time.
> +		 */
> +		list_add_tail_rcu(&ca->list, &tcp_cong_list);
> +		list_del_rcu(&existing->list);
> +		pr_debug("%s updated\n", ca->name);
> +	}
> +	spin_unlock(&tcp_cong_list_lock);
> +
> +	return ret;
> +}

Was wondering if we could have tcp_register_congestion_control and tcp_update_congestion_control
could be refactored for reuse. Maybe like below. From the function itself what is not clear whether
callers that replace an existing one should do the synchronize_rcu() themselves or if this should
be part of tcp_update_congestion_control?

int tcp_check_congestion_control(struct tcp_congestion_ops *ca)
{
	/* All algorithms must implement these. */
	if (!ca->ssthresh || !ca->undo_cwnd ||
	    !(ca->cong_avoid || ca->cong_control)) {
		pr_err("%s does not implement required ops\n", ca->name);
		return -EINVAL;
	}
	if (ca->key == TCP_CA_UNSPEC)
		ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
	if (ca->key == TCP_CA_UNSPEC) {
		pr_notice("%s results in zero key\n", ca->name);
		return -EEXIST;
	}
	return 0;
}

/* Attach new congestion control algorithm to the list of available
  * options or replace an existing one if old is non-NULL.
  */
int tcp_update_congestion_control(struct tcp_congestion_ops *new,
				  struct tcp_congestion_ops *old)
{
	struct tcp_congestion_ops *found;
	int ret;

	ret = tcp_check_congestion_control(new);
	if (ret)
		return ret;
	if (old &&
	    (old->key != new->key ||
	     strcmp(old->name, new->name))) {
		pr_notice("%s & %s have non-matching congestion control names\n",
			  old->name, new->name);
		return -EINVAL;
	}
	spin_lock(&tcp_cong_list_lock);
	found = tcp_ca_find_key(new->key);
	if (old) {
		if (found == old) {
			list_add_tail_rcu(&new->list, &tcp_cong_list);
			list_del_rcu(&old->list);
		} else {
			pr_notice("%s not registered\n", old->name);
			ret = -EINVAL;
		}
	} else {
		if (found) {
			pr_notice("%s already registered\n", new->name);
			ret = -EEXIST;
		} else {
			list_add_tail_rcu(&new->list, &tcp_cong_list);
		}
	}
	spin_unlock(&tcp_cong_list_lock);
	return ret;
}

int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
{
	return tcp_update_congestion_control(ca, NULL);
}
EXPORT_SYMBOL_GPL(tcp_register_congestion_control);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 1/8] bpf: Retire the struct_ops map kvalue->refcnt.
  2023-03-16  2:36 ` [PATCH bpf-next v7 1/8] bpf: Retire the struct_ops map kvalue->refcnt Kui-Feng Lee
@ 2023-03-17 16:47   ` Martin KaFai Lau
  2023-03-17 20:41     ` Kui-Feng Lee
  0 siblings, 1 reply; 33+ messages in thread
From: Martin KaFai Lau @ 2023-03-17 16:47 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sdf

On 3/15/23 7:36 PM, Kui-Feng Lee wrote:
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1943,6 +1943,7 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd);
>   struct bpf_map *__bpf_map_get(struct fd f);
>   void bpf_map_inc(struct bpf_map *map);
>   void bpf_map_inc_with_uref(struct bpf_map *map);
> +struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref);
>   struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map *map);
>   void bpf_map_put_with_uref(struct bpf_map *map);
>   void bpf_map_put(struct bpf_map *map);
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 38903fb52f98..2a854e9cee52 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -58,6 +58,8 @@ struct bpf_struct_ops_map {
>   	struct bpf_struct_ops_value kvalue;
>   };
>   
> +static DEFINE_MUTEX(update_mutex);

There has been a comment on the unused "update_mutex" since v3 and v5:

"...This is only used in patch 5 of this set. Please move it there..."


> +
>   #define VALUE_PREFIX "bpf_struct_ops_"
>   #define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1)
>   
> @@ -249,6 +251,7 @@ int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
>   	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
>   	struct bpf_struct_ops_value *uvalue, *kvalue;
>   	enum bpf_struct_ops_state state;
> +	s64 refcnt;
>   
>   	if (unlikely(*(u32 *)key != 0))
>   		return -ENOENT;
> @@ -267,7 +270,14 @@ int bpf_struct_ops_map_sys_lookup_elem(struct bpf_map *map, void *key,
>   	uvalue = value;
>   	memcpy(uvalue, st_map->uvalue, map->value_size);
>   	uvalue->state = state;
> -	refcount_set(&uvalue->refcnt, refcount_read(&kvalue->refcnt));
> +
> +	/* This value offers the user space a general estimate of how
> +	 * many sockets are still utilizing this struct_ops for TCP
> +	 * congestion control. The number might not be exact, but it
> +	 * should sufficiently meet our present goals.
> +	 */
> +	refcnt = atomic64_read(&map->refcnt) - atomic64_read(&map->usercnt);
> +	refcount_set(&uvalue->refcnt, max_t(s64, refcnt, 0));
>   
>   	return 0;
>   }
> @@ -491,7 +501,6 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>   		*(unsigned long *)(udata + moff) = prog->aux->id;
>   	}
>   
> -	refcount_set(&kvalue->refcnt, 1);
>   	bpf_map_inc(map);
>   
>   	set_memory_rox((long)st_map->image, 1);
> @@ -536,8 +545,7 @@ static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
>   	switch (prev_state) {
>   	case BPF_STRUCT_OPS_STATE_INUSE:
>   		st_map->st_ops->unreg(&st_map->kvalue.data);
> -		if (refcount_dec_and_test(&st_map->kvalue.refcnt))
> -			bpf_map_put(map);
> +		bpf_map_put(map);
>   		return 0;
>   	case BPF_STRUCT_OPS_STATE_TOBEFREE:
>   		return -EINPROGRESS;
> @@ -570,7 +578,7 @@ static void bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key,
>   	kfree(value);
>   }
>   
> -static void bpf_struct_ops_map_free(struct bpf_map *map)
> +static void bpf_struct_ops_map_free_nosync(struct bpf_map *map)

nit. __bpf_struct_ops_map_free() is the usual alternative name to use in this case.

>   {
>   	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
>   
> @@ -582,6 +590,25 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
>   	bpf_map_area_free(st_map);
>   }
>   
> +static void bpf_struct_ops_map_free(struct bpf_map *map)
> +{
> +	/* The struct_ops's function may switch to another struct_ops.
> +	 *
> +	 * For example, bpf_tcp_cc_x->init() may switch to
> +	 * another tcp_cc_y by calling
> +	 * setsockopt(TCP_CONGESTION, "tcp_cc_y").
> +	 * During the switch,  bpf_struct_ops_put(tcp_cc_x) is called
> +	 * and its refcount may reach 0 which then free its
> +	 * trampoline image while tcp_cc_x is still running.
> +	 *
> +	 * Thus, a rcu grace period is needed here.
> +	 */
> +	synchronize_rcu();
> +	synchronize_rcu_tasks();

synchronize_rcu_mult(call_rcu, call_rcu_tasks) to wait both in parallel (credit 
to Paul's tip).



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 2/8] net: Update an existing TCP congestion control algorithm.
  2023-03-17 15:23   ` Daniel Borkmann
@ 2023-03-17 17:18     ` Martin KaFai Lau
  2023-03-17 17:23       ` Daniel Borkmann
  2023-03-17 23:07     ` Kui-Feng Lee
  1 sibling, 1 reply; 33+ messages in thread
From: Martin KaFai Lau @ 2023-03-17 17:18 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: netdev, Eric Dumazet, Kui-Feng Lee, bpf, ast, song, kernel-team,
	andrii, sdf

On 3/17/23 8:23 AM, Daniel Borkmann wrote:
>  From the function itself what is not clear whether
> callers that replace an existing one should do the synchronize_rcu() themselves 
> or if this should
> be part of tcp_update_congestion_control?

bpf_struct_ops_map_free (in patch 1) also does synchronize_rcu() for another 
reason (bpf_setsockopt), so the caller (bpf_struct_ops) is doing it. From 
looking at tcp_unregister_congestion_control(), make sense that it is more 
correct to have another synchronize_rcu() also in tcp_update_congestion_control 
in case there will be other non bpf_struct_ops caller doing update in the future.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 2/8] net: Update an existing TCP congestion control algorithm.
  2023-03-17 17:18     ` Martin KaFai Lau
@ 2023-03-17 17:23       ` Daniel Borkmann
  2023-03-17 21:46         ` Kui-Feng Lee
  0 siblings, 1 reply; 33+ messages in thread
From: Daniel Borkmann @ 2023-03-17 17:23 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Eric Dumazet, Kui-Feng Lee, bpf, ast, song, kernel-team,
	andrii, sdf

On 3/17/23 6:18 PM, Martin KaFai Lau wrote:
> On 3/17/23 8:23 AM, Daniel Borkmann wrote:
>>  From the function itself what is not clear whether
>> callers that replace an existing one should do the synchronize_rcu() themselves or if this should
>> be part of tcp_update_congestion_control?
> 
> bpf_struct_ops_map_free (in patch 1) also does synchronize_rcu() for another reason (bpf_setsockopt), so the caller (bpf_struct_ops) is doing it. From looking at tcp_unregister_congestion_control(), make sense that it is more correct to have another synchronize_rcu() also in tcp_update_congestion_control in case there will be other non bpf_struct_ops caller doing update in the future.

Agree, I was looking at 'bpf: Update the struct_ops of a bpf_link', and essentially as-is
it was implicit via map free. +1, tcp_update_congestion_control() would be more obvious and
better for other/non-BPF users.

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 3/8] bpf: Create links for BPF struct_ops maps.
  2023-03-16  2:36 ` [PATCH bpf-next v7 3/8] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
@ 2023-03-17 18:10   ` Martin KaFai Lau
  2023-03-17 20:52     ` Kui-Feng Lee
  0 siblings, 1 reply; 33+ messages in thread
From: Martin KaFai Lau @ 2023-03-17 18:10 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sdf

On 3/15/23 7:36 PM, Kui-Feng Lee wrote:
> +int bpf_struct_ops_link_create(union bpf_attr *attr)
> +{
> +	struct bpf_struct_ops_link *link = NULL;
> +	struct bpf_link_primer link_primer;
> +	struct bpf_struct_ops_map *st_map;
> +	struct bpf_map *map;
> +	int err;
> +
> +	map = bpf_map_get(attr->link_create.map_fd);
> +	if (!map)
> +		return -EINVAL;
> +
> +	st_map = (struct bpf_struct_ops_map *)map;
> +
> +	if (!bpf_struct_ops_valid_to_reg(map)) {
> +		err = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	link = kzalloc(sizeof(*link), GFP_USER);
> +	if (!link) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL);
> +	RCU_INIT_POINTER(link->map, map);

The link->map assignment should be done with the bpf_link_settle(), meaning only 
assign after everything else has succeeded.

The link is not exposed to user space until bpf_link_settle(). The link->map 
assignment can be done after ->reg succeeded and do it just before 
bpf_link_settle(). Then there is no need to do the RCU_INIT_POINTER(link->map, 
NULL) dance in the error case.

> +
> +	err = bpf_link_prime(&link->link, &link_primer);
> +	if (err)
> +		goto err_out;
> +
> +	err = st_map->st_ops->reg(st_map->kvalue.data);
> +	if (err) {
> +		/* No RCU since no one has a chance to read this pointer yet. */
> +		RCU_INIT_POINTER(link->map, NULL);
> +		bpf_link_cleanup(&link_primer);
> +		link = NULL;
> +		goto err_out;
> +	}
> +
> +	return bpf_link_settle(&link_primer);
> +
> +err_out:
> +	bpf_map_put(map);
> +	kfree(link);
> +	return err;
> +}


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 4/8] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-03-16  2:36 ` [PATCH bpf-next v7 4/8] libbpf: Create a bpf_link in bpf_map__attach_struct_ops() Kui-Feng Lee
@ 2023-03-17 18:44   ` Martin KaFai Lau
  2023-03-17 21:00     ` Kui-Feng Lee
  2023-03-17 22:23   ` Andrii Nakryiko
  1 sibling, 1 reply; 33+ messages in thread
From: Martin KaFai Lau @ 2023-03-17 18:44 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sdf

On 3/15/23 7:36 PM, Kui-Feng Lee wrote:
> @@ -11590,31 +11631,32 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>   	if (!link)
>   		return libbpf_err_ptr(-EINVAL);
>   
> -	st_ops = map->st_ops;
> -	for (i = 0; i < btf_vlen(st_ops->type); i++) {
> -		struct bpf_program *prog = st_ops->progs[i];
> -		void *kern_data;
> -		int prog_fd;
> +	/* kern_vdata should be prepared during the loading phase. */
> +	err = bpf_map_update_elem(map->fd, &zero, map->st_ops->kern_vdata, 0);
> +	if (err) {

It should not fail for BPF_F_LINK struct_ops when err is EBUSY.
The struct_ops map can attach, detach, and then attach again.

It needs a test for this case.

> +		free(link);
> +		return libbpf_err_ptr(err);
> +	}


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 5/8] bpf: Update the struct_ops of a bpf_link.
  2023-03-16  2:36 ` [PATCH bpf-next v7 5/8] bpf: Update the struct_ops of a bpf_link Kui-Feng Lee
@ 2023-03-17 19:23   ` Martin KaFai Lau
  2023-03-17 21:39     ` Kui-Feng Lee
  2023-03-18  1:11     ` Kui-Feng Lee
  2023-03-17 22:27   ` Andrii Nakryiko
  1 sibling, 2 replies; 33+ messages in thread
From: Martin KaFai Lau @ 2023-03-17 19:23 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sdf

On 3/15/23 7:36 PM, Kui-Feng Lee wrote:
> +static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map *new_map)
> +{
> +	struct bpf_struct_ops_map *st_map, *old_st_map;
> +	struct bpf_struct_ops_link *st_link;
> +	struct bpf_map *old_map;
> +	int err = 0;
> +
> +	st_link = container_of(link, struct bpf_struct_ops_link, link);
> +	st_map = container_of(new_map, struct bpf_struct_ops_map, map);
> +
> +	if (!bpf_struct_ops_valid_to_reg(new_map))
> +		return -EINVAL;
> +
> +	mutex_lock(&update_mutex);
> +
> +	old_map = rcu_dereference_protected(st_link->map, lockdep_is_held(&update_mutex));
> +	old_st_map = container_of(old_map, struct bpf_struct_ops_map, map);
> +	/* The new and old struct_ops must be the same type. */
> +	if (st_map->st_ops != old_st_map->st_ops) {
> +		err = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	err = st_map->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);

I don't think it has completely addressed Andrii's comment in v4 regarding 
BPF_F_REPLACE: 
https://lore.kernel.org/bpf/CAEf4BzbK8s+VFG5HefydD7CRLzkRFKg-Er0PKV_-C2-yttfXzA@mail.gmail.com/

For now, tcp_update_congestion_control() enforces the same cc-name. However, it 
is still not the same as what BPF_F_REPLACE intented to do: update only when it 
is the same old-map. Same cc-name does not necessarily mean the same old-map.

> +	if (err)
> +		goto err_out;
> +
> +	bpf_map_inc(new_map);
> +	rcu_assign_pointer(st_link->map, new_map);
> +	bpf_map_put(old_map);
> +
> +err_out:
> +	mutex_unlock(&update_mutex);
> +
> +	return err;
> +}
> +
>   static const struct bpf_link_ops bpf_struct_ops_map_lops = {
>   	.dealloc = bpf_struct_ops_map_link_dealloc,
>   	.show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
>   	.fill_link_info = bpf_struct_ops_map_link_fill_link_info,
> +	.update_map = bpf_struct_ops_map_link_update,
>   };
>   
>   int bpf_struct_ops_link_create(union bpf_attr *attr)
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 5a45e3bf34e2..6fa10d108278 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -4676,6 +4676,21 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>   	return ret;
>   }
>   
> +static int link_update_map(struct bpf_link *link, union bpf_attr *attr)
> +{
> +	struct bpf_map *new_map;
> +	int ret = 0;

nit. init zero is unnecessarily.

> +
> +	new_map = bpf_map_get(attr->link_update.new_map_fd);
> +	if (IS_ERR(new_map))
> +		return -EINVAL;
> +
> +	ret = link->ops->update_map(link, new_map);
> +
> +	bpf_map_put(new_map);
> +	return ret;
> +}
> +
>   #define BPF_LINK_UPDATE_LAST_FIELD link_update.old_prog_fd
>   
>   static int link_update(union bpf_attr *attr)
> @@ -4696,6 +4711,11 @@ static int link_update(union bpf_attr *attr)
>   	if (IS_ERR(link))
>   		return PTR_ERR(link);
>   
> +	if (link->ops->update_map) {
> +		ret = link_update_map(link, attr);
> +		goto out_put_link;
> +	}
> +
>   	new_prog = bpf_prog_get(attr->link_update.new_prog_fd);
>   	if (IS_ERR(new_prog)) {
>   		ret = PTR_ERR(new_prog);
> diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c
> index ff4f89a2b02a..158f14e240d0 100644
> --- a/net/bpf/bpf_dummy_struct_ops.c
> +++ b/net/bpf/bpf_dummy_struct_ops.c
> @@ -222,12 +222,18 @@ static void bpf_dummy_unreg(void *kdata)
>   {
>   }
>   
> +static int bpf_dummy_update(void *kdata, void *old_kdata)
> +{
> +	return -EOPNOTSUPP;
> +}
> +
>   struct bpf_struct_ops bpf_bpf_dummy_ops = {
>   	.verifier_ops = &bpf_dummy_verifier_ops,
>   	.init = bpf_dummy_init,
>   	.check_member = bpf_dummy_ops_check_member,
>   	.init_member = bpf_dummy_init_member,
>   	.reg = bpf_dummy_reg,
> +	.update = bpf_dummy_update,

When looking at this together in patch 5, the changes in bpf_dummy_struct_ops.c 
should not be needed.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 6/8] libbpf: Update a bpf_link with another struct_ops.
  2023-03-16  2:36 ` [PATCH bpf-next v7 6/8] libbpf: Update a bpf_link with another struct_ops Kui-Feng Lee
@ 2023-03-17 19:42   ` Martin KaFai Lau
  2023-03-17 21:40     ` Kui-Feng Lee
  2023-03-17 22:33   ` Andrii Nakryiko
  1 sibling, 1 reply; 33+ messages in thread
From: Martin KaFai Lau @ 2023-03-17 19:42 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sdf

On 3/15/23 7:36 PM, Kui-Feng Lee wrote:
> Introduce bpf_link__update_map(), which allows to atomically update
> underlying struct_ops implementation for given struct_ops BPF link
> 
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>   tools/lib/bpf/libbpf.c   | 30 ++++++++++++++++++++++++++++++
>   tools/lib/bpf/libbpf.h   |  1 +
>   tools/lib/bpf/libbpf.map |  1 +
>   3 files changed, 32 insertions(+)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 6dbae7ffab48..63ec1f8fe8a0 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -11659,6 +11659,36 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>   	return &link->link;
>   }
>   
> +/*
> + * Swap the back struct_ops of a link with a new struct_ops map.
> + */
> +int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map)
> +{
> +	struct bpf_link_struct_ops *st_ops_link;
> +	__u32 zero = 0;
> +	int err, fd;
> +
> +	if (!bpf_map__is_struct_ops(map) || map->fd < 0)
> +		return -EINVAL;
> +
> +	st_ops_link = container_of(link, struct bpf_link_struct_ops, link);
> +	/* Ensure the type of a link is correct */
> +	if (st_ops_link->map_fd < 0)
> +		return -EINVAL;
> +
> +	err = bpf_map_update_elem(map->fd, &zero, map->st_ops->kern_vdata, 0);
> +	if (err && err != -EBUSY)
> +		return err;
> +
> +	fd = bpf_link_update(link->fd, map->fd, NULL);

bpf_link_update() returns ok/error. Using "fd = ..." is confusing. Use "err = 
..." instead and remove the need of "int fd;".

> +	if (fd < 0)
> +		return fd;
> +
> +	st_ops_link->map_fd = map->fd;
> +
> +	return 0;
> +}
> +
>   typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr,
>   							  void *private_data);
>   
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index db4992a036f8..1615e55e2e79 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -719,6 +719,7 @@ bpf_program__attach_freplace(const struct bpf_program *prog,
>   struct bpf_map;
>   
>   LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> +LIBBPF_API int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map);
>   
>   struct bpf_iter_attach_opts {
>   	size_t sz; /* size of this struct for forward/backward compatibility */
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 50dde1f6521e..cc05be376257 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -387,6 +387,7 @@ LIBBPF_1.2.0 {
>   	global:
>   		bpf_btf_get_info_by_fd;
>   		bpf_link_get_info_by_fd;
> +		bpf_link__update_map;
>   		bpf_map_get_info_by_fd;
>   		bpf_prog_get_info_by_fd;
>   } LIBBPF_1.1.0;


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 1/8] bpf: Retire the struct_ops map kvalue->refcnt.
  2023-03-17 16:47   ` Martin KaFai Lau
@ 2023-03-17 20:41     ` Kui-Feng Lee
  0 siblings, 0 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-17 20:41 UTC (permalink / raw)
  To: Martin KaFai Lau, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sdf



On 3/17/23 09:47, Martin KaFai Lau wrote:
> On 3/15/23 7:36 PM, Kui-Feng Lee wrote:
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1943,6 +1943,7 @@ struct bpf_map *bpf_map_get_with_uref(u32 ufd);
>>   struct bpf_map *__bpf_map_get(struct fd f);
>>   void bpf_map_inc(struct bpf_map *map);
>>   void bpf_map_inc_with_uref(struct bpf_map *map);
>> +struct bpf_map *__bpf_map_inc_not_zero(struct bpf_map *map, bool uref);
>>   struct bpf_map * __must_check bpf_map_inc_not_zero(struct bpf_map 
>> *map);
>>   void bpf_map_put_with_uref(struct bpf_map *map);
>>   void bpf_map_put(struct bpf_map *map);
>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>> index 38903fb52f98..2a854e9cee52 100644
>> --- a/kernel/bpf/bpf_struct_ops.c
>> +++ b/kernel/bpf/bpf_struct_ops.c
>> @@ -58,6 +58,8 @@ struct bpf_struct_ops_map {
>>       struct bpf_struct_ops_value kvalue;
>>   };
>> +static DEFINE_MUTEX(update_mutex);
> 
> There has been a comment on the unused "update_mutex" since v3 and v5:
> 
> "...This is only used in patch 5 of this set. Please move it there..."

Got it! Sorry about that.  I moved it to patch 5 in v6, somehow, it
appears in patch 1 again in v7. I must do something wrong during
rebasing.


> 
> 
>> +
>>   #define VALUE_PREFIX "bpf_struct_ops_"
>>   #define VALUE_PREFIX_LEN (sizeof(VALUE_PREFIX) - 1)
>> @@ -249,6 +251,7 @@ int bpf_struct_ops_map_sys_lookup_elem(struct 
>> bpf_map *map, void *key,
>>       struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map 
>> *)map;
>>       struct bpf_struct_ops_value *uvalue, *kvalue;
>>       enum bpf_struct_ops_state state;
>> +    s64 refcnt;
>>       if (unlikely(*(u32 *)key != 0))
>>           return -ENOENT;
>> @@ -267,7 +270,14 @@ int bpf_struct_ops_map_sys_lookup_elem(struct 
>> bpf_map *map, void *key,
>>       uvalue = value;
>>       memcpy(uvalue, st_map->uvalue, map->value_size);
>>       uvalue->state = state;
>> -    refcount_set(&uvalue->refcnt, refcount_read(&kvalue->refcnt));
>> +
>> +    /* This value offers the user space a general estimate of how
>> +     * many sockets are still utilizing this struct_ops for TCP
>> +     * congestion control. The number might not be exact, but it
>> +     * should sufficiently meet our present goals.
>> +     */
>> +    refcnt = atomic64_read(&map->refcnt) - atomic64_read(&map->usercnt);
>> +    refcount_set(&uvalue->refcnt, max_t(s64, refcnt, 0));
>>       return 0;
>>   }
>> @@ -491,7 +501,6 @@ static int bpf_struct_ops_map_update_elem(struct 
>> bpf_map *map, void *key,
>>           *(unsigned long *)(udata + moff) = prog->aux->id;
>>       }
>> -    refcount_set(&kvalue->refcnt, 1);
>>       bpf_map_inc(map);
>>       set_memory_rox((long)st_map->image, 1);
>> @@ -536,8 +545,7 @@ static int bpf_struct_ops_map_delete_elem(struct 
>> bpf_map *map, void *key)
>>       switch (prev_state) {
>>       case BPF_STRUCT_OPS_STATE_INUSE:
>>           st_map->st_ops->unreg(&st_map->kvalue.data);
>> -        if (refcount_dec_and_test(&st_map->kvalue.refcnt))
>> -            bpf_map_put(map);
>> +        bpf_map_put(map);
>>           return 0;
>>       case BPF_STRUCT_OPS_STATE_TOBEFREE:
>>           return -EINPROGRESS;
>> @@ -570,7 +578,7 @@ static void 
>> bpf_struct_ops_map_seq_show_elem(struct bpf_map *map, void *key,
>>       kfree(value);
>>   }
>> -static void bpf_struct_ops_map_free(struct bpf_map *map)
>> +static void bpf_struct_ops_map_free_nosync(struct bpf_map *map)
> 
> nit. __bpf_struct_ops_map_free() is the usual alternative name to use in 
> this case.

Got it!

> 
>>   {
>>       struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map 
>> *)map;
>> @@ -582,6 +590,25 @@ static void bpf_struct_ops_map_free(struct 
>> bpf_map *map)
>>       bpf_map_area_free(st_map);
>>   }
>> +static void bpf_struct_ops_map_free(struct bpf_map *map)
>> +{
>> +    /* The struct_ops's function may switch to another struct_ops.
>> +     *
>> +     * For example, bpf_tcp_cc_x->init() may switch to
>> +     * another tcp_cc_y by calling
>> +     * setsockopt(TCP_CONGESTION, "tcp_cc_y").
>> +     * During the switch,  bpf_struct_ops_put(tcp_cc_x) is called
>> +     * and its refcount may reach 0 which then free its
>> +     * trampoline image while tcp_cc_x is still running.
>> +     *
>> +     * Thus, a rcu grace period is needed here.
>> +     */
>> +    synchronize_rcu();
>> +    synchronize_rcu_tasks();
> 
> synchronize_rcu_mult(call_rcu, call_rcu_tasks) to wait both in parallel 
> (credit to Paul's tip).
> 

Nice!


> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 3/8] bpf: Create links for BPF struct_ops maps.
  2023-03-17 18:10   ` Martin KaFai Lau
@ 2023-03-17 20:52     ` Kui-Feng Lee
  0 siblings, 0 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-17 20:52 UTC (permalink / raw)
  To: Martin KaFai Lau, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sdf



On 3/17/23 11:10, Martin KaFai Lau wrote:
> On 3/15/23 7:36 PM, Kui-Feng Lee wrote:
>> +int bpf_struct_ops_link_create(union bpf_attr *attr)
>> +{
>> +    struct bpf_struct_ops_link *link = NULL;
>> +    struct bpf_link_primer link_primer;
>> +    struct bpf_struct_ops_map *st_map;
>> +    struct bpf_map *map;
>> +    int err;
>> +
>> +    map = bpf_map_get(attr->link_create.map_fd);
>> +    if (!map)
>> +        return -EINVAL;
>> +
>> +    st_map = (struct bpf_struct_ops_map *)map;
>> +
>> +    if (!bpf_struct_ops_valid_to_reg(map)) {
>> +        err = -EINVAL;
>> +        goto err_out;
>> +    }
>> +
>> +    link = kzalloc(sizeof(*link), GFP_USER);
>> +    if (!link) {
>> +        err = -ENOMEM;
>> +        goto err_out;
>> +    }
>> +    bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, 
>> &bpf_struct_ops_map_lops, NULL);
>> +    RCU_INIT_POINTER(link->map, map);
> 
> The link->map assignment should be done with the bpf_link_settle(), 
> meaning only assign after everything else has succeeded.
> 
> The link is not exposed to user space until bpf_link_settle(). The 
> link->map assignment can be done after ->reg succeeded and do it just 
> before bpf_link_settle(). Then there is no need to do the 
> RCU_INIT_POINTER(link->map, NULL) dance in the error case.
> 

Sound good! I will move this line.

>> +
>> +    err = bpf_link_prime(&link->link, &link_primer);
>> +    if (err)
>> +        goto err_out;
>> +
>> +    err = st_map->st_ops->reg(st_map->kvalue.data);
>> +    if (err) {
>> +        /* No RCU since no one has a chance to read this pointer yet. */
>> +        RCU_INIT_POINTER(link->map, NULL);
>> +        bpf_link_cleanup(&link_primer);
>> +        link = NULL;
>> +        goto err_out;
>> +    }
>> +
>> +    return bpf_link_settle(&link_primer);
>> +
>> +err_out:
>> +    bpf_map_put(map);
>> +    kfree(link);
>> +    return err;
>> +}
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 4/8] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-03-17 18:44   ` Martin KaFai Lau
@ 2023-03-17 21:00     ` Kui-Feng Lee
  0 siblings, 0 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-17 21:00 UTC (permalink / raw)
  To: Martin KaFai Lau, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sdf



On 3/17/23 11:44, Martin KaFai Lau wrote:
> On 3/15/23 7:36 PM, Kui-Feng Lee wrote:
>> @@ -11590,31 +11631,32 @@ struct bpf_link 
>> *bpf_map__attach_struct_ops(const struct bpf_map *map)
>>       if (!link)
>>           return libbpf_err_ptr(-EINVAL);
>> -    st_ops = map->st_ops;
>> -    for (i = 0; i < btf_vlen(st_ops->type); i++) {
>> -        struct bpf_program *prog = st_ops->progs[i];
>> -        void *kern_data;
>> -        int prog_fd;
>> +    /* kern_vdata should be prepared during the loading phase. */
>> +    err = bpf_map_update_elem(map->fd, &zero, 
>> map->st_ops->kern_vdata, 0);
>> +    if (err) {
> 
> It should not fail for BPF_F_LINK struct_ops when err is EBUSY.
> The struct_ops map can attach, detach, and then attach again.
> 
> It needs a test for this case.

Got it!

> 
>> +        free(link);
>> +        return libbpf_err_ptr(err);
>> +    }
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 5/8] bpf: Update the struct_ops of a bpf_link.
  2023-03-17 19:23   ` Martin KaFai Lau
@ 2023-03-17 21:39     ` Kui-Feng Lee
  2023-03-18  1:11     ` Kui-Feng Lee
  1 sibling, 0 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-17 21:39 UTC (permalink / raw)
  To: Martin KaFai Lau, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sdf



On 3/17/23 12:23, Martin KaFai Lau wrote:
> On 3/15/23 7:36 PM, Kui-Feng Lee wrote:
>> +static int bpf_struct_ops_map_link_update(struct bpf_link *link, 
>> struct bpf_map *new_map)
>> +{
>> +    struct bpf_struct_ops_map *st_map, *old_st_map;
>> +    struct bpf_struct_ops_link *st_link;
>> +    struct bpf_map *old_map;
>> +    int err = 0;
>> +
>> +    st_link = container_of(link, struct bpf_struct_ops_link, link);
>> +    st_map = container_of(new_map, struct bpf_struct_ops_map, map);
>> +
>> +    if (!bpf_struct_ops_valid_to_reg(new_map))
>> +        return -EINVAL;
>> +
>> +    mutex_lock(&update_mutex);
>> +
>> +    old_map = rcu_dereference_protected(st_link->map, 
>> lockdep_is_held(&update_mutex));
>> +    old_st_map = container_of(old_map, struct bpf_struct_ops_map, map);
>> +    /* The new and old struct_ops must be the same type. */
>> +    if (st_map->st_ops != old_st_map->st_ops) {
>> +        err = -EINVAL;
>> +        goto err_out;
>> +    }
>> +
>> +    err = st_map->st_ops->update(st_map->kvalue.data, 
>> old_st_map->kvalue.data);
> 
> I don't think it has completely addressed Andrii's comment in v4 
> regarding BPF_F_REPLACE: 
> https://lore.kernel.org/bpf/CAEf4BzbK8s+VFG5HefydD7CRLzkRFKg-Er0PKV_-C2-yttfXzA@mail.gmail.com/

I just check with Andrii.
For this, I will add a old_map_fd field in bpf_attr to pass a fd of
the old map.  The definition of the type of bpf_link_ops::update_map
will be changed to pass an old_map.  So, we can check if it is expected.

> 
> For now, tcp_update_congestion_control() enforces the same cc-name. 
> However, it is still not the same as what BPF_F_REPLACE intented to do: 
> update only when it is the same old-map. Same cc-name does not 
> necessarily mean the same old-map.

st_ops->update() will check if old_st_mp->kvalue.data is the same
one registered before.  That will enforce that old_st_map is old_map.

> 
>> +    if (err)
>> +        goto err_out;
>> +
>> +    bpf_map_inc(new_map);
>> +    rcu_assign_pointer(st_link->map, new_map);
>> +    bpf_map_put(old_map);
>> +
>> +err_out:
>> +    mutex_unlock(&update_mutex);
>> +
>> +    return err;
>> +}
>> +
>>   static const struct bpf_link_ops bpf_struct_ops_map_lops = {
>>       .dealloc = bpf_struct_ops_map_link_dealloc,
>>       .show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
>>       .fill_link_info = bpf_struct_ops_map_link_fill_link_info,
>> +    .update_map = bpf_struct_ops_map_link_update,
>>   };
>>   int bpf_struct_ops_link_create(union bpf_attr *attr)
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 5a45e3bf34e2..6fa10d108278 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -4676,6 +4676,21 @@ static int link_create(union bpf_attr *attr, 
>> bpfptr_t uattr)
>>       return ret;
>>   }
>> +static int link_update_map(struct bpf_link *link, union bpf_attr *attr)
>> +{
>> +    struct bpf_map *new_map;
>> +    int ret = 0;
> 
> nit. init zero is unnecessarily.
> 
>> +
>> +    new_map = bpf_map_get(attr->link_update.new_map_fd);
>> +    if (IS_ERR(new_map))
>> +        return -EINVAL;
>> +
>> +    ret = link->ops->update_map(link, new_map);
>> +
>> +    bpf_map_put(new_map);
>> +    return ret;
>> +}
>> +
>>   #define BPF_LINK_UPDATE_LAST_FIELD link_update.old_prog_fd
>>   static int link_update(union bpf_attr *attr)
>> @@ -4696,6 +4711,11 @@ static int link_update(union bpf_attr *attr)
>>       if (IS_ERR(link))
>>           return PTR_ERR(link);
>> +    if (link->ops->update_map) {
>> +        ret = link_update_map(link, attr);
>> +        goto out_put_link;
>> +    }
>> +
>>       new_prog = bpf_prog_get(attr->link_update.new_prog_fd);
>>       if (IS_ERR(new_prog)) {
>>           ret = PTR_ERR(new_prog);
>> diff --git a/net/bpf/bpf_dummy_struct_ops.c 
>> b/net/bpf/bpf_dummy_struct_ops.c
>> index ff4f89a2b02a..158f14e240d0 100644
>> --- a/net/bpf/bpf_dummy_struct_ops.c
>> +++ b/net/bpf/bpf_dummy_struct_ops.c
>> @@ -222,12 +222,18 @@ static void bpf_dummy_unreg(void *kdata)
>>   {
>>   }
>> +static int bpf_dummy_update(void *kdata, void *old_kdata)
>> +{
>> +    return -EOPNOTSUPP;
>> +}
>> +
>>   struct bpf_struct_ops bpf_bpf_dummy_ops = {
>>       .verifier_ops = &bpf_dummy_verifier_ops,
>>       .init = bpf_dummy_init,
>>       .check_member = bpf_dummy_ops_check_member,
>>       .init_member = bpf_dummy_init_member,
>>       .reg = bpf_dummy_reg,
>> +    .update = bpf_dummy_update,
> 
> When looking at this together in patch 5, the changes in 
> bpf_dummy_struct_ops.c should not be needed.
> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 6/8] libbpf: Update a bpf_link with another struct_ops.
  2023-03-17 19:42   ` Martin KaFai Lau
@ 2023-03-17 21:40     ` Kui-Feng Lee
  0 siblings, 0 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-17 21:40 UTC (permalink / raw)
  To: Martin KaFai Lau, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sdf



On 3/17/23 12:42, Martin KaFai Lau wrote:
> On 3/15/23 7:36 PM, Kui-Feng Lee wrote:
>> Introduce bpf_link__update_map(), which allows to atomically update
>> underlying struct_ops implementation for given struct_ops BPF link
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   tools/lib/bpf/libbpf.c   | 30 ++++++++++++++++++++++++++++++
>>   tools/lib/bpf/libbpf.h   |  1 +
>>   tools/lib/bpf/libbpf.map |  1 +
>>   3 files changed, 32 insertions(+)
>>
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 6dbae7ffab48..63ec1f8fe8a0 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -11659,6 +11659,36 @@ struct bpf_link 
>> *bpf_map__attach_struct_ops(const struct bpf_map *map)
>>       return &link->link;
>>   }
>> +/*
>> + * Swap the back struct_ops of a link with a new struct_ops map.
>> + */
>> +int bpf_link__update_map(struct bpf_link *link, const struct bpf_map 
>> *map)
>> +{
>> +    struct bpf_link_struct_ops *st_ops_link;
>> +    __u32 zero = 0;
>> +    int err, fd;
>> +
>> +    if (!bpf_map__is_struct_ops(map) || map->fd < 0)
>> +        return -EINVAL;
>> +
>> +    st_ops_link = container_of(link, struct bpf_link_struct_ops, link);
>> +    /* Ensure the type of a link is correct */
>> +    if (st_ops_link->map_fd < 0)
>> +        return -EINVAL;
>> +
>> +    err = bpf_map_update_elem(map->fd, &zero, 
>> map->st_ops->kern_vdata, 0);
>> +    if (err && err != -EBUSY)
>> +        return err;
>> +
>> +    fd = bpf_link_update(link->fd, map->fd, NULL);
> 
> bpf_link_update() returns ok/error. Using "fd = ..." is confusing. Use 
> "err = ..." instead and remove the need of "int fd;".

got it.

> 
>> +    if (fd < 0)
>> +        return fd;
>> +
>> +    st_ops_link->map_fd = map->fd;
>> +
>> +    return 0;
>> +}
>> +
>>   typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct 
>> perf_event_header *hdr,
>>                                 void *private_data);
>> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
>> index db4992a036f8..1615e55e2e79 100644
>> --- a/tools/lib/bpf/libbpf.h
>> +++ b/tools/lib/bpf/libbpf.h
>> @@ -719,6 +719,7 @@ bpf_program__attach_freplace(const struct 
>> bpf_program *prog,
>>   struct bpf_map;
>>   LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct 
>> bpf_map *map);
>> +LIBBPF_API int bpf_link__update_map(struct bpf_link *link, const 
>> struct bpf_map *map);
>>   struct bpf_iter_attach_opts {
>>       size_t sz; /* size of this struct for forward/backward 
>> compatibility */
>> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
>> index 50dde1f6521e..cc05be376257 100644
>> --- a/tools/lib/bpf/libbpf.map
>> +++ b/tools/lib/bpf/libbpf.map
>> @@ -387,6 +387,7 @@ LIBBPF_1.2.0 {
>>       global:
>>           bpf_btf_get_info_by_fd;
>>           bpf_link_get_info_by_fd;
>> +        bpf_link__update_map;
>>           bpf_map_get_info_by_fd;
>>           bpf_prog_get_info_by_fd;
>>   } LIBBPF_1.1.0;
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 2/8] net: Update an existing TCP congestion control algorithm.
  2023-03-17 17:23       ` Daniel Borkmann
@ 2023-03-17 21:46         ` Kui-Feng Lee
  0 siblings, 0 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-17 21:46 UTC (permalink / raw)
  To: Daniel Borkmann, Martin KaFai Lau
  Cc: netdev, Eric Dumazet, Kui-Feng Lee, bpf, ast, song, kernel-team,
	andrii, sdf



On 3/17/23 10:23, Daniel Borkmann wrote:
> On 3/17/23 6:18 PM, Martin KaFai Lau wrote:
>> On 3/17/23 8:23 AM, Daniel Borkmann wrote:
>>>  From the function itself what is not clear whether
>>> callers that replace an existing one should do the synchronize_rcu() 
>>> themselves or if this should
>>> be part of tcp_update_congestion_control?
>>
>> bpf_struct_ops_map_free (in patch 1) also does synchronize_rcu() for 
>> another reason (bpf_setsockopt), so the caller (bpf_struct_ops) is 
>> doing it. From looking at tcp_unregister_congestion_control(), make 
>> sense that it is more correct to have another synchronize_rcu() also 
>> in tcp_update_congestion_control in case there will be other non 
>> bpf_struct_ops caller doing update in the future.
> 
> Agree, I was looking at 'bpf: Update the struct_ops of a bpf_link', and 
> essentially as-is
> it was implicit via map free. +1, tcp_update_congestion_control() would 
> be more obvious and
> better for other/non-BPF users.

It makes sense to me.
I will refactor functions as well.

> 
> Thanks,
> Daniel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 4/8] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-03-16  2:36 ` [PATCH bpf-next v7 4/8] libbpf: Create a bpf_link in bpf_map__attach_struct_ops() Kui-Feng Lee
  2023-03-17 18:44   ` Martin KaFai Lau
@ 2023-03-17 22:23   ` Andrii Nakryiko
  2023-03-17 23:48     ` Kui-Feng Lee
  1 sibling, 1 reply; 33+ messages in thread
From: Andrii Nakryiko @ 2023-03-17 22:23 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sdf

On Wed, Mar 15, 2023 at 7:37 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>
> bpf_map__attach_struct_ops() was creating a dummy bpf_link as a
> placeholder, but now it is constructing an authentic one by calling
> bpf_link_create() if the map has the BPF_F_LINK flag.
>
> You can flag a struct_ops map with BPF_F_LINK by calling
> bpf_map__set_map_flags().
>
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>  tools/lib/bpf/libbpf.c | 90 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 66 insertions(+), 24 deletions(-)
>

[...]

> -               if (!prog)
> -                       continue;
> +       link->link.detach = bpf_link__detach_struct_ops;
>
> -               prog_fd = bpf_program__fd(prog);
> -               kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i];
> -               *(unsigned long *)kern_data = prog_fd;
> +       if (!(map->def.map_flags & BPF_F_LINK)) {
> +               /* w/o a real link */
> +               link->link.fd = map->fd;
> +               link->map_fd = -1;
> +               return &link->link;
>         }
>
> -       err = bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
> -       if (err) {
> -               err = -errno;
> +       fd = bpf_link_create(map->fd, -1, BPF_STRUCT_OPS, NULL);

pass 0, not -1. BPF APIs have a convention that fd=0 means "no fd was
provided". And actually kernel should have rejected this -1, so please
check why that didn't happen in your testing, we might be missing some
kernel validation.


> +       if (fd < 0) {
>                 free(link);
> -               return libbpf_err_ptr(err);
> +               return libbpf_err_ptr(fd);
>         }
>
> -       link->detach = bpf_link__detach_struct_ops;
> -       link->fd = map->fd;
> +       link->link.fd = fd;
> +       link->map_fd = map->fd;
>
> -       return link;
> +       return &link->link;
>  }
>
>  typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr,
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 5/8] bpf: Update the struct_ops of a bpf_link.
  2023-03-16  2:36 ` [PATCH bpf-next v7 5/8] bpf: Update the struct_ops of a bpf_link Kui-Feng Lee
  2023-03-17 19:23   ` Martin KaFai Lau
@ 2023-03-17 22:27   ` Andrii Nakryiko
  2023-03-18  0:41     ` Kui-Feng Lee
  1 sibling, 1 reply; 33+ messages in thread
From: Andrii Nakryiko @ 2023-03-17 22:27 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sdf

On Wed, Mar 15, 2023 at 7:37 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>
> By improving the BPF_LINK_UPDATE command of bpf(), it should allow you
> to conveniently switch between different struct_ops on a single
> bpf_link. This would enable smoother transitions from one struct_ops
> to another.
>
> The struct_ops maps passing along with BPF_LINK_UPDATE should have the
> BPF_F_LINK flag.
>
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>  include/linux/bpf.h            |  2 ++
>  include/uapi/linux/bpf.h       |  8 +++++--
>  kernel/bpf/bpf_struct_ops.c    | 38 ++++++++++++++++++++++++++++++++++
>  kernel/bpf/syscall.c           | 20 ++++++++++++++++++
>  net/bpf/bpf_dummy_struct_ops.c |  6 ++++++
>  net/ipv4/bpf_tcp_ca.c          |  6 ++++++
>  tools/include/uapi/linux/bpf.h |  8 +++++--
>  7 files changed, 84 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 455b14bf8f28..56e6ab7559ef 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1474,6 +1474,7 @@ struct bpf_link_ops {
>         void (*show_fdinfo)(const struct bpf_link *link, struct seq_file *seq);
>         int (*fill_link_info)(const struct bpf_link *link,
>                               struct bpf_link_info *info);
> +       int (*update_map)(struct bpf_link *link, struct bpf_map *new_map);
>  };
>
>  struct bpf_tramp_link {
> @@ -1516,6 +1517,7 @@ struct bpf_struct_ops {
>                            void *kdata, const void *udata);
>         int (*reg)(void *kdata);
>         void (*unreg)(void *kdata);
> +       int (*update)(void *kdata, void *old_kdata);
>         int (*validate)(void *kdata);
>         const struct btf_type *type;
>         const struct btf_type *value_type;
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 42f40ee083bf..24e1dec4ad97 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1555,8 +1555,12 @@ union bpf_attr {
>
>         struct { /* struct used by BPF_LINK_UPDATE command */
>                 __u32           link_fd;        /* link fd */
> -               /* new program fd to update link with */
> -               __u32           new_prog_fd;
> +               union {
> +                       /* new program fd to update link with */
> +                       __u32           new_prog_fd;
> +                       /* new struct_ops map fd to update link with */
> +                       __u32           new_map_fd;
> +               };
>                 __u32           flags;          /* extra flags */
>                 /* expected link's program fd; is specified only if
>                  * BPF_F_REPLACE flag is set in flags */
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 8ce6c7581ca3..5a9e10b92423 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -807,10 +807,48 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
>         return 0;
>  }
>
> +static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map *new_map)
> +{
> +       struct bpf_struct_ops_map *st_map, *old_st_map;
> +       struct bpf_struct_ops_link *st_link;
> +       struct bpf_map *old_map;
> +       int err = 0;
> +
> +       st_link = container_of(link, struct bpf_struct_ops_link, link);
> +       st_map = container_of(new_map, struct bpf_struct_ops_map, map);
> +
> +       if (!bpf_struct_ops_valid_to_reg(new_map))
> +               return -EINVAL;
> +
> +       mutex_lock(&update_mutex);
> +
> +       old_map = rcu_dereference_protected(st_link->map, lockdep_is_held(&update_mutex));
> +       old_st_map = container_of(old_map, struct bpf_struct_ops_map, map);
> +       /* The new and old struct_ops must be the same type. */
> +       if (st_map->st_ops != old_st_map->st_ops) {
> +               err = -EINVAL;
> +               goto err_out;
> +       }
> +
> +       err = st_map->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
> +       if (err)
> +               goto err_out;
> +
> +       bpf_map_inc(new_map);
> +       rcu_assign_pointer(st_link->map, new_map);
> +       bpf_map_put(old_map);
> +
> +err_out:
> +       mutex_unlock(&update_mutex);
> +
> +       return err;
> +}
> +
>  static const struct bpf_link_ops bpf_struct_ops_map_lops = {
>         .dealloc = bpf_struct_ops_map_link_dealloc,
>         .show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
>         .fill_link_info = bpf_struct_ops_map_link_fill_link_info,
> +       .update_map = bpf_struct_ops_map_link_update,
>  };
>
>  int bpf_struct_ops_link_create(union bpf_attr *attr)
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index 5a45e3bf34e2..6fa10d108278 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -4676,6 +4676,21 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>         return ret;
>  }
>
> +static int link_update_map(struct bpf_link *link, union bpf_attr *attr)
> +{
> +       struct bpf_map *new_map;
> +       int ret = 0;
> +
> +       new_map = bpf_map_get(attr->link_update.new_map_fd);
> +       if (IS_ERR(new_map))
> +               return -EINVAL;

I was expecting a check for the BPF_F_LINK flag here. Isn't it
necessary to verify that here?



> +
> +       ret = link->ops->update_map(link, new_map);
> +
> +       bpf_map_put(new_map);
> +       return ret;
> +}
> +
>  #define BPF_LINK_UPDATE_LAST_FIELD link_update.old_prog_fd
>

[...]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 6/8] libbpf: Update a bpf_link with another struct_ops.
  2023-03-16  2:36 ` [PATCH bpf-next v7 6/8] libbpf: Update a bpf_link with another struct_ops Kui-Feng Lee
  2023-03-17 19:42   ` Martin KaFai Lau
@ 2023-03-17 22:33   ` Andrii Nakryiko
  2023-03-18  1:17     ` Kui-Feng Lee
  1 sibling, 1 reply; 33+ messages in thread
From: Andrii Nakryiko @ 2023-03-17 22:33 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sdf

On Wed, Mar 15, 2023 at 7:37 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>
> Introduce bpf_link__update_map(), which allows to atomically update
> underlying struct_ops implementation for given struct_ops BPF link
>
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---
>  tools/lib/bpf/libbpf.c   | 30 ++++++++++++++++++++++++++++++
>  tools/lib/bpf/libbpf.h   |  1 +
>  tools/lib/bpf/libbpf.map |  1 +
>  3 files changed, 32 insertions(+)
>
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index 6dbae7ffab48..63ec1f8fe8a0 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -11659,6 +11659,36 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>         return &link->link;
>  }
>
> +/*
> + * Swap the back struct_ops of a link with a new struct_ops map.
> + */
> +int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map)
> +{
> +       struct bpf_link_struct_ops *st_ops_link;
> +       __u32 zero = 0;
> +       int err, fd;
> +
> +       if (!bpf_map__is_struct_ops(map) || map->fd < 0)
> +               return -EINVAL;
> +
> +       st_ops_link = container_of(link, struct bpf_link_struct_ops, link);
> +       /* Ensure the type of a link is correct */
> +       if (st_ops_link->map_fd < 0)
> +               return -EINVAL;
> +
> +       err = bpf_map_update_elem(map->fd, &zero, map->st_ops->kern_vdata, 0);
> +       if (err && err != -EBUSY)

Why is it ok to ignore -EBUSY? Let's leave a comment as well, as this
is not clear.

> +               return err;
> +
> +       fd = bpf_link_update(link->fd, map->fd, NULL);
> +       if (fd < 0)
> +               return fd;
> +
> +       st_ops_link->map_fd = map->fd;
> +
> +       return 0;
> +}
> +
>  typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr,
>                                                           void *private_data);
>
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index db4992a036f8..1615e55e2e79 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -719,6 +719,7 @@ bpf_program__attach_freplace(const struct bpf_program *prog,
>  struct bpf_map;
>
>  LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> +LIBBPF_API int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map);
>
>  struct bpf_iter_attach_opts {
>         size_t sz; /* size of this struct for forward/backward compatibility */
> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
> index 50dde1f6521e..cc05be376257 100644
> --- a/tools/lib/bpf/libbpf.map
> +++ b/tools/lib/bpf/libbpf.map
> @@ -387,6 +387,7 @@ LIBBPF_1.2.0 {
>         global:
>                 bpf_btf_get_info_by_fd;
>                 bpf_link_get_info_by_fd;
> +               bpf_link__update_map;

nit: this should go before bpf_link_get_info_by_fd ('_' orders before 'g')

>                 bpf_map_get_info_by_fd;
>                 bpf_prog_get_info_by_fd;
>  } LIBBPF_1.1.0;
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 7/8] libbpf: Use .struct_ops.link section to indicate a struct_ops with a link.
  2023-03-16  2:36 ` [PATCH bpf-next v7 7/8] libbpf: Use .struct_ops.link section to indicate a struct_ops with a link Kui-Feng Lee
@ 2023-03-17 22:35   ` Andrii Nakryiko
  0 siblings, 0 replies; 33+ messages in thread
From: Andrii Nakryiko @ 2023-03-17 22:35 UTC (permalink / raw)
  To: Kui-Feng Lee; +Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sdf

On Wed, Mar 15, 2023 at 7:38 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>
> Flags a struct_ops is to back a bpf_link by putting it to the
> ".struct_ops.link" section.  Once it is flagged, the created
> struct_ops can be used to create a bpf_link or update a bpf_link that
> has been backed by another struct_ops.
>
> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
> ---

LGTM.

Acked-by: Andrii Nakryiko <andrii@kernel.org>

>  tools/lib/bpf/libbpf.c | 60 +++++++++++++++++++++++++++++++-----------
>  1 file changed, 44 insertions(+), 16 deletions(-)
>

[...]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 2/8] net: Update an existing TCP congestion control algorithm.
  2023-03-17 15:23   ` Daniel Borkmann
  2023-03-17 17:18     ` Martin KaFai Lau
@ 2023-03-17 23:07     ` Kui-Feng Lee
  1 sibling, 0 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-17 23:07 UTC (permalink / raw)
  To: Daniel Borkmann, Kui-Feng Lee, bpf, ast, martin.lau, song,
	kernel-team, andrii, sdf
  Cc: netdev, Eric Dumazet



On 3/17/23 08:23, Daniel Borkmann wrote:
> On 3/16/23 3:36 AM, Kui-Feng Lee wrote:
>> This feature lets you immediately transition to another congestion
>> control algorithm or implementation with the same name.  Once a name
>> is updated, new connections will apply this new algorithm.
>>
>> The purpose is to update a customized algorithm implemented in BPF
>> struct_ops with a new version on the flight.  The following is an
>> example of using the userspace API implemented in later BPF patches.
>>
>>     link = bpf_map__attach_struct_ops(skel->maps.ca_update_1);
>>     .......
>>     err = bpf_link__update_map(link, skel->maps.ca_update_2);
>>
>> We first load and register an algorithm implemented in BPF struct_ops,
>> then swap it out with a new one using the same name. After that, newly
>> created connections will apply the updated algorithm, while older ones
>> retain the previous version already applied.
>>
>> This patch also takes this chance to refactor the ca validation into
>> the new tcp_validate_congestion_control() function.
>>
>> Cc: netdev@vger.kernel.org, Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   include/net/tcp.h   |  3 +++
>>   net/ipv4/tcp_cong.c | 60 +++++++++++++++++++++++++++++++++++++++------
>>   2 files changed, 56 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index db9f828e9d1e..2abb755e6a3a 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -1117,6 +1117,9 @@ struct tcp_congestion_ops {
>>   int tcp_register_congestion_control(struct tcp_congestion_ops *type);
>>   void tcp_unregister_congestion_control(struct tcp_congestion_ops 
>> *type);
>> +int tcp_update_congestion_control(struct tcp_congestion_ops *type,
>> +                  struct tcp_congestion_ops *old_type);
>> +int tcp_validate_congestion_control(struct tcp_congestion_ops *ca);
>>   void tcp_assign_congestion_control(struct sock *sk);
>>   void tcp_init_congestion_control(struct sock *sk);
>> diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
>> index db8b4b488c31..c90791ae8389 100644
>> --- a/net/ipv4/tcp_cong.c
>> +++ b/net/ipv4/tcp_cong.c
>> @@ -75,14 +75,8 @@ struct tcp_congestion_ops *tcp_ca_find_key(u32 key)
>>       return NULL;
>>   }
>> -/*
>> - * Attach new congestion control algorithm to the list
>> - * of available options.
>> - */
>> -int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
>> +int tcp_validate_congestion_control(struct tcp_congestion_ops *ca)
>>   {
>> -    int ret = 0;
>> -
>>       /* all algorithms must implement these */
>>       if (!ca->ssthresh || !ca->undo_cwnd ||
>>           !(ca->cong_avoid || ca->cong_control)) {
>> @@ -90,6 +84,20 @@ int tcp_register_congestion_control(struct 
>> tcp_congestion_ops *ca)
>>           return -EINVAL;
>>       }
>> +    return 0;
>> +}
>> +
>> +/* Attach new congestion control algorithm to the list
>> + * of available options.
>> + */
>> +int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
>> +{
>> +    int ret;
>> +
>> +    ret = tcp_validate_congestion_control(ca);
>> +    if (ret)
>> +        return ret;
>> +
>>       ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
>>       spin_lock(&tcp_cong_list_lock);
>> @@ -130,6 +138,44 @@ void tcp_unregister_congestion_control(struct 
>> tcp_congestion_ops *ca)
>>   }
>>   EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control);
>> +/* Replace a registered old ca with a new one.
>> + *
>> + * The new ca must have the same name as the old one, that has been
>> + * registered.
>> + */
>> +int tcp_update_congestion_control(struct tcp_congestion_ops *ca, 
>> struct tcp_congestion_ops *old_ca)
>> +{
>> +    struct tcp_congestion_ops *existing;
>> +    int ret;
>> +
>> +    ret = tcp_validate_congestion_control(ca);
>> +    if (ret)
>> +        return ret;
>> +
>> +    ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
>> +
>> +    spin_lock(&tcp_cong_list_lock);
>> +    existing = tcp_ca_find_key(old_ca->key);
>> +    if (ca->key == TCP_CA_UNSPEC || !existing || 
>> strcmp(existing->name, ca->name)) {
>> +        pr_notice("%s not registered or non-unique key\n",
>> +              ca->name);
>> +        ret = -EINVAL;
>> +    } else if (existing != old_ca) {
>> +        pr_notice("invalid old congestion control algorithm to 
>> replace\n");
>> +        ret = -EINVAL;
>> +    } else {
>> +        /* Add the new one before removing the old one to keep
>> +         * one implementation available all the time.
>> +         */
>> +        list_add_tail_rcu(&ca->list, &tcp_cong_list);
>> +        list_del_rcu(&existing->list);
>> +        pr_debug("%s updated\n", ca->name);
>> +    }
>> +    spin_unlock(&tcp_cong_list_lock);
>> +
>> +    return ret;
>> +}
> 
> Was wondering if we could have tcp_register_congestion_control and 
> tcp_update_congestion_control
> could be refactored for reuse. Maybe like below. From the function 
> itself what is not clear whether
> callers that replace an existing one should do the synchronize_rcu() 
> themselves or if this should
> be part of tcp_update_congestion_control?
> 
> int tcp_check_congestion_control(struct tcp_congestion_ops *ca)
> {
>      /* All algorithms must implement these. */
>      if (!ca->ssthresh || !ca->undo_cwnd ||
>          !(ca->cong_avoid || ca->cong_control)) {
>          pr_err("%s does not implement required ops\n", ca->name);
>          return -EINVAL;
>      }
>      if (ca->key == TCP_CA_UNSPEC)
>          ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name));
>      if (ca->key == TCP_CA_UNSPEC) {
>          pr_notice("%s results in zero key\n", ca->name);
>          return -EEXIST;
>      }
>      return 0;
> }
> 
> /* Attach new congestion control algorithm to the list of available
>   * options or replace an existing one if old is non-NULL.
>   */
> int tcp_update_congestion_control(struct tcp_congestion_ops *new,
>                    struct tcp_congestion_ops *old)
> {
>      struct tcp_congestion_ops *found;
>      int ret;
> 
>      ret = tcp_check_congestion_control(new);
>      if (ret)
>          return ret;
>      if (old &&
>          (old->key != new->key ||
>           strcmp(old->name, new->name))) {
>          pr_notice("%s & %s have non-matching congestion control names\n",
>                old->name, new->name);
>          return -EINVAL;
>      }
>      spin_lock(&tcp_cong_list_lock);
>      found = tcp_ca_find_key(new->key); >      if (old) {
>          if (found == old) {
>              list_add_tail_rcu(&new->list, &tcp_cong_list);
>              list_del_rcu(&old->list);
>          } else {
>              pr_notice("%s not registered\n", old->name);
>              ret = -EINVAL;
>          }
>      } else {
>          if (found) {
>              pr_notice("%s already registered\n", new->name);
>              ret = -EEXIST;
>          } else {
>              list_add_tail_rcu(&new->list, &tcp_cong_list);
>          }
>      }
>      spin_unlock(&tcp_cong_list_lock);
>      return ret;
> }

After trying to do this refactoring, I found it just shares a few lines
of code, but make thing complicated by adding more checks.  So, I prefer
to keep it as it is.  How do you think?


> 
> int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
> {
>      return tcp_update_congestion_control(ca, NULL);
> }
> EXPORT_SYMBOL_GPL(tcp_register_congestion_control);

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 4/8] libbpf: Create a bpf_link in bpf_map__attach_struct_ops().
  2023-03-17 22:23   ` Andrii Nakryiko
@ 2023-03-17 23:48     ` Kui-Feng Lee
  0 siblings, 0 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-17 23:48 UTC (permalink / raw)
  To: Andrii Nakryiko, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sdf



On 3/17/23 15:23, Andrii Nakryiko wrote:
> On Wed, Mar 15, 2023 at 7:37 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>>
>> bpf_map__attach_struct_ops() was creating a dummy bpf_link as a
>> placeholder, but now it is constructing an authentic one by calling
>> bpf_link_create() if the map has the BPF_F_LINK flag.
>>
>> You can flag a struct_ops map with BPF_F_LINK by calling
>> bpf_map__set_map_flags().
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   tools/lib/bpf/libbpf.c | 90 +++++++++++++++++++++++++++++++-----------
>>   1 file changed, 66 insertions(+), 24 deletions(-)
>>
> 
> [...]
> 
>> -               if (!prog)
>> -                       continue;
>> +       link->link.detach = bpf_link__detach_struct_ops;
>>
>> -               prog_fd = bpf_program__fd(prog);
>> -               kern_data = st_ops->kern_vdata + st_ops->kern_func_off[i];
>> -               *(unsigned long *)kern_data = prog_fd;
>> +       if (!(map->def.map_flags & BPF_F_LINK)) {
>> +               /* w/o a real link */
>> +               link->link.fd = map->fd;
>> +               link->map_fd = -1;
>> +               return &link->link;
>>          }
>>
>> -       err = bpf_map_update_elem(map->fd, &zero, st_ops->kern_vdata, 0);
>> -       if (err) {
>> -               err = -errno;
>> +       fd = bpf_link_create(map->fd, -1, BPF_STRUCT_OPS, NULL);
> 
> pass 0, not -1. BPF APIs have a convention that fd=0 means "no fd was
> provided". And actually kernel should have rejected this -1, so please
> check why that didn't happen in your testing, we might be missing some
> kernel validation.

Oh! probe_perf_link() also pass -1 as well.
I will fix it.

> 
> 
>> +       if (fd < 0) {
>>                  free(link);
>> -               return libbpf_err_ptr(err);
>> +               return libbpf_err_ptr(fd);
>>          }
>>
>> -       link->detach = bpf_link__detach_struct_ops;
>> -       link->fd = map->fd;
>> +       link->link.fd = fd;
>> +       link->map_fd = map->fd;
>>
>> -       return link;
>> +       return &link->link;
>>   }
>>
>>   typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr,
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 5/8] bpf: Update the struct_ops of a bpf_link.
  2023-03-17 22:27   ` Andrii Nakryiko
@ 2023-03-18  0:41     ` Kui-Feng Lee
  0 siblings, 0 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-18  0:41 UTC (permalink / raw)
  To: Andrii Nakryiko, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sdf



On 3/17/23 15:27, Andrii Nakryiko wrote:
> On Wed, Mar 15, 2023 at 7:37 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>>
>> By improving the BPF_LINK_UPDATE command of bpf(), it should allow you
>> to conveniently switch between different struct_ops on a single
>> bpf_link. This would enable smoother transitions from one struct_ops
>> to another.
>>
>> The struct_ops maps passing along with BPF_LINK_UPDATE should have the
>> BPF_F_LINK flag.
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   include/linux/bpf.h            |  2 ++
>>   include/uapi/linux/bpf.h       |  8 +++++--
>>   kernel/bpf/bpf_struct_ops.c    | 38 ++++++++++++++++++++++++++++++++++
>>   kernel/bpf/syscall.c           | 20 ++++++++++++++++++
>>   net/bpf/bpf_dummy_struct_ops.c |  6 ++++++
>>   net/ipv4/bpf_tcp_ca.c          |  6 ++++++
>>   tools/include/uapi/linux/bpf.h |  8 +++++--
>>   7 files changed, 84 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 455b14bf8f28..56e6ab7559ef 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1474,6 +1474,7 @@ struct bpf_link_ops {
>>          void (*show_fdinfo)(const struct bpf_link *link, struct seq_file *seq);
>>          int (*fill_link_info)(const struct bpf_link *link,
>>                                struct bpf_link_info *info);
>> +       int (*update_map)(struct bpf_link *link, struct bpf_map *new_map);
>>   };
>>
>>   struct bpf_tramp_link {
>> @@ -1516,6 +1517,7 @@ struct bpf_struct_ops {
>>                             void *kdata, const void *udata);
>>          int (*reg)(void *kdata);
>>          void (*unreg)(void *kdata);
>> +       int (*update)(void *kdata, void *old_kdata);
>>          int (*validate)(void *kdata);
>>          const struct btf_type *type;
>>          const struct btf_type *value_type;
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 42f40ee083bf..24e1dec4ad97 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -1555,8 +1555,12 @@ union bpf_attr {
>>
>>          struct { /* struct used by BPF_LINK_UPDATE command */
>>                  __u32           link_fd;        /* link fd */
>> -               /* new program fd to update link with */
>> -               __u32           new_prog_fd;
>> +               union {
>> +                       /* new program fd to update link with */
>> +                       __u32           new_prog_fd;
>> +                       /* new struct_ops map fd to update link with */
>> +                       __u32           new_map_fd;
>> +               };
>>                  __u32           flags;          /* extra flags */
>>                  /* expected link's program fd; is specified only if
>>                   * BPF_F_REPLACE flag is set in flags */
>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>> index 8ce6c7581ca3..5a9e10b92423 100644
>> --- a/kernel/bpf/bpf_struct_ops.c
>> +++ b/kernel/bpf/bpf_struct_ops.c
>> @@ -807,10 +807,48 @@ static int bpf_struct_ops_map_link_fill_link_info(const struct bpf_link *link,
>>          return 0;
>>   }
>>
>> +static int bpf_struct_ops_map_link_update(struct bpf_link *link, struct bpf_map *new_map)
>> +{
>> +       struct bpf_struct_ops_map *st_map, *old_st_map;
>> +       struct bpf_struct_ops_link *st_link;
>> +       struct bpf_map *old_map;
>> +       int err = 0;
>> +
>> +       st_link = container_of(link, struct bpf_struct_ops_link, link);
>> +       st_map = container_of(new_map, struct bpf_struct_ops_map, map);
>> +
>> +       if (!bpf_struct_ops_valid_to_reg(new_map))
>> +               return -EINVAL;
>> +
>> +       mutex_lock(&update_mutex);
>> +
>> +       old_map = rcu_dereference_protected(st_link->map, lockdep_is_held(&update_mutex));
>> +       old_st_map = container_of(old_map, struct bpf_struct_ops_map, map);
>> +       /* The new and old struct_ops must be the same type. */
>> +       if (st_map->st_ops != old_st_map->st_ops) {
>> +               err = -EINVAL;
>> +               goto err_out;
>> +       }
>> +
>> +       err = st_map->st_ops->update(st_map->kvalue.data, old_st_map->kvalue.data);
>> +       if (err)
>> +               goto err_out;
>> +
>> +       bpf_map_inc(new_map);
>> +       rcu_assign_pointer(st_link->map, new_map);
>> +       bpf_map_put(old_map);
>> +
>> +err_out:
>> +       mutex_unlock(&update_mutex);
>> +
>> +       return err;
>> +}
>> +
>>   static const struct bpf_link_ops bpf_struct_ops_map_lops = {
>>          .dealloc = bpf_struct_ops_map_link_dealloc,
>>          .show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
>>          .fill_link_info = bpf_struct_ops_map_link_fill_link_info,
>> +       .update_map = bpf_struct_ops_map_link_update,
>>   };
>>
>>   int bpf_struct_ops_link_create(union bpf_attr *attr)
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 5a45e3bf34e2..6fa10d108278 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -4676,6 +4676,21 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
>>          return ret;
>>   }
>>
>> +static int link_update_map(struct bpf_link *link, union bpf_attr *attr)
>> +{
>> +       struct bpf_map *new_map;
>> +       int ret = 0;
>> +
>> +       new_map = bpf_map_get(attr->link_update.new_map_fd);
>> +       if (IS_ERR(new_map))
>> +               return -EINVAL;
> 
> I was expecting a check for the BPF_F_LINK flag here. Isn't it
> necessary to verify that here?


link->ops->update_map != NULL implies BPF_F_LINK.  So, it doesn't do
additional check for BPF_F_LINK here.


> 
> 
> 
>> +
>> +       ret = link->ops->update_map(link, new_map);
>> +
>> +       bpf_map_put(new_map);
>> +       return ret;
>> +}
>> +
>>   #define BPF_LINK_UPDATE_LAST_FIELD link_update.old_prog_fd
>>
> 
> [...]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 5/8] bpf: Update the struct_ops of a bpf_link.
  2023-03-17 19:23   ` Martin KaFai Lau
  2023-03-17 21:39     ` Kui-Feng Lee
@ 2023-03-18  1:11     ` Kui-Feng Lee
  2023-03-18  5:38       ` Martin KaFai Lau
  1 sibling, 1 reply; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-18  1:11 UTC (permalink / raw)
  To: Martin KaFai Lau, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sdf



On 3/17/23 12:23, Martin KaFai Lau wrote:
> On 3/15/23 7:36 PM, Kui-Feng Lee wrote:
>> +static int bpf_struct_ops_map_link_update(struct bpf_link *link, 
>> struct bpf_map *new_map)
>> +{
>> +    struct bpf_struct_ops_map *st_map, *old_st_map;
>> +    struct bpf_struct_ops_link *st_link;
>> +    struct bpf_map *old_map;
>> +    int err = 0;
>> +
>> +    st_link = container_of(link, struct bpf_struct_ops_link, link);
>> +    st_map = container_of(new_map, struct bpf_struct_ops_map, map);
>> +
>> +    if (!bpf_struct_ops_valid_to_reg(new_map))
>> +        return -EINVAL;
>> +
>> +    mutex_lock(&update_mutex);
>> +
>> +    old_map = rcu_dereference_protected(st_link->map, 
>> lockdep_is_held(&update_mutex));
>> +    old_st_map = container_of(old_map, struct bpf_struct_ops_map, map);
>> +    /* The new and old struct_ops must be the same type. */
>> +    if (st_map->st_ops != old_st_map->st_ops) {
>> +        err = -EINVAL;
>> +        goto err_out;
>> +    }
>> +
>> +    err = st_map->st_ops->update(st_map->kvalue.data, 
>> old_st_map->kvalue.data);
> 
> I don't think it has completely addressed Andrii's comment in v4 
> regarding BPF_F_REPLACE: 
> https://lore.kernel.org/bpf/CAEf4BzbK8s+VFG5HefydD7CRLzkRFKg-Er0PKV_-C2-yttfXzA@mail.gmail.com/
> 
> For now, tcp_update_congestion_control() enforces the same cc-name. 
> However, it is still not the same as what BPF_F_REPLACE intented to do: 
> update only when it is the same old-map. Same cc-name does not 
> necessarily mean the same old-map.
> 
>> +    if (err)
>> +        goto err_out;
>> +
>> +    bpf_map_inc(new_map);
>> +    rcu_assign_pointer(st_link->map, new_map);
>> +    bpf_map_put(old_map);
>> +
>> +err_out:
>> +    mutex_unlock(&update_mutex);
>> +
>> +    return err;
>> +}
>> +
>>   static const struct bpf_link_ops bpf_struct_ops_map_lops = {
>>       .dealloc = bpf_struct_ops_map_link_dealloc,
>>       .show_fdinfo = bpf_struct_ops_map_link_show_fdinfo,
>>       .fill_link_info = bpf_struct_ops_map_link_fill_link_info,
>> +    .update_map = bpf_struct_ops_map_link_update,
>>   };
>>   int bpf_struct_ops_link_create(union bpf_attr *attr)
>> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index 5a45e3bf34e2..6fa10d108278 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -4676,6 +4676,21 @@ static int link_create(union bpf_attr *attr, 
>> bpfptr_t uattr)
>>       return ret;
>>   }
>> +static int link_update_map(struct bpf_link *link, union bpf_attr *attr)
>> +{
>> +    struct bpf_map *new_map;
>> +    int ret = 0;
> 
> nit. init zero is unnecessarily.
> 
>> +
>> +    new_map = bpf_map_get(attr->link_update.new_map_fd);
>> +    if (IS_ERR(new_map))
>> +        return -EINVAL;
>> +
>> +    ret = link->ops->update_map(link, new_map);
>> +
>> +    bpf_map_put(new_map);
>> +    return ret;
>> +}
>> +
>>   #define BPF_LINK_UPDATE_LAST_FIELD link_update.old_prog_fd
>>   static int link_update(union bpf_attr *attr)
>> @@ -4696,6 +4711,11 @@ static int link_update(union bpf_attr *attr)
>>       if (IS_ERR(link))
>>           return PTR_ERR(link);
>> +    if (link->ops->update_map) {
>> +        ret = link_update_map(link, attr);
>> +        goto out_put_link;
>> +    }
>> +
>>       new_prog = bpf_prog_get(attr->link_update.new_prog_fd);
>>       if (IS_ERR(new_prog)) {
>>           ret = PTR_ERR(new_prog);
>> diff --git a/net/bpf/bpf_dummy_struct_ops.c 
>> b/net/bpf/bpf_dummy_struct_ops.c
>> index ff4f89a2b02a..158f14e240d0 100644
>> --- a/net/bpf/bpf_dummy_struct_ops.c
>> +++ b/net/bpf/bpf_dummy_struct_ops.c
>> @@ -222,12 +222,18 @@ static void bpf_dummy_unreg(void *kdata)
>>   {
>>   }
>> +static int bpf_dummy_update(void *kdata, void *old_kdata)
>> +{
>> +    return -EOPNOTSUPP;
>> +}
>> +
>>   struct bpf_struct_ops bpf_bpf_dummy_ops = {
>>       .verifier_ops = &bpf_dummy_verifier_ops,
>>       .init = bpf_dummy_init,
>>       .check_member = bpf_dummy_ops_check_member,
>>       .init_member = bpf_dummy_init_member,
>>       .reg = bpf_dummy_reg,
>> +    .update = bpf_dummy_update,
> 
> When looking at this together in patch 5, the changes in 
> bpf_dummy_struct_ops.c should not be needed.

I don't follow you.
If we don't assign a function to .update, it will fail in
bpf_struct_ops_map_link_update(). Of course, I can add a check
in bpf_struct_ops_map_link_update() to return an error if .update
is NULL.


> 
> 

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 6/8] libbpf: Update a bpf_link with another struct_ops.
  2023-03-17 22:33   ` Andrii Nakryiko
@ 2023-03-18  1:17     ` Kui-Feng Lee
  0 siblings, 0 replies; 33+ messages in thread
From: Kui-Feng Lee @ 2023-03-18  1:17 UTC (permalink / raw)
  To: Andrii Nakryiko, Kui-Feng Lee
  Cc: bpf, ast, martin.lau, song, kernel-team, andrii, sdf



On 3/17/23 15:33, Andrii Nakryiko wrote:
> On Wed, Mar 15, 2023 at 7:37 PM Kui-Feng Lee <kuifeng@meta.com> wrote:
>>
>> Introduce bpf_link__update_map(), which allows to atomically update
>> underlying struct_ops implementation for given struct_ops BPF link
>>
>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com>
>> ---
>>   tools/lib/bpf/libbpf.c   | 30 ++++++++++++++++++++++++++++++
>>   tools/lib/bpf/libbpf.h   |  1 +
>>   tools/lib/bpf/libbpf.map |  1 +
>>   3 files changed, 32 insertions(+)
>>
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index 6dbae7ffab48..63ec1f8fe8a0 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -11659,6 +11659,36 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>>          return &link->link;
>>   }
>>
>> +/*
>> + * Swap the back struct_ops of a link with a new struct_ops map.
>> + */
>> +int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map)
>> +{
>> +       struct bpf_link_struct_ops *st_ops_link;
>> +       __u32 zero = 0;
>> +       int err, fd;
>> +
>> +       if (!bpf_map__is_struct_ops(map) || map->fd < 0)
>> +               return -EINVAL;
>> +
>> +       st_ops_link = container_of(link, struct bpf_link_struct_ops, link);
>> +       /* Ensure the type of a link is correct */
>> +       if (st_ops_link->map_fd < 0)
>> +               return -EINVAL;
>> +
>> +       err = bpf_map_update_elem(map->fd, &zero, map->st_ops->kern_vdata, 0);
>> +       if (err && err != -EBUSY)
> 
> Why is it ok to ignore -EBUSY? Let's leave a comment as well, as this
> is not clear.

got it!

> 
>> +               return err;
>> +
>> +       fd = bpf_link_update(link->fd, map->fd, NULL);
>> +       if (fd < 0)
>> +               return fd;
>> +
>> +       st_ops_link->map_fd = map->fd;
>> +
>> +       return 0;
>> +}
>> +
>>   typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr,
>>                                                            void *private_data);
>>
>> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
>> index db4992a036f8..1615e55e2e79 100644
>> --- a/tools/lib/bpf/libbpf.h
>> +++ b/tools/lib/bpf/libbpf.h
>> @@ -719,6 +719,7 @@ bpf_program__attach_freplace(const struct bpf_program *prog,
>>   struct bpf_map;
>>
>>   LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
>> +LIBBPF_API int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map);
>>
>>   struct bpf_iter_attach_opts {
>>          size_t sz; /* size of this struct for forward/backward compatibility */
>> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
>> index 50dde1f6521e..cc05be376257 100644
>> --- a/tools/lib/bpf/libbpf.map
>> +++ b/tools/lib/bpf/libbpf.map
>> @@ -387,6 +387,7 @@ LIBBPF_1.2.0 {
>>          global:
>>                  bpf_btf_get_info_by_fd;
>>                  bpf_link_get_info_by_fd;
>> +               bpf_link__update_map;
> 
> nit: this should go before bpf_link_get_info_by_fd ('_' orders before 'g')

ok!

> 
>>                  bpf_map_get_info_by_fd;
>>                  bpf_prog_get_info_by_fd;
>>   } LIBBPF_1.1.0;
>> --
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH bpf-next v7 5/8] bpf: Update the struct_ops of a bpf_link.
  2023-03-18  1:11     ` Kui-Feng Lee
@ 2023-03-18  5:38       ` Martin KaFai Lau
  0 siblings, 0 replies; 33+ messages in thread
From: Martin KaFai Lau @ 2023-03-18  5:38 UTC (permalink / raw)
  To: Kui-Feng Lee, Kui-Feng Lee; +Cc: bpf, ast, song, kernel-team, andrii, sdf

On 3/17/23 6:11 PM, Kui-Feng Lee wrote:
>>> --- a/net/bpf/bpf_dummy_struct_ops.c
>>> +++ b/net/bpf/bpf_dummy_struct_ops.c
>>> @@ -222,12 +222,18 @@ static void bpf_dummy_unreg(void *kdata)
>>>   {
>>>   }
>>> +static int bpf_dummy_update(void *kdata, void *old_kdata)
>>> +{
>>> +    return -EOPNOTSUPP;
>>> +}
>>> +
>>>   struct bpf_struct_ops bpf_bpf_dummy_ops = {
>>>       .verifier_ops = &bpf_dummy_verifier_ops,
>>>       .init = bpf_dummy_init,
>>>       .check_member = bpf_dummy_ops_check_member,
>>>       .init_member = bpf_dummy_init_member,
>>>       .reg = bpf_dummy_reg,
>>> +    .update = bpf_dummy_update,
>>
>> When looking at this together in patch 5, the changes in 
>> bpf_dummy_struct_ops.c should not be needed.
> 
> I don't follow you.
> If we don't assign a function to .update, it will fail in
> bpf_struct_ops_map_link_update(). Of course, I can add a check
> in bpf_struct_ops_map_link_update() to return an error if .update
> is NULL.

Yes, test ->update in bpf_struct_ops.c is better but not in 
bpf_struct_ops_map_link_update (more on this later). It does not need the dummy 
struct_ops to test the link. The dummy struct_ops was created to catch the 
trampoline img issue with ops/func having different args and return value.

It is better to enforce the BPF_F_LINK struct_ops must support both ->validate 
and ->update at the beginning and it can be revisited later. The current 
'->validate' testing in bpf_struct_ops_map_update_elem() in patch 3 is too late. 
Being able to '->validate' is particularly important for BPF_F_LINK struct_ops. 
Testing '->validate' and '->update' in bpf_struct_ops_init() will be too strict 
for now though when considering other on-going efforts to support struct_ops in 
other subsystems. A better place to test both '->validate' and '->update' should 
be in bpf_struct_ops_map_alloc(). It should return -ENOTSUPP when trying to 
creating a struct_ops BPF_F_LINK map without st_ops->validate and st_ops->update.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2023-03-18  5:38 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-16  2:36 [PATCH bpf-next v7 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee
2023-03-16  2:36 ` [PATCH bpf-next v7 1/8] bpf: Retire the struct_ops map kvalue->refcnt Kui-Feng Lee
2023-03-17 16:47   ` Martin KaFai Lau
2023-03-17 20:41     ` Kui-Feng Lee
2023-03-16  2:36 ` [PATCH bpf-next v7 2/8] net: Update an existing TCP congestion control algorithm Kui-Feng Lee
2023-03-17 15:23   ` Daniel Borkmann
2023-03-17 17:18     ` Martin KaFai Lau
2023-03-17 17:23       ` Daniel Borkmann
2023-03-17 21:46         ` Kui-Feng Lee
2023-03-17 23:07     ` Kui-Feng Lee
2023-03-16  2:36 ` [PATCH bpf-next v7 3/8] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
2023-03-17 18:10   ` Martin KaFai Lau
2023-03-17 20:52     ` Kui-Feng Lee
2023-03-16  2:36 ` [PATCH bpf-next v7 4/8] libbpf: Create a bpf_link in bpf_map__attach_struct_ops() Kui-Feng Lee
2023-03-17 18:44   ` Martin KaFai Lau
2023-03-17 21:00     ` Kui-Feng Lee
2023-03-17 22:23   ` Andrii Nakryiko
2023-03-17 23:48     ` Kui-Feng Lee
2023-03-16  2:36 ` [PATCH bpf-next v7 5/8] bpf: Update the struct_ops of a bpf_link Kui-Feng Lee
2023-03-17 19:23   ` Martin KaFai Lau
2023-03-17 21:39     ` Kui-Feng Lee
2023-03-18  1:11     ` Kui-Feng Lee
2023-03-18  5:38       ` Martin KaFai Lau
2023-03-17 22:27   ` Andrii Nakryiko
2023-03-18  0:41     ` Kui-Feng Lee
2023-03-16  2:36 ` [PATCH bpf-next v7 6/8] libbpf: Update a bpf_link with another struct_ops Kui-Feng Lee
2023-03-17 19:42   ` Martin KaFai Lau
2023-03-17 21:40     ` Kui-Feng Lee
2023-03-17 22:33   ` Andrii Nakryiko
2023-03-18  1:17     ` Kui-Feng Lee
2023-03-16  2:36 ` [PATCH bpf-next v7 7/8] libbpf: Use .struct_ops.link section to indicate a struct_ops with a link Kui-Feng Lee
2023-03-17 22:35   ` Andrii Nakryiko
2023-03-16  2:36 ` [PATCH bpf-next v7 8/8] selftests/bpf: Test switching TCP Congestion Control algorithms Kui-Feng Lee

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.