All of lore.kernel.org
 help / color / mirror / Atom feed
From: Martin KaFai Lau <martin.lau@linux.dev>
To: Kui-Feng Lee <kuifeng@meta.com>
Cc: bpf@vger.kernel.org, ast@kernel.org, song@kernel.org,
	kernel-team@meta.com, andrii@kernel.org, sdf@google.com
Subject: Re: [PATCH bpf-next v6 3/8] bpf: Create links for BPF struct_ops maps.
Date: Mon, 13 Mar 2023 18:42:06 -0700	[thread overview]
Message-ID: <6586775f-079e-7c21-9747-651be221dcd1@linux.dev> (raw)
In-Reply-To: <20230310043812.3087672-4-kuifeng@meta.com>

On 3/9/23 8:38 PM, Kui-Feng Lee wrote:
> BPF struct_ops maps are employed directly to register TCP Congestion
> Control algorithms. Unlike other BPF programs that terminate when
> their links gone. The link of a BPF struct_ops map provides a uniform
> experience akin to other types of BPF programs.

This part is a little confusing to read. I think it is trying to explain how the 
current bpf struct_ops uses update_elem to do both "update" and "register". It 
was done before the bpf_link was introduced. With bpf_link, the prog attach is 
done at the link creation time and prog detach is done when the link is gone. It 
is a more consistent experience to do the same for bpf struct_ops: attach 
(register) bpf_struct_ops during bpf_link creation and detach (unregister) when 
the link is gone.  This patch adds a new link type BPF_LINK_TYPE_STRUCT_OPS for 
attaching a bpf struct_ops to a subsystem.

> 
> bpf_links are responsible for registering their associated
> struct_ops. You can only use a struct_ops that has the BPF_F_LINK flag
> set to create a bpf_link, while a structs without this flag behaves in
> the same manner as before and is registered upon updating its value.
> 
> The BPF_LINK_TYPE_STRUCT_OPS serves a dual purpose. Not only is it
> used to craft the links for BPF struct_ops programs, but also to
> create links for BPF struct_ops them-self.  Since the links of BPF
> struct_ops programs are only used to create trampolines internally,
> they are never seen in other contexts. Thus, they can be reused for
> struct_ops themself.
> 
> To maintain a reference to the map supporting this link, we add
> bpf_struct_ops_link as an additional type. The pointer of the map is
> RCU and won't be necessary until later in the patchset.
> 

[ ... ]

> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 239cc0e2639c..2abb755e6a3a 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -1119,6 +1119,7 @@ int tcp_register_congestion_control(struct tcp_congestion_ops *type);
>   void tcp_unregister_congestion_control(struct tcp_congestion_ops *type);
>   int tcp_update_congestion_control(struct tcp_congestion_ops *type,
>   				  struct tcp_congestion_ops *old_type);
> +int tcp_validate_congestion_control(struct tcp_congestion_ops *ca);

I may not be clear in comment in v5. This is also tcp_cong.c changes and belongs 
to patch 2.

[ ... ]

> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index ab7811a4c1dd..888d6aefc31a 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -16,6 +16,7 @@ enum bpf_struct_ops_state {
>   	BPF_STRUCT_OPS_STATE_INIT,
>   	BPF_STRUCT_OPS_STATE_INUSE,
>   	BPF_STRUCT_OPS_STATE_TOBEFREE,
> +	BPF_STRUCT_OPS_STATE_READY,
>   };
>   
>   #define BPF_STRUCT_OPS_COMMON_VALUE			\
> @@ -504,11 +505,25 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>   		*(unsigned long *)(udata + moff) = prog->aux->id;
>   	}
>   
> -	bpf_map_inc(map);
> +	if (st_map->map.map_flags & BPF_F_LINK) {
> +		if (st_ops->validate) {
> +			err = st_ops->validate(kdata);
> +			if (err)
> +				goto reset_unlock;
> +		}
> +		set_memory_rox((long)st_map->image, 1);
> +		/* Let bpf_link handle registration & unregistration.
> +		 *
> +		 * Pair with smp_load_acquire() during lookup_elem().
> +		 */
> +		smp_store_release(&kvalue->state, BPF_STRUCT_OPS_STATE_READY);
> +		goto unlock;
> +	}
>   
>   	set_memory_rox((long)st_map->image, 1);
>   	err = st_ops->reg(kdata);
>   	if (likely(!err)) {
> +		bpf_map_inc(map);

The bpf_map_inc(map) line-move for the non BPF_F_LINK case has been spinning in 
my head since v5 because the bpf_map_inc is now done after publishing the map in 
reg(). I think it works considering only delete_elem() can remove this map at 
this point and delete_elem() cannot be run now. It is tricky, so please help to 
add some comments here.


>   		/* Pair with smp_load_acquire() during lookup_elem().
>   		 * It ensures the above udata updates (e.g. prog->aux->id)
>   		 * can be seen once BPF_STRUCT_OPS_STATE_INUSE is set.
> @@ -524,7 +539,6 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key,
>   	 */
>   	set_memory_nx((long)st_map->image, 1);
>   	set_memory_rw((long)st_map->image, 1);
> -	bpf_map_put(map);
>   
>   reset_unlock:
>   	bpf_struct_ops_map_put_progs(st_map);
> @@ -542,6 +556,9 @@ static int bpf_struct_ops_map_delete_elem(struct bpf_map *map, void *key)
>   	struct bpf_struct_ops_map *st_map;
>   
>   	st_map = (struct bpf_struct_ops_map *)map;
> +	if (st_map->map.map_flags & BPF_F_LINK)
> +		return -EOPNOTSUPP;
> +
>   	prev_state = cmpxchg(&st_map->kvalue.state,
>   			     BPF_STRUCT_OPS_STATE_INUSE,
>   			     BPF_STRUCT_OPS_STATE_TOBEFREE);
> @@ -609,7 +626,7 @@ static void bpf_struct_ops_map_free(struct bpf_map *map)
>   static int bpf_struct_ops_map_alloc_check(union bpf_attr *attr)
>   {
>   	if (attr->key_size != sizeof(unsigned int) || attr->max_entries != 1 ||
> -	    attr->map_flags || !attr->btf_vmlinux_value_type_id)
> +	    (attr->map_flags & ~BPF_F_LINK) || !attr->btf_vmlinux_value_type_id)
>   		return -EINVAL;
>   	return 0;
>   }
> @@ -720,3 +737,113 @@ void bpf_struct_ops_put(const void *kdata)
>   
>   	bpf_map_put(&st_map->map);
>   }
> +
> +static bool bpf_struct_ops_valid_to_reg(struct bpf_map *map)
> +{
> +	struct bpf_struct_ops_map *st_map = (struct bpf_struct_ops_map *)map;
> +
> +	return map->map_type == BPF_MAP_TYPE_STRUCT_OPS &&
> +		map->map_flags & BPF_F_LINK &&
> +		/* Pair with smp_store_release() during map_update */
> +		smp_load_acquire(&st_map->kvalue.state) == BPF_STRUCT_OPS_STATE_READY;
> +}
> +
> +static void bpf_struct_ops_map_link_dealloc(struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *st_link;
> +	struct bpf_struct_ops_map *st_map;
> +
> +	st_link = container_of(link, struct bpf_struct_ops_link, link);
> +	st_map = (struct bpf_struct_ops_map *)
> +		rcu_dereference_protected(st_link->map, true);
> +	if (st_map) {
> +		/* st_link->map can be NULL if
> +		 * bpf_struct_ops_link_create() fails to register.
> +		 */

Thanks for the comment. This helps the review a lot.

> +		st_map->st_ops->unreg(&st_map->kvalue.data);
> +		bpf_map_put(&st_map->map);
> +	}
> +	kfree(st_link);
> +}
> +

[ ... ]

> +int bpf_struct_ops_link_create(union bpf_attr *attr)
> +{
> +	struct bpf_struct_ops_link *link = NULL;
> +	struct bpf_link_primer link_primer;
> +	struct bpf_struct_ops_map *st_map;
> +	struct bpf_map *map;
> +	int err;
> +
> +	map = bpf_map_get(attr->link_create.map_fd);
> +	if (!map)
> +		return -EINVAL;
> +
> +	st_map = (struct bpf_struct_ops_map *)map;
> +
> +	if (!bpf_struct_ops_valid_to_reg(map)) {
> +		err = -EINVAL;
> +		goto err_out;
> +	}
> +
> +	link = kzalloc(sizeof(*link), GFP_USER);
> +	if (!link) {
> +		err = -ENOMEM;
> +		goto err_out;
> +	}
> +	bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL);
> +	RCU_INIT_POINTER(link->map, map);
> +
> +	err = bpf_link_prime(&link->link, &link_primer);
> +	if (err)
> +		goto err_out;
> +
> +	err = st_map->st_ops->reg(st_map->kvalue.data);
> +	if (err) {
> +		/* No RCU since no one has a chance to read this pointer yet. */
> +		link->map = NULL;

RCU_INIT_POINTER(link->map, NULL). Otherwise, it will have the same sparse warning.

Others lgtm.

> +		bpf_link_cleanup(&link_primer);
> +		link = NULL;
> +		goto err_out;
> +	}
> +
> +	return bpf_link_settle(&link_primer);
> +
> +err_out:
> +	bpf_map_put(map);
> +	kfree(link);
> +	return err;
> +}
> +



  reply	other threads:[~2023-03-14  1:50 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-10  4:38 [PATCH bpf-next v6 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 1/8] bpf: Retire the struct_ops map kvalue->refcnt Kui-Feng Lee
2023-03-14  6:05   ` Martin KaFai Lau
2023-03-10  4:38 ` [PATCH bpf-next v6 2/8] net: Update an existing TCP congestion control algorithm Kui-Feng Lee
2023-03-10 16:47   ` Stephen Hemminger
2023-03-13 15:46     ` Kui-Feng Lee
2023-03-13 16:43       ` Kui-Feng Lee
2023-03-14  0:28   ` Martin KaFai Lau
2023-03-14  4:31     ` Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 3/8] bpf: Create links for BPF struct_ops maps Kui-Feng Lee
2023-03-14  1:42   ` Martin KaFai Lau [this message]
2023-03-16  0:21     ` Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 4/8] libbpf: Create a bpf_link in bpf_map__attach_struct_ops() Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 5/8] bpf: Update the struct_ops of a bpf_link Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 6/8] libbpf: Update a bpf_link with another struct_ops Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 7/8] libbpf: Use .struct_ops.link section to indicate a struct_ops with a link Kui-Feng Lee
2023-03-10  4:38 ` [PATCH bpf-next v6 8/8] selftests/bpf: Test switching TCP Congestion Control algorithms Kui-Feng Lee
2023-03-14  5:04   ` Martin KaFai Lau
2023-03-10 16:28 ` [PATCH bpf-next v6 0/8] Transit between BPF TCP congestion controls Kui-Feng Lee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6586775f-079e-7c21-9747-651be221dcd1@linux.dev \
    --to=martin.lau@linux.dev \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=kernel-team@meta.com \
    --cc=kuifeng@meta.com \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.