All of lore.kernel.org
 help / color / mirror / Atom feed
* Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
@ 2023-08-03 13:19 Hou Tao
  2023-08-03 13:28 ` Hou Tao
  0 siblings, 1 reply; 16+ messages in thread
From: Hou Tao @ 2023-08-03 13:19 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Joanne Koong,
	Kumar Kartikeya Dwivedi

Hi,

I am preparing for qp-trie v4, but I need some help on how to support
variable-sized key in bpf syscall. The implementation of qp-trie needs
to distinguish between dynptr key from bpf program and variable-sized
key from bpf syscall. In v3, I added a new dynptr type:
BPF_DYNPTR_TYPE_USER for variable-sized key from bpf syscall [0], so
both bpf program and bpf syscall will use the same type to represent the
variable-sized key, but Andrii thought ptr+size tuple was simpler and
would be enough for user APIs, so in v4, the type of key for bpf program
and syscall will be different. One way to handle that is to add a new
parameter in .map_lookup_elem()/.map_delete_elem()/.map_update_elem() to
tell whether the key comes from bpf program or syscall or introduce new
APIs in bpf_map_ops for variable-sized key related syscall, but I think
it will introduce too much churn. Considering that the size of
bpf_dynptr_kern is 8-bytes aligned, so I think maybe I could reuse the
lowest 1-bit of key pointer to tell qp-trie whether or not it is a
bpf_dynptr_kern or a variable-sized key pointer from syscall. For
bpf_dynptr_kern, because it is 8B-aligned, so its lowest bit must be 0,
and for variable-sized key from syscall, I could allocated a 4B-aligned
pointer and setting the lowest bit as 1, so qp-trie can distinguish
between these two types of pointer. The question is that I am not sure
whether the idea above is a good one or not. Does it sound fragile ? Or
is there any better way to handle that ?

Regards


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-03 13:19 Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ? Hou Tao
@ 2023-08-03 13:28 ` Hou Tao
  2023-08-17  6:34   ` Hou Tao
  0 siblings, 1 reply; 16+ messages in thread
From: Hou Tao @ 2023-08-03 13:28 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Joanne Koong,
	Kumar Kartikeya Dwivedi



On 8/3/2023 9:19 PM, Hou Tao wrote:
> Hi,
>
> I am preparing for qp-trie v4, but I need some help on how to support
> variable-sized key in bpf syscall. The implementation of qp-trie needs
> to distinguish between dynptr key from bpf program and variable-sized
> key from bpf syscall. In v3, I added a new dynptr type:
> BPF_DYNPTR_TYPE_USER for variable-sized key from bpf syscall [0], so
> both bpf program and bpf syscall will use the same type to represent the
> variable-sized key, but Andrii thought ptr+size tuple was simpler and
> would be enough for user APIs, so in v4, the type of key for bpf program
> and syscall will be different. One way to handle that is to add a new
> parameter in .map_lookup_elem()/.map_delete_elem()/.map_update_elem() to
> tell whether the key comes from bpf program or syscall or introduce new
> APIs in bpf_map_ops for variable-sized key related syscall, but I think
> it will introduce too much churn. Considering that the size of
> bpf_dynptr_kern is 8-bytes aligned, so I think maybe I could reuse the
> lowest 1-bit of key pointer to tell qp-trie whether or not it is a
> bpf_dynptr_kern or a variable-sized key pointer from syscall. For
> bpf_dynptr_kern, because it is 8B-aligned, so its lowest bit must be 0,
> and for variable-sized key from syscall, I could allocated a 4B-aligned
> pointer and setting the lowest bit as 1, so qp-trie can distinguish
> between these two types of pointer. The question is that I am not sure
> whether the idea above is a good one or not. Does it sound fragile ? Or
> is there any better way to handle that ?
Forgot to add the URL for [0]:

[0]:
https://lore.kernel.org/bpf/CAEf4BzZyfUOfGkQP67urmG9=7pqUF-5E9LjZf-Y0sL9nbcHFww@mail.gmail.com/
>
> Regards
>
>
>
> .


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-03 13:28 ` Hou Tao
@ 2023-08-17  6:34   ` Hou Tao
  2023-08-17 23:00     ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Hou Tao @ 2023-08-17  6:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Andrii Nakryiko, Joanne Koong,
	Kumar Kartikeya Dwivedi

ping ?

On 8/3/2023 9:28 PM, Hou Tao wrote:
>
> On 8/3/2023 9:19 PM, Hou Tao wrote:
>> Hi,
>>
>> I am preparing for qp-trie v4, but I need some help on how to support
>> variable-sized key in bpf syscall. The implementation of qp-trie needs
>> to distinguish between dynptr key from bpf program and variable-sized
>> key from bpf syscall. In v3, I added a new dynptr type:
>> BPF_DYNPTR_TYPE_USER for variable-sized key from bpf syscall [0], so
>> both bpf program and bpf syscall will use the same type to represent the
>> variable-sized key, but Andrii thought ptr+size tuple was simpler and
>> would be enough for user APIs, so in v4, the type of key for bpf program
>> and syscall will be different. One way to handle that is to add a new
>> parameter in .map_lookup_elem()/.map_delete_elem()/.map_update_elem() to
>> tell whether the key comes from bpf program or syscall or introduce new
>> APIs in bpf_map_ops for variable-sized key related syscall, but I think
>> it will introduce too much churn. Considering that the size of
>> bpf_dynptr_kern is 8-bytes aligned, so I think maybe I could reuse the
>> lowest 1-bit of key pointer to tell qp-trie whether or not it is a
>> bpf_dynptr_kern or a variable-sized key pointer from syscall. For
>> bpf_dynptr_kern, because it is 8B-aligned, so its lowest bit must be 0,
>> and for variable-sized key from syscall, I could allocated a 4B-aligned
>> pointer and setting the lowest bit as 1, so qp-trie can distinguish
>> between these two types of pointer. The question is that I am not sure
>> whether the idea above is a good one or not. Does it sound fragile ? Or
>> is there any better way to handle that ?
> Forgot to add the URL for [0]:
>
> [0]:
> https://lore.kernel.org/bpf/CAEf4BzZyfUOfGkQP67urmG9=7pqUF-5E9LjZf-Y0sL9nbcHFww@mail.gmail.com/
>> Regards
>>
>>
>>
>> .


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-17  6:34   ` Hou Tao
@ 2023-08-17 23:00     ` Alexei Starovoitov
  2023-08-19  9:15       ` Hou Tao
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2023-08-17 23:00 UTC (permalink / raw)
  To: Hou Tao; +Cc: bpf, Andrii Nakryiko, Joanne Koong, Kumar Kartikeya Dwivedi

On Wed, Aug 16, 2023 at 11:35 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> ping ?

Sorry for the delay. I've been on PTO.

> On 8/3/2023 9:28 PM, Hou Tao wrote:
> >
> > On 8/3/2023 9:19 PM, Hou Tao wrote:
> >> Hi,
> >>
> >> I am preparing for qp-trie v4, but I need some help on how to support
> >> variable-sized key in bpf syscall. The implementation of qp-trie needs
> >> to distinguish between dynptr key from bpf program and variable-sized
> >> key from bpf syscall. In v3, I added a new dynptr type:
> >> BPF_DYNPTR_TYPE_USER for variable-sized key from bpf syscall [0], so
> >> both bpf program and bpf syscall will use the same type to represent the
> >> variable-sized key, but Andrii thought ptr+size tuple was simpler and
> >> would be enough for user APIs, so in v4, the type of key for bpf program
> >> and syscall will be different. One way to handle that is to add a new
> >> parameter in .map_lookup_elem()/.map_delete_elem()/.map_update_elem() to
> >> tell whether the key comes from bpf program or syscall or introduce new
> >> APIs in bpf_map_ops for variable-sized key related syscall, but I think
> >> it will introduce too much churn. Considering that the size of
> >> bpf_dynptr_kern is 8-bytes aligned, so I think maybe I could reuse the
> >> lowest 1-bit of key pointer to tell qp-trie whether or not it is a
> >> bpf_dynptr_kern or a variable-sized key pointer from syscall. For
> >> bpf_dynptr_kern, because it is 8B-aligned, so its lowest bit must be 0,
> >> and for variable-sized key from syscall, I could allocated a 4B-aligned
> >> pointer and setting the lowest bit as 1, so qp-trie can distinguish
> >> between these two types of pointer. The question is that I am not sure
> >> whether the idea above is a good one or not. Does it sound fragile ? Or
> >> is there any better way to handle that ?

Let's avoid bit hacks. They're not extensible and should be used
only in cases where performance matters a lot or memory constraints are extreme.

ptr/sz tuple from syscall side sounds the simplest.
I agree with Andrii exposing the dynptr concept to user space
and especially as part of syscall is unnecessary.
We already have LPM as a precedent. Maybe we can do the same here?
No need to add new sys_bpf commands.

If the existing bpf_map_lookup_elem() helper doesn't fit qp-tree we can
use new kfuncs from bpf prog and LPM-like map accessors from syscall.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-17 23:00     ` Alexei Starovoitov
@ 2023-08-19  9:15       ` Hou Tao
  2023-08-21 23:49         ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Hou Tao @ 2023-08-19  9:15 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Joanne Koong, Kumar Kartikeya Dwivedi

Hi,

On 8/18/2023 7:00 AM, Alexei Starovoitov wrote:
> On Wed, Aug 16, 2023 at 11:35 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> ping ?
> Sorry for the delay. I've been on PTO.
>
>> On 8/3/2023 9:28 PM, Hou Tao wrote:
>>> On 8/3/2023 9:19 PM, Hou Tao wrote:
>>>> Hi,
>>>>
>>>> I am preparing for qp-trie v4, but I need some help on how to support
>>>> variable-sized key in bpf syscall. The implementation of qp-trie needs
>>>> to distinguish between dynptr key from bpf program and variable-sized
>>>> key from bpf syscall. In v3, I added a new dynptr type:
>>>> BPF_DYNPTR_TYPE_USER for variable-sized key from bpf syscall [0], so
>>>> both bpf program and bpf syscall will use the same type to represent the
>>>> variable-sized key, but Andrii thought ptr+size tuple was simpler and
>>>> would be enough for user APIs, so in v4, the type of key for bpf program
>>>> and syscall will be different. One way to handle that is to add a new
>>>> parameter in .map_lookup_elem()/.map_delete_elem()/.map_update_elem() to
>>>> tell whether the key comes from bpf program or syscall or introduce new
>>>> APIs in bpf_map_ops for variable-sized key related syscall, but I think
>>>> it will introduce too much churn. Considering that the size of
>>>> bpf_dynptr_kern is 8-bytes aligned, so I think maybe I could reuse the
>>>> lowest 1-bit of key pointer to tell qp-trie whether or not it is a
>>>> bpf_dynptr_kern or a variable-sized key pointer from syscall. For
>>>> bpf_dynptr_kern, because it is 8B-aligned, so its lowest bit must be 0,
>>>> and for variable-sized key from syscall, I could allocated a 4B-aligned
>>>> pointer and setting the lowest bit as 1, so qp-trie can distinguish
>>>> between these two types of pointer. The question is that I am not sure
>>>> whether the idea above is a good one or not. Does it sound fragile ? Or
>>>> is there any better way to handle that ?
> Let's avoid bit hacks. They're not extensible and should be used
> only in cases where performance matters a lot or memory constraints are extreme.
I see. Neither the performance reason nor the memory limitation fit here.
>
> ptr/sz tuple from syscall side sounds the simplest.
> I agree with Andrii exposing the dynptr concept to user space
> and especially as part of syscall is unnecessary.
> We already have LPM as a precedent. Maybe we can do the same here?
> No need to add new sys_bpf commands.

There is no need to add new sys_bpf commands. We can extend bpf_attr to
support variable-sized key in qp-trie for bpf syscall. The probem is the
keys from bpf syscall and bpf program are different: bpf syscall uses
ptr+size tuple and bpf program uses dynptr, but the APIs in bpf_map_ops
only uses a pointer to represent the key, so qp-trie can not distinguish
between the keys from bpf syscall and bpf program. In qp-trie v1, the
key of qp-trie was similar with LPM-trie: both the syscall and program
used the same key format. But the key format for bpf program changed to
dynptr in qp-trie v2 according to the suggestion from Andrii. I think it
is also a bad ideal to go back to v1 again.

>
> If the existing bpf_map_lookup_elem() helper doesn't fit qp-tree we can
> use new kfuncs from bpf prog and LPM-like map accessors from syscall.

It is a feasible solution, but it will make qp-trie be different with
other map types. I will try to extend the APIs in bpf_map_ops first to
see how much churns it may introduce.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-19  9:15       ` Hou Tao
@ 2023-08-21 23:49         ` Alexei Starovoitov
  2023-08-22  0:54           ` Hou Tao
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2023-08-21 23:49 UTC (permalink / raw)
  To: Hou Tao; +Cc: bpf, Andrii Nakryiko, Joanne Koong, Kumar Kartikeya Dwivedi

On Sat, Aug 19, 2023 at 3:39 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/18/2023 7:00 AM, Alexei Starovoitov wrote:
> > On Wed, Aug 16, 2023 at 11:35 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> ping ?
> > Sorry for the delay. I've been on PTO.
> >
> >> On 8/3/2023 9:28 PM, Hou Tao wrote:
> >>> On 8/3/2023 9:19 PM, Hou Tao wrote:
> >>>> Hi,
> >>>>
> >>>> I am preparing for qp-trie v4, but I need some help on how to support
> >>>> variable-sized key in bpf syscall. The implementation of qp-trie needs
> >>>> to distinguish between dynptr key from bpf program and variable-sized
> >>>> key from bpf syscall. In v3, I added a new dynptr type:
> >>>> BPF_DYNPTR_TYPE_USER for variable-sized key from bpf syscall [0], so
> >>>> both bpf program and bpf syscall will use the same type to represent the
> >>>> variable-sized key, but Andrii thought ptr+size tuple was simpler and
> >>>> would be enough for user APIs, so in v4, the type of key for bpf program
> >>>> and syscall will be different. One way to handle that is to add a new
> >>>> parameter in .map_lookup_elem()/.map_delete_elem()/.map_update_elem() to
> >>>> tell whether the key comes from bpf program or syscall or introduce new
> >>>> APIs in bpf_map_ops for variable-sized key related syscall, but I think
> >>>> it will introduce too much churn. Considering that the size of
> >>>> bpf_dynptr_kern is 8-bytes aligned, so I think maybe I could reuse the
> >>>> lowest 1-bit of key pointer to tell qp-trie whether or not it is a
> >>>> bpf_dynptr_kern or a variable-sized key pointer from syscall. For
> >>>> bpf_dynptr_kern, because it is 8B-aligned, so its lowest bit must be 0,
> >>>> and for variable-sized key from syscall, I could allocated a 4B-aligned
> >>>> pointer and setting the lowest bit as 1, so qp-trie can distinguish
> >>>> between these two types of pointer. The question is that I am not sure
> >>>> whether the idea above is a good one or not. Does it sound fragile ? Or
> >>>> is there any better way to handle that ?
> > Let's avoid bit hacks. They're not extensible and should be used
> > only in cases where performance matters a lot or memory constraints are extreme.
> I see. Neither the performance reason nor the memory limitation fit here.
> >
> > ptr/sz tuple from syscall side sounds the simplest.
> > I agree with Andrii exposing the dynptr concept to user space
> > and especially as part of syscall is unnecessary.
> > We already have LPM as a precedent. Maybe we can do the same here?
> > No need to add new sys_bpf commands.
>
> There is no need to add new sys_bpf commands. We can extend bpf_attr to
> support variable-sized key in qp-trie for bpf syscall. The probem is the
> keys from bpf syscall and bpf program are different: bpf syscall uses
> ptr+size tuple and bpf program uses dynptr, but the APIs in bpf_map_ops
> only uses a pointer to represent the key, so qp-trie can not distinguish
> between the keys from bpf syscall and bpf program. In qp-trie v1, the
> key of qp-trie was similar with LPM-trie: both the syscall and program
> used the same key format. But the key format for bpf program changed to
> dynptr in qp-trie v2 according to the suggestion from Andrii. I think it
> is also a bad ideal to go back to v1 again.
>
> >
> > If the existing bpf_map_lookup_elem() helper doesn't fit qp-tree we can
> > use new kfuncs from bpf prog and LPM-like map accessors from syscall.
>
> It is a feasible solution, but it will make qp-trie be different with
> other map types. I will try to extend the APIs in bpf_map_ops first to
> see how much churns it may introduce.

you mean you want to try to dynamically adapt bpf_map_lookup_elem()
to consider 'void *key' as a pointer to dynptr for bpf prog and
lpm-like tuple for syscall?
I'm afraid the verifier changes will be messy, since PTR_TO_MAP_KEY is
quite special.
__bpf_kfunc void *bpf_qptree_lookup(const bpf_map *map, const struct
bpf_dynptr_kern *key, ...);
will be so much easier to add.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-21 23:49         ` Alexei Starovoitov
@ 2023-08-22  0:54           ` Hou Tao
  2023-08-22  0:58             ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Hou Tao @ 2023-08-22  0:54 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Joanne Koong, Kumar Kartikeya Dwivedi

Hi,

On 8/22/2023 7:49 AM, Alexei Starovoitov wrote:
> On Sat, Aug 19, 2023 at 3:39 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 8/18/2023 7:00 AM, Alexei Starovoitov wrote:
>>> On Wed, Aug 16, 2023 at 11:35 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> ping ?
>>> Sorry for the delay. I've been on PTO.
>>>
>>>> On 8/3/2023 9:28 PM, Hou Tao wrote:
>>>>> On 8/3/2023 9:19 PM, Hou Tao wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I am preparing for qp-trie v4, but I need some help on how to support
>>>>>> variable-sized key in bpf syscall. The implementation of qp-trie needs
>>>>>> to distinguish between dynptr key from bpf program and variable-sized
>>>>>> key from bpf syscall. In v3, I added a new dynptr type:
>>>>>> BPF_DYNPTR_TYPE_USER for variable-sized key from bpf syscall [0], so
>>>>>> both bpf program and bpf syscall will use the same type to represent the
>>>>>> variable-sized key, but Andrii thought ptr+size tuple was simpler and
>>>>>> would be enough for user APIs, so in v4, the type of key for bpf program
>>>>>> and syscall will be different. One way to handle that is to add a new
>>>>>> parameter in .map_lookup_elem()/.map_delete_elem()/.map_update_elem() to
>>>>>> tell whether the key comes from bpf program or syscall or introduce new
>>>>>> APIs in bpf_map_ops for variable-sized key related syscall, but I think
>>>>>> it will introduce too much churn. Considering that the size of
>>>>>> bpf_dynptr_kern is 8-bytes aligned, so I think maybe I could reuse the
>>>>>> lowest 1-bit of key pointer to tell qp-trie whether or not it is a
>>>>>> bpf_dynptr_kern or a variable-sized key pointer from syscall. For
>>>>>> bpf_dynptr_kern, because it is 8B-aligned, so its lowest bit must be 0,
>>>>>> and for variable-sized key from syscall, I could allocated a 4B-aligned
>>>>>> pointer and setting the lowest bit as 1, so qp-trie can distinguish
>>>>>> between these two types of pointer. The question is that I am not sure
>>>>>> whether the idea above is a good one or not. Does it sound fragile ? Or
>>>>>> is there any better way to handle that ?
>>> Let's avoid bit hacks. They're not extensible and should be used
>>> only in cases where performance matters a lot or memory constraints are extreme.
>> I see. Neither the performance reason nor the memory limitation fit here.
>>> ptr/sz tuple from syscall side sounds the simplest.
>>> I agree with Andrii exposing the dynptr concept to user space
>>> and especially as part of syscall is unnecessary.
>>> We already have LPM as a precedent. Maybe we can do the same here?
>>> No need to add new sys_bpf commands.
>> There is no need to add new sys_bpf commands. We can extend bpf_attr to
>> support variable-sized key in qp-trie for bpf syscall. The probem is the
>> keys from bpf syscall and bpf program are different: bpf syscall uses
>> ptr+size tuple and bpf program uses dynptr, but the APIs in bpf_map_ops
>> only uses a pointer to represent the key, so qp-trie can not distinguish
>> between the keys from bpf syscall and bpf program. In qp-trie v1, the
>> key of qp-trie was similar with LPM-trie: both the syscall and program
>> used the same key format. But the key format for bpf program changed to
>> dynptr in qp-trie v2 according to the suggestion from Andrii. I think it
>> is also a bad ideal to go back to v1 again.
>>
>>> If the existing bpf_map_lookup_elem() helper doesn't fit qp-tree we can
>>> use new kfuncs from bpf prog and LPM-like map accessors from syscall.
>> It is a feasible solution, but it will make qp-trie be different with
>> other map types. I will try to extend the APIs in bpf_map_ops first to
>> see how much churns it may introduce.
> you mean you want to try to dynamically adapt bpf_map_lookup_elem()
> to consider 'void *key' as a pointer to dynptr for bpf prog and
> lpm-like tuple for syscall?
> I'm afraid the verifier changes will be messy, since PTR_TO_MAP_KEY is
> quite special.

No. I didn't plan to do that. I am trying to add three new APIs in
bpf_map_ops to handle lookup/update/delete operation from bpf syscall
(e.g, map_lookup_elem_syscall). So bpf program and bpf syscall can use
different API to operate on qp-trie.
> __bpf_kfunc void *bpf_qptree_lookup(const bpf_map *map, const struct
> bpf_dynptr_kern *key, ...);
> will be so much easier to add.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-22  0:54           ` Hou Tao
@ 2023-08-22  0:58             ` Alexei Starovoitov
  2023-08-22  1:46               ` Hou Tao
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2023-08-22  0:58 UTC (permalink / raw)
  To: Hou Tao; +Cc: bpf, Andrii Nakryiko, Joanne Koong, Kumar Kartikeya Dwivedi

On Mon, Aug 21, 2023 at 5:55 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/22/2023 7:49 AM, Alexei Starovoitov wrote:
> > On Sat, Aug 19, 2023 at 3:39 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 8/18/2023 7:00 AM, Alexei Starovoitov wrote:
> >>> On Wed, Aug 16, 2023 at 11:35 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>> ping ?
> >>> Sorry for the delay. I've been on PTO.
> >>>
> >>>> On 8/3/2023 9:28 PM, Hou Tao wrote:
> >>>>> On 8/3/2023 9:19 PM, Hou Tao wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I am preparing for qp-trie v4, but I need some help on how to support
> >>>>>> variable-sized key in bpf syscall. The implementation of qp-trie needs
> >>>>>> to distinguish between dynptr key from bpf program and variable-sized
> >>>>>> key from bpf syscall. In v3, I added a new dynptr type:
> >>>>>> BPF_DYNPTR_TYPE_USER for variable-sized key from bpf syscall [0], so
> >>>>>> both bpf program and bpf syscall will use the same type to represent the
> >>>>>> variable-sized key, but Andrii thought ptr+size tuple was simpler and
> >>>>>> would be enough for user APIs, so in v4, the type of key for bpf program
> >>>>>> and syscall will be different. One way to handle that is to add a new
> >>>>>> parameter in .map_lookup_elem()/.map_delete_elem()/.map_update_elem() to
> >>>>>> tell whether the key comes from bpf program or syscall or introduce new
> >>>>>> APIs in bpf_map_ops for variable-sized key related syscall, but I think
> >>>>>> it will introduce too much churn. Considering that the size of
> >>>>>> bpf_dynptr_kern is 8-bytes aligned, so I think maybe I could reuse the
> >>>>>> lowest 1-bit of key pointer to tell qp-trie whether or not it is a
> >>>>>> bpf_dynptr_kern or a variable-sized key pointer from syscall. For
> >>>>>> bpf_dynptr_kern, because it is 8B-aligned, so its lowest bit must be 0,
> >>>>>> and for variable-sized key from syscall, I could allocated a 4B-aligned
> >>>>>> pointer and setting the lowest bit as 1, so qp-trie can distinguish
> >>>>>> between these two types of pointer. The question is that I am not sure
> >>>>>> whether the idea above is a good one or not. Does it sound fragile ? Or
> >>>>>> is there any better way to handle that ?
> >>> Let's avoid bit hacks. They're not extensible and should be used
> >>> only in cases where performance matters a lot or memory constraints are extreme.
> >> I see. Neither the performance reason nor the memory limitation fit here.
> >>> ptr/sz tuple from syscall side sounds the simplest.
> >>> I agree with Andrii exposing the dynptr concept to user space
> >>> and especially as part of syscall is unnecessary.
> >>> We already have LPM as a precedent. Maybe we can do the same here?
> >>> No need to add new sys_bpf commands.
> >> There is no need to add new sys_bpf commands. We can extend bpf_attr to
> >> support variable-sized key in qp-trie for bpf syscall. The probem is the
> >> keys from bpf syscall and bpf program are different: bpf syscall uses
> >> ptr+size tuple and bpf program uses dynptr, but the APIs in bpf_map_ops
> >> only uses a pointer to represent the key, so qp-trie can not distinguish
> >> between the keys from bpf syscall and bpf program. In qp-trie v1, the
> >> key of qp-trie was similar with LPM-trie: both the syscall and program
> >> used the same key format. But the key format for bpf program changed to
> >> dynptr in qp-trie v2 according to the suggestion from Andrii. I think it
> >> is also a bad ideal to go back to v1 again.
> >>
> >>> If the existing bpf_map_lookup_elem() helper doesn't fit qp-tree we can
> >>> use new kfuncs from bpf prog and LPM-like map accessors from syscall.
> >> It is a feasible solution, but it will make qp-trie be different with
> >> other map types. I will try to extend the APIs in bpf_map_ops first to
> >> see how much churns it may introduce.
> > you mean you want to try to dynamically adapt bpf_map_lookup_elem()
> > to consider 'void *key' as a pointer to dynptr for bpf prog and
> > lpm-like tuple for syscall?
> > I'm afraid the verifier changes will be messy, since PTR_TO_MAP_KEY is
> > quite special.
>
> No. I didn't plan to do that. I am trying to add three new APIs in
> bpf_map_ops to handle lookup/update/delete operation from bpf syscall
> (e.g, map_lookup_elem_syscall). So bpf program and bpf syscall can use
> different API to operate on qp-trie.

How does bpf prog side api look like?
I thought we wanted to use dynptr as a key?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-22  0:58             ` Alexei Starovoitov
@ 2023-08-22  1:46               ` Hou Tao
  2023-08-22  3:25                 ` Alexei Starovoitov
  0 siblings, 1 reply; 16+ messages in thread
From: Hou Tao @ 2023-08-22  1:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Joanne Koong, Kumar Kartikeya Dwivedi

Hi,

On 8/22/2023 8:58 AM, Alexei Starovoitov wrote:
> On Mon, Aug 21, 2023 at 5:55 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 8/22/2023 7:49 AM, Alexei Starovoitov wrote:
>>> On Sat, Aug 19, 2023 at 3:39 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> Hi,
>>>>
>>>> On 8/18/2023 7:00 AM, Alexei Starovoitov wrote:
>>>>> On Wed, Aug 16, 2023 at 11:35 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>>
SNIP
>>>>> If the existing bpf_map_lookup_elem() helper doesn't fit qp-tree we can
>>>>> use new kfuncs from bpf prog and LPM-like map accessors from syscall.
>>>> It is a feasible solution, but it will make qp-trie be different with
>>>> other map types. I will try to extend the APIs in bpf_map_ops first to
>>>> see how much churns it may introduce.
>>> you mean you want to try to dynamically adapt bpf_map_lookup_elem()
>>> to consider 'void *key' as a pointer to dynptr for bpf prog and
>>> lpm-like tuple for syscall?
>>> I'm afraid the verifier changes will be messy, since PTR_TO_MAP_KEY is
>>> quite special.
>> No. I didn't plan to do that. I am trying to add three new APIs in
>> bpf_map_ops to handle lookup/update/delete operation from bpf syscall
>> (e.g, map_lookup_elem_syscall). So bpf program and bpf syscall can use
>> different API to operate on qp-trie.
> How does bpf prog side api look like?
> I thought we wanted to use dynptr as a key?

Yes. bpf prog will use dynptr as the map key. The bpf program will use
the same map helpers as hash map to operate on qp-trie and the verifier
will be updated to allow using dynptr as map key for qp-trie.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-22  1:46               ` Hou Tao
@ 2023-08-22  3:25                 ` Alexei Starovoitov
  2023-08-22 13:12                   ` Hou Tao
  0 siblings, 1 reply; 16+ messages in thread
From: Alexei Starovoitov @ 2023-08-22  3:25 UTC (permalink / raw)
  To: Hou Tao; +Cc: bpf, Andrii Nakryiko, Joanne Koong, Kumar Kartikeya Dwivedi

On Mon, Aug 21, 2023 at 6:46 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/22/2023 8:58 AM, Alexei Starovoitov wrote:
> > On Mon, Aug 21, 2023 at 5:55 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 8/22/2023 7:49 AM, Alexei Starovoitov wrote:
> >>> On Sat, Aug 19, 2023 at 3:39 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>> Hi,
> >>>>
> >>>> On 8/18/2023 7:00 AM, Alexei Starovoitov wrote:
> >>>>> On Wed, Aug 16, 2023 at 11:35 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>>>
> SNIP
> >>>>> If the existing bpf_map_lookup_elem() helper doesn't fit qp-tree we can
> >>>>> use new kfuncs from bpf prog and LPM-like map accessors from syscall.
> >>>> It is a feasible solution, but it will make qp-trie be different with
> >>>> other map types. I will try to extend the APIs in bpf_map_ops first to
> >>>> see how much churns it may introduce.
> >>> you mean you want to try to dynamically adapt bpf_map_lookup_elem()
> >>> to consider 'void *key' as a pointer to dynptr for bpf prog and
> >>> lpm-like tuple for syscall?
> >>> I'm afraid the verifier changes will be messy, since PTR_TO_MAP_KEY is
> >>> quite special.
> >> No. I didn't plan to do that. I am trying to add three new APIs in
> >> bpf_map_ops to handle lookup/update/delete operation from bpf syscall
> >> (e.g, map_lookup_elem_syscall). So bpf program and bpf syscall can use
> >> different API to operate on qp-trie.
> > How does bpf prog side api look like?
> > I thought we wanted to use dynptr as a key?
>
> Yes. bpf prog will use dynptr as the map key. The bpf program will use
> the same map helpers as hash map to operate on qp-trie and the verifier
> will be updated to allow using dynptr as map key for qp-trie.

And that's the problem I just mentioned.
PTR_TO_MAP_KEY is special. I don't think we should hack it to also
mean ARG_PTR_TO_DYNPTR depending on the first argument (map type).

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-22  3:25                 ` Alexei Starovoitov
@ 2023-08-22 13:12                   ` Hou Tao
  2023-08-25 18:33                     ` Andrii Nakryiko
  0 siblings, 1 reply; 16+ messages in thread
From: Hou Tao @ 2023-08-22 13:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Andrii Nakryiko, Joanne Koong, Kumar Kartikeya Dwivedi

Hi,

On 8/22/2023 11:25 AM, Alexei Starovoitov wrote:
> On Mon, Aug 21, 2023 at 6:46 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi,
>>
>> On 8/22/2023 8:58 AM, Alexei Starovoitov wrote:
>>> On Mon, Aug 21, 2023 at 5:55 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>> Hi,
>>>>
>>>> On 8/22/2023 7:49 AM, Alexei Starovoitov wrote:
>>>>> On Sat, Aug 19, 2023 at 3:39 AM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 8/18/2023 7:00 AM, Alexei Starovoitov wrote:
>>>>>>> On Wed, Aug 16, 2023 at 11:35 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>>>>>>
>> SNIP
>>>>>>> If the existing bpf_map_lookup_elem() helper doesn't fit qp-tree we can
>>>>>>> use new kfuncs from bpf prog and LPM-like map accessors from syscall.
>>>>>> It is a feasible solution, but it will make qp-trie be different with
>>>>>> other map types. I will try to extend the APIs in bpf_map_ops first to
>>>>>> see how much churns it may introduce.
>>>>> you mean you want to try to dynamically adapt bpf_map_lookup_elem()
>>>>> to consider 'void *key' as a pointer to dynptr for bpf prog and
>>>>> lpm-like tuple for syscall?
>>>>> I'm afraid the verifier changes will be messy, since PTR_TO_MAP_KEY is
>>>>> quite special.
>>>> No. I didn't plan to do that. I am trying to add three new APIs in
>>>> bpf_map_ops to handle lookup/update/delete operation from bpf syscall
>>>> (e.g, map_lookup_elem_syscall). So bpf program and bpf syscall can use
>>>> different API to operate on qp-trie.
>>> How does bpf prog side api look like?
>>> I thought we wanted to use dynptr as a key?
>> Yes. bpf prog will use dynptr as the map key. The bpf program will use
>> the same map helpers as hash map to operate on qp-trie and the verifier
>> will be updated to allow using dynptr as map key for qp-trie.
> And that's the problem I just mentioned.
> PTR_TO_MAP_KEY is special. I don't think we should hack it to also
> mean ARG_PTR_TO_DYNPTR depending on the first argument (map type).

Sorry for misunderstanding your reply. But before switch to the kfunc
way, could you please point me to some code or function which shows the
specialty of PTR_MAP_KEY ?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-22 13:12                   ` Hou Tao
@ 2023-08-25 18:33                     ` Andrii Nakryiko
  2023-08-31  6:29                       ` Hou Tao
  0 siblings, 1 reply; 16+ messages in thread
From: Andrii Nakryiko @ 2023-08-25 18:33 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, bpf, Andrii Nakryiko, Joanne Koong,
	Kumar Kartikeya Dwivedi

On Tue, Aug 22, 2023 at 6:12 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 8/22/2023 11:25 AM, Alexei Starovoitov wrote:
> > On Mon, Aug 21, 2023 at 6:46 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi,
> >>
> >> On 8/22/2023 8:58 AM, Alexei Starovoitov wrote:
> >>> On Mon, Aug 21, 2023 at 5:55 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>> Hi,
> >>>>
> >>>> On 8/22/2023 7:49 AM, Alexei Starovoitov wrote:
> >>>>> On Sat, Aug 19, 2023 at 3:39 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 8/18/2023 7:00 AM, Alexei Starovoitov wrote:
> >>>>>>> On Wed, Aug 16, 2023 at 11:35 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>>>>>>
> >> SNIP
> >>>>>>> If the existing bpf_map_lookup_elem() helper doesn't fit qp-tree we can
> >>>>>>> use new kfuncs from bpf prog and LPM-like map accessors from syscall.
> >>>>>> It is a feasible solution, but it will make qp-trie be different with
> >>>>>> other map types. I will try to extend the APIs in bpf_map_ops first to
> >>>>>> see how much churns it may introduce.
> >>>>> you mean you want to try to dynamically adapt bpf_map_lookup_elem()
> >>>>> to consider 'void *key' as a pointer to dynptr for bpf prog and
> >>>>> lpm-like tuple for syscall?
> >>>>> I'm afraid the verifier changes will be messy, since PTR_TO_MAP_KEY is
> >>>>> quite special.
> >>>> No. I didn't plan to do that. I am trying to add three new APIs in
> >>>> bpf_map_ops to handle lookup/update/delete operation from bpf syscall
> >>>> (e.g, map_lookup_elem_syscall). So bpf program and bpf syscall can use
> >>>> different API to operate on qp-trie.
> >>> How does bpf prog side api look like?
> >>> I thought we wanted to use dynptr as a key?
> >> Yes. bpf prog will use dynptr as the map key. The bpf program will use
> >> the same map helpers as hash map to operate on qp-trie and the verifier
> >> will be updated to allow using dynptr as map key for qp-trie.
> > And that's the problem I just mentioned.
> > PTR_TO_MAP_KEY is special. I don't think we should hack it to also
> > mean ARG_PTR_TO_DYNPTR depending on the first argument (map type).
>
> Sorry for misunderstanding your reply. But before switch to the kfunc
> way, could you please point me to some code or function which shows the
> specialty of PTR_MAP_KEY ?
>
>

Search in kernel/bpf/verifier.c how PTR_TO_MAP_KEY is handled. The
logic assumes that there is associated struct bpf_map * pointer from
which we know fixed-sized key length.

But getting back to the topic at hand. I vaguely remember discussion
we had, but it would be good if you could summarize it again here to
avoid talking past each other. What is the bpf_map_ops changes you
were thinking to do? How bpf_attr will look like? How BPF-side API for
lookup/delete/update will look like? And then let's go from there?
Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-25 18:33                     ` Andrii Nakryiko
@ 2023-08-31  6:29                       ` Hou Tao
  2023-09-08 22:34                         ` Andrii Nakryiko
  0 siblings, 1 reply; 16+ messages in thread
From: Hou Tao @ 2023-08-31  6:29 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, bpf, Andrii Nakryiko, Joanne Koong,
	Kumar Kartikeya Dwivedi

Hi Andrii,

On 8/26/2023 2:33 AM, Andrii Nakryiko wrote:
> On Tue, Aug 22, 2023 at 6:12 AM Hou Tao <houtao@huaweicloud.com> wrote:
SNIP
>>>> Yes. bpf prog will use dynptr as the map key. The bpf program will use
>>>> the same map helpers as hash map to operate on qp-trie and the verifier
>>>> will be updated to allow using dynptr as map key for qp-trie.
>>> And that's the problem I just mentioned.
>>> PTR_TO_MAP_KEY is special. I don't think we should hack it to also
>>> mean ARG_PTR_TO_DYNPTR depending on the first argument (map type).
>> Sorry for misunderstanding your reply. But before switch to the kfunc
>> way, could you please point me to some code or function which shows the
>> specialty of PTR_MAP_KEY ?
>>
>>
> Search in kernel/bpf/verifier.c how PTR_TO_MAP_KEY is handled. The
> logic assumes that there is associated struct bpf_map * pointer from
> which we know fixed-sized key length.

Thanks for the information. Will check that.
>
> But getting back to the topic at hand. I vaguely remember discussion
> we had, but it would be good if you could summarize it again here to
> avoid talking past each other. What is the bpf_map_ops changes you
> were thinking to do? How bpf_attr will look like? How BPF-side API for
> lookup/delete/update will look like? And then let's go from there?
> Thanks!

Sorry for the late reply. I am a bit distracted by other work this week.

For bpf_attr, a new field 'dynkey_size' is added to support
BPF_MAP_{LOOKUP/UPDATE/DELETE}_ELEM and BPF_MAP_GET_NEXT_KEY on qp-trie
as shown below:

struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
        __u32           map_fd;
        __aligned_u64   key;
        union {
                __aligned_u64 value;
                __aligned_u64 next_key;
        };
        __u64           flags;
        __u32           dynkey_size;    /* input/output for
                                         * BPF_MAP_GET_NEXT_KEY. input
                                         * only for other commands.
                                         */
};

And 4 new APIs are added in libbpf to support basic operations on qp-trie:

LIBBPF_API int bpf_map_update_dynkey_elem(int fd, const void *key,
unsigned int key_size, const void *value, __u64 flags);
LIBBPF_API int bpf_map_lookup_dynkey_elem(int fd, const void *key,
unsigned int key_size, void *value);
LIBBPF_API int bpf_map_delete_dynkey_elem(int fd, const void *key,
unsigned int key_size);
LIBBPF_API int bpf_map_get_next_dynkey(int fd, const void *key, void
*next_key, unsigned int *key_size);

About 3 weeks again, I have used the lowest bit of key pointer in
.map_lookup_elem/.map_update_elem/.map_delete_elem to distinguish
between bpf_user_dynkey-typed key from syscall and bpf_dynptr_kern-typed
key from bpf program. The definition of bpf_user_dynkey and its
allocation method are shown below. bpf syscall uses it to allocate
variable-sized key and passes it to qp-trie.

/* Allocate bpf_user_dynkey and its data together */
struct bpf_user_dynkey {
        unsigned int size;
        void *data;
};

static void *bpf_new_user_dynkey(unsigned int size)
{
        struct bpf_user_dynkey *dynkey;
        size_t total;

        total = round_up(sizeof(*dynkey) + size, 2);
        dynkey = kvmalloc(total, GFP_USER | __GFP_NOWARN);
        if (!dynkey)
                return ERR_PTR(-ENOMEM);

        dynkey->size = size;
        dynkey->data = &dynkey[1];
        return (void *)((long)dynkey | BPF_USER_DYNKEY_MARK);
}


After Alexei suggested that bit hack is only OK for memory or
performance reason, I'm planning to add 2 new callbacks in bpf_map_ops
to support update/delete operations in bpf syscall as shown below, but I
have tried it yet.

/* map is generic key/value storage optionally accessible by eBPF
programs */
struct bpf_map_ops {
        /* funcs callable from userspace (via syscall) */
        /* ...... */
        void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
        long (*map_update_elem_sys_only)(struct bpf_map *map, void *key,
void *value, u64 flags);
        long (*map_delete_elem_sys_only)(struct bpf_map *map, void *key);
        /* ...... */
};







^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-08-31  6:29                       ` Hou Tao
@ 2023-09-08 22:34                         ` Andrii Nakryiko
  2023-09-09  2:39                           ` Hou Tao
  0 siblings, 1 reply; 16+ messages in thread
From: Andrii Nakryiko @ 2023-09-08 22:34 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, bpf, Andrii Nakryiko, Joanne Koong,
	Kumar Kartikeya Dwivedi

On Wed, Aug 30, 2023 at 11:29 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi Andrii,
>
> On 8/26/2023 2:33 AM, Andrii Nakryiko wrote:
> > On Tue, Aug 22, 2023 at 6:12 AM Hou Tao <houtao@huaweicloud.com> wrote:
> SNIP
> >>>> Yes. bpf prog will use dynptr as the map key. The bpf program will use
> >>>> the same map helpers as hash map to operate on qp-trie and the verifier
> >>>> will be updated to allow using dynptr as map key for qp-trie.
> >>> And that's the problem I just mentioned.
> >>> PTR_TO_MAP_KEY is special. I don't think we should hack it to also
> >>> mean ARG_PTR_TO_DYNPTR depending on the first argument (map type).
> >> Sorry for misunderstanding your reply. But before switch to the kfunc
> >> way, could you please point me to some code or function which shows the
> >> specialty of PTR_MAP_KEY ?
> >>
> >>
> > Search in kernel/bpf/verifier.c how PTR_TO_MAP_KEY is handled. The
> > logic assumes that there is associated struct bpf_map * pointer from
> > which we know fixed-sized key length.
>
> Thanks for the information. Will check that.
> >
> > But getting back to the topic at hand. I vaguely remember discussion
> > we had, but it would be good if you could summarize it again here to
> > avoid talking past each other. What is the bpf_map_ops changes you
> > were thinking to do? How bpf_attr will look like? How BPF-side API for
> > lookup/delete/update will look like? And then let's go from there?
> > Thanks!
>
> Sorry for the late reply. I am a bit distracted by other work this week.
>
> For bpf_attr, a new field 'dynkey_size' is added to support
> BPF_MAP_{LOOKUP/UPDATE/DELETE}_ELEM and BPF_MAP_GET_NEXT_KEY on qp-trie
> as shown below:
>
> struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
>         __u32           map_fd;
>         __aligned_u64   key;
>         union {
>                 __aligned_u64 value;
>                 __aligned_u64 next_key;
>         };
>         __u64           flags;
>         __u32           dynkey_size;    /* input/output for
>                                          * BPF_MAP_GET_NEXT_KEY. input
>                                          * only for other commands.
>                                          */

hm.. I wonder if it would be more elegant to add `key_size` and
`value_size`, and allow to specify it (optionally) even for maps that
have fixed-size keys and values. Return error if expected key/value
size doesn't match map definition. From libbpf side, libbpf can be
smart to not set it on older kernels (or if user didn't provide this
information). But for bpf_map__lookup_elem() and other higher-level
APIs, we should have all this information available.


> };
>
> And 4 new APIs are added in libbpf to support basic operations on qp-trie:
>
> LIBBPF_API int bpf_map_update_dynkey_elem(int fd, const void *key,
> unsigned int key_size, const void *value, __u64 flags);
> LIBBPF_API int bpf_map_lookup_dynkey_elem(int fd, const void *key,
> unsigned int key_size, void *value);
> LIBBPF_API int bpf_map_delete_dynkey_elem(int fd, const void *key,
> unsigned int key_size);
> LIBBPF_API int bpf_map_get_next_dynkey(int fd, const void *key, void
> *next_key, unsigned int *key_size);
>
> About 3 weeks again, I have used the lowest bit of key pointer in
> .map_lookup_elem/.map_update_elem/.map_delete_elem to distinguish
> between bpf_user_dynkey-typed key from syscall and bpf_dynptr_kern-typed
> key from bpf program. The definition of bpf_user_dynkey and its
> allocation method are shown below. bpf syscall uses it to allocate
> variable-sized key and passes it to qp-trie.
>
> /* Allocate bpf_user_dynkey and its data together */
> struct bpf_user_dynkey {
>         unsigned int size;
>         void *data;
> };
>
> static void *bpf_new_user_dynkey(unsigned int size)
> {
>         struct bpf_user_dynkey *dynkey;
>         size_t total;
>
>         total = round_up(sizeof(*dynkey) + size, 2);
>         dynkey = kvmalloc(total, GFP_USER | __GFP_NOWARN);
>         if (!dynkey)
>                 return ERR_PTR(-ENOMEM);
>
>         dynkey->size = size;
>         dynkey->data = &dynkey[1];
>         return (void *)((long)dynkey | BPF_USER_DYNKEY_MARK);
> }
>
>
> After Alexei suggested that bit hack is only OK for memory or
> performance reason, I'm planning to add 2 new callbacks in bpf_map_ops
> to support update/delete operations in bpf syscall as shown below, but I
> have tried it yet.
>
> /* map is generic key/value storage optionally accessible by eBPF
> programs */
> struct bpf_map_ops {
>         /* funcs callable from userspace (via syscall) */
>         /* ...... */
>         void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);

a bit confused, did you mean to also have key_size as a third argument here?

>         long (*map_update_elem_sys_only)(struct bpf_map *map, void *key,
> void *value, u64 flags);
>         long (*map_delete_elem_sys_only)(struct bpf_map *map, void *key);
>         /* ...... */
> };
>
>
>
>
>
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-09-08 22:34                         ` Andrii Nakryiko
@ 2023-09-09  2:39                           ` Hou Tao
  2023-09-11 18:10                             ` Andrii Nakryiko
  0 siblings, 1 reply; 16+ messages in thread
From: Hou Tao @ 2023-09-09  2:39 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Alexei Starovoitov, bpf, Andrii Nakryiko, Joanne Koong,
	Kumar Kartikeya Dwivedi

Hi,

On 9/9/2023 6:34 AM, Andrii Nakryiko wrote:
> On Wed, Aug 30, 2023 at 11:29 PM Hou Tao <houtao@huaweicloud.com> wrote:
>> Hi Andrii,
>>
>> On 8/26/2023 2:33 AM, Andrii Nakryiko wrote:
>>> On Tue, Aug 22, 2023 at 6:12 AM Hou Tao <houtao@huaweicloud.com> wrote:
>> SNIP
>>>>>> Yes. bpf prog will use dynptr as the map key. The bpf program will use
>>>>>> the same map helpers as hash map to operate on qp-trie and the verifier
>>>>>> will be updated to allow using dynptr as map key for qp-trie.
>>>>> And that's the problem I just mentioned.
>>>>> PTR_TO_MAP_KEY is special. I don't think we should hack it to also
>>>>> mean ARG_PTR_TO_DYNPTR depending on the first argument (map type).
>>>> Sorry for misunderstanding your reply. But before switch to the kfunc
>>>> way, could you please point me to some code or function which shows the
>>>> specialty of PTR_MAP_KEY ?
>>>>
>>>>
>>> Search in kernel/bpf/verifier.c how PTR_TO_MAP_KEY is handled. The
>>> logic assumes that there is associated struct bpf_map * pointer from
>>> which we know fixed-sized key length.
>> Thanks for the information. Will check that.
>>> But getting back to the topic at hand. I vaguely remember discussion
>>> we had, but it would be good if you could summarize it again here to
>>> avoid talking past each other. What is the bpf_map_ops changes you
>>> were thinking to do? How bpf_attr will look like? How BPF-side API for
>>> lookup/delete/update will look like? And then let's go from there?
>>> Thanks!
>> Sorry for the late reply. I am a bit distracted by other work this week.
>>
>> For bpf_attr, a new field 'dynkey_size' is added to support
>> BPF_MAP_{LOOKUP/UPDATE/DELETE}_ELEM and BPF_MAP_GET_NEXT_KEY on qp-trie
>> as shown below:
>>
>> struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
>>         __u32           map_fd;
>>         __aligned_u64   key;
>>         union {
>>                 __aligned_u64 value;
>>                 __aligned_u64 next_key;
>>         };
>>         __u64           flags;
>>         __u32           dynkey_size;    /* input/output for
>>                                          * BPF_MAP_GET_NEXT_KEY. input
>>                                          * only for other commands.
>>                                          */
> hm.. I wonder if it would be more elegant to add `key_size` and
> `value_size`, and allow to specify it (optionally) even for maps that
> have fixed-size keys and values. Return error if expected key/value
> size doesn't match map definition. From libbpf side, libbpf can be
> smart to not set it on older kernels (or if user didn't provide this
> information). But for bpf_map__lookup_elem() and other higher-level
> APIs, we should have all this information available.

I am OK with the addition of key_size and value_size in bpf_attr and I
will try to do that. After the addition, bpf syscall will also need to
check these two size for fixed-size map, but I am a bit worried about
the compatibility of libbpf and kernel for fixed-size map. There are
three possible cases:
1) new libbpf and older kernel. key_size and value_size will be ignored
by older kernel, because the definition of bpf_attr in older kernel is
shorter. It will be OK.
2) old libbpf and new kernel.  key_size and value_size will be zero for
kernel, because libbpf doesn't pass these values. We can use 0 as an
unspecified size, but for some map (e.g., bloom-filter) zero key_size is
also valid, so do we need to introduce a feature bit to tell both
key_size and value_size are valid ?
3) matched libbpf and kernel. Same problem as 2): use zero as an
unspecified size or an extra flag is needed.

I totally missed these higher-level APIs with both key_sz and value_sz. 
It seems these high-level also need updates to support qp-trie (e.g.,
bpf_map__get_next_key, because the size of current key and next key are
different).
>
>
>> };
>>
>> And 4 new APIs are added in libbpf to support basic operations on qp-trie:
>>
>> LIBBPF_API int bpf_map_update_dynkey_elem(int fd, const void *key,
>> unsigned int key_size, const void *value, __u64 flags);
>> LIBBPF_API int bpf_map_lookup_dynkey_elem(int fd, const void *key,
>> unsigned int key_size, void *value);
>> LIBBPF_API int bpf_map_delete_dynkey_elem(int fd, const void *key,
>> unsigned int key_size);
>> LIBBPF_API int bpf_map_get_next_dynkey(int fd, const void *key, void
>> *next_key, unsigned int *key_size);
>>
>> About 3 weeks again, I have used the lowest bit of key pointer in
>> .map_lookup_elem/.map_update_elem/.map_delete_elem to distinguish
>> between bpf_user_dynkey-typed key from syscall and bpf_dynptr_kern-typed
>> key from bpf program. The definition of bpf_user_dynkey and its
>> allocation method are shown below. bpf syscall uses it to allocate
>> variable-sized key and passes it to qp-trie.
>>
>> /* Allocate bpf_user_dynkey and its data together */
>> struct bpf_user_dynkey {
>>         unsigned int size;
>>         void *data;
>> };
>>
>> static void *bpf_new_user_dynkey(unsigned int size)
>> {
>>         struct bpf_user_dynkey *dynkey;
>>         size_t total;
>>
>>         total = round_up(sizeof(*dynkey) + size, 2);
>>         dynkey = kvmalloc(total, GFP_USER | __GFP_NOWARN);
>>         if (!dynkey)
>>                 return ERR_PTR(-ENOMEM);
>>
>>         dynkey->size = size;
>>         dynkey->data = &dynkey[1];
>>         return (void *)((long)dynkey | BPF_USER_DYNKEY_MARK);
>> }
>>
>>
>> After Alexei suggested that bit hack is only OK for memory or
>> performance reason, I'm planning to add 2 new callbacks in bpf_map_ops
>> to support update/delete operations in bpf syscall as shown below, but I
>> have tried it yet.
>>
>> /* map is generic key/value storage optionally accessible by eBPF
>> programs */
>> struct bpf_map_ops {
>>         /* funcs callable from userspace (via syscall) */
>>         /* ...... */
>>         void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
> a bit confused, did you mean to also have key_size as a third argument here?

Ah, I am planning to pass bpf_user_dynkey to these newly-added APIs and
bpf_user_dynke::size is the size of the key. Passing a plain pointer and
key size is also fine.
>
>>         long (*map_update_elem_sys_only)(struct bpf_map *map, void *key,
>> void *value, u64 flags);
>>         long (*map_delete_elem_sys_only)(struct bpf_map *map, void *key);
>>         /* ...... */
>> };
>>
>>
>>
>>
>>
>>
>>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ?
  2023-09-09  2:39                           ` Hou Tao
@ 2023-09-11 18:10                             ` Andrii Nakryiko
  0 siblings, 0 replies; 16+ messages in thread
From: Andrii Nakryiko @ 2023-09-11 18:10 UTC (permalink / raw)
  To: Hou Tao
  Cc: Alexei Starovoitov, bpf, Andrii Nakryiko, Joanne Koong,
	Kumar Kartikeya Dwivedi

On Fri, Sep 8, 2023 at 7:39 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 9/9/2023 6:34 AM, Andrii Nakryiko wrote:
> > On Wed, Aug 30, 2023 at 11:29 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >> Hi Andrii,
> >>
> >> On 8/26/2023 2:33 AM, Andrii Nakryiko wrote:
> >>> On Tue, Aug 22, 2023 at 6:12 AM Hou Tao <houtao@huaweicloud.com> wrote:
> >> SNIP
> >>>>>> Yes. bpf prog will use dynptr as the map key. The bpf program will use
> >>>>>> the same map helpers as hash map to operate on qp-trie and the verifier
> >>>>>> will be updated to allow using dynptr as map key for qp-trie.
> >>>>> And that's the problem I just mentioned.
> >>>>> PTR_TO_MAP_KEY is special. I don't think we should hack it to also
> >>>>> mean ARG_PTR_TO_DYNPTR depending on the first argument (map type).
> >>>> Sorry for misunderstanding your reply. But before switch to the kfunc
> >>>> way, could you please point me to some code or function which shows the
> >>>> specialty of PTR_MAP_KEY ?
> >>>>
> >>>>
> >>> Search in kernel/bpf/verifier.c how PTR_TO_MAP_KEY is handled. The
> >>> logic assumes that there is associated struct bpf_map * pointer from
> >>> which we know fixed-sized key length.
> >> Thanks for the information. Will check that.
> >>> But getting back to the topic at hand. I vaguely remember discussion
> >>> we had, but it would be good if you could summarize it again here to
> >>> avoid talking past each other. What is the bpf_map_ops changes you
> >>> were thinking to do? How bpf_attr will look like? How BPF-side API for
> >>> lookup/delete/update will look like? And then let's go from there?
> >>> Thanks!
> >> Sorry for the late reply. I am a bit distracted by other work this week.
> >>
> >> For bpf_attr, a new field 'dynkey_size' is added to support
> >> BPF_MAP_{LOOKUP/UPDATE/DELETE}_ELEM and BPF_MAP_GET_NEXT_KEY on qp-trie
> >> as shown below:
> >>
> >> struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
> >>         __u32           map_fd;
> >>         __aligned_u64   key;
> >>         union {
> >>                 __aligned_u64 value;
> >>                 __aligned_u64 next_key;
> >>         };
> >>         __u64           flags;
> >>         __u32           dynkey_size;    /* input/output for
> >>                                          * BPF_MAP_GET_NEXT_KEY. input
> >>                                          * only for other commands.
> >>                                          */
> > hm.. I wonder if it would be more elegant to add `key_size` and
> > `value_size`, and allow to specify it (optionally) even for maps that
> > have fixed-size keys and values. Return error if expected key/value
> > size doesn't match map definition. From libbpf side, libbpf can be
> > smart to not set it on older kernels (or if user didn't provide this
> > information). But for bpf_map__lookup_elem() and other higher-level
> > APIs, we should have all this information available.
>
> I am OK with the addition of key_size and value_size in bpf_attr and I
> will try to do that. After the addition, bpf syscall will also need to
> check these two size for fixed-size map, but I am a bit worried about
> the compatibility of libbpf and kernel for fixed-size map. There are
> three possible cases:
> 1) new libbpf and older kernel. key_size and value_size will be ignored
> by older kernel, because the definition of bpf_attr in older kernel is
> shorter. It will be OK.

Not ignored. Libbpf will have to zero out unknown fields. But yes, it
will be OK.

> 2) old libbpf and new kernel.  key_size and value_size will be zero for
> kernel, because libbpf doesn't pass these values. We can use 0 as an
> unspecified size, but for some map (e.g., bloom-filter) zero key_size is
> also valid, so do we need to introduce a feature bit to tell both
> key_size and value_size are valid ?

I wouldn't bother. Zero is "not provided". It will still be a valid
size for zero-sized key/value cases.

> 3) matched libbpf and kernel. Same problem as 2): use zero as an
> unspecified size or an extra flag is needed.

Again, I don't see a problem. Valid case (zero expected, zero
provided) will work correctly. The only case where we could have
caught an error would be expected non-zero size, but provided zero
size (which we treat as not specified size). Kernel won't reject it
outright, but that's backwards compatible behavior anyway, so I think
it's fine.

tl;dr I wouldn't bother with the new flag, it seems unnecessary.

>
> I totally missed these higher-level APIs with both key_sz and value_sz.
> It seems these high-level also need updates to support qp-trie (e.g.,
> bpf_map__get_next_key, because the size of current key and next key are
> different).
> >
> >

yep, we probably will have to add more generic bpf_map__next_key(map,
cur_key, cur_key_size, *next_key, *next_key_sz)


Keep in mind that for next_key operation the user should provide the
maximum next key size it expects, otherwise this kernel API will be
hard to use: very easy for the kernel to overrun a user-provided
buffer. So the next key size will be an in/out param. On input, it's
maximum expected size, on output -- actual filled out size.


> >> };
> >>
> >> And 4 new APIs are added in libbpf to support basic operations on qp-trie:
> >>
> >> LIBBPF_API int bpf_map_update_dynkey_elem(int fd, const void *key,
> >> unsigned int key_size, const void *value, __u64 flags);
> >> LIBBPF_API int bpf_map_lookup_dynkey_elem(int fd, const void *key,
> >> unsigned int key_size, void *value);
> >> LIBBPF_API int bpf_map_delete_dynkey_elem(int fd, const void *key,
> >> unsigned int key_size);
> >> LIBBPF_API int bpf_map_get_next_dynkey(int fd, const void *key, void
> >> *next_key, unsigned int *key_size);
> >>
> >> About 3 weeks again, I have used the lowest bit of key pointer in
> >> .map_lookup_elem/.map_update_elem/.map_delete_elem to distinguish
> >> between bpf_user_dynkey-typed key from syscall and bpf_dynptr_kern-typed
> >> key from bpf program. The definition of bpf_user_dynkey and its
> >> allocation method are shown below. bpf syscall uses it to allocate
> >> variable-sized key and passes it to qp-trie.
> >>
> >> /* Allocate bpf_user_dynkey and its data together */
> >> struct bpf_user_dynkey {
> >>         unsigned int size;
> >>         void *data;
> >> };
> >>
> >> static void *bpf_new_user_dynkey(unsigned int size)
> >> {
> >>         struct bpf_user_dynkey *dynkey;
> >>         size_t total;
> >>
> >>         total = round_up(sizeof(*dynkey) + size, 2);
> >>         dynkey = kvmalloc(total, GFP_USER | __GFP_NOWARN);
> >>         if (!dynkey)
> >>                 return ERR_PTR(-ENOMEM);
> >>
> >>         dynkey->size = size;
> >>         dynkey->data = &dynkey[1];
> >>         return (void *)((long)dynkey | BPF_USER_DYNKEY_MARK);
> >> }
> >>
> >>
> >> After Alexei suggested that bit hack is only OK for memory or
> >> performance reason, I'm planning to add 2 new callbacks in bpf_map_ops
> >> to support update/delete operations in bpf syscall as shown below, but I
> >> have tried it yet.
> >>
> >> /* map is generic key/value storage optionally accessible by eBPF
> >> programs */
> >> struct bpf_map_ops {
> >>         /* funcs callable from userspace (via syscall) */
> >>         /* ...... */
> >>         void *(*map_lookup_elem_sys_only)(struct bpf_map *map, void *key);
> > a bit confused, did you mean to also have key_size as a third argument here?
>
> Ah, I am planning to pass bpf_user_dynkey to these newly-added APIs and
> bpf_user_dynke::size is the size of the key. Passing a plain pointer and
> key size is also fine.

Wait, you want to pass bpf_user_dynkey from user-space side? Why?
pointer + size as part of bpf_attr seems like a more straightforward
syscall-side interface, IMO.

> >
> >>         long (*map_update_elem_sys_only)(struct bpf_map *map, void *key,
> >> void *value, u64 flags);
> >>         long (*map_delete_elem_sys_only)(struct bpf_map *map, void *key);
> >>         /* ...... */
> >> };
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-09-11 18:10 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-03 13:19 Question: Is it OK to assume the address of bpf_dynptr_kern will be 8-bytes aligned and reuse the lowest bits to save extra info ? Hou Tao
2023-08-03 13:28 ` Hou Tao
2023-08-17  6:34   ` Hou Tao
2023-08-17 23:00     ` Alexei Starovoitov
2023-08-19  9:15       ` Hou Tao
2023-08-21 23:49         ` Alexei Starovoitov
2023-08-22  0:54           ` Hou Tao
2023-08-22  0:58             ` Alexei Starovoitov
2023-08-22  1:46               ` Hou Tao
2023-08-22  3:25                 ` Alexei Starovoitov
2023-08-22 13:12                   ` Hou Tao
2023-08-25 18:33                     ` Andrii Nakryiko
2023-08-31  6:29                       ` Hou Tao
2023-09-08 22:34                         ` Andrii Nakryiko
2023-09-09  2:39                           ` Hou Tao
2023-09-11 18:10                             ` Andrii Nakryiko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.