Re: [PATCH v9 bpf-next 3/5] bpf: Add skb dynptrs

From: Martin KaFai Lau <martin.lau@linux.dev>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Joanne Koong <joannelkoong@gmail.com>,
	daniel@iogearbox.net, andrii@kernel.org, martin.lau@kernel.org,
	ast@kernel.org, netdev@vger.kernel.org, memxor@gmail.com,
	kernel-team@fb.com, bpf@vger.kernel.org,
	Andrii Nakryiko <andrii.nakryiko@gmail.com>
Subject: Re: [PATCH v9 bpf-next 3/5] bpf: Add skb dynptrs
Date: Tue, 31 Jan 2023 14:07:17 -0800	[thread overview]
Message-ID: <c873deaf-6d63-98b1-a29a-01dd311be99a@linux.dev> (raw)
In-Reply-To: <20230131053042.h7wp3w2zq46swfmk@macbook-pro-6.dhcp.thefacebook.com>

On 1/30/23 9:30 PM, Alexei Starovoitov wrote:
>>>>> Also bpf_skb_pull_data is quite heavy. For progs that only want to parse
>>>>> the packet calling that in bpf_dynptr_data is a heavy hammer.
>>>>>
>>>>> It feels that we need to go back to skb_header_pointer-like discussion.
>>>>> Something like:
>>>>> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer)
>>>>> Whether buffer is a part of dynptr or program provided is tbd.
>>>>
>>>> making it hidden within dynptr would make this approach unreliable
>>>> (memory allocations, which can fail, etc). But if we ask users to pass
>>>> it directly, then it should be relatively easy to use in practice with
>>>> some pre-allocated per-CPU buffer:
> 
> bpf_skb_pull_data() is even more unreliable, since it's a bigger allocation.
> I like preallocated approach more, so we're in agreement here.
> 
>>>>
>>>>
>>>> struct {
>>>>     __int(type, BPF_MAP_TYPE_PERCPU_ARRAY);
>>>>     __int(max_entries, 1);
>>>>     __type(key, int);
>>>>     __type(value, char[4096]);
>>>> } scratch SEC(".maps");
>>>>
>>>>
>>>> ...
>>>>
>>>>
>>>> struct dyn_ptr *dp = bpf_dynptr_from_skb(...).
>>>> void *p, *buf;
>>>> int zero = 0;
>>>>
>>>> buf = bpf_map_lookup_elem(&scratch, &zero);
>>>> if (!buf) return 0; /* can't happen */
>>>>
>>>> p = bpf_dynptr_slice(dp, off, 16, buf);
>>>> if (p == NULL) {
>>>>      /* out of range */
>>>> } else {
>>>>      /* work with p directly */
>>>> }
>>>>
>>>> /* if we wrote something to p and it was copied to buffer, write it back */
>>>> if (p == buf) {
>>>>       bpf_dynptr_write(dp, buf, 16);
>>>> }
>>>>
>>>>
>>>> We'll just need to teach verifier to make sure that buf is at least 16
>>>> byte long.
>>>
>>> A fifth __sz arg may do:
>>> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void
>>> *buffer, u32 buffer__sz);
>>
>> We'll need to make sure that buffer__sz is >= len (or preferably not
>> require extra size at all). We can check that at runtime, of course,
>> but rejecting too small buffer at verification time would be a better
>> experience.
> 
> I don't follow. Why two equivalent 'len' args ?
> Just to allow 'len' to be a variable instead of constant ?
> It's unusual for the verifier to have 'len' before 'buffer',
> but this is fixable.

Agree. One const scalar 'len' should be enough. Buffer should have the same size 
as the requesting slice.

> 
> How about adding 'rd_only vs rdwr' flag ?
> Then MEM_RDONLY for ret value of bpf_dynptr_slice can be set by the verifier
> and in run-time bpf_dynptr_slice() wouldn't need to check for skb->cloned.
> if (rd_only) return skb_header_pointer()
> if (rdwr) bpf_try_make_writable(); return skb->data + off;
> and final bpf_dynptr_write() is not needed.
> 
> But that doesn't work for xdp, since there is no pull.
> 
> It's not clear how to deal with BPF_F_RECOMPUTE_CSUM though.
> Expose __skb_postpull_rcsum/__skb_postpush_rcsum as kfuncs?
> But that defeats Andrii's goal to use dynptr as a generic wrapper.
> skb is quite special.
> 
> Maybe something like:
> void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
>                         void *buffer, u32 buffer__sz)
> {
>    if (skb_cloned()) {
>      skb_copy_bits(skb, offset, buffer, len);
>      return buffer;
>    }
>    return skb_header_pointer(...);
> }
> 
> When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write.
> The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write.
> No need for rdonly flag, but extra copy is there in case of cloned which
> could have been avoided with extra rd_only flag.
> 
> In case of xdp it will be:
> void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len,
>                         void *buffer, u32 buffer__sz)
> {
>     ptr = bpf_xdp_pointer(xdp, offset, len);
>     if (ptr)
>        return ptr;
>     bpf_xdp_copy_buf(xdp, offset, buffer, len, false); /* copy into buf */
>     return buffer;
> }
> 
> bpf_dynptr_write will use bpf_xdp_copy_buf(,true); /* copy into xdp */

My preference would be making bpf_dynptr_slice() work similarly for skb and xdp, 
so above bpf_dynptr_slice() skb and xdp logic looks good.

Regarding, MEM_RDONLY, it probably is not relevant to xdp. For skb, not sure how 
often is the 'skb_cloned() && !skb_clone_writable()'. May be it can be left for 
later optimization?

Regarding BPF_F_RECOMPUTE_CSUM, I wonder if bpf_csum_diff() is enough to come up 
the csum. Then the missing kfunc is to update the skb->csum. Not sure how the 
csum logic will look like in xdp, probably getting csum from the xdp-hint, 
calculate csum_diff and then set it to the to-be-created skb. All this is likely 
a kfunc also, eg. a kfunc to directly allocate skb during the XDP_PASS case. The 
bpf prog will have to be written differently if it needs to deal with the csum 
but the header parsing part could at least be shared.