From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 518BBC636CC for ; Tue, 31 Jan 2023 22:16:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230189AbjAaWQL (ORCPT ); Tue, 31 Jan 2023 17:16:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43902 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230007AbjAaWQK (ORCPT ); Tue, 31 Jan 2023 17:16:10 -0500 X-Greylist: delayed 522 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Tue, 31 Jan 2023 14:16:09 PST Received: from out-170.mta1.migadu.com (out-170.mta1.migadu.com [95.215.58.170]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9695645BE8 for ; Tue, 31 Jan 2023 14:16:09 -0800 (PST) Message-ID: DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1675202846; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ck7B5zBKkHCM9oUtwGTyOQHqePke66EuHA6ljEevXRw=; b=CenTPmdiLMg3jshXTEFCGkCdW2d1qrmfHOQiVjTZu05vjyUWy4dOowfMV4aLNkC4OX5PV6 TTgdIpZ0s4DaznUxxmLPffhGlcTQyUXKPTTmI3ZM6P95gOjHmq30zNUrLdeDyMk0DVsm7/ sapep+rq2RfdHhsxiCuhi4wpuvUYu8c= Date: Tue, 31 Jan 2023 14:07:17 -0800 MIME-Version: 1.0 Subject: Re: [PATCH v9 bpf-next 3/5] bpf: Add skb dynptrs Content-Language: en-US To: Alexei Starovoitov Cc: Joanne Koong , daniel@iogearbox.net, andrii@kernel.org, martin.lau@kernel.org, ast@kernel.org, netdev@vger.kernel.org, memxor@gmail.com, kernel-team@fb.com, bpf@vger.kernel.org, Andrii Nakryiko References: <20230127191703.3864860-1-joannelkoong@gmail.com> <20230127191703.3864860-4-joannelkoong@gmail.com> <5715ea83-c4aa-c884-ab95-3d5e630cad05@linux.dev> <20230130223141.r24nlg2jp5byvuph@macbook-pro-6.dhcp.thefacebook.com> <4b7b09b5-fd23-2447-7f05-5f903288625f@linux.dev> <20230131053042.h7wp3w2zq46swfmk@macbook-pro-6.dhcp.thefacebook.com> X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Martin KaFai Lau In-Reply-To: <20230131053042.h7wp3w2zq46swfmk@macbook-pro-6.dhcp.thefacebook.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Migadu-Flow: FLOW_OUT Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On 1/30/23 9:30 PM, Alexei Starovoitov wrote: >>>>> Also bpf_skb_pull_data is quite heavy. For progs that only want to parse >>>>> the packet calling that in bpf_dynptr_data is a heavy hammer. >>>>> >>>>> It feels that we need to go back to skb_header_pointer-like discussion. >>>>> Something like: >>>>> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void *buffer) >>>>> Whether buffer is a part of dynptr or program provided is tbd. >>>> >>>> making it hidden within dynptr would make this approach unreliable >>>> (memory allocations, which can fail, etc). But if we ask users to pass >>>> it directly, then it should be relatively easy to use in practice with >>>> some pre-allocated per-CPU buffer: > > bpf_skb_pull_data() is even more unreliable, since it's a bigger allocation. > I like preallocated approach more, so we're in agreement here. > >>>> >>>> >>>> struct { >>>> __int(type, BPF_MAP_TYPE_PERCPU_ARRAY); >>>> __int(max_entries, 1); >>>> __type(key, int); >>>> __type(value, char[4096]); >>>> } scratch SEC(".maps"); >>>> >>>> >>>> ... >>>> >>>> >>>> struct dyn_ptr *dp = bpf_dynptr_from_skb(...). >>>> void *p, *buf; >>>> int zero = 0; >>>> >>>> buf = bpf_map_lookup_elem(&scratch, &zero); >>>> if (!buf) return 0; /* can't happen */ >>>> >>>> p = bpf_dynptr_slice(dp, off, 16, buf); >>>> if (p == NULL) { >>>> /* out of range */ >>>> } else { >>>> /* work with p directly */ >>>> } >>>> >>>> /* if we wrote something to p and it was copied to buffer, write it back */ >>>> if (p == buf) { >>>> bpf_dynptr_write(dp, buf, 16); >>>> } >>>> >>>> >>>> We'll just need to teach verifier to make sure that buf is at least 16 >>>> byte long. >>> >>> A fifth __sz arg may do: >>> bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, void >>> *buffer, u32 buffer__sz); >> >> We'll need to make sure that buffer__sz is >= len (or preferably not >> require extra size at all). We can check that at runtime, of course, >> but rejecting too small buffer at verification time would be a better >> experience. > > I don't follow. Why two equivalent 'len' args ? > Just to allow 'len' to be a variable instead of constant ? > It's unusual for the verifier to have 'len' before 'buffer', > but this is fixable. Agree. One const scalar 'len' should be enough. Buffer should have the same size as the requesting slice. > > How about adding 'rd_only vs rdwr' flag ? > Then MEM_RDONLY for ret value of bpf_dynptr_slice can be set by the verifier > and in run-time bpf_dynptr_slice() wouldn't need to check for skb->cloned. > if (rd_only) return skb_header_pointer() > if (rdwr) bpf_try_make_writable(); return skb->data + off; > and final bpf_dynptr_write() is not needed. > > But that doesn't work for xdp, since there is no pull. > > It's not clear how to deal with BPF_F_RECOMPUTE_CSUM though. > Expose __skb_postpull_rcsum/__skb_postpush_rcsum as kfuncs? > But that defeats Andrii's goal to use dynptr as a generic wrapper. > skb is quite special. > > Maybe something like: > void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, > void *buffer, u32 buffer__sz) > { > if (skb_cloned()) { > skb_copy_bits(skb, offset, buffer, len); > return buffer; > } > return skb_header_pointer(...); > } > > When prog is just parsing the packet it doesn't need to finalize with bpf_dynptr_write. > The prog can always write into the pointer followed by if (p == buf) bpf_dynptr_write. > No need for rdonly flag, but extra copy is there in case of cloned which > could have been avoided with extra rd_only flag. > > In case of xdp it will be: > void *bpf_dynptr_slice(const struct bpf_dynptr *ptr, u32 offset, u32 len, > void *buffer, u32 buffer__sz) > { > ptr = bpf_xdp_pointer(xdp, offset, len); > if (ptr) > return ptr; > bpf_xdp_copy_buf(xdp, offset, buffer, len, false); /* copy into buf */ > return buffer; > } > > bpf_dynptr_write will use bpf_xdp_copy_buf(,true); /* copy into xdp */ My preference would be making bpf_dynptr_slice() work similarly for skb and xdp, so above bpf_dynptr_slice() skb and xdp logic looks good. Regarding, MEM_RDONLY, it probably is not relevant to xdp. For skb, not sure how often is the 'skb_cloned() && !skb_clone_writable()'. May be it can be left for later optimization? Regarding BPF_F_RECOMPUTE_CSUM, I wonder if bpf_csum_diff() is enough to come up the csum. Then the missing kfunc is to update the skb->csum. Not sure how the csum logic will look like in xdp, probably getting csum from the xdp-hint, calculate csum_diff and then set it to the to-be-created skb. All this is likely a kfunc also, eg. a kfunc to directly allocate skb during the XDP_PASS case. The bpf prog will have to be written differently if it needs to deal with the csum but the header parsing part could at least be shared.