From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 83889C433F5 for ; Wed, 5 Oct 2022 18:45:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230503AbiJESpX (ORCPT ); Wed, 5 Oct 2022 14:45:23 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43608 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233019AbiJESpB (ORCPT ); Wed, 5 Oct 2022 14:45:01 -0400 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A863280EB8 for ; Wed, 5 Oct 2022 11:43:02 -0700 (PDT) Received: by mail-pf1-x44a.google.com with SMTP id k19-20020a056a00135300b0054096343fc6so11262947pfu.10 for ; Wed, 05 Oct 2022 11:43:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=LWoHnxwZ5YMv6UWo8LFZ+JnbPH00J+hdwCUgN3m2CC8=; b=iTmLqcvcyjA2UndngembIehCAIUpF4VV3gJtHSNlJzk55aUTEkBpv6mX5wuAqkXHLl bQB0SU+L5OmzYq994nUJlo6CtblENF8gr5GYL745aWKRA2fPgtBb4vQOZL3KBbyBnOov mMQ23tECX49z6TGmTHXEIC0isdSkJxYdhR0lnEnqkH2TQGi77GEPr/ShbNKgJwWdytc9 uKQCXr4/wibEtDMfz43EcEP62+HLKx93QzmBrFgQyeDQ6xcEywm9aKIqWV9QRZCgOpdB W4OsCOTSjrMpRWY6XalgDHtuPxfnRKykRmJNiLFI5x90tywY0EJO8vPnHAbewMy/GBMs aYLA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=LWoHnxwZ5YMv6UWo8LFZ+JnbPH00J+hdwCUgN3m2CC8=; b=RP+2QdiykM0yR18e5iiItN0FotdF2yIaidVjSCrn7hVhQoHhbadNRcu/8ErTVoXf/a HXrXlv5wdd2cRq4PATpxD+8dg7hGJcLzhaQ0rdhKCK7lxpek5MiZZFKSamoMTi2GUmdp gUs1pY9mVwTD+QdQfpKgS4SH1zuUwqsMHFiBXX0GmNXbYOqgMYNsVHHcMon8kURndskh W6PW+Wul276HI4G41xoTIJqQBgyOkAuNO17joqpkf+w/F80J+hEJVpENoyGNbePc3b+R ajj2vA8KQ4/FJoc07wh2GpOZhiksFVcOA8NwFRCGsV1+cMy6rZJvhip7m71JhFhwvYKL 07eQ== X-Gm-Message-State: ACrzQf0JvvWSWV0D1F5xcwVF4pb8fKq3q2k18yOwYjuKUezVYNboDiP8 3wbk0IGGhQjlyUEbYg+J5ZRRwDQ= X-Google-Smtp-Source: AMsMyM5FoiOuhaCiandc5Gt+/xWG324mXluktP4GOzjHEKjFkX5poKV1r882AT0DmBY/v0RD+xWuSOY= X-Received: from sdf.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5935]) (user=sdf job=sendgmr) by 2002:a05:6a00:2290:b0:541:f19:5197 with SMTP id f16-20020a056a00229000b005410f195197mr1166353pfe.42.1664995382119; Wed, 05 Oct 2022 11:43:02 -0700 (PDT) Date: Wed, 5 Oct 2022 11:43:00 -0700 In-Reply-To: <982b9125-f849-5e1c-0082-7239b8c8eebf@redhat.com> Mime-Version: 1.0 References: <166256538687.1434226.15760041133601409770.stgit@firesoul> <35fcfb25-583a-e923-6eee-e8bbcc19db17@redhat.com> <982b9125-f849-5e1c-0082-7239b8c8eebf@redhat.com> Message-ID: Subject: Re: [PATCH RFCv2 bpf-next 00/18] XDP-hints: XDP gaining access to HW offload hints via BTF From: sdf@google.com To: Jesper Dangaard Brouer Cc: brouer@redhat.com, bpf@vger.kernel.org, netdev@vger.kernel.org, xdp-hints@xdp-project.net, larysa.zaremba@intel.com, memxor@gmail.com, Lorenzo Bianconi , mtahhan@redhat.com, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , dave@dtucker.co.uk, Magnus Karlsson , bjorn@kernel.org Content-Type: text/plain; charset="UTF-8"; format=flowed; delsp=yes Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org On 10/05, Jesper Dangaard Brouer wrote: > On 04/10/2022 20.26, Stanislav Fomichev wrote: > > On Tue, Oct 4, 2022 at 2:29 AM Jesper Dangaard Brouer > > wrote: > > > > > > > > > On 04/10/2022 01.55, sdf@google.com wrote: > > > > On 09/07, Jesper Dangaard Brouer wrote: > > > > > This patchset expose the traditional hardware offload hints to > XDP and > > > > > rely on BTF to expose the layout to users. > > > > > > > > > Main idea is that the kernel and NIC drivers simply defines the > struct > > > > > layouts they choose to use for XDP-hints. These XDP-hints structs > gets > > > > > naturally and automatically described via BTF and implicitly > exported to > > > > > users. NIC drivers populate and records their own BTF ID as the > last > > > > > member in XDP metadata area (making it easily accessible by AF_XDP > > > > > userspace at a known negative offset from packet data start). > > > > > > > > > Naming conventions for the structs (xdp_hints_*) is used such that > > > > > userspace can find and decode the BTF layout and match against the > > > > > provided BTF IDs. Thus, no new UAPI interfaces are needed for > exporting > > > > > what XDP-hints a driver supports. > > > > > > > > > The patch "i40e: Add xdp_hints_union" introduce the idea of > creating a > > > > > union named "xdp_hints_union" in every driver, which contains all > > > > > xdp_hints_* struct this driver can support. This makes it > easier/quicker > > > > > to find and parse the relevant BTF types. (Seeking input before > fixing > > > > > up all drivers in patchset). > > > > > > > > > > > > > The main different from RFC-v1: > > > > > - Drop idea of BTF "origin" (vmlinux, module or local) > > > > > - Instead to use full 64-bit BTF ID that combine object+type ID > > > > > > > > > I've taken some of Alexandr/Larysa's libbpf patches and integrated > > > > > those. > > > > > > > > > Patchset exceeds netdev usually max 15 patches rule. My excuse is > three > > > > > NIC drivers (i40e, ixgbe and mvneta) gets XDP-hints support and > which > > > > > required some refactoring to remove the SKB dependencies. > > > > > > > > Hey Jesper, > > > > > > > > I took a quick look at the series. > > > Appreciate that! :-) > > > > > > > Do we really need the enum with the flags? > > > > > > The primary reason for using enum is that these gets exposed as BTF. > > > The proposal is that userspace/BTF need to obtain the flags via BTF, > > > such that they don't become UAPI, but something we can change later. > > > > > > > We might eventually hit that "first 16 bits are reserved" issue? > > > > > > > > Instead of exposing enum with the flags, why not solve it as > follows: > > > > a. We define UAPI struct xdp_rx_hints with _all_ possible hints > > > > > > How can we know _all_ possible hints from the beginning(?). > > > > > > UAPI + central struct dictating all possible hints, will limit > innovation. > > > > We don't need to know them all in advance. The same way we don't know > > them all for flags enum. That UAPI xdp_rx_hints can be extended any > > time some driver needs some new hint offload. The benefit here is that > > we have a "common registry" of all offloads and different drivers have > > an opportunity to share. > > > > Think of it like current __sk_buff vs sk_buff. xdp_rx_hints is a fake > > uapi struct (__sk_buff) and the access to it gets translated into > > _xdp_rx_hints offsets (sk_buff). > > > > > > b. Each device defines much denser _xdp_rx_hints struct > with the > > > > metadata that it supports > > > > > > Thus, the NIC device is limited to what is defined in UAPI struct > > > xdp_rx_hints. Again this limits innovation. > > > > I guess what I'm missing from your series is the bpf/userspace side. > > Do you have an example on the bpf side that will work for, say, > > xdp_hints_ixgbe_timestamp? > > > > Suppose, you pass this custom hints btf_id via xdp_md as proposed, > I just want to reiterate why we place btf_full_id at the "end inline". > This makes it easily available for AF_XDP to consume. Plus, we already > have to write info into this metadata cache-line anyway, thus it's > almost free. Moving bpf_full_id into xdp_md, will require expanding > both xdp_buff and xdp_frame (+ extra store for converting > buff-to-frame). If AF_XDP need this btf_full_id the BPF-prog _could_ > move/copy it from xdp_md to metadata, but that will just waste cycles, > why not just store it once in a known location. > One option, for convenience, would be to map xdp_md->bpf_full_id to load > the btf_full_id value from the metadata. But that would essentially be > syntax-sugar and adds UAPI. > > what's the action on the bpf side to consume this? > > > > If (ctx_hints_btf_id == xdp_hints_ixgbe_timestamp_btf_id /* supposedly > > populated at runtime by libbpf? */) { > See e.g. bpf_core_type_id_kernel(struct xdp_hints_ixgbe_timestamp) > AFAIK libbpf will make this a constant at load/setup time, and give us > dead-code elimination. Even with bpf_core_type_id_kernel() you still would have the following: if (ctx_hints_btf_id == bpf_core_type_id_kernel(struct xdp_hints_ixgbe)) { } else if (the same for every driver that has custom hints) { } Toke has a good suggestion on hiding this behind a helper; either pre-generated on the libbpf side or a kfunc. We should try to hide this per-device logic if possible; otherwise we'll get to per-device XDP programs that only work on some special deployments. OTOH, we'll probably get there with the hints anyway? > > // do something with rx_timestamp > > // also, handle xdp_hints_ixgbe and then xdp_hints_common ? > > } else if (ctx_hints_btf_id == xdp_hints_ixgbe) { > > // do something else > > // plus explicitly handle xdp_hints_common here? > > } else { > > // handle xdp_hints_common > > } > I added a BPF-helper that can tell us if layout if compatible with > xdp_hints_common, which is basically the only UAPI the patchset > introduces. > The handle xdp_hints_common code should be common. > I'm not super happy with the BPF-helper approach, so suggestions are > welcome. E.g. xdp_md/ctx->is_hint_common could be one approach and > ctx->has_hint (ctx is often called xdp so it reads xdp->has_hint). > One feature I need from the BPF-helper is to "disable" the xdp_hints and > allow the BPF-prog to use the entire metadata area for something else > (avoiding it to be misintrepreted by next prog or after redirect). As mentioned in the previous emails, let's try to have a bpf side example/selftest for the next round? I also feel like xdp_hints_common is a bit distracting. It makes the common case easy and it hides the discussion/complexity about per-device hints. Maybe we can drop this common case at all? Why can't every driver has a custom hints struct? If we agree that naming/size will be the same across them (and review catches/guaranteed that), why do we even care about having common xdp_hints_common struct? > > What I'd like to avoid is an xdp program targeting specific drivers. > > Where possible, we should aim towards something like "if this device > > has rx_timestamp offload -> use it without depending too much on > > specific btf_ids. > > > I do understand your wish, and adding rx_timestamps to xdp_hints_common > would be too easy (and IMHO wasting u64/8-bytes for all packets not > needing this timestamp). Hopefully we can come up with a good solution > together. > One idea would be to extend libbpf to lookup or translate struct name > struct xdp_hints_DRIVER_timestamp { > __u64 rx_timestamp; > } __attribute__((preserve_access_index)); > into e.g. xdp_hints_i40e_timestamp, if an ifindex was provided when > loading > the XDP prog. And the bpf_core_type_id_kernel() result of the struct > returning id from xdp_hints_i40e_timestamp. > But this ideas doesn't really work for the veth redirect use-case :-( > As veth need to handle xdp_hints from other drivers. Agreed. If we want redirect to work, then the parsing should be either mostly pre-generated by libbpf to include all possible btf ids that matter; or done similarly by a kfunc. The idea that we can pre-generate per-device bpf program seems to be out of the window now? > > > > c. The subset of fields in _xdp_rx_hints should match the > ones from > > > > xdp_rx_hints (we essentially standardize on the field > names/sizes) > > > > d. We expose _xdp_rx_hints btf id via netlink for each > device > > > > > > For this proposed design you would still need more than one BTF ID or > > > _xdp_rx_hints struct's, because not all packets contains all > > > hints. The most common case is HW timestamping, which some HW only > > > supports for PTP frames. > > > > > > Plus, I don't see a need to expose anything via netlink, as we can > just > > > use the existing BTF information from the module. Thus, avoiding to > > > creating more UAPI. > > > > See above. I think even with your series, that btf_id info should also > > come via netlink so the programs can query it before loading and do > > the required adjustments. Otherwise, I'm not sure I understand what I > > need to do with a btf_id that comes via xdp_md/xdp_frame. It seems too > > late? I need to know them in advance to at least populate those ids > > into the bpf program itself? > Yes, we need to know these IDs in advance and can. I don't think we need > the netlink interface, as we can already read out the BTF layout and IDs > today. I coded it up in userspace, where the intented consumer is AF_XDP > (as libbpf already does this itself). > See this code: > - > https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_ids.c > - > https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_read.c SG, if we can have some convention on the names where we can reliably parse out all possible structs with the hints, let's rely solely on vmlinux+vmlinux module btf. > > > > e. libbpf will query and do offset relocations for > > > > xdp_rx_hints -> _xdp_rx_hints at load time > > > > > > > > Would that work? Then it seems like we can replace bitfields with > the > > > > > > I used to be a fan of bitfields, until I discovered that they are bad > > > for performance, because compilers cannot optimize these. > > > > Ack, good point, something to keep in mind. > > > > > > following: > > > > > > > > if (bpf_core_field_exists(struct xdp_rx_hints, vlan_tci)) { > > > > /* use that hint */ > > > > > > Fairly often a VLAN will not be set in packets, so we still have to > read > > > and check a bitfield/flag if the VLAN value is valid. (Guess it is > > > implicit in above code). > > > > That's a fair point. Then we need two signals? > > > > 1. Whether this particular offload is supported for the device at all > > (via that bpf_core_field_exists or something similar) > > 2. Whether this particular packet has particular metadata (via your > > proposed flags) > > > > if (device I'm attaching xdp to has vlan offload) { // via > > bpf_core_field_exists? > > if (particular packet comes with a vlan tag) { // via your proposed > > bitfield flags? > > } > > } > > > > Or are we assuming that (2) is fast enough and we don't care about > > (1)? Because (1) can 'if (0)' the whole branch and make the verifier > > remove that part. > > > > > > } > > > > > > > > All we need here is for libbpf to, again, do xdp_rx_hints -> > > > > _xdp_rx_hints translation before it evaluates > > > > bpf_core_field_exists()? > > > > > > > > Thoughts? Any downsides? Am I missing something? > > > > > > > > > > Well, the downside is primarily that this design limits innovation. > > > > > > Each time a NIC driver want to introduce a new hardware hint, they > have > > > to update the central UAPI xdp_rx_hints struct first. > > > > > > The design in the patchset is to open for innovation. Driver can > extend > > > their own xdp_hints__xxx struct(s). They still have to land > > > their patches upstream, but avoid mangling a central UAPI struct. As > > > upstream we review driver changes and should focus on sane struct > member > > > naming(+size) especially if this "sounds" like a hint/feature that > more > > > driver are likely to support. With help from BTF relocations, a new > > > driver can support same hint/feature if naming(+size) match (without > > > necessary the same offset in the struct). > > > > The opposite side of this approach is that we'll have 'ixgbe_hints' > > with 'rx_timestamp' and 'mvneta_hints' with something like > > 'rx_tstamp'. > Well, as I wrote reviewers should ask drivers to use the same member name. SG! > > > > Also, about the TX side: I feel like the same can be applied there, > > > > the program works with xdp_tx_hints and libbpf will rewrite to > > > > _xdp_tx_hints. xdp_tx_hints might have fields > like "has_tx_vlan:1"; > > > > those, presumably, can be relocatable by libbpf as well? > > > > > > > > > > Good to think ahead for TX-side, even-though I think we should focus > on > > > landing RX-side first. > > > > > > I notice your naming xdp_rx_hints vs. xdp_tx_hints. I have named the > > > common struct xdp_hints_common, without a RX/TX direction indication. > > > Maybe this is wrong of me, but my thinking was that most of the common > > > hints can be directly used as TX-side hints. I'm hoping TX-side > > > xdp-hints will need to do little-to-non adjustment, before using the > > > hints as TX "instruction". I'm hoping that XDP-redirect will just > work > > > and xmit driver can use XDP-hints area. > > > > > > Please correct me if I'm wrong. > > > The checksum fields hopefully translates to similar TX > offload "actions". > > > The VLAN offload hint should translate directly to TX-side. > > > > > > I can easily be convinced we should name it xdp_hints_rx_common from > the > > > start, but then I will propose that xdp_hints_tx_common have the > > > checksum and VLAN fields+flags at same locations, such that we don't > > > take any performance hint for moving them to "TX-side" hints, making > > > XDP-redirect just work. > > > > Might be good to think about this beforehand. I agree that most of the > > layout should hopefully match. However once case that I'm interested > > in is rx_timestamp vs tx_timestamp. For rx, I'm getting the timestamp > > in the metadata; for tx, I'm merely setting a flag somewhere to > > request it for async delivery later (I hope we plan to support that > > for af_xdp?). So the layout might be completely different :-( > > > Yes, it is definitely in my plans to support handling at TX-completion > time, so you can extract the TX-wire-timestamp. This is easy for AF_XDP > as it has the CQ (Completion Queue) step. > I'm getting ahead of myself, but for XDP I imagine that driver will > populate this xdp_tx_hint in DMA TX-completion function, and we can add > a kfunc "not-a-real-hook" to xdp_return_frame that can run another XDP > BPF-prog that can inspect the xdp_tx_hint in metadata. Can we also place that xdp_tx_hint somewhere in the completion ring for AF_XDP to consume? > At this proposed kfunc xdp_return_frame call point, we likely cannot know > what driver that produced the xdp_hints metadata either, and thus not lock > our design or BTF-reloacations to assume which driver is it loaded on. > [... cut ... getting too long] > --Jesper