From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E4B4C433FE for ; Thu, 17 Nov 2022 23:47:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231194AbiKQXrr (ORCPT ); Thu, 17 Nov 2022 18:47:47 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49796 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230287AbiKQXrq (ORCPT ); Thu, 17 Nov 2022 18:47:46 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A0E07A446 for ; Thu, 17 Nov 2022 15:46:53 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1668728812; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=t4amhDHKsYSMHWw3ZB+fT/YBGFiZSg1p/vLZ6sF81EM=; b=W9ufZfOkWmgJ1aSXqdLZObDb5NGpXqaZbJFc4BV/VJ2TdXY3w1ic2yhwDeJHHgg79qcj5e Vs8MK7g3+KFMJBnukj7owgWPPDoopkue6jht5FJwCeNGXzxiH7NTC6NKgOMYy0fE6FmoWY FkgjqxhbMC9jeCL1Z5OvXF+jgOV6Sh4= Received: from mail-ed1-f72.google.com (mail-ed1-f72.google.com [209.85.208.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-94-J0HNvF8fN06dkMJNDbYIAg-1; Thu, 17 Nov 2022 18:46:51 -0500 X-MC-Unique: J0HNvF8fN06dkMJNDbYIAg-1 Received: by mail-ed1-f72.google.com with SMTP id z9-20020a05640235c900b0046358415c4fso2036118edc.9 for ; Thu, 17 Nov 2022 15:46:51 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=t4amhDHKsYSMHWw3ZB+fT/YBGFiZSg1p/vLZ6sF81EM=; b=fyznLbl4YG3koqFiOg/0BK0sD+WMxR3fqFM/56nUcMk+hm8RjdvoIEIw/kNXTtqAyQ x83cO4kOmUOUlpbmCtjEfEX3v3OXaLo61lpj8PTkkq380zAcntEl/3OkjfG5vIMLyzW0 Zql3MvQxGwzHG1VgUWDwGwxJCy8xQx7yei6DyadtzY68BZQ9YA8JpCQ0jM80CNwfaDgD TR4uaujeKGmE0pwNMQ7c2qCqBtVzxozXm8Kt5nXQdjNyCX+8bl52BfF7w8cDmdKFEGZW chkyTQAkJ8tC0y3mCKggqSF/hzE8dhnrRngE+QXZE0AWvDryY5+PWC2mEeFEt0P9fvqR Za3g== X-Gm-Message-State: ANoB5plxYu4L1RShqT0NKFCN7gHai60droVS6EJN3JGzaFViwfrV5KYu +4sENtDXbB1GLA2kRghUzuT/E2GvEXkg4QoPaFU06zSRQHy3svYv/6TQMi4n64GKnOusVmVYz6N T1tkZGGOi9SRp X-Received: by 2002:a17:906:791:b0:7ad:14f8:7583 with SMTP id l17-20020a170906079100b007ad14f87583mr4074832ejc.185.1668728810206; Thu, 17 Nov 2022 15:46:50 -0800 (PST) X-Google-Smtp-Source: AA0mqf5xbcfwM/CIQaQKBLzuqeKKGNTHSR6fYH18SUafOXx+saLrIDVlD+zoA8TAuKC4YfWcmovUKg== X-Received: by 2002:a17:906:791:b0:7ad:14f8:7583 with SMTP id l17-20020a170906079100b007ad14f87583mr4074814ejc.185.1668728809728; Thu, 17 Nov 2022 15:46:49 -0800 (PST) Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2]) by smtp.gmail.com with ESMTPSA id v10-20020a170906292a00b007ad96726c42sm959726ejd.91.2022.11.17.15.46.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 17 Nov 2022 15:46:48 -0800 (PST) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id ED9B37A701D; Fri, 18 Nov 2022 00:46:46 +0100 (CET) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Stanislav Fomichev , Alexei Starovoitov Cc: John Fastabend , Martin KaFai Lau , bpf , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Song Liu , Yonghong Song , KP Singh , Hao Luo , Jiri Olsa , David Ahern , Jakub Kicinski , Willem de Bruijn , Jesper Dangaard Brouer , Anatoly Burakov , Alexander Lobakin , Magnus Karlsson , Maryam Tahhan , xdp-hints@xdp-project.net, Network Development Subject: Re: [xdp-hints] Re: [PATCH bpf-next 05/11] veth: Support rx timestamp metadata for xdp In-Reply-To: References: <20221115030210.3159213-1-sdf@google.com> <20221115030210.3159213-6-sdf@google.com> <87h6z0i449.fsf@toke.dk> <8735ajet05.fsf@toke.dk> <6374854883b22_5d64b208e3@john.notmuch> <34f89a95-a79e-751c-fdd2-93889420bf96@linux.dev> <878rkbjjnp.fsf@toke.dk> <6375340a6c284_66f16208aa@john.notmuch> <637576962dada_8cd03208b0@john.notmuch> <87wn7t4y0g.fsf@toke.dk> X-Clacks-Overhead: GNU Terry Pratchett Date: Fri, 18 Nov 2022 00:46:46 +0100 Message-ID: <874juxywih.fsf@toke.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: bpf@vger.kernel.org Stanislav Fomichev writes: > On Thu, Nov 17, 2022 at 8:59 AM Alexei Starovoitov > wrote: >> >> On Thu, Nov 17, 2022 at 3:32 AM Toke H=C3=B8iland-J=C3=B8rgensen wrote: >> > >> > Stanislav Fomichev writes: >> > >> > >> > Doesn't look like the descriptors are as nice as you're trying to >> > >> > paint them (with clear hash/csum fields) :-) So not sure how much >> > >> > CO-RE would help. >> > >> > At least looking at mlx4 rx_csum, the driver consults three diffe= rent >> > >> > sets of flags to figure out the hash_type. Or am I just unlucky w= ith >> > >> > mlx4? >> > >> >> > >> Which part are you talking about ? >> > >> hw_checksum =3D csum_unfold((__force __sum16)cqe->checksum); >> > >> is trivial enough for bpf prog to do if it has access to 'cqe' poin= ter >> > >> which is what John is proposing (I think). >> > > >> > > I'm talking about mlx4_en_process_rx_cq, the caller of that check_cs= um. >> > > In particular: if (likely(dev->features & NETIF_F_RXCSUM)) branch >> > > I'm assuming we want to have hash_type available to the progs? >> > >> > I agree we should expose the hash_type, but that doesn't actually look >> > to be that complicated, see below. >> > >> > > But also, check_csum handles other corner cases: >> > > - short_frame: we simply force all those small frames to skip checks= um complete >> > > - get_fixed_ipv6_csum: In IPv6 packets, hw_checksum lacks 6 bytes fr= om >> > > IPv6 header >> > > - get_fixed_ipv4_csum: Although the stack expects checksum which >> > > doesn't include the pseudo header, the HW adds it >> > > >> > > So it doesn't look like we can just unconditionally use cqe->checksu= m? >> > > The driver does a lot of massaging around that field to make it >> > > palatable. >> > >> > Poking around a bit in the other drivers, AFAICT it's only a subset of >> > drivers that support CSUM_COMPLETE at all; for instance, the Intel >> > drivers just set CHECKSUM_UNNECESSARY for TCP/UDP/SCTP. I think the >> > CHECKSUM_UNNECESSARY is actually the most important bit we'd want to >> > propagate? >> > >> > AFAICT, the drivers actually implementing CHECKSUM_COMPLETE need access >> > to other data structures than the rx descriptor to determine the status >> > of the checksum (mlx4 looks at priv->flags, mlx5 checks rq->state), so >> > just exposing the rx descriptor to BPF as John is suggesting does not >> > actually give the XDP program enough information to act on the checksum >> > field on its own. We could still have a separate kfunc to just expose >> > the hw checksum value (see below), but I think it probably needs to be >> > paired with other kfuncs to be useful. >> > >> > Looking at the mlx4 code, I think the following mapping to kfuncs (in >> > pseudo-C) would give the flexibility for XDP to access all the bits it >> > needs, while inlining everything except getting the full checksum for >> > non-TCP/UDP traffic. An (admittedly cursory) glance at some of the oth= er >> > drivers (mlx5, ice, i40e) indicates that this would work for those >> > drivers as well. >> > >> > >> > bpf_xdp_metadata_rx_hash_supported() { >> > return dev->features & NETIF_F_RXHASH; >> > } >> > >> > bpf_xdp_metadata_rx_hash() { >> > return be32_to_cpu(cqe->immed_rss_invalid); >> > } >> > >> > bpf_xdp_metdata_rx_hash_type() { >> > if (likely(dev->features & NETIF_F_RXCSUM) && >> > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | MLX4_CQE_STATUS= _UDP)) && >> > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && >> > cqe->checksum =3D=3D cpu_to_be16(0xffff)) >> > return PKT_HASH_TYPE_L4; >> > >> > return PKT_HASH_TYPE_L3; >> > } >> > >> > bpf_xdp_metadata_rx_csum_supported() { >> > return dev->features & NETIF_F_RXCSUM; >> > } >> > >> > bpf_xdp_metadata_rx_csum_level() { >> > if ((cqe->status & cpu_to_be16(MLX4_CQE_STATUS_TCP | >> > MLX4_CQE_STATUS_UDP)) && >> > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IPOK)) && >> > cqe->checksum =3D=3D cpu_to_be16(0xffff)) >> > return CHECKSUM_UNNECESSARY; >> > >> > if (!(priv->flags & MLX4_EN_FLAG_RX_CSUM_NON_TCP_UDP && >> > (cqe->status & cpu_to_be16(MLX4_CQE_STATUS_IP_ANY))) && >> > !short_frame(len)) >> > return CHECKSUM_COMPLETE; /* we could also omit this case = entirely */ >> > >> > return CHECKSUM_NONE; >> > } >> > >> > /* this one could be called by the metadata_to_skb code */ >> > bpf_xdp_metadata_rx_csum_full() { >> > return check_csum() /* BPF_CALL this after refactoring so it is skb-= agnostic */ >> > } >> > >> > /* this one would be for people like John who want to re-implement >> > * check_csum() themselves */ >> > bpf_xdp_metdata_rx_csum_raw() { >> > return cqe->checksum; >> > } >> >> Are you proposing a bunch of per-driver kfuncs that bpf prog will call. >> If so that works, but bpf prog needs to pass dev and cqe pointers >> into these kfuncs, so they need to be exposed to the prog somehow. >> Probably through xdp_md ? No, I didn't mean we should call per-driver kfuncs; the examples above were meant to be examples of what the mlx4 driver would unrolls those kfuncs to. Sorry that that wasn't clear. > So far I'm doing: > > struct mlx4_xdp_buff { > struct xdp_buff xdp; > struct mlx4_cqe *cqe; > struct mlx4_en_dev *mdev; > } > > And then the kfuncs get ctx (aka xdp_buff) as a sole argument and can > find cqe/mdev via container_of. > > If we really need these to be exposed to the program, can we use > Yonghong's approach from [0]? I don't think we should expose them to the BPF program; I like your approach of stuffing them next to the CTX pointer and de-referencing that. This makes it up to the driver which extra objects it needs, and the caller doesn't have to know/care. I'm not vehemently opposed to *also* having the rx-desc pointer directly accessible (in which case Yonghong's kfunc approach is probably fine). However, as mentioned in my previous email, I doubt how useful that descriptor itself will be... >> This way we can have both: bpf prog reading cqe fields directly >> and using kfuncs to access things. >> Inlining of kfuncs should be done generically. >> It's not a driver job to convert native asm into bpf asm. > > Ack. I can replace the unrolling with something that just resolves > "generic" kfuncs to the per-driver implementation maybe? That would at > least avoid netdev->ndo_kfunc_xxx indirect calls at runtime.. As stated above, I think we should keep the unrolling. If we end up with an actual CALL instruction for every piece of metadata that's going to suck performance-wise; unrolling is how we keep this fast enough! :) -Toke