From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32C32C432C3 for ; Thu, 28 Nov 2019 03:42:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id ECFFD215A5 for ; Thu, 28 Nov 2019 03:42:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="YNrhUU2J" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727409AbfK1Dms (ORCPT ); Wed, 27 Nov 2019 22:42:48 -0500 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:57015 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727166AbfK1Dmr (ORCPT ); Wed, 27 Nov 2019 22:42:47 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1574912565; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=oNpXXUPzEf7y4T1D/GmnsHSNzKOeTG0HJ+r3KP23ESE=; b=YNrhUU2J2GyX6jKGmmnIZqj8dOvtrEuKkjTxXDz3XbAqHFQhG50GUPAnVBOsG2k4KQyhic v+Pt+7AOMLxIWwtJlrdhmDBIZz8jJOQ56PDZBYSJqRxecYbEm2TWFzqxox6N+2reuUfdAy FRWuuafKlsuHvv9Sx/6eCoC56+xMLjE= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-8-hKnZHYPxOGS8eP4RC3ZEfg-1; Wed, 27 Nov 2019 22:42:43 -0500 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id B324410054E3; Thu, 28 Nov 2019 03:42:41 +0000 (UTC) Received: from [10.72.12.231] (ovpn-12-231.pek2.redhat.com [10.72.12.231]) by smtp.corp.redhat.com (Postfix) with ESMTP id 460A15D6D0; Thu, 28 Nov 2019 03:41:59 +0000 (UTC) Subject: Re: [RFC net-next 00/18] virtio_net XDP offload To: Jakub Kicinski Cc: Song Liu , Jesper Dangaard Brouer , Daniel Borkmann , "Michael S . Tsirkin" , netdev@vger.kernel.org, John Fastabend , qemu-devel@nongnu.org, Alexei Starovoitov , "David S . Miller" , Prashant Bhole , kvm@vger.kernel.org, Yonghong Song , Andrii Nakryiko , Martin KaFai Lau References: <20191126100744.5083-1-prashantbhole.linux@gmail.com> <20191126123514.3bdf6d6f@cakuba.netronome.com> <48cec928-871f-3f50-e99f-c6a6d124cf4c@redhat.com> <20191127114913.0363a0e8@cakuba.netronome.com> From: Jason Wang Message-ID: <285af7e2-6a4d-b20c-0aeb-165e3cd4309d@redhat.com> Date: Thu, 28 Nov 2019 11:41:52 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: <20191127114913.0363a0e8@cakuba.netronome.com> Content-Language: en-US X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-MC-Unique: hKnZHYPxOGS8eP4RC3ZEfg-1 X-Mimecast-Spam-Score: 0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Sender: kvm-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On 2019/11/28 =E4=B8=8A=E5=8D=883:49, Jakub Kicinski wrote: > On Wed, 27 Nov 2019 10:59:37 +0800, Jason Wang wrote: >> On 2019/11/27 =E4=B8=8A=E5=8D=884:35, Jakub Kicinski wrote: >>> On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote: >>>> Note: This RFC has been sent to netdev as well as qemu-devel lists >>>> >>>> This series introduces XDP offloading from virtio_net. It is based on >>>> the following work by Jason Wang: >>>> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net >>>> >>>> Current XDP performance in virtio-net is far from what we can achieve >>>> on host. Several major factors cause the difference: >>>> - Cost of virtualization >>>> - Cost of virtio (populating virtqueue and context switching) >>>> - Cost of vhost, it needs more optimization >>>> - Cost of data copy >>>> Because of above reasons there is a need of offloading XDP program to >>>> host. This set is an attempt to implement XDP offload from the guest. >>> This turns the guest kernel into a uAPI proxy. >>> >>> BPF uAPI calls related to the "offloaded" BPF objects are forwarded >>> to the hypervisor, they pop up in QEMU which makes the requested call >>> to the hypervisor kernel. Today it's the Linux kernel tomorrow it may >>> be someone's proprietary "SmartNIC" implementation. >>> >>> Why can't those calls be forwarded at the higher layer? Why do they >>> have to go through the guest kernel? >> >> I think doing forwarding at higher layer have the following issues: >> >> - Need a dedicated library (probably libbpf) but application may choose >> to do eBPF syscall directly >> - Depends on guest agent to work > This can be said about any user space functionality. Yes but the feature may have too much unnecessary dependencies:=20 dedicated library, guest agent, host agent etc. This can only work for=20 some specific setups and will lead vendor specific implementations. > >> - Can't work for virtio-net hardware, since it still requires a hardware >> interface for carrying=C2=A0 offloading information > The HW virtio-net presumably still has a PF and hopefully reprs for > VFs, so why can't it attach the program there? Then you still need a interface for carrying such information? It will=20 work like assuming we had a virtio-net VF with reprs: libbpf(guest) -> guest agent -> host agent -> libbpf(host) -> BPF=20 syscall -> VF reprs/PF drvier -> VF/PF reprs -> virtio-net VF Still need a vendor specific way for passing eBPF commands from driver=20 to reprs/PF, and possibility, it could still be a virtio interface there. In this proposal it will work out of box as simple as: libbpf(guest) -> guest kernel -> virtio-net driver -> virtio-net VF If the request comes from host (e.g flow offloading, configuration etc),=20 VF reprs make perfect fit. But if the request comes from guest, having=20 much longer journey looks quite like a burden (dependencies, bugs etc) . What's more important, we can not assume the how virtio-net HW is=20 structured, it could even not a SRIOV or PCI card. > >> - Implement at the level of kernel may help for future extension like >> BPF object pinning and eBPF helper etc. > No idea what you mean by this. My understanding is, we should narrow the gap between non-offloaded eBPF=20 program and offloaded eBPF program. Making maps or progs to be visible=20 to kernel may help to persist a unified API e.g object pinning through=20 sysfs, tracepoint, debug etc. > >> Basically, this series is trying to have an implementation of >> transporting eBPF through virtio, so it's not necessarily a guest to >> host but driver and device. For device, it could be either a virtual one >> (as done in qemu) or a real hardware. > SmartNIC with a multi-core 64bit ARM CPUs is as much of a host as > is the x86 hypervisor side. This set turns the kernel into a uAPI > forwarder. Not necessarily, as what has been done by NFP, driver filter out the=20 features that is not supported, and the bpf object is still visible in=20 the kernel (and see above comment). > > 3 years ago my answer to this proposal would have been very different. > Today after all the CPU bugs it seems like the SmartNICs (which are > just another CPU running proprietary code) may just take off.. > That's interesting but vendor may choose to use FPGA other than SoC in=20 this case. Anyhow discussion like this is somehow out of the scope of=20 the series. >>> If kernel performs no significant work (or "adds value", pardon the >>> expression), and problem can easily be solved otherwise we shouldn't >>> do the work of maintaining the mechanism. >> My understanding is that it should not be much difference compared to >> other offloading technology. > I presume you mean TC offloads? In virtualization there is inherently a > hypervisor which will receive the request, be it an IO hub/SmartNIC or > the traditional hypervisor on the same CPU. > > The ACL/routing offloads differ significantly, because it's either the > driver that does all the HW register poking directly or the complexity > of programming a rule into a HW table is quite low. > > Same is true for the NFP BPF offload, BTW, the driver does all the > heavy lifting and compiles the final machine code image. Yes and this series benefit from the infrastructure invented from NFP.=20 But I'm not sure this is a good point since, technically the machine=20 code could be generated by smart NIC as well. > > You can't say verifying and JITing BPF code into machine code entirely > in the hypervisor is similarly simple. Yes and that's why we choose to do in on the device (host) to simplify=20 things. > > So no, there is a huge difference. > >>> The approach of kernel generating actual machine code which is then >>> loaded into a sandbox on the hypervisor/SmartNIC is another story. >> We've considered such way, but actual machine code is not as portable as >> eBPF bytecode consider we may want: >> >> - Support migration >> - Further offload the program to smart NIC (e.g through macvtap >> passthrough mode etc). > You can re-JIT or JIT for SmartNIC..? Having the BPF bytecode does not > guarantee migration either, Yes, but it's more portable than machine code. > if the environment is expected to be > running different version of HW and SW. Right, we plan to have feature negotiation. > But yes, JITing in the guest > kernel when you don't know what to JIT for may be hard, Yes. > I was just > saying that I don't mean to discourage people from implementing > sandboxes which run JITed code on SmartNICs. My criticism is (as > always?) against turning the kernel into a one-to-one uAPI forwarder > into unknown platform code. We have FUSE and I think it's not only the forwarder, and we may do much=20 more work on top in the future. For unknown platform code, I'm not sure=20 why we need care about that. There's no way for us to prevent such=20 implementation and if we try to formalize it through a specification=20 (virtio spec and probably eBPF spec), it may help actually. > > For cloud use cases I believe the higher layer should solve this. > Technically possible, but have lots of drawbacks. Thanks