All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jason Wang <jasowang@redhat.com>
To: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Song Liu <songliubraving@fb.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	netdev@vger.kernel.org, John Fastabend <john.fastabend@gmail.com>,
	qemu-devel@nongnu.org, Alexei Starovoitov <ast@kernel.org>,
	"David S . Miller" <davem@davemloft.net>,
	Prashant Bhole <prashantbhole.linux@gmail.com>,
	kvm@vger.kernel.org, Yonghong Song <yhs@fb.com>,
	Andrii Nakryiko <andriin@fb.com>, Martin KaFai Lau <kafai@fb.com>
Subject: Re: [RFC net-next 00/18] virtio_net XDP offload
Date: Thu, 28 Nov 2019 11:41:52 +0800	[thread overview]
Message-ID: <285af7e2-6a4d-b20c-0aeb-165e3cd4309d@redhat.com> (raw)
In-Reply-To: <20191127114913.0363a0e8@cakuba.netronome.com>


On 2019/11/28 上午3:49, Jakub Kicinski wrote:
> On Wed, 27 Nov 2019 10:59:37 +0800, Jason Wang wrote:
>> On 2019/11/27 上午4:35, Jakub Kicinski wrote:
>>> On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:
>>>> Note: This RFC has been sent to netdev as well as qemu-devel lists
>>>>
>>>> This series introduces XDP offloading from virtio_net. It is based on
>>>> the following work by Jason Wang:
>>>> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
>>>>
>>>> Current XDP performance in virtio-net is far from what we can achieve
>>>> on host. Several major factors cause the difference:
>>>> - Cost of virtualization
>>>> - Cost of virtio (populating virtqueue and context switching)
>>>> - Cost of vhost, it needs more optimization
>>>> - Cost of data copy
>>>> Because of above reasons there is a need of offloading XDP program to
>>>> host. This set is an attempt to implement XDP offload from the guest.
>>> This turns the guest kernel into a uAPI proxy.
>>>
>>> BPF uAPI calls related to the "offloaded" BPF objects are forwarded
>>> to the hypervisor, they pop up in QEMU which makes the requested call
>>> to the hypervisor kernel. Today it's the Linux kernel tomorrow it may
>>> be someone's proprietary "SmartNIC" implementation.
>>>
>>> Why can't those calls be forwarded at the higher layer? Why do they
>>> have to go through the guest kernel?
>>
>> I think doing forwarding at higher layer have the following issues:
>>
>> - Need a dedicated library (probably libbpf) but application may choose
>>    to do eBPF syscall directly
>> - Depends on guest agent to work
> This can be said about any user space functionality.


Yes but the feature may have too much unnecessary dependencies: 
dedicated library, guest agent, host agent etc. This can only work for 
some specific setups and will lead vendor specific implementations.


>
>> - Can't work for virtio-net hardware, since it still requires a hardware
>> interface for carrying  offloading information
> The HW virtio-net presumably still has a PF and hopefully reprs for
> VFs, so why can't it attach the program there?


Then you still need a interface for carrying such information? It will 
work like assuming we had a virtio-net VF with reprs:

libbpf(guest) -> guest agent -> host agent -> libbpf(host) -> BPF 
syscall -> VF reprs/PF drvier -> VF/PF reprs -> virtio-net VF

Still need a vendor specific way for passing eBPF commands from driver 
to reprs/PF, and possibility, it could still be a virtio interface there.

In this proposal it will work out of box as simple as:

libbpf(guest) -> guest kernel -> virtio-net driver -> virtio-net VF

If the request comes from host (e.g flow offloading, configuration etc), 
VF reprs make perfect fit. But if the request comes from guest, having 
much longer journey looks quite like a burden (dependencies, bugs etc) .

What's more important, we can not assume the how virtio-net HW is 
structured, it could even not a SRIOV or PCI card.


>
>> - Implement at the level of kernel may help for future extension like
>>    BPF object pinning and eBPF helper etc.
> No idea what you mean by this.


My understanding is, we should narrow the gap between non-offloaded eBPF 
program and offloaded eBPF program. Making maps or progs to be visible 
to kernel may help to persist a unified API e.g object pinning through 
sysfs, tracepoint, debug etc.


>
>> Basically, this series is trying to have an implementation of
>> transporting eBPF through virtio, so it's not necessarily a guest to
>> host but driver and device. For device, it could be either a virtual one
>> (as done in qemu) or a real hardware.
> SmartNIC with a multi-core 64bit ARM CPUs is as much of a host as
> is the x86 hypervisor side. This set turns the kernel into a uAPI
> forwarder.


Not necessarily, as what has been done by NFP, driver filter out the 
features that is not supported, and the bpf object is still visible in 
the kernel (and see above comment).


>
> 3 years ago my answer to this proposal would have been very different.
> Today after all the CPU bugs it seems like the SmartNICs (which are
> just another CPU running proprietary code) may just take off..
>

That's interesting but vendor may choose to use FPGA other than SoC in 
this case. Anyhow discussion like this is somehow out of the scope of 
the series.


>>> If kernel performs no significant work (or "adds value", pardon the
>>> expression), and problem can easily be solved otherwise we shouldn't
>>> do the work of maintaining the mechanism.
>> My understanding is that it should not be much difference compared to
>> other offloading technology.
> I presume you mean TC offloads? In virtualization there is inherently a
> hypervisor which will receive the request, be it an IO hub/SmartNIC or
> the traditional hypervisor on the same CPU.
>
> The ACL/routing offloads differ significantly, because it's either the
> driver that does all the HW register poking directly or the complexity
> of programming a rule into a HW table is quite low.
>
> Same is true for the NFP BPF offload, BTW, the driver does all the
> heavy lifting and compiles the final machine code image.


Yes and this series benefit from the infrastructure invented from NFP. 
But I'm not sure this is a good point since, technically the machine 
code could be generated by smart NIC as well.


>
> You can't say verifying and JITing BPF code into machine code entirely
> in the hypervisor is similarly simple.


Yes and that's why we choose to do in on the device (host) to simplify 
things.


>
> So no, there is a huge difference.
>

>>> The approach of kernel generating actual machine code which is then
>>> loaded into a sandbox on the hypervisor/SmartNIC is another story.
>> We've considered such way, but actual machine code is not as portable as
>> eBPF bytecode consider we may want:
>>
>> - Support migration
>> - Further offload the program to smart NIC (e.g through macvtap
>>    passthrough mode etc).
> You can re-JIT or JIT for SmartNIC..? Having the BPF bytecode does not
> guarantee migration either,


Yes, but it's more portable than machine code.


> if the environment is expected to be
> running different version of HW and SW.


Right, we plan to have feature negotiation.


> But yes, JITing in the guest
> kernel when you don't know what to JIT for may be hard,


Yes.


> I was just
> saying that I don't mean to discourage people from implementing
> sandboxes which run JITed code on SmartNICs. My criticism is (as
> always?) against turning the kernel into a one-to-one uAPI forwarder
> into unknown platform code.


We have FUSE and I think it's not only the forwarder, and we may do much 
more work on top in the future. For unknown platform code, I'm not sure 
why we need care about that. There's no way for us to prevent such 
implementation and if we try to formalize it through a specification 
(virtio spec and probably eBPF spec), it may help actually.


>
> For cloud use cases I believe the higher layer should solve this.
>

Technically possible, but have lots of drawbacks.

Thanks


WARNING: multiple messages have this Message-ID (diff)
From: Jason Wang <jasowang@redhat.com>
To: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Song Liu <songliubraving@fb.com>, Martin KaFai Lau <kafai@fb.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	netdev@vger.kernel.org, John Fastabend <john.fastabend@gmail.com>,
	qemu-devel@nongnu.org, Alexei Starovoitov <ast@kernel.org>,
	Prashant Bhole <prashantbhole.linux@gmail.com>,
	kvm@vger.kernel.org, Yonghong Song <yhs@fb.com>,
	Andrii Nakryiko <andriin@fb.com>,
	"David S . Miller" <davem@davemloft.net>
Subject: Re: [RFC net-next 00/18] virtio_net XDP offload
Date: Thu, 28 Nov 2019 11:41:52 +0800	[thread overview]
Message-ID: <285af7e2-6a4d-b20c-0aeb-165e3cd4309d@redhat.com> (raw)
In-Reply-To: <20191127114913.0363a0e8@cakuba.netronome.com>


On 2019/11/28 上午3:49, Jakub Kicinski wrote:
> On Wed, 27 Nov 2019 10:59:37 +0800, Jason Wang wrote:
>> On 2019/11/27 上午4:35, Jakub Kicinski wrote:
>>> On Tue, 26 Nov 2019 19:07:26 +0900, Prashant Bhole wrote:
>>>> Note: This RFC has been sent to netdev as well as qemu-devel lists
>>>>
>>>> This series introduces XDP offloading from virtio_net. It is based on
>>>> the following work by Jason Wang:
>>>> https://netdevconf.info/0x13/session.html?xdp-offload-with-virtio-net
>>>>
>>>> Current XDP performance in virtio-net is far from what we can achieve
>>>> on host. Several major factors cause the difference:
>>>> - Cost of virtualization
>>>> - Cost of virtio (populating virtqueue and context switching)
>>>> - Cost of vhost, it needs more optimization
>>>> - Cost of data copy
>>>> Because of above reasons there is a need of offloading XDP program to
>>>> host. This set is an attempt to implement XDP offload from the guest.
>>> This turns the guest kernel into a uAPI proxy.
>>>
>>> BPF uAPI calls related to the "offloaded" BPF objects are forwarded
>>> to the hypervisor, they pop up in QEMU which makes the requested call
>>> to the hypervisor kernel. Today it's the Linux kernel tomorrow it may
>>> be someone's proprietary "SmartNIC" implementation.
>>>
>>> Why can't those calls be forwarded at the higher layer? Why do they
>>> have to go through the guest kernel?
>>
>> I think doing forwarding at higher layer have the following issues:
>>
>> - Need a dedicated library (probably libbpf) but application may choose
>>    to do eBPF syscall directly
>> - Depends on guest agent to work
> This can be said about any user space functionality.


Yes but the feature may have too much unnecessary dependencies: 
dedicated library, guest agent, host agent etc. This can only work for 
some specific setups and will lead vendor specific implementations.


>
>> - Can't work for virtio-net hardware, since it still requires a hardware
>> interface for carrying  offloading information
> The HW virtio-net presumably still has a PF and hopefully reprs for
> VFs, so why can't it attach the program there?


Then you still need a interface for carrying such information? It will 
work like assuming we had a virtio-net VF with reprs:

libbpf(guest) -> guest agent -> host agent -> libbpf(host) -> BPF 
syscall -> VF reprs/PF drvier -> VF/PF reprs -> virtio-net VF

Still need a vendor specific way for passing eBPF commands from driver 
to reprs/PF, and possibility, it could still be a virtio interface there.

In this proposal it will work out of box as simple as:

libbpf(guest) -> guest kernel -> virtio-net driver -> virtio-net VF

If the request comes from host (e.g flow offloading, configuration etc), 
VF reprs make perfect fit. But if the request comes from guest, having 
much longer journey looks quite like a burden (dependencies, bugs etc) .

What's more important, we can not assume the how virtio-net HW is 
structured, it could even not a SRIOV or PCI card.


>
>> - Implement at the level of kernel may help for future extension like
>>    BPF object pinning and eBPF helper etc.
> No idea what you mean by this.


My understanding is, we should narrow the gap between non-offloaded eBPF 
program and offloaded eBPF program. Making maps or progs to be visible 
to kernel may help to persist a unified API e.g object pinning through 
sysfs, tracepoint, debug etc.


>
>> Basically, this series is trying to have an implementation of
>> transporting eBPF through virtio, so it's not necessarily a guest to
>> host but driver and device. For device, it could be either a virtual one
>> (as done in qemu) or a real hardware.
> SmartNIC with a multi-core 64bit ARM CPUs is as much of a host as
> is the x86 hypervisor side. This set turns the kernel into a uAPI
> forwarder.


Not necessarily, as what has been done by NFP, driver filter out the 
features that is not supported, and the bpf object is still visible in 
the kernel (and see above comment).


>
> 3 years ago my answer to this proposal would have been very different.
> Today after all the CPU bugs it seems like the SmartNICs (which are
> just another CPU running proprietary code) may just take off..
>

That's interesting but vendor may choose to use FPGA other than SoC in 
this case. Anyhow discussion like this is somehow out of the scope of 
the series.


>>> If kernel performs no significant work (or "adds value", pardon the
>>> expression), and problem can easily be solved otherwise we shouldn't
>>> do the work of maintaining the mechanism.
>> My understanding is that it should not be much difference compared to
>> other offloading technology.
> I presume you mean TC offloads? In virtualization there is inherently a
> hypervisor which will receive the request, be it an IO hub/SmartNIC or
> the traditional hypervisor on the same CPU.
>
> The ACL/routing offloads differ significantly, because it's either the
> driver that does all the HW register poking directly or the complexity
> of programming a rule into a HW table is quite low.
>
> Same is true for the NFP BPF offload, BTW, the driver does all the
> heavy lifting and compiles the final machine code image.


Yes and this series benefit from the infrastructure invented from NFP. 
But I'm not sure this is a good point since, technically the machine 
code could be generated by smart NIC as well.


>
> You can't say verifying and JITing BPF code into machine code entirely
> in the hypervisor is similarly simple.


Yes and that's why we choose to do in on the device (host) to simplify 
things.


>
> So no, there is a huge difference.
>

>>> The approach of kernel generating actual machine code which is then
>>> loaded into a sandbox on the hypervisor/SmartNIC is another story.
>> We've considered such way, but actual machine code is not as portable as
>> eBPF bytecode consider we may want:
>>
>> - Support migration
>> - Further offload the program to smart NIC (e.g through macvtap
>>    passthrough mode etc).
> You can re-JIT or JIT for SmartNIC..? Having the BPF bytecode does not
> guarantee migration either,


Yes, but it's more portable than machine code.


> if the environment is expected to be
> running different version of HW and SW.


Right, we plan to have feature negotiation.


> But yes, JITing in the guest
> kernel when you don't know what to JIT for may be hard,


Yes.


> I was just
> saying that I don't mean to discourage people from implementing
> sandboxes which run JITed code on SmartNICs. My criticism is (as
> always?) against turning the kernel into a one-to-one uAPI forwarder
> into unknown platform code.


We have FUSE and I think it's not only the forwarder, and we may do much 
more work on top in the future. For unknown platform code, I'm not sure 
why we need care about that. There's no way for us to prevent such 
implementation and if we try to formalize it through a specification 
(virtio spec and probably eBPF spec), it may help actually.


>
> For cloud use cases I believe the higher layer should solve this.
>

Technically possible, but have lots of drawbacks.

Thanks



  reply	other threads:[~2019-11-28  3:42 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-26 10:07 [RFC net-next 00/18] virtio_net XDP offload Prashant Bhole
2019-11-26 10:07 ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 01/18] bpf: introduce bpf_prog_offload_verifier_setup() Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 02/18] net: core: rename netif_receive_generic_xdp() to do_generic_xdp_core() Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 03/18] net: core: export do_xdp_generic_core() Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 04/18] tuntap: check tun_msg_ctl type at necessary places Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 05/18] vhost_net: user tap recvmsg api to access ptr ring Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 06/18] tuntap: remove usage of ptr ring in vhost_net Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 07/18] tun: set offloaded xdp program Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-12-01 16:35   ` David Ahern
2019-12-01 16:35     ` David Ahern
2019-12-02  2:44     ` Jason Wang
2019-12-02  2:44       ` Jason Wang
2019-12-01 16:45   ` David Ahern
2019-12-01 16:45     ` David Ahern
2019-12-02  2:47     ` Jason Wang
2019-12-02  2:47       ` Jason Wang
2019-12-09  0:24       ` Prashant Bhole
2019-12-09  0:24         ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 08/18] tun: run offloaded XDP program in Tx path Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-12-01 16:39   ` David Ahern
2019-12-01 16:39     ` David Ahern
2019-12-01 20:56     ` David Miller
2019-12-01 20:56       ` David Miller
2019-12-01 21:40       ` Michael S. Tsirkin
2019-12-01 21:40         ` Michael S. Tsirkin
2019-12-01 21:54         ` David Miller
2019-12-01 21:54           ` David Miller
2019-12-02  2:56           ` Jason Wang
2019-12-02  2:56             ` Jason Wang
2019-12-02  2:45     ` Jason Wang
2019-12-02  2:45       ` Jason Wang
2019-11-26 10:07 ` [RFC net-next 09/18] tun: add a way to inject Tx path packet into Rx path Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 10/18] tun: handle XDP_TX action of offloaded program Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 11/18] tun: run xdp prog when tun is read from file interface Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 12/18] virtio-net: store xdp_prog in device Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 13/18] virtio_net: use XDP attachment helpers Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 14/18] virtio_net: add XDP prog offload infrastructure Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 15/18] virtio_net: implement XDP prog offload functionality Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-27 20:42   ` Michael S. Tsirkin
2019-11-27 20:42     ` Michael S. Tsirkin
2019-11-28  2:53     ` Prashant Bhole
2019-11-28  2:53       ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 16/18] bpf: export function __bpf_map_get Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 10:07 ` [RFC net-next 17/18] virtio_net: implment XDP map offload functionality Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 20:19   ` kbuild test robot
2019-11-26 10:07 ` [RFC net-next 18/18] virtio_net: restrict bpf helper calls from offloaded program Prashant Bhole
2019-11-26 10:07   ` Prashant Bhole
2019-11-26 20:35 ` [RFC net-next 00/18] virtio_net XDP offload Jakub Kicinski
2019-11-26 20:35   ` Jakub Kicinski
2019-11-27  2:59   ` Jason Wang
2019-11-27  2:59     ` Jason Wang
2019-11-27 19:49     ` Jakub Kicinski
2019-11-27 19:49       ` Jakub Kicinski
2019-11-28  3:41       ` Jason Wang [this message]
2019-11-28  3:41         ` Jason Wang
2019-11-27 20:32   ` Michael S. Tsirkin
2019-11-27 20:32     ` Michael S. Tsirkin
2019-11-27 23:40     ` Jakub Kicinski
2019-11-27 23:40       ` Jakub Kicinski
2019-12-02 15:29       ` Michael S. Tsirkin
2019-12-02 15:29         ` Michael S. Tsirkin
2019-11-28  3:32   ` Alexei Starovoitov
2019-11-28  3:32     ` Alexei Starovoitov
2019-11-28  4:18     ` Jason Wang
2019-11-28  4:18       ` Jason Wang
2019-12-01 16:54       ` David Ahern
2019-12-01 16:54         ` David Ahern
2019-12-02  2:48         ` Jason Wang
2019-12-02  2:48           ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=285af7e2-6a4d-b20c-0aeb-165e3cd4309d@redhat.com \
    --to=jasowang@redhat.com \
    --cc=andriin@fb.com \
    --cc=ast@kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=hawk@kernel.org \
    --cc=jakub.kicinski@netronome.com \
    --cc=john.fastabend@gmail.com \
    --cc=kafai@fb.com \
    --cc=kvm@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=prashantbhole.linux@gmail.com \
    --cc=qemu-devel@nongnu.org \
    --cc=songliubraving@fb.com \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.