bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Toshiaki Makita <toshiaki.makita1@gmail.com>
To: John Fastabend <john.fastabend@gmail.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <jakub.kicinski@netronome.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Jamal Hadi Salim <jhs@mojatatu.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	Jiri Pirko <jiri@resnulli.us>,
	Pablo Neira Ayuso <pablo@netfilter.org>,
	Jozsef Kadlecsik <kadlec@netfilter.org>,
	Florian Westphal <fw@strlen.de>,
	Pravin B Shelar <pshelar@ovn.org>
Cc: netdev@vger.kernel.org, bpf@vger.kernel.org,
	William Tu <u9012063@gmail.com>,
	Stanislav Fomichev <sdf@fomichev.me>
Subject: Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP
Date: Sun, 27 Oct 2019 22:06:18 +0900	[thread overview]
Message-ID: <59ca42a2-3779-4c9a-a094-8222405c65fe@gmail.com> (raw)
In-Reply-To: <5daf34614a4af_30ac2b1cb5d205bce4@john-XPS-13-9370.notmuch>

On 19/10/23 (水) 1:54:57, John Fastabend wrote:
> Toshiaki Makita wrote:
>> On 2019/10/19 0:22, John Fastabend wrote:
>>> Toshiaki Makita wrote:
>>>> This is a PoC for an idea to offload flow, i.e. TC flower and nftables,
>>>> to XDP.
>>> I've only read the cover letter so far but...
>> Thank you for reading this long cover letter.
>>>> * Motivation
>>>> The purpose is to speed up flow based network features like TC flower and
>>>> nftables by making use of XDP.
>>>> I chose flow feature because my current interest is in OVS. OVS uses TC
>>>> flower to offload flow tables to hardware, so if TC can offload flows to
>>>> XDP, OVS also can be offloaded to XDP.
>>> This adds a non-trivial amount of code and complexity so I'm
>>> critical of the usefulness of being able to offload TC flower to
>>> XDP when userspace can simply load an XDP program.
>>> Why does OVS use tc flower at all if XDP is about 5x faster using
>>> your measurements below? Rather than spend energy adding code to
>>> a use case that as far as I can tell is narrowly focused on offload
>>> support can we enumerate what is missing on XDP side that blocks
>>> OVS from using it directly?
>> I think nothing is missing for direct XDP use, as long as XDP datapath
>> only partially supports OVS flow parser/actions like xdp_flow.
>> The point is to avoid duplicate effort when someone wants to use XDP
>> through TC flower or nftables transparently.
> I don't know who this "someone" is that wants to use XDP through TC
> flower or nftables transparently. TC at least is not known for a
> great uapi. It seems to me that it would be a relatively small project
> to write a uapi that ran on top of a canned XDP program to add
> flow rules. This could match tc cli if you wanted but why not take
> the opportunity to write a UAPI that does flow management well.

Not sure why TC is not a great uapi (or cli you mean?).
I'm not thinking it's a good idea to add another API when there is an 
API to do the same thing.

> Are there users of tc_flower in deployment somewhere?

Yes at least I know a few examples in my company...
To me flower is easier to use compared to u32 as it can automatically 
dissect flows, so I usually use it when some filter or redirection is 
necessary on devices level.

> I don't have
> lots of visibility but my impression is these were mainly OVS
> and other switch offload interface.
> OVS should be sufficiently important that developers can right a
> native solution in my opinion. Avoiding needless layers of abstraction
> in the process. The nice thing about XDP here is you can write
> an XDP program and set of maps that matches OVS perfectly so no
> need to try and fit things in places.

I think the current XDP program for TC matches OVS as well, so I don't 
see such merit. TC flower and ovs kmod is similar in handling flows 
(both use mask list and look up hash table per mask).

>>> Additionally for hardware that can
>>> do XDP/BPF offload you will get the hardware offload for free.
>> This is not necessary as OVS already uses TC flower to offload flows.
> But... if you have BPF offload hardware that doesn't support TCAM
> style flow based offloads direct XDP offload would seem easier IMO.
> Which is better? Flow table based offloads vs BPF offloads likely
> depends on your use case in my experience.

I thought having single control path for offload is simple and 
intuitive. If XDP is allowed to be offloaded, OVS needs to have more 
knobs to enable offload of XDP programs.
I don't have ideas about difficulty in offload drivers' implementation 

>>> Yes I know XDP is bytecode and you can't "offload" bytecode into
>>> a flow based interface likely backed by a tcam but IMO that doesn't
>>> mean we should leak complexity into the kernel network stack to
>>> fix this. Use the tc-flower for offload only (it has support for
>>> this) if you must and use the best (in terms of Mpps) software
>>> interface for your software bits. And if you want auto-magic
>>> offload support build hardware with BPF offload support.
>>> In addition by using XDP natively any extra latency overhead from
>>> bouncing calls through multiple layers would be removed.
>> To some extent yes, but not completely. Flow insertion from userspace
>> triggered by datapath upcall is necessary regardless of whether we use
>> TC or not.
> Right but these are latency involved with OVS architecture not
> kernel implementation artifacts. Actually what would be an interesting
> metric would be to see latency of a native xdp implementation.
> I don't think we should add another implementation to the kernel
> that is worse than what we have.
>   xdp_flow  TC        ovs kmod
>   --------  --------  --------
>   22ms      6ms       0.6ms
> TC is already order of magnitude off it seems :(

Yes but I'm unsure why it takes so much time. I need to investigate that...

> If ovs_kmod is .6ms why am I going to use something that is 6ms or
> 22ms. I expect a native xdp implementation using a hash map to be
> inline with ovs kmod if not better. So if we have already have
> an implementation in kernel that is 5x faster and better at flow
> insertion another implementation that doesn't meet that threshold
> should probably not go in kernel.

Note that simple hash table (micro flows) implementation is hard.
It requires support for all key parameters including conntrack 
parameters, which does not look feasible now.
I only support mega flows using multiple hash tables each of which has 
its mask. This allows for support of partial set of keys which is 
feasible with XDP.

With this implementation I needed to emulate a list of masks using 
arrays, which I think is the main source of the latency, as it requires 
several syscalls to modify and lookup the BPF maps (I think it can be 
improved later though). This is the same even if we cross out the TC.

Anyway if the flow is really sensitive to latency, users should add the 
flow entry in advance. Going through userspace itself has non-negligible 

> Additionally, for the OVS use case I would argue the XDP native
> solution is straight forward to implement. Although I will defer
> to OVS datapath experts here but above you noted nothing is
> missing on the feature side?

Yes as long as it is partial flow key/action support.
For full datapath support it's far from feasible, as William experienced...

>>>> When TC flower filter is offloaded to XDP, the received packets are
>>>> handled by XDP first, and if their protocol or something is not
>>>> supported by the eBPF program, the program returns XDP_PASS and packets
>>>> are passed to upper layer TC.
>>>> The packet processing flow will be like this when this mechanism,
>>>> xdp_flow, is used with OVS.
>>> Same as obove just cross out the 'TC flower' box and add support
>>> for your missing features to 'XDP prog' box. Now you have less
>>> code to maintain and less bugs and aren't pushing packets through
>>> multiple hops in a call chain.
>> If we cross out TC then we would need similar code in OVS userspace.
>> In total I don't think it would be less code to maintain.
> Yes but I think minimizing kernel code and complexity is more important
> than minimizing code in a specific userspace application/use-case.
> Just think about the cost of a bug in kernel vs user space side. In
> user space you have ability to fix and release your own code in kernel
> side you will have to fix upstream, manage backports, get distributions
> involved, etc.
> I have no problem adding code if its a good use case but in this case
> I'm still not seeing it.

I can understand the kernel code is more important, but if we determine 
not to have the code in kernel, we probably need each code in each 
userland tools? OVS has XDP, TC (iproute2 or some new userland tool) has 
XDP, nft has XDP, and they will be the same functionality, which is 
duplicate effort.

>>>>    +-------------+
>>>>    | openvswitch |
>>>>    |    kmod     |
>>>>    +-------------+
>>>>           ^
>>>>           | if not match in filters (flow key or action not supported by TC)
>>>>    +-------------+
>>>>    |  TC flower  |
>>>>    +-------------+
>>>>           ^
>>>>           | if not match in flow tables (flow key or action not supported by XDP)
>>>>    +-------------+
>>>>    |  XDP prog   |
>>>>    +-------------+
>>>>           ^
>>>>           | incoming packets
>>>> Of course we can directly use TC flower without OVS to speed up TC.
>>> huh? TC flower is part of TC so not sure what 'speed up TC' means. I
>>> guess this means using tc flower offload to xdp prog would speed up
>>> general tc flower usage as well?
>> Yes.
>>> But again if we are concerned about Mpps metrics just write the XDP
>>> program directly.
>> I guess you mean any Linux users who want TC-like flow handling should develop
>> their own XDP programs? (sorry if I misunderstand you.)
>> I want to avoid such a situation. The flexibility of eBPF/XDP is nice and it's
>> good to have any program each user wants, but not every sysadmin can write low
>> level good performance programs like us. For typical use-cases like flow handling
>> easy use of XDP through existing kernel interface (here TC) is useful IMO.
> For OVS the initial use case I suggest write a XDP program tailored and
> optimized for OVS. Optimize it for this specific use case.

I think current code is already optimized for OVS as well as TC.

> If you want a general flow based XDP program write one, convince someone
> to deploy and build a user space application to manage it. No sysadmin
> has to touch this. Toke and others at RedHat appear to have this exact
> use case in mind.
>> ...
>>>> * About alternative userland (ovs-vswitchd etc.) implementation
>>>> Maybe a similar logic can be implemented in ovs-vswitchd offload
>>>> mechanism, instead of adding code to kernel. I just thought offloading
>>>> TC is more generic and allows wider usage with direct TC command.
>>>> For example, considering that OVS inserts a flow to kernel only when
>>>> flow miss happens in kernel, we can in advance add offloaded flows via
>>>> tc filter to avoid flow insertion latency for certain sensitive flows.
>>>> TC flower usage without using OVS is also possible.
>>> I argue to cut tc filter out entirely and then I think non of this
>>> is needed.
>> Not correct. Even with native XDP use, multiple map lookup/modification
>> from userspace is necessary for flow miss handling, which will lead to
>> some latency.
> I have not done got the data but I suspect the latency will be much
> closer to the ovs kmod .6ms than the TC or xdp_flow latency.

It should not as I explained above, although I should confirm it.

>> And there are other use-cases for direct TC use, like packet drop or
>> redirection for certain flows.
> But these can be implemented in XDP correct?

So want to avoid duplicate work...

>>>> Also as written above nftables can be offloaded to XDP with this
>>>> mechanism as well.
>>> Or same argument use XDP directly.
>> I'm thinking it's useful for sysadmins to be able to use XDP through
>> existing kernel interfaces.
> I agree its perhaps friendly to do so but for OVS not necessary and
> if sysadmins want a generic XDP flow interface someone can write one.
> Tell admins using your new tool gives 5x mpps improvement and orders
> of magnitude latency reduction and I suspect converting them over
> should be easy. Or if needed write an application in userspace that
> converts tc commands to native XDP map commands.
> I think for sysadmins in general (not OVS) use case I would work
> with Jesper and Toke. They seem to be working on this specific problem.

I have talked about this in Netdev 0x13 and IIRC Jesper was somewhat 
positive on this idea...
Jesper, any thoughts?

Toshiaki Makita

  parent reply	other threads:[~2019-10-27 13:06 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-18  4:07 Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 01/15] xdp_flow: Add skeleton of XDP based flow offload driver Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 02/15] xdp_flow: Add skeleton bpf program for XDP Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 03/15] bpf: Add API to get program from id Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 04/15] xdp: Export dev_check_xdp and dev_change_xdp Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 05/15] xdp_flow: Attach bpf prog to XDP in kernel after UMH loaded program Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 06/15] xdp_flow: Prepare flow tables in bpf Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 07/15] xdp_flow: Add flow entry insertion/deletion logic in UMH Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 08/15] xdp_flow: Add flow handling and basic actions in bpf prog Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 09/15] xdp_flow: Implement flow replacement/deletion logic in xdp_flow kmod Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 10/15] xdp_flow: Add netdev feature for enabling flow offload to XDP Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 11/15] xdp_flow: Implement redirect action Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 12/15] xdp_flow: Implement vlan_push action Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 13/15] bpf, selftest: Add test for xdp_flow Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 14/15] i40e: prefetch xdp->data before running XDP prog Toshiaki Makita
2019-10-18  4:07 ` [RFC PATCH v2 bpf-next 15/15] bpf, hashtab: Compare keys in long Toshiaki Makita
2019-10-18 15:22 ` [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP John Fastabend
2019-10-21  7:31   ` Toshiaki Makita
2019-10-22 16:54     ` John Fastabend
2019-10-22 17:45       ` Toke Høiland-Jørgensen
2019-10-24  4:27         ` John Fastabend
2019-10-24 10:13           ` Toke Høiland-Jørgensen
2019-10-27 13:19             ` Toshiaki Makita
2019-10-27 15:21               ` Toke Høiland-Jørgensen
2019-10-28  3:16                 ` David Ahern
2019-10-28  8:36                   ` Toke Høiland-Jørgensen
2019-10-28 10:08                     ` Jesper Dangaard Brouer
2019-10-28 19:07                       ` David Ahern
2019-10-28 19:05                     ` David Ahern
2019-10-31  0:18                 ` Toshiaki Makita
2019-10-31 12:12                   ` Toke Høiland-Jørgensen
2019-11-11  7:32                     ` Toshiaki Makita
2019-11-12 16:53                       ` Toke Høiland-Jørgensen
2019-11-14 10:11                         ` Toshiaki Makita
2019-11-14 12:41                           ` Toke Høiland-Jørgensen
2019-11-18  6:41                             ` Toshiaki Makita
2019-11-18 10:20                               ` Toke Høiland-Jørgensen
2019-11-22  5:42                                 ` Toshiaki Makita
2019-11-22 11:54                                   ` Toke Høiland-Jørgensen
2019-11-25 10:18                                     ` Toshiaki Makita
2019-11-25 13:03                                       ` Toke Høiland-Jørgensen
2019-11-18 10:28                               ` Toke Høiland-Jørgensen
2019-10-27 13:13         ` Toshiaki Makita
2019-10-27 15:24           ` Toke Høiland-Jørgensen
2019-10-27 19:17             ` David Miller
2019-10-31  0:32               ` Toshiaki Makita
2019-11-12 17:50                 ` William Tu
2019-11-14 10:06                   ` Toshiaki Makita
2019-11-14 17:09                     ` William Tu
2019-11-15 13:16                       ` Toke Høiland-Jørgensen
2019-11-12 17:38             ` William Tu
2019-10-23 14:11       ` Jamal Hadi Salim
2019-10-24  4:38         ` John Fastabend
2019-10-24 17:05           ` Jamal Hadi Salim
2019-10-27 13:27         ` Toshiaki Makita
2019-10-27 13:06       ` Toshiaki Makita [this message]
2019-10-21 11:23 ` Björn Töpel
2019-10-21 11:47   ` Toshiaki Makita

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=59ca42a2-3779-4c9a-a094-8222405c65fe@gmail.com \
    --to=toshiaki.makita1@gmail.com \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=fw@strlen.de \
    --cc=hawk@kernel.org \
    --cc=jakub.kicinski@netronome.com \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=john.fastabend@gmail.com \
    --cc=kadlec@netfilter.org \
    --cc=kafai@fb.com \
    --cc=netdev@vger.kernel.org \
    --cc=pablo@netfilter.org \
    --cc=pshelar@ovn.org \
    --cc=sdf@fomichev.me \
    --cc=songliubraving@fb.com \
    --cc=u9012063@gmail.com \
    --cc=xiyou.wangcong@gmail.com \
    --cc=yhs@fb.com \
    --subject='Re: [RFC PATCH v2 bpf-next 00/15] xdp_flow: Flow offload to XDP' \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).