Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP

From: Toshiaki Makita <toshiaki.makita1@gmail.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Martin KaFai Lau <kafai@fb.com>, Song Liu <songliubraving@fb.com>,
	Yonghong Song <yhs@fb.com>,
	"David S. Miller" <davem@davemloft.net>,
	Jakub Kicinski <jakub.kicinski@netronome.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	John Fastabend <john.fastabend@gmail.com>,
	Jamal Hadi Salim <jhs@mojatatu.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	Jiri Pirko <jiri@resnulli.us>,
	netdev@vger.kernel.org, bpf@vger.kernel.org,
	William Tu <u9012063@gmail.com>
Subject: Re: [RFC PATCH bpf-next 00/14] xdp_flow: Flow offload to XDP
Date: Wed, 14 Aug 2019 16:33:56 +0900	[thread overview]
Message-ID: <f6160572-8fa8-0199-8d81-6159dd4cd5ff@gmail.com> (raw)
In-Reply-To: <20190814014445.3dnduyrass5jycr5@ast-mbp>

Hi Alexei, thank you for taking a look!

On 2019/08/14 10:44, Alexei Starovoitov wrote:
> On Tue, Aug 13, 2019 at 09:05:44PM +0900, Toshiaki Makita wrote:
>> This is a rough PoC for an idea to offload TC flower to XDP.
> ...
>>   xdp_flow  TC        ovs kmod
>>   --------  --------  --------
>>   4.0 Mpps  1.1 Mpps  1.1 Mpps
> 
> Is xdp_flow limited to 4 Mpps due to veth or something else?

Looking at perf, accumulation of each layer's overhead resulted in the number.
With XDP prog which only redirects packets and does not touch the data,
the drop rate is 10 Mpps. In this case the main overhead is XDP's redirect processing
and handling of 2 XDP progs (in veth and i40e).
In the xdp_flow test the overhead additionally includes flow key parse in XDP prog
and hash table lookup (including jhash calculation) which resulted in 4 Mpps.

>> So xdp_flow drop rate is roughly 4x faster than software TC or ovs kmod.
>>
>> OTOH the time to add a flow increases with xdp_flow.
>>
>> ping latency of first packet when veth1 does XDP_PASS instead of DROP:
>>
>>   xdp_flow  TC        ovs kmod
>>   --------  --------  --------
>>   25ms      12ms      0.6ms
>>
>> xdp_flow does a lot of work to emulate TC behavior including UMH
>> transaction and multiple bpf map update from UMH which I think increases
>> the latency.
> 
> make sense, but why vanilla TC is so slow ?

No ideas. At least TC requires additional syscall to insert a flow compared to ovs kmod,
but 12 ms looks too long for that.

>> * Implementation
>>
>> xdp_flow makes use of UMH to load an eBPF program for XDP, similar to
>> bpfilter. The difference is that xdp_flow does not generate the eBPF
>> program dynamically but a prebuilt program is embedded in UMH. This is
>> mainly because flow insertion is considerably frequent. If we generate
>> and load an eBPF program on each insertion of a flow, the latency of the
>> first packet of ping in above test will incease, which I want to avoid.
> 
> I think UMH approach is a good fit for this.
> Clearly the same algorithm can be done as kernel code or kernel module, but
> bpfilter-like UMH is a safer approach.
> 
>> - patch 9
>>   Add tc-offload-xdp netdev feature and hooks to call xdp_flow kmod in
>>   TC flower offload code.
> 
> The hook into UMH from TC looks simple. Do you expect the same interface to be
> reused from OVS ?

Do you mean openvswitch kernel module by OVS?
If so, no, at this point. TC hook is simple because I reused flow offload mechanism.
OVS kmod does not have offload interface and ovs-vswitchd is using TC for offload.
I wanted to reuse this mechanism for offloading to XDP, so using TC.

>> * About alternative userland (ovs-vswitchd etc.) implementation
>>
>> Maybe a similar logic can be implemented in ovs-vswitchd offload
>> mechanism, instead of adding code to kernel. I just thought offloading
>> TC is more generic and allows wider usage with direct TC command.
>>
>> For example, considering that OVS inserts a flow to kernel only when
>> flow miss happens in kernel, we can in advance add offloaded flows via
>> tc filter to avoid flow insertion latency for certain sensitive flows.
>> TC flower usage without using OVS is also possible.
>>
>> Also as written above nftables can be offloaded to XDP with this
>> mechanism as well.
> 
> Makes sense to me.
> 
>>    bpf, hashtab: Compare keys in long
> 
> 3Mpps vs 4Mpps just from this patch ?
> or combined with i40 prefech patch ?

Combined.

>>   drivers/net/ethernet/intel/i40e/i40e_txrx.c  |    1 +
> 
> Could you share "perf report" for just hash tab optimization
> and for i40 ?

Sure, I'll get some more data and post them.

> I haven't seen memcmp to be bottle neck in hash tab.
> What is the the of the key?

typo of "size of the key"? IIRC 64 bytes.

Toshiaki Makita