Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets

From: "Samudrala, Sridhar" <sridhar.samudrala@intel.com>
To: Jonathan Lemon <bsd@fb.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: "Magnus Karlsson" <magnus.karlsson@intel.com>,
	"Björn Töpel" <bjorn.topel@intel.com>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Network Development" <netdev@vger.kernel.org>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"Jakub Kicinski" <jakub.kicinski@netronome.com>
Subject: Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
Date: Mon, 13 May 2019 16:30:58 -0700	[thread overview]
Message-ID: <d55253dd-42b4-9cb1-ddc9-4f74c06ec845@intel.com> (raw)
In-Reply-To: <D40B5C89-53F8-4EC1-AB35-FB7C395864DE@fb.com>

On 5/13/2019 1:42 PM, Jonathan Lemon wrote:
> Tossing in my .02 cents:
> 
> 
> I anticipate that most users of AF_XDP will want packet processing
> for a given RX queue occurring on a single core - otherwise we end
> up with cache delays.  The usual model is one thread, one socket,
> one core, but this isn't enforced anywhere in the AF_XDP code and is
> up to the user to set this up.

AF_XDP with busypoll should allow a single thread to poll a given RX
queue and use a single core.

> 
> On 7 May 2019, at 11:24, Alexei Starovoitov wrote:
>> I'm not saying that we shouldn't do busy-poll. I'm saying it's
>> complimentary, but in all cases single core per af_xdp rq queue
>> with user thread pinning is preferred.
> 
> So I think we're on the same page here.
> 
>> Stack rx queues and af_xdp rx queues should look almost the same from
>> napi point of view. Stack -> normal napi in softirq. af_xdp -> new
>> kthread
>> to work with both poll and busy-poll. The only difference between
>> poll and busy-poll will be the running context: new kthread vs user
>> task.
> ...
>> A burst of 64 packets on stack queues or some other work in softirqd
>> will spike the latency for af_xdp queues if softirq is shared.
> 
> True, but would it be shared?  This goes back to the current model,
> which
> as used by Intel is:
> 
>      (channel == RX, TX, softirq)
> 
> MLX, on the other hand, wants:
> 
>      (channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq)
> 
> Which would indeed lead to sharing.  The more I look at the above, the
> stronger I start to dislike it.  Perhaps this should be disallowed?
> 
> I believe there was some mention at LSF/MM that the 'channel' concept
> was something specific to HW and really shouldn't be part of the SW API.
> 
>> Hence the proposal for new napi_kthreads:
>> - user creates af_xdp socket and binds to _CPU_ X then
>> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
>> exposed)
>> - spawns kthread pinned to cpu X
>> - configures irq for that af_xdp queue to fire on cpu X
>> - user space with the help of libbpf pins its processing thread to
>> that cpu X
>> - repeat above for as many af_xdp sockets as there as cpus
>>    (its also ok to pick the same cpu X for different af_xdp socket
>>    then new kthread is shared)
>> - user space configures hw to RSS to these set of af_xdp sockets.
>>    since ethtool api is a mess I propose to use af_xdp api to do this
>> rss config
> 
> 
>   From a high level point of view, this sounds quite sensible, but does
> need
> some details ironed out.  The model above essentially enforces a model
> of:
> 
>      (af_xdp = RX.af_xdp + bound_cpu)
>        (bound_cpu = hw.cpu + af_xdp.kthread + hw.irq)
> 
> (temporarily ignoring TX for right now)
> 
> 
> I forsee two issues with the above approach:
>     1. hardware limitations in the number of queues/rings
>     2. RSS/steering rules
> 
>> - user creates af_xdp socket and binds to _CPU_ X then
>> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
>> exposed)
> 
> Here, the driver may not be able to create an arbitrary RQ, but may need
> to
> tear down/reuse an existing one used by the stack.  This may not be an
> issue
> for modern hardware.
> 
>> - user space configures hw to RSS to these set of af_xdp sockets.
>>    since ethtool api is a mess I propose to use af_xdp api to do this
>> rss config
> 
> Currently, RSS only steers default traffic.  On a system with shared
> stack/af_xdp queues, there should be a way to split the traffic types,
> unless we're talking about a model where all traffic goes to AF_XDP.
> 
> This classification has to be done by the NIC, since it comes before RSS
> steering - which currently means sending flow match rules to the NIC,
> which
> is less than ideal.  I agree that the ethtool interface is non optimal,
> but
> it does make things clear to the user what's going on.

'tc' provides another interface to split NIC queues into groups of
queues each with its own RSS. For ex:
tc qdisc add dev <i/f> root mqprio num_tc 3 map 0 1 2 queues 2@0 32@2 
8@34 hw 1 mode channel
will split NIC queues into 3 groups of 2, 32 and 8 queues.

By default all the packets goto only the first queue group with 2 
queues. Filters can be added to redirect packets to the other queues groups.

tc filter add dev <i/f> protocol ip ingress prio 1 flower dst_ip 
192.168.0.2 ip_proto tcp dst_port 1234 skip_sw hw_tc 1
tc filter add dev <i/f> protocol ip ingress prio 1 flower dst_ip 
192.168.0.3 ip_proto tcp dst_port 1234 skip_sw hw_tc 2

Here hw_tc indicates the queue group.

It should be possible to run AF_XDP on queue group 3 by creating 8 
af-xdp sockets and binding them to queues 34-42.

Does this look like a reasonable model to use a subset of nic queues for 
af-xdp applications?

> 
> Perhaps an af_xdp library that does some bookkeeping:
>     - open af_xdp socket
>     - define af_xdp_set as (classification, steering rules, other?)
>     - bind socket to (cpu, af_xdp_set)
>     - kernel:
>       - pins calling thread to cpu
>       - creates kthread if one doesn't exist, binds to irq and cpu
>       - has driver create RQ.af_xdp, possibly replacing RQ.stack
>       - applies (af_xdp_set) to NIC.
> 
> Seems workable, but a little complicated?  The complexity could be moved
> into a separate library.
> 
> 
>> imo that would be the simplest and performant way of using af_xdp.
>> All configuration apis are under libbpf (or libxdp if we choose to
>> fork it)
>> End result is one af_xdp rx queue - one napi - one kthread - one user
>> thread.
>> All pinned to the same cpu with irq on that cpu.
>> Both poll and busy-poll approaches will not bounce data between cpus.
>> No 'shadow' queues to speak of and should solve the issues that
>> folks were bringing up in different threads.
> 
> Sounds like a sensible model from my POV.
>