From: "Samudrala, Sridhar" <sridhar.samudrala@intel.com>
To: Jonathan Lemon <bsd@fb.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: "Magnus Karlsson" <magnus.karlsson@intel.com>,
"Björn Töpel" <bjorn.topel@intel.com>,
"Daniel Borkmann" <daniel@iogearbox.net>,
"Network Development" <netdev@vger.kernel.org>,
"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
"Jakub Kicinski" <jakub.kicinski@netronome.com>
Subject: Re: [RFC bpf-next 0/7] busy poll support for AF_XDP sockets
Date: Mon, 13 May 2019 16:30:58 -0700 [thread overview]
Message-ID: <d55253dd-42b4-9cb1-ddc9-4f74c06ec845@intel.com> (raw)
In-Reply-To: <D40B5C89-53F8-4EC1-AB35-FB7C395864DE@fb.com>
On 5/13/2019 1:42 PM, Jonathan Lemon wrote:
> Tossing in my .02 cents:
>
>
> I anticipate that most users of AF_XDP will want packet processing
> for a given RX queue occurring on a single core - otherwise we end
> up with cache delays. The usual model is one thread, one socket,
> one core, but this isn't enforced anywhere in the AF_XDP code and is
> up to the user to set this up.
AF_XDP with busypoll should allow a single thread to poll a given RX
queue and use a single core.
>
> On 7 May 2019, at 11:24, Alexei Starovoitov wrote:
>> I'm not saying that we shouldn't do busy-poll. I'm saying it's
>> complimentary, but in all cases single core per af_xdp rq queue
>> with user thread pinning is preferred.
>
> So I think we're on the same page here.
>
>> Stack rx queues and af_xdp rx queues should look almost the same from
>> napi point of view. Stack -> normal napi in softirq. af_xdp -> new
>> kthread
>> to work with both poll and busy-poll. The only difference between
>> poll and busy-poll will be the running context: new kthread vs user
>> task.
> ...
>> A burst of 64 packets on stack queues or some other work in softirqd
>> will spike the latency for af_xdp queues if softirq is shared.
>
> True, but would it be shared? This goes back to the current model,
> which
> as used by Intel is:
>
> (channel == RX, TX, softirq)
>
> MLX, on the other hand, wants:
>
> (channel == RX.stack, RX.AF_XDP, TX.stack, TX.AF_XDP, softirq)
>
> Which would indeed lead to sharing. The more I look at the above, the
> stronger I start to dislike it. Perhaps this should be disallowed?
>
> I believe there was some mention at LSF/MM that the 'channel' concept
> was something specific to HW and really shouldn't be part of the SW API.
>
>> Hence the proposal for new napi_kthreads:
>> - user creates af_xdp socket and binds to _CPU_ X then
>> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
>> exposed)
>> - spawns kthread pinned to cpu X
>> - configures irq for that af_xdp queue to fire on cpu X
>> - user space with the help of libbpf pins its processing thread to
>> that cpu X
>> - repeat above for as many af_xdp sockets as there as cpus
>> (its also ok to pick the same cpu X for different af_xdp socket
>> then new kthread is shared)
>> - user space configures hw to RSS to these set of af_xdp sockets.
>> since ethtool api is a mess I propose to use af_xdp api to do this
>> rss config
>
>
> From a high level point of view, this sounds quite sensible, but does
> need
> some details ironed out. The model above essentially enforces a model
> of:
>
> (af_xdp = RX.af_xdp + bound_cpu)
> (bound_cpu = hw.cpu + af_xdp.kthread + hw.irq)
>
> (temporarily ignoring TX for right now)
>
>
> I forsee two issues with the above approach:
> 1. hardware limitations in the number of queues/rings
> 2. RSS/steering rules
>
>> - user creates af_xdp socket and binds to _CPU_ X then
>> - driver allocates single af_xdp rq queue (queue ID doesn't need to be
>> exposed)
>
> Here, the driver may not be able to create an arbitrary RQ, but may need
> to
> tear down/reuse an existing one used by the stack. This may not be an
> issue
> for modern hardware.
>
>> - user space configures hw to RSS to these set of af_xdp sockets.
>> since ethtool api is a mess I propose to use af_xdp api to do this
>> rss config
>
> Currently, RSS only steers default traffic. On a system with shared
> stack/af_xdp queues, there should be a way to split the traffic types,
> unless we're talking about a model where all traffic goes to AF_XDP.
>
> This classification has to be done by the NIC, since it comes before RSS
> steering - which currently means sending flow match rules to the NIC,
> which
> is less than ideal. I agree that the ethtool interface is non optimal,
> but
> it does make things clear to the user what's going on.
'tc' provides another interface to split NIC queues into groups of
queues each with its own RSS. For ex:
tc qdisc add dev <i/f> root mqprio num_tc 3 map 0 1 2 queues 2@0 32@2
8@34 hw 1 mode channel
will split NIC queues into 3 groups of 2, 32 and 8 queues.
By default all the packets goto only the first queue group with 2
queues. Filters can be added to redirect packets to the other queues groups.
tc filter add dev <i/f> protocol ip ingress prio 1 flower dst_ip
192.168.0.2 ip_proto tcp dst_port 1234 skip_sw hw_tc 1
tc filter add dev <i/f> protocol ip ingress prio 1 flower dst_ip
192.168.0.3 ip_proto tcp dst_port 1234 skip_sw hw_tc 2
Here hw_tc indicates the queue group.
It should be possible to run AF_XDP on queue group 3 by creating 8
af-xdp sockets and binding them to queues 34-42.
Does this look like a reasonable model to use a subset of nic queues for
af-xdp applications?
>
> Perhaps an af_xdp library that does some bookkeeping:
> - open af_xdp socket
> - define af_xdp_set as (classification, steering rules, other?)
> - bind socket to (cpu, af_xdp_set)
> - kernel:
> - pins calling thread to cpu
> - creates kthread if one doesn't exist, binds to irq and cpu
> - has driver create RQ.af_xdp, possibly replacing RQ.stack
> - applies (af_xdp_set) to NIC.
>
> Seems workable, but a little complicated? The complexity could be moved
> into a separate library.
>
>
>> imo that would be the simplest and performant way of using af_xdp.
>> All configuration apis are under libbpf (or libxdp if we choose to
>> fork it)
>> End result is one af_xdp rx queue - one napi - one kthread - one user
>> thread.
>> All pinned to the same cpu with irq on that cpu.
>> Both poll and busy-poll approaches will not bounce data between cpus.
>> No 'shadow' queues to speak of and should solve the issues that
>> folks were bringing up in different threads.
>
> Sounds like a sensible model from my POV.
>
next prev parent reply other threads:[~2019-05-13 23:35 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-02 8:39 [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Magnus Karlsson
2019-05-02 8:39 ` [RFC bpf-next 1/7] net: fs: make busy poll budget configurable in napi_busy_loop Magnus Karlsson
2019-05-02 8:39 ` [RFC bpf-next 2/7] net: i40e: ixgbe: tun: veth: virtio-net: centralize xdp_rxq_info and add napi id Magnus Karlsson
2019-05-02 8:39 ` [RFC bpf-next 4/7] netdevice: introduce busy-poll setsockopt for AF_XDP Magnus Karlsson
2019-05-03 0:13 ` Samudrala, Sridhar
2019-05-03 6:35 ` Magnus Karlsson
2019-05-02 8:39 ` [RFC bpf-next 5/7] net: add busy-poll support for XDP sockets Magnus Karlsson
2019-05-03 0:23 ` Samudrala, Sridhar
2019-05-02 8:39 ` [RFC bpf-next 6/7] libbpf: add busy-poll support to " Magnus Karlsson
2019-05-02 8:39 ` [RFC bpf-next 7/7] samples/bpf: add busy-poll support to xdpsock sample Magnus Karlsson
2019-05-24 2:51 ` Ye Xiaolong
2019-05-27 6:58 ` Magnus Karlsson
2019-05-06 16:31 ` [RFC bpf-next 0/7] busy poll support for AF_XDP sockets Alexei Starovoitov
2019-05-07 11:51 ` Magnus Karlsson
2019-05-07 18:24 ` Alexei Starovoitov
2019-05-08 12:10 ` Magnus Karlsson
2019-05-16 12:37 ` Magnus Karlsson
2019-05-16 23:50 ` Samudrala, Sridhar
2019-05-17 7:50 ` Magnus Karlsson
2019-05-17 18:20 ` Jakub Kicinski
2019-05-18 8:49 ` Magnus Karlsson
2019-05-13 20:42 ` Jonathan Lemon
2019-05-13 23:30 ` Samudrala, Sridhar [this message]
2019-05-14 7:53 ` Björn Töpel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d55253dd-42b4-9cb1-ddc9-4f74c06ec845@intel.com \
--to=sridhar.samudrala@intel.com \
--cc=alexei.starovoitov@gmail.com \
--cc=bjorn.topel@intel.com \
--cc=bpf@vger.kernel.org \
--cc=bsd@fb.com \
--cc=daniel@iogearbox.net \
--cc=jakub.kicinski@netronome.com \
--cc=magnus.karlsson@intel.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).