All of lore.kernel.org
 help / color / mirror / Atom feed
* AF_XDP new prefer busy poll
@ 2021-04-01 20:08 Dan Siemon
  2021-04-05  8:26 ` Björn Töpel
  0 siblings, 1 reply; 4+ messages in thread
From: Dan Siemon @ 2021-04-01 20:08 UTC (permalink / raw)
  To: xdp-newbies

I've started working on adding SO_PREFER_BUSY_POLL [1] to afxdp-rs [2].
I have a few questions that I haven't been able to answer definitively
from docs or commits.

1) To confirm, configuration like the below is required?

echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout

2) It's not clear to me what polling operations are required. It looks
like the xdpdock example was modified to call recvfrom() and sendto()
in every situation where previously the condition was that the
need_wakeup flag was set on one of the queues. It looks like this
structure may do extra syscalls?

It it sufficient to 'poll' (I don't mean the syscall here) the socket
once with one syscall operation or do we need the equivalent of a send
and recv operation (like the example) in each loop iteration?

3) The patch linked below mentions adding recvmsg and sendmsg support
for busy polling. The xdpsock example uses recvfrom(). What is the set
of syscalls that can drive the busy polling? Is there a recommendation
for which one(s) should be used?

4) In situations where there are multiple sockets, will it work to do
one poll syscall with multiple fds to reduce the number of syscalls? Is
that a good idea?

5)

"If the application stops performing busy-polling via a system call,
the watchdog timer defined by gro_flush_timeout will timeout, and
regular softirq handling will resume."

Does this imply that if the application fails to poll within the
watchdog time that it needs to take action to get back into prefer busy
polling mode?

On the plus side, the initial performance numbers look good but there
are a lot of drops as traffic ramps up that I haven't figured out the
cause of yet. There are no drops once it's running in a steady state.

Thanks for any help or insight.

[1] - https://lwn.net/Articles/837010/
[2] - https://github.com/aterlo/afxdp-rs


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: AF_XDP new prefer busy poll
  2021-04-01 20:08 AF_XDP new prefer busy poll Dan Siemon
@ 2021-04-05  8:26 ` Björn Töpel
  2021-04-06  0:50   ` Dan Siemon
  0 siblings, 1 reply; 4+ messages in thread
From: Björn Töpel @ 2021-04-05  8:26 UTC (permalink / raw)
  To: Dan Siemon; +Cc: Xdp, Karlsson, Magnus

Hey Dan, sorry for the late reply. I'm in easter mode. :-)

On Thu, 1 Apr 2021 at 22:09, Dan Siemon <dan@coverfire.com> wrote:
>
> I've started working on adding SO_PREFER_BUSY_POLL [1] to afxdp-rs [2].
> I have a few questions that I haven't been able to answer definitively
> from docs or commits.
>
> 1) To confirm, configuration like the below is required?
>
> echo 2 | sudo tee /sys/class/net/ens785f1/napi_defer_hard_irqs
> echo 200000 | sudo tee /sys/class/net/ens785f1/gro_flush_timeout
>

Yes, but the defer count and timeout is really up to you. The "prefer
busy-polling" is built on top of commit 6f8b12d661d0 ("net: napi: add
hard irqs deferral feature").

> 2) It's not clear to me what polling operations are required. It looks
> like the xdpdock example was modified to call recvfrom() and sendto()
> in every situation where previously the condition was that the
> need_wakeup flag was set on one of the queues. It looks like this
> structure may do extra syscalls?
>

Yes. More below.

> It it sufficient to 'poll' (I don't mean the syscall here) the socket
> once with one syscall operation or do we need the equivalent of a send
> and recv operation (like the example) in each loop iteration?
>

The idea with busy-polling from a kernel perspective is that the
driver code is entered via the syscall (read or write). For the
receive side: syscall() -> enter the napi poll implementation of the
netdev, and pass the packets (if any) to the XDP socket ring.
With busy-polling enabled there are no interrupts or softirq
mechanisms that execute the driver code. IOW, it's up to userland to
call the driver via a syscall. Busy-polling will typically require
more syscalls than a need_wakeup mode (as you noted above).

Again, when you are executing in busy-polling mode, the userland
application has to do a syscall to run the driver code. Does that mean
that userland can starve the driver? No, and this is where the time
out/defer count comes in. I'll expand on this in 5 (below)

> 3) The patch linked below mentions adding recvmsg and sendmsg support
> for busy polling. The xdpsock example uses recvfrom(). What is the set
> of syscalls that can drive the busy polling? Is there a recommendation
> for which one(s) should be used?
>

recvmsg/sendmsg and poll (which means read/recvfrom/recvmsg, and
corresponding on the write side). Use readfrom for rx queues, and
sendto for tx queues.  Poll works as well, but the overhead for poll
is larger than send/recv.

> 4) In situations where there are multiple sockets, will it work to do
> one poll syscall with multiple fds to reduce the number of syscalls? Is
> that a good idea?
>

The current implementation is really a one socket/syscall thing.
Magnus and I had some ideas on extending busy-polling for a set of
sockets, but haven't had a use-case for it yet.

> 5)
>
> "If the application stops performing busy-polling via a system call,
> the watchdog timer defined by gro_flush_timeout will timeout, and
> regular softirq handling will resume."
>
> Does this imply that if the application fails to poll within the
> watchdog time that it needs to take action to get back into prefer busy
> polling mode?
>

Yes. If the application fails to poll within the specified timeout,
the kernel will do driver polling at a pace of the timeout, and if
there are no packets after "defer count" times, interrupts will be
enabled. This to ensure that the driver is not starved by userland.
Have a look at Eric's commit above for details on the defer/timeout
logic.

> On the plus side, the initial performance numbers look good but there
> are a lot of drops as traffic ramps up that I haven't figured out the
> cause of yet. There are no drops once it's running in a steady state.
>

Interesting! Please let me know your results, and if you run into weirdness!


Also, please let me know if you need more input!

Cheers!
Björn


> Thanks for any help or insight.
>
> [1] - https://lwn.net/Articles/837010/
> [2] - https://github.com/aterlo/afxdp-rs
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: AF_XDP new prefer busy poll
  2021-04-05  8:26 ` Björn Töpel
@ 2021-04-06  0:50   ` Dan Siemon
  2021-04-06 14:30     ` Björn Töpel
  0 siblings, 1 reply; 4+ messages in thread
From: Dan Siemon @ 2021-04-06  0:50 UTC (permalink / raw)
  To: Björn Töpel, Xdp; +Cc: Karlsson, Magnus

On Mon, 2021-04-05 at 10:26 +0200, Björn Töpel wrote:
> 
> > 3) The patch linked below mentions adding recvmsg and sendmsg
> > support
> > for busy polling. The xdpsock example uses recvfrom(). What is the
> > set
> > of syscalls that can drive the busy polling? Is there a
> > recommendation
> > for which one(s) should be used?
> > 
> 
> recvmsg/sendmsg and poll (which means read/recvfrom/recvmsg, and
> corresponding on the write side). Use readfrom for rx queues, and
> sendto for tx queues.  Poll works as well, but the overhead for poll
> is larger than send/recv.

To clarify, does this mean:
* When a descriptor is added to fill ring or tx ring, call sendmsg.
* When looking for descriptors in completion ring or rx ring, first
call recvmsg()

?

Or are the fq and cq different vs. tx and rx?

It might be useful to outline an idealized xsk loop. The loop I have
looks something like:

for each socket:
1) Process completion queue (read from cq)
2) Try to receive descriptors (read from rx queue)
3) Send any pending packets (write to tx queue)
4) Add descriptors to fq [based on a deficit counter condition] (write
to fq)

[My use case is packet forwarding between sockets]

Ideally there wouldn't a syscall in each of those four steps.

It it acceptable to call recvmsg once at the top of the loop and only
call sendmsg() if one of steps 3 or 4 wrote to a queue (fq, tx)?

In my use case, packet forwarding with dedicated cores, if one syscall
at the top of the loop did 'send' and 'receive' that might be more
efficient as the next iteration can process the descriptors written in
the previous iteration.

> 
> > 5)
> > 
> > "If the application stops performing busy-polling via a system
> > call,
> > the watchdog timer defined by gro_flush_timeout will timeout, and
> > regular softirq handling will resume."
> > 
> > Does this imply that if the application fails to poll within the
> > watchdog time that it needs to take action to get back into prefer
> > busy
> > polling mode?
> > 
> 
> Yes. If the application fails to poll within the specified timeout,
> the kernel will do driver polling at a pace of the timeout, and if
> there are no packets after "defer count" times, interrupts will be
> enabled. This to ensure that the driver is not starved by userland.
> Have a look at Eric's commit above for details on the defer/timeout
> logic.

I need to dig a bit to understand this more. How does the application
determine that interrupts have been re-enabled so it can disable them
again?

Thanks for your help.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: AF_XDP new prefer busy poll
  2021-04-06  0:50   ` Dan Siemon
@ 2021-04-06 14:30     ` Björn Töpel
  0 siblings, 0 replies; 4+ messages in thread
From: Björn Töpel @ 2021-04-06 14:30 UTC (permalink / raw)
  To: Dan Siemon; +Cc: Xdp, Karlsson, Magnus

On Tue, 6 Apr 2021 at 02:50, Dan Siemon <dan@coverfire.com> wrote:
>
> On Mon, 2021-04-05 at 10:26 +0200, Björn Töpel wrote:
> >
> > > 3) The patch linked below mentions adding recvmsg and sendmsg
> > > support
> > > for busy polling. The xdpsock example uses recvfrom(). What is the
> > > set
> > > of syscalls that can drive the busy polling? Is there a
> > > recommendation
> > > for which one(s) should be used?
> > >
> >
> > recvmsg/sendmsg and poll (which means read/recvfrom/recvmsg, and
> > corresponding on the write side). Use readfrom for rx queues, and
> > sendto for tx queues.  Poll works as well, but the overhead for poll
> > is larger than send/recv.
>
> To clarify, does this mean:
> * When a descriptor is added to fill ring or tx ring, call sendmsg.
> * When looking for descriptors in completion ring or rx ring, first
> call recvmsg()
>
> ?

Not quite; Tx (completetion/Tx ring) sendmsg, Rx (fill/Rx ring) recvmsg.

>
> Or are the fq and cq different vs. tx and rx?
>
> It might be useful to outline an idealized xsk loop. The loop I have
> looks something like:
>
> for each socket:
> 1) Process completion queue (read from cq)
> 2) Try to receive descriptors (read from rx queue)
> 3) Send any pending packets (write to tx queue)
> 4) Add descriptors to fq [based on a deficit counter condition] (write
> to fq)
>
> [My use case is packet forwarding between sockets]
>
> Ideally there wouldn't a syscall in each of those four steps.
>
> It it acceptable to call recvmsg once at the top of the loop and only
> call sendmsg() if one of steps 3 or 4 wrote to a queue (fq, tx)?
>

Yes, and moreover on the Tx side, you can write multiple packets and
then call one sendmsg() (but then the latency will be worse).

> In my use case, packet forwarding with dedicated cores, if one syscall
> at the top of the loop did 'send' and 'receive' that might be more
> efficient as the next iteration can process the descriptors written in
> the previous iteration.
>
> >
> > > 5)
> > >
> > > "If the application stops performing busy-polling via a system
> > > call,
> > > the watchdog timer defined by gro_flush_timeout will timeout, and
> > > regular softirq handling will resume."
> > >
> > > Does this imply that if the application fails to poll within the
> > > watchdog time that it needs to take action to get back into prefer
> > > busy
> > > polling mode?
> > >
> >
> > Yes. If the application fails to poll within the specified timeout,
> > the kernel will do driver polling at a pace of the timeout, and if
> > there are no packets after "defer count" times, interrupts will be
> > enabled. This to ensure that the driver is not starved by userland.
> > Have a look at Eric's commit above for details on the defer/timeout
> > logic.
>
> I need to dig a bit to understand this more. How does the application
> determine that interrupts have been re-enabled so it can disable them
> again?
>

The application doesn't need to care about that. It really just an
implementation detail; The only thing the application needs to do is
set the timeout/defer count, and make sure to do syscalls. Depending
on the kind of flows, the timeout/defer count can be tweaked for
better latency. The fact that interrupts get reenabled is just to make
sure that the driver isn't starved *if* the application is bad
behaved.

Does that make sense?


Cheers,
Björn

> Thanks for your help.
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-04-06 14:30 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-01 20:08 AF_XDP new prefer busy poll Dan Siemon
2021-04-05  8:26 ` Björn Töpel
2021-04-06  0:50   ` Dan Siemon
2021-04-06 14:30     ` Björn Töpel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.