linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Dumazet <edumazet@google.com>
To: Joe Damato <jdamato@fastly.com>
Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	 chuck.lever@oracle.com, jlayton@kernel.org,
	linux-api@vger.kernel.org,  brauner@kernel.org,
	davem@davemloft.net, alexander.duyck@gmail.com,
	 sridhar.samudrala@intel.com, kuba@kernel.org,
	Wei Wang <weiwan@google.com>
Subject: Re: [net-next 0/3] Per epoll context busy poll support
Date: Wed, 24 Jan 2024 15:38:19 +0100	[thread overview]
Message-ID: <CANn89i+UiCRpu6M-hDga=dSTk1F5MjkgV=kKS6zC31pvOh78DQ@mail.gmail.com> (raw)
In-Reply-To: <20240124142008.GA1448@fastly.com>

On Wed, Jan 24, 2024 at 3:20 PM Joe Damato <jdamato@fastly.com> wrote:
>
> On Wed, Jan 24, 2024 at 09:20:09AM +0100, Eric Dumazet wrote:
> > On Wed, Jan 24, 2024 at 3:54 AM Joe Damato <jdamato@fastly.com> wrote:
> > >
> > > Greetings:
> > >
> > > TL;DR This builds on commit bf3b9f6372c4 ("epoll: Add busy poll support to
> > > epoll with socket fds.") by allowing user applications to enable
> > > epoll-based busy polling and set a busy poll packet budget on a per epoll
> > > context basis.
> > >
> > > To allow for this, two ioctls have been added for epoll contexts for
> > > getting and setting a new struct, struct epoll_params.
> > >
> > > This makes epoll-based busy polling much more usable for user
> > > applications than the current system-wide sysctl and hardcoded budget.
> > >
> > > Longer explanation:
> > >
> > > Presently epoll has support for a very useful form of busy poll based on
> > > the incoming NAPI ID (see also: SO_INCOMING_NAPI_ID [1]).
> > >
> > > This form of busy poll allows epoll_wait to drive NAPI packet processing
> > > which allows for a few interesting user application designs which can
> > > reduce latency and also potentially improve L2/L3 cache hit rates by
> > > deferring NAPI until userland has finished its work.
> > >
> > > The documentation available on this is, IMHO, a bit confusing so please
> > > allow me to explain how one might use this:
> > >
> > > 1. Ensure each application thread has its own epoll instance mapping
> > > 1-to-1 with NIC RX queues. An n-tuple filter would likely be used to
> > > direct connections with specific dest ports to these queues.
> > >
> > > 2. Optionally: Setup IRQ coalescing for the NIC RX queues where busy
> > > polling will occur. This can help avoid the userland app from being
> > > pre-empted by a hard IRQ while userland is running. Note this means that
> > > userland must take care to call epoll_wait and not take too long in
> > > userland since it now drives NAPI via epoll_wait.
> > >
> > > 3. Ensure that all incoming connections added to an epoll instance
> > > have the same NAPI ID. This can be done with a BPF filter when
> > > SO_REUSEPORT is used or getsockopt + SO_INCOMING_NAPI_ID when a single
> > > accept thread is used which dispatches incoming connections to threads.
> > >
> > > 4. Lastly, busy poll must be enabled via a sysctl
> > > (/proc/sys/net/core/busy_poll).
> > >
> > > The unfortunate part about step 4 above is that this enables busy poll
> > > system-wide which affects all user applications on the system,
> > > including epoll-based network applications which were not intended to
> > > be used this way or applications where increased CPU usage for lower
> > > latency network processing is unnecessary or not desirable.
> > >
> > > If the user wants to run one low latency epoll-based server application
> > > with epoll-based busy poll, but would like to run the rest of the
> > > applications on the system (which may also use epoll) without busy poll,
> > > this system-wide sysctl presents a significant problem.
> > >
> > > This change preserves the system-wide sysctl, but adds a mechanism (via
> > > ioctl) to enable or disable busy poll for epoll contexts as needed by
> > > individual applications, making epoll-based busy poll more usable.
> > >
> >
> > I think this description missed the napi_defer_hard_irqs and
> > gro_flush_timeout settings ?
>
> I'm not sure if those settings are strictly related to the change I am
> proposing which makes epoll-based busy poll something that can be
> enabled/disabled on a per-epoll context basis and allows the budget to be
> set as well, but maybe I am missing something? Sorry for my
> misunderstanding if so.
>
> IMHO: a single system-wide busy poll setting is difficult to use
> properly and it is unforunate that the packet budget is hardcoded. It would
> be extremely useful to be able to set both of these on a per-epoll basis
> and I think my suggested change helps to solve this.
>
> Please let me know.
>
> Re the two settings you noted:
>
> I didn't mention those in the interest of brevity, but yes they can be used
> instead of or in addition to what I've described above.
>
> While those settings are very useful, IMHO, they have their own issues
> because they are system-wide as well. If they were settable per-NAPI, that
> would make it much easier to use them because they could be enabled for the
> NAPIs which are being busy-polled by applications that support busy-poll.
>
> Imagine you have 3 types of apps running side-by-side:
>   - A low latency epoll-based busy poll app,
>   - An app where latency doesn't matter as much, and
>   - A latency sensitive legacy app which does not yet support epoll-based
>     busy poll.
>
> In the first two cases, the settings you mention would be helpful or not
> make any difference, but in the third case the system-wide impact might be
> undesirable because having IRQs fire might be important to keep latency
> down.
>
> If your comment was more that my cover letter should have mentioned these,
> I can include that in a future cover letter or suggest some kernel
> documentation which will discuss all of these features and how they relate
> to each other.
>
> >
> > I would think that if an application really wants to make sure its
> > thread is the only one
> > eventually calling napi->poll(), we must make sure NIC interrupts stay masked.
> >
> > Current implementations of busy poll always release NAPI_STATE_SCHED bit when
> > returning to user space.
> >
> > It seems you want to make sure the application and only the
> > application calls the napi->poll()
> > at chosen times.
> >
> > Some kind of contract is needed, and the presence of the hrtimer
> > (currently only driven from dev->@gro_flush_timeout)
> > would allow to do that correctly.
> >
> > Whenever we 'trust' user space to perform the napi->poll shortly, we
> > also want to arm the hrtimer to eventually detect
> > the application took too long, to restart the other mechanisms (NIC irq based)
>
> There is another change [1] I've been looking at from a research paper [2]
> which does something similar to what you've described above -- it keeps
> IRQs suppressed during busy polling. The paper suggests a performance
> improvement is measured when using a mechanism like this to keep IRQs off.
> Please see the paper for more details.
>
> I haven't had a chance to reach out to the authors or to tweak this patch
> to attempt an RFC / submission for it, but it seems fairly promising in my
> initial synthetic tests.
>
> When I tested their patch, as you might expect, no IRQs were generated at
> all for the NAPIs that were being busy polled, but the rest of the
> NAPIs and queues were generating IRQs as expected.
>
> Regardless of the above patch: I think my proposed change is helpful and
> the IRQ suppression bit can be handled in a separate change in the future.
> What do you think?
>
> > Note that we added the kthread based napi polling, and we are working
> > to add a busy polling feature to these kthreads.
> > allowing to completely mask NIC interrupts and further reduce latencies.
>
> I am aware of kthread based NAPI polling, yes, but I was not aware that
> busy polling was being considered as a feature for them, thanks for the
> head's up.
>
> > Thank you
>
> Thanks for your comments - I appreciate your time and attention.
>
> Could you let me know if your comments are meant as a n-ack or similar?

Patch #2 needs the 'why' part, and why would we allow user space to
ask to poll up to 65535 packets...
There is a reason we have a warning in place when a driver attempts to
set a budget bigger than 64.

You cited recent papers,  I wrote this one specific to linux busy
polling ( https://netdevconf.info/2.1/papers/BusyPollingNextGen.pdf )

Busy polling had been in the pipe, when Wei sent her patches and follow ups.

cb038357937ee4f589aab2469ec3896dce90f317 net: fix race between napi
kthread mode and busy poll
5fdd2f0e5c64846bf3066689b73fc3b8dddd1c74 net: add sysfs attribute to
control napi threaded mode
29863d41bb6e1d969c62fdb15b0961806942960e net: implement threaded-able
napi poll loop support

I am saying that I am currently working to implement the kthread busy
polling implementation,
after fixing two bugs in SCTP and UDP (making me wondering if busy
polling is really used these days)

I am also considering unifying napi_threaded_poll() and the
napi_busy_loop(), and seeing your patches
coming make this work more challenging.

  reply	other threads:[~2024-01-24 14:38 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-24  2:53 [net-next 0/3] Per epoll context busy poll support Joe Damato
2024-01-24  2:53 ` [net-next 1/3] eventpoll: support busy poll per epoll instance Joe Damato
2024-01-24  2:53 ` [net-next 2/3] eventpoll: Add per-epoll busy poll packet budget Joe Damato
2024-01-24  2:53 ` [net-next 3/3] eventpoll: Add epoll ioctl for epoll_params Joe Damato
2024-01-24 15:37   ` Joe Damato
2024-01-24  8:20 ` [net-next 0/3] Per epoll context busy poll support Eric Dumazet
2024-01-24 14:20   ` Joe Damato
2024-01-24 14:38     ` Eric Dumazet [this message]
2024-01-24 15:19       ` Joe Damato
2024-01-30 18:54   ` Samudrala, Sridhar
2024-02-02  3:28     ` Joe Damato
2024-02-02 17:23       ` Samudrala, Sridhar
2024-02-02 18:22         ` Jakub Kicinski
2024-02-02 19:33           ` Joe Damato
2024-02-02 19:58             ` Jakub Kicinski
2024-02-02 20:23               ` Joe Damato
2024-02-02 20:50                 ` Samudrala, Sridhar
2024-02-02 20:55                   ` Joe Damato
2024-02-03  1:15                 ` Jakub Kicinski
2024-02-05 18:17                   ` Stanislav Fomichev
2024-02-05 18:52                     ` Joe Damato
2024-02-05 19:48                       ` Jakub Kicinski
2024-02-05 18:51                   ` Joe Damato
2024-02-05 19:59                     ` Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CANn89i+UiCRpu6M-hDga=dSTk1F5MjkgV=kKS6zC31pvOh78DQ@mail.gmail.com' \
    --to=edumazet@google.com \
    --cc=alexander.duyck@gmail.com \
    --cc=brauner@kernel.org \
    --cc=chuck.lever@oracle.com \
    --cc=davem@davemloft.net \
    --cc=jdamato@fastly.com \
    --cc=jlayton@kernel.org \
    --cc=kuba@kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=sridhar.samudrala@intel.com \
    --cc=weiwan@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).