All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Björn Töpel" <bjorn.topel@intel.com>
To: Magnus Karlsson <magnus.karlsson@gmail.com>,
	Maxim Mikityanskiy <maximmi@nvidia.com>
Cc: "Björn Töpel" <bjorn.topel@gmail.com>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Network Development" <netdev@vger.kernel.org>,
	bpf <bpf@vger.kernel.org>,
	"Maxim Mikityanskiy" <maximmi@mellanox.com>,
	"Karlsson, Magnus" <magnus.karlsson@intel.com>,
	"David S. Miller" <davem@davemloft.net>,
	kuba@kernel.org, hawk@kernel.org,
	"John Fastabend" <john.fastabend@gmail.com>,
	intel-wired-lan <intel-wired-lan@lists.osuosl.org>
Subject: Re: [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
Date: Tue, 8 Sep 2020 14:21:54 +0200	[thread overview]
Message-ID: <75146564-2774-47ac-cc75-40d74bea50d8@intel.com> (raw)
In-Reply-To: <CAJ8uoz3WbS7E1OiC5p8x+o48vwkN43R9JxMwvRvgVk4n3SNiZg@mail.gmail.com>

On 2020-09-08 13:37, Magnus Karlsson wrote:
> On Tue, Sep 8, 2020 at 12:33 PM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>>
>> On 2020-09-04 16:59, Björn Töpel wrote:
>>> On 2020-09-04 15:53, Björn Töpel wrote:
>>>> This series addresses a problem that arises when AF_XDP zero-copy is
>>>> enabled, and the kernel softirq Rx processing and userland process
>>>> is running on the same core.
>>>>
>>> [...]
>>>>
>>>
>>> @Maxim I'm not well versed in Mellanox drivers. Would this be relevant
>>> to mlx5 as well?
>>
>> Thanks for letting me know about this series! So the basic idea is to
>> stop processing hardware completions if the RX ring gets full, because
>> the application didn't have chance to run? Yes, I think it's also
>> relevant to mlx5, the issue is not driver-specific, and a similar fix is
>> applicable. However, it may lead to completion queue overflows - some
>> analysis is needed to understand what happens then and how to handle it.
>>
>> Regarding the feature, I think it should be opt-in (disabled by
>> default), because old applications may not wakeup RX after they process
>> packets in the RX ring.
>

First of all; Max, thanks for some really good input as usual!


> How about need_wakeup enable/disable at bind time being that opt-in,
> instead of a new option? It is off by default, and when it is off, the
> driver busy-spins on the Rx ring until it can put an entry there. It
> will not yield to the application by returning something less than
> budget. Applications need not check the need_wakeup flag. If
> need_wakeup is enabled by the user, the contract is that user-space
> needs to check the need_wakeup flag and act on it. If it does not,
> then that is a programming error and it can be set for any unspecified
> reason. No reason to modify the application, if it checks need_wakeup.
> But if this patch behaves like that I have not checked.
>

It does not behave exactly like this. If need_wakeup is enabled, the 
napi look exists, but if not the napi is rearmed -- so we'll get a less 
efficient loop. I'll need address this.

I'm leaning towards a more explicit opt-in like Max suggested. As Max 
pointed out, AF_XDP/XDP_TX is actually a non-insane(!) way of using 
AF_XDP zero-copy, which will suffer from the early exit.


> Good points in the rest of the mail, that I think should be addressed.
>

Yeah, I agree. I will take a step back an rethink. I'll experiment with
the busy-polling suggested by Jakub, and also an opt-in early exit.

Thanks for all the suggestions folks!


Cheers,
Björn


> /Magnus
> 
>> Is it required to change xdpsock accordingly?
>> Also, when need_wakeup is disabled, your driver implementation seems to
>> quit NAPI anyway, but it shouldn't happen, because no one will send a
>> wakeup.
>>
>> Waiting until the RX ring fills up, then passing control to the
>> application and waiting until the hardware completion queue fills up,
>> and so on increases latency - the busy polling approach sounds more
>> legit here.
>>
>> The behavior may be different depending on the driver implementation:
>>
>> 1. If you arm the completion queue and leave interrupts enabled on early
>> exit too, the application will soon be interrupted anyway and won't have
>> much time to process many packets, leading to app <-> NAPI ping-pong one
>> packet at a time, making NAPI inefficient.
>>
>> 2. If you don't arm the completion queue on early exit and wait for the
>> explicit wakeup from the application, it will easily overflow the
>> hardware completion queue, because we don't have a symmetric mechanism
>> to stop the application on imminent hardware queue overflow. It doesn't
>> feel correct and may be trickier to handle: if the application is too
>> slow, such drops should happen on driver/kernel level, not in hardware.
>>
>> Which behavior is used in your drivers? Or am I missing some more options?
>>
>> BTW, it should be better to pass control to the application before the
>> first dropped packet, not after it has been dropped.
>>
>> Some workloads different from pure AF_XDP, for example, 50/50 AF_XDP and
>> XDP_TX may suffer from such behavior, so it's another point to make a
>> knob on the application layer to enable/disable it.
>>
>>   From the driver API perspective, I would prefer to see a simpler API if
>> possible. The current API exposes things that the driver shouldn't know
>> (BPF map type), and requires XSK-specific handling. It would be better
>> if some specific error code returned from xdp_do_redirect was reserved
>> to mean "exit NAPI early if you support it". This way we wouldn't need
>> two new helpers, two xdp_do_redirect functions, and this approach would
>> be extensible to other non-XSK use cases without further changes in the
>> driver, and also the logic to opt-in the feature could be put inside the
>> kernel.
>>
>> Thanks,
>> Max
>>
>>>
>>> Cheers,
>>> Björn
>>

WARNING: multiple messages have this Message-ID (diff)
From: =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?= <bjorn.topel@intel.com>
To: intel-wired-lan@osuosl.org
Subject: [Intel-wired-lan] [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full
Date: Tue, 8 Sep 2020 14:21:54 +0200	[thread overview]
Message-ID: <75146564-2774-47ac-cc75-40d74bea50d8@intel.com> (raw)
In-Reply-To: <CAJ8uoz3WbS7E1OiC5p8x+o48vwkN43R9JxMwvRvgVk4n3SNiZg@mail.gmail.com>

On 2020-09-08 13:37, Magnus Karlsson wrote:
> On Tue, Sep 8, 2020 at 12:33 PM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>>
>> On 2020-09-04 16:59, Bj?rn T?pel wrote:
>>> On 2020-09-04 15:53, Bj?rn T?pel wrote:
>>>> This series addresses a problem that arises when AF_XDP zero-copy is
>>>> enabled, and the kernel softirq Rx processing and userland process
>>>> is running on the same core.
>>>>
>>> [...]
>>>>
>>>
>>> @Maxim I'm not well versed in Mellanox drivers. Would this be relevant
>>> to mlx5 as well?
>>
>> Thanks for letting me know about this series! So the basic idea is to
>> stop processing hardware completions if the RX ring gets full, because
>> the application didn't have chance to run? Yes, I think it's also
>> relevant to mlx5, the issue is not driver-specific, and a similar fix is
>> applicable. However, it may lead to completion queue overflows - some
>> analysis is needed to understand what happens then and how to handle it.
>>
>> Regarding the feature, I think it should be opt-in (disabled by
>> default), because old applications may not wakeup RX after they process
>> packets in the RX ring.
>

First of all; Max, thanks for some really good input as usual!


> How about need_wakeup enable/disable at bind time being that opt-in,
> instead of a new option? It is off by default, and when it is off, the
> driver busy-spins on the Rx ring until it can put an entry there. It
> will not yield to the application by returning something less than
> budget. Applications need not check the need_wakeup flag. If
> need_wakeup is enabled by the user, the contract is that user-space
> needs to check the need_wakeup flag and act on it. If it does not,
> then that is a programming error and it can be set for any unspecified
> reason. No reason to modify the application, if it checks need_wakeup.
> But if this patch behaves like that I have not checked.
>

It does not behave exactly like this. If need_wakeup is enabled, the 
napi look exists, but if not the napi is rearmed -- so we'll get a less 
efficient loop. I'll need address this.

I'm leaning towards a more explicit opt-in like Max suggested. As Max 
pointed out, AF_XDP/XDP_TX is actually a non-insane(!) way of using 
AF_XDP zero-copy, which will suffer from the early exit.


> Good points in the rest of the mail, that I think should be addressed.
>

Yeah, I agree. I will take a step back an rethink. I'll experiment with
the busy-polling suggested by Jakub, and also an opt-in early exit.

Thanks for all the suggestions folks!


Cheers,
Bj?rn


> /Magnus
> 
>> Is it required to change xdpsock accordingly?
>> Also, when need_wakeup is disabled, your driver implementation seems to
>> quit NAPI anyway, but it shouldn't happen, because no one will send a
>> wakeup.
>>
>> Waiting until the RX ring fills up, then passing control to the
>> application and waiting until the hardware completion queue fills up,
>> and so on increases latency - the busy polling approach sounds more
>> legit here.
>>
>> The behavior may be different depending on the driver implementation:
>>
>> 1. If you arm the completion queue and leave interrupts enabled on early
>> exit too, the application will soon be interrupted anyway and won't have
>> much time to process many packets, leading to app <-> NAPI ping-pong one
>> packet at a time, making NAPI inefficient.
>>
>> 2. If you don't arm the completion queue on early exit and wait for the
>> explicit wakeup from the application, it will easily overflow the
>> hardware completion queue, because we don't have a symmetric mechanism
>> to stop the application on imminent hardware queue overflow. It doesn't
>> feel correct and may be trickier to handle: if the application is too
>> slow, such drops should happen on driver/kernel level, not in hardware.
>>
>> Which behavior is used in your drivers? Or am I missing some more options?
>>
>> BTW, it should be better to pass control to the application before the
>> first dropped packet, not after it has been dropped.
>>
>> Some workloads different from pure AF_XDP, for example, 50/50 AF_XDP and
>> XDP_TX may suffer from such behavior, so it's another point to make a
>> knob on the application layer to enable/disable it.
>>
>>   From the driver API perspective, I would prefer to see a simpler API if
>> possible. The current API exposes things that the driver shouldn't know
>> (BPF map type), and requires XSK-specific handling. It would be better
>> if some specific error code returned from xdp_do_redirect was reserved
>> to mean "exit NAPI early if you support it". This way we wouldn't need
>> two new helpers, two xdp_do_redirect functions, and this approach would
>> be extensible to other non-XSK use cases without further changes in the
>> driver, and also the logic to opt-in the feature could be put inside the
>> kernel.
>>
>> Thanks,
>> Max
>>
>>>
>>> Cheers,
>>> Bj?rn
>>

  reply	other threads:[~2020-09-08 19:10 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-04 13:53 [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring is full Björn Töpel
2020-09-04 13:53 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-04 13:53 ` [PATCH bpf-next 1/6] xsk: improve xdp_do_redirect() error codes Björn Töpel
2020-09-04 13:53   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-04 13:53 ` [PATCH bpf-next 2/6] xdp: introduce xdp_do_redirect_ext() function Björn Töpel
2020-09-04 13:53   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-04 13:53 ` [PATCH bpf-next 3/6] xsk: introduce xsk_do_redirect_rx_full() helper Björn Töpel
2020-09-04 13:53   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-04 15:11   ` Jesper Dangaard Brouer
2020-09-04 15:11     ` [Intel-wired-lan] " Jesper Dangaard Brouer
2020-09-04 15:39     ` Björn Töpel
2020-09-04 15:39       ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-07 12:45       ` Jesper Dangaard Brouer
2020-09-07 12:45         ` [Intel-wired-lan] " Jesper Dangaard Brouer
2020-09-04 13:53 ` [PATCH bpf-next 4/6] i40e, xsk: finish napi loop if AF_XDP Rx queue is full Björn Töpel
2020-09-04 13:53   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-04 13:53 ` [PATCH bpf-next 5/6] ice, " Björn Töpel
2020-09-04 13:53   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-04 13:53 ` [PATCH bpf-next 6/6] ixgbe, " Björn Töpel
2020-09-04 13:53   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-04 15:35   ` Jesper Dangaard Brouer
2020-09-04 15:35     ` [Intel-wired-lan] " Jesper Dangaard Brouer
2020-09-04 15:54     ` Björn Töpel
2020-09-04 15:54       ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-04 13:59 ` [PATCH bpf-next 0/6] xsk: exit NAPI loop when AF_XDP Rx ring " Björn Töpel
2020-09-04 13:59   ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-08 10:32   ` Maxim Mikityanskiy
2020-09-08 10:32     ` [Intel-wired-lan] " Maxim Mikityanskiy
2020-09-08 11:37     ` Magnus Karlsson
2020-09-08 11:37       ` [Intel-wired-lan] " Magnus Karlsson
2020-09-08 12:21       ` Björn Töpel [this message]
2020-09-08 12:21         ` =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-09 15:37     ` Jesper Dangaard Brouer
2020-09-09 15:37       ` [Intel-wired-lan] " Jesper Dangaard Brouer
2020-09-04 14:27 ` Jesper Dangaard Brouer
2020-09-04 14:27   ` [Intel-wired-lan] " Jesper Dangaard Brouer
2020-09-04 14:32   ` Björn Töpel
2020-09-04 14:32     ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-04 23:58     ` Jakub Kicinski
2020-09-04 23:58       ` [Intel-wired-lan] " Jakub Kicinski
2020-09-07 13:37       ` Björn Töpel
2020-09-07 13:37         ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-07 18:40         ` Jakub Kicinski
2020-09-07 18:40           ` [Intel-wired-lan] " Jakub Kicinski
2020-09-08  6:58           ` Björn Töpel
2020-09-08  6:58             ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-08 17:24             ` Jakub Kicinski
2020-09-08 17:24               ` [Intel-wired-lan] " Jakub Kicinski
2020-09-08 18:28               ` Björn Töpel
2020-09-08 18:28                 ` [Intel-wired-lan] " =?unknown-8bit?q?Bj=C3=B6rn_T=C3=B6pel?=
2020-09-08 18:34                 ` Jakub Kicinski
2020-09-08 18:34                   ` [Intel-wired-lan] " Jakub Kicinski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=75146564-2774-47ac-cc75-40d74bea50d8@intel.com \
    --to=bjorn.topel@intel.com \
    --cc=ast@kernel.org \
    --cc=bjorn.topel@gmail.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=hawk@kernel.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=john.fastabend@gmail.com \
    --cc=kuba@kernel.org \
    --cc=magnus.karlsson@gmail.com \
    --cc=magnus.karlsson@intel.com \
    --cc=maximmi@mellanox.com \
    --cc=maximmi@nvidia.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.