Re: [PATCH bpf-next 00/10] xsk: stop softirq processing on full XSK Rx queue

From: Magnus Karlsson <magnus.karlsson@gmail.com>
To: Maxim Mikityanskiy <maximmi@nvidia.com>
Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>,
	bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	magnus.karlsson@intel.com, bjorn@kernel.org,
	netdev@vger.kernel.org, brouer@redhat.com,
	alexandr.lobakin@intel.com, Tariq Toukan <tariqt@nvidia.com>,
	Saeed Mahameed <saeedm@nvidia.com>
Subject: Re: [PATCH bpf-next 00/10] xsk: stop softirq processing on full XSK Rx queue
Date: Wed, 13 Apr 2022 17:12:09 +0200	[thread overview]
Message-ID: <CAJ8uoz3Qwa7M1NtN0x376gsejSd6eVZ0sHYjQbBZtNaG0tOJcg@mail.gmail.com> (raw)
In-Reply-To: <5864171b-1e08-1b51-026e-5f09b8945076@nvidia.com>

On Wed, Apr 13, 2022 at 1:40 PM Maxim Mikityanskiy <maximmi@nvidia.com> wrote:
>
> On 2022-04-11 18:35, Maciej Fijalkowski wrote:
> > On Fri, Apr 08, 2022 at 03:48:44PM +0300, Maxim Mikityanskiy wrote:
> >> On 2022-04-08 12:08, Maciej Fijalkowski wrote:
> >>> On Thu, Apr 07, 2022 at 01:49:02PM +0300, Maxim Mikityanskiy wrote:
> >>>> On 2022-04-05 14:06, Maciej Fijalkowski wrote:
> >>>>> Hi!
> >>>>>
> >>>>> This is a revival of Bjorn's idea [0] to break NAPI loop when XSK Rx
> >>>>> queue gets full which in turn makes it impossible to keep on
> >>>>> successfully producing descriptors to XSK Rx ring. By breaking out of
> >>>>> the driver side immediately we will give the user space opportunity for
> >>>>> consuming descriptors from XSK Rx ring and therefore provide room in the
> >>>>> ring so that HW Rx -> XSK Rx redirection can be done.
> >>>>>
> >>>>> Maxim asked and Jesper agreed on simplifying Bjorn's original API used
> >>>>> for detecting the event of interest, so let's just simply check for
> >>>>> -ENOBUFS within Intel's ZC drivers after an attempt to redirect a buffer
> >>>>> to XSK Rx. No real need for redirect API extension.
> >>>>
> >>>
> >>> Hey Maxim!
> >>>
> >>>> I believe some of the other comments under the old series [0] might be still
> >>>> relevant:
> >>>>
> >>>> 1. need_wakeup behavior. If need_wakeup is disabled, the expected behavior
> >>>> is busy-looping in NAPI, you shouldn't break out early, as the application
> >>>> does not restart NAPI, and the driver restarts it itself, leading to a less
> >>>> efficient loop. If need_wakeup is enabled, it should be set on ENOBUFS - I
> >>>> believe this is the case here, right?
> >>>
> >>> Good point. We currently set need_wakeup flag for -ENOBUFS case as it is
> >>> being done for failure == true. You are right that we shouldn't be
> >>> breaking the loop on -ENOBUFS if need_wakeup flag is not set on xsk_pool,
> >>> will fix!
> >>>
> >>>>
> >>>> 2. 50/50 AF_XDP and XDP_TX mix usecase. By breaking out early, you prevent
> >>>> further packets from being XDP_TXed, leading to unnecessary latency
> >>>> increase. The new feature should be opt-in, otherwise such usecases suffer.
> >>>
> >>> Anyone performing a lot of XDP_TX (or XDP_PASS, etc) should be using the
> >>> regular copy-mode driver, while the zero-copy driver should be used when most
> >>> packets are sent up to user-space.
> >>
> >> You generalized that easily, but how can you be so sure that all mixed use
> >> cases can live with the much slower copy mode? Also, how do you apply your
> >> rule of thumb to the 75/25 AF_XDP/XDP_TX use case? It's both "a lot of
> >> XDP_TX" and "most packets are sent up to user-space" at the same time.
> >
> > We are not aware of a single user that has this use case.
>
> Doesn't mean there aren't any ;)
>
> Back in the original series, Björn said it was a valid use case:
>
>  > I'm leaning towards a more explicit opt-in like Max suggested. As Max
>  > pointed out, AF_XDP/XDP_TX is actually a non-insane(!) way of using
>  > AF_XDP zero-copy, which will suffer from the early exit.
>
> https://lore.kernel.org/bpf/75146564-2774-47ac-cc75-40d74bea50d8@intel.com/
>
> What has changed since then?

We now have real use cases and have become wiser ;-).

> In any case, it's a significant behavior change that breaks backward
> compatibility and affects the mentioned use case. Such changes should go
> as opt-in: we have need_wakeup and unaligned chunks as opt-in features,
> so I don't see any reason to introduce a breaking change this time.

In my opinion, the existing flags you mentioned are different. The
unaligned flag changes the format of the descriptor and the
need_wakeup flag can stop the driver so user space needs to explicitly
wake it up. These options really change the uapi and not refactoring
the application for this would really break it, so an opt-in was a
must. What Maciej is suggesting is about changing the performance for
a case that I have never seen (does not mean it does not exist though
because I do not know all users of course, but it is at least
uncommon). Do we want to put an opt-in for every performance change we
commit. I say no. But at some point a performance change might of
course be so large that it is warranted. It should also be for a use
case that exists, or at least we believe exists. I do not think that
is the case for this one. But if someone out there really has this use
case, please let me know and I will be happy to change my opinion.

> > What we do know
> > is that we have a lot of users that care about zero packet loss
> > performance when redirecting to an AF_XDP socket when using the zero-copy
> > driver. And this work addresses one of the corner cases and therefore
> > makes ZC driver better overall. So we say, focus on the cases people care
> > about. BTW, we do have users using mainly XDP_TX, XDP_PASS and
> > XDP_REDIRECT, but they are all using the regular driver for a good reason.
> > So we should not destroy those latencies as you mention.
> >
> >>
> >> At the moment, the application is free to decide whether it wants zerocopy
> >> XDP_TX or zerocopy AF_XDP, depending on its needs. After your patchset the
> >> only valid XDP verdict on zerocopy AF_XDP basically becomes "XDP_REDIRECT to
> >> XSKMAP". I don't think it's valid to break an entire feature to speed up
> >> some very specific use case.
> >
> > We disagree that it 'breaks an entire feature' - it is about hardening the
> > driver when user did not come up with an optimal combo of ring sizes vs
> > busy poll budget. Driver needs to be able to handle such cases in a
> > reasonable way.
>
> XDP_TX is a valid verdict on an XSK RX queue, and stopping XDP_TX
> processing for an indefinite time sounds to me as breaking the feature.
>   Improving performance doesn't justify breaking other stuff. It's OK to
> do so if the application explicitly acks that it doesn't care about
> XDP_TX, or (arguably) if it was the behavior from day one.
>
> > What is more, (at least Intel) zero-copy drivers are
> > written in a way that XDP_REDIRECT to XSKMAP is the most probable verdict
> > that can come out of XDP program. However, other actions are of course
> > supported, so with your arguments, you could even say that we currently
> > treat redir as 'only valid' action, which is not true.
>
> I did not say that, please don't attribute your speculations to me. One
> thing is optimizing for the most likely use case, the other is breaking
> unlikely use cases to improve the likely ones.
>
> > Just note that
> > Intel's regular drivers (copy-mode AF_XDP socket) are optimized for
> > XDP_PASS (as that is the default without an XDP program in place) as that
> > is the most probable verdict for that driver.
> >
> >>
> >> Moreover, in early days of AF_XDP there was an attempt to implement zerocopy
> >> XDP_TX on AF_XDP queues, meaning both XDP_TX and AF_XDP could be zerocopy.
> >> The implementation suffered from possible overflows in driver queues, thus
> >> wasn't upstreamed, but it's still a valid idea that potentially could be
> >> done if overflows are worked around somehow.
> >>
> >
> > That feature would be good to have, but it has not been worked on and
> > might never be worked on since there seems to be little interest in XDP_TX
> > for the zero-copy driver. This also makes your argument about disregarding
> > XDP_TX a bit exaggerated. As we stated before, we have not seen a use case
> > from a real user for this.
> >
> >>> For the zero-copy driver, this opt in is not
> >>> necessary. But it sounds like a valid option for copy mode, though could we
> >>> think about the proper way as a follow up to this work?
> >>
> >> My opinion is that the knob has to be part of initial submission, and the
> >> new feature should be disabled by default, otherwise we have huge issues
> >> with backward compatibility (if we delay it, the next update changes the
> >> behavior, breaking some existing use cases, and there is no way to work
> >> around it).
> >>
> >
> > We would not like to introduce knobs for every
> > feature/optimization/trade-off we could think of unless we really have to.
> > Let us keep it simple. Zero-copy is optimized for XDP_REDIRECT to an
> > AF_XDP socket. The regular, copy-mode driver is optimized for the case
> > when the packet is consumed by the kernel stack or XDP. That means that we
> > should not introduce this optimization for the regular driver, as you say,
> > but it is fine to do it for the zero-copy driver without a knob. If we
> > ever introduce this for the regular driver, yes, we would need a knob.
> >
> >>>>
> >>>> 3. When the driver receives ENOBUFS, it has to drop the packet before
> >>>> returning to the application. It would be better experience if your feature
> >>>> saved all N packets from being dropped, not just N-1.
> >>>
> >>> Sure, I'll re-run tests and see if we can omit freeing the current
> >>> xdp_buff and ntc bump, so that we would come back later on to the same
> >>> entry.
> >>>
> >>>>
> >>>> 4. A slow or malicious AF_XDP application may easily cause an overflow of
> >>>> the hardware receive ring. Your feature introduces a mechanism to pause the
> >>>> driver while the congestion is on the application side, but no symmetric
> >>>> mechanism to pause the application when the driver is close to an overflow.
> >>>> I don't know the behavior of Intel NICs on overflow, but in our NICs it's
> >>>> considered a critical error, that is followed by a recovery procedure, so
> >>>> it's not something that should happen under normal workloads.
> >>>
> >>> I'm not sure I follow on this one. Feature is about overflowing the XSK
> >>> receive ring, not the HW one, right?
> >>
> >> Right. So we have this pipeline of buffers:
> >>
> >> NIC--> [HW RX ring] --NAPI--> [XSK RX ring] --app--> consumes packets
> >>
> >> Currently, when the NIC puts stuff in HW RX ring, NAPI always runs and
> >> drains it either to XSK RX ring or to /dev/null if XSK RX ring is full. The
> >> driver fulfills its responsibility to prevent overflows of HW RX ring. If
> >> the application doesn't consume quick enough, the frames will be leaked, but
> >> it's only the application's issue, the driver stays consistent.
> >>
> >> After the feature, it's possible to pause NAPI from the userspace
> >> application, effectively disrupting the driver's consistency. I don't think
> >> an XSK application should have this power.
> >
> > It already has this power when using an AF_XDP socket. Nothing new. Some
> > examples, when using busy-poll together with gro_flush_timeout and
> > napi_defer_hard_irqs you already have this power. Another example is not
> > feeding buffers into the fill ring from the application side in zero-copy
> > mode. Also, application does not have to be "slow" in order to cause the
> > XSK Rx queue overflow. It can be the matter of not optimal budget choice
> > when compared to ring lengths, as stated above.
>
> (*)
>
> > Besides that, you are right, in copy-mode (without busy-poll), let us not
> > introduce this as this would provide the application with this power when
> > it does not have it today.
> >
> >>
> >>> Driver picks entries from fill ring
> >>> that were produced by app, so if app is slow on producing those I believe
> >>> this would be rather an underflow of ring, we would simply receive less
> >>> frames. For HW Rx ring actually being full, I think that HW would be
> >>> dropping the incoming frames, so I don't see the real reason to treat this
> >>> as critical error that needs to go through recovery.
> >>
> >> I'll double check regarding the hardware behavior, but it is what it is. If
> >> an overflow moves the queue to the fault state and requires a recovery,
> >> there is nothing I can do about that.
>
> I double-checked that, and it seems there is no problem I indicated in
> point 4. In the mlx5e case, if NAPI is delayed, there will be lack of RX
> WQEs, and hardware will start dropping packets, and it's an absolutely
> regular situation. Your arguments above (*) are valid.
>
> >> A few more thoughts I just had: mlx5e shares the same NAPI instance to serve
> >> all queues in a channel, that includes the XSK RQ and the regular RQ. The
> >> regular and XSK traffic can be configured to be isolated to different
> >> channels, or they may co-exist on the same channel. If they co-exist, and
> >> XSK asks to pause NAPI, the regular traffic will still run NAPI and drop 1
> >> XSK packet per NAPI cycle, unless point 3 is fixed. It can also be
> >
> > XSK does not pause the whole NAPI cycle, your HW XSK RQ just stops with
> > doing redirects to AF_XDP XSK RQ. Your regular RQ is not affected in any
> > way. Finally point 3 needs to be fixed.
> >
> > FWIW we also support having a channel/queue vector carrying more than one
> > RQ that is associated with particular NAPI instance.
> >
> >> reproduced if NAPI is woken up by XSK TX. Besides, (correct me if I'm wrong)
> >> your current implementation introduces extra latency to XSK TX if XSK RX
> >> asked to pause NAPI, because NAPI will be restarted anyway (by TX wakeup),
> >> and it could have been rescheduled by the kernel.
> >
> > Again, we don't pause NAPI. Tx and Rx processing are separate.
> >
> > As for Intel drivers, Tx is processed first. So even if we break at Rx due
> > to -ENOBUFS from xdp_do_redirect(), Tx work has already been done.
> >
> > To reiterate, we agreed on fixing point 1 and 3 from your original mail.
> > Valid and good points.
>
> Great that we agreed on 1 and 3! Point 4 can be dropped. For point 2, I
> wrote my thoughts above.
>
> >>
> >>> Am I missing something? Maybe I have just misunderstood you.
> >>>
> >>>>
> >>>>> One might ask why it is still relevant even after having proper busy
> >>>>> poll support in place - here is the justification.
> >>>>>
> >>>>> For xdpsock that was:
> >>>>> - run for l2fwd scenario,
> >>>>> - app/driver processing took place on the same core in busy poll
> >>>>>      with 2048 budget,
> >>>>> - HW ring sizes Tx 256, Rx 2048,
> >>>>>
> >>>>> this work improved throughput by 78% and reduced Rx queue full statistic
> >>>>> bump by 99%.
> >>>>>
> >>>>> For testing ice, make sure that you have [1] present on your side.
> >>>>>
> >>>>> This set, besides the work described above, also carries also
> >>>>> improvements around return codes in various XSK paths and lastly a minor
> >>>>> optimization for xskq_cons_has_entries(), a helper that might be used
> >>>>> when XSK Rx batching would make it to the kernel.
> >>>>
> >>>> Regarding error codes, I would like them to be consistent across all
> >>>> drivers, otherwise all the debuggability improvements are not useful enough.
> >>>> Your series only changed Intel drivers. Here also applies the backward
> >>>> compatibility concern: the same error codes than were in use have been
> >>>> repurposed, which may confuse some of existing applications.
> >>>
> >>> I'll double check if ZC drivers are doing something unusual with return
> >>> values from xdp_do_redirect(). Regarding backward comp, I suppose you
> >>> refer only to changes in ndo_xsk_wakeup() callbacks as others are not
> >>> exposed to user space? They're not crucial to me, but it improved my
> >>> debugging experience.
> >>
> >> Sorry if I wasn't clear enough. Yes, I meant the wakeup error codes. We
> >> aren't doing anything unusual with xdp_do_redirect codes (can't say for
> >> other drivers, though).
> >>
> >> Last time I wanted to improve error codes returned from some BPF helpers
> >> (make the errors more distinguishable), my patch was blocked because of
> >> backward compatibility concerns. To be on the safe side (i.e. to avoid
> >> further bug reports from someone who actually relied on specific codes), you
> >> might want to use a new error code, rather than repurposing the existing
> >> ones.
> >>
> >> I personally don't have objections about changing the error codes the way
> >> you did if you keep them consistent across all drivers, not only Intel ones.
> >
> > Got your point. So I'll either drop the patches or look through
> > ndo_xsk_wakeup() implementations and try to standardize the ret codes.
> > Hope this sounds fine.
>
> Yes, either way sounds absolutely fine to me.
>
> >
> > MF
>