Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch

From: Alexander Lobakin <aleksander.lobakin@intel.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Larysa Zaremba <larysa.zaremba@intel.com>,
	netdev@vger.kernel.org,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	linux-kernel@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	Eric Dumazet <edumazet@google.com>,
	Michal Kubiak <michal.kubiak@intel.com>,
	intel-wired-lan@lists.osuosl.org,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	"David S. Miller" <davem@davemloft.net>,
	Magnus Karlsson <magnus.karlsson@intel.com>
Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch
Date: Tue, 6 Jun 2023 14:47:56 +0200	[thread overview]
Message-ID: <5aac6822-6fe5-e182-935e-7aa86f1e820d@intel.com> (raw)
In-Reply-To: <CAKgT0UeEz2Gqb62sn0pP3_yBMc-LpR0Twmv5_HTREvHBLpCsNw@mail.gmail.com>

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Fri, 2 Jun 2023 10:50:02 -0700

Sorry for the silence, had sorta long weekend :p

> On Fri, Jun 2, 2023 at 9:16 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:

[...]

>> Ok, maybe I phrased it badly.
>> If we don't stop the loop until skb is passed up the stack, how we can
>> go out of the loop with an unfinished skb? Previously, I thought lots of
>> drivers do that, as you may exhaust your budget prior to reaching the
>> last fragment, so you'll get back to the skb on the next poll.
>> But if we count 1 skb as budget unit, not descriptor, how we can end up
>> breaking the loop prior to finishing the skb? I can imagine only one
>> situation: HW gave us some buffers, but still processes the EOP buffer,
>> so we don't have any more descriptors to process, but the skb is still
>> unfinished. But sounds weird TBH, I thought HW processes frames
>> "atomically", i.e. it doesn't give you buffers until they hold the whole
>> frame :D
> 
> The problem is the frames aren't necessarily written back atomically.
> One big issue is descriptor write back. The hardware will try to cache
> line optimize things in order to improve performance. It is possible
> for a single frame to straddle either side of a cache line. As a
> result the first half may be written back, the driver then processes
> that cache line, and finds the next one isn't populated while the
> hardware is collecting enough descriptors to write back the next one.

Ah okay, that's was I was suspecting. So it's not atomic and
skb/xdp_buff is stored on the ring to handle such cases, not budget
exhausting.
Thanks for the detailed explanation. I feel 1 skb = 1 unit more logical
optimal to me now :D

> 
> It is also one of the reasons why I went to so much effort to prevent
> us from writing to the descriptor ring in the cleanup paths. You never
> know when you might be processing an earlier frame and accidently
> wander into a section that is in the process of being written. I think
> that is addressed now mostly through the use of completion queues
> instead of the single ring that used to process both work and
> completions.

Completion rings are neat, you totally avoid writing anything to HW on
Rx polling and vice versa, no descriptor read on refilling. My
preference is to not refill anything on NAPI and do a separate workqueue
for that, esp. given that most NICs nowadays have "refill me please"
interrupt.
Please don't look at the idpf code, IIRC from what I've been told they
do it the "old" way and touch both receive and refill queues on Rx
polling :s :D

>> ice has xdp_buff on the ring for XDP multi-buffer. It's more lightweight
>> than skb, but also carries the frags, since frags is a part of shinfo,
>> not skb.
>> It's totally fine and we'll end up doing the same here, my question was
>> as I explained below.
> 
> Okay. I haven't looked at ice that closely so I wasn't aware of that.

No prob, just FYI. This moves us one step closer to passing something
more lightweight than skb up the stack in non-extreme cases, so that the
stack will take care of it when GROing :)

>>> Yep, now the question is how many drivers can be pulled into using
>>> this library. The issue is going to be all the extra features and
>>> workarounds outside of your basic Tx/Rx will complicate the code since
>>> all the drivers implement them a bit differently. One of the reasons
>>> for not consolidating them was to allow for performance optimizing for
>>> each driver. By combining them you are going to likely need to add a
>>> number of new conditional paths to the fast path.
>>
>> When I was counting the number of spots in the Rx polling function that
>> need to have switch-cases/ifs in order to be able to merge the code
>> (e.g. parsing the descriptors), it was something around 4-5 (per
>> packet). So it can only be figured out during the testing whether adding
>> new branches actually hurts there.
> 
> The other thing is you may want to double check CPU(s) you are
> expected to support as last I knew switch statements were still
> expensive due to all the old spectre/meltdown workarounds.
Wait, are switch-cases also affected? I wasn't aware of that. For sure I
didn't even consider using ops/indirect calls, but switch-cases... I saw
lots o'times people replacing indirections with switch-cases, what's the
point otherwise :D

Thanks,
Olek
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan