All of lore.kernel.org
 help / color / mirror / Atom feed
From: Alexander Lobakin <aleksander.lobakin@intel.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Larysa Zaremba <larysa.zaremba@intel.com>,
	netdev@vger.kernel.org,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	linux-kernel@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	Eric Dumazet <edumazet@google.com>,
	Michal Kubiak <michal.kubiak@intel.com>,
	intel-wired-lan@lists.osuosl.org,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	"David S. Miller" <davem@davemloft.net>,
	Magnus Karlsson <magnus.karlsson@intel.com>
Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch
Date: Tue, 6 Jun 2023 14:47:56 +0200	[thread overview]
Message-ID: <5aac6822-6fe5-e182-935e-7aa86f1e820d@intel.com> (raw)
In-Reply-To: <CAKgT0UeEz2Gqb62sn0pP3_yBMc-LpR0Twmv5_HTREvHBLpCsNw@mail.gmail.com>

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Fri, 2 Jun 2023 10:50:02 -0700

Sorry for the silence, had sorta long weekend :p

> On Fri, Jun 2, 2023 at 9:16 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:

[...]

>> Ok, maybe I phrased it badly.
>> If we don't stop the loop until skb is passed up the stack, how we can
>> go out of the loop with an unfinished skb? Previously, I thought lots of
>> drivers do that, as you may exhaust your budget prior to reaching the
>> last fragment, so you'll get back to the skb on the next poll.
>> But if we count 1 skb as budget unit, not descriptor, how we can end up
>> breaking the loop prior to finishing the skb? I can imagine only one
>> situation: HW gave us some buffers, but still processes the EOP buffer,
>> so we don't have any more descriptors to process, but the skb is still
>> unfinished. But sounds weird TBH, I thought HW processes frames
>> "atomically", i.e. it doesn't give you buffers until they hold the whole
>> frame :D
> 
> The problem is the frames aren't necessarily written back atomically.
> One big issue is descriptor write back. The hardware will try to cache
> line optimize things in order to improve performance. It is possible
> for a single frame to straddle either side of a cache line. As a
> result the first half may be written back, the driver then processes
> that cache line, and finds the next one isn't populated while the
> hardware is collecting enough descriptors to write back the next one.

Ah okay, that's was I was suspecting. So it's not atomic and
skb/xdp_buff is stored on the ring to handle such cases, not budget
exhausting.
Thanks for the detailed explanation. I feel 1 skb = 1 unit more logical
optimal to me now :D

> 
> It is also one of the reasons why I went to so much effort to prevent
> us from writing to the descriptor ring in the cleanup paths. You never
> know when you might be processing an earlier frame and accidently
> wander into a section that is in the process of being written. I think
> that is addressed now mostly through the use of completion queues
> instead of the single ring that used to process both work and
> completions.

Completion rings are neat, you totally avoid writing anything to HW on
Rx polling and vice versa, no descriptor read on refilling. My
preference is to not refill anything on NAPI and do a separate workqueue
for that, esp. given that most NICs nowadays have "refill me please"
interrupt.
Please don't look at the idpf code, IIRC from what I've been told they
do it the "old" way and touch both receive and refill queues on Rx
polling :s :D

>> ice has xdp_buff on the ring for XDP multi-buffer. It's more lightweight
>> than skb, but also carries the frags, since frags is a part of shinfo,
>> not skb.
>> It's totally fine and we'll end up doing the same here, my question was
>> as I explained below.
> 
> Okay. I haven't looked at ice that closely so I wasn't aware of that.

No prob, just FYI. This moves us one step closer to passing something
more lightweight than skb up the stack in non-extreme cases, so that the
stack will take care of it when GROing :)

>>> Yep, now the question is how many drivers can be pulled into using
>>> this library. The issue is going to be all the extra features and
>>> workarounds outside of your basic Tx/Rx will complicate the code since
>>> all the drivers implement them a bit differently. One of the reasons
>>> for not consolidating them was to allow for performance optimizing for
>>> each driver. By combining them you are going to likely need to add a
>>> number of new conditional paths to the fast path.
>>
>> When I was counting the number of spots in the Rx polling function that
>> need to have switch-cases/ifs in order to be able to merge the code
>> (e.g. parsing the descriptors), it was something around 4-5 (per
>> packet). So it can only be figured out during the testing whether adding
>> new branches actually hurts there.
> 
> The other thing is you may want to double check CPU(s) you are
> expected to support as last I knew switch statements were still
> expensive due to all the old spectre/meltdown workarounds.
Wait, are switch-cases also affected? I wasn't aware of that. For sure I
didn't even consider using ops/indirect calls, but switch-cases... I saw
lots o'times people replacing indirections with switch-cases, what's the
point otherwise :D

Thanks,
Olek
_______________________________________________
Intel-wired-lan mailing list
Intel-wired-lan@osuosl.org
https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

WARNING: multiple messages have this Message-ID (diff)
From: Alexander Lobakin <aleksander.lobakin@intel.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	Larysa Zaremba <larysa.zaremba@intel.com>,
	<netdev@vger.kernel.org>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	<linux-kernel@vger.kernel.org>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Michal Kubiak <michal.kubiak@intel.com>,
	<intel-wired-lan@lists.osuosl.org>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Christoph Hellwig <hch@lst.de>,
	Magnus Karlsson <magnus.karlsson@intel.com>
Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch
Date: Tue, 6 Jun 2023 14:47:56 +0200	[thread overview]
Message-ID: <5aac6822-6fe5-e182-935e-7aa86f1e820d@intel.com> (raw)
In-Reply-To: <CAKgT0UeEz2Gqb62sn0pP3_yBMc-LpR0Twmv5_HTREvHBLpCsNw@mail.gmail.com>

From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Fri, 2 Jun 2023 10:50:02 -0700

Sorry for the silence, had sorta long weekend :p

> On Fri, Jun 2, 2023 at 9:16 AM Alexander Lobakin
> <aleksander.lobakin@intel.com> wrote:

[...]

>> Ok, maybe I phrased it badly.
>> If we don't stop the loop until skb is passed up the stack, how we can
>> go out of the loop with an unfinished skb? Previously, I thought lots of
>> drivers do that, as you may exhaust your budget prior to reaching the
>> last fragment, so you'll get back to the skb on the next poll.
>> But if we count 1 skb as budget unit, not descriptor, how we can end up
>> breaking the loop prior to finishing the skb? I can imagine only one
>> situation: HW gave us some buffers, but still processes the EOP buffer,
>> so we don't have any more descriptors to process, but the skb is still
>> unfinished. But sounds weird TBH, I thought HW processes frames
>> "atomically", i.e. it doesn't give you buffers until they hold the whole
>> frame :D
> 
> The problem is the frames aren't necessarily written back atomically.
> One big issue is descriptor write back. The hardware will try to cache
> line optimize things in order to improve performance. It is possible
> for a single frame to straddle either side of a cache line. As a
> result the first half may be written back, the driver then processes
> that cache line, and finds the next one isn't populated while the
> hardware is collecting enough descriptors to write back the next one.

Ah okay, that's was I was suspecting. So it's not atomic and
skb/xdp_buff is stored on the ring to handle such cases, not budget
exhausting.
Thanks for the detailed explanation. I feel 1 skb = 1 unit more logical
optimal to me now :D

> 
> It is also one of the reasons why I went to so much effort to prevent
> us from writing to the descriptor ring in the cleanup paths. You never
> know when you might be processing an earlier frame and accidently
> wander into a section that is in the process of being written. I think
> that is addressed now mostly through the use of completion queues
> instead of the single ring that used to process both work and
> completions.

Completion rings are neat, you totally avoid writing anything to HW on
Rx polling and vice versa, no descriptor read on refilling. My
preference is to not refill anything on NAPI and do a separate workqueue
for that, esp. given that most NICs nowadays have "refill me please"
interrupt.
Please don't look at the idpf code, IIRC from what I've been told they
do it the "old" way and touch both receive and refill queues on Rx
polling :s :D

>> ice has xdp_buff on the ring for XDP multi-buffer. It's more lightweight
>> than skb, but also carries the frags, since frags is a part of shinfo,
>> not skb.
>> It's totally fine and we'll end up doing the same here, my question was
>> as I explained below.
> 
> Okay. I haven't looked at ice that closely so I wasn't aware of that.

No prob, just FYI. This moves us one step closer to passing something
more lightweight than skb up the stack in non-extreme cases, so that the
stack will take care of it when GROing :)

>>> Yep, now the question is how many drivers can be pulled into using
>>> this library. The issue is going to be all the extra features and
>>> workarounds outside of your basic Tx/Rx will complicate the code since
>>> all the drivers implement them a bit differently. One of the reasons
>>> for not consolidating them was to allow for performance optimizing for
>>> each driver. By combining them you are going to likely need to add a
>>> number of new conditional paths to the fast path.
>>
>> When I was counting the number of spots in the Rx polling function that
>> need to have switch-cases/ifs in order to be able to merge the code
>> (e.g. parsing the descriptors), it was something around 4-5 (per
>> packet). So it can only be figured out during the testing whether adding
>> new branches actually hurts there.
> 
> The other thing is you may want to double check CPU(s) you are
> expected to support as last I knew switch statements were still
> expensive due to all the old spectre/meltdown workarounds.
Wait, are switch-cases also affected? I wasn't aware of that. For sure I
didn't even consider using ops/indirect calls, but switch-cases... I saw
lots o'times people replacing indirections with switch-cases, what's the
point otherwise :D

Thanks,
Olek

  reply	other threads:[~2023-06-06 12:49 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-25 12:57 [Intel-wired-lan] [PATCH net-next v2 00/12] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin
2023-05-25 12:57 ` Alexander Lobakin
2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 01/12] net: intel: introduce Intel Ethernet common library Alexander Lobakin
2023-05-25 12:57   ` Alexander Lobakin
2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 02/12] iavf: kill "legacy-rx" for good Alexander Lobakin
2023-05-25 12:57   ` Alexander Lobakin
2023-05-30 15:29   ` Alexander H Duyck
2023-05-30 15:29     ` [Intel-wired-lan] " Alexander H Duyck
2023-05-30 16:22     ` Alexander Lobakin
2023-05-30 16:22       ` [Intel-wired-lan] " Alexander Lobakin
2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch Alexander Lobakin
2023-05-25 12:57   ` Alexander Lobakin
2023-05-30 16:18   ` Alexander H Duyck
2023-05-30 16:18     ` [Intel-wired-lan] " Alexander H Duyck
2023-05-31 11:14     ` Maciej Fijalkowski
2023-05-31 11:14       ` [Intel-wired-lan] " Maciej Fijalkowski
2023-05-31 15:22       ` Alexander Lobakin
2023-05-31 15:22         ` [Intel-wired-lan] " Alexander Lobakin
2023-05-31 15:13     ` Alexander Lobakin
2023-05-31 15:13       ` Alexander Lobakin
2023-05-31 17:22       ` Alexander Duyck
2023-06-02 13:58         ` Alexander Lobakin
2023-06-02 13:58           ` Alexander Lobakin
2023-06-02 15:04           ` Alexander Duyck
2023-06-02 15:04             ` Alexander Duyck
2023-06-02 16:15             ` Alexander Lobakin
2023-06-02 16:15               ` Alexander Lobakin
2023-06-02 17:50               ` Alexander Duyck
2023-06-02 17:50                 ` Alexander Duyck
2023-06-06 12:47                 ` Alexander Lobakin [this message]
2023-06-06 12:47                   ` Alexander Lobakin
2023-06-06 14:24                   ` Alexander Duyck
2023-06-06 14:24                     ` Alexander Duyck
2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 04/12] iavf: remove page splitting/recycling Alexander Lobakin
2023-05-25 12:57   ` Alexander Lobakin
2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 05/12] iavf: always use a full order-0 page Alexander Lobakin
2023-05-25 12:57   ` Alexander Lobakin
2023-05-26  8:57   ` David Laight
2023-05-26  8:57     ` [Intel-wired-lan] " David Laight
2023-05-26 12:52     ` Alexander Lobakin
2023-05-26 12:52       ` [Intel-wired-lan] " Alexander Lobakin
2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 06/12] net: skbuff: don't include <net/page_pool.h> into <linux/skbuff.h> Alexander Lobakin
2023-05-25 12:57   ` Alexander Lobakin
2023-05-27  3:54   ` Jakub Kicinski
2023-05-27  3:54     ` [Intel-wired-lan] " Jakub Kicinski
2023-05-30 13:12     ` Alexander Lobakin
2023-05-30 13:12       ` [Intel-wired-lan] " Alexander Lobakin
2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 07/12] net: page_pool: avoid calling no-op externals when possible Alexander Lobakin
2023-05-25 12:57   ` Alexander Lobakin
2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 08/12] net: page_pool: add DMA-sync-for-CPU inline helpers Alexander Lobakin
2023-05-25 12:57   ` Alexander Lobakin
2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 09/12] iavf: switch to Page Pool Alexander Lobakin
2023-05-25 12:57   ` Alexander Lobakin
2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 10/12] libie: add common queue stats Alexander Lobakin
2023-05-25 12:57   ` Alexander Lobakin
2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 11/12] libie: add per-queue Page Pool stats Alexander Lobakin
2023-05-25 12:57   ` Alexander Lobakin
2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 12/12] iavf: switch queue stats to libie Alexander Lobakin
2023-05-25 12:57   ` Alexander Lobakin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5aac6822-6fe5-e182-935e-7aa86f1e820d@intel.com \
    --to=aleksander.lobakin@intel.com \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hawk@kernel.org \
    --cc=hch@lst.de \
    --cc=ilias.apalodimas@linaro.org \
    --cc=intel-wired-lan@lists.osuosl.org \
    --cc=kuba@kernel.org \
    --cc=larysa.zaremba@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=magnus.karlsson@intel.com \
    --cc=michal.kubiak@intel.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pmenzel@molgen.mpg.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.