From: Alexander Lobakin <aleksander.lobakin@intel.com> To: Alexander Duyck <alexander.duyck@gmail.com> Cc: Paul Menzel <pmenzel@molgen.mpg.de>, Jesper Dangaard Brouer <hawk@kernel.org>, Larysa Zaremba <larysa.zaremba@intel.com>, netdev@vger.kernel.org, Ilias Apalodimas <ilias.apalodimas@linaro.org>, linux-kernel@vger.kernel.org, Christoph Hellwig <hch@lst.de>, Eric Dumazet <edumazet@google.com>, Michal Kubiak <michal.kubiak@intel.com>, intel-wired-lan@lists.osuosl.org, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, "David S. Miller" <davem@davemloft.net>, Magnus Karlsson <magnus.karlsson@intel.com> Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch Date: Tue, 6 Jun 2023 14:47:56 +0200 [thread overview] Message-ID: <5aac6822-6fe5-e182-935e-7aa86f1e820d@intel.com> (raw) In-Reply-To: <CAKgT0UeEz2Gqb62sn0pP3_yBMc-LpR0Twmv5_HTREvHBLpCsNw@mail.gmail.com> From: Alexander Duyck <alexander.duyck@gmail.com> Date: Fri, 2 Jun 2023 10:50:02 -0700 Sorry for the silence, had sorta long weekend :p > On Fri, Jun 2, 2023 at 9:16 AM Alexander Lobakin > <aleksander.lobakin@intel.com> wrote: [...] >> Ok, maybe I phrased it badly. >> If we don't stop the loop until skb is passed up the stack, how we can >> go out of the loop with an unfinished skb? Previously, I thought lots of >> drivers do that, as you may exhaust your budget prior to reaching the >> last fragment, so you'll get back to the skb on the next poll. >> But if we count 1 skb as budget unit, not descriptor, how we can end up >> breaking the loop prior to finishing the skb? I can imagine only one >> situation: HW gave us some buffers, but still processes the EOP buffer, >> so we don't have any more descriptors to process, but the skb is still >> unfinished. But sounds weird TBH, I thought HW processes frames >> "atomically", i.e. it doesn't give you buffers until they hold the whole >> frame :D > > The problem is the frames aren't necessarily written back atomically. > One big issue is descriptor write back. The hardware will try to cache > line optimize things in order to improve performance. It is possible > for a single frame to straddle either side of a cache line. As a > result the first half may be written back, the driver then processes > that cache line, and finds the next one isn't populated while the > hardware is collecting enough descriptors to write back the next one. Ah okay, that's was I was suspecting. So it's not atomic and skb/xdp_buff is stored on the ring to handle such cases, not budget exhausting. Thanks for the detailed explanation. I feel 1 skb = 1 unit more logical optimal to me now :D > > It is also one of the reasons why I went to so much effort to prevent > us from writing to the descriptor ring in the cleanup paths. You never > know when you might be processing an earlier frame and accidently > wander into a section that is in the process of being written. I think > that is addressed now mostly through the use of completion queues > instead of the single ring that used to process both work and > completions. Completion rings are neat, you totally avoid writing anything to HW on Rx polling and vice versa, no descriptor read on refilling. My preference is to not refill anything on NAPI and do a separate workqueue for that, esp. given that most NICs nowadays have "refill me please" interrupt. Please don't look at the idpf code, IIRC from what I've been told they do it the "old" way and touch both receive and refill queues on Rx polling :s :D >> ice has xdp_buff on the ring for XDP multi-buffer. It's more lightweight >> than skb, but also carries the frags, since frags is a part of shinfo, >> not skb. >> It's totally fine and we'll end up doing the same here, my question was >> as I explained below. > > Okay. I haven't looked at ice that closely so I wasn't aware of that. No prob, just FYI. This moves us one step closer to passing something more lightweight than skb up the stack in non-extreme cases, so that the stack will take care of it when GROing :) >>> Yep, now the question is how many drivers can be pulled into using >>> this library. The issue is going to be all the extra features and >>> workarounds outside of your basic Tx/Rx will complicate the code since >>> all the drivers implement them a bit differently. One of the reasons >>> for not consolidating them was to allow for performance optimizing for >>> each driver. By combining them you are going to likely need to add a >>> number of new conditional paths to the fast path. >> >> When I was counting the number of spots in the Rx polling function that >> need to have switch-cases/ifs in order to be able to merge the code >> (e.g. parsing the descriptors), it was something around 4-5 (per >> packet). So it can only be figured out during the testing whether adding >> new branches actually hurts there. > > The other thing is you may want to double check CPU(s) you are > expected to support as last I knew switch statements were still > expensive due to all the old spectre/meltdown workarounds. Wait, are switch-cases also affected? I wasn't aware of that. For sure I didn't even consider using ops/indirect calls, but switch-cases... I saw lots o'times people replacing indirections with switch-cases, what's the point otherwise :D Thanks, Olek _______________________________________________ Intel-wired-lan mailing list Intel-wired-lan@osuosl.org https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
WARNING: multiple messages have this Message-ID (diff)
From: Alexander Lobakin <aleksander.lobakin@intel.com> To: Alexander Duyck <alexander.duyck@gmail.com> Cc: Paul Menzel <pmenzel@molgen.mpg.de>, Jesper Dangaard Brouer <hawk@kernel.org>, Larysa Zaremba <larysa.zaremba@intel.com>, <netdev@vger.kernel.org>, Ilias Apalodimas <ilias.apalodimas@linaro.org>, <linux-kernel@vger.kernel.org>, "David S. Miller" <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Michal Kubiak <michal.kubiak@intel.com>, <intel-wired-lan@lists.osuosl.org>, Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>, Christoph Hellwig <hch@lst.de>, Magnus Karlsson <magnus.karlsson@intel.com> Subject: Re: [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch Date: Tue, 6 Jun 2023 14:47:56 +0200 [thread overview] Message-ID: <5aac6822-6fe5-e182-935e-7aa86f1e820d@intel.com> (raw) In-Reply-To: <CAKgT0UeEz2Gqb62sn0pP3_yBMc-LpR0Twmv5_HTREvHBLpCsNw@mail.gmail.com> From: Alexander Duyck <alexander.duyck@gmail.com> Date: Fri, 2 Jun 2023 10:50:02 -0700 Sorry for the silence, had sorta long weekend :p > On Fri, Jun 2, 2023 at 9:16 AM Alexander Lobakin > <aleksander.lobakin@intel.com> wrote: [...] >> Ok, maybe I phrased it badly. >> If we don't stop the loop until skb is passed up the stack, how we can >> go out of the loop with an unfinished skb? Previously, I thought lots of >> drivers do that, as you may exhaust your budget prior to reaching the >> last fragment, so you'll get back to the skb on the next poll. >> But if we count 1 skb as budget unit, not descriptor, how we can end up >> breaking the loop prior to finishing the skb? I can imagine only one >> situation: HW gave us some buffers, but still processes the EOP buffer, >> so we don't have any more descriptors to process, but the skb is still >> unfinished. But sounds weird TBH, I thought HW processes frames >> "atomically", i.e. it doesn't give you buffers until they hold the whole >> frame :D > > The problem is the frames aren't necessarily written back atomically. > One big issue is descriptor write back. The hardware will try to cache > line optimize things in order to improve performance. It is possible > for a single frame to straddle either side of a cache line. As a > result the first half may be written back, the driver then processes > that cache line, and finds the next one isn't populated while the > hardware is collecting enough descriptors to write back the next one. Ah okay, that's was I was suspecting. So it's not atomic and skb/xdp_buff is stored on the ring to handle such cases, not budget exhausting. Thanks for the detailed explanation. I feel 1 skb = 1 unit more logical optimal to me now :D > > It is also one of the reasons why I went to so much effort to prevent > us from writing to the descriptor ring in the cleanup paths. You never > know when you might be processing an earlier frame and accidently > wander into a section that is in the process of being written. I think > that is addressed now mostly through the use of completion queues > instead of the single ring that used to process both work and > completions. Completion rings are neat, you totally avoid writing anything to HW on Rx polling and vice versa, no descriptor read on refilling. My preference is to not refill anything on NAPI and do a separate workqueue for that, esp. given that most NICs nowadays have "refill me please" interrupt. Please don't look at the idpf code, IIRC from what I've been told they do it the "old" way and touch both receive and refill queues on Rx polling :s :D >> ice has xdp_buff on the ring for XDP multi-buffer. It's more lightweight >> than skb, but also carries the frags, since frags is a part of shinfo, >> not skb. >> It's totally fine and we'll end up doing the same here, my question was >> as I explained below. > > Okay. I haven't looked at ice that closely so I wasn't aware of that. No prob, just FYI. This moves us one step closer to passing something more lightweight than skb up the stack in non-extreme cases, so that the stack will take care of it when GROing :) >>> Yep, now the question is how many drivers can be pulled into using >>> this library. The issue is going to be all the extra features and >>> workarounds outside of your basic Tx/Rx will complicate the code since >>> all the drivers implement them a bit differently. One of the reasons >>> for not consolidating them was to allow for performance optimizing for >>> each driver. By combining them you are going to likely need to add a >>> number of new conditional paths to the fast path. >> >> When I was counting the number of spots in the Rx polling function that >> need to have switch-cases/ifs in order to be able to merge the code >> (e.g. parsing the descriptors), it was something around 4-5 (per >> packet). So it can only be figured out during the testing whether adding >> new branches actually hurts there. > > The other thing is you may want to double check CPU(s) you are > expected to support as last I knew switch statements were still > expensive due to all the old spectre/meltdown workarounds. Wait, are switch-cases also affected? I wasn't aware of that. For sure I didn't even consider using ops/indirect calls, but switch-cases... I saw lots o'times people replacing indirections with switch-cases, what's the point otherwise :D Thanks, Olek
next prev parent reply other threads:[~2023-06-06 12:49 UTC|newest] Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-05-25 12:57 [Intel-wired-lan] [PATCH net-next v2 00/12] net: intel: start The Great Code Dedup + Page Pool for iavf Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin 2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 01/12] net: intel: introduce Intel Ethernet common library Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin 2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 02/12] iavf: kill "legacy-rx" for good Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin 2023-05-30 15:29 ` Alexander H Duyck 2023-05-30 15:29 ` [Intel-wired-lan] " Alexander H Duyck 2023-05-30 16:22 ` Alexander Lobakin 2023-05-30 16:22 ` [Intel-wired-lan] " Alexander Lobakin 2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 03/12] iavf: optimize Rx buffer allocation a bunch Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin 2023-05-30 16:18 ` Alexander H Duyck 2023-05-30 16:18 ` [Intel-wired-lan] " Alexander H Duyck 2023-05-31 11:14 ` Maciej Fijalkowski 2023-05-31 11:14 ` [Intel-wired-lan] " Maciej Fijalkowski 2023-05-31 15:22 ` Alexander Lobakin 2023-05-31 15:22 ` [Intel-wired-lan] " Alexander Lobakin 2023-05-31 15:13 ` Alexander Lobakin 2023-05-31 15:13 ` Alexander Lobakin 2023-05-31 17:22 ` Alexander Duyck 2023-06-02 13:58 ` Alexander Lobakin 2023-06-02 13:58 ` Alexander Lobakin 2023-06-02 15:04 ` Alexander Duyck 2023-06-02 15:04 ` Alexander Duyck 2023-06-02 16:15 ` Alexander Lobakin 2023-06-02 16:15 ` Alexander Lobakin 2023-06-02 17:50 ` Alexander Duyck 2023-06-02 17:50 ` Alexander Duyck 2023-06-06 12:47 ` Alexander Lobakin [this message] 2023-06-06 12:47 ` Alexander Lobakin 2023-06-06 14:24 ` Alexander Duyck 2023-06-06 14:24 ` Alexander Duyck 2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 04/12] iavf: remove page splitting/recycling Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin 2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 05/12] iavf: always use a full order-0 page Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin 2023-05-26 8:57 ` David Laight 2023-05-26 8:57 ` [Intel-wired-lan] " David Laight 2023-05-26 12:52 ` Alexander Lobakin 2023-05-26 12:52 ` [Intel-wired-lan] " Alexander Lobakin 2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 06/12] net: skbuff: don't include <net/page_pool.h> into <linux/skbuff.h> Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin 2023-05-27 3:54 ` Jakub Kicinski 2023-05-27 3:54 ` [Intel-wired-lan] " Jakub Kicinski 2023-05-30 13:12 ` Alexander Lobakin 2023-05-30 13:12 ` [Intel-wired-lan] " Alexander Lobakin 2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 07/12] net: page_pool: avoid calling no-op externals when possible Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin 2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 08/12] net: page_pool: add DMA-sync-for-CPU inline helpers Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin 2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 09/12] iavf: switch to Page Pool Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin 2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 10/12] libie: add common queue stats Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin 2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 11/12] libie: add per-queue Page Pool stats Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin 2023-05-25 12:57 ` [Intel-wired-lan] [PATCH net-next v2 12/12] iavf: switch queue stats to libie Alexander Lobakin 2023-05-25 12:57 ` Alexander Lobakin
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=5aac6822-6fe5-e182-935e-7aa86f1e820d@intel.com \ --to=aleksander.lobakin@intel.com \ --cc=alexander.duyck@gmail.com \ --cc=davem@davemloft.net \ --cc=edumazet@google.com \ --cc=hawk@kernel.org \ --cc=hch@lst.de \ --cc=ilias.apalodimas@linaro.org \ --cc=intel-wired-lan@lists.osuosl.org \ --cc=kuba@kernel.org \ --cc=larysa.zaremba@intel.com \ --cc=linux-kernel@vger.kernel.org \ --cc=magnus.karlsson@intel.com \ --cc=michal.kubiak@intel.com \ --cc=netdev@vger.kernel.org \ --cc=pabeni@redhat.com \ --cc=pmenzel@molgen.mpg.de \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.