From: Jesper Dangaard Brouer <jbrouer@redhat.com> To: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Cc: "John Fastabend" <john.fastabend@gmail.com>, "Daniel Borkmann" <daniel@iogearbox.net>, "Toke Høiland-Jørgensen" <toke@redhat.com>, alardam@gmail.com, magnus.karlsson@intel.com, bjorn.topel@intel.com, andrii.nakryiko@gmail.com, kuba@kernel.org, ast@kernel.org, netdev@vger.kernel.org, davem@davemloft.net, hawk@kernel.org, jonathan.lemon@gmail.com, bpf@vger.kernel.org, jeffrey.t.kirsher@intel.com, maciejromanfijalkowski@gmail.com, intel-wired-lan@lists.osuosl.org, "Marek Majtyka" <marekx.majtyka@intel.com> Subject: Re: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set Date: Wed, 9 Dec 2020 12:52:23 +0100 [thread overview] Message-ID: <20201209125223.49096d50@carbon> (raw) In-Reply-To: <20201209095454.GA36812@ranger.igk.intel.com> On Wed, 9 Dec 2020 10:54:54 +0100 Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote: > On Tue, Dec 08, 2020 at 10:03:51PM -0800, John Fastabend wrote: > > > On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote: > > > > Jesper Dangaard Brouer wrote: > > > > > On Fri, 4 Dec 2020 16:21:08 +0100 > > > > > Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > [...] pruning the thread to answer Jesper. > > I think you meant me, but thanks anyway for responding :) I was about to say that ;-) > > > > > > > > > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue > > > > > resources, as the use-case is only DDoS. Today we have this problem > > > > > with the ixgbe hardware, that cannot load XDP programs on systems with > > > > > more than 192 CPUs. > > > > > > > > The ixgbe issues is just a bug or missing-feature in my opinion. > > > > > > Not a bug, rather HW limitation? > > > > Well hardware has some max queue limit. Likely <192 otherwise I would > > have kept doing queue per core on up to 192. But, ideally we should > > Data sheet states its 128 Tx qs for ixgbe. I likely remember wrong, maybe it was only ~96 CPUs. I do remember that some TX queue were reserved for something else, and QA reported issues (as I don't have this high end system myself). > > still load and either share queues across multiple cores or restirct > > down to a subset of CPUs. > > And that's the missing piece of logic, I suppose. > > > Do you need 192 cores for a 10gbps nic, probably not. > > Let's hear from Jesper :p LOL - of-cause you don't need 192 cores. With XDP I will claim that you only need 2 cores (with high GHz) to forward 10gbps wirespeed small packets. The point is that this only works, when we avoid atomic lock operations per packet and bulk NIC PCIe tail/doorbell. It was actually John's invention/design to have a dedicated TX queue per core to avoid the atomic lock operation per packet when queuing packets to the NIC. 10G @64B give budget of 67.2 ns (241 cycles @ 3.60GHz) Atomic lock operation use:[1] - Type:spin_lock_unlock Per elem: 34 cycles(tsc) 9.485 ns - Type:spin_lock_unlock_irqsave Per elem: 61 cycles(tsc) 17.125 ns (And atomic can affect Inst per cycle) But I have redesigned the ndo_xdp_xmit call to take a bulk of packets (up-to 16) so it should not be a problem to solve this by sharing TX-queue and talking a lock per 16 packets. I still recommend that, for fallback case, you allocated a number a TX-queue and distribute this across CPUs to avoid hitting a congested lock (above measurements are the optimal non-congested atomic lock operation) [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c > > Yes, it requires some extra care, but should be doable > > if someone cares enough. I gather current limitation/bug is because > > no one has that configuration and/or has complained loud enough. > > I would say we're safe for queue per core approach for newer devices where > we have thousands of queues to play with. Older devices combined with big > cpu count can cause us some problems. > > Wondering if drivers could have a problem when user would do something > weird as limiting the queue count to a lower value than cpu count and then > changing the irq affinity? Not sure what you mean. But for XDP RX-side we use softirq NAPI guarantee to guard against concurrent access to our (per-cpu) data structures. > > > > > > > > > > > > > I think we just document that XDP_TX consumes resources and if users > > > > care they shouldn't use XD_TX in programs and in that case hardware > > > > should via program discovery not allocate the resource. This seems > > > > cleaner in my opinion then more bits for features. > > > > > > But what if I'm with some limited HW that actually has a support for XDP > > > and I would like to utilize XDP_TX? > > > > > > Not all drivers that support XDP consume Tx resources. Recently igb got > > > support and it shares Tx queues between netstack and XDP. > > > > Makes sense to me. > > > > > > > > I feel like we should have a sort-of best effort approach in case we > > > stumble upon the XDP_TX in prog being loaded and query the driver if it > > > would be able to provide the Tx resources on the current system, given > > > that normally we tend to have a queue per core. > > > > Why do we need to query? I guess you want some indication from the > > driver its not going to be running in the ideal NIC configuraition? > > I guess printing a warning would be the normal way to show that. But, > > maybe your point is you want something easier to query? > > I meant that given Jesper's example, what should we do? You don't have Tx > resources to pull at all. Should we have a data path for that case that > would share Tx qs between XDP/netstack? Probably not. > I think ixgbe should have a fallback mode, where it allocated e.g. 32 TX-queue for XDP xmits or even just same amount as RX-queues (I think XDP_TX and XDP_REDIRECT can share these TX-queues dedicated to XDP). When in fallback mode a lock need to be taken (sharded across CPUs), but ndo_xdp_xmit will bulk up-to 16 packets, so it should not matter too much. I do think ixgbe should output a dmesg log message, to say it is in XDP fallback mode with X number of TX-queues. For us QA usually collect the dmesg output after a test run. > > > > > > In that case igb would say yes, ixgbe would say no and prog would be > > > rejected. > > > > I think the driver should load even if it can't meet the queue per > > core quota. Refusing to load at all or just dropping packets on the > > floor is not very friendly. I think we agree on that point. > > Agreed on that. But it needs some work. I can dabble on that a bit. > I will really appreciate if Intel can fix this in the ixgbe driver, and implement a fallback method. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
WARNING: multiple messages have this Message-ID (diff)
From: Jesper Dangaard Brouer <jbrouer@redhat.com> To: intel-wired-lan@osuosl.org Subject: [Intel-wired-lan] [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set Date: Wed, 9 Dec 2020 12:52:23 +0100 [thread overview] Message-ID: <20201209125223.49096d50@carbon> (raw) In-Reply-To: <20201209095454.GA36812@ranger.igk.intel.com> On Wed, 9 Dec 2020 10:54:54 +0100 Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote: > On Tue, Dec 08, 2020 at 10:03:51PM -0800, John Fastabend wrote: > > > On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote: > > > > Jesper Dangaard Brouer wrote: > > > > > On Fri, 4 Dec 2020 16:21:08 +0100 > > > > > Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > [...] pruning the thread to answer Jesper. > > I think you meant me, but thanks anyway for responding :) I was about to say that ;-) > > > > > > > > > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue > > > > > resources, as the use-case is only DDoS. Today we have this problem > > > > > with the ixgbe hardware, that cannot load XDP programs on systems with > > > > > more than 192 CPUs. > > > > > > > > The ixgbe issues is just a bug or missing-feature in my opinion. > > > > > > Not a bug, rather HW limitation? > > > > Well hardware has some max queue limit. Likely <192 otherwise I would > > have kept doing queue per core on up to 192. But, ideally we should > > Data sheet states its 128 Tx qs for ixgbe. I likely remember wrong, maybe it was only ~96 CPUs. I do remember that some TX queue were reserved for something else, and QA reported issues (as I don't have this high end system myself). > > still load and either share queues across multiple cores or restirct > > down to a subset of CPUs. > > And that's the missing piece of logic, I suppose. > > > Do you need 192 cores for a 10gbps nic, probably not. > > Let's hear from Jesper :p LOL - of-cause you don't need 192 cores. With XDP I will claim that you only need 2 cores (with high GHz) to forward 10gbps wirespeed small packets. The point is that this only works, when we avoid atomic lock operations per packet and bulk NIC PCIe tail/doorbell. It was actually John's invention/design to have a dedicated TX queue per core to avoid the atomic lock operation per packet when queuing packets to the NIC. 10G @64B give budget of 67.2 ns (241 cycles @ 3.60GHz) Atomic lock operation use:[1] - Type:spin_lock_unlock Per elem: 34 cycles(tsc) 9.485 ns - Type:spin_lock_unlock_irqsave Per elem: 61 cycles(tsc) 17.125 ns (And atomic can affect Inst per cycle) But I have redesigned the ndo_xdp_xmit call to take a bulk of packets (up-to 16) so it should not be a problem to solve this by sharing TX-queue and talking a lock per 16 packets. I still recommend that, for fallback case, you allocated a number a TX-queue and distribute this across CPUs to avoid hitting a congested lock (above measurements are the optimal non-congested atomic lock operation) [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c > > Yes, it requires some extra care, but should be doable > > if someone cares enough. I gather current limitation/bug is because > > no one has that configuration and/or has complained loud enough. > > I would say we're safe for queue per core approach for newer devices where > we have thousands of queues to play with. Older devices combined with big > cpu count can cause us some problems. > > Wondering if drivers could have a problem when user would do something > weird as limiting the queue count to a lower value than cpu count and then > changing the irq affinity? Not sure what you mean. But for XDP RX-side we use softirq NAPI guarantee to guard against concurrent access to our (per-cpu) data structures. > > > > > > > > > > > > > I think we just document that XDP_TX consumes resources and if users > > > > care they shouldn't use XD_TX in programs and in that case hardware > > > > should via program discovery not allocate the resource. This seems > > > > cleaner in my opinion then more bits for features. > > > > > > But what if I'm with some limited HW that actually has a support for XDP > > > and I would like to utilize XDP_TX? > > > > > > Not all drivers that support XDP consume Tx resources. Recently igb got > > > support and it shares Tx queues between netstack and XDP. > > > > Makes sense to me. > > > > > > > > I feel like we should have a sort-of best effort approach in case we > > > stumble upon the XDP_TX in prog being loaded and query the driver if it > > > would be able to provide the Tx resources on the current system, given > > > that normally we tend to have a queue per core. > > > > Why do we need to query? I guess you want some indication from the > > driver its not going to be running in the ideal NIC configuraition? > > I guess printing a warning would be the normal way to show that. But, > > maybe your point is you want something easier to query? > > I meant that given Jesper's example, what should we do? You don't have Tx > resources to pull at all. Should we have a data path for that case that > would share Tx qs between XDP/netstack? Probably not. > I think ixgbe should have a fallback mode, where it allocated e.g. 32 TX-queue for XDP xmits or even just same amount as RX-queues (I think XDP_TX and XDP_REDIRECT can share these TX-queues dedicated to XDP). When in fallback mode a lock need to be taken (sharded across CPUs), but ndo_xdp_xmit will bulk up-to 16 packets, so it should not matter too much. I do think ixgbe should output a dmesg log message, to say it is in XDP fallback mode with X number of TX-queues. For us QA usually collect the dmesg output after a test run. > > > > > > In that case igb would say yes, ixgbe would say no and prog would be > > > rejected. > > > > I think the driver should load even if it can't meet the queue per > > core quota. Refusing to load at all or just dropping packets on the > > floor is not very friendly. I think we agree on that point. > > Agreed on that. But it needs some work. I can dabble on that a bit. > I will really appreciate if Intel can fix this in the ixgbe driver, and implement a fallback method. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
next prev parent reply other threads:[~2020-12-09 11:54 UTC|newest] Thread overview: 120+ messages / expand[flat|nested] mbox.gz Atom feed top 2020-12-04 10:28 [PATCH v2 bpf 0/5] New netdev feature flags for XDP alardam 2020-12-04 10:28 ` [Intel-wired-lan] " alardam 2020-12-04 10:28 ` [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set alardam 2020-12-04 10:28 ` [Intel-wired-lan] " alardam 2020-12-04 12:18 ` Toke Høiland-Jørgensen 2020-12-04 12:18 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2020-12-04 12:46 ` Maciej Fijalkowski 2020-12-04 12:46 ` [Intel-wired-lan] " Maciej Fijalkowski 2020-12-04 15:21 ` Daniel Borkmann 2020-12-04 15:21 ` [Intel-wired-lan] " Daniel Borkmann 2020-12-04 17:20 ` Toke Høiland-Jørgensen 2020-12-04 17:20 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2020-12-04 22:19 ` Daniel Borkmann 2020-12-04 22:19 ` [Intel-wired-lan] " Daniel Borkmann 2020-12-07 11:54 ` Jesper Dangaard Brouer 2020-12-07 11:54 ` [Intel-wired-lan] " Jesper Dangaard Brouer 2020-12-07 12:08 ` Toke Høiland-Jørgensen 2020-12-07 12:08 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2020-12-07 12:03 ` Toke Høiland-Jørgensen 2020-12-07 12:03 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2020-12-07 12:54 ` Jesper Dangaard Brouer 2020-12-07 12:54 ` [Intel-wired-lan] " Jesper Dangaard Brouer 2020-12-07 20:52 ` John Fastabend 2020-12-07 20:52 ` [Intel-wired-lan] " John Fastabend 2020-12-07 22:38 ` Saeed Mahameed 2020-12-07 22:38 ` [Intel-wired-lan] " Saeed Mahameed 2020-12-07 23:07 ` Maciej Fijalkowski 2020-12-07 23:07 ` [Intel-wired-lan] " Maciej Fijalkowski 2020-12-09 6:03 ` John Fastabend 2020-12-09 6:03 ` [Intel-wired-lan] " John Fastabend 2020-12-09 9:54 ` Maciej Fijalkowski 2020-12-09 9:54 ` [Intel-wired-lan] " Maciej Fijalkowski 2020-12-09 11:52 ` Jesper Dangaard Brouer [this message] 2020-12-09 11:52 ` Jesper Dangaard Brouer 2020-12-09 15:41 ` David Ahern 2020-12-09 15:41 ` [Intel-wired-lan] " David Ahern 2020-12-09 17:15 ` Saeed Mahameed 2020-12-09 17:15 ` [Intel-wired-lan] " Saeed Mahameed 2020-12-10 3:34 ` David Ahern 2020-12-10 3:34 ` [Intel-wired-lan] " David Ahern 2020-12-10 6:48 ` Saeed Mahameed 2020-12-10 6:48 ` [Intel-wired-lan] " Saeed Mahameed 2020-12-10 15:30 ` David Ahern 2020-12-10 15:30 ` [Intel-wired-lan] " David Ahern 2020-12-10 18:58 ` Saeed Mahameed 2020-12-10 18:58 ` [Intel-wired-lan] " Saeed Mahameed 2021-01-05 11:56 ` Marek Majtyka 2021-01-05 11:56 ` [Intel-wired-lan] " Marek Majtyka 2021-02-01 16:16 ` Toke Høiland-Jørgensen 2021-02-01 16:16 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2021-02-02 11:26 ` Marek Majtyka 2021-02-02 11:26 ` [Intel-wired-lan] " Marek Majtyka 2021-02-02 12:05 ` Toke Høiland-Jørgensen 2021-02-02 12:05 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2021-02-02 19:34 ` Jakub Kicinski 2021-02-02 19:34 ` [Intel-wired-lan] " Jakub Kicinski 2021-02-03 12:50 ` Marek Majtyka 2021-02-03 12:50 ` [Intel-wired-lan] " Marek Majtyka 2021-02-03 17:02 ` Jakub Kicinski 2021-02-03 17:02 ` [Intel-wired-lan] " Jakub Kicinski 2021-02-10 10:53 ` Toke Høiland-Jørgensen 2021-02-10 10:53 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2021-02-10 18:31 ` Jakub Kicinski 2021-02-10 18:31 ` [Intel-wired-lan] " Jakub Kicinski 2021-02-10 22:52 ` Toke Høiland-Jørgensen 2021-02-10 22:52 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2021-02-12 1:26 ` Jakub Kicinski 2021-02-12 1:26 ` [Intel-wired-lan] " Jakub Kicinski 2021-02-12 2:05 ` Alexei Starovoitov 2021-02-12 2:05 ` [Intel-wired-lan] " Alexei Starovoitov 2021-02-12 7:02 ` Marek Majtyka 2021-02-12 7:02 ` [Intel-wired-lan] " Marek Majtyka 2021-02-16 14:30 ` Toke Høiland-Jørgensen 2021-02-16 14:30 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2020-12-09 15:44 ` David Ahern 2020-12-09 15:44 ` [Intel-wired-lan] " David Ahern 2020-12-10 13:32 ` Explaining XDP redirect bulk size design (Was: [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set) Jesper Dangaard Brouer 2020-12-10 13:32 ` [Intel-wired-lan] " Jesper Dangaard Brouer 2020-12-10 14:14 ` Magnus Karlsson 2020-12-10 14:14 ` Magnus Karlsson 2020-12-10 17:30 ` Jesper Dangaard Brouer 2020-12-10 17:30 ` Jesper Dangaard Brouer 2020-12-10 19:20 ` Saeed Mahameed 2020-12-10 19:20 ` [Intel-wired-lan] " Saeed Mahameed 2020-12-08 1:01 ` [PATCH v2 bpf 1/5] net: ethtool: add xdp properties flag set David Ahern 2020-12-08 1:01 ` [Intel-wired-lan] " David Ahern 2020-12-08 8:28 ` Jesper Dangaard Brouer 2020-12-08 8:28 ` [Intel-wired-lan] " Jesper Dangaard Brouer 2020-12-08 11:58 ` Toke Høiland-Jørgensen 2020-12-08 11:58 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2020-12-09 5:50 ` John Fastabend 2020-12-09 5:50 ` [Intel-wired-lan] " John Fastabend 2020-12-09 10:26 ` Toke Høiland-Jørgensen 2020-12-09 10:26 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2020-12-08 9:00 ` Jesper Dangaard Brouer 2020-12-08 9:00 ` [Intel-wired-lan] " Jesper Dangaard Brouer 2020-12-08 9:42 ` Daniel Borkmann 2020-12-08 9:42 ` [Intel-wired-lan] " Daniel Borkmann 2020-12-04 12:57 ` Maciej Fijalkowski 2020-12-04 12:57 ` [Intel-wired-lan] " Maciej Fijalkowski 2020-12-04 10:28 ` [PATCH v2 bpf 2/5] drivers/net: turn XDP properties on alardam 2020-12-04 10:28 ` [Intel-wired-lan] " alardam 2020-12-04 12:19 ` Toke Høiland-Jørgensen 2020-12-04 12:19 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2020-12-09 19:05 ` kernel test robot 2020-12-09 19:05 ` kernel test robot 2020-12-04 10:28 ` [PATCH v2 bpf 3/5] xsk: add usage of xdp properties flags alardam 2020-12-04 10:28 ` [Intel-wired-lan] " alardam 2020-12-04 10:29 ` [PATCH v2 bpf 4/5] xsk: add check for full support of XDP in bind alardam 2020-12-04 10:29 ` [Intel-wired-lan] " alardam 2020-12-04 10:29 ` [PATCH v2 bpf 5/5] ethtool: provide xdp info with XDP_PROPERTIES_GET alardam 2020-12-04 10:29 ` [Intel-wired-lan] " alardam 2020-12-04 17:20 ` [PATCH v2 bpf 0/5] New netdev feature flags for XDP Jakub Kicinski 2020-12-04 17:20 ` [Intel-wired-lan] " Jakub Kicinski 2020-12-04 17:26 ` Toke Høiland-Jørgensen 2020-12-04 17:26 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?= 2020-12-04 19:22 ` Jakub Kicinski 2020-12-04 19:22 ` [Intel-wired-lan] " Jakub Kicinski 2020-12-07 12:04 ` Toke Høiland-Jørgensen 2020-12-07 12:04 ` [Intel-wired-lan] " Toke =?unknown-8bit?q?H=C3=B8iland-J=C3=B8rgensen?=
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20201209125223.49096d50@carbon \ --to=jbrouer@redhat.com \ --cc=alardam@gmail.com \ --cc=andrii.nakryiko@gmail.com \ --cc=ast@kernel.org \ --cc=bjorn.topel@intel.com \ --cc=bpf@vger.kernel.org \ --cc=daniel@iogearbox.net \ --cc=davem@davemloft.net \ --cc=hawk@kernel.org \ --cc=intel-wired-lan@lists.osuosl.org \ --cc=jeffrey.t.kirsher@intel.com \ --cc=john.fastabend@gmail.com \ --cc=jonathan.lemon@gmail.com \ --cc=kuba@kernel.org \ --cc=maciej.fijalkowski@intel.com \ --cc=maciejromanfijalkowski@gmail.com \ --cc=magnus.karlsson@intel.com \ --cc=marekx.majtyka@intel.com \ --cc=netdev@vger.kernel.org \ --cc=toke@redhat.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.