Issue ===== With the current implementation at the time of i40e_init_msix(), i40e creates vectors only based on the number of online CPUs. This would be problematic for RT setup that includes a large number of isolated but very few housekeeping CPUs. This is because in those setups an attempt to move all IRQs from isolated to housekeeping CPUs may easily fail due to per CPU vector limit. Setup For The Issue =================== I have triggered this issue on a setup that had a total of 72 cores among which 68 were isolated and only 4 were left for housekeeping tasks. I was using tuned's realtime-virtual-host profile to configure the system. However, Tuned reported the error message 'Failed to set SMP affinity of IRQ xxx to '00000040,00000010,00000005': [Errno 28] No space left on the device' for several IRQs in tuned.log. Note: There were other IRQs as well pinned to the housekeeping CPUs that were generated by other drivers. Fix === - In this proposed fix I have replaced num_online_cpus in i40e_init_msix() with the number of housekeeping CPUs. - The reason why I chose to include both HK_FLAG_DOMAIN & HK_FLAG_WQ is because we would also need IRQ isolation with something like systemd's CPU affinity. Testing ======= To test this change I had added a tracepoint in i40e_init_msix() to find the number of CPUs derived for vector creation with and without tuned's realtime-virtual-host profile. As per expectation with the profile applied I was only getting the number of housekeeping CPUs and all available CPUs without it. Nitesh Narayan Lal (1): i40e: limit the msix vectors based on housekeeping CPUs drivers/net/ethernet/intel/i40e/i40e_main.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) --
In a realtime environment, it is essential to isolate unwanted IRQs from isolated CPUs to prevent latency overheads. Creating MSIX vectors only based on the online CPUs could lead to a potential issue on an RT setup that has several isolated CPUs but a very few housekeeping CPUs. This is because in these kinds of setups an attempt to move the IRQs to the limited housekeeping CPUs from isolated CPUs might fail due to the per CPU vector limit. This could eventually result in latency spikes because of the IRQ threads that we fail to move from isolated CPUs. This patch prevents i40e to add vectors only based on available online CPUs by using housekeeping_cpumask() to derive the number of available housekeeping CPUs. Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> --- drivers/net/ethernet/intel/i40e/i40e_main.c | 16 +++++++++++----- 1 file changed, 11 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 5d807c8004f8..9691bececb86 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -5,6 +5,7 @@ #include <linux/of_net.h> #include <linux/pci.h> #include <linux/bpf.h> +#include <linux/sched/isolation.h> /* Local includes */ #include "i40e.h" @@ -10933,11 +10934,13 @@ static int i40e_reserve_msix_vectors(struct i40e_pf *pf, int vectors) static int i40e_init_msix(struct i40e_pf *pf) { struct i40e_hw *hw = &pf->hw; + const struct cpumask *mask; int cpus, extra_vectors; int vectors_left; int v_budget, i; int v_actual; int iwarp_requested = 0; + int hk_flags; if (!(pf->flags & I40E_FLAG_MSIX_ENABLED)) return -ENODEV; @@ -10968,12 +10971,15 @@ static int i40e_init_msix(struct i40e_pf *pf) /* reserve some vectors for the main PF traffic queues. Initially we * only reserve at most 50% of the available vectors, in the case that - * the number of online CPUs is large. This ensures that we can enable - * extra features as well. Once we've enabled the other features, we - * will use any remaining vectors to reach as close as we can to the - * number of online CPUs. + * the number of online (housekeeping) CPUs is large. This ensures that + * we can enable extra features as well. Once we've enabled the other + * features, we will use any remaining vectors to reach as close as we + * can to the number of online (housekeeping) CPUs. */ - cpus = num_online_cpus(); + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ; + mask = housekeeping_cpumask(hk_flags); + cpus = cpumask_weight(mask); + pf->num_lan_msix = min_t(int, cpus, vectors_left / 2); vectors_left -= pf->num_lan_msix; -- 2.18.4
> -----Original Message-----
> From: Nitesh Narayan Lal <nitesh@redhat.com>
> Sent: Monday, June 15, 2020 1:21 PM
> To: linux-kernel@vger.kernel.org; frederic@kernel.org; mtosatti@redhat.com;
> sassmann@redhat.com; Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; Keller,
> Jacob E <jacob.e.keller@intel.com>; jlelli@redhat.com
> Subject: [Patch v1] i40e: limit the msix vectors based on housekeeping CPUs
>
> In a realtime environment, it is essential to isolate
> unwanted IRQs from isolated CPUs to prevent latency overheads.
> Creating MSIX vectors only based on the online CPUs could lead
> to a potential issue on an RT setup that has several isolated
> CPUs but a very few housekeeping CPUs. This is because in these
> kinds of setups an attempt to move the IRQs to the limited
> housekeeping CPUs from isolated CPUs might fail due to the per
> CPU vector limit. This could eventually result in latency spikes
> because of the IRQ threads that we fail to move from isolated
> CPUs. This patch prevents i40e to add vectors only based on
> available online CPUs by using housekeeping_cpumask() to derive
> the number of available housekeeping CPUs.
>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
Ok, so the idea is that "housekeeping" CPUs are to be used for general purpose configuration, and thus is a subset of online CPUs. By reducing the limit to just housekeeping CPUs, we ensure that we do not overload the system with more queues than can be handled by the general purpose CPUs?
Thanks,
Jake
[-- Attachment #1.1: Type: text/plain, Size: 1726 bytes --] On 6/15/20 4:48 PM, Keller, Jacob E wrote: > >> -----Original Message----- >> From: Nitesh Narayan Lal <nitesh@redhat.com> >> Sent: Monday, June 15, 2020 1:21 PM >> To: linux-kernel@vger.kernel.org; frederic@kernel.org; mtosatti@redhat.com; >> sassmann@redhat.com; Kirsher, Jeffrey T <jeffrey.t.kirsher@intel.com>; Keller, >> Jacob E <jacob.e.keller@intel.com>; jlelli@redhat.com >> Subject: [Patch v1] i40e: limit the msix vectors based on housekeeping CPUs >> >> In a realtime environment, it is essential to isolate >> unwanted IRQs from isolated CPUs to prevent latency overheads. >> Creating MSIX vectors only based on the online CPUs could lead >> to a potential issue on an RT setup that has several isolated >> CPUs but a very few housekeeping CPUs. This is because in these >> kinds of setups an attempt to move the IRQs to the limited >> housekeeping CPUs from isolated CPUs might fail due to the per >> CPU vector limit. This could eventually result in latency spikes >> because of the IRQ threads that we fail to move from isolated >> CPUs. This patch prevents i40e to add vectors only based on >> available online CPUs by using housekeeping_cpumask() to derive >> the number of available housekeeping CPUs. >> >> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com> >> --- > Ok, so the idea is that "housekeeping" CPUs are to be used for general purpose configuration, and thus is a subset of online CPUs. By reducing the limit to just housekeeping CPUs, we ensure that we do not overload the system with more queues than can be handled by the general purpose CPUs? Yes. General purpose or the housekeeping CPUs or the non-isolated CPUs. > > Thanks, > Jake > -- Nitesh [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --]
On Mon, Jun 15, 2020 at 04:21:25PM -0400, Nitesh Narayan Lal wrote:
> + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ;
> + mask = housekeeping_cpumask(hk_flags);
> + cpus = cpumask_weight(mask);
Code like this has no business inside a driver. Please provide a
proper core API for it instead. Also please wire up
pci_alloc_irq_vectors* to use this API as well.
[-- Attachment #1.1: Type: text/plain, Size: 562 bytes --] On 6/16/20 4:03 AM, Christoph Hellwig wrote: > On Mon, Jun 15, 2020 at 04:21:25PM -0400, Nitesh Narayan Lal wrote: >> + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ; >> + mask = housekeeping_cpumask(hk_flags); >> + cpus = cpumask_weight(mask); > Code like this has no business inside a driver. Please provide a > proper core API for it instead. Ok, I will think of a better way of doing this. > Also please wire up > pci_alloc_irq_vectors* to use this API as well. Understood, I will include this in a separate patch. > -- Thanks Nitesh [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #1.1: Type: text/plain, Size: 1276 bytes --] On 6/16/20 4:03 AM, Christoph Hellwig wrote: > On Mon, Jun 15, 2020 at 04:21:25PM -0400, Nitesh Narayan Lal wrote: >> + hk_flags = HK_FLAG_DOMAIN | HK_FLAG_WQ; >> + mask = housekeeping_cpumask(hk_flags); >> + cpus = cpumask_weight(mask); > Code like this has no business inside a driver. Please provide a > proper core API for it instead. Also please wire up > pci_alloc_irq_vectors* to use this API as well. > Hi Christoph, I have been looking into using nr_houskeeping_* API that I will be defining within pci_alloc_irq_vectors* to limit the nr of vectors. However, I am wondering about a few things: - Some of the drivers such as i40e until now, use the num_online CPUs to restrict the number of vectors that they should create. Will it make sense if I restrict the maximum vectors requested based on nr_online/housekeeping_cpus (Though I will have to make sure that the min_vecs is always satisfied)? The other option would be to check for the total available vectors in all online/housekeeping CPUs for limiting the maxvecs, this way will probably be more accurate? - Another thing that I am wondering about is the right way to test this change. Please let me know if you have any suggestions? -- Nitesh [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --]