Linux-PCI Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs
@ 2020-09-28 18:35 Nitesh Narayan Lal
  2020-09-28 18:35 ` [PATCH v4 1/4] sched/isolation: API to get number of " Nitesh Narayan Lal
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Nitesh Narayan Lal @ 2020-09-28 18:35 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-pci, intel-wired-lan, frederic,
	mtosatti, sassmann, jesse.brandeburg, lihong.yang, helgaas,
	nitesh, jeffrey.t.kirsher, jacob.e.keller, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, peterz, juri.lelli, vincent.guittot, lgoncalv

This is a follow-up posting for "[PATCH v3 0/4] isolation: limit msix vectors
to housekeeping CPUs".

Issue
=====
With the current implementation device drivers while creating their MSIX        
vectors only take num_online_cpus() into consideration which works quite well  
for a non-RT environment, but in an RT environment that has a large number of   
isolated CPUs and very few housekeeping CPUs this could lead to a problem.    
The problem will be triggered when something like tuned will try to move all    
the IRQs from isolated CPUs to the limited number of housekeeping CPUs to       
prevent interruptions for a latency-sensitive workload that will be running
on the isolated CPUs. This failure is caused because of the per CPU vector         
limitation.                                                                     


Proposed Fix
============
In this patch-set, the following changes are proposed:
- A generic API housekeeping_num_online_cpus() which is meant to return the
  online housekeeping CPUs based on the hk_flag passed by the caller.
- i40e: Specifically for the i40e driver the num_online_cpus() used in 
  i40e_init_msix() to calculate numbers msix vectors is replaced with the
  above defined API that returns the online housekeeping CPUs that are meant
  to handle managed IRQ jobs.
- pci_alloc_irq_vector(): With the help of housekeeping_num_online_cpus() the
  max_vecs passed in pci_alloc_irq_vector() is restricted only to the online
  housekeeping CPUs (designated for managed IRQ jobs) strictly in an RT
  environment. However, if the min_vecs exceeds the online housekeeping CPUs,
  max_vecs is limited based on the min_vecs instead.


Future Work
===========

- In the previous upstream discussion [1], it was decided that it would be
  better if we can have a generic framework that can be consumed by all the
  drivers to fix this kind of issue. However, it will be a long term work,
  and since there are RT workloads that are getting impacted by the reported
  issue. We agreed upon the proposed per-device approach for now.


Testing
=======
Functionality:
- To test that the issue is resolved with i40e change I added a tracepoint
  in i40e_init_msix() to find the number of CPUs derived for vector creation
  with and without tuned's realtime-virtual-host profile. As per expectation
  with the profile applied I was only getting the number of housekeeping CPUs
  and all available CPUs without it. Another way to verify is by checking
  the number of IRQs that get created corresponding to a impacted device.
  Similarly did a few more tests with different modes eg with only nohz_full,
  isolcpus etc.

Performance:
- To analyze the performance impact I have targetted the change introduced in 
  pci_alloc_irq_vectors() and compared the results against a vanilla kernel
  (5.9.0-rc3) results.

  Setup Information:
  + I had a couple of 24-core machines connected back to back via a couple of
    mlx5 NICs and I analyzed the average bitrate for server-client TCP and
    UDP transmission via iperf. 
  + To minimize the Bitrate variation of iperf TCP and UDP stream test I have
    applied the tuned's network-throughput profile and disabled HT.
 Test Information:
  + For the environment that had no isolated CPUs:
    I have tested with single stream and 24 streams (same as that of online
    CPUs).
  + For the environment that had 20 isolated CPUs:
    I have tested with single stream, 4 streams (same as that the number of
    housekeeping) and 24 streams (same as that of online CPUs).

 Results:
  # UDP Stream Test:
    + There was no degradation observed in UDP stream tests in both
      environments. (With isolated CPUs and without isolated CPUs after the
      introduction of the patches).
  # TCP Stream Test - No isolated CPUs:
    + No noticeable degradation was observed.
  # TCP Stream Test - With isolated CPUs:
    + Multiple Stream (4)  - Average degradation of around 5-6%
    + Multiple Stream (24) - Average degradation of around 2-3%
    + Single Stream        - Even on a vanilla kernel the Bitrate observed 
                             for a TCP single stream test seem to vary
                             significantly across different runs (eg. the %
                             variation between the best and the worst case on
                             a vanilla kernel was around 8-10%). A similar
                             variation was observed with the kernel that
                             included my patches. No additional degradation
                             was observed.

If there are any suggestions for more performance evaluation, I would
be happy to discuss/perform them.


Changes from v3[2]:
==================
- Moved the logic to limit the max_vecs from pci_alloc_irq_vectors() to
  pci_alloc_irq_vectors_affinity() as that's the exported interface and
  drivers using this API also need to be fixed (suggestion from Bjorn Helgaas).

Changes from v2[3]:
==================
- Renamed hk_num_online_cpus() with housekeeping_num_online_cpus() to keep
  the naming convention consistent (based on a suggestion from Peter
  Zijlstra and Frederic Weisbecker).
- Added an argument "enum hk_flags" to the housekeeping_num_online_cpus() API
  to make it more usable in different use-cases (based on a suggestion from 
  Frederic Weisbecker).
- Replaced cpumask_weight(cpu_online_mask) with num_online_cpus() (suggestion
  from Bjorn Helgaas).
- Modified patch commit messages and comment based on Bjorn Helgaas's
  suggestion.

Changes from v1[4]:
==================
Patch1:                                                                       
- Replaced num_houskeeeping_cpus() with hk_num_online_cpus() and started
  using the cpumask corresponding to HK_FLAG_MANAGED_IRQ to derive the number
  of online housekeeping CPUs. This is based on Frederic Weisbecker's
  suggestion.           
- Since the hk_num_online_cpus() is self-explanatory, got rid of             
  the comment that was added previously.                                     
Patch2:                                                                       
- Added a new patch that is meant to enable managed IRQ isolation for
  nohz_full CPUs. This is based on Frederic Weisbecker's suggestion.              
Patch4 (PCI):                                                                 
- For cases where the min_vecs exceeds the online housekeeping CPUs, instead
  of skipping modification to max_vecs, started restricting it based on the
  min_vecs. This is based on a suggestion from Marcelo Tosatti.                                                                    


[1] https://lore.kernel.org/lkml/20200922095440.GA5217@lenoir/
[2] https://lore.kernel.org/lkml/20200925182654.224004-1-nitesh@redhat.com/
[3] https://lore.kernel.org/lkml/20200923181126.223766-1-nitesh@redhat.com/
[4] https://lore.kernel.org/lkml/20200909150818.313699-1-nitesh@redhat.com/


Nitesh Narayan Lal (4):
  sched/isolation: API to get number of housekeeping CPUs
  sched/isolation: Extend nohz_full to isolate managed IRQs
  i40e: Limit msix vectors to housekeeping CPUs
  PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

 drivers/net/ethernet/intel/i40e/i40e_main.c |  3 ++-
 drivers/pci/msi.c                           | 18 ++++++++++++++++++
 include/linux/sched/isolation.h             |  9 +++++++++
 kernel/sched/isolation.c                    |  2 +-
 4 files changed, 30 insertions(+), 2 deletions(-)

-- 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 1/4] sched/isolation: API to get number of housekeeping CPUs
  2020-09-28 18:35 [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs Nitesh Narayan Lal
@ 2020-09-28 18:35 ` Nitesh Narayan Lal
  2020-09-28 18:35 ` [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs Nitesh Narayan Lal
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Nitesh Narayan Lal @ 2020-09-28 18:35 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-pci, intel-wired-lan, frederic,
	mtosatti, sassmann, jesse.brandeburg, lihong.yang, helgaas,
	nitesh, jeffrey.t.kirsher, jacob.e.keller, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, peterz, juri.lelli, vincent.guittot, lgoncalv

Introduce a new API housekeeping_num_online_cpus(), that can be used to
retrieve the number of online housekeeping CPUs based on the housekeeping
flag passed by the caller.

Some of the consumers for this API are the device drivers that were
previously relying only on num_online_cpus() to determine the number of
MSIX vectors to create. In real-time environments to minimize interruptions
to isolated CPUs, all device-specific IRQ vectors are often moved to the
housekeeping CPUs, having excess vectors could cause housekeeping CPU to
run out of IRQ vectors.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 include/linux/sched/isolation.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/linux/sched/isolation.h b/include/linux/sched/isolation.h
index cc9f393e2a70..e021b1846c1d 100644
--- a/include/linux/sched/isolation.h
+++ b/include/linux/sched/isolation.h
@@ -57,4 +57,13 @@ static inline bool housekeeping_cpu(int cpu, enum hk_flags flags)
 	return true;
 }
 
+static inline unsigned int housekeeping_num_online_cpus(enum hk_flags flags)
+{
+#ifdef CONFIG_CPU_ISOLATION
+	if (static_branch_unlikely(&housekeeping_overridden))
+		return cpumask_weight(housekeeping_cpumask(flags));
+#endif
+	return num_online_cpus();
+}
+
 #endif /* _LINUX_SCHED_ISOLATION_H */
-- 
2.18.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs
  2020-09-28 18:35 [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs Nitesh Narayan Lal
  2020-09-28 18:35 ` [PATCH v4 1/4] sched/isolation: API to get number of " Nitesh Narayan Lal
@ 2020-09-28 18:35 ` Nitesh Narayan Lal
  2020-09-28 18:35 ` [PATCH v4 3/4] i40e: Limit msix vectors to housekeeping CPUs Nitesh Narayan Lal
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 28+ messages in thread
From: Nitesh Narayan Lal @ 2020-09-28 18:35 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-pci, intel-wired-lan, frederic,
	mtosatti, sassmann, jesse.brandeburg, lihong.yang, helgaas,
	nitesh, jeffrey.t.kirsher, jacob.e.keller, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, peterz, juri.lelli, vincent.guittot, lgoncalv

Extend nohz_full feature set to include isolation from managed IRQS. This
is required specifically for setups that only uses nohz_full and still
requires isolation for maintaining lower latency for the listed CPUs.

Suggested-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 kernel/sched/isolation.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index 5a6ea03f9882..9df9598a9e39 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
 	unsigned int flags;
 
 	flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
-		HK_FLAG_MISC | HK_FLAG_KTHREAD;
+		HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
 
 	return housekeeping_setup(str, flags);
 }
-- 
2.18.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 3/4] i40e: Limit msix vectors to housekeeping CPUs
  2020-09-28 18:35 [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs Nitesh Narayan Lal
  2020-09-28 18:35 ` [PATCH v4 1/4] sched/isolation: API to get number of " Nitesh Narayan Lal
  2020-09-28 18:35 ` [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs Nitesh Narayan Lal
@ 2020-09-28 18:35 ` Nitesh Narayan Lal
  2020-09-28 18:35 ` [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() " Nitesh Narayan Lal
  2020-10-01 15:49 ` [PATCH v4 0/4] isolation: limit msix vectors " Frederic Weisbecker
  4 siblings, 0 replies; 28+ messages in thread
From: Nitesh Narayan Lal @ 2020-09-28 18:35 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-pci, intel-wired-lan, frederic,
	mtosatti, sassmann, jesse.brandeburg, lihong.yang, helgaas,
	nitesh, jeffrey.t.kirsher, jacob.e.keller, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, peterz, juri.lelli, vincent.guittot, lgoncalv

If we have isolated CPUs designated to perform real-time tasks, to keep the
latency overhead to a minimum for real-time CPUs IRQ vectors are moved to
housekeeping CPUs from the userspace. Creating MSIX vectors only based on
the online CPUs could lead to exhaustion of housekeeping CPU IRQ vectors in
such environments.

This patch prevents i40e to create vectors only based on online CPUs by
retrieving the online housekeeping CPUs that are designated to perform
managed IRQ jobs.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Acked-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 2e433fdbf2c3..370b581cd48c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -5,6 +5,7 @@
 #include <linux/of_net.h>
 #include <linux/pci.h>
 #include <linux/bpf.h>
+#include <linux/sched/isolation.h>
 #include <generated/utsrelease.h>
 
 /* Local includes */
@@ -11002,7 +11003,7 @@ static int i40e_init_msix(struct i40e_pf *pf)
 	 * will use any remaining vectors to reach as close as we can to the
 	 * number of online CPUs.
 	 */
-	cpus = num_online_cpus();
+	cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
 	pf->num_lan_msix = min_t(int, cpus, vectors_left / 2);
 	vectors_left -= pf->num_lan_msix;
 
-- 
2.18.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-09-28 18:35 [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs Nitesh Narayan Lal
                   ` (2 preceding siblings ...)
  2020-09-28 18:35 ` [PATCH v4 3/4] i40e: Limit msix vectors to housekeeping CPUs Nitesh Narayan Lal
@ 2020-09-28 18:35 ` Nitesh Narayan Lal
  2020-09-28 21:59   ` Bjorn Helgaas
                     ` (2 more replies)
  2020-10-01 15:49 ` [PATCH v4 0/4] isolation: limit msix vectors " Frederic Weisbecker
  4 siblings, 3 replies; 28+ messages in thread
From: Nitesh Narayan Lal @ 2020-09-28 18:35 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-pci, intel-wired-lan, frederic,
	mtosatti, sassmann, jesse.brandeburg, lihong.yang, helgaas,
	nitesh, jeffrey.t.kirsher, jacob.e.keller, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, peterz, juri.lelli, vincent.guittot, lgoncalv

If we have isolated CPUs dedicated for use by real-time tasks, we try to
move IRQs to housekeeping CPUs from the userspace to reduce latency
overhead on the isolated CPUs.

If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
may exceed per-CPU vector limits.

When we have isolated CPUs, limit the number of vectors allocated by
pci_alloc_irq_vectors() to the minimum number required by the driver, or
to one per housekeeping CPU if that is larger.

Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
---
 drivers/pci/msi.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 30ae4ffda5c1..8c156867803c 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -23,6 +23,7 @@
 #include <linux/slab.h>
 #include <linux/irqdomain.h>
 #include <linux/of_irq.h>
+#include <linux/sched/isolation.h>
 
 #include "pci.h"
 
@@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
 				   struct irq_affinity *affd)
 {
 	struct irq_affinity msi_default_affd = {0};
+	unsigned int hk_cpus;
 	int nvecs = -ENOSPC;
 
+	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
+
+	/*
+	 * If we have isolated CPUs for use by real-time tasks, to keep the
+	 * latency overhead to a minimum, device-specific IRQ vectors are moved
+	 * to the housekeeping CPUs from the userspace by changing their
+	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
+	 * running out of IRQ vectors.
+	 */
+	if (hk_cpus < num_online_cpus()) {
+		if (hk_cpus < min_vecs)
+			max_vecs = min_vecs;
+		else if (hk_cpus < max_vecs)
+			max_vecs = hk_cpus;
+	}
+
 	if (flags & PCI_IRQ_AFFINITY) {
 		if (!affd)
 			affd = &msi_default_affd;
-- 
2.18.2


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-09-28 18:35 ` [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() " Nitesh Narayan Lal
@ 2020-09-28 21:59   ` Bjorn Helgaas
  2020-09-29 17:46     ` Christoph Hellwig
  2020-10-16 12:20   ` Peter Zijlstra
  2020-10-20 14:16   ` Thomas Gleixner
  2 siblings, 1 reply; 28+ messages in thread
From: Bjorn Helgaas @ 2020-09-28 21:59 UTC (permalink / raw)
  To: Nitesh Narayan Lal, hch
  Cc: linux-kernel, netdev, linux-pci, intel-wired-lan, frederic,
	mtosatti, sassmann, jesse.brandeburg, lihong.yang,
	jeffrey.t.kirsher, jacob.e.keller, jlelli, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, peterz, juri.lelli, vincent.guittot, lgoncalv

[to: Christoph in case he has comments, since I think he wrote this code]

On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote:
> If we have isolated CPUs dedicated for use by real-time tasks, we try to
> move IRQs to housekeeping CPUs from the userspace to reduce latency
> overhead on the isolated CPUs.
> 
> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
> may exceed per-CPU vector limits.
> 
> When we have isolated CPUs, limit the number of vectors allocated by
> pci_alloc_irq_vectors() to the minimum number required by the driver, or
> to one per housekeeping CPU if that is larger.
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> ---
>  drivers/pci/msi.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 30ae4ffda5c1..8c156867803c 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -23,6 +23,7 @@
>  #include <linux/slab.h>
>  #include <linux/irqdomain.h>
>  #include <linux/of_irq.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "pci.h"
>  
> @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  				   struct irq_affinity *affd)
>  {
>  	struct irq_affinity msi_default_affd = {0};
> +	unsigned int hk_cpus;
>  	int nvecs = -ENOSPC;
>  
> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> +
> +	/*
> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> +	 * to the housekeeping CPUs from the userspace by changing their
> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> +	 * running out of IRQ vectors.
> +	 */
> +	if (hk_cpus < num_online_cpus()) {
> +		if (hk_cpus < min_vecs)
> +			max_vecs = min_vecs;
> +		else if (hk_cpus < max_vecs)
> +			max_vecs = hk_cpus;
> +	}
> +
>  	if (flags & PCI_IRQ_AFFINITY) {
>  		if (!affd)
>  			affd = &msi_default_affd;
> -- 
> 2.18.2
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-09-28 21:59   ` Bjorn Helgaas
@ 2020-09-29 17:46     ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2020-09-29 17:46 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Nitesh Narayan Lal, hch, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, mtosatti, sassmann, jesse.brandeburg,
	lihong.yang, jeffrey.t.kirsher, jacob.e.keller, jlelli, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, peterz, juri.lelli, vincent.guittot, lgoncalv

On Mon, Sep 28, 2020 at 04:59:31PM -0500, Bjorn Helgaas wrote:
> [to: Christoph in case he has comments, since I think he wrote this code]

I think I actually suggested this a few iterations back.

> > +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> > +
> > +	/*
> > +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> > +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> > +	 * to the housekeeping CPUs from the userspace by changing their
> > +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> > +	 * running out of IRQ vectors.
> > +	 */
> > +	if (hk_cpus < num_online_cpus()) {

I woukd have moved the assignment to hk_cpus below the comment and
just above the if, but that is really just a minor style preference.

Otherwise this looks good:

Acked-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs
  2020-09-28 18:35 [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs Nitesh Narayan Lal
                   ` (3 preceding siblings ...)
  2020-09-28 18:35 ` [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() " Nitesh Narayan Lal
@ 2020-10-01 15:49 ` Frederic Weisbecker
  2020-10-08 21:40   ` Nitesh Narayan Lal
  4 siblings, 1 reply; 28+ messages in thread
From: Frederic Weisbecker @ 2020-10-01 15:49 UTC (permalink / raw)
  To: Nitesh Narayan Lal, Peter Zijlstra
  Cc: linux-kernel, netdev, linux-pci, intel-wired-lan, mtosatti,
	sassmann, jesse.brandeburg, lihong.yang, helgaas,
	jeffrey.t.kirsher, jacob.e.keller, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, juri.lelli, vincent.guittot, lgoncalv

On Mon, Sep 28, 2020 at 02:35:25PM -0400, Nitesh Narayan Lal wrote:
> Nitesh Narayan Lal (4):
>   sched/isolation: API to get number of housekeeping CPUs
>   sched/isolation: Extend nohz_full to isolate managed IRQs
>   i40e: Limit msix vectors to housekeeping CPUs
>   PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
> 
>  drivers/net/ethernet/intel/i40e/i40e_main.c |  3 ++-
>  drivers/pci/msi.c                           | 18 ++++++++++++++++++
>  include/linux/sched/isolation.h             |  9 +++++++++
>  kernel/sched/isolation.c                    |  2 +-
>  4 files changed, 30 insertions(+), 2 deletions(-)

Acked-by: Frederic Weisbecker <frederic@kernel.org>

Peter, if you're ok with the set, I guess this should go through
the scheduler tree?

Thanks.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs
  2020-10-01 15:49 ` [PATCH v4 0/4] isolation: limit msix vectors " Frederic Weisbecker
@ 2020-10-08 21:40   ` Nitesh Narayan Lal
  0 siblings, 0 replies; 28+ messages in thread
From: Nitesh Narayan Lal @ 2020-10-08 21:40 UTC (permalink / raw)
  To: Frederic Weisbecker, Peter Zijlstra
  Cc: linux-kernel, netdev, linux-pci, intel-wired-lan, mtosatti,
	sassmann, jesse.brandeburg, lihong.yang, helgaas,
	jeffrey.t.kirsher, jacob.e.keller, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, juri.lelli, vincent.guittot, lgoncalv

[-- Attachment #1.1: Type: text/plain, Size: 1038 bytes --]


On 10/1/20 11:49 AM, Frederic Weisbecker wrote:
> On Mon, Sep 28, 2020 at 02:35:25PM -0400, Nitesh Narayan Lal wrote:
>> Nitesh Narayan Lal (4):
>>   sched/isolation: API to get number of housekeeping CPUs
>>   sched/isolation: Extend nohz_full to isolate managed IRQs
>>   i40e: Limit msix vectors to housekeeping CPUs
>>   PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
>>
>>  drivers/net/ethernet/intel/i40e/i40e_main.c |  3 ++-
>>  drivers/pci/msi.c                           | 18 ++++++++++++++++++
>>  include/linux/sched/isolation.h             |  9 +++++++++
>>  kernel/sched/isolation.c                    |  2 +-
>>  4 files changed, 30 insertions(+), 2 deletions(-)
> Acked-by: Frederic Weisbecker <frederic@kernel.org>
>
> Peter, if you're ok with the set, I guess this should go through
> the scheduler tree?
>
> Thanks.

Hi Peter,

I wanted follow up to check if you have any concerns/suggestions that I
should address in this patch-set before this can be picked?

-- 
Thanks
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-09-28 18:35 ` [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() " Nitesh Narayan Lal
  2020-09-28 21:59   ` Bjorn Helgaas
@ 2020-10-16 12:20   ` Peter Zijlstra
  2020-10-18 18:14     ` Nitesh Narayan Lal
  2020-10-20 14:16   ` Thomas Gleixner
  2 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2020-10-16 12:20 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: linux-kernel, netdev, linux-pci, intel-wired-lan, frederic,
	mtosatti, sassmann, jesse.brandeburg, lihong.yang, helgaas,
	jeffrey.t.kirsher, jacob.e.keller, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, juri.lelli, vincent.guittot, lgoncalv

On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote:
> If we have isolated CPUs dedicated for use by real-time tasks, we try to
> move IRQs to housekeeping CPUs from the userspace to reduce latency
> overhead on the isolated CPUs.
> 
> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
> may exceed per-CPU vector limits.
> 
> When we have isolated CPUs, limit the number of vectors allocated by
> pci_alloc_irq_vectors() to the minimum number required by the driver, or
> to one per housekeeping CPU if that is larger.
> 
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  drivers/pci/msi.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index 30ae4ffda5c1..8c156867803c 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -23,6 +23,7 @@
>  #include <linux/slab.h>
>  #include <linux/irqdomain.h>
>  #include <linux/of_irq.h>
> +#include <linux/sched/isolation.h>
>  
>  #include "pci.h"
>  
> @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>  				   struct irq_affinity *affd)
>  {
>  	struct irq_affinity msi_default_affd = {0};
> +	unsigned int hk_cpus;
>  	int nvecs = -ENOSPC;
>  
> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> +
> +	/*
> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> +	 * to the housekeeping CPUs from the userspace by changing their
> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> +	 * running out of IRQ vectors.
> +	 */
> +	if (hk_cpus < num_online_cpus()) {
> +		if (hk_cpus < min_vecs)
> +			max_vecs = min_vecs;
> +		else if (hk_cpus < max_vecs)
> +			max_vecs = hk_cpus;

is that:

		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

Also, do we really need to have that conditional on hk_cpus <
num_online_cpus()? That is, why can't we do this unconditionally?

And what are the (desired) semantics vs hotplug? Using a cpumask without
excluding hotplug is racy.

> +	}
> +
>  	if (flags & PCI_IRQ_AFFINITY) {
>  		if (!affd)
>  			affd = &msi_default_affd;
> -- 
> 2.18.2
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-16 12:20   ` Peter Zijlstra
@ 2020-10-18 18:14     ` Nitesh Narayan Lal
  2020-10-19 11:11       ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Nitesh Narayan Lal @ 2020-10-18 18:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, netdev, linux-pci, intel-wired-lan, frederic,
	mtosatti, sassmann, jesse.brandeburg, lihong.yang, helgaas,
	jeffrey.t.kirsher, jacob.e.keller, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, juri.lelli, vincent.guittot, lgoncalv

[-- Attachment #1.1: Type: text/plain, Size: 3260 bytes --]


On 10/16/20 8:20 AM, Peter Zijlstra wrote:
> On Mon, Sep 28, 2020 at 02:35:29PM -0400, Nitesh Narayan Lal wrote:
>> If we have isolated CPUs dedicated for use by real-time tasks, we try to
>> move IRQs to housekeeping CPUs from the userspace to reduce latency
>> overhead on the isolated CPUs.
>>
>> If we allocate too many IRQ vectors, moving them all to housekeeping CPUs
>> may exceed per-CPU vector limits.
>>
>> When we have isolated CPUs, limit the number of vectors allocated by
>> pci_alloc_irq_vectors() to the minimum number required by the driver, or
>> to one per housekeeping CPU if that is larger.
>>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  drivers/pci/msi.c | 18 ++++++++++++++++++
>>  1 file changed, 18 insertions(+)
>>
>> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
>> index 30ae4ffda5c1..8c156867803c 100644
>> --- a/drivers/pci/msi.c
>> +++ b/drivers/pci/msi.c
>> @@ -23,6 +23,7 @@
>>  #include <linux/slab.h>
>>  #include <linux/irqdomain.h>
>>  #include <linux/of_irq.h>
>> +#include <linux/sched/isolation.h>
>>  
>>  #include "pci.h"
>>  
>> @@ -1191,8 +1192,25 @@ int pci_alloc_irq_vectors_affinity(struct pci_dev *dev, unsigned int min_vecs,
>>  				   struct irq_affinity *affd)
>>  {
>>  	struct irq_affinity msi_default_affd = {0};
>> +	unsigned int hk_cpus;
>>  	int nvecs = -ENOSPC;
>>  
>> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
>> +
>> +	/*
>> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
>> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
>> +	 * to the housekeeping CPUs from the userspace by changing their
>> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
>> +	 * running out of IRQ vectors.
>> +	 */
>> +	if (hk_cpus < num_online_cpus()) {
>> +		if (hk_cpus < min_vecs)
>> +			max_vecs = min_vecs;
>> +		else if (hk_cpus < max_vecs)
>> +			max_vecs = hk_cpus;
> is that:
>
> 		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

Yes, I think this will do.

>
> Also, do we really need to have that conditional on hk_cpus <
> num_online_cpus()? That is, why can't we do this unconditionally?


FWIU most of the drivers using this API already restricts the number of
vectors based on the num_online_cpus, if we do it unconditionally we can
unnecessary duplicate the restriction for cases where we don't have any
isolated CPUs.

Also, different driver seems to take different factors into consideration
along with num_online_cpus while finding the max_vecs to request, for
example in the case of mlx5:
MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
               MLX5_EQ_VEC_COMP_BASE

Having hk_cpus < num_online_cpus() helps us ensure that we are only
changing the behavior when we have isolated CPUs.

Does that make sense?

>
> And what are the (desired) semantics vs hotplug? Using a cpumask without
> excluding hotplug is racy.

The housekeeping_mask should still remain constant, isn't?
In any case, I can double check this.

>
>> +	}
>> +
>>  	if (flags & PCI_IRQ_AFFINITY) {
>>  		if (!affd)
>>  			affd = &msi_default_affd;
>> -- 
>> 2.18.2
>>
-- 
Thanks
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-18 18:14     ` Nitesh Narayan Lal
@ 2020-10-19 11:11       ` Peter Zijlstra
  2020-10-19 14:00         ` Marcelo Tosatti
  2020-10-19 14:21         ` Frederic Weisbecker
  0 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2020-10-19 11:11 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: linux-kernel, netdev, linux-pci, intel-wired-lan, frederic,
	mtosatti, sassmann, jesse.brandeburg, lihong.yang, helgaas,
	jeffrey.t.kirsher, jacob.e.keller, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, juri.lelli, vincent.guittot, lgoncalv

On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:
> >> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> >> +
> >> +	/*
> >> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> >> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> >> +	 * to the housekeeping CPUs from the userspace by changing their
> >> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> >> +	 * running out of IRQ vectors.
> >> +	 */
> >> +	if (hk_cpus < num_online_cpus()) {
> >> +		if (hk_cpus < min_vecs)
> >> +			max_vecs = min_vecs;
> >> +		else if (hk_cpus < max_vecs)
> >> +			max_vecs = hk_cpus;
> > is that:
> >
> > 		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
> 
> Yes, I think this will do.
> 
> >
> > Also, do we really need to have that conditional on hk_cpus <
> > num_online_cpus()? That is, why can't we do this unconditionally?
> 
> FWIU most of the drivers using this API already restricts the number of
> vectors based on the num_online_cpus, if we do it unconditionally we can
> unnecessary duplicate the restriction for cases where we don't have any
> isolated CPUs.

unnecessary isn't really a concern here, this is a slow path. What's
important is code clarity.

> Also, different driver seems to take different factors into consideration
> along with num_online_cpus while finding the max_vecs to request, for
> example in the case of mlx5:
> MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
>                MLX5_EQ_VEC_COMP_BASE
> 
> Having hk_cpus < num_online_cpus() helps us ensure that we are only
> changing the behavior when we have isolated CPUs.
> 
> Does that make sense?

That seems to want to allocate N interrupts per cpu (plus some random
static amount, which seems weird, but whatever). This patch breaks that.

So I think it is important to figure out what that driver really wants
in the nohz_full case. If it wants to retain N interrupts per CPU, and
only reduce the number of CPUs, the proposed interface is wrong.

> > And what are the (desired) semantics vs hotplug? Using a cpumask without
> > excluding hotplug is racy.
> 
> The housekeeping_mask should still remain constant, isn't?
> In any case, I can double check this.

The goal is very much to have that dynamically configurable.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-19 11:11       ` Peter Zijlstra
@ 2020-10-19 14:00         ` Marcelo Tosatti
  2020-10-19 14:25           ` Nitesh Narayan Lal
  2020-10-20  7:30           ` Peter Zijlstra
  2020-10-19 14:21         ` Frederic Weisbecker
  1 sibling, 2 replies; 28+ messages in thread
From: Marcelo Tosatti @ 2020-10-19 14:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nitesh Narayan Lal, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jacob.e.keller, jlelli,
	hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, juri.lelli, vincent.guittot,
	lgoncalv

On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote:
> On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:
> > >> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> > >> +
> > >> +	/*
> > >> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> > >> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> > >> +	 * to the housekeeping CPUs from the userspace by changing their
> > >> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> > >> +	 * running out of IRQ vectors.
> > >> +	 */
> > >> +	if (hk_cpus < num_online_cpus()) {
> > >> +		if (hk_cpus < min_vecs)
> > >> +			max_vecs = min_vecs;
> > >> +		else if (hk_cpus < max_vecs)
> > >> +			max_vecs = hk_cpus;
> > > is that:
> > >
> > > 		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
> > 
> > Yes, I think this will do.
> > 
> > >
> > > Also, do we really need to have that conditional on hk_cpus <
> > > num_online_cpus()? That is, why can't we do this unconditionally?
> > 
> > FWIU most of the drivers using this API already restricts the number of
> > vectors based on the num_online_cpus, if we do it unconditionally we can
> > unnecessary duplicate the restriction for cases where we don't have any
> > isolated CPUs.
> 
> unnecessary isn't really a concern here, this is a slow path. What's
> important is code clarity.
> 
> > Also, different driver seems to take different factors into consideration
> > along with num_online_cpus while finding the max_vecs to request, for
> > example in the case of mlx5:
> > MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
> >                MLX5_EQ_VEC_COMP_BASE
> > 
> > Having hk_cpus < num_online_cpus() helps us ensure that we are only
> > changing the behavior when we have isolated CPUs.
> > 
> > Does that make sense?
> 
> That seems to want to allocate N interrupts per cpu (plus some random
> static amount, which seems weird, but whatever). This patch breaks that.

On purpose. For the isolated CPUs we don't want network device 
interrupts (in this context).

> So I think it is important to figure out what that driver really wants
> in the nohz_full case. If it wants to retain N interrupts per CPU, and
> only reduce the number of CPUs, the proposed interface is wrong.

It wants N interrupts per non-isolated (AKA housekeeping) CPU.
Zero interrupts for isolated interrupts.

> > > And what are the (desired) semantics vs hotplug? Using a cpumask without
> > > excluding hotplug is racy.
> > 
> > The housekeeping_mask should still remain constant, isn't?
> > In any case, I can double check this.
> 
> The goal is very much to have that dynamically configurable.

Yes, but this patch is a fix for customer bug in the old, static on-boot 
isolation CPU configuration.

---

Discussing the dynamic configuration (not this patch!) case:

Would need to enable/disable interrupts for a particular device 
on a per-CPU basis. Such interface does not exist yet.

Perhaps that is what you are looking for when writing "proposed interface
is wrong" Peter? 




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-19 11:11       ` Peter Zijlstra
  2020-10-19 14:00         ` Marcelo Tosatti
@ 2020-10-19 14:21         ` Frederic Weisbecker
  1 sibling, 0 replies; 28+ messages in thread
From: Frederic Weisbecker @ 2020-10-19 14:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nitesh Narayan Lal, linux-kernel, netdev, linux-pci,
	intel-wired-lan, mtosatti, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jacob.e.keller, jlelli,
	hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, juri.lelli, vincent.guittot,
	lgoncalv

On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote:
> > > And what are the (desired) semantics vs hotplug? Using a cpumask without
> > > excluding hotplug is racy.
> > 
> > The housekeeping_mask should still remain constant, isn't?
> > In any case, I can double check this.
> 
> The goal is very much to have that dynamically configurable.

Right but unfortunately we are not there before a little while. And the
existing code in these drivers allocating vectors doesn't even take into
account hotplug as you spotted. So I agreed to let Nitesh fix this issue
on top of the existing code until he can look into providing some infrastructure
for these kind of vectors allocation. The first step would be to consolidate
similar code from other drivers, then maybe handle hotplug and later
dynamic housekeeping.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-19 14:00         ` Marcelo Tosatti
@ 2020-10-19 14:25           ` Nitesh Narayan Lal
  2020-10-20  7:30           ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Nitesh Narayan Lal @ 2020-10-19 14:25 UTC (permalink / raw)
  To: Marcelo Tosatti, Peter Zijlstra
  Cc: linux-kernel, netdev, linux-pci, intel-wired-lan, frederic,
	sassmann, jesse.brandeburg, lihong.yang, helgaas,
	jeffrey.t.kirsher, jacob.e.keller, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, juri.lelli, vincent.guittot, lgoncalv

[-- Attachment #1.1: Type: text/plain, Size: 1958 bytes --]


On 10/19/20 10:00 AM, Marcelo Tosatti wrote:
> On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote:
>> On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:

[...]

>>>> Also, do we really need to have that conditional on hk_cpus <
>>>> num_online_cpus()? That is, why can't we do this unconditionally?
>>> FWIU most of the drivers using this API already restricts the number of
>>> vectors based on the num_online_cpus, if we do it unconditionally we can
>>> unnecessary duplicate the restriction for cases where we don't have any
>>> isolated CPUs.
>> unnecessary isn't really a concern here, this is a slow path. What's
>> important is code clarity.

Right, I can skip that check then.

>>
>>> Also, different driver seems to take different factors into consideration
>>> along with num_online_cpus while finding the max_vecs to request, for
>>> example in the case of mlx5:
>>> MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
>>>                MLX5_EQ_VEC_COMP_BASE
>>>
>>> Having hk_cpus < num_online_cpus() helps us ensure that we are only
>>> changing the behavior when we have isolated CPUs.
>>>
>>> Does that make sense?
>> That seems to want to allocate N interrupts per cpu (plus some random
>> static amount, which seems weird, but whatever). This patch breaks that.
> On purpose. For the isolated CPUs we don't want network device 
> interrupts (in this context).
>
>> So I think it is important to figure out what that driver really wants
>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>> only reduce the number of CPUs, the proposed interface is wrong.
> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
> Zero interrupts for isolated interrupts.

Right, otherwise we may end up in a situation where we run out of per CPU
vectors while we move the IRQs from isolated CPUs to housekeeping.


-- 
Thanks
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-19 14:00         ` Marcelo Tosatti
  2020-10-19 14:25           ` Nitesh Narayan Lal
@ 2020-10-20  7:30           ` Peter Zijlstra
  2020-10-20 13:00             ` Nitesh Narayan Lal
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2020-10-20  7:30 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: Nitesh Narayan Lal, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jacob.e.keller, jlelli,
	hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, juri.lelli, vincent.guittot,
	lgoncalv

On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
> > So I think it is important to figure out what that driver really wants
> > in the nohz_full case. If it wants to retain N interrupts per CPU, and
> > only reduce the number of CPUs, the proposed interface is wrong.
> 
> It wants N interrupts per non-isolated (AKA housekeeping) CPU.

Then the patch is wrong and the interface needs changing from @min_vecs,
@max_vecs to something that expresses the N*nr_cpus relation.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-20  7:30           ` Peter Zijlstra
@ 2020-10-20 13:00             ` Nitesh Narayan Lal
  2020-10-20 13:41               ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Nitesh Narayan Lal @ 2020-10-20 13:00 UTC (permalink / raw)
  To: Peter Zijlstra, Marcelo Tosatti
  Cc: linux-kernel, netdev, linux-pci, intel-wired-lan, frederic,
	sassmann, jesse.brandeburg, lihong.yang, helgaas,
	jeffrey.t.kirsher, jacob.e.keller, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, juri.lelli, vincent.guittot, lgoncalv

[-- Attachment #1.1: Type: text/plain, Size: 746 bytes --]


On 10/20/20 3:30 AM, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
>>> So I think it is important to figure out what that driver really wants
>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>>> only reduce the number of CPUs, the proposed interface is wrong.
>> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
> Then the patch is wrong and the interface needs changing from @min_vecs,
> @max_vecs to something that expresses the N*nr_cpus relation.

Reading Marcelo's comment again I think what is really expected is 1
interrupt per non-isolated (housekeeping) CPU (not N interrupts).

My bad that I missed it initially.

-- 
Thanks
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-20 13:00             ` Nitesh Narayan Lal
@ 2020-10-20 13:41               ` Peter Zijlstra
  2020-10-20 14:39                 ` Nitesh Narayan Lal
  0 siblings, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2020-10-20 13:41 UTC (permalink / raw)
  To: Nitesh Narayan Lal
  Cc: Marcelo Tosatti, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jacob.e.keller, jlelli,
	hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, juri.lelli, vincent.guittot,
	lgoncalv

On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote:
> 
> On 10/20/20 3:30 AM, Peter Zijlstra wrote:
> > On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
> >>> So I think it is important to figure out what that driver really wants
> >>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
> >>> only reduce the number of CPUs, the proposed interface is wrong.
> >> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
> > Then the patch is wrong and the interface needs changing from @min_vecs,
> > @max_vecs to something that expresses the N*nr_cpus relation.
> 
> Reading Marcelo's comment again I think what is really expected is 1
> interrupt per non-isolated (housekeeping) CPU (not N interrupts).

Then what is the point of them asking for N*nr_cpus when there is no
isolation?

Either everybody wants 1 interrupts per CPU and we can do the clamp
unconditionally, in which case we should go fix this user, or they want
multiple per cpu and we should go fix the interface.

It cannot be both.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-09-28 18:35 ` [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() " Nitesh Narayan Lal
  2020-09-28 21:59   ` Bjorn Helgaas
  2020-10-16 12:20   ` Peter Zijlstra
@ 2020-10-20 14:16   ` Thomas Gleixner
  2020-10-20 16:18     ` Nitesh Narayan Lal
  2 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2020-10-20 14:16 UTC (permalink / raw)
  To: Nitesh Narayan Lal, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, mtosatti, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, nitesh, jeffrey.t.kirsher, jacob.e.keller,
	jlelli, hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, peterz, juri.lelli,
	vincent.guittot, lgoncalv

On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote:
>  
> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> +
> +	/*
> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> +	 * to the housekeeping CPUs from the userspace by changing their
> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> +	 * running out of IRQ vectors.
> +	 */

This is not true for managed interrupts. The interrupts affinity of
those cannot be changed by user space.

> +	if (hk_cpus < num_online_cpus()) {
> +		if (hk_cpus < min_vecs)
> +			max_vecs = min_vecs;
> +		else if (hk_cpus < max_vecs)
> +			max_vecs = hk_cpus;
> +	}

So now with that assume a 16 core machine (HT off for simplicity)

17 Requested interrupts (1 general, 16 queues)

Managed interrupts will allocate

   1  general interrupt which is free movable by user space
   16 managed interrupts for queues (one per CPU)

This allows the driver to have 16 queues, i.e. one queue per CPU. These
interrupts are only used when an application on a CPU issues I/O.

With the above change this will result

   1  general interrupt which is free movable by user space
   1  managed interrupts (possible affinity to all 16 CPUs, but routed
      to housekeeping CPU as long as there is one online)

So the device is now limited to a single queue which also affects the
housekeeping CPUs because now they have to share a single queue.

With larger machines this gets even worse.

So no. This needs way more thought for managed interrupts and you cannot
do that at the PCI layer. Only the affinity spreading mechanism can do
the right thing here.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-20 13:41               ` Peter Zijlstra
@ 2020-10-20 14:39                 ` Nitesh Narayan Lal
  0 siblings, 0 replies; 28+ messages in thread
From: Nitesh Narayan Lal @ 2020-10-20 14:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Marcelo Tosatti, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jacob.e.keller, jlelli,
	hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, juri.lelli, vincent.guittot,
	lgoncalv

[-- Attachment #1.1: Type: text/plain, Size: 1491 bytes --]


On 10/20/20 9:41 AM, Peter Zijlstra wrote:
> On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote:
>> On 10/20/20 3:30 AM, Peter Zijlstra wrote:
>>> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
>>>>> So I think it is important to figure out what that driver really wants
>>>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>>>>> only reduce the number of CPUs, the proposed interface is wrong.
>>>> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
>>> Then the patch is wrong and the interface needs changing from @min_vecs,
>>> @max_vecs to something that expresses the N*nr_cpus relation.
>> Reading Marcelo's comment again I think what is really expected is 1
>> interrupt per non-isolated (housekeeping) CPU (not N interrupts).
> Then what is the point of them asking for N*nr_cpus when there is no
> isolation?
>
> Either everybody wants 1 interrupts per CPU and we can do the clamp
> unconditionally, in which case we should go fix this user, or they want
> multiple per cpu and we should go fix the interface.
>
> It cannot be both.

Based on my understanding I don't think this is consistent, the number
of interrupts any driver can request varies to an extent that some
consumer of this API even request just one interrupt for its use.

This was one of the reasons why I thought of having a conditional
restriction.

But I agree there is a lack of consistency.

-- 
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-20 14:16   ` Thomas Gleixner
@ 2020-10-20 16:18     ` Nitesh Narayan Lal
  2020-10-20 18:07       ` Thomas Gleixner
  0 siblings, 1 reply; 28+ messages in thread
From: Nitesh Narayan Lal @ 2020-10-20 16:18 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, mtosatti, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jacob.e.keller, jlelli,
	hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, peterz, juri.lelli,
	vincent.guittot, lgoncalv

[-- Attachment #1.1: Type: text/plain, Size: 2650 bytes --]


On 10/20/20 10:16 AM, Thomas Gleixner wrote:
> On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote:
>>  
>> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
>> +
>> +	/*
>> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
>> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
>> +	 * to the housekeeping CPUs from the userspace by changing their
>> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
>> +	 * running out of IRQ vectors.
>> +	 */
> This is not true for managed interrupts. The interrupts affinity of
> those cannot be changed by user space.

Ah Yes. Perhaps
s/IRQs/non-managed IRQ ?

>
>> +	if (hk_cpus < num_online_cpus()) {
>> +		if (hk_cpus < min_vecs)
>> +			max_vecs = min_vecs;
>> +		else if (hk_cpus < max_vecs)
>> +			max_vecs = hk_cpus;
>> +	}
> So now with that assume a 16 core machine (HT off for simplicity)
>
> 17 Requested interrupts (1 general, 16 queues)
>
> Managed interrupts will allocate
>
>    1  general interrupt which is free movable by user space
>    16 managed interrupts for queues (one per CPU)
>
> This allows the driver to have 16 queues, i.e. one queue per CPU. These
> interrupts are only used when an application on a CPU issues I/O.

Correct.

>
> With the above change this will result
>
>    1  general interrupt which is free movable by user space
>    1  managed interrupts (possible affinity to all 16 CPUs, but routed
>       to housekeeping CPU as long as there is one online)
>
> So the device is now limited to a single queue which also affects the
> housekeeping CPUs because now they have to share a single queue.
>
> With larger machines this gets even worse.

Yes, the change can impact the performance, however, if we don't do that we
may have a latency impact instead. Specifically, on larger systems where
most of the CPUs are isolated as we will definitely fail in moving all of the
IRQs away from the isolated CPUs to the housekeeping.

>
> So no. This needs way more thought for managed interrupts and you cannot
> do that at the PCI layer.

Maybe we should not be doing anything in the case of managed IRQs as they
are anyways pinned to the housekeeping CPUs as long as we have the
'managed_irq' option included in the kernel cmdline.

>  Only the affinity spreading mechanism can do
> the right thing here.

I can definitely explore this further.

However, IMHO we would still need a logic to prevent the devices from
creating excess vectors.

Do you agree?

>
> Thanks,
>
>         tglx
>
-- 
Thanks
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-20 16:18     ` Nitesh Narayan Lal
@ 2020-10-20 18:07       ` Thomas Gleixner
  2020-10-21 20:25         ` Thomas Gleixner
  0 siblings, 1 reply; 28+ messages in thread
From: Thomas Gleixner @ 2020-10-20 18:07 UTC (permalink / raw)
  To: Nitesh Narayan Lal, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, mtosatti, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jacob.e.keller, jlelli,
	hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, peterz, juri.lelli,
	vincent.guittot, lgoncalv

On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
> On 10/20/20 10:16 AM, Thomas Gleixner wrote:
>> With the above change this will result
>>
>>    1  general interrupt which is free movable by user space
>>    1  managed interrupts (possible affinity to all 16 CPUs, but routed
>>       to housekeeping CPU as long as there is one online)
>>
>> So the device is now limited to a single queue which also affects the
>> housekeeping CPUs because now they have to share a single queue.
>>
>> With larger machines this gets even worse.
>
> Yes, the change can impact the performance, however, if we don't do that we
> may have a latency impact instead. Specifically, on larger systems where
> most of the CPUs are isolated as we will definitely fail in moving all of the
> IRQs away from the isolated CPUs to the housekeeping.

For non managed interrupts I agree.

>> So no. This needs way more thought for managed interrupts and you cannot
>> do that at the PCI layer.
>
> Maybe we should not be doing anything in the case of managed IRQs as they
> are anyways pinned to the housekeeping CPUs as long as we have the
> 'managed_irq' option included in the kernel cmdline.

Exactly. For the PCI side this vector limiting has to be restricted to
the non managed case.

>>  Only the affinity spreading mechanism can do
>> the right thing here.
>
> I can definitely explore this further.
>
> However, IMHO we would still need a logic to prevent the devices from
> creating excess vectors.

Managed interrupts are preventing exactly that by pinning the interrupts
and queues to one or a set of CPUs, which prevents vector exhaustion on
CPU hotplug.

Non-managed, yes that is and always was a problem. One of the reasons
why managed interrupts exist.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-20 18:07       ` Thomas Gleixner
@ 2020-10-21 20:25         ` Thomas Gleixner
  2020-10-21 21:04           ` Nitesh Narayan Lal
                             ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Thomas Gleixner @ 2020-10-21 20:25 UTC (permalink / raw)
  To: Nitesh Narayan Lal, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, mtosatti, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jacob.e.keller, jlelli,
	hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, peterz, juri.lelli,
	vincent.guittot, lgoncalv, Jakub Kicinski, Dave Miller

On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
>> However, IMHO we would still need a logic to prevent the devices from
>> creating excess vectors.
>
> Managed interrupts are preventing exactly that by pinning the interrupts
> and queues to one or a set of CPUs, which prevents vector exhaustion on
> CPU hotplug.
>
> Non-managed, yes that is and always was a problem. One of the reasons
> why managed interrupts exist.

But why is this only a problem for isolation? The very same problem
exists vs. CPU hotplug and therefore hibernation.

On x86 we have at max. 204 vectors available for device interrupts per
CPU. So assumed the only device interrupt in use is networking then any
machine which has more than 204 network interrupts (queues, aux ...)
active will prevent the machine from hibernation.

Aside of that it's silly to have multiple queues targeted at a single
CPU in case of hotplug. And that's not a theoretical problem.  Some
power management schemes shut down sockets when the utilization of a
system is low enough, e.g. outside of working hours.

The whole point of multi-queue is to have locality so that traffic from
a CPU goes through the CPU local queue. What's the point of having two
or more queues on a CPU in case of hotplug?

The right answer to this is to utilize managed interrupts and have
according logic in your network driver to handle CPU hotplug. When a CPU
goes down, then the queue which is associated to that CPU is quiesced
and the interrupt core shuts down the relevant interrupt instead of
moving it to an online CPU (which causes the whole vector exhaustion
problem on x86). When the CPU comes online again, then the interrupt is
reenabled in the core and the driver reactivates the queue.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-21 20:25         ` Thomas Gleixner
@ 2020-10-21 21:04           ` Nitesh Narayan Lal
  2020-10-22  0:02           ` Jakub Kicinski
  2020-10-22 12:28           ` Marcelo Tosatti
  2 siblings, 0 replies; 28+ messages in thread
From: Nitesh Narayan Lal @ 2020-10-21 21:04 UTC (permalink / raw)
  To: Thomas Gleixner, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, mtosatti, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jacob.e.keller, jlelli,
	hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, peterz, juri.lelli,
	vincent.guittot, lgoncalv, Jakub Kicinski, Dave Miller

[-- Attachment #1.1: Type: text/plain, Size: 2990 bytes --]


On 10/21/20 4:25 PM, Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
>> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
>>> However, IMHO we would still need a logic to prevent the devices from
>>> creating excess vectors.
>> Managed interrupts are preventing exactly that by pinning the interrupts
>> and queues to one or a set of CPUs, which prevents vector exhaustion on
>> CPU hotplug.
>>
>> Non-managed, yes that is and always was a problem. One of the reasons
>> why managed interrupts exist.
> But why is this only a problem for isolation? The very same problem
> exists vs. CPU hotplug and therefore hibernation.
>
> On x86 we have at max. 204 vectors available for device interrupts per
> CPU. So assumed the only device interrupt in use is networking then any
> machine which has more than 204 network interrupts (queues, aux ...)
> active will prevent the machine from hibernation.

Yes, that is indeed the case.

>
> Aside of that it's silly to have multiple queues targeted at a single
> CPU in case of hotplug. And that's not a theoretical problem.  Some
> power management schemes shut down sockets when the utilization of a
> system is low enough, e.g. outside of working hours.
>
> The whole point of multi-queue is to have locality so that traffic from
> a CPU goes through the CPU local queue. What's the point of having two
> or more queues on a CPU in case of hotplug?
>
> The right answer to this is to utilize managed interrupts and have
> according logic in your network driver to handle CPU hotplug. When a CPU
> goes down, then the queue which is associated to that CPU is quiesced
> and the interrupt core shuts down the relevant interrupt instead of
> moving it to an online CPU (which causes the whole vector exhaustion
> problem on x86). When the CPU comes online again, then the interrupt is
> reenabled in the core and the driver reactivates the queue.

IIRC then i40e does have something like that where it suspends all IRQs
before hibernation and restores them when the CPU is back online.

I am not particularly sure about the other drivers.

This brings me to another discussion that Peter initiated that is to
perform the proposed restriction without any condition for all non-managed
IRQs.

Something on the lines:

+       if (!pci_is_managed(dev))
+               max_vecs = clamp(hk_cpus, min_vecs, max_vecs);


I am not particularly sure about this because I am not sure what kind of
performance penalty this will have on the drivers in general and if
that will be acceptable at all. Any thoughts?

However, this still doesn't solve the generic problem, and an ideal solution
will be something that you suggested.

Will it be sensible to think about having a generic API that can be
consumed by all the drivers and that can do both the things you mentioned?

>
> Thanks,
>
>         tglx
>
>
>
-- 
Thanks
Nitesh


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-21 20:25         ` Thomas Gleixner
  2020-10-21 21:04           ` Nitesh Narayan Lal
@ 2020-10-22  0:02           ` Jakub Kicinski
  2020-10-22  0:27             ` Jacob Keller
  2020-10-22  8:28             ` Thomas Gleixner
  2020-10-22 12:28           ` Marcelo Tosatti
  2 siblings, 2 replies; 28+ messages in thread
From: Jakub Kicinski @ 2020-10-22  0:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nitesh Narayan Lal, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, mtosatti, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jacob.e.keller, jlelli,
	hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, peterz, juri.lelli,
	vincent.guittot, lgoncalv, Dave Miller, Magnus Karlsson,
	Saeed Mahameed

On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
> > On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:  
> >> However, IMHO we would still need a logic to prevent the devices from
> >> creating excess vectors.  
> >
> > Managed interrupts are preventing exactly that by pinning the interrupts
> > and queues to one or a set of CPUs, which prevents vector exhaustion on
> > CPU hotplug.
> >
> > Non-managed, yes that is and always was a problem. One of the reasons
> > why managed interrupts exist.  
> 
> But why is this only a problem for isolation? The very same problem
> exists vs. CPU hotplug and therefore hibernation.
> 
> On x86 we have at max. 204 vectors available for device interrupts per
> CPU. So assumed the only device interrupt in use is networking then any
> machine which has more than 204 network interrupts (queues, aux ...)
> active will prevent the machine from hibernation.
> 
> Aside of that it's silly to have multiple queues targeted at a single
> CPU in case of hotplug. And that's not a theoretical problem.  Some
> power management schemes shut down sockets when the utilization of a
> system is low enough, e.g. outside of working hours.
> 
> The whole point of multi-queue is to have locality so that traffic from
> a CPU goes through the CPU local queue. What's the point of having two
> or more queues on a CPU in case of hotplug?
> 
> The right answer to this is to utilize managed interrupts and have
> according logic in your network driver to handle CPU hotplug. When a CPU
> goes down, then the queue which is associated to that CPU is quiesced
> and the interrupt core shuts down the relevant interrupt instead of
> moving it to an online CPU (which causes the whole vector exhaustion
> problem on x86). When the CPU comes online again, then the interrupt is
> reenabled in the core and the driver reactivates the queue.

I think Mellanox folks made some forays into managed irqs, but I don't
remember/can't find the details now.

For networking the locality / queue per core does not always work,
since the incoming traffic is usually spread based on a hash. Many
applications perform better when network processing is done on a small
subset of CPUs, and application doesn't get interrupted every 100us. 
So we do need extra user control here.

We have a bit of a uAPI problem since people had grown to depend on
IRQ == queue == NAPI to configure their systems. "The right way" out
would be a proper API which allows associating queues with CPUs rather
than IRQs, then we can use managed IRQs and solve many other problems.

Such new API has been in the works / discussions for a while now.

(Magnus keep me honest here, if you disagree the queue API solves this.)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-22  0:02           ` Jakub Kicinski
@ 2020-10-22  0:27             ` Jacob Keller
  2020-10-22  8:28             ` Thomas Gleixner
  1 sibling, 0 replies; 28+ messages in thread
From: Jacob Keller @ 2020-10-22  0:27 UTC (permalink / raw)
  To: Jakub Kicinski, Thomas Gleixner
  Cc: Nitesh Narayan Lal, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, mtosatti, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jlelli, hch, bhelgaas,
	mike.marciniszyn, dennis.dalessandro, thomas.lendacky, jiri,
	mingo, peterz, juri.lelli, vincent.guittot, lgoncalv,
	Dave Miller, Magnus Karlsson, Saeed Mahameed



On 10/21/2020 5:02 PM, Jakub Kicinski wrote:
> On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote:
>> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
>>> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:  
>>>> However, IMHO we would still need a logic to prevent the devices from
>>>> creating excess vectors.  
>>>
>>> Managed interrupts are preventing exactly that by pinning the interrupts
>>> and queues to one or a set of CPUs, which prevents vector exhaustion on
>>> CPU hotplug.
>>>
>>> Non-managed, yes that is and always was a problem. One of the reasons
>>> why managed interrupts exist.  
>>
>> But why is this only a problem for isolation? The very same problem
>> exists vs. CPU hotplug and therefore hibernation.
>>
>> On x86 we have at max. 204 vectors available for device interrupts per
>> CPU. So assumed the only device interrupt in use is networking then any
>> machine which has more than 204 network interrupts (queues, aux ...)
>> active will prevent the machine from hibernation.
>>
>> Aside of that it's silly to have multiple queues targeted at a single
>> CPU in case of hotplug. And that's not a theoretical problem.  Some
>> power management schemes shut down sockets when the utilization of a
>> system is low enough, e.g. outside of working hours.
>>
>> The whole point of multi-queue is to have locality so that traffic from
>> a CPU goes through the CPU local queue. What's the point of having two
>> or more queues on a CPU in case of hotplug?
>>
>> The right answer to this is to utilize managed interrupts and have
>> according logic in your network driver to handle CPU hotplug. When a CPU
>> goes down, then the queue which is associated to that CPU is quiesced
>> and the interrupt core shuts down the relevant interrupt instead of
>> moving it to an online CPU (which causes the whole vector exhaustion
>> problem on x86). When the CPU comes online again, then the interrupt is
>> reenabled in the core and the driver reactivates the queue.
> 
> I think Mellanox folks made some forays into managed irqs, but I don't
> remember/can't find the details now.
> 

I remember looking into this a few years ago, and not getting very far
either.

> For networking the locality / queue per core does not always work,
> since the incoming traffic is usually spread based on a hash. Many
> applications perform better when network processing is done on a small
> subset of CPUs, and application doesn't get interrupted every 100us. 
> So we do need extra user control here.
> 
> We have a bit of a uAPI problem since people had grown to depend on
> IRQ == queue == NAPI to configure their systems. "The right way" out
> would be a proper API which allows associating queues with CPUs rather
> than IRQs, then we can use managed IRQs and solve many other problems.
> 

I think we (Intel) hit some of the same issues you mention.

I know I personally would like to see something that lets a lot of the
current driver-specific policy be moved out. I think it should be
possible to significantly simplify the abstraction used by the drivers.

> Such new API has been in the works / discussions for a while now.
> 
> (Magnus keep me honest here, if you disagree the queue API solves this.)
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-22  0:02           ` Jakub Kicinski
  2020-10-22  0:27             ` Jacob Keller
@ 2020-10-22  8:28             ` Thomas Gleixner
  1 sibling, 0 replies; 28+ messages in thread
From: Thomas Gleixner @ 2020-10-22  8:28 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Nitesh Narayan Lal, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, mtosatti, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jacob.e.keller, jlelli,
	hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, peterz, juri.lelli,
	vincent.guittot, lgoncalv, Dave Miller, Magnus Karlsson,
	Saeed Mahameed

On Wed, Oct 21 2020 at 17:02, Jakub Kicinski wrote:
> On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote:
>> The right answer to this is to utilize managed interrupts and have
>> according logic in your network driver to handle CPU hotplug. When a CPU
>> goes down, then the queue which is associated to that CPU is quiesced
>> and the interrupt core shuts down the relevant interrupt instead of
>> moving it to an online CPU (which causes the whole vector exhaustion
>> problem on x86). When the CPU comes online again, then the interrupt is
>> reenabled in the core and the driver reactivates the queue.
>
> I think Mellanox folks made some forays into managed irqs, but I don't
> remember/can't find the details now.
>
> For networking the locality / queue per core does not always work,
> since the incoming traffic is usually spread based on a hash. Many

That makes it problematic and is fundamentally different from block I/O.

> applications perform better when network processing is done on a small
> subset of CPUs, and application doesn't get interrupted every 100us. 
> So we do need extra user control here.

Ok.

> We have a bit of a uAPI problem since people had grown to depend on
> IRQ == queue == NAPI to configure their systems. "The right way" out
> would be a proper API which allows associating queues with CPUs rather
> than IRQs, then we can use managed IRQs and solve many other problems.
>
> Such new API has been in the works / discussions for a while now.

If there is anything which needs to be done/extended on the irq side
please let me know.

Thanks

        tglx

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
  2020-10-21 20:25         ` Thomas Gleixner
  2020-10-21 21:04           ` Nitesh Narayan Lal
  2020-10-22  0:02           ` Jakub Kicinski
@ 2020-10-22 12:28           ` Marcelo Tosatti
  2 siblings, 0 replies; 28+ messages in thread
From: Marcelo Tosatti @ 2020-10-22 12:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Nitesh Narayan Lal, linux-kernel, netdev, linux-pci,
	intel-wired-lan, frederic, sassmann, jesse.brandeburg,
	lihong.yang, helgaas, jeffrey.t.kirsher, jacob.e.keller, jlelli,
	hch, bhelgaas, mike.marciniszyn, dennis.dalessandro,
	thomas.lendacky, jiri, mingo, peterz, juri.lelli,
	vincent.guittot, lgoncalv, Jakub Kicinski, Dave Miller

On Wed, Oct 21, 2020 at 10:25:48PM +0200, Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
> > On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
> >> However, IMHO we would still need a logic to prevent the devices from
> >> creating excess vectors.
> >
> > Managed interrupts are preventing exactly that by pinning the interrupts
> > and queues to one or a set of CPUs, which prevents vector exhaustion on
> > CPU hotplug.
> >
> > Non-managed, yes that is and always was a problem. One of the reasons
> > why managed interrupts exist.
> 
> But why is this only a problem for isolation? The very same problem
> exists vs. CPU hotplug and therefore hibernation.
> 
> On x86 we have at max. 204 vectors available for device interrupts per
> CPU. So assumed the only device interrupt in use is networking then any
> machine which has more than 204 network interrupts (queues, aux ...)
> active will prevent the machine from hibernation.
> 
> Aside of that it's silly to have multiple queues targeted at a single
> CPU in case of hotplug. And that's not a theoretical problem.  Some
> power management schemes shut down sockets when the utilization of a
> system is low enough, e.g. outside of working hours.

Exactly. It seems the proper way to do handle this is to disable
individual vectors rather than moving them. And that is needed for 
dynamic isolate / unisolate anyway...

> The whole point of multi-queue is to have locality so that traffic from
> a CPU goes through the CPU local queue. What's the point of having two
> or more queues on a CPU in case of hotplug?
> 
> The right answer to this is to utilize managed interrupts and have
> according logic in your network driver to handle CPU hotplug. When a CPU
> goes down, then the queue which is associated to that CPU is quiesced
> and the interrupt core shuts down the relevant interrupt instead of
> moving it to an online CPU (which causes the whole vector exhaustion
> problem on x86). When the CPU comes online again, then the interrupt is
> reenabled in the core and the driver reactivates the queue.

Aha... But it would be necessary to do that from userspace (for runtime
isolate/unisolate).


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, back to index

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-28 18:35 [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs Nitesh Narayan Lal
2020-09-28 18:35 ` [PATCH v4 1/4] sched/isolation: API to get number of " Nitesh Narayan Lal
2020-09-28 18:35 ` [PATCH v4 2/4] sched/isolation: Extend nohz_full to isolate managed IRQs Nitesh Narayan Lal
2020-09-28 18:35 ` [PATCH v4 3/4] i40e: Limit msix vectors to housekeeping CPUs Nitesh Narayan Lal
2020-09-28 18:35 ` [PATCH v4 4/4] PCI: Limit pci_alloc_irq_vectors() " Nitesh Narayan Lal
2020-09-28 21:59   ` Bjorn Helgaas
2020-09-29 17:46     ` Christoph Hellwig
2020-10-16 12:20   ` Peter Zijlstra
2020-10-18 18:14     ` Nitesh Narayan Lal
2020-10-19 11:11       ` Peter Zijlstra
2020-10-19 14:00         ` Marcelo Tosatti
2020-10-19 14:25           ` Nitesh Narayan Lal
2020-10-20  7:30           ` Peter Zijlstra
2020-10-20 13:00             ` Nitesh Narayan Lal
2020-10-20 13:41               ` Peter Zijlstra
2020-10-20 14:39                 ` Nitesh Narayan Lal
2020-10-19 14:21         ` Frederic Weisbecker
2020-10-20 14:16   ` Thomas Gleixner
2020-10-20 16:18     ` Nitesh Narayan Lal
2020-10-20 18:07       ` Thomas Gleixner
2020-10-21 20:25         ` Thomas Gleixner
2020-10-21 21:04           ` Nitesh Narayan Lal
2020-10-22  0:02           ` Jakub Kicinski
2020-10-22  0:27             ` Jacob Keller
2020-10-22  8:28             ` Thomas Gleixner
2020-10-22 12:28           ` Marcelo Tosatti
2020-10-01 15:49 ` [PATCH v4 0/4] isolation: limit msix vectors " Frederic Weisbecker
2020-10-08 21:40   ` Nitesh Narayan Lal

Linux-PCI Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-pci/0 linux-pci/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-pci linux-pci/ https://lore.kernel.org/linux-pci \
		linux-pci@vger.kernel.org
	public-inbox-index linux-pci

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-pci


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git